Fréchet Sufficient Dimension Reduction for Random Objects
aa r X i v : . [ m a t h . S T ] J u l Fr ´echet Sufficient Dimension Reduction forRandom Objects
Chao Ying and Zhou YuSchool of Statistics, East China Normal UniversityJuly 2, 2020
Abstract
We in this paper consider Fr´echet sufficient dimension reduction with responses beingcomplex random objects in a metric space and high dimension Euclidean predictors. Wepropose a novel approach called weighted inverse regression ensemble method for linearFr´echet sufficient dimension reduction. The method is further generalized as a new opera-tor defined on reproducing kernel Hilbert spaces for nonlinear Fr´echet sufficient dimensionreduction. We provide theoretical guarantees for the new method via asymptotic analysis.Intensive simulation studies verify the performance of our proposals. And we apply ourmethods to analyze the handwritten digits data to demonstrate its use in real applications.
Keywords:
Metric Space; Sliced Inverse Regression; Sufficient Dimension Reduction
Sufficient Dimension Reduction (Li (1991); Cook (1998)), as a powerful tool to extractthe core information hidden in the high-dimensional data, has become an important and rapidlydeveloping research field. For regression with multiple responses Y ∈ R q and multiple predictors X ∈ R p , classical linear sufficient dimension reduction seeks a p × d matrix β such that Y X | β T X, (1)1here stands for independence. The smallest subspace (Yin et al. (2008)) spanned by β with β satisfying the above relation (1) is called the central subspace, which is denoted as S Y | X .Classical methods for identifying the central subspace with one dimensional response in-clude sliced inverse regression (Li (1991)), sliced average variance estimation (Cook & Weisberg(1991)), the central k th moment method (Yin & Cook (2002)), the inverse third moment approach(Yin (2003)), contour regression (Li et al. (2005)), directional regression (Li & Wang (2007)), theconstructive approach (Xia (2007)), the semiparametric estimation (Ma & Zhu (2012, 2013)),and many others. Li et al. (2003), Zhu et al. (2010), Li et al. (2008) and Zhu et al. (2010) madeimportant extensions for sufficient dimension reduction with multivariate response.Li et al. (2011), Lee et al. (2013) and Li (2018) further articulated the general formulationof nonlinear sufficient dimension reduction as Y X | f ( X ) , (2)where f : R p R d is an unknown vector-valued function of X . Nonlinear sufficient dimensionreduction actually replaces the linear sufficient predictor β T X by a nonlinear predictor f ( X ) .The smallest subspace spanned by the functions satisfying the relation (2) is called the centralclass and denoted as G Y | X . See Lee et al. (2013) and Li (2018) for more details.Due to the rapid development of data collection technologies, statisticians nowadays aremore frequently encountering complex data that are non-Euclidean and specially do not lie ina vector space. Images (Peyr´e (2009); Gonz´alez-Briones et al. (2018)), shapes (Small (1996);Simeoni & Panaretos (2013)), graphs (Tsochantaridis et al. (2004); Ferretti et al. (2018)), tensors(Zhu et al. (2009); Li & Zhang (2017)), random densities (Petersen & M ¨uller (2016); Liu et al.(2019)) are examples of complex data types that appear naturally as responses in image com-pletion, computer vision, biomedical analysis, signal processing and other application areas. Inparticular, in image completion for handwritten digits (Tsagkrasoulis & Montana (2018)), the up-per part of each image was taken as the predictors X , and the bottom half was set as the responses2 . Figure 1 in the following illustrates the idea of such image analysis for digits { , , } . Topredict the bottom half of handwritten digits from their upper half is not an easy task, as the upperparts of image digits { , , } are quite similar to each other. In image analysis, it is common toassume that the images lie on an unknown manifold equipped with a meaningful distance metric.Then it is of great interest to develop general Fr´echet sufficient dimension reduction method withmetric space valued responses. Fr´echet sufficient dimension reduction for such X and Y is thenan immediate need that can facilitate graphical understanding of the regression structure, and iscertainly helpful for further image clustering or classification and outlier diagnostics.Figure 1: The first row consists of the predictors X which are the upper halves of the imagedigits { , , } ; The second row consists of the responses Y which are the bottom halves of theimage digits { , , } ; The third row consists of the whole image digits { , , } .Dubey & M ¨uller (2019) and Petersen & M ¨uller (2019b) provided some fundamental toolsfor Fr´echet analysis of such random objects. Petersen & M ¨uller (2019a) further proposed a gen-eral global and local Fr´echet regression paradigm for responses being complex random objects ina metric space with Euclidean predictors. Along their pioneering work in Fr´echet analysis, it isthen of great interest to consider linear and nonlinear sufficient dimension reduction for responseobjects in a metric space when the dimension of Euclidean predictors is relatively high.3s an illustration of Fr´echet sufficient dimension reduction, we consider two models: ( i ) . Y =(sin( β T X + ε ) sin( β T X + ε ) , sin( β T X + ε ) cos( β T X + ε ) , cos( β T X + ε )) , ( ii ) . Y =(sin( f ( X ) + ε ) / , cos( f ( X ) + ε ) / ) , where ( ε , ε ) T ∼ N (0 , I ) , X = ( x , . . . , x p ) T ∼ N (0 p , I p ) with p = 30 , f ( X ) = x + x , β = (0 . , . , , . . . , T , and β = (0 , . . . , , . , . T . For models (i) and (ii), the responseslie on unit spheres. Linear Fr´echet sufficient dimension reduction for model (i) aims at findingthe central subspace S Y | X with d = 2 , which is the column space spanned by ( β , β ) . Andthe purpose of nonlinear Fr´echet sufficient dimension reduction for model (ii) is to identify thecentral class G Y | X with d = 1 , which is comprised of all measurable functions of f ( X ) .To address this issue, we in this paper propose a novel linear Fr´echet sufficient dimensionreduction method to recover the central subspace S Y | X defined based on (1) with metric spacevalued response Y . We also provide a consistent estimator of the structural dimension d , whichis the dimension of the central subspace. The new method is further generalized to nonlinearFr´echet sufficient dimension reduction (2) via the reproducing kernel Hilbert space. The pro-posed linear and nonlinear Fr´echet sufficient dimension reduction estimators are shown to beunbiased for the central subspace S Y | X and the central class G Y | X respectively. Moreover, by tak-ing advantage of the distance metric of the random objects, both estimators require no numericaloptimization or nonparametric smoothing because they can be easily implemented by spectraldecomposition of linear operators. The asymptotic convergence results of our proposal are de-rived for theoretical justifications. We also examine our method via comprehensive simulationstudies including responses that consist of probability distributions or lie on the sphere. And theapplication to the handwritten digits data demonstrates the practical value of our proposal.4 Linear Fr´echet Sufficient Dimension Reduction
Let (Ω , d ) be a metric space. The linear Fr´echet sufficient dimension consider the regressionwith response variable Y ∈ Ω and predictors X ∈ R p . Let F be the joint distribution of ( X, Y ) defined on R p × Ω . And we assume that the conditional distributions F Y | X and F X | Y exist.With the linearity condition that E ( X | β T X ) is linear in X , Li (1991) discovered thefundamental property of sliced inverse regression Σ − { E ( X | Y ) − E ( X ) } ∈ S Y | X , (3)where Σ = var ( X ) . However, the inverse regression mean E ( X | Y ) is difficult for us to estimate,as only distances between response objects can be computable for responses in metric space.Our goal for linear Fr´echet sufficient dimension is then to borrow the strength of slicedinverse regression without the estimation of the inverse regression function E ( X | Y ) . To intro-duce our new method, we first recall the martingale difference divergence (MDD) proposed byShao & Zhang (2014) for Y ∈ R q and X ∈ R p , which is developed to measure the conditionalmean (in)dependence of Y on X , i.e. E ( Y | X ) = E ( Y ) , almost surely . To be specific, MDD ( Y | X ) is defined as a nonnegative number that satisfiesMDD ( Y | X ) = − E (cid:2) { Y − E ( Y ) } T { Y ′ − E ( Y ′ ) }k X − X ′ k (cid:3) , where ( X ′ , Y ′ ) is an independent copy of ( X, Y ) , and k · k stands for the Euclidean distance.To inherit the spirit of sliced inverse regression, we switch the roles of X and Y in martingaledifference divergence, and define the following p × p matrix Λ = − E (cid:2) { X − E ( X ) }{ X ′ − E ( X ′ ) } T d ( Y, Y ′ ) (cid:3) , ( X, Y ) ∈ R p × Ω . By the property of conditional expectation, we have Λ = − E (cid:2) E { X − E ( X ) | Y } E { X ′ − E ( X ′ ) | Y ′ } T d ( Y, Y ′ ) (cid:3) . (4)Invoking the appealing property (3) of sliced inverse regression, we see that Σ − Λ = − Σ − E (cid:2) E { X − E ( X ) | Y } E { X ′ − E ( X ′ ) | Y ′ } T d ( Y, Y ′ ) (cid:3) ∈ S Y | X . We summarize this property in the following proposition.
Proposition 1. Λ is positive semidefinite. Assume the linearity condition holds true, then Span (cid:8) Σ − Λ (cid:9) ⊆ S Y | X . From (4), Λ can be viewed as the weighted average ensemble of the inverse regression mean E ( X | Y ) , where the weight function is the distance d ( Y, Y ′ ) . We thus call our new method asweighted inverse regression ensemble. The weighted inverse regression ensemble can also beapplied for classical linear sufficient dimension reduction with Y ∈ R q and d ( Y, Y ′ ) = k Y − Y ′ k being the Euclidean distance. Moreover, choosing the number of slices for sliced inverseregression is a longstanding issue in the literature. Compared to sliced inverse regression, ourproposal is completely slicing free and is readily applicable to multivariate response data.Let M = Σ − Λ and ( β , . . . , β d ) be the left singular vectors of M corresponding to the d largest singular values. Then Proposition 6 suggests that ( β , . . . , β d ) provides a basis of S Y | X .Given a random sample { ( X i , Y i ) , i = 1 , . . . , n } from ( X, Y ) , then µ = E ( X ) and Σ = var ( X ) can be estimated as ˆ µ = E n ( X ) and ˆΣ = E n { ( X − ˆ µ )( X − ˆ µ ) T } , where E n ( · ) indicates thesample average n − P ni =1 ( · ) . Moreover, we can adopt U-statistics to estimate Λ as ˆΛ = − X ≤ i = j ≤ n ( X i − ˆ µ )( X j − ˆ µ ) T d ( Y i , Y j ) / { n ( n − } . ˆ M = ˆΣ − ˆΛ . We then adopt the top d left singularvectors ( ˆ β , . . . , ˆ β d ) of ˆ M to recover S Y | X in the sample level. And we introduce the followingnotations to present the central limit theory for the estimation of the central subspace. Γ( X ) = ( X − µ )( X − µ ) T − Σ , Λ (1) ( X, Y, X ′ , Y ′ ) = − ( X − µ )( X ′ − µ ) T d ( Y, Y ′ ) , Λ (1)1 ( X ′ , Y ′ ) = E { Λ (1) ( X, Y, X ′ , Y ′ ) | X ′ , Y ′ } , ϑ = E { ( X − µ ) d ( Y, Y ′ ) } , Θ( X, Y ) = Λ (1)1 ( X, Y ) − Λ + ( X − µ ) ϑ T + ϑ ( X − µ ) T ,ζ ℓ ( X, Y ) = Σ − n Θ( X, Y )Λ + ΛΘ(
X, Y ) − Γ( X )Σ − ΛΛ T − ΛΛ T Σ − Γ( X ) o Σ − , Υ ℓ ( X, Y ) = p X j =1 ,j = ℓ β j β Tj ζ ℓ ( X, Y ) β ℓ λ j − λ ℓ , ℓ = 1 , . . . , d. Theorem 1.
Assume the linearity condition and the singular values λ ℓ ’s are distinct for ℓ =1 , . . . , d . In addition, assume that Ed ( Y, Y ′ ) < ∞ and X has finite fourth moment, then n / (ˆ β ℓ − β ℓ ) D −→ N (0 p , Σ ℓ ) , (5) as n → ∞ , where Σ ℓ = cov { Υ ℓ ( X, Y ) } . d The estimation of structural dimension d is another focus in sufficient dimension reduc-tion. We adopt the ladle estimator proposed by Luo & Li (2016) for order determination, whichextracts the information contained in both the singular values and the left singular vectors of M .Let B k = (ˆ β , . . . , ˆ β k ) be the p × k matrix consisting of the principal d left singular vectorsof ˆ M . We randomly draw n bootstrap samples of size n and denote the realization of B k basedon the i th bootstrap sample as B ∗ k,i . The following function is proposed to evaluate the difference7etween B k and its bootstrap counterpart f n ( k ) = , k = 0 ,n − P ni =1 { − | det ( B Tk B ∗ k,i ) |} , k = 1 , . . . , p. And f n ( k ) is further normalized as f n ( k ) = f n ( k ) / { P r p i =0 f n ( i ) } , where r p = p − if p ≤ , r p = ⌊ p/ log p ⌋ if p > and ⌊ a ⌋ stands for the largest integer no greater than a . The effect ofthe singular values are measured as g n ( k ) = ˆ λ k +1 / (1 + P r p i =0 ˆ λ i +1 ) , k = 0 , , . . . , r p . And theladle estimator for structural dimension d is constructed as ˆ d = argmin k =0 ,...,r p { f n ( k ) + g n ( k ) } . To obtain the desired estimation consistency of the structural dimension, we assume that
Assumption 1.
The bootstrap version kernel matrix M ∗ satisfies n / { vech( M ∗ ( M ∗ ) T ) − vech( c M ( c M ) T ) } → N (0 , var[vech { H ( X, Y ) } ]) (6) where vech( · ) is the vectorization of the upper triangular part of a matrix and H ( X, Y ) = − Σ − (Γ( X ) − Σ)Σ − + Σ − (Λ (1) ( X, Y ) − Λ) − Σ − ( X − µ ) ϑ T − Σ − ϑ ( X − µ ) T . Assumption 2.
For any sequence of nonnegative random variables { Z n : n = 1 , , . . . } involvedin this paper, if Z n = O p ( c n ) for some sequence { c n : n ∈ N } with c n > , then E ( c − n Z n ) existfor each n and E ( c − n Z n ) = O (1) . From the proof of Theorem 1, we know that n / { vech( ˆ M ˆ M T ) − vech( M M T ) } also con-verges in distribution to the right-hand side of (6). Assumption 1 amounts to asserting that asymp-totic behaviour of n / ( M ∗ ( M ∗ ) T − ˆ M ˆ M T ) mimics that of n / ( ˆ M ˆ M T − M M T ) . The validityof this self-similarity was discussed in Bickel & Freedman (1981), Luo & Li (2016). Assumption8 has also been adopted and verified by Luo & Li (2016). The following theorem confirms thatthe number of useful sufficient predictors for linear Fr´echet sufficient dimension reduction canbe consistently estimated. Theorem 2.
Assume Ed ( Y, Y ′ ) < ∞ and X has finite fourth moment. And suppose Assump-tions (1)–(2) hold, then P r { lim n →∞ P r ( ˆ d = d |D )= 1 } = 1 , where D = { ( X , Y ) , ( X , Y ) , . . . } is a sequence of independent copies of ( X, Y ) . As the descendant of sliced inverse regression, the weighted inverse regression ensemblemethod will share the similar limitation with sliced inverse regression when dealing with regres-sion functions that are symmetric about the origin (Cook & Weisberg (1991)). To remedy thisproblem and to further extend the scope of our method, we in the next will consider nonlinearFr´echet sufficient dimension reduction defined in (2) using the reproducing kernel Hilbert space.Let H X be a reproducing kernel Hilbert space of functions of X generated by a positive definitekernel κ X . To extend the idea of weighted inverse regression ensemble for nonlinear Fr´ecehtsufficient dimension reduction, we introduce a new type of operator in the following. Definition 1.
Let µ X ( · ) = Eκ X ( · , X ) . For ( X, Y ) and its independent copy ( X ′ , Y ′ ) , we definethe weighted inverse regression ensemble operator Λ XX ′ : H X ′ → H X such that Λ XX ′ = − E { ( κ X ( · , X ) − µ X ( · )) ⊗ ( κ X ( · , X ′ ) − µ X ′ ( · )) d ( Y, Y ′ ) } . We assume the following regularity assumptions for theoretical investigations into Λ XX ′ .9 ssumption 3. Eκ X ( X, X ) < ∞ . Assumption 4.
The operator Λ XX ′ has a representation as Λ XX ′ = Σ XX S , where S is a uniquebounded linear operator such that S : H X → H X , S = Q X SQ X with Q X being the projectionoperator mapping H X on to ran(Σ XX ) , and ran(Σ XX ) stands for the closure of the range of thecovariance operator Σ XX . Assumption 5. G Y | X is dense in L ( P X |M Y | X ) , where L ( P X |M Y | X ) denotes the collection of M Y | X -measurable functions in L ( P X ) and M Y | X = σ [ f ( X )] . Assumption 6.
The eigenfunctions ψ i ’s are included in R (Σ XX ) , where R (Σ XX ) = { Σ XX f : f ∈ H X } . Assumption 7.
Let ( ε n ) ∞ n =1 be a sequence of positive numbers such that lim n →∞ ε n = 0 , lim n →∞ n − / /ε / n = 0 . . Assumption 3, 5 and 6 are commonly used conditions for reproduce kernel Hilbert spacesin the literature (Lee et al. (2013); Li (2018)). Assumption 4 is similar to the result of Theorem1 of Baker (1973) that defines the correlation operator, which will guarantee that our proposedoperator is compact. Assumption 7 is adopted by Fukumizu et al. (2007) for asymptotic analysisof kernel type methods, which is helpful to establish the estimation consistency of nonlinearweighted inverse regression ensemble method.
Proposition 2. Λ XX ′ is a bounded linear and self-adjoint operator. For any f, g ∈ H X , h f, Λ XX ′ g i = − E { ( f ( X ) − Ef ( X ))( g ( X ′ ) − Eg ( X ′ )) d ( Y, Y ′ ) } . Moreover, there exists a separable R -Hilbert space H and a mapping φ : Ω → H such that h f, Λ XX ′ f i = 2 { E [( f ( X ) − Ef ( X ))( φ ( Y ) − Eφ ( Y ))] } = 2( cov [ f ( X ) , φ ( Y )]) . X and random objects Y due to its similarity to the popular Hilbert-Schmidt Independence Criterion (Gretton et al. (2005)). Denote the covariance operator of X as Σ XX = E { κ X ( · , X ) ⊗ κ X ( · , X ) } − Eκ X ( · , X ) ⊗ Eκ X ( · , X ) .. The next proposition reveals therelationship between Λ XX ′ and the central class G Y | X . Proposition 3.
Suppose assumptions (3)–(5) hold, then ran (cid:8) Σ − XX Λ XX ′ (cid:9) ⊆ G Y | X . Proposition 4.
Suppose assumptions (3)–(5) hold and G Y | X is complete. Then, ran (cid:8) Σ − XX Λ XX ′ (cid:9) = G Y | X . Proposition 8 suggests that the range of Σ − XX Λ XX ′ is always contained in the central class G Y | X . Proposition 9 further extends the scope in the following aspects. First, it confirms that thenonlinear weighted inverse regression ensemble method is exhaustive in recovering the centralclass. The exhaustiveness of our nonlinear proposal is an appealing property which may not ex-ist in the linear setting. The second is that the nonlinear weighted inverse regression ensemblemethod leads to the minimal sufficient predictor satisfying (2), as sufficiency and completenesstogether imply minimal sufficiency in classical statistical inference. Last but not least, the non-linear weighted inverse regression ensemble method does not rely on the linear conditional meanassumption requiring that E ( X | β T X ) be linear in X . By relaxing such a stringent condition, thenonlinear method will have a wide range of applications.Let Λ ∗ XX ′ be the adjoint operator of Λ XX ′ . Proposition 9 indicates thatran (cid:8) Σ − XX Λ XX ′ Λ ∗ XX ′ Σ − XX (cid:9) = G Y | X , (7)11he space (7) can be recovered by performing the following generalized eigenvalue problem: max h f, Λ XX ′ Λ ∗ XX ′ f i H X , s . t . h f, Σ XX f i H X = 1 , f ⊥L k − , (8)where L k = Span ( f , . . . , f k − ) and f , . . . , f k − are the solutions to this constrained maximiza-tion problem in the previous steps. Define the following sample level estimators ˆ µ X ( · ) = E n [ κ X ( · , X i )] ˆΣ XX = E n { ( κ X ( · , X i ) − ˆ µ X ( · )) ⊗ ( κ X ( · , X i ) − ˆ µ X ( · )) } , ˆΛ XX = − X ≤ i = j ≤ n ( κ X ( · , X i ) − ˆ µ X ( · )) ⊗ ( κ X ( · , X j ) − ˆ µ X ( · )) d ( Y i , Y j ) / ( n ( n − . The sample version of (8) then becomes max h f, b Λ XX ′ b Λ ∗ XX ′ f i H X , s . t . h f, ( b Σ XX + ε n I ) f i H X = 1 . (9)Let V XX ′ = Σ − / XX Λ XX ′ Λ ∗ XX ′ Σ − / XX . Then we can verify that f = Σ − / XX ψ , where ψ = arg max g ∈H X , k g k H X =1 h g, V XX ′ g i H X . Let ˆ V XX ′ = ( ˆΣ XX + ε n I ) − / ˆΛ XX ′ ˆΛ ∗ XX ′ ( ˆΣ XX + ε n I ) − / . Then we have ˆ f ( X ) = ( ˆΣ XX + ε n I ) − / ˆ ψ , ˆ ψ = arg max g ∈H X , k g k H X =1 h g, ˆ V XX ′ g i H X . We in the next establish the estimation consistency of our nonlinear Fr´ecechet sufficientdimension reduction approach. Although we only focus on the first eigenfunction in the followingtheorem, similar asymptotic results can be derived for the entire central .
Theorem 3.
Suppose assumptions (3)–(7) hold. In addition, assume that Ed ( Y, Y ′ ) < ∞ , thenas n → ∞ k ˆ V XX ′ − V XX ′ k HS = o p (1) , |h ˆ ψ , ψ i H X | P −→ , k{ ˆ f ( X ) − E ˆ f ( X ) } − { f ( X ) − Ef ( X ) }k−→ , here k · k in this theorem is the standard L norm to measure the distance of functions and k · k HS denotes the Hilbert-Schmidt norm. Let η i = κ X ( · , X i ) − ˆ µ X ( · ) , i = 1 , . . . , n . The estimated eigenfunctions ˆ f ℓ ’s solved from(9) can be further characterized as a linear combination of η i such that ˆ f ℓ = P ni =1 a ℓ,i η i . Denote α ℓ = ( a ℓ, , . . . , a ℓ,n ) T . The next proposition indicates that α ℓ can be obtained through solving aneigen-decomposition problem. Proposition 5.
Let K n be the n × n kernel matrix whose ( i, j ) th element is κ X ( X i , X j ) . Denote J n as the n × n matrix whose elements are all one. Define G X = ( I n − J n /n ) K n ( I n − J n /n ) and let D Y be the n × n matrix whose ( i, j ) th element is d ( Y i , Y j ) . Then we have G X α ℓ = γ ℓ ,where γ ℓ is the ℓ th eigenvector of the following matrix ( G X + ε n I n ) − G X D Y G X D Y G X ( G X + ε n I n ) − . Let ˆ α ℓ = ( G X + ε n I n ) − γ ℓ . Inspired by Proposition 4, the ℓ th estimated sufficient predictorcan then be represented as ˆ f ℓ = P ni =1 ˆ a ℓ,i η i , where ˆ a ℓ,i is the i th element of the n × vector ˆ α ℓ . We consider the following models with responses being complex random objects.Model I. Let β = (1 , , , . . . , T and β = (0 , . . . , , , T . X ∼ U [0 , p and Y is thedistribution function with its quantile function being Q Y ( τ ) = µ Y + σ Y Φ − ( τ ) , where Φ( · ) isthe cumulative distribution function of standard normal, µ Y | X ∼ N (exp( β T X ) , . ) . And weconsider σ Y = 1 as case (i) and σ Y = | β T X | as case (ii). As Y and its independent copy Y ′ arerandom distribution functions, then we adopt the Wasserstein distance as the metric d ( Y, Y ′ ) . Forcase (i), S Y | X = Span ( β ) and d = 1 . For case (ii), S Y | X = Span ( β , β ) and d = 2 .13odel II. Consider the following Fr´echet regression function m ( X ) = (cos( f ( X )) , sin( f ( X ))) . Generate ε from N (0 , . ) on the tangent line of m ( x ) . And the response Y is generated as Y = cos( ε ) m ( x ) ⊕ sin( ε ) ε/ | ε | , where ⊕ stands for vector addition. We can verify that Y ∈ Ω where Ω is the unit circle in R . Then d ( Y, Y ′ ) is naturally chosen as the geodesic distance arccos( Y T Y ′ ) . Moreover, weconsider case (i) f ( X ) = β T X with X ∼ U [0 , p and cased (ii) where f ( X ) = ( x + x ) / with X ∼ N (0 p , I p ) for both linear and nonlinear Fr´echet sufficient dimension reduction.Model III. Generate ε i from N (0 , . ) for i = 1 , . We consider two cases in this study.The model structure of case (i) is exactly the same as our motivating example (i) illustrated inSection 1 with X ∼ U [0 , p . For case (ii), the response Y is generated as Y = (sin( f ( X )+ ε ) / sin( f ( X )+ ε ) / , sin( f ( X )+ ε ) / cos( f ( X )+ ε ) / , cos( f ( X )+ ε ) / ) , where f ( X ) = x + x and f ( X ) = x p − + x p , and X ∼ N (0 p , I p ) . We see that Y ∈ Ω where Ω is the unit sphere in R . Again d ( Y, Y ′ ) = arccos( Y T Y ′ ) is the geodesic distance.Model I and case (i) of and Model II and III are adopted for linear Fr´echet sufficient dimen-sion reduction, while the two rest cases are examples for nonlinear Fr´echet sufficient dimensionreduction. Let ˆ β and ˆ f ( X ) be our proposed linear and nonlinear weighted inverse regressionestimators. To evaluate our proposal for linear Fr´echet sufficient dimension reduction, we adoptthe trace correlation (Ferr´e (1998)) defined as r = tr ( P β P ˆ β ) /d , where P β = β ( β T β ) − β T . Toassess the performance of nonlinear Fr´echet sufficient dimension reduction, we utilize the squaredistance correlation ρ ( f ( X ) , ˆ f ( X )) proposed by Sz´ekely et al. (2007). The square distance cor-relation can also be adapted to linear Fr´echet sufficient dimension reduction as ρ ( β T X, ˆ β T X ) .And larger values of r or ρ indicate better estimation.14e consider n = 100 , , , and p = 10 , , . Treating d as known, Table 1 and 2summarize the mean values of r and ρ based on 100 repetitions with different combinations of n and p . We can see from Table 1 that the original weighted inverse regression ensemble workswell except for case (ii) of Model II and III with U-shape structure, which is consistent withour theoretical anticipation. As an effective remedy, the nonlinear weighted inverse regressionproduces a satisfying result as seen from Table 2, in which the tuning parameter is simply setas ε n = 0 . and Gaussian kernel κ X ( X, X ′ ) = exp {−k X − X ′ k / (2 σ κ ) } is adopted with σ κ = 0 . . The results for order determination are presented in Table 3, where the entries are thenumber of correct estimation of d out of repetitions. Table 3 shows that the ladle estimator incombination with weighted inverse regression ensemble works well, with percentage of correctestimation reaching as high as for most cases.To better illustrate the performance of our nonlinear Fr´echet sufficient dimension reductionmethod, we in Figure 2 present the 2-D scatter plots for the nonlinear sufficient predictors fromcase (ii) of Model III versus their sample estimates obtained by the nonlinear weighted inverseregression ensemble method with n = 100 and p = 10 . The left panel is the 2-D scatter plotsfor the first nonlinear sufficient predictor f ( X ) = x + x versus its estimate ˆ f ( X ) ; the rightpanel is the 2-D scatter plots for the second nonlinear sufficient predictor f ( X ) = x p − + x p versus its estimate ˆ f ( X ) . Figure 2 shows a strong relationship between f i and ˆ f i for i = 1 , . ˆ f i behaves like a measurable function of f i , which is consistent with our theoretical development asour focus on nonlinear Fr´echet sufficient dimension reduction is the σ -field generated by f and f rather than f and f themselves. As f / i is a measurable function of f i , then f / i can alsobe regarded as the nonlinear sufficient predictor. We in Figure 3 present the 2-D scatter plots forthe nonlinear sufficient predictors f / and f / versus ˆ f and ˆ f . We can observe a strong linearpattern between f / i and ˆ f i , which again verify that our proposed nonlinear weighted inverseregression ensemble method is effect in recovering the central class G Y | X with responses Y being15etric space valued random objects. f -0.2-0.15-0.1-0.0500.050.10.150.2 0 5 10 f -0.2-0.15-0.1-0.0500.050.10.150.2 Figure 2: Scatter plots of nonlinear sufficient predictors f i ’s versus their estimates ˆ f i ’s. To further investigate the performance of our proposals and demonstrate its use in real ap-plications, we now extract gray-scale images of three handwritten digit classes { , , } ,from the UCI Machine Learning Repository. This dataset contains a training group of size and a testing group of size . Each digit was represented by an × pixel image. The × upper part of each image was taken as the predictors X , and the × bottom half was set as theresponses Y .We focus on sufficient dimension reduction of the -dimensional feature vectors X for thetraining set, which serves as a preparatory step for further clustering or classification. Becausethe response Y is also -dimensional, we include the projective resampling approach (Li et al.16 f -0.2-0.15-0.1-0.0500.050.10.150.2 0 0.5 1 1.5 2 f -0.2-0.15-0.1-0.0500.050.10.150.20.25 Figure 3: Scatter plots of nonlinear sufficient predictors f / i ’s versus their estimates ˆ f i ’s.(2008)) for comparisons, as it is the state of the art sufficient dimension reduction paradigm formultivariate response data. To be specific, we consider projective resampling approach in combi-nation with three classical methods; sliced inverse regression, sliced average variance estimationand directional regression. And we adopt three different distance metrics for our proposals: theEuclidean distance, the distance metric learned by the Local Linear Embedding (Roweis & Saul(2000)), the distance metric learned by the Isomap approach (Tenenbaum et al. (2009)).For the training data, Figure 4 presents the scatter plots of the first two sufficient predictorsestimated by projective resampling based three classical methods, as well as our proposed linearand nonlinear weighted inverse regression ensemble methods, with the cases for digits 0, 8 and9 represented by red, green and blue dots respectively. Figure 4 shows that the linear weightedinverse regression ensemble method performs much better than the classical methods. We alsoobserve that our nonlinear weighted inverse regression ensemble method based on the distancemetric induced by the Isomap approach provides better separation both in location and variation,17hich should be useful for further classification.Figure 5 presents the perspective plots for the testing data. Similar to our previous findings,our linear and nonlinear weighted inverse regression ensemble approaches again do a much betterjob in separating the three digit classes compared to the classical methods. The upper parts ofdigits and are generally difficult to distinguish. However, our proposals provide valid anduseful information for classification as seen from the scatter plots. When the predictor dimension is excessively large, we may consider sparse Fr´echet dimen-sion reduction with response as random objects and ultrahigh dimensional predictor. The pro-posed weighted inverse regression ensemble method can be further extended for model free vari-able selection and screening (Yin & Hilafu (2015); Yu et al. (2016)) and minimax estimation of S Y | X (Tan et al. (2019)). The full potential of sparse Fr´echet dimension reduction will be furtherexplored in future research. References B AKER , C. R.(1973). Joint measures and cross-covariance operators.
Trans. Amer. Math. Soc. , 273–289.B
ICKEL , P. J. & F
REEDMAN , D. A.(1981). Some asymptotic theory for the bootstrap.
Ann.Statist. , 1196-1217.C OOK , R. D. (1998).
Regression graphics: Ideas for studying regressions through graphics .New York: John Wiley. 18
OOK , R.D. & W
EISBERG , S. (1991). Sliced Inverse Regression for Dimension Reduction:Comment.
J. Am. Statist. Assoc. , 328-332.D UBEY , P. & M ¨
ULLER , H.-G. (2019). Fr´echet analysis of variance for random objects.
Toappear in Biometrika .F ERR ´ E , L. (1998) Determing the dimension in sliced inverse regression and related methods. J.Am. Statist. Assoc. , 132–140.F UKUMIZU , K., B
ACH , F. R. & G
RETTON , A.(2007). Statistical Consistency of Kernel Canon-ical Correlation Analysis.
J. Mach. Learn. Res. , 361-383.F ERRETTI , M., I
ULITA , M. ,C
AVEDO , E. ,C
HIESA , P. ,S
CHUMACHER D IMECH ,A.,S
ANTUCCIONE C HADHA , A.,B
ARACCHI , F.,G
IROUARD , H.,M
ISOCH , S.,G
IACOBINI ,E. ,D
EPYPERE , H. & H
AMPEL , H. (2018). Sex differences in Alzheimer disease-the gatewayto precision medicine.
Nat. Rev. Neurol. , 457-469.G RETTON , A., B
OUSQUENTB , O., S
MOLA , A. & S CH ¨ OLKOPF , B.(2005). Kernel methods formeasuring independence.
J. Mach. Learn. Res. , 1343–1368.G ONZ ´ ALEZ -B RIONES , A., V
ILLARRUBIA , G., D E P AZ , J.& C ORCHADO , J. (2018). A multi-agent system for the classification of gender and age from images.
Comput. Vis. Image Underst. , 98-106.L EE , K.-Y., L I , B., & C HIAROMONTE ,F. (2013). A general theory for nonlinear sufficientdimension reduction: formulation and estimation.
Ann. Statist. , 3182–3210.Li, B. (2018). Sufficient Dimension Reduction: Methods and Applications with R . Chapman andHall/CRC. 19 I , B., A RTEMIOU , A. & L I , L. (2011). Principal support vector machines for linear andnonlinear sufficient dimension reduction. Ann. Statist. , 3182–3210.L I , B. & W ANG , S. (2007). On directional regression for dimension reduction.
J. Am. Statist.Assoc. , 997–1008.L I , B., W EN , S. Q. & Z HU L.-X. (2008). On a projective resampling method for dimensionreduction with multivariate responses.
J. Am. Statist. Assoc. , 1177-1186.L I , B. Z HA , H. & C HIAROMONTE , F. (2005). Contour Regression: A General Approach toDimension Reduction.
Ann. Statist. , 1580-1616.L I , K. C. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Am.Statist. Assoc. , 316–327.L I , K. C., A RAGON , Y., S
HEDDEN , K. & A
GNAN , C. T. (2003). Dimension Reduction forMultivariate Response Data.
J. Am. Statist. Assoc. , 99-109.L UO , W. & L I , B. (2016). Combing eigenvalues and variation of eigenvectors for order deter-mination. Biometrika , 875–887.L IU , A., L IU , J. & L U , Y.(2019). On the rate of convergence of empirical measure in ∞ -Wasserstein distance for unbounded density function. Q. Appl. Math. , 811–829.L I , L.& Z HANG , X. (2017). Parsimonious Tensor Response Regression.
J. Am. Statist. As-soc. , 1131-1146M A , Y. & Z HU , L. (2012). A semiparametric approach to dimension reduction. J. Am. Statist.Assoc. , 168–179. 20 A , Y. & Z HU , L. (2013). Efficient estimation in sufficient dimension reduction. Ann. Statist. , 250–268.P ETERSEN , A. & M ¨
ULLER , H.-G. (2016). Functional data analysis for density functions bytransformation to a Hilbert space.
Ann. Statist. , 183–218.P ETERSEN , A. & M ¨
ULLER , H.-G. (2019). Fr´echet regression for random objects with Euclideanpredictors.
Ann. Statist. , 691–719.P ETERSEN , A. & M ¨
ULLER , H.-G. (2019). Functional models for time-varying random objects.
To appear in J. R. Statist. Soc. B P EYR ´ E , G. (2009). Manifold models for signals and images. Comput. Vis. Image Underst. ,249-260.R
OWEIS , S. & S
AUL , L. (2000). Nonlinear dimensionality reduction by locally linear embed-ding.
Science , 2323-2326.S
MALL , C. (1996). The statistical theory of shape.
Springer series in statistics .S IMEONI , M. & P
ANARETOS , V. (2013). Statistics on Manifolds applied to Shape Theory. infoscience.epfl.ch .S Z ´ EKELY , G. J., R
IZZO , M. L. & B
AKIROV , N. K. (2007). Measuring and testing independenceby correlation of distances.
Ann. Statist. , 2769–2794.S HAO , X. & Z
HANG , J. (2014). Martingale difference correlation and its use in high dimensionalvariable screening.
J. Am. Statist. Assoc. , 1302–1318.T
SOCHANTARIDIS , I., H
OFMANN , T., J
OACHIMS , T. & A
LTUN , Y. (2009). Support vectormachine learning for interdependent and structured output spaces.
J. Mach. Learn. Res. ,104.21
SAGKRASOULIS , D. & M
ONTANA , G. (2018). Random forest regression for manifold-valuedresponses.
Pattern Recognit. Lett. , 6-13.T
ENENBAUM , J., S
ILVA , V. & L
ANGFORD , J. (2000). A global geometric framework for non-linear dimensionality reduction.
Science , 2319-2323.T AN , K., S HI , L. & Y U , Z. (2019). Sparse SIR: optimal rates and adaptive estimation. Ann.Statist. , 64–85.X IA , Y. (2007). A constructive approach to the estimation of dimension reduction directions. Ann. Statist. , 2654–2690.Y IN , X. R. (2003). Estimating central subspaces via inverse third moments. Biometrika ,113–125.Y IN , X. & B URA , E. (2006). Moment Based Dimension Reduction for Multivariate ResponseRegression.
J. Statist. Plan. Inf. , 3675–3688.Y IN , X. & C OOK , R. D. (2002). Dimension reduction for the conditional k-th moment inregression.
J. R. Statist. Soc. B , 159–176.Y IN , X., L I , B. & C OOK , R. D. (2008). Successive direction extraction for estimating thecentral subspace in a multiple-index regression.
J. Mult. Anal. , 1733–1757.Y IN , X. & H ILAFU , H. (2015). Sequential sufficient dimension reduction for large p , small n problems. J. R. Statist. Soc. B , 879–892.Y U , Z., D ONG , Y. & S
HAO , J. (2016). On marginal sliced inverse regression for ultrahighdimensional model-free feature selection.
Ann. Statist. , 2594–2623.22 HU , H., C HEN , Y., I
BRAHIM , J., L I , Y., H ALL , C. & L I , W. (2009). Intrinsic RegressionModels for Positive-Definite Matrices With Applications to Diffusion Tensor Imaging. J. Am.Statist. Assoc. , 1203-1212.Z HU , L., Z HU , L.-X. & W EN , S. (2010). On dimension reduction in regressions with multivari-ate responses. Sinica , 1291-1307. 23able 1: The Averages of r and ρ for the estimation of S Y | X based on simulation runs. Model I Model II Model III ( p, n )
100 200 300 400 100 200 300 400 100 200 300 400Case i 10 0.979 0.988 0.993 0.995 0.988 0.996 0.997 0.998 0.987 0.995 0.996 0.9970.973 0.984 0.990 0.993 0.988 0.994 0.996 0.967 0.987 0.994 0.995 0.99720 0.947 0.976 0.976 0.989 0.971 0.991 0.994 0.995 0.966 0.986 0.991 0.9940.943 0.970 0.970 0.985 0.970 0.989 0.992 0.993 0.873 0.986 0.990 0.99330 0.908 0.960 0.976 0.982 0.961 0.987 0.990 0.992 0.945 0.977 0.986 0.9890.917 0.955 0.970 0.976 0.964 0.985 0.988 0.990 0.959 0.978 0.985 0.989Case ii 10 0.990 0.995 0.997 0.998 0.255 0.253 0.234 0.278 0.379 0.369 0.382 0.3770.989 0.994 0.996 0.997 0.122 0.092 0.078 0.085 0.231 0.178 0.166 0.15620 0.976 0.988 0.993 0.995 0.136 0.125 0.123 0.134 0.177 0.188 0.188 0.1960.978 0.989 0.992 0.994 0.092 0.053 0.038 0.039 0.163 0.097 0.078 0.06830 0.956 0.982 0.988 0.991 0.094 0.085 0.084 0.094 0.118 0.122 0.126 0.1340.967 0.982 0.987 0.991 0.085 0.048 0.032 0.028 0.147 0.081 0.060 0.049Table 2: *The average r and ρ are listed in the first and second rows for each p .Table 3: The Averages of ρ for the estimation of G Y | X based on simulation runs. Model II (case ii) Model III (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.948 0.952 0.952 0.952 10 0.818 0.828 0.827 0.82820 0.952 0.952 0.951 0.952 20 0.823 0.834 0.828 0.82930 0.952 0.951 0.952 0.952 30 0.827 0.826 0.827 0.82724able 4:
The number of correctly estimation for d among simulation runs. Model I (case i) Model I (case ii) Model II (case i) Model III (case i) ( p, n )
100 200 300 400 100 200 300 400 100 200 300 400 100 200 300 40010 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 10020 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 10030 100 100 100 100 100 100 100 100 100 100 100 100 99 100 100 10025 ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X f^ f ^ f^ f ^ f^ f ^ Figure 4: The first row consists of the scatter plots for the training data via projective resamplingbased Slice Inverse Regression(SIR), Slice Average Variance Estimation(SAVE), Directional Re-gression(DR), respectively. The second row consists of the scatter plots based on our linear pro-posal with Euclidean distance, Locally Linear Embedding (LLE) and Isomap, respectively. Thethird row consists of the scatter plots based on our nonlinear proposal with Euclidean distance,LLE and Isomap, respectively. (red: ; green: ;blue: .)26 ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X f^ f ^ f^ f ^ f^ f ^ Figure 5: The first row consists of the perspective plots for the first two sufficient predictors forthe testing data by projective resampling based SIR, SAVE and DR, respectively. The second rowconsists of the perspective plots for the testing data based on our linear proposal with Euclideandistance, LLE and Isomap, respectively. The third row consists of the perspective plots for ournonlinear proposal with Euclidean distance, LLE and Isomap. (red: ; green: ;blue: .)27 upplement to “Fr´echet Sufficient Dimension Reduction forRandom Objects” Abstract
This Supplementary Material includes following topics: A. Additional results of simu-lation examples; B. Additional results for the application to the handwritten digits data; C.Detailed proofs of the technical results.
A Additional Simulation Studies
In addition to models I-III adopted in the main paper, we consider a new model here.Model IV. Y = (cos( ε ) sin( f ( X )) sin( f ( X )) , cos( ε ) sin( f ( X )) cos( f ( X )) , cos( ε ) cos( f ( X )) , sin( ε )) . For Model IV, the response is a 4-dimensional vector and the fourth dimension can be viewedas a noise term which does not contain any valid information. Moreover, Y ∈ Ω where Ω isthe unit sphere in R . And the metric equipped with Ω is the geodesic distance d ( Y, Y ′ ) =arccos( Y T Y ′ ) . Generate ε from N (0 , . ) . We consider case (i) where f ( X ) = β T X and f ( X ) = β T X with β = (0 . , . , , . . . , T and β = (0 , . . . , , . , . T and cased (ii)where f ( X ) = 0 . x + x ) / and f ( X ) = 0 . x p − + x p ) / for both linear and nonlinearFr´echet sufficient dimension reduction.We design the following scenarios for the predictors for models I-IV.Scenario 1. X is generated from the multivariate normal distribution N ( α, I p ) , where α ∼ U [0 , p . The results for the four models, including linear Fr´echet sufficient dimension reduction,28onlinear Fr´echet sufficient dimension reduction and order determination, are presented in TableA.1-A.3.Scenario 2. X is generated from the multivariate normal distribution N ( α, Σ) , where α ∼ U [0 , p and Σ = { ( σ ij ) p × p : σ ij = 0 . | i − j | } . The results of the four models under scenario 2 aresummarized in Table A.4-A.6.Scenario 3. x i is generated from the poisson distribution P λ with λ = 1 for i = 1 , . . . , p . x i and x j are independent of each other. The results of the four models under scenario 3 arepresented in Table A.7 -A.9.Scenario 4. x i is generated from the exponential distribution distribution Exp ( λ ) with λ = 1 for i = 1 , . . . , p . x i and x j are independent of each other. The results of the four models underscenario 4 are presented in Table A.10-A.12.We see from these tables our proposal gives quite promising results for Fr´chet sufficientdimension reduction and order determination. When the weighted inverse regression ensemblemethod fail to work with symmetric regression function, its nonlinear extension always make agood remedy. Our proposed methods along with the asymptotic theories are robust to differentmodel settings, except for order determination with case (ii) of Model I under scenario 2.29able 5: The means of r and ρ for estimating S Y | X among repetitions with scenario 1. Model I (case i) Model I (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.912 0.938 0.955 0.967 10 0.869 0.923 0.954 0.9610.891 0.917 0.938 0.952 0.872 0.917 0.947 0.95520 0.790 0.876 0.910 0.927 20 0.771 0.869 0.911 0.9190.787 0.851 0.884 0.905 0.798 0.862 0.904 0.90930 0.707 0.817 0.858 0.888 30 0.688 0.830 0.854 0.8940.731 0.799 0.827 0.861 0.752 0.839 0.849 0.884Model II(case i) Model II(case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.915 0.963 0.974 0.982 10 0.471 0.510 0.529 0.5310.915 0.953 0.966 0.974 0.375 0.392 0.421 0.39220 0.827 0.918 0.946 0.961 20 0.387 0.439 0.451 0.4670.866 0.914 0.936 0.951 0.372 0.361 0.338 0.38130 0.766 0.874 0.912 0.937 30 0.314 0.382 0.413 0.4240.850 0.886 0.909 0.929 0.328 0.337 0.339 0.326Model III (case i) Model III (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.891 0.946 0.971 0.976 10 0.538 0.577 0.586 0.5840.895 0.941 0.966 0.973 0.476 0.508 0.512 0.51420 0.810 0.898 0.924 0.941 20 0.401 0.464 0.463 0.4930.843 0.901 0.919 0.934 0.453 0.466 0.445 0.46530 0.735 0.856 0.900 0.922 30 0.333 0.391 0.427 0.4440.813 0.870 0.902 0.919 0.424 0.426 0.446 0.438Model IV (case i) Model IV (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.920 0.949 0.966 0.978 10 0.591 0.607 0.622 0.6270.921 0.945 0.962 0.974 0.430 0.433 0.451 0.44820 0.803 0.877 0.934 0.944 20 0.431 0.484 0.501 0.5200.832 0.879 0.930 0.938 0.387 0.379 0.373 0.38830 0.733 0.840 0.888 0.911 30 0.358 0.419 0.449 0.4540.806 0.856 0.890 0.907 0.365 0.356 0.371 0.366Table 6: *The average r and ρ are listed in the first and second rows for each p .30able 7: The means of ρ for estimating G Y | X among repetitions with scenario 1. Model II (case ii) Model III (case ii) Model IV (case ii) ( p, n )
100 200 300 400 100 200 300 400 100 200 300 40010 0.986 0.957 0.957 0.957 0.843 0.840 0.839 0.844 0.940 0.940 0.942 0.94120 0.954 0.958 0.956 0.956 0.843 0.835 0.837 0.839 0.941 0.942 0.942 0.94330 0.956 0.956 0.955 0.955 0.840 0.838 0.834 0.838 0.940 0.942 0.943 0.941Table 8:
The number of correctly estimation for d among repetitions with scenario 1. Model I (case i) Model I (case ii) Model II (case i) ( p, n )
100 200 300 400 ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 100 100 100 100 10 41 59 63 70 10 85 96 99 9920 100 100 100 100 20 22 38 40 53 20 78 96 100 10030 100 100 100 100 30 18 31 36 43 30 23 70 94 100Model III (case i) Model IV (case i) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 38 45 48 60 10 74 90 87 9020 10 21 29 35 20 49 72 82 8330 2 16 27 32 30 37 65 70 6731able 9:
The means of r and ρ for estimating S Y | X among repetitions with scenario 2. Model I (case i) Model I (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.854 0.913 0.933 0.947 10 0.840 0.880 0.931 0.9370.862 0.909 0.929 0.943 0.872 0.895 0.939 0.94520 0.720 0.823 0.848 0.881 20 0.711 0.781 0.846 0.8810.764 0.830 0.848 0.878 0.791 0.820 0.866 0.89630 0.617 0.766 0.795 0.837 30 0.584 0.767 0.794 0.8360.689 0.783 0.807 0.839 0.716 0.816 0.826 0.856Model II (case i) Model II (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.853 0.920 0.948 0.963 10 0.249 0.258 0.247 0.2500.888 0.930 0.947 0.962 0.131 0.101 0.084 0.08820 0.707 0.853 0.900 0.921 20 0.128 0.141 0.121 0.1580.814 0.881 0.908 0.925 0.097 0.059 0.042 0.05130 0.603 0.753 0.842 0.883 30 0.086 0.086 0.085 0.0990.778 0.827 0.873 0.897 0.100 0.049 0.032 0.029Model III(case i) Model III (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.864 0.916 0.943 0.957 10 0.365 0.383 0.373 0.3530.896 0.930 0.950 0.961 0.243 0.192 0.173 0.15420 0.749 0.830 0.898 0.904 20 0.181 0.189 0.184 0.1820.819 0.862 0.915 0.918 0.160 0.106 0.078 0.06930 0.687 0.794 0.842 0.862 30 0.114 0.123 0.118 0.1220.799 0.840 0.873 0.886 0.144 0.086 0.063 0.050Model IV (case i) Model IV (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.844 0.915 0.946 0.959 10 0.595 0.613 0.631 0.6440.877 0.930 0.953 0.964 0.504 0.478 0.510 0.52620 0.731 0.843 0.881 0.916 20 0.424 0.472 0.505 0.5230.808 0.873 0.901 0.927 0.439 0.435 0.444 0.47330 0.660 0.803 0.828 0.877 30 0.339 0.412 0.445 0.4620.783 0.853 0.864 0.898 0.409 0.415 0.411 0.44332able 10:
The means of ρ for estimating G Y | X among repetitions with scenario 2. Model II (case ii) Model III (case ii) Model IV (case ii) ( p, n )
100 200 300 400 100 200 300 400 100 200 300 40010 0.954 0.951 0.951 0.951 0.824 0.820 0.825 0.826 0.933 0.939 0.938 0.94120 0.952 0.951 0.952 0.952 0.817 0.825 0.825 0.823 0.937 0.939 0.941 0.93930 0.950 0.950 0.951 0.952 0.836 0.821 0.825 0.821 0.935 0.939 0.940 0.939Table 11:
The number of correctly estimation for d among repetitions with scenario 2. Model I (case i) Model I (case ii) Model II (case i) ( p, n )
100 200 300 400 ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 100 100 100 100 10 38 38 41 45 10 73 97 100 10020 100 100 100 100 20 12 15 22 35 20 45 93 98 10030 100 100 100 100 30 10 13 17 18 30 6 53 62 96Model III (case i) Model IV (case i) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 73 80 81 84 10 68 81 85 8720 49 60 66 67 20 46 55 64 6830 43 53 58 59 30 31 44 62 6433able 12:
The means of r and ρ for estimating S Y | X among repetitions with scenario 3. Model I (case i) Model I (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.852 0.903 0.918 0.927 10 0.942 0.952 0.961 0.9720.821 0.874 0.889 0.899 0.937 0.942 0.953 0.96520 0.756 0.843 0.846 0.874 20 0.869 0.915 0.928 0.9450.734 0.802 0.805 0.834 0.877 0.908 0.922 0.93530 0.659 0.757 0.805 0.832 30 0.846 0.880 0.898 0.9100.669 0.720 0.758 0.786 0.870 0.882 0.890 0.899Model II (case i) Model II (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.932 0.965 0.974 0.981 10 0.562 0.562 0.582 0.5700.925 0.957 0.966 0.975 0.584 0.568 0.571 0.55220 0.852 0.927 0.950 0.967 20 0.504 0.523 0.517 0.5280.877 0.922 0.940 0.958 0.566 0.572 0.566 0.54830 0.776 0.888 0.926 0.945 30 0.447 0.502 0.513 0.5130.853 0.892 0.919 0.936 0.565 0.565 0.555 0.553Model III (case i) Model III (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.981 0.990 0.995 0.996 10 0.568 0.577 0.578 0.5810.981 0.989 0.993 0.995 0.668 0.653 0.646 0.64920 0.954 0.979 0.986 0.991 20 0.497 0.522 0.520 0.5250.962 0.979 0.985 0.989 0.663 0.638 0.637 0.65330 0.928 0.969 0.979 0.984 30 0.476 0.499 0.512 0.5070.951 0.971 0.978 0.983 0.657 0.647 0.648 0.639Model IV (case i) Model IV (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.984 0.992 0.995 0.996 10 0.642 0.657 0.659 0.6540.984 0.991 0.994 0.996 0.619 0.605 0.605 0.59420 0.959 0.982 0.987 0.991 20 0.562 0.574 0.571 0.5690.966 0.981 0.986 0.990 0.611 0.598 0.581 0.56930 0.928 0.969 0.981 0.985 30 0.517 0.532 0.542 0.5360.951 0.971 0.980 0.984 0.596 0.571 0.571 0.55034able 13:
The means of ρ for estimating G Y | X among repetitions with scenario 3. Model II (case ii) Model III (case ii) Model IV (case ii) ( p, n )
100 200 300 400 100 200 300 400 100 200 300 40010 0.886 0.888 0.889 0.8911 0.773 0.780 0.780 0.776 0.913 0.915 0.918 0.91420 0.881 0.885 0.893 0.8848 0.770 0.774 0.777 0.774 0.917 0.915 0.918 0.91630 0.901 0.892 0.886 0.8914 0.777 0.773 0.774 0.777 0.915 0.918 0.918 0.918Table 14:
The number of correctly estimation for d among repetitions with scenario 3. Model I (case i) Model I (case ii) Model II (case i) ( p, n )
100 200 300 400 ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 100 100 100 100 10 100 100 100 100 10 90 99 98 9920 100 100 100 100 20 100 100 100 100 20 76 95 99 10030 100 100 100 100 30 97 98 100 100 30 28 81 96 100Model III (case i) Model IV (case i) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 98 99 100 100 10 100 100 100 10020 98 97 98 100 20 100 100 100 10030 96 96 99 99 30 100 100 100 10035able 15:
The means of r and ρ for estimating S Y | X among repetitions with scenario 4. Model I (case i) Model I (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.792 0.830 0.865 0.898 10 0.908 0.932 0.931 0.9320.775 0.802 0.835 0.869 0.909 0.929 0.925 0.92520 0.682 0.755 0.758 0.787 20 0.839 0.878 0.884 0.8900.670 0.713 0.710 0.743 0.859 0.876 0.876 0.87930 0.608 0.687 0.706 0.739 30 0.809 0.838 0.845 0.8740.618 0.654 0.660 0.688 0.851 0.842 0.836 0.863Model II (case i) Model II (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.907 0.947 0.966 0.975 10 0.580 0.589 0.585 0.5850.913 0.942 0.957 0.967 0.640 0.616 0.610 0.60120 0.808 0.906 0.937 0.955 20 0.505 0.531 0.531 0.5360.853 0.904 0.929 0.944 0.604 0.613 0.612 0.59830 0.725 0.866 0.908 0.935 30 0.459 0.500 0.510 0.5240.825 0.879 0.904 0.925 0.581 0.598 0.600 0.605Model III (case i) Model III (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.970 0.984 0.989 0.992 10 0.620 0.634 0.652 0.6600.974 0.984 0.988 0.991 0.710 0.713 0.716 0.71320 0.935 0.969 0.981 0.984 20 0.535 0.551 0.561 0.5670.952 0.971 0.980 0.983 0.711 0.709 0.694 0.70030 0.910 0.954 0.972 0.978 30 0.501 0.528 0.536 0.5320.946 0.960 0.973 0.977 0.713 0.707 0.693 0.686Model IV (case i) Model IV (case ii) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 0.99 0.982 0.990 0.992 10 0.687 0.702 0.697 0.7000.974 0.982 0.989 0.991 0.675 0.656 0.653 0.65220 0.931 0.972 0.981 0.984 20 0.587 0.606 0.599 0.6120.950 0.974 0.981 0.983 0.665 0.639 0.622 0.62930 0.909 0.956 0.971 0.979 30 0.537 0.554 0.559 0.5580.945 0.962 0.972 0.978 0.662 0.634 0.622 0.60736able 16:
The means of ρ for estimating G Y | X among repetitions with scenario 4. Model II (case ii) Model III (case ii) Model IV (case ii) ( p, n )
100 200 300 400 100 200 300 400 100 200 300 40010 0.826 0.835 0.826 0.828 0.775 0.771 0.779 0.774 0.893 0.897 0.899 0.90220 0.825 0.828 0.826 0.831 0.892 0.897 0.898 0.898 0.777 0.779 0.779 0.77730 0.829 0.822 0.828 0.822 0.775 0.773 0.781 0.778 0.893 0.897 0.899 0.899Table 17:
The number of correctly estimation for d among repetitions with scenario 4. Model I (case i) Model I (case ii) Model II (case i) ( p, n )
100 200 300 400 ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 100 100 100 100 10 98 99 100 100 10 84 100 99 10020 100 100 100 100 20 98 98 100 100 20 80 99 100 10030 100 100 100 100 30 94 99 100 100 30 29 92 99 100Model III (case i) Model IV (case i) ( p, n )
100 200 300 400 ( p, n )
100 200 300 40010 100 99 100 99 10 100 100 100 10020 96 100 100 100 20 99 100 100 10030 94 94 95 98 30 97 100 100 10037
Additional Results for the Handwritten Digits Data
B.1 Application to Handwritten Digit Classes { , , } To further investigate the performance of our proposals and demonstrate its use in real ap-plications, we now extract gray-scale images of three handwritten digit classes { , , } inFigure 1, from the UCI Machine Learning Repository. This dataset contains a training group ofsize and a testing group of size . Each digit was represented by an × pixel image.The × bottom part of each image was taken as the predictors X , and the × upper half wasset as the responses Y .Figure 6: The first row consists of the responses Y which are the upper halves of the image digits { , , } ; The second row consists of the predictors X which are the bottom halves of the imagedigits { , , } ; The third row consists of the whole image digits { , , } .For image digits { , , } , we also include projective resampling approach in combinationwith three classical methods, sliced inverse regression, sliced average variance estimation anddirectional regression for comparisons. And we adopt three different distance metrics for ourproposals: the Euclidean distance, the distance metric learned by the Local Linear Embedding(Roweis & Saul (2000)), the distance metric learned by the Isomap approach (Tenenbaum et al.(2009)). Similar to the conclusion drawn from the application to image digits { , , } , we againfind that our proposals provide valid and useful information for classification as seen from Figure38 and 3, especially for the nonlinear approach in combination with Isomap.39 ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X f^ f ^ f^ f ^ f^ f ^ Figure 7: The first row consists of the scatter plots for the training data via projective resamplingbased SIR, SAVE and DR, respectively. The second row consists of the scatter plots based on ourlinear proposal with Euclidean distance, LLE and Isomap, respectively. The third row consistsof the scatter plots based on our nonlinear proposal with Euclidean distance, LLE and Isomap,respectively. (red: ; green: ;blue: .) 40 ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X b ^ X b ^ T X f^ f ^ f^ f ^ f^ f ^ Figure 8: The first row consists of the perspective plots for the first two sufficient predictors forthe testing data by SIR, SAVE and DR, respectively. The second row consists of the perspectiveplots for the testing data based on our linear proposal with Euclidean distance, LLE and Isomap,respectively. The third row consists of the perspective plots for our nonlinear proposal withEuclidean distance, LLE and Isomap, respectively. (red: ; green: ;blue: .)41 .2 Structural Dimension Determination Figure 9: The vertical axis in the panel (a) and (b) represents a combination of the measuresabout eigenvalues and eigenvectors, g n ( k ) , for digits groups: { } and { } , respectively. . . . . . k g n ( k ) (a) The ladle plot . . . . . k g n ( k ) (b) The ladle plot For the handwritten digits data, we apply the ladle estimator with the distance metric learnedby the Isomap method. Figure 4 displays the ladle plot for digits { , , } and { , , } respec-tively. We find that in both cases the ladle estimator yields ˆ d = 3 or . And from the scatter plotsaccumulated, we know that the first two sufficient predictors already provide useful informationfor image digits separation. 42 Proofs of Theoretical Results
C.1 Proof of Proposition 1
Proposition 6. Λ is positive semidefinite. Assume the linearity condition holds true, then Span (cid:8) Σ − Λ (cid:9) ⊆ S Y | X . Proof.
To prove the proposition, we introduction a fact that exists a separable R -Hilbert space H and a mapping φ : Ω → H such that ∀ Y, Y ′ ∈ H , d ( Y, Y ′ ) = k φ ( Y ) − φ ( Y ′ ) k H , as shown bySchoenberg (1937, 1938). Let β φ ( µ ) = Eφ ( Y ) . For ∀ a ∈ R p , a = (0 , . . . , T , we have a T Λ a = − E {h a T ( X − EX ) , a T ( X ′ − EX ) i d ( Y, Y ′ ) } = − E {h a T ( X − EX ) , a T ( X ′ − EX ) ik φ ( Y ) − φ ( Y ′ ) k } = − E {h a T ( X − EX ) , a T ( X ′ − EX ) ih φ ( Y ) − φ ( Y ′ ) , φ ( Y ) − φ ( Y ′ ) i} = − E {h a T ( X − EX ) , a T ( X ′ − EX ) ih φ ( Y ) − β φ ( µ ) + β φ ( µ ) − φ ( Y ′ ) ,φ ( Y ) − β φ ( µ ) + β φ ( µ ) − φ ( Y ′ ) i} = 2 E {h a T ( X − EX ) , a T ( X ′ − EX ) ih φ ( Y ) − β φ ( µ ) , φ ( Y ′ ) − β φ ( µ ) i} = 2 { E [ a T ( X − EX ) ⊗ ( φ ( Y ) − β φ ( µ )] } ≥ Therefore, Λ is a semidefined matrix. By double expectation, we have − Σ − E (( X − EX )( X ′ − EX ) T d ( Y, Y ′ ))= − Σ − E ( E ( X − E ( X | Y )) E ( X ′ − E ( X | Y ′ )) T d ( Y, Y ′ )) , By the property of SIR, we have Σ − E ( X − E ( X | Y )) ∈ S Y | X . The proof is completed.43 .2 Proof of Theorem 1 Theorem 4.
Assume the linearity condition and the singular values λ ℓ ’s are distinct for ℓ =1 , . . . , d . In addition, assume that Ed ( Y, Y ′ ) < ∞ and X has finite fourth moment, then n / (ˆ β ℓ − β ℓ ) D −→ N (0 p , Σ ℓ ) , (C.10) as n → ∞ , where Σ ℓ = cov { Υ ℓ ( X, Y ) } . The following lemmas are needed before we prove Theorem 1. Let E ( X ) = µ, X = n − P ni =1 X i and b Σ = n − P ni =1 ( X i − X )( X i − X ) T . Lemma 1 provides the asymptotic expan-sion of b Σ . Its proof is obvious and thus omitted. Lemma 1.
Assume X has finite fourth moment. Then b Σ − Σ = 1 n n X i =1 Γ( X i ) + o P ( n − / ) , (C.11) where Γ( X i ) = ( X i − µ )( X i − µ ) T − Σ . Let
Λ = − E { ( X − µ )( X ′ − µ ) T d ( Y, Y ′ ) } and b Λ = − n ( n − X ≤ i = j ≤ n ( X i − X )( X j − X ) T d ( Y i , Y j ) . (C.12)Lemma 2 provides the asymptotic expansion of b Λ . Lemma 2.
Assume X has finite fourth moment. Then b Λ − Λ = 1 n n X i =1 Θ( X i , Y i ) + o P ( n − / ) , (C.13) where the exact form of Θ( X i , Y i ) is provided in the proof. roof. First we decompose b Λ in (C.12) as b Λ = b U (1) + b U (2) + b U (3) + b U (4) , where b U (1) = − n ( n − X i = j ( X i − µ )( X j − µ ) T d ( Y i , Y j ) , b U (2) = 1 n ( n − X i = j ( b µ − µ )( X j − µ ) T d ( Y i , Y j ) , b U (3) = 1 n ( n − X i = j ( X i − µ )( b µ − µ ) T d ( Y i , Y j ) , and b U (4) = − n ( n − X i = j ( b µ − µ )( b µ − µ ) T d ( Y i , Y j ) . (C.14)Let Λ (1) ( X i , Y i , X j , Y j ) = − ( X i − µ )( X j − µ ) T d ( Y i , Y j ) and denote Λ (1)1 ( x, y ) = E { Λ (1) ( X, Y, x, y ) } .For the first term, we have b U (1) − Λ = 1 n ( n − X i = j { Λ (1) ( X i , Y i , X j , Y j ) − Λ } = 1 n n X i =1 { Λ (1)1 ( X i , Y i ) − Λ } + 1 n ( n − X i = j A ( X i , Y i ) (C.15)where A ( X i , Y i , X j , Y j ) = { Λ (1) ( X i , Y i , X j , Y j ) − Λ (1)1 ( X i , Y i ) } .
45y simple calculation, E k n ( n − X i = j A ( X i , Y i ) k F = tr ( E ( 1 n ( n − X i = j A ( X i , Y i , X j , Y j ))( 1 n ( n − X k = t A ( X k , Y k , X t , Y t )))= n ( n − n − n − n ( n − tr ( E ( A ( X, Y, X ′ , Y ′ ) A ( X ′′ , Y ′′ , X ′′′ , Y ′′′ )))+ n ( n − n − n ( n − tr ( E ( A ( X, Y, X ′ , Y ′ ) A ( X, Y, X ′′ , Y ′′ )))+ n ( n − n ( n − tr ( E ( A ( X, Y, X ′ , Y ′ ) A ( X, Y, X ′ , Y ′ )))= n ( n − n − n ( n − tr ( E ( E ( A ( X, Y, X ′ , Y ′ ) | X, Y ) E ( A ( X, Y, X ′′ , Y ′′ ) | X, Y )))+ 1 n ( n − tr ( E ( A ( X, Y, X ′ , Y ′ ) A ( X, Y, X ′ , Y ′ )))= 1 n ( n − tr ( E ( A ( X, Y, X ′ , Y ′ ) A ( X, Y, X ′ , Y ′ ))) = O ( n − ) we get n ( n − P i = j A ( X i , Y i ) = O p ( n − ) .Let ϑ = E { ( X − µ ) d ( Y, Y ′ ) } . Note that n ( n − X i = j ( X j − µ ) d ( Y i , Y j ) P −→ ϑ. It follows that b U (2) = 1 n n X i =1 ( X i − µ ) ϑ T + o P ( n − / ) . (C.16)Similarly we have b U (3) = 1 n n X i =1 ϑ ( X i − µ ) T + o P ( n − / ) . (C.17)46ote that b U (4) = o P ( n − / ) . (C.15), (C.16) and (C.17) together lead to (C.13), where Θ( X i , Y i ) =Λ (1)1 ( X i , Y i ) − Λ + ( X i − µ ) ϑ T + ϑ ( X i − µ ) T .By algebra calculations, we have c M − M = b Σ − b Λ − Σ − Λ = ( b Σ − − Σ − )Λ + Σ − ( b Λ − Λ) + O p ( n − )= − Σ − ( b Σ − Σ)Σ − Λ + Σ − ( b Λ − Λ) + o p ( n − / )= − n n X i =1 Σ − (( X i − µ )( X i − µ ) T − Σ)Σ − Λ+ 1 n ( n − X i = j Σ − (( X i − µ )( X j − µ ) T d ( Y i , Y j ) − Λ)+Σ − ( µ − ¯ X ) ϑ T + Σ − ϑ ( µ − ¯ X ) T + o p ( n − / )= − n n X i =1 Σ − (( X i − µ )( X i − µ ) T − Σ)Σ − Λ+ 1 n n X i =1 Σ − (Λ (1)1 ( X i , Y i ) − Λ)+Σ − ( µ − ¯ X ) ϑ T + Σ − ϑ ( µ − ¯ X ) T + o p ( n − / )= 1 n n X i =1 H ( X i , Y i ) + o p ( n − / ) where H ( X, Y ) = − Σ − (( X − µ )( X − µ ) T (C.18) − Σ)Σ − Λ + Σ − (Λ (1)1 ( X, Y ) − Λ)+Σ − ( µ − X ) ϑ T + Σ − ϑ ( µ − X ) T Simple calculations lead to c M c M T − M M T = Σ − ( b Λ − Λ)Λ T Σ − + ( b Σ − − Σ − )ΛΛ T Σ − +Σ − ΛΛ T ( b Σ − − Σ − ) + Σ − Λ( b Λ − Λ) T Σ − λ ℓ and β ℓ satisfy the following singular value decomposition equation: M M T β ℓ = λ ℓ β ℓ , and ℓ = 1 , . . . , p, Hence, Σ − ΛΛ T Σ − β ℓ = λ ℓ β ℓ , and ℓ = 1 , . . . , p, where β Tℓ β ℓ = 1 and β Tℓ β = 0 for ℓ = . Similarly, in the sample level, we have b Σ − b Λ b Λ T b Σ − β ℓ = λ ℓ β ℓ , and ℓ = 1 , . . . , p ; where β Tℓ β ℓ = 1 and β Tℓ β = 0 for ℓ = . The singular value decomposition form in the samplelevel implies that Σ − ( b Λ − Λ)Λ T Σ − β ℓ + ( b Σ − − Σ − )ΛΛ T Σ − β ℓ (C.19) + Σ − ΛΛ T ( b Σ − − Σ − ) β ℓ + Σ − Λ( b Λ − Λ) T Σ − β ℓ + Σ − ΛΛ T Σ − ( b β ℓ − β ℓ )= λ ℓ ( b λ ℓ − λ ℓ ) β ℓ + ( b λ ℓ − λ ℓ ) λ ℓ β ℓ + λ ℓ ( b β ℓ − β ℓ ) + o p ( n − / ) for ℓ = 1 , . . . , p . Multiply both sides of (C.19) by β Tℓ , we get from the left β Tℓ [Σ − ( b Λ − Λ)Λ T Σ − + ( b Σ − − Σ − )ΛΛ T Σ − + Σ − ΛΛ T ( b Σ − − Σ − )+ Σ − Λ( b Λ − Λ) T Σ − ] β ℓ = λ ℓ ( b λ ℓ − λ ℓ ) + ( b λ ℓ − λ ℓ ) λ k + o p ( n − / ) which further suggests that b λ ℓ = λ ℓ + β Tℓ λ ℓ [Σ − ( b Λ − Λ)Λ T Σ − + ( b Σ − − Σ − )ΛΛ T Σ − (C.20) + Σ − ΛΛ T ( b Σ − − Σ − ) + Σ − Λ( b Λ − Λ) T Σ − ] β ℓ + o p ( n − / ) By lemma A.2 of Cook & Ni (2005), we know that b Σ − − Σ − = − Σ − ( b Σ − Σ)Σ − + o p ( n − / ) b λ ℓ = λ ℓ + β Tℓ λ ℓ [Σ − ( b Λ − Λ)Λ T Σ − − Σ − ( b Σ − Σ)Σ − ΛΛ T Σ − − Σ − ΛΛ T Σ − ( b Σ − Σ)Σ − + Σ − Λ( b Λ − Λ) T Σ − ] β ℓ + o p ( n − / )= λ ℓ + 1 n n X i =1 C i,λ ℓ + o p ( n − / ) where C i,λ ℓ = β Tℓ λ ℓ [Σ − Θ( X i , Y i )Λ T Σ − − Σ − Γ( X i )Σ − ΛΛ T Σ − − Σ − ΛΛ T Σ − Γ( X i )Σ − + Σ − ΛΘ( X i , Y i )Σ − ] β ℓ + o p ( n − / ) Now we return to the expansion of ˆ β ℓ . Since ( β , . . . , β p ) is a basis if R p , there exists c ℓj for j = 1 , . . . , p , such that b β ℓ − β ℓ = P pj =1 c ℓj β j and c ℓj = O p ( n − / ) . We will derive the explicitform of c ℓj in the next step. Note that (C.19) can be rewritten as (Σ − ΛΛ T Σ − − λ ℓ ) p X j =1 c ℓj β j (C.21) = λ ℓ ( b λ ℓ − λ ℓ ) β ℓ + ( b λ ℓ − λ ℓ ) λ ℓ β ℓ + [Σ − ( b Λ − Λ)Λ T Σ − + ( b Σ − − Σ − )ΛΛ T Σ − + Σ − ΛΛ T ( b Σ − − Σ − ) + Σ − Λ( b Λ − Λ) T Σ − ] β ℓ = λ ℓ ( b λ ℓ − λ ℓ ) β ℓ + ( b λ ℓ − λ ℓ ) λ ℓ β ℓ + [Σ − ( b Λ − Λ)Λ T Σ − − Σ − ( b Σ − Σ)Σ − ΛΛ T Σ − − Σ − ΛΛ T Σ − ( b Σ − Σ)Σ − + Σ − Λ( b Λ − Λ) T Σ − ] β ℓ = λ ℓ ( b λ ℓ − λ ℓ ) β ℓ + ( b λ ℓ − λ ℓ ) λ ℓ β ℓ + 1 n n X ζ ℓ ( X i , Y i ) β ℓ where ζ ℓ ( X i , Y i ) = Σ − Θ( X i , Y i )ΛΣ − − Σ − Γ( X i )Σ − ΛΛ T Σ − − Σ − ΛΛ T Σ − Γ( X i )Σ − + Σ − ΛΘ( X i , Y i )Σ − β Tj ( j = ℓ ) , we can get from the left c ℓ,j = 1 n X i =1 β Tj ζ ℓ ( X i , Y i ) β ℓ λ j − λ ℓ , j = ℓ ; (C.22)In addition, β Tℓ β ℓ = b β Tℓ b β ℓ = 1 indicates that p X j =1 c ℓj β Tj β ℓ + β Tℓ p X j =1 c ℓj β j , which further implies that c ℓℓ = 0 . We define Σ ℓ = cov(Υ ℓ (X , Y)) , (C.23)where p × random vector Υ ℓ ( X, Y ) = P pj =1 ,j = ℓ β j β Tj ζ ℓ ( X,Y ) β ℓ λ j − λ ℓ . Then plug (C.22) and (C.11)into (C.13), and we get b β ℓ = β ℓ + 1 n n X i =1 Υ ℓ ( X i , Y i ) + o p ( n − / ) (C.24)The conclusion is then straightforward via the central limit theorem. C.3 Proof of Theorem 2
Theorem 5.
Assume Ed ( Y, Y ′ ) < ∞ and X has finite fourth moment. And suppose Assump-tions (1)–(2) hold, then P r { lim n →∞ P r ( ˆ d = d |D )= 1 } = 1 , where D = { ( X , Y ) , ( X , Y ) , . . . } is a sequence of independent copies of ( X, Y ) .Proof. The singular value of c M are square root of the corresponding eigenvalue of matrix c M c M T .Moreover, the left singular vectors are the same as the eigenvectors of c M c M T . Then we applyTheorem 2 in Luo & Li (2016) to get the desired result.50 .4 Proof of Proposition 2 Proposition 7. Λ XX ′ is a bounded linear and self-adjoint operator. For any f, g ∈ H X , h f, Λ XX ′ g i H X = − E { ( f ( X ) − Ef ( X ))( g ( X ′ ) − Eg ( X ′ )) d ( Y, Y ′ ) } . Moreover, there exists a separable R -Hilbert space H and a mapping φ : Ω → H such that h f, Λ XX ′ f i H X = 2 { E [( f ( X ) − Ef ( X ))( φ ( Y ) − Eφ ( Y ))] } = 2( cov [ f ( X ) , φ ( Y )]) , Proof.
For arbitrary f, g ∈ H X , we have |h f, Λ XX ′ g i H X | ≤ E |h f, (( κ X ( · , X ) − µ X ) ⊗ ( κ X ( · , X ′ ) − µ X ) d ( Y, Y ′ )) g i H X | = E {|h f, κ X ( · , X ) − µ X i H X ||h κ X ( · , X ′ ) − µ X , g i H X | d ( Y, Y ′ ) }≤ k f k H X k g k H X E h κ X ( · , X ) − µ X , κ X ( · , X ) − µ X i H X ( Ed ( Y, Y ′ )) / = k f k H X k g k H X ( Eκ X ( X, X ) − µ X )( Ed ( Y, Y ′ )) / Since µ X ≤ ( E k κ X ( · , X ) k H X ) = ( Eκ X ( X, X ) / ) ≤ Eκ X ( X, X ) < ∞ , and Ed ( Y, Y ′ ) < ∞ . Therefore, Λ XX ′ is a bounded liner and self-adjoint operator. h f, Λ XX ′ g i H X = − E h f, (( κ X ( · , X ) − µ X ) ⊗ ( κ X ( · , X ′ ) − µ X ) d ( Y, Y ′ )) g i H X = − E {h f, κ X ( · , X ) − µ X i H X h κ X ( · , X ′ ) − µ X , g i H X d ( Y, Y ′ ) } = − E { ( f ( X ) − Ef ( X ))( g ( X ′ ) − Eg ( X ′ )) d ( Y, Y ′ ) } For arbitrary f ∈ H X , we have h f, Λ XX ′ f i H X = −h f, E (( κ X ( · , X ) − µ X ) ⊗ ( κ X ( · , X ′ ) − µ X ) d ( Y, Y ′ )) f i H X = − E {h f, (( κ X ( · , X ) − µ X ) ⊗ ( κ X ( · , X ′ ) − µ X )) f i H X k φ ( Y ) − φ ( Y ′ ) k H } = 2 E {h f ( X ) − Ef ( X ) , f ( X ′ ) − Ef ( X ′ ) i H X h φ ( Y ) − β φ ( µ ) , φ ( Y ′ ) − β φ ( µ ) i H } = 2 { E ( f ( X ) − Ef ( X )) ⊗ ( φ ( Y ) − β φ ( µ )) } ≥ Λ XX ′ is a semidefined operator. C.5 Proof of Proposition 3
Proposition 8.
Suppose assumptions (3)–(5) hold, then ran (cid:8) Σ − XX Λ XX ′ (cid:9) ⊆ G Y | X . Proof.
Firstly, we show that ran(Λ XX ′ ) ⊆ Σ XX G Y | X which is equivalent to (Σ XX G Y | X ) ⊥ ⊆ ran(Λ XX ′ ) ⊥ . Since ran(Λ XX ′ ) ⊥ = ker(Λ XX ′ ) , where ker(Λ XX ′ ) denotes nuclear space generated by the op-erator Λ XX ′ , it suffices to show that (Σ XX G Y | X ) ⊥ ⊆ ker(Λ XX ′ ) . Now we define G φ ( Y ) | X , we get (Σ XX G φ ( Y ) | X ) ⊥ ⊆ ker(Λ XX ′ ) . (C.25)Let f ∈ (Σ XX G φ ( Y ) | X ) ⊥ . Then, for all g ∈ G φ ( Y ) | X , we have h f, Σ XX g i H X = cov { f ( X ) , g ( X ) } = 0 . Because g is measurable with respect to M φ ( Y ) | X , we have g ( X ) = E [ g ( X ) |M φ ( Y ) | X ] . And cov { f ( X ) , E [ g ( X ) |M φ ( Y ) | X ] } = E [ f ( X ) E [ g ( X ) |M φ ( Y ) | X ]] − E [ f ( X )] E [ g ( X )]= E [ E [ f ( X ) |M φ ( Y ) | X ] g ( X )] − E [ f ( X )] E [ g ( X )]= cov { E [ f ( X ) |M φ ( Y ) | X ] , g ( X ) } G φ ( Y ) | X is dense in L ( P X |M φ ( Y ) | X ) modulo constants, there exists a sequence { f n ⊆ G φ ( Y ) | X } such that var [ f n ( X ) − f ( X )] → . Then cov { E [ f ( X ) |M φ ( Y ) | X ] , f n ( X ) } = E { E [ f ( X ) |M φ ( Y ) | X ] f n ( X ) } − E [ f ( X )] E [ f n ( X )] = 0 (C.26)On the other hand, cov { E [ f ( X ) |M φ ( Y ) | X ] , f n ( X ) } → cov { E [ f ( X ) |M φ ( Y ) | X ] , f ( X ) } (C.27) = cov { E [ f ( X ) |M φ ( Y ) | X ] , E [ f ( X ) |M φ ( Y ) | X ] } Combining (C.26) and (C.27), we have var { E [ f ( X ) |M φ ( Y ) | X ] } = 0 This implies that E [ f ( X ) |M φ ( Y ) | X ] = constant almost surely. Since M φ ( Y ) | X is sufficient, wehave E [ f ( X ) |M φ ( Y ) | X ] = E [ f ( X ) | φ ( Y ) , M φ ( Y ) | X ] . So E [ f ( X ) | φ ( Y ) , M φ ( Y ) | X ] = constantalmost surely. Consequently, E [ f ( X ) | φ ( Y )] = constant almost surely. Σ − XX E [( κ X ( · , X ) − µ X ) ⊗ ( κ X ( · , X ′ ) − µ X ) d ( Y, Y ′ )]= Σ − XX E [( κ X ( · , X ) − µ X ) ⊗ ( κ X ( · , X ′ ) − µ X ) k φ ( Y ) − φ ( Y ′ ) k H ]= Σ − XX E { E [( κ X ( · , X ) − µ X ) | φ ( Y )] ⊗ E [( κ X ( · , X ′ ) − µ X ) | φ ( Y ′ )] k φ ( Y ) − φ ( Y ′ ) k H ] } We can get E [( κ X ( · , X ′ ) − µ X ) | φ ( Y ′ )] f = E [ f ( X ′ )) | φ ( Y ′ )] − µ X ( f ( X ′ )) = 0 Therefore, Λ XX ′ f = 0 . Then we have proved (C.25).53y (C.25), we have ran(Λ XX ′ ) ⊆ Σ XX G φ (Y) | X , which implies Σ − XX ran(Λ XX ′ ) ⊆ G φ (Y) | X . Note that Σ − XX ran(Λ XX ′ ) = { Σ − XX f : f = Λ XX ′ g, g ∈ H φ ( Y ) } = { Σ − XX Λ XX ′ g : g ∈ H φ ( Y ) } = ran(Σ − Λ XX ′ ) Then, because G φ ( Y ) | X is closed, we have ran(Σ − XX Λ XX ′ ) ⊆ G φ ( Y ) | X . Finally, we will show G φ ( Y ) | X ⊆ G Y | X . It is easy to find that Y X |G Y | X ⇒ φ ( Y ) X |G φ ( Y ) | X Therefore, we have G φ ( Y ) | X ⊆ G Y | X . The proof is completed.
C.6 Proof of Proposition 4
Proposition 9.
Suppose assumptions (3)–(5) hold and G Y | X is complete. Then, ran (cid:8) Σ − XX Λ XX ′ (cid:9) = G Y | X . Proof.
Form Proposition 8, we know ran(Λ XX ′ ) ⊆ Σ XX G Y | X . Therefore, we only need toshow Σ XX G Y | X ⊆ ran(Λ XX ′ ) , or equivalently, ker(Λ X ′ X ) ⊆ (Σ XX G Y | X ) ⊥ . Let f ∈ ker(Λ X ′ X ) .Then Λ X ′ X f = 0 , which implies that Σ − X ′ X ′ Λ X ′ X f = 0 . By the proof of Proposition 8, wehave E ( f ( X ) |M φ ( Y ) | X ) = constant . Since M φ ( Y ) | X ⊆ M Y | X , we have E ( f ( X ) |M Y | X ) =constant . It follows that, for any g ∈ Σ XX G Y | X , we have cov ( f ( X ) , g ( X )) = cov ( f ( X ) , E ( g ( X ) |M Y | X )) = cov ( E ( f ( X ) |M Y | X ) , g ( X )) = 0 . That is, f ∈ (Σ XX G Y | X ) ⊥ . The proof is completed.54 .7 Proof of Theorem 3 Theorem 6.
Suppose assumptions (3)–(7) hold. In addition, assume that Ed ( Y, Y ′ ) < ∞ , thenas n → ∞ k ˆ V XX ′ − V XX ′ k HS = o p (1) , |h ˆ ψ , ψ i HS | P −→ , k{ ˆ f ( X ) − E ˆ f ( X ) } − { f ( X ) − Ef ( X ) }k−→ , where k · k in this theorem is the standard L norm to measure the distance of functions and k · k HS denotes the Hilbert-Schmidt norm. To prove this theorem, we need the following lemmas.
Lemma 3.
The cross-covariance operator Λ XX ′ is a Hilbert-Schmidt operator, and its Hilbert-Schmidt norm is given by k Λ XX ′ k HS = h E { ( κ X ( , X ) − µ X ) ⊗ ( κ X ( , X ′ ) − µ X ) d ( Y, Y ′ ) } ,E { ( κ X ( , X ) − µ X ) ⊗ ( κ X ( , X ′ ) − µ X ) d ( Y, Y ′ ) }i = E XX ′ Y Y ′ E X ′′ X ′′′ Y ′′ Y ′′′ [ h ( κ X ( , X ) − µ X ) , ( κ X ( , X ′′ ) − µ X ) i H X h ( κ X ( , X ′ ) − µ X ) , ( κ X ( , X ′′′ ) − µ X ) i H X d ( Y, Y ′ ) d ( Y ′′ , Y ′′′ )]= k E XX ′ Y Y ′ [( κ X ( , X ) − µ X )( κ X ( , X ′ ) − µ X ) d ( Y, Y ′ )] k H X ⊗H X where ( X, Y ) , ( X ′ , Y ′ ) , ( X ′′ , Y ′′ ) and ( X ′′′ , Y ′′′ ) are independently and identically with distribu-tion P XY . From the facts H X ⊂ L ( P X ) , the law of large numbers implies for each f ∈ H X , lim n →∞ h f, b Λ XX ′ f i H X = h f, Λ XX ′ f i H X in probability. Moreover, the central limit theorem shows that the above convergence rate is oforder O p ( n − / ) . The following lemma shows the tight uniform result that k b Λ XX ′ − Λ XX ′ k HS converges to zero in the order of O p ( n − / ) . 55 emma 4. Under the Assumption 3 and Ed ( Y, Y ′ ) < ∞ , we have k b Λ XX ′ − Λ XX ′ k HS = O p ( n − / ) Proof.
Write for simplicity η = κ X ( · , X ) − µ X and F = H X ⊗ H X . Then η , . . . , η n are i.i.d.random elements in H X . Lemma 3 implies k b Λ XX ′ k HS = k n ( n − n X i = j [( η i − n n X s =1 η s )( η j − n n X s =1 η s ) d ( Y i , Y j )] k F . Then we can derive that h Λ XX ′ , b Λ XX ′ i HS = h E [ ηη ′ d ( Y, Y ′ )] , n ( n − n X i = j [( η i − n n X s =1 η s )( η j − n n X s =1 η s ) d ( Y i , Y j )] i F From these equations, we have k b Λ XX ′ − Λ XX ′ k HS = k Λ XX ′ k HS − h Λ XX ′ , b Λ XX ′ i HS + k b Λ XX ′ k HS = k n ( n − n X i = j [( η i − n n X s =1 η s )( η j − n n X s =1 η s ) d ( Y i , Y j )] − E [ ηη ′ d ( Y, Y ′ )] k F = k n ( n − n X i = j ( η i η j d ( Y i , Y j ) − E [ ηη ′ d ( Y, Y ′ )]) − [( 1 n n X s =1 η s )(( 1 n n X s =1 η s )( 1 n ( n − X i = j d ( Y i , Y j )) − n ( n − X i = j ( η i + η j ) d ( Y i , Y j ))] k F k b Λ XX ′ − Λ XX ′ k HS ≤ k n ( n − n X i = j ( η i η j d ( Y i , Y j ) − E [ ηη ′ d ( Y, Y ′ )]) k F + k ( 1 n n X s =1 η s ) k H k [(( 1 n n X s =1 η s )( 1 n ( n − X i = j d ( Y i , Y j )) − n ( n − X i = j ( η i + η j ) d ( Y i , Y j ))] k H = k n ( n − n X i = j ( η i η j d ( Y i , Y j ) − E [ ηη ′ d ( Y, Y ′ )]) k F + k n n X s =1 η s k H k n ( n − X k = i = j η k d ( Y i , Y j ) k H By simple calculation, we obtain E k n ( n − n X i = j ( η i η j d ( Y i , Y j ) − E [ ηη ′ d ( Y, Y ′ )]) k F (C.28) = 1 n ( n − X i = j,k = t E h η i η j d ( Y i , Y j ) − E [ ηη ′ d ( Y, Y ′ )] , η k η t d ( Y k , Y t ) − E [ ηη ′ d ( Y, Y ′ )] i F = C n E h ηη ′ d ( Y, Y ′ ) − E [ ηη ′ d ( Y, Y ′ )] , ηη ′′ d ( Y, Y ′′ ) − E [ ηη ′ d ( Y, Y ′ )] i F + 2 n ( n − E k ηη ′ d ( Y, Y ′ ) − E [ ηη ′ d ( Y, Y ′ )] k F = O ( n − ) because E k ηη ′ d ( Y, Y ′ ) k H < ∞ by assumption 3 and Ed ( Y, Y ′ ) < ∞ . E k n n X s =1 η s k H = 1 n E k η s k H = O ( n − ) (C.29)57ince Eη ′′ d ( Y, Y ′ ) = Eη ′′ Ed ( Y, Y ′ ) = 0 . By the law of large numbers, for any f, g ∈ H X , wehave lim n →∞ h g, ( η k d ( Y i , Y j )) f i H X = h g, ( η ′′ d ( Y, Y ′ )) f i H X (C.30)in probability. Combining (C.28), (C.29) and (C.30), we have k b Λ XX ′ − Λ XX ′ k HS = O p ( n − / ) Lemma 5.
Let ε n be a positive number such that ε n → n → ∞ ) . Then, for the i.i.d. sample ( X , Y ) , . . . , ( X n , Y n ) , we have k b V XX ′ − (Σ XX + ε n I ) − / Λ XX ′ Λ ∗ XX ′ (Σ XX + ε n I ) − / k = O p ( 1 ε / n n / ) Proof.
The left hand side term can be decomposed as b V XX ′ − (Σ XX + ε n I ) − / Λ XX ′ Λ ∗ XX ′ (Σ XX + ε n I ) − / = [( b Σ XX + ε n I ) − / − (Σ XX + ε n I ) − / ] b Λ XX ′ b Λ ∗ XX ′ ( b Σ XX + ε n I ) − / + (Σ XX + ε n I ) − / [ b Λ XX ′ − Λ XX ′ ] b Λ ∗ XX ′ ( b Σ XX + ε n I ) − / + (Σ XX + ε n I ) − / Λ XX ′ ( b Λ ∗ XX ′ − Λ ∗ XX ′ )( b Σ XX + ε n I ) − / + (Σ XX + ε n I ) − / Λ XX ′ Λ ∗ XX ′ [( b Σ XX + ε n I ) − / − (Σ XX + ε n I ) − / ] (C.31)From the equation A − / − B − / = A − / ( B / − A / ) B − / + ( A − B ) B − / . [( b Σ XX + ε n I ) − / − (Σ XX + ε n I ) − / ] b Λ XX ′ b Λ ∗ XX ′ ( b Σ XX + ε n I ) − / = { ( b Σ XX + ε n I ) − / ((Σ XX + ε n I ) / − ( b Σ XX + ε n I ) / )+( b Σ XX − Σ XX ) } ( b Σ XX + ε n I ) − / b Λ XX ′ b Λ ∗ XX ′ ( b Σ XX + ε n I ) − / From ( b Σ XX + ε n I ) − / ≤ ε − / n , k ( b Σ XX + ε n I ) − / b Λ XX ′ b Λ ∗ XX ′ ( b Σ XX + ε n I ) − / k ≤ C andLemma 8 in Fukumizu et al. (2007). The norm of the above operator is bounded from above by Cε n { √ ε n max {k Σ XX + ε n I k / , k b Σ XX + ε n I k / } + 1 }k b Σ XX − Σ XX k = O p ( ε − / n n − / ) For the second term, we have (Σ XX + ε n I ) − / [ b Λ XX ′ − Λ XX ′ ] b Λ ∗ XX ′ ( b Σ XX + ε n I ) − / = O p ( 1 ε n n / ) The third and fourth terms are similar to the second and first terms. Correspondingly, their boundsare O p ( ε n n / ) and O p ( ε / n n / ) , respectively. Lemma 6.
Assumption V XX ′ is compact. Then for a sequence ε n → , k (Σ XX + ε n I ) − / Λ XX ′ Λ ∗ XX ′ (Σ XX + ε n I ) − / − V XX ′ k = o p (1) Proof.
An upper bound of the left hand side of the assertion is given by k{ (Σ XX + ε n I ) − / − Σ − / XX } Λ XX ′ Λ ∗ XX ′ (Σ XX + ε n I ) − / k (C.32) + k Σ − / XX Λ XX ′ Λ ∗ XX ′ { (Σ XX + ε n I ) − / − Σ − / XX }k k{ (Σ XX + ε n I ) − / Σ / XX − I } Σ − / XX Λ XX ′ Λ ∗ XX ′ Σ − / XX k . (C.33)Note that the range Σ − / XX Λ XX ′ Λ ∗ XX ′ Σ − / XX is included in R (Σ XX ) . Let v be an arbitrary elementin R (Σ − / XX Λ XX ′ Λ ∗ XX ′ Σ − / XX ) T R (Σ XX ) . Then there exists u ∈ H X such that v = Σ XX u .Noting that Σ XX and (Σ XX + ε n I ) / are commutative, we have k{ (Σ XX + ε n I ) − / Σ / XX − I } v k H X = k{ (Σ XX + ε n I ) − / Σ / XX − I } Σ XX u k H X = k (Σ XX + ε n I ) − / Σ / XX { Σ / XX − (Σ XX + ε n I ) / } Σ / XX u k H X ≤ k Σ / XX − (Σ XX + ε n I ) / kk Σ / XX u k H X . Σ XX + ε n I → Σ XX in norm means that (Σ XX + ε n I ) / → Σ / XX in norm, the convergence { (Σ XX + ε n I ) − / Σ / XX − I } v −→ n → ∞ ) holds for all v ∈ R (Σ − / XX Λ XX ′ Λ ∗ XX ′ Σ − / XX ) T R (Σ XX ) . Because Σ − Λ XX ′ is compact, Lemma9 in Fukumizu et al. (2007) shows (C.33) converges to zero. The convergence of second term in(C.32) can be proved similarly. Lemma 7.
Let A be a compact positive operator on a Hilbert space H , and A n ( n ∈ N ) be bounded positive operators on H such that A n converges to A in norm. Assume that theeigenspace of A corresponding to the largest eigenvalue is one-dimensional spanned by a uniteigenvector φ , and the maximum of the spectrum of A n is attained by a unit eigenvector f n . Then |h f n , φ i H | → n → ∞ ) . roof. Because A is compact and positive, the eigen-decomposition A = ∞ X i =1 ρ i ψ i h ψ i , ·i H holds, where ρ > ρ ≥ ρ ≥ · · · ≥ are eigenvalues and { ψ i } is the corresponding eigenvectorsso that { ψ i } is the CONS of H .Let δ n = |h f n , ψ i H | . We have h f n , Af n i H = ρ h f n , ψ i H + ∞ X i =2 ρ i h f n , ψ i i H ≤ ρ h f n , ψ i H + ρ (1 − h f n , ψ i H ) = ρ δ n + ρ (1 − δ n ) . On the other hand, the convergence |h f n , Af n i − h ψ , Aψ i H | ≤ |h f n , Af n i H − h f n , A n f n i H | + |h f n , A n f n i H − h ψ , Aψ i H |≤ k A − A n k H + |k A n k H − k A k H | → implies that h f n , Af n i must converges to ρ . These two facts, together with ρ > ρ , result in δ → .From the norm convergence Q n A n Q n → QAQ , where Q n and Q are the orthogonal pro-jections onto the orthogonal complements of f n and f , respectively, we have convergence of theeigenvector corresponding to the first eigenvalue. It is not difficult to obtain convergence of theeigenspaces corresponding to the m th eigenvalue in a similar way. of Theorem 3. The first and second equations are proved by Lemma 5 and 6. Now we prove thethird equation. Without loss of generality, we can assume ˆ ψ → ψ in H X . The squared L ( P X ) distance between ˆ f − E ˆ f ( X ) and f − Ef ( X ) is given by k Σ / XX ( ˆ f − f ) k H X = k Σ / XX ˆ f k H X − h ψ , Σ / XX ˆ f i H X + k ψ k H X . Σ / XX ˆ f converges to ψ ∈ H X in probability. We have k Σ / XX ˆ f − ψ k H X ≤ k Σ / XX { ( b Σ XX + ε n I ) − / − (Σ XX + ε n I ) − / } b ψ k H X + k Σ / XX (Σ XX + ε n I ) − / ( b ψ − ψ ) k H X + k Σ / XX (Σ XX + ε n I ) − / ψ − ψ k H X (C.34)Using the same argument as in the bound of the first term of (C.31), the first term in (C.34) isshown to converge to zero. The second term obviously converges to zero. Similar to Lemma 6,the third term converge to zero, which completes the proof. C.8 Proof of Proposition 4
Proposition 10.
Let K n be the n × n kernel matrix whose ( i, j ) th element is κ X ( X i , X j ) . Denote J n as the n × n matrix whose elements are all one. Define G X = ( I n − J n /n ) K n ( I n − J n /n ) and let D Y be the n × n matrix whose ( i, j ) th element is d ( Y i , Y j ) . Then we have G X α ℓ = γ ℓ ,where γ ℓ is the ℓ th eigenvector of the following matrix ( G X + ε n I n ) − G X D Y G X D Y G X ( G X + ε n I n ) − . Proof.
The subspace ran( b Λ XX ′ ) is spanned by the set C X = { κ X ( · , X i ) − E n κ X ( · , X ) : i = 1 , . . . , n } = { η , . . . , η n } . Define [ · ] C X as the coordinate representation about the system C X . Note that the member ofthis spanning system are not linearly independent because their summation is the zero function.We in the next find the coordinate representation of − b Λ XX ′ . [ − b Λ XX ′ η i ] C X = ( n ( n − − [( n X k = t η k ⊗ η t d ( Y k , Y t )) η i ] C X = ( n ( n − − ( n X k = t [ η k ] C X [ η t ] T C X d ( Y k , Y t ) G X )[ η i ] C X η i is simply the i th member of the spanning system C X , we have [ η i ] C X = e i . Moreover, h η i , η j i H X = κ X ( X i , X j ) − n − n X l =1 κ X ( X i , X l ) − n − n X k =1 κ X ( X j , X k ) + n − n X k =1 n X l =1 κ X ( X k , X l ) Therefore, the Gram matrix of the set C X is G X = ( I n − J n /n ) K n ( I n − J n /n ) . Then [ − b Λ XX ′ η i ] C X = ( n ( n − − ( X k = t e k e Tt d ( Y k , Y t )) G X e i = D Y G X e i [ − b Λ XX ′ ] C X = ([ − b Λ XX ′ η ] C X , . . . , [ − b Λ XX ′ η n ] C X ) = D X G X ( e , . . . , e n ) = D Y G X Similarly, we can get [ b Σ XX ] C X = G X . The proof is completed. References L I , B., W EN , S. Q. & Z HU L.-X. (2008). On a projective resampling method for dimensionreduction with multivariate responses.
J. Am. Statist. Assoc. , 1177-1186.C
OOK , R. D. & N I , L.(2005). Sufficient dimension reduction via inverse regression: A minimumdiscrepancy approach. J. Am. Statist. Assoc. , 410-428.F
UKUMIZU , K., B
ACH , F. R. & G
RETTON , A.(2007). Statistical Consistency of Kernel Canon-ical Correlation Analysis.
J. Mach. Learn. Res. , 361-383.L UO , W. & L I , B.(2016). Combining eigenvalues and variation of eigenvectors for order deter-mination. Biometrika , 875-887.R
OWEIS , S. & S
AUL , L. (2000). Nonlinear dimensionality reduction by locally linear embed-ding.
Science , 2323-2326. 63
CHOENBERG , I. J.(1937). On certain metric spaces arising from Euclidean spaces by a changeof metric and their imbedding in Hilbert space.
Ann. of Math. , 787–793.S CHOENBERG , I. J.(1938). Metric spaces and positive definite functions.
Trans. Amer. Math.Soc. , 522–536.T ENENBAUM , J., S
ILVA , V. & L
ANGFORD , J. (2000). A global geometric framework for non-linear dimensionality reduction.
Science290