A Generalization of the Pearson Correlation to Riemannian Manifolds
AA Generalization of the Pearson Correlationto Riemannian Manifolds
P. Michl orcid.org/0000-0002-6398-0654
The increasing application of deep-learning is accompanied by a shift towards highly non-linearstatistical models. In terms of their geometry it is natural to identify these models with Riemannianmanifolds. The further analysis of the statistical models therefore raises the issue of a correlationmeasure, that in the cutting planes of the tangent spaces equals the respective Pearson correlation andextends to a correlation measure that is normalized with respect to the underlying manifold. In thispurpose the article reconstitutes elementary properties of the Pearson correlation to successively derivea linear generalization to multiple dimensions and thereupon a nonlinear generalization to principalmanifolds, given by the Riemann-Pearson Correlation.
Keywords:
Nonlinear Correlation, Riemann-Pearson Correlation, Principal Manifold
A fundamental issue, that accompanies the analysis ofmultivariate data, concerns the quantification of statis-tically dependency structures by association measures.Many approaches in this direction can be traced backto the late 19 th century, where the issue was closely re-lated to the task, to extract laws of nature from two di-mensional scatter plots. This in particular applies to thewidespread Pearson correlation coefficient. Definition (Pearson Correlation) . Let X : Ω → R and Y : Ω → R be random variables with finite variances σ X and σ Y . Then the Pearson correlation ρ X,Y is defined by: ρ X,Y := Cov(
X, Y ) σ X σ Y (1.1)Due to its popularity and simplicity the Pearson cor-relation has been generalized to a variety of different do-mains of application, including generic monotonous rela-tionships, relationships between sets of random variablesand asymmetric relationships (Zheng et al. 2010). In thepurpose to provide a generalization to smooth curves andsubmanifolds, that allow an incorporation of structuralassumptions, some elementary considerations have to betaken into account, that allow a separation between thepairwise quantification of dependencies and their globalmodelling. Pearson’s original motivation, was the regression of astraight line, that minimizes the averaged Euclidean dis-tance to points, that are scattered about it (Pearson 1901,p561). Thereby his investigations were preceded by theobservation, that for a measurement series the assumed“direction of causality” influences the estimate of the slopeof the regression line. Thereby the direction of causalityis implicated by the choice of an error model, that as-sumes one random variable to be error free and the other to account for the whole observed error. Pearson em-pirically observed, that for n ∈ N points, given by i.i.d.realizations x ∈ R n of X and y ∈ R n of Y , the leastsquares regression line of y on x only equals the regres-sion line of x on y , if all points perfectly fit on a straightline. In all other cases, however, the slopes of the re-spective regression lines turned out, not to be reciprocaland their product was found within the interval [0 , .This observation was decisive for Pearson’s definition ofthe correlation coefficient. Thereby ρ X,Y is estimated byits empirical counterpart ρ x,y , that replaces variances bysample variances and the covariance by the sample vari-ance. Lemma 1.
Let X : Ω → R and Y : Ω → R be randomvariables with n ∈ N i.i.d. realizations x ∈ R n and y ∈ R n . Furthermore let β x ∈ R denote the slope of the linearregression of y on x and β y ∈ R the slope of the linearregression of x on y . Then: ρ x,y = β x β y (2.1) Proof of Lemma 1.
The following proof is based on (Ken-ney et al. 1962). The least squares regression of y on x implicates, that for regression coefficients α x , β x ∈ R anda normal distributed random error ε := Y − ( β x X + α x ) the log-likelihood of the realizations is maximized, if andonly if the (cid:96) -norm of the realizations of ε is minimized,such that: SSE y ( α x , β x ) := n (cid:88) i =1 ( y i − ( β x x i + α x )) → min (2.2)Since SSE y is a quadratic function of α x and β x andtherefore convex, it has a unique global minimum at: ∂∂α x SSE y = 2 n (cid:88) i =1 ( y i − ( β x x i + α x ))( − x i ) = 0 (2.3) ∂∂β x SSE y = 2 n (cid:88) i =1 ( y i − ( β x x i + α x ))( −
1) = 0 (2.4) a r X i v : . [ m a t h . S T ] J un y equating the coefficients, equations 2.3 and 2.4 can berewritten as a system of linear equations of α x and β x : α x n + β x n (cid:88) i =1 x i = n (cid:88) i =1 y i (2.5) α x n (cid:88) i =1 x i + β x n (cid:88) i =1 x i = n (cid:88) i =1 x i y i (2.6)Consequently in matrix notation the vector ( α x , β x ) T isdetermined by: (cid:18) α x β x (cid:19) = (cid:18) n (cid:80) ni =1 x i (cid:80) ni =1 x i (cid:80) ni =1 x i (cid:19) − (cid:18) (cid:80) ni =1 y i (cid:80) ni =1 x i y i (cid:19) (2.7)Let x, y respectively denote the sample means . Then bycalculating the matrix inverse, the slope β x equates to: β x = (cid:32) n (cid:88) i =1 x i y i − nxy (cid:33) (cid:32) n (cid:88) i =1 x i − nx (cid:33) − (2.8)Thereupon by substituting the sample variance : σ x := 1 n n (cid:88) i =1 ( x i − x ) = 1 n (cid:32) n (cid:88) i =1 x i − nx (cid:33) (2.9)And the sample covariance : Cov( x , y ) := 1 n n (cid:88) i =1 ( x i − x )( y i − y )= 1 n (cid:32) n (cid:88) i =1 x i y i − nxy (cid:33) (2.10)It follows from equation 2.8, that: β x = Cov( x , y ) σ x (2.11)Conversely the slope β y of the linear regression of x on y mutatis mutandis equates to: β y = Cov( y , x ) σ y (2.12)By the symmetry of Cov it the follows, from equations2.11 and 2.12 that: β x β y = Cov( x , y ) σ x σ y = ρ xy (2.13)Lemma 1 shows, that ρ x,y may be regarded as the ge-ometric mean of the regression slopes β x and β y , where β x and β y respectively describe the causal relationships X → Y and Y → X . Thereby X and Y respectivelyare treated as error free regressor variables to predict thecorresponding response variable, that captures the over-all error. The mutual linear relationship X ↔ Y is then described by a regression line, that equally treats errorsin both variables. As an immediate consequence of thissymmetry it follows, that this total least squares re-gression line is unique, and its slope β (cid:63)x , that describes y by x is reciprocal to the slope β (cid:63)y , that describes x by y such that β (cid:63)x β (cid:63)y = 1 . In this sense β x and β y may beregarded as biased estimations of β (cid:63)x and β (cid:63)y . Therebythe bias generally is known as “regression dilution” or “re-gression attenuation”. For the case that both errors areindependent and normal distributed, this bias can be cor-rected by a prefactor, that incorporates the error of therespective regressor variable. An application of this cor-rection to lemma 1 then shows, that ρ X,Y has a consistentestimations by the sample variances of x and y and thevariances of their respective errors ε X and ε Y . Proposition 2.
Let X : Ω → R and Y : Ω → R be ran-dom variables with n ∈ N i.i.d. realizations x ∈ R n and y ∈ R n and random errors ε X ∼ N (0 , η X ) and ε Y ∼ N (0 , η Y ) . Then: ρ x,y P → (cid:18) − η X σ x (cid:19) (cid:18) − η Y σ y (cid:19) , for n → ∞ (2.14) Proof of Proposition 2.
Let β x ∈ R be the slope of theordinary least squares (OLS) regression line of y on x ,where x is assumed to realize X with a normal distributedrandom error ε X ∼ N (0 , η X ) . Then X decomposes into(i) an unobserved error free regressor variable X (cid:63) and (ii)the random error ε X , such that: X ∼ X (cid:63) + ε X (2.15)With respect to this decomposition, the slope β (cid:63)x of the to-tal least squares (TLS) regression line, that also considers ε X , then is identified by the slope of the OLS regressionof y on x (cid:63) , where x (cid:63) realizes X (cid:63) . Thereupon let σ x ∗ bethe empirical variance of x (cid:63) , then according to (Snedecoret al. 1967) it follows, that: β x P → σ x ∗ σ x ∗ + η X β (cid:63)x , for n → ∞ (2.16)Since furthermore ε X by definition is statistically inde-pendent from X (cid:63) , it can be concluded, that: σ x = Var( X (cid:63) + ε X )= Var( X (cid:63) ) + Var( ε X ) = σ x ∗ + η X (2.17)Such that: σ x ∗ σ x ∗ + η X = σ x − η X σ x = 1 − η X σ x (2.18)And therefore by equation 2.16 that: β x P → (cid:18) − η X σ x (cid:19) β (cid:63)x , for n → ∞ (2.19)Conversely let now β y ∈ R be the the slope of the OLSregression of x on y , where y is assumed to realize Y witha random error ε X ∼ N (0 , η Y ) . Then also the correctedslope β (cid:63)y mutatis mutandis satisfies the relation given by2quation 2.19 and by the representation of ρ x,y , as givenby lemma 1, it then can be concluded, that: ρ x,y = β x β x (2.20) P → (cid:18) − η X σ x (cid:19) (cid:18) − η Y σ y (cid:19) β (cid:63)x β (cid:63)y , for n → ∞ The proposition then follows by the uniqueness of thetotal least squares regression line for known variances η X and η Y , such that: β (cid:63)y = 1 β (cid:63)x Within the same publication, in which Pearson in-troduced the correlation coefficient, he also developed astructured approach that determines the straight line,that minimizes the Euclidean distance (Pearson 1901,p563). His method, which later received attribution as themethod of
Principal Component Analysis ( PCA ),however, even went further and allowed a canonical gener-alization of the problem in the following sense: For d ∈ N let X : Ω → R d be a multivariate random vector andfor n ∈ N let x ∈ R n × d be an i.i.d. realization of X .Then for any given k ∈ N with k ≤ d the goal is, todetermine an affine linear subspace L ⊆ R d of dimen-sion k , that minimizes the summed Euclidean distanceto x . In order to solve this problem, the fundamentalidea of Pearson was, to transfer the principal axis theo-rem from ellipsoids to multivariate Gaussian distributedrandom vectors. Thereupon, however, the method alsocan be formulated with respect to generic elliptical distri-butions. Definition (Elliptical Distribution) . For d ∈ N let X : Ω → R d be a random vector. Then X is ellipticallydistributed, iff there exists a random vector S : Ω → R k with k ≤ d , which distribution is invariant to rotations,a matrix A ∈ R d × k of rank k and a vector b ∈ R d , suchthat: X ∼ A S + b (3.1)Consequently a random vector X is elliptically dis-tributed, if it can be represented by an affine transforma-tion of a radial symmetric distributed random vector S .The decisive property, that underpins the choice of ellip-tical distributions, lies within their coincidence of linearand statistical dependencies, which allows to decompose X in statistically independent components by a lineardecomposition. This property allows, to substantiate themultidimensional “linear fitting problem” with respect toan orthogonal projection. Proposition 3.
For d ∈ N let X : Ω → R d be an ellipti-cally distributed random vector, L ⊆ R d an affine linearsubspace of R d and π L the orthogonal projection of R d onto L . Then the following statements are equivalent: (i) L minimizes the Euclidean distance to X (ii) E ( X ) ∈ L and L maximizes the variance Var( π L ( X )) Proof.
Let Y L := X − π L ( X ) , then the Euclidean dis-tance between X and L can be written: d ( X , π L ( X )) = E ( (cid:107) ( X − π L ( X )) (cid:107) ) (3.2) = E ( Y L ) This representation can furthermore be decomposed byusing the algebraic formula for the variance: E ( Y L ) = Var( Y L ) + E ( Y L ) (3.3)Let now be Y ⊥ L := π L ( X ) , then X = Y L + Y ⊥ L and Y L and Y ⊥ L are uncorrelated, such that: Var( X ) = Var( Y L + Y ⊥ L ) (3.4) = Var( Y L ) + Var( Y ⊥ L ) From equations 3.2, 3.3 and 3.4 it follows, that: d ( X , π L ( X )) = Var( X ) − Var( Y ⊥ L ) + E ( Y L ) (3.5)Consequentially the Euclidean distance is minimized, ifand only if the right side of equation 3.5 is minimized.The first term Var( X ) , however, does not depend on L and since X is elliptically distributed, the linear indepen-dence of Y ⊥ L and Y L is sufficient for statistically inde-pendence. It follows, that the Euclidean distance is min-imized, if and only if: (1) The term E ( Y L ) is minimizedand (2) the term Var( Y ⊥ L ) is maximized. Concerning (1)it follows, that: E ( Y L ) = E ( X − π L ( X )) = ( E ( X ) − π L ( E ( X ))) Therefore the term E ( Y L ) is minimized, if and only if π L ( E ( X )) = E ( X ) , which in turn means that E ( X ) ∈ L .Concerning (2), the proposition immediately follows bythe definition of Y ⊥ L .Let now be k ≤ d . In order to derive an affine linearsubspace L ⊆ R d that minimizes the Euclidean distanceto X , proposition ... states, that is suffices to provide an L which (1) is centred in X , such that E ( X ) ∈ L , and(2) maximizes the variance of the projection. In order tomaximize Var( π L ( X )) , however, it is beneficial to give afurther representation. Lemma 4.
For d ∈ N let X : Ω → R d be an ellipticallydistributed random vector, L ⊆ R d an affine linear sub-space of R d , which for an k ≤ d , a vector v ∈ R d and anorthonormal basis u , . . . , u k ∈ R d is given by: L = v + k (cid:77) i =1 R u i et further be π L : R d → L the orthogonal projection of R d onto L . Then the variance of the projection is givenby: Var( π L ( X )) = k (cid:88) i =1 u Ti Cov( X ) u i Proof.
Let L (cid:48) := L + E ( X ) − v , then the orthogonalprojection π L (cid:48) ( X ) decomposes into individual orthogo-nal projections to the respective basis vectors, such that: π L (cid:48) ( X ) = E ( X ) + k (cid:88) i =1 (cid:104) X − E ( X ) , u i (cid:105) u i (3.6)Let ˆ X i := (cid:104) X , u i (cid:105) u i , for i ∈ { , . . . , k } . The total vari-ance of this projection is then given by: Var( π L (cid:48) ( X )) . = Var (cid:32) E ( X ) + k (cid:88) i =1 ˆ X i − k (cid:88) i =1 E ( ˆ X i ) (cid:33) (3.7) = Var (cid:32) k (cid:88) i =1 ˆ X i (cid:33) Since the random variables ˆ X i by definition are uncorre-lated, the algebraic formula for the variance can be usedto decompose the variance: Var (cid:32) k (cid:88) i =1 ˆ X i (cid:33) = k (cid:88) i =1 Var( ˆ X i ) (3.8)By equating the term Var( ˆ X i ) , for i ∈ { , . . . , k } it fol-lows, that: Var( ˆ X i ) = Var ( (cid:104) X , u i (cid:105) u i ) (3.9) = Var (cid:16) X T u i (cid:17) u i = Var (cid:16) X T u i (cid:17) And furthermore by introducing the covariance matrix
Cov( X ) : Var (cid:16) X T u i (cid:17) def = E (cid:16) ( X T u i ) T ( X T u i ) (cid:17) (3.10) = u T i E (cid:16) X T X (cid:17) u i def = u Ti Cov( X ) u i Summarized the equations 3.7, 3.8, 3.9 and 3.10 providea representation for the variance of the projection to L (cid:48) : Var( π L (cid:48) ( X )) = k (cid:88) i =1 u Ti Cov( X ) u i Finally the total variance of the projection is invariantunder translations of L , such that: Var( π L ( X )) = Var( π L ( X ) + E ( X ) − v ) (3.11) = Var( π L (cid:48) ( X )) . = k (cid:88) i =1 u Ti Cov( X ) u i Lemma 4, shows, that for elliptically distributed ran-dom vectors X the best fitting linear subspaces are com-pletely determined by the expectation E ( X ) and the co-variance matrix Cov( X ) . On this point it is important tonotice, that the covariance matrix is symmetric, which al-lows its diagonalization with regard to real valued Eigen-values. Lemma 5.
For d ∈ N let X : Ω → R d be an ellipticallydistributed random vector, L ⊆ R d an affine linear sub-space of R d , which for an k ≤ d , a vector v ∈ R d and anorthonormal basis u , . . . , u k ∈ R d is given by: L = v + k (cid:77) i =1 R u i Let further be π L : R d → L the orthogonal projection of R d onto L , as well as λ , . . . , λ d ∈ R the eigenvalues of Cov( X ) . Then there exist numbers a , . . . , a d ∈ [0 , with (cid:80) di =1 a i = k , such that: Var( π L ( X )) = d (cid:88) i =1 λ i a i Proof.
From lemma 4 it follows, that:
Var( π L ( X )) = k (cid:88) j =1 u Tj Cov( X ) u j Since the covariance matrix
Cov( X ) is a symmetric ma-trix, there exists an orthonormal basis transformation ma-trix S ∈ R d × d and a diagonal matrix D ∈ R d × d , such that Cov( X ) = S T DS . Then the variance Var( π L ( X )) has adecomposition, given by: Var( π L ( X )) = k (cid:88) i =1 u j T S T DS u j = k (cid:88) i =1 ( S u j ) T DS u j For j ∈ { , . . . , k } let now c j := S u j and for i ∈{ , . . . , n } let the number a i ∈ R be defined by: a i := k (cid:88) j =1 ( c ji ) Then according to Lemma 4 the variance
Var( π L ( X )) canbe decomposed: k (cid:88) i =1 c jT D c j = k (cid:88) j =1 d (cid:88) i =1 c ji λ i c ji = d (cid:88) i =1 λ i k (cid:88) j =1 ( c ji ) = d (cid:88) i =1 λ i a i u , . . . , u k is an orthonormal ba-sis and S an orthonormal matrix it follows that also c , . . . , c k is an orthonormal basis. Consequentially for i ∈ { , . . . , d } it holds, that: a i = k (cid:88) j =1 ( c ji ) ≤ k (cid:88) j =1 (cid:107) c ji (cid:107) ≤ And furthermore by its definition it follows, that a i ≥ ,such that a i ∈ [0 , . Besides this the sum over all a i equates to: d (cid:88) i =1 a i = d (cid:88) i =1 k (cid:88) j =1 ( c ji ) = k (cid:88) j =1 c Tj c j = k With reference to the principal axis transformation, theeigenvectors of the covariance matrix are then termed principal components and affine linear subspaces ofthe embedding space as linear principal manifolds . Definition (Linear Principal Manifold) . For d ∈ N let X : Ω → R d be a random vector. Then a vector c ∈ R d with c (cid:54) = 0 is a principal component for X , iff there existsan λ ∈ R , such that : Cov( X ) · c = λ c (3.12) Furthermore let L ⊆ R d be an affine linear subspace of R d with dimension k ≤ d . Then L is a linear k - principalmanifold for X , if there exists a set set c , . . . , c k oflinear independent principal components for X , such that: L = E ( X ) + k (cid:77) i =1 R c i Then L is termed maximal, iff the sum of the Eigenvalues λ , . . . , λ k , that correspond to the principal components c , . . . , c k is maximal. Proposition 6.
For d ∈ N let X : Ω → R d be an ellip-tically distributed random vector and L ⊆ R d an affinelinear subspace. Then the following statements are equiv-alent: (i) L minimizes the Euclidean distance to X (ii) L is a maximal linear principal manifold for X Proof. “ = ⇒ ” Let π L : R d (cid:44) → L denote the orthogonal pro-jection of R d onto L . Then according to proposition 3 L minimizes the averaged Euclidean distance to X , if andonly if (i) E ( X ) ∈ L and (ii) L maximizes the variance Var( π L ( X )) . In particular (i) is satisfied, if and only if anorthonormal basis u , . . . , u k ∈ R d can be chosen, suchthat: L = E ( X ) + k (cid:77) i =1 R u i Then according to Lemma 5 there exist numbers a , . . . , a d ∈ [0 , with (cid:80) di =1 a i = k , such that: Var( π L ( X )) = d (cid:88) i =1 λ i a i Thereupon (ii) is satisfied, if and only if the numbers a i maximize this sum. Since the covariance matrix Cov( X ) is positive semi-definite, the eigenvalues λ i are not nega-tive such that the sum is maximized for: a i = (cid:40) for i ∈ { , . . . k } elseSuch that: k (cid:88) j =1 u Tj Cov( X ) u j = d (cid:88) i =1 λ i a i = k (cid:88) i =1 λ i = k (cid:88) j =1 c Tj Cov( X ) c j Accordingly the choice u j = c j for i ∈ { , . . . , k } maxi-mizes Var( π L ( X )) and L has a representation, given by: L = E ( X ) + k (cid:77) i =1 R c i ” ⇐ = ” Let L have a representation as given by (ii), then (1) E ( X ) ∈ L and (2) the variance Var( π L ( X )) is maximized.According to Proposition 3 it follows, that L minimizesthe Euclidean distance to X . Definition ( L -Correlation) . For d ∈ N let X : Ω → R d be a random vector and L a maximal linear principal man-ifold for X . Then for any i, j ∈ { , . . . , d } let the L -Correlation between X i and X j be defined by: ρ X i ,X j | L := R i R j (3.13) where with the orthogonal projection π L : R d → L for any i ∈ { , . . . , d } the reliability of X i with respect to L isgiven by: R i := 1 − Var i ( X − π L ( X ))Var i ( X ) (3.14) Proposition 7.
For an elliptically distributed randomvector the L -Correlation generalizes the Pearson Corre-lation to maximal linear principal manifolds.Proof. Let X : Ω → R be an elliptically distributed ran-dom vector and L a maximal linear -principal manifoldfor X . Then for i ∈ { , } the random error of the vari-able X i has a variance: η X i = Var X i ( X − π L ( X )) R i = 1 − η X i σ X i Consequently: ρ X i ,X j | L = (cid:18) − η X i σ X i (cid:19) (cid:32) − η X j σ X j (cid:33) With n ∈ N i.i.d. realizations x ∈ R n × of X an empir-ical L -Correlation ρ x i ,x j | L is then given by replacing thevariances by the sample variances. Then by proposition2 it follows, that: ρ x i ,x j P → ρ x i ,x j | L , for n → ∞ Linear principal manifolds allow the projection of a ran-dom vector X : Ω → R d onto a linear subspace L ⊆ R d ,which maximally preserves the linear dependency struc-ture of X in terms of its covariances. Thereby for theorthogonal projection π L : R d (cid:44) → L , the variance on L ,given by Var( π L ( X )) , is referred as the explained vari-ance and the orthogonal deviation Var( X − π L ( X )) asthe unexplained variance . Thereupon by the assump-tion, that X is elliptically distributed, it can be con-cluded, that linear independence coincides with statisti-cally independence, that that π L ( X ) and X − π L ( X ) arestatistically independent and therefore allow the followingdecomposition: Var( X ) (cid:124) (cid:123)(cid:122) (cid:125) total variance = Var( π L ( X )) (cid:124) (cid:123)(cid:122) (cid:125) explained variance + Var( X − π L ( X )) (cid:124) (cid:123)(cid:122) (cid:125) unexplained variance This decomposition, as shown by theorem 7, is of funda-mental importance for the correlation over linear Princi-pal Manifolds, since it determines the reliabilities of therespective random variable X by the ratio: R = 1 − explained variancetotal variance On this point of the discussion it’s just a small step togeneralize the principal components, by a smooth curves γ : [ a, b ] → R d (figure 4.1). This is particular appropriate,if the assumption of an elliptically distribution can onlyhardly be justified, like for observed dynamical sys-tems . Thereby the evolution function generates a smoothsubmanifold M ⊆ R d within the observation space R d ,and an “error free” observation can be identified by arandom vector X (cid:63) , with outcomes on M . Additionally,however, the observation function may be regarded to besubjected to a measurement error ε . By the assumption,that ε has an elliptical distribution, then the distributionof the observable random vector X is represented by aelliptical M− distribution. Definition ( M -Distribution) . For d ∈ N let X (cid:63) : Ω → R d be a random vector and M ⊆ R d a smooth x γ Figure 4.1:
Principal Curve for a 2-dimensional realization k − submanifold of R d with k ≤ d . Then X (cid:63) is M− distributed, iff for the probability density P , whichis induced by X (cid:63) , it holds, that: P ( X (cid:63) = x ) > ⇔ x ∈ M (4.1) Thereupon a random vector X : Ω → R d is elliptically M -distributed, iff there exists an M -distributed randomvector X (cid:63) : Ω → R d and an elliptically distributed randomerror ε : Ω → R d , such that: X ∼ X (cid:63) + ε (4.2) M σ σ X X X σ Figure 4.2:
Elliptical M -distribution in dimensions The assumption, that the observed random vector X ,is elliptically M -distributed, is very general, but allows anestimation of M by minimizing the averaged Euclideandistance to X . Thereby the tangent spaces T x M havea basis, given by k principal components of local in-finitesimal covariances, such that the remaining d − k principal components describe the normal space N x M ,which is orthogonal to the tangent space T x M . Since T x M and N x M are equipped with an induced Rieman-nian metric, which is simply given by the standard scalarproduct, there exists a minimal orthogonal projection π M : R d (cid:44) → M , that maps any realization x of X to a6losest point on M . Then proposition 3 motivates proper-ties for M to minimize the averaged Euclidean distance torealizations of X . This provides the definition of smooth k -principal manifolds (Hastie et al. 1989, p513). Definition (Principal Manifold) . For d ∈ N let X : Ω → R d be a random vector, M ⊆ R d a (smooth) k -submanifold of R d with k ≤ d and π M : R d (cid:44) → M aminimal orthogonal projection onto M . Then M is a(smooth) k -principal manifold for X , iff ∀ x ∈ M it holds,that: E ( X ∈ π − M ( x )) = x (4.3) Furthermore M is termed maximal, iff M maximizes theexplained variance Var( π M ( X )) . By extending the local properties of the tangent spacesto the underlying manifold, by propositions 3 and 6 itcan be concluded, that maximal principal manifolds min-imize the Euclidean distance to X . Intuitively this can beunderstood as follows: The principal manifold propertyassures, that: Var( X ) = Var( π M ( X )) + Var( X − π M ( X )) Consequently the choice of M maximizes Var( π M ( X )) ifand only if it minimizes Var( X − π M ( X )) , which equalsthe variance of the error and therefore the Euclidean dis-tance. At closer inspection, however, it turns out, that indifference to linear principal manifolds, the maximizationproblem is ill-defined for arbitrary smooth principal man-ifolds, since for any finite number of realizations trivial so-lutions can be found by smooth principal manifolds, thatinterpolate the realizations and therefore provide a perfectexplanation. In order to close this gap, further structuralstructural assumptions have to be incorporated, ether bya parametric family { f θ } θ ∈ Θ that restricts the possiblesolutions - or by a regularization, as given in the elasticmap algorithm that penalizes long distances and strongcurvature (Gorban et al. 2008). Due to the complexityof this topic, however, it is left to the second chapter,where Energy based models are used to overcome thisdeficiency. In the following the generalization of the cor-relation to smooth principal manifolds for convenience isdefined with respect to a principal manifold M , which ismaximal “with respect to appropriate restrictions”. Definition (Riemann-Pearson Correlation) . For d ∈ N let X : Ω → R d be a random vector, M a smooth prin-cipal manifold for X , which is maximal “with respect toappropriate restrictions” and π M : R d → M a minimalorthogonal projection. Then for any i, j ∈ { , . . . , d } the Riemann-Pearson Correlation between X i and X j isgiven by: ρ X i ,X j |M := R i R j (cid:90) M S i,j ( x ) S j,i ( x ) d P M (4.4) where for i ∈ { , . . . , d } the reliability of X i with re-spect to M is given by: R i := 1 − Var i ( X − π M ( X ))Var i ( X ) (4.5) and for i, j ∈ { , . . . , d } the local sensitivity of X i with respect to X j by: S i,j ( x ) := ∂∂x j ( x − π M ( x )) (cid:12)(cid:12)(cid:12)(cid:12) i (4.6) Proposition 8.
For an elliptical M− distributed randomvector X the Riemann-Pearson Correlation generalizesthe L -Correlation to smooth principal manifolds.Proof. Let L be a maximal linear principal manifold for X : Ω → R d , and x a realization of X then there existsan β ∈ R with: S i,j ( x ) = ∂∂x j ( x − π L ( x )) (cid:12)(cid:12)(cid:12)(cid:12) i ≡ β Furthermore for c (cid:54) = 0 : S j,i ( x ) = ∂∂x i ( x − π L ( x )) (cid:12)(cid:12)(cid:12)(cid:12) j ≡ β Such that S i,j ( x ) S j,i ( x ) = 1 . Consequently for M = L itfollows, that: ρ X i ,X j |M = R i R j (cid:90) M S i,j ( x ) S j,i ( x ) d P M = R i R j (cid:90) M d P M = R i R j = ρ X i ,X j | L References [1] Shurong Zheng, Ning-zhong Shi, and Zhengjun Zhang.Generalized Measures of Correlation.
Manuscript ,pages 1–45, 2010.[2] Karl Pearson. On lines and planes of closest fit to sys-tems of points in space.
The London, Edinburgh, andDublin Philosophical Magazine and Journal of Sci-ence , 2(1):559–572, 1901.[3] J. Kenney and E. Keeping. Linear regression and cor-relation.
Mathematics of statistics , 1:252–285, 1962.[4] G. W. Snedecor and W. G. Cochran.
Statistical Meth-ods . Iowa State University Press, Ames, 6 edition,1967.[5] Trevor Hastie and Werner Stuetzle. PrincipalCurves.
Journal of the American Statistical Associ-ation , 84(406):502–516, 1989.[6] Alexander N. Gorban and Andrei Y. Zinovyev.Principal Graphs and Manifolds.