Nonparametric independence tests in metric spaces: What is known and what is not
aa r X i v : . [ m a t h . S T ] S e p Nonparametric independence tests in metric spaces: Whatis known and what is not
Fernando Castro-Prado
University of Santiago de Compostela and Health Research Institute, Santiago de Com-postela, Spain.
E-mail: [email protected] González-Manteiga
University of Santiago de Compostela, Santiago de Compostela, Spain.
Summary . Distance correlation is a recent extension of Pearson’s correlation, that char-acterises general statistical independence between Euclidean-space-valued random vari-ables, not only linear relations. This review delves into how and when distance correlationcan be extended to metric spaces, combining the information that is available in the lit-erature with some original remarks and proofs, in a way that is comprehensible for anymathematical statistician.
Keywords : Distance correlation; Association measures; Nonparametric statistics
1. Introduction
The energy of data (Székely and Rizzo, 2017) and all the mathematical statistics thatstems from it, including the characterisation of independence in Euclidean spaces (§ 2)and many other interesting results (Székely and Rizzo, 2010, 2009, 2013), have a verystrong and well-established theoretical basis (Bakirov et al. , 2006; Székely et al. , 2007;Székely and Rizzo, 2017).Nevertheless, the article (Lyons, 2013) that introduces distance correlation in metricspaces leaves a surprising amount of details to the reader (Jakobsen, 2017, p. 2). Theelision of so many intermediate steps meant that, for several years, it was unnoticed thatmost of the theory was incorrect (Lyons, 2018). Such mistakes were largely discoveredby Jakobsen (2017), who devoted 150 pages to go through and to correct glitches of theoriginal 10-page paper.The goal of the present review is to present a corrected version of Lyons’ theory, bysummarising and explaining the work by Jakobsen (2017) and by adding a few originalproofs, all of this taking into account the recent corrigendum of the original article (Lyons,2018). In addition, the reader will be provided with a gentle introduction to the abstractmathematical concepts that this theory requires. Thus, for the first time, a clear andconcise bottom-up explanation of the theory of distance correlation in metric spaces isavailable to the scientific community.
Castro-Prado and González-Manteiga
2. Distance correlation in Euclidean spaces
When two random elements (vectors) X and Y are Euclidean-space-valued (let X be L − dimensional and Y be M − dimensional; for L, M ∈ Z + ), it is possible to definean association measure that characterises their independence called distance correlation (Székely et al. , 2007). Firstly, distance covariance should be defined, as a certain normof the difference of the joint characteristic function and the product of the marginals: dCov( X, Y ) := k ϕ X,Y − ϕ X ϕ Y k w ≡ sZ R L × R M | ϕ X,Y ( t, s ) − ϕ X ( t ) ϕ Y ( s ) | w ( t, s ) d t d s ;where w is a weight function which is dependent of the dimension of the Euclidean spacesin which the supports of X and Y are contained (and it has a property of uniqueness[Székely and Rizzo, 2012]): w ( t, s ) := Γ (cid:0) L +12 (cid:1) ( k t k √ π ) L +1 Γ (cid:0) M +12 (cid:1) ( k s k √ π ) M +1 , ( t, s ) ∈ R L × R M .And, as usually: ϕ X ( t ) := E h e i h t,X i i , t ∈ R L ; ϕ Y ( s ) := E h e i h s,Y i i , s ∈ R M .Logically, distance correlation is defined as the quotient of variance and the productof standard deviations and so it has no sign: dCor( X, Y ) := dCov(
X, Y ) p dCov( X, X ) dCov(
Y, Y ) ,whenever dCov( X, X ) dCov(
Y, Y ) = 0 . If dCov( X, X ) = 0 , then dCor(
X, Y ) := 0 .The reasons why distance correlation is an improved version of the squared (Pearson’s)correlation are: • It has values in [0,1]. This is unsurprising: R is totally ordered and, as such, onecan only move “leftwards” or “rightwards” and so the sign of (Pearson’s) correlationexpresses this structure. However, this notion is not valid in Euclidean spaces ofarbitrary dimensionality. • It is zero if and only if X and Y are independent (thus, its interest).Notwithstanding the convoluted initial definition of dCor , its sample version can easilybe computed. Given a paired sample ( X , Y ) , . . . , ( X n , Y n ) i.i.d. ( X, Y ) ;let a ij := d ( X i , X j ) for i, j ∈ [1 , n ] ∩ Z . Using this notation, doubly-centred distancesare: A ij := a ij − ¯ a i · − ¯ a j · + ¯ a ·· onparametric independence tests in metric spaces If { b ij } i,j and { B ij } i,j are analogously defined for { Y i } i , the empirical distance covarianceis simply the nonnegative real number whose square is: [ dCov n ( X, Y ) := 1 n n X i,j =1 A ij B ij ,so that it is, indeed, a correlation of distances.The above estimator comes from the alternative definition of dCov derived by Székelyand Rizzo (2009): dcov( X, Y ) = E[ d ( X, X ′ ) d ( Y, Y ′ )] + E[ d ( X, X ′ )] E[ d ( Y, Y ′ )] − d ( X, X ′ ) d ( Y, Y ′′ )] ,which is valid as long as moments of order are finite. Primed letters refer to independentand identically distributed copies of the corresponding random element.Whenever { X, Y } are independent and have finite first moments, the asymptoticdistribution of the product of a scaled version of the preceding statistic is a linear combi-nation of independent chi-squared variables with one degree of freedom. More precisely: n [ dCov n ( X, Y ) D −→ n →∞ ∞ X j =1 λ j Z j ,where { Z j } j are i.i.d. N(0 , and { λ j } j ⊂ R . Unfortunately, this null distribution is notuseful in practice.Instead, it is resampling techniques that should be used. The most sensible choicewhen it comes to approximating the null distribution of the test statistic is to base thedesign of the resampling scheme on the information that H provides, which in this case(i.e., independence) yields to permutation tests .
3. Context and notations
Let ( X , d X ) and ( Y , d Y ) be two arbitrary separable metric spaces (the need for sepa-rability is dealt with in 3.2). The random element Z = ( X, Y ) is defined over (Ω , F , P) and has values in X × Y , with its distribution being θ : B ( X × Y ) −→ [0 , .The following notation will be used for the marginal distributions: • X ∼ µ := θ ◦ π − , marginal over X ; where π : ( x, y ) ∈ X × Y x ∈ X . • Y ∼ ν := θ ◦ π − , marginal over Y ; where π : ( x, y ) ∈ X × Y y ∈ Y .Thus, the nonparametric test of independence for X and Y consists in testing H : θ = µ × ν versus H : θ = µ × ν . For the sake of clarity, it is important to notethat the product µ × ν is defined conventionally: it is the only measure in B ( X ) ⊗ B ( Y ) so that ( µ × ν )( A × B ) := µ ( A ) ν ( B ); A ∈ B ( X ) , B ∈ B ( Y ) . Castro-Prado and González-Manteiga
The first perquisite of assuming the separability of X and Y is that, this way, the σ − algebra that their topological product generates is simply the product σ − algebra: B ( X × Y ) = B ( X ) ⊗ B ( Y ) := σ { A × B : A ∈ B ( X ) , B ∈ B ( Y ) } .This equality is useful by itself (e.g., it is crucial to the proof of lemma 3.10in Jakobsen [2017]), but its most important corollary is that it guarantees that themetrics of the marginal spaces are jointly measurable: for Z ∈ { X , Y } , d Z is B ( Z ) ⊗ B ( Z ) / B ( R ) − measurable. This, in turn, is what ensures that the Lebesgueintegrals that appear in the definition of distance covariance (§ 4) are defined. A coun-terexample would be X := R R , equipped with the discrete metric. This is a particularcase of Nedoma’s pathology (see Schechter [1996, proposition 21.8] and Bogachev [2007,example 6.4.3] for further details), which states that the diagonal set { ( x, x ) : x ∈ X } isnot in B ( X ) ⊗ B ( X ) when the cardinality of X is greater than that of the continuum.Finally, separability is explicitly used in the proofs of some important properties ofdistance covariance (Jakobsen, 2017, theorem 4.4 and lemma 5.8), which indicates thatit is not an ungodly hypothesis.The original article that presented distance correlation in metric spaces (Lyons, 2013)was oblivious of the crucial role of separability in the theory. The map µ : B ( X ) −→ R is said to be a finite signed (Borel) measure, and it is denoted µ ∈ M ( X ) , if and only if | µ | is a finite measure. For each µ ∈ M ( X ) , there is a Hahn–Jordan decomposition and it is essentially unique (Billingsley, 1995, theorem 3.2.1) or, inother words, it is possible to find a couple of nonnegative measures µ ± ∈ M ( X ) so that µ = µ + − µ − and a partition of the space X = X + ⊔ X − satisfying: µ + ( X − ) = 0 = µ − ( X + ) ;which is to say that µ + and µ − are orthogonal (mutually singular).This allows to naturally define (Lebesgue) integrals with respect to signed measures.For f : X −→ R measurable, Z X f d µ := Z X f d µ + − Z X f d µ − ;which is well-defined whenever f is integrable with respect to | µ | = µ + + µ − .On the other hand, it will also be necessary to integrate with respect to productmeasures. To begin with, consider ν ∈ M ( Y ) , with Hahn–Jordan decomposition givenby ( Y ± , ν ± ) . Then: • µ + × ν + + µ − × ν − is a (nonnegative) measure with support ( X + × Y + ) ⊔ ( X − × Y − ) ; • µ + × ν − + µ − × ν + is a (nonnegative) measure with support ( X + × Y − ) ⊔ ( X − × Y + ) . onparametric independence tests in metric spaces Because of their disjoint supports, the aforementioned two measures are mutually singularand, consequently (Rudin, 1987, corollary of theorem 6.14), they form the Hahn–Jordandecomposition of µ × ν : µ × ν = ( µ + × ν + + µ − × ν − ) − ( µ + × ν − + µ − × ν + ) .Thus, the integral of a Borel-measurable function h : X × Y −→ R with respect to µ × ν is: Z h d µ × ν = Z h d µ + × ν + + Z h d µ − × ν − − Z h d µ + × ν − − Z h d µ − × ν + ;which entails that L ( µ × ν ) is the intersection of the four function spaces L ( µ ± × ν ± ) .On the last equation, the integration sets were omitted, as it is superfluous to un-derscore that it is the largest possible one (in this case, X × Y ). This notation abuse,taken from Lyons (2013), is among the few ones that will be used on the present paper,while the ones that caused mistakes and confusion on Lyons’ article (and even in itscorrigendum [Lyons, 2018]) will be avoided.The last relevant remark about the integration with respect to the product of signedmeasures is that they satisfy a generalised Fubini–Tonelli theorem (Bogachev, 2007,§ 3.3): ∀ h ∈ L ( µ × ν ) , Z h d µ × ν = Z Z h d µ d ν = Z Z h d ν d µ . For the sake of clarity, it is convenient to state and prove the c r − inequality . For any α, β, r ∈ R + : ( α + β ) r ≤ c r ( α r + β r ) , where c r = ( , r < r − , r ≥ . Proof . (1) Let r<1. The goal is to show that ( t + 1) r ≤ t r + 1 , t := αβ or, equivalently, that f ( t ) := t r + 1 − ( t + 1) r ≥ .And the latter inequality holds because r − < : ∀ t ∈ R + , f ′ ( t ) = r ( t r − − ( t + 1) r − ) > ⇒ ∀ t ∈ R + , f ( t ) ≥ f (0) = 0 .(2) For r ≥ , the function g ( x ) := x r is convex in every x ∈ R + . When r > : g ′′ ( x ) = r ( r − x r − > , x ∈ R + .Geometrically, convexity implies that: g (cid:18) α + β (cid:19) ≤ g ( α ) + g ( β )2 ⇔ ( α + β ) r ≤ r − ( α r + β r ) . Castro-Prado and González-Manteiga
At this point, it is possible to introduce the concept of regularity of a signed measure: µ ∈ M ( X ) is said to have finite moments of order r , and it is written as µ ∈ M r ( X ) , ifand only if ∃ o ∈ X , Z d X ( o, x ) r d | µ | ( x ) < + ∞ .Applying the c r − inequality, it is straightforward to see that when the condition aboveholds, it does so for any origin: µ ∈ M r ( X ) ⇔ ∀ o ∈ X , Z d X ( o, x ) r d | µ | ( x ) < + ∞ .In addition, a signed measure on a product of two spaces θ ∈ M ( X × Y ) is said tobelong to M r,r ( X × Y ) if both its marginals have finite moments of order r . Finally,the subindex will be used as a notation for probability measures: M ( X ) := (cid:8) µ ∈ M ( X ) : µ ≥ , µ ( X ) = 1 (cid:9) ; M r ( X ) := M r ( X ) ∩ M ( X ); M r,r ( X × Y ) := M r,r ( X × Y ) ∩ M ( X × Y ) .
4. Formal definition of dcov
The previous section set the theoretical framework in which speaking of distance covari-ance makes sense, thus solving some inconsistencies of Lyons (2013). This will enableto define the operator dcov rigorously, simplifying and illustrating the explanations byJakobsen (2017).
In order to define dcov , it is important to keep in mind that: ∀ µ , µ ∈ M ( X ) : d X ∈ L ( µ × µ ) .This is a consequence of Fubini and the triangle inequality: Z d X d | µ | × | µ | ≤ Z d X ( x, o ) d | µ | × | µ | ( x, x ′ ) + Z d X ( o, x ′ ) d | µ | × | µ | ( x, x ′ ) == | µ | ( X ) Z d X ( x, o ) d | µ | ( x ) + | µ | ( X ) Z d X ( x, o ) d | µ | ( x ) < + ∞ . The definition of distance covariance involves doubly centred distances (§ 4.3), but firstthe various expected values that are to appear should be checked to be well-defined. For µ ∈ M ( X ) , the following function maps each point x ∈ X to its expected distance tothe random element X ∼ µ : a µ : X −→ R x Z d X ( x, x ′ ) d µ ( x ′ ) onparametric independence tests in metric spaces Obviously, it is well-defined. On top of that, it is | µ | ( X ) − Lipschitzian (and, therefore,continuous): ∀ x, x ′ ∈ X : | a µ ( x ) − a µ ( x ′ ) | ≤ Z | d X ( x, z ) − d X ( x ′ , z ) | d | µ | ( z ) ≤≤ Z d X ( x, x ′ ) d | µ | ( z ) = | µ | ( X ) d X ( x, x ′ ) .On the other hand, recalling 4.1, the integral D ( µ ) is always a real number: D ( µ ) := Z a µ d µ = Z d X d µ × µ .The following four inequalities can easily be derived from the previous results and theywill be very useful hereinafter. For µ ∈ M ( X ) and x, y ∈ X :(a) D ( µ ) ≤ a µ ( x ) ;(b) D ( µ ) ≤ a µ ( x ) + a µ ( y ) ;(c) d X ( x, y ) ≤ a µ ( x ) + a µ ( y ) ;(d) a µ ( x ) ≤ d X ( x, y ) + a µ ( y ) . Proof .(1) D ( µ ) = R d X ( x ′ , x ′′ ) d µ ( x ′ , x ′′ ) ≤≤ µ ( X ) R d X ( x ′ , x ) d µ ( x ′ ) + µ ( X ) R d X ( x, x ′′ ) d µ ( x ′′ ) = 2 a µ ( x ) .(2) Applying (1) to x and y and adding side-by-side the resulting equations, one gets: D ( µ ) ≤ a µ ( x ) + 2 a µ ( y ) .(3) Integrate with respect to µ ( z ) both sides of: d X ( x, y ) ≤ d X ( x, z ) + d X ( y, z ) .(4) Idem to (3) : d X ( x, z ) ≤ d X ( x, y ) + d X ( y, z ) . For µ ∈ M ( X ) , the doubly µ − centred version of d X is: d µ : X × X −→ R ( x , x ) d X ( x , x ) − a µ ( x ) − a µ ( x ) + D ( µ ) This modification of d X , in general, is not a metric; although it is always continuous(since d X , a µ , π and π are) and, in particular, Borel-measurable. Moreover, it isimportant to note that, when writing d µ , there is no explicit reference to the metricspace over which this map is defined. Such an abuse of notation makes formulae easierto read and write without creating any misunderstanding. That is not the case of someabbreviations by Lyons, such as the usage of d := d X and d := d Y , which mistakenlysuggests that there is a need for X and Y to share the same metric structure, which isan unnecessary restriction for the theory that would render some interesting applicationsimpossible.The last remarkable property of d µ is: ∀ µ, µ , µ ∈ M ( X ) : d µ ∈ L ( µ × µ ) . Castro-Prado and González-Manteiga
Proof . In the first instance, it is convenient to justify that, for any ( x, y ) ∈ X , | d µ ( x, y ) | ≤ a µ ( y ) .To see this, there are two cases to be considered: • If d µ ( x, y ) ≥ , it suffices to apply the inequalities in 4.2: | d µ ( x, y ) | = d µ ( x, y ) (3) ≤ D ( µ ) (1) ≤ a µ ( y ) . • For d µ ( x, y ) < , the arguments of Jakobsen (2017, páx. 10) make use of unneces-sarily strong hypotheses. Instead, the following rationale: ∀ z, t ∈ X : d X ( x, z ) ≤ d X ( x, y ) + d X ( y, t ) + d X ( t, z ) ⇒⇒ a µ ( x ) ≤ d X ( x, y ) + a µ ( y ) + D ( µ ) ;yields | d µ ( x, y ) | ≤ a µ ( y ) .Now, using the aforementioned inequality, proving that d µ ∈ L ( µ × µ ) turns out tobe quite straightforward: Z d µ ( x, y ) d µ × µ ( x, y ) ≤ Z a µ ( x ) a µ ( y ) d µ × µ ( x, y ) Fubini == 4 Z d X ( x, z ) d µ × µ ( x, z ) Z d X ( y, z ) d µ × µ ( y, z ) d X ∈ L < + ∞ . dcov The generalised distance covariance is defined as: dcov( θ ) := Z ( X × Y ) d µ ( x, x ′ ) d ν ( y, y ′ ) d θ (cid:0) ( x, y ) , ( x ′ , y ′ ) (cid:1) , θ ∈ M , ( X × Y ) ;where, once again, µ := θ ◦ π − and ν := θ ◦ π − .In order to check that dcov is well-defined, it suffices to note that the integral of theproduct of two functions with respect to a (nonnegative) measure is always a scalar prod-uct (bilinear, semidefinite positive) and, as a result, it satisfies the Cauchy–Bunyakovsky–Schwarz inequality. It is also possible to prove this particular case of Hölder’s inequalitymore directly: ≤ Z [ d µ ( v ) d ν ( w ) − d µ ( w ) d ν ( v )] d θ ( v, w ) = 2 Z d µ d θ Z d ν d θ − (cid:18)Z d µ d ν d θ (cid:19) ⇒ d µ , d ν ∈ L ⇒ | dcov( θ ) | ≤ sZ d µ d θ Z d ν d θ < + ∞ . onparametric independence tests in metric spaces A third approach is to derive a particular case of the AM-GM inequality (and also ofYoung’s): ( d µ ± d ν ) ≥ ⇔ d µ + d ν ≥ ∓ d µ d ν ⇔ d µ + d ν ≥ | d µ d ν | ,Anyhow, the key step is to show that the integrals on the right-hand side are finite. Forinstance, in the case of d µ : Z d µ ( x, x ′ ) d θ (( x, y )( x ′ , y ′ )) Fubini = Z Z d µ ( x, x ′ ) d θ ( x, y ) d θ ( x ′ , y ′ ) ACOV == Z d µ ( x, x ′ ) d µ ( x, x ′ ) d µ ∈ L ( µ × µ ) < + ∞ .where the acronym “ACOV” stands for abstract change of variables , which in this casetakes a projection as the change of variables function. More formally, let f be a measur-able function in the following diagram: ( X × Y , B ( X ) ⊗ B ( Y ) , θ ) π −→ ( X , B ( X )) f −→ ( R , B ( R )) .When f ∈ L ( θ ◦ π − ) , the aforementioned ACOV theorem ensures that: Z π ( X × Y ) f d( θ ◦ π − ) = Z X × Y ( f ◦ π ) d θ or, recalling that µ def. = θ ◦ π − : Z X f ( x ) d µ ( x ) = Z X × Y f ( x ) d θ ( x, y ) .The different integrability checks that have been conducted so far allow to write dcov in terms of expected values. Taking X ∼ µ ∈ M ( X ) and Y ∼ ν ∈ M ( Y ) , with jointdistribution θ := P ◦ (cid:0) XY (cid:1) − , their distance covariance is given by: dcov( X, Y ) Abuse := dcov( θ ) = E[ d µ ( X, X ′ ) d ν ( Y, Y ′ )] == E n(cid:16) d X ( X, X ′ ) − E[ d X ( X, X ′ ) | X ] − E[ d X ( X, X ′ ) | X ′ ] + E[ d X ( X, X ′ )] (cid:17) ·· (cid:16) d Y ( Y, Y ′ ) − E[ d Y ( Y, Y ′ ) | Y ] − E[ d Y ( Y, Y ′ ) | Y ′ ] + E[ d Y ( Y, Y ′ )] (cid:17)o ;where primed letters refer to independent and identically distributed copies of the corre-sponding random element.Finally, note that dcov is always an association measure, in the sense that it vanishesunder independence: dcov( µ × ν ) = Z d µ d ν d( µ × ν ) Fubini == (cid:18)Z d X d µ − Z a µ d µ + Z D ( µ ) d µ (cid:19) (cid:18)Z d Y d ν − Z a ν d ν + Z D ( ν ) d ν (cid:19) = Castro-Prado and González-Manteiga = [ D ( µ ) − D ( µ ) + D ( µ )][ D ( ν ) − D ( ν ) + D ( ν )] = 0 .Moreover, under certain conditions, dcov is nonnegative and it can be rescaled into theinterval [0 , (see 6.1), becoming a normalised association measure (Bishop et al. , 1975,pages 375–376).
5. Distance covariance in negative type spaces
The fact that: θ = µ × ν ⇒ dcov( θ ) = 0 ,makes it natural to wonder which spaces ensure that the reciprocal implication also holds.The answer is: strong negative type spaces, since in them dcov( θ ) can be presented as aninjective function of θ − µ × ν .In order to explain this, negative type spaces will be firstly introduced (§ 5.1), as theyare the ones in which dcov admits the aforementioned representation (although injectivityis not guaranteed). Then the strong version of this condition will be defined (§ 5.3) and apivotal result will be put forward: strong negative type is not only a necessary conditionfor dcov to characterise independence, but it is also sufficient (with a little exception, byno means restrictive). The concept of negative type is not a recent invention (Wilson, 1935) and it has recentlybeen enjoying its “second youth”: firstly, because of its role in computational algorithmics(Deza and Laurent, 1997, § 6.1.; Naor, 2010) and, more recently, in relation to the energyof data (Székely and Rizzo, 2017).The metric space ( X , d X ) is said to be of negative type if and only if: ∀ n ∈ Z + ; ∀ x, y ∈ X n : 2 n X i,j =1 d X ( x i , y j ) ≥ n X i,j =1 [ d X ( x i , x j ) + d X ( y i , y j )] .The analytic expression above has the following geometrical interpretation: given n redpoints and as many blue ones, the sum of the distances among the n ordered pairs ofthe same colour is not less than the corresponding sum for different colours. Moreover,this condition can be stated in another way, that is apparently more general, which is the conditionally negative definiteness of the metric. However, both are actually equivalent(which can be checked by taking repetitions of the points and recalling that Q is densein R ): ∀ n ∈ N ; ∀ x ∈ X n ; ∀ α ∈ R n , n X i =1 α i = 0 : n X i,j =1 α i α j d X ( x i , x j ) ≤ .This is not to say that negative type metric spaces are the ones in which the metric actslike a negative definite kernel (such as the ones thoroughly studied by Klebanov [2005]and Berg et al. [1984]). However, an equivalent definition in terms of the definiteness of onparametric independence tests in metric spaces a certain kernel exists. Namely, ( X , d X ) is a negative type space if and only if there isa point o ∈ X so that the absolute antipodal divergence d o ( x, y ) := d X ( x, o ) + d X ( y, o ) − d X ( x, y ) , ( x, y ) ∈ X is definite positive.There are many familiar examples of negative type spaces, like the Euclidean onesand, more generally, all Hilbert spaces (as it will be explained in 5.2). Now some results involving Hilbert spaces are to be presented. For the sake of simplicity,assume that the scalar field is R in every case, but, as a general rule, every statementthat will be made is also true for C , mutatis mutandi . This can be proven by realifyingor complexifying (pages 132–135 of Jakobsen, 2017), according to the case.It will be necessary to integrate functions f : X −→ H which have a Hilbert space astheir codomain. Had X not been assumed to be separable (see § 3.2), as in Lyons (2013),the spaces H that arise later on would not necessarily be separable, which would onlyallow to perform weak integration (Pettis, 1938), and not the strong one (Bochner, 1933).Given µ ∈ M ( X ) , if f is a Pettis-integrable (or, specifically, scalarly µ − integrable), theintegral I ∈ H is unambiguously defined by its commutativity with respect to every mapof the dual space H ∗ : I = Z X f d µ ⇔ ∀ h ∗ : H −→ R linear and continuous , h ∗ ( I ) = Z X ( h ∗ ◦ f ) d µ .Hereinafter, every Hilbert space that will arise is going to be separable, which meansthat Pettis integrals are Bochner integrals.After these technical remarks, the Schoenberg’s theorem (Schoenberg, 1937 and 1938),can be stated. It characterises negative type spaces ( X , d X ) as those such that (cid:0) X , √ d X (cid:1) can be isometrically embedded into a Hilbert space: ∃ H Hilbert space ; ∃ ϕ : X −→ H ; ∀ x, y ∈ X : k ϕ ( x ) − ϕ ( y ) k H = d X ( x, y ) .For a simple proof, using the absolute antipodal divergence (see 5.1), refer to Jakobsen(2017, theorem 3.7), that corrects Lyons (2013). Regardless of this, Schoenberg’s theoremensures that the separability of the original metric spaces (§ 3.2) is inherited by all theHilbert spaces that arise. Before the Hilbert space representation of dcov can be tackled,the barycentre operator has to be defined: given an isometric map ϕ : (cid:0) X , √ d X (cid:1) −→ H (like the one on the preceding theorem) and µ ∈ M ( X ) , the following Pettis integralalways exists β ϕ ( µ ) := Z X ϕ d µ ∈ H and it is called barycentre , because it is the average of a H -field over X according tothe distribution given by µ (thus resembling the geometrical idea of a gravity centre). Infact, if X ∼ µ ∈ M ( X ) , β ϕ ( µ ) = E[ ϕ ( X )] . Castro-Prado and González-Manteiga
On the other hand, if ψ : (cid:0) Y , √ d Y (cid:1) −→ H is also isometric, the barycentre of the tensorproduct ϕ ⊗ ψ for θ ∈ M , ( X × Y ) is defined as: β ϕ ⊗ ψ ( θ ) := Z X × Y ( ϕ ⊗ ψ ) d θ ∈ H ⊗ H .More importantly, if ( µ, ν ) are the marginals of θ ∈ M , ( X × Y ) , the followingequality holds: dcov( θ ) = 4 k β φ ⊗ ψ ( θ − µ × ν ) k H ⊗H .In conclusion, dcov will characterise independence in those spaces in which the previouskernel is injective, that are going to be dealt with right below. If ( X , d X ) has negative type, one can derive the following inequality (whose proof issurprisingly long [Jakobsen, 2017, lemma 3.16]): ∀ µ , µ ∈ M ( X ) : D ( µ − µ ) ≤ .On top of that, if the operator D separates probability measures (with finite first mo-ments) in ( X , d X ) , that space is said to have strong negative type: D ( µ − µ ) = 0 ⇔ µ = µ .The extended Schoenberg’s theorem shows the equivalence of the strong negativetype of ( X , d X ) and the existence of an isometric map ϕ : ( X , √ d X ) −→ H suchthat β ϕ is injective. Furthermore, for strong negative type X and Y , two isometricmaps ϕ : ( X , √ d X ) −→ H and ψ : ( Y , √ d Y ) −→ H can be found so that β ϕ ⊗ ψ : M , ( X × Y ) −→ H ⊗ H is injective. As a result, whenever X and Y have strongnegative type, dcov( X, Y ) = 0 ⇔ X, Y independentholds for any random element Z = ( X, Y ) : Ω −→ X × Y .Thus, the strong negative type of marginal spaces is a sufficient condition for theequivalence above to hold, but is it also necessary ? The answer is yes , but with theexception of a “pathological” case.If ( Y , d Y ) was not of strong negative type (symmetrically for X ), it is indeed possibleto find θ ∈ M , ( X × Y ) so that: dcov( θ ) = 0 and, at the same time, θ = ( θ ◦ π − ) × ( θ ◦ π − ) ;whenever min { X , Y } > . Such θ can be constructed as follows: θ := δ x × ν + δ x × ν ;where ν , ν are two different measures in M ( Y ) so that D ( ν − ν ) = 0 , while x , x ∈ X are two distinct points. For each x ∈ X , δ x ∈ M ( X ) denotes point mass at x : δ x ( x ) = 1 . onparametric independence tests in metric spaces This way, the aforementioned pathological case consists of one of the marginal spacesbeing a singleton. Such exception is not a restriction because, whenever Y = 1 (sym-metrically for X ), dcov ≡ (since d ν ≡ ) and every θ ∈ M , ( X × Y ) is the productof its marginals. To see this last part, note that: Y = { y } ⇒ B ( Y ) = {∅ , { y }} = {∅ , Y } .And consequently, for B ∈ B ( Y ) , ∀ A ∈ B ( X ) , θ ( A × B ) = ( θ ( A × ∅ ) = θ ( ∅ ) = 0 = µ ( A ) ν ( ∅ ) θ ( A × Y ) = θ (cid:2) π − ( A ) (cid:3) ≡ µ ( A ) = µ ( A ) ν ( Y ) ;and so θ = µ × ν . This analytical result is the formalisation of the intuitive notion that, ifa random element Y has constantly a certain value, the observations of any other random X are bound to be independent of the ones of Y .After the previous theoretical discussion, the interest of identifying practical examplesof strong negative type spaces is clear. With regard to this, for the scope of the presentarticle (and for most real data applications), it suffices to know that all separable Hilbertspaces have strong negative type. Although this is an unsurprising result, its proof is byno means straightforward (Jakobsen, 2017, pages 49–60).
6. Distance correlation in metric spaces dcor
Like previously, let ( X, Y ) ∼ θ ∈ M , ( X × Y ) have marginals ( µ, ν ) , where ( X , d X ) and ( Y , d Y ) are two separable metric spaces. Then, the following inequalities hold: | dcov( X, Y ) | ≤ p dvar( X ) dvar( Y ) ≤ D ( µ ) D ( ν ) ;where dvar( X ) := dcov( X, X ) . If, in addition, ( X , d X ) and ( Y , d Y ) have negative type: dcov( X, Y ) = 4 k β ϕ × ψ ( θ − µ × ν ) k H ⊗H ≥ .In this context, distance correlation (for metric spaces) is defined as: dcor( X, Y ) := dcov(
X, Y ) p dvar( X ) dvar( Y ) ∈ [0 , whenever the denominator is nonzero. For nondegenerate cases, this will not be a matterof concern, for dvar( X ) only reaches the extreme values of its range [0 , D ( µ ) ] when it isconcentrated on one or two points (respectively): dvar( X ) = 0 ⇔ ∃ x ∈ X , µ = δ x “ µ − almost surely”; dvar( X ) = D ( µ ) ⇔ ∃ x, x ′ ∈ X , µ = δ x + δ x ′ “ µ − almost surely”.When dvar( X ) = 0 , as in the Euclidean case, dcor( X, Y ) := 0 . Castro-Prado and González-Manteiga dcor in Euclidean spaces
In has already been shown that dcor has range [0 , and is zero if and only if there isindependence, which recapitulates the property for Euclidean spaces (§ 2). Indeed, it ispossible to prove (via the Hilbert space representations introduced in 5.2) that, when ( X , d X ) and ( Y , d Y ) are (finitely dimensional) Euclidean spaces, the notion of distancecorrelation of § 6.1 (Lyons, 2013) generalises the square of the one in § 2 (Székely et al. ,2007): dcov( X, Y ) = dCov(
X, Y ) ; dcor( X, Y ) = dCor(
X, Y ) .For θ ∈ M , ( X × Y ) , dcov( X, Y ) becomes a product of expectations. By expandingit and simplifying, one can easily get the generalisation of Brownian distance covariance(Székely and Rizzo, 2009, theorems 7–8) to general metric spaces: dcov( X, Y ) = E[ d X ( X, X ′ ) d Y ( Y, Y ′ )] + E[ d X ( X, X ′ )] E[ d Y ( Y, Y ′ )] −− d X ( X, X ′ ) d Y ( Y, Y ′′ )] .In conclusion, dcov satisfactorily extends dCov squared.
7. Nonparametric test of independence in metric spaces dcov
The following map will be key to the construction of the sample version of dcov : h : ( X × Y ) −→ R (cid:0) ( x i , y i ) (cid:1) i =1 f X ( x , x , x , x ) f Y ( y , y , y , y ) ;where, for Z ∈ { X , Y } , f Z ( z ) := d Z ( z , z ) + d Z ( z , z ) − d Z ( z , z ) − d Z ( z , z ) , z ∈ Z .The functions f Z and h are clearly measurable and proving their integrability can beaccomplished by sequentially deriving inequalities from the triangle inequality (see pages148–150 of Jakobsen [2017] for the correction of the attempt by Lyons [2013]). Integratingthese functions is pretty straightforward. Firstly, for f X : Z ( X × Y ) f X ( x , x , x , x ) d θ (( x , y ) , ( x , y )) ACOV == d X ( x , x ) − a µ ( x ) − a µ ( x ) + D ( µ ) ≡ d µ ( x , x ) , ( x , x ) ∈ X ;where θ ∈ M , ( X × Y ) has marginals ( µ, ν ) . Given that the same ( mutatis mutandi )holds for f Y , dcov( θ ) = Z ( X × Y ) d µ ( x , x ) d ν ( y , y ) d θ (( x , y ) , ( x , y )) = Z ( X × Y ) h d θ .This means that, if ( X i , Y i ) i =1 denotes a vector that contains random elements thatare independent and identically distributed to ( X, Y ) ∼ θ , dcov( θ ) = E (cid:2) h (cid:0) ( X i , Y i ) i =1 (cid:1)(cid:3) and, consequently, its sample version is a V − statistic, as the ones that Lyons (2013)derived (erroneously), as it will be shown next. onparametric independence tests in metric spaces For n ∈ Z + , the following notation will be used for the empirical measure associated toa certain sample { ( X i , Y i ) } ni =1 i.i.d. ( X, Y ) ∼ θ : θ n := 1 n n X i =1 δ ( X i ,Y i ) : Ω −→ M , ( X × Y ) .A few routine computations yield that the natural estimator [ dcov( θ ) := dcov( θ n ) is, unsurprisingly, the V − statistic with (nonsymmetric) kernel h : dcov( θ n ) = 1 n n X i =1 · · · n X i =1 h (cid:0) ( X i λ , Y i λ ) λ =1 (cid:1) ≡ V n ( h ) .On the other hand, it is logical to consider the analogous U − statistic as an alternativeestimator, which will be shown to require less stringent conditions to behave satisfactorilythan dcov( θ n ) . For n ≥ , let: ˜ U n ( h ) := 16! (cid:0) n (cid:1) X { i λ } λ ⊂ [1 ,n ] ∩ Z different h (cid:0) ( X i λ , Y i λ ) λ =1 (cid:1) ;where the tilde indicates that this is not a U − statistic sensu stricto , but rather one builtupon a kernel that is nonsymmetric. To correct this, let ¯ h be the symmetrisation of h : ¯ h ( z ) := 16! X σ ∈ S h (cid:0) z σ ( j ) (cid:1) j =1 ≡ X σ ∈ S h ( z σ ) , z ∈ ( X × Y ) ;where S := { σ : [1 , ∩ Z −→ [1 , ∩ Z : σ bijective } is the symmetric group of order .So ˜ U n ( h ) is the U − statistic based on ¯ h : ˜ U n ( h ) = 1 (cid:0) n (cid:1) X i <...
Lyons (2013) mistook the hypotheses of the aforementioned Hoeffding theoremfor the ones of the SLLN for V − statistics (Giné and Zinn, 1992, page 274). Theweakest conditions under which the SLLN for V − statistics hold in this context are: θ ∈ M / , / ( X × Y ) (Jakobsen, 2017, theorem 5.5). In other words, the finiteness ofmoments of order suffices to ensure asymptotic consistency: V n ( h ) a.s. −→ n →∞ dcov( θ ) . If θ ∈ M , ( X × Y ) is the product of its marginals ( µ, ν ) and these are nondegenerate,the asymptotic distributions of the estimators introduced in 7.2 are: nV n ( h ) D −→ n →∞ ∞ X i =1 λ i ( Z i −
1) + D ( µ ) D ( ν ) ; n ˜ U n ( h ) D −→ n →∞ ∞ X i =1 λ i ( Z i − ;where { Z i } i ∈ N ∗ i.i.d. N(0 , and where { λ i } i ∈ N ∗ are the eigenvalues (with multiplicity)of the linear operator S : L ( θ ) −→ L ( θ ) that maps f into S ( f ) : X × Y −→ R , whichis defined as: S ( f )( x, y ) := Z X × Y d µ ( x, x ′ ) d ν ( y, y ′ ) f ( x ′ , y ′ ) d θ ( x ′ , y ′ ) , ( x, y ) ∈ X × Y .The original attempt of proving the result for the V − statistic (Lyons, 2013) includedsome incorrect arguments to conclude that P ∞ i =1 λ i = D ( µ ) D ( ν ) . Lyons (2018) statesthat the previous identity does hold as long as both marginal spaces have negative type,but the justification of this is somewhat abstruse. In case of it being true, it would be theexact same asymptotic distribution that Székely et al. (2007) had derived. Anyhow, thiscannot be brought to practical usefulness (as in 2), since the eigenvalues { λ i } i dependon θ (unknown) and cannot be easily estimated. The most logical approach to this is,once again as in 2, a resampling strategy. One way of arguing for this procedure wouldbe to summon the results of Arcones and Giné (1992), that ensure that approximatingthe thresholds for the test statistic via naïve bootstrap leads to a consistent resamplingtechnique, as ¯ h satisfies the integrability condition required by those authors. References
Arcones, M. Á. and Giné, E. (1992) On the bootstrap of U and V -statistics. Annals ofStatistics , , 655–674.Bakirov, N. K.; Rizzo, M. L. and Székely, G. J. (2006) A multivariate nonparametric testof independence. Journal of Multivariate Analysis , , 1742 –1756.Berg, C.; Christensen, J. P. R. and Ressel, P. (1984) Harmonic analysis on semigroups .1st edition. Springer. onparametric independence tests in metric spaces Billingsley, P. (1995)
Probability and measure . 3rd edition. John Wiley & Sons.Bishop, Y. M. M.; Fienberg, S. E. and Holland, P. W. (1975)
Discrete multivariateanalysis: theory and practice . MIT Press.Bochner, S. (1933) Integration von Funktionen, deren Werte die Elemente eines Vektor-raumes sind.
Fundamenta Mathematicae , , 262–276.Bogachev, V. I. (2007) Measure theory (volumes 1–2) 1st edition. Springer.Deza, M. M. and Laurent, M. (1997)
Geometry of cuts and metrics . 1st edition. Springer.Giné, E. and Zinn, J. (1992) Marcinkiewicz type laws of large numbers an convergenceof moments for U − statistics, chapter of Probability in Banach Spaces 8: Proceedingsof the Eighth International Conference (pages 273–291) Springer.Hoeffding, W. (1961) The strong law of large numbers for u − statistics. Institute of Statis-tics Mimeo Series https://repository.lib.ncsu.edu/handle/1840.4/2128 .Jakobsen, M. E. (2017)
Distance covariance in metric spaces: Non-parametric indepen-dence testing in metric spaces . University of Copenhagen. ArXiv: .Klebanov, L. B. (2005) N - distances and their applications . The Karolinum Press.Lyons, R. (2013) Distance covariance in metric spaces. Annals of Probability , , 3284–3305.Lyons, R. (2018) Errata to “Distance covariance in metric spaces”. Annals of Probability , , 2400–2405.Naor, A. (2010) L embeddings of the Heisenberg group and fast estimation of graphisoperimetry. Proceedings of the International Congress of Mathematicians , , 1549–1575. ArXiv: .Pettis, B. J. (1938) On integration in vector spaces. Transactions of the American Math-ematical Society , , 277–304.Rudin, W. (1987) Real and complex analysis . 3rd edition. McGraw-Hill. ISBN0071002766.Schechter, E. (1996)
Handbook of analysis and its foundations . 1st edition. AcademicPress. ISBN 0126227608.Schoenberg, I. J. (1938) Metric spaces and positive definite functions.
Transactions ofthe American Mathematical Society , , 522–536.Schoenberg, I. J. (1937) On certain metric spaces arising from euclidean spaces by achange of metric and their imbedding in Hilbert space. Annals of Mathematics (SecondSeries) , , 787–793.Székely, G. J. and Rizzo, M. L. (2009) Brownian distance covariance. Annals of AppliedStatistics , , 1236–1265. Castro-Prado and González-Manteiga
Székely, G. J. and Rizzo, M. L. (2010) DISCO analysis: a nonparametric extension ofanalysis of variance.
Annals of Applied Statistics , , 1034–1055.Székely, G. J. and Rizzo, M. L. (2012) On the uniqueness of distance covariance. Statisticsand Probability Letters , , 2278–2282.Székely , G. J. and Rizzo, M. L. (2013) The distance correlation t -test of independencein high dimension. Journal of Multivariate Analysis , , 193–213.Székely, G. J. and Rizzo, M. L. (2017) The energy of data. Annual Review of Statisticsand Its Application , , 447–479.Székely, G. J.; Rizzo, M. L. and Bakirov, N. (2007) Measuring and testing dependenceby correlation of distances. Annals of Statistics , , 2769–2794.Wilson , W. A. (1935) On certain types of continuous transformations of metric spaces. American Journal of Mathematics ,57