[PDF] Nonparametric independence tests in metric spaces: What is known and what is not

Abstract

Distance correlation is a recent extension of Pearson's correlation, that characterises general statistical independence between Euclidean-space-valued random variables, not only linear relations. This review delves into how and when distance correlation can be extended to metric spaces, combining the information that is available in the literature with some original remarks and proofs, in a way that is comprehensible for any mathematical statistician.

Full PDF

aa r X i v : . [ m a t h . S T ] S e p Nonparametric independence tests in metric spaces: Whatis known and what is not

Fernando Castro-Prado

University of Santiago de Compostela and Health Research Institute, Santiago de Com-postela, Spain.

E-mail: [email protected] González-Manteiga

University of Santiago de Compostela, Santiago de Compostela, Spain.

Summary . Distance correlation is a recent extension of Pearson’s correlation, that char-acterises general statistical independence between Euclidean-space-valued random vari-ables, not only linear relations. This review delves into how and when distance correlationcan be extended to metric spaces, combining the information that is available in the lit-erature with some original remarks and proofs, in a way that is comprehensible for anymathematical statistician.

Keywords : Distance correlation; Association measures; Nonparametric statistics

1. Introduction

The energy of data (Székely and Rizzo, 2017) and all the mathematical statistics thatstems from it, including the characterisation of independence in Euclidean spaces (§ 2)and many other interesting results (Székely and Rizzo, 2010, 2009, 2013), have a verystrong and well-established theoretical basis (Bakirov et al. , 2006; Székely et al. , 2007;Székely and Rizzo, 2017).Nevertheless, the article (Lyons, 2013) that introduces distance correlation in metricspaces leaves a surprising amount of details to the reader (Jakobsen, 2017, p. 2). Theelision of so many intermediate steps meant that, for several years, it was unnoticed thatmost of the theory was incorrect (Lyons, 2018). Such mistakes were largely discoveredby Jakobsen (2017), who devoted 150 pages to go through and to correct glitches of theoriginal 10-page paper.The goal of the present review is to present a corrected version of Lyons’ theory, bysummarising and explaining the work by Jakobsen (2017) and by adding a few originalproofs, all of this taking into account the recent corrigendum of the original article (Lyons,2018). In addition, the reader will be provided with a gentle introduction to the abstractmathematical concepts that this theory requires. Thus, for the ﬁrst time, a clear andconcise bottom-up explanation of the theory of distance correlation in metric spaces isavailable to the scientiﬁc community.

Castro-Prado and González-Manteiga

2. Distance correlation in Euclidean spaces

When two random elements (vectors) X and Y are Euclidean-space-valued (let X be L − dimensional and Y be M − dimensional; for L, M ∈ Z + ), it is possible to deﬁnean association measure that characterises their independence called distance correlation (Székely et al. , 2007). Firstly, distance covariance should be deﬁned, as a certain normof the diﬀerence of the joint characteristic function and the product of the marginals: dCov( X, Y ) := k ϕ X,Y − ϕ X ϕ Y k w ≡ sZ R L × R M | ϕ X,Y ( t, s ) − ϕ X ( t ) ϕ Y ( s ) | w ( t, s ) d t d s ;where w is a weight function which is dependent of the dimension of the Euclidean spacesin which the supports of X and Y are contained (and it has a property of uniqueness[Székely and Rizzo, 2012]): w ( t, s ) := Γ (cid:0) L +12 (cid:1) ( k t k √ π ) L +1 Γ (cid:0) M +12 (cid:1) ( k s k √ π ) M +1 , ( t, s ) ∈ R L × R M .And, as usually: ϕ X ( t ) := E h e i h t,X i i , t ∈ R L ; ϕ Y ( s ) := E h e i h s,Y i i , s ∈ R M .Logically, distance correlation is deﬁned as the quotient of variance and the productof standard deviations and so it has no sign: dCor( X, Y ) := dCov(

X, Y ) p dCov( X, X ) dCov(

Y, Y ) ,whenever dCov( X, X ) dCov(

Y, Y ) = 0 . If dCov( X, X ) = 0 , then dCor(

X, Y ) := 0 .The reasons why distance correlation is an improved version of the squared (Pearson’s)correlation are: • It has values in [0,1]. This is unsurprising: R is totally ordered and, as such, onecan only move “leftwards” or “rightwards” and so the sign of (Pearson’s) correlationexpresses this structure. However, this notion is not valid in Euclidean spaces ofarbitrary dimensionality. • It is zero if and only if X and Y are independent (thus, its interest).Notwithstanding the convoluted initial deﬁnition of dCor , its sample version can easilybe computed. Given a paired sample ( X , Y ) , . . . , ( X n , Y n ) i.i.d. ( X, Y ) ;let a ij := d ( X i , X j ) for i, j ∈ [1 , n ] ∩ Z . Using this notation, doubly-centred distancesare: A ij := a ij − ¯ a i · − ¯ a j · + ¯ a ·· onparametric independence tests in metric spaces If { b ij } i,j and { B ij } i,j are analogously deﬁned for { Y i } i , the empirical distance covarianceis simply the nonnegative real number whose square is: [ dCov n ( X, Y ) := 1 n n X i,j =1 A ij B ij ,so that it is, indeed, a correlation of distances.The above estimator comes from the alternative deﬁnition of dCov derived by Székelyand Rizzo (2009): dcov( X, Y ) = E[ d ( X, X ′ ) d ( Y, Y ′ )] + E[ d ( X, X ′ )] E[ d ( Y, Y ′ )] − d ( X, X ′ ) d ( Y, Y ′′ )] ,which is valid as long as moments of order are ﬁnite. Primed letters refer to independentand identically distributed copies of the corresponding random element.Whenever { X, Y } are independent and have ﬁnite ﬁrst moments, the asymptoticdistribution of the product of a scaled version of the preceding statistic is a linear combi-nation of independent chi-squared variables with one degree of freedom. More precisely: n [ dCov n ( X, Y ) D −→ n →∞ ∞ X j =1 λ j Z j ,where { Z j } j are i.i.d. N(0 , and { λ j } j ⊂ R . Unfortunately, this null distribution is notuseful in practice.Instead, it is resampling techniques that should be used. The most sensible choicewhen it comes to approximating the null distribution of the test statistic is to base thedesign of the resampling scheme on the information that H provides, which in this case(i.e., independence) yields to permutation tests .

3. Context and notations

Let ( X , d X ) and ( Y , d Y ) be two arbitrary separable metric spaces (the need for sepa-rability is dealt with in 3.2). The random element Z = ( X, Y ) is deﬁned over (Ω , F , P) and has values in X × Y , with its distribution being θ : B ( X × Y ) −→ [0 , .The following notation will be used for the marginal distributions: • X ∼ µ := θ ◦ π − , marginal over X ; where π : ( x, y ) ∈ X × Y x ∈ X . • Y ∼ ν := θ ◦ π − , marginal over Y ; where π : ( x, y ) ∈ X × Y y ∈ Y .Thus, the nonparametric test of independence for X and Y consists in testing H : θ = µ × ν versus H : θ = µ × ν . For the sake of clarity, it is important to notethat the product µ × ν is deﬁned conventionally: it is the only measure in B ( X ) ⊗ B ( Y ) so that ( µ × ν )( A × B ) := µ ( A ) ν ( B ); A ∈ B ( X ) , B ∈ B ( Y ) . Castro-Prado and González-Manteiga

The ﬁrst perquisite of assuming the separability of X and Y is that, this way, the σ − algebra that their topological product generates is simply the product σ − algebra: B ( X × Y ) = B ( X ) ⊗ B ( Y ) := σ { A × B : A ∈ B ( X ) , B ∈ B ( Y ) } .This equality is useful by itself (e.g., it is crucial to the proof of lemma 3.10in Jakobsen [2017]), but its most important corollary is that it guarantees that themetrics of the marginal spaces are jointly measurable: for Z ∈ { X , Y } , d Z is B ( Z ) ⊗ B ( Z ) / B ( R ) − measurable. This, in turn, is what ensures that the Lebesgueintegrals that appear in the deﬁnition of distance covariance (§ 4) are deﬁned. A coun-terexample would be X := R R , equipped with the discrete metric. This is a particularcase of Nedoma’s pathology (see Schechter [1996, proposition 21.8] and Bogachev [2007,example 6.4.3] for further details), which states that the diagonal set { ( x, x ) : x ∈ X } isnot in B ( X ) ⊗ B ( X ) when the cardinality of X is greater than that of the continuum.Finally, separability is explicitly used in the proofs of some important properties ofdistance covariance (Jakobsen, 2017, theorem 4.4 and lemma 5.8), which indicates thatit is not an ungodly hypothesis.The original article that presented distance correlation in metric spaces (Lyons, 2013)was oblivious of the crucial role of separability in the theory. The map µ : B ( X ) −→ R is said to be a ﬁnite signed (Borel) measure, and it is denoted µ ∈ M ( X ) , if and only if | µ | is a ﬁnite measure. For each µ ∈ M ( X ) , there is a Hahn–Jordan decomposition and it is essentially unique (Billingsley, 1995, theorem 3.2.1) or, inother words, it is possible to ﬁnd a couple of nonnegative measures µ ± ∈ M ( X ) so that µ = µ + − µ − and a partition of the space X = X + ⊔ X − satisfying: µ + ( X − ) = 0 = µ − ( X + ) ;which is to say that µ + and µ − are orthogonal (mutually singular).This allows to naturally deﬁne (Lebesgue) integrals with respect to signed measures.For f : X −→ R measurable, Z X f d µ := Z X f d µ + − Z X f d µ − ;which is well-deﬁned whenever f is integrable with respect to | µ | = µ + + µ − .On the other hand, it will also be necessary to integrate with respect to productmeasures. To begin with, consider ν ∈ M ( Y ) , with Hahn–Jordan decomposition givenby ( Y ± , ν ± ) . Then: • µ + × ν + + µ − × ν − is a (nonnegative) measure with support ( X + × Y + ) ⊔ ( X − × Y − ) ; • µ + × ν − + µ − × ν + is a (nonnegative) measure with support ( X + × Y − ) ⊔ ( X − × Y + ) . onparametric independence tests in metric spaces Because of their disjoint supports, the aforementioned two measures are mutually singularand, consequently (Rudin, 1987, corollary of theorem 6.14), they form the Hahn–Jordandecomposition of µ × ν : µ × ν = ( µ + × ν + + µ − × ν − ) − ( µ + × ν − + µ − × ν + ) .Thus, the integral of a Borel-measurable function h : X × Y −→ R with respect to µ × ν is: Z h d µ × ν = Z h d µ + × ν + + Z h d µ − × ν − − Z h d µ + × ν − − Z h d µ − × ν + ;which entails that L ( µ × ν ) is the intersection of the four function spaces L ( µ ± × ν ± ) .On the last equation, the integration sets were omitted, as it is superﬂuous to un-derscore that it is the largest possible one (in this case, X × Y ). This notation abuse,taken from Lyons (2013), is among the few ones that will be used on the present paper,while the ones that caused mistakes and confusion on Lyons’ article (and even in itscorrigendum [Lyons, 2018]) will be avoided.The last relevant remark about the integration with respect to the product of signedmeasures is that they satisfy a generalised Fubini–Tonelli theorem (Bogachev, 2007,§ 3.3): ∀ h ∈ L ( µ × ν ) , Z h d µ × ν = Z Z h d µ d ν = Z Z h d ν d µ . For the sake of clarity, it is convenient to state and prove the c r − inequality . For any α, β, r ∈ R + : ( α + β ) r ≤ c r ( α r + β r ) , where c r = ( , r < r − , r ≥ . Proof . (1) Let r<1. The goal is to show that ( t + 1) r ≤ t r + 1 , t := αβ or, equivalently, that f ( t ) := t r + 1 − ( t + 1) r ≥ .And the latter inequality holds because r − < : ∀ t ∈ R + , f ′ ( t ) = r ( t r − − ( t + 1) r − ) > ⇒ ∀ t ∈ R + , f ( t ) ≥ f (0) = 0 .(2) For r ≥ , the function g ( x ) := x r is convex in every x ∈ R + . When r > : g ′′ ( x ) = r ( r − x r − > , x ∈ R + .Geometrically, convexity implies that: g (cid:18) α + β (cid:19) ≤ g ( α ) + g ( β )2 ⇔ ( α + β ) r ≤ r − ( α r + β r ) . Castro-Prado and González-Manteiga

At this point, it is possible to introduce the concept of regularity of a signed measure: µ ∈ M ( X ) is said to have ﬁnite moments of order r , and it is written as µ ∈ M r ( X ) , ifand only if ∃ o ∈ X , Z d X ( o, x ) r d | µ | ( x ) < + ∞ .Applying the c r − inequality, it is straightforward to see that when the condition aboveholds, it does so for any origin: µ ∈ M r ( X ) ⇔ ∀ o ∈ X , Z d X ( o, x ) r d | µ | ( x ) < + ∞ .In addition, a signed measure on a product of two spaces θ ∈ M ( X × Y ) is said tobelong to M r,r ( X × Y ) if both its marginals have ﬁnite moments of order r . Finally,the subindex will be used as a notation for probability measures: M ( X ) := (cid:8) µ ∈ M ( X ) : µ ≥ , µ ( X ) = 1 (cid:9) ; M r ( X ) := M r ( X ) ∩ M ( X ); M r,r ( X × Y ) := M r,r ( X × Y ) ∩ M ( X × Y ) .

4. Formal deﬁnition of dcov

The previous section set the theoretical framework in which speaking of distance covari-ance makes sense, thus solving some inconsistencies of Lyons (2013). This will enableto deﬁne the operator dcov rigorously, simplifying and illustrating the explanations byJakobsen (2017).

In order to deﬁne dcov , it is important to keep in mind that: ∀ µ , µ ∈ M ( X ) : d X ∈ L ( µ × µ ) .This is a consequence of Fubini and the triangle inequality: Z d X d | µ | × | µ | ≤ Z d X ( x, o ) d | µ | × | µ | ( x, x ′ ) + Z d X ( o, x ′ ) d | µ | × | µ | ( x, x ′ ) == | µ | ( X ) Z d X ( x, o ) d | µ | ( x ) + | µ | ( X ) Z d X ( x, o ) d | µ | ( x ) < + ∞ . The deﬁnition of distance covariance involves doubly centred distances (§ 4.3), but ﬁrstthe various expected values that are to appear should be checked to be well-deﬁned. For µ ∈ M ( X ) , the following function maps each point x ∈ X to its expected distance tothe random element X ∼ µ : a µ : X −→ R x Z d X ( x, x ′ ) d µ ( x ′ ) onparametric independence tests in metric spaces Obviously, it is well-deﬁned. On top of that, it is | µ | ( X ) − Lipschitzian (and, therefore,continuous): ∀ x, x ′ ∈ X : | a µ ( x ) − a µ ( x ′ ) | ≤ Z | d X ( x, z ) − d X ( x ′ , z ) | d | µ | ( z ) ≤≤ Z d X ( x, x ′ ) d | µ | ( z ) = | µ | ( X ) d X ( x, x ′ ) .On the other hand, recalling 4.1, the integral D ( µ ) is always a real number: D ( µ ) := Z a µ d µ = Z d X d µ × µ .The following four inequalities can easily be derived from the previous results and theywill be very useful hereinafter. For µ ∈ M ( X ) and x, y ∈ X :(a) D ( µ ) ≤ a µ ( x ) ;(b) D ( µ ) ≤ a µ ( x ) + a µ ( y ) ;(c) d X ( x, y ) ≤ a µ ( x ) + a µ ( y ) ;(d) a µ ( x ) ≤ d X ( x, y ) + a µ ( y ) . Proof .(1) D ( µ ) = R d X ( x ′ , x ′′ ) d µ ( x ′ , x ′′ ) ≤≤ µ ( X ) R d X ( x ′ , x ) d µ ( x ′ ) + µ ( X ) R d X ( x, x ′′ ) d µ ( x ′′ ) = 2 a µ ( x ) .(2) Applying (1) to x and y and adding side-by-side the resulting equations, one gets: D ( µ ) ≤ a µ ( x ) + 2 a µ ( y ) .(3) Integrate with respect to µ ( z ) both sides of: d X ( x, y ) ≤ d X ( x, z ) + d X ( y, z ) .(4) Idem to (3) : d X ( x, z ) ≤ d X ( x, y ) + d X ( y, z ) . For µ ∈ M ( X ) , the doubly µ − centred version of d X is: d µ : X × X −→ R ( x , x ) d X ( x , x ) − a µ ( x ) − a µ ( x ) + D ( µ ) This modiﬁcation of d X , in general, is not a metric; although it is always continuous(since d X , a µ , π and π are) and, in particular, Borel-measurable. Moreover, it isimportant to note that, when writing d µ , there is no explicit reference to the metricspace over which this map is deﬁned. Such an abuse of notation makes formulae easierto read and write without creating any misunderstanding. That is not the case of someabbreviations by Lyons, such as the usage of d := d X and d := d Y , which mistakenlysuggests that there is a need for X and Y to share the same metric structure, which isan unnecessary restriction for the theory that would render some interesting applicationsimpossible.The last remarkable property of d µ is: ∀ µ, µ , µ ∈ M ( X ) : d µ ∈ L ( µ × µ ) . Castro-Prado and González-Manteiga

Proof . In the ﬁrst instance, it is convenient to justify that, for any ( x, y ) ∈ X , | d µ ( x, y ) | ≤ a µ ( y ) .To see this, there are two cases to be considered: • If d µ ( x, y ) ≥ , it suﬃces to apply the inequalities in 4.2: | d µ ( x, y ) | = d µ ( x, y ) (3) ≤ D ( µ ) (1) ≤ a µ ( y ) . • For d µ ( x, y ) < , the arguments of Jakobsen (2017, páx. 10) make use of unneces-sarily strong hypotheses. Instead, the following rationale: ∀ z, t ∈ X : d X ( x, z ) ≤ d X ( x, y ) + d X ( y, t ) + d X ( t, z ) ⇒⇒ a µ ( x ) ≤ d X ( x, y ) + a µ ( y ) + D ( µ ) ;yields | d µ ( x, y ) | ≤ a µ ( y ) .Now, using the aforementioned inequality, proving that d µ ∈ L ( µ × µ ) turns out tobe quite straightforward: Z d µ ( x, y ) d µ × µ ( x, y ) ≤ Z a µ ( x ) a µ ( y ) d µ × µ ( x, y ) Fubini == 4 Z d X ( x, z ) d µ × µ ( x, z ) Z d X ( y, z ) d µ × µ ( y, z ) d X ∈ L < + ∞ . dcov The generalised distance covariance is deﬁned as: dcov( θ ) := Z ( X × Y ) d µ ( x, x ′ ) d ν ( y, y ′ ) d θ (cid:0) ( x, y ) , ( x ′ , y ′ ) (cid:1) , θ ∈ M , ( X × Y ) ;where, once again, µ := θ ◦ π − and ν := θ ◦ π − .In order to check that dcov is well-deﬁned, it suﬃces to note that the integral of theproduct of two functions with respect to a (nonnegative) measure is always a scalar prod-uct (bilinear, semideﬁnite positive) and, as a result, it satisﬁes the Cauchy–Bunyakovsky–Schwarz inequality. It is also possible to prove this particular case of Hölder’s inequalitymore directly: ≤ Z [ d µ ( v ) d ν ( w ) − d µ ( w ) d ν ( v )] d θ ( v, w ) = 2 Z d µ d θ Z d ν d θ − (cid:18)Z d µ d ν d θ (cid:19) ⇒ d µ , d ν ∈ L ⇒ | dcov( θ ) | ≤ sZ d µ d θ Z d ν d θ < + ∞ . onparametric independence tests in metric spaces A third approach is to derive a particular case of the AM-GM inequality (and also ofYoung’s): ( d µ ± d ν ) ≥ ⇔ d µ + d ν ≥ ∓ d µ d ν ⇔ d µ + d ν ≥ | d µ d ν | ,Anyhow, the key step is to show that the integrals on the right-hand side are ﬁnite. Forinstance, in the case of d µ : Z d µ ( x, x ′ ) d θ (( x, y )( x ′ , y ′ )) Fubini = Z Z d µ ( x, x ′ ) d θ ( x, y ) d θ ( x ′ , y ′ ) ACOV == Z d µ ( x, x ′ ) d µ ( x, x ′ ) d µ ∈ L ( µ × µ ) < + ∞ .where the acronym “ACOV” stands for abstract change of variables , which in this casetakes a projection as the change of variables function. More formally, let f be a measur-able function in the following diagram: ( X × Y , B ( X ) ⊗ B ( Y ) , θ ) π −→ ( X , B ( X )) f −→ ( R , B ( R )) .When f ∈ L ( θ ◦ π − ) , the aforementioned ACOV theorem ensures that: Z π ( X × Y ) f d( θ ◦ π − ) = Z X × Y ( f ◦ π ) d θ or, recalling that µ def. = θ ◦ π − : Z X f ( x ) d µ ( x ) = Z X × Y f ( x ) d θ ( x, y ) .The diﬀerent integrability checks that have been conducted so far allow to write dcov in terms of expected values. Taking X ∼ µ ∈ M ( X ) and Y ∼ ν ∈ M ( Y ) , with jointdistribution θ := P ◦ (cid:0) XY (cid:1) − , their distance covariance is given by: dcov( X, Y ) Abuse := dcov( θ ) = E[ d µ ( X, X ′ ) d ν ( Y, Y ′ )] == E n(cid:16) d X ( X, X ′ ) − E[ d X ( X, X ′ ) | X ] − E[ d X ( X, X ′ ) | X ′ ] + E[ d X ( X, X ′ )] (cid:17) ·· (cid:16) d Y ( Y, Y ′ ) − E[ d Y ( Y, Y ′ ) | Y ] − E[ d Y ( Y, Y ′ ) | Y ′ ] + E[ d Y ( Y, Y ′ )] (cid:17)o ;where primed letters refer to independent and identically distributed copies of the corre-sponding random element.Finally, note that dcov is always an association measure, in the sense that it vanishesunder independence: dcov( µ × ν ) = Z d µ d ν d( µ × ν ) Fubini == (cid:18)Z d X d µ − Z a µ d µ + Z D ( µ ) d µ (cid:19) (cid:18)Z d Y d ν − Z a ν d ν + Z D ( ν ) d ν (cid:19) = Castro-Prado and González-Manteiga = [ D ( µ ) − D ( µ ) + D ( µ )][ D ( ν ) − D ( ν ) + D ( ν )] = 0 .Moreover, under certain conditions, dcov is nonnegative and it can be rescaled into theinterval [0 , (see 6.1), becoming a normalised association measure (Bishop et al. , 1975,pages 375–376).

5. Distance covariance in negative type spaces

The fact that: θ = µ × ν ⇒ dcov( θ ) = 0 ,makes it natural to wonder which spaces ensure that the reciprocal implication also holds.The answer is: strong negative type spaces, since in them dcov( θ ) can be presented as aninjective function of θ − µ × ν .In order to explain this, negative type spaces will be ﬁrstly introduced (§ 5.1), as theyare the ones in which dcov admits the aforementioned representation (although injectivityis not guaranteed). Then the strong version of this condition will be deﬁned (§ 5.3) and apivotal result will be put forward: strong negative type is not only a necessary conditionfor dcov to characterise independence, but it is also suﬃcient (with a little exception, byno means restrictive). The concept of negative type is not a recent invention (Wilson, 1935) and it has recentlybeen enjoying its “second youth”: ﬁrstly, because of its role in computational algorithmics(Deza and Laurent, 1997, § 6.1.; Naor, 2010) and, more recently, in relation to the energyof data (Székely and Rizzo, 2017).The metric space ( X , d X ) is said to be of negative type if and only if: ∀ n ∈ Z + ; ∀ x, y ∈ X n : 2 n X i,j =1 d X ( x i , y j ) ≥ n X i,j =1 [ d X ( x i , x j ) + d X ( y i , y j )] .The analytic expression above has the following geometrical interpretation: given n redpoints and as many blue ones, the sum of the distances among the n ordered pairs ofthe same colour is not less than the corresponding sum for diﬀerent colours. Moreover,this condition can be stated in another way, that is apparently more general, which is the conditionally negative deﬁniteness of the metric. However, both are actually equivalent(which can be checked by taking repetitions of the points and recalling that Q is densein R ): ∀ n ∈ N ; ∀ x ∈ X n ; ∀ α ∈ R n , n X i =1 α i = 0 : n X i,j =1 α i α j d X ( x i , x j ) ≤ .This is not to say that negative type metric spaces are the ones in which the metric actslike a negative deﬁnite kernel (such as the ones thoroughly studied by Klebanov [2005]and Berg et al. [1984]). However, an equivalent deﬁnition in terms of the deﬁniteness of onparametric independence tests in metric spaces a certain kernel exists. Namely, ( X , d X ) is a negative type space if and only if there isa point o ∈ X so that the absolute antipodal divergence d o ( x, y ) := d X ( x, o ) + d X ( y, o ) − d X ( x, y ) , ( x, y ) ∈ X is deﬁnite positive.There are many familiar examples of negative type spaces, like the Euclidean onesand, more generally, all Hilbert spaces (as it will be explained in 5.2). Now some results involving Hilbert spaces are to be presented. For the sake of simplicity,assume that the scalar ﬁeld is R in every case, but, as a general rule, every statementthat will be made is also true for C , mutatis mutandi . This can be proven by realifyingor complexifying (pages 132–135 of Jakobsen, 2017), according to the case.It will be necessary to integrate functions f : X −→ H which have a Hilbert space astheir codomain. Had X not been assumed to be separable (see § 3.2), as in Lyons (2013),the spaces H that arise later on would not necessarily be separable, which would onlyallow to perform weak integration (Pettis, 1938), and not the strong one (Bochner, 1933).Given µ ∈ M ( X ) , if f is a Pettis-integrable (or, speciﬁcally, scalarly µ − integrable), theintegral I ∈ H is unambiguously deﬁned by its commutativity with respect to every mapof the dual space H ∗ : I = Z X f d µ ⇔ ∀ h ∗ : H −→ R linear and continuous , h ∗ ( I ) = Z X ( h ∗ ◦ f ) d µ .Hereinafter, every Hilbert space that will arise is going to be separable, which meansthat Pettis integrals are Bochner integrals.After these technical remarks, the Schoenberg’s theorem (Schoenberg, 1937 and 1938),can be stated. It characterises negative type spaces ( X , d X ) as those such that (cid:0) X , √ d X (cid:1) can be isometrically embedded into a Hilbert space: ∃ H Hilbert space ; ∃ ϕ : X −→ H ; ∀ x, y ∈ X : k ϕ ( x ) − ϕ ( y ) k H = d X ( x, y ) .For a simple proof, using the absolute antipodal divergence (see 5.1), refer to Jakobsen(2017, theorem 3.7), that corrects Lyons (2013). Regardless of this, Schoenberg’s theoremensures that the separability of the original metric spaces (§ 3.2) is inherited by all theHilbert spaces that arise. Before the Hilbert space representation of dcov can be tackled,the barycentre operator has to be deﬁned: given an isometric map ϕ : (cid:0) X , √ d X (cid:1) −→ H (like the one on the preceding theorem) and µ ∈ M ( X ) , the following Pettis integralalways exists β ϕ ( µ ) := Z X ϕ d µ ∈ H and it is called barycentre , because it is the average of a H -ﬁeld over X according tothe distribution given by µ (thus resembling the geometrical idea of a gravity centre). Infact, if X ∼ µ ∈ M ( X ) , β ϕ ( µ ) = E[ ϕ ( X )] . Castro-Prado and González-Manteiga

On the other hand, if ψ : (cid:0) Y , √ d Y (cid:1) −→ H is also isometric, the barycentre of the tensorproduct ϕ ⊗ ψ for θ ∈ M , ( X × Y ) is deﬁned as: β ϕ ⊗ ψ ( θ ) := Z X × Y ( ϕ ⊗ ψ ) d θ ∈ H ⊗ H .More importantly, if ( µ, ν ) are the marginals of θ ∈ M , ( X × Y ) , the followingequality holds: dcov( θ ) = 4 k β φ ⊗ ψ ( θ − µ × ν ) k H ⊗H .In conclusion, dcov will characterise independence in those spaces in which the previouskernel is injective, that are going to be dealt with right below. If ( X , d X ) has negative type, one can derive the following inequality (whose proof issurprisingly long [Jakobsen, 2017, lemma 3.16]): ∀ µ , µ ∈ M ( X ) : D ( µ − µ ) ≤ .On top of that, if the operator D separates probability measures (with ﬁnite ﬁrst mo-ments) in ( X , d X ) , that space is said to have strong negative type: D ( µ − µ ) = 0 ⇔ µ = µ .The extended Schoenberg’s theorem shows the equivalence of the strong negativetype of ( X , d X ) and the existence of an isometric map ϕ : ( X , √ d X ) −→ H suchthat β ϕ is injective. Furthermore, for strong negative type X and Y , two isometricmaps ϕ : ( X , √ d X ) −→ H and ψ : ( Y , √ d Y ) −→ H can be found so that β ϕ ⊗ ψ : M , ( X × Y ) −→ H ⊗ H is injective. As a result, whenever X and Y have strongnegative type, dcov( X, Y ) = 0 ⇔ X, Y independentholds for any random element Z = ( X, Y ) : Ω −→ X × Y .Thus, the strong negative type of marginal spaces is a suﬃcient condition for theequivalence above to hold, but is it also necessary ? The answer is yes , but with theexception of a “pathological” case.If ( Y , d Y ) was not of strong negative type (symmetrically for X ), it is indeed possibleto ﬁnd θ ∈ M , ( X × Y ) so that: dcov( θ ) = 0 and, at the same time, θ = ( θ ◦ π − ) × ( θ ◦ π − ) ;whenever min { X , Y } > . Such θ can be constructed as follows: θ := δ x × ν + δ x × ν ;where ν , ν are two diﬀerent measures in M ( Y ) so that D ( ν − ν ) = 0 , while x , x ∈ X are two distinct points. For each x ∈ X , δ x ∈ M ( X ) denotes point mass at x : δ x ( x ) = 1 . onparametric independence tests in metric spaces This way, the aforementioned pathological case consists of one of the marginal spacesbeing a singleton. Such exception is not a restriction because, whenever Y = 1 (sym-metrically for X ), dcov ≡ (since d ν ≡ ) and every θ ∈ M , ( X × Y ) is the productof its marginals. To see this last part, note that: Y = { y } ⇒ B ( Y ) = {∅ , { y }} = {∅ , Y } .And consequently, for B ∈ B ( Y ) , ∀ A ∈ B ( X ) , θ ( A × B ) = ( θ ( A × ∅ ) = θ ( ∅ ) = 0 = µ ( A ) ν ( ∅ ) θ ( A × Y ) = θ (cid:2) π − ( A ) (cid:3) ≡ µ ( A ) = µ ( A ) ν ( Y ) ;and so θ = µ × ν . This analytical result is the formalisation of the intuitive notion that, ifa random element Y has constantly a certain value, the observations of any other random X are bound to be independent of the ones of Y .After the previous theoretical discussion, the interest of identifying practical examplesof strong negative type spaces is clear. With regard to this, for the scope of the presentarticle (and for most real data applications), it suﬃces to know that all separable Hilbertspaces have strong negative type. Although this is an unsurprising result, its proof is byno means straightforward (Jakobsen, 2017, pages 49–60).

6. Distance correlation in metric spaces dcor

Like previously, let ( X, Y ) ∼ θ ∈ M , ( X × Y ) have marginals ( µ, ν ) , where ( X , d X ) and ( Y , d Y ) are two separable metric spaces. Then, the following inequalities hold: | dcov( X, Y ) | ≤ p dvar( X ) dvar( Y ) ≤ D ( µ ) D ( ν ) ;where dvar( X ) := dcov( X, X ) . If, in addition, ( X , d X ) and ( Y , d Y ) have negative type: dcov( X, Y ) = 4 k β ϕ × ψ ( θ − µ × ν ) k H ⊗H ≥ .In this context, distance correlation (for metric spaces) is deﬁned as: dcor( X, Y ) := dcov(

X, Y ) p dvar( X ) dvar( Y ) ∈ [0 , whenever the denominator is nonzero. For nondegenerate cases, this will not be a matterof concern, for dvar( X ) only reaches the extreme values of its range [0 , D ( µ ) ] when it isconcentrated on one or two points (respectively): dvar( X ) = 0 ⇔ ∃ x ∈ X , µ = δ x “ µ − almost surely”; dvar( X ) = D ( µ ) ⇔ ∃ x, x ′ ∈ X , µ = δ x + δ x ′ “ µ − almost surely”.When dvar( X ) = 0 , as in the Euclidean case, dcor( X, Y ) := 0 . Castro-Prado and González-Manteiga dcor in Euclidean spaces

In has already been shown that dcor has range [0 , and is zero if and only if there isindependence, which recapitulates the property for Euclidean spaces (§ 2). Indeed, it ispossible to prove (via the Hilbert space representations introduced in 5.2) that, when ( X , d X ) and ( Y , d Y ) are (ﬁnitely dimensional) Euclidean spaces, the notion of distancecorrelation of § 6.1 (Lyons, 2013) generalises the square of the one in § 2 (Székely et al. ,2007): dcov( X, Y ) = dCov(

X, Y ) ; dcor( X, Y ) = dCor(

X, Y ) .For θ ∈ M , ( X × Y ) , dcov( X, Y ) becomes a product of expectations. By expandingit and simplifying, one can easily get the generalisation of Brownian distance covariance(Székely and Rizzo, 2009, theorems 7–8) to general metric spaces: dcov( X, Y ) = E[ d X ( X, X ′ ) d Y ( Y, Y ′ )] + E[ d X ( X, X ′ )] E[ d Y ( Y, Y ′ )] −− d X ( X, X ′ ) d Y ( Y, Y ′′ )] .In conclusion, dcov satisfactorily extends dCov squared.

7. Nonparametric test of independence in metric spaces dcov

The following map will be key to the construction of the sample version of dcov : h : ( X × Y ) −→ R (cid:0) ( x i , y i ) (cid:1) i =1 f X ( x , x , x , x ) f Y ( y , y , y , y ) ;where, for Z ∈ { X , Y } , f Z ( z ) := d Z ( z , z ) + d Z ( z , z ) − d Z ( z , z ) − d Z ( z , z ) , z ∈ Z .The functions f Z and h are clearly measurable and proving their integrability can beaccomplished by sequentially deriving inequalities from the triangle inequality (see pages148–150 of Jakobsen [2017] for the correction of the attempt by Lyons [2013]). Integratingthese functions is pretty straightforward. Firstly, for f X : Z ( X × Y ) f X ( x , x , x , x ) d θ (( x , y ) , ( x , y )) ACOV == d X ( x , x ) − a µ ( x ) − a µ ( x ) + D ( µ ) ≡ d µ ( x , x ) , ( x , x ) ∈ X ;where θ ∈ M , ( X × Y ) has marginals ( µ, ν ) . Given that the same ( mutatis mutandi )holds for f Y , dcov( θ ) = Z ( X × Y ) d µ ( x , x ) d ν ( y , y ) d θ (( x , y ) , ( x , y )) = Z ( X × Y ) h d θ .This means that, if ( X i , Y i ) i =1 denotes a vector that contains random elements thatare independent and identically distributed to ( X, Y ) ∼ θ , dcov( θ ) = E (cid:2) h (cid:0) ( X i , Y i ) i =1 (cid:1)(cid:3) and, consequently, its sample version is a V − statistic, as the ones that Lyons (2013)derived (erroneously), as it will be shown next. onparametric independence tests in metric spaces For n ∈ Z + , the following notation will be used for the empirical measure associated toa certain sample { ( X i , Y i ) } ni =1 i.i.d. ( X, Y ) ∼ θ : θ n := 1 n n X i =1 δ ( X i ,Y i ) : Ω −→ M , ( X × Y ) .A few routine computations yield that the natural estimator [ dcov( θ ) := dcov( θ n ) is, unsurprisingly, the V − statistic with (nonsymmetric) kernel h : dcov( θ n ) = 1 n n X i =1 · · · n X i =1 h (cid:0) ( X i λ , Y i λ ) λ =1 (cid:1) ≡ V n ( h ) .On the other hand, it is logical to consider the analogous U − statistic as an alternativeestimator, which will be shown to require less stringent conditions to behave satisfactorilythan dcov( θ n ) . For n ≥ , let: ˜ U n ( h ) := 16! (cid:0) n (cid:1) X { i λ } λ ⊂ [1 ,n ] ∩ Z diﬀerent h (cid:0) ( X i λ , Y i λ ) λ =1 (cid:1) ;where the tilde indicates that this is not a U − statistic sensu stricto , but rather one builtupon a kernel that is nonsymmetric. To correct this, let ¯ h be the symmetrisation of h : ¯ h ( z ) := 16! X σ ∈ S h (cid:0) z σ ( j ) (cid:1) j =1 ≡ X σ ∈ S h ( z σ ) , z ∈ ( X × Y ) ;where S := { σ : [1 , ∩ Z −→ [1 , ∩ Z : σ bijective } is the symmetric group of order .So ˜ U n ( h ) is the U − statistic based on ¯ h : ˜ U n ( h ) = 1 (cid:0) n (cid:1) X i <...

Lyons (2013) mistook the hypotheses of the aforementioned Hoeﬀding theoremfor the ones of the SLLN for V − statistics (Giné and Zinn, 1992, page 274). Theweakest conditions under which the SLLN for V − statistics hold in this context are: θ ∈ M / , / ( X × Y ) (Jakobsen, 2017, theorem 5.5). In other words, the ﬁniteness ofmoments of order suﬃces to ensure asymptotic consistency: V n ( h ) a.s. −→ n →∞ dcov( θ ) . If θ ∈ M , ( X × Y ) is the product of its marginals ( µ, ν ) and these are nondegenerate,the asymptotic distributions of the estimators introduced in 7.2 are: nV n ( h ) D −→ n →∞ ∞ X i =1 λ i ( Z i −

1) + D ( µ ) D ( ν ) ; n ˜ U n ( h ) D −→ n →∞ ∞ X i =1 λ i ( Z i − ;where { Z i } i ∈ N ∗ i.i.d. N(0 , and where { λ i } i ∈ N ∗ are the eigenvalues (with multiplicity)of the linear operator S : L ( θ ) −→ L ( θ ) that maps f into S ( f ) : X × Y −→ R , whichis deﬁned as: S ( f )( x, y ) := Z X × Y d µ ( x, x ′ ) d ν ( y, y ′ ) f ( x ′ , y ′ ) d θ ( x ′ , y ′ ) , ( x, y ) ∈ X × Y .The original attempt of proving the result for the V − statistic (Lyons, 2013) includedsome incorrect arguments to conclude that P ∞ i =1 λ i = D ( µ ) D ( ν ) . Lyons (2018) statesthat the previous identity does hold as long as both marginal spaces have negative type,but the justiﬁcation of this is somewhat abstruse. In case of it being true, it would be theexact same asymptotic distribution that Székely et al. (2007) had derived. Anyhow, thiscannot be brought to practical usefulness (as in 2), since the eigenvalues { λ i } i dependon θ (unknown) and cannot be easily estimated. The most logical approach to this is,once again as in 2, a resampling strategy. One way of arguing for this procedure wouldbe to summon the results of Arcones and Giné (1992), that ensure that approximatingthe thresholds for the test statistic via naïve bootstrap leads to a consistent resamplingtechnique, as ¯ h satisﬁes the integrability condition required by those authors. References

Arcones, M. Á. and Giné, E. (1992) On the bootstrap of U and V -statistics. Annals ofStatistics , , 655–674.Bakirov, N. K.; Rizzo, M. L. and Székely, G. J. (2006) A multivariate nonparametric testof independence. Journal of Multivariate Analysis , , 1742 –1756.Berg, C.; Christensen, J. P. R. and Ressel, P. (1984) Harmonic analysis on semigroups .1st edition. Springer. onparametric independence tests in metric spaces Billingsley, P. (1995)

Probability and measure . 3rd edition. John Wiley & Sons.Bishop, Y. M. M.; Fienberg, S. E. and Holland, P. W. (1975)

Discrete multivariateanalysis: theory and practice . MIT Press.Bochner, S. (1933) Integration von Funktionen, deren Werte die Elemente eines Vektor-raumes sind.

Fundamenta Mathematicae , , 262–276.Bogachev, V. I. (2007) Measure theory (volumes 1–2) 1st edition. Springer.Deza, M. M. and Laurent, M. (1997)

Geometry of cuts and metrics . 1st edition. Springer.Giné, E. and Zinn, J. (1992) Marcinkiewicz type laws of large numbers an convergenceof moments for U − statistics, chapter of Probability in Banach Spaces 8: Proceedingsof the Eighth International Conference (pages 273–291) Springer.Hoeﬀding, W. (1961) The strong law of large numbers for u − statistics. Institute of Statis-tics Mimeo Series https://repository.lib.ncsu.edu/handle/1840.4/2128 .Jakobsen, M. E. (2017)

Distance covariance in metric spaces: Non-parametric indepen-dence testing in metric spaces . University of Copenhagen. ArXiv: .Klebanov, L. B. (2005) N - distances and their applications . The Karolinum Press.Lyons, R. (2013) Distance covariance in metric spaces. Annals of Probability , , 3284–3305.Lyons, R. (2018) Errata to “Distance covariance in metric spaces”. Annals of Probability , , 2400–2405.Naor, A. (2010) L embeddings of the Heisenberg group and fast estimation of graphisoperimetry. Proceedings of the International Congress of Mathematicians , , 1549–1575. ArXiv: .Pettis, B. J. (1938) On integration in vector spaces. Transactions of the American Math-ematical Society , , 277–304.Rudin, W. (1987) Real and complex analysis . 3rd edition. McGraw-Hill. ISBN0071002766.Schechter, E. (1996)

Handbook of analysis and its foundations . 1st edition. AcademicPress. ISBN 0126227608.Schoenberg, I. J. (1938) Metric spaces and positive deﬁnite functions.

Transactions ofthe American Mathematical Society , , 522–536.Schoenberg, I. J. (1937) On certain metric spaces arising from euclidean spaces by achange of metric and their imbedding in Hilbert space. Annals of Mathematics (SecondSeries) , , 787–793.Székely, G. J. and Rizzo, M. L. (2009) Brownian distance covariance. Annals of AppliedStatistics , , 1236–1265. Castro-Prado and González-Manteiga

Székely, G. J. and Rizzo, M. L. (2010) DISCO analysis: a nonparametric extension ofanalysis of variance.

Annals of Applied Statistics , , 1034–1055.Székely, G. J. and Rizzo, M. L. (2012) On the uniqueness of distance covariance. Statisticsand Probability Letters , , 2278–2282.Székely , G. J. and Rizzo, M. L. (2013) The distance correlation t -test of independencein high dimension. Journal of Multivariate Analysis , , 193–213.Székely, G. J. and Rizzo, M. L. (2017) The energy of data. Annual Review of Statisticsand Its Application , , 447–479.Székely, G. J.; Rizzo, M. L. and Bakirov, N. (2007) Measuring and testing dependenceby correlation of distances. Annals of Statistics , , 2769–2794.Wilson , W. A. (1935) On certain types of continuous transformations of metric spaces. American Journal of Mathematics ,57

Related Researches

Berry-Esseen bounds of second moment estimators for Gaussian processes observed at high frequency

by Soukaina Douissi

Prepivoted permutation tests

by Colin B. Fogarty

Sharp Sensitivity Analysis for Inverse Propensity Weighting via Quantile Balancing

by Jacob Dorn

Online nonparametric regression with Sobolev kernels

by Oleksandr Zadorozhnyi

Discrepancy Bounds for a Class of Negatively Dependent Random Points Including Latin Hypercube Samples

by Michael Gnewuch

Edgeworth approximations for distributions of symmetric statistics

by Friedrich Götze

On the estimating equations and objective functions for parameters of exponential power distribution: Application for disorder

by Mehmet Niyazi ?ankaya

A new robust approach for multinomial logistic regression with complex design model

by Elena Castilla

Discrete Max-Linear Bayesian Networks

by Benjamin Hollering

Online Statistical Inference for Gradient-free Stochastic Optimization

by Xi Chen

The complex behaviour of Galton rank order statistic

by E. del Barrio

Sharper Sub-Weibull Concentrations: Non-asymptotic Bai-Yin Theorem

by Huiming Zhang

Inference and model selection in general causal time series with exogenous covariates

by Mamadou Lamine Diop

Instance-Dependent Bounds for Zeroth-order Lipschitz Optimization with Error Certificates

by François Bachoc

Nonparametric calibration for stochastic reaction-diffusion equations based on discrete observations

by Florian Hildebrandt

On shrinkage estimation of a spherically symmetric distribution for balanced loss functions

by Lahoucine Hobbad

Semiparametric empirical likelihood inference with estimating equations under density ratio models

by Meng Yuan

Adaptive Robust Large Volatility Matrix Estimation Based on High-Frequency Financial Data

by Minseok Shin

Graph Community Detection from Coarse Measurements: Recovery Conditions for the Coarsened Weighted Stochastic Block Model

by Nafiseh Ghoroghchian

Efficient computational algorithms for approximate optimal designs

by Jiangtao Duan

Distribution-Free Robust Linear Regression

by Jaouad Mourtada

On the consistency of the Kozachenko-Leonenko entropy estimate

by Luc Devroye

On the Minimal Error of Empirical Risk Minimization

by Gil Kur

On admissible estimation of a mean vector when the scale is unknown

by Yuzo Maruyama

It was "all" for "nothing": sharp phase transitions for noiseless discrete channels

by Jonathan Niles-Weed

«

1

2

3

4

»

Submitted on 29 Sep 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar