aa r X i v : . [ m a t h . P R ] O c t ON DISTANCE COVARIANCE IN METRIC AND HILBERTSPACES
SVANTE JANSON
Abstract.
Distance covariance is a measure of dependence betweentwo random variables that take values in two, in general different, met-ric spaces, see Sz´ekely, Rizzo and Bakirov (2007) and Lyons (2013). Itis known that the distance covariance, and its generalization α -distancecovariance, can be defined in several different ways that are equivalentunder some moment conditions. The present paper considers four suchdefinitions and find minimal moment conditions for each of them, to-gether with some partial results when these conditions are not satisfied.The paper also studies the special case when the variables are Hilbertspace valued, and shows under weak moment conditions that two suchvariables are independent if and only if their ( α -)distance covariance is0; this extends results by Lyons (2013) and Dehling et al. (2018+). Theproof uses a new definition of distance covariance in the Hilbert spacecase, generalizing the definition for Euclidean spaces using characteristicfunctions by Sz´ekely, Rizzo and Bakirov (2007). Introduction
Distance covariance is a measure of dependence between two randomvariables X and Y that take values in two, in general different, spaces X and Y . This measure appears in Feuerverger [9] as a test statistic when X = Y = R ; it was more generally introduced by Sz´ekely, Rizzo and Bakirov[26] for the case of random variables in Euclidean spaces, possibly of differentdimensions. This was extended to general separable measure spaces byLyons [18], see also Jakobsen [12], and to semimetric spaces (of negativetype, see below) by Sejdinovic et al. [23].Our setting throughout this paper is the following (see also Remark 1.7):( X , Y ) is a pair of random variables taking values in X × Y , where X and Y are separable metric spaces, with metrics d X and d Y ; we write just d forboth metrics when there is no risk of confusion.We denote the distance covariance by dcov α ( X , Y ), where α > α = 1; in this case we may drop thesubscript and write dcov( X , Y ).One interesting feature of distance correlation is that it can be defined inseveral ways that look very different but are equivalent (at least assumingsufficient moment conditions). We will give several definitions (sometimesfor special cases) and begin with three related definitions that work in thegeneral setting just described. Date : 29 October, 2019.2010
Mathematics Subject Classification.
Let, throughout the paper, ( X , Y ) , ( X , Y ) , . . . be independent copiesof ( X , Y ). Also, let x o ∈ X and y o ∈ Y be two fixed points, and write forconvenience k x k := d ( x , x o ) and k y k := d ( y , y o ) for x ∈ X and y ∈ Y . (Inthe case of Euclidean spaces, or Hilbert spaces, we choose x o = y o = 0, and k x k is the usual norm.) We use x o and y o for moment conditions of thetype E k X k α < ∞ ; note that by the triangle inequality, for this conditionthe choice of x o does not matter, and that this condition is equivalent to E d ( X , X ) α < ∞ .Also, define for convenience α ∗ := max( α, α −
2) = ( α, < α , α − , α > . (1.1)As will be seen below, the case of main interest is α ∈ (0 , α ∗ = α .When necessary, we distinguish the versions of distance covariance by dif-ferent superscripts such as dcov ∗ α , dcov b α , dcov ∼ α , but usually this is omittedbecause the choice of definition does not matter, or is clear from the context. Definition 1.1.
Assume E k X k α < ∞ and E k Y k α < ∞ . Thendcov α ( X , Y ) = dcov ∗ α ( X , Y ):= E (cid:2) d ( X , X ) α d ( Y , Y ) α (cid:3) + E (cid:2) d ( X , X ) α (cid:3) E (cid:2) d ( Y , Y ) α (cid:3) − E (cid:2) d ( X , X ) α d ( Y , Y ) α (cid:3) . (1.2) Definition 1.2.
Assume E k X k α ∗ < ∞ and E k Y k α ∗ < ∞ . Thendcov α ( X , Y ) = dcov b α ( X , Y ) := E (cid:2) b X α b Y α (cid:3) , (1.3)where b X α := d ( X , X ) α − d ( X , X ) α + d ( X , X ) α − d ( X , X ) α (1.4)and similarly for b Y α . Definition 1.3.
Assume E k X k α ∗ < ∞ and E k Y k α ∗ < ∞ . Thendcov α ( X , Y ) = dcov ∼ α ( X , Y ) := E (cid:2) e X α e Y α (cid:3) , (1.5)where e X α := E ( b X α | X , X ) (1.6)= d ( X , X ) α − E X d ( X , X ) α − E X d ( X , X ) α + E d ( X , X ) α (1.7)and similarly for e Y α , where E X denotes integrating over X only, i.e., theconditional expectation given all X j (but not X ).The role of the parameter α is thus to replace the metric d by d α in thedefinition of dcov = dcov . See further Remark 1.7 below.Note that dcov α ( X , Y ) only depends on the joint distribution of X and Y ; thus distance covariance can be seen as a functional on distributions in X × Y .The moment condition E k X k α < ∞ and E k Y k α < ∞ in Definition 1.1is equivalent to E d ( X , X ) α < ∞ and E d ( Y , Y ) α < ∞ , which impliesthat all expectations in (1.2) are finite; it implies also b X α , b Y α ∈ L and thus N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 3 e X α , e Y α ∈ L , so the expectations in (1.3) and (1.5) are also finite. Moreover,in this case, it is easy to see that Definitions 1.1–1.3 are equivalent: byexpanding the products b X α b Y α and e X α e Y α in (1.3) and (1.5), we obtain (1.2)after simple calculations. It is less obvious that the weaker moment conditionin Definitions 1.2 and 1.3 is enough to guarantee that the expectations in(1.3) and (1.5) are finite and equal; we show this, and in particular that b X α , b Y α , e X α , e Y α ∈ L , in Section 3 (Theorem 3.5). In Section 8 we show thatthe exponents 2 α and α ∗ in the moment conditions are optimal in general;in Section 9 we discuss extensions when the moment conditions fail.The original definition of distance covariance by Sz´ekely, Rizzo and Bakirov[26], for random variables X and Y in Euclidean spaces R p and R q , see alsoFeuerverger [9], is quite different and is based on characteristic functions.The general version with a α ∈ (0 ,
2) [26, Section 3.1] is as follows.Let ϕ X ( t ) := E e i t · X , ϕ Y ( u ) := E e i u · Y and ϕ X , Y ( t , u ) := E e i( t · X + u · Y ) bethe characteristic functions of X , Y and ( X , Y ). Define also the constants c α,k := 2 α Γ(( k + α ) / − π k/ Γ( − α/
2) = α α − Γ(( k + α ) / π k/ Γ(1 − α/ > . (1.8)(The values of these normalization constants are unimportant; they are cho-sen to make the definition agree with the preceding ones.) Definition 1.4.
Let ( X , Y ) be a pair of random vectors in R p and R q ,respectively, where p, q >
1, and let 0 < α <
2. Thendcov α ( X , Y ) = dcov E α ( X , Y ) := c α,p c α,q Z t ∈ R p Z u ∈ R q (cid:12)(cid:12) ϕ X , Y ( t , u ) − ϕ X ( t ) ϕ Y ( u ) (cid:12)(cid:12) d t d u | t | p + α | u | q + α . (1.9) Remark 1.5.
No moment condition is needed in Definition 1.4, since theintegrand in (1.9) is non-negative; with this definition (for Euclidean spacesand α < α ( X , Y ) is always defined, although it may be ∞ . Asshown in [26], dcov α ( X , Y ) is finite at least when E k X k α < ∞ and E k Y k α < ∞ ; this also follows from the equivalence with Definitions 1.2 and 1.3, seeTheorems 6.2 and 6.4.In contrast, we have in Definitions 1.1–1.3 imposed moment conditionsmaking dcov α ( X , Y ) finite. These definitions can be used somewhat moregenerally when the expectations in them are finite, and even when the resultis + ∞ ; see Sections 8 and 9. However, without moment conditions, thereare cases, even with X = Y = R , when Definitions 1.1–1.3 yield results ofthe type ∞ − ∞ and thus cannot be used at all; see Examples 8.4, 8.7, 8.9and 8.15. (cid:3) Remark 1.6.
Definition 1.4 requires α <
2, since typically the integral in(1.9) diverges for α >
2. For example, if p = q and X = Y ∼ N (0 , I ),then | ϕ X , Y ( t , u ) − ϕ X ( t ) ϕ Y ( u ) | ∼ |h t , u i| as t , u →
0, and (1.9) diverges for α > (cid:3) Feuerverger [9] gave Definition 1.4 with α = 1 for X = Y = R and thespecial case when ( X , Y ) have the empirical distribution of a finite samplefrom an unknown bivariate distribution, thus defining a test statistic forindependence. He also showed that it has the equivalent forms (1.3) and SVANTE JANSON (1.2). More generally, for arbitrary random ( X , Y ) in Euclidean spaces and0 < α <
2, Sz´ekely, Rizzo and Bakirov [26] gave Definition 1.4; they alsoshowed that it is equivalent to Definition 1.1 when the moment condition inthe latter holds [26, Remark 3 for α = 1; implicit in § α ∈ (0 , § § distance covariance was introduced by [26] (for the case α = 1, and α -distance covariance ingeneral). (Actually, [26] and [24] define the distance covariance as the squareroot of dcov( X , Y ); we ignore this difference in terminology.)In the Euclidean setting in [9] and [26], with α <
2, Definition 1.4 impliesimmediately the fundamental property that dcov α ( X , Y ) > X and Y , and furthermoredcov α ( X , Y ) = 0 ⇐⇒ X and Y are independent . (1.10)Hence, dcov α ( X , Y ) can be regarded as a measure of dependency, and dis-tance covariance can be used to test independence. (As noted in [26], (1.10)does not hold for α = 2; see Section 7.)Lyons [18] extended the theory to general (separable) metric spaces, with α = 1, using Definition 1.3 as his definition. (This was also suggestedin [25, § X and Y are metric spaces of negative type (see [18] for adefinition; see also [23], [4] and Remark 1.7 below), because in this case, butnot otherwise, dcov( X , Y ) > X and Y such that dcov( X , Y ) isdefined; if furthermore the spaces are of strong negative type (see again [18]),then also (1.10) holds for α = 1. (The implication that dcov α ( X , Y ) = 0 forindependent variables is trivial, for any α , but not the converse.) Hence, formetric spaces of strong negative type, dcov can be regarded as a measure ofdependence and for tests of independence just as in the Euclidean case.We have here, as [18], assumed that d X and d Y are metrics. However, wecan formally use Definitions 1.1–1.3 for any symmetric measurable functions d X : X × X → [0 , ∞ ) and d Y : Y × Y → [0 , ∞ ). (For X and Y such thatthe expectations exist, and still assuming X and Y to be separable metricspaces, to avoid technical problems.) It seems natural to assume at leastthat d X and d Y are semimetrics ; a semimetric on a space X is a symmetricfunction d : X × X → [0 , ∞ ) such that d ( x , x ) = 0 ⇐⇒ x = x . (Thus,the triangle inequality is not assumed. Note that the term semimetric alsois used in other context with a different meaning.) This extension was madeby Sejdinovic et al. [23]; they considered semimetrics of negative type andshowed that much of the theory extends to this case. Remark 1.7.
If 0 < α
1, then d α is also a metric for any metric d , anddcov α is just dcov applied to the spaces X and Y equipped with the metrics d α X and d α Y . (From an abstract point of view, the case α d α is a semi-metric for every α >
0, and dcov α is just dcov applied to the semimetrics d α X and d α Y for any α > N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 5
On the other hand, see [18] and [22], a semimetric d on a space X isof negative type if and only there exists an embedding ϕ : X → H into aHilbert space such that d ( x , x ) = k ϕ ( x ) − ϕ ( x ) k . (1.11)In particular, (1.11) implies that d / is a metric. (We assume that ballsfor the semimetric define the topology, and thus the metric d / defines thetopology of X .) Hence, for semimetrics of negative type, dcov α is the sameas dcov α for the metrics d / X and d / Y ; in particular, dcov equals dcov forthese metrics. Consequently, our setting with metrics but arbitrary α in-cludes also semimetrics of negative type. Furthermore, using the embedding ϕ , we see that dcov for semimetric spaces of negative type can be reducedto dcov for Hilbert spaces, see Remark 7.4. (This is implicit in [23], wherethis embedding is used to give another interpretation of distance covariance,see Remarks 1.10 and 7.5.)We will in the sequel assume that d X and d Y are metrics (without as-suming negative type), but note that as just said, by changing α , this reallyincludes the case of semimetrics of negative type.In this context we note that if X is a Euclidean space R q , or more generallya Hilbert space, then the semimetric k x − x k α is of negative type if andonly if 0 < α
2, see [22]. (It is thus a metric of negative type if andonly if 0 < α < α
2, wecan conversely regard dcov α as dcov for the semimetric of negative type k x − x k α . (cid:3) In the first part of the present paper, we consider general metric spacesand general α >
0, and study and compare Definitions 1.1–1.3. In particu-lar, we show that the definitions agree under the moment conditions above(Section 3). We also show that dcov α depends continuously on the distri-bution of ( X , Y ), assuming convergence of the α ∗ moments E k X k α ∗ and E k Y k α ∗ (Theorem 4.2 and Remark 4.3).Sz´ekely, Rizzo and Bakirov [26] showed that, in the Euclidean case andwith α ∈ (0 , α for the empirical distribution of a samplegives a strongly consistent estimator of dcov α , provided α moments arefinite. This was extended to general metric spaces, with α = 1, by Lyons [18],who claimed consistency in this sense assuming only finite first moments;however, the proof is incorrect as noted in the Errata. As also noted in[18], there is a simple proof assuming second moments, and Jakobsen [12]proved the result when E ( k X kk Y k ) / < ∞ , and thus in particular when X and Y have moments of order 5 /
3. We remove this condition and show(Theorem 4.4) consistency assuming only first moments (as stated in [18]);furthermore, this is extended to all α >
0, now assuming α ∗ moments.In the second part of the paper, we consider Hilbert spaces. Dehlinget al. [8] studied dcov α for α ∈ (0 ,
2) in the infinite-dimensional Hilbertspace L [0 , α = 1. Dehling et al. [8,Theorem 4.2] extended this to all α ∈ (0 , SVANTE JANSON
We consider the Hilbert space case in Sections 5–7. We give yet anotherdefinition of dcov α in this case (Definition 6.1), which is related to Defini-tion 1.4 in Euclidean spaces, but where we replace the characteristic func-tions by certain characteristic random variables , which are Gaussian ran-dom variables that can be defined also for variables in infinite-dimensionalHilbert spaces. We show that this definition is equivalent to the ones aboveunder suitable moment conditions. We then use this definition to give anew proof, assuming only α moments, of the theorem by Dehling et al. [8]just mentioned that (1.10) holds for Hilbert spaces and any α ∈ (0 ,
2) (ourTheorem 6.6). Our proof (and Definition 6.1) is based on the ideas in [8];however, the proof in [8] is formulated for the Hilbert space L [0 ,
1] and usesarguments with Brownian motion. Our proof can be regarded as a more ab-stract version of their proof, stated for arbitrary (separable) Hilbert spacesand using i.i.d. Gaussian sequences instead of Brownian motion; we believethat this makes the proof clearer since it avoids irrelevant details related tothe particular choice L [0 ,
1] of the Hilbert space.Section 7 studies the case α = 2 for Hilbert spaces. This case is rathertrivial, and markedly different from α <
2. In particular, even in one di-mension, (1.10) does not hold for α = 2, as is well known since [26, § § Remark 1.8.
Another version of the definitions above is obtained if wedenote the right-hand side of (1.4) by b X α ( X , X , X , X ) and then definedcov α ( X , Y ) = dcov = α ( X , Y ):= E (cid:0) b X α ( X , X , X , X ) b Y α ( Y , Y , Y , Y ) (cid:1) . (1.12)This version is used in proofs in [18] and [12].It is obvious that if b X α , b Y α ∈ L , then the expectation in (1.12) is finite,and, using Fubini’s theorem to integrate first over X , X , Y , Y , it equals E (cid:0) e X α e Y α (cid:1) ; thus, at least in this case, (1.12) agrees with (1.5). In particular,by Lemma 3.3 below, this holds when E k X k α ∗ < ∞ and E k Y k α ∗ < ∞ .We will not consider this definition further, and we leave the case when themoment condition just stated fails to the reader. (We conjecture resultssimilar to those in Sections 8 and 9.) (cid:3) Remark 1.9.
We have defined e X α as a conditional expectation of b X α ; thiscan be regarded as an orthogonal projection in the Hilbert space L ( P ). N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 7 If E k X k α < ∞ , so d ( X , X ) α ∈ L , then, as noted by Jakobsen [12], e X α can also be regarded as a projection in another way, viz. as the orthogonalprojection of d ( X , X ) α onto the subspace of L ( P ) consisting of functions g ( X , X ) with E (cid:0) g ( X , X ) | X (cid:1) = E (cid:0) g ( X , X ) | X (cid:1) = 0 a.s. (cid:3) Remark 1.10.
For semimetrics of negative type, another interpretation ofdistance covariance is given by Sejdinovic et al. [23, Theorem 24], showingthat it coincides with the
Hilbert-Schmidt independence criterion , a distancemeasure between the distributions L ( X , Y ) and L ( X , Y ) = L ( X ) × L ( Y )that is defined using reproducing Hilbert spaces given by some kernels onthe spaces, provided one chooses the kernels to be defined in a specific wayby the metrics d X and d Y . See also Remark 7.5. (cid:3) Remark 1.11.
Yet another interpretation (or definition) of distance covari-ance was given by Sz´ekely and Rizzo [24] for Euclidean spaces; it was called
Brownian covariance distance . In the one-dimensional case X = Y = R , andwith α = 1, let W and W ′ be two two-sided Brownian motions, independentof each other and of X and Y ; thendcov( X , Y ) = E (cid:2) Cov (cid:0) W ( X ) , W ′ ( Y ) (cid:12)(cid:12) W, W ′ (cid:1) (cid:3) (1.13)This was extended, also in [24], to arbitrary dimension by using Brownianfields on R k , and to α ∈ (0 ,
2) by using fractional Brownian fields.This approach was further generalized to arbitrary spaces with semimet-rics of negative type by Kanagawa et al. [16, Section 6.4], letting W and W ′ be Gaussian stochastic processes on X and Y , with suitable covariancekernels. (cid:3) Remark 1.12.
Definitions 1.2–1.4 show immediately that dcov α ( X , X ) > α ( X , X ) > X is degenerate (i.e., isconcentrated at a single value); this is immediate for Definition 1.4; it wasshown by Lyons [18] for Definition 1.3 (for α = 1), and his proof extendsto general α , and to Definition 1.2, for the latter even without any momentassumption (allowing + ∞ ). (cid:3) Remark 1.13.
Distance correlation is defined by [26] asdcov α ( X , Y )dcov α ( X , X ) / dcov α ( Y , Y ) / , (1.14)provided X and Y are non-degenerate so the denominator is strictly positive(see Remark 1.12).Various properties of distance correlation follow from properties of dis-tance covariance; we leave this to the reader. (cid:3) Some notation
As said in the introduction, ( X , Y ) is a pair of random variables tak-ing values in separable metric spaces X and Y , and ( X i , Y i ), i >
1, areindependent copies of ( X , Y ). α is a fixed parameter, and α ∗ is given by(1.1). Unless stated otherwise, we assume only α >
0. (This condition issometimes repeated for emphasis.) P ( X ) denotes the set of all Borel probability measures in X . SVANTE JANSON
Convergence almost surely, in probability, in distribution and in L p aredenoted by a . s . −→ , p −→ , d −→ , L p −→ .We use the standard definition of covarianceCov( Z, W ) := E [ ZW ] − E Z E W (2.1)not only for real random variables, but also more generally for any complexrandom variables Z and W with E | Z | , E | W | < ∞ ; we further extend thisnotation to conditional covariance.For real x, y , x ∧ y := min { x, y } and x ∨ y := max { x, y } ; also x + := x ∨ x − := ( − x ) + = − ( x ∧ x = x + − x − .The inner product in a Hilbert space is denoted by h x, y i ; for finite-dimensional R q we also use x · y . All Hilbert spaces have real scalars, sothe inner product is real-valued. C and c will denote some unimportant positive constants that dependonly on α (and may be taken as universal constants for α Existence and continuity
We begin by recording the simple fact that with enough moments, Defi-nitions 1.1–1.3 agree.
Lemma 3.1.
Let α > . If E k X k α < ∞ and E k Y k α < ∞ , then allexpectations in (1.2) , (1.3) and (1.5) are finite, and the three definitions of dcov α ( X , Y ) agree, i.e., dcov ∗ α ( X , Y ) = dcov b α ( X , Y ) = dcov ∼ α ( X , Y ) .Proof. As said in the introduction, this is elementary; we omit the details. (cid:3)
We will extend this to the weaker moment conditions used in Defini-tions 1.2 and 1.3. We argue similarly to Lyons [18], who showed the case α = 1 (and thus implicitly 0 < α
1, see Remark 1.7). We first show someuseful estimates of the variable b X α defined in (1.4). Note the symmetry upto sign under cyclic permutations of the indices 1 , . . . , X i , it isreally a pointwise inequality that could have been stated for four non-randompoints x , . . . , x . In sums such as (3.2) and (3.3), the indices are interpretedmodulo 4; moreover, a term containing an index i ± i + 1 and i −
1; the sum in (3.3) is thus really a sum of8 terms.
Lemma 3.2.
Let X be a metric space. (i) If < α , then | b X α | X i =1 (cid:0) k X i k α ∧ k X i +1 k α (cid:1) . (3.1)(ii) If < α , then | b X α | C X i =1 k X i k α/ k X i +1 k α/ . (3.2) N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 9 (iii) If α > , then | b X α | C X i =1 k X i k α − k X i ± k . (3.3) Proof.
Write d ij := d ( X i , X j ). Thus b X α = d α − d α + d α − d α . Note thetriangle inequality d ij k X i k + k X j k . (3.4) Case 1: α . Since d α is a metric when α
1, it suffices to consider thecase α = 1. The triangle inequality yields (cid:12)(cid:12) b X (cid:12)(cid:12) (cid:12)(cid:12) d − d (cid:12)(cid:12) + (cid:12)(cid:12) d − d (cid:12)(cid:12) d + d = 2 d . (3.5)Similarly, by shifting the indices, (cid:12)(cid:12) b X (cid:12)(cid:12) d . (3.6)Hence, using (3.5)–(3.6) and (3.4), | b X | (cid:0) d , d (cid:1) (cid:0) k X k + k X k , k X k + k X k (cid:1) . (3.7)We claim that for any real x , . . . , x > x + x ) ∧ ( x + x ) X i =1 (cid:0) x i ∧ x i +1 (cid:1) . (3.8)In fact, by cyclic symmetry, we may without loss of generality assume that x is the largest of x , . . . , x , and in this case x + x = x ∧ x + x ∧ x X i =1 (cid:0) x i ∧ x i +1 (cid:1) , (3.9)and (3.8) follows. Hence (3.8) holds, and (3.7) implies (3.1) for α = 1. Assaid above, this shows (3.1) in general.Furthermore, for α
1, (3.2) follows from (3.1) since x ∧ y x / y / when x, y > Case 2: α > . By the cyclic symmetry we may assume that k X k is thelargest of k X k , . . . , k X k . Then, (3.4) implies d ij k X k , i, j = 1 , . . . , . (3.10)As above, the triangle inequality yields (cid:12)(cid:12) d − d (cid:12)(cid:12) d (3.11)and thus, by the mean value theorem, for some θ ∈ [0 , (cid:12)(cid:12) d α − d α (cid:12)(cid:12) d α (cid:0) θd + (1 − θ ) d (cid:1) α − . (3.12)Using (3.10), this yields (cid:12)(cid:12) d α − d α (cid:12)(cid:12) d α α − k X k α − . (3.13)Similarly, (cid:12)(cid:12) d α − d α (cid:12)(cid:12) d α (cid:0) θ ′ d + (1 − θ ′ ) d (cid:1) α − d α α − k X k α − . (3.14) Summing (3.13) and (3.14) yields, using again (3.4), (cid:12)(cid:12) b X α (cid:12)(cid:12) (cid:12)(cid:12) d α − d α (cid:12)(cid:12) + (cid:12)(cid:12) d α − d α (cid:12)(cid:12) α α k X k α − d α α k X k α − ( k X k + k X k ) . (3.15)This proves (3.3) for any α > α
2, we further note that our assumption k X j k k X k implies k X k α − k X j k k X k α/ k X j k α/ , j = 1 , . . . , , (3.16)and thus (3.15) also yields (3.2). (cid:3) Lemma 3.3. If E k X k α ∗ < ∞ , then E b X α < ∞ and E e X α < ∞ . For α = 1, this is shown by Lyons [18, Errata]. Proof. Case 1: α . In this case α ∗ = α . Recall that, by definition, X i and X i ± are independent. Hence, E (cid:0) k X i k α/ k X i +1 k α/ (cid:1) = E k X i k α E k X i +1 k α < ∞ , (3.17)so each term in the sum in (3.2) belongs to L , and thus (3.2) implies b X α ∈ L . Since e X α is defined by (1.6) as a conditional expectation of b X α ,this further implies e X α ∈ L . Case 2: α > . In this case α ∗ = 2( α − >
2, and the result follows in thesame way from (3.3). (cid:3)
In the following lemma, we consider together with X also a sequences( X ( n ) ) n > of random variables in X . We then define X ( n ) i for i > (cid:0) X i , ( X ( n ) i ) n (cid:1) in X ∞ are independent copies of (cid:0) X , ( X ( n ) ) n (cid:1) . This extends in the obvious way when we consider sequences (cid:0) ( X ( n ) , Y ( n ) ) (cid:1) n . We use the superscript ( n ) in the natural way and let e.g. b X ( n ) α be defined as in (1.4) using X ( n ) i . Lemma 3.4.
Let X and X ( n ) , n > , be random variables in X , and assumethat E k X k α ∗ < ∞ and E d ( X ( n ) , X ) α ∗ → as n → ∞ . Then E (cid:0) b X ( n ) α − b X α (cid:1) → and E (cid:0) e X ( n ) α − e X α (cid:1) → .Proof. We use without further comments some elementary facts about uni-form integrability, see e.g. [11, Theorems 5.5.4, 5.4.5 and 5.4.6].Since E d ( X ( n ) , X ) α ∗ →
0, the sequence d ( X ( n ) , X ) α ∗ of random variablesis uniformly integrable. The triangle inequality yields k X ( n ) k d ( X ( n ) , X )+ k X k , and thus k X ( n ) k α ∗ C (cid:0) d ( X ( n ) , X ) α ∗ + k X k α ∗ (cid:1) , (3.18)and it follows that the sequence k X ( n ) k α ∗ is uniformly integrable. Lemma 3.2and the argument in the proof of Lemma 3.3, using Lemma A.1 in theappendix, show that the sequence ( b X ( n ) α ) is uniformly integrable.Furthermore, we have d ( X ( n ) , X ) p −→
0, and thus d ( X ( n ) i , X i ) p −→ i . The triangle inequality then implies d ( X ( n ) i , X ( n ) j ) p −→ d ( X i , X j )for every i and j , and thus the definition (1.4) implies b X ( n ) α p −→ b X α . N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 11
This and the uniform square integrability just established yield E (cid:0) b X ( n ) α − b X α (cid:1) → F is the σ -field generated by all X j and X ( n ) j with j ∈ { , } , then e X α = E ( b X α | F ) and e X ( n ) α = E ( b X ( n ) α | F ). Consequently, E (cid:12)(cid:12) e X ( n ) α − e X α (cid:12)(cid:12) = E (cid:12)(cid:12) E ( b X ( n ) α − b X α | F ) (cid:12)(cid:12) E (cid:12)(cid:12) b X ( n ) α − b X α (cid:12)(cid:12) → . (3.19) (cid:3) Theorem 3.5.
Definitions 1.1–1.3 are well-defined; more precisely, for any α > , assuming the stated moment conditions, the expectations in (1.2) , (1.3) and (1.5) are finite. Furthermore, any two of these definitions yieldthe same result, whenever the moment conditions in both are satisfied.Proof. Lemma 3.1 shows that all three definitions are valid and agree underthe condition of Definition 1.1, i.e., when E k X k α < ∞ and E k Y k α < ∞ .It remains to show that (1.3) and (1.5) are finite and agree under theweaker assumption E k X k α ∗ < ∞ and E k Y k α ∗ < ∞ . In this case, Lemma 3.3shows that b X α , b Y α , e X α , e Y α ∈ L , and thus (1.3) and (1.5) are finite.We do not know a simple direct argument to show the equality of the twoexpressions, so we use truncations as follows. Let, for n > X ( n ) := ( X , k X k n, x o , otherwise , (3.20)and define Y ( n ) similarly. Then E d ( X ( n ) , X ) α ∗ = E (cid:2) k X k α ∗ {k X k > n } (cid:3) a . s . −→ , as n → ∞ . (3.21)Thus, Lemma 3.4 yields k b X ( n ) α − b X α k L → k e X ( n ) α − e X α k L → k b Y ( n ) α − b Y α k L → k e Y ( n ) α − e Y α k L → L -convergence just shown implies that, as n → ∞ ,dcov b α ( X ( n ) , Y ( n ) ) = E (cid:2) b X ( n ) α b Y ( n ) α (cid:3) → E (cid:2) b X α b Y α (cid:3) = dcov b α ( X , Y ) (3.22)and similarlydcov ∼ α ( X ( n ) , Y ( n ) ) = E (cid:2) b X ( n ) α b Y ( n ) α (cid:3) → E (cid:2) b X α b Y α (cid:3) = dcov ∼ α ( X , Y ) , (3.23)Furthermore, for each n , k X ( n ) k and k Y ( n ) k are bounded, and thus Lemma 3.1applies and shows dcov b α ( X ( n ) , Y ( n ) ) = dcov ∼ α ( X ( n ) , Y ( n ) ). Consequently,(3.22)–(3.23) imply dcov b α ( X , Y ) = dcov ∼ α ( X , Y ). (cid:3) We return in Section 8 to the case when the moment conditions fail.4.
Continuity and consistency
The lemmas in Section 3 yield also continuity results. Unspecified con-vergence is as n → ∞ . Theorem 4.1.
Let α > . Let ( X , Y ) and ( X ( n ) , Y ( n ) ) , n > , be pairs ofrandom variables in X × Y , and assume that E k X k α ∗ < ∞ , E k Y k α ∗ < ∞ and, as n → ∞ , E d ( X ( n ) , X ) α ∗ → and E d ( Y ( n ) , Y ) α ∗ → . Then, dcov α ( X ( n ) , Y ( n ) ) → dcov α ( X , Y ) . (4.1) Proof.
Lemma 3.4 yields b X ( n ) α L −→ b X α and b Y ( n ) α L −→ b Y α , and thusdcov α ( X ( n ) , Y ( n ) ) = 14 E (cid:2) b X ( n ) α b Y ( n ) α (cid:3) → E (cid:2) b X α b Y α (cid:3) = dcov α ( X , Y ) . (4.2) (cid:3) We can extend this result and assume only convergence in distribution of( X ( n ) , Y ( n ) ) together with a moment condition. Theorem 4.2.
Let α > . Let ( X , Y ) and ( X ( n ) , Y ( n ) ) , n > , be pairs ofrandom variables in X × Y , and assume that, as n → ∞ , ( X ( n ) , Y ( n ) ) d −→ ( X , Y ) . Assume further one of the following two conditions. (i) The sequences k X ( n ) k α ∗ and k Y ( n ) k α ∗ are uniformly integrable. (ii) E k X ( n ) k α ∗ → E k X k α ∗ < ∞ and E k Y ( n ) k α ∗ → E k Y k α ∗ < ∞ .Then, dcov α ( X ( n ) , Y ( n ) ) → dcov α ( X , Y ) . (4.3) Proof. (i): Since
X × Y is a separable metric space, we may by the Skorohodcoupling theorem [15, Theorem 4.30] without loss of generality assume that( X ( n ) , Y ( n ) ) a . s . −→ ( X , Y ). Furthermore, the assumption in (i) implies thatsup n E k X ( n ) k α ∗ < ∞ , and thus E k X k α ∗ < ∞ by Fatou’s lemma. Since d ( X ( n ) , X ) k X ( n ) k + k X k , it follows, similarly to (3.18), that the sequence d ( X ( n ) , X ) α ∗ is uniformly integrable. Since we have assumed d ( X ( n ) , X ) a . s . −→
0, this implies E d ( X ( n ) , X ) α ∗ →
0. Similarly, E d ( Y ( n ) , Y ) α ∗ →
0. ThusTheorem 4.1 applies and yields (4.3).(ii): We have X ( n ) d −→ X and thus k X ( n ) k d −→ k X k . This and ourassumption E k X ( n ) k α ∗ → E k X k α ∗ imply that the sequence k X ( n ) k α ∗ is uni-formly integrable [11, Theorem 5.5.9]. The same holds for Y ( n ) , and thuspart (i) applies. (cid:3) Remark 4.3.
Suppose that the metric spaces X and Y are complete. (Thisensures that all probability measures are tight; see e.g. [2].) Give X × Y themetric (for example) d (cid:0) ( x , y ) , ( x , y ) (cid:1) := d X ( x , x ) + d Y ( y , y ) . (4.4)Let P α ( X × Y ) be the space of all Borel probability measures µ on X × Y such that R X ×Y k ( x , y ) k α d µ ( x , y ) < ∞ . In other words, P α ( X × Y ) is thespace of all distributions of pairs of random variables ( X , Y ) ∈ X × Y suchthat E k X k α < ∞ and E k Y k α < ∞ .Define a metric in P α ( X × Y ) by d α ( µ, µ ′ ) := ( inf (cid:8) E (cid:2) d (cid:0) ( X , Y ) , ( X ′ , Y ′ ) (cid:1) α (cid:3)(cid:9) , < α , inf (cid:8) E (cid:2) d (cid:0) ( X , Y ) , ( X ′ , Y ′ ) (cid:1) α (cid:3) /α (cid:9) , α > . (4.5)taking the infimum over all pairs of random variables ( X , Y ) and ( X ′ , Y ′ )in X × Y such that ( X , Y ) ∼ µ and ( X ′ , Y ′ ) ∼ µ ′ ; see e.g. [6, pp. 796–799 (in the English translation)]. (This is known under various names,including Kantorovich distance , Wasserstein distance and minimal L α dis-tance , see also [21].) Convergence of a sequence L ( X ( n ) , Y ( n ) ) of distribu-tions to L ( X , Y ) in this metric is equivalent to convergence in distribution N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 13 ( X ( n ) , Y ( n ) ) d −→ ( X , Y ) (i.e., weak convergence of the distributions) to-gether with uniform integrability of k X ( n ) , Y ( n ) ) k α ∗ (or, equivalently, con-vergence of moments E k ( X ( n ) , Y ( n ) ) k α ∗ → E k ( X , Y ) k α ∗ ).Theorem 4.1 then says that dcov α is a continuous functional on P α ∗ ( X ×Y ), for every α > (cid:3) Consistency.
Let µ ∈ P ( X × Y ) be the distribution of ( X , Y ). Then,( X , Y ) , . . . can be regarded as a sequence of independent samples from µ .Let ν n be the empirical distribution of the first n samples, i.e., ν n := 1 n n X i =1 δ ( X i , Y i ) ∈ P ( X × Y ) . (4.6)Note that ν n is a random probability measure. Hence, its distance covari-ance dcov α ( ν n ) is a random variable. The following theorem shows thatthis random variable converges to dcov α ( µ ) a.s.; in other words, the dis-tance covariance of the empirical distribution is a consistent estimator ofthe covariance distance of µ . As said in the introduction, this was provedby Sz´ekely, Rizzo and Bakirov [26] for the Euclidean case with α ∈ (0 , α = 1, the result was stated by Lyons [18],but his proof requires a stronger moment condition. Second moments areenough for α = 1, see [26, Remark 3]; Jakobsen [12] improved this andshowed that 5 / α >
0, assuming 2 α moments.We can now show consistency assuming only α ∗ moments, as requiredby our definitions. In particular, this shows that for α = 1, first momentssuffice, as stated in [18]. Theorem 4.4.
Let µ be the distribution of ( X , Y ) ∈ X × Y and assumethat E k X k α ∗ , E k Y k α ∗ < ∞ . If ν n is the empirical distribution (4.6) , then dcov α ( ν n ) a . s . −→ dcov α ( µ ) . (4.7) Proof.
Conditionally on the sequence ( ν n ) n of empirical measures, let ( X ( n ) , Y ( n ) )be a random variable with distribution ν n . Since X × Y is a separable metricspace, the distribution ν n converges a.s. to µ (in the usual weak topology);see [27] or [2, Problem 4.4]. In other words, a.s., conditionally on ( ν n ) n ,( X ( n ) , Y ( n ) ) d −→ ( X , Y ).Furthermore, by the definition (4.6) of ν n , conditioning on the sequence( ν k ) k , E (cid:0) k X ( n ) k α ∗ | ( ν k ) k (cid:1) = 1 n n X i =1 k X ( n ) i k α ∗ . (4.8)Hence, the strong law of large numbers (in R ) shows that a.s., conditioned on( ν k ) k , E k X ( n ) k α ∗ a . s . −→ E k X k α ∗ , and similarly also E k Y ( n ) k α ∗ a . s . −→ E k Y k α ∗ .Consequently, Theorem 4.2(ii) applies a.s. to the sequence ( ν n ) n and the cor-responding random variables ( X ( n ) , Y ( n ) ); hence dcov α ( ν n ) a . s . −→ dcov α ( µ ). (cid:3) Our proofs of Theorems 4.2 and 4.4 give no information on the rate ofconvergence, leading to the following problems.
Problem 4.5.
What is the rate of convergence in (4.3), under suitablehypotheses on ( X n , Y n )? Problem 4.6.
What is the rate of convergence in (4.7), under suitablehypotheses on ( X , Y )?5. Hilbert spaces, preliminaries
In this and the next two sections we assume that X and Y are separableHilbert spaces; we therefore change notation and write X = H and Y = H ′ .We give our extension of Definition 1.4 of covariance distance in Section 6,but we first need some preliminaries.5.1. Characteristic random variables.
Let H be a separable Hilbertspace, of finite or infinite dimension dim H .Fix an ON-basis ( e i ) dim H in H , and let ξ i , i = 1 , , . . . , be i.i.d. N (0 , ξ := ( ξ i ) dim H , a random vector of length dim H (finite or infinite). Define for any x ∈ H , ξ · x = x · ξ := dim H X i =1 h x , e i i ξ i . (5.1)Note that in the finite-dimensional case, ξ ∈ H and this is the usual innerproduct. In the infinite-dimensional case ξ / ∈ H a.s., but the sum in (5.1)converges a.s. since P i |h x , e i i| = k x k < ∞ . Hence, ξ · x is defined a.s. inany case. Note also that ξ · x is a real-valued random variable, and that ξ · x ∼ N (cid:0) , k x k (cid:1) . (5.2)Let X be an H -valued random variable, and assume that ξ is independentof X . Then ξ · X exists a.s.; thus ξ · X is a well-defined real-valued randomvariable. Consider the conditional expectationΦ X ( ξ ) := E (cid:0) e i ξ · X (cid:12)(cid:12) ξ (cid:1) . (5.3)This is a complex-valued random variable (determined a.s.), which can bewritten as a (deterministic) function of ξ .In the finite-dimensional case dim H < ∞ , we may identify H with R q ,with ( e j ) q as the standard basis. Then (5.1) and (5.3) show thatΦ X ( ξ ) = ϕ X ( ξ ) a.s. , (5.4)where ϕ X ( t ) := E e i t · X is the usual characteristic function. For this rea-son, we say, for a general Hilbert space H , that Φ X ( ξ ) is the characteristicrandom variable of X .Note that Φ X ( ξ ) is a complex random variable, with (cid:12)(cid:12) Φ X ( ξ ) (cid:12)(cid:12) X ( ξ ) depends on the choices of ( e j ) j and ( ξ j ) j , but these choices are re-garded as fixed. Moreover, the following theorem says that Φ X ( ξ ) has thesame fundamental property as the usual characteristic function: it dependson X only through its distribution, and conversely, it characterizes the dis-tribution. N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 15
Theorem 5.1.
Let H be a separable Hilbert space, and let X and Y be H -valued random variables. Fix as above an ON-basis ( e i ) dim H in H , anda random vector ξ := ( ξ i ) of i.i.d. standard normal random variables ξ i , i = 1 , , . . . , and assume further that these are independent of X and Y .Then X d = Y ⇐⇒ Φ X ( ξ ) = Φ Y ( ξ ) a.s. (5.6)We prove first a lemma that will help to reduce to the finite-dimensionalcase. Lemma 5.2.
Let X be an H -valued random variable and let ξ = ( ξ i ) i be asabove, and in particular independent of X . Then, for any ε > , the event (cid:8) E (cid:0) ∧ | ξ · X | (cid:12)(cid:12) ξ (cid:1) < ε (cid:9) has positive probability.More generally, for any finite set of random variables X (1) , . . . , X ( m ) in H ,all independent of ξ , the events (cid:8) E (cid:0) ∧| ξ · X ( j ) | (cid:12)(cid:12) ξ (cid:1) < ε (cid:9) hold simultaneouslywith positive probability.Proof. For finite N dim H , let Π N be the orthogonal projection of H ontothe subspace H N spanned by e , . . . , e N . Let X N := Π N X and X >N := X − X N , and define ξ N := ( ξ , . . . , ξ N ) and ξ >N := ( ξ N +1 , ξ N +2 , . . . ). Thenwe can write, interpreting the dot products in the obvious way in analogywith (5.1), ξ · X = ξ N · X N + ξ >N · X >N . (5.7)Assume in the remainder of the proof that dim H = ∞ ; the case dim H < ∞ is similar but simpler, taking N := dim H below so X >N = 0.Since the sum in (5.1) converges a.s., and ξ >N · X >N is the tail of this sum,it follows that ξ >N · X >N a . s . −→ N → ∞ . Consequently, by dominatedconvergence, E (cid:0) ∧ | ξ >N · X >N | (cid:1) → N → ∞ . (5.8)Let W N := E (cid:0) ∧ | ξ >N · X >N | (cid:12)(cid:12) ξ (cid:1) . (5.9)Then (5.8) shows E W N →
0; hence we may choose
N < ∞ such that E W N < ε/
4. Then Markov’s inequality yields P (cid:0) W N < ε/ (cid:1) > − E W N ε/ > . (5.10)Moreover, for each i N , again by dominated convergence, E (cid:0) ∧ | s h X , e i i| (cid:1) → s → , (5.11)and thus there exists δ i > | s | < δ i , then E (cid:0) ∧ | s h X , e i i| (cid:1) < ε N . (5.12)Recalling (5.7) and (5.1), we see that | ξ · X | N X i =1 | ξ i h X , e i i| + | ξ >N · X >N | (5.13) and thus1 ∧ | ξ · X | N X i =1 (cid:0) ∧ | ξ i h X , e i i| (cid:1) + (cid:0) ∧ | ξ >N · X >N | (cid:1) . (5.14)Hence, recalling (5.9), E (cid:0) ∧ | ξ · X | (cid:12)(cid:12) ξ (cid:1) N X i =1 E (cid:0) ∧ | ξ i h X , e i i| (cid:12)(cid:12) ξ i (cid:1) + W N . (5.15)Consequently, if ξ is such that W N < ε/ | ξ i | < δ i for i = 1 , . . . , N , then(5.12) implies E (cid:0) ∧ | ξ · X | (cid:12)(cid:12) ξ (cid:1) < N X i =1 ε N + ε ε. (5.16)Since the events { W N < ε/ } and {| ξ i | < δ i } are independent and each haspositive probability, they occur together with positive probability, and thus(5.16) holds with positive probability.This proves the first part of the lemma. The second is proved in the sameway, choosing N so large that (5.10) holds with W N replaced by P mj =1 W ( j ) N ,where W ( j ) N is defined by (5.9) but using X ( j ) instead of X , and then choosing δ i so small that (5.12) holds for each X ( j ) (cid:3) Proof of Theorem 5.1. = ⇒ : If X d = Y , then ( X , ξ ) d = ( Y , ξ ) and (5.1)implies ( ξ · X , ξ ) d = ( ξ · Y , ξ ) which by (5.3) implies Φ X ( ξ ) = Φ Y ( ξ ) a.s. ⇐ = : We let N dim H be finite and use the notation in the proof ofLemma 5.2. Then (5.7) holds, and thus (cid:12)(cid:12) e i ξ · X − e i ξ N · X N (cid:12)(cid:12) = (cid:12)(cid:12) e i ξ >N · X >N − (cid:12)(cid:12) ∧ | ξ >N · X >N | . (5.17)Hence, (cid:12)(cid:12) E (cid:0) e i ξ · X (cid:12)(cid:12) ξ (cid:1) − E (cid:0) e i ξ N · X N (cid:12)(cid:12) ξ (cid:1)(cid:12)(cid:12) E (cid:0) ∧ | ξ >N · X >N | (cid:12)(cid:12) ξ (cid:1) a.s. (5.18)Using (5.3), (5.18) can be written, since ξ N and ξ >N are independent, (cid:12)(cid:12) Φ X ( ξ ) − Φ X N ( ξ N ) (cid:12)(cid:12) E (cid:0) ∧ | ξ >N · X >N | (cid:12)(cid:12) ξ >N (cid:1) a.s. (5.19)Similarly, with analoguous notation, (cid:12)(cid:12) Φ Y ( ξ ) − Φ Y N ( ξ N ) (cid:12)(cid:12) E (cid:0) ∧ | ξ >N · Y >N | (cid:12)(cid:12) ξ >N (cid:1) a.s. (5.20)The assumption Φ X ( ξ ) = Φ Y ( ξ ) a.s. thus implies (cid:12)(cid:12) Φ X N ( ξ N ) − Φ Y N ( ξ N ) (cid:12)(cid:12) E (cid:0) ∧ | ξ >N · X >N | (cid:12)(cid:12) ξ >N (cid:1) + E (cid:0) ∧ | ξ >N · Y >N | (cid:12)(cid:12) ξ >N (cid:1) a.s. (5.21)Lemma 5.2 (applied to X >N and Y >N ) implies that for any ε >
0, the right-hand side of (5.21) is less than 4 ε with positive probability. Furthermore,the left-hand side of (5.21) is a function of ξ N , and the right-hand side is afunction of ξ >N ; thus the two sides are independent. Consequently, (5.21)implies (cid:12)(cid:12) Φ X N ( ξ N ) − Φ Y N ( ξ N ) (cid:12)(cid:12) < ε a.s. (5.22) N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 17
Since ε is arbitrary, this showsΦ X N ( ξ N ) = Φ Y N ( ξ N ) a.s. (5.23)Since X N and Y N live in the finite-dimensional space H N , (5.4) appliesand shows ϕ X N ( ξ N ) = Φ X N ( ξ N ) = Φ Y N ( ξ N ) = ϕ Y N ( ξ N ) a.s. , (5.24)where ϕ X N ( t ) and ϕ Y N ( t ) are the ordinary characteristic functions in R N (identified with H N ). Hence, ϕ X N ( t ) = ϕ Y N ( t ) (5.25)for a.e. t ∈ R N , and since characteristic functions are continuous, (5.25)holds for all t ∈ R N , and thus X N d = Y N . (5.26)If dim H < ∞ , we may choose N = dim H and the result X d = Y follows.(Much of the argument above is not needed in this case.)If dim H = ∞ , then (5.26) holds for every finite N . Furthermore, as N → ∞ , we have X N a . s . −→ X and thus X N d −→ X and similarly Y N d −→ Y . Consequently, X d = Y , which completes the proof. (cid:3) Remark 5.3.
The mapping x ξ · x is an isometry of H onto the GaussianHilbert space spanned by the random variables ξ i , and it can be regardedas an abstract stochastic integral, cf. [13, Chapter VII.2]. It replaces the Itˆointegrals used in [8]. (cid:3) Remark 5.4.
The arguments above are related to the proof of [18, Theorem3.16]. We sketch the connection: That proof uses an embedding φ of theHilbert space into L ( R ∞ × R ); if we compose φ with the Fourier transform f R e π i tx f ( x ) d x acting on the last variable (which is an isometry), weobtain an equivalent embedding ˆ φ , which in our notation equalsˆ φ : x → i2 πt (cid:0) e i c ′ t ξ · x − (cid:1) ∈ L ( P × d t ) (5.27)for a constant c ′ >
0. Hence, if µ = L ( X ), the distribution of X , then,combining the notation of [18] and ours, β ˆ φ ( µ ) := E (cid:0) φ ′ ( X ) | ξ (cid:1) = i2 πt (cid:0) Φ c ′ t X ( ξ ) − (cid:1) . (5.28)Hence, the result in [18, Theorem 3.16] that β φ ( µ ) characterises µ is closelyrelated to, and follows from, Theorem 5.1. Furthermore, the two proofs aresimilar; both are based on approximating with the finite-dimensional casewhich is easy. (cid:3) Independence and characteristic random variables.
Now con-sider a pair of random variables ( X , Y ) taking values in two, possibly dif-ferent, separable Hilbert spaces H and H ′ . Fix, as above, an ON-basis( e i ) dim H in H , and i.i.d. N (0 ,
1) random variables ξ i , i = 1 , , . . . . Simi-larly, fix an ON-basis ( e ′ j ) dim H ′ in H ′ , and i.i.d. N (0 ,
1) random variables η j , j = 1 , , . . . . Assume that all ξ i and η j are independent of each otherand of ( X , Y ). Then ( X , Y ) is a random variable in the Hilbert space H ⊕ H ′ = H × H ′ ,and e , e ′ , e , e ′ , . . . is an ON-basis in this space. Let ξ = ( ξ i ) dim H , η :=( η i ) dim H ′ , and ζ := ( ξ , η , ξ , η , . . . ). Theorem 5.5.
Let ( X , Y ) be a pair of random variables taking values inseparable Hilbert spaces H and H ′ . Then, with notation as above, X and Y are independent if and only if E (cid:0) e i ξ · X +i η · Y (cid:12)(cid:12) ξ , η (cid:1) = E (cid:0) e i ξ · X (cid:12)(cid:12) ξ (cid:1) E (cid:0) e i η · Y (cid:12)(cid:12) η (cid:1) a.s. (5.29) Proof.
Let Y ′ be a copy of Y , independent of X , ξ , η . Then, X and Y areindependent if and only if ( X , Y ) d = ( X , Y ′ ), and the result follows fromTheorem 5.1, applied to the Hilbert space H × H ′ , noting that with thebases and Gaussian variables above, ζ · ( X , Y ) = ξ · X + η · Y a.s., and thusΦ ( X , Y ) ( ζ ) = E (cid:0) e i ζ · ( X , Y ) (cid:12)(cid:12) ξ , η (cid:1) = E (cid:0) e i ξ · X +i η · Y (cid:12)(cid:12) ξ , η (cid:1) , (5.30)while, by independence and Y d = Y ′ ,Φ ( X , Y ′ ) ( ζ ) = E (cid:0) e i ξ · X +i η · Y ′ (cid:12)(cid:12) ξ , η (cid:1) = E (cid:0) e i ξ · X (cid:12)(cid:12) ξ (cid:1) E (cid:0) e i η · Y ′ (cid:12)(cid:12) η (cid:1) = E (cid:0) e i ξ · X (cid:12)(cid:12) ξ (cid:1) E (cid:0) e i η · Y (cid:12)(cid:12) η (cid:1) a.s. (5.31) (cid:3) Note that, by (2.1), (5.29) may be writtenCov (cid:16) e i ξ · X , e i η · Y (cid:12)(cid:12) ξ , η (cid:17) = 0 a.s. (5.32)6. Covariance distance in Hilbert space
We give a new definition of covariance distance for Hilbert spaces; itcan be seen as a version of Definition 1.4 for Euclidean spaces, where wereplace the characteristic functions there by the characteristic random vari-ables defined in Section 5, which makes the extension to infinite-dimensionalHilbert spaces possible. (The definition is inspired by [8, Lemma 4.1]; seeRemark 5.3.)Define, for 0 < α < c α := 2 α/ − Γ( − α/
2) = α α/ Γ(1 − α/ . (6.1) Definition 6.1.
Let ( X , Y ) be a pair of random vectors in separable Hilbertspaces, and let 0 < α <
2. Then, with notation as in Section 5,dcov α ( X , Y ) = dcov H α ( X , Y ):= c α Z ∞ Z ∞ E (cid:12)(cid:12) Φ ( r X ,s Y ) ( ξ , η ) − Φ r X ( ξ )Φ s Y ( η ) (cid:12)(cid:12) d r d sr α +1 s α +1 (6.2)= c α Z ∞ Z ∞ E (cid:12)(cid:12)(cid:12) E (cid:0) e i r ξ · X +i s η · Y | ξ , η (cid:1) − E (cid:0) e i r ξ · X | ξ (cid:1) E (cid:0) e i s η · Y | η (cid:1)(cid:12)(cid:12)(cid:12) d r d sr α +1 s α +1 (6.3)= c α Z ∞ Z ∞ E (cid:12)(cid:12) Cov (cid:16) e i r ξ · X , e i s η · Y | ξ , η (cid:17)(cid:12)(cid:12) d r d sr α +1 s α +1 . (6.4) N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 19
The expressions (6.2)–(6.4) are equal by the definitions (5.3) and (2.1)above, cf. (5.30) and (5.32). Note that no moment assumtions are made;as for Definition 1.4, the definition works for any ( X , Y ) in these spaces,but dcov α ( X , Y ) may be infinite. Furthermore, as shown in the next the-orem, for the special case of Euclidean spaces, Definition 6.1 agrees withDefinition 1.4, again without moment conditions. Theorem 6.2.
Let < α < . If ( X , Y ) is a pair of random vectorsin Euclidean spaces R p and R q , then Definitions 1.4 and 6.1 agree, i.e., dcov E α ( X , Y ) = dcov H α ( X , Y ) .Proof. Assume that H = R p and H ′ = R q . Then (5.4) impliesΦ ( r X ,s Y ) ( ξ , η ) = ϕ ( r X ,s Y ) ( ξ , η ) = ϕ ( X , Y ) ( r ξ , s η ) (6.5)and thus, since r ξ ∼ N (0 , r I p ) and s η ∼ N (0 , s I q ), where I k is the identitymatrix in R k , E (cid:12)(cid:12) Φ ( r X ,s Y ) ( ξ , η ) − Φ r X ( ξ )Φ s Y ( η ) (cid:12)(cid:12) = E (cid:12)(cid:12) ϕ ( X , Y ) ( r ξ , s η ) − ϕ X ( r ξ ) ϕ Y ( s η ) (cid:12)(cid:12) = Z t ∈ R p Z u ∈ R q (cid:12)(cid:12) ϕ ( X , Y ) ( t , u ) − ϕ X ( t ) ϕ Y ( u ) (cid:12)(cid:12) e −| t | / r (2 πr ) p/ e −| u | / s (2 πs ) q/ d t d u . (6.6)Substituting this in (6.2), we obtain (1.9) by interchanging the order ofintegration, because, by elementary calculations, Z ∞ e −| t | / r (2 πr ) p/ d rr α +1 = 2 α/ − Γ(( p + α ) / π p/ | t | − p − α = c α,p c α | t | − p − α , (6.7)see (1.8) and (6.1), and similarly for the integral over s . (cid:3) Remark 6.3.
The proof of Theorem 6.2 together with Remark 1.6 showsthat the restriction α < α >
2, theintegrals diverge typically, for example for H = H ′ = R and X = Y ∼ N (0 , α >
2, the integrals always diverge exceptwhen X and Y are independent, but we have not verified that.) (cid:3) We return to the general Hilbert space case, and show that Definition 6.1agrees with the earlier ones; this is an abstract version of [8, Lemma 4.1],where the Hilbert spaces are L [0 , Theorem 6.4.
Let < α < . If ( X , Y ) is a pair of random vectorsin Hilbert spaces H and H ′ , and E k X k α < ∞ and E k Y k α < ∞ , thenDefinitions 1.2, 1.3 and 6.1 agree, i.e., dcov H α ( X , Y ) = dcov b α ( X , Y ) =dcov ∼ α ( X , Y ) , and this value is finite.Proof. Let again ( X , Y ) , . . . be i.i.d. copies of ( X , Y ), and assume that ξ and η are independent of all of them. Then, using (5.30)–(5.31) and (5.2), E (cid:12)(cid:12) Φ ( r X ,s Y ) ( ξ , η ) − Φ r X ( ξ )Φ s Y ( η ) (cid:12)(cid:12) = E E h(cid:16) e i r ξ · X +i s η · Y − e i r ξ · X +i s η · Y (cid:17)(cid:16) e − i r ξ · X − i s η · Y − e − i r ξ · X − i s η · Y (cid:17) (cid:12)(cid:12) ξ , η i = E h(cid:16) e i r ξ · X +i s η · Y − e i r ξ · X +i s η · Y (cid:17)(cid:16) e − i r ξ · X − i s η · Y − e − i r ξ · X − i s η · Y (cid:17)i = E e i r ξ · ( X − X )+i s η · ( Y − Y ) − E e i r ξ · ( X − X )+i s η · ( Y − Y )0 SVANTE JANSON − E e i r ξ · ( X − X )+i s η · ( Y − Y ) + E e i r ξ · ( X − X )+i s η · ( Y − Y ) = E e − r k X − X k − s k Y − Y k − E e − r k X − X k − s k Y − Y k − E e − r k X − X k − s k Y − Y k + E e − r k X − X k − s k Y − Y k . (6.8)Define the real-valued random variableΛ X ( u ) := E e − u k X − X k − E e − u k X − X k + E e − u k X − X k − E e − u k X − X k (6.9)and define Λ Y ( u ) similarly. Then, by expanding the product and usingsymmetry, E (cid:2) Λ X ( u )Λ Y ( v ) (cid:3) = 4 (cid:16) E e − u k X − X k − v k Y − Y k − E e − u k X − X k − v k Y − Y k − E e − u k X − X k − v k Y − Y k + E e − u k X − X k − v k Y − Y k (cid:17) . (6.10)Consequently, (6.8) yields E (cid:12)(cid:12) Φ ( r X ,s Y ) ( ξ , η ) − Φ r X ( ξ )Φ s Y ( η ) (cid:12)(cid:12) = 14 E h Λ X (cid:16) r (cid:17) Λ Y (cid:0) s (cid:1)i (6.11)and the definition (6.3) yields, with a change of variables,dcov H α ( X , Y ) = c α Z ∞ Z ∞ E h Λ X (cid:16) r (cid:17) Λ Y (cid:0) s (cid:1)i d r d sr α +1 s α +1 = c α α Z ∞ Z ∞ E (cid:2) Λ X ( u )Λ Y ( v ) (cid:3) d u d vu α/ v α/ . (6.12)We rewrite (6.9) as, with indices interpreted modulo 4,Λ X ( u ) = X i =1 ( − i − e − u k X i − X i +1 k = X i =1 ( − i (cid:0) − e − u k X i − X i +1 k (cid:1) . (6.13)Recall that for 0 < γ <
1, see [19, (5.9.5)], Z ∞ (cid:0) − e − x (cid:1) x − γ − d x = − Γ( − γ ) . (6.14)Hence, (6.13) and a change of variables yield Z ∞ Λ X ( u ) d uu α/ = X i =1 ( − i Z ∞ (cid:0) − e − u k X i − X i +1 k (cid:1) d uu α/ = − Γ( − α/ X i =1 ( − i k X i − X i +1 k α = Γ( − α/ b X α . (6.15)If we naively interchange order of integrations and expectation in (6.12), anduse (6.15), we obtain (1.3) and thus dcov H α ( X , Y ) = dcov b α ( X , Y ), since c α is defined in (6.1) so that constant factors cancel. However, this interchangerequires justification; indeed it is not always allowed, since the expectation N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 21 in (1.3) does not always exist, not even as an extended real number, seeExample 8.7, while (6.2)–(6.4) always exist in [0 , ∞ ].Hence, we introduce an integrating factor. Let M >
0; we will later let M → ∞ . Similarly to (6.15), we have Z ∞ e − Mu Λ X ( u ) d uu α/ = X i =1 ( − i Z ∞ (cid:0) e − Mu − e − u ( k X i − X i +1 k + M ) (cid:1) d uu α/ = − Γ( − α/ X i =1 ( − i (cid:0)(cid:0) k X i − X i +1 k + M (cid:1) α/ − M α/ (cid:1) . (6.16)Let α ∈ (0 ,
2) be given and define, for x > h M ( x ) := x α/ + M α/ − ( x + M ) α/ . (6.17)Then, (6.15) and (6.16) yield Z ∞ (cid:0) − e − Mu (cid:1) Λ X ( u ) d uu α/ = Γ( − α/ X i =1 ( − i − h M (cid:0) k X i − X i +1 k (cid:1) =: Γ( − α/ b X α ; M , (6.18)where thus we define b X α ; M := X i =1 ( − i − h M (cid:0) k X i − X i +1 k (cid:1) . (6.19)Note also that the integrand in (6.12) is non-negative by (6.11). Hence,(6.12) and monotone convergence yielddcov H α ( X , Y )= lim M →∞ c α α Z ∞ Z ∞ (cid:0) − e − Mu (cid:1)(cid:0) − e − Mv (cid:1) E (cid:2) Λ X ( u )Λ Y ( v ) (cid:3) d u d vu α/ v α/ . (6.20)Furthermore, | Λ X ( u ) | and | Λ Y ( v ) | are bounded (by 4) by (6.13), and thusFubini applies so we may interchange expectation and integrations in (6.20),which by (6.18) yields, recalling (6.1),dcov H α ( X , Y ) = lim M →∞ E [ b X α ; M b Y α ; M ] . (6.21)Since α/ ∈ (0 , h M in (6.17) is increasing, with h M (0) = 0and h M ( x ) ր M α/ as x → ∞ . Similarly, h M ( x ) = h x ( M ) ր x α/ as M → ∞ ; hence, the definitions (6.19) and (1.4) yield b X α ; M → b X α as M → ∞ . (6.22)Furthermore, if 0 x y , then0 h M ( y ) − h M ( x ) y α/ − x α/ , (6.23)and it follows that for any Z , Z ∈ H , (cid:12)(cid:12) h M (cid:0) k Z k (cid:1) − h M (cid:0) k Z k (cid:1)(cid:12)(cid:12) (cid:12)(cid:12) k Z k α − k Z k α (cid:12)(cid:12) . (6.24) We claim that Lemma 3.2 holds for b X α ; M too, so that, in particular, | b X α ; M | C X i =1 k X i k α/ k X i +1 k α/ , (6.25)where the constant C does not depend on M . This is seen by repeating theproof of Lemma 3.2, recalling the definition (6.19) of b X α ; M and using (6.24);we omit the details.Let b X ∗ be the right-hand side of (6.25). We now use the assumption E k X k α < ∞ , which implies that b X ∗ ∈ L . Similarly, | b Y α ; M | b Y ∗ with b Y ∗ ∈ L . Consequently, | b X α ; M b Y α ; M | b X ∗ b Y ∗ ∈ L , so dominated convergenceapplies to (6.21) and we obtain, by (6.22),dcov H α ( X , Y ) = E [ lim M →∞ b X α ; M b Y α ; M ] = E [ b X α b Y α ] = dcov b α ( X , Y ) , (6.26)using (1.3). Hence, Definitions 6.1 and 1.2 agree (under the given momentcondition). By Theorem 3.5, they agree with Definition 1.3 too; furthermore,the value is finite. (cid:3) Remark 6.5.
Note that the proof shows that (6.21) holds for any randomvariables in Hilbert spaces, without any moment condition. (With the resultpossibly + ∞ .) (cid:3) Independence and distance covariance.
For (separable) Hilbertspaces, as said in the introduction, Lyons [18, Theorem 3.16] showed that(1.10) holds for α = 1, and Dehling et al. [8, Theorem 4.2] extended this toall α ∈ (0 , Theorem 6.6 (Dehling et al. [8, Theorem 4.2]) . Let X = H and Y = H ′ be separable Hilbert spaces and let α ∈ (0 , . Use Definition 1.1, 1.2, 1.3or 6.1, and assume (for the first three) the moment condition there. Then dcov α ( X , Y ) = 0 if and only if X and Y are independent.Proof. For Definitions 1.1–1.3, the moment condition there and Theorems 3.5and 6.2 show that dcov α ( X , Y ) equals dcov H α ( X , Y ) given by Definition 6.1.Hence, we may in all cases use dcov H α . It follows from (6.3) that dcov H α ( X , Y ) =0 if and only if (5.29) holds, and the result follows by Theorem 5.5. (cid:3) Remark 6.7.
This theorem is stated in [8] for the case X = Y = L [0 ,
1] (so X and Y are stochastic processes on [0 , X , Y that satisfy some smoothness conditionsare considered in [8], but this is for other reasons and is not needed forTheorem 6.6.)The theorem in [8] is stated assuming only finite α moments, as we doabove for Definitions 1.2 and 1.3; however, [8] uses Definition 1.1 which ingeneral requires somewhat more for existence, see Theorem 8.1 below. (cid:3) Remark 6.8.
Theorem 6.6 includes the case when X or Y has finite dimen-sion, i.e., is a Euclidean space.Furthermore, although the theorem is stated for separable Hilbert spaces,it extends also to non-separable spaces, provided we assume that X and Y N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 23 are Bochner measurable, for the trivial reason that this implies that X and Y a.s. take values in some separable subspaces H and H ′ . (cid:3) Remark 6.9.
The proof of Theorem 6.6 would be much simpler if distancecovariance was monotone under orthogonal projections, so that we wouldhave dcov α (Π N X , Π N Y ) dcov α ( X , Y ). However, this is not always thecase, even in finite dimension, as is seen by the following example. (cid:3) Example 6.10.
Let X = Y = R and let X = ( X ′ , X ′′ ) and Y = ( Y ′ , Y ′′ ),where X ′ = Y ′ , but X ′ , X ′′ , Y ′′ are independent and non-degenerate. (Fordefiniteness, we may take X ′ , X ′′ , Y ′′ ∼ Be(1 / N (0 , R → R be the standard projection onto the first coordinate, so (Π X , Π Y ) =( X ′ , Y ′ ).For a ∈ R , let X ( a ) := ( X ′ , aX ′′ ) and Y ( a ) := ( Y ′ , aY ′′ ); thus ( X (1) , Y (1)) =( X , Y ) and ( X (0) , Y (0)) = ( X ′ , Y ′ ) (regarding R as a subspace of R ). For t = ( t ′ , t ′′ ) and u = ( u ′ , u ′′ ), we have ϕ X ( a ) , Y ( a ) ( t , u ) = E e i( t ′ X ′ + u ′ X ′ + t ′′ aX ′′ + u ′′ aY ′′ ) = ϕ X ′ ( t ′ + u ′ ) ϕ X ′′ ( at ′′ ) ϕ Y ′′ ( au ′′ ) (6.27)and similarly (or by taking t = 0 or u = 0 in (6.27)) ϕ X ( a ) ( t ) = ϕ X ′ ( t ′ ) ϕ X ′′ ( at ′′ ) , ϕ Y ( a ) ( u ) = ϕ X ′ ( u ′ ) ϕ Y ′′ ( au ′′ ) . (6.28)Hence, (1.9) yieldsdcov α ( X ( a ) , Y ( a ))= c α, c α, Z t ∈ R Z u ∈ R (cid:12)(cid:12) ϕ X ′ ( t ′ + u ′ ) − ϕ X ′ ( t ′ ) ϕ X ′ ( u ′ ) (cid:12)(cid:12) (cid:12)(cid:12) ϕ X ′′ ( at ′′ ) ϕ Y ′′ ( au ′′ ) (cid:12)(cid:12) d t d u | t | α | u | α (6.29)and it is obvious thatdcov α ( X , Y ) = dcov α ( X (1) , Y (1)) < dcov α ( X (0) , Y (0)) = dcov α ( X ′ , Y ′ ) = dcov α (Π X , Π Y ) . (6.30)Thus, an orthogonal projection might increase distance covariance.It can obviously also decrease it; for example the projection onto thesecond coordinate above yields ( X ′′ , Y ′′ ) with dcov α ( X ′′ , Y ′′ ) = 0. (cid:3) Hilbert spaces and α = 2We continue to assume that X and Y are Hilbert spaces; we now considerthe case α = 2. Note that Definition 6.1 does not apply (it requires α < k X i − X j k , b X = − h X , X i + 2 h X , X i − h X , X i + 2 h X , X i = 2 h X − X , X − X i . (7.1)Assume, as in Definitions 1.2 and 1.3, that E k X k < ∞ . Then E X exists,in Bochner sense (see Appendix B), and (1.6) together with (7.1) yield e X = E (cid:0) b X | X , X (cid:1) = − h X − E X , X − E X i . (7.2) We thus see directly that (3.2) and (3.3) hold, and thus b X , e X ∈ L if E k X k < ∞ , as asserted by Lemma 3.3.In particular, in the 1-dimensional case X = R , b X = 2( X − X )( X − X ) , e X = − X − E X )( X − E X ) , (7.3)with the latter assuming E | X | < ∞ . Consequently, if X = Y = R and E | X | , E | Y | < ∞ , then Definition 1.3 yields, using (7.3) and independence,dcov ( X , Y ) = E (cid:2) e X e Y (cid:3) = 4Cov( X , Y ) , (7.4)as noted by Sz´ekely, Rizzo and Bakirov [26]. (Definitions 1.1–1.2 agree byTheorems 3.5.) This extends to higher dimensional Euclidean spaces and,more generally, Hilbert spaces as follows. Let H ⊗ H ′ denote the Hilbertspace tensor product of H and H ′ , see e.g. [13, Appendix E]; recall that thisis a Hilbert space such that there is a bilinear map ⊗ : H × H ′ → H ⊗ H ′ with h x ⊗ y , x ⊗ y i H⊗H ′ = h x , x i H h y , y i H ′ ; (7.5)furthermore, if { e i } i and { e ′ j } j are ON-bases in H and H ′ , then { e i ⊗ e ′ j } i,j is an ON-basis in H ⊗ H ′ . (Note that the mapping ⊗ is neither injectivenor surjective, but the set of finite linear combinations P i x i ⊗ y j is densein H ⊗ H ′ .) Hence, X ⊗ Y is a random variable in H ⊗ H ′ with k X ⊗ Y k = k X k k Y k . Theorem 7.1.
Let X = H and Y = H ′ be separable Hilbert spaces, andassume E k X k < ∞ and E k Y k < ∞ . Let ( e i ) i and ( e ′ j ) j be ON-bases in H and H ′ . Then, dcov ( X , Y ) = 4 X i,j Cov (cid:0) h X , e i i , h Y , e ′ j i (cid:1) (7.6)= 4 (cid:13)(cid:13) E ( X ⊗ Y ) − E X ⊗ E Y (cid:13)(cid:13) H⊗H ′ (7.7) Proof.
Since dcov α ( X , Y ) and the expressions in (7.6)–(7.7) are invariantunder (deterministic) shifts of X and Y , we may for convenience assume E X = E Y = 0. Then, by (7.2), e X e Y = 4 h X , X ih Y , Y i = 4 X i,j h X , e i ih X , e i ih Y , e ′ j ih Y , e ′ j i . (7.8)We have, by the Cauchy–Schwarz inequality, X i (cid:12)(cid:12) h X , e i ih X , e i i (cid:12)(cid:12) (cid:16)X i h X , e i i (cid:17) / (cid:16)X i h X , e i i (cid:17) / = k X k k X k (7.9)and thus, by independence and the Cauchy–Schwarz inequality again, E X i,j (cid:12)(cid:12) h X , e i ih X , e i ih Y , e ′ j ih Y , e ′ j i (cid:12)(cid:12) E (cid:2) k X k k X k k Y k k Y k (cid:3) = (cid:0) E (cid:2) k X k k Y k (cid:3)(cid:1) E k X k E k Y k < ∞ . (7.10)Hence, (7.8) yields by Fubini’s theorem, justified by (7.10), E (cid:2) e X e Y (cid:3) = 4 X i,j E (cid:2) h X , e i ih X , e i ih Y , e ′ j ih Y , e ′ j i (cid:3) N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 25 = 4 X i,j (cid:0) E (cid:2) h X , e i ih Y , e ′ j i (cid:3)(cid:1) (7.11)which yields (7.6).Moreover, { e i ⊗ e ′ j } i,j is an ON-basis in H ⊗ H ′ , and thus (cid:13)(cid:13) E ( X ⊗ Y ) (cid:13)(cid:13) = X i,j h E ( X ⊗ Y ) , e i ⊗ e ′ j i = X i,j (cid:0) E h X ⊗ Y , e i ⊗ e ′ j i (cid:1) = X i,j (cid:0) E (cid:2) h X , e i ih Y , e ′ j i (cid:3)(cid:1) (7.12)which together with (7.11) yields (7.7). (cid:3) Corollary 7.2.
Let X = H and Y = H ′ be separable Hilbert spaces, andassume E k X k < ∞ and E k Y k < ∞ . Then, the following are equivalent: (i) dcov ( X , Y ) = 0 . (ii) Cov (cid:0) h X , x i , h Y , y i (cid:1) = 0 for every x ∈ H , y ∈ H ′ . (iii) E ( X ⊗ Y ) − E X ⊗ E Y = 0 .Proof. For (i) = ⇒ (ii), and x , y = 0, choose ON-bases such that e = x / k x k and e ′ = y / k y k . The rest is immediate from Theorem 7.1. (cid:3) Sz´ekely, Rizzo and Bakirov [26] observed that for α = 2 and real-valuedvariables, dcov ( X , Y ) = 0 does not characterize independence but insteadthat X and Y are uncorrelated; Corollary 7.2 extends this to Hilbert spaces,in the sense (ii) or (iii) above. Remark 7.3. E ( X ⊗ Y ) − E X ⊗ E Y ∈ H ⊗ H ′ can be regarded as thecovariance of the vector-valued variables X and Y ; cf. the general theoryof higher moments of Banach space valued variables in [14], where the mo-ment lives in a suitable tensor product. (The general theory in [14] focusseson a single variable and on the projective and injective tensor products,but see [14, Remarks 3.24 and 3.25]. Since we assume separable spacesand E k X k , E k Y k < ∞ , there are no problems with integrability; cf. [14,Theorem 5.14].) (cid:3) Remark 7.4.
Let X and Y both be metric spaces such that d α is a semimet-ric of negative type. Then, see Remark 1.7, there are embeddings ϕ : X → H and ϕ ′ : Y → H ′ into Hilbert spaces such that d X ( x , x ) α = k ϕ ( x ) − ϕ ( x ) k , d Y ( y , y ) α = k ϕ ′ ( y ) − ϕ ′ ( y ) k . (7.13)It follows immediately that, for any of Definitions 1.1–1.3,dcov α ( X , Y ) = dcov (cid:0) ϕ ( X ) , ϕ ′ ( Y ) (cid:1) . (7.14)Hence, dcov α ( X , Y ) can be interpreted as in Theorem 7.1 for the embeddedvariables, as shown (for α = 1) in [18, Proposition 3.7]. (cid:3) Remark 7.5.
The Hilbert space tensor product
H ⊗ H ′ can be identifiedwith the space of Hilbert–Schmidt operators H → H ′ (see (B.18) and theproof of Lemma B.7); then E ( X ⊗ Y ) − E X ⊗ E Y = E [( X − E X ) ⊗ ( Y − E Y )]corresponds to the operator x E [ h x , X − E X i ( Y − E Y )], known as the covariance operator (or cross-covariance operator [1]). Thus Theorem 7.1 says that dcov ( X , Y ) is 4 times the squared Hilbert–Schmidt norm of thecovariance operator.More generally, if X and Y both are metric spaces such that d α is asemimetric of negative type, then (7.14) shows that dcov α ( X , Y ) equalsdcov ( ϕ ( X ) , ϕ ′ ( Y )) for some embeddings ϕ : X → H and ϕ ′ : Y → H ′ intoHilbert spaces. Hence, dcov α ( X , Y ) equals 4 times the squared Hilbert–Schmidt norm of the covariance operator corresponding to the embeddedvariables, as shown in [23, Theorem 24]; this Hilbert–Schmidt norm (or itssquare) is called the Hilbert–Schmidt independence criterion (HSIC) [10],[23, § (cid:3) Remark 7.6. If α is an even integer larger than 2, we can similarly expressdcov α in moments of X and Y , but the resulting formulas are complicatedand do not seem to be of any interest. For example, for α = 4, for X = Y = R , and taking for simplicity X = Y with E X = 0,dcov ( X , X ) = 32 E [ X ] E [ X ] − E [ X ] E [ X ] + 68( E [ X ]) − E [ X ]) E [ X ] + 64 E [ X ]( E [ X ]) + 36( E [ X ]) . (7.15)We do not know any application or interesting properties of dcov α with α > (cid:3) Optimality of moment conditions
We have so far assumed the moment conditions stated in Definitions 1.1–1.3; these seem natural and convenient for applications. Nevertheless, it isof interest to study whether they really are required for the definitions, andwhat happens when we try to extend one of the definitions to cases when themoment condition fails. Definitions 1.4 and 6.1 are stated without momentconditions, but we similarly can ask when the results are finite and whetherthey agree with the other definitions.In this section, we will give examples showing that the moment conditionsin Definitions 1.1–1.3 are optimal in general, in the sense that if we reducethe exponent in the moment condition, then there exist counterexampleswhere the definition either yields an infinite value or is meaningless. Onthe other hand, there are also cases where the moment conditions do nothold but the definitions yield a finite value. We explore these possibilitiesin the next section, but our results are incomplete, and we leave a numberof (explicit or implicit) open problems.In general, if we try to define dcov α ( X , Y ) by (1.2) or (1.3) for some X and Y , there are three possibilities:(dc1) The expression yields a finite value; this may then be taken to bedcov α ( X , Y ). This happens when all expectations in (1.2) or (1.3),respectively, are finite. (For (1.2), it also includes the trivial casewhen X or Y is degenerate, so d ( X i , X j ) = 0 a.s. or d ( Y i , Y j ) = 0a.s.; then all terms in (1.2) are 0, if necessary interpreting 0 · ∞ = 0.)(dc2) The expression makes sense as either + ∞ or −∞ . We may then takeit as defining dcov α ( X , Y ), now with an infinite value in {−∞ , ∞} .(We do not know whether −∞ can happen, see Problem 9.20.) Thus,at least one expectation is infinite. Furthermore, for (1.2), where all N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 27 expectations are of non-negative variables and thus defined in [0 , ∞ ],this means that either the two first expectations are finite, or thethird expectation is; for (1.3) this means that one of E (cid:2) ( b X α b Y α ) + (cid:3) and E (cid:2) ( b X α b Y α ) − (cid:3) is finite and the other infinite, so the expectation E (cid:2) b X α b Y α (cid:3) is defined as + ∞ or −∞ .(dc3) The expression (1.2) or (1.3) is of the type ∞ − ∞ . Then it ismeaningless, and dcov α ( X , Y ) is undefined (by this definition).For Definition 1.3, we have the same possibilities as for Definition 1.2, butalso the complication that e X α and e Y α have to be defined, see (1.6)–(1.7).We thus have another bad case:(dc4) e X α or e Y α is not defined. Then dcov ∼ α ( X , Y ) is undefined.For Euclidean spaces, we also have Definition 1.4, and for Hilbert spaceswe have Definition 6.1. Since (1.9) and (6.2)–(6.4) are integrals of non-negative functions, Definitions 1.4 and 6.1 are always meaningful, but mayyield + ∞ . In other words, we have only the cases (dc1) and (dc2). Againwe may ask when the definition yields a finite value, and when it agreeswith other definitions; in particular whether the moment conditions in The-orem 6.4 are best possible.The moment conditions assumed in Definitions 1.1–1.3 guarantee, as seenin Theorem 3.5, that the good case (dc1) occurs. In the following subsectionswe investigate more generally when the cases (dc1)–(dc4) occur, and whetherthe different definitions still agree when more than one of them applies.8.1. Optimality in Definition 1.1.
We begin with Definition 1.1, wherewe have a simple necessary and sufficient condition.
Theorem 8.1. (i) If E k X k α + E k Y k α + E [ k X k α k Y k α ] < ∞ , then allexpectations in (1.2) are finite, so (1.2) defines dcov ∗ α ( X , Y ) as a finitenumber.Moreover, in this case also the definitions (1.3) and (1.5) yield the sameresult, i.e., dcov ∗ α ( X , Y ) = dcov b α ( X , Y ) = dcov ∼ α ( X , Y ) . (ii) Conversely, if E k X k α + E k Y k α + E [ k X k α k Y k α ] = ∞ , and X and Y are non-degenerate, then (1.2) is of the type ∞ − ∞ and thus meaningless. In particular, Case (dc2), i.e., a well-defined infinite value of dcov ∗ α , neveroccurs for Definition 1.1. Proof. (i): This follows by minor modifications of the argument used underslightly stronger assumptions in Section 1 and Lemma 3.1. Note that theassumption implies that E [ k X i k α k Y j k α ] < ∞ for all i and j , and thus itfollows from the triangle inequality (3.4) that all expectations in (1.2) arefinite. Moreover, the assumption implies, using (3.4) again, that b X α , b Y α ∈ L , and thus e X α and e Y α are defined by (1.6)–(1.7), and also that b X α b Y α ∈ L and e X α e Y α ∈ L . We omit the details.(ii): If E k X k α = ∞ , then E d ( x , X ) α = ∞ for any x , and thus, byfirst conditioning on ( X , Y ) and ( X , Y ) and integrating over X only,both E (cid:2) d ( X , X ) α (cid:3) = ∞ and E (cid:2) d ( X , X ) α d ( Y , Y ) α (cid:3) = ∞ ; hence, since E (cid:2) d ( Y , Y ) α (cid:3) >
0, we see that (1.2) is of the type ∞ − ∞ .By symmetry, the same holds if E k Y k α = ∞ . Finally, suppose that E [ k X k α k Y k α ] = ∞ . By the cases just treated,we may assume that also E k X k α < ∞ and E k Y k α < ∞ . Then, using thetriangle inequality and integrating only over the event {k X k , k Y k , k Y k M } , for an M so large that this event has positive probability, we see thatboth the first and last expectations in (1.2) are ∞ , and thus (1.2) is ∞ −∞ . (cid:3) Remark 8.2.
If Theorem 8.1(i) applies and dcov E α ( X , Y ) or dcov H α ( X , Y ) isdefined, i.e., if α < ∗ α ( X , Y ). This follows by Theorem 8.1together with Theorems 6.2 and 6.4. (cid:3) Example 8.3. If X and Y are independent with E k X k α < ∞ and E k Y k α < ∞ , then Theorem 8.1(i) applies and (1.2) makes perfect sense; Definitions 1.1–1.3 all can be used, and all yield 0. (cid:3) Example 8.4.
Let X be arbitrary with E k X k α = ∞ , and let Y = X .Then, Theorem 8.1(ii) shows that dcov ∗ α ( X , X ) is of the type ∞ − ∞ anddoes not make sense. Consequently, in general, the moment condition inDefinition 1.1 is necessary. (In particular, for every ( X , Y ) with Y = X .) (cid:3) Optimality in Definition 1.2.
We have already seen in Example 8.4that the moment condition in Definition 1.1 is necessary, in a strong sense.We next show that the moment conditions in Definitions 1.2 and 1.3 alsoare optimal, in the sense that if we reduce the exponent, there are coun-terexamples. However, there are also examples where these definitions yieldfinite values although the moment condition fails.Consider first Definition 1.2. b X α and b Y α are always defined by (1.4), sothe question is whether E [ b X α b Y α ] exists or not, and whether its value is finiteof not. Note, in particular, that dcov b α ( X , X ) := E b X α always is defined,although it may be + ∞ ; we havedcov b α ( X , X ) < ∞ ⇐⇒ b X α ∈ L . (8.1)Note also that, by rotational symmetry in the indices in (1.4), b X α has asymmetric distribution. Thus E b X α = 0 whenever the expectation exists. Example 8.5.
Let X = Y = R , and suppose that X > P ( X = 0) > X = X = 0, we have − b X α = X α + X α − | X − X | α > X α ∧ X α . (8.2)Hence, if E | b X α | < ∞ , then ∞ > E (cid:2)(cid:0) X α ∧ X α (cid:1) (cid:3) = E (cid:2) X α ∧ X α (cid:3) = Z ∞ P (cid:2) X α ∧ X α > t (cid:3) d t = Z ∞ P (cid:2) X α > t (cid:3) d t = 2 α Z ∞ P (cid:2) X > x (cid:3) x α − d x. (8.3)If we choose X such that, for x > P ( X > x ) = x − α , (8.4) N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 29 then E | X | γ < ∞ for every γ < α , but the integral in (8.3) diverges and thus E | b X α | = ∞ ; hence (1.3) yields dcov b α ( X , X ) = ∞ by (8.1). (Case (dc2).)Consequently, when α
2, the exponent α ∗ = α is optimal in Definition 1.2(in order to yield a finite value). (cid:3) Example 8.6.
Let α > X = Y = R , and suppose, for simplicity, that X > P ( X = 0) = P ( X = 1) = 1 /
4. On the event X = X = 0, X = 1, we have for some c >
0, assuming that X >
2, say, b X α = X α − | X − | α + 1 > c X α − . (8.5)Hence, for these values of X , X , X , we have b X α > c X α − − C . Conse-quently, E | b X α | < ∞ = ⇒ E X α − < ∞ .We can choose X as above such that E X γ < ∞ for every γ < α −
2, but E X α − = ∞ and consequently E | b X α | = ∞ ; thus, (1.3) yieldsdcov α ( X , X ) = ∞ . (Case (dc2).) Hence, when α >
2, the exponent α ∗ =2 α − (cid:3) Example 8.7.
We have here given examples with E | b X α | = ∞ , so that(1.3) gives dcov b α ( X , X ) = + ∞ .Similarly, (8.2) and a calculation as in (8.3) show that if, say, P ( X > x ) = x − α/ for x >
2, then E | b X α | = ∞ . Since b X α has a symmetric distributionby (1.3), it follows that if Y is any non-degenerate random variable suchthat X and Y are independent, then the expectation in (1.3) is of the type ∞ − ∞ and thus undefined (Case (dc3) above); hence Definition 1.2 cannotbe applied at all (even allowing ±∞ as a result). (cid:3) Example 8.8.
Let X = Y = R and consider the special (and rather excep-tional) case α = 2, cf. Section 7. Then b X is given by (7.3), and it followseasily that b X ∈ L ⇐⇒ E | X | < ∞ , (8.6)and that for X = Y , (7.4) holds in the formdcov b ( X , X ) = E [ b X ] = 4 (cid:0) Var X (cid:1) (8.7)for any X , where the expressions all are infinite when E | X | = ∞ . Thisshows again that the condition of finite α ∗ moment in Definition 1.2 cannotbe improved when α = 2, if we want dcov ( X , X ) to be finite. Furthermore,if we take Y = ζ X where X and ζ are independent with E X = 0, E [ X ] = ∞ , ζ ∈ {± } and E ζ = 0, then E [ b X b Y ] is of the type ∞ − ∞ ; hence, evenallowing infinite values, dcov b ( X , Y ) cannot be defined by Definition 1.2without assuming second moments. (cid:3) Optimality in Definition 1.3.
We now turn to Definition 1.3. Asnoted above, e X α is only defined for some X . If we use the conditionalexpectation definition in (1.6), then we have to require b X α ∈ L , i.e., E | b X α | < ∞ . On the other hand, the explicit formula (1.7) makes senseonly if E d ( X , X ) α < ∞ , or equivalently E k X k α < ∞ , since otherwise alsothe conditional expectations in (1.7) are + ∞ a.s., and thus (1.7) is ∞ − ∞ .Moreover, if E k X k α < ∞ , then b X α ∈ L by (1.4), and (1.6) agrees with (1.7).Hence we may take (1.6) as the primary definition of e X α , and say that e X α is defined when b X α ∈ L . This holds in particular when E k X k α < ∞ , andthen (1.7) holds too, but note that Lemma 3.2 shows that E k X k α ∗ / < ∞ suffices for b X α ∈ L .Hence, e X α is defined if and only if b X α ∈ L , and then e X α ∈ L ; further-more E e X α = E E (cid:0) b X α | X , X (cid:1) = E b X α = 0 . (8.8)Moreover, in this case, also E (cid:0) e X α | X (cid:1) = E (cid:0) E (cid:0) b X α | X , X (cid:1) | X (cid:1) = E (cid:0) b X α | X (cid:1) = 0 , (8.9)since b X α has a symmetric distribution also when conditioned on X , bysymmetry in (1.4). Example 8.9.
Recall that Example 8.7 gives an example where b X α / ∈ L ;hence, e X α is not defined and thus dcov ∼ α ( X , X ) is undefined. (cid:3) We note a general result relating e X α and b X α . By (1.6), e X α is (a.s.) afunction of X and X ; let us (temporarily) write e X α as e X α ( X , X ), sothat we can substitute other X i as arguments. The following lemma showsthat b X α can be recovered from e X α . Lemma 8.10.
Suppose that b X α ∈ L . Then, a.s., b X α = e X α ( X , X ) − e X α ( X , X ) + e X α ( X , X ) − e X α ( X , X ) . (8.10) Consequently, for any p > , e X α exists and e X α ∈ L p ⇐⇒ b X α ∈ L p . (8.11) Proof. If E k X k α < ∞ , this is obvious from (1.7) and cancellations. Ingeneral, we use truncations. Let, for M > I M := {| X | M } , I Mi := {| X i | M } , and let p M := E I M = P (cid:0) | X i | M (cid:1) . Then, E (cid:0) I M I M b X α | X , X (cid:1) = p M d ( X , X ) α − p M E X (cid:0) I M d ( X , X ) α (cid:1) + p M E (cid:0) I M I M d ( X , X ) α (cid:1) − p M E X (cid:0) I M d ( X , X ) α (cid:1) (8.12)and consequently, by rotational symmetry and cancellations, interpreting allindices modulo 4, X i =1 ( − i − E (cid:0) I Mi +2 I Mi +3 b X α | X i , X i +1 (cid:1) = p M X i =1 ( − i − d ( X i , X i +1 ) α = p M b X α . (8.13)Since we assume b X α ∈ L , we have I Mi +2 I Mi +3 b X α L −→ b X α as M → ∞ , andthus E (cid:0) I Mi +2 I Mi +3 b X α | X i , X i +1 (cid:1) L −→ E (cid:0) b X α | X i , X i +1 (cid:1) = e X α ( X i , X i +1 ) . (8.14)Hence, as M → ∞ , the left-hand side of (8.13) converges in L to the right-hand side of (8.10), while the right-hand side of (8.13) obviously convergesto b X α . Hence, (8.10) follows.Finally, (8.11) is an immediate consequence of (8.10) and (1.6). (cid:3) N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 31
In particular, this leads to the following for the case Y = X . Note thatdcov ∼ α ( X , X ) := E e X α is defined whenever e X α is, although it may be + ∞ ;cf. dcov b α ( X , X ) discussed above. Theorem 8.11.
Let X be a random variable in a metric space. Then thefollowing are equivalent: (i) dcov b α ( X , X ) < ∞ . (ii) dcov ∼ α ( X , X ) < ∞ (which includes that e X α is defined). (iii) b X α ∈ L . (iv) e X α is defined and e X α ∈ L .Furthermore, if these hold, then dcov b α ( X , X ) = dcov ∼ α ( X , X ) .Proof. (i) ⇐⇒ (iii): This follows directly from the definition (1.3), as notedin (8.1).(ii) ⇐⇒ (iv): Follows similarly from the definition (1.5).(iii) ⇐⇒ (iv): By Lemma 8.10.Finally, suppose that (i)–(iv) hold. Use (8.10) and expand ( b X α ) as a sumof products. Since e X α ∈ L , each product is in L , so we may take theirexpectations separately. Furthermore, (8.9) implies that all off-diagonalterms such as E [ e X α ( X , X ) e X α ( X , X )] = 0, and we obtain E (cid:2) b X α (cid:3) = X i =1 E (cid:2) e X α ( X i , X i +1 ) (cid:3) = 4 E (cid:2) e X α (cid:3) . (8.15)Hence, dcov b α ( X , X ) = E (cid:2) b X α (cid:3) = E (cid:2) e X α (cid:3) = dcov ∼ α ( X , X ). (cid:3) Corollary 8.12. (i) If b X α ∈ L , so e X α is defined, then dcov b α ( X , X ) =dcov ∼ α ( X , X ) (finite or infinite). (ii) If b X α / ∈ L , then dcov b α ( X , X ) = ∞ and dcov ∼ α ( X , X ) is undefined.Proof. Follows from Theorem 8.11, considering the three cases b X α ∈ L , b X α ∈ L \ L and b X α / ∈ L separately. (cid:3) If we only care about finite values and regard ∞ as ’undefined’, we thussee that dcov ∼ α ( X , X ) = dcov b α ( X , X ) for all X . Example 8.13.
Let α X ∈ R be as in Example 8.5; thus X > E | X | γ < ∞ for every γ < α , and inparticular E | X | α/ < ∞ ; hence Lemma 3.2(ii) implies that b X α ∈ L . Thus, e X α exists, but b X α / ∈ L by Example 8.5; hence Theorem 8.11 shows thatdcov ∼ α ( X , X ) = ∞ . Consequently, the exponent α ∗ = α is optimal in Defi-nition 1.3 when α (cid:3) Example 8.14.
Similarly, let α > X ∈ R be as in Example 8.6.Then, E | X | γ < ∞ for every γ < α −
2, and in particular E | X | α − < ∞ ;hence Lemma 3.2(iii) implies that b X α ∈ L . Thus, e X α exists, but b X α / ∈ L by Example 8.6; hence Theorem 8.11 shows that dcov ∼ α ( X , X ) = ∞ .Consequently, the exponent α ∗ = 2 α − α > (cid:3) Hence, the exponent α ∗ is optimal in Definition 1.3 too. Example 8.15.
Let α = 2 and X = Y = R as in Example 8.8, and assumethat E X = 0. Then, by (7.3), e X exists and e X = − X X . Hence, we finddirectly the same conclusions for dcov ∼ α as found for dcov b α in Example 8.8.In particular, with X and Y = ζ X as in the final part of Example 8.8, E (cid:2) e X e Y (cid:3) is of the type ∞ − ∞ and thus undefined. (Case (dc3).) (cid:3) Optimality for dcov E α and dcov H α . Definitions 1.4 and 6.1 do notrequire any moment conditions; if X and Y are Euclidean spaces or Hilbertspaces, respectively, then dcov E α ( X , Y ) and dcov H α ( X , Y ) are always defined,but may be + ∞ . (Recall also that Theorem 6.2 shows that for spaces whereboth are defined, we always have dcov E α ( X , Y ) = dcov H α ( X , Y ), finite ornot.) Theorem 6.4 shows that the moment condition E k X k α , E k Y k α < ∞ is sufficient to guarantee that dcov E α ( X , Y ) = dcov H α ( X , Y ) is finite. (Recallthat this is the same moment condition as in Definitions 1.2 and 1.3.) Thefollowing example shows that the exponent α in this moment condition isoptimal, even for random variables in R . Example 8.16.
Let 0 < α <
2, and let X be a symmetric stable randomvariable in R with the characteristic function ϕ X ( t ) = e −| t | α . Then E | X | α = ∞ , but E | X | γ < ∞ for every γ < α .Take Y = X . Then, for 0 t t u t , ϕ X , X ( t, − u ) − ϕ X ( t ) ϕ X ( − u ) = e −| t − u | α − e −| t | α −| u | α > e − t α − e − t α > ct α , (8.16)for some c >
0. Consequently, (1.9) yields, changing the sign of u ,dcov E α ( X , X ) > c Z t =0 Z tu = t t α d t d ut α u α = c Z t =0 t α t α d t = ∞ . (8.17)Hence, using Theorem 6.2, dcov H α ( X , X ) = dcov E α ( X , X ) = ∞ . The condi-tion in Theorem 6.4 on finite α moments thus cannot be replaced by anylower moments in order to guarantee finite values. (cid:3) Beyond the moment conditions
We continue to investigate cases when the moment condition in Defini-tions 1.2–1.3 fails; now with the aim of obtaining positive results.9.1.
A weaker condition.
We begin with dcov b α in Definition 1.2, andshow first that the counterexample in Example 8.5 is optimal, at least when α Theorem 9.1.
Let X be any separable metric space, and let < α . If Z ∞ P (cid:2) k X k > x (cid:3) x α − d x < ∞ , (9.1) then E | b X α | < ∞ and thus dcov b α ( X , X ) < ∞ .Proof. The calculation in (8.3) shows that (9.1) is equivalent to E (cid:2)(cid:0) k X k α ∧ k X k α (cid:1) (cid:3) < ∞ . (9.2)In other words, k X k α ∧ k X k α ∈ L . Hence, Lemma 3.2(i) shows that b X α ∈ L . (cid:3) N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 33
Remark 9.2.
Let 0 < α
1. Then, e.g. using Lemma 9.4 below, theargument in Example 8.5 is easily extended to show that if X = R , then(9.1) is also necessary for E | b X α | < ∞ . Thus, at least for α X = R ,(9.1) is both necessary and sufficient for dcov b α ( X , X ) < ∞ . (cid:3) Remark 9.3.
It is easy to see directly that the condition (9.1) follows fromthe condition E k X k α < ∞ in Lemma 3.3. (We omit the details.) Further-more, (9.1) is a strictly weaker condition, and thus, for α
1, Theorem 9.1is stronger than Lemma 3.3. For example, if we instead of (8.4) choose, for x > e , P ( X > x ) = x − α / log x, (9.3)then E X α = ∞ , but the integral in (8.3) converges and Theorem 9.1 showsthat E | b X α | < ∞ and dcov b α ( X , X ) < ∞ . (cid:3) Hence, although we have seen that the exponent in the moment conditionin Definition 1.2 is best possible, Theorem 9.1 shows that for α
1, themoment condition can be weakened to the condition (9.1) (together withthe same for Y ); we postpone the details to Theorem 9.6, where we alsoextend it to dcov E α and dcov H α .Before proceeding, we note that when α
1, we may simplify the condi-tion b X α ∈ L by the following lemma. Lemma 9.4.
Let p > . If < α , then b X α ∈ L p ⇐⇒ k X k α + k X k α − d ( X , X ) α ∈ L p . (9.4)Note that (for α
1) the right-hand side is non-negative by the triangleinequality.
Proof. = ⇒ : Since E | b X α | p < ∞ , the conditional expectation E (cid:0) | b X α | p | X , X (cid:1) ∈ L . Hence, there exist some x and x such that E (cid:0) | b X α | p | X = x , X = x (cid:1) ∈ L , which by the definition (1.4) means d ( X , X ) α − d ( X , x ) α − d ( X , x ) α ∈ L p . (9.5)The triangle inequality yields, for j = 3 , (cid:12)(cid:12) d ( X , x j ) − k X k (cid:12)(cid:12) d ( x j , x o ) = O (1) , (9.6)and thus, since α (cid:12)(cid:12) d ( X , x j ) α − k X k α (cid:12)(cid:12) = O (1) , (9.7)and the result follows. ⇐ = : Immediate (for any α ), since the definition (1.4) can be written b X α = X i =1 ( − i (cid:0) k X i k α + k X i +1 k α − d ( X i , X i +1 ) α (cid:1) . (9.8) (cid:3) Remark 9.5.
We do not know (even for X = R ) whether Lemma 9.4 holdsalso for α >
1, and leave that as an open problem. (It holds, by a minormodification of the proof above, for α > E k X k p ( α − < ∞ , but that seems less useful.) (cid:3) We next introduce a class of function spaces.
Lorentz spaces.
The condition (9.1) can be expressed as follows using
Lorentz spaces , a generalization of the Lebesgue spaces L p ; see e.g. [3; 5]. Let X ∗ be the decreasing rearrangement of k X k ; this is the (weakly) decreasingfunction (0 , → [0 , ∞ ) defined by X ∗ ( t ) := inf (cid:8) x : P ( k X k > x ) t (cid:9) . (9.9)In probabilistic terms, X ∗ is characterized as the decreasing function on(0 ,
1) that, regarded as a random variable when (0 ,
1) is equipped with theLebesgue measure, has the same distribution as k X k .For a given probability space (Ω , F , P ), and p, q ∈ (0 , ∞ ), the Lorentz space L p,q (Ω , F , P ) is defined as the linear space of all real-valued randomvariables X such that Z (cid:0) t /p X ∗ ( t ) (cid:1) q d tt < ∞ . (9.10)It is well-known that L p,p = L p , and that if q < q then L p,q ⊂ L p,q , withstrict inequality provided the probability space is large enough.A standard Fubini argument shows that Z (cid:0) t /p X ∗ ( t ) (cid:1) q d tt = q Z Z ∞ { X ∗ ( t ) > x } x q − t q/p − d x d t = q Z Z ∞ (cid:8) P ( k X k > x ) > t (cid:9) x q − t q/p − d x d t = p Z ∞ P (cid:2) k X k > x (cid:3) q/p x q − d x. (9.11)In particular, taking p = α and q = 2 α , we see that (9.1) is equivalent to k X k ∈ L α, α .Consequently, for α
1, Theorem 9.1 says that if k X k ∈ L α, α , then b X α ∈ L , which weakens the condition k X k ∈ L α in Lemma 3.3 to L α, α .Hence, we can extend the use of Definition 1.2; moreover, as shown below,also Definitions 1.3, 1.4 and 6.1 yield the same result in this case. Theorem 9.6.
Let < α , and assume that k X k , k Y k ∈ L α, α . Then: (i) Definition 1.2 yields a finite value dcov b α ( X , Y ) . (ii) Definition 1.3 yields a finite value dcov ∼ α ( X , Y ) , and dcov ∼ α ( X , Y ) =dcov b α ( X , Y ) . (iii) If X and Y are Euclidean spaces, then Definition 1.4 yields a finitevalue dcov E α ( X , Y ) , and dcov E α ( X , Y ) = dcov b α ( X , Y ) . (iv) If X and Y are Hilbert spaces, then Definition 6.1 yields a finite value dcov H α ( X , Y ) , and dcov H α ( X , Y ) = dcov b α ( X , Y ) .Thus, the values that are defined are all equal (and finite). The proof is postponed to the following subsections.Note also that by Remark 9.2, the Lorentz space L α, α is optimal in thestrong sense that, for α X = R ,dcov b α ( X , X ) < ∞ ⇐⇒ b X α ∈ L ⇐⇒ k X k ∈ L α, α . (9.12)The proofs of the results above assume α
1. We leave the case α >
N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 35
Problem 9.7.
For α >
1, what is the optimal Lorentz space condition thatguarantees E | b X α | < ∞ and thus dcov b α ( X , X ) < ∞ ?By Theorem 8.11, the answer for dcov ∼ α ( X , X ) is the same. Remark 9.8.
Example 8.8 shows that in the special case α = 2, the condi-tion k X k ∈ L in Definition 1.2 cannot be improved; it is actually necessaryfor b X α ∈ L and dcov b ( X , X ) < ∞ in the case X = R . Hence, for α = 2,the answer to Problem 9.7 is L = L , .A naive interpolation with (9.12) yields the conjecture that for 1 < α < L α, . (cid:3) Remark 9.9.
The equivalence (9.12) does not hold for all metric spaces X ,not even for α = 1. For a counterexample, let X = ℓ with the standardbasis ( e n ) ∞ , let 0 < γ /
2, and let N be an integer-valued random variablewith P ( N = n ) = p n := cn − − γ , n >
1, where c is a normalization constant.Finally, let X := N / e N . It is easily seen that, with X i defined in the sameway by N i , b X X i =1 N / i { N i = N i +1 } , (9.13)and thus, using Cauchy–Schwarz’s (or Minkowski’s) inequality, E b X C E (cid:2) N { N = N } (cid:3) = C ∞ X n =1 np n = C ∞ X n =1 n − − γ < ∞ , (9.14)while for x > P (cid:0) k X k > x (cid:1) = P ( N > x ) = X n>x cn − − γ > cx − γ > cx − , (9.15)so (9.1) fails, and thus k X k / ∈ L , . (cid:3) Problem 9.10.
Does the equivalence (9.12) hold in Euclidean spaces? Ininfinite-dimensional Hilbert spaces?We have not investigated whether the results on continuity and consis-tency in Section 4 can be extended (for α
1) by replacing the momentconditions with the corresponding Lorentz space condition. In particular:
Problem 9.11.
Let α
1. Does Theorem 4.4 hold if the moment conditionis replaced by X , Y ∈ L α, α ?9.3. More on dcov b α and dcov ∼ α . Theorem 8.11 considers only the case X = Y . We do not know whether it extends to dcov α ( X , Y ) in general,without further conditions. We give a partial result. Theorem 9.12.
Suppose that b X α , b Y α ∈ L , so e X α and e Y α exist. Supposefurther that e X α e Y α ∈ L and e X α ( X , X ) e Y α ( Y , Y ) ∈ L . Then b X α b Y α ∈ L , and dcov b α ( X , Y ) = dcov ∼ α ( X , Y ) ; futhermore, this value is finite.In particular, this holds if b X α , b Y α ∈ L .Proof. This is similar to the proof of Theorem 8.11. We have e X α , e Y α ∈ L by(1.6), and thus e X α ( X , X ) e Y α ( Y , Y ) ∈ L by independence. Express b X α and b Y α by (8.10) and expand b X α b Y α as a sum of 16 terms. By the assumptions(and symmetry), every term is in L , so we may take their expectationsseparately. Furthermore, (8.9) implies that e.g. E [ e X α ( X , X ) e Y α ( Y , Y )] =0, and we obtain E (cid:2) b X α b Y α (cid:3) = X i =1 E (cid:2) e X α ( X i , X i +1 ) e Y α ( Y i , Y i +1 ) (cid:3) = 4 E (cid:2) e X α e Y α (cid:3) . (9.16)If b X α , b Y α ∈ L , then e X α , e Y α ∈ L and the assumptions above follow bythe Cauchy–Schwarz inequality. (cid:3) Proof of Theorem 9.6 (i)(ii) . By the comments before Theorem 9.6, the as-sumptions imply b X α , b Y α ∈ L , and thus Theorem 9.12 shows (i) and (ii). (cid:3) Problem 9.13.
Let either X and Y be arbitrary, or consider only X = Y = R .(i) Is it true for arbitrary random X ∈ X and Y ∈ Y that dcov b α ( X , Y )is defined and finite ⇐⇒ dcov ∼ α ( X , Y ) is defined and finite?(ii) If this holds, is furthermore always dcov b α ( X , Y ) = dcov ∼ α ( X , Y )?9.4. More on dcov E α and dcov H α . Consider now the case of Euclidean or,more generally, Hilbert spaces and Definitions 1.4 and 6.1. We complete theproof of Theorem 9.6; recall that this assumes α Proof of Theorem 9.6 (iii)(iv) . (iv): This follows by essentially the same proofas for Theorem 6.4. As noted in Remark 6.5, (6.21) holds without any mo-ment condition. Moreover, as said in the proof of Theorem 6.4, Lemma 3.2holds for b X α ; M defined in (6.19) too, uniformly in M ; we now use Lemma 3.2(i),and denote the right-hand side by b X ∗∗ . Hence, | b X α ; M | b X ∗∗ , and similarly | b Y α ; M | b Y ∗∗ .As noted above, X ∈ L α, α is equivalent to (9.1) and to (9.2). Conse-quently, b X ∗∗ ∈ L and, similarly, b Y ∗∗ ∈ L . Hence, b X ∗∗ b Y ∗∗ ∈ L and dom-inated convergence applies to (6.21), just as in the proof of Theorem 6.4,yielding dcov H α ( X , Y ) = dcov b α ( X , Y ) < ∞ .(iii): Theorem 6.2 shows the general equality dcov E α ( X , Y ) = dcov H α ( X , Y ),and thus (iii) follows from (iv).This completes the proof of Theorem 9.6. (cid:3) Problem 9.14.
For 1 < α <
2, what is the optimal Lorentz space conditionthat guarantees dcov H α ( X , X ) < ∞ (for variables in a Hilbert space)? Doesthis also imply dcov H α ( X , X ) = dcov b α ( X , X )? Does this condition implydcov H α ( X , Y ) = dcov b α ( X , Y ) for two variables X and Y ? Problem 9.15.
Let either X and Y be arbitrary Hilbert spaces, or consideronly X = Y = R . Let 0 < α < X ∈ X and Y ∈ Y that dcov b α ( X , Y )is defined and finite ⇐⇒ dcov H α ( X , Y ) is finite?(ii) If this holds, is furthermore always dcov b α ( X , Y ) = dcov H α ( X , Y )? N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 37
Hilbert spaces, α = 2 . Consider now the case when X = H and Y = H ′ are Hilbert spaces, as in the preceding subsection, but take α = 2. Thendcov H α ( X , Y ) is not defined, so we consider dcov b ( X , Y ) and dcov ∼ ( X , Y ).In Section 7, we did this assuming second moments; we now remove thatassumption and generalise the results. (This is partly for its own sake, butmainly for the application in the next subsection.)In this subsection, expectations E X of Hilbert space valued random vari-ables are always interpreted in Pettis sense, see Appendix B. (This is some-times said explicitly for emphasis.) We use some technical results statedand proved in Appendix B.Recall that b X = 2 h X − X , X − X i by (7.1), for any X . We next showthat (7.2) holds under weaker conditions than assumed in Section 7. Lemma 9.16.
Let X = H be a Hilbert space. If e X exists, then E X existsin Pettis sense, and e X = − h X − E X , X − E X i . (9.17) Proof.
By (7.1), we have b X = 2 h Z , Z ′ i where Z := X − X and Z ′ := X − X . Assume that e X exists, which by our definition means that E | b X | < ∞ .Thus E |h Z , Z ′ i| < ∞ . Lemma B.1(ii) applies and shows that Z = X − X is Pettis integrable. Hence, for every x ∈ H , h X , x i − h X , x i = h X − X , x i = h Z , x i ∈ L . (9.18)Since h X , x i and h X , x i are independent random variables, this implies E |h X , x i| < ∞ , for any x ∈ H , and thus E X exists in Pettis sense.Using (7.1), we may now integrate over first X and then X and obtain E (cid:0) b X | X , X , X (cid:1) = 2 h X − X , E X − X i , (9.19) e X = E (cid:0) b X | X , X (cid:1) = 2 h X − E X , E X − X i . (9.20)showing (9.17). (cid:3) Theorem 9.17.
Let X = H and Y = H ′ be Hilbert spaces. (i) If dcov b ( X , Y ) is defined, i.e., E (cid:2) b X b Y (cid:3) is defined as an extended realnumber, then dcov b ( X , Y ) ∈ [0 , ∞ ] . (ii) If dcov b ( X , Y ) < ∞ , then dcov b ( X , Y ) = (cid:13)(cid:13) E (cid:2) ( X − X ) ⊗ ( Y − Y ) (cid:3)(cid:13)(cid:13) H⊗H ′ , (9.21) where the expectation exists in Pettis sense. (iii) If dcov ∼ ( X , Y ) is defined, i.e., e X and e Y are defined and E (cid:2) e X e Y (cid:3) isdefined as an extended real number, then dcov ∼ ( X , Y ) ∈ [0 , ∞ ] . (iv) If furthermore dcov ∼ ( X , Y ) < ∞ , then dcov ∼ ( X , Y ) = 4 (cid:13)(cid:13) E [ X ⊗ Y ] − E X ⊗ E Y (cid:13)(cid:13) H⊗H ′ , (9.22) where the expectations exist in Pettis sense. (v) If dcov b ( X , Y ) and dcov ∼ ( X , Y ) both are defined, as in (i) and (iii) ,and furthermore dcov b ( X , Y ) and dcov ∼ ( X , Y ) both are finite, then dcov b ( X , Y ) = dcov ∼ ( X , Y ) . (9.23) Proof. (i),(ii): By (7.1) and (7.5), b X b Y = 4 h X − X , X − X ih Y − Y , Y − Y i = 4 (cid:10) ( X − X ) ⊗ ( Y − Y ) , ( X − X ) ⊗ ( Y − Y ) (cid:11) H⊗H ′ . (9.24)This is an example of h Z , Z ′ i as in Lemma B.1, with Z := ( X − X ) ⊗ ( Y − Y ) d = ( X − X ) ⊗ ( Y − Y ). Thus, (i) follows from Lemma B.1(iii), and(ii) from Lemma B.1(ii).(iii),(iv): Similarly, if e X and e Y exist, then Lemma 9.16 shows that E X and E Y exist in Pettis sense, and furthermore, using (7.5), e X e Y = 4 h X − E X , X − E X ih Y − E Y , Y − E Y i = 4 h ( X − E X ) ⊗ ( Y − E Y ) , ( X − E X ) ⊗ ( Y − E Y ) i . (9.25)This is another example of h Z , Z ′ i as in Lemma B.1, now with Z := ( X − E X ) ⊗ ( Y − E Y ). Thus, (iii) follows from Lemma B.1(iii).Finally, assume dcov ∼ ( X , Y ) < ∞ , i.e., e X e Y ∈ L . Then (9.25) andLemma B.1(ii) show that E Z exists in Pettis sense, and thatdcov ∼ ( X , Y ) = E [ e X e Y ] = 4 k E Z k = 4 k E (cid:2) ( X − E X ) ⊗ ( Y − E Y ) (cid:3) k . (9.26)We have X ⊗ Y = Z + ( E X ) ⊗ ( Y − E Y ) + X ⊗ E Y . (9.27)Furthermore, since E X and E Y are constant vectors, it is easy to see that E [ X ⊗ E Y ] = E X ⊗ E Y and E [( E X ) ⊗ ( Y − E Y )] = E X ⊗ E [ Y − E Y ] = 0.(This also follows from the more general Lemma B.7.) Hence, (9.27) showsthat E [ X ⊗ Y ] exists, and E [ X ⊗ Y ] = E [ X ⊗ Y ] = E Z + E X ⊗ E Y . (9.28)Thus (9.22) follows from (9.26).(v): In this case, (i)–(iv) all hold. By (iv), the expectations E X , E Y and E [ X ⊗ Y ] exist. Hence, E [ X i ⊗ Y i ] = E [ X ⊗ Y ] exists for every i .Furthermore, if i = j so X i and Y j are independent, E [ X i ⊗ Y j ] exists byLemma B.7 and equals E [ X i ] ⊗ E [ Y j ]. Hence, E [ X i ⊗ Y j ] exists for every i and j , and thus E (cid:2) ( X − X ) ⊗ ( Y − Y ) (cid:3) = E [ X ⊗ Y ] + E [ X ⊗ Y ] − E [ X ⊗ Y ] − E [ X ⊗ Y ]= 2 (cid:0) E [ X ⊗ Y ] − E [ X ] ⊗ E [ Y ] (cid:1) . (9.29)Consequently, (9.23) follows from (9.21) and (9.22). (cid:3) Metric spaces of negative type.
In this subsection we assume that X and Y are metric spaces such that d α X and d α Y both are of negative type,see Remark 1.7. We then can embed the spaces into Hilbert spaces as inRemark 7.4 and transfer the results in Section 9.5. Theorem 9.18.
Let α > and let X and Y be metric spaces such that d α X and d α Y both are of negative type. N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 39 (i) If dcov b α ( X , Y ) is defined, i.e., E (cid:2) b X α b Y α (cid:3) is defined as an extended realnumber, then dcov b α ( X , Y ) ∈ [0 , ∞ ] . (ii) If dcov ∼ α ( X , Y ) is defined, i.e., e X α and e Y α are defined and E (cid:2) e X α e Y α (cid:3) is defined as an extended real number, then dcov ∼ α ( X , Y ) ∈ [0 , ∞ ] . (iii) If dcov b α ( X , Y ) and dcov ∼ α ( X , Y ) both are defined, as in (i) and (ii) ,and furthermore both are finite, then dcov b α ( X , Y ) = dcov ∼ α ( X , Y ) . (9.30) Proof.
Immediate by Remark 7.4 and Theorem 9.17(i)(iii)(v). (cid:3)
This gives a partial (but not complete) answer to Problem 9.13 for spaceswith d α of negative type; recall from Remark 1.7 that when 0 < α
2, thisincludes Hilbert spaces, in particular R . Remark 9.19. If d is a metric of negative type, then so is d α for every α X and Y are metric spaces of negative type, then Theorem 9.18applies at least with 0 < α (cid:3) Negative values? If X and Y are metric spaces such that d α is of neg-ative type, then Theorem 9.18 shows that dcov b α ( X , Y ) and dcov ∼ α ( X , Y )may not be negative and finite, nor −∞ . Theorem 8.1 then shows thesame for dcov ∗ α ( X , Y ). The same is also, trivially, true for dcov E α ( X , Y ) anddcov H α ( X , Y ) when they are applicable. More precisely, we have the possi-bilities shown in Table 1, by Theorems 3.5, 6.4, 8.1 and 9.18; Examples 8.5,8.7, 8.9, 8.13, 8.16; (1.9) and (6.2).[0 , ∞ ) + ∞ ( −∞ , −∞ undefineddcov ∗ α ( X , Y ) + − − − +dcov b α ( X , Y ) + + − − +dcov ∼ α ( X , Y ) + + − − +dcov E α ( X , Y ) + + − − − dcov H α ( X , Y ) + + − − − Table 1.
Possibilities when d α is of negative typeConversely, if X or Y is a metric space that is not of negative typethen dcov( X , Y ) < ∗ α ( X , Y ), dcov b α ( X , Y ), dcov ∼ α ( X , Y ). Theorem 8.1 still rules out ±∞ for dcov ∗ α ( X , Y ), and we find the possibilities shown in Table 2.[0 , ∞ ) + ∞ ( −∞ , −∞ undefineddcov ∗ α ( X , Y ) + − + − +dcov b α ( X , Y ) + + + ? +dcov ∼ α ( X , Y ) + + + ? + Table 2.
Possibilities when d α is not of negative typeFor dcov b α ( X , Y ) and dcov ∼ α ( X , Y ), we do not know whether −∞ is pos-sible (in Case (dc2) in Section 8): Problem 9.20.
Is dcov b α ( X , Y ) = −∞ or dcov ∼ α ( X , Y ) = −∞ possible? Acknowledgement
This work was inspired by a lecture by Thomas Mikosch at a mini-mini-workshop in Gothenburg in April, 2019. I thank Thomas Mikosch for helpfulcomments.
Appendix A. A uniform integrability lemma
We use above some well-known standard results on uniform integrability,see e.g. [11, § § X ι ) ι ∈I and ( Y ι ) ι ∈I with anarbitrary index set I . Lemma A.1.
Let ( X n ) n and ( Y n ) n be uniformly integrable sequences ofrandom variables, and suppose that for each n , X n and Y n are independent.Then the sequence ( X n Y n ) n is also uniformly integrable. To prove this, we use another simple result that perhaps is less well-knownthan it deserves.
Lemma A.2.
Let ( X n ) n be a sequence of random variables. Then ( X n ) n isuniformly integrable if and only if for every ε > there exists K ε < ∞ anda sequence ( X εn ) n of random variables such that for every n , | X εn | K ε a.s. , (A.1) E (cid:12)(cid:12) X n − X εn (cid:12)(cid:12) < ε. (A.2) Proof.
This is a simple exercise, using your favourite definition of uniformintegrability. (See e.g. [11, Definition 5.4.1 and Theorem 5.4.1].) (cid:3)
Proof of Lemma A.1.
The uniform integrability implies the existence of con-stants B and B ′ such that E | X n | B and E | Y n | B ′ for all n .Let 0 < ε <
1. Lemma A.2 shows that there exists K ε < ∞ and randomvariables X εn and Y εn such that both (A.1)–(A.2) and the correspondinginequalities with Y hold. Then | X εn Y εn | K ε a.s. Since X n and Y n areindependent, we may also assume that the pairs ( X n , X εn ) and ( Y n , Y εn ) areindependent, and then E (cid:12)(cid:12) X n Y n − X εn Y εn (cid:12)(cid:12) E (cid:12)(cid:12) X n ( Y n − Y εn ) (cid:12)(cid:12) + E (cid:12)(cid:12) ( X n − X εn ) Y n (cid:12)(cid:12) + E (cid:12)(cid:12) ( X n − X εn )( Y n − Y εn ) (cid:12)(cid:12) Bε + B ′ ε + ε = ( B + B ′ + 1) ε. (A.3)Lemma A.2 in the opposite direction shows that the sequence ( X n Y n ) n isuniformly integrable. (cid:3) Appendix B. Bochner and Pettis integrals
The expectation E X of an H -valued random variable X , where H is aseparable Hilbert space, can be defined using either the Bochner integralor the Pettis integral; see e.g. the summary in [14, § N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 41 given there. Both integrals are defined for general Banach spaces, but inthis paper we need them only for separable Hilbert spaces. In this case, E X exists in Bochner sense if and only if E k X k < ∞ , and E X exists in Pettissense if and only if E |h X , x i| < ∞ for every x ∈ H , and then E X is theelement of H determined by h E X , x i = E h X , x i , x ∈ H . (B.1)If E X exists in Bochner sense, then it exists in Pettis sense, and the value isthe same. (Hence, the reader may choose to always interpret E X in Pettissense. However, the Bochner integral is more convenient when applicable.)The converse is not true; there are X such that E X exists in Pettis sensebut not Bochner sense. (See e.g. Example B.3.)It is well-known, and easy to see, that if E X exists in Pettis sense, thenthere exists C < ∞ (depending on X ) such that E |h X , x i| C k x k , x ∈ H . (B.2)We use in Section 9.5 some results on Pettis integrals in (separable)Hilbert spaces, stated in the lemmas below. We believe that at least some ofthese are known, but since we have not found references, we give completeproofs. Lemma B.1.
Let Z be random variable in a separable Hilbert space H , andlet Z ′ be an independent copy of Z . (i) If Z is Bochner integrable, i.e., if E k Z k < ∞ , then E |h Z , Z ′ i| < ∞ . (B.3)(ii) If (B.3) holds, then Z is Pettis integrable, i.e., E Z exists in Pettissense. Moreover, E h Z , Z ′ i = k E Z k > . (B.4)(iii) If E h Z , Z ′ i + < ∞ , then (B.3) holds. In other words, E h Z , Z ′ i may befinite (and then > by (B.4) ), + ∞ or undefined, but never −∞ . Remark B.2.
We show in Examples B.3 and B.4 that the implications in(i) and (ii) are strict, i.e., their converses do not hold.Furthermore, it is easy find examples, even with H = R , where E h Z , Z ′ i is+ ∞ or undefined (i.e., ∞ − ∞ ); take any real-valued random Z with Z > E | Z | = ∞ . (cid:3) Proof of Lemma B.1. (i): By the Cauchy–Schwarz inequality, |h Z , Z ′ i| k Z kk Z ′ k , and (B.3) follows by the independence of Z and Z ′ .(ii): Let A := E |h Z , Z ′ i| and let u ∈ H with k u k = 1. Furthermore, let W := sgn h Z , u i and W ′ := sgn h Z ′ , u i , and let for M > I M := {k Z k M } and I ′ M := {k Z ′ k M } . Since I M W Z d = I ′ M W ′ Z ′ is measurable andbounded, E [ I M W Z ] = E [ I ′ M W ′ Z ′ ] exists, even in Bochner sense, and wehave, for any finite M , A > E (cid:2) I M W I ′ M W ′ h Z , Z ′ i (cid:3) = E h I M W Z , I ′ M W ′ Z ′ i = E (cid:2) E (cid:0) h I M W Z , I ′ M W ′ Z ′ i | Z (cid:1)(cid:3) = E h I M W Z , E [ I ′ M W ′ Z ′ ] i = h E [ I M W Z ] , E [ I ′ M W ′ Z ′ ] i = k E [ I M W Z ] k . (B.5) Hence, by the Cauchy–Schwarz inequality, k u k = 1, and (B.5), E (cid:2) I M |h u , Z i| (cid:3) = E (cid:2) I M W h u , Z i (cid:3) = E h u , I M W Z i = h u , E [ I M W Z ] i k E [ I M W Z ] k A / . (B.6)Letting M → ∞ yields, by monotone convergence, E (cid:12)(cid:12) h u , Z i (cid:12)(cid:12) A / (B.7)for every u with k u k = 1, which (since H is reflexive) shows that Z is Pettisintegrable.Finally, the Pettis integrability yields first E (cid:0) h Z , Z ′ i | Z (cid:1) = h Z , E Z ′ i (B.8)and then, taking the expectation of (B.8), E h Z , Z ′ i = E h Z , E Z ′ i = h E Z , E Z ′ i = h E Z , E Z i , (B.9)which is (B.4).(iii): We have, similarly to (B.5), E (cid:2) I M I ′ M h Z , Z ′ i (cid:3) = E h I M Z , I ′ M Z ′ i = E (cid:2) E (cid:0) h I M Z , I ′ M Z ′ i | Z (cid:1)(cid:3) = E h I M Z , E [ I ′ M Z ′ ] i = h E [ I M Z ] , E [ I ′ M Z ′ ] i = k E [ I M Z ] k > . (B.10)Hence, E (cid:2) I M I ′ M h Z , Z ′ i − (cid:3) E (cid:2) I M I ′ M h Z , Z ′ i + (cid:3) E (cid:2) h Z , Z ′ i + (cid:3) < ∞ , (B.11)and letting M → ∞ yields E h Z , Z ′ i − < ∞ by monotone convergence. Hence,(B.3) holds, and the result follows. (cid:3) We give counterexamples to converses of the statements in Lemma B.1.
Example B.3.
Let N be a positive integer-valued random variable and let p n := P ( N = n ), let ( a n ) ∞ be a sequence of positive numbers, and let ( e i ) i be an ON-basis in H . Define Z := a N e N . Then E k Z k = E a N = ∞ X n =1 a n p n . (B.12)If N ′ is an independent copy of N , and Z ′ := a N ′ e N ′ , then h Z , Z ′ i = a N { N = N ′ } , and thus E (cid:12)(cid:12) h Z , Z ′ i (cid:12)(cid:12) = E h Z , Z ′ i = ∞ X n =1 a n p n . (B.13)Consequently, choosing p n = c/n and a n = n , E |h Z , Z ′ i| < ∞ but E k Z k = ∞ , so E Z does not exist in Bochner sense. Hence the converse to LemmaB.1(i) does not hold.In this example, as is easily seen, E Z exists in Pettis sense if and only if P ∞ n =1 a n p n < ∞ , and then E Z = P n a n p n e n . (cid:3) N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 43
Example B.4.
Let ( e i ) i be an ON-basis in H , let ξ i ∼ N (0 , i > N be a positive integer-valued random variable,independent of ( ξ i ) i . Define Z := P Ni =1 ξ i e i . Then, for any x ∈ H , h Z , x i = N X i =1 h e i , x i ξ i . (B.14)Conditioned on N , this has a normal distribution with variance P N h e i , x i k x k . Hence, E (cid:0) |h Z , x i| (cid:12)(cid:12) N (cid:1) = r π (cid:16) N X i =1 h e i , x i (cid:17) / k x k (B.15)and thus E (cid:12)(cid:12) h Z , x i (cid:12)(cid:12) k x k < ∞ . Consequently, E Z exists in Pettis sense.(With E Z = 0, by symmetry.)On the other hand, if N ′ d = N and ξ ′ i ∼ N (0 ,
1) are independent of eachother and of N and ( ξ i ) i , so Z ′ := P N ′ i =1 ξ ′ i e i is an independent copy of Z , then h Z , Z ′ i = P N ∧ N ′ ξ i ξ ′ i . The sequence ( ξ i ξ ′ i ) i is i.i.d. with mean 0and variance E [( ξ i ξ ′ i ) ] = E [ ξ i ] E [( ξ ′ i ) ] = 1, and thus by the central limittheorem, for some c > n > E (cid:0) |h Z , Z ′ i| (cid:12)(cid:12) N ∧ N ′ = n (cid:1) = E (cid:12)(cid:12)(cid:12) n X ξ i ξ ′ i (cid:12)(cid:12)(cid:12) > c √ n. (B.16)Hence, E |h Z , Z ′ i| > c E √ N ∧ N ′ = c Z ∞ P (cid:0) √ N ∧ N ′ > t (cid:1) d t = c Z ∞ P (cid:0) N > t , N ′ > t (cid:1) d t = c Z ∞ P (cid:0) N > t (cid:1) d t. (B.17)Choose N with P ( N > n ) = n − γ for n >
1, where 0 < γ . Then P ( N > t ) > t − γ for t >
1, and (B.17) yields E |h Z , Z ′ i| > c R ∞ t − γ d t = ∞ .Consequently, E Z exists in Pettis sense, but (B.3) does not hold. Hence,the converse to Lemma B.1(ii) does not hold.Note also that (B.16) and (B.17) hold in the opposite direction withanother c ; hence, in this example, (B.3) holds if we take γ > . More-over, k Z k = (cid:0)P N ξ i (cid:1) / , and it follows from the law of large numbersthat E (cid:0) k Z k | N = n (cid:1) ∼ √ n as n → ∞ , and thus, if γ , we have E k Z k > c E N / = ∞ . Consequently, taking γ ∈ ( , ] gives another exam-ple showing that the converse to (i) does not hold. (cid:3) Recall that a
Hilbert–Schmidt operator T : H → H ′ , where H and H ′ areHilbert spaces, is a linear operator such that if ( e i ) i is an ON-basis in H ,then k T k := X i k T e i k < ∞ . (B.18)(This is independent of the choice of basis ( e i ) i .) See e.g. [17, § Lemma B.5.
Let H and H ′ be separable Hilbert spaces, let X be randomvariable in H such that E X exists in Pettis sense, and let T : H → H ′ be aHilbert–Schmidt operator. Then E k T X k < ∞ .Proof. Since T is a Hilbert–Schmidt operator, T ∗ T is a positive self-adjointtrace class operator in H , and thus there exists an ON-basis ( e i ) i in H consisting of eigenvectors, so T ∗ T e i = λ i e i , where λ i > X i λ i = k T k < ∞ . (B.19)(See again e.g. [17, §
30] and [7, Exercise IX.2.19].) Let s i := λ / . (Theseare known as the singular values of T .) Then, for any x ∈ H , k T x k = h T ∗ T x , x i = X i h T ∗ T x , e i ih x , e i i = X i h x , T ∗ T e i ih x , e i i = X i λ i h x , e i ih x , e i i = X i s i h x , e i i . (B.20)Let ( ε i ) i be i.i.d. random variables with P ( ε i = 1) = P ( ε i = −
1) = , andlet them also be independent of X . Let Z := X i s i ε i e i , (B.21)where the sum converges in H (surely) since P i s i < ∞ by (B.19). Let x ∈ H and note that h x , Z i = X i s i h x , e i i ε i . (B.22)Hence, using (B.20), E |h x , Z i| = E (cid:12)(cid:12)(cid:12)X i s i h x , e i i ε i (cid:12)(cid:12)(cid:12) = X i s i h x , e i i = k T x k . (B.23)Moreover, Khintchine’s inequality [11, Lemma 3.8.1] applies to (B.22) andyields (cid:0) E |h x , Z i| (cid:1) / C E |h x , Z i| . (B.24)Combining (B.23) and (B.24) we find k T x k C E |h x , Z i| . (B.25)Let E X and E ε denote integration over X and ( ε i ), respectively. Then (B.25)yields k T X k C E ε |h X , Z i| and thus E k T X k C E X E ε |h X , Z i| = C E |h X , Z i| . (B.26)On the other hand, (B.2) yields, using also the definition (B.21) and (B.19), E X |h X , Z i| C k Z k = C (cid:16)X i s i (cid:17) / = C k T k HS . (B.27)Thus, E |h X , Z i| = E E X |h X , Z i| C k T k HS < ∞ . (B.28)The result follows by (B.26) and (B.28). (cid:3) N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 45
Remark B.6.
Example B.3 shows that the result in Lemma B.5 does nothold for T = I , the identity operator (if dim H = ∞ ). In fact, the resultholds if and only if T is Hilbert–Schmidt: if T is a bounded operator that isnot Hilbert–Schmidt, then there exists X such that E X exists but E k T X k = ∞ ; this can be seen by a modification of Example B.3. (We omit the details.) (cid:3) Lemma B.7.
Let X and Y be independent random variables with values inseparable Hilbert spaces H and H ′ . If E X and E Y exist in Pettis sense, then E [ X ⊗ Y ] exists in Pettis sense, in H ⊗ H ′ , and E [ X ⊗ Y ] = ( E X ) ⊗ ( E Y ) .Proof. Let z ∈ H ⊗ H ′ , and define a linear operator T z : H → H ′ by h T z x , y i = h x ⊗ y , z i . (B.29)Let ( e i ) i and ( e ′ j ) j be ON-bases in H and H ′ . Then ( e i ⊗ e ′ j ) i,j is an ON-basisin H ⊗ H ′ , and thus, using (B.18) and (B.29), k T z k = X i k T z e i k = X i X j h T z e i , e ′ j i = X i X j h e i ⊗ e ′ j , z i = k z k < ∞ , (B.30)and thus T z is a Hilbert–Schmidt operator. (In fact, as is well-known, it iseasy to see that z T z yields an isometry between H ⊗ H ′ and the space ofHilbert–Schmidt operators H → H ′ .) Hence, Lemma B.5 applies and shows E k T z X k < ∞ .Furthermore, since Y is Pettis integrable, (B.29) and (B.2) show that forevery x ∈ H , E |h x ⊗ Y , z i| = E |h T z x , Y i| C k T z x k . (B.31)Consequently, with E Y denoting the integral over Y , E |h X ⊗ Y , z i| = E E Y |h X ⊗ Y , z i| C E k T z X k < ∞ . (B.32)Since z ∈ H ⊗ H ′ is arbitrary, this shows that X ⊗ Y is Pettis integrable,i.e., that E [ X ⊗ Y ] exists in Pettis sense.Finally, by (B.1), (7.5) and independence, for any e i and e ′ j in the bases, h E [ X ⊗ Y ] , e i ⊗ e ′ j i = E h X ⊗ Y , e i ⊗ e ′ j i = E (cid:2) h X , e i ih Y , e ′ j i (cid:3) = E [ h X , e i i ] E [ h Y , e ′ j i ] = h E X , e i ih E Y , e ′ j i = h ( E X ) ⊗ ( E Y ) , e i ⊗ e ′ j i . (B.33)Since the set of such e i ⊗ e ′ j is a basis, E [ X ⊗ Y ] = ( E X ) ⊗ ( E Y ) follows. (cid:3) Remark B.8.
In this paper we consider only the Hilbert space tensor prod-uct defined in Section 7. Nevertheless, we note that Lemma B.7 a fortiori holds also for the injective tensor product H ˇ ⊗H ′ , since there is a naturalcontinuous mapping H ⊗ H ′ → H ˇ ⊗H ′ mapping x ⊗ y x ⊗ y . On the otherhand, the result does not hold for the projective tensor product H b ⊗H ′ , whichcan be seen as follows: Let H = H ′ and note that then x ⊗ y
7→ h x , y i ex-tends to a continuous linear functional on H b ⊗H ′ . Hence, if E [ X ⊗ Y ] existsin H b ⊗H ′ , then E h X , Y i exists in R , so E |h X , Y i| < ∞ , but Example B.4shows that this does not always hold for independent Pettis integrable X and Y . (cid:3) References [1] Charles R. Baker: Joint measures and cross-covariance operators.
Trans. Amer. Math. Soc. (1973), 273–289.[2] Patrick Billingsley:
Convergence of Probability Measures . Wiley, NewYork, 1968.[3] Colin Bennett & Robert Sharpley:
Interpolation of Operators . Aca-demic Press, Boston, 1988.[4] Christian Berg, Jens Peter Reus Christensen & Paul Ressel:
HarmonicAnalysis on Semigroups. Theory of Positive Definite and Related Func-tions . Springer-Verlag, New York, 1984.[5] J¨oran Bergh & J¨orgen L¨ofstr¨om:
Interpolation Spaces . Springer-Verlag,Berlin, 1976.[6] V. I. Bogachev & A. V. Kolesnikov: The Monge-Kantorovich problem:achievements, connections, and prospects. (Russian)
Uspekhi Mat. Nauk (2012), no. 5(407), 3–110; English translation: Russian Math. Sur-veys
67 (2012), no. 5, 785–890.[7] John B. Conway:
Functional Analysis . Springer-Verlag, New York,1990.[8] Herold Dehling, Muneya Matsui, Thomas Mikosch, Gennady Samorod-nitsky, Laleh Tafakori: Distance covariance for discretized stochasticprocesses. Preprint, 2018. arXiv:1806.09369v4 [9] Andrey Feuerverger: A consistent test for bivariate dependence.
Inter-national Statistical Review (1993), no. 3, 419–433.[10] Arthur Gretton, Olivier Bousquet, Alex Smola & Bernhard Sch¨olkopf:Measuring statistical dependence with Hilbert–Schmidt norms. Algo-rithmic Learning Theory , 63-77, Lecture Notes in Artificial Intelligence,3734, Springer, Berlin, 2005.[11] Allan Gut:
Probability: A Graduate Course , 2nd ed., Springer, NewYork, 2013.[12] Martin Emil Jakobsen: Distance covariance in metric spaces: non-parametric independence testing in metric spaces. Master’s thesis,Copenhagen, 2017. arXiv:1706.03490v1 [13] Svante Janson:
Gaussian Hilbert Spaces , Cambridge Univ. Press, Cam-bridge, UK, 1997.[14] Svante Janson and Sten Kaijser: Higher moments of Banach space val-ued random variables.
Memoirs Amer. Math. Soc. , no.1127 (2015).[15] Olav Kallenberg:
Foundations of Modern Probability. arXiv:1807.02582v1 [17] Peter D. Lax:
Functional Analysis . Wiley, 2002.[18] Russell Lyons: Distance covariance in metric spaces.
Ann. Probab. (2013), no. 5, 3284–3305. Errata: Ann. Probab. (2018), no. 4, 2400–2405.[19] NIST Handbook of Mathematical Functions . Edited by Frank W. J.Olver, Daniel W. Lozier, Ronald F. Boisvert and Charles W. Clark.Cambridge Univ. Press, 2010.
N DISTANCE COVARIANCE IN METRIC AND HILBERT SPACES 47
Also available as
NIST Digital Library of Mathematical Functions , http://dlmf.nist.gov/ [20] Albrecht Pietsch: Nukleare lokalkonvexe R¨aume . 2. ed., Akademie-Verlag, Berlin, 1969. English translation:
Nuclear Locally ConvexSpaces . Springer-Verlag, Berlin, 1972.[21] Ludger R¨uschendorf: Wasserstein metric.
En-cyclopedia of Mathematics . Available at [22] I. J. Schoenberg: Metric spaces and positive definite functions.
Trans.Amer. Math. Soc. (1938), no. 3, 522–536.[23] Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton & KenjiFukumizu: Equivalence of distance-based and RKHS-based statisticsin hypothesis testing. Ann. Statist.
41 (2013), no. 5, 2263–2291.[24] G´abor J. Sz´ekely & Maria L. Rizzo: Brownian distance covariance.
Ann.Appl. Stat. (2009), no. 4, 1236–1265.[25] G´abor J. Sz´ekely & Maria L. Rizzo: Rejoinder: Brownian distancecovariance. Ann. Appl. Stat. (2009), no. 4, 1303–1308.[26] G´abor J. Sz´ekely, Maria L. Rizzo & Nail K. Bakirov: Measuring andtesting dependence by correlation of distances. Ann. Statist. (2007),no. 6, 2769–2794.[27] V. S. Varadarajan: On the convergence of sample probability distribu-tions. Sankhy¯a (1958), 23–26. Department of Mathematics, Uppsala University, PO Box 480, SE-751 06Uppsala, Sweden
E-mail address : [email protected] URL : ∼∼