aa r X i v : . [ m a t h . S T ] J a n MEASURING ASSOCIATION WITH WASSERSTEIN DISTANCES
JOHANNES WIESEL
Abstract.
Let π ∈ Π( µ, ν ) be a coupling between two probability measures µ and ν on a Polish space. In this article we propose and study a class of non-parametric measures of association between µ and ν . The analysis is basedon the Wasserstein distance between ν and the disintegration π x of π withrespect to the first coordinate. We also establish basic statistical propertiesof this new class of measures: we develop a statistical theory for stronglyconsistent estimators and determine their convergence rate. Throughout ouranalysis we make use of the so-called adapted/causal Wasserstein distance, inparticular we rely on results established in [Backhoff, Bartl, Beiglb¨ock, Wiesel.Estimating processes in adapted Wasserstein distance. 2020].Our class of measures offers on alternative to the correlation coefficient pro-posed by [Dette, Siburg and Stoimenov (2013). A copula-based non-parametricmeasure of regression dependence. Scandinavian Journal of Statistics 40(1),21–41] and [Chatterjee (2020). A new coefficient of correlation. Journal ofthe American Statistical Association, 1–21]. In contrast to these works, ourapproach also applies to probability laws on general Polish spaces. Introduction
Given a sample of ( X , X ) , ( X , X ) , . . . , ( X N , X N ) generated from a measure π with marginals µ and ν on a product X ×Y of topological spaces, a number of workshave recently asked whether it is possible to define a simple empirical measure T N ,which provides an estimate for a non-parametric measure of association between µ and ν . More concretely, [5, Abstract] states the following desirable conditions:“Is it possible to define a coefficient of correlation which is:(i) simple as the classical coefficients like Pearson’s correlation orSpearman’s correlation, and yet(ii) Consistently estimates some simple and interpretable measureof the degree of dependence between the variables, which is 0if and only if the variables are independent and 1 if and onlyif one is a measurable function of the other, and(iii) Has a simple asymptotic theory under the hypothesis of inde-pendence, like the classical coefficients?”As is argued in [5], none of the various past works based on joint cumulative distribu-tion functions and ranks, kernel-based methods, information theoretic coefficients,coefficients based on copulas or on pairwise distances (see e.g. [19, 15, 3, 20, 22,9, 21, 24, 16, 10, 27] and the references therein) satisfies all three properties stated Date : February 2, 2021.
Key words and phrases.
Independence, measure of association, correlation, optimal transport,(causal) Wasserstein distance.MSC 2010 Classification: 62G10, 62H20, 60F05, 60D05. We thank Bodhi Sen for helpfuldiscussions. above. It turns out that the articles [7] and [5] are the first to answer this questionin the affirmative for spaces X = R d and Y = R d , where d = 1. Since thentheir correlation coefficient has attracted a lot of attention, see e.g. [23, 4]. Morerecently [6] (see also [12] for a comparison) show how to build a correspondingestimator T N for general d ≥
1. The analysis in [6] is restricted to estimatorsarising from RKHS with specific properties and thus cannot be applied to arbi-trary Polish spaces X = Y . In this article we offer an alternative construction of T N based on Wasserstein distances. Directly utilising the underlying compatiblemetric structure of the space X , properties (i)-(iii) are then shown hold withoutfurther assumptions. Furthermore, by varying the metric d and the Wassersteinexponent p , one can naturally construct a whole family of different estimators. Weare thus able to build an estimator directly from well-known quantities studied inthe theory of optimal transportation. In fact, it will turn out that once we havedefined a specific measure of association T , our estimator can be computed via theplug-in approach T N = T ( b π N ) for the so-called adapted empirical measure b π N .In this article we derive consistency and convergence rates of the estimator T ( b π N )under different assumptions.2. Notation and main results
Let X be a Polish space with a compatible metric d and let us denote by Prob( X )the set of Borel probability measures on X . Let us take µ, ν ∈ Prob( X ) and denoteby Π( µ, ν ) the set of couplings between µ and ν as , i.e.Π( µ, ν ) = { π ∈ Prob(
X × X ) : π ( · × X ) = µ ( · ) , π ( X × · ) = ν ( · ) } . The Wasserstein distance W ( µ, ν ) is defined via W ( µ, ν ) = inf π ∈ Π( µ,ν ) Z d ( x , x ) π ( dx , dx ) . The pushforward of the measure µ via a function f : X → X is denoted by f µ ,i.e. ( f µ )( A ) := µ ( { x ∈ X : f ( x ) ∈ A } )for all Borel sets A ⊆ X . Generalising the above definition to Borel probabilitymeasures on X := X × X , we often write π = ( x ) π and π = ( x ) π for π ∈ Prob( X ), where ( x , x ) x and ( x , x ) x are the canonical projectionmaps from X to the first and second coordinates respectively. We also recall thatany coupling π ∈ Π( µ, ν ) has a µ -a.s. unique disintegration with respect to the firstcoordinate, i.e. there exists a Borel measurable function x π x such that π ( A × B ) = Z A π x ( B ) µ ( dx ) for all Borel sets A, B ⊆ X . The product coupling with marginals µ and ν is denoted by µ ⊗ ν .One of the key notions used in this article is the so-called adapted/causal Wasser-stein distance. It can be introduced as follows: for Borel probability measures π, ˜ π on X we define the nested/causal/adapted Wasserstein distance AW ( π, ˜ π ) via AW ( π, ˜ π ) = inf γ ∈ Π( π , ˜ π ) Z [ d ( x , y ) + W ( π x , ˜ π y )] γ ( dx , dy ) . (1)On an intuitive level, the nested Wasserstein distance only considers those couplings γ ∈ Π( π, ˜ π ), which respect the information flow formalised by the canonical (i.e. WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 3 coordinate) filtration ( F t ) t ∈{ , } : in (1) this is achieved by first taking an infimumover couplings of π , ˜ π (i.e. “couplings at time one”) and then a second (nested)infimum with respect to the respective disintegrations (i.e. “conditional couplingsat time two”). This feature distinguishes AW from the Wasserstein distance W ,which also includes “anticipative couplings”. We refer to [2, pp. 2-3] for a well-written introduction to this topic. The nested distance was introduced in [17], [18]in the context of multistage stochastic optimisation and was independently analysedin [14]. Let us also remark here that we always have the inequality W ( π, ˜ π ) ≤ AW ( π, ˜ π ) , (2)where the Wasserstein distance W ( π, ˜ π ) is correspondingly defined as W ( π, ˜ π ) = inf γ ∈ Π( π, ˜ π ) Z [ d ( x , y ) + d ( x , y )] γ ( d ( x , x ) , d ( y , y ))and Π( π, ˜ π ) = (cid:8) γ ∈ Π( π, ˜ π ) : γ ( · × X ) = π ( · ) , γ ( X × · ) = ˜ π ( · ) (cid:9) . For the rest of this article we fix two measures µ, ν ∈ Prob( X ). For any π ∈ Prob( X ) let us define the functional π T ( π ) T ( π ) := R W ( π x , ν ) π ( dx ) R d ( y, z ) π ( dy ) π ( dz ) . If π ∈ Π( µ, ν ), then in particular T ( π ) = R W ( π x , ν ) µ ( dx ) R d ( y, z ) ν ( dy ) ν ( dz ) , where throughout we assume that ν is not a singleton, i.e. Z d ( y, z ) ν ( dy ) ν ( dz ) = 0 . It turns out that T defines a convenient measure of association, whose propertiesand estimation are discussed in the upcoming sections. In particular we show that T indeed satisfies the main requirement (ii) stated in [5, Abstract], as cited in theintroduction: Theorem 2.1.
For any π ∈ Π( µ, ν ) the functional π T ( π ) satisfies:(i) T ( π ) ∈ [0 , .(ii) T ( π ) = 0 if and only if π = µ ⊗ ν .(iii) T ( π ) = 1 if and only if ν = f µ for some measurable function f : X → X . A natural estimator for T is given via the following plugin approach: Theorem 2.2.
Let π ∈ Π( µ, ν ) such that Z d ( x , x ) ν ( dx ) < ∞ for any x ∈ X and let ˆ π N be an AW -consistent estimator of π . Then T (ˆ π N ) is aconsistent estimator of T ( π ) . JOHANNES WIESEL
One such AW -consistent estimator of π has recently been constructed in [1] andthroughout this article, we will make use of results established there. In particularcontinuity of T in AW will directly enable us to establish convergence rates.Let us also also remark that our analysis can easily be extended to consider p -Wasserstein distances W p for p > T p ( π ) := (cid:0)R W p ( π x , ν ) p π ( dx ) (cid:1) /p (cid:0)R d ( x , y ) p π ( dx ) π ( dy ) (cid:1) /p and replacing W , AW by the (adapted) p -Wasserstein distances W p , AW p in allresults. The restriction to p = 1 is thus only chosen for notational simplicity.This article is structured as follows: in Section 3 we derive basic properties of T and compare it to the measure of association derived in [6] as well as Pearson’scorrelation coefficient in the case of a bivariate Gaussian distribution π . In Section4 we state general continuity properties of the functional π T ( π ) with respect to AW and give a first consistency result. Section 5 and 6 then exhibit convergencerates for the independent case π = µ ⊗ ν and the general case respectively. Werelegate longer proofs to the appendix.3. Basic results and discussion
As explained in the introduction, the functional π T ( π ) is not the only onesatisfying the basic properties stated in Theorem 2.1. In the following remark wecompare T to the functional obtained in [6] for the specific case ( X , d ) = ( R d , | · | ): Remark 3.1.
Let us point out here that T ( π ) is different from the measure ofassociation proposed in [6] . For the case ( X , d ) = ( R d , | · | ) , this measure can bewritten as ˙ T ( π ) = 1 − R | x − y | π x ( dx ) π x ( dy ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz )= R | y − z | ν ( dy ) ν ( dz ) − R | x − y | π x ( dx ) π x ( dy ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz ) . The denominator is obviously the same as in our definition of T ( π ) . The numeratoris different: indeed choosing γ x ∈ Π( ν, π x ) such that W ( ν, π x ) = Z | y − z | γ x ( dy, dz ) WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 5 for each x ∈ X , it is not hard to see that ˙ T ( π ) = R ( | y − z | − | ˜ y − ˜ z | ) γ x ( dy, d ˜ y ) γ x ( dz, d ˜ z ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz ) ≤ R | y − z − (˜ y − ˜ z ) | γ x ( dy, d ˜ y ) γ x ( dz, d ˜ z ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz ) ≤ R ( | y − ˜ y | + | z − ˜ z | ) γ x ( dy, d ˜ y ) γ x ( dz, d ˜ z ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz )= R | y − ˜ y | γ x ( dy, d ˜ y ) µ ( dx ) + R | z − ˜ z | γ x ( dz, d ˜ z ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz )= 2 R W ( π x , ν ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz ) = 2 T ( π ) . In conclusion, in the case ( X , d ) = ( R d , | · | ) , the functional ˙ T ( π ) is dominated by T ( π ) . By a similar reasoning, we can derive the following corollary:
Corollary 3.2.
Let ( X , k · k ) be a normed space and let us define the measure ofassociation derived from the norm k · k by ˙ T ( π ) = 1 − R k y − z k π x ( dy ) π x ( dz ) µ ( dx ) R k y − z k ν ( dy ) ν ( dz ) . Then we have ˙ T ( π ) ≤ T ( π ) . In particular all upper bounds derived in this articlealso hold for ˙ T ( π ) , adjusting for a factor of . However, the relation ˙ T ( π ) = 0 if and only if π = µ ⊗ ν might not hold, e.g. if ˙ T ( π ) only depends on a finite number of moments of π .Thus in general T ( π ) offers greater flexibility than ˙ T ( π ) as it can be defined forany metric d instead of just any norm k · k , while it always satisfies the properites (i) - (iii) of Theorem 2.1.Let us now compare our measure of association T to a different benchmark: recallthat if π is a bivariate Gaussian distribution, then the association between µ and ν is famously quantified via Pearson’s correlation coefficient. It turns out we canalso compute T ( π ) explicitly in this case: Example 3.3 (Comparison with Pearson’s correlation coefficient in the case p = 2) . Let ( X , d ) = ( R , | · | ) and let π = N ( a, Σ) , where a = ( a , a ) is the mean and Σ = (cid:20) σ ρσ σ ρσ σ σ (cid:21) is the variance of the bivariate normal distribution π . Here we assume σ , σ > and note that ρ ∈ [ − , is Pearson’s correlation coefficient. Then T ( π ) = 1 − p − ρ . JOHANNES WIESEL
Proof.
Note that we can immediately read off the marginal distributions µ = N ( a , σ ) and ν = N ( a , σ ) , as well as π x = N (cid:18) a + σ σ ρ ( x − a ) , (1 − ρ ) σ (cid:19) . Furthermore, by the explicit formula for the 2-Wasserstein distance between Gaus-sians (see e.g. [13, Simple example]) one can compute W ( π x , ν ) = (cid:18) a + σ σ ρ ( x − a ) − a (cid:19) + σ + (1 − ρ ) σ − q (1 − ρ ) σ = (cid:18) σ σ ρ ( x − a ) (cid:19) + σ + (1 − ρ ) σ − σ p − ρ , so that Z W ( π x , ν ) µ ( dx ) = ρ σ + σ + (1 − ρ ) σ − σ p − ρ = 2 σ (cid:16) − p − ρ (cid:17) . Lastly Z | y − z | ν ( dy ) ν ( dz ) = 2 Z | y | ν ( dy ) − (cid:18)Z | z | ν ( dz ) (cid:19) = 2 σ and the claim follows. (cid:3) Estimator for T ( π ) and asymptotic consistency We now investigate continuity properties of the functional π T ( π ), which willenable us to construct a plugin estimator. We then check its asymptotic consistency.Let us thus first show that the functional π T ( π ) is continuous in the adaptedWasserstein distance AW : Theorem 4.1.
For π ∈ Π( µ, ν ) and ˜ π ∈ Π(˜ µ, ˜ ν ) we have (cid:12)(cid:12)(cid:12)(cid:12)Z W ( π x , ν ) µ ( dx ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ AW ( π, ˜ π ) + W ( ν, ˜ ν ) ≤ AW ( π, ˜ π ) and thus in particular | T ( π ) − T (˜ π ) | ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) (cid:16) AW ( π, ˜ π ) + W ( ν, ˜ ν ) + g ( ν, ˜ ν ) (cid:17) ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) (cid:16) AW ( π, ˜ π ) + 3 W ( ν, ˜ ν ) (cid:17) ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) AW ( π, ˜ π ) . for any x ∈ X , where f ( ν, ˜ ν ) := Z d ( y, z ) ν ( dy ) ν ( dz ) · Z d ( y, z ) ˜ ν ( dy ) ˜ ν ( dz ) g ( ν, ˜ ν ) := (cid:12)(cid:12)(cid:12)(cid:12)Z d ( y, z ) ˜ ν ( dy ) ˜ ν ( dz ) − Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12) WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 7
We have the following immediate corollary:
Corollary 4.2.
Let π ∈ Π( µ, ν ) such that Z d ( x , x ) ν ( dx ) < ∞ for any x ∈ X and let ˆ π N be an AW -consistent estimator of π . Then T (ˆ π N ) isan asymptotically consistent estimator of T ( π ) .Proof. Theorem 4.1 yields | T ( π ) − T (ˆ π N ) | ≤ R d ( x , x ) ν ( dx ) f ( ν, (ˆ π N ) ) AW ( π, ˆ π N ) . By assumption we have lim N →∞ AW ( π, ˆ π N ) = 0. By the proof of Theorem 4.1 inthe appendix we conclude that g ( ν, (ˆ π N ) ) ≤ AW ( π, ˆ π N ) so thatlim N →∞ f ( ν, (ˆ π N ) ) = f ( ν, ν ) , where f ( ν, ν ) > (cid:3) We now give an explicit example of an AW -consistent estimator ˆ π N , which willthen naturally facilitate a plugin estimator T (ˆ π N ) for T ( π ). For simplicity we onlydiscuss here the case where π is a probability measure on ([0 , d ) , where we equip[0 , d with the Euclidean metric | · | . Of course, our analysis can then easily beextended to probability measures on any compact subset of R d .Before we explain the details of the construction, we need to introduce some addi-tional notation: for a subset F of R d let diam( F ) := sup x,y ∈ F | x − y | and for any setA, let | A | denote the number of elements in A . Lastly, for any π ∈ Prob(([0 , d ) )and any Borel set G ⊆ [0 , d we define the conditional probability π G ( · ) = 1 π ( G ) Z G π x ( · ) π ( dx ) ∈ Prob([0 , d ) , where we make the convention that π G := δ if π ( G ) = 0.Let us assume that we are given i.i.d. samples ( X , X ) , ( X , X ) , . . . , ( X N , X N )of π . Let us partition the unit cube [0 , d into a disjoint union of a finite number ofcubes and let ϕ N : [0 , d → [0 , d map each cube to its center. Then in particular ϕ N has a finite range for each N ≥
1. We now set b π N := 1 N N X n =1 δ ϕ N ( X n ) ,ϕ N ( X n ) for each N ≥
1. In order to fix some additional notation we can reformulate theassumptions on the function ϕ N as follows: if we defineΦ N := (cid:8) ( ϕ N ) − ( { x } ) : x ∈ ϕ N ([0 , d ) (cid:9) , then [0 , d = [ G ∈ Φ N G disjoint.One of the main results of [1] is the following: JOHANNES WIESEL
Lemma 4.3 ([1, Theorem 1.3]) . Assume that lim N →∞ | Φ N | /N = 0 . Then theadapted empirical measures is a strongly consistent estimator, that is, lim N →∞ AW ( π, b π N ) = 0 P -almost surely. As a preparation for the next sections we make two additional remarks here: firstwe note that T ( b π N ) can be written as T ( b π N ) = P G ∈ Φ N | n ∈{ ,...,N } s.t. X n ∈ G | N W ( b π NG , b π N ) N P Nn,m =1 | ϕ N ( X n ) − ϕ N ( X m ) | . (3)Second, while the estimate of | f ( ν, (ˆ π N ) ) − f ( ν, ν ) | in terms of W ( π, ˆ π N ) is usefulfor the proof of Corollary 4.2, the following result provides sharper convergencerates for the case ˆ π N = b π N : Lemma 4.4.
We have √ N ( f ( ν, b π N ) − f ( ν, ν )) = O P (1) . Proof.
We note that f ( ν, b π N ) − f ( ν, ν ) = Z | y − z | b π N ( dy ) b π N ( dz ) − Z | y − z | ν ( dy ) ν ( dz )= 1 N N X i,j =1 | ϕ N ( X i ) − ϕ N ( X j ) | − Z | x − y | ν ( dx ) ν ( dy ) . Using the CLT for U-statistics we conclude that √ N ( f ( ν, b π N ) − f ( ν, ν )) = O P (1) . (cid:3) In the following sections we discuss convergence rates of the estimator T ( b π N ), firstfor the independent case π = µ ⊗ ν and subsequently for the general case.5. The case π = µ ⊗ ν In this section we discuss convergence rates of T ( b π N ) for the case π = µ ⊗ ν . Wethen show how to construct a test for independence of µ and ν using the estimator T ( b π N ). As T ( π ) ∈ [0 ,
1] for all π ∈ Π( µ, ν ) we cannot hope for a CLT as in [6,Theorem 4.1]. However, we can still obtain parametric convergence rates. Indeed,our core insight will be the following result: Theorem 5.1. If π = µ ⊗ ν then we have for all ε > P (cid:18)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) ≥ ε (cid:19) ≤ exp (cid:18) N (cid:18) log(2) ( | Φ N | + 1) N − ε d (cid:19)(cid:19) and consequently T ( b π N ) = O P (cid:18) | Φ N |√ N (cid:19) . In particular, if lim N →∞ | Φ N | N = 0 and lim N →∞ | Φ N | log N = ∞ , WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 9 then there exists C = C ( ν ) > such that the test: reject π = µ ⊗ ν if T ( b π N ) > C | Φ N |√ N , makes no error after a random sample size under π = µ ⊗ ν . Furthermore, if π = µ ⊗ ν then the same test again makes no error after a random sample size. We note here that as the construction of b π N is fully explicit and no additionalassumptions on the measure π are necessary, which makes the above result concep-tually easy to apply.Lastly, we can construct the following simple test statistic for independence of µ and ν : Corollary 5.2.
Under the assumptions that µ and ν are non-atomic and π = µ ⊗ ν ,there exists a constant C ( ν ) such that the test: reject π = µ ⊗ ν if T ( b π N ) > C ( ν ) r π | Φ N |√ N + σ √ N Φ − (1 − α ) ! , where Φ − denotes the quantile function of the standard normal distribution, hasasymptotic significance level α .Proof. As in the proof of Theorem 5.1, this follows from the inequality T ( b π N ) ≤ √ d ˜ T N ( π ) R | x − y | b π N ( dx ) b π N ( dy ) ≤ C ( ν ) ˜ T N ( π ) , which holds for all sufficiently large N ∈ N . Here ˜ T N is given by˜ T N ( π ) := X G ∈ Φ N X H ∈ Φ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H | N − | n ∈ { , . . . , N } s.t. X n ∈ G | N · | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . We can then conclude by Lemma A.1. (cid:3) General convergence rates for T ( π )We now derive general rates of convergence for T ( π ), using results recently obtainedin [1]. In particular we slightly refine the definition of ϕ N and thus the adaptedempirical measure given in Section 4 as follows: we set r = 1 / d = 1 and r = 1 / (2 d ) for all d ≥
2. For all N ≥
1, let us now partition the cube [0 , d into thedisjoint union of N rd cubes with edges of length N − r and let ϕ N : [0 , d → [0 , d map each such small cube to its center. As before we then set b π N := 1 N N X n =1 δ ϕ N ( X n ) ,ϕ N ( X n ) . for each N ≥ π for the remainder of this section: Assumption 6.1 (Lipschitz kernels) . There is a version of the ( µ -a.s. uniquelydefined) disintegration such that the mapping ([0 , d ) ∋ x π x ∈ Prob([0 , d ) is Lipschitz continuous, where Prob([0 , d ) is endowed with its usual Wassersteindistance W . Lemma 6.2 (Average rate of AW ( π, b π N ), see [1, Theorem 1.5]) . Under Assump-tion 6.1, there is a constant
C > such that E h AW ( µ, b π N ) i ≤ C · N − / for d = 1 ,N − / log( N + 1) for d = 2 ,N − / (2 d ) for d ≥ , =: C · rate( N )(4) for all N ≥ . In the theorem above, the constant C depends on d and the Lipschitz-constant inAssumption 6.1.Furthermore, [1] also show the following concentration inequality: Lemma 6.3 (Deviation of AW ( π, b π N ), see [1, Theorem 1.7]) . Under Assumption6.1, there are constants c, C > such that P h AW ( µ, b π N ) ≥ C rate( N ) + ε i ≤ (cid:16) − cN ε (cid:17) for all N ≥ and all ε > . As above, the constants c, C depend on d , and the Lipschitz constant in Assumption6.1. The above lemmas immediately enable us to prove average convergence ratesand a deviation result for the plugin estimator T ( b π N ). More concretely we obtainthe following: Theorem 6.4.
Under Assumption 6.1, there is a constant
C > such that E (cid:12)(cid:12)(cid:12)(cid:12)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) − Z W ( π x , ν ) µ ( dx ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C ( ν ) · N − / for d = 1 ,N − / log( N + 1) for d = 2 ,N − / (2 d ) for d ≥ . In particular we have (cid:12)(cid:12)(cid:12) T ( b π N ) − T ( π ) (cid:12)(cid:12)(cid:12) = O P ( N − / ) for d = 1 ,O P ( N − / log( N + 1)) for d = 2 ,O P ( N − / (2 d ) ) for d ≥ . Proof.
By Theorem 4.1 we have Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) − Z W ( π x , ν ) µ ( dx ) ≤ AW ( π, b π N ) , WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 11 so the first claim follows from Lemma 6.2 replacing C by 2 C .Moreover, Theorem 4.1 also states that (cid:12)(cid:12)(cid:12) T ( π ) − T ( b π N ) (cid:12)(cid:12)(cid:12) ≤ (cid:0)R d ( x , x ) ν ( dx ) (cid:1) f ( ν, b π N ) AW ( π, b π N )Combining this with Lemma 4.4, which states that √ N ( f ( ν, b π N ) − f ( ν, ν )) = O P (1) , and f ( ν, ν ) > (cid:3) In a similar fashion we can derive concentration bounds from Lemma 6.3:
Theorem 6.5.
Under Assumption 6.1, there are constants c, C > such that P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) − Z W ( π x , ν ) µ ( dx ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ C rate( N ) + ε (cid:21) ≤ (cid:16) − cN ε (cid:17) for all N ≥ and all ε > .Proof. Using again Theorem 4.1 we obtain the existence of two constants c, C > P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) − Z W ( π x , ν ) µ ( dx ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ C rate( N ) + ε (cid:21) ≤ P h AW ( µ, b π N ) ≥ C rate( N ) + ε i ≤ (cid:16) − cN ε (cid:17) replacing c by c/ (cid:3) References [1] Julio Backhoff, Daniel Bartl, Mathias Beiglb¨ock, and Johannes Wiesel. Estimating processesin adapted wasserstein distance. arXiv preprint arXiv:2002.07261 , 2020.[2] Julio Backhoff-Veraguas, Daniel Bartl, Mathias Beiglb¨ock, and Manu Eder. Adapted Wasser-stein Distances and Stability in Mathematical Finance.
Financ. Stoch. , January to appear.[3] Julius Blum, Jack Kiefer, and Murray Rosenblatt. Distribution free tests of independencebased on the sample distribution function.
Ann. Math. Stat , pages 485–498, 1961.[4] Sky Cao and Peter J Bickel. Correlations with tailored extremal properties. arXiv preprintarXiv:2008.10177 , 2020.[5] S Chatterjee. A new coefficient of correlation.
J. Amer. Statist. Assoc. , pages 1–21, 2020.[6] Nabarun Deb, Promit Ghosal, and Bodhisattva Sen. Measuring association on topologicalspaces using kernels and geometric graphs. arXiv preprint arXiv:2010.01768 , 2020.[7] Holger Dette, Karl Siburg, and Pavel Stoimenov. A copula-based non-parametric measure ofregression dependence.
Scand. J. Stat. , 40(1):21–41, 2013.[8] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance ofthe empirical measure.
Probab. Theory Related Fields , 162(3-4):707–738, 2015.[9] Jerome Friedman and Lawrence Rafsky. Graph-theoretic measures of multivariate associationand prediction.
Ann. Statist. , 11(2):377–391, 1983.[10] Fabrice Gamboa, Thierry Klein, and Agn`es Lagnoux. Sensitivity analysis based on Cramer–von Mises distance.
SIAM/ASA J. Uncertain. Quantif. , 6(2):522–548, 2018.[11] Arthur Gretton and L´aszl´o Gy¨orfi. Consistent nonparametric tests of independence.
J. Mach.Learn. Res. , 11:1391–1423, 2010.[12] Chenlu Ke and Xiangrong Yin. Expected conditional characteristic function-based measuresfor testing independence.
J. Amer. Statist. Assoc. , 2019. [13] Martin Knott and Cyril S Smith. On the optimal mapping of distributions.
J. Optim. TheoryAppl. , 43(1):39–49, 1984.[14] R´emi Lassalle. Causal transport plans and their Monge-Kantorovich problems.
Stoch. Anal.Appl. , 36(3):452–484, 2018.[15] Edward H Linfoot. An informational measure of correlation.
Inf. Control , 1(1):85–89, 1957.[16] Russell Lyons et al. Distance covariance in metric spaces.
Ann. Probab. , 41(5):3284–3305,2013.[17] Georg Pflug. Version-independence and nested distributions in multistage stochastic opti-mization.
SIAM J. Optim. , 20(3):1406–1420, 2009.[18] Georg Pflug and Alois Pichler. A distance for multistage stochastic optimization models.
SIAM J. Optim. , 22(1):1–23, 2012.[19] Alfr´ed R´enyi. On measures of dependence.
Acta Math. Acad. Sci. Hungar. , 10(3-4):441–451,1959.[20] Murray Rosenblatt. A quadratic measure of deviation of two-dimensional density estimatesand a test of independence.
Ann. Statist. , pages 1–14, 1975.[21] Marco Scarsini. On measures of concordance.
Stochastica , 8(3):201–218, 1984.[22] Berthold Schweizer, Edward F Wolff, et al. On nonparametric measures of dependence forrandom variables.
Ann. Statist. , 9(4):879–885, 1981.[23] Hongjian Shi, Mathias Drton, and Fang Han. On the power of chatterjee rank correlation. arXiv preprint arXiv:2008.11619 , 2020.[24] G´abor J Sz´ekely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependenceby correlation of distances.
Ann. Statist. , 35(6):2769–2794, 2007.[25] C. Villani.
Optimal Transport: Old and New , volume 338. Springer, 2008.[26] Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergenceof empirical measures in Wasserstein distance.
Bernoulli , 4(A):2620–2648, 2019.[27] Kai Zhang. Bet on independence.
J. Amer. Statist. Assoc. , 114(528):1620–1637, 2019.
Appendix A. Remaining proofs
Proof of Theorem 2.1. (i) Clearly T ( π ) ≥
0. Furthermore, replacing the W ( π x , ν )-optimal coupling in the numerator of T ( π ) by the the product coupling π x ⊗ ν we obtain the upper bound Z W ( π x , ν ) µ ( dx ) ≤ Z Z d ( y, z ) π x ( dy ) ν ( dz ) µ ( dx )= Z d ( y, z ) ν ( dy ) ν ( dz ) . (5)(ii) If T ( π ) = 0 then W ( π x , ν ) = 0 µ -a.s. and thus π x = ν µ -a.s. by positivedefiniteness of the Wasserstein distance. In particular π ( A × B ) = Z A π x ( B ) µ ( dx ) = Z A ν ( B ) µ ( dx ) = ( µ ⊗ ν )( A × B )(6) for any Borel subsets A, B ⊆ X and thus π = µ ⊗ ν . On the other hand,if π = µ ⊗ ν , then using again (6) we conclude that π x = ν µ -a.s. by µ -a.s. uniqueness of disintegrations. Thus W ( π x , ν ) = 0 µ -a.s., which in turnimplies T ( π ) = 0. This shows the claim.(iii) Note that cyclical monotonicity of optimal transport for the cost function c ( x, y ) = d ( x, y ) (see e.g. [25, Def. 5.1]) implies that inequality (5) is strictunless π x = δ f ( x ) for some function f : X → X : indeed, consider the productcoupling π x ⊗ ν and define the set A := { x ∈ X : ∃ y , ˜ y ∈ supp( π x ) , y = ˜ y } . WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 13
Let us assume towards a contradiction that µ ( A ) >
0. By the definition of thedisintegration x π x and tightness of probability measures we then obtain µ ( { x ∈ X : ∃ y , ˜ y ∈ supp( π x ) ∩ supp( ν ) , y = ˜ y } ) > . Next, by the definition of the product coupling π x ⊗ ν we have that µ ( { x ∈ X : ∃ ( y , ˜ y ) , (˜ y , y ) ∈ supp( π x ⊗ ν ) , y = ˜ y } ) > . Now we note that d ( y , ˜ y ) + d (˜ y , y ) > d ( y , y ) + d (˜ y , ˜ y ) = 0 , so that µ ( { x ∈ X : supp( π x ⊗ ν ) is not cyclically monotone } ) > , a contradiction. On the other hand for π x = δ f ( x ) we have Z W ( π x , ν ) µ ( dx ) = Z d ( f ( x ) , x ) µ ( dx ) ν ( dx )= Z d ( y, z ) ν ( dy ) ν ( dz ) . This concludes the proof. (cid:3)
Proof of Theorem 4.1.
Fix δ > γ ∈ Π( µ, ˜ µ ) such that Z ( d ( x , y ) + W ( π x , ˜ π y )) γ ( dx , dy ) ≤ AW ( π, ˜ π ) + δ. (7)A repeated application of the triangle inequality now yields (cid:12)(cid:12)(cid:12)(cid:12)Z W ( π x , ν ) µ ( dx ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)Z W ( π x , ν ) γ ( dx , dy ) − Z W (˜ π y , ˜ ν ) γ ( dx , dy ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z |W ( π x , ν ) − W (˜ π y , ˜ ν ) | γ ( dx , dy ) ≤ Z [ |W ( π x , ν ) + W (˜ π y , ν ) | + |W (˜ π y , ν ) − W (˜ π y , ˜ ν ) | ] γ ( dx , dy ) ≤ Z [ W ( π x , ˜ π y ) + W ( ν, ˜ ν )] γ ( dx , dy ) ≤ Z W ( π x , ˜ π y ) γ ( dx , dy ) + Z W ( ν, ˜ ν ) γ ( dx , dy ) ≤ AW ( π, ˜ π ) + δ + W ( ν, ˜ ν ) , where the last inequality follows from the particular choice of γ in (7). As δ > W ( ν, ˜ ν ) ≤ W ( π, ˜ π ) ≤ AW ( π, ˜ π ) , (8)we conclude that (cid:12)(cid:12)(cid:12)(cid:12)Z W ( π x , ν ) µ ( dx ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ AW ( π, ˜ π ) + W ( ν, ˜ ν ) ≤ AW ( π, ˜ π ) , which shows the first claim. The second claim now follows by writing | T ( π ) − T (˜ π ) | = 1 f ( ν, ˜ ν ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z W ( π x , ν ) µ ( dx ) Z d ( y, z ) ˜ ν ( dy )˜ ν ( dz ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ f ( ν, ˜ ν ) "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z d ( y, z ) ˜ ν ( dy )˜ ν ( dz ) − Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · Z W ( π x , ν ) µ ( dx )+ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z W ( π x , ν ) µ ( dx ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · Z d ( y, z ) ν ( dy ) ν ( dz ) ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) [ g ( ν, ˜ ν ) + AW ( π, ˜ π ) + W ( ν, ˜ ν )] ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) [ g ( ν, ˜ ν ) + 2 AW ( π, ˜ π )]for any x ∈ X . Now let γ ∈ Π(˜ ν, ν ) be an W -optimal coupling between ˜ ν and ν .Using again the triangle inequality we then conclude that g ( ν, ˜ ν ) = (cid:12)(cid:12)(cid:12)(cid:12)Z d ( y, z ) ˜ ν ( dy ) ˜ ν ( dz ) − Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)Z d ( y, z ) ˜ ν ( dy ) ˜ ν ( dz ) − Z d ( y, ˜ z ) ˜ ν ( dy ) ν ( d ˜ z ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)Z d (˜ y, z ) ˜ ν ( d ˜ y ) ν ( dz ) − Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z | d ( y, z ) − d ( y, ˜ z ) | ˜ ν ( dy ) γ ( dz, d ˜ z )+ Z | d (˜ y, z ) − d ( y, z ) | ν ( dz ) γ ( d ˜ y, dy ) ≤ Z d ( z, ˜ z ) ˜ ν ( dy ) γ ( dz, d ˜ z ) + Z d (˜ y, y ) ˜ ν ( dz ) γ ( d ˜ y, dy )= 2 W ( ν, ˜ ν ) ≤ AW ( π, ˜ π ) . This concludes the proof. (cid:3)
Proof of Lemma 4.3.
The proof follows from the same arguments as in [1, Proof ofTheorem 1.3] with a few minor changes. We first remark that it is enough to showthe claim for π with continuous disintegration x π x . Indeed, the general casethen follows exactly as in [1, Proof of Theorem 1.3].We now note that [1, Proof of Lemma 3.4] states explicitly that E (cid:2) W (cid:0) µ , b µ N (cid:1)(cid:3) ≤ CR ( N ) , WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 15 where the function R is defined as R : [0 , + ∞ ) → [0 , + ∞ ] , R ( u ) := u − / if d = 1 u − / log( u + 3) if d = 2 u − /d if d ≥ . Furthermore [1, Proof of Lemma 3.4] also states that E X G ∈ Φ Nt b µ N ( G ) W (cid:0) µ G , b µ NG (cid:1) | G Nt ≤ R (cid:18) N | Φ N | (cid:19) , so that we can conclude as in [1, Proof of Lemma 5.3] that AW (cid:16) µ, b µ N (cid:17) ≤ δ + C ( δ ) (cid:18) ∆ N + R (cid:18) N | Φ N | (cid:19)(cid:19) for all N ∈ N large enough, where∆ N := X f,N ∆ NG , ∆ NG := b µ N ( G ) (cid:0) W (cid:0) µ G , b µ NG (cid:1) − E (cid:2) W (cid:0) µ G , b µ NG (cid:1) | G Nt (cid:3)(cid:1) . We can now follow the arguments in [1, Proof of Theorem 5.3], noting thatlim N →∞ R (cid:18) N | Φ N | (cid:19) = 0as lim N →∞ | Φ N | /N = 0 by assumption. This concludes the proof. (cid:3) Proof of Theorem 5.1.
We first bound the Wasserstein distance W ( b π NG , b π N ) fromabove by quantities, whose distributions are easier to control. This goes back to aclassical argument, see e.g. [8, Lemma 5], or also [26, Appendix A] for a detaileddiscussion. In our specific case we use the fact that both b π NG and b π N are finitelysupported on ϕ N ([0 , d ). Together with the observation that diam([0 , d ) = √ d we can thus bound the Wasserstein distance in (3) from above as follows: W ( b π NG , b π N ) ≤ √ d X H ∈ Φ N | b π NG ( H ) − b π N ( H ) |≤ √ d X H ∈ Φ N (cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H || n ∈ { , . . . , N } s.t. X n ∈ G |− | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12) , so that X G ∈ Φ N | n ∈ { , . . . , N } s.t. X n ∈ G | N W ( b π NG , b π N ) ≤ √ d X G ∈ Φ N | n ∈ { , . . . , N } s.t. X n ∈ G | N · X H ∈ Φ N (cid:12)(cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H || n ∈ { , . . . , N } s.t. X n ∈ G | − | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12)(cid:12) = √ d X G ∈ Φ N X H ∈ Φ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H | N − | n ∈ { , . . . , N } s.t. X n ∈ G | N · | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =: √ d ˜ T N ( π ) . Up to the constant √ d the term ˜ T N ( π ) is a classical non-parametric estimator forindependence of µ and ν , see e.g. [11]. More precisely [11, Theorem 1] states thatunder the assumption π = µ ⊗ ν one has P ( ˜ T N ( π ) ≥ ε ) ≤ ( | Φ N | +1) exp (cid:18) − N ε (cid:19) = exp (cid:18) N (cid:18) log(2) ( | Φ N | + 1) N − ε (cid:19)(cid:19) (9)for any ε >
0. We thus conclude that P (cid:18)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) ≥ ε (cid:19) ≤ P ( √ d ˜ T N ( π ) ≥ ε ) ≤ exp (cid:18) N (cid:18) log(2) ( | Φ N | + 1) N − ε d (cid:19)(cid:19) . This shows the first claim.In particular choosing ε = 2 p d log(2) ( | Φ N | + 1) / √ N in (9) yields P (cid:18) ˜ T N ( π ) ≥ p d log(2) | Φ N | + 1 √ N (cid:19) ≤ exp( −| Φ N | ) , which is summable by assumption. Thus a Borel-Cantelli argument implies that˜ T N ( π ) ≤ p d log(2) | Φ N | + 1 √ N for all sufficiently large N ∈ N . Lastly, noting that by the law of large numberslim N →∞ Z | y − z | b π N ( dy ) b π N ( dz ) = Z | y − z | ν ( dy ) ν ( dz ) , where the tern on the right is positive by assumption, we can choose an appropriateconstant C ( ν ) > T ( b π N ) ≤ C ( ν ) | Φ N |√ N WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 17 for all N ∈ N sufficiently large. On the other hand, if π = µ ⊗ ν , then AW -consistency of b π N implies that Z W ( b π NG , b π N ) b π N ( dx )does not converge to zero, so that there exists δ > T ( b π N ) ≥ δ for all N ∈ N sufficiently large. (cid:3) Let us lastly recall the following lemma, which is used in the proof of Corollary 5.2.
Lemma A.1 ([11, Theorem 3]) . Under the assumption that µ and ν are non-atomicand π = µ ⊗ ν , there exists a centering sequence C N = C N ( µ, ν ) ≤ r π | Φ N |√ N such that √ N ( ˜ T N ( π ) − C N ) /σ ⇒ N (0 , , where σ = 1 − /π and ˜ T N ( π ) := X G ∈ Φ N X H ∈ Φ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H | N − | n ∈ { , . . . , N } s.t. X n ∈ G | N · | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Johannes WieselColumbia University, Department of Statistics1255 Amsterdam AvenueNew York, NY 10027, USA
Email address ::