[PDF] Measuring association with Wasserstein distances

Abstract

Let \pi\in \Pi(\mu,\nu) be a coupling between two probability measures \mu and \nu on a Polish space. In this article we propose and study a class of nonparametric measures of association between \mu and \nu. The analysis is based on the Wasserstein distance between \nu and the disintegration \pi_{x_1} of \pi with respect to the first coordinate. We also establish basic statistical properties of this new class of measures: we develop a statistical theory for strongly consistent estimators and determine their convergence rate. Throughout our analysis we make use of the so-called adapted/causal Wasserstein distance, in particular we rely on results established in [Backhoff, Bartl, Beiglb\"ock, Wiesel. Estimating processes in adapted Wasserstein distance. 2020]. Our class of measures offers on alternative to the correlation coefficient proposed by [Dette, Siburg and Stoimenov (2013). A copula-based non-parametric measure of regression dependence. Scandinavian Journal of Statistics 40(1), 21-41] and [Chatterjee (2020). A new coefficient of correlation. Journal of the American Statistical Association, 1-21]. In contrast to these works, our approach also applies to probability laws on general Polish spaces.

Full PDF

aa r X i v : . [ m a t h . S T ] J a n MEASURING ASSOCIATION WITH WASSERSTEIN DISTANCES

JOHANNES WIESEL

Abstract.

Let π ∈ Π( µ, ν ) be a coupling between two probability measures µ and ν on a Polish space. In this article we propose and study a class of non-parametric measures of association between µ and ν . The analysis is basedon the Wasserstein distance between ν and the disintegration π x of π withrespect to the ﬁrst coordinate. We also establish basic statistical propertiesof this new class of measures: we develop a statistical theory for stronglyconsistent estimators and determine their convergence rate. Throughout ouranalysis we make use of the so-called adapted/causal Wasserstein distance, inparticular we rely on results established in [Backhoﬀ, Bartl, Beiglb¨ock, Wiesel.Estimating processes in adapted Wasserstein distance. 2020].Our class of measures oﬀers on alternative to the correlation coeﬃcient pro-posed by [Dette, Siburg and Stoimenov (2013). A copula-based non-parametricmeasure of regression dependence. Scandinavian Journal of Statistics 40(1),21–41] and [Chatterjee (2020). A new coeﬃcient of correlation. Journal ofthe American Statistical Association, 1–21]. In contrast to these works, ourapproach also applies to probability laws on general Polish spaces. Introduction

Given a sample of ( X , X ) , ( X , X ) , . . . , ( X N , X N ) generated from a measure π with marginals µ and ν on a product X ×Y of topological spaces, a number of workshave recently asked whether it is possible to deﬁne a simple empirical measure T N ,which provides an estimate for a non-parametric measure of association between µ and ν . More concretely, [5, Abstract] states the following desirable conditions:“Is it possible to deﬁne a coeﬃcient of correlation which is:(i) simple as the classical coeﬃcients like Pearson’s correlation orSpearman’s correlation, and yet(ii) Consistently estimates some simple and interpretable measureof the degree of dependence between the variables, which is 0if and only if the variables are independent and 1 if and onlyif one is a measurable function of the other, and(iii) Has a simple asymptotic theory under the hypothesis of inde-pendence, like the classical coeﬃcients?”As is argued in [5], none of the various past works based on joint cumulative distribu-tion functions and ranks, kernel-based methods, information theoretic coeﬃcients,coeﬃcients based on copulas or on pairwise distances (see e.g. [19, 15, 3, 20, 22,9, 21, 24, 16, 10, 27] and the references therein) satisﬁes all three properties stated Date : February 2, 2021.

Key words and phrases.

Independence, measure of association, correlation, optimal transport,(causal) Wasserstein distance.MSC 2010 Classiﬁcation: 62G10, 62H20, 60F05, 60D05. We thank Bodhi Sen for helpfuldiscussions. above. It turns out that the articles [7] and [5] are the ﬁrst to answer this questionin the aﬃrmative for spaces X = R d and Y = R d , where d = 1. Since thentheir correlation coeﬃcient has attracted a lot of attention, see e.g. [23, 4]. Morerecently [6] (see also [12] for a comparison) show how to build a correspondingestimator T N for general d ≥

1. The analysis in [6] is restricted to estimatorsarising from RKHS with speciﬁc properties and thus cannot be applied to arbi-trary Polish spaces X = Y . In this article we oﬀer an alternative construction of T N based on Wasserstein distances. Directly utilising the underlying compatiblemetric structure of the space X , properties (i)-(iii) are then shown hold withoutfurther assumptions. Furthermore, by varying the metric d and the Wassersteinexponent p , one can naturally construct a whole family of diﬀerent estimators. Weare thus able to build an estimator directly from well-known quantities studied inthe theory of optimal transportation. In fact, it will turn out that once we havedeﬁned a speciﬁc measure of association T , our estimator can be computed via theplug-in approach T N = T ( b π N ) for the so-called adapted empirical measure b π N .In this article we derive consistency and convergence rates of the estimator T ( b π N )under diﬀerent assumptions.2. Notation and main results

Let X be a Polish space with a compatible metric d and let us denote by Prob( X )the set of Borel probability measures on X . Let us take µ, ν ∈ Prob( X ) and denoteby Π( µ, ν ) the set of couplings between µ and ν as , i.e.Π( µ, ν ) = { π ∈ Prob(

X × X ) : π ( · × X ) = µ ( · ) , π ( X × · ) = ν ( · ) } . The Wasserstein distance W ( µ, ν ) is deﬁned via W ( µ, ν ) = inf π ∈ Π( µ,ν ) Z d ( x , x ) π ( dx , dx ) . The pushforward of the measure µ via a function f : X → X is denoted by f µ ,i.e. ( f µ )( A ) := µ ( { x ∈ X : f ( x ) ∈ A } )for all Borel sets A ⊆ X . Generalising the above deﬁnition to Borel probabilitymeasures on X := X × X , we often write π = ( x ) π and π = ( x ) π for π ∈ Prob( X ), where ( x , x ) x and ( x , x ) x are the canonical projectionmaps from X to the ﬁrst and second coordinates respectively. We also recall thatany coupling π ∈ Π( µ, ν ) has a µ -a.s. unique disintegration with respect to the ﬁrstcoordinate, i.e. there exists a Borel measurable function x π x such that π ( A × B ) = Z A π x ( B ) µ ( dx ) for all Borel sets A, B ⊆ X . The product coupling with marginals µ and ν is denoted by µ ⊗ ν .One of the key notions used in this article is the so-called adapted/causal Wasser-stein distance. It can be introduced as follows: for Borel probability measures π, ˜ π on X we deﬁne the nested/causal/adapted Wasserstein distance AW ( π, ˜ π ) via AW ( π, ˜ π ) = inf γ ∈ Π( π , ˜ π ) Z [ d ( x , y ) + W ( π x , ˜ π y )] γ ( dx , dy ) . (1)On an intuitive level, the nested Wasserstein distance only considers those couplings γ ∈ Π( π, ˜ π ), which respect the information ﬂow formalised by the canonical (i.e. WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 3 coordinate) ﬁltration ( F t ) t ∈{ , } : in (1) this is achieved by ﬁrst taking an inﬁmumover couplings of π , ˜ π (i.e. “couplings at time one”) and then a second (nested)inﬁmum with respect to the respective disintegrations (i.e. “conditional couplingsat time two”). This feature distinguishes AW from the Wasserstein distance W ,which also includes “anticipative couplings”. We refer to [2, pp. 2-3] for a well-written introduction to this topic. The nested distance was introduced in [17], [18]in the context of multistage stochastic optimisation and was independently analysedin [14]. Let us also remark here that we always have the inequality W ( π, ˜ π ) ≤ AW ( π, ˜ π ) , (2)where the Wasserstein distance W ( π, ˜ π ) is correspondingly deﬁned as W ( π, ˜ π ) = inf γ ∈ Π( π, ˜ π ) Z [ d ( x , y ) + d ( x , y )] γ ( d ( x , x ) , d ( y , y ))and Π( π, ˜ π ) = (cid:8) γ ∈ Π( π, ˜ π ) : γ ( · × X ) = π ( · ) , γ ( X × · ) = ˜ π ( · ) (cid:9) . For the rest of this article we ﬁx two measures µ, ν ∈ Prob( X ). For any π ∈ Prob( X ) let us deﬁne the functional π T ( π ) T ( π ) := R W ( π x , ν ) π ( dx ) R d ( y, z ) π ( dy ) π ( dz ) . If π ∈ Π( µ, ν ), then in particular T ( π ) = R W ( π x , ν ) µ ( dx ) R d ( y, z ) ν ( dy ) ν ( dz ) , where throughout we assume that ν is not a singleton, i.e. Z d ( y, z ) ν ( dy ) ν ( dz ) = 0 . It turns out that T deﬁnes a convenient measure of association, whose propertiesand estimation are discussed in the upcoming sections. In particular we show that T indeed satisﬁes the main requirement (ii) stated in [5, Abstract], as cited in theintroduction: Theorem 2.1.

For any π ∈ Π( µ, ν ) the functional π T ( π ) satisﬁes:(i) T ( π ) ∈ [0 , .(ii) T ( π ) = 0 if and only if π = µ ⊗ ν .(iii) T ( π ) = 1 if and only if ν = f µ for some measurable function f : X → X . A natural estimator for T is given via the following plugin approach: Theorem 2.2.

Let π ∈ Π( µ, ν ) such that Z d ( x , x ) ν ( dx ) < ∞ for any x ∈ X and let ˆ π N be an AW -consistent estimator of π . Then T (ˆ π N ) is aconsistent estimator of T ( π ) . JOHANNES WIESEL

One such AW -consistent estimator of π has recently been constructed in [1] andthroughout this article, we will make use of results established there. In particularcontinuity of T in AW will directly enable us to establish convergence rates.Let us also also remark that our analysis can easily be extended to consider p -Wasserstein distances W p for p > T p ( π ) := (cid:0)R W p ( π x , ν ) p π ( dx ) (cid:1) /p (cid:0)R d ( x , y ) p π ( dx ) π ( dy ) (cid:1) /p and replacing W , AW by the (adapted) p -Wasserstein distances W p , AW p in allresults. The restriction to p = 1 is thus only chosen for notational simplicity.This article is structured as follows: in Section 3 we derive basic properties of T and compare it to the measure of association derived in [6] as well as Pearson’scorrelation coeﬃcient in the case of a bivariate Gaussian distribution π . In Section4 we state general continuity properties of the functional π T ( π ) with respect to AW and give a ﬁrst consistency result. Section 5 and 6 then exhibit convergencerates for the independent case π = µ ⊗ ν and the general case respectively. Werelegate longer proofs to the appendix.3. Basic results and discussion

As explained in the introduction, the functional π T ( π ) is not the only onesatisfying the basic properties stated in Theorem 2.1. In the following remark wecompare T to the functional obtained in [6] for the speciﬁc case ( X , d ) = ( R d , | · | ): Remark 3.1.

Let us point out here that T ( π ) is diﬀerent from the measure ofassociation proposed in [6] . For the case ( X , d ) = ( R d , | · | ) , this measure can bewritten as ˙ T ( π ) = 1 − R | x − y | π x ( dx ) π x ( dy ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz )= R | y − z | ν ( dy ) ν ( dz ) − R | x − y | π x ( dx ) π x ( dy ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz ) . The denominator is obviously the same as in our deﬁnition of T ( π ) . The numeratoris diﬀerent: indeed choosing γ x ∈ Π( ν, π x ) such that W ( ν, π x ) = Z | y − z | γ x ( dy, dz ) WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 5 for each x ∈ X , it is not hard to see that ˙ T ( π ) = R ( | y − z | − | ˜ y − ˜ z | ) γ x ( dy, d ˜ y ) γ x ( dz, d ˜ z ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz ) ≤ R | y − z − (˜ y − ˜ z ) | γ x ( dy, d ˜ y ) γ x ( dz, d ˜ z ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz ) ≤ R ( | y − ˜ y | + | z − ˜ z | ) γ x ( dy, d ˜ y ) γ x ( dz, d ˜ z ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz )= R | y − ˜ y | γ x ( dy, d ˜ y ) µ ( dx ) + R | z − ˜ z | γ x ( dz, d ˜ z ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz )= 2 R W ( π x , ν ) µ ( dx ) R | y − z | ν ( dy ) ν ( dz ) = 2 T ( π ) . In conclusion, in the case ( X , d ) = ( R d , | · | ) , the functional ˙ T ( π ) is dominated by T ( π ) . By a similar reasoning, we can derive the following corollary:

Corollary 3.2.

Let ( X , k · k ) be a normed space and let us deﬁne the measure ofassociation derived from the norm k · k by ˙ T ( π ) = 1 − R k y − z k π x ( dy ) π x ( dz ) µ ( dx ) R k y − z k ν ( dy ) ν ( dz ) . Then we have ˙ T ( π ) ≤ T ( π ) . In particular all upper bounds derived in this articlealso hold for ˙ T ( π ) , adjusting for a factor of . However, the relation ˙ T ( π ) = 0 if and only if π = µ ⊗ ν might not hold, e.g. if ˙ T ( π ) only depends on a ﬁnite number of moments of π .Thus in general T ( π ) oﬀers greater ﬂexibility than ˙ T ( π ) as it can be deﬁned forany metric d instead of just any norm k · k , while it always satisﬁes the properites (i) - (iii) of Theorem 2.1.Let us now compare our measure of association T to a diﬀerent benchmark: recallthat if π is a bivariate Gaussian distribution, then the association between µ and ν is famously quantiﬁed via Pearson’s correlation coeﬃcient. It turns out we canalso compute T ( π ) explicitly in this case: Example 3.3 (Comparison with Pearson’s correlation coeﬃcient in the case p = 2) . Let ( X , d ) = ( R , | · | ) and let π = N ( a, Σ) , where a = ( a , a ) is the mean and Σ = (cid:20) σ ρσ σ ρσ σ σ (cid:21) is the variance of the bivariate normal distribution π . Here we assume σ , σ > and note that ρ ∈ [ − , is Pearson’s correlation coeﬃcient. Then T ( π ) = 1 − p − ρ . JOHANNES WIESEL

Proof.

Note that we can immediately read oﬀ the marginal distributions µ = N ( a , σ ) and ν = N ( a , σ ) , as well as π x = N (cid:18) a + σ σ ρ ( x − a ) , (1 − ρ ) σ (cid:19) . Furthermore, by the explicit formula for the 2-Wasserstein distance between Gaus-sians (see e.g. [13, Simple example]) one can compute W ( π x , ν ) = (cid:18) a + σ σ ρ ( x − a ) − a (cid:19) + σ + (1 − ρ ) σ − q (1 − ρ ) σ = (cid:18) σ σ ρ ( x − a ) (cid:19) + σ + (1 − ρ ) σ − σ p − ρ , so that Z W ( π x , ν ) µ ( dx ) = ρ σ + σ + (1 − ρ ) σ − σ p − ρ = 2 σ (cid:16) − p − ρ (cid:17) . Lastly Z | y − z | ν ( dy ) ν ( dz ) = 2 Z | y | ν ( dy ) − (cid:18)Z | z | ν ( dz ) (cid:19) = 2 σ and the claim follows. (cid:3) Estimator for T ( π ) and asymptotic consistency We now investigate continuity properties of the functional π T ( π ), which willenable us to construct a plugin estimator. We then check its asymptotic consistency.Let us thus ﬁrst show that the functional π T ( π ) is continuous in the adaptedWasserstein distance AW : Theorem 4.1.

For π ∈ Π( µ, ν ) and ˜ π ∈ Π(˜ µ, ˜ ν ) we have (cid:12)(cid:12)(cid:12)(cid:12)Z W ( π x , ν ) µ ( dx ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ AW ( π, ˜ π ) + W ( ν, ˜ ν ) ≤ AW ( π, ˜ π ) and thus in particular | T ( π ) − T (˜ π ) | ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) (cid:16) AW ( π, ˜ π ) + W ( ν, ˜ ν ) + g ( ν, ˜ ν ) (cid:17) ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) (cid:16) AW ( π, ˜ π ) + 3 W ( ν, ˜ ν ) (cid:17) ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) AW ( π, ˜ π ) . for any x ∈ X , where f ( ν, ˜ ν ) := Z d ( y, z ) ν ( dy ) ν ( dz ) · Z d ( y, z ) ˜ ν ( dy ) ˜ ν ( dz ) g ( ν, ˜ ν ) := (cid:12)(cid:12)(cid:12)(cid:12)Z d ( y, z ) ˜ ν ( dy ) ˜ ν ( dz ) − Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12) WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 7

We have the following immediate corollary:

Corollary 4.2.

Let π ∈ Π( µ, ν ) such that Z d ( x , x ) ν ( dx ) < ∞ for any x ∈ X and let ˆ π N be an AW -consistent estimator of π . Then T (ˆ π N ) isan asymptotically consistent estimator of T ( π ) .Proof. Theorem 4.1 yields | T ( π ) − T (ˆ π N ) | ≤ R d ( x , x ) ν ( dx ) f ( ν, (ˆ π N ) ) AW ( π, ˆ π N ) . By assumption we have lim N →∞ AW ( π, ˆ π N ) = 0. By the proof of Theorem 4.1 inthe appendix we conclude that g ( ν, (ˆ π N ) ) ≤ AW ( π, ˆ π N ) so thatlim N →∞ f ( ν, (ˆ π N ) ) = f ( ν, ν ) , where f ( ν, ν ) > (cid:3) We now give an explicit example of an AW -consistent estimator ˆ π N , which willthen naturally facilitate a plugin estimator T (ˆ π N ) for T ( π ). For simplicity we onlydiscuss here the case where π is a probability measure on ([0 , d ) , where we equip[0 , d with the Euclidean metric | · | . Of course, our analysis can then easily beextended to probability measures on any compact subset of R d .Before we explain the details of the construction, we need to introduce some addi-tional notation: for a subset F of R d let diam( F ) := sup x,y ∈ F | x − y | and for any setA, let | A | denote the number of elements in A . Lastly, for any π ∈ Prob(([0 , d ) )and any Borel set G ⊆ [0 , d we deﬁne the conditional probability π G ( · ) = 1 π ( G ) Z G π x ( · ) π ( dx ) ∈ Prob([0 , d ) , where we make the convention that π G := δ if π ( G ) = 0.Let us assume that we are given i.i.d. samples ( X , X ) , ( X , X ) , . . . , ( X N , X N )of π . Let us partition the unit cube [0 , d into a disjoint union of a ﬁnite number ofcubes and let ϕ N : [0 , d → [0 , d map each cube to its center. Then in particular ϕ N has a ﬁnite range for each N ≥

1. We now set b π N := 1 N N X n =1 δ ϕ N ( X n ) ,ϕ N ( X n ) for each N ≥

1. In order to ﬁx some additional notation we can reformulate theassumptions on the function ϕ N as follows: if we deﬁneΦ N := (cid:8) ( ϕ N ) − ( { x } ) : x ∈ ϕ N ([0 , d ) (cid:9) , then [0 , d = [ G ∈ Φ N G disjoint.One of the main results of [1] is the following: JOHANNES WIESEL

Lemma 4.3 ([1, Theorem 1.3]) . Assume that lim N →∞ | Φ N | /N = 0 . Then theadapted empirical measures is a strongly consistent estimator, that is, lim N →∞ AW ( π, b π N ) = 0 P -almost surely. As a preparation for the next sections we make two additional remarks here: ﬁrstwe note that T ( b π N ) can be written as T ( b π N ) = P G ∈ Φ N | n ∈{ ,...,N } s.t. X n ∈ G | N W ( b π NG , b π N ) N P Nn,m =1 | ϕ N ( X n ) − ϕ N ( X m ) | . (3)Second, while the estimate of | f ( ν, (ˆ π N ) ) − f ( ν, ν ) | in terms of W ( π, ˆ π N ) is usefulfor the proof of Corollary 4.2, the following result provides sharper convergencerates for the case ˆ π N = b π N : Lemma 4.4.

We have √ N ( f ( ν, b π N ) − f ( ν, ν )) = O P (1) . Proof.

We note that f ( ν, b π N ) − f ( ν, ν ) = Z | y − z | b π N ( dy ) b π N ( dz ) − Z | y − z | ν ( dy ) ν ( dz )= 1 N N X i,j =1 | ϕ N ( X i ) − ϕ N ( X j ) | − Z | x − y | ν ( dx ) ν ( dy ) . Using the CLT for U-statistics we conclude that √ N ( f ( ν, b π N ) − f ( ν, ν )) = O P (1) . (cid:3) In the following sections we discuss convergence rates of the estimator T ( b π N ), ﬁrstfor the independent case π = µ ⊗ ν and subsequently for the general case.5. The case π = µ ⊗ ν In this section we discuss convergence rates of T ( b π N ) for the case π = µ ⊗ ν . Wethen show how to construct a test for independence of µ and ν using the estimator T ( b π N ). As T ( π ) ∈ [0 ,

1] for all π ∈ Π( µ, ν ) we cannot hope for a CLT as in [6,Theorem 4.1]. However, we can still obtain parametric convergence rates. Indeed,our core insight will be the following result: Theorem 5.1. If π = µ ⊗ ν then we have for all ε > P (cid:18)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) ≥ ε (cid:19) ≤ exp (cid:18) N (cid:18) log(2) ( | Φ N | + 1) N − ε d (cid:19)(cid:19) and consequently T ( b π N ) = O P (cid:18) | Φ N |√ N (cid:19) . In particular, if lim N →∞ | Φ N | N = 0 and lim N →∞ | Φ N | log N = ∞ , WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 9 then there exists C = C ( ν ) > such that the test: reject π = µ ⊗ ν if T ( b π N ) > C | Φ N |√ N , makes no error after a random sample size under π = µ ⊗ ν . Furthermore, if π = µ ⊗ ν then the same test again makes no error after a random sample size. We note here that as the construction of b π N is fully explicit and no additionalassumptions on the measure π are necessary, which makes the above result concep-tually easy to apply.Lastly, we can construct the following simple test statistic for independence of µ and ν : Corollary 5.2.

Under the assumptions that µ and ν are non-atomic and π = µ ⊗ ν ,there exists a constant C ( ν ) such that the test: reject π = µ ⊗ ν if T ( b π N ) > C ( ν ) r π | Φ N |√ N + σ √ N Φ − (1 − α ) ! , where Φ − denotes the quantile function of the standard normal distribution, hasasymptotic signiﬁcance level α .Proof. As in the proof of Theorem 5.1, this follows from the inequality T ( b π N ) ≤ √ d ˜ T N ( π ) R | x − y | b π N ( dx ) b π N ( dy ) ≤ C ( ν ) ˜ T N ( π ) , which holds for all suﬃciently large N ∈ N . Here ˜ T N is given by˜ T N ( π ) := X G ∈ Φ N X H ∈ Φ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H | N − | n ∈ { , . . . , N } s.t. X n ∈ G | N · | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . We can then conclude by Lemma A.1. (cid:3) General convergence rates for T ( π )We now derive general rates of convergence for T ( π ), using results recently obtainedin [1]. In particular we slightly reﬁne the deﬁnition of ϕ N and thus the adaptedempirical measure given in Section 4 as follows: we set r = 1 / d = 1 and r = 1 / (2 d ) for all d ≥

2. For all N ≥

1, let us now partition the cube [0 , d into thedisjoint union of N rd cubes with edges of length N − r and let ϕ N : [0 , d → [0 , d map each such small cube to its center. As before we then set b π N := 1 N N X n =1 δ ϕ N ( X n ) ,ϕ N ( X n ) . for each N ≥ π for the remainder of this section: Assumption 6.1 (Lipschitz kernels) . There is a version of the ( µ -a.s. uniquelydeﬁned) disintegration such that the mapping ([0 , d ) ∋ x π x ∈ Prob([0 , d ) is Lipschitz continuous, where Prob([0 , d ) is endowed with its usual Wassersteindistance W . Lemma 6.2 (Average rate of AW ( π, b π N ), see [1, Theorem 1.5]) . Under Assump-tion 6.1, there is a constant

C > such that E h AW ( µ, b π N ) i ≤ C ·  N − / for d = 1 ,N − / log( N + 1) for d = 2 ,N − / (2 d ) for d ≥ , =: C · rate( N )(4) for all N ≥ . In the theorem above, the constant C depends on d and the Lipschitz-constant inAssumption 6.1.Furthermore, [1] also show the following concentration inequality: Lemma 6.3 (Deviation of AW ( π, b π N ), see [1, Theorem 1.7]) . Under Assumption6.1, there are constants c, C > such that P h AW ( µ, b π N ) ≥ C rate( N ) + ε i ≤ (cid:16) − cN ε (cid:17) for all N ≥ and all ε > . As above, the constants c, C depend on d , and the Lipschitz constant in Assumption6.1. The above lemmas immediately enable us to prove average convergence ratesand a deviation result for the plugin estimator T ( b π N ). More concretely we obtainthe following: Theorem 6.4.

Under Assumption 6.1, there is a constant

C > such that E (cid:12)(cid:12)(cid:12)(cid:12)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) − Z W ( π x , ν ) µ ( dx ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C ( ν ) ·  N − / for d = 1 ,N − / log( N + 1) for d = 2 ,N − / (2 d ) for d ≥ . In particular we have (cid:12)(cid:12)(cid:12) T ( b π N ) − T ( π ) (cid:12)(cid:12)(cid:12) =  O P ( N − / ) for d = 1 ,O P ( N − / log( N + 1)) for d = 2 ,O P ( N − / (2 d ) ) for d ≥ . Proof.

By Theorem 4.1 we have Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) − Z W ( π x , ν ) µ ( dx ) ≤ AW ( π, b π N ) , WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 11 so the ﬁrst claim follows from Lemma 6.2 replacing C by 2 C .Moreover, Theorem 4.1 also states that (cid:12)(cid:12)(cid:12) T ( π ) − T ( b π N ) (cid:12)(cid:12)(cid:12) ≤ (cid:0)R d ( x , x ) ν ( dx ) (cid:1) f ( ν, b π N ) AW ( π, b π N )Combining this with Lemma 4.4, which states that √ N ( f ( ν, b π N ) − f ( ν, ν )) = O P (1) , and f ( ν, ν ) > (cid:3) In a similar fashion we can derive concentration bounds from Lemma 6.3:

Theorem 6.5.

Under Assumption 6.1, there are constants c, C > such that P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) − Z W ( π x , ν ) µ ( dx ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ C rate( N ) + ε (cid:21) ≤ (cid:16) − cN ε (cid:17) for all N ≥ and all ε > .Proof. Using again Theorem 4.1 we obtain the existence of two constants c, C > P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) − Z W ( π x , ν ) µ ( dx ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ C rate( N ) + ε (cid:21) ≤ P h AW ( µ, b π N ) ≥ C rate( N ) + ε i ≤ (cid:16) − cN ε (cid:17) replacing c by c/ (cid:3) References [1] Julio Backhoﬀ, Daniel Bartl, Mathias Beiglb¨ock, and Johannes Wiesel. Estimating processesin adapted wasserstein distance. arXiv preprint arXiv:2002.07261 , 2020.[2] Julio Backhoﬀ-Veraguas, Daniel Bartl, Mathias Beiglb¨ock, and Manu Eder. Adapted Wasser-stein Distances and Stability in Mathematical Finance.

Financ. Stoch. , January to appear.[3] Julius Blum, Jack Kiefer, and Murray Rosenblatt. Distribution free tests of independencebased on the sample distribution function.

Ann. Math. Stat , pages 485–498, 1961.[4] Sky Cao and Peter J Bickel. Correlations with tailored extremal properties. arXiv preprintarXiv:2008.10177 , 2020.[5] S Chatterjee. A new coeﬃcient of correlation.

J. Amer. Statist. Assoc. , pages 1–21, 2020.[6] Nabarun Deb, Promit Ghosal, and Bodhisattva Sen. Measuring association on topologicalspaces using kernels and geometric graphs. arXiv preprint arXiv:2010.01768 , 2020.[7] Holger Dette, Karl Siburg, and Pavel Stoimenov. A copula-based non-parametric measure ofregression dependence.

Scand. J. Stat. , 40(1):21–41, 2013.[8] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance ofthe empirical measure.

Probab. Theory Related Fields , 162(3-4):707–738, 2015.[9] Jerome Friedman and Lawrence Rafsky. Graph-theoretic measures of multivariate associationand prediction.

Ann. Statist. , 11(2):377–391, 1983.[10] Fabrice Gamboa, Thierry Klein, and Agn`es Lagnoux. Sensitivity analysis based on Cramer–von Mises distance.

SIAM/ASA J. Uncertain. Quantif. , 6(2):522–548, 2018.[11] Arthur Gretton and L´aszl´o Gy¨orﬁ. Consistent nonparametric tests of independence.

J. Mach.Learn. Res. , 11:1391–1423, 2010.[12] Chenlu Ke and Xiangrong Yin. Expected conditional characteristic function-based measuresfor testing independence.

J. Amer. Statist. Assoc. , 2019. [13] Martin Knott and Cyril S Smith. On the optimal mapping of distributions.

J. Optim. TheoryAppl. , 43(1):39–49, 1984.[14] R´emi Lassalle. Causal transport plans and their Monge-Kantorovich problems.

Stoch. Anal.Appl. , 36(3):452–484, 2018.[15] Edward H Linfoot. An informational measure of correlation.

Inf. Control , 1(1):85–89, 1957.[16] Russell Lyons et al. Distance covariance in metric spaces.

Ann. Probab. , 41(5):3284–3305,2013.[17] Georg Pﬂug. Version-independence and nested distributions in multistage stochastic opti-mization.

SIAM J. Optim. , 20(3):1406–1420, 2009.[18] Georg Pﬂug and Alois Pichler. A distance for multistage stochastic optimization models.

SIAM J. Optim. , 22(1):1–23, 2012.[19] Alfr´ed R´enyi. On measures of dependence.

Acta Math. Acad. Sci. Hungar. , 10(3-4):441–451,1959.[20] Murray Rosenblatt. A quadratic measure of deviation of two-dimensional density estimatesand a test of independence.

Ann. Statist. , pages 1–14, 1975.[21] Marco Scarsini. On measures of concordance.

Stochastica , 8(3):201–218, 1984.[22] Berthold Schweizer, Edward F Wolﬀ, et al. On nonparametric measures of dependence forrandom variables.

Ann. Statist. , 9(4):879–885, 1981.[23] Hongjian Shi, Mathias Drton, and Fang Han. On the power of chatterjee rank correlation. arXiv preprint arXiv:2008.11619 , 2020.[24] G´abor J Sz´ekely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependenceby correlation of distances.

Ann. Statist. , 35(6):2769–2794, 2007.[25] C. Villani.

Optimal Transport: Old and New , volume 338. Springer, 2008.[26] Jonathan Weed and Francis Bach. Sharp asymptotic and ﬁnite-sample rates of convergenceof empirical measures in Wasserstein distance.

Bernoulli , 4(A):2620–2648, 2019.[27] Kai Zhang. Bet on independence.

J. Amer. Statist. Assoc. , 114(528):1620–1637, 2019.

Appendix A. Remaining proofs

Proof of Theorem 2.1. (i) Clearly T ( π ) ≥

0. Furthermore, replacing the W ( π x , ν )-optimal coupling in the numerator of T ( π ) by the the product coupling π x ⊗ ν we obtain the upper bound Z W ( π x , ν ) µ ( dx ) ≤ Z Z d ( y, z ) π x ( dy ) ν ( dz ) µ ( dx )= Z d ( y, z ) ν ( dy ) ν ( dz ) . (5)(ii) If T ( π ) = 0 then W ( π x , ν ) = 0 µ -a.s. and thus π x = ν µ -a.s. by positivedeﬁniteness of the Wasserstein distance. In particular π ( A × B ) = Z A π x ( B ) µ ( dx ) = Z A ν ( B ) µ ( dx ) = ( µ ⊗ ν )( A × B )(6) for any Borel subsets A, B ⊆ X and thus π = µ ⊗ ν . On the other hand,if π = µ ⊗ ν , then using again (6) we conclude that π x = ν µ -a.s. by µ -a.s. uniqueness of disintegrations. Thus W ( π x , ν ) = 0 µ -a.s., which in turnimplies T ( π ) = 0. This shows the claim.(iii) Note that cyclical monotonicity of optimal transport for the cost function c ( x, y ) = d ( x, y ) (see e.g. [25, Def. 5.1]) implies that inequality (5) is strictunless π x = δ f ( x ) for some function f : X → X : indeed, consider the productcoupling π x ⊗ ν and deﬁne the set A := { x ∈ X : ∃ y , ˜ y ∈ supp( π x ) , y = ˜ y } . WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 13

Let us assume towards a contradiction that µ ( A ) >

0. By the deﬁnition of thedisintegration x π x and tightness of probability measures we then obtain µ ( { x ∈ X : ∃ y , ˜ y ∈ supp( π x ) ∩ supp( ν ) , y = ˜ y } ) > . Next, by the deﬁnition of the product coupling π x ⊗ ν we have that µ ( { x ∈ X : ∃ ( y , ˜ y ) , (˜ y , y ) ∈ supp( π x ⊗ ν ) , y = ˜ y } ) > . Now we note that d ( y , ˜ y ) + d (˜ y , y ) > d ( y , y ) + d (˜ y , ˜ y ) = 0 , so that µ ( { x ∈ X : supp( π x ⊗ ν ) is not cyclically monotone } ) > , a contradiction. On the other hand for π x = δ f ( x ) we have Z W ( π x , ν ) µ ( dx ) = Z d ( f ( x ) , x ) µ ( dx ) ν ( dx )= Z d ( y, z ) ν ( dy ) ν ( dz ) . This concludes the proof. (cid:3)

Proof of Theorem 4.1.

Fix δ > γ ∈ Π( µ, ˜ µ ) such that Z ( d ( x , y ) + W ( π x , ˜ π y )) γ ( dx , dy ) ≤ AW ( π, ˜ π ) + δ. (7)A repeated application of the triangle inequality now yields (cid:12)(cid:12)(cid:12)(cid:12)Z W ( π x , ν ) µ ( dx ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)Z W ( π x , ν ) γ ( dx , dy ) − Z W (˜ π y , ˜ ν ) γ ( dx , dy ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z |W ( π x , ν ) − W (˜ π y , ˜ ν ) | γ ( dx , dy ) ≤ Z [ |W ( π x , ν ) + W (˜ π y , ν ) | + |W (˜ π y , ν ) − W (˜ π y , ˜ ν ) | ] γ ( dx , dy ) ≤ Z [ W ( π x , ˜ π y ) + W ( ν, ˜ ν )] γ ( dx , dy ) ≤ Z W ( π x , ˜ π y ) γ ( dx , dy ) + Z W ( ν, ˜ ν ) γ ( dx , dy ) ≤ AW ( π, ˜ π ) + δ + W ( ν, ˜ ν ) , where the last inequality follows from the particular choice of γ in (7). As δ > W ( ν, ˜ ν ) ≤ W ( π, ˜ π ) ≤ AW ( π, ˜ π ) , (8)we conclude that (cid:12)(cid:12)(cid:12)(cid:12)Z W ( π x , ν ) µ ( dx ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ AW ( π, ˜ π ) + W ( ν, ˜ ν ) ≤ AW ( π, ˜ π ) , which shows the ﬁrst claim. The second claim now follows by writing | T ( π ) − T (˜ π ) | = 1 f ( ν, ˜ ν ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z W ( π x , ν ) µ ( dx ) Z d ( y, z ) ˜ ν ( dy )˜ ν ( dz ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ f ( ν, ˜ ν ) "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z d ( y, z ) ˜ ν ( dy )˜ ν ( dz ) − Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · Z W ( π x , ν ) µ ( dx )+ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z W ( π x , ν ) µ ( dx ) − Z W (˜ π y , ˜ ν ) ˜ µ ( dy ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · Z d ( y, z ) ν ( dy ) ν ( dz ) ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) [ g ( ν, ˜ ν ) + AW ( π, ˜ π ) + W ( ν, ˜ ν )] ≤ R d ( x , x ) ν ( dx ) f ( ν, ˜ ν ) [ g ( ν, ˜ ν ) + 2 AW ( π, ˜ π )]for any x ∈ X . Now let γ ∈ Π(˜ ν, ν ) be an W -optimal coupling between ˜ ν and ν .Using again the triangle inequality we then conclude that g ( ν, ˜ ν ) = (cid:12)(cid:12)(cid:12)(cid:12)Z d ( y, z ) ˜ ν ( dy ) ˜ ν ( dz ) − Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)Z d ( y, z ) ˜ ν ( dy ) ˜ ν ( dz ) − Z d ( y, ˜ z ) ˜ ν ( dy ) ν ( d ˜ z ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)Z d (˜ y, z ) ˜ ν ( d ˜ y ) ν ( dz ) − Z d ( y, z ) ν ( dy ) ν ( dz ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z | d ( y, z ) − d ( y, ˜ z ) | ˜ ν ( dy ) γ ( dz, d ˜ z )+ Z | d (˜ y, z ) − d ( y, z ) | ν ( dz ) γ ( d ˜ y, dy ) ≤ Z d ( z, ˜ z ) ˜ ν ( dy ) γ ( dz, d ˜ z ) + Z d (˜ y, y ) ˜ ν ( dz ) γ ( d ˜ y, dy )= 2 W ( ν, ˜ ν ) ≤ AW ( π, ˜ π ) . This concludes the proof. (cid:3)

Proof of Lemma 4.3.

The proof follows from the same arguments as in [1, Proof ofTheorem 1.3] with a few minor changes. We ﬁrst remark that it is enough to showthe claim for π with continuous disintegration x π x . Indeed, the general casethen follows exactly as in [1, Proof of Theorem 1.3].We now note that [1, Proof of Lemma 3.4] states explicitly that E (cid:2) W (cid:0) µ , b µ N (cid:1)(cid:3) ≤ CR ( N ) , WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 15 where the function R is deﬁned as R : [0 , + ∞ ) → [0 , + ∞ ] , R ( u ) :=  u − / if d = 1 u − / log( u + 3) if d = 2 u − /d if d ≥ . Furthermore [1, Proof of Lemma 3.4] also states that E  X G ∈ Φ Nt b µ N ( G ) W (cid:0) µ G , b µ NG (cid:1) | G Nt  ≤ R (cid:18) N | Φ N | (cid:19) , so that we can conclude as in [1, Proof of Lemma 5.3] that AW (cid:16) µ, b µ N (cid:17) ≤ δ + C ( δ ) (cid:18) ∆ N + R (cid:18) N | Φ N | (cid:19)(cid:19) for all N ∈ N large enough, where∆ N := X f,N ∆ NG , ∆ NG := b µ N ( G ) (cid:0) W (cid:0) µ G , b µ NG (cid:1) − E (cid:2) W (cid:0) µ G , b µ NG (cid:1) | G Nt (cid:3)(cid:1) . We can now follow the arguments in [1, Proof of Theorem 5.3], noting thatlim N →∞ R (cid:18) N | Φ N | (cid:19) = 0as lim N →∞ | Φ N | /N = 0 by assumption. This concludes the proof. (cid:3) Proof of Theorem 5.1.

We ﬁrst bound the Wasserstein distance W ( b π NG , b π N ) fromabove by quantities, whose distributions are easier to control. This goes back to aclassical argument, see e.g. [8, Lemma 5], or also [26, Appendix A] for a detaileddiscussion. In our speciﬁc case we use the fact that both b π NG and b π N are ﬁnitelysupported on ϕ N ([0 , d ). Together with the observation that diam([0 , d ) = √ d we can thus bound the Wasserstein distance in (3) from above as follows: W ( b π NG , b π N ) ≤ √ d X H ∈ Φ N | b π NG ( H ) − b π N ( H ) |≤ √ d X H ∈ Φ N (cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H || n ∈ { , . . . , N } s.t. X n ∈ G |− | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12) , so that X G ∈ Φ N | n ∈ { , . . . , N } s.t. X n ∈ G | N W ( b π NG , b π N ) ≤ √ d X G ∈ Φ N | n ∈ { , . . . , N } s.t. X n ∈ G | N · X H ∈ Φ N (cid:12)(cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H || n ∈ { , . . . , N } s.t. X n ∈ G | − | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12)(cid:12) = √ d X G ∈ Φ N X H ∈ Φ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H | N − | n ∈ { , . . . , N } s.t. X n ∈ G | N · | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =: √ d ˜ T N ( π ) . Up to the constant √ d the term ˜ T N ( π ) is a classical non-parametric estimator forindependence of µ and ν , see e.g. [11]. More precisely [11, Theorem 1] states thatunder the assumption π = µ ⊗ ν one has P ( ˜ T N ( π ) ≥ ε ) ≤ ( | Φ N | +1) exp (cid:18) − N ε (cid:19) = exp (cid:18) N (cid:18) log(2) ( | Φ N | + 1) N − ε (cid:19)(cid:19) (9)for any ε >

0. We thus conclude that P (cid:18)Z W (cid:16)b π Nx , b π N (cid:17) b π N ( dx ) ≥ ε (cid:19) ≤ P ( √ d ˜ T N ( π ) ≥ ε ) ≤ exp (cid:18) N (cid:18) log(2) ( | Φ N | + 1) N − ε d (cid:19)(cid:19) . This shows the ﬁrst claim.In particular choosing ε = 2 p d log(2) ( | Φ N | + 1) / √ N in (9) yields P (cid:18) ˜ T N ( π ) ≥ p d log(2) | Φ N | + 1 √ N (cid:19) ≤ exp( −| Φ N | ) , which is summable by assumption. Thus a Borel-Cantelli argument implies that˜ T N ( π ) ≤ p d log(2) | Φ N | + 1 √ N for all suﬃciently large N ∈ N . Lastly, noting that by the law of large numberslim N →∞ Z | y − z | b π N ( dy ) b π N ( dz ) = Z | y − z | ν ( dy ) ν ( dz ) , where the tern on the right is positive by assumption, we can choose an appropriateconstant C ( ν ) > T ( b π N ) ≤ C ( ν ) | Φ N |√ N WASSERSTEIN ESTIMATOR FOR MEASURING ASSOCIATION 17 for all N ∈ N suﬃciently large. On the other hand, if π = µ ⊗ ν , then AW -consistency of b π N implies that Z W ( b π NG , b π N ) b π N ( dx )does not converge to zero, so that there exists δ > T ( b π N ) ≥ δ for all N ∈ N suﬃciently large. (cid:3) Let us lastly recall the following lemma, which is used in the proof of Corollary 5.2.

Lemma A.1 ([11, Theorem 3]) . Under the assumption that µ and ν are non-atomicand π = µ ⊗ ν , there exists a centering sequence C N = C N ( µ, ν ) ≤ r π | Φ N |√ N such that √ N ( ˜ T N ( π ) − C N ) /σ ⇒ N (0 , , where σ = 1 − /π and ˜ T N ( π ) := X G ∈ Φ N X H ∈ Φ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | n ∈ { , . . . , N } s.t. X n ∈ G, X n ∈ H | N − | n ∈ { , . . . , N } s.t. X n ∈ G | N · | n ∈ { , . . . , N } s.t. X n ∈ H | N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Johannes WieselColumbia University, Department of Statistics1255 Amsterdam AvenueNew York, NY 10027, USA

Email address ::