Central Limit Theorems for General Transportation Costs
Eustasio del Barrio, Alberto González-Sanz, Jean-Michel Loubes
CCentral Limit Theorems for General Transportation Costs
Eustasio del Barrio (1) ∗ , Alberto Gonz´alez-Sanz (2) , and Jean-Michel Loubes (3) † (1)(2) IMUVA, Universidad de Valladolid, Spain (2)(3)
IMT, Universit´e de Toulouse France (1) [email protected] (2) alberto.gonzalez [email protected] (3) [email protected] 24, 2021
Abstract
We consider the problem of optimal transportation with general cost between a empiricalmeasure and a general target probability on R d , with d ≥
1. We extend results in [19] andprove asymptotic stability of both optimal transport maps and potentials for a large class ofcosts in R d . We derive a central limit theorem (CLT) towards a Gaussian distribution for theempirical transportation cost under minimal assumptions, with a new proof based on the Efron-Stein inequality and on the sequential compactness of the closed unit ball in L ( P ) for the weaktopology. Keywords:
Optimal transport, optimal matching, CLT, Efron-Stein’s inequality.
In the last few years new techniques based on the optimal transportation problem have become pop-ular to handle statistical and machine learning problems over the space of probability distributions.Dealing with distributions has shed light on the need for probabilistic tools that are well adaptedto the intrinsic geometry of the data, and the theory of optimal transport provides a natural frame-work to tackle such issues. In particular the transportation cost distance is a convenient metric inmany problems encountered in data science and the range of application fields is huge, including forinstance computational statistics, biology, image analysis, economy, finance or fairness in machinelearning. We refer for instance to [34], [14], [15], [12], [5], [24], [7] and references therein. Understand-ing the approximations done when dealing with empirical distributions and providing better controlson the asymptotic distribution of optimal transport cost is of importance for further research on thissubject.In all this work, we will be concerned with probabilities on the measurable space R d , endowedwith the Borel σ -field, denoted as P ( R d ). In this setting, the optimal transport problem is formulatedas follows. Let P, Q be probability measures in P ( R d ) and c : R d × R d → R be a function referred toas the cost . We say that a measurable map T : R d → R d is an optimal transport map from P to Q if it is a minimizer in the problem T c ( P, Q ) := inf T : T P = Q (cid:90) R d c ( x , T ( x )) dP ( x ) , (1.1) ∗ Research partially supported by FEDER, Spanish Ministerio de Econom´ıa y Competitividad, grant MTM2017-86061-C2-1-P and Junta de Castilla y Le´on, grants VA005P17 and VA002G18. † Research partially supported by the AI Interdisciplinary Institute ANITI, which is funded by the French “Investingfor the Future – PIA3” program under the Grant agreement ANR-19-PI3A-0004. a r X i v : . [ m a t h . S T ] F e b here the notation T P represents the push-forward measure, that is, the measure such that foreach measurable set A we have T P ( A ) := P ( T − ( A )).This previous formulation of the problem is known as the Monge formulation and is closely relatedto the following problem known as the Kantorovich optimal transportation problem.A probability measure π ∈ P ( R d × R d ) is said to be an optimal transport plan for the cost c between P and Q if it is a minimizer in the problem T c ( P, Q ) = inf γ ∈ Π( P,Q ) (cid:90) R d × R d c ( x , y ) dπ ( x , y ) , (1.2)where Π( P, Q ) is the set of probability measures π ∈ P ( R d × R d ) such that π ( A × R d ) = P ( A ) and π ( R d × B ) = Q ( B ) for all A, B measurable sets. We have used the same notation for the minimumvalue in both (1.1) and (1.2), and this and the existence of optimal transport maps indeed hold forrather general costs, as shown in [23], including the potential costs c p ( x , y ) = | x − y | p , p ≥
1. We writein this case T p ( P, Q ) for the minimal value in (1.1) or (1.2) and W p ( P, Q ) := ( T p ( P, Q )) /p . Note that W p ( P, Q ) is a distance on the subset in P ( R d ) of distributions with finite moment of order p , denotedas P p ( R d ), referred to as the p − Wasserstein or Monge-Kantorovich distance. This distance is closelyrelated to the weak topology of P ( R d ), in the sense that P n w −→ P and (cid:82) | x | p dP n ( x ) −→ (cid:82) | x | p dP ( x )is equivalent to W p ( P n , P ) −→ T c ( P n , Q ) or T c ( P n , Q m ), where P n (resp. Q m ) denotes the empirical measure on a sample X , . . . , X n of i.i.d.observations with law P (resp. a sample Y , . . . , Y m i.i.d. Q ). Early work on this topic, starting with[2] (see also [37, 38, 40] and the more recent [22]), focused on the case P = Q and provided rates ofdecay of T p ( P n , P ) (in fact, T p ( P n , P ) → P has a finite moment of order p ), which turns outto depend on the dimension of the sample space. The problem becomes simpler when this dimensionis one, since, in this case, there is a common representation using the quantile function for all convexcosts. This was exploited in [16] and [17] for proving distributional limit theorems for T p ( P n , P ), p = 1 ,
2. We refer to [19] for a more detailed account about the history of the problem. The problemhas received a renewed interest in the last few years, both in the setup P = Q (see [3, 25, 39]) or forgeneral P and Q (see [36] and [41] for finitely and countably supported probabilities, [19] for the case p = 2 and general probabilities and dimension and [18, 6] for dimension d = 1 and general costs).In this paper we provide central limit theorems for T c ( P n , Q ) or T c ( P n , Q m ) for general costfunctions and general dimension, under minimal moment and regularity assumptions on P and Q .Our contribution covers the strictly convex costs in [23] for which existence of the optimal transportis guaranteed. Strict convexity of the cost appears to be a minimal requirement for a general centrallimit theorem with a Gaussian limiting distribution. In fact, for the non strictly convex cost p = 1 andin a univariate setup, [16] shows that {√ n T ( P n , P ) } n ∈ N converges to a non-Gaussian distributionunder some regularity assumptions. Our moment assumptions improve upon those in [19]. Themain result there is that, under mild regularity assumptions on P and Q (these are assumed to beabsolutely continuous probabilities on R d with convex supports) √ n ( T ( P n , Q ) − E T ( P n , Q )) w −→ N (0 , σ ( P, Q )) (1.3)provided that P and Q have finite moments of order 4 + δ for some δ >
0. We could take Q = δ (Dirac’s measure on 0) and see that in that case T ( P n , Q ) = n (cid:80) ni =1 | X i | , and therefore, that afinite moment of order 4 is necessary (and sufficient in this case) for a CLT. One may wonder if (1.3)still holds under the weaker assumption of finite fourth moments. In fact, in dimension d = 1, [18]proves that for p > √ n ( T p ( P n , Q ) − E T p ( P n , Q )) w −→ N (0 , σ p ( P, Q )) , (1.4)assuming only finite moments of order 2 p and continuity of the quantile function (an equivalentformulation of this last assumption is that the support of Q is an interval, see Proposition A.7 in[9]). 2imilar results, but with quite a few more requirements on the regularity of the probabilities,are proved in [6]. The key to prove (1.3) and (1.4) (the same approach has been used in [27] tostudy the asymptotic behaviour of entropically regularized Wasserstein distances) is a linearizationtechnique based on the Efron-Stein inequality for variances coupled with stability results for optimaltransportation potentials . For continuous costs the Kantorovich problem (1.2) admits an equivalentdual form, namely, T c ( P, Q ) = sup ( f,g ) ∈ Φ c ( P,Q ) (cid:90) f ( x ) dP ( x ) + (cid:90) g ( y ) dQ ( y ) , (1.5)where Φ c ( P, Q ) = { ( f, g ) ∈ L ( P ) × L ( Q ) : f ( x ) + g ( x ) ≤ c ( x , y ) } . It is said that ψ ∈ L ( P ) is anoptimal transport potential from P to Q for the cost c if there exists ϕ ∈ L ( Q ) such that the pair( ψ, ϕ ) solves (1.5).The present contribution generalizes the sharp results in [18] to multivariate probabilities and tomuch more general costs than c p . Also, our results cover those in [19], improving them in the sensethat here we do not require P and Q to have a convex support, but only a connected support with anegligible boundary. Furthermore, to avoid the technical need for stronger-than-necessary momentassumptions, in this work we describe a completely new tool to prove a central limit theorem fortransportation costs. This approach can be summarized as follows. We try to show that the empiricaltransportation cost can be approximated by a linear term. The linearization error, say R n (see (4.7)for details) has a variance that can be bounded using the Efron-Stein inequality. The upper bound isthe expected value of a random variable, say U n , which converges to 0 a.s. (this convergence followsfrom the stability results for optimal transport potentials), but one cannot conclude from this that E ( U n ) → EU n is bounded and then Banach-Alaoglu theorem yields weak convergence in L ( P ) of U n alongsubsequences. By taking Ces`aro means we can go from weak to strong convergence and, with someadditional work, to conclude that √ n ( R n − ER n ) → P n and Q n when thesesequences converge weakly to some probabilities P and Q . There is a large amount of literatureworking on these topics. Convergence of optimal maps is a topic of general interest, beyond ourapplication to CLTs and results on this issue have a long history, tracing back at least to [13]. Toour knowledge, interest on the convergence of potentials is more recent and requires some additionalguarantee on the uniqueness of the potentials. Seminal results on it can be found in Theorem 2.8 in[19] for the quadratic cost. Corollary 5.23 in [42] deals with this problem for general costs but one ofboth probabilities is supposed to be fixed. Some results are provided in Theorem 1.52. in [33] whenthe involved probabilities are compactly supported.Then problem of uniqueness of optimal transport potentials is linked to the smoothness of theprobabilities and also to the topology of their supports. For a probability Q is the smallest closedset R Q such Q ( R Q ) = 1. Yet, we will use the notationsupp( Q ) := int ( R Q ) (1.6)for the interior of R Q . Moreover, we say that a probability Q has negligible boundary if (cid:96) d ( R Q \ supp( Q )) = 0, where (cid:96) d denotes Lebesgue measure on R d . A probability with a convex supporthas a negligible boundary, but the condition is far from necessary. When the cost is of the form c ( x , y ) = h ( x − y ) with h satisfying some regularity assumptions (see (A1)-(A3) and the relateddiscussion in Section 2) and Q has a density with respect to Lebesgue measure (in the sequel, when Q (cid:28) (cid:96) d ) and a connected support with negligible boundary then we prove (Corollary 2.7) thatoptimal transport potentials are unique up to an additive constant (it is easy to see that this fails if Q has a disconnected support). From this uniquenes we move on to give general stability results foroptimal trasnport potentials under only the following assumption3 ssumption 1. Q ∈ P ( R d ) is such that Q (cid:28) (cid:96) d and has connected support with negligible boundary; Q n , P n , P ∈ P ( R d ) are such that P n w → P , Q n w → Q , T c ( P n , Q n ) < ∞ and T c ( Q, P ) < ∞ , for a cost c ( x , y ) = h ( x − y ) with h differentiable and satisfying (A1)-(A3). If ψ n (resp. ψ ) are the c -optimal transport potentials from Q n to P n (resp. from Q to P ), thenwe prove in Theorem 3.4 that (a) There exist constants a n ∈ R such that ˜ ψ n := ψ n − a n → ψ in the sense of uniform convergenceon the compacts sets. (b) For each compact K ⊂ supp( Q ) ∩ dom( ∇ ψ )sup x ∈ K sup y n ∈ ∂ c ψ n ( x ) | y n − ∇ c ψ ( x ) | −→ , where dom( ∇ ψ ) denotes the set of points where ψ is differentiable.The second main contribution of this paper is to provide a general result on the nature of thefluctuation of the empirical transportation cost around its expected value. More precisely, we considerthe asymptotic behaviour of {√ n ( T c ( P n , Q ) − E T c ( P n , Q )) } n ∈ N . Our main result (Theorem 4.5),establishes the convergence √ n ( T c ( P n , Q ) − E T c ( P n , Q )) w −→ N (0 , σ c ( P, Q )) , with σ c ( P, Q ) := (cid:90) ϕ ( x ) dP ( x ) − (cid:18)(cid:90) ϕ ( x ) dP ( x ) (cid:19) , where ϕ is an optimal transport potential for the cost c , from P to Q . This CLT holds assuming onlythat c ( x , y ) = h ( x − y ) with h differentiable and satisfying (A1)-(A3) and P, Q ∈ P ( R d ) satisfying Assumption 2. P (cid:28) (cid:96) d and Q (cid:28) (cid:96) d have connected supports with negligible boundary; moreover (cid:90) h (2 x ) dP ( x ) < ∞ , (cid:90) h ( − y ) dQ ( y ) < ∞ , and inf q ,q ∈ [1 , ∞ ]: q + q =1 E | X − X (cid:48) | q E (cid:18)(cid:90) R d |∇ h ( X − y ) | q dQ ( y ) (cid:19) < ∞ , where X has law P and X (cid:48) is an independent copy of X . We note that σ c ( P, Q ) is well defined, in the sense that it does not depend on the chosen potential,which is proved to be unique up to an additive constant in Corollary 2.7.The linearization technique that we use yields CLTs for the transportation cost under minimalassumptions. We discuss this with detail in the case of potential costs in Section 4. As a minor prizeto pay, the approach does not yield moment convergence. We show in Theorem 4.6 that momentconevergence holds under some additional moment assumptions. Finally, we derive a CLT for theempirical transportation cost in a two-sample setup and a further CLT for the empirical p -Wassersteindistance.We end this Introduction with some details about our setup and notation. We assume all theinvolved random variables (we use this term for both R and R d -valued random elements) to bedefined on a probability space (Ω , A , P ). We write L ( P ) for the Hilbert space of square integrablerandom variables on the former space. → w denotes weak convergence of probability measures, whilewe write (cid:42) for weak convergence (in the usual sense in Functional Analysis) in the space L ( P ). Atsome points we write A ⊂⊂ B to mean that there is some compact set, K , such that A ⊂ K ⊂ B .4 Preliminary results on optimal transport maps and potentials
This section presents some results related to optimal transport potentials and maps for generalcosts. The main reference on the topic is [23]. We give two main results, which are necessary toolsfor the study of stability in section 3: we prove uniqueness, up to an additive constant, of the optimaltransport potential (Corollary 2.7) and a weak continuity result for a version of the optimal transportmaps Lemma 2.10.We consider the optimal transport problem formulated in its dual form (1.5). Convexity playsa key role in the optimal transportation problem with quadratic cost. This idea can be adaptedto general costs through the notion of c -concavity. Recall that f : R d → R ∪ {−∞} is said to be c -concave if there exist a set T ⊂ R d × R such that f ( x ) = inf ( y , t ) ∈T { c ( x , y ) − t } . (2.1)For a function f : R d → R ∪ {−∞} the c -conjugate of f (see [23]) is defined as f c ( y ) = inf x ∈ R d { c ( x , y ) − f ( x ) } for all y ∈ R d . (2.2) c -conjugation can be seen as a generalization of the Legendre’s transform in convex analysis, see [29].Obviously, f c is c -concave and it is easy to check that its own c -conjugate, f cc , satisfies f cc ≥ f ,with equality if f is c -concave. This means that we can restrict the collection of pairs ( f, g ) in (1.5)to pairs ( f, f c ), with f c -concave, without changing the optimal value.For a c − concave function f : R d → R ∪ {−∞} the c -superdifferential of f , ∂ c f , is the set of pairs( x , y ) ∈ R d × R d such that f ( z ) ≤ f ( x ) + [ c ( z , y ) − c ( x , y )] for all z ∈ R d (see, e.g., Definition 1.1 in [23]). We write ∂ c f ( x ) for the set of y such that ( x , y ) ∈ ∂ c f and,more generally, ∂ c f ( U ) = ∪ x ∈ U ∂ c f ( x ) for U ⊂ R d . Under mild assumptions (implied by (A1)-(A3)below; see Propositions C.3 and C.4 in [23]) ∂ c f ( x ) is nonempty if f is finite in a neighborhood of x . When ∂ c f ( x ) is a singleton we denote this point as ∇ c f ( x ). It is easy to see, for a c -concavefunction f , that f ( x ) + f c ( y ) ≤ c ( x , y ), with equality if and only if y ∈ ∂ c f ( x ). As a consequenceof these key observations, π ∈ Π( P, Q ) is an optimal transport plan (a minimizer in (1.2)) and the c -concave function f is an optimal transport potential (( f, f c ) is a maximizer in (1.5)) if and only if π is concentrated on the set ∂ c f . This yields a characterization of optimal transport plans, provideda maximizer in (1.5) exists. In that case we can get an equivalent description of optimal transportplans in terms of cyclical monotonocity (see [35, 32]). A set Γ ⊂ R d × R d is said to be c -cyclicallymonotone if for all n ∈ N and { ( x k , y k ) } nk =1 ⊂ Γ n (cid:88) k =1 c ( x k , y k ) ≤ n (cid:88) k =1 c ( x σ ( k ) , y k ) , (2.3)for every permutation σ in { , . . . , n } . Optimal transport plans are supported in c -cyclically mono-tone sets (see Theorem 2.4 below). In the convex case (which corresponds to the quadratic cost c ( x , y ) = | x − y | ) cyclically monotone sets are those contained in the subdifferential of a convexfunction and the subdifferential of a convex function is maximal cyclically monotone (this is knownas Rockafellar’s Theorem, see for instance, [28]). For general costs a similar result holds. We quote itfor convenience in the next Lemma. A proof can be found in [31] (Lemma 2.1). Note that Lemma 2.1is weaker than Rockafellar’s Theorem for convex functions, since it does not claim that the set ∂ c f is maximal. Lemma 2.1. If c ≥ is a continuous cost then a set Γ ⊂ R d × R d is c -cyclically monotone if andonly if there exists a c -concave function f such that Γ ⊂ ∂ c f . c ( x , y ) = h ( x − y ), where h : R d → [0 , ∞ ) is a non negative function satisfying(A1) h is strictly convex on R d ,(A2) given a height r ∈ R + and an angle θ ∈ (0 , π ), there exists some M := M ( r, θ ) > | p | > M , one can find a cone K ( r, θ, z , p ) := (cid:110) x ∈ R d : | x − p || z | cos( θ/ ≤ (cid:104) z , x − p (cid:105) ≤ r | z | (cid:111) , with vertex at p on which h attains its maximum at p ,(A3) lim | x |→∞ h ( x ) | x | = ∞ . Remark 2.2.
The potential cost c p ( x , y ) := | x − y | p satisfies conditions (A1)-(A3) for p > , see [23] . In the case of a quadratic cost the crucial step to turn the characterization of optimal transportplans into a characterization of optimal transport maps relies on the fact that convex functionsare locally Lipschitz, hence, by Rademacher’s Theorem (see, e.g., Theorem 9.60 in [30]), they aredifferentiable at almost every point in the interior of their domain. For general costs convexity doesnot hold, but the Lipschitz property remains with great generality. In fact, if g is a c -concave functionthen for every ( a , b ) , ( x , y ) ∈ ∂ c g we have | g ( x ) − g ( a ) | ≤ | c ( x , y ) − c ( a , y ) | + | c ( x , b ) − c ( a , b ) | . (2.4)When c ( x , y ) = h ( x − y ) with h convex and differentiable, (2.4) implies that | c ( x , y ) − c ( a , y ) | ≤ | x − a | ( |∇ h ( x − y ) | + |∇ h ( a − y ) | ) . (2.5)As a consequence we obtain that | g ( x ) − g ( a ) | ≤ | x − a || ζ ( a , b , x , y ) | , for all ( a , b ) , ( x , y ) ∈ ∂ c g, where ζ ( a , b , x , y ) is a continuous function (we recall that a differentiable convex function is, infact, continuouly differentiable, see Corollary 25.5.1 in [29]). Ellaborating on these bounds it can beproved that under (A1)-(A3) c -concave functions are locally Lipschitz, hence, differentiable at almostevery point. For convenience we quote here a precise result (see Theorem 3.3 in [23]). Lemma 2.3.
Let c ( x , y ) = h ( x − y ) be a cost satisfying (A1)-(A3) and let f be a c -concave function,then there exists a convex set K ⊂ R d with interior Ω such that(i) Ω ⊂ dom( f ) = { x : f ( x ) ∈ R } ⊂ K ,(ii) f is locally Lipschitz in Ω . Now we can relate the shape of the gradient of a c -concave function to the shape of the c -superdifferential. We write h ∗ for the convex conjugate of h , namely, h ∗ ( y ) = sup x ( (cid:104) x , y (cid:105) − h ( x )).Then, if f is c -concave (see Proposition 3.4 in [23]): a) the relation s ( x ) = x − ∇ h ∗ ( ∇ f ( x )) defines a Borel function in the set where f is differentiable,dom( ∇ f ), b) for all x ∈ dom( ∇ f ) it holds that ∂ c f ( x ) = ∇ c f ( x ) = { s ( x ) } , c) the set dom( f ) \ dom( ∇ f ) is of Lebesgue measure zero.6ow, with all the ingredients above, a characterization of optimal transport plans and maps isgiven the next result, which summarizes Theorems 1.2, 2.3 and 2.7 in [23]. Theorem 2.4.
For any cost c ( x , y ) = h ( x − y ) , satisfying (A1)-(A3) , and Borel probability measures P, Q on R d :(i) There exists at least an optimal transport plan. γ ∈ Π( P, Q ) is an optimal transport plan if andonly if its support, Supp( γ ) , is a c -cyclically monotone set, or, equivalently, if there exists a c -concave function ψ such that Supp( γ ) ⊂ ∂ c ψ . In this case ψ is an optimal transport potential.(ii) If P (cid:28) (cid:96) d , then there exists a unique optimal transport plan γ := ( id × T ) P , where T ( x ) := x − ∇ h ∗ ( ∇ ψ ( x )) = ∇ c ψ ( x ) is P -a.s. unique and the c -concave function ψ is an optimaltransport potential. The approach in this work to CLT’s for the empirical transportation cost relies on the stabilityresults for optimal transport potentials that we prove in Section 3. There cannot be any result in thatsense without some kind of uniqueness of this potential. Of course, a look at (1.5) shows that if ψ isan optimal transport potential and C ∈ R then ψ + C is also an optimal transport potential. Withthe next results we show that, under some minimal assumptions, the optimal transport potential isunique up to the addition of a constant. Lemma 2.5.
Let Ω ⊂ R d be an open, bounded convex set, f : Ω −→ R be a Lipschitz function suchthat ∇ f = almost everywhere in Ω , then there exists a constant C ∈ R such that f = C in L .Proof. This is a straightforward consequence of Poincar´e’s inequality in convex domains (see, e.g.,Theorem 3.2. in [1]).
Theorem 2.6.
Assume c ( x , y ) = h ( x − y ) satisfies (A1)-(A3) and f , f are c -concave functionssuch that ∇ f = ∇ f almost everywhere in the open connected set Ω , then there exists a constant C ∈ R such that f = f + C in Ω .Proof. Assume p ∈ Ω ⊂ dom( f ) ∩ dom( f ). By Lemma 2.3 φ, ψ are locally Lipschitz, hence, thereexist (cid:15) p > f , f are Lipschitz in B ( p , (cid:15) p ). Then the function f − f satisfies theassumptions of Lemma 2.5. As a consequence, there exists C p ∈ R such that f = f + C p in B ( p , (cid:15) p ) for each p ∈ L . The proof will be complete if we show that the previous constant does notdepend on p . But this follows from connectedness of the Ω, since if we setΓ := { q ∈ Ω : C q = C p } then Γ is obviously open and, by continuity, it is also closed, hence, Γ = Ω.Let us assume now that P and Q are probabilities on R d wit P absolutely continuous and ψ , ψ are optimal transport potentials. By Theorem 2.4 we have ∇ h ∗ ( ∇ ψ ( x )) = ∇ h ∗ ( ∇ ψ ( x )) P -a.s..If h is differentiable then ∇ h ∗ ( ∇ ψ ( x )) = ∇ h ∗ ( ∇ ψ ( x )) = y implies ∇ ψ ( x ) = ∇ ψ ( x ) = ∇ h ( y ).Hence, P -a.s., ∇ ψ ( x ) = ∇ ψ ( x ). If, additionally, P is supported in an open, connected set we canapply Theorem 2.6 and conclude that ψ = ψ + C on the support of P for some constant C ∈ R .This proves the following uniqueness result for optimal transport potentials. Corollary 2.7. If c ( x , y ) = h ( x − y ) , where h is differentiable and satisfies (A1)-(A3) , P (cid:28) (cid:96) d andis supported on an open, connected set, A , and ψ , ψ are optimal transport potentials from P to Q for the cost c , then, there exists a constant, C ∈ R , such that ψ ( x ) = ψ ( x ) + C for every x ∈ A . In the next section we will state and prove results related to the stability of optimal transportmaps and potentials, namely, we will prove convergence in different senses of optimal transportpotentials ( ϕ n ) or maps ( ∇ c ϕ n ) from P n to Q n under the assumption that (at least) P n → w and Q n → Q . Results of this kind have a long history, tracing back at least to [13] for the case ofoptimal transport maps under quadratic costs. Stability of the potentials is crucial for the Efron-Stein approach to CLTs in [19] or in section 4 in this paper, and has only been investigated recently.7or smooth probabilities, optimal transport potentials are a.s. differentiable, and there is a simplerelation between their gradients and the optimal transport maps, as noted above. Hence, it is naturalto try to go from stability results for optimal maps to stability results for optimal potentials. Weshould note, additionally, that the points of nondifferentiability of the potentials are those pointsin which the superdifferentials are not singletons and that, for this reason, the better way to dealwith stability of the optimal plans is to think of them as multivalued maps ( x (cid:55)→ ∂ c ϕ n ( x ) ⊂ R d ) or,equivalently, as subsets ( ∂ c ϕ n = { ( x , y ) ∈ R d × R d : y ∈ ∂ c ϕ n ( x ) } , the graph of ∂ c ϕ n ). The notionof convergence that fits our goals is the commonly called Painlev´e-Kuratowski convergence (see [30]),which is defined as follows: for a sequence { Γ n } n ∈ N of subsets of R m • the outer limit, lim sup n Γ n , is the set of x ∈ R m for which there exists a sequence { x n } with x n ∈ Γ n such that there exists a subsequence which converges to x , • the inner limit, lim inf n Γ n , is the set of x ∈ R m for which there exists a sequence { x n } with x n ∈ Γ n which converges to x .When the outer and inner limit sets are equal the sequence is said to converge in the Painlev´e-Kuratowski sense and the common set is the limit. This notion of convergence is automaticallytransferred easily to multivalued maps. In this case { T n } n ∈ N , where T n : R d → R d , is said to converge graphically to another multivalued map T ifGph( T n ) := { ( x , y ) : y ∈ T n ( x ) } → Gph( T )in the Painlev´e-Kuratowski sense. A very convenient feature of the Painlev´e-Kuratowski sense isthat sequential compactness can be easily described in terms of a simple condition. To be precise,it is said that a sequence of sets Γ n ⊂ R d , n ≥
1, does not escape to the horizon if there exist (cid:15) > { n j } such that Γ n j ∩ B ( , (cid:15) ) (cid:54) = ∅ for all j ≥
1. For convenience we quote nexta version of Theorem 4.18 in [30].
Theorem 2.8.
Let { Γ n } n ≥ be a sequence of subsets of R m that does not escape to horizon. Thenthere exists a subsequence { n j k } and a nonempty subset Γ ⊂ R m such that Γ n jk −→ Γ , in the sense of Painlev´e-Kuratowski. In the next theorem we show that when a sequence of c -ciclically monotone sets converges in thesense of Painlev´e-Kuratowski to a set, then it is also c -ciclically monotone, generalizing the resultfor classical convexity in [30]. Lemma 2.9.
Assume c is a continuous cost function and { Γ n } n ∈ N ⊂ R d is a sequence of c -ciclicallymonotone sets. If Γ n → Γ in the sense of Painlev´e-Kuratowski, then Γ is also c -ciclically monotone.Proof. We consider { ( x k , y k ) } mk =1 ⊂ Γ. For each pair ( x k , y k ) there exists a sequence ( x nk , y nk ) ∈ Γ n such that ( x nk , y nk ) → ( x k , y k ) as n → ∞ . Since Γ n is c -ciclically monotone, m (cid:88) k =1 c ( x nk , y nk ) ≤ m (cid:88) k =1 c ( x nσ ( k ) , y nk ) , for every permutation σ of { , . . . , m } . Continuity of c guarantees that m (cid:88) k =1 c ( x k , y k ) ≤ m (cid:88) k =1 c ( x σ ( k ) , y k ) . Combining the last two results we see that if a sequence of c -superdifferentials does not escapeto the horizon, then there exists a converging subsequence to a set and this set is also c -ciclicallymonotone.We finish the section with a weak continuity result for the multivalued map ∂ c ψ , which will bevery useful in the following section. 8 emma 2.10. Assume c ( x , y ) = h ( x − y ) with h satisfying (A1)-(A3) . Let f be a c − concave functionand x ∈ dom( ∇ c f ) . Then for each sequence x n → x and y n ∈ ∂ c f ( x n ) we have that y n → ∇ c f ( x ) .As a consequence, for each (cid:15) > there exists some δ > such that ∂ c f ( B ( x , δ )) ⊂ B ( ∇ c f ( x ) , (cid:15) ) .Proof. Let ( x n , y n ) be as in the statement. Then for every z ∈ R d we have f ( z ) ≤ f ( x n ) + [ c ( z , y n ) − c ( x n , y n )] . (2.6)Since ψ is differentiable at x , it is bounded in a neighborhood of x , say U , which can choose to becompact. By Proposition C.4 in [23] ∂ c f ( U ) is bounded. Hence, the sequence y n must be boundedand, taking subsequences if necessary, we can assume that it is convergent. Taking limits in (2.6) andnoticing that f is continuous in its domain we get the first conclusion. To check the second claim,assume it is false. Then we can choose some (cid:15) > n ∈ N there exists | x n − x | ≤ n and some y n ∈ ∂ c f ( x n ) with | y n − ∇ c f ( x ) | > (cid:15) . To conclude note that the sequences { x n } n ∈ N and { y n } n ∈ N lead to a contradiction with the first assertion. The main goal of this section is to prove a general result (Theorem 3.4) on the stability of optimalmaps and potentials for a very large class of costs, using the tools presented in section 2. Thepath to this main result starts by proving stability along subsequences of the c -superdifferentials ofoptimal transport potentials (Lemma 3.1), extending a similar result in [19] for the particular setupof classical convexity. We prove then (Lemmas 3.2 and 3.3) a uniform boundedness result which,once the potentials are conveniently fixed at a convenient point (see (3.2) below) allows to prove theanticipated stability result. For the sake of readability we present here the results and defer most ofthe proofs to the Appendix.The first step in the plan above is this result on the stability of c -superdifferentials. Lemma 3.1.
Let Q ∈ P ( R d ) be such that Q (cid:28) (cid:96) d and has connected support and negligible boundary.Let Q n , P n , P ∈ P ( R d ) be such that P n w → P , Q n w → Q and T c ( P n , Q n ) < ∞ and T c ( Q, P ) < ∞ , for all n ∈ N , for a cost c ( x , y ) = h ( x − y ) with h differentiable and satisfying (A1)-(A3) . If ψ n (resp. ψ ) arethe optimal transport c -potentials from Q n to P n (resp. from Q to P ), then there exists a cyclicallymonotone set Γ such that ∂ c ψ n → Γ ⊂ ∂ c ψ (3.1) in the sense of Painlev´e-Kuratowski along subsequences. Moreover, if x ∈ dom( ∇ ψ ) , then ( x , ∇ c ψ ( x )) ∈ Γ . In our next results we pay attention to the optimal transportation potentials, ψ n , which are welldefined, under the assumptions of Corollary 2.7, up to the addition of a constant. The possibilityof arbitrarily choosing that constant could lead to some difficulties that we can avoid fixing it asfollows. We choose some p ∈ dom( ∇ ψ ) ∩ supp( Q ) and assume ψ ( p ) = 0 and ψ n ( p ) = 0 for large n. (3.2)Of course, we can ensure that the potential ψ vanishes at any p where it is finite by taking ˜ ψ ( x ) = ψ ( x ) − ψ ( p ). Under the assumptions of Lemma 3.1 (see the proof for further details) we must have p ∈ dom( ∇ ψ n ) for large enough n , hence, p ∈ dom( ψ n ) and we can choose the potentials as in(3.2).Next, we present two technical lemmas in which the assumptions (A2) and (A3) play the mainroles. These results, crucial in the proof of Theorem 3.4, are proved ellaborating on the arguments9igure 1: Geometric interpretation of Lemma 3.2 and Lemma 3.3.in [23] to prove that a c -concave function is locally Lipschitz. The geometric interpretation of theseresults is shown in Figure 1. Lemma 3.2 shows that for any point p for which the boundednesscondition fails, there is a hyperplane H passing trough p and splitting the space into two parts suchthat in one of both, the grey one in Figure 1, this property holds for any other point. Lemma 3.2.
Under the same assumptions as in Lemma 3.1, let p ∈ R d be such that there existsa sequence { p n } n ∈ N ⊂ R d such that p n → p and ψ n ( p n ) is not bounded. Then there exists z ∈ R d such that, for every bounded sequence { x n } n ∈ N ⊂⊂ { x : (cid:104) z , x − p (cid:105) > } , the sequence ψ n ( x n ) is notbounded. Lemma 3.2 is the key to the next technical result, which proves boundedness of both (cid:83) k ∈ N ψ n k ( K )and (cid:83) k ∈ N ∂ c ψ n k ( K ) for compact K ⊂ Supp( Q ). Lemma 3.3.
Let
P, Q, P n , Q n be probability measures satisfying the assumptions of Lemma 3.1.Assume that p ∈ Supp( Q ) and ψ n ( p ) → . Then for each compact K ⊂ Supp( Q ) there exists asubsequence { ψ n k } k ∈ N such that (cid:83) k ∈ N ψ n k ( K ) and (cid:83) k ∈ N ∂ c ψ n k ( K ) are bounded sets. Now, as an application of the uniform boundedness results in Lemma 3.3, we are ready to applythe classical Arzel`a-Ascoli theorem to prove of the main theorem of the section.
Theorem 3.4.
Let Q ∈ P ( R d ) be such that Q (cid:28) (cid:96) d and has a connected support with negligibleboundary. Assume Q n , P n , P ∈ P ( R d ) are such that P n w → P , Q n w → Q and T c ( P n , Q n ) < ∞ and T c ( Q, P ) < ∞ for a cost c ( x , y ) = h ( x − y ) , with h differentiable and satisfying (A1)-(A3) . If ψ n (resp. ψ ) areoptimal transport potentials from Q n to P n (resp. from Q to P ) for the cost c . Then:(i) There exist constants a n ∈ R such that ˜ ψ n := ψ n − a n → ψ in the sense of uniform convergenceon the compacts sets of Supp( Q ) .(ii) For each compact K ⊂⊂ Supp( Q ) ∩ dom( ∇ ψ )sup x ∈ K sup y n ∈ ∂ c ψ n ( x ) | y n − ∇ c ψ ( x ) | −→ . (3.3)We note that Theorem 3.4 generalizes Theorem 2.8 in [19] to a more general class of costs.Moreover, it also generalizes the results of stability of optimal transport maps, as Corollary 5.23in [42]. An important improvement of Theorem 1.52. in [33] is obtained since we do not require acompact assumption. Finally we will see in the following sections that it is an useful tool to prove aCentral Limit Theorem for general Wasserstein distances.10nder stronger assumptions on the way that P n approaches P and Q n approaches Q it is possibleto prove L convergence of the potentials. We show this next for potential costs. We recall thatthe hypothesis of Corollary 3.5 are fulfilled when we have weak convergence P n w → P , Q n w → Q plusconvergence of moments of order 2 p , (cid:90) | x | p dP n ( x ) −→ (cid:90) | x | p dP ( x ) , (cid:90) | y | p dQ n ( y ) −→ (cid:90) | y | p dQ ( y ) . Corollary 3.5.
Let Q ∈ P p ( R d ) be such that Q (cid:28) (cid:96) d and has connected support with negligibleboundary. Assume P n , P ∈ P ( R d ) are such that T p ( P n , P ) → . If ψ n (resp. ψ ) are optimal transport potentials from Q to P n (resp. from Q to P ) for the cost c p ( x , y ) = | x − y | p and p > , then there exist constants a n ∈ R such that ˜ ψ n := ψ n − a n → ψ in thesense of L ( Q ) .Proof. We can apply Theorem 3.4 to see that there exist constants a n ∈ R such that ˜ ψ n = ψ n − a n → ψ and ∇ c ψ n → ∇ c ψ Q -a.s.. We note also that the assumption implies that (cid:82) |∇ c ψ n | p dQ → (cid:82) |∇ c ψ | p dQ and, therefore, (cid:82) |∇ c ψ n − ∇ c ψ | p dQ →
0. In particular |∇ c ψ n | p is Q -uniformly inte-grable. We relabel the potentials and write ψ n instead of ˜ ψ n and assume (with no loss of generality)that ψ ( x ) = ψ n ( x ) = 0 for some x ∈ Supp( Q ) ∩ dom( ∇ ψ ). To conclude, it suffices to show that ψ n is Q -uniformly integrable. To check this we set y = ∇ c ψ ( x ), take y n ∈ ∂ c ψ n ( x ) and recallthat, by Theorem 3.4 y n → y . Now, we observe that ψ n ( x ) ≤ ψ n ( x ) + | x − y n | p − | x − y n | p ≤ | x − y n | p (3.4)for every x . Similarly, ψ cn ( y ) ≤ ψ cn ( y n ) + | y − x | p − | y n − x | p = | y − x | p for every y . Since Q -a.s. we have ψ n ( x ) + ψ cn ( ∇ c ψ n ( x )) = | x − ∇ c ψ n ( x ) | p , we conclude that ψ n ( x ) ≥ | x − ∇ c ψ n ( x ) | p − |∇ c ψ n ( x ) − x | p . This last bound together with (3.4) shows that ψ n is Q -uniformly integrable and completes theproof. Let P ∈ P ( R d ) and for each n ∈ N let X , . . . , X n denote a a sample of independent random variableswith distribution P . Consider also the correspondent empirical measure P n := n (cid:80) nk =1 δ X k . We areinterested in the behavior of the sequence {√ n ( T p ( P n , Q ) − E T p ( P n , Q )) } n ∈ N .We will prove first tightness of this sequence from a suitable variance bound, following similar argu-ments as those in [19]. We recall the Efron-Stein inequality and refer for further details to Chapter3.1 in [10]. Let ( X (cid:48) , . . . , X (cid:48) n ) be an independent copy of ( X , . . . , X n ), set Z := f ( X , . . . , X n ) andfor each i ∈ { , . . . , n } denote Z (cid:48) i := f ( X , . . . , X i − , X (cid:48) i , X i +1 , . . . , X n ) . The Efron-Stein inequality states then thatVar( Z ) ≤ n (cid:88) i =1 E ( Z − Z i ) = n (cid:88) i =1 E ( Z − Z (cid:48) i ) . X , . . . , X n are i.i.d, the inequality can be written asVar( Z ) ≤ n E ( Z − Z (cid:48) i ) = nE ( Z − Z (cid:48) i ) . In this work we present a general bound for the variance of T c ( P n , Q ) assuming only that one of bothprobabilities is absolutely continuous with respect to Lebesgue measure and assuming also that thecost is convex. We note that for X with law P the set of points where h ( X − · ) is not differentiableis a set of Lebesgue measure 0, hence if Q (cid:28) (cid:96) d then it is differentiable Q -a.s.. As a consequence ∇ h ( X − y ) is well defined Q - a.s., and also E |∇ h ( X − Y ) | q in the next statement. Lemma 4.1.
Assume c ( x , y ) = h ( x − y ) , with h satisfying (A1)-(A3). Let P, Q ∈ P ( R d ) be suchthat Q (cid:28) (cid:96) d . Assume X, X (cid:48) , Y are independent random variables with X ∼ P , X (cid:48) ∼ P and Y ∼ Q .Then n Var( T c ( P n , Q )) ≤ inf ( q ,q ) ∈ α (cid:104) (cid:0) E | X − X (cid:48) | q (cid:1) q (cid:0) E |∇ h ( X − Y ) | q (cid:1) q (cid:105) , (4.1) where α = { ( q , q ) : q i ∈ [1 , ∞ ] , q + q = 1 } . We remark that assumptions (A1)-(A3) are only used in Lemma 4.1 to ensure the existence ofan optimal transport map.
Remark 4.2.
As a consequence of Lemma 4.1, under the same assumptions, if inf ( q ,q ) ∈ α (cid:104) (cid:0) E | X − X (cid:48) | q (cid:1) q (cid:0) E |∇ h ( X − Y ) | q (cid:1) q (cid:105) < ∞ , (4.2) then the sequence {√ n ( T c ( P n , Q ) − E T c ( P n , Q )) } n ∈ N is tight. We show next that we can replace assumption (4.2) with a simpler version in the case of potentialcosts. It should be noted that absolute continuity of Q is not needed for the following result. Corollary 4.3. If c ( x , y ) = | x − y | p and p > then n Var( T p ( P n , Q )) ≤ (cid:0) E | X − X (cid:48) | p (cid:1) p (cid:0) pE | X − Y | p (cid:1) pp − Proof.
We assume that the right hand side in the last bound is finite (there is nothing to proveotherwise). Since |∇ h ( X − y ) | = p | X − y | p − , the result follows by taking q = p , q = pp − in (4.1)if Q (cid:28) (cid:96) d .For general Q we can take random variables Y ∼ Q , Y m ∼ Q m , m ∈ N with Q m (cid:28) (cid:96) d and E | Y m − Y | p →
0. Without loss of generality we can assume that (
X, X (cid:48) ) is independent of( Y, { Y m } m ≥ ). For fixed n ∈ N we have that T p ( P n , Q m ) converges to T p ( P n , Q ) a.s. as m → ∞ .Also, for each m ∈ N , we have n Var( T p ( P n , Q m )) ≤ (cid:0) E | X − X (cid:48) | p (cid:1) p (cid:0) pE | X − Y m | p (cid:1) pp − =: A m . We observe that A m → A := (cid:0) E | X − X (cid:48) | p (cid:1) p (cid:0) pE | X − Y | p (cid:1) pp − . Finally, Fatou’s lemma enables usto conclude that n Var( T p ( P n , Q )) ≤ n lim inf m Var( T p ( P n , Q m )) ≤ lim inf m A m = A. Remark 4.4.
As in Remark 4.2, Corollary 4.3 yields the conclusion that {√ n ( T p ( P n , Q ) − E T p ( P n , Q )) } n ∈ N is tight if P and Q have finite moments of order p . This assumption is sharp in the sense that if P is such that {√ n ( T p ( P n , Q ) − E T p ( P n , Q )) } n ∈ N for Q = δ then P must have finite moment of order p . In fact, the optimal transport map from P n to Q is T ( x ) = , hence, T p ( P n , Q ) = (cid:82) | x | p dP n ( x ) and √ n ( T p ( P n , Q ) − E T p ( P n , Q )) = √ n (cid:80) nj =1 ( | X j | p − E | X | p ) . (4.3) It is well known (see, e.g., Chapter 10 in [26] ) that the random variable in (4.3) is tight if and onlyif E ( | X | p ) < ∞ . Hence, as claimed, a finite moment of order p is a minimal requirement for P toguarantee that {√ n ( T p ( P n , Q ) − E T p ( P n , Q )) } n ∈ N is tight for, say, every Q with bounded support. c satisfying assumptions (A1)-(A3). Inthe following theorem we show that, with this assumptions on the cost, there exists a unique weakcluster point of the sequence {√ n ( T c ( P n , Q ) − E T c ( P n , Q )) } n ∈ N , which is Gaussian. Similar work,in the particular case of the cost | · | , was done in [19], where a version of Efron-Stein inequalityis used to prove that the empirical transport cost is approximately linear. This approach has alsobeen used for the entropic regularization of the empirical transport cost in [27]. This tool based onEfron-Stein inequality requires to have some sort of uniform integrability, which can be guaranteedassuming finite moments of order 4 + δ . Following arguments developed in Remark 4.4, the followingresult proves that the moment assumption can be relaxed. Theorem 4.5.
Assume c ( x , y ) = h ( x − y ) with h differentiable and satisfying (A1)-(A3) . Let P, Q ∈ P ( R d ) be such that P (cid:28) (cid:96) d , Q (cid:28) (cid:96) d , and P has connected support and negligible boundary.Assume further that (cid:90) h (2 x ) dP ( x ) < ∞ and (cid:90) h ( − y ) dQ ( y ) < ∞ , (4.4) and (4.2) holds. Then √ n ( T c ( P n , Q ) − E T c ( P n , Q )) w −→ N (0 , σ c ( P, Q )) , (4.5) where σ c ( P, Q ) := (cid:90) ϕ ( x ) dP ( x ) − (cid:18)(cid:90) ϕ ( x ) dP ( x ) (cid:19) , (4.6) and ϕ is an optimal transport potential for the cost c from P to Q . It should be noted at this point that the optimal transport potential in Theorem 4.5 is unique,up to the addition of a constant, as a consequence of Corollary 2.7. It follows from the proof ofTheorem 4.5 that ϕ ∈ L ( P ). This implies that the limiting variance, σ c ( P, Q ), is well-defined andfinite.The proof of Theorem 4.5 initially follows the path in [19]. This means that we look at R n := T c ( P n , Q ) − (cid:90) ϕ ( x ) dP n ( x ) , (4.7)where ϕ is an optimal transport potential from P to Q for the cost c . We write R (cid:48) n for the version of R n computed from X (cid:48) , X , . . . , X n . Using the stability results for optimal transport potentials onecan prove that n ( R n − R (cid:48) n ) a.s. −→
0. If n E ( R n − R (cid:48) n ) → n E ( R n − R (cid:48) n ) ≤ M under mild moment assumptions. However, the convergence n E ( R n − R (cid:48) n ) → δ moment assumption in [19]). Our proof ofTheorem 4.5 avoids these stronger assumptions by using the following workaround. First, the bound n E ( R n − R (cid:48) n ) ≤ M and the Banach-Alaoglu Theorem (see, e.g., Theorem 3.16 in [11]) show that,along subsequences, n ( R n − R (cid:48) n ) converges weakly to 0 in the Hilbert (hence reflexive) space L ( P ).Then, the Banach-Saks property of Hilbert spaces (see, e.g., Exercise 5.34 in [11]) shows that (takingfurther subsequences if necessary) there exists a Ces`aro mean of { n | R n − R (cid:48) n |} n ∈ N convergent to 0 in L ( P ) in the strong sense. We show then that the same holds with the Ces`aro means of the sequence √ n ( R n − ER n ) and from this we conclude that √ n ( R n − ER n ) → Theorem 4.6.
Assume c ( x , y ) = h ( x − y ) with h differentiable and satisfying (A1)-(A3) . Let P, Q ∈ P ( R d ) be such that P (cid:28) (cid:96) d , Q (cid:28) (cid:96) d and P has connected support and negligible boundary. uppose that (4.4) holds and assume R n is as in (4.7) . Assume further that X, X (cid:48) and Y areindependent random variables with X ∼ P , X (cid:48) ∼ P and Y ∼ Q . If there exists some δ > such that, inf q ,q ∈ [1 , ∞ ]: q + q =1 (cid:104) E | X − X (cid:48) | (2+ δ ) q E |∇ h ( X − Y ) | (2+ δ ) q (cid:105) < ∞ , (4.8) then n Var( R n ) −→ . As a consequence, n Var( T c ( P n , Q )) −→ σ c ( P, Q ) . (4.9)To get a more clear picture about the sharpness of the assumptions in Theorems 4.5 and 4.6, weinclude the particular version for potential costs, c p ( x , y ) = | x − y | p for p > c p satisfies (A1)-(A3) for p > Corollary 4.7.
Assume p > . Let P, Q ∈ P ( R d ) be such that P (cid:28) (cid:96) d and has connected supportand negligible boundary. If P and Q have finite moments of order p , then √ n ( T p ( P n , Q ) − E T p ( P n , Q )) w −→ N (0 , σ p ( P, Q )) , (4.10) where σ p ( P, Q ) := (cid:90) ϕ ( x ) dP ( x ) − (cid:18)(cid:90) ϕ ( x ) dP ( x ) (cid:19) , (4.11) and ϕ is an optimal transport potential from P to Q for c p . Moreover if P has a finite moment oforder p + (cid:15) for some (cid:15) > , then n Var( T p ( P n , Q )) −→ σ p ( P, Q ) . (4.12) Proof.
A look at the proof of Corollary 4.3 shows that finite 2 p moments guarantee that (4.2) holds.Clearly, (4.4) holds too, and we can apply Theorem 4.5 to conclude (4.10) (the fact that absolutecontinuity of Q is not necessary follows using the approximation argument in the proof of Corollary4.3). For (4.12) we take in (4.8) the conjugate pair q = p ( p − δ and q = q q − = p p − (2+ δ ) , where δ = 2 − p (1 − p + (cid:15) ). With these choices (4.8) becomes (cid:16) E | X − X (cid:48) | (2 p + (cid:15) ) (cid:17) (cid:18) E (cid:18)(cid:90) R d | X − y | p dQ ( y ) (cid:19)(cid:19) < ∞ , and we apply Theorem 4.6. The case of finite moment of order 2 p + (cid:15) for Q follows similarly. Remark 4.8.
As noted in Remark 4.4, the assumption of finite moments of order p (ar least for P )cannot be relaxed for tightness and, in that sense, the moment assumptions in Theorem 4.7 are sharpand cannot be improved. On the other hand, in the case p = 2 , Corollary 4.7 improves Theorem 4.1in [19] , not only by proving that finite fourth moments are enough (the original assumption was finitemoments of order (cid:15) in [19] ), but also by assuming milder regularity assumptions on P and Q . Inthis new setting, P must have a connected support with a negligible boundary, relaxing the assumptionof a convex support. The only price to pay is that variance convergence may fail under this relaxedassumptions. So far we have considered CLTs for T p ( P, Q ). Its p − root W p ( P, Q ) := ( T p ( P, Q )) /p defines awell-known metric in the space of probabilities with finite moments of order p , the p -Wassersteindistance. Proving a CLT for the empirical Wasserstein distance is not a straightforward applicationof a delta-method and Corollary 4.7, since we do not have a fixed centering constant in Theorem 4.10.Yet, we can circunvent this issue and prove the following result.14 heorem 4.9. Let P (cid:54) = Q ∈ P ( R d ) be such that P (cid:28) (cid:96) d and has connected support and negligibleboundary. Assume P and Q have finite moments of order p and p > . Then, if σ p ( P, Q ) is definedas in Theorem 4.7, √ n (cid:16) W p ( P n , Q ) − (cid:0) E [ W pp ( P n , Q )] (cid:1) p (cid:17) w −→ N (0 , β p ( P, Q )) , where β p ( P, Q ) := (cid:16) p W pp ( P n ,Q ) p − (cid:17) σ p ( P, Q ) .Proof. Setting A n := W p ( P n , Q ) and B n := (cid:0) E [ W pp ( P n , Q )] (cid:1) p , we know from Corollary 4.7 that √ n ( A pn − B pn ) w −→ N (0 , σ p ( P, Q )) . (4.13)Moreover, the bound W pp ( P n , Q ) ≤ p − (cid:90) | x | p dP n ( x ) + 2 p − (cid:90) | y | p dQ ( y ) , together with the assumption of finite moments of order 2 p , imply that W pp ( P n , Q ) is uniformlyintegrable. It follows that A n a.s. −−→ W p ( P, Q ), and B n → W p ( P, Q ) . (4.14)By the mean value theorem applied to the function t (cid:55)→ t p , there exists ε n ∈ (0 ,
1) such that A pn − B pn = ( A n − B n ) p ( A n ε n + B n (1 − ε n )) p − . (4.15)The limits of (4.14) imply that necessarily p ( A n ε n + B n (1 − ε n )) p − a.s. −−→ p W p ( P, Q ) p − >
0. Thisfact, together with the limit (4.13) and Slutsky’s theorem applied in (4.15) conclude the proof.
For n, m ∈ N let X , . . . , X n and Y , . . . , Y m be independent i.i.d. random samples with distributions P and Q . Consider the correspondent empirical measures P n := n (cid:80) nk =1 δ X k and Q m := m (cid:80) mk =1 δ Y k .At first sight one may conjecture that the approach leading to Theorems 4.5 and 4.6 trivially extendsto the two-sample setup, yielding a CLT for T c ( P n , Q m ). However, a closer look at the proof showsthat major issues appear when extending Claim 3. For this reason an adaptation of Theorem 4.5 tothe two-sample setup is left for further work. On the other hand, under stronger moment assumptions,such as (4.8), the extension is straightforward. We present the result avoiding additional details. Theorem 4.10.
Assume c ( x , y ) = h ( x − y ) with h differentiable and satisfying (A1)-(A3). Let P, Q ∈ P ( R d ) be such that P (cid:28) (cid:96) d , Q (cid:28) (cid:96) d and both have connected support and negligible boundary.Assume that (4.4) holds and also that there exists some δ > such that (4.8) holds, as well as thecorresponding conditions exchanging the roles of P and Q . Then, if nmn + m → λ ∈ (0 , as n, m → ∞ , (cid:113) nmn + m ( T ( P n , Q m ) − E T ( P n , Q m )) w −→ N (cid:0) , (1 − λ ) σ c ( P, Q ) + λσ c ( Q, P ) (cid:1) , with σ c ( Q, P ) as in (4.6) . Furthermore, n Var( T ( P n , Q m )) → (1 − λ ) σ c ( P, Q ) + λσ c ( Q, P ) . Appendix
Proof of Theorem 3.4.
We prove each claim separately. To prove (i) we take, without loss ofgenerality, p ∈ supp ∩ dom( ∇ ψ ) as in (3.2) (hence, ψ n ( p ) = 0). From (2.5) and Lemma 3.3 we seethat for each compact K ⊂ supp( Q ) there exists a subsequence ψ n k and a constant R = R ( K ) > a , x ∈ K | ψ n k ( x ) − ψ n k ( a ) | ≤ | x − a | R. Hence, the functions of the sequence { ψ n k } are R -Lipschitz on each compact set and ψ n k ( p ) = 0 andwe can apply Arzel`a-Ascoli theorem in each compact set to conclude that there exits a continuousfunction f such that ψ n km → f uniformly on the compact sets of supp( Q ) for some subsequence.We claim that f = ψ + C . To prove it we consider x ∈ supp( Q ) and any sequence y n ∈ ∂ c ψ n ( x ),by Lemma 3.3 we know that there exist a sub-sequence { y n k } k ∈ N which is bounded. Hence, byLemma 3.1, there exists y ∈ ∂ψ ( x ) such that y n k → y ∈ ∂ c ψ ( x ) along a subsequence. We keep thenotation for this sub-sequence and note that it satisfies ψ n k ( z ) ≤ ψ n k ( x ) + [ c ( z , y n k ) − c ( x , y n k )] for all z ∈ R d , and by taking limits, f ( z ) ≤ f ( x ) + [ c ( z , y ) − c ( x , y )] for all z ∈ dom( f ) . Therefore, ∂ c f ( x ) is non-empty for every x ∈ supp( Q ). This entails that f is c − concave and, asa consequence, almost surely differentiable. Moreover, y ∈ ∂ c f ( x ) ∩ ∂ c ψ ( x ). We conclude that ∇ c f = ∇ c ψ a.s. in supp( Q ) and (i) follows by Corollary 2.7.We turn now to (ii) and assume, on the contrary, that there exists a sequence { x n } ⊂ K and y n ∈ ∂ c ψ n ( x n ) such that | y n − ∇ c ψ ( x n ) | > (cid:15) for some (cid:15) > n. (5.1)Compactness of K implies that there exists x ∈ K such that x n → x along a subsequence, which,to ease notation, we denote also as x n . Lemma 3.3 implies that y n also converges to some y alonga subsequence. But then Theorem 3.1 shows that y = ∇ c ψ ( x ) which contradicts (5.1). (cid:3) Proof of Theorem 4.5.
We write ( X (cid:48) , . . . , X (cid:48) n ) for an independent copy of ( X , . . . , X n ) anddenote by P ( i ) n the empirical measure on ( X , . . . , X (cid:48) i , . . . , X n ). As in (4.7), R n = T c ( P n , Q ) − (cid:90) ϕ ( x ) dP n ( x ) , where ϕ is an optimal transport potential from P to Q . We write R ( i ) n for the version of R n computedfrom P ( i ) n instead of P n . To ease notation it will be convenient to write P (cid:48) n rather that P (1) n and R (cid:48) n instead of R (1) n at some points.The guideline of the proof is to show that n ( R n − R (cid:48) n ) a.s. −→ n E ( R n − R (cid:48) n ) ≤ M . Fromthis we can obtain, using the Banach-Alaoglu theorem and the Banach-Saks property (see detailsbelow), that there exists a Ces`aro mean of { n | R n − R (cid:48) n |} n ∈ N convergent to 0 in L ( P ). Finally thesame holds with the Ces`aro means of the sequence √ n ( R n − ER n ). To conclude we will prove thatthese three claims imply the central limit theorem. We follow this path in the following completeproof, which we split into three main steps: Claim 1: n ( R n − R (cid:48) n ) a.s. −→ n E ( R n − R (cid:48) n ) ≤ M .16e write ϕ n for an optimal transport potential between P n and Q . Since T c ( P (cid:48) n , Q ) = sup ( f,g ) ∈ Φ c ( P,Q ) (cid:90) f ( x ) dP (cid:48) n ( x ) + (cid:90) g ( y ) dQ ( y ) ≥ (cid:90) ϕ n ( x ) dP (cid:48) n ( x ) + (cid:90) ϕ cn ( y ) dQ ( y ) , then we have R (cid:48) n ≥ n ϕ n ( X (cid:48) ) + 1 n n (cid:88) k =2 ϕ n ( X ) − n n (cid:88) k =2 ϕ ( X k ) − n ϕ ( X (cid:48) ) + (cid:90) ϕ cn ( y ) dQ ( y ) . This implies that R n − R (cid:48) n ≤ n (cid:0) ϕ n ( X ) − ϕ ( X ) − ϕ n ( X (cid:48) ) + ϕ ( X (cid:48) ) (cid:1) . (5.2)By Theorem 3.4 we can assume, without loss of generality, that, almost surely, ϕ n → ϕ , uniformly oncompact subsets of supp( P ). This entails that n ( R n − R (cid:48) n ) + a.s. →
0. By symmetry, n ( R (cid:48) n − R n ) + a.s. → n ( R (cid:48) n − R n ) a.s. → n ( R n − R (cid:48) n ) = n ( T c ( P n , Q ) − T c ( P (cid:48) n , Q )) − ( ϕ ( X ) − ϕ ( X (cid:48) )) . It follows from (4.2) and the proof of Lemma 4.1 that n E ( T c ( P n , Q ) − T c ( P (cid:48) n , Q )) is a boundedsequence and, therefore, it suffices to show that Eϕ ( X ) < ∞ . To check this, we fix x ∈ supp( P ) ∩ dom( ∇ ϕ ). From (2.4) we get that | ϕ ( X ) | ≤ | ϕ ( x ) | + | c ( X , y ) − c ( x , y ) | + | c ( X , b ) − c ( x , b ) | , ≤ | ϕ ( x ) | + c ( X , y ) + c ( x , y ) + c ( X , b ) + c ( x , b ) , for all ( x , b ) , ( X , y ) ∈ ∂ c ϕ . Since ϕ is differentiable at x then if X ∈ dom( ∇ ϕ ) we have | ϕ ( X ) | ≤ | ϕ ( x ) | + c ( X , ∇ c ϕ ( X )) + c ( x , ∇ c ϕ ( X ))+ c ( X , ∇ c ϕ ( x )) + c ( x , ∇ c ϕ ( x )) . Recalling that c ( x , y ) = h ( x − y ) and that h is convex, we see that c ( X , ∇ c ϕ ( X )) = h ( X − ∇ c ϕ ( X )) ≤ h (2 X ) + h ( − ∇ c ϕ ( X )) . Hence, using the fact that Q = ∇ c ϕ P and (4.4) we deduce that E ( c ( X , ∇ c ϕ ( X ))) ≤ (cid:90) h (2 x ) dP ( x ) + (cid:90) h ( − y ) dQ ( y ) < ∞ . Similarly, we check that E ( c ( X , ∇ c ϕ ( x )) ) < ∞ and E ( c ( x , ∇ c ϕ ( X )) ) < ∞ . This shows that ϕ ( X ) has a finite second moment, as claimed. Claim 2:
From every subsequence of { n | R n − R (cid:48) n |} n ∈ N we can extract a subsequence for whichthe Ces`aro mean converges to 0 in L ( P ).From Claim 1 and the Banach-Alaoglu theorem (see Theorem 3.16 in [11]) applied on the Hilbertspace L ( P ), we see that, along subsequences, n | R n − R (cid:48) n | L (cid:42)
0, where L (cid:42) denotes the weak convergencein the space L ( P ). By a theorem of Banach and Saks (see the Banach–Saks property, exercise 5.24in [11]), we conclude that there exists a sub-sequence, { n k | R n k − R (cid:48) n k |} k ∈ N , such that the Ces`aromeans g m = 1 m m (cid:88) k =1 n k | R n k − R (cid:48) n k | (5.3)17onverge strongly to 0 in L ( P ), that is, E (cid:16) m m (cid:88) k =1 n k | R n k − R (cid:48) n k | (cid:17) −→ . (5.4) Claim 3:
From every subsequence of √ n ( R n − ER n ) we can extract a further subsequence for whichthe Ces`aro mean converges to 0 in L ( P ).There exists a Ces`aro mean of √ n ( R n − ER n ) convergent to 0 in L ( P ).For ease of notation, we write k instead of n k in (5.3). We set G m := m (cid:80) mk =1 √ kR k . By theEfron-Stein inequality Var( G m ) ≤ m (cid:88) i =1 E ( G m − G ( i ) m ) . (5.5)Next, we observe that E ( G m − G ( i ) m ) = E (cid:16) m m (cid:88) k =1 √ k ( R k − R ( i ) k ) (cid:17) = 1 m m (cid:88) k =1 kE (cid:16) R k − R ( i ) k (cid:17) + 2 m m (cid:88) k =1 m (cid:88) j = k +1 √ k (cid:112) jE ( R k − R ( i ) k )( R j − R ( i ) j ) , since for the terms with k < i the difference is is 0. Hence E ( G m − G ( i ) m ) = 1 m m (cid:88) k = i kE (cid:16) R k − R ( i ) k (cid:17) + 2 m m (cid:88) k = i m (cid:88) j = k +1 √ k (cid:112) jE ( R k − R ( i ) k )( R j − R ( i ) j )= 1 m m (cid:88) k = i kE (cid:0) R k − R (cid:48) k (cid:1) + 2 m m (cid:88) k = i m (cid:88) j = k +1 √ k (cid:112) jE ( R k − R (cid:48) k )( R j − R (cid:48) j ) . Here, the second equality comes from the fact that ( R k − R (cid:48) k ) has the same distribution as ( R k − R ( i ) k ) when i ≤ k , and the same happens with ( R k − R (cid:48) k )( R j − R (cid:48) j ) and ( R k − R ( i ) k )( R j − R ( i ) j ). Now18urning back to (5.5) we haveVar( G m ) ≤
12 1 m m (cid:88) i =1 m (cid:88) k = i kE (cid:0) R k − R (cid:48) k (cid:1) + 2 m m (cid:88) i =1 m (cid:88) k = i m (cid:88) j = k +1 √ k (cid:112) jE ( R k − R (cid:48) k )( R j − R (cid:48) j ) ≤
12 1 m m (cid:88) i =1 m (cid:88) k = i kE (cid:0) R k − R (cid:48) k (cid:1) + 1 m m (cid:88) i =1 m (cid:88) k = i m (cid:88) j = k +1 √ k (cid:112) jE | ( R k − R (cid:48) k ) || ( R j − R (cid:48) j ) | = 12 1 m m (cid:88) k =1 k E (cid:0) R k − R (cid:48) k (cid:1) + 1 m m (cid:88) i =1 m (cid:88) k = i m (cid:88) j = k +1 √ k (cid:112) jE | ( R k − R (cid:48) k ) || ( R j − R (cid:48) j ) | . Compute the last term to obtain m (cid:88) i =1 m (cid:88) k = i m (cid:88) j = k +1 √ k (cid:112) jE | ( R k − R (cid:48) k ) || ( R j − R (cid:48) j ) | = m (cid:88) j =1 j − (cid:88) k =1 k (cid:88) i =1 √ k (cid:112) jE | ( R k − R (cid:48) k ) || ( R j − R (cid:48) j ) | = m (cid:88) j =1 j − (cid:88) k =1 √ k (cid:112) jE | ( R k − R (cid:48) k ) || ( R j − R (cid:48) j ) | k ≤ m (cid:88) j =1 j − (cid:88) k =1 kjE | ( R k − R (cid:48) k ) || ( R j − R (cid:48) j ) | . We conclude thatVar( G m ) ≤
12 1 m m (cid:88) k =1 k E (cid:0) R k − R (cid:48) k (cid:1) + m (cid:88) j =1 j − (cid:88) k =1 kjE | ( R k − R (cid:48) k ) || ( R j − R (cid:48) j ) | = 12 E (cid:16) m m (cid:88) k =1 k | R k − R (cid:48) k | (cid:17) , which, together with (5.4), shows that E (cid:16) m m (cid:88) k =1 √ k ( R k − ER k ) (cid:17) = Var( G m ) −→ . (5.6)Finally we have proven that for every subsequence of { G m } m ∈ N we can find a further subsequenceconverging to 0 strongly in L ( P ), and Claim 3 follows.Now we are ready to prove the central limit theorem. Note that by the Central Limit Theoremwe have (cid:90) ϕ ( x ) dP n ( x ) − E (cid:18)(cid:90) ϕ ( x ) dP n ( x ) (cid:19) w −→ N (0 , σ c ( P, Q )) .
19s a consequence, the Ces`aro means converge to the same limit,1 m m (cid:88) k =1 (cid:110) (cid:90) ϕ ( x ) dP k ( x ) − E (cid:16) (cid:90) ϕ ( x ) dP k ( x ) (cid:17)(cid:111) w −→ N (0 , σ c ( P, Q )) . (5.7)Both (5.7) and (5.6) imply that1 m m (cid:88) k =1 √ k {T c ( P k , Q ) − E T c ( P k , Q ) } w −→ N (0 , σ c ( P, Q )) . (5.8)The variance bound of Lemma 4.1 and Remark 4.2 yield tightness of √ n {T c ( P n , Q ) − E T c ( P n , Q ) } n ∈ N .Hence each sub-sequence has a convergent sub-sequence to some limiting distributions, say γ . TheCes`aro means must converge also to γ . Finally, from (5.8) we conclude that γ = N (0 , σ c ( P, Q )) andthe proof follows. (cid:3)
Proof of Theorem 4.6.
We keep the same notations as in the proof of Theorem 4.5, notingthat the new assumption (4.8) has no influence on the proof of Claim 1. Hence, we only have toprove that n ( R n − R (cid:48) n ) is uniformly integrable and, in fact, recalling that n ( R n − R (cid:48) n ) = n ( T c ( P n , Q ) − T c ( P (cid:48) n , Q )) − ( ϕ ( X ) − ϕ ( X (cid:48) ))and that ϕ ( X ) has a finite second moment (as shown in the proof of Theorem 4.5), it suffices toprove uniform integrability of n ( T c ( P n , Q ) − T c ( P (cid:48) n , Q )).To check this we denote Z := T c ( P n , Q ) and Z (cid:48) := T c ( P (cid:48) n , Q ). Arguing as in the proof of Lemma4.1 we see that ( Z − Z (cid:48) ) + ≤ | X − X (cid:48) | (cid:90) C (cid:48) |∇ h ( X − y ) | dQ ( y ) . Hence, by H¨older’s inequality, for every pair ( q , q ) ∈ α it holds that E ( n ( Z − Z (cid:48) ) + ) δ ≤ E (cid:110) | X − X (cid:48) | δ (cid:16) (cid:90) C (cid:48) n |∇ h ( X − y ) | dQ ( y ) (cid:17) δ (cid:111) ≤ (cid:16) E | X − X (cid:48) | (2+ δ ) q (cid:17) q (cid:16) E (cid:16) (cid:90) C (cid:48) n |∇ h ( X − y ) | dQ ( y ) (cid:17) (2+ δ ) q (cid:17) q . A further use of H¨older’s inequality yields that (cid:82) C (cid:48) |∇ h ( X − y ) | dQ ( y ) ≤ (cid:16)(cid:82) C (cid:48) dQ ( y ) (cid:17) (2+ δ ) q − δ ) q (cid:16)(cid:82) C (cid:48) |∇ h ( X − y ) | (2+ δ ) q dQ ( y ) (cid:17) δ ) q = n (2+ δ ) q − δ ) q (cid:16)(cid:82) C (cid:48) |∇ h ( X − y ) | q (2+ δ ) dQ ( y ) (cid:17) δ ) q Note that ( X (cid:48) , . . . , X n ) is independent of X , hence, the same holds for C (cid:48) k , for k = 1 , . . . , n . Byexchangeability, we have that (cid:82) C (cid:48) |∇ h ( X − y ) | (2+ δ ) q dQ ( y ) is equally distributed as (cid:82) C (cid:48) k |∇ h ( X − y ) | (2+ δ ) q dQ ( y ), k = 2 , . . . , n . This implies E (cid:110)(cid:82) C (cid:48) |∇ h ( X − y ) | (2+ δ ) q dQ ( y ) (cid:111) = n E (cid:110)(cid:80) ni =1 (cid:82) C (cid:48) i |∇ h ( X − y ) | (2+ δ ) q dQ ( y ) (cid:111) ≤ n E (cid:8)(cid:82) R d |∇ h ( X − y ) | (2+ δ ) q dQ ( y ) (cid:9) , which, in turn, entails E (cid:16)(cid:82) C (cid:48) |∇ h ( X − y ) | dQ ( y ) (cid:17) (2+ δ ) q ≤ n (2+ δ ) q E (cid:0)(cid:82) R d |∇ h ( X − y ) | (2+ δ ) q dQ ( y ) (cid:1) . E ( n ( Z − Z (cid:48) ) + ) (2+ δ ) ≤ (cid:0) E | X − X (cid:48) | (2+ δ ) q (cid:1) q (cid:0) E (cid:0)(cid:82) R d |∇ h ( X − y ) | (2+ δ ) q dQ ( y ) (cid:1)(cid:1) q and the proof follows. (cid:3) Proof of Theorem 4.10.
We set R n,m := T c ( P n , Q m ) − (cid:90) ϕ ( x ) dP n ( x ) − (cid:90) ψ ( y ) dQ m ( y )with ϕ an optimal transport potential from P to Q for the cost c and ψ = ϕ c and observe that itsuffices to show that nmn + m Var( R n,m ) →
0. Once again the key of the proof is Efron-Stein’s inequality.Note that R n,m as a function of X , . . . , X n , Y , . . . , Y m is symmetric in its n first variables as well as inthe last m . Let X (cid:48) (resp. Y (cid:48) ) be a copy of X (resp. Y ) both independent of X , . . . , X n , Y , . . . , Y m ,finally let P (cid:48) n (resp. Q (cid:48) n ) be the empirical distribution of X (cid:48) , X . . . , X n (resp. Y (cid:48) , Y . . . , Y m ). Hence,if we denote R (cid:48) n,m := T c ( P (cid:48) n , Q m ) − (cid:90) ϕ ( x ) dP (cid:48) n ( x ) − (cid:90) ψ ( y ) dQ m ( y ) ,R (cid:48)(cid:48) n,m := T c ( P n , Q (cid:48) m ) − (cid:90) ϕ ( x ) dP n ( x ) − (cid:90) ψ ( y ) dQ (cid:48) m ( y ) , by the Efron-Stein inequality we have nmn + m Var( R n,m ) ≤ n mn + m E ( R n,m − R (cid:48) n,m ) + nm n + m E ( R n,m − R (cid:48)(cid:48) n,m ) . Now, to conclude, it suffices to prove that n E (( R n,m − R (cid:48) n,m ) ) → m E (( R n,m − R (cid:48)(cid:48) n,m ) ) → . (5.10)We handle (5.9), which will follow if we prove that n ( R n,m − R (cid:48) n,m ) + → n ( R n,m − R (cid:48) n,m ) is uniformly integrable. For the first claim note note that if ϕ n (resp. ψ n ) is anoptimal transport potential from P n to Q m (resp. from Q m to P n ) then R (cid:48) n,m ≥ (cid:82) ϕ n ( x ) dP (cid:48) n ( x ) + (cid:82) ψ m ( y ) dQ m ( y ) − (cid:82) ϕ ( x ) dP (cid:48) n ( x ) − (cid:82) ψ ( y ) dQ m ( y ) . As a consequence, R n,m − R (cid:48) n,m ≤ (cid:90) R d ( ϕ n ( x ) − ϕ ( x )) (cid:0) dP n ( x ) − dP (cid:48) n ( x ) (cid:1) = 1 n (cid:0) ϕ n ( X ) − ϕ ( X ) − ϕ n ( X (cid:48) ) + ϕ ( X (cid:48) ) (cid:1) and we see that n (cid:0) R n,m − R (cid:48) n,m (cid:1) + ≤ | ϕ n ( X ) − ϕ ( X ) − ϕ n ( X (cid:48) ) + ϕ ( X (cid:48) ) | . (5.11)By Theorem 3.4, with a right choice of potentials we can guarantee that, P − a.s., ϕ n → ϕ andconclude that n (cid:0) R n,m − R (cid:48) n,m (cid:1) + → P − a.s.Finally, it only remains to prove that n E (cid:0) R n,m − R (cid:48) n,m (cid:1) is uniformly bounded, which followsarguing as in the proof of Theorem 4.6. (cid:3) .2 Proofs of Lemmas Proof of Lemma 3.1.
Set x ∈ dom( ∇ c ψ ) ∩ Supp( Q ) and y = ∇ c ψ ( x ). By Lemma 2.10 we seethat for each (cid:15) > δ > | z − x | < δ then ∂ c ψ ( z ) ⊂ B ( y , (cid:15) ). Let π bethe unique optimal transport plan between Q and P . By Theorem 2.4 supp( π ) ⊂ ∂ c ψ . This entails π ( B ( x , δ ) × B ( y , (cid:15) )) = π (cid:16) B ( x , δ ) × R d (cid:17) = Q ( B ( x , δ )) = η > , where the inequality comes from the assumption Q (cid:28) (cid:96) d . Repeating the argument with a decreasingsequence (cid:15) k →
0, we obtain a sequence δ k ≤ k such that π ( B ( x , δ k ) × B ( y , (cid:15) k )) = π (cid:16) B ( x , δ k ) × R d (cid:17) = Q ( B ( x , δ k )) > η k > . Let π n be an optimal transport plan between P n and Q n . We observe that(a) π n w → π by Theorem 5.20 in [42],(b) supp( π n ) ⊂ ∂ c ψ n by Theorem 2.4.By (a) there exists N k such that, for n ≥ N k , π n ( B ( x , δ k ) × B ( y , (cid:15) k )) ≥ η k /
2. Hence, by (b)we can choose a pair ( x n k , y n k ) ∈ ∂ c ψ n ∩ ( B ( x , δ k ) × B ( y , (cid:15) k )) . (5.12)As a consequence of (5.12), since (cid:15) k , δ k →
0, we can extract a sub-sequence of ( x n , y n ) ∈ ∂ c ψ n converging to ( x , y ). Define a n := ψ n ( x n ) − ψ ( x ) and ˜ ψ n := ψ n − a n (which has the same c -superdifferential as ψ n ). Now, (5.12) implies that ∂ c ˜ ψ n are c -cyclically monotone sets which donot escape to the horizon. By Theorem 2.8 and Lemma 2.9 we deduce that ∂ c ˜ ψ n converges to a c -cyclically monotone set Γ along a sub-sequence. Necessarily Γ ⊂ ∂ c f for some c -concave function f .We observe that ( x , y ) ∈ ∂ c f . If we take another arbitrary point x ∈ dom( ∇ c ψ ) and ∂ c ψ ( x ) = { y } ,we can apply the same arguments to check that ( x , y ) ∈ ∂ c f . Hence, dom( ∇ c ψ ) ⊂ dom( f ). Since f is differentiable a.s then ∂ c f is a singleton a.s and, therefore, that ∇ f = ∇ ψ a.s. in the supportof Q , which is connected. Using Theorem 2.6 we conclude that there exists a constant C such that ψ = f − C in Ω. Hence ∂ c ψ = ∂ c f and the result follows. (cid:3) Proof of Lemma 3.2.
We can assume, without loss of generality, that p is in the interior ofthe domain of ψ , since otherwise the result is trivial. With this assumption, we check first that wecannot have ψ n ( p n ) → ∞ . In fact, in that case, by c -concavity we would have ψ n ( p n ) ≤ c ( p n , y ) − ψ cn ( y )for all y . Hence, we would have ψ cn ( y n ) → −∞ for all y n → y . Now, take p as in (3.2). By Lemma3.1 we can choose (˜ p n , y n ) with y n ∈ ∂ c ψ n (˜ p n ), ˜ p n → p and y n → ∇ c ψ ( p ) = y . But then wewould have ψ n (˜ p n ) → ψ ( p ) = 0, while, on the other hand, ψ n (˜ p n ) = c (˜ p n , y n ) − ψ cn ( y n ) → ∞ , which is a contradiction.Now, we can assume, taking subsequences if necessary, that ψ n ( p n ) < − n for all n ∈ N . Now,taking y n ∈ R d ∈ ∂ c ψ ( p n ) we have that ψ n ( x ) ≤ c ( x , y n ) + λ n , for all x ∈ R d , (5.13)where λ n = ψ n ( p n ) − c ( p n , y n ). Hence, by assumption we have that c ( p n , y n ) + λ n ≤ − n for all n ∈ N . Now, let { x n } be a bounded sequence such that ψ n ( x n ) is bounded. Then ψ n ( x n ) < c ( x n , y n ) − c ( p n , y n ) − n. ψ n ( x n ) , p n , x n are bounded, then | y n | → ∞ . For each n we choose the height r n ∈ [0 , ∞ ] andthe direction z n of the largest cone with vertex p n − y n such that K (cid:0) r n , π r − n , z n , p n − y n (cid:1) ⊂ { x : h ( x ) ≤ h ( p n − y n ) } . Since z n ∈ S d − , then up to a sub-sequence, we can assume that z n → z ∈ S d − . Also, since | p n − y n | → ∞ then the condition (A2) implies that r n → ∞ (note that otherwise if | r n | < R then(A2) is no longer true for r = R + 1 and θ = π r − n ).Now let { x n } n ∈ N ⊂⊂ { x : (cid:104) z , x − p (cid:105) > } be a bounded sequence. From the fact that r n → ∞ we see that cos (cid:16) π r − n (cid:17) → . Therefore, for big enough n | x n − p n | cos (cid:16) π r − n (cid:17) < (cid:104) z , x − p (cid:105) < r n . (5.14)As a consequence x n ∈ K ( r n , π r − n , z , p n ), which implies that x n − y n ∈ K (cid:0) r n , π r − n , z , p n − y n (cid:1) ⊂ { x : h ( x ) ≤ h ( p n − y n ) } . From this we conclude that c ( x n , y n ) ≤ c ( p n , y n ), and turning back to (5.13), that ψ n ( x n ) ≤ c ( x n , y n ) + λ n ≤ c ( p n , y n ) + λ n ≤ − n, and the proof follows. (cid:3) Proof of Lemma 3.3.
We split the proof into the following steps:
Step 1 (Pointwise boundedness): Fix x ∈ supp( Q ) ∩ dom( ψ ). By Lemma 3.1 there exists a c -cyclically monotone set Γ such that, up to taking sub-sequences, ∂ c ψ n → Γ in the sense of Painlev´e-Kuratowski. Hence, there exists a sequence ( x n k , y n k ) ∈ ∂ c ψ n k satisfying( x n k , y n k ) → ( x , y ) ∈ Γ . Assume { ψ n ( x n k ) } k ∈ N is not bounded. Then there exist a sub-sequence ψ n km ( x n km ) → −∞ (thecase ψ n km ( x n km ) → ∞ can be excluded arguing as at the beginning of the proof of Lemma 3.2. Now,we take p as in (3.2) and observe that,0 ≤ ψ n km ( x n km ) + c ( p , y n km ) − c ( x k m , y n km ) . (5.15)Taking limits as m → ∞ in (5.15) leads to a contradiction. Hence, the sequence { ψ n k ( x n k ) } k ∈ N mustbe bounded.For ease of reading we will use the same notation for the subsequence { ψ n k } k ∈ N and the mainsequence { ψ n } n ∈ N in the subsequent steps 2 and 3. Step 2 (For every compact K ⊂ supp( Q ) there exists M > | ψ n ( K ) | ≤ M for large enough n ): Assume, on the contrary, that for every m ∈ N there exists some n m ∈ N such that k n m ∈ K and | ψ n m ( k n m ) | > m . Then | ψ n m ( k n m ) | → ∞ as m → ∞ and, by compactness, k n m → k ∈ K along asubsequence. By Lemma 3.2 we see that there exists z ∈ R d such that ψ n ( x n ) is not bounded, for ev-ery bounded sequence { x n } ⊂⊂ { x : (cid:104) z , x − k (cid:105) > } . Now take x ∈ supp( Q ) ∩ { x : (cid:104) z , x − k (cid:105) > } .Since this last set is open, there exists ε > B ( x , (cid:15) ) ⊂⊂ supp( Q ) ∩ { x : (cid:104) z , x − k (cid:105) > } ,and this contradicts Step 1 applied to the point x . Step 3 (For every compact K ⊂ supp( Q ) there exists M > ∂ c ψ n ( K ) ⊂ B ( , M )for large enough n ): Assume this fails for a compact K ⊂ supp( Q ). Since supp( Q ) is open, thereexists (cid:15) > K (cid:15) := { x : d ( x , K ) ≤ (cid:15) } ⊂⊂ supp( Q ) .
23y Step 2 there exists
M > n ∈ N such that | ψ n ( k ) | ≤ M , for all k ∈ K (cid:15) and n ≥ n . Nowwe can take { k n } n ∈ N ⊂ K and y n ∈ ∂ c ψ n ( k n ) such that | y n | → ∞ , define v n := k n − y n and observethat for n big enough | v n | >
1. Define ξ n := 1 − (cid:15) | v n | and note that ξ n →
1. All k n belong to thecompact set K , hence define z n := k n + ( ξ n − v n = k n + (cid:15) | v n | v n ∈ K (cid:15) , for which we can ensure ψ n ( z n ) > − M . By definition of superdifferentials we have2 M ≥ ψ n ( z n ) − ψ n ( k n ) ≥ h ( v n ) − h ( ξ n v n )and by convexity of h there exists s n ∈ ∂h ( ξ n v n ), for which we have2 M ≥ (cid:104) (1 − ξ n ) v n , s n (cid:105) = (cid:15) (cid:68) v n | v n | , s n (cid:69) . (5.16)Observe that we also have h ( ) ≥ h ( ξ n v n ) + (cid:104) − ξ n v n , s n (cid:105) . Now, since ξ n > − (cid:15) > | v n | → ∞ we have | ξ n v n | → ∞ and, consequently,lim inf n →∞ (cid:68) v n | v n | , s n (cid:69) ≥ lim inf n →∞ h ( ξ n v ) | ξ n v n | → ∞ , (5.17)with the last limit following from the condition (A3). This contradicts (5.16). (cid:3) Proof of Lemma 4.1.
We write X (cid:48) for a random variable with law P and independentfrom ( X , . . . , X n ). Denote by P (cid:48) n the empirical measure associated to ( X (cid:48) , X , . . . , X n ) and Z (cid:48) := T c ( P (cid:48) n , Q ). Since Q (cid:28) (cid:96) d , there exists an optimal transport map from Q to P (cid:48) n , which we denote by T . We set C (cid:48) := { y ∈ R d : T ( y ) = X (cid:48) } , C (cid:48) i := { y ∈ R d : T ( y ) = X i } , i ≥ , and observe that Q ( C (cid:48) i ) = n and Z (cid:48) = (cid:90) c ( x , y ) dπ (cid:48) ( x , y ) = (cid:90) C (cid:48) c ( X (cid:48) , y ) dQ ( y ) + n (cid:88) i =2 (cid:90) C (cid:48) i c ( X i , y ) dQ ( y ) ,Z ≤ (cid:90) C (cid:48) c ( X , y ) dQ ( y ) + n (cid:88) i =2 (cid:90) C (cid:48) i c ( X i , y ) dQ ( y ) . From this we see that (recall that h ( X − · ) is convex and Q -a.s. differentiable) Z − Z (cid:48) ≤ (cid:90) C (cid:48) (cid:0) c ( X , y ) − c ( X (cid:48) , y ) (cid:1) dQ ( y ) ≤ (cid:90) C (cid:48) (cid:10) ∇ h ( X − y ) , X − X (cid:48) (cid:11) dQ ( y ) ≤ | X − X (cid:48) | (cid:90) C (cid:48) |∇ h ( X − y ) | dQ ( y ) . Hence, by H¨older’s inequality, for any pair ( q , q ) ∈ α , E ( Z − Z (cid:48) ) ≤ E (cid:110) | X − X (cid:48) | (cid:16) (cid:90) C (cid:48) |∇ h ( X − y ) | dQ ( y ) (cid:17) (cid:111) ≤ (cid:16) E | X − X (cid:48) | q (cid:17) q (cid:16) E (cid:16) (cid:90) C (cid:48) |∇ h ( X − y ) | dQ ( y ) (cid:17) q (cid:17) q . (5.18)24sing again H¨older’s inequality we get that (cid:90) C (cid:48) |∇ h ( X − y ) | dQ ( y ) ≤ (cid:16) (cid:90) C (cid:48) dQ ( y ) (cid:17) q − q (cid:16) (cid:90) C (cid:48) |∇ h ( X − y ) | q dQ ( y ) (cid:17) q = n q − q (cid:16) (cid:90) C (cid:48) |∇ h ( X − y ) | q dQ ( y ) (cid:17) q . Finally, by exchangeability, E (cid:110) (cid:90) C (cid:48) |∇ h ( X − y ) | q dQ ( y ) (cid:111) = 1 n E (cid:110) n (cid:88) i =1 (cid:90) C (cid:48) i |∇ h ( X − y ) | q dQ ( y ) (cid:111) = 1 n E (cid:110) (cid:90) R d |∇ h ( X − y ) | q dQ ( y ) (cid:111) , which implies that E (cid:16) (cid:90) C (cid:48) |∇ h ( X − y ) | dQ ( y ) (cid:17) q ≤ n q E (cid:16) (cid:90) R d |∇ h ( X − y ) | q dQ ( y ) (cid:17) . Combining the last estimates with (5.18) leads to E ( Z − Z (cid:48) ) ≤ n (cid:16) E | X − X (cid:48) | q (cid:17) q (cid:16) E (cid:16) (cid:90) R d |∇ h ( X − y ) | q dQ ( y ) (cid:17)(cid:17) q . (cid:3) References [1]
Acosta, G. and Dur´an, R.G. (2004). An optimal Poincar´e inequality in L for convex do-mains: Theory and Algorithms, Proc. Amer. Math. Soc. 132, 195-202 .[2]
Ajtai, M., Koml´os , J. and Tusn´ady, G .(1984). On optimal matchings.
Combinatorica , Ambrosio, L., Stra, F. and Trevisan, D . (2019). A PDE approach to a 2-dimensionalmatching problem.
Probab. Theory Relat. Fields , , 433–477.[4] Avron, D. (1965), Solutions in the large for multi-dimensional non linear partial differentialequations of first order,
Annales de l’institut Fourier 15.2 : 1-35 .[5]
Bachoc, F. and Gamboa, F. and Loubes, J. M. and Venet, N. (2017). A Gaussianprocess regression model for distribution inputs.
IEEE Transactions on Information Theory .64(10). 6620-6637.[6]
Berthet, P., Fort, J.C., and Klein, T. (2017). A Central Limit Theorem for Wassersteintype distances between two different laws. ffhal-01624786v2f[7]
Black, E. , Yeom, S. and Fredrikson, M. (2020). FlipTest: fairness testing via optimaltransport.
Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency ,111-121.[8]
Bonnans, J. F. and Shapiro, A. (2000). Perturbation Analysis of Optimization Problems.
Springer , New York, NY.[9]
Bobkov, S. and Ledoux, M. (2019). One-dimensional empirical measures, order statisticsand Kantorovich transport distances.
Memoirs Am. Math. Soc , , n. 1259.2510] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: ANonasymptotic Theory of Independence,
Oxford. [11]
Brezis, H. (2011). Functional Analysis Sobolev Spaces and Partial Differential Equations.
Springer, New York. [12]
Courty, N. , Flamary, R. and Ducoffe, M. (2018). Learning Wasserstein Embeddings.https://openreview.net/forum?id=SJyEH91A-.[13]
Cuesta-Albertos, J. A., Matr´an, C., and Tuero-D´ıaz, A. (1997). Optimal transporta-tion plans and convergence in distribution.
J. Multivariate Anal. , 72–83.[14] Cuturi, M. and Peyr´e, G. (2019). Special issue on optimal transport in data sciences.
Information and Inference: A Journal of the IMA , 8(4).[15]
Cuturi, M. and Peyr´e, G. (2019). Computational Optimal Transport: With Applicationsto Data Science,
Foundations and Trends ® in Machine Learning. 11( 5-6): 355-607 .[16] del Barrio, E., Gin´e, E. and Matr´an, C. (1999). Central Limit Theorems for the Wasser-stein Distance Between the Empirical and the True Distributions. Ann. Probab. , , 1009-1071.[17] del Barrio, E., Gin´e, E. and Utzet, F. (2005). Asymptotics for L functionals of the em-pirical quantile process, with applications to tests of fit based on weighted Wasserstein distances. Bernoulli , , 131–189.[18] del Barrio, E., Gordaliza, P., and Loubes, J.M. (2019). A central limit theorem for Lptransportation cost on the real line with application to fairness assessment in machine learning. Information and Inference: A Journal of the IMA , vol , issue , 817-849.[19] del Barrio, E. and Loubes, J.M. (2019). Central limit theorems for empirical transportationcost in general dimension. Ann. Probab. , , 926-951.[20] Evans, L. C. (2010). Partial differential equations.
American Mathematical Society. [21]
Federer, H. (1969). Geometric Measure Theory,
Springer .[22]
Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance ofthe empirical measure.
Prob. Theory and Related Fields
Gangbo, W. and McCann, R. J. (1996). The geometry of optimal transportation.
ActaMath.
177 , no. 2,
Gordaliza, P. , del Barrio, E. , Gamboa, F. and Loubes, J.M. (2019). Obtainingfairness using optimal transport theory.
International Conference on Machine Learning , 2357-2365.[25]
Ledoux, M. (2019). On optimal matching of Gaussian samples.
Journal of Mathematical Sci-ences , , 495–522.[26] Ledoux, M. and Talagrand, M. (1991).
Probability in Banach Spaces. Springer .[27]
Mena, G. and Niles-Weed, J. (2019). Statistical bounds for entropic optimal transport:sample complexity and the central limit theorem.
Advances in Neural Information ProcessingSystems 32, 4541-4551 .[28]
Rockafellar, R. T. (1966). Characterization of the subdifferentials of convex functions.
Pa-cific J. Math.
17 , no. 3,
Rockafellar, R. T. (1970). Convex Analysis.
Princeton University Press .2630]
Rockafellar, R.T. and Wets R. J.-B. (2009). Variational Analysis.
Springer Science andBusiness Media .[31]
Ruschendorf, L. . (1996). On c-optimal random variables.
Statistics and Probability Letters,Volume 27, Issue 3. [32]
Ruschendorf, L. (1995). Optimal solutions of multivariate coupling problems,
Appl. Math. 23,no. 3, 325-338 .[33]
Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians.
Birkhauser .[34]
Schiebinger, G. et al. (2019) Optimal-Transport Analysis of Single-Cell Gene ExpressionIdentifies Developmental Trajectories in Reprogramming.
Cell , 176(4):928 - 943.e22.[35]
Smith, C. and Knott, M. (1992). On Hoeffding-Frechet Bounds and Cyclic Monotone Rela-tions
Journal Of Multivariate Analysis 4, 328-334. [36]
Sommerfeld, M. and Munk, A. (2018). Inference for empirical Wasserstein distances on finitespaces
Journal of The Royal Statistical Society Series B-statistical Methodology.
Talagrand, M. (1992). Matching random samples in many dimensions.
Ann. Appl. Probab. , , 846–856.[38] Talagrand , M. (1994). The transportation cost from the uniform measure to the empiricalmeasure in dimension ≥ Ann. Probab. , , 919–959.[39] Talagrand, M. (2018). Scaling and non-standard matching theorems.
Comptes Rendus Math-ematique , , 692–695.[40] Talagrand, M. and Yukich, J. E. (1993). The integrability of the square exponential trans-portation cost.
Ann. Appl. Probab. , , 1100–1111.[41] Tameling, C., Sommerfeld, M. and Munk, A. (2019). Empirical optimal transport oncountable metric spaces: Distributional limits and statistical applications.
Ann. Appl. Probab.Volume 29, Number 5 , 2744-2781. [42]
Villani, C. (2008). Optimal Transport: Old and New.
Springer Science and Business Media .[43]
Villani, C. (2003). Topics in Optimal Transportation.