[PDF] A short proof on the rate of convergence of the empirical measure for the Wasserstein distance

Abstract

We provide a short proof that the Wasserstein distance between the empirical measure of a n-sample and the estimated measure is of order n^-(1/d), if the measure has a lower and upper bounded density on the d-dimensional flat torus.

Full PDF

aa r X i v : . [ m a t h . S T ] J a n A short proof on the rate of convergence of theempirical measure for the Wasserstein distance

Vincent Divol a a Universit´e Paris-Saclay and Inria Saclay, France

Abstract

We provide a short proof that the Wasserstein distance between the empiricalmeasure of a n -sample and the estimated measure is of order n − /d , if themeasure has a lower and upper bounded density on the d -dimensional ﬂattorus.For 1 ≤ p < ∞ , let W p be the p -Wasserstein distance between measures,deﬁned for two probability measures µ, ν with ﬁnite p th moments supportedon a metric space (Ω , ρ ) by W p ( µ, ν ) := inf π ∈ Π( µ,ν ) C p ( π ) /p , (1)where Π( µ, ν ) is the set of transport plans between µ and ν , that is the setof probability measures on Ω × Ω, with ﬁrst marginal µ and second marginal ν , and C p ( π ) = RR ρ ( x, y ) p d π ( x, y ) is the cost of the plan π . We deﬁne thedistance W ∞ by replacing the quantity C p ( π ) /p by the π -essential supremumof ρ .Let µ be a probability measure on some metric space (Ω , ρ ), and let µ n be the empirical measure associated to a n -sample X , . . . , X n of law µ . The question of studying rates of convergence between µ and µ n forWasserstein distances W p has attracted a lot of attention over recent years(see e.g. [5, 6]). If no bounds on the density are assumed, then the quantity E W p ( µ n , µ ) is known to be bounded by a quantity of order n − p + n − d whenΩ is a d -dimensional domain, and this bound is tight (see e.g [5]). For p = ∞ ,Nicol´as Garc´ıa Trillos and Dejan Slepˇcev [6] have shown that E W ∞ ( µ n , µ ) Email address: [email protected] (Vincent Divol) s of order (log n/n ) /d (for d ≥

3) in the case where µ has a density f which is lower bounded and upper bounded on some convex domain Ω. As W p ≤ W ∞ , the same rate also holds for any 1 ≤ p ≤ ∞ . This exhibits thefollowing phenomenon: when 2 p > d , the problem of reconstructing µ forthe Wasserstein distance is strictly harder if no bounds on the underlyingdensity are assumed.In this note, we propose to give a short proof of the fact that E W p ( µ n , µ ) . n − /d (for d ≥

3) for bounded densities. We restrict to the case where Ω is the d -dimensional ﬂat torus Ω in order to avoid complications due to boundaryeﬀects. Let P be the set of probability distributions on Ω, having a density f satisfying f min ≤ f ≤ f max for some f max ≥ f min > Theorem.

Let µ ∈ P and ≤ p < ∞ . Then, there exists a constant C such that E W p ( µ n , µ ) ≤ C  n − /d if d ≥ , (log n ) / n − / if d = 2 ,n − / if d = 1 . (2)The standard approach for bounding the distance W p ( µ n , µ ) consists inprecisely assessing the masses given by the measures µ n and µ on dyadicpartitions of the domain Ω (see e.g. [6]). We propose to take a diﬀerent routeby relying on a result from [3] which asserts that the Wasserstein distanceis controlled by the pointed negative Sobolev distance when comparing mea-sures having lower bounded densities. The proof is then completed by usingtools from Fourier analysis.We also note that minimax results from [7] (proven for measures on thecube) can be straightforwardly adapted to the setting of the ﬂat torus. Inparticular, those results imply that the rates exhibited in the theorem areoptimal on the class P (up to a logarithmic factor for d = 2). The proof As W p ≥ W q if p ≥ q , we may assume that p ≥

2. The proof of thetheorem is heavily based on the following result of optimal transport theory,appearing in [3, 2]. Let p ∗ be the conjugate exponent of p . For φ ∈ L p with R φ = 0, introduce the pointed negative Sobolev norm k φ k ˙ H − p := sup (cid:26)Z φψ, k∇ ψ k L p ∗ ≤ (cid:27) , (3)2here the supremum is taken over all smooth functions ψ deﬁned on Ω. Lemma 1.

Let µ, ν be two measures on Ω having densities f, g . Assume that f ≥ f min . Then, W p ( µ, ν ) ≤ pf /p − k f − g k ˙ H − p . (4)Let K be a smooth radial nonnegative function with R K = 1, supportedon the unit ball and, for h >

0, let K h = h − d K ( · /h ). Let µ n,h be the measurehaving density K h ∗ µ n on Ω, i.e. the density at a point x ∈ Ω is given by f n,h ( x ) := P nj =1 K h ( x − X j ) /n . Lemma 2.

We have W p ( µ n , µ n,h ) ≤ C h , where C = ( R | x | p K ( x )d x ) /p .Proof. Consider the unique transport plan π j between K h ∗ δ X j and δ X j .The cost of π j is equal to R | x − X j | p K h ( x − X j )d x = h p R | x | p K ( x )d x . Themeasure n P nj =1 π j is a transport plan between µ n,h and µ n , with associatedcost equal to h p R | x | p K ( x )d x .By Lemmas 1 and 2, E W p ( µ n , µ ) ≤ E W p ( µ n , µ n,h ) + E W p ( µ n,h , µ ) ≤ C h + pf /p − E k f n,h − f k ˙ H − p . (5)To further bound this quantity, we use the following relation between thenegative Sobolev norm and the Fourier decomposition of a signal. Given φ ∈ L p , we let ˆ φ be the sequence of Fourier coeﬃcients of φ (indexed by Z d )and denote by ∨ the inverse Fourier transform. Let | x | := P di =1 | x i | for x ∈ R d .A multiplier s is a bounded sequence indexed by Z d such that the operator φ ∈ L p ( s ˆ φ ) ∨ ∈ L p is bounded. A suﬃcient condition for a sequence tobe a multiplier is given by Mikhlin multiplier theorem [1, Theorem 3.6.7,Theorem 5.2.7]. Lemma 3.

Let s : R d → R be a smooth function such that | ∂ α s ( ξ ) | ≤ B | ξ | −| α | for every multiindex α with | α | ≤ d/ . Then, the sequence ( s ( m )) m ∈ Z d isa multiplier with corresponding operator of norm smaller than C p,d B . Let a : R d → R be a smooth function with a ( ξ ) = 1 / | ξ | for | ξ | ≥ a (0) = 0. Let A be the associated multiplier operator (by Lemma 3) deﬁnedby A ( φ ) = ( a ˆ φ ) ∨ . Lemma 4.

Let φ ∈ L p with R φ = 0 . Then, k φ k ˙ H − p ≤ C kA ( φ ) k L p . roof. Let ψ : Ω → R be a smooth function with k∇ ψ k L p ∗ ≤

1. As ˆ φ (0) = 0,we have Z φψ = X m ∈ Z d ˆ φ ( m ) ˆ ψ ( m ) = X m ∈ Z d a ( m ) ˆ φ ( m ) | m | ˆ ψ ( m ) ≤ kA ( φ ) k L p k ( | · | ˆ ψ ) ∨ k L p ∗ . Note that | · | = P di =1 ε i e i , where e i ( m ) = m i and ε i ( m ) is the sign of m i . As ε i is a multiplier (by Lemma 3), we have k ( | · | ˆ ψ ) ∨ k L p ∗ ≤ c P di =1 k ( e i ˆ ψ ) ∨ k L p ∗ = c P di =1 k ∂ i ψ k L p ∗ ≤ C .Hence, to conclude, it suﬃces to bound E kA ( f n,h − f ) k L p ≤ kA ( f h − f ) k L p + E kA ( f n,h − f h ) k L p . Bound of the bias.

Let κ be the Fourier transform of K . As K is smoothand compactly supported, κ is a multiplier by Lemma 3. Also, the function M = a · ( κ −

1) is a multiplier as a product of multiplier. Remark thatˆ f h − ˆ f = ( κ ( h · ) −

1) ˆ f , so that A ( f h − f ) = h ( M ( h · ) ˆ f ) ∨ . As the multipliernorms of M and M ( h · ) are equal [1, Theorem 3.6.7], we have kA ( f h − f ) k L p ≤ hC k f k L p ≤ hC f max . (6) Bound of the ﬂuctuations.

Eventually, we bound E kA ( f n,h − f h ) k L p ≤ E h kA ( f n,h − f h ) k pL p i /p . (7)The random variable A ( f n,h ) is equal to n − P nj =1 U j , where U j := A ( K h ∗ δ X j ) = A ( K h )( · − X j ) and E U j = A ( f h ). We control the expectation of the L p -norm of the sum of i.i.d. centered functions thanks to the next lemma,which is a direct consequence of Rosenthal inequality [4]. Lemma 5.

Let U , . . . , U n be i.i.d. functions on L p . Then, the expectation E (cid:13)(cid:13)(cid:13) n P ni =1 ( U i − E U i ) (cid:13)(cid:13)(cid:13) pL p is smaller than C p n − p/ Z (cid:16) E | U ( x ) | (cid:17) p/ d x + C p n − p Z E [ | U ( x ) | p ] d x. (8)Let v h be the sequence in ℓ p ∗ ( Z d ) deﬁned by v h ( m ) = a ( m ) κ ( hm ) for m ∈ Z d . By a change of variable, we obtain E [ | U ( x ) | p ] = Z f ( y ) |A ( K h )( x − y ) | p d y ≤ f max kA ( K h ) k pL p ≤ f max k v h k pℓ p ∗ , (9)4here, at the last line, we applied Hausdorﬀ-Young inequality [8, SectionXII.2]. The last step consists in bounding k v h k p ∗ ℓ p ∗ . We separate this quantityinto two parts: S = P | hm |≤ | v h ( m ) | p ∗ and S = P | hm | > | v h ( m ) | p ∗ . To bound S , we use that κ is bounded on the unit ball, so that S is of the order X | hm |≤ | m | − p ∗ .  h p ∗ − d if d ≥ d = 2 and p > − log h if p = d = 21 if d = 1 . (10)To bound S , we use that | κ ( hm ) | ≤ C γ | hm | − γ for any γ >

0. Choosing γ such that γp ∗ + p ∗ > d , we obtain that S is of the order h − γp ∗ X | hm | > | m | − γp ∗ − p ∗ . h p ∗ − d . (11)Putting together inequalities (8), (10) and (11) yields that, for h of the order n − /d , the expectation E kA ( f n,h − f h ) k L p is of the order  h/ √ nh d . n − /d if d ≥ , q ( − log h ) /n . (log n ) / n − / if d = 2 ,n − / if d = 1 . (12)We conclude the proof by putting together the estimates (5), (6) and (12). Remark 1.

For p = 2, Mikhlin multiplier theorem can be replaced by Parse-val’s theorem, further simplifying the proof. Remark 2.

A similar proof shows that the risk of the measure µ n,h satisﬁes E W p ( µ n,h , µ ) . n − ( s +1) / (2 s + d ) if f is assumed to be of regularity s . Indeed,we can exploit the regularity of s to show that, if κ has suﬃciently manyzero derivatives at 0, then the bias term is of order h s +1 , while the ﬂuctu-ation terms is bounded in the same way. We then obtain the desired rateby choosing h of the order n − / (2 s + d ) . This rate is in accordance with theminimax result of [7], where a modiﬁed wavelet density estimator is shownto attain the same rate of convergence. References [1] Loukas Grafakos.

Classical Fourier analysis , volume 2. Springer.52] Sloan Nietert, Ziv Goldfeld, and Kengo Kato. From smooth Wassersteindistance to dual Sobolev norm: Empirical approximation and statisticalapplications, 2021.[3] R´emi Peyre. Comparison between W distance and ˙ H − norm, and lo-calization of Wasserstein distance. ESAIM. Control, Optimisation andCalculus of Variations , 24(4), 2018.[4] Haskell P Rosenthal. On the subspaces of L p ( p >

2) spanned by se-quences of independent random variables.

Israel Journal of Mathematics ,8(3):273–303, 1970.[5] Shashank Singh and Barnab´as P´oczos. Minimax distribution estimationin Wasserstein distance. arXiv preprint arXiv:1802.08855 , 2018.[6] Nicol´as Garcia Trillos and Dejan Slepˇcev. On the rate of convergence ofempirical measures in ∞ -transportation distance. Canadian Journal ofMathematics , 67(6):1358–1383, 2015.[7] Jonathan Weed and Quentin Berthet. Estimation of smooth densities inWasserstein distance. In

Conference on Learning Theory , pages 3118–3119, 2019.[8] A. Zygmund and R. Feﬀerman.