A short proof on the rate of convergence of the empirical measure for the Wasserstein distance
aa r X i v : . [ m a t h . S T ] J a n A short proof on the rate of convergence of theempirical measure for the Wasserstein distance
Vincent Divol a a Universit´e Paris-Saclay and Inria Saclay, France
Abstract
We provide a short proof that the Wasserstein distance between the empiricalmeasure of a n -sample and the estimated measure is of order n − /d , if themeasure has a lower and upper bounded density on the d -dimensional flattorus.For 1 ≤ p < ∞ , let W p be the p -Wasserstein distance between measures,defined for two probability measures µ, ν with finite p th moments supportedon a metric space (Ω , ρ ) by W p ( µ, ν ) := inf π ∈ Π( µ,ν ) C p ( π ) /p , (1)where Π( µ, ν ) is the set of transport plans between µ and ν , that is the setof probability measures on Ω × Ω, with first marginal µ and second marginal ν , and C p ( π ) = RR ρ ( x, y ) p d π ( x, y ) is the cost of the plan π . We define thedistance W ∞ by replacing the quantity C p ( π ) /p by the π -essential supremumof ρ .Let µ be a probability measure on some metric space (Ω , ρ ), and let µ n be the empirical measure associated to a n -sample X , . . . , X n of law µ . The question of studying rates of convergence between µ and µ n forWasserstein distances W p has attracted a lot of attention over recent years(see e.g. [5, 6]). If no bounds on the density are assumed, then the quantity E W p ( µ n , µ ) is known to be bounded by a quantity of order n − p + n − d whenΩ is a d -dimensional domain, and this bound is tight (see e.g [5]). For p = ∞ ,Nicol´as Garc´ıa Trillos and Dejan Slepˇcev [6] have shown that E W ∞ ( µ n , µ ) Email address: [email protected] (Vincent Divol) s of order (log n/n ) /d (for d ≥
3) in the case where µ has a density f which is lower bounded and upper bounded on some convex domain Ω. As W p ≤ W ∞ , the same rate also holds for any 1 ≤ p ≤ ∞ . This exhibits thefollowing phenomenon: when 2 p > d , the problem of reconstructing µ forthe Wasserstein distance is strictly harder if no bounds on the underlyingdensity are assumed.In this note, we propose to give a short proof of the fact that E W p ( µ n , µ ) . n − /d (for d ≥
3) for bounded densities. We restrict to the case where Ω is the d -dimensional flat torus Ω in order to avoid complications due to boundaryeffects. Let P be the set of probability distributions on Ω, having a density f satisfying f min ≤ f ≤ f max for some f max ≥ f min > Theorem.
Let µ ∈ P and ≤ p < ∞ . Then, there exists a constant C such that E W p ( µ n , µ ) ≤ C n − /d if d ≥ , (log n ) / n − / if d = 2 ,n − / if d = 1 . (2)The standard approach for bounding the distance W p ( µ n , µ ) consists inprecisely assessing the masses given by the measures µ n and µ on dyadicpartitions of the domain Ω (see e.g. [6]). We propose to take a different routeby relying on a result from [3] which asserts that the Wasserstein distanceis controlled by the pointed negative Sobolev distance when comparing mea-sures having lower bounded densities. The proof is then completed by usingtools from Fourier analysis.We also note that minimax results from [7] (proven for measures on thecube) can be straightforwardly adapted to the setting of the flat torus. Inparticular, those results imply that the rates exhibited in the theorem areoptimal on the class P (up to a logarithmic factor for d = 2). The proof As W p ≥ W q if p ≥ q , we may assume that p ≥
2. The proof of thetheorem is heavily based on the following result of optimal transport theory,appearing in [3, 2]. Let p ∗ be the conjugate exponent of p . For φ ∈ L p with R φ = 0, introduce the pointed negative Sobolev norm k φ k ˙ H − p := sup (cid:26)Z φψ, k∇ ψ k L p ∗ ≤ (cid:27) , (3)2here the supremum is taken over all smooth functions ψ defined on Ω. Lemma 1.
Let µ, ν be two measures on Ω having densities f, g . Assume that f ≥ f min . Then, W p ( µ, ν ) ≤ pf /p − k f − g k ˙ H − p . (4)Let K be a smooth radial nonnegative function with R K = 1, supportedon the unit ball and, for h >
0, let K h = h − d K ( · /h ). Let µ n,h be the measurehaving density K h ∗ µ n on Ω, i.e. the density at a point x ∈ Ω is given by f n,h ( x ) := P nj =1 K h ( x − X j ) /n . Lemma 2.
We have W p ( µ n , µ n,h ) ≤ C h , where C = ( R | x | p K ( x )d x ) /p .Proof. Consider the unique transport plan π j between K h ∗ δ X j and δ X j .The cost of π j is equal to R | x − X j | p K h ( x − X j )d x = h p R | x | p K ( x )d x . Themeasure n P nj =1 π j is a transport plan between µ n,h and µ n , with associatedcost equal to h p R | x | p K ( x )d x .By Lemmas 1 and 2, E W p ( µ n , µ ) ≤ E W p ( µ n , µ n,h ) + E W p ( µ n,h , µ ) ≤ C h + pf /p − E k f n,h − f k ˙ H − p . (5)To further bound this quantity, we use the following relation between thenegative Sobolev norm and the Fourier decomposition of a signal. Given φ ∈ L p , we let ˆ φ be the sequence of Fourier coefficients of φ (indexed by Z d )and denote by ∨ the inverse Fourier transform. Let | x | := P di =1 | x i | for x ∈ R d .A multiplier s is a bounded sequence indexed by Z d such that the operator φ ∈ L p ( s ˆ φ ) ∨ ∈ L p is bounded. A sufficient condition for a sequence tobe a multiplier is given by Mikhlin multiplier theorem [1, Theorem 3.6.7,Theorem 5.2.7]. Lemma 3.
Let s : R d → R be a smooth function such that | ∂ α s ( ξ ) | ≤ B | ξ | −| α | for every multiindex α with | α | ≤ d/ . Then, the sequence ( s ( m )) m ∈ Z d isa multiplier with corresponding operator of norm smaller than C p,d B . Let a : R d → R be a smooth function with a ( ξ ) = 1 / | ξ | for | ξ | ≥ a (0) = 0. Let A be the associated multiplier operator (by Lemma 3) definedby A ( φ ) = ( a ˆ φ ) ∨ . Lemma 4.
Let φ ∈ L p with R φ = 0 . Then, k φ k ˙ H − p ≤ C kA ( φ ) k L p . roof. Let ψ : Ω → R be a smooth function with k∇ ψ k L p ∗ ≤
1. As ˆ φ (0) = 0,we have Z φψ = X m ∈ Z d ˆ φ ( m ) ˆ ψ ( m ) = X m ∈ Z d a ( m ) ˆ φ ( m ) | m | ˆ ψ ( m ) ≤ kA ( φ ) k L p k ( | · | ˆ ψ ) ∨ k L p ∗ . Note that | · | = P di =1 ε i e i , where e i ( m ) = m i and ε i ( m ) is the sign of m i . As ε i is a multiplier (by Lemma 3), we have k ( | · | ˆ ψ ) ∨ k L p ∗ ≤ c P di =1 k ( e i ˆ ψ ) ∨ k L p ∗ = c P di =1 k ∂ i ψ k L p ∗ ≤ C .Hence, to conclude, it suffices to bound E kA ( f n,h − f ) k L p ≤ kA ( f h − f ) k L p + E kA ( f n,h − f h ) k L p . Bound of the bias.
Let κ be the Fourier transform of K . As K is smoothand compactly supported, κ is a multiplier by Lemma 3. Also, the function M = a · ( κ −
1) is a multiplier as a product of multiplier. Remark thatˆ f h − ˆ f = ( κ ( h · ) −
1) ˆ f , so that A ( f h − f ) = h ( M ( h · ) ˆ f ) ∨ . As the multipliernorms of M and M ( h · ) are equal [1, Theorem 3.6.7], we have kA ( f h − f ) k L p ≤ hC k f k L p ≤ hC f max . (6) Bound of the fluctuations.
Eventually, we bound E kA ( f n,h − f h ) k L p ≤ E h kA ( f n,h − f h ) k pL p i /p . (7)The random variable A ( f n,h ) is equal to n − P nj =1 U j , where U j := A ( K h ∗ δ X j ) = A ( K h )( · − X j ) and E U j = A ( f h ). We control the expectation of the L p -norm of the sum of i.i.d. centered functions thanks to the next lemma,which is a direct consequence of Rosenthal inequality [4]. Lemma 5.
Let U , . . . , U n be i.i.d. functions on L p . Then, the expectation E (cid:13)(cid:13)(cid:13) n P ni =1 ( U i − E U i ) (cid:13)(cid:13)(cid:13) pL p is smaller than C p n − p/ Z (cid:16) E | U ( x ) | (cid:17) p/ d x + C p n − p Z E [ | U ( x ) | p ] d x. (8)Let v h be the sequence in ℓ p ∗ ( Z d ) defined by v h ( m ) = a ( m ) κ ( hm ) for m ∈ Z d . By a change of variable, we obtain E [ | U ( x ) | p ] = Z f ( y ) |A ( K h )( x − y ) | p d y ≤ f max kA ( K h ) k pL p ≤ f max k v h k pℓ p ∗ , (9)4here, at the last line, we applied Hausdorff-Young inequality [8, SectionXII.2]. The last step consists in bounding k v h k p ∗ ℓ p ∗ . We separate this quantityinto two parts: S = P | hm |≤ | v h ( m ) | p ∗ and S = P | hm | > | v h ( m ) | p ∗ . To bound S , we use that κ is bounded on the unit ball, so that S is of the order X | hm |≤ | m | − p ∗ . h p ∗ − d if d ≥ d = 2 and p > − log h if p = d = 21 if d = 1 . (10)To bound S , we use that | κ ( hm ) | ≤ C γ | hm | − γ for any γ >
0. Choosing γ such that γp ∗ + p ∗ > d , we obtain that S is of the order h − γp ∗ X | hm | > | m | − γp ∗ − p ∗ . h p ∗ − d . (11)Putting together inequalities (8), (10) and (11) yields that, for h of the order n − /d , the expectation E kA ( f n,h − f h ) k L p is of the order h/ √ nh d . n − /d if d ≥ , q ( − log h ) /n . (log n ) / n − / if d = 2 ,n − / if d = 1 . (12)We conclude the proof by putting together the estimates (5), (6) and (12). Remark 1.
For p = 2, Mikhlin multiplier theorem can be replaced by Parse-val’s theorem, further simplifying the proof. Remark 2.
A similar proof shows that the risk of the measure µ n,h satisfies E W p ( µ n,h , µ ) . n − ( s +1) / (2 s + d ) if f is assumed to be of regularity s . Indeed,we can exploit the regularity of s to show that, if κ has sufficiently manyzero derivatives at 0, then the bias term is of order h s +1 , while the fluctu-ation terms is bounded in the same way. We then obtain the desired rateby choosing h of the order n − / (2 s + d ) . This rate is in accordance with theminimax result of [7], where a modified wavelet density estimator is shownto attain the same rate of convergence. References [1] Loukas Grafakos.
Classical Fourier analysis , volume 2. Springer.52] Sloan Nietert, Ziv Goldfeld, and Kengo Kato. From smooth Wassersteindistance to dual Sobolev norm: Empirical approximation and statisticalapplications, 2021.[3] R´emi Peyre. Comparison between W distance and ˙ H − norm, and lo-calization of Wasserstein distance. ESAIM. Control, Optimisation andCalculus of Variations , 24(4), 2018.[4] Haskell P Rosenthal. On the subspaces of L p ( p >
2) spanned by se-quences of independent random variables.
Israel Journal of Mathematics ,8(3):273–303, 1970.[5] Shashank Singh and Barnab´as P´oczos. Minimax distribution estimationin Wasserstein distance. arXiv preprint arXiv:1802.08855 , 2018.[6] Nicol´as Garcia Trillos and Dejan Slepˇcev. On the rate of convergence ofempirical measures in ∞ -transportation distance. Canadian Journal ofMathematics , 67(6):1358–1383, 2015.[7] Jonathan Weed and Quentin Berthet. Estimation of smooth densities inWasserstein distance. In
Conference on Learning Theory , pages 3118–3119, 2019.[8] A. Zygmund and R. Fefferman.