Asymptotics of smoothed Wasserstein distances
aa r X i v : . [ m a t h . P R ] M a y ASYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES
HONG-BIN CHEN AND JONATHAN NILES-WEED
Courant Institute of Mathematical Sciences, New York University
Abstract.
We investigate contraction of the Wasserstein distances on R d underGaussian smoothing. It is well known that the heat semigroup is exponentiallycontractive with respect to the Wasserstein distances on manifolds of positive cur-vature; however, on flat Euclidean space—where the heat semigroup correspondsto smoothing the measures by Gaussian convolution—the situation is more subtle.We prove precise asymptotics for the -Wasserstein distance under the action of theEuclidean heat semigroup, and show that, in contrast to the positively curved case,the contraction rate is always polynomial, with exponent depending on the momentsequences of the measures. We establish similar results for the p -Wasserstein dis-tances for p = 2 as well as the χ divergence, relative entropy, and total variationdistance. Together, these results establish the central role of moment matchingarguments in the analysis of measures smoothed by Gaussian convolution. Introduction
Given two probability distributions µ and ν on a Riemannian manifold M , what canbe said about the Wasserstein distance W ( µP t , νP t ) , where P t is the heat semigroup?The seminal works of Otto and Villani [27] and von Renesse and Sturm [35] showthat this question is intimately related to the geometry of M —in particular, itscurvature. Specifically, von Renesse and Sturm [35] show that the Ricci curvature of M is bounded below by K if and only if(1.1) W ( µP t , νP t ) ≤ e − Kt W ( µ, ν ) for all probability measures µ and ν on M and t ≥ . In particular, when M ispositively curved, the convergence of W ( µP t , νP t ) to zero is exponentially fast.If we specialize to the flat space M = R d , then the application of the heat semigroupis nothing more than convolution by the Gaussian measure ρ t with density(1.2) ρ t ( x ) = 1(2 πt ) d e − t | x | , x ∈ R d . It is immediate that W ( µ ∗ ρ t , ν ∗ ρ t ) ≤ W ( µ, ν ) , but since R d has zero curvature,(1.1) does not imply any strict contraction as t → ∞ . And, indeed, there may benone: if µ = δ x and ν = δ y for x, y ∈ R d , then W ( µ ∗ ρ t , ν ∗ ρ t ) = W ( µ, ν ) = | x − y | ∀ t ≥ . E-mail address : [email protected], [email protected] . Date : May 5, 2020.JNW gratefully acknowledges the support of the Institute for Advanced Study, where a portionof this research was conducted. If x = y , then we do not even have W ( µ ∗ ρ t , ν ∗ ρ t ) → . More generally, it isstraightforward to see that if µ and ν have different means, then W ( µ ∗ ρ t , ν ∗ ρ t ) isbounded away from as t → ∞ .The fact that (1.1) is uninformative on R d is well known and has spurred an interestin refinements for finite-dimensional flat spaces. Bolley et al. [5] performed a carefulanalysis of the heat semigroup on R d and established an elegant improvement of (1.1):(1.3) W ( µ ∗ ρ t , ν ∗ ρ t ) ≤ W ( µ, ν ) − d Z t ( h ( µ ∗ ρ s ) − h ( ν ∗ ρ s )) d s , where h is the differential entropy (i.e., the relative entropy with respect to theLebesgue measure). Unlike (1.1), this result can yield strict contraction even in theabsence of curvature. However, (1.3) does not make it easy to answer questions ofthe following type:(1) Under what conditions on µ and ν does W ( µ ∗ ρ t , ν ∗ ρ t ) → ?(2) If W ( µ ∗ ρ t , ν ∗ ρ t ) → , at what rate does this contraction occur?In this work, we give sharp answers to both questions. A consequence of our maintheorem is that, under suitable tail bounds, the quantity W ( µ ∗ ρ t , ν ∗ ρ t ) alwaysapproaches zero as t → ∞ if µ and ν have the same mean, but that this convergencealways happens at a polynomial rather than exponential rate. Indeed, if the first n moments of µ and ν match but their ( n +1) th moments do not, then W ( µ ∗ ρ t , ν ∗ ρ t ) =Θ( t − n ) . Moreover, we show that the rescaled quantity t n W ( µ ∗ ρ t , ν ∗ ρ t ) has a positivelimit as t → ∞ : lim t →∞ t n W ( µ ∗ ρ t , ν ∗ ρ t ) = c µ,ν > , where c µ,ν is an explicit positive constant depending on the ( n + 1) st moments of µ and ν . We complement these results by showing that, up to a trivial rescaling, c µ,ν isalso the limiting value of the relative entropy and χ divergence between µ ∗ ρ t and ν ∗ ρ t . We establish similar results for the total variation distance, which decays atthe same rate but possesses a different limiting value. Together, these results implythat a variety of measures of discrepancy between probability distributions on R d agree in the limit under application of Gaussian smoothing.Our results also extend to the p -Wasserstein distances for p = 2 . For example, ifthe first n moments of µ and ν match but the ( n + 1) th moments do not and thedistributions satisfy sufficiently strong tail conditions, we obtain bounds of the form < lim inf t →∞ t n/ W p ( µ ∗ ρ t , ν ∗ ρ t ) ≤ lim sup t →∞ t n/ W p ( µ ∗ ρ t , ν ∗ ρ t ) < ∞ . Together with the sharp asymptotics we obtain for p = 2 , these results show thatall the Wasserstein distances W p for p ∈ [1 , ∞ ) decay at the same, polynomial rate,under a simple moment matching condition.1.1. Related work.
The Wasserstein contraction of diffusion semigroups is deeplyconnected to several modern areas of geometry and probability theory. Otto andVillani [27] first established a link between contraction of the heat semigroup, Tala-grand’s inequality, and log-Sobolev inequalities. The connection between these ideasis the formal understanding due to Otto [26] of the heat semigroup as a gradient flowassociated with the entropy functional on the space of probability measures equippedwith the Wasserstein metric [1]. This powerful analogy reveals the central role ofthe convexity of the entropy functional along geodesics in this space and forms the
SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 3 basis for the synthetic notions of Ricci curvature [19, 30, 31, 34]. This perspec-tive also sheds new light on the concentration of measure phenomenon [17], via thetransportation-entropy inequalities developed by Marton [21, 22] and Talagrand [32].More generally, these ideas parallel the development of a general set of techniques forstudying Riemannian manifolds via diffusion processes [36].The condition in (1.1) that the Ricci curvature be bounded below by K is known asthe CD( K, ∞ ) condition. The general CD(
K, N ) (“curvature-dimension”) conditionexpresses in a certain sense that the Ricci curvature is bounded below by K andthe dimension is at most N [3]. The result of Bolley et al. [5] given in (1.3) is thecorrect analogue of (1.1) for spaces satisfying the CD(0 , d ) condition. Bolley et al.[5] develop similar contractive results involving a different measure of distance forspaces satisfying CD(
K, N ) for general K and N . Establishing the correct CD(
K, N ) analogues for statements first formulated under a CD( K, ∞ ) condition is an area ofactive research [see 6, and references therein]. This line of work is closely related tothe problem of proving similar contractive estimates for general diffusion processes[11, 12, 20, 37, 40].To obtain asymptotics for the -Wasserstein distance, we employ a technique sim-ilar to one recently used to establish sharp limiting constants for the -dimensionalmatching problem [2]. We solve a linearized form of the Monge-Ampère equation toobtain a candidate feasible transport solution in the form of a coupling investigatedby Moser [24]. Evaluating the cost of this ansatz —and establishing a matching lowerbound via a strategy developed by Peyre [28]—shows that this coupling is asymptot-ically optimal.The behavior of smoothed version of the Wasserstein distance is also of statisti-cal interest. Several recent works in statistics and information theory examine thebehavior of W ( µ ∗ ρ t , ν ∗ ρ t ) when ν = µ n is an empirical measure comprising n i.i.d. samples from µ . Weed [38] noticed that when µ and ν are compactly supported,the -Wasserstein distance satisfies E W ( µ ∗ ρ t , µ n ∗ ρ t ) ≪ E W ( µ, µ n ) when t is suf-ficiently large. This observation implies that certain statistical tasks involving theWasserstein distance become easier if samples are first smoothed by Gaussian noise.Goldfeld et al. [16] extended this analysis to the total variation distance, relativeentropy, and -Wasserstein distance, as well as to distributions with unbounded sup-port. Motivated by these findings, Goldfeld and Greenewald [15] propose to studythis smoothed Wasserstein distance as a statistically attractive variant of the standardWasserstein distances.Our asymptotic results on the behavior of the χ -divergence and relative entropyagree with several nonasymptotic bounds in the statistics literature for Gaussian mix-tures [4, 39]. To our knowledge, the asymptotic connection with smoothed Wasser-stein distances is new.1.2. Notation.
We denote by N the set of nonnegative integers. Sometimes, weshall write N ∪ { } to emphasize that is admissible. The symbol g will denote thestandard Gaussian measure on R d , namely, g ( dx ) = ρ ( x ) dx , where ρ is defined asin (1.2).We recall the following multi-index notation. For y ∈ R d and α ∈ N d , we write y α = y α y α . . . y α d d and α ! = Q di =1 ( α i !) . For j ∈ N , let [ j ] = { α ∈ N d : P di =1 α i = j } .The symbol c > denotes a universal constant whose value may change fromline to line. We use subscripts to indicate when such a constant depends on otherparameters of the problem. For real numbers a and b , we write a ∨ b for max { a, b } . ASYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES Setting and Main Results
Consider two probability measures µ and ν on R d . Let X and Y be randomvariables with laws µ and ν , respectively. Throughout, we consider measures withsufficiently light tails, which we quantify via the following condition:— Condition E ( β ) , for β > , is said to hold if E e β | X − E X | , E e β | Y − E Y | < ∞ .The following moment-matching condition plays a central role in our results.— Condition M ( n ) , for n ∈ N ∪ { } , is said to hold if n is the largest nonnegativeinteger such that E X α = E Y α , for all α ∈ [ k ] and all k ≤ n. In other words, M ( n ) holds if moment tensors of X up to order n match those of Y ,but those of order n + 1 do not.2.1. Exact asymptotics for the -Wasserstein distance. Our first main resultgives exact asymptotics for the -Wasserstein distance under the tail condition E ( β ) and the moment matching condition M ( n ) . Theorem 2.1.
Suppose E ( β ) holds for some β > . If M ( n ) holds for some n ∈ N ∪ { } , then lim t →∞ t n W ( µ ∗ ρ t , ν ∗ ρ t ) = 1 n + 1 X α ∈ [ n +1] α ! (cid:12)(cid:12)(cid:12) E X α − E Y α (cid:12)(cid:12)(cid:12) . If n = 0 , then Theorem 2.1 reads lim t →∞ W ( µ ∗ ρ t , ν ∗ ρ t ) = | E X − E Y | , which recovers exactly the situation identified in the introduction. Moreover, asCorollary 2.4 below makes clear, this zeroth-order behavior is common to all Wasser-stein distances.We obtain the upper bound in Theorem 2.1 by constructing a coupling between µ ∗ ρ t and ν ∗ ρ t via the solution to a PDE obtained by linearizing the Monge-Ampèreequation. The proof of this upper bound appears in Section 3. To show the lowerbound, we employ the concept of displacement interpolation, due to McCann [23],and control the solution of the PDE considered in Section 3 along a geodesic inWasserstein space. The proof appears in Section 5.Theorem 2.1 implies several useful estimates, including the following corollary,showing that a good approximation of W ( µ ∗ ρ t , ν ∗ ρ t ) can be obtained by replacing µ and ν with appropriate Gaussian measures. Corollary 2.2.
Suppose E ( β ) holds from some β > . Let N µ (respectively, N ν ) beGaussian with the same mean and covariance as µ (respectively, ν ). Then we have (cid:12)(cid:12)(cid:12) W ( µ ∗ ρ t , ν ∗ ρ t ) − W ( N µ ∗ ρ t , N ν ∗ ρ t ) (cid:12)(cid:12)(cid:12) = O ( t − ) . Note that the quantity W ( N µ ∗ ρ t , N ν ∗ ρ t ) has an explicit expression (see, e.g.,[14, Proposition 7]). Since the first two moments of µ and N µ (respectively, ν and N ν ) match, Corollary 2.2 follows immediately from Theorem 2.1 after applying thetriangle inequality: (cid:12)(cid:12)(cid:12) W ( µ ∗ ρ t , ν ∗ ρ t ) − W ( N µ ∗ ρ t , N ν ∗ ρ t ) (cid:12)(cid:12)(cid:12) ≤ W ( µ ∗ ρ t , N µ ∗ ρ t )+ W ( ν ∗ ρ t , N ν ∗ ρ t ) = O ( t − ) . SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 5
Generalization to W p for p = 2 . Though we do not develop exact asymptoticswhen p = 2 , our main result of this section shows that the other Wasserstein distancesexhibit the same qualitative behavior as W , namely, that under M ( n ) the quantity W p ( µ ∗ ρ t , ν ∗ ρ t ) decays at the rate t − n/ . Theorem 2.3.
Let n ∈ N \ { } and p ≥ . If E ( β ) holds for some β > , and M ( n ) holds, then there are positive constants c µ,ν and c d,n,p,β and functions h ( t ) and h ( t ) satisfying lim t →∞ h ( t ) = lim t →∞ h ( t ) = 1 such that c µ,ν h ( t ) ≤ t n W p ( µ ∗ ρ t , ν ∗ ρ t ) ≤ c d,n,p,β h ( t ) , ∀ t > p − β . We show the upper bound in Theorem 2.3 in Section 3, where it follows from thesame construction used to obtain sharp bounds in the W case. The lower boundfollows from simpler ideas and appears in Section 6.Theorem 2.3 implies the following corollary, which gives exact zeroth-order asymp-totics for W p . Corollary 2.4.
Let p ≥ . Assume E ( β ) holds for some β > . Then we have lim t →∞ W p ( µ ∗ ρ t , ν ∗ ρ t ) = (cid:12)(cid:12) E X − E Y (cid:12)(cid:12) . To obtain Corollary 2.4, we let ˜ µ be the law of ˜ X = X − E X + E Y . Clearly W p ( µ ∗ ρ t , ˜ µ ∗ ρ t ) = | E X − E Y | for all p ≥ and t ≥ . The triangle inequality implies (cid:12)(cid:12)(cid:12) W p ( µ ∗ ρ t , ν ∗ ρ t ) − | E X − E Y | (cid:12)(cid:12)(cid:12) ≤ W p (˜ µ ∗ ρ t , ν ∗ ρ t ) . Applying Theorem 2.3 to the measures ˜ µ and ν yields the claim.2.3. Asymptotics for f -divergences. Theorem 2.1 implies that though the 2-Wasserstein distance is highly nonlinear, its asymptotic behavior under Gaussiansmoothing is entirely determined by linear functionals of the measures (i.e., their mo-ments). In fact, under the same conditions, we show that similar limiting behaviorholds for the χ divergence and Kullback–Leibler divergence (relative entropy) be-tween µ ∗ ρ t and ν ∗ ρ t as well. Given two probability measures µ and ν with Lebesguedensities, recall χ ( µ, ν ) = Z (cid:18) µν (cid:19) νdx − Z ( µ − ν ) ν dx. D KL ( µ k ν ) = Z log (cid:18) µν (cid:19) µdx . The χ and Kullback-Leibler divergence, as well as the total variation distance definedbelow, are examples of f -divergences, which are common measures of dissimilarityin information theory and statistics [10, 18] The following theorem shows that thesedivergences have the same asymptotic form as the squared -Wasserstein distance,but decay at the rate t − ( n +1) rather than t − n . Theorem 2.5. If E ( β ) and M ( n ) hold for some β > and some n ∈ N ∪ { } , then lim t →∞ t n +1 χ ( µ ∗ ρ t , ν ∗ ρ t ) = X α ∈ [ n +1] α ! (cid:12)(cid:12)(cid:12) E X α − E Y α (cid:12)(cid:12)(cid:12) ; (2.1) lim t →∞ t n +1 D KL ( µ ∗ ρ t k ν ∗ ρ t ) = 12 X α ∈ [ n +1] α ! (cid:12)(cid:12)(cid:12) E X α − E Y α (cid:12)(cid:12)(cid:12) . (2.2) ASYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES
We note that (2.2) combined with Theorem 2.1 implies that(2.3) W ( µ ∗ ρ t , ν ∗ ρ t ) ∼ tn + 1 D KL ( µ ∗ ρ t k ν ∗ ρ t ) as t → ∞ .This can be compared with Talagrand’s inequality [32], which states that the measure ρ t satisfies W ( µ, ρ t ) ≤ t D KL ( µ k ρ t ) , ∀ µ . Equation (2.3) says that µ ∗ ρ t and ν ∗ ρ t asymptotically enjoy a similar bound.Finally, we prove exact asymptotics for the total variation distance, defined by d TV ( µ, ν ) = Z | µ − ν | dx . In contrast to the asymptotics for W , χ , and D KL , which have an L flavor, theasymptotic behavior of d TV is governed by L . Theorem 2.6.
Suppose that µ and ν have finite ( n + 2) th moments and M ( n ) holdsfor some n ∈ N ∪ { } . Then lim t →∞ t n +12 d TV ( µ ∗ ρ t , ν ∗ ρ t ) = 12 Z (cid:12)(cid:12)(cid:12)(cid:12) X α ∈ [ n +1] α ! (cid:16) E X α − E Y α (cid:17) H α ( x ) (cid:12)(cid:12)(cid:12)(cid:12) g ( dx ) , where H α is the Hermite polynomial defined by H α ( x ) = d Y i =1 H α i ( x i ) , x ∈ R d , α ∈ N d , (2.4) H m ( x ) = ( − m e x d m dx m e − x , x ∈ R , m ∈ N . (2.5)As a consequence of the fact that the Hermite polynomials form an orthogonalbasis for L ( R d , g ) , the condition M ( n ) implies that the integral in Theorem 2.6 isstrictly positive. Note that Theorem 2.6 does not require assuming that µ and ν satisfy the exponential tail condition E ( β ) for any nonzero β .Proofs of Theorems 2.5 and 2.6 appear in Section 7.3. Upper bounds via the Moser coupling
The goal of this section is to prove upper bounds on the quantity W p ( µ ∗ ρ t , ν ∗ ρ t ) by exhibiting a particular coupling between µ ∗ ρ t and ν ∗ ρ t . This coupling is obtainedby a method due to Moser [24]. We assume throughout this section that M ( n ) holdsfor some n ∈ N \ { } and will handle the case M (0) separately. In particular, weassume in this section without loss of generality that E X = E Y = 0 . (3.1)We first show how to motivate the Moser coupling on a purely heuristic level,following Caracciolo et al. [8]. If we assume the existence of a suitably regular map T pushing µ ∗ ρ t to ν ∗ ρ t , then this map must satisfy the Monge-Ampère equation: µ ∗ ρ t ( x ) = ν ∗ ρ t ( T ( x )) J T ( x ) , where J T is the Jacobian determinant of T at x . Let us linearize this equation byassuming that µ ∗ ρ t and ν ∗ ρ t are close to ρ t , so that we can write µ ∗ ρ t = (1+ δ µ ) ρ t and ν ∗ ρ t = (1 + δ ν ) ρ t , where δ µ and δ ν are small. Under the additional assumption that SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 7 T ( x ) = x + δ T ( x ) for a small perturbation δ T ( x ) , we have the approximation J T ( x ) ≈ ∇ · δ T ( x ) . Combining these approximations yields the first-order expansion ∇ · δ T ( x ) + ( ∇ log ρ t ( x )) · δ T ( x ) = δ µ ( x ) − δ ν ( x ) = ( µ ∗ ρ t ( x ) − ν ∗ ρ t ( x )) ρ − t ( x ) . Brenier’s theorem suggests writing δ T = ∇ u for some u : R d → R . Using thedefinition of ρ t , we obtain ∆ u ( x ) − t − x · ∇ u ( x ) = ( µ ∗ ρ t ( x ) − ν ∗ ρ t ( x )) ρ − t ( x ) . (3.2)The Moser coupling is finally defined by using a solution to (3.2) to construct a vectorfield that evolves µ ∗ ρ t into ν ∗ ρ t .In the remainder of this section, we first give the rigorous details of this construc-tion. We then prove upper bounds on the cost of this coupling and, in the specialcase p = 2 , develop exact asymptotics.3.1. Construction of the Moser coupling.
For each fixed t > , we study the keyequation (3.2). There is a weak solution u to this equation satisfying u ∈ C ( R d ) (see Lemma 3.1 below). Hence, ∇ u makes sense pointwise. We now show how to usesuch a solution to construct a coupling.For s ∈ [0 , , define the linear interpolation m s = (1 − s )( µ ∗ ρ t ) + s ( ν ∗ ρ t ) (3.3)and the vector field ξ s ( x ) = ρ t ∇ u ( x ) m s . Using (3.2), one can check ∂ s m s + ∇ · ( m s ξ s ) = 0 . This allows us to apply the Benamou–Brenier formula [7, Corollary 3.2 and Remark3.3] to obtain W pp ( µ ∗ ρ t , ν ∗ ρ t ) ≤ Z Z R d | ξ s ( x ) | p m s ( x ) dxds = Z R d |∇ u ( x ) | p (cid:18) Z (cid:16) ρ t ( x ) m s ( x ) (cid:17) p − ds (cid:19) ρ t ( x ) dx. (3.4)To estimate the term inside parentheses in the above display, we first establish alower bound for µ ∗ ρ t (and similarly for ν ∗ ρ t ): µ ∗ ρ t ( x ) = (2 πt ) − d E e − t | x − X | ≥ (2 πt ) − d e − t E | x − X | = ρ t ( x ) e t h x, E X i− t E | X | = ρ t ( x ) e − t E | X | . (3.5)Here, we used Jensen’s inequality and the assumption (3.1). By (3.3), (3.5) and asimilar lower bound for ν ∗ ρ t , we obtain m s ( x ) ρ − t ( x ) ≥ e − t ( E | X | ∨ E | Y | ) , x ∈ R d , which then gives Z (cid:16) ρ t ( x ) m s ( x ) (cid:17) p − ds ≤ e p − t ( E | X | ∨ E | Y | ) , x ∈ R d . ASYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES
Plug this into (3.4) and apply a change of variables to see W pp ( µ ∗ ρ t , ν ∗ ρ t ) ≤ e p − t ( E | X | ∨ E | Y | ) Z R d |∇ u ( x ) | p ρ t ( x ) dx = e p − t ( E | X | ∨ E | Y | ) Z |∇ u ( t x ) | p g ( dx ) . (3.6)To evaluate the integral appearing in (3.6), we first define some notation. Thefollowing objects will appear many times in this paper: η ( x, y ) = exp (cid:0) h x, y i − | y | (cid:1) , Θ t ( x ) = t (cid:16) E η ( x, t − X ) − E η ( x, t − Y ) (cid:17) . (3.7)By this definition, we immediately have ( µ ∗ ρ t − ν ∗ ρ t )( x ) = t − Θ t ( t − x ) ρ t ( x ) , x ∈ R d . (3.8)Let us introduce w ( x ) = t − u ( t x ) , x ∈ R d (3.9)and the Ornstein–Uhlenbeck operator L = ∆ − x · ∇ (3.10)Hence, due to (3.8) and (3.2), we know w solves Lw = Θ t . (3.11)Adopting the notation from the Malliavin calculus, we write the gradient ∇ as D .Under this notation, (3.6) becomes W pp ( µ ∗ ρ t , ν ∗ ρ t ) ≤ e p − t ( E | X | ∨ E | Y | ) Z | Dw | p d g . (3.12)Our upper bounds follow from analysis of (3.12). We first show how to obtain anupper bound of the right order when p = 2 before performing a more careful argumentfor the p = 2 case.3.2. A general upper bound.
Our first bound shows that (3.12) is of order t − np/ for any p ≥ . This will suffice to prove the upper bound of Theorem 2.3.We introduce the following notion. For k ∈ N , let D k,p be the completion of smoothfunctions on R d , whose derivatives grow at most polynomially, under the norm k · k k,p = (cid:18) k X j =0 Z | D j · | p d g (cid:19) p , (3.13)where D j = ∂ j . The space D k,p is the Sobolev space with g as the underlying measure.We collect useful properties of w in the lemma stated below, the proof of which isgiven in Section 4.2. Lemma 3.1.
Under the assumption (3.1) , there is a weak solution w to (3.11) , whichalso satisfies(1) w ∈ D ,p for all p ∈ [1 , ∞ ) , and w ∈ C ( R d ) ;(2) the mean of Dw is zero, namely, R Dwd g = 0 . SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 9 (3) there is a constant c p > depending only on p such that Z | D w | p d g ≤ c p Z | Lw | p d g . The first two parts of the lemma allow us to apply the Poincaré inequality (Lemma4.1) to see Z | Dw | p d g ≤ c p Z | D w | p d g . Hence, part (3) of Lemma 3.1 and (3.11) imply Z | Dw | p d g ≤ c p Z | Lw | p d g = c p Z | Θ t | p d g . (3.14)Combining this bound with (3.6), we obtain W pp ( µ ∗ ρ t , ν ∗ ρ t ) ≤ c p e p − t ( E | X | ∨ E | Y | ) Z | Θ t | p d g . Finally, to prove the upper bound of Theorem 2.3, we apply the following lemma,whose proof appears in Section 4.3.1.
Lemma 3.2.
Suppose E ( β ) holds for some β > and M ( n ) holds for some n ∈ N (assumption (3.1) is not needed). For each p ≥ and each δ > , there is c d,n,p,δ,β such that (cid:18) Z | Θ t | p d g (cid:19) p ≤ c d,n,p,δ,β t − n max Z ∈{ X,Y } E e δ ( p − t | Z | , t > δ ( p − β . Choosing δ = 2 and letting h ( t ) = e p − t ( E | X | ∨ E | Y | ) · max Z ∈{ X,Y } E e δ ( p − t | Z | , weobtain the desired claim.3.3. Upper bound for p = 2 . To obtain a sharper estimate when p = 2 , we applyintegration by parts to (3.12) to obtain W ( µ ∗ ρ t , ν ∗ ρ t ) ≤ e t ( E | X | ∨ E | Y | ) Z − wLwd g . (3.15)In Section 4.3.2, we prove the following result. Lemma 3.3. As t → ∞ , Z − wLwd g = t − n n + 1 X α ∈ [ n +1] α ! | E X α − E Y α | + O ( t − n − ) . (3.16)Applying this lemma along with the fact that e t ( E | X | ∨ E | Y | ) approaches as t → ∞ yields the upper bound of Theorem 2.1.Recall that we have assumed (3.1). If E X = E Y = v = 0 , then applying the sameproof to centered versions of µ and ν yields an upper bound which depends on n + 1 X α ∈ [ n +1] α ! | E ( X − v ) α − E ( Y − v ) α | . But under the condition M ( n ) for n ≥ , we have E ( X − v ) α − E ( Y − v ) α = E X α − E Y α ∀ α ∈ [ n + 1] . so we recover precisely the desired bound. Finally, under M (0) , we use the argument of Corollary 2.4 to reduce to the n ≥ case. 4. Estimates for solutions to the Ornstein-Uhlenbeck PDE
In Section 3, we established that good upper bounds for the Wasserstein distancescan be obtained by understanding solutions to (3.11), which reads: Lw = Θ t , where L is the Ornstein-Uhlenbeck operator L = ∆ − x · ∇ and Θ t is defined in (3.7).In this section, we derive the key estimates on solutions to (3.11), via which weobtain the bounds given in Section 3. As we shall see, these estimates also play arole in obtaining good lower bounds, a question we turn to in Sections 5 and 6.We first establish several preliminaries involving the Malliavin calculus, beforegiving the promised proof of Lemma 3.1. In the remainder of the section, we derivethe necessary estimates on Θ t .4.1. Preliminaries.
We begin by reviewing several concepts from analysis on Gauss-ian spaces.Consider the stochastic process W = { W ( h ) : R d → R } h ∈ R d given by W ( h )( x ) = h h, x i . Under the probability measure g , one can see that W is an (centered) isonor-mal Gaussian process with variance E g W ( h ) W ( g ) = h h, g i , where E g denotes theexpectation with respect to g .Recall the Hermite polynomials H m given in (2.5). For m ∈ N , let H m be theclosed linear subspace of L = L ( g ) generated by { H m ( W ( h )) : h ∈ R d , | h | = 1 } .The space H m is called the m th Wiener chaos, and {H m } m ≥ forms an orthogonaldecomposition of L . Let J m : L → H m be the orthogonal projection. In particular,for ϕ ∈ L , we have J ϕ = R ϕd g .Recall the Sobolev space D k,p with norm given in (3.13). Let P denote the set ofpolynomials on R d , which is dense in D k,p for p > and k ≥ (see [25, Corollary1.5.1 and Excercise 1.1.7]). On P , the operator L in (3.10) can be expressed as L = P ∞ m =0 − mJ m and its pseudo-inverse as L − = P ∞ m =1 − m J m . Still on P , we candefine the negative square root of − L by C = P ∞ m =0 −√ mJ m and its pseudo-inverse C − = P ∞ m =1 − √ m J m . For more details, see [25].Finally, we require the following L p version of the Poincaré inequality for Gaussianmeasures [29, Corollary 2.4]. Lemma 4.1.
Let p ≥ . There is c p > (depending only on p ) such that, for all ϕ ∈ D ,p , Z | ϕ − ¯ ϕ | p d g ≤ c p Z | Dϕ | p d g , where ¯ ϕ = R ϕd g . Proof of Lemma 3.1.
Part (1) . We first construct a solution to (3.11) using L − .Lemma 3.2 implies Θ t ∈ L p . By density, let ϕ k ∈ P be polynomials such that ϕ k → Θ t in L p as k → ∞ . The mean of Θ t is zero because, due to (3.1) and (3.7), Z Θ t d g = t E Z ( µ ∗ ρ t − ν ∗ ρ t ) dx = 0 . SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 11
Hence, we may assume ϕ k all have zero means, namely, J ϕ k = 0 . Applying themultiplier theorem [25, Theorem 1.4.2] to L − , and using [25, Theorem 1.5.1] withthe relation − C = L , one can see that the limit w = lim k →∞ L − ϕ k = L − Θ t in D ,p (4.1)exists and L : D ,p → L p is continuous. Therefore, we have Lw = lim k →∞ LL − ϕ k =lim k →∞ ϕ k − J ϕ k = Θ t in L p .It remains to check w ∈ C ( R d ) . On each Euclidean ball B ⊂ R d , the standardGaussian measure g has a density both bounded above and below. Hence, due to w ∈ D ,p , we know that w also belongs to the standard Sobolev space W ,p ( B ) forthe Lebesgue measure, for all p ∈ [1 , ∞ ) . The standard Sobolev embedding theorem(see, e.g., [13, part 3 of Theorem 3.26]) implies w ∈ C ( B ) , and thus w ∈ C ( R d ) . Part (2) . Recall that we write D i = ∂ i . Using the approximation (4.1) and perform-ing integration by parts for polynomial integrands, we have Z D i wd g = lim k →∞ h D i L − ϕ k , i g = − lim k →∞ h L − ϕ k , x i i g = − lim k →∞ h ϕ k , L − x i i g = lim k →∞ h ϕ k , x i i g = h Θ t , x i i g , where h· , ·i g is the L ( g ) inner product. Here in the third equality, we used theself-adjointness of L − which is evident from its formula on polynomials. In thepenultimate equality, we used the fact that L − x i = − J x i = − x i because x i belongsto the first order Wiener chaos H . Hence it is sufficient to check R x i Θ t d g = 0 .Indeed, due to (3.1), we have Z x i E exp (cid:16) h x, t − X i − | t − X | (cid:17) g ( dx ) = E (2 π ) − d Z x i e − | x − t − X | dx = E t − X i = 0 , and a similar equality with X replaced by Y . Finally, by (3.7), we conclude that R x i Θ t d g = 0 . Part (3) . This is an immediate consequence of [25, Theorem 1.5.1], the density of P and the fact C = − L on P .4.3. Proofs of Some Estimates.
The bounds in Section 3 relied on two estimates:Lemma 3.2, which showed k Θ t k L p ( g ) = O ( t − n ) , and Lemma 3.3, which gave exactasymptotics for − R wLw g . In this section, we prove both lemmas.For a multi-index α ∈ N d , we write ∂ α = ∂ α ∂ α . . . ∂ α d d . All derivatives below arewith respect to y .Fix α ∈ [ j ] for some j ∈ N . In view of (3.7), to study asymptotics of Θ t , we shallderive the expansion of η ( x, y ) in y for fixed x . In the following, we express ∂ α η ( x, y ) in terms of Hermite polynomials. Recall our notation for Hermite polynomials isgiven in (2.4) and (2.5). It can be checked that d m dy m e xy − y = H m ( x − y ) e xy − y , x, y ∈ R . Since η ( x, y ) = Q di =1 ( e x i y i − y i ) , the above two displays imply that ∂ α η ( x, y ) = H α ( x − y ) η ( x, y ) , x, y ∈ R d . Apply the Taylor expansion (with the remainder expressed as an integral) to η ( x, y ) in y around to see that, for each n ∈ N , η ( x, y ) = n X j =0 a j ( x, y ) + r n +1 ( x, y ) , x, y ∈ R d , (4.2)where a j ( x, y ) = X α ∈ [ j ] y α α ! ∂ α η ( x,
0) = X α ∈ [ j ] y α α ! H α ( x ) ,r n +1 ( x, y ) = ( n + 1) Z (1 − s ) n X α ∈ [ n +1] y α α ! ∂ α η ( x, sy ) ds = ( n + 1) Z (1 − s ) n X α ∈ [ n +1] y α α ! H α ( x − sy ) η ( x, sy ) ds. (4.3)We can now prove the required estimates4.3.1. Proof of Lemma 3.2.
The condition M ( n ) implies that E a j ( x, t − X ) = E a j ( x, t − Y ) , x ∈ R d , j = 0 , , , . . . , n, which together with (3.7) and (4.2) yields the following Lemma 4.2.
Suppose µ and ν have finite ( n + 1) th moments and M ( n ) holds forsome n ∈ N ∪ { } . With r n +1 given in (4.3) , we have Θ t ( x ) = t (cid:16) E r n +1 ( x, t − X ) − E r n +1 ( x, t − Y ) (cid:17) , x ∈ R d , t > . Lemma 3.2 then follows from Lemma 4.2 and the following result, whose proofappears in Section 8.1
Lemma 4.3.
Let p ≥ . Suppose E ( β ) holds for some β > . Then for each m ∈ N ∪ { } and δ > , there is c d,m,p,δ,β > such that, for Z ∈ { X, Y } , Z | t E r m +1 ( x, t − Z ) | p g ( dx ) ≤ c d,m,p,δ,β t − mp (cid:16) E e δ ( p − t | Z | (cid:17) p , t > δ ( p − β . (4.4) If p = 1 , (4.4) holds for t > under a weaker assumption E | X | m +1 , E | Y | m +1 < ∞ . Proof of Lemma 3.3.
Using (3.7), (4.2) and the assumption M ( n ) , we can write Θ t = Q + R (4.5)where Q ( x ) = t (cid:16) E a n +1 ( x, t − X ) − E a n +1 ( x, t − Y ) (cid:17) = t − n X α ∈ [ n +1] α ! ( E X α − E Y α ) H α ( x ) , (4.6) R ( x ) = t (cid:16) E r n +2 ( x, t − X ) − E r n +2 ( x, t − Y ) (cid:17) . (4.7)To carry out our computations, we need the result that Q and R are orthogonalto each other, namely, Z QRd g = 0 . (4.8) SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 13
We prove this fact in Section 8.2.Due to (4.1), we have w = L − Θ t = L − ( Q + R ) . Since J n +1 Q = Q ∈ P , by thedefinition of L , we have L − Q = − n +1 Q . Using this, (4.8) and the self-adjointness of L − , we can compute Z − wLwd g = Z − ( L − Q + L − R )( Q + R ) d g = 1 n + 1 Z | Q | d g − Z RL − Rd g . (4.9)To determine the first term on the right hand side, we need the following standardfact. Lemma 4.4.
Let α and β be two multi-indices. Then Z H α H β d g = ( α ! if α = β if α = β . This lemma together with (4.6) gives Z | Q | d g = t − n X α ∈ [ n +1] α ! (cid:12)(cid:12) E X α − E Y α (cid:12)(cid:12) . (4.10)To estimate the second term, recall the definition of the operator C − in Section4.1 and see that Z RL − Rd g = Z | C − R | d g ≤ c Z | R | d g , where we used [25, Theorem 1.4.2] in the last inequality. Recall the formula (4.7).Then, Lemma 4.3 with m = n + 1 implies that for each δ > , there is c d,n,δ > suchthat Z | R | d g ≤ c d,n,δ t − n − max Z ∈{ X,Y } (cid:16) E e δ t | Z | (cid:17) = O ( t − n − ) . (4.11)Hence, the above display gives R RL − Rd g = O ( t − n − ) . Combining (4.9), (4.10),and (4.11) yields the claim.5. Exact Asymptotics for p = 2 In Section 3, we exhibited a valid coupling between µ ∗ ρ t and ν ∗ ρ t achieving theclaimed upper bound of Theorem 2.1. A key step in the proof was the derivation ofinequality (3.15): W ( µ ∗ ρ t , ν ∗ ρ t ) ≤ e t ( E | X | ∨ E | Y | ) Z − wLwd g . The goal of this section is to prove the following complementary lower bound.
Proposition 5.1.
Assume E ( β ) and M ( n ) hold for some β > and n ∈ N \ { } ,respectively. Then as t → ∞ , (5.1) W ( µ ∗ ρ t , ν ∗ ρ t ) ≥ (1 + o (1)) Z − wLwd g .4 ASYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES
Assume E ( β ) and M ( n ) hold for some β > and n ∈ N \ { } ,respectively. Then as t → ∞ , (5.1) W ( µ ∗ ρ t , ν ∗ ρ t ) ≥ (1 + o (1)) Z − wLwd g .4 ASYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES Combined with the explicit expansion developed in Lemma 3.3, this propositionproves the desired lower bound of Theorem 2.1 under M ( n ) for n ≥ . As in Section 3,the bound for M (0) follows immediately from the argument of Corolary 2.4.We now proceed with the proof. We first define the displacement interpolationbetween two measures [23]. Denote by the push-forward of a measure under amap. Since µ ∗ ρ t and ν ∗ ρ t are absolutely continuous with respect to the Lebesguemeasure, the optimal coupling between them is given by a convex function φ on R d such that ( ∇ φ ) ( µ ∗ ρ t ) = ν ∗ ρ t . We then define the displacement interpolation µ s by µ s = ((1 − s )Id + s ∇ φ ) ( µ ∗ ρ t ) , s ∈ [0 , . (5.2)Recall u in (3.2). Since u is locally Lipschitz, we have the following inequality(see [19, Lemma A.1; 9, Lemma 13]): Z − u ( µ ∗ ρ t − ν ∗ ρ t ) dx ≤ W ( µ ∗ ρ t , ν ∗ ρ t ) (cid:16) Z |∇ u | µ s dx (cid:17) . (5.3)Let us set a = a ( t ) = 1 + t − . (5.4)We work with the Gaussian measure ρ at which is slight perturbation of ρ t .The following bound holds (cid:13)(cid:13)(cid:13) µ s ρ at (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13) µ ∗ ρ t ρ at (cid:13)(cid:13)(cid:13) ∞ ∨ (cid:13)(cid:13)(cid:13) ν ∗ ρ t ρ at (cid:13)(cid:13)(cid:13) ∞ . (5.5)Indeed, by choosing ρ at to be the reference measure on R d , the curvature-dimensionalcriterion CD(0 , ∞ ) (c.f. [33, Theorem 14.8]) is satisfied. This allows us to apply [33,Theorem 17.15] to see that, for each p > , the following functional is displacement-convex: U p ( ρ ) = (R R d (cid:12)(cid:12) ρρ at (cid:12)(cid:12) p ρ at dx, if ρ is a probability density, ∞ , otherwise.Hence, we have (cid:13)(cid:13)(cid:13) µ s ρ at (cid:13)(cid:13)(cid:13) L p ( ρ at ) ≤ (cid:13)(cid:13)(cid:13) µ ∗ ρ t ρ at (cid:13)(cid:13)(cid:13) L p ( ρ at ) ∨ (cid:13)(cid:13)(cid:13) ν ∗ ρ t ρ at (cid:13)(cid:13)(cid:13) L p ( ρ at ) . Sending p → ∞ , we obtain (5.5).Then, we estimate the right hand side of (5.5): µ ∗ ρ t ρ at ( x ) = (cid:16) πat πt (cid:17) d E exp (cid:18) − t (cid:16) | x | − h x, X i + | X | − a | x | (cid:17)(cid:19) = a d E e − t | √ a − a x − √ aa − X | e t ( a − | X | ≤ (1 + t − ) d E e t − | X | . (5.6)An analogous bound holds for ν ∗ ρ t /ρ at . Now let c ( t ) = (1 + t − ) d (cid:16) E e t − | X | ∨ E e t − | Y | (cid:17) . Clearly, lim t →∞ c ( t ) = 1 . The above two displays and (5.5) imply that µ s ρ at ( x ) ≤ c ( t ) , x ∈ R d , s ∈ [0 , . (5.7) SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 15
Use the estimate (5.7) in (5.3) to see Z − u ( µ ∗ ρ t − ν ∗ ρ t ) dx ≤ W ( µ ∗ ρ t , ν ∗ ρ t ) c ( t ) (cid:16) Z |∇ u | ρ at dx (cid:17) . (5.8)We now show R |∇ u | ρ at dx is a good approximation of R |∇ u | ρ t dx . We begin withan elementary computation, for t > , (cid:12)(cid:12) e − at | x | − e − t | x | (cid:12)(cid:12) = (1 − e (1 − a ) at | x | ) e − at | x | ≤ ( a − at | x | e − at | x | ≤ ct − e − at | x | , (5.9)where c > is a universal constant. This estimate implies that (cid:12)(cid:12)(cid:12) Z |∇ u | ρ at dx − a − d Z |∇ u | ρ t dx (cid:12)(cid:12)(cid:12) ≤ ct − · d/ Z |∇ u | ρ at dx. (5.10)Apply Hölder’s inequality to see, for t large, Z |∇ u | ρ at dx = (cid:16) Z (cid:0) ρ at ρ t (cid:1) / ρ t dx (cid:17) / (cid:16) Z |∇ u | ρ t dx (cid:17) / = (2 a ) − d (cid:16) Z e a − | x | g ( dx ) (cid:17) / (cid:16) Z |∇ u | ρ t dx (cid:17) / ≤ c d (cid:16) Z |∇ u | ρ t dx (cid:17) / . (5.11)Applying a change of variables, (3.9) implies R |∇ u | ρ t dx = R |∇ w | d g . Then,(3.14) and Lemma 3.2 imply that R |∇ u | ρ t dx = O ( t − n ) . From this, (5.10) and(5.11), we obtain Z |∇ u | ρ at dx = a − d Z |∇ u | ρ t dx + O ( t − n − ) . (5.12)Recall (3.8), (3.9) and (3.11). Changing variables and integrating by parts, wehave Z |∇ u | ρ t dx = Z |∇ w | d g = Z − wLwd g , Z − u ( µ ∗ ρ t − ν ∗ ρ t ) dx = Z − ut − Θ t ( t − x ) ρ t dx = Z − w Θ t d g = Z − wLwd g . Plug the above display and (5.12) into (5.8) to get a lower bound c − ( t ) (cid:16) R − wLwd g (cid:17) a − d ( t ) R − wLwd g + O ( t − n − ) ≤ W ( µ ∗ ρ t , ν ∗ ρ t ) . Since c ( t ) and a ( t ) both converge to as t → ∞ and Lemma 3.3 implies that R − wLwd g is of order t − n , we obtain Proposition 5.1.6. Lower Bound for p = 1 Section 3 establishes an upper bound on W p ( µ ∗ ρ t , ν ∗ ρ t ) valid for all p ≥ . Tocomplete the proof of Theorem 2.3, we complement this upper bound with a lowerbound on W ( µ ∗ ρ t , ν ∗ ρ t ) of the same order. Since W ≤ W p for all p ≥ , this lowerbound suffices to establish the desired two-sided bound on W p ( µ ∗ ρ t , ν ∗ ρ t ) . As before, it suffices to assume (3.1). Our technique is to employ Kantorovich-Rubinstein duality, which reads W ( µ ∗ ρ t , ν ∗ ρ t ) = sup f ∈ Lip Z f ( x )( µ ∗ ρ t − ν ∗ ρ t )( x ) dx , where the supremum is taken over all -Lipschitz functions on R d .To apply the Kantorovich–Rubinstein duality, we need to construct a suitableLipschitz test function. For this, we need a smooth bump function φ : R d → R with the following properties: ≤ φ ≤ φ ( x ) = 1 , ∀ x ∈ B ; φ ( x ) = 0 , ∀ x B ; |∇ φ | ≤ , (6.1)where B r denotes the centered Euclidean ball with radius r > . Now, let us consider f ( x ) = φ ( t − x )Θ t ( t − x ) , x ∈ R d . We first employ the following estimate, whose proof is deferred to Section 8.3: |∇ f | ≤ c d,n t − n +12 max Z ∈{ X,Y } (cid:16) E | Z | n +1 + t − n +22 E | Z | n +3 (cid:17) = c d,n t − n +12 max Z ∈{ X,Y } (cid:16) E | Z | n +1 (cid:17) (1 + o (1)) (6.2)as t → ∞ . Then, the Kantorovich–Rubinstein duality implies that W ( µ ∗ ρ t , ν ∗ ρ t ) ≥ (1 − o (1)) c µ,ν t n +12 Z f ( x ) (cid:0) µ ∗ ρ t − ν ∗ ρ t (cid:1) ( x ) dx = (1 − o (1)) c µ,ν t n Z φ | Θ t | d g ≥ (1 − o (1)) c µ,ν t n Z B | Θ t | d g where (3.8) is used to derive the equality. To lower bound the last integral, we use(4.5) to see Z B | Θ t | d g ≥ Z B | Q | d g − Z B | R | d g From this, (4.6) and (4.11), we can derive Z B | Θ t | d g ≥ t − n Z B (cid:12)(cid:12)(cid:12)(cid:12) X α ∈ [ n +1] α ! (cid:16) E X α − E Y α (cid:17) H α ( x ) (cid:12)(cid:12)(cid:12)(cid:12) g ( dx ) − c d,n,δ t − n − max Z ∈{ X,Y } (cid:16) E e δ t | Z | (cid:17) , where the fact that the Hermite polynomials form an orthogonal basis for L ( R d , g ) combined with the condition M ( n ) implies that integrand on the right side is notidentically zero. We therefore obtain W ( µ ∗ ρ t , ν ∗ ρ t ) ≥ (1 − o (1)) c µ,ν t − n , as claimed. SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 17 Asymptotics for f -divergences In this section, we prove Theorems 2.5 and 2.6. In contrast to our results on theWasserstein distances, asymptotics for f -divergences are significantly easier to obtain,because they are defined as explicit functions of the densities.Our results show that, when properly rescaled, both χ ( µ ∗ ρ t , ν ∗ ρ t ) and D KL ( µ ∗ ρ t , ν ∗ ρ t ) possess the same limiting value, which also agrees with the limiting valueof the rescaled squared -Wasserstein distance. Somewhat surprisingly, while χ and D KL are not symmetric in their arguments, their limiting values are symmetric.7.1. Exact asymptotics for χ -divergence. The goal is to prove (2.1). The χ -divergence between µ ∗ ρ t and ν ∗ ρ t admits the following representation χ ( µ ∗ ρ t , ν ∗ ρ t ) = Z (cid:18) µ ∗ ρ t ν ∗ ρ t (cid:19) ν ∗ ρ t dx − Z ( µ ∗ ρ t − ν ∗ ρ t ) ν ∗ ρ t dx. (7.1)To derive an upper bound, we need a lower bound for ν ∗ ρ t . Apply Jensen’s inequalityto see ν ∗ ρ t ( x ) = (2 πt ) − d E e − t | x − Y | ≥ (2 πt ) − d e − t E | x − Y | = (2 πt ) − d e − t | x | + t h x, E Y i− t E | Y | . By Cauchy–Schwarz, we have h x, E Y i ≥ − t | x | − t | E Y | . Let a = a ( t ) be given in (5.4). The above two displays imply that ν ∗ ρ t ( x ) ≥ a − d e (2 t ) − | E Y | − (2 t ) − E | Y | ρ ta ( x ) . (7.2)Denote the right hand side of this display by c ,t ρ ta ( x ) , and clearly we have lim t →∞ c ,t =1 . Hence, we obtain an upper bound: χ ( µ ∗ ρ t , ν ∗ ρ t ) ≤ c − ,t Z ( µ ∗ ρ t − ν ∗ ρ t ) ρ ta dx. (7.3)For a lower bound, we use a version of (5.6) for ν to obtain χ ( µ ∗ ρ t , ν ∗ ρ t ) ≥ c − ,t Z ( µ ∗ ρ t − ν ∗ ρ t ) ρ at dx. Here c ,t = (1 + t − ) E e t − | Y | , which converges to as t → ∞ . The desired resultfollows from this, (7.3) and the following lemma. Lemma 7.1.
Suppose M ( n ) holds for some n ∈ N ∪ { } . If z = z ( t ) is a function of t satisfying | z ( t ) − | ≤ ct − for all t with some constant c ≥ , then it holds that lim t →∞ t n +1 Z ( µ ∗ ρ t − ν ∗ ρ t ) ρ zt dx = X α ∈ [ n +1] α ! | E X α − E Y α | . Proof.
By a similar computation in (5.9), we have (cid:12)(cid:12)(cid:12)(cid:12) ρ t ( x ) ρ zt ( x ) − z d (cid:12)(cid:12)(cid:12)(cid:12) ≤ cz d t − e zt | x | , x ∈ R d . This display together with (3.8) yields, for t large, (cid:12)(cid:12)(cid:12)(cid:12) Z ( µ ∗ ρ t − ν ∗ ρ t ) ρ zt dx − z d t − Z | Θ t | d g (cid:12)(cid:12)(cid:12)(cid:12) = t − (cid:12)(cid:12)(cid:12)(cid:12) Z | Θ t ( t − x ) | ρ t ( x ) ρ t ( x ) ρ zt ( x ) dx − z d Z | Θ t ( t − x ) | ρ t ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) ≤ t − Z | Θ t ( t − x ) | cz d t − e zt | x | ρ t ( x ) dx ≤ c d t − Z | Θ t ( x ) | ρ z z − ( x ) dx. (7.4)Invoke Hölder’s inequality to see, for t large, Z | Θ t ( x ) | ρ z z − ( x ) dx ≤ c d (cid:18) Z | Θ t ( x ) | d g (cid:19) (cid:18) Z (cid:18) ρ z z − ρ (cid:19) d g (cid:19) ≤ c d (cid:18) Z | Θ t | d g (cid:19) = O ( t − n ) (7.5)where the last identity follows from Lemma 3.2.To compute R | Θ t | d g , let us recall (4.5) and (4.8). These imply Z | Θ t | d g = Z | Q | d g + Z | R | d g . From this decomposition, (4.10) and (4.11), we obtain lim t →∞ t n Z | Θ t | d g = X α ∈ [ n +1] α ! (cid:12)(cid:12)(cid:12) E X α − E Y α (cid:12)(cid:12)(cid:12) . (7.6)The proof is complete by combining (7.4), (7.5) and (7.6). (cid:3) Exact asymptotics for relative entropy.
In this section, we prove (2.2). Forsimplicity, let us write f = µ ∗ ρ t and g = ν ∗ ρ t . Using the Taylor expansion log( x + 1) = x + Z ( s − x (1 + sx ) ds, x > − , we have log (cid:16) f − gg + 1 (cid:17) = f − gg + Z ( s − f − g ) ( sf + (1 − s ) g ) ds. Therefore, we obtain D KL ( f k g ) = χ ( f, g ) + Z ( s − Z R d ( f − g ) ( sf + (1 − s ) g ) f dxds. (7.7)Since we have already established the asymptotic behavior of χ , we focus on thesecond term. Due to (5.6) and (7.2) for both µ and ν , there are c ,t and c ,t , bothconverging to as t → ∞ , such that c ,t ρ ta ( x ) ≤ f ( x ) , g ( x ) ≤ c ,t ρ at ( x ) , x ∈ R d . with a = a ( t ) given in (5.4). Therefore, we have c ,t c ,t Z ( f − g ) ρ at ρ ta dx ≤ Z ( f − g ) ( sf + (1 − s ) g ) f dx ≤ c ,t c ,t Z ( f − g ) ρ ta ρ at dx. SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 19
After computing ρ ta /ρ at and ρ at /ρ ta , one can see the above becomes c ,t c ,t (2 a − a ) d Z ( f − g ) ρ a − a t dx ≤ Z ( f − g ) ( sf + (1 − s ) g ) f dx ≤ c ,t c ,t (cid:16) a − a (cid:17) d Z ( f − g ) ρ a a − t dx. Note that both the upper bound and the lower bound are independent of s . Thesealong with Lemma 7.1 imply lim t →∞ t n +1 Z ( s − Z R d ( f − g ) ( sf + (1 − s ) g ) f dxds = − X α ∈ [ n +1] α ! (cid:12)(cid:12)(cid:12) E X α − E Y α (cid:12)(cid:12)(cid:12) . This together with (7.7) and (2.1) finishes the proof.7.3.
Exact asymptotics for total variation distance.
Recall (3.7) and we have d TV ( µ ∗ ρ t , ν ∗ ρ t ) = 12 Z (cid:12)(cid:12) µ ∗ ρ t − ν ∗ ρ t (cid:12)(cid:12) dx = 12 Z | t − Θ t ( t − x ) | ρ t ( x ) dx = 12 t − Z | Θ t | d g . Due to (3.7) and (4.2), for fixed n , we write Θ t = Q + R with Q ( x ) = t n +1 X j =0 (cid:16) E a j ( x, t − X ) − E a j ( x, t − Y ) (cid:17) = n +1 X j =1 t − j − X α ∈ [ j ] α ! (cid:16) E X α − E Y α (cid:17) H α ( x ) , and R as in (4.7). The triangle inequality implies that (cid:12)(cid:12)(cid:12)(cid:12) Z | Θ t | d g − Z | Q | d g (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z | R | d g . Applying Lemma 4.3 with m = n + 1 and p = 1 , we obtain the desired result.8. Auxiliary Results
Proof of Lemma 4.3.
Proof.
Formula (4.3) implies that Z | t E r m +1 ( x, t − Z ) | p g ( dx )= m + 1 t mp Z (cid:12)(cid:12)(cid:12)(cid:12) Z (1 − s ) m X α ∈ [ m +1] α ! E Z α H α ( x − st − Z ) η ( x, st − Z ) ds (cid:12)(cid:12)(cid:12)(cid:12) p g ( dx ) ≤ c m,p t mp X α ∈ [ m +1] Z (cid:12)(cid:12)(cid:12) Z E Z α H α ( x − st − Z ) η ( x, st − Z ) ds (cid:12)(cid:12)(cid:12) p g ( dx ) ≤ c m,p t mp X α ∈ [ m +1] Z E (cid:18) Z (cid:12)(cid:12)(cid:12) Z α H α ( x − st − Z ) η ( x, st − Z ) (cid:12)(cid:12)(cid:12) p g ( dx ) (cid:19) p ds ! p , where in the last inequality we used Minkowski’s integral inequality. Then, we estimate the integral with respect to g . Recall η in (3.7). Due to α ∈ [ m + 1] , we have | Z α | ≤ | Z | m +1 . Since H α is a polynomial of degree m + 1 as evidentin (2.4), one can see that | H α ( x ) | ≤ c d,m (1 + | x | m +1 ) . Therefore, we obtain Z (cid:12)(cid:12)(cid:12) Z α H α ( x − st − Z ) η ( x, st − Z ) (cid:12)(cid:12)(cid:12) p g ( dx ) ≤ c d,m,p | Z | p ( m +1) Z (cid:16) | x − st − Z | p ( m +1) (cid:17) e h x,pst − Z i− p | st − Z | e − | x | dx = c d,m,p | Z | p ( m +1) (cid:18) Z (cid:16) | x − st − Z | p ( m +1) (cid:17) e − | x − pst − Z | dx (cid:19) e p − p | st − Z | = c d,m,p | Z | p ( m +1) (cid:16) | st − ( p − Z | p ( m +1) (cid:17) e p − p | st − Z | . The above two displays yield, for δ > , Z | t E r m +1 ( x, t − Z ) | p g ( dx ) ≤ c d,m,p t − mp (cid:18) E (cid:16) | Z | m +1 + t − m +12 | ( p − Z | m +2 (cid:17) e p − t | Z | (cid:19) p ≤ c d,m,p,δ,β t − mp (cid:16) E e δ ( p − t | Z | (cid:17) p , t > δ ( p − β . For p = 1 , it can be checked that the above is valid as long as E | Z | m +1 < ∞ . (cid:3) Proof of (4.8) . Proof.
Due to (4.6), it suffices to show Z H α (Θ t − Q ) d g = 0 , for all α ∈ [ n + 1] . (8.1)Using (3.7) and changing variables, we have, for α ∈ [ n + 1] , Z H α Θ t d g = t (2 π ) d Z E H α ( x ) (cid:16) e − | x − t − X | − e − | x − t − Y | (cid:17) dx = t (2 π ) d Z E (cid:0) H α ( x + t − X ) − H α ( x + t − Y ) (cid:1) e − | x | dx. We claim that E (cid:0) H α ( x + t − X ) − H α ( x + t − Y ) (cid:1) = t − ( n +1) ( E X α − E Y α ) . (8.2)This immediately gives Z H α Θ t d g = t − n ( E X α − E Y α ) . On the other hand, by Lemma 4.4, we have Z H α Qd g = t − n ( E X α − E Y α ) . The above two displays give us (8.1).To show (8.2), we introduce the following notation. For β ∈ N d , we write β ≤ α if β i ≤ α i for all i = 1 , , . . . , d . If β ≤ α and β = α , we write β < α . Lastly, let | β | = P i β i . SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 21
By (2.5) and (2.4), we know that H α ( x ) is a polynomial of degree | α | = n + 1 andthe leading order term is x α . Hence, there are coefficients c β for β ≤ α such that theleft hand side of (8.2) admits the following expansion E (cid:0) H α ( x + t − X ) − H α ( x + t − Y ) (cid:1) = X β ≤ α c β x α − β t − | β | ( E X β − E Y β ) . If β < α , then | β | ≤ n . Hence, by the assumption M ( n ) , we must have E X β = E Y β for all β < α . Therefore, the only term that does not vanish on the right of the abovedisplay is c α t − ( n +1) ( E X α − E Y α ) and c α = 1 as evident from (2.5). This verifies(8.2) and completes the proof. (cid:3) Proof of (6.2) . To show f is uniformly Lipschitz and figure out its Lipschitzcoefficient, we start to estimate, using (6.1), |∇ f ( x ) | ≤ t − (cid:16) (cid:12)(cid:12) Θ t ( t − x ) (cid:12)(cid:12) + (cid:12)(cid:12) ∇ Θ t ( t − x ) (cid:12)(cid:12)(cid:17) {| t − x |≤ } . To bound Θ t ( x ) for | x | ≤ , due to Lemma 4.2, we only need to estimate using (4.3),for | x | ≤ and Z ∈ { X, Y } , t E | r n +1 ( x, t − Z ) | ≤ c d,n t − n Z E | Z | n +1 (cid:12)(cid:12)(cid:12)(cid:12) X α ∈ [ n +1] H α ( x − st − Z ) η ( x, st − Z ) (cid:12)(cid:12)(cid:12)(cid:12) ds ≤ c d,n t − n E | Z | n +1 (cid:18) | x | n +1 + | t − Z | n +1 (cid:19) e | x | ≤ c d,n t − n (cid:16) E | Z | n +1 + t − n +12 E | Z | n +2 (cid:17) . Here, to derive the second inequality, we used the fact that H α is a polynomial ofdegree n + 1 , and also the formula of η in (3.7).Again by Lemma 4.2, to bound |∇ Θ t ( t − x ) | , we only need to show, for | x | ≤ and Z ∈ { X, Y } , t E (cid:12)(cid:12) ∇ r n +1 ( x, t − Z ) (cid:12)(cid:12) ≤ c d,n t − n Z E | Z | n +1 (cid:12)(cid:12)(cid:12)(cid:12) X α ∈ [ n +1] ∇ (cid:16) H α ( x − st − Z ) η ( x, st − Z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ds ≤ c d,n t − n E | Z | n +1 (cid:18)(cid:0) | x | n + | t − Z | n (cid:1) e | x | + (cid:0) | x | n +1 + | t − Z | n +1 (cid:1) | t − Z | e | x | (cid:19) ≤ c d,n t − n (cid:16) E | Z | n +1 + t − n +22 E | Z | n +3 (cid:17) . In the second inequality, we applied the product rule of differentiation, and againused the fact of H α being a polynomial of degree n + 1 and the definition of η in(3.7).From all these estimates, we derive that |∇ f | ≤ c d,n t − n +12 max Z ∈{ X,Y } (cid:16) E | Z | n +1 + t − n +22 E | Z | n +3 (cid:17) , as claimed. References [1] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows in metric spaces and in the spaceof probability measures. Lectures in Mathematics ETH Zürich. Birkhäuser Verlag,Basel, 2005. ISBN 978-3-7643-2428-5; 3-7643-2428-7.[2] L. Ambrosio, F. Stra, and D. Trevisan. A PDE approach to a 2-dimensional matchingproblem. Probab. Theory Related Fields, 173(1-2):433–477, 2019. ISSN 0178-8051. doi:10.1007/s00440-018-0837-x. URL https://doi.org/10.1007/s00440-018-0837-x .[3] D. Bakry, I. Gentil, and M. Ledoux. Analysis and geometry of Markovdiffusion operators, volume 348 of Grundlehren der Mathematischen Wissenschaften[Fundamental Principles of Mathematical Sciences]. Springer, Cham, 2014. ISBN 978-3-319-00226-2; 978-3-319-00227-9. doi: 10.1007/978-3-319-00227-9.[4] A. S. Bandeira, J. Niles-Weed, and P. Rigollet. Optimal rates of estimation for multi-reference alignment. Mathematical Statistics and Learning, 2020. To appear.[5] F. Bolley, I. Gentil, and A. Guillin. Dimensional contraction via Markov transportationdistance. Journal of the London Mathematical Society. Second Series, 90(1):309–332,2014. ISSN 0024-6107. doi: 10.1112/jlms/jdu027.[6] F. Bolley, I. Gentil, and A. Guillin. Dimensional improvements of the logarithmicSobolev, Talagrand and Brascamp-Lieb inequalities. The Annals of Probability, 46(1):261–301, 2018. ISSN 0091-1798. doi: 10.1214/17-AOP1184.[7] L. Brasco. A survey on dynamical transport distances. Journal of MathematicalSciences, 181(6):755–781, 2012.[8] S. Caracciolo, C. Lucibello, G. Parisi, and G. Sicuro. Scaling hy-pothesis for the euclidean bipartite matching problem. Phys. Rev.E, 90:012118, Jul 2014. doi: 10.1103/PhysRevE.90.012118. URL https://link.aps.org/doi/10.1103/PhysRevE.90.012118 .[9] S. Chewi, T. Maunu, P. Rigollet, and A. J. Stromme. Gradient descent algorithms forBures-Wasserstein barycenters. arXiv preprint arXiv:2001.01700, 2020.[10] I. Csiszár. Eine informationstheoretische Ungleichung und ihre Anwendung auf denBeweis der Ergodizität von Markoffschen Ketten. Magyar Tud. Akad. Mat. Kutató Int.Közl., 8:85–108, 1963.[11] A. Eberle. Reflection coupling and Wasserstein contractivity without convexity.Comptes Rendus Mathématique. Académie des Sciences. Paris, 349(19-20):1101–1104,2011. ISSN 1631-073X. doi: 10.1016/j.crma.2011.09.003.[12] A. Eberle. Reflection couplings and contraction rates for diffusions. Probability Theoryand Related Fields, 166(3-4):851–886, 2016. ISSN 0178-8051. doi: 10.1007/s00440-015-0673-1.[13] M. Giaquinta and L. Martinazzi. An introduction to the regularity theory for ellipticsystems, harmonic maps and minimal graphs. Springer Science & Business Media, 2013.[14] C. R. Givens, R. M. Shortt, et al. A class of Wasserstein metrics for probabilitydistributions. The Michigan Mathematical Journal, 31(2):231–240, 1984.[15] Z. Goldfeld and K. Greenewald. Gaussian-smooth optimal transport: Metric structureand statistical efficiency. arXiv preprint arXiv:2001.09206, 2020.[16] Z. Goldfeld, K. Greenewald, J. Niles-Weed, and Y. Polyanskiy. Convergence ofsmoothed empirical measures with applications to entropy estimation. IEEE Trans.Inform. Theory, 2020. To appear.[17] M. Ledoux. The concentration of measure phenomenon, volume 89 of MathematicalSurveys and Monographs. American Mathematical Society, Providence, RI, 2001. ISBN0-8218-2864-9.[18] F. Liese and I. Vajda. On divergences and informations in statistics and informationtheory. IEEE Trans. Inform. Theory, 52(10):4394–4412, 2006. ISSN 0018-9448. doi:10.1109/TIT.2006.881731. URL https://doi.org/10.1109/TIT.2006.881731 . SYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES 23 [19] J. Lott and C. Villani. Ricci curvature for metric-measure spaces via optimal transport.Annals of Mathematics. Second Series, 169(3):903–991, 2009. ISSN 0003-486X. doi:10.4007/annals.2009.169.903.[20] D. Luo and J. Wang. Exponential convergence in L p -wasserstein distance for diffusionprocesses without uniformly dissipative drift. Mathematische Nachrichten, 289(14-15):1909–1926, 2016.[21] K. Marton. Bounding d -distance by informational divergence: a method to provemeasure concentration. Ann. Probab., 24(2):857–866, 1996. ISSN 0091-1798. doi:10.1214/aop/1039639365. URL https://doi.org/10.1214/aop/1039639365 .[22] K. Marton. A measure concentration inequality for contracting Markov chains. Geom.Funct. Anal., 6(3):556–571, 1996. ISSN 1016-443X. doi: 10.1007/BF02249263. URL https://doi.org/10.1007/BF02249263 .[23] R. J. McCann. A convexity principle for interacting gases. Adv. Math.,128(1):153–179, 1997. ISSN 0001-8708. doi: 10.1006/aima.1997.1634. URL https://doi.org/10.1006/aima.1997.1634 .[24] J. Moser. On the volume elements on a manifold. Trans. Amer. Math.Soc., 120:286–294, 1965. ISSN 0002-9947. doi: 10.2307/1994022. URL https://doi.org/10.2307/1994022 .[25] D. Nualart. The Malliavin calculus and related topics, volume 1995. Springer, 2006.[26] F. Otto. The geometry of dissipative evolution equations: the porous medium equation.Communications in Partial Differential Equations, 26(1-2):101–174, 2001. ISSN 0360-5302. doi: 10.1081/PDE-100002243.[27] F. Otto and C. Villani. Generalization of an inequality by Talagrand and links with thelogarithmic Sobolev inequality. J. Funct. Anal., 173(2):361–400, 2000. ISSN 0022-1236.doi: 10.1006/jfan.1999.3557. URL https://doi.org/10.1006/jfan.1999.3557 .[28] R. Peyre. Comparison between W distance and ˙H − norm, and localization of Wasser-stein distance. ESAIM Control Optim. Calc. Var., 24(4):1489–1501, 2018. ISSN 1292-8119. doi: 10.1051/cocv/2017050. URL https://doi.org/10.1051/cocv/2017050 .[29] G. Pisier. Probabilistic methods in the geometry of banach spaces. In Probability andanalysis, pages 167–241. Springer, 1986.[30] K.-T. Sturm. On the geometry of metric measure spaces. II. Acta Mathematica, 196(1):133–177, 2006. ISSN 0001-5962. doi: 10.1007/s11511-006-0003-7.[31] K.-T. Sturm. On the geometry of metric measure spaces. I. Acta Mathematica, 196(1):65–131, 2006. ISSN 0001-5962. doi: 10.1007/s11511-006-0002-8.[32] M. Talagrand. Transportation cost for Gaussian and other product measures. Geom.Funct. Anal., 6(3):587–600, 1996. ISSN 1016-443X. doi: 10.1007/BF02249265. URL https://doi.org/10.1007/BF02249265 .[33] C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & BusinessMedia, 2009.[34] C. Villani. Synthetic theory of Ricci curvature bounds. Japanese Journal ofMathematics, 11(2):219–263, 2016. ISSN 0289-2316. doi: 10.1007/s11537-016-1531-3.[35] M.-K. von Renesse and K.-T. Sturm. Transport inequalities, gradientestimates, entropy and Ricci curvature. Communications on Pure andApplied Mathematics, 58(7):923–940, 2005. doi: 10.1002/cpa.20060. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.20060 .[36] F.-Y. Wang. Analysis for diffusion processes on Riemannian manifolds, volume 18 ofAdvanced Series on Statistical Science & Applied Probability . World Scientific Pub-lishing Co. Pte. Ltd., Hackensack, NJ, 2014. ISBN 978-981-4452-64-9.[37] F.-Y. Wang. Exponential contraction in wasserstein distances for diffusion semigroupswith negative curvature. arXiv preprint arXiv:1603.05749, 2016.[38] J. Weed. Sharper rates for estimating differential entropy under gaussian convolutions.Massachusetts Institute of Technology (MIT), Tech. Rep, 2018.4 ASYMPTOTICS OF SMOOTHED WASSERSTEIN DISTANCES