[PDF] Convergence of Langevin Monte Carlo in Chi-Square Divergence

Abstract

We study sampling from a target distribution ν ∗ = e −f using the unadjusted Langevin Monte Carlo (LMC) algorithm when the potential f satisfies a strong dissipativity condition and it is first-order smooth with Lipschitz gradient. We prove that, initialized with a Gaussian that has sufficiently small variance, O ˜ (λd ϵ −1 ) steps of the LMC algorithm are sufficient to reach ϵ -neighborhood of the target in Chi-square divergence, where λ is the log-Sobolev constant of ν ∗ . Our results do not require warm-start to deal with exponential dimension dependency in Chi-square divergence at initialization. In particular, for strongly convex and first-order smooth potentials, we show that the LMC algorithm achieves the rate estimate O ˜ (d ϵ −1 ) which improves the previously known rates in this metric, under the same assumptions. Translating to other metrics, our result also recovers the best-known rate estimates in KL divergence, total variation and 2 -Wasserstein distance in the same setup. Finally, as we rely on the log-Sobolev inequality, our framework covers a wide range of non-convex potentials that are first-order smooth and that exhibit strong convexity outside of a compact region.

Full PDF

aa r X i v : . [ s t a t . M L ] J u l A Brief Note on the Convergence of Langevin Monte Carloin Chi-Square Divergence

Murat A. Erdogdu ∗ Rasa Hosseinzadeh † July 31, 2020

Abstract

We study sampling from a target distribution ν ∗ ∝ e − f using the unadjusted Langevin MonteCarlo (LMC) algorithm when the target ν ∗ satisﬁes the Poincar´e inequality, and the potential f is ﬁrst-order smooth and dissipative. Under an opaque uniform warmness condition on theLMC iterates, we establish that e O ( ǫ − ) steps are suﬃcient for LMC to reach ǫ neighborhood ofthe target in Chi-square divergence. We hope that this note serves as a step towards establishinga complete convergence analysis of LMC under Chi-square divergence. We consider sampling from a target distribution ν ∗ ∝ e − f using the Langevin Monte Carlo (LMC) x k +1 = x k − η ∇ f ( x k ) + √ ηW k , (1.1)where f : R d → R is the potential function, W k is a d -dimensional isotropic Gaussian randomvector independent from { x l } l ≤ k , and η is the step size. This algorithm is the Euler discretizationof the following stochastic diﬀerential equation (SDE) dz t = −∇ f ( z t ) dt + √ dB t , (1.2)where B t denotes the d -dimensional Brownian motion. The solution of the above SDE is referredto as the ﬁrst-order Langevin diﬀusion, and the convergence behavior of the LMC algorithm (1.1)is intimately related to the properties of the diﬀusion process (1.2). Denoting the distribution of z t with ρ t , the following Fokker-Planck equation describes the evolution of the above dynamics [Ris96] ∂ρ t ( x ) ∂t = ∇ · ( ∇ f ( x ) ρ t ( x )) + ∆ ρ t ( x ) = ∇ · (cid:16) ρ t ( x ) ∇ log ρ t ( x ) ν ∗ ( x ) (cid:17) . (1.3)Convergence to the equilibrium of the above equation has been studied extensively under variousassumptions and distance measures. Deﬁning the Chi-square divergence and the KL divergencebetween two probability distributions ρ and ν in R d as, χ (cid:0) ρ | ν (cid:1) = − R (cid:16) ρ ( x ) ν ( x ) (cid:17) ν ( x ) dx and KL (cid:0) ρ | ν (cid:1) = R log (cid:16) ρ ( x ) ν ( x ) (cid:17) ρ ( x ) dx, (1.4) ∗ Department of Computer Science and Department of Statistical Sciences at the University of Toronto, and VectorInstitute, [email protected] † Department of Computer Science at the University of Toronto, and Vector Institute, [email protected] ν satisﬁes the Poincar´e inequality (PI) if the following holds ∀ ρ, χ (cid:0) ρ | ν (cid:1) ≤ C P R R d (cid:13)(cid:13) ∇ ρ ( x ) ν ( x ) (cid:13)(cid:13) ν ( x ) dx, (PI)where C P is termed as the Poincar´e constant, and ρ is a probability density. If ν ∗ satisﬁes (PI),it is straightforward to show that dχ (cid:0) ρ t | ν ∗ (cid:1) /dt = − R k∇ ρ t ( x ) ν ∗ ( x ) k ν ∗ ( x ) dx ; thus, PI implies theexponential convergence of the diﬀusion ddt χ (cid:0) ρ t | ν ∗ (cid:1) ≤ − C P χ (cid:0) ρ t | ν ∗ (cid:1) = ⇒ χ (cid:0) ρ t | ν ∗ (cid:1) ≤ e − t/C P χ (cid:0) ρ | ν ∗ (cid:1) . (1.5)Under additional smoothness assumptions on the potential function, convergence behavior of (1.3)to equilibrium can be translated to that of the LMC algorithm. In particular, implications of LSIon the convergence of LMC is relatively well-understood [Dal17b, DM17, VW19, EH20]; however,there is no convergence analysis of LMC in Chi-square divergence when the target satisﬁes PI. Inthis note, we aim to provide an analysis under a uniform warmness condition, which we verify fora simple Gaussian example. We hope that this serves as a step towards establishing a completeconvergence analysis of LMC under Chi-square divergence. Related work.

Started by the pioneering works [DM16, Dal17b, DM17], non-asymptotic analysisof LMC has drawn a lot of interest [Dal17a, CB18, CCAY +

18, DM19, DMM19, VW19, DK19,BDMS19, LWME19]. It is known that e O (cid:0) dǫ − (cid:1) steps of LMC yield an ǫ accurate sample in KLdivergence for strongly convex and ﬁrst-order smooth potentials [CB18, DMM19]. This is stillthe best-known rate in this setup, and recovers the best-known rates in total variation and L -Wasserstein metrics [DM17, Dal17b, DM19]. Recently, these global curvature assumptions arerelaxed to growth conditions [CCAY +

18, EMS18]. For example, [VW19] established convergenceguarantees for LMC when sampling from targets distributions that satisfy a log-Sobolev inequality,and has a smooth potential. This corresponds to potentials with quadratic tails [B´E85, BG99] upto ﬁnite perturbations [HS87]; thus, this result is able to deal with non-convex potentials whileachieving the same rate of e O (cid:0) dǫ − (cid:1) in KL divergence. Notation.

Throughout the note, log denotes the natural logarithm. For a real number x ∈ R , wedenote its absolute value with | x | . We denote the Euclidean norm of a vector x ∈ R d with k x k .The gradient, divergence, and Laplacian of f are denoted by ∇ f ( x ), ∇ · f ( x ) and ∆ f ( x ).We use E [ x ] to denote the expected value of a random variable or a vector x , where expecta-tions are over all the randomness inside the brackets. For probability densities p , q on R d , we useKL (cid:0) p | q (cid:1) and χ (cid:0) p | q (cid:1) to denote their KL divergence (or relative entropy) and Chi-square divergence,respectively, which are deﬁned in (1.4).We denote the Borel σ -ﬁeld of R d with B ( R d ). L -Wasserstein distance and total variation (TV)metric are deﬁned respectively as W ( p, q ) = inf ν (cid:0)R k x − y k dν ( p, q ) (cid:1) / , and TV ( p, q ) = sup A ∈B ( R d ) (cid:12)(cid:12)R A ( p ( x ) − q ( x )) dx (cid:12)(cid:12) , where in the ﬁrst formula, inﬁmum runs over the set of probability measures on R d × R d that hasmarginals with corresponding densities p and q .Multivariate Gaussian distribution with mean µ ∈ R d and covariance matrix Σ ∈ R d × d isdenoted with N ( µ, Σ). 2

Convergence Analysis under Uniform Warmness

Our ﬁrst assumption is on the target: we study target distributions that satisfy the Poincar´einequality.

Assumption 1 (Poincar´e inequality) . The target ν ∗ ∝ e − f satisﬁes (PI) with Poincar´e constant C P . PI can be seen as a linearization of LSI [OV00] and is suﬃcient to achieve exponential conver-gence of the diﬀusion in Chi-square divergence [CGL + W metrics can be bounded asTV ( ρ, ν ∗ ) ≤ q KL (cid:0) ρ | ν ∗ (cid:1) / ≤ q χ (cid:0) ρ | ν ∗ (cid:1) / W ( ρ, ν ∗ ) ≤ C P χ (cid:0) ρ | ν ∗ (cid:1) . Above, the former inequality is due to [Tsy08, Lemma 2.7] together with Csisz´ar-Kullback-Pinskerinequality, and the latter is shown to hold in, for example [Liu20, Theorem 1.1]. Therefore conver-gence in Chi-square divergence implies convergence in these measures of distance as well.A representative analysis of LMC (see e.g. [VW19, EH20]) starts with the interpolation process, d ˜ x k,t = −∇ f ( x k ) dt + √ dB t with ˜ x k, = x k . (2.1)We denote by ˜ ρ k,t and ρ k , the distributions of ˜ x k,t and x k , and we observe easily that ˜ ρ k,η = ρ k +1 .The advantage of analyzing the interpolation process is in its continuity in time, which allows oneto work with the Fokker-Planck equation. Using this diﬀusion, in Lemma 1, we show that the timederivative of χ (cid:0) ˜ ρ k,t | ν ∗ (cid:1) diﬀers from the diﬀerential inequality (1.5) by an additive error term. Lemma 1. If ν ∗ satisﬁes Assumption 1, then the following inequality governs the evolution ofChi-square divergence of interpolated process (2.1) from the target ddt χ (cid:0) ˜ ρ k,t | ν ∗ (cid:1) ≤ − C P χ (cid:0) ˜ ρ k,t | ν ∗ (cid:1) + 2 E h ˜ ρ k,t (˜ x k,t ) ν ∗ (˜ x k,t ) k∇ f (˜ x k,t ) − ∇ f ( x k ) k i . (2.2) Proof.

The proof follows from similar lines that lead to a diﬀerential inequality in KL divergence(see for example [VW19]). Let ˜ ρ t | k denote the distribution of ˜ x k,t conditioned on x k , which satisﬁes ∂ ˜ ρ t | k ( x ) ∂t = ∇ · ( ∇ f ( x k )˜ ρ t | k ( x )) + ∆˜ ρ t | k ( x ) . Taking expectation with respect to x k we get ∂ ˜ ρ k,t ( x ) ∂t = ∇ · ( R ∇ f ( x k )˜ ρ tk ( x, x k ) dx k ) + ∆˜ ρ k,t ( x )= ∇ · (cid:0) ˜ ρ k,t ( x ) R ∇ f ( x k )˜ ρ k | t ( x k | ˜ x k,t = x ) dx k + ∇ ˜ ρ k,t ( x ) (cid:1) = ∇ · ( ˜ ρ k,t ( x ) ( E [ ∇ f ( x k ) | ˜ x k,t = x )] + ∇ log ˜ ρ k,t ( x )))= ∇ · (cid:16) ˜ ρ k,t ( x ) (cid:16) E [ ∇ f ( x k ) − ∇ f ( x ) | ˜ x k,t = x ] + ∇ log (cid:16) ˜ ρ k,t ( x ) ν ∗ ( x ) (cid:17)(cid:17)(cid:17) . ρ k,t from the target ν ∗ ddt χ (cid:0) ˜ ρ k,t | ν ∗ (cid:1) = 2 R ˜ ρ k,t ( x ) ν ∗ ( x ) × ∂ ˜ ρ k,t ( x ) ∂t dx = 2 R ˜ ρ k,t ( x ) ν ∗ ( x ) × ∇ · (cid:16) ˜ ρ k,t ( x ) (cid:16) E [ ∇ f ( x k ) − ∇ f ( x ) | ˜ x k,t = x ] + ∇ log (cid:16) ˜ ρ k,t ( x ) ν ∗ ( x ) (cid:17)(cid:17)(cid:17) dx = − R ˜ ρ k,t ( x ) h∇ ˜ ρ k,t ( x ) ν ∗ ( x ) , E [ ∇ f ( x k ) − ∇ f ( x ) | ˜ x k,t = x ] + ∇ log (cid:16) ˜ ρ k,t ( x ) ν ∗ ( x ) (cid:17) dx i = − R k∇ ˜ ρ k,t ( x ) ν ∗ ( x ) k ν ∗ ( x ) dx + 2 R h∇ ˜ ρ k,t ( x ) ν ∗ ( x ) , E [ ∇ f ( x ) − ∇ f ( x k ) | ˜ x k,t = x ] i ˜ ρ k,t ( x ) dx ≤ − R k∇ ˜ ρ k,t ( x ) ν ∗ ( x ) k ν ∗ ( x ) dx + 2 R E h ˜ ρ k,t ( x ) ν ∗ ( x ) k∇ f ( x ) − ∇ f ( x k ) k | ˜ x k,t = x i ˜ ρ k,t ( x ) dx ≤ − C P χ (cid:0) ˜ ρ k,t | ν ∗ (cid:1) + 2 E h ˜ ρ k,t (˜ x k,t ) ν ∗ (˜ x k,t ) k∇ f (˜ x k,t ) − ∇ f ( x k ) k i , where step 1 follows from the divergence theorem, step 2 from h a, b i ≤ k a k + k b k and in step 3,we used Assumption 1.The above diﬀerential inequality (2.2) is the discrete version of (1.5), which will be used toestablish a single step bound that can be iterated to yield the ﬁnal convergence result. For this, wecontrol the additive error term in (2.2), namely k∇ f (˜ x k,t ) − ∇ f ( x k ) k , under a smoothness conditionon the potential function. The ratio of densities ˜ ρ k,t (˜ x k,t ) ν ∗ (˜ x k,t ) is harder to control, for which we willproceed under an opaque uniform warmness condition as stated below. Assumption 2 (Uniform warmness of LMC iterates) . The ratio of densities satisﬁes ∀ k, E h ˜ ρ k,t ν ∗ (˜ x k,t ) i ≤ B d where ˜ x k,t ∼ ˜ ρ k,t . (2.3) where B d is a constant that does not depend on k . This assumption states that the LMC iterates stay suﬃciently close to the target. The constant B d may have an exponential dimension dependence as seen in the next example. We emphasizethat the above condition needs to proven to establish a complete convergence analysis; however, inthis note, we only verify this for the Gaussian case where we know ˜ ρ k,t explicitly. Simple example where Assumption 2 is satisﬁed.

In this toy example, the above as-sumption is veriﬁed for the Gaussian sampling problem with LMC. Assume for simplicity that f ( x ) = k x k , and x ∼ N (0 , σ I ). One can easily verify that˜ ρ k,t = N (0 , σ k,t I ) where σ k,t := (1 − t ) σ k + 2 t and σ k := (1 − η ) k σ + − (1 − η ) k − η/ . Let, for simplicity, σ = 0 . / (1 − η/ E h ˜ ρ k,t ν ∗ (˜ x k,t ) i = /σ k,t − d/ σ dk,t whenever σ k,t < . . For the above choice of σ , and a suﬃciently small step size η , one can verify that σ k is monotonicallyincreasing and converges to 1 / (1 − η/ ≤ /

2. Moreover, σ > / B d = O (2 d ). Similar analysis can becarried out for the non-isotropic Gaussian case.Alternatively, one could assume that the LMC iterates stay uniformly warm in the sense thatsup x ρ k /ν ∗ ( x ) < ∞ (see [VW19, page 11] for a comment on a similar warmness). In this case, the4roof is simpler and does not require dissipativity (Assumption 4); however, unlike the previouscondition, this notion of warmness cannot be veriﬁed for the Gaussian example.Next, we assume that the potential is ﬁrst-order smooth as follows. Assumption 3 (Smoothness) . The gradient of the potential f satisﬁes ∀ x, y, k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k . We will need to bound the moments of LMC iterates for which we use the following assumption.

Assumption 4 (2-dissipativity) . For some constants a > and b ≥ , we have ∀ x, h∇ f ( x ) , x i ≥ a k x k − b. The above condition implies quadratic growth of f ; thus, we note that LSI may be moreappropriate in this setup. Under the above assumption, for suﬃciently small step-size, it is knownthat the LMC iterates have ﬁnite moments of all orders. Speciﬁcally for η ≤ ψ (for explicit valuesof ψ and α , refer to [EMS18]), we have E (cid:2) k x k k (cid:3) ≤ α (1 + E (cid:2) k x k (cid:3) ) . Lemma 2.

If Assumptions 1, 2, 3 and 4 hold, when η ≤ ψ , the following inequality is satisﬁedbetween the consecutive iterates of the LMC algorithm, χ (cid:0) ρ k +1 | ν ∗ (cid:1) ≤ e − η C P χ (cid:0) ρ k | ν ∗ (cid:1) + 2 γL η , where γ , B d × (cid:0) (cid:0) αL E (cid:2) k x k (cid:3) + αL + k∇ f (0) k (cid:1) + d ( d + 2) (cid:1) . Proof.

We use Lemma 1 and bound the additive term. We start by using Assumptions 2 and 3with H¨older inequality to get E h ˜ ρ k,t (˜ x k,t ) ν ∗ (˜ x k,t ) k∇ f (˜ x k,t ) − ∇ f ( x k ) k i ≤ E (cid:20)(cid:16) ˜ ρ k,t (˜ x k,t ) ν ∗ (˜ x k,t ) (cid:17) (cid:21) / E (cid:2) k∇ f (˜ x k,t ) − ∇ f ( x k ) k (cid:3) / ≤ L √ B d E (cid:2) k ˜ x k,t − x k k (cid:3) / = L √ B d E (cid:2) k−∇ f ( x k ) t + √ B t k (cid:3) / . Now we use smoothness assumption to derive a growth condition k∇ f ( x ) k ≤ k∇ f ( x ) − ∇ f (0) k + 8 k∇ f (0) k ≤ L k x k + 8 k∇ f (0) k . We use this growth bound and Assumptions 4 to bound E (cid:2) k−∇ f ( x k ) t + √ B t k (cid:3) . We write E (cid:2) k−∇ f ( x k ) t + √ B t k (cid:3) ≤ t E (cid:2) k∇ f ( x k ) k (cid:3) + 32 E (cid:2) k B t k (cid:3) ≤ t (cid:0) L E (cid:2) k x k k (cid:3) + k∇ f (0) k (cid:1) + 32 t d ( d + 2) ≤ t (cid:0) αL E (cid:2) k x k (cid:3) + αL + k∇ f (0) k (cid:1) + 32 t d ( d + 2) ≤ η (cid:0) ψ (cid:0) αL E (cid:2) k x k (cid:3) + αL + k∇ f (0) k (cid:1) + d ( d + 2) (cid:1) ≤ η γ /B d , where for the last step we used η ≤

1. Putting all of these together, we get E h ˜ ρ k,t (˜ x k,t ) ν ∗ (˜ x k,t ) k∇ f (˜ x k,t ) − ∇ f ( x k ) k i ≤ γL η. ddt χ (cid:0) ˜ ρ k,t | ν ∗ (cid:1) ≤ − C P χ (cid:0) ˜ ρ k,t | ν ∗ (cid:1) + 2 γL η. By rearranging and multiplying with exp (cid:16) C P t (cid:17) , and using the fact that t ≤ η , we obtain ddt (cid:16) e t C P χ (cid:0) ˜ ρ k,t | ν ∗ (cid:1)(cid:17) ≤ e η C P ηL γ. Integrating both sides and rearranging yields the desired inequality.By iterating the bound in the previous lemma, we establish the following convergence guarantee.

Proposition 3.

If Assumptions 1, 2, 3 and 4 hold. Then for a small ǫ satisfying ǫ ≤ C P γL ( C P ∧ ψ ) , where γ is deﬁned in Lemma 2 and for some ∆ upper bounding the error at initialization, i.e. χ (cid:0) ρ | ν ∗ (cid:1) ≤ ∆ , if we choose the step size as η = ǫ C P γL then LMC reaches ǫ accuracy (i.e. χ (cid:0) ρ N | ν ∗ (cid:1) ≤ ǫ ) after N steps for any N satisfying N ≥ C P L ) × γǫ × log (cid:18) ǫ (cid:19) . Proof.

For this choice of η and the bound on ǫ , we have η ≤ ψ ∧ C P . This implies we can useLemma 2 and e − C P η ≤ − C P η. Combining the above bound with Lemma 2, we easily get χ (cid:0) ρ k | ν ∗ (cid:1) ≤ (1 − C P η ) χ (cid:0) ρ k − | ν ∗ (cid:1) + 2 γL η . (2.4)We state the following elementary lemma. Lemma 4.

For a real sequence { θ k } k ≥ , if we have θ k ≤ (1 − a ) θ k − + b for some a ∈ (0 , , and b ≥ , then θ k ≤ e − ak θ + b/a. Proof of Lemma 4.

Recursion on θ k ≤ (1 − a ) θ k − + b yields θ k ≤ (1 − a ) k θ + b (1 + (1 − a ) + (1 − a ) + · · · + (1 − a ) k − ) ≤ (1 − a ) k θ + ba . Using the fact that 1 − a ≤ e − a completes the proof.Iterating (2.4) with the help of Lemma 4, we obtain the following bound χ (cid:0) ρ k | ν ∗ (cid:1) ≤ e − ηN C P ∆ + C P γL η. The conditions on η and N imply that each term on the RHS is bounded by ǫ/ In this note, we provided a simple analysis on the convergence of LMC algorithm in Chi-squaredivergence under the condition that LMC iterates stay uniformly warm across all iterations (As-sumption 2). This condition needs to be veriﬁed for the general case to establish a completeconvergence analysis; however, we only veriﬁed it for a simple Gaussian example.6 cknowledgements

Authors would like to thank Andre Wibisono for pointing out to an error in the proof of Lemma 1in an early version of this note.

References [BDMS19] Nicolas Brosse, Alain Durmus, Eric Moulines, and Sotirios Sabanis,

The tamed unadjustedlangevin algorithm , Stochastic Processes and their Applications (2019), no. 10, 3638 –3663. (Cited on page 2.) [B´E85] D. Bakry and M. ´Emery,

Diﬀusions hypercontractives , S´eminaire de Probabilit´es XIX 1983/84(Berlin, Heidelberg), Springer Berlin Heidelberg, 1985, pp. 177–206. (Cited on page 2.) [BG99] Sergej G Bobkov and Friedrich G¨otze,

Exponential integrability and transportation cost relatedto logarithmic sobolev inequalities , Journal of Functional Analysis (1999), no. 1, 1–28. (Citedon page 2.) [CB18] Xiang Cheng and Peter L Bartlett,

Convergence of langevin mcmc in kl-divergence , PMLR 83(2018), no. 83, 186–211. (Cited on page 2.) [CCAY +

18] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jor-dan,

Sharp convergence rates for langevin dynamics in the nonconvex setting , arXiv preprintarXiv:1805.01648 (2018). (Cited on page 2.) [CGL +

20] Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, Philippe Rigollet, and AustinStromme,

Exponential ergodicity of mirror-langevin diﬀusions , arXiv preprint arXiv:2005.09669(2020). (Cited on page 3.) [Dal17a] Arnak Dalalyan,

Further and stronger analogy between sampling and optimization: Langevinmonte carlo and gradient descent , Proceedings of the 2017 Conference on Learning Theory,Proceedings of Machine Learning Research, vol. 65, PMLR, 07–10 Jul 2017, pp. 678–689. (Citedon page 2.) [Dal17b] Arnak S Dalalyan,

Theoretical guarantees for approximate sampling from smooth and log-concave densities , Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2017), no. 3, 651–676. (Cited on page 2.) [DK19] Arnak S Dalalyan and Avetik Karagulyan, User-friendly guarantees for the langevin montecarlo with inaccurate gradient , Stochastic Processes and their Applications (2019), no. 12,5278–5311. (Cited on page 2.) [DM16] Alain Durmus and Eric Moulines,

Sampling from strongly log-concave distributions with theunadjusted langevin algorithm , arXiv preprint arXiv:1605.01559 (2016). (Cited on page 2.) [DM17] Alain Durmus and ´Eric Moulines, Nonasymptotic convergence analysis for the unadjustedlangevin algorithm , The Annals of Applied Probability (2017), no. 3, 1551–1587. (Citedon page 2.) [DM19] , High-dimensional bayesian inference via the unadjusted langevin algorithm , Bernoulli (2019), no. 4A, 2854–2882. (Cited on page 2.) [DMM19] Alain Durmus, Szymon Majewski, and Blazej Miasojedow, Analysis of langevin monte carlo viaconvex optimization. , Journal of Machine Learning Research (2019), no. 73, 1–46. (Cited onpage 2.) [EH20] Murat A Erdogdu and Rasa Hosseinzadeh, On the convergence of langevin monte carlo: Theinterplay between tail growth and smoothness , arXiv preprint arXiv:2005.13097 (2020). (Citedon pages 2 and 3.) EMS18] Murat A Erdogdu, Lester Mackey, and Ohad Shamir,

Global non-convex optimization withdiscretized diﬀusions , Advances in Neural Information Processing Systems, 2018, pp. 9671–9680. (Cited on pages 2 and 5.) [HS87] Richard Holley and Daniel Stroock,

Logarithmic sobolev inequalities and stochastic ising models ,Journal of Statistical Physics (1987), no. 5, 1159–1194. (Cited on page 2.) [Liu20] Yuan Liu, The poincar´e inequality and quadratic transportation-variance inequalities , Electron.J. Probab. (2020), 16 pp. (Cited on page 3.) [LWME19] Xuechen Li, Yi Wu, Lester Mackey, and Murat A Erdogdu, Stochastic runge-kutta acceler-ates langevin monte carlo and beyond , Advances in Neural Information Processing Systems 32,Curran Associates, Inc., 2019, pp. 7748–7760. (Cited on page 2.) [OV00] Felix Otto and C´edric Villani,

Generalization of an inequality by talagrand and links with thelogarithmic sobolev inequality , Journal of Functional Analysis (2000), no. 2, 361–400. (Citedon page 3.) [Ris96] Hannes Risken,

Fokker-planck equation , The Fokker-Planck Equation, Springer, 1996, pp. 63–95. (Cited on page 1.) [Tsy08] Alexandre B Tsybakov,

Introduction to nonparametric estimation , Springer Science & BusinessMedia, 2008. (Cited on page 3.) [VW19] Santosh Vempala and Andre Wibisono,

Rapid convergence of the unadjusted langevin algorithm:Isoperimetry suﬃces , Advances in Neural Information Processing Systems, 2019, pp. 8092–8104. (Cited on pages 2, 3, and 4.)(Cited on pages 2, 3, and 4.)