The Langevin Monte Carlo algorithm in the non-smooth log-concave case
aa r X i v : . [ m a t h . S T ] J a n The Langevin Monte Carlo algorithm in the non-smoothlog-concave case
Joseph LehecJanuary 27, 2021
Abstract
We prove non-asymptotic polynomial bounds on the convergence of the LangevinMonte Carlo algorithm in the case where the potential is a convex function which isglobally Lipschitz on its domain, typically the maximum of a finite number of affinefunctions on an arbitrary convex set. In particular the potential is not assumed to begradient Lipschitz, in contrast with most (if not all) existing works on the topic.
Setting.
Sampling from a high-dimensional log-concave probability measure is a problemdating back to the early nineties and the seminal work of Dyer, Frieze and Kannan [11] andwhich has many applications to various fields such as Bayesian statistics, convex optimizationand statistical inference. This problem is always addressed via Markov Chain Monte Carlomethods, but there is a large variety of those: Metropolis-Hastings type random walks (ballwalk), Glauber like dynamics (hit and run) or Hamiltonian Monte Carlo. In this article,we will consider the so-called Langevin algorithm, which is the natural discretization of thefollowing continuous time diffusion. Let µ be a probability measure on R n and let ϕ be itspotential, namely µ has density e − ϕ with respect to the Lebesgue measure. The Langevindiffusion associated to µ is the solution ( X t ) of the following stochastic differential equation dX t = dB t − ∇ ϕ ( X t ) dt, (1)where ( B t ) is a standard n -dimensional Brownian motion. The Langevin algorithm is theEuler scheme associated to this diffusion: Given a time step parameter η we let ( ξ k ) k ≥ be asequence of i.i.d. centered Gaussian vectors with covariance η Id and set x k +1 = x k + ξ k +1 − η ∇ ϕ ( x k ) . (2)We shall focus on the log-concave case, namely the case where the potential ϕ is convex. Oneoriginality of this work is that we will consider the constrained case, allowing the measure µ to be supported on a set K different from R n . In other words the potential ϕ is allowed to1ake the value + ∞ outside some set K . Notice that the log-concavity assumption impliesthat K is convex. In the constrained case the Langevin diffusion (1) becomes dX t = dB t − ∇ V ( X t ) dt − d Φ t , (3)where (Φ t ) is a process that repels ( X t ) inward when it reaches the boundary of K , see thenext section for a precise definition. The discretization then becomes x k +1 = P (cid:16) x k + ξ k +1 − η ∇ ϕ ( x k ) (cid:17) , (4)where P is the projection on K : For x ∈ R n the point P K ( x ) is the closest point to x in K . This is the algorithm we will study throughout the article. It was first introduced inour joint paper with Bubeck and Eldan [5] and to the best of our knowledge it has not beeninvestigated since.The second originality of this work is that we do not assume the potential ϕ to besmooth. More precisely we will assume that the gradient of ϕ (or rather its subdifferential)is uniformly bounded on K , but we do not assume it to be Lipschitz or even continuous. Letus point out that this is by no means an exotic situation, the reader could think for instanceof ϕ being the maximum of a finite number of affine functions on K . We do not make anyassumption whatsoever on the convex set K .The drawback of this very generic situation and of our approach is that we are onlyable to get convergence estimates in Wasserstein distance. Recall that given p ≥
1, theWasserstein distance W p between two probability measures µ and ν is defined as W pp ( µ, ν ) = inf X ∼ µ,Y ∼ ν { E [ | X − Y | p ] } . By a slight abuse of notation, if
X, Y are random vectors we also write W p ( X, Y ) for theWasserstein distance between the law of X and that of Y . In the sequel we shall only considerthe distances W and W . Note that by Jensen’s inequality we have W ≤ W . Main results.
Our main result is the following bound between the Langevin algorithm (4)after k steps and its corresponding point X kη in the true Langevin diffusion (3). Theorem 1.
Assume that µ is log-concave, with globally Lipschitz potential ϕ on its support K and let L be the Lipschitz constant. Assume that the time step η satisfies η < nL − andsuppose that the Langevin algorithm and diffusion are initiated at the same point x . Thenfor every integer k we have n W ( X kη , x k ) ≤ A k η / (5) where A = 4(1 + σ )( n + 2 log k ) / r + 3 L n / , (6) and r = d ( x , ∂K ) and σ = 1 n (cid:16) ϕ ( x ) − inf K { ϕ ( y ) } (cid:17) . (7)2 emark. The transport cost W behaves additively when taking tensor products, so theWasserstein distance between any two probability measures on R n is typically of order √ n .Therefore n W is of order 1, which explains why we wrote the theorem this way. The readershould thus have in mind that the theorem provides some non trivial information as soon asthe right-hand side of (5) is smaller than some small constant ε .The result depends on the starting point via the parameters r and σ . In order to geta meaningful bound the algorithm should not be initiated too close to the boundary of K or at a point where the potential ϕ is too large. Let us also point out that the theorem isalso valid when there is no support constraint, namely when K = R n . One just replaces r by + ∞ , so that A = O ( Ln − / ) in this case. Let us comment also on the parameter σ . Obviously σ = 0 if the potential is minimal at x . Fradelizi’s theorem [13, Theorem 4]asserts that if µ is log-concave on R n with density f and has its barycenter at x thensup x ∈ R n { f ( x ) } ≤ e n f ( x ) . In terms of the parameter σ this means that if µ has its barycenter at x then σ ≤ ϕ is assumed to be Lipschitz with constant L , if x is at O ( nL − ) distance from thebarycenter then again σ is order 1. In general we shall think of σ as a parameter of order1. Also we are never going to take more than poly ( n ) steps so log k will always be negligiblecompared to n . Under the previous assumptions the parameter A thus satisfies A = O (cid:18) max (cid:18) n / r ; Ln / (cid:19)(cid:19) . In order to estimate the complexity of the Langevin algorithm, we need to combine theprevious theorem with some estimate on the speed of convergence of the Langevin diffusion( X t ) towards its equilibrium measure µ . For this we shall use two functional inequalites,the Poincar´e inequality and the logarithmic Sobolev inequality. Recall that the measure µ is said to satisfy the logarithmic Sobolev inequality if for every probability measure ν on R n we have D ( ν | µ ) ≤ C LS I ( ν | µ ) (8)where D ( ν | µ ) and I ( ν | µ ) denote respectively the relative entropy and the relative Fisherinformation of ν with respect to µ : D ( ν | µ ) = Z R n log (cid:18) dνdµ (cid:19) dν and I ( ν | µ ) = Z R n (cid:12)(cid:12)(cid:12)(cid:12) ∇ log (cid:18) dνdµ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) dν. The smallest constant C LS for which (8) holds true is called the log-Sobolev constant of µ .The factor is just a matter of convention, with this normalization the log-Sobolev constantof the standard Gaussian measure is 1, in any dimension. It is well-known that the log-Sobolev inequality is stronger than the Poincar´e inequality. More precisely, letting C P bethe best constant in the Poincar´e inequality:var µ ( f ) ≤ C P Z R n |∇ f | dµ, we have C P ≤ C LS . 3 heorem 2. Again assume that µ is log-concave with globally Lipschitz potential on itssupport, with Lipschitz constant L . Let x be a point in the support of µ and recall thedefinition (7) of σ and r . Assume in addition that the measure µ satisfies the log-Sobolevinequality with constant C LS . Then after k steps of the Langevin algorithm started at x withtime step parameter η < nL − we have n W ( x k , µ ) ≤ B e − kη/ C LS + 2 A kη / where again A is given by (6) and B = 4 C LS (cid:18) (cid:18) max( C LS , n min( r , (cid:19) + σ + Ln (cid:19) . Remark.
Note that we initiate the Langevin algorithm at a Dirac point mass, we do not needany warm start hypothesis. The starting point only plays a role through the parameters r and σ .Let us describe what the theorem gives in terms of the complexity of the Langevinalgorithm. Say we want n W ( x k , µ ) ≤ ε for some small ε . The first term of the right-hand side is a little bit intricate, so let us assume that all parameters of the problem are atmost polynomial in the dimension. Then that term is just poly ( n ) exp( − kη/ C LS ), whichis negligible as soon as kη = Ω( C LS log n ). Let us also assume that σ x = O (1) (see thediscussion above). Then the theorem shows that choosing η = Θ ∗ (cid:18) ε C LS min (cid:18) r n , nL (cid:19)(cid:19) and running the algorithm for k = Θ ∗ (cid:18) C LS ε max (cid:18) nr ; L n (cid:19)(cid:19) steps produces a point x k satisfying n W ( x k , µ ) ≤ ε . The notation Θ ∗ hides universalconstants as well as possible polylog ( n ) dependencies, this is a common practice in this field.Note in particular that if we treat all parameters other than the dimension as constants,then we already get a non trivial information after a number of steps of the algorithm whichis nearly linear in the dimension.Of course not every log-concave measure satisfies log-Sobolev, simply because log-Sobolevimplies sub-Gaussian tails. However there are a number of interesting cases in which thelog-Sobolev inequality is known to hold true, which we list below.1. If the potential ϕ is α -uniformly convex for some α >
0, in the sense that x ϕ ( x ) − α | x | is convex, then µ satisfies log-Sobolev with constant 1 /α . This is thecelebrated Bakry- ´Emery criterion, see [1]. See also [3] for an alternate proof based onthe Pr´ekopa-Leindler inequality.2. If µ is log-concave and is supported on a ball of radius R , then µ satisfies log-Sobolevwith constant R , up to a universal constant. This follows trivially from E. Milman’sresult that, within the class of log-concave measures, Gaussian concentration and thelog-Sobolev inequality are equivalent, see [18, Theorem 1.2.], or [16, Theorem 2].4. If µ is log-concave, supported on a ball of radius R and isotropic, in the sense that itscovariance matrix is the identity matrix, then Lee and Vempala [17] have shown that µ satisfies log-Sobolev with constant R , up to a universal factor. Note that the isotropycondition implies that R ≥ √ n , so this improves greatly upon the previous result inthe isotropic case.In the first case, notice that since the potential is at the same time globally Lipschitz anduniformly convex, the support of µ must be bounded. Actually if the potential is globallyLipschitz it cannot grow fast enough at infinity to insure log-Sobolev. So if we insist onassuming that the potential is Lipschitz and on using log-Sobolev then we have to assumethat the support is bounded.One way around this issue is to use the Poincar´e inequality rather than log-Sobolev.Indeed every log-concave measure satisfies the Poincar´e inequality. Kannan, Lovasz andSimonovits [14] proved that the Poincar´e constant of an isotropic log-concave measure on R n is O ( n ) and conjectured that it should actually be bounded. This conjecture, which was themajor open problem in the field of asymptotic convex geometry, was recently nearly solvedby Yuansi Chen [8], who proved an n o (1) bound for the Poincar´e constant of an isotropiclog-concave vector in dimension n . The result of Chen relies on a technique invented byEldan [12] which was also used by Lee an Vempala [17] to prove a O ( √ n ) bound for the KLSconstant, as well as the aforementioned log-Sobolev result. Recall that if ν is a probabilitymeasure, absolutely continuous with respect to µ , the chi-square divergence of ν with respectto ν is defined as χ ( ν | µ ) = Z R n (cid:18) dνdµ − (cid:19) dµ. Our next theorem then states as follows.
Theorem 3.
Assume that µ is a log-concave probability with globally Lipschitz potential ϕ on its domain, with constant L . Then after k steps of the Langevin algorithm initiated ata random point x taking values in the domain, and with time step parameter η satisfying η ≤ nL − , we have n W ( x k , µ ) ≤ C P χ ( x | µ )e − kη/C P + 2 Akη / . where C P is the Poincar´e constant of µ and where A = 4( n + 2 log k ) / E (cid:20) σ r (cid:21) + 3 L n / , note that r and σ are random here. The price we have to pay for using Poincar´e rather than log-Sobolev is that we only get aconvergence result for W rather than W , and, more importantly, that the result depends ona warm start hypothesis. Namely the algorithm must be initiated at random point x whosechi-square divergence to µ is finite. In the unconstrained case, namely when µ is supportedon the whole R n , a natural choice for a warm start is an appropriate Gaussian measure. Onecan indeed get the following estimate. 5 emma 4. Suppose µ is log-concave, supported on the whole R n , with globally Lipschitzpotential, with Lipschitz constant L . Let γ be the Gaussian measure centered at a point x and with covariance nL Id . Then log χ ( γ | µ ) ≤ n (1 + σ ) + n (cid:18) L C P n (cid:19) , where C P is the Poincar´e constant of µ . In particular when σ = O (1) and all other parameters of the problem are at mostpolynomial in n , we get log χ ( γ | µ ) ≤ O ( n log n ). With this choice of a warm start, andobserving that in the unconstrained case the parameter A is just O ( L/ √ n ), the previoustheorem gives n W ( x k , µ ) ≤ ε after k = Θ ∗ (cid:18) C P L n ε (cid:19) steps, with η chosen appropriately. Also in the constrained case, one can get a non trivialcomplexity estimate from Theorem 3 by choosing the uniform measure on a ball containedin the support as a warm start. We leave this annoying computation to the reader.Finally let us point out that it is also possible to obtain interesting bounds from our resultwhen the potential is not globally Lipschitz, simply by restricting the measure to a large ball.For simplicity let us only spell out the argument when the measure µ is supported on thewhole R n and when we have a linear control on the gradient of the potential, but the methodcould give non trivial bounds in more general situations. So let µ be a log-concave measuresupported on the whole R n , let ϕ be its potential, and consider the Langevin algorithmassociated to the measure µ conditioned on the ball of radius R : x k +1 = P (cid:16) x k + √ ηξ k +1 − η ∇ ϕ ( x k ) (cid:17) , (9)where P is the orthogonal projection on the ball of radius R : P ( x ) = ( x if | x | ≤ R Rx | x | otherwiseIn this special case, Theorem 2 yields the following complexity for the Langevin algorithm. Theorem 5.
Assume that µ is log-concave, supported on the whole R n and that the gradientof the potential ϕ grows at most linearly: |∇ ϕ ( x ) | ≤ β ( | x | + 1) , for all x ∈ R n and for some β > . Assume that the Langevin algorithm is initiated at ,that σ = O (1) , that R | x | dµ = O ( n ) , and let C LS be the log-Sobolev constant of µ , with theconvention that it equals + ∞ if µ does not satisfy log-Sobolev. Then choosing R = Θ ∗ ( √ n ) , η = Θ ∗ ( ε max( β, − min( C LS , n ) − ) and running the algorithm (9) initiated at for k = Θ ∗ (cid:18) min( C LS , n ) max( β, ε (cid:19) steps produces a point x k satisfying n W ( x k , µ ) ≤ ε . C LS and β are of constant order the complexitydoes not depend on the dimension. Related works.
We end this introduction with a discussion on a short selection of relatedworks. Let us first mention that as far as we know, the Langevin algorithm with supportconstraint was only investigated in our previous paper with Bubeck and Eldan [5]. In thispaper the potential was assumed to be gradient Lipschitz. In all the works that we could findon the Langevin Monte Carlo algorithm the potential is always assumed to be smooth, mostof the time gradient Lipschitz. This hypothesis is somewhat relaxed in the recent article [7],but the authors analyze the Langevin algorithm for a smoothed out approximation of µ ,and in any case they still require the gradient of the potential to be H¨older continuous. Thepresent work appears to be the first were ∇ ϕ is allowed to be discontinuous. More generally,it seems to be the case that in all the literature on MCMC algorithms for log-concavesampling, the potential is always assumed to be smooth.Let us give the state of the art convergence bounds for the Langevin algorithm in thesmooth, unconstrained case. The first quantitative result appears to be Dalalyan’s article [9].The result is in total variation distance rather than Wasserstein but as in the present workthe strategy consists in writing T V ( x k , µ ) ≤ T V ( x k , X kη ) + T V ( X kη , µ ) (10)and estimating both terms separately. His assumption is that the potential ϕ satisfies αId ≤ ∇ ϕ ≤ βId pointwise on the whole R n , where α and β are positive constants. Actually a closer look athis argument shows that he does not really use log-concavity. Indeed, his main contributionis a bound for the relative entropy of the Langevin algorithm at time k with respect tothe corresponding point in the Langevin diffusion. That part of the argument is a niceapplication of Girsanov’s formula and does not use log-concavity at all, only the fact that ∇ ϕ is Lipschitz is needed. Dalalyan only uses strict log-concavity to estimate how fast thediffusion ( X t ) converges to µ . But that only requires Poincar´e for an exponentially fast decayin chi-square divergence or log-Sobolev for a decay in relative entropy. Dalalyan’s theoremcan thus be rewritten as follows: if dµ = e − ϕ dx is supported on the whole R n , if ∇ ϕ isLipschitz with constant β and if µ satisfies the log-Sobolev inequality with constant C LS then after k steps of the Langevin algorithm with times step parameter η we have T V ( x k , µ ) ≤ D ( x | µ ) / e − kη/ C LS + βn / (1 + E [ σ ]) / k / η, where again σ = n ( ϕ ( x ) − min R n { ϕ } ). The result depends on a warm start hypothesis,the algorithm must be initiated from a random point x whose relative entropy to the targetmeasure is finite. On the other hand, it is not hard to see that one can find a Gaussianmeasure whose relative entropy to µ is O ∗ ( n ). As a result, it follows from the previousbound that if η is chosen appropriately then one has T V ( x k , µ ) ≤ ε after k = Θ ∗ (cid:18) C LS β nε (cid:19) ∇ ϕ ≥ αId for some positive α . Also their approach is a bitdifferent from that of Dalalyan: instead of bounding W ( x k , X kη ) and W ( X kη , µ ) separatelythey directly obtain a recursive inequality for W ( x k , µ ). Their approach essentially yieldsthe following result. Theorem 6.
Suppose that αId ≤ ∇ ϕ ≤ βId pointwise on the whole R n for some positiveconstants α, β . Assume also that the time step parameter η satisfies η ≤ min (cid:16) α β , α (cid:17) . Thenthe Langevin algorithm after k steps satisfies n W ( x k , µ ) ≤ e − αkη/ n W ( x , µ ) + 8 β ηα . They do not quite obtain such a neat statement and we shall provide a proof of thisresult in the last section of the paper.This result implies that n W ( x k , µ ) ≤ ε after a number of steps k = O ∗ (cid:16) β α ε (cid:17) , withtime step parameter of order αε/β . This should be compared to the complexity given byTheorem 5 in this case. Indeed, observe that the hypothsesis αId ≤ ∇ ϕ ≤ βId implies thatthe log-Sobolev constant is 1 /α at most and that ∇ ϕ grows linearly. Therefore Theorem 5applies, and it gives the following complexity: k = Θ ∗ (cid:16) β α ε (cid:17) . The dependence in ε is thusworse ( ε rather ε ) but the dependence in the other parameters is the same, which is quiteremarkable given the fact that Theorem 5 holds under considerably weaker assumptions.Lastly, let us also mention Vempala and Wibisono’s work [21] whose approach is similarin spirit to that of Durmus and Mounlines but gives a result closer to Dalalyan’s. They provethat if ∇ ϕ is Lipschitz with constant β , if µ satisfies log-Sobolev with constant C LS , and ifthe time step parameter satisfies η ≤ / (4 C LS β ) then after k steps of the algorithm one has D ( x k | µ ) ≤ e − kη/C LS D ( x | µ ) + 8 nβ C LS η. (11)As in the result of Dalalyan the measure µ is not assumed to be log-concave, only log-Sobolevis required. Let us note that this result recovers Dalalyan’s by Pinsker’s inequality. Let usalso remark that combining it with the transport inequality W ( x k , µ ) ≤ C LS D ( x k | µ ),which is a consequence of log-Sobolev, one gets1 n W ( x k , µ ) ≤ n C LS D ( x | µ )e − kη/C LS + 16 β C LS η. This almost recovers Theorem 6, under a weaker hypothesis, but with the caveat that thisis from a warm start in the relative entropy sense.
Acknowledgement.
The author is grateful to S´ebastien Bubeck and Ronen Eldan for anumber of useful discussions related to this work.8
Discretization of the Langevin diffusion
In this section we prove Theorem 1. This is the main contribution of this article. We beginthis section by defining more precisely the Langevin diffusion with reflection at the boundaryof K .Recall that dµ = e − ϕ dx is a log-concave probability measure on R n . The potential ϕ is thus a convex function allowed to take the value + ∞ . The closure of its domain, whichis also the support of the measure ϕ is denoted K . It is a closed convex set of R n withnon-empty interior.Let ( B t ) be a Brownian motion. Given a starting point x ∈ K there is a unique couple( X t , Φ t ) of processes satisfying1. X = x , Φ = 0;2. X t ∈ K for all t > dX t = dB t − ∇ ϕ ( X t ) dt − d Φ t for all t > t ) has bounded variation and its Stieljes derivative has the property that for everyprocess ( Y t ) taking values in K the measure h X t − Y t , d Φ t i is non-negative.The last condition can be rewritten as follows: The process (Φ t ) has bounded variation, itsderivative is a measure supported on the set of { t ≥ X t ∈ ∂K } and for every such time d Φ t is an outer normal to the boundary of K at X t .Note that we do not assume that ϕ is everywhere differentiable in the interior of K , so thethird condition should rather read:3’ dX t = dB t − V t dt − d Φ t where V t belongs to the sub-differential of ϕ at X t .In the sequel, whenever we write ∇ ϕ ( x ) we actually mean some vector belonging to ∂ϕ ( x ).Since the only property of ∇ ϕ ( x ) that we use is the inequality ϕ ( y ) − ϕ ( x ) ≥ h y − x, ∇ ϕ ( x ) i this simplification does not affect the validity of the analysis. Such stochastic differentialequations with reflection at the boundary of a convex set were first considered by Tanaka [20].Since we do not assume ∇ ϕ to be Lipschitz (not even continuous actually) the existence ofa solution for this equation does not formally follow from Tanaka’s theorem, but it wasestablished by C´epa in [6]. The proof of Theorem 1 requires some preparation. We beginwith a bound on the total variation of the reflection process (Φ t ). In the next lemma welet ℓ t be the total variation of (Φ t ) on the interval [0 , t ]. The process ( ℓ t ) is also called thelocal time of ( X t ) at the boundary of K . We need to show that ℓ t = O ( t ). That lemma isessentially taken from our previous work with Bubeck and Eldan [5], except that we havesimplified the proof and improved the result a bit. Lemma 7.
Assume that the Langevin diffusion ( X t ) is initiated at point x in the interiorof K and recall the definition (7) of r and σ . Then for every t > we have E [ ℓ t ] / ≤ n (1 + σ ) tr . roof. By Itˆo’s formula we have d | X t − x | = 2 h X t − x , dB t i − h X t − x , ∇ ϕ ( X t ) i dt − h X t − x , d Φ t i + n dt. Recall that d Φ t = ν t dℓ t where ν t is an outer unit normal at X t . By definition of (Φ t ), r and ℓ t we have h X t − x , d Φ t i ≥ sup x ∈ K h x − x , d Φ t i ≥ r dℓ t . Also by convexity of ϕ −h X t − x , ∇ ϕ ( X t ) i ≤ ϕ ( x ) − ϕ ( X t ) ≤ nσ . We thus obtain | X t − x | + 2 r ℓ t ≤ n (1 + σ ) t + Z t h X s − x , dB s i . (12)Taking expectation already gives a bound on the first moment of ℓ t . To get a bound on thesecond moment observe that (12) implies that4 r E [ ℓ t ] ≤ n (1 + σ ) t + E "(cid:18)Z t h X s − x , dB s i (cid:19) . By Itˆo’s isometry and using (12) again, this time to bound E [ | X t − x | ], we get E "(cid:18)Z t h X s − x , dB s i (cid:19) = E (cid:20)Z t | X s − x | ds (cid:21) ≤ n (1 + σ ) t . Plugging this into the previous display yields the desired inequality.We also need the following elementary bound on the maximum of Gaussian vectors. Weprovide a proof for completeness.
Lemma 8.
Let G , . . . , G k be standard Gaussian vectors on R n . Then E (cid:20) max i ≤ k (cid:8) | G i | (cid:9)(cid:21) ≤ e( n + 2 log k ) . Proof.
Set χ i = | G i | for every i . The p -th moment of χ i satisfies E [ χ pi ] = 2 p Γ (cid:0) n + p (cid:1) Γ (cid:0) n (cid:1) ≤ ( n + 2( p − p , at least when p is an integer. Therefore E (cid:20) max i ≤ k { χ i } (cid:21) ≤ "X i ≤ k E [ χ pi ] /p ≤ k /p ( n + 2( p − p to be the smallest integer larger than log k yields the result.10e are now in a position to prove the main result. Proof of Theorem 1.
We first couple the diffusion ( X t ) and its discretization ( x k ) in the mostnatural way one could think of, by choosing the sequence ( ξ k ) as follows: ξ k = B kη − B ( k − η , k ≥ . (13)Observe that for any x ∈ K and y ∈ R n we have | x − P K ( y ) | ≤ | x − y | . Therefore | X ( i +1) η − x i +1 | = (cid:12)(cid:12)(cid:12) X ( i +1) η − P (cid:16) x i + ξ i +1 − η ∇ ϕ ( x i ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) X ( i +1) η − x i − ξ i +1 + η ∇ ϕ ( x i ) (cid:12)(cid:12)(cid:12) . Let ( e X t ) be the process defined by e X t = x i + B t − B iη − t − iη ∇ ϕ ( x i )for all t between iη and ( i + 1) η . Then e X iη = x i and the previous display can be rewrittenas | X ( i +1) η − x i +1 | ≤ | X ( i +1) η − e X ( i +1) η | . The process ( X t − e X t ) is continuous with finite variation (the Brownian part cancels out).Therefore d | X t − e X t | = 2 h X t − e X t , dX t − d e X t i = −h X t − e X t , ∇ ϕ ( X t ) − ∇ ϕ ( x i ) i dt − h X t − e X t , d Φ t i . Since ϕ is convex we have h X t − x i , ∇ ϕ ( X t ) − ∇ ϕ ( x i ) i ≥
0. Also, since x i ∈ K and bydefinition of the reflection (Φ t ) we have h X t − x i , d Φ t i ≥
0. Plugging this back in theprevious display we get d | X t − e X t | ≤ h e X t − x i , ( ∇ ϕ ( X t ) − ∇ ϕ ( x i )) dt + 2 d Φ t i . Now we replace e X t by its value, and we integrate between iη and ( i + 1) η . We get | X ( i +1) η − x i +1 | ≤ | X iη − x i | + Z ( i +1) ηiη h B t − B iη − t − iη ∇ ϕ ( x i ) , ( ∇ ϕ ( X t ) − ∇ ϕ ( x i )) dt + 2 d Φ t i . We take expectation, use the martingale property of the Brownian motion, and use thehypothesis |∇ ϕ | ≤ L . We obtain E (cid:2) | X ( i +1) η − x i +1 | (cid:3) ≤ E (cid:2) | X iη − x i | (cid:3) + 2 E (cid:2) h ξ i +1 , Φ ( i +1) η − Φ iη i (cid:3) + Lη E [ | ξ i +1 | + | Φ ( i +1) η − Φ iη | ] + 12 L η , (14)11here ξ i +1 = B ( i +1) η − B iη . Observe that E [ | ξ i +1 | ] ≤ √ nη . Also by Cauchy-Schwarz andLemma 8 E " k − X i =0 h ξ i +1 , Φ ( i +1) η − Φ iη i ≤ E (cid:20) max i ≤ k {| ξ i |} ℓ kη (cid:21) ≤ e / ( n + 2 log k ) / η / E [ ℓ kη ] / . Summing (14) over i thus yields E (cid:2) | X kη − x k | (cid:3) ≤ / ( n + 2 log k ) / η / E [ ℓ kη ] / + Ln / kη / + Lη E [ ℓ kη ] + 12 L kη . Now recall from Lemma 7 that E [ ℓ kη ] / ≤ n (1 + σ ) kηr . Lastly, we use the assumption η < n/L to simplify the inequality a bit. We finally obtain E (cid:2) | X kη − x k | (cid:3) ≤ n + 2 log k ) / n (1 + σ ) r kη / + 32 Ln / kη / , which is the result. In this section we prove Theorem 2. Observe first that the Wasserstein distance is indeed adistance, so it satisfies the triangle inequality and we have1 n W ( x k , µ ) ≤ n W ( x k , X kη ) + 2 n W ( X kη , µ ) . The first term of the right-side is handled by Theorem 1, so we only need to bound thesecond term. This is the purpose of the following lemma. In this lemma ( P t ) stands for thesemi-group of the Langevin diffusion (3). In other words νP t denotes the law of X t when X has law ν . Lemma 9.
Assume that µ is log-concave with globally Lipschitz potential on its support,with Lipschitz constant L . Assume also that µ satisfies log-Sobolev with constant C LS . Thenfor every x in the interior of the support of µ and every t > we have n W ( δ x P t , µ ) ≤ C LS (cid:18) (cid:18) max( C LS , n min( r , (cid:19) + σ + Ln (cid:19) e − t/ C LS , where again the parameters σ and r are defined by (7) . roof. If µ satisfies the logarithmic Sobolev inequality then it satisfies the transport inequal-ity: W ( ν ; µ ) ≤ C CLS D ( ν | µ )for every probability measure ν . This is due to Otto and Villani [19], see also [2]. Thelog-Sobolev inequality also implies that the relative entropy decays exponentially fast alongthe semigroup ( P t ): D ( νP t | µ ) ≤ e − t/C LS D ( ν | µ ) . This is really folklore, one just need to observe that the derivative of the entropy is the Fisherinformation, and combine log-Sobolev with a Gronwall type argument. Combining the twoinequalities yields W ( νP t , µ ) ≤ C LS e − t/C LS D ( ν | µ ) . This cannot be applied directly to a Dirac point mass. However, observe that the convexityof ϕ implies that the Wasserstein distance is non increasing along the semigroup: For anytwo probability measures ν , ν and every time t we have W ( ν P t , ν P t ) ≤ W ( ν , ν ) . This is a well-known fact, which is easily seen using parallel coupling: start two diffusions,from ν and ν respectively, driven by the same Brownian motion, and use the convexity of ϕ (and K ). Combining this with the triangle inequality for W we thus get W ( δ x P t , µ ) ≤ W ( δ x P t , νP t ) + 2 W ( νP t , µ ) ≤ W ( δ x , ν ) + 4 C LS e − t/C LS D ( ν | µ ) . (15)This is valid for every ν and it is natural to take ν to be the measure µ conditioned to theball B ( x , δ ) for some small δ >
0. Then W ( δ x , ν ) ≤ δ and D ( ν | µ ) = log (cid:18) µ ( B ( x , δ )) (cid:19) . If δ ≤ r then B ( x , δ ) is included in the support of µ and we havelog (cid:18) µ ( B ( x , δ )) (cid:19) ≤ max B ( x ,δ ) { ϕ } + n log (cid:18) δ (cid:19) + log (cid:18) v n (cid:19) , where v n is the Lebesgue measure of the unit ball in dimension n . Recall that log (cid:16) v n (cid:17) ≤ n log n . Also max B ( x ,δ ) { ϕ } ≤ ϕ ( x ) + Lδ ≤ min K { ϕ } + nσ + Lδ.
Moreover min K { ϕ } ≤ Z R n ϕ dµ = S ( µ ) , where S denotes the Shannon entropy. It is well-known that among measures of fixed covari-ance the Gaussian measure maximizes the Shannon entropy (this is just Jensen actually).13herefore S ( µ ) = n πe ) + 12 log det(cov( µ )) ≤ n πeC LS ) . The last inequality is just a consequence of the fact that the log-Sobolev inequality impliesPoincar´e, which in turn implies a bound on the covariance matrix. Plugging everything backin (15) we get 1 n W ( δ x P t , µ ) ≤ δ n + 4 C LS (cid:18)
32 + log (cid:16) nδ (cid:17) + σ + Ln (cid:19) e − t/C LS for every δ ≤ min( r , δ = min (cid:0) (2 nC LS ) / e − t/ C LS , r , (cid:1) and using the inequal-ity x e − x ≤ e − x/ yields the result. In this section we prove Theorem 3. Again the idea is to write1 n W ( x k , µ ) ≤ n W ( x k , X kη ) + 2 n W ( X kη , µ ) , and to bound the first term using Theorem 1. Actually, note that here we allow x = X tobe random, so we rather condition on x , apply Theorem 1 and then take expectation again.Therefore it is enough to bound the second term. This is where the Poincar´e inequalityenters the picture. Note that this part of the argument does use the log-concavity of µ . Lemma 10.
Suppose that µ satisfies Poincar´e with constant C P . Then for every probabilitymeasure on R n we have W ( ν, µ ) ≤ C P χ ( ν | µ ) . Proof.
Let ρ be the density of ν with respect to µ and let f be a 1-Lipschitz function on R n .Applying Cauchy-Schwarz and the Poincar´e inequality we get Z f dν − Z f dµ = Z f ( ρ − dµ ≤ var µ ( f ) / (cid:18)Z ( ρ − dµ (cid:19) / ≤ C / P χ ( ν | µ ) / . Taking the supremum in f and using the dual formulation of W yields the result. Remark.
We are not aware of an analogue statement for W , maybe it is simply false, wedid not investigate this question too much.On the other hand it is well-known that under Poincar´e the chi-square divergence de-cays exponentially fast along the Langevin diffusion. Letting ( P t ) be the semigroup of theLangevin diffusion associated to µ we have χ ( νP t | µ ) ≤ e − t/C P χ ( ν | µ ) . We thus get the following: 14 orollary 11.
Suppose that µ satisfies Poincar´e with constant C P . Then for every proba-bility measure ν on R n and every t > we have W ( νP t , µ ) ≤ C P χ ( ν | µ ) e − t/C P . This finishes the proof of Theorem 3.We end this section with a simple estimate of the chi-square divergence of an appropriateGaussian measure to µ in the unconstrained case. Proof of Lemma 4.
Recall that µ is assumed to be supported on the whole R n with convexand globally Lipschitz potential ϕ . Let γ be the Gaussian measure centered at a point x and with covariance αId for some α >
0. Then χ ( γ | µ ) ≤ (2 πα ) − n Z R n e − α | x − x | + ϕ ( x ) dx ≤ (2 πα ) − n Z R n e − α | x − x | + ϕ ( x )+ L | x − x | dx ≤ (2 πα ) − n Z R n e − α | x − x | + ϕ ( x )+ L α dx = (2 πα ) − n/ e ϕ ( x )+ L α . Also, reasoning along the same lines as in the previous section we get ϕ ( x ) ≤ min R n { ϕ } + nσ ≤ n π e) + n C P + nσ . Putting everything together and choosing α = n/L yieldslog χ ( γ | µ ) ≤ n − log α + 1 + log C P + 2 σ ) + 12 L α = n (cid:18) σ + 12 log (cid:18) L C P n (cid:19)(cid:19) , which is the result. We begin this section with a simple lemma about the Wassertein distance of µ to µ restrictedto a large ball. Lemma 12.
Let µ be a log-concave measure on R n , and let µ R be the measure µ conditionedon the ball centered at of radius R . There exists a universal constant C such that W ( µ, µ R ) ≤ CM exp (cid:18) − RC √ M (cid:19) , ∀ R ≥ C √ M where M = R R n | x | dµ . roof. Let X have law µ . Note that by Borell’s lemma [4, Lemma 3.1] we have P ( | X | ≥ t ) ≤ e − t/C √ M for all t ≥ C √ M for some universal constant C . This also implies that E [ | X | ] ≤ C M for some C . Now assume that R ≥ C √ M , let e X have law µ R and beindependent of X and let Y = ( X if | X | ≤ R e X otherwise . Then Y also has law µ R , so that W ( µ, µ R ) ≤ E [ | X − Y | ] = E [ | X − e X | ; | X | > R ] ≤ E [ | X | ] / P ( | X | > R ) / ≤ C M e − R/ ( C M / ) , which is the result. Proof of Theorem 5.
Assuming that R R n | x | dµ = O ( n ), the previous lemma shows that n W ( µ, µ R ) will be negligible as soon as R is a sufficiently large multiple of √ n log n . Nowwe apply Theorem 2 to µ R and initiating the Langevin algorithm at 0. Then the parameter r is of order √ n log n . Moreover the hypothesis |∇ ϕ ( x ) | ≤ β ( | x | + 1)show that the potential of µ R is Lipschitz with constant O ∗ ( β √ n ). Therefore the constant A defined by (6) satisfies A = O ∗ (max( β, µ R is log-concave and supported on a ball of radius O ∗ ( √ n ) its log-Sobolev constant is O ∗ ( n ) at most.Also, if µ satisfies log-Sobolev, then the log-Sobolev constant of µ R cannot be larger thana constant factor times that of µ . This follows easily from the fact that within log-concavemeasures log-Sobolev and Gaussian concentration are equivalent, see [18, Theorem 1.2]. Tosum up, the log-Sobolev constant of µ R is O ∗ (min( n, C LS )) where C LS is the log-Sobolevconstant of µ (which is possibly infinite). Applying Theorem 2 we see that after k = Θ ∗ (cid:18) min( C LS , n ) max( β, ε (cid:19) of the Langevin algorithm for µ R initiated at 0, and with appropriate time step parameterwe have n W ( x k , µ R ) ≤ ε . Since n W ( µ R , µ ) is negligible this implies n W ( x k , µ ) ≤ ε . In this final section we prove the Durmus-Moulines type result. The argument is essentiallytaken from their article [10], but we also borrow some ideas from Vempala and Wibisono’swork [21].
Proof of Theorem 6.
We assume that the Langevin diffusion is in equilibrium: the startingpoint X is already distributed according to µ . As in the proof of the main result the diffusion16 X t ) and the algorithm ( x k ) are coupled together by choosing the sequence ξ k = B ( k +1) δ − B kδ for all k . For t ∈ [ iη, ( i + 1) η ] we set e X t = x i + B t − B iη − t − iη ∇ ϕ ( x i ) . so that e X iη = x i and e X ( i +1) η = x i +1 . Using ∇ ϕ ≥ αId and the fact that ∇ ϕ is Lipschitzwith constant β we get d | X t − e X t | = −h X t − e X t , ∇ ϕ ( X t ) − ∇ ϕ ( x i ) i dt ≤ − α | X t − e X t | dt + h X t − e X t , ∇ ϕ ( e X t ) − ∇ ϕ ( x i ) i dt ≤ − α | X t − e X t | dt + 12 α |∇ ϕ ( e X t ) − ∇ ϕ ( x i ) | dt ≤ − α | X t − e X t | dt + β α | e X t − x i | dt (16)Now observe that E [ | e X t − x i | ] ≤ nη + η E [ |∇ ϕ ( x i ) | ] . To the bound the second term we proceed as follows: E [ |∇ ϕ ( x i ) | ] ≤ E [ |∇ ϕ ( X iη ) | ] + 2 E [ |∇ ϕ ( X iη ) − ∇ ϕ ( x i ) | ]= 2 E [∆ ϕ ( X iη )] + 2 E [ |∇ ϕ ( X iη ) − ∇ ϕ ( x i ) | ] ≤ βn + 2 β E [ | X iη − x i | ] , where the second line follows from the fact that X iη has law µ and an integration by parts.Taking expectation in (16) and using the inequality βη ≤ ddt E [ | X t − e X t | ] ≤ − α E [ | X t − e X t | ] + β α (cid:18) nη β η E [ | X iη − x i | ] (cid:19) . Then we integrate this differential inequality between iη and ( i + 1) η . We gete αη/ E [ | X ( i +1) η − x i +1 | ] ≤ E [ | X iη − x i | ] + (e αη/ − β α (cid:18) nη β η E [ | X iη − x i | ] (cid:19) ≤ (cid:18) β η α (cid:19) E [ | X iη − x i | ] + 3 β nη α ≤ (1 + αη E [ | X iη − x i | ] + 3 β nη α ≤ e αη/ E [ | X iη − x i | ] + 3 β nη α , where we used the inequalities η ≤ α β and η ≤ α as well as the inequality e x ≤ x for x ∈ [0 , E [ | X iη − x i | ]. It is easy to solve it and we17btain E [ | X kη − x k | ] ≤ e − αkη/ E [ | X − x | ] + 3 β nη α k − X i =0 e − αiη/ ≤ e − αkη/ E [ | X − x | ] + 3 β nη α (1 − e − αη/ ) ≤ e − αkη/ E [ | X − x | ] + 8 β nηα , where we used the hypothesis η ≤ /α in the last line. Now since X kη has law µ the lefthand side is W ( µ, x k ) at most. Optimizing on the coupling ( X , x ) we thus get1 n W ( x k , µ ) ≤ e − αkη/ n W ( x , µ ) + 8 β ηα , which is the result. References [1] Bakry, D.; Gentil, I.; Ledoux, M. Analysis and geometry of Markov diffusion oper-ators. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles ofMathematical Sciences], 348.
Springer, Cham,
J. Math. Pures Appl. (9) 80 (2001), no. 7, 669-696.[3] Bobkov, S.G.; Ledoux, M. From Brunn-Minkowski to Brascamp-Lieb and to loga-rithmic Sobolev inequalities.
Geom. Funct. Anal.
10 (2000), no. 5, 1028-1052.[4] Borell, C. Convex measures on locally convex spaces.
Ark. Mat.
12 (1974), 239-252.[5] Bubeck, S.; Eldan, R.; Lehec, J. Sampling from a log-concave distribution withprojected Langevin Monte Carlo.
Discrete Comput. Geom.
59 (2018), no. 4, 757-783.[6] C´epa, E. ´Equations diff´erentielles stochastiques multivoques. (French) [Multivariatestochastic differential equations] S´eminaire de Probabilit´es, XXIX, 86-107, LectureNotes in Math., 1613, Springer, Berlin, 1995.[7] Chatterji, N.S., Diakonikolas, J., Jordan, M.I., Bartlett, P.L. Langevin Monte Carlowithout smoothness. Preprint, arXiv: 1905.13285, 2019[8] Chen, Y. An Almost Constant Lower Bound of the Isoperimetric Coefficient in theKLS Conjecture. Preprint, arXiv: 2011.13661, 2020.[9] Dalalyan, A.S. Theoretical guarantees for approximate sampling from smooth andlog-concave densities.
J. R. Stat. Soc. Ser. B. Stat. Methodol.
79 (2017), no. 3, 651-676. 1810] Durmus, A.; Moulines, E. High-dimensional Bayesian inference via the unadjustedLangevin algorithm.
Bernoulli
25 (2019), no. 4A, 2854-2882.[11] Dyer, M.; Frieze, A.; Kannan, R. A random polynomial-time algorithm for approx-imating the volume of convex bodies.
J. Assoc. Comput. Mach.
38 (1991), no. 1,1-17.[12] Eldan, R. Thin shell implies spectral gap up to polylog via a stochastic localizationscheme.
Geom. Funct. Anal.
23 (2013), no. 2, 532-569.[13] Fradelizi, M. Sections of convex bodies through their centroid.
Arch. Math. (Basel)69 (1997), no. 6, 515-522.[14] Kannan, R.; Lov´asz, L.; Simonovits, M. Isoperimetric problems for convex bodiesand a localization lemma.
Discrete Comput. Geom.
13 (1995), no. 3-4, 541-559.[15] Ledoux, M. A simple analytic proof of an inequality by P. Buser.
Proc. Amer. Math.Soc.
121 (1994), no. 3, 951-959.[16] Ledoux, M. From concentration to isoperimetry: semigroup proofs.
Concentration,functional inequalities and isoperimetry , 155-166, Contemp. Math., 545,
Amer. Math.Soc. , Providence, RI, 2011.[17] Lee Y.T.; Vempala S.S. Eldan’s Stochastic Localization and the KLS Conjecture:Isoperimetry, Concentration and Mixing Preprint, arXiv:1712.01791, 2017.[18] Milman, E. Isoperimetric and concentration inequalities: equivalence under curvaturelower bound.
Duke Math. J.
154 (2010), no. 2, 207-239.[19] Otto, F.; Villani, C. Generalization of an inequality by Talagrand and links with thelogarithmic Sobolev inequality.
J. Funct. Anal.
173 (2000), no. 2, 361-400.[20] Tanaka, H. Stochastic differential equations with reflecting boundary condition inconvex regions.