Double-Loop Unadjusted Langevin Algorithm
aa r X i v : . [ m a t h . S T ] J u l Double-Loop Unadjusted Langevin Algorithm
Paul Rolland Armin Eftekhari Ali Kavis Volkan Cevher Abstract
A well-known first-order method for samplingfrom log-concave probability distributions is theUnadjusted Langevin Algorithm (ULA). Thiswork proposes a new annealing step-size sched-ule for ULA, which allows to prove new con-vergence guarantees for sampling from a smoothlog-concave distribution, which are not coveredby existing state-of-the-art convergence guaran-tees. To establish this result, we derive a newtheoretical bound that relates the Wasserstein dis-tance to total variation distance between any twolog-concave distributions that complements thereach of Talagrand T inequality. Moreover, ap-plying this new step size schedule to an existingconstrained sampling algorithm, we show state-of-the-art convergence rates for sampling from aconstrained log-concave distribution, as well asimproved dimension dependence.
1. Introduction
Let d µ ∗ ( x ) ∝ e − f ( x ) d x be a probability measure over R d , where f : R d → R is a convex function withLipschitz continuous gradient. In order to sample fromsuch distributions, first-order sampling schemes based onthe discretization of Langevin dynamics and, in particu-lar the Unadjusted Langevin Algorithm (ULA), have foundwidespread success in various applications (Welling & Teh,2011; Li et al., 2016b; Patterson & Teh, 2013; Li et al.,2016a).An ever-growing body of literature has been devoted solelyto the study of ULA and its variations (Ahn et al., 2012;Chen et al., 2015; Cheng & Bartlett, 2017; Cheng et al.,2017; Dalalyan & Karagulyan, 2017; Durmus et al.,2017; 2018a; Dwivedi et al., 2018; Luu et al., 2017; LIONS, Ecole Polytechnique Fédérale de Lausanne, Switzer-land Department of Mathematics and Mathematical Statistics,Umea University, Sweden. Correspondence to: Paul Rolland
Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).
Welling & Teh, 2011; Ma et al., 2015). The ULA iteratesare given as x k +1 = x k − γ k +1 ∇ f ( x k ) + p γ k +1 g k , (1)where ∇ f is the gradient of f , { γ k } k ≥ is a non-increasingsequence of positive step-sizes, and the entries of g k ∈ R d are zero-mean and unit-variance Gaussian randomvariables, independent from each another and everythingelse. In its standard form (1), ULA can provably sam-ple from any log-concave and smooth probability measure(Durmus et al., 2017; 2018a).The recent analysis of (Durmus et al., 2018a) studies ULAthrough the lens of convex optimization. Their analysisshows strong resemblance with the convergence analysisof stochastic gradient descent (SGD) algorithm for mini-mizing a convex continuously differentiable function f : R d → R . Starting from x ∈ R d , SGD iterates similarlyas (1): x k +1 = x k − γ k +1 ∇ f ( x k ) + γ k +1 Θ( x k ) , where { γ k } k ≥ is a non-increasing sequence of positivestep-sizes, and Θ : R d → R d is a stochastic perturbation to ∇ f . One way of proving convergence guarantees for thismethod is to show the following inequality: γ k +1 ( E [ f ( x k +1 )] − f ( x ∗ )) ≤ E (cid:2) k x k − x ∗ k (cid:3) − E (cid:2) k x k +1 − x ∗ k (cid:3) + Cγ k +1 (2)for some constant C ≥ , ∀ k ≥ and x ∗ ∈ arg min x ∈ R d f ( x ) . From this inequality, and using stepsize γ k ∝ √ k , it is then possible to show convergence, inexpectation, of the average iterate ¯ x T = T P T − t =0 x t to theoptimal value, i.e., E [ f (¯ x T )] − f ( x ∗ ) = O (cid:16) √ T (cid:17) .In their paper, (Durmus et al., 2018a) showed a similar de-scent lemma as (2) for the sequence of generated measures { µ k } k ≥ denoting the distributions of the iterates { x k } k ≥ in (1), in which the objective gap E [ f ( x k )] − f ( x ∗ ) is re-placed with the Kullback-Leibler divergence KL( µ k ; µ ∗ ) ,and the Euclidean distance k x k − x ∗ k is replaced with the -Wasserstein distance W ( µ k , µ ∗ ) : γ k +1 KL( µ k ; µ ∗ ) ≤ W ( µ k , µ ∗ ) − W ( µ k +1 , µ ∗ )+ 2 Ldγ k +1 , (3) ouble-Loop Unadjusted Langevin Algorithm where L is the Lipschitz constant of the gradient of f . Thenagain, using γ k ∝ √ k , it is possible to show convergenceof the average sample distribution ¯ µ T = T P Tt =0 µ t to µ ∗ in KL divergence, with rate O (cid:16) d √ T (cid:17) .In this work, we improve this convergence rate to O (cid:16) d T (cid:17) .To this end, we first establish a new bound that relatesthe W distance and the KL divergence between any twolog-concave distributions. When applied to inequality (3),this new bound can be exploited to design a new step-sizesequence { γ k } k ≥ that allows to derive new convergencerates for ULA.We introduce a new multistage decaying step size sched-ule, which proceeds in a double loop fashion by geomet-rically decreasing the step-size after a certain number ofiterations, and that we call Double-loop ULA (DL-ULA).To the best of our knowledge, all existing convergenceproof for ULA use either constant, or polynomially decay-ing step sizes, i.e. of the form γ k = k − α for some α ≥ ,and this is the first work introducing a multistage decayingstep size for a sampling algorithm. Interestingly, there isprecedence to support our approach in that such step decayschedule can improve convergence of optimization algo-rithms (Hazan & Kale, 2014; Ge et al., 2019; Aybat et al.,2019; Yousefian et al., 2012).Our new inequality relating KL divergence and W dis-tance serves as an alternative to the powerful T inequality(Gozlan & Léonard, 2010), the latter requiring stronger as-sumptions on the distributions. The literature on Langevindynamics commonly proves the convergence of an algo-rithm in KL divergence and then extends it to the totalvariation ( TV ) distance using the famous Pinsker’s inequal-ity (Pinsker, 1960; Cheng & Bartlett, 2017; Durmus et al.,2018a). Our new inequality enables to do the same for ex-tending convergence results to W distance in the case ofgeneral log-concave distributions, and hence, might be ofindependent interest. Note, however, that this inequality ap-plied alone to extend the result of (Durmus et al., 2018a)to W distance provides a suboptimal convergence rate, andmodifying the step-size schedule and the analysis appearsto be crucial for improving the rate.Finally, we apply this multistage strategy to the constrainedsampling algorithm MYULA (Brosse et al., 2017), whichallows us to obtain improved convergence guarantees, bothin terms of rate and dimension dependence. This approachprovides state-of-the-art convergence guarantees for sam-pling from a log-concave distribution over a general convexset.We summarize our contributions as follows: • We introduce a variant of the Unadjusted Langevin Algo- rithm, using a new multistage decaying step-size sched-ule as well as a clipping step. Our new approach, calledDL-ULA, yields new convergence guarantees, that arenot covered by existing convergence result (i.e., eitherbetter convergence rate or better dimension dependencecompared to state-of-the-art results). • We apply our new step-size schedule to an existingLangevin-based constrained sampling algorithm, calledMYULA (Brosse et al., 2017), and improve its conver-gence both in terms of iteration and dimension depen-dences. • We introduce a new bound relating the -Wassersteinand the TV distance between any two log-concave dis-tributions.A summary of our convergence rates can be found in Ta-bles 1 and 2. Road map
In section , we define several metrics onprobability measures that we will use, recall some prop-erties of log-concave distributions that we will exploit, aswell as some results on convergence of ULA. In section ,we present our new extension of ULA for unconstrainedsampling, by introducing a new multistage step size sched-ule. We then prove convergence guarantees by making useof a new bound relating the KL divergence and the W dis-tance. Finally, in section , we apply this procedure to theexisting algorithm MYULA for constrained sampling, andshow that it yields improved convergence guarantees, bothin terms of convergence rate and dimension dependence.
2. Related work
Unconstrained sampling
Sampling algorithms basedon Langevin dynamics have been widely studied(Ahn et al., 2012; Chen et al., 2015; Cheng & Bartlett,2017; Cheng et al., 2017; Dalalyan & Karagulyan, 2017;Durmus et al., 2018a; Dwivedi et al., 2018; Durmus et al.,2017; Luu et al., 2017; Welling & Teh, 2011). Althoughmost convergence rates have been established in thestrongly log-concave setting, rates have also been shownfor general log-concave distributions, and in particularexhibit larger dimension dependences (see Table 1).Convergence guarantees for ULA applied to a generalunconstrained log-concave distribution have been succes-sively improved over the years. To the best of our knowl-edge, the best existing convergence results are the oneobtained by (Durmus et al., 2018a) and (Durmus et al.,2017), that respectively show O ( d ǫ − ) and O ( d ǫ − ) con-vergence guarantees in TV distance. In this paper, we im-prove upon the former one, by showing a O ( d ǫ − ) con-vergence rate. This result is not absolutely better than the ouble-Loop Unadjusted Langevin Algorithm one of (Durmus et al., 2017), but enjoys better dimensiondependence.Until recently, convergence rate in Wasserstein distancehad not been proven in the general log-concave setting.Only recently, (Zou et al., 2018) presented a method basedon underdamped Langevin dynamics that provably con-verges in W -distance for a general log-concave distribu-tion.In (Zou et al., 2018), the authors show a O ( d . ǫ − ) convergence rate in W distance for general log-concavedistributions. However, they make the assumption that E X ∼ µ (cid:2) k X k (cid:3) ≤ ¯ U d for some scalar ¯ U . However, let d µ ( x ) ∝ e −k x k d x , which is a log-concave distribution.Then, E X ∼ µ (cid:2) k X k (cid:3) = Ω( d ) and their assumption doesnot hold. For comparison purpose, if we replace it with ourweaker Assumption 4, their rate becomes O ( d . ǫ − ) . Constrained sampling
Extensions of ULA have been de-signed in order to sample from constrained distributions(Bubeck et al., 2018; Brosse et al., 2017; Hsieh et al., 2018;Patterson & Teh, 2013). In (Bubeck et al., 2018), the au-thors propose to apply ULA, and project the sample ontothe constraint at each iteration. They show a convergencerate of O ( d ǫ − ) in TV distance for log-concave distri-butions (i.e., O ( d ǫ − ) iterations of the algorithm are suf-ficient in order to obtain an error smaller than ǫ in TV dis-tance).In (Brosse et al., 2017), the authors propose to smooth theconstraint using its Moreau-Yoshida envelope, and obtain aconvergence rate of O ( d ǫ − ) in TV distance when the ob-jective distribution is log-concave. To do so, they penalizethe domain outside the constrain directly inside the targetdistribution via its Moreau-Yoshida envelop.The analysis of MYULA in (Brosse et al., 2017) only holdswhen the penalty parameter is fixed and chosen in advance,leading to a natural saturation after a certain number of it-erations. In this work, we extend this procedure using ourDouble-loop approach. This allows to obtain improved con-vergence both in terms of rate and dimension dependence,i.e., O ( d . ǫ − ) in TV distance, and to ensure asymptoticconvergence of the algorithm since the penalty is allowedto vary along the iterations.The special case of sampling from simplices was solvedin (Hsieh et al., 2018), introducing Mirrored Langevin Dy-namics (MLD). Their work relies on finding a mirror mapfor the given constraint domain, and then performing ULAin the dual space. However, this method requires stronglog-concavity of the distribution in the dual space. More-over, finding a suitable mirror map for a general convex setis not an easy task.
3. Preliminaries
Let us recall the distances/divergences between probabilitymeasures which are used frequently throughout the paper.The Kullback–Leibler (KL) divergence between two prob-ability measures µ, ν on R d is defined as KL( µ ; ν ) = E µ log(d µ/ d ν ) , (4)assuming that µ is dominated by ν . Their Total Variation(TV) distance is defined as k µ − ν k TV = sup S | µ ( S ) − ν ( S ) | , (5)where the supremum is over all measurable sets S of R d . Fi-nally, the -Wasserstein (or W for short) distance between µ and ν is defined as W ( µ, ν ) = inf φ ∈ Φ( µ,ν ) Z R d × R d k x − y k dγ ( x, y ) , (6)where Φ( µ, ν ) denotes the set of all joint probability mea-sures φ on R d that marginalize to µ and ν , namely, forall measurable sets A, B ⊆ R d , φ ( A × R d ) = µ ( A ) and φ ( R d × B ) = ν ( B ) .The main difference between W and TV distances is that W associates a higher cost when the difference betweenthe distributions occurs at points that are further appart (interms of Euclidean distance). Due to this property, errorsoccurring at the tail of the distributions (i.e., when k x k →∞ ) can have a small impact in terms of TV distance, but amajor impact in terms of W distance. We start by recalling the basic property that we will assumeon the probability measure. We will then present someknown results about this class of measures which will beexploited in the convergence analysis of our algorithm.
Definition 1.
We say that a function f : R d → R has L -Lipschitz continuous gradient for L ≥ if ∀ x, y ∈ R d , k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k . Definition 2.
We say a function f : R d → R in convex if ∀ ≤ t ≤ and ∀ x, y ∈ R d , f ( tx + (1 − t ) y ) ≤ tf ( x ) + (1 − t ) f ( y ) Definition 3.
We say that probability measure µ ∝ e − f ( x ) d x is logconcave if f is convex. Moreover, we saythat µ is L -smooth if f has a L -Lipschitz continuous gradi-ent. ouble-Loop Unadjusted Langevin Algorithm As mentioned previously, bounding the Wasserstein dis-tance between two probability measures requires control-ling the error at the tail of the distributions. In order to dealwith such a distance without injecting large dependence inthe dimension, we make the following assumption on thetail of the target distribution, which is quite standard whenworking with unconstrained non-strongly log-concave dis-tributions (Durmus et al., 2018a; 2017):
Assumption 4.
There exists η > , M η > such that forall x ∈ R d such that k x k ≥ M η , f ( x ) − f ( x ∗ ) ≥ η k x − x ∗ k where x ∗ = arg min x ∈ R d f ( x ) . Without loss of generality,we will also assume x ∗ = 0 and f ( x ∗ ) = 0 . Note that in the case of a distribution constrained to a set Ω ⊂ R d , this assumption is naturally satisfied with η arbi-trary, and M η = diam (Ω) where diam (Ω) is the diameterof Ω .In order to see how this assumption transfers into aconstraint on the tail of the distribution, we recall twofollowing results shown in (Durmus et al., 2018a) and(Lovász & Vempala, 2007) respectively. Lemma 5.
Let X ∈ R d be a random vector from a log-concave distribution µ satisfying assumption 4. Then E X ∼ µ (cid:2) k X k (cid:3) ≤ d ( d + 1) η + M η Lemma 6.
Let X ∈ R d be a random vector from a log-concave distribution µ such that E (cid:2) k X k (cid:3) ≤ C . Then,for any R > , we have P r ( k X k > RC ) < e − R +1 It is thus possible to combine both lemmas to show that anydistribution satisfying assumption 4 necessarily has a sub-exponential tail. This property will allow us to control theWasserstein distance in terms of the total variation distance.
Lemma 7.
Let X be a random vector from a log-concavedistribution µ satisfying assumption 4. Then, ∀ R > , P r k X k > R s d ( d + 1) η + M η ! < e − R +1 Finally, we recall the standard Unadjusted Langevin Algo-rithm as well as a very useful inequality bounding the KL divergence between the target distribution and the k -th sam-ple distribution.Consider the probability space ( R d , B , µ ∗ ) , where B is theBorel sigma algebra and µ ∗ is the target distribution. Sup-pose that µ ∗ is log-concave and dominated by the Lebesguemeasure on R d , namely, d µ ∗ ( x ) = Ce − f ( x ) d x, ∀ x ∈ S, (7)where C is an unknown normalizing constant and the func-tion f : R d → R is convex and ∇ f is L -Lipschitz contin-uous. We wish to sample from µ ∗ without calculating thenormalizing constant C .A well-known scheme for sampling for such a distributionis called ULA. Initialized at x ∈ R d , the iterates of ULAare x k +1 = x k − γ ∇ f ( x k ) + p γg k (8)for all k ≥ , where γ > is the step-size and the entries of g k ∈ R d are zero-mean and unit-variance Gaussian randomvariables, independent from each another and everythingelse. Let µ k be the probability measure associated to iterate x k , ∀ k ≥ . It is well-known that ULA converges to thetarget measure in KL divergence.More specifically, for n ≥ n ǫ = O ( d Lǫ − ) iterations, wereach KL( µ n ; µ ∗ ) ≤ ǫ , where µ n = n P nk =1 µ k is theaverage of the probability measures associated to the iter-ates { x k } nk =0 (Durmus et al., 2018a). The averaging sum n P nk =1 µ k is to be understood in the sense of measures,i.e., sampling from the ¯ µ n is equivalent to choosing an in-dex k uniformly at random among { , ..., n } , and then sam-pling from µ k .To prove this result, the authors showed the following use-ful inequality that we will exploit in our analysis: Lemma 8.
Suppose that we apply the ULA iterations (8) for sampling from a smooth log-concave probability mea-sure µ ∗ ∝ e − f ( x ) d x with constant step-size γ > , start-ing from x ∼ µ . Then, ∀ n > , KL(¯ µ n ; µ ∗ ) ≤ W ( µ , µ ∗ )2 γn + Ldγ. (9)
4. DL-ULA for unconstrained sampling
In this section, we present a modified version of the stan-dard ULA for sampling from an unconstrained distributionand provide convergence guarantees. This modified ver-sion of ULA involves a new step size schedule as well asa projection step. We will show that it allows to obtain im-proved convergence rate, as well as the first convergencerate in W -distance for overdamped Langevin dynamics. ouble-Loop Unadjusted Langevin Algorithm Algorithm 1
Double-loop Unadjusted Langevin Algorithm(DL-ULA)
Input : Smooth unconstrained probability measure µ ∗ , step sizes { γ k } k ≥ , number of (inner) iterations { n k } k ≥ , initial probability measure µ on R d , andthresholds { τ k } k ≥ . Initialization:
Draw a sample x from the probabilitymeasure µ . for k = 1 , . . . do x k, ← x k − for n = 1 , . . . , n k do x k,n +1 ← x k,n − γ k ∇ f ( x k,n ) + √ γ k g k,n , where g k,n ∼ N (0 , I d ) . end for x k ← x k,i , where i is drawn from the uniform distri-bution on { , · · · , n k } . if k x k k > τ k then x k ← τ k x k / k x k k . end ifend for4.1. DL-ULA algorithm We consider the problem of sampling from a smooth andunconstrained probability measure µ ∗ ∝ e − f ( x ) d x , where f : R d → R is differentiable. To this end, we apply thestandard ULA in a double-loop fashion, and decrease thestep size only between each inner loop. Moreover, eachinner loop is followed by a projection step onto some Eu-clidean ball. The procedure is summarized in Algorithm 1.The projection step appears to be crucial in our analysis inorder to control the tail of the sample distribution, whichis necessary for bounding its Wasserstein distance to thetarget distribution.In the following sections, we derive the convergence ratefor Algorithms 1. The global idea for showing the con-vergence of this algorithm is to use the inequality (9) re-cursively between each successive outer loop. We denoteas ¯ µ k the average distribution associated to the iterates ofouter iteration k just before the projection step. Similarly,we denote as ˜ µ k the same distribution after the projectionstep.Each outer iteration k uses as a starting point a sample fromthe previous outer iteration x k, ∼ ˜ µ k − . Therefore, wecan apply the inequality (9) to the outer iteration k to obtain KL(¯ µ k ; µ ∗ ) ≤ W (˜ µ k − , µ ∗ )2 γ k n k + Ldγ k . (10)In order to unfold the recursion, we must have a bound on W (˜ µ k − , µ ∗ ) in terms of KL(¯ µ k − , µ ∗ ) . Using the lighttail property of log-concave distributions, it is easy to ob-tain a bound between W (˜ µ k − , µ ∗ ) and W (¯ µ k − , µ ∗ ) . However, it is not clear how to bound W (¯ µ k − , µ ∗ ) by KL(¯ µ k − , µ ∗ ) .As an intermediate step in the convergence analysis, we de-rive in the next section a bound between the W -distanceand the TV -distance between two general log-concaveprobability measures, which can then be extended to a W - KL bound using Pinsker’s inequality. W - and TV -Distances When µ and ν are both compactly supported on an Eu-clidean ball of diameter D , then it is well-known that W ( µ, ν ) ≤ D p k µ − ν k TV (Gibbs & Su, 2002). Other-wise, if µ and ν are not compactly supported, their fast-decaying tail (Lemma 7) allows us to derive a similarbound, as summarized next and proved in Appendix A. Lemma 9. ( W - TV distances inequality) Let µ, ν be log-concave probability measures on R d both satisfying As-sumption 4 with ( η, M η ) . Then, for some scalar c ∈ R , W ( µ, ν ) ≤ cd max (cid:18) log (cid:18) k µ − ν k TV (cid:19) , (cid:19) p k µ − ν k TV . (11)In a sense, (11) is an alternative to the powerful T in-equality which does not apply generally in our setting(Gozlan & Léonard, 2010). Indeed, for C µ > , recall thata probability measure µ satisfies Talagrand’s T ( C µ ) trans-portation inequality if W ( µ, ν ) ≤ C µ p KL( µ ; ν ) , (12)for any probability measure ν . Above, C µ depends onlyon µ and, in particular, if µ is κ strongly log-concave, then (12) holds with C µ = O (1 / √ κ ) (Gozlan & Léonard,2010). In this work, the target measures that we considerare not necessarily strongly log-concave measures, leav-ing us in need for a replacement to (12). In our analysis,(11) serves as a replacement for (12). Indeed, using thePinsker’s inequality (Pinsker, 1960), an immediate conse-quence of (11) is that W ( µ, ν ) = e O (KL( µ ; ν ) ) . (13)In fact, (11) might also be of interest in its own right, espe-cially when working with non-strongly log-concave mea-sures. For example, it is easy to use (13) to extend the well-known O ( ǫ − ) convergence rate of ULA in KL divergenceto a e O ( ǫ − ) convergence rate in W distance in the non-strongly log-concave setting. To the best of our knowledge,such a result does not exist in the literature. If d µ ∝ e − f d x , then we say that µ is κ is strongly log-concave if f is κ strongly convex. ouble-Loop Unadjusted Langevin Algorithm Literature W TV KL (Durmus et al., 2018a) - e O (cid:0) Ld ǫ − (cid:1) e O (cid:0) Ld ǫ − (cid:1) (Durmus et al., 2017) - e O (cid:0) L d ǫ − (cid:1) -(Zou et al., 2018) e O (cid:0) L d . ǫ − (cid:1) ∗ - - Our work e O (cid:0) Ld ǫ − (cid:1) e O (cid:0) Ld ǫ − (cid:1) e O (cid:16) Ld ǫ − (cid:17) Table1. Complexity of sampling from a smooth and log-concave probability distribution. For each metric, the entry corresponds to thetotal number of iterations to use in order to reach an ǫ accuracy in the specified metric. ( ∗ For comparison purpose, we extended the proofin (Zou et al., 2018) in the case where the distribution satisfies the weaker assumption 4. The dimension dependence is thus differentfrom (Zou et al., 2018)).
Having covered the necessary technical tools above, wenow turn our attention to the convergence rate of Algo-rithm 1. The final step to take care of is to choose thesequences { γ k } k ≥ and { n k } k ≥ so as to obtain the bestpossible convergence guarantees. We summarize our resultin Theorem 10. Theorem 10. (iteration complexity of DL-ULA)
Let µ ∗ be a L -smooth log-concave distribution satisfying assump-tion 4. Suppose that µ also satisfies assumption 4. Forevery k ≥ , let n k = LM dk e k (14) γ k = 1 Ld e − k (15) τ k = M k. (16) where M = q d ( d +1) η + M η = O ( d ) .Let ¯ µ k , ˜ µ k be the average distributions associated with theiterates of outer iteration k of DL-ULA using the parame-ters above, just before and after the projection step respec-tively. Then, ∀ ǫ > , we have: • After N KL = ˜ O ( Ld ǫ − ) total iterations, we obtain KL(¯ µ k ; µ ∗ ) ≤ ǫ . • After N TV = ˜ O ( Ld ǫ − ) total iterations, we obtain k ˜ µ k − µ ∗ k TV ≤ ǫ . • After N W = ˜ O ( Ld ǫ − ) total iterations, we obtain W (˜ µ k , µ ∗ ) ≤ ǫ . A few remarks about Theorem 10 are in order.
Geometric sequences.
Theorem 10 prescribes a geomet-ric sequence for the choice of { γ k } k and { n k } k . As outeriteration counter k increases, more and more ULA (in-ner) iterations are performed with the constant step-size γ k . Asymptotically, we observe that the step size decreases ata rate n − where n is the total number of ULA iterations.This decaying rate is faster than the standard decaying rateof n − for ULA (Durmus et al., 2018a).In constrast to convex optimization where a global opti-mum can provably be reached with constant step-size, ULAcannot converge to the target distribution µ ∗ when usingconstant step-size, since the stationary distribution of ULAiterates (8) when using a constant step size is different fromthe target distribution. Asymptotically, it is thus desirableto use as small a step-size as possible. Projection step.
Although the initial and target distribu-tions are both log-concave, and thus have a sub-exponentialtail, the sample distributions ¯ µ k are not generally log-concave, and it is not clear whether they also share the sub-exponential tail property. The projection step at the end ofeach outer iteration provides a way to enforce the light tailproperty, so that we can still apply a bound similar to (11).This procedure is made clearer in the proof of the theorem.This procedure also provides more stability in the earlyouter iterations where the step-size is the largest. Moreover,since lim k →∞ τ k = ∞ , the projection step asymptoticallynever applies in practice. Convergence rate comparison
Table 1 summarizes var-ious convergence rates of Langevin dynamics based meth-ods applied to general log-concave distributions. We ob-serve that DL-ULA achieves improved convergence guar-antees either in terms of rate or dimension dependence.Compared to (Durmus et al., 2017), the convergence ratein TV distance is worse in terms of accuracy ǫ but enjoysmuch better dimension dependence, and is also better interms of Lipschitz constant dependence.
5. DL-MYULA for constrained sampling
We now apply the same multistage idea to an existing con-strained sampling algorithm, and show that it allows bothto obtain an asymptotic convergence and improved conver-gence guarantees. ouble-Loop Unadjusted Langevin Algorithm
Consider sampling from a log-concave distribution over aconvex set Ω ⊂ R d , i.e., µ ∗ ( x ) = ( e − f ( x ) / R Ω e − f ( x ′ ) dx ′ x ∈ Ω0 x / ∈ Ω . (17)In (Durmus et al., 2018b; Brosse et al., 2017), the authorspropose to reduce this problem to an unconstrained sam-pling problem by penalizing the domain outside Ω directlyinside the probability measure using its Moreau-Yoshidaenvelop. More precisely, they propose to sample from thefollowing unconstrained probability measure d µ λ ( x ) ∝ e − f λ ( x ) d x where f λ : R d → R is defined as: f λ ( x ) = f ( x ) + 12 λ k x − proj Ω ( x ) k , ∀ x ∈ R d , (18)where proj Ω : R d → Ω is the standard projection opera-tor onto Ω defined as proj Ω ( x ) = arg min y ∈ Ω k x − y k .Note that this penalty is easily differentiable as soon asthe projection onto Ω can be computed since ∇ f λ ( x ) = ∇ f ( x ) + λ ( x − proj Ω ( x )) .By bounding the TV distance between µ λ and µ ∗ , theyshowed that, by sampling from µ λ with λ small enough,it is possible to sample from µ ∗ with arbitrary precision.This algorithm is called Moreau-Yoshida ULA (MYULA).Building on this approach, we can apply our double loopalgorithm, by modifying both the step size as well asthe penalty parameter λ between each inner loop (Algo-rithm 2).In addition to providing improved rate, as we will showlater, our algorithm also has the advantage to use a decreas-ing penalty parameter λ so as to guarantee asymptotic con-vergence of the algorithm to the target distribution. On theother hand, MYULA uses constant penalty λ , and thus sat-urates after a certain number of iterations. Although thislooks like a trivial extension, using varying penalty param-eter makes the analysis more challenging since the targetdistribution of the algorithm is regularly changing. We now analyze the convergence of DL-MYULA. In Algo-rithm 2, both the step-size γ and the penalty parameter λ are decreased after each outer iteration. Therefore, at eachouter iteration k , we aim to sample from the unconstrainedpenalized distribution d µ λ k ∝ e − f λk ( x ) d x where f λ k isdefined in equation (18).Similarly as for DL-ULA, we will use Lemma 9 after eachouter iteration. However, since the target distribution of Algorithm 2
DL-MYULA
Input : Smooth constrained probability measure µ ∗ , stepsizes { γ k } k ≥ , penalty parameters { λ k } k ≥ , number of(inner) iterations { n k } k ≥ , initial probability measure µ on R d , and thresholds { τ k } k ≥ . Initialization:
Draw a sample x from the probabilitymeasure µ init . for k = 1 , . . . do x k, ← x k − for n = 1 , . . . , n k do x k,n +1 ← x k,n − γ k ( ∇ f ( x k,n ) + λ k ( x k,n − proj Ω ( x k,n ))) + √ γ k g k,n , where g k,n ∼ N (0 , I d ) . end for x k ← x k,i , where i is drawn from the uniform distri-bution on { , · · · , n k } . if k x k k > τ k then x k ← τ k x k / k x k k . end ifend for outer iteration is µ λ k instead of µ ∗ , the inequality reads asfollows: KL(¯ µ k ; µ λ k ) ≤ W (˜ µ k − , µ λ k )2 γ k n k + Ldγ k . where we recall that ¯ µ k is the average iterate distribution ofouter iteration k just before the projection step, and ˜ µ k isthe one just after the projection step.In order to use a similar recursion argument as previously,we must thus bound W (˜ µ k − , µ λ k ) by W (˜ µ k − , µ λ k − ) .Using the triangle inequality for W , we have W (˜ µ k − , µ λ k ) ≤ W (˜ µ k − , µ λ k − ) + W ( µ λ k − , µ ∗ )+ W ( µ λ k , µ ∗ ) . In (Brosse et al., 2017), the authors showed a bound for k µ λ − µ ∗ k TV in terms of λ > , and it is easy to extend theirproof to obtain a bound for W ( µ λ , µ ∗ ) (see Lemma 12 andits proof in Appendix C).In order to prove our result, we make the same assumptionson the constraint set Ω as in (Brosse et al., 2017): Assumption 11.
There exist r, R, ∆ > such that1. B (0 , r ) ⊂ Ω ⊂ B (0 , D ) where B (0 , r ) = { y ∈ R d : k x − y k ≤ r } ∀ r > ,2. e inf Ω c ( f ) − max Ω ( f ) ≥ ∆ , where Ω c = R d \ Ω . Lemma 12.
Let Ω ⊂ R d satisfy Assumption 11. Then ∀ λ < r d , W ( µ λ , µ ∗ ) ≤ C d √ λ (19) for some scalar C Ω > depending on D, r and ∆ . ouble-Loop Unadjusted Langevin Algorithm The proof of the previous Lemma is given in Appendix C.Using these results, the convergence proof is then very sim-ilar as for DL-ULA, and is summarized in Theorem 13,whose proof can be found in Appendix D.
Theorem 13. (iteration complexity of DL-MYULA)
Let Ω ⊂ R d be a convex set satisfying Assumption 11 and µ ∗ be a log-concave distribution given by (17) where f has L -Lipschitz continuous gradient. For every k ≥ , let λ k = 1 d r + de k (20) n k = Ldk e k (21) γ k = 1 Ld e − k (22) τ k = Dk (23) for every k ≥ . Then, ∀ ǫ > , we have: • After N TV = O (cid:0) d . ǫ − (cid:1) total iterations, we obtain k ˆ µ K − µ ∗ k TV ≤ ǫ . • After N W = ˜ O (cid:0) d . ǫ − (cid:1) total iterations, we obtain W (ˆ µ K , µ ∗ ) . ǫ . We make a few comments about this convergence result.
Smoothness of µ λ k One can notice that outer iterationsin DL-MYULA are longer than in DL-ULA. In order to ex-plain this choice, first observe that the Lipschitz constant as-sociated with the penalized distribution µ λ grows as O (cid:0) λ (cid:1) as λ goes to . As k increases and λ k decreases, µ λ k be-comes less and less smooth. Thus, for ULA to succeed inapproximating µ λ k , the step size γ k of ULA iterations re-duces accordingly, and the number of iterations increases.The choice for λ k ensures that λ k < r d as required forLemma 12 to be applicable. Convergence rate comparison
Table 2 summarizes con-vergence rates in TV distance for various first-order con-strained sampling algorithms. We can see that DL-MYULA outperforms existing approaches, both in termsof rate and dimension dependence.
6. Conclusion
In this work, we proposed and analyzed a new step-sizeschedule for the well-known Unadjusted Langevin Algo-rithm. Our approach works by applying ULA successivelywith constant step-size, and by geometrically decreasing it
Algorithm TV Literature
PLMC d e O (cid:0) ǫ − (cid:1) (Bubeck et al., 2018)MYULA d e O (cid:0) ǫ − (cid:1) (Brosse et al., 2017)DL-MYULA d . e O (cid:0) ǫ − (cid:1) Our work
Table2. Upper bounds on the number of iterations required in or-der to guarantee an error smaller than ǫ in TV distance for variousconstrained sampling algorithms. after a certain number of iterations. Exploiting a new resulton the relation between the -Wasserstein distance and the TV distance of two log-concave distributions, we were ableto prove new convergence guarantees for this procedure.We also applied our approach to an existing first-order con-strained sampling, and showed improved convergence guar-antees, both in terms of rate and dimension dependence.
7. Acknowledgements
This project has received funding from the European Re-search Council (ERC) under the European Union’s Horizon2020 research and innovation programme (grant agreementn ◦ References
Ahn, S., Korattikara, A., and Welling, M. Bayesian pos-terior sampling via stochastic gradient fisher scoring. arXiv preprint arXiv:1206.6380 , 2012.Aybat, N. S., Fallah, A., Gurbuzbalaban, M., and Ozdaglar,A. A universally optimal multistage accelerated stochas-tic gradient method. arXiv preprint arXiv:1901.08022 ,2019.Brosse, N., Durmus, A., Moulines, É., and Pereyra, M.Sampling from a log-concave distribution with compactsupport with proximal langevin monte carlo. arXivpreprint arXiv:1705.08964 , 2017.Bubeck, S., Eldan, R., and Lehec, J. Sampling from alog-concave distribution with projected langevin montecarlo.
Discrete & Computational Geometry , 59(4):757–783, 2018.Chen, C., Ding, N., and Carin, L. On the convergence ofstochastic gradient mcmc algorithms with high-order in- ouble-Loop Unadjusted Langevin Algorithm tegrators. In
Advances in Neural Information ProcessingSystems , pp. 2278–2286, 2015.Cheng, X. and Bartlett, P. Convergence of langevin mcmcin kl-divergence. arXiv preprint arXiv:1705.09048 ,2017.Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I.Underdamped langevin mcmc: A non-asymptotic analy-sis. arXiv preprint arXiv:1707.03663 , 2017.Dalalyan, A. S. and Karagulyan, A. G. User-friendly guar-antees for the langevin monte carlo with inaccurate gra-dient. arXiv preprint arXiv:1710.00095 , 2017.Durmus, A., Moulines, E., et al. Nonasymptotic conver-gence analysis for the unadjusted langevin algorithm.
The Annals of Applied Probability , 27(3):1551–1587,2017.Durmus, A., Majewski, S., and Miasojedow, B. Analysisof langevin monte carlo via convex optimization. arXivpreprint arXiv:1802.09188 , 2018a.Durmus, A., Moulines, E., and Pereyra, M. Efficientbayesian computation by proximal markov chain montecarlo: when langevin meets moreau.
SIAM Journal onImaging Sciences , 11(1):473–506, 2018b.Dwivedi, R., Chen, Y., Wainwright, M. J., and Yu, B. Log-concave sampling: Metropolis-hastings algorithms arefast! arXiv preprint arXiv:1801.02309 , 2018.Ge, R., Kakade, S. M., Kidambi, R., and Netrapalli, P.The step decay schedule: A near optimal, geometri-cally decaying learning rate procedure. arXiv preprintarXiv:1904.12838 , 2019.Gibbs, A. L. and Su, F. E. On choosing and bounding prob-ability metrics.
International statistical review , 70(3):419–435, 2002.Gozlan, N. and Léonard, C. Transport inequalities. a survey. arXiv preprint arXiv:1003.3852 , 2010.Hazan, E. and Kale, S. Beyond the regret minimization bar-rier: optimal algorithms for stochastic strongly-convexoptimization.
The Journal of Machine Learning Re-search , 15(1):2489–2512, 2014.Hsieh, Y.-P., Kavis, A., Rolland, P., and Cevher, V. Mir-rored langevin dynamics. In
Advances in Neural Infor-mation Processing Systems , pp. 2883–2892, 2018.Li, C., Chen, C., Carlson, D., and Carin, L. Preconditionedstochastic gradient langevin dynamics for deep neuralnetworks. In
Thirtieth AAAI Conference on Artificial In-telligence , 2016a. Li, W., Ahn, S., and Welling, M. Scalable mcmc for mixedmembership stochastic blockmodels. In
Artificial Intelli-gence and Statistics , pp. 723–731, 2016b.Lovász, L. and Vempala, S. The geometry of logconcavefunctions and sampling algorithms.
Random Structures& Algorithms , 30(3):307–358, 2007.Luu, T., Fadili, J., and Chesneau, C. Sampling from non-smooth distribution through langevin diffusion. 2017.Ma, Y.-A., Chen, T., and Fox, E. A complete recipe forstochastic gradient mcmc. In
Advances in Neural Infor-mation Processing Systems , pp. 2917–2925, 2015.Patterson, S. and Teh, Y. W. Stochastic gradient rieman-nian langevin dynamics on the probability simplex. In
Advances in neural information processing systems , pp.3102–3110, 2013.Pinsker, M. S. Information and information stability of ran-dom variables and processes. 1960.Villani, C. Optimal transport–old and new, volume 338 ofa series of comprehensive studies in mathematics, 2009.Welling, M. and Teh, Y. W. Bayesian learning via stochas-tic gradient langevin dynamics. In
Proceedings ofthe 28th international conference on machine learning(ICML-11) , pp. 681–688, 2011.Yousefian, F., Nedi´c, A., and Shanbhag, U. V. Onstochastic gradient and subgradient methods with adap-tive steplength sequences.
Automatica , 48(1):56–67,2012.Zou, D., Xu, P., and Gu, Q. Stochastic variance-reduced hamilton monte carlo methods. arXiv preprintarXiv:1802.04791 , 2018. ouble-Loop Unadjusted Langevin Algorithm
A. Proof of Lemma 9
Before proving Lemma 9, we first prove some intermediate Lemmas.
Lemma 14.
Let µ, ν be any two distributions. Then, ∀ R > , we have W ( µ, ν ) ≤ R k µ − ν k TV + 2 E X ∼ µ (cid:2) k X k {k X k >R } (cid:3) + 2 R E X ∼ µ (cid:2) {k X k >R } (cid:3) + 2 E Y ∼ ν (cid:2) k Y k {k Y k >R } (cid:3) + 2 R E Y ∼ ν (cid:2) {k Y k >R } (cid:3) where {k X k >R } is the indicator function of the set B (0 , R ) c = { x ∈ R d : k x k > R } .Proof. Let X ∼ µ, Y ∼ ν . W -distance between probability measures µ and ν can be interpreted as the most cost-efficienttransport plan to transform µ into ν , defined as W ( µ, ν ) = min ( X,Y ) ∼ γ E k X − Y k , (24)where the minimization is over all probability measures γ that marginalize to µ, ν , namely, γ ( A × R d ) = µ ( A ) , γ ( R d × B ) = ν ( B ) , (25)for any measurable sets A, B ⊆ R d . For a fixed such measure γ , let us decompose the right-hand side of (24) as E k X − Y k = E (cid:2) k X − Y k E R (cid:3) + E (cid:2) k X − Y k E cR (cid:3) , (26)where E R stands for the indicator of the event E R = {k X k ≤ R, k Y k ≤ R } . Above, E cR is the complement of E R .For the first expectation on the right-hand side above, we write that E (cid:2) k X − Y k E R (cid:3) ≤ R E [1 X = Y E R ] ≤ R E [1 X = Y ] . (27)For the second expectation on the right-hand side of (26), we write that E (cid:2) k X − Y k E cR (cid:3) ≤ E (cid:2) k X k E cR (cid:3) + 2 E (cid:2) k Y k E cR (cid:3) . (( a + b ) ≤ a + 2 b ) (28)Let us in turn focus on, say, the first expectation on the right-hand side of (28). Since E cR = 1 {k X k >R } + 1 {k X k ≤ R } {k Y k >R } , we can write that E (cid:2) k X k E cR (cid:3) = E (cid:2) k X k {k X k >R } (cid:3) + E (cid:2) k X k {k X k ≤ R } {k Y k >R } (cid:3) ≤ E (cid:2) k X k {k X k >R } (cid:3) + R E (cid:2) {k Y k >R } (cid:3) . (29)Bounding E (cid:2) k Y k E cR (cid:3) similarly, we obtain E k X − Y k ≤ R E [1 X = Y ] + 2 E X ∼ µ (cid:2) k X k {k X k >R } (cid:3) + 2 R E X ∼ µ (cid:2) {k X k >R } (cid:3) + 2 E Y ∼ ν (cid:2) k Y k {k Y k >R } (cid:3) + 2 R E Y ∼ ν (cid:2) {k Y k >R } (cid:3) The result is then obtained by minimizing the above inequality over all coupling γ , and using the fact that k µ − ν k TV =min ( X,Y ) ∼ γ E [1 X = Y ] (Gibbs & Su, 2002). Lemma 15.
Suppose that µ, ν both satisfy Assumption 4 with η, M η > and such that E X ∼ µ (cid:2) k X k (cid:3) , E Y ∼ ν (cid:2) k Y k (cid:3) ≤ C . Then, for any R ≥ C , W ( µ, ν ) ≤ R k µ − ν k TV + 8 (cid:0) R + RC + C (cid:1) e − RC +1 . (30) ouble-Loop Unadjusted Langevin Algorithm Proof.
We start from the result of Lemma 14. The goal is then to bound the each term on the right hand side using the tailproperty of log-concave distributions (Lemma 6).We have E (cid:2) k X k {k X k >R } (cid:3) = 2 Z k x k >R Z z ∈ R {k x k ≥ z } zdzdµ ( x )= 2 Z z ∈ R zdz Z k x k ≥ max( R,z ) dµ ( x )= 2 Z z ∈ R z Pr [ k X k ≥ max( R, z )] dz = 2 Pr[ k X k ≥ R ] Z R zdz + 2 Z ∞ R z Pr[ k X k ≥ z ] dz ≤ R e − RC +1 + 2 Z ∞ R ze − zC +1 dz ≤ (cid:0) R + 2 CR + 2 C (cid:1) e − RC +1 . (31)Similarly, we have E [1 {k X k >R } ] = Pr[ k X | > R ] ≤ e − RC +1 . (32)Doing the same calculation for Y and replacing the terms in Lemma 14 provides the result.Using the previous Lemma, it is now easy to prove the result of Lemma 9. Proof of Lemma 9.
Let us apply Lemma 15 using R = C max (cid:18) log (cid:18) k µ − ν k TV (cid:19) , (cid:19) . With this choice of R and if k µ − ν k TV ≤ , note that e − RC = k µ − ν k TV . (33)On the other hand, if k µ − ν k TV > , then e − RC ≤ ≤ k µ − ν k TV . (34)Thus, Lemma 15 gives W ( µ, ν ) ≤ C max (cid:18) log (cid:18) k µ − ν k TV (cid:19) , (cid:19) k µ − ν k TV + 8 C (cid:18) (cid:18) log (cid:18) k µ − ν k TV (cid:19) , (cid:19)(cid:19) k µ − ν k TV ≤ C max (cid:18) log (cid:18) k µ − ν k TV (cid:19) , (cid:19) k µ − ν k TV . (35)Lemma 9 then follows from taking the square root of (35) and using C = d ( d +1) η + M η according to Lemma 5. B. Proof of Theorem 10
We start by showing the following result in the case where the target distribution µ ∗ satisfies E X ∼ µ ∗ (cid:2) k X k (cid:3) ≤ . Theorem 16. (iteration complexity of DL-ULA)
Let µ ∗ be a L -smooth log-concave distribution such that E X ∼ µ ∗ (cid:2) k X k (cid:3) ≤ . Suppose that µ also satisfies E X ∼ µ (cid:2) k X k (cid:3) ≤ . For every k ≥ , let n k = Ldk e k (36) ouble-Loop Unadjusted Langevin Algorithm γ k = 1 Ld e − k (37) τ k = k. (38) Then, ∀ ǫ > , we have: • After N KL = ˜ O ( Ldǫ − ) total iterations, we obtain KL(˜ µ k ; µ ∗ ) ≤ ǫ where ˜ µ k is the distribution associated to theiterates of outer iteration k just before the projection step. • After N TV = ˜ O ( Ldǫ − ) total iterations, we obtain k ˜ µ k − µ ∗ k TV ≤ ǫ . • After N W = ˜ O ( Ldǫ − ) total iterations, we obtain W (˜ µ k , µ ∗ ) ≤ ǫ .Proof. Recall that in Algorithm 1, we denote as ¯ µ k the average of the distributions associated to the iterates of outeriteration k just before the projection step, i.e., just before the projection step, x k ∼ ¯ µ k . We also denote as ˜ µ k the samedistribution, but after the projection step, i.e. the iterate that will be used as a warm start for the next outer iteration.In order to show the result, we will show by induction that ∀ k ≥ , k ˜ µ k − µ ∗ k TV ≤ u k e − k (39)where { u k } k ≥ is a real-valued sequence defined as u = min(2 √ eW ( µ , µ ∗ ) + 1 + 2 √ , e ) and u k = 4 √ eu k − + 9 +2 √ .Let us fix k ≥ . Thanks to the inequality (10), k ¯ µ k − µ ∗ k TV ≤ p KL (¯ µ k ; µ ∗ ) (Pinsker’s inequality) ≤ s W (˜ µ k − , µ ∗ ) γ k n k + 2 Ldγ k ≤ W (˜ µ k − , µ ∗ ) √ γ k n k + p Ldγ k (40)In order to use a recursion argument, we need to bound W (˜ µ k − , µ ∗ ) by k ˜ µ k − − µ ∗ k TV . Note that the projection stepfor ˜ µ k − with τ k − = ( k − ensures that Pr X ∼ ˜ µ k − ( k X k ≥ k −
1) = 0 . Knowing that E X ∼ µ ∗ (cid:2) k X k (cid:3) ≤ , we canapply Lemma 15 on W (˜ µ k − , µ ∗ ) using R = k . Also, by replacing the values for γ k , n k , we get W (˜ µ k − , µ ∗ ) ≤ k k ˜ µ k − − µ k − k TV + 16 ek e − k . Thus, k ¯ µ k − µ ∗ k TV ≤ k k ˜ µ k − − µ ∗ k TV + 4 √ eke − k ke k + √ e − k Now, by using the recursion hypothesis, i.e. that k ˜ µ k − − µ ∗ k TV ≤ u k − e − k +1 , we have: k ¯ µ k − µ ∗ k TV ≤ (cid:16) √ eu k − + 4 √ e + √ (cid:17) e − k (41)Then, by taking into account the projection step at the end of outer iteration k , we obtain k ˜ µ k − µ k k TV ≤ k ˜ µ k − ¯ µ k k TV + k ¯ µ k − µ ∗ k TV (triangle inequality) = Pr X ∼ ¯ µ k [ k X k > τ k ] + k ¯ µ k − µ ∗ k TV , (42) ouble-Loop Unadjusted Langevin Algorithm where the last line above follows because the projection step ensures Pr X ∼ ˜ µ k [ k X k > τ k ] = 0 . In turn, to compute theprobability in the last line above, we write that Pr X ∼ ¯ µ k [ k X k ≥ τ k ] ≤ Pr X ∼ µ ∗ [ k X k ≥ τ k ] + | ¯ µ k ([ τ k , ∞ ]) − µ ∗ ([ τ k , ∞ ]) | (triangle inequality) ≤ e − k + k ¯ µ k − µ ∗ k TV , (43)By combining (41), (42) and (43), we finally obtain k ˜ µ k − µ ∗ k TV ≤ k ¯ µ k − µ ∗ k TV + e − k ≤ (cid:16) √ eu k − + 9 + 2 √ (cid:17) e − k = u k e − k Finally, using equations (40), (42) and (43) applied at k = 1 , we can also apply Lemma 15 and we get: k ˜ µ − µ k TV ≤ (cid:16) W ( µ , µ ∗ ) + 2 √ (cid:17) e − (44)which proves the result for the initial case. We thus showed that equation (39) holds for all k ≥ .It is easy to verify that the sequence { u k } k ≥ converges, and is upper bounded by U = max( u , u ∗ ) where u ∗ =lim k →∞ u k . Moreover, since E X ∼ µ ∗ (cid:2) k X k (cid:3) , E X ∼ µ (cid:2) k X k (cid:3) ≤ we have that W ( µ , µ ∗ ) ≤ , and thus U is di-mension independent.After each outer iteration k , we thus have k ˜ µ k − µ ∗ k TV ≤ U e − k . Therefore, after K TV = log( Uǫ ) iterations, we have k ˜ µ k − µ ∗ k TV ≤ ǫ . The total number of iterations required is N TV = K TV X k =1 n k ≤ LdK K TV X k =1 e k = 11 − e − Ld log (cid:18) Uǫ (cid:19) U ǫ − Similarly, we also have W (˜ µ k , µ ∗ ) ≤ k k ˜ µ k − µ ∗ k TV +16 ek e − k ≤ (4 U +16 e ) k e − k . Thus, after K W = log( U +16 eǫ ) iterations, we have W (˜ µ k , µ ∗ ) ≤ ǫ log( U +16 eǫ ) . The total number of iterations required is N W = O ( Ldǫ − ) .Finally, we have KL(¯ µ k ; µ ∗ ) ≤ W (˜ µ k − ,µ ∗ )2 γ k n k + Ldγ k ≤ k ˜ µ k − − µ ∗ k TV e − k + e − k ≤ ( U + 1) e − k . Therefore, after K KL = log( U +1 ǫ ) iterations, we have KL(¯ µ k ; µ ∗ ) ≤ ǫ . The total number of iterations required is N KL = O ( Ldǫ − ) .In order to show the more general theorem 10, we must get rid of the assumption that E X ∼ µ ∗ (cid:2) k X k (cid:3) ≤ . To this end, wewill suppose that we apply DL-ULA to a contracted version of µ ∗ , for which theorem 10 applies. Then, we will dilate theobtained sample in order to recover samples from the desired measure µ ∗ and bound the error induced by this dilatation inorder to obtain the final convergence result.Let us first recall the notion of push-forward measure. Definition 17.
Let h : R d → R be a strongly convex function whose gradient is denoted as ∇ h : R d → R d . We say that ν is the push-forward measure of µ under ∇ h , and we write ν = ∇ h µ , if ν is the distribution obtained by sampling from µ , and then applying the map ∇ h to the samples.More precisely, it means that for every Borel set E on R d , we have ν ( E ) = µ ( ∇ h − ( E )) . ouble-Loop Unadjusted Langevin Algorithm Lemma 18.
Let d µ = e − f ( x ) d x and d ν = e − g ( x ) d x be such that ν = ∇ h µ for some strongly convex function h . Then,the triplet ( µ, ν, h ) must satisfy the Monge-Ampère equation: e − f = e − g ◦∇ h det ∇ h. Let d µ ∗ = e − f ( x ) d x be an L -smooth log-concave target distribution such that E X ∼ µ ∗ (cid:2) k X k (cid:3) ≤ M . Instead of directlysample from µ ∗ , suppose that we sample from the shrunk distribution ν ∗ = ∇ h µ ∗ with h ( x ) = M k x k for some M ≥ , i.e., ∇ h ( x ) = xM . In this particular case, we have that det ∇ h ( x ) is independent of x . Therefore, we haveaccording the Lemma 18 that d ν ∗ ∝ e − f ( Mx ) d x .This means that ν ∗ is the same distribution as µ ∗ , after the samples have been divided by M . It is easy to see that thisscaling procedure implies that E X ∼ ν ∗ [ k X k ] = M E X ∼ µ ∗ [ k X k ] ≤ .Thus, if we apply DL-ULA for sampling from ν ∗ , then we can apply the convergence result provided by theorem 16. Notethat this push-forward implies that ν ∗ is M L -smooth, i.e., the Lipschitz constant has been multiplied by M . Indeed, if g ( x ) = f ( M x ) and f is L -smooth, then, k∇ g ( y ) − ∇ g ( x ) k = M k∇ f ( M y ) − ∇ f ( M x ) k ≤ M k y − x k . Let ˜ ν be the approximated distribution obtained using DL-ULA on ν with n k = LM dk e k , γ k = LM d e − k and τ k = k .Then, according to Theorem 16, we have the following convergence results: • After N KL = ˜ O ( LM dǫ − ) total iterations, we obtain KL(˜ ν − ν ∗ ) ≤ ǫ . • After N TV = ˜ O ( LM dǫ − ) total iterations, we obtain k ˜ ν − ν ∗ k TV ≤ ǫ . • After N W = ˜ O ( LM dǫ − ) total iterations, we obtain W (˜ ν, ν ∗ ) ≤ ǫ .By applying the inverse mapping ∇ h − ( x ) = M x , we obtain samples from ˜ µ = ∇ h − ν . Interestingly, it can beshown that applying the same push-forward on two measures does not change their TV -distance not their KL divergence(Hsieh et al., 2018): k ˜ ν − ν ∗ k TV = k∇ h − ν − ∇ h − ν ∗ k TV = k ˜ µ − µ ∗ k TV , KL(˜ ν ; ν ∗ ) = KL( ∇ h − ν ; ∇ h − ν ∗ ) = KL(˜ µ ; µ ∗ ) . In terms of W -distance, when applying the same mapping ∇ h − to two measures, it can be shown that W (˜ µ ; µ ∗ ) ≤ M W ( ∇ h µ ; ∇ h µ ∗ ) = M W (˜ ν ; ν ∗ ) . Therefore, by sampling from ν ∗ , and then multiplying the obtained samples by M , we obtain the following convergenceresults: • After N KL = ˜ O ( LM dǫ − ) total iterations, we obtain KL(˜ µ − µ ∗ ) ≤ ǫ . • After N TV = ˜ O ( LM dǫ − ) total iterations, we obtain k ˜ µ − µ ∗ k TV ≤ ǫ . • After N W = ˜ O ( LM d (cid:0) ǫM (cid:1) − ) = ˜ O ( LM dǫ − ) total iterations, we obtain W (˜ µ, µ ∗ ) ≤ ǫ .Finally, we make the following important observation. By modifying the parameters γ k , τ k , it is possible to mimic theabove procedure by directly applying DL-ULA to µ ∗ . Suppose that we apply DL-ULA for sampling from d ν ∗ = e g ( y ) d y ,where g ( y ) = f ( M y ) , using parameters γ k , n k , τ k . Let y i be the iterates of some arbitrary outer iteration k , and let x i = M y i be their scaled version. The ULA iterates are: (cid:26) y i +1 = y i + γ i ∇ g ( y i ) + √ γ i g i x i +1 = M y i +1 ouble-Loop Unadjusted Langevin Algorithm Since ∇ g ( y i ) = M ∇ f ( M y i ) , we can rewrite this scheme only in terms of { x i } : x i +1 = x i + M γ i ∇ f ( x i ) + p M γ i g i Moreover, applying the projection step to y i with parameter τ k is the same as applying this projection to x i with parameter M τ k .Therefore, applying DL-ULA to ν ∗ using parameters n k , γ k , τ k , and then multiplying the iterates by M is the same asdirectly applying DL-ULA to µ ∗ using parameters n k , M γ k , M τ k .Overall, if we apply DL-ULA to a distribution µ ∗ such that E X ∼ µ ∗ (cid:2) k X k (cid:3) ≤ M using n k = LM dk e k , γ k = Ld e − k and τ k = M k , then we can guarantee convergence rates of ˜ O ( LM dǫ − ) , ˜ O ( LM dǫ − ) and ˜ O ( LM dǫ − ) in KL divergence, TV -distance and W -distance respectively.Finally, thanks to Lemma 5, we know that we can choose M = q d ( d +1) η + M η = O ( d ) . Thus, plugging this value insidethe convergence results above concludes the theorem. C. Proof of Lemma 12
Proof.
A similar result has been shown in (Brosse et al., 2017) (Proposition 5) for W distance, and it is only a matter oftrivial technicalities to extend their result to W distance. Since the full proof requires to introduce several concepts thatare out of the scope of this paper, we only present the required modifications that allow us to extend the result from W - to W -distance.Using (Villani, 2009), Theorem 6.15, we have: W ( µ λ , µ ∗ ) ≤ Z R d k x k | µ ∗ ( x ) − µ λ ( x ) | dx = A + B (45)where A = Z K c k x k µ λ ( x ) dx , B = − R K e − f R R d e − f λ ! Z K k x k µ ∗ ( x ) dx (46)Following very closely the proof in (Brosse et al., 2017) (equations 48 to 51), we can easily obtain: A ≤ ∆ − d − X i =0 dr r πλ ! d − i (cid:16) R + 2 R p λ ( d − i + 2) + λ ( d − i + 2) (cid:17) . (47)Therefore, for λ ≤ r πd , A ≤ ∆ − √ πλdr − R + 2 Rr r dπ + r dπ ! . (48)Moreover, it is also shown in (Brosse et al., 2017) (equations 17, 30, 42) that (cid:16) − R K e − f R R d e − fλ (cid:17) ≤ ∆ − πλdr − , whichimplies: B ≤ ∆ − √ πλdr − R (49)We thus showed that W ( µ λ , µ ∗ ) ≤ C √ dλ for some C > depending on D, r, ∆ . D. Convergence rate of HULA for sampling from a distribution over a bounded domain
The proof of Theorem 13 is very similar to the one for DL-ULA. Before presenting it, we will need an auxiliary Lemma,showing the light tail property of the distributions µ λ . ouble-Loop Unadjusted Langevin Algorithm Lemma 19.
For λ ≤ r d , the distribution µ λ as defined in equation (18) satisfiesPr X ∼ µ λ ( k X k ≥ R ) ≤ σe − RD for some scalar σ > and any R > , where D is the diameter of the constraint set Ω .Proof. Suppose first that R ≥ D . Then, Pr X ∼ µ λ [ k X k ≥ R ] = R B (0 ,R ) c e − f ( x ) − λ k x − proj Ω ( x ) k d x R Ω e − f ( x ) d x + R Ω c e − f ( x ) − λ k x − proj Ω ( x ) k d x ≤ ∆ R B (0 ,R ) c e − λ ( k x k − D ) d x Vol (Ω) ≤ ∆ Vol (Ω) − Z ∞ R u d − e − λ ( u − D ) d u = ∆ Vol (Ω) − d Vol ( B (0 , Z ∞ R u d − e − λ ( u − D ) d u ≤ ∆ d Vol ( B (0 , Vol ( B (0 , r )) D d − Z ∞ R − D ( u + D ) d − e − λ u d u ≤ ∆ d r d Z ∞ R − D (2 u ) d − e − λ u d u since u ≥ R − D ≥ D ≤ ∆ d r d d − Z ∞ λ ( R − D ) (2 vλ ) d − e − v r λ v d u ( v = 12 λ u ) ≤ ∆ d d − λ d r d Γ (cid:18) d λ ( R − D ) (cid:19) where Γ( s ; x ) is the incomplete Gamma function ≤ ∆ d − d d d (cid:18) λ ( R − D ) (cid:19) d e − λ ( R − D ) since for x ≥ s , Γ( s ; x ) ≤ sx s e − x , λ ≤ r d ≤ ∆ d − d d d (cid:18) ( R − D ) λd (cid:19) d e − λd ( R − D ) ! d ≤ (cid:16) c d e − √ λd ( R − D ) (cid:17) d since xe − x ≤ e − x ∀ x ≥ and λd ( R − D ) ≥ where in the last line, c d = ∆ d − d d d . If c d e − √ λd ( R − D ) ≥ , then, this does not provide a useful bound, and we canalways write Pr X ∼ µ λ [ k X k ≥ R ] ≤ ≤ c d e − √ λd ( R − D ) . On the other hand, if c d e − √ λd ( R − D ) ≤ , then we have Pr X ∼ µ λ [ k X k ≥ R ] ≤ c d e − √ λd ( R − D ) ! d ≤ c d e − √ λd ( R − D ) .Therefore, we can write: Pr X ∼ µ λ [ k X k ≥ R ] ≤ c d e − √ λd ( R − D ) ≤ c d e − RD − since λ ≤ r d ≤ D d ≤ max(1 , c d ) e e − RD . Moreover, in the case R ≤ D , we have max(1 , c d ) e e − RD ≥ ≥ Pr X ∼ µ λ [ k X k ≥ R ] . We thus showed the result with σ = max(1 , c d ) e . Note that although c d depends on d , it is bounded and converges to as d → ∞ , thus it does notinvolve any asymptotic dependence in d . ouble-Loop Unadjusted Langevin Algorithm Using this Lemma, we can now prove our convergence result for DL-MYULA (Theorem 13).
Proof.
Let denote µ k ≡ µ λ k the target distributions of the ULA iterations at outer iteration k ≥ , and µ init the initialdistribution. It is straightforward to show that the distributions µ k are L k -smooth with L k = L + λ k .The proof goes exactly the same way as for Theorem 10. We will show by induction that ∀ k ≥ , k ˜ µ k − µ k k TV ≤ u k e − k + r d Lr e − k where { u k } k ≥ is defined u = √ e (cid:16) W ( µ init , µ ∗ + C Ω d ) (cid:17) and the recurrence relation u k = 4 D √ eu k − + 4 D √ σ + 2 C Ω d ( √ e + 1) k + 2 √ d L + σ. For any k ≥ , we have: k ¯ µ k − µ k k TV ≤ p µ k ; µ k ) (Pinsker’s inequality) ≤ s W (˜ µ k − , µ k ) γ k n k + 2 L k dγ k ≤ W (˜ µ k − , µ k ) √ γ k n k + p L k dγ k ≤ W (˜ µ k − , µ k − ) √ γ k n k + W ( µ k − , µ ∗ ) √ γ k n k + W ( µ k , µ ∗ ) √ γ k n k + p L k dγ k (50)For the second and third term, we can use Lemma 12 and the values of λ k to show that ∀ k ≥ , W ( µ k , µ ∗ ) ≤ C Ω d e − k (51)For the first term, we use Lemma 14 with R = Dk together with the fact that Pr X ∼ ˜ µ k − ( k X k ≥ Dk ) = 0 thanks to theprojection step, and the light tail property of µ k to obtain W (˜ µ k − , µ k − ) ≤ D k k ˜ µ k − − µ k − k TV + 4 D k σe − k +1 . (52)By replacing (51) and (52) in (50), and using the recursion hypothesis for k ˜ µ k − − µ k − k TV , we obtain k ¯ µ k − µ k k TV ≤ D √ eu k − + 2 D √ σ + C Ω d ( √ e + 1) k + √ d L ! e − k + r d Lr e − k (53)Similarly as for DL-ULA, and using Lemma 19 we can show that k ˜ µ k − µ k k TV ≤ k ¯ µ k − µ k k TV + σe − k Thus, using the recurrence relation for u k , we have k ˜ µ k − µ k k TV ≤ u k e − k + r d Lr e − k (54)as required to show the induction property. The case for k = 1 is shown analogous to DL-ULA. ouble-Loop Unadjusted Langevin Algorithm Finally, in order to relate ˜ µ k to the target distribution µ ∗ , we use the result shown in (Bubeck et al., 2018) that k µ λ − µ ∗ k TV ≤ C ′ d √ λ for some constant C ′ > and ∀ λ < r d .We can easily show that the sequence { u k } k ≥ increasingly converges to the following limit: U = 8 eD + 4 D √ σ + 2 C Ω d ( √ e + 1) k + 2 √ d L + σ + 4 D s eD + 2 D √ σ + C Ω d ( √ e + 1) k + √ d L + σ O ( √ d ) . We thus have for all k ≥ : k ˜ µ k − µ ∗ k TV ≤ ( U + C ′ √ d ) e − k + r d Lr e − k Therefore, after K TV = log (cid:18) U + C ′ √ d, (cid:16) d Lr (cid:17) (cid:19) ǫ iterations, we have k ˜ µ k − µ ∗ k TV ≤ ǫ . The total number ofiterations required is N TV = ˜ O ( Ld . ǫ − ) .Finally, using W (˜ µ k , µ ∗ ) ≤ D k k ˜ µ k − µ ∗ k TV , we can obtain a similar convergence result, i.e., after K W =log D max (cid:18) U + C ′ √ d, (cid:16) d Lr (cid:17) (cid:19) ǫ iterations, we have W (˜ µ k , µ ∗ ) ≤ ǫ log ( K ) . The total number of iterations re-quired is N W = ˜ O ( Ld . ǫ − ))