[PDF] Constrained Ensemble Langevin Monte Carlo

Abstract

The classical Langevin Monte Carlo method looks for i.i.d. samples from a target distribution by descending along the gradient of the target distribution. It is popular partially due to its fast convergence rate. However, the numerical cost is sometimes high because the gradient can be hard to obtain. One approach to eliminate the gradient computation is to employ the concept of "ensemble", where a large number of particles are evolved together so that the neighboring particles provide gradient information to each other. In this article, we discuss two algorithms that integrate the ensemble feature into LMC, and the associated properties. There are two sides of our discovery: 1. By directly surrogating the gradient using the ensemble approximation, we develop Ensemble Langevin Monte Carlo. We show that this method is unstable due to a potentially small denominator that induces high variance. We provide a counterexample to explicitly show this instability. 2. We then change the strategy and enact the ensemble approximation to the gradient only in a constrained manner, to eliminate the unstable points. The algorithm is termed Constrained Ensemble Langevin Monte Carlo. We show that, with a proper tuning, the surrogation takes place often enough to bring the reasonable numerical saving, while the induced error is still low enough for us to maintain the fast convergence rate, up to a controllable discretization and ensemble error. Such combination of ensemble method and LMC shed light on inventing gradient-free algorithms that produce i.i.d. samples almost exponentially fast.

Full PDF

aa r X i v : . [ s t a t . M L ] F e b CONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO ∗ ZHIYAN DING † AND

QIN LI ‡ Abstract.

The classical Langevin Monte Carlo method looks for i.i.d. samples from a targetdistribution by descending along the gradient of the target distribution. It is popular partially dueto its fast convergence rate. However, the numerical cost is sometimes high because the gradient canbe hard to obtain. One approach to eliminate the gradient computation is to employ the concept of“ensemble”, where a large number of particles are evolved together so that the neighboring particlesprovide gradient information to each other. In this article, we discuss two algorithms that integratethe ensemble feature into LMC, and the associated properties. There are two sides of our discovery: • By directly surrogating the gradient using the ensemble approximation, we develop En-semble Langevin Monte Carlo. We show that this method is unstable due to a potentiallysmall denominator that induces high variance. We provide a counterexample to explicitlyshow this instability. • We then change the strategy and enact the ensemble approximation to the gradient onlyin a constrained manner, to eliminate the unstable points. The algorithm is termed Con-strained Ensemble Langevin Monte Carlo. We show that, with a proper tuning, the surro-gation takes place often enough to bring the reasonable numerical saving, while the inducederror is still low enough for us to maintain the fast convergence rate, up to a controllablediscretization and ensemble error.Such combination of ensemble method and LMC shed light on inventing gradient-free algorithmsthat produce i.i.d. samples almost exponentially fast.

Key words.

Langevin Monte Carlo, ensemble methods, variance, gradient free

AMS subject classiﬁcations.

1. Introduction.

Bayesian sampling is one of the core problems in Bayesian in-ference, that sees wide applications in data assimilation and inverse problems [27, 1]that arise in remote sensing and imaging [20], atmospheric science and earth sci-ence [14], petroleum engineering [23, 24] and epidemiology [21]. The goal is to ﬁndi.i.d. samples or approximately i.i.d. samples from a probability distribution thatencodes the information of an unknown parameter. Throughout the paper we denote(1.1) p ( x ) ∝ e − f ( x ) , x ∈ R d the distribution function of the unknown parameter x , and we assume that ∇ f ( x ) is L -smooth, meaning ∇ f is Lipschitz continuous with L being its Lipschitz constant: |∇ f ( y ) − ∇ f ( x ) | < L | x − y | .The research on sampling is rich. There are many algorithms proposed in theliterature [26, 2, 8, 25]. One classical sampling approach is the celebrated Markovchain Monte Carlo (MCMC) [25, 28, 17, 9, 16]. This is a class of methods thatsets the target distribution as the invariant measure of the Markov transition kernel,so after many rounds of iteration, the sample can be viewed to be drawn from theinvariant measure. Since there are many ways to design the Markov chain, there aremany subcategories of MCMC methods. Among them, the Langevin Monte Carlo(LMC) stands out for its simplicity, and fast convergence rate. ∗ Submitted to the editors DATE.

Funding:

Both authors acknowledge generous support from NSF-DMS 1750488, NSF-TRIPODS 2023239 and Wisconsin Alumni Research Foundation. † Department of Mathematics, University of Wisconsin-Madison, Madison, WI ([email protected]). ‡ Department of Mathematics, University of Wisconsin-Madison, Madison, WI([email protected]). 1

This manuscript is for review purposes only.

Z. DING, Q. LI

The key idea of LMC is to design a stochastic diﬀerential equation, whose longtime equilibrium coincides with the target distribution. The samples are then drawnby following the trajectory of the (discretized) SDE. Typically the SDE convergesexponentially fast, and thus the probability distribution of LMC samples, viewed asthe discrete version of the SDE, also converges to the target distribution exponentiallyfast, up to a discretization error. The non-asymptotic convergence rate for thesemethods and their variations was recently made rigorous in [4, 5, 11, 10, 12] for log-concave probability distribution functions (or equivalently, for convex f ( x )).One key drawback of LMC is that it requires the frequent calculation of thegradients. For each sample, at each iteration, one needs to compute at least one fullgradient. For a problem in R d , this is a calculation of d partial derivatives per sampleper iteration, and in the case when d ≫

1, the cost is rather high. Therefore, in themost practical setting, one looks for substitutes of LMC that achieve “gradient-free”property so that the number of partial derivative computation is relaxed [7, 29].Another sampling strategy that is completely parallel to the MCMC method isthe ensemble type method. Unlike MCMC, or LMC in particular, ensemble methodsevolve a large number of samples altogether, and these samples interplay with eachother. A Fokker-Planck type PDE is formulated to drive an arbitrarily given distribu-tion toward the target distribution, and the ensemble methods can be viewed as theparticle methods applied to numerically evolve the PDE, with the ensemble distri-bution of the samples approximating the solution of the PDE. Two famous ensemblemethods are Ensemble Kalman Inversion [19] and Ensemble Kalman Sampling [15].Earlier works are found in [27, 13]. See also the numerical analysis and other followup works in [ ? , ? , 18].The main drawbacks of ensemble methods are also obvious: The algorithms sur-rogate the statistical quantities with the ensemble version and the numerical analysisis signiﬁcantly harder. Numerically at each time step, one needs to calculate thesestatistical quantities, a cost that is unseen in MCMC methods. There is, however, onefactor of ensemble methods that can potentially bring a great beneﬁt: Since a lot ofsamples are evolved together on R d , it is easy to imagine that close neighbors of eachsample can already approximately provide the gradient information. This may makegradient-free possible. Indeed, suppose one has a large number of particles, sampledfrom a certain probability distribution, in a small neighborhood of a sample x ∗ , thentaking the average of the ﬁnite diﬀerences between these particles can give a rathergood estimate to the gradient ∇ f ( x ∗ ) to be used in LMC. This idea was alreadyexplored in EKS, where the authors inserted a variance term in the underlying SDEof LMC, and by combining the gradient term with the variance term, they form acovariance that requires no gradient computation. However, such strategy holds trueeither if the forward map is linear, or the samples are all controllably close to eachother, neither of which is valid in real practice. Nevertheless, such exploration sets astepping stone for designing gradient-free methods under the ensemble framework.To summarize, the non-asymptotic convergence rate of LMC is thoroughly studiedfor a large class of nonlinear f ( x ), while the validity of ensemble methods are generallyin lack. On the other hand, LMC requires the computation of gradients, but thestrategy of evolving a large number of samples as is done in the ensemble methodscan potentially eliminate the gradient computation.It is thus natural to ask if it is possible to bring together the two approaches, fora new method that may inherit the advantages of both. To be speciﬁc, we look for analgorithm that requires as few gradient calculations as possible, while being able tosample from a relatively general target distribution p ( x ) (almost) exponentially fast This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO • We ﬁrst study the most straightforward approach by directly replacing everygradient in LMC by the ensemble approximation. This is to evolve a largenumber of samples together, and for each sample at each iteration, we useneighbors to calculate the approximated gradients, as surrogates to the gra-dients in LMC. We term this method Ensemble LMC (EnLMC). However,this algorithm, despite being intuitive, will be shown to be unstable. Indeed,at the “outskirts” of p ( x ), the approximated gradients can be very sensitive,suggesting that the replacement should not be enacted in these regions. • We therefore propose an alternative, termed Constrained Ensemble LMC(CEnLMC). The constrained version of EnLMC enacts the ensemble approx-imation to the gradient only in the stable region, and for samples in theunstable region, we directly compute ∇ f . We can show that this methodprovides samples that are close to LMC samples, and thus converges to thetarget distribution at the same rate (exponential, up to a controllable errorterm). Furthermore, we present how the parameters in the constraints de-termine the stability of the algorithm and the chance of enacting ensembleapproximations.We stress that the method CEnLMC is not completely “gradient-free” since itenacts ensemble approximation to replace the gradient computation only in the “sta-ble” regions. However, the study conducted here presents a thorough understandingof fusing the concepts of ensemble methods and LMC. While it provides a possibilityto reduce the gradient computation, it also embraces the fast convergence that canbe achieved by LMC for nonlinear f .We also mention that there are many means for approximating the gradients. Wecannot claim the optimality of the ensemble approximation used in this article. It ishighly possible that one can replace the gradients in LMC using other methods thatexplore information from neighboring ensemble samples in a more eﬃcient way. Thatrequires a more detailed study on multiple choices of ensemble approximation and isbeyond the scope of the current paper. The current result is one of the pioneering at-tempts to integrate ensemble features to LMC, and shed light on inventing algorithmsthat both converge fast and are gradient-free.The paper is organized as follows. In Section 2, we review two main ingredientsof our methods: the classical LMC, and the ensemble gradient approximation. InSection 3, we propose the two new methods and discuss the properties. More specif-ically, we will show the brute-force combination of LMC and the ensemble gradientapproximation will lead to instability, but the constrained version recovers the targetdistribution with a high numerical saving. The proof is given in Section 4.

2. Two main ingredients.

The main ingredients of our method are the classicalLangevin Monte Carlo and an ensemble approximation to the gradient. We reviewthem in this section.

LMC is a very popular MCMC typesampling method. Under mild conditions, it provides fast convergence: Samples pro-duced by LMC can be viewed as sampled from the target distribution only after asmall number of iterations.The classical LMC starts with a sample, denoted as x , and updates the sample This manuscript is for review purposes only.

Z. DING, Q. LI position according to:(2.1) x m +1 = x m − ∇ f ( x m ) h + √ hξ md , where h is the time stepsize, and ξ md is drawn i.i.d. from N (0 , I d ), and I d denotes theidentity matrix of size d × d . For a ﬁxed small h , as m → ∞ , it is expected that q m ,the probability distribution of x m is close to p , the target distribution.To intuitively understand the convergence of this algorithm, we can view theupdating formula as the Euler-Maruyama discretization for the following SDE:(2.2) d X t = −∇ f ( X t ) d t + √ B t , where B t is a d -dimensional Brownian motion. The SDE characterizes the trajectoryof X t by two forcing terms: ∇ f ( X ) d t and d B t . While the ﬁrst term ∇ f drives X t to the minimum of f , the Brownian motion term introduces the ﬂuctuation. Denote q ( x ) the initial distribution from where X is drawn, and q ( x, t ) the probabilitydensity function of X t , then it is a well-known result that q ( x, t ) satisﬁes the followingFokker-Planck equation:(2.3) ∂ t q = ∇ · ( ∇ f q + ∇ q ) , with q ( x,

0) = q . It was shown in [22] that q ( x, t ) converges to the target density function p ( x ) ∝ e − f exponentially fast in time, meaning:lim t →∞ X t ∼ p ( x ) . Considering that the updating formula for LMC (2.1) is merely a discretizationof (2.2), then x m ≈ X mh , and thus for large enough m , q m , the distribution of x m ,should also be close to p . This is made rigorous recently in a number of papers [4, 5,11, 10], most of which quantize the diﬀerence between q m and p using the Wassersteindistance. To be more speciﬁc, it was shown in [5] that for strongly-convex, gradient-Lipschitz f , to achieve ǫ accuracy in Wasserstein L distance, the number of iterationneeds to be m ≥ e O ( d/ǫ ). Here the notation e O hides a log factor.We should note, however, that in each iteration of LMC, one local gradient needsto be computed, and this is equivalent to a calculation of d partial derivatives periteration. This essentially means a cost of e O ( d /ǫ ) is needed for one good sample.For a problem with high dimensionality d ≫

1, the cost is prohibitive. It would bedesirable to combine this method with strategies that eliminate gradient computationfor a gradient-free fast-converging sampling method.

Ensemble sampling methods havebeen gaining ground in recent years. The idea is to evolve a large number of samplesaltogether so that samples could provide information to each other. In particular, iftwo samples are close to each other, the ﬁnite diﬀerence roughly provides approximategradient information. There are various choices of using neighbors to ﬁnd approxi-mated gradients. We look for a probability ensemble in this article.Suppose we look for an approximate gradient of f at x ∗ ∈ R d using its neighbors x that are within η distance, and assume the neighbor x is drawn from an arbitraryprobability density function q ( x ), independent of x ∗ , then call(2.4) ˜ d η,q ( x ∗ ) = α d h∇ f ( x ∗ ) , x − x ∗ i| x − x ∗ | | x − x ∗ |≤ η q ( x ) ( x − x ∗ ) , This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO α d is the normalization constant:(2.5) α d = dV = d S d η d , where V = Z | x − x ∗ |≤ η x = Z η r d − S d dr = η d S d d , with S d being the volume of unit d -sphere, we can formulate an ensemble gradientapproximation:(2.6) ∇ f ( x ∗ ) = E q (cid:16) ˜ d η,q ( x ∗ ) (cid:17) . The formula (2.6) is valid merely because: ∇ f ( x ∗ ) = dV Z | x − x ∗ |≤ η ( x − x ∗ ) ⊗ ( x − x ∗ ) | x − x ∗ | d x · ∇ f ( x ∗ )= α d Z | x − x ∗ |≤ η ( x − x ∗ ) ⊗ ( x − x ∗ ) | x − x ∗ | d x · ∇ f ( x ∗ )= α d Z R h∇ f ( x ∗ ) , x − x ∗ i| x − x ∗ | | x − x ∗ |≤ η q ( x ) ( x − x ∗ ) q ( x ) d x = α d E q (cid:18) h∇ f ( x ∗ ) , x − x ∗ i| x − x ∗ | | x − x ∗ |≤ η q ( x ) ( x − x ∗ ) (cid:19) . One key idea of the ensemble gradient approximation is to realize that the termin ˜ d η,q can be approximated when η is small, namely: h∇ f , x − x ∗ i ≈ f ( x ) − f ( x ∗ ) . Replace the h∇ f , x − x ∗ i term in ˜ d η,q by the ﬁnite diﬀerence term, and deﬁne(2.7) d η,q ( x ∗ ) = α d f ( x ) − f ( x ∗ ) | x − x ∗ | | x − x ∗ |≤ η q ( x ) ( x − x ∗ ) , then the gradient ∇ f ( x ∗ ) has a ﬁnite diﬀerence approximation, replacing (2.6):(2.8) ∇ f ( x ∗ ) ≈ E q ( d η,q ( x ∗ )) = E q (cid:18) α d f ( x ) − f ( x ∗ ) | x − x ∗ | | x − x ∗ |≤ η q ( x ) ( x − x ∗ ) (cid:19) . We can further justify the error in this approximation. Suppose ∇ f is Lipschitzcontinuous, then(2.9) | f ( x ) − f ( x ∗ ) − h∇ f ( x ∗ ) , x − x ∗ i| ≤ L | x − x ∗ | ≤ Lη , we have:(2.10) |∇ f ( x ∗ ) − E q ( d η,q ( x ∗ )) |≤ E q (cid:16) | d η,q ( x ∗ ) − ˜ d η,q ( x ∗ ) | (cid:17) = E q (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) α d h∇ f ( x ∗ ) , x − x ∗ i| x − x ∗ | | x − x ∗ |≤ η q ( x ) ( x − x ∗ ) − α d f ( x ) − f ( x ∗ ) | x − x ∗ | | x − x ∗ |≤ η q ( x ) ( x − x ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) = E q (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) α d | f ( x ) − f ( x ∗ ) − h∇ f ( x ∗ ) , x − x ∗ i|| x − x ∗ | | x − x ∗ |≤ η q ( x ) ( x − x ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) ≤ E q (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) α d L | x − x ∗ |≤ η q ( x ) ( x − x ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) ≤ Lηd .

This manuscript is for review purposes only.

Z. DING, Q. LI

This formula suggests the approximation is ﬁrst order in η , and the smallness of η needs to dominate the largeness in d . We also stress that the derivation is validonly if the neighbors are distributed according to q ( x ), a known distribution that isindependent of x ∗ .Suppose in reality, we have N independent particles around x ∗ , denoted as { x j } Nj =1 ,sampled from q j ( x ) respectively, then the ensemble gradient approximation formulais further reduced to:(2.11) ∇ f ( x ∗ ) ≈ α d N N X j =1 f ( x j ) − f ( x ∗ ) | x j − x ∗ | | x j − x ∗ |≤ η q j ( x j ) ( x j − x ∗ ) . We note that q j ( x ) do not have to be the same.

3. Algorithms and properties.

We propose our new methods in this section.The strategy is to sample a large number of particles according to LMC (2.1), andreplace the gradients in LMC using the ensemble gradient approximation (2.11). Thenimmediately the samples are no longer i.i.d. but they are still sampled from the samemarginal distribution.We discuss in Section 3.1 the straightforward combination of the two. We term themethod the Ensemble LMC (EnLMC). However, we will ﬁnd the algorithm is ratherunstable due to the gradient approximation in the unstable regions. This suggestsus to enact the ensemble gradient approximation only in a constrained manner. Thenew algorithm, termed the Constrained Ensemble LMC (CEnLMC), will be discussedin Section 3.2, in which we provide a number of constraints, and enact the ensemblegradient approximation only when these constraints are satisﬁed. The intuition ofhow these constraints are formulated will also be discussed. The theoretical resultswill also be summarized in Section 3.3.

We now study the direct com-bination of LMC and the ensemble gradient approximation. Denote { x mi } Ni =1 the N samples at the m -th step iteration, then following the LMC formula, we would like towrite Ensemble LMC (EnLMC) in the form of:(3.1) x m +1 i = x mi − hF mi + √ hξ mi , with the force F mi = N − P F mij approximating ∇ f ( x mi ). Here F mij stands for thecontribution of x mj towards calculating ∇ f ( x mi ).Denote F m − = σ (cid:16) x n ≤ m − j ≤ N (cid:17) the ﬁltration, and denote p mj the conditional proba-bility density of x mj . Then it can be seen that x mi and x mj are independent conditionedon F m − . Denoting p mj the marginal distribution of x mj conditioned on F m − , wecan replace x ∗ and q ( x ) by x mi and p mj ( x ) respectively in (2.4) to deﬁne:(3.2) G mij = α d (cid:10) ∇ f ( x mi ) , x mj − x mi (cid:11) | x mj − x mi | x mj − x mi p mj | x mj − x mi |≤ η where p mj = p mj ( x mj ). Then, we still have (2.6) holds true, meaning, for all j = i ,(3.3) ∇ f ( x mi ) = E p mj ( G mij ) = E (cid:0) G mij (cid:12)(cid:12) F m − , x mi (cid:1) . Recall the deﬁnition of d η,q in (2.10), we deﬁne(3.4) F mij = α d δf mij | δx mij | δx mij p mj | δx mij |≤ η , with δf mij = f ( x mj ) − f ( x mi ) , δx mij = x mj − x mi , This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO E (cid:0)(cid:12)(cid:12) G mi,j − F mi,j (cid:12)(cid:12)(cid:12)(cid:12) F m − , x mi (cid:1) ≤ Lηd .

Summing up contribution from all j = i , we approximate ∇ f ( x mi ) by:(3.6) ∇ f ( x mi ) ≈ F mi = 1 N − N X j = i F mij . We note that according to (3.1), p mj = p mj ( x mj ) can be explicitly calculated.Indeed to update x mj from x m − j , we need x m − j , F m − j and a random variable ξ m − j .Realizing that when conditioned on F m − , both x m − j and F m − j are determined, andthe only randomness comes from the Gaussian variable ξ m − j , meaning x mj is merelya Gaussian variable as well when conditioned on F m − : x mj |F m − ∼ N ( x m − j − hF m − j , hI d ) , or in other words:(3.7) p mj ( x ) = 1(4 πh ) d/ exp (cid:0) −| x − (cid:0) x m − j − hF m − j (cid:1) | / (4 h ) (cid:1) . Plugging in the deﬁnition of x mj , we can compute p mj explicitly:(3.8) p mj = 1(4 πh ) d/ exp (cid:0) −| ξ m − j | / (cid:1) . Remark F m − . If one uses (2.11) in a brute-force manner, including all randomness, thenwe arrive at F mij = α d δf mij | δx mij | | δx mij |≤ η p m ( x mj ) δx mij . where p m is the true distribution of x mj without the conditioning. However, thisdeﬁnition of F mij cannot be used in the ensemble approximation: The x mi and x mj are not independent to each other and thus the ensemble E p m ( F mij ) may not recover ∇ f ( x mi ). More importantly, p m ( x ) is unknown in practice, making the calculationimpossible.We plug (3.8) into (3.6) and run (3.1) for the update. The method is termedEnsemble Langevin Monte Carlo (EnLMC), as presented in Algorithm 3.1. This manuscript is for review purposes only.

Z. DING, Q. LI

Algorithm 3.1 Ensemble Langevin Monte Carlo (EnLMC)Preparation:

1. Input: h (time stepsize); N (particle number); η (parameter); d (dimension); M (stopping index); f ( x ).2. Initial: (cid:8) x i (cid:9) Ni =1 i.i.d. sampled from an initial distribution induced by q ( x ). Run: For m = 0 , , · · · M For i = 1 , , · · · , N – Deﬁne(3.9) F mi = 1 N − N X j = i F mij , with F mij = α d δf mij | δx mij | | δx mij | <η p mj δx mij . – Draw ξ mi from N (0 , I d );– Update(3.10)  x m +1 i = x mi − hF mi + √ hξ mi p m +1 i = 1(4 πh ) d/ exp (cid:0) −| ξ mi | / (cid:1) . end endOutput: { x Mi } Ni =1 .The design of this algorithm follows straightforwardly from intuition: One re-places the gradient in LMC by the ensemble approximation using the neighbors’information. Since the diﬀerence between the true gradient and the ensemble ap-proximation shrinks to zero as η , the neighboring range vanishes, one may incline toconclude that this method would converge also, as long as η is small enough.However, this is not true. This ensemble surrogate of the gradient induces stronginstability to the algorithm. Indeed, ξ mj is a Gaussian variable, and for every ﬁxed ǫ , there is non-trivial probability that makes p mj ( x mj ) < ǫ , which blows up the forceterm (3.9). We explicitly show this instability using the following example with d = 1and f ( x ) = x / Theorem

Assume { x mi } Ni =1 are generated from Algorithm 3.1, then for d = 1 and f ( x ) = x / , we have: for any m > , ≤ i ≤ N (3.11) E | x mi | = ∞ . This negative example suggests that directly replacing the gradient by the ensem-ble approximation leads to an unstable method.We leave the proof to Section 4.1, but quickly discuss the intuition of the proofhere. Indeed, to compute the variance of x m +1 term: E | x m +1 i | , it is necessary tocompute the variance of the force term E (cid:16)(cid:12)(cid:12) F mi,j (cid:12)(cid:12) (cid:17) . The trajectory of { x i } Ni =1 is hardto trace, but one can nevertheless compute the conditional variance, conditioned on F m − :(3.12) E (cid:16)(cid:12)(cid:12) F mi,j (cid:12)(cid:12) (cid:12)(cid:12)(cid:12) F m − (cid:17) = Z (cid:12)(cid:12) F mi,j (cid:12)(cid:12) p mj ( x mj ) p mi ( x mi ) d x mj d x mi , where p mi are the conditional probability distribution given F m − . This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO F mij in (3.9), for f ( x ) = | x | /

2, we have:(3.13) F mi,j = 1 η ( x mj + x mi )( x mj − x mi )2 | x mj − x mi | | δx mij | <η p mj ( x mj ) ( x mj − x mi ) = ( x mj + x mi )2 η | δx mij | <η p mj ( x mj ) . At the same time, denoting w mi = x m − i − hF m − i the deterministic part of the updatefor x mi , we know that, for all i :(3.14) x mi − w mi = √ hξ m − i ∼ N (0 , h ) ⇒ p mi ( x ) = exp (cid:18) − | x mi − w mi | h (cid:19) . Plugging (3.13) and (3.14) into (3.12), we have:(3.15) E (cid:16)(cid:12)(cid:12) F mi,j (cid:12)(cid:12) (cid:12)(cid:12)(cid:12) F m − (cid:17) = Z R Z B η ( x mi ) (cid:0) x mj + x mi (cid:1) η exp −| x mi − w mi | + | x mj − w mj | h ! d x mj d x mi . Since the p mj term is in the denominator in (3.13), and when one takes the variance,this term gets squared. In the end this exponential term from x mj appears in a positivemanner in (3.15). This already suggests the blowing up of this variance term. A morecareful derivation shows:(3.16) E (cid:16)(cid:12)(cid:12) F mi,j (cid:12)(cid:12) (cid:12)(cid:12)(cid:12) F m − (cid:17) = Z R e − | xmi − wmi | h Z B η (0) ( z + 2 x mi ) η e | z + xmi − wmj | h d z d x mi = Z B η (0) e −| wmi | | z − wmj | h Z R ( z + 2 x mi ) η e xmi ( z + wmi − wmj )2 h d x mi d z = ∞ . In the second equality we used change of variables z = x mj − x mi . The inﬁnity comesfrom the inner integral, where we are essentially looking at the second moment of anexponential function.This inﬁnite variance of F mi,j , calculated in (3.16), suggests the variance of x m +1 i ,to be showed in (3.11), is also inﬁnite. Proving Theorem 3.2 then amounts to carryingout the detailed derivation on how E | x m +1 i | depends on E (cid:12)(cid:12) F mi,j (cid:12)(cid:12) , and we leave thisto Section 4.1. We now take a morecareful look at the instability in the ensemble gradient approximation to LMC. Intu-itively there are two sources of instability: • When x mi is at the “outskirt” of p ( x ), f ( x mi ) is high, and p ( x mi ) ∝ exp {− f ( x mi ) } is extremely small. This could bring high relative error, and we should avoidmaking any approximations in this region. • In the formula (3.6), p mj ( x mj ) is in the denominator. Considering the way theterm is deﬁned in (3.8), it takes an O (1) value with high probability when ξ mj is moderately small. However, there is a small chance for | ξ mj | to take largevalues, which will make p mj ( x mj ) extremely small, bringing inﬁnite variance,as shown in (3.16). This manuscript is for review purposes only. Z. DING, Q. LI

To avoid these two scenarios, we essentially need to identify: • x mi who are at the “outskirt” of p ; • x mj that is within η distance from x mi but has large | ξ m − j | .When these happen, the ensemble approximation is disabled and we come back touse the true gradient ∇ f ( x mi ).To identify the ﬁrst scenario is relatively straightforward: We simply set a thresh-old, call it M f , and will only employ ensemble gradient approximation when f ( x mi )is smaller than f ∗ + M f : f ( x mi ) < f ∗ + M f , where f ∗ is the optimal (minimum) of f .To identify the second scenario is slightly more involved. We now consider √ h | ξ m − j | = | x mj − w mj | ≤| x mj − x mi | + | x mi − w mi | + | w mi − w mj | = | δx mij | + √ h | ξ m − i | + | δw mij | where we denote the deterministic component of the updating formula:(3.17) w mi = x m − i − hF m − i , and set δw mij = w mj − w mi . To have a moderate | ξ m − j | calls for the smallness of allthree terms on the right hand side. For a ﬁxed x mi , since we only consider x mj whoare already within η distance, the ﬁrst term is already bounded by η and is small. Wetherefore need to ensure the rest two terms are bounded as well. To do so, we proposeto enact the ensemble gradient approximation only if | ξ m − i | is at most moderatelylarge, and for those x mi , we include the x mj contribution in the calculation of F mi onlyif | δw mij | is at most moderately large. This is to say, for a ﬁxed preset constant pairs( R , R ): • When √ h | ξ m − i | > R :(3.18) F mi = ∇ f ( x mi ) , • When √ h | ξ m − i | ≤ R :(3.19) F mi = 1 N mi N X j = i F mij , with F mij = α d δf mij | δx mij | δx mij p mj | δx mij |≤ η , | δw mij |≤ R , where(3.20) N mi = N X j = i | δw mij |≤ R , is the number of neighbors within η distance whose corresponding | δw mij | iscontrolled.Note that compared with (3.4), we add another indicator function in (3.19) to ensure δw mij is controlled by R . Furthermore, numerically to have statistical stability, wealso preset a value for N ∗ and require N mi ≥ N ∗ . If N mi < N ∗ , we do not enact theensemble approximation and use the true gradient ∇ f ( x mi ).Summarizing the discussion above, we have:(3.21) F mi =  ∇ f ( x mi ) , √ h | ξ m − i | > R or f ( x mi ) > f ∗ + M f or N ∗ > N mi N mi N X j = i F mij , otherwise . This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO

Algorithm 3.2 Constrained Ensemble Langevin Monte Carlo (CEnLMC)Preparation:

1. Input: h (time stepsize); N (particle number); η, R , R , N ∗ , M f (parameters); d (dimension); M (stopping index); ∇ f ( x ); f ( x ); f ∗ (minimal value).2. Initial: (cid:8) x i (cid:9) Ni =1 i.i.d. sampled from an initial distribution induced by q ( x ).Set w − i = ∞ for 1 ≤ i ≤ N . Run: For m = 0 , , · · · , M For i = 1 , , · · · , N – Deﬁne N mi = N X j = i | δw mij | R or f ( x mi ) > f ∗ + M f or N ∗ > N mi , deﬁne F mi = ∇ f ( x mi ) . else deﬁne(3.22) F mi = 1 N mi N X j = i F mij , with F mij = δf mij | δx mij | δx mij p mj | δx mij |≤ η , | δw mij |≤ R . end – Draw ξ mi from N (0 , I d ).– Update(3.23)  x m +1 i = x mi − hF mi + √ hξ mi , p m +1 i = 1(4 πh ) d/ exp (cid:0) −| ξ mi | / (cid:1) . end endOutput: { x Mi } Ni =1 . There are two types of properties of CEnLMCthat we would like to discuss: 1. the convergence: We would like to show that thedistribution of x mi , as m → ∞ converges to the target distribution; 2. the numericalcost: We would like to show that the probability of computing the gradients is lowwith a proper tuning of R , R and M f , and thus most gradients are replaced byits cheaper ensemble version. This makes CEnLMC much cheaper than the classicalLMC.These two properties are discussed in the following subsections respectively. To show the method converge is to showthat the distribution of x mi , as m → ∞ , converges to the target distribution p up toa small discretization error. This manuscript is for review purposes only. Z. DING, Q. LI

Our strategy is to show that particles computed from CEnLMC are close to theparticles computed from the classical LMC if they start with the same initial data.Since it is well-known that the distribution of LMC samples converges to the targetdistribution, the samples found by CEnLMC then recover the target distribution as m → ∞ as well.We ﬁrst introduce the particle system that solves the classical LMC (2.1). Deﬁne z i = x i for 1 ≤ i ≤ N and update(3.24) z m +1 i = z mi − ∇ f ( z mi ) h + √ hξ mi , where ξ mi is same as (3.23). This is the classical LMC algorithm, and all sample z i are decoupled from each other. Our ﬁrst goal is to show that x mi and z mi areapproximately the same, as seen in the following theorem. Theorem

Assume { x mi } Ni =1 are generated from Algorithm 3.2, and { z mi } Ni =1 are generated from (3.24) , with the parameters chosen to satisfy h ≤ min (cid:26) L , d (cid:27) , max { η, , } ≤ R . Assume f is L -smooth, then, for m ≥ , ≤ i ≤ N : (3.25) E | x mi − z mi | ≤ O  exp( Lmh ) s R d M f d Lη d N ∗ exp (cid:18) R ( R + R )2 h (cid:19) + ηd  . If we further assume f is µ -convex, then, denoting κ = L/µ , for any m ≥ , ≤ i ≤ N : (3.26) E | x mi − z mi | ≤ O s R d κM f d µη d N ∗ exp (cid:18) R ( R + R )2 h (cid:19) + κηd  . We leave the proof to Section 4.2.This theorem gives the bound to the distance between CEnLMC samples andLMC samples. By tuning the parameters wisely, we can make the two systems closeto each other. We take the convex case for example. There are two terms in thebound. The second bound mainly comes from the ﬁnite diﬀerence approximation,induced in (3.5). And the ﬁrst term traces back to ensemble error ( E |∇ f ( x mi ) − G mi | ).After adding constraints (3.18)-(3.20), this error contributes to 1 / √ N ∗ term. This isoptimal in terms of N ∗ according to the central limit theorem.To make the distance small, we ﬁrst need to let η be small so that the errorfrom the ﬁnite diﬀerencing is small. Upon choosing small η , with R , ﬁxed, weneed to select a moderate M f /N ∗ to make the ﬁrst term small. Since M f is thebound we set to turn on or oﬀ the ensemble gradient approximation, we expect itto be relatively large. N ∗ is the minimum number of neighbors needed to enact theensemble approximation to ensure the statistical accuracy and is thus also expectedto be large. To accommodate both, we set M f = ( N ∗ ) ρ with ρ < Corollary

Under the same assumption as in Theorem 3.3 and let f be µ -convex, for any small number ǫ > and < ρ < , by setting (3.27) M f = ( N ∗ ) ρ , η < ǫκd , N ∗ = R d/ (1 − ρ )1 κ / (1 − ρ ) d / (1 − ρ ) µ / (1 − ρ ) η d/ (1 − ρ ) ǫ / (1 − ρ ) exp (cid:18) R ( R + R )2(1 − ρ ) h (cid:19) , This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO we have: for any m ≥ , ≤ i ≤ N : (3.28) E | x mi − z mi | ≤ O ( ǫ ) . This is obtained by simply setting both terms in (3.26) smaller than ǫ . We omit theproof.Now we are ready to combine this result with the well-known convergence resultof LMC to show the convergence of CEnLMC. The convergence is discussed in bothWasserstein distance sense, and weak sense. Theorem

Under the same assumption as in Theorem 3.3 and let f be µ -convex, we denote κ = L/µ the condition number, q mi the probability density of x mi .Assume R | x | q d x < ∞ , we have: W convergence: For any m ≥ , ≤ i ≤ N , (3.29) W ( q mi , p ) ≤ exp (cid:18) − µhm (cid:19) W ( q , p )+ O  κ ( √ hd + ηd ) + s R d κd M f µη d N ∗ exp (cid:18) R ( R + R )2 h (cid:19) . Weak convergence: For any Lipschitz function g : R d → R with E p ( g ) < ∞ and m ≥ , we have (3.30) E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 g ( x mi ) − E p ( g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤O (cid:18) exp (cid:18) − µhm (cid:19) W ( q , p ) (cid:19) + O  √ N + κ ( √ hd + ηd ) + s R d κd M f µη d N ∗ exp (cid:18) R ( R + R )2 h (cid:19) We leave the proof to Section 4.2. We note that in both (3.29) and (3.30),there is one exponentially decaying term, and the rest can be seen as the remainderterm. Therefore we can call the convergence rate exponential, up to a controllablediscretization and ensemble error. The exponentially decaying term comes from thefact that the distribution of z mi decays to the target distribution exponentially fast,and the remainder term mostly comes from the distance between { x mi } and { z mi } systems. We now discuss the numerical savingof CEnLMC compared with the classical LMC.The main reason to utilize the ensemble gradient approximation is to avoid thegradient computation. In the algorithm, the ensemble approximation is enacted onlyif: √ h | ξ m − i | ≤ R , f ( x mi ) ≤ f ∗ + M f , N mi ≥ N ∗ , where the size of N mi depends on the number of samples who satisfy | δw mij | ≤ R .Therefore the probability of not using the ensemble approximation (but using ∇ f )can be bounded by: This manuscript is for review purposes only. Z. DING, Q. LI (3.31) P ( { F mi = ∇ f ( x mi ) } ) ≤ P (cid:16)n √ h | ξ mi | > R o(cid:17) + P ( {| f ( x mi ) − f ∗ | > M f } )+ P (cid:0)(cid:8) N mi

Theorem

With the same choice of parameters as in Corollary 3.1, setting ρ = and ǫ = 1 , assuming R | x | q d x < ∞ , we have, for any m > , ≤ i ≤ N : P (cid:16)n √ h | ξ m − i | = | x mi − w mi | > R o(cid:17) . Z ∞ R √ h r d − exp (cid:18) − r (cid:19) d r , (3.32) P ( {| f ( x mi ) − f ∗ | > M f } ) . exp (cid:16) − µhm (cid:17) W ( q , p ) + d ( R /η ) d/ d exp (cid:18) − R ( R + R )2 h (cid:19) , (3.33) P (cid:0)(cid:8) | δw mij | > R (cid:9)(cid:1) . exp (cid:16) − µhm (cid:17) W ( q , p ) + κ d/ L d/ / (2 π ) d R . (3.34) Here we assume κ and µ to be of O (1) . We leave the proof to Section 4.3.This theorem gives the bound to the ﬁrst two terms in (3.31). The estimate to thethird term is only implicit. According to the formula of (3.32)-(3.33), if one has mod-erate R and R , the two probabilities are already small (since d ≫ h ≪ R ﬁxed, the probability P (cid:0)(cid:8) | δw mij | > R (cid:9)(cid:1) is ﬁxed, and thus one needs to increase N to have a low P (cid:0)(cid:8) N mi

4. Proof of theoretical results.4.1. Proof of Theorem 3.2.

In this section, we prove Theorem 3.2. Accordingto algorithm 3.1, we have x mi = x m − i − hF m − i + √ hξ m − i and { ξ m − i } Ni =1 are i.i.d. independent. Under ﬁltration F m − , then the conditionaldistribution of { x mi } Ni =1 are independent.To prove the theorem, we need the following proposition: Proposition

Assume { x mi } Ni =1 are generated from Algorithm 3.1 with F m deﬁned as (3.9) , then for f ( x ) = x / , we have: for any m > , ≤ i ≤ N (4.1) E (cid:16) | E mi | (cid:17) = E | F mi − ∇ f ( x mi ) | = ∞ . This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO Proof of Proposition 4.1.

Since f ( x ) = | x | /

2, we can obtain, according to (3.13): F mi,j = 1 η ( x mj + x mi )( x mj − x mi )2 | x mj − x mi | | δx mij | <η p mj ( x mj ) ( x mj − x mi )= x mj − x mi η | δx mij | <η p mj ( x mj ) + x mi η | δx mij | <η p mj ( x mj ) . The two terms carry diﬀerent information: • The conditional expectation of ﬁrst term equals zero: E x mj − x mi η | δx mij | <η p mj ( x mj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F m − ! = 12 η Z Z | x mj − x mi | <η ( x mj − x mi ) p mi ( x mi ) d x mj d x mi = 0 . • The second term is consistent with ∇ f ( x mi ) = x mi , meaning: E x mi η | δx mij | <η p mj ( x mi ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F m − , x mi ! = x mi Z | x mj − x mi | <η η d x mj = x mi , where we use x mj and x mi is conditional independent in the ﬁrst equality.These imply, for all j = i :(4.2) E (cid:0) F mi,j − x mi (cid:12)(cid:12) F m − (cid:1) = 0 . Furthermore, since the conditional distribution of x mj , x mj , x mi are independent, for j = j , i = j , and i = j :(4.3) E (cid:0) ( F mi,j − x mi )( F mi,j − x mi ) (cid:12)(cid:12) F m − (cid:1) = E (cid:0) E (cid:0) ( F mi,j − x mi )( F mi,j − x mi ) (cid:12)(cid:12) F m − , x mi (cid:1)(cid:12)(cid:12) F m − (cid:1) = E (cid:0) E (cid:0) F mi,j − x mi (cid:12)(cid:12) F m − , x mi (cid:1) E (cid:0) F mi,j − x mi (cid:12)(cid:12) F m − , x mi (cid:1)(cid:12)(cid:12) F m − (cid:1) =0Plug (4.2) and (4.3) into E (cid:16) | E mi | (cid:12)(cid:12)(cid:12) F m − (cid:17) = E (cid:16) | F mi − ∇ f ( x mi ) | (cid:12)(cid:12)(cid:12) F m − (cid:17) , wehave(4.4) E (cid:16) | E mi | (cid:12)(cid:12)(cid:12) F m − (cid:17) = E (cid:16) | F mi − ∇ f ( x mi ) | (cid:12)(cid:12)(cid:12) F m − (cid:17) = 1( N − N X j = i E (cid:16)(cid:12)(cid:12) F mi,j − x mi (cid:12)(cid:12) (cid:12)(cid:12)(cid:12) F m − (cid:17) = 1( N − N X j = i E (cid:16)(cid:12)(cid:12) F mi,j (cid:12)(cid:12) (cid:12)(cid:12)(cid:12) F m − (cid:17) − N − E (cid:16) | x mi | (cid:12)(cid:12)(cid:12) F m − (cid:17) , where we use (4.3) in the second equality. Noting that in (3.16) we already showed: E (cid:16)(cid:12)(cid:12) F mi,j (cid:12)(cid:12) (cid:12)(cid:12)(cid:12) F m − (cid:17) = ∞ , This manuscript is for review purposes only. Z. DING, Q. LI and that the second term in (4.4) is ﬁnite: E (cid:16) | x mi | (cid:12)(cid:12)(cid:12) F m − (cid:17) = (cid:12)(cid:12) x m − i − hF m − i (cid:12)(cid:12) + 2 h < ∞ , we obtain: E (cid:16) | F mi − ∇ f ( x mi ) | (cid:12)(cid:12)(cid:12) F m − (cid:17) = ∞ , which proves (4.1), concluding this proposition.Now, we are ready to prove Theorem 3.2. Proof of Theorem 3.2.

For each m ≥ ≤ i ≤ N , we consider x m +1 i = x mi − h ∇ f ( x mi ) + √ hξ mi + hE mi , where E mi = ∇ f ( x mi ) − F mi denote the diﬀerentiation from the classical LMC formula.Using x mj and x mi are conditional independent for i = j , we obtain(4.5) E (cid:16) E mi ( x mi − h ∇ f ( x mi ) + √ hξ mi ) (cid:12)(cid:12)(cid:12) F m − (cid:17) = E (cid:0) E mi ( x mi − h ∇ f ( x mi )) (cid:12)(cid:12) F m − (cid:1) = E (cid:0) E (cid:0) E mi ( x mi − h ∇ f ( x mi )) (cid:12)(cid:12) F m − , x mi (cid:1)(cid:12)(cid:12) F m − (cid:1) = E  N − N X j = i E (cid:0) x mi − F mi,j (cid:12)(cid:12) F m − , x mi (cid:1) ( x mi − h ∇ f ( x mi )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F m −  = E (cid:0) x mi − h ∇ f ( x mi )) (cid:12)(cid:12) F m − (cid:1) = 0 , where we use E (cid:0) ξ mi (cid:12)(cid:12) F m − (cid:1) = E ( ξ mi ) = ~ E (cid:0) | x m +1 i | (cid:12)(cid:12) F m − (cid:1) = E (cid:18)(cid:12)(cid:12)(cid:12) x mi − h ∇ f ( x mi ) + √ hξ mi (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) F m − (cid:19) + E (cid:0) | E mi | (cid:12)(cid:12) F m − (cid:1) ≥ E (cid:0) | E mi | (cid:12)(cid:12) F m − (cid:1) , where we use (4.5) in the ﬁrst equality. Finally, using the previous proposition, wehave E (cid:0) E (cid:0) | x m +1 i | (cid:12)(cid:12) F m − (cid:1)(cid:1) ≥ E (cid:0) E (cid:0) | E mi | (cid:12)(cid:12) F m − (cid:1)(cid:1) = ∞ , which proves (3.11). We now analyze Algorithm 3.2, the ConstraintEnsemble LMC. The strategy is to compare the evolution of x mi with z mi , the solu-tion to the classical LMC (3.24), before utilizing the convergence of z mi to ﬁnd theconvergence of x mi .Theorem 3.3 discusses the closeness of x mi and z mi , while Theorem 3.4 discussesthe convergence of x mi . The following two subsections are dedicated to these twotheorems respectively. This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO To show the smallness of x mi − z mi , we ﬁrstrewrite the updating formula for x mi , (3.23), into(4.6) x m +1 i = x mi − ∇ f ( x mi ) h + E mi h + √ hξ mi , where(4.7) E mi = ∇ f ( x mi ) − F mi . Comparing the updating formula of z mi in equation (3.24), it is easy to see thatthe key lies in bounding the term E mi . This is shown in the following lemma. Lemma

Under the same conditions of Theorem 3.3, we have: for any m ≥ , ≤ i ≤ N (4.8) E | E mi | . s R d LM f d η d N ∗ exp (cid:18) R ( R + R )2 h (cid:19) + Lηd .

Theorem 3.3 is a direct consequence from this lemma.

Proof of Theorem 3.3.

For each m ≥

0, 1 ≤ i ≤ N , we subtract (4.6) and (3.24)to obtain(4.9) E (cid:12)(cid:12) x m +1 i − z m +1 i (cid:12)(cid:12) = E | ( x mi − z mi ) − h ( ∇ f ( x mi ) − ∇ f ( z mi )) | + h E | E mi | . Noting that ∇ f is L -Lipschitz continuous, |∇ f ( x mi ) − ∇ f ( z mi )) | ≤ Lh | x mi − z mi | , then | ( x mi − z mi ) − h ( ∇ f ( x mi ) − ∇ f ( z mi )) | ≤ (1 + Lh ) | x mi − z mi | . We take the expectation, and utilize Lemma 4.2: E (cid:12)(cid:12) x m +1 i − z m +1 i (cid:12)(cid:12) ≤ (1 + Lh ) E | x mi − z mi | + h s R d LM f d η d N ∗ exp (cid:18) R ( R + R )2 h (cid:19) + Lηd  . Use this formula iteratively, we have: E | x mi − z mi | ≤ (1 + Lh ) m E | x m − z m | + (1 + Lh ) m s R d M f d Lη d N ∗ exp (cid:18) R ( R + R )2 h (cid:19) + ηd  . Noting x m = z m , the ﬁrst term is eliminated, and we conclude (3.25). When f is µ -convex, ∇ f ( x mi ) − ∇ f ( z mi ) ≥ µ ( x mi − z mi ) , then for h small enough: | ( x mi − z mi ) − h ( ∇ f ( x mi ) − ∇ f ( z mi )) | ≤ (1 − µh ) | x mi − z mi | . Running the same argument as above, and relaxing (1 − µh ) m ≤

1, we conclude (3.26).

This manuscript is for review purposes only. Z. DING, Q. LI

We now prove Lemma 4.2

Proof of Lemma 4.2.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO E (cid:16) | G mi − ∇ f ( x mi ) | (cid:17) = E  E  Ω i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N mi X j = i h G mij − ∇ f ( x mi ) | δw mij |

Here in (I) we used L |∇ f ( x mi ) | ≤ f ( x mi ) − f ∗ < M f , in (II) we used change ofvariables y = x mi , z = x mj − x mi . In (III), we used:exp | y + z − w mj | h − | y − w mi | h ! = exp | y − w mi + z + w mi − w mj | h − | y − w mi | h ! = exp | z + w mi − w mj | h + (cid:10) y − w mi , z + w mi − w mj (cid:11) h ! . exp | z | h + | w mi − w mj | h + | z || w mi − w mj | h + | y − w mi | (cid:0) | z | + | w mi − w mj | (cid:1) h ! . Plug (4.16) into (4.14), we have(4.17) E (cid:16) | G mi − ∇ f ( x mi ) | (cid:17) . R d d LM f N ∗ η d exp (cid:18) η + 2( ηR + ηR + R R ) + R h (cid:19) . Using η < R and H¨older inequality we have E ( | G mi − ∇ f ( x mi ) | ) = (cid:16) E (cid:16) | G mi − ∇ f ( x mi ) | (cid:17)(cid:17) / . s R d d LM f N ∗ η d exp (cid:18) R ( R + R )2 h (cid:19) . Combine it with (4.12) we prove (4.8).

The validity of Theorem 3.4 is built upon thefact that x mi system and z mi system are close, shown above, and that the z mi systemfollows LMC, which converges to the target distribution.It is a classical result to show LMC solution converges to LMC. To do so, oneconstructs another particle system that is drawn from the target distribution. Let y be a random vector drawn from target distribution induced by p , and set(4.18) y i ( t ) = y i − Z t ∇ f ( y i ( s )) d s + √ Z t d B i ( s ) , where we construct Brownian motion that satisﬁes:(4.19) B i ( h ( m + 1)) − B i ( hm ) = √ hξ mi . Then y i ( t ) is drawn from the distribution induced by p as well. On the discrete level,let y mi = y i ( hm ), then:(4.20) y m +1 i = y mi − Z ( m +1) hmh ∇ f ( y i ( s )) d s + √ hξ mi . Since y mi ∼ p ( x ), then we have W ( q mi , p ) ≤ E | x mi − y mi | , This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO E takes all randomness into account. Choose the initial data y so that W ( q , p ) = E | x i − y i | . Then the problem boils down to showing that x mi is closeto y mi . Since we know already that x mi and z mi are close, we now need to show thecloseness between z and y . This classical result regarding the convergence of LMC wasshown in [3, 5], and we cite it here for the completeness of the paper (with notationsadjusted to our setting). Proposition z and y ). Assume conditions of Theorem 3.3,and let f be L -smooth and µ convex with κ = L/µ , we have: for any m ≥ , ≤ i ≤ N (4.21) E | z mi − y mi | ≤ exp (cid:18) − µhm (cid:19) W ( q , p ) + O (cid:16) κ √ hd (cid:17) . We leave the proof to Appendix A. We should emphasize that this result is essentiallythe same as the one in [5, 11, 6]. The only diﬀerence is that we use L norm forbounding z mi − y mi for the consistency with the result in Theorem 3.3.Now, we are ready to prove Theorem 3.4. Proof of Theorem 3.4.

Combining Theorem 3.3 and Proposition 4.3 by adding(3.26) and (4.21) through the triangle inequality, we obtain(4.22) E | x mi − y mi | ≤ E | x mi − z mi | + E | z mi − y mi | = exp (cid:18) − µhm (cid:19) W ( q , p )+ O  κ ( √ hd + ηd ) + s R d κd M f µη d N ∗ exp (cid:18) R ( R + R )2 h (cid:19) . Since W ( q mi , p ) ≤ E | x mi − y mi | , we prove (3.29). To prove (3.30), we use(4.23) E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 g ( x mi ) − E p ( g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ N N X i =1 E | g ( x mi ) − g ( y mi ) | + E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 g ( y mi ) − E p ( g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Use the Lipschitz continuity, the ﬁrst term is easily controlled.(4.24) 1 N N X i =1 E | g ( x mi ) − g ( y mi ) | ≤ O N N X i =1 E | x mi − y mi | ! . Here the O notation includes the Lipschitz constant of g . The second term of (4.23)is a standard central limit theory:(4.25) E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 g ( y mi ) − E p ( g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤  E N N X i =1 g ( y mi ) − E p ( g ) !  / ≤ N N X i =1 E ( g ( y mi ) − E p ( g )) ! / ≤ O (cid:18) √ N (cid:19) . Combining (4.22), (4.24) and (4.25) into (4.23), we prove the weak convergence (3.30).

This manuscript is for review purposes only. Z. DING, Q. LI

We prove Theorem 3.5 in this section.

Proof of Theorem 3.5.

First, we prove (3.32). Noticing that for m ≥ ≤ i ≤ N x mi − w mi = √ hξ m − i , which implies P ( {| x mi − w mi | > R } ) = P (cid:18)(cid:26) | ξ m − i | > R √ h (cid:27)(cid:19) = Z | x | > R √ h π ) d/ exp (cid:18) − | x | (cid:19) d x = S d (2 π ) d/ Z ∞ R √ h r d − exp (cid:18) − r (cid:19) d r ≤ Z ∞ R √ h r d − exp (cid:18) − r (cid:19) d r . Second, we can use M f = √ N ∗ and f ( x mi ) − f ∗ ≤ µ |∇ f ( x mi ) | to obtain(4.26) P ( { f ( x mi ) − f ∗ > M f } )= P (cid:16)n f ( x mi ) − f ∗ > √ N ∗ o(cid:17) ≤ P (cid:16)n |∇ f ( x mi ) | > µ √ N ∗ o(cid:17) ≤ P (cid:16)n |∇ f ( y mi ) | + |∇ f ( x mi ) − ∇ f ( y mi ) | > µ √ N ∗ o(cid:17) ≤ P ( |∇ f ( x mi ) − ∇ f ( y mi ) | > µ √ N ∗ )! + P ( |∇ f ( y mi ) | > µ √ N ∗ )! , where y mi is deﬁned in (4.18)-(4.20) and we use 2 | a − b | + 2 | b | ≥ | a | in the secondinequality.The second term of (4.26) is easy to bound:(4.27) P (cid:16)n |∇ f ( y mi ) | > µ √ N ∗ / o(cid:17) ≤ µ √ N ∗ E (cid:0) |∇ f ( y mi ) | (cid:1) ≤ κd √ N ∗ , where we use E p |∇ f ( y ) | ≤ Ld according to Lemma 3 in [5].The ﬁrst term can be bounded by(4.28) P ( |∇ f ( x mi ) − ∇ f ( y mi ) | > µ √ N ∗ )! ≤ P ( | x mi − y mi | > µ √ N ∗ L )! ≤ P (cid:18)(cid:26) | x mi − y mi | > µ / ( N ∗ ) / √ L (cid:27)(cid:19) ≤ s κL ( N ∗ ) / E ( | x mi − y mi | )where we use |∇ f ( x mi ) − ∇ f ( y mi ) | ≤ L | x mi − y mi | in the ﬁrst inequality. CombiningCorollary 3.1 (3.28) and Proposition 4.3 (4.21), we obtain E ( | x mi − y mi | ) . exp (cid:18) − µhm (cid:19) W ( q , p ) + 1 , This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO h ≤ d , and κ ∼ µ ∼ O (1). Plugging this into (4.28), we have(4.29) P (cid:18)(cid:26) |∇ f ( x mi ) − ∇ f ( y mi ) | > µ ( N ∗ ) / (cid:27)(cid:19) . exp (cid:16) − µhm (cid:17) W ( q , p ) + 1( N ∗ ) / . exp (cid:16) − µhm (cid:17) W ( q , p ) + 1( R /η ) d/ d exp (cid:18) − R ( R + R )2 h (cid:19) where we use κ ∼ µ ∼ O (1) and (3.27). Plugging (4.29) and (4.27) into (4.26), weprove (3.33).Finally, we prove (3.34). Considering w mi = x m − i − hF m − i , we decompose(4.30) P (cid:0)(cid:8) | w mj − w mi | > R (cid:9)(cid:1) ≤ P (cid:0)(cid:8) h | F m − i − ∇ f ( x m − i ) | > R / (cid:9)(cid:1) + P (cid:0)(cid:8) h | F m − j − ∇ f ( x m − j ) | > R / (cid:9)(cid:1) + P (cid:0)(cid:8) | x m − j − x m − i − h ( ∇ f ( x m − j ) − ∇ f ( x m − i )) | > R / (cid:9)(cid:1) . The ﬁrst two terms can be bounded by(4.31) P (cid:0)(cid:8) h | F m − i − ∇ f ( x m − i ) | > R / (cid:9)(cid:1) + P (cid:0)(cid:8) h | F m − j − ∇ f ( x m − j ) | > R / (cid:9)(cid:1) ≤ hR E (cid:0) | F m − i − ∇ f ( x m − i ) | (cid:1) . hR , where we use Lemma 4.2 (4.8) in the last inequality. To bound the last term of (4.30),we rewrite it as(4.32) P (cid:0)(cid:8) | x m − j − x m − i − h ( ∇ f ( x m − j ) − ∇ f ( x m − i )) | > R / (cid:9)(cid:1) ≤ P (cid:0)(cid:8) | x m − j − x m − i | > R / (cid:9)(cid:1) ≤ P (cid:0)(cid:8) | x m − j − y m − j | > R / (cid:9)(cid:1) + P (cid:0)(cid:8) | x m − j − y m − i | > R / (cid:9)(cid:1) + P (cid:0)(cid:8) | y m − j − y m − i | > R / (cid:9)(cid:1) . In the ﬁrst inequality above, we used µh < f is µ -convex. Similar to (4.28)-(4.29), we can bound the ﬁrst two terms(4.33) P (cid:0)(cid:8) | x m − j − y m − j | > R / (cid:9)(cid:1) + P (cid:0)(cid:8) | x m − j − y m − i | > R / (cid:9)(cid:1) . exp (cid:16) − µhm (cid:17) W ( q , p ) + 1 R . Since f is L -smooth and µ -convex, we have p ( x ) ≤ π/L ) d/ exp (cid:18) − µ | x − x ∗ | (cid:19) , This manuscript is for review purposes only. Z. DING, Q. LI where x ∗ is the minimal point of f . Then the last term of (4.32) can be bounded by P (cid:0)(cid:8) | y m − j − y m − i | > R / (cid:9)(cid:1) = Z R d | y − z | p ( y ) p ( z ) d y d z = Z R d | y − z | p ( y + x ∗ ) p ( z + x ∗ ) d y d z . R Z R Z R | y − z | (2 π/L ) d exp (cid:18) − µ ( | y | + | z | )2 (cid:19) d y d z . κ d/ L d/ (2 π ) d R , In the ﬁrst equality, we use that y m − j , y m − i are independent.These reduce (4.32) to: P (cid:0)(cid:8) | x m − j − x m − i − h ( ∇ f ( x m − j ) − ∇ f ( x m − i )) | > R / (cid:9)(cid:1) . κ d/ L d/ (2 π ) d R + exp (cid:16) − µhm (cid:17) W ( q , p ) + 1 R . Combining it with (4.31) and (4.30), we obtain (3.34).

Appendix A. Proof of Proposition 4.3.

In this section, we prove Proposition4.3. For convenience, we ignore i and deﬁne∆ m = z m − y m . Then it suﬃces to prove the smallness of E | ∆ m | . Proof of Proposition 4.3. we ﬁrst divide ∆ m +1 into several parts:(A.1) ∆ m +1 =∆ m + ( y m +1 − y m ) − ( z m +1 − z m )=∆ m + − Z ( m +1) hmh ∇ f ( y ( s )) d s + √ hξ m ! − − Z ( m +1) hmh ∇ f ( z m ) d s + √ hξ m ! =∆ m − Z ( m +1) hmh ( ∇ f ( y ( s )) − ∇ f ( z m )) d s ! =∆ m − Z ( m +1) hmh ( ∇ f ( y ( s )) − ∇ f ( y m ) + ∇ f ( y m ) − ∇ f ( z m )) d s ! =∆ m − h ( ∇ f ( y m ) − ∇ f ( z m )) − Z ( m +1) hmh ( ∇ f ( y ( s )) − ∇ f ( y m )) d s =∆ m − hU m − V m , where U m = ∇ f ( y m ) − ∇ f ( z m ) ,V m = Z ( m +1) hmh ( ∇ f ( y ( s )) − ∇ f ( y m )) d s . Now the ﬁrst two terms of (A.1) can be bounded by(A.2) | ∆ m − hU m | ≤ (1 − µh ) | ∆ m | , This manuscript is for review purposes only.

ONSTRAINED ENSEMBLE LANGEVIN MONTE CARLO f is µ -convex.Next, for the second term on the right-hand side of (A.1), we ﬁrst bound L -norm: E (cid:0) | V m | (cid:1) (I) ≤ h Z ( m +1) hmh E (cid:16) |∇ f ( y ( s )) − ∇ f ( y m ) | (cid:17) d s (II) ≤ hL Z ( m +1) hmh E (cid:16) | y ( s ) − y m | (cid:17) d s = hL Z ( m +1) hmh E (cid:12)(cid:12)(cid:12)(cid:12)Z s − mh ∇ f ( y ( t )) d t + √ B ( s ) − B ( nh )) (cid:12)(cid:12)(cid:12)(cid:12) ! d s (III) ≤ h L Z ( m +1) hmh Z s − mh E (cid:16) |∇ f ( y ( t )) | (cid:17) d t d s + 4 h L Z ( m +1) hmh E | ξ m | d s (IV) = h L E (cid:16) |∇ f ( y m ) | (cid:17) + 4 h L d (V) = h L E p |∇ f | + 4 h L ≤ h L d + 4 h L d , (A.3)where (II) comes from L -Lipschitz condition, (I) and (III) come from the use ofYoung’s inequality and Jensen’s inequality when we move the | · | from outside toinside of the integral, and (IV) and (V) hold true because y ( t ) ∼ p for all t . In (VI)we use E p |∇ f | ≤ Ld using [5, Lemma 3].Using H¨older’s inequality and h ≤ L , (A.3) implies E ( | V m | ) ≤ (cid:0) E (cid:0) | V m | (cid:1)(cid:1) / ≤ h / Ld / . Plugging this and (A.2) into (A.1), we obtain E (cid:0)(cid:12)(cid:12) ∆ m +1 (cid:12)(cid:12)(cid:1) ≤ E ( | ∆ m − hU m | ) + E ( | V m | ) ≤ (1 − µh ) E ( | ∆ m | ) + 5 h / Ld / . Using this iteratively and E | ∆ | = E | z − y | = W ( q , p ), we prove (4.21). Acknowledgments.

Both authors acknowledge generous support from NSF-DMS 1750488, NSF-TRIPODS 2023239 and Wisconsin Alumni Research Foundation.

REFERENCES[1] C. Andrieu, N. Freitas, A. Doucet, and M. Jordan. An introduction to MCMC for MachineLearning.

Machine Learning , 50:5–43, 01 2003.[2] A. Beskos, A. Jasra, K. Law, R. Tempone, and Y. Zhou. Multilevel sequential Monte Carlosamplers.

Stochastic Processes and their Applications , 127(5):1417–1440, 2017.[3] N. Chatterji, N. Flammarion, Y. Ma, P. Bartlett, and M. Jordan. On the theory of variancereduction for stochastic gradient Monte Carlo. In

Proceedings of the 35th InternationalConference on Machine Learning , volume 80, pages 764–773, 07 2018.[4] A. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concavedensities.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,79(3):651–676, 2017.[5] A. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo withinaccurate gradient.

Stochastic Processes and their Applications , 129(12):5278 – 5311,2019.[6] A. Dalalyan and L. Riou-Durand. On sampling from a log-concave density using kineticLangevin diﬀusions. arXiv , abs/1807.09382, 2018.

This manuscript is for review purposes only. Z. DING, Q. LI[7] Z. Ding and Q. Li. Variance reduction for random coordinate descent-Langevin Monte Carlo.In

Advances in Neural Information Processing Systems 33 . 2020.[8] A. Doucet, N. Freitas, and N. Gordon.

An introduction to sequential Monte Carlo Methods ,pages 3–14. 2001.[9] S. Duane, A. Kennedy, B. Pendleton, and D. Roweth. Hybrid Monte Carlo.

Physics Letters B ,195(2):216 – 222, 1987.[10] A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convexoptimization.

Journal of Machine Learning Research , 20:73:1–73:46, 2019.[11] A. Durmus and ´E. Moulines. Non-asymptotic convergence analysis for the unadjusted Langevinalgorithm.

Ann. Appl. Probab. , 27(3):1551–1587, 2017.[12] R. Dwivedi, Y. Chen, M. J. Wainwright, and B. Yu. Log-concave sampling: Metropolis-hastingsalgorithms are fast.

Journal of Machine Learning Research , 20(183):1–42, 2019.[13] G. Evensen.

Data Assimilation: The Ensemble Kalman ﬁlter . Springer-Verlag, 2006.[14] P. Fabian. Atmospheric sampling.

Advances in Space Research , 1(11):17 – 27, 1981.[15] A. Garbuno-Inigo, F. Hoﬀmann, W. Li, and A. Stuart. Interacting Langevin diﬀusions: Gra-dient structure and Ensemble Kalman sampler.

SIAM Journal on Applied DynamicalSystems , 19(1):412–441, 2020.[16] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restora-tion of images.

IEEE Trans. Pattern Anal. Mach. Intell. , 6:721–741, 11 1984.[17] W. Hastings. Monte Carlo sampling methods using Markov chains and their applications.

Biometrika , 57(1):97–109, 1970.[18] M. Herty and G. Visconti. Continuous limits for constrained Ensemble Kalman ﬁlter.

InverseProblems , 36(7):075006, jul 2020.[19] M. Iglesias, K. Law, and A. Stuart. Ensemble Kalman methods for inverse problems.

InverseProblems , 29(4):045001, 03 2013.[20] Q. Li and K. Newton. Diﬀusion equation-assisted Markov Chain Monte Carlo methods for theinverse radiative transfer equation.

Entropy , 21(3), 2019.[21] R. Li, S. Pei, B. Chen, Y. Song, T. Zhang, W. Yang, and J. Shaman. Substantial undocumentedinfection facilitates the rapid dissemination of novel coronavirus (sars-cov-2).

Science ,368(6490):489–493, 2020.[22] P. Markowich and C. Villani. On the trend to equilibrium for the Fokker-Planck equation: Aninterplay between physics and functional analysis. In

Physics and Functional Analysis,Matematica Contemporanea (SBM) 19 , pages 1–29, 1999.[23] J. Martin, L. Wilcox, C. Burstedde, and O. Ghattas. A stochastic newton MCMC method forlarge-scale statistical inverse problems with application to seismic inversion.

SIAM Journalon Scientiﬁc Computing , 34(3):A1460–A1487, 2012.[24] N. Nagarajan, M. Honarpour, and K. Sampath. Reservoir-ﬂuid sampling and characterization— key to eﬃcient reservoir management.

Journal of Petroleum Technology , 59, 08 2007.[25] R. Neal. Probabilistic inference using Markov chain Monte Carlo methods.

Technical ReportCRG-TR-93-1. Dept. of Computer Science, University of Toronto. , 1993.[26] R. Neal. Annealed importance sampling.

Statistics and Computing , 11:125–139, 2001.[27] S. Reich. A dynamical systems framework for intermittent data assimilation.

BIT NumericalMathematics , 51(1):235–249, 03 2011.[28] G. Roberts and J. Rosenthal. General state space Markov chains and MCMC algorithms.

Probability Surveys , 1, 04 2004.[29] X Tong, M. Morzfeld, and Y. Marzouk. MALA-within-Gibbs samplers for high-dimensionaldistributions with sparse conditional structure.

SIAM Journal on Scientiﬁc Computing ,42(3):A1765–A1788, 2020.,42(3):A1765–A1788, 2020.