[PDF] Simulated annealing from continuum to discretization: a convergence analysis via the Eyring--Kramers law

Abstract

We study the convergence rate of continuous-time simulated annealing (X_t; \, t \ge 0) and its discretization (x_k; \, k =0,1, \ldots) for approximating the global optimum of a given function f. We prove that the tail probability \mathbb{P}(f(X_t) > \min f +\delta) (resp. \mathbb{P}(f(x_k) > \min f +\delta)) decays polynomial in time (resp. in cumulative step size), and provide an explicit rate as a function of the model parameters. Our argument applies the recent development on functional inequalities for the Gibbs measure at low temperatures -- the Eyring-Kramers law. In the discrete setting, we obtain a condition on the step size to ensure the convergence.

Full PDF

SSIMULATED ANNEALING FROM CONTINUUM TO DISCRETIZATION:A CONVERGENCE ANALYSIS VIA THE EYRING–KRAMERS LAW

WENPIN TANG AND XUN YU ZHOU

Abstract.

We study the convergence rate of continuous-time simulated annealing ( X t ; t ≥

0) and its discretization ( x k ; k = 0 , , . . . ) for approximating the global optimum of a givenfunction f . We prove that the tail probability P ( f ( X t ) > min f + δ ) (resp. P ( f ( x k ) > min f + δ )) decays polynomial in time (resp. in cumulative step size), and provide an explicitrate as a function of the model parameters. Our argument applies the recent developmenton functional inequalities for the Gibbs measure at low temperatures – the Eyring-Kramerslaw. In the discrete setting, we obtain a condition on the step size to ensure the convergence. Key words:

Simulated annealing, convergence rate, Euler discretization, Eyring-Kramers law,functional inequalities, overdamped Langevin equation.1.

Introduction

Simulated annealing (SA) is an umbrella term which denotes a set of stochastic optimiza-tion methods. The goal of SA is to ﬁnd the global minimum of a function f : R d → R , inparticular when f is non-convex. These methods have many applications in physics, oper-ations research and computer science, see e.g. van Laarhoven and Aarts (1987); Koulamaset al. (1994); Delahaye et al. (2019). The name is inspired from annealing in metallurgy,which is a process aiming to increase the size of crystals by heating and controlled cool-ing. The stochastic version of SA was independently proposed by Kirkpatrick et al. (1983)and Cerny (1985). The idea is as follows: consider a stochastic process related to f whichis subject to thermal noise. When simulating this process, one decreases the temperatureslowly over the time. As a result, the stochastic process escapes from saddle points and localoptima, and converges to the global minimum of f with high probability. This is generallytrue if the cooling is slow enough, and it is important to ﬁnd the right stochastic processwith the fastest possible cooling schedule which approximates the global optimum.In this paper, we explore the convergence rate of continuous-time SA and its discretization.To be more precise, deﬁne the continuous-time SA process ( X t ; t ≥

0) by dX t = −∇ f ( X t ) dt + √ τ t dB t , X d = µ ( dx ) , (1)where ( B t ; t ≥

0) is d -dimensional Brownian motion, τ t is the cooling schedule of tem-perature, and µ ( dx ) is some initial distribution. This formulation was ﬁrst consideredby Geman and Hwang (1986). If τ t ≡ τ is constant in time, the process (1) is the well-known overdamped Langevin equation whose stationary distribution is the Gibbs measure ν τ ( dx ) ∝ exp( − f ( x ) /τ ) dx . Thus, we sometimes call the process (1) an SA adapted over-damped Langevin equation as well. See Section 3 for background. We also consider theEuler-Maruyama discretization of the continuous-time simulated annealing process. Let η k Date : February 10, 2021. a r X i v : . [ m a t h . P R ] F e b WENPIN TANG AND XUN YU ZHOU be the step size at iteration k , and Θ k := (cid:80) j ≤ k η j be the cumulative step size up to iteration k . The discrete SA process ( x k ; k = 0 , , . . . ) is deﬁned by x k +1 = x k − ∇ f ( x k ) η k + (cid:112) τ Θ k η k Z k , x d = µ ( dx ) , (2)where ( Z k ; k = 0 , , . . . ) are independent and identically distributed standard normal vectors,and τ Θ k is the cooling schedule at iteration k . For τ t ≡ τ constant in time, the scheme (2) isknown as the unadjusted Langevin algorithm (ULA) which approximates the Gibbs measure ν τ . The algorithm was introduced in Parisi (1981); Grenander and Miller (1994), and furtherstudied by Roberts and Tweedie (1996); Durmus and Moulines (2017).The goal of this paper is to study the decay in time of the tail probability P ( f ( X t ) > min f + δ ) or P ( f ( x k ) > min f + δ ) , under suitable conditions on the function f , the cooling schedule τ t , and the discretizationscheme η k , Θ k . Let us mention a few motivations. First there are a line of works on theinterplay between sampling and optimization (Raginsky et al. (2017); Ma et al. (2019a,b)).Note that if τ t ≡ τ is constant in time, the overdamped Langevin equation converges to theGibbs measure ν τ , and for τ suﬃciently small, the Gibbs measure ν τ approximates the Diracmass at the global minimum of f . Combining these two ingredients, SA sets the coolingschedule τ t to decrease to 0 over the time. It is then expected that the SA process (1) or(2) converges to the global minimum as t → ∞ or k → ∞ . There are also recent workson various noisy gradient-based algorithms (Ge et al. (2015); Jin et al. (2017); Chen et al.(2020); Guo et al. (2020)), which aims to escape saddle points and ﬁnd a local minimum of f . In comparison with these methods, SA has priority to ﬁnd the global minimum but atthe cost of longer exploration time.The main tool in our analysis is the Eyring-Kramers law, which is a set of functionalinequalities for the Gibbs measure at low temperatures. Let us explain our results. It wasshown in Geman and Hwang (1986); Chiang et al. (1987) that the correct order of τ t forthe process (1) to converge to the global minimum of f is (ln t ) − . In fact, there is a phasetransition related to the critical depth E ∗ of the function f :(a) If τ t ≤ E ln t for t large enough with E < E ∗ , then lim sup t →∞ P ( f ( X t ) ≤ min f + δ ) < τ t ≥ E ln t for t large enough with E > E ∗ , then lim t →∞ P ( f ( X t ) ≤ min f + δ ) = 1.Roughly speaking, the critical depth E ∗ is the largest hill one needs to climb starting froma local minimum to the global minimum. The formal deﬁnition of the critical depth E ∗ willbe given in Assumption 2, but see Figure 1 below for an illustration when f is a double-wellfunction. Part ( a ) was proved by Holley et al. (1989) using a sophisticated argument thatinvolves the Poincar´e inequality. Part ( b ) was proved by Miclo (1992) who characterizedthe fastest cooling schedule by the Eying-Kramers law for the log-Sobolev inequality. Seealso Miclo (1995); Zitt (2008); Fournier and Tardif (2020) for similar results under diﬀerentconditions on the function f . The convergence rate of ( f ( X t ); t ≥

0) to the global minimumof f was considered by M´arquez (1997) only for δ suﬃciently small. However, the rate of P ( f ( X t ) > min f + δ ) for general δ > IMULATED ANNEALING 3

Figure 1.

Illustration of the critical depth of a double-well function.To simplify the notation, we assume throughout this paper thatmin R d f ( x ) = 0 , i.e. the global minimum of f is 0 by considering f − min f . Our results are outlined asfollows, and the precise statement of these results will be given in Section 2. Main Results (Informal) . Under some assumptions on the function f , we have:(i) Assume that τ t ≥ E ln t for t large enough with E > E ∗ . For δ > , there exists C > (depending on δ, f, E, d ) such that P ( f ( X t ) > δ ) ≤ Ct − min ( δE , (1 − E ∗ E ) ) . (ii) Assume that τ t ≥ E ln t for t large enough with E > E ∗ . Also assume that Θ k → ∞ , η k +1 Θ k → and ln Θ k (cid:80) j ≤ k η j +1 Θ − E ∗ /Ej → as k → ∞ . For δ > , there exists C > (depending on δ, f, E, d ) such that P ( f ( x k ) > δ ) ≤ C Θ − min ( δE , (1 − E ∗ E ) ) k . The contribution of this paper is twofold: • Polynomial decay rate . We prove that the tail probability P ( f ( X t ) > δ ) (resp. P ( f ( x k ) > δ )) decays polynomially in time (resp. in cumulative step size). In thecontinuous setting, Monmarch´e (2018) obtained the same rate of convergence forSA adapted underdamped Langevin equation, and Menz et al. (2018) consideredan improvement of SA via parallel tempering. However, the convergence rate forcontinuous-time SA, i.e. part ( i ) has yet been recorded in the literature thoughthis result is probably understood in the Eyring-Kramers folklore. We provide aself-contained treatment which bridges the literature and, more importantly, can beapplied to obtain the new result in the discrete setting. • Choice of step size . Part ( ii ) for the discrete simulated annealing is completelynovel to our best knowledge, and it also gives a practical guidance on the choice of stepsize for discretization. The condition Θ k → ∞ indicates that the step size cannotbe chosen too small, while the condition η k +1 Θ k → WENPIN TANG AND XUN YU ZHOU cannot be chosen too large. The condition ln Θ k (cid:80) j ≤ k η j +1 Θ − E ∗ /Ej → η k = k − θ with θ ∈ ( ,

1] satisﬁes the conditions in ( ii ) to ensure theconvergence.Also note that the rate min (cid:0) δE , (1 − E ∗ E ) (cid:1) is smaller than . It is interesting to know whetherthis rate is optimal, and we leave the problem for future work.There is another interesting problem: the dependence of the constant C on the dimension d . The issue is subtle, since most analysis including the Eyring-Kramers law uses Laplace’smethod. However, Laplace’s method may fail if both the dimension d and the inverse tem-perature 1 /τ tend to inﬁnity (Shun and McCullagh (1995)). As shown in Remark 1, weobtain an upper bound for C which is exponential in d . This suggests the convergence rate isexponentially slow as the dimension increases, which concurs with the fact that ﬁnding theglobal minimum of a general nonconvex function is NP-hard (Jain and Kar (2017)).Finally, we mention a few approaches to accelerate or improve SA. Fang et al. (1997)considered a cooling schedule depending on both time and state; Pavlyukevich (2007) usedthe L´evy ﬂight; Monmarch´e (2018) studied SA adapted to underdamped Langevin equation;Menz et al. (2018) applied the replica exchange technique; Gao et al. (2020) employed arelaxed stochastic control formulation, originally proposed by Wang et al. (2020) for rein-forcement learning, to derive a state-dependent temperature control schedule.The remainder of the paper is organized as follows. Section 2 presents the assumptionsand our main results. Section 3 provides background on diﬀusion processes and functionalinequalities. The result for the continuous-time simulated annealing (Theorem 1) is provedin Section 4. The result for the discrete simulated annealing (Theorem 2) is proved in Section5. We conclude with Section 6. 2. Main results

In this section, we make precise the informal statement in the introduction, and presentthe main results of the paper. We ﬁrst collect the notations that will be used throughoutthis paper.– The notation | · | is the Euclidean norm of a vector, and a · b is the scalar product ofvectors a and b .– For a function f : R d → R , let ∇ f , ∇ f and ∆ f denote its gradient, Hessian andLaplacian respectively.– For X, Y two random variables, || X − Y || T V denotes the total variation norm of thesigned measure corresponding to X − Y .– The symbol a ∼ b means that a/b → ∞ . Similarly, the symbol a = O ( b ) means that a/b is bounded as some problemparameter tends to 0 or ∞ .We use C for a generic constant which depends on problem parameters ( δ, f, E . . . ), and maychange from line to line.Next, we present a few assumptions on the function f . These assumptions are standard inthe study of metastability. Assumption 1.

Let f : R d → R be smooth, bounded from below, and satisfy the conditions: IMULATED ANNEALING 5 (i) f is non-degenerate on the set of critical points. That is, for some C > , | ξ | C ≤ |∇ f ( x ) ξ | ≤ C | ξ | for each x ∈ { z : ∇ f ( z ) = 0 } . (ii) There exists C > such that lim inf | x |→∞ |∇ f ( x ) | − ∆ f ( x ) | x | ≥ C, inf x ∇ f ( x ) ≥ − C. Let us make a few comments on Assumption 1. The condition ( ii ) implies that f hasat least quadratic growth at inﬁnity. This is a necessary and suﬃcient condition to obtainthe log-Sobolev inequality (see (Royer, 2007, Theorem 3.1.21) and Section 3.2) which is keyto convergence analysis. The conditions ( i ) and ( ii ) imply that the set of critical points isdiscrete and ﬁnite (Menz and Schlichting, 2014, Remark 1.6). In particular, it follows thatthe set of local minimum points { m , . . . , m N } is also ﬁnite, with N the number of localminimum points of f .To keep the presentation simple, we make additional assumptions on f , following (Menzet al., 2018, Assumption 2.5). Deﬁne the saddle height (cid:98) f ( m i , m j ) between two local minimumpoints m i , m j by (cid:98) f ( m i , m j ) := inf (cid:26) max s ∈ [0 , f ( γ ( s )) : γ ∈ C [0 , , γ (0) = m i , γ (1) = m j (cid:27) . (3)See Figure 1 for an illustration of the saddle height (cid:98) f ( m , m ) when f is a double-well functionwith m the global minimum and m the local minimum. Assumption 2.

Let m , · · · , m N be the positions of the local minima of f .(i) m is the unique global minimum point of f , and m , . . . , m N are ordered in the sensethat there exists δ > such that f ( m N ) ≥ f ( m N − ) ≥ · · · ≥ f ( m ) ≥ δ and f ( m ) = 0 . (ii) For each i, j ∈ { , . . . , N } , the saddle height between m i , m j is attained at a uniquecritical point s ij of index one. That is, f ( s ij ) = (cid:98) f ( m i , m j ) , and if { λ , . . . , λ n } arethe eigenvalues of ∇ f ( s ij ) , then λ < and λ i > for i ∈ { , . . . , n } . The point s ij is called the communicating saddle point between the minima m i and m j .(iii) There exists p ∈ [ N ] such that the energy barrier f ( s p ) − f ( m p ) dominates all theothers. That is, there exists δ > such that for all i ∈ [ N ] \ { p } , E ∗ := f ( s p ) − f ( m p ) ≥ f ( s i ) − f ( m i ) + δ. The dominating energy barrier E ∗ is called the critical depth. The convergence result for the continuous-time SA (1) is stated as follows. The proof willbe given in Section 4.

Theorem 1.

Let f satisfy Assumption 1 &

2. Assume that τ t ∼ E ln t and ddt (cid:16) τ t (cid:17) = O (cid:0) t (cid:1) as t → ∞ with E > E ∗ . Also assume the moment condition for the initial distribution µ : foreach p ≥ , there exists C p > such that (cid:90) R d f ( x ) p µ ( dx ) ≤ C p . (4) WENPIN TANG AND XUN YU ZHOU

Then for each δ, ε > , there exists C > (depending on δ, ε, f, µ , E, d ) such that P ( f ( X t ) > δ ) ≤ Ct − min ( δE , (1 − E ∗ E ) ) + ε . (5)To get the convergence result for the discrete simulated annealing, we need an additionalcondition on the function f . Assumption 3.

The gradient ∇ f is globally Lipschitz, i.e. |∇ f ( x ) − ∇ f ( y ) | ≤ L | x − y | forsome L > . The convergence result for the discrete simulated annealing (2) is stated as follows. Theproof will be given in Section 5.

Theorem 2.

Let f satisfy Assumption 1, 2 &

3, and let the condition (4) for µ hold.Assume that τ t ∼ E ln t and ddt (cid:16) τ t (cid:17) = O (cid:0) t (cid:1) as t → ∞ with E > E ∗ . Also assume that Θ k → ∞ , η k +1 Θ k → , (6) and ln Θ k (cid:80) j ≤ k η j +1 Θ − E ∗ /Ej → , (7) as k → ∞ . Then for each δ, ε > , there exists C > (depending on δ, ε, f, µ , E, d ) suchthat P ( f ( x k ) > δ ) ≤ C Θ − min ( δE , (1 − E ∗ E ) ) + εk . (8)3. Preliminaries

In this section, we present a few vocabularies and basic results of diﬀusion processes andfunctional inequalities. We also explain how these results are applied in the setting of SA,which will be useful in our convergence analysis.3.1.

Diﬀusion processes and SA.

Consider the general diﬀusion process ( X t ; t ≥

0) in R d of form: dX t = b ( t, X t ) dt + σ ( t, X t ) dB t , X d = µ ( dx ) , (9)where ( B t ; t ≥

0) is a d -dimensional Brownian motion, with the drift b : R + × R d → R d andthe covariance matrix σ : R + × R d → R d × d . To ensure the well-posedness of the SDE (9), itrequires some growth and regularity conditions on b and σ . For instance, • If b and σ are Lipschitz and have linear growth in x uniformly in t , then (9) has astrong solution which is pathwise unique. • If b is bounded, and σ is bounded, continuous and strictly elliptic, then (9) has aweak solution which is unique in distribution.We refer to Stroock and Varadhan (1979); Rogers and Williams (1987) for background andfurther developments on the well-posedness of SDEs, and to (Cherny and Engelbert, 2005,Chapter 1) for a review of related results. IMULATED ANNEALING 7

Another important aspect is the distributional property of ( X t ; t ≥

0) governed by theSDE (9). Let L be the inﬁnitesimal generator of the diﬀusion process X deﬁned by L g ( t, x ) := b ( t, x ) · ∇ g ( x ) + 12 σ ( t, x ) σ ( t, x ) T : ∇ g ( x ):= d (cid:88) i =1 b i ( t, x ) ∂∂x i g ( x ) + 12 d (cid:88) i,j =1 (cid:0) σ ( t, x ) σ ( t, x ) T (cid:1) ij ∂ ∂x i ∂x j g ( x ) , (10)and L ∗ be the corresponding adjoint operator given by L ∗ g ( t, x ) := −∇ · ( b ( t, x ) g ( x )) + 12 ∇ : ( σ ( t, x ) σ ( t, x ) T g ( x )):= − d (cid:88) i =1 ∂∂x i ( b i ( t, x ) g ( x )) + 12 d (cid:88) i,j =1 ∂ ∂x i ∂x j ( σ ( t, x ) σ ( t, x ) T g ( x )) ij , (11)where g : R d → R is a suitably smooth test function, and : denotes the Frobenius innerproduct which is the component-wise inner product of two matrices. The probability density ρ t ( · ) of the process X at time t then satisﬁes the Fokker-Planck equation: ∂ρ t ∂t = L ∗ ρ t . (12)Specializing (12) to the SA process (1) with b ( t, x ) = −∇ f ( x ) and σ ( t, x ) = √ τ t I d , we havethat the probability density µ t ( · ) of X governed by the SDE (1) satisﬁes ∂µ t ∂t = ∇ · ( µ t ∇ f ) + τ t ∆ µ t . (13)Under further growth conditions on b and σ , it can be shown that as t → ∞ , ρ t ( · ) → ρ ∞ ( · )which is the stationary distribution of ( X t ; t ≥ ρ ∞ ischaracterized by the equation L ∗ ρ ∞ = 0; see Ethier and Kurtz (1986); Meyn and Tweedie(1993) for a general theory on stability of diﬀusion processes, and (Tang, 2019, Section 2) fora summary with various pointers to the literature.However, for general b and σ , the stationary distribution ρ ∞ ( · ) does not have a closed-form expression. One good exception is b ( t, x ) = −∇ f ( x ) and σ ( t, x ) = √ τ I d , where X isgoverned by the overdamped Langevin equation: dX t = −∇ f ( X t ) dt + √ τ dB t , X d = µ ( dx ) . (14)Such a process is time-reversible, and the stationary distribution, under some growth condi-tion on f , is the Gibbs measure ν τ ( dx ) = 1 Z τ exp (cid:18) − f ( x ) τ (cid:19) dx, (15)where Z τ := (cid:82) R d exp( − f ( x ) /τ ) dx is the normalizing constant. Much is known about theoverdamped Langevin dynamics. For instance, if f is λ -convex (i.e. ∇ f + λI d is positivedeﬁnite), the overdamped Langevin process (14) converges exponentially in the Wassersteinmetric with rate λ to the Gibbs measure ν τ (Carrillo et al., 2006). See also Bakry et al.(2014) for modern techniques to analyze the evolution of the overdamped Langevin equationand generalizations. WENPIN TANG AND XUN YU ZHOU

Now we turn to the SA process (1). The diﬀerence between the overdamped Langevinprocess (14) and the process (1) is that the temperature τ t of the latter is decreasing intime. Due to the time dependence, the limiting distribution of SA is unknown. As wewill see in Section 4, the idea is to approximate (1) by a process of Gibbs measures withtemperature τ t . Since τ t decreases to 0 in the limit, the problem boils down to studyingGibbs measures at low temperatures. In the next section, we recall some results of Gibbsmeasures at low temperatures, which are motivated by applications in molecular dynamicsand Bayesian statistics.3.2. Functional inequalities and the Erying-Kramers law.

Here we present functionalinequalities of Gibbs measures at low temperatures ( τ → µ and ν be two probabilitymeasures on R d such that µ is absolutely continuous relative to ν , with dµ/dν the Radon-Nikodym derivative. Deﬁne the relative entropy or KL-divergence H ( µ | ν ) of µ with respectto ν by H ( µ | ν ) := (cid:90) log (cid:18) dµdν (cid:19) dµ = (cid:90) dµdν log (cid:18) dµdν (cid:19) dν, (16)and the Fisher information I ( µ | ν ) of µ with respect to ν by I ( µ | ν ) := 12 (cid:90) (cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:18) dµdν (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:18) dµdν (cid:19) − dν. (17)We say that the probability measure ν satisﬁes the log-Sobolev inequality (LSI) with constant α >

0, if for all probability measures µ with I ( µ | ν ) < ∞ , H ( µ | ν ) ≤ α I ( µ | ν ) . (18)The constant α is called the LSI constant for the probability measure ν . For instance, theLSI constant α = 1 for ν the multivariate Gaussian with mean 0 and covariance matrix I d .The LSI also plays an important role in studying the convergence rate of the overdampedLangevin equation. Recall that ν τ is the Gibbs measure deﬁned by (15), and assume that ν τ satisﬁes the LSI with constant α τ >

0. It follows from (Schlichting, 2012, Theorem 1.7) thatby letting µ τ,t be the probability distribution of X t deﬁned by (14), we have H ( µ τ,t | ν τ ) ≤ e − τα τ t H ( µ τ, | ν τ ) . (19)So larger the value of α τ is, faster the convergence of the overdamped Langevin process inthe KL divergence is. The subscript ‘ τ ’ in α τ suggests the dependence of the LSI constanton the temperature τ , and we are interested in the asymptotics of α τ at low temperatures as τ →

0. This problem was considered by (Menz and Schlichting, 2014, Corollary 3.18), whoderived a sharp lower bound for α τ as τ → Lemma 1.

Let f satisfy Assumption 1 &

2. Then the Gibbs measure ν τ deﬁned by (15) satisﬁes the LSI with constant α τ > such that α τ ∼ C exp (cid:18) − E ∗ τ (cid:19) as τ → , (20) where C > depends on f, d . IMULATED ANNEALING 9

The Eyring-Kramers law provides an estimate on the spectral gap of the overdampedLangevin equation (14). It dates back to Eyring (1935); Kramers (1940) in the study ofmetastability in chemical reactions, and is proved rigorously by Bovier et al. (2004, 2005).Lemma 1 is the LSI version of the Eyring-Kramers law, which is stronger than the spectralgap estimate implied by the Poincar´e inequality.Further deﬁne the Wasserstein distance W ( µ, ν ) between µ and ν by W ( µ, ν ) := inf Π (cid:115)(cid:90) | x − y | Π( dx, dy ) , (21)where the inﬁmum is over all joint distributions Π coupling µ and ν . We say that the prob-ability measure ν satisﬁes Talagrand’s inequality with constant γ >

0, if for all probabilitymeasure µ with H ( µ | ν ) < ∞ , W ( µ, ν ) ≤ γ H ( µ | ν ) . (22)It follows from (Otto and Villani, 2000, Theorem 1) that the LSI implies Talagrand’s in-equality with the same constant. That is, if ν satisﬁes the LSI with constant α >

0, then ν also satisﬁes Talagrand’s inequality with constant γ = α . Combining with Lemma 1, we geta lower bound estimate of Talagrand’s inequality constant for the Gibbs measure ν τ . Lemma 2.

Let f satisfy Assumption 1 &

2. Then the Gibbs measure ν τ deﬁned by (15) satisﬁes Talagrand’s inequality with constant γ τ > such that γ τ ∼ C exp (cid:18) − E ∗ τ (cid:19) as τ → where C > depends on f, d . Continuous-time simulated annealing

In this section, we prove Theorem 1 by using the ideas developed in Miclo (1992); Mon-march´e (2018); Menz et al. (2018). Let µ t be the probability measure of X t deﬁned by (1).The key idea is to compare µ t with the time-dependent Gibbs measure ν τ t given by ν τ t ( dx ) = 1 Z τ t exp (cid:18) − f ( x ) τ t (cid:19) dx, (24)where Z τ t := (cid:82) R d exp( − f ( x ) /τ t ) is the normalizing constant. Note that ν τ t will concentrateon the minimum point of f as t → ∞ since τ t → t → ∞ . We will see that ν τ t is close to µ t in some sense as t → ∞ . The proof of Theorem 1 is broken into four steps. Step 1: Reduce µ t to ν τ t . We establish a bound that relates ν τ t to µ t . Let ( (cid:101) X t ; t ≥

0) be aprocess whose distribution is ν τ t at time t , coupled with ( X t ; t ≥

0) on the same probabilityspace. Fix δ >

0. We have P ( f ( X t ) > δ ) = P ( f ( X t ) > δ, f ( (cid:101) X t ) > δ ) + P ( f ( X t ) > δ, f ( (cid:101) X t ) ≤ δ ) ≤ P ( f ( (cid:101) X t ) > δ ) + || X t − (cid:101) X t || T V ≤ P ( f ( (cid:101) X t ) > δ ) + (cid:112) H ( µ t | ν τ t ) , (25)where we use Pinsker’s inequality (Tsybakov, 2009, Lemma 2.5) in the last inequation. Nowthe problem boils down to estimating P ( f ( (cid:101) X t ) > δ ) and H ( µ t | ν τ t ). Step 2: Long-time behavior of f ( (cid:101) X t ) . We study the asymptotics of P ( f ( (cid:101) X t ) > δ ) as t → ∞ . The following lemma provides a quantitative estimate of how ν τ t , or equivalently (cid:101) X t concentrates on the minimum point of f as t → ∞ . Lemma 3.

Let f satisfy Assumption 1 &

2. Assume that τ t ∼ E ln t as t → ∞ with E > E ∗ .For each ε ∈ (0 , δ ) , there exist C > (depending on δ, ε, f, E, d ) such that P ( f ( (cid:101) X t ) > δ ) ≤ Ct − δ − εE . (26) Proof.

Note that P ( f ( (cid:101) X t ) > δ ) = (cid:82) f ( x ) >δ exp( − f ( x ) /τ t ) dx (cid:82) R d exp( − f ( x ) /τ t ) dx . (27)Under Assumption 1, f has quadratic growth, so at least linear growth at inﬁnity (Menz andSchlichting, 2014, Lemma 3.14): there exists C > R large enough, f ( x ) ≥ min | z | = R f ( z ) + C ( | x | − R ) for | x | > R. We can also choose R suﬃciently large so that min | z | = R f ( z ) > δ . Consequently, (cid:90) f ( x ) >δ exp( − f ( x ) /τ t ) dx = (cid:90) f ( x ) >δ, | x |≤ R exp( − f ( x ) /τ t ) dx + (cid:90) f ( x ) >δ, | x | >R exp( − f ( x ) /τ t ) dx = e − δτt Vol( B R )(1 + O ( τ t )) , (28)where Vol( B R ) is the volume of a ball with radius R . Further by Laplace’s method, (cid:90) R d exp( − f ( x ) /τ t ) dx ∼ C ( τ t ) d . (29)By injecting (28), (29) into (27),we get P ( f ( (cid:101) X t ) > δ ) ≤ Ct − δE (ln t ) d , (30)which clearly yields (26) since (ln t ) d /t εE → t → ∞ . (cid:3) Remark 1.

It is interesting to get a bound for P ( f ( (cid:101) X t ) > δ ) when the dimension d islarge. As mentioned in the introduction, the Laplace bound (29) may fail when d, t → ∞ simultaneously. Recall that m is the minimum point of f . By continuity of f , there exists r > such that f ( x ) < ε when | x − m | < r . Thus, (cid:90) R d exp( − f ( x ) /τ t ) dx ≥ (cid:90) | x − m |

EdCR → ∞ as d → ∞ , (cid:90) f ( x ) >δ exp( − f ( x ) /τ t ) dx = e − δτt Vol( B R )(1 + O ( τ t d )) , (32) Combining (31) and (32) , we get P ( f ( (cid:101) X t ) > δ ) ≤ Cγ d t − δ − εE , (33) IMULATED ANNEALING 11 where

C > depends on δ, ε, f, E , and γ = max( R/r, e δ − εCR ) . Also note that Raginsky et al.(2017) obtained the bound E f ( (cid:101) X t ) ≤ Cd/ ln t . By Markov’s inequality, we get P ( f ( (cid:101) X t ) > δ ) ≤ Cδ − d (ln t ) − . (34) In comparison with (34) , the bound (33) is better in ‘ t ’ but worse in ‘ d ’. In terms of relaxationtime, i.e. letting P ( f ( (cid:101) X t ) > δ ) be of constant order, both estimates show an exponentialdependence of t on d . This suggests that SA is exponentially slow as the dimension increases. Step 3: Diﬀerential inequality for H ( µ t | ν τ t ) . To get an estimate of H ( µ t | ν τ t ), we need toconsider the time derivative ddt H ( µ t | ν τ t ). The following lemma is a reformulation of (Miclo,1992, Proposition 3). For ease of reference, we give a simpliﬁed proof here. First let usconvent some notation. For an absolutely continuous measure µ ( dx ), we abuse the notation µ ( dx ) = µ ( x ) dx , i.e. µ ( x ) is the density of µ ( dx ). So for two such probability measures µ and ν , the Radon-Nikodym derivative dµdν ( x ) is identiﬁed with µ ( x ) ν ( x ) . Lemma 4.

Let τ t be decreasing in t . We have ddt H ( µ t | ν τ t ) ≤ − τ t I ( µ t | ν τ t ) + ddt (cid:18) τ t (cid:19) E f ( X t ) . (35) where I ( µ t | ν τ t ) is the Fisher information deﬁned by (17) .Proof. Observe that ddt H ( µ t | ν τ t ) = ddt (cid:90) µ t ln (cid:18) µ t ν τ t (cid:19) dx = (cid:90) ∂µ t ∂t ln (cid:18) µ t ν τ t (cid:19) dx (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + (cid:90) µ t ∂∂t (cid:18) ln (cid:18) µ t ν τ t (cid:19)(cid:19) dx (cid:124) (cid:123)(cid:122) (cid:125) ( b ) . (36)We ﬁrst consider the term (a). Recall that µ t satisﬁes the Fokker-Planck equation (13).Together with the fact that ∇ ( τ t ν τ t ) = − ν τ t ∇ f , we have ∂µ t ∂t = ∇ · (cid:18) τ t ν τ t ∇ (cid:18) µ t ν τ t (cid:19)(cid:19) . (37)By injecting (37) into the term (a) and further by integration by parts, we get( a ) = (cid:90) ∇ · (cid:18) τ t ν τ t ∇ (cid:18) µ t ν τ t (cid:19)(cid:19) ln (cid:18) µ t ν τ t (cid:19) dx = − (cid:90) τ t ν τ t ∇ (cid:18) µ t ν τ t (cid:19) · ∇ ln (cid:18) µ t ν τ t (cid:19) dx = − τ t (cid:90) (cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:18) µ t ν τ t (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:18) µ t ν τ t (cid:19) − dν τ t = − τ t I ( µ t | ν τ t ) . (38) Now we consider the term (b). Direct computation leads to( b ) = (cid:90) (cid:18) ∂µ t ∂t − µ t ν τ t ∂ν τ t ∂t (cid:19) dx = − (cid:90) ∂∂t (ln ν τ t ) dµ t = (cid:90) ddt (ln Z τ t ) dµ t + ddt (cid:18) τ t (cid:19) E f ( X t ) ≤ ddt (cid:18) τ t (cid:19) E f ( X t ) , (39)where we use the fact that (cid:82) µ t dx = 1 in the second equation, and that τ t is decreasing in t in the last inequality. Combining (36) with (38) and (39) yields (35). (cid:3) Step 4: Estimating H ( µ t | ν τ t ) via the Erying-Kramers law. Note that there are twoterms on the right hand side of (35). We start with an estimate of the second term.

Lemma 5.

Let f satisfy Assumption 1, and assume that the condition (4) for µ holds. Foreach ε > , there exists C > (depending on ε, f ) such that E f ( X t ) ≤ C (1 + t ) ε . (40) Proof.

It is easy to see that Assumption 1 implies Assumption H in Miclo (1992). Togetherwith the moment condition (4), the proof follows the line of reasoning in (Miclo, 1992, Lemma2). (cid:3) Now we apply the Eyring-Kramers law, combining with a Gr¨onwall-type argument tobound H ( µ t | ν τ t ) for large t . Lemma 6.

Let f satisfy Assumption 1 &

2, and assume that the condition (4) for µ holds.Assume that τ t ∼ E ln t and ddt (cid:16) τ t (cid:17) = O (cid:0) t (cid:1) as t → ∞ with E > E ∗ . For each ε > , thereexists C > (depending on ε, f, E, d ) such that H ( µ t | ν τ t ) ≤ Ct − ( − E ∗ E − ε ) . (41) Proof.

Using Lemma 1 and the bound (35), we have ddt H ( µ t | ν τ t ) ≤ − τ t α t H ( µ t | ν τ t ) + Ct E f ( X t ) , (42)where α t is the LSI constant for the Gibbs measure ν τ t . By the Eyring-Kramers formula(20), for each ε >

0, there exist

C > t > τ t α t ≥ Ct − ( E ∗ E − ε ) for t ≥ t . (43)Combining (42) with (40), (43), we get ddt H ( µ t | ν τ t ) ≤ − Ct − ( E ∗ E − ε ) H ( µ t | ν τ t ) + C (cid:48) t − ε . (44)Fix ε ∈ (0 , − E ∗ E ), let Q ( t ) := H ( µ t | ν τ t ) − C (cid:48) C t − E ∗ E +2 ε . IMULATED ANNEALING 13

Then for t suﬃciently large and t ≥ t , we have ddt Q ( t ) ≤ − Ct − E ∗ E + ε Q ( t ) by (44). Thisimplies that Q ( t ) ≤ Q ( t ) e − C (cid:82) tt s − E ∗ E + ε ds . Thus, H ( µ t | ν τ t ) ≤ C (cid:48) C t − E ∗ E +2 ε + H ( µ t | ν t ) e − Cκ ( t κ − t κ ) , (45)where κ := 1 − E ∗ E − ε >

0. Note that the ﬁrst term on the right hand side of (45) dominates,and the conclusion follows. (cid:3)

Finally, by injecting (26), (41) into (25), we get the desired estimate (5).5.

Discrete simulated annealing

This section is devoted to the proof of Theorem 2. The idea is close to that employedfor the continuous-time SA process (1). However, the analysis is more complicated due todiscretization, and additional tools from Vempala and Wibisono (2019) on the convergenceof ULA are used. Recall that η k is the step size at iteration k , and Θ k := (cid:80) j ≤ k η j is thecumulative step size up to iteration k . Let µ k be the probability density of x k deﬁned by (2),and ν τ Θ k ( dx ) = 1 Z τ Θ k exp (cid:18) − f ( x ) τ Θ k (cid:19) dx, (46)where Z τ Θ k := (cid:82) R d exp( − f ( x ) /τ Θ k ) dx is the normalizing constant. We divide the proof intofour steps. Step 1: Reduce µ k to ν τ Θ k . This step is similar to Step 1 & 2 in Section 4. Let ( (cid:101) x k ; k ≥ ν τ Θ k at epoch k , coupled with ( x k ; k ≥

0) on the sameprobability space. Fix δ >

0. The same argument as in (25) shows that P ( f ( x k ) > δ ) ≤ P ( f ( (cid:101) x k ) > δ ) + (cid:113) H ( µ k | ν τ Θ k ) . (47)Assume that Θ k → ∞ and τ Θ k ∼ E ln Θ k as k → ∞ with E > E ∗ . By Lemma 3, we get a boundfor the ﬁrst term on the right hand side of (47). That is, for each ε ∈ (0 , δ ), there exists C > ε, δ, f, E, d ) such that P ( f ( (cid:101) x k ) > δ ) ≤ C Θ − δ − εE k . (48)So it remains to estimate H ( µ k | ν τ Θ k ), which is the task of the next three steps. Step 2: Continuous-time coupling.

To make use of continuous-time tools, we couple thesequence ( x k ; k ≥

0) by a continuous-time process ( X t ; t ≥

0) such that ( X Θ k ; k ≥

0) hasthe same distribution as ( x k ; k ≥ X by dX t = −∇ f ( x k ) dt + (cid:112) τ Θ k dB t , t ∈ [Θ k , Θ k +1 ) , (49)where we identify X Θ k with x k . Note that the Fokker-Planck equation (13) plays an impor-tant role in the analysis of continuous-time SA (1). It is desirable to get a version of theFokker-Planck equation for the coupled process (49). The result is stated as follows. Lemma 7.

For t ∈ [Θ k , Θ k +1 ) , the probability density µ t of X t deﬁned by (49) satisﬁes thefollowing equation: ∂µ t ∂t = ∇ · (cid:18) τ Θ k ν τ Θ k ∇ (cid:18) µ t ν τ Θ k (cid:19)(cid:19) + ∇ · ( µ t E [ ∇ f ( x k ) − ∇ f ( X t ) | X t = x ]) . (50) Proof.

Let µ t | s ( x | y ) be the conditional probability P ( X t = x | X s = y ). By conditioning on X Θ k = x k , we have ∂µ t | Θ k ( x | x k ) ∂t = ∇ · ( µ t | Θ k ( x | x k ) ∇ f ( x k )) + τ Θ k ∆ µ t | Θ k ( x | x k ) . (51)By integrating (51) against µ Θ k , and using the fact that µ t | Θ k ( x | x k ) µ Θ k ( x k ) = µ t ( x ) µ Θ k | t ( x k | x ),we get ∂µ t ∂t = ∇ · ( µ t ( x ) E [ ∇ f ( x k ) | X t = x ]) + τ Θ k ∆ µ t . (52)Further by (37), we have ∇ · (cid:18) τ Θ k ν τ Θ k ∇ (cid:18) µ t ν τ Θ k (cid:19)(cid:19) = ∇ · ( µ t ∇ f ( x )) + τ Θ k ∆ µ t . (53)Combining (52) and (53) yields (50). (cid:3) There are two terms on the right hand side of (50). Comparing to (37), the ﬁrst term isthe usual Fokker-Planck term, while the second term corresponds to the discretization error.

Step 3: One-step analysis of H ( µ k | ν τ Θ k ) . Here we use the coupled process (49) to studythe one-step decay of H ( µ k | ν τ Θ k ). Lemma 8.

Let f satisfy Assumption 1, 2 &

3, and assume that the condition (4) for µ holds. Assume that τ t ∼ E ln t and ddt (cid:16) τ t (cid:17) = O (cid:0) t (cid:1) as t → ∞ with E > E ∗ . Also assume that Θ k → ∞ and η k +1 Θ k → as k → ∞ . Then, for each ε > , there exist C, C (cid:48) > (dependingon ε, f, E, d ) such that H ( µ k +1 | ν τ Θ k +1 ) ≤ (cid:18) − Cη k +1 Θ − ( E ∗ E − ε ) k (cid:19) H ( µ k | ν τ Θ k )+ C (cid:48) ( η k +1 + η k +1 ln Θ k + η k +1 Θ − εk ) . (54) Proof.

Write H ( µ k +1 | ν τ Θ k +1 ) = H ( µ k +1 | ν τ Θ k ) (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + ( H ( µ k +1 | ν τ Θ k +1 ) − H ( µ k +1 | ν τ Θ k )) (cid:124) (cid:123)(cid:122) (cid:125) ( b ) . (55) IMULATED ANNEALING 15

We ﬁrst use the coupled process (49) to study the term ( a ). Note that ddt H ( µ t | ν τ Θ k ) = (cid:90) ∂µ t ∂t ln (cid:18) µ t ν τ Θ k (cid:19) dx + (cid:90) µ t ddt ln (cid:18) µ t ν τ Θ k (cid:19) dx = (cid:90) ∇ · (cid:18) τ Θ k ν τ Θ k ∇ (cid:18) µ t ν τ Θ k (cid:19)(cid:19) ln (cid:18) µ t ν τ Θ k (cid:19) dx + (cid:90) ∇ · ( µ t E [ ∇ f ( x k ) − ∇ f ( X t ) | X t = x ]) ln (cid:18) µ t ν τ Θ k (cid:19) dx (cid:124) (cid:123)(cid:122) (cid:125) ( c ) + ddt (cid:90) µ t ( dx )= − τ Θ k I ( µ t | ν τ Θ k ) + ( c ) , (56)where we use (50) in the second equation, and (38) in the third equation. Now we need toestimate the term ( c ) in (56). By integration by parts and the fact that a · b ≤ τ Θ k | a | + τ Θ k | b | ,we get ( c ) = E (cid:18) ( ∇ f ( X t ) − ∇ f ( x k )) · ∇ log (cid:18) µ t ν τ Θ k (cid:19)(cid:19) ≤ τ Θ k E |∇ f ( X t ) − ∇ f ( x k )) | + τ Θ k E (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) µ t ν τ Θ k (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L τ Θ k E | X t − x k | + τ Θ k I ( µ t | ν τ Θ k ) , (57)where L is the Lipschitz constant of ∇ f by Assumption 3. Recall from (49) that X t − x k = −∇ f ( x k )( t − Θ k ) + (cid:112) τ Θ k ( t − Θ k ) Z , where Z is standard normal. Consequently, E | X t − x k | = ( t − x k ) E |∇ f ( x k ) | + 2 τ Θ k ( t − x k ) d ≤ η k +1 E |∇ f ( x k ) | + Cτ Θ k η k +1 . (58)According to Lemma 2, ν τ Θ k satisﬁes Talagrand’s inequality with constant γ τ Θ k ∼ κ exp( − E ∗ /τ Θ k ).Further by (Vempala and Wibisono, 2019, Lemma 10), E |∇ f ( x k ) | ≤ Cγ τ Θ k H ( µ k | ν τ Θ k ) + C. (59)Combining (57) with (58), (59) and the fact that τ Θ k ∼ E ln Θ k as k → ∞ , we have( c ) ≤ C (cid:18) η k +1 Θ E ∗ E k ln Θ k (cid:19) H ( µ k | ν τ Θ k ) + C ( η k +1 + η k +1 ln Θ k ) + τ Θ k I ( µ t | ν τ Θ k ) . (60)Injecting (60) into (56) and further by Lemma 1, we get ddt H ( µ t | ν τ Θ k ) ≤ − τ Θ k I ( µ t | ν τ Θ k ) + C (cid:48) (cid:18) η k +1 Θ E ∗ E k ln Θ k (cid:19) H ( µ k | ν τ Θ k ) + C (cid:48) ( η k +1 + η k +1 ln Θ k ) ≤ − C Θ − ( E ∗ E − ε ) k H ( µ t | ν τ Θ k ) + C (cid:48) (cid:18) η k +1 Θ E ∗ E + εk H ( µ k | ν τ Θ k ) + ( η k +1 + η k +1 ln Θ k ) (cid:19) , Now by a Gr¨onwall argument, we have H ( µ k +1 | ν τ Θ k ) ≤ e − Cη k +1 Θ − ( E ∗ E − ε ) k (cid:16) (1 + C (cid:48) η k +1 Θ E ∗ E + ε ) H ( µ k | ν τ Θ k ) + C (cid:48) ( η k +1 + η k +1 ln Θ k ) (cid:17) ≤ e − Cη k +1 Θ − ( E ∗ E − ε ) k H ( µ k | ν τ Θ k ) + C (cid:48) ( η k +1 + η k +1 ln Θ k ) ≤ (cid:18) − Cη k +1 Θ − ( E ∗ E − ε ) k (cid:19) H ( µ k | ν τ Θ k ) + C (cid:48) ( η k +1 + η k +1 ln Θ k ) , (61)where we use the fact that η k +1 Θ E ∗ E k → k → ∞ in the second inequality.Now we consider the term ( b ) in (55). Note that H ( µ k +1 | ν τ Θ k +1 ) − H ( µ k +1 | ν τ Θ k ) = ln (cid:18) Z τ Θ k +1 Z τ Θ k (cid:19) + (cid:18) τ Θ k +1 − τ Θ k (cid:19) E f ( x k +1 ) ≤ C η k +1 Θ k E f ( x k +1 ) . (62)We claim that for each ε > E f ( x k +1 ) ≤ C ≤ C Θ εk . Choose C > E f ( x k +1 ) be the ﬁrst term exceeding C . By Assumption 3, f ( x k +1 ) ≤ f ( x k ) − η k |∇ f ( x k ) | + (cid:112) τ Θ k η k ∇ f ( x k ) · Z k + L | η k ∇ f ( x k ) + (cid:112) τ Θ k η k Z k | Further by taking expectation, we get E f ( x k +1 ) − E f ( x k ) ≤ − η k (cid:18) − η k L (cid:19) E |∇ f ( x k ) | + Ldτ Θ k η k . (63)Thus, E f ( x k +1 ) − E f ( x k ) ≤ Ldτ Θ k η k which implies that E f ( x k ) > C − k large enough.By Assumption 1(ii), E |∇ f ( x k ) | > C (cid:48) for some C (cid:48) >

0. Combining with (63), we have E f ( x k ) > E f ( x k +1 ) ≥ C for k large enough. This contradicts the fact that E f ( x k +1 ) is theﬁrst term exceeding C . Now by (62), we get H ( µ k +1 | ν τ Θ k +1 ) − H ( µ k +1 | ν τ Θ k ) ≤ Cη k +1 Θ − εk . (64)Combining (55) with (61), (64) yields (54). (cid:3) Step 4: Estimating H ( µ k | ν τ Θ k ) . We use Lemma 8 to derive an estimate for H ( µ k | ν τ Θ k ).Under the condition (6), the term η k +1 Θ − εk dominates η k +1 , η k +1 ln Θ k as k → ∞ . Thus,the recursion (54) yields H ( µ k +1 | ν τ Θ k +1 ) ≤ (cid:18) − Cη k +1 Θ − ( E ∗ E − ε ) k (cid:19) H ( µ k | ν τ Θ k ) + C (cid:48) η k +1 Θ − εk . Since E ∗ /E <

1, a similar argument as in Lemma 6 shows that H ( µ k +1 | ν τ Θ k +1 ) − C Θ − (1 − E ∗ E − ε ) k ≤ (cid:18) − C (cid:48) η k +1 Θ − ( E ∗ E − ε ) k (cid:19)(cid:18) H ( µ k | ν τ Θ k ) − C Θ − (1 − E ∗ E − ε ) k − (cid:19) . This together with the condition (7) implies that H ( µ k | ν τ Θ k ) ≤ C Θ − (1 − E ∗ E − ε ) k . (65)By injecting (48) and (65) into (47) we obtain (8). IMULATED ANNEALING 17 Conclusion

In this paper, we study the convergence rate of SA in both continuous and discrete settings.The main tool is functional inequalities for the Gibbs measure at low temperatures. We provethat the tail probability, in both settings, exhibits a polynomial decay in time. The decayrate is also given as function of the model parameters. In the discrete setting, we derive acondition on the step size to ensure the convergence to the global minimum. This conditionmay be useful in tuning the step size.There are a few directions to extend this work. For instance, one can study the convergencerate of SA for L´evy ﬂight with a suitable cooling schedule. Another problem is to study thedependence of the convergence rate in the dimension d . This requires a deep understandingof the Eyring-Kramers law in high dimension, and is related to the Laplace approximationof high dimensional integrals. Both problems are worth exploring, but may be challenging. Acknowledgement:

Tang thanks Georg Menz for helpful discussions. He also gratefullyacknowledges ﬁnancial support through a start-up grant at Columbia University. Zhou grate-fully acknowledges ﬁnancial supports through a start-up grant at Columbia University andthrough the Nie Center for Intelligent Asset Management.

References

D. Bakry, I. Gentil, and M. Ledoux.

Analysis and geometry of Markov diﬀusion operators ,volume 348 of

Grundlehren der Mathematischen Wissenschaften . Springer, 2014.A. Bovier, M. Eckhoﬀ, V. Gayrard, and M. Klein. Metastability in reversible diﬀusionprocesses. I. Sharp asymptotics for capacities and exit times.

J. Eur. Math. Soc. , 6(4):399–424, 2004.A. Bovier, V. Gayrard, and M. Klein. Metastability in reversible diﬀusion processes. II.Precise asymptotics for small eigenvalues.

J. Eur. Math. Soc. , 7(1):69–99, 2005.J. A. Carrillo, R. J. McCann, and C. Villani. Contractions in the 2-Wasserstein length spaceand thermalization of granular media.

Arch. Ration. Mech. Anal. , 179(2):217–263, 2006.V. Cerny. Thermodynamical approach to the traveling salesman problem: an eﬃcient simu-lation algorithm.

J. Optim. Theory Appl. , 45(1):41–51, 1985.X. Chen, S. S. Du, and X. T. Tong. On stationary-point hitting time and ergodicity ofstochastic gradient Langevin dynamics.

J. Mach. Learn. Res. , 21:Paper No. 68, 41, 2020.A. S. Cherny and H.-J. Engelbert.

Singular stochastic diﬀerential equations , volume 1858 of

Lecture Notes in Mathematics . Springer-Verlag, 2005.T.-S. Chiang, C.-R. Hwang, and S. J. Sheu. Diﬀusion for global optimization in R n . SIAMJ. Control Optim. , 25(3):737–753, 1987.D. Delahaye, S. Chaimatanan, and M. Mongeau. Simulated annealing: from basics to applica-tions. In

Handbook of metaheuristics , volume 272 of

Internat. Ser. Oper. Res. ManagementSci. , pages 1–35. Springer, 2019.A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjustedLangevin algorithm.

Ann. Appl. Probab. , 27(3):1551–1587, 2017.S. N. Ethier and T. G. Kurtz.

Markov processes: Characterization and Convergence . WileySeries in Probability and Mathematical Statistics. John Wiley & Sons, Inc., 1986.H. Eyring. The activated complex in chemical reactions.

J. Chem. Phys. , 3(2):107–115, 1935.

H. Fang, M. Qian, and G. Gong. An improved annealing method and its large-time behavior.

Stochastic Process. Appl. , 71(1):55–74, 1997.N. Fournier and C. Tardif. On the simulated annealing in R d . 2020. arXiv:2003.06360.X. Gao, Z. Q. Xu, and X. Y. Zhou. State-dependent temperature control for Langevindiﬀusions. 2020. arXiv:2005.04507.R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points – online stochasticgradient for tensor decomposition. In COLT , pages 797–842, 2015.S. Geman and C.-R. Hwang. Diﬀusions for global optimization.

SIAM J. Control Optim. ,24(5):1031–1043, 1986.U. Grenander and M. I. Miller. Representations of knowledge in complex systems.

J. Roy.Statist. Soc. Ser. B , 56(4):549–603, 1994.X. Guo, J. Han, and W. Tang. Perturbed gradient descent with occupation time. 2020.arXiv:2005.04507.R. A. Holley, S. Kusuoka, and D. W. Stroock. Asymptotics of the spectral gap with applica-tions to the theory of simulated annealing.

J. Funct. Anal. , 83(2):333–347, 1989.P. Jain and P. Kar. Non-convex optimization for machine learning.

Found. Trends Mach.Learn. , 10:142–336, 2017.C. Jin, R. Ge, P. Netrapalli, S. Kakade, and M. I. Jordan. How to escape saddle pointseﬃciently. In

ICML , pages 1724–1732, 2017.S. Kirkpatrick, J. Gelatt, and M. Vecchi. Optimization by simulated annealing.

Science , 220(4598):671–680, 1983.C. Koulamas, S. Antony, and R. Jaen. A survey of simulated annealing applications tooperations research problems.

Omega , 22(1):41–56, 1994.H. A. Kramers. Brownian motion in a ﬁeld of force and the diﬀusion model of chemicalreactions.

Physica , 7(4):284–304, 1940.Y. Ma, N. Chatterji, X. Cheng, N. Flammarion, P. Bartlett, and M. I. Jordan. Is there ananalog of Nesterov acceleration for MCMC? arXiv:1902.00996 , 2019a.Y. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan. Sampling can be faster thanoptimization.

Proc. Natl. Acad. Sci. USA , 116(42):20881–20885, 2019b.D. M´arquez. Convergence rates for annealing diﬀusion processes.

Ann. Appl. Probab. , 7(4):1118–1139, 1997.G. Menz and A. Schlichting. Poincar´e and logarithmic Sobolev inequalities by decompositionof the energy landscape.

Ann. Probab. , 42(5):1809–1884, 2014.G. Menz, A. Schlichting, W. Tang, and T. Wu. Ergodicity of the inﬁnite swapping algorithmat low temperature. 2018. arXiv:1811.10174.S. P. Meyn and R. L. Tweedie. Stability of Markovian processes. III. Foster-Lyapunov criteriafor continuous-time processes.

Adv. in Appl. Probab. , 25(3):518–548, 1993.L. Miclo. Recuit simul´e sur R n . ´Etude de l’´evolution de l’´energie libre. Annales de l’InstitutHenri Poincar´e , 28(2):235–266, 1992.L. Miclo. Une ´etude des algorithmes de recuit simul´e sous-admissibles.

Ann. Fac. Sci.Toulouse Math. (6) , 4(4):819–877, 1995.P. Monmarch´e. Hypocoercivity in metastable settings and kinetic simulated annealing.

Prob-ability Theory and Related Fields , pages 1–34, 2018.F. Otto and C. Villani. Generalization of an inequality by Talagrand and links with thelogarithmic Sobolev inequality.

J. Funct. Anal. , 173(2):361–400, 2000.

IMULATED ANNEALING 19

G. Parisi. Correlation functions and computer simulations.

Nuclear Phys. B , 180(3, FS 2):378–384, 1981.I. Pavlyukevich. L´evy ﬂights, non-local search and simulated annealing.

J. Comput. Phys. ,226(2):1830–1844, 2007.M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradientLangevin dynamics: a nonasymptotic analysis. In

COLT , pages 1674–1703, 2017.G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions andtheir discrete approximations.

Bernoulli , 2(4):341–363, 1996.L. C. G. Rogers and D. Williams.

Diﬀusions, Markov processes, and martingales. Vol. 2, ItˆoCalculus . Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons,Inc., 1987.G. Royer.

An initiation to logarithmic Sobolev inequalities , volume 14 of

SMF/AMS Textsand Monographs . American Mathematical Society, 2007.A. Schlichting.

The Eyring-Kramers formula for Poincar´e and logarithmic Sobolev inequali-ties . PhD thesis, Universit¨at Leipzig, 2012. Available at http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-97965 .Z. Shun and P. McCullagh. Laplace approximation of high-dimensional integrals.

J. Roy.Statist. Soc. Ser. B , 57(4):749–760, 1995.D. W. Stroock and S. R. S. Varadhan.

Multidimensional diﬀusion processes , volume 233 of

Grundlehren der Mathematischen Wissenschaften . Springer-Verlag, 1979.W. Tang. Exponential ergodicity and convergence for generalized reﬂected Brownian motion.

Queueing Syst. , 92(1-2):83–101, 2019.A. B. Tsybakov.

Introduction to nonparametric estimation . Springer Series in Statistics.Springer, 2009.P. J. M. van Laarhoven and E. H. L. Aarts.

Simulated annealing: theory and applications ,volume 37 of

Mathematics and its Applications . D. Reidel Publishing Co., 1987.S. Vempala and A. Wibisono. Rapid convergence of the Unadjusted Langevin Algorithm:isoperimetry suﬃces. In

NeurIPS , volume 32, pages 8094–8106, 2019.H. Wang, T. Zariphopoulou, and X. Y. Zhou. Reinforcement learning in continuous time andspace: A stochastic control approach.

Journal of Machine Learning Research , 21:1–34,2020.P. A. Zitt. Annealing diﬀusions in a potential function with a slow growth.

Stochastic Process.Appl. , 118(1):76–119, 2008.

Department of Industrial Engineering and Operations Research, Columbia University.

Email address : [email protected] Department of Industrial Engineering and Operations Research, Columbia University.

Email address ::