Simulated annealing from continuum to discretization: a convergence analysis via the Eyring--Kramers law
SSIMULATED ANNEALING FROM CONTINUUM TO DISCRETIZATION:A CONVERGENCE ANALYSIS VIA THE EYRING–KRAMERS LAW
WENPIN TANG AND XUN YU ZHOU
Abstract.
We study the convergence rate of continuous-time simulated annealing ( X t ; t ≥
0) and its discretization ( x k ; k = 0 , , . . . ) for approximating the global optimum of a givenfunction f . We prove that the tail probability P ( f ( X t ) > min f + δ ) (resp. P ( f ( x k ) > min f + δ )) decays polynomial in time (resp. in cumulative step size), and provide an explicitrate as a function of the model parameters. Our argument applies the recent developmenton functional inequalities for the Gibbs measure at low temperatures – the Eyring-Kramerslaw. In the discrete setting, we obtain a condition on the step size to ensure the convergence. Key words:
Simulated annealing, convergence rate, Euler discretization, Eyring-Kramers law,functional inequalities, overdamped Langevin equation.1.
Introduction
Simulated annealing (SA) is an umbrella term which denotes a set of stochastic optimiza-tion methods. The goal of SA is to find the global minimum of a function f : R d → R , inparticular when f is non-convex. These methods have many applications in physics, oper-ations research and computer science, see e.g. van Laarhoven and Aarts (1987); Koulamaset al. (1994); Delahaye et al. (2019). The name is inspired from annealing in metallurgy,which is a process aiming to increase the size of crystals by heating and controlled cool-ing. The stochastic version of SA was independently proposed by Kirkpatrick et al. (1983)and Cerny (1985). The idea is as follows: consider a stochastic process related to f whichis subject to thermal noise. When simulating this process, one decreases the temperatureslowly over the time. As a result, the stochastic process escapes from saddle points and localoptima, and converges to the global minimum of f with high probability. This is generallytrue if the cooling is slow enough, and it is important to find the right stochastic processwith the fastest possible cooling schedule which approximates the global optimum.In this paper, we explore the convergence rate of continuous-time SA and its discretization.To be more precise, define the continuous-time SA process ( X t ; t ≥
0) by dX t = −∇ f ( X t ) dt + √ τ t dB t , X d = µ ( dx ) , (1)where ( B t ; t ≥
0) is d -dimensional Brownian motion, τ t is the cooling schedule of tem-perature, and µ ( dx ) is some initial distribution. This formulation was first consideredby Geman and Hwang (1986). If τ t ≡ τ is constant in time, the process (1) is the well-known overdamped Langevin equation whose stationary distribution is the Gibbs measure ν τ ( dx ) ∝ exp( − f ( x ) /τ ) dx . Thus, we sometimes call the process (1) an SA adapted over-damped Langevin equation as well. See Section 3 for background. We also consider theEuler-Maruyama discretization of the continuous-time simulated annealing process. Let η k Date : February 10, 2021. a r X i v : . [ m a t h . P R ] F e b WENPIN TANG AND XUN YU ZHOU be the step size at iteration k , and Θ k := (cid:80) j ≤ k η j be the cumulative step size up to iteration k . The discrete SA process ( x k ; k = 0 , , . . . ) is defined by x k +1 = x k − ∇ f ( x k ) η k + (cid:112) τ Θ k η k Z k , x d = µ ( dx ) , (2)where ( Z k ; k = 0 , , . . . ) are independent and identically distributed standard normal vectors,and τ Θ k is the cooling schedule at iteration k . For τ t ≡ τ constant in time, the scheme (2) isknown as the unadjusted Langevin algorithm (ULA) which approximates the Gibbs measure ν τ . The algorithm was introduced in Parisi (1981); Grenander and Miller (1994), and furtherstudied by Roberts and Tweedie (1996); Durmus and Moulines (2017).The goal of this paper is to study the decay in time of the tail probability P ( f ( X t ) > min f + δ ) or P ( f ( x k ) > min f + δ ) , under suitable conditions on the function f , the cooling schedule τ t , and the discretizationscheme η k , Θ k . Let us mention a few motivations. First there are a line of works on theinterplay between sampling and optimization (Raginsky et al. (2017); Ma et al. (2019a,b)).Note that if τ t ≡ τ is constant in time, the overdamped Langevin equation converges to theGibbs measure ν τ , and for τ sufficiently small, the Gibbs measure ν τ approximates the Diracmass at the global minimum of f . Combining these two ingredients, SA sets the coolingschedule τ t to decrease to 0 over the time. It is then expected that the SA process (1) or(2) converges to the global minimum as t → ∞ or k → ∞ . There are also recent workson various noisy gradient-based algorithms (Ge et al. (2015); Jin et al. (2017); Chen et al.(2020); Guo et al. (2020)), which aims to escape saddle points and find a local minimum of f . In comparison with these methods, SA has priority to find the global minimum but atthe cost of longer exploration time.The main tool in our analysis is the Eyring-Kramers law, which is a set of functionalinequalities for the Gibbs measure at low temperatures. Let us explain our results. It wasshown in Geman and Hwang (1986); Chiang et al. (1987) that the correct order of τ t forthe process (1) to converge to the global minimum of f is (ln t ) − . In fact, there is a phasetransition related to the critical depth E ∗ of the function f :(a) If τ t ≤ E ln t for t large enough with E < E ∗ , then lim sup t →∞ P ( f ( X t ) ≤ min f + δ ) < τ t ≥ E ln t for t large enough with E > E ∗ , then lim t →∞ P ( f ( X t ) ≤ min f + δ ) = 1.Roughly speaking, the critical depth E ∗ is the largest hill one needs to climb starting froma local minimum to the global minimum. The formal definition of the critical depth E ∗ willbe given in Assumption 2, but see Figure 1 below for an illustration when f is a double-wellfunction. Part ( a ) was proved by Holley et al. (1989) using a sophisticated argument thatinvolves the Poincar´e inequality. Part ( b ) was proved by Miclo (1992) who characterizedthe fastest cooling schedule by the Eying-Kramers law for the log-Sobolev inequality. Seealso Miclo (1995); Zitt (2008); Fournier and Tardif (2020) for similar results under differentconditions on the function f . The convergence rate of ( f ( X t ); t ≥
0) to the global minimumof f was considered by M´arquez (1997) only for δ sufficiently small. However, the rate of P ( f ( X t ) > min f + δ ) for general δ > IMULATED ANNEALING 3
Figure 1.
Illustration of the critical depth of a double-well function.To simplify the notation, we assume throughout this paper thatmin R d f ( x ) = 0 , i.e. the global minimum of f is 0 by considering f − min f . Our results are outlined asfollows, and the precise statement of these results will be given in Section 2. Main Results (Informal) . Under some assumptions on the function f , we have:(i) Assume that τ t ≥ E ln t for t large enough with E > E ∗ . For δ > , there exists C > (depending on δ, f, E, d ) such that P ( f ( X t ) > δ ) ≤ Ct − min ( δE , (1 − E ∗ E ) ) . (ii) Assume that τ t ≥ E ln t for t large enough with E > E ∗ . Also assume that Θ k → ∞ , η k +1 Θ k → and ln Θ k (cid:80) j ≤ k η j +1 Θ − E ∗ /Ej → as k → ∞ . For δ > , there exists C > (depending on δ, f, E, d ) such that P ( f ( x k ) > δ ) ≤ C Θ − min ( δE , (1 − E ∗ E ) ) k . The contribution of this paper is twofold: • Polynomial decay rate . We prove that the tail probability P ( f ( X t ) > δ ) (resp. P ( f ( x k ) > δ )) decays polynomially in time (resp. in cumulative step size). In thecontinuous setting, Monmarch´e (2018) obtained the same rate of convergence forSA adapted underdamped Langevin equation, and Menz et al. (2018) consideredan improvement of SA via parallel tempering. However, the convergence rate forcontinuous-time SA, i.e. part ( i ) has yet been recorded in the literature thoughthis result is probably understood in the Eyring-Kramers folklore. We provide aself-contained treatment which bridges the literature and, more importantly, can beapplied to obtain the new result in the discrete setting. • Choice of step size . Part ( ii ) for the discrete simulated annealing is completelynovel to our best knowledge, and it also gives a practical guidance on the choice of stepsize for discretization. The condition Θ k → ∞ indicates that the step size cannotbe chosen too small, while the condition η k +1 Θ k → WENPIN TANG AND XUN YU ZHOU cannot be chosen too large. The condition ln Θ k (cid:80) j ≤ k η j +1 Θ − E ∗ /Ej → η k = k − θ with θ ∈ ( ,
1] satisfies the conditions in ( ii ) to ensure theconvergence.Also note that the rate min (cid:0) δE , (1 − E ∗ E ) (cid:1) is smaller than . It is interesting to know whetherthis rate is optimal, and we leave the problem for future work.There is another interesting problem: the dependence of the constant C on the dimension d . The issue is subtle, since most analysis including the Eyring-Kramers law uses Laplace’smethod. However, Laplace’s method may fail if both the dimension d and the inverse tem-perature 1 /τ tend to infinity (Shun and McCullagh (1995)). As shown in Remark 1, weobtain an upper bound for C which is exponential in d . This suggests the convergence rate isexponentially slow as the dimension increases, which concurs with the fact that finding theglobal minimum of a general nonconvex function is NP-hard (Jain and Kar (2017)).Finally, we mention a few approaches to accelerate or improve SA. Fang et al. (1997)considered a cooling schedule depending on both time and state; Pavlyukevich (2007) usedthe L´evy flight; Monmarch´e (2018) studied SA adapted to underdamped Langevin equation;Menz et al. (2018) applied the replica exchange technique; Gao et al. (2020) employed arelaxed stochastic control formulation, originally proposed by Wang et al. (2020) for rein-forcement learning, to derive a state-dependent temperature control schedule.The remainder of the paper is organized as follows. Section 2 presents the assumptionsand our main results. Section 3 provides background on diffusion processes and functionalinequalities. The result for the continuous-time simulated annealing (Theorem 1) is provedin Section 4. The result for the discrete simulated annealing (Theorem 2) is proved in Section5. We conclude with Section 6. 2. Main results
In this section, we make precise the informal statement in the introduction, and presentthe main results of the paper. We first collect the notations that will be used throughoutthis paper.– The notation | · | is the Euclidean norm of a vector, and a · b is the scalar product ofvectors a and b .– For a function f : R d → R , let ∇ f , ∇ f and ∆ f denote its gradient, Hessian andLaplacian respectively.– For X, Y two random variables, || X − Y || T V denotes the total variation norm of thesigned measure corresponding to X − Y .– The symbol a ∼ b means that a/b → ∞ . Similarly, the symbol a = O ( b ) means that a/b is bounded as some problemparameter tends to 0 or ∞ .We use C for a generic constant which depends on problem parameters ( δ, f, E . . . ), and maychange from line to line.Next, we present a few assumptions on the function f . These assumptions are standard inthe study of metastability. Assumption 1.
Let f : R d → R be smooth, bounded from below, and satisfy the conditions: IMULATED ANNEALING 5 (i) f is non-degenerate on the set of critical points. That is, for some C > , | ξ | C ≤ |∇ f ( x ) ξ | ≤ C | ξ | for each x ∈ { z : ∇ f ( z ) = 0 } . (ii) There exists C > such that lim inf | x |→∞ |∇ f ( x ) | − ∆ f ( x ) | x | ≥ C, inf x ∇ f ( x ) ≥ − C. Let us make a few comments on Assumption 1. The condition ( ii ) implies that f hasat least quadratic growth at infinity. This is a necessary and sufficient condition to obtainthe log-Sobolev inequality (see (Royer, 2007, Theorem 3.1.21) and Section 3.2) which is keyto convergence analysis. The conditions ( i ) and ( ii ) imply that the set of critical points isdiscrete and finite (Menz and Schlichting, 2014, Remark 1.6). In particular, it follows thatthe set of local minimum points { m , . . . , m N } is also finite, with N the number of localminimum points of f .To keep the presentation simple, we make additional assumptions on f , following (Menzet al., 2018, Assumption 2.5). Define the saddle height (cid:98) f ( m i , m j ) between two local minimumpoints m i , m j by (cid:98) f ( m i , m j ) := inf (cid:26) max s ∈ [0 , f ( γ ( s )) : γ ∈ C [0 , , γ (0) = m i , γ (1) = m j (cid:27) . (3)See Figure 1 for an illustration of the saddle height (cid:98) f ( m , m ) when f is a double-well functionwith m the global minimum and m the local minimum. Assumption 2.
Let m , · · · , m N be the positions of the local minima of f .(i) m is the unique global minimum point of f , and m , . . . , m N are ordered in the sensethat there exists δ > such that f ( m N ) ≥ f ( m N − ) ≥ · · · ≥ f ( m ) ≥ δ and f ( m ) = 0 . (ii) For each i, j ∈ { , . . . , N } , the saddle height between m i , m j is attained at a uniquecritical point s ij of index one. That is, f ( s ij ) = (cid:98) f ( m i , m j ) , and if { λ , . . . , λ n } arethe eigenvalues of ∇ f ( s ij ) , then λ < and λ i > for i ∈ { , . . . , n } . The point s ij is called the communicating saddle point between the minima m i and m j .(iii) There exists p ∈ [ N ] such that the energy barrier f ( s p ) − f ( m p ) dominates all theothers. That is, there exists δ > such that for all i ∈ [ N ] \ { p } , E ∗ := f ( s p ) − f ( m p ) ≥ f ( s i ) − f ( m i ) + δ. The dominating energy barrier E ∗ is called the critical depth. The convergence result for the continuous-time SA (1) is stated as follows. The proof willbe given in Section 4.
Theorem 1.
Let f satisfy Assumption 1 &
2. Assume that τ t ∼ E ln t and ddt (cid:16) τ t (cid:17) = O (cid:0) t (cid:1) as t → ∞ with E > E ∗ . Also assume the moment condition for the initial distribution µ : foreach p ≥ , there exists C p > such that (cid:90) R d f ( x ) p µ ( dx ) ≤ C p . (4) WENPIN TANG AND XUN YU ZHOU
Then for each δ, ε > , there exists C > (depending on δ, ε, f, µ , E, d ) such that P ( f ( X t ) > δ ) ≤ Ct − min ( δE , (1 − E ∗ E ) ) + ε . (5)To get the convergence result for the discrete simulated annealing, we need an additionalcondition on the function f . Assumption 3.
The gradient ∇ f is globally Lipschitz, i.e. |∇ f ( x ) − ∇ f ( y ) | ≤ L | x − y | forsome L > . The convergence result for the discrete simulated annealing (2) is stated as follows. Theproof will be given in Section 5.
Theorem 2.
Let f satisfy Assumption 1, 2 &
3, and let the condition (4) for µ hold.Assume that τ t ∼ E ln t and ddt (cid:16) τ t (cid:17) = O (cid:0) t (cid:1) as t → ∞ with E > E ∗ . Also assume that Θ k → ∞ , η k +1 Θ k → , (6) and ln Θ k (cid:80) j ≤ k η j +1 Θ − E ∗ /Ej → , (7) as k → ∞ . Then for each δ, ε > , there exists C > (depending on δ, ε, f, µ , E, d ) suchthat P ( f ( x k ) > δ ) ≤ C Θ − min ( δE , (1 − E ∗ E ) ) + εk . (8)3. Preliminaries
In this section, we present a few vocabularies and basic results of diffusion processes andfunctional inequalities. We also explain how these results are applied in the setting of SA,which will be useful in our convergence analysis.3.1.
Diffusion processes and SA.
Consider the general diffusion process ( X t ; t ≥
0) in R d of form: dX t = b ( t, X t ) dt + σ ( t, X t ) dB t , X d = µ ( dx ) , (9)where ( B t ; t ≥
0) is a d -dimensional Brownian motion, with the drift b : R + × R d → R d andthe covariance matrix σ : R + × R d → R d × d . To ensure the well-posedness of the SDE (9), itrequires some growth and regularity conditions on b and σ . For instance, • If b and σ are Lipschitz and have linear growth in x uniformly in t , then (9) has astrong solution which is pathwise unique. • If b is bounded, and σ is bounded, continuous and strictly elliptic, then (9) has aweak solution which is unique in distribution.We refer to Stroock and Varadhan (1979); Rogers and Williams (1987) for background andfurther developments on the well-posedness of SDEs, and to (Cherny and Engelbert, 2005,Chapter 1) for a review of related results. IMULATED ANNEALING 7
Another important aspect is the distributional property of ( X t ; t ≥
0) governed by theSDE (9). Let L be the infinitesimal generator of the diffusion process X defined by L g ( t, x ) := b ( t, x ) · ∇ g ( x ) + 12 σ ( t, x ) σ ( t, x ) T : ∇ g ( x ):= d (cid:88) i =1 b i ( t, x ) ∂∂x i g ( x ) + 12 d (cid:88) i,j =1 (cid:0) σ ( t, x ) σ ( t, x ) T (cid:1) ij ∂ ∂x i ∂x j g ( x ) , (10)and L ∗ be the corresponding adjoint operator given by L ∗ g ( t, x ) := −∇ · ( b ( t, x ) g ( x )) + 12 ∇ : ( σ ( t, x ) σ ( t, x ) T g ( x )):= − d (cid:88) i =1 ∂∂x i ( b i ( t, x ) g ( x )) + 12 d (cid:88) i,j =1 ∂ ∂x i ∂x j ( σ ( t, x ) σ ( t, x ) T g ( x )) ij , (11)where g : R d → R is a suitably smooth test function, and : denotes the Frobenius innerproduct which is the component-wise inner product of two matrices. The probability density ρ t ( · ) of the process X at time t then satisfies the Fokker-Planck equation: ∂ρ t ∂t = L ∗ ρ t . (12)Specializing (12) to the SA process (1) with b ( t, x ) = −∇ f ( x ) and σ ( t, x ) = √ τ t I d , we havethat the probability density µ t ( · ) of X governed by the SDE (1) satisfies ∂µ t ∂t = ∇ · ( µ t ∇ f ) + τ t ∆ µ t . (13)Under further growth conditions on b and σ , it can be shown that as t → ∞ , ρ t ( · ) → ρ ∞ ( · )which is the stationary distribution of ( X t ; t ≥ ρ ∞ ischaracterized by the equation L ∗ ρ ∞ = 0; see Ethier and Kurtz (1986); Meyn and Tweedie(1993) for a general theory on stability of diffusion processes, and (Tang, 2019, Section 2) fora summary with various pointers to the literature.However, for general b and σ , the stationary distribution ρ ∞ ( · ) does not have a closed-form expression. One good exception is b ( t, x ) = −∇ f ( x ) and σ ( t, x ) = √ τ I d , where X isgoverned by the overdamped Langevin equation: dX t = −∇ f ( X t ) dt + √ τ dB t , X d = µ ( dx ) . (14)Such a process is time-reversible, and the stationary distribution, under some growth condi-tion on f , is the Gibbs measure ν τ ( dx ) = 1 Z τ exp (cid:18) − f ( x ) τ (cid:19) dx, (15)where Z τ := (cid:82) R d exp( − f ( x ) /τ ) dx is the normalizing constant. Much is known about theoverdamped Langevin dynamics. For instance, if f is λ -convex (i.e. ∇ f + λI d is positivedefinite), the overdamped Langevin process (14) converges exponentially in the Wassersteinmetric with rate λ to the Gibbs measure ν τ (Carrillo et al., 2006). See also Bakry et al.(2014) for modern techniques to analyze the evolution of the overdamped Langevin equationand generalizations. WENPIN TANG AND XUN YU ZHOU
Now we turn to the SA process (1). The difference between the overdamped Langevinprocess (14) and the process (1) is that the temperature τ t of the latter is decreasing intime. Due to the time dependence, the limiting distribution of SA is unknown. As wewill see in Section 4, the idea is to approximate (1) by a process of Gibbs measures withtemperature τ t . Since τ t decreases to 0 in the limit, the problem boils down to studyingGibbs measures at low temperatures. In the next section, we recall some results of Gibbsmeasures at low temperatures, which are motivated by applications in molecular dynamicsand Bayesian statistics.3.2. Functional inequalities and the Erying-Kramers law.
Here we present functionalinequalities of Gibbs measures at low temperatures ( τ → µ and ν be two probabilitymeasures on R d such that µ is absolutely continuous relative to ν , with dµ/dν the Radon-Nikodym derivative. Define the relative entropy or KL-divergence H ( µ | ν ) of µ with respectto ν by H ( µ | ν ) := (cid:90) log (cid:18) dµdν (cid:19) dµ = (cid:90) dµdν log (cid:18) dµdν (cid:19) dν, (16)and the Fisher information I ( µ | ν ) of µ with respect to ν by I ( µ | ν ) := 12 (cid:90) (cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:18) dµdν (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:18) dµdν (cid:19) − dν. (17)We say that the probability measure ν satisfies the log-Sobolev inequality (LSI) with constant α >
0, if for all probability measures µ with I ( µ | ν ) < ∞ , H ( µ | ν ) ≤ α I ( µ | ν ) . (18)The constant α is called the LSI constant for the probability measure ν . For instance, theLSI constant α = 1 for ν the multivariate Gaussian with mean 0 and covariance matrix I d .The LSI also plays an important role in studying the convergence rate of the overdampedLangevin equation. Recall that ν τ is the Gibbs measure defined by (15), and assume that ν τ satisfies the LSI with constant α τ >
0. It follows from (Schlichting, 2012, Theorem 1.7) thatby letting µ τ,t be the probability distribution of X t defined by (14), we have H ( µ τ,t | ν τ ) ≤ e − τα τ t H ( µ τ, | ν τ ) . (19)So larger the value of α τ is, faster the convergence of the overdamped Langevin process inthe KL divergence is. The subscript ‘ τ ’ in α τ suggests the dependence of the LSI constanton the temperature τ , and we are interested in the asymptotics of α τ at low temperatures as τ →
0. This problem was considered by (Menz and Schlichting, 2014, Corollary 3.18), whoderived a sharp lower bound for α τ as τ → Lemma 1.
Let f satisfy Assumption 1 &
2. Then the Gibbs measure ν τ defined by (15) satisfies the LSI with constant α τ > such that α τ ∼ C exp (cid:18) − E ∗ τ (cid:19) as τ → , (20) where C > depends on f, d . IMULATED ANNEALING 9
The Eyring-Kramers law provides an estimate on the spectral gap of the overdampedLangevin equation (14). It dates back to Eyring (1935); Kramers (1940) in the study ofmetastability in chemical reactions, and is proved rigorously by Bovier et al. (2004, 2005).Lemma 1 is the LSI version of the Eyring-Kramers law, which is stronger than the spectralgap estimate implied by the Poincar´e inequality.Further define the Wasserstein distance W ( µ, ν ) between µ and ν by W ( µ, ν ) := inf Π (cid:115)(cid:90) | x − y | Π( dx, dy ) , (21)where the infimum is over all joint distributions Π coupling µ and ν . We say that the prob-ability measure ν satisfies Talagrand’s inequality with constant γ >
0, if for all probabilitymeasure µ with H ( µ | ν ) < ∞ , W ( µ, ν ) ≤ γ H ( µ | ν ) . (22)It follows from (Otto and Villani, 2000, Theorem 1) that the LSI implies Talagrand’s in-equality with the same constant. That is, if ν satisfies the LSI with constant α >
0, then ν also satisfies Talagrand’s inequality with constant γ = α . Combining with Lemma 1, we geta lower bound estimate of Talagrand’s inequality constant for the Gibbs measure ν τ . Lemma 2.
Let f satisfy Assumption 1 &
2. Then the Gibbs measure ν τ defined by (15) satisfies Talagrand’s inequality with constant γ τ > such that γ τ ∼ C exp (cid:18) − E ∗ τ (cid:19) as τ → where C > depends on f, d . Continuous-time simulated annealing
In this section, we prove Theorem 1 by using the ideas developed in Miclo (1992); Mon-march´e (2018); Menz et al. (2018). Let µ t be the probability measure of X t defined by (1).The key idea is to compare µ t with the time-dependent Gibbs measure ν τ t given by ν τ t ( dx ) = 1 Z τ t exp (cid:18) − f ( x ) τ t (cid:19) dx, (24)where Z τ t := (cid:82) R d exp( − f ( x ) /τ t ) is the normalizing constant. Note that ν τ t will concentrateon the minimum point of f as t → ∞ since τ t → t → ∞ . We will see that ν τ t is close to µ t in some sense as t → ∞ . The proof of Theorem 1 is broken into four steps. Step 1: Reduce µ t to ν τ t . We establish a bound that relates ν τ t to µ t . Let ( (cid:101) X t ; t ≥
0) be aprocess whose distribution is ν τ t at time t , coupled with ( X t ; t ≥
0) on the same probabilityspace. Fix δ >
0. We have P ( f ( X t ) > δ ) = P ( f ( X t ) > δ, f ( (cid:101) X t ) > δ ) + P ( f ( X t ) > δ, f ( (cid:101) X t ) ≤ δ ) ≤ P ( f ( (cid:101) X t ) > δ ) + || X t − (cid:101) X t || T V ≤ P ( f ( (cid:101) X t ) > δ ) + (cid:112) H ( µ t | ν τ t ) , (25)where we use Pinsker’s inequality (Tsybakov, 2009, Lemma 2.5) in the last inequation. Nowthe problem boils down to estimating P ( f ( (cid:101) X t ) > δ ) and H ( µ t | ν τ t ). Step 2: Long-time behavior of f ( (cid:101) X t ) . We study the asymptotics of P ( f ( (cid:101) X t ) > δ ) as t → ∞ . The following lemma provides a quantitative estimate of how ν τ t , or equivalently (cid:101) X t concentrates on the minimum point of f as t → ∞ . Lemma 3.
Let f satisfy Assumption 1 &
2. Assume that τ t ∼ E ln t as t → ∞ with E > E ∗ .For each ε ∈ (0 , δ ) , there exist C > (depending on δ, ε, f, E, d ) such that P ( f ( (cid:101) X t ) > δ ) ≤ Ct − δ − εE . (26) Proof.
Note that P ( f ( (cid:101) X t ) > δ ) = (cid:82) f ( x ) >δ exp( − f ( x ) /τ t ) dx (cid:82) R d exp( − f ( x ) /τ t ) dx . (27)Under Assumption 1, f has quadratic growth, so at least linear growth at infinity (Menz andSchlichting, 2014, Lemma 3.14): there exists C > R large enough, f ( x ) ≥ min | z | = R f ( z ) + C ( | x | − R ) for | x | > R. We can also choose R sufficiently large so that min | z | = R f ( z ) > δ . Consequently, (cid:90) f ( x ) >δ exp( − f ( x ) /τ t ) dx = (cid:90) f ( x ) >δ, | x |≤ R exp( − f ( x ) /τ t ) dx + (cid:90) f ( x ) >δ, | x | >R exp( − f ( x ) /τ t ) dx = e − δτt Vol( B R )(1 + O ( τ t )) , (28)where Vol( B R ) is the volume of a ball with radius R . Further by Laplace’s method, (cid:90) R d exp( − f ( x ) /τ t ) dx ∼ C ( τ t ) d . (29)By injecting (28), (29) into (27),we get P ( f ( (cid:101) X t ) > δ ) ≤ Ct − δE (ln t ) d , (30)which clearly yields (26) since (ln t ) d /t εE → t → ∞ . (cid:3) Remark 1.
It is interesting to get a bound for P ( f ( (cid:101) X t ) > δ ) when the dimension d islarge. As mentioned in the introduction, the Laplace bound (29) may fail when d, t → ∞ simultaneously. Recall that m is the minimum point of f . By continuity of f , there exists r > such that f ( x ) < ε when | x − m | < r . Thus, (cid:90) R d exp( − f ( x ) /τ t ) dx ≥ (cid:90) | x − m | EdCR → ∞ as d → ∞ , (cid:90) f ( x ) >δ exp( − f ( x ) /τ t ) dx = e − δτt Vol( B R )(1 + O ( τ t d )) , (32) Combining (31) and (32) , we get P ( f ( (cid:101) X t ) > δ ) ≤ Cγ d t − δ − εE , (33) IMULATED ANNEALING 11 where C > depends on δ, ε, f, E , and γ = max( R/r, e δ − εCR ) . Also note that Raginsky et al.(2017) obtained the bound E f ( (cid:101) X t ) ≤ Cd/ ln t . By Markov’s inequality, we get P ( f ( (cid:101) X t ) > δ ) ≤ Cδ − d (ln t ) − . (34) In comparison with (34) , the bound (33) is better in ‘ t ’ but worse in ‘ d ’. In terms of relaxationtime, i.e. letting P ( f ( (cid:101) X t ) > δ ) be of constant order, both estimates show an exponentialdependence of t on d . This suggests that SA is exponentially slow as the dimension increases. Step 3: Differential inequality for H ( µ t | ν τ t ) . To get an estimate of H ( µ t | ν τ t ), we need toconsider the time derivative ddt H ( µ t | ν τ t ). The following lemma is a reformulation of (Miclo,1992, Proposition 3). For ease of reference, we give a simplified proof here. First let usconvent some notation. For an absolutely continuous measure µ ( dx ), we abuse the notation µ ( dx ) = µ ( x ) dx , i.e. µ ( x ) is the density of µ ( dx ). So for two such probability measures µ and ν , the Radon-Nikodym derivative dµdν ( x ) is identified with µ ( x ) ν ( x ) . Lemma 4. Let τ t be decreasing in t . We have ddt H ( µ t | ν τ t ) ≤ − τ t I ( µ t | ν τ t ) + ddt (cid:18) τ t (cid:19) E f ( X t ) . (35) where I ( µ t | ν τ t ) is the Fisher information defined by (17) .Proof. Observe that ddt H ( µ t | ν τ t ) = ddt (cid:90) µ t ln (cid:18) µ t ν τ t (cid:19) dx = (cid:90) ∂µ t ∂t ln (cid:18) µ t ν τ t (cid:19) dx (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + (cid:90) µ t ∂∂t (cid:18) ln (cid:18) µ t ν τ t (cid:19)(cid:19) dx (cid:124) (cid:123)(cid:122) (cid:125) ( b ) . (36)We first consider the term (a). Recall that µ t satisfies the Fokker-Planck equation (13).Together with the fact that ∇ ( τ t ν τ t ) = − ν τ t ∇ f , we have ∂µ t ∂t = ∇ · (cid:18) τ t ν τ t ∇ (cid:18) µ t ν τ t (cid:19)(cid:19) . (37)By injecting (37) into the term (a) and further by integration by parts, we get( a ) = (cid:90) ∇ · (cid:18) τ t ν τ t ∇ (cid:18) µ t ν τ t (cid:19)(cid:19) ln (cid:18) µ t ν τ t (cid:19) dx = − (cid:90) τ t ν τ t ∇ (cid:18) µ t ν τ t (cid:19) · ∇ ln (cid:18) µ t ν τ t (cid:19) dx = − τ t (cid:90) (cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:18) µ t ν τ t (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:18) µ t ν τ t (cid:19) − dν τ t = − τ t I ( µ t | ν τ t ) . (38) Now we consider the term (b). Direct computation leads to( b ) = (cid:90) (cid:18) ∂µ t ∂t − µ t ν τ t ∂ν τ t ∂t (cid:19) dx = − (cid:90) ∂∂t (ln ν τ t ) dµ t = (cid:90) ddt (ln Z τ t ) dµ t + ddt (cid:18) τ t (cid:19) E f ( X t ) ≤ ddt (cid:18) τ t (cid:19) E f ( X t ) , (39)where we use the fact that (cid:82) µ t dx = 1 in the second equation, and that τ t is decreasing in t in the last inequality. Combining (36) with (38) and (39) yields (35). (cid:3) Step 4: Estimating H ( µ t | ν τ t ) via the Erying-Kramers law. Note that there are twoterms on the right hand side of (35). We start with an estimate of the second term. Lemma 5. Let f satisfy Assumption 1, and assume that the condition (4) for µ holds. Foreach ε > , there exists C > (depending on ε, f ) such that E f ( X t ) ≤ C (1 + t ) ε . (40) Proof. It is easy to see that Assumption 1 implies Assumption H in Miclo (1992). Togetherwith the moment condition (4), the proof follows the line of reasoning in (Miclo, 1992, Lemma2). (cid:3) Now we apply the Eyring-Kramers law, combining with a Gr¨onwall-type argument tobound H ( µ t | ν τ t ) for large t . Lemma 6. Let f satisfy Assumption 1 & 2, and assume that the condition (4) for µ holds.Assume that τ t ∼ E ln t and ddt (cid:16) τ t (cid:17) = O (cid:0) t (cid:1) as t → ∞ with E > E ∗ . For each ε > , thereexists C > (depending on ε, f, E, d ) such that H ( µ t | ν τ t ) ≤ Ct − ( − E ∗ E − ε ) . (41) Proof. Using Lemma 1 and the bound (35), we have ddt H ( µ t | ν τ t ) ≤ − τ t α t H ( µ t | ν τ t ) + Ct E f ( X t ) , (42)where α t is the LSI constant for the Gibbs measure ν τ t . By the Eyring-Kramers formula(20), for each ε > 0, there exist C > t > τ t α t ≥ Ct − ( E ∗ E − ε ) for t ≥ t . (43)Combining (42) with (40), (43), we get ddt H ( µ t | ν τ t ) ≤ − Ct − ( E ∗ E − ε ) H ( µ t | ν τ t ) + C (cid:48) t − ε . (44)Fix ε ∈ (0 , − E ∗ E ), let Q ( t ) := H ( µ t | ν τ t ) − C (cid:48) C t − E ∗ E +2 ε . IMULATED ANNEALING 13 Then for t sufficiently large and t ≥ t , we have ddt Q ( t ) ≤ − Ct − E ∗ E + ε Q ( t ) by (44). Thisimplies that Q ( t ) ≤ Q ( t ) e − C (cid:82) tt s − E ∗ E + ε ds . Thus, H ( µ t | ν τ t ) ≤ C (cid:48) C t − E ∗ E +2 ε + H ( µ t | ν t ) e − Cκ ( t κ − t κ ) , (45)where κ := 1 − E ∗ E − ε > 0. Note that the first term on the right hand side of (45) dominates,and the conclusion follows. (cid:3) Finally, by injecting (26), (41) into (25), we get the desired estimate (5).5. Discrete simulated annealing This section is devoted to the proof of Theorem 2. The idea is close to that employedfor the continuous-time SA process (1). However, the analysis is more complicated due todiscretization, and additional tools from Vempala and Wibisono (2019) on the convergenceof ULA are used. Recall that η k is the step size at iteration k , and Θ k := (cid:80) j ≤ k η j is thecumulative step size up to iteration k . Let µ k be the probability density of x k defined by (2),and ν τ Θ k ( dx ) = 1 Z τ Θ k exp (cid:18) − f ( x ) τ Θ k (cid:19) dx, (46)where Z τ Θ k := (cid:82) R d exp( − f ( x ) /τ Θ k ) dx is the normalizing constant. We divide the proof intofour steps. Step 1: Reduce µ k to ν τ Θ k . This step is similar to Step 1 & 2 in Section 4. Let ( (cid:101) x k ; k ≥ ν τ Θ k at epoch k , coupled with ( x k ; k ≥ 0) on the sameprobability space. Fix δ > 0. The same argument as in (25) shows that P ( f ( x k ) > δ ) ≤ P ( f ( (cid:101) x k ) > δ ) + (cid:113) H ( µ k | ν τ Θ k ) . (47)Assume that Θ k → ∞ and τ Θ k ∼ E ln Θ k as k → ∞ with E > E ∗ . By Lemma 3, we get a boundfor the first term on the right hand side of (47). That is, for each ε ∈ (0 , δ ), there exists C > ε, δ, f, E, d ) such that P ( f ( (cid:101) x k ) > δ ) ≤ C Θ − δ − εE k . (48)So it remains to estimate H ( µ k | ν τ Θ k ), which is the task of the next three steps. Step 2: Continuous-time coupling. To make use of continuous-time tools, we couple thesequence ( x k ; k ≥ 0) by a continuous-time process ( X t ; t ≥ 0) such that ( X Θ k ; k ≥ 0) hasthe same distribution as ( x k ; k ≥ X by dX t = −∇ f ( x k ) dt + (cid:112) τ Θ k dB t , t ∈ [Θ k , Θ k +1 ) , (49)where we identify X Θ k with x k . Note that the Fokker-Planck equation (13) plays an impor-tant role in the analysis of continuous-time SA (1). It is desirable to get a version of theFokker-Planck equation for the coupled process (49). The result is stated as follows. Lemma 7. For t ∈ [Θ k , Θ k +1 ) , the probability density µ t of X t defined by (49) satisfies thefollowing equation: ∂µ t ∂t = ∇ · (cid:18) τ Θ k ν τ Θ k ∇ (cid:18) µ t ν τ Θ k (cid:19)(cid:19) + ∇ · ( µ t E [ ∇ f ( x k ) − ∇ f ( X t ) | X t = x ]) . (50) Proof. Let µ t | s ( x | y ) be the conditional probability P ( X t = x | X s = y ). By conditioning on X Θ k = x k , we have ∂µ t | Θ k ( x | x k ) ∂t = ∇ · ( µ t | Θ k ( x | x k ) ∇ f ( x k )) + τ Θ k ∆ µ t | Θ k ( x | x k ) . (51)By integrating (51) against µ Θ k , and using the fact that µ t | Θ k ( x | x k ) µ Θ k ( x k ) = µ t ( x ) µ Θ k | t ( x k | x ),we get ∂µ t ∂t = ∇ · ( µ t ( x ) E [ ∇ f ( x k ) | X t = x ]) + τ Θ k ∆ µ t . (52)Further by (37), we have ∇ · (cid:18) τ Θ k ν τ Θ k ∇ (cid:18) µ t ν τ Θ k (cid:19)(cid:19) = ∇ · ( µ t ∇ f ( x )) + τ Θ k ∆ µ t . (53)Combining (52) and (53) yields (50). (cid:3) There are two terms on the right hand side of (50). Comparing to (37), the first term isthe usual Fokker-Planck term, while the second term corresponds to the discretization error. Step 3: One-step analysis of H ( µ k | ν τ Θ k ) . Here we use the coupled process (49) to studythe one-step decay of H ( µ k | ν τ Θ k ). Lemma 8. Let f satisfy Assumption 1, 2 & 3, and assume that the condition (4) for µ holds. Assume that τ t ∼ E ln t and ddt (cid:16) τ t (cid:17) = O (cid:0) t (cid:1) as t → ∞ with E > E ∗ . Also assume that Θ k → ∞ and η k +1 Θ k → as k → ∞ . Then, for each ε > , there exist C, C (cid:48) > (dependingon ε, f, E, d ) such that H ( µ k +1 | ν τ Θ k +1 ) ≤ (cid:18) − Cη k +1 Θ − ( E ∗ E − ε ) k (cid:19) H ( µ k | ν τ Θ k )+ C (cid:48) ( η k +1 + η k +1 ln Θ k + η k +1 Θ − εk ) . (54) Proof. Write H ( µ k +1 | ν τ Θ k +1 ) = H ( µ k +1 | ν τ Θ k ) (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + ( H ( µ k +1 | ν τ Θ k +1 ) − H ( µ k +1 | ν τ Θ k )) (cid:124) (cid:123)(cid:122) (cid:125) ( b ) . (55) IMULATED ANNEALING 15 We first use the coupled process (49) to study the term ( a ). Note that ddt H ( µ t | ν τ Θ k ) = (cid:90) ∂µ t ∂t ln (cid:18) µ t ν τ Θ k (cid:19) dx + (cid:90) µ t ddt ln (cid:18) µ t ν τ Θ k (cid:19) dx = (cid:90) ∇ · (cid:18) τ Θ k ν τ Θ k ∇ (cid:18) µ t ν τ Θ k (cid:19)(cid:19) ln (cid:18) µ t ν τ Θ k (cid:19) dx + (cid:90) ∇ · ( µ t E [ ∇ f ( x k ) − ∇ f ( X t ) | X t = x ]) ln (cid:18) µ t ν τ Θ k (cid:19) dx (cid:124) (cid:123)(cid:122) (cid:125) ( c ) + ddt (cid:90) µ t ( dx )= − τ Θ k I ( µ t | ν τ Θ k ) + ( c ) , (56)where we use (50) in the second equation, and (38) in the third equation. Now we need toestimate the term ( c ) in (56). By integration by parts and the fact that a · b ≤ τ Θ k | a | + τ Θ k | b | ,we get ( c ) = E (cid:18) ( ∇ f ( X t ) − ∇ f ( x k )) · ∇ log (cid:18) µ t ν τ Θ k (cid:19)(cid:19) ≤ τ Θ k E |∇ f ( X t ) − ∇ f ( x k )) | + τ Θ k E (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) µ t ν τ Θ k (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L τ Θ k E | X t − x k | + τ Θ k I ( µ t | ν τ Θ k ) , (57)where L is the Lipschitz constant of ∇ f by Assumption 3. Recall from (49) that X t − x k = −∇ f ( x k )( t − Θ k ) + (cid:112) τ Θ k ( t − Θ k ) Z , where Z is standard normal. Consequently, E | X t − x k | = ( t − x k ) E |∇ f ( x k ) | + 2 τ Θ k ( t − x k ) d ≤ η k +1 E |∇ f ( x k ) | + Cτ Θ k η k +1 . (58)According to Lemma 2, ν τ Θ k satisfies Talagrand’s inequality with constant γ τ Θ k ∼ κ exp( − E ∗ /τ Θ k ).Further by (Vempala and Wibisono, 2019, Lemma 10), E |∇ f ( x k ) | ≤ Cγ τ Θ k H ( µ k | ν τ Θ k ) + C. (59)Combining (57) with (58), (59) and the fact that τ Θ k ∼ E ln Θ k as k → ∞ , we have( c ) ≤ C (cid:18) η k +1 Θ E ∗ E k ln Θ k (cid:19) H ( µ k | ν τ Θ k ) + C ( η k +1 + η k +1 ln Θ k ) + τ Θ k I ( µ t | ν τ Θ k ) . (60)Injecting (60) into (56) and further by Lemma 1, we get ddt H ( µ t | ν τ Θ k ) ≤ − τ Θ k I ( µ t | ν τ Θ k ) + C (cid:48) (cid:18) η k +1 Θ E ∗ E k ln Θ k (cid:19) H ( µ k | ν τ Θ k ) + C (cid:48) ( η k +1 + η k +1 ln Θ k ) ≤ − C Θ − ( E ∗ E − ε ) k H ( µ t | ν τ Θ k ) + C (cid:48) (cid:18) η k +1 Θ E ∗ E + εk H ( µ k | ν τ Θ k ) + ( η k +1 + η k +1 ln Θ k ) (cid:19) , Now by a Gr¨onwall argument, we have H ( µ k +1 | ν τ Θ k ) ≤ e − Cη k +1 Θ − ( E ∗ E − ε ) k (cid:16) (1 + C (cid:48) η k +1 Θ E ∗ E + ε ) H ( µ k | ν τ Θ k ) + C (cid:48) ( η k +1 + η k +1 ln Θ k ) (cid:17) ≤ e − Cη k +1 Θ − ( E ∗ E − ε ) k H ( µ k | ν τ Θ k ) + C (cid:48) ( η k +1 + η k +1 ln Θ k ) ≤ (cid:18) − Cη k +1 Θ − ( E ∗ E − ε ) k (cid:19) H ( µ k | ν τ Θ k ) + C (cid:48) ( η k +1 + η k +1 ln Θ k ) , (61)where we use the fact that η k +1 Θ E ∗ E k → k → ∞ in the second inequality.Now we consider the term ( b ) in (55). Note that H ( µ k +1 | ν τ Θ k +1 ) − H ( µ k +1 | ν τ Θ k ) = ln (cid:18) Z τ Θ k +1 Z τ Θ k (cid:19) + (cid:18) τ Θ k +1 − τ Θ k (cid:19) E f ( x k +1 ) ≤ C η k +1 Θ k E f ( x k +1 ) . (62)We claim that for each ε > E f ( x k +1 ) ≤ C ≤ C Θ εk . Choose C > E f ( x k +1 ) be the first term exceeding C . By Assumption 3, f ( x k +1 ) ≤ f ( x k ) − η k |∇ f ( x k ) | + (cid:112) τ Θ k η k ∇ f ( x k ) · Z k + L | η k ∇ f ( x k ) + (cid:112) τ Θ k η k Z k | Further by taking expectation, we get E f ( x k +1 ) − E f ( x k ) ≤ − η k (cid:18) − η k L (cid:19) E |∇ f ( x k ) | + Ldτ Θ k η k . (63)Thus, E f ( x k +1 ) − E f ( x k ) ≤ Ldτ Θ k η k which implies that E f ( x k ) > C − k large enough.By Assumption 1(ii), E |∇ f ( x k ) | > C (cid:48) for some C (cid:48) > 0. Combining with (63), we have E f ( x k ) > E f ( x k +1 ) ≥ C for k large enough. This contradicts the fact that E f ( x k +1 ) is thefirst term exceeding C . Now by (62), we get H ( µ k +1 | ν τ Θ k +1 ) − H ( µ k +1 | ν τ Θ k ) ≤ Cη k +1 Θ − εk . (64)Combining (55) with (61), (64) yields (54). (cid:3) Step 4: Estimating H ( µ k | ν τ Θ k ) . We use Lemma 8 to derive an estimate for H ( µ k | ν τ Θ k ).Under the condition (6), the term η k +1 Θ − εk dominates η k +1 , η k +1 ln Θ k as k → ∞ . Thus,the recursion (54) yields H ( µ k +1 | ν τ Θ k +1 ) ≤ (cid:18) − Cη k +1 Θ − ( E ∗ E − ε ) k (cid:19) H ( µ k | ν τ Θ k ) + C (cid:48) η k +1 Θ − εk . Since E ∗ /E < 1, a similar argument as in Lemma 6 shows that H ( µ k +1 | ν τ Θ k +1 ) − C Θ − (1 − E ∗ E − ε ) k ≤ (cid:18) − C (cid:48) η k +1 Θ − ( E ∗ E − ε ) k (cid:19)(cid:18) H ( µ k | ν τ Θ k ) − C Θ − (1 − E ∗ E − ε ) k − (cid:19) . This together with the condition (7) implies that H ( µ k | ν τ Θ k ) ≤ C Θ − (1 − E ∗ E − ε ) k . (65)By injecting (48) and (65) into (47) we obtain (8). IMULATED ANNEALING 17 Conclusion In this paper, we study the convergence rate of SA in both continuous and discrete settings.The main tool is functional inequalities for the Gibbs measure at low temperatures. We provethat the tail probability, in both settings, exhibits a polynomial decay in time. The decayrate is also given as function of the model parameters. In the discrete setting, we derive acondition on the step size to ensure the convergence to the global minimum. This conditionmay be useful in tuning the step size.There are a few directions to extend this work. For instance, one can study the convergencerate of SA for L´evy flight with a suitable cooling schedule. Another problem is to study thedependence of the convergence rate in the dimension d . This requires a deep understandingof the Eyring-Kramers law in high dimension, and is related to the Laplace approximationof high dimensional integrals. Both problems are worth exploring, but may be challenging. Acknowledgement: Tang thanks Georg Menz for helpful discussions. He also gratefullyacknowledges financial support through a start-up grant at Columbia University. Zhou grate-fully acknowledges financial supports through a start-up grant at Columbia University andthrough the Nie Center for Intelligent Asset Management. References D. Bakry, I. Gentil, and M. Ledoux. Analysis and geometry of Markov diffusion operators ,volume 348 of Grundlehren der Mathematischen Wissenschaften . Springer, 2014.A. Bovier, M. Eckhoff, V. Gayrard, and M. Klein. Metastability in reversible diffusionprocesses. I. Sharp asymptotics for capacities and exit times. J. Eur. Math. Soc. , 6(4):399–424, 2004.A. Bovier, V. Gayrard, and M. Klein. Metastability in reversible diffusion processes. II.Precise asymptotics for small eigenvalues. J. Eur. Math. Soc. , 7(1):69–99, 2005.J. A. Carrillo, R. J. McCann, and C. Villani. Contractions in the 2-Wasserstein length spaceand thermalization of granular media. Arch. Ration. Mech. Anal. , 179(2):217–263, 2006.V. Cerny. Thermodynamical approach to the traveling salesman problem: an efficient simu-lation algorithm. J. Optim. Theory Appl. , 45(1):41–51, 1985.X. Chen, S. S. Du, and X. T. Tong. On stationary-point hitting time and ergodicity ofstochastic gradient Langevin dynamics. J. Mach. Learn. Res. , 21:Paper No. 68, 41, 2020.A. S. Cherny and H.-J. Engelbert. Singular stochastic differential equations , volume 1858 of Lecture Notes in Mathematics . Springer-Verlag, 2005.T.-S. Chiang, C.-R. Hwang, and S. J. Sheu. Diffusion for global optimization in R n . SIAMJ. Control Optim. , 25(3):737–753, 1987.D. Delahaye, S. Chaimatanan, and M. Mongeau. Simulated annealing: from basics to applica-tions. In Handbook of metaheuristics , volume 272 of Internat. Ser. Oper. Res. ManagementSci. , pages 1–35. Springer, 2019.A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjustedLangevin algorithm. Ann. Appl. Probab. , 27(3):1551–1587, 2017.S. N. Ethier and T. G. Kurtz. Markov processes: Characterization and Convergence . WileySeries in Probability and Mathematical Statistics. John Wiley & Sons, Inc., 1986.H. Eyring. The activated complex in chemical reactions. J. Chem. Phys. , 3(2):107–115, 1935. H. Fang, M. Qian, and G. Gong. An improved annealing method and its large-time behavior. Stochastic Process. Appl. , 71(1):55–74, 1997.N. Fournier and C. Tardif. On the simulated annealing in R d . 2020. arXiv:2003.06360.X. Gao, Z. Q. Xu, and X. Y. Zhou. State-dependent temperature control for Langevindiffusions. 2020. arXiv:2005.04507.R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points – online stochasticgradient for tensor decomposition. In COLT , pages 797–842, 2015.S. Geman and C.-R. Hwang. Diffusions for global optimization. SIAM J. Control Optim. ,24(5):1031–1043, 1986.U. Grenander and M. I. Miller. Representations of knowledge in complex systems. J. Roy.Statist. Soc. Ser. B , 56(4):549–603, 1994.X. Guo, J. Han, and W. Tang. Perturbed gradient descent with occupation time. 2020.arXiv:2005.04507.R. A. Holley, S. Kusuoka, and D. W. Stroock. Asymptotics of the spectral gap with applica-tions to the theory of simulated annealing. J. Funct. Anal. , 83(2):333–347, 1989.P. Jain and P. Kar. Non-convex optimization for machine learning. Found. Trends Mach.Learn. , 10:142–336, 2017.C. Jin, R. Ge, P. Netrapalli, S. Kakade, and M. I. Jordan. How to escape saddle pointsefficiently. In ICML , pages 1724–1732, 2017.S. Kirkpatrick, J. Gelatt, and M. Vecchi. Optimization by simulated annealing. Science , 220(4598):671–680, 1983.C. Koulamas, S. Antony, and R. Jaen. A survey of simulated annealing applications tooperations research problems. Omega , 22(1):41–56, 1994.H. A. Kramers. Brownian motion in a field of force and the diffusion model of chemicalreactions. Physica , 7(4):284–304, 1940.Y. Ma, N. Chatterji, X. Cheng, N. Flammarion, P. Bartlett, and M. I. Jordan. Is there ananalog of Nesterov acceleration for MCMC? arXiv:1902.00996 , 2019a.Y. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan. Sampling can be faster thanoptimization. Proc. Natl. Acad. Sci. USA , 116(42):20881–20885, 2019b.D. M´arquez. Convergence rates for annealing diffusion processes. Ann. Appl. Probab. , 7(4):1118–1139, 1997.G. Menz and A. Schlichting. Poincar´e and logarithmic Sobolev inequalities by decompositionof the energy landscape. Ann. Probab. , 42(5):1809–1884, 2014.G. Menz, A. Schlichting, W. Tang, and T. Wu. Ergodicity of the infinite swapping algorithmat low temperature. 2018. arXiv:1811.10174.S. P. Meyn and R. L. Tweedie. Stability of Markovian processes. III. Foster-Lyapunov criteriafor continuous-time processes. Adv. in Appl. Probab. , 25(3):518–548, 1993.L. Miclo. Recuit simul´e sur R n . ´Etude de l’´evolution de l’´energie libre. Annales de l’InstitutHenri Poincar´e , 28(2):235–266, 1992.L. Miclo. Une ´etude des algorithmes de recuit simul´e sous-admissibles. Ann. Fac. Sci.Toulouse Math. (6) , 4(4):819–877, 1995.P. Monmarch´e. Hypocoercivity in metastable settings and kinetic simulated annealing. Prob-ability Theory and Related Fields , pages 1–34, 2018.F. Otto and C. Villani. Generalization of an inequality by Talagrand and links with thelogarithmic Sobolev inequality. J. Funct. Anal. , 173(2):361–400, 2000. IMULATED ANNEALING 19 G. Parisi. Correlation functions and computer simulations. Nuclear Phys. B , 180(3, FS 2):378–384, 1981.I. Pavlyukevich. L´evy flights, non-local search and simulated annealing. J. Comput. Phys. ,226(2):1830–1844, 2007.M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradientLangevin dynamics: a nonasymptotic analysis. In COLT , pages 1674–1703, 2017.G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions andtheir discrete approximations. Bernoulli , 2(4):341–363, 1996.L. C. G. Rogers and D. Williams. Diffusions, Markov processes, and martingales. Vol. 2, ItˆoCalculus . Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons,Inc., 1987.G. Royer. An initiation to logarithmic Sobolev inequalities , volume 14 of SMF/AMS Textsand Monographs . American Mathematical Society, 2007.A. Schlichting. The Eyring-Kramers formula for Poincar´e and logarithmic Sobolev inequali-ties . PhD thesis, Universit¨at Leipzig, 2012. Available at http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-97965 .Z. Shun and P. McCullagh. Laplace approximation of high-dimensional integrals. J. Roy.Statist. Soc. Ser. B , 57(4):749–760, 1995.D. W. Stroock and S. R. S. Varadhan. Multidimensional diffusion processes , volume 233 of Grundlehren der Mathematischen Wissenschaften . Springer-Verlag, 1979.W. Tang. Exponential ergodicity and convergence for generalized reflected Brownian motion. Queueing Syst. , 92(1-2):83–101, 2019.A. B. Tsybakov. Introduction to nonparametric estimation . Springer Series in Statistics.Springer, 2009.P. J. M. van Laarhoven and E. H. L. Aarts. Simulated annealing: theory and applications ,volume 37 of Mathematics and its Applications . D. Reidel Publishing Co., 1987.S. Vempala and A. Wibisono. Rapid convergence of the Unadjusted Langevin Algorithm:isoperimetry suffices. In NeurIPS , volume 32, pages 8094–8106, 2019.H. Wang, T. Zariphopoulou, and X. Y. Zhou. Reinforcement learning in continuous time andspace: A stochastic control approach. Journal of Machine Learning Research , 21:1–34,2020.P. A. Zitt. Annealing diffusions in a potential function with a slow growth. Stochastic Process.Appl. , 118(1):76–119, 2008. Department of Industrial Engineering and Operations Research, Columbia University. Email address : [email protected] Department of Industrial Engineering and Operations Research, Columbia University. Email address ::