[PDF] A thorough study of the performance of simulated annealing with geometric cooling in correlated and long tailed spatial scenarios

Abstract

Metaheuristics, as the simulated annealing used in the optimization of disordered systems, goes beyond physics, and the travelling salesman is a paradigmatic NP-complete problem that allows to infer important theoretical properties of the algorithm in different random environments. Many versions of the algorithm are explored in the literature, but so far the effects of the statistical distribution of the coordinates of the cities on the performance of the algorithm has been neglected. We propose a simple way to explore this aspect by analyzing the performance of a standard version of the simulated annealing (geometric cooling) in correlated systems with a simple and useful method based on a linear combination of independent random variables. Our results suggest that performance depends on the shape of the statistical distribution of the coordinates but not necessarily on its variance corroborated by the cases of uniform and normal distributions. On the other hand, a study with different power laws (different decay exponents) for the coordinates always produces different performances. We show that the performance of the simulated annealing, even in its best version, is not improved when the distribution of the coordinates does not have the first moment. However, surprisingly, we still observe improvements in situations where the second moment is not defined but not where the first one is not defined. Finite size scaling, fits, and universal laws support all of our results. In addition our study show when the cost must be scaled.

Full PDF

AA thorough study of the performance of simulated annealing with geometriccooling in correlated and long tailed spatial scenarios

Roberto da Silva , Eliseu Venites Filho , Alexandre Alves Abstract

Metaheuristics, as the simulated annealing used in the optimization of disordered systems, goes beyond physics,and the travelling salesman is a paradigmatic NP-complete problem that allows to infer important theoretical propertiesof the algorithm in different random environments. Many versions of the algorithm are explored in the literature, butso far the effects of the statistical distribution of the coordinates of the cities on the performance of the algorithm hasbeen neglected. We propose a simple way to explore this aspect by analyzing the performance of a standard versionof the simulated annealing (geometric cooling) in correlated systems with a simple and useful method based on alinear combination of independent random variables. Our results suggest that performance depends on the shape ofthe statistical distribution of the coordinates but not necessarily on its variance corroborated by the cases of uniformand normal distributions. On the other hand, a study with different power laws (different decay exponents) for thecoordinates always produces different performances. We show that the performance of the simulated annealing, evenin its best version, is not improved when the distribution of the coordinates does not have the ﬁrst moment. However,surprisingly, we still observe improvements in situations where the second moment is not deﬁned but not where theﬁrst one is not deﬁned. Finite size scaling, ﬁts, and universal laws support all of our results. In addition our studyshow when the cost must be scaled.

Contents1 Introduction 12 Pedagogical aspects of the Simulated Annealing in the context of the Travelling salesman problem 43 Correlated and long-tailed environments 6 C opt / (cid:104) C (cid:105) as a function of ρ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Long tail effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Some magnetic systems, known as spin glasses, present random couplings which lead to high levels of frustrationand disordering. The metastability of many structures of such spin glasses leads to experimental physical measure-ments of the order of days due to slow decay of the magnetization to zero. This peculiarity is also hard to explore in

Preprint submitted to Elsevier January 1, 2021 a r X i v : . [ c ond - m a t . d i s - nn ] D ec omputer simulations. Sherrington and Kirkpatrick [1] proposed an interesting exactly solvable spin-glass model thatshows an slow dynamics of magnetization.Based on these annealing phenomena (heating treatment of materials to enlarge its ductility) in solids, Kirkpatrick,Gelatt, and Vecchi [2] observed a deep connection between the Statistical Mechanics of systems with many degrees offreedom, which can be governed by the Metropolis prescription [3] to thermal equilibrium, and the combinatorial op-timization of some problems related to obtain the minimal of functions with many parameters in complex landscapes.Many combinatorial optimization problems belong to the complexity class known as NP-complete which is asubset of class simply known as NP class. Let us better explain this fundamental point in the complexity of algorithms.A problem belongs to NP class if it can be decided in polynomial time by a Turing machine (a universal computermodel, for more details see [4]). Here it is important to say that not necessarily it possesses an algorithm that can solveit in polynomial time. Sure, we know that there are problems which can be solved by an algorithm in polynomial time(fast solution) and not only decided! A simple example is sorting a list.One algorithm in these conditions belongs to a complexity class known as P, and obviously P ⊂ NP. However thereare some ”rebel” problems in NP, the so-called NP-complete problems, that can be understood, in a non-rigorous way,as the harder problems in NP, since there is not a single known algorithm to date, that can solve one of these problemsin polynomial time differently than occurs with problems in P.Should we then believe in conjecture P (cid:54) = NP? Yes, and the reason is simple! If we know one NP-completeproblem, any algorithm that can solve it, can be used to solve any other NP-complete problem (this is technicallycalled polynomial reduction, and for a good reference in complexity theory, see again [4]). Therefore, if somebodydiscovers an algorithm in polynomial time that solve one of these NP-complete problems, by consequence, all ofthem will also have solution in polynomial time. However, such algorithm never was found, which makes strongerthe conjecture.But what do you do with these problems? An alternative for them is to use heuristics, and those ones which arephysically motivated can be good suggestion. Thus, based on Ref. [2], which proposed a metaheuristics known assimulated annealing (SA), an algorithm which performs a random walk in the conﬁguration space until it reaches theglobal equilibrium or, at worst, an interesting local minimum, the physicists, and computer scientists have exploredseveral different combinatorial problems, including for example the study of thermodynamics of the protein folding[5], the evaluation of periodic orbits of multi-electron atomic systems [6], and many others.Basically, the SA algorithm works in the following way: we start with an initial solution/conﬁguration σ old , andcalculate the energy (cost in a more general context) of this initial solution E ( σ old ) . After that, another state israndomly chosen, which is denoted by σ new , and this new solution is accepted with a probability that follows theMetropolis prescription [3]:Pr ( σ old → σ new ) = min (cid:26) , exp (cid:20) − T ( E ( σ new ) − E ( σ old )) (cid:21)(cid:27) , (1)otherwise the system remains in the same σ old state. This process is repeated until the ensemble sampled of the systemreaches an equilibrium at the given temperature. Finally, the temperature T is decreased by a cooling schedule. Thetemperature will be decreased until a state with low enough energy is found.Here, it is important to mention that many cooling schedules can be applied. There are cooling schedules thatasymptotically converge towards the global minimum as the one that cools the system with a logarithm rule [7].However such schedule converges very slowly and requires a large computation time.There are good alternative schedules, that although without a rigorous guarantee of the convergence towards theglobal optimum, are computationally faster. One of them is the geometric cooling schedule which considers that attime t , the temperature is given by T = α t T , with 0 < α <

1. An interesting version of this heuristic works with twoloops: an internal and another external which can be resumed by the Heuristic SAGCS (Simulated annealing with2eometric cooling schedule) resumed in the algorithm 1.

Algorithm 1:

SAGCS (Standard Simulated Annealing with Geometric Cooling Schedule) parameters: σ : initial conﬁguration T : initial temperature T min : ﬁnal temperature N iter : number of iterations of the internal loop α : coefﬁcient of the geometric cooling Result: low energy state σ T ← T ; σ ← σ ; while T > T min dofor i ← to N iter do σ new ← randomState() ; if E ( σ new ) < E ( σ ) then σ ← σ new ; else x ← rand( [ , ] ) ; if x < exp (cid:2) − T ( E ( σ new ) − E ( σ )) (cid:3) then σ ← σ new ; endendend T ← α T ; end It is important to observe that such Heuristic works with two parameters: N iter that controls the number of internaliterations, while the external loop (iterations over different temperatures) is governed by: N steps = (cid:22) ln ( T min / T ) ln α (cid:23) (2)and sure, for appropriate N steps and N iter , the system must converge for a local minimum that is very close to the globalminimum. Amongst the many interesting NP-complete problems, the traveling salesman problem (TSP) deservesmuch attention and it became a natural way to test the SA heuristic using different cooling schedules, variations ofthe method, and parallelization techniques in many different contexts (see for example [10, 11, 12, 13]). The problem(TSP) is to ﬁnd the optimal Hamiltonian cycle, in a given graph (vertices and valued edges), i.e., the minimum costclosed path including all vertices passing through each vertex only once.Let us consider graphs with N vertices corresponding to points ( x i , y i ) , i = , .., N , distributed in a two dimensionalscenario. In this case, the graph naturally has valued edges deﬁned by the Euclidean distances: d ( i , j ) = (cid:113) ( x i − x j ) + ( y i − y j ) , (3)with i , j = , .., N , and therefore, the simplest situation is to consider the graph having all edges, since all costs d ( i , j ) can be deﬁned for all pair of the points, which is denoted by a complete graph.In this paper, we are not interested in testing cooling schedules, or even other SA heuristics, which are very wellexplored in the literature, but in building computer experiments to test the SAGCS in the TSP in order to understandits efﬁciency considering the effects of correlation and variance on the random coordinates x i and y i . It is importantto mention that our method to include correlation is very simple and allows to change the correlation but keepingthe same variance which is very important for making a fair comparison, since we simultaneously variate the twoparameters, we do not know if the effects are only caused by the correlation.Our results show how the performance of the algorithm transits from d = ρ =

0) until d = ρ =

1) by3eforming the region where the points (cities) are distributed. We also investigate a parameter that controls themoments of the random variables x i and y i and we quantitatively show how the increase of the variance affects theperformance. We particularly show that the shape of the distribution is more important than its variance in certainsituations. On the other hand, when the points have coordinates that are power-law distributed, in the region wherethe average cannot be deﬁned, the efﬁciency of the algorithm is negligible independently on scaling.In the next section we will show some fundamental and pedagogical aspects of the SA for the TSP. In the following,in section 3 we deﬁne the different environments where we will apply the SAGCS. We will show in detail how togenerate points with correlated coordinates, and how to generate points with power-law distributed coordinates. Insection 4 we show our results and ﬁnally in section 5 we present some conclusions.

2. Pedagogical aspects of the Simulated Annealing in the context of the Travelling salesman problem

The simulated annealing with geometric schedule (SAGCS) is a very simple heuristic used to optimize combinato-rial problems as the Traveling Salesman Problem. The points ( x i , y i ) , i = , ..., N are scattered over a two dimensionalregion, where the coordinates follow some joint probability distribution p ( x i , y i ) .Denoting by v ( i ) ≡ ( x i , y i ) the i -th point of the cycle C : C ≡ v ( ) → v ( ) → .... → v ( N − ) → v ( N ) → v ( ) , whosecost is given by E ( C ) = N ∑ k = d ( v ( k ) , v ( k + )) (4)where v ( N + ) = v ( ) and d ( i , j ) is given by Eq. 3. We aim to determine the best cycle C , i.e., the cycle of minimalcost.The new conﬁguration can be obtained performing different mechanisms in this work. The ﬁrst one is the “simpleswap”(SS) , i.e., a new point is randomly chosen, for example v ( j ) , and the candidate to be the new conﬁguration C (cid:48) corresponds to C by simply changing the positions of v ( j ) ≡ ( x j , y j ) and v ( j + ) ≡ ( x j + , y j + ) , then we can performthe Metropolis prescription to optimize the problem in the context of the simulated annealing.On the other hand, there is a more interesting way to obtain a new conﬁguration according to [8, 9] the 2-opt move.It is a popular procedure used to improve algorithms to approximate the solution of the Travelling Salesman Problem.Starting from a given cycle, it consists in exchanging two links of the cycle to construct a new one. This is performedby reversing the sequence of nodes between the selected links and then reconnecting the cycle back together. Forexample, if one has a route/cycle C ≡ v ( ) → v ( ) → v ( ) → v ( ) → v ( ) → v ( ) → v ( ) → v ( ) → v ( ) , by choosing i = i =

7, it leads to C (cid:48) ≡ v ( ) → v ( ) → v ( ) → v ( ) → v ( ) → v ( ) → v ( ) → v ( ) → v ( ) . It allows us toexplore the state space of Hamiltonian cycles efﬁciently as well as it provides a simple way to calculate the energydifference between the states before and after the move, which is a very important feature for working in conjunctionwith the Metropolis algorithm.Now, let us explore the complexity of the problem. There are N ! different cycles. Testing all of them is impossiblefor large N ! For example for N = N ! ≈ √ π N N + e − N , one estimates2048! ≈ √ π + e − ≈ . × This a staggering number when compared, for example, with the number of atoms of the universe O ( ). Actu-ally, among this ”astronomical” number of different cycles (much more than astronomical), certainly there are a lot ofthem corresponding to a good local minima. It is exactly here that SA shows its usefulness. Let us understand how itworks by showing the power of this algorithm. For example, let us examine the Fig. 1.This ﬁgure shows three histograms: the red one, represented in Fig. 1 (a) shows a sample of the costs of onemillion of random drawn cycles of the same conﬁguration with N = x i ∈ [ − , ] and y i ∈ [ − , ] . The black one, Fig.1 (b), represents a histogram with only 400 different costs obtained from N run =

400 different random conﬁgurationsof N = T = T f = − , α = .

99, and N iter =

070 2160 22500.000.010.02 960 1000 1040 10800.000.010.020.03 76 77 78 79 800.00.20.40.60.81.0 C Random drawn cycles (a) f ( C ) (b) Simple swap (c)

Figure 1: The reason for that the SA is so important. We show the histogram of the costs of the 10 random drawn cycles (a), 400 different runs ofthe SA using simple swap (b), and 400 different runs of the SA using 2-opt choice for new conﬁgurations (c). The line green is only to highlightthe huge difference between our best result obtained with SA and the “brute force sampling” One can observe two very distinct Gaussian distributions whose means differ across 1000 units of the cost. It isimportant to notice that not even one among the one million costs of randomly drawn Hamiltonian cycles reached thecost of SA. Actually, it is worse than this, in the ﬁrst case we have C rand = . ± .

02, while in the second caseone has C SA = . ± .

74 and the overlapping probability between these two distributions is zero in the precisionof the any known machine! It is important to mention that we provocatively used only N run =

400 runs for the SAagainst the brute force (random drawn cycles).The ﬁnal blow comes when we use the 2-opt choice to sample new conﬁgurations in the SAGCS (2opt-SAGCS),which is shown Fig. 1 (c), where we exactly used the same parameters of the Fig. 1 (b). In this case one has C SA = . ± .

031 which is an extremely low cost when compared with brute force method.Thus, this pedagogical explanation is only to show that exhaustive sampling is not a solution for combinatorialproblems. Heuristics as the SA is an important alternative to obtain good (not always the optimal, but in very dis-ordered systems this cannot be a meaningful difference) solutions to the TSP. Now, after this preparatory study, wepresent the details about the scenario for which we intend to analyze the performance of the SAGCS. In the nextsection we will present the method to generate points with correlated coordinates, and how to generate points withcoordinates long-tailed distributed. In these environments, we intend to explore some effects on the performance ofthe SA which will be performed in section 4. 5 . Correlated and long-tailed environments

One of the important questions addressed in this paper is to study how the SAGCS works considering that x i and y i are random variables with a correlation ρ . On the other hand, for the particular case p ( x i , y i ) = p ( x i ) p ( y i ) ( ρ = p ( x i ) ∼ x − γ i and p ( y i ) ∼ y − γ i , looking at the effect of γ onperformance of the algorithm. Now, we discuss how to generate scenarios by following these prescriptions. In this section, we will show that we can generate correlated random variables from non-correlated random vari-ables considering that both (correlated and non-correlated) have the same variance and average by imposing an ad-ditional constraint – considering the average equal to 0 for the uncorrelated random variables. This is exactly thesame procedure used in [14] in the context of emerging of rogue waves in the superposition of electrical waves withcorrelated phases.Let us consider spatial coordinates x e y of points in a two-dimensional environment that are random variables: x = α z + α z y = β z + β z (5)where z and z are independent and identically distributed random variables, such that: (cid:104) z (cid:105) = (cid:104) z (cid:105) = (cid:104) z (cid:105) , and (cid:104) z z (cid:105) = (cid:104) z (cid:105) (cid:104) z (cid:105) = (cid:104) z (cid:105) .The variance of the variable x is given by: (cid:68) ( ∆ x ) (cid:69) = (cid:10) x (cid:11) − (cid:104) x (cid:105) = ( α + α ) (cid:68) ( ∆ z ) (cid:69) where (cid:10) z (cid:11) − (cid:104) z (cid:105) = (cid:10) z (cid:11) − (cid:104) z (cid:105) = (cid:10) z (cid:11) − (cid:104) z (cid:105) = (cid:68) ( ∆ z ) (cid:69) , and similarly (cid:68) ( ∆ y ) (cid:69) = ( β + β ) (cid:68) ( ∆ z ) (cid:69) .Now, we impose the condition (cid:68) ( ∆ x ) (cid:69) = (cid:68) ( ∆ y ) (cid:69) = (cid:68) ( ∆ z ) (cid:69) , (6)which implies that α + α = β + β = z and z are non-correlated random variables, x and y are, and the correlationbetween these random variables can be calculated: ρ = (cid:104) ( x − (cid:104) x (cid:105) ) ( y − (cid:104) y (cid:105) ) (cid:105) (cid:114)(cid:68) ( ∆ x ) (cid:69) (cid:68) ( ∆ y ) (cid:69) (7)Thus, again after some cancellations and combinations: ρ = ( α β + α β ) . And after a little algebra, the randomvariables can be now written as: x = sin (cid:18)

12 sin − ( ρ ) (cid:19) z + cos (cid:18)

12 sin − ( ρ ) (cid:19) z (8)and y = cos (cid:18)

12 sin − ( ρ ) (cid:19) z + sin (cid:18)

12 sin − ( ρ ) (cid:19) z (9)have the same average that are given by: (cid:104) x (cid:105) = (cid:104) y (cid:105) = √ (cid:20)(cid:16) − (cid:112) − ρ (cid:17) / + (cid:16) + (cid:112) − ρ (cid:17) / (cid:21) (cid:104) z (cid:105) From this, we can draw two important conclusions:1. The random variables x and y have the same variance of z and z that are identically distributed and we requiredthis according to Eq. 6 and therefore it does not depend on ρ ;2. If (cid:104) z (cid:105) = (cid:104) z (cid:105) = (cid:104) z (cid:105) =

0, thus (cid:104) x (cid:105) = (cid:104) y (cid:105) =

0. 6hus, if one considers x and y as ρ -correlated random variables generated from two independent random variables z and z , with average zero and variance σ = (cid:68) ( ∆ z ) (cid:69) , x and y also have average zero and the same variance σ .This is an important point because if one works with different averages and dispersions between the cases ρ = ρ (cid:54) = -1.0 -0.5 0.0 0.5 1.0-1.0-0.50.00.51.0 -1.0 -0.5 0.0 0.5 1.0-1.0-0.50.00.51.0-1.0 -0.5 0.0 0.5 1.0-1.0-0.50.00.51.0 -1.0 -0.5 0.0 0.5 1.0-1.0-0.50.00.51.0-1.0 -0.5 0.0 0.5 1.0-1.0-0.50.00.51.0 -1.0 -0.5 0.0 0.5 1.0-1.0-0.50.00.51.0 x  C   G  I K y x

Figure 2: Effects of correlation on the points using Eqs. 8 and 9.

For example, using two uniform and identically distributed random variables z = ( ξ − ) and z = ( ξ − ) that assume values in [ − , ] generated from two ξ and ξ uniform random variables assuming values in [ , ] , theFig. 2 shows a plot of y versus x obtained from equations 8 and 9 for different values of ρ . In this paper, the idea is tostudy the SAGCS on these correlated scenarios by analyzing its performance. Other important point of our study is to look at effects of the long tailed distributions for the coordinates of thepoints on the SA performance. Thus, we use a power-law probability density function to generate the coordinates ofthe two-dimensional points. For that, we initially propose the following distribution for the coordinates:7 ( x ; x , γ ) =  ( γ − ) x − γ | x | − γ if | x | ≥ x ∆ = x with centerin the origin, since the two branches can not touch the origin by normalization, but x can theoretically be made assmall as we want. Just for illustration, we show plots of p ( x ; x , γ ) in Fig. 3. -300 -150 0 150 30010 -7 -5 -3 -1 -0.009 0.000 0.00910 p ( x ) x -3 -2 -1 x  x       x = 10 -3 Figure 3: Two-tailed power law distribution. Fig. (a): A plot of p ( x ; x , γ ) for γ = x . Fig. (b): A plot of p ( x ; x , γ ) for x = − for different values of γ . The histogram is obtained with points sampled with Eq. 11 for γ = .

5, only to check the formulae.

To draw a variable that follow the distribution of Eq. 10 one simply uses two uniform random (or more preciselypseudo-random variables) variables ξ and ξ assuming values in [ , ] . First, one chooses if the variable is on thepositive or negative branching. This is performed checking the sign of ( ξ − ) . After that, one makes ξ equal toarea ( γ − ) x − γ (cid:82) wx x − γ dx . This results in w = x ( − ξ ) − γ . Therefore, in the general case, the random variable built from ξ and ξ is: x = ξ − | ξ − | x ( − ξ ) − γ (11)8nd naturally with other two random variables ξ and ξ , one has y = ξ − | ξ − | x ( − ξ ) − γ (12)In Fig. 3 (b), a histogram for the points ( x , y ) sampled in this way is shown for γ = .

5, only to check the agreementwith the exact result. The idea is to observe the performance of the SAGCS in this scenario answering how importantis γ and its effects on the ﬁnal cost obtained with SA algorithm. Notice that points generated by Eqs. 11 and 12have border effects. For instance, in Fig. 4, we present some examples of the points scattered according to differentdistributions by focusing the power law ones. -4 -3 -2 -1 0 1 2 3 4-4-2024 -0.004 -0.002 0.000 0.002 0.004 0.006 0.0080.0000.0050.0100.015-0.004 -0.003 -0.002 -0.001 0.000 0.001 0.002 0.003-0.004-0.0020.0000.002 -0.0006 -0.0003 0.0000 0.0003 0.0006-0.00030.00000.00030.00060.0009-0.0004 -0.0002 0.0000 0.0002 0.0004 0.0006-0.0006-0.0004-0.00020.00000.00020.00040.0006 -0.002 -0.001 0.000 0.001 0.002 0.003-0.0010.0000.0010.0020.0030.0040.005Gaussian  and  power law I  power law II  power law I  power law II  power law II  Figure 4: Scattering of points for a comparison: (a) Gaussian coordinates, (b-e) power-law coordinates for different approaches and for differentexponents γ . It is important to notice that the ﬁgures are shown with different intervals for a good visualization. Thus, for power laws, in somecases we observe the gap around the origin but in other cases, we do not. Using the Gaussian distribution (with average zero and variance 1), the pattern of points shown in Fig. 4 (a) isvery different than the case which uses the points generated by Eqs. 11 and 12 for γ = . x = − (Fig. 4 (b)). Here, we will denote the points using the Eqs. 11 and 12 by power law I. We can observe some outlier points since,for this case, the distribution has no second moment deﬁned. However an radial symmetry is absent in the powerlaw I and the border effects previously mentioned are notorious. The question is whether they are really important.Alternatively, in this paper, we also proposed a second power law type (power law II) to study the SA, which respectsthe radial symmetry: 9 = r ( − ξ ) − γ cos ( πξ ) (13) y = r ( − ξ ) − γ cos ( πξ ) and in this case for γ = . r = − one has the distribution of points represented in Fig. 4 (c). Thus relevantquestions should be related to the performance of the SA in both situations: symmetrical and asymmetric ones. Bycompleting in Fig 4 (d), and (e) we show the distribution of points for a case where we have deﬁned variance ( γ = . γ = .

4. Results

We performed computer experiments to analyze two effects on the optimization by the SA for the TSP: a) thecorrelation between the spatial coordinates, and b) the variance/distribution shape for the spatial random coordinates.We used the SAGCS which is fast and standard way to perform optimization. It is natural to think that such effectsmust be proportionally important in other variations of the SA with other slower cooling schedules independently ofthe ﬁnal cost obtained is better. The goal of this paper is not to compare different SA algorithms but performing aquantitative study of the SAGCS considering different spatial distributions for the coordinates of the points in the TSP.However, before starting the main core of our results, it is important to understand some preliminary aspects: theeffects of the number of external loop iterations N steps = (cid:106) ln ( T min / T ) ln α (cid:107) , and N iter which denotes the number of internalloop iterations on the ﬁnal cost. We consider points ( x i , y i ) , = , ..., N uniformly distributed non-correlated randomvariables (case ρ = N run =

60 different runs ofthe SA.So we performed simulations considering T =

10 and T min = − . We initially consider the simple swap schemeto generate new candidate conﬁgurations. Under these conditions, one obtains a plot of C ∞ which can be observed inFig. 5 (a) as function of total number of iterations N total = N steps N iter . Here we intent to show that such parameter ismore important than the parameters N steps ( α ) , and N iter separately. Our range for N total was from 5 . iterations upto 10 iterations. The procedure is to change N iter from 2 up to 2 . Now, given the N total and the N iter we obtain thefactor of the exponent α of the cooling: α = (cid:18) T min T (cid:19) NiterNtotal 

For example, when N total = . , α changes from 0 .

001 up to 0 . N total = . one obtains that α changes from 0 . . N iter and α → α → C opt depends on N total and not on N steps and N iter separately. For this experiment it was used N = T =

100 the behavior is similar. The result suggests a reasonable power-lawuniversal behavior C opt ∼ N − δ total for sufﬁciently large total number of points in the sample independently on the internal or external loops. For clarity,in the Fig. 6, we show a plot of C opt as function of N total for both T =

10 and T =

100 in the same plot, forgettingat this moment the effects of α and N iter separately. In log-log scale we obtain respectively δ = . ± .

001 and δ = . ± . δ ≈ .

43 transiting to δ ≈ .

02, which shows that for N total > O ( e ) ∼ O ( ) the optimal cost found has notdoes not show a meaningful increment, by changing from C opt ≈

90 up to C opt ≈

70. After this preparatory study, we10 T = 10  C  N total Size : Proportional to N iter

Color : Blue: 

Red:  N total T = 100 Figure 5: Final cost as function of N total = N steps · N iter . The color denotes the different values of α while the size of the points is proportional to N iter . can proceed to the main study of this work, to investigate the correlation and long range effects on the coordinates ofthe points. From now, all of our results were obtained by using a ﬁxed set of parameters N iter = α = . N run = T min = − , and T =

10, which results in N total = O ( ) . Except by the ﬁnite size scaling analysis, all of ourexperiments with SAGCS were applied in scenarios with N = C as function of the iteration in theexternal loop (iteration of cooling schedule) n , for different values of ρ in scenarios with cities/points ( x , y ) built fromidentical uniform random variables z and z deﬁned on [-1,1] as the Fig. 2. We will ﬁrst study the SS-SAGCSperformance.In Fig. 8(a) we can observe a decreasing of the ﬁnal cost C opt until it reaches a steady state after approximately500 iterations. However, an interesting analysis is to consider the ratio C opt / (cid:104) C (cid:105) , given by (cid:104) C (cid:105) = N ∑ i < j d ( i , j ) (cid:0) N (cid:1) = ( N − ) ∑ i < j d ( i , j ) , (14)where the average cost is calculated considering that cities have an average distance multiplied by the size of cycle,and here, O = ( / N run ) ∑ N run i = O i denotes an average of amount O over the different runs (different steady costs obtained11 T = 10 T = 100 l n ( C op t ) ln( N total ) Figure 6: Highlighting the power law behavior of the ﬁnal cost as function of N total discarding the effects of N iter and α separately for both cases: T =

10 and T = δ = . ± . δ = . ± .

001 respectively. by the SA). It is important to say that (cid:104) C (cid:105) calculated by Eq. 14 has no difference of the cost of any cycle which wasrandomly chosen. This is a typical characteristic of the TSP, since Hamiltonian cycles randomly generated are so farfrom the local minimum found by heuristics as the SA, that even generating billions of such cycles, the probability ofreaching a good local minimum as the SA can be disregarded as we previously discussed in Sec. 2.The lower C opt / (cid:104) C (cid:105) , the better the performance. Fig. 8 (b-I) shows the behavior of this ratio as function of ρ forseveral different values of N . We have an improvement of the performance as ρ enlarges. A reasonable (not so good)empirical scaling can be observed in Fig. 8(b-II). If one uses b = L max / L and adjust z ≈ .

06, the different curveshave a good agreement in the collapse.We can observe that the performance for SS-SAGCS is even better for small number of cities for the simple swapprescription. But, an important point is to better investigate the performance of the SAGCS as function of ρ . Let usfocus the extreme cases: ρ = ρ =

1. For ρ =

1, one has exactly the points scattered in a straight line (case ρ =

1, Fig. 2). In this particular case, we know in advance, the global optimal Hamiltonian cycle, which is exactlygiven by C ( theor ) glob = √

2. On the other hand, we can analytically estimate the averaged cycle (cid:104) C (cid:105) ( theor ) ρ = = NN ( N − ) ∑ Ni < i d ( i , j ) ≈ NN ( N − ) ∑ N − j = ∑ N − ji = i (cid:104) ∆ (cid:105) l n ( C op t ) ln( N total ) T = 100 Figure 7: Power law behavior of the ﬁnal cost as function of N total for the prescription 2-opt exactly as performed in Fig. 6 for the simple swap.One observes a transition from δ ≈ .

43 to δ ≈ . where (cid:104) ∆ (cid:105) = √ N is the average displacement between two adjacent pair of points which is randomly distributed overthe line. Thus, (cid:104) C (cid:105) ( theor ) ρ = = √ N ( N − ) ∑ N − j = ∑ N − ji = i = √ ( N − ) ( N − )= √ ( N + ) Precisely: C ( theor ) glob ( ρ = ) / (cid:104) C (cid:105) ( theor ) ρ = = ( N + ) .which leads to the ratio C ( theor ) glob / (cid:104) C (cid:105) ( theor ) ρ = = ≈ .

001 5, for N = C glob is around of O ( − ) smaller than (cid:104) C (cid:105) ( theor ) ρ = , however the SA ﬁnds C ( SS ) opt / (cid:104) C (cid:105) ≈ .

38 which is a modest reduction. But one uses simpleswap and such modest reduction is indeed expected. What about 2-opt? With the same parameters, and using thisprescription, should the reduction of O ( − ) be expected? Yes, as it can be observed in Table 1, which shows acomparison of C opt for different values of ρ . One observes that C ( − opt ) opt , the cost obtained with 2-opt-SAGCS issmaller than C ( SS ) opt (the one obtained with SS-SAGCS) and the difference is huge indeed.13

200 400 600 800 10004006008001000       C ( n ) n (a) C op t / < C >  (b-I) (b-II) z b C op t / < C >  L = 2 L = 2 L = 2 L = 2 L = 2 L = 2 Figure 8: Plot (a): Time evolution of the cost C ( n ) . Here n is the iteration of the external loop (cooling schedule) Plot b-I : Performance of the SAas function of ρ for different number of cities. Plot b-II An approximate scaling for the plot (b-I). One used simple-swap prescription to generatenew conﬁgurations in both plots. C ( SS ) opt C ( − opt ) opt Table 1: Comparison between the simple swap and 2-opt prescriptions in the SAGCS for the ﬁnal average costs as function of ρ Thus, considering the estimate for ρ =

1, one obtains C ( − opt ) opt / (cid:104) C (cid:105) ≈ .

002 which leads to a very similar measurewhen compared with the one obtained by analytical means.Actually, ﬁnding the optimal Hamiltonian cycle with points scattered in a straight line is equivalent to order a listand one has good algorithms in polynomial times (heapsort, quicksort...) that efﬁciently performs such classiﬁcation.Nevertheless, it is important to mention that we have no previous information about the topology of the points andthe simulated annealing with 2-opt prescription simply works to ﬁnd a similar result to the optimal one for ρ = C glob / (cid:104) C (cid:105) ρ = to check if the curve C opt / (cid:104) C (cid:105) × ρ obtained with SA is compatible withthese extremes. Once C glob for ρ = C glob / (cid:104) C (cid:105) ρ = ? First, let us start with acalculation that can be performed exactly (see Appendix): (cid:104) C (cid:105) ( theor ) ρ = = N ( + √ + ( √ + )) which amounts to 2136, approximately, for N = N run =

60, we numerically obtain (cid:104) C (cid:105) ρ = ≈ C glob , constructive heuristics (out of scope of the Boltzmann machines) cansupply an estimate which should work as a benchmark. A good suggestion is the nearest neighbor (NN) algorithm[15], a greedy algorithm that corresponds to a particular case of the tourist random walk [16], [17] with the maximalmemory: N − ρ = C ( NN ) opt ( N = ) = . ± . C ( NN ) opt / (cid:104) C (cid:105) ρ = ≈ . C ( SS ) opt / (cid:104) C (cid:105) ρ = ≈ .

479 which uses SS-SAGCS.Surprisingly, the 2-opt-SAGCS C ( − Opt ) opt = . ( ) , which leads to C ( − Opt ) opt / (cid:104) C (cid:105) ρ = ≈ .

036 which is even betterthan the NN algorithm.There exist many specially arranged city distributions which make the NN algorithm gives the worst route. Herewe are observing, it is an excellent heuristic, however, the 2-opt-SAGCS is still a better benchmark (almost a technicaldraw between the two heuristics) but the SAGCS has other advantages. The simulated annealing is O ( N sample ) whilethe NN algorithm has a complexity O ( N ) which can be very complicate for extremely larger instances. SA hasmodest optimal costs only when one uses a simple swap scheme, which is a naive technique when compared with 2-opt that can applied in general scenarios. In realistic situations, a generalization of the TSP, the VRP (vehicle routingproblem) is the best alternative compared to the NN algorithm, and Tabu search algorithm as suggested by the authorsin [18].But again, our proposal in this paper is to analyze possible effects of the environment on the SA and not a detailedcomparison among the methods, yet we could not miss to show the 2-opt-SAGCS in comparison to SS-SAGCS andthe NN algorithm.Thus, it is interesting to similarly analyze the size effects on the 2-opt-SAGCS exactly as we performed in plot inFig. 8 (b).Fig. 9 (a) I shows a similar decay of C opt / (cid:104) C (cid:105) as function of ρ but with values extremely lower than those ofSS-SAGCS. Following what we applied to the SS-SAGCS, we performed a similar scaling for the 2-opt-SAGCS (Fig.9 (a) II) but not so with the same quality. In this case the value is z ≈ .

39 greater than z ≈ .

06 found to SS-SAGCS.It is also interesting to analyze the effects on the points considering different statistical distributions for the coor-dinates. Thus we prepared some experiments to capture the effects on the ratio C opt / (cid:104) C (cid:105) as function of ρ considering15 .0 0.2 0.4 0.6 0.8 1.00.000.050.100.150.20 0.0 0.2 0.4 0.6 0.8 1.00.000.050.100.150.20 C op t / < C >  (a) I (a) II L = 2 L = 2 L = 2 L = 2 L = 2 L = 2 z b C op t / < C >  Uniform:  

Gauss: 

Gauss:  - scaled Gauss:   C op t / < C > (b) I Uniform:  

Gauss: 

Gauss:  scaled Gauss:    (b) II Figure 9: (a) I: Performance of the 2-opt-SAGCS as function of ρ for different number of cities. (a) II: An approximate scaling for the plot isobtained with z ≈ .

39. (b): Analysis for different distributions of the points: Gaussian and uniform. The case (b) I corresponds to SS-SAGCS andthe (b) II corresponds to 2-opt-SAGCS. z and z identically distributed uniform random variables, now we also consider them as Gaussian randomvariables for a comparison. It is important to mention that the x and y coordinates, have, the same average (zero) andvariance of the variables z and z independently on ρ . As we previously reported, the effects to be observed dependsstrictly on ρ since the average and variance do not change.We can observe that in both cases, SS-SAGCS and 2-opt-SAGCS, respectively described by Fig. 9 (b-I) and Fig.9 (b-II), the shape of the distribution seems to be more important than the variance once the Gaussian and the uniformrandom variables, with the same or different variances, lead to different curves, bu Gaussians with different variancesproduce practically the same behavior.Finally, we also use an scaled cost: C s = C (cid:112) ( x max − x min )( y max − y min ) , (15)where x ( y ) max ( min ) = max ( min ) { x ( y ) , x ( y ) , ..., x ( y ) N } , once, differently from uniform distribution, the Gaussiandistribution is not supported in a ﬁnite interval. However, as the same Figs. 9 (b-I) and (b-II) show, no changes wereobserved in the two versions of the simulated annealing (the blue triangles correspond to Gaussian distribution withscaled cost).For SS-SAGCS the lower the N , the better the performance, on the other hand, for the 2-opt-SAGCS, the higherthe N , the better the performance as also can be observed in the Figs. 8 (b-I) and 9 (b-I). opt / (cid:104) C (cid:105) as a function of ρ We also focus our results in ﬁnding good ﬁts for C opt / (cid:104) C (cid:105) as function of ρ . So follow a (heuristic and qualitative)method to ﬁnd suitable ﬁts for such behavior.First, we try a polynomial ﬁt p ( ρ ) = n ∑ i = a i ρ i by testing n =

2, 3 and 4 (with 3, 4, and 5 parameters respectively). The idea here is only to make a preliminaryinvestigation. After that, we test exponential decay ﬁts with only two parameters: e I ( ρ ) = e I ( ) + a e − ρ / ρ once that e I ( ) is ﬁxed and taken as C opt / (cid:104) C (cid:105) ( ) (it was visually investigated). Following, one considers a simplelinear combination of exponential decays (with four parameters): e II ( ρ ) = e II ( ) + a e − ρ / ρ + a e − ρ / ρ , (16)also assuming empirically that e II ( ) = C opt / (cid:104) C (cid:105) ( ) .Alternatively, we also experimented other functions with four parameters, and the one that presented a good resultwas the rational function: r ( ρ ) = a + b ρ + c ρ + d ρ .This is shown in Fig. 10, which shows the different ﬁts. The ﬁts obtained in the different cases are summarized intable 2.We can observe that the larger the degree of the polynomial ﬁtted, the better the coefﬁcient of determination α (the closer to one, the better it is). Our analysis also shows that in both cases, a single exponential (exponential I) isnot enough to nicely ﬁt the curve but with two exponential (four parameters), an excellent ﬁt with α ≈ .

999 can beobtained. With the same four parameters, the rational function gives a good result despite not so good as the linearcombination of exponentials. It is important to mention that the coefﬁcient a found is exactly C opt / (cid:104) C (cid:105) ( ) in bothcases (SS and 2opt-SAGCS) and the ﬁt should be performed with three parameters. After all, one can observe that abetter match is obtained by considering of a linear combination of the exponentials described by the Eq. 16. It seemsto be a universal ﬁt for C opt / (cid:104) C (cid:105) × ρ for different values of ρ for both SS-SAGCS and 2-opt-SAGCS.17 .0 0.2 0.4 0.6 0.8 1.00.380.400.420.440.460.48 0.0 0.2 0.4 0.6 0.8 1.00.380.400.420.440.460.480.0 0.2 0.4 0.6 0.8 1.00.380.400.420.440.460.48 0.0 0.2 0.4 0.6 0.8 1.00.380.400.420.440.460.48 (a) quadratic cubic quartic C op t / < C > polynomial fit (b) Exponential I (c)

Rational  SS-SAGCS (d)

Exponential II C op t / < C > polynomial fit quadratic cubic quartic (a) (b) Exponential I (c)

Rational (d)  Exponential II

Figure 10: Fits for performance versus ρ for SS-SAGCS and 2-opt-SAGCS. (a) preliminary polynomial ﬁt (b) simple exponential (c) a reasonableﬁt with rational function (d) the best ﬁt obtained by combining two exponentials.

18A Quadratic Cubic Quartic a = . ( ) a = . ( ) a = . ( ) a = . ( ) a = − . ( ) a = . ( ) SS a = − . ( ) a = . ( ) a = − . ( ) α = . a = − . ( ) a = . ( ) α = . a = − . ( ) α = . a = . ( ) a = . ( ) a = . ( ) a = . ( ) a = . ( ) a = . ( ) a = − . ( ) a = . ( ) a = − . ( ) α = . a = − . ( ) a = − . ( ) α = . a = − . ( ) α = . c = − . ( . ) · − a = . ( ) c = − . ( ) ρ = . ( ) b = − . ( ) ρ = − . ( ) SS α = . c = − . ( ) c = − ( . ± . ) · − d = − . ( ) ρ = − . ( ) α = . α = . c = − . ( . ) · − a = . ( ) c = − . ( . ) · − ρ = − . ( ) b = − . ( ) ρ = − . ( ) α = . c = − . ( ) c = − . ( ) · − d = − . ( ) ρ = − . ( ) α = . α = . Table 2: Values of coefﬁcients found after nonlinear ﬁtting by using Levenberg-Marquadt method. The parameter α corresponds to coefﬁcient ofdetermination of each ﬁt .3. Long tail effects Finally, it is important to analyze the effects about the long tail effects on the coordinates of the points distributedon the environment. In this case, we concentrate our analysis on ρ =

0. Firstly, one chooses the variables x and y according to equations 11 and 12 for the SS-SAGCS. Plotting C opt / (cid:104) C (cid:105) as function of γ according to Fig. 11 (a) and(b).As expected C opt / (cid:104) C (cid:105) decreases as function of exponent γ . Fig. 11 (a) shows this behavior for different valuesof x considering the scaling in Eq. 15. We can observe a collapse of all curves independently on x . The inset plotshows the same plot without performing the scaling. It is interesting to notice the instability in the regions withoutdeﬁned variance 2 ≤ γ ≤ γ <

2, which corresponds the region where the averagecannot be deﬁned, the SA has no efﬁciency which suggests that the algorithm should not be applied in these situationssince C opt / (cid:104) C (cid:105) ≈

1. Fig. 11 (b) is only a reﬁnement of one case in Fig. 11 (a) with uncertainty bars and also showingthe values obtained with Gaussian and uniform distributions for a comparison (dashed blue lines).Performing similar simulations to the plot of the Fig. 11 (a) but for the case 2opt-SAGCS, which can be observedin Fig. 12 (a) and the same conclusions can be drawn for γ < γ .

5. Summary and Conclusions

In this paper, we study the effects of the statistics on the coordinates of the points when we apply an standardsimulated annealing algorithm to the travelling salesman problem. Our results are concerned with the long tail effectson the coordinates but also the correlation effects between these coordinates.Our study shows that the performance of the simulated annealing increases as the correlation increases in bothversions of the SA. The main reason is the dimensionality reduction, which transforms the simulated annealing atlimit in an approximated sorting algorithm. The correlation effects show that the shape of distribution attributed tocoordinates is more important than the variance when we compare Gaussian and uniform distributions.Our results also suggest that the higher the exponent of the power law, the lower the simulated annealing perfor-mance, and for γ <

2, i.e., distributions without the ﬁrst moment, we have no performance of the SA, i.e., C opt / (cid:104) C (cid:105) ≈ C opt / (cid:104) C (cid:105) × ρ and C opt / (cid:104) C (cid:105) × γ never explored in the optimization problemswith SA. In addition, we also show that 2-opt-SAGCS presents a better performance as larger is the system, differentlyfrom its inefﬁcient version, the SS-SAGCS, where the reverse situation occurs.We believe that both (correlation and long range effects) studies can bring an interesting knowledge for moretechnical applications in artiﬁcial intelligence, machine learning, and other areas. In special, in the search for globalminimum in neural networks algorithms, maybe reviving the interests in simulating annealing as a viable alternativeto stochastic gradient descent at the optimization step. Acknowledgments

R. da Silva thanks CNPq for ﬁnancial support under grant numbers 311236/2018-9, and 424052/2018-0. AAthanks Conselho Nacional de Desenvolvimento Cient´ıﬁco (CNPq) for its ﬁnancial support, grant 307265/2017-0.This research was partially carried out using the computational resources from the Cluster-Slurm, IF-UFRGS.

References [1] D. Sherrington, S. Kirkpatrick, Phys. Rev. Lett. , 1792 (1975)[2] S. Kirkpatrick , C. D. Gelatt Jr, M. P. Vecchi, Science , 671–677 (1983)[3] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, E. Teller, J. Chem. Phys. , 1087 (1953).[4] Christos Papadimitriou, Computational Complexity, Pearson (1993)[5] U.H.E. Hansmann, Y. Okamoto, Braz. J. Phys. , 187 (1999)  -4 -3 -2 -1 Averageis not defined C op t / < C >  Varianceis not defined (a) with normalization without normalization C op t / < C >  uniformNormal   (b) Figure 11: (a) C opt / (cid:104) C (cid:105) as function of γ for different values of ε for the SS-SAGCS. We perform the scaling C s = C / (cid:112) ( x max − x min )( y max − y min ) .The inset plot shows the same curves without scaling. (b) Reﬁnement of the region presenting the error bars. We can observe that γ ≈ . γ ≈ . -4 -3 -2 -1 C op t C  (a) (b) -4 -3 -2 -1 C op t C  Radial symmetry

Figure 12: (a) C opt / (cid:104) C (cid:105) as function of γ for different values of ε for 2-opt-SAGCS. We perform the scaling C s = C / (cid:112) ( x max − x min )( y max − y min ) .The inset plot shows the same curves without scaling. (b) The same plot by using power law coordinates with radial symmetry (Eq. 13).

6] F. Mauger, C. Chandre, T. Uzer, Commun. Nonlinear Sci. Numer. Simul. , 2845 (2011)[7] S. Geman, D. Geman, IEEE Transactions on Pattern Analysis and Machine Learning , 721 (1984)[8] S. Lin, B. W. Kernighan, Operations Research , 498-516 (1973)[9] S. Kirkpatrick, J. Stat. Phys. , 157-162 (1987)[13] L. Ingber, Mathl. Comput. Modelling , 29-57 (1993)[14] R. da Silva, S. D. Prado, Phys. Lett. A, (2020)[15] G. Gutin, A. Yeob, A. Zverovicha, Discrete Applied Mathematics , 81-86 (2002)[16] G. F. Lima, A. S. Martinez, and O. Kinouchi, Phys. Rev Lett. , 010603 (2001).[17] H. Eugene Stanley, Sergey V. Buldyrev, Nature , 373–374 (2001)[18] P. Adi Wicaksono, D. Puspitasari, S. Ariyandanu, R. Hidayanti, IOP Conf. Ser.: Earth Environ. Sci. Appendix: Average distance between two points uniformly distributed in a square

Our original problem considers points uniformly distributed in the square such that x ∈ [ − , ] , y ∈ [ − , ] , i.e., ansquare of side 2. This is similar to consider points in the square deﬁned by x ∈ [ , ] , y ∈ [ , ] , and thus the averagedistance between the points can be calculated by: (cid:104) d (cid:105) = (cid:82) (cid:82) (cid:82) (cid:82) (cid:112) ( x i − x j ) + ( y i − y j ) dx i dx j dy i dy j = (cid:82) (cid:82) (cid:82) (cid:82) (cid:113) ( x i − x j ) + ( y i − y j ) dx i dx j dy i dy j Performing the change of variables z = x i , z = x j , z = y i , and z = y i ,and thus (cid:104) d (cid:105) = (cid:90) (cid:90) (cid:90) (cid:90) (cid:113) ( z − z ) + ( z − z ) dz dz dz dz If z and z are uniformly distributed, so the random variable u = | z − z | has a triangular distribution and theprobability density function 2 ( − u ) , and thus making the change of variables u = | z − z | , v = | z − z | , w = z , and t = z , one obtains: (cid:104) d (cid:105) = (cid:90) (cid:90) ( − u )( − v ) (cid:112) u + v dudv Now is almost done! Making u = r cos θ and v = r sin θ . However a trick is to perform the integration for thelower triangle where 0 ≤ θ ≤ π / (cid:104) d (cid:105) = (cid:82) π / (cid:82) / sin θ ( − r cos θ )( − r sin θ ) r drd θ = (cid:82) π / (cid:82) / cos θ (cid:0) r − r ( sin θ + cos θ ) + r cos θ sin θ (cid:1) drd θ = (cid:104) (cid:82) π /

40 1cos θ d θ − (cid:82) π / ( sin θ + cos θ ) cos θ d θ + (cid:82) π /

40 cos θ sin θ cos θ d θ (cid:105) = (cid:104) (cid:82) π /

40 1cos θ d θ − (cid:82) π /

40 sin θ cos θ d θ (cid:105) And ﬁnally by performing these integrals one obtains (cid:104) d (cid:105) = ( + √ +

10 ln ( √ + ))))