[PDF] Self-Adjusting Evolutionary Algorithms for Multimodal Optimization

Abstract

Recent theoretical research has shown that self-adjusting and self-adaptive mechanisms can provably outperform static settings in evolutionary algorithms for binary search spaces. However, the vast majority of these studies focuses on unimodal functions which do not require the algorithm to flip several bits simultaneously to make progress. In fact, existing self-adjusting algorithms are not designed to detect local optima and do not have any obvious benefit to cross large Hamming gaps. We suggest a mechanism called stagnation detection that can be added as a module to existing evolutionary algorithms (both with and without prior self-adjusting algorithms). Added to a simple (1+1) EA, we prove an expected runtime on the well-known Jump benchmark that corresponds to an asymptotically optimal parameter setting and outperforms other mechanisms for multimodal optimization like heavy-tailed mutation. We also investigate the module in the context of a self-adjusting (1+ λ ) EA and show that it combines the previous benefits of this algorithm on unimodal problems with more efficient multimodal optimization. To explore the limitations of the approach, we additionally present an example where both self-adjusting mechanisms, including stagnation detection, do not help to find a beneficial setting of the mutation rate. Finally, we investigate our module for stagnation detection experimentally.

Full PDF

SSelf-Adjusting Evolutionary Algorithms for MultimodalOptimization

Amirhossein RajabiTechnical University of DenmarkKgs. [email protected] Carsten WittTechnical University of DenmarkKgs. [email protected] 3, 2020

Abstract

Recent theoretical research has shown that self-adjusting and self-adaptive mech-anisms can provably outperform static settings in evolutionary algorithms for binarysearch spaces. However, the vast majority of these studies focuses on unimodal func-tions which do not require the algorithm to ﬂip several bits simultaneously to makeprogress. In fact, existing self-adjusting algorithms are not designed to detect localoptima and do not have any obvious beneﬁt to cross large Hamming gaps.We suggest a mechanism called stagnation detection that can be added as a mod-ule to existing evolutionary algorithms (both with and without prior self-adjustingschemes). Added to a simple (1+1) EA, we prove an expected runtime on the well-known

Jump benchmark that corresponds to an asymptotically optimal parametersetting and outperforms other mechanisms for multimodal optimization like heavy-tailed mutation. We also investigate the module in the context of a self-adjusting(1+ λ ) EA and show that it combines the previous beneﬁts of this algorithm onunimodal problems with more eﬃcient multimodal optimization.To explore the limitations of the approach, we additionally present an examplewhere both self-adjusting mechanisms, including stagnation detection, do not helpto ﬁnd a beneﬁcial setting of the mutation rate. Finally, we investigate our modulefor stagnation detection experimentally. Recent theoretical research on self-adjusting algorithms in discrete search spaces hasproduced a remarkable body of results showing that self-adjusting and self-adaptivemechanisms outperform static parameter settings. Examples include an analysis of thewell-known (1+( λ, λ )) GA using a 1 / OneMax (Doerr and Doerr, 2018), of a self-adjusting (1+ λ ) EA sampling oﬀspring with diﬀer-ent mutation rates (Doerr et al., 2019), matching the parallel black-box complexityof the OneMax function, and a self-adaptive variant of the latter (Doerr, Witt and1 a r X i v : . [ c s . N E ] J un ang, 2018). Furthermore, self-adjusting schemes for algorithms over the search space { , . . . , r } n for r > λ ) EAfrom Doerr et al. (2019) samples λ/ r/ n times the mutation probability)and the other half with strength 2 r . The strength is afterwards adjusted to the oneused by a ﬁttest oﬀspring. Similarly, the 1 / some improvements with thediﬀerent parameters tried or, at least, that the smallest disimprovement observed in un-successful mutations gives reliable hints on the choice of the parameter. However, thereare situations where the algorithm cannot make progress and does not learn from unsuc-cessful mutations either. This can be the case when the algorithm reaches local optimaescaping from which requires an unlikely event (such as ﬂipping many bits simultane-ously) to happen. Classical self-adjusting algorithms would observe many unsuccessfulsteps in such situations and suggest to set the mutation rate to its minimum althoughthat might not be the best choice to leave the local optimum. In fact, the vast majorityof runtime results for self-adjusting EAs is concerned with unimodal functions that haveno other local optima than the global optimum. An exception is the work (Dang andLehre, 2016) which considers a self-adaptive EA allowing two diﬀerent mutation prob-abilities on a speciﬁcally designed multimodal problem. Altogether, there is a lack oftheoretical results giving guidance on how to design self-adjusting algorithms that canleave local optima eﬃciently.In this paper, we address this question and propose a self-adjusting mechanism called stagnation detection that adjusts mutation rates when the algorithm has reached a localoptimum. In contrast to previous self-adjusting algorithms this mechanism is likely toincrease the mutation in such situations, leading to a more eﬃcient escape from localoptima. This idea has been mentioned before, e. g., in the context of population sizingin stagnation (Eiben, Marchiori and Valk´o, 2004); also, recent empirical studies of theabove-mentioned 2-rate (1+ λ ) EA, handling of stagnation by increasing the variancewas explicitly suggested in Ye, Doerr and B¨ack (2019). Our contribution has several2dvantages over previous discussion of stagnation detection: it represents a simple mod-ule that can be added to several existing evolutionary algorithms with little eﬀort, itprovably does not change the behavior of the algorithm on unimodal functions (exceptfor small error terms), allowing the transfer of previous results, and we provide rigorousruntime analyses showing general upper bounds for multimodal functions including itsbeneﬁts on the well-known Jump benchmark function.In a nutshell, our stagnation detection mechanism works in the setting of pseudo-boolean optimization and standard bit mutation. Starting from strength r = 1, itincreases the strength from r to r + 1 after a long waiting time without improvementhas elapsed, meaning it is unlikely that an improving bit string at Hamming distance r exists. This approach bears some resemblance with variable neighborhood search (VNS)(Hansen and Mladenovic, 2018); however, the idea of VNS is to apply local search witha ﬁxed neighborhood until reaching a local optimum and then to adapt the neighbor-hood structure. There have also been so-called quasirandom evolutionary algorithms(Doerr, Fouz and Witt, 2010) that search the set of Hamming neighbors of a searchpoint more systematically; however, these approaches do not change the expected num-ber of bits ﬂipped. In contrast, our stagnation detection uses the whole time an unbiasedrandomized global search operator in an EA and just adjusts the underlying mutationprobability. Statistical signiﬁcance of long waiting times is used, indicating that im-provements at Hamming distance r are unlikely to exist; this is rather remotely relatedto (but clearly inspired by) the estimation-of-distribution algorithm sig-cGA Doerr andKrejca (2018) that uses statistical signiﬁcance to counteract genetic drift.This paper is structured as follows: In Section 2, we introduce the concrete mecha-nism for stagnation detection and employ it in the context of a simple, static (1+1) EAand the already self-adjusting 2-rate (1+ λ ) EA. Moreover, we collect tools for the anal-ysis that are used in the rest of the paper. Section 3 deals with concrete runtime boundsfor the (1+1) EA and (1+ λ ) EA with stagnation detection. Besides general upperbounds, we prove a concrete result for the Jump benchmark function that is asymp-totically optimal for algorithms using standard bit mutation and outperforms previousmutation-based algorithms for this function like the heavy-tailed EA from Doerr et al.(2017). Elementary techniques are suﬃcient to show these results. To explore the limi-tations of stagnation detection and other self-adjusting schemes, we propose in Section 4a function where these mechanisms provably fail to set the mutation rate to a beneﬁcialregime. As a technical tool, we use drift analysis and analyses of occupation times forprocesses with strong drift. To that purpose, we use a theorem by Hajek (Hajek, 1982)on occupation times that, to the best of the knowledge, was not used for the analysisof randomized search heuristics before and may be of independent interest. Finally, inSection 5, we add some empirical results, showing that the asymptotically smaller run-time of our algorithm on

Jump is also visible for small problem dimensions. We ﬁnishwith some conclusions. 3

Preliminaries

We shall now formally deﬁne the algorithms analyzed and present some fundamentaltools for the analysis.

We are concerned with pseudo-boolean functions f : { , } n → R that w. l. o. g. are tobe maximized. A simple and well-studied EA studied in many runtime analyses (e. g.,Droste, Jansen and Wegener (2002)) is the (1+1) EA displayed in Algorithm 1. It usesa standard bit mutation with strength r , where 1 ≤ r ≤ n/

2, which means that everybit is ﬂipped independently with probability r/n . Usually, r = 1 is used, which is theoptimal strength on linear functions (Witt, 2013). Smaller strengths lead to less than1 bit being ﬂipped in expectation, and strengths above n/ Algorithm 1 (1+1) EA with static strength r Select x uniformly at random from { , } n for t ← , , . . . do Create y by ﬂipping each bit in a copy of x independently with probability rn . if f ( y ) ≥ f ( x ) then x ← y .The runtime (also called optimization time ) of the (1+1) EA on a function f is theﬁrst point of time t where a search point of maximal ﬁtness has been created; often theexpected runtime, i. e., the expected value of this time, is analyzed. The (1+1) EA with r = 1 has been extensively studied on simple unimodal problems like OneMax ( x , . . . , x n ) := | x | , and LeadingOnes ( x , . . . , x n ) := n (cid:88) i =1 i (cid:89) j =1 x j but also on the multimodal Jump m function with gap size m deﬁned as follows: Jump m ( x , . . . , x n ) = (cid:40) m + | x | if | x | ≤ n − m or | x | = nn − | x | otherwiseThe classical (1+1) EA with r = 1 optimizes these functions in expected time Θ( n log n ),Θ( n ) and Θ( n m + n log n ), respectively (see, e. g., Droste, Jansen and Wegener (2002)).The ﬁrst two problems are unimodal functions, while Jump for m ≥ | x | = n − m . To overcome thisoptimum, m bits have to ﬂip simultaneously. It is well known (Doerr et al., 2017) that4he time to leave this optimum is minimized at strength m instead of strength 1 (seebelow for a more detailed exposition of this phenomenon). Hence, the (1+1) EA wouldbeneﬁt from increasing its strength when sitting at the local optimum. The algorithmdoes not immediately know that it sits at a local optimum. However, if there is animprovement at Hamming distance 1 then such an improvement has probability at least(1 − /n ) n − /n ≥ / ( en ) with strength 1, and the probability of not ﬁnding it in en ln n steps is at most (1 − / ( en )) en ln n ≤ /n. Similarly, if there is an improvement that can be reached by ﬂipping k bits simultane-ously and the current strength equals k , then the probability of not ﬁnding it within(( en ) k /k k ) ln n steps is at most (cid:18) − k k ( en ) k (cid:19) (( en ) k /k k ) ln n ≤ n . Hence, after (( en ) k /k k ) ln n steps without improvement there is high evidence for thatno improvement at Hamming distance k exists.We put this ideas into an algorithmic framework by counting the number of so-calledunsuccessful steps, i. e., steps that do not improve ﬁtness. Starting from strength 1,the strength is increased from r to r + 1 when the counter exceeds the threshold2(( en ) r /r r ) ln( nR ) for a parameter R to be discussed shortly. Both counter and strengthare reset (to 0 and 1 respectively) when an improvement is found, i. e., a search point ofstrictly better ﬁtness. In the context of the (1+1) EA, the stagnation detection (SD) isincorporated in Algorithm 2. We see that the counter u is increased in every iterationthat does not ﬁnd a strict improvement. However, search points of equal ﬁtness arestill accepted as in the classical (1+1) EA. We note that the strength stays at its initialvalue 1 if ﬁnding an improvement does not take longer than the corresponding threshold2 en ln( Rn ); if the threshold is never exceeded the algorithm behaves identical to the(1+1) EA with strength 1 according to Algorithm 1.The parameter R can be used to control the probability of failing to ﬁnd an im-provement at the “right” strength. More precisely, the probability of not ﬁnding animprovement at distance r with strength r is at most (cid:18) − r r ( en ) r (cid:19) (2( en ) r /r r ) ln( nR ) ≤ nR ) . As shown below in Theorem 3, if R is set to the number of ﬁtness values of the underlyingfunction f , i. e., R = | Im( f ) | , then the probability of ever missing an improvement atthe right strength is suﬃciently small throughout the run. We recommend at least R = n if nothing is known about the range of f , resulting in a threshold of at least4(( en ) r /r r ) ln( n ) at strength r .We also add stagnation detection to the (1+ λ ) EA with self-adjusting mutation ratedeﬁned in Doerr et al. (2019) (adapted to maximization of the ﬁtness function), wherehalf of the oﬀspring are created with strength r/ r ;5ee Algorithm 3. Unsuccessful mutations are counted in the same way as in Algorithm 2,taking into account that λ oﬀspring are used. The algorithm can be in two states. Unlessthe counter threshold is reached and a strength increase is triggered, the algorithmbehaves the same as the self-adjusting (1+ λ ) EA from Doerr et al. (2019) (State 2).If, however, the counter threshold 2 en ln( nR ) /λ is reached, then the algorithm changesto the module that keeps increasing the strength until a strict improvement is found(State 1). Since it does not make sense to decrease the strength in this situation, alloﬀspring use the same strength until ﬁnally an improvement is found and the algorithmchanges back to the original behavior using two strengths for the oﬀspring. The booleanvariable g keeps track of the state. From the discussion of these two algorithms, wesee that the stagnation detection consisting of counter for unsuccessful steps, threshold,and strength increase also can be added to other algorithms, while keeping their originalbehavior unless the counter threshold it reached. Algorithm 2 (1+1) EA with stagnation detection (SD-(1+1) EA)Select x uniformly at random from { , } n and set r ← u ← for t ← , , . . . do Create y by ﬂipping each bit in a copy of x independently with probability r t n . u ← u + 1. if f ( y ) > f ( x ) then x ← y . r t +1 ← u ← else if f ( y ) = f ( x ) and r t = 1 then x ← y . if u > (cid:16) enr t (cid:17) r t ln( nR ) then r t +1 ← min { r t + 1 , n/ } . u ← else r t +1 ← r t . We now collect frequently used mathematical tools. The ﬁrst one is a simple summationformula used to analyze the time spent until the strength is increased to a certain value.

Lemma 1.

For m < n , we have (cid:80) mi =1 (cid:0) eni (cid:1) i < nn − m (cid:0) enm (cid:1) m . lgorithm 3 (1+ λ ) EA with two-rate standard bit mutation and stagnation detection(SASD-(1+ λ ) EA) Select x uniformly at random from { , } n and set r ← r init . u ← g ← False (boolean variable indicating stagnation detection) for t ← , , . . . do u ← u + 1. if g = True then

State 1 – Stagnation Detection for i ← , . . . , λ do Create x i by ﬂipping each bit in a copy of x independently with probability r t n . y ← arg max x i f ( x i ) (breaking ties randomly). if f ( y ) > f ( x ) then x ← y . r t +1 ← r init . g ← False . u ← elseif u > (cid:16) enr t (cid:17) r t ln( nR ) /λ then r t +1 ← min { r t + 1 , n/ } . u ← else r t +1 ← r t . else (i. e., g = False ) State 2 – Self-Adjusting (1+ λ ) EA for i ← , . . . , λ do Create x i by ﬂipping each bit in a copy of x independently with probability r t n if i ≤ λ/ r t /n otherwise. y ← arg min x i f ( x i ) (breaking ties randomly). if f ( y ) ≥ f ( x ) thenif f ( y ) > f ( x ) then u ← x ← y .Perform one of the following two actions with prob. 1 / r t with the strength that y has been created with.– Replace r t with either r t / r t , each with probability 1 / r t +1 ← min { max { , r t } , n/ } . if u > (cid:16) enr t (cid:17) r t ln( nR ) /λ then r t +1 ← g ← True . u ← roof. We have (cid:16) enm − i (cid:17) m − i = (cid:0) men (cid:1) i (cid:16) mm − i (cid:17) m − i (cid:0) enm (cid:1) m for all i < m , so m (cid:88) i =1 (cid:16) eni (cid:17) i = m − (cid:88) i =0 (cid:18) enm − i (cid:19) m − i = (cid:16) enm (cid:17) m m − (cid:88) i =0 (cid:16) men (cid:17) i (cid:18) im − i (cid:19) m − i < (cid:16) enm (cid:17) m m − (cid:88) i =0 (cid:16) mn (cid:17) i < nn − m (cid:16) enm (cid:17) m . (cid:3) The following result due to Hajek applies to processes with a strong drift towardssome target state, resulting in decreasing occupation probabilities with respect to thedistance from the target. On top of this occupation probabilities, the theorem bounds occupation times , i. e., the number of steps that the process spends in a non-target stateover a certain time period.

Theorem 1 (Theorem 3.1 in Hajek (1982)) . Let X t , t ≥ , be a stochastic processadapted to a ﬁltration F t on R . Let a ∈ R . Assume for ∆ t = X t +1 − X t that there are η > , δ < and D > such that that(a) E (cid:0) e η ∆ | F t ; X t > a (cid:1) ≤ ρ (b) E (cid:0) e η ∆ | F t ; X t ≤ a (cid:1) ≤ D If additionally X is of exponential type (i. e., E (cid:0) e λX (cid:1) is ﬁnite for some λ > ) thenfor any constant (cid:15) > there exist absolute constants K ≥ , δ < such that for all b ≥ a and T ≥ (cid:16) T T (cid:88) t =1 X t ≤ b ≤ − (cid:15) − − (cid:15) − ρ De η ( a − b ) (cid:17) ≤ Kδ T In this section, we study the SD-(1+1) EA from Algorithm 2 in greater detail. We showgeneral upper and lower bounds on multimodal functions and then analyze the specialcase of

Jump more precisely. We also show the important result that on unimodalfunctions, the SD-(1+1) EA with high probability behaves in the same way as theclassical (1+1) EA with strength 1, including the same asymptotic bound on the expectedoptimization time.

In the following, given a ﬁtness function f : { , } n → R , we call the gap of the point x ∈ { , } n the minimum hamming distance to points with strictly larger ﬁtness functionvalue. Formally, gap( x ) := min { H ( x, y ) : f ( y ) > f ( x ) , y ∈ { , } n } . x ) bits ofthe current search point. However, if the algorithm creates a point of gap( x ) distancefrom the current search point x , we can make progress with a positive probability. Notethat gap( x ) = 1 is allowed, so the deﬁnition also covers points that are not local optima.Hereinafter, T x denotes the number of steps of SD-(1+1) EA to ﬁnd an improvementpoint when the current search point is x . Let phase r consists of all points of timewhere strength r is used in the algorithm with stagnation counter. Let E r be theevent of not ﬁnding the optimum by the end of phase r , and U r be the event of notﬁnding the optimum during phases 1 to r − r . In other words, U r = E ∩ · · · ∩ E r − ∩ E r .The following lemma will be used throughout this section. It shows that the proba-bility of not ﬁnding a search point with larger ﬁtness value in phases of larger strengththan the real gap size is small; however, by deﬁnition phase n/ R controls the threshold for the number of unsuccessful steps in stagnationdetection. Lemma 2.

Let x ∈ { , } n be the current search point of the SD-(1+1) EA on a pseudo-boolean ﬁtness function f : { , } n → R and let m = gap( x ) . Then Pr ( E r ) ≤ (cid:40) nR ) if m ≤ r < n/ if r = n/ . Proof.

The algorithm spends 2 e r n r /r r ln( nR ) steps at strength r until it increases thecounter. Then, the probability of not improving at strength r ≥ m is at mostPr ( E r ) = (cid:18) − (cid:16) − rn (cid:17) n − m (cid:16) rn (cid:17) m (cid:19) e r n r /r r ln( nR ) ≤ nR ) . During phase n/

2, the algorithm does not increase the strength, and it continues tomutate each bit with probability of 1 /

2. As each point on domain is accessible in thisphase, the probability of eventually failing to ﬁnd the improvement is 0. (cid:3)

We turn the previous observation into a general lemma on improvement times.

Theorem 2.

Let x ∈ { , } n be the current search point of the SD-(1+1) EA on apseudo-boolean function f : { , } n → R . Deﬁne T x as the time to create a strict im-provement and L x,k := E ( T x ) if gap( x ) = k . Then, using m = min { k, n/ } , we have forall x with gap( x ) = k that (cid:16) enm (cid:17) m (cid:18) − m n − m (cid:19) < L x,k ≤ (cid:16) enm (cid:17) m (cid:18) mn ln( nR ) (cid:19) . Proof.

Using the law of total probability with respect to the events U i deﬁned above,we have E ( T x ) = n/ (cid:88) i =1 E ( T x | U i ) Pr ( U i ) . (1)9ote that the algorithm does not increase the strength to more than n/

2. By as-suming that the algorithm pessimistically does not ﬁnd a better point for r < m , we canbound the formula (1) as follows:E ( T x ) < E ( T x | U m ) (cid:124) (cid:123)(cid:122) (cid:125) =: S + n/ (cid:88) i = m +1 E ( T x | U i ) Pr ( U i ) (cid:124) (cid:123)(cid:122) (cid:125) =: S . Regarding S , it takes (cid:80) m − i =1 en/i ) i ln( nR ) steps until the SD-(1+1) EA increasesthe strength to m . When the mutation probability is m/n , within an expected number of(( m/n ) m (1 − m/n ) n − m ) − steps, a better point will be found. Thus, by using Lemma 1,we have E ( T x | U m ) ≤ m − (cid:88) i =1 (cid:16) eni (cid:17) i ln( nR ) + 1( m/n ) m (1 − m/n ) n − m < nn − m + 1 (cid:18) enm − (cid:19) m − ln( nR ) + (cid:16) enm (cid:17) m < (cid:16) enm (cid:17) m (cid:32) men (cid:18) m − (cid:19) m − ln( nR ) (cid:33) ≤ (cid:16) enm (cid:17) m (cid:18) mn ln( nR ) (cid:19) . In order to estimate S , if m = n/

2, the value of S equals zero. Otherwise, by usingLemma 2, Pr ( U i ) < (cid:81) i − j = m Pr ( E j ) < n − i − m ) for i ≥ m + 1 since R ≥

1. We compute n/ (cid:88) i = m +1 E ( T x | U i ) Pr ( U i ) ≤ n/ (cid:88) i = m +1 O (cid:18)(cid:16) eni (cid:17) i ln( nR ) (cid:19) n − i − m ) = ln( nR ) n/ (cid:88) i = m +1 O (cid:18)(cid:16) ei (cid:17) i n m − i (cid:19) = o (( en/m ) m ) . Altogether, we have E ( T x ) ≤ (cid:0) enm (cid:1) m (cid:0) mn ln( nR ) (cid:1) + o (( en/m ) m ) . Moreover, the expected number of iterations for ﬁnding an improvement is at least p − m (1 − p ) − ( n − m ) for any mutation rate p . Using the same arguments as in the analysisof the (1+1) EA on Jump in Doerr et al. (2017), since mn is the unique minimum pointin the interval [0 , T x ) ≥ ( m/n ) − m (1 − m/n ) − ( n − m ) ≥ ( en/m ) m (cid:18) − m n − m (cid:19) . (cid:3) We now present the above-mentioned important “simulation result” implying that onunimodal functions, the stagnation detection of SD-(1+1) EA is unlikely ever to trigger a10trength increase during its run. Moreover, for a wide range of runtime bounds obtainedvia the ﬁtness level method (Wegener, 2001), we show that these bounds transfer to theSD-(1+1) EA up to vanishingly small error terms. The proof carefully estimates theprobability of the strength ever exceeding 1.

Lemma 3.

Let f : { , } n → R be a unimodal function and consider the SD-(1+1) EAwith R ≥ | Im( f ) | . Then, with probability − o (1) , the SD-(1+1) EA never increases thestrength and behaves stochastically like the (1+1) EA before ﬁnding an optimum of f .Denote by T sd and T classic the runtime of the SD-(1+1) EA and the classical (1+1) EAwith strength on f , respectively. If U is an upper bound on E ( T T classic ) obtained bysumming up worst-case expected waiting times for improving over all ﬁtness values in Im( f ) , then E ( T sd ) ≤ U + o (1) . The same statements hold with SD-(1+1) EA replaced with SASD-(1+ λ ) EA, and(1+1) EA replaced with the self-adjusting (1+ λ ) EA without stagnation detection. Proof.

We let the random set W contain the search points from which the SD-(1+1) EAdoes not ﬁnd an improvement within phase 1 (i. e., while r t = 1). As above, E denotesthe probability of not ﬁnding an improvement within phase 1. As on unimodal functions,the gap of all points is 1, we have by Lemma 2 that Pr ( E ) ≤ Rn ) . This argumentationholds for each improvement that has to be found. Since at most | Im( f ) | ≤ R improvingsteps happen before ﬁnding the optimum, by a union bound the probability of the SD-(1+1) EA ever increasing the strength beyond 1 is at most R Rn ) = o (1), which provesthe ﬁrst claim of the lemma.To prove the second claim, we consider all ﬁtness values f < · · · < f | Im( f ) | inincreasing order and sum up upper bounds on the expected times to improve from eachof these ﬁtness values. Under the condition that the strength is not increased beforeleaving a ﬁtness level, the worst-case time to leave a level (over all search points with thesame ﬁtness value) is clearly not increased. Hence, we bound the expected optimizationtime of the SD-(1+1) EA from above by adding the waiting times on all ﬁtness levelsfor the (1+1) EA, which is given by U , and the expected times spent to leave the pointsin W ; formally, E ( T sd ) ≤ U + (cid:88) x ∈ W E ( T x ) . Each point in Im( f ) contributes with probability Pr ( E ) to W . Hence , E ( | W | ) ≤ Im( f ) Pr ( E ) ≤ R Pr ( E ). As on unimodal functions, the gap of all points is 1, by11emma 2, we have Pr ( U i ) < (cid:81) i − j =1 Pr ( E j ) < n − i . Hence,E ( T sd ) < U + (cid:88) x ∈ W E ( T x ) < U + R · Pr ( E ) n/ (cid:88) i =1 E ( T x | U i ) Pr ( U i ) < U + R · ( nR ) − n/ (cid:88) i =1 O (cid:18)(cid:16) ei (cid:17) i n − i ln( nR ) (cid:19) . The second term is o (1), hence E ( T ) ≤ U + o (1) . as suggested.All the arguments are used in the same way with respect to the SASD-(1+ λ ) EAand its original formulation without stagnation detection. (cid:3) It is well known that strength 1 for the (1+1) EA leads to an expected runtime ofΘ( n m ) on Jump m if m ≥ m bits must ﬂip simultaneously to leave thelocal optimum at n − m one-bits. To minimize the time for such an escaping muta-tion, mutation rate m/n is optimal (Doerr et al., 2017), leading to an expected time of(1 + o (1))( n/m ) m (1 − m/n ) m − n to optimize Jump , which is Θ(( en/m ) m ) for m = o ( √ n ).However, a static rate of m/n cannot be chosen without knowing the gap size m . There-fore, diﬀerent heavy-tailed mutation operators have been proposed for the (1+1) EA(Doerr et al., 2017; Friedrich, Quinzan and Wagner, 2018), which most of the timechoose strength 1 but also use strength r , for arbitrary r ∈ { , . . . , n/ } with at leastpolynomial probability. This results in optimization times on Jump of Θ(( en/m ) m · p ( n ))for some small polynomial p ( n ) (roughly, p ( n ) = ω ( √ m ) in Doerr et al. (2017) and p ( n ) = Θ( n ) in Friedrich, Quinzan and Wagner (2018)). Similar polynomial overheadsoccur with hypermutations as used in artiﬁcial immune systems (Corus, Oliveto andYazdani, 2018); in fact such overheads cannot be completely avoided with heavy-tailedmutation operators, as proved in Doerr et al. (2017). We also remark that Jump can beoptimized faster than O (( en/m ) m ) if crossover is used (Whitley et al., 2018; Rowe andAishwaryaprajna, 2019), by simple estimation-of-distribution algorithms (Doerr, 2019)or speciﬁc black-box algorithms (Buzdalov, Doerr and Kever, 2016). In addition, theoptimization time of n ( m +1) / e O ( m ) m − m/ is shown for the (1+( λ, λ )) GA to optimize Jump with 2 < m < n/

16 in Antipov, Doerr and Karavaev (2020). All of this is outsidethe scope of this study that concentrates on mutation-only algorithms.We now state our main result, implying that the SD-(1+1) EA achieves an asymptoti-cally optimal runtime on

Jump m for m = o ( √ n ), hence being faster than the heavy-tailed12utations mentioned above. Recall that this does not come at a signiﬁcant extra costfor simple unimodal functions like OneMax according to Lemma 3.

Theorem 3.

Let n ∈ N . For all ≤ m = O ( n/ ln n ) , the expected runtime E ( T ) of theSD-(1+1) EA on Jump m satisﬁes Ω (cid:18)(cid:16) enm (cid:17) m (cid:18) − m n − m (cid:19)(cid:19) ≤ E ( T ) ≤ O (cid:16)(cid:16) enm (cid:17) m (cid:17) . Proof.

It is well known that the (1+1) EA with mutation rate 1 /n ﬁnds the optimumof the n -dimensional OneMax function in an expected number of at most en ln n − O ( n )iterations.Until reaching the plateau consisting of all points of n − m one-bits, Jump is equivalentto

OneMax ; hence, according to Lemma 3, the expected time until SD-(1+1) EA reachesthe plateau is at most O ( n ln n ) (noting that this bound was obtained via the ﬁtnesslevel method).Every plateau point x with n − m one-bits satisﬁes gap( x ) = m according to thedeﬁnition of Jump . Thus, using Theorem 2, the algorithm ﬁnds the optimum withinexpected time Ω (cid:18)(cid:16) enm (cid:17) m (cid:18) − m n − m (cid:19)(cid:19) ≤ E ( T x ) ≤ O (cid:16)(cid:16) enm (cid:17) m (cid:17) . This dominates the expected time of the algorithm before the plateau point.Finally, Ω (cid:18)(cid:16) enm (cid:17) m (cid:18) − m n − m (cid:19)(cid:19) ≤ E ( T ) ≤ O (cid:16)(cid:16) enm (cid:17) m (cid:17) . (cid:3) It is easy to see (similarly to the analysis of Theorem 3) that for all m = Θ( n ), theexpected runtime E ( T ) of the SD-(1+1) EA on Jump m satisﬁes E ( T ) = O (cid:0)(cid:0) enm (cid:1) m ln n (cid:1) . The

Jump function only has one local optimum that usually has to be overcome on theway to the global optimum. We generalize the previous analysis to functions that havemultiple local optima of possibly diﬀerent gap sizes. As a special case, we can asymp-totically recover the expected runtime on the

LeadingOnes function in Corollary 1.

Theorem 4.

The expected runtime of the SD-(1+1) EA on a pseudo-Boolean ﬁtnessfunction f is at most E ( T | V , . . . , V n ) = O (cid:32) n (cid:88) k =1 V k L k (cid:33) , where V k is the number of points x of gap( x ) = k visited by the algorithm and L k :=max { L x,k | x ∈ { , } n ∧ gap( x ) = k } with L x,k as deﬁned in Theorem 2. Moreover, E ( T ) = O (cid:32) n (cid:88) k =1 E ( V k ) L k (cid:33) , roof. The SD-(1+1) EA visits a random trajectory of search points { x , x , x , . . . , x m = x ∗ } in order to ﬁnd an optimum point x ∗ .For any search point x with gap( x ) = k , the expected time to ﬁnd a better searchpoint when r ≤ m is E ( T x ) = L k according to Theorem 2.Also, we have T = T x + T x + · · · + T x m = (cid:80) nk =1 V k · ( T x | gap( x ) = k ). Therefore,as the strength r is reset to 1 after each improvement, we haveE ( T | V , . . . , V n ) = O (cid:32) n (cid:88) k =1 V k L k (cid:33) , which proves the ﬁrst statement of this theorem. The second follows by the law of totalexpectation. (cid:3) Corollary 1.

The expected runtime of the SD-(1+1) EA on

LeadingOnes is at most O ( n ) . Proof. On LeadingOnes , there are at most n points of gap size 1, so according toTheorem 4, the expected runtime is O ( n ). (cid:3) Corollary 1 can be also inferred from Lemma 3 since

LeadingOnes is unimodal andthe O ( n ) bound was inferred via the ﬁtness level method.We ﬁnally specialize Theorem 4 into a result for the well-known Trap functionDroste, Jansen and Wegener (2002) that is identical for

OneMax except for the all-zeros string that has optimal ﬁtness n + 1. We obtain a bound of 2 Θ( n ) instead of theΘ( n n ) bound for the classical (1+1) EA. The base of our result is somewhat larger thanfor the fast GA from Doerr et al. (2017); however, it is still close to the 2 n bound thatwould be obtained by uniform search. Corollary 2.

The expected runtime of SD-(1+1) EA on

Trap is at most O (2 . n ln n ) . Proof. On Trap , there are one point of gap size n and O ( n ) points with gap size of 1.So according to Theorem 4, the expected runtime is O ((2 . n ln n ). (cid:3) While our previous analyses have shown the beneﬁts of the self-adjusting scheme, inparticular highlighting stagnation detection on multimodal functions, it is clear thatour scheme also has limitations. In this section, we present an example of a pseudo-Boolean function where stagnation detection does not help to ﬁnd its global optimumin polynomial time; moreover, the function is hard for other self-adjusting schemes sincemeasuring the number of successes does not hint on the location of the global optimum.In fact, the function demonstrates a more general eﬀect where the behavior is verysensitive with respect to choice of the the mutation probability. More precisely, a plain(1+1) EA with mutation probability 1 /n with overwhelming probability gets stuck in14 local optimum from which it needs exponential time to escape while the (1+1) EAwith mutation probability 2 /n and also above ﬁnds the global optimum in polynomialtime with overwhelming probability. Since the function is unimodal except at the localoptimum, our self-adjusting (1+1) EA with stagnation detection fails as well.To the best of our knowledge, a phase transition with respect to the mutation proba-bility where an increase by a small constant factor leads from exponential to polynomialoptimization time has been unknown in the literature of runtime analysis so far and maybe of independent interest. We are aware of opposite phase transitions on monotonefunctions (Lengler, 2018) where increasing the mutation rate is detrimental; however,we feel that our function and the general underlying construction principle are easier tounderstand than these speciﬁc monotone functions.The construction of our function, called NeedHighMut , is based on a general prin-ciple that was introduced in Witt (2003) to show the beneﬁts of populations and wassubsequently applied in Jansen and Wiegand (2004) to separate a coevolutionary variantof the (1+1) EA from the standard (1+1) EA. Section 5 of the latter paper also beauti-fully describes the general construction technique that involves creating two diﬀerentlypronounced gradients for the algorithms to follow. Further applications are given inWitt (2006) and Witt (2008) to show the beneﬁt of populations in elitist and non-elitistEAs. Also Rohlfshagen, Lehre and Yao (2009) use very similar construction techniquefor their

Balance function that is easier to optimize in frequently changing than slowlychanging environments; however, they did not seem to be aware that their approachresembles earlier work from the papers above.We now describe the construction of our function

NeedHighMut . The crucialobservation is that strength 1 (i. e., probability p = 1 /n ) makes it more likely to ﬂipexactly one speciﬁc bit than strength 2 – in fact strength 1 is asymptotically optimal sincethe probability of ﬂipping one speciﬁc bit is p (1 − p ) n − ≈ pe − pn , which is maximized for p = 1 /n . However, to ﬂip speciﬁc two bits, which has probability p (1 − p ) n − ≈ p e − pn ,the choice p = 2 /n is asymptotically optimal and clearly better than 1 /n . Now, givena hypothetical time span of T , we expect approximately T ( p ) := T pe − p/n speciﬁc one-bit and T ( p ) := T p e − p/n speciﬁc two-bit ﬂips. Assuming the actual numbers to beconcentrated and just arguing with expected values, we have T (1 /n ) (cid:29) T (1 /n ) but T (2 /n ) (cid:29) T (2 /n ), i. e., there will be considerably more two-bit ﬂips at strength 2than at strength 1 and considerably less 1-bit ﬂips. The ﬁtness function will accountfor this. It leads to a trap at a local optimum if a certain number of one-bit ﬂips isexceeded before a certain minimum number of two-bit ﬂips has happened; however, ifthe number of one-bit ﬂips is low enough before the minimum number of two-bit ﬂipshas been reached, the process is on track to the global optimum.We proceed with the formal deﬁnition of NeedHighMut , making these ideas preciseand overcoming technical hurdles. Since we have at most n speciﬁc one-bit ﬂips but aspeciﬁc two-bit ﬂip is already by a factor of O (1 /n ) less likely than a one-bit ﬂip, we willwork with two-bit ﬂips happening in small blocks of size √ n , leading to a probability ofroughly n − / for a two-bit ﬂip in a block. In the following, we will imagine a bit string x of length n as being split into a preﬁx a := a ( x ) of length n − m and a suﬃx b := b ( x )15f length m , where m still has to be deﬁned. Hence, x = a ( x ) ◦ b ( x ), where ◦ denotesthe concatenation.The preﬁx a ( x ) is called valid if it is of the form 1 i n − m − i , i. e., i leading ones and n − m − i trailing zeros. The preﬁx ﬁtness pre ( x ) of a string x ∈ { , } n with validpreﬁx a ( x ) = 1 i n − m − i equals just i , the number of leading ones. The suﬃx consistsof (cid:100) ξ √ n (cid:101) , where ξ ≥ (cid:100) n / (cid:101) bits each, altogether m ≤ ξ n / = o ( n ) bits. Such a block is called valid if it containseither 0 or 2 one-bits; moreover, it is called active if it contains 2 and inactive if itcontains 0 one-bits. A suﬃx where all blocks are valid and where all blocks followingﬁrst inactive block are also inactive is called valid itself, and the suﬃx ﬁtness suff ( x )of a string x with valid suﬃx b ( x ) is the number of leading active blocks before the ﬁrstinactive block. Finally, we call a string x ∈ { , } n valid if both its preﬁx and suﬃx arevalid.Our ﬁnal ﬁtness function is a weighted combination of pre ( x ) and suff ( x ). Wedeﬁne for x ∈ { , } n , where x = a ◦ b with the above-introduced a and b , NeedHighMut ξ ( x ) :=  n suff ( x ) + pre ( x ) if pre ( x ) ≤ n − m )10 ∧ x valid n m + pre ( x ) + suff ( x ) − n − pre ( x ) > n − m )10 ∧ x valid − OneMax ( x ) otherwise.We note that all search points in the second case have a ﬁtness of at least n m − n − n ( m −

1) + n , an upper bound on the ﬁtness of search points thatfall into the ﬁrst case without having m leading active blocks in the suﬃx. Hence,search points x where pre ( x ) = n − m and suff ( x ) = (cid:100) ξ √ n (cid:101) represent local optima ofsecond-best overall ﬁtness. The set of global optima equals the points where pre ( x ) =9( n − m ) /

10 and suff ( x ) = m , which implies that ( n − m ) /

10 = Ω( n ) bits have to beﬂipped simultaneously to escape from the local toward the global optimum.The parameter ξ ≥ ξ = 1, strength 1 usuallyleads to the local optimum ﬁrst while strengths above 2 usually lead directly to the globaloptimum. Using larger ξ increases the threshold for the strength necessary to ﬁnd theglobal optimum instead of being trapped in the local one.We now formally show with respect to diﬀerent algorithms that NeedHighMut ischallenging to optimize without setting the right mutation probability in advance. Westart with an analysis of the classical (1+1) EA, where we for simplicity only show thenegative result for p = 1 /n even though it would even hold for ξ/n . Theorem 5.

Consider the plain (1+1) EA with mutation probability p on NeedHighMut ξ for a constant ξ ≥ . If p = 1 /n then with probability − − Ω( n ) , its optimization timeis n Ω( n ) . If p = ( cξ ) /n for any constant c ≥ then the optimization time is O ( n ) withprobability − − Ω( √ n ) . roof. It is easy to see (similarly to the analysis of the

SufSamp function from Jansen,Jong and Wegener (2005)) that the ﬁrst valid search point (i. e., search point of non-negative ﬁtness) has both pre - and suff -value value of at most n / with probability2 − Ω( n / ) . This follows from the fact that the function is symmetric on invalid searchpoints and that from each level set of i one-bits, only O (1) search points are valid.In the following, we tacitly assume that we have reached a valid search point of thedescribed maximum pre - and suff -value and note that this changes the required numberof improvements to reach local or global maximum only by a 1 − o (1) factor. Forreadability this factor will not be spelt out any more.We prepare the main analysis by bounding the probability of a mutation being ac-cepted after a valid search point has been reached. Even if a mutation changes up to o ( n ) consecutive bits of the preﬁx or suﬃx, it must maintain n − o ( n ) preﬁx bits in orderto result in a valid search points. Hence, the probability of an accepted step at mutationprobability c/n (valid for any constant c ) is at most (1 − c/n ) n − m − o ( n ) = (1 + o (1)) e − c .Steps ﬂipping Ω( n ) consecutive bits have probability n − Ω( n ) and are subsumed by thefailure probabilities stated in this theorem. Clearly, the probability of a accepted stepis at least (1 − /n ) n = (1 − o (1)) e − c .Using this knowledge of accepted steps, we shall now prove the statement for p = 1 /n .The probability of improving the pre -value is at least e − /n since it is suﬃcient to ﬂipthe leftmost zero of the preﬁx to 1. In a phase of length emn steps, there are at least m preﬁx-improving mutations with probability 1 − − Ω( n ) by Chernoﬀ bounds. All theseimprove the function value and are accepted unless the suff -value increases to m beforethe pre -value exceeds 9 n/ (cid:0) n / (cid:1) n e − (1 + o (1)) ≤ (1 + o (1))( e − / n − / since it is necessary to ﬂip two zerosinto ones and to have an accepted mutation. By the same reasoning, steps that activate k = o ( n ) blocks simultaneously have a probability of at most (1 + o (1))( e − / n − / ) k .We consider a phase of s := emn steps and bound the number of number of acceptedsteps increasing the suff -value by k by applying Chernoﬀ bounds since this number ifbounded by a binomial distribution with parameter s and p k := (1 + o (1))( e − / n − / ) k .Hence, the number of accepted steps activating one suﬃx block in in emn ≤ en steps is less than √ n with probability 1 − − Ω( √ n ) . The expected number of acceptedsteps activating k ≥ O ( n − / ), and by Chernoﬀ bounds theactual number is at most n / with probability 1 − − Ω( n / ) . Hence, by a union boundover k ∈ { , . . . , n / } , the steps adding more than one valid suﬃx block increase the suff -value by at most n / / = n / with probability 1 − − Ω( n / ) . Steps adding k > n / valid blocks have probability O (2 − Ω( n / ) ) and are subsumed by the failureprobability. If none of the failure events occurs, the total increase of the suff -value isat most √ n + n / < √ n . Also, with probability 1 − − Ω( √ n ) , the pre -value decreasesby altogether at most O ( √ n ) in the O ( √ n ) mutations that improve the suﬃx, whichcan be subsumed in a lower-order term in the above analysis of pre -improving steps.17ltogether, with overwhelming probability 1 − − Ω( n / ) the preﬁx is optimized beforethe suﬃx. The probability of reaching the global optimum from the local one is n − Ω( n ) since it is necessary to ﬂip m/

10 bit simultaneously to leave the local optimum. In a phaseof n c (cid:48) n steps for a suﬃciently small constant c (cid:48) this does not happen with probability1 − − Ω( n ) . This completes the proof of the statement for the case p = 1 /n .For p = c/n , where c ≥ ξ , we argue similarly with inverted roles of preﬁx and suﬃx.The probability of activating a block in the suﬃx is at least (1 − o (1))(( c / e − c n − / )now. In a phase of (7 / ξ ( e /c ) mn steps, we expect (7 / ξ √ n activated blocks andwith overwhelming probability we have at least ξ √ n such blocks. The probabilityof improving the pre -value by k is only (1 + o (1)) ce − c /n k , amounting to an expectednumber of improvements by 1 of (1+ o (1))(7 / ξ/c ) mn − k = (1+ o (1))(7 / ξ/c ) n − k ≤ (1+ o (1))(7 / n − k since c ≥ ξ , and, using similar Chernoﬀ and union bounds as above,the probability of at least (9 / m pre -improving steps in the phase is 2 − Ω( n / ) . (cid:3) The previous analysis can be transferred to the SD-(1+1) EA with stagnation de-tection, showing that this mechanism does not help to increase the success probabilitysigniﬁcantly compared to the plain (1+1) EA with p = 1 /n . The proof shows that theSD-(1+1) EA with high probability does not behave diﬀerently from the (1+1) EA. Theonly major diﬀerence is visible after reaching the local optimum of NeedHighMut ,where stagnation detection kicks in. This results in the bound 2 Ω( n ) in the followingtheorem, compared to n Ω( n ) in the previous one. Theorem 6.

With probability at least − O (1 /n ) , the SD-(1+1) EA needs at least Ω( n ) steps to optimize NeedHighMut ξ for ξ ≥ . Proof.

We assume that the parameter | R | of the algorithm is set to at least n and followthe analysis of the case p = 1 /n from the proof of Theorem 5. In a phase of emn steps,there are at least m pre -improving mutations (having probability at least 1 / ( en ) each)with probability 1 − − Ω( n ) by Chernoﬀ bounds. For each of these improving mutations,the probability that it does not happen within the threshold of en ln( n | R | ) ≥ en ln( n )iterations is at most (1 − / ( en )) en ln( n ) ≤ /n . By a union bound, the probabilitythat at least one of the mutations does not happen within this number of iterations isat most 1 /n . Together with the analysis of the number of suff -increasing mutations,this means that the strength stays at 1 until the local optimum is reached, and that thelocal optimum is reached ﬁrst, with probability at least 1 − O (1 /n ).Leaving the local optimum requires a mutation ﬂipping at least m/

10 = Ω( n ) bitssimultaneously. As already analyzed in Theorem 2, even at optimal strength this requires2 Ω( n ) steps with probability 1 − − Ω( n ) . Taking a union bound over all failure probabilitiescompletes the proof. (cid:3) Finally, we also show that the self-adaptation scheme of the SASD-(1+ λ ) EA doesnot help to concentrate the mutation rate on the right regime for NeedHighMut ξ if ξ is a suﬃciently large constant and λ is not too large. This still applies in connectionwith stagnation detection. 18 heorem 7. Let ξ be a suﬃciently large constant and assume λ = o ( n ) and λ = ω (1) .Then with probability at least − O (1 /n ) , the SASD-(1+ λ ) EA with stagnation detection(Algorithm 3) needs at least Ω( n ) /λ generations to optimize NeedHighMut . The proof of this theorem uses more advanced techniques, more precisely Theorem 1to analyze the distribution of mutation strength in the oﬀspring over time. This tech-nique allows us that only a small constant fraction of steps uses strength that are morebeneﬁcial for the suﬃx than the preﬁx.

Proof.

The idea is to show that the strength has a drift towards its minimum and thenapply Theorem 1 to bound the number of steps at which a mutation rate is taken thatcould be beneﬁcial. Then, since most of the steps use small mutation rates, the preﬁx isoptimized before the suﬃx with high probability and a local optimum reached.To make these ideas precise, we pick up and extend the analysis of the acceptanceand improvement probabilities from Theorem 5. Hence (with respect to the creation ofa single oﬀspring): • The probability of accepting a mutation at strength r = o ( n ) is (1 ± o (1)) e − r since only o ( n ) bits ﬂip with probability 1 − e − ω ( r ) and (1 − o (1)) n bits have tobe preserved (not ﬂipped) with probability 1 − − Ω( n ) . At strengths r = Ω( n ) theprobability of improving the pre -value by m/ − Ω( n ) since m/ m/ e − Ω( r ) . • the probability of improving the pre -value by k = o ( n ) is (1 ± o (1))( r/n ) k e − r . • the probability of improving the suff -value by k = o ( n ) is (1 ± o (1))( r/n ) k e − r .Clearly, the probability that at least one out of λ oﬀspring is improving the functionvalue is at most λ times as large. Since we have λ = o ( n ) oﬀspring and each improvementhas probability p i = O (1 /n ), the probability at having at least one improving oﬀspringis at least 1 − (1 − p i ) λ = 1 − (1 − (1 − o (1)) λp i , hence also by a factor at least (1 − o (1)) λ larger.Using these bounds on the acceptance and improvement probabilities, we now useideas similar to the analysis of the near region in Doerr et al. (2019) to show a drift ofthe strength towards small values. We discuss several cases: r t ≤ (ln λ ) /

4: then the probability of creating a copy of the parent at strength r t / − o (1)) e − (ln λ ) / = (1 − o (1)) λ − / . This probability is by a factor (1 − o (1)) e smaller at strength 2 r t . Using Chernoﬀ bounds and exploiting λ = ω (1) we have thatwith probability 1 − o (1), the number of copies produced at strength r t / r t , and there is at least one copyproduced from strength r t /

2. Due to the uniform choice of the individual adjusting thestrength in case of ties, the probability of increasing the strength is at most 1 / − (cid:15) forsome constant (cid:15) > r t ≥ λ : Then with probability 1 − o (1), all oﬀspring are invalid in preﬁx orsuﬃx and therefore worse than the parent. The ﬁtness function is − n − OneMax

19n this case. Now, since the minimum number of bits ﬂipped at strength 2 r t is withprobability 1 − o (1) larger than the maximum number of bits ﬂipped at strength r t / − o (1) an oﬀspring producedfrom strength r t / / − (cid:15) again. L := (ln λ ) / ≤ r t ≤ λ =: U : here we only know that the probability of decreasingthe strength is at least 1 / λ ) EA. However, aconstant number of such decreasing steps is enough to reach strength at most L from thesmallest possible strength above U . Using a potential function with an exponential slopein the range [ L, U ] like in Doerr et al. (2019), we arrive at a process that increases withprobability at most 1 / − (cid:15) and decreases with the remaining probability. We choose aconstant (cid:15) > / (cid:15) except forthe case r t = 2, where the strength stays the same with probability at least 1 / (cid:15) .Hence, for the process X t := log ( r t ) that lives on the non-negative integers we obtain,writing ∆ t := X t +1 − X t , thatE (cid:0) e η ∆ t | F t ; X t > (cid:1) = e − η (cid:18)

12 + (cid:15) (cid:19) + e η (cid:18) − (cid:15) (cid:19) ≤ − η(cid:15) + η ≤ ρ for a constant ρ < η is chosen as a suﬃciently small constant (depending on theconstant (cid:15) ). Similarly, given this choice of η , we immediately haveE (cid:0) e η ∆ t | F t ; X t ≤ (cid:1) ≤ D for a constant D >

0. If we choose b in Theorem 1 as a suﬃciently large constant, weobtain, noting a = 2, 1 − (cid:15) − − (cid:15) − ρ De η ( a − b ) ≥ T , the number of generations where X t > b holds, is at most T /

10 with probability 1 − − Ω( T ) . Let b ∗ = 2 b , i. e., the strengthcorresponding to X t = b . We set T := (12 / e b ∗ mn/ ( b ∗ λ ). Since a pre -improvingmutation has probability at least (1 − o (1)) λ ( b ∗ /n ) e − b ∗ , we have an expected number ofat least (1 − o (1))(27 / m such mutations in the phase at with probability 1 − − Ω( n ) atleast m such mutations by Chernoﬀ bounds. This is suﬃcient to reach the local optimumunless there are at least (2 / ξ √ n suff -improving mutations in the phase. Note thatthe choice of the constant ξ only impacts the length of the preﬁx in lower-order termsthat vanish in O -notation.We bound the number of suff -improving mutations separately for the points intime (i. e., generations) where X t ≤ b and where X t > b . For the ﬁrst set of timepoints, we note that the probability of a suff -improving mutation by k ≥ o (1)) λ (2 /n / ) k e − since the term x /e − x takes its maximum at x = 2. Usingsimilar arguments based on Chernoﬀ and union bounds as in the proof of Theorem 5, webound the total improvement of the suff -value in at most T ≤ (12 / e b ∗ n / ( λb ∗ ) stepswhere X t ≤ b by i := (25 / e b ∗ − √ n/b ∗ with probability 1 − − Ω( n / ) . For the points20f time where X t > b the probability of a pre -improving mutation is maximized (up tolower-order terms) at strength b ∗ since the function x /e − x is monotonically decreasingfor x >

2. Assuming at most (12 / e b ∗ n /b ∗ such time points (which assumption holdswith probability at least 1 − − Ω( n ) , we obtain an expected number of suff -improvingmutations by 1 of at most 12100 e b ∗ n b ∗ b ∗ n / e − b ∗ = 12100 √ n and using Chernoﬀ and union bounds we bound the total improvement of the suff -valuein these generations by i = (13 / √ n with 1 − − Ω( n / ) . Now, if we choose ξ largeenough, then i + i ≤ ξ √ n so that the preﬁx is optimized before the suﬃx with probability altogether 1 − − Ω( n / ) .Together with the analysis in Theorem 6 for the case that the stagnation counterexceeds its threshold, this means that with probability 1 − O (1 /n ) the local optimumis reached before the global one. Again arguing in the same way as in the proof ofTheorem 6, the time to reach the global optimum from the local one is 2 Ω( n ) /λ withprobability 1 − − Ω( n ) . The sum of all failure probabilities is O (1 /n ). (cid:3) Our theoretical results are asymptotic. In this section, we show the results of the exper-iments we did in order to see how the diﬀerent algorithms perform in practice for small n . In the ﬁrst experiment, we ran an implementation of Algorithms 2 (SD-(1+1) EA)and 3 (SASD-(1+ λ ) EA) on the Jump ﬁtness function with jump size m = 4 and n vary-ing from 40 to 160. We compared our algorithms against (1+1) EA with standard mu-tation rate 1/n, (1+1) EA with mutation probability m/n , and Algorithm (1+1) FEA β from Doerr et al. (2017) with three diﬀerent β = { . , , } .In Figures 1 and more precisely 2, we observe that stagnation detection techniquemakes the algorithm faster than the algorithms with heavy-tailed mutation operator(1+1) FEA β . Also, Algorithm SD-(1+1) EA is not much slower than the (1+1) EAwith mutation probability mn even though it does not need the gap size.In the second experiment, we ran our algorithms and the classic (1+1) EA withdiﬀerent mutation probabilities on NeedHighMut ξ with n = { , , , , } and ξ = 3.The outcomes support that the theory from Section 4 already holds for small n . InTable 1, one can see that for ξ = 3, the (1+1) EA with p = 6 /n and 8 /n is much moresuccessful to ﬁnd global optimum points than the rest of the algorithms. https://github.com/DTUComputeTONIA/StagnationDetection . Jump .Figure 2: Box plots comparing number of ﬁtness calls (over 1000 runs) the mentionedalgorithms take to optimize Jump . 22 Algo. (1+1)EAwith p= n (1+1)EAwith p= n (1+1)EAwith p= n (1+1)EAwith p= n SD-(1 + 1)EA SASD-(1 + ln n )EA200 0.00000 0.00000 0.01181 0.19380 0.00000 0.00000400 0.00000 0.00000 0.33858 0.87402 0.00100 0.00000600 0.00000 0.00000 0.42449 0.85950 0.00051 0.00000800 0.00000 0.00000 0.84000 0.97273 0.00056 0.002291000 0.00000 0.00000 0.80769 0.97917 0.00058 0.00121 Table 1: Ratio of successfully achieved global optimum where ξ = 3 over 1000 runs. Conclusions

We have designed and analyzed self-adjusting EAs for multimodal optimization. Inparticular, we have proposed a module called stagnation detection that can be addedto existing EAs without essentially changing their behavior on unimodal (sub)problems.Our stagnation detection keeps track of the number of unsuccessful steps and increasesthe mutation rate based on statistically signiﬁcant waiting times without improvement.Hence, there is high evidence for being at a local optimum when the strength is increased.Theoretical analyses reveal that the (1+1) EA equipped with stagnation detectionoptimizes the

Jump function in asymptotically optimal time corresponding to the beststatic choice of the mutation rate. Moreover, we have proved a general upper bound formultimodal functions that can recover asymptotically runtimes on well-known examplefunctions, and we have shown that on unimodal functions, the (1+1) EA with stagnationdetection with high probability never deviates from the classical (1+1) EA; also a relatedstatement was proved for the self-adjusting (1+ λ ) EA from Doerr et al. (2019). Finally,to show the limitations of the approach we have presented a function on which all of ourinvestigated self-adjusting EAs provably fail to be eﬃcient.In the future, we would like to investigate our module for stagnation detection inother EAs and study its beneﬁts on combinatorial optimization problems. Acknowledgement

This work was supported by a grant by the Danish Council for Independent Research(DFF-FNU 8021-00260B).

References

Antipov, Denis, Doerr, Benjamin, and Karavaev, Vitalii (2019). A tight runtime analysisfor the (1 + ( λ , λ )) GA on LeadingOnes. In Proc. of FOGA ’19 , 169–182. ACM Press.Antipov, Denis, Doerr, Benjamin, and Karavaev, Vitalii (2020). The (1 + ( λ, λ )) GA iseven faster on multimodal problems.

CoRR , abs/2004.06702 . URL http://arxiv.org/abs/2004.06702 . 23uzdalov, Maxim, Doerr, Benjamin, and Kever, Mikhail (2016). The unrestricted black-box complexity of jump functions. Evolutionary Computation , (4), 719–744.Corus, Dogan, Oliveto, Pietro Simone, and Yazdani, Donya (2018). Fast artiﬁcial im-mune systems. In Proc. of PPSN ’18 , 67–78. Springer.Dang, Duc-Cuong and Lehre, Per Kristian (2016). Self-adaptation of mutation rates innon-elitist populations. In

Proc. of PPSN ’16 , 803–813. Springer.Doerr, Benjamin (2019). A tight runtime analysis for the cGA on jump functions: EDAscan cross ﬁtness valleys at no extra cost. In

Proc. of GECCO ’19 , 1488–1496. ACMPress.Doerr, Benjamin and Doerr, Carola (2018). Optimal static and self-adjusting parameterchoices for the (1+( λ , λ )) genetic algorithm. Algorithmica , (5), 1658–1709.Doerr, Benjamin and Doerr, Carola (2020). Theory of parameter control for dis-crete black-box optimization: Provable performance gains through dynamic parameterchoices. In Doerr, B. and Neumann, F. (eds.), Theory of Evolutionary Computation– Recent Developments in Discrete Optimization , 271–321. Springer.Doerr, Benjamin, Doerr, Carola, and K¨otzing, Timo (2018). Static and self-adjustingmutation strengths for multi-valued decision variables.

Algorithmica , (5), 1732–1768.Doerr, Benjamin, Fouz, Mahmoud, and Witt, Carsten (2010). Quasirandom evolutionaryalgorithms. In Proc. of GECCO ’10 , 1457–1464. ACM Press.Doerr, Benjamin, Gießen, Christian, Witt, Carsten, and Yang, Jing (2019). The (1+ λ ) evolutionary algorithm with self-adjusting mutation rate. Algorithmica , (2),593–631.Doerr, Benjamin and Krejca, Martin S. (2018). Signiﬁcance-based estimation-of-distribution algorithms. In Proc. of GECCO ’18 , 1483–1490. ACM Press.Doerr, Benjamin, Le, Huu Phuoc, Makhmara, R´egis, and Nguyen, Ta Duy (2017). Fastgenetic algorithms. In

Proc. of GECCO ’17 , 777–784. ACM Press.Doerr, Benjamin, Witt, Carsten, and Yang, Jing (2018). Runtime analysis for self-adaptive mutation rates. In

Proc. of GECCO ’18 , 1475–1482. ACM Press.Doerr, Carola and Wagner, Markus (2018). Sensitivity of parameter control mechanismswith respect to their initialization. In

Proc. of PPSN ’18 , 360–372. Springer.Doerr, Carola, Ye, Furong, van Rijn, Sander, Wang, Hao, and B¨ack, Thomas (2018). To-wards a theory-guided benchmarking suite for discrete black-box optimization heuris-tics: Proﬁling (1+ λ ) EA variants on OneMax and LeadingOnes. In Proc. of GECCO’18 , 951–958. ACM Press. 24roste, Stefan, Jansen, Thomas, and Wegener, Ingo (2002). On the analysis of the (1+1)evolutionary algorithm.

Theoretical Computer Science , , 51–81.Eiben, A. E., Marchiori, Elena, and Valk´o, V. A. (2004). Evolutionary algorithms withon-the-ﬂy population size adjustment. In Proc. of PPSN ’04 , 41–50. Springer.Fajardo, Mario A. Hevia (2019). An empirical evaluation of success-based parametercontrol mechanisms for evolutionary algorithms. In

Proc. of GECCO ’19 , 787–795.ACM Press.Friedrich, Tobias, Quinzan, Francesco, and Wagner, Markus (2018). Escaping large de-ceptive basins of attraction with heavy-tailed mutation operators. In

Proc. of GECCO’18 , 293–300. ACM Press.Hajek, Bruce (1982). Hitting and occupation time bounds implied by drift analysis withapplications.

Advances in Applied Probability , , 502–525.Hansen, Pierre and Mladenovic, Nenad (2018). Variable neighborhood search. In Mart´ı,Rafael, Pardalos, Panos M., and Resende, Mauricio G. C. (eds.), Handbook of Heuris-tics , 759–787. Springer.Jansen, Thomas, Jong, Kenneth A. De, and Wegener, Ingo (2005). On the choice of theoﬀspring population size in evolutionary algorithms.

Evolutionary Computation , ,413–440.Jansen, Thomas and Wiegand, R. Paul (2004). The cooperative coevolutionary (1+1)EA. Evolutionary Computation , (4), 405–434.L¨assig, J¨org and Sudholt, Dirk (2011). Adaptive population models for oﬀspring popu-lations and parallel evolutionary algorithms. In Proc. of FOGA ’11 , 181–192. ACMPress.Lengler, Johannes (2018). A general dichotomy of evolutionary algorithms on monotonefunctions. In

Proc. of PPSN ’18 , 3–15. Springer.Lissovoi, Andrei, Oliveto, Pietro S., and Warwicker, John Alasdair (2020). Simple hyper-heuristics control the neighbourhood size of randomised local search optimally forleadingones.

Evolutionary Computation . In print.Rodionova, Anna, Antonov, Kirill, Buzdalova, Arina, and Doerr, Carola (2019). Oﬀ-spring population size matters when comparing evolutionary algorithms with self-adjusting mutation rates. In

Proc. of GECCO ’19 , 855–863. ACM Press.Rohlfshagen, Philipp, Lehre, Per Kristian, and Yao, Xin (2009). Dynamic evolutionaryoptimisation: an analysis of frequency and magnitude of change. In

Proc. of GECCO’09 , 1713–1720. ACM Press.Rowe, Jonathan E. and Aishwaryaprajna (2019). The beneﬁts and limitations of votingmechanisms in evolutionary optimisation. In

Proc. of FOGA ’19 , 34–42. ACM Press.25egener, Ingo (2001). Methods for the analysis of evolutionary algorithms on pseudo-Boolean functions. In Sarker, Ruhul, Mohammadian, Masoud, and Yao, Xin (eds.),

Evolutionary Optimization . Kluwer Academic Publishers.Whitley, Darrell, Varadarajan, Swetha, Hirsch, Rachel, and Mukhopadhyay, Anirban(2018). Exploration and exploitation without mutation: Solving the jump function in ϑ ( n ) time. In Proc. of PPSN ’18 , 55–66. Springer.Witt, Carsten (2003). Population size vs. runtime of a simple EA. In