Self-Adjusting Evolutionary Algorithms for Multimodal Optimization
SSelf-Adjusting Evolutionary Algorithms for MultimodalOptimization
Amirhossein RajabiTechnical University of DenmarkKgs. [email protected] Carsten WittTechnical University of DenmarkKgs. [email protected] 3, 2020
Abstract
Recent theoretical research has shown that self-adjusting and self-adaptive mech-anisms can provably outperform static settings in evolutionary algorithms for binarysearch spaces. However, the vast majority of these studies focuses on unimodal func-tions which do not require the algorithm to flip several bits simultaneously to makeprogress. In fact, existing self-adjusting algorithms are not designed to detect localoptima and do not have any obvious benefit to cross large Hamming gaps.We suggest a mechanism called stagnation detection that can be added as a mod-ule to existing evolutionary algorithms (both with and without prior self-adjustingschemes). Added to a simple (1+1) EA, we prove an expected runtime on the well-known
Jump benchmark that corresponds to an asymptotically optimal parametersetting and outperforms other mechanisms for multimodal optimization like heavy-tailed mutation. We also investigate the module in the context of a self-adjusting(1+ λ ) EA and show that it combines the previous benefits of this algorithm onunimodal problems with more efficient multimodal optimization.To explore the limitations of the approach, we additionally present an examplewhere both self-adjusting mechanisms, including stagnation detection, do not helpto find a beneficial setting of the mutation rate. Finally, we investigate our modulefor stagnation detection experimentally. Recent theoretical research on self-adjusting algorithms in discrete search spaces hasproduced a remarkable body of results showing that self-adjusting and self-adaptivemechanisms outperform static parameter settings. Examples include an analysis of thewell-known (1+( λ, λ )) GA using a 1 / OneMax (Doerr and Doerr, 2018), of a self-adjusting (1+ λ ) EA sampling offspring with differ-ent mutation rates (Doerr et al., 2019), matching the parallel black-box complexityof the OneMax function, and a self-adaptive variant of the latter (Doerr, Witt and1 a r X i v : . [ c s . N E ] J un ang, 2018). Furthermore, self-adjusting schemes for algorithms over the search space { , . . . , r } n for r > λ ) EAfrom Doerr et al. (2019) samples λ/ r/ n times the mutation probability)and the other half with strength 2 r . The strength is afterwards adjusted to the oneused by a fittest offspring. Similarly, the 1 / some improvements with thedifferent parameters tried or, at least, that the smallest disimprovement observed in un-successful mutations gives reliable hints on the choice of the parameter. However, thereare situations where the algorithm cannot make progress and does not learn from unsuc-cessful mutations either. This can be the case when the algorithm reaches local optimaescaping from which requires an unlikely event (such as flipping many bits simultane-ously) to happen. Classical self-adjusting algorithms would observe many unsuccessfulsteps in such situations and suggest to set the mutation rate to its minimum althoughthat might not be the best choice to leave the local optimum. In fact, the vast majorityof runtime results for self-adjusting EAs is concerned with unimodal functions that haveno other local optima than the global optimum. An exception is the work (Dang andLehre, 2016) which considers a self-adaptive EA allowing two different mutation prob-abilities on a specifically designed multimodal problem. Altogether, there is a lack oftheoretical results giving guidance on how to design self-adjusting algorithms that canleave local optima efficiently.In this paper, we address this question and propose a self-adjusting mechanism called stagnation detection that adjusts mutation rates when the algorithm has reached a localoptimum. In contrast to previous self-adjusting algorithms this mechanism is likely toincrease the mutation in such situations, leading to a more efficient escape from localoptima. This idea has been mentioned before, e. g., in the context of population sizingin stagnation (Eiben, Marchiori and Valk´o, 2004); also, recent empirical studies of theabove-mentioned 2-rate (1+ λ ) EA, handling of stagnation by increasing the variancewas explicitly suggested in Ye, Doerr and B¨ack (2019). Our contribution has several2dvantages over previous discussion of stagnation detection: it represents a simple mod-ule that can be added to several existing evolutionary algorithms with little effort, itprovably does not change the behavior of the algorithm on unimodal functions (exceptfor small error terms), allowing the transfer of previous results, and we provide rigorousruntime analyses showing general upper bounds for multimodal functions including itsbenefits on the well-known Jump benchmark function.In a nutshell, our stagnation detection mechanism works in the setting of pseudo-boolean optimization and standard bit mutation. Starting from strength r = 1, itincreases the strength from r to r + 1 after a long waiting time without improvementhas elapsed, meaning it is unlikely that an improving bit string at Hamming distance r exists. This approach bears some resemblance with variable neighborhood search (VNS)(Hansen and Mladenovic, 2018); however, the idea of VNS is to apply local search witha fixed neighborhood until reaching a local optimum and then to adapt the neighbor-hood structure. There have also been so-called quasirandom evolutionary algorithms(Doerr, Fouz and Witt, 2010) that search the set of Hamming neighbors of a searchpoint more systematically; however, these approaches do not change the expected num-ber of bits flipped. In contrast, our stagnation detection uses the whole time an unbiasedrandomized global search operator in an EA and just adjusts the underlying mutationprobability. Statistical significance of long waiting times is used, indicating that im-provements at Hamming distance r are unlikely to exist; this is rather remotely relatedto (but clearly inspired by) the estimation-of-distribution algorithm sig-cGA Doerr andKrejca (2018) that uses statistical significance to counteract genetic drift.This paper is structured as follows: In Section 2, we introduce the concrete mecha-nism for stagnation detection and employ it in the context of a simple, static (1+1) EAand the already self-adjusting 2-rate (1+ λ ) EA. Moreover, we collect tools for the anal-ysis that are used in the rest of the paper. Section 3 deals with concrete runtime boundsfor the (1+1) EA and (1+ λ ) EA with stagnation detection. Besides general upperbounds, we prove a concrete result for the Jump benchmark function that is asymp-totically optimal for algorithms using standard bit mutation and outperforms previousmutation-based algorithms for this function like the heavy-tailed EA from Doerr et al.(2017). Elementary techniques are sufficient to show these results. To explore the limi-tations of stagnation detection and other self-adjusting schemes, we propose in Section 4a function where these mechanisms provably fail to set the mutation rate to a beneficialregime. As a technical tool, we use drift analysis and analyses of occupation times forprocesses with strong drift. To that purpose, we use a theorem by Hajek (Hajek, 1982)on occupation times that, to the best of the knowledge, was not used for the analysisof randomized search heuristics before and may be of independent interest. Finally, inSection 5, we add some empirical results, showing that the asymptotically smaller run-time of our algorithm on
Jump is also visible for small problem dimensions. We finishwith some conclusions. 3
Preliminaries
We shall now formally define the algorithms analyzed and present some fundamentaltools for the analysis.
We are concerned with pseudo-boolean functions f : { , } n → R that w. l. o. g. are tobe maximized. A simple and well-studied EA studied in many runtime analyses (e. g.,Droste, Jansen and Wegener (2002)) is the (1+1) EA displayed in Algorithm 1. It usesa standard bit mutation with strength r , where 1 ≤ r ≤ n/
2, which means that everybit is flipped independently with probability r/n . Usually, r = 1 is used, which is theoptimal strength on linear functions (Witt, 2013). Smaller strengths lead to less than1 bit being flipped in expectation, and strengths above n/ Algorithm 1 (1+1) EA with static strength r Select x uniformly at random from { , } n for t ← , , . . . do Create y by flipping each bit in a copy of x independently with probability rn . if f ( y ) ≥ f ( x ) then x ← y .The runtime (also called optimization time ) of the (1+1) EA on a function f is thefirst point of time t where a search point of maximal fitness has been created; often theexpected runtime, i. e., the expected value of this time, is analyzed. The (1+1) EA with r = 1 has been extensively studied on simple unimodal problems like OneMax ( x , . . . , x n ) := | x | , and LeadingOnes ( x , . . . , x n ) := n (cid:88) i =1 i (cid:89) j =1 x j but also on the multimodal Jump m function with gap size m defined as follows: Jump m ( x , . . . , x n ) = (cid:40) m + | x | if | x | ≤ n − m or | x | = nn − | x | otherwiseThe classical (1+1) EA with r = 1 optimizes these functions in expected time Θ( n log n ),Θ( n ) and Θ( n m + n log n ), respectively (see, e. g., Droste, Jansen and Wegener (2002)).The first two problems are unimodal functions, while Jump for m ≥ | x | = n − m . To overcome thisoptimum, m bits have to flip simultaneously. It is well known (Doerr et al., 2017) that4he time to leave this optimum is minimized at strength m instead of strength 1 (seebelow for a more detailed exposition of this phenomenon). Hence, the (1+1) EA wouldbenefit from increasing its strength when sitting at the local optimum. The algorithmdoes not immediately know that it sits at a local optimum. However, if there is animprovement at Hamming distance 1 then such an improvement has probability at least(1 − /n ) n − /n ≥ / ( en ) with strength 1, and the probability of not finding it in en ln n steps is at most (1 − / ( en )) en ln n ≤ /n. Similarly, if there is an improvement that can be reached by flipping k bits simultane-ously and the current strength equals k , then the probability of not finding it within(( en ) k /k k ) ln n steps is at most (cid:18) − k k ( en ) k (cid:19) (( en ) k /k k ) ln n ≤ n . Hence, after (( en ) k /k k ) ln n steps without improvement there is high evidence for thatno improvement at Hamming distance k exists.We put this ideas into an algorithmic framework by counting the number of so-calledunsuccessful steps, i. e., steps that do not improve fitness. Starting from strength 1,the strength is increased from r to r + 1 when the counter exceeds the threshold2(( en ) r /r r ) ln( nR ) for a parameter R to be discussed shortly. Both counter and strengthare reset (to 0 and 1 respectively) when an improvement is found, i. e., a search point ofstrictly better fitness. In the context of the (1+1) EA, the stagnation detection (SD) isincorporated in Algorithm 2. We see that the counter u is increased in every iterationthat does not find a strict improvement. However, search points of equal fitness arestill accepted as in the classical (1+1) EA. We note that the strength stays at its initialvalue 1 if finding an improvement does not take longer than the corresponding threshold2 en ln( Rn ); if the threshold is never exceeded the algorithm behaves identical to the(1+1) EA with strength 1 according to Algorithm 1.The parameter R can be used to control the probability of failing to find an im-provement at the “right” strength. More precisely, the probability of not finding animprovement at distance r with strength r is at most (cid:18) − r r ( en ) r (cid:19) (2( en ) r /r r ) ln( nR ) ≤ nR ) . As shown below in Theorem 3, if R is set to the number of fitness values of the underlyingfunction f , i. e., R = | Im( f ) | , then the probability of ever missing an improvement atthe right strength is sufficiently small throughout the run. We recommend at least R = n if nothing is known about the range of f , resulting in a threshold of at least4(( en ) r /r r ) ln( n ) at strength r .We also add stagnation detection to the (1+ λ ) EA with self-adjusting mutation ratedefined in Doerr et al. (2019) (adapted to maximization of the fitness function), wherehalf of the offspring are created with strength r/ r ;5ee Algorithm 3. Unsuccessful mutations are counted in the same way as in Algorithm 2,taking into account that λ offspring are used. The algorithm can be in two states. Unlessthe counter threshold is reached and a strength increase is triggered, the algorithmbehaves the same as the self-adjusting (1+ λ ) EA from Doerr et al. (2019) (State 2).If, however, the counter threshold 2 en ln( nR ) /λ is reached, then the algorithm changesto the module that keeps increasing the strength until a strict improvement is found(State 1). Since it does not make sense to decrease the strength in this situation, alloffspring use the same strength until finally an improvement is found and the algorithmchanges back to the original behavior using two strengths for the offspring. The booleanvariable g keeps track of the state. From the discussion of these two algorithms, wesee that the stagnation detection consisting of counter for unsuccessful steps, threshold,and strength increase also can be added to other algorithms, while keeping their originalbehavior unless the counter threshold it reached. Algorithm 2 (1+1) EA with stagnation detection (SD-(1+1) EA)Select x uniformly at random from { , } n and set r ← u ← for t ← , , . . . do Create y by flipping each bit in a copy of x independently with probability r t n . u ← u + 1. if f ( y ) > f ( x ) then x ← y . r t +1 ← u ← else if f ( y ) = f ( x ) and r t = 1 then x ← y . if u > (cid:16) enr t (cid:17) r t ln( nR ) then r t +1 ← min { r t + 1 , n/ } . u ← else r t +1 ← r t . We now collect frequently used mathematical tools. The first one is a simple summationformula used to analyze the time spent until the strength is increased to a certain value.
Lemma 1.
For m < n , we have (cid:80) mi =1 (cid:0) eni (cid:1) i < nn − m (cid:0) enm (cid:1) m . lgorithm 3 (1+ λ ) EA with two-rate standard bit mutation and stagnation detection(SASD-(1+ λ ) EA) Select x uniformly at random from { , } n and set r ← r init . u ← g ← False (boolean variable indicating stagnation detection) for t ← , , . . . do u ← u + 1. if g = True then
State 1 – Stagnation Detection for i ← , . . . , λ do Create x i by flipping each bit in a copy of x independently with probability r t n . y ← arg max x i f ( x i ) (breaking ties randomly). if f ( y ) > f ( x ) then x ← y . r t +1 ← r init . g ← False . u ← elseif u > (cid:16) enr t (cid:17) r t ln( nR ) /λ then r t +1 ← min { r t + 1 , n/ } . u ← else r t +1 ← r t . else (i. e., g = False ) State 2 – Self-Adjusting (1+ λ ) EA for i ← , . . . , λ do Create x i by flipping each bit in a copy of x independently with probability r t n if i ≤ λ/ r t /n otherwise. y ← arg min x i f ( x i ) (breaking ties randomly). if f ( y ) ≥ f ( x ) thenif f ( y ) > f ( x ) then u ← x ← y .Perform one of the following two actions with prob. 1 / r t with the strength that y has been created with.– Replace r t with either r t / r t , each with probability 1 / r t +1 ← min { max { , r t } , n/ } . if u > (cid:16) enr t (cid:17) r t ln( nR ) /λ then r t +1 ← g ← True . u ← roof. We have (cid:16) enm − i (cid:17) m − i = (cid:0) men (cid:1) i (cid:16) mm − i (cid:17) m − i (cid:0) enm (cid:1) m for all i < m , so m (cid:88) i =1 (cid:16) eni (cid:17) i = m − (cid:88) i =0 (cid:18) enm − i (cid:19) m − i = (cid:16) enm (cid:17) m m − (cid:88) i =0 (cid:16) men (cid:17) i (cid:18) im − i (cid:19) m − i < (cid:16) enm (cid:17) m m − (cid:88) i =0 (cid:16) mn (cid:17) i < nn − m (cid:16) enm (cid:17) m . (cid:3) The following result due to Hajek applies to processes with a strong drift towardssome target state, resulting in decreasing occupation probabilities with respect to thedistance from the target. On top of this occupation probabilities, the theorem bounds occupation times , i. e., the number of steps that the process spends in a non-target stateover a certain time period.
Theorem 1 (Theorem 3.1 in Hajek (1982)) . Let X t , t ≥ , be a stochastic processadapted to a filtration F t on R . Let a ∈ R . Assume for ∆ t = X t +1 − X t that there are η > , δ < and D > such that that(a) E (cid:0) e η ∆ | F t ; X t > a (cid:1) ≤ ρ (b) E (cid:0) e η ∆ | F t ; X t ≤ a (cid:1) ≤ D If additionally X is of exponential type (i. e., E (cid:0) e λX (cid:1) is finite for some λ > ) thenfor any constant (cid:15) > there exist absolute constants K ≥ , δ < such that for all b ≥ a and T ≥ (cid:16) T T (cid:88) t =1 X t ≤ b ≤ − (cid:15) − − (cid:15) − ρ De η ( a − b ) (cid:17) ≤ Kδ T In this section, we study the SD-(1+1) EA from Algorithm 2 in greater detail. We showgeneral upper and lower bounds on multimodal functions and then analyze the specialcase of
Jump more precisely. We also show the important result that on unimodalfunctions, the SD-(1+1) EA with high probability behaves in the same way as theclassical (1+1) EA with strength 1, including the same asymptotic bound on the expectedoptimization time.
In the following, given a fitness function f : { , } n → R , we call the gap of the point x ∈ { , } n the minimum hamming distance to points with strictly larger fitness functionvalue. Formally, gap( x ) := min { H ( x, y ) : f ( y ) > f ( x ) , y ∈ { , } n } . x ) bits ofthe current search point. However, if the algorithm creates a point of gap( x ) distancefrom the current search point x , we can make progress with a positive probability. Notethat gap( x ) = 1 is allowed, so the definition also covers points that are not local optima.Hereinafter, T x denotes the number of steps of SD-(1+1) EA to find an improvementpoint when the current search point is x . Let phase r consists of all points of timewhere strength r is used in the algorithm with stagnation counter. Let E r be theevent of not finding the optimum by the end of phase r , and U r be the event of notfinding the optimum during phases 1 to r − r . In other words, U r = E ∩ · · · ∩ E r − ∩ E r .The following lemma will be used throughout this section. It shows that the proba-bility of not finding a search point with larger fitness value in phases of larger strengththan the real gap size is small; however, by definition phase n/ R controls the threshold for the number of unsuccessful steps in stagnationdetection. Lemma 2.
Let x ∈ { , } n be the current search point of the SD-(1+1) EA on a pseudo-boolean fitness function f : { , } n → R and let m = gap( x ) . Then Pr ( E r ) ≤ (cid:40) nR ) if m ≤ r < n/ if r = n/ . Proof.
The algorithm spends 2 e r n r /r r ln( nR ) steps at strength r until it increases thecounter. Then, the probability of not improving at strength r ≥ m is at mostPr ( E r ) = (cid:18) − (cid:16) − rn (cid:17) n − m (cid:16) rn (cid:17) m (cid:19) e r n r /r r ln( nR ) ≤ nR ) . During phase n/
2, the algorithm does not increase the strength, and it continues tomutate each bit with probability of 1 /
2. As each point on domain is accessible in thisphase, the probability of eventually failing to find the improvement is 0. (cid:3)
We turn the previous observation into a general lemma on improvement times.
Theorem 2.
Let x ∈ { , } n be the current search point of the SD-(1+1) EA on apseudo-boolean function f : { , } n → R . Define T x as the time to create a strict im-provement and L x,k := E ( T x ) if gap( x ) = k . Then, using m = min { k, n/ } , we have forall x with gap( x ) = k that (cid:16) enm (cid:17) m (cid:18) − m n − m (cid:19) < L x,k ≤ (cid:16) enm (cid:17) m (cid:18) mn ln( nR ) (cid:19) . Proof.
Using the law of total probability with respect to the events U i defined above,we have E ( T x ) = n/ (cid:88) i =1 E ( T x | U i ) Pr ( U i ) . (1)9ote that the algorithm does not increase the strength to more than n/
2. By as-suming that the algorithm pessimistically does not find a better point for r < m , we canbound the formula (1) as follows:E ( T x ) < E ( T x | U m ) (cid:124) (cid:123)(cid:122) (cid:125) =: S + n/ (cid:88) i = m +1 E ( T x | U i ) Pr ( U i ) (cid:124) (cid:123)(cid:122) (cid:125) =: S . Regarding S , it takes (cid:80) m − i =1 en/i ) i ln( nR ) steps until the SD-(1+1) EA increasesthe strength to m . When the mutation probability is m/n , within an expected number of(( m/n ) m (1 − m/n ) n − m ) − steps, a better point will be found. Thus, by using Lemma 1,we have E ( T x | U m ) ≤ m − (cid:88) i =1 (cid:16) eni (cid:17) i ln( nR ) + 1( m/n ) m (1 − m/n ) n − m < nn − m + 1 (cid:18) enm − (cid:19) m − ln( nR ) + (cid:16) enm (cid:17) m < (cid:16) enm (cid:17) m (cid:32) men (cid:18) m − (cid:19) m − ln( nR ) (cid:33) ≤ (cid:16) enm (cid:17) m (cid:18) mn ln( nR ) (cid:19) . In order to estimate S , if m = n/
2, the value of S equals zero. Otherwise, by usingLemma 2, Pr ( U i ) < (cid:81) i − j = m Pr ( E j ) < n − i − m ) for i ≥ m + 1 since R ≥
1. We compute n/ (cid:88) i = m +1 E ( T x | U i ) Pr ( U i ) ≤ n/ (cid:88) i = m +1 O (cid:18)(cid:16) eni (cid:17) i ln( nR ) (cid:19) n − i − m ) = ln( nR ) n/ (cid:88) i = m +1 O (cid:18)(cid:16) ei (cid:17) i n m − i (cid:19) = o (( en/m ) m ) . Altogether, we have E ( T x ) ≤ (cid:0) enm (cid:1) m (cid:0) mn ln( nR ) (cid:1) + o (( en/m ) m ) . Moreover, the expected number of iterations for finding an improvement is at least p − m (1 − p ) − ( n − m ) for any mutation rate p . Using the same arguments as in the analysisof the (1+1) EA on Jump in Doerr et al. (2017), since mn is the unique minimum pointin the interval [0 , T x ) ≥ ( m/n ) − m (1 − m/n ) − ( n − m ) ≥ ( en/m ) m (cid:18) − m n − m (cid:19) . (cid:3) We now present the above-mentioned important “simulation result” implying that onunimodal functions, the stagnation detection of SD-(1+1) EA is unlikely ever to trigger a10trength increase during its run. Moreover, for a wide range of runtime bounds obtainedvia the fitness level method (Wegener, 2001), we show that these bounds transfer to theSD-(1+1) EA up to vanishingly small error terms. The proof carefully estimates theprobability of the strength ever exceeding 1.
Lemma 3.
Let f : { , } n → R be a unimodal function and consider the SD-(1+1) EAwith R ≥ | Im( f ) | . Then, with probability − o (1) , the SD-(1+1) EA never increases thestrength and behaves stochastically like the (1+1) EA before finding an optimum of f .Denote by T sd and T classic the runtime of the SD-(1+1) EA and the classical (1+1) EAwith strength on f , respectively. If U is an upper bound on E ( T T classic ) obtained bysumming up worst-case expected waiting times for improving over all fitness values in Im( f ) , then E ( T sd ) ≤ U + o (1) . The same statements hold with SD-(1+1) EA replaced with SASD-(1+ λ ) EA, and(1+1) EA replaced with the self-adjusting (1+ λ ) EA without stagnation detection. Proof.
We let the random set W contain the search points from which the SD-(1+1) EAdoes not find an improvement within phase 1 (i. e., while r t = 1). As above, E denotesthe probability of not finding an improvement within phase 1. As on unimodal functions,the gap of all points is 1, we have by Lemma 2 that Pr ( E ) ≤ Rn ) . This argumentationholds for each improvement that has to be found. Since at most | Im( f ) | ≤ R improvingsteps happen before finding the optimum, by a union bound the probability of the SD-(1+1) EA ever increasing the strength beyond 1 is at most R Rn ) = o (1), which provesthe first claim of the lemma.To prove the second claim, we consider all fitness values f < · · · < f | Im( f ) | inincreasing order and sum up upper bounds on the expected times to improve from eachof these fitness values. Under the condition that the strength is not increased beforeleaving a fitness level, the worst-case time to leave a level (over all search points with thesame fitness value) is clearly not increased. Hence, we bound the expected optimizationtime of the SD-(1+1) EA from above by adding the waiting times on all fitness levelsfor the (1+1) EA, which is given by U , and the expected times spent to leave the pointsin W ; formally, E ( T sd ) ≤ U + (cid:88) x ∈ W E ( T x ) . Each point in Im( f ) contributes with probability Pr ( E ) to W . Hence , E ( | W | ) ≤ Im( f ) Pr ( E ) ≤ R Pr ( E ). As on unimodal functions, the gap of all points is 1, by11emma 2, we have Pr ( U i ) < (cid:81) i − j =1 Pr ( E j ) < n − i . Hence,E ( T sd ) < U + (cid:88) x ∈ W E ( T x ) < U + R · Pr ( E ) n/ (cid:88) i =1 E ( T x | U i ) Pr ( U i ) < U + R · ( nR ) − n/ (cid:88) i =1 O (cid:18)(cid:16) ei (cid:17) i n − i ln( nR ) (cid:19) . The second term is o (1), hence E ( T ) ≤ U + o (1) . as suggested.All the arguments are used in the same way with respect to the SASD-(1+ λ ) EAand its original formulation without stagnation detection. (cid:3) It is well known that strength 1 for the (1+1) EA leads to an expected runtime ofΘ( n m ) on Jump m if m ≥ m bits must flip simultaneously to leave thelocal optimum at n − m one-bits. To minimize the time for such an escaping muta-tion, mutation rate m/n is optimal (Doerr et al., 2017), leading to an expected time of(1 + o (1))( n/m ) m (1 − m/n ) m − n to optimize Jump , which is Θ(( en/m ) m ) for m = o ( √ n ).However, a static rate of m/n cannot be chosen without knowing the gap size m . There-fore, different heavy-tailed mutation operators have been proposed for the (1+1) EA(Doerr et al., 2017; Friedrich, Quinzan and Wagner, 2018), which most of the timechoose strength 1 but also use strength r , for arbitrary r ∈ { , . . . , n/ } with at leastpolynomial probability. This results in optimization times on Jump of Θ(( en/m ) m · p ( n ))for some small polynomial p ( n ) (roughly, p ( n ) = ω ( √ m ) in Doerr et al. (2017) and p ( n ) = Θ( n ) in Friedrich, Quinzan and Wagner (2018)). Similar polynomial overheadsoccur with hypermutations as used in artificial immune systems (Corus, Oliveto andYazdani, 2018); in fact such overheads cannot be completely avoided with heavy-tailedmutation operators, as proved in Doerr et al. (2017). We also remark that Jump can beoptimized faster than O (( en/m ) m ) if crossover is used (Whitley et al., 2018; Rowe andAishwaryaprajna, 2019), by simple estimation-of-distribution algorithms (Doerr, 2019)or specific black-box algorithms (Buzdalov, Doerr and Kever, 2016). In addition, theoptimization time of n ( m +1) / e O ( m ) m − m/ is shown for the (1+( λ, λ )) GA to optimize Jump with 2 < m < n/
16 in Antipov, Doerr and Karavaev (2020). All of this is outsidethe scope of this study that concentrates on mutation-only algorithms.We now state our main result, implying that the SD-(1+1) EA achieves an asymptoti-cally optimal runtime on
Jump m for m = o ( √ n ), hence being faster than the heavy-tailed12utations mentioned above. Recall that this does not come at a significant extra costfor simple unimodal functions like OneMax according to Lemma 3.
Theorem 3.
Let n ∈ N . For all ≤ m = O ( n/ ln n ) , the expected runtime E ( T ) of theSD-(1+1) EA on Jump m satisfies Ω (cid:18)(cid:16) enm (cid:17) m (cid:18) − m n − m (cid:19)(cid:19) ≤ E ( T ) ≤ O (cid:16)(cid:16) enm (cid:17) m (cid:17) . Proof.
It is well known that the (1+1) EA with mutation rate 1 /n finds the optimumof the n -dimensional OneMax function in an expected number of at most en ln n − O ( n )iterations.Until reaching the plateau consisting of all points of n − m one-bits, Jump is equivalentto
OneMax ; hence, according to Lemma 3, the expected time until SD-(1+1) EA reachesthe plateau is at most O ( n ln n ) (noting that this bound was obtained via the fitnesslevel method).Every plateau point x with n − m one-bits satisfies gap( x ) = m according to thedefinition of Jump . Thus, using Theorem 2, the algorithm finds the optimum withinexpected time Ω (cid:18)(cid:16) enm (cid:17) m (cid:18) − m n − m (cid:19)(cid:19) ≤ E ( T x ) ≤ O (cid:16)(cid:16) enm (cid:17) m (cid:17) . This dominates the expected time of the algorithm before the plateau point.Finally, Ω (cid:18)(cid:16) enm (cid:17) m (cid:18) − m n − m (cid:19)(cid:19) ≤ E ( T ) ≤ O (cid:16)(cid:16) enm (cid:17) m (cid:17) . (cid:3) It is easy to see (similarly to the analysis of Theorem 3) that for all m = Θ( n ), theexpected runtime E ( T ) of the SD-(1+1) EA on Jump m satisfies E ( T ) = O (cid:0)(cid:0) enm (cid:1) m ln n (cid:1) . The
Jump function only has one local optimum that usually has to be overcome on theway to the global optimum. We generalize the previous analysis to functions that havemultiple local optima of possibly different gap sizes. As a special case, we can asymp-totically recover the expected runtime on the
LeadingOnes function in Corollary 1.
Theorem 4.
The expected runtime of the SD-(1+1) EA on a pseudo-Boolean fitnessfunction f is at most E ( T | V , . . . , V n ) = O (cid:32) n (cid:88) k =1 V k L k (cid:33) , where V k is the number of points x of gap( x ) = k visited by the algorithm and L k :=max { L x,k | x ∈ { , } n ∧ gap( x ) = k } with L x,k as defined in Theorem 2. Moreover, E ( T ) = O (cid:32) n (cid:88) k =1 E ( V k ) L k (cid:33) , roof. The SD-(1+1) EA visits a random trajectory of search points { x , x , x , . . . , x m = x ∗ } in order to find an optimum point x ∗ .For any search point x with gap( x ) = k , the expected time to find a better searchpoint when r ≤ m is E ( T x ) = L k according to Theorem 2.Also, we have T = T x + T x + · · · + T x m = (cid:80) nk =1 V k · ( T x | gap( x ) = k ). Therefore,as the strength r is reset to 1 after each improvement, we haveE ( T | V , . . . , V n ) = O (cid:32) n (cid:88) k =1 V k L k (cid:33) , which proves the first statement of this theorem. The second follows by the law of totalexpectation. (cid:3) Corollary 1.
The expected runtime of the SD-(1+1) EA on
LeadingOnes is at most O ( n ) . Proof. On LeadingOnes , there are at most n points of gap size 1, so according toTheorem 4, the expected runtime is O ( n ). (cid:3) Corollary 1 can be also inferred from Lemma 3 since
LeadingOnes is unimodal andthe O ( n ) bound was inferred via the fitness level method.We finally specialize Theorem 4 into a result for the well-known Trap functionDroste, Jansen and Wegener (2002) that is identical for
OneMax except for the all-zeros string that has optimal fitness n + 1. We obtain a bound of 2 Θ( n ) instead of theΘ( n n ) bound for the classical (1+1) EA. The base of our result is somewhat larger thanfor the fast GA from Doerr et al. (2017); however, it is still close to the 2 n bound thatwould be obtained by uniform search. Corollary 2.
The expected runtime of SD-(1+1) EA on
Trap is at most O (2 . n ln n ) . Proof. On Trap , there are one point of gap size n and O ( n ) points with gap size of 1.So according to Theorem 4, the expected runtime is O ((2 . n ln n ). (cid:3) While our previous analyses have shown the benefits of the self-adjusting scheme, inparticular highlighting stagnation detection on multimodal functions, it is clear thatour scheme also has limitations. In this section, we present an example of a pseudo-Boolean function where stagnation detection does not help to find its global optimumin polynomial time; moreover, the function is hard for other self-adjusting schemes sincemeasuring the number of successes does not hint on the location of the global optimum.In fact, the function demonstrates a more general effect where the behavior is verysensitive with respect to choice of the the mutation probability. More precisely, a plain(1+1) EA with mutation probability 1 /n with overwhelming probability gets stuck in14 local optimum from which it needs exponential time to escape while the (1+1) EAwith mutation probability 2 /n and also above finds the global optimum in polynomialtime with overwhelming probability. Since the function is unimodal except at the localoptimum, our self-adjusting (1+1) EA with stagnation detection fails as well.To the best of our knowledge, a phase transition with respect to the mutation proba-bility where an increase by a small constant factor leads from exponential to polynomialoptimization time has been unknown in the literature of runtime analysis so far and maybe of independent interest. We are aware of opposite phase transitions on monotonefunctions (Lengler, 2018) where increasing the mutation rate is detrimental; however,we feel that our function and the general underlying construction principle are easier tounderstand than these specific monotone functions.The construction of our function, called NeedHighMut , is based on a general prin-ciple that was introduced in Witt (2003) to show the benefits of populations and wassubsequently applied in Jansen and Wiegand (2004) to separate a coevolutionary variantof the (1+1) EA from the standard (1+1) EA. Section 5 of the latter paper also beauti-fully describes the general construction technique that involves creating two differentlypronounced gradients for the algorithms to follow. Further applications are given inWitt (2006) and Witt (2008) to show the benefit of populations in elitist and non-elitistEAs. Also Rohlfshagen, Lehre and Yao (2009) use very similar construction techniquefor their
Balance function that is easier to optimize in frequently changing than slowlychanging environments; however, they did not seem to be aware that their approachresembles earlier work from the papers above.We now describe the construction of our function
NeedHighMut . The crucialobservation is that strength 1 (i. e., probability p = 1 /n ) makes it more likely to flipexactly one specific bit than strength 2 – in fact strength 1 is asymptotically optimal sincethe probability of flipping one specific bit is p (1 − p ) n − ≈ pe − pn , which is maximized for p = 1 /n . However, to flip specific two bits, which has probability p (1 − p ) n − ≈ p e − pn ,the choice p = 2 /n is asymptotically optimal and clearly better than 1 /n . Now, givena hypothetical time span of T , we expect approximately T ( p ) := T pe − p/n specific one-bit and T ( p ) := T p e − p/n specific two-bit flips. Assuming the actual numbers to beconcentrated and just arguing with expected values, we have T (1 /n ) (cid:29) T (1 /n ) but T (2 /n ) (cid:29) T (2 /n ), i. e., there will be considerably more two-bit flips at strength 2than at strength 1 and considerably less 1-bit flips. The fitness function will accountfor this. It leads to a trap at a local optimum if a certain number of one-bit flips isexceeded before a certain minimum number of two-bit flips has happened; however, ifthe number of one-bit flips is low enough before the minimum number of two-bit flipshas been reached, the process is on track to the global optimum.We proceed with the formal definition of NeedHighMut , making these ideas preciseand overcoming technical hurdles. Since we have at most n specific one-bit flips but aspecific two-bit flip is already by a factor of O (1 /n ) less likely than a one-bit flip, we willwork with two-bit flips happening in small blocks of size √ n , leading to a probability ofroughly n − / for a two-bit flip in a block. In the following, we will imagine a bit string x of length n as being split into a prefix a := a ( x ) of length n − m and a suffix b := b ( x )15f length m , where m still has to be defined. Hence, x = a ( x ) ◦ b ( x ), where ◦ denotesthe concatenation.The prefix a ( x ) is called valid if it is of the form 1 i n − m − i , i. e., i leading ones and n − m − i trailing zeros. The prefix fitness pre ( x ) of a string x ∈ { , } n with validprefix a ( x ) = 1 i n − m − i equals just i , the number of leading ones. The suffix consistsof (cid:100) ξ √ n (cid:101) , where ξ ≥ (cid:100) n / (cid:101) bits each, altogether m ≤ ξ n / = o ( n ) bits. Such a block is called valid if it containseither 0 or 2 one-bits; moreover, it is called active if it contains 2 and inactive if itcontains 0 one-bits. A suffix where all blocks are valid and where all blocks followingfirst inactive block are also inactive is called valid itself, and the suffix fitness suff ( x )of a string x with valid suffix b ( x ) is the number of leading active blocks before the firstinactive block. Finally, we call a string x ∈ { , } n valid if both its prefix and suffix arevalid.Our final fitness function is a weighted combination of pre ( x ) and suff ( x ). Wedefine for x ∈ { , } n , where x = a ◦ b with the above-introduced a and b , NeedHighMut ξ ( x ) := n suff ( x ) + pre ( x ) if pre ( x ) ≤ n − m )10 ∧ x valid n m + pre ( x ) + suff ( x ) − n − pre ( x ) > n − m )10 ∧ x valid − OneMax ( x ) otherwise.We note that all search points in the second case have a fitness of at least n m − n − n ( m −
1) + n , an upper bound on the fitness of search points thatfall into the first case without having m leading active blocks in the suffix. Hence,search points x where pre ( x ) = n − m and suff ( x ) = (cid:100) ξ √ n (cid:101) represent local optima ofsecond-best overall fitness. The set of global optima equals the points where pre ( x ) =9( n − m ) /
10 and suff ( x ) = m , which implies that ( n − m ) /
10 = Ω( n ) bits have to beflipped simultaneously to escape from the local toward the global optimum.The parameter ξ ≥ ξ = 1, strength 1 usuallyleads to the local optimum first while strengths above 2 usually lead directly to the globaloptimum. Using larger ξ increases the threshold for the strength necessary to find theglobal optimum instead of being trapped in the local one.We now formally show with respect to different algorithms that NeedHighMut ischallenging to optimize without setting the right mutation probability in advance. Westart with an analysis of the classical (1+1) EA, where we for simplicity only show thenegative result for p = 1 /n even though it would even hold for ξ/n . Theorem 5.
Consider the plain (1+1) EA with mutation probability p on NeedHighMut ξ for a constant ξ ≥ . If p = 1 /n then with probability − − Ω( n ) , its optimization timeis n Ω( n ) . If p = ( cξ ) /n for any constant c ≥ then the optimization time is O ( n ) withprobability − − Ω( √ n ) . roof. It is easy to see (similarly to the analysis of the
SufSamp function from Jansen,Jong and Wegener (2005)) that the first valid search point (i. e., search point of non-negative fitness) has both pre - and suff -value value of at most n / with probability2 − Ω( n / ) . This follows from the fact that the function is symmetric on invalid searchpoints and that from each level set of i one-bits, only O (1) search points are valid.In the following, we tacitly assume that we have reached a valid search point of thedescribed maximum pre - and suff -value and note that this changes the required numberof improvements to reach local or global maximum only by a 1 − o (1) factor. Forreadability this factor will not be spelt out any more.We prepare the main analysis by bounding the probability of a mutation being ac-cepted after a valid search point has been reached. Even if a mutation changes up to o ( n ) consecutive bits of the prefix or suffix, it must maintain n − o ( n ) prefix bits in orderto result in a valid search points. Hence, the probability of an accepted step at mutationprobability c/n (valid for any constant c ) is at most (1 − c/n ) n − m − o ( n ) = (1 + o (1)) e − c .Steps flipping Ω( n ) consecutive bits have probability n − Ω( n ) and are subsumed by thefailure probabilities stated in this theorem. Clearly, the probability of a accepted stepis at least (1 − /n ) n = (1 − o (1)) e − c .Using this knowledge of accepted steps, we shall now prove the statement for p = 1 /n .The probability of improving the pre -value is at least e − /n since it is sufficient to flipthe leftmost zero of the prefix to 1. In a phase of length emn steps, there are at least m prefix-improving mutations with probability 1 − − Ω( n ) by Chernoff bounds. All theseimprove the function value and are accepted unless the suff -value increases to m beforethe pre -value exceeds 9 n/ (cid:0) n / (cid:1) n e − (1 + o (1)) ≤ (1 + o (1))( e − / n − / since it is necessary to flip two zerosinto ones and to have an accepted mutation. By the same reasoning, steps that activate k = o ( n ) blocks simultaneously have a probability of at most (1 + o (1))( e − / n − / ) k .We consider a phase of s := emn steps and bound the number of number of acceptedsteps increasing the suff -value by k by applying Chernoff bounds since this number ifbounded by a binomial distribution with parameter s and p k := (1 + o (1))( e − / n − / ) k .Hence, the number of accepted steps activating one suffix block in in emn ≤ en steps is less than √ n with probability 1 − − Ω( √ n ) . The expected number of acceptedsteps activating k ≥ O ( n − / ), and by Chernoff bounds theactual number is at most n / with probability 1 − − Ω( n / ) . Hence, by a union boundover k ∈ { , . . . , n / } , the steps adding more than one valid suffix block increase the suff -value by at most n / / = n / with probability 1 − − Ω( n / ) . Steps adding k > n / valid blocks have probability O (2 − Ω( n / ) ) and are subsumed by the failureprobability. If none of the failure events occurs, the total increase of the suff -value isat most √ n + n / < √ n . Also, with probability 1 − − Ω( √ n ) , the pre -value decreasesby altogether at most O ( √ n ) in the O ( √ n ) mutations that improve the suffix, whichcan be subsumed in a lower-order term in the above analysis of pre -improving steps.17ltogether, with overwhelming probability 1 − − Ω( n / ) the prefix is optimized beforethe suffix. The probability of reaching the global optimum from the local one is n − Ω( n ) since it is necessary to flip m/
10 bit simultaneously to leave the local optimum. In a phaseof n c (cid:48) n steps for a sufficiently small constant c (cid:48) this does not happen with probability1 − − Ω( n ) . This completes the proof of the statement for the case p = 1 /n .For p = c/n , where c ≥ ξ , we argue similarly with inverted roles of prefix and suffix.The probability of activating a block in the suffix is at least (1 − o (1))(( c / e − c n − / )now. In a phase of (7 / ξ ( e /c ) mn steps, we expect (7 / ξ √ n activated blocks andwith overwhelming probability we have at least ξ √ n such blocks. The probabilityof improving the pre -value by k is only (1 + o (1)) ce − c /n k , amounting to an expectednumber of improvements by 1 of (1+ o (1))(7 / ξ/c ) mn − k = (1+ o (1))(7 / ξ/c ) n − k ≤ (1+ o (1))(7 / n − k since c ≥ ξ , and, using similar Chernoff and union bounds as above,the probability of at least (9 / m pre -improving steps in the phase is 2 − Ω( n / ) . (cid:3) The previous analysis can be transferred to the SD-(1+1) EA with stagnation de-tection, showing that this mechanism does not help to increase the success probabilitysignificantly compared to the plain (1+1) EA with p = 1 /n . The proof shows that theSD-(1+1) EA with high probability does not behave differently from the (1+1) EA. Theonly major difference is visible after reaching the local optimum of NeedHighMut ,where stagnation detection kicks in. This results in the bound 2 Ω( n ) in the followingtheorem, compared to n Ω( n ) in the previous one. Theorem 6.
With probability at least − O (1 /n ) , the SD-(1+1) EA needs at least Ω( n ) steps to optimize NeedHighMut ξ for ξ ≥ . Proof.
We assume that the parameter | R | of the algorithm is set to at least n and followthe analysis of the case p = 1 /n from the proof of Theorem 5. In a phase of emn steps,there are at least m pre -improving mutations (having probability at least 1 / ( en ) each)with probability 1 − − Ω( n ) by Chernoff bounds. For each of these improving mutations,the probability that it does not happen within the threshold of en ln( n | R | ) ≥ en ln( n )iterations is at most (1 − / ( en )) en ln( n ) ≤ /n . By a union bound, the probabilitythat at least one of the mutations does not happen within this number of iterations isat most 1 /n . Together with the analysis of the number of suff -increasing mutations,this means that the strength stays at 1 until the local optimum is reached, and that thelocal optimum is reached first, with probability at least 1 − O (1 /n ).Leaving the local optimum requires a mutation flipping at least m/
10 = Ω( n ) bitssimultaneously. As already analyzed in Theorem 2, even at optimal strength this requires2 Ω( n ) steps with probability 1 − − Ω( n ) . Taking a union bound over all failure probabilitiescompletes the proof. (cid:3) Finally, we also show that the self-adaptation scheme of the SASD-(1+ λ ) EA doesnot help to concentrate the mutation rate on the right regime for NeedHighMut ξ if ξ is a sufficiently large constant and λ is not too large. This still applies in connectionwith stagnation detection. 18 heorem 7. Let ξ be a sufficiently large constant and assume λ = o ( n ) and λ = ω (1) .Then with probability at least − O (1 /n ) , the SASD-(1+ λ ) EA with stagnation detection(Algorithm 3) needs at least Ω( n ) /λ generations to optimize NeedHighMut . The proof of this theorem uses more advanced techniques, more precisely Theorem 1to analyze the distribution of mutation strength in the offspring over time. This tech-nique allows us that only a small constant fraction of steps uses strength that are morebeneficial for the suffix than the prefix.
Proof.
The idea is to show that the strength has a drift towards its minimum and thenapply Theorem 1 to bound the number of steps at which a mutation rate is taken thatcould be beneficial. Then, since most of the steps use small mutation rates, the prefix isoptimized before the suffix with high probability and a local optimum reached.To make these ideas precise, we pick up and extend the analysis of the acceptanceand improvement probabilities from Theorem 5. Hence (with respect to the creation ofa single offspring): • The probability of accepting a mutation at strength r = o ( n ) is (1 ± o (1)) e − r since only o ( n ) bits flip with probability 1 − e − ω ( r ) and (1 − o (1)) n bits have tobe preserved (not flipped) with probability 1 − − Ω( n ) . At strengths r = Ω( n ) theprobability of improving the pre -value by m/ − Ω( n ) since m/ m/ e − Ω( r ) . • the probability of improving the pre -value by k = o ( n ) is (1 ± o (1))( r/n ) k e − r . • the probability of improving the suff -value by k = o ( n ) is (1 ± o (1))( r/n ) k e − r .Clearly, the probability that at least one out of λ offspring is improving the functionvalue is at most λ times as large. Since we have λ = o ( n ) offspring and each improvementhas probability p i = O (1 /n ), the probability at having at least one improving offspringis at least 1 − (1 − p i ) λ = 1 − (1 − (1 − o (1)) λp i , hence also by a factor at least (1 − o (1)) λ larger.Using these bounds on the acceptance and improvement probabilities, we now useideas similar to the analysis of the near region in Doerr et al. (2019) to show a drift ofthe strength towards small values. We discuss several cases: r t ≤ (ln λ ) /
4: then the probability of creating a copy of the parent at strength r t / − o (1)) e − (ln λ ) / = (1 − o (1)) λ − / . This probability is by a factor (1 − o (1)) e smaller at strength 2 r t . Using Chernoff bounds and exploiting λ = ω (1) we have thatwith probability 1 − o (1), the number of copies produced at strength r t / r t , and there is at least one copyproduced from strength r t /
2. Due to the uniform choice of the individual adjusting thestrength in case of ties, the probability of increasing the strength is at most 1 / − (cid:15) forsome constant (cid:15) > r t ≥ λ : Then with probability 1 − o (1), all offspring are invalid in prefix orsuffix and therefore worse than the parent. The fitness function is − n − OneMax
19n this case. Now, since the minimum number of bits flipped at strength 2 r t is withprobability 1 − o (1) larger than the maximum number of bits flipped at strength r t / − o (1) an offspring producedfrom strength r t / / − (cid:15) again. L := (ln λ ) / ≤ r t ≤ λ =: U : here we only know that the probability of decreasingthe strength is at least 1 / λ ) EA. However, aconstant number of such decreasing steps is enough to reach strength at most L from thesmallest possible strength above U . Using a potential function with an exponential slopein the range [ L, U ] like in Doerr et al. (2019), we arrive at a process that increases withprobability at most 1 / − (cid:15) and decreases with the remaining probability. We choose aconstant (cid:15) > / (cid:15) except forthe case r t = 2, where the strength stays the same with probability at least 1 / (cid:15) .Hence, for the process X t := log ( r t ) that lives on the non-negative integers we obtain,writing ∆ t := X t +1 − X t , thatE (cid:0) e η ∆ t | F t ; X t > (cid:1) = e − η (cid:18)
12 + (cid:15) (cid:19) + e η (cid:18) − (cid:15) (cid:19) ≤ − η(cid:15) + η ≤ ρ for a constant ρ < η is chosen as a sufficiently small constant (depending on theconstant (cid:15) ). Similarly, given this choice of η , we immediately haveE (cid:0) e η ∆ t | F t ; X t ≤ (cid:1) ≤ D for a constant D >
0. If we choose b in Theorem 1 as a sufficiently large constant, weobtain, noting a = 2, 1 − (cid:15) − − (cid:15) − ρ De η ( a − b ) ≥ T , the number of generations where X t > b holds, is at most T /
10 with probability 1 − − Ω( T ) . Let b ∗ = 2 b , i. e., the strengthcorresponding to X t = b . We set T := (12 / e b ∗ mn/ ( b ∗ λ ). Since a pre -improvingmutation has probability at least (1 − o (1)) λ ( b ∗ /n ) e − b ∗ , we have an expected number ofat least (1 − o (1))(27 / m such mutations in the phase at with probability 1 − − Ω( n ) atleast m such mutations by Chernoff bounds. This is sufficient to reach the local optimumunless there are at least (2 / ξ √ n suff -improving mutations in the phase. Note thatthe choice of the constant ξ only impacts the length of the prefix in lower-order termsthat vanish in O -notation.We bound the number of suff -improving mutations separately for the points intime (i. e., generations) where X t ≤ b and where X t > b . For the first set of timepoints, we note that the probability of a suff -improving mutation by k ≥ o (1)) λ (2 /n / ) k e − since the term x /e − x takes its maximum at x = 2. Usingsimilar arguments based on Chernoff and union bounds as in the proof of Theorem 5, webound the total improvement of the suff -value in at most T ≤ (12 / e b ∗ n / ( λb ∗ ) stepswhere X t ≤ b by i := (25 / e b ∗ − √ n/b ∗ with probability 1 − − Ω( n / ) . For the points20f time where X t > b the probability of a pre -improving mutation is maximized (up tolower-order terms) at strength b ∗ since the function x /e − x is monotonically decreasingfor x >
2. Assuming at most (12 / e b ∗ n /b ∗ such time points (which assumption holdswith probability at least 1 − − Ω( n ) , we obtain an expected number of suff -improvingmutations by 1 of at most 12100 e b ∗ n b ∗ b ∗ n / e − b ∗ = 12100 √ n and using Chernoff and union bounds we bound the total improvement of the suff -valuein these generations by i = (13 / √ n with 1 − − Ω( n / ) . Now, if we choose ξ largeenough, then i + i ≤ ξ √ n so that the prefix is optimized before the suffix with probability altogether 1 − − Ω( n / ) .Together with the analysis in Theorem 6 for the case that the stagnation counterexceeds its threshold, this means that with probability 1 − O (1 /n ) the local optimumis reached before the global one. Again arguing in the same way as in the proof ofTheorem 6, the time to reach the global optimum from the local one is 2 Ω( n ) /λ withprobability 1 − − Ω( n ) . The sum of all failure probabilities is O (1 /n ). (cid:3) Our theoretical results are asymptotic. In this section, we show the results of the exper-iments we did in order to see how the different algorithms perform in practice for small n . In the first experiment, we ran an implementation of Algorithms 2 (SD-(1+1) EA)and 3 (SASD-(1+ λ ) EA) on the Jump fitness function with jump size m = 4 and n vary-ing from 40 to 160. We compared our algorithms against (1+1) EA with standard mu-tation rate 1/n, (1+1) EA with mutation probability m/n , and Algorithm (1+1) FEA β from Doerr et al. (2017) with three different β = { . , , } .In Figures 1 and more precisely 2, we observe that stagnation detection techniquemakes the algorithm faster than the algorithms with heavy-tailed mutation operator(1+1) FEA β . Also, Algorithm SD-(1+1) EA is not much slower than the (1+1) EAwith mutation probability mn even though it does not need the gap size.In the second experiment, we ran our algorithms and the classic (1+1) EA withdifferent mutation probabilities on NeedHighMut ξ with n = { , , , , } and ξ = 3.The outcomes support that the theory from Section 4 already holds for small n . InTable 1, one can see that for ξ = 3, the (1+1) EA with p = 6 /n and 8 /n is much moresuccessful to find global optimum points than the rest of the algorithms. https://github.com/DTUComputeTONIA/StagnationDetection . Jump .Figure 2: Box plots comparing number of fitness calls (over 1000 runs) the mentionedalgorithms take to optimize Jump . 22 Algo. (1+1)EAwith p= n (1+1)EAwith p= n (1+1)EAwith p= n (1+1)EAwith p= n SD-(1 + 1)EA SASD-(1 + ln n )EA200 0.00000 0.00000 0.01181 0.19380 0.00000 0.00000400 0.00000 0.00000 0.33858 0.87402 0.00100 0.00000600 0.00000 0.00000 0.42449 0.85950 0.00051 0.00000800 0.00000 0.00000 0.84000 0.97273 0.00056 0.002291000 0.00000 0.00000 0.80769 0.97917 0.00058 0.00121 Table 1: Ratio of successfully achieved global optimum where ξ = 3 over 1000 runs. Conclusions
We have designed and analyzed self-adjusting EAs for multimodal optimization. Inparticular, we have proposed a module called stagnation detection that can be addedto existing EAs without essentially changing their behavior on unimodal (sub)problems.Our stagnation detection keeps track of the number of unsuccessful steps and increasesthe mutation rate based on statistically significant waiting times without improvement.Hence, there is high evidence for being at a local optimum when the strength is increased.Theoretical analyses reveal that the (1+1) EA equipped with stagnation detectionoptimizes the
Jump function in asymptotically optimal time corresponding to the beststatic choice of the mutation rate. Moreover, we have proved a general upper bound formultimodal functions that can recover asymptotically runtimes on well-known examplefunctions, and we have shown that on unimodal functions, the (1+1) EA with stagnationdetection with high probability never deviates from the classical (1+1) EA; also a relatedstatement was proved for the self-adjusting (1+ λ ) EA from Doerr et al. (2019). Finally,to show the limitations of the approach we have presented a function on which all of ourinvestigated self-adjusting EAs provably fail to be efficient.In the future, we would like to investigate our module for stagnation detection inother EAs and study its benefits on combinatorial optimization problems. Acknowledgement
This work was supported by a grant by the Danish Council for Independent Research(DFF-FNU 8021-00260B).
References
Antipov, Denis, Doerr, Benjamin, and Karavaev, Vitalii (2019). A tight runtime analysisfor the (1 + ( λ , λ )) GA on LeadingOnes. In Proc. of FOGA ’19 , 169–182. ACM Press.Antipov, Denis, Doerr, Benjamin, and Karavaev, Vitalii (2020). The (1 + ( λ, λ )) GA iseven faster on multimodal problems.
CoRR , abs/2004.06702 . URL http://arxiv.org/abs/2004.06702 . 23uzdalov, Maxim, Doerr, Benjamin, and Kever, Mikhail (2016). The unrestricted black-box complexity of jump functions. Evolutionary Computation , (4), 719–744.Corus, Dogan, Oliveto, Pietro Simone, and Yazdani, Donya (2018). Fast artificial im-mune systems. In Proc. of PPSN ’18 , 67–78. Springer.Dang, Duc-Cuong and Lehre, Per Kristian (2016). Self-adaptation of mutation rates innon-elitist populations. In
Proc. of PPSN ’16 , 803–813. Springer.Doerr, Benjamin (2019). A tight runtime analysis for the cGA on jump functions: EDAscan cross fitness valleys at no extra cost. In
Proc. of GECCO ’19 , 1488–1496. ACMPress.Doerr, Benjamin and Doerr, Carola (2018). Optimal static and self-adjusting parameterchoices for the (1+( λ , λ )) genetic algorithm. Algorithmica , (5), 1658–1709.Doerr, Benjamin and Doerr, Carola (2020). Theory of parameter control for dis-crete black-box optimization: Provable performance gains through dynamic parameterchoices. In Doerr, B. and Neumann, F. (eds.), Theory of Evolutionary Computation– Recent Developments in Discrete Optimization , 271–321. Springer.Doerr, Benjamin, Doerr, Carola, and K¨otzing, Timo (2018). Static and self-adjustingmutation strengths for multi-valued decision variables.
Algorithmica , (5), 1732–1768.Doerr, Benjamin, Fouz, Mahmoud, and Witt, Carsten (2010). Quasirandom evolutionaryalgorithms. In Proc. of GECCO ’10 , 1457–1464. ACM Press.Doerr, Benjamin, Gießen, Christian, Witt, Carsten, and Yang, Jing (2019). The (1+ λ ) evolutionary algorithm with self-adjusting mutation rate. Algorithmica , (2),593–631.Doerr, Benjamin and Krejca, Martin S. (2018). Significance-based estimation-of-distribution algorithms. In Proc. of GECCO ’18 , 1483–1490. ACM Press.Doerr, Benjamin, Le, Huu Phuoc, Makhmara, R´egis, and Nguyen, Ta Duy (2017). Fastgenetic algorithms. In
Proc. of GECCO ’17 , 777–784. ACM Press.Doerr, Benjamin, Witt, Carsten, and Yang, Jing (2018). Runtime analysis for self-adaptive mutation rates. In
Proc. of GECCO ’18 , 1475–1482. ACM Press.Doerr, Carola and Wagner, Markus (2018). Sensitivity of parameter control mechanismswith respect to their initialization. In
Proc. of PPSN ’18 , 360–372. Springer.Doerr, Carola, Ye, Furong, van Rijn, Sander, Wang, Hao, and B¨ack, Thomas (2018). To-wards a theory-guided benchmarking suite for discrete black-box optimization heuris-tics: Profiling (1+ λ ) EA variants on OneMax and LeadingOnes. In Proc. of GECCO’18 , 951–958. ACM Press. 24roste, Stefan, Jansen, Thomas, and Wegener, Ingo (2002). On the analysis of the (1+1)evolutionary algorithm.
Theoretical Computer Science , , 51–81.Eiben, A. E., Marchiori, Elena, and Valk´o, V. A. (2004). Evolutionary algorithms withon-the-fly population size adjustment. In Proc. of PPSN ’04 , 41–50. Springer.Fajardo, Mario A. Hevia (2019). An empirical evaluation of success-based parametercontrol mechanisms for evolutionary algorithms. In
Proc. of GECCO ’19 , 787–795.ACM Press.Friedrich, Tobias, Quinzan, Francesco, and Wagner, Markus (2018). Escaping large de-ceptive basins of attraction with heavy-tailed mutation operators. In
Proc. of GECCO’18 , 293–300. ACM Press.Hajek, Bruce (1982). Hitting and occupation time bounds implied by drift analysis withapplications.
Advances in Applied Probability , , 502–525.Hansen, Pierre and Mladenovic, Nenad (2018). Variable neighborhood search. In Mart´ı,Rafael, Pardalos, Panos M., and Resende, Mauricio G. C. (eds.), Handbook of Heuris-tics , 759–787. Springer.Jansen, Thomas, Jong, Kenneth A. De, and Wegener, Ingo (2005). On the choice of theoffspring population size in evolutionary algorithms.
Evolutionary Computation , ,413–440.Jansen, Thomas and Wiegand, R. Paul (2004). The cooperative coevolutionary (1+1)EA. Evolutionary Computation , (4), 405–434.L¨assig, J¨org and Sudholt, Dirk (2011). Adaptive population models for offspring popu-lations and parallel evolutionary algorithms. In Proc. of FOGA ’11 , 181–192. ACMPress.Lengler, Johannes (2018). A general dichotomy of evolutionary algorithms on monotonefunctions. In
Proc. of PPSN ’18 , 3–15. Springer.Lissovoi, Andrei, Oliveto, Pietro S., and Warwicker, John Alasdair (2020). Simple hyper-heuristics control the neighbourhood size of randomised local search optimally forleadingones.
Evolutionary Computation . In print.Rodionova, Anna, Antonov, Kirill, Buzdalova, Arina, and Doerr, Carola (2019). Off-spring population size matters when comparing evolutionary algorithms with self-adjusting mutation rates. In
Proc. of GECCO ’19 , 855–863. ACM Press.Rohlfshagen, Philipp, Lehre, Per Kristian, and Yao, Xin (2009). Dynamic evolutionaryoptimisation: an analysis of frequency and magnitude of change. In
Proc. of GECCO’09 , 1713–1720. ACM Press.Rowe, Jonathan E. and Aishwaryaprajna (2019). The benefits and limitations of votingmechanisms in evolutionary optimisation. In
Proc. of FOGA ’19 , 34–42. ACM Press.25egener, Ingo (2001). Methods for the analysis of evolutionary algorithms on pseudo-Boolean functions. In Sarker, Ruhul, Mohammadian, Masoud, and Yao, Xin (eds.),
Evolutionary Optimization . Kluwer Academic Publishers.Whitley, Darrell, Varadarajan, Swetha, Hirsch, Rachel, and Mukhopadhyay, Anirban(2018). Exploration and exploitation without mutation: Solving the jump function in ϑ ( n ) time. In Proc. of PPSN ’18 , 55–66. Springer.Witt, Carsten (2003). Population size vs. runtime of a simple EA. In