[PDF] Standard Steady State Genetic Algorithms Can Hillclimb Faster than Mutation-only Evolutionary Algorithms

Abstract

Explaining to what extent the real power of genetic algorithms lies in the ability of crossover to recombine individuals into higher quality solutions is an important problem in evolutionary computation. In this paper we show how the interplay between mutation and crossover can make genetic algorithms hillclimb faster than their mutation-only counterparts. We devise a Markov Chain framework that allows to rigorously prove an upper bound on the runtime of standard steady state genetic algorithms to hillclimb the OneMax function. The bound establishes that the steady-state genetic algorithms are 25% faster than all standard bit mutation-only evolutionary algorithms with static mutation rate up to lower order terms for moderate population sizes. The analysis also suggests that larger populations may be faster than populations of size 2. We present a lower bound for a greedy (2+1) GA that matches the upper bound for populations larger than 2, rigorously proving that 2 individuals cannot outperform larger population sizes under greedy selection and greedy crossover up to lower order terms. In complementary experiments the best population size is greater than 2 and the greedy genetic algorithms are faster than standard ones, further suggesting that the derived lower bound also holds for the standard steady state (2+1) GA.

Full PDF

SStandard Steady State Genetic Algorithms Can HillclimbFaster than Mutation-only Evolutionary Algorithms

Dogan Corus

Department of Computer Science,University of Sheﬃeld,Sheﬃeld, UK

Pietro S. Oliveto

Department of Computer Science,University of Sheﬃeld,Sheﬃeld, UK

September 10, 2018

Abstract

OneMax function. Thebound establishes that the steady-state genetic algorithms are 25% faster than all standardbit mutation-only evolutionary algorithms with static mutation rate up to lower order termsfor moderate population sizes. The analysis also suggests that larger populations may befaster than populations of size 2. We present a lower bound for a greedy (2+1) GA thatmatches the upper bound for populations larger than 2, rigorously proving that 2 individualscannot outperform larger population sizes under greedy selection and greedy crossover up tolower order terms. In complementary experiments the best population size is greater than2 and the greedy genetic algorithms are faster than standard ones, further suggesting thatthe derived lower bound also holds for the standard steady state (2+1) GA.

Genetic algorithms (GAs) rely on a population of individuals that simultaneously explore thesearch space. The main distinguishing features of GAs from other randomised search heuristicsis their use of a population and crossover to generate new solutions. Rather than slightly modi-fying the current best solution as in more traditional heuristics, the idea behind GAs is that newsolutions are generated by recombining individuals of the current population (i.e., crossover).Such individuals are selected to reproduce probabilistically according to their ﬁtness (i.e., re-production). Occasionally, random mutations may slightly modify the oﬀspring produced bycrossover. The original motivation behind these mutations is to avoid that some genetic materialmay be lost forever, thus allowing to avoid premature convergence [1, 2]. For these reasons theGA community traditionally regards crossover as the main search operator while mutation isconsidered a “background operator” [2, 3, 4] or a “secondary mechanism of genetic adaptation” [1]. Explaining when and why GAs are eﬀective has proved to be a non-trivial task. Schema theoryand its resulting building block hypothesis [1] were devised to explain such working principles.1 a r X i v : . [ c s . N E ] A ug owever, these theories did not allow to rigorously characterise the behaviour and performanceof GAs. The hypothesis was disputed when a class of functions (i.e., Royal Road), thought tobe ideal for GAs, was designed and experiments revealed that the simple (1+1) EA was moreeﬃcient [5, 6].Runtime analysis approaches have provided rigorous proofs that crossover may indeed speedup the evolutionary process of GAs in ideal conditions (i.e., if suﬃcient diversity is available inthe population). The Jump function was introduced by Jansen and Wegener as a ﬁrst examplewhere crossover considerably improves the expected runtime compared to mutation-only Evolu-tionary Algorithms (EAs) [7]. The proof required an unrealistically small crossover probabilityto allow mutation alone to create the necessary population diversity for the crossover operatorto then escape the local optimum. Dang et al. recently showed that the suﬃcient diversity,and even faster upper bounds on the runtime for not too large jump gaps, can be achieved alsofor realistic crossover probabilities by using diversity mechanisms [8]. Further examples thatshow the eﬀectiveness of crossover have been given for both artiﬁcially constructed functions andstandard combinatorial optimisation problems (see the next section for an overview).Excellent hillclimbing performance of crossover based GAs has been also proved. B. Doerr etal. proposed a (1+( λ , λ )) GA which optimises the OneMax function inΘ( n (cid:112) log log log( n ) / log log( n )) ﬁtness evaluations (i.e., runtime) [9], [10]. Since the unbiasedunary black box complexity of OneMax is Ω( n log n ) [11], the algorithm is asymptotically fasterthan any unbiased mutation-only evolutionary algorithm (EA). Furthermore, the algorithm runsin linear time when the population size is self-adapted throughout the run [12]. Through thiswork, though, it is hard to derive conclusions on the working principles of standard GAs be-cause these are very diﬀerent compared to the (1+( λ , λ )) GA in several aspects. In particular,the (1+( λ , λ )) GA was especially designed to use crossover as a repair mechanism that followsthe creation of new solutions via high mutation rates. This makes the algorithm work in aconsiderably diﬀerent way compared to traditional GAs.More traditional GAs have been analysed by Sudholt [13]. Concerning OneMax , he showshow ( µ + λ ) GAs are twice as fast as their standard bit mutation-only counterparts. As a conse-quence, he showed an upper bound of ( e/ n log n (1 + o (1)) function evaluations for a (2+1) GAversus the en log n (1 − o (1)) function evaluations required by any standard bit mutation-onlyEA [14, 15]. This bound further reduces to 1 . n ln n ± O ( n log log n ) if the optimal mutationrate is used (i.e., (1 + √ / · /n ≈ . /n ). However, the analysis requires that diversity isartiﬁcially enforced in the population by breaking ties always preferring genotypically diﬀerentindividuals. This mechanism ensures that once diversity is created on a given ﬁtness level, it willnever be lost unless a better ﬁtness level is reached, giving ample opportunities for crossover toexploit this diversity.Recently, it has been shown that it is not necessary to enforce diversity for standard steadystate GAs to outperform standard bit mutation-only EAs [16]. In particular, the Jump functionwas used as an example to show how the interplay between crossover and mutation may besuﬃcient for the emergence of the necessary diversity to escape from local optima more quickly.Essentially, a runtime of O ( n k − ) may be achieved for any sublinear jump length k > n k ) function evaluations required by standard bit mutation-only EAs.In this paper, we show that this interplay between mutation and crossover may also speed-up the hillclimbing capabilities of steady state GAs without the need of enforcing diversityartiﬁcially. In particular, we consider a standard steady state ( µ +1) GA [17, 2, 18] and provean upper bound on the runtime to hillclimb the OneMax function of (3 / en log n + O ( n ) forany µ ≥ µ = o (log n/ log log n ) when the standard 1 /n mutation rate is used. Apart fromshowing that standard ( µ +1) GAs are faster than their standard bit mutation-only counterpartsup to population sizes µ = o (log n/ log log n ), the framework provides two other interesting2nsights. Firstly, it delivers better runtime bounds for mutation rates that are higher thanthe standard 1 /n rate. The best upper bound of 0 . en log n + O ( n ) is achieved for c/n with c = (cid:0) √ − (cid:1) ≈ .

3. Secondly, the framework provides a larger upper bound, up to lowerorder terms, for the (2+1) GA compared to that of any µ ≥ µ = o (log n/ log log n ). Thereason for the larger constant in the leading term of the runtime is that, for populations of size2, there is always a constant probability that any selected individual takes over the populationin the next generation. This is not the case for population sizes larger than 2.To shed light on the exact runtime for population size µ = 2 we present a lower bound analysisfor a greedy genetic algorithm, which we call (2+1) S GA, that always selects individuals of highestﬁtness for crossover and always successfully recombines them if their Hamming distance is greaterthan 2. This algorithm is similar to the one analysed by Sudholt [13] to allow the derivationof a lower bound, with the exception that we do not enforce any diversity artiﬁcially and thatour crossover operator is slightly less greedy (i.e., in [13] crossover always recombines correctlyindividuals also when the Hamming distance is exactly 2). Our analysis delivers a matching lowerbound for all mutation rates c/n , where c is a constant, for the greedy (2+1) S GA (thus also(3 / en log n + O ( n ) and 0 . en log n + O ( n ) respectively for mutation rates 1 /n and 1 . /n ). Thisresult rigorously proves that, under greedy selection and semi-greedy crossover, the (2+1) GAcannot outperform any ( µ +1) GA with µ ≥ µ = o (log n/ log log n ).We present some experimental investigations to shed light on the questions that emerge fromthe theoretical work. In the experiments we consider the commonly used parent selection thatchooses uniformly at random from the population with replacement (i.e., our theoretical upperbounds hold for a larger variety of parent selection operators). We ﬁrst compare the performanceof the standard steady state GAs against the fastest standard bit mutation-only EA with ﬁxedmutation rate (i.e., the (1+1) EA [14, 15]) and the GAs that have been proved to outperform it.The experiments show that the speedups over the (1+1) EA occur already for small problem sizes n and that population sizes larger than 2 are faster than the standard (2+1) GA. Furthermore, thegreedy (2 + 1) S GA indeed appears to be faster than the standard (2+1) GA , further suggestingthat the theoretical lower bound also holds for the latter algorithm. Finally, experiments conﬁrmthat larger mutation rates than 1 /n are more eﬃcient. In particular, better runtimes are achievedfor mutation rates that are even larger than the ones that minimise our theoretical upper bound(i.e., c/n with 1.5 ≤ c ≤ c =1.3 we have derived mathematically; interestingly thisexperimental rate is similar to the optimal mutation rate for OneMax of the algorithm analysed in[13]). These theoretical and experimental results seem to be in line with those recently presentedfor the same steady state GAs for the Jump function [16, 8]: higher mutation rates than 1 /n arealso more eﬀective on Jump .The rest of the paper is structured as follows. In the next section we brieﬂy review previousrelated works that consider algorithms using crossover operators. In Section 3 we give precisedeﬁnitions of the steady state ( µ +1) GA and of the OneMax function. In Section 4 we presentthe Markov Chain framework that we will use for the analysis of steady state elitist GAs. InSection 5 we apply the framework to analyse the ( µ +1) GA and present the upper bound on theruntime for any 3 ≤ µ = o (log n/ log log n ) and mutation rate c/n for any constant c . In Section6 we present the matching lower bound on the runtime of the greedy (2+1) S GA. In Section 7 wepresent our experimental ﬁndings. In the Conclusion we present a discussion and open questionsfor future work. We thank an anonymous reviewer for pointing out that this is not obvious. Related Work

The ﬁrst rigorous groundbreaking proof that crossover can considerably improve the performanceof EAs was given by Jansen and Wegener for the ( µ +1) GA with an unrealistically low crossoverprobability [7]. A series of following works on the analysis of the Jump function have made thealgorithm characteristics increasingly realistic [8, 19]. Today it has been rigorously proved thatthe standard steady state ( µ +1) GA with realistic parameter settings does not require artiﬁcialdiversity enforcement to outperform its standard bit mutation-only counterpart to escape theplateau of local optima of the Jump function [16].Proofs that crossover may make a diﬀerence between polynomial and exponential time forescaping local optima have also been available for some time [20, 6]. The authors devised examplefunctions where, if suﬃcient diversity was enforced by some mechanism, then crossover couldeﬃciently combine diﬀerent individuals into an optimal solution. Mutation, on the other handrequired a long time because of the great Hamming distance between the local and global optima.The authors chose to call the artiﬁcially designed functions

Real Royal Road functions becausethe Royal Road functions devised to support the building block hypothesis had failed to do so[21]. The Real Royal Road functions, though, had no resemblance with the schemata structuresrequired by the building block hypothesis.The utility of crossover has also been proved for less artiﬁcial problems such as coloringproblems inspired by the Ising model from physics [22], computing input-output sequences inﬁnite state machines [23], shortest path problems [24], vertex cover [25] and multi-objectiveoptimization problems [26]. The above works show that crossover allows to escape from localoptima that have large basins of attraction for the mutation operator. Hence, they establish theusefulness of crossover as an operator to enchance the exploration capabilities of the algorithm.The interplay between crossover and mutation may produce a speed-up also in the exploitationphase, for instance when the algorithm is hillclimbing. Research in this direction has recentlyappeared. The design of the (1+( λ, λ )) GA was theoretically driven to beat the Ω( n ln n ) lowerbound of all unary unbiased black box algorithms. Since the dynamics of the algorithm diﬀerconsiderably from those of standard GAs, it is diﬃcult to achieve more general conclusions aboutthe performance of GAs from the analysis of the (1+( λ, λ )) GA. From this point of view thework of Sudholt is more revealing when he shows that any standard ( µ + λ ) GA outperformsits standard bit mutation-only counterpart for hillclimbing the OneMax function [13]. Theonly caveat is that the selection stage enforces diversity artiﬁcially, similarly to how Jansen andWegener had enforced diversity for the Real Royal Road function analysis. In this paper werigorously prove that it is not necessary to enforce diversity artiﬁcially for standard-steady stateGAs to outperform their standard bit mutation-only counterpart.

We will analyse the runtime (i.e., the expected number of ﬁtness function evaluations before anoptimal search point is found) of a steady state genetic algorithm with population size µ andoﬀspring size 1 (Algorithm 1). In steady state GAs the entire population is not changed at once,but rather a part of it. In this paper we consider the most common option of creating one newsolution per generation [17, 18]. Rather than restricting the algorithm to the most commonlyused uniform selection of two parents, we allow more ﬂexibility to the choice of which parentselection mechanism is used. This approach was also followed by Sudholt for the analysis ofthe ( µ +1) GA with diversity [13]. In each generation the algorithm picks two parents from its4 lgorithm 1: ( µ +1) GA [17, 2, 18, 16] P ← µ individuals, uniformly at random from { , } n ; repeat Select x, y ∈ P with replacement using an operator abiding (1); z ← Uniform crossover with probability 1 / x, y ); Flip each bit in z with probability c/n ; P ← P ∪ { z } ; Choose one element from P with lowest ﬁtness and remove it from P , breaking ties atrandom; until termination condition satisﬁed ;population with replacement using a selection operator that satisﬁes the following condition. ∀ x, y : f ( x ) ≥ f ( y ) = ⇒ Pr(select x ) ≥ Pr(select y ) . (1)The condition allows to use most of the popular parent selection mechanisms with replacementsuch as ﬁtness proportional selection, rank selection or the one commonly used in steady stateGAs, i.e., uniform selection [2]. Afterwards, uniform crossover between the selected parents (i.e.,each bit of the oﬀspring is chosen from each parent with probability 1 /

2) provides an oﬀspringto which standard bit mutation (i.e., each bit is ﬂipped with with probability c/n ) is applied.The best µ among the µ + 1 solutions are carried over to the next generation and ties are brokenuniformly at random.In the paper we use the standard convention for naming steady state algorithms: the ( µ +1) EAdiﬀers from the ( µ +1) GA by only selecting one individual per generation for reproduction andapplying standard bit mutation to it (i.e., no crossover). Otherwise the two algorithms areidentical.We will analyse Algorithm 1 for the well-studied OneMax function that is deﬁned on bit-strings x ∈ { , } n of length n and returns the number of 1-bits in the string: OneMax ( x ) = (cid:80) ni =1 x i . Here x i is the i th bit of the solution x ∈ { , } n . The OneMax benchmark functionis very useful to assess the hillclimbing capabilities of a search heuristic. It displays the charac-teristic function optimisation property that ﬁnding improving solutions becomes harder as thealgorithm approaches the optimum. The problem is the same as that of identifying the hiddensolution of the Mastermind game where we assume for simplicity that the target string is the oneof all 1-bits. Any other target string z ∈ { , } n may also be used without loss of generality. Ifa bitstring is used, then OneMax is equivalent to Mastermind with two colours [27]. This canbe generalised to many colours if alphabets of greater size are used [28, 29].

The recent analysis of the ( µ +1) GA for the Jump function shows that the interplay betweencrossover and mutation may create the diversity required for crossover to decrease the expectedtime to jump towards the optimum [16]. At the heart of the proof is the analysis of a randomwalk on the number of diverse individuals on the local optima of the function. The analysisdelivers improved asymptotic expected runtimes of the ( µ +1) GA over mutation-only EAs onlyfor population sizes µ = ω (1). This happens because, for larger population sizes, it takes moretime to lose diversity once created, hence crossover has more time to exploit it. For OneMax i . S ,i S ,i S ,i − p m − p d p d − p c − p r p r p c p m the technique delivers worse asymptotic bounds for population sizes µ = ω (1) and an O ( n ln n )bound for constant population size. Hence, the techniques of [16] cannot be directly applied toshow a speed-up of the ( µ +1) GA over mutation-only EAs and a careful analysis of the leadingconstant in the runtime is necessary. In this section we present the Markov chain framework thatwe will use to obtain the upper bounds on the runtime of the elitist steady state GAs. We willafterwards discuss how this approach builds upon and generalises Sudholt’s approach in [13].The OneMax function has n + 1 distinct ﬁtness values. We divide the search space into thefollowing canonical ﬁtness levels [30, 31]: L i = { x ∈ { , } n | OneMax ( x ) = i } . We say that a population is in ﬁtness level i if and only if its best solution is in level L i .We use a Markov chain (MC) for each ﬁtness level i to represent the diﬀerent states thepopulation may be in before reaching the next ﬁtness level. The MC depicted in Fig. 1 distin-guishes between states where the population has no diversity (i.e., all individuals have the samegenotype), hence crossover is ineﬀective, and states where diversity is available to be exploitedby the crossover operator. The MC has one absorbing state and two transient states. The ﬁrsttransient state S ,i is adopted if the whole population consists of copies of the same individualat level i (i.e., all the individuals have the same genotype). The second state S ,i is reached ifthe population consists of µ individuals in ﬁtness level i and at least two individuals x and y arenot identical. The second transient state S ,i diﬀers from the state S ,i in having diversity whichcan be exploited by the crossover operator. S ,i and S ,i are mutually accessible from each othersince the diversity can be introduced at state S ,i via mutation with some probability p d andcan be lost at state S ,i with some relapse probability p r when copies of a solution take over thepopulation.The absorbing state S ,i is reached when a solution at a better ﬁtness level is found, an eventthat happens with probability p m when the population is at state S ,i and with probability p c when the population is at state S ,i . We pessimistically assume that in S ,i there is always onlyone single individual with a diﬀerent genotype (i.e., with more than one distinct individual, p c would be higher and p r would be zero). Formally when S ,i is reached the population is no longerin level i because a better ﬁtness level has been found. However, we will bound the expectedtime to reach the absorbing state for the next level only when the whole population has reachedit (or a higher level). We do this because we assume that initially all the population is in level i when calculating the transition probabilities in the MC for each level i . This implies thatbounding the expected times to reach the absorbing states of each ﬁtness level is not suﬃcient toachieve an upper bound on the total expected runtime. When S ,i is reached for the ﬁrst time,the population only has one individual at the next ﬁtness level or in a higher one. Only when6ll the individuals have reached level i + 1 (i.e., either in state S ,i +1 or S ,i +1 ) may we use theMC to bound the runtime to overcome level i + 1. Then the MC can be applied, once per ﬁtnesslevel, to bound the total runtime until the optimum is reached.The main distinguishing aspect between the analysis presented herein and that of Sudholt[13] is that we take into account the possibility to transition back and forth (i.e., resp. withprobability p d and p r ) between states S ,i and S ,i as in standard steady state GAs (see Fig.1). By enforcing that diﬀerent genotypes on the same ﬁtness level are kept in the population,the genetic algorithm considered in [13] has a good probability of exploiting this diversity torecombine the diﬀerent individuals. In particular, once the diversity is created it will never belost, giving many opportunities for crossover to take advantage of it. A crucial aspect is that theprobability of increasing the number of ones via crossover is much higher than the probabilityof doing so via mutation once many 1-bits have been collected. Hence, by enforcing that onceState S ,i is reached it cannot be left until a higher ﬁtness level is found, Sudholt could provethat the resulting algorithm is faster compared to only using standard bit mutation. In thestandard steady state GA, instead, once the diversity is created it may subsequently be lostbefore crossover successfully recombines the diverse individuals. This behaviour is modelled inthe MC by considering the relapse probability p r . Hence, the algorithm spends less time instate S ,i compared to the GA with diversity enforcement. Nevertheless, it will still spend someoptimisation time in state S ,i where it will have a higher probability of improving its ﬁtness byexploiting the diversity via crossover than when in state S ,i (i.e., no diversity) where it has torely on mutation only. For this reason the algorithm will not be as fast for OneMax as the GAwith enforced diversity but will still be faster than standard bit mutation-only EAs.An interesting consequence of the possibility of losing diversity, is that populations of sizegreater than 2 can be beneﬁcial. In particular the diversity (i.e., State S ,i ) may be completelylost in the next step when there is only one diverse individual left in the population. Whenthis is the case, the relapse probability p r decreases with the population size µ because theprobability of selecting the diverse individual for removal is 1 /µ . Furthermore, for populationsize µ = 2 there is a positive probability that diversity is lost in every generation by either ofthe two individuals taking over, while for larger population sizes this is not the case. As a resultour MC framework analysis will deliver a better upper bound for µ > µ = 2. This interesting insight into the utility of larger populations could not be seen in theanalysis of [13] because there, once the diversity is achieved, it cannot be lost.We ﬁrst concentrate on the expected absorbing time of the MC. Afterwards we will calculatethe takeover time before we can transition from one MC to the next. Since it is not easy to derivethe exact transition probabilities, a runtime analysis is considerably simpliﬁed by using boundson these probabilities. The main result of this section is stated in the following theorem thatshows that we can use lower bounds on the transition probabilities moving in the direction ofthe absorbing state (i.e., p m , p d and p c ) and an upper bound on the probability of moving in theopposite direction to no diversity (i.e., p r ) to derive an upper bound on the expected absorbingtime of the Markov chain. In particular, we deﬁne a Markov chain M (cid:48) that uses the bounds onthe exact transition probabilities and show that its expected absorbing time is greater than theabsorbing time of the original chain. Hereafter, we drop the level index i for brevity and use E [ T ] and E [ T ] instead of E [ T ,i ] and E [ T ,i ] (Similarly, S will denote state S ,i ). Theorem 1.

Consider two Markov chains M and M (cid:48) with the topology in Figure 1 where thetransition probabilities for M are p c , p m , p d , p r and the transition probabilities for M (cid:48) are p (cid:48) c , p (cid:48) m , p (cid:48) d and p (cid:48) r . Let the expected absorbing time for M be E [ T ] and the expected absorbing timeof M (cid:48) starting from state S be E [ T (cid:48) ] respectively. If • p m < p c p (cid:48) d ≤ p d • p (cid:48) r ≥ p r • p (cid:48) c ≤ p c • p (cid:48) m ≤ p m Then E [ T ] ≤ E [ T (cid:48) ] ≤ p (cid:48) c + p (cid:48) r p (cid:48) c p (cid:48) d + p (cid:48) c p (cid:48) m + p (cid:48) m p (cid:48) r + p (cid:48) c . We ﬁrst concentrate on the second inequality in the statement of the theorem which willfollow immediately from the next lemma. It allows us to obtain the expected absorbing time ofthe MC if the exact values for the transition probabilities are known. In particular, the lemmaestablishes the expected times E [ T ] and E [ T ] to reach the absorbing state, starting from thestates S and S respectively. Lemma 2.

The expected times E [ T ] and E [ T ] to reach the absorbing state, starting from state S and S respectively are as follows: E [ T ] = p c + p r + p d p c p d + p c p m + p m p r ≤ p c + p r p c p d + p c p m + p m p r + 1 p c E [ T ] = p m + p r + p d p c p d + p c p m + p m p r . Proof.

We analyse the MC and using the law of total expectation together with the conditionalprobabilities we establish the following recurrence equations: E [ T ] = ( E [ T ] + 1) p d + p m + (1 + E [ T ])(1 − p d − p m ) E [ T ] = ( E [ T ] + 1) p r + p c + (1 + E [ T ])(1 − p c − p r ) . We start by solving the system of equations for the Markov chain. In order to get an expressionfor E [ T ], we will ﬁrst express E [ T ] in terms of E [ T ]. E [ T ] = ( E [ T ] + 1) p r + p c + (1 + E [ T ])(1 − p c − p r ) , implying E [ T ] = ( E [ T ] + 1) p r − p r + 1 p c + p r = p r E [ T ] + 1 p c + p r . We now substitute the expression for E [ T ] into the equation for E [ T ]: E [ T ] = (cid:18) p r E [ T ] + 1 p c + p r + 1 (cid:19) p d + p m + (1 + E [ T ])(1 − p d − p m ) . Hence, E [ T ] = p c + p d + p r p c p d + p c p m + p m p r . E [ T ] can be bounded from above by separating the p d term in the nu-merator: E [ T ] = p c + p r p c p d + p c p m + p m p r + p d p c p d + p c p m + p m p r ≤ p c + p r p c p d + p c p m + p m p r + p d p c p d ≤ p c + p r p c p d + p c p m + p m p r + 1 p c . If we substitute the value of E [ T ] in the above expression for E [ T ] we obtain: E [ T ] = p r ( p c + p d + p r ) p c p d + p c p m + p m p r + 1 p c + p r = p r ( p c + p d + p r ) + p c p d + p c p m + p m p r ( p c + p r )( p c p d + p c p m + p m p r )= p c ( p m + p r + p d ) + p r ( p m + p r + p d )( p c + p r )( p c p d + p c p m + p m p r )= p m + p r + p d p c p d + p c p m + p m p r . Before we prove the ﬁrst inequality in the statement of Theorem 1, we will derive some helperpropositions. We ﬁrst show that as long as the transition probability of reaching the absorbingstate from the state S (with diversity) is greater than that of reaching the absorbing state fromthe state with no diversity S (i.e., p m < p c ), then the expected absorbing time from state S isat least as large as the expected time unconditional of the starting point. This will allow us toachieve a correct upper bound on the runtime by just bounding the absorbing time from state S . In particular, it allows us to pessimistically assume that the algorithm starts each new ﬁtnesslevel in state S (i.e., there is no diversity in the population). Proposition 3.

Consider a Markov chain with the topology given in Figure 1. Let E [ T ] and E [ T ] be the expected absorbing times starting from state S and S respectively. If p m < p c ,then E [ T ] > E [ T ] and E [ T ] , the unconditional expected absorbing time, satisﬁes E [ T ] ≤ E [ T ] .Proof. From Lemma 2, E [ T ] = p c + p d + p r p c p d + p c p m + p m p r E [ T ] = p m + p d + p r p c p d + p c p m + p m p r . Since the denominators in both expressions are the same, E [ T ] > E [ T ] follows from p c + p d + p r >p m + p d + p r , which in turn follows from p c > p m . The unconditional expected absorbing time iscalculated as the weighted sum E [ T ] = p · E [ T ] + (1 − p ) · E [ T ] where p is the probability thatthe initial state is S and 1 − p is the probability that the initial state is S . Since E [ T ] ≥ E [ T ],the weighted sum E [ T ] is also smaller than or equal to E [ T ].In the following proposition we show that if we overestimate the probability of losing diversity9nd underestimate the probability of increasing it, then we achieve an upper bound on theexpected absorbing time as long as p m < p c . Afterwards, in Proposition 5 we show that an upperbound on the absorbing time is also achieved if the probabilities p c and p m are underestimated. Proposition 4.

Consider two Markov chains M and M (cid:48) with the topology in Figure 1 where thetransition probabilities for M are p c , p m , p d , p r and the transition probabilities for M (cid:48) are p (cid:48) c , (cid:48) m , p d and p r . Let the expected absorbing times starting from state S for M and M (cid:48) be E [ T ] and E [ T (cid:48) ] respectively. If p (cid:48) c ≤ p c and p (cid:48) m ≤ p m , then E [ T ] ≤ E [ T (cid:48) ] .Proof. Let c and m be non-negative slack variables such that p (cid:48) c + c = p c , p (cid:48) m + m = p m . Similarlyto the proof of Proposition 4, we prove the claim that the absorbing times E [ T (cid:48) ] = p (cid:48) c + p d + p r p (cid:48) c p d + p (cid:48) c p (cid:48) m + p (cid:48) m p r ,E [ T ] = ( p (cid:48) c + c ) + p d + p r ( p (cid:48) c + c ) p d + ( p (cid:48) c + c )( p (cid:48) m + m ) + ( p (cid:48) m + m ) p r , satisfy E [ T (cid:48) ] − E [ T ] ≥ . Again for readability purposes let A = p (cid:48) c + p d + p r and B = p (cid:48) c p d + p (cid:48) c p (cid:48) m + p (cid:48) m p r . Then, E [ T (cid:48) ] − E [ T ] = AB − A + cB + cp d + p (cid:48) c m + cp (cid:48) m + cm + mp r = Acp d + Ap (cid:48) c m + Acp (cid:48) m + Acm + Amp r − BcB ( B + cp d + p (cid:48) c m + cp (cid:48) m + cm + mp r ) . Since the denominator is positive we focus on proving that the numerator ( N ) is also positive N = Acp d + Ap (cid:48) c m + Acp (cid:48) m + Acm + Amp r − Bc ≥ . Substituting the actual values for A and B , we obtain the following equivalent expression: N =( p (cid:48) c + p d + p r ) cp d + ( p (cid:48) c + p d + p r ) p (cid:48) c m + ( p (cid:48) c + p d + p r ) cp (cid:48) m + ( p (cid:48) c + p d + p r ) cm + ( p (cid:48) c + p d + p r ) mp r − ( p (cid:48) c p d + p (cid:48) c p (cid:48) m + p (cid:48) m p r ) c =( p d + p r ) cp d + ( p (cid:48) c + p d + p r ) p (cid:48) c m + ( p d + p r ) cp (cid:48) m + ( p (cid:48) c + p d ) cm + ( p (cid:48) c + p d + p r ) mp r . Since all of the above terms are positive the proposition follows.The propositions use that by lower bounding p d and upper bounding p r we overestimate theexpected number of generations the population is in state S compared to the time spent in state S . Hence, if p c > p m we can safely use a lower bound for p d and an upper bound for p r andstill obtain a valid upper bound on the runtime E [ T ]. This is rigorously shown by combiningtogether the results of the previous propositions to prove the main result i.e., Theorem 1. Proof of Theorem 1.

Consider a third Markov chain M ∗ whose transition probabilities are p c , p m , p (cid:48) r , p (cid:48) d . Let the absorbing time of M starting from state S be E [ T ]. In order to prove theabove statement we will prove the following sequence of inequalities. E [ T ] ≤ E [ T ] ≤ E [ T ∗ ] ≤ E [ T (cid:48) ] . According to Proposition 3, E [ T ] ≤ E [ T ] since p c > p m . According to Proposition 4 E [ T ] ≤ E [ T ∗ ] since p (cid:48) d ≤ p d , p (cid:48) r ≥ p r and p c > p m . Finally, according to Proposition 5, p (cid:48) c ≤ p c and11 (cid:48) m ≤ p m implies E [ T ∗ ] ≤ E [ T (cid:48) ] and our proof is completed by using Lemma 2 to show that thelast inequality of the statement holds.The algorithm may skip some levels or a new ﬁtness level may be found before the wholepopulation has reached the current ﬁtness level. Hence, by summing up the expected runtimesto leave each of the n + 1 levels and the expected times for the whole population to takeover eachlevel, we obtain an upper bound on the expected runtime. The next lemma establishes an upperbound on the expected time it takes to move from the absorbing state of the previous Markovchain ( S ,i ) to any transient state ( S ,i +1 or S ,i +1 ) of the next Markov chain. The lemmauses standard takeover arguments originally introduced in the ﬁrst analysis of the ( µ +1) EA for OneMax [32]. To achieve a tight upper bound Witt had to carefully wait for only a fractionof the population to take over a level before the next level was discovered. In our case, thecalculation of the transition probabilities of the MC is actually simpliﬁed if we wait for the wholepopulation to take over each level. Hence in our analysis the takeover time calculations are moresimilar to the ﬁrst analysis of the ( µ +1) EA with and without diversity mechanisms to takeoverthe local optimum of TwoMax [33].

Lemma 6.

Let the best individual of the current population be in level i and all individuals bein level at least i − . Then, the expected time for the whole population to be in level at least i is O ( µ log µ ) .Proof. Let k be the number of individuals of the population at ﬁtness level i . Assume that onethese k solutions is selected as a parent. If the other parent is also on level i but has a diﬀerentgenotype, then the Hamming distance between the parents is equal to 2 d for some d ∈ N and thenumber of 1-bits in the outcome of the crossover operator is i − d plus a binomially distributedrandom variable with parameters 2 d and 1 /

2. With probability at least 1 / d due to the symmetry of the binomial distribution. If the other parentis on level i − d + 1 while the number of1-bits in the outcome of the crossover operator is i − d − d + 1 and 1 /

2. The probability that the ﬁrst 2 d trials to have anoutcome larger than d is 1 /

2. On top of that, for the oﬀspring to have i /

2. Hence, if at leastone i level solution is picked as a parent then with probability at least 1 /

4, the outcome of thecrossover operator has at least i i or more 1-bits is added to the population. The solution will be acceptedby selection unless the population has already been taken over. The probability that a solutionat level i is picked as a parent is at least 2 k/µ and the probability that mutation does not ﬂipany bits is (1 − c/n ) n ≥ / ( e c + 1). So the expected time between adding the k th and the( k + 1)th i -level solution to the population is less than 2( e c + 1)( µ/k ). By summing over all k ∈ { , . . . , µ − } , we obtain the following upper bound for the whole population to take overlevel i : µ − (cid:88) k =1 e c + 1)( µ/k ) ≤ e c + 1) µ µ − (cid:88) k =1 /k ≤ e c + 1) µ · O (log µ ) = O ( µ log µ ) . The lemma shows that, once a new ﬁtness level is discovered for the ﬁrst time, it takes atmost O ( µ log µ ) generations until the whole population consists of individuals from the newly12iscovered ﬁtness level or higher. While the absorption time of the Markov chain might decreasewith the population size, for too large population sizes, the upper bound on the expected totaltake over time will dominate the runtime. As a result the MC framework will deliver largerupper bounds on the runtime unless the expected time until the population takes over the ﬁtnesslevels is asymptotically smaller than the expected absorption time of all MCs. For this reason,our results will require population sizes of µ = o (log n/ log log n ), to allow all ﬁtness levels tobe taken over in expected o ( n log n ) time such that the latter time does not aﬀect the leadingconstant of the total expected runtime. In this section we use the Markov Chain framework devised in Section 4 to prove that the( µ +1) GA is faster than any standard bit mutation-only ( µ + λ ) EA.In order to satisfy the requirements of Theorem 1, we ﬁrst show in Lemma 7 that p c > p m if the population is at one of the ﬁnal n/ (4 c (1 + e c )) ﬁtness levels. The lemma also shows thatit is easy for the algorithm to reach such a ﬁtness level. Afterwards we bound the transitionprobabilities of the MC in Lemma 8. We conclude the section by stating and proving the mainresult, essentially by applying Theorem 1 with the transition probabilities calculated in Lemma8. Lemma 7.

For the ( µ +1) GA with mutation rate c/n for any constant c , if the population isin any ﬁtness level i > n − n/ (4 c (1 + e c )) , then p c is always larger than p m . The expected timefor the ( µ +1) GA to sample a solution in ﬁtness level n − n/ (4 c (1 + e c )) for the ﬁrst time is O ( nµ log µ ) .Proof. We consider the probability p c . If two individuals on the same ﬁtness level with non-zeroHamming distance 2 d are selected as parents with probability p (cid:48) , then the probability that thecrossover operator yields an improved solution is at least (see proof of Theorem 4 in [13]):Pr( X > d ) = 12 (1 − Pr( X = d )) = 12 (cid:18) − − d (cid:18) dd (cid:19)(cid:19) ≥ / , (2)where X is a binomial random variable with parameters 2 d and 1 / − c/n ) n no bits are ﬂipped and the absorbing state is reached. If any individual is selectedtwice as parent, then the improvement can only be achieved by mutation (i.e., with probability p m ) since crossover is ineﬀective. So p c > p (cid:48) (1 / − c/n ) n +(1 − p (cid:48) ) p m , hence if p m < p (cid:48) (1 / − c/n ) n +(1 − p (cid:48) ) p m it follows that p m < p c . The condition can be simpliﬁed to p m < (1 / − c/n ) n with simple algebraic manipulation. For large enough n , (1 − c/n ) n ≥ / (1+ e c ) and the conditionreduces to p m < / (4(1 + e c )).Since p m < ( n − i ) c/n is an upper bound on the transition probability (i.e., at least oneof the zero bits has to ﬂip to increase the OneMax value), the condition is satisﬁed for i ≥ n − n/ (4 c (1 + e c )). For any level i ≤ n − n/ (4 c (1 + e c )), after the take over of the level occurs in O ( µ log µ ) expected time, the probability of improving is at least Ω(1) due to the linear number of0-bits that can be ﬂipped. Hence, we can upper bound the total number of generations necessaryto reach ﬁtness level i = n − n/ (4 c (1 + e c )) by O ( nµ log µ ).The lemma has shown that p c > p m holds after a linear number of ﬁtness levels have beentraversed. Now, we bound the transition probabilities of the Markov chain.13 emma 8. Let µ ≥ . Then the transition probabilities p d , p c , p r and p m are bounded as follows: p d ≥ µ ( µ + 1) i ( n − i ) c n ( e c + O (1 /n )) , p c ≥ µ − µ ( e c + O (1 /n )) ,p r ≤ ( µ − (cid:0) µ − O (1 /n ) (cid:1) e c µ ( µ + 1) , p m ≥ c ( n − i ) n ( e c + O (1 /n )) . Proof.

We ﬁrst bound the probability p d of transitioning from the state S ,i to the state S ,i .In order to introduce a new solution at level i with diﬀerent genotype, it is suﬃcient that themutation operator simultaneously ﬂips one of the n − i i S ,i , all individuals are identical, hence crossoveris ineﬀective. Moreover, when the diverse solution is created, it should stay in the population,which occurs with probability µ/ ( µ + 1) since one of the µ copies of the majority individualshould be removed by selection instead of the oﬀspring. So p d can be lower bounded as follows: p d ≥ µ ( µ + 1) icn ( n − i ) cn (cid:16) − cn (cid:17) n − . Using the inequality (1 − /x ) x − ≥ /e ≥ (1 − /x ) x , we now bound (cid:0) − cn (cid:1) n − as follows: (cid:16) − cn (cid:17) n − ≥ (cid:16) − cn (cid:17) n − ≥ (cid:16) − cn (cid:17) n = (cid:18) (cid:16) − cn (cid:17) ( n/c ) − (cid:16) − cn (cid:17) (cid:19) c ≥ (cid:18) e (cid:16) − cn (cid:17) (cid:19) c ≥ e c (cid:18) − c n (cid:19) , where in the last step we used the Bernoulli’s inequality.We can further absorb the c /n in an asymptotic O (1 /n ) term as follows: (cid:16) − cn (cid:17) n − ≥ (cid:16) − cn (cid:17) n − ≥ (cid:16) − cn (cid:17) n ≥ e c (cid:18) − c n (cid:19) ≥ e − c − O (1 /n ) = 1 e c + O (1 /n ) . (3)The bound for p d is then, p d ≥ µ ( µ + 1) i ( n − i ) c n (cid:0) e c + O (1 /n ) (cid:1) . We now consider p c . To transition from state S ,i to S ,i (i.e., p c ) it is suﬃcient that twogenotypically diﬀerent individuals are selected as parents (i.e., with probability at least 2( µ − /µ ), that crossover provides a better solution (i.e., with probability at least 1 / − c/n ) n ≥ / (cid:0) e c + O (1 /n ) (cid:1) according to Eq. (3)). Therefore, the probability is p c ≥ µ − µ (cid:16) − cn (cid:17) n ≥ µ − µ ( e c + O (1 /n ))For calculating p r we pessimistically assume that the Hamming distance between the individ-uals in the population is 2 and that there is always only one individual with a diﬀerent genotype.A population in state S ,i which has diversity, goes back to state S ,i when:1. A majority individual is selected twice as parent (i.e., probability ( µ − /µ ), mutationdoes not ﬂip any bit (i.e., probability (1 − c/n ) n ) and the minority individual is discarded(i.e., probability 1 / ( µ + 1)).2. Two diﬀerent individuals are selected as parents, crossover chooses either from the majorityindividual in both bit locations where they diﬀer (i,e., prob. 1/4) and mutation does notﬂip any bit (i.e., probability (1 − c/n ) n ≤ /e c ) or mutation must ﬂip at least one speciﬁcbit (i.e., probability O (1 /n )). Finally, the minority individual is discarded (i.e., probability1 / ( µ + 1)).3. A minority individual is chosen twice as parent and the mutation operator ﬂips at least twospeciﬁc bit positions (i.e., with probability O (1 /n )) and ﬁnally the minority individual isdiscarded (i.e., probability 1 / ( µ + 1)).Hence, the probability of losing diversity is: p r ≤ µ + 1 (cid:20) ( µ − µ (cid:16) − cn (cid:17) n + 2 1 µ µ − µ (cid:18) (cid:16) − cn (cid:17) n + O (1 /n ) (cid:19) + O (1 /n ) (cid:21) ≤ µ − + ( µ −

1) + 4 e c ( µ − O (1 /n )2 e c µ ( µ + 1) + O (1 /n ) µ + 1= ( µ − µ −

1) + 1 + 4 e c O (1 /n )] + 2 e c µ O (1 /n )2 e c µ ( µ + 1)= ( µ − µ −

1) + 1 + O (1 /n )] + O (1 /n )2 e c µ ( µ + 1) ≤ ( µ − µ − O (1 /n ))2 e c µ ( µ + 1) . In the last inequality we absorbed the O (1 /n ) term into the O (1 /n ) term.The transition probability p m from state S ,i to state S ,i is the probability of improvementby mutation only, because crossover is ineﬀective at state S ,i . The number of 1-bits in theoﬀspring increases if the mutation operator ﬂips one of the ( n − i ) 0-bits ( i.e., with probability c ( n − i ) /n ) and does not ﬂip any other bit (i.e., with probability (1 − c/n ) n − ≥ (cid:0) e c + O (1 /n ) (cid:1) − according to Eq. (3)). Therefore, the lower bound on the probability p m is: p m ≥ c ( n − i ) n (cid:0) e c + O (1 /n ) (cid:1) . We are ﬁnally ready to state our main result.15 heorem 9.

The expected runtime of the ( µ +1) GA with µ ≥ and mutation rate c/n for anyconstant c on OneMax is: E [ T ] ≤ e c n log nc (3 + c ) + O ( nµ log µ ) . For µ = o (log n/ log log n ) , the bound reduces to: E [ T ] ≤ c (3 + c ) e c n log n (1 + o (1)) . Proof.

We use Theorem 1 to bound E [ T i ], the expected time until the ( µ +1) GA creates anoﬀspring at ﬁtness level i + 1 or above given that all individuals in its initial population are atlevel i . The bounds on the transition probabilities established in Lemma 8 will be set as the exacttransition probabilities of another Markov chain, M (cid:48) , with absorbing time larger than E [ T i ] (byTheorem 1). Since Theorem 1 requires that p c > p m and Lemma 7 establishes that p c > p m holdsfor all ﬁtness levels i > n − n/ c (1+ e c ), we will only analyse E [ T i ] for n − ≥ i > n − n/ (cid:0) c (1+ e c ) (cid:1) .Recall that, by Lemma 7, level n − n/ (cid:0) c (1 + e c ) (cid:1) is reached in expected O ( nµ log µ ) time.Consider the expected absorbing time E [ T (cid:48) i ], of the Markov chain M (cid:48) with transition proba-bilities: p (cid:48) d := µ ( µ + 1) i ( n − i ) c n ( e c + O (1 /n )) , p (cid:48) c := µ − µ ( e c + O (1 /n )) ,p (cid:48) r := ( µ − (cid:0) µ − O (1 /n ) (cid:1) e c µ ( µ + 1) , p (cid:48) m := c ( n − i ) n ( e c + O (1 /n )) . According to Theorem 1: E [ T i ] ≤ E [ T (cid:48) i, ] ≤ p (cid:48) c + p (cid:48) r p (cid:48) c p (cid:48) d + p (cid:48) c p (cid:48) m + p (cid:48) m p (cid:48) r + 1 p (cid:48) c . (4)We simplify the numerator and the denominator of the ﬁrst term separately. The numeratoris p (cid:48) c + p (cid:48) r = µ − µ ( e c + O (1 /n )) + ( µ − (cid:0) µ − O (1 /n ) (cid:1) e c µ ( µ + 1) ≤ µ − µ e c (cid:18) µ − O (1 /n ) µ + 1 (cid:19) ≤ ( µ − µ + O (1 /n )]2 µ e c ( µ + 1) . (5)16e can also rearrange the denominator D = p (cid:48) c p (cid:48) d + p (cid:48) c p (cid:48) m + p (cid:48) m p (cid:48) r as follows: D = p (cid:48) c ( p (cid:48) d + p (cid:48) m ) + p (cid:48) m p (cid:48) r = ( µ − (cid:16) µi ( n − i ) c ( µ +1) n ( e c + O (1 /n )) + c ( n − i ) n ( e c + O (1 /n )) (cid:17) µ ( e c + O (1 /n ))+ c ( n − i )( µ − (cid:0) µ − O (1 /n ) (cid:1) n ( e c + O (1 /n )) 2 e c µ ( µ + 1) ≥ ( µ − (cid:16) µi ( n − i ) c ( µ +1) n + c ( n − i ) n (cid:17) µ ( e c + O (1 /n )) + c ( n − i )( µ − (cid:0) µ − O (1 /n ) (cid:1) n ( e c + O (1 /n )) 2 µ ( µ + 1) ≥ c ( n − i )( µ − µ ( e c + O (1 /n )) · (cid:18) µic ( µ + 1) n + 1 n + 2 µ − O (1 /n ) n ( µ + 1) (cid:19) ≥ c ( n − i )( µ − µ ( e c + O (1 /n )) · (cid:32) µic + ( µ + 1) n + n (cid:0) µ − O (1 /n ) (cid:1) ( µ + 1) n (cid:33) ≥ c ( n − i )( µ − (cid:18) µic + n (cid:20) µ + O (1 /n ) (cid:21)(cid:19) µ ( e c + O (1 /n )) ( µ + 1) n . (6)Note that the term in square brackets is the same in both the numerator (i.e., Eq. (5)) and thedenominator (i.e., Eq. (6)) including the small order terms in O (1 /n ) (i.e., they are identical).Let A = [3 µ + c (cid:48) /n ], where c (cid:48) > O (1 /n ) in the upperbound on p r in Lemma 8. We can now put the numerator and denominator together and simplifythe expression : p (cid:48) c + p (cid:48) r p (cid:48) c ( p (cid:48) d + p (cid:48) m ) + p (cid:48) m p (cid:48) r ≤ ( µ − A µ e c ( µ + 1) · µ (cid:0) e c + O (1 /n ) (cid:1) ( µ + 1) n c ( n − i )( µ − µic + nA ) ≤ A (cid:0) e c + O (1 /n ) (cid:1) n e c c ( n − i )( µic + nA ) . By using that e c + O (1 /n ) e c ≤ e c + O (1 /n ), we get: p (cid:48) c + p (cid:48) r p (cid:48) c ( p (cid:48) d + p (cid:48) m ) + p (cid:48) m p (cid:48) r ≤ A ( e c + O (1 /n )) n c ( n − i )( µic + nA ) ≤ e c An c ( n − i )( µic + nA ) + O (1 /n ) An c ( n − i )( µic + nA ) . The facts, n − i ≥ A = Ω(1), and µ, i, c > nA + µic = Ω( n ) and An c ( n − i )( µic + nA ) = O ( n ). When multiplied by the O (1 /n ) term, we get:17 e c nc ( n − i ) An ( µic + nA ) + O (1) . By adding and subtracting µic to the numerator of An ( µic + nA ) , we obtain: ≤ e c nc ( n − i ) (cid:18) − µicµic + nA (cid:19) + O (1) . Note that the multiplier outside the brackets, ( e c n ) / ( c ( n − i )), is in the order of O (cid:0) n/ ( n − i ) (cid:1) .We now add and subtract µnc to the numerator of − µicµic + nA to create a positive additive termin the order of O (cid:0) µ ( n − i ) /n (cid:1) .= e c nc ( n − i ) (cid:18) − µncµic + nA + µ ( n − i ) cµic + nA (cid:19) + O (1)= e c nc ( n − i ) (cid:18) − µncµic + nA (cid:19) + e c nc ( n − i ) µ ( n − i ) cµic + nA + O (1)= e c nc ( n − i ) (cid:18) − µncµic + nA (cid:19) + O ( µ ) . Since p (cid:48) c = Ω(1 /µ ), we can similarly absorb 1 /p (cid:48) c into the O ( µ ) term. After the addition ofthe remaining term 1 /p (cid:48) c from Eq.(4), we obtain a valid upper bound on E [ T i ]: E [ T i ] ≤ p (cid:48) c + p (cid:48) r p (cid:48) c p (cid:48) d + p (cid:48) c p (cid:48) m + p (cid:48) m p (cid:48) r + 1 p (cid:48) c ≤ e c nc ( n − i ) (cid:18) − µncµic + nA (cid:19) + O ( µ ) . In order to bound the negative term, we will rearrange its denominator (i.e., nA + µic ): n (cid:2) µ + c (cid:48) /n ] + µic = 3 µn + c (cid:48) + µic = 3 µn + c (cid:48) − ( n − i ) µc + µnc< µn (3 + c ) + c (cid:48) , µnc . Altogether, E [ T i ] ≤ e c nc ( n − i ) (cid:18) − µncµn (3 + c ) + c (cid:48) (cid:19) + O ( µ )= e c nc ( n − i ) (cid:18) − µnc + c (cid:48) c c − c (cid:48) c c µn (3 + c ) + c (cid:48) (cid:19) + O ( µ )= e c nc ( n − i ) (cid:18) − c c + c (cid:48) c c µn (3 + c ) + c (cid:48) (cid:19) + O ( µ )= e c nc ( n − i ) (cid:18) − c c + O (1 /n ) (cid:19) + O ( µ )= e c nc ( n − i ) 33 + c + O ( µ ) . If we add the expected time to take over each ﬁtness level from Lemma 6 and sum over all ﬁtnesslevels the upper bound on the runtime is: n (cid:88) i = n − n/ (4 c (1+ e c )) (cid:18) e c nc ( n − i ) 33 + c + O ( µ ) + O ( µ log µ ) (cid:19) ≤ n (cid:88) i =0 (cid:18) e c nc ( n − i ) 33 + c + O ( µ log µ ) (cid:19) ≤ e c n log nc (3 + c ) + O ( nµ log µ ) ≤ e c n log nc (3 + c ) (1 + o (1)) , where in the last inequality we use µ = o (log n/ log log n ) to prove the second statement of thetheorem.The second statement of the theorem provides an upper bound of (3 / en log n for the stan-dard mutation rate 1 /n (i.e., c = 1) and µ = o (log n/ log log n ). The upper bound is minimisedfor c = (cid:0) √ − (cid:1) . Hence, the best upper bound is delivered for a mutation rate of about1 . /n . The resulting leading term of the upper bound is: E [ T ] ≤ e ( √ − ) n log n (cid:0) √ − (cid:1) (cid:0) (cid:0) √ − (cid:1) + 3 (cid:1) ≈ . n log n. We point out that Theorem 9 holds for any µ ≥

3. Our framework provides a higher upperbound when µ = 2 compared to larger values of µ . The main diﬀerence lies in the probability p r as shown in the following lemma. Lemma 10.

The transition probabilities p m , p r , p c and p d for the (2+1) GA, with mutationrate c/n and c constant, are bounded as follows: p d ≥ i ( n − i ) c n ( e c + O (1 /n )) , p c ≥ e c + O (1 /n )) ,p r ≤ e c + O (1 /n ) , p m ≥ c ( n − i )( e c + O (1 /n )) n . Proof.

While the other probabilities are obtained by setting µ = 2 in the expressions from19 lgorithm 2: (2 + 1) S GA P ← µ individuals, uniformly at random from { , } n ; repeat Choose x, y ∈ P uniformly at random among P ∗ , the individuals with the current bestﬁtness f ∗ ; z ← Uniform crossover with probability 1 / x, y ); Flip each bit in z with probability c/n ; If f ( z ) = f ∗ and max w ∈ P ∗ ( HD ( w, z )) > then z ← z ∨ arg max w ∈ P ∗ ( HD ( w, z )) ; P ← P ∪ { z } ; Choose one element from P with lowest ﬁtness and remove it from P , breaking ties atrandom; until termination condition satisﬁed ;Lemma 8, the probability of losing diversity is larger for a population of size two than it is for µ ≥

3. When either individual is picked twice as the parent (which occurs with probability 1 / /e c ), a copy ofa solution is introduced into the population. Moreover, a copy can also be introduced if twodiﬀerent genotypes are selected and then crossover picks the same parents for the bit positionswhere the parents diﬀer, which occurs with probability at most 1 /

4. Any other event whichproduces a copy of one of the individuals requires ﬂipping a constant number of speciﬁc bitswhich occurs with probability O (1 /n ). Once a copy is added to the population, the diversity islost if the minority solution is removed from the population which occurs with probability 1 / p r ≤ (cid:18) e − c + 12 (cid:18) e − c + O (1 /n ) (cid:19)(cid:19) ≤ e c + O (1 /n ) . The upper bound on p r from Lemma 8 is 1 / (8 e c ), which is smaller than the bound we havejust found. This is due to the assumptions in the lemma that there can be only one genotypein the population at a given time which can take over the population in the next iteration.However, when µ = 2, either individual can take over the population in the next iteration.This larger upper bound on p r for µ = 2 leads to a larger upper bound on the runtime of E [ T ] ≤ c +4 e c n log nc (1 + o (1)) for the (2+1) GA. The calculations are omitted as they are thesame as those of the proof of Theorem 9 where p r ≥ / (24 e c ) + O (1 /n ) is used and µ is set to 2. In the previous section we provided a higher upper bound for the (2 + 1) GA compared to the( µ + 1) GA with population size greater than 2 and µ = o (log n/ log log n ). To rigorously provethat the (2 + 1) GA is indeed slower, we require a lower bound on the runtime of the algorithmthat is higher than the upper bound provided in the previous section for the ( µ + 1) GA ( µ ≥ µ +1) GA with greedyparent selection and greedy crossover (i.e., Algorithm 2) in the sense that:1. Parents are selected uniformly at random only among the solutions from the highest ﬁtnesslevel ( greedy selection ).2. If the oﬀspring has the same ﬁtness as its parents and its Hamming distance to any individ-ual with equal ﬁtness in the population is larger than 2, then the algorithm automaticallyperforms an OR operation between the oﬀspring and the individual with the largest Ham-ming distance and ﬁtness, breaking ties arbitrarily, and adds the resulting oﬀspring tothe population i.e., we pessimistically allow it to skip as many ﬁtness levels as possible( semi-greedy crossover ).The greedy selection allows us to ignore the improvements that occur via crossover betweensolutions from diﬀerent ﬁtness levels. Thus, the crossover is only beneﬁcial when there are atleast two diﬀerent genotypes in the population at the highest ﬁtness level discovered so far. Thediﬀerence with the algorithm analysed by Sudholt [13] is that the (2 + 1) S GA we consider doesnot use any diversity mechanism and it does not automatically crossover correctly when theHamming distance between parents is exactly 2. As a result, there still is a non-zero probabilityof losing diversity before a successful crossover occurs. The crossover operator of the (2 + 1) S GAis less greedy than the one analysed in [13] (i.e., there crossover is automatically successful alsowhen the Hamming distance between the parents is 2). We point out that the upper bounds onthe runtime derived in the previous section also hold for the greedy (2 + 1) S GA.The Markov chain structure of Figure 1 is still representative of the states that the algorithmcan be in. When there is no diversity in the population, either an improvement via mutationoccurs or diversity is introduced into the population by the mutation operator. When diversityis present, both crossover and mutation can reach a higher ﬁtness level while there is also aprobability that the population will lose diversity by replicating one of the existing genotypes.With a population size of two the diversity can be lost by creating a copy of either solutionand removing the other one from the population during environmental selection (i.e., Line 8in Algorithm 2). With population sizes greater than two, the loss of diversity can only occurwhen the majority genotype (i.e., the genotype with most copies in the population) is replicated.Building upon this we will show that the asymptotic performance of (2 + 1) S GA for

OneMax cannot be better than that of the ( µ +1) GAs for µ > Theorem 11. [14] Consider a partition of the search space into non-empty sets A , . . . , A m .For a search algorithm A , we say that it is in A i or on level i if the best individual created so faris in A i . Let the probability of A traversing from level i to level j in one step be at most u i · γ i,j for all j > i and (cid:80) mj = i +1 γ i,j = 1 for all i . Assume that for all j > i and some ≤ χ ≤ it holds γ i,j ≥ χ m (cid:88) k = j γ i,k Then the expected hitting time of A m is at least − (cid:88) i =1 Pr( A starts in A i ) · (cid:18) u i + χ m − (cid:88) j = i +1 u j (cid:19) ≥ m − (cid:88) i =1 Pr( A starts in A i ) · χ m − (cid:88) j = i u j . Due to the greedy crossover and the greedy parent selection used in [13], the population couldbe represented by the trajectory of a single individual. If an oﬀspring with lower ﬁtness was addedto the population, then the greedy parent selection never chose it. If instead, a solution withequally high ﬁtness and diﬀerent genotype was created, then the algorithm immediately reducedthe population to a single individual that is the best possible outcome from crossing over thetwo genotypes. The main diﬀerence between the following analysis and that of [13] is that wewant to take into account the possibility that the gained diversity may be lost before crossoverexploits it. To this end, when individuals of equal ﬁtness and Hamming distance 2 are created,crossover only exploits this successfully (i.e., goes to the next ﬁtness level) with the conditionalprobability that crossover is successful before the diversity is lost. Otherwise, the diversity islost. Only when individuals of Hamming distance larger than 2 are created, we allow crossoverto immediately provide the best possible outcome as in [13].Now, we can state the main result of this section.

Theorem 12.

The expected runtime of the (2 + 1) S GA with mutation probability p = c/n forany constant c on OneMax is no less than: e c c (cid:18) ≤ k ≤ n ( ( np ) k ( k !) ) (cid:19) n log n − O ( n log log n ) . Proof.

To prove the theorem statement we wish to apply Theorem 11. We say that the (2 +1) S GA is on level i if its current best solution has i S GA starting from ﬁtness level (cid:96) = (cid:100) n − min { n/ log n, n/ ( p log n ) }(cid:101) . Given thegreedy selection, the algorithm always reaches a new level in state S (i.e., no diversity, seeFig. 1). We underestimate the expected runtime of the algorithm by considering as one singleiteration the phases starting when state S is reached and ending when state S is left: either ahigher ﬁtness level is reached (i.e., absorbing state) or the diversity is lost (i.e., back in S ).In order to apply Theorem 11 we need to provide an upper bound on the probability p i,j ofreaching ﬁtness level j from level i . In particular, we need to show that there exist u i and γ i,j such that p i,j ≤ u i · γ i,j . We ﬁrst concentrate on deriving a bound on p i,j .For the (2 + 1) S GA with the current solution at level i ≥ (cid:96) , let p i,i + k be the probability thatthe algorithm reaches level j = i + k in the next iteration.We ﬁrst calculate the probability when k ≥

2. The (2 + 1) S GA will be on level i + k for k ≥ • The mutation operator ﬂips k more 0-bits than 1-bits, which occurs with probability p m,k ; • The mutation operator ﬂips exactly k k p d,k (because the (2+1) S GA automatically makes the largest possible improvement achieavableby crossover when Hamming distance greater than 2 is created);22

The mutation operator ﬂips exactly one 0-bit and one 1-bit which, leading to state S ,initiates a phase that will end once S is left. As said such a phase will be counted as oneiteration only. To achieve this we calculate the conditional probability that level i + k isreached before the diversity is lost (i.e., the algorithm has returned to state S ). From S level i + k may be reached in either of the following ways: – After a successful crossover which increases the number of 1-bits by one, the mutationoperator ﬂips k − p m,k − ); – The mutation operator ﬂips k − k − p d,k − .So the total probability, p i,i + k for k ≥ p i,i + k ≤ p m,k + p d,k + p d, p d,k − + p m,k − p d,k − + p m,k − + p r ≤ p m,k + p d,k + p d, p d,k − + p m,k − p r . (7)Here, the term multiplying p d, is the conditional probability of reaching level i + k beforelosing diversity, an event which occurs with probability p r . In the conditional probability we use1 · p m,k − instead of multiplying it by the crossover probability, which still gives a correct bound.Overall, to bound the probability, p i,i + k we need upper bounds on the probabilities p m,k , p d,k , and a lower bound on p r .We start with p r . To lose diversity it is suﬃcient that the outcome of crossover is a copyand then that mutation does not ﬂip any bits (probability at least (1 − p ) n ). Finally we needenvironmental selection to remove the diﬀerent individual (probability 1 / /

2) andcrossover is ineﬀective (probability 1) or two diﬀerent individuals are selected (probability 1 / / p r , theprobability of losing diversity, is at least, p r ≥ (cid:18) − p (cid:19) n (cid:18)

12 + 12 · (cid:19) = (1 − p ) n . We derive p m,k from Lemma 2 in [14] where it is proved that, for levels i ≥ (cid:96) , the probabilitythat standard bit mutation with mutation rate p ﬂips k more 0-bits than 1-bits is upper bounded23y: p m,k ≤ p k (1 − p ) n − k ( n − i ) k k ! · (cid:18) · i ( n − i ) p (1 − p ) (cid:19) = (1 − p ) n + k k ! (cid:18) p ( n − i )(1 − p ) (cid:19) k (cid:18) · i ( n − i ) p (1 − p ) (cid:19) ≤ (1 − p ) n (cid:18) p ( n − i )(1 − p ) (cid:19) k (cid:18) · i ( n − i ) p (1 − p ) (cid:19) = (1 − p ) n (cid:18) p ( n − i )(1 − p ) (cid:19) k (cid:18) O (1 / log n ) (cid:19) . (8)Here, the last inequality follows from i < n and n − i ≤ n − (cid:96) which implies i ( n − i ) p = O (1 / log n ).Finally, the probability p d,k that the mutation operator ﬂips exactly k k p d,k ≤ (1 − p ) n ( n − i ) k p k ( np ) k k ! k !(1 − p ) k ≤ (1 − p ) n (cid:18) p ( n − i )(1 − p ) (cid:19) k ( np ) k k ! k ! . Now, we separately bound some terms from Eq. (7): p d, ≤ (1 − p ) n ( n − i ) np (1 − p ) (9) p d,k − + p m,k − ≤ (1 − p ) n (cid:18) p ( n − i )(1 − p ) (cid:19) k − · (cid:18) ( np ) k − (cid:0) ( k − (cid:1) + 1 + O (1 / log n ) (cid:19) p d, p r ≤ (1 − p ) n ( n − i ) np (1 − p ) · − p ) n = 4 np p ( n − i )(1 − p ) .p d, p d,k − + p m,k − p r ≤ np (1 − p ) n (cid:18) p ( n − i )(1 − p ) (cid:19) k · (cid:18) ( np ) k − (cid:0) ( k − (cid:1) + 1 + O (1 / log n ) (cid:19) . Therefore the upper bound on p i,i + k for k ≥ i,i + k ≤ p m,k + p d,k + p d, p d,k − + p m,k − p r ≤ (1 − p ) n (cid:18) p ( n − i )(1 − p ) (cid:19) k · (cid:20) O (1 / log n ) + ( np ) k ( k !) + 4 np (cid:18) ( np ) k − (( k − + 1 + O (1 / log n ) (cid:19)(cid:21) . For y := max ≤ k ≤ n ( ( np ) k ( k !) ) the above bound reduces to: p i,i + k ≤ (1 − p ) n (cid:18) p ( n − i )(1 − p ) (cid:19) k · (cid:18) O (1 / log n ) + y + 4 np ( y + 1) (cid:19) , where we used that np = c , is a constant.We now calculate the missing term of p i,i + k (i.e., k = 1). For k = 1, an improvement can beachieved only if one of the following events occurs: • The mutation operator ﬂips one more 0-bit than it ﬂips 1-bits which happens with proba-bility p m, (Eq. (8) with k = 1). • The mutation operator ﬂips exactly one 1-bit and one 0-bit with probability p d, (Eq. (9))and a phase in S starts. Then, before the population loses its diversity either: – A successful crossover between the two solutions with diﬀerent genotypes occurs andthe mutation does not ﬂip more 1-bits than 0-bits (probability p ∗ c ), – The crossover operator yields a solution on the same ﬁtness level with probability p s and then the mutation operator ﬂips one more 0-bit than it ﬂips 1-bits with probability p m, . – The crossover operator yields a solution on a worse ﬁtness level (with probability lessthan 1 − p s ) and then the mutation operator ﬂips two more 0-bits than it ﬂips 1-bitswhich happens with probability p m, (Eq. (8) with k = 2).So p i,i +1 can be upper bounded as follows: p i,i +1 ≤ p m, + p d, p ∗ c + p s p m, + (1 − p s ) p m, p ∗ c + p r + p s p m, + (1 − p s ) p m, ≤ p m, + p d, p ∗ c + p m, p ∗ c + p r . (10)The second inequality is due to p s p m, + (1 − p s ) p m, ≤ p m, . Substituting the ﬁrst p m, term25nd p d, we get: p i,i +1 ≤ (1 − p ) n p ( n − i )(1 − p ) (1 + O (1 / log n )) + (1 − p ) n ( n − i ) np (1 − p ) · p ∗ c + p m, p ∗ c + p r = (1 − p ) n p ( n − i )(1 − p ) (cid:18) O (1 / log n ) + np p ∗ c + p m, p ∗ c + p r (cid:19) . (11)We now bound p ∗ c . For crossover to increase the number of 1-bits by one, parent selectionmust pick two diﬀerent individuals (probability 1 / / − p ) n ) or if it ﬂips at least one of the n − i ≤ (cid:96) O (1 / log n )).So, the probability that crossover increases the number of 1-bits by one is p ∗ c ≤

12 14 (cid:18) (1 − p ) n + O (1 / log n ) (cid:19) ≤ (1 − p ) n O (1 / log n ) . The probability p m, is in the order of O (1 / log n ) because the number of 0-bits is less than (cid:96) .The term ( p ∗ c + p m, ) / ( p ∗ c + p r ) in Eq. (11) is therefore at most: p ∗ c + p m, p ∗ c + p r ≤ (1 − p ) n + O (1 / log n ) (1 − p ) n + (1 − p ) n = 13 + O (1 / log n ) . Hence, Eq. (11) is bounded as follows: p i,i +1 ≤ (1 − p ) n p ( n − i )(1 − p ) (cid:18) O (1 / log n ) + np (cid:0)

13 + O (1 / log n ) (cid:1)(cid:19) ≤ (1 − p ) n p ( n − i )(1 − p ) (cid:18) c O (1 / log n ) (cid:19) ≤ (1 − p ) n p ( n − i )(1 − p ) (cid:18) c O (1 / log n ) (cid:19) . We now can determine the parameters u i and γ i,k for the application of Theorem 11. Wedeﬁne, u (cid:48) i := e − c ( n − i ) p (1 − p ) (cid:18) y O (1 / log n ) (cid:19) , (12) γ (cid:48) i,i + k := (cid:18) (3 + 12 c ) p ( n − i )(1 − p ) (cid:19) k − . (13)Observe that y ≥ c and for large enough n , u (cid:48) i γ (cid:48) i,i + k ≥ p i,i + k . Consider the normalised variables26 i := u (cid:48) i (cid:80) nj = i +1 γ (cid:48) i,j and γ i,j := γ (cid:48) i,j (cid:80) nj = i +1 γ (cid:48) i,j . Since u i γ i,j = u (cid:48) i γ (cid:48) i,j ≥ p i,j , it follows that u i and γ i,j satisfy their deﬁnitions in Theorem 11.Now we turn to the main condition of Theorem 11. n − i (cid:88) k = j − i γ (cid:48) i,i + k ≤ (cid:18) (3 + 12 c ) p ( n − i )(1 − p ) (cid:19) j − i − · ∞ (cid:88) k =0 (cid:18) (3 + 12 c ) p ( n − i )(1 − p ) (cid:19) k . For large enough n , (3+12 c ) p ( n − i )(1 − p ) <

1. Therefore, n − i (cid:88) k = j − i γ (cid:48) i,i + k ≤ γ (cid:48) i,j − (3+12 c ) p ( n − i )(1 − p ) ≤ γ (cid:48) i,j − O (1 / log n )Since γ (cid:48) i,j ≥ (cid:80) nk = j γ (cid:48) i,k implies γ i,j ≥ (cid:80) nk = j γ i,k , for χ = 1 − O (1 / log n ) the main conditionof Theorem 11 is satisﬁed.All that remains is to calculate u i . u i = u (cid:48) i n (cid:88) j = i +1 γ (cid:48) i,j ≤ u (cid:48) i n − i (cid:88) k =1 (cid:18) (3 + 12 c ) p ( n − i )(1 − p ) (cid:19) k − ≤ u (cid:48) i ∞ (cid:88) k =0 (cid:18) (3 + 12 c ) p ( n − i )(1 − p ) (cid:19) k ≤ e − c ( n − i ) p (1 − p ) (cid:18) y O (1 / log n ) (cid:19) − O (1 / log n ) , where in the ﬁrst inequality we substituted γ (cid:48) i,j from Eq. (13) and in the last u (cid:48) i from Eq. (12).Overall, according to Theorem 11 the runtime is E [ T ] ≥ P r { Initialise at level (cid:96) or below } · χ · n − (cid:88) i = (cid:96) u i ≥ (1 − / log n )(1 − O (1 / log n )) (cid:18) y O (1 / log n ) (cid:19) − · e c (1 − p ) p (1 − O (1 / log n )) n − (cid:88) i = (cid:96) n − i . Finally, we combine the (1 − O (1 / log n ) terms and (1 − p ) = 1 − O (1 / log n ) into a single271 − O (1 / log n ) term and change the index of the sum: ≥ (1 − O (1 / log n )) (cid:18)

33 + y − O (1 / log n ) (cid:19) e c p n − (cid:96) (cid:88) i =1 i ≥ (1 − O (1 / log n )) (cid:18)

33 + y − O (1 / log n ) (cid:19) e c p log ( n − (cid:96) ) ≥ (1 − O (1 / log n )) (cid:18)

33 + y − O (1 / log n ) (cid:19) e c p (log n − log log n ) ≥ (cid:18)

33 + y − O (1 / log n ) (cid:19) e c p (log n − O (log log n )) ≥ e c n log nc (3 + max ≤ k ≤ n ( ( np ) k ( k !) )) − O ( n log log n ) . Note that for c ≤

4, max ≤ k ≤ n ( ( np ) k ( k !) ) ≤ pn = c . Since E [ T ] ≥ en log n for c ≥

3, for the purposeof ﬁnding the mutation rate that minimises the lower bound, we can reduce the statement of thetheorem to: 3 e c n log nc (3 + c ) − O ( n log log n ) . The theorem provides a lower bound of (3 / en log n − O ( n log log n ) for the standard mutationrate 1 /n (i.e., c = 1). The lower bound is minimised for c = (cid:0) √ − (cid:1) . Hence, the smallestlower bound is delivered for a mutation rate of about 1 . /n . The resulting lower bound is : E [ T ] ≥ e ( √ − ) n log n (cid:0) √ − (cid:1) (cid:0) (cid:0) √ − (cid:1) + 3 (cid:1) − O ( n log log n ) ≈ . n log n − O ( n log log n ) . Since the lower bound for the (2 + 1) S GA matches the upper bound for the ( µ +1) GA with µ >

2, the theorem proves that, under greedy selection and semi-greedy crossover, populations ofsize 2 cannot be faster than larger population sizes up to µ = o (log n/ log log n ). In the followingsection we give experimental evidence that the greedy algorithms are faster than the standard(2+1) GA, thus suggesting that the same conclusions hold also for the standard non-greedyalgorithms. The theoretical results presented in the previous sections pose some new interesting questions.On one hand, the theory suggests that population sizes greater than 2 beneﬁt the ( µ +1) GA forhillclimbing the OneMax function. On the other hand, the best runtime bounds are obtained fora mutation rate of approximately 1 . /n , suggesting that higher mutation rates than the standard1 /n rate may improve the performance of the ( µ +1) GA. In this section we present the outcomeof some experimental investigations to shed further light on these questions. In particular, we28igure 2: Average runtime over 1000 independent runs versus problem size n . A v e r a g e r un t i m e Problem size (n)(1+1) EA(2+1) GA(5+1) GASudholt (2+1) GA opt

Sudholt (2+1) GA

Sudholt greedy selection + very greedy XO opt [9]Self-adjusting 1+(λ, λ)[9] will investigate the eﬀects of the population size and mutation rate on the runtime of the steady-state GA for

OneMax and compare its runtime against other GAs that have been proved to befaster than mutation-only EAs in the literature.We start with an overview of the performance of the algorithms. In Fig. 2, we plot theaverage runtime over 1000 independent runs of the ( µ +1) GA with µ = 2 and µ = 5 (withuniform parent selection and standard 1 /n mutation rate) for exponentially increasing problemsizes and compare it against the fastest standard bit mutation-only EA with static mutationrate (i.e., the (1+1) EA with 1 /n mutation rate). While the algorithm using µ = 5 outperformsthe µ = 2 version, they are both faster than the (1+1) EA already for small problem sizes. Wealso compare the algorithms against the (2+1) GA investigated by Sudholt [13] where diversityis enforced by the environmental selection always preferring distinct individuals of equal ﬁtness- the same GA variant that was ﬁrst proposed and analysed in [7]. We run the algorithm bothwith standard mutation rate 1 /n and with the optimal mutation rate (1 + √ / (2 n ). Obviously,when diversity is enforced, the algorithms are faster. Finally, we also compare the algorithmsagainst the (1+( λ , λ )) GA with self-adjusting population sizes and Sudholt’s (2+1) GA as theywere compared previously in [34]. Note that in [34] (Fig. 8 therein) Sudholt’s algorithm wasimplemented with a very greedy parent selection operator that always prefers distinct individualson the highest ﬁtness level for reproduction.In order to decompose the eﬀects of the greedy parent selection, greedy crossover and the29igure 3: Comparison between standard selection, greedy selection and greedy selection + greedycrossover GAs. The runtime is averaged over 1000 independent runs. A v e r a g e r un t i m e Problem size (n)(2+1) GAGreedy selection (2+1) GA(2+1) S GASudholt (2+1) GA

Sudholt greedy selection(2+1) GA

Sudholt greedy selection + greedy XO (2+1) GA use of diversity, we conducted further experiments shown in Figure 3. Here, we see that it isindeed the enforced diversity that creates the fundamental performance diﬀerence. Moreover,the results show that the greedy selection/greedy crossover GA is slightly faster than the greedyparent selection GA and that greedy parent selection is slightly faster than standard selection.Overall, the ﬁgure suggests that the lower bound presented in Theorem 12 is also valid for thestandard (2+1) GA with uniform parent selection (i.e., no greediness). In Figure 3, it can benoted that the performance diﬀerence between the GA with greedy crossover and greedy parentselection analysed in [13] and the (2+1) GA with enforced diversity and without greedy crossoveris more pronounced than the performance diﬀerence between the standard (2+1) GA analysed inSection 5 and the (2 + 1) S GA which was analysed in Section 6. The reason behind the diﬀerencein discrepancies is that the (2 + 1) S GA does not implement the greedy crossover operator whenthe Hamming distance is 2. We speculate that cases where the Hamming distance is just enoughfor the crossover to exploit it occur much more frequently than the cases where a larger Hammingdistance is present. As a result, the performance of the (2 + 1) S GA does not deviate much fromthe standard algorithm. Table 1 presents the mean and standard deviation of the runtimes ofthe algorithms depicted in Figure 2 and Figure 3 over 1000 independent runs.Now we investigate the eﬀects of the population size on the ( µ +1) GA. We perform 1000independent runs of the ( µ +1) GA with uniform parent selection and standard mutation rate30igure 4: Average runtime gain of the ( µ +1) GA versus the (2+1) GA for diﬀerent populationsizes, errorbars show the standard deviation normalised by the average runtime for µ = 2. A v g . r un t i m e / A v g . r un t i m e o f ( + ) E A Population sizen=256n=1024 n=4096n=16384 /n for increasing population sizes up to µ = 16. In Fig. 4 we present average runtimes dividedby the runtime of the (2 + 1) GA and in Fig. 5 normalised against the runtime of the (1+1) EA.In both ﬁgures, we see that the runtime improves for µ larger than 2 and after reaching its lowestvalue increases again with the population size. It is not clear whether there is a constant optimalstatic value for µ around 4 or 5. The experiments, however, do not rule out the possibility thatthe optimal static population size increases slowly with the problem size (i.e., µ = 3 for n = 256, µ = 4 for n = 4096 and µ = 5 for n = 16384). On the other hand, we clearly see that asthe problem size increases we get a larger improvement on the runtime. This indicates that theharder is the problem, more useful are the populations. In particular, in Figure 5 we see that thetheoretical asymptotic gain of 25% with respect to the runtime of the (1+1) EA is approachedmore and more closely as n increases. For the considered problem sizes, the ( µ +1) GA is fasterthan the (1+1) EA for all tested values of µ . However, to see the runtime improvement ofthe ( µ +1) GA against the (2+1) GA for µ >

15 the experiments (Fig. 4) suggest that greaterproblem sizes would need to be used.Finally, we investigate the eﬀect of the mutation rate on the runtime. Based on our previ-ous experiments we set the population size to the best seen value of µ = 5 and perform 1000independent runs for each c value ranging from 0 . .

9. In Figure 6, we see that even thoughthe mutation rate c ≈ . . Algorithms n = 64 n = 128 n = 256 n = 512Mean Std. dev. Mean Std. dev. Mean Std. dev. Mean Std. dev.(1+1) EA 612.66 208.88 1456.81 450.51 3397.72 887.07 7804.65 1791.44(2+1) GA 546.57 179.61 1271.30 357.41 2952.70 727.84 6586.60 1378.50Greedy (2+1) GA 519.93 177.28 1228.86 355.23 2854.18 730.07 6548.51 1434.21(5+1) GA 529.29 156.19 1194.50 281.92 2744.80 595.41 6087.60 1164.30Sudholt‘s (2+1) GA /n opt S GA 484.40 174.87 1183.80 366.28 2705.09 710.85 6183.55 1451.07Sudholt’s greedy selection + greedy XO (2+1) GA /n /n opt λ, λ ) 583.91 146.58 1209.14 164.96 2478.51 294.53 5084.68 462.34 Algorithms n = 1024 n = 2048 n = 4096 n = 8192Mean Std. dev. Mean Std. dev. Mean Std. dev. Mean Std. dev.(1+1) EA 17267.39 3653.38 38636.71 6966.43 84286.18 13563.25 186012.84 28660.69(2+1) GA 14715.00 2876.00 32843.00 5574.70 71346.00 10810.00 156800.00 23357.00Greedy (2+1) GA 14553.66 2892.29 32667.89 6075.17 71149.61 11990.71 154354.66 23250.14(5+1) GA 13538.00 2436.00 29907.00 4909.60 65136.00 9758.00 139590.00 18622.00Sudholt‘s (2+1) GA /n opt S GA 14028.86 2852.73 31403.85 5935.81 68957.18 11905.57 151635.40 25489.27Sudholt’s greedy selection + greedy XO (2+1) GA /n /n opt λ, λ ) 10324.62 695.95 20951.38 1157.73 42216.53 1862.68 85028.97 2703.08 igure 5: Average runtime gain of the ( µ +1) GA versus the (1+1) EA for diﬀerent populationsizes, errorbars show the standard deviation normalised by the average runtime of the (1+1) EA. A v g . r un t i m e / A v g . r un t i m e o f ( + ) E A Population sizen=256n=1024 n=4096n=16384

The question of whether genetic algorithms can hillclimb faster than mutation-only algorithmsis a long standing one. On one hand, in his pioneering book, Rechenberg had given preliminaryexperimental evidence that crossover may speed up the runtime of population based EAs forgeneralised

OneMax [35]. On the other hand, further experiments suggested that genetic algo-rithms were slower hillclimbers than the (1+1) EA [6, 21]. In recent years it has been rigorouslyshown that crossover and mutation can outperform algorithms using only mutation. Firstly,a new theory-driven GA called (1+( λ , λ )) GA has been shown to be asymptotically faster forhillclimbing the OneMax function than any unbiased mutation-only EA [34]. Secondly, it hasbeen shown how standard ( µ + λ ) GAs are twice as fast as their standard bit mutation-onlycounterparts for OneMax as long as diversity is enforced through environmental selection [13].In this paper we have rigorously proven that standard steady-state GAs with µ ≥ µ = o (log n/ log log n ) are at least 25% faster than all unbiased standard bit mutation-basedEAs with static mutation rate for OneMax even if no diversity is enforced. The Markov Chainframework we used to achieve the upper bounds on the runtimes should be general enough toallow future analyses of more complicated GAs, for instance with greater oﬀspring population33igure 6: Average runtime gain of the (5+1) GA for various mutation rates versus the standard1 /n mutation rate, errorbars show the standard deviation normalised by the average runtime for1 /n mutation rate. A v g . r un t i m e / ( A v g . r un t i m e f o r c = ) Mutation rate (c)n=256n=1024 n=4096n=16384 sizes or more sophisticated crossover operators. A limitation of the approach is that it applies toclasses of problems that have plateaus of equal ﬁtness. Hence, for functions where each genotypehas a diﬀerent ﬁtness value our approach would not apply. An open question is whether thelimitation is inherent to our framework or whether it is crossover that would not help steady-state EAs at all on such ﬁtness landscapes.Our results also explain that populations are useful not only in the exploration phase of theoptimization, but also to improve exploitation during the hillclimbing phases. In particular, largerpopulation sizes increase the probability of creating and maintaining diversity in the population.This diversity can then be exploited by the crossover operator. Recent results had already shownhow the interplay between mutation and crossover may allow the emergence of diversity, which inturn allows to escape plateaus of local optima more eﬃciently compared to relying on mutationalone [16]. Our work sheds further light on the picture by showing that populations, crossoverand mutation together, not only may escape optima more eﬃciently, but may be more eﬀectivealso in the exploitation phase.Another additional insight gained from the analysis is that the standard mutation rate 1 /n may not be optimal for the ( µ +1) GA on OneMax . This result is also in line with, and nicelycomplements, other recent ﬁndings concerning steady state GAs. For escaping plateaus of local34ptima it has been recently shown that increasing the mutation rate above the standard 1 /n rate leads to smaller upper bounds on escaping times [16]. However, when jumping large low-ﬁtness valleys, mutation rates of about 2 . /n seem to be optimal static rates (see the experimentsection in [36, 37]). For OneMax lower mutation rates seem to be optimal static rates, but stillconsiderably larger than the standard 1 /n rate.New interesting questions for further work have spawned. Concerning population sizes anopen problem is to rigorously prove whether the optimal size grows with the problem size andat what rate. Also determining the optimal mutation rate remains an open problem. While ourtheoretical analysis delivers the best upper bound on the runtime with a mutation rate of about1.3/ n , experiments suggest a larger optimal mutation rate. Interestingly, this experimental rateis very similar to the optimal mutation rate (i.e., approximately 1.618/ n ) of the ( µ +1) GA withenforced diversity proven in [13]. The beneﬁts of higher than standard mutation rates in elitistalgorithms is a topic that is gaining increasing interest [38, 39, 40].Further improvements may be achieved by dynamically adapting the population size andmutation rate during the run. Advantages, in this sense, have been shown for the (1+( λ , λ )) GAby adapting the population size [12] and for single individual algorithms by adapting the mutationrate [41, 42]. Generalising the results to larger classes of hillclimbing problems is intriguing. Inparticular, proving whether speed ups of the ( µ +1) GA compared to the (1+1) EA are alsoachieved for royal road functions would give a deﬁnitive answer to a long standing question[21]. Analyses for larger problem classes such as linear functions and classical combinatorialoptimisation problems would lead to further insights. Yet another natural question is how the( µ +1) GA hillclimbing capabilities compare to ( µ + λ ) GAs and generational GAs. References [1] D. E. Goldberg,

Genetic Algorithms in Search, Optimization and Machine Learning .Addison-Wesley, 1989.[2] A. E. Eiben and J. E. Smith,

Introduction to evolutionary computing . Springer, 2003.[3] T. B¨ack,

Evolutionary Algorithms in Theory and Practice . Oxford University Press, 1996.[4] J. H. Holland,

Adaptation in Natural and Artiﬁcial Systems . University Michigan Press,1975.[5] M. Mitchell, J. Holland, and S. Forrest, “Relative building-block ﬁtness and the buildingblock hypothesis,” in

Foundations of Genetic Algorithms (FOGA ’93) , vol. 2, 1993, pp.109–126.[6] T. Jansen and I. Wegener, “Real royal road functions — where crossover provably is essen-tial,”

Discrete Applied Mathematics , vol. 149, no. 1-3, pp. 111–125, 2005.[7] ——, “The analysis of evolutionary algorithms-a proof that crossover really can help,”

Al-gorithmica , vol. 34, no. 1, pp. 47–66, 2002.[8] D. Dang, T. Friedrich, T. K¨otzing, M. S. Krejca, P. K. Lehre, P. S. Oliveto, D. Sudholt, andA. M. Sutton, “Escaping local optima with diversity mechanisms and crossover,” in

Pro-ceedings of the Genetic and Evolutionary Computation Conference (GECCO 2016) . ACM,2016, pp. 645–652. 359] B. Doerr and C. Doerr, “A tight runtime analysis of the (1+( λ , λ )) genetic algorithm ononemax,” in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO2015) . ACM, 2015, pp. 1423–1430.[10] B. Doerr, C. Doerr, and F. Ebel, “From black-box complexity to designing new geneticalgorithms,”

Theoretical Computer Science , vol. 567, pp. 87 – 104, 2015.[11] P. K. Lehre and C. Witt, “Black-box search by unbiased variation,”

Algorithmica , vol. 64,no. 4, pp. 623–642, 2012.[12] B. Doerr and C. Doerr, “Optimal parameter choices through self-adjustment: Applying the1/5-th rule in discrete settings,” in

Proceedings of the Genetic and Evolutionary ComputationConference (GECCO 2015) . ACM, 2015, pp. 1335–1342.[13] D. Sudholt, “How crossover speeds up building-block assembly in genetic algorithms,”

Evo-lutionary Computation , vol. 25, no. 2, pp. 237–274, 2017.[14] ——, “A new method for lower bounds on the running time of evolutionary algorithms,”

IEEE Transactions on Evolutionary Computation , vol. 17, no. 3, pp. 418–435, 2013.[15] C. Witt, “Tight bounds on the optimization time of a randomized search heuristic on linearfunctions,”

Combinatorics, Probability & Computing , vol. 22, no. 2, pp. 294–318, 2013.[16] D. Dang, T. Friedrich, T. K¨otzing, M. S. Krejca, P. K. Lehre, P. S. Oliveto, D. Sudholt, andA. M. Sutton, “Emergence of diversity and its beneﬁts for crossover in genetic algorithms,”in

International Conference on Parallel Problem Solving from Nature (PPSN XIV) , 2016,pp. 890–900.[17] J. Sarma and K. D. Jong, “Generation gap methods,” in

Handbook of evolutionary compu-tation , T. Back, D. B. Fogel, and Z. Michalewicz, Eds. IOP Publishing Ltd., 1997, ch. C2.7.[18] J. E. Rowe, “Genetic algorithms,” in

Handbook of computational Intelligence , W. Pedryczand J. Kacprzyk, Eds. Springer, 2011, pp. 825–844.[19] T. K¨otzing, D. Sudholt, and M. Theile, “How crossover helps in Pseudo-Boolean optimiza-tion,” in

Proceedings of the Genetic and Evolutionary Computation Conference (GECCO2011) . ACM Press, 2011, pp. 989–996.[20] T. Storch and I. Wegener, “Real royal road functions for constant population size,”

Theo-retical Computer Science , vol. 320, no. 1, pp. 123–134, 2004.[21] M. Mitchell, J. Holland, and S. Forrest, “When will a genetic algorithm outperform hillclimbing?” in

Neural Information Processing Systems (NIPS 6) . Morgan Kaufmann, 1994,pp. 51–58.[22] D. Sudholt, “Crossover is provably essential for the ising model on trees,” in

Proceedings ofthe Genetic and Evolutionary Computation Conference (GECCO 2005) . New York, NewYork, USA: ACM Press, 2005, pp. 1161–1167.[23] P. K. Lehre and X. Yao, “Crossover can be constructive when computing unique input–output sequences,”

Soft Computing , vol. 15, no. 9, pp. 1675–1687, 2011.[24] B. Doerr, E. Happ, and C. Klein, “Crossover can provably be useful in evolutionary com-putation,”

Theoretical Computer Science , vol. 425, pp. 17–33, 2012.3625] F. Neumann, P. S. Oliveto, G. Rudolph, and D. Sudholt, “On the eﬀectiveness of crossoverfor migration in parallel evolutionary algorithms,” in

Proceedings of the Genetic and Evo-lutionary Computation Conference (GECCO 2011) . ACM Press, 2011, pp. 1587–1594.[26] C. Qian, Y. Yu, and Z. Zhou, “An analysis on recombination in multi-objective evolutionaryoptimization,”

Artiﬁcial Intelligence , vol. 204, pp. 99 – 119, 2013.[27] B. Doerr and C. Winzen, “Playing mastermind with constant-size memory,”

Theory ofComputing Systems , vol. 55, no. 4, pp. 658–684, 2014.[28] B. Doerr, C. Doerr, R. Sp¨ohel, and T. Henning, “Playing mastermind with many colors,”

Journal of the ACM , vol. 63, no. 5, pp. 42:1–42:23, 2016.[29] B. Doerr, C. Doerr, and T. K¨otzing, “The right mutation strength for multi-valued deci-sion variables,” in

Proceedings of the Genetic and Evolutionary Computation Conference(GECCO 2016) . ACM, 2016, pp. 1115–1122.[30] T. Jansen,

Analyzing evolutionary algorithms: The computer science perspective . Springer,2013.[31] P. S. Oliveto and X. Yao, “Runtime analysis of evolutionary algorithms for discrete optimiza-tion,” in

Theory of Randomized Search Heuristics: Foundations and Recent Developments ,B. Doerr and A. Auger, Eds. World Scientiﬁc, 2011, pp. 21–52.[32] C. Witt, “Runtime analysis of the ( µ +1) ea on simple pseudo-boolean functions,” Evolu-tionary Computation , vol. 14, no. 1, pp. 65–86, 2006.[33] T. Friedrich, P. S. Oliveto, D. Sudholt, and C. Witt, “Analysis of diversity-preserving mech-anisms for global exploration,”

Evolutionary Computation , vol. 17, no. 4, pp. 455–476, 2009.[34] B. Doerr, C. Doerr, and F. Ebel, “From black-box complexity to designing new geneticalgorithms,”

Theoretical Computer Science , vol. 567, pp. 87–104, 2015.[35] I. Rechenberg,

Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien derbiologishen Evolution . Friedrich Frommann Verlag, 1973.[36] D.-C. Dang, T. Friedrich, T. K¨otzing, M. S. Krejca, P. K. Lehre, P. S. Oliveto, D. Sudholt,and A. M. Sutton, “Escaping Local Optima using Crossover with Emergent or ReinforcedDiversity,”

ArXiv , Aug. 2016.[37] ——, “Escaping local optima using crossover with emergent diversity,”

IEEE Transactionson Evolutionary Computation , pp. –, 2017, in Press.[38] P. S. Oliveto, P. K. Lehre, and F. Neumann, “Theoretical analysis of rank-based mutation-combining exploration and exploitation,” in

IEEE Congress on Evolutionary Computation(CEC 2009) . ACM, 2017, pp. 1455–1462.[39] D. Corus, P. S. Oliveto, and D. Yazdani, “On the runtime analysis of the opt-ia artiﬁcialimmune system,” in

Proceedings of the Genetic and Evolutionary Computation Conference(GECCO 2017) . ACM, 2017, pp. 83–90.[40] B. Doerr, H. P. Le, R. Makhmara, and T. D. Nguyen, “Fast genetic algorithms,” in

Pro-ceedings of the Genetic and Evolutionary Computation Conference (GECCO 2017) . ACM,2017, pp. 777–784. 3741] B. Doerr, C. Doerr, and J. Yang, “k-bit mutation with self-adjusting k outperforms standardbit mutation,” in

International Conference on Parallel Problem Solving from Nature (PPSNXIV) . Springer, 2016, pp. 824–834.[42] A. Lissovoi, P. S. Oliveto, and J. A. Warwicker, “On the runtime analysis of generalisedselection hyper-heuristics for pseudo-boolean optimisation,” in