[PDF] Tight Bounds on the Optimization Time of the (1+1) EA on Linear Functions

Abstract

The analysis of randomized search heuristics on classes of functions is fundamental for the understanding of the underlying stochastic process and the development of suitable proof techniques. Recently, remarkable progress has been made in bounding the expected optimization time of the simple (1+1) EA on the class of linear functions. We improve the best known bound in this setting from (1.39+o(1))enlnn to enlnn+O(n) in expectation and with high probability, which is tight up to lower-order terms. Moreover, upper and lower bounds for arbitrary mutations probabilities p are derived, which imply expected polynomial optimization time as long as p=O((lnn)/n) and which are tight if p=c/n for a constant c . As a consequence, the standard mutation probability p=1/n is optimal for all linear functions, and the (1+1) EA is found to be an optimal mutation-based algorithm. The proofs are based on adaptive drift functions and the recent multiplicative drift theorem.

Full PDF

aa r X i v : . [ c s . N E ] D ec Tight Bounds on the Optimization Time of the(1+1) EA on Linear Functions

Carsten Witt

DTU Informatics, Technical University of DenmarkJune 18, 2018

Abstract

The analysis of randomized search heuristics on classes of functions is fun-damental for the understanding of the underlying stochastic process and thedevelopment of suitable proof techniques. Recently, remarkable progress hasbeen made in bounding the expected optimization time of the simple (1+1) EAon the class of linear functions. We improve the best known bound in this set-ting from (1 .

39 + o (1)) en ln n to en ln n + O ( n ) in expectation and with highprobability, which is tight up to lower-order terms. Moreover, upper and lowerbounds for arbitrary mutations probabilities p are derived, which imply ex-pected polynomial optimization time as long as p = O ((ln n ) /n ) and which aretight if p = c/n for a constant c . As a consequence, the standard mutationprobability p = 1 /n is optimal for all linear functions, and the (1+1) EA isfound to be an optimal mutation-based algorithm. The proofs are based onadaptive drift functions and the recent multiplicative drift theorem. The rigorous runtime analysis of randomized search heuristics, in particular of evo-lutionary computation, is a growing research area where many results have been ob-tained in recent years. This line of research started oﬀ in the early 1990’s (M¨uhlenbein,1992) with the consideration of very simple evolutionary algorithms such as the well-known (1+1) EA on very simple example functions such as the well-known

OneMax function. Later on, results regarding the runtime on classes of functions were derived(e. g. Droste, Jansen, and Wegener, 2002; He and Yao, 2001; Wegener and Witt,2005a,b) and important tools for the analysis were developed. Nowadays the stateof the art in the ﬁeld allows for the analysis of diﬀerent types of search heuristics onproblems from combinatorial optimization (Neumann and Witt, 2010).1ecently, the analysis of evolutionary algorithms on linear pseudo-boolean func-tions has experienced a great renaissance. The ﬁrst proof that the (1+1) EA opti-mizes any linear function in expected time O ( n log n ) by Droste, Jansen and Wegener(2002) was highly technical since it did not yet explicitly use the analytic frameworkof drift analysis (Hajek, 1982), which allowed for a considerably simpliﬁed proof ofthe O ( n log n ) bound, see He and Yao (2004) for the ﬁrst complete proof using themethod. Another major improvement was made by J¨agersk¨upper (2008), who forthe ﬁrst time stated bounds on the implicit constant hidden in the O ( n log n ) term.This constant was ﬁnally improved by Doerr, Johannsen, and Winzen (2010a) to thebound (1 .

39 + o (1)) en ln n using a clean framework for the analysis of multiplica-tive drift (Doerr, Johannsen, and Winzen, 2010b). The best known lower bound forgeneral linear functions with non-zero weights is en ln n − O ( n ) and was also provenby Doerr, Johannsen and Winzen (2010a), building upon the case of the OneMax function analyzed by Doerr, Fouz, and Witt (2010, 2011).The standard (1+1) EA ﬂips each bit with probability p = 1 /n but also diﬀerentvalues for the mutation probability p have been studied in the literature. Recently,it has been proved by Doerr and Goldberg (2011) that the O ( n log n ) bound on theexpected optimization time of the (1+1) EA still holds (also with high probability)if p = c/n for an arbitrary constant c . This result uses the multiplicative driftframework mentioned above and a drift function being cleverly tailored towards theparticular linear function. However, the analysis is also highly technical and does notyield explicit constants in the O -term. For p = ω (1 /n ), no runtime analyses wereknown so far.In this paper, we prove that the (1+1) EA optimizes all linear functions in ex-pected time en ln n + O ( n ), thereby closing the gap between the upper and the lowerbound up to terms of lower order. Moreover, we show a general upper bound depend-ing on the mutation probability p , which implies that the expected optimization timeis polynomial as long as p = O ((ln n ) /n ) (and p = Ω(1 / poly( n ))). Since the expectedoptimization time is proved to be superpolynomial for p = ω ((ln n ) /n ), this impliesa phase transition in the regime Θ((ln n ) /n ). If the mutation probability is c/n forsome constant c , the expected optimization time is proved to be (1 ± o (1)) e c c n ln n .Altogether, we obtain that the standard choice p = 1 /n of the mutation probabilityis optimal for all linear functions. This is remarkable since this seems to be the choicethat is most often recommended by practitioners in evolutionary computation (B¨ack,1993). In fact, the lower bounds hold for the large class of so-called mutation-basedEAs, in which the (1+1) EA with p = 1 /n is found to be an optimal algorithm.The proofs of the upper bounds use the recent multiplicative drift theorem anda drift function that is adapted towards both the linear function and the mutation Note, however, that not the original (1+1) EA but a variant rejecting oﬀspring of equal ﬁtnessis studied in that paper. n ln n -term.All these bounds hold also with high probability, which follows from the recent tailbounds added to the multiplicative drift theorem by Doerr and Goldberg (2011). Thelower bounds are based on a new multiplicative drift theorem for lower bounds.This paper is structured as follows. Section 2 sets up deﬁnitions, notations andother preliminaries. Section 3 summarizes and explains the main results. In Sections 4and 5, respectively, we prove an upper bound for general mutation probabilities anda reﬁned result for p = 1 /n . Lower bounds are shown in Section 6. We ﬁnish withsome conclusions. The (1+1) EA is a basic search heuristic for the optimization of pseudo-booleanfunctions f : { , } n → R . It reﬂects the typical behavior of more complicated evolu-tionary algorithms, serves as basis for the study of more complex approaches and istherefore intensively investigated in the theory of randomized search heuristics (Augerand Doerr, 2011). For the case of minimization, it is deﬁned as Algorithm 1. Algorithm 1 (1+1) EA t := 0.choose uniformly at random an initial bit string x ∈ { , } n . repeat create x ′ by ﬂipping each bit in x t independently with prob. p (mutation) . x t +1 := x ′ if f ( x ′ ) ≤ f ( x t ), and x t +1 := x t otherwise (selection) . t := t + 1. until forever.The (1+1) EA can be considered a simple hill-climber where search points aredrawn from a stochastic neighborhood based on the mutation operator. The pa-rameter p , where 0 < p <

1, is often chosen as 1 /n , which then is called standardmutation probability . We call a mutation from x t to x ′ accepted if f ( x ′ ) ≤ f ( x t ), i. e.,if the new search point is taken over; otherwise we call it rejected. In our theoreticalstudies, we ignore the fact that the algorithm in practice will be stopped at sometime. The runtime (synonymously, optimization time ) of the (1+1) EA is deﬁnedas the ﬁrst random point in time t such that the search point x t has optimal, i. e.,minimum f -value. This corresponds to the number of f -evaluations until reachingthe optimum. In many cases, one is aiming for results on the expected optimizationtime. Here, we also prove results that hold with high probability (w. h. p.) , whichmeans probability 1 − o (1). 3he (1+1) EA is also an instantiation of the algorithmic scheme that is called mutation-based EA by Sudholt (2010) and is displayed as Algorithm 2. It is a generalpopulation-based approach that includes many variants of evolutionary algorithmswith parent and oﬀspring populations as well as parallel evolutionary algorithms.Any mechanism for managing the populations, which are multisets, is allowed as longas the mutation operator is the only variation operator and follows the independentbit-ﬂip property with probability 0 < p ≤ /

2. Again the smallest t such that x t isoptimal deﬁnes the runtime. Sudholt has proved for p = 1 /n that no mutation-basedEA can locate a unique optimum faster than the (1+1) EA can optimize OneMax .We will see that the (1+1) EA is the best mutation-based EA on a broad class offunctions, also for diﬀerent mutation probabilities.

Algorithm 2

Scheme of a mutation-based EA for t := 0 → µ − do create x t ∈ { , } n uniformly at random. end forrepeat select a parent x ∈ { x , . . . , x t } according to t and f ( x ) , . . . , f ( x t ).create x t +1 by ﬂipping each bit in x independently with probability p ≤ / t := t + 1. until forever.Throughout this paper, we are concerned with linear pseudo-boolean functions.A function f : { , } n → R is called linear if it can be written as f ( x n , . . . , x ) = w n x n + · · · + w x + w . As common in the analysis of the (1+1) EA, we assumew. l. o. g. that w = 0 and w n ≥ · · · ≥ w > x n down to x such that x n , the most signiﬁcant bit, is said to be on the left-hand side and x , the least signiﬁcant bit, on the right-hand side. Since it ﬁts the proof techniquesmore naturally, we assume also w. l. o. g. that the (1+1) EA (or, more generally, themutation-based EA at hand) is minimizing f , implying that the all-zeros string isthe optimum. Our assumptions do not lose generality since we can permute bits andnegate the weights of a linear function without aﬀecting the stochastic behavior ofthe (1+1) EA/mutation-based EA.The probably most intensively studied linear function is OneMax ( x n , . . . , x ) = x n + · · · + x , occasionally also called the CountingOnes problem (which would bethe more appropriate name here since we will be minimizing the function). In thispaper, we will see that on the one hand,

OneMax is not only the easiest linear func-tion deﬁnition-wise but also in terms of expected optimization time. On the otherhand, the upper bounds obtained for

OneMax hold for every linear function up tolower-order terms. Hence, surprisingly the (1+1) EA is basically as eﬃcient on anarbitrary linear function as it is on

OneMax . This underlines the robustness of the4andomized search heuristic and, in retrospect and for the future, is a strong moti-vation to investigate the behavior of randomized search heuristics on the

OneMax problem thoroughly.Our proofs of the forthcoming upper bounds use the multiplicative drift theoremin its most recent version (cf. Doerr, Johannsen and Winzen, 2010b and Doerr andGoldberg, 2011). The key idea of multiplicative drift is to identify a time-independentrelative progress called drift.

Theorem 1 (Multiplicative Drift, Upper Bound) . Let S ⊆ R be a ﬁnite set ofpositive numbers with minimum . Let { X ( t ) } t ≥ be a sequence of random variablesover S ∪ { } . Let T be the random ﬁrst point in time t ≥ for which X ( t ) = 0 .Suppose that there exists a δ > such that E ( X ( t ) − X ( t +1) | X ( t ) = s ) ≥ δs for all s ∈ S with Prob( X ( t ) = s ) > . Then for all s ∈ S with Prob( X (0) = s ) > , E ( T | X (0) = s ) ≤ ln( s ) + 1 δ . Moreover, it holds that

Prob(

T > (ln( s ) + t ) /δ )) ≤ e − t . As an easy example application, consider the (1+1) EA on

OneMax and let X ( t ) denote the number of one-bits at time t . As worse search points are not accepted, X ( t ) is non-increasing over time. We obtain E ( X ( t ) − X ( t +1) | X ( t ) = s ) ≥ s (1 /n )(1 − /n ) n − ≥ s/ ( en ), in other words a multiplicative drift of at least δ = 1 / ( en ), sincethere are s disjoint single-bit ﬂips that decrease the X -value by 1. Theorem 1 appliedwith δ = 1 / ( en ) and ln( X (0) ) ≤ ln n gives us the upper bound en (ln n + 1) on theexpected optimization time, which is the same as the classical method of ﬁtness-basedpartitions (Wegener, 2001; Sudholt, 2010) or coupon collector arguments (Motwaniand Raghavan, 1995) would yield.On a general linear function, it is not necessarily a good choice to let X ( t ) countthe current number of one-bits. Consider, for example, the well-known function BinVal ( x n , . . . , x ) = P ni =1 i − x i . The (1+1) EA might replace the search point(1 , , . . . ,

0) by the better search point (0 , , . . . , n − , , . . . ,

0) by a better search point is equivalentto ﬂipping the leftmost one-bit. In such a step, an expected number of ( n − p zero-bits ﬂip, which decreases the expected number of zero-bits by only 1 − ( n − p .The latter expectation (the so-called additive drift) is only 1 /n for the standardmutation probability p = 1 /n and might be negative for larger p . Therefore, X ( t ) istypically deﬁned as X ( t ) := g ( x ( t ) ), where x ( t ) is the current search point at time t and g ( x n , . . . , x ) is another linear function called drift function or potential function .Doerr, Johannsen and Winzen (2010b) use x + · · · + x n/ + (5 / x n/ + · · · + x n )5s potential function in their application of the multiplicative drift theorem. Thisleads to a good lower bound on the multiplicative drift on the one hand and a smallmaximum value of X ( t ) on the other hand. In our proofs of upper bounds in theSections 4 and 5, it is crucial to deﬁne appropriate potential functions.For the lower bounds in Section 6, we need the following variant of the multiplica-tive drift theorem. Theorem 2 (Multiplicative Drift, Lower Bound) . Let S ⊆ R be a ﬁnite set of positivenumbers with minimum . Let { X ( t ) } t ≥ be a sequence of random variables over S ,where X ( t +1) ≤ X ( t ) for any t ≥ , and let s min > . Let T be the random ﬁrst pointin time t ≥ for which X ( t ) ≤ s min . If there exist positive reals β, δ ≤ such that forall s > s min and all t ≥ with Prob( X ( t ) = s ) > it holds that1. E (cid:0) X ( t ) − X ( t +1) | X ( t ) = s (cid:1) ≤ δs ,2. Prob( X ( t ) − X ( t +1) ≥ βs | X ( t ) = s ) ≤ βδ/ ln s ,then for all s ∈ S with Prob( X (0) = s ) > , E (cid:0) T | X (0) = s (cid:1) ≥ ln( s ) − ln( s min ) δ · − β β . Compared to the upper bound, the lower-bound version includes a condition onthe maximum stepwise progress and requires non-increasing sequences. As a tech-nical detail, the theorem allows for a positive target s min , which is required in ourapplications. We now list the main consequences from the lower bounds and upper bounds thatwe will prove in the following sections.

Theorem 3.

On any linear function, the following holds for the expected optimizationtime E ( T p ) of the (1+1) EA with mutation probability p .1. If p = ω ((ln n ) /n ) or p = o (1 / poly( n )) then E ( T p ) is superpolynomial.2. If p = Ω(1 / poly( n )) and p = O ((ln n ) /n ) then E ( T p ) is polynomial.3. If p = c/n for a constant c then E ( T p ) = (1 ± o (1)) e c c n ln n .4. E ( T p ) is minimized for mutation probability p = 1 /n if n is large enough.5. No mutation-based EA has an expected optimization time that is smaller than E ( T /n ) (up to lower-order terms).

6n fact, our forthcoming analyses are more precise; in particular, we do not stateavailable tails on the upper bounds above and leave them in the more general, but alsomore complicated Theorem 4 in Section 4. The ﬁrst statement of our summarizingTheorem 3 follows from the Theorems 7, 8 and 9 in Section 6. The second statementis proven in Corollary 2, which follows from the already mentioned Theorem 4. Thethird statement takes together the Corollaries 1 and 3. Since e c /c is minimized for c = 1, the fourth statement follows from the third one in conjunction with Corollary 3.The ﬁfth statement is also contained in the Theorems 7 and 9.It is worth noting that the optimality of p = 1 /n apparently was never provenrigorously before, not even for the case of OneMax , where tight upper and lowerbounds on the expected optimization time were only available for the standard mu-tation probability (Sudholt, 2010; Doerr, Fouz and Witt, 2011). For the general caseof linear functions, the strongest previous result said that p = Θ(1 /n ) is optimal(Droste, Jansen and Wegener, 2002). Our result on the optimality of the mutationprobability 1 /n is interesting since this is the commonly recommended choice bypractitioners. In this section, we show a general upper bound that applies to any non-trivial muta-tion probability.

Theorem 4.

On any linear function, the optimization time of the (1+1) EA withmutation probability < p < is at most (1 − p ) − n (cid:18) nα (1 − p ) − n α − αα − /p ) + ( n −

1) ln(1 − p ) + tp (cid:19) =: b ( t ) with probability at least − e − t , and it is at most b (1) in expectation, where α > can be chosen arbitrarily (also depending on n ). Before we prove the theorem, we note two important consequences in more read-able form. The ﬁrst one (Corollary 1) displays upper bounds for mutation probabil-ities c/n . The second one (Corollary 2) is used in Theorem 3 above, which states aphase transition from polynomial to superpolynomial expected optimization times atmutation probability p = Θ((ln n ) /n ). Corollary 1.

On any linear function, the optimization time of the (1+1) EA withmutation probability p = c/n , where c > is a constant, is bounded from above by (1 + o (1))(( e c /c ) n ln n ) with probability − o (1) and also in expectation. A recent technical report extending Sudholt (2010) shows the optimality of p = 1 /n in the caseof OneMax using a diﬀerent approach, see http://arxiv.org/abs/1109.1504 . roof. Let α := ln ln n or any other suﬃciently slowly growing function. Then α/ ( α −

1) = 1+ O (1 / ln ln n ) and α / ( α −

1) = O (ln ln n ). Moreover, (1 − c/n ) − n ≤ e c .The b ( t ) in Theorem 4 becomes at most e c · (cid:18) O ( n ln ln n ) + (1 + o (1)) n (ln( n ) + ln(1 /c ) + t ) c (cid:19) , and the corollary follows by choosing, e. g., t := ln ln n . (cid:3) Corollary 2.

On any linear function, the optimization time of the (1+1) EA withmutation probability p = O ((ln n ) /n ) and p = Ω(1 / poly( n )) is polynomial with prob-ability − o (1) and also in expectation. Proof.

Let α := 2. By making all positive terms at least 1 and multiplying them,we obtain that the upper bound b ( t ) from Theorem 4 is at most8 n (1 − p ) − n · ln( e/p ) + tp ≤ ne pn · ln( e/p ) + tp . Assume 1 /p = Ω(poly( n )) and p ≤ c (ln n ) /n for some constant c and suﬃcientlylarge n . Then e pn ≤ n c and the whole expression is polynomial for t = 1 (provingthe expectation) and also if t = ln n (proving the probability 1 − o (1)). (cid:3) The proof of Theorem 4 uses an adaptive potential function as in Doerr andGoldberg (2011). That is, the random variables X ( t ) used in Theorem 1 map thecurrent search point of the (1+1) EA via a potential function to some value in away that depends also on the linear function at hand. As a special case, if the givenlinear function happens to be OneMax , X ( t ) just counts the number of one-bits attime t . The general construction shares some similarities with the one in Doerr andGoldberg (2011), but both construction and proof are less involved. Proof of Theorem 4.

Let f ( x ) = w n x n + · · · + w x be the linear function at hand.Deﬁne γ i := (cid:18) αp (1 − p ) n − (cid:19) i − for 1 ≤ i ≤ n , and let g ( x ) = g n x n + · · · + g x be the potential function deﬁned by g := 1 = γ and g i := min (cid:26) γ i , g i − · w i w i − (cid:27) for 2 ≤ i ≤ n . Note that the g i are non-decreasing w. r. t. i . Intuitively, if the ratio of w i and w i − is too extreme, the minimum function caps it appropriately, otherwise g i and g i − are in the same ratio. We consider the stochastic process X ( t ) := g ( a ( t ) ),8here a ( t ) is the current search point of the (1+1) EA at time t . Obviously, X ( t ) = 0if and only if f has been optimized.Let ∆ t := X ( t ) − X ( t +1) . We will show below that E (∆ t | X ( t ) = s ) ≥ s · p · (1 − p ) n − · (cid:18) − α (cid:19) . ( ∗ )The initial value satisﬁes X (0) ≤ g n + · · · + g ≤ n X i =1 γ i ≤ (cid:16) αp (1 − p ) n − (cid:17) n − αp (1 − p ) − n ≤ e nαp (1 − p ) − n αp (1 − p ) − n , which means ln( X (0) ) ≤ nαp (1 − p ) − n + ln(1 /p ) + ln((1 − p ) n − ) . The multiplicative drift theorem (Theorem 1) yields that the optimization time T isbounded from above byln( X ) + tp (1 − p ) n − (1 − /α ) ≤ α ( nαp (1 − p ) − n + ln(1 /p ) + ln((1 − p ) n − ) + t )( α − p (1 − p ) n − = b ( t )with probability at least 1 − e − t , and E ( T ) = b (1), which proves the theorem.To show ( ∗ ), we ﬁx an arbitrary current value s and an arbitrary search point a ( t ) satisfying g ( a ( t ) ) = s . In the following, we implicitly assume X ( t ) = s but mostlyomit this for the sake of readability. We denote by I := { i | a ( t ) i = 1 } the index setof the one-bits in a ( t ) and by Z := { , . . . , n } \ I the zero-bits. We assume I = ∅ since there is nothing to show otherwise. Denote by a ′ the random (not necessarilyaccepted) oﬀspring produced by the (1+1) EA when mutating a ( t ) and by a ( t +1) thenext search point after selection. Recall that a ( t +1) = a ′ if and only if f ( a ′ ) ≤ f ( a ( t ) ).In the following, we will use the event A that a ( t +1) = a ′ = a ( t ) since obviously ∆ t = 0otherwise. Let I ∗ := { i ∈ I | a ′ i = 0 } be the random set of ﬂipped one-bits and Z ∗ := { i ∈ Z | a ′ i = 1 } be the set of ﬂipped zero-bits in a ′ (not conditioned on A ).Note that I ∗ = ∅ if A occurs.We need further deﬁnitions to analyze the drift carefully. For i ∈ I , we deﬁne k ( i ) := max { j ≤ i | g j = γ j } as the most signiﬁcant position to the right of i (possibly i itself) where the potential function might be capping; note that k ( i ) ≥ g = γ . Let L ( i ) := { k ( i ) , . . . , n } ∩ Z be the set of zero-bits left of (andincluding) k ( i ) and let R ( i ) := { , . . . , k ( i ) − } ∩ Z be the remaining zero-bits. Bothsets may be empty. For event A to occur, it is necessary that there is some i ∈ I such that bit i ﬂips to zero and X j ∈ I ∗ w j − X j ∈ Z ∗ ∩ L ( i ) w j ≥ i ∈ I , let A i be theevent that1. i is the leftmost ﬂipping one-bit (i. e., i ∈ I ∗ and { i + 1 , . . . , n } ∩ I ∗ = ∅ ) and2. P j ∈ I ∗ w j − P j ∈ Z ∗ ∩ L ( i ) w j ≥ A i occurs, ∆ t = 0. Furthermore, the A i are mutually disjoint.For any i ∈ I , ∆ t can be written as the sum of the two terms∆ L ( i ) := X j ∈ I ∗ g j − X j ∈ Z ∗ ∩ L ( i ) g j and ∆ R ( i ) := − X j ∈ Z ∗ ∩ R ( i ) g j . By the law of total probability and the linearity of expectation, we have E (∆ t ) = X i ∈ I E (∆ L ( i ) | A i ) · Prob( A i ) + E (∆ R ( i ) | A i ) · Prob( A i ) . ( ∗∗ )In the following, the bits in R ( i ) are pessimistically assumed to ﬂip to 1 independentlywith probability p each if A i happens. This leads to E (∆ R ( i ) | A i ) ≥ − p P j ∈ R ( i ) g j .In order to estimate E (∆ L ( i )), we carefully inspect the relation between theweights of the original function and the potential function. By deﬁnition, we ob-tain g j /g k ( i ) = w j /w k ( i ) for k ( i ) ≤ j ≤ i and g j /g k ( i ) ≤ w j /w k ( i ) for j > i whereas g j /g k ( i ) ≥ w j /w k ( i ) for j < k ( i ). Hence, if A i occurs then g j ≥ g k ( i ) · w j w k ( i ) for j ∈ I ∗ (since i is the leftmost ﬂipping one-bit) whereas g j ≤ g k ( i ) · w j w k ( i ) for j ∈ L ( i ). Together,we obtain under A ( i ) the nonnegativity of the random variable ∆ L ( i ):∆ L ( i ) | A i = X j ∈ I ∗ | A i g j − X j ∈ ( Z ∗ ∩ L ( i )) | A i g j ≥ X j ∈ I ∗ | A i g k ( i ) · w j w k ( i ) − X j ∈ ( Z ∗ ∩ L ( i )) | A i g k ( i ) · w j w k ( i ) ≥ A i .Now let S i := {| Z ∗ ∩ L ( i ) | = 0 } be the event that no zero-bit from L ( i ) ﬂips.Using the law of total probability, we obtain that E (∆ L ( i ) | A i ) · Prob( A i ) = E (∆ L ( i ) | A i ∩ S i ) · Prob( A i ∩ S i )+ E (∆ L ( i ) | A i ∩ S i ) · Prob( A i ∩ S i ) . Since ∆ L ( i ) | A i ≥

0, the conditional expectations are non-negative. We bound thesecond term on the right-hand side by 0. In conjunction with ( ∗∗ ), we get E (∆ t ) ≥ X i ∈ I E (∆ L ( i ) | A i ∩ S i ) · Prob( A i ∩ S i ) + E (∆ R ( i ) | A i ) · Prob( A i ) . E (∆ L ( i ) | A i ∩ S i ) ≥ g i . We estimate Prob( A i ∩ S i ) ≥ p (1 − p ) n − sinceit is suﬃcient to ﬂip only bit i and Prob( A i ) ≤ p since it is necessary to ﬂip this bit.Further above, we have bounded E (∆ R ( i ) | A i ). Taking everything together, we get E (∆ t ) ≥ X i ∈ I  p (1 − p ) n − g i − p X j ∈ R ( i ) g j  ≥ X i ∈ I  p (1 − p ) n − g i g k ( i ) γ k ( i ) − p k ( i ) − X j =1 γ j  . The term for i equals p (1 − p ) n − g i g k ( i ) (cid:18) αp (1 − p ) n − (cid:19) k ( i ) − − p · (cid:18)(cid:16) αp (1 − p ) n − (cid:17) k ( i ) − − (cid:19)(cid:16) αp (1 − p ) n − (cid:17) ≥ (cid:18) − α (cid:19) p (1 − p ) n − g i g k ( i ) (cid:18) αp (1 − p ) n − (cid:19) k ( i ) − = (cid:18) − α (cid:19) p (1 − p ) n − g i , where the inequality uses g i ≥ g k ( i ) . Hence, E (∆ t ) ≥ X i ∈ I (cid:18) − α (cid:19) p (1 − p ) n − g i = (cid:18) − α (cid:19) p (1 − p ) n − g ( a ( t ) ) , which proves ( ∗ ), and, therefore, the theorem. (cid:3) /n In this section, we consider the standard mutation probability p = 1 /n and reﬁnethe result from Corollary 1. More precisely, we obtain that the lower order-terms are O ( n ). The proof will be shorter and uses a simpler potential function. Theorem 5.

On any linear function, the expected optimization time of the (1+1) EAwith p = 1 /n is at most en ln n + 2 en + O (1) , and the probability that the optimizationtime exceeds en ln n + (1 + t ) en + O (1) is at most e − t . Proof.

Let f ( x ) = w n x n + · · · + w x be the linear function at hand and let g ( x ) = g n x n + · · · + g x be the potential function deﬁned by g i = (cid:18) n − (cid:19) min { j ≤ i | w j = w i }− , g i = (1 + 1 / ( n − i − for all i if and only if the w i are mutually distinct. Weconsider the stochastic process X ( t ) := g ( a ( t ) ), where a ( t ) is the current search pointof the (1+1) EA at time t . Obviously, X ( t ) = 0 if and only if f has been optimized.Let ∆ t := X ( t ) − X ( t +1) . In a case analysis (partly inspired by Doerr, Johannsenand Winzen, 2010b), we will show below for n ≥ E (∆ t | X ( t ) = s ) ≥ s/ ( en ).The initial value satisﬁes X (0) ≤ g n + · · · + g ≤ n − X i =0 (cid:18) n − (cid:19) i = (1 + 1 / ( n − n − / ( n − ≤ ( n − (cid:18) n − (cid:19) e − ( n − ≤ en, where we have used (1 + 1 / ( n − n − ≤ e . Hence, ln( X ) ≤ (ln n ) + 1. Assuming n ≥

4, Theorem 1 yields E ( T ) ≤ en (ln( n ) + 2) and Prob( T > en ((ln n ) + t + 1)) ≤ e − t regardless of the starting point, from which the theorem follows.The case analysis ﬁxes an arbitrary current search point a ( t ) . We denote by I := { i | a ( t ) i = 1 } the index set of its one-bits and by Z := { , . . . , n } \ I its zero-bits.We assume I = ∅ since there is nothing to show otherwise. Denote by a ′ the random(not necessarily accepted) oﬀspring produced by the (1+1) EA when mutating a ( t ) and by a ( t +1) the next search point after selection. Recall that a ( t +1) = a ′ if andonly if f ( a ′ ) ≤ f ( a ( t ) ). In what follows, we will often condition on the event A that a ( t +1) = a ′ = a ( t ) holds since ∆ t = 0 otherwise. Let I ∗ := { i ∈ I | a ′ i = 0 } be the setof ﬂipped one-bits and by Z ∗ := { i ∈ Z | a ′ i = 1 } be the set of ﬂipped zero-bits. Notethat I ∗ = ∅ if A occurs. Case 1:

Event S := {| I ∗ | ≥ } ∩ A occurs. Under this condition, each zero-bitin a ( t ) has been ﬂipped to 1 in a ( t +1) with probability at most 1 /n . Since g i ≥ ≤ i ≤ n , we have E (∆ t | S ) ≥ | I ∗ | − n X i/ ∈ I g i ≥ − n n X i =1 (cid:18) n − (cid:19) i − = 2 − (1 + 1 / ( n − n − n/ ( n − ≥ − (cid:18) e − (cid:18) − n (cid:19)(cid:19) ≥ n ≥

4, where we have used 1 + 1 / ( n −

1) = 1 / (1 − /n ). Hence, we pessimisticallyassume E (∆ t | S ) = 0. Case 2:

Event S := {| I ∗ | = 1 } ∩ A occurs. Let i ∗ be single element of I ∗ andnote that this is a random variable. Subcase 2.1: S := {| I ∗ | = 1 } ∩ { Z ∗ = ∅} ∩ A occurs. Since {| I ∗ | = 1 } and { Z ∗ = ∅} together imply A , the index i ∗ of the ﬂipped one-bit is uniform over I .Hence, E (∆ t | S ) = P i ∈ I g i / | I | . Moreover, Prob( S ) ≥ | I | (1 /n )(1 − /n ) n − ≥| I | / ( en ), implying E (∆ t | S ) · Prob( S ) ≥ g ( a ( t ) ) / ( en ) = X ( t ) / ( en ). If we can show12hat E (∆ t | {| I ∗ | = 1 } ∩ {| Z ∗ | ≥ } ∩ A ) ≥

0, which will be proven in Subcase 2.2below, then E (∆ t | X ( t ) = s ) ≥ s/ ( en ) follows by the law of total probability and theproof is complete. Subcase 2.2: S := {| I ∗ | = 1 } ∩ {| Z ∗ | ≥ } ∩ A occurs. Let j ∗ := max { j | j ∈ Z ∗ } be the index of the leftmost ﬂipping zero-bit, and note that also j ∗ is random.Since we work under | I ∗ | = 1 and the w j are monotone increasing w. r. t. j , it isnecessary for A to occur that w j ∗ ≤ w i ∗ holds. Subcase 2.2.1: S := {| I ∗ | = 1 } ∩ {| Z ∗ | ≥ } ∩ { j ∗ > i ∗ } ∩ A occurs. Then w j ∗ = w i ∗ and | Z ∗ | = 1 must hold. In this case, g j ∗ = g i ∗ by the deﬁnition of g and E (∆ t | S ) = 0 follows immediately. Subcase 2.2.2: S := {| I ∗ | = 1 } ∩ {| Z ∗ | ≥ } ∩ { j ∗ < i ∗ } ∩ A occurs. If w j ∗ = w i ∗ then | Z ∗ | = 1 must hold for A to occur, and zero drift follows as in theprevious subcase. Now let us assume w j ∗ < w i ∗ and thus g j ∗ < g i ∗ . For notationalconvenience, we redeﬁne i ∗ := min { i | w i = w i ∗ } . We consider Z r := Z ∩ { , . . . , i ∗ − } , the set of potentially ﬂipping zero-bits right of i ∗ , denote k := | Z r | and notethat in the worst case, Z r = { i ∗ − , . . . , i ∗ − k } as the g i are non-decreasing. Byusing ˜ p := Prob( Z ∗ ∩ Z r = ∅ ) = 1 − (1 − /n ) k and the deﬁnition of conditionalprobabilities, we obtain under S that every bit from Z r is ﬂipped (not necessarilyindependently) with probability at most (1 /n ) / ˜ p = n (1 − (1 − /n ) k ) . We now assume thatall the corresponding a ′ are accepted. This is pessimistic for the following reasons:Consider a rejected a ′ . If | Z ∗ | = 1 then our prerequisite j ∗ < i ∗ and the monotonicityof the g i imply a negative ∆ t -value. If | Z ∗ | > t -value is due tothe fact g i < g i − + g i − for 3 ≤ i ≤ n . Hence, using the linearity of expectation weget E (∆ t | S ) ≥ g i ∗ − n ˜ p · X j ∈ Z r g j ≥ g i ∗ − k X j =1 g i ∗ − j n (1 − (1 − /n ) k )= (cid:18) n − (cid:19) i ∗ − − k − X j =0 (1 + 1 / ( n − i ∗ − − j n (1 − (1 − /n ) k )= (cid:18) n − (cid:19) i ∗ − k (cid:18) n − (cid:19) k − − ((1 + 1 / ( n − k − · ( n − n (1 − (1 − /n ) k ) ! = 0 , where the last equality follows since 1 + 1 / ( n −

1) = (1 − /n ) − and((1 + 1 / ( n − k − · ( n − n (1 − (1 − /n ) k ) = (cid:18) − n (cid:19) (1 − /n ) − k − − (1 − /n ) k = (cid:18) − n (cid:19) − k . This completes the proof. (cid:3) Lower Bounds

In this section, we state lower bounds that prove the results from Theorem 4 to betight up to lower-order terms for a wide range of mutation probabilities. Moreover, weshow that the lower bounds hold for the very large class of mutation-based algorithms(Algorithm 2). Recall that a list of the most important consequences is given abovein Theorem 3. For technical reasons, we split the proof of the lower bounds into twomain cases, namely p = O ( n − / − ε ) and p = Ω( n ε − ) for any constant ε >

0. Unless p > /

2, the proofs go back to

OneMax as a worst case, as outlined in the followingsubsection.

Doerr, Johannsen and Winzen (2010a) show with respect to the (1+1) EA withstandard mutation probability 1 /n that OneMax is the “easiest” function from theclass of functions with unique global optimum, which comprises the class of linearfunctions. More precisely, the expected optimization time on

OneMax is proved tobe smallest within the class.We will generalize this result to p ≤ / OneMax tothe (1+1) EA µ in a similar way to Sudholt (2010, Section 7). The latter algorithm,displayed as Algorithm 3, creates search points uniformly at random from time 0 totime µ − µ −

1; afterwards it works as the standard (1+1) EA. Note that we obtain thestandard (1+1) EA for µ = 1. Moreover, we will only consider the case µ = poly( n )in order to bound the running time of the initialization. This makes sense since aunique optimum (such as the all-zeros string for OneMax ) is with overwhelmingprobability not found even when drawing 2 √ n random search points. Algorithm 3 (1+1) EA µ for t := 0 → µ − do choose x t ∈ { , } n uniformly at random. end for x t := arg min { f ( x ) | x ∈ { x , . . . , x t }} (breaking ties uniformly). repeat create x ′ by ﬂipping each bit in x t independently with prob. p . x t +1 := x ′ if f ( x ′ ) ≤ f ( x t ), and x t +1 := x t otherwise. t := t + 1. until forever. 14ur analyses need the monotonicity statement from Lemma 1 below, which issimilar to Lemma 11 in Doerr, Johannsen and Winzen (2010a) and whose proofis already sketched in Droste, Jansen, and Wegener (2000, Section 5). Note, how-ever, that Doerr, Johannsen and Winzen (2010a) only consider p = 1 /n and have astronger statement for this case. More precisely, they show Prob( | mut ( a ) | = j ) ≥ Prob( | mut ( b ) | = j ), which does not hold for large p . Here and hereinafter, | x | denotes the number of ones in a bit string x . Lemma 1.

Let a, b ∈ { , } n be two search points satisfying | a | < | b | . Denoteby mut ( x ) the random string obtained by mutating each bit of x independently withprobability p . Let ≤ j ≤ n be arbitrary. If p ≤ / then Prob( | mut ( a ) | ≤ j ) ≥ Prob( | mut ( b ) | ≤ j ) . Proof.

We prove the result only for | b | = | a | + 1. The general statement thenfollows by induction on | b | − | a | .By the symmetry of the mutation operator, Prob( | mut ( x ) | ≤ j ) is the samefor all x with | x | = | a | . We therefore assume b ≥ a (i. e., b is component-wisenot less than a ). In the following, let s ∗ be the unique index where b s ∗ = 1 and a s ∗ = 0. Let S ( x ) be the event that bit s ∗ ﬂips when x is mutated. Since bits areﬂipped independently, it holds Prob( S ( x )) = p for any x . We write a ′ := mut ( a ) and b ′ := mut ( b ). Assuming p ≤ /

2, the aim is to show Prob( | a ′ | ≤ j ) ≥ Prob( | b ′ | ≤ j ),which by the law of total probability is equivalent to (cid:18) Prob( | a ′ | ≤ j | S ( a )) − Prob( | b ′ | ≤ j | S ( b )) (cid:19) · (1 − p )+ (cid:18) Prob( | a ′ | ≤ j | S ( a )) − Prob( | b ′ | ≤ j | S ( b )) (cid:19) · p ≥ . ( ∗ )Note that the relation Prob( | a ′ | ≤ j | S ( a )) ≥ Prob( | b ′ | ≤ j | S ( b )) follows froma simple coupling argument as a ′ ≤ b ′ holds if the mutation operator ﬂips the bitsother than s ∗ in the same way with respect to a and b . Moreover,Prob( | a ′ | ≤ j | S ( a )) − Prob( | b ′ | ≤ j | S ( b ))= Prob( | b ′ | ≤ j | S ( b )) − Prob( | a ′ | ≤ j | S ( a ))since a is obtained from b by ﬂipping bit s ∗ and vice versa. Hence, ( ∗ ) follows. (cid:3) The following theorem is a generalization of Theorem 9 by Doerr, Johannsenand Winzen (2010a) to the case p ≤ / p = 1 /n . However, we notonly generalize to higher mutation probabilities, but also also consider the moregeneral class of mutation-based algorithms. Finally, we prove stochastic ordering,15hile Doerr, Johannsen and Winzen (2010a) inspect only the expected optimizationtimes. Still, many ideas of the original proof can be taken over and be combined withthe proof of Theorem 5 in Sudholt (2010). Theorem 6.

Consider a mutation-based EA A with population size µ and mutationprobability p ≤ / on any function with a unique global optimum. Then the opti-mization time of A is stochastically at least as large as the optimization time of the(1+1) EA µ on OneMax . Proof.

Let f denote the function with unique global optimum, which we w. l. o. g.assume to be the all-zeros string. For any sequence X = ( x , . . . , x ℓ − ) of searchpoints over { , } n , let q ( X ) be the probability that X represents the ﬁrst ℓ searchpoints x , . . . , x ℓ − created by Algorithm A on f (its so-called history up to time ℓ − X with q ( X ) >

0, let T f ( X ) denote the random optimization time ofAlgorithm A on f , given that its history up to time ℓ equals X . LetΞ ℓ := ( X = ( x , . . . , x ℓ − ) ∈ ℓ × i =1 { , } n (cid:12)(cid:12)(cid:12) q ( X ) > ) denote the set of all possible histories of length ℓ with respect to Algorithm A on f ,and let Ξ := { S mℓ =1 Ξ ℓ | m ∈ N } denote all possible histories of ﬁnite length. Finally,for any X ∈

Ξ, let L ( X ) denote the length of X .Given any X ∈

Ξ, let (1+1) EA( X ) be the algorithm that chooses a search pointwith minimal number of ones from X as current search point at time L ( X ) − OneMax .Now, let T OneMax ( X ) denote the random optimization time of the (1+1) EA( X ). Weclaim that the stochastic orderingProb( T f ( X ) ≥ t ) ≥ Prob( T OneMax ( X ) ≥ t )holds for every X ∈

Ξ satisfying L ( X ) ≥ µ and every t ≥

0. Note that the randomvector of initial search points X ∗ := ( x , . . . , x µ − ) follows the same distribution inboth Algorithm A and the (1+1) EA µ . In particular, the two algorithms are identicalbefore time µ −

1, i. e., before initialization is ﬁnished. Furthermore, (1+1) EA( X ∗ )is the (1+1) EA µ initialized with X ∗ . Altogether, the claimed stochastic orderingimplies the theorem. Moreover, regardless of the length L ( X ), the claim is obviousfor t ≤ L ( X ) since the behavior up to time L ( X ) is ﬁxed.For any X ∈

Ξ, let |X | := min {| x | | x ∈ X } denote the best number of ones inthe history, where x ∈ ( x , . . . , x ℓ − ) means that x = x i for some i ∈ { , . . . , ℓ − } .For every k ∈ { , . . . , n } , every ℓ ≥ µ and every t ≥

0, let p k,ℓ ( t ) := min { Prob( T OneMax ( X ) ≥ ℓ + t ) | X ∈ Ξ ℓ , |X | = k }

16e the minimum probability of the (1+1) EA( X ) needing at least ℓ + t steps tooptimize OneMax from a history of length ℓ whose best search point has exactly k one-bits. Due to the symmetry of the OneMax function and the deﬁnition of(1+1) EA( X ), we have Prob( T OneMax ( X ) ≥ ℓ + t ) = p k,ℓ ( t ) for every X satisfying |X | = ℓ and |X | = k . In other words, the minimum can be omitted from thedeﬁnition of p k,ℓ .Furthermore, for every k ∈ { , . . . , n } , every ℓ ≥ µ and every t ≥

0, let˜ p k,ℓ ( t ) := min { Prob( T f ( X ) ≥ ℓ + t ) | X ∈ Ξ ℓ , |X | ≥ k } be the minimum probability of Algorithm A needing at least ℓ + t steps to optimize f from a history of length ℓ ≥ µ whose best search point has at least k one-bits. Wewill show ˜ p k,ℓ ( t ) ≥ p k,ℓ ( t ) for any k ∈ { , . . . , n } and ℓ ≥ µ by induction on t . Inparticular, by choosing ℓ := µ and applying the law of total probability with respectto the outcomes of |X ∗ | , this will imply the above-mentioned stochastic orderingand, therefore, the theorem.If k ≥ p k,ℓ (0) = ˜ p k,ℓ (0) = 1 for any ℓ ≥ µ since the condition means thatthe ﬁrst ℓ search points do not contain the optimum. Moreover, p ,ℓ ( t ) = ˜ p ,ℓ ( t ) = 0for any t ≥ ℓ ≥ µ since a history beginning with the all-zeros string corre-sponds to optimization time 0 and thus minimizes both Prob( T f ( X ) ≥ t + ℓ ) andProb( T OneMax ( X ) ≥ t + ℓ ). Now let us assume that there is some t ≥ p k,ℓ ( t ′ ) ≥ p k,ℓ ( t ′ ) holds for all 0 ≤ t ′ ≤ t , k ∈ { , . . . , n } , and ℓ ≥ µ . Note that theinequality has already been proven for all t if k = 0.Consider the (1+1) EA( X ) for an arbitrary X satisfying L ( X ) = ℓ ≥ µ and |X | = k + 1 for some k ∈ { , . . . , n − } . Let some x ∈ { , } n , where | x | = k + 1, bechosen from X and let y ∈ { , } n be the random search point generated by ﬂippingeach bit in x independently with probability p . The (1+1) EA( X ) will accept y asnew search point at time ℓ + 1 > µ if and only if | y | ≤ | x | = k + 1. Hence, p k +1 ,ℓ ( t + 1) = Prob( | y | ≥ k + 1) · p k +1 ,ℓ +1 ( t ) + k X j =0 Prob( | y | = j ) · p j,ℓ +1 ( t ) . ( ∗ )Next, let X , where again L ( X ) = ℓ ≥ µ , be a history satisfying Prob( T f ( X ) ≥ t + 1) = ˜ p k +1 ,ℓ ( t + 1) and let ˜ x be the (random) search point that is chosen formutation at time ℓ in order to obtain the equality of the two probabilities. Note that | ˜ x | ≥ k + 1. Moreover, let ˜ y ∈ { , } n be the random search point generated byﬂipping each bit in ˜ x independently with probability p . Let X ′ be the concatenationof X and ˜ y . Then˜ p k +1 ,ℓ ( t + 1) = Prob( | ˜ y | ≥ k + 1) · Prob( T f ( X ′ ) ≥ t | | ˜ y | ≥ k + 1)+ k X j =0 Prob( | ˜ y | = j ) · Prob( T f ( X ′ ) ≥ t | | ˜ y | = j ) , p i ( t ), gives us the lower bound˜ p k +1 ,ℓ ( t + 1) ≥ Prob( | ˜ y | ≥ k + 1) · ˜ p k +1 ,ℓ +1 ( t ) + k X j =0 Prob( | ˜ y | = j ) · ˜ p j,ℓ +1 ( t ) . To relate the last inequality to ( ∗ ) above, we interpret the right-hand side as afunction of k + 2 variables. More precisely, let φ ( a , . . . , a k +1 ) := P k +1 j =0 a j ˜ p j,ℓ +1 ( t ) andconsider the vectors v ( f ) = ( v ( f )0 , . . . , v ( f ) k +1 ) := (Prob( | ˜ y | = 0) , . . . , Prob( | ˜ y | = k ) , Prob( | ˜ y | ≥ k + 1))and v ( O ) = ( v ( O )0 , . . . , v ( O ) k +1 ) := (Prob( | y | = 0) , . . . , Prob( | y | = k ) , Prob( | y | ≥ k + 1)) . If we can show that φ ( v ( f ) ) ≥ φ ( v ( O ) ), then we can conclude˜ p k +1 ,ℓ ( t + 1) ≥ φ ( v ( f ) ) ≥ φ ( v ( O ) ) ≥ Prob( | y | ≥ k + 1) · p k +1 ,ℓ +1 ( t ) + k X j =0 Prob( | y | = j ) · p j,ℓ +1 ( t ) = p k +1 ,ℓ ( t + 1) , where the last inequality follows from the induction hypothesis and the equality isfrom ( ∗ ). This will complete the induction step.To show the outstanding inequality, we use that for 0 ≤ j ≤ k Prob( | y | ≤ j ) ≥ Prob( | ˜ y | ≤ j ) , which follows from Lemma 1 since | ˜ x | ≥ | x | and p ≤ /

2. In other words, j X i =0 v ( O ) i ≥ j X i =0 v ( f ) i for 0 ≤ j ≤ k and P k +1 i =0 v ( O ) i = P k +1 i =0 v ( f ) i since we are dealing with probabilitydistributions. Altogether, the vector v ( O ) majorizes the vector v ( f ) . Since they arebased on increasingly restrictive conditions, the ˜ p j ( t ) are non-decreasing in j . Hence, φ is Schur-concave (cf. Theorem A.3 in Chapter 3 of Marshall, Olkin, and Arnold,2011), which proves φ ( v ( f ) ) ≥ φ ( v ( O ) ) as desired. (cid:3) .2 Large Mutation Probabilities It is not too diﬃcult to show that mutation probabilities p = Ω( n ε − ), where ε > µ ) ﬂip too manybits for it to optimize linear functions eﬃciently. Theorem 7.

On any linear function, the optimization time of an arbitrary mutation-based EA with µ = poly( n ) and p = Ω( n ε − ) for some constant ε > , is boundedfrom below by Ω( n ε ) with probability − − Ω( n ε ) . Proof.

Due to Theorem 6, if suﬃces to show the result for the (1+1) EA µ on OneMax . The following two statements follow from Chernoﬀ bounds (and a unionbound over µ = poly( n ) search points in the second statement).1. Due to the lower bound on p , the probability of a single step not ﬂipping atleast ⌊ pi/ ⌋ bits out of a set of i bits is at most 2 − Ω( pi ) = 2 − Ω( in ε − ) .2. The search point x µ − has at least n/ n/ − − Ω( n ) .Furthermore, as we consider OneMax , the number of one-bits is non-increasing overtime. We assume an x µ − being non-optimal and having at most 2 n/ − Ω( n ) to the failure probability. The assumptionmeans that all future search points accepted by the (1+1) EA µ will have at least n/ − Ω( n ε ) , and by the union bound, the totalprobability is still 2 − Ω( n ε ) in a number of 2 cn ε steps if the constant c is chosen smallenough. (cid:3) Mutation-based EAs have only been deﬁned for p ≤ / p > / OneMax is theeasiest linear function in this case.

Theorem 8.

On any linear function, the expected optimization time of the (1+1) EAwith mutation probability p > / is bounded from below by Ω( n ) . Proof.

We distinguish between two cases.

Case 1: p ≥ / . Here we assume that the initial search point has at least n/ − n/ − .Since the n/ n/ / n/ ,hence the expected optimization time under the assumed initialization is at least 4 n/ .Altogether, the unconditional expected optimization time is at least 2 − n/ − · n/ =2 Ω( n ) . Case 2: / < p ≤ / . Now the aim is to show that all created search points havea number of ones that is in the interval I := [ n/ , n/

8] with probability 1 − − Ω( n ) .This will imply the theorem by the usual waiting time argument.Let x be a search point such that | x | ∈ I . We consider the event of mutating x to some x ′ where | x ′ | < n/

8. Since p > /

2, this is most likely if | x | = 7 n/ x ). Still, using Chernoﬀ boundsand p ≤ /

4, at least (1 / · (7 n/ > n/ − − Ω( n ) . By a symmetrical argument, the probability is 2 − Ω( n ) that | x ′ | > n/ (cid:3) As was to be expected, no polynomial expected optimization times were possiblefor the range of p considered in this subsection. We now turn to mutation probabilities that are bounded from above by roughly1 /n / . Here relatively precise lower bounds can be obtained. Theorem 9.

On any linear function, the expected optimization time of an arbitrarymutation-based EA with µ = poly( n ) and p = O ( n − / − ε ) is bounded from below by (1 − o (1))(1 − p ) − n (1 /p ) min { ln n, ln(1 / ( p n )) } . As a consequence from Theorem 9, we obtain that the bound from Theorem 4is tight (up to lower-order terms) for the (1+1) EA as long as ln(1 / ( p n )) =ln n − o (ln n ). This condition is weaker than p = O ((ln n ) /n ). If p = ω ((ln n ) /n )or p = o (1 / poly( n )), then Theorem 9 in conjunction with Theorem 7 and 8 implysuperpolynomial expected optimization time. Thus, the bounds are tight for all p that allow polynomial optimization times.Before the proof, we state another important consequence, implying the statementfrom Theorem 3 that using the (1+1) EA with mutation probability 1 /n is optimalfor any linear function. Corollary 3.

On any linear function, the expected optimization time of a mutation-based EA with µ = poly( n ) and p = c/n , where c > is a constant, is boundedfrom below by (1 − o (1))(( e c /c ) n ln n ) . If p = ω (1 /n ) or p = o (1 /n ) , the expectedoptimization time is ω ( n ln n ) . roof. The ﬁrst statement follows immediately from Theorem 9 using (1 − c/n ) − n ≥ e c and ln(1 / ( p n )) = ln n − O (ln c ). The second one follows, depending on p , eitherfrom Theorem 7 or, in that case assuming p = O ((ln n ) /n ), from Theorem 9, notingthat (1 − p ) − n (1 /p ) ≥ e np /p = ω ( n ) if p = ω (1 /n ) or p = o (1 /n ). (cid:3) Recall that by Theorem 6, it is enough to prove Theorem 9 for the (1+1) EA µ on OneMax . As mentioned above, this is a well-studied function, for which strongupper and lower bounds are known in the case p = 1 /n . Our result for general p isinspired by the proof of Theorem 1 in Doerr, Fouz and Witt (2010), which uses animplicit multiplicative drift theorem for lower bounds. Therefore, we now need anupper bound on the multiplicative drift, which is given by the following generalizationof Lemma 6 in Doerr, Fouz and Witt (2011). Lemma 2.

Consider the (1+1) EA with mutation probability p for the minimizationof OneMax . Given a current search point with i one-bits, let I ′ denote the randomnumber of one-bits in the subsequent search point (after selection). Then we have E [ i − I ′ ] ≤ ip (1 − p + ip / (1 − p )) n − i . Proof.

Note that I ′ ≤ i since the number of one-bits in the process is non-increasing.Hence, only mutations that ﬂip at least as many one-bits as zero-bits have to beconsidered. The event that the total number of one-bits is decreased by k ≥ F k,j that k + j one-bits and j zero-bits ﬂip, for all j ∈ Z +0 . The probability of an individual event F k,j equals (cid:18) ik + j (cid:19)(cid:18) n − ij (cid:19) p k +2 j (1 − p ) n − k − j , where (cid:0) ab (cid:1) := 0 for b > a . Thus, we have E ( i − I ′ ) ≤ i X k =1 k X j ≥ (cid:18) ik + j (cid:19)(cid:18) n − ij (cid:19) p k +2 j (1 − p ) n − k − j ≤ i X k =1 k (cid:18) ik (cid:19) p k (1 − p ) n − k | {z } =: S · n − i X j =0 i j (cid:18) n − ij (cid:19) (cid:18) p − p (cid:19) j | {z } =: S , where the second inequality uses (cid:0) ik + j (cid:1) ≤ i j · (cid:0) ik (cid:1) . Factoring out (1 − p ) n − i of S ,we recognize the expected value of a binomial distribution with parameters i and p ,which means S = (1 − p ) n − i · ip . Regarding S , we apply the Binomial Theorem andobtain S = (1 + i ( p/ (1 − p )) ) n − i . The product of S and S is the upper boundfrom the lemma. (cid:3) roof of Theorem 9. As already mentioned, we may assume that the linear functionis

OneMax and that the algorithm is the (1+1) EA µ . The idea is to apply Theorem 2,which is the above-mentioned multiplicative drift theorem for lower bounds, for asuitable choice of the parameters. Let ˜ p := max { p, /n } . We ﬁrst observe that theprobability of ﬂipping at least b := ˜ pn ln n bits in a single step is bounded from aboveby (cid:18) n ˜ pn ln n (cid:19) · p ˜ pn ln n ≤ (cid:18) e ˜ pn ˜ pn ln n (cid:19) ˜ pn ln n = 2 − Ω(˜ pn (ln n )(ln ln n )) , where we have used p ≤ ˜ p . Hence, the probability is superpolynomially small. In thefollowing, we assume that the number of one-bits changes by at most b in each of atotal number of at most (1 − p ) − n n ln n = 2 O (˜ pn )+ O (ln ln n ) steps that are considered forthe lower bound we want to prove. This event holds with probability 1 − o (1), which,using the law of total probability, decreases the bound only by a factor of 1 − o (1).Let X ( t ) denote the number of one-bits at time t and note that this is non-increasing over time. We choose s min := n ˜ p ln n and β := 1 / ln n and introduce s max := 1 / (2˜ p n ln n ) as an additional upper bound. Note that s max ≤ n/ (2 ln n ) dueto ˜ p ≥ /n . Since the µ initial search points are drawn uniformly at random and µ = poly( n ), it holds X µ ≥ s max with probability 1 − o (1). Again, assuming this tohappen, we lose a factor 1 − o (1) in the bound we want to prove. Moreover, due to ourassumption p = O ( n − / − ε ) (which means ˜ p = O ( n − / − ε )), we have b = n ˜ p ln n ≤ / (4˜ p n ln n ) = s max / n large enough. Altogether, it holds s max / ≤ X t ∗ ≤ s max at the ﬁrst point of time t ∗ where X t ∗ ≤ s max . To simplify issues, we considerthe process only from time t ∗ on. Skipping the ﬁrst t ∗ steps, we pessimisticallyassume s := s max / X ( t ) ≤ s max for all t ≥

0. The secondcondition of the drift theorem is now fulﬁlled since the bound on ˜ p also implies b = ˜ pn ln n ≤ / (2˜ p n ln n ) = βs max , where βs max is the largest value for βs to betaken into account.Assembling the factors from the lower bound in Theorem 2, we get − β β = 1 − o (1).Furthermore, we have ln( s /s min ) = ln(1 / (4˜ p n ln n )) = ln(1 / (˜ p n )) − O (ln ln n ),which is (1 − o (1)) ln(1 / (˜ p n )) by our assumption on ˜ p . If we can prove that 1 /δ =(1 − o (1))(1 − p ) − n (1 /p ), the proof is complete.To bound δ , we use Lemma 2. Note that i ≤ s max holds in our simpliﬁed process.Using the lemma and recalling that 1 / ˜ p ≤ /p , we get E ( X ( t ) − X ( t +1) | X ( t ) = i ) i ≤ p (cid:18) − p + s max p − p (cid:19) n − s max ≤ p (cid:18) − p + 1 n ln n (cid:19) n − s max ≤ p (cid:18) (1 − p ) (cid:18) n ln n (cid:19)(cid:19) n − s max = (1 + o (1)) p (1 − p ) n , p ≤ / / ( n ln n )) n = 1 + o (1) and (1 − p ) − s max =(1 − p ) − / (2˜ p n ln n ) = 1 + o (1). Hence, 1 /δ ≥ (1 − o (1))(1 /p )(1 − p ) − n as suggested,which completes the proof. (cid:3) Finally, we remark that the expected optimization time of the (1+1) EA with p = 1 /n on OneMax is known to be en ln n − Θ( n ) (Doerr, Fouz and Witt, 2011).Hence, in conjunction with Theorems 5 and 6, we obtain for p = 1 /n that the expectedoptimization time of the (1+1) EA varies by at most an additive term Θ( n ) withinthe class of linear functions. Conclusions

We have presented new bounds on the expected optimization time of the (1+1) EAon the class of linear functions. The results are now tight up to lower-order terms,which applies to any mutation probability p = O ((ln n ) /n ). This means that 1 /n is the optimal mutation probability on any linear function. We have for the ﬁrsttime studied the case p = ω (1 /n ) and proved a phase transition from polynomialto exponential running time in the regime Θ((ln n ) /n ). The lower bounds showthat OneMax is the easiest linear function for all p ≤ /

2, and they apply notonly to the (1+1) EA but also to the large class of mutation-based EAs. They soexhibit the (1+1) EA as optimal mutation-based algorithm on linear functions. Theupper bounds hold with high probability. As proof techniques, we have employedmultiplicative drift in conjunction with adaptive potential functions. In the future,we hope to see these techniques applied to the analysis of other randomized searchheuristics.We ﬁnish with an open problem. Even though our proofs of upper bounds wouldsimplify for the function

BinVal , this function is often considered as a worst case. Isit true that the runtime of the (1+1) EA on

BinVal is stochastically largest withinthe class of linear functions, thereby complementing the result that the runtime on

OneMax is stochastically smallest?

Acknowledgments

The author thanks Benjamin Doerr, Timo K¨otzing, Per Kristian Lehre and CarolaWinzen for constructive discussions on the subject and for ruling out an early proofattempt. Moreover, he thanks Daniel Johannsen for proofreading a draft of theSections 4 and 5 and pointing out a simpliﬁcation of the proof of Theorem 4. Finally,he thanks Dirk Sudholt, who suggested to study the class of mutation-based EAs.23

Multiplicative Drift for Lower Bounds

In this appendix, we supply the proof of Theorem 2, the lower-bound version of themultiplicative drift theorem. The proof follows the one of Theorem 5 in Lehre andWitt (2010) and uses the following additive drift theorem.

Theorem 10 (J¨agersk¨upper (2007)) . Let X (1) , X (2) , . . . be random variables withbounded support and let T be the stopping time deﬁned by T := min { t | X (1) + · · · + X ( t ) ≥ g } for a given g > . If E ( T ) exists and E (cid:0) X ( i ) | T ≥ i (cid:1) ≤ u for i ∈ N , then E ( T ) ≥ g/u . The proof of Theorem 2 also makes use of the following simple lemma.

Lemma 3.

Let X be any random variable, and k any real number. If it holds that Prob(

X < k ) > , then E ( X ) ≥ E ( X | X < k ) . Proof.

Deﬁne p := Prob( X < k ) and µ k := E ( X | X < k ). The lemma clearly holdswhen p = 1 such that we assume 0 < p < E ( X ) is positiveinﬁnite then E ( X ) ≥ µ k is obvious. If E ( X ) is negative inﬁnite then so is µ k by thelaw of total probability. Finally, for ﬁnite E ( X ), the law of total probability yields E ( X ) = (1 − p ) · E ( X | X ≥ k ) + p · µ k ≥ (1 − p ) · k + p · µ k > (1 − p ) · µ k + p · µ k = E ( X | X < k ) . (cid:3) Proof of Theorem 2.

The proof generalizes the proof of Theorem 1 in Doerr, Fouzand Witt (2010). The random variable T is non-negative. Hence, if the expectationof T does not exist, then it is positive inﬁnite and the theorem holds. We conditionon the event T > t , but we omit stating this event in the expectations for notationalconvenience. We deﬁne the stochastic process Y ( t ) := ln( X ( t ) ) (note that X ( t ) ≥ t +1 ( s ) := (cid:0) Y ( t ) − Y ( t +1) | X ( t ) = s (cid:1) = (cid:16) ln (cid:16) sX ( t +1) (cid:17) | X ( t ) = s (cid:17) . We consider the time until X ( t ) ≤ s min if X (0) = s and use the parameter g :=ln( s /s min ). By the law of total probability, the expectation of ∆ t +1 ( s ) can be ex-pressed asProb( s − X ( t +1) ≥ βs ) · E (cid:0) ∆ t +1 ( s ) | s − X ( t +1) ≥ βs (cid:1) + Prob( s − X ( t +1) < βs ) · E (cid:0) ∆ t +1 ( s ) | s − X ( t +1) < βs (cid:1) . (1)24y applying the second condition from the theorem, the ﬁrst term in (1) can bebounded from above by βδ ln s · ln s = βδ . The logarithmic function is concave. Hence,by Jensen’s inequality, the second term in (1) is at mostln (cid:16) E (cid:16) sX ( t +1) | s − X ( t +1) < βs ∧ X ( t ) = s (cid:17)(cid:17) = ln (cid:18) E (cid:18) s − X ( t +1) X ( t +1) | s − X ( t +1) < βs ∧ X ( t ) = s (cid:19)(cid:19) . By using the inequality ln(1 + x ) ≤ x as well as the conditions X t +1 ≥ (1 − β ) s and X t +1 ≤ X t , this simpliﬁes to E (cid:18) s − X ( t +1) X ( t +1) | s − X ( t +1) < βs ∧ X ( t ) = s (cid:19) < E (cid:18) s − X ( t +1) (1 − β ) s | s − X ( t +1) < βs ∧ X ( t ) = s (cid:19) . By Lemma 3 and the ﬁrst condition from the theorem, it follows that the secondterm in (1) is at most E (cid:18) s − X ( t +1) (1 − β ) s | X ( t ) = s (cid:19) ≤ δ − β . Altogether, we obtain E (∆ t +1 ( s )) ≤ ( β + 1 / (1 − β )) δ ≤ (( β + 1) / (1 − β )) δ . FromTheorem 10, it now follows that E (cid:0) T | X (0) = s (cid:1) ≥ δ · − β β · ln (cid:18) s s min (cid:19) . (cid:3) References

Auger, Anne and Doerr, Benjamin (2011).

Theory of Randomized Search Heuristics– Foundations and Recent Developments . World Scientiﬁc Publishing.B¨ack, Thomas (1993). Optimal mutation rates in genetic search. In

Proc. ofICGA ’93 , 2–8. Morgan Kaufmann.Doerr, Benjamin and Goldberg, Leslie Ann (2011). Adaptive drift analysis.

Algorith-mica . To appear; preprint: http://arxiv.org/abs/1108.0295 .25oerr, Benjamin, Johannsen, Daniel, and Winzen, Carola (2010a). Drift analysis andlinear functions revisited. In

Proc. of CEC ’10 , 1–8. IEEE Press.Doerr, Benjamin, Johannsen, Daniel, and Winzen, Carola (2010b). Multiplicativedrift analysis. In

Proc. of GECCO ’10 , 1449–1456. ACM Press.Doerr, Benjamin, Fouz, Mahmoud, and Witt, Carsten (2010). Quasirandom evolu-tionary algorithms. In

Proc. of GECCO ’10 , 1457–1464. ACM Press.Doerr, Benjamin, Fouz, Mahmoud, and Witt, Carsten (2011). Sharp bounds byprobability-generating functions and variable drift. In

Proc. of GECCO ’11 , 2083–2090. ACM Press.Droste, Stefan, Jansen, Thomas, and Wegener, Ingo (2000). A natural and simplefuntions which is hard for all evolutionary algorithms. In

Proc. of IECON ’00 ,2704–2709. DOI 10.1109/IECON.2000.972425.Droste, Stefan, Jansen, Thomas, and Wegener, Ingo (2002). On the analysis of the(1+1) evolutionary algorithm.

Theoretical Computer Science , , 51–81.Hajek, Bruce (1982). Hitting-time and occupation-time bounds implied by driftanalysis with applications. Advances in Applied Probability , (3), 502–525.He, Jun and Yao, Xin (2001). Drift analysis and average time complexity of evolu-tionary algorithms. Artiﬁcial Intelligence , , 57–85.He, Jun and Yao, Xin (2004). A study of drift analysis for estimating computationtime of evolutionary algorithms. Natural Computing , (1), 21–35.J¨agersk¨upper, Jens (2007). Algorithmic analysis of a basic evolutionary algorithmfor continuous optimization. Theoretical Computer Science , (3), 329–347.J¨agersk¨upper, Jens (2008). A blend of markov-chain and drift analysis. In Proc. ofPPSN ’08 , vol. 5199 of

LNCS , 41–51. Springer.Lehre, Per Kristian and Witt, Carsten (2010). Black-box search by unbiased variation.ECCC report TR10-102, http://eccc.hpi-web.de/report/2010/102/.Marshall, Albert W., Olkin, Ingram, and Arnold, Barry (2011).

Inequalities: Theoryof Majorization and Its Applications . Springer, 2nd ed.Motwani, Rajeev and Raghavan, Prabhakar (1995).

Randomized Algorithms . Cam-bridge University Press.M¨uhlenbein, Heinz (1992). How genetic algorithms really work: I. Mutation andhillclimbing. In

Proc. of PPSN ’92 , 15–26. Elsevier.26eumann, Frank and Witt, Carsten (2010).

Bioinspired Computation in Combina-torial Optimization – Algorithms and Their Computational Complexity . NaturalComputing Series. Springer.Sudholt, Dirk (2010). General lower bounds for the running time of evolution-ary algorithms. In

Proc. of PPSN ’10 , 124–133. Springer. Extended version: http://arxiv.org/abs/1109.1504 .Wegener, Ingo (2001). Methods for the analysis of evolutionary algorithms on pseudo-Boolean functions. In Sarker, Ruhul, Mohammadian, Masoud, and Yao, Xin (eds.),

Evolutionary Optimization . Kluwer Academic Publishers.Wegener, Ingo and Witt, Carsten (2005a). On the analysis of a simple evolutionaryalgorithm on quadratic pseudo-boolean functions.

Journal of Discrete Algorithms , (1), 61–78.Wegener, Ingo and Witt, Carsten (2005b). On the optimization of monotone poly-nomials by simple randomized search heuristics. Combinatorics, Probability &Computing ,14