[PDF] Novel Analysis of Population Scalability in Evolutionary Algorithms

Abstract

Population-based evolutionary algorithms (EAs) have been widely applied to solve various optimization problems. The question of how the performance of a population-based EA depends on the population size arises naturally. The performance of an EA may be evaluated by different measures, such as the average convergence rate to the optimal set per generation or the expected number of generations to encounter an optimal solution for the first time. Population scalability is the performance ratio between a benchmark EA and another EA using identical genetic operators but a larger population size. Although intuitively the performance of an EA may improve if its population size increases, currently there exist only a few case studies for simple fitness functions. This paper aims at providing a general study for discrete optimisation. A novel approach is introduced to analyse population scalability using the fundamental matrix. The following two contributions summarize the major results of the current article. (1) We demonstrate rigorously that for elitist EAs with identical global mutation, using a lager population size always increases the average rate of convergence to the optimal set; and yet, sometimes, the expected number of generations needed to find an optimal solution (measured by either the maximal value or the average value) may increase, rather than decrease. (2) We establish sufficient and/or necessary conditions for the superlinear scalability, that is, when the average convergence rate of a (μ+μ) EA (where μ≥2 ) is bigger than μ times that of a (1+1) EA.

Full PDF

aa r X i v : . [ c s . N E ] M a y Novel Analysis of Population Scalability in Evolutionary Algorithms

Jun He ∗ Department of Computer ScienceAberystwyth University, Aberystwyth, SY23 3DB, U.K.Tianshi ChenInstitute of Computing TechnologyChinese Academy of Sciences, Beijing 100190, ChinaBoris MitavskiyDepartment of Computer ScienceAberystwyth University, Aberystwyth, SY23 3DB, U.K.June 13, 2018

Abstract

Population-based evolutionary algorithms (EAs) have been widely applied to solve various optimiza-tion problems. The question of how the performance of a population-based EA depends on the popula-tion size arises naturally. The performance of an EA may be evaluated by diﬀerent measures, such asthe average convergence rate to the optimal set per generation or the expected number of generations toencounter an optimal solution for the ﬁrst time. Population scalability is the performance ratio between abenchmark EA and another EA using identical genetic operators but a larger population size. Althoughintuitively the performance of an EA may improve if its population size increases, currently there existonly a few case studies for simple ﬁtness functions. This paper aims at providing a general study fordiscrete optimisation. A novel approach is introduced to analyse population scalability using the funda-mental matrix. The following two contributions summarize the major results of the current article. (1)We demonstrate rigorously that for elitist EAs with identical global mutation, using a lager populationsize always increases the average rate of convergence to the optimal set; and yet, sometimes, the expectednumber of generations needed to ﬁnd an optimal solution (measured by either the maximal value or theaverage value) may increase, rather than decrease. (2) We establish suﬃcient and/or necessary conditionsfor the superlinear scalability, that is, when the average convergence rate of a ( µ + µ ) EA (where µ ≥ µ times that of a (1 + 1) EA. Population-based evolutionary algorithms (EAs) have been widely applied to tackle a variety of optimizationproblems. A wide number of approaches is available to design eﬃcient population-based EAs. Using apopulation delivers many beneﬁts [1]. A commonly accepted intuitive hunch is that the performance of suchan EA may improve if its population size increases. Nonetheless, sometimes an intuitive rule of thumb maybe bridgeable, hence a rigorous analysis is highly desirable. Currently there are only a few case studies forsimple ﬁtness functions, but no general result has been established so far.Population scalability describes the relationship between the performance of an EA and its populationsize. The actual meaning of a population-based EA should be interpreted as a family of EAs using identicalgenetic operators but diﬀerent population sizes. Considering a benchmark EA and another EA in the familywith a population size larger than the benchmark EA, population scalability (scalability for short) is measured ∗ Corresponding author. [email protected] . (1)To make use of the above formula, it is necessary to clarify the meaning of the “performance” of an EA.Since EAs are iterative methods, the following two measures are fundamental in evaluating their performancefrom both, theoretical and practical points of view. Convergence rate: the rate of an EA converging to the optimal set per generation [2, 3]. The convergencerate is a measure applicable in both, numerical and discrete optimization problems.

Expected number of generations: the average number of generations needed to encounter an optimalsolution for the ﬁrst time . This measure is not suitable for numerical optimization, where an EAusually needs an inﬁnite number of generations to ﬁnd an exact solution.To simplify the analysis, a (1 + 1) EA plays the role of the benchmark EA in the paper. Another EAis a ( µ + µ ) EA where µ ( ≥

2) is the population size (an integer). Furthermore EAs under considerationsatisfy the following conditions: (1) they are applied to tackle discrete optimization problems; (2) they areconvergent; (3) genetic operators include mutation and selection only; and (4) genetic operators are selectedin the same fashion at every generation.The problem of population scalability is rather challenging in the theory of population-based EAs. Inorder to estimate population scalability, it is necessary to acquire the exact value of the convergence rate orthe expected number of generations it takes to reach an optimal solution for both, (1 + 1) and ( µ + µ ) EAs.The diﬃculty is that the convergence rate depends on the initial population and varies from one generationto another; and the expected number of generations depends on the initial populations. Furthermore thesearch spaces corresponding to the (1 + 1) and ( µ + µ ) EAs have diﬀerent dimensions. Therefore the notionof population scalability should be based on the “overall” performance of an EA, but how does one deﬁnethe “overall” performance?Convergent EAs can be modelled via absorbing Markov chains [4, 5], and one of the most elegant as-pects of the theory of absorbing Markov chains revolves around the notion of the fundamental matrix, { N ( X, Y ) } X, Y are transient states where N ( X, Y ) is the expected number of visits to the transient state X starting with a transient states Y prior to absorption.Clearly given a (1 + 1) EA and a ( µ + µ ) EA (where µ ≥ ρ ( N ). In section 3 we show that 1 /ρ ( N ) is the average rate ofconvergence to the optimal set; and ρ ( N ) is a ”max-min” value related to the expected number of visits toa transient state. Population scalability is measured rigorously in terms of the ratio of the spectral radii ofthe corresponding fundamental matrices:spectral radius of the fundamental matrix of the (1 + 1) EAspectral radius of the fundamental matrix of the ( µ + µ ) EA . (2)The aim of the current article is to compare the average convergence rate towards an optimal solution(i.e. an absorbing set of states) of the corresponding Markov transition matrices modelling a ( µ + µ ) EA anda (1 + 1) EA. This makes the paper very diﬀerent from the majority of prior research trends, that study theexpected number of generations. This work addresses the following two fundamental questions.1. Does the average convergence rate increase as the population size does, and, if so, then under whatkind of circumstances does this take place? Intuitively this seems trivial, since a ( µ + µ ) EA employsmore individuals than a (1 + 1) EA does, nonetheless, a proof is required to conﬁrm this. In EAs (but never in iterative methods), the performance is also evaluated by the expected number of ﬁtness evaluationsneeded to ﬁnd an optimal solution. Since in a ( µ + µ ) EA, the number of ﬁtness evaluations per generation is ﬁxed to µ , so theexpected number of ﬁtness evaluations = µ × the expected number of generations. Therefore it is suﬃcient to study populationscalability using the number of generations.

2. As the population size increases from 1 to µ , under what kind of circumstances does the averageconvergence rate increase by a factor bigger than µ ? This is another intuitive principle, since thenumber of individuals employed by a ( µ + µ ) EA is µ times that by the corresponding (1 + 1) EA.The paper is organized as follows: A review of previous related research is given in Section 2. Convergencerate and population scalability are formally introduced in Section 3. Section 4 aims at answering the ﬁrstquestion stated above while Section 5 aims at answering the second one. Sections 6 and 7 are devoted to thecase studies of non-bridgeable and bridgeable ﬁtness landscapes. Finally, Section 8 concludes the paper anddiscuss other types of EAs. The study of population scalability in evolutionary computation can be traced to early 1990s. Goldberg et al.[6] presented a population sizing equation to show how a large population size helps an EA to distinguishbetween good and bad building blocks on some test problems. M¨uhlenbein and Schlierkamp-Voosen [7]studied the critical (minimal) population size that can guarantee convergence to the optimum. An adoptivescheme to control the population size has been proposed in Arabas et al. [8] and the eﬀectiveness of theproposed methodology has been validated through an empirical study. A review of various techniques tocontrol EA’s parameters where an adjustment of the population size has been emphasized as an importantresearch issue appears in Eiben et al. [9]. A link between the population size and the quality of the solutionhas been exhibited in Harik et al. [10] via an analogy between one-dimensional random walks and EAs. Whilethe approximate population sizing models proposed in the investigations mentioned above may shed somelight on deciding a “promising” population size, the eﬀectiveness of the models has been validated only viavarious case studies based upon speciﬁc optimization problems.There do exist a few rigorous results about population scalability. As one of the earliest rigorous analysis,He and Yao [11] investigated how the expected hitting time of EAs varies as the population size increases.A while later He and Yao [12] exhibited a link between population scalability and parallelism. A study ofthe population scalability of the (1 + µ ) EA on three pseudo-Boolean functions, Leading-Ones, One-Maxand Suf-Samp appears in Jansen et al. [13]. L¨assig and Sudholt [14] presented a runtime analysis of a(1 + µ ) EA with an adaptive oﬀspring size µ on several pseudo-Boolean functions. An analysis of how therunning time of a ( µ + 1) EA on the Sphere function scales up with respect to the problem size n appearsin J¨agersk¨upper and Witt [15]. In Jansen and Wegener [16] it has been shown that the running time ofthe ( µ + 1) EA with a crossover operator on the Real Royal Road function is polynomial on average, whilethat of an EA with mutation and selection only is exponentially large with an overwhelming probability. Arigorous run-time analysis of the ( µ + 1) EA on a speciﬁc pseudo-Boolean function has been carried out inWitt [17, 18]. A rigorous runtime analysis of selecting the population size with respect to the ( µ + 1) EAon several pseudo-Boolean functions appears in Storch [19]. A runtime analysis of both (1 + µ ) and ( µ + 1)EA on some instances of Vertex Covering Problems is provided in Oliveto et al. [20]. A run-time analysisof ( µ + 1) EAs with diversity-preserving mechanisms on the Two-Max problem has been implemented inFriedrich et al. [21]. An upper bound on the number of generations it takes a ( µ + µ ) EA to encounter anoptimal solution for the ﬁrst time on the two well-known unimodal problems, Leading-Ones and One-Max,has been obtained in [22]. The eﬀect of population size in evolutionary multi-objective optimization has beenconsidered in Giel and Lehre [23]. It has been shown that only the population-based EA is successful, whileall the other individual-based algorithms fail on a speciﬁed class of pseudo-Boolean functions. Nonetheless,all of the available theoretical results are mainly restricted to several simple algorithms for tackling speciﬁcproblems. In other words, the up-to-date knowledge is limited to case studies only [24].In contrast with the previous investigations, the current paper aims at drawing general results that applyto all discrete optimisation problems. The study is based on the fundamental matrix of the Markov chainmodelling an EA. Such an approach can be traced back to an early work on asymptotic convergence propertiesof EAs in Fogel [25] and it has been applied to analysing elitist EAs in He and Yao [5].3 Evolutionary Algorithms, Absorbing Markov Chains and Popu-lation Scalability

Without loss of generality, consider the problem of maximizing a ﬁtness function f ( x ).max f ( x ) , x ∈ D, subject to constraints , where f ( x ) is a ﬁtness function, and D is its domain (a ﬁnite set). For instance, D is the set of all Booleanformulas in the satisﬁability problem [26], or the set of all possible vertex covers in the vertex cover problem[20]. Arrange the values of the ﬁtness function in the order from high to low, that are called ﬁtness levels .To alleviate the complexity of theoretical analysis, suppose that all constraints in the above problem havebeen removed through a constraint handling method. Under this circumstance, all solutions in D are thoughtto be feasible. Practical and theoretical analysis of constraint handling in evolutionary computation may befound in [27, 28], for instance.We consider an EA that makes use of an extra archive for keeping the best found solution. The archiveitself is not involved in generating a new population. The general design of a ( µ + µ ) EA with an archiveis described in Algorithm 1. In the description, t denotes the generation counter. Φ t = ( φ t, , · · · , φ t,µ )represents the population at the t th generation, where φ t, , · · · , φ t,µ are µ individuals. Φ t +1 / denotes theoﬀspring population generated via mutation from the population Φ t . Algorithm 1

A ( µ + µ ) EA with an Archive (where µ is the population size) input : ﬁtness function; generation counter t ← initialize Φ ; an archive keeps the best solution in Φ ; while (no optimal solution is found) do Φ t +1 / ← each individual in Φ t generates a child by mutation; evaluate the ﬁtness of each individual in Φ t +1 / ; Φ t +1 ← selected from Φ t , Φ t +1 / ; update the archive if the best solution in Φ t +1 is better than it; t ← t + 1; end while output : the maximum of the ﬁtness function.The initial population is selected at random in such a way that any possible population may be chosenwith a positive probability. Rather general deﬁnitions of mutation and selection operators appear below. • A mutation operator is represented via an S (1) by S (1) Markov transition probability matrix the entriesof which are given as P M ( x, y ) = P ( φ t +1 / = y | φ t = x ) , x, y ∈ S (1) . Here P M ( x, y ) denotes the probability of going from x to y . φ t is an individual in the t th generationpopulation and φ t +1 / the child of φ t after mutation. S (1) = D is called the space of individuals .Individuals φ t and φ t +1 / represent random variables, while x and y denote their states. • A selection operator is represented via an S ( µ ) × S ( µ ) by S ( µ ) probability transition matrix, the entriesof which are introduced below: P S ( X, Y ; Z ) = P (Φ t +1 = Z | Φ t = X, Φ t +1 / = Y ) , X, Y, Z ∈ S ( µ ) . Here P S ( X, Y ; Z ) represents the probability of selecting µ individuals from populations X and Y (children of X ) and then forming the next parent population Z . Φ t is the t th generation population,Φ t +1 / the population of children of Φ t after mutation, and Φ t +1 is the ( t + 1) th generation population. S ( µ ) is the Cartesian product Q µi =1 S (1) , called a space of populations if µ ≥

2. Populations Φ t ,4 t +1 / and Φ t +1 are random variables, while X, Y, Z denotes their states in the space of populations.Superscripts (1) and ( µ ) are used to distinguish between the space of individuals and space of populations.A natural requirement on the selection operator is that all the individuals in Z must come from these in X or in Y . If Z contains an individual that is neither an individual of X nor of Y , then the probabilityof going from X and Y to Z is 0.The stopping criterion is that the algorithm will terminate once an optimal solution is found. Thiscriterion is assumed only for the sake of convenience of the theoretical analysis. Apparently, we can simplyignore what happens after an EA encounters an optimal solution. EAs considered in the paper are convergent.Starting from any initial population, an EA can ﬁnd an optimal solution after a ﬁnite number of generations.The mathematical framework introduced above incorporates a wide class of EAs as it doesn’t assume anyimplementation details. Table 1 summarizes the special notation appearing in the paper. (1) and ( µ ) distinguish between a (1 + 1) EA and a ( µ + µ ) EA when necessary S (1) the set of individuals S ( µ ) the set of populations of size µ , = Cartesian product Q µi =1 S (1) x, y, z individuals, also called states in S (1) X, Y, Z populations, also called states in S ( µ ) S (1)opt the set of individuals that are optimal S (1)non the set of individuals that are not optimal S ( µ )opt the set of populations that contain at least one optimal individual S ( µ )non the set of populations that contain no optimal individual S ( µ )same ( x ) the set of populations the best individual of which is x S ( µ )high ( x ) the set of populations with best individual’s ﬁtness > f ( x ) S ( µ )bridge ( x ) the set of populations that contain x ’s bridgeable point P ( X, S ( µ )set ( x )) the probability of going from X to the set S ( µ )set ( x ) x ρ = arg max P ( x, x ) among all non-optimal states x ∈ S (1)non Φ t the population at the t th generationΦ t +1 / the oﬀspring population of Φ t after mutation q t ( X ) the probability that Φ t = X q t the vector to represent the probabilities of Φ t in all non-optimal states m ( X ) the expected number of generations needed to ﬁnd an optimal solution whenstarting from X Q x,x the transition probability submatrix within the set S ( µ )same ( x )Table 1: Notation Table The discussion in the subsection follows the general analytic framework of the absorbing Markov chains foranalysing EAs initiated in He and Yao [5]. According to the stopping criterion, an EA halts once an optimalsolution is found. So if Φ t = X is an optimal state, we let Φ t +1 = Φ t +2 = · · · = X for all the future states.Thus the sequence { Φ t , t = 0 , , · · · } can be modelled by an absorbing Markov chain . Let P be its transitionmatrix, having entries P ( X, Y ) = P (Φ t +1 = Y | Φ t = X ) , X, Y ∈ S . Individuals or populations are called states when we speak about the corresponding Markov chains. For theMarkov chains modelling convergent EAs, an optimal state is always an absorbing state while a non-optimalstate is always a transient state. An absorbing Markov chain is a Markov chain where starting from every state, the chain can reach an absorbing state. Anabsorbing state is a state from which it is impossible to leave[29, p.416]. X to a set S set is denoted by P ( X, S set ) = P (Φ t +1 ∈ S set | Φ t = X ) . The transition matrix of an absorbing Markov chain P can be written in the following canonical form. P = (cid:18) I O ∗ Q (cid:19) , (3)where I is the identity matrix indicating transitions within the optimal set, and O denotes the zero matrix(apparently representing transitions from optimal to non-optimal states, as discussed above). The matrix Q denotes transitions within non-optimal states. The part ∗ represents transitions from the non-optimal statesto the optimal ones.Perhaps the most elegant part of the theory of absorbing Markov chains revolves around the notion of afundamental matrix. Deﬁnition 1. [29, Deﬁnition 11.3] The matrix N = ( I − Q ) − is called the fundamental matrix , where itsentry N ( X, Y ) gives the expected number of visits to a transient state X starting from a transient state Y before being absorbed. The expected number of visits has a direct link to the the expected number of generations needed toencounter an optimal solution for the ﬁrst time.

Lemma 1. [29, Theorem 11.5] Let m ( X ) denote the expected number of generations needed to encounter anoptimal solution for the ﬁrst time when starting from a transient state X . Then m ( X ) = X Y ∈S non N ( X, Y ) . (4)In the paper we consider two special values of the expected number of generations it takes to reach anoptimum solution. The ﬁrst one is the maximum value of the expected number of generations, given asmax { m ( X ); X ∈ S} = max { X Y ∈S non N ( X, Y ); X ∈ S} . (5)The above value is the supreme of the expected number of generations it takes to reach an optimal solutionamong all possible initializations.The second one is the average value of the expected number of generations it takes to reach an optimumover all of the transient states, given as1 | S non | X X ∈S non m ( X ) = 1 | S non | X X,Y ∈S non N ( X, Y ) , (6)where | S non | denotes the cardinality of the set of transient states. The above value is achieved whenthe initial population is chosen uniformly at random from the set of all transient states. Analogously, analternative notion of the expected number of generations it takes to encounter an optimal individual for theﬁrst time when any population (not necessarily one containing no optimal individual) is selected uniformlyat random is given as follows: 1 | S | X X ∈S m ( X ) . (7) In this subsection, we introduce the notion of the average convergence rate. The convergence rate of an EAmeasures how fast Φ t converges to the optimal set per generation. In EAs, it is formally deﬁned by theconditional probability of an EA converging to the optimal set, that is P (Φ t ∈ S opt | Φ t − ∈ S non ) = 1 − P (Φ t ∈ S non | Φ t − ∈ S non ) .

6e denote the probability that Φ t is at a transient state X by q t ( X ) = P (Φ t = X ) . Write all transient states in a vector form : ( X , X , · · · ) T . Then the vector q t = ( q t ( X ) , q t ( X ) , · · · ) T denotes the probabilities of Φ t in all transient states. The corresponding Markov chain is then representedby a matrix iteration: q Tt +1 = q Tt Q , or q t +1 = Q T q t . (8)Since the EA is initialized at random so that every possible population is selected with a positive probabil-ity (for instance, uniformly at random), each state can be chosen as an initial state with a positive probability.This means that q ( X ) > X ∈ S non . We denote this by q > t being in the transient set in the 1-norm for . k q t k = P (Φ t ∈ S non ) . Since the EA is initialized at random, k q t k = P (Φ t ∈ S non ) > P (Φ t − ∈ S opt , Φ t ∈ S non ) = 0, the conditional probability of staying within the non-optimal set inthe t th generation is P (Φ t ∈ S non | Φ t − ∈ S non )= P (Φ t − ∈ S non , Φ t ∈ S non ) P (Φ t − ∈ S non )= P (Φ t − ∈ S opt , Φ t ∈ S non ) + P (Φ t − ∈ S non , Φ t ∈ S non ) P (Φ t − ∈ S non )= P (Φ t ∈ S non ) P (Φ t − ∈ S non )= k q t k k q t − k . Thus, the geometric mean of the conditional probabilities of staying within the non-optimal set for t generations is t Y s =1 P (Φ s ∈ S non | Φ s − ∈ S non ) ! /t = (cid:18) k q t k k q k (cid:19) /t . (9)Next we deﬁne the average convergence rate for t generations based on the geometric mean above. Deﬁnition 2.

The average rate of convergence to the optimal set for t generations is − t Y s =1 P (Φ s ∈ S non | Φ s − ∈ S non ) ! /t . Now we consider the limit of the rate as t increases towards + ∞ . Lemma 2.

The average rate of convergence to the optimal set for t generations satisﬁes − lim t → + ∞ t Y s =1 P (Φ s ∈ S non | Φ s − ∈ S non ) ! /t = 1 − ρ ( Q ) . v represents a column vector and the transpose notation, v T , allows to represent this column vector in the form of a rowvector. The vector v is also denoted by [ v ( X )] and its entry v ( X ) by [ v ] X . For a vector v , k v k = P ni =1 | v i | . For a square matrix A = [ A ij ], k A k = max ≤ j ≤ n P ni =1 | A ij | . roof. (1) From the matrix iteration q t = Q T q t − , we get k q t k ≤k ( Q T ) t k k q k . Since the EA is initialized at random, and q > , the average rate of convergence to the optimal setafter t iterations satisﬁes 1 − (cid:18) k q t k k q k (cid:19) /t ≥ − k ( Q T ) t k /t . According to the Gelfand’s spectral radius formula , as t → + ∞ ,1 − lim t → + ∞ (cid:18) k q t k k q k (cid:19) /t ≥ − lim t → + ∞ k ( Q T ) t k /t = 1 − ρ ( Q T ) = 1 − ρ ( Q ) . (10)(2) Since Q ≥

0, according to the Perron-Frobenius theorems [30, p.670], ρ ( Q ) is an eigenvalue of Q (andalso Q T ) and the corresponding eigenvector v ≥

0. In particular, ρ ( Q ) v = Q T v . Let min q denote the minimum value of all the entries in the vector q . Since q >

0, min q > v so that k v k ∞ = min q . We split q into two parts, q = v + w , where w ≥

0. Thus, since w ≥ Q ≥

0, we deduce that q t = Q T q t − = ( Q T ) t q = ( Q T ) t ( v + w ) ≥ ( Q T ) t v = ( ρ ( Q )) t v . It follows then that (cid:18) k q t k k q k (cid:19) /t ≥ ρ ( Q ) (cid:18) k v k k q k (cid:19) /t → ρ ( Q )as t → + ∞ . The inequality is equivalent to the one below:1 − lim t → + ∞ (cid:18) k q t k k q k (cid:19) /t ≤ − ρ ( Q ) . The desired conclusion follows by combing the inequality above with Inequality (10) and Equality (9).

Deﬁnition 3.

The average rate of convergence to the optimal set is − lim t → + ∞ t Y s =1 P (Φ s ∈ S non | Φ s − ∈ S non ) ! /t = 1 − ρ ( Q ) . The “average” is the geometric mean that is taken over all generations under the condition of randomizedinitialization . Gelfand’s spectral radius formula says that for any induced matrix norm k A k , its spectral radius ρ ( A ) = lim t →∞ k A t k /t [30, p.619]. For a vector v , k v k ∞ = max i =1 , ··· ,n | v i | . The rate is similar to another average rate of convergence based on the logarithmic mean [31, p.73], − ln ρ ( Q ). The diﬀerencebetween the two notions of rate is not particularly signiﬁcant [32] .4 The Spectral Radius of the Fundamental Matrix Given a (1 + 1) EA and a ( µ + µ ) EA (where µ ≥ /ρ ( N )equals the average convergence rate. This can be seen from the following lemma. Lemma 3.

The spectral radii of the transition probability submatrix Q and the fundamental matrix N arerelated as follows: ρ ( N ) = (1 − ρ ( Q )) − . (11) Proof.

From the deﬁnition of the fundamental matrix, it follows that λ is an eigenvalue of Q if and only if(1 − λ ) − is an eigenvalue of N .Since Q is non-negative, according to Perron-Frobenius Theorems [30, p.670], ρ ( Q ) is an eigenvalue of Q such that ρ ( Q ) ≥| λ | , where λ is any eigenvalue of Q .On the other hand, (1 − ρ ( Q )) − is an eigenvalue of N and satisﬁes11 − ρ ( Q ) ≥ − | λ | ≥ | − λ | , so that (1 − ρ ( Q )) − is the spectral radius of N .Next the lemma below shows that ρ ( N ) is a “max-min” value related to the expected number of visits totransient states. Lemma 4.

The spectral radius of the fundamental matrix equals ρ ( N ) = max q min Y : q ( Y ) =0 P X ∈S non N ( Y, X ) q ( X ) q ( Y ) , (12) where q ( X ) = P (Φ = X ) is the probability that the initial population Φ is at the state X and N ( Y, X ) isthe expected number of visits to the state X starting from the state Y .Proof. The lemma is a direct application of the Collatz-Wielandt formula . ρ ( N ) = ρ ( N T ) = max q min Y : q ( Y ) =0 [ N T q ] Y [ q ] Y = max q min Y : q ( Y ) =0 P X ∈S non N ( Y, X ) q ( X ) q ( Y ) , which is the conclusion.Eventually, the lemma below establishes lower and upper bounds on the spectral radius of the fundamentalmatrix. Lemma 5.

The spectral radius of the fundamental matrix satisﬁes the following inequality: min X ∈S non m ( X ) ≤ ρ ( N ) ≤ max X ∈S non m ( X ) . (13) The ”max-min” version of the Collatz-Wielandt formula claims that for a nonnegative square matrix A = [ A ij ], its spectralradius ρ ( A ) = max x ∈N g ( x ), where g ( x ) = min ≤ i ≤ n,x i =0 [ Ax ] i [ x ] i and the set N = { x ; x ≥ with x = } [30, p670]. Thereexists a ”min-max” version of the Collatz-Wielandt formula, which is applicable too. roof. The lemma is a direct consequence of the following fact : given any n × n non-negative A = [ a ij ],its spectral radius satisﬁes the inequalitiesmin i n X j =1 a ij ≤ ρ ( A ) ≤ max i n X j =1 a i,j . Indeed, substituting N in place of A yields the desired conclusion. Matrix norms can be used as a measure of the expected number of visits. The ∞ -norm of the fundamentalmatrix is given as k N k ∞ = max X ∈S non X Y ∈S non N ( X, Y ) . (14) Lemma 6.

The ∞ -norm of the fundamental matrix equals k N k ∞ = max X ∈S non m ( X ) . Proof.

This follows immediately from the deﬁnition of the matrix ∞ -norm and Lemma 1.The deﬁnition and the lemma above provide us with two equivalent interpretations of k N k ∞ . k N k ∞ is the maximal value of the expected number of visits to the set of transient states among all possibleinitializations. This is equivalent to saying that k N k ∞ is the maximal value of the expected number ofgenerations to reach the optimal set among all possible starting transient states.The a -norm of the fundamental matrix is deﬁned as k N k a = 1 | S non | X X,Y ∈S non N ( X, Y ) . (15) Lemma 7.

The ∞ -norm of the fundamental matrix is alternatively described as follows: k N k a = 1 | S non | X X ∈S non m ( X ) . (16) Proof.

This is an immediate consequence of the deﬁnition of the matrix a -norm and lemma 1.The deﬁnition and the lemma above reveal the following two equivalent meanings of k N k a . k N k a isthe average value of the expected number of visits to the set of transient states among all possible initialtransient states. k N k a is the average value of the expected number of generations it takes to reach theoptimal set among all possible initial transient states. Given a (1 + 1) EA and a ( µ + µ ) EA (where µ ≥

2) that exploit an identical mutation operator to optimizethe same ﬁtness function, population scalability is measured by the ratio between their performances. Asdiscussed in previous sections, there are diﬀerent approaches to evaluate the performance of an EA and,hence, there are several ways to measure population scalability.

Deﬁnition 4.

Population scalability under the spectral radius of the fundamental matrix is ρ -scalability( µ ) = ρ ( N (1) ) ρ ( N ( µ ) ) = 1 − ρ ( Q ( µ ) )1 − ρ ( Q (1) ) (17)= average convergence rate of the ( µ + µ ) EAaverage convergence rate of the (1 + 1) EA . (18) The fact is given in [30, Exercise 8.2.7]. The result in Exercise 8.2.7 is stated only for positive matrices, yet an identicalargument that replaces the Collatz-Wielandt formula for positive matrices by the Collatz-Wielandt formula for non-negativematrices shows that the same fact holds for all non-negative matrices. For a square matrix A = [ A ij ], k A k ∞ = max ≤ i ≤ n P nj =1 | A ij | and k A k a = ( P ni,j =1 | A ij | ) /n . eﬁnition 5. Population scalability under the ∞ -norm of the fundamental matrix is ∞ -scalability( µ ) = k N (1) k ∞ k N ( µ ) k ∞ (19)= maximum value of expected numbers of generations of the (1 + 1) EAmaximum value of expected numbers of generations of the ( µ + µ ) EA . (20) Deﬁnition 6.

Population scalability under the a -norm of the fundamental matrix is a -scalability( µ ) = k N (1) k a k N ( µ ) k a (21)= average value of expected numbers of generations of the (1 + 1) EAaverage value of expected numbers of generations of the ( µ + µ ) EA , (22) where the average is taken over all of the transient states, excluding the absorbing state(s).If considering the average over all the states, an alternative deﬁnition is given by ˆ a -scalability( µ ) = | S (1)non || S ( µ )non | k N (1) k a k N ( µ ) k a . (23)An essential part of the deﬁnitions above is that both EAs must adopt identical mutation operators. Thisensures that the comparison is meaningful. Nonetheless, it is impossible for the selection operators to beidentical. Indeed even if the selection operators are of the same type, for example roulette wheel selection,the conditional probabilities determining the actual selection operators are never identical under distinctpopulation sizes.The following questions are fundamental when studying population scalability.1. As the population size increases from 1 to µ (where µ ≥ scalability ( µ ) > µ + µ ) EA has no scalability with respect to the (1 + 1) EA.2. As the population size increases from 1 to µ (where µ ≥ scalability ( µ ) > µ ?If the population scalability is greater than µ , then we say that the ( µ + µ ) EA has superlinear scalability with respect to the (1 + 1) EA.Population scalability is diﬀerent from the relationship between the performance of an EA and its popu-lation size discussed in previous references such as [13]. There the comparison of the two EAs is carried outin terms of the big O notation. The diﬀerence is clearly demonstrated through the following question:maximum value of expected number of generations for the (1 + 1) EAmaximum value of expected numbers of generations for the ( µ + µ ) EA < µ + µ ) EA = O (1)?Here O (1) is big O notation. Nonetheless, there is a drawback in using the big O notation when studyingpopulation scalability. For example, O (1) does not distinguish between the case when the expected numberof generations the (1 + 1) EA takes to reach an optimum for the ﬁrst time is 100 times that the ( µ + µ ) EAtakes, and the case when the expected number of generations the (1 + 1) EA takes is 1 /

100 times that the( µ + µ ) EA takes. In this sense, population scalability analysis is diﬀerent from the work in Jansen et al. [13].The notion of population scalability is similar to that of the speedup widely used when analysing parallelalgorithms. Nonetheless, population scalability doesn’t depend on the number of parallel computing proces-sors. There is a link between superlinear population scalability and superlinear speedup in parallel EAs. If11ach individual is assigned to a processor, then EAs turn into parallel EAs. Under this circumstance, super-linear scalability implies superlinear speedup if ignoring the communication cost. An interesting question inparallel EAs is when and how superlinear speedup phenomenon happens [33, 34, 35].There is an essential diﬀerence between the notions of population scalability and that of No Free LunchTheorems [36]. Population scalability compares the performance of two EAs that exploit identical geneticoperators but diﬀerent population sizes to optimize the same ﬁtness function, while the No Free LunchTheorems compare the average performance of the two EAs over all possible ﬁtness functions. This section focuses on investigating elitist EAs that adopt global mutation and elitist selection operators.The corresponding deﬁnitions appear below.

Deﬁnition 7.

A mutation operator is called global if any individual can reach the optimal set via mutationafter a single iteration.

Deﬁnition 8.

A selection operator is called elitist if the best parent individual is replaced by the best childindividual only in case when the best child individual is ﬁtter. There is no restriction on selecting non-bestindividuals and any selection strategy can be applied.

Global mutation guarantees that the optimal set is reachable starting from any initial state, while elitistselection aims at maintaining the best solution found over time. An alternative elitist operator is to replacethe best parent individual by a child with a better or equal ﬁtness [5]. In the current paper we do not considersuch a variant.Now let’s emphasize two virtuous properties for mutation and elitist selection. The ﬁrst property is calledthe mutation property. It compares the probability of going from a population to a higher ﬁtness level withthe probability of going from an individual to a higher ﬁtness level.

Lemma 8.

Suppose X = ( x , · · · , x µ ) ∈ S ( µ )non is a population with x = x being one of the best individuals.Then the following mutation property holds for elitist EAs: for i = 1 , · · · , µ, and µ ≥ , P M ( X, S ( µ )high ( X )) ≥ P M ( x i , S (1)high ( x )) , (24) P M ( X, S ( µ )high ( X )) ≤ µ X i =1 P M ( x i , S (1)high ( x )) , (25) where S ( µ )high ( x ) denotes the set consisting of all populations whose best individual’s ﬁtness is higher than f ( x ) .Furthermore, if a mutation operator is global, then the inequalities above are strict.Proof. Notice that the event of going from X to a higher ﬁtness level can be alternatively expressed as theevent that at least one of individuals x i goes to a higher ﬁtness level.Since mutation is performed independently, so that for µ ≥ X to a higherﬁtness level is P M ( X, S ( µ )high ) = 1 − µ Y i =1 (1 − P M ( x i , S (1)high ( x )) , implying that for i = 1 , · · · , µ , P M ( X, S ( µ )high ( X )) ≥ P M ( x i , S (1)high ( x )) . The inequality for µ ≥ P M ( X, S ( µ )high ( X )) ≤ µ X i =1 P M ( x i , S (1)high ( x ))12ollows trivially from the fact that the probability of a union of events is always bounded above by the sumof the probabilities of the constituent events.Moreover, it is easy to see that the above inequalities are strict if the mutation operator is global.Elitist selection insures that the best individual in a population will either enter a higher ﬁtness level orremain unchanged, thereby never getting worse. This phenomenon is called the elitist selection property andit can be reformulated as follows. Lemma 9.

Given a population X the best individual of which is x , the elitist selection property implies that P ( X, S ( µ )same ( x )) + P ( X, S ( µ )high ( x )) = 1 , (26) where S ( µ )same ( x ) denotes the set consisting of all populations the best individual of which is x . First consider the (1 + 1) elitist EA. Arrange all states in S (1)non in the order of their ﬁtness from high to low(where the individuals at the same ﬁtness level may be arranged in any order), and write them in a vectorform ( x , x , x , · · · ) T , where f ( x ) ≥ f ( x ) ≥ f ( x ) ≥ · · · . Thanks to the elitist selection property discussed in the previous section, the best individual never entersa lower ﬁtness level meaning that for any individuals x and y with f ( y ) ≤ f ( x ), the entry of the Markovtransition matrix that stands for the probability of going from an individual x to an individual y , P ( x, y ) = 0 . It follows then that the transition matrix Q (1) is lower triangular and can be written in the followingform: Q (1) =  P ( x , x ) 0 0 · · · P ( x , x ) P ( x , x ) 0 · · ·  . (27)The following simple fact now follows naturally from the deﬁnition of eigenvalues and spectral radius [30,p.490]. Lemma 10.

Given transition matrix Q (1) and let x ρ be the state such that x ρ = arg max { P ( x, x ); x ∈ S (1)non } . (28) Then the spectral radius ρ ( Q (1) ) = P ( x ρ , x ρ ) . (29)Here P ( x, x ) is the probability of the Markov chain remaining in state x . The above lemma shows thatthe spectral radius of a Markov transition submatrix, that models a (1 + 1) EA (restricted to the non-optimalstates), is the maximal self-transition probability of a non-optimal state remaining unchanged in the nextgeneration.Next we consider a ( µ + µ ) EA (where µ ≥ S ( µ )non in the order of the ﬁtnessof their best individual from high to low (where populations with the same best individual are arrangedtogether in an arbitrary order), and write them in a vector form: ( X , X , · · · ) T . If x , x , · · · denote thecorresponding best individuals, then their ﬁtness decreases: f ( x ) ≥ f ( x ) ≥ · · · . Once again, thanks to the elitist selection property, the best individual in a population never revisits anystate at a lower ﬁtness level. Thus the probability of going from population X (with the best individual x )to population Y that is not in the set S ( µ )same ( x ) or S ( µ )high ( x ) is 0. It follows then that the matrix Q ( µ ) is ablock lower triangular matrix, that can be written in the form: Q ( µ ) =  Q ( µ ) x ,x O O · · · Q ( µ ) x ,x Q ( µ ) x ,x O · · · ... ... ... ...  , (30)13here O denotes a zero matrix and Q ( µ ) x,y is the submatrix consisting of transition probabilities from thestates in S ( µ )same ( x ) to the states in S ( µ )same ( y ). In particular, Q ( µ ) x,x is the submatrix consisting of transitionprobabilities within the set S ( µ )same ( x ).The following lemma is an extension of Lemma 10 for the case when µ ≥ Lemma 11.

Let Q ( µ ) x,x denote the diagonal block submatrix as in (30), that represents transitions within theset S ( µ )same ( x ) , then ρ ( Q ( µ ) ) = max x ∈S (1)non ρ ( Q ( µ ) x,x ) . (31) Proof.

The proof is based on a simple fact [30, Exercise 7.1.4]: if A is a block lower triangular matrix suchthat A = (cid:18) A , OA , A , (cid:19) , (32)then λ is an eigenvalue of A if and only if λ is an eigenvalue of A , or A , . Thus ρ ( A ) = max { ρ ( A , ) , ρ ( A , ) } . Let A = Q ( µ ) (see the matrix (30)), then ρ ( Q ( µ ) ) = max x ∈S (1)non ρ ( Q ( µ ) x,x )as claimed.The following lemma provides lower and upper bounds on the spectral radius of the transition probabilitysubmatrix Q ( µ ) x,x . Lemma 12.

The spectral radius of the transition submatrix Q ( µ ) x,x satisﬁes min X ∈S ( µ )same ( x ) P ( X, S ( µ )same ( x )) ≤ ρ ( Q ( µ ) x,x ) ≤ max X ∈S ( µ )same ( x ) P ( X, S ( µ )same ( x )) . (33) Proof.

The proof is the same as that of Lemma 5. Indeed, substituting Q ( µ ) x,x in place of A yields the desiredconclusion. ρ -Scalability Always Happens for Elitist EAs Exploiting Global Mutation An intuitive reason behind the use of population-based EAs is that a larger population size is likely to increasethe convergence rate. The following proposition proves that this is, indeed, the case.

Theorem 1.

Suppose a (1 + 1) elitist EA and a ( µ + µ ) elitist EA (where µ ≥ ) exploit identical globalmutation operator to maximize the same ﬁtness function. Then ρ -scalability( µ ) > . Proof.

For the (1 + 1) elitist EA, let x ρ ∈ S (1)non be an individual such that ρ ( Q (1) ) = P ( x ρ , x ρ ) . Likewise, for a ( µ + µ ) elitist EA, Lemma 11 says that ρ ( Q ( µ ) ) = max x ∈S (1)non ρ ( Q ( µ ) x,x ) . Since the set S (1)non is ﬁnite, there exists an x ∈ S (1)non such that ρ ( Q ( µ ) ) = ρ ( Q ( µ ) x,x ) . (34)Consider the transition probability matrix Q ( µ ) x,x for such an x .According to Lemma 12, the spectral radius ρ ( Q ( µ ) x,x ) is bounded above as ρ ( Q ( µ ) x,x ) ≤ max X ∈S ( µ )same ( x ) P ( X, S ( µ )same ( x )) . X = ( x , x , · · · , x µ ) in the set S ( µ )same ( x ) where x = x , and the above inequalityholds for this speciﬁc X , ρ ( Q ( µ ) x,x ) ≤ P ( X, S ( µ )same ( x )) . Combining the inequality above with the elitist selection property (26), P ( X, S ( µ )high ( x )) + P ( X, S ( µ )same ( x )) = 1 , yields ρ ( Q ( µ ) x,x ) ≤ − P ( X, S ( µ )high ( x )) . (35)Now the global mutation property tells us that, for µ ≥ P ( X, S ( µ )high ( x )) > P ( x, S (1)high ( x )) = 1 − P ( x, x ) . Recall that P ( x ρ , x ρ ) is the maximal self-transition probability so that P ( X, S ( µ )high ( x )) ≥ − P ( x ρ , x ρ ) = 1 − ρ ( Q (1) ) . Substituting this bound into Inequality (35) yields ρ ( Q ( µ ) x,x ) < ρ ( Q (1) ) . Recalling (34): ρ ( Q ( µ ) ) = ρ ( Q ( µ ) x,x ) , we deduce that ρ ( Q ( µ ) ) < ρ ( Q (1) ) , so that ρ -scalability( µ ) > . Thereby establishing the desired conclusion. ∞ -Scalability May Not Happen for Elitist EAs using Global Mutation Another intuitive reason behind the use of population-based EAs is that a larger population size is likely toshorten the expected number of generations. Unfortunately sometimes this is wrong. The following exampleshows that increasing the population size increases, rather than reduces, the maximum of the expectednumber of generations for an EA to ﬁnd an optimal solution. Equivalently, ∞ -scalability( µ ) < µ ≥ x x x x x ﬁtness 5 4 3 2 1Table 2: Fitness function.Consider the following ( µ + µ ) EA (=(1 + µ ) EA): the best individual is replicated by µ copies and eachcopy generates a child via mutation; the best individual is replaced only when a child is better than it. Themutation transition probabilities P ( x, y ) , x, y = x , · · · , x are given in Table 3, where ǫ ≥ ǫ will be discussed later). When ǫ >

0, the mutation operator is global.First, set ǫ = 0. The maximal value of the expected number of generations for the (1 + 1) EA toencounter the optimal individual x is the time when the EA starts from x . According to the probabilitytransition matrix above, it takes 2 time steps (with probability 1) to reach the optimal state along the road x → x → x (with probability 0 . x → x → x → x (alsowith probability 0 . m (1) ( x ) = 2 · . · . . . x x x x x x − ǫ ǫ ǫ ǫ ǫx − ǫ ǫ ǫ ǫ ǫx ǫ − ǫ ǫ ǫ ǫx − ǫ ǫ ǫ ǫ ǫx ǫ ǫ . . − ǫ ǫ Table 3: Mutation transition probability matrix.The maximal value of the expected number of generations for the (2 + 2) to reach a population containing theoptimal individual x is the time when the EA starts from ( x , x ). Because of the elitist selection, the onlypossible oﬀspring population is ( x , x ) or ( x , x ). Since the event of going from ( x , x ) to ( x , x ) happensonly if both x s mutate into x , the probability of this event is 0 . · . .

25. Consequently, the probabilitythat the event of going from ( x , x ) to ( x , x ) happens is 1 − .

25 = 0 .

75. Thus, according to the mutationprobability transition matrix, the EA reaches the optimum along the road ( x , x ) → ( x , x ) → ( x , x ) withprobability 0 .

25, while it does so along the road ( x , x ) → ( x , x ) → ( x , x ) → ( x , x ) with probability0 .

75, so that m (2) ( x , x ) = 2 · .

25 + 3 · .

75 = 2 . . This demonstrates explicitly that the maximal value of the expected number of generations that the (1+1)EA needs to reach an optimum is shorter than that the (2 + 2) EA needs, since m (1) ( x ) = 2 . < m (2) ( x , x ) = 2 . . (36)Furthermore, the reasoning above generalizes to the case when µ > m ( µ ) ( x , . . . , x ) = 2 · . µ + 3 · (1 − . µ ) , thereby demonstrating that m ( µ ) ( x , . . . , x ) is a strictly increasing function of the population size µ (in-creasing the population size also increases the maximal value of the expected number of generations).Now observe that m (1) ( x ) − m ( µ ) ( x , x )is a continuous function of ǫ so that for small enough ǫ Inequality (36) as well as the conclusion in theparagraph above still hold. Moreover, notice that the continuity argument implies that the elitist selectioncan also be alleviated to certain non-elitist selection (non-best individuals may replace the parent individualbut with tiny probabilities) so that all the same conclusions remain valid. a -Scalability and ˆ a -Scalability May Not Happen for Elitist EAs using GlobalMutation The following modiﬁcation of the example in the previous subsection shows that increasing the populationsize may increase, rather than reduce, the average value of the expected number of generations, regardless ofwhether the population is chosen uniformly at random from the set of all transient states or from the set ofall possible states. Equivalently, a -scalability( µ ) < a -scalability( µ ) < µ ≥ x x x x x i ( i = 4 , · · · , µ + µ ) EA (=(1 + µ ) EA): the best individual is replicated by µ copies and eachcopy generates a child via mutation; the best individual is replaced only when a child is better than it. The16tate x x x x x i ( i ≥ x − ǫ ǫ ǫ ǫ . ǫx − ǫ ǫ ǫ ǫ . ǫx ǫ − ǫ ǫ ǫ . ǫx − ǫ ǫ ǫ ǫ . ǫx i ( i ≥ ǫ ǫ . . − ǫ . ǫ Table 5: Mutation transition probability matrix.mutation transition probabilities appear in Table 5, where ǫ ≥ ǫ can be chosen suﬃciently small and also elitist selection can be alleviated according to the sametype of continuity argument as in the previous example, of course, to show that the (1 + 1) EA outperformsthe (2 + 2), (3 + 3) or (4 + 4) EA. ρ -Scalability to TakePlace ρ -Scalability to Take Place In the current subsection we present a rather general suﬃcient and necessary condition for superlinear scala-bility to take place that applies to both elitist and non-elitist EAs. The condition is based on the concept ofa “road”. Intuitively, a road is a transition path between two states X and Y . A rigorous deﬁnition appearsbelow [37]. Deﬁnition 9.

Given two states

X, Y ∈ S ( µ ) , if there exists k states X = X → X → · · · → X k = Y suchthat P ( X , X ) · · · P ( X k − , X k +1 ) > , then { X , · · · , X k } is called a road from X to Y . We also say that k is the length of the road. We write road ( X, Y, k ) to denote the set of all roads from X to Y having length k . Let P ( road ( X, S ( µ )opt ( x ) , k )) denote the probability of going from X to the set S ( µ )opt ( x ) via “roads” of length k . A general suﬃcient and necessary condition for superlinear scalability to take place appears in the followingtheorem. The theorem is largely based on the classical Gelfand’s spectral radius formula. Theorem 2.

Suppose we are given a (1+1)

EA and a ( µ + µ ) EA (where µ ≥ ) that exploit identical mutationoperator to maximize the same ﬁtness function. For the ( µ + µ ) EA, supperlinear scalability happens if andonly if there exists some k > and for X ∈ S ( µ )non , P ( road ( X, S ( µ )opt , k ) ) > − (cid:16) − µ (1 − ρ ( Q (1) )) (cid:17) k . (37) Proof. (1) The proof that the condition is suﬃcient.Suppose Inequality (37) holds for all populations X in the set S ( µ )non . It follows from the assumption thatmax X ∈S ( µ )non P (Φ k ∈ S ( µ )non | Φ = X ) < (cid:16) − µ (1 − ρ ( Q (1) ) (cid:17) k . ∞ -norm, we obtain, k ( Q ( µ ) ) k k ∞ < (cid:16) − µ (1 − ρ ( Q (1) ) (cid:17) k . (38)Since the spectral radius of a matrix is not bigger than its maximum norm [30, p.619], ρ (( Q ( µ ) ) k ) ≤k ( Q ( µ ) ) k k ∞ , so that ρ ( Q ( µ ) ) = (cid:16) ρ (( Q ( µ ) ) k ) (cid:17) /k ≤ (cid:16) k ( Q ( µ ) ) k k ∞ (cid:17) /k . Combining Inequality (38) with the above inequality yields ρ ( Q ( µ ) ) < − µ (cid:16) − ρ ( Q (1) ) (cid:17) , − ρ ( Q ( µ ) )1 − ρ ( Q (1) ) > µ. This means that superlinear scalability takes place.(2) The proof that the condition is necessary.Suppose Inequality (37) does not hold. This means that for any k >

0, there exists some X ∈ S ( µ )non suchthat P ( road ( X, S ( µ )opt , k ) ) ≤ − (cid:16) − µ (1 − ρ ( Q (1) ) (cid:17) k . (39)It follows then that for any k > P (Φ k ∈ S ( µ )opt | Φ = X ) ≤ − (cid:16) − µ (1 − ρ ( Q (1) )) (cid:17) k ,P (Φ k ∈ S ( µ )non | Φ = X ) ≥ (cid:16) − µ (1 − ρ ( Q (1) )) (cid:17) k , and this, in turn, implies thatmax Y ∈S ( µ )non P (Φ k ∈ S ( µ )non | Φ = Y ) ≥ (cid:16) − µ (1 − ρ ( Q (1) )) (cid:17) k . Rewriting the inequality above in terms of the ∞ -norm we deduce that, k ( Q ( µ ) ) k k ∞ ≥ (cid:16) − µ (1 − ρ ( Q (1) )) (cid:17) k . Taking the limit as k → + ∞ , and applying Gelfand’s spectral radius formula, we obtain: ρ ( Q ( µ ) ) = lim k →∞ (cid:16) k ( Q ( µ ) ) k k ∞ (cid:17) /k , so that ρ ( Q ( µ ) ) ≥ − µ (1 − ρ ( Q (1) )) , − ρ ( Q ( µ ) )1 − ρ ( Q (1) ) ≤ µ. This means that no superlinear scalability takes place.18 .2 Suﬃcient and Necessary Condition for Superlinear ρ -Scalability to Happenfor Elitist EAs The suﬃcient and necessary condition for the superlinear scalability to occur that has been established inTheorem 2 can be reformulated in a more explicit fashion when dealing with elitist EAs. We call thisreformulation “road through bridge”. A detailed analysis is provided in the current subsection.

Deﬁnition 10.

An individual y is called a bridgeable point of an individual x if y satisﬁes the followingconditions:1. the ﬁtness of x is larger than that of y : f ( x ) ≥ f ( y ) ;2. the probability of going from x to the set S (1)high ( x ) via mutation is smaller than that from y to the sameset S (1)high ( x ) . P M ( x, S (1)high ( x )) ≤ P M ( y, S (1)high ( x )) . (40)The term “bridgeable point” is motivated by the following intuitive notion: y may serve as a “bridge” for x to step towards a higher ﬁtness level.To achieve superlinear scalability, it is important for elitist EAs to go through some “bridgeable point”.Intuitively there are two types of roads going from a state towards a higher ﬁtness level. One is the roadgoing from its current ﬁtness level directly towards the higher ﬁtness level; another type is the road throughsome bridgeable point before reaching a higher ﬁtness level.Given a population X the best individual of which is x and a population Y in the set S ( µ )high ( x ), the roadsfrom X to Y can be classiﬁed into two categories: • Road through bridge { X = X, X , · · · , X k − , X k = Y } : at least one of the intermediate populations X , · · · , X k − contains a bridgeable point of x . • Road over gap { X = X, X , · · · , X k − , X k = Y } : none of the intermediate populations X , · · · , X k − contains a bridgeable point of x .Let P ( road ( X, S ( µ )high ( x ) , k ) through bridge) denote the probability of going from X to the set S ( µ )high ( x ) via“roads through bridge” of length k .Likewise, let P ( road ( X, S ( µ )high ( x ) , k ) over gap) denote the probability of going from X to the set S ( µ )high ( x )via “roads over gap” of length k .The following theorem provides a suﬃcient and necessary condition for superlinear scalability to occur incase of elitist EAs in terms of the ”road through bridge”. Theorem 3.

Suppose we are given a (1 + 1) elitist EA and a ( µ + µ ) elitist EA (where µ ≥ ) that exploitidentical mutation operator to maximize the same ﬁtness function. Consider an individual x ρ such that x ρ = arg max { P ( x, x ); x ∈ S (1)non } . For the ( µ + µ ) EA, supperlinear ρ -scalability happens if and only if the following road through bridge condition holds: there exists some k > , such that for any x ∈ S (1)non and any X ∈ S ( µ )same ( x ) , P ( road ( X, S ( µ )high ( x ) , k ) over gap ) + P ( road ( X, S ( µ )high ( x ) , k ) through bridge ) > − (1 − µ (1 − P ( x ρ , x ρ )) k . (41) Proof.

For the (1 + 1) elitist EA, Lemma 10 tells us that ρ ( Q (1) ) = P ( x ρ , x ρ ) . For the ( µ + µ ) EA, from Lemma 11, it follows that ρ ( Q ( µ ) ) = max x ∈S (1)non ρ ( Q ( µ ) x,x ) . S (1)non is ﬁnite, there is some x ∈ S (1)non such that ρ ( Q ( µ ) ) = ρ ( Q ( µ ) x,x ) . (42)Thanks to elitist selection, Inequality (41) is equivalent to saying that for any X ∈ S ( µ )same ( x ) P ( road ( X, S ( µ )high ( x ) , k ) > − (cid:16) − µ (1 − ρ ( Q (1) ) (cid:17) k . The desired conclusion now follows directly from Theorem 2 applied to the Markov transition submatrix Q ( µ ) x,x and replacing S ( µ )non and S ( µ )opt in Theorem 2 by S ( µ )same ( x ) and S ( µ )high ( x ) respectively. ρ -Scalability to Happen for Elitist EAs The following theorem informs us further that the existence of a “road through bridge” is a necessarycondition for superlinear scalability to take place. Let S ( µ )bridge ( x ) denote the set of all populations whichcontain a bridgeable point of x . Theorem 4.

Suppose we are given a (1 + 1) elitist EA and a ( µ + µ ) elitist EA (where µ ≥ ) that exploitidentical mutation operator to maximize the same ﬁtness function. Consider an individual x ρ such that x ρ = arg max { P ( x, x ); x ∈ S (1)non } . Let X ρ = ( x ρ , · · · , x ρ ) . If for some population size µ ≥ , and for any k > P ( Road ( X ρ , S ( µ )high ( x ρ ) , k ) through bridge ) = 0 , (43) then no supperlinear ρ -scalability ever takes place for such a µ .Proof. For the (1 + 1) elitist EA, from Lemma 10, ρ ( Q (1) ) = P ( x ρ , x ρ ) . For a ( µ + µ ) elitist EA (where µ ≥ Q ( µ ) x ρ ,x ρ .Split the set S ( µ )same ( x ρ ) into two subsets: S ( µ )bridge ( x ρ ) and S ( µ )same ( x ρ ) \ S ( µ )bridge ( x ρ ).From the condition of the theorem, for any k > P ( Road ( X ρ , S ( µ )high ( x ρ ) , k ) through bridge) = 0 , so that the probability of going from any state in the set S ( µ )same ( x ρ ) \ S ( µ )bridge ( x ρ ) to the set S ( µ )bridge ( x ρ ) is 0.Hence the matrix Q ( µ ) x ρ ,x ρ is reducible. We write it in the following form (cid:18) ˆQ ( µ ) x ρ ,x ρ O ∗ ∗∗ (cid:19) , where ˆQ ( µ ) x ρ ,x ρ represents the transition probability submatrix within the set S ( µ )same ( x ρ ) \ S ( µ )bridge ( x ρ ). The ∗ part represents transition probabilities from the set S ( µ )bridge ( x ρ ) to the set S ( µ )same ( x ρ ), and the part labelledwith the ∗∗ symbol stands for the transition probabilities within the set S ( µ )bridge ( x ρ ). O denotes a zero matrix.For any population X = ( x , · · · , x µ ) in the set S ( µ )same ( x ρ ) \ S ( µ )bridge ( x ρ ), since none of its individuals is a“bridgeable point” of x ρ , we have P M ( x, S (1)high ( x ρ )) ≤ P M ( x ρ , S (1)high ( x ρ )) = 1 − ρ ( Q (1) ) . According to the mutation property (25), for any µ > P ( X, S ( µ )high ( x ρ )) < µ (1 − ρ ( Q (1) )) . S ( µ )same ( x ρ ) \ S ( µ )bridge ( x ρ ), we havemin X ∈S ( µ )same ( x ρ ) \S ( µ )bridge ( x ρ )) P ( X, S ( µ )same ( x ρ ) \ S ( µ )bridge ( x ρ )) ≥ − µ (1 − ρ ( Q (1) )) . According to Lemma 12, ρ ( ˆQ ( µ ) x ρ ,x ρ ) ≥ − µ (1 − ρ ( Q (1) )) . Since ρ ( Q ( µ ) ) ≥ ρ ( ˆQ ( µ ) x ρ ,x ρ ), we deduce that1 − ρ ( Q ( µ ) ) ≤ µ (1 − ρ ( Q (1) )) , which means that no superlinear ρ -scalability takes place.From the theorem, we deduce two necessary conditions for superlinear ρ -scalability to happen as follows.First of all, bridgeable point(s) must exist. Furthermore, bridgeable point(s) must be preserved duringselection. Consequently, if there is no population diversity, then there is no superlinear ρ -scalability. ρ -Scalability on Non-bridgeableFitness Landscapes ρ -Scalability Never Happens to Non-bridgeable Fitness Land-scapes Deﬁnition 11.

Given a ﬁtness function f ( x ) , we say that the ﬁtness landscape assocated with a (1 + 1) EAis non-bridgeable meaning that for any two non-optimal states x and y , if x has a better ﬁtness than y (thatis f ( x ) ≥ f ( y ) ), then starting from x , the EA has a larger probability to enter a higher ﬁtness level, S (1)high ( x ) than starting from y (and arriving at the same subset S (1)high ( x ) ), that is P M ( x, S (1)high ( x )) > P M ( y, S (1)high ( x )) . In terms of the average convergence rate, using a population delivers no superlinear scalability if theﬁtness landscape associated with the (1 + 1) EA is non-bridgeable. The following theorem demonstrates this.

Proposition 1.

Suppose we are given a (1 + 1) elitist EA and a ( µ + µ ) elitist EAs (where µ ≥ ) that exploitidentical mutation operator for maximizing the same ﬁtness function. If the ﬁtness landscape associated withthe (1 + 1) EA is non-bridgeable, then no superlinear ρ -scalability happens.Proof. Let x ρ be an individual such that x ρ = arg max { P ( x, x ); x ∈ S (1)non } , and X ρ = ( x ρ , · · · , x ρ ). It is easy to see there is no bridgeable point for X ρ , so that for k > P ( Road ( X ρ , S ( µ )high ( x ρ ) , k ) through bridge) = 0 . The desired conclusion now follows directly from Theorem 4.Analogous results have been established under ∞ -scalability and a -scalability for non-bridgeable ﬁtnesslandscapes which are called monotonic ﬁtness landscapes in He and Chen [38].21 .2 An Example of Non-bridgeable Fitness Landscapes Consider the average capacity 0-1 knapsack problem [39], described as follows: let x be a binary string( s , · · · , s n ) ∈ { , } n , max x n X i =1 v i s i subject to n X i =1 w i s i ≤ C, where v i is the value of the i -th item, w i its weight and C = 0 . P ni =1 w i is the capacity.Consider the instance where the values and the weights of the items v i = w i = 1 for i = 1 , · · · , n , and C = 0 . n . The ﬁtness function for this instance is similar to the One-Max function. f ( x ) = (cid:26) P ni =1 s i , if P ni =1 s i ≤ . n, infeasible , otherwise . (44)An individual is represented by a binary string. The ( µ + µ ) EA for the instance uses the followingmutation and selection operators. • Randomised Initialisation: generate µ feasible solutions (individuals) at random. • Bitwise Mutation: given a string x , ﬂip each bit independently with ﬂipping probability 1 /n . If anindividual generates an infeasible oﬀspring, the oﬀspring is rejected immediately, while the parent isautomatically transferred into the intermediate population of children after mutation. • Elitist Selection: any elitist selection operator will do.The ﬁtness landscape associated with the (1 + 1) EA is non-bridgeable. According to Proposition 1,superlinear ρ -scalability never happens, meaning that using a population does not increase the averageconvergence rate. Similar results have been established before in terms of the expected number of generationsit takes to reach the optimum for the One-Max problem. Sudholt [40] proved that the (1+1) EA is the best EAto tackle this problem. Jansen et al. [13] also analysed the relationship between the runtime and populationsize but under the big O notation. ρ -Scalability on Bridgeable FitnessLandscapes ρ -Scalability May Happen on Certain Bridgeable Fitness Land-scapes Deﬁnition 12.

Given a ﬁtness function f ( x ) , we say that the ﬁtness landscape associated with a (1 + 1) is bridgeable if there exit two non-optimal states x and y where x has a better ﬁtness than y (that is f ( x ) ≥ f ( y ) )while the probability of entering a higher ﬁtness level S (1)high ( x ) starting from the state x is not less than thatstarting from y , that is P M ( x, S (1)high ( x )) ≤ P M ( y, S (1)high ( x )) . Proposition 2 below investigates a particular scenario where “roads through bridge” exist on bridge-able ﬁtness landscapes, thereby demonstrating that the use of a population may be helpful when copingwith bridgeable ﬁtness landscapes in the sense that superlinear scalability could be achieved under certainconditions.

Proposition 2.

Suppose we are given a (1 + 1) elitist EA and a ( µ + µ ) elitist EAs (where µ ≥ ) thatexploit identical mutation operator to maximize the same ﬁtness function. Suppose that the ﬁtness landscapeassociated with (1 + 1) EA is bridgeable. Assume further, that the following conditions hold:1.

Fitness diversity preservation: given Φ t = X and Φ t +1 / = Y , if there exists one or more individuals in X or Y whose ﬁtness is less than that of the best individual of X , then at least one of these individualsmust be selected into the next population with positive probability. . Existence of bridgeable points: let x ρ be a state at the 2nd highest ﬁtness level. We require that P ( x ρ , x ρ ) = max z ∈S (1)non P ( z, z ) . All other states at lower ﬁtness levels are bridgeable points of x ρ . The probability of going from abridgeable point y to the optimal set via mutation is larger than that from x ρ by a factor of µ : P M ( y, S (1)opt ) ≥ µP M ( x, S (1)opt ) . (45) Pass through bridgeable points:

The probability of going from the x ρ above to the set of bridgeablepoints via mutation is large enough in the following sense: P M ( x ρ , S (1)bridge ( x ρ )) ≥ µP M ( x ρ , S (1)opt ) . (46) Then superlinear ρ -scalability happens for such a µ .Proof. For the ( µ + µ ) EA, let X be any population in the non-optimal set. Now consider the probabilityof going from X to the optimal set in two generations. It is convenient to analyze two complementary casesaccording to the diﬀerent types of the population X . Case 1

The population X = ( x ρ , x ρ , . . . , x ρ ) consists of the repeated copies of the ﬁttest individual x ρ : .From Conditions (45) and (46), the probability of going from X to the optimal set in two generations isgreater than P (Φ t +2 ∈ S ( µ )opt | Φ t +1 ∈ S ( µ )bridge ( x ρ )) · ( P (Φ t +1 ∈ S ( µ )bridge ( x ρ ) | Φ t = X ) ≥ µ ( P ( x ρ , S (1)opt )) = µ (1 − P ( x ρ , x ρ )) . Case 2

A population X = ( x , · · · , x µ ) that contains at least one bridgeable point of x ρ (recall condition2). From Condition (46), the probability of going from X to the optimal set in two generations is greaterthan P (Φ t +2 ∈ S ( µ )opt | Φ t +1 ∈ S ( µ )opt ) · P (Φ t +1 ∈ S ( µ )opt | Φ t = X ) ≥ µ ( P ( x ρ , S (1)opt ))= µ (1 − P ( x ρ , x ρ )) ≥ µ (1 − P ( x ρ , x ρ )) . Thus, after examining the two mutually exhaustive cases above, we deduce that for all populations X inthe non-optimal set, P (Φ t +2 ∈ S ( µ )opt | Φ t = X ) ≥ µ (1 − P ( x ρ , x ρ )) > − (1 − µ (1 − P ( x ρ , x ρ )) . The inequality above shows that the road through bridge condition (41) holds implying that the superlinearscalability takes place thanks to theorem 3.

Consider another instance of the average capacity 0-1 knapsack problem: the value of the item v = n whilethe remaining items have values v i = 1 for i = 2 , · · · , n ; the weight of the item w = n −

1, while the weights23f the remaining items w i = 1 for i = 2 , · · · , n . The capacity C = n −

1. The ﬁtness function resembles afully deceptive function [11]. f ( x ) =  n, if s = 1 , s = · · · = s n = 0; P ni =2 s i , if s = 0;infeasible , otherwise . (47)An individual is represented by a binary string. The ( µ + µ ) EA for the instance uses the followingmutation and selection operators. • Randomised Initialisation: generate µ feasible solutions (individuals) at random. • Bitwise Mutation: given a binary string x , ﬂip each bit independently with probability 1 /n . If anindividual generates an infeasible oﬀspring, the oﬀspring is rejected immediately and the parent istransferred into in the population of children. • Elitist Proportional Selection: the best individual is replaced if the best child individual is ﬁtter,while the non-best individuals are selected from the two populations X and Y (disregarding the bestindividual) via ﬁtness proportional selection.Notice that the self-transition probability, P ( x, x ), is maximal when x = (0 , , · · · , n , is (1 , , · · · , n −

1, is x = (0 , , · · · , x .The event of going from x = (0 , , · · · ,

1) to (1 , , · · · ,

0) via mutation occurs if and only if all of the bitsare ﬂipped. The probability of this event happening is P M ( x, S (1)opt ) = (cid:18) n (cid:19) n . The event of going from any other feasible state y (except (1 , , · · · ,

0) and (0 , , · · · , , , · · · , | y | denote the number of 1-valued bits in y . Since y is a feasible solution but except(1 , , · · · ,

0) and (0 , , · · · ,

1) so that | y | < n −

1. The probability of this event happening is: P M ( y, S (1)opt ) = (cid:18) − n (cid:19) n −| y |− (cid:18) n (cid:19) | y | +1 . The event of going from the only individual x = (0 , , · · · ,

1) at the second highest ﬁtness level to the setof bridgeable points happens if and only if the ﬁrst bit is not ﬂipped and at least one of the other bits isﬂipped. The probability of this event is then P M ( x, S (1)bridge ( x )) = (cid:18) − n (cid:19) n − X k =1 (cid:18) n − k (cid:19) (cid:18) n (cid:19) k (cid:18) − n (cid:19) n − k − . Thus, conditions (45) and (46) hold for any population size µ ≤ n implying that superlinear ρ -scalabilityhappens for µ ≤ n . A novel approach, based on the fundamental matrix of absorbing Markov chains, is introduced to studypopulation scalability of EAs. The spectral radius and matrix norms of the fundamental matrix are used as ameasure of the performance of an EA. The reciprocal of the spectral radius 1 /ρ ( N ) is the average convergencerate interpreted through the notion of the geometric mean. The ∞ -norm k N k ∞ is the maximum value of24he expected number of generations to encounter an optimum solution for the ﬁrst time. The a -norm k N k a is the average value of the expected number of generations to encounter an optimum solution for the ﬁrstover all transient initial states. Three diﬀerent notions of population scalability are proposed in the paper: ρ -scalability (based on the spectral radius of the fundamental matrix), ∞ -scalability (based on the ∞ -normof the fundamental matrix) and a -scalability (based on the a -norm of the fundamental matrix).The main results of the paper may be summarized in two parts.1. Theorem 1 shows that ρ -scalability always happens for elitist EAs using global mutation. For apopulation-based EA using identical mutation, the average convergence rate of a ( µ + µ ) EA (where µ ≥

2) is always larger than that of the (1 + 1) EA. Nonetheless, a -scalability and ∞ -scalability maynot take place. Using a larger population size sometimes will increase, rather than reduce, the expectednumber of generations to encounter an optimal solution for the ﬁrst time (when measured either interms of the maximum value or the average value). This fact is counterintuitive to the commonlyaccepted “rule of thumb” in evolutionary computation.2. Theorems 2, 3 and 4 provide suﬃcient and/or necessary conditions for the superlinear ρ -scalability totake place. The conditions indicate that for the elitist EAs optimizing the same ﬁtness function andusing identical mutation operators, the average convergence rate of a ( µ + µ ) EA (where µ ≥

2) is morethan µ times that of the corresponding (1 + 1) EA if and only if the probability of passing through the“roads through bridge” is suﬃciently large for the ( µ + µ ) EA.In order to illustrate the theoretical ﬁndings above, two cases studies are provided in the paper. The ﬁrstcase study shows that the average convergence rate of a ( µ + µ ) EA (where µ ≥

2) is never larger than µ times that of the (1 + 1) EA on any non-bridgeable ﬁtness landscapes. The second one illustrates that theaverage convergence rate of a ( µ + µ ) EA (where µ ≥

2) might be larger than µ times that of the (1 + 1) EAon certain bridgeable ﬁtness landscapes.The notion of population scalability is not intended to compare the performance of the corresponding(1 + 1) and ( µ + µ ) EAs on all instances of a given combinatorial optimization problem such as, for instance,the 0-1 knapsak problem, at once. Indeed, as we have seen in Subsections 6.2 and 7.2, for the same pairof the corresponding (1 + 1) and ( µ + µ ) EAs, population scalability may take place on one instance, and,simultaneously, not on another instance of the 0-1 knapsak problem, so that it is meaningless to considerpopulation scalability on all instances at once.While the approach based on the notion of the fundamental matrix is rather virtuous for analysing andunderstanding the notion of population scalability of EAs, it is unlikely to be practical when it comes tocalculating ρ -population scalability for a speciﬁed pair of EAs optimizing a given ﬁtness function. This isdue to the fact that the fundamental matrix is usually diﬃcult to compute. Likewise the calculation of ∞ -scalability and a -scalability is not an easy job.There are still many open questions some of which are listed below. Can we pin down any insightfulsuﬃcient and/or necessary conditions that the expected number of generations that a ( µ + µ ) EA (where µ ≥

2) takes to encounter an optimal solution for the ﬁrst time is greater than that the corresponding (1 + 1)EA (measure by either the average value or maximum value) does? Can we address the same question interms of the superlinear scalability? How to determine the threshold of the population size when an EA losesits superlinear scalability? Is there any feasible approach to calculating or, at least, estimating the populationscalability?

The condition that the EAs are convergent is not necessary. If an EA is not convergent, then ρ ( Q ( µ ) ) = 1.In this case, the corresponding notion of the population scalability is revised as follows: ρ -scalability( µ ) =  ρ ( N (1) ) ρ ( N ( µ ) ) , if ρ ( Q (1) ) < , + ∞ , if ρ ( Q (1) ) = 1 , ρ ( Q ( µ ) ) < , indeﬁned , if ρ ( Q (1) ) = 1 , ρ ( Q ( µ ) ) = 1 . (48)If a mutation operator is not global, we can easily revise it exploiting mixed strategy [32]. We apply thismutation operator with the probability 1 − ǫ and apply a global mutation with the probability ǫ for a small ǫ >

0. Evidently, the mixed strategy mutation operator obtained in this manner is global.25rossover is widely used in EAs. Since a (1 + 1) EA doesn’t include any crossover operator, it is not anappropriate candidate as the benchmark EA. Instead, a (2 + 2) EA with crossover would play the role of thebenchmark EA. Thus, the notion of scalability would then be revised accordingly: ρ -scalability( µ ) = ρ ( N (2) ) ρ ( N ( µ ) ) . (49)The EAs above can still be modelled by absorbing Markov chains and Theorem 2 is applicable.Nonetheless, it seems rather diﬃcult to apply the fundamental matrix approach to EAs that exploittime-dependent genetic operators. Acknowledgements

We would especially like to thank Professor G¨unter Rudolph who has been involved in initializing the researchissue of population scalability. This work is supported by the EPSRC under Grant EP/I009809/1 andNational Natural Science Foundation of China under Grant 61170081.

References [1] A. Pr¨ugel-Bennett. Beneﬁts of a population: ﬁve mechanisms that advantage population-based algo-rithms.

IEEE Transactions on Evolutionary Computation , 14(4):500–517, 2010.[2] J. Suzuki. A Markov chain analysis on simple genetic algorithms.

IEEE Transactions on Systems, Manand Cybernetics , 25(4):655–659, 1995.[3] J. He and L. Kang. On the convergence rate of genetic algorithms.

Theoretical Computer Science , 229(1-2):23–39, 1999.[4] G. Rudolph. Finite Markov chain results in evolutionary computation: a tour d’horizon.

FundamentaInformaticae , 35(1):67–89, 1998.[5] J. He and X. Yao. Towards an analytic framework for analysing the computation time of evolutionaryalgorithms.

Artiﬁcial Intelligence , 145(1-2):59–97, 2003.[6] D.E. Goldberg, K. Deb, and J.H. Clark. Genetic algorithms, noise, and the sizing of populations.

Complex Systems , 6:333–362, 1992.[7] H. M¨uhlenbein and D. Schlierkamp-Voosen. The science of breeding and its application to the breedergenetic algorithm (BGA).

Evolutionary Computation , 1(4):335–360, 1993.[8] J. Arabas, Z. Michalewicz, and J. Mulawka. GAVaPS-a genetic algorithm with varying population size.In

Proceedings of the IEEE Conference on World Congress on Computational Intelligence , pages 73–78.IEEE, 1994.[9] A.E. Eiben, R. Hinterding, and Z. Michalewicz. Parameter control in evolutionary algorithms.

IEEETransactions on Evolutionary Computation , 3(2):124–141, 1999.[10] G. Harik, E. Cant´u-Paz, D.E. Goldberg, and B.L. Miller. The gambler’s ruin problem, genetic algorithms,and the sizing of populations.

Evolutionary Computation , 7(3):231–253, 1999.[11] J. He and X. Yao. From an individual to a population: An analysis of the ﬁrst hitting time of population-based evolutionary algorithms.

IEEE Transactions on Evolutionary Computation , 6(5):495–511, 2002.[12] J. He and X. Yao. Analysis of scalable parallel evolutionary algorithms. In

Proceedins of 2006 IEEEWorld Congress on Computational Intelligence , pages 427–434, Vancouver, Canada, July 2006. IEEEPress. 2613] T. Jansen, K.A. de Jong, and I. Wegener. On the choice of the oﬀspring population size in evolutionaryalgorithms.

Evolutionary Computation , 13(4):413–440, 2005.[14] J. L¨assig and D. Sudholt. Adaptive population models for oﬀspring populations and parallel evolutionaryalgorithms. In Hans-Georg Beyer and William B. Langdon, editors,

Proceedings of the 11th InternationalWorkshop Foundations of Genetic Algorithms , pages 181–192, Schwarzenberg, Austria, 2011. ACM.[15] J. J¨agersk¨upper and C. Witt. Rigorous runtime analysis of a ( µ + 1) ES for the sphere function. InHans-Georg Beyer and Una-May O’Reilly, editors, Proceedings of the 2005 Conference on Genetic andEvolutionary Computation , pages 849–856, Washington DC, USA, 2005. ACM.[16] T. Jansen and I. Wegener. Real royal road functions–where crossover provably is essential.

DiscreteApplied Mathematics , 149(1-3):111–125, 2005.[17] C. Witt. Runtime analysis of the ( µ + 1) ea on simple pseudo-Boolean functions. Evolutionary Compu-tation , 14(1):65–86, 2006.[18] C. Witt. Population size versus runtime of a simple evolutionary algorithm.

Theoretical ComputerScience , 403(1):104–120, 2008.[19] T. Storch. On the choice of the parent population size.

Evolutionary Computation , 16(4):557–578, 2008.[20] P. S. Oliveto, J. He, and X. Yao. Analysis of the (1+1)-EA for ﬁnding approximate solutions to vertexcover problems.

IEEE Transactions on Evolutionary Computation , 13(5):1006 –1029, 2009.[21] T. Friedrich, P.S. Oliveto, D. Sudholt, and C. Witt. Analysis of diversity-preserving mechanisms forglobal exploration.

Evolutionary Computation , 17(4):455–476, 2009.[22] T. Chen, J. He, G. Sun, G. Chen, and X. Yao. A new approach for analyzing average time complexity ofpopulation-based evolutionary algorithms on unimodal problems.

IEEE Transactions on Systems, Manand Cybernetics, Part B , 39(5):1092–1106, 2009.[23] O. Giel and P.K. Lehre. On the eﬀect of populations in evolutionary multi-objective optimisation.

Evolutionary Computation , 18(3):335–356, 2010.[24] P.S. Oliveto, J. He, and X. Yao. Time complexity of evolutionary algorithms for combinatorial op-timization: A decade of results.

International Journal of Automation and Computing , 4(3):281–293,2007.[25] D.B. Fogel. Asymptotic convergence properties of genetic algorithms and evolutionary programming:analysis and experiments.

Cybernetics and Systems , 25(3):389–407, 1994.[26] Y. Zhou, J. He, and Q. Nie. A comparative runtime analysis of heuristic algorithms for satisﬁabilityproblems.

Artiﬁcial Intelligence , 173(2):240–257, 2009.[27] C. Coello and A. Carlos. Theoretical and numerical constraint-handling techniques used with evolu-tionary algorithms: A survey of the state of the art.

Computer Methods in Applied Mechanics andEngineering , 191(11-12):1245–1287, 2002.[28] Y. Zhou and J. He. A runtime analysis of evolutionary algorithms for constrained optimization problems.

IEEE Transactions on Evolutionary Computation , 11(5):608–619, 2007.[29] C.M. Grinstead and J.L. Snell.

Introduction to Probability . American Mathematical Society, 1997.[30] C.D. Meyer.

Matrix analysis and applied linear algebra: solutions manual . SIAM, 2000.[31] R.S. Varga.

Matrix Iterative Analysis . Springer, 2009.[32] J. He, F. He, and H. Dong. Pure strategy or mixed strategy? In Jin-Kao Hao and Martin Midden-dorf, editors,

Evolutionary Computation in Combinatorial Optimization (LNCS 7245) , pages 218–229.Springer, 2012. 2733] D. Andre and J.R. Koza. A parallel implementation of genetic programming that achieves superlinearperformance.

Information Sciences , 106(3-4):201–218, 1998.[34] E. Alba and M. Tomassini. Parallelism and evolutionary algorithms.

IEEE Transactions on EvolutionaryComputation , 6(5):443–462, 2002.[35] E. Alba. Parallel evolutionary algorithms can achieve superlinear performance.

Information ProcessingLetters , 82(1):7–13, 2002.[36] D.H. Wolpert and W.G. Macready. No free lunch theorems for optimization.

IEEE Transactions onEvolutionary Computation , 1(1):67–82, 1997.[37] J. He, L. Kang, and Y. Chen. Convergence of genetic evolution algorithms for optimization.

InternationalJournal of Parallel, Emergent and Distributed Systems , 5(1):37–56, 1995.[38] J. He and T. Chen. A general analysis of evolutionary algorithms for hard and easy ﬁtness functions.

CoRR , abs/1203.6286, 2012.[39] S. Martello and P. Toth.

Knapsack Problems . John Wiley & Sons, Chichester, 1990.[40] D. Sudholt. General lower bounds for the running time of evolutionary algorithms. In