Level-Based Analysis of the Population-Based Incremental Learning Algorithm
aa r X i v : . [ c s . N E ] J un Level-Based Analysis of the Population-BasedIncremental Learning Algorithm ∗ Per Kristian Lehre & Phan Trung Hai NguyenSchool of Computer ScienceUniversity of BirminghamBirmingham B15 2TT, United KingdomJune 6, 2018
Abstract
The Population-Based Incremental Learning (
PBIL ) algorithm uses a convex com-bination of the current model and the empirical model to construct the next model,which is then sampled to generate offspring. The Univariate Marginal DistributionAlgorithm (
UMDA ) is a special case of the
PBIL , where the current model is ignored.Dang and Lehre (GECCO 2015) showed that
UMDA can optimise
LeadingOnes ef-ficiently. The question still remained open if the
PBIL performs equally well. Here,by applying the level-based theorem in addition to Dvoretzky–Kiefer–Wolfowitz in-equality, we show that the
PBIL optimises function
LeadingOnes in expected time O (cid:0) nλ log λ + n (cid:1) for a population size λ = Ω(log n ), which matches the bound of the UMDA . Finally, we show that the result carries over to
BinVal , giving the fist runtimeresult for the
PBIL on the
BinVal problem.
Index terms—
Population-based incremental learning, LeadingOnes, BinVal, Runningtime analysis, Level-based analysis, Theory
Estimation of distribution algorithms (
EDAs ) are a class of randomised search heuristicsthat optimise objective functions by constructing probabilistic models and then sample themodels to generate offspring for the next generation. Various variants of
EDA have beenproposed over the last decades; they differ from each other in the way their models arerepresented, updated as well as sampled over generations. In general,
EDAs are usuallycategorised into two main classes: univariate and multivariate . Univariate
EDAs take ad-vantage of first-order statistics (i.e. mean) to build a univariate model, whereas multivariate
EDAs apply higher-order statistics to model the correlations between the decision variables.There are only a few runtime results available for
EDAs . Recently, there has been agrowing interest in the optimisation time of the
UMDA , introduced by M¨uhlenbein andPaaß [11], on standard benchmark functions [4, 13, 8, 7, 14]. Recall that the optimisationtime of an algorithm is the number of fitness evaluations the algorithm needs before aglobal optimum is sampled for the first time. Dang and Lehre [4] analysed a variant of ∗ Preliminary version of this work will appear in the Proceedings of the 15th International Conference onParallel Problem Solving from Nature 2018 (PPSN XV).
UMDA using truncation selection and derived the first upper bounds of O ( nλ log λ )and O (cid:0) nλ log λ + n (cid:1) on the expected optimisation times of the UMDA on OneMax and
LeadingOnes , respectively, where the population size is λ = Ω(log n ). These results wereobtained using a relatively new technique called level-based analysis [3]. Very recently,Witt [13] proved that the UMDA optimises
OneMax within O ( µn ) and O ( µ √ n ) when µ ≥ c log n and µ ≥ c ′ √ n log n for some constants c, c ′ >
0, respectively. However, thesebounds only hold when λ = (1 + Θ(1)) µ . This constraint on λ and µ was relaxed by Lehreand Nguyen [8], where the upper bound O ( λn ) holds for λ = Ω( µ ) and c log n ≤ µ = O ( √ n )for some constant c > PBIL [1], was presented very recently byWu et al. [14]. In this work, the
PBIL was referred to as a cross entropy algorithm.The study proved an upper bound O (cid:0) n ε (cid:1) of the PBIL with margins [1 /n, − /n ] on LeadingOnes , where λ = n ε , µ = O ( n ε/ ), η ∈ Ω (1) and ε ∈ (0 , PBIL were significantly higher than those for the
UMDA .Thus, it is of interest to determine whether the
PBIL is less efficient than the
UMDA , orwhether the bounds derived in the early works were too loose.This paper makes two contributions. First, we address the question above by deriv-ing a tighter bound O (cid:0) nλ log λ + n (cid:1) on the expected optimisation time of the PBIL on LeadingOnes . The bound holds for population sizes λ = Ω (log n ), which is a much weakerassumption than λ = ω ( n ) as required in [14]. Our proof is more straightforward thanthat in [14] because much of the complexities of the analysis are already handled by thelevel-based method [3].The second contribution is the first runtime bound of the PBIL on BinVal . Thisfunction was shown to be the hardest among all linear functions for the cGA [5]. Theresult carries easily over from the level-based analysis of
LeadingOnes using an identi-cal partitioning of the search space. This observation further shows that runtime bounds,derived by the level-based method using the canonical partition, of the
PBIL or other non-elitist population-based algorithms using truncation selection, on
LeadingOnes also holdfor
BinVal .The paper is structured as follows. Section 2 introduces the
PBIL with margins as wellas the level-based theorem, which is the main method employed in the paper. Given allnecessary tools, the next two sections then provide upper bounds on the expected optimi-sation time of the
PBIL on LeadingOnes and
BinVal . Finally, our concluding remarksare given in Section 5.
We first introduce the notations used throughout the paper. Let X := { , } n be a finitebinary search space with dimension n . The univariate model in generation t ∈ N is rep-resented by a vector p ( t ) := ( p ( t )1 , . . . , p ( t ) n ) ∈ [0 , n , where each p ( t ) i is called a marginal .Let X ( t )1 , . . . , X ( t ) n be n independent Bernoulli random variables with success probabilities p ( t )1 , . . . , p ( t ) n . Furthermore, let X ( t ) i : j := P jk = i X ( t ) k be the number of ones sampled from p ( t ) i : j := ( p ( t ) i , . . . , p ( t ) j ) for all 1 ≤ i ≤ j ≤ n . Each individual (or bitstring) is denoted as x = ( x , . . . , x n ) ∈ X . We aim at maximising an objective function f : X → R . We areprimarily interested in the optimisation time of these algorithms, so tools to analyse runtimeare of importance. We will make use of the level-based theorem [3].2 lgorithm 1: PBIL with margins t ← p ( t ) ← (1 / , / , . . . , / repeatfor j = 1 , , . . . , λ do sample an offspring x ( j ) ∼ Pr( · | p ( t ) ) as defined in (1)evaluate the fitness f ( x ( j ) )sort P ( t ) ← { x (1) , x (2) , . . . , x ( λ ) } such that f ( x (1) ) ≥ f ( x (2) ) ≥ . . . ≥ f ( x ( λ ) ) for i = 1 , , . . . , n do p ( t +1) i ← max (cid:8) /n, min (cid:8) − /n, (1 − η ) p ( t ) i + ( η/µ ) P µj =1 x ( j ) i (cid:9)(cid:9) t ← t + 1 until termination condition is fulfilled We consider the two pseudo-Boolean functions:
LeadingOnes and
BinVal , which arewidely used theoretical benchmark problems in runtime analyses of
EDAs [5, 4, 14]. Theformer aims at maximising the number of leading ones, while the latter tries to maximisethe binary value of the bitstring. The global optimum for both functions are the all-onesbitstring. Furthermore,
BinVal is an extreme linear function, where the fitness-contributionof the bits decreases exponentially with the bit-position. Droste [5] showed that among alllinear functions,
BinVal is difficult for the cGA . Given a bitstring x = ( x , . . . , x n ) ∈ X ,the two functions are formally defined as follows: Definition 1.
LeadingOnes ( x ) := P ni =1 Q ij =1 x j . Definition 2.
BinVal ( x ) := P ni =1 n − i x i . The PBIL algorithm maintains a univariate model over generations. The probability of abitstring x = ( x , . . . , x n ) sampled from the current model p ( t ) is given byPr (cid:16) x | p ( t ) (cid:17) = n Y i =1 (cid:16) p ( t ) i (cid:17) x i (cid:16) − p ( t ) i (cid:17) − x i . (1)Let p (0) := (1 / , . . . , /
2) be the initial model. The algorithm in generation t samplesa population of λ individuals, denoted as P ( t ) := { x (1) , x (2) , . . . , x ( λ ) } , which are sortedin descending order according to fitness. The µ fittest individuals are then selected toderive the next model p ( t +1) using the component-wise formula p ( t +1) i := (1 − η ) p ( t ) i +( η/µ ) P µj =1 x ( j ) i for all i ∈ { , , . . . , n } , where x ( j ) i is the i -th bit of the j -th individual inthe sorted population, and η ∈ (0 ,
1] is the smoothing parameter (sometimes known as thelearning rate). The ratio γ := µ/λ ∈ (0 ,
1) is called the selective pressure of the algorithm.Univariate EDAs often employ margins to avoid the marginals to fix at either 0 or 1. Inparticular, the marginals are usually restricted to the interval [1 /n, − /n ] after beingupdated, where the quantities 1 /n and 1 − /n are called the lower and upper borders,respectively. The algorithm is called the PBIL with margins. Algorithm 1 gives a fulldescription of the
PBIL (with margins). 3 lgorithm 2:
Non-elitist population-based algorithm t ←
0; create initial population P ( t ) repeatfor i = 1 , . . . , λ do sample P ( t +1) i ∼ D ( P ( t ) ) t ← t + 1 until termination condition is fulfilled Introduced in [3], the level-based theorem is a general tool that provides upper boundson the expected optimisation time of many non-elitist population-based algorithms on awide range of optimisation problems [3, 8, 4]. The theorem assumes that the algorithm tobe analysed can be described in the form of Algorithm 2, which maintains a population P ( t ) ∈ X λ , where X λ is the space of all populations with size λ . The theorem is generalsince it never assumes specific fitness functions, selection mechanisms, or generic operatorslike mutation and crossover. Furthermore, the theorem assumes that the search space X canbe partitioned into m disjoint subsets A , . . . , A m , which we call levels , and the last level A m consists of all global optima of the objective function. The theorem is formally statedin Theorem 1 [3]. We will use the notation [ n ] := { , , . . . , n } and A ≥ j := ∪ mk = j A k . Theorem 1 ( Level-Based Theorem ) . Given a partition ( A i ) i ∈ [ m ] of X , define T :=min { tλ | | P ( t ) ∩ A m | > } , where for all t ∈ N , P ( t ) ∈ X λ is the population of Algorithm 2in generation t . Denote y ∼ D ( P ( t ) ) . If there exist z , . . . , z m − , δ ∈ (0 , , and γ ∈ (0 , such that for any population P ( t ) ∈ X λ ,( G1 ) for each level j ∈ [ m − , if | P ( t ) ∩ A ≥ j | ≥ γ λ then Pr ( y ∈ A ≥ j +1 ) ≥ z j . ( G2 ) for each level j ∈ [ m − and all γ ∈ (0 , γ ] , if | P ( t ) ∩ A ≥ j | ≥ γ λ and | P ( t ) ∩ A ≥ j +1 | ≥ γλ then Pr ( y ∈ A ≥ j +1 ) ≥ (1 + δ ) γ. ( G3 ) and the population size λ ∈ N satisfies λ ≥ (cid:18) γ δ (cid:19) ln (cid:18) mz ∗ δ (cid:19) , where z ∗ := min j ∈ [ m − { z j } , then E [ T ] ≤ (cid:18) δ (cid:19) m − X j =1 (cid:20) λ ln (cid:18) δλ z j δλ (cid:19) + 1 z j (cid:21) . Algorithm 2 assumes a mapping D from the space of populations X λ to the space ofprobability distributions over the search space. The mapping D is often said to depend onthe current population only [3]; however, it is unnecessarily always the case, especially for the PBIL with a sufficiently large offspring population size λ . The rationale behind this is thatin each generation the PBIL draws λ samples from the current model p ( t ) , that correspondto λ individuals in the current population, and if the number of samples λ is sufficiently4arge, it is highly likely that the empirical distributions for all positions among the entirepopulation cannot deviate too far from the true distributions, i.e. marginals p ( t ) i . Moreover,the theorem relies on three conditions (G1), (G2) and (G3); thus, as long as these three canbe fully verified, the PBIL , whose model is constructed from the current population P ( t ) inaddition to the current model p ( t ) , is still eligible to the level-based analysis. In addition to the level-based theorem, we also make use of some other mathematical results.First of all is the Dvoretzky–Kiefer–Wolfowitz inequality [9], which provides an estimate onhow close an empirical distribution function will be to the true distribution from which thesamples are drawn. The following theorem follows by replacing ε = ε ′ √ λ into [9, Corollary1]. Theorem 2 ( DKW Inequality ) . Let X , . . . , X λ be λ i.i.d. real-valued random variableswith cumulative distribution function F . Let ˆ F λ be the empirical distribution function whichis defined by ˆ F λ ( x ) := (1 /λ ) P λi =1 { X i ≤ x } . For any λ ∈ N and ε > , we always have Pr (cid:18) sup x ∈ R (cid:12)(cid:12) ˆ F λ ( x ) − F ( x ) (cid:12)(cid:12) > ε (cid:19) ≤ e − λε . Furthermore, properties of majorisation between two vectors are also exploited. The con-cept is formally defined in Definition 3 [6], followed by its important property (in Lemma 1)that we use intensively throughout the paper.
Definition 3.
Given vectors p (1) := ( p (1)1 , . . . , p (1) n ) and p (2) := ( p (2)1 , . . . , p (2) n ) , where p (1)1 ≥ p (1)2 ≥ . . . ≥ p (1) n and similarly for the p (2) i s. Vector p (1) is said to majorise vector p (2) , insymbols p (1) ≻ p (2) , if p (1)1 ≥ p (2)1 , . . . , P n − i =1 p (1) i ≥ P n − i =1 p (2) i and P ni =1 p (1) i = P ni =1 p (2) i . Lemma 1 ([2]) . Let X , . . . , X n be n independent Bernoulli random variables with successprobabilities p , . . . , p n , respectively. Denote p := ( p , p , . . . , p n ) ; let S ( p ) := P ni =1 X i and D λ := { p : p i ∈ [0 , , i ∈ [ n ] , P ni =1 p i = λ } . For two vectors p (1) , p (2) ∈ D λ , if p (1) ≺ p (2) then Pr (cid:0) S ( p (1) ) = n (cid:1) ≥ Pr (cid:0) S ( p (2) ) = n (cid:1) . Lemma 2.
Let p (1) and p (2) ∈ D λ be two vectors as defined in Lemma 1, where all com-ponents in p ( · ) are arranged in descending order. Let z (1) := ( z (1)1 , . . . , z (1) n ) where each z (1) i := (1 − η ) p (1) i + η , and z (2) := ( z (2)1 , . . . , z (2) n ) , where each z (2) i := (1 − η ) p (2) i + η forany constant η ∈ (0 , . If p (2) ≻ p (1) , then z (2) ≻ z (1) .Proof. For all j ∈ [ n − P ji =1 z (2) i ≥ P ji =1 z (1) i since P ji =1 p (2) i ≥ P ji =1 p (1) i .Furthermore, if j = n , then P ni =1 z (2) i = P ni =1 z (1) i due to P ni =1 p (2) i = P ni =1 p (1) i . ByDefinition 3, z (2) ≻ z (1) . PBIL on LeadingOnes
We now show how to apply the level-based theorem to analyse the runtime of the
PBIL .We use a canonical partition of the search space, where each subset A j contains bitstringswith exactly j leading ones. A j := { x ∈ { , } n | LeadingOnes ( x ) = j } . (2)Conditions (G1) and (G2) of Theorem 1 assume that there are at least γ λ individuals inlevels A ≥ j in generation t . Recall γ := µ/λ . This implies that the first j bits among the5 fittest individuals are all ones. Denote ˆ p ( t ) i := (1 /λ ) P λj =1 x ( j ) i as the frequencies of onesat position i in the current population. We first show that under the assumption of thetwo conditions of Theorem 1 and with a population size λ = Ω (log n ), the first j marginalscannot be too close to the lower border 1 /n with probability at least 1 − n − Ω(1) . Lemma 3. If | P ( t ) ∩ A ≥ j | ≥ γ λ and λ ≥ c ((1 + 1 /ε ) /γ ) ln( n ) for any constants c, ε > and γ ∈ (0 , , then it holds with probability at least − n − c that p ( t ) i ≥ γ / (1 + ε ) for all i ∈ [ j ] .Proof. Consider an arbitrary bit i ∈ [ j ]. Let Q i be the number of ones sampled at position i in the current population, and the corresponding empirical distribution function of thenumber of zeros is F λ (0) = (1 /λ ) P λj =1 { x ( j ) i ≤ } = ( λ − Q i ) /λ = 1 − ˆ p ( t ) i , and the truedistribution function is F (0) = 1 − p ( t ) i . The DKW inequality (see Theorem 2) yieldsthat Pr(ˆ p ( t ) i − p ( t ) i > φ ) ≤ Pr( | ˆ p ( t ) i − p ( t ) i | > φ ) ≤ e − λφ for all φ >
0. Therefore, withprobability at least 1 − e − λφ we have ˆ p ( t ) i − p ( t ) i ≤ φ and, thus, p ( t ) i ≥ ˆ p ( t ) i − φ ≥ γ − φ since ˆ p ( t ) i ≥ γ λ/λ = γ due to | P ( t ) ∩ A ≥ j | ≥ γ λ . We then choose φ ≤ εγ / (1 + ε ) for someconstant ε > λ ≥ c ((1 + 1 /ε ) /γ ) ln( n ). Putting everything together, it holds that p ( t ) i ≥ γ (1 − ε/ (1 + ε )) = γ / (1 + ε ) with probability at least 1 − n − c .Given the µ top individuals having at least j leading ones, we now estimate the proba-bility of sampling j leading ones from the current model p ( t ) . Lemma 4.
For any non-empty subset I ⊆ [ n ] , define C I := (cid:8) x ∈ { , } n | Q i ∈ I x i = 1 (cid:9) . If | P ( t ) ∩ C I | ≥ γ λ and λ ≥ c ((1 + 1 /ε ) /γ ) ln( n ) for any constants ε > , γ ∈ (0 , , thenit holds with probability at least − n − c that q ( t ) := Q i ∈ I p ( t ) i ≥ γ / (1 + ε ) .Proof. We prove the statement using the
DKW inequality (see Theorem 2). Let m = | I | .Given an offspring sample Y ∼ p ( t ) from the current model, let Y I := P i ∈ I Y i be the numberof one-bits in bit-positions I . By the assumption | P ( t ) ∩ C I | ≥ γ λ on the current population,the empirical distribution function of Y I must satisfy ˆ F λ ( m −
1) = λ P λi =1 { Y I,i ≤ m − } ≤ − ˆ q ( t ) , where ˆ q ( t ) ≥ γ is the fraction of individuals in the current population with j leadingones, and the true distribution function satisfies F ( m −
1) = 1 − q ( t ) . The DKW inequalityyields that Pr(ˆ q ( t ) − q ( t ) > φ ) ≤ Pr( | ˆ q ( t ) − q ( t ) | > φ ) ≤ e − λφ for all φ >
0. Therefore, withprobability at least 1 − e − λφ it holds ˆ q ( t ) − q ( t ) ≤ φ and, thus, q ( t ) ≥ ˆ q ( t ) − φ ≥ γ − φ .Choosing φ := εγ / (1 + ε ), we get q ( t ) ≥ γ (1 − ε/ (1 + ε )) = γ / (1 + ε ) with probability atleast 1 − e − φ λ ≥ − n − c .Given the current level is j , we speak of a success if the first j marginals never dropbelow γ / (1 + ε ); otherwise, we speak of a failure . If there are no failures at all, let usassume that O (cid:0) n log λ + n /λ (cid:1) is an upper bound on the expected number of generations ofthe PBIL on LeadingOnes . The following lemma shows that this is also the the expectedoptimisation time of the
PBIL on LeadingOnes . Lemma 5.
If the expected number of generations required by the
PBIL to optimise
LeadingOnes in case of no failure is at most t ∗ ∈ O (cid:0) n log λ + n /λ (cid:1) regardless of the initial probabilityvector of the PBIL , the expected number of generations of the
PBIL on LeadingOnes isat most t ∗ .Proof. From the point when the algorithm starts, we divide the time into identical phases,each lasting t ∗ generations. Let E i denote the event that the i -th interval is a failure for6 ∈ N . According to Lemma 3, Pr ( E i ) ≤ n − c O ( n log λ + n /λ ) = O ( n − c ′ +2 ) by unionbound for another constant c ′ > λ ≤ αn where α > n , and the constant c large enough suchthat c ′ >
2, and Pr (cid:0) E ∧ E (cid:1) ≥ − Pr ( E ) − Pr ( E ) ≥ − O ( n − c ′ +2 ) by union bound. Let T be the number of generations performed by the algorithm until a global optimum is foundfor the first time. We know that E (cid:2) T | ∧ i ∈ N E i (cid:3) ≤ t ∗ , and Pr (cid:0) T ≤ t ∗ | ∧ i ∈ N E i (cid:1) ≥ / (cid:0) T ≥ t ∗ | ∧ i ∈ N E i (cid:1) ≤ / E (cid:2) T | E ∧ E (cid:3) ≤ t ∗ Pr (cid:0) T ≤ t ∗ | E ∧ E (cid:1) + (2 t ∗ + E [ T ]) Pr (cid:0) T ≥ t ∗ | E ∧ E (cid:1) = 2 t ∗ + Pr (cid:0) T ≥ t ∗ | E ∧ E (cid:1) E [ T ] ≤ t ∗ + (1 / E [ T ]since Pr (cid:0) T ≤ t ∗ | E ∧ E (cid:1) ≥ Pr (cid:0) T ≤ t ∗ | ∧ i ∈ N E i (cid:1) ≥ /
2. Substituting the result intothe following yields E [ T ] = Pr (cid:0) E ∧ E (cid:1) E (cid:2) T | E ∧ E (cid:3) + Pr ( E ∨ E ) (2 t ∗ + E [ T ]) ≤ Pr (cid:0) E ∧ E (cid:1) (2 t ∗ + (1 / E [ T ]) + Pr ( E ∨ E ) (2 t ∗ + E [ T ])= 2 t ∗ + ((1 /
2) Pr (cid:0) E ∧ E (cid:1) + Pr ( E ∨ E )) E [ T ]= 2 t ∗ + E [ T ] − (1 /
2) Pr (cid:0) E ∧ E (cid:1) E [ T ] . Thus, E [ T ] ≤ t ∗ / Pr (cid:0) E ∧ E (cid:1) = 4 t ∗ (1 + o (1)) = 4 t ∗ .By the result of Lemma 5, the phase-based analysis that is exploited until there is apair with no failure only leads to a multiplicative constant in the expectation. We needto calculate the value of t ∗ that will also asymptotically be the overall expected numberof generations of the PBIL on LeadingOnes . We now give our runtime bound for the
PBIL on LeadingOnes with sufficiently large population λ . The proof is very straightfor-ward compared to that in [14]. The floor and ceiling functions of x ∈ R are ⌊ x ⌋ and ⌈ x ⌉ ,respectively. Theorem 3.
The
PBIL with margins and offspring population size λ ≥ c log n for a suf-ficiently large constant c > , parent population size µ = γ λ for any constant γ sat-isfying γ ≤ η ⌈ ξ ⌉ +1 / ((1 + δ ) e ) where ξ = ln( p ) / ( p − and p := γ / (1 + ε ) for anypositive constants δ, ε and smoothing parameter η ∈ (0 , , has expected optimisation time O (cid:0) nλ log λ + n (cid:1) on LeadingOnes .Proof.
We strictly follow the procedure recommended in [3].
Step 1 : Recall that we use the canonical partition, defined in (2), in which each subset A j contains individuals with exactly j leading ones. There are a total of m = n + 1 levelsranging from A to A n . Step 2 : Given | P ( t ) ∩ A ≥ j | ≥ γ λ = µ and | P ( t ) ∩ A ≥ j +1 | ≥ γλ , we prove that theprobability of sampling an offspring in A ≥ j +1 in generation t + 1 is lower bounded by(1 + δ ) γ for some constant δ > z ( t ) = ( z ( t )1 , . . . , z ( t ) j ) that majorises p ( t )1: j , thenthe probability of obtaining j successes from a Poisson-binomial distribution with parameters j and p ( t )1: j is lower bounded by the same distribution with parameters j and z ( t ) . Following714], we compare X ( t )1 , . . . , X ( t ) j with another sequence of independent Bernoulli randomvariables Z ( t )1 , . . . , Z ( t ) j with success probabilities z ( t )1 , . . . , z ( t ) j . Note that Z ( t ) := P jk =1 Z ( t ) k .Define m := ⌊ ( P ji =1 p ( t ) i − jp ) / (1 − n − p ) ⌋ where p := γ ε , and let Z ( t )1 , . . . , Z ( t ) m all havesuccess probability z ( t )1 = . . . = z ( t ) m = 1 − n , Z ( t ) m +2 , . . . , Z ( t ) j get p and possibly a randomvariable Z ( t ) m +1 takes intermediate value [ p , − n ] to guarantee P ji =1 p ( t ) i = P ji =1 z ( t ) i .Since P ji =1 p ( t ) i ≥ j · ( Q ji =1 p ( t ) i ) /j ≥ j · p /j by the Arithmetic Mean-Geometric Mean in-equality (see Lemma 7 in the Appendix) and Lemma 4, we get m ≥ ⌊ j ( p /j − p ) / (cid:0) − n − p (cid:1) ⌋ .Let us consider the following function: g ( j ) = j · p /j − p − p − j = j · p /j − − p . This function has a horizontal asymptote at y = − ξ , where ξ := ln p p − (see calculation in theAppendix). Thus, m ≥ j − ⌈ ξ ⌉ for all j ≥ t .The PBIL then updates the current model p ( t ) to obtain p ( t +1) using the component-wiseformula p ( t +1) i = (1 − η ) p ( t ) i + ηµ P µk =1 x ( k ) i . For all i ∈ [ j ], we know that P µk =1 x ( k ) i = µ dueto the assumption of condition (G2). After the model is updated, we obtain • z ( t +1) i = 1 − n for all i ≤ j − ⌈ ξ ⌉ , • z ( t +1) i ≥ (1 − η ) p + η ≥ η for all j − ⌈ ξ ⌉ < i ≤ j , and • p ( t +1) j +1 ≥ (1 − η ) p ( t ) j +1 + η γγ ≥ η γγ due to P µk =1 x ( k ) j +1 = γλ .Let us denote z ( t +1) i = (1 − η ) z ( t ) i + η . Lemmas 1 and 2 assert that z ( t +1) majorises p ( t +1) i : j ,and Pr( X ( t +1)1: j = j ) ≥ Pr( Z ( t +1) = j ). In words, the probability of sampling an offspring in A ≥ j in generation t + 1 is lower bounded by the probability of obtaining j successes from aPoisson-binomial distribution with parameters j and z ( t +1) . More precisely, at generation t + 1,Pr( X ( t +1)1: j +1 = j + 1) ≥ Pr( X ( t +1)1: j = j ) · Pr( X ( t +1) j +1 = 1) ≥ Pr( Z ( t +1) = j ) · p ( t +1) j +1 ≥ (1 − /n ) j −⌈ ξ ⌉ η ⌈ ξ ⌉ +1 γ/γ ≥ (1 + δ ) γ, where (cid:0) − n (cid:1) j −⌈ ξ ⌉ ≥ e and γ ≤ η ⌈ ξ ⌉ +1 e (1+ δ ) for any constant δ >
0. Thus, condition (G2) ofTheorem 1 is verified.
Step 3 : Given that | P ( t ) ∩ A ≥ j | ≥ γ λ , we aim at showing that the probability ofsampling an offspring in A ≥ j +1 in generation t + 1 is at least z j . Note in particular thatLemma 4 yields Pr( X ( t +1)1: j = j ) ≥ γ ε . The probability of sampling an offspring in A ≥ j +1 in generation t + 1 is lower bounded byPr( X ( t +1)1: j = j ) · Pr( X ( t +1) j +1 = 1) ≥ γ ε · n =: z j . where Pr( X ( t +1) j +1 = 1) = p ( t +1) j +1 ≥ n . Therefore, condition (G1) of Theorem 1 is satisfiedwith z j = z ∗ = γ (1+ ε ) n . Step 4 : Condition (G3) of Theorem 1 requires a population size λ = Ω (log n ). Thisbound matches with the condition on λ ≥ c log n for some sufficiently large constant c > λ = Ω (log n ). Step 5 : When z j = γ (1+ ε ) n where γ ≤ η ⌈ ξ ⌉ +1 (1+ δ ) e and λ ≥ c log n for some constants ε > η ∈ (0 ,
1] and sufficiently large c >
0, all conditions of Theorem 1 are verified. Using that8n (cid:16) δλ δλz j (cid:17) < ln (cid:0) δλ (cid:1) an upper bound on the expected optimisation time of the PBIL on LeadingOnes is guaranteed as follows. (cid:18) δ (cid:19) n − X j =0 (cid:20) λ ln (cid:18) δλ (cid:19) + 1 z j (cid:21) < δ nλ log λ + 8(1 + ε ) δ γ n + o (cid:0) n (cid:1) ∈ O (cid:0) nλ log λ + n (cid:1) . Hence, the expected number of generations t ∗ is O (cid:16) n log λ + n λ (cid:17) for a sufficiently large λ in the case of no failure and, thus, meets the assumption in Lemma 5. The expectedoptimisation time of the PBIL on LeadingOnes is still asymptotically O (cid:0) nλ log λ + n (cid:1) .This completes the proof.Our improved upper bound of O (cid:0) n (cid:1) on the optimisation time of the PBIL with pop-ulation size λ = Θ (log n ) on LeadingOnes is significantly better than the previous bound O (cid:0) n ε (cid:1) from [14]. Our result is not only stronger, but the proof is much simpler as mostof the complexities of the population dynamics of the algorithm is handled by Theorem 1[3]. Furthermore, we also provide specific values for the multiplicative constants, i.e. δ and ε ) δ γ for the terms nλ log λ and n , respectively (see Step 5 in Theorem 3). Moreover,the result also matches the runtime bound of the UMDA on LeadingOnes for a smallpopulation λ = Θ (log n ) [4].Note that Theorem 3 requires some condition on the selective pressure, that is γ ≤ η ⌈ ξ ⌉ +1 (1+ δ ) e where ξ = ln p p − and p := γ ε for any positive constants δ, ε and smoothing parameter η ∈ (0 , γ , this result here tells us that there exists some settings for the PBIL such that it can optimise
LeadingOnes within O (cid:0) nλ log λ + n (cid:1) time in expectation. PBIL on BinVal
We first partition the search space into non-empty disjoint subsets A , . . . , A n . Lemma 6.
Let us define the levels as A j := (cid:26) x ∈ { , } n (cid:12)(cid:12)(cid:12)(cid:12) j X i =1 n − i ≤ BinVal ( x ) < j +1 X i =1 n − i (cid:27) , for j ∈ [ n ] ∪ { } , where P i =1 n − i = 0 . If a bitstring x has exactly j leading ones, i.e. LeadingOnes ( x ) = j , then x ∈ A j .Proof. Consider a bitstring x = 1 j { , } n − j − . The fitness contribution of the first j leading ones to BinVal ( x ) is P ji =1 n − i . The ( j + 1)-th bit has no contribution, while thatof the last n − j − P ni = j +2 n − i = P n − j − i =0 i = 2 n − j − − P ji =1 n − i ≤ BinVal ( x ) ≤ P j +1 i =1 n − i − < P j +1 i =1 n − i . Hence, the bitstring x belongs to level A j .In both problems, all that matters to determine the level of a bitstring is the positionof the first 0-bit when scanning from the most significant to the least significant bits. Nowconsider two bitstrings in the same level for BinVal , their rankings after the populationis sorted are also determined by some other less significant bits; however, the proof ofTheorem 3 never takes these bits into account. Thus, the following corollary yields the firstupper bound on the expected optimisation time of the
PBIL and the
UMDA (when η = 1)for BinVal . 9 orollary 1.
The
PBIL with margins and offspring population size λ ≥ c log n for a suf-ficiently large constant c > , parent population size µ = γ λ for any constant γ sat-isfying γ ≤ η ⌈ ξ ⌉ +1 / ((1 + δ ) e ) where ξ = ln( p ) / ( p − and p := γ / (1 + ε ) for anypositive constants δ, ε and smoothing parameter η ∈ (0 , , has expected optimisation time O (cid:0) nλ log λ + n (cid:1) on BinVal . Runtime analyses of
EDAs are scarce. Motivated by this, we have derived an upper boundof O (cid:0) nλ log λ + n (cid:1) on the expected optimisation time of the PBIL on both
LeadingOnes and
BinVal for a population size λ = Ω (log n ). The result improves upon the previouslybest-known bound O (cid:0) n ε (cid:1) from [14], and requires a much smaller population size λ =Ω (log n ), and uses relatively straightforward arguments. We also presents the first upperbound on the expected optimisation time of the PBIL on BinVal .Furthermore, our analysis demonstrates that the level-based theorem can yield runtimebounds for
EDAs whose models are updated using information gathered from the currentand previous generations. An additional aspect of our analysis is the use of the
DKW inequality to bound the true distribution by the empirical population sample when thenumber of samples is large enough. We expect these arguments will lead to new results inruntime analysis of evolutionary algorithms.
Appendix