Blending Dynamic Programming with Monte Carlo Simulation for Bounding the Running Time of Evolutionary Algorithms
Kirill Antonov, Maxim Buzdalov, Arina Buzdalova, Carola Doerr
BBlending Dynamic Programming with Monte CarloSimulation for Bounding the Running Time ofEvolutionary Algorithms
Kirill Antonov
ITMO University,Saint Petersburg, Russia
Maxim Buzdalov
ITMO University,Saint Petersburg, Russia
Arina Buzdalova
ITMO University,Saint Petersburg, Russia
Carola Doerr
Sorbonne Universit´e,CNRS, LIP6, Paris, France
Abstract —With the goal to provide absolute lower boundsfor the best possible running times that can be achieved by (1 + λ ) -type search heuristics on common benchmark problems,we recently suggested a dynamic programming approach thatcomputes optimal expected running times and the regret valuesinferred when deviating from the optimal parameter choice.Our previous work is restricted to problems for which tran-sition probabilities between different states can be expressedby relatively simple mathematical expressions. With the goalto cover broader sets of problems, we suggest in this work anextension of the dynamic programming approach to settings inwhich the transition probabilities cannot necessarily be computedexactly, but in which they can be approximated numerically, upto arbitrary precision, by Monte Carlo sampling.We apply our hybrid Monte Carlo dynamic programmingapproach to a concatenated jump function and demonstrate howthe obtained bounds can be used to gain a deeper understandinginto parameter control schemes. I. I
NTRODUCTION
Running time analysis of evolutionary algorithms and othersearch heuristics contributes to our understanding of black-box optimization not only by providing insights into the basicworking principles that drive algorithms’ performance, butalso by providing lower bounds for the performance of broadclasses of black-box approaches [1], [2]. Typically expressedas black-box complexity bounds, these lower bounds can beseen as a baseline against which we can compare state-of-the-art solvers, with the goal to quantify the potential of furtheralgorithm development. By comparing black-box complexitybounds for different classes of algorithms, we obtain insightinto the impact of certain algorithm features, such as theirdegree of parallelism [3], [4], their memory [5], [11], theirselection principles [7], properties of their sampling strate-gies [8]–[10], etc. More refined lower bounds can be obtainedby specifying a family of algorithms, the available choices forconfiguring it, and by studying the dependency of the runningtime on the choices of these configurable components. Thishas been classically studied in the asymptotic sense [12]–[15], however, some recent works provide tight bounds alsofor concrete problem dimensions [16], [17].Most of these works assume a static setting of the parame-ters. In practice, however, it is well know that dynamic param-eter settings can greatly improve the efficiency of evolutionaryalgorithms [18]–[20]. Precise lower bounds for algorithms with dynamic parameter settings are rare, see [21] for a surveyof rigorous results for algorithms with dynamic parameterchoices and [22]–[24] for a few exceptions that provide preciselower bounds for algorithms with dynamic parameter settings.Proving precise lower bounds for algorithms with dynamicparameter choices is challenging for several reasons: (i) lowerbounds require to make statements about all possible settings(as opposed to upper bounds, in which one concrete parametersetting is analyzed), (ii) the (often very complex and trajectory-based) dependency of the parameters on the state of thealgorithm needs to be taken into account, and (iii) all boundsneed to be very tight to obtain an overall bound that isprecise enough to give meaningful results for concrete problemdimensions (and not just for the asymptotic behavior). Inlack of precise methods to derive optimal algorithm behavior,several author teams have resorted to compute actions thatmaximize the step-wise progress, in hope that this “drift-maximization” is close to the true optimal parameter settingpolicy. This step-wise progress maximization was either done exactly [25] or via
Monte Carlo simulations [26]. It is known,however, that even for the O NE M AX problem with perfectfitness-distance correlation, the drift-maximization strategy isonly close to being optimal [23], but not strictly optimal [27],[28]. A different approach to compute precise lower boundsfor algorithms with dynamic parameter settings was thereforetaken in [27]. For the problem of minimizing the expected run-ning time of (1+1)-type evolutionary algorithms on O NE M AX ,the authors first derive an exact method to compute the optimaldynamic parameter choices and then evaluate this formula forconcrete problem dimensions using a dynamic programmingapproach. The work was greatly extended in [28], by consider-ing different (1+ λ ) -type heuristics and by not only computingthe optimal parameter choices at each state, but also the costof deviating from it. Our contribution:
We demonstrate in this work how theapproach taken in [27], [28] can be extended to settings inwhich we do not necessarily have a closed-form expressionfrom which we can derive exact parameter settings. Our overallapproach is still based on dynamic programming, but wereplace the exact formulae that model the probabilities totransition between different states by Monte Carlo simulations.That is, we take a similar approach as in [26], but we a r X i v : . [ c s . N E ] F e b o not simply maximize the expected drift, but use theseapproximated probabilities to derive strategies that minimizethe expected running time – the quantity that most runningtime analyses focus on.We apply our approach to the R UGGEDNESS functionsuggested in [29]. R
UGGEDNESS extends the classical jumpfunction [30] by introducing several small jumps that thealgorithms need to make in order to find the optimal solution,see Section IV-A for a formal definition. We have chosen thisexample because in [31] we observed that classical parametercontrol schemes essentially fail on R
UGGEDNESS , in thesense that they show worse performance than the (1 + λ ) evolutionary algorithm (EA) with static mutation rate /n (with n denoting the dimension of the problem). Interestingly,a better performance is obtained when mutation rates arecapped at /n instead of /n ; see Figure 1 for an illustration.That is, despite having less flexibility, the algorithms performbetter. High-performing parameter control mechanisms shouldnot suffer from more flexibility, and we therefore see thisexample as an interesting use-case to identify deficiencies ofstate-of-the-art parameter control techniques, and to developappropriate remedies to mitigate these.We suggested in [28] heatmaps to illustrate the behavior ofparameter control schemes. While indeed being very usefulto bound the regret (i.e., the cost of deviating from theoptimal parameter setting) of a single iteration, these heatmapsturned out to be suboptimal when analyzing the whole searchtrajectory. We therefore suggest in this work to complementthem with plots that illustrate the regret per iteration. Withthese plots, one can now make a clear distinction betweenquick cheap “hiccups” and long-term harmful deviations. Implications for Ruggedness:
The application of ourmethod to R
UGGEDNESS shows that the (1 + λ ) EA withstatic mutation rate /n is around 50-75% worse than the (1 + λ ) EA variant using optimal dynamic mutation rates. Wealso illustrate the usefulness of our regret plots by analyzingwhy the (1+ λ ) EA ( A, b ) with generalized 1/5-th success rule,shown efficient for other problems in [6], fails on this problem. Impact and limitations of our work:
While the precisevalues of the lower bound and of the individual regrets forR
UGGEDNESS may be of interest to a rather specializedcommunity only, we believe that our general approach of hy-bridizing Monte Carlo simulation with dynamic programmingshould be of interest to everyone willing to derive preciselower bounds against which sampling-based optimization al-gorithms can be compared.At present, our work is restricted to settings in which statesare never visited twice. In practice, this limits our work togreedy (“elitist”) algorithms. Extensions to algorithms and/orproblems in which states may be visited more than once formsone of the most challenging directions for future work, onwhich we comment in more detail in Section V.
Structure of the paper:
The (1 + λ ) EA framework andits main variants studied in [31] are introduced in Section II.In Section III we describe our hybrid dynamic programmingapproach using Monte Carlo simulation. In Section IV we
Algorithm 1:
A family of (1 + λ ) -type algorithms Data: n : problem size; f : { , } n → R : function tomaximize; λ : population size; D ( p ) : a family ofparameterized distributions over [0 ..n ] Sample parent x ∈ { , } n uniformly at random; for t ← , , . . . do for i ∈ [1 ..λ ] do Choose a distribution parameter p ti ; Sample k i ∼ D ( p ti ) , the number of bits to flip; Create y i by flipping k i bits in x chosenuniformly at random without replacement; Select x ← arg max z ∈{ x,y ,...,y λ } f ( z ) ;present the results of this approach applied to derive thedynamic parameter settings which minimize the expectedrunning time of the (1+ λ ) EA on R
UGGEDNESS . We commenton further extensions of our approach in Section V. Section VIconcludes the paper.II. (1 + λ ) E VOLUTIONARY A LGORITHMS
We present in Algorithm 1 a fairly general family of (1+ λ ) EAs, for the context of maximizing a “fitness” function f : { , } n → R defined on bit strings. Algorithms covered bythis framework are initialized by selecting their first solutioncandidate uniformly at random. In each iteration, λ “offspring”are sampled, by modifying the best-so-far solution x . Themodification is done by unary unbiased variation operators [8],which –according to a characterization provided in [23] – aredefined by fixing a distribution D over the set of possiblemutation strengths k ∈ [0 ..n ] := { , , , . . . , n } . The operatorfirst samples a mutation strength k from this distribution andthen flips k uniformly selected, pairwise different bits in x . The best-so-far solution “survives” and determines the centerof variation in the following iteration. Ties can be brokenarbitrarily; in practice and in the remainder of this work weassume uniform selection among the best offspring (if at leastone of them is at least as good as the parent).We do not specify in Algorithm 1 a stopping criterion,because we are – in line with the majority of running timeanalysis papers – interested in the expected optimization time, i.e., the average number of function evaluation that an algo-rithm performs until it evaluates for the first time a solution x ∗ with f ( x ∗ ) ≥ f ( y ) for all y ∈ { , } n .In [31], the following (1 + λ ) EA variants were studied: (1) Shift (1+ λ ) EA:
The standard (1+ λ ) EA with mutationrate p is Algorithm 1 with D ( p ti ) = B ( n, p ) for all t ∈ N , i ∈ [1 ..λ ] . It was argued in [32], [33] that for (1 + λ ) schemes thisoperator should not be used in practice, since it assigns proba-bility mass to flipping 0 bits, which cannot advance the search. Examples: The two most commonly used unary unbiased variation op-erators are 1-bit-flips and standard bit mutation (SBM). 1-bit-flips, as usedwithin Randomized Local Search (RLS), correspond to the 1-point distribution P [ k = 1] = 1 , whereas SBM (used by many evolutionary algorithms) isexactly the operator characterized by the binomial distribution B ( n, p ) . wo strategies were suggested to mitigate this unfavorablebehavior: “shifting” all probability mass from k = 0 to k = 1 and “resampling” k until k > . We use the shift strategy in allalgorithms considered in this work, i.e., we use the mutationoperator characterized by the distribution B → ( n, p ) , whichhas P [ (cid:96) = 1 | (cid:96) ∼ B → ( n, p )] = P [ (cid:96) ≤ | (cid:96) ∼ B ( n, p )] and P [ (cid:96) = k | (cid:96) ∼ B → ( n, p )] = P [ (cid:96) = k | (cid:96) ∼ B ( n, p )] for k ≥ . Where p is not specified, we tacitly assume p = 1 /n . (2) (1 + λ ) EA ( A, b ) : This algorithm, analyzed in [6],uses a multiplicative parameter update scheme inspired bythe 1/5-th success rule [34]. Our version uses shift mutation(as introduced above) instead of resampling mutation, but theupdate of p remains the same as in [6] with A = 2 and b = 1 / . That is, p is initialized as /n . At the end of eachiteration p is updated to p if max { f ( y ) , . . . , f ( y λ ) } ≥ f ( x ) and to p/ otherwise. (3) (1+ λ ) EA r/ , r : This algorithm was suggested in [35].It creates y , . . . , y λ by the shift mutation with mutation rate p/ and the other offspring, y λ +1 , . . . , y λ by the shift mutationwith mutation rate p . Let f h = max { f ( y ) , . . . , f ( y λ ) } and f d = max { f ( y λ +1 ) , . . . , f ( y λ ) } . If f h > f d , p is updated to p/ with probability / and to p otherwise. If f d > f h , p is updated to p with probability / and to p/ otherwise. If f d = f h , p is updated to either p/ or p equiprobably. (4) (1 + λ ) HQEA:
This algorithm was suggested in [31].It hybridizes reinforcement learning with the multiplicativeupdate used within the (1 + λ ) EA ( A, b ) . We cannot presentdetails, for reasons of space, and refer the interested readerto [31] for details. For this work, it is interesting to note thatthe algorithm also uses shift mutation, with a mutation rate p that is updated after each iteration. All λ offspring are sampledfrom the same distribution. Capping of the mutation rate p : An advantage of theshift mutation operator is that it converges to the 1-bit-flipoperator when p → . It can therefore be used to interpolatebetween global and local search. However, we typically wantto maintain some probability of escaping local optima. In prac-tice, p is therefore often capped to remain within an interval [ p min , p max ] . In our work, when considering parameter controlmethods, we fix p max = 1 / . As discussed in the introduction, p min can have a decisive influence on the efficiency of thealgorithms, as can be seen in the example on the right side ofFigure 1. We study two values, p min = 1 /n and p min = 1 /n . Scope of our work:
We will focus in the following oncomputing a lower bound for the family of algorithms thatfollow Algorithm 1, but which use identical distributions tosample the λ offspring (i.e., we require that D ( p ti ) = D ( p tj ) for all i, j ∈ [ λ ] ). We also restrict the algorithms to thosein which D ( p t ) (using the previous convention, we will fromnow on drop the index i ) may depend on n and on f ( x ) , butnot on x , nor on t , nor on any other information. This assumption is irrelevant for our use-case, but the structure of x canhave an important impact in general, as is easily seen for the L EADING O NES function, for which f (0 . . .
0) = f (01 . . .
1) = 0 , but the optimal mutationrate for (0 . . . is one, whereas the optimal one for (01 . . . is p min . III. A
LGORITHM DESCRIPTION
In this section we describe our algorithm that computesgood choices of distribution parameters p for each possiblefitness value of the parent. A. High-Level Description
Our algorithm is based on dynamic programming. We iterateover the possible fitness values, starting from the second bestand continuing towards the smallest value. For each fitnessvalue we aim at finding the best possible parameter valuethat minimizes the remaining expected running time, as wellas the remaining time itself, assuming that the (1 + λ ) EAwill subsequently choose the optimal parameter values for allbetter fitness values. As a side effect we also compute theremaining expected running times for a number of parametervalues, which will be later useful to evaluate regrets associatedwith particular parameter control schemes.While doing it, we assume, similarly to [27], [28], that forall higher fitness values the best possible expected runningtimes are already computed. However, since we aim at deal-ing with various fitness functions, we use the Monte Carloapproach to approximate transition probabilities instead.
B. Requirements for the Fitness Function
Before diving into details, we discuss the limitation of thisalgorithm first. Our main limitation follows from our assump-tion that, apart from the problem size, the optimal choiceof the distribution parameter p depends on the fitness valueexclusively. Hence, if two individuals have the same fitnessbut different structure, it may result in different transitionprobabilities to higher fitness values. Failing to account forthat may result in both overestimation and underestimation ofrunning times. For this reason, we formulate our requirementto the fitness function as follows. Requirement 1.
For any two individuals x and x , such that f ( x ) = f ( x ) : • either both x and x shall be optima; • or for each fitness value f (cid:48) > f ( x ) , assuming Y = { y | f ( y ) = f (cid:48) } is a set of bit strings with the fitness f (cid:48) , thereshall exist a bijective mapping π : Y → Y such that suchthat for each y ∈ Y the following transition probabilitiesshall be equal: P [ x → y ] = P [ x → π ( y )] . When applied inductively, this requirement informallymeans that the fitness value unambiguously determines thefurther stochastic behavior of the evolutionary algorithm.Note that some popular benchmark fitness functions, suchas O NE M AX , as well as the function R UGGEDNESS usedin this study, satisfy this requirement. Some other functionssatisfy it only partially: for instance, with certain definitionsof the J
UMP function, the individuals that form the smallfitness valley may have the same fitness but be structurallydifferent. However, in this particular case the probability thatthe parent of the (1+ λ ) EA becomes such an individual duringthe optimization run is overwhelmingly small, so we can still lgorithm 2:
High-level outline of the hybrid algo. f min , f max ← minimum and maximum fitness values; Initialize optimal times: T ∗ f max ← ; for f ← f max − , . . . , f min do for p ∈ { p ( f )1 , p ( f )2 , . . . , p ( f ) m f } do Compute approximate probabilities (˜ p i ) i =0 , ,... of increasing fitness by i with mutation rate p using the Monte Carlo approach; T f,p ← − ˜ p (cid:16) (cid:80) i>f T i · ˜ p ( i − f ) (cid:17) ; Store optimal time: T ∗ f ← min p ( T f,p ) ; Store optimal rate: P opt f ← arg min p ( T f,p ) ; return { P opt , T ∗ , T } apply our algorithm and get the results that are imprecise onlyup to a factor whose difference from one is negligible. C. Detailed Description
The high-level pseudocode of the proposed algorithm isgiven in Algorithm 2.We maintain T ∗ f , our approximation of the optimal expectedremaining time starting at fitness f until the optimum is hit.This value for the maximum fitness, f max that corresponds tothe optimum, is obviously zero: T ∗ f max = 0 . We compute theremaining values starting from f max − and stepping downuntil the minimum fitness value f min is processed. Since thealgorithms from the (1 + λ ) EA family are elitist, T ∗ f dependsonly on T ∗ f (cid:48) with f (cid:48) > f , but not on other entries.To evaluate a particular T ∗ f , we evaluate the expectedremaining times T f,p starting at fitness f until the optimumis hit, provided that (i) while the parent’s fitness is f , the (1+ λ ) EA uses the mutation rate p , and (ii) when the parent’sfitness is updated, the (1+ λ ) EA uses the previously computedoptimal mutation rates. Since we aim at an easy-to-computeapproximation scheme that does not depend on the structure ofthe fitness function, we employ the following simplifications. • Instead of testing all possible p , which is computation-ally infeasible, we use a finite set of mutation rates { p ( f )1 , . . . , p ( f ) m f } that may depend on f . To obtain a goodapproximation, this set shall be dense enough around theassumed optimal mutation rate, but in the same time itshould have a small enough size so that the computationfinishes in affordable times. We detail our choices of thesesets later in Section III-D. • Instead of analytically computing transition probabilitiesof the (1 + λ ) EA from the current fitness f to eachlarger fitness f (cid:48) (which may be very hard, error-prone andcomputationally demanding for certain problems) we usethe Monte Carlo approach, which amounts to simulationof a part of the run of the (1 + λ ) EA.The Monte Carlo evaluation of T f,p is performed as follows.1) Choose a large number of modeled iterations, N I . 2) Choose a large number N T of successful iterations (theiterations that resulted in a strict fitness increase) to waitfor: if the transitions are easy enough, N T successesalready make a good estimation of the majority of likelytransitions, so we do not execute all N I iterations andhence save computational resources.3) Find an individual with fitness f and set it as the parentindividual for the next iterations of the (1 + λ ) EA.4) Perform N I iterations of the (1+ λ ) EA (or fewer if N T is hit earlier), however, at the end of each iteration do not update the parent regardless of the outcome. Instead,for the j -th iteration we record the fitness f (cid:48) j of theindividual that would otherwise become the next parent.Note that the correctness of not updating the parent evenin the case of neutral move is motivated by Requirement 1:any individual with the fitness f induces the same behavior,so execution of a neutral move may be safely omitted.Let N A ≤ N I be the actual number of simulated iterations.The probability estimations ˜ p i of increasing the fitness by i arecomputed as ˜ p i = N A · |{ j | ≤ j ≤ N A ; f (cid:48) j = f + i }| . Notethat, as the (1 + λ ) EA is elitist, ˜ p counts also the occasionswhere the best offspring was worse than the parent.Finally, T f,p is computed based on ˜ p i and T ∗ f (cid:48) for f (cid:48) > f using the equation in line 6 of Algorithm 2, which is a solutionof the recurrent relation T f,p = 1 + ˜ p · T f,p + (cid:80) i> ˜ p i · T ∗ f + i . D. Choice of Mutation Rate Sets
The choice of the mutation rate sets { p ( f )1 , . . . , p ( f ) m f } hasto be a trade-off between the accuracy of the resulting valuesand the computation time. However, the freedom to choosedifferent sets for different fitness values partially reduces theeffects of this trade-off. In particular, one may conduct prelim-inary experiments to see which probabilities for which fitnessvalues are most promising, and increase the coverage nearthese probabilities in more involved experiments. Technically,this opens a possibility of a self-adaptive scheme, where firstsome predefined grid is used for each fitness value, and thenit is refined in the most promising regions to get better results.In this paper, however, we used a more conservative schemedetailed below.In our experiments, we used a union of two grids: the multi-plicative grid of the form p i = p base · α i , which we use for largefitness values and which spans relatively small probabilities,and the additive grid of the form p i = p base + i · p step tocover the whole ranges of fitness values and probabilities.As the particular parameter values need to depend on theproblem size, we give these values, as well as the Monte Carlosimulation parameters N I and N T , in the next section.IV. E XAMPLE A PPLICATION OF O UR A PPROACH
In this section we outline several kinds of insights thatcan be derived from the results computed by the proposedalgorithm. Some of them, namely lower bounds for parametercontrol methods, plots of optimal parameter values, and pa-rameter efficiency heatmaps, have been previously proposedin [27], [28], and the regret plots are new to this paper. . The Ruggedness Function
We have chosen as use-case the R
UGGEDNESS functionintroduced in [29] (function F6 there). It is basically a functionof concatenated jumps. That is, assuming that the uniqueoptimum is located in some bit string z , with fitness value n , all points at Hamming distance one from z have fitness n − , while those at distance two have fitness n − , thoseat distance three have fitness n − , those at distance fourhave fitness n − , and so on. Formally, letting O M z ( x ) := { i ∈ [1 ..n ] | z i = x i } , R UGGEDNESS assigns r ( x ) = n ifO M z ( x ) = n , r ( x ) = O M z ( x ) + 1 if O M z ( x ) ≡ n mod 2 ,and r ( x ) = O M z ( x ) − otherwise.As mentioned above, all the mutation rate control methodsconsidered in [31] perform worse on R UGGEDNESS than theshift (1 + λ ) EA with static mutation rate p = 1 /n , see Fig. 1.This raises two important questions, which we will answer inthe remainder of this section: (i) How far are the algorithmsbenchmarked in [31] from the best possible performance? (ii)What is the impact of the individual parameter choices thatare made during the optimization process?Our experiments consistently use problem size n = 100 .This choice is motivated by three main factors: it is largeenough to see absolute differences in algorithms’ behavior,yet small enough to finish the Monte Carlo simulations,and it allows a straightforward comparison with the previousworks [28], [31].To compute the necessary data for R UGGEDNESS , we usethe following parameters for our Monte Carlo approach: N I =10 , N T = 5 · . We use the same set of probabilities foreach fitness, which is constructed using a multiplicative gridwith the following parameters: p base = 10 − to match p min =1 /n = 10 − , and α = 10 / , so that 100 probabilities lessthan 1 are employed, whose logarithms are evenly distributed. B. Lower Runtime Bounds for Parameter Control Methods
The near-optimal runtime values T ∗ f for each fitness f obtained from the proposed algorithm may be directly usedto construct the lower bounds on the expected running timespossible for parameter control methods. For this, we computefor each fitness value f the probability p init f of hitting it with thefirst created individual. The value T = (cid:80) f T ∗ f p init f , which isthe expected runtime of the (1+ λ ) EA assuming near-optimalparameter choices, is then used as lower bound.Fig. 1 presents the comparison of these lower bounds,measured in iterations, with the performance of the standard (1 + λ ) EA, as well as of the (1 + λ ) EA employing a fewexisting parameter control methods, on R
UGGEDNESS andon O NE M AX for comparison, all using the shift mutation.For O NE M AX we use the exact method for computing thelower bounds from [28]. For the parameter control methodswe additionally employ two different lower bounds on themutation rate, p min = 1 /n and p min = 1 /n .Now we compare the insights achieved for R UGGEDNESS with the ones achieved before. In [31] one was able to seeonly that some parameter control methods are far away from (1 + λ ) EA when using p min = 1 /n , and the situationbecomes better when p min = 1 /n is used. More precisely,with the less generous lower bound on the mutation rate, allthe considered methods are able to perform just as (1+ λ ) EA.But is it possible to come up with a more efficient parametercontrol method that would outperform (1 + λ ) EA? With thejust obtained lower bounds on expected runtimes, representedwith green lines in Fig. 1, one can conclude that even if onemakes the optimal mutation rate choices for each fitness value,only a relatively small constant-factor improvement wouldbe possible. This gives an important bound which was notpreviously known.One of the interesting observations on R
UGGEDNESS , whichcan be derived from Fig. 1, is that the (1 + λ ) EA ( A, b ) has amuch worse performance than both the (1 + λ ) EA r/ , r and λ I t e r a ti on s p min = 1 /n (1 + λ ) ( A, b ) HQEA Lower λp min = 1 /n λ I t e r a ti on s p min = 1 /n (1 + λ ) ( A, b ) HQEA Lower λp min = 1 /n O NE M AX R UGGEDNESSFig. 1. Lower bounds for the expected running times of the (1 + λ ) EA using binomial distribution with varying p for mutation, compared with theperformance of various parameter control methods averaged over 100 independent runs. Problem size is n = 100 for all four plots. Legend entries are: (1 + λ ) : the (1 + λ ) EA with fixed p = 1 /n , 2-rate: the (1 + λ ) EA r/ , r , ( A, b ) : the (1 + λ ) EA ( A, b ) , HQEA: the (1 + λ ) HQEA.
20 40 60 80 10010 − Fitness S ca l e d m u t a ti on r a t e : p · n O NE M AX λ = 2 λ = 4 λ = 16 λ = 64 λ = 512 − FitnessR
UGGEDNESSFig. 2. Best mutation rates for different values of λ and fitness values. Problem size is n = 100 for both plots. the (1+ λ ) EA. However, the reason for such a behavior cannotbe derived from Fig. 1. We return to this question further inthis section by looking more closely at how these methodschoose mutation rates and how they relate to the best choices.
C. Plots of Optimal Parameter Values
The observation of running times alone cannot providedeeper insights about the problem structure, including theoptimal mutation rates for different fitness values. Such aknowledge can shed some light on which issues a parametercontrol method may face when the (1 + λ ) EA solves theproblem in question. The plots of near-optimal mutationrates, which the proposed algorithm provides, may serve thispurpose. What is more, one can infer the influence of thepopulation size λ on the optimal parameters from such plots.Fig. 2 presents the plots of near-optimal mutation rates asa function of the fitness value for O NE M AX , computed asin [28], and R UGGEDNESS . One can see that for O NE M AX these plots demonstrate a steady tendency of decreasing whilegetting closer to the optimum, which matches the general ex-pectations. At the same time, most the plots for R UGGEDNESS ,while having a similar general trend, show regular oscillationswith a period of 2. Indeed, a small mutation rate is better forthe (1 + λ ) EA to improve from the fitness value that has adifferent parity than n (by flipping a single bit), but at leasttwo bits need to be flipped to improve from the fitness valuewith the same parity as n , which requires a larger mutationrate that tends to /n as one gets closer to the optimum.Note that such optimal parameters may be difficult to betracked precisely by most parameter control methods thatassume a slow change of the optimal mutation rate. D. Parameter Efficiency Heatmaps
Since Algorithm 2 provides expected runtimes T f,p for eachfitness value and mutation rate, assuming that for higher fitnessvalues the optimal mutation rates are chosen, we can use this data to estimate the relative efficiency of mutation rate choicesfor each fitness value. In Fig. 3 we present this information asa heatmap for n = 100 and λ = 512 . Each cell of a heatmapcorresponds to a pair of f and p , where the following p arechosen: p = 10 − i/ for ≤ i ≤ .Colors of the cells represent the relative efficiency of thecorresponding p among all mutation rates for the correspond-ing f . They are derived from the value C f,p = exp( α f · ( T ∗ f − T f,p )) , where α f ≤ is chosen in such a way that at least ahalf of C f,p are at least . . This is done to display enoughinformation about at least 50% of possible choices, while notartificially emphasizing differences if they are negligible.Using Fig. 3, one can make the following observations aboutdynamic mutation rates for the (1 + λ ) EA on R
UGGEDNESS : • Overall, near-optimal mutation rates start with a ratherhigh value at the beginning of the optimization and showa trend to gradually decrease towards the optimum. • This trend is non-monotone. By comparison with O NE -M AX , one can see that for even fitness values R UGGED - NESS shows behavior similar to O NE M AX . However, forodd fitness values the best mutation rates are higher.These observations agree well with the findings from theprevious subsection and Fig. 2. However, one can also seethat in most of fitness values, except for a few last ones, thereis a wide enough range of nearly equally good parameters foreach fitness value, which shows that most parameter controlmethods have, in theory, a fair chance of maintaining goodmutation rates until the last moments.Fig. 3 also shows traces of the actual mutation rates chosenby the (1 + λ ) EA r/ , r and (1 + λ ) EA ( A, b ) algorithms onR UGGEDNESS , both with p min = 1 /n , for the correspondingfitness values. The following observations can be made: • Both algorithms quickly, in less than 10 iterations, reachnear-optimal mutation rates at fitness of roughly 70. • For fitness of roughly 80, both algorithms choose rea-
20 40 60 8010 − − Fitness S ca l e d m u t a ti on r a t e : p · n − − Fitness S ca l e d m u t a ti on r a t e : p · n . . . . O NE M AX R UGGEDNESSFig. 3. Parameter efficiency heatmaps for (1 + λ ) EA, n = 100 , λ = 512 . Heatmap for R UGGEDNESS also contains traces of mutation rate choices in runsof the (1 + λ ) EA r/ , r (black) and the (1 + λ ) EA ( A, b ) (red), both with p min = 1 /n . Color denotes the relative efficiency, the larger the better. − Iteration number R e g r e ti n r un ti m e (1 + λ ) EA r/ , r , λ = 512 − Iteration number R e g r e ti n r un ti m e , (1 + λ ) EA ( A, b ) , λ = 512 Fig. 4. Difference between the expected runtimes for the best mutation rate choice and the one actually chosen by the parameter control algorithms inexample runs. Left plot shows a complete run of the (1 + λ ) EA r/ , r . Right plot shows selected iterations of the (1 + λ ) EA ( A, b ) , the omitted iterationshave the same fitness and regret. Each red line denotes an artificial visual boundary between finite expected runtimes (below) and infinite ones (above). sonably good values often enough, which ensures someprogress, with the (1 + λ ) EA r/ , r doing slightly better. • For last few fitness values, both algorithms visit oddfitness values exclusively, but the (1 + λ ) EA r/ , r still chooses good mutation rates, whereas the (1 + λ ) EA ( A, b ) gets stuck with the too small rates.The latter observation clearly points at the behavior thatcauses the poor performance of the (1 + λ ) EA ( A, b ) seen atFig. 1. It appears that the ( A, b ) adaptation rule forces a switchto local search as progress slows down, which converges themutation rate to p min = 1 /n . With such rate it makes thechance of finding two-bit improvements Θ( n ) smaller, whichexplains the large performance gap observed in Fig. 1. E. Regret Plots
Overlaying parameter choices of particular parameter con-trol methods over the parameter efficiency heatmaps mayindicate the presence of problems in a method, and how large the deviation is. However, it cannot show how much of theperformance the method loses from acting suboptimally.A possible solution is to show how much the method losesto the optimal choice in each iteration. More precisely, foreach iteration we take the parameter value p chosen by theparameter control method in question that corresponds to thecurrent parent’s fitness value, and plot the value | T f,p − T ∗ f | for that iteration. Such regret plots are presented in Fig. 4. Wechange the plot color whenever the fitness value changes, sothat one can see which iterations share the same fitness value.We can observe in the example runs in Fig. 4 that bothmethods initially quickly reduce their regret. However, the (1 + λ ) EA r/ , r keeps it at a low level, whereas the (1 + λ ) EA ( A, b ) shows a more complicated behavior. First itfeatures two peaks of local regret maxima, which match twooccasions of too high mutation rates seen in Fig. 3. It thenquickly moves through a good region and falls down to smallmutation rates, where it spends most of its time with very largeregrets, whose sum dominates the total running time.ote that, occasionally, a parameter control method mayenter a region where T f,p reaches too high values, or where itis even infinite (which happens twice in Fig. 4). If the methodis quick enough to return to better mutation rates, this does notharm the performance, as is the case for the (1 + λ ) EA r/ , r .V. G ENERALIZATION OF THE H YBRID A PPROACH
As discussed in the introduction, our hybrid approach isat the moment restricted to settings in which states are notvisited more than once. This can be resolved, at the price ofcomputational complexity, by construction of Markov chainson all states with equal fitness and solving the resulting systemof equations. However, this may also require a revision of useddefinitions, such as the one for regret: if the mutation rate isfixed to some value for one of the states, globally suboptimal choices for other states may actually make the runtime smaller .Our approach is, in fact, not limited to (1 + λ ) typealgorithms: as it simulates an iteration of the algorithm asa parameterized black-box, a wide range of algorithms maybe modeled, which is much harder with exact approaches.The number of states may become too large to havethem considered explicitly for certain problems, as well asfor population-based algorithms. While we see no obvioussolution for this problem yet, state-merging approaches mightbe a solution for some of the cases.VI. C ONCLUSIONS
We have extended in this work the dynamic programmingapproach for computing optimal state-dependent, dynamicmutation rates suggested in [27], [28] to settings in whichthe transition probabilities cannot necessarily be computedby closed-form expressions, but where they need to be ap-proximated by Monte Carlo simulations. We have appliedthis approach to derive optimal parameter choices for the (1 + λ ) EA on R
UGGEDNESS . We have also introduced regretplots , which not only show deficiencies in parameter controlmethods, but indicate their impact on the running time.We plan to extend our work to more complicated settingdetailed in Section V, as well as to use the results presented inthis work to improve some of the parameter control methods.
Acknowledgments.
The reported study was funded byRFBR and CNRS, project number 20-51-15009, and by theParis Ile-de-France region.R
EFERENCES[1] B. Doerr and F. Neumann, Eds.,
Theory of Evolutionary Computation—Recent Developments in Discrete Optimization . Springer, 2020.[2] C. Doerr, “Complexity theory for black-box optimization heuristics,” in
Theory of Evolutionary Computation: Recent Developments in DiscreteOptimization . Springer, 2020.[3] G. Badkobeh, P. K. Lehre, and D. Sudholt, “Unbiased black-boxcomplexity of parallel search,” in
PPSN . Springer, 2014.[4] P. K. Lehre and D. Sudholt, “Parallel black-box complexity with tailbounds,”
IEEE Trans. Evol. Comp. , vol. 24, no. 6, 2020.[5] S. Droste, T. Jansen, and I. Wegener, “Upper and lower bounds forrandomized search heuristics in black-box optimization,”
Theory ofComputing Systems , vol. 39, 2006.[6] C. Doerr and M. Wagner, “Simple on-the-fly parameter selection mech-anisms for two classical discrete black-box optimization benchmarkproblems,” in
GECCO . ACM, 2018. [7] C. Doerr and J. Lengler, “Introducing elitist black-box models: Whendoes elitist behavior weaken the performance of evolutionary algo-rithms?”
Evol. Comp. , vol. 25, 2017.[8] P. K. Lehre and C. Witt, “Black-box search by unbiased variation,”
Algorithmica , vol. 64, no. 4, 2012.[9] B. Doerr, T. K¨otzing, J. Lengler, and C. Winzen, “Black-box complex-ities of combinatorial problems,”
Theor. Comput. Sci. , vol. 471, 2013.[10] J. Rowe and M. Vose, “Unbiased black box search algorithms,” in
GECCO . ACM, 2011.[11] B. Doerr and C. Winzen, “Playing Mastermind with constant-sizememory,”
Theory of Computing Systems , vol. 55, 2014.[12] C. Witt, “Tight bounds on the optimization time of a randomized searchheuristic on linear functions,”
Combinatorics, Probability & Computing ,vol. 22, 2013.[13] D. Sudholt, “A new method for lower bounds on the running time ofevolutionary algorithms,”
IEEE Trans. Evol. Comp. , vol. 17, 2013.[14] C. Gießen and C. Witt, “The interplay of population size and mutationprobability in the (1+ λ ) EA on OneMax,”
Algorithmica , vol. 78, no. 2,2017.[15] P. S. Oliveto, D. Sudholt, and C. Witt, “A tight lower bound onthe expected runtime of standard steady state genetic algorithms,” in
GECCO . ACM, 2020.[16] F. Chicano, A. M. Sutton, L. D. Whitley, and E. Alba, “Fitnessprobability distribution of bit-flip mutation,”
Evolutionary Computation ,vol. 23, no. 2, 2015.[17] C. Gießen and C. Witt, “Optimal mutation rates for the (1 + λ ) EAon OneMax through asymptotically tight drift analysis,”
Algorithmica ,vol. 80, 2018.[18] G. Karafotias, M. Hoogendoorn, and A. Eiben, “Parameter control inevolutionary algorithms: Trends and challenges,”
IEEE Trans. Evol.Comp. , vol. 19, 2015.[19] A. Aleti and I. Moser, “A systematic literature review of adaptive pa-rameter control methods for evolutionary algorithms,”
ACM ComputingSurveys , vol. 49, 2016.[20] A. E. Eiben, R. Hinterding, and Z. Michalewicz, “Parameter control inevolutionary algorithms,”
IEEE Trans. Evol. Comp. , vol. 3, 1999.[21] B. Doerr and C. Doerr, “Theory of parameter control mechanisms fordiscrete black-box optimization: Provable performance gains throughdynamic parameter choices,” in
Theory of Evolutionary Computation:Recent Developments in Discrete Optimization . Springer, 2020.[22] S. B¨ottcher, B. Doerr, and F. Neumann, “Optimal fixed and adaptivemutation rates for the LeadingOnes problem,” in
PPSN . Springer, 2010.[23] B. Doerr, C. Doerr, and J. Yang, “Optimal parameter choices via preciseblack-box analysis,”
Theor. Comput. Sci. , vol. 801, 2020.[24] A. Lissovoi, P. S. Oliveto, and J. A. Warwicker, “Simple hyper-heuristicscontrol the neighbourhood size of randomised local search optimally forLeadingOnes,”
Evol. Comp. , vol. 28, 2020.[25] T. B¨ack, “The interaction of mutation rate, selection, and self-adaptationwithin a genetic algorithm,” in
PPSN . Elsevier, 1992.[26] ´A. Fialho, L. D. Costa, M. Schoenauer, and M. Sebag, “Extreme valuebased adaptive operator selection,” in
PPSN . Springer, 2008.[27] N. Buskulic and C. Doerr, “Maximizing drift is not optimal forsolving OneMax,”
Evolutionary Computation , 2021, to appear. DOI:10.1162/evco a 00290.[28] M. Buzdalov and C. Doerr, “Optimal mutation rates for the (1 + λ ) EAon OneMax,” in
PPSN . Springer, 2020.[29] C. Doerr, F. Ye, N. Horesh, H. Wang, O. M. Shir, and T. B¨ack, “Bench-marking discrete optimization heuristics with IOHprofiler,”
Applied SoftComputing , vol. 88, 2020.[30] S. Droste, T. Jansen, and I. Wegener, “On the analysis of the (1+1)evolutionary algorithm,”
Theor. Comput. Sci. , vol. 276, 2002.[31] A. Buzdalova, C. Doerr, and A. Rodionova, “Hybridizing the 1/5-thsuccess rule with q-learning for controlling the mutation rate of anevolutionary algorithm,” in
PPSN . Springer, 2020.[32] E. Carvalho Pinto and C. Doerr. (2018) Towards a more practice-awareruntime analysis of evolutionary algorithms. [Online]. Available:https://arxiv.org/abs/1812.00493[33] E. C. Pinto and C. Doerr, “A simple proof for the usefulness of crossoverin black-box optimization,” in
PPSN . Springer, 2018.[34] I. Rechenberg,