[PDF] Tuning as a Means of Assessing the Benefits of New Ideas in Interplay with Existing Algorithmic Modules

Abstract

Introducing new algorithmic ideas is a key part of the continuous improvement of existing optimization algorithms. However, when introducing a new component into an existing algorithm, assessing its potential benefits is a challenging task. Often, the component is added to a default implementation of the underlying algorithm and compared against a limited set of other variants. This assessment ignores any potential interplay with other algorithmic ideas that share the same base algorithm, which is critical in understanding the exact contributions being made. We introduce a more extensive procedure, which uses hyperparameter tuning as a means of assessing the benefits of new algorithmic components. This allows for a more robust analysis by not only focusing on the impact on performance, but also by investigating how this performance is achieved. We implement our suggestion in the context of the Modular CMA-ES framework, which was redesigned and extended to include some new modules and several new options for existing modules, mostly focused on the step-size adaptation method. Our analysis highlights the differences between these new modules, and identifies the situations in which they have the largest contribution.

Full PDF

TTuning as a Means of Assessing the Benefits of New Ideas inInterplay with Existing Algorithmic Modules

Jacob de Nobel

Leiden Institute of AdvancedComputer ScienceLeiden, The Netherlands

Diederick Vermetten

Leiden Institute of AdvancedComputer ScienceLeiden, The Netherlands

Hao Wang

Leiden Institute of AdvancedComputer ScienceLeiden, The Netherlands

Carola Doerr

Sorbonne Université, CNRS, LIP6Paris, France

Thomas Bäck

Leiden Institute of AdvancedComputer ScienceLeiden, The Netherlands

ABSTRACT

Introducing new algorithmic ideas is a key part of the continuousimprovement of existing optimization algorithms. However, whenintroducing a new component into an existing algorithm, assessingits potential benefits is a challenging task. Often, the component isadded to a default implementation of the underlying algorithm andcompared against a limited set of other variants. This assessmentignores any potential interplay with other algorithmic ideas thatshare the same base algorithm, which is critical in understandingthe exact contributions being made.We introduce a more extensive procedure, which uses hyperpa-rameter tuning as a means of assessing the benefits of new algorith-mic components. This allows for a more robust analysis by not onlyfocusing on the impact on performance, but also by investigatinghow this performance is achieved.We implement our suggestion in the context of the ModularCMA-ES framework, which was redesigned and extended to in-clude some new modules and several new options for existingmodules, mostly focused on the step-size adaptation method. Ouranalysis highlights the differences between these new modules, andidentifies the situations in which they have the largest contribution.

With the continuous increase in interest for the field of optimization,many new algorithms get introduced every year. A large numberof these algorithms are not completely novel, but instead add newalgorithmic ideas to existing methods. Originally referring to oneparticular algorithm, CMA-ES has developed into a whole familyof algorithms that are built around the core design of the originalCMA-ES algorithm from [20]. While this growth of the algorithmset helps to keep improving the state-of-the-art performance, italso raises a simple question: “How to assess the benefits of newalgorithmic ideas?”The naive way of performing such an assessment is to implementthe algorithmic idea into a bare-bones version of the base algorithm,and to benchmark it against the default (and maybe some othervariants). While this technique does manage to give an indicationof the usefulness of the newly introduced component, the resultsare not always practical and hide important information, sincethey only consider the idea in isolation. Often, there tends to be

Equal contribution between first and second authors. an important interplay between algorithmic components, which iscompletely missed when doing the type of assessment describedabove.We aim to provide in this work a roadmap for assessing thesealgorithmic ideas in a way which takes component interactions intoaccount. This is achieved by considering the different algorithmicideas as modules in a modular framework. Several of these types offrameworks have been developed over the years [10, 29, 31, 32, 38].We will work here with the Modular CMA-ES (ModCMA), whichis significantly extended from the existing ModEA framework [32],by both adding new modules and new options for existing modules(see Section 2 for details).With this modular framework, we show in this work how hy-perparameter tuning can be used to assess the contributions of thenewly implemented components. We illustrate how this approachgives a detailed perspective on the benefits of new algorithmic ideas,by not only looking at pure performance metrics, but also consid-ering the interplay with existing modules. We show, among otherthings, that the introduction of new step-size adaptation methodscan be beneficial, but that it requires careful consideration of theinteractions with other modules, such as the recombination weights.We also discuss the limitations of this approach, and how to bestuse it to gain the most understanding about these new algorithmicideas.

Our work relies heavily on the Modular Evolutionary Algorithms(ModEA) framework introduced in [32]. Since this framework hasn’tundergone any active development in recent time, we decided toredesign the framework to our specifications. The modificationswe made rendered the name of the framework no longer befitting,as only CMA-ES variants can now be created using the frame-work, whereas the original framework also supported the designof other evolutionary algorithms. The new framework was dubbedthe Modular CMA-ES (ModCMA) and is available as an open sourcePython package within the IOHprofiler [14] environment . It isintegrated with the IOHexperimenter, giving access to a broad setof benchmark problems, including a C++ implementation of the https://github.com/IOHprofiler/ModularCMAES1 a r X i v : . [ c s . N E ] F e b acob de Nobel, Diederick Vermetten, Hao Wang, Carola Doerr, and Thomas Bäck BBOB functions [19] from the COCO environment [18]. In addi-tion, this allows for easy data logging, which can be used directlywith the interactive performance analysis and visualization fromIOHanalyzer [37].

Motivation.

The primary goal behind redesigning the frameworkwas to reduce its complexity, and to only include functionality com-patible to the CMA-ES and its variants. The reasoning behind thisis the fact that the framework mostly revolves around the CMA-ES. Other EAs are available in the framework, but are quite un-derdeveloped w.r.t. the CMA-ES. Moreover, introducing workinginteractions between the CMA-ES and operators from other EAsoverly complicates the framework’s structure. For example, ModEAcontains a range of different methods for performing recombina-tion. However, the canonical CMA-ES does not explicitly performrecombination. Instead, it updates its mean 𝑚 by taking a weightedaverage of the individuals in its current population, which it thenuses to sample new individuals from a normal distribution. In otherEAs, recombination occurs in a much more pronounced sense, forexample by crossover. In order to make the modular algorithm ofthe CMA-ES function with these other forms of recombination,its original method for “recombination” had to be adapted. TheCMA-ES however, is still only able to properly function with oneof these recombination methods, the canonical one. As this patterncould be observed in other parts of the framework as well (i.e.,mutation, selection), it was decided to remove these other methodsall together and to focus solely on the CMA-ES. To design the Modular CMA-ES, we use the implementation fromthe popular CMA-ES tutorial [17] as a starting point. This workprovides a detailed description of the CMA-ES algorithm, includinga practical guide to its implementation. From this basic design, weseparate the CMA-ES in a number of functionally related blocks, inorder to allow a customization of a specific part of the algorithm.This allows us to implement algorithmic variants of the CMA-ES asfunctional modules. From a user perspective, any of these modulescould then be combined in order create a custom instantiation ofthe CMA-ES, by selecting an option for each available module.In ModEA, eleven of such modules were already implemented.These were all reimplemented in the Modular CMA-ES, with a fewchanges to the structure of the options. Specifically, we removedthe

Pairwise Selection as a module.Instead, we incorporated thisoption in the

Mirrored Sampling module as the option

MirroredSampling with Pairwise Selection , converting this module from bi-nary to ternary. This is done because the pairwise selection methodis not suited for use without mirrored sampling [3].We implemented a new module for performing boundary cor-rection (see Section 2.2), and added five alternative options forperforming step size adaptation (see Section 2.3). These two exten-sions to the framework will be the focus of our analysis through outthis work. This set of changes give us the following list of modulesfor the redesigned Modular CMA-ES:(1)

Active Update : Bad candidate solutions are penalized inthe covariance matrix update using negative weights [22].Note that in [17], this is given as the default version, herewe consider it to be optional. (2)

Elitism : ( 𝜇 + 𝜆 ) - selection instead of ( 𝜇, 𝜆 ) - selection.(3) Orthogonal Sampling : All the newly sampled points inthe population are orthonormalized using a Gram-Schmidtprocedure [35].(4)

Sequential Selection : Candidate solution are immediatelyranked and compared with the current best solution. If im-provement is found, no additional objective function evalua-tions are performed [11].(5)

Threshold Convergence : A method for balancing explo-ration with exploitation, scaling the mutation vectors to arequired length threshold, which decays over time [30].(6)

Step-Size Adaptation : Supplementary to the default Cu-mulative Step size Adaptation (CSA), Two Point step sizeAdaption (TPA) [15] is implemented. TPA requires two ad-ditional objective function evaluations, used for evaluatingboth a shorter and a longer version of the population’s cen-ter of mass. The version which shows the higher objectivefunction value determines whether the step size should beincreased or decreased. Five newly added mechanism forperforming step size adaptation are implemented. They aredescribed in detail in Section 2.3.(7)

Mirrored Sampling : For every newly sampled point, itsmirror image is added the population, by reversing its sign [3].With

Pairwise Selection , only the best point of each mirroredpair is used in recombination.(8)

Quasi-Gaussian Sampling : Instead of performing the sim-ple random sampling from the multivariate Gaussian, newsolutions can alternatively be drawn from quasi-randomsequences (a.k.a. low-discrepancy sequences) [7]. We imple-mented two options for this module, the Halton and Sobolsequences.(9)

Recombination Weights : Three options are implemented;1) default weights (see [17]), 2) equal weights: 𝑤 𝑖 = / 𝜇 , and3) 𝑤 𝑖 = / 𝑖 + /( 𝜆 𝜆 ) for 𝑖 = , , . . . , 𝜆 .(10) Restart Strategy : When the optimization process stagnates,the CMA-ES can be restarted using a restart strategy. Twostrategies are implemented. IPOP [5] increases the size popu-lation after every restart by a constant factor. BIPOP [16] alsochanges the size of the population, but alternates betweenlarger and smaller population sizes.(11)

Boundary Correction : If candidate solutions are sampledoutside the search domain, they can be transformed back intothe search domain by applying a boundary correction oper-ation. In Section 2.2, we describe six options for performingboundary correction which have been implemented.In Table 1, a detailed overview is given of all currently imple-mented modules and their options in the Modular CMA-ES frame-work.

In the original framework, a boundary correction function takenfrom [25] was implemented, and always applied after each mutation.In some cases, however, this operator can degrade the performanceof the algorithm quite drastically. We therefore decided to makethe boundary correction operation optional, and to implement itas a module, for it to only be used when beneficial. A number of ssessing the Benefits of New Algorithmic Ideas via Tuning MSR PSR xNES m-xNES p-xNES 𝑖 + 𝜆 𝜆 - - - -10 off IPOP BIPOP - - - -11 off UR MCS COTN SCS TCS - Table 1: The modules available for the Modular CMA-ES.The numeric index for each module corresponds to theindex used in the text of Section 2.1. Newly added mod-ules/options are given in bold. different boundary correction strategies were implemented, takenfrom [12]:(1)

None : no correction is applied to infeasible coordinates ofsolutions.(2)

Uniform Resample (UR) : replaces all infeasible coordi-nates of a solution with new coordinates sampled uniformlyat random within the search space.(3)

Mirror Correction Strategy (MCS) : mirrors all infeasiblecoordinates of a solution with respect to its closest boundary.(4)

Complete One-tailed Normal Correction Strategy (COTN) :All infeasible coordinates are replaces to new coordinatesinside the search space according to a rescaled one-sidednormal distribution centered on the boundary.(5)

Saturation Correction Strategy (SCS) : All infeasible co-ordinates is set to the closest corresponding bound.(6)

Toroidal Correction Strategy (TCS) : All infeasible coor-dinates get reflected off the opposite boundary.

In this work, we consider a number of alternative step size adap-tation mechanisms for new options for the Modular CMA-ES. Wetake inspiration from [23], which provides a qualitative evaluationof multiple step size adaptation mechanisms used in ES. In additionthe CSA and TPA step size adaptation methods, which were alreadyimplemented, we implemented the following procedures:(1)

Median success rule (MSR) : The MSR mechanism [1] adaptsthe step-size 𝜎 as follows: it firstly computes a success rateby checking the number of current individuals that are betterthan some user-defined quantile of the function values in theprevious population, then accumulates such success rates inevery iteration, and finally decides to increase the step-sizeif the cumulated values is bigger than / and decrease itotherwise.(2) Population success rule (PSR) : determines the successrate of the current population using a rank-based approach.It firstly sorts all individuals in the current and previouspopulation together, then retrieves the set of ranks of indi-viduals belonging to the current iteration and the one for theprevious iteration, and finally calculates the average rank difference between those two sets as the population successrate, which controls the step-size updates.(3) xNES step-size adaptation (xNES) : calculates the lengthof each standardized mutation vectorand subtracts from it the expected length of the standardGaussian vector. The resulting difference is then scalarizedusing the same weights used in the recombination, which isfinally fed into an exponential function to generate a multi-plicative coefficient to modify the step-size.(4) mean-xNES step-size adaptation (m-XNES) : functionssimilarly to xNES, with the exception that it takes the stan-dardized differential vector between current centers of massand the one in the previous iteration and compares it to theexpected length of the standard Gaussian vector.(5) xNES with log normal prior Step size adaptation (p-xNES) : resembles the principle of self-adaptation for step-sizes, where 𝜆 trial step-sizes are generated from a log-normaldistribution which takes the current step-size as its meanand each trial step-size is used to sample a candidate point.To determine the new step-size, this method calculates theweighted sum of the log-transformed trial step-sizes, wherethose assigned to their corresponding candidate points inthe recombination. With the introduction of these new module settings, we have a clearuse-case for the assessment of algorithmic ideas within the CMA-ESalgorithm. Since these options are implemented into a frameworkwith many existing modules, it will not suffice to look at themin isolation. Instead, we should carefully consider the potentialinteractions with the existing modules and investigate their impacton the empirical performance of ModCMA. Previous work [33]used data from a complete enumeration of all module settings toanalyze the contribution of each individual module. However, suchan approach becomes intractable when we are confronted witha huge set of modules, or more importantly if we aim to obtainthe contribution of some new modules implemented incrementallyto an existing portfolio of modules, which we have investigatedextensively. Besides, this complete enumeration approach ignoresentirely the configuration of continuous strategy parameters, e.g., 𝑐 , 𝑐 𝜇 , and 𝑐 𝑐 , which have been shown to significantly impact theper-instance performance of the resulting configurations [9].To properly address the problem of determining the contribu-tion of a single module setting to an existing portfolio of modules,we make use of hyperparameter optimization, which has previ-ously been shown to achieve results comparable to the completeenumeration method, while being much more easily extendable toother hyperparameters [34]. We propose the following roadmap toformalize this procedure:(1) Select a modular implementation of the base algorithm towhich the new module has been added, a hyperparamteroptimizer and a performance metric.(2) Collect a list of the existing modules and relevant hyperpa-rameters (without the new module to assess). This will bethe search space for the hyperparamter optimization. acob de Nobel, Diederick Vermetten, Hao Wang, Carola Doerr, and Thomas Bäck (3) Run the selected hyperparameter optimizer on this searchspace, ideally for a wide set of relevant benchmark functions.This data will then serve as the baseline performance.(4) Extend the original search space by including the new mod-ule to asses, and run the hyperparamter optimization on thisextended search space (using the exact same setup as thebaseline).(5) Compare the data from the baseline to the experiment withthe extended search space. This should not only be donefrom a performance perspective, but also from the resultingconfigurations themselves. This allows for the analysis ofpotential interactions between modules. In order to compare the different configurations of the ModCMA,we need to define the ways in which we measure their performance.Assuming a set of optimization algorithms A = { 𝐴 , 𝐴 , . . . } , aset of objective functions F = { 𝑓 , 𝑓 , . . . } , a function evaluationbudget 𝐵 , and 𝑁 repeated runs of each algorithm, we denote by 𝑇 ( 𝐴, 𝑓 , 𝑣, 𝑖 ) , 𝑖 ∈ [ ..𝑁 ] , the number of function evaluations con-sumed by algorithm 𝐴 to find in its 𝑖 -th run on function 𝑓 a solutionof solution quality at least 𝑣 . Among various methods for quantify-ing the empirical performance of optimizers, the expected runningtime (ERT) [6] is commonly chosen, which estimates the expec-tation of the number of function evaluations (a.k.a. the runningtime) of an optimizer to hit a predefined target value when un-limited evaluation budget is provided. However, the performancecomparison based on ERT is be largely biased towards the targetvalue prescribed by the user. This value can be difficult to deter-mine a-priori for a configuration task on many optimizers and italso adds another design factor to our experimental setup. Instead,we propose to take a measure that relies on a set of target valuessince it will be less sensitive to the choice of each individual valuetherein and could cover more perspectives of the running profile ofoptimizers. One of such measures is the the Area Under the ECDFCurve (AUC) of the running time, defined as follows: AUC ( 𝐴, 𝑓 , V) = ∫ 𝐵 (cid:98) 𝐹 ( 𝑡 ; 𝐴, 𝑓 , V) d 𝑡, where (cid:98) 𝐹 ( 𝑡 ; 𝐴, 𝑓 , V) = 𝑁 |V | (cid:205) 𝑣 ∈V (cid:205) 𝑁𝑖 = ( 𝑇 ( 𝐴, 𝑓 , 𝑣, 𝑖 ) ≤ 𝑡 ) ( isthe characteristic function). In this work, we evaluate the algorithmsfor the target values V = { − 𝑖 : 𝑖 ∈ [ .. ]} ⊂ [ − , ] .We note that most hyperparameter tuning methods are builtwith minimization in mind. As such, we use the Area Over theCurve (AOC) instead of AUC, since we know AOC ( 𝐴, 𝑓 , V) = 𝐵 − AUC ( 𝐴, 𝑓 , V) . The roadmap proposed above is designed to be generic, so that itcan function with any modular algorithm, hyperparameter tuner,and performance metric. In order to collect the AOC measure fromthe runs of the ModCMA, we integrated it into the IOHprofiler [14].This tool is used because it offers an easy way of accessing theBBOB-functions, while providing the needed logging functional-ity to easily calculate the AOC of each run. As our baseline, wewill tune the existing modules from ModCMA, which are shown (plain text) in Table 1, totalling 6 binary and 4 ternary modules.In addition, we tune the four continuous hyperparameters 𝑐 , 𝑐 𝜇 , 𝑐 𝜎 , 𝑐 𝑐 , and 𝑐 𝜎 , which control the dynamics of the adaption of thecovariance matrix ( 𝑐 , 𝑐 𝜇 , and 𝑐 𝑐 ) and of the step-size ( 𝑐 𝜎 ). We thenrun two experiments to assess both the new step-size adaptationmethods and the boundary correction module, as introduced inSection 2.3 and Section 2.2 respectively. All of the code used inthese experiments, and the resulting data, is available in [13]. Hyperparameter tuning using irace:

In this paper, we usethe irace [26, 27] library as our hyperparameter optimizer. Irace is based on the principle of iterated racing, in which each race repeatedly executes configurations on different problem instancesuntil there is statistically significant reason to discard enough ofthem to move to the next race (thus inherently allocating moreruns to more promising configurations). Using this procedure, oneor more configurations will emerge as the final elites at the end ofthe optimization. The number of times irace has evaluated the eliteconfigurations can differ significantly between two runs. To obtaina fair comparison, we therefore perform an independent set of 25validation runs, with the same random seeds for all configurations.We use the results of these runs to assess the final performance. BBOB problem suite:

We configure irace to use the first in-stance ( iid = 1 ) of each of the 24 BBOB functions [18, 19], in5D. While the argument can be made that tuning should be doneover multiple instances, we favoured running more repetitions ofirace over using more instances. Each irace-run is given a budgetof 1 000 algorithm evaluations, which themselves have a budget of

10 000 · 𝐷 function evaluations. Before considering our proposed method, we run a basic bench-marking experiment on each of the individual module options. Thisis similar to the common approach of benchmarking the new mod-ule against a set of other algorithm variants. We show the resultingbest single-module configurations (a.k.a. the virtual best solver, VBSfor short) relative to the default CMA-ES in Table 2. In this table, wesee that among the new modules, only two have been selected: MSRfor F23 and m-XNES for F5. We can further look at the over-all con-tributions of the newly introduced step-size settings by plotting theECDF-curves over all functions, as done in Figure 1. In this figure,we can clearly see that most methods are quite competitive, withthe only exception being xNES, which has a significantly worseperformance than the others. Overall, the MSR method seems tobe quite effective, but there is no strict domination over the othersettings.

To illustrate the usefulness of hyperparameter tuning in the modu-lar CMA-ES, we conduct a baseline experiment where we tune allmodules (excluding the newly introduced ones) and the selectedhyperparameters. We perform this tuning on each benchmark func-tion separately , and compare this baseline to the default CMA-ESas well as the virtual best solver on each function from Table 2. Implemented in R, freely available at [28]. The initial iteration of irace consists of random configurations and the default CMA-ESsetting.4 ssessing the Benefits of New Algorithmic Ideas via Tuning

Fid VBS AOC of VBS AOC of Default Improvement1 elitist_True 247 326 24%2 active_True 1272 1659 23%3 local_restart_BIPOP 38374 44518 14%4 local_restart_IPOP 41746 44613 6%5 step_size_adaptation_m-xnes 43 63 31%6 elitist_True 655 904 28%7 step_size_adaptation_tpa 1312 39199 97%8 base_sampler_halton 1186 4544 74%9 base_sampler_sobol 959 2470 61%10 active_True 1309 1729 24%11 active_True 1162 1749 34%12 base_sampler_sobol 2186 2980 27%13 active_True 1627 2191 26%14 active_True 601 831 28%15 local_restart_BIPOP 30380 43313 30%16 local_restart_BIPOP 8172 34132 76%17 threshold_convergence_True 12464 26884 54%18 threshold_convergence_True 15764 33724 53%19 mirrored_mirrored 33567 36688 9%20 threshold_convergence_True 36482 40691 10%21 local_restart_IPOP 38028 40371 6%22 mirrored_mirrored 566 8632 93%23 step_size_adaptation_msr 11060 34433 68%24 local_restart_IPOP 42099 44351 5%

Table 2: Table showing the AOC of the best single-moduleconfiguration for each function (VBS), compared to that ofthe default CMA-ES. Note that these values does not in-clude benefits from tuning the continuous hyperparameters,which are set to the default values for all configurations inthis table. Default step_size_adaptation_lp-xnes step_size_adaptation_m-xnes step_size_adaptation_msrstep_size_adaptation_psr step_size_adaptation_tpa step_size_adaptation_xnes

Function Evaluations P r opo r ti on o f (r un , t a r g e t , ... ) p a i r s Figure 1: ECDF-curve of all single-module stepsize options.Figure generated using IOHanalyzer [37].

Since we run 4 runs of irace for each function, this results in 4 setsof elites (each set has up to 5 configurations), for which we thenperform the verification runs. We plot the distribution of the AOCfor each of these configurations in Figure 2. From this figure, it isclear that the tuning of all parameters at once is much better thansimply selecting a single-module variant, as is to be expected. Thisplot also highlights the significant differences in performance ofthe final found configurations. There are two main reasons for thisfact: the inherent stochasticity of the CMA-ES itself, and the largeimpact of the initially generated configurations of irace. We discussthese challenges in detail in Section 5.From this baseline data, we can also study the resulting configu-rations themselves. This can be done by aggregating the moduleswhich have been selected in the final elite configurations in the A O C Default CMA-ESVBS Single-module CMA-ES

Figure 2: Distribution of the area over the ECDF curve forthe final elite configuration of the baseline irace runs. AllAOC’s are averages of 25 verification runs. The VBS single-module configurations can be seen in Table 2. A c t i v e E li t i s t O r t h o g o n a l S e q u e n t i a l T h r e s h o l d SS A M i rr o r e d S a m p l e r W e i g h t s R e s t a r t M o d u l e C o un t Option 0Option 1Option 2

Figure 3: Module counts of all elites found in the baseline-experiment, over all 24 BBOB-functions. The option num-bers correspond to those in Table 1 separate irace runs, as is visualized in Figure 3. In this figure, we cansee that there is a large variability in the selected module options,which seems to indicate that they are all usable for at least somefunctions. One notable exception is the weights option “equal”,which is chosen in less than of configurations. For generating our experimental data, we conduct two hyperpa-rameter optimization experiments with irace, one where allow thenew SSA methods to be selected, and another including the newboundary correction methods. Note that in the boundary correctionexperiment, the new SSA methods cannot be selected and vice versa.We use the same experimental setup for running these experimentsas with the baseline experiment.Based on these experiments, there are two main approaches toanalyze the contributions of the newly introduced modules: theperformance-perspective and the perspective of the selected mod-ules. We start by looking at the performance: for each experiment,we look at the impact on the final performance of the elite configu-rations found by the irace runs. First, we visualize the distributions acob de Nobel, Diederick Vermetten, Hao Wang, Carola Doerr, and Thomas Bäck A O C Default CMA-ESVBS Single-module CMA-ESbaseline_csssa_csbounds_cs

Figure 4: Distribution of the single best elites from the base-line and the tuning with the additional modules. AOC valuesare the result of averaging over 25 verification runs.

100 1 0.01 1e−4 1e−6 1e−8 Baseline_c60 SSA_c63

Best-so-far f(x)-value F un c ti on E v a l u a ti on s Figure 5: Comparison of the Expected Running Time of thebest configurations found on F12 by both the baseline andthe SSA experiments. Shaded areas indicate the outer quan-tiles (20-80). of the AOC of the single best configuration found in each run ofirace (based on the verification runs) in Figure 4.In this plot, we can see that the effect of introducing the newmodules is quite mixed. For some functions, the performance evenworsens significantly (e.g., on F8) after introducing new modules,while for others we see the desired improvement (e.g. on F23)In order to better show these differences, we plot in Figure 6the AOC from the single best configurations found in both the SSAand bound-experiments relative to the best configuration from thebaseline. We see that the general trend of performance is somewhatnegative, with some outliers in both directions. This seems to indi-cate that these new modules are not always beneficial to the finalperformance. For example, we can consider F12, where the configu-ration found by the baseline has an average AOC of , whilethe best configuration found when including the new SSA-methodsin the search space reaches an average AOC of . We show theexpected running time of thes two configurations in Figure 5, wherewe can clearly see this difference. However, it is also clear that thevariance in between runs is significant, which can partly explainpoor performance. Indeed, if we look at the average AOC during theirace runs, the difference between these two configurations is only I m p r o v e m e n t o v e r B a s e li n e BoundsSSA

Figure 6: The relative improvements per function of the bestconfiguration found by irace relative to the baseline experi-ment’s best configuration. . This leads to an important observation about the assessmentof the new algorithmic modules: when judging results purely fromthe average performance measures, it is necessary to also considerthe overall variability of the experiment, as well as the inherentstochasticity of the base algorithm.We perform the same procedure to the boundary correctionmethods. The impact of this module is expected to be smaller, sincefor most of the “easier” functions, the boundary condition is rarelyviolated. For some of the more challenging functions however, thepenalty value given by BBOB function itself might not be sufficientto “guide” the algorithm back in bounds, but an explicit boundarycorrection could be beneficial in these cases. We can see that thisseems to indeed be the case in Figure 6, where on the more com-plex functions, e.g., F21, the performance is improved when theboundary correction module is tuned. We also see that for thesefunctions, the “None” option is rarely selected, which confirms thatthe algorithm jumping out of bounds without being corrected hasnegative impact on the performance.In Figure 6, we also see that, the inclusion of the new step-sizeadaptation methods does manage to improve the overall perfor-mance for some functions. As an example, on F23 we saw an im-provement of . over the best baseline configuration. If weconsider all four elite configurations and compare the average per-formance differences, the average improvement is even higher, at . . The stability of this improvement is promising, but in orderto fully grasp how the inclusion of the new step-size adaptationmechanisms leads to this improvement, we need to analyze theselected modules across these different experiments. We have seen that the performance of the elite configuration foundon F23 improves when we include the new step-size adaptationmodules in the search space. In order to identify what this perfor-mance can tell us about the new modules themselves, we shouldstudy the configurations in more detail. The obvious way to seethe difference is by looking at how often the new module optionshave been selected in the final elite configurations. Over 20 elites,the PSR update was selected 14 times, MSR once, and CSA fivetimes. This shows that these new modules are indeed used in thesuccessful configurations. To see how the inclusion of these module ssessing the Benefits of New Algorithmic Ideas via Tuning A c t i v e E li t i s t O r t h o g o n a l S e q u e n t i a l T h r e s h o l d M i rr o r e d S a m p l e r W e i g h t s R e s t a r t A O C OffOn ssa_csbaseline_cs 4526.0Opt0Opt1Opt216970.0

Figure 7: Combined module activation plot for the elitesfound in the baseline and SSA experiments, for function 23.The lower the line, the better its performance, scaled withineach band according to the AOC. The option numbers corre-spond to those in Table 1. options changes the interactions with the other modules, we lookat the combined module activation plot, which is shown in Figure 7.From this figure, we can see that there are some interesting dif-ferences between the two sets of configurations: the options forthe restart and mirrored module are not as uniform when usingthe new step-size adaptation methods, and the weights option ischanged completely. These observations show that there is a clearinterplay between these modules.Next to the module activations, we can also look into the distri-butions of configured continuous hyperparameters. To illustratethis, we study F3, and plot the pairwise relations between the fourcontinuous hyperparamters and the final AOC value in Figure 11.From the marginal distribution (shown on the diagonal), we cansee that the optimized setting of 𝑐 𝜎 differs the most across theSSA, boundary correction, and the baseline experiments. This isa direct result of the introduction of the new step-size adaptationmethods, each of which prefers slightly different settings for thisparameter. This indicates that even though the final performanceof the elite configurations is similar between the baseline and theSSA-experiment, the inclusion of new step-size adaptation methodssignificantly alters the found elite configurations.We can extend this module analysis to all functions by aggregat-ing the most important differences found between the baseline andSSA-experiments. First, we can plot how often each new moduleoption is selected in the elites for each function, as is done in Fig-ure 8. We can use the same principle to study the interaction withthe other modules. For the binary modules, we can directly capturethe module difference by looking at which modules occur more orless often in the final set of elites, as is visualized in Figure 9. Whilethis does not directly generalize to modules with more settings,we can create a similar plot for the other modules by consideringthe overlap in selected module distributions. This is visualized inFigure 10. From these figures, it becomes clear that the elites onsome functions are barely affected by the inclusion of the new mod-ules, while others require completely different module settings toproperly exploit the changed search space.We should note that only considering the final elite configu-rations does not tell the full story of a modules contribution. As Figure 8: Heatmap showing the fraction of the elite config-uration in which each of the options for either SSA (top) orboundary correction (bottom) are active.

Figure 9: Heatmap showing difference in the fraction of theelite configuration in which each the of the binary modulesare active, between the baseline and the SSA experiment.Positive values indicate a module is turned on more oftenin the SSA experiments.

Figure 10: Heatmap showing difference in the distributionof the ternary modules selected in the final elites, betweenthe baseline and the SSA experiment. 0 indicates that thedistribution is identical, while 1 indicates that there is nooverlap at all. briefly noted previously, introducing a new module increases thesize and complexity of the search space, which has a large impacton the hyperparameter tuning. If a module is very dependent onthe settings of other hyperparameters, this can lead to deteriorationof the final results, since the initially sampled configurations arelikely to have worse performance than those in the baseline. Thisis visualized in Figure 12, where this is clearly seen on function F5.This is a linear slope function, but the BBOB-specification does notinclude a sufficient penalty for leaving the search space. As a result,an algorithm which quickly leaves the search space will reach therequired objective value very quickly. Thus, when adding bound-ary correction methods, / random configurations are not able toabuse this loophole, leading to a worse initial performance. Whilefor F5, the function is simple enough that the good configurationscan still be found (and the inclusion of the default CMA-ES settingsin the initial population means that there is always at least onegood configuration present), the same issue exists to a lesser extentin other functions. Figure 12 also shows that the “tunability” of acob de Nobel, Diederick Vermetten, Hao Wang, Carola Doerr, and Thomas Bäck c cc c m u c s A O C Figure 11: Distribution of the continuous hyperparametersfrom the elite configurations found in all three experiments. A O C i m p r o v e m e n t o v e r d e f a u l t Typeboundsssabaseline

Figure 12: Distribution of the relative AOC values found inthe initial race of irace (relative to the default CMA-ES con-figuration; positive values equate to lower AOC.) modules on different functions varies widely. For instance, on func-tions F16 - F18, the spread of AOC values are significantly largerthan those on functions F19 - F21, suggesting that it is relativelymore difficult to tune the modules in the latter since the tuner willvery likely take a considerably larger budget to identify optimalconfigurations. Also, while on some functions it is trivial to getimprovement (e.g., F7) over the default CMA-ES, it is a lot morechallenging on others, for example on functions F16 - F18.

We discuss three key challenges for the module assessment pro-cedure based on on hyperparameter optimization that we haveidentified in this work.

Influence and stochasticity of the hyperparameter tuning:

While we showed that assessing the impact of an algorithmic com-ponent by using a hyperparameter tuning approach provides usefulinsights, there are several factors which can complicate this ap-proach. Since hyerparameter tuning is a very challenging problem, with many different approaches to solving it, the kind of tuner usedwill have a large impact on the resulting assessment. In this paper,we used irace, which tends to focus on converging to a single con-figuration, instead of covering a large set of different solutions. Thisnecessitates running multiple repetitions of the irace procedureitself, as the initialization might otherwise have too much impact onthe final configurations. This can quickly become computationallyexpensive.

Algorithm-inherent stochasticity:

As we discussed in the re-sults, we need to take care when drawing conclusions from theperformance of the different CMA-ES configurations. Since CMA-ES is inherently stochastic, the amount of variance of the config-urations on a certain function has a large impact on the searchprocedure of irace. Since we end up selecting elites based on theaverage performance, we are inherently underestimating the AOCof the final configuration. Even though irace largely mitigates thisby using statistical testing in the races to decide when to discardconfigurations, there will always be some degree of underestima-tion of the performance (the median performance in the verificationruns is . worse than predicted from the irace runs). Limits ofthe per-instance analysis:

In the current setup, the performancemeasures are only done on an per-instance basis. While this is oftenpreferred over tuning for large sets of functions/instances, it doeshave some drawbacks. Specifically, if a module is designed to havea good performance over a wide set of functions, but other settingsexist for each individual function which outperform it, this newmodule would not be seen as beneficial. Because of this, we ar-gue that module assessment by hyperparameter tuning should notreplace the traditional assessments, but rather complement it formore in-depth, per-instance analysis. We can identify this for thestep-size adaptation module by looking at the ECDF-curve of thesingle-module variants, as previously shown in Figure 1. To assessthe impact of a new algorithmic component in a robust manner,tuning across a whole benchmark set of possibly diverse problemscan be performed, and compared to a tuned variant of the samemodular framework without this module.

We introduced a roadmap for assessing the performance of individ-ual algorithmic ideas, which takes into account the interplay withother existing settings by comparing the results of hyperparametertuning. Since this approach requires a modular design to functionas intended, we use the Modular CMA-ES framework, which wehave extended with new modules. Our analysis showed that thenewly added step size adaptation mechanisms are not always useful,but do provide clear benefits in several functions. The results alsoshowed that step-size adaptation is most useful when combinedwith a different weights option.The current version of the Modular CMA-ES framework is agood step in the direction of complete modularization of the CMA-ES algorithm, but some further enhancements can still be made.This would allow for even more precise control over each of theindividual components, leading to an ideal testbed for new algorith-mic ideas, which can then be evaluated using the approach outlinedin this paper. However, since this can be computationally intensive,we should aim to share and reuse data as much as possible, by ssessing the Benefits of New Algorithmic Ideas via Tuning developing and maintaining a well-organized repository for thistype of benchmark data. This does not only reduce the amount ofcomputation needed to test new modules, but it also gives rise to thepossibility of testing methods to re-use data from other experiments,since the search spaces have large overlap. Ideally, this would allowfor the usage of methods from transfer learning to significantlyshorten the time needed to assess a modules performance, evenwithin a large modular search space.Additionally, we note that while the proposed module assess-ment is inherently dependent on the used hyperparameter tuningmethod, the overall procedure remains the same no matter whichtuner is used. As a result, the analysis of the results should takeinto account the particularities of the tuner, such as the way config-urations are generated. Further research should still be done intodifferent hyperparameter optimization methods (e.g., SMAC [21],MIP-EGO [36], SPOT [8], GGA [2], hyperband [24], etc.) to deter-mine exactly how they differ in this modular algorithm context.Additionally, an analysis pipeline for this type of benchmarkingcould be designed within existing tools like the IOHanalyzer [37],which would greatly reduce the amount of effort needed to assessnew algorithmic ideas. Acknowledgements

This work was supported by the Paris Ile-de-France Region.

REFERENCES [1] Ouassim Ait Elhara, Anne Auger, and Nikolaus Hansen. 2013. A Median SuccessRule for Non-Elitist Evolution Strategies: Study of Feasibility.

GECCO 2013 -Proceedings of the 2013 Genetic and Evolutionary Computation Conference (072013). https://doi.org/10.1145/2463372.2463429[2] Carlos Ansótegui, Yuri Malitsky, Horst Samulowitz, Meinolf Sellmann, and KevinTierney. 2015. Model-based Genetic Algorithms for Algorithm Configuration. In

Proc. of International Conference on Artificial Intelligence (IJCAI’15) . AAAI Press,733–739.[3] Anne Auger, Dimo Brockhoff, and Nikolaus Hansen. 2011. Mirrored Sampling inEvolution Strategies with Weighted Recombination. In

GECCO . ACM, 861–868.https://doi.org/10.1145/2001576.2001694[4] Anne Auger, Dimo Brockhoff, Nikolaus Hansen, Tea Tušar, and KonstantinosVarelas. 2020. Data from BBOB-workshops and competitions on 24 noiselessfunctions. https://coco.gforge.inria.fr/doku.php?id=algorithms-bbob.[5] Anne Auger and Nikolaus Hansen. 2005. A restart CMA evolution strategy withincreasing population size. In

Proc. of Congress on Evolutionary Computation(CEC’05) . 1769–1776. https://doi.org/10.1109/CEC.2005.1554902[6] Anne Auger and Nikolaus Hansen. 2005. A restart CMA evolution strategy withincreasing population size. In

Proceedings of the IEEE Congress on EvolutionaryComputation, CEC 2005, 2-4 September 2005, Edinburgh, UK . IEEE, 1769–1776.https://doi.org/10.1109/CEC.2005.1554902[7] Anne Auger, Mohammed Jebalia, and Olivier Teytaud. 2005. Algorithms (X, sigma,eta): Quasi-random Mutations for Evolution Strategies. In

Artificial Evolution .Springer, 296–307. https://doi.org/10.1007/11740698_26[8] Thomas Bartz-Beielstein. 2010. SPOT: An R Package For Automatic and Interac-tive Tuning of Optimization Algorithms by Sequential Parameter Optimization.

CoRR abs/1006.4645 (2010). arXiv:1006.4645 http://arxiv.org/abs/1006.4645[9] Nacim Belkhir, Johann Dréo, Pierre Savéant, and Marc Schoenauer. 2017. Perinstance algorithm configuration of CMA-ES with limited budget. In

Proc. ofGenetic and Evolutionary Computation (GECCO’17)

PPSN . Springer, 11–21. https://doi.org/10.1007/978-3-642-15844-5_2[12] Fabio Caraffini, Anna V. Kononova, and David Corne. 2019. Infeasibility andstructural bias in differential evolution.

Inf. Sci.

496 (2019), 161–179. https://doi.org/10.1016/j.ins.2019.05.019 [13] Jacob de Nobel, Diederick Vermetten, Hao Wang, Carola Doerr, and ThomasBäck. 2021. Data and Code from: Tuning as a means of assessing the benefits ofnew ideas in interplay with existing algorithmic modules. https://doi.org/10.5281/zenodo.4524959[14] Carola Doerr, Hao Wang, Furong Ye, Sander van Rijn, and Thomas Bäck.2018. IOHprofiler: A Benchmarking and Profiling Tool for Iterative Opti-mization Heuristics. arXiv e-prints:1810.05281 (Oct. 2018). arXiv:1810.05281https://arxiv.org/abs/1810.05281 The BBOB datasets from [4] are available in theweb-based interface of IOHanalyzer at http://iohprofiler.liacs.nl/.[15] Nikolaus Hansen. 2008. CMA-ES with Two-Point Step-Size Adaptation. arXiv:0805.0231 [cs] (May 2008). http://arxiv.org/abs/0805.0231 arXiv: 0805.0231.[16] Nikolaus Hansen. 2009. Benchmarking a BI-Population CMA-ES on the BBOB-2009 Function Testbed. In

Proceedings of the 11th Annual Conference Companion onGenetic and Evolutionary Computation Conference: Late Breaking Papers (Montreal,Québec, Canada) (GECCO ’09) . Association for Computing Machinery, New York,NY, USA, 2389–2396. https://doi.org/10.1145/1570256.1570333[17] Nikolaus Hansen. 2016. The CMA Evolution Strategy: A Tutorial.

CoRR abs/1604.00772 (2016). arXiv:1604.00772 http://arxiv.org/abs/1604.00772[18] Nikolaus Hansen, Anne Auger, Raymond Ros, Olaf Mersmann, Tea Tušar, andDimo Brockhoff. 2020. COCO: A platform for comparing continuous optimizersin a black-box setting.

Optimization Methods and Software (2020), 1–31.[19] Nikolaus Hansen, Steffen Finck, Raymond Ros, and Anne Auger. 2009.

Real-Parameter Black-Box Optimization Benchmarking 2009: Noiseless Functions Defi-nitions . Technical Report RR-6829. INRIA. https://hal.inria.fr/inria-00362633/document[20] Nikolaus Hansen and Andreas Ostermeier. 2001. Completely DerandomizedSelf-Adaptation in Evolution Strategies.

Evolutionary Computation

9, 2 (2001),159–195. https://doi.org/10.1162/106365601750190398[21] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm configuration. In

LION . Springer, 507–523.[22] Grahame A. Jastrebski and Dirk V. Arnold. 2006. Improving Evolution Strategiesthrough Active Covariance Matrix Adaptation. In

CEC . 2814–2821. https://doi.org/10.1109/CEC.2006.1688662[23] Oswin Krause, Tobias Glasmachers, and Christian Igel. 2017. Qualitative andQuantitative Assessment of Step Size Adaptation Rules. In

Proceedings of the14th ACM/SIGEVO Conference on Foundations of Genetic Algorithms (Copenhagen,Denmark) (FOGA ’17) . Association for Computing Machinery, New York, NY,USA, 139–148. https://doi.org/10.1145/3040718.3040725[24] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and AmeetTalwalkar. 2016. Hyperband: A novel bandit-based approach to hyperparameteroptimization. arXiv preprint arXiv:1603.06560 (2016).[25] Rui Li. 2009.

Mixed-Integer Evolution Strategies for Parameter Optimization andTheir Applications to Medical Image Analysis . Theses. Leiden University.[26] Manuel López-Ibáñez, Jérémie Dubois-Lacoste, Leslie Pérez Cáceres, Mauro Birat-tari, and Thomas Stützle. 2016. The irace package: Iterated racing for automaticalgorithm configuration.

Operations Research Perspectives

The irace package, Iterated Race for Automatic Algorithm Configuration .Technical Report TR/IRIDIA/2011-004. IRIDIA, Université Libre de Bruxelles,Belgium. http://iridia.ulb.ac.be/IridiaTrSeries/IridiaTr2011-004.pdf[28] Manuel López-Ibáñez and Leslie Pérez Cáceres. [n.d.]. The irace Package: IteratedRace for Automatic Algorithm Configuration. http://iridia.ulb.ac.be/irace/..[29] Nuno Lourenço, Francisco Pereira, and Ernesto Costa. 2012. Evolving Evolu-tionary Algorithms. In

Proceedings of the 14th Annual Conference Companion onGenetic and Evolutionary Computation (GECCO ’12) . ACM, New York, NY, USA,51–58. https://doi.org/10.1145/2330784.2330794 bibtex: lourenco_evolving_2012.[30] Alejandro Piad-Morffis, Suilan Estévez-Velarde, Antonio Bolufé-Röhler, JamesMontgomery, and Stephen Chen. 2015. Evolution strategies with thresheldconvergence. In

CEC . 2097–2104. https://doi.org/10.1109/CEC.2015.7257143[31] Jorge Tavares, Penousal Machado, Amílcar Cardoso, Francisco B. Pereira, andErnesto Costa. 2004. On the Evolution of Evolutionary Algorithms. In

GeneticProgramming (Lecture Notes in Computer Science) , Maarten Keijzer, Una-MayO’Reilly, Simon Lucas, Ernesto Costa, and Terence Soule (Eds.). Springer, 389–398.https://doi.org/10.1007/978-3-540-24650-3_37[32] Sander van Rijn, Hao Wang, Matthijs van Leeuwen, and Thomas Bäck. 2016.Evolving the structure of Evolution Strategies. In

SSCI . 1–8. https://doi.org/10.1109/SSCI.2016.7850138[33] Sander van Rijn, Hao Wang, Bas van Stein, and Thomas Bäck. 2017. AlgorithmConfiguration Data Mining for CMA Evolution Strategies. In

GECCO . ACM,737–744. https://doi.org/10.1145/3071178.3071205[34] Diederick Vermetten, Hao Wang, Carola Doerr, and Thomas Bäck. 2020. In-tegrated vs. sequential approaches for selecting and tuning CMA-ES vari-ants. In

GECCO ’20: Genetic and Evolutionary Computation Conference, Can-cún Mexico, July 8-12, 2020 , Carlos Artemio Coello Coello (Ed.). ACM, 903–912.https://doi.org/10.1145/3377930.33898319 acob de Nobel, Diederick Vermetten, Hao Wang, Carola Doerr, and Thomas Bäck [35] Hao Wang, Michael Emmerich, and Thomas Bäck. 2014. Mirrored OrthogonalSampling with Pairwise Selection in Evolution Strategies. In

SAC . ACM, 154–156.https://doi.org/10.1145/2554850.2555089[36] Hao Wang, Michael Emmerich, and Thomas Bäck. 2018. Cooling Strategies forthe Moment-Generating Function in Bayesian Global Optimization. In

CEC . 1–8.https://doi.org/10.1109/CEC.2018.8477956[37] Hao Wang, Diederick Vermetten, Furong Ye, Carola Doerr, and Thomas Bäck.2020. IOHanalyzer: Performance Analysis for Iterative Optimization Heuristic.

CoRR abs/2007.03953 (2020). https://arxiv.org/abs/2007.03953 IOHanalyzer isavailable at CRAN, on GitHub, and as web-based GUI, see https://iohprofiler.github.io/IOHanalyzer/ for links.[38] Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2012. EvaluatingComponent Solver Contributions to Portfolio-Based Algorithm Selectors. In