[PDF] Multi-objective Analysis of MAP-Elites Performance

Abstract

In certain complex optimization tasks, it becomes necessary to use multiple measures to characterize the performance of different algorithms. This paper presents a method that combines ordinal effect sizes with Pareto dominance to analyze such cases. Since the method is ordinal, it can also generalize across different optimization tasks even when the performance measurements are differently scaled. Through a case study, we show that this method can discover and quantify relations that would be difficult to deduce using a conventional measure-by-measure analysis. This case study applies the method to the evolution of robot controller repertoires using the MAP-Elites algorithm. Here, we analyze the search performance across a large set of parametrizations; varying mutation size and operator type, as well as map resolution, across four different robot morphologies. We show that the average magnitude of mutations has a bigger effect on outcomes than their precise distributions.

Full PDF

MMulti-objective Analysis of MAP-ElitesPerformance

Eivind Samuelsen

Department of InformaticsUniversity of Oslo

Oslo, Norwayemail address

Kyrre Glette

Department of InformaticsUniversity of Oslo

Oslo, Norwaykyrrehg@iﬁ.uio.no

Abstract —In certain complex optimization tasks, it becomesnecessary to use multiple measures to characterize the perfor-mance of different algorithms. This paper presents a method thatcombines ordinal effect sizes with Pareto dominance to analyzesuch cases. Since the method is ordinal, it can also generalizeacross different optimization tasks even when the performancemeasurements are differently scaled. Through a case study, weshow that this method can discover and quantify relations thatwould be difﬁcult to deduce using a conventional measure-by-measure analysis. This case study applies the method to theevolution of robot controller repertoires using the MAP-Elitesalgorithm. Here, we analyze the search performance across alarge set of parametrizations; varying mutation size and operatortype, as well as map resolution, across four different robotmorphologies. We show that the average magnitude of mutationshas a bigger effect on outcomes than their precise distributions.

I. I

NTRODUCTION

Search algorithms can be applied to a multitude of tasksin engineering, science and other ﬁelds, from scheduling andoptimization tasks to design and creativity [1], [2]. While insome straightforward cases it is sufﬁcient for the search toend up with the single best solution, or a set of nondominatedsolutions in the case of a multi-objective optimization task, itwould in some cases be an advantage to be able to investigatea range of different and high-performing solutions from thesame search process.In this context, algorithms like Novelty Search with Lo-cal Competition[3] and MAP-Elites [4] have been proposed,which explore and keep track of solutions that are high-performing but different according to a behavior criterion.These algorithms, named Quality Diversity (QD) or Illumina-tion algorithms, have been successfully applied to e.g. avoidingdeception and promoting diversity in hard search domains [5],for the generation of a large set of qualitatively differentartworks [6], or for exploring the design space of airfoils [7]. Itshould be noted that the exploratory nature of these algorithmscan have a dual purpose–either solely as a diversity-enhancingfunctionality, serving as stepping stones for the algorithm toovercome deceptive search spaces [5]–or also for inspectingor exploiting the generated repertoire of solutions [4].QD algorithms have been particularly successful in theEvolutionary Robotics (ER) ﬁeld [8], with applications such assoft robot evolution [4], damage recovery [9], and locomotion repertoire generation [10], [11]. Recently, [12] compared theperformances of a range of QD variants, and [13] proposeda unifying framework for QD.MAP-Elites [4], searches a space of user-deﬁned features,or behaviors , by discretizing it into a grid. As the algorithmprogresses, the cells in this grid are progressively ﬁlled withsolutions according to their position in the behavior space, re-placing any solution already associated with the cell only if thenew solution is better according to some user-deﬁned qualitymeasure. This property stands out as particularly attractive forsome applications, since the result of the search is a regulargrid and one can easily locate a cell with a desired behavior,e.g. for use in locomotion repertoire generation tasks [11].QD algorithms such as MAP-Elites naturally give rise tomultiple ways of measuring the their performance: one wouldlike to know how diverse the solutions are in terms of howmuch of the behavior space is covered, as well as the qualityof the solutions. These aspects are covered by [4] by the coverage and precision measures, respectively, along witha third measure called global reliability . [12] reduces thesequality and diversity criteria down to a single measure, calledQD-score, that increases with any improvement in the twoareas.In multi-objective optimization (MOO), it is known that anyway of combining multiple objective functions into one giveschallenges related to scaling and potential loss of solutions,compared to a Pareto-based treatment [1]. We argue that thesame concept also applies when comparing algorithms onmultiple performance measures, and therefore propose a Paretodominance-based analysis method that takes this explicitlyinto consideration.Usually MOO will deal with deterministic objective func-tion evaluations, where one can say with absolute certainty ifone solution dominates the other. Stochastic algorithms suchas MAP-Elites, however, can produce very varying results,so instead of a single value for each performance measure,multiple runs of the algorithms will result in a set of differentvalues. Using Cliff’s delta [14], an ordinal statistic closelyrelated to the Vargha-Delaney effect size [15] suggested forthis use by [16], we are able to describe to what degree theoutcomes of one algorithm dominates those of another. a r X i v : . [ c s . N E ] J un e have earlier co-evolved robot morphologies and con-trollers, and produced real-world instances of these [17]. Tofully make use of a new morphology, it would be relevantto learn a repertoire of walking behaviors, such as beingable to move forwards and backwards and turn with differentspeeds, much in the same vein as [10] and [11]. Therefore,as a case study we employ MAP-Elites to generate controllerrepertoires. We illustrate that the proposed effect size, beingan ordinal statistic, is able to generalize the results across fourdistinct robot morphologies.One important parameter in MAP-Elites is the map resolu-tion, which decides how many cells to divide the map intoalong each behavior axis. Lower resolution would perhapsresult in more difﬁculty reaching new cells, but it might alsoreduce the number of evaluations needed to ﬁll the map,simply because there are fewer cells to ﬁll.Another aspect that could have a signiﬁcant impact onthe performance of the evolutionary search, is the choice ofmutation operator. Some studies in ER always apply randomperturbations to every parameter, while others only apply theperturbations to each parameter with a certain probability. Bothapproaches has their merits, and we seek to understand howthe two affect the performance of our search. We also test arange of values for the mutation size, i.e. the σ of the Gaussiandistribution. This is partly to control for effects that do notgeneralize across mutation sizes, but also to shed light onpossible performance trade-offs.The results from our experimental runs are reported in effectsize tables, allowing for statistical comparisons, as well asvisualizing them in parametric plots, giving a good overviewof the multi-objective performance aspects. We also visualizean excerpt of the results in a measure-by-measure fashion, tocontrast with the conventional analysis.To summarize, this paper makes two major contributions:First, we demonstrate how we can apply a Pareto dominancebased Cliff’s delta calculation for comparing performances ofMAP-Elites searches. Secondly, we demonstrate the usefulnessof these methods with a case study in evolutionary robotics,with a particular focus on different mutation schemes and howthese affect performance.II. M ETHODS

We implemented the MAP-Elites algorithm as describedin [4] to ﬁnd repertoires of robot gaits that effect usefulbehaviors. We then analyze the effect of different map res-olutions, mutation operators and mutation sizes on the qualityof the repertoires for four different robots. The robots andthe simulator are described in section II-A. The gait behaviorand performance measures are deﬁned in section II-C. Themutation operators are described in section II-D. The measuresused to judge repertoire quality is deﬁned in section II-E.Finally, the methods used to analyze the results are presentedin sections II-F and II-G. A summary of key details is shownin Table I. The experiment data and R source code for the

TABLE IE

XPERIMENT SETUP

PhysX version 3.4Ground-robot friction 0.3 / 0.3Timestep − sControl system period 1 sPre-evaluation periods 1Evaluation periods 4Samples per period 4 Behavior feature Map extents

Turn rate ± / s Adjusted forward speed ± .

75 m / s Performance measure Weighed penalties for- Large body pitch- Low body height- Sideways movementInitial population 100Initial mutation all-hardInitial σ ×

5, 7 ×

7, 9 × σ values 0.01, 0.1, 0.2, 0.4, 0.8Mutation types all, someRobots robot2robot3robot4robot5 4 legs, 6 joints4 legs, 9 joints4 legs, 10 joints6 legs, 14 jointsTotal combinations 240Runs per combination 12 robot2 robot3robot4 robot5 Fig. 1. The four robot morphologies analysis, along with videos of evolved repertoires, is availableonline. A. Simulated Robots

The algorithm was run on four robot morphologies selectedfrom [17], illustrated in Figure 1. The morphologies haveretained their original numbering. Using the PhysX simulator,the subject robot is simulated on a inﬁnite ﬂat plane. Whengiven a new gait to evaluate by MAP-Elites, it ﬁrst runs thegait for one period before starting to sample position andorientation. Temporary address for review: https://folk.uio.no/eivinsam/data/ALIFE18 v φ D S P Fig. 2. Joint set point ( SP ) as a function of t

01 0 1 2 3 D Fig. 3. Wrapping d onto [0 , Sampling continues until a given number of periods haveelapsed. Based on the samples, behavior and performance(section II-C) are given to MAP-Elites. This repeats withoutrestarting the simulator, until the run ends when a certainnumber of evaluations has been reached or the simulatordetects that the robot has been in a tilted or ﬂipped state for20 consecutive evaluations. After each evaluation the current coverage and precision (section II-E) is logged for analysis.

B. Control System

Each joint is controlled by an open-loop controller with fourparameters: phase offset φ , duty cycle D , and two extremevalues v and v , as shown in Figure 2. The duty cycle isencoded as a continuous parameter d such that D = (cid:40) d − (cid:98) d (cid:99) (cid:98) d (cid:99) is even (cid:100) d (cid:101) − d otherwiseas illustrated in Figure 3. This lets d be mutated freelyas a continuous variable while mapping it to the correctrange without discontinuities and without getting stuck at theextreme values. The parameters are encoded symmetrically, sothat each joint on the left side of the body share duty cycleand amplitude parameters with the corresponding joint on theright side. The phase offset is coded differentially, i.e. it is φ on the left side and φ + ∆ φ on the right. C. Behavior and Fitness

The behavior descriptor is composed of two features thatare designed to work well as control parameters for a higher-level control system: average turn rate and adjusted forwardspeed. Average turn speed is measured simply as b T = (cid:80) Ni =0 ∆ ψ i / ∆ t , where N is the number of sample periods, ∆ ψ i is the change in orientation at sample i and ∆ t is thetime elapsed.To provide a robust measure of the forward speed thataccounts for turn radius and ﬁlters out sideways locomotion,more complex calculations are carried out for the secondbehavior feature: First the average position and orientation ofthe robot during the ﬁrst and last periods are estimated. Linesperpendicular to the orientations and intersecting the positionsare constructed, illustrated as n a and n b in Figure 4. The point n a n b Fig. 4. Forward movement estimation. The length of the circular segmentis R Θ , which gives an estimate of the amount of forward movement. SE R approximately corresponds to the shaded area, and measures the amount ofsideways motion. where these two lines cross is used as a center of curvature, C ;using it we can estimate the turn radius and its standard error R = N +1 (cid:80) Ni =0 | x i − C | SE R = (cid:113) N (cid:80) Ni =0 ( | x i − C | − R ) and the change in orientation relative to C Θ = N (cid:80) Ni =1 ∠ x i Cx i − where then angle ∠ x i Cx i − between x i C and Cx i − signed.The adjusted forward speed is then deﬁned as b F = R Θ / ∆ t .As the ﬁtness measure, we compute a weighted sum ofseveral penalty measures Penalty = 10 SE R + | θ | + 10 − h/ . where | θ | is the average absolute pitch of the body, and h isthe average height of the body reference point. This penalizeslarge sideways motion, up/down tilt and having the body closeto the ground. The penalty score is then inverted into f = max { , − Penalty } in order to produce a value increasing with solution quality. D. Mutation Operators and Other Variables

We test two different commonly used mutation operators:mutating all parameters, and mutating each parameter with aprobability /k . In both cases Gaussian mutation with standarddeviation σ is used on the mutated parameters. We run theexperiments with ﬁve different values for σ . We also attemptto control for effects causesd by running with three differentmap resolutions. The extents the map cells covers in behaviorspace are kept constant. All combinations of mutation type,mutation magnitude and map resolution is run multiple timeson all robots, as well as map extents is summarized in Table I. E. Performance Measures

Performance measures similar to those in [4] and [12] areused:

Precision is deﬁned as the average score of ﬁlled cells:

P ( m ) = QD-score ( m )n ( m ) = 1n ( m ) (cid:88) x ∈ M m ( x ) where m ( x ) is the score in cell x of map m or zero if empty, M is the set of all cells, and n ( m ) is the number of ﬁlledcells in m . Coverage is deﬁned as the fraction of cells ﬁlled:

C ( m ) = n ( m )N ( m ) ABLE IIE

XAMPLE VALUES FOR δ ab No draws Max. draws δ ab P A /P B P A P B P A P B P D +1 . ∞ +0 . At least . Undeﬁned or − . At most − . At most − . where N ( m ) is the total number of cells in the map. Finally,global reliability is deﬁned as the average score across allcells, which can be expressed as G ( m ) = QD-score ( m )N ( m ) = P ( m ) C ( m ) meaning that reliability can be seen as either the QD-scorenormalized for map resolution, or the product of precision and coverage . These deﬁnitions differs from those in [4] inthat they do not scale the cell scores against the best observedscore for each cell. F. Cliff’s Delta

Cliff’s delta [14] is an ordinal effect size that estimateshow separated two distributions are. Given the probability P A = P ( a > b ) of a random value from group a being greaterthan a random value of group b , and P B = P ( a < b ) , itcan be deﬁned as δ ab = P ( a > b ) − P ( a < b ) (1)Table II shows some example values of Cliff’s delta,along with an interpretation in terms of the probabilities P A = P ( a > b ) and P B = P ( a < b ) along with the prob-ability of a draw P D = 1 − ( P A + P B ) . The sign of δ ab says whether a (positive) or b (negative) is more likely toperform the best, and the magnitude indicates the degree ofdifference in probabilities.Cliff’s delta can either be calculated exactly by comparingall possible pairs of observations from the samples, approx-imated by comparing a large number of random pairs, orthrough its relation δ ab = 2 U | a || b | − to the Wilcoxon-Mann-Whitney (WMW) U statistic. Becauseof this linear relation, testing whether δ ab is likely to be zerocorresponds to doing a WMW U test. δ ab is also related toVargha and Delaney’s effect size ˆ A [15] by δ ab = 2 ˆ A ab − .Since it is ordinal, we can use it to measure differencesacross incomparable groups, for example compare the per-formance of two algorithms across different benchmarks, orin this case, different parametrizations of an algorithm acrossdifferent robot morphologies. Assuming the different groupsare of equal importance, this can be done by ensuring thatcomparisons are always done between observations in the same group, and weighting the groups equally. This can becalculated as δ ab = 1 | G | (cid:88) g ∈ G δ ( a | g, b | g ) G. Pareto Domination

Considering the performance measures deﬁned in sec-tion II-E from the perspective of multi-objective optimization,we have three objectives we want to optimize. Rather thananalyzing algorithm performance with regards to the objec-tives individually, we are interested in performance across allobjectives. Pareto domination provides a way to test this.An objective vector x is said to dominate another vector y if all individual objectives are at least as good, and at leastone objective is better: x (cid:31) P y ≡ ∀ i x i (cid:4) y i ∧ ∃ i x i (cid:3) y i As suggested by [16], we can replace the comparisonoperators in the equation deﬁning Cliff’s delta (1) with Paretodomination, resulting in a measure of what degree one sampledominates the other: d ab = P ( a (cid:31) P b ) − P ( a ≺ P b ) (2)Like the ordinary Cliff’s delta, δ ab , d ab can be used tocompare distributions to decide which is better. Note, however,that the WMW U statistic and test cannot be generalized inthe same manner, so it cannot be used to calculate the Paretod statistic. This is because Pareto dominance is not able toproduce a simple ranking of a sample.Note that given the way precision , coverage and reliability is deﬁned here, reliability is redundant when consideringPareto domination: consider if, for some maps m and n , G ( m ) < G ( n ) , thus implying m (cid:7) P n . Expanding this asproducts of P and C we get P ( m ) C ( m ) < P ( n ) C ( n ) However, since both P and C are always non-negative, thiswould imply P ( m ) < P ( n ) ∨ C ( m ) < C ( n ) , and thenwe would already have m (cid:7) P n based on precision and coverage alone. III. R ESULTS

Figure 5 shows the conventional box plots for coverage , re-liability and precision after the last evaluation for robot3. Theplots indicate that mutation magnitude has the largest effecton all three measures. Coverage and reliability increases withmutation magnitude, while high mutation magnitudes seem tohave a negative effect on precision . There also seems to bea small negative effect on all three measures from increasingmap resolution. There is little or no difference between themutation types on any of the measures. Corresponding plotsfor robot2, robot4, and robot5 are omitted, but mostly showthe same main trends, with only robot2 deviating notablyfrom the patterns, and those differences are mainly visiblein a change of scales. obo t .

05 0 . . . . × × × a ll s o m e Magnitude Resolution Type0.20.40.60.8 r obo t .

05 0 . . . . × × × a ll s o m e Magnitude Resolution Type12 r obo t .

05 0 . . . . × × × a ll s o m e Magnitude Resolution Type46810

Coverage Reliability Precision

Fig. 5. Box plots of the three objectives for robot3 grouped by parameter. The vertical axes measure the performance measure given by the subtitle. Thehorizontal ticks name the parameter values: the σ , map resolutions and mutation type parameters are separated by vertical lines. Figure 7 shows coverage-precision plots of the results byrobot and parameter. In the plots, mutation magnitude has themost visible effect. The traces spread out from a commonorigin into different niches in order from least magnitude, least coverage , to largest magnitude, most coverage . Magnitudes0.2, 0.4 and 0.8 forms a clear Pareto front, while 0.1 and0.05 are behind 0.2 on average on most robots. There isconsiderable overlap in the half-dominated areas, but a cleardistinction between the areas of the extreme parameter valuesfor all robots but robot2. In the plots for the other twoparameters there are not any such clear patterns. The shadedareas for mutation type almost completely overlap, with whatis perhaps a slight advantage to some-soft on average. Of thethree map resolutions, × is considerably better on average,but has especially poor coverage on robot2.Of the four morphologies, robot2 stands out by havingmuch higher precision and lower coverage than the others,along with correspondingly higher precision variance andlower coverage variance. The others generally have about thesame shape and average development; however, robot3 appearsto achieve better coverage with higher mutation sizes, whilerobot4 has bigger variation in precision .Table III and Figure 6 shows the Pareto-based Cliff’s deltasfor all pairs of values of each parameter. The conﬁdence inter-vals are 99% conﬁdence intervals computed by bootstrapping.Because checking if a − α conﬁdence interval overlaps withsome x is equivalent to a hypothesis test with threshold α on whether the true value is x [18], we test for statisticallysigniﬁcant differences ( α = 0 . ) between groups by checkingif the reported conﬁdence intervals overlap with zero.For the map resolutions, the effect sizes imply a strongordering preferring smaller map resolutions: × is likelyto produce a better result than both × and × , and × is likely to produce a better result than × .For the mutation magnitudes, the effect sizes form a partial,but almost complete ordering: . is worse than all othervalues. . is worse than all values but . . . and . areequally good, both better than . and . and worse than . . TABLE IIIP

ARETO - BASED C LIFF ’ S DELTA EFFECT SIZES FOR EACH PAIR OFPARAMETER VALUES

Map resolutions m a b − . ± . − . ± . − . ± . (cid:31) (cid:31) Mutation types mr a b allsome − . ± . Mutation magnitudes r a b + . ± . + . ± .

069 + . ± . + . ± .

066 + . ± .

067 + . ± . + . ± .

058 + . ± . − . ± . − . ± . . (cid:31) { . , . } (cid:31) . (cid:31) . Effect sizes statistically signiﬁcantly different from zero are in bold. Whenthe effect sizes implies a (partial) ordering, it is shown beneath the table. m Grouped by mutation magnitude r Grouped by map resolution

For the mutation types, the effect size is quite small, andwe can not rule out that there is no difference at all betweenthe two methods. IV. D

ISCUSSION

From the conventional plot in Figure 5 one can identifythat for robot3 lower map resolutions lead to higher precision while coverage remains about the same. The different σ valueslead to different trade-offs in terms of precision and coverage ,while the two mutation types are hard to differentiate. Theseobservations are all conﬁrmed by Figure 7 to hold for robot4and robot5 as well, and to a certain degree also for robot2.These plots also indicate that σ = 0 . , which for most robotshas the largest half-dominating area and the largest part not Map resolution -0.50 -0.25 0.00 0.25 0.50

Mutation magnitude -0.50 -0.25 0.00 0.25 0.50

Mutation type

Fig. 6. Graphical representation of the effect sizes. Each row in the ﬁgurecorresponds to a column in Table III. The colored lines indicate the conﬁdenceinterval for d ab with a given by the color and b given by the row. The thinlines show the 99% conﬁdence interval, while the thick lines represent thecentral 50%. If the thin line does not cross the zero line then one of the groupsare statistically signiﬁcantly better. covered by other areas, might be the best overall choice.The areas for mutation type also indicate that mutating allparameters may have a slight advantage over mutating onlysome. The reported effect sizes in Table III and Figure 6captures all these observations, quantiﬁed and generalizedacross the four robots.Because of the large number of different parameter values inthe experiment, showing the box plots for all four robots wouldrequire the same amount of space as Figure 7 does, but to usit is clear that Figure 7 gives a signiﬁcantly clearer view of theperformance trade-offs of a set of parametrizations this large.In addition, plotting along the coverage-precision dimensionsallows us to trace the search progress with systematic marks.As an obervation from this, there is a tendency for theexponentially placed marks to be evenly spaced in the plot,indicating a trend of exponential convergence.From the mutation magnitude plot for robot3 in Figure 7,we can see that 0.4 and 0.8 are one the same isocurve, whichresults in in similar reliability values. Still, the coverage-precision plot highlights the difference in these parametriza-tions, and makes it easier to select based on the application.To achieve the same insights with the box plots, one wouldneed to cross-reference at least two of them.V. C ONCLUSION

We have presented a method for multi-objective perfor-mance analysis of the MAP-Elites algorithm, based on ordinaleffect sizes and Pareto dominance. Since the method usesordinal effect size, it allows us to draw general conclusions on the performance of various parameter values across groupsof different scaling, such as varying robot morphologies.Through a thorough case study we demonstrated that thisapproach allowed us to better discern performance in trade-off scenarios as seen when varying the σ parameter. At thesame time, it reproduced the conclusions where the traditionalanalysis has proven robust.We expect the method to be useful for other MAP-Elitespractitioners, both for analyzing new algorithmic features aswell as for tuning performance to speciﬁc applications. Wealso expect that the method could ﬁnd application beyondMAP-Elites, in particular it should be applicable to otheralgorithms within the Quality Diversity domain.R EFERENCES[1] A. Konak, D. W. Coit, and A. E. Smith, “Multi-objective optimizationusing genetic algorithms: A tutorial,”

Reliability Engineering & SystemSafety

Computer-Aided Design

Proceedings of the13th annual conference on Genetic and evolutionary computation .ACM, 2011, pp. 211–218.[4] J.-B. Mouret and J. Clune, “Illuminating search spaces by mappingelites,” vol. 521, no. 7553, p. 9993, 2015.[5] J. Lehman, K. O. Stanley, and R. Miikkulainen, “Effective DiversityMaintenance in Deceptive Domains,” in

GECCO 2013: Proceedings ofthe 2013 Genetic and Evolutionary Computation Conference , 2013, pp.215–222.[6] A. Nguyen, J. Yosinski, and J. Clune, “Innovation Engines: AutomatedCreativity and Improved Stochastic Optimization via Deep Learning,”in

Proceedings of the 2015 Genetic and Evolutionary ComputationConference , 2015, pp. 959–966.[7] A. Gaier, A. Asteroth, and J.-B. Mouret, “Data-efﬁcient exploration,optimization, and modeling of diverse designs through surrogate-assistedillumination,” in

GECCO 2017 - Proceedings of the 2017 Genetic andEvolutionary Computation Conference , 2017, pp. 99–106.[8] S. Doncieux, N. Bredeche, J.-B. Mouret, and A. E. G. Eiben, “Evolu-tionary robotics: what, why, and where to,”

Frontiers in Robotics andAI , vol. 2, p. 4, 2015.[9] A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret, “Robots that canadapt like animals,”

Nature , vol. 521, no. 7553, p. 503, 2015.[10] A. Cully and J.-B. Mouret, “Evolving a Behavioral Repertoire for aWalking Robot.”

Evolutionary computation , vol. 24, no. 1, pp. 59–88,2016.[11] M. Duarte, J. Gomes, S. M. Oliveira, and A. L. Christensen, “Evolutionof Repertoire-based Control for Robots with Complex Locomotor Sys-tems,”

IEEE Transactions on Evolutionary Computation , vol. 22, no. 2,pp. 314–328, 2018.[12] J. K. Pugh, L. B. Soros, and K. O. Stanley, “Quality Diversity: A NewFrontier for Evolutionary Computation,”

Frontiers in Robotics and AI ,vol. 3, no. July, pp. 1–17, 2016.[13] A. Cully and Y. Demiris, “Quality and Diversity Optimization: AUnifying Modular Framework,”

IEEE Transactions on EvolutionaryComputation , vol. 22, no. 2, pp. 245–259, 2018.[14] N. Cliff, “Dominance statistics: Ordinal analyses to answer ordinalquestions.”

Psychological Bulletin , vol. 114, no. 3, p. 494, 1993.[15] A. Vargha and H. D. Delaney, “A critique and improvement of the CLcommon language effect size statistics of McGraw and Wong,”

Journalof Educational and Behavioral Statistics , vol. 25, no. 2, pp. 101–132,2000.[16] G. Neumann, M. Harman, and S. Poulding, “Transformed vargha-delaney effect size,” in

Search-Based Software Engineering . SpringerInternational Publishing, 2015, pp. 318–324. .0 0.2 0.42468234523452345 P r ec i s i on Coverage r obo t r obo t r obo t r obo t Map resolution Mutation magnitude Mutation type

Fig. 7. Precision and coverage grouped by the different variables. The curves trace the mean of each group as the evaluation count increases, with pointsmarking the 350th, 1250th and 5000th evaluation. The shaded areas are the areas dominated by at least half of the runs after the last evaluation. The lightgray curves are isocurves for reliability.17] E. Samuelsen and K. Glette, “Real-world reproduction of evolvedrobot morphologies: Automated categorization and evaluation,” in

Ap-plications of Evolutionary Computation - 18th European Conference .Springer, 2015, pp. 771–782.[18] S. Greenland, S. J. Senn, K. J. Rothman, J. B. Carlin, C. Poole, S. N.Goodman, and D. G. Altman, “Statistical tests, p values, conﬁdenceintervals, and power: a guide to misinterpretations,”