[PDF] Competitiveness of MAP-Elites against Proximal Policy Optimization on locomotion tasks in deterministic simulations

Abstract

The increasing importance of robots and automation creates a demand for learnable controllers which can be obtained through various approaches such as Evolutionary Algorithms (EAs) or Reinforcement Learning (RL). Unfortunately, these two families of algorithms have mainly developed independently and there are only a few works comparing modern EAs with deep RL algorithms. We show that Multidimensional Archive of Phenotypic Elites (MAP-Elites), which is a modern EA, can deliver better-performing solutions than one of the state-of-the-art RL methods, Proximal Policy Optimization (PPO) in the generation of locomotion controllers for a simulated hexapod robot. Additionally, extensive hyper-parameter tuning shows that MAP-Elites displays greater robustness across seeds and hyper-parameter sets. Generally, this paper demonstrates that EAs combined with modern computational resources display promising characteristics and have the potential to contribute to the state-of-the-art in controller learning.

Full PDF

CC OMPETITIVENESS OF

MAP-E

LITES AGAINST P ROXIMAL P OLICY O PTIMIZATION ON LOCOMOTION TASKS INDETERMINISTIC SIMULATIONS

Szymon Brych

Department of ComputingImperial College LondonLondon, SW7 2AZ, UK [email protected]

Antoine Cully

Department of ComputingImperial College LondonLondon, SW7 2AZ, UK [email protected] A BSTRACT

The increasing importance of robots and automation creates a demand for learnable controllerswhich can be obtained through various approaches such as Evolutionary Algorithms (EAs) orReinforcement Learning (RL). Unfortunately, these two families of algorithms have mainly developedindependently and there are only few works comparing modern EAs with deep RL algorithms. Weshow that Multidimensional Archive of Phenotypic Elites (MAP-Elites), which is a modern EA, candeliver better-performing solutions than one of the state-of-the-art RL methods, Proximal PolicyOptimization (PPO) in the generation of locomotion controllers for a simulated hexapod robot.Additionally, extensive hyper-parameter tuning shows that MAP-Elites displays greater robustnessacross seeds and hyper-parameter sets. Generally, this paper demonstrates that EAs combined withmodern computational resources display promising characteristics and have the potential to contributeto the state-of-the-art in controller learning.

Keywords

Quality-Diversity optimization, Reinforcement Learning, Proximal Policy Optimization, MAP-Elites

Increased demand for robots and various forms of automatic control creates a demand for automated ways to generatecontrollers. Such controllers were traditionally designed by experts, however, associated complexities make this processslow and therefore expensive. For this reason, it is particularly desirable to automatically learn controllers as opposed tomanually designing them.In this space, learning approaches seem particularly promising. Existing work has shown that Evolutionary Algorithms(EAs) can be successfully applied to control problems in common simulated environments [1, 2], as well as in realrobots [3]. Particularly interesting among EAs is a family of Quality-Diversity algorithms (QDs; [4]). These takeinspiration from the diversity of species produced by natural evolution. Instead of generating a single solution, like mostlearning algorithms, QD algorithms produce a collection of diverse and high-performing solutions. This diversity allowsrobustness to changing conditions as well as specialization to various use cases resulting in performance gains [5].At the same time, Reinforcement Learning (RL) has gained huge popularity due to breakthroughs [6, 7] fueled with theadvent of wide-spread applications of deep neural networks [8]. Over a relatively short course of time Deep RL hasalso seen numerous very impressive contributions to the problems of continuous control in simulation [9, 10, 11].These examples show that both EAs and RL methods can be applied in controller learning and despite slightly differentterminologies they share the same high-level learning principles. Reinforcement Learning focuses on learning fromexperience through a reward signal that is feedback on the desirability of actions taken with respect to the environment.In the context of EAs, the reward signal is often referred to as the ﬁtness , which in evolutionary terms quantiﬁes howwell an individual is adapted to the environment in which it acts. a r X i v : . [ c s . A I] S e p ompetitiveness of MAP-Elites against Proximal Policy Optimization on locomotion tasks in deterministic simulationsWe attempt to make further contributions to the knowledge base of Evolutionary Algorithms and their competitivenessagainst Deep Reinforcement Learning approaches. To this end, we discuss the most relevant Policy Gradient methods[12, 13] and choose Proximal Policy Optimization (PPO; [10]) for deeper investigation. Analogously, we furtherproceed with a discussion of some of recent EAs and select a QD algorithm MAP-Elites [14] for the comparison.We then specify a continuous gait control problem for an 18-degrees-of-freedom hexapod robot in a deterministicsimulation. Evaluations on a physical robot is left for future works. Subsequently, throughout series of experiments fortwo different setups of an open and closed-loop controller, we perform a systematic comparison of selected algorithmsto eventually discuss characteristics of both approaches.The main contributions of this work are: 1. showing that MAP-Elites is a competitive alternative to PPO, achieving betterresults in terms of episode reward in all the evaluated scenarios. 2. demonstrating greater robustness of MAP-Elitesagainst both seeds and hyper-parameters, which is crucial when considering applying algorithms on physical robots asthis results in better reproducibility and shorter tuning procedure. Learning locomotion controllers is a problem that has attracted the attention of multiple research domains. This isillustrated by the diversity of approaches that have been explored in the literature, for instance: Policy Gradient Methods[15, 16, 17, 10, 18], Evolutionary Algorithms [3, 1, 14], Particle Swarm Optimization [19] or Bayesian Optimization[20]. Additionally, [21] and [20] give a great high-level comparison of aforementioned categories augmented with GridSearch and Random Search, which are often used as baseline benchmarks for algorithms [20, 3, 1]. In this work, wefocus exclusively on the ﬁrst 2 categories: Policy Gradient Methods and Evolutionary Algorithms.

All Reinforcement Learning methods seek to optimize the objective J ( π ) which is deﬁned as the expected sum offuture rewards r t discounted with a factor γ under a trajectory distribution determined by following policy π over acertain horizon T : J ( π ) = E π (cid:34) T (cid:88) t =0 γ t r t (cid:35) (1)Policy Gradient methods [12, 13] leverage a parametrized policy function π θ , such as a neural network, that is trained byoptimizing an objective function with respect to policy parameters θ . Additionally, it is quite common to approximate astate-value function V ( s ) : V ( s ) = E π (cid:34) T (cid:88) k = t γ k − t r t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s t = s (cid:35) (2)State-value function serves as a baseline when judging the utility of taking certain action in a given state and helps instabilizing learning [13]. Using this baseline, objective J ( π ) can be optimized by a direct gradient estimation: ∇ θ J ( π θ ) = E π θ (cid:34) T (cid:88) t =0 γ t ∇ θ logπ θ ( a t | s t ) (cid:32) T (cid:88) k = t γ k − t r t − V ( s t ) (cid:33)(cid:35) (3)One of the more recent improvements to PG methods is a Trust Region Policy Optimization (TRPO) [17], whichintroduced a new surrogate objective function used to limit the size of the policy’s update (therefore "trust region"). Thisis designed to ensure that the policy does not diverge to unknow regions of the search space. The introduction of TRPOhas been followed by multiple inﬂuential works, such as in the Asynchronous Advantage Actor-Critic (A3C) algorithm[16] and Proximal Policy Optimization (PPO). A3C combines the Actor-Critic PG approach [22] with parallel andasynchronous execution which added the power of modern computing to variants of existing algorithms.On the other hand, PPO [10] avoids the challenge of using a constrained optimisation by replacing TRPO’s coreobjective component with a clipping objective : L clip ( θ ) = ˆ E t [ min ( r t ( θ ) ˆ A t , clip ( r t ( θ ) , − (cid:15), (cid:15) ) ˆ A t )] (4)where ˆ A is an advantage estimator at timestep t (more details in [23]), r t is a policy ratio π θ ( a t | s t ) /π θ old ( a t | s t ) and, the expectation stands for "the empirical average over a ﬁnite batch of samples" [10]. This component caps theincentives for altering parameters of the current policy if taking an action according to a new policy π θ differs toomuch in probability from the π θ old used for experience collection (for more details please refer to Alg. 1). This greatlysimpliﬁes the implementation, which combined with PPO’s capabilities for parallelization, makes it a particularlyattractive tool. Additionally, its applicability to learning locomotion has already been reported in several publications[10, 11]. 2ompetitiveness of MAP-Elites against Proximal Policy Optimization on locomotion tasks in deterministic simulations Another family of algorithms suitable for use in learning locomotion controllers is the family of Evolutionary Algorithms(EAs). Since we are interested in comparisons to Deep Reinforcement Learning, we focus on evolving controllersrepresented as neural networks. The approach of evolving neural networks is sometimes termed as a

Neuroevolution (e.g., in [1]). The idea of using evolutionary approaches to learn controllers is not new and has been discussed innumerous publications across the years [24, 25, 3, 2, 1, 26].Some of the more recent discussions of EAs include work presented in [2], which uses a gradient-based algorithmcalled Evolutionary Strategy. This algorithm deﬁnes a population of parameters as a Gaussian distribution, which issampled to approximate the gradient of the ﬁtness function. Then, stochastic gradient descent is used to update theparameters of the Gaussian distribution, which moves the population to regions of the search space with high-ﬁtness.Authors highlight strong parallelization capabilities and successfully train their controllers in both continuous anddiscrete domains.Contrary to the aforementioned, authors of [1] offer an implementation of an EA algorithm simply referred to as theGenetic Algorithm, which doesn’t rely on gradients. It maintains a population of individuals that are subject to thetypical evolutionary operations of ﬁtness evaluation, selection (with elitism), and mutation. No crossover is applied.One of the most distinguishing aspects of this work is the scale, as individuals take the form of networks having up to4M parameters.Quality-Diversity optimisation (QD) algorithms is a recently introduced subgroup of EAs [4] that aim at producinglarge collections of solutions which are both diverse and high-performing. One of the most well-known QD algorithmsis the Multidimensional Archive of Phenotypic Elites (MAP-Elites; [14, 3]). It deﬁnes an archive with a ﬁxed numberof cells that is used to store the collection of solutions. Each cell corresponds to a different type of solution that ischaracterised by a different behavioral descriptor . The behavioural descriptor is a numerical vector describing sometask-dependent behavioral properties of an individual (e.g., the average body oscillations, the end location of the robotor the leg-ground contact patterns). The objective of this paper is to evaluate how MAP-Elites performs compared toPPO on the generation of locomotion controller for a hexapod robot.

In this work, we compare PPO and MAP-Elites on locomotion tasks. In order to ensure a fair comparison, a commonsimulation environment setup is used across experiments. It consists of a 3D model of a 6-legged, hexapod robot with18 actuated degrees-of-freedom (3 per leg). This model is designed to resemble a physical robot used in [3].The locomotion task is the same throughout this paper - it is to walk as far as possible along an X-axis within anepisode lasting 5 seconds (corresponding to 333 time steps). This walked distance is referred to as the ﬁtness or episodereward . The environment is deterministic (i.e., the same action always leads to the same result) and simulated with theopen-source physics simulator DART [27]. The pseudo-code of PPO [10] is presented in Alg. 1. During each iteration, several actors generate batchesof experience by following the policy π θ old . This is then followed by the computation of advantage estimators ˆ A .Subsequently, the objective L , which includes terms for: clipping L c , squared loss L v and entropy S (with hyper-parameters: (cid:15) , c , c ), is optimized with respect to the current policy’s parameters θ over K epochs. Each epoch utilizesexperience split into several mini-batches. After this is done, the old policy’s parameters get overwritten with currentones and the procedure repeats or learned policy is returned. MAP-Elites

Alg. 2 shows the pseudo-code of MAP-Elites [14, 3], where ( P , C ) stands for a behavior-performancearchive being a result of execution. During each generation, we sample and mutate several solutions from the behavior-performance archive. The algorithm then evaluates each of these mutated solutions and records their behavioraldescriptors and performance based on results from a single episode (rollout) per individual. These new solutionsare then potentially added in the archive if a cell indicated by the behavioral descriptor is empty or occupied by alower-performing solution. 3ompetitiveness of MAP-Elites against Proximal Policy Optimization on locomotion tasks in deterministic simulations Algorithm 1

PPO Algorithm [10] for iteration = 1 , , ... dofor actor = 1 → N do Run policy π θ old in env for T timestepsCompute advantage estimates ˆ A ,..., ˆ A T end for Optimize L with respect to θ ,with K epochs and mini-batch size M ≤ NT where: L t ( θ ) = ˆ E t [ L ct ( θ ) + c L vt ( θ ) + c S [ π θ ]( s t )] L ct ( θ ) = min ( r t ( θ ) ˆ A t , clip ( r t ( θ ) , − (cid:15), (cid:15) ) ˆ A t ) L vt ( θ ) = ( V θ ( s t ) − V targ ) r t ( θ ) = π θ ( a t | s t ) /π θ old ( a t | s t ) θ old ← θ end forreturn policy π θ old Algorithm 2

MAP-Elites Algorithm [14] ( P ← ∅ , C ← ∅ ) for iter = 1 → I doif iter < N then c (cid:48) ← random_controller() else c ← random_selection( C ) c (cid:48) ← random_variation( c ) end if x (cid:48) ← behavioral_descriptor(simu( c (cid:48) )) p (cid:48) ← performance(simu( c (cid:48) )) if P ( x (cid:48) ) = ∅ or P ( x (cid:48) ) < p (cid:48) then P ( x (cid:48) ) ← p (cid:48) C ( x (cid:48) ) ← c (cid:48) end ifend forreturn behavior-performance map ( P and C ) All of the MAP-Elites experiments in this work use a discrete, 6-dimensional behavior descriptor, recording fraction ofthe total time that a given leg of the hexapod was in the contact with the ground [3]. Each dimension of a descriptorhas four or ﬁve buckets ( descriptor base ). The source code of both PPO and MAP-Elites is available under respectiverepositories.

Due to the different nature of PPO and MAP-Elites, there are some difﬁculties associated with their comparison. Toname some of the high-level differences: • PPO, like other Policy Gradient methods, optimizes for the expectedcumulative reward from an episode for a single solution, whereas MAP-Elites has an additional goal of producing adiverse set of solutions according to a behavioral descriptor. • PPO outputs stochastic policies that drive its exploration.On the other hand, MAP-Elites, as used in this work, produces deterministic policies that are decoupled from explorationmechanics, as the exploration is driven by mutation. • PPO gets transition and reward information at the frame-levelgranularity, whereas MAP-Elites only receives cumulative rewards (i.e., ﬁtness) from episodes. However, MAP-Elitesadditionally uses behavioral descriptors, which quantify the high-level behavioral features and PPO has no access tosuch information.Nevertheless, there are still some aspects of both algorithms that make quantitative comparisons possible: • Their totalcomputational effort can be measured in the number of simulation frames. • Solutions yielded by PPO can be comparedwith the best-ﬁt individual of MAP-Elites in terms of cumulative episodic rewards.To make comparisons between MAP-Elites and PPO as fair as possible the following measures are taken: • MAP-Elitesis set up to evolve the weights of neural policies of the same size as the network architectures used by PPO (valuenetwork excluded as this was treated as PPO’s implementation detail irrelevant to MAP-Elites). • The main policynetwork architecture is selected to be a fully-connected Multi-Layer Perceptron, although this is a limiting choice forMAP-Elites which is capable of simultaneously evolving the weights and topologies of neural networks. • To minimizedifferences in the setup, the original Python implementation of PPO is ported to C++ where it can directly use the sameinstance of the simulation environment and robot model as MAP-Elites does.

Two sets of experiments are presented: 1.

Open-loop gait training - a scenario where the policy takes only a functionof time as an input. 2.

Closed-loop gait training - a more complex scenario where the state of the robot’s body is thepolicy’s input.The comparisons of the different evaluations are based on statistics over the episode reward (i.e., cumulative episodereward for a ﬁxed amount of simulation frames). We perform multiple replication runs to examine seed sensitivity dueto stochastic components in the tested algorithms. The statistical signiﬁcance of performance difference is veriﬁedthrough

Wilcoxon Signed-Rank Test and p-values are reported.4ompetitiveness of MAP-Elites against Proximal Policy Optimization on locomotion tasks in deterministic simulationsFigure 1: Comparison of the quartile analysis for the open-loop (left) and closed-loop (right) experiments featuringPPO and MAP-Elites. A colorful shading around a line represents the variability between the ﬁrst and the third quartile,while the thicker line is the median. For the closed-loop, apart from episode rewards, a distribution over the number ofindividuals in MAP-Elites archives which over-perform the median of PPO is showed in green.

The open-loop controllers studied in this work take a scalar time-modulo-period as an input and output 18 target angularpositions of joints. This is quite popular due to its high utility and simplicity [3]. The period is always set arbitrarily toone second.All of the used open-loop architectures have two layers of hidden neurons with two to six neurons in each layer. Thesecounts are treated as a hyper-parameter. Biases were used, which results in total policy parameter count ranging from64 to 130.

The hyper-parameter selection is done in two phases. First, 370 unique PPO hyper-parameter conﬁgurations are testedfor a short horizon (75M frames) with four replications each. Then, the four best conﬁgurations (according to the medianepisode reward ) are executed for a longer horizon (255M frames) with 20 replications each. We report the performanceof the single best performing conﬁguration on the long horizon in Fig. 1 (left). The details of the hyper-parametervalues are given in Table 1. The reasoning behind this is to initially evaluate hyper-parameter conﬁgurations until theﬁrst horizon, and continue only with the most promising ones. This approach allows to explore a larger number ofhyper-parameter conﬁguration at a lower computational cost.The hyper-parameter conﬁgurations sampled for the initial assessment consist of learning rate: [5e-5,1e-2] and clippingrange: [5e-2,4e-1], both sampled log-uniformly. An entropy term is sampled log-uniformly from the range [1e-4,1e-2]in 25% cases or set to zero otherwise in the remaining 75%. Additionally, among uniformly sampled hyper-parametersare mini-batch size, selected from the range of [2,32] Ki, and policy architecture sampled from 8 predeﬁned options of[64,130] parameters in each. Both policy and value networks have the same architecture, but weights are not shared. The hyper-parameter tuning procedure for MAP-Elites is the same as the one used with PPO. However, due to MAP-Elites being far more stable across parameters and substantially dominating in performance (Fig. 1) we test 220hyper-parameter conﬁgurations with three replications for initial horizon and, as before, 20 replications for the bestones. The performance of a MAP-Elites algorithm is deﬁned as the episode reward of the best solution contained inthe Behavioural-Performance archive. The hyper-parameters range for MAP-Elites include: mutation rate sampleduniformly from range [0,0.5], the same set of neural architectures as for PPO, and selection of either base-4 or base-5behavioral descriptor. The details of the hyper-parameter values for the selected conﬁguration are given in Table 1. We report medians as it seems to be a better-suited statistic than for instance average, due to higher robustness against variousdistributions. An analogous argument applies to the preference of the inter-quartile range over standard deviation.

Fig. 1 (left) shows the experimental results with the median, ﬁrst and third quartile displayed. We can see that MAP-Elites with MLP individuals signiﬁcantly overperforms PPO (medians of 3 vs 1.5 meters traveled by the hexapod). Thisdifference is statistically signiﬁcant with a p-value of 0.00014. Additionally, we can also see that MAP-Elites tendsto increase the performance over longer horizons than PPO. The best-ﬁt individual after the ﬁrst generation performsabove 1.5 meters. That is much better than many parametrizations of PPO were able to reach throughout the whole oftheir run showing the importance of a relatively simple random search, which was also appreciated in [1].Qualitatively, gaits learned through the open-loop PPO training quite often take a form of a hexapod ﬂipping over to theback and then walking upside-down (e.g., "tip-over" gait), which can be thought of as a local optimum. The viability ofthis gait relates to speciﬁcs of the hexapod model, which has lower leg extents above knee joints. On the other hand,gait referred to as "aspiring bipedal" is an example of a gait produced by MAP-Elites.

In the closed-loop setting, both the input and output consist of 18 angular positions of the joints. The input is the currentstate of the robot, while the output refers to the desired state for the next time step. Similar setup can be found in [11].An increased number of inputs results in an increased number of parameters overall. The predeﬁned set of architecturesranges from 98 to 282 in the number of weights (including biases).

In the closed-loop case, we follow the same evaluation scheme as before, except that for both MAP-Elites and PPO, wesample only 50 hyper-parameter conﬁgurations with three replications each over the short horizon. Then, we select thebest-performing conﬁguration and perform 20 replications over the long horizon. In this experience, the short horizon is533M frames and the long one 1.6B frames.The sampling distributions of PPO hyper-parameters are just as before with exception of batch sizes selected from just 3options of 16, 32, or 64Ki as inspired by similar settings of [11]. In the case of MAP-Elites, the hyper-parameters weresampled just as in the open-loop controller setting. The details of the ﬁnal hyper-parameter values are given in Table 1.

The statistics over the single best conﬁgurations’ runs of each algorithm are presented in Fig. 1 (right). In the closed-loopsetting, PPO is a much more competitive than in the open-loop setting. As Fig. 1 (right) shows, PPO quickly increasesperformance in terms of simulation frames, however, it eventually ﬂattens out at a median performance of around 6.66ompetitiveness of MAP-Elites against Proximal Policy Optimization on locomotion tasks in deterministic simulationsmeters. The characteristics of the gaits produced by PPO changed dramatically in this scenario, delivering very fastgaits (e.g., "gallop" gait).In contrast, MAP-Elites’ performance ascends over a much longer horizon and converges at a higher performancemedian value of around 6.9 meters without the risk of divergence. The p-value yielded by Wilcoxon Signed-Rank Test is6e-4, therefore conﬁrming the statistical difference between algorithms’ results at the end of the horizon. It is importantto recall that MAP-Elites does not provide just one policy, but instead a large collection of diverse and high-performingsolutions. Fig. 1 (right) also shows the number of controllers in the MAP-Elites archives that outperform the medianperformance of PPO at each frame. We can see that MAP-Elites produces between 3 to 25 outperforming controllers,for the same computational budget as PPO. Gait named "tiptoe" is an example of a closed-loop controller evolved withMAP-Elites.Additionally, in Fig. 2 (left) we analyze the sensitivity of both algorithms to hyper-parameter selection. We can seethat for the relatively short horizon of 75M frames, median of median performances across conﬁgurations is higher forPPO than MAP-Elites and the interquartile range is similar. After 533M frames, PPO results across conﬁgurationsdisplay increased inter-quartile range and similar median as in shorter runs. This is not the case for the MAP-Elitesconﬁgurations which reduce their variability and substantially increase the median to the extent which makes itdominating over corresponding PPO conﬁgurations. Fig. 2 (right) shows an example archive for the best-performingMAP-Elites conﬁguration and randomly selected run after total of 1.6B time steps with the best individuals in red.Open-loop Closed-loopclip_range ( (cid:15) ) 0.08458 0.27059learning_rate 2.7175e-3 1.52e-5layer_0_size 3 5layer_1_size 4 5entropy ( c ) 1.4938e-4 0 Open-loop Closed-loopnb_gen 3825 24001mutation_rate 0.0891977 0.188637layer_0_size 4 4layer_1_size 4 5descriptor_base 4 4Table 1: PPO (left) and MAP-Elites (right) ﬁnal hyper-parameter conﬁgurations selected after tuning. Additionally,both PPO experiment setups used 10 epochs, 32 mini-batches, batch size of 32Ki, discount factor γ = 0 . and GeneralAdvantage Estimator [23] factor λ = 0 . . The behavior of PPO in terms of the open-loop gait controller learning seems quite contrary to MAP-Elites. PPO wassubject to oscillations and drops during the training, showed less robustness to hyper-parameters and seeds (especiallyfor small batch size). Aforementioned "tip-over" gait, found in the course of the open-loop PPO experimentation, can bethought of as easy-to-ﬁnd local optimum. On the other hand, MAP-Elites demonstrated great exploration capabilities,without a need for additional stimulation like the entropy term of PPO [11]. MAP-Elites was able not only to doublethe median episode return value after a certain computational horizon but also found such an original gait as "aspiringbipedal". These results are rather surprising as MAP-Elites, optimizing for two goals: diversity and performance, seemsdisadvantaged by design when compared with a method that solely focuses on performance.Despite nearly doubled search effort in conﬁguration count to ﬁnd a well-performing open-loop conﬁguration ofPPO, none of them were statistically competitive against MAP-Elites. However, despite more than 1000 runs,PPO’s hyperparameter space is of very high dimensionality, which makes it still very possible that more impressiveconﬁgurations do exist. One must also remember that the discussed open-loop setting is deterministic (stochasticinjection into MAP-Elites proves to be troublesome [28, 29]).In the case of the closed-loop setting, the performance gap between algorithms proved to be less obvious. PPO seems toprioritize exploitation which allows for quicker ascend, wheres MAP-Elites needs a longer horizon to dominate over acontender. This behavior is a well-known property of MAP-Elites as its exploration capability may result in a long timeto ﬁnd a high-performing solution, which in our case is also the better-performing solution.Generally, due to PPO’s greater number of hyperparameters and the sensitivity to their actual values (exempliﬁedin Fig. 2), it tends to be more difﬁcult to ﬁnd stable conﬁgurations, even despite the insights of similar applicationsin the literature [11] and available software [30]. For this reason, throughout whole the experimentation more effortwas spent on the examination of PPO (PPO:620 vs MAP-Elites:500 billion frames total). This seemed a reasonablecourse of action given our objective of providing a fair comparison and that PPO tended to systematically show lowerperformance than MAP-Elites. 7ompetitiveness of MAP-Elites against Proximal Policy Optimization on locomotion tasks in deterministic simulationsMAP-Elites by design has the characteristics of non-decreasing performance for the best-ﬁt individual. Thanks to thisfeature, learning is stable without oscillations or drops, which are typical to the RL methods like PPO. Since PPOmaintains only one solution that is constantly altered, it may also have a problem with reverting undesirable actions,whereas MAP-Elites would simply discard an inefﬁcient individual within the context of the certain behavioral bin.

This work presents a thorough comparison of an Evolutionary Algorithm MAP-Elites and Policy Gradient methodPPO in the context of robotic gait learning in deterministic simulation. To achieve statistically-signiﬁcant results weperform numerous replications which combined with hyper-parameter search require a computational effort of aroundone trillion simulation frames.Our work tends to suggest that MAP-Elites, if applicable, might be easier to implement in a parallel setting and tends todeliver slightly better results over the long horizon even when compared with top Policy Gradient approach like PPO. Atthe same time, robustness to seed and hyper-parameter selection proves very convenient when working with MAP-Elites,saving compute that would be otherwise invested in these selection activities, and encouraging reproducibility.Although due to the fairness of comparison, we constrain MAP-Elites to evolve neural networks with a ﬁxed-topology,this is unnecessary in the general case. Throughout the experimentation, we have also obtained some preliminaryresults on evolving neural network topology altogether with weights. These results hinted at further improvements ofperformance, producing another original gait such as "bunny hop" gait. This idea could be further investigated as wellas considering hybrid approaches or addressing the problem of MAP-Elites with noisy observations.

We would like to thank Marek Barwi´nski, Ben Cataldo, Luca Grillotti, and Joe Phillips for their efforts in reading thepaper draft and their valuable comments.

References [1] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune. Deep neuroevolution: Geneticalgorithms are a competitive alternative for training deep neural networks for reinforcement learning.

CoRR ,abs/1712.06567, 2017. URL http://arxiv.org/abs/1712.06567 .[2] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforce-ment learning. arXiv preprint arXiv:1703.03864 , 2017.[3] A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret. Robots that can adapt like animals.

Nature , 521(7553):503,2015.[4] A. Cully and Y. Demiris. Quality and diversity optimization: A unifying modular framework.

IEEE Transactionson Evolutionary Computation , 22(2):245–259, 2017.[5] R. Kaushik, P. Desreumaux, and J.-B. Mouret. Adaptive prior selection for repertoire-based online adaptation inrobotics.

Frontiers in Robotics and AI , 6:151, 2020.[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Playing atariwith deep reinforcement learning.

CoRR , abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602 .[7] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton,et al. Mastering the game of go without human knowledge.

Nature , 550(7676):354–359, 2017.[8] J. Schmidhuber. Deep learning in neural networks: An overview.

Neural networks , 61:85–117, 2015.[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous controlwith deep reinforcement learning. arXiv preprint arXiv:1509.02971 , 2015.[10] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347 , 2017.[11] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. Eslami, M. Riedmiller, et al.Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286 , 2017.[12] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.

Machinelearning , 8(3-4):229–256, 1992. 8ompetitiveness of MAP-Elites against Proximal Policy Optimization on locomotion tasks in deterministic simulations[13] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learningwith function approximation. In

Advances in neural information processing systems , pages 1057–1063, 2000.[14] J.-B. Mouret and J. Clune. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909 , 2015.[15] S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. arXivpreprint arXiv:1802.09477 , 2018.[16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronousmethods for deep reinforcement learning. In

International conference on machine learning , pages 1928–1937,2016.[17] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In

InternationalConference on Machine Learning , pages 1889–1897, 2015.[18] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms.2014.[19] M. Zambrano-Bigiarini, M. Clerc, and R. Rojas. Standard particle swarm optimisation 2011 at cec-2013: Abaseline for future pso improvements. In , pages 2337–2344.IEEE, 2013.[20] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth. Bayesian optimization for learning gaits underuncertainty.

Annals of Mathematics and Artiﬁcial Intelligence , 76(1-2):5–23, 2016.[21] M. P. Deisenroth, G. Neumann, J. Peters, et al. A survey on policy search for robotics.

Foundations and Trends R (cid:13) in Robotics , 2(1–2):1–142, 2013.[22] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction . MIT press, 2018.[23] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control usinggeneralized advantage estimation. arXiv preprint arXiv:1506.02438 , 2015.[24] R. Reeve and J. Hallam. An analysis of neural models for walking control.

IEEE transactions on neural networks ,16(3):733–742, 2005.[25] S. Lee, J. Yosinski, K. Glette, H. Lipson, and J. Clune. Evolving gaits for physical robots with the hyperneatgenerative encoding: The beneﬁts of simulation. In

European Conference on the Applications of EvolutionaryComputation , pages 540–549. Springer, 2013.[26] T. T. Huan, C. Van Kien, and H. P. H. Anh. Adaptive evolutionary neural network gait generation for humanoidrobot optimized with modiﬁed differential evolution algorithm. In , pages 621–626. IEEE, 2018.[27] J. Lee, M. Grey, S. Ha, T. Kunz, S. Jain, Y. Ye, S. Srinivasa, M. Stilman, and C. Karen Liu. Dart: Dynamicanimation and robotics toolkit.

The Journal of Open Source Software , 3:500, 02 2018. doi: 10.21105/joss.00500.[28] N. Justesen, S. Risi, and J.-B. Mouret. Map-elites for noisy domains by adaptive sampling. In

Proceedings of theGenetic and Evolutionary Computation Conference Companion , GECCO ’19, pages 121–122, New York, NY,USA, 2019. ACM. ISBN 978-1-4503-6748-6. doi: 10.1145/3319619.3321904. URL http://doi.acm.org/10.1145/3319619.3321904 .[29] M. Flageat and A. Cully. Fast and stable map-elites in noisy domains using deep grids. In

Artiﬁcial Life ConferenceProceedings , pages 273–282. MIT Press, 2020.[30] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJInternational Conference on Intelligent Robots and Systems