[PDF] Sparse Reward Exploration via Novelty Search and Emitters

Abstract

Reward-based optimization algorithms require both exploration, to find rewards, and exploitation, to maximize performance. The need for efficient exploration is even more significant in sparse reward settings, in which performance feedback is given sparingly, thus rendering it unsuitable for guiding the search process. In this work, we introduce the SparsE Reward Exploration via Novelty and Emitters (SERENE) algorithm, capable of efficiently exploring a search space, as well as optimizing rewards found in potentially disparate areas. Contrary to existing emitters-based approaches, SERENE separates the search space exploration and reward exploitation into two alternating processes. The first process performs exploration through Novelty Search, a divergent search algorithm. The second one exploits discovered reward areas through emitters, i.e. local instances of population-based optimization algorithms. A meta-scheduler allocates a global computational budget by alternating between the two processes, ensuring the discovery and efficient exploitation of disjoint reward areas. SERENE returns both a collection of diverse solutions covering the search space and a collection of high-performing solutions for each distinct reward area. We evaluate SERENE on various sparse reward environments and show it compares favorably to existing baselines.

Full PDF

SSparse Reward Exploration via Novelty Search and Emitters

Giuseppe Paolo

AI Lab, SoftBank Robotics EuropeSorbonne Université, CNRS, Institut des SystèmesIntelligents et de Robotique, ISIRParis, [email protected]

Alexandre Coninx

Sorbonne Université, CNRS, Institut des SystèmesIntelligents et de Robotique, ISIRParis, [email protected]

Stephane Doncieux

Sorbonne Université, CNRS, Institut des SystèmesIntelligents et de Robotique, ISIRParis, [email protected]

Alban La ﬂ aquière AI Lab, SoftBank Robotics EuropeParis, Franceala ﬂ [email protected] ABSTRACT

Reward-based optimization algorithms require both exploration, to ﬁ nd rewards, and exploitation, to maximize performance. The needfor ef ﬁ cient exploration is even more signi ﬁ cant in sparse rewardsettings, in which performance feedback is given sparingly, thusrendering it unsuitable for guiding the search process. In this work,we introduce the SparsE Reward Exploration via Novelty and Emit-ters (SERENE) algorithm, capable of ef ﬁ ciently exploring a searchspace, as well as optimizing rewards found in potentially disparateareas. Contrary to existing emitters-based approaches, SERENE sep-arates the search space exploration and reward exploitation into twoalternating processes. The ﬁ rst process performs exploration throughNovelty Search, a divergent search algorithm. The second one ex-ploits discovered reward areas through emitters, i.e. local instancesof population-based optimization algorithms. A meta-scheduler allo-cates a global computational budget by alternating between the twoprocesses, ensuring the discovery and ef ﬁ cient exploitation of dis-joint reward areas. SERENE returns both a collection of diverse solu-tions covering the search space and a collection of high-performingsolutions for each distinct reward area. We evaluate SERENE onvarious sparse reward environments and show it compares favorablyto existing baselines. KEYWORDS

Novelty search, sparse rewards, emitters, evolutionary algorithm,quality diversity

ACM Reference Format:

Giuseppe Paolo, Alexandre Coninx, Stephane Doncieux, and Alban La ﬂ aquière.2021. Sparse Reward Exploration via Novelty Search and Emitters. In Pro-ceedings of the Genetic and Evolutionary Computation Conference 2021(GECCO ’21).

ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro ﬁ t or commercial advantage and that copies bear this notice and the full citationon the ﬁ rst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s). GECCO ’21, July 10–14, 2021, Lille, France © 2021 Copyright held by the owner/author(s).ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

BudgetScheduler

K KKK K K

ExplorationExploitation

YES NO Reward M an y N S gene r a t i on s E m i tt e r s E v a l ua t i on K Figure 1: SERENE consists of two exploration and exploita-tion processes, controlled by a scheduler. The exploration pro-cess searches for novel solutions through Novelty Search. Theexploitation process uses emitters to optimize the rewards dis-covered during exploration. The scheduler alternates betweenthe two processes by splitting the total evaluation budget intochunks of size K to assign to either of them.

Embodied agents solve tasks by learning a policy dictating how toact in different situations. This is done by evaluating the agent’sperformance on the task through a reward function . Learning strate-gies for such agents can be divided in two groups: step-based andepisode-based [34]. The former expects a reward after each step.On the contrary, episode-based ones need rewards only at the endof each training episode. So much reliance on the reward forcessome constraints: the reward function must be well designed andprovide feedback as frequently as possible. In many complex sce-narios where the reward is given only if speci ﬁ c conditions aremet, such constraints are impossible to respect. These are known as sparse reward situations and can prove very dif ﬁ cult to tackle. Inthis work, we consider sparse reward settings in which the rewardis obtained only in small disjoint areas of the whole search space.One example would be a robotic arm trying to push an object toone of a few given positions. The search space consists of all the ECCO ’21, July 10–14, 2021, Lille, France Giuseppe Paolo, Alexandre Coninx, Stephane Doncieux, and Alban La ﬂ aquière positions the object can achieve, while the reward is given only ifthe object reaches one of the goals. In such situations, a standardReinforcement Learning (RL) [36] agent explores by trying randomactions. The probability of ﬁ nding a reward this way is almost zero,rendering learning impractical. Therefore, the way exploration isperformed is fundamental when dealing with sparse rewards settings.In recent years, many algorithms have been proposed to solvethis problem [5, 7, 16, 25, 37]. Among them, Novelty Search (NS)is an evolutionary algorithm that focuses only on exploration, whileignoring any possible reward [25]. By doing so, NS tends towards auniform exploration of the search space [15], avoiding the need for awell-de ﬁ ned reward function. At the same time, its strength is also itslimitation: considering all the non-rewarding areas as valuable as therewarding ones prevents the algorithm from ﬁ nding the best possiblesolutions. Augmenting NS with the ability to shift its focus from pureexploration to reward exploitation could help address this issue. Onepossible way of doing so is by using multi-objective optimizationmethods like NSGA-II [13]. However, merging exploration andexploitation through a Pareto front can degrade the exploring powerof the algorithm. A different approach is taken by Quality-Diversity(QD) algorithms, a family of methods that build a set of both diverseand high-quality solutions [33].In this work, we introduce SparsE Reward Exploration via Nov-elty search and Emitters (SERENE), a QD algorithm addressingsparse reward problems. SERENE augments NS with emitters [18]to perform rewards maximization while keeping its exploration abil-ity, thanks to a clear separation between the exploration and exploita-tion. Introduced as a way to improve the ef ﬁ ciency of MAP-Elites(ME) [28] in the CMA-ME method [18], emitters are instances ofreward-based evolutionary algorithms scheduled to perform a localsearch in the search space. In the original formulation, ME acts asa scheduler by initializing emitters in different areas of the searchspace. The emitters then perform both local exploration and exploita-tion of the reward, leading to degraded performances in settings withvery sparse rewards, where not all policies can obtain a reward. Con-versely, SERENE decouples exploration from exploitation to betterdeal with such situations. The former is performed through NS, com-pletely ignoring the reward. Once a reward area is found, SERENEspawns emitters focusing solely on its maximization, without inter-fering with the exploration process. This allows our algorithm toshift its focus between exploration and exploitation at any moment.Persisting in exploring even after some reward areas have been foundis essential, since other reward areas could be present in the searchspace.In the following, we will discuss other works tackling the sparserewards problem in Section 2. In Section 3 we will analyze themethods SERENE draws from and explain in detail the concept of emitter . The method itself will be introduced in Section 4, tested inSection 5, and the results discussed in Section 6. We will concludewith Section 7 by pointing at possible extensions and improvements. Step-based algorithms expect a reward at every step, making dealingwith sparse reward particularly dif ﬁ cult; this is the case for many RLalgorithms. Following the recently increased interest in the problem, many new approaches have been proposed to deal with this sparsity.Some methods work on improving the data ef ﬁ ciency of the search[1, 29]. Others introduce some arti ﬁ cial curiosity by counting thenumber of times a state is visited, and push exploration by makingless-visited states more rewarding [2, 37]. Another strategy usesadditional shaped rewards to aid in approaching the task [38]. In[14, 23, 31], a population of RL agents is used to increase explorationwhile learning a policy. However, none of these methods explicitlyseparates exploration and exploitation.Episode-based methods, and more speci ﬁ cally evolutionary algo-rithms [39], are better suited for dealing with sparse reward settings,given the more relaxed dependency on the reward. For this reason,many works combined evolutionary algorithms with RL. In [32]and [24], the authors use Evolutionary Strategy (ES) to collect thedata over which a RL agent is then trained. These approaches takeadvantage of the exploration of evolution-based methods and thedata ef ﬁ ciency of RL.Separating exploration from exploitation has been proven usefulfor overcoming deceptive gradients in sparse reward settings [6, 7,16]. In [7], a reward-agnostic exploration phase is ﬁ rst performedthrough Goal Exploration Processes [19]; then a RL based policyis learned on the collected data. A similar two-step process is usedin [16] to solve ATARI games. Conversely, QD-RL [6] separatesexploration and exploitation by taking advantage of a QD populationtrained through an actor-critic approach. Half of the population isoptimized for quality, while the other half is optimized for diversity. Divergent search methods, as the one used in [6], generate solutionsby looking for a set of diverse policies. This prevents getting stuck inlocal optima that could limit the performance of the solutions. Oneof the ﬁ rst algorithms developed in this direction is NS [25]. Since,many divergent search algorithms have been developed, using dif-ferent mechanisms to drive the search: curiosity [35], empowerment[5], surprise [21], diversity [11, 12, 17, 33], and novelty [26].QD [12, 33] is a family of divergent search algorithms thatsearches for a set of diverse solutions while also improving ontheir quality. A well-known QD algorithm is ME [28], a method thatdrives the search for novel policies by discretizing the search spaceinto a grid and ﬁ lling its cells with high-performing solutions.QD algorithms have been extended by combining them with ES[3] to increase their ef ﬁ ciency and speed of convergence [9, 10, 18].In [9], an ES uses NS’s novelty objective to look for novel solu-tions while improving their performances. At the same time, theapproach followed in [18], and then extended in [10], uses ME as ascheduler for modi ﬁ ed instances of CMA-ES [22], named emitters .Exploration of the search space and reward exploitation are bothperformed through emitters. Regardless of the method’s power, fus-ing the two aspects can limit performances in sparse reward settingswhere reward-based algorithms struggle to explore.In this work, we take inspiration from [18] by combining emitterswith NS to keep the two aspects, i.e. exploration and exploitation,separated. This allows our method to avoid the shortcomings ofexploring through emitters. In the next section, we describe in detailhow both NS and emitters work before detailing the functioning ofSERENE. parse Reward Exploration via Novelty Search and Emitters GECCO ’21, July 10–14, 2021, Lille, France The notation used in this work is based on the one introduced in [15]and is directly inspired by the RL literature.

NS is an evolutionary algorithm that replaces the usual ﬁ tness met-rics used by evolutionary algorithms with a novelty metric. Thismetric pushes the search towards novel areas of the search space.The novelty is calculated in an hand-de ﬁ ned behavior space B inwhich the behavior of each policy 𝜃 𝑖 ∈ Θ is represented. When apolicy is evaluated, it traverses a sequence of states 𝜏 = [ 𝑠 , · · · , 𝑠 𝑇 ] .Traversed states are observed through some sensors generating asequence of observations 𝜏 O = [ 𝑜 , · · · , 𝑜 𝑇 ] , with 𝑜 𝑡 ∈ O . Fromthe sequence of observations it is possible to extract a representa-tion 𝑏 𝑖 ∈ B of the policy’s behavior by using an observer function 𝑂 B : O → B . This whole process can be summarized by introduc-ing a behavior function directly mapping a policy 𝜃 𝑖 to its behaviordescriptor 𝑏 𝑖 : 𝜙 ( 𝜃 𝑖 ) = 𝑏 𝑖 . (1)Once computed, the behavior descriptors are used to calculate thepolicies’ novelty as: 𝜂 ( 𝜃 𝑖 ) = | 𝐽 | ∑︁ 𝑗 ∈ 𝐽 dist ( 𝑏 𝑖 , 𝑏 𝑗 ) = | 𝐽 | ∑︁ 𝑗 ∈ 𝐽 dist � 𝜙 ( 𝜃 𝑖 ) , 𝜙 ( 𝜃 𝑗 ) , (2)where 𝐽 is the set of indexes of the 𝑘 policies closest to 𝜃 𝑖 in thebehavior space.The novelty of the policies is calculated at each generation andused to choose the policies for the next generation. Moreover, 𝑄 policies are sampled to be stored into an archive , returned as outcomeof the algorithm. The archive is also used to keep track of the alreadyexplored areas of the space B . This is done by choosing the | 𝐽 | closest neighbors used in equation (2) not only from the currentpopulation and offspring but also from the archive. By choosingthe most novel policies from the previous generation to composethe population, the search is always pushed towards less exploredareas of B . Notwithstanding its capacity for exploration, NS cannotexploit the rewards potentially found during the search. This leadsto low rewarding solutions. An emitter [10, 18] is an instance of an estimation-of-distributionreward optimization algorithm, such as CMA-ES [22]. Its objec-tive is to rapidly examine a small area of the search space whileoptimizing on the reward. In [18] and [10] the CMA-ME algorithmcombines emitters with ME [28], by using the latter as a schedulerfor the emitters evaluation. It works by initializing a population ofpolicies 𝜃 by sampling their parameters from a distribution N ( 𝜇 , Σ ) and adding them to the ME archive. The algorithm then samplesone of these policies and uses it to initialize the population of theemitter E 𝑖 . At this point, E 𝑖 is evaluated until a termination criterionis met; e.g. a lack of increase of the reward found. Moreover, thepolicies found during the evaluation of the emitter are added to theME archive according to ME addition strategy. After the terminationof E 𝑖 , a new emitter is initialized by sampling another policy fromthe archive. This is repeated until the whole evaluation budget isdepleted. Different types of algorithms can be used as emitters, changinghow the search is performed and how the policies are selected. Thisshows the ﬂ exibility of the approach. At the same time, both [18]and [10] perform exploration through reward-following emitters.This reduces performances in situations where the reward is verysparse and many of the policies do not get any reward.Decouplingthe exploitation of the reward from the exploration allows to moreef ﬁ ciently deal with sparse rewards settings [8]. SERENE disentangles the exploration of the behavior space B fromthe exploitation of the reward through a two-steps process. In the ﬁ rst phase, called exploration phase , B is explored by performingNS. As per equation (1), the policies 𝜃 𝑖 found during exploration areassigned a behavior descriptor 𝜙 ( 𝜃 𝑖 ) . A policy obtaining a rewardmeans that its 𝜙 ( 𝜃 𝑖 ) belongs to the subspace of rewarding behaviors B R ⊆ B . It is in this subspace that the exploitation of the rewardhappens. This is done in the second phase, called exploitation phase ,in which the emitters are initialized using the rewarding policiesfound in B R during exploration. During the exploitation phase themost rewarding policies are stored to be returned as result of the al-gorithm. Moreover, particularly novel policies found by the emittersare also stored. By launching emitters only in the neighborhoodsof the reward areas, SERENE keeps the exploitation of the rewardseparated from the exploration of the search space. This results intaking the best of both worlds: the exploration power of NS and thefocused exploitation of reward-based algorithms.The exploitation and exploration phases are alternated repeatedlythrough a meta-scheduler. This scheduler divides a total evaluationbudget 𝐵 in smaller chunks of size 𝐾 and assigns them to either oneof the two phases. The whole process is illustrated in Figure 1 anddescribed in Algorithm 1. Algorithm 1:

SERENE

INPUT: evaluation budget 𝐵 , budget chunk size 𝐾 ,population size 𝑀 , emitter population size 𝑀 E , offspringper policy 𝑚 , mutation parameter 𝜎 , number of policiesadded to novelty archive 𝑄 ; RESULT:

Novelty archive A 𝑁 , rewarding archive A 𝑅 ; A 𝑁 = ∅ ; A 𝑅 = ∅ ; Q 𝐸 = ∅ ; Q 𝑁𝐶 = ∅ ; Q 𝐶𝐸 = ∅ ;Sample population Γ ;Split 𝐵 in chunks of size 𝐾 ; while 𝐵 not depleted doif Γ then Evaluate 𝜃 𝑖 , ∀ 𝜃 𝑖 ∈ Γ ;Calculate 𝑏 𝑖 = 𝜙 ( 𝜃 𝑖 ) ∈ B , ∀ 𝜃 𝑖 ∈ Γ ; Exploration Phase ( 𝐾 , 𝑚 , 𝜎 , A 𝑁 , Q 𝐶𝐸 , Γ 𝑔 , 𝑄 ); if not Q 𝐶𝐸 == ∅ or not Q 𝐸 == ∅ then Exploitation Phase ( 𝐾 , Q 𝐶𝐸 , 𝜆 , 𝑚 , Q 𝐸 , A 𝑁 , A 𝑅 , 𝑀 E );To keep track of policies generated during the different phases,SERENE uses the following buffers and containers: ECCO ’21, July 10–14, 2021, Lille, France Giuseppe Paolo, Alexandre Coninx, Stephane Doncieux, and Alban La ﬂ aquière • novelty archive A 𝑁 : a repertoire of the novel policies foundduring the exploration phase , and returned as ﬁ rst output ofSERENE; • reward archive A 𝑅 : a repertoire of rewarding policies foundduring the exploitation phase , returned as second output ofSERENE; • candidates emitter buffer Q 𝐶𝐸 : a buffer containing the re-warding policies 𝜙 ( 𝜃 𝑖 ) ∈ B R found during the explorationphase and used in the exploitation phase to initialize emitters; • emitter buffer Q 𝐸 : a buffer containing all the initialized emit-ters to be evaluated during the exploitation phase ; • novelty candidates buffer Q 𝑁𝐶 : a buffer containing the mostnovel policies found by the emitter. Each emitter has its owninstance of this buffer and the policies in it are sampled foraddition to the novelty archive once the emitter is terminated.A high-level overview of how these sets interact during the twophases is given in Figure 2, and a more detailed description is pro-posed in the two following subsections. Exploration phase

SERENE starts by generating an initial population Γ of size 𝑀 . Thisis done by sampling the parameters of the population’s policies 𝜃 𝑗 from a normal distribution N ( , 𝐼 ) . The population is used to explorethe behavior space B through NS. At each generation 𝑔 , a mutationoperator generates 𝑚 new policies 𝜃 𝑖𝑗 (offspring) from each of thepolicies 𝜃 𝑗 ∈ Γ 𝑔 : ∀ 𝑗 ∈ { , . . . , 𝑀 } , ∀ 𝑖 ∈ { , . . . , 𝑚 } , 𝜃 𝑖𝑗 = 𝜃 𝑗 + 𝜖 , with 𝜖 ∼ N ( , 𝜎𝐼 ) . (3)The resulting offspring population Γ 𝑚𝑔 , of size 𝑚 × 𝑀 , is then evalu-ated to obtain the behavior descriptors 𝜙 ( 𝜃 𝑖𝑗 ) = 𝑏 𝑖𝑗 ∈ B . The noveltyof Γ 𝑔 and Γ 𝑚𝑔 is then calculated using equation (2) and is used to gen-erate the next generation population Γ 𝑔 + by taking the most novelpolicies from Γ 𝑔 ∪ Γ 𝑚𝑔 . At the same time, 𝑄 policies 𝜃 𝑖𝑗 ∈ Γ 𝑚𝑔 areuniformly sampled to be added to the novelty archive A 𝑁 . Finally,all the rewarding policies found are stored in the candidates emittersbuffer Q 𝐶𝐸 . The process just described is detailed in Algorithm 2.The exploration phase is executed for the 𝐾 evaluation steps inthe given budget chunk, where each evaluation step corresponds toone policy evaluation. Once the chunk is depleted, the schedulerassigns the next chunk to the exploitation phase only if Q 𝐶𝐸 ≠ ∅ .On the contrary, another exploration phase is performed. This meansthat if no reward can be discovered, i.e. B 𝑅 = ∅ , SERENE performsexactly like NS. Exploitation phase

The exploitation phase consists of two sub-steps: the bootstrappingstep , in which the policies in Q 𝐶𝐸 are used to initialize and bootstrapemitters, and the emitter step , in which the initialized emitters areevaluated. Bootstrap step.

During this step, emitters are initialized fromthe rewarding policies 𝜃 𝑖 ∈ Q 𝐶𝐸 , and their potential for rewardimprovement evaluated. This insures that only emitters capable ofimproving the rewards are considered for full evaluation, reducingwasted evaluation budget. The policies used to initialize the emittersare selected according to their novelty with respect to A 𝑅 . This Algorithm 2:

Exploration Phase

INPUT: budget chunk 𝐾 , number of offspring per parent 𝑚 ,mutation parameter 𝜎 , novelty archive A 𝑁 , candidateemitters buffer Q 𝐶𝐸 , population Γ 𝑔 , number of policies 𝑄 ; while 𝐾 not depleted do Generate offspring Γ 𝑚𝑔 from population Γ 𝑔 ;Evaluate 𝜃 𝑖 , ∀ 𝜃 𝑖 ∈ Γ 𝑚𝑔 ;Calculate 𝑏 𝑖 = 𝜙 ( 𝜃 𝑖 ) ∈ B , ∀ 𝜃 𝑖 ∈ Γ 𝑚𝑔 ;Calculate 𝜂 ( 𝜃 𝑖 ) = | 𝐽 | Í 𝑗 ∈ 𝐽 dist ( 𝑏 𝑖 , 𝑏 𝑗 ) , ∀ 𝜃 𝑖 ∈ Γ 𝑚𝑔 Ð Γ 𝑔 ; A 𝑁 ← 𝜃 𝑖 with 𝜃 𝑖 ∈ Γ 𝑚𝑔 , ∀ 𝑖 ∈ { , . . . , 𝑄 } ; if 𝜙 ( 𝜃 𝑖 ) ∈ B R then Q 𝐶𝐸 ← 𝜃 𝑖 Generate Γ 𝑔 + from most novel 𝜃 𝑖 ∈ Γ 𝑚𝑔 Ð Γ 𝑔 ;enables SERENE to focus on less explored areas of the reward-ing behavior space B 𝑅 . The whole bootstrapping phase lasts 𝐾 / evaluations.As discussed in Section 3.2, an emitter is an instance of a reward-following population-based optimization algorithm . In this work, wedo not use estimation-of-distribution algorithms like CMA-ES [22]because the estimation of the covariance matrix 𝐶 is unreliable whenthe population size is smaller than the cardinality of the parameterspace Θ . CMA-ES circumvents the issue by using information fromprevious generations to calculate 𝐶 . While stabilizing 𝐶 , this alsoleads to a less ef ﬁ cient use of the evaluation budget. Hence, in thiswork we use as emitter an elitist genetic algorithm that does notrequire any estimation of distribution. Conversely, it composes itspopulation with the most rewarding policies from the previous gen-eration’s population and offspring, while the offspring are generatedaccording to equation 3.An emitter E 𝑖 based on this algorithm consists of: a population 𝑃 containing 𝑀 E policies ˜ 𝜃 ∈ Θ ; a population of offspring 𝑃 𝑚 ofsize 𝑚 × 𝑀 E ; a generation counter 𝛾 ; a tracker for the maximumreward found up until generation 𝛾 𝑅 𝛾 ; an improvement measure 𝐼 (·) ; a novelty measure 𝜂 𝑖 equal to the novelty of the policy usedto initialize the emitter; and a novelty candidate buffer Q 𝑁𝐶 . E 𝑖 isinitialized from a policy 𝜃 𝑖 ∈ Q 𝐶𝐸 by sampling its initial population 𝑃 from the distribution N ( 𝜃 𝑖 , 𝜎 𝑖 𝐼 ) . To keep the emitter’s explorationlocal and prevent overlapping with the search space of possiblenearby emitters, we initialize 𝜎 𝑖 as: 𝜎 𝑖 = min � dist ( 𝜃 𝑖 , 𝜃 𝑗 ) , ∀ 𝜃 𝑗 ∈ Γ 𝑚𝑔 ∪ Γ 𝑔 . (4)This shapes N ( 𝜃 𝑖 , 𝜎 𝑖 𝐼 ) such that all other 𝜃 𝑗 are at least 3 standarddeviation away from its center. Once E 𝑖 has been initialized, itspotential is evaluated by running it for 𝜆 generations and calculatingits emitter improvement 𝐼 (E 𝑖 ) . This improvement is de ﬁ ned as thedifference of the average rewards obtained during the initial and themost recent generations of the emitter: 𝐼 (E 𝑖 ) = 𝜆𝑀 E © « 𝑇 ∑︁ 𝛾 = 𝑇 − 𝜆 / 𝑀 E ∑︁ 𝑗 = 𝑟 ( 𝛾 , 𝑗 ) − 𝜆 / ∑︁ 𝛾 = 𝛾 𝑀 E ∑︁ 𝑗 = 𝑟 ( 𝛾 , 𝑗 ) ª®¬ . (5)Here 𝑇 is the last evaluated generation, 𝑟 ( 𝛾 , 𝑗 ) is the reward of policy ˜ 𝜃 𝑗 ∈ 𝑃 𝛾 , and 𝛾 is the generation at which the emitter is at the parse Reward Exploration via Novelty Search and Emitters GECCO ’21, July 10–14, 2021, Lille, France RewardingPolicies

NoveltyArchiveRewardArchive

NovelPoliciesCandidatesEmitterBuffer Novel PoliciesBest Policies SelectedEmitterEmitter BufferSampled Policies NoveltyCandidatesBufferNovelPoliciesNoveltySearch NoveltyCalculationExploration Phase Exploitation Phase InitialiazedEmittersBootstrapStepEmitter Step

Figure 2: Overview of the sets used by SERENE to keep track of the explored areas and the initialized emitters. Highlighted in redare the two archives returned as ﬁ nal result of the algorithm execution. beginning of the exploitation phase; it is always 𝛾 = for an emitterin the bootstrap step. If 𝐼 (E 𝑖 ) < = , the chances for the emitter to ﬁ nd better solutions than the initial ones are low, so it is not worthallotting more budget to its evaluation. On the contrary, 𝐼 (E 𝑖 ) > means that the emitter has high potential for improvement. Thus allthe initialized emitters for which 𝐼 (E 𝑖 ) > are added to the emitterbuffer Q 𝐸 for further evaluation. Emitter step.

The initialized emitters in the emitter buffer Q 𝐸 are run during this step. It starts by calculating the pareto frontsbetween the improvement 𝐼 (E 𝑖 ) and the novelty 𝜂 (E 𝑖 ) of each ofthe emitters E 𝑖 ∈ Q 𝐸 . The emitter to run is then sampled from thefront of the non-dominated emitters. Using both the novelty and the ﬁ tness to select which emitter to run allows SERENE to focus bothon the less explored and most promising areas of B 𝑅 .The policies ˜ 𝜃 𝑗 ∈ Θ generated by an emitter can be stored eitherfor the reward they achieve or for their novelty. At every generation 𝛾 all the policies ˜ 𝜃 𝑗 ∈ 𝑃 𝛾 with a reward 𝑟 ( ˜ 𝜃 𝑗 ) > 𝑅 𝛾 − are added to A 𝑅 . Additionally, the policies ˜ 𝜃 𝑗 for which 𝜂 ( ˜ 𝜃 𝑗 ) > 𝜂 𝑖 are storedinto the emitter’s novelty candidates buffer Q 𝑁𝐶 .The emitter E 𝑖 is run until either the given budget chunk is de-pleted or a termination condition is met. In the ﬁ rst case, SERENErecalculates 𝐼 (E 𝑖 ) from the beginning of the emitter phase and as-signs the next budget chunk to the exploration phase . On the contrary,if a termination condition is met, E 𝑖 is discarded and another emitterto evaluate is sampled from the Pareto front. There can be multipletermination conditions. The one used in this work is inspired fromthe stagnation criterion de ﬁ ned in [22], stopping the emitter whenthere is no more improvement on the reward. A detailed de ﬁ nitionof the termination condition is presented in Appendix B. Beforestarting the new emitter evaluation, 𝑄 policies from the terminatedemitter’s novelty candidates buffer Q 𝑁𝐶 are uniformly sampled tobe added to A 𝑁 . In addition to saving particularly novel solutionsas part of the ﬁ nal result, this prevents the exploration phase fromre-exploring areas covered by emitters during the exploitation phase.The whole exploitation phase is detailed in Algorithm 3.The code repository is available at:

Exploitation Phase

INPUT: budget chunk 𝐾 , candidate emitters buffer Q 𝐶𝐸 ,number of bootstrap generations 𝜆 , emitter population size 𝑀 E , number of offspring per policy 𝑚 , emitters buffer Q 𝐸 ,rewarding archive A 𝑅 , novelty archive A 𝑁 ; */Bootstrap step/* while 𝐾 / not depleted do Select most novel policy 𝜃 𝑖 from Q 𝐶𝐸 ;Calculate 𝜎 𝑖 ;Initialize: E 𝑖 , Q 𝑖𝑁𝐶 = ∅ , and 𝑃 ; for 𝛾 ∈ { , . . . , 𝜆 } doif 𝑃 then Evaluate ˜ 𝜃 𝑗 , ∀ ˜ 𝜃 𝑗 ∈ 𝑃 ;Generate offspring population 𝑃 𝑚𝛾 from 𝑃 𝛾 ;Evaluate ˜ 𝜃 𝑗 , ∀ ˜ 𝜃 𝑗 ∈ 𝑃 𝑚𝛾 ;Generate 𝑃 𝛾 + from best ˜ 𝜃 𝑗 ∈ 𝑃 𝑚𝛾 Ð 𝑃 𝛾 ;Calculate 𝐼 (E 𝑖 ) ; if 𝐼 (E 𝑖 ) > then Q 𝐸 ← E 𝑖 ; */Emitters step/* Calculate pareto fronts in Q 𝐸 ; while / 𝐾 not depleted do Sample E 𝑖 from non-dominated emitters in Q 𝐸 ; while not 𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑒 (E 𝑖 ) do Generate offspring population 𝑃 𝑚𝛾 from 𝑃 𝛾 ;Evaluate ˜ 𝜃 𝑗 , ∀ ˜ 𝜃 𝑗 ∈ 𝑃 𝑚𝛾 ; A 𝑅 ← ˜ 𝜃 𝑗 , ∀ ˜ 𝜃 𝑗 ∈ 𝑃 𝑚𝛾 | 𝑟 ( ˜ 𝜃 𝑗 ) > 𝑅 𝛾 ; Q 𝑖𝑁𝐶 ← ˜ 𝜃 𝑗 , ∀ ˜ 𝜃 𝑗 ∈ 𝑃 𝑚𝑔 | 𝜂 ( ˜ 𝜃 𝑗 ) > 𝜂 𝑖 ;Generate 𝑃 𝛾 + from best ˜ 𝜃 𝑗 ∈ 𝑃 𝑚𝛾 Ð 𝑃 𝛾 ;Update 𝐼 (E 𝑖 ) and 𝑅 𝛾 ; if 𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑒 (E 𝑖 ) then A 𝑁 ← ˜ 𝜃 𝑗 with ˜ 𝜃 𝑗 ∈ Q 𝑖𝑁𝐶 , ∀ 𝑗 ∈ { , . . . , 𝑄 } ;Discard emitter E 𝑖 ;the reward in each of them. For the evaluation, we consider the foursparse rewards environment illustrated in Figure 3: Curling : A two Degrees of Freedom (DoF) robotic arm controlledby a 3 layers Neural Network (NN) with each layer of size . The ECCO ’21, July 10–14, 2021, Lille, France Giuseppe Paolo, Alexandre Coninx, Stephane Doncieux, and Alban La ﬂ aquière arm has to push the blue ball into one of the two goal areas shown inorange and green. A reward is provided only if the ball stops in oneof the two areas. Moreover, the closer the ball is to the center of thereward area, the higher the reward is. The controller takes as inputa 6-dimensional vector containing the ball pose ( 𝑥 , 𝑦 ) , and the twojoints angles and velocities. The output of the controller is the speedof each joint at the next timestep. The size of the parameter space Θ is 94, and each policy is run in the environment for timesteps. Hardmaze : Introduced in [25], it consists of a two-wheeled robot,in blue, whose task is to navigate the maze and reach either one ofthe green and orange areas. The reward is only given if the robotstops in one of the two areas, and is higher the closer the robot is toits center. The robot is controlled by a 2-layers NN with each layerof size . The controller takes as input the reading of the 5 distancesensors mounted on the robot; shown in red in Figure 3. Its outputis the 2-dimensional vector containing the speed of the 2 wheels atthe next timestep. The size of the parameter space Θ is 63, and eachpolicy is run in the environment for timesteps. Redundant arm : A 20-DoF robotic arm, introduced in [27], inwhich the arm’s end-effector has to reach one of the 3 colored goalareas. The reward is maximal in the center of the areas, and the armis controller by a NN with 2 layers of size 5. The controller takes asinput the 20-dimensional vector of each joint’s position, and outputsthe 20-dimensional joint’s torque vector. The size of the parameterspace Θ is 228, and each policy is run in the environment for timesteps. Robotic ant maze : Introduced in [6], it consists in a 4-legged roboticant in a maze. There are two goal areas and the task is for the ant tonavigate the maze and reach the center of one of them. The robot iscontrolled by a 3-layers NN, with each layer of size 10. The inputof the controller is the 29-dimensional observation returned by theenvironment at each step, while its output is the 8-dimensional joint’storque control. The size of the parameter space Θ is 574, and eachpolicy is run in the environment for timesteps. Baselines

We compare SERENE against 5 different baselines: • NS [25]: vanilla NS, that performs pure exploration and doesnot attempt to improve on the reward; • NSGA-II [13]: a multi-objective evolutionary algorithm opti-mizing both the novelty and the reward; • CMA-ME [18]: the original algorithm introducing emittersthat combines ME with emitters over a × grid coveringthe behavior space of all environments. Among the variousemitters proposed in [18] we selected the “optimizing” emit-ter; • ME [28]: vanilla MAP-Elites that uses a × grid to coverthe behavior space of every environment; • RND : pure random search in which no selection happens, andevery policy is sampled from a normal distribution

N ( , 𝐼 ) .Each algorithm has been given a budget of 𝐵 = evalu-ations, with the chunk size set to 𝐾 = . The population sizeis 𝑀 = , and for each policy we generate 𝑚 = offspring. Asmutation parameter we used 𝜎 = . , while the number of policiesuniformly sampled to be added to the novelty archive is 𝑄 = . CurlingRedundant arm Robotic ant mazeHard Maze

Figure 3: Testing environments: Curling, HardMaze, Redun-dant arm, Robotic ant maze.

SERENE uses an emitter population size of 𝑀 E = , with a boot-strap phase for each emitter of 𝜆 = generations. For CMA-ME weused the same parameters as in [18]: 15 emitters, each one with apopulation size of 37. In every experiment, the policies parametersare bounded in the [ − , ] range. Finally, the statistical results arecomputed over 15 runs for each experiment. This section discusses the results obtained during the experiments.

Balancing the exploration of the search space and the exploitationof the reward is an aspect of paramount importance for reward-based algorithms. Even more so in sparse reward environments.This balance can be studied by analyzing the amount of evaluationbudget dedicated to either of the two aspects. The exploration budgetconsists of all the evaluated policies that did not get any reward. Onthe contrary, the exploitation budget is obtained by counting all theevaluated policies that collected some reward from one of the rewardareas.As Figure 4 shows, SERENE has a more balanced budget split be-tween exploration (in blue) and exploitation (other colors) comparedto the other baselines. In situations in which exploration is harder,a bigger part of the budget is assigned to exploration rather thanexploitation of the reward. This is the case for the robotic ant mazeenvironment. Additionally, due to the way emitters are selected, thealgorithm can shift its exploitation focus among the different rewardareas. Figure 4 shows that most of SERENE’s exploitation budgetis assigned to the green reward area in the Curling, Hard maze andRobotic ant maze environments. As it can be seen in Figure 3, thisarea is more dif ﬁ cult to discover and to reach with respect to theorange area. This makes the exploitation of the orange reward area parse Reward Exploration via Novelty Search and Emitters GECCO ’21, July 10–14, 2021, Lille, France B u dg e t % B u dg e t % Figure 4: Budget percentage between the exploration of thesearch space (in blue) and the exploitation of each reward ar-eas (other colors). faster, having both the novelty and the improvement go to zerorapidly. On the contrary, being the green area harder to reach, itsnovelty will remain higher for longer, making SERENE select moreemitters focused on it. The effect can also be seen in Figure 6, wherethe reward for area 1 quickly reaches higher values compared tothe one of reward area 2. At the same time, in the Redundant armenvironment where the 3 reward areas are equally easy to discoverand to reach, this effect is less present and the exploitation budgetis more evenly split between them. The ability to switch its focusis similar to intrinsic motivation based methods [4, 20] and allowsSERENE to reach high rewards in all reward areas. Other baselinesexhibit a less balanced distribution of the evaluation budget, as theydo not explicitly separate exploration from exploitation.

Performing good exploration in situations of sparse rewards is fun-damental in order to discover all the possible rewarding areas ofthe search space. In our experiments, we measured the explorationcapacity of each of the tested algorithms through the coverage metric [28, 30]. It is evaluated by discretizing the search space in a × grid and calculating the percentage of cells occupied by the policiesadded to the archive. This metric does not include any measure ofthe performance of the solutions in the cells. Moreover, for NS basedmethods, it does not consider the amount of space reached by thepopulation, but only the one covered by the ﬁ nal archive.The plots in Figure 5 show that SERENE can perform explorationwith an ef ﬁ ciency comparable to NS, notwithstanding the lower bud-get assigned to exploring the search space. Nevertheless, Figure 5shows that the coverage obtained by ME quickly reaches high values.This is due to two main factors. First, ME does not have a distinctpopulation and archive, thus the initial population is completelyadded to the archive. This gives ME an initial higher coverage com-pared to NS based methods, that only add to the archive 𝑄 policies from the initial population. The second factor is the discretization ofthe search space already induced by ME. Calculating the coverageon a grid of the same resolution as the one used by ME introduces astrong bias in favor of ME based methods. However, this comes atthe cost of having to choose the size of the grid’s cells beforehand.On the contrary, although based on ME, CMA-ME results aremore variable across all environments, and exhibit lower explorationcompared to ME. This effect is likely due to the reliance on emittersfor exploration, leading to more local exploration in the parameterspace Θ . It can prove useful in environments like Curling or Redun-dant arm, where a small change in parameters leads to big behavioralchanges, increasing the probability of ﬁ nding a reward. On the con-trary, environments like Hard Maze or Robotic ant maze in whichthis does not happen can prove more challenging to explore.At the same time, the exploration performance of NSGA-II ispoor. In the Redundant arm environment, exploration is even lowerthan the random search baseline. This result is likely due to the multi-objective approach of optimizing both novelty and reward throughPareto fronts. Therefore, as soon as a reward area is discovered, thebest strategy to improve the front is to focus on the reward becausethis scales better than the novelty. Figure 6 shows the average maximum reward achieved by the al-gorithms in the reward areas of all environments. Emitters solelyfocusing on exploiting the reward allow SERENE to reach almostthe maximum reward on the easiest to reach reward areas in lessthan evaluations. High rewards are also achieved on the harderto reach areas, even if the required time is higher. On the contrary,ME improves on the reward at a much slower pace. This is likelydue to the random selection of policies from the archive to generatenew policies. In a sparse reward environment in fact, the probabilityof selecting a rewarding policy is proportional to the ratio betweenthe rewarding and non-rewarding areas. The sparser the reward is,i.e. the smaller the reward area is, the lower the probability of se-lecting a rewarding policy from the archive is, and the slower theexploitation gets. A similar trend is exhibited by CMA-ME: evenif able to reach high rewards on the discovered reward areas, it isslow in its optimization. At the same time, even NS reached highrewards on almost all environments, but without any explicit rewardoptimization it did not exploit the reward areas to the maximum. Themulti-objective approach NSGA-II can always ﬁ nd at least one ofthe multiple reward areas, but then tends to extensively focus on it,instead of also exploring other areas. For this reason only the easiestreward area is exploited to high values in all environments, whilethe harder reward area is seldom exploited. In this work we introduced SERENE, a method that ef ﬁ ciently dealswith sparse reward environments by augmenting NS with emitters.Contrary to similar methods using emitters, SERENE keeps ex-ploration and exploitation of the reward as two distinct processes.Exploration is carried out by taking advantage of NS to discover allthe reachable reward areas. These areas are then exploited by usinglocal instances of population-based optimization algorithms calledemitters. By using a meta-scheduler, SERENE can automatically ECCO ’21, July 10–14, 2021, Lille, France Giuseppe Paolo, Alexandre Coninx, Stephane Doncieux, and Alban La ﬂ aquière SERENE C o v e r a g e % NSNSGA-IICMA-MEMERND

Curling C o v e r a g e % SERENENSNSGA-IICMA-MEMERND

Robotic ant maze

SERENE

Hard Maze C o v e r a g e % NSNSGA-IICMA-MEMERND SERENE

Redundant arm C o v e r a g e % NSNSGA-IICMA-MEMERND

Figure 5: Coverage in the four different environments with respect to the given evaluation budget. C u r li n g CMA-ME R e w a r d SERENENSNSGA-IIMERND

Max Reward: Area 1 R e w a r d SERENENSNSGA-IICMA-MEMERND

Max Reward: Area 2 R e d un d a n t a r m R e w a r d SERENENSNSGA-IICMA-MEMERND

Max Reward: Area 1 R e w a r d SERENENSNSGA-IICMA-MEMERND

Max Reward: Area 2 R e w a r d SERENENSNSGA-IICMA-MEMERND

Max Reward: Area 3 R o b o t i c a n t m a z e Max Reward: Area 1 R e w a r d SERENENSNSGA-IICMA-MEMERND

Max Reward: Area 2 R e w a r d SERENENSNSGA-IICMA-MEMERND H a r d M a z e Max Reward: Area 1 R e w a r d SERENENSNSGA-IICMA-MEMERND R e w a r d SERENENSNSGA-IICMA-MEMERND

Max Reward: Area 2

Figure 6: Maximum reward reached in all the reward areas of each environment. assign the evaluation budget to either exploration or exploitation.This is advantageous also in situations in which no reward is present:in the absence of reward to exploit, SERENE performs exactly likeNS.SERENE has been tested on four different sparse reward environ-ments, reaching high performances on all of them. Notwithstandingthese encouraging results, the method still suffers from the samelimitations as other QD methods, and ﬁ rst and foremost from the prior hand-design of the behavior space B . In the future we willwork on addressing this limitation by learning a behavior descriptorthat could foster exploration towards rewarding solutions.At the same time, it has been highlighted in [18] and [10] thatmany kind of emitters can be used to address different kind ofproblems. The evaluation of different types of emitters and theircombination is also an exciting line of work to extend the currentmethod. parse Reward Exploration via Novelty Search and Emitters GECCO ’21, July 10–14, 2021, Lille, France REFERENCES [1] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong,Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and WojciechZaremba. 2017. Hindsight experience replay. In

Advances in Neural InformationProcessing Systems . 5048–5058.[2] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton,and Remi Munos. 2016. Unifying count-based exploration and intrinsic motivation.

Advances in neural information processing systems

29 (2016), 1471–1479.[3] Hans-Georg Beyer and Hans-Paul Schwefel. 2002. Evolution strategies–A com-prehensive introduction.

Natural computing

1, 1 (2002), 3–52.[4] Sebastian Blaes, Marin Vlastelica, Jia-Jie Zhu, and Georg Martius. 2019. ControlWhat You Can: Intrinsically Motivated Task-Planning Agent. In

Advances inNeural Information Processing (NeurIPS’19) . Curran Associates, Inc., 12520–12531.[5] Víctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi Torres. 2020. Explore, Discover and Learn: UnsupervisedDiscovery of State-Covering Skills. arXiv preprint arXiv:2002.03647 (2020).[6] Geoffrey Cideron, Thomas Pierrot, Nicolas Perrin, Karim Beguir, and OlivierSigaud. 2020. QD-RL: Ef ﬁ cient Mixing of Quality and Diversity in ReinforcementLearning. arXiv preprint arXiv:2006.08505 (2020).[7] Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. 2018. Gep-pg: Decouplingexploration and exploitation in deep reinforcement learning algorithms. arXivpreprint arXiv:1802.05054 (2018).[8] Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. 2018. Gep-pg: Decou-pling exploration and exploitation in deep reinforcement learning algorithms. In International Conference on Machine Learning . PMLR, 1039–1048.[9] Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, KennethStanley, and Jeff Clune. 2018. Improving exploration in evolution strategiesfor deep reinforcement learning via a population of novelty-seeking agents. In

Advances in neural information processing systems . 5027–5038.[10] Antoine Cully. 2020. Multi-Emitter MAP-Elites: Improving quality, diversityand convergence speed with heterogeneous sets of emitters. arXiv preprintarXiv:2007.05352 (2020).[11] Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. 2015.Robots that can adapt like animals.

Nature

IEEE Transactions on Evolutionary Computation

22, 2 (2017), 245–259.[13] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. Afast and elitist multiobjective genetic algorithm: NSGA-II.

IEEE transactions onevolutionary computation

6, 2 (2002), 182–197.[14] Thang Doan, Bogdan Mazoure, Moloud Abdar, Audrey Durand, Joelle Pineau,and R Devon Hjelm. 2019. Attraction-repulsion actor-critic for continuous controlreinforcement learning. arXiv preprint arXiv:1909.07543 (2019).[15] Stephane Doncieux, Alban La ﬂ aquière, and Alexandre Coninx. 2019. Noveltysearch: a theoretical perspective. In Proceedings of the Genetic and EvolutionaryComputation Conference . ACM, 99–106.[16] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune.2019. Go-explore: a new approach for hard-exploration problems. arXiv preprintarXiv:1901.10995 (2019).[17] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. 2018.Diversity is all you need: Learning skills without a reward function. arXiv preprintarXiv:1802.06070 (2018).[18] Matthew C Fontaine, Julian Togelius, Stefanos Nikolaidis, and Amy K Hoover.2020. Covariance matrix adaptation for the rapid illumination of behavior space.In

Proceedings of the 2020 genetic and evolutionary computation conference .94–102.[19] Sébastien Forestier, Rémy Portelas, Yoan Mollard, and Pierre-Yves Oudeyer. 2017.Intrinsically motivated goal exploration processes with automatic curriculumlearning. arXiv preprint arXiv:1708.02190 (2017).[20] Jacqueline Gottlieb, Pierre-Yves Oudeyer, Manuel Lopes, and Adrien Baranes.2013. Information-seeking, curiosity, and attention: computational and neuralmechanisms.

Trends in cognitive sciences

17, 11 (2013), 585–593.[21] Daniele Gravina, Antonios Liapis, and Georgios Yannakakis. 2016. Surprisesearch: Beyond objectives and novelty. In

Proceedings of the Genetic and Evolu-tionary Computation Conference 2016 . ACM, 677–684.[22] Nikolaus Hansen. 2016. The CMA evolution strategy: A tutorial. arXiv preprintarXiv:1604.00772 (2016).[23] Whiyoung Jung, Giseung Park, and Youngchul Sung. 2020. Population-guidedparallel policy search for reinforcement learning. arXiv preprint arXiv:2001.02907 (2020).[24] Shauharda Khadka and Kagan Tumer. 2018. Evolution-guided policy gradient inreinforcement learning. In

Advances in Neural Information Processing Systems .1188–1200.[25] Joel Lehman and Kenneth O Stanley. 2008. Exploiting open-endedness to solveproblems through the search for novelty.. In

ALIFE . 329–336. [26] Joel Lehman and Kenneth O Stanley. 2011. Evolving a diversity of virtual creaturesthrough novelty search and local competition. In

Proceedings of the 13th annualconference on Genetic and evolutionary computation . ACM, 211–218.[27] Pontus Loviken and Nikolas Hemion. 2017. Online-learning and planning inhigh dimensions with ﬁ nite element goal babbling. In . IEEE, 247–254.[28] Jean-Baptiste Mouret and Jeff Clune. 2015. Illuminating search spaces by mappingelites. arXiv preprint arXiv:1504.04909 (2015).[29] Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and SergeyLevine. 2018. Visual reinforcement learning with imagined goals. In Advances inNeural Information Processing Systems . 9191–9200.[30] Giuseppe Paolo, Alban La ﬂ aquiere, Alexandre Coninx, and Stephane Doncieux.2019. Unsupervised Learning and Exploration of Reachable Outcome Space. algorithms

24 (2019), 25.[31] Jack Parker-Holder, Aldo Pacchiano, Krzysztof Choromanski, and StephenRoberts. 2020. Effective diversity in population-based reinforcement learning. arXiv preprint arXiv:2002.00632 (2020).[32] Aloïs Pourchot and Olivier Sigaud. 2018. CEM-RL: Combining evolutionaryand gradient-based methods for policy search. arXiv preprint arXiv:1810.01222 (2018).[33] Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. 2016. Quality diversity: Anew frontier for evolutionary computation.

Frontiers in Robotics and AI

Neural Networks

113 (2019), 28–40.[35] Christopher Stanton and Jeff Clune. 2016. Curiosity search: producing generalistsby encouraging individuals to continually explore and acquire skills throughouttheir lifetime.

PloS one

11, 9 (2016), e0162235.[36] Richard S Sutton and Andrew G Barto. 2018.

Reinforcement learning: An intro-duction . MIT press.[37] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, YanDuan, John Schulman, Filip DeTurck, and Pieter Abbeel. 2017.

Advances inneural information processing systems . 2753–2762.[38] Alexander Trott, Stephan Zheng, Caiming Xiong, and Richard Socher. 2019.Keeping your distance: Solving sparse reward tasks using self-balancing shapedrewards. In

Advances in Neural Information Processing Systems . 10376–10386.[39] Pradnya A Vikhar. 2016. Evolutionary algorithms: A critical review and its futureprospects. In . IEEE, 261–265.

ECCO ’21, July 10–14, 2021, Lille, France Giuseppe Paolo, Alexandre Coninx, Stephane Doncieux, and Alban La ﬂ aquière A FINAL ARCHIVE DISTRIBUTION In ﬁ gure 7 we show the distribution of the behaviors of the policies inthe ﬁ nal archive. Each point is represents different policy. In blue arethe policies that do not get any reward, thus considered exploratory ,while in orange are rewarding policies, considered exploitative . ForSERENE the exploratory policies are the ones in the novelty archive A 𝑁 , while the exploitative policies are the ones in the rewardingarchive A 𝑅 .We can see that even if the coverage metric values for SERENEare lower with respect to ME, the search space is well covered.Moreover, the reward areas are densely explored. B TERMINATION CRITERION

The termination condition used for our emitters is inspired by the stagnation criteria introduced in [22]. We track the history of therewards obtained over the last + ∗ 𝑛 / 𝜆 emitter’s generations.Where 𝑛 is the size of the parameter space Θ and 𝜆 is the emitter’spopulation size. The emitter is terminated if either the maximum orthe median of the last 20 rewards is not better than the maximum orthe median of the ﬁ rst 20 rewards. parse Reward Exploration via Novelty Search and Emitters GECCO ’21, July 10–14, 2021, Lille, France Hard Maze

EnvironmentFitNSNSNSGA-IICMA-MEMERND

Curling Redundant arm Robotic ant maze

Figure 7: Distribution of the behavior descriptors of the archived policies. On each column are shown the results for an environment,while on each row is shown the distribution for each experiment. The archive plotted are from the runs achieving highest coverage. Inblue are the policies with no reward, in orange the policies with a reward. For SERENE in blue are the policies in the novelty archive and in orange the policies in the reward archivereward archive