[PDF] Teacher algorithms for curriculum learning of Deep RL in continuously parameterized environments

Abstract

We consider the problem of how a teacher algorithm can enable an unknown Deep Reinforcement Learning (DRL) student to become good at a skill over a wide range of diverse environments. To do so, we study how a teacher algorithm can learn to generate a learning curriculum, whereby it sequentially samples parameters controlling a stochastic procedural generation of environments. Because it does not initially know the capacities of its student, a key challenge for the teacher is to discover which environments are easy, difficult or unlearnable, and in what order to propose them to maximize the efficiency of learning over the learnable ones. To achieve this, this problem is transformed into a surrogate continuous bandit problem where the teacher samples environments in order to maximize absolute learning progress of its student. We present a new algorithm modeling absolute learning progress with Gaussian mixture models (ALP-GMM). We also adapt existing algorithms and provide a complete study in the context of DRL. Using parameterized variants of the BipedalWalker environment, we study their efficiency to personalize a learning curriculum for different learners (embodiments), their robustness to the ratio of learnable/unlearnable environments, and their scalability to non-linear and high-dimensional parameter spaces. Videos and code are available at this https URL.

Full PDF

TTeacher algorithms for curriculum learning ofDeep RL in continuously parameterized environments

Rémy Portelas

Inria (FR)

Cédric Colas

Inria (FR)

Katja Hofmann

Microsoft Research (UK)

Pierre-Yves Oudeyer

Inria (FR)

Abstract:

We consider the problem of how a teacher algorithm can enable an unknown DeepReinforcement Learning (DRL) student to become good at a skill over a wide rangeof diverse environments. To do so, we study how a teacher algorithm can learnto generate a learning curriculum, whereby it sequentially samples parameterscontrolling a stochastic procedural generation of environments. Because it doesnot initially know the capacities of its student, a key challenge for the teacher is todiscover which environments are easy, difﬁcult or unlearnable, and in what orderto propose them to maximize the efﬁciency of learning over the learnable ones.To achieve this, this problem is transformed into a surrogate continuous banditproblem where the teacher samples environments in order to maximize absolutelearning progress of its student. We present a new algorithm modeling absolutelearning progress with Gaussian mixture models (ALP-GMM). We also adaptexisting algorithms and provide a complete study in the context of DRL. Usingparameterized variants of the BipedalWalker environment, we study their efﬁciencyto personalize a learning curriculum for different learners (embodiments), theirrobustness to the ratio of learnable/unlearnable environments, and their scalabilityto non-linear and high-dimensional parameter spaces. Videos and code are availableat https://github.com/flowersteam/teachDeepRL . Keywords:

Deep Reinforcement Learning, Teacher-Student Learning, CurriculumLearning, Learning Progress, Curiosity, Parameterized Procedural Environments

We address the strategic student problem . This problem is well known in the developmental roboticscommunity [1], and formalizes a setting where an agent has to sequentially select tasks to train on tomaximize its average competence over the whole set of tasks after a given number of interactions.To address this problem, several works [2, 3, 4] proposed to use automated Curriculum Learning(CL) strategies based on Learning Progress (LP) [5], and showed that population-based algorithmscan beneﬁt from such techniques. Inspired by these initial results, similar approaches [6] werethen successfully applied to DRL agents in continuous control scenarios with discrete sets of goals ,here deﬁned as tasks varying only by their reward functions (e.g reaching various target positionsin a maze). Promising results were also observed when learning to navigate in discrete sets of environments , deﬁned as tasks differing by their state space (e.g escaping from a set of mazes) [7, 8].In this paper, we study for the ﬁrst time whether LP-based curriculum learning methods are ableto scaffold generalist DRL agents in continuously parameterized environments. We compare thereuse of Robust Intelligent Adaptive Curiosity (RIAC) [9] in this new context to Absolute LearningProgress - Gaussian Mixture Model (ALP-GMM), a new GMM-based approach inspired by earlierwork on developmental robotics [4] that is well suited for DRL agents. Both these methods rely onAbsolute Learning Progress (ALP) as a surrogate objective to optimize with the aim to maximizeaverage competence over a given parameter space. Importantly, our approaches do not assume adirect mapping from parameters to environments, meaning that a given parameter vector encodes adistribution of environments with similar properties, which is closer to real-world scenarios wherestochasticity is an issue.Recent work [10] already showed impressive results in continuously parameterized environments.The POET approach proved itself to be capable of generating and mastering a large set of diverse a r X i v : . [ c s . L G ] O c t ipedalWalker environments. However, their work differs from ours as they evolve a population ofagents where each individual agent is specialized for a single speciﬁc deterministic environment,whereas we seek to scaffold the learning of a single generalist agent in a training regime where itnever sees the same exact environment twice.As our approaches make few assumptions, they can deal with ill-deﬁned parameter spaces thatinclude unfeasible subspaces and irrelevant parameter dimensions. This makes them particularly wellsuited to complex continuous parameter spaces in which expert-knowledge is difﬁcult to acquire.We formulate the Continuous Teacher-Student (CTS) framework to cover this scope of challenges,opening the range of potential applications. Main contributions: • A Continuous Teacher-Student setup enabling to frame Teacher-Student interactions forill-deﬁned continuous parameter spaces encoding distributions of tasks. See Sec. 3. • Design of two parameterized BipedalWalker environments, well-suited to benchmark CLapproaches on continuous parameter spaces encoding distributions of environments withprocedural generation. See Sec. 4.3. • ALP-GMM, a CL approach based on Gaussian Model Mixture and absolute LP that is wellsuited for DRL agents learning continuously parameterized tasks. See Sec. 4.1. • First study of ALP-based teacher algorithms leveraged to scaffold the learning of generalistDRL agents in continuously parameterized environments. See Sec. 5.

Curriculum learning , as formulated in the supervised machine learning community, initially refers totechniques aimed at organizing labeled data to optimize the training of neural networks [11, 12, 13].Concurrently, the RL community has been experimenting with transfer learning methods, providingways to improve an agent on a target task by pre-training on an easier source task [14]. These twolines of work were combined and gave birth to curriculum learning for RL, that is, methods organizingthe order in which tasks are presented to a learning agent so as to maximize its performance on oneor several target tasks.Learning progress has often been used as an intrinsically motivated objective to automate curriculumlearning in developmental robotics [5], leading to successful applications in population-based roboticcontrol in simulated [4, 15] and real-world environments [2, 3]. LP was also used to accelerate thetraining of LSTMs and neural turing machines [16], and to personalize sequences of exercises forchildren in educational technologies [17].A similar Teacher-Student framework was proposed in [7], which compared teacher approaches on aset of navigation tasks in Minecraft [18]. While their work focuses on discrete sets of environments,we tackle the broader challenge of dealing with continuous parameter spaces that map to distributionsover environments, and in which large parts of the parameter space may be unlearnable.Another form of CL has already been studied for continuous sets of tasks [19], however theyconsidered goals (varying by their reward function), where we tackle the more complex setting oflearning to behave in a continuous set of environments (varying by their state space). The GOAL-GAN algorithm also requires to set a reward range of "intermediate difﬁculty" to be able to labeleach goal in order to train the GAN, which is highly dependent on both the learner’s skills and theconsidered continuous set of goals. Besides, as the notion of intermediate difﬁculty provides noguarantee of progress, this approach is susceptible to focusing on unlearnable goals for which thelearner’s competence stagnates in the intermediate difﬁculty range.

In this section, we formalize our Continuous Teacher-Student framework. It is inspired from earlierwork in developmental robotics [9, 1] and intelligent tutoring systems [17]. The CTS frameworkis also close to earlier work on Teacher-Student approaches for discrete sets of tasks [7]. In CTShowever, teachers sample parameters mapping to distributions of tasks from a continuous parameter2pace. In the remainder of this paper, we will refer to parameter sampling and task distributionsampling interchangeably, as one parameter directly maps to a task distribution.

Student

In CTS, learning agents, called students, are confronted with episodic Partially ObservableMarkov Decision Processes (POMDP) tasks τ . For each interaction step in τ , a student collects anobservation o P O τ , performs an action a P A τ , and receives a corresponding reward r P R τ . Upontask termination, an episodic reward r e “ ř Tt “ r p t q is computed, with T the length of the episode. Teacher

The teacher interacts with its student by selecting a new parameter p P P , mapped to atask distribution T p p q P T , proposing m tasks τ „ T p p q to its student and observing r p , the averageof the m episodic rewards r e . The new parameter-reward tuple is then added to an history database ofinteractions H that the teacher leverages to inﬂuence the parameter selection in order to maximize thestudent’s ﬁnal competence return c p “ f p r p q across the parameter space. Formally, the objective is max ż P ,t “ K w p ¨ c Kp d p, (1)with K the predeﬁned maximal number of teacher-student interactions and w p , a factor weighting therelative importance of each task distribution in the optimization process, enabling to specify whetherto focus on speciﬁc subregions of the parameter space (i.e. harder target tasks). As students areconsidered as black-box learners, the teacher solely relies on its database history H for parametersampling and does not have access to information about its student’s internal state, algorithm, orperceptual and motor capacities. Parameter space assumptions

The teacher does not know the evolution of difﬁculty across theparameter space and therefore assumes a non-linear, piece-wise smooth function. The parameterspace may also be ill-deﬁned. For example, there might be subregions U Ă P of the parameterspace in which competence improvements on any parameter u P U is not possible given the statetransition functions F τ of tasks sampled in any T p u q P T (i.e tasks are either trivial or unfeasible).Additionally, given a parameter space P P R d , there might exist an equivalent parameter space P P R k with k ă d , constructed with a subset of the d dimensions of P , meaning that there mightbe irrelevant or redundant dimensions in P .In the following sections, we will restrict our study to CTS setups in which teachers sample onlyone task per selected parameter vector (i.e m “ and r p = r e ) and do not prioritize the learning ofspeciﬁc subspaces (i.e w p “ , @ p P P ). In this section, we will describe our absolute LP-based teacher algorithms, our reference teachers,and present the continuously parameterized BipedalWalker environments used to evaluate them.

Of central importance to this paper is the concept of learning progress, formulated as a theoreticalhypothesis to account for intrinsically motivated learning in humans [20], and applied for efﬁcientrobot learning [5, 2]. Inspired by some of this work [3, 21], we frame our two teacher approaches asa Multi-Armed Bandit setup in which arms are dynamically mapped to subspaces of the parameterspace, and whose values are deﬁned by an absolute average LP utility function. The objective is thento select subspaces on which to sample a distribution of tasks in order to maximize ALP. ALP gives aricher signal than (positive) LP as it enables the teacher to detect when a student is losing competenceon a previously mastered parameter subspace (thus preventing catastrophic forgetting).

Robust Intelligent Adaptive Curiosity (RIAC)

RIAC [9] is a task sampling approach whose coreidea is to split a given parameter space in hyperboxes (called regions) according to their absolute LP,deﬁned as the difference of cumulative episodic reward between the newest and oldest tasks sampledin the region. Tasks are then sampled within regions selected proportionally to their ALP score. Thisapproach can easily be translated to the problem of sampling distributions of tasks, as is the case inthis work. To avoid a known tendency of RIAC to oversplit the space [19], we added a few minor3odiﬁcations to the original architecture to constrain the splitting process. Details can be found inappendix B.

Absolute Learning Progress Gaussian Mixture Model (ALP-GMM)

Another more principledway of sampling tasks according to LP measures is to rely on the well known Gaussian MixtureModel [22] and Expectation-Maximization [23] algorithms. This concept has already been success-fully applied in the cognitive science ﬁeld as a way to model intrinsic motivation in early vocaldevelopments of infants [4]. In addition of testing for the ﬁrst time their approach (referred to asCovar-GMM) on DRL students, we propose a variant based on an ALP measure capturing long-termprogress variations that is well-suited for RL setups. See appendix B for a description of their method.The key concept of ALP-GMM is to ﬁt a GMM on a dataset of previously sampled parametersconcatenated to their respective ALP measure. Then, the Gaussian from which to sample a newparameter is chosen using an EXP4 bandit scheme [24] where each Gaussian is viewed as an arm, andALP is its utility. This enables the teacher to bias the parameter sampling towards high-ALP subspaces.To get this per-parameter ALP value, we take inspiration from earlier work on developmental robotics[21]: for each newly sampled parameter p new and associated episodic reward r new , the closest(Euclidean distance) previously sampled parameter p old (with associated episodic reward r old ) isretrieved using a nearest neighbor algorithm (implemented with a KD-Tree [25]). We then have alp new “ | r new ´ r old | (2)The GMM is ﬁt periodically on a window W containing only the most recent parameter-ALP pairs(here the last ) to bound its time complexity and make it more sensitive to recent high-ALPsubspaces. The number of Gaussians is adapted online by ﬁtting multiple GMMs (here having from to k max “ Gaussians) and keeping the best one based on Akaike’s Information Criterion [26].Note that the nearest neighbor computation of per-parameter ALP uses a database that contains allpreviously sampled parameters and associated episodic rewards, which prevents any forgetting oflong-term progress. In addition to its main task sampling strategy, ALP-GMM also samples randomparameters to enable exploration (here p rnd “ ). See Algorithm 1 for pseudo-code and appendixB for a schematic view of ALP-GMM. Algorithm 1

Absolute Learning Progress Gaussian Mixture Model (ALP-GMM)

Require:

Student S , parametric procedural environment generator E , bounded parameter space P ,probability of random sampling p rnd , ﬁtting rate N , max number of Gaussians k max Initialize parameter-ALP First-in-First-Out window W , set max size to N Initialize parameter-reward history database H loop N times Ź Bootstrap phase Sample random p P P , send E p τ „ T p p qq to S , observe episodic reward r p Compute ALP of p based on r p and H (see equation 2) Store p p, r p q pair in H , store p p, ALP p q pair in W loop Ź Stop after K inner loops Fit a set of GMM having 2 to k max kernels on W Select the GMM with best Akaike Information Criterion loop N times p rnd % of the time, sample a random parameter p P P Else, sample p from a Gaussian chosen proportionally to its mean ALP value Send E p τ „ T p p qq to student S and observe episodic reward r p Compute ALP of p based on r p and H Store p p, r p q pair in H , store p p, ALP p q pair in W Return S In Random, parameters are sampled randomly in theparameter space for each new episode. Although simplistic, similar approaches in previous work [3]proved to be competitive against more elaborate forms of CL.4 racle

A hand-constructed approach, sampling random task distributions in a ﬁxed-size slidingwindow on the parameter space. This window is initially set to the easiest area of the parameterspace and is then slowly moved towards complex ones, with difﬁculty increments only happeningif a minimum average performance is reached. Expert knowledge is used to ﬁnd the dimensionsof the window, the amplitude and direction of increments, and the average performance threshold.Pseudo-code is available in Appendix B.

The BipedalWalker environment [27] offers a convenient test-bed for continuous control, allowing toeasily build parametric variations of the original version [28, 10]. The learning agent, embodied ina bipedal walker, receives positive rewards for moving forward and penalties for torque usage andangular head movements. Agents are allowed steps to reach the other side of the map. Episodesare aborted with a ´ reward penalty if the walker’s head touches an obstacle.To study the ability of our teachers to guide DRL students, we design two continuously parameterizedBipedalWalker environments enabling the procedural generation of walking tracks: • Stump Tracks A D parametric environment producing tracks paved with stumps varyingby their height and spacing. Given a parameter vector r µ h , ∆ s s , a track is constructed bygenerating stumps spaced by ∆ s and whose heights are deﬁned by independent samples in anormal distribution N p µ h , . q . • Hexagon Tracks

A more challenging D parametric BipedalWalker environment. Given offset values µ o , each track is constructed by generating hexagons having their defaultvertices’ positions perturbed by strictly positive independent samples in N p µ o , . q . Theremaining parameters are distractors deﬁning the color of each hexagon. This environmentis challenging as there are no subspaces generating trivial tracks with 0-height obstacles (asoffsets to the default hexagon shape are positive). This parameter space also has non-lineardifﬁculty gradients as each vertices have different impacts on difﬁculty when modiﬁed.All of the experiments done in these environments were performed using OpenAI’s implementationof Soft-Actor Critic [29] as the single student algorithm. To test our teachers’ robustness to studentswith varying abilities, we use different walker morphologies (see Figure 1). Additional details onthese two environments along with track examples are available in Appendix E. (a) Stump Tracks walkers (b) Hexagon Tracks walker Figure 1:

Multiple students and environments to benchmark teachers. (a):

In addition to the defaultbipedal walker morphology (middle agent), we designed a bipedal walker with 50% shorter legs (left) and abigger quadrupedal walker (right). (b):

The quadrupedal walker is also used in Hexagon Tracks.

Performance metric

To assess the performance of all of our approaches on our BipedalWalkerenvironments, we deﬁne a binary competence return measure stating whether a given track distributionis mastered or not, depending on the student’s episodic reward r p . We set the reward threshold to ,which was used in [10] to ensure "reasonably efﬁcient" walking gates for default bipedal walkerstrained on environments similar to ours. Note that this reward threshold is only used for evaluationpurposes and in the Oracle condition. Performance is then evaluated periodically by sampling a singletrack in each track distribution of a ﬁxed evaluation set of distributions sampled uniformly in theparameter space. We then simply measure the percentage of mastered tracks. During evaluation,learning in DRL agents is turned off.Through our experiments we answer three questions about ALP-GMM, Covar-GMM and RIAC:5 Are ALP-GMM, Covar-GMM and RIAC able to optimize their students’ performance betterthan random approaches and teachers exploiting environment knowledge? • How does their performance scale when the proportion of unfeasible tasks increases? • Are they able to scale to high-dimensional sampling spaces with irrelevant dimensions?

Figure 2 provides a visualization of the sampling trajectory observed ina representative ALP-GMM run for a default walker. Each plot shows the location of each Gaussianof the current mixture along with the track distributions subsequently sampled. At ﬁrst (a), thewalker does not manage to make any signiﬁcant progress. After episodes (b) the student startsmaking progress on the leftmost part of the parameter space, especially for track distributions with aspacing higher than . , which leads ALP-GMM to focus its sampling in that direction. After kepisodes (c) ALP-GMM has shifted its sampling strategy to more complex regions. The analysis of atypical RIAC run is detailed in Appendix C (ﬁg. 8). (a) After 500 episodes (b) 1500 eps. (c) 15000 eps. (d) Mastered tracks (20M steps) Figure 2:

Example of an ALP-GMM teacher paired with a Soft Actor-Critic student on Stump Tracks.

Figures (a)-(c) show the evolution of ALP-GMM parameter sampling in a representative run. Each dot representsa sampled track distribution and is colored according to its Absolute Learning Progress value. After initialprogress on the leftmost part of the space, as in (b), most ALP-GMM runs end up improving on track distributionswith to . stump height, with the highest ones usually paired with spacing above . or below , indicatingthat tracks with large or very low spacing are easier than those in r , . s . Figure (d) shows for the same runwhich track distributions of the test set are mastered (i.e r t ą , shown by green dots) after k episodes. Performance comparison

Figure 3 shows learning curves for each condition paired with short,default and quadrupedal walkers. First of all, for short agents (a), one can see that Oracle is thebest performing algorithm, mastering more than % of the test set after Million steps. Thisis an expected result as Oracle knows where to sample simple track distributions, which is crucialwhen most of the parameter space is unfeasible, as is the case with short agents. ALP-GMM is theLP-based teacher with highest ﬁnal mean performance, reaching . % against . % for Covar-GMM and . % for RIAC. This performance advantage for ALP-GMM is statistically signiﬁcantwhen compared to RIAC (Welch’s t-test at M steps: p ă . ), however there is no statisticallysigniﬁcant difference with Covar-GMM ( p “ . ). All LP-based teachers are signiﬁcantly superiorto Random ( p ă . ).Regarding default bipedal walkers (b), our hand-made curriculum (Oracle) performs better than otherapproaches for the ﬁrst Million steps and then rapidly decreases to end up with a performancecomparable to RIAC and Covar-GMM. All LP-based conditions end up with a ﬁnal mean performancestatistically superior to Random ( p ă ´ ). ALP-GMM is the highest performing algorithm,signiﬁcantly superior to Oracle ( p ă . ), RIAC ( p ă . ) and Covar-GMM ( p ă . ).For quadrupedal walkers (c), Random, ALP-GMM, Covar-GMM and RIAC agents quickly learnto master nearly % of the test set, without signiﬁcant differences apart from Covar-GMM beingsuperior to RIAC ( p ă . ). This indicates that, for this agent type, the parameter space of Stump6racks is simple enough that trying random tracks for each new episode is a sufﬁcient curriculumlearning strategy. Oracle teachers perform signiﬁcantly worse than any other method ( p ă ´ ). % M a s t e r e d e n v (a) Short agents % M a s t e r e d e n v (b) Default agents % M a s t e r e d e n v OracleALP-GMMCovar-GMMRIACRandom (c) Quadrupedal agents

Figure 3:

Evolution of mastered track distributions for Teacher-Student approaches in Stump Tracks.

The mean performance (32 seeded runs) is plotted with shaded areas representing the standard error of the mean.

Through this analysis we answered our ﬁrst experimental question by showing how ALP-GMM,Covar-GMM and RIAC, without strong assumptions on the environment, managed to scaffold thelearning of multiple students better than Random. Interestingly, ALP-GMM outperformed Oraclewith default agents, and RIAC, Covar-GMM and ALP-GMM surpassed Oracle with the quadrupedalagent, despite its advantageous use of domain knowledge. This indicates that training only on trackdistributions sampled from a sliding window that end up on the most difﬁcult parameter subspaceleads to forgetting of simpler task distributions. Our approaches avoid this issue through efﬁcienttracking of their students’ learning progress.

A crucial requirement when designing all-purpose teacher algorithms is to ensure their ability to dealwith parameter spaces that are ill-deﬁned w.r.t to the considered student. To study this property weperformed additional experiments on Stump Tracks where we gradually increased the stump heightdimension range, which increases the amount of unfeasible tracks.Results are summarized in Table 1. To assess whether a condition is robust to increasing unfeasibility,one can look at the p-value of the Welch’s t-test performed on the ﬁnal performance measure betweenthe condition run on the original parameter space and the same condition run on a wider space. Highp-value indicates that there is not enough evidence to reject the null hypothesis of no difference,which can be interpreted as being robust to parameter spaces containing more unfeasible tasks. Usingthis metric, it is clear that ALP-GMM is the most robust condition among the presented LP-basedteachers, with a p-value of . when increasing the stump height range from r , s to r , s comparedto p “ . for RIAC and p “ . for Covar-GMM. When going from r , s to r , s , ALP-GMMis the only LP-based teacher able to maintain most of its performance ( p “ . ). Although Randomalso seems to show robustness to increasingly unfeasible parameter spaces ( p “ . when goingfrom r , s to r , s and p “ . from r , s to r , s ), it is most likely due to its stagnation inlow performances. Compared to all other approaches, ALP-GMM remains the highest performingcondition in both parameter space variations ( p ă . ).Cond. \ Stump height r , s r , s r , s ALP-GMM . ˘ . . ˘ . ( p “ . ) . ˘ . ( p “ . )Covar-GMM . ˘ . . ˘ . ( p “ . ) . ˘ . ( p “ . )RIAC . ˘ . . ˘ . ( p “ . ) . ˘ . ( p “ . )Random . ˘ . . ˘ . ( p “ . ) . ˘ . ( p “ . )Table 1: Impact of increasing the proportion of unfeasible tasks.

The average performance with standarddeviation (after 20 Million steps) on the original Stump Tracks’ test-set is reported (32 seeds per condition).Theadditional p-values inform whether conditions run in the original Stump Tracks ( r , s ) are signiﬁcantly betterthan when run on variations with higher maximal stump height. To assess whether ALP-GMM, Covar-GMM and RIAC are able to scale to parameter spaces of higherdimensionality containing irrelevant dimensions, and whose difﬁculty gradients are non-linear, weperformed experiments with quadrupedal walkers on Hexagon Tracks, our 12-dimensional parametricBipedalWalker environment. Results are shown in Figure 4. In the ﬁrst Millions steps, onecan see that Oracle has a large performance advantage compared to LP-based teachers, which ismainly due to its knowledge of initial progress niches. However, by the end of training, ALP-GMMsigniﬁcantly outperforms Oracle ( p ă . ), reaching an average ﬁnal performance of % against for Oracle. Compared to Covar-GMM and RIAC, the ﬁnal performance of ALP-GMM is alsosigniﬁcantly superior ( p ă . and p ă . , respectively) while being more robust and having lessvariance (see appendix D). All LP-based approaches are signiﬁcantly better than Random ( p ă . ). % M a s t e r e d e n v OracleALP-GMMCovar-GMMRIACRandom

Figure 4:

Teacher-Student approaches in Hexagon Tracks. Left:

Evolution of mastered tracks for Teacher-Student approaches in Hexagon Tracks. 32 seeded runs (25 for Random) of 80 Millions steps where performedfor each condition. The mean performance is plotted with shaded areas representing the standard error of themean.

Right:

A visualization of which track distributions of the test-set are mastered (i.e r t ą , shown bygreen dots) by an ALP-GMM run after million steps. Experiments on the Hexagon Tracks showed that ALP-GMM is the most suitable condition forcomplex high-dimensional environments containing irrelevant dimensions, non-linear parameterspaces and large proportions of initially unfeasible tasks.

Complementary experiments

To better grasp the general properties of our teacher algorithms,additional abstract experiments without DRL students were also performed for parameter spaces withincreasing number of dimensions (relevant and irrelevant) and increasing ratio of initially unfeasiblesubspaces, showing that GMM-based approaches performed best (see Appendix A).

This work demonstrated that LP-based teacher algorithms could successfully guide DRL agentsto learn in difﬁcult continuously parameterized environments with irrelevant dimensions and largeproportions of unfeasible tasks. With no prior knowledge of its student’s abilities and only looseboundaries on the task space, ALP-GMM, our proposed teacher, consistently outperformed randomheuristics and occasionally even expert-designed curricula.ALP-GMM, which is conceptually simple and has very few crucial hyperparameters, opens-upexciting perspectives inside and outside DRL for curriculum learning problems. Within DRL, it couldbe applied to previous work on autonomous goal exploration through incremental building of goalspaces [30]. In this case several ALP-GMM instances could scaffold the learning agent in each ofits autonomously discovered goal spaces. Another domain of applicability is assisted education, forwhich current state of the art relies heavily on expert knowledge [17] and is mostly applied to discretetask sets. 8 cknowledgments

References [1] M. Lopes and P.-Y. Oudeyer. The Strategic Student Approach for Life-Long Exploration andLearning. In

IEEE Conference on Development and Learning / EpiRob 2012 , San Diego, UnitedStates, Nov. 2012. doi:10.1109/DevLrn.2012.6400807.[2] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner. Intrinsic motivation systems for autonomousmental development.

IEEE transactions on evolutionary computation , 11(2):265–286, 2007.doi:10.1109/TEVC.2006.890271.[3] A. Baranes and P.-Y. Oudeyer. Active learning of inverse models with intrinsically motivatedgoal exploration in robots.

Robotics and Autonomous Systems , 61(1):49–73, 2013. doi:10.1016/j.robot.2012.05.008.[4] C. Moulin-Frier, S. M. Nguyen, and P.-Y. Oudeyer. Self-organization of early vocal developmentin infants and machines: The role of intrinsic motivation.

Frontiers in Psychology (CognitiveScience) , 4(1006), 2014. ISSN 1664-1078. doi:10.3389/fpsyg.2013.01006.[5] D. Blank, D. Kumar, L. Meeden, and J. Marshall. Bringing up robot: Fundamental mechanismsfor creating a self-motivated, self-organizing architecture.

Cybernetics & Systems , 12 2003.doi:10.1080/01969720590897107.[6] C. Colas, P. Oudeyer, O. Sigaud, P. Fournier, and M. Chetouani. CURIOUS: intrinsicallymotivated modular multi-goal reinforcement learning. In

Proceedings of the 36th InternationalConference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA ,pages 1331–1340, 2019.[7] T. Matiisen, A. Oliver, T. Cohen, and J. Schulman. Teacher-student curriculum learning.

IEEE transactions on neural networks and learning systems , 2019. doi:10.1109/TNNLS.2019.2934906.[8] P. R. S. K. Mysore, S. Reward-guided curriculum for robust reinforcement learning. 2018.[9] A. Baranes and P. Oudeyer. R-IAC: robust intrinsically motivated exploration and active learning.

IEEE Trans. Autonomous Mental Development , 1(3):155–169, 2009. doi:10.1109/TAMD.2009.2037513.[10] R. Wang, J. Lehman, J. Clune, and K. O. Stanley. Paired open-ended trailblazer (POET):endlessly generating increasingly complex and diverse learning environments and their solutions.

CoRR , abs/1901.01753, 2019.[11] J. L. Elman. Learning and development in neural networks: the importance of starting small.

Cognition , 48(1):71 – 99, 1993. ISSN 0010-0277. doi:10.1016/0010-0277(93)90058-4.[12] K. A. Krueger and P. Dayan. Flexible shaping: How learning in small steps helps.

Cognition ,110(3):380 – 394, 2009. ISSN 0010-0277. doi:10.1016/j.cognition.2008.11.014.[13] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In

Proceedings ofthe 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec,Canada, June 14-18, 2009 , pages 41–48, 2009. doi:10.1145/1553374.1553380.[14] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey.

J.Mach. Learn. Res. , 10:1633–1685, Dec. 2009. ISSN 1532-4435.[15] S. Forestier and P. Oudeyer. Modular active curiosity-driven discovery of tool use. In , pages 3965–3972,Oct 2016. doi:10.1109/IROS.2016.7759584.916] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculumlearning for neural networks. In

Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 1311–1320. JMLR. org, 2017.[17] B. Clément, D. Roy, P.-Y. Oudeyer, and M. Lopes. Multi-Armed Bandits for Intelligent TutoringSystems.

Journal of Educational Data Mining (JEDM) , 7(2):20–48, June 2015.[18] M. Johnson, K. Hofmann, T. Hutton, and D. Bignell. The malmo platform for artiﬁcialintelligence experimentation. In

Proceedings of the Twenty-Fifth International Joint Conferenceon Artiﬁcial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016 , pages 4246–4247,2016.[19] C. Florensa, D. Held, X. Geng, and P. Abbeel. Automatic goal generation for reinforcementlearning agents. In

Proceedings of the 35th International Conference on Machine Learning,ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 , pages 1514–1523, 2018.[20] F. Kaplan and P.-Y. Oudeyer. In search of the neural circuits of intrinsic motivation.

Frontiersin neuroscience , 1:17, 2007. doi:10.3389%2Fneuro.01.1.1.017.2007.[21] S. Forestier, Y. Mollard, and P. Oudeyer. Intrinsically motivated goal exploration processes withautomatic curriculum learning.

CoRR , abs/1708.02190, 2017.[22] C. E. Rasmussen. The inﬁnite gaussian mixture model. In S. A. Solla, T. K. Leen, and K. Müller,editors,

Advances in Neural Information Processing Systems 12 , pages 554–560. MIT Press,2000.[23] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data viathe em algorithm.

Journal of the royal statistical society, Series B , 39(1):1–38, 1977.[24] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed banditproblem.

SIAM journal on computing , 32(1):48–77, 2002.[25] J. L. Bentley. Multidimensional binary search trees used for associative searching.

Commun.ACM , 18(9):509–517, 1975. doi:10.1145/361002.361007.[26] H. Bozdogan. Model selection and akaike’s information criterion (aic): The general theoryand its analytical extensions.

Psychometrika , 52(3):345–370, Sep 1987. ISSN 1860-0980.doi:10.1007/BF02294361.[27] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.Openai gym, 2016.[28] D. Ha. Reinforcement learning for improving agent design.

CoRR , abs/1810.03779, 2018.[29] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropydeep reinforcement learning with a stochastic actor.

CoRR , abs/1801.01290, 2018.[30] A. Laversanne-Finot, A. Pere, and P.-Y. Oudeyer. Curiosity driven exploration of learneddisentangled goal spaces. In

Proceedings of The 2nd Conference on Robot Learning , volume 87of

Proceedings of Machine Learning Research , pages 487–504. PMLR, 29–31 Oct 2018.10

Experiments on an n-dimensional toy parameter space

The n-dimensional toy space An n -dimensional toy parameter space P P r , s n was implementedto simulate a student learning process, enabling the study of our teachers in a controlled deterministicenvironment without DRL agents. A parameter p directly maps to an episodic reward r p dependingon the history of previously sampled parameters. The parameter space is divided in hypercubes andenforces the following rules: • Sampling a parameter in an "unlocked" hypercube results in a positive reward ranging from to depending on the amount of already sampled parameters in the hypercube: if parameters were sampled in it, the next one will yield a reward of . Sampling a parameterlocated in a "locked" hypercube does not yield any reward. • At ﬁrst, all hypercubes are "locked" except for one, located in a corner. • Sampling parameters in an unlocked hypercube unlocks its neighboring hypercubes. Results on n-dimensional toy spaces

Results are displayed in Figure 5. We use the medianpercentage of unlocked hypercubes as a performance metric. A ﬁrst experiment was performed on a D toy space with hypercubes per dimensions. In this experiment one can see that all LP-basedapproaches outperform Random by a signiﬁcant margin. Covar-GMM is the highest performingalgorithm. This ﬁrst toy-space will be used as a point of reference for our following analysis,for which all conditions were tested on a panel of toy spaces with varying number of meaningfuldimensions (ﬁrst row of Figure 5), irrelevant dimensions (second row) and number of hypercubes(third row).By looking at the ﬁrst row of Figure 5, one can see that increasing the dimension size seems to have agreater negative impact on RIAC than on GMM-based approaches: RIAC, which was between ALP-GMM and Covar-GMM in terms of median performance in our reference experiment is now clearlyunder-performing them on all toy spaces. In the D and D cases RIAC is even outperformed bythe Random condition after k episodes and k episodes, respectively. For the D toy space RIACconsistently outperforms Random, reaching a median ﬁnal performance of % after M episodes.In this D toy space ALP-GMM and Covar-GMM both reach % of median performance after Mepisodes. Covar-GMM is the highest performing condition in each toy-space, closely followed byALP-GMM.The second row of Figure 5 shows how performances of our approaches vary when adding irrelevantdimensions to the D toy-space. To better grasp the properties of these additional dimensions, onecan see that Random is not affected by them. With , and additional useless dimensions,RIAC is consistently inferior to GMM-based conditions in terms of median performance. RIACmedian performance is only above Random during the ﬁrst k episodes. ALP-GMM is the highestperforming algorithm throughout training for toy spaces with and irrelevant dimensions, closelyfollowed by Covar-GMM. In the toy space with irrelevant dimensions, ALP-GMM outperformsCovar-GMM in the ﬁrst k episodes but end up reaching a % median performance after kepisodes against only k episodes for Covar-GMM.The last row shows how performance changes according to the number of hypercubes. Given our toyspace rules, increasing the number of hypercubes reduces the initial area where reward is obtainablein the parameter space, and therefore allows us to study the sensitivity of our approaches to detectlearnable subspaces. Random struggles in all toy-spaces compared to other conditions and to itsperformances on the reference experiment with hypercubes per dimensions. Covar-GMM andRIAC are the best performing conditions for toy-spaces with hypercubes per dimensions. However,when increasing to and hypercubes per dimensions, Covar-GMM remains the best performingcondition but RIAC is now under-performing compared to ALP-GMM.Overall these experiments showed that GMM-based approaches scaled better than RIAC on parameterspaces with large number of relevant or irrelevant dimensions, and large number of (initially)unfeasible parameter spaces. Among these GMM-based approaches, contrary to experiments withDRL students on BipedalWalker environments, Covar-GMM proved to be better than ALP-GMM forthese toy spaces. 11

20 40 60 80 100

Episodes (x1000) % M a s t e r e d c u b e s ALP-GMMRIACCovar-GMMRandom

Figure 5:

Evolution of performance on n-dimensional toy-spaces.

The impact of 3 aspects of the parameterspace are tested: growing number of meaningful dimensions (top row), growing number of irrelevant dimensions(middle row) and increasing number of hypercubes (bottom row). The median performance (percentage ofunlocked hypercubes) is plotted with shaded curves representing the performance of each run. 20 repeats wereperformed for each condition (for each toy-space). Implementation details

Soft-Actor Critic

All of our experiments were performed with OpenAI’s implementation of SACas our DRL student. We used the same 2-layered (400, 300) network architecture with ReLU for theQ, V and policy networks. The policy network’s output uses tanh activations. The entropy coefﬁcientand learning progress were respectively set to . and . . Gradient steps are performed every environment steps by selecting samples from a replay buffer with a ﬁxed sized of millions. RIAC

To avoid a known tendency of RIAC to oversplit the parameter space [19], we added a fewmodiﬁcations to the original architecture. The essence of our modiﬁed RIAC could be summarizedas follows (hyperparameters settings are given in parenthesis):1. When collecting a new parameter-reward pair, it is added to its respective region. If thisregion reaches its maximal capacity max s ( “ ), a split attempt is performed.2. When trying to split a parent region p into two children regions, n ( “ ) candidate splits onrandom dimensions and thresholds are generated. Splits resulting in one of the child regions, c or c , having less than min s ( “ ) individuals are rejected. Likewise, to avoid havingextremely small regions, a minimum size min d is enforced for each region’s dimensions(set to { of the initial range of each dimensions of the parameter space). The split withthe highest score, deﬁned as card p c q ¨ card p c q ¨ | alp p c q ´ alp p c q| , is kept. If no validsplit was found, the region ﬂushes its oldest points (the oldest quarter of pairs sampled inthe region are removed).3. At sampling time, several strategies are combined: • : a random parameter is chosen in the entire space. • : a region is selected proportionally to its ALP and a random parameter is sampledwithin the region. • : a region is selected proportionally to its ALP and the worst parameter with lowestassociated episodic reward is slightly mutated (by adding a Gaussian noise N p p, . q ).We send the reader back to the original papers of RIAC [9, 3] for detailed motivations and pseudo-codedescriptions. Covar-GMM

Originating from the developmental robotic ﬁeld [4], this approach inspired thedesign of ALP-GMM. In Covar-GMM, instead of ﬁtting a GMM on the parameter space concatenatedwith ALP as in ALP-GMM, they concatenate each parameters with its associated episodic return andtime (relative to the current window of considered parameters). New parameters are then chosen bysampling on a Gaussian selected proportionally to its positive covariance between time and episodicreward, which emulates positive LP. Contrary to ALP-GMM, they ignore negative learning progressand do not have a way to detect long term LP (i.e LP is only measured for the currently ﬁtteddatapoints). Although not initially present in Covar-GMM, we compute the number of Gaussiansonline as in ALP-GMM to compare the two approaches solely on their LP measure. Likewise,Covar-GMM is given the same hyperparameters as ALP-GMM (see section 4).

Oracle

Oracle has been manually crafted based on knowledge acquired over multiple runs of thealgorithm. It uses a step size σ W “ R , with R a vector containing the maximal distance for eachdimension of the parameter space. Before each new episode, the window ( W size “ R ) is slidtoward more complex task distributions by σ W only if the average episodic reward of the last proposed tasks is above r thr “ . See Algorithm 2 for pseudo-code.Figure 7 provides a visualization of the evolution of Oracle’s parameter sampling for a typical run inStump Tracks. One can see that the ﬁnal position of the parameter sampling window (correspondingto a subspace that cannot be mastered by the student) is reached after episodes (c) and remainsthe same up to the end of the run, totaling episodes. This end-of-training focus on a speciﬁcpart of the parameter space is the cause of the forgetting issues of Oracle (see section 5.1). https://github.com/openai/spinningup Schematic view of an ALP-GMM teacher’s workﬂow

Algorithm 2

Oracle

Require:

Student S , parametric environment E , bounded parameter space P , initial samplingwindow position W pos , window step size σ W , memory size m size , reward threshold r thr ,window-size W size . Set sampling window W Ă P to W pos loop Sample random parameter p P W , send E p τ „ T p p qq to S , observe episodic reward r p If the mean competence over the last m size episodes exceeds r thr then w pos “ w pos ` σ w Return S Additional visualizations for Stump Tracks experiments (a) 1000 episodes (b) 2000 episodes (c) 10500 episodes (d) 15000 episodes

Figure 7:

Evolution of Oracle parameter sampling for a default bipedal walker on Stump Tracks.

Bluedots represent the last sampled parameters, red dots represent all other previously sampled parameters. Atﬁrst (a), Oracle starts by sampling parameters in the easiest subspace (i.e large stump spacing and low stumpheight). After episodes (b), Oracle slid its sampling window towards stump tracks whose stump height liesbetween . and . and a spacing between . and . . After episodes (c) this Oracle run reached achallenging subspace that his student will not be able to master. By episodes, The sampling window didnot move as the mean reward threshold was never crossed.(a) 500 episodes (b) 1500 episodes (c) 15000 episodes (d) 20000 episodes Figure 8:

Evolution of RIAC parameter sampling for a default bipedal walker on Stump Tracks.

Atﬁrst (a), RIAC do not ﬁnd any learning progress signal in the space, resulting in random splits. After episodes, RIAC focuses its sampling on the leftmost part of the space, corresponding to low stump heights, forwhich the SAC student manages to progress. After k episodes (c), RIAC spreaded its sampling to parameterscorresponding to track distributions with stump heights up to . , with the highest stumps paired with highspacing. By the end of the training (d) the student converged to a ﬁnal skill level, and thus LP is no longerdetected by RIAC, except for simple track distributions in the leftmost part of the space in which occasionalforgetting of walking gates leads to ALP signal. racle ALP-GMM Covar-GMM RIAC Random0102030 % M a s t e r e d e n v (a) Short agents Oracle ALP-GMM Covar-GMM RIAC Random0204060 % M a s t e r e d e n v (b) Default agents Oracle ALP-GMM Covar-GMM RIAC Random0255075100 % M a s t e r e d e n v (c) Quadrupedal agents Figure 9:

Box plot of the ﬁnal performance of each condition on Stump Tracks after 20M steps.

Goldlines are medians, surrounded by a box showing the ﬁrst and third quartile, which are then followed by whiskersextending to the last datapoint or . times the inter-quartile range. Beyond the whiskers are outlier datapoints. (a) : For short agents, Random always end-up mastering 0% of the track distributions of the test set, except for asingle run that is able to master 3 track distributions (6%). LP-based teachers obtained superior performancesthan Random while still failing to reach non-zero performances by the end of training in { runs for ALP-GMM, { for Covar-GMM and { for RIAC. (b) : For default walkers, LP-based approaches have less variance than Oracle (visible by the difference ininter-quartile range) whose window-sliding strategy led to catastrophic forgetting occurring in a majority of runs.Random remains the least performing algorithm. (c) : For quadrupedal walkers, Oracle performs signiﬁcantly worse than any other condition ( p ă ´ ).Additional investigations on the data revealed that, by sliding its sampling window towards track distributionswith higher stump heights and lower stump spacing, Oracle’s runs mostly failed to master track distributions thatwere both hard and distant from its sampling window within the parameter space: that is, tracks with both highstump heights ( ą . ) and high spacing ( ą . ). % M a s t e r e d e n v (a) max stump height of % M a s t e r e d e n v (b) max stump height of % M a s t e r e d e n v ALP-GMMCovar-GMMRIACRandom (c) max stump height of Figure 10:

Evolution of mean performance of Teacher-Student approaches when increasing the amountof unfeasible tracks in Stump Tracks with default bipedal walkers.

32 seeded runs where performed foreach condition. The mean performance is plotted with shaded areas representing the standard error of the mean.ALP-GMM is the most robust LP-based teacher and maintains a statistically signiﬁcant performance advantageover all other conditions in all 3 settings. Random performances are most impacted when increasing the numberof unfeasible tracks. ALP-GMM is more robust than RIAC when going from a maximal stump height of to and to . Note that for all 3 experiments, for comparison purposes, the same test set was used and containedonly track distributions with a maximal stump height of . Additional visualization for Hexagon Tracks experiments

To better understand the properties of all of the tested conditions in Hexagon Tracks, we analyzed thedistributions of the percentage of mastered environments of the test set after training for Millions(environment) steps. Using Figure 11, one can see that ALP-GMM both has the highest medianperformance and narrowest distribution. Out of the repeats, only Oracle and ALP-GMM alwaysend-up with positive ﬁnal performance scores whereas Covar-GMM, RIAC and Random end-upwith performance in { , { and { runs, respectively. Interestingly, in all repeats of anycondition, the student manages to master part of the test set at some point (i.e non-zero performance),meaning that runs that end-up with ﬁnal test performance actually experienced catastrophicforgetting. This showcase the ability of ALP-GMM to avoid this forgetting issue through efﬁcienttracking of its student’s absolute learning progress. Oracle ALP-GMM Covar-GMM RIAC Random0255075100 % M a s t e r e d e n v Figure 11:

Box plot of the ﬁnal performance of each condition run on Hexagon Tracks after 80M steps.

In BipedalWalker environments, observations vectors provided to walkers are composed of 10 lidarsensors (providing distance measures), the hull angle and velocities (linear and angular), the angleand speed of each hip and knee joints along with a binary vector which informs whether each leg istouching the ground or not. This sums up to -dimensions for our two bipedal walkers and forthe quadrupedal version. To account for its increased weight and additional legs, we increased themaximal torque usage and reduced the torque penalty for quadrupedal agents. Parameter bounds of Stump Tracks

In Stump Tracks, the range of the mean stump height µ h isset to r , s , while the spacing ∆ s range lies in r , s . Examples of randomly generated tracks areleft for consultation in Figure 12. Parameter bounds of Hexagon Tracks

In Hexagon Tracks, the range of the dimensions of thespace are set to r , s . Figure 13 provides a visual explanation of how hexagons are generated. Figure14 shows examples of randomly generated tracks.Figure 12: Example of tracks generated in Stump Tracks.

Generation of obstacles in Hexagon Tracks . Given a default hexagonal obstacle, the ﬁrst values of a - D parameter are used as positive offsets to the x and y positions of all vertices (except for the y position of the ﬁrst and last ones, in order to ensure the obstacle has at least one edge in contact with the ground. Figure 14: