[PDF] Curriculum in Gradient-Based Meta-Reinforcement Learning

Abstract

Gradient-based meta-learners such as Model-Agnostic Meta-Learning (MAML) have shown strong few-shot performance in supervised and reinforcement learning settings. However, specifically in the case of meta-reinforcement learning (meta-RL), we can show that gradient-based meta-learners are sensitive to task distributions. With the wrong curriculum, agents suffer the effects of meta-overfitting, shallow adaptation, and adaptation instability. In this work, we begin by highlighting intriguing failure cases of gradient-based meta-RL and show that task distributions can wildly affect algorithmic outputs, stability, and performance. To address this problem, we leverage insights from recent literature on domain randomization and propose meta Active Domain Randomization (meta-ADR), which learns a curriculum of tasks for gradient-based meta-RL in a similar as ADR does for sim2real transfer. We show that this approach induces more stable policies on a variety of simulated locomotion and navigation tasks. We assess in- and out-of-distribution generalization and find that the learned task distributions, even in an unstructured task space, greatly improve the adaptation performance of MAML. Finally, we motivate the need for better benchmarking in meta-RL that prioritizes \textit{generalization} over single-task adaption performance.

Full PDF

CCurriculum in Gradient-Based Meta-Reinforcement Learning

Bhairav Mehta ∗ Universit de Montral, Mila

Tristan Deleu † Universit de Montral, Mila

Sharath Chandra Raparthy † Mila

Chris J. Pal

Polytechnique Montral, MilaCIFAR AI Chair

Liam Paull

Universit de Montral, MilaCIFAR AI Chair

Abstract

Gradient-based meta-learners such as Model-Agnostic Meta-Learning (MAML) haveshown strong few-shot performance in su-pervised and reinforcement learning settings.However, speciﬁcally in the case of meta-reinforcement learning (meta-RL), we canshow that gradient-based meta-learners aresensitive to task distributions. With thewrong curriculum, agents suffer the effectsof meta-overﬁtting , shallow adaptation, andadaptation instability. In this work, we beginby highlighting intriguing failure cases ofgradient-based meta-RL and show that taskdistributions can wildly affect algorithmicoutputs, stability, and performance. To addressthis problem, we leverage insights fromrecent literature on domain randomization andpropose meta Active Domain Randomization(meta-ADR), which learns a curriculum oftasks for gradient-based meta-RL in a similaras ADR does for sim2real transfer. Weshow that this approach induces more stablepolicies on a variety of simulated locomotionand navigation tasks. We assess in- andout-of-distribution generalization and ﬁndthat the learned task distributions, even in anunstructured task space, greatly improve theadaptation performance of MAML. Finally,we motivate the need for better benchmarkingin meta-RL that prioritizes generalization oversingle-task adaption performance. ∗ Correspondence to [email protected] † Denotes equal contribution

Figure 1: Meta-ADR proposes tasks to a meta-RL agent,helping learn a curriculum of tasks rather than uniformlysampling them from a set distribution. A discriminator learns a reward as a proxy for task-difﬁculty, using pre-and post-adaptation rollouts as input. The reward is usedto train

SVPG particles , which ﬁnd the tasks causing themeta-learner the most difﬁculty after adaption. The parti-cles propose a diverse set of tasks, trying to ﬁnd the tasksthat are currently causing the agent the most difﬁculty.

Meta-learning concerns building models or agents thatcan learn how to adapt quickly to new tasks from datasetswhich are orders of magnitudes smaller than their stan-dard supervised learning counterparts. Put differently,meta-learning concerns learning how to learn , ratherthan simply maximizing performance on a single taskor dataset. Gradient-based meta-learning has seen asurge of interest, with the foremost algorithm beingModel-Agnostic Meta-Learning (MAML) [Finn et al.,2017a]. Gradient-based meta-learners are fully trainablevia gradient descent, and have shown strong performance a r X i v : . [ c s . L G ] F e b n various supervised and reinforcement learning tasks[Finn et al., 2017a, Rakelly et al., 2019].The focus of this work is on an understudied hyperpa-rameter within the gradient-based meta-learning frame-work: the distribution of tasks. In MAML, this distri-bution is assumed given and is used to sample tasks formeta-training of the MAML agent. In supervised learn-ing, this quantity is relatively well-deﬁned, as we oftenhave a large dataset for a task such as image classiﬁcation . As a result, the distribution of tasks, p ( τ ) , is built fromrandom minibatches sampled from this distribution.However, in the meta-reinforcement learning setting, thistask distribution is poorly deﬁned, and is often hand-crafted by a human experimenter with the target task inmind. While the task samples themselves are pulled ran-domly from a range or distribution (i.e a locomoter askedto achieve a target velocity), the distribution itself needsto be speciﬁed. In practice, the distribution p ( τ ) turns outto be an extremely sensitive hyperparameter in Meta-RL:too ”wide” of a distribution (i.e the variety of tasks is toolarge) leads to underﬁtting, with agents unable to special-ize to the given target task even with larger numbers ofgradient steps; too ”narrow”, and we see poor general-ization and adaption to even slightly out-of-distributionenvironments.Even worse, randomly sampling (as is often the case)from p ( τ ) can allow for sampling of tasks that cancause interference and optimization difﬁculties, espe-cially when tasks are qualitatively different (due to difﬁ-culty or task deﬁnitions being changed too much by thephysical parameters that are varied).This phenomena, called meta-overﬁtting (or meta-underﬁtting, in the former, ”wide” case), is not newto recent deep reinforcement learning problem settings.Domain randomization [Tobin et al., 2017], a popular sim2real transfer method, faces many of the same is-sues when learning robotic policies purely in simula-tion. Here, we will show that meta-reinforcement learn-ing has the analogous issues regarding generalization,which we can attribute to the random sampling of tasks.We will then describe the repurposing of a recent algo-rithm called Active Domain Randomization [Mehta et al.,2019], which aims to learn a curriculum of tasks in un-structured task spaces. In this work, we address the prob-lem of meta-overﬁtting by explicitly optimizing for thetask distribution represented by p ( τ ) . The incorporationof a learned curriculum leads to stronger generalizationperformance and more robust optimization. Our resultshighlight the need for continued work in analysis of theeffect of task distributions on meta-RL performance and Even in regression, a function such as a sinusoid is oftenprovided by the experimenter as the task distribution. underscoring the potential for curriculum learning tech-niques.

In this section we brieﬂy cover reinforcement learning,meta-learning, and curriculum learning ideas touchedupon in later parts of the paper.

We consider a reinforcement learning setting where atask τ is deﬁned as a Markov Decision Process (MDP), atuple ( S, A, T, R, γ ) where S is the state space, A is theaction space, T is transition function T : S × A → S , R is a reward function and γ is a discount factor whichlives within (0 , . The goal of reinforcement learning isto learn a function π parameterized by θ in such a waythat it maximizes the expected total discounted reward[Sutton and Barto, 2018]. Most deep learning models are built to solve only onetask, and often lack the ability to generalize and quicklyadapt to solve a new set of tasks. Meta-learning involveslearning a learning algorithm which can adapt quicklyrather than learning from scratch. Several methods havebeen proposed, treating the learning algorithm as a re-current model capable of remembering past experience[Santoro et al., 2016, Munkhdalai and Yu, 2017, Mishraet al., 2018], as a non-parametric model [Koch et al.,2015, Vinyals et al., 2016, Snell et al., 2017], or as anoptimization problem [Ravi and Larochelle, 2017, Finnet al., 2017a]. In this paper, we focus on a popular ver-sion of a gradient-based meta-learning algorithm calledModel Agnostic Meta Learning (MAML; [Finn et al.,2017a]).

The main idea in MAML is to ﬁnd a good parameter ini-tialization such that the model can adapt to a new task, τ ,quickly. Formally, given a distribution of tasks p ( τ ) anda loss function L τ corresponding to each task, the aim isto ﬁnd parameters θ such that the model f θ can adapt tonew tasks with one or few gradient steps. For example,in the case of a single gradient step, the parameters θ (cid:48) τ adapted to the task τ are θ (cid:48) τ = θ − α ∇ θ L τ ( D train , f θ ) , (1)with step size α , where the loss is evaluated on a (typi-cally small) dataset D train of training examples from task . In order to ﬁnd a good initial value of the parameters θ , the objective function being optimized in MAML iswritten as min θ (cid:88) τ i L τ i ( D test , f θ (cid:48) τi ) , (2)where it evaluates the performance in generalization onsome test examples D test for task τ . The meta objec-tive function is optimized by gradient descent where theparameters are updated according to θ ← θ − β ∇ θ (cid:88) τ i L τ i ( D test , f θ (cid:48) τi ) , (3)where β is the outer step size. In addition to few-shot supervised learning problems,where the number of training examples is small, meta-learning has also been successfully applied to reinforce-ment learning problems. In meta-reinforcement learn-ing, the goal is to ﬁnd a policy that can quickly adapt tonew environments, generally from only a few trajecto-ries. Rakelly et al. [2019] treat this problem by condi-tioning the policy on a latent representation of the task,and Duan et al. [2016], Wang et al. [2016] representthe reinforcement learning algorithm as a recurrent net-work, inspired by the “black-box” meta-learning meth-ods mentioned above. Some meta-learning algorithmscan even be adapted to reinforcement learning with min-imal changes [Mishra et al., 2018]. In particular, MAMLhas also shown some success on robotics applications[Finn et al., 2017b]. In the context of reinforcementlearning, D train and D test are datasets of trajectoriessampled by the policies before and after adaptation (i.erollouts in D train are sampled before the gradient stepin Equation 1, whereas those in D test are sampled after).The loss function used for the adaptation is REINFORCE[Williams, 1992], and the outer, meta objective in Equa-tion 3 is optimized using TRPO [Schulman et al., 2015]. Active Domain Randomization (ADR) [Mehta et al.,2019] builds upon the framework of Domain Random-ization [Tobin et al., 2017]. Domain randomization, auseful, zero-shot technique for transferring robot poli-cies from simulation to real hardware, uniformly sam-ples randomized environments (effectively, tasks) that anagent must solve. ADR improves DR by learning an ac-tive policy that proposes a curriculum of tasks to train aninner-loop, black-box agent. ADR, mostly used in thezero-shot learning scenario of simulation-to-real trans-fer , uses Stein Variational Policy Gradient (SVPG) [Liuet al., 2017] to learn a set of parameterized particles, { µ φ i } Ni =1 that proposes randomized environments whichare subsequently used to train the agent.SVPG beneﬁts from both the Maximum-Entropy RLframework [Ziebart, 2010] and a kernel term that re-pulses similar policies to encourage particle, and there-fore task, diversity. This allows SVPG to hone in on re-gions of high reward while maintaining variety, which al-lows ADR to outperform many existing methods in termsof performance and generalization on zero-shot learningand robotic benchmarks.To train the particles, ADR uses a discriminator to distin-guish between trajectories generated in a proposed ran-domized environment and those generated by the samepolicy in a default, reference environment. Intuitively,ADR is optimized to ﬁnd environments where the samepolicy produces different behavior in the two types of en-vironments, signalling a probable weakness in the policywhen evaluated on those types of randomized environ-ments. When discussing meta-reinforcement learning, to thebest of our knowledge, the task distribution p ( τ ) hasnever been studied or ablated upon. As most bench-mark environments and tasks in meta-RL stem from twopapers ([Finn et al., 2017a, Rothfuss et al., 2018], withthe task distributions being prescribed with the environ-ments), the discussion in meta-RL papers has almost al-ways centered around the computation of the updates[Rothfuss et al., 2018], practical improvements and ap-proximations made to improve efﬁciency, or learning ex-ploration policies with meta-learning [Stadie et al., 2018,Gurumurthy et al., 2019, Gupta et al., 2018]. In this sec-tion, we brieﬂy discuss prior work in curriculum learningthat bears the most similarity to the analyses we conducthere.Starting with the seminal curriculum learning paper[Bengio et al., 2009], many different proposals to learnan optimal ordering over tasks has been studied. Curricu-lum learning has been tackled with Bayesian Optimiza-tion [Tsvetkov et al., 2016], multi-armed bandits [Graveset al., 2017], and evolutionary strategies [Wang et al.,2019] in supervised learning and reinforcement learningsettings, but here, we focus on the latter. However, inmost work, the task space is almost always discrete, witha teacher agent looking to choose the best next task overa set of N pre-made tasks. The notion of best has alsobeen explored in depth, with metrics being based on avariety of things from ground-truth accuracy or rewardto adversarial gains between a teacher and student agent[Pinto et al., 2017].owever, up until recently, the notion of continuously-parameterized curriculum learning has been studied lessoften. Often, continuous-task curriculum learning ex-ploits a notion of difﬁculty in the task itself. In orderto get agents to hop over large gaps, it’s been empiri-cally easier to get them to jump over smaller ones ﬁrst[Heess et al., 2017]; likewise, in navigation domains, itsbeen easier to show easier goals and grow a goal space[Pong et al., 2019], or even work backwards towards thestart state in a reverse curriculum manner [Florensa et al.,2017].While deep reinforcement learning, particularly inrobotics, has a seen a large amount of curriculum learn-ing papers in recent times [Mehta et al., 2019, OpenAIet al., 2019], curriculum learning has not been exten-sively researched in meta-RL. This may be partly due tothe naissance of the ﬁeld; only recently was a large-scale,multi-task benchmark for meta-RL released [Yu et al.,2019]. As we hope to show in this work, the notions oftasks, task distributions, and curricula in meta-learningare fruitful avenues of study, and can make (or break)many of the meta-learning algorithms in use today. We begin with a simple question:

Does the meta-training task distribution in meta-RLreally matter?

To answer this question, we run a standard meta-reinforcement learning benchmark, . In this environment, a point-mass must navigateto a goal, with rewards given at each timestep propor-tional to the Euclidean distance between the goal and thecurrent position.We take the hyperparameters and experiment setup fromthe original MAML work and simply change the task dis-tribution from which the 2D goal is uniformly sampled.We then show generalization results of the ﬁnal, meta-learned initial policy after a single gradient step . Wethen track the generalization of the one-step adaptationperformance across a wide range of target goals.In , the training distribution pre-scribes goals where each coordinate is traditionally sam-pled between [ − . , . (the second plot in Figure 2)with the agent always beginning at [0 , . We then evalu-ate each goal in the grid between [ − , at 0.5 intervals,allowing us to test both in- and out-of-distribution gener-alization.We see from Figure 2 an interesting phenomenon, partic-ularly as the training environment shifts away from the Figure 2: Various agents’ ﬁnal adaption to a range oftarget tasks. The agents vary only in training task distri-butions, shown as red overlaid boxes. Redder is higherreward. one which samples goal coordinates g t ∼ [ − . , . .While the standard environment from [Finn et al., 2017a]generalizes reasonably well, shifting the training distri-bution even slightly ruins generalization of the adaptedpolicy. What’s more, when shown the entire test distri-bution, MAML fails to generalize to it. We see that oneven the simplest environments, the meta-training taskdistribution seems to have a profound effect, motivatingthe need for dedicating more attention towards selectingthe task distribution p ( τ ) .Upon further inspection, we ﬁnd that shifting the meta-training distribution destabilizes MAML, leading to poorperformance when averaged. The ﬁrst environment,where g t ∼ [ − . , . has three out of ﬁve random seedsthat converge, with the latter two, g t ∼ [ − . , . and g t ∼ [ − . , . , have two and one seeds that convergerespectively. The original task distribution sees conver-gence in all ﬁve random seeds tested, hinting at a dif-ference in stability due to the goals, and therefore taskdistribution, that each agent sees. This hints at a hiddenimportance of the task distribution p ( τ ) , a hypothesis weexplore in greater detail in the next section. As we saw in the previous section, uniformly samplingtasks from a set task distributions highly affects general-ization performance of the resulting meta-learning agent.Consequently, in this work, we optimize for a curriculumover the task distribution p ( τ ) : arg min τ i ∼ p ( τ ) min θ (cid:88) τ i L τ i ( f θ (cid:48) i ) (4)where θ (cid:48) i are the updated parameters after a single meta-gradient update.While curriculum learning has had success in scenarioswhere task-spaces are structured, learning curricula inunstructured task-spaces, where an intuitive scale of dif-ﬁculty might be lacking, is an understudied topic. How-ever, learning such curricula has seen a surge of inter-st in the problem of simulation transfer in robotics,where policies trained in simulation are transferred zero-shot (no ﬁne-tuning) for use on real hardware. Usinga method called domain randomization [Tobin et al.,2017], several recent methods [OpenAI et al., 2019,Moziﬁan et al., 2019] propose how to learn a curricu-lum of randomizations - which randomized environ-ments would be most useful to show the learning agentin order to make progress on the held-out target task: thereal robot.In the meta-RL setting, the learned curriculum would beover the space of tasks. For example, in , this would be where goals are sampled, or in HalfCheetahVelocity , another popular meta-RL bench-mark, the goal velocity the locomotor must achieve.As learning the curriculum is often treated as a rein-forcement learning problem, it requires a reward in orderto calculate policy gradients. While many of the meth-ods from the domain randomization literature use proxiessuch as completion rates or average reward, the optimiza-tion scheme depends on the reward function of the task .In meta-learning, optimization and reward maximizationon a single task is not the goal, and such an approachmay lead to counter-intuitive results.A more natural ﬁt in the meta-learning scenario wouldbe to somehow use the qualitative difference between thepre- and post-adaptation trajectories. Like a good teacherwith a struggling student, the curriculum could shift to-wards where the meta-learner needs help. For example,tasks in which negative adaptation [Deleu and Bengio,2018] occurs, or where the return from a pre-adaptedagent is higher the post-adapted agent, would be primetasks to focus on for training.To this end, we modify

Active Domain Randomization (ADR) to calculate such a score between the two typesof trajectories. Rather than using a reference environ-ment as in ADR, we ask a discriminator to differenti-ate between the pre- and post-adaptation trajectories. Ifa particular task generates trajectories that can be dis-tinguished by the discriminator after adaptation, we fo-cus more heavily on these tasks by providing the high-level optimizer, parameterized by Stein Variational Pol-icy Gradient, a higher reward.Concretely, we provide the particles the reward: r i = log( f ψ ( y | D i )) (5)where discriminator f ψ produces a boolean prediction ofwhether the trajectory D i is a pre-adaptation ( y = 0 )or post-adaptation ( y = 1 ) trajectory. We present thealgorithm, which we term Meta-ADR, in Algorithm 1. Algorithm 1

Meta-ADR Input

Task distribution p ( τ ) Initialize π θ : agent policy, µ φ : SVPG particles, f ψ :discriminator while not max epochs do for each particle µ φ do sample tasks τ i ∼ µ φ ( · ) , bounded by supportof p ( τ ) end for for each τ i do D pre , D post = MAML RL ( π θ , τ i ) Calculate r i for τ i using D post (Eq. (5)) end for // Gradient Updates Update particles using SVPG update rule and r i Update f ψ with D pre and D post using SGD. end while Meta-ADR learns a curriculum in this unstructured taskspace without relying on the notion of task performanceor reward functions. Note that Meta-ADR runs the orig-inal MAML algorithm as a subroutine, but in fact Meta-ADR can run any meta-learning subroutine (i.e Reptile[Nichol et al., 2018], PEARL [Rakelly et al., 2019], orFirst-Order MAML). In this work, we abstract away themeta-learning subroutine, focusing instead on the effectof task distributions on the learner’s generalization ca-pabilities. An advantage of Meta-ADR over the ADRoriginal formulation is that unlike ADR, Meta-ADR re-quires no additional rollouts, using the rollouts alreadyrequired by gradient-based meta-reinforcement learnersto optimize the curriculum.

In this section, we show the results of uniform samplingof the standard MAML agent when changing task distri-bution p ( τ ) , while also benchmarking against a MAMLagent trained with a learned task distribution using Meta-ADR. All hyperparameters for each task are taken from[Finn et al., 2017a, Rothfuss et al., 2018], with the ex-ception that we take the ﬁnal policy at the end of 200meta-training epochs instead of the best-performing pol-icy over 500 meta-training epochs. We use the code from[Deleu and Guiroy, 2018] to run all of our experiments.Unless otherwise noted, all experiments are run and av-eraged across ﬁve random seeds. All results are shownafter a single gradient step during meta-test time. Foreach task, we artiﬁcially create a generalization range;potentially disjoint from the training distribution of tar-get goals, velocities, headings, etc., and we evaluate eachagent both in- and out-of-distribution.igure 3: When a curriculum of tasks is learned withMeta-ADR, we see the stability of MAML improve. Redder is higher reward.

Importantly, since our focus is on generalization, we evaluate the ﬁnal policy , rather than the standard, best-performing policy. As MAML produces a ﬁnal ini-tial policy, when evaluating for generalization for meta-learning, we adapt that initial policy to each target task,and report the adaption results. In addition, in certainsections, we discuss negative adaption, which is simplythe performance difference between the ﬁnal, adaptedpolicy and the ﬁnal, initial policy. When this quantityis negative, as noted in [Deleu and Bengio, 2018], wesay that the policy has negatively adapted to the task.We present results from standard Meta-RL benchmarksin Sections 6.1 and 6.2, and in general ﬁnd that Meta-ADR stabilizes the adaption procedure. However, thisﬁnding is not universal, as we note in Section 6.3.In Subsections 6.3, we highlight a need for betterbenchmarking and failure cases (overﬁtting and biased,non-uniform generalization) that both Meta-ADR anduniform-sampling methods seem to suffer from.

In this section we evaluate meta-ADR on two navigationtasks: 2D-Navigation-Dense and Ant-Navigation.

We train the same meta-learning agent from Section 4 on , except this time we use the tasks proposed by Meta-ADR, using a learned curriculum topropose the next best task for the agent. We evaluategeneralization across the same scaled up square spanningthe ranges of [ − , in both dimensions.From Figure 3, we see that, with a learned curricu-lum, the agent generalizes much better, especially within the training distribution. A MAML agent trainedwith a Meta-ADR curriculum also generalizes out-of-distribution with much stronger performance. These re-sults hint at the strong dependence of MAML perfor-mance and the task distribution p ( τ ) , especially when In navigation environments, tasks are parameterized by the x, y location of the goal. compared to those in Figure 2. Learning such a task cur-riculum with a method such as Meta-ADR helps alleviatesome instability.

Interestingly, on a more complex variant of the sametask,

Ant-Navigation , the beneﬁts of such a learned cur-riculum are minimized. In this task, an eight-legged lo-comoter is tasked with achieving goal positions sampledfrom a predetermined range; the standard environmentsamples the goal positions from a box centered at (0 , ,with each coordinate sampled from g ∼ [ − , . We sys-tematically evaluate each agent on a grid with both axesranging between [ − , , with a . step interval.In Figure 4, we qualitatively see the same generalizationacross all training task distributions when comparing arandomly sampled task curriculum and a learned one.We hypothesize that this stability comes mainly from thecontrol components of the reward, leading to a smoother,stabler performance across all training distributions. Inaddition, generalization is unaffected by the choice ofdistribution, pointing to differences between this task andthe simpler version.Figure 4: In the Ant-Navigation task, both uniformlysampled goals (top) and a learned curriculum of goalswith Meta-ADR (bottom) are stable in performance. Weattribute this to the extra components in the reward func-tion. Redder is higher reward.Compared to the , Ant-Navigation also receives a dense reward related to its distance to tar-get, speciﬁcally “control cost, a contact cost, a survivalreward, and a penalty equal to its L1 distance to the tar-get position.” In comparison, the task, while a simpler control problem, receives rewardinformation only related to the Euclidean distance to thegoal. Counter-intuitively, this simplicity results in less stable performance when uniformly sampling tasks, anablation which we hope to study in future work. .2 Locomotion

We now consider locomotion , another standard meta-RLbenchmark, where we are tasked with training an agentto quickly move in a particular target velocity (Section6.2.1) or in a particular direction (Section 6.2.2). In thissection, we focus on two high-dimensional continuouscontrol problems. In the

AntVelocity , an eight-legged lo-comoter must run at a speciﬁc speed, with the task space(both for learned and random curricula) being the tar-get velocity. In

Humanoid-Direc-2D , a benchmark intro-duced by [Rothfuss et al., 2018], an agent must learn torun in a target direction, θ in a 2D plane.Both tasks are extremely high-dimensional in both ob-servation and action space. The ant has a (111 × sized observation space, with each step requiring an ac-tion vector of length eight. The Humanoid, which takesin a (376 × element state, requires an action vector oflength 17. When dealing with the target velocity task, we train anant locomoter to attain target speeds sampled from v t ∼ [0 , (Figure 5 left, the standard variant of AntVelocity )and speeds sampled from v t ∼ [0 , (Figure 5 right).Figure 5: Ant-Velocity sees less of a beneﬁt fromcurriculum, but performance is greatly affected by acorrectly-calibrated task distribution (left). In a mis-calibrated one (right), we see that performance from alearned curriculum is slightly more stable.While we see that learned curricula make insigniﬁcantamounts of performance improvement over random sam-pling when shown the same task distribution, we seelarge differences in performance between task distribu-tions, motivating our hypothesis that p ( τ ) is a crucialhyperparameter for successful meta-RL. In additon, wenotice that the highest scores are attained on the veloc-ities closer to the easiest variant of the task: a v t = 0 ,which requires the locomoter to stand completely still.We expand on this oddity in Section 6.3.3. Figure 6: In the high-dimensional Humanoid Directionaltask, we evaluate many different training distributions tounderstand the effect of p ( τ ) on generalization in difﬁ-cult continuous control environments. In particular, wefocus on symmetric variants of tasks - task distributionsthat mirror each other, such as − π and π − π in theright panel. Intuitively, when averaged over many trials,such mirrored distrbutions should produce similar trendsof in and out-of-distribution generalization.In the standard variant of Humanoid-Direc-2D , a loco-moter is tasked with running in a particular direction,sampled from [0 , π ] . This task makes no distinction re-garding target velocity, but rather calculates the rewardbased on the agent’s heading and other control costs.In this task, we shift the distribution from [0 , π ] to sub-sets of this range, subsequently training and evaluatingMAML agents across the entire range of tasks between [0 , π ] , as seen in the ﬁrst two panels of Figure 6. Again,we compare agents trained with the standard uniformly-random sampled task distribution against those trainedwith a learned curriculum using Meta-ADR.When studying the generalization capabilities on this dif-ﬁcult continuous control task, we are particularly inter-ested in symmetric versions of the task; for example,tasks that sample the right and left semi-circles of thetask space. We repeat this experiment with many vari-ants of this symmetric task description, and report repre-sentative results due to space in Figure 7.When testing various training distributions, we ﬁndthat, in general, learned curricula stabilize the algo-rithm. We see more consistent performance increases,with smaller losses in performance in the directions that UniformSampling-MAML outperforms the learned cur-riculum. However, as noted in Tables 1 and 2, we seethat again, the task distribution p ( τ ) is an extremelysensitive hyperparameter, causing large shifts in per-formance when uniformly sampling from those ranges.Worse, this hyperparameter seems to cause counter-intuitive gains and drops in performance, both on in and out-of-distribution tasks.While learned curricula seem to help in such a task, amore important consideration from many of these exper-igure 7: In complex, high-dimensional environments,training task distributions can wildly vary performance.Even in the Humanoid Directional task, Meta-ADR al-lows MAML to generalize across the range, although ittoo is affected in terms of total return when comparedto the same algorithm trained with ”good” task distribu-tions.Table 1: We compare agents trained with random cur-ricula on different but symmetric task distributions p ( τ ) .Changing the distribution leads to counter-intuitive dropsin performance on tasks both in- and out-of-distribution. p( τ ) θ = 0 θ = 30 θ = 60 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± iments is the variance in performance between tasks. As generalization across evaluation tasks is a difﬁcult metricto characterize due to the inherent issues when compar-ing methods, it is tempting to take the best performingtasks, or average across the whole range. However, as weshow in the remaining sections, closer inspection on eachof the above experiments sheds light on major issues withthe evaluation approaches standard in the meta-RL com-munity today.Table 2: Evaluating tasks that are qualitatively similar,for example running at a heading offset from the start-ing heading by 30 degrees to the left or right , leads todifferent performances from the same algorithm. p( τ ) θ = 180 θ = 330 θ = 300 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Figure 8: Uniform sampling causes MAML to show biastowards certain tasks, with the effect being compoundedwith instability when using ”bad” task distributions, hereshown as ± . , ± . , ± . in the 2D-Navigation-Denseenvironment. In this section, we discuss intermediate and auxillary re-sults from each of our previous experiments, highlight-ing uninterpretable algorithm bias, meta-overﬁtting, andperformance benchmarking in meta-RL.

To readers surprised by the poor generalization capabil-ities of MAML on such a simple task seen in Figure 2,we offer Figure 8, an unﬁltered look at each seed used tocalculate each image in Figure 2.What we immediately notice is the high variance in allbut the standard variant of the task, an agent trained ongoals with coordinates sampled from g t ∼ [ − . , . .We even see a reoccurring bias towards certain tasks(visualized as the top-left of the grid). Interestingly,when changing the uniform sampling to a learned cur-riculum, we no longer see such high-variance in conver-gence across tasks. While our results seem in oppositionto many works in the meta-reinforcement learning area,we restate that in our setting, we can only evaluate the ﬁnal policy, as the notion of best -performing loses muchof its meaning when evaluating for generalization. Many works in the meta-reinforcement learning settingfocus on ﬁnal adaption performance , but few works fo-cus on the loss of performance after the adaption step.Coined by [Deleu and Bengio, 2018] as negative adap-tion , the deﬁnition is simple: the loss in performance af- igure 9: When we correlate the ﬁnal performance withthe amount of quality of adaption , we see a troublingtrend. MAML seems to overﬁt to certain tasks, withmany tasks that were already neglected during trainingshowing worse post-adaptation returns. ter a gradient step at meta-test time. Negative adaptationoccurs when a pre-adaption policy has overﬁt to a partic-ular task during meta-training. During meta-test time, anadditional gradient steps degrade performance, leadingto negative adaptation.We extensively evaluate negative adaption in the

Humanoid-Direc-2D benchmark described in Section6.2.2, providing correlation results between performanceand the difference between the post- and pre-adaptionperformance.When we systematically evaluate negative adaptionacross all tested

Humanoid-Direc-2D training distribu-tions, we notice an interesting correlation between per-formance and the amount of negative-adaptation. Bothmethods produce near-linear relationships between thetwo quantities, but when evaluating generalization, weneed to focus on the left-hand side of the x-axis, wherepolicies already are performing poorly, and what qualita-tive effects extra gradient steps have.We notice a characteristic sign of meta-overﬁtting , wherestrongly performing policies continue to perform well,but poorly performing ones stagnate, or more often, de-grade in performance. When tested, Meta-ADR does nothelp in this regard, despite having slightly stronger ﬁnalperformance in tasks.

Most alarmingly, we present results on the evaluation oftasks, particularly in the Locomotion Velocity tasks. As noted in [Deleu and Bengio, 2018], we see a character-istic pyramid when evaluating generalization of locomo-tion agents trained to achieve a target task. However,in many papers concerning meta-RL, we see monotonicgrowth curves in performance on such environments. InFigure 10, we show the issues in reporting such curves.Figure 10: When individually plotting out each target ve-locity, we see a strong bias towards the easier variants ofthe task. Only the easier variants of the task producemonotonically increasing learning curves.When we sample uniformly from the same distributionthat we train on, say target velocities pulled from v t ∼ [0 , , the evaluation sampling can highly affect our re-sults. We evaluate the training curves of the MAMLagent, and see that only those closest to the easiest vari-ant of the task ( v t = 0 , where an agent must stand still)produce monotonically increasing learning curves. Moreinterestingly, target velocities closer to zero but out-of-distribution (OOD) show better performance than largertarget velocities that are in distribution. As the devia-tion away from a target velocity of 0 becomes larger, thelearning curves stagnate or even start to degrade.Unlike the previous two sections, which stem interest-ing research directions such as why does MAML showbias towards certain tasks and how can we ﬁx negativeadaption , our analysis here points towards a fundamentalﬂaw in the task design of the Target Velocity Locomotiontasks commonly used in Meta-RL benchmarking. We present

Meta-ADR , a curriculum learning algorithmsuitable for helping gradient-based meta-learners gener-alize better in a meta-reinforcement learning setting. Weshow a strong dependence between the performance ofMAML and the correct task distribution. When switch-ing out only the random sampling of tasks for such alearned curriculum, we show strong performance acrossa variety of meta-RL benchmarks. From our experi-ments, we highlight issues with current meta-RL bench-arking, focusing on a need for generalization evalution,a proper, exclusive train-test task separation, and betterevaluation tasks in general.

Acknowledgements

The authors gratefully acknowledge the Natural Sciencesand Engineering Research Council of Canada (NSERC),the Fonds de Recherche Nature et Technologies Quebec(FQRNT), Calcul Quebec, Compute Canada, the CanadaResearch Chairs, Canadian Institute for Advanced Re-search (CIFAR) and Nvidia for donating a DGX-1 forcomputation. BM would like to thank Glen Bersethfor helpful discussions in early drafts of this work, andIVADO for ﬁnancial support.

References

Y. Bengio, J. Louradour, R. Collobert, and J. Weston.Curriculum learning. In

Proceedings of the 26th An-nual International Conference on Machine Learning ,ICML ’09, pages 41–48, New York, NY, USA, 2009.ACM. ISBN 978-1-60558-516-1. doi: 10.1145/1553374.1553380.T. Deleu and Y. Bengio. The effects of negative adap-tation in Model-Agnostic Meta-Learning.

CoRR ,abs/1812.02159, 2018. URL http://arxiv.org/abs/1812.02159 .T. Deleu and H. S. Guiroy, Simon. Reinforce-ment Learning with Model-Agnostic Meta-Learningin PyTorch, 2018. URL https://github.com/tristandeleu/\pytorch-maml-rl/ .Y. Duan, J. Schulman, X. Chen, P. L. Bartlett,I. Sutskever, and P. Abbeel. RL2: Fast ReinforcementLearning via Slow Reinforcement Learning. 2016.URL http://arxiv.org/abs/1611.02779 .C. Finn, P. Abbeel, and S. Levine. Model-AgnosticMeta-Learning for Fast Adaptation of Deep Net-works.

International Conference on Machine Learn-ing , 2017a. URL http://arxiv.org/abs/1703.03400 .C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-Shot Visual Imitation Learning via Meta-Learning.

Conference on Robot Learning , 2017b. URL https://arxiv.org/abs/1709.04905 .C. Florensa, D. Held, M. Wulfmeier, M. Zhang, andP. Abbeel. Reverse curriculum generation for rein-forcement learning, 2017.A. Graves, M. G. Bellemare, J. Menick, R. Munos, andK. Kavukcuoglu. Automated curriculum learning forneural networks, 2017. A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine.Meta-reinforcement learning of structured explorationstrategies, 2018.S. Gurumurthy, S. Kumar, and K. Sycara. Mame :Model-agnostic meta-exploration, 2019.N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel,G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Es-lami, M. Riedmiller, and D. Silver. Emergence of lo-comotion behaviours in rich environments, 2017.G. Koch, R. Zemel, and R. Salakhutdinov. Siamese Neu-ral Networks for One-shot Image Recognition.

Inter-national Conference on Machine Learning , 2015.Y. Liu, P. Ramachandran, Q. Liu, and J. Peng. Stein vari-ational policy gradient, 2017.B. Mehta, M. Diaz, F. Golemo, C. J. Pal, andL. Paull. Active domain randomization.

CoRR ,abs/1904.04762, 2019. URL http://arxiv.org/abs/1904.04762 .N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel.A Simple Neural Attentive Meta-Learner.

Inter-national Conference on Learning Representations ,2018. URL https://arxiv.org/pdf/1707.03141.pdf .M. Moziﬁan, J. C. G. Higuera, D. Meger, andG. Dudek. Learning domain randomization distri-butions for transfer of locomotion policies.

CoRR ,abs/1906.00410, 2019. URL http://arxiv.org/abs/1906.00410 .T. Munkhdalai and H. Yu. Meta Networks.

Interna-tional Conference on Machine Learning , 2017. URL https://arxiv.org/abs/1703.00837 .A. Nichol, J. Achiam, and J. Schulman. On ﬁrst-ordermeta-learning algorithms, 2018.OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej,M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plap-pert, G. Powell, R. Ribas, J. Schneider, N. Tezak,J. Tworek, P. Welinder, L. Weng, Q.-M. Yuan,W. Zaremba, and L. Zhang. Solving rubik’s cube witha robot hand.

ArXiv , abs/1910.07113, 2019.L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Ro-bust adversarial reinforcement learning, 2017.V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, andS. Levine. Skew-ﬁt: State-covering self-supervised re-inforcement learning, 2019.K. Rakelly, A. Zhou, D. Quillen, C. Finn, and S. Levine.Efﬁcient Off-Policy Meta-Reinforcement Learning viaProbabilistic Context Variables.

International Con-ference on Machine Learning , 2019. URL https://arxiv.org/abs/1903.08254 .. Ravi and H. Larochelle. Optimization as a model forfew-shot learning.

International Conference on Learn-ing Representations , 2017.J. Rothfuss, D. Lee, I. Clavera, T. Asfour, and P. Abbeel.Promp: Proximal meta-policy search, 2018.A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, andT. Lillicrap. Meta-Learning with Memory-AugmentedNeural Networks.

International Conference on Ma-chine Learning , 2016. URL https://arxiv.org/abs/1605.06065 .J. Schulman, S. Levine, P. Moritz, M. I. Jordan,and P. Abbeel. Trust Region Policy Optimization,2015. URL https://arxiv.org/abs/1502.05477 .J. Snell, K. Swersky, and R. S. Zemel. Prototypical Net-works for Few-shot Learning.

Conference on NeuralInformation Processing Systems , 2017. URL https://arxiv.org/abs/1703.05175 .B. C. Stadie, G. Yang, R. Houthooft, X. Chen, Y. Duan,Y. Wu, P. Abbeel, and I. Sutskever. Some consider-ations on learning to explore via meta-reinforcementlearning, 2018.R. S. Sutton and A. G. Barto.

Reinforcement Learning:An introduction . MIT Press, 2018.J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba,and P. Abbeel. Domain randomization for transfer-ring deep neural networks from simulation to the realworld. In

Intelligent Robots and Systems (IROS), 2017IEEE/RSJ International Conference on . IEEE, 2017.Y. Tsvetkov, M. Faruqui, W. Ling, B. MacWhinney, andC. Dyer. Learning the curriculum with Bayesian op-timization for task-speciﬁc word representation learn-ing. In

Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 130–139, Berlin, Ger-many, Aug. 2016. Association for Computational Lin-guistics. doi: 10.18653/v1/P16-1013. URL .O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu,and D. Wierstra. Matching Networks for One ShotLearning.

Conference on Neural Information Process-ing Systems , 2016. URL http://arxiv.org/abs/1606.04080 .J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer,J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, andM. Botvinick. Learning to reinforcement learn. 2016.R. Wang, J. Lehman, J. Clune, and K. O. Stanley. Pairedopen-ended trailblazer (poet): Endlessly generatingincreasingly complex and diverse learning environ-ments and their solutions, 2019. R. J. Williams. Simple statistical gradient-following al-gorithms for connectionist reinforcement learning. In

Machine Learning , 1992.T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn,and S. Levine. Meta-world: A benchmark and evalu-ation for multi-task and meta reinforcement learning,2019.B. D. Ziebart.