[PDF] Model Primitive Hierarchical Lifelong Reinforcement Learning

Abstract

Learning interpretable and transferable subpolicies and performing task decomposition from a single, complex task is difficult. Some traditional hierarchical reinforcement learning techniques enforce this decomposition in a top-down manner, while meta-learning techniques require a task distribution at hand to learn such decompositions. This paper presents a framework for using diverse suboptimal world models to decompose complex task solutions into simpler modular subpolicies. This framework performs automatic decomposition of a single source task in a bottom up manner, concurrently learning the required modular subpolicies as well as a controller to coordinate them. We perform a series of experiments on high dimensional continuous action control tasks to demonstrate the effectiveness of this approach at both complex single task learning and lifelong learning. Finally, we perform ablation studies to understand the importance and robustness of different elements in the framework and limitations to this approach.

Full PDF

MModel Primitive Hierarchical Lifelong Reinforcement Learning

Bohan Wu

Columbia [email protected]

Jayesh K. Gupta

Stanford [email protected]

Mykel J. Kochenderfer

Stanford [email protected]

ABSTRACT

Learning interpretable and transferable subpolicies and performingtask decomposition from a single, complex task is difficult. Sometraditional hierarchical reinforcement learning techniques enforcethis decomposition in a top-down manner, while meta-learningtechniques require a task distribution at hand to learn such de-compositions. This paper presents a framework for using diversesuboptimal world models to decompose complex task solutions intosimpler modular subpolicies. This framework performs automaticdecomposition of a single source task in a bottom up manner, con-currently learning the required modular subpolicies as well as acontroller to coordinate them. We perform a series of experimentson high dimensional continuous action control tasks to demon-strate the effectiveness of this approach at both complex single tasklearning and lifelong learning. Finally, we perform ablation studiesto understand the importance and robustness of different elementsin the framework and limitations to this approach.

KEYWORDS

Reinforcement learning; Task decomposition; Transfer; Lifelonglearning

ACM Reference Format:

Bohan Wu, Jayesh K. Gupta, and Mykel J. Kochenderfer. 2019. Model Prim-itive Hierarchical Lifelong Reinforcement Learning . In

Proc. of the 18thInternational Conference on Autonomous Agents and Multiagent Systems(AAMAS 2019), Montreal, Canada, May 13–17, 2019,

IFAAMAS, 9 pages.

In the lifelong learning setting, we want our agent to solve a series ofrelated tasks drawn from some task distribution rather than a single,isolated task. Agents must be able to transfer knowledge gained inprevious tasks to improve performance on future tasks. This settingis different from multi-task reinforcement learning [25, 27, 31] andvarious meta-reinforcement learning settings [7, 8], where the agentjointly trains on multiple task environments. Not only do such non-incremental settings make the problem of discovering commonstructures between tasks easier, they allow the methods to ignorethe problem of catastrophic forgetting [16], which is the inability tosolve previous tasks after learning to solve new tasks in a sequentiallearning setting.Our work takes a step towards solutions for such incrementalsettings. We draw on the idea of modularity [17]. While learningto perform a complex task, we force the agent to break its solu-tion down into simpler subpolicies instead of learning a singlemonolithic policy. This decomposition allows our agent to rapidlylearn another related task by transferring these subpolicies. We

Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.), May 13–17, 2019,Montreal, Canada hypothesize that many complex tasks are heavily structured and hi-erarchical in nature. The likelihood of transfer of an agent’s solutionincreases if it can capture such shared structure.A key ingredient of our proposal is the idea of world mod-els [10, 12, 14] — transition models that can predict future sensorydata given the agent’s current actions. The world however is com-plex, and learning models that are consistent enough to plan withis not only hard [24], but planning with such one-step models issuboptimal [11]. We posit that the requirement that these worldmodels be good predictors of the world state is unnecessary, pro-vided we have a multiplicity of such models. We use the term modelprimitives to refer to these suboptimal world models. Since eachmodel primitive is only relatively better at predicting the next stateswithin a certain region of the environment space, we call this areathe model primitive’s region of specialization .Model primitives allow the agent to decompose the task beingperformed into subtasks according to their regions of specializationand learn a specialized subpolicy for each subtask. The same modelprimitives are used to learn a gating controller to select, improve,adapt, and sequence the various subpolicies to solve a given task ina manner very similar to a mixture of experts framework [15].Our framework assumes that at least a subset of model prim-itives are useful across a range of tasks and environments. Thisassumption is less restrictive than that of successor representa-tions [3, 5]. Even though successor representations decouple thestate transitions from the rewards (representing the task or goals),the transitions learned are policy dependent and can only transferacross tasks with the same environment dynamics.There are alternative approaches to learning hierarchical spatio-temporal decompositions from the rewards seen while interactingwith the environment. These approaches include meta-learning al-gorithms like Meta-learning Shared Hierarchies (MLSH) [8], whichrequire a multiplicity of pretrained subpolicies and joint training onrelated tasks. Other approaches include the option-critic architec-ture [1] that allows learning such decompositions in a single taskenvironment. However, this method requires regularization hyper-parameters that are tricky to set. As observed by Vezhnevets et al.[30], its learning often collapses to a single subpolicy. Moreover, weposit that capturing the shared structure across task-environmentscan be more useful in the context of transfer for lifelong learningthan reward-based task specific structures.To summarize our contributions: • Given diverse suboptimal world models, we propose a methodto leverage them for task decomposition. • We propose an architecture to jointly train decomposed sub-policies and a gating controller to solve a given task. • We demonstrate the effectiveness of this approach at bothsingle-task and lifelong learning in complex domains withhigh-dimensional observations and continuous actions. a r X i v : . [ c s . L G ] M a r PRELIMINARIES

We assume the standard reinforcement learning (RL) formulation:an agent interacts with an environment to maximize the expectedreward [23]. The environment is modeled as a Markov decisionprocess (MDP), which is defined by ⟨S , A , R , T , γ ⟩ with a statespace S , an action space A , a reward function R : S × A → R , adynamics model T : S×A → Π (S) , and a discount factor γ ∈ [ , ) .Here, Π (·) defines a probability distribution over a set. The agentacts according to stationary stochastic policies π : S → Π (A) ,which specify action choice probabilities for each state. Each policy π has a corresponding Q π : S × A → R function that defines theexpected discounted cumulative reward for taking an action a fromstate s and following the policy π from that point onward. Lifelong Reinforcement Learning : In a lifelong learning set-ting, the agent must interact with multiple tasks and successfullysolve each of them. Adopting the framework from Brunskill andLi [4], in lifelong RL, the agent receives S , A , initial state distri-bution ρ ∈ Π (S) , horizon H , discount factor γ , and an unknowndistribution over reward-transition function pairs, D . The agentsamples (R i , T i ) ∼ D and interacts with the MDP ⟨S , A , R i , T i , γ ⟩ for a maximum of H timesteps, starting according to the initial statedistribution ρ . After solving the given MDP or after H timesteps,whichever occurs first, the agent resamples from D and repeats.The fundamental question in lifelong learning is to determinewhat knowledge should be captured by the agent from the tasksit has already solved so that it can improve its performance onfuture tasks. When learning with functional approximation, thistranslates to learning the right representation — the one with theright inductive bias for the tasks in the distribution. Given theassumption that the set of related tasks for lifelong learning sharea lot of structure, the ideal representation should be able to capturethis shared structure.Thrun and Pratt [28] summarized various representation decom-position methods into two major categories. Modern approachesto avoiding catastrophic forgetting during transfer tend to fall intoeither category. The first category partitions the parameter spaceinto task-specific parameters and general parameters [19]. The sec-ond category learns constraints that can be superimposed whenlearning a new function [13].A popular approach within the first category is to use what Thrunand Pratt [28] term as recursive functional decomposition. This ap-proach assumes that solution to tasks can be decomposed into afunction of the form f i = h i ◦ д , where h i is task-specific whereas д is the same for all f i . This scheme has been particularly effec-tive in computer vision where early convolutional layers in deepconvolutional networks trained on ImageNet [6, 22] become a veryeffective д for a variety of tasks. However, this approach to decom-position often fails in DeepRL because of two main reasons. First,the gradients used to train such networks are noisier as a result ofMonte Carlo sampling. Second, the i.i.d. assumption for trainingdata often fails.We instead focus on devising an effective piecewise functional de-composition of the parameter space, as defined by Thrun and Pratt[28]. The assumption behind this decomposition is that each func-tion f i can be represented by a collection of functions h , . . . , h m , π K π π π k T K T T ˆ T k × Gating controller Environment P ( M k | s t ) a t s t , r t Figure 1: Diagram of MPHRL Architecture. Solid arrows areactive during both learning and execution. Dotted arrowsare active only during learning. where m ≪ N , and N is the number of tasks to learn. Our hypoth-esis is that this decomposition is much more effective and easier tolearn in RL. This section outlines the Model Primitive Hierarchical Reinforce-ment Learning (MPHRL) framework (Figure 1) to address the prob-lem of effective piecewise functional decomposition for transferacross a distribution of tasks.

The key assumption in MPHRL is access to several diverse worldmodels of the environment dynamics. These models can be seenas instances of learned approximations to the true environmentdynamics T . In reality, these dynamics can even be non-stationary.Therefore, the task of learning a complete model of the environmentdynamics might be too difficult. Instead, it can be much easier totrain multiple approximate models that specialize in different partsof the environment. We use the term model primitives to refer tothese approximate world models.Suppose we have access to K model primitives: ˆ T k : S × A → Π (S) . For simplicity, we can assign a label M k to each ˆ T k , such thattheir predictions of the environment’s transition probabilities canbe denoted by ˆ T ( s t + | s t , a t , M k ) . The goal of the MPHRL framework is to usethese suboptimal predictions from different model primitives todecompose the task space into their regions of specialization, andlearn different subpolicies π k : S → Π (A) that can focus onthese regions. In the function approximation regime, each subpolicy π k belongs to a fixed class of smoothly parameterized stochasticpolicies { π θ k | θ k ∈ Θ } , where Θ is a set of valid parameter vectors.Model primitives are suboptimal and make incorrect predictionsabout the next state. Therefore we do not use them for planningor model-based learning of subpolicies directly. Instead, modelprimitives give rise to useful functional decompositions and allowsubpolicies to be learned in a model-free way. Taking inspiration from the mixture-of-experts literature [15], where the output from multiple experts canbe combined using probabilistic gating functions, MPHRL decom-poses the solution for a given task into multiple “expert” subpoliciesand a gating controller that can compose them to solve the task. Weant this switching behavior to be probabilistic and continuous toavoid abrupt transitions. During learning, we want this controllerto help assign the reward signal to the correct blend of subpoliciesto ensure effective learning as well as decomposition.Since the gating controller’s goal is to choose the subpolicywhose corresponding model primitive makes the best predictionfor a given transition, using Bayes’ rule we can write: P ( M k | s t , a t , s t + ) ∝ P ( M k | s t ) π k ( a t | s t ) ˆ T ( s t + | s t , a t , M k ) (1)because π k ( a t | s t ) = π ( a t | s t , M k ) .The agent only has access to the current state s t during execution.Therefore, the agent needs to marginalize out s t + and a t such thatthe model choice only depends on the current state s t : P ( M k | s t ) = ∫ s t + ∈S ∫ a t ∈A P ( M k | s t , a t , s t + ) P ( s t + , a t ) da t ds t + (2)This is equivalent to: P ( M k | s t ) = E s t + , a t ∼ P ( s t + , a t ) [ P ( M k | s t , a t , s t + )] (3)Unfortunately, computing these integrals requires expensiveMonte Carlo methods. However, we can use an approximate methodto achieve the same objective with discriminative learning [18].We parameterize the gating controller (GC) as a categorical dis-tribution P ϕ ( M k | s t ) = P ( M k | s t ; ϕ ) and minimize the conditionalcross entropy loss between E s t + , a t ∼ P ( s t + , a t ) [ P ( M k | s t , a t , s t + )] and P ϕ ( M k | s t ) for all sampled transitions ( s t , a t , s t + ) in a rollout:minimize ϕ L GC (4)where L GC = (cid:213) s t (cid:213) k − (cid:32)(cid:213) s t + (cid:213) a t P ( M k | s t , a t , s t + ) (cid:33) × log P ( M k | s t ; ϕ ) (5)This is equivalent to an implicit Monte Carlo integration to computethe marginal if s t + , a t ∼ P ( s t + , a t ) . Although we cannot queryor sample from P ( s t + , a t ) directly, s t , a t , and s t + can be sampledaccording to their respective distributions while we perform rolloutsin the environment. Despite the introduced bias in our estimates,we find Eq. 4 sufficient for achieving task decomposition. Taking inspiration from mixture-of-experts, the gating controller composes the subpolicies into amixture policy: π ( a t | s t ) = K (cid:213) k = P ϕ ( M k | s t ) π k ( a t | s t ) (6) Dur-ing a rollout, the agent samples as follows: a t ∼ π ( a t | s t ) (7) s t + ∼ T ( s t + | s t , a t ) (8)The π k from Eq. 1 gets coupled with this sampling distribution,making the target distribution in Eq. 5 no longer stationary and the approximation process difficult. We alleviate this issue by ignoring π k , effectively treating it as a distribution independent of k . Thistransforms Eq. 1 into:ˆ P ( M k | s t , a t , s t + ) ∝ P ( M k | s t ) ˆ T ( s t + | s t , a t , M k ) (9) Since the focus of this work is on difficult continuous action prob-lems, we mostly concentrate on the issue of policy optimization andhow it integrates with the gating controller. The standard policy(SP) optimization objective is:maximize θ L SP = E ρ , π θ [ π θ ( a t | s t ) Q π θ ( s t , a t )] (10)With baseline subtraction for variance reduction, this turns into [20]:maximize θ L PG = E ρ , π θ [ π θ ( a t | s t ) ˆ A t ] (11)where ˆ A t is an estimator of the advantage function [2].In MPHRL, we directly use the mixture policy as defined by Eq. 6.The standard policy gradients (PG) get weighted by the probabilityoutputs of the gating controller, enforcing the required specializa-tion by factorizing into:ˆ д k = E ρ , π θk (cid:2) P ϕ ( M k | s t )∇ θ k log π θ k ( a t | s t ) ˆ A t (cid:3) (12)In practice, we use the Clipped PPO objective [21] instead toperform stable updates by limiting the step size. This includesadding a baseline estimator (BL) parameterized by ψ for valueprediction and variance reduction. We optimize ψ according to thefollowing loss: L BL = E (cid:20)(cid:13)(cid:13)(cid:13) V ψ − V π θ (cid:13)(cid:13)(cid:13) (cid:21) (13)We summarize this single-task learning algorithm in Algorithm 1,which results in a set of decomposed subpolicies, π θ , . . . π θ K , anda gating controller P ϕ that can modulate between them to solvethe task under consideration. Algorithm 1

MPHRL: single-task learning Initialize P ϕ , π θ = { π θ , . . . , π θ K } , V ψ while not converged do Rollout trajectories τ ∼ π θ , ϕ Compute advantage estimates ˆ A τ Optimize L PG wrt θ , . . . , θ K with expectations taken over τ Optimize L BL wrt ψ with expectations taken over τ Optimize L GC wrt ϕ with expectations taken over τ Lifelong learning : We have shown how MPHRL can decompose asingle complex task solution into different functional components.Complex tasks often share structure and can be decomposed intosimilar sets of subtasks. Different tasks however require differentrecomposition of similar subtasks. Therefore, we transfer the sub-policies to learn target tasks, but not the gating controller or thebaseline estimator. We summarize the lifelong learning algorithmin Algorithm 2, with the global variable

RESET set to true. lgorithm 2

MPHRL: lifelong learning Initialize P ϕ , π θ = { π θ , . . . , π θ K } , V ψ for Tasks (R i , T i ) ∼ D do if RESET then Initialize P ϕ , V ψ while not converged do Rollout trajectories τ ∼ π θ , ϕ Compute advantage estimates ˆ A τ Optimize L PG wrt θ , . . . , θ K with expectations taken over τ Optimize L BL wrt ψ with expectations taken over τ Optimize L GC wrt ϕ with expectations taken over τ Our experiments aim to answer two questions: (a) can model prim-itives ensure task decomposition? (b) does such decompositionimprove transfer for lifelong learning?We evaluate our approach in two challenging domains: a Mu-JoCo [29] ant navigating different mazes and a Stacker [26] armpicking up and placing different boxes. In our experiments, weuse subpolicies that have Gaussian action distributions, with meangiven by a multi-layer perceptron taking observations as inputand standard deviations given by a different set of parameters.MPHRL’s gating controller outputs a categorical distribution andis parameterized by another multi-layer perceptron. We also usea separate multi-layer perceptron for the baseline estimator. Weuse the standard PPO algorithm as a baseline to compare againstMPHRL. Transferring network weights empirically led to worseperformance for standard PPO. Hence, we re-initialize its weightsfor every task. For fair comparison, we also shrink the hidden layersize of MPHRL’s subpolicy networks from 64 to 16. We conducteach experiment across 5 different seeds. Error bars represent thestandard deviation from the mean.The focus of this work is on understanding the usefulness ofmodel primitives for task decomposition and the resulting improve-ment in sample efficiency from transfer. To conduct controlledexperiments with interpretable results, we hand-designed modelprimitives using the true next state provided by the environmentsimulator. Concretely, we apply distinct multivariate Gaussian noisemodels with covariance σ Σ to the true next state. We then samplefrom this distribution to obtain the mean of the probability distri-bution of a model primitive’s next state prediction, using Σ as itscovariance. Here, σ is the noise scaling factor that distinguishesmodel primitives, while Σ refers to the empirical covariance of thesampled next states: µ ∼ N( s t + , σ k Σ ) (14)ˆ T ( s t + | s t , a t , M k ) = N( µ , σ k Σ ) (15)Using Σ as opposed to a constant covariance is essential for con-trolled experiments because different elements of the observationspace have different orders of magnitude. Sampling µ from a distri-bution effectively adds random bias to the model primitive’s nextstate probability distribution. Hyperparameter details are in Table 1, and our code is freelyavailable at http://github.com/sisl/MPHRL. Table 1: Hyperparameters: MPHRL and baseline PPO

Category Hyper-parameter ValueNum. model primitives: L-Maze 2Maze D-Maze 4Standard 10-Maze 4H-V Corridors 2Velocity 2Extra 5Num. model primitives: Standard 8-P&P 128-P&P Box-only 2Action-only 6Gating controller: Hidden layers 2Network Hidden dimension 64Gating controller: Single / Source (Maze) 1 × − Base learning rate Single / Source (8-P&P) 3 × − Target tasks 3 × − Gating controller: Single / Source 1Num. epoches / batch Target tasks 10Baseline and model Hidden layers 2primitive networks Hidden dimension 64Base learning rate 3 × − Subpolicy networks Hidden layers 2Hidden dimension (MPHRL) 16Hidden dimension (PPO) 64Base learning rate 3 × − Optimization Num. actors (Maze) 16Num. actors (8-P&P) 24Batch size / actor (Maze) 2048Batch size / actor (8-P&P) 1536Max. timesteps / task 3 × Minibatch size / actor 256Num. epoches / batch γ ) 0.99GAE parameter ( λ ) 0.95PPO clipping coeff. ( ϵ ) 0.2Gradient clipping NoneVF coeff. ( c ) 1.0Entropy coeff. ( c ) 0Optimizer Adam First, we focus on two single-task learning experiments whereMPHRL learns a number of interpretable subpolicies to solve asingle task. Both the L-Maze and D-Maze (Figure 2a) tasks requirethe ant to learn to walk and reach the green goal within a finite Single task refers to L-Maze and D-Maze; source and target tasks refer to the firsttask and all subsequent tasks in a lifelong learning taskset, respectively. Baseline network hyperparameters apply to both MPHRL and baseline PPO; modelprimitive networks are for experiments with learned model primitives only. The baseline PPO has no subpolicies, so the subpolicy network is the policy network. Baseline and subpolicy networks only. a) L-Maze (top) and D-Maze(bottom)

L-Maze D-MazeTasks051015202530 T i m e s t e p s ( × ) MPHRLBaseline PPO (b) Performance

Figure 2: Single-task learning horizon. For both tasks, both the goal and the initial ant locations arefixed. For the L-Maze, the agent has access to two model primitives,one specializing in the horizontal (E, W) corridor and the otherspecializing in the vertical (N, S) corridor of the maze. Similarlyfor the D-Maze, the agent has access to four model primitives,one specializing in each N, S, E, W corridor of the maze. In theirspecialized corridors, the noise scaling factor σ =

0. Outside of theirregions of specialization, σ = .

5. The observation space includesthe standard joint angles and positions, lidar information trackingdistances from walls on each side, and the Manhattan distanceto the goal. Figure 2b shows the experimental results on theseenvironments. Notice that using model primitives can make thelearning problem more difficult and increase the sample complexityon a single task. This is expected, since we are forcing the agentto decompose the solution, which could be unnecessary for easytasks. However, we will observe in the following section that thisdecomposition can lead to remarkable improvements in transferperformance during lifelong learning.

To evaluate our framework’s performance at lifelong learning, weintroduce two tasksets.

To evaluate MPHRL’s performance in lifelonglearning, we generate a family of 10 random mazes for the MuJoCoAnt environment, referred to as the 10-Maze taskset (Figure 4) here-after. The goal, the observation space, the Gaussian noise models,and the model primitives remain the same as in D-Maze. The agenthas a maximum of 3 × timesteps to reach 80% success rate ineach of the 10 tasks. As shown in Figure 3a, MPHRL requires nearlydouble the number of timesteps to learn the decomposed subpoli-cies in the first task. However, this cost gets heavily amortized overthe entire taskset, with MPHRL taking half the total number oftimesteps of the baseline PPO, exhibiting strong subpolicy transfer. We modify the Stacker task [26] to createthe 8-Pickup&Place taskset. As shown in Figure 5, a robotic arm istasked to bring 2 boxes to their respective goal locations in a certain T o t a l Tasks020406080100 T i m e s t e p s ( × ) MPHRLBaseline PPO (a) 10-Maze T o t a l Tasks020406080100 T i m e s t e p s ( × ) MPHRLBaseline PPO (b) 8-Pickup&Place

Figure 3: MPHRL vs. PPO for lifelong learning order. Marked by colors red, green, and blue, the goal locationsreside within two short walls forming a “stack”.Each of the 8 tasks has a maximum of 3 goal locations. The ob-servation space of the agent includes joint angles and positions,box and goal locations, their relative distances to each other, andthe current stage of the task encoded as one-hot vectors. The agenthas access to six model primitives for each box that specialize inreaching above, lowering to, grasping, picking up, carrying, anddropping a certain box. Similar to 10-Maze, model primitives have σ of 0 within their specialized stages and σ of 0.5 otherwise. Figure 3bshows MPHRL’s experimental performance by learning twelve use-ful subpolicies for this taskset. We notice again the strong transferperformance due to the decomposition forced by the model primi-tives. Note that this taskset is much more complex than 10-Mazesuch that MPHRL even accelerates the learning of the first task. We conduct ablation experiments to answer the following questions:(1) How much gain in sample efficiency is achieved by transfer-ring subpolicies?(2) Can MPHRL learn the task decomposition even when themodel primitives are quite noisy or when the source task doesnot cover all “cases”?(3) When does MPHRL fail to decompose the solution?(4) What kind of diversity in the model primitives is essential forperformance?(5) When does MPHRL lead to negative transfer?(6) Is MPHRL’s gain in sample efficiency a result of hand-craftedmodel primitives and how does it perform with actual learnedmodel primitives?

MPHRL has the ability to decompose thesolution even given bad model primitives. Since the learning is donemodel-free, these suboptimal model primitives should not stronglyaffect the learning performance so long as they remain sufficientlydistinct. To investigate the limitations to this claim, we conduct fiveexperiments using various sets of noisy model primitives. Below,the first value corresponds to the noise scaling factor σ withintheir individual regions of specialization, while the second valuecorresponds to σ outside of their regions of specialization.(a) 0.4 and 0.5: good models with limited distinction(b) 0.5 and 1.0: good models with reasonable distinction a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 4: 10-Maze lifelong learning taskset (a)B1 → T1B2 → T2 (b)B2 → T1B1 → T2B1 → T3 (c)B2 → T1B2 → T2 (d)B1 → T1B1 → T2B2 → T3 (e)B1 → T1B2 → T2B2 → T3 (f)B1 → T1B1 → T2B1 → T3 (g)B2 → T1B1 → T2 (h)B2 → T1B1 → T2B1 → T3 Figure 5: 8-Pickup&Place lifelong learning taskset. B1 and B2 refer to Box1 (black) and Box2 (white); T1, T2, and T3 refer toTarget 1 (red), Target 2 (green), and Target 3 (blue) T o t a l Tasks010203040506070 T i m e s t e p s ( × ) (a) Effect of noisy model primitives T o t a l Tasks010203040506070 T i m e s t e p s ( × ) No ConfusionConfusion (b) Effect of model primitive confusion at thecorners T o t a l Tasks020406080100120140 T i m e s t e p s ( × ) Gating Controller and Subpolicy TransferSubpolicy Transfer OnlyNo Gating Controller or Subpolicy Transfer (c) Effect of gating controller and subpolicytransfer (target tasks only)

Figure 6: 10-Maze: MPHRL ablationTable 2: Effect of suboptimal model primitive types (N/A indicates failure to solve the task within × timesteps) Timesteps to reach target average success rate of 80% (10-Maze) and 75% (8-Pickup&Place) ( × )Taskset Model Primitives 1 2 3 4 5 6 7 8 9 10 Total10-Maze Extra 21 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 3: 10-Maze: effect of experience

Timesteps to reach 80% average success rate ( × )Experience 1 2 3 4 5 6 7 8 910 tasks 0 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± .

16 tasks 3 . ± . . ± . . ± . . ± . . ± . . ± . . ± . σ = σ =

20 show little deteriora-tion in performance, taking 15 . ± . We next test the conditionwhen there is substantial overlap in regions of specialization be-tween different model primitives. For the 10-Maze taskset, the mostplausible region for this confusion is at the corners. In this ex-periment, within each corner, the two model primitives whosespecialized corridors share the corner have σ = σ = .

5. Figure 6b shows the performance for model prim-itive confusion against the standard set of model primitives with noconfusion. We observe that despite some performance degradation,MPHRL continues to outperform the PPO baseline.

Having tested MPHRL against noises, weexperimented with undesirable model primitives for 10-Maze:(a)

Extra : a fifth model primitive that specializes in states wherethe ant is moving horizontally;(b)

H-V corridors : 2 model primitives specializing in horizontal (E,W) and vertical (N, S) corridors respectively;(c)

Velocity : 2 model primitives specializing in states where theant is moving horizontally or vertically;and for 8-Pickup&Place:(a)

Box-only : 2 model primitives for all actions on 2 boxes;(b)

Action-only : 6 model primitives for 6 actions performed onboxes: reach above, lower to, grasp, pick up, carry, and drop.Table 2 shows MPHRL is susceptible to performance degradationgiven undesirable sets of model primitives. However, MPHRL stilloutperforms baseline PPO when given an extra, undesirable modelprimitive. This indicates that for best transfer, the model primitivesneed to approximately capture the structure present in the taskset.

Lifelonglearning agents with neural network function approximators facethe problem of negative transfer and catastrophic forgetting. Ideally,they should find the solution quickly if the task has already beenseen. More generally, given two sets of tasks T and T ′ such that T ⊂ T ′ , after being exposed to T ′ the agent should perform noworse, and preferably better, than had it been exposed to T only.In this experiment, we restore the subpolicy checkpoints aftersolving the 10 tasks and evaluate MPHRL’s learning performancefor the first 9 tasks. Similarly, we restore the subpolicy checkpointsafter solving 6 tasks and evaluate MPHRL’s performance on the × − − − − A v g . r e w a r d s Figure 7: Average rewards of MPHRL when using an ora-cle gating controller. The reward threshold for reaching 80%success rate for the first task is approximately 800. T o t a l Tasks020406080100120 T i m e s t e p s ( × ) Learned Model PrimitivesHand-designed Model PrimitivesBaseline PPO

Figure 8: 10-Maze-v2: partial decomposition and learnedmodel primitives. Success threshold is at . first 5 tasks. The gating controller is reset for each task as in earlierexperiments. We summarize the results in Table 3. Subpoliciestrained sequentially on 6 or 10 tasks quickly relearn the requiredbehavior for all previously seen tasks, implying no catastrophicforgetting. Moreover, if we compare the 10-task result to the 6-taskresult, we see remarkable improvements at transfer. This impliesnegative transfer is limited with this approach. One might suspect that all gainsin sample efficiency come from hand-crafted model primitives be-cause they allow the agent to learn a perfect gating controller. How-ever, Figure 7 shows the reward curves for an experiment where thegating controller is already perfectly known. This setup is unable to igure 9: 10-Maze-v2 lifelong learning tasksetFigure 10: Three corridor environments for learning the “N”model primitive learn any 10-Maze task. Since the 10-Maze taskset is composed ofsequential subtasks, only one subpolicy will be learned in the firstcorridor when the gating controller is perfect. When transitioningto the second corridor, the second policy needs to be learned fromscratch, making the ant’s survival rate very low. This discouragesthe first subpolicy from entering the second corridor and activatingthe second subpolicy. Eventually, the ant stops moving forwardclose to the intersection between the first two corridors. In con-trast, MPHRL’s natural curriculum for gradual specialization allowsmultiple subpolicies to learn the basic skills for survival initially.

To confirm that the ordering oftasks does not significantly affect MPHRL’s performance, we modi-fied 10-Maze to create the 10-Maze-v2 taskset (Figure 9), in whichthe source task does not allow for complete decomposition into alluseful subpolicies for the subsequent tasks. Again, we observe largeimprovement in sample efficiency over standard PPO (Figure 8).

This paper focuses on evalu-ating suboptimal models for task decomposition in controlled ex-periments using hand-designed model primitives. Here, we showone way to obtain each model primitive for 10-Maze-v2 using threecorridor environments demonstrated in Figure 10. Concretely, weparameterize each model primitive using a multivariate Gaussiandistribution. We learn the mean of this distribution via a multi-layer perceptron using a weighted mean square error in dynamicsprediction as the loss. The standard deviation is still derived fromthe empirical covariance Σ as described earlier. Even though thediversity in these learned model primitives is much more difficultto quantify and control, their sample efficiency substantially out-performs standard PPO and slightly underperforms hand-designedmodel primitives with 0 and 0.5 model noises (Figure 8). To explore factors that leadto negative transfer, we tested MPHRL without re-initializing thegating controller in target tasks, as shown in Figure 6c. Althoughthe mean sample efficiency remains stable, its standard deviationincreases dramatically, indicating volatility due to negative transfer.

To measure how much gain in sampleefficiency MPHRL has achieved by transferring subpolicies alone,we conducted a 10-Maze experiment by re-initializing all network weights for every new task. As shown in Figure 6c, sample com-plexity more than quintuples when subpolicies are re-initialized (ingreen).

To validate using ˆ P ( M k | s t , a t , s t + ) in Eq. 9 as opposed to P ( M k | s t , a t , s t + ) from Eq. 1, we tested MPHRL with Eq. 1 on 10-Maze. Allruns with different seeds failed to solve the first 5 tasks (Table 4). Asthe gating controller is re-initialized during transfer, most actionswere chosen incorrectly. The gating controller is thus presentedwith the incorrect cross entropy target, which worsens the actiondistribution. The resulting vicious cycle forces the gating controllerto converge to a suboptimal equilibrium against the incorrect target. Table 4: 10-Maze: effect of coupling between cross entropyand action distribution

Timesteps to reach 80% average success rate ( × )Task 1 2 3 4 5Timesteps ( × ) 15 . ± . . ± . . ± . . ± . We showed how imperfect world models can be used to decomposea complex task into simpler ones. We introduced a framework thatuses these model primitives to learn piecewise functional decom-positions of solutions to complex tasks. The learned decomposedsubpolicies can then be used to transfer to a variety of related tasks,reducing the overall sample complexity required to learn complexbehaviors. Our experiments showed that such structured decom-position avoids negative transfer and catastrophic interference, amajor concern for lifelong learning systems.Our approach does not require access to accurate world models.Neither does it need a well-designed task distribution or the incre-mental introduction of individual tasks. So long as the set of modelprimitives are useful across the task distribution, MPHRL is robustto other imperfections.Nevertheless, learning useful and diverse model primitives, sub-policies and task decomposition all simultaneously is left for futurework. The recently introduced Neural Processes [9] can potentiallybe an efficient approach to build upon.

ACKNOWLEDGMENTS

We are thankful to Kunal Menda and everyone at SISL for use-ful comments and suggestions. This work is supported in part byDARPA under agreement number D17AP00032. The content issolely the responsibility of the authors and does not necessarilyrepresent the official views of DARPA. We are also grateful for thesupport from Google Cloud in scaling our experiments.

EFERENCES [1] Pierre-Luc Bacon, Jean Harb, and Doina Precup. 2017. The Option-critic Archi-tecture. In

AAAI Conference on Artificial Intelligence (AAAI) . 1726–1734.[2] L C Baird. 1994. Reinforcement Learning in Continuous Time: Advantage Up-dating. In

IEEE International Conference on Neural Networks (ICNN) , Vol. 4. 2448–2453.[3] André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado Pvan Hasselt, and David Silver. 2017. Successor Features for Transfer in Rein-forcement Learning. In

Advances in Neural Information Processing Systems (NIPS) .4055–4065.[4] Emma Brunskill and Lihong Li. 2014. PAC-inspired Option Discovery in LifelongReinforcement Learning. In

International Conference on Machine Learning (ICML) .316–324.[5] Peter Dayan. 1993. Improving Generalization for Temporal Difference Learning:The Successor Representation.

Neural Computation

5, 4 (July 1993), 613–624.[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet:A Large-scale Hierarchical Image Database. In

IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR) . 248–255.[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In

International Conference onMachine Learning (ICML) . 1126–1135.[8] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. 2018.Meta Learning Shared Hierarchies. In

International Conference on Learning Rep-resentations (ICLR) . https://openreview.net/forum?id=SyX0IeWAW[9] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende,S M Ali Eslami, and Yee Whye Teh. 2018. Neural Processes.

CoRR abs/1807.01622(July 2018). arXiv:cs.LG/1807.01622[10] D. Ha and J. Schmidhuber. 2018. World Models.

CoRR abs/1803.10122 (2018).arXiv:cs.AI/1803.10122 https://worldmodels.github.io[11] G. Zacharias Holland, Erik Talvitie, and Michael Bowling. 2018. The Effect ofPlanning Shape on Dyna-style Planning in High-dimensional State Spaces.

CoRR abs/1806.01825 (June 2018). arXiv:cs.AI/1806.01825[12] Georg B. Keller, Tobias Bonhoeffer, and Mark HÃĳbener. 2012. SensorimotorMismatch Signals in Primary Visual Cortex of the Behaving Mouse.

Neuron

74, 5(2012), 809–815. https://doi.org/10.1016/j.neuron.2012.03.040[13] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, GuillaumeDesjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, AgnieszkaGrabska-Barwinska, et al. 2017. Overcoming Catastrophic Forgetting in NeuralNetworks.

Proceedings of the National Academy of Sciences

Neuron

95, 6 (2017), 1420–1432. https://doi.org/10.1016/j.neuron.2017.08.036[15] Saeed Masoudnia and Reza Ebrahimpour. 2014. Mixture of Experts: A LiteratureSurvey.

Artificial Intelligence Review

42, 2 (Aug. 2014), 275–293.[16] Michael McCloskey and Neal J Cohen. 1989. Catastrophic Interference in Connec-tionist Networks: The Sequential Learning Problem. In

Psychology of Learningand Motivation , Gordon H Bower (Ed.). Vol. 24. Academic Press, 109–165.[17] Gerhard Neumann, Christian Daniel, Alexandros Paraschos, Andras Kupcsik,and Jan Peters. 2014. Learning Modular Policies for Robotics.

Frontiers of Com-putational Neuroscience

8, 62 (June 2014), 1–32.[18] Dan Rosenbaum and Yair Weiss. 2015. The Return of the Gating Network:Combining Generative Models and Discriminative Training in Natural ImagePriors. In

Advances in Neural Information Processing Systems (NIPS) . 2683–2691.[19] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer,James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell.2016. Progressive Neural Networks.

CoRR abs/1606.04671 (June 2016).arXiv:cs.LG/1606.04671[20] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel.2015. High-dimensional Continuous Control Using Generalized AdvantageEstimation.

CoRR abs/1506.02438 (June 2015). arXiv:cs.LG/1506.02438[21] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.2017. Proximal Policy Optimization Algorithms.

CoRR abs/1707.06347 (2017).arXiv:1707.06347[22] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisit-ing Unreasonable Effectiveness of Data in Deep Learning Era. In

IEEE InternationalConference on Computer Vision (ICCV) . 843–852.[23] Richard S Sutton and Andrew G Barto. 1998.

Reinforcement learning: An Intro-duction . MIT Press.[24] Eric Talvitie. 2017. Self-Correcting Models for Model-Based ReinforcementLearning. In

AAAI Conference on Artificial Intelligence (AAAI) .[25] Fumihide Tanaka and Masayuki Yamamura. 2003. Multitask ReinforcementLearning on the Distribution of MDPs. In

IEEE International Symposium onComputational Intelligence in Robotics and Automation , Vol. 3. 1108–1113.[26] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego deLas Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. 2018. DeepMind Control Suite.

CoRR abs/1801.00690 (2018). arXiv:1801.00690[27] Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick,Raia Hadsell, Nicolas Heess, and Razvan Pascanu. 2017. Distral: Robust MultitaskReinforcement Learning. In

Advances in Neural Information Processing Systems(NIPS) . 4496–4506.[28] Sebastian Thrun and Lorien Pratt. 1998. Learning to Learn: Introduction andOverview. In

Learning to Learn , Sebastian Thrun and Lorien Pratt (Eds.). Springer,Boston, MA, 3–17. https://doi.org/10.1007/978-1-4615-5529-2_1[29] Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A Physics Enginefor Model-based Control.. In

IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS) . 5026–5033. https://doi.org/10.1109/IROS.2012.6386109[30] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, MaxJaderberg, David Silver, and Koray Kavukcuoglu. 2017. FeUdal Networks forHierarchical Reinforcement Learning. In

International Conference on MachineLearning (ICML) . 3540–3549.[31] Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. 2007. Multi-taskReinforcement Learning: A Hierarchical Bayesian Approach. In