Learning Compositional Neural Programs for Continuous Control
Thomas Pierrot, Nicolas Perrin, Feryal Behbahani, Alexandre Laterre, Olivier Sigaud, Karim Beguir, Nando de Freitas
LLearning Compositional Neural Programs forContinuous Control
Thomas Pierrot
InstaDeep [email protected]
Nicolas Perrin
CNRS, Sorbonne Université [email protected]
Feryal Behbahani
DeepMind [email protected]
Alexandre Laterre
InstaDeep [email protected]
Olivier Sigaud
Sorbonne Université [email protected]
Karim Beguir
InstaDeep [email protected]
Nando de Freitas
DeepMind [email protected]
Abstract
We propose a novel solution to challenging sparse-reward, continuous controlproblems that require hierarchical planning at multiple levels of abstraction. Oursolution, dubbed AlphaNPI-X, involves three separate stages of learning. First,we use off-policy reinforcement learning algorithms with experience replay tolearn a set of atomic goal-conditioned policies, which can be easily repurposedfor many tasks. Second, we learn self-models describing the effect of the atomicpolicies on the environment. Third, the self-models are harnessed to learn recursivecompositional programs with multiple levels of abstraction. The key insight is thatthe self-models enable planning by imagination, obviating the need for interactionwith the world when learning higher-level compositional programs. To accomplishthe third stage of learning, we extend the AlphaNPI algorithm, which appliesAlphaZero to learn recursive neural programmer-interpreters. We empirically showthat AlphaNPI-X can effectively learn to tackle challenging sparse manipulationtasks, such as stacking multiple blocks, where powerful model-free baselines fail.
Deep reinforcement learning (RL) has advanced many control domains, including dexterous objectmanipulation [1, 4, 30], agile locomotion [45] and navigation [12, 17]. Despite these successes,several key challenges remain. Stuart Russell phrases one of these challenges eloquently in [37]:“
Intelligent behavior over long time scales requires the ability to plan and manage activity hierar-chically, at multiple levels of abstraction ” and “ the main missing piece of the puzzle is a method forconstructing the hierarchy of abstract actions in the first place” . If achieved, this capability wouldbe “
The most important step needed to reach human-level AI ” [37]. This challenge is particularlydaunting when temporally extended tasks are combined with sparse binary rewards. In this casethe agent does not receive any feedback from the environment while having to decide on a complexcourse of action, and receives a non-zero reward only after having fully solved the task.Low sample efficiency is another challenge. In the absence of demonstrations, model-free RL agentsrequire many interactions with the environment to converge to a satisfactory policy [1]. We arguein this paper that both challenges can be addressed by learning models and compositional neural
Preprint. Under review. a r X i v : . [ c s . A I] J u l rograms. In particular, planning with a learned internal model of the world reduces the amountof necessary interactions with the environment. By imagining likely future scenarios, an agent canavoid making mistakes in the real environment and instead find a sound plan before acting on theenvironment.Many real-world tasks are naturally decomposed into hierarchical structures. We hypothesize thatlearning a variety of skills which can be reused and composed to learn more complex skills is keyto tackling long-horizon sparse reward tasks in a sample efficient manner. Such compositionality,formalised by hierarchical RL (HRL), enables agents to explore in a temporally correlated manner,improving sample efficiency by reusing previously trained lower level skills. Unfortunately, priorstudies in HRL typically assume that the hierarchy is given, or learn very simple forms of hierarchyin a model-free manner.We propose a novel method, AlphaNPI-X , to learn programmatic policies which can perform hi-erarchical planning at multiple levels of abstraction in sparse reward continuous control problems.We first train low-level atomic policies that can be recomposed and re-purposed, represented by a single goal-conditioned neural network. We leverage off-policy reinforcement learning with hindsightexperience replay [3] to train these efficiently. Next, we learn a transition model over the effects ofthese atomic policies, to imagine likely future scenarios, removing the need to interact with the realenvironment. Lastly, we learn recursive compositional programs, which combine low-level atomicpolicies at multiple levels of hierarchy, by planning over the learnt transition models, alleviatingthe need to interact with the environment. This is made possible by extending the AlphaNPI algo-rithm [33] which applies AlphaZero-style planning [43] in a recursive manner to learn recombinablelibraries of symbolic programs.We show that our agent can learn to successfully combine skills hierarchically to solve challeng-ing robotic manipulation tasks through look-ahead planning, even in the absence of any furtherinteractions with the environment and where powerful model-free baselines struggle to get off theground. Our work is motivated by the central concern of constructing a hierarchy of abstract actions to dealwith multiple challenging continuous control problems with sparse rewards. In addition, we wantthese actions to be reused and recomposed across different levels of hierarchy through planning witha focus on sample efficiency. This brings several research areas together, namely multitask learning,hierarchical reinforcement learning (HRL) and model-based reinforcement learning (MBRL).As we have shown, learning continuous control from sparse binary rewards is difficult because itrequires the agent to find long sequences of continuous actions from very few information. Otherworks tackling similar block stacking problems [21, 26] have either used a very precisely tunedcurriculum, auxiliary tasks or reward engineering to succeed. In contrast, we show that by relying onplanning even in absence of interactions with the environment we can successfully learn from rawreward signals.Multitask learning and the resulting transfer learning challenge have been extensively studied acrossa large variety of domains, ranging from supervised problems, to goal-reaching in robotics or multi-agent settings [46, 49]. A common requirement is to modify an agent’s behaviour depending on thecurrently inferred task. Universal value functions [38] have been identified as a general factorizedrepresentation which is amenable to goal-conditioning agents and can be trained effectively [6],when one has access to appropriately varied tasks to train from. Hindsight Experience Replay (
HER )[3] helps improve the training efficiency by allowing past experience to be relabelled retroactivelyand has shown great promises in variety of domains. However, using a single task conditioningvector has its limitations when addressing long horizon problems. Other works explore differentways to break long-horizon tasks into sequences of sub-goals [31, 7], or leverage
HER and curiositysignals to tackle similar robotic manipulation tasks [21]. In our work, we leverage
HER to trainlow-level goal-conditioned policies, but additionally learn to combine them sequentially using aprogram-guided meta-controller. Videos of agent behaviour are available at: https://sites.google.com/view/alphanpix
MCTS with a learned model, this also relates to advances in MBRL[29, 32, 39, 40]. Our work extends this to the use of hierarchical programmatic skills. This closelyconnects to model-based HRL research area which has seen only limited attention so far [41, 32, 18].Finally, our work capitalizes on recent advances in Neural Programmers Interpreters (NPI) [35], inparticular AlphaNPI [33]. NPI learns a library of program embeddings that can be recombined usinga core recurrent neural network that learns to interpret arbitrary programs. AlphaNPI augments NPIwith AlphaZero [42, 43] style
MCTS search and RL applied to symbolic tasks such as Tower of Hanoiand sorting. We extend AlphaNPI to challenging continuous control domains and learn a transitionmodel over programs instead of using the environment. Our work shares similar motivation with[48] where NPI is applied to solve different robotic tasks, but we do not require execution traces orpre-trained low-level policies. > CLEAN_AND_STACK > MOVE_ALL_TO_ZONE_ORANGE > MOVE_TO_ZONE_0_ORANGE > MOVE_TO_ZONE_1_ORANGE > STOP > STACK_ALL_TO_ZONE_BLUE > MOVE_TO_ZONE_2_BLUE > STACK_3_2 > STOP > STACK_ALL_TO_ZONE_ORANGE > STACK_1_0 > STOP > STOP
Final StateInitial State
Orange ZoneBlue ZoneBlue Zone
Figure 1: Illustrative example of an execution trace for the C
LEAN _A ND _S TACK program. Thistrace is not optimal as the program may be realised in fewer moves but it corresponds to one ofthe solutions found by AlphaNPI-X during training.
Atomic program calls are shown in green and non-atomic program calls are shown in blue.
In this paper, we aim to learn libraries of skills to solve a variety of tasks in continuous actiondomains with sparse rewards. Consider the task shown in Figure 1 where the agent’s goal is to takethe environment from its initial state where four blocks are randomly placed to the desired finalstate where blocks are in their corresponding coloured zones and stacked on top of each other. Weformalize skills and their combinations as programs . An example programmatic trace for solving thistask is shown where the sequence of programs are called to take the environment from the initial stateto the final rewarding state. We specify two distinct types of programs:
Atomic programs (shownin green) are low-level goal-conditioned policies which take actions in the environment for a fixednumber of steps T . Non-atomic programs (shown in blue) are a combination of atomic and/or othernon-atomic programs, allowing multiple possible levels of hierarchies in behaviour.We base our experiments on a set of robotic tasks with continuous action space. Due to the lack of anylong-horizon hierarchical multi-task benchmarks, we extended the OpenAI Gym Fetch environment[8] with tasks exhibiting such requirements. We consider a target set of tasks represented by a3
ROGRAM ARGUMENTS DESCRIPTION S TACK T WO BLOCKS ID S TACK A BLOCK ON ANOTHER BLOCK .M OVE _T O _Z ONE B LOCK ID & COLOUR M OVE A BLOCK TO A COLOUR ZONE .S TACK _A LL _B LOCKS N O ARGUMENTS S TACK ALL BLOCKS TOGETHER IN ANY ORDER .S TACK _A LL _T O _Z ONE COLOUR S TACK BLOCKS OF THE SAME COLOUR IN A ZONE .M OVE _A LL _T O _Z ONE COLOUR M OVE BLOCKS OF THE SAME COLOUR TO A ZONE .C LEAN _T ABLE N O ARGUMENTS M OVE ALL BLOCKS TO THEIR COLOUR ZONE .C LEAN _A ND _S TACK N O ARGUMENTS S TACK BLOCKS OF THE SAME COLOUR IN ZONES . Table 1: Programs library for the fetch arm environment. We obtain 20 atomic and 7 non atomicprograms when considering all possible combinations when expending programs arguments. Pleasesee Supp. Section A for a detailed explanation of all programs and Table 3 for all combinations whenexpending program arguments.hierarchical library of programs, see Table 1. These tasks involve controlling a robotic arm with7-DOF to manipulate 4 coloured blocks in the environment. Tasks vary from simple block stackingto arranging all blocks into different areas depending on their colour. Initial block positions on thetable and arm positions are randomized in all tasks. We consider 20 atomic programs that correspondto operating on one block at a time as well as 7 non-atomic programs that require interacting with 2to 4 blocks. We give the full specifications of these tasks in Sup. Table 3.We consider a continuous action space A , a continuous state space S , an initial state distribution ρ and a transition function T : A × S → S . The state vector contains the positions, rotations,linear and angular velocities of the gripper and all blocks. More precisely, a state s t has the form s t = [ X t , X t , X t , X t , Y ] where X it is the position ( x, y, z ) of block i at time step t and Y containsadditional information about the gripper and velocities.More formally, we aim to learn a set of n programs p i , i ∈ { , . . . , n } . A program p i is defined byits pre-condition φ i : S → { , } which assesses whether the program can start and its post-condition ψ i : S → { , } which corresponds to the reward function here. Each program is associated to anMDP ( S , A , T , R i ) which can start only in states such that the pre-condition φ i is satisfied, andwhere R i is a reward function that outputs when the post-condition ψ i is satisfied and 0 otherwise.Atomic programs are represented by a goal-conditioned neural network with continuous action space.Non-atomic programs use a modified action space: we replace the original continuous action space A by a discrete action space A (cid:48) = { a (cid:48) , . . . , a (cid:48) n , S TOP } where actions a (cid:48) i call programs p i and theS TOP action enables the current program to terminate and return the execution to its calling program.Atomic programs don’t have a S
TOP action, they terminate after T time steps. AlphaNPI-X learns to solve multiple tasks by composing programs at different levels of hierarchy.Given a one-hot encoding of a program and states from the environment, the meta-controller callseither an atomic program, a non-atomic program or the S
TOP action. In the next sections, we willdescribe how learning and inference is performed.
Learning in our system operates in three stages: first we learn the atomic programs, then we learna transition model over their effect and finally we train the meta-controller to combine them. Weprovide detailed explanation of how these modules are learned below.
Learning Atomic Programs
Our atomic programs consist of two components: a) a goal-setter that, given the atomic programencoding and the current environment state, generates a goal vector representing the desired positionof the blocks in the environment and b) a goal-conditioned policy that gets a “goal” as a conditioningvector, produces continuous actions and terminates after T time steps.4 - Meta-ControllerA - Atomic Programs B - Self Behavioral Model IfStateNon AtomicProgram Index AlphaNPI Probabilities overPrograms SelectValue is Atomicis non AtomicInternal State Call recursivelyAlphaNPI withandCall the behavioralmodel to getTerminate currentprogram executionIf is not STOPis STOPStateAtomicProgram Index GoalSetter GoalContinuous ActionStateGoal GoalConditionedPolicy Predicted stateafter T timestepsStateAtomicProgram Index BehavioralModel
Figure 2: AlphaNPI-X consists of three modules: A goal setter transforms an atomic program indexinto a goal that is executed by a goal-conditioned policy . A self-behavioural model transforms anatomic program index and an environment state into a prediction of the environment state once thegoal-conditioned policy has been rolled for T time steps to execute this program. A meta-controller plans through the behavioural model to execute non-atomic programs.To execute an atomic program p i from an initial state s satisfying the pre-condition φ i , the goal-setter, Θ , computes the goal g : g = Θ( s, i ) . We then roll the goal-conditioned policy π atomicθ ( . , g ) for T time steps to achieve the goal. In this work, the goal setter module Θ is provided which translates theatomic program specification into the corresponding goal vector indicating the desired position of theblocks which is also used to compute rewards.We parametrize our shared goal-conditioned policy using Universal Value Function Approximators( UVFA s) [38].
UVFA s estimate a value function that does not only generalise over states but alsogoals. To accelerate training of this goal-conditioned
UVFA , we leverage the "final" goal relabelingstrategy introduced in
HER [3]. Past episodes of experience are relabelled retroactively with goalsthat are different from the goal aimed for during data collection and instead correspond to the goalachieved in the final state of the episode. The mappings from state vector to goals is simply done viaextracting the blocks positions directly from the state vector s . To deal with continuous control, wearbitrarily use DDPG [27] to train the goal-conditioned policy.More formally, we define a goal space G which is a subspace of the state space S . The goal-conditioned policy π atomicθ : S × G → A takes as inputs a state s ∈ S as well as a goal g ∈ G . Wedefine the function h : S → G that extracts the goal g from a state s . This policy is trained with thereward function f uvfa : S × G → R defined as f uvfa ( s, g ) = ( (cid:107) h ( s ) − g (cid:107) ≤ (cid:15) ) where (cid:15) > . Thegoal setter as Θ :
S × { . . . k } → G takes an initial state and an atomic program index and returns agoal such that f uvfa ( s, Θ( s, i )) = ψ i ( s ) . We can thus express any atomic program policy π i as π i ( . ) = π atomicθ ( . , Θ( . , i )) , ∀ i ≤ k. We first train the policy with HER to achieve goals sampled uniformly in G from any state sampledfrom ρ . However, the distribution of initial states encountered by each atomic program may bevery different when executing programs sequentially (as will happen when non-atomic programsare introduced). Thus, after the initial training, we continue training the policy, but with probability . we do not reset the environment between episodes, to approximate the real distribution of initialstates which happens when atomic programs are chained together. Hence, the initial state s is either5 re transformed into a goal... That is achieved throughContinuous actions.A Program Index + An Initial State
GoalBlock To Be MovedGoal Setter Goal Conditioned Policy
Figure 3: AlphaNPI-X combines a goal setter and a goal-conditioned policy to execute atomicprograms. Here we see an atomic program execution for stacking block 2 on top of block 3. Thegoal proposed by the goal setter is shown in red which indicated the desired position for block 2.The goal-conditioned policy takes this goal as an input and receives a positive reward only if it cansuccessfully reach this goal.sampled randomly or kept as the last state observed in the previous episode. We later show ourempirical results and analysis regarding both training phases.
Learning Self-Behavioural Model
After learning a set of atomic programs we learn a transition model over their effects: Ω w : S ×{ . . . k } → S , parameterized with a neural network. As Figure 2-B shows, this module takes asinput an initial state and an atomic program index and the output is the prediction of the environmentstate obtained when rolling the policy associated to this program for T time steps from the initial state.We use a fully connected MLP and train it by minimizing the mean-squared error to the ground-truthfinal states. This enables our model to make jumpy predictions over the effect of executing an atomicprogram during search, hence removing the need for any interactions with the environment duringplanning. Learning this model enables us to imagine the state of the environment following an atomicprogram execution and hence to avoid any further calls to atomic programs that would each have toperform many actions in the environment. Learning the Meta-Controller
In order to compose atomic programs together into hierarchical stacks of non-atomic programs, weuse a meta-controller inspired by AlphaNPI [33]. The meta-controller interprets and selects thenext program to execute using neural-network guided Monte Carlo Tree Search (MCTS) [43, 33],conditioned on the current program index and states from the environment (see Figure 2 for anoverview of a node expansion).We train the meta-controller using the recursive
MCTS strategy introduced in AlphaNPI [33]: duringsearch, if the selected action is non-atomic, we recursively build a new Monte Carlo tree for thatprogram, using the same state s t . See Supp. Section B for a detailed description of the search processand pseudo-code.In AlphaNPI [33], similar to AlphaZero, during the tree search, future scenarios were evaluated byleveraging the ground-truth environment, without any temporal abstraction. Instead in this work, wedo not use the environment directly during planning, but replace it by our learnt transition model overthe effects of the atomic programs, the self-behavioural model described in Section 4.1, resultingin a far more sample efficient algorithm. Besides, AlphaNPI used a hand-coded curriculum wherethe agent gradually learned from easy to hard programs. Instead here, we randomly sample at eachiteration a program to learn from, removing the need for any supervision and hand-crafting of acurriculum. Inspired by recent work [9], we also implemented an automated curriculum strategybased on learning progress, however we found that it does not outperform random sampling (seemore detailed explanations and results in Supp. Section B).6igure 4: The goal policy π atomicθ is trained in two phases. During the first phase, goals and initialstates are sampled uniformly. During the second phase, we do not reset the environment withprobability p between two consecutive episodes to approximate the real distribution of initial states.Once the goal policy is trained, we learn a model of its behaviour. Left.
Comparing performanceafter phase 2 for different probabilities p . Middle.
Comparing performance between phase 1 andphase 2.
Right.
Comparing model predictions performance when trained on different number ofepisodes generated with the goal policy.
To infer with AlphaNPI-X, we compare three different inference strategies: 1) Rely only on thepolicy network, without planning. 2) Plan a whole trajectory using the learned transition model (i.e.self-behavioural model) and execute it fully (i.e. open-loop planning). 3) Use a receding horizon forplanning inspired by Model Predictive Control (MPC) [13]. During execution, observed states candiverge from the predictions made by the self-behavioural model due to distribution shift, especiallyfor long planning horizons, which deteriorates performance. To counter that, we do not committo a plan for the full episode, but instead re-plan after executing any atomic program (non-atomicprograms do not trigger re-planning). The comparison between these inference methods can be foundin Table 2. We also provide a detailed example in Supp. Section C.
We first train the atomic policies using
DDPG with
HER relabelling described in Section 4.1. Initialblock positions on the table as well as gripper position are randomized. Additional details regardingthe experimental setup is provided in Supp. Section D. In the first phase of training, we train theagent for 100 epochs during uniform goal sampling. In the second phase we continue training for 150epochs where we change the sampling distribution to allow no resets between episodes to mimic thedesired setting where skills are sequentially executed.We evaluate the agent performance on goals corresponding to single atomic programs as well as twomulti-step sequential atomic program executions,
MultiStep-Moving which requires 4 consecutivecalls to
MOVE _ TO _ ZONE and
MultiStep-Stacking which requires 3 consecutive calls to
STACK (seeFigure 4). We observe that while the agent obtains decent performance on executing atomic programsin the first phase, the second phase is indeed crucial to ensure success of the sequential executionof the programs. We also observe that the reset probability affects the performance and p = 0 . provides the better trade-off between the asymptotic training performance and the agent ability toexecute programs sequentially. More details are included in Supp. Section E.We then train the self-behavioural model to predict the effect of atomic programs, using three datasetsrespectively made of 10k, 50k and 100k episodes played with the goal-conditioned policy. As duringthe second phase of training, we did not reset the environment between two episodes with a probability . . We trained the self-behavioural model for 500 epochs on each data set. Figure 4 shows theperformance aggregated across different variations of STACK and
MOVE _ TO _ ZONE behaviours.To train the meta-controller, we randomly sample from the set of non-atomic programs during training.In Figure-5, we report the performance on all non-atomic tasks during the evolution of training. Wealso investigated the use of an automated curriculum via learning progress but observed that it doesnot outperform random sampling (please see Supp. Section B.3 for details). After training for epochs, we evaluate the AlphaNPI-X agent on each non-atomic program in the library shown inTable 2. Our results indicate that removing planning significantly reduces performance, especially for7
100 200 300 400 500 600 700
Number of epochs M e a n r e w a r d StackAllBlocksCleanAndStack CleanTableMoveAllToZone StackAllToZone
Figure 5: AlphaNPI-X performance evolution during training on non-atomic programs. The per-formance showed is the one predicted by the behavioural model and thus can differ from the onemeasured in the environment. P ROGRAM N O P LAN P LANNING P LANNING + R E - PLANNING M ULTITASK
DDPG M
ULTITASK
DDPG + HERC
LEAN _T ABLE
LEAN _A ND _S TACK
TACK _A LL _B LOCKS
TACK _A LL _T O _Z ONE
OVE _A LL _T O _Z ONE
Table 2: We compare the performance of AlphaNPI-X in different inference settings described inSection 4.2 as well as against 2 model-free baselines. Each program is executed 100 times withrandomized environment configuration. For programs with arguments, the performance is averagedover all possible argument combinations. The full list of programs is provided in Supp. Section A.stacking 4 blocks where removing planning results in complete failure. We observe that re-planningduring inference significantly helps improve the agent’s performance overall.We compare our method to two baselines to illustrate the difficulty for standard RL methods to solvetasks with sparse reward signals. First, we implemented a multitask
DDPG ( M - DDPG ) that takesas input the environment state and a one hot encoding of a non-atomic program. At each iteration, M - DDPG selects randomly a program index and plays one episode in the environment. Second, weimplemented a M - DDPG + HER agent which leverages a goal setter for non-atomic tasks described inSection 4.1. This non atomic programs goal setter is not available to AlphaNPI-X. This agent hasmore information than its M - DDPG counterpart. Instead of receiving a program index, it receives agoal vector representing the desired end state of the blocks. As in
HER , goals are relabelled duringtraining. We observe that M - DDPG is unable to learn any non-atomic program. It is also the case for M - DDPG + HER despite having access to the additional goal representations. This shows that standardexploration mechanisms in model-free agents such as in
DDPG , where Gaussian noise is added to theactions, is very unlikely to lead to rewarding sequences and hence learning is hindered.
In this paper, we proposed AlphaNPI-X, a novel method for constructing a hierarchy of abstractactions in a rich object manipulation domain with sparse rewards and long horizons. Severalingredients proved critical in the design of AlphaNPI-X. First, by learning a self-behavioural model,we can leverage the power of recursive AlphaZero-style look-ahead planning across multiple levelsof hierarchy, without ever interacting with the real environment. Second, planning with a recedinghorizon at inference time resulted in more robust performance. We also observed that our approachdoes not require a carefully designed curriculum commonly used in NPI so random task samplingcan be simply used instead. Experimental results demonstrated that AlphaNPI-X, using an abstractimagination-based reasoning, can simultaneously solve multiple complex tasks involving dexterousobject manipulation beyond the reach of model-free methods.8 limitation of our work is that similar to AlphaNPI, we pre-specified a hierarchical library ofthe programs to be learned. While it enables human interpretability, this requirement is still quitestrong. Thus, a natural next step would be to extend our algorithm to discover these programmaticabstractions during training so that our agent can specify and learn new skills hierarchically in a fullyunsupervised manner.
Broader Impact
Our paper presents a sample efficient technique for learning in sparse-reward temporally extendedsettings. It does not require any human demonstrations and can learn purely from sparse rewardsignals. Moreover, after learning low-level skills, it can learn to combine them to solve challengingnew tasks without any requirement to interact with the environment. We believe this has a positiveimpact in making reinforcement learning techniques more accessible and applicable in settingswhere interacting with the environment is costly or even dangerous such as robotics. Due to thecompositional nature of our method, the interpretability of the agent’s policy is improved as high-level programs explicitly indicate the intention of the agent over multiple steps of behaviour. Thisparticipates in the effort of building more interpretable and explainable reinforcement learning agentsin general. Furthermore, by releasing our code and environments we believe that we help efforts inreproducible science and allow the wider community to build upon and extend our work in the future.
References [1] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, ArthurPetron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cubewith a robot hand. arXiv preprint arXiv:1910.07113 , 2019.[2] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning withpolicy sketches. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 166–175. JMLR. org, 2017.[3] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.
CoRR , abs/1707.01495, 2017.[4] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, JakubPachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterousin-hand manipulation. arXiv preprint arXiv:1808.00177 , 2018.[5] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture.
CoRR ,abs/1609.05140, 2016.[6] André Barreto, Diana Borsa, Shaobo Hou, Gheorghe Comanici, Eser Aygün, Philippe Hamel,Daniel Toyama, Shibl Mourad, David Silver, Doina Precup, et al. The option keyboard:Combining skills in reinforcement learning. In
Advances in Neural Information ProcessingSystems , pages 13031–13041, 2019.[7] Homanga Bharadhwaj, Animesh Garg, and Florian Shkurti. Dynamics-aware latent spacereachability for exploration in temporally-extended tasks. arXiv preprint arXiv:2005.10934 ,2020.[8] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,and Wojciech Zaremba. Openai gym.
CoRR , abs/1606.01540, 2016.[9] Cédric Colas, Pierre Fournier, Olivier Sigaud, and Pierre-Yves Oudeyer. CURIOUS: intrinsicallymotivated multi-task, multi-goal reinforcement learning.
CoRR , abs/1810.06284, 2018.[10] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value functiondecomposition.
Journal of artificial intelligence research , 13:227–303, 2000.[11] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all youneed: Learning skills without a reward function. arXiv preprint arXiv:1802.06070 , 2018.912] Aleksandra Faust, Kenneth Oslund, Oscar Ramirez, Anthony Francis, Lydia Tapia, Marek Fiser,and James Davidson. Prm-rl: Long-range robotic navigation tasks by combining reinforcementlearning and sampling-based planning. In , pages 5113–5120. IEEE, 2018.[13] Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: theory andpractice—a survey.
Automatica , 25(3):335–348, 1989.[14] Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXivpreprint arXiv:1611.07507 , 2016.[15] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solutionalgorithms for factored mdps.
Journal of Artificial Intelligence Research , 19:399–468, 2003.[16] Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relaypolicy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXivpreprint arXiv:1910.11956 , 2019.[17] David Held, Xinyang Geng, Carlos Florensa, and Pieter Abbeel. Automatic goal generation forreinforcement learning agents. arXiv preprint arXiv:1705.06366 , 2018.[18] León Illanes, Xi Yan, Rodrigo Toro Icarte, and Sheila A McIlraith. Symbolic plans as high-levelinstructions for reinforcement learning. In
Proceedings of the International Conference onAutomated Planning and Scheduling , volume 30, pages 540–550, 2020.[19] Riashat Islam, Zafarali Ahmed, and Doina Precup. Marginalized state distribution entropyregularization in policy optimization. arXiv preprint arXiv:1912.05128 , 2019.[20] Yiding Jiang, Shixiang Gu, Kevin Murphy, and Chelsea Finn. Language as an abstraction forhierarchical deep reinforcement learning.
CoRR , abs/1906.07343, 2019.[21] John B Lanier, Stephen McAleer, and Pierre Baldi. Curiosity-driven multi-criteria hindsightexperience replay. arXiv preprint arXiv:1906.03710 , 2019.[22] Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric Xing, Sergey Levine, and Rus-lan Salakhutdinov. Efficient exploration via state marginal matching. arXiv preprintarXiv:1906.05274 , 2019.[23] Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical actor-critic. arXiv preprintarXiv:1712.00948 , 2017.[24] Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical reinforcement learning with hindsight. arXiv preprint arXiv:1805.08180 , 2018.[25] Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical reinforcement learning with hindsight.In
International Conference on Learning Representations , 2019.[26] Richard Li, Allan Jabri, Trevor Darrell, and Pulkit Agrawal. Towards practical multi-objectmanipulation using relational reinforcement learning. arXiv preprint arXiv:1912.11032 , 2019.[27] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXivpreprint arXiv:1509.02971 , 2015.[28] Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchicalreinforcement learning.
CoRR , abs/1805.08296, 2018.[29] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural networkdynamics for model-based deep reinforcement learning with model-free fine-tuning. In , pages 7559–7566. IEEE,2018.[30] Anusha Nagabandi, Kurt Konoglie, Sergey Levine, and Vikash Kumar. Deep dynamics modelsfor learning dexterous manipulation. arXiv preprint arXiv:1909.11652 , 2019.1031] Suraj Nair and Chelsea Finn. Hierarchical foresight: Self-supervised learning of long-horizontasks via visual subgoal generation. arXiv preprint arXiv:1909.05829 , 2019.[32] Soroush Nasiriany, Vitchyr Pong, Steven Lin, and Sergey Levine. Planning with goal-conditioned policies. In
Advances in Neural Information Processing Systems , pages 14814–14825, 2019.[33] Thomas Pierrot, Guillaume Ligner, Scott E Reed, Olivier Sigaud, Nicolas Perrin, AlexandreLaterre, David Kas, Karim Beguir, and Nando de Freitas. Learning compositional neuralprograms with recursive tree search and planning. In
Advances in Neural Information ProcessingSystems 32 , pages 14646–14656, 2019.[34] Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and SergeyLevine. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprintarXiv:1903.03698 , 2019.[35] Scott Reed and Nando De Freitas. Neural programmer-interpreters. arXiv preprintarXiv:1511.06279 , 2015.[36] Martin A. Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave,Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learningby playing - solving sparse reward tasks from scratch.
CoRR , abs/1802.10567, 2018.[37] S. Russell.
Human Compatible: Artificial Intelligence and the Problem of Control . PenguinPublishing Group, 2019. ISBN 9780525558620. URL https://books.google.fr/books?id=M1eFDwAAQBAJ .[38] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function ap-proximators. In Francis Bach and David Blei, editors,
Proceedings of the 32nd InternationalConference on Machine Learning , volume 37 of
Proceedings of Machine Learning Research ,pages 1312–1320, Lille, France, 07–09 Jul 2015. PMLR.[39] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si-mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Masteringatari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265 ,2019.[40] Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and DeepakPathak. Planning to explore via self-supervised world models. arXiv preprint arXiv:2005.05960 ,2020.[41] David Silver and Kamil Ciosek. Compositional planning using optimal option models. arXivpreprint arXiv:1206.6473 , 2012.[42] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.Mastering the game of go with deep neural networks and tree search. nature , 529(7587):484,2016.[43] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, ArthurGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Masteringchess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprintarXiv:1712.01815 , 2017.[44] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: Aframework for temporal abstraction in reinforcement learning.
Artificial intelligence , 112(1-2):181–211, 1999.[45] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez,and Vincent Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. arXivpreprint arXiv:1804.10332 , 2018.[46] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: Asurvey.
Journal of Machine Learning Research , 10(Jul):1633–1685, 2009.1147] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg,David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161 , 2017.[48] Danfei Xu, Suraj Nair, Yuke Zhu, Julian Gao, Animesh Garg, Li Fei-Fei, and Silvio Savarese.Neural Task Programming: Learning to Generalize Across Hierarchical Tasks. arXiv e-prints ,art. arXiv:1710.01813, October 2017.[49] Yu Zhang and Qiang Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114 ,2017. 12 upplementary materialA Programs library
Our program library includes 20 atomic programs and 7 non-atomic ones. The full list of theseprograms can be viewed in Table 3. In the AlphaNPI paper, the program library contained onlyfive atomic programs and 3 non-atomic programs. Thus, the branching factor in the tree search inAlphaNPI-X is on average much greater than in the AlphaNPI paper. Furthermore, in this work, theseatomic programs are learned and therefore might not always execute as expected while in AlphaNPIthe five atomic programs are hard-coded in the environment and thus execute successfully anytimethey are called.While in this study we removed the need for a hierarchy of programs during learning, we still definedprograms levels for two purposes: (i) to control the dependencies between programs, programs oflower levels cannot call programs of higher levels; (ii) to facilitate the tree search by relying on thelevel balancing term L introduced in the original AlphaNPI P-UCT criterion. In this context, wedefined C LEAN _T ABLE and C
LEAN _A ND _S TACK as level 2 programs, S
TACK _A LL _B LOCKS ,M OVE _A LL _T O _Z ONE and M
OVE _A LL _T O _Z ONE as level 1 programs. The atomic programs aredefined as level 0 programs. Interestingly, the natural order learned during training (when samplingrandom program indices to learn from) matches this hierarchy, see Figure 11.
PROGRAM DESCRIPTION S TACK _0_1 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _0_2 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _0_3 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _1_0 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _1_2 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _1_3 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _2_0 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _2_1 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _2_3 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _3_0 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _3_1 S
TACK BLOCK NUMBER ON BLOCK NUMBER
TACK _3_2 S
TACK BLOCK NUMBER ON BLOCK NUMBER
OVE _T O _Z ONE _0_ORANGE M
OVE BLOCK NUMBER TO ORANGE ZONE .M OVE _T O _Z ONE _1_ORANGE M
OVE BLOCK NUMBER TO ORANGE ZONE .M OVE _T O _Z ONE _2_ORANGE M
OVE BLOCK NUMBER TO ORANGE ZONE .M OVE _T O _Z ONE _3_ORANGE M
OVE BLOCK NUMBER TO ORANGE ZONE .M OVE _T O _Z ONE _0_BLUE M
OVE BLOCK NUMBER TO BLUE ZONE .M OVE _T O _Z ONE _1_BLUE M
OVE BLOCK NUMBER TO BLUE ZONE .M OVE _T O _Z ONE _2_BLUE M
OVE BLOCK NUMBER TO BLUE ZONE .M OVE _T O _Z ONE _3_BLUE M
OVE BLOCK NUMBER TO BLUE ZONE .S TACK _A LL _T O _Z ONE _ORANGE S
TACK ORANGE BLOCKS IN THE ORANGE ZONE .S TACK _A LL _T O _Z ONE _BLUE S
TACK BLUE BLOCKS IN THE BLUE ZONE .M OVE _A LL _T O _Z ONE _ORANGE M
OVE ORANGE BLOCKS TO THE ORANGE ZONE .M OVE _A LL _T O _Z ONE _BLUE M
OVE BLUE BLOCKS TO THE ORANGE ZONE .S TACK _A LL _B LOCKS S TACK ALL BLOCKS TOGETHER IN ANY ORDER .C LEAN _T ABLE M OVE ALL BLOCKS TO THEIR COLOUR ZONE .C LEAN _A ND _S TACK S TACK BLOCKS OF THE SAME COLOUR IN ZONES . Table 3: Programs library for the fetch arm environment. We show the flat list of all programpossibilities by expending their arguments. 13
Details of the AlphaNPI-X method
B.1 Table of symbols
Here we provide the list of symbols used in our method section:
Name Symbol Notes
Action space
A A is continuousState space
S S is continuousGoal space
G G is a sub-space of S Initial state distribution ρ Reward obtained at time t r t Discount factor γ Program p i , i ∈ { . . . n } If i ≤ k then p i is atomicProgram index selected at time t i t Environment state s t Program p i pre-condition φ i : S → { , } Program p i post-condition ψ i : S → { , } AlphaNPI-X policy π : S × { . . . n } → A Program p i policy π i ∀ s ∈ S | φ ( s ) = 1 , π i ( s ) = π ( s, i ) Goal setter
Θ :
S × { . . . k } → G HER mapping function h : S → G extract a goal from a state vectorGoal conditioned policy π atomicθ : S × G → A π i ( . ) = π atomicθ ( ., Θ( ., i )) Self-behavioural model Ω w : S × { . . . k } → S Table 4: AlphaNPI-X table of symbols
B.2 AlphaNPI
Here we provide some further details on the AlphaNPI method, which we use and extend forour meta-controller. The AlphaNPI agent uses a recursion augmented Monte Carlo Tree Search(MCTS) algorithm to learn libraries of hierarchical programs from sparse reward signals. The treesearch is guided by an actor-critic network inspired from the Neural Programmers-Interpreter (NPI)architecture. A stack is used to handle programs as in a standard program execution: when a non-atomic program calls another program, the current NPI network’s hidden state is saved in a stackand the next program execution starts. When it terminates, the execution goes back to the previousprogram and the network gets back its previous hidden state. The mechanism is also applied to MCTSitself: when the current search decides to execute another program, the current tree is saved alongwith the network’s hidden state in the stack and a new search starts to execute the desired program.The neural network takes as input an environment state s t and a program index i t and returns avector of probabilities over the next programs to call, π t , as well as a prediction for the value V t .The network is composed of five modules. An encoder f enc encodes the environment state into avector o t = f enc ( s t ) and a program embedding matrix M prog contains learnable embedding foreach non-atomic programs. The i -th row of the matrix contains the embedding p of the programreferred by the index i such that p t = M prog ( i t ) . An LSTM core f lstm takes the encoded state andthe program embedding as input and returns its hidden state h t . Finally, a policy head and a valuehead take the hidden state as input and return respectively π t and V t , see Figure 6.The guided MCTS process is used to generate data to train the AlphaNPI neural network. A searchreturns a sequence of transitions ( s t , i t , π mctst , r ) where π mctst corresponds to the tree policy at time t and r is the final episode reward. The tree policy is computed from the tree nodes visit counts.These transitions are stored in a replay buffer. At training time, batches of trajectories are sampledand the network is trained to minimize the loss (cid:96) = (cid:88) batch − (cid:0) π mcts (cid:1) T log π (cid:124) (cid:123)(cid:122) (cid:125) (cid:96) policy + ( V − r ) (cid:124) (cid:123)(cid:122) (cid:125) (cid:96) value . (1)14 ncoderProgramembeddingmatrix LSTM core Value networkActor network Policy networkValue networkEnvironmentagnostic interpreterPerception moduleProgram module
Figure 6: The AlphaNPI architecture, slightly modified from [33].Note that in AlphaNPI-X, in contrast with the original AlphaNPI work, proper back-propagationthrough time (BPTT) over batches of trajectories is performed instead of gradient descent over batchesof transitions, resulting in a more stable updates.AlphaNPI can explore and exploit. In exploration mode, many simulations per node are performed,Dirichlet noise is added to the priors to help exploration and actions are chosen by sampling the treepolicy vectors. In exploitation mode, significantly less simulations are used, no noise is added andactions are chosen by taking the argmax of the tree policy vectors. The tree is used in explorationmode to generate data for training and in exploitation mode at inference time.
B.3 Improvements over AlphaNPIDistributed training and BPTT
We improved AlphaNPI training speed and stability in distributingthe algorithm. We use 10 actors in parallel. Each actor has a copy of the neural networks weights.It uses it to guide its tree searches to collect experiences. In each epoch, the experience collectedby all the actors is sent to a centralized replay buffer. A learner, that has a copy of the networkweights, samples batches in the buffer, computes the losses and performs gradient descent to updatethe network weights. When the learner is done, it sends the new weights to all the actors. We use theMPI (Message Passing Interface) paradigm through its python package MPI4py to implement theprocesses in parallel. Each actor uses 1 CPU.Another improvement over the standard AlphaNPI is leveraging back-propagation through time(BPTT). The AlphaNPI agent uses an LSTM. During tree search, the LSTM states are stored insidethe tree nodes. However, in the original AlphaNPI, transitions are stored independently and gradientdescents are performed on batches of uncorrelated transitions. While it worked on the examplespresented in the original paper, we found that implementing BPTT improves training stability.
Curriculum Learning
At each training iteration, the agent selects the programs to train on. Inthe original AlphaNPI paper, the agents select programs following a hard-coded curriculum basedon program complexity. In this work, we select programs randomly instead and thus remove theneed for extra supervision. We also implemented an automated curriculum learning paradigm basedon learning progress signal [9] and observed that it does not improve over random sampling. Wecompare both strategies in Figure 5. We detail below the learning progress based curriculum we usedas a comparison.The curriculum based on the program levels from the original AlphaNPI might be replaced by acurriculum based on learning progress. The agent focuses on programs for which its learning progressis maximum. This strategy requires less hyper-parameters and enables the agent to discover thesame hierarchy implicitly. The learning progress for a program i is defined as the derivative overtime of the agent performance on this program. At the beginning of each training iteration t , theagent attempts every non atomic program l times in exploitation mode. We note C it the average15erformance, computed as mean reward over the l episodes, on program i at iteration t . Rewards arestill assumed to be binary: 0 or 1. We compute the learning progress for program i as LP it = (cid:12)(cid:12) C it − C it − (cid:12)(cid:12) , (2)where the absolute derivative over time of the agent performance is approximated as the first order.Finally, we compute the probability p it of choosing to train on the program i at iteration t as p it = (cid:15) ∗ M + (1 − (cid:15) ) ∗ LP it (cid:80) j LP jt , (3)where (cid:15) ∈ [0 , is an hyperparameter and M is the total number of non atomic programs. The term (cid:15) ∗ M is used to balance exploration and exploitation, it ensures that the agent tries all programs witha non-zero probability. 16 .4 Pseudo-codesGoal policy trainingAlgorithm 1 Goal policy training: First phase for num_epoch dofor num_cycle dofor num_ep_per_cycle do Reset the environment in a state s ∼ ρ Sample uniformly in the goal space a goal g ∼ G Play one episode with the goal policy π atomicθ ( . | g ) from s Store the trajectory transitions in a replay buffer endfor num_sgd_per_cycle do Sample a batch of transitions in the replay bufferResample according to
HER strategyCompute the gradient on this batchUpdate weights θ with the gradient averaged over all actors endendend Algorithm 2
Goal policy training: Second phase for num_epoch dofor num_cycle dofor num_ep_per_cycle do Choose uniformly a motion program p i With probability 0.5 reset the environmentOtherwise, sample a state s such that φ i ( s ) = 1 in the state buffer and reset to this stateCompute with the goal setter a goal g = Θ( s , i ) corresponding to this programPlay one episode with the goal policy π atomicθ ( . | g ) from s Store the trajectory in a replay bufferStore the final state in the state buffer endfor num_sgd_per_cycle do Sample a batch of transitions in the replay bufferResample according to
HER strategyCompute the gradient on this batchUpdate weights θ with the gradient averaged over all actors endendend elf-behavioural model trainingAlgorithm 3 Learning the behavioral model of the APEM
Data generationfor num_episode do Choose uniformly a motion program p i With probability 0.5 reset the environmentOtherwise, sample a state s , such that φ i ( s ) = 1 , in the state buffer and reset to this stateCompute with the goal setter a goal g = Θ( s , i ) Play one episode with the goal policy π atomicθ ( . | , g ) from s Store the initial state s , the final state s f and the program index i in a dataset endTrainingfor num_epochs dofor num_sgd_epoch do Sample a batch of ( s , s f , i ) tuplesUpdate w to minimize l = (cid:80) batch (Ω w ( s , i ) − s f ) endend AlphaNPI trainingAlgorithm 4
AlphaNPI training for num_epoch dofor num_task_per_epoch do Sample program p i according to the curriculum strategy for num_ep_per_task do Sample an initial state s ∈ S until φ i ( s ) = 1 Run an AlphaNPI search using the self behavioral model to obtain a final state s f Compute the reward r = ψ i ( s f ) Store the trajectory in a replay buffer endendfor num_sgd_per_epoch do Sample a minibatch of transitions in the replay bufferTrain the AlphaNPI neural net with it endfor every program do Play n _ val episodes with the tree in exploitation modeRecord the averaged score endend lphaNPI inferenceAlgorithm 5 AlphaNPI execution with MPC during inference
Input : program index i and an initial state s while T rue do Run n simu with AlphaNPI relying on the self behavioral model from current state s t Compute tree policy π mcts with visit countsSelect the next program to call p i if i ≤ k then Roll π atomicθ ( . | Θ( s t , i )) for T timesteps to get new state s t + T else if a == STOP then
Stop the search else
Run a new search recursively to execute the program endendend Example execution of the programs
Let us imagine we want to run the atomic program S
TACK _1_2. The state space is S = R and the goal space is G = R . A state s t contains all positions, angles, velocities and angularvelocities of the blocks and of the robot articulations at time t . A goal g is the concatenationof desired final positions ( x, y, z ) for each block. Let us consider atomic program S TACK _1_2has index 0. In this case, we compute the goal corresponding to this program as g = Θ( s ,
0) =( x , y , z +2 d, x , y , z , x , y , z , x , y , z ) where d is the block radius. This goal correspondsto positions where block 1 is stacked on block 2 and where other blocks have not been moved. Thepre-condition of this program verifies in state s that no other block is already on top of block sothat stacking is possible. If the pre-condition is verified, the goal policy is rolled for T = 50 timesteps to execute this goal. When the goal policy terminates, the program post-condition is called onthe final state s T and verifies for all blocks that their final position is in a sphere of radius (cid:15) aroundtheir expected final position described by goal g . The value we use for (cid:15) is the standard value used in HER [3].We train the goal-conditioned policy π atomicθ so that it can reach goals g that corresponds to theatomic programs for any initial position, as long as the initial position verifies the atomic program pre-condition. Once trained, we learn its self behavioural model Ω w . This model takes an environmentstate and an atomic program index and predicts the environment state obtained when the goal policyhas been rolled for T timesteps conditioned on goal and initial environment state. Thus, this modelperforms jumpy predictions, it does not capture instantaneous effects but rather global effects such as block has been moved, block has remained at its initial position .When the model has been trained, we use it to learn non-atomic programs with our extended AlphaNPI.Let us imagine that we want to execute the C LEAN _A ND _S TACK program. This program is expectedto stack, in any order, both orange blocks in the orange zone and both blue blocks in the blue zone. Itspre-condition is always true since it can be called from any initial position. Its post-condition looks ata state s and returns if it finds an orange block with its center of gravity in the orange zone andwith an orange block stacked in top of it. It verifies the same for the blue blocks. The post-conditionreturns otherwise. The stacking test is performed as explained above looking at a sphere of radius (cid:15) around an ideal position.A possible execution trace for this program is: > CLEAN_AND_STACK > MOVE_ALL_TO_ZONE_ORANGE > MOVE_TO_ZONE_0_ORANGE > MOVE_TO_ZONE_1_ORANGE > STOP > STACK_ALL_TO_ZONE_BLUE > MOVE_TO_ZONE_2_BLUE > STACK_3_2 > STOP > STACK_ALL_TO_ZONE_ORANGE > STACK_1_0 > STOP > STOP Final StateInitial State
Orange ZoneBlue ZoneBlue Zone
Figure 7: Illustrative example of an execution trace for the C
LEAN _A ND _S TACK program. Thistrace is not optimal as the program may be realised in fewer moves but it corresponds to one ofthe solutions found by AlphaNPI-X during training.
Atomic program calls are shown in green and non-atomic program calls are shown in blue.In this trace, the non-atomic programs are shown in blue and the atomic programs are shown in green.Non-atomic program execution happens as in the original AlphaNPI. The execution of an atomicprogram happens as described before: when the atomic program is called, the current environmentstate is transformed into a goal vector that is realised by the goal-conditioned policy. In the traceabove, the goal policy has been called times with different goal vectors.20 Experimental setting
D.1 Experiment setup
Figure 8:
Our multi-task fetch arm environment.
The agent first learns to manipulate all blocks onthe table. Then it learns to perform a large set of manipulation tasks. These tasks require abstractionas well as precision. The agent learns to move blocks on zones depending of the block color. It learnsto stack all blocks together and to stack blocks of the same color in the corresponding zone.We base our experiments on a set of robotic tasks with continuous action space. Due to the lack of anylong-horizon hierarchical multi-task benchmarks, we extended the OpenAI Gym Fetch environment[8] with tasks exhibiting such requirements. We reused the core of the P
ICK A ND P LACE environmentand added two colored zones as well as four colored blocks that are indexed. We did not modifythe environment physics. The action space is A = [ − , , it corresponds to continuous command dx, dy, dz, dα to the arm gripper. In this setting, the observation space is the same as the state spaceand contains all blocks and robot joint positions together with their linear and angular velocities.It is of dimension , with blocks ordered in the observations removing the need to provide theircolour or index. More precisely, a state s t has the form s t = [ X t , X t , X t , X t , Y ] where X it is theposition ( x, y, z ) of block i at time step t and Y contains additional information about the gripperand velocities. When executing atomic programs, we assume that we always aim to move the blockwhich is at the first position in the observation vector. Then, to move any block j , we apply a circularpermutation to the observation vector such that block j arrives at the first position in the observation.We consider two different initial state distributions ρ in the environment. During goal-conditionedpolicy training, the block at the first position in the observation vector has a . probability to startin the gripper, a . probability to start under the gripper and a . probability to start elsewhereuniformly on the table. All other blocks may start anywhere on the table. Once the first blockis positioned, we place the other blocks in a random order. When a block is placed, it has a . probability to be placed on top of another block and . to be placed on the table. In this case, itsposition is sampled uniformly on the table until there is no collision. During training, we anneal theseprobabilities. By the end of training all blocks are always initialised at random positions on the table. D.2 Computational resources
We ran all experiments on a 30 CPU cores. We used 28 CPUs to train the goal-conditioned policyand 10 CPU cores to train the meta-controller. We trained the goal-conditioned policy for 2 days(counting both phases) and the meta-controller for also 2 days.21 .3 Hyper-parameters
We provide all hyper-parameters used in our experiments for our method together with baselines.
Notation Description ValueAtomic Programs
DDPG actor hidden layers size 256/256DDPG critic hidden layers size 256/256 γ discount factor 0.99 batch size batch size (number of transitions) 256 lr actor actor learning rate . lr critic critic learning rate . replay_k ratio to resample transitions in HER 4 n actors number of actors used in parallel 28 n cycles _ per _ epoch number of cycles per epoch 40 n batches number of updates per cycle 40buffer size replay buffer size (number of transitions) n epochs _ phase _ number of epochs for phase 1 100 n epochs _ phase _ number of epochs for phase 2 150 Behavioural Model
Model layers hidden layers size 512/512Dataset size number of episodes collected to create dataset 50000 n epochs number of epochs 500 AlphaNPI
Observation Module hidden layer size 128 P program embedding dimension 256 H LSTM hidden state dimension 128 S observation encoding dimension 128 γ discount factor to penalize long traces reward 0.97 n simu number of simulations in the tree in exploration mode 100 n simu − exploit number of simulations in the tree in exploitation mode 5 n updates _ per _ episode number of gradient descent per episode played 2 n ep number of episodes at each iteration 20 n val number of episodes for validation 10 n iteration number of iterations 700 c level coefficient to encourage choice of higher level programs 3.0 n actors number of actors used in parallel 10 c puct coefficient to balance exploration/exploitation in MCTS batch size batch size (number of trajectories) 16 n buf maximum size of buffer memory (nb of full trajectories) 100 p buf probability to draw positive reward experience in buffer 0.5 lr learning rate . τ tree policies temperature coefficient 1.3 (cid:15) d AlphaZero Dirichlet noise fraction 0.25 α d AlphaZero Dirichlet distribution parameter 0.03Table 5: Hyperparameters22
Additional Results
E.1 Results of Goal Policy Training
We train the goal-conditioned policy π atomicθ in two consecutive phases. In the first phase we samplethe initial states and goal-vectors randomly. During the second phase, we change the initial andgoal state distributions to ensure that our goal-conditioned policy still performs well when calledsequentially by the meta-controller. For initial states, with probability p we do not reset the initialstate of the environment between episodes. This means that the initial state of some episodes are thefinal state achieved in the previous episode. To simplify this process in a distributed setting, we storethe final states of all episodes in a buffer and in each episode, with probability p , we sample a statefrom this buffer to be our initial state in this episode. Additionally, instead of random goal-vectorsampling we randomly select an atomic program p i , then given the initial state we compute the goalvector using the goal-setter. We study the impact of the reset probability p as well as the usefulnessof the second phase in Figure 9. S t a c k M o v e T o Z o n e M u l t i S t e p - M o v i n g M u l t i S t e p - S t a c k i n g M e a n s c o r e s p=0.1p=0.5p=0.9 S t a c k M o v e T o Z o n e M u l t i S t e p - M o v i n g M u l t i S t e p - S t a c k i n g M e a n s c o r e s After first phaseAfter second phase
Figure 9:
Left
We compare the goal-conditioned policy’s performance on executing atomic programs
STACK and
MOVETOZONE as well as two multi-step sequential atomic program executions after thecompletion of second phase with different probabilities p . Right
We compare the performance ofgoal-conditioned policy in the first phase compared to the second phase. While the agent achievesgood performance on atomic programs, the second phase is required to be able to sequence them (asobserved in performance on multi-step tasks). Also, the reset probability can affect performance onmulti-step tasks. We observe that p = 0 . gives the better trade-off between the asymptotic trainingperformance and the agent ability to execute programs sequentially. E.2 Results of self-behavioural model training
We represent the self-behavioural model Ω w with a 2 layer MLP that takes as input the initialenvironment state and the atomic program and predicts the environment state after T = 50 steps.Learning this model enables us to imagine the state of the environment following an atomic programexecution and hence to avoid any further calls to atomic programs that would each have to performmany actions in the environment.To train the model, we play N episodes with goal-conditioned policy in the environment on uniformlysampled atomic programs and record initial and final environment states in a dataset. Then, we trainthe model to minimize its prediction error, computed as a mean square error over this dataset. Westudy the impact of N in Figure 10. E.3 Curriculum learning comparisons
We compare training the meta-controller with random program sampling compared to an automatedcurriculum using a learning progress signal detailed in section B.3. We observe in Figure 11 that23 t a c k M o v e T o Z o n e M e a n s c o r e s N=10KN=50KN=100K
Figure 10: To train the self-behavioural model Ω w , we construct a dataset. We record initial and finalstates for N episodes of goal-conditioned policy’s behaviour after training. At the beginning of eachepisode, an atomic program p i is randomly chosen. In this figure, we study the impact of the numberof episodes N .both random and automated curriculum gives rise to a natural hierarchy for programs. We observethat the M OVE A LL T O Z ONE and S
TACK A LL T O Z ONE are learned first. The traces generated byAlphaNPI-X after training confirms that these programs are used to execute the more complexprograms C
LEAN A ND S TACK and C
LEAN T ABLE . Given the results shown in Figure 11 we cannotconclude that the learning progress based curriculum outperforms random sampling.
Number of epochs . . . . . . M e a n r e w a r d StackAllBlocksCleanAndStack CleanTableMoveAllToZone StackAllToZone
Figure 11: We represent the evolution of AlphaNPI-X performance during training and compare twocurriculum strategies. Solid lines correspond to an agent trained with randomly sampled programswhile dotted lines correspond to the programs selected according to a learning progress basedautomatic curriculum.
E.4 Sample efficiency
We compare our method’s sample efficiency with other works on the Fetch environment. For trainingthe goal-conditioned policy, 28
DDPG actors sample 100 episodes per epoch in parallel. Thus, 2800episodes are sampled per epoch. In total, we train the agent for 250 epochs resulting in e episodes.To train the self-behavioural model, we show that sampling e episodes is enough to perform well.Thus, in total we only use . e episodes to master all tasks. Note that training the meta-controllerdoes not require any interaction with the environment. In comparison, in the HER original paper[3], each task alone requires around . e episodes to be mastered. In CURIOUS [9], training the24gent requires e episodes. Finally, in [21] where the agent is trained only to perform the taskS TACK _A LL _B LOCKS , around . e6