Towards Hierarchical Task Decomposition using Deep Reinforcement Learning for Pick and Place Subtasks
Luca Marzari, Ameya Pore, Diego Dall'Alba, Gerardo Aragon-Camarasa, Alessandro Farinelli, Paolo Fiorini
TTowards Hierarchical Task Decompositionusing Deep Reinforcement Learning for
Pick and Place
Subtasks
Luca Marzari , + , Ameya Pore , , Diego Dall’Alba , Gerardo Aragon-Camarasa ,Alessandro Farinelli and Paolo Fiorini Abstract — Deep Reinforcement Learning (DRL) is emergingas a promising approach to generate adaptive behaviors forrobotic platforms. However, a major drawback of using DRLis the data-hungry training regime that requires millions oftrial and error attempts, which is impractical when runningexperiments on robotic systems. To address this issue, wepropose a multi-subtask reinforcement learning method wherecomplex tasks are decomposed manually into low-level subtasksby leveraging human domain knowledge. These subtasks canbe parametrized as expert networks and learned via existingDRL methods. Trained subtasks can then be composed witha high-level choreographer. As a testbed, we use a pick andplace robotic simulator to demonstrate our methodology, andshow that our method outperforms an imitation learning-basedmethod and reaches a high success rate compared to an end-to-end learning approach. Moreover, we transfer the learnedbehavior in a different robotic environment that allows us toexploit sim-to-real transfer and demonstrate the trajectoriesin a real robotic system. Our training regime is carried outusing a central processing unit (CPU)-based system, whichdemonstrates the data-efficient properties of our approach.
I. I
NTRODUCTION
Robot learning has been an emerging paradigm sincethe advent of Deep Reinforcement Learning (DRL) withbreakthroughs in dexterous manipulation [1], grasping [2]and navigation and locomotion tasks [3]. However, a majorbarrier in the universal adoption of DRL for robotics isthe data-hungry training regime that requires millions oftrial and error attempts, which is impractical in real robotichardware. Existing DRL methods learn complex tasks end-to-end leading to overfitting of training idiosyncrasies, thatmakes them sample inefficient and less adaptable to othertasks [4].Imitation learning based methods, such as BehaviorCloning (BC) have proven to be efficient with respectto model-free DRL methods since they allow robots tolearn from reference trajectories [5]. However collectingdemonstrations require low-level action inputs from humans,specialized hardware and instrumentation, such as virtualreality or teleoperation units. Moreover, robots learn fromtrajectories that have human constraints and biases associatedwith them. In this paper, our underlying hypothesis is that wecan leverage human domain knowledge to segment complextasks into high-level abstract subtasks and these subtasks Department of Computer Science, University of Verona, Verona, Italy Automatic Control Department, Universitat Polit`ecnica de Catalunya ,Barcelona, Spain Computer Vision and Autonomous group, School of Computing Sci-ence, University of Glasgow, UK + corresponding author: [email protected] Fig. 1. Summary diagram of the hierarchical architecture proposed inthis paper. The pick and place task is divided into Low-level SubtaskExperts (LSE), namely approach , manipulate and retract . These subtasksare coordinated using a High Level Choreographer (HLC). can be learned using a DRL approach . Human bias inhigh-level segmentation will make learning more tractable.Moreover, low-level trajectories that the robot will learn,will consider the environment and mechanical constraints ofthe robot. This hypothesis is closely related to HierarchicalReinforcement Learning (HRL) approaches that operate mul-tiple policies at different temporal scales. However, a majordrawback of these approaches is that learning end-to-endcontinuously changes the high-level policy, which in turnleads to an unstable lower-level policy [6]. Therefore, wetrain the low-level policies independent from the high-levelpolicy.In this work, we consider a robotic pick and place task andsegment the task into simpler subtasks, namely approach , manipulate and retract . These subtasks are trained inde-pendently using a DRL policy: Deep Deterministic Policy a r X i v : . [ c s . R O ] M a r radient (DDPG) [7] combined with Hindsight ExperienceReplay (HER) [8]. These subtasks are coordinated by achoreographer network that learns to sequence complex be-haviors from subtasks (see Fig. 1). We motivate our approachby taking inspiration from how we, humans, learn new skillsand behaviors. That is, we learn complex tasks by segmentingthem into simpler behaviors and learning each of themseparately. As an example of complex motor skills involvedin the game of basketball, during training, the athlete learnsto execute dribbling, passing and shooting. These low-levelskills can then be combined to output complex skills suchas assisting, attacking, freeball, to name but a few [9].Furthermore, we show the successful transfer of thelearned policies to a different robotic simulation scene.Finally, we demonstrate the translation of the learned policiesin simulation to a real robotic system and measure its successin grasping different objects.Our contributions are therefore twofold: first, we propose atraining methodology that uses a DRL approach for learningsubtasks; and second, a real robot demonstration of theproposed framework.The outline of this paper is as follows. In Sec. II, wesummarize studies in the literature that are close to this work.The proposed training methodology is explained in Sec. III.In Sec. IV, we describe experiments performed to validateour approach, then the results obtained are presented inSec. V. Finally, we conclude and provide directions towardsfuture work in Sec. VI.II. R ELATED W ORK
Learning from demonstrations : An alternative to makeend-to-end DRL algorithms efficient and learn human-likebehavior is to Learn from Demonstrations (LfD). Veceriket al. [10] used demonstrations to fill a replay buffer toprovide the agent with prior knowledge for a DDPG policy.Nair et al. [11] showed that task demonstrations can beused to provide reference trajectories for DRL exploration.They introduced a Behavior Cloning (BC) loss to the DRLoptimization function and showed that the agent is moreefficient than a baseline DDPG method. Similarly, Goecks etal. [12] proposed a two-phase combination of BC and DRL,where demonstrations were used to pretrain the networkfollowed by training a DRL agent to produce an adaptablebehavior. In this paper, we experiment with the latter fortraining expert subtasks.
Task decomposition : The idea of splitting a task intosubtasks has been reported in the literature [13], [14], inwhich these subtasks are then choreographed to outputa complex behavior. Recently, Hierarchical ReinforcementLearning (HRL) has emerged as a reinforcement learningsetting where multiple agents can be trained at various levelsof temporal abstraction [15] and learn different subtasksfollowing an end-to-end training paradigm. HRL consistof training agents such that the low-level agent encodesprimitives motor skills while the higher-level policy selectswhich low-level agents are to be used to complete a task [6],[16]. Similarly, Beyret et al. [17] studied an explainable HRL method for a robotic manipulation task that uses HindsightExperience Replay (HER) as a high-level agent to decidegoals that are given as input to the low-level policy. Inthese works, hierarchical policies are learned end-to-end,thus observe instability leading to sample in-efficiency, i.e.,the lower level policy changes under a non-stationary high-level policy.
Muti-subtask approaches : Yang et al. [18] have proposedto use sets of pretrained motor skills parametrized by a deepneural network. From these pretrained motor skills, a gatingnetwork learns to fuse the networks’ weights to generatevarious multi-legged locomotion tasks. In this work, we donot use network fusion, rather we combine the subtasksusing a choreographer. Our method closely resembles [19]that develops a reactive reinforcement learning architectureand demonstrates that each subtask can be learned inde-pendently and combined using a choreographer. Pore et al.[19] uses a BC approach to train expert subtask networksthat require collecting demonstrations or hand-engineeringsolutions. However, we use a DRL based approach to learnthe subtask and deviate from using a common backbone forfeature extraction of input values.III. M
ETHODS
We consider the pick and place robotic task, where thegoal of the robot is to grasp a randomly sampled blockin the environment within the reach of the robot and toplace it to a target location. Our underlying hypothesis liesin the fact that high-level domain knowledge of a humanoperator can be leveraged to provide a prior understandingto the robotic agent. Hence, the task is decomposed basedon human knowledge into three abstract subtasks such as:approaching the current object position, manipulating theobject to grasp it, and retracting the object to a targetposition. For learning the complete pick and place task,we consider two Markov Decision Processes (MDPs) withdifferent levels of temporal hierarchy. The higher-level agent,called High-Level Choreographer (HLC) acts at the level ofsubtasks and learns a policy to choreograph the subtasks,whereas a low-level agent, called Low-level Subtask Expert(LSE) learns the policy for low-level actions inside thesubtask. Since a hierarchical task decomposition is used,there are two different goals during each episode: a subtaskgoal that is considered to train LSE, and the final task goalthat is used to train the HLC. In Sec. III-A, we describe thetraining strategy for the LSE agent and in Sec. III-B, wedescribe the HLC training strategy .When we talk about a complete task such as the pickand place task, it is obvious for a human to split individualsubtasks that make up the pick and place task. That is, whenwe are faced with the decomposition of this task, we use whatis called prior knowledge [20]. Since a robot does not havethis prior knowledge, Brooks’s subsumption [14] has beenadopted and the complex task has been manually dividedinto its elementary components. Source code for this paper is available at https://github.com/LM095/Multi-Subtask-DRL-for-Pick-and-Place-Task . Training the Low-level Subtask Expert (LSE) The aim of an LSE is to learn an optimal policy and taskrepresentations to accomplish the specific subtask. For this,we define an MDP formulation for LSE as follows. Insideeach subtask u i (where i refers to the number of subtasks),for every timestep t , the agent receives a state input S t fromthe environment E , executes an action a t and moves to thenext state S t + . We use a DDPG + HER training paradigmto learn the LSE policy π θ i since off-line policy learningmethods are sample-efficient in continuous action spaces [8],[21]. The state inputs to the agent are the vector observationsthat provide the robot’s and object’s kinematic details suchas position, velocity, and orientation of the block and thegripper. The action output of the LSE network consists of x , y and z positions. Each of the LSE is parametrized by a neuralnetwork that consists of three fully connected layers withReLu activation function and one final Linear output layerwith Tanh activation function in case of actor and withoutactivation function in case of the critic. A detailed picture ofLSE architecture is depicted in Fig. 2.This method ensures that each subtask network is anexpert in that specific behavior. Each of these networks istrained independently. Once the success of picking up theblock reaches 100%, we stop the training. We use a densereward function that returns the negative distance betweenthe achieved goal and the subtask goal at each step. That is, approach ’s goal is the position above the block, manipulate ’sgoal is the base of the object to grasp, and retract ’s goal isthe target point where we want to place the object. Fig. 2. Network architecture of LSE (right) and HLC (left). HLC actornetwork node activates one of the LSE network. Each LSE network outputsaction movements that are applied to the environment.
B. High Level Choreographer (HLC)
After the subtask networks are trained, we establish anHLC that learns a policy to temporally choreograph thesubtasks to complete the task. For an HLC agent in a state s t (cid:48) , it activates a subtask u i and receives a reward r t (cid:48) . As aconsequence, the agent goes to a state s t (cid:48) + that correspondsto the state after completing the activated subtask. Notethat the notation t and t (cid:48) are used to indicate temporal hierarchy, i.e. t refers to timestep for the LSE networks, while t (cid:48) refers to a timestep for the HLC. We use the networkarchitecture introduced in [19] that consists of a recurrentnetwork followed by two independent fully connected layersthat serve as the actor and critic. A schematic representationis depicted in Fig. 2.Since the output of the HLC network is discreet actionvalues, we use an A3C training strategy to learn the HLCpolicy. We use the generalized advantage estimation [22]to improve data-efficiency and reduce the variance in thetrajectories. We define a reward function r (cid:48) t that provides + −
1, in all other cases.IV. E
XPERIMENTS
To validate our hypothesis, we perform two sets ofexperiments. First, a comparative study between our LSEapproach, LSE trained through BC method and an end-to-end approach. Second, we deploy our method in a differentrobotic environment and show the successful translation ofthe learned policy in a real robot without fine-tuning thepolicies. Note that all the training methods are carried out onan Intel Core i7 9th Gen CPU-based system. Our motivationis to deviate from the commonly used GPU-based trainingresources commonly used in DRL research and show thata simple CPU-based computing system can be employed tolearn analogous robotic behaviors quickly.
Fig. 3. Different environments used for experiments (a)
FetchPickAndPlace-v1 (b)
PandaPickAndPlace-v0 (c) Franka EmikaRobot used for real robot demonstrations.
A. Simulation experiments: FetchPickAndPlace-v1
In the first experiment, we use the Mujoco simulationengine environment,
FetchPickAndPlace-v1 that comprisesthe Fetch robot (see Fig. 3a). Each LSE is trained indepen-dently instead of training LSEs concurrently as describedin [19]. In order to use a dense reward for each LSE, wemodified the original step function [23]. Our new env . step () function requires two parameters, the action to be taken inthe environment (as before) and the goal that the agent istrying to achieve.To evaluate LSE actions, we add hand-engineered solu-tions to other subtasks. That is, we start by training the approach subtask and use hand-engineered solutions for manipulate and retract subtasks respectively. Once approach reaches high success rate, the weights are saved and weuse a similar strategy to train the manipulate and retract subtask. This strategy allows us to estimate the overall pickand place task success of the robot while it is learning one3 ig. 4. Pick and place task (a) accomplished with end-to-end learning strategy with DDPG+HER and our LSE DDPG+HER. (b) failure with a thincylindrical object for end-to-end strategy (c) success with a thin cylindrical object for the agent trained with our LSE strategy. (d) failure with a small boxobject for the agent trained with end-to-end strategy. (e) success with a small box object for the agent trained with our LSE strategy. of the subtasks. Note that contrary to [19] where hand-engineered solutions are used during the training, in ourmethodology these solutions are used only for evaluation andare not required in the actual training. Finally, after each ofthe subtasks is trained, we load the weights of the networkand train the HLC to temporally choreograph the subtasks.For each LSE, we show the training performance using twomethods trained via DDPG+HER and BC. For the end-to-endstrategy, we use DDPG+HER method. B. Real robot experiments
For the second part of our experiments, we transfer allthe methodology presented in the section III to anothersimulation environment, called
PandaPickAndPlace-v0 , thatconsist of the Franka Emika Robot. We use a differentenvironment to show the ease in transferability to a differentrobotic system and to demonstrate a real robotic testbed.Note that the observation space used in this environment issimilar to that of FetchPickAndPlace-v1 however, we cannotmeasure all the parameters considered in this observationspace in the real robotic system, such as the object positionduring all the episode or the distance between the gripperand the object. Therefore, we simplify the observation in
PandaPickAndPlace-v0 and consider only the variables thatcan be measured in the real robotic system. Hence, forboth simulation (
PandaPickAndPlace-v0 ) and real robot, weconsider the current pose of the gripper, the initial pose ofthe object, and the state of the joints of the gripper.We establish the communication pipeline between thesimulation environment and the real robot using a RobotOperating System (ROS ) node that is interfaced with theMoveit framework . The poses generated by the actions inthe PandaPickAndPlace-v0 environment are processed by
Moveit to generate the complete trajectory while observing https://github.com/qgallouedec/panda-gym https://moveit.ros.org/ the physical constraints of the real robot. We apply a homo-geneous transformation to change the reference frame whichlies at the gripper center in the simulation scene to the pandabase frame in the real robot.Our hypothesis lies in the fact that using the LSE ap-proach, we can reuse the subtasks and fine-tuning LSEgripper closure in order to grasp different types of objects,contrary to an end-to-end approach that would require com-plete retraining. Therefore, in the real robotic setup, wecompare the success in grasping different objects using ourmethod and an end-to-end learning method. Our objectsinclude two different geometrical-shaped objects that aredifferent from the block dimensions used in the trainingprocedure (see Fig. 4). LSE approach provides a possibilityto change one of the subtasks without affecting other trainedsubtasks. Using a subset of behavior is not possible in an end-to-end learning behavior. Hence, in the LSE method, we usethe trained LSE on the block pick and place task and finetune the grasping for the retract subtask, whereas we directlydeploy the behaviors learn for the end-to-end learning.V. R ESULTS
Fig. 5 depicts the sample-efficiency of LSE strategytrained via DDPG+HER, Behavior Cloning (BC) and an end-to-end learning method. The peak represents the maximumsuccess reached by each method for each subtask, i.e. firstpeak denotes the completion of training the approach sub-task, the second peak denotes completion of the training of manipulate subtask, and the third peak denotes training the retract subtask. DDPG+HER outperforms BC and reaches100% success in 255k steps, while BC takes 372k steps.Moreover, DDPG+HER shows a smooth monotonouslearning curve compared to BC which does not stabilizeimmediately after reaching high success values. Overall,DDPG+HER shows less variance compared to BC. Thereis a significant difference between the learning curve for the retract behavior. retract is a temporally elongated subtask4 ig. 5. Performance comparison of our training strategy usingDDPG+HER, BC, and end-to-end (e2e) learning. Each experiment is exe-cuted independently three times with different seeds. Success is quantifiedas the percentage of successful grasp as a function of training steps. compared to other subtasks, i.e. Out of 50 timesteps in anepisode, retract subtask takes 25 timesteps. BC takes moretime to learn because it suffers from the compounding errorcaused by a covariant shift in temporally elongated tasks[24]. Hence, we observe DDPG+HER faster in learning forthe retract subtask. DDPG+HER learns the retract behaviorfaster while BC takes more training steps. Both BC andDDPG+HER significantly outperform the end-to-end learner,which reached approximately 25% success in 400k steps.This verifies our hypothesis that learning subtasks individu-ally is highly sample and data-efficient compared to an end-to-end learning setting . TABLE IP
ERFORMANCE OF METHODS FOR THE SAME LEVEL OF SUCCESS RATE
Number of steps Total timeLSE1 LSE2 LSE3 HLC TotalDDPG+HERend-to-end - - - - 1.4M ∼ ∼
25 minDDPG+HERLSE 150k 30k 75k 98k ∼
18 min
Table I shows the comparison of training performanceof the methods presented in this work. In particular, weanalyze three possible strategies. The first strategy refersto an end-to-end approach using DDPG+HER with sparsereward, the second strategy refers to a subtask approachusing BC, and finally, the last strategy refers to the newmethodology proposed in this paper. For the strategiesthat use subtasks, we define LSE1 as approach , LSE2as manipulate , and LSE3 as retract . DDPG+HER usingsubtask decomposition is the best performing approach, andour results suggest that following the subtask approach,training can be more effective if we use a DRL algorithmcompared to a supervised BC imitation learning. Thebehavior learned by DDPG+HER is more robust and does not require the collection of expert demonstrations whichcan be time-consuming and often reflects less variability.We analyze the actions learned by LSE policy and anend-to-end policy in Fig. 6. For this, we take trained LSEs(i.e. 100% success rate for each subtask) and the end-to-endmodel respectively, and analyze their activation patterns inthe Cartesian space for 10 episodes. Note that the initial en-vironment conditions are the same for both policies. Fig. 6ashows the specialized subtask activation patterns of groundtruth hand-engineered solutions. Fig. 6b and 6c depict theactions learned using our proposed approach and end-to-endlearning strategy respectively. The actions generated fromLSE networks are in the vicinity of the hand-engineeredactions (see Fig. 6a) indicating that the learned behavioris specialized to the particular subtask. There is a slightdeviation in the manipulate activation of hand-engineeredand learned behaviors. This can be attributed to the fact thatmanipulation activations are near-zero values and predictingvalues correct to decimal places will indicate overfitting. Thenetwork activations for the end-to-end approach do not showany particular pattern. The plot verifies our hypothesis thatthe LSE approach makes the task tractable compared to anend-to-end approach.For the real robot experiments, using the subtask approach,the robot is able to pick up different objects, whereas usingan end-to-end training method, the robot can only completethe block pickup which it has been trained on and fails ingrasping all other objects, Fig. 4. LSE approach allows us tofine-tune LSE gripper closure for a particular subtask (in thiscase retract ) in order to grasp different types of objects thatis not possible in an end-to-end policy.
This verifies that thesubtask approach can generate robust behavior by fine tuninga subset of subtask . Refer to the attached supplementaryvideo. VI. CONCLUSIONSIn this work, we show that a high-level task representationof human knowledge can be leveraged to decompose a pickand place task into multiple subtasks. These subtasks canbe learned independently via specialized expert networksusing a DRL-based policy. We present a training strategy thatdoes not require the use of demonstrations and is sample-efficient compared to a imitation learning-based methodstudied previously. This training strategy allows us to trainthe pick and place behavior in a simple CPU-based system in ∼
20 minutes. Furthermore, we transfer the learned policiesin a different robotic environment and in a real robotic systemand demonstrate its successful translation from a simulatedscene to the real robotic system. Using our approach, the realrobotic system is able to grasp different geometric shapes.Future work will be focused on the decomposition of thetask in a self-supervised fashion. Further, we will expand therepertoire of the subtasks that can be fused deferentially toadapt to a new task.5 ig. 6. LSE specialization analysis using different training strategies. Samples representing activation patterns using (a) hand-engineered solutions (b)learned using our subtask approach (c) learned using an end-to-end strategy for 10 episodes.
ACKNOWLEDGMENTThe authors would like to acknowledge the support fromthe European Union’s Horizon 2020 research and innovationprogramme under the Marie Sklodowska-Curie grant agree-ment No 813782 (ATLAS) and under grant agreement No.742671 (ARS). R
EFERENCES[1] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learninghand-eye coordination for robotic grasping with deep learning andlarge-scale data collection,”
The International Journal of RoboticsResearch , vol. 37, no. 4-5, pp. 421–436, 2018.[2] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learn-ing for robotic manipulation with asynchronous off-policy updates,”in . IEEE, 2017, pp. 3389–3396.[3] T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine,“Learning to walk via deep reinforcement learning,” arXiv preprintarXiv:1812.11103 , 2018.[4] A. Zhang, N. Ballas, and J. Pineau, “A dissection of overfitting andgeneralization in continuous reinforcement learning,” arXiv preprintarXiv:1806.07937 , 2018.[5] T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg,and P. Abbeel, “Deep imitation learning for complex manipulationtasks from virtual reality teleoperation,” in . IEEE, 2018, pp.5628–5635.[6] O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchicalreinforcement learning,” arXiv preprint arXiv:1805.08296 , 2018.[7] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-miller, “Deterministic policy gradient algorithms,” in
Internationalconference on machine learning . PMLR, 2014, pp. 387–395.[8] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welin-der, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsightexperience replay,” arXiv preprint arXiv:1707.01495 , 2017.[9] H. Jia, C. Ren, Y. Hu, Y. Chen, T. Lv, C. Fan, H. Tang, and J. Hao,“Mastering basketball with deep reinforcement learning: An integratedcurriculum training approach,” in
Proceedings of the 19th InternationalConference on Autonomous Agents and MultiAgent Systems , 2020, pp.1872–1874.[10] M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot,N. Heess, T. Roth¨orl, T. Lampe, and M. Riedmiller, “Leveragingdemonstrations for deep reinforcement learning on robotics problemswith sparse rewards,” arXiv preprint arXiv:1707.08817 , 2017.[11] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel,“Overcoming exploration in reinforcement learning with demonstra-tions,” in . IEEE, 2018, pp. 6292–6299.[12] V. G. Goecks, G. M. Gremillion, V. J. Lawhern, J. Valasek, and N. R.Waytowich, “Integrating behavior cloning and reinforcement learningfor improved performance in dense and sparse reward environments,” arXiv preprint arXiv:1910.04281 , 2019. [13] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptivemixtures of local experts,”
Neural computation , vol. 3, no. 1, pp. 79–87, 1991.[14] R. A. Brooks, “Intelligence without representation,”
Artificial intelli-gence , vol. 47, no. 1-3, pp. 139–159, 1991.[15] A. G. Barto and S. Mahadevan, “Recent advances in hierarchicalreinforcement learning,”
Discrete event dynamic systems , vol. 13,no. 1, pp. 41–77, 2003.[16] A. Levy, R. Platt, and K. Saenko, “Hierarchical actor-critic,” arXivpreprint arXiv:1712.00948 , vol. 12, 2017.[17] B. Beyret, A. Shafti, and A. A. Faisal, “Dot-to-dot: Explainablehierarchical reinforcement learning for robotic manipulation,” arXivpreprint arXiv:1904.06703 , 2019.[18] C. Yang, K. Yuan, Q. Zhu, W. Yu, and Z. Li, “Multi-expert learning ofadaptive legged locomotion,”
Science Robotics , vol. 5, no. 49, 2020.[19] A. Pore and G. Aragon-Camarasa, “On simple reactive neural net-works for behaviour-based reinforcement learning,” in . IEEE,2020, pp. 7477–7483.[20] R. Dubey, P. Agrawal, D. Pathak, T. L. Griffiths, and A. A. Efros,“Investigating human priors for playing video games,” arXiv preprintarXiv:1802.10217 , 2018.[21] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker,G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder,V. Kumar, and W. Zaremba, “Multi-goal reinforcement learn-ing: Challenging robotics environments and request for research,” arXiv:1802.09464v2[cs.LG] , 2018.[22] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estima-tion,” arXiv preprint arXiv:1506.02438 , 2015.[23] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540 , 2016.[24] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learningand structured prediction to no-regret online learning,” in
Proceedingsof the fourteenth international conference on artificial intelligence andstatistics , 2011, pp. 627–635., 2011, pp. 627–635.