[PDF] Hierarchical Reinforcement Learning for Quadruped Locomotion

Abstract

Legged locomotion is a challenging task for learning algorithms, especially when the task requires a diverse set of primitive behaviors. To solve these problems, we introduce a hierarchical framework to automatically decompose complex locomotion tasks. A high-level policy issues commands in a latent space and also selects for how long the low-level policy will execute the latent command. Concurrently, the low-level policy uses the latent command and only the robot's on-board sensors to control the robot's actuators. Our approach allows the high-level policy to run at a lower frequency than the low-level one. We test our framework on a path-following task for a dynamic quadruped robot and we show that steering behaviors automatically emerge in the latent command space as low-level skills are needed for this task. We then show efficient adaptation of the trained policy to a different task by transfer of the trained low-level policy. Finally, we validate the policies on a real quadruped robot. To the best of our knowledge, this is the first application of end-to-end hierarchical learning to a real robotic locomotion task.

Full PDF

HHierarchical Reinforcement Learning for Quadruped Locomotion

Deepali Jain, Atil Iscen, and Ken Caluwaerts

Abstract — Legged locomotion is a challenging task for learn-ing algorithms, especially when the task requires a diverse setof primitive behaviors. To solve these problems, we introducea hierarchical framework to automatically decompose complexlocomotion tasks. A high-level policy issues commands in alatent space and also selects for how long the low-level policywill execute the latent command. Concurrently, the low-levelpolicy uses the latent command and only the robot’s on-boardsensors to control the robot’s actuators. Our approach allowsthe high-level policy to run at a lower frequency than the low-level one. We test our framework on a path-following task for adynamic quadruped robot and we show that steering behaviorsautomatically emerge in the latent command space as low-levelskills are needed for this task. We then show efﬁcient adaptationof the trained policy to a different task by transfer of thetrained low-level policy. Finally, we validate the policies on areal quadruped robot. To the best of our knowledge, this is theﬁrst application of end-to-end hierarchical learning to a realrobotic locomotion task.

I. INTRODUCTIONLocomotion for legged robots is a challenging controlproblem that requires high-speed control of actuators aswell as precise coordination between multiple legs based onvarious types of sensor data. In addition to basic locomotion,different terrains, tasks or environmental conditions mightrequire speciﬁc primitive behaviors.Recent research shows promising results on learningbased systems for locomotion tasks in simulation and realhardware [1], [2], [3]. Various techniques can be used todiscover policies for such tasks. In this work, we focus onReinforcement Learning (RL) to obtain robust policies.Robot locomotion is an excellent match for hierarchicalcontrol architectures. Indeed, the separation of low-levelcontrol of the legs and high-level decision making basedon the environment and task at hand provides multipleadvantages such as reuse of the learned low-level skillsacross tasks, and interpretability of the high-level decisions.Given a complex task, manually deﬁning a suitable hier-archy is typically a tedious task that requires engineering ofthe state and action spaces as well as reward functions foreach primitive. To overcome this, we introduce a hierarchicalframework to automatically decompose complex locomotiontasks. A high-level policy issues commands to a low-levelpolicy and decides for how long to execute the low-levelpolicy at a time. The low-level policy acts according tocommands from the high-level policy and on-board sensors.Our approach allows separation of the state variables that areused for low-level control, from state variables only required

Robotics at Google, 10011 New York, USA. { jaindeepali, atil, kencaluwaerts } @google.com t = 0s t = 2s t = 3s Fig. 1: Simulated task on the left and the robot performinga hierarchical policy learned in simulation. During executionthe high-level policy executes intermittently to update thelatent command for the low-level policy.for higher-level control. Our architecture naturally allows thehigh-level to operate at a slower timescale than the low-level.We test our framework on a path following task for adynamic quadruped robot. The task requires walking intodifferent directions to complete the track while keepingbalance. Using our architecture, we train both levels ofthe hierarchical policy end-to-end. We show that steeringbehavior automatically emerges in the latent command spacebetween the high-level and low-level policies, which allowsreuse of the learned low-level behaviors. We show transferof the low-level policy to a different track to achieve fastadaptation to a new task. Lastly, we deploy our policies tohardware to validate the learned behaviors on a real robot.II. RELATED WORKHierarchical Reinforcement Learning (HRL) methods fo-cus on decomposing complex tasks into simpler sub-tasks.Not only does this help simplify a single difﬁcult problem, itcan also help in adapting the solution faster to a new problemif sub-tasks are general enough. The framework based on pre-deﬁned options [4], or temporally extended actions, is oneof the ﬁrst popular methods in this direction. More recently,considerable research attention is given to the problem ofautomatically discovering options through experience.In methods like HRL with hindsight [5] and data-efﬁcientHRL [6], hierarchy is introduced using universal valuefunctions (value functions that are parameterized by ’goal’).Actions of a higher-level policy, running at a ﬁxed slowertimescale, act as goals for a lower-level. A goal is explicitlydeﬁned as a point in observation space and the low-level isrewarded for reaching that point. This allows both levels tobe trained through their respective reward signals. However,this goal speciﬁcation is not suitable in all situations. Ifthe observation space is high dimensional, then the high-level task of selecting a goal becomes very difﬁcult. Also, a r X i v : . [ c s . L G ] M a y etermining when the goal is achieved requires task-speciﬁcdomain knowledge.Latent space policies for HRL [7] use a different approachto parameterize the low-level. The high-level outputs a setof latent variables as goal for the lower level that arelearned through maximum entropy reinforcement learning.Both levels are then trained to maximize the main taskreward. This, however, prevents the low-level from beingreused for any other task.Along similar lines, Osa et. al. [8] recently proposed amethod based on information maximization to learn latentvariables of a hierarchical policy.In their paper on meta learning shared hierarchies [9],Kevin et al. propose a HRL framework that is learned onmultiple related tasks. The low-level skills are reused acrosstasks while the meta-controller is task-speciﬁc. Instead ofparameterizing a single low-level policy, the meta-controllerselects a different low level policy from a set for each sub-task. In order for general low-level policies to emerge, theframework needs to be trained on a number of related tasks.In our method, we use a latent goal representation to re-move the need to hand design low-level rewards or decidingon the number of low-level policies. We also use differentstate representations for both levels to ensure that reusablelow-level skills are learned even when trained on a singletask. Moreover, in our method, the high-level policy runsat a variable timescale, easing processing requirements forhigher-level state information.The task of robot navigation lends itself to a hierarchicalsolution with path-planning at the high-level and point-to-point locomotion at the low-level. In this context, manymethods [10], [11], [12] have been tried to solve these twotasks separately. Nicolas et al. [11], propose a hierarchicalframework for locomotion based on modulated locomotorcontrollers. A low-level spinal network learns primitive lo-comotion by training on simple tasks. A high-level cortical network, drives behavior by modulating the inputs to thepre-trained spinal network. HRL with pre-trained primi-tives is also applied to the task of robot locomotion onrough terrains [13], [14]. In the DeepLoco [13] paper, low-level controllers achieve robust walking gaits that satisfy astepping-target. High-level controllers then invoke desiredstep targets for the low-level controller.We apply our hierarchical learning method to the robotlocomotion task of following a path in 2D. Our method doesnot need speciﬁcation of timescales for the two levels nor alow-level reward signal. Our end-to-end hierarchical learningframework automatically discovers steering behaviors at thelow-level which can transfer to a real quadruped robot.III. METHOD A. Hierarchical Policy Structure and Execution

Our hierarchical policy is structured as shown in Fig. 2.The high-level policy (HL) receives higher-level observationsfrom the environment and issues commands in a latent spaceto a low-level policy. The high-level also decides the durationfor which the low-level is executed before the next high-level evaluation. The low-level (LL) receives observations fromon-board sensors (low-level) and the current latent commandfrom the high-level. It outputs actions to execute on thehardware. At the end of the duration set by the high-level, thehigh-level is invoked again and the process repeats (Fig. 3).Both high-level and low-level policies in this architectureare neural networks. Algorithm 1 shows how an episode isexecuted using a hierarchical policy in which the high-leveland low-level have weights φ h and φ l respectively. High-level policy network

High-level Low-level

Low-level policy network Robot hold for duration timesteps duration (d) latent commands ( l ) high-level observations ( o h ) low-level observations ( o l ) motor commands ( a ) Hardware

Fig. 2: Hierarchical policy. The high-level policy with pa-rameters φ h receives high-level observations o h and outputsa latent command vector l and a duration d . The low-levelpolicy (parameters φ l ) computes motor commands a basedon l and low-level observations o l . The high-level policyis only evaluated every d steps. The architecture is trainedend-to-end. B. Learning Parameters of a Hierarchical Policy

To jointly learn the parameters of the high-level and low-level neural networks, we optimize a standard reinforcementlearning objective. Consider a state space S and action space A . A sequential decision making or control problem can bemodeled as a Markov Decision Process (MDP). An MDPis deﬁned by a transition function P ( s t +1 | s t , a t ) and areward function, r ( s t , a t ) . A policy π θ ( s ) , parameterizedby a weight vector θ , maps states s to actions a . For ahierarchical policy, θ is the collection of parameters fromall levels ( θ = { φ h , φ l } ) and the subset of state variablesobservable by the high-level and low-level are denoted as o h and o l respectively. The policy interacts with the MDPfor an episode of T timesteps at a time. The reinforcementlearning objective is to maximize the expected total reward igh-level Low-level ... Hardware time t+1 t+ d t ...... ... ...... o h d, l a o l Fig. 3: Hierarchical policy evaluation timeline. The high-level policy computes a latent command for the low-level policyand a duration for which to execute the low-level policy. The low-level policy interacts with the hardware at a constantfrequency. At the end of the high-level period, the high-level receives updated high-level observations and computes a newlatent command and duration.

Algorithm 1

Executing a Hierarchical Policy procedure R UN HRL( θ ) (cid:46) HRL policy weights { φ h , φ l } = θ o h ← initial HL observation R ← (cid:46) Episode reward d ← (cid:46) LL duration while not end of episode do if d = 0 then o h ← HL observation { d, l } ← f φ h ( o h ) (cid:46) Duration, latent com-mand a = f φ l ( o l , l ) (cid:46) LL action (motor commands) o l , r ← StepInEnvironment( a ) d ← d − R ← R + r return R (cid:46)

Total reward for the episodeat the end of episode: arg max θ E (cid:34) T (cid:88) t =1 r ( s t , π θ ( s t )) (cid:35) . (1)We use a simple derivative-free optimization algorithmcalled Augmented Random Search (ARS) [15] to maximize R . The algorithm proceeds by choosing a number of direc-tions uniformly at random on a sphere in policy parameterspace, then evaluates the policy along these directions andﬁnally updates the parameters along the top performingdirections. C. Transferring Low-Level Policies

An interesting aspect of our hierarchical method is thatafter learning a policy on one task, the low-level policy canbe transferred to a new task from a similar domain. Thisallows sharing of primitive skills across related problemsand is faster than learning from scratch on each task. Thelow-level policy can be transferred by keeping φ l ﬁxed afterlearning on the original task and re-initializing φ h . Then,during training only φ h is updated by ARS. IV. EXPERIMENTS A. Task Details

We apply our method to a path-following task for aquadruped robot. For this, we use the Minitaur quadrupedrobot from Ghost Robotics . The Minitaur robot has degrees of freedom ( per leg). The swing and extensionof each the legs is controlled using a PD position controllerprovided with the robot. We train our policies in simulationusing pyBullet [16], [17].For the locomotion task, we tackle the problem of follow-ing a curved path in 2D while staying within the allowedregion. The robot is rewarded for moving towards the endof the path. The task requires the robot to steer left andright at different angles. The optimal trajectory for the centerof mass for the robot is not deﬁned and depends on therobot’s anatomy and learned low-level behaviors. Steeringposes additional challenges because the legs of the robotcan only move in the sagittal plane. The reward function isgiven by: r ( t ) = d (cid:16) x ( t − , x goal (cid:17) − d (cid:16) x ( t ) , x goal (cid:17) (2) R = (cid:88) t ≥ r ( t ) , (3)where d( ., . ) is the Euclidean distance, x is the position ofthe robot, and x goal is the ﬁnal position of the path. Weterminate an episode as soon as the robot moves out of thepath.To learn locomotion, we use the recent Policies Modulat-ing Trajectory Generators (PMTG) architecture, which hasshown success at learning forward locomotion on quadrupedrobots [2]. The PMTG architecture takes advantage of thecyclic characteristic of locomotion and of leg movementprimitives by using trajectory generators. Trajectory genera-tors serve as parameterized functions that provide circular legpositions. The policy is responsible to modulate the generatorand adjust leg trajectories with a residual as needed. Amore detailed explanation of the architecture can be foundin the paper [2]. Our hierarchical policy is responsiblefor controlling the PMTG architecture which issues motorposition commands. ghostrobotics.io . Hierarchical Architecture As demonstrated in previous work [2], a well-trained linearneural network policy in combination with the PMTG canproduce locomotion. Therefore we use linear neural networksfor the high-level and the low-level policies. However, weclip the latent command space to [ − , dim( l ) , which al-lows us to more easily study the latent space. The numberof dimensions of the latent command dim( l ) is a hyper-parameter. Note that while the policy networks are linear,PMTG introduces recurrency and non-linearities [2].We separate the state information into two. We only feedthe robot’s position x and the robot’s orientation (yaw di-rection) into the high-level policy (4-dimensional). The high-level policy outputs the latent command l and a duration d .The low-level policy network observes the 8-dimensionalPMTG state (we use 4 trajectory generators, one per leg),4-dimensional IMU sensor data (roll, pitch, roll rate, pitchrate), and the latent command l from the high-level policy.The output of the low-level network are motor positionsand PMTG parameters.We update the low-level’s output every . The high-level is executed every d low-level steps (where d wascalculated during the previous high-level cycle). In practice d is rescaled to [100 , from the [ − , clipped value. Sincethe low-level timestep is , the time between high-levelevaluations is between . and . . This highly simpliﬁesthe process of estimating the position and direction of therobot. C. Transfer of Low-Level Policies to New Tasks

We show that our architecture can adapt to differentpaths shown in Figure 4. We ﬁrst train the architecture forpath on the left side of Figure 4. The low-level policy onlyhas access to proprioceptive sensor data and this forces it tolearn generic steering primitives that can be reused acrossdifferent paths. We test this property of our hierarchicalarchitecture by reusing the trained low-level policies frompath when training on path . D. Baselines

For comparison, we train ﬂat policies on these tasks.The input to the ﬂat policies is the same as the high-level’s observations concatenated with the low-level’s in thehierarchical setup (except, trivially, for the latent commands)and the output is the same as the low-level actions. Theﬂat policy also uses the same PMTG architecture for a faircomparison.Secondly, we implement an expert hierarchical policy foradditional comparison. We pre-train the low-level policy forthis baseline using a carefully designed and tuned rewardfunction to follow a target steering angle. The high-levelpolicy computes the running duration d for the pre-trainedlow-level policy and also outputs a steering angle (a scalarin the range − (far left) to (far right), instead of the latentcommand l ). The input for the expert policy’s high-level andlow-level is exactly the same as in the HRL case. (a) Robot path tracking in simulation. If the robot’s center of massexits the black area, the episode is terminated. x -6.0 m-5.0 m-4.0 m-3.0 m-2.0 m-1.0 m0.0 m1.0 m2.0 m y x -2.0 m-1.0 m0.0 m1.0 m2.0 m3.0 m4.0 m5.0 m6.0 m y (b) Trajectory on paths with a shared low-level policy (trained onthe path on the left). Dots indicate when the high-level policy takesa new decision. Fig. 4: Sample rollouts in simulation of the path-trackingtask with a 4D latent command space.As in the HRL case, the baseline policies are trainedby directly optimizing R using Augmented Random Search(ARS) [15]. We perform evaluation across different searchdirections in parallel. We train each method with a set ofhyper-parameters (number of directions to search in ARS,number of top direction for updating parameters and numberof latent command dimensions in case of our hierarchicalmethod). Finally, we pick the best hyper-parameter for eachand compare the average performance of random trainingruns with those hyper-parameter settings.In Fig. 5 we show learning curves for policies, a ﬂatpolicy, hierarchical policy with expert-designed, pre-trainedlow-level, and a hierarchical policy with latent commandspace (our method). The policies are trained on differentpaths. All three methods succeed in solving the task offollowing the ﬁrst path (Fig. 5a). For the second path, ourmethod is able to solve the task signiﬁcantly faster than otherpolicies (Fig. 5b). On the second path, the ﬂat policy has tolearn the parameters from scratch. The expert policy’s high-level learns to use the same low-level policy used in the ﬁrstpath. This low-level policy was pre-trained (see Appendix).Therefore, the expert policy needs extra training time to learnboth levels separately. On the other hand, both levels of ourlatent command based hierarchical policy are trained fromscratch on the ﬁrst path. The best performing policy uses a dimensional latent space. We can see that this policy canstill reuse the same low-level and D latent commands toadapt quickly to a new task.Fig. 4 shows how the robot trained with a hierarchicalpolicy behaves in simulation. It successfully follows thepath using steering behaviors. Complete trajectories can beseen in Fig. 4b. Markers along the trajectory show points atwhich the high-level becomes active and computes the nextlatent command and duration. The low-level policy was only

100 200 300 400 500

Iteration R e w a r d FlatHierarchical latent commandHierarchical expert (a) Learning curves for path 1. All policies are trained from scratch.

Iteration R e w a r d Hierarchical latent command - adaptedHierarchical expertFlat (b) Learning curves for path 2. Our method (hierarchicallatent) reuses the low-level policy learned for path 1.

Fig. 5: Learning curves of a ﬂat policy, a hierarchical policywith latent commands and an expert hierarchical policy.We plot the average of 5 statistical runs with shaded arearepresenting the standard error.trained on the ﬁrst path and is reused for the second path.To simplify the analysis, we study a dimensional latentcommand space learned by our method in Fig. 7. Weevaluated the low-level for different points in the latent space.In Fig. 7a we show the movement direction of the robot whengiving different points in latent space as commands to thelow-level and executing the low-level for a ﬁxed numberof steps ( ). The length of the arrow is proportionalto the distance covered. Corresponding color-coded robottrajectories are shown in Fig. 7b. We can observe that forthe path following task, robot steering behaviors of varyingvelocities emerge automatically as low-level behaviors. Thehigh-level uses these steering behaviors to navigate differentparts of the path as show in Fig. 7b. Moreover, the high-levelalso decides a variable duration for each latent command(see Fig. 7b). We can observe that for straighter parts of thepath, the high-level selects a longer duration to go forward,while for curved parts, it switches latent commands morefrequently. E. Hardware Validation

Finally, we validate our results by transferring an HRLpolicy to a real robot and recording the resulting trajectories. −0.50.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

X coordinate (m) −3.5−3.0−2.5−2.0−1.5−1.0−0.50.00.5 Y c oo r d i n a t e ( m ) X coordinate (m) −5.0−4.5−4.0−3.5−3.0−2.5−2.0−1.5 Y c oo r d i n a t e ( m ) Fig. 6: The trajectories of the real robot measured withmotion capture while using a trained HRL policy at differentsegments of the path.We use a motion capture system (PhaseSpace Impulse X2E)to estimate the robot’s current position and heading, whichis then fed into the high-level policy. Since our architectureallows execution of the different levels at different frequen-cies, it is sufﬁcient to transmit motion capture data to thehigh-level policy at a much lower rate compared to low-levelsensor data such as IMU readings.Because of the limited capture volume in our lab setting,we were only able to track the robot’s trajectory along part ofthe task (see Fig. 1 and 6). To overcome this limitation, werecorded shorter robot trajectories starting at the origin. Wethen virtually moved the robot down the path by adding anoffset to the motion capture’s position estimate and recordedanother set of trajectories. Note the signiﬁcant variance forthe real trajectories at the start of the path due to slippageof the legs during dynamic turning gaits.V. CONCLUSIONWe presented a hierarchical control approach particularlysuited for legged robots. By separating the architecture intotwo parts, a high-level and a low-level policy network, andjointly training them, we obtained a number of advantagesover previous algorithms.First, the architecture is agnostic to the task: we do notneed to manually pick or pretrain the behaviors (primitives)of the low-level policy. As a consequence we also remove theneed to design individual reward functions for each behavior.In fact, our algorithm outperforms a similar setup in whichthe low-level behaviors are predeﬁned.Secondly, our method can be used to bootstrap whentraining on a new task by transferring the trained low-levelpolicy.Finally, the high-level and low-level policies operate atdifferent timescales and can use different state representa-tions. This is of particular practical importance, since motorcommands should be able to be calculated in mere millisec-onds by a low-level policy for safety and stability reasons.High-level signals such as rewards or position estimates areoften updated at much lower frequencies and might haveto be transmitted via a wireless connection. Our approachprovides a natural way to decouple these timescales.The task at hand allowed us to study the results in detailin both simulation and hardware to validate our approach a) Low-level behaviors sampledfrom a 2D latent commandspace. Vector directions corre-spond to the movement direc-tion of the robot. Vector lengthis proportional to the distancecovered. x coordinate (m) −2.0−1.5−1.0−0.50.00.51.01.52.02.5 y c oo r d i n a t e ( m ) (b) Low-level behaviors for differ-ent latent commands (colorscorrespond to Fig. 7a). Noticethat while diverse, the low-level behaviors are biased to-wards left turns because of thetask at-hand. x -6.0 m-5.0 m-4.0 m-3.0 m-2.0 m-1.0 m0.0 m1.0 m2.0 m y Time (s) (c) Sample trajectory of theHRL policy with a 2D latentcommand space. Dots indicatenew high-level commands.The timeline shows thehigh-level activations.

Fig. 7: Analysis of latent command space l and low-level duration d .and implementation. We show that given the path followingtask, the steering behaviors automatically emerge in a latentspace, and the robot can easily adapt to a new path with low-level transfer. We also deployed these policies to hardwareto validate the learned hierarchical policy.In future work, we plan to apply this algorithm on tasksrequiring a high level of agility in more complex environ-ments. As an example, if the robot has to jump over anobstacle or climb stairs, manually deﬁning a set of low-level behaviors will become even more cumbersome. Webelieve that the latent command space will allow us to tacklethese challenges through automatic discovery of the complexprimitives required to solve the task. In addition, we areplanning to incorporate more complex sensors such as cam-era images, which naturally operate at different timescalesand require signiﬁcant computational power. In this caseour approach would allow for distributed processing, withoutcompromising performance.APPENDIXAs part of the baselines, a low-level expert steering policyis trained separately. This policy is controlled by a scalarinput from the high-level l , which determines the targetdirection. We train the policy using the ARS algorithmby rewarding the magnitude of the average steering angleover the past timesteps. The reward is capped by theinput l . Then another component (weighted by α ) is addedto the reward for moving forward, which is capped by aﬁxed value, r fwcap: r steer ( t ) = min (cid:16) l, θ steer t (cid:17) (4) r fw ( t ) = min (cid:16) r fwcap , x ( t ) − x ( t − (cid:17) (5) r ( t ) = r steer ( t ) + αr fw ( t ) (6) R = (cid:88) t ≥ r ( t ) . (7) Iteration R e w a r d Expert low level steering policy

Fig. 8: Learning curve for the pre-training phase of theexpert low-level policy. −1 0 1 2 3 4 5 6 7−3−2−10123

Fig. 9: Expert low-level policy with different inputs(axes in m ).For training, we randomly sample an input l from auniform distribution for each episode. The learning curve fortraining this policy is shown in Fig. 8. Sample trajectoriesafter training are shown in Fig. 9.ACKNOWLEDGMENTWe would like to thank Jie Tan, Tingnan Zhang, ErwinCoumans, Sehoon Ha (Robotics at Google), Honglak Lee,ﬁr Nachum (Google Brain), and Arun Ahuja (DeepMind)for insightful discussions.R EFERENCES[1] Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso,Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learningagile and dynamic motor skills for legged robots.

Science Robotics ,4(26):eaau5872, 2019.[2] Atil Iscen, Ken Caluwaerts, Jie Tan, Tingnan Zhang, Erwin Coumans,Vikas Sindhwani, and Vincent Vanhoucke. Policies modulating tra-jectory generators. In

Conference on Robot Learning , pages 916–926,2018.[3] Wenhao Yu, C Karen Liu, and Greg Turk. Policy transfer with strategyoptimization. arXiv preprint arXiv:1810.05751 , 2018.[4] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdpsand semi-mdps: A framework for temporal abstraction in reinforce-ment learning.

Artiﬁcial Intelligence , 112(1-2):181–211, 1999.[5] Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical reinforce-ment learning with hindsight. arXiv preprint arXiv:1805.08180 , 2018.[6] Oﬁr Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine.Data-efﬁcient hierarchical reinforcement learning. In

Advances inNeural Information Processing Systems , pages 3307–3317, 2018.[7] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and SergeyLevine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808 , 2018.[8] Takayuki Osa, Voot Tangkaratt, and Masashi Sugiyama. Hierarchicalreinforcement learning via advantage-weighted information maximiza-tion. arXiv preprint arXiv:1901.01365 , 2019. [9] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and JohnSchulman. Meta learning shared hierarchies. arXiv preprintarXiv:1710.09767 , 2017.[10] Bastian Bischoff, Duy Nguyen-Tuong, IH Lee, Felix Streichert, AloisKnoll, et al. Hierarchical reinforcement learning for robot navigation.In

Proceedings of the European Symposium on Artiﬁcial NeuralNetworks, Computational Intelligence and Machine Learning (ESANN2013) , 2013.[11] Nicolas Heess, Greg Wayne, Yuval Tassa, Timothy Lillicrap, MartinRiedmiller, and David Silver. Learning and transfer of modulatedlocomotor controllers. arXiv preprint arXiv:1610.05182 , 2016.[12] Aleksandra Faust, Kenneth Oslund, Oscar Ramirez, Anthony Francis,Lydia Tapia, Marek Fiser, and James Davidson. PRM-RL: Long-range robotic navigation tasks by combining reinforcement learningand sampling-based planning. In , pages 5113–5120. IEEE, 2018.[13] Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel VanDe Panne. Deeploco: Dynamic locomotion skills using hierarchicaldeep reinforcement learning.

ACM Transactions on Graphics (TOG) ,36(4):41, 2017.[14] Xue Bin Peng, Glen Berseth, and Michiel Van de Panne. Terrain-adaptive locomotion skills using deep reinforcement learning.

ACMTransactions on Graphics (TOG) , 35(4):81, 2016.[15] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple randomsearch provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055 , 2018.[16] Erwin Coumans. Bullet Physics SDK.[17] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai,Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprintarXiv:1804.10332arXiv preprintarXiv:1804.10332