Adaptive Curriculum Generation from Demonstrations for Sim-to-Real Visuomotor Control
Lukas Hermann, Max Argus, Andreas Eitel, Artemij Amiranashvili, Wolfram Burgard, Thomas Brox
AAdaptive Curriculum Generation from Demonstrations for Sim-to-RealVisuomotor Control
Lukas Hermann ∗ , Max Argus ∗ , Andreas Eitel, Artemij Amiranashvili, Wolfram Burgard, Thomas Brox Abstract — We propose Adaptive Curriculum Generationfrom Demonstrations (ACGD) for reinforcement learning inthe presence of sparse rewards. Rather than designing shapedreward functions, ACGD adaptively sets the appropriate taskdifficulty for the learner by controlling where to sample fromthe demonstration trajectories and which set of simulationparameters to use. We show that training vision-based controlpolicies in simulation while gradually increasing the difficultyof the task via ACGD improves the policy transfer to the realworld. The degree of domain randomization is also graduallyincreased through the task difficulty. We demonstrate zero-shot transfer for two real-world manipulation tasks: pick-and-stow and block stacking. A video showing the results can befound at https://lmb.informatik.uni-freiburg.de/projects/curriculum/
I. INTRODUCTIONReinforcement Learning (RL) holds the promise of solvinga large variety of manipulation tasks with less engineer-ing and integration effort. Learning continuous visuomo-tor controllers from raw images circumvents the need formanually designing multi-stage manipulation and computervision pipelines [1]. Vision-based closed-loop control canincrease robustness of robots performing real-world finemanipulation tasks. Prior experience has shown that learning-based methods can cope better with complex semi-structuredenvironments that are hard to model in a precise manner dueto contacts, non-rigid objects or cluttered scenes [2], [3], [4].A major challenge preventing deep reinforcement learningmethods from being used more widely on physical robots isexploration. RL algorithms often rely on random explorationto search for rewards. This hinders the application to real-world robotic tasks in which the reward is too sparse to everbe encountered through random actions. Additionally, in thereal world random exploration can also be dangerous for therobot or its environment.A common strategy to address the exploration problem isreward shaping, in which distance measures and intermediaterewards continuously provide hints on how to reach thegoal [5]. Reward shaping is typically task specific andrequires manual setup as well as careful optimization. Thiscontradicts the purpose of using RL as a general approachand can bias policies to satisfy shaped rewards.Poor exploration can also be compensated by providingdemonstration trajectories and performing Behavior Cloning(BC) [6]. Pre-training with behavior cloning can guide the ∗ First two authors contributed equally. All authors are with the Uni-versity of Freiburg, Germany. Wolfram Burgard is also with the ToyotaResearch Institute, USA. This work has been supported partly by theBMBF grant No. 01IS18040B-OML and the DFG grant No. BR 3815/10-1. { hermannl,argusm,eitel } @cs.uni-freiburg.de Simulation Real world
Fig. 1: Adaptive Curriculum Generation from Demonstra-tions utilizes only 10 demonstration trajectories to enablelearning visuomotor policies for two fine manipulation tasks:block stacking (top) and pick-and-stow (bottom). The vi-suomotor policies are trained in simulation and transferredwithout further training to a physical robot. The image-in-image shows the egocentric view of the arm-mounted RGBcamera. The third person view is not provided to the robot.RL algorithm to initial rewards, from where it can beoptimized further [7]. Behavior cloning usually requires asubstantial amount of expert demonstrations.This paper tackles the exploration problem with curricu-lum learning based on a few manual demonstrations. Wepresent an adaptive curriculum generation algorithm that con-trols difficulty by controlling how initial states are sampledfrom demonstration trajectories as well as controlling the de-gree of domain randomization that is applied during training.The algorithm continually adapts difficulty parameters duringtraining to keep the rewards within a predetermined, desiredsuccess rate interval. Consequently, the robot can alwayslearn at the appropriate difficulty level, which speeds up theconvergence of the training process and the final generaliza-tion performance. The method is simple and applicable to avariety of robotic tasks. We demonstrate its performance onreal world pick-and-stow and block stacking tasks. We applythe curriculum learning in conjunction with policy learningin a physics simulator. Afterwards, the learned policies aretransferred without further re-training to the physical robot.The advantages of training in simulation include drasticallyincreased learning speed through distributed training, nohuman involvement during training, improved safety, andadded algorithmic possibilities due to access to simulatorstate information.We present a novel perspective on domain randomization a r X i v : . [ c s . R O ] J u l nside our curriculum learning framework. Our algorithmautomatically learns when to increase the amount of visualdomain randomization and dynamics randomization duringthe training process, until the policy exhibits the desireddegree of domain invariance required for a successful simu-lation to reality transfer.The main contributions of this paper are: 1) a curriculumgeneration method for learning with sparse rewards thatonly requires a dozen demonstrations, 2) an algorithm forautomatic and controlled scaling of task difficulty, 3) aunified treatment of demonstration sampling and domainrandomization as task difficulty parameters of the curriculum,4) zero-shot transfer from simulation to real-world for tworobot manipulation tasks.II. RELATED WORK A. Training Curricula
The seminal paper by Bengio et al. [8] shows that curricu-lum learning provides benefits for classical supervised learn-ing problems such as shape classification. Florensa et al. [9]propose a reverse curriculum for reinforcement learning thatrequires only the final state in which the task is achieved.Their curriculum is generated gradually by sampling randomactions to move further away from the given goal andthus reversely expanding the start state distribution withincreasingly more difficult states. The key difference of ourmethod is that we extend the idea of curriculum learningto task difficulty parameters, which goes beyond samplingstart states nearby the goal. Further, we sample backwardsfrom the demonstration trajectory in comparison to randomlysampling backwards in action space, which is likely to fail fordifficult states that are unlikely to be encountered randomly.Another approach is to generate a linear reverse curriculum, abaseline that we compare against in our experiments [10]. Amore evolved approach generates a curriculum of goals usinga Generative Adversarial Network (GAN) but cannot handlevisual goals [11]. Learning by playing also generates a formof curriculum, where auxiliary tasks are leveraged to learnincreasingly more complicated tasks [12]. Nevertheless, it isnot straightforward to design auxiliary tasks that can benefitthe final goal task.
B. Learning from Demonstrations and Imitation
The most common approach for learning from demonstra-tions is supervised learning, also known as behavior cloning.Behavior cloning has shown to enable training of successfulmanipulation policies [6], but usually has to cope with thecompounding-error problem and requires hundreds or eventhousands of expert demonstrations. Several methods havecombined behavior cloning with RL, either in a hierarchicalmanner [13] or as a pre-training step to initialize the RLagent [7]. Other methods leverage demonstration data fortraining off-policy algorithms by adding demonstration tran-sitions to the replay buffer [14]. In comparison to the thoseworks our method does not try to copy the demonstratedpolicy. Only the visited states are used as initial conditionsand the actions taken during demonstrations are not required. Therefore, our method is more robust against suboptimaldemonstrations.Demonstration data can also be provided in form of rawvideo data [15], [16], [17], [18]. These approaches canwork if sufficient demonstration data is provided, while ourmethod requires only around a dozen demonstrations. Nair etal. [19] combine off-policy learning with demonstrations tosolve block stacking tasks. They use Hindsight ExperienceReplay [20] for multi-goal tasks, but this method is notapplicable to visual domains where the desired goal con-figuration is not given, as in our tasks. Zhu et al. [21]combine imitation learning and RL for training manipulationpolicies in simulation. They report preliminary success fortransferring the learned policies into the real world but alsochallenges, due to different physical properties of simulationand the real world. We address this problem by leveragingour curriculum generation approach for gradually trainingwith more realistic simulation parameters.
C. Sim-to-Real and Domain Randomization
For sim-to-real transfer of our policy we build uponan efficient technique called domain randomization [22]. Ithas been successfully applied to transfer imitation learningpolicies for a pick-and-place task on a real robot [23]. Sinceour visuomotor controller operates in a first-person viewwith sensor data from a camera-in-hand robot setup, thereis significantly less background clutter, so we generally usea smaller degree of domain randomization. Nevertheless, aswe show in experiments domain randomization can harmconvergence during RL training, so we circumvent thisproblem by incorporating domain randomization into ouradaptive curriculum. Domain randomization has also beenextended to dynamics randomization [24], which we alsoincorporate into our approach. A recent method proposes toclose the sim-to-real gap by adapting domain randomizationwith experience collected in the real world [25]. Experimentswith a real robot show impressive results for a drawer-opening and a peg-in-hole task, but the learned policies arenot based on visual input and it is unclear if the method alsoworks with sparse rewards. In concurrent work, OpenAI etal. [26] have also considered adaptive domain randomizationfor sim-to-real dexterous manipulation.Sim-to-real transfer can also be viewed from a trans-fer learning or domain adaptation perspective [27]. Here,methods that use GANs have been successfully applied forinstance grasping on a real robot [28]. Nevertheless, trainingthe GAN requires a great amount of real-world trainingdata (on the order of 100,000 grasps). Recent meta-learningmethods seem to be suitable to reduce the amount of dataneeded in the real world for efficient domain adaptation [29].Those transfer learning methods are outside of the scope ofour work because we focus on zero-shot sim-to-real transfer.
D. Motor Control for Manipulation
Model-based reinforcement learning techniques with prob-abilistic dynamics models have been proposed to learn con-trol policies for block stacking [30] and multi-phase manip-lation tasks [31]. Guided Policy Search has been appliedto visuomotor control tasks [32]. The mentioned approacheswork well in the local range of trained trajectories, butgeneralize less to larger variations in goal and robot statepositions, which is required for our tasks.III. ADAPTIVE CURRICULUM GENERATIONFROM DEMONSTRATIONSIn this section we present Adaptive Curriculum Genera-tion from Demonstrations (ACGD), a method to overcomeexploration difficulties of reinforcement learning from sparserewards in simulated environments. Depending on the currentsuccess rate, ACGD automatically schedules increasinglydifficult subtasks by shaping the initial state distribution andscaling a set of parameters that control the difficulty of theenvironment, such as the degree of domain randomization.
A. Reinforcement Learning
In reinforcement learning an agent makes some observa-tion ( o t ) of an underlying environment state, which is usedby a policy to compute an action a t = π ( o t ) . This producestransitions consisting of ( o t , a , r , o t + ) for discrete timesteps.In a sparse reward setting a correct sequence of actionsproduces rewards ( r t ). The policy is optimized to maximizethe discounted future rewards, R = ∑ Ti = t γ ( i − t ) r i , called return.A number of different RL algorithms exist, they canbe categorized as either off-policy algorithms or on-policyalgorithms based on if they make use of transitions that arenot generated by the current policy being optimized. In ourwork we use the Proximal Policy Optimization (PPO) [33]algorithm, as on-policy algorithms are more suited to grad-ually changing the environment parameters. B. Reverse-Trajectory Sampling
During training in simulation it is possible to initializeepisodes from arbitrary states of demonstration trajectories.We bootstrap exploration by using reverse-trajectory sam-pling, meaning we initially sample states from the end of thedemonstration trajectories. These states are close to achievingsparse rewards, making these tasks much easier and solvableusing RL. As the training progresses we successively samplestates closer to the beginning of demonstration trajectories.This creates a curriculum in which the agent learns to solvea growing fraction of the task.However, sampling states from the demonstration trajec-tories restricts the policy to observing initial states presentin the recorded demonstrations. Especially if we want ouralgorithm to work with few demonstrations this presentsa potential source of bias. To overcome this problem wemix resets from demonstrations with regular resets of theenvironments which have automatic randomization of initialstates. The choice of sampling demonstration or regularresets is made by a mixing function depending linearly onthe success rates sr r of regular resets episodes and the overallprogress of the training. As opposed to [10], who claim thatit is important for reverse curriculum learning to includepreviously sampled sections throughout the training in order Algorithm 1:
Adaptive Curriculum Generation fromDemonstrations
Input :
Iterations N , initial policy π , increment ε ,task params H j = { µ jinit , σ jinit , µ jend , σ jend } ,reward interval [ α , β ] Output: final policy π N sr d , sr r , δ d , δ r ← for i ← to N do with probability p = . ( sr r + i / N ) do sample regular restart ( H , δ r ) ; sr , δ ← sr r , δ r ; otherwise sample demonstration restart ( H d , δ d ) ; sr , δ ← sr d , δ d ; rollouts ← generate rollouts ( π ) ; π ← update policy ( rollouts , π ) ; sr ← update success rates ( rollouts ) ; δ ← δ + ε · ( sr > β ) − ε · ( sr < α ) ; end to prevent catastrophic forgetting, our experiments suggestthat this is not the case. C. Task and Environment Parameter Adaptation
Apart from the distance between start states and goalstates, the difficulty of a task also depends on a set of factors,such as the degree of domain randomization, intrinsics of thephysics simulator or criteria that define the task completion.As an example, the complexity of block stacking dependssignificantly on the bounciness of the blocks. Therefore, wedesign our tasks such that their difficulty can be controlled bya set of parameters H . Examples can be found in Table II,most of these are not task specific. At the beginning of thetraining all parameters are set to the intuitively easiest con-figuration (e.g. less bouncy objects or smaller initial distancebetween objects). The parameters that determine the degreeof appearance or dynamics randomization are initially set to aminimum. During training we scale the variance of the sam-pling distributions and thus increase the difficulty by linearlyinterpolating between 0 and the maximal value chosen basedon what is realistic. Our experiments clearly suggest, that it isbeneficial to gradually increase the degree of randomizationover time since too much domain randomization from thebeginning slows down training or might even prevent thepolicy from learning the task at all. To our knowledge,we are the first to apply curriculum learning to sim-to-realtransfer. When sampling initial states from demonstration itis not possible to randomize all parameters because someconfigurations are pre-determined by the demonstration. Thisresults in a different set of parameters being randomized foreach type of reset, see Table II. D. Adaptive Curriculum Generation
The challenge of curriculum learning is to decide a goodstrategy to choose the appropriate difficulty of start statesand task parameters in the course of the training. Previousig. 2: Architecture of the policy network. The value functionhas the same architecture apart from having a single outputfor the value. Policy and value functions share the weightsof the CNN (dotted box).approaches sampled initial states uniformly [34], [19], [21]or linearly backwards [10].However, sampling states from the end of the demonstra-tion trajectories for too long unnecessarily slows down thetraining because the policy is trained on subtasks that it hasalready learned to master. On the other hand, sampling moredifficult task configurations too fast may prevent the policyfrom experiencing any reward at all. Especially tasks withlong episode lengths often do not have a constant difficulty atevery stage. Consider for instance a stacking task: it consistsof both easier parts that only require straight locomotion andmore difficult bottleneck moments like grasping and placingthe object. Our method adaptively generates a curriculum,such that the learning algorithm automatically dedicates moretime on hard parts of the task, while not wasting time atstraightforward sections.Intuitively, we want the probability of experiencing areward to be neither too high, nor too low. Instead, our goalis to confine the probability within a desired reward region α ≤ P ( R t > | π ) ≤ β , where R t denotes the return of a rolloutstarted from a state sampled from the demonstration data attimestep t and the interval [ α , β ] are hyperparameters whichwe set to [ . , . ] after empirical testing. This is inspiredby the Goals of Intermediate Difficulty of [11]. For sparsebi-modal tasks, the probability P ( R t > | π ) corresponds tothe expected success rate of the task.We control all difficulty parameters with two coefficients δ d and δ r ∈ [ , ] , which regulate the difficulty of resets fromdemonstrations and regular resets respectively by scaling thevariances of H linearly w.r.t δ . A δ close to 0 correspondsto the easiest and close to 1 corresponds to the most difficulttask setting (i.e. initializing episodes further away from thegoal and sampling task parameters with higher variance). Ourmethod tunes the difficulty during training to ensure that thesuccess rates of regular and demonstration resets ( sr r and sr d ) stay in the desired region, as shown in the pseudo-codein Algorithm 1. An example of this optimization can be seenin Fig. 3. IV. EXPERIMENTSWe demonstrate the effectiveness of the proposed curricu-lum generation via two manipulation tasks: pick-and-stow and stacking small blocks; see Fig. 1. We posed the followingquestions: 1) can curriculum learning with demonstrationsenable learning tasks with sparse rewards that do not succeedwithout curriculum learning? 2) what is the generalizationperformance compared to the existing behavioral cloning andreinforcement learning with shaped rewards? 3) does ouradaptive curriculum outperform other curriculum baselines?4) can visuomotor policies trained in simulation generalizeto a physical robot? A. Experimental Setup
For training and evaluation of the policies we re-createdthe two tasks and the robot setup in a physics simulator. Wealigned simulation and real-world as closely as possible; fora side-by-side comparison see Fig. 1.Our KUKA iiwa manipulator is equipped with a WSG-50 two finger gripper and an Intel SR300 camera mountedon the gripper for an eye-in-hand view. The control of theend effector is limited to the rotation around the verticalz-axis, such that the gripper always faces downwards, itis parameterized as a reduced 5 DoF continuous action a = [ ∆ x , ∆ y , ∆ z , ∆ θ , a gripper ] in the end effector frame. Here, ∆ x , y , z specify a Cartesian offset for the desired end effectorposition, ∆ θ defines the yaw rotation of the end effector and a gripper is the gripper action that is mapped to the binarycommand open/close fingers.The observations that the policy receives are a combinationof the 84 ×
84 pixels RGB camera image and a proprioceptivestate vector consisting of the gripper height above the table,the angle that specifies the rotation of the gripper, the openingwidth of the gripper fingers and the remaining timesteps ofthe episode normalized to the range [ , ] , see Fig. 2.Consistent with results of [35], preliminary experimentsshowed that it is beneficial to include the proprioceptivefeatures. The policy network is trained using PPO, but ourapproach can be used with any on-policy RL algorithm. Forall experiments, training runs 8 environments in parallel fora total number of 10 timesteps, this takes approximately 11hours on a system with one Titan X and 16 CPU cores.During training the performance of the policy is alwaysevaluated on the maximum task difficulty in order to obtaincomparable results between different methods. The policyoutput is the mean and standard deviation of a diagonal Gaus-sian distribution over the 5-dimensional continuous actions.The value function yields a scalar value as output. We use the Adam [36] optimizer with an initial learning rate of 0 . B. Experiments in Simulation
For experiments in simulation we use the PyBullet Physicssimulator [37]. We compare our approach against severalbaselines: 1) training PPO with sparse and shaped rewards, 2)behavior cloning, 3) PPO with behavior cloning initialization,4) several standard non-adaptive curriculum learning meth-ods that use demonstrations. For both tasks, we recorded aset of 10 manual demonstrations for curriculum learning and . . . . . . Timesteps × . . . . . . S u cce ss R a t e / D i ffi c u l t y Success rate demonstration resetsSuccess rate regular resetsSuccess rate evaluationDifficulty of demonstration resets δ d Difficulty of regular resets δ r Success rate interval [ α , β ] Fig. 3: Example training run showing evolution of successrates and difficulty coefficients δ over the course of training.First the success rate and thus difficulties increase for resetsfrom demonstrations (blue curves), followed by increases forregular resets (red). The plot shows that the success rateis kept in the desired interval (grey area) by the difficultycoefficients until the highest difficulty is reached. . . . . . . Timesteps × . . . . . . S u cce ss R a t e OursBehavior cloning init. + RL sparse rewardsRL shaped rewardsRL sparse rewardsBehavior cloning baseline
Fig. 4: ACGD vs. baseline methods for the pick-and-stowtask, evaluated in simulation.an additional set of 100 demonstrations for the BC baseline,both with a 3D mouse.
Pick-and-stow . The robot has to pick up a small block andplace it within a box. The initial position of the robot andthe block randomly change every episode. A sparse rewardof 1 − φ is received only after reaching a goal state, where φ denotes a penalty for collisions of gripper and block withthe edges of the box. The dense reward function for theRL baseline is composed of the Euclidean distance betweengripper and block as well as the distance between block anda location above the box plus additional sparse rewards forsuccessfully grasping and placing the block.Fig. 4 shows the task success rates during training. Theresults show averages over five different random seeds forevery experiment. Our method successfully solves the taskwith an average final success rate of 94%. The policytrained with BC achieves a notable success rate of 23%,but overall lacks consistency having overfitted to the smallset of demonstrations. Interestingly, the policy that used BCas initialization for RL performed poorly and experiencedcatastrophic forgetting after few timesteps. Neither RL withsparse rewards nor RL with shaped rewards are able tocompletely finish the task. While the former is unable tolearn any meaningful behavior, the latter learns to grasp the . . . . . . Timesteps × . . . . . . S u cce ss R a t e OursLinear cur. w/ task diff. params and reg. resetsLinear cur. w/ reg. resetsLinear curriculumUniform curriculumBehavior cloning init. + RL sparse rewardsRL shaped rewardsRL sparse rewardsBehavior cloning baseline
Fig. 5: ACGD vs baseline methods for the block stackingtask, evaluated in simulation. . . . . . . Timesteps × . . . . . . S u cce ss R a t e Adaptive task difficulty paramsAdaptive task difficulty params, shared δ Constant task difficulty params
Fig. 6: Here we investigate the different methods for choos-ing the task difficulty for all parameters H besides thelocation of demonstration resets. We compare adaptivelyincreasing the difficulty of those parameters during trainingagainst always training on the highest difficulty.block but fails to place it inside the box. Block stacking
To solve the task, the agent has to stackone block on top of the other one, which involves severalimportant subproblems such as reaching, grasping and plac-ing objects precisely. The task features complex, contact-rich interactions that require substantial precision. We furtherincreased the difficulty of the task compared to prior workby using smaller blocks. Since it is a task of long episodelength that requires at least around 70 steps to be solved withour choice of per-step action ranges, it is particularly wellsuited for curriculum learning.We define a successful stack as one block being on top ofthe other one for at least 10 timesteps, which indicates stablestacking. A sparse reward of 1 − φ is given after a successfulstack where φ denotes a penalty for the movement of thebottom block during the execution of the task. The shapedreward function consists of a mixture of distance functionsand sparse rewards similar to the pick-and-stow task.The training progress and the evolution of difficulty co-efficients are shown in Fig. 3. We see that the success rateis kept in the desired interval through the adaption of thedifficulty coefficients until the highest difficulty is reached.The success rate of evaluation runs (green) is higher dueto execution of a deterministic policy. Fig. 5 shows theresults compared to baselines. As it is considerably harderthan the previous task, none of the curriculum-free baselinesig. 7: Each image shows the progress of the robot attempting to stack a small blue block on top of a red block. Task Train Test Trials Success RateBlock stacking Simulation Simulation 91/100 91%Pick-and-stow 94/100 94%Block stacking Simulation Real 12/20 60%Pick-and-stow 17/20 85%
TABLE I: Success rates in simulation and real-world.are able to solve the stacking task. In comparison to theuniform and linear reverse curriculum learning variants, ourmethod learns faster and achieves a better final performance,with the final success rate being more than 20% higher.For the linear curriculum variant start states are sampledlinearly further away from the goal during the course of thetraining. Our method shows less variance across the seeds,which indicates that the adaptive curriculum improves thestability of learning. The experiment also demonstrates theimportance of not learning exclusively from demonstrations,especially if the amount of demonstration data is limited.Linear curricula with regular resets, similar to [10], clearlyoutperform uniform and linear curriculum learning that aretrained only on initial states sampled from demonstrations.Another advantage of our method is that it can learn sub-stantially more efficient solutions than those provided by thedemonstrations. This results from the use of demonstrationstates, but not the actions taken. For the stacking task, manualdemonstration episodes had a mean duration of 164 ± ±
15 transitions. This means that in simulation ourtrained policy solves the task twice as fast as a human expertoperating the robot with a 3D mouse.We further evaluated how adaptively changing the taskparameters H for the task difficulty and domain randomiza-tion improves the training speed and performance. In Fig. 6we compare our full model with the following ablations:1) shared δ for demonstration resets and regular resets i.e. δ r = δ d , 2) constant task difficulty , i.e. adaptive curriculumfrom demonstrations, but without changing the difficulty ofthe task parameters H , which were set to the maximumdifficulty for the complete training. C. Experiments with Real Robot
We applied the trained policies without any additional fine-tuning on the real robot. Our results are shown in Table I.We see that despite incurring a performance penalty byevaluating on the real robot, the policies transfer with a goodsuccess rate of 85% for pick-and-stow and 60% for block not including the trajectory sampling position Difficulty Parameters: H d bounciness of objects, num. steps stacked for task success, gripperspeed, position offset of relative Cartesian position control, camerafield of view, camera position and orientation, block color, table color,camera image brightness, camera image contrast, camera image hue,camera image saturation, camera image sharpness, camera image blur,light direction Additional Regular Reset Parameters: H a initial gripper height, lateral gripper offset, distance between blocks,min. final block vel. for task success, table height, height of the robotbase, initial gripper rotation, block size TABLE II: Task parameters used to adapt the difficultyof the stacking task. When initializing from demonstrationstates only a subset H d can be randomized, regular resetsrandomize H r = H d ∪ H a .stacking. Both task were evaluated with a constant episodelength of 300 timesteps (15 seconds). Within the given timeframe, the policy can attempt a second trial after a failedfirst execution of the task. This shows the advantages of apolicy learned in closed-loop as it implicitly aims to re-graspthe block in case of a failed stacking or stowing attempt.Using small wooden blocks with an edge length of only2.5cm requires a highly precise actuation. as the block tendsto fall over if not placed precisely or dropped from a toolarge height. In contrast, polices learned with non-adaptivecurriculum learning baselines were unable to achieve thesim-to-real transfer. It is difficult to compare performancewith previous approaches due to the lack of clear benchmarksetups. In a related approach, Zhu et al. [21] performed zero-shot sim-to-real transfer of a block stacking task, however,they use large deformable foam blocks for stacking. Theseare easier to grasp because they are made of foam and easierto stack because they are larger and have more friction. Theyreport a success rate of 35% for stacking on the real robot,which is lower than ours.V. CONCLUSIONSThe proposed Adaptive Curriculum Generation fromDemonstration (ACGD) method enables robust training ofpolicies for difficult multi-step tasks. It does this by adap-tively setting the appropriate task difficulty for the learnerby controlling where to sample from the demonstrationtrajectories and which set of simulation parameters to use.This unified treatment of demonstration sampling and do-main randomization as task difficulty improves training. Incombination with domain randomization the method cantrain policies in simulation that achieve good success rateswhen evaluated on a real robot. EFERENCES[1] F. Wirnshofer, P. S. Schmitt, W. Feiten, G. v. Wichert, and W. Burgard,“Robust, compliant assembly via optimal belief space planning,” in
Proc. of the IEEE International Conference on Robotics & Automation(ICRA) , May 2018, pp. 1–5.[2] A. Zeng, S. Song, K. Yu, E. Donlon, F. R. Hogan, M. Bauza,D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli, F. Alet, N. C. Dafle,R. Holladay, I. Morena, P. Qu Nair, D. Green, I. Taylor, W. Liu,T. Funkhouser, and A. Rodriguez, “Robotic pick-and-place of novelobjects in clutter with multi-affordance grasping and cross-domainimage matching,” in
Proc. of the IEEE International Conference onRobotics & Automation (ICRA) , May 2018, pp. 1–8.[3] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser,“Learning synergies between pushing and grasping with self-supervised deep reinforcement learning,” in
Proc. of the IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS) .IEEE, 2018, pp. 4238–4245.[4] A. Eitel, N. Hauff, and W. Burgard, “Learning to singulate objects us-ing a push proposal network,” in
Proc. of the International Symposiumon Robotics Research (ISRR) , 2017.[5] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under rewardtransformations: Theory and application to reward shaping,” in
Proc. ofthe International Conference on Machine Learning (ICML) , 1999, pp.278–287.[6] T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, andP. Abbeel, “Deep imitation learning for complex manipulation tasksfrom virtual reality teleoperation,” in
Proc. of the IEEE InternationalConference on Robotics & Automation (ICRA) . IEEE, 2018, pp. 1–8.[7] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman,E. Todorov, and S. Levine, “Learning complex dexterous manipulationwith deep reinforcement learning and demonstrations,” in
Proc. ofRobotics: Science and Systems (RSS) , Pittsburgh, Pennsylvania, June2018.[8] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculumlearning,” in
Proc. of the International Conference on MachineLearning (ICML) . New York, NY, USA: ACM, 2009, pp. 41–48.[Online]. Available: http://doi.acm.org/10.1145/1553374.1553380[9] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel, “Re-verse curriculum generation for reinforcement learning,” in
Conferenceon Robot Learning (CoRL) , 2017, pp. 482–495.[10] L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa,S. Savarese, and L. Fei-Fei, “Surreal: Open-source reinforcementlearning framework and robot manipulation benchmark,” in
Confer-ence on Robot Learning (CoRL) , 2018, pp. 767–782.[11] C. Florensa, D. Held, X. Geng, and P. Abbeel, “Automaticgoal generation for reinforcement learning agents,” in
Proc. ofthe 35th International Conference on Machine Learning , vol. 80.PMLR, 10–15 Jul 2018, pp. 1515–1528. [Online]. Available:http://proceedings.mlr.press/v80/florensa18a.html[12] M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave,T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learningby playing - solving sparse reward tasks from scratch,”
CoRR , vol.abs/1802.10567, 2018.[13] R. Strudel, A. Pashevich, I. Kalevatykh, I. Laptev, J. Sivic, andC. Schmid, “Combining learned skills and reinforcement learning forrobotic manipulations,” arXiv preprint arXiv:1908.00722 , 2019.[14] M. Veˇcer´ık, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot,N. Heess, T. Roth¨orl, T. Lampe, and M. Riedmiller, “Leveragingdemonstrations for deep reinforcement learning on robotics problemswith sparse rewards,” arXiv preprint arXiv:1707.08817 , 2017.[15] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal,S. Levine, and G. Brain, “Time-contrastive networks: Self-supervisedlearning from video,” in
Proc. of the IEEE International Conferenceon Robotics & Automation (ICRA) . IEEE, 2018, pp. 1134–1141.[16] Y. Aytar, T. Pfaff, D. Budden, T. Paine, Z. Wang, and N. de Freitas,“Playing hard exploration games by watching youtube,” in
Advancesin Neural Information Processing Systems , 2018, pp. 2930–2941.[17] L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navi-gation through raw depth inputs with generative adversarial imitationlearning,” in
Proc. of the IEEE International Conference on Robotics& Automation (ICRA) , May 2018, pp. 1111–1117.[18] O. Mees, M. Merklinger, G. Kalweit, and W. Burgard, “Adversarialskill networks: Unsupervised robot skill learning from video,” 2019. [19] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel,“Overcoming exploration in reinforcement learning with demonstra-tions,” in
Proc. of the IEEE International Conference on Robotics &Automation (ICRA) , 2018, pp. 6292–6299.[20] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welin-der, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsightexperience replay,” in
Advances in Neural Information ProcessingSystems , 2017, pp. 5048–5058.[21] Y. Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvu-nakool, J. Kramr, R. Hadsell, N. de Freitas, and N. Heess, “Reinforce-ment and imitation learning for diverse visuomotor skills,” in
Proc. ofRobotics: Science and Systems (RSS) , Pittsburgh, Pennsylvania, June2018.[22] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,“Domain randomization for transferring deep neural networks fromsimulation to the real world,” in
Proc. of the IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS) . IEEE, 2017,pp. 23–30.[23] S. James, A. J. Davison, and E. Johns, “Transferring end-to-endvisuomotor control from simulation to real world for a multi-stage task,” in
Conference on Robot Learning (CoRL) , vol. 78.PMLR, 13–15 Nov 2017, pp. 334–343. [Online]. Available:http://proceedings.mlr.press/v78/james17a.html[24] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in
Proc. of the IEEE International Conference on Robotics & Automation(ICRA) . IEEE, 2018, pp. 1–8.[25] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac,N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adaptingsimulation randomization with real world experience,” in
Proc. of theIEEE International Conference on Robotics & Automation (ICRA) ,2019, pp. 8973–8979.[26] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin,B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas,J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan,W. Zaremba, and L. Zhang, “Solving rubik’s cube with a robot hand,” arXiv preprint , 2019.[27] A. A. Rusu, M. Veerk, T. Rothrl, N. Heess, R. Pascanu,and R. Hadsell, “Sim-to-real robot learning from pixels withprogressive nets,” in
Conference on Robot Learning (CoRL) ,vol. 78. PMLR, 13–15 Nov 2017, pp. 262–270. [Online]. Available:http://proceedings.mlr.press/v78/rusu17a.html[28] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakr-ishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, et al. , “Usingsimulation and domain adaptation to improve efficiency of deeprobotic grasping,” in
Proc. of the IEEE International Conference onRobotics & Automation (ICRA) . IEEE, 2018, pp. 4243–4250.[29] T. Yu, C. Finn, S. Dasari, A. Xie, T. Zhang, P. Abbeel, and S. Levine,“One-shot imitation from observing humans via domain-adaptivemeta-learning,” in
Proc. of Robotics: Science and Systems (RSS) ,Pittsburgh, Pennsylvania, June 2018.[30] M. Deisenroth, C. Rasmussen, and D. Fox, “Learning to control alow-cost manipulator using data-efficient reinforcement learning,” in
Proc. of Robotics: Science and Systems (RSS) , June 2011.[31] O. Kroemer, C. Daniel, G. Neumann, H. Van Hoof, and J. Peters,“Towards learning hierarchical skills for multi-phase manipulationtasks,” in
Proc. of the IEEE International Conference on Robotics& Automation (ICRA) . IEEE, 2015, pp. 1503–1510.[32] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-endtraining of deep visuomotor policies,”
Journal of Machine LearningResearch , vol. 17, no. 39, pp. 1–40, 2016. [Online]. Available:http://jmlr.org/papers/v17/15-522.html[33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, andO. Klimov, “Proximal policy optimization algorithms,”
CoRR ,vol. abs/1707.06347, 2017.[34] I. Popov, N. Heess, T. P. Lillicrap, R. Hafner, G. Barth-Maron,M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. A. Riedmiller, “Data-efficient deep reinforcement learning for dexterous manipulation,”
CoRR , vol. abs/1704.03073, 2017.[35] D. Kalashnikov, A. Irpan, P. P. Sampedro, J. Ibarz, A. Herzog,E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke,and S. Levine, “Qt-opt: Scalable deep reinforcement learningfor vision-based robotic manipulation,” 2018. [Online]. Available:https://arxiv.org/pdf/1806.1029336] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in
ICLR , 2015.[37] E. Coumans, “Bullet physics simulation,” in
ACM SIGGRAPH 2015Courses , ser. SIGGRAPH ’15. New York, NY, USA: ACM, 2015.[Online]. Available: http://doi.acm.org/10.1145/2776880.2792704
PPENDIX
A. Hyperparameters
Table III shows a list of hyperparameters that were usedin the experiments.
Hyperparameter Value
Adam learning rate 2 . × − Adam ε × − Discount γ τ × Interval [ α , β ] [ . , . ] Increment ε TABLE III: Default hyperparameters that were used in theexperiments unless stated otherwise.
B. Additional Experiments
Additional experiments were conducted for the blockstacking task described in Section IV-B. We investigate thechoice of input modalities (Fig. 8) and the impact of differentvalues for the hyperparameters reward interval [ α , β ] (Fig. 9)and difficulty increment ε (Fig. 10). . . . . . . Timesteps × . . . . . . S u cce ss R a t e Img + stateImg only
Fig. 8: Observation experiment. We evaluate how the per-formance depends on the input modality.
Img only learnspurely from the camera images. We reduced the networkarchitecture to the CNN part of the default network. Thefull model clearly outperforms the
Img only network. Weassume, that this is because the state vector contains valuableinformation that is not or only insufficiently contained inthe images. The plot shows averages and standard deviationsof the evaluation success rate over five runs with differentrandom seeds. . . . . . . Timesteps × . . . . . . S u cce ss R a t e [ α , β ] = [0.4, 0.6][ α , β ] = [0.3, 0.5][ α , β ] = [0.5, 0.7][ α , β ] = [0.45, 0.55][ α , β ] = [0.3, 0.7] Fig. 9: Reward Interval Experiment. In this experiment, weevaluate the influence of difference values for [ α , β ] . Theinterval of [ . , . ] shows the best performance and is usedin the paper, but the other runs also achieve good scores. Theplot shows averages and standard deviations of the evaluationsuccess rate over five runs with different random seeds. . . . . . . Timesteps × . . . . . . S u cce ss R a t e (cid:15) = 0.002 (cid:15) = 0.001 (cid:15) = 0.003 Fig. 10: Difficulty Increment Experiment. Here, we comparehow different values for ε influence the training speed andfinal performance. All three values show similar trainingcurves with our default value of ε = ..