[PDF] How to Train Your Robot with Deep Reinforcement Learning; Lessons We've Learned

Abstract

Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated control, which does not connect with the constraints of learning in real environments, deep RL has also demonstrated promise in enabling physical robots to learn complex skills in the real world. At the same time,real world robotics provides an appealing domain for evaluating such algorithms, as it connects directly to how humans learn; as an embodied agent in the real world. Learning to perceive and move in the real world presents numerous challenges, some of which are easier to address than others, and some of which are often not considered in RL research that focuses only on simulated domains. In this review article, we present a number of case studies involving robotic deep RL. Building off of these case studies, we discuss commonly perceived challenges in deep RL and how they have been addressed in these works. We also provide an overview of other outstanding challenges, many of which are unique to the real-world robotics setting and are not often the focus of mainstream RL research. Our goal is to provide a resource both for roboticists and machine learning researchers who are interested in furthering the progress of deep RL in the real world.

Full PDF

HHow to Train Your Robot with DeepReinforcement Learning – LessonsWe’ve Learned

SAGE

Julian Ibarz , Jie Tan , Chelsea Finn , Mrinal Kalakrishnan , Peter Pastor , Sergey Levine Abstract

Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviorsfrom low level sensor observations. Although a large portion of deep RL research has focused on applications in videogames and simulated control, which does not connect with the constraints of learning in real environments, deep RLhas also demonstrated promise in enabling physical robots to learn complex skills in the real world. At the same time,real world robotics provides an appealing domain for evaluating such algorithms, as it connects directly to how humanslearn – as an embodied agent in the real world. Learning to perceive and move in the real world presents numerouschallenges, some of which are easier to address than others, and some of which are often not considered in RL researchthat focuses only on simulated domains. In this review article, we present a number of case studies involving roboticdeep RL. Building off of these case studies, we discuss commonly perceived challenges in deep RL and how theyhave been addressed in these works. We also provide an overview of other outstanding challenges, many of which areunique to the real-world robotics setting and are not often the focus of mainstream RL research. Our goal is to provide a resource both for roboticists and machine learning researchers who are interested in furthering the progress of deep

RL in the real world.

Keywords

Robotics, reinforcement learning, deep learning

Robotic learning lies at the intersection of machine learningand robotics. From the perspective of a machine learningresearcher interested in studying intelligence, robotics isan appealing medium to study as it provides a lens intothe constraints that humans and animals encounter whenlearning, uncovering aspects of intelligence that might nototherwise be apparent to study when we restrict ourselvesto simulated environments. For example, robots receivestreams of raw sensory observations as a consequence oftheir actions, and cannot practically obtain large amounts ofdetailed supervision beyond observing these sensor readings.This makes for a challenging but highly realistic learningproblem. Further, unlike agents in video games, robotsdo not readily receive a score or reward function that isshaped for their needs, and instead need to develop theirown internal representation of progress towards goals. Fromthe perspective of robotics research, using learning-basedtechniques is appealing because it can enable robots to movetowards less structured environments, to handle unknownobjects, and to learn a state representation suitable formultiple tasks.Despite being an interesting medium, there is a signiﬁcantbarrier for a machine learning researcher to enter roboticsand vice-versa. Beyond the cost of a robot, there are manydesign choices in choosing how to set-up the algorithmand the robot. For example, reinforcement learning (RL)algorithms require learning from experience that the robotautonomously collects itself, opening up many choices inhow the learning is initialized, how to prevent unsafe behavior, and how to deﬁne the goal or reward. Likewise,machine learning and RL algorithms also provide a numberof important design choices and hyperparameters that can betricky to select.Motivated by these challenges for the researchers in therespective ﬁelds, our goal in this article is to provide ahigh-level overview of how deep RL can be approachedin a robotics context, summarize the ways in which keychallenges in RL have been addressed in some of ourown previous work, and provide a perspective on majorchallenges that remain to be solved, many of which are notyet the subject of active research in the RL community.There have been high-quality survey articles aboutapplying machine learning to robotics. Deisenroth et al.(2013) focused on policy search techniques for robotics,whereas Kober et al. (2013) focused on RL. Morerecently, Kroemer et al. (2019) reviewed the learningalgorithms for manipulation tasks. S¨underhauf et al. (2018)identiﬁed current areas of research in deep learning thatwere relevant to robotics, and described a few challenges inapplying deep learning techniques to robotics. Robotics at Google Everyday Robots, X, The Moonshot Factory Stanford University University of California Berkeley

Corresponding author:

Julian IbarzEmail: [email protected]

Prepared using sagej.cls [Version: 2017/01/17 v1.20] a r X i v : . [ c s . R O ] F e b Instead of writing another comprehensive literaturereview, we ﬁrst center our discussion around three casestudies from our own prior work. We then provide an in-depth discussion of a few topics that we consider especiallyimportant given our experience. This article naturallyincludes numerous opinions. When sharing our opinions, wedo our best to ground our recommendations in empiricalevidence, while also discussing alternative options. We hopethat, by documenting these experiences and our practices, wecan provide a useful resource both for roboticists interestedin using deep RL and for machine learning researchersinterested in working with robots.

In this section, we provide a brief, informal introductionto RL, by contrasting it with classical techniques ofprogramming robot behavior. A robotics problem isformalized by deﬁning a state and action space, and the dynamics which describe how actions inﬂuence the state ofthe system. The state space includes internal states of therobot as well as the state of the world that is intended to becontrolled. Quite often, the state is not directly observable –instead, the robot is equipped with sensors, which provide observations that can be used to infer the state. The goalmay be deﬁned either as a target state to be achieved, oras a reward function to be maximized. We want to ﬁnd acontroller, (known as a policy in RL parlance), that mapsstates to actions in a way that maximizes the reward whenexecuted.If the states can be directly or indirectly observed, anda model of the system dynamics is known, the problemcan be solved with classical methods such as planning oroptimal control. These methods use the knowledge of thedynamics model to search for sequences of actions that whenapplied from the start state, take the system to the desiredgoal state or maximize the achieved reward. However, ifthe dynamics model is unknown, the problem falls into therealm of RL (Sutton and Barto 2018). In the paradigm of RL,samples of state-action sequences ( trajectories ) are requiredin order to learn how to control the robot and maximizethe reward. In model-based RL , the samples are used tolearn a dynamics model of the environment, which in turnis used in a planning or optimal control algorithm to producea policy or the sequence of controls. In model-free RL , thedynamics are not explicitly modeled, but instead the optimalpolicy or value function is learned directly by interactionwith the environment. Both model-based and model-free RLhave their own strengths and weaknesses, and the choice ofalgorithm depends heavily on the properties required. Theseconsiderations are discussed further in Sections 3 and 4.

In this section, we present a few case studies of applicationsof deep RL to various robotic tasks that we havestudied. The applications span manipulation, grasping, andlegged locomotion. The sensory inputs used range fromlow-dimensional proprioceptive state information to high-dimensional camera pixels, and the action spaces includeboth continuous and discrete actions. By consolidating our experiences from those case studies,we seek to derive a common understanding of the kinds ofrobotic tasks that are tractable to solve with deep RL today.Using these case studies as a backdrop, we point readersto outstanding challenges that remain to be solved and arecommonly encountered in Section 4.

Reinforcement learning of individual robotic skills has a longhistory (Peters et al. 2010; Ijspeert et al. 2002; Peters andSchaal 2008; Konidaris et al. 2012; Daniel et al. 2013; Koberet al. 2013; Manschitz et al. 2014). Deep RL provides someappealing capabilities in this regard: deep neural networkpolicies can alleviate the need to manually design policyclasses, provide a moderate amount of generalization tovariable initial conditions and, perhaps most importantly,allow for end-to-end joint training for both perception andcontrol, learning to directly map high-dimensional sensoryinputs, such as images, to control outputs. Of course, suchend-to-end training itself presents a number of challenges,which we will also discuss. We discuss a few case studies onsingle-task deep robotic learning with a variety of differentmethods, including model-based and model-free algorithms,and with different starting assumptions.

Figure 1.

A PR2 learns to place a red trapezoid block into ashape-sorting cube. With Levine et al. (2016), it learns localpolicies for each initial position of the cube, which can be resetautomatically using the robot’s left arm. The local policies aredistilled into a global policy that takes images as input, ratherthan the cube’s location.

Guided policy search meth-ods (Levine et al. 2016) were among the ﬁrst deep RLmethods that could be tractably applied to learn individualneural network skills for image-based manipulation tasks.The basic principle behind these methods is that the neuralnetwork policy is “guided” by another RL method, typicallya model-based RL algorithm. The neural network policy isreferred to as a global policy , and is trained to perform thetask successfully from raw sensory observations and undermoderate variability in the initial conditions. For example,as shown in Figure 1, the global policy might be requiredto put the red shape into the shape sorting cube at differentpositions. This requires the policy to implicitly determine the

Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine position of the hole. However, this is not supervised directly,but instead the perception mechanism is learned end-to-endtogether with control. Supervision is provided from multipleindividual model-based learners that learn separate localpolicies to insert the shape into the hole at a speciﬁc position.In the case of the experiment illustrated in Figure 1, ninelocal policies were trained for nine different cube positions,and a single global policy was then trained to perform thetask from images. Typically, the local policies do not usedeep RL, and do not use image inputs. They instead useobservations that reﬂect the low-dimensional, “true” state ofthe system, such as the position of the shape-sorting cubein the previous example, in order to learn more efﬁciently.Local policies can be trained with model-based methods suchas LQR-FLM (Levine and Abbeel 2014; Levine et al. 2016),which uses LQR with ﬁtted time-varying linear models, ormodel-free techniques such as PI2 (Chebotar et al. 2017b,a).A full theoretical treatment of the guided policy searchalgorithm is outside the scope of this article, and we refer thereader to prior work on this topic (Levine and Koltun 2013;Levine and Abbeel 2014; Levine et al. 2016).An important point of discussion for this article, however,is the set of assumptions underlying guided policy searchmethods. Typically, such methods assume that the localpolicies can be optimized with simple, “shallow” RLmethods, such as LQR-FLM or PI2. This assumptionis reasonable for robotic manipulation tasks trained inlaboratory settings, but can prove difﬁcult in (1) open-worldenvironments where the low-level state of the system cannotbe effectively measured and in (2) settings where resettingthe environment poses a challenge. For example, in theexperiment in Figure 1, the robot is holding the cube in itsleft arm during training, so that the position of the cube canbe provided to the low-level policies and so that the robotcan automatically reposition the cube into different positionsdeterministically. We discuss these challenges in more detailin Sections 4.12 and 4.2.3.Nonetheless, for learning individual robotic skills, guidedpolicy search methods have been applied widely and to abroad range of behaviors, ranging from inserting objectsinto containers and putting caps on bottles (Levine et al.2016), opening doors (Chebotar et al. 2017b), and shootinghockey pucks (Chebotar et al. 2017a). In most cases, guidedpolicy search methods are very efﬁcient in terms of thenumber of samples, particularly as compared to model-freeRL algorithms, since the model-based local policy learnerscan acquire the local solutions quickly and efﬁciently. Image-based tasks can typically be learned in a few hundred trials,corresponding to 2-3 hours of real-world training, includingall resets and network training time (Levine et al. 2016;Chebotar et al. 2017a). Model-free RL algo-rithms lift some of the limitations of guided policy search,such as the need to decompose a task into multiple distinctand repeatable initial states or the need for a model-basedoptimizer that typically operates on a low dimensional staterepresentation, but at the cost of a substantial increase in therequired number of samples. For example, the Lego blockstacking experiment reported by Haarnoja et al. (2018a)required a little over two hours of interaction, whereas

Figure 2.

Examples of model-free based algorithms learningskills in a few hours from low-dimensional state observations.(a) is learning to stack lego blocks with Haarnoja et al. (2018a).(b) is learning door opening with Gu et al. (2017). comparable Lego block stacking experiments reported byLevine et al. (2015) required about 10 minutes of training.The gap in training time tends to close a bit when we considertasks with more variability: guided policy search generallyrequires a linear increase in the number of samples withmore initial states, whereas model-free algorithms can betterintegrate experience from multiple initial states and goals,typically with sub-linear increase in sample requirements.As model-free methods generally do not require a lower-dimensional state for model-based trajectory optimization,they can also be applied to tasks that can only be deﬁnedon images, without an explicit representation learning phase.Although there is a long history of model-free RL inrobotics (Peters et al. 2010; Ijspeert et al. 2002; Peters andSchaal 2008; Konidaris et al. 2012; Daniel et al. 2013;Kober et al. 2013; Manschitz et al. 2014), modern model-free deep RL algorithms have been used more recentlyfor tasks such as door opening (Gu et al. 2017) andassembly and stacking of objects (Haarnoja et al. 2018a)with low-dimensional state observations. These methodswere generally based on off-policy actor-critic designs, suchas DDPG or NAF (Lillicrap et al. 2015; Gu et al. 2016),soft Q-learning (Haarnoja et al. 2018b,a), and soft actor-critic (Haarnoja et al. 2019). An illustration of some ofthese tasks is shown in Figure 2. From our experiences,we generally found that simple manipulation tasks, such asopening doors and stacking Lego blocks, either with a singleposition or some variation in position, can be learned in 2-4 hours of interaction, with either torque control or end-effector position control. Incorporating demonstration dataand other sources of supervision can further accelerate someof these methods (Veˇcer´ık et al. 2017; Riedmiller et al.2018). Section 4.2 describes other techniques to make thoseapproaches more sample efﬁcient.Although most model-free deep RL algorithms that havebeen applied to learn manipulation skills directly from real-world data have used off-policy algorithms based on Q-learning (Haarnoja et al. 2018a; Gu et al. 2017) or actor-critic designs (Haarnoja et al. 2018b), on-policy policygradient algorithms have also been used. Although standardconﬁgurations of these methods can require around 10times the number of samples as off-policy algorithms, on-policy methods such as TRPO (Schulman et al. 2015),NPG (Kakade 2002), and PPO (Schulman et al. 2017)can be tuned to only be 2-3 times less efﬁcient than off-policy algorithms in some tasks (Peng et al. 2019). Insome cases, this increased sample requirement may bejustiﬁed by ease of use, better stability, and better robustnessto suboptimal hyperparameter settings. On-policy policy

Prepared using sagej.cls gradient algorithms have been used to learn tasks such as peginsertion (Lee et al. 2019), targeted throwing Ghadirzadehet al. (2017), and dexterous manipulation (Zhu et al.2019) directly on real-world hardware, and can be furtheraccelerated with example demonstrations (Zhu et al. 2019).While in principle model-free deep RL algorithms shouldexcel at learning directly from raw image observations,in practice this is a particularly difﬁcult training regime,and good real-world results with model-free deep RLlearning directly from raw image observations have onlybeen obtained recently, with accompanying improvementsin the efﬁciency and stability of off-policy model-free RLmethods (Haarnoja et al. 2019, 2018b; Fujimoto et al. 2018).The SAC algorithm can learn tasks in the real world directlyfrom images (Haarnoja et al. 2019; Singh et al. 2019), andseveral other recent works have studied real-world learningfrom images (Schoettler et al. 2019; Schwab et al. 2019).All of these experiments were conducted in relativelyconstrained laboratory environments, and although thelearned skills use raw image observations, they generallyhave limited robustness to realistic visual perturbations andcan only handle the speciﬁc objects on which they aretrained. We discuss in Section 3.2 how image-based deepRL can be scaled up to enable meaningful generalization.Furthermore, a major challenge in learning from rawimage observations in the real world is the problem ofreward speciﬁcation: if the robot needs to learn fromraw image observations, it also needs to evaluate thereward function from raw image observations, which itselfcan require a hand-designed perception system, partlydefeating the purpose of learning from images in the ﬁrstplace, or otherwise require extensive instrumentation of theenvironment (Zhu et al. 2019). We discuss this challengefurther in Section 4.9.

Although there are situations where asingle skill is all a robot will need to perform, it is notsufﬁcient for general-purpose robots where learning eachskill from scratch is impractical. In such cases, there is a greatdeal of knowledge that can be shared across tasks to speed uplearning. In this section, we discuss one particular case studyof scalable multi-task learning of vision-based manipulationskills, with a focus on tasks that require pushing or pickingand placing objects. Unlike in the previous section, if ourgoal is to learn many tasks with many objects, a challengediscussed in detail in Section 4.5, it will be most practicalto learn from data that can be collected at scale, withouthuman supervision or even a human attending the robot. As aresult, it becomes imperative to remove assumptions such asregular resets of the environment or a carefully instrumentedenvironment for measuring reward.Motivated by these challenges, the visual foresightapproach (Finn and Levine 2017; Ebert et al. 2018)leverages large batches of off-policy, autonomously collectedexperience to train an action-conditioned video predictionmodel, and then uses this model to plan to accomplishtasks. The key intuition of this approach is that knowledgelearned about physics and dynamics can be shared acrosstasks and largely decoupled from goal-centric knowledge.These models are trained using streams of robot experience, consisting of the observed camera images and actionstaken, without assumptions about reward information. Aftertraining, a human provides a goal, by providing an image ofthe goal or by indicating that an object corresponding to aspeciﬁed pixel should be moved to a desired position. Then,the robot performs an optimization over action sequencesin an effort to minimize the distance between the predictedfuture and the desired goal.This algorithm has been used to complete objectrearrangement tasks such as grasping an apple and puttingit on a plate, re-orienting a stapler, and pushing otherobjects into conﬁgurations (Finn and Levine 2017; Ebertet al. 2018). Further, it has been used for visual reachingtasks (Byravan et al. 2018), object pushing and trajectoryfollowing tasks (Yen-Chen et al. 2020), for satisfying relativeobject positioning tasks (Xie et al. 2018), and for clothmanipulation tasks such as folding shorts, covering an objectwith a towel, and re-arranging a sleeve of a shirt (Ebert et al.2018). Importantly, each collection of tasks can be performedusing a single learned model and planning approach, ratherthan having to re-train a policy for each individual taskor object. This generalization precisely results from thealgorithms ability to leverage broad, autonomously-collecteddatasets with hundreds of objects, and the ability to trainreusable, task-agnostic models from this data.Despite these successes, there are a number of limitationsand challenges that we highlight here. First, although the datacollection process does not require human involvement, ituses a specialized set-up with the robot in front of a bin withtilted edges that ensure that objects do not fall out, along withan action space that is constrained within the bin. This allowscontinuous, unattended data collection, discussed furtherin Section 4.7. Outside of laboratory settings, however,collecting data in unconstrained, open-world environmentsintroduces a number of important challenges, which wediscuss in Section 4.12. Second, inaccuracies in the modeland reward function can be exploited by the planner,leading to inconsistencies in performance. We discuss thesechallenges in Sections 4.6 and 4.9. Finally, ﬁnding plansfor complex tasks pose a challenging optimization problemfor the planner, which can be addressed to some degreeusing demonstrations (for details, see Section 4.4). This hasenabled the models to be used for tool use tasks such assweeping trash into a dustpan, wiping objects off a plate witha sponge, and hooking out-of-reach objects with a hook (Xieet al. 2019).

Learning to grasp remains one of the most signiﬁcantopen problems in robotics, requiring complex interactionwith previously unseen objects, closed loop vision-basedcontrol to react to unforeseen dynamics or situations, andin some cases pre-manipulation to isolate the object to begrasped. Indeed, most object interaction behaviors requiregrasping the object as the ﬁrst step. Prior work typicallytackles grasping as the problem of identifying suitable grasplocations (Zeng et al. 2018; Morrison et. al 2018; Mahleret al. 2018; ten Pas et al. 2017), rather than as an explicitcontrol problem. The motivation for this problem deﬁnitionis to allow the visual problem to be completely separatedfrom the control problem, which becomes an open loop

Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine Figure 3.

Close-up of our robot grasping setup in our setup(left) and about 1000 visually and physically diverse trainingobjects (right). Each robot consists of a KUKA LBR IIWA armwith a two-ﬁnger gripper and an over-the-shoulder RGB camera. control problem. This separation signiﬁcantly simpliﬁes theproblem. The drawback is that this approach cannot adapt todynamic environments or reﬁne its strategy while executingthe grasp. Can deep RL provide us with a mechanism to learnto grasp directly from experience, and as a dynamical andinteractive process?A number of works have studied closed loop grasping (Yuand Rodriguez 2018; Viereck et al. 2017; Hausman et al.2017; Levine et al. 2018). In contrast to these methods,which frame closed-loop grasping as a servoing problem,QT-Opt Kalashnikov et al. (2018) uses a general-purposeRL algorithm to solve the grasping task, which enablesmulti-step reasoning, in other words, the policy can beoptimized across the entire trajectory. In practice, thisenables this method to autonomously acquire complexgrasping strategies, some of which we illustrate in Figure. 4.This method is also entirely self-supervised, using onlygrasp outcome labels that are obtained automatically by therobot. Several works have proposed self-supervised graspingsystems (Pinto and Gupta 2016; Levine et al. 2018), butto the best of the author’s knowledge, this method is theﬁrst to incorporate a multi-step optimization via RL into ageneralizable vision-based system trained on self-supervisedreal-world data.Related to this work, Zeng et al. (2018) recentlyproposed a Q-learning framework for combining graspingand pushing. QT-Opt utilizes a much more ﬂexible actionspace, directly commanding gripper motion in all degreesof freedom in 3 dimensions, and exhibits substantially betterperformance and generalization. Finally, in contrast to manycurrent grasping systems that utilize depth sensing (Mahleret al. 2018; Morrison et al. 2018) or wrist-mountedcameras (Viereck et al. 2017; Morrison et al. 2018), QT-Opt operates on raw monocular RGB observations from anover-the-shoulder camera that doesn’t need to be calibrated.The performance of QT-Opt indicates that effective learning can achieve excellent grasp success rates even with thisrudimentary sensing set-up.In this work, we focus on evaluating the success rate of thepolicy in grasping never seen during training objects in a binusing a top-down grasping (4 degrees of freedom). This taskdeﬁnition simpliﬁes some robot safety challenges, whichare discussed more in Section 4.11. However, this problemstill retains the challenging aspects that have been hard todeal with: unknown object dynamics, geometry, vision basedclosed-loop control, self-supervised approach as well as handeye coordination by removing the need to calibrate the entiresystem (camera and gripper locations as well as workspacebounds are not given to the policy).For this speciﬁc task, QT-Opt can reach 86% graspsuccess when learning completely from data collected fromprevious experiments which we will refer to as ofﬂine data,and can quickly reach 96% success with an additionalonline data of 28,000 grasps collected during a jointﬁnetuning training phase. Those results show that RL can bescalable and practical on a real robotic application by eitherallowing to reuse past collected experiences (ofﬂine data),and potentially training purely ofﬂine (no additional robotinteraction required) or a combination of ofﬂine and onlineapproaches (called joint ﬁnetuning). Leveraging ofﬂine datamakes deep RL a practical approach for robotics as it allowsto scale the training dataset to a large enough size to allowgeneralization to happen, with a small robotic ﬂeet of 7robots and over a period of a few months, or by leveragingsimulation, to generalize with a collection effort of just a fewdays James et al. (2019); Rao et al. (2020) (see Section 4.3for more examples of sim-to-real techniques).Because the policy is learned by optimizing the rewardacross the entire trajectory (optimizing for long term rewardusing Bellman backup), and is constantly re-planning its nextmove with vision as an input, the policy can learn complexbehaviors in a self-supervised manner that would have beenhard to program, such as singulation, pregrasp manipulation,dealing with a cluttered scene, learning retrial behaviorsas well as handling environment disturbance and dynamicobjects (Figure 4). Retrial behaviors can be learned becausethe policy can quickly react to the visual input, at every step,which may show in one step that the object dropped after thegripper lifted it from the bin, and thus deciding to re-attempta grasp in the new location the object fell to.Section 4.2 describes some of the design principles weused to get good data efﬁciency. Section 4.5 discussesstrategies that allowed us to generalize properly to unseenobjects. Section 4.7 describes ways we managed to scale to7 robots with one human operator as well as enable 24h/7day operations. Section 4.4 discusses how we side-steppedexploration challenges by leveraging scripted policies.The lessons from this work have been that (1) a lot ofvaried data was required to learn generalizable grasping,which means that we need unattended data collection anda scalable RL pipeline; (2) the need for large and varied datameans that we need to leverage all of the previously collecteddata so far (ofﬂine data) and need a framework that makesthis easy is crucial; (3) to achieve maximal performance,combining ofﬂine data with a small amount of online dataallows us to go from 86% to 96% grasp success.

Prepared using sagej.cls (a)(b)(c)(d)(e)(f)(g)(h)

Figure 4.

Eight grasps from the QT-Opt policy, illustrating some of the strategies discovered by our method: pregrasp manipulation(a, b), grasp readjustment (c, d), grasping dynamic objects and recovery from perturbations (e, f), and grasping in clutter (g, h).

Figure 5.

The Minitaur robot learns to walk from scratch usingdeep RL.

Although walking and running seems effortless activitiesfor us, designing locomotion controllers for legged robotsis a long-standing challenge (Raibert 1986). RL holdsthe promise to automatically design high performancelocomotion controllers (Kohl and Stone 2004; Tedrake et al.2015; Ha et al. 2018; Hwangbo et al. 2019; Lee et al.2020a). In this case study, we apply deep RL techniqueson the Minitaur robot (Figure 5), a mechanically simple andlow cost quadruped platform (De 2017). We have overcomesigniﬁcant challenges and developed various learning-basedapproaches, with which agile and stable locomotion gaitsemerge automatically.Simulation is an important prototyping tool for robotics,which can help to bypass many challenges of learning onreal systems, such as data efﬁciency and safety. In fact,most of the prior work used simulation (Brockman et al. 2016; Coumans and Bai 2016) to evaluate and benchmarkthe learning algorithms (H¨am¨al¨ainen et al. 2015; Heess et al.2017; Yu et al. 2018; Peng et al. 2018a). Using general-purpose RL algorithms and a simple reward for walkingfast and efﬁciently, we can train the quadruped robot towalk in simulation within 2-3 hours. However, a policylearned in simulation usually does not work well on the realrobot. This performance gap is known as the reality gap .Our research has identiﬁed the key causes of this gap anddeveloped various solutions. Please refer to Section 4.3 formore details. With these sim-to-real transfer techniques, wecan successfully deploy the controllers learned in simulationon the robots with zero or only a handful of real-worldexperiments (Tan et al. 2018; Yu et al. 2019). Without muchprior knowledge and manual tuning, the learning algorithmautomatically ﬁnds policies that are more agile and energyefﬁcient than the controllers developed with the traditionalapproaches.Given the initial policies learned in simulation, it isimportant that the robots can continue their learning processin the real-world in a life-long fashion to adapt theirpolicies to the changing dynamics and operation conditions.There are three main challenges for real-world learningof locomotion skills. The ﬁrst is sample efﬁciency. DeepRL often needs tens of millions of data samples to learnmeaningful locomotion gaits, which can take months ofdata collection on the robot. This is further exacerbatedby the need for extensive hyperparameter tuning. We havedeveloped novel solutions that have signiﬁcantly reduced

Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine the sample complexity (Section 4.2) and the need forhyperparameter tuning (Section 4.1).Robot safety is another bottleneck for real-world training.During the exploration stage of learning, the robot oftentries noisy actuation patterns that cause jerky motions andsevere wear-and-tear of the motors. In addition, because therobot has yet to master balancing skills, the repeated fallingquickly damages the hardware. We discuss in Section 4.11several techniques that we employ to mitigate the safetyconcerns for learning locomotion with real robots.The last challenge is asynchronous control . On a physicalrobot, sensor measurements, neural network inferenceand action execution usually happen simultaneously andasynchronously. The observation that the agent receives maynot be the latest owing to computation and communicationdelays. However, this asynchrony breaks the fundamentalassumption of the markovian decision process (MDP).Consequently, the performance of many deep RL algorithmsdrop dramatically in the presence of asynchronous control.In locomotion tasks, asynchronous control is essential toachieve high control frequency. In other words, to learn towalk, the robot has to think and act at the same time. Wediscuss our solutions of this challenge in Section 4.8, for bothmodel-free and model-based learning algorithms.With the progress to overcome these challenges, we havedeveloped an efﬁcient and autonomous on-robot trainingsystem (Haarnoja et al. 2019), in which the robot can learnwalking and turning, from scratch in the real world, withonly ﬁve minutes of data (Yang et al. 2020) and little humansupervision. In the previous section, we showed a few examples ofapplications of deep RL on robotic tasks that enabledprogress over previous approaches in terms of generalizationto a large variety of environments, objects or more complexbehaviors. Those applications required to solve or at leastmitigate a few challenges speciﬁc to applying deep RL onreal robots that have been identiﬁed over the years. In thissection, we describe those challenges and provide, wheneveravailable, our current best mitigation strategies that enabledus to apply deep RL to the applications we discussed inSection 3.

Deep RL algorithms are notoriously difﬁcult to use inpractice (Irpan 2018). The performance of commonlyused RL methods depends on careful settings of thehyperparameters, and often varies substantially betweenruns (i.e., for different “random seeds” in simulation).Off-policy algorithms, which are particularly desirable inrobotics owing to their improved sample efﬁciency, cansuffer even more from these issues than on-policy policygradient methods. We can broadly classify the challengesof reliable and stable learning into two groups: (1) reducingsensitivity to hyperparameters, and (2) reducing issues owingto local optima and delayed rewards.One approach to reducing the burden of tuninghyperparameters is to use automated hyperparameter tuning methods (Chiang et al. 2019). However, such methodstypically require running RL algorithms many times,which is impractical outside of simulated domains. Apotentially promising alternative available for off-policyRL methods is to run multiple learning processes withdifferent hyperparameters on the same off-policy data buffer,effectively using one run’s worth of data for multipleindependent learning processes. Recent work has exploredthis idea in simple simulated domains (Khadka et al. 2019),though it remains to be seen if such an approach can bescaled up to real-world robotic learning settings. Anotherapproach is to develop algorithms that automatically tunetheir own hyperparameters, as in the case of SAC withautomated temperature tuning, which has been demonstratedto greatly reduce the need for hyperparameter tuning acrossdomains, thus enabling much easier deployment on real-world robotic systems (Haarnoja et al. 2019). Lastly, we canaim to develop methods that are, through their design, morerobust to hyperparameter settings. This option, although themost desirable, is also the toughest, because it likely requiresan in-depth understanding for the real reasons behind thesensitivity of current RL algorithms, which has so far provenelusive.The second challenge to reliable and stable learning islocal optima and delayed rewards. In contrast to supervisedlearning problems, which put a convex loss function on topof a nonlinear neural network function approximator, theRL objective itself can present a challenging optimizationlandscape independently of the policy or value functionparameterization, which means that the usual beneﬁts ofover-parameterized networks do not fully resolve issuesrelating to local optima. This is indeed part of thereason why different runs of the same algorithm canproduce drastically different solutions, and it presents amajor challenge for real-world deployment, where evena single run can be exceptionally time-consuming. Somemethods might provide better resilience to local optimaby preferring stochastic policies that can explore multiplestrategies simultaneously (Ziebart et al. 2008; Toussaint2009; Rawlik et al. 2013; Fox et al. 2016; Haarnoja et al.2017, 2018c). More sophisticated exploration strategiesmight further alleviate these issues (Fu et al. 2017; Pathaket al. 2017), and parameter-space exploration strategiesmight offer a particularly promising approach to combatingthis issue (Burda et al. 2019). Indeed, we have observed insome of our own experiments that, when collecting largeamount of on-policy data is not an issue, direct parametersearch methods such as augmented random search (Maniaet al. 2018) can often be substantially easier to deploy thanmore classic RL methods, likely to their ability to avoid localoptima by exploring directly in the parameter space. It maytherefore prove fruitful to investigate methods that combineentropy maximization and parameter space exploration as away to avoid the local optima and delayed reward issues thatmake real-world deployment challenging.

Many popular RL algorithms require millions of stochasticgradient descent (SGD) steps to train policies that canaccomplish complex tasks (Schulman et al. 2017; Mnih et al.2013). This often means that millions of interaction with the

Prepared using sagej.cls real world will be required for robotic tasks, which is quiteprohibitive in practice. Without any improvement in sampleefﬁciency to those algorithms, the number of training stepswill only increase as the model size increases to tackle moreand more complex robotic tasks.We have found that some classes of RL algorithms aremuch more sample efﬁcient than others. RL algorithmscan be categorized into model-based versus model-freemethods. Among the model-free methods, they are oftencategorized into on-policy and off-policy methods. Generallyspeaking, among model-free techniques, off-policy methodsare about an order of magnitude more data efﬁcient thanon-policy methods. Model-based methods could be anotherorder of magnitude more data efﬁcient than their model-free counterparts. In the following sections we discuss ourexperiences with these methods.

On-policy algorithms such aspolicy gradient methods (Peters and Schaal 2006; Schulmanet al. 2015) have recently become popular owing to theirstability and their ability to learn policies for a widevariety of tasks. Unfortunately, on-policy algorithms have theconstraint to only use a sample coming from the latest policythat is being trained. This has the unfortunate consequencethat the number of required data samples is equal and oftenlarger to the number of training steps needed to train amodel, which in practice, can be millions of steps. Trainingan on-policy model may thus require several millions andsometimes billions of action executions in the real worldwhich is often prohibitive.Off-policy methods do not assume that the samples arecoming from the current trained policy. In practice, thismeans the samples can be reused multiple times acrossback-propagations, potentially hundreds or thousands oftimes, without any over-ﬁtting in complex visual tasks. InKalashnikov et al. (2018), up to 15 training steps of batchsize 32 were done per collect step on real robots duringa ﬁnetuning phase, which is equivalent to 480 gradientdescents per collect step. Recently, SAC (Haarnoja et al.2019), an off-policy method, was able to learn to walk ona quadruped robot, from scratch, with just 2 hours of realrobot data coming from a single robot. Note that furtherincreasing the ratio between the number of training steps andthe number of collected samples may decrease the trainingperformance due to overﬁtting. The optimal ratio is often taskdependent, policy dependent or algorithm dependent, whichis an important hyper-parameter to tune.

Model-based algorithmssuch as Draeger et al. (1995) choose the optimal action byleveraging a model of the environment. The agent may learnfrom the experience generated using this model instead ofcollected in the real environment. Thus the amount of datarequired for model-based methods is usually much less thantheir model-free counterparts. For example, we leveragedsuch a technique to effectively learn to walk, from scratch,with only a few minutes of real robot data (Nagabandi et al.2018; Yang et al. 2020). The downside is that these methodsrequire to have access to such a model, which is oftenchallenging to acquire in practice. We cover model-basedtechniques in more detail in Section 4.6.

When learning from high-dimensional observations,e.g. image observations, learning visual representations canoccupy substantial amount of training and sample complex-ity. One trick for addressing this challenge is via inputremapping . In particular, when policies are trained in a labo-ratory environment, the true underlying state of the systemmay be observable during training, even when the policyto be learned must use vision. In these settings, one policyor multiple local policies can be efﬁciently learned withoutvision using privileged state information, and these policiescan be distilled into a ﬁnal policy that takes raw observationsas input and is trained to produce the output of the non-vision policies. This trick has been successful in a num-ber of settings including robotic manipulation from imagepixels (Levine et al. 2016; Pinto et al. 2018), autonomousdriving (Chen et al. 2020), and robotic locomotion from ahistory of proprioceptive sensor measurements (Lee et al.2020b).

Figure 6.

PR2 learning to scoop a bag of rice into a bowl with aspatula (left) using a learned visual state representation (right),using Finn et al. (2016b). The feature points visualized on theright images were learned without supervision with anautoencoder.

When the true states of objects cannot be measured and thelocal policies must themselves handle image observations,these observations can be ﬁrst encoded into a lower-dimensional state space via an autoencoder, such as a spatialautoencoder that summarizes the image with a set of featurepoints in image-space (Finn et al. 2016b) – an example ofsuch features are illustrated in Figure 6. Unsupervised featurelearning methods such as autoencoders (Finn et al. 2016b;Ghadirzadeh et al. 2017), contrastive losses (Sermanet et al.2018), and correspondence learning (Florence et al. 2018,2019), provide a reasonable solution in cases where theinductive biases of the unsupervised algorithm effectivelymatch the needs of the state representation.

Image classiﬁers used by compa-nies, such Facebook or Google, are trained on tens of millionof labeled images (Kuznetsova et al. 2020), or pre-trained onbillions of images (Mahajan et al. 2018; Xie et al. 2020), toreach the level of quality required by certain products. Nat-ural language processing (NLP) systems for machine trans-lation, or speech recognition systems such as BERT (Devlinet al. 2019), also require billions of samples to generalizeand have descent performance for real applications. In a way,

Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine supervised learning systems are also inefﬁcient, but in manyapplications, the gains in generalization and performancethat deep learning provides compensates for the cost ofcollecting such large amounts of data. Similarly, a generalpurpose robot may also require a large volume of data to trainon, unless signiﬁcant improvements have been made in ourlearning algorithms. Ofﬂine training enable us to use all thedata collected so far to train our policies, and thus, potentiallyscale to billions of real world samples.Off-policy methods can leverage all the data collected inthe past, across many experiments. In most RL benchmarks,off-policy methods are still collecting new data as thetraining happens. However, off-policy methods can also betrained without collecting any new data during the trainingphase; similar to supervised learning problems. We call this ofﬂine training , whereas other work may call it batch RL.In Section 3.2, we have shown that this mode of trainingallowed us to generalize grasping policies to unseen objectswith just 500,000 trials. If we compare this dataset to theImageNet dataset, which has about 1 million images, we cansee that the amount of data to learn this complex robotictask from vision sensor, using RL, is in the same order ofmagnitude as learning to classify 1,000 types of objects.In both cases, the learned models have shown the abilityto generalize to a wide variety of unseen object instances.There are challenges to stabilize ofﬂine training. The ofﬂinetraining can become unstable if the state-action distributionfrom the latest policy differs too much from the one thatwas used to collect the training data. Recent work juststarted to identify and address to some extent this speciﬁcproblem (Kumar et al. 2019; Agarwal et al. 2020; Fujimotoet al. 2019).An important technique to bypass the sample efﬁciencyproblem is to use simulators, which can generate realisticexperience much faster than real time. Combining withsim-to-real transfer techniques, simulators allow us to learnpolicies that can be deployed in the real world with a minimalamount of real world interaction. In the next section, wediscuss the use of simulation. Simulation is becoming increasingly accurate over the years,which makes it a good proxy to real robots. One bottleneckof robotic learning is to collect a large amount of dataautonomously and safely. While collecting enough real dataon the physical system is slow and expensive, simulationcan run orders of magnitude faster than real-time, andcan start many instances simultaneously. In addition, datacan be collected continuously without human intervention.On the real robot, human supervision is always neededfor resetting experiments, monitoring hardware status andensuring safety. In contrast, experiments can be resetautomatically, and safety is not a problem in simulation.Thus, prototyping in simulation is faster, cheaper and saferthan experimenting on the real robot. These enable fastiteration of developing and tuning learning algorithms. Thefast pace of experiments allow us to efﬁciently shape thereward function, sweep the hyper-parameters, ﬁne-tune thealgorithm, and test whether a given task falls within therobot’s hardware capability. From our own experience, we have beneﬁted tremendously from prototyping in simulation(Tan et al. 2018).In addition to prototyping, can we directly use thepolicies trained in simulation on real robots? Unfortunately,deploying these policies can fail catastrophically due to the reality gap . Modeling errors cause a mismatch in robotdynamics, and rendered images often do not look like theirreal-world counterparts. The reality gap is a major obstaclethat prevents the application of learning to robotics. Insimulations, the robots can learn to backﬂip (Peng et al.2018a), bicycle stunts (Tan et al. 2014), and even put onclothes (Clegg et al. 2018). In contrast, it is still verychallenging to teach robots to perform basic tasks suchas walking in the real world. Bridging the reality gap willallow robotics to fully tap into the power of learning. Moreimportantly, bridging the reality gap is important to pushthe advancement of machine learning for robotics towardsthe right direction. In the last few years, the OpenAI Gymbenchmark (Brockman et al. 2016) is the key driving forcebehind the development of deep RL and its applicationto robotics. However, these simulation benchmarks areconsiderably easier than their real world equivalent. It doesnot take into consideration the detailed dynamics, partialobservability, latency, and safety aspects of robotics. Thus,the scores which researchers optimize their algorithms forcan be misleading: the learning algorithms that perform wellin the Gym environments may not work well on real robots.If we can bridge this reality gap, we would have a farbetter simulation benchmark for robotics, which can focusthe research effort to the most pressing challenges in robotlearning, such as non-Markovian assumption (asynchronouscontrol), partial observability and safe exploration andactuation. In the following section, we outline a fewmethods that have been employed successfully for sim-to-real transfer.

In simulation, wecan access the ground-truth state of the robot, which cansigniﬁcantly simplify the learning of tasks. In contrast, in thereal-world, we are restricted to partial observations that areusually noisy and delayed, due to the limitation of onboardsensors. For example, it is difﬁcult to precisely measurethe root translation of a legged robot. To eliminate thisdifference, we can remove the inaccessible states duringtraining (Tan et al. 2018), apply state estimation, add moresensors (e.g. Motion Capture) (Haarnoja et al. 2019) or learnto infer the missing information (e.g. reward) (Yang et al.2019). On the other hand, if used properly, the ground-truth states in simulation can signiﬁcantly speed up learning.“Learning by cheating” (Chen et al. 2020) ﬁrst leveraged theground-truth states to learn a privileged agent, and in thesecond stage, imitated this agent to remove the reliance onthe privileged information.

The reality gap is caused bythe discrepancy between the simulation and the real-worldphysics. This error has many sources, including incorrectphysical parameters, un-modeled dynamics, and stochasticreal environment. However, there is no general consensusabout which of these sources plays a more important role.After a large number of experiments with legged robots, bothin simulation and on real robots, we found that the actuator

Prepared using sagej.cls dynamics and the lack of latency modeling are the maincauses of the model error. Developing accurate models forthe actuator and latency signiﬁcantly narrow the reality gap(Tan et al. 2018). We successfully deployed agile locomotiongaits that are learned in simulation to the real robot withoutthe need for any data collected on the robot. The idea behind domainrandomization is to randomly sample different simulationparameters while training the RL policy. This can includevarious dynamics parameters (Peng et al. 2018b; Tanet al. 2018) of the robot and the environment, as wellas visual and rendering parameters such as textures andlighting (Sadeghi and Levine 2017; Tobin et al. 2017).Similar to data augmentation methods in supervised learning,policies trained under such diverse conditions tend to bemore robust to such variations, and can thus perform betterin the real-world.

The success of adversarialtraining methods such as generative adversarial networks(GAN) (Goodfellow et al. 2014) have resulted in theirapplication to several other problems, including sim-to-realtransfer. Adapter networks have been trained that convertsimulated images to look like their real-world counterparts,which can then be used to train policies in simulation (Jameset al. 2017; Shrivastava et al. 2017; Bousmalis et al. 2017,2018; Rao et al. 2020). An alternative approach is thatof James et al. (2019) which trains an adapter network toconvert real-world images to canonical simulation images,allowing a policy trained only in simulation to be appliedin the real-world. Training of the real-to-sim adapter wasachieved by using domain-randomized simulation images asa proxy for real-world images, removing the need for real-world data altogether. The resulting policy achieved 70%grasp success in the real-world with the QT-Opt algorithm,with no real-world data, and reaches a success rate of 91%after ﬁne-tuning on just 5,000 real-world grasps: a resultwhich previously took over 500,000 grasps to achieve.

In RL, “exploration” refers most generally to the problemof choosing a policy that allows an agent to discover high-reward regions of the state space. Such a policy may not itselfhave very high average reward – typically, good explorationstrategies are risk-seeking (Bellemare et al. 2016), highlystochastic (Ziebart et al. 2008; Toussaint 2009; Rawlik et al.2013; Fox et al. 2016; Osband et al. 2016; Haarnoja et al.2017), and prioritize novelty over exploitation (Bellemareet al. 2016; Fu et al. 2017; Pathak et al. 2017).In practice, effective exploration is particularly challeng-ing in tasks with sparse reward . In the most extreme versionof this problem, the agent must essentially ﬁnd a (highreward) needle in a (zero reward) haystack. Unfortunately,the most natural formulation of many practical roboticstasks has this property. For many tasks, it is most naturalto formulate them as binary reward tasks (Irpan 2018): agrasping robot can either succeed or fail at grasping anobject, a pouring robot can pour water into a glass or not,and a mobile robot can reach the destination or not. One canreasonably regard these as the most basic task speciﬁcation, with any more informative reward (e.g., distance to the goal)as additional engineer-provided shaping.For this reason, a number of prior works havefocused on studying exploration for sparse-reward robotictasks (Andrychowicz et al. 2017; Schoettler et al. 2019).Numerous methods for improving exploration have beenproposed in the literature (Ziebart et al. 2008; Toussaint2009; Rawlik et al. 2013; Fox et al. 2016; Osband et al. 2016;Pathak et al. 2017; Haarnoja et al. 2017), and many of thesecan be applied directly to real-world robotic RL. However,for certain real-world robotic tasks, this problem can often beside-stepped using a combination of relatively simple manualengineering and demonstration data, and this provides avery powerful mechanism for avoiding a major challengeand instead focusing on other issues, such as efﬁciencyand generalization. The use of demonstrations to mitigateexploration challenges has a long history in robotics (Ijspeertet al. 2002; Peters and Schaal 2008; Daniel et al. 2013;Manschitz et al. 2014), and has been used in a numberof recent works (Jain et al. 2019; Nair et al. 2018). Thereare various ways to incorporate the demonstrations intothe learning process, which are discussed in the followingsection.

A simple way to incorporate demon-strations to mitigate the exploration challenge is to pre-traina policy network with demonstrations via imitation learning(also called behavioral cloning) (Bojarski et al. 2016). Thisapproach has been used in a variety of prior robotic learningworks (Ijspeert et al. 2002; Peters and Schaal 2008; Danielet al. 2013; Manschitz et al. 2014).Although this approach is simple and often effective, itsuffers from two major challenges. First, imitation learninglacks effective guarantees on performance both in theory andin practice (Ross et al. 2011), and the resulting policies cansuffer from “compounding errors,” where a small mistakethrows the policy into an unexpected state, where it makesa bigger mistake. Second, the learned initialization can beeasily forgotten by the RL. As it is common practice tobegin RL with a high random exploration factor, RL canquickly decimate the pre-trained policy, and end up in astate that is no better than random initialization. Note thatsome algorithms and policy representations are particularlyamenable to initialization from demonstrations. For example,dynamic movement primitives (DMPs) can be initializedfrom demonstrations in a way that does not suffer fromcompounding errors (Schaal 2006), whereas guided policysearch can be initialized from demonstration by pre-trainingthe local policies, which in practice tends to be a lot morestable than demonstration pre-training for standard policygradient or actor-critic methods (Levine et al. 2015).

Another technique for incorpo-rating demonstrations in off-policy model-free RL is to adddemonstration data to the data buffer for the off-policyalgorithm. This method is often used with Q-learning oractor-critic style algorithms (Veˇcer´ık et al. 2017; Wu et al.2019). This can in principle mitigate the exploration chal-lenge, because the algorithm is exposed to high-rewardbehavior, but tends to be problematic in practice, becausecommonly used approximate dynamic programming meth-ods (i.e., value function estimation) need to see both good

Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine Figure 7.

Using both unsupervised interaction andteleoperated demonstration data, the robot learns a visualdynamics model and action proposal model that enables it toperform new tasks with novel, previously-unseen tools(using Xie et al. (2019)). The task speciﬁcation is shown on theleft and the robot performing the task is shown on the right. and bad experience to learn which actions are desirable.Therefore, when the demonstrations are much better thanthe agent’s own experience, the value function will typicallylearn that the demonstrated states are better, but mightfail to learn which actions must be taken to reach thosestates. Therefore, this tends to be much more effective whencombined with the next method.

Instead of simply pre-training thepolicy with supervised learning, we can train it jointly,adding together the loss from the policy gradient objectivewith the loss for behavioral cloning (Hester et al. 2018; Wuet al. 2019; Johannink et al. 2019). This simple approachprovides a much stronger signal to the learner, generallysucceeding in staying close to the demonstrations, but atthe cost of biasing policy learning: if the demonstrations aresuboptimal, the behavioral cloning loss may prevent the RLalgorithm from discovering a better policy.

In model-based RL, demonstration data can also be aggregated withthe agent’s experience to produce better models. However, incontrast to the model-free setting, for model-based RL thisapproach can be quite effective, because it would enable thelearned model to capture correct dynamics in important partsof the state space. When combined with a good planningmethod, which can also use the demonstrations (e.g., asa proposal distribution), including demonstrations into themodel training dataset can enable a robot to perform complexbehaviors, such as using tools (Figure 7), which wouldbe extremely difﬁcult to discover automatically (Xie et al.2019).

In addition to demonstrations, wecan also overcome the exploration challenge with a moderateamount of manual engineering, by designing “scripted”policies that can serve as initialization. Scripted policiescan be incorporated into the learning process in much thesame way as demonstrations, and can provide considerablebeneﬁt. In the QT-Opt grasping system (Figure 3), scriptedpolicies are used to pre-populate the data buffer with a higherproportion of successful grasps than would be obtainedwith purely random actions. Although aggregating such datafrom a small number of demonstrations would have limitedeffectiveness, the advantage of a scripted policy is that it can be used to collect very large datasets. In the ﬁnal QT-Optexperiment, the scripted policy was used to collect 200,000grasp attempts, with a success rate around 15-30% Althoughthis success rate is much lower than the ﬁnal policy, whichsucceeds 96% of the time, it was sufﬁcient to bootstrap aneffective vision-based grasping skill.Another reason why we pre-populate the replay bufferonly with a scripted policy is to help keep a ratio ofsuccessful and unsuccessful episodes close to 50%. This ismotivated by techniques trying to re-balance equally eachclass when training a multi-class classiﬁer as in Chawla et al.(2002). A poor performing policy doesn’t generate gooddata to train a Q-function since it requires both good andbad attempts to be able to learn a good ranking of what agood or bad action is. At the beginning of the training, thepolicy is bad because the Q-function being learned hasn’tconverged yet. Such a policy only generates unsuccessfulepisodes which can’t be used to train a good Q-function. Thisis why the policy is only used to generate data once it reacheda certain amount of performance. Our rule of thumb is to onlystart using the trained policy for data collection once it hasreached 20+% success.Scripted policies can also be used in a “residual” RLframework, which serves a similar purpose as joint trainingwith demonstrations. In residual RL (Silver et al. 2018;Tan et al. 2018; Johannink et al. 2019), the reinforcementlearner learns a policy that is additively combined withthe scripted policy, i.e. π ﬁnal ( s ) = π scripted ( s ) + π learned ( s ) .The motivation is similar: unlike pure initialization, theresidual approach always retains the scripted component.However, unlike joint training with demonstrations, residualRL can overcome the bias in the scripted policy by learningto “undo” π scripted ( s ) , and therefore can in principle stillconverge to the optimal policy. Shaping the reward function canalso side-step exploration challenges by providing the RLalgorithm with additional guidance for exploration. Forexample, for a reaching task, one can use the distance of theagent to the goal as a negative reward which will signiﬁcantlyspeed up the exploration. We’ve used this approach inseveral works for learning manipulation skills, such as dooropening and peg insertion, where object location informationis available during training (Gu et al. 2017; Levine et al.2016). This approach is very effective for any tasks wherethe agent has to go to a speciﬁc know location, such as innavigation tasks Francis et al. (2020). However, we’ve foundin practice that such an approach can be difﬁcult to scale tomany diverse manipulation tasks. This is due to two factors.First, it can be very difﬁcult to weight the shaping termsproperly to avoid any greedy and unintentional sub-optimalbehaviors. For example, to open the door, one may want toget close to the handle, but may require to take some distancefrom it to take a different approach with the gripper if thehandle can’t be moved with the current orientation. Suchbehavior would go against the shaping of the reward, andthus the reward shaping may make it impossible to discoversuch a behavior if its weight is too high. Second, and perhapsmore importantly in real-world environments, such shapedreward functions require knowledge of the precise state of theenvironment, such as object locations relative to the robot.

Prepared using sagej.cls This is feasible in simulation but can be very challengingon real robots, where the only input may be an image. Onceone wants to tackle multiple manipulation tasks, dealingwith those variations may be difﬁcult to program even insimulation, since the state conﬁguration one has to deal withcan grow exponentially.While we have discussed how the challenge of explorationcan be side-stepped by employing demonstrations, scriptedpolicies, and reward shaping, the study of exploration andcuriosity in robotic learning still plays an important role.Indeed, we can regard those approaches as a means toparallelize research on robotic learning: if we aim to studyperception, generalization, and complex tasks, we can avoidneeding to solve exploration as a prerequisite.

Generalization to any new skills, environments or tasksstill remains an unsolved problem. Solving this problem isrequired to allow robots to operate in a wide variety ofreal-world scenarios. However, there are a few restrictedsituations where we have seen good generalization. In thenext section, we cover two important aspects: (1) gooddata diversity guaranteeing to cover the space we want togeneralize and (2) having a correct train and test evaluationprotocol that allows us to optimize our system towards bettergeneralization.

Good data diversity that covers thespace of generalization we care about is critical to have goodperformance with deep learning. Deep RL is no exception.In QT-Opt, we cared about generalization to the objects thatwere never seen during training. Thus, we made sure thatduring data collection, the agent would see more than 1,000different object types. If we had only collected data with asmall set of objects, we may not achieve the generalizationcapability that we need. It is the same analogy that we cannotexpect a model trained on CIFAR (with 100 classes) togeneralize as well as a model trained on ImageNet (with1,000 classes). This is also true for robotics. If we want togeneralize to any objects, we may need to collect data withthousands of them. If we want the policy to be agnostic to therobot arm geometry, we may need to train with thousands ofarm variations, etc.A lot of recent work has leveraged domain randomizationin simulation to obtain good sim-to-real transfer because theycared about generalization to a new environment. There isa tradeoff here as more environment diversity may causethe policies to have lower performance. Often this can bealleviated with larger and better neural network architectures.As an example, a larger and deeper than usual neural networkwas required in Kalashnikov et al. (2018) for the Q-functionto deal with the large variety of objects and to achieve goodperformance on test objects.

To get good generalization, theentire system, including its hyperparameters, has to be tunedto optimize for it. This means that when we deﬁne theevaluation protocol, we have to be thoughtful to have twoMDPs: one for training, and a separate one for evaluation.This separation of MDP has to be done based on whatwe care to generalize against: if we want a policy that cangrasp new objects, we should have the training MDP with a different set of objects than the testing MDP, both insimulation and the real setup. If we care about generalizingto new robot dynamics, we should make sure to deﬁne ourtraining MDP with different dynamics than our testing MDP.

There have been notable success stories in robotics withmodel-based RL approaches that learn a model of thedynamics and use that model to choose actions (Deisenrothet al. 2013; Levine et al. 2016; Lenz et al. 2015; Finnand Levine 2017; Nagabandi et al. 2018; Xie et al. 2019;Kurutach et al. 2018; Yang et al. 2020). Here, we usethe term ‘model-based’ to describe algorithms that learn amodel of the dynamics from data, not to refer to the settingwhere a model is known a priori. Empirically, these methodshave enjoyed superior sample complexity in comparison tomodel-free approaches (Deisenroth et al. 2013; Nagabandiet al. 2018; Yang et al. 2020), have scaled to vision-basedtasks (Levine et al. 2016; Finn et al. 2016b; Finn and Levine2017), and demonstrated generalization capabilities to manyobjects and tasks when the model is trained on large, diversedatasets (Finn and Levine 2017; Yang et al. 2020). Thesegeneralization capabilities are a natural byproduct of beingable to train on off-policy datasets.Despite the beneﬁts of model-based RL methods, aprimary, well-known challenge faced by such model-basedRL approaches is model exploitation, i.e. when the modelis imperfect in some parts of the state space, and theoptimization over actions ﬁnds parts of the state space wherethe model is erroneously optimistic. This can result in pooraction selection. Although this challenge is real, we havefound that, in practice, we have multiple tools for mitigatingit. First, we have found that optimization under the model issuccessful when the data distribution consists of particularlybroad distributions over actions and states (Finn and Levine2017). In problem domains where this is not possible, oneeffective tool is data aggregation, which interleaves the datacollection and model learning, similar to DAGGER (Rosset al. 2011). Whenever the model is inaccurate and getsexploited, more data in the real world is collected to re-train the model. Another tool is to represent and accountfor model uncertainty (Deisenroth and Rasmussen 2011).Acquiring accurate uncertainty estimates when using neuralnetwork models is particularly challenging, though there hasbeen some success on physical robots (Nagabandi et al.2020). If we cannot obtain uncertainty estimates, then we canalternatively model the data distribution that the model wasﬁt, and constrain the optimization to that distribution. Wehave found this approach to be particularly effective whenusing models ﬁt locally around a relatively small numberof trajectories (Levine et al. 2016; Chebotar et al. 2017b).We can achieve a similar effect, but without having to reﬁtmodels from scratch, by learning to adapt models to localcontexts from a few transitions (Clavera et al. 2019): thisapproach allows us to automatically construct local modelsfrom short windows of experience. These local models havebeen demonstrated on a variety of robotic manipulation andlocomotion problems.Even if the learned model is accurate for a single-stepprediction, error can accumulate over the a long-horizon

Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine plan. For example, the predicted and real trajectories canquickly diverge after a contact event, even if the single-step model error is small. We found that using multi-steplosses (Finn and Levine 2017; Yang et al. 2020), shorterhorizons (when applicable) (Nagabandi et al. 2018) andreplanning (Finn and Levine 2017; Nagabandi et al. 2018)are effective strategies for limiting the error accumulation,and recovering from model exploitation. Recent advances in deep learning have also contributed tofaster compute architectures and the availability of evergrowing (labeled) data sets (Garofolo et al. 1993; Denget al. 2009). In addition, various open-source efforts, suchas those of Paszke et al. (2017) and Abadi et al. (2015),have contributed to minimizing the cost of entry. Importantly,progress was enabled also because the time it took to traindeep models and iterate on them became shorter and shorter.This holds true for robotic learning as well. The fastertraining data can be collected and a hypothesis can be tested,the faster progress will be made.Despite advances in data-efﬁciency (Section 4.2), deepRL still requires a fair amount of data, especially if visualinformation (images) is part of the observation. The majorityof robot learning experiments to date were conducted on asingle robot closely monitored by a single human operator.This one-to-one relation between robot and operator has beena tedious but effective way to ensure continuous and safeoperation. The human can reset the scene, stop the robotin unsafe situations, and simply restart and reset the roboton failures. However, to scale up data collection efforts andincrease the throughput of evaluation runs, robots need torun without human supervision. It is impractical to allocatemore operators to a setup with multiple robots, or whenever asingle robot is meant to run 24h/7, and especially both. In thefollowing we discuss the particular challenges that arise inthose settings, namely (1) designing the experimental setupto maximize throughput, i.e. the number of episodes/trialsper hour, (2) facilitating continuous operation of the robots,and (3) dealing with non-stationarity due to environmentchanges.

The experimental setup itself,i.e. how a particular robot is set up to tackle a speciﬁctask, is an important and often overlooked aspect ofa successful experiment. Oftentimes the setup has beencarefully engineered or the task has been chosen such thatthe robot can reset the scene to facilitate unattended andpotentially round-the-clock operation. For example, in (Pintoand Gupta 2016; Levine et al. 2018; Kalashnikov et al. 2018;Zeng et al. 2018; Cabi et al. 2019; Dasari et al. 2019), theworkspaces are convex, the objects involved allow for safeinteraction, and action-spaces are mostly restricted to top-down combined with either intrinsic compliance of the robotitself and/or a wrist mounted force-torque sensor to detectand stop unsafe motions. Ideally, the experimental setup isas unconstrained as possible, but in practice is restricted tocreate a safe action space for the robot (see Section 4.11.1).

Round-the-clockoperation will stress the robot itself as well as theexperimental setup. Repeated potentially unintended contact of the robot with objects and environment will wear out anyexperimental setup eventually and needs to be consideredupfront. The challenge for long running experiments is toincrease the mean-time-between-failure while ensuring thatthe data that is being collected is indeed useful for training.The former requires to understand the root cause for eachintervention and develop fail-safe redundancies. We discussthis challenge more in Section 4.12. Similarly, to ensurethat the collected data is not compromised, adding sanitychecks is recommended along with actually using the dataearly to train and re-train models. Despite simply acquiringmore data faster, running experiments around-the-clock alsoensures that robots are exposed to varying amounts oflighting conditions allowing us to train more robust policies.However, spot-checking the collected data is important aswe noticed, for example that the ceiling lights automaticallyturned off for parts of the night resulting in very dark imagescompromising the data.

A learned policy will fail if environment aspects havesigniﬁcantly changed since training. For example, thelighting conditions may signiﬁcantly shift at night ifwindows are present in the room, and evaluations done atnight may have very different results if no data collectionhappened at that time. The underlying dynamics mayhave shifted signiﬁcantly since training due to hardwaredegradation. Hardware degradation, such as change ofbattery level, wear and tear, and hardware failure, are themajor causes of dynamic changes. Traditional learning-based approaches, which have distinctive training and testingphases, assuming stationary distribution between phases,suffer from hardware degradation or environment changesnot captured in the collection phase. In extreme cases oflocomotion, a learned policy can stop working after merelya few weeks due to signiﬁcant robot dynamic changes. Toaddress these challenges, learning algorithms need to adjustonline (Yu et al. 2017), optimize for quick adaptation (Finnet al. 2017a; Yang et al. 2019), or learn in a lifelong fashion.This can also have consequences for evaluation protocolswhere comparing two learned policies or even the sameone at different times. We recently found that the bestpolicy learned in Levine et al. (2018) was sensitive toa hardware degradation of the ﬁngers, which caused aconsistent performance drop of 5% in as little as 800 graspsexecuted on a single robot. One way to mitigate this is touse proper A/B testing protocols as described in Tang et al.(2010).

The MDP formulation assumes synchronous execution:the observed state remains unchanged until the action isapplied. However, on real robotic systems, the execution isasynchronous. The state of the robot is continuously evolvingas the state is measured, transmitted, the action calculatedand applied.

Latency measures the delay from when theobservation is measured at the sensor, to when the actionis actually executed at the actuator. This delay is usuallyon the order of milliseconds to seconds, depending on thehardware and the complexity of the policy. The existence

Prepared using sagej.cls of latency means that the next state of the system does notdirectly depend on the measured state, but instead on thestate after a delay of latency after the measurement, whichis not observable. Latency violates the most fundamentalassumption of MDP (Xiao et al. 2020), and thus can causefailure to some RL algorithms. For example, we tested softactor-critic (SAC) (Haarnoja et al. 2018c) and QT-Opt (Xiaoet al. 2020), two state-of-the-art off-policy algorithms, tolearn walking on a simulated quadruped robot or graspingobjects with an arm, with different latencies. Although bothQT-Opt and SAC can learn efﬁciently when the latency iszero, they failed when we increase the latency.Clearly, we need special treatments to combat the non-Markovianness introduced by latency. For model-basedmethods, the planning component is often computationallyexpensive, and incurs additional latency. For example,the popular sample-based planner, cross-entropy method(CEM) (De Boer et al. 2005), needs to rollout manytrajectories and update the underlying distribution of optimalaction sequences. Even if CEM is parallelized using thelatest GPU, planning alone can still take tens of milliseconds.To accommodate such latency, in Yang et al. (2020), weplan the optimal action sequence based on a future state,which is predicted using the learned dynamic model, tocompensate for the latency caused by the planning algorithm.For model-free methods, one approach is to add recurrenceto the policy network, and in particular, include the previousactions taken by the policy as part of the state deﬁnition.The recurrent neural network could learn to extrapolatethe observation to when the action is applied, from thememorized previous observations. Another approach alongthe same line, which avoids the additional cost of trainingrecurrent neural networks, is to augment the observationspace with a window of previous observations and actions.In practice, we ﬁnd that the latter is simpler and equallyeffective (Haarnoja et al. 2018c; Xiao et al. 2020). A critical component required for any application of RLis the reward function. In simulation or video gameenvironments, the reward function is typically easy tospecify, because one has full access to the simulator or gamestate, and can determine whether the task was successfullycompleted or access the score of the game. In the realworld, however, assigning a score to quantify how well atask was completed can be a challenging perceptual problemof its own. In most of our case studies, we sidestep thisdifﬁculty in one of the following ways. (1) Instrumentingthe environment with additional sensors that provide rewardinformation. For example, an inertial measurement unit wasused to measure the angle of the door and the handle to learna door-opening task in Chebotar et al. (2017b), or a motioncapture device was used to measure how fast the quadrupedrobot walks (Haarnoja et al. 2019). (2) Simple heuristicssuch as image subtraction or target joint encoder values canbe valuable in some cases. For example, Kalashnikov et al.(2018) used the gripper encoder values and a comparison ofimages with and without the grasping in order to determinewhether an object was successfully grasped. (3) Learning avisual prediction model as in Finn and Levine (2017) avoidsthe need to deﬁne reward functions at training time: instead, the reward is speciﬁed at evaluation time based on a goalimage or equivalent representation. However, none of thesemethods necessarily generalizes to any possible robot taskone might wish to solve using RL.Learning the reward function itself is a promising avenuefor addressing this problem. It can be learned explicitly, fromdemonstrations (Finn et al. 2016a), from human annotation(Cabi et al. 2019), from human preferences (Sadigh et al.2017; Christiano et al. 2017), or from multiple sources ofhuman feedback (Bıyık et al. 2020). In these examples,reward function learning is typically done in parallel with theRL process, because new experience data helps train a betterreward function approximation. However, large amounts ofdemonstrations or annotations may be required. The processof learning reward functions from demonstrations, called inverse RL is an underspeciﬁed problem Ziebart et al.(2008), making it difﬁcult to scale to image observations,and exploitation of the reward can happen even with in-the-loop reward learning. There are promising techniques to tryto address some of these problems, including using meta-learned priors (Xie et al. 2018) or active queries (Singh et al.2019), but learning rewards with minimal human supervisionin the general case remains an unsolved problem.

One promising approach towards enabling robots to learntasks efﬁciently is to leverage previous experience fromother tasks rather than training for a task completely fromscratch. Multi-task learning approaches aim to do exactlythis by learning multiple tasks at once, rather than trainingfor a single task. Similarly, meta-learning algorithms trainacross multiple tasks such that learning a new future taskcan be done very efﬁciently. Although these approaches haveshown considerable promise in enabling robots to quicklyadapt to new object conﬁgurations (Duan et al. 2017), newobjects (Finn et al. 2017b; James et al. 2018), and newterrains or environment conditions (Clavera et al. 2019; Yuet al. 2019), a number of challenges remain in order to makethem practical for learning across many different roboticcontrol tasks in the real world.The ﬁrst challenge is to specify the task collection. Thesealgorithms assume a collection of training tasks that arerepresentative of the kinds of tasks that the robot mustgeneralize or adapt to at test time. However, specifying areward function for a single task already presents a majorchallenge (Section 4.9), let alone for tens or hundreds oftasks. Some prior works have proposed solutions to thisproblem by deriving goals or skills in an unsupervisedmanner (Gregor et al. 2017; Jabri et al. 2019). However, wehave yet to see these approaches show signiﬁcant success inreal world settings.Another signiﬁcant challenge lies in the optimizationlandscape of multiple tasks. Learning multiple tasks atonce can present a challenge even for supervised learningproblems due to different tasks being learned at differentrates (Chen et al. 2018; Schaul et al. 2019) or the challengesin determining how to resolve conﬂicting gradient signalsbetween tasks (Sener and Koltun 2018). These optimizationchallenges can be exacerbated in RL settings, where theyare confounded with challenges in trading off explorationand exploitation. These challenges are less severe for

Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine similar tasks (Duan et al. 2016; Finn et al. 2017a; Rakellyet al. 2019), but pose a major challenge for more distincttasks (Parisotto et al. 2016; Rusu et al. 2015).Finally, as we scale learning algorithms towards manydifferent tasks, all of the existing challenges discussed aboveremain and can be even more tricky, including the needfor resetting the environment towards state that are relevantfor the current task 4.12, operating robots at scale 4.7, andhandling non-stationarity 4.7.3. Safety is critical when we apply RL on real robots.Although sufﬁcient exploration leads to more efﬁcientlearning, directly exploring in the real world is not alwayssafe. Repeated falling, self-collisions, jerky actuation, andcollisions with obstacles may damage the robot and itssurroundings, which will require costly repairs and manualinterventions (Section 4.12).

One simple wayto avoid unsafe behaviors is to restrict the action spacesuch that any action that a learned policy can take is safe.This is usually very restrictive and cannot be applied to allapplications. However, there are many cases, particularly insemi-static environments and tasks, such as grasping andmanipulation, where this is the right approach. Graspingobjects in a bin is a very common task in logistics. Inthese settings, safety can typically be enforced by restrictingthe work space. For example, in Levine et al. (2018) andKalashnikov et al. (2018), all actions are selected throughsampling, and unsafe samples are rejected. This allows us toperform safety checks or add constraints to the action space.By using a geometric model of the robot and the world, wecan reject actions that are outside the 3D volume above thebin, and reject actions that violate kinematic or geometricconstraints. We can also enforce constraints on the velocityof the arm.Although this allows us to handle safety for parts of therobot and environment that can be modeled, it does notdeal with anything that is unmodeled, such as objects inthe scene that we might want to grasp or push aside beforegrasping. We can mitigate this issue by using a force-torquesensor at the end-effector to detect and stop motion when animpact occurs. From the point of view of the RL agent, thisaction appears to have a truncated effect. This combinationof strategies can provide for a workable level of safety ina simple and effective way for tasks that are quasi-static innature.

Typically, exploration strategiesare realized by adding random noise to the actions.Uncorrelated random noise injected in the action space forexploration can cause jerky motions, which may damage thegearbox and the actuators, and thus is unsafe to execute onthe robot. Options for smoothing out jerky actions duringexploration include: reward shaping by penalizing jerkinessof the motion, mimicking smooth reference trajectories(Peng et al. 2018a), learning an additive feedback togetherwith a trajectory generator (Iscen et al. 2018), samplingtemporal coherent noise (Haarnoja et al. 2019; Yang et al.2020), or smoothing the action sequence with low-pass ﬁlters. All these techniques work well, although additionalmanual tuning or user-speciﬁed data may be required.In the locomotion case study (Section 3.3), because thelearning algorithm can freely explore the policy space, theconverged gait may not be periodic, may be jerky or mayuse too much energy, which can damage the robot andits surroundings. They usually do not resemble the gaitsof animals that we are familiar with in nature. Althoughit is possible to mitigate these problems by shaping thereward function, we ﬁnd that a better alternative thatrequires less tuning is to incorporate a periodic and smoothtrajectory generator into the learning process. We developa novel neural network architecture, policies modulatedtrajectory generator (PMTG) (Iscen et al. 2018), which caneffectively incorporate prior knowledge of locomotion andregularize the learned gait. PMTG subdivides the controllerinto an open-loop and a feedback component. The open-loop trajectory generator creates smooth and periodic legmotion, whereas the feedback policy, represented by a neuralnetwork, can be learned to modulate this trajectory generator,to change walking speed, direction and style. As a result, thePMTG policies are safe to be deployed or directly learned onthe real robot.

It is crucial torecognize that unsafe situations is about to happen, sothat a recovering policy can be deployed to keep therobot safe, or to shutdown the robot completely. Heuristic-based approaches can be designed to recognize these unsafestates or actions by checking whether the action will causecollision, or whether the power and the torque exceedthe limit. Performing these rule-based safety checks oftenrequire careful tuning and a rich set of onboard sensors.Furthermore, we can also employ learning to recognizeunsafe situations. These approaches can use ensemblemodels to estimate uncertainty (Deisenroth and Rasmussen2011; Eysenbach et al. 2018) of certain predictions, whichcan be a good indicator whether any unsafe behavior mayhappen, or can directly learn the probability of futureunsafe behaviors from experience (Gandhi et al. 2017;Srinivasan et al. 2020). Once a precarious situation isrecognized, a recovering policy can be deployed to move therobot back to a safe state. The task policy, the recoveringpolicy and the classiﬁer for safety can all be learnedsimultaneously (Eysenbach et al. 2018; Thananjeyan et al.2020). For example, in a locomotion task, when the robotis in a balance state, the task policy (walking) is executedand updated. When the robot is about to fall, which ispredicted by the learned Q function, the recovering policy(stand up) takes over. The data collected in this mode is usedto update the recovering policy. We showed that learningthem simultaneously can dramatically reduce the number offalls during training.

One obvious wayto avoid unsafe behaviors is to penalize unsafe actions eachtime they are taken. However, this can be hard in practice,as careful tuning is needed for the weights of this penaltyterm. A more effective alternative is to formulate safe RLas a constrained markov decision process (C-MDP) (Altman1999). For example, TRPO (Schulman et al. 2015) ensuresa stable learning using a KL divergence constraint. More

Prepared using sagej.cls recently, (Achiam et al. 2017; Bohez et al. 2019) have alsoapplied constraint-based optimization to model safety asa set of hard constraints. In our locomotion projects, weformulated a C-MDP that has inequality constraints on theroll and the pitch of the robot base, which constitutes a roughmeasure of balance. If the state of the robot stays within theconstraints throughout the entire training process, the robotis guaranteed to stay upright. This minimizes the chance offalling when the robot is learning to walk. The constrainedformulation usually performs better because as long as theconstraints are met, no gradients is generated, and thus nointerference can happen between the safety constraints andthe reward objective. However, too stringent constraints willlimit exploration and can lead to slow learning. Last butnot least, even if the training process is safe, the ﬁnallearned policy can execute unexpected, and potentiallyunsafe, actions when encountering unseen observations. Toimprove the generalization of the policy to unseen situations,we adopted a robust control approach. We use domainrandomization, which samples different physical parameters,or add perturbation forces, either randomly or adversarially(Pinto et al. 2017), to the robot during training, to forceit to learn to react under a wide variety of observations.Before deploying the policy on the robots, we also performextensive evaluations in simulation about the safety andthe performance of the controller on untrained scenarios.Occasionally, the robot, which is trained to be robust andpassed all the safety checks in simulation, can still misbehavein the real world. In these rare situations, the model-based or heuristic-based safety checks, such as self-collisiondetection, power/torque limit, acceleration threshold, etc.,will trigger and shut down the robot.

We use the term robot persistence to refer to the capabilityof the robot to persist in collecting data and training withminimal human intervention. Persistence is crucial for larger-scale robotic learning, because the effectiveness of modernmachine learning models (i.e., deep neural networks) iscritically dependent on the quantity and diversity of trainingdata, and persistence is required to collect large training sets.We can divide the problem of robot persistence into twomain categories: (1) self persistence – the robot must avoiddamaging itself during training; (2) task persistence – therobot must act so that it can continue to perform the task.Robot persistence is critical for enabling autonomous datacollection safely and at scale.

We deﬁne self-persistence as theability for the robot to keep its full range of motion whileperforming a task. If the robot were to collide with itself,or the environment, and end up damaging itself, it may endup losing certain abilities, requiring human intervention. InSection 4.11, we provide a few strategies to improve self-persistence.

Task persistence is the capabilityof the robotic setup to accomplish a range of tasks,repeatedly, in the case of grasping, hundreds of thousandsof times to learn the task. Being able to retry a task is tightly coupled with the environment itself and is to this day, still anunsolved problem for a large range of tasks.Challenges can occur where the robot work-space islimited, and thus objects required to accomplish a task mayaccidentally be thrown out of reach. In this case, we need toﬁnd exploration strategies that avoid ending up in such states,in a very limited data regime, to avoid human interventionsthat are needed every time such an unrecoverable state isencountered. In high dimensional states, such as images, thisbecomes a challenging problem as even deﬁning those statesbecomes a challenge on its own: how do we know from animage that an object has fallen off the bin?Another class of challenges that we also put in thiscategory is what is often called “environment reset”. Inmany cases, once the task is accomplished, changes in theenvironment may need to happen before another trial canbe done. This is easy to do in simulation: just reset thestate of the environment. In the real world, this can oftenbe much harder to accomplish, as resetting the environmentis a sequence of robotic tasks, which may be as hard orharder than the task we are trying to learn itself. An exampleis learning to screw the cap of a bottle again, we mayhave to unscrew it to be able to try to screw it again.Pouring or assembly tasks are also examples where resettingthe environment may be as challenging or may requiremany steps to accomplish. Automating the whole process ofenvironment reset is required if we want the robot to persistto learn the task. It becomes a challenge of identifying theright set of sub-tasks whose reset action we already learnedhow to do with a robot.On occasion, some tasks are physically irreversible, suchas welding two pieces of metal, cutting food with a knife,cutting paper with scissors, or writing with a marker. In thosecases, other robots may have to bring new objects to the robottrying to learn those tasks, which may be much harder thantrying to accomplish the task itself.Solving task persistence remains mostly an open problem.Although guided policy search methods that can handlerandom initial states have been developed (Montgomery andLevine 2016; Montgomery et al. 2017), they still rely onclustering the initial states into a discrete set of “similar”states, which may be impractical in some cases, such as thediverse grasping task discussed in Section 3.2 and the diversepushing task in Section 3.1.3, where the “state” includes thepositions and identities of all objects in the scene. Previouswork such as Pinto and Gupta (2016); Finn and Levine(2017) limited the task and action space to be within a bin,which helped keep objects in it by having raised side wallsas well as tackled tasks that required a simple reset: just openthe gripper above the bin and bring it back to a home positionwhich can easily be scripted. Because task persistence wasresolved to some extent, some of those work managed tocollect millions of trials (Levine et al. 2018; Kalashnikovet al. 2018). Unfortunately, many tasks do not have these niceproperties. For example, Chebotar et al. (2017a) leveraged ahuman to perform the reset by bringing the puck back to aposition where the hockey stick could hit it again. Haarnojaet al. (2019) had to bring the legged robot back to its initialstarting position every time the robot reached the end ofthe limited 5m workspace. In both cases, task persistencewas not achieved and humans were performing the reset

Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine procedure. This makes data collection hard to scale because(1) it was very time consuming for a human and (2) in bothcases, they stopped because they started to feel back painwhile performing the environment reset. As such, only a fewhours of data, and less than 1,000 trials were performed.More recently, work such as that of Eysenbach et al.(2018) tried to tackle this issue of task persistence byintegrating environment reset as part of the learningprocedure, in a task-agnostic way. However this work onlyexplored tasks which have a unique starting point, that can bereached from most states. This strategy is not always possiblesuch as in self-driving cars, where going backward to comeback to the starting point is generally not safe. In this article, we discussed how deep RL algorithms can beapproached in a robotics context. We provided a brief reviewof recent work on this topic, a more in-depth discussionfocusing on a set of case studies, and a discussion of themajor challenges in deep RL as it pertains to real-worldrobotic control. Our aim was to present the reader witha high-level summary of the capabilities of current deepRL methods in the robotics domain, discuss which issuesmake deployment of deep RL methods difﬁcult, and providea perspective on how some of those difﬁculties can bemitigated or avoided.Although deep RL is often regarded as being tooinefﬁcient for real-world learning scenarios, described inSection 4.2, we discuss how in fact deep RL methods havebeen applied successfully on tasks ranging from quadrupedalwalking, to grasping novel objects, to learning varied andcomplex manipulation skills. These case studies illustratethat deep RL can in fact be used to learn directly inthe real wold, can learn to utilize raw sensory modalitiessuch as camera images, and can learn tasks that presenta substantial physical challenge, such as walking anddexterous manipulation. Most importantly, these case studiesillustrate that policies trained with deep RL can generalizeeffectively, such as in the case of the robotic graspingexperiments discussed in Section 3.2.However, utilizing deep RL does present a number ofsigniﬁcant challenges, and though these challenges do notpreclude current applications of deep RL in robotics, theydo limit its impact. Some of these challenges have partial orcomplete current solutions, whereas some do not. Althoughcurrent deep RL methods are not as inefﬁcient as oftenbelieved, provided that an appropriate algorithm is usedand the hyperparameters are chosen correctly, efﬁciency andstability remain major challenges, and additional researchon RL algorithm design should focus on further improvingboth. The use of simulation can further reduce challengesdue to sample efﬁciency, though simulation alone doesnot solve all issues with robotic learning. Exploration canpose a major challenge in robotic RL, but we outline avariety of ways in which exploration challenges can be side-stepped in practical robotic control problems, from utilizingdemonstrations to baseline hand-engineered controllers. Ofcourse, not all exploration challenges can be overcome inthis way, but “solving” the difﬁcult RL exploration problemshould not be a prerequisite for effective application of deep RL in robotics. Generalization presents a challenge for deepRL, but in contrast to arguments made in many prior works,we do not believe that this issue is any more pronounced thanin any other machine learning ﬁeld, and the availability oflarge and diverse data can enable RL policies to generalizein the same way as it enables generalization for supervisedmodels. Indeed, deep RL is likely to have an advantage here– if generalization is limited primarily by data quantity anddiversity, automatically labeled robotic experience can likelybe collected in much larger amounts than hand-labeled data.Beyond the algorithmic challenges in deep RL, roboticdeep RL also presents a number of challenges that areunique to the robotics setting: learning complex skillsrequires considerable data collection by the robots, whichrequires the ability to keep the robots operational withminimal human intervention. Conducting training withoutpersistent human oversight is itself a signiﬁcant engineeringchallenges, and requires certain best practices, as we discussin Section 4.7. This last challenge is tightly connectedto designing persistent robots, as we desire for the robotto be an autonomous agent in the real world, there aremany challenges that are often overlooked in simulatedenvironments which we discuss in Section 4.12. As robotsexist in the real world, they must also obey real-timeconstraints, which means that policies must be evaluatedin parallel and with a limited time budget alongside themotion of the robot – this presents challenges in theclassically synchronous MDP model (Section 4.8). Finally,and importantly, real-world RL requires to deﬁne a rewardfunction. Although it is common in RL research to assumethat the reward function or reward signal is an external signalthat is provided by the environment, in robotic learning thisfunction must itself be programmed, or otherwise learnedby the robot. As we expand the number of tasks we wantour robots to accomplish via techniques such as multi-taskor meta-learning discussed in Section 4.10, the efforts indeﬁning those reward functions will continue to increase.This can serve as a major barrier to deployment of RLalgorithms in practice, though it can be mitigated with avariety of automatic and semi-automatic reward acquisitionmethods, as discussed in Section 4.9.We believe that these challenges, though addressed in partover the past few years, offer a fruitful range of topics forfuture research. Addressing them will bring us closer to afuture where RL can enable any robot to learn any task.This would lead to an explosive growth in the capabilitiesof autonomous robots – when the capabilities of robots arelimited primarily by the amount of robot time availableto learn skills, rather than the amount of engineering timenecessary to program them, robots will be able to acquirelarge skill repertoires. A suitable goal for robotic deep RLresearch would be to make robotic RL as natural and scalableas the learning performed by humans and animals, whereany behavior can be acquired without manual scaffolding orinstrumentation, provided that the task is speciﬁed precisely,is physically possible, and does not pose an unreasonableexploration challenge.

Prepared using sagej.cls References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C,Corrado GS, Davis A, Dean J, Devin M, Ghemawat S,Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R,Kaiser L, Kudlur M, Levenberg J, Man´e D, Monga R, Moore S,Murray D, Olah C, Schuster M, Shlens J, Steiner B, SutskeverI, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Vi´egas F,Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y andZheng X (2015) TensorFlow: Large-scale machine learningon heterogeneous systems. URL http://tensorflow.org/ . Software available from tensorﬂow.org.Achiam J, Held D, Tamar A and Abbeel P (2017) ConstrainedPolicy Optimization. In:

International Conference on MachineLearning .Agarwal R, Schuurmans D and Norouzi M (2020) An optimisticperspective on ofﬂine reinforcement learning. In:

InternationalConference on Machine Learning .Altman E (1999)

Constrained Markov decision processes ,volume 7. CRC Press.Andrychowicz M, Wolski F, Ray A, Schneider J, Fong R,Welinder P, McGrew B, Tobin J, Abbeel OP and Zaremba W(2017) Hindsight experience replay. In:

Advances in NeuralInformation Processing Systems . pp. 5048–5058.Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton Dand Munos R (2016) Unifying count-based exploration andintrinsic motivation. In:

Advances in Neural InformationProcessing Systems . pp. 1471–1479.Bıyık E, Losey DP, Palan M, Landolﬁ NC, Shevchuk G and SadighD (2020) Learning reward functions from diverse sources ofhuman feedback: Optimally integrating demonstrations andpreferences. arXiv preprint arXiv:2006.14091 .Bohez S, Abdolmaleki A, Neunert M, Buchli J, Heess N andHadsell R (2019) Value constrained model-free continuouscontrol. arXiv preprint arXiv:1902.04623 .Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B,Goyal P, Jackel LD, Monfort M, Muller U, Zhang J et al.(2016) End to end learning for self-driving cars. arXiv preprintarXiv:1604.07316 .Bousmalis K, Irpan A, Wohlhart P, Bai Y, Kelcey M, KalakrishnanM, Downs L, Ibarz J, Pastor P, Konolige K et al. (2018) Usingsimulation and domain adaptation to improve efﬁciency of deeprobotic grasping. In:

International Conference on Robotics andAutomation . IEEE, pp. 4243–4250.Bousmalis K, Silberman N, Dohan D, Erhan D and KrishnanD (2017) Unsupervised pixel-level domain adaptation withgenerative adversarial networks. In:

Conference on ComputerVision and Pattern Recognition .Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J,Tang J and Zaremba W (2016) OpenAI Gym. arXiv preprintarXiv:1606.01540 .Burda Y, Edwards H, Storkey A and Klimov O (2019) Explorationby random network distillation. In:

International Conferenceon Learning Representations .Byravan A, Leeb F, Meier F and Fox D (2018) Se3-pose-nets:Structured deep dynamics models for visuomotor control.

International Conference on Robotics and Automation .Cabi S, Colmenarejo SG, Novikov A, Konyushkova K, Reed S,Jeong R, Zolna K, Aytar Y, Budden D, Vecerik M, Sushkov O,Barker D, Scholz J, Denil M, de Freitas N and Wang Z (2019) A Framework for Data-Driven Robotics.Chawla NV, Bowyer KW, Hall LO and Kegelmeyer WP (2002)SMOTE: Synthetic Minority Over-Sampling Technique.

Journal of Artiﬁcial Intelligence Research

16: 321–357.Chebotar Y, Hausman K, Zhang M, Sukhatme G, Schaal S andLevine S (2017a) Combining model-based and model-freeupdates for trajectory-centric reinforcement learning. In:

International Conference on Machine Learning . JMLR. org,pp. 703–711.Chebotar Y, Kalakrishnan M, Yahya A, Li A, Schaal S and LevineS (2017b) Path integral guided policy search. In:

InternationalConference on Robotics and Automation . IEEE, pp. 3381–3388.Chen D, Zhou B, Koltun V and Kr¨ahenb¨uhl P (2020) Learning bycheating. In:

Conference on Robot Learning . PMLR.Chen Z, Badrinarayanan V, Lee CY and Rabinovich A(2018) GradNorm: Gradient Normalization for Adaptive LossBalancing in Deep Multitask Networks. In:

InternationalConference on Machine Learning .Chiang HTL, Faust A, Fiser M and Francis A (2019) Learningnavigation behaviors end-to-end with autorl.

IEEE Roboticsand Automation Letters

Advances in Neural Information Processing Systems . pp.4299–4307.Clavera I, Nagabandi A, Liu S, Fearing RS, Abbeel P, LevineS and Finn C (2019) Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning.In:

International Conference on Learning Representations .Clegg A, Yu W, Tan J, Liu CK and Turk G (2018) Learning to dress:synthesizing human dressing motion via deep reinforcementlearning. In:

SIGGRAPH Asia 2018 Technical Papers . ACM.Coumans E and Bai Y (2016) pybullet, a python module for physicssimulation, games, robotics and machine learning. http://pybullet.org/ .Daniel C, Neumann G, Kroemer O and Peters J (2013) Learningsequential motor tasks. In:

International Conference onRobotics and Automation . IEEE.Dasari S, Ebert F, Tian S, Nair S, Bucher B, Schmeckpeper K, SinghS, Levine S and Finn C (2019) RoboNet: Large-Scale Multi-Robot Learning. In:

Conference on Robot Learning .De A (2017)

Modular Hopping and Running via ParallelComposition . PhD Thesis.De Boer PT, Kroese DP, Mannor S and Rubinstein RY (2005) Atutorial on the cross-entropy method.

Annals of operationsresearch

InternationalConference on Machine Learning . Omnipress, pp. 465–472.Deisenroth MP, Neumann G and Peters J (2013) A survey on policysearch for robotics. In:

Foundations and Trends in Robotics ,volume 2. Now Publishers, Inc., pp. 1–142.Deng J, Dong W, Socher R, Li LJ, Li K and Fei-Fei L (2009)Imagenet: A large-scale hierarchical image database. In:

Conference on Computer Vision and Pattern Recognition . Ieee,pp. 248–255.Devlin J, Chang MW, Lee K and Toutanova K (2019) BERT:Pre-training of Deep Bidirectional Transformers for Language

Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine Understanding. In:

NAACL-HLT .Draeger A, Engell S and Ranke H (1995) Model predictive controlusing neural networks.

Control Systems Magazine

Advances in Neural Information ProcessingSystems . pp. 1087–1098.Duan Y, Schulman J, Chen X, Bartlett PL, Sutskever I and Abbeel P(2016) Rl : Fast reinforcement learning via slow reinforcementlearning. arXiv preprint arXiv:1611.02779 .Ebert F, Finn C, Dasari S, Xie A, Lee A and Levine S (2018) Visualforesight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568 .Eysenbach B, Gu S, Ibarz J and Levine S (2018) Leaveno Trace: Learning to Reset for Safe and AutonomousReinforcement Learning. In: International Conference onLearning Representations .Finn C, Abbeel P and Levine S (2017a) Model-agnostic meta-learning for fast adaptation of deep networks. In:

InternationalConference on Machine Learning . JMLR. org.Finn C and Levine S (2017) Deep visual foresight for planningrobot motion. In:

International Conference on Robotics andAutomation . IEEE.Finn C, Levine S and Abbeel P (2016a) Guided cost learning:Deep inverse optimal control via policy optimization. In:

International Conference on Machine Learning . pp. 49–58.Finn C, Tan XY, Duan Y, Darrell T, Levine S and Abbeel P (2016b)Deep Spatial Autoencoders for Visuomotor Learning. In:

International Conference on Robotics and Automation . IEEE,pp. 512–519.Finn C, Yu T, Zhang T, Abbeel P and Levine S (2017b) One-Shot Visual Imitation Learning via Meta-Learning. In:

Conference on Robot Learning , Proceedings of MachineLearning Research , volume 78.Florence P, Manuelli L and Tedrake R (2019) Self-supervisedcorrespondence in visuomotor policy learning.

IEEE Roboticsand Automation Letters

Conference on Robot Learning . pp. 373–385.Fox R, Pakman A and Tishby N (2016) Taming the Noise inReinforcement Learning via Soft Updates. In:

Conference onUncertainty in Artiﬁcial Intelligence . AUAI Press.Francis A, Faust A, Chiang HTL, Hsu J, Kew JC, Fiser M andLee TWE (2020) Long-range indoor navigation with prm-rl.

IEEE Transactions on Robotics

Advances in Neural Information Processing Systems . pp. 2577–2587.Fujimoto S, Meger D and Precup D (2019) Off-Policy DeepReinforcement Learning without Exploration. In:

InternationalConference on Machine Learning .Fujimoto S, van Hoof H and Meger D (2018) AddressingFunction Approximation Error in Actor-Critic Methods. In:

International Conference on Machine Learning .Gandhi D, Pinto L and Gupta A (2017) Learning to ﬂy bycrashing. In:

International Conference on Intelligent Robots and Systems . IEEE, pp. 3948–3955.Garofolo JS, Lamel LF, Fisher WM, Fiscus JG and Pallett DS(1993) Darpa timit acoustic-phonetic continous speech corpuscd-rom. nist speech disc 1-1.1.

NASA STI/Recon technicalreport n

International Conference on Intelligent Robots and Systems .IEEE, pp. 2351–2358.Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-FarleyD, Ozair S, Courville A and Bengio Y (2014) Generativeadversarial nets. In:

Advances in Neural InformationProcessing Systems . pp. 2672–2680.Gregor K, Rezende DJ and Wierstra D (2017) VariationalIntrinsic Control. In:

International Conference on LearningRepresentations, Workshop Track Proceedings .Gu S, Holly E, Lillicrap T and Levine S (2017) Deep reinforcementlearning for robotic manipulation with asynchronous off-policy updates. In:

International Conference on Robotics andAutomation . IEEE, pp. 3389–3396.Gu S, Lillicrap T, Sutskever I and Levine S (2016) Continuous deepq-learning with model-based acceleration. In:

InternationalConference on Machine Learning . pp. 2829–2838.Ha S, Kim J and Yamane K (2018) Automated deep reinforcementlearning environment for hardware of a modular legged robot.In:

International Conference on Ubiquitous Robots . IEEE.Haarnoja T, Ha S, Zhou A, Tan J, Tucker G and Levine S(2019) Learning to walk via deep reinforcement learning. In:

Robotics: Science and Systems .Haarnoja T, Pong V, Zhou A, Dalal M, Abbeel P and Levine S(2018a) Composable deep reinforcement learning for roboticmanipulation. In:

International Conference on Robotics andAutomation . IEEE.Haarnoja T, Tang H, Abbeel P and Levine S (2017) Reinforcementlearning with deep energy-based policies. In:

InternationalConference on Machine Learning . JMLR. org, pp. 1352–1361.Haarnoja T, Zhou A, Abbeel P and Levine S (2018b) Soft Actor-Critic: Off-Policy Maximum Entropy Deep ReinforcementLearning with a Stochastic Actor. In:

International Conferenceon Machine Learning .Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J,Kumar V, Zhu H, Gupta A, Abbeel P et al. (2018c) SoftActor-Critic Algorithms and Applications. arXiv preprintarXiv:1812.05905 .H¨am¨al¨ainen P, Rajam¨aki J and Liu CK (2015) Online control ofsimulated humanoids using particle belief propagation.

ACMTransactions on Graphics (TOG)

AAAI Symposium on Interactive Multi-Sensory Object Perception for Embodied Agents .Heess N, Sriram S, Lemmon J, Merel J, Wayne G, Tassa Y, ErezT, Wang Z, Eslami S, Riedmiller M et al. (2017) Emergenceof locomotion behaviours in rich environments. arXiv preprintarXiv:1707.02286 .Hester T, Vecer´ık M, Pietquin O, Lanctot M, Schaul T, Piot B,Horgan D, Quan J, Sendonaris A, Osband I, Dulac-ArnoldG, Agapiou J, Leibo JZ and Gruslys A (2018) Deep Q-learning From Demonstrations. In:

Conference on ArtiﬁcialIntelligence . Prepared using sagej.cls Hwangbo J, Lee J, Dosovitskiy A, Bellicoso D, Tsounis V, Koltun Vand Hutter M (2019) Learning agile and dynamic motor skillsfor legged robots.

Science Robotics

International Conference on Robotics and Automation . IEEE.Irpan A (2018) Deep reinforcement learning doesn’t workyet. .Iscen A, Caluwaerts K, Tan J, Zhang T, Coumans E, SindhwaniV and Vanhoucke V (2018) Policies Modulating TrajectoryGenerators. In:

Conference on Robot Learning .Jabri A, Hsu K, Gupta A, Eysenbach B, Levine S and Finn C (2019)Unsupervised curricula for visual meta-reinforcement learning.In:

Advances in Neural Information Processing Systems . pp.10519–10530.Jain D, Li A, Singhal S, Rajeswaran A, Kumar V and Todorov E(2019) Learning deep visuomotor policies for dexterous handmanipulation. In:

International Conference on Robotics andAutomation . IEEE, pp. 3636–3643.James S, Bloesch M and Davison AJ (2018) Task-EmbeddedControl Networks for Few-Shot Imitation Learning. In:

Conference on Robot Learning .James S, Davison AJ and Johns E (2017) Transferring End-to-End Visuomotor Control from Simulation to Real World fora Multi-Stage Task. In:

Conference on Robot Learning .James S, Wohlhart P, Kalakrishnan M, Kalashnikov D, IrpanA, Ibarz J, Levine S, Hadsell R and Bousmalis K (2019)Sim-to-real via sim-to-sim: Data-efﬁcient robotic grasping viarandomized-to-canonical adaptation networks. In:

Conferenceon Computer Vision and Pattern Recognition .Johannink T, Bahl S, Nair A, Luo J, Kumar A, Loskyll M, OjeaJA, Solowjow E and Levine S (2019) Residual ReinforcementLearning for Robot Control. In:

International Conference onRobotics and Automation . IEEE.Kakade SM (2002) A natural policy gradient. In:

Advances inNeural Information Processing Systems . pp. 1531–1538.Kalashnikov D, Irpan A, Pastor P, Ibarz J, Herzog A, Jang E,Quillen D, Holly E, Kalakrishnan M, Vanhoucke V and LevineS (2018) Scalable deep reinforcement learning for vision-basedrobotic manipulation. In:

Conference on Robot Learning ,Proceedings of Machine Learning Research. PMLR.Khadka S, Majumdar S, Nassar T, Dwiel Z, Tumer E, MiretS, Liu Y and Tumer K (2019) Collaborative EvolutionaryReinforcement Learning. In:

International Conference onMachine Learning .Kober J, Bagnell JA and Peters J (2013) Reinforcement learningin robotics: A survey.

The International Journal of RoboticsResearch

InternationalConference on Robotics and Automation . IEEE.Konidaris G, Kuindersma S, Grupen R and Barto A (2012)Robot learning from demonstration by constructing skill trees.

International Journal of Robotics Research

CoRR abs/1907.03146. URL http://arxiv.org/abs/1907.03146 . Kumar A, Fu J, Soh M, Tucker G and Levine S (2019) StabilizingOff-Policy Q-Learning via Bootstrapping Error Reduction.In:

Advances in Neural Information Processing Systems ,volume 32. Curran Associates, Inc.Kurutach T, Clavera I, Duan Y, Tamar A and Abbeel P (2018)Model-Ensemble Trust-Region Policy Optimization. In:

International Conference on Learning Representations .Kuznetsova A, Rom H, Alldrin N, Uijlings JRR, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A, DuerigT and Ferrari V (2020) The open images dataset V4: Uniﬁedimage classiﬁcation, object detection, and visual relationshipdetection at scale.

International Journal of Computer Vision

Science Robotics

5. URL http://dx.doi.org/10.1126/scirobotics.abc5986 .Lee J, Hwangbo J, Wellhausen L, Koltun V and Hutter M (2020b)Learning quadrupedal locomotion over challenging terrain.

Science Robotics

International Conference on Roboticsand Automation . IEEE.Lenz I, Knepper RA and Saxena A (2015) Deepmpc: Learningdeep latent features for model predictive control. In:

Robotics:Science and Systems . Rome, Italy.Levine S and Abbeel P (2014) Learning neural network policieswith guided policy search under unknown dynamics. In:

Advances in Neural Information Processing Systems . pp. 1071–1079.Levine S, Finn C, Darrell T and Abbeel P (2016) End-to-endtraining of deep visuomotor policies.

The Journal of MachineLearning Research

International Conference on Machine Learning . pp. 1–9.Levine S, Pastor P, Krizhevsky A, Ibarz J and Quillen D(2018) Learning Hand-Eye Coordination for Robotic Graspingwith Deep Learning and Large-Scale Data Collection.

TheInternational Journal of Robotics Research

InternationalConference on Robotics and Automation .Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y,Silver D and Wierstra D (2015) Continuous control with deepreinforcement learning. arXiv preprint arXiv:1509.02971 .Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y,Bharambe A and van der Maaten L (2018) Exploring the limitsof weakly supervised pretraining. In:

European Conference onComputer Vision .Mahler J, Matl M, Liu X, Li A, Gealy D and Goldberg K(2018) Dex-Net 3.0: Computing Robust Vacuum Suction GraspTargets in Point Clouds Using a New Analytic Model andDeep Learning.

International Conference on Robotics andAutomation .Mania H, Guy A and Recht B (2018) Simple random search of staticlinear policies is competitive for reinforcement learning. In:

Advances in Neural Information Processing Systems . Prepared using sagej.cls barz, Tan, Finn, Kalakrishnan, Pastor, Levine Manschitz S, Kober J, Gienger M and Peters J (2014) Learningto sequence movement primitives from demonstrations. In:

International Conference on Intelligent Robots and Systems .Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I,Wierstra D and Riedmiller M (2013) Playing Atari with DeepReinforcement Learning. In:

Advances in Neural InformationProcessing Systems, Deep Learning Workshop .Montgomery W, Ajay A, Finn C, Abbeel P and Levine S (2017)Reset-free guided policy search: Efﬁcient deep reinforcementlearning with stochastic initial states. In:

InternationalConference on Robotics and Automation . IEEE, pp. 3373–3380.Montgomery WH and Levine S (2016) Guided policy searchvia approximate mirror descent. In:

Advances in NeuralInformation Processing Systems . pp. 4008–4016.Morrison D, Corke P and Leitner J (2018) Closing the Loop forRobotic Grasping: A Real-time, Generative Grasp SynthesisApproach. In:

Robotics: Science and Systems .Morrison et al D (2018) Cartman: The low-cost CartesianManipulator that won the Amazon Robotics Challenge. In:

International Conference on Robotics and Automation . IEEE.Nagabandi A, Konolige K, Levine S and Kumar V (2020) DeepDynamics Models for Learning Dexterous Manipulation. In:

Conference on Robot Learning .Nagabandi A, Yang G, Asmar T, Pandya R, Kahn G, Levine Sand Fearing RS (2018) Learning image-conditioned dynamicsmodels for control of underactuated legged millirobots. In:

International Conference on Intelligent Robots and Systems .IEEE, pp. 4606–4613.Nair A, McGrew B, Andrychowicz M, Zaremba W and Abbeel P(2018) Overcoming exploration in reinforcement learning withdemonstrations. In:

International Conference on Robotics andAutomation . IEEE, pp. 6292–6299.Osband I, Blundell C, Pritzel A and Van Roy B (2016) Deepexploration via bootstrapped dqn. In:

Advances in NeuralInformation Processing Systems . pp. 4026–4034.Parisotto E, Ba J and Salakhutdinov R (2016) Actor-Mimic: DeepMultitask and Transfer Reinforcement Learning.

CoRR .Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z,Lin Z, Desmaison A, Antiga L and Lerer A (2017) Automaticdifferentiation in pytorch. In:

Advances in Neural InformationProcessing Systems Workshop on Autodiff .Pathak D, Agrawal P, Efros AA and Darrell T (2017) Curiosity-driven exploration by self-supervised prediction. In:

Conference on Computer Vision and Pattern RecognitionWorkshops . IEEE, pp. 16–17.Peng XB, Abbeel P, Levine S and van de Panne M (2018a)Deepmimic: Example-guided deep reinforcement learning ofphysics-based character skills.

ACM Transactions on Graphics(TOG)

International Conference on Robotics andAutomation . IEEE.Peng XB, Kumar A, Zhang G and Levine S (2019) Advantage-weighted regression: Simple and scalable off-policy reinforce-ment learning. arXiv preprint arXiv:1910.00177 .Peters J, M¨ulling K and Alt¨un Y (2010) Relative entropy policysearch. In:

AAAI Conference on Artiﬁcial Intelligence . Peters J and Schaal S (2006) Policy gradient methods forrobotics. In:

International Conference on Intelligent Robotsand Systems . IEEE, pp. 2219–2225.Peters J and Schaal S (2008) Reinforcement learning of motor skillswith policy gradients.

Neural Networks

Proceedings of Robotics: Science and Systems . Pittsburgh,Pennsylvania. DOI:10.15607/RSS.2018.XIV.008.Pinto L, Davidson J, Sukthankar R and Gupta A (2017)Robust adversarial reinforcement learning. In:

InternationalConference on Machine Learning . JMLR. org.Pinto L and Gupta A (2016) Supersizing self-supervision: Learningto grasp from 50K tries and 700 robot hours. In:

InternationalConference on Robotics and Automation . IEEE.Raibert MH (1986)

Legged Robots That Balance . Cambridge, MA,USA: Massachusetts Institute of Technology. ISBN 0-262-18117-7.Rakelly K, Zhou A, Finn C, Levine S and Quillen D (2019) EfﬁcientOff-Policy Meta-Reinforcement Learning via ProbabilisticContext Variables. In:

International Conference on MachineLearning .Rao K, Harris C, Irpan A, Levine S, Ibarz J and Khansari M (2020)Rl-cyclegan: Reinforcement learning aware simulation-to-real.In:

Conference on Computer Vision and Pattern Recognition .Rawlik K, Toussaint M and Vijayakumar S (2013) On stochasticoptimal control and reinforcement learning by approximateinference. In:

International Joint Conference on ArtiﬁcialIntelligence .Riedmiller M, Hafner R, Lampe T, Neunert M, Degrave J, van deWiele T, Mnih V, Heess N and Springenberg JT (2018)Learning by Playing Solving Sparse Reward Tasks fromScratch. In:

International Conference on Machine Learning .Ross S, Gordon G and Bagnell D (2011) A reduction of imitationlearning and structured prediction to no-regret online learning.In:

International Conference on Artiﬁcial Intelligence andStatistics .Rusu AA, Colmenarejo SG, Gulcehre C, Desjardins G, KirkpatrickJ, Pascanu R, Mnih V, Kavukcuoglu K and Hadsell R (2015)Policy distillation. arXiv preprint arXiv:1511.06295 .Sadeghi F and Levine S (2017) CAD2RL: Real Single-Image FlightWithout a Single Real Image.

Robotics: Science and Systems .Sadigh D, Dragan AD, Sastry S and Seshia SA (2017) Activepreference-based learning of reward functions. In:

Robotics:Science and Systems .Schaal S (2006) Dynamic movement primitives-a framework formotor control in humans and humanoid robotics. In:

Adaptivemotion of animals and machines . Springer, pp. 261–280.Schaul T, Borsa D, Modayil J and Pascanu R (2019) RayInterference: a Source of Plateaus in Deep ReinforcementLearning. In:

Multidisciplinary Conference on ReinforcementLearning and Decision Making .Schoettler G, Nair A, Luo J, Bahl S, Ojea JA, Solowjow E andLevine S (2019) Deep Reinforcement Learning for IndustrialInsertion Tasks with Visual Inputs and Natural Rewards. In:

International Conference on Intelligent Robots and Systems .Schulman J, Levine S, Abbeel P, Jordan M and Moritz P(2015) Trust Region Policy Optimization. In:

InternationalConference on Machine Learning . Prepared using sagej.cls Schulman J, Wolski F, Dhariwal P, Radford A and Klimov O(2017) Proximal policy optimization algorithms. arXiv preprintarXiv:1707.06347 .Schwab D, Springenberg TJ, Martins FM, Neunert M, Neunert M,Abdolmaleki A, Hertweck T, Hafner R, Nori F and RiedmillerAM (2019) Simultaneously Learning Vision and Feature-basedControl Policies for Real-world Ball-in-a-Cup.

Robotics:Science and Systems .Sener O and Koltun V (2018) Multi-task learning as multi-objectiveoptimization. In:

Advances in Neural Information ProcessingSystems . pp. 527–538.Sermanet P, Lynch C, Chebotar Y, Hsu J, Jang E, Schaal S,Levine S and Brain G (2018) Time-contrastive networks: Self-supervised learning from video. In:

International Conferenceon Robotics and Automation . IEEE.Shrivastava A, Pﬁster T, Tuzel O, Susskind J, Wang W and WebbR (2017) Learning from simulated and unsupervised imagesthrough adversarial training. In:

Conference on ComputerVision and Pattern Recognition .Silver T, Allen K, Tenenbaum J and Kaelbling L (2018) Residualpolicy learning. arXiv preprint arXiv:1812.06298 .Singh A, Yang L, Finn C and Levine S (2019) End-To-End RoboticReinforcement Learning without Reward Engineering. In:

Robotics: Science and Systems .Srinivasan K, Eysenbach B, Ha S, Tan J and Finn C (2020) Learningto be Safe: Deep RL with a Safety Critic.S¨underhauf N, Brock O, Scheirer WJ, Hadsell R, Fox D, LeitnerJ, Upcroft B, Abbeel P, Burgard W, Milford M and Corke P(2018) The limits and potentials of deep learning for robotics.

The International Journal of Robotics Research

Reinforcement learning: Anintroduction . MIT press.Tan J, Gu Y, Liu CK and Turk G (2014) Learning bicycle stunts.

ACM Transactions on Graphics (TOG)

Robotics: Science and Systems .Tang D, Agarwal A, O’Brien D and Meyer M (2010)Overlapping Experiment Infrastructure: More, Better, FasterExperimentation. In:

International Conference on KnowledgeDiscovery and Data Mining . ACM.Tedrake R, Zhang TW and Seung HS (2015) Learning to Walk in20 minutes. In:

Workshop on Adaptive and Learning Systems .ten Pas A, Gualtieri M, Saenko K and Platt R (2017) GraspPose Detection in Point Clouds.

The International Journal ofRobotics Research

InternationalConference on Intelligent Robots and Systems . IEEE.Toussaint M (2009) Robot trajectory optimization using approx-imate inference. In:

International Conference on MachineLearning . ACM, pp. 1049–1056.Veˇcer´ık M, Hester T, Scholz J, Wang F, Pietquin O, Piot B,Heess N, Roth¨orl T, Lampe T and Riedmiller M (2017) Leveraging demonstrations for deep reinforcement learningon robotics problems with sparse rewards. arXiv preprintarXiv:1707.08817 .Viereck U, ten Pas A, Saenko K and Platt R (2017) Learning avisuomotor controller for real world robotic grasping usingsimulated depth images. In:

Conference on Robot Learning .Wu YH, Charoenphakdee N, Bao H, Tangkaratt V and Sugiyama M(2019) Imitation Learning from Imperfect Demonstration. In:

International Conference on Machine Learning , Proceedingsof Machine Learning Research.Xiao T, Jang E, Kalashnikov D, Levine S, Ibarz J, Hausman K andHerzog A (2020) Thinking while moving: Deep reinforcementlearning with concurrent control. In:

International Conferenceon Learning Representations .Xie A, Ebert F, Levine S and Finn C (2019) Improvisation throughPhysical Understanding: Using Novel Objects as Tools withVisual Foresight.Xie A, Singh A, Levine S and Finn C (2018) Few-Shot GoalInference for Visuomotor Learning and Planning. In:

Conference on Robot Learning , Proceedings of MachineLearning Research , volume 87. pp. 40–52.Xie Q, Luong MT, Hovy E and Le QV (2020) Self-TrainingWith Noisy Student Improves ImageNet Classiﬁcation. In:

Conference on Computer Vision and Pattern Recognition .Yang Y, Caluwaerts K, Iscen A, Tan J and Finn C (2019) Norml:No-reward meta learning. In:

AAMAS .Yang Y, Caluwaerts K, Iscen A, Zhang T, Tan J and SindhwaniV (2020) Data Efﬁcient Reinforcement Learning for LeggedRobots. In:

Conference on Robot Learning .Yen-Chen L, Bauza M and Isola P (2020) Experience-EmbeddedVisual Foresight. In:

Conference on Robot Learning .Yu K and Rodriguez A (2018) Realtime State Estimation withTactile and Visual sensing. Application to Planar Manipulation.In:

International Conference on Robotics and Automation .IEEE.Yu W, Tan J, Bai Y, Coumans E and Ha S (2019) Learning fastadaptation with meta strategy optimization.Yu W, Tan J, Liu CK and Turk G (2017) Preparing for the Unknown:Learning a Universal Policy with Online System Identiﬁcation.In:

Robotics: Science and Systems .Yu W, Turk G and Liu CK (2018) Learning symmetric and low-energy locomotion.

ACM Transactions on Graphics (TOG)

InternationalConference on Intelligent Robots and Systems . pp. 4238–4245.Zhu H, Gupta A, Rajeswaran A, Levine S and Kumar V (2019)Dexterous manipulation with deep reinforcement learning:Efﬁcient, general, and low-cost. In:

International Conferenceon Robotics and Automation . IEEE.Ziebart BD, Maas A, Bagnell JA and Dey AK (2008) Maximumentropy inverse reinforcement learning. In: