Reinforcement Learning for Robotic Manipulation using Simulated Locomotion Demonstrations
11 Reinforcement Learning for Robotic Manipulationusing Simulated Locomotion Demonstrations
Ozsel Kilinc, Yang Hu, and Giovanni Montana
Abstract —Learning robotic manipulation through reinforce-ment learning (RL) using only sparse reward signals is still con-sidered a largely unsolved problem. Leveraging human demon-strations can make the learning process more sample efficient, butobtaining high-quality demonstrations can be costly or unfeasible.In this paper we propose a novel approach that introduces object-level demonstrations, i.e. examples of where the objects should beat any state. These demonstrations are generated automaticallythrough RL hence require no expert knowledge. We observethat, during a manipulation task, an object is moved from aninitial to a final position. When seen from the point of viewof the object being manipulated, this induces a locomotion taskthat can be decoupled from the manipulation task and learntthrough a physically-realistic simulator. The resulting object-leveltrajectories, called simulated locomotion demonstrations (SLDs),are then leveraged to define auxiliary rewards that are used tolearn the manipulation policy. The proposed approach has beenevaluated on 13 tasks of increasing complexity, and has beendemonstrated to achieve higher success rate and faster learningrates compared to alternative algorithms. SLDs are especiallybeneficial for tasks like multi-object stacking and non-rigid objectmanipulation.
Index Terms —reinforcement learning, sparse reward, robotic,manipulation, dexterous hand, stacking, non-rigid object
I. I
NTRODUCTION
Reinforcement Learning (RL) solves sequential decision-making problems by learning a policy that maximises expectedrewards. Recently, with the aid of deep artificial neural net-work as function approximators, RL-trained agents have beenable to autonomously master a number of complex tasks, mostnotably playing video games [1] and board games [2]. Robotmanipulation has been extensively studied in RL, but is partic-ularly challenging to master because it may involve multiplestages (e.g. stacking multiple blocks), high-dimensional statespaces (e.g. dexterous hand manipulation [3], [4]) and com-plex dynamics (e.g. manipulating non-rigid objects). Althoughpromising performance has been reported on a wide range oftasks like grasping [5], [6], stacking [7] and dexterous handmanipulation [3], [4], the learning algorithms usually requirecarefully-designed reward signals to learn good policies. Forexample, [6] propose a thoroughly weighted 5-term rewardformula for learning to stack Lego blocks and [8] use a 3-termshaped reward to perform door-opening tasks with a robot arm.
The authors are with Warwick Manufacturing Group (WMG), Uni-versity of Warwick, United Kindom. E-mail: { ozsel.kilinc, yang.hu.1,g.montana } @warwick.ac.uk.c (cid:13) The requirement of hand-engineered, dense reward functionslimits the applicability of RL in real-world robot manipulationto cases where task-specific knowledge can be captured.The alternative to designing shaped rewards consists oflearning with only sparse feedback signals, i.e. a non-zerorewards indicating the completion of a task. Using sparserewards is more desirable in practise: it generalises to manytasks without the need to hand-engineering [2], [9], [10].On the other hand, learning with only sparse rewards issignificantly more challenging since associating sequences ofactions to positive rewards received only when a task has beensuccessfully completed becomes more difficult. A numberof existing approaches that address this problem have beenproposed lately [9]–[16]; some of them report some successin completing manipulation tasks like object pushing [9], [16],pick-and-place [9], stacking two blocks [11], [16], and targetfinding in a scene [14], [15]. Nevertheless, for more complextasks such as stacking multiple blocks and manipulating non-rigid objects, there is scope for further improvement.A particularly promising approach to facilitate learning hasbeen to leverage human expertise through a number of manu-ally generated examples demonstrating how a given task canbe completed. When these demonstrations are available, theycan be used by an agent in various ways, e.g. by attempting togenerate a policy that mimics them [17]–[19], pre-learning apolicy from them for further RL [2], [20], as a mechanism toguide exploration [7], as data from which to infer a rewardfunction [21]–[24], and in combination with trajectoriesgenerated during RL [25]–[27]. Practically, however, humandemonstrations are expensive to obtain, and their effectivenessultimately depends on the competence of the demonstrators.Demonstrators with insufficient task-specific expertise couldgenerate low-quality demonstrations resulting in sub-optimalpolicies. Although there is an existing body of work focusingon learning with imperfect demonstrations [28]–[33], thesemethods usually assume that either qualitative evaluation met-rics are available [28], [30], [32] or that a substantial volumeof (optimal) demonstrations can be collected [29], [31], [33].In this paper, we propose a novel approach that allowscomplex robot manipulation tasks to be learnt with only sparserewards and without the need for manually-produced demon-strations. In the tasks we consider, an object is manipulatedby a robot so that, starting from a (random) initial position,it eventually reaches a goal position through a sequence ofstates in which its location and pose vary. For example, Fig.1 (top row) represents a pick-and-place manipulation taskin which the object is being picked up by the two-fingergripper and moved from its initial state to a pre-defined a r X i v : . [ c s . L G ] J un M a n i p u l a t i o n R i g i d O b j ec t L o c o m o t i o n N o n - r i g i d O b j ec t L o c o m o t i o n Fig. 1. An illustration of the proposed approach. Top row: a general robot manipulation task of pick-and-place, which requires the robot to pick up an object(green cube) and place it to a specified location (red sphere). Middle row: the corresponding auxiliary locomotion task requires the object to move to thetarget location. Bottom row: the auxiliary locomotion task corresponding to a pick-and-place task with a non-rigid object (not shown). Note that the auxiliarylocomotion tasks usually have significantly simpler dynamics compared to the corresponding robot manipulation task, hence can be learnt efficiently throughstandard RL, even for very complex tasks. The learnt locomotion policy is used to inform the robot manipulation policy. target location (red sphere): robot manipulation implies objectlocomotion. Our key observations is that the manipulation taskcan be decoupled from object locomotion, and the latter canbe explicitly modelled as an independent task for the objectitself to learn. Fig. 1 (middle row) illustrates this idea forthe pick-and-place task: the object, on its own, must learnto navigate from any given initial position until it reaches itstarget position. More complex manipulation tasks involvingnon-rigid objects can also be thought as inducing such objectlocomotion tasks, e.g. see Fig. 1 (bottom row).Although in the real world it is impossible for objects tomove on their own, learning such object locomotion policiescan be achieved in a virtual environment through a realis-tic physics engine such as MuJoCo [34], Gazebo [35] orPybullet [36]. In our experience, such policies are relativelystraightforward to learn using only sparse rewards since theobjects usually operate in simple state/action spaces and/orhave simple dynamics, which are not constrained in anyway. Once a locomotion policy has been learnt, it producesobject trajectories called simulated locomotion demonstrations(SLDs). We utilise these SLDs in the form of auxiliary rewardsguiding the main manipulation policy. These rewards encour-age the robot, during training, to induce object trajectories thatare similar to the SLDs. Although the SLDs are learnt througha realistic simulator, this requirement does not restrict theirapplicability to real world problems: the manipulation policiescan still be implemented to physical systems. To the best ofour knowledge, this is the first time that object-level policiesare trained in a physics simulator to enable robot manipulationlearning with sparse rewards.In our implementation, all the policies are learnt using deepdeterministic policy gradient (DDPG) [37], which has beenchosen due to its widely reported effectiveness in continuouscontrol; however, most RL algorithms compatible with contin- uous actions could have been used within the proposed SLDframework. Our experimental results involve continuouscontrol environments using the MuJoCo physics engine [34]within the OpenAI Gym framework [38]. These environmentscover a variety of robot manipulation tasks with increasinglevel of complexity, e.g. pushing, sliding and pick-and-placetasks with a Fetch robotic arm, in-hand object manipulationwith a Shadow’s dexterous hand, multi-object stacking, andnon-rigid object manipulation. Overall, across all environ-ments, we have found that our approach can achieve fasterlearning rate and higher success rate compared to existingmethods, especially in more challenging tasks such as stackingobjects and manipulating non-rigid objects.The remainder of the paper is organised as follows. InSection II we review the most related work, and in Section IIIwe provide some introductory background material regardingthe RL modelling framework and algorithms we use. In Sec-tion IV, we develop the proposed methodology. In Section Vwe describe all the environments used for our experiments, andthe experimental results are reported in Section VI. Finally,we conclude with a discussion and suggestions for furtherextensions in Section VII.II. R ELATED W ORK
A. Learning from Demonstration
A substantial body of work exists on reinforcement learningwith demonstrations. Behaviour cloning (BC) methods ap-proach sequential decision-making as a supervised learningproblem [18], [19], [39], [40]. Some BC methods includean expert demonstrator in the training loop to handle themismatching between the demonstration data and the dataencountered in the training procedure [17], [41]. Recent BCmethods have also considered adversarial frameworks to im-prove the policy learning [24], [42]. A different approach consists of inverse reinforcement learning, which seeks toinfer a reward/cost function to guide policy learning [21]–[23];several methods have been developed to leverage demonstra-tions for robotic manipulation tasks with sparse rewards. Forinstance, [7], [25] jointly use demonstrations with trajectoriescollected during the RL process to guide the exploration, and[20] use the demonstrations to pre-learn a policy, which isfurther fine-tuned in a following RL stage. All the abovemethods require human demonstrations on the full state andaction spaces of the task. These approaches may also needspecialised data capture setups such as teleoperation interfacesto obtain the demonstrations. In general, obtaining demon-strations is an expensive process in terms of both humaneffort and hardware requirements. In contrast, the proposedmethod generates object-level demonstrations autonomously,and could potentially be used jointly with task demonstrationswhen these are available.
B. Goal Conditioned Policy
Goal-conditioned policies [43] that can effectively gener-alise over multiple goals have shown to be promising forrobotic problems. For manipulation tasks with sparse re-wards, several approaches have recently been proposed toautomatically generate a curriculum of goals to facilitate policylearning. For instance, [44] used a self-play approach on re-versible or resettable environments, [45] employed adversarialtraining for robotic locomotion tasks, [46] proposed variationalautoencoders for visual robotics tasks, and [9] introducedHindsight Experience Replay (HER), which randomly drawssynthetic goals from previously encountered states. HER inparticular has been proved particularly effective, although theautomatic goal generation can still be problematic on complextasks involving multiple stages, e.g. stacking multiple objects,when used without demonstrations [46]. Some attempts havebeen made to form an explicit curriculum for such complextasks; e.g. [11] manually define several semantically groundedsub-tasks each having its own individual reward. Methods suchas this one requires significant human effort hence they cannotbe readily applied across different tasks.
C. Auxiliary Reward in RL
Lately, increasing efforts have been made to design gen-eral auxiliary reward functions aimed at facilitating learningin environments with only sparse rewards. Many of thesestrategies involve a notion of curiosity [47], which encouragesagents to visit novel states that have not been seen in previousexperience; for instance, [14] formulate the auxiliary rewardusing the error in predicting the RL agent’s actions by aninverse dynamics model, [12] encourage the agent to visit thestates that result the largest information gain in system dynam-ics, [10] construct the auxiliary reward based on the error inpredicting the output of a fixed randomly initialised neuralnetwork, and [15] introduces the notion of state reachability.Despite the benefits introduced by these approaches, visitingunseen states may be less beneficial in robot manipulationtasks as exploring complex state spaces to find rewards israther impractical [9]. III. B
ACKGROUND
A. Reinforcement Learning
Reinforcement learning tackles sequential decision-makingproblems with the aim of maximising a reward signal: an agentlearns how to make optimal actions to be taken under differentenvironmental situations or states. The stochastic process thatarises as the agent interacts with the environment is modelledas a Markov Decision Process (MDP). The MDP is definedby a tuple (cid:104)S , A , T , R , γ (cid:105) , where S is the environment’s statespace, A is the agent’s action space, T is environment dynam-ics or transition function, R is the reward function defying howthe agent is rewarded for its actions, and γ ∈ (0 , is a factorused to discount rewards over time. At any given time step,the agent observes the state of the environment, s ∈ S , andchooses an action a ∈ A based on a policy µ ( s | θ µ ) : S → A where θ µ contains the parameters of the policy. Once the agentreceives a reward, r = R ( s, a ) : S × A → R for its actions,the environment moves onto the next state s (cid:48) ∈ S accordingto its transition function, i.e. s (cid:48) = T ( s, a ) : S × A → S .The aim is for the agent to maximise the expected return, J ( θ µ ) = E s t ,a t ∼ µ (cid:80) t γ t − R ( s t , a t ) where the subscript t indicates the time step. B. Deep Deterministic Policy Gradient Algorithm
Policy gradient (PG) algorithms maximise the expectedreturn J ( θ µ ) by updating θ µ in the direction of ∇ θ µ J ( θ µ ) . Inparticular, deep deterministic policy gradient (DDPG) learnsdeterministic policies [48] modelled by deep artificial neuralnetworks to leverage their state-of-the-art representation ca-pacity. DDPG includes a policy (actor) network a = µ ( s | θ µ ) and an action-value (critic) network Q µ ( s, a | θ Q ) where θ Q contains the critic’s parameters. The actor maps states todeterministic actions, and the critic estimates the total expectedreturn starting from the state s by taking action a and thenfollowing µ . The algorithm alternates between two stages.First, it collects experience using the current policy withan additional noise sampled from a random process N forexploration, i.e. a = µ ( s | θ µ )+ N . The experienced transitions,i.e. (cid:104) s, a, r, s (cid:48) (cid:105) , are stored into a replay buffer, D , which arethen used to learn the actor and critic networks. The critic islearnt by minimising the following loss to satisfy the Bellmanequation, similarly to Q-learning [49]: L ( θ Q µ ) = E s,a,r,s (cid:48) ∼D (cid:104) ( Q µ ( s, a | θ Q µ ) − y ) (cid:105) (1)where y = r + γQ µ ( s (cid:48) , µ ( s (cid:48) ) | θ Q µ ) is the target value. Prac-tically, directly minimising Eq. 1 is proven to be unstable,since the function Q µ used in target value estimation isalso updated in each iteration. Similarly to DQN [1], DDPGstabilises the learning by obtaining smoother target values y = r + γQ µ (cid:48) ( s (cid:48) , µ (cid:48) ( s (cid:48) ) | θ Q µ (cid:48) ) , where µ (cid:48) and Q µ (cid:48) are targetnetworks. The weights of µ (cid:48) and Q µ (cid:48) are the exponentialmoving average of the weights of µ and Q µ over iterations,respectively. The actor, on the other hand, is updated using thefollowing policy gradient, ∇ θ µ J ( θ µ ) = E s ∼D (cid:104) ∇ a Q µ ( s, a | θ Q µ ) | a = µ ( s ) ∇ θ µ µ ( s | θ µ ) (cid:105) . (2) C. Hindsight Experience Replay
Hindsight Experience Replay (HER) [9] has been in-troduced to learn policies from sparse rewards, especiallyfor robot manipulation tasks. The idea is to view thestates achieved in an episode as pseudo goals (i.e. achievedgoals) to facilitate learning even when the desired goal hasnot been achieved during the episode. Specifically, we let (cid:104) s || g, a, r, s (cid:48) || g (cid:105) be the original transition obtained in anepisode, where || denotes the concatenation operation, and g is the desired goal of the task. Normally, the agent wouldbe rewarded only when g is achieved, which may occurvery rarely during learning, especially when the policy isfar from being optimal for the task at hand. In HER, g is replaced by an achieved goal denoted by g (cid:48) , which israndomly sampled from the states reached in an episode. Thisgenerates a new transition (cid:104) s || g (cid:48) , a, r, s (cid:48) || g (cid:48) (cid:105) which is morelikely to be rewarded. The generated transitions are savedinto an experience reply buffer and can be used by off-policyRL algorithms like DQN [1] and DDPG [37]. HER allowsinitial learning to materialise using goals that are more easilyachieved, and ultimately allows the agent to learn the desiredgoal when g (cid:48) → g . IV. M ETHODOLOGY
A. Problem Formulation
Our problem setup is as follows. Given a manipulation task,initially we model the object to be manipulated as an inde-pendent agent that must learn how to reach its final position,starting from a given position, through a sequence of actions.Using the notation of Section III-A, this locomotion task G obj is assumed to be a MDP, (cid:104)S obj , A obj , T obj , R obj , γ obj (cid:105) . The agentmaximises the expected return of the sparse feedback givenby the environment to reward the object only when it reachesits final position. When interacting with the environment, theagent receives a state s obj ∈ S obj and decides what action totake, a obj ∈ A obj , according to its policy µ obj . This policy isgenerally easy to learn in a simulator.Once the agent-level policy is trained, we turn to our mainobjective, which is to learn robot manipulation. This task G ismodelled as an independent MDP, (cid:104)S , A , T , R , γ (cid:105) . The robotmaximises the expected return J ( θ µ ) of sparse rewards givenby the environment only upon successful completion of themanipulation task, i.e. when the object has reached its goalstate. When learning the manipulation policy, µ , we wish toleverage the pre-trained µ obj ; see Section IV-B.We note here that the state s in the manipulation MDPincludes the object, whereas s obj is a subset of s that onlycontains object-related features; in practice, any dimensionunrelated to the object is removed from s using a knownmapping ψ , i.e. s obj = ψ ( s ) (see Section V for the list offeatures included in s obj vs. s ). B. Simulated Locomotion Demonstrations (SLD)
When the robot follows a policy µ , the object is beingmoved from one state to another, hence the robot’s action canbe thought of as inducing an object locomotion action. We use a (cid:48) obj to denote the object locomotion caused by µ whichtherefore depends on the robot’s action, i.e. a (cid:48) obj = f ( a ) .Since we had initially learnt the object locomotion policy, µ obj ( s obj ) = µ obj ( ψ ( s )) , we also have access to a obj , whichwe use as demonstrations of the desired object actions whenlearning G . Two mild assumptions are made: Assumption 1 . a (cid:48) obj ∈ A obj . Assumption 2 . P ( s obj | µ ) d = P ( s obj | µ obj ) for an optimal µ and µ obj where P ( s obj | . ) is the distribution of s obj under thegiven policy. Assumption 1 ensures that a (cid:48) obj = f ( µ ( s | θ µ )) is in the same action space as a obj = µ obj ( s obj ) so that a obj can be used as a demonstration of what a (cid:48) obj is expected to be.Assumption 2 allows us to utilise µ obj on G . Specifically, µ obj is a function approximator that has been learnt on G obj with s obj ∼ P ( s obj | µ obj ) . To utilise it on G , where s obj ∼ P ( s obj | µ ) ,we need Assumption 2 to eliminate the mismatch in statedistributions and guarantee the optimality of µ obj in G . Theproposed approach we describe in the following paragraphsrewards such robot actions a that induce f ( a ) = a (cid:48) obj ≈ a obj .As we update θ µ to maximise these rewards, the resultant robotpolicy automatically satisfies these two assumptions.Considering that a obj is meant to be a demonstration of a (cid:48) obj , the learning objective could be written in terms of theEuclidean distance between them, as this would encouragethe object locomotion generated by the robot to be sufficientlyclose to the desired one, i.e. arg min θ µ E s ∼ P ( s | µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a (cid:48) obj − a obj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =arg min θ µ E s ∼ P ( s | µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( µ ( s | θ µ )) − µ obj ( ψ ( s )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (3)where P ( s | µ ) is the state distribution under policy µ . Practi-cally, however, we do not have a closed-form expression forthe required mapping, f . This is because the robot could havea complex architecture, and the objects could have arbitraryshapes resulting in very complex dynamics when the robotinteracts with the objects.Rather than assuming knowledge of f to calculate a (cid:48) obj , weestimate an inverse dynamic model that predicts a (cid:48) obj giventhe states occurring before and after an action. Specifically, arobot state s corresponds to an object state s obj . Upon takingan action a = µ ( s | θ µ ) , a new robot state s (cid:48) is observed,which corresponds to a new object state s (cid:48) obj . We learn afunction approximator through an artificial neural network, I ,to estimate a (cid:48) obj from s obj and s (cid:48) obj , i.e. a (cid:48) obj ≈ I (cid:16) s obj , s (cid:48) obj | φ (cid:17) where φ is the parameters of I . Since the objects usually havemuch simpler dynamics compared to the robot, typically I is much easier to learn compared to f . We learn I togetherwith the locomotion policy (see Section IV-C and Alg. 1).Substituting I into Eq. 3 leads to the following optimisationproblem: arg min θ µ E s ∼ P ( s | µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I (cid:0) s obj , s (cid:48) obj | φ (cid:1) − µ obj ( s obj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (4) In the following paragraphs, we introduce two more approximators ( Q µ obj and I ) that have also been learnt on G obj . Similarly, this assumption enablestheir utilisation on G . Algorithm 1:
Learning locomotion policy and inversedynamic
Given :
Locomotion MDP G obj = (cid:104)S obj , A obj , T obj , R obj , γ obj (cid:105) ,Neural networks µ obj ( . | θ µ obj ) , Q µ obj ( . | θ Q µ obj ) and I ( . | φ ) A random process N obj for exploration Initialise:
Parameters θ µ obj , θ Q µ obj and φ Experience replay buffer D obj for i episode = 1 to N episode dofor i rollout = 1 to N rollout do Sample an initial state s for t = 0 , T − do Sample an object action: a obj = µ obj (cid:0) s obj (cid:1) + N obj Execute the action: s (cid:48) obj = T obj ( s obj , a obj ) , r obj = R obj ( s obj , a obj ) Store ( s obj , a obj , r obj , s (cid:48) obj ) in D obj s obj ← s (cid:48) obj Generate HER samples and store in D obj for i update ← to N update do Get a random mini-batch of samples from D obj Update Q µ obj minimising the loss in Eq.(1)Update µ obj using the gradient in Eq.(III-B)Update I minimising the loss in Eq.(7) Return : µ obj , Q µ obj and I Note that s (cid:48) obj is a function of µ : s (cid:48) obj = ψ ( s (cid:48) ) where s (cid:48) = T ( s, a ) = T ( s, µ ( s | θ µ )) .Minimising Eq. 4 through gradient-based methods is notpossible as it requires differentiation through T , which isunknown. Nevertheless, since both I and µ obj are known, oneoption to minimise Eq. 4 would be to learn a manipulationpolicy µ through standard model-free RL using the followingreward: q = − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I (cid:16) s obj , s (cid:48) obj (cid:17) − µ obj ( s obj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (5)Practically, however, the above reward is sensitive to thescale of object actions: as the object moves close to the target,the scale of µ obj ( s obj ) decreases significantly, which makesthe reward too weak for learning. In order to deal with thisissue, we introduce Q obj ( s obj , a obj ) , the expected return whenthe action a obj is taken at the state s obj (i.e. RL action-valuefunction), and use it to build a reward in analogous to Eq. 5.Note that Q obj is well-bounded [50] and its scale increases asthe object moves nearer to the target in sparse reward setting. Q obj can be obtained when learning the object policy (seeSection IV-C). The new reward is written as follows: q = Q µ obj (cid:0) s obj , I ( s obj , s (cid:48) obj ) (cid:1) − Q µ obj (cid:0) s obj , µ obj ( s obj ) (cid:1) (6)We refer to Eq. 6 as the SLD reward. Since µ obj is learntthrough standard RL, a obj = µ obj ( s obj ) can be viewed asthe desired action maximising the expected return when theobject is at state s obj . Accordingly, Eq. 6 can be viewed asthe advantage of the object action caused by the manipulationpolicy a (cid:48) obj ≈ I ( s obj , s (cid:48) obj ) with respect to the desired objectaction a obj = µ obj ( s obj ) . Maximising this term encourages therobot manipulation to produce similar object actions comparedto the desired ones. C. Learning Algorithm
In this subsection, we detail the algorithm to learn locomo-tion policy, and the algorithm to learn the manipulation policy with SLD reward.
Locomotion policy . Alg. 1 refers to learning the locomotionpolicy. In order to learn using only sparse rewards, we adoptHER to generate transition samples as in Section III-C. We useDDPG to learn the policy µ obj and the action-value function Q obj , i.e. using the gradient in Eq. III-B to update µ obj andminimizing Eq. 1 to update Q obj . Concurrently, we learn theinverse dynamics model I using the trajectories generatedduring the policy learning process by minimising the followingobjective function: arg min φ E s obj ,a obj ,s (cid:48) obj ∼D obj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I (cid:0) s obj , s (cid:48) obj | φ (cid:1) − a obj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (7)where D obj is an experience replay buffer as in Alg. 1. Recallthat s (cid:48) obj = T obj ( s obj , a obj ) = T obj ( s obj , µ obj ( s obj )) . Manipulation policy . Alg. 2 refers to learning the ma-nipulation policy µ . In addition to the action-value functionfor the robot Q µr ( s, a ) , we learn another action-value function Q µq ( s, a ) for the SLD reward in Eq. 6. The policy parameter θ µ is updated with the following gradient: ∇ θ µ J ( θ µ ) = E s ∼D (cid:20) ∇ a (cid:16) Q µr ( s, a ) + Q µq ( s, a ) (cid:17)(cid:12)(cid:12)(cid:12) a = µ ( s ) ∇ θ µ µ ( s | θ µ ) (cid:21) (8)Some tasks may include N > objects, e.g. stacking. Theproposed method is able to handle these tasks by learningindividual policies and SLD reward action-value functions Q µq i ( s, a ) for each object, i = 1 , . . . , N . The gradient to update θ µ is given by ∇ θ µ J ( θ µ ) = E s ∼D (cid:20) ∇ a (cid:16) Q µr ( s, a ) + N (cid:80) i =1 Q µq i ( s, a ) (cid:17)(cid:12)(cid:12)(cid:12) a = µ ( s ) ∇ θ µ µ ( s | θ µ ) (cid:21) (9)Fig. 2(b) shows the block diagram of manipulation policylearning procedure. V. E NVIRONMENTS
We have evaluated the SLD method on 13 simulated Mu-JoCo [34] environments using two different robot configu-rations: 7-DoF Fetch robotic arm with a two-finger parallelgripper and 24-DoF Shadow’s Dexterous Hand. The tasks wehave chosen to evaluate include single rigid object manip-ulation, multiple rigid object stacking and non-rigid objectmanipulation. Overall, we have used 9 MuJoCo environments(3 with Fetch robot arm and 6 with Shadow’s hand) for singlerigid object tasks. Furthermore, we have included additionalenvironments for stacking multiple object and non-rigid objectmanipulation using the Fetch robot arm. In all environmentsthe rewards are sparse.
Fetch Arm Single Object Environments . These are thesame
Push , Slide and
PickAndPlace tasks introduced in [51].In each episode, a desired 3D position (i.e. the target) of theobject is randomly generated. The reward is zero if the objectis within 5cm range to the target, otherwise the reward is as-signed a value of − . The robot actions are 4-dimensional: 3Dfor the desired arm movement in Cartesian coordinates and 1D + LocomotionMDPObject − I N obj T obj R obj µ obj Q µ obj a obj r obj s' obj . G obj = S obj , A obj , T obj , R obj , γ obj s obj (a) Learning locomotion policy µ obj and inverse dynamic I + Manipulation MDPRobot N T R µ Q r µ Q q µ a s s' r SLD Reward Module − I Q µ obj µ obj q G = S , A , T , R , γ ψ s' obj ψ s obj (b) Learning manipulation policy µ Fig. 2. Block diagram for the proposed SLD algorithm. The solid linesrepresent the forward pass of the model. The dashed lines represent therewards/losses as feedback signals for model learning. to control the opening of the gripper. In pushing and sliding,the gripper is locked to prevent grasping. The observationsinclude the positions and linear velocities of the robot armand the gripper, the object’s position, rotation, angular velocity,the object’s relative position and linear velocity to the gripper,and the target coordinate. An episode terminates after time-steps. Shadow’s Hand Single Object Environments . These in-clude the tasks first introduced in [51], i.e.
Egg , Block , Pen manipulation. In these tasks, the object (a block, an egg-shaped object, or a pen) is placed on the palm of the robot
Algorithm 2:
Learning manipulation policy
Given :
Manipulation MDP G = (cid:104)S , A , T , R , γ (cid:105) ,Learnt object-related components µ obj , Q µ obj and I ( . | φ ) Neural networks µ ( . | θ µ ) , Q µr ( . | θ Q µr ) and Q µq ( . | θ Q µq ) A random process N for explorationA function ψ extracting object-related states from s Initialise:
Parameters θ µ , θ Q µr and θ Q µq Experience replay buffer D for i episode = 1 to N episode dofor i rollout = 1 to N rollout do Sample an initial state s for t = 0 , T − do Sample a robot action: a = µ ( s ) + N Execute the action: s (cid:48) = T ( s, a ) , r = R ( s, a ) Obtain SLD reward: s obj = ψ ( s ) and s (cid:48) obj = ψ ( s (cid:48) ) q = Q µ obj (cid:0) s obj , I φ ( s obj , s (cid:48) obj ) (cid:1) − Q µ obj (cid:0) s obj , µ obj ( s obj ) (cid:1) Store ( s, a, r, q, s (cid:48) ) in D s ← s (cid:48) Generate HER samples and store in D for i update ← to N update do Get a random mini-batch of samples from D Update Q µr minimising the loss in Eq.(1) for r Update Q µq minimising the loss in Eq.(1) for q Update µ using the gradient in Eq.(8) Return : µ hand; the robot hand is required to manipulate the object toreach a target pose. The target pose is 7-dimension describingthe 3D position together with 4D quaternion orientation, andis randomly generated in each episode. The reward is ifthe object is within some task-specific range to the target,otherwise the reward is − . As in [51], each task has twovariants: Full and
Rotate . In the
Full variant, the object’swhole 7D pose is required to meet the given target pose. Inthe
Rotate variants, the 3D object position is ignored andonly the 4D object rotation is expected to the satisfy thedesired target. Robot actions are 20-dimensional controllingthe absolute positions of all non-coupled joints of the hand.The observations include the positions and velocities of all 24joints of the robot hand, the object’s position and rotation, theobject’s linear and angular velocities, and the target pose. Anepisode terminates after time-steps.
Fetch Arm Multiple Object Stacking Environments . Thetask of stacking multiple objects is built upon the
PickAnd-Place task, taken from the fetch arm single object environ-ments. We consider stacking both two and three objects. Forthe N -object stacking task, the target has N dimensionsdescribing the desired positions of all N objects in 3D.Following [7], we start these tasks with the first object placedat its desired target. The robot needs to perform N − pick-and-place actions without displacing the first object. Thereward is zero if all objects are within cm range to theirdesignated targets, otherwise the reward is assigned a value of − . The robot actions and observations are similar to those inthe PickAndPlace task, excepting that the observations includethe position, rotation, angular velocity, relative position andlinear velocity to the gripper for each object. The episodelength is 50 time-steps for 2 objects and 100 for 3 objects.
Fetch Arm Non-rigid Object Environments . We build
EA B 90 o o CD E F GH GHECB FDA 180 o FGHDAB 180 o C Fig. 3. An illustration of the non-rigid objects used in the experiments. Toprow: a hinge joint (shown as grey circles) between two neighbouring blocksallows one rotational DoF along their coincident edges up to ◦ . Bottomrow: each variant has some predefined target poses (2 options for and4 for ). non-rigid object manipulation tasks based on the PickAndPlace task from the fetch arm single object environments. Insteadof using the original rigid block, we have created a non-rigid object by hinging some blocks side-by-side alone theiredges as shown in Fig. 3. A hinge joint is placed betweentwo neighbouring blocks, allowing one rotational degree offreedom (DoF) along their coincident edges up to ◦ . Weintroduce two different variants: and . For the N -tuple task, N cubical blocks are connected with N − hinge joints creating N − internal DoF. The target pose has N -dimension describing the desired 3D positions of all N blocks, which are selected uniformly in each episode from aset of predefined target poses (see Fig. 3). The robot is requiredto manipulate the object to match the target pose. The rewardis zero when all the N blocks are within a cm range to theircorresponding targets, otherwise the reward is given a value of − . Robot actions and observations are similar to those in the PickAndPlace tasks, excepting that the observations includethe position, rotation, angular velocity, relative position andlinear velocity to the gripper for each block. The episodelength is time-steps for both and . Object locomotion MDP . The proposed method requiresto define an object locomotion MDP corresponding to eachrobot manipulation task. For the environments with the Fetcharm, the object’s observations include the object’s position,rotation, angular velocity, the object’s relative position andlinear velocity to the target, and the target location. For theenvironments with the Shadow’s hand, the object observationsinclude the object’s position and rotation, the object’s linearand angular velocities, and the target pose. The rewards arethe same as those in each robot manipulation task. For rigidobjects, we define the object action as the desired relativechange in the D object pose ( D position and D quaternionorientation) between two consecutive time-steps. This leads to D action spaces. For non-rigid objects, the action includesthe desired changes in D poses for the blocks at two ends.This leads to D action spaces.It is worth noting that, in the
Full variants of Shadow’s handenvironments, we consider the object translation and rotation as two individual locomotion tasks, and we learn separatelocomotion policies and Q-functions for each task. We findthat the above strategy encourages the manipulation policyto perform translation and rotation simultaneously. Althoughobject translation and rotation could be executed within asingle task, we have empirically found that the resultingmanipulation policies tend to prioritise one action versus theother (e.g. they tend to rotate the object first, then translate it)and generally achieves a lower performance.VI. E
XPERIMENTS
A. Implementation and Training Process
Three-layer neural networks with ReLU activations wereused to approximate all policies, action-value functions andinverse dynamics models. The Adam optimiser [52] wasemployed to train all the neural networks. During the trainingof locomotion policies, the robot was considered as a non-learning component in the scene and its actions were not re-stricted to prevent any potential collision with the objects. Wecould have different choices for the actions of the robot. Forexample, we could let the robot move randomly or perform anyarbitrary fixed action (e.g. a robot arm moving upwards withconstant velocity until it reaches to the maximum height andthen staying there). In preliminary experiments, we assessedwhether this choice bears any effect on final performance, andconcluded that no particular setting had clear advantages. Forlearning locomotion and manipulation policies, most of thehyperparameters suggested in the original HER implementa-tion [51] were retained with only a couple of exceptionsfor locomotion policies only: to facilitate exploration, withprobability . ( . in [51]) a random action was drawn from auniform distribution, otherwise we retained the current action,and added Gaussian noise with zero mean and . ( . in [51]) standard deviation. For locomotion policies, in allShadow’s hand environments and , we train the objectsover 50 epochs. In the remaining environments, we stop thetraining after 20 epochs. When training the main manipulationpolicies, the number of epochs varies across tasks. For bothlocomotion and manipulation policies, each epoch includes cycles, and each cycle includes rollouts generated inparallel through MPI workers using CPU cores. This leadsto ×
50 = 1900 full episodes per epoch. For each epoch, theparameters are updated times using a batch size of on a GPU core. We normalise the observations to have zeromean and unit standard deviation as input of neural networks.We update mean and standard deviations of the observationsusing running estimation on the data in each rollout. We clipthe SLD reward to the same range with the environmentalsparse rewards, i.e. [ − , .Our algorithm has been implemented in PyTorch . All theenvironments are based on OpenAI Gym, and are providedwith support for locomotion policy learning using the mocap entity in the MuJoCo library. The corresponding source code, https://pytorch.org/ the environments, and illustrative videos for selected taskshave been made publicly available. B. Comparison and Performance Evaluation
We include the following methods for comparisons: • DDPG-Sparse:
Refers to DDPG [37] using sparse re-wards. • HER-Sparse:
Refers to DDPG with HER [9] usingsparse rewards. • HER-Dense:
Refers to DDPG with HER, using densedistance-based rewards. • DDPG-Sparse+SLDR:
Refers to DDPG using sparseenvironmental rewards and SLD reward proposed in thispaper. • HER-Sparse+RNDR:
Refers to DDPG with HER, us-ing sparse environmental rewards and random networkdistillation-based auxiliary rewards (RNDR) [10]. • HER-Sparse+SLDR:
Refers to DDPG with HER, usingsparse environmental rewards and SLD reward.We use DDPG-Sparse, HER-Sparse and HER-Dense as base-lines. HER-Sparse+RNDR is a representative method con-structing auxiliary rewards to facilitate policy learning. DDPG-Sparse+SLDR and HER-Sparse+SLDR represents the pro-posed approach using SLD reward with different methods forpolicy learning.Following [51], we evaluate the performance after eachtraining epoch by performing 10 deterministic test rolloutsfor each one of the 38 MPI workers. Then we compute thetest success rate by averaging across the 380 test rollouts. Forall comparison methods, we evaluate the performance with5 different random seeds and report the median test successrate with the interquartile range. In all environments, we alsokeep the models with the highest test success rate for differentmethods and compare their performance.
C. Single Rigid Object Environments
The learning curves for Fetch, the
Rotate and
Full variantsof Shadow’s hand environments are reported in Fig. 4(a),Fig. 4(b) and Fig. 4(c), respectively. We find that HER-Sparse+SLDR features a faster learning rate and the bestperformance on all the tasks. This evidence demonstrates thatSLD, coupled with DDPG and HER, can facilitate policylearning with sparse rewards. The benefits introduced by HER-Sparse+SLDR are particularly evident in hand manipulationtasks (Fig. 4(b) and Fig. 4(c)) compared to fetch robot tasks(Fig. 4(a)), which are notoriously more complex to solve.Additionally, we find that HER-Sparse+SLDR outperformsHER-Sparse+RNDR in most tasks. A possible reason for thisresult is that most methods using auxiliary rewards are basedon the notion of curiosity, whereby reaching unseen states isa preferable strategy, which is less suitable for manipulationtasks [9]. In contrast, the proposed method exploits a notionof desired object locomotion to guide the main policy during Source code: https://github.com/WMGDataScience/sldr.git Environments: https://github.com/WMGDataScience/gym wmgds.git Supplementary videos: https://youtu.be/jubZ0dPVl2M training. We also observe that DDPG-Sparse+SLDR fails formost tasks. A possible reason for this is that, despite itseffectiveness, the proposed approach still requires a suitableRL algorithm to learn from SLD rewards together with sparseenvironmental rewards. DDPG on its own is less effective forthis task. Finally, we find that HER-Dense performs worsethan HER-Sparse. This result support previous observationsthat sparse rewards may be more beneficial for complex robotmanipulation tasks compared to dense rewards [9], [51].
D. Fetch Arm Multiple Object Environments
For environments with N objects, we reuse the locomotionpolicies trained on the PickAndPlace task with single objects,and obtain an individual SLD reward for each of N objects.We train N +1 action-value functions in total, i.e. one for eachSLD reward and one for the environmental sparse rewards. Themanipulation policy is trained using the gradient in Eq. IV-C.Inspired by [51], we randomly select between two initialisa-tion settings for the training: (1) the targets are distributed onthe table (i.e. an auxiliary task) and (2) the targets are stackedon top of each other (i.e. the original stacking task). Eachinitialisation setting is randomly selected with a probabilityof . . We have observed that this initialisation strategyhelps HER-based methods complete the stacking tasks. FromFig. 4(d), we find that HER-Sparse+SLDR achieves betterperformance compared to HER-Sparse, HER-Sparse+RNDand HER-Dense in the 2-object stacking task (Stack2), whileother methods fail. On the more complex 3-object stackingtask (Stack3), HER-Sparse+SLDR is the only algorithm tosucceed. HER-Sparse+RND occasionally solves the Stack3task with fixed random seeds but the performance is unstableacross different random seeds and multiple runs. E. Fetch Arm Non-Rigid Object Environments
The learning curves for 3-tuple and 5-tuple non-rigid objecttasks are reported in Fig. 4(e). Similarly to the multipleobject environment, HER-Sparse+SLDR achieves better per-formance for the 3-tuple task compared to HER-Sparse andHER-Sparse+RND, while the other methods fail to completethe task. For the more complex task, only HER-Sparse+SLDR is able to succeed. Among the 4 pre-definedtargets depicted in Fig. 3, HER-Sparse+SLDR can achieve 3targets on average, and can accomplish all 4 targets in oneinstance, out of 5 runs with different random seeds.
F. Comparison Across the Best Models
Fig. 5 summarises the performance of the models withthe best test success rates for each one of the competingmethods. We can see that the proposed HER-Sparse+SLDRachieves top performance compared to all other methods.Specifically, HER-Sparse+SLDR is the only algorithm that isable to steadily solve 3-object stacking (Stack3) and 5-tuplenon-rigid object manipulation ( ). Remarkably, these twotasks have the highest complexity among all the 13 tasks.The Stack3 task includes multiple stages that require the robotto pick and place multiple objects with different source and
10 20 30 40
Epoch M e d i a n t e s t s u cc e ss r a t e Push
10 20 30 40
Epoch
PickAndPlace
10 20 30 40
Epoch
Slide
HER-Sparse+SLDRHER-SparseHER-Sparse+RNDRHER-DenseDDPG-Sparse+SLDRDDPG-Sparse (a) Fetch arm single object
20 40 60 80
Epoch M e d i a n t e s t s u cc e ss r a t e EggRotate
20 40 60 80
Epoch
BlockRotate
20 40 60 80
Epoch
PenRotate
HER-Sparse+SLDRHER-SparseHER-Sparse+RNDRHER-DenseDDPG-Sparse+SLDRDDPG-Sparse (b) Shadow’s hand single object,
Rotate variants
40 80 120 160
Epoch M e d i a n t e s t s u cc e ss r a t e EggFull
40 80 120 160
Epoch
BlockFull
40 80 120 160
Epoch
PenFull
HER-Sparse+SLDRHER-SparseHER-Sparse+RNDRHER-DenseDDPG-Sparse+SLDRDDPG-Sparse (c) Shadow’s hand single object,
Full variants
20 40 60 80
Epoch M e d i a n t e s t s u cc e ss r a t e Stack2
40 80 120 160
Epoch
Stack3
HER-Sparse+SLDRHER-SparseHER-Sparse+RNDRHER-DenseDDPG-Sparse+SLDRDDPG-Sparse (d) Fetch arm multi-object stacking
20 40 60 80
Epoch M e d i a n t e s t s u cc e ss r a t e
60 120 180 240
Epoch
HER-Sparse+SLDRHER-SparseHER-Sparse+RNDRHER-DenseDDPG-Sparse+SLDRDDPG-Sparse (e) Fetch arm non-rigid objectFig. 4. Learning curves of comparison algorithms on all environments. target locations in a fixed order; in the task the objecthas the most complex dynamics. For these complex tasks, theproposed SLD reward seems to be particularly beneficial. Apossible reason is that, although the task is very complex,the objects are still able to learn good locomotion policies(see Fig 6(a)). The SLD from learnt locomotion policiesprovides critical information on how the object should bemanipulated to complete the task, and this information isnot utilised by other methods like HER and HER+RND. Ourapproach outperforms the runner-up by a large margin in the
Full variants of Shadow’s hand manipulation tasks (EggFull,BlockFull and PenFull), which feature complex state/actionspaces and system dynamics. Finally, the proposed methodconsistently achieves better or similar performance than therunner-up in other simpler tasks.VII. C
ONCLUSION AND D ISCUSSION
In this paper, we address the problem of mastering robotmanipulation through deep reinforcement learning using onlysparse rewards. The rationale for the proposed methodologyis that robot manipulation tasks can be seen of as inducingobject locomotion. Based on this observation, we propose tofirstly model the objects as independent entities that need tolearn an optimal locomotion policy through interactions witha realistically simulated environments, then these policies areleveraged to improve the manipulation learning phase.We believe that using SLDs introduces three significantadvantages over the use of robot-level human demonstrations.First, SLDs are generated artificially through a RL policy,hence require no human effort. Second, producing humandemonstrations for complex tasks may prove difficult and/orcostly to achieve without significant investments in humanresources. For instance, it may be particularly difficult fora human to generate good demonstrations for tasks such asmanipulating non-rigid objects with a single hand or with arobotic gripper. On the other hand, we have demonstrated thatthe locomotion policies can be easily learnt, even for complextasks, purely in a virtual environment; e.g., in our studies,these policies have achieved % success rate on all tasks(e.g. see Fig. 6(a) and Fig. 6(b)). Third, since the locomotionpolicy is learnt through RL, SLD does not require task-specific domain knowledge, which is necessary prerequisite forhuman demonstrations, and can be designed using only sparserewards. Moreover, the SDL approach is orthogonal to existingmethods that use expert demonstrations, and combining themtogether would be an interesting direction for future work.The performance of the proposed SLD framework has beenthoroughly examined on 13 robot manipulation environmentsof increasing complexity. These studies demonstrate that fasterlearning and higher success rate can be achieved through SLDscompared to existing methods. In our experiments, SLDs havebeen able to solve complex tasks, such as stacking 3 objectsand manipulating non-rigid object with 5 tuples, whereascompeting methods have failed. Remarkably, we have beenable to outperform runner-up methods by a significant marginfor complex Shadow’s hand manipulation tasks. AlthoughSLDs are obtained using a physics engine, this requirement Push PnP Slide EggRotate BlockRotate PenRotate EggFull BlockFull PenFull Stack2 Stack3 3-tuple 5-tuple0.00.20.40.60.81.0 M e d i a n t e s t s u cc e ss r a t e HER-Sparse+SLDR HER-Sparse HER-Sparse+RNDR HER-Dense DDPG-Sparse+SLDR DDPG-Sparse
Fig. 5. Comparison of models with the best test success rate for all methods on all the environments. does not restrict the applicability of the proposed approach tosituations where the manipulation is learnt using real robot aslong as the locomotion policy can pre-learnt realistically.Several aspects will be investigated in follow-up work. Wehave noticed that when the interaction between the manipulat-ing robot and the objects is very complex, the manipulationpolicy may be difficult to learn despite the fact that thelocomotion policy is successfully learnt. For instance, inthe case of the task with Fetch arm, although thelocomotion policy achieves a % success rate (as shown inFig. 6(a)), the manipulation policy does not always completesthe task (as shown in Fig. 4(e) and Fig. 5). In such cases,when the ideal object locomotion depends heavily on therobot, the benefit of the SLDs is reduced. Another limitationis given by our Assumption 2 (Section IV-B), which maynot hold for some tasks. For example, for pen manipulationtasks with Shadow’s hand, although the pen can rotate andtranslate itself to complete locomotion tasks (as shown inFig. 6(b)), it is difficult for the robot to reproduce the samelocomotion without dropping the pen. This issue can degradethe performance of the manipulation policy despite havingobtained an optimal locomotion policy (see Fig. 4(b), Fig. 4(c)and Fig. 5). A possible solution would be to train the manipu-lation policy and locomotion policy jointly, and check whetherthe robot can reproduce the object locomotion suggested bythe locomotion policy; a notion of “reachability” of objectlocomotion could be used to regularise the locomotion policyand enforce P ( s obj | µ ) d = P ( s obj | µ obj ) .Finally, in this paper we have adopted DDPG as the maintraining algorithm due to its widely reported effectiveness inrobot control tasks. However, other algorithms would also besuitable for continuous action domain such as trust regionpolicy optimisation (TRPO) [53], proximal policy optimisation(PPO) [54] and soft actor-critic [55]; analogously, model-basedmethods [16], [56] could also provide feasible alternatives tobe explored in future work.R EFERENCES[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski,S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis, “Human-level control throughdeep reinforcement learning,”
Nature , vol. 518, no. 7540, pp. 529–533,2015.
Epoch M e d i a n t e s t s u cc e ss r a t e Fetch - Rigid
PushPickAndPlaceSlide 10 20 30 40 50
Epoch
Fetch - Flexible (a) Fetch arm, single object and non-rigid object
Epoch
Hand Translate
EggBlockPen 10 20 30 40 50
Epoch
Hand Rotate
EggBlockPen (b) Shadow’s hand single objectFig. 6. Learning curves of locomotion policies. Recall that we train individuallocomotion policies for translation and rotation in Shadow’s hand environ-ments (see Section V). We do not include Fetch arm multi object stackingenvironments as those environments reuse the polices learnt in Fetch armsingle object environments (see Section VI-D).[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc-tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever,T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis,“Mastering the game of go with deep neural networks and tree search,”
Nature , vol. 529, no. 7587, pp. 484–489, 2016.[3] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar, “Dexterousmanipulation with deep reinforcement learning: Efficient, general, andlow-cost,” arXiv preprint arXiv:1810.06045 , 2018.[4] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew,J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al. , “Learn-ing dexterous in-hand manipulation,” arXiv preprint arXiv:1808.00177 ,2018.[5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training ofdeep visuomotor policies,”
The Journal of Machine Learning Research ,vol. 17, no. 1, pp. 1334–1373, 2016.[6] I. Popov, N. Heess, T. P. Lillicrap, R. Hafner, G. Barth-Maron, M. Ve-cerik, T. Lampe, Y. Tassa, T. Erez, and M. A. Riedmiller, “Data-efficient deep reinforcement learning for dexterous manipulation,” CoRR , vol.abs/1704.03073, 2017.[7] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel,“Overcoming exploration in reinforcement learning with demonstra-tions,” in
International Conference on Robotics and Automation , 2018,pp. 6292–6299.[8] S. Gu, E. Holly, T. P. Lillicrap, and S. Levine, “Deep reinforcementlearning for robotic manipulation with asynchronous off-policy updates,”in
International Conference on Robotics and Automation , 2017, pp.3389–3396.[9] M. Andrychowicz, D. Crow, A. Ray, J. Schneider, R. Fong, P. Welinder,B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experiencereplay,” in
Advances in Neural Information Processing Systems , 2017,pp. 5055–5065.[10] Y. Burda, H. Edwards, A. Storkey, and O. Klimov, “Exploration byrandom network distillation,” arXiv preprint arXiv:1810.12894 , 2018.[11] M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. V.de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning byplaying solving sparse reward tasks from scratch,” in
InternationalConference on Machine Learning , 2018, pp. 4341–4350.[12] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, andP. Abbeel, “VIME: variational information maximizing exploration,” in
Advances in Neural Information Processing Systems , 2016.[13] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A.Efros, “Large-scale study of curiosity-driven learning,” arXiv preprintarXiv:1808.04355 , 2018.[14] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-drivenexploration by self-supervised prediction,” in
IEEE Conference onComputer Vision and Pattern Recognition Workshops , 2017, pp. 16–17.[15] N. Savinov, A. Raichuk, R. Marinier, D. Vincent, M. Pollefeys, T. Lil-licrap, and S. Gelly, “Episodic curiosity through reachability,” arXivpreprint arXiv:1810.02274 , 2018.[16] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, and S. Levine,“SOLAR: deep structured representations for model-based reinforcementlearning,” in
International Conference on Machine Learning , 2019.[17] S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation learningand structured prediction to no-regret online learning,” in
InternationalConference on Artificial Intelligence and Statistics , 2011, pp. 627–635.[18] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang,and J. Zhao, “End to end learning for self-driving cars,” arXiv preprintarXiv:1604.07316 , 2016.[19] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of drivingmodels from large-scale video datasets,” in
IEEE Conference on Com-puter Vision and Pattern Recognition , 2017.[20] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot,D. Horgan, J. Quan, A. Sendonaris, I. Osband, G. Dulac-Arnold,J. Agapiou, J. Z. Leibo, and A. Gruslys, “Deep q-learning fromdemonstrations,” in
AAAI Conference on Artificial Intelligence , 2018,pp. 3223–3230.[21] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcementlearning,” in
International Conference on Machine Learning , 2000, pp.663–670.[22] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforce-ment learning,” in
International Conference on Machine Learning , 2004.[23] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverseoptimal control via policy optimization,” in
International Conference onMachine Learning , 2016, pp. 49–58.[24] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in
Advances in Neural Information Processing Systems , 2016, pp. 4565–4573.[25] M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot,N. Heess, T. Roth¨orl, T. Lampe, and M. A. Riedmiller, “Leveragingdemonstrations for deep reinforcement learning on robotics problemswith sparse rewards,”
CoRR , vol. abs/1707.08817, 2017.[26] F. Sasaki, T. Yohira, and A. Kawaguchi, “Sample efficient imitationlearning for continuous control,” in
International Conference on Learn-ing Representations , 2019.[27] S. Reddy, A. D. Dragan, and S. Levine, “SQIL: Imitation learning viaregularized behavioral cloning,” arXiv preprint arXiv:1905.11108 , 2019.[28] D. H. Grollman and A. Billard, “Donut as i do: Learning from faileddemonstrations,” in
IEEE International Conference on Robotics andAutomation , 2011.[29] J. Zheng, S. Liu, and L. M. Ni, “Robust bayesian inverse reinforcementlearning with sparse behavior noise,” in
AAAI Conference on ArtificialIntelligence , 2014. [30] K. Shiarlis, J. Messias, and S. Whiteson, “Inverse reinforcement learningfrom failure,” in
International Conference on Autonomous Agents &Multiagent Systems , 2016.[31] Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, and T. Darrell, “Reinforcementlearning from imperfect demonstrations,” in
International Conferenceon Learning Representations , 2018.[32] D. S. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolating be-yond suboptimal demonstrations via inverse reinforcement learning fromobservations,” in
International Conference on Learning Representations ,2019.[33] S. Choi, K. Lee, and S. Oh, “Robust learning from demonstrations withmixed qualities using leveraged gaussian processes,”
IEEE Transactionson Robotics , vol. 35, pp. 564–576, 2019.[34] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine formodel-based control,” in . IEEE, 2012, pp. 5026–5033.[35] N. Koenig and A. Howard, “Design and use paradigms for gazebo, anopen-source multi-robot simulator,” in
IEEE/RSJ International Confer-ence on Intelligent Robots and Systems (IROS) , 2004.[36] E. Coumans and Y. Bai, “Pybullet, a python module for physicssimulation in robotics, games and machine learning,” https://pybullet.org,2017.[37] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,”
CoRR , vol. abs/1509.02971, 2015.[38] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540 , 2016.[39] D. Pomerleau, “ALVINN: an autonomous land vehicle in a neuralnetwork,” in
Advances in Neural Information Processing Systems , 1988,pp. 305–313.[40] Y. Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider,I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot imitation learning,”in
Advances in Neural Information Processing Systems , 2017, pp. 1087–1098.[41] N. D. Ratliff, J. A. Bagnell, and S. S. Srinivasa, “Imitation learningfor locomotion and manipulation,” in
International Conference on Hu-manoid Robots , 2007, pp. 392–397.[42] Z. Wang, J. Merel, S. Reed, G. Wayne, N. de Freitas, and N. Heess, “Ro-bust imitation of diverse behaviors,” arXiv preprint arXiv:1707.02747 ,2017.[43] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal valuefunction approximators,” in
Proceedings of the 32nd InternationalConference on Machine Learning, ICML 2015, Lille, France, 6-11 July2015 , 2015, pp. 1312–1320.[44] S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, andR. Fergus, “Intrinsic motivation and automatic curricula via asymmetricself-play,” in
International Conference on Learning Representations ,2018.[45] C. Florensa, D. Held, X. Geng, and P. Abbeel, “Automatic goal genera-tion for reinforcement learning agents,” in
International Conference onMachine Learning , 2018, pp. 1514–1523.[46] A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Visualreinforcement learning with imagined goals,” in
Advances in NeuralInformation Processing Systems , 2018, pp. 9209–9220.[47] J. Schmidhuber, “Curious model-building control systems,” in
IEEEInternational Joint Conference on Neural Networks . IEEE, 1991, pp.1458–1463.[48] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. A.Riedmiller, “Deterministic policy gradient algorithms,” in
InternationalConference on Machine Learning , 2014, pp. 387–395.[49] C. J. C. H. Watkins and P. Dayan, “Technical note q-learning,”
MachineLearning , vol. 8, pp. 279–292, 1992.[50] R. S. Sutton and A. G. Barto,
Reinforcement Learning: An Introduction ,2nd ed. MIT Press, 2018.[51] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Pow-ell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, andW. Zaremba, “Multi-goal reinforcement learning: Challenging roboticsenvironments and request for research,”
CoRR , vol. abs/1802.09464,2018.[52] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,”
CoRR , vol. abs/1412.6980, 2014. [Online]. Available:http://arxiv.org/abs/1412.6980[53] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in
International Conference on MachineLearning , 2015, pp. 1889–1897. [54] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 ,2017.[55] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochasticactor,” in International Conference on Machine Learning , 2018.[56] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforce-ment learning in a handful of trials using probabilistic dynamics models,”in