Improved Learning of Robot Manipulation Tasks via Tactile Intrinsic Motivation
Nikola Vulin, Sammy Christen, Stefan Stevsic, Otmar Hilliges
11 Improved Learning of Robot Manipulation Tasksvia Tactile Intrinsic Motivation
Nikola Vulin , Sammy Christen , Stefan Stevˇsi´c and Otmar Hilliges Abstract —In this paper we address the challenge of explorationin deep reinforcement learning for robotic manipulation tasks.In sparse goal settings, an agent does not receive any positivefeedback until randomly achieving the goal, which becomesinfeasible for longer control sequences. Inspired by touch-basedexploration observed in children, we formulate an intrinsicreward based on the sum of forces between a robot’s force sensorsand manipulation objects that encourages physical interaction.Furthermore, we introduce contact-prioritized experience replay,a sampling scheme that prioritizes contact rich episodes andtransitions. We show that our solution accelerates the explorationand outperforms state-of-the-art methods on three fundamentalrobot manipulation benchmarks.
Index Terms —Reinforcement Learning, Deep Learning inGrasping and Manipulation, Tactile Feedback, Intrinsic Moti-vation
I. INTRODUCTION M ODEL-FREE deep reinforcement learning (DRL) al-gorithms have demonstrated great potential in solvingsequential decision making problems, such as learning to playvideo games in Atari [1], defeating the world champion in thegame of Go [2], or controlling robotic systems for locomotion[3], navigation [4] and manipulation [5]. For complex tasksand real-world scenarios, in particular in robotics, formulatinga feasible reward function is inherently difficult and tedious.Hence, reward functions are most easily formulated as asparse signal received upon reaching the final goal. This leadsto inefficient exploration, because an agent receives crucialfeedback rarely and in a delayed manner. Reward functionsare typically based on reaching extrinsic states. Thus, thereward does not leverage internal robot states or other internalrepresentations. To address inefficient exploration, an intrinsic reward inspired by intrinsic motivation observed in humans[6] can be added to the reward function.Humans strongly rely on the sense of touch for explorationand use it as guidance when interacting with their surround-ings. For example, humans can manipulate objects even with-out visual feedback, and removing tactile cues significantlyreduces manipulation capabilities [7]. Furthermore, it has beenshown that force feedback is used by infants to explore the AIT Lab, Department of Computer Science, ETH Zurich, Switzerland [email protected] ©2021 IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works.This is the author’s version of an article that has been published in IEEERobotics and Automation Letters. The final version of record is available athttps://dx.doi.org/10.1109/LRA.2021.3061308. Fig. 1. Inspiration of our work comes from intrinsic motivation of infantsexploring the world through physical interactions. We model tactile sensingwith force sensors (red) and leverage tactile information to accelerate learning. physical properties of objects [8]. Our insight is that forcefeedback can be leveraged in a general manner for learning adiverse set of manipulation tasks. Inspired by the touch-basedexploration observed in small children, we propose an intrinsic reward based on tactile signals to enhance exploration inrobotic manipulation environments. Furthermore, we introducea sampling scheme that encourages robot-object interaction viaprioritization of meaningful trajectories.Intrinsic rewards have been introduced for discrete domainsbased on state novelty [9], uncertainty [10], [11] or informa-tion gain [12]. Using methods based on intrinsic motivationaccelerates the training of an agent, but requires an additionalmodel to extract a reward from state information. In DRLapproaches, tactile robot-object interaction information is oftenignored, although this information might be valuable duringexploration. Some prior works extend the state space withtactile information to provide explicit interaction cues. Thisresults in improvements in terms of sample efficiency androbustness for grasping [13], in-hand manipulation [14], andhuman-robot interaction [15]. However, these methods modifyonly the state space [13], [14], or focus on a set of specifictasks [13], [15].In our method (see Figure 1), we add force sensors to theend-effector and extend the state space with force measure-ments, similarly to other work [13], [14], [15]. We then go be-yond prior work and introduce an intrinsic reward to guide theexploration. Our intrinsic reward behaves like an intermediate,simpler to achieve, directive base. Having an intermediatebase can improve exploration, since the agent returns to thisgood position and does not lose track due to probabilisticexploration [16]. The reward motivates the agent to touch the a r X i v : . [ c s . R O ] F e b object while allowing it to explore different interactions, i.e.,when and how long to touch an object. We experimentallyshow that our reward leads to significantly better performanceon benchmark manipulation tasks, especially when the taskdifficulty is increased.Another way of addressing exploration is to use data aug-mentation techniques such as Hindsight Experience Replay(HER) [17], which replaces a task-specific goal with a statethat was achieved in hindsight of a sampled transition. Thiswill induce intermediate rewards for failed trials. However,HER randomly samples states from the replay buffer, includingtrajectories where the manipulation object does not move,which may slow down learning. We therefore introduce a re-play buffer sampling scheme that allows an agent to prioritizemeaningful trajectories. More specifically, we first modify thesampling probability of entire episodes depending on whethercontacts have happened. From a batch of selected episodes, wesample virtual goals, with an increased weight on transitionsthat occur after initial contact. We then go back in time tofind training transitions, which narrows down the search spaceto meaningful transitions for manipulation tasks. We showempirically that this leads to faster convergence compared toHER.The main contributions of this paper can be summarized asfollows: i) An intrinsic force-based reward that acceleratesthe exploration and improves the performance on severalbenchmark robotic manipulation tasks. ii) A novel samplingscheme for off-policy algorithms in contact-based tasks toincrease the sample efficiency. iii) We provide a detailedevaluation of our approach and comparison with the baselines.We analyze the contributions of the method components in anablation study and show that our method performs significantlybetter, especially with increased task difficulty.II. RELATED WORK A. Tactile Feedback
For humans, tactile sensing is an important sensory modal-ity, besides visual perception, when manipulating objects. It isrich in information and free of external disturbances, comparedto visual sensing that is influenced by occlusions and poorlighting conditions. While the majority of DRL works do notconsider tactile feedback for learning robotic manipulationtasks, a few have shown the benefits and potential of consid-ering this additional sensory modality [13], [14], [15], [18],[19].In [13], the authors show that tactile sensing can improvethe robustness of grasping under noisy position sensing, thussupporting exact position estimation. [14] demonstrates thattactile sensing provides the agent with essential informationand makes the learning more sample efficient for in-handmanipulation tasks. Unlike our method, these approaches onlyadd the force measurement to the state space.More similar to our method, [15] and [18] extend their statespace with tactile sensing and use it in the reward function.To learn human-robot interactions, [15] propose a specificreward to ensure contact between the human and the robot.[18] uses the feedback of tactile sensors as a penalty to avoid high impacts and hence to learn gentle manipulation. However,in both methods the reward is used to trigger a task-specificbehavior. In contrast, we formulate an intrinsic reward toencourage general task-independent exploration.
B. Intrinsic Motivation
Humans are efficient learners, showing intrinsically moti-vated behaviors when exploring the world around them [6].Inspired by human exploration strategies, intrinsic motivationin reinforcement learning (RL) has been used to improveexploration in sparse reward settings [20]. The research mostlyrelies on intrinsic rewards that an agent collects for visitingnovel states. This type of reward can be implemented bycomputing pseudo-counts through density models [9] or byestimating state novelty with neural networks [11]. Further-more, one can model curiosity to guide the agent towards areaswhere the uncertainty of the agent’s prediction of the followingstates is high [10]. These methods need to estimate additional,computationally expensive values, which slows down training.In contrast, our intrinsic reward only relies on tactile feedbackpresent in the state space.
C. Experience Replay
Experience replay [21] is applied in off-policy DRL algo-rithms to store experience transitions in a buffer and reusethem during training. Popularized in DQN [1], it has become astandard procedure to mitigate sample efficiency and stabilizetraining in most state-of-the-art off-policy algorithms [22]. Anextension includes Prioritized Experience Replay (PER) [23],where transitions are prioritized based on how valuable theyare to the learning process via the temporal difference error.For the case of learning from sparse rewards in continuouscontrol problems, Hindsight Experience Replay (HER) [17]has proposed a data augmentation technique that achieveslarge performance improvements over standard approaches.The transitions stored in the buffer are used for training, withthe difference that rewards are computed with respect to virtualgoals. Even when the agent fails to complete the task, virtualgoals provide a way to achieve intermediate rewards. HER wasextended in [24] by prioritizing episodes based on the sum ofkinetic and potential energy of the manipulation object. Thiscan lead to prioritization of undesired trajectories, because thefaster the object moves, the higher the priority, which maynot be what is beneficial for the task at hand. In contrast,our prioritization scheme treats episodes with contact moreequally and also modifies the sampling probability of trainingtransitions within an episode.III. PRELIMINARIES
A. Problem Definition
In this paper, we focus on using reinforcement learningto solve robotic manipulation tasks, where a robot needs tointeract with an object to complete the task. For example,the robot may need to pick objects and place them in acontainer, build a structure, or use tools. A straightforwardway to learn these tasks is to reward the agent once the task is completed. In contrast, shaping the reward might be verydifficult because there are multiple intermediate steps that therobot needs to achieve before it can reach the final goal. Whensparse rewards are used, no signal directs the agent towardsthe goal, and hence the agent reaches it only by chance viarandom exploration. Thus, algorithm convergence is usuallyvery slow, or the algorithm does not even converge if thetask is too complex. We aim to improve exploration in roboticmanipulation tasks and hence learn the required skills faster. Inmanipulation tasks, we can often resort to the goal-conditionedformulation of the reinforcement learning problem. This canhelp to mitigate the exploration problem by applying methodssuch as HER [17].
B. Goal-conditioned Deep Reinforcement Learning
To model our control problem, we use the standard for-mulation of a Markov Decision Process (MDP) and extend itwith a set of goals G . An MDP is defined by a tuple M = {S , A , G , R , T , ρ , γ } , where S is the state set, A the actionset, R t = r ( s t , a t , g t ) the reward function, T = p ( s t +1 | s t , a t ) the transition dynamics of the environment with s t ∈ S and a t ∈ A , ρ = p ( s ) the initial state distribution, and γ adiscount factor. Our aim is to maximize the expected sum ofdiscounted future rewards with a policy π : S ×G → A , whichis a mapping from states and goals to actions. We adopt theactor-critic framework and use the Q function to describe theexpected future reward: Q ( s t , g t , a t ) = E a t ∼ π,s t +1 ∼T (cid:34) T (cid:88) i = t γ i − t R t (cid:35) (1)We approximate both the value function (critic) and the pol-icy (actor) with neural networks and use the DDPG algorithm[22] for training. Thus, the critic network is parameterized by θ Q and minimizes the following loss: L ( θ Q ) = E M (cid:104)(cid:0) Q ( s t , g t , a t ; θ Q ) − y t (cid:1) (cid:105) ,y t = r ( s t , g t , a t ) + γQ ( s t +1 , g t +1 , a t +1 ; θ Q ) . (2)The policy network uses parameters θ π and is trained tomaximize the Q-value: L ( θ π ) = E π (cid:2) Q ( s t ,g t ,a t ; θ Q ) | s t ,g t ,a t = π ( s t ,g t ; θ π ) (cid:3) (3)The algorithm is off-policy and therefore uses an experiencereplay buffer for training. The data is collected in episoderollouts (we use the word trajectory interchangeably in thispaper) and stored as transitions ( s t , g t , a t , r t , s t +1 ) in thebuffer. These transitions are then sampled from the replaybuffer to train the neural networks that approximate the Qfunction and control policy π . C. Hindsight Experience Replay (HER)
Hindsight Experience Replay (HER) [17] is a data augmen-tation technique that significantly improves the exploration ofgoal-conditioned reinforcement learning algorithms. In HER,the key idea is to learn from failed trials. Therefore, uniformly
Fig. 2. Overview of the benchmark tasks. We enlarge the benchmark’s [25]original goal space (green) to a larger area (blue) to increase the task difficulty.For the
Pick-And-Place task, we sample goals uniformly contrary to theoriginal benchmark task, which uses 50% of targets on the table. sampled transitions from the replay buffer are modified byreplacing the goal g t of a transition with a state achieved ina subsequent time step of the same trajectory. Accordingly,the reward r t must be recalculated, which will consequentlyinduce rewards for trajectories that failed to reach the originalgoal. From these unsuccessful trials, the goal-conditionedpolicy should learn to reach the original goal more efficientlyvia extrapolation. However, the goals used in HER are usuallyobject-centric. Hence, it can take a significant amount of trialsuntil the agent starts to move the object and virtual goals startto affect the learning algorithm. This is a substantial downsideof HER, which we address with our method. D. Simulation Environments
We evaluate our method on three different benchmark tasksfrom the robotics collection [25] of OpenAI Gym [26]. Theenvironments are based on the physics engine MuJoCo [27].MuJoCo allows to design tailored physical environments andcomputes the dynamics and contact interactions between rigidbodies. We use a 7 DoF robot manipulator (see Figure 1).Marked in red are the locations of the force sensors, which areplaced on the end-effector. The robot is controlled via torqueactions that are sent to its motor joints.Figure 2 shows the three tasks considered:
Pick-And-Place , Push and
Slide . These tasks represent a diverse set of manipu-lation tasks. In
Pick-And-Place , the goal is to pick up an objectand move it to a target. In
Push , the task consists of pushingan object on a surface to a target position. Contrarily, in
Slide the robot has to kick a puck in order to reach a distant goalthat is out of reach of the robot’s workspace. The red spherein Figure 2 marks the target position an object needs to reach.The original goal space [25] is illustrated in green. To increasethe difficulty of the problem, we enlarge the spread of the goalspace from the green to the blue area. For
Pick-And-Place , theoriginal problem [25] uses the simplification that 50 % of goalsare on the table and not in the air. This can lead to instanceswhere the robot does not need to learn to grasp the object inorder to complete the task. Hence, we modify the task suchthat all goals are sampled uniformly in the marked green area.IV. METHODWhen training robots to execute manipulation tasks viareinforcement learning, the reward function is most often
Fig. 3. The standard reinforcement learning loop and our two method extensions in red. The reward function is extended by an intrinsic reward and a samplingscheme that prioritizes contact-rich transitions is introduced. defined as sparse. Therefore, it is essential to have an efficientexploration approach. We propose a method that improvesexploration efficiency by leveraging force sensor informationin the goal-conditioned reinforcement learning setting (seeFigure 3).Our main insight is that sense of touch can guide robotsto take actions that result in manipulation of an object. Toimplement the sense of touch, we extend our state-space withforce measurements from tactile sensors positioned at the end-effector [13], [14], [15]. We introduce a touch-based intrinsicreward function to direct exploration towards the states wherethe robot touches the object. The trajectory in the state spacethat leads to the task solution passes through the state wherethe robot touches the object. Hence, incentivizing the agentto reach such states will guide the exploration towards thefinal task goal. Using the intrinsic reward in combination withHER utilizes virtual goals much earlier because the robot startsto move the object significantly faster. We further exploit thetouch information by introducing a Contact-Prioritized ReplayBuffer (CPER). CPER uses contact clues to sample moreinformative transitions and virtual goals for training the Qfunction and control policy π , which further speeds up thelearning. A. State Space
We extend our state space with force measurements fromtactile sensors positioned at the end-effector [13], [14], [15].The original state space consists of a mixture between pro-prioceptive state information, i.e., joint positions and jointvelocities, and relevant information about the manipulationobject, such as the object’s velocity and the position x obj ,which we will use in the reward function. We further extendthe state space with the measured force f t at the end-effectorand the sum of all measured forces until the current step (cid:80) ti =0 f i . The accumulated force is a good measure to trackhow much contact occurred during an episode, which is whatour intrinsic reward is based on. B. Intrinsic Force Reward (IR)
We extend an extrinsic reward function r ext ( s, g ) with ourintrinsic force reward r int ( s ) , which is independent of thegoal: r ( s, g ) = ω ext ∗ r ext ( s, g ) + ω int ∗ r int ( s ) , (4) where ω ext and ω int are the weights on the extrinsic andintrinsic reward function, respectively. We empirically foundthat the weights ω ext = 0 . and ω int = 0 . provide a goodbalance between our intrinsic reward as an auxiliary reward,while putting more emphasis on the extrinsic reward function.The extrinsic reward function is formulated sparsely, wherethe indicator function I returns a positive signal of magnitude1 if the object’s position x obj is within the range ε pos aroundthe goal g : r ext ( s, g ) = I (cid:20)(cid:13)(cid:13) g − x obj (cid:13)(cid:13) < ε pos (cid:21) (5)In all other cases, the reward is 0. The intrinsic reward dependsonly on the state and returns a reward if the current sum offorces (cid:80) ti =0 f i is higher than a minimal amount of desiredcontact interaction ε force : r int ( s t ) = I (cid:20) t (cid:88) i =0 f i > ε force (cid:21) (6)The intrinsic force reward allows the agent a certain freedomabout when and how long to touch the object. As soon as aminimal amount of contact occurred, i.e., when Σ f is greaterthan a minimal threshold, the agent receives the intrinsicreward until the end of the episode and can concentrate onsolving the extrinsic task. The reward encourages the agent tomanipulate the object and to bring it into motion. Therefore,the value of the threshold ε force should be chosen large enoughto avoid falsely providing the intrinsic reward due to sensornoise. We empirically set the threshold to N for the tasks Push and
Pick-And-Place and N for the Slide task, since thisenvironment requires only a short interaction with the object.However, we found that the threshold is robust across a rangeof different values.We formulate the intrinsic reward such that it has a minimalinfluence on the final task. After the force threshold is reached,the agent can freely explore interaction with an object. Hence,we direct the agent to a crucial intermediate base from wherethe goal can be reached more quickly. Importantly, we use thesame reward function across multiple tasks, demonstrating thatour formulation can work in the general case. We also addressthe issues of HER in early exploration. Specifically, when anobject is not moved during an episode, HER does not have aneffect on training (see Section III-C for details). Our intrinsicreward circumvents the early exploration issue by guiding the
Algorithm 1
Contact-Prioritized Experience Replay (CPER) Given:
Transitions ( s t , g t , a t , r t , s t +1 ) of episode e ∈ E and step t ∈ T stored in Replay Buffer R Compute probability distribution p transition ( e, t ) (cid:46) Fig. 4, Eq. 7 Sample mini-batch B of episodes from R with marginal probability distribution p episode = (cid:80) Tt =0 p transition ( e, t ) for episode e b in B do Sample state s t (cid:48) from p transition ( e b , t ) Sample transition ( s t , g t , a t , r t , s t +1 ) back in past where ≤ t ≤ t (cid:48) . if Hindsight Transition then (cid:46)
See HER [17] Replace original goal with sampled state g t ← s t (cid:48) Recompute r (cid:48) := r ( s t , a t , g t ) end if end for Perform one step of optimization using Algorithm A and mini-batch B agent to interact with the object and significantly speeds upconvergence as we show in Section V-A. C. Contact-Prioritized Experience Replay (CPER)
When training the value and policy networks, transitions aresampled from a replay buffer. The standard procedure is tosample the transitions uniformly. However, not all transitionsare of the same value to the learning algorithm. Therefore,prioritizing more informative transitions helps to speed uplearning [23]. Furthermore, the selection of virtual goals isimportant as well. We leverage tactile information to alter thesampling of transitions and virtual goals.We therefore introduce Contact-Prioritized Experience Re-play (CPER), which is a sampling prioritization scheme underthe assumption that contact rich trajectories include moremeaningful information for the learning process. Our methodrelies on two important features to accelerate learning. First,we prioritize trajectories where the agent touches the object.These trajectories contain important information about therelevant motions of the agent and the object. Other trajectoriesinclude robot motions where the end-effector may be far awayfrom the object and are therefore less relevant for training.Second, we sample for virtual goals first and then find trainingtransitions by sampling from previous time steps. We samplevirtual goals from states occurring after the contact to inducemeaningful intermediate rewards. By sampling transitionsfrom previous time steps, we increase the chance of findingrelevant transitions. This reversed scheme helps in tasks like
Slide , where most important transitions occur prior to thecontact phase. We experimentally show that using CPER leadsto significantly faster training compared to the case when onlyour intrinsic reward is used. We now explain our samplingscheme in more detail.First, we compute the probability distribution p transition using p (cid:48) transition (see Figure 4) and the following equation: p (cid:48) transition ( e, t ) = if t (cid:80) i =0 f i < ε force λ otherwise ,p transition ( e, t ) = p (cid:48) transition ( e, t ) (cid:80) Ee =0 (cid:80) Tt =0 p (cid:48) transition ( e, t ) (7) Fig. 4. Computation of p (cid:48) transition . Once the force threshold t force is surpassed,the probability is multiplied by a factor λ . Once the sum of forces reaches the threshold ε force attime step t force within a trajectory, we increase the samplingprobability of transitions for all subsequent time steps untilthe end of the episode T by factor λ . We empirically foundthat a factor of 10 leads to the best performance.Algorithm 1 describes our method in detail and consistsof two parts. First, we compute the marginal probabilitydistribution p episode from the sum of p transition per episode in thebuffer, which overweights trajectories where contact occurredand further prioritizes based on the time step of the firstcontact (lines 2-3). This induces a prioritization of episodeswith contact-rich information. We then sample a batch oftrajectories B according to p episode (line 3).In the second part (lines 4-11), we sample a virtual transitionat time t (cid:48) within an episode from batch B according to ourmodified distribution p transition (line 5). We then reverse backin time to find a training transition ( s t , g t , a t , r t , s t +1 ) andreplace the original goal g t with the achieved state s t (cid:48) from thevirtual transition (line 6-9). Our sampling scheme allows us tofind more meaningful virtual goals, narrowing down the searchspace to transitions that are relevant for robot manipulationtasks. In contrast to HER [17], where first transitions aresampled uniformly and then virtual goals are selected fromachieved states in the near future, we search for relevant goalsfirst and then sample back in time to find training transitions. CPER+IR (Ours)CPERIRHER (Tactile)HER
Fig. 5. Learning curves for all three tasks. We show the average success rates over 5 random seeds with the corresponding confidence interval. 1 epochcorresponds to 40 training episodes. We show our full method with contact-prioritized experience replay and the intrinsic reward (CPER+IR) (red) and twoablations: CPER (orange) and IR (brown). We compare it with two baselines: HER with tactile sensing (light blue) and the original HER method (blue).
Since we first sample goal transitions with a higher samplingprobability for transitions after the first contact (see Equation7), it ensures that states in the trajectory before the contactare not neglected. We only replace the goals for a subset oftransitions (80%, the same as in [17], line 7), keeping theoriginal goal in the other subset as the task’s final goal.V. RESULTS
A. Robot Manipulation Experiments
To evaluate the proposed approach, we choose three stan-dard manipulation tasks from MuJoCo benchmarks:
Pick-And-Place , Push , and
Slide (see Section III-D). These threeenvironments require significantly different interactions be-tween the robot and the object, allowing us to evaluate thegeneralization abilities of our approach. The
Pick-And-Place task requires the robot to grasp the object, the
Push taskrequires one side touching, while
Slide requires a short kickinteraction. In this experiment, we use the hardest version ofeach task as described in Sec. III-D. Our goal is to evaluate ourcontributions in the most demanding settings. We use DDPG[22] with HER [17], which we call only HER for abbreviationgoing forward, as a baseline because it is the state-of-the-art method for training goal-conditioned control policies. Inthe ablative study, we test how each of the components ofour method performs. First, we extend the state space withforce and cumulative force sensor readings, as explained inSec. IV-A, and use HER to train the agent (HER (Tactile)).Next, we separately add our intrinsic reward (IR) and ourcontact-prioritized experience replay (CPER), where we useour prioritization scheme with only the extrinsic reward.Finally, our full method uses all the components, combiningthe intrinsic reward and CPER (CPER+IR (Ours)).Figure 5 shows the results of the experiments. Our approachlearns to complete the task significantly faster than HER.The difference is more prominent in
Pick-And-Place and
Push , since HER has problems solving these tasks due to theincreased task difficulty. In the original settings [25], HERmanages to learn these tasks. However, our method leads tofaster training in the original settings as well (see Figure 6).In all tasks, our approach converges after 20 to 30 epochs.In particular, at the beginning of training, our method findssuccessful actions faster. Hence, it yields a steeper learning curve, which indicates that our method motivates the agent tomanipulate the object from the beginning.Our ablation study shows that simply adding tactile infor-mation to the state space does not automatically lead to fastertraining. HER usually performs the same with and withoutthe tactile sensor readings. In
Pick-And-Place , the tactilefeedback improves the performance, but it is unstable. Thefaster convergence may imply that the tactile feedback gives animportant signal that the object is grasped. When the intrinsicreward is used (IR), the performance is significantly improvedcompared to both HER and HER (Tactile). This confirmsthat the proposed intrinsic reward function indeed speeds uptraining. In
Slide , the intrinsic reward has the smallest effectbecause this task demands a precise movement of the robotarm prior to the short kick of the object.Finally, we analyze the influence of our sampling schemeCPER. While there is no significant difference between IR andCPER+IR for
Pick-And-Place , we observe that it is crucial tochoose transitions with valuable information in
Push and
Slide to achieve faster convergence. In
Pick-And-Place , the agentstarts to grasp the object relatively fast due to the intrinsicreward, and hence most of the trajectories are prioritized forsampling. Therefore, our sampling scheme does not have asignificant effect. Furthermore, applying CPER without IRshows similar results as the full method in
Pick-and-Place ,implying that the intrinsic reward can potentially be avoidedin some tasks.
B. Task Difficulty
To study the effects of increasing the task difficulty, we con-ducted experiments using our method and the HER baseline.The original environments use a much smaller goal space thanthe robot can potentially reach in
Push and
Slide , or use thesimplification that 50% of the goals in
Pick-And-Place are onthe table, which is a curriculum that simplifies the exploration.We test both approaches in three environments where wegradually increase the difficulty. In the Simple environment,we use the original settings. For Intermediate, we use settingsthat are half-way between the Simple and Hard environments,i.e., we increase the range of the goal space for
Push and
Slide , while we decrease the percentage of goals on the table
Fig. 6. Learning curves for all three tasks and difficulty levels. We show theaverage success rate over 5 seeds with the corresponding confidence interval.1 epoch corresponds to 40 training episodes. We compare our full methodCPER + IR (red) and the original HER method (blue). for
Pick-And-Place . Hard is our final environment describedin Sec. III-D.The results are illustrated in Figure 6. When the taskdifficulty is kept the same as in [25] (Simple, top row), bothmethods reach similar success rates. However, we can seethat our method converges faster. Similar results are observedwhen the task difficulty is slightly increased (Intermediate,middle row), although the convergence of the baseline slowseven more compared to our method. Lastly, if the difficultyis increased as described in Section III-D, our method showssignificantly better performance both in terms of convergenceand success rate. These results are in accordance with insightsfrom [28], which demonstrates that methods often performwell due to constrained goal spaces. Because of the enhancedexploration, our method manipulates the object once discov-ered and can therefore more efficiently generalize to a largergoal space.
C. Sampling Ablation Study
In this experiment, we analyze our prioritization schemeCPER in more detail. CPER uses two main components:Increasing the probability of sampling transition from episodeswith contact, and sampling transitions and virtual goals insidethe trajectory based on the contact occurrence. We comparefull CPER to a sampling scheme that uses just the first compo-nent, i.e., we prioritize episodes based on contact occurrences.Hence, we sample the virtual goal state s t (cid:48) uniformly, insteadof sampling it from p transition (compare line 5 in Algorithm1). We name this sampling scheme episode sampling . Wealso compare against a more conventional baseline, where we OursEp. Sampl.Rew. Prio.
Fig. 7. Ablation study of our sampling method. We show the average successrate over 5 random seeds with the corresponding confidence interval. 1 epochcorresponds to 40 training episodes. We compare our full method CPER + IR(red) and an ablation (purple) with a classical reward prioritiza-tion scheme(grey). In our ablation we find the training episodes according to CPER, butsample the state and virtual goal uniformly (like in HER). prioritize episodes based on the achieved reward and samplethe transitions uniformly, which we call reward prioritization .In all three methods, we use our IR and the tactile feedbackfor an appropriate comparison.Figure 7 illustrates the results. We can see that our fullmethod is generally exploring faster. In particular, for
Slide ,where the physical interactions with the object are limited toa few steps, we can see that episode sampling shows slowerconvergence. This indicates that the full method is findingmore informative transitions, which our method achieves bysampling virtual goals from p transition and then revert backto find a meaningful training transition. In this task, it isimportant to learn the swing trajectory prior to the kick andto choose virtual goals after the kick. If we choose transitionsrandomly, we more often sample non-informative transitionsthat occur after the kick or goals where the object does notmove, which slows down training. The worse performance ofthe reward prioritization baseline further indicates the benefitof our prioritization scheme.VI. DISCUSSIONHere we address the current limitations of our method andpropose areas for future work. This paper focuses on thethree commonly used benchmarks in RL research for roboticmanipulation tasks [25]. These fundamental problems havesignificantly different contact patterns and a single object inthe scene. This allows us to demonstrate the benefits of anintrinsic reward based on tactile information and our proposedsampling scheme. However, applying our method to morecomplex tasks such as the manipulation of multiple objects ora door opening task, would require adding further modules. Tosolve such tasks, our method could be combined with othersolutions, such as using a task-specific curriculum [29] formultiple objects or learning from expert demonstrations [30]for door opening.To study the properties of the method, we conducted exper-iments in simulation. The deployment of simulator policieson a real robot often fails due to the discrepancy of sensormeasurements and system dynamics between the real andsimulated systems [3]. To overcome this, either the accuracyof simulators needs to be improved or the policies have to berobust against imperfections in real world settings. Approachesthat combine both improving the accuracy of simulators as well as applying domain randomization have been proposed[3] and could be combined with our method. The contributionof our approach will most likely be robust for transfer to areal robot, since sensor noise will not trigger the threshold forreceiving the intrinsic reward. One more aspect to consideris sensor placement, which we found crucial for avoidingexploitation of the intrinsic reward, e.g., by starting to pushon the table. Here we use force sensors to get informationabout the contact with the object. In future work, one couldinfer tactile information from other sensor modalities, suchas ultrasonic proximity sensors or vision [31], which wouldfurther alleviate this issue and yield a simpler solution for areal system.While we primarily focus on manipulation tasks, whereusing the sense of touch as intrinsic reward is intuitive, ourmethod could be extended to other domains. In a more generalsense, if a task can be decomposed into distinct phases,demarcated by a measurable event or physical property, theintrinsic reward and the sampling scheme could be applied.For example, reaching a certain amount of vertical thrust mightbe useful for learning flying skills in drones.VII. CONCLUSIONWe study the challenges of exploration and efficient learningin deep reinforcement learning for robot manipulation tasks.We show in our experiments that an intrinsic force rewardovercomes the difficulties of initial exploration and results inlearning the intended behavior faster. We have discovered thattransitions where the agent manipulates the object and bringsit into motion contain valuable information for the learningprocess. We therefore introduce an up-sampling scheme thatprioritizes exactly these transitions. We show that our prioriti-zation scheme accelerates the learning progress even more. Wefind that our solution improves the performance and enhancesthe exploration on three fundamental manipulation tasks. Thus,we conclude that tactile feedback has the potential to advancereinforcement learning a step further.R EFERENCES[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,”
Nature , vol. 518, no. 7540, pp. 529–533, 2015.[2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Masteringthe game of go without human knowledge,”
Nature , vol. 550, no. 7676,pp. 354–359, 2017.[3] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun,and M. Hutter, “Learning agile and dynamic motor skills for leggedrobots,”
Science Robotics , vol. 4, no. 26, 2019.[4] S. Christen, L. Jendele, E. Aksan, and O. Hilliges, “Learning func-tionally decomposed hierarchies for continuous control tasks with pathplanning,”
IEEE Robotics and Automation Letters , 2021.[5] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew,A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas et al. , “Solvingrubik’s cube with a robot hand,” arXiv preprint arXiv:1910.07113 , 2019.[6] R. M. Ryan and E. L. Deci, “Intrinsic and extrinsic motivations: Classicdefinitions and new directions,”
Contemporary educational psychology ,vol. 25, no. 1, pp. 54–67, 2000.[7] L. A. Jones and S. J. Lederman,
Human hand function . OxfordUniversity Press, 2006.[8] E. Kosoy, J. Collins, D. M. Chan, J. B. Hamrick, S. Huang, A. Gopnik,and J. Canny, “Exploring exploration: Comparing children with rl agentsin unified environments,” arXiv preprint arXiv:2005.02880 , 2020. [9] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, andR. Munos, “Unifying count-based exploration and intrinsic motivation,”in
Advances in Neural Information Processing Systems , 2016, pp. 1471–1479.[10] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-drivenexploration by self-supervised prediction,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops ,2017, pp. 16–17.[11] Y. Burda, H. Edwards, A. Storkey, and O. Klimov, “Exploration byrandom network distillation,” in
International Conference on LearningRepresentations , 2018.[12] A. Baranes and P.-Y. Oudeyer, “Active learning of inverse modelswith intrinsically motivated goal exploration in robots,”
Robotics andAutonomous Systems , vol. 61, no. 1, pp. 49–73, 2013.[13] H. Merzi´c, M. Bogdanovi´c, D. Kappler, L. Righetti, and J. Bohg,“Leveraging contact forces for learning to grasp,” in . IEEE, 2019, pp.3615–3621.[14] A. Melnik, L. Lach, M. Plappert, T. Korthals, R. Haschke, and H. Ritter,“Tactile sensing and deep reinforcement learning for in-hand manipula-tion tasks,”
IROS Workshop on Autonomous Object Manipulation , 2019.[15] S. Christen, S. Stevˇsi´c, and O. Hilliges, “Guided deep reinforcementlearning of control policies for dexterous human-robot interaction,” in .IEEE, 2019, pp. 2161–2167.[16] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “Go-explore: a new approach for hard-exploration problems,” arXiv preprintarXiv:1901.10995 , 2019.[17] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder,B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “HindsightExperience Replay,” in
Advances in Neural Information ProcessingSystems , 2017, pp. 5048–5058.[18] S. H. Huang, M. Zambelli, J. Kay, M. F. Martins, Y. Tassa, P. M. Pilarski,and R. Hadsell, “Learning gentle object manipulation with curiosity-driven deep reinforcement learning,”
CoRR , vol. abs/1903.08542, 2019.[19] B. Wu, I. Akinola, J. Varley, and P. K. Allen, “Mat: Multi-fingeredadaptive tactile grasping via deep reinforcement learning,” in
Conferenceon Robot Learning , 2020, pp. 142–161.[20] A. Aubret, L. Matignon, and S. Hassas, “A survey on intrinsic motivationin reinforcement learning,” arXiv preprint arXiv:1908.06976 , 2019.[21] L.-J. Lin, “Self-improving reactive agents based on reinforcement learn-ing, planning and teaching,”
Machine learning , vol. 8, no. 3-4, pp. 293–321, 1992.[22] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” in
International Conference on Learning Representations ,2016.[23] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experiencereplay,” in
International Conference on Learning Representations , 2016.[24] R. Zhao and V. Tresp, “Energy-based hindsight experience prioritiza-tion,”
Conference on Robot Learning , pp. 113–122, 2018.[25] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Pow-ell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, andW. Zaremba, “Multi-goal reinforcement learning: Challenging roboticsenvironments and request for research,”
CoRR , vol. abs/1802.09464,2018.[26] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540 , 2016.[27] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine formodel-based control,” in . IEEE, 2012, pp. 5026–5033.[28] W. C. Lewis-II, M. Moll, and L. E. Kavraki, “How much do unstatedproblem constraints limit deep robotic reinforcement learning?”
CoRR ,vol. abs/1909.09282, 2019.[29] S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone,“Curriculum learning for reinforcement learning domains: A frameworkand survey,”
J. Mach. Learn. Res. , vol. 21, pp. 181:1–181:50, 2020.[30] A. Rajeswaran*, V. Kumar*, A. Gupta, G. Vezzani, J. Schulman,E. Todorov, and S. Levine, “Learning Complex Dexterous Manipulationwith Deep Reinforcement Learning and Demonstrations,” in
Proceedingsof Robotics: Science and Systems (RSS) , 2018.[31] T. Pham, A. Kheddar, A. Qammaz, and A. A. Argyros, “Towardsforce sensing from vision: Observing hand-object interactions to infermanipulation forces,” in