[PDF] DROID: Minimizing the Reality Gap using Single-Shot Human Demonstration

Abstract

Reinforcement learning (RL) has demonstrated great success in the past several years. However, most of the scenarios focus on simulated environments. One of the main challenges of transferring the policy learned in a simulated environment to real world, is the discrepancy between the dynamics of the two environments. In prior works, Domain Randomization (DR) has been used to address the reality gap for both robotic locomotion and manipulation tasks. In this paper, we propose Domain Randomization Optimization IDentification (DROID), a novel framework to exploit single-shot human demonstration for identifying the simulator's distribution of dynamics parameters, and apply it to training a policy on a door opening task. Our results show that the proposed framework can identify the difference in dynamics between the simulated and the real worlds, and thus improve policy transfer by optimizing the simulator's randomization ranges. We further illustrate that based on these same identified parameters, our method can generalize the learned policy to different but related tasks.

Full PDF

11 DROID: Minimizing the Reality Gap usingSingle-Shot Human Demonstration

Ya-Yen Tsai , Hui Xu , Zihan Ding , Chong Zhang , Edward Johns , and Bidan Huang † Abstract —Reinforcement learning (RL) has demonstratedgreat success in the past several years. However, most of thescenarios focus on simulated environments. One of the mainchallenges of transferring the policy learned in a simulated envi-ronment to real world, is the discrepancy between the dynamicsof the two environments. In prior works, Domain Randomization(DR) has been used to address the reality gap for both roboticlocomotion and manipulation tasks. In this paper, we proposeDomain Randomization Optimization IDentiﬁcation (DROID), anovel framework to exploit single-shot human demonstration foridentifying the simulator’s distribution of dynamics parameters,and apply it to training a policy on a door opening task.Our results show that the proposed framework can identifythe difference in dynamics between the simulated and the realworlds, and thus improve policy transfer by optimizing thesimulator’s randomization ranges. We further illustrate thatbased on these same identiﬁed parameters, our method cangeneralize the learned policy to different but related tasks.

Index Terms —Transfer Learning, Learning from Demonstra-tion, Manipulation Planning

I. INTRODUCTIONReinforcement Learning (RL) has been widely applied todecision making, control, and planning. In the ﬁeld of robotlearning, many works have adopted RL as the robot controllingpolicy, to improve its learning efﬁciency and performance [1]–[3]. Recent works have demonstrated that RL can be used tocontrol a dexterous robotic hand or a robotic arm to solve tasksthat require complicated manipulation skill, such as solving aRubik’s cube [4], [5] or opening a door [6].RL uses self-exploration to ﬁnd the optimal policy. Typ-ically, this requires a very large amount of trial-and-error,which is time-consuming and can easily result in hardwaredamage if executed on the physical robot. A less costly andsafer approach is learning the policy via simulation. However,the discrepancy between the real world and the simulationmodels could hinder the policy from directly being deployedin real world, especially when the task is contact-rich. In theliterature, this sim-to-real problem is referred to as the realitygap, which remains an open issue to date.

System Identiﬁcation (SI) and

Domain Randomization (DR)are the two common approaches to cross the dynamics reality † denotes the corresponding author. Y.-Y. Tsai is with the Hamlyn Centre for Robotic Surgery andE. Johns is with the Robot Learning Lab, Department of Comput-ing, Imperial College London, SW7 2AZ, London, UK { y.tsai17,e.johns } @imperial.ac.uk H. Xu is with School of Computer Science and Engineering,University of Electronic Science and Technology of China [email protected] Z. Ding, C. Zhang, and B. Huang, are with Tencent Robotics X, China { zihan.ding18, chongzzhang, bidanhuang } @tencent.com Fig. 1: This ﬁgure presents the overview of the proposedframework, DROID, used to minimize the reality gap.gap. While SI tries to reduce the reality gap by identiﬁcationof the real-world parameters, DR tries to increase robustnessto the reality gap by training on randomized simulated envi-ronments. However, these methods still often struggle to accu-rately obtain real-world parameters or choose randomizationranges without any real data [7]. This can result in learning abiased policy, or a policy which fails to converge.To address this, we propose a novel framework, DomainRandomization Optimization IDentiﬁcation (DROID), to au-tomatically optimize the environment parameter distribution,with a combination of DR and SI approaches. Rather thandetermining the DR range using intuition or tedious tuning, weexploit human demonstrations and attempt to align simulatedtrajectories with these real-world trajectories. Through this, weidentify the simulator’s optimal parameter range as a statisticalmodel, which can then be sampled from during training withRL. An overview of this method is shown in Fig. 1.With DROID, the learned policy can be transferred to thereal world directly, thus learning efﬁciency is signiﬁcantlyimproved and unsafe interactions with the environment isavoided. Our experimental results show that after parameteroptimization and identiﬁcation, a much higher success ratecan be achieved with DROID on a real-world door openingtask, compared with standard DR or SI. We also show howthe learned dynamics can be directly used to train policies fordifferent but related tasks.II. RELATED WORKRL’s applications in robotics suffer from expensive trainingdata collection in real robots [8]. By offering cheaper and a r X i v : . [ c s . R O ] F e b safer data collection environment, simulation has gained hugepopularity for RL training [4], [9]–[11]. However, the issueof the reality gap still remains one of the main problemshindering the policy learned in simulation from transferringwell to the real world. To bridge the reality gap, two commonstrategies have been studied in prior works: SI, and DR [7].SI has been a popular approach for sim-to-real transfer [12]–[17]. It focuses on ﬁnding the exact model of the real world, sothe physical behaviors match between the real and simulatedsystem. This could be done by constructing the simulationthrough direct measurement of the environment parameters inthe real world [18] or by collecting real world data for opti-mizing the simulated model parameters [19]–[21]. However,correctly identifying the system’s parameters is challenging.Many parameters cannot be explicitly measured or can involvepresence of noises, especially for dynamics related ones,like friction, stiffness, and damping. In addition, the systemparameters could be of high-dimensional and entangled. Thisfurther increases the difﬁculties in achieving accurate andprecise SI results [7], [22], [23]. As a consequence, SI usuallyrequires expertise of the system to handcraft the model.Rather than identifying the environment parameters, DRcreates multiple simulated environments by randomizing thesystem parameters within given ranges during policy training.This improves the policy’s generalizability and robustnessagainst the reality gap. Recently, it has achieved signiﬁcantprogress for sim-to-real transfer in robotics [4]–[7], [21], [24]–[28]. Unlike SI, DR achieves better sim-to-real performanceby covering a greater range of parameters distribution inthe simulation containing the real values during RL training.Prior works optimize system parameters with simulated andreal trajectories collected using hand-designed policies [4]; orautomatically adjust the boundaries of uniform randomizationdistributions according to model performances [5]. Whileeffective, these works suffer from a common drawback. Ex-isting works [7] have demonstrated its demanding engineeringefforts in adjusting the randomization ranges, which is difﬁcultand not intuitive. Hand tuning the parameter ranges couldeasily cause overestimated values and lead to training RLin an invalid environment and learning a suboptimal policy.How to quickly choose the randomization ranges for differentparameters and achieve effective policy generalization stillremains a challenge.Besides the two main approaches, a variant of DR, AdaptiveDR, has also been studied. It was proposed to optimize theparameter distributions and minimize the chance of trainingRL in invalid environments. Prior works use techniques suchas approximate Bayesian computation [29], Bayesian Opti-mization, or a relative entropy policy search [30] to estimateor optimize distributions of system parameters [26], [27], [31].The proposed framework draws some similarities to theseworks in avoiding overestimation of parameter distribution, butfocuses on more contact-rich task which involves determiningcomplicated and entangled dynamics related parameters andoptimizing the distribution through a more efﬁcient and saferapproach, i.e. human demonstration. In addition, we optimizethe randomization distributions with respect to trajectoriescontaining not only the observed positions and/or velocities, but also the proprioceptive torques on joints of the robotarm. Experiments were conducted on a contact-rich task, dooropening, with DROID due to its complex dynamics of therobot joints, the door hinge and the contacts between thegripper and the handle.The rest of the paper is organized as follows: The method-ologies is presented in Section III, followed by the experimen-tal setup, results and the discussion section in IV. Finally, theconclusions and the future works are presented in Section V.III. METHODOLOGYLearning in a simulated environment is convenient, buttransferring the learning results to the real environment re-quires an accurate model of that environment. Simulationtypically uses mathematical models to compute the interactionforce and torque between objects. These models rely on pre-deﬁned dynamics-related parameters such as friction, stiffnessand damping. Unlike kinematics-related parameters, many ofthem are not easily accessible and hence are often difﬁcultto measure and identify. Therefore, tasks that involve theseparameters experience difﬁculty in sim-to-real transfer. To thisend, we propose a framework DROID that evaluates theseparameters through human demonstration. Building on theconcept of interaction force, we implicitly perceive dynamicsinformation of the real-world system from the feedback of therobot and use this information to determine the distribution ofthe parameters in the simulation. This gives us a reasonableset of parameters for the domain randomization in RL andhence results in a successful policy transfer.DROID is composed of three phases: the human demonstra-tion (Section III-A), the parameter identiﬁcation and optimiza-tion (Section III-B), and the policy learning with optimized DR(Section III-C). In the ﬁrst phase, the human demonstrates acontact task in the real world. The robot clones the humanbehaviors multiple times and records the data. In the secondphase, robot in the simulator repeats the same behaviors andrecords a same set of data. The data from the real system andthe simulator is then used for identifying the distribution of thetask relevant parameters. Through an iterative approach, wecan gradually update the parameter distribution to minimizethe differences between the two perceived feedback until theobtained simulated environments can better reﬂect to the realworld. In the ﬁnal phase, policy is trained with DR basedon the parameter distribution optimized. The resulting policycan be transferred to the real with good performance for theprevious optimization steps. Note that this study focuses onminimizing the reality gap and the human demonstrations onlyserve for the purpose of identifying the task relevant parameterdistribution. Learning the task from human demonstration isout of the scope of this paper and in the third phase we learnthe policy from scratch without human demonstrations. In thefollowing sections, we will go into more details on how eachpart is implemented. A. Single-Shot Human Demonstration

In this ﬁrst phase, our aim is to collect the data reﬂectingdynamics relevant to the task. To this end, human demonstrates the task once in the real world to provide a robot motion tra-jectory, q d , through kinesthetic guidance. This demonstrationis safe to be executed by the robot and allows the robot torepeat it multiple times automatically. By repeating the q d tointeract with the environment, the dynamics information canbe perceived by the torque sensor feedback τ r of the robot.Such feedback is later used as the reference to identify andupdate the parameter distribution in the simulation. Differentfrom the approaches that makes the robot to randomly interactwith the environment, in DROID we rely on the human toprovide a trajectory that is safe for the robot to identify thesystem dynamics. This only requires a single-shot demonstra-tion. We focus on the task relevant parameter distributions andlimit the random exploration of the robot in the real world.This minimizes the risk and save the time in the real robotexperiment. B. Parameter Optimization and Identiﬁcation

The goal in this phase is to correctly identify the parameterdistribution that minimizes the discrepancy of the simulatedand the real system. Rather than ﬁnding the speciﬁc valuefor each parameter, we determine the distributions of themfor the presence of noises and uncertainties in the real world.We can then train the RL to ﬁnd a policy that works withinthis distribution, which is the key problem solved in DROID.Falsely deﬁned distribution will lead to failure in sim-to-realtransfer, and the policy trained under an unreasonable DRcan suffer bad performance. We model this distribution as amultivariate normal distribution Φ( µ , Σ ) , where µ and the Σ is the mean and covariance, and system parameters φ israndomly sampled from Φ .During the human demonstration phase, data has beencollected from the real system. Taking the identical steps, wecan program the robot to repeat the same task in the simulationand hence obtain the torque sensor feedback τ s . As the realworld parameter value φ (cid:48) is not easily accessible, we alignthe simulation and the real environment by aligning the robotbehaviors in them. Here, we deﬁne the “behavior” as the robotaction and perception pairs, i.e. the motion trajectory q d andthe torque sensing τ s . Changing φ changes the dynamics ofthe simulation and hence changes the robot behavior. Samplingdifferent φ from its distribution Φ , we observe the differentbehaviors under different environments in the simulator andidentify the ones that are most similar to the real worldbehavior. We hence update Φ based on the robot behaviors.We formulate this as a distribution optimization problemand update Φ iteratively based on the Covariance MatrixAdaptation Evolution Strategy (CMA-ES) approach [32] withthe following objective function: J ( φ ) = 1 N N (cid:88) n =1 ( (cid:107) τ s ( φ ) − τ r n ( φ (cid:48) ) (cid:107) + cβ ) , φ ∼ Φ (1)where N , β , c is the total number of trajectories from thereal robot, a penalty for failure of the task and the factor forthe penalty, respectively.The process is iterated from the initial guess, Φ init = N ( µ init , Σ init ) , until convergence. Note that we append cβ at the end of the ﬁtness function to penalize the situation wherethe robot fails to grip the door knob during door openingprocess in the simulation. This occasionally happens when forexample the friction coefﬁcients of the ﬁngers become to lowor when the joint damping is set to be invalid. The overviewof the described algorithm is summarized in Algorithm 1. M is the total number of samples from the current distribution Φ . For each iteration, the robot performs the task in thesimulator under the environment parameter φ sampled fromthe distributions Φ . The cost J ( φ ) is therefore evaluatedwith the resulting τ s in Eqn. 1. Note that in total N realrobot trajectories are used to evaluate the cost and J ( φ ) isthe average value. After M iterations, the CMA-ES updates Φ from the x best candidates which associated with thelowest costs J ( φ ) . The update process of Φ ( i.e. , µ and Σ ) and hyper-parameters of CMA-ES following the standardprocedure. This repeats until Φ converges and we achieve theoptimized Φ ∗ Algorithm 1

Optimizing parameter distribution Initialize hyper-parameters of CMA-ES Initialize Φ with N ( µ init , Σ init ) while not converged do for m = 1 : M do Sample φ m from Φ Robot perform task in simulation with φ m Collect τ s ( φ m ) Calculate J ( φ m ) in Eqn.1 with N real trajectories { τ r ( φ (cid:48) ) , ..., τ r N ( φ (cid:48) ) } end for Select x best φ from { φ , ..., φ M } by min J ( φ ) Update

Φ = N ( µ , Σ ) with the selected φ set Update hyper-parameters of CMA-ES end while return optimized Φ ∗ C. Policy Learning

In this paper, we adopt a reinforcement learning frame-work to learn an optimal control policy π θ , parameterizedby θ , through policy gradient. This is formulated as aMarkov decision process (MDP) which is deﬁned by the tuple ( S , A , P , r, ρ , γ ) , where S is the state space, A is the actionspace, P : p ( s t +1 | s t , a t ) is a distribution of state transition, R : S × A → R is the reward function, ρ the initialstate distribution and γ ∈ (0 , is the discount factor. Theoptimal policy π ∗ θ aims to maximize the cumulative rewardover episodes: π ∗ θ = arg max π θ E π θ (cid:34)(cid:88) t r ( s t ) (cid:35) . (2)Proximal Policy Optimization (PPO) [33] is deployed forthe robot learning purpose. It updates the policy by using thesurrogate objective: L ( θ ) = E t [ min ( r t ( θ ) ˆ A t , clip ( r t ( θ ) , − (cid:15), (cid:15) ) ˆ A t )] , (3)where, ˆ A t is the estimate of the advantage function at timestep t , and r t ( θ ) denotes the ratio between the current policy and (a) (b) (c) (d) Fig. 2: (a) and (b) shows the hardware setup in the real world and simulation. The AruCo markers attached to the table andthe cabinet are used for the tracking purpose. (c) shows four different locations of the door knob, each separated by 5 cm alongthe lever arm of the door. (d) shows three variants of the door. From left to right are the original door, door with one springand door with two springs attached to the hinge. The springs were used to emulate doors with different dynamics.the previous policy. The clipped term keeps the ratio insidethe interval [1 − (cid:15), (cid:15) ] , the minimum of the clipped andunclipped term is used for the expectation. This provides alower bound, or a pessimistic bound, on the unclipped term.As the model of the real world environment is describedby Φ ∗ , sufﬁcient simulation environments are hence requiredto be sampled from the distribution in order to reﬂect tothe reality. Therefore, the RL agent is trained on multiplesimulated environments sampled from Φ ∗ . By doing so, wehope to maximize the similarity in state transition betweenthe simulated and the real environments and enhance the trans-ferrability of the learned policy to the real world application.The structure of the reward function follows closely to theone deﬁned in [6] and is presented as follows: r =  ω · r door + ω · r ori + ω · r dist + ω · r log dist + ω · r slip , if λ < ◦ ,ω · r door + ω · r ori + ω · r slip , otherwise (4)where λ is the hinge angle of the door, ranging from 0 ◦ (closed door) to 90 ◦ (completely opened door). The rewardfunction has a total ﬁve terms. The r door rewards the actionthat results in the increase to the door hinge. The r ori rewardsthe relative orientation between the door knob and the gripper.The higher reward is given if the relative orientation is closerto orthogonal. The r dist and r log dist are associated with therelative displacement between the door knob and the gripper.The higher reward is given to the short displacement. Onemajor difference between our door opening strategy and theDoorGym’s, lies in that they use hooks to open the doorwhile our robot is trained to ﬁrmly grip knob handle duringthe entire task. This signiﬁcantly increases the complexity ofdynamics involved. To this end, we added a penalty term r slip in the reward function to prevent the ﬁngers from slipping. ω , ω , ω , ω , and ω are ﬁve coefﬁcients for normalizingeach reward terms.IV. EXPERIMENTS AND RESULTSWe evaluate our work through a door opening task for itscomplexity in the dynamics involves. The main focus of ourexperiments is to access how well the RL policy learned with DROID can be transferred to the robot in the real world. Thisevaluation consists of three experiments.The ﬁrst experiment validated that DROID can identify theparameter distributions for environments and with differentdynamics. In the second part, we applied one of the optimizeddistributions to DR and train a policy for door opening.The evaluation on the performance of this policy was doneby comparing it’s robustness and effectiveness in sim-to-realtransfer with other approaches (standard DR and without DR).Finally, we tested the generalizability of the RL policy learnedfrom the optimized parameter distributions. A. Experimental Setup

The experiment consisted of a real and a simulated sys-tem. In the real system, a cabinet door, a camera and a 7-DoF Franka Emika robot arm were used as illustrated inFig. 2(a)(b). The Franka robot was equipped with 7-DoF jointtorque sensors allowing it to record the torque feedback whileinteracting with the environment. A two ﬁngers gripper wasmounted at the end effector for grasping and manipulationpurpose. The cabinet was the target of interest in which weattempted to identify its dynamics parameter distributions. Theﬁxed camera provided the relative pose information betweenthe robot and the door as well as the door angle informationvia tracking the visual markers attached to the door and theoptical table during the experiment.For the simulation part, we deployed the MuJoCo plat-form [34] to perform the RL training. The simulated environ-ment was deﬁned by a set of parameters including kinematicstree, and many dynamics such as mass, damping, friction etc.Many kinematics such as relative poses and the geometricdimensions and the robot related dynamics such as massand inertia were either provided ofﬁcially or can be directlymeasured and hence were not within our focus. Inertia of thedoor was also not considered as it can be estimated once weobtained the CAD model. Our main interest was to identifythe parameter distributions of the dynamics of the robot andthe door that were not provided and difﬁcult to measured inthe real world. These parameters included mass, joint frictionloss, and joint damping and the sliding and torsional frictions.

B. Parameter distribution identiﬁcation

In the ﬁrst experiment, we investigate the feasibility ofDROID in the distributions identiﬁcation. This veriﬁes whetherDROID can ﬁnd the distribution of parameters that correctlyreﬂects the interaction behavior encountered in the real-worlddoor opening task.For the given cabinet door (Fig. 2), the real robot followedthe human demonstration to obtain ten sets of τ r that reﬂectedthe dynamics of the interactions with the door. We ﬁrst madean initial guess of the Φ init for the parameters of interest.The estimation was made by referencing Franka’s ofﬁciallyprovided values and DoorGym’s parameters [6] with the ex-ceptions being the door and knob masses which were directlymeasured. At the ﬁrst iteration, 30 simulated environmentswere sampled from the initial distributions. Parameters sam-pled with negative values were omitted and resampled. Witheach environment, the robot cloned the human demonstrationby following q d to obtain τ s . The associated cost for eachtrail was calculated using Eqn. 1. The higher the discrepancyamong the torques, the higher the cost would be gained. Forthe simulation that failed to successfully open the door due tofactors like grasp slipping, an extra penalty of 10 was added tothe cost. CMA-ES algorithm took the top ﬁve best candidatesof φ to update the means and the covariances and hence the Φ . In a new iteration, the updated Φ were used to generateanother 30 new simulations and this process was iterated untilconvergence.With the above steps, we have estimated the parameterdistribution for our cabinet door (Tab. I). These parametersreﬂect the dynamics of the robot arm, of the door hinge andthe contacts. Fig. 5 illustrates the optimization process andresults. Fig. 5(a) shows the joint torque trajectories of the robotobtained in the simulation before and after the optimization,and the joint torque trajectory obtained in the real world. Thisonly displays one of the joint for illustration. As it can beseen, differences between the red (after optimization) and theblack (real robot) lines are much smaller than the differencesbetween the blue (before optimization) and the black. Thissuggests that the proposed approach can indeed minimizethe reality gap. The same conclusion can be drawn fromFig. 5(c) plotting the cost against the iteration. Among the 30simulations the average cost gradually converged to a lowervalue. This indicates that the simulations sampled from theoptimized distribution lead to smaller reality gap than thosesampled from the initialed distributions. The means and thevariances of three parameters at each iteration are shown inFig. 5(c) and these show the distribution of these parametersconverged to ﬁxed ranges. The quantitative results comparingthe unoptimized and optimized distributions are summarizedin Tab. I. We have applied this result to the sim-to-real transferexperiment, which is detailed in the next section.Furthermore, a validation was carried to determine thepossibility variations in estimating the parameter distributionusing different single human demonstration. Different humandemonstration was provided through an alternative robot poseas illustrated in Fig. 3 for validation. Note that due to robotworkspace limitation, only the presented two poses could be (a) (b) Fig. 3: Two different initial robot poses used in the humandemonstrations. (a) is the initial robot pose used to estimateparameter distribution in Tab. I. (b) is another initial robotpose used to validate the possibility of a bias in parameterestimation.Fig. 4: This compares the parameter distributions estimatedusing two different initial poses and human demonstrations.Red and blue curves represent the parameter distributionestimated using robot pose (a) and (b) shown in Fig. 3.applied to interact with and open the door. The parameterdistribution identiﬁcation process was repeated using this pose.The experimental result comparing the estimated parameterdistribution using the two human demonstrations is shownin Fig. 4. We veriﬁed that despite for the minor variations,the majority of parameter distributions estimated using thealternative pose still converged to similar distributions as statedin Tab. I and single human demonstration was adequate forparameter estimation.In order to further verify the approach, we have conductedtwo sets of controlled experiments by changing the dynamicsof the real door.The original door was modiﬁed to include additional springsto change its dynamics. Fig. 2(d) illustrates the three variantsof the door, without any spring, with one spring and with twosprings attached to the hinge. These springs were identical andshare similar dynamics.The optimization results for different door dynamics arealso summarized in Tab. I. It can be seen from the parametervalues that the robot dynamics is not much affected but theenvironment dynamics varies a lots. This is due to the fact thatwe only changed the door dynamics and the algorithm is ableto identify this change correctly. Doors equipped with differentamount of springs would expect to have different distributionsfor the door joint frictionloss, stiffness and damping. Forthe frictionloss, door with 2 springs seem to have higherfriction loss than the other two scenarios. The joint stiffness increases as the number of spring equipped increase, whichis reasonable. Doors with springs tend to have higher jointdamping values than the door without the spring as expected.The joint dampings associated to the robot falls into the similarvalues. This shows the proposed framework have consistantresults. The masses and the frictions vary across differentscenarios, and this may be attribute to the robot which needshigher friction in the gripper in order to sustain the sufﬁcientforce to grip the door knob in the simulation when more springis loaded to the door.Based on the above evaluation results, the proposed DROIDframework has demonstrated to be feasible in determining theparameter distribution of the real world dynamics. In theory,the reality gap is smaller after optimization. The frameworkhas also shown to be realizable to identify task with differentdynamics. Most outcomes have appeared to be reasonable witha few exceptions which may be attribute to factors such asimperfect demonstration resulting in loss of grip when doorbecomes stiffer or noises present in joint torque sensors in thereal world. We have used the optimized result in sim-to-realpractice and detail it in the Section IV-C.

C. Sim-to-real transfer with optimized DR

While in many prior works, domain randomization methodshave shown to be effective in addressing the reality gap,selecting the randomization ranges still remain challenging,and training the RL agent on overestimated ranges maybe inevitable and could lead to poor learning performance.In the last experiment, we have successfully identiﬁed theparameter distributions, but we have yet demonstrated it to beeffective to learn adequate policy and transfer to the real world.Therefore, in this experiment, we aimed at comparing thelearning performance of policies learned from three methods,namely normal DR, DROID without DR and DROID with DR.These policies were trained to open the door without springsin the simulation and then directly transferred to the real worldto open the same door. Normal DR approach was trained onthe estimated parameter distributions which correspond to theleft most columns in Tab. I. DROID without DR, similar toSI, was trained using only ﬁxed parameters. In this case, theparameters used were the optimized µ listed in column fourof the Tab. I. Finally, DROID with DR was trained based onthe optimized distributions, i.e. column four and ﬁve of Tab. I.For the sake of fairness, all of them were trained usingmodel-free on-policy approach, i.e. PPO, and shared identicalRL hyperparameters, architectures of networks, observations,and reward functions, with the parameter ranges being theonly differences. In the reinforcement learning setting, theobservation included the joint positions and velocities of thearm, the relative positions between the robot gripper and theknob. These combined to form the 23 DoF state space. Theaction space consisted of 9 DoF, specifying the 7 robot jointpositions and 2 gripper widths. The critic and actor modelswere represented by a neural network with two hidden layersof size 64, following tanh activation functions. We performedour simulations on MuJoCo physics engine with timestep of0.001s. During the training, each episode took a maximum of (a)(b)(c)

Fig. 5: (a) shows the comparison between a robot joint torquetrajectory in real world and the joint torque trajectories beforeand after the optimization and identiﬁcation for parameterdistribution in simulation. (b) shows the changes in means(the solid lines) and the variances (the shaded areas) of thethree parameters over iterations during optimization. The redline represent the mean and the shaded region represent thevariance. (c) shows the associated cost computed using Eqn. 1over iterations.512 steps. We employed the ADAM optimizer with a stepsizeof 0.001 and mini-batch of 64 episodes to update the policyand value networks.We have trained three policies with each approach, andhave tested each of them in the real world for 10 times. Theresults for sim-to-real comparisons are displayed in Tab. II.The door opening angles have been recorded each time andthe historgram is shown in Fig. 6. We deﬁne a successfuldoor opening case as the one with the hinge angle of thedoor larger than ◦ in real-world trials. From these results,we can see the later two approaches (DROID with DR andwithout DR) outperform the standard DR. In the DROID withDR setting, the optimized means and variances for DR ( success rate) can achieve signiﬁcant advantages in real-worldtests over the normal DR ( success rate) with initializedmeans and variances. We also notice that the policies trained TABLE I: This compares the initial and ﬁnal parameters distribution after optimization and identiﬁcation. The distribution isdeﬁned by µ and σ . Additionally, this also compares the optimized distributions for doors equipped with one or two springs. DoorGym Door without spring Door with spring Door with springs µ init diag (Σ init ) µ opt diag (Σ opt ) µ opt diag (Σ opt ) µ opt diag (Σ opt ) Door Properties

Door Mass(kg) . Robot Properties

JointDamping(7DoF) [100, 100,100, 100,100, 10,0.4] [2.0, 2.0,2.0, 2.0,2.0, 1.0,0.2] [101.35, 100.06,100.08, 99.61,99.14, 10.05,0.83] [1.06, 1.83,0.67, 0.41,0.68, 0.64,0.57] [98.11, 98.69,100.38, 99.64,101.01, 9.72,1.17] [0.41, 0.63,0.65, 0.15,0.96, 0.04,1.19] [98.63, 99.28,95.83, 100.12,95.71,11.55,1.54 ] [1.23, 0.71,1.42, 0.81,0.77, 0.46,0.67]

Gripper Properties (Left Right)

Sliding Friction [0.5, 0.5] [0.25, 0.25] [0.47, 1.78] [0.70, 0.86] [1.37, 0.76 ] [0.35, 0.39 ] [3.04, 2.68] [0.46, 0.54]Torsional Friction [0.5, 0.5] [0 . , . [1.38, 1.84] [0.66, 0.92] [0.79, 2.29] [0.42, 0.80] [1.04, 2.0] [0.72, 0.83] without DR using optimized means µ opt as system parametersin simulation can achieve the highest success rate of . in reality, even higher than the DROID with DR method. Onone hand, this implies that the optimized µ opt indeed achievesa good sim-to-real transfer. On the other hand, the fact thatDROID without DR produces a higher success rate but lowerdoor opening angles than DROID with DR is interesting. Weattribute this to the relatively weak dependency on systemdynamics when the door was opened with small angles ( e.g. , ◦ ), but as the hinge angle increased, the systems dynamicsbecame more important for the door opening process due to thelarger contact forces between the gripper and the door knob.Hence, the policies cannot further open the door well withouta proper DR in simulated training process, which was alsotestiﬁed in Fig. 6. From the distributions of maximal door-opening angle, we show that DROID with DR can achievethe door opening with more concentration on larger maximalhinge angles, e.g. around ◦ , compared against normal DRand DROID without DR. Tab. II also demonstrates that DROIDwith DR can achieve the highest average door-opening angleamong all three methods, even though it does not performas great as the normal DR method in simulation. As for thenumber of steps taken to open the door, our method haveadvantageous performances consistently in both simulationand reality, which indicates a faster door-opening process. D. Generalization of the Learned Policy

As mentioned in the previous section, we only used onehuman demonstration to determine the parameter distribution.This demonstration was, however, not used during RL becauseit can only serve for a door with the same dimensions. Inthis experiment, we tested the policy with three additionaldoor knob locations in the simulation and in the real systemas shown in the Fig 2(c) to emulate doors with differentdimensions. These locations were set to be 5 cm apart alongthe lever arm of the door. With DROID, the trained policy Fig. 6: Comparison of the door opening performance in realityfor three methods (from top to bottom): normal DR, DROIDwith DR and DROID without DR. The horizontal axis is themaximal opening angle of the door in degrees. The verticalaxis is the percentile value for each bin. It compares thedistributions of maximal angles over 30 runs for each method.TABLE II: Comparisons of Sim-To-Real results DR( µ init , Σ init ) µ opt DR( µ opt , Σ opt )Success Rate(angle > ◦ ) sim real . % 80% Open Angle(mean ± std) sim . ± . . ± . . ± . real . ± . . ± . . ± . Open Steps(mean ± std) sim . ± . . ± . . ± . real . ± . . ± . . ± . was able to pull the door knob, at different locations, andopen the door successfully. As such, we have veriﬁed thatonce the parameter distribution is identiﬁed using a humandemonstration, different RL policies can be trained within theoptimized distribution to accomplish tasks other than the onedemonstrated. V. CONCLUSIONIn this paper, we proposed a novel and generic framework,Domain Randomization Optimization IDentiﬁcation (DROID)to minimize the reality gap between simulated and real envi-ronments. The approach is designed to be applicable to a rangeof contact-rich manipulation tasks, such as door opening. Byexecuting a human demonstration trajectory in both simulationand reality, the differences in their dynamics can be minimizedby iteratively updating and identify the distributions of thereal dynamics parameters. Using a door opening task as anexample, we have veriﬁed its capability to identify reasonableparameter distributions and thus reduce the reality gap. Asuccessful RL policy can then be obtained by training inthis distribution and directly transferring to the real world.The sim-to-real performance has shown to be superior thantraining with typical DR and SI approaches. Finally, we alsodemonstrated that a generalized RL policy can be trained toaccomplish different tasks, given that the system dynamicsremains unchanged. R EFERENCES[1] A. A. Rusu, M. Veˇcer´ık, T. Roth¨orl, N. Heess, R. Pascanu, andR. Hadsell, “Sim-to-real robot learning from pixels with progressivenets,” in

Conference on Robot Learning . PMLR, 2017, pp. 262–270.[2] L. J. Lin, “Scaling up reinforcement learning for robot control,” in

ICML ,1993.[3] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning inrobotics: A survey,”

The International Journal of Robotics Research ,vol. 32, no. 11, pp. 1238–1274, 2013.[4] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew,J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. , “Learningdexterous in-hand manipulation,”

The International Journal of RoboticsResearch , vol. 39, no. 1, pp. 3–20, 2020.[5] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew,A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. , “Solvingrubik’s cube with a robot hand,” arXiv preprint arXiv:1910.07113 , 2019.[6] Y. Urakami, A. Hodgkinson, C. Carlin, R. Leu, L. Rigazio, andP. Abbeel, “Doorgym: A scalable door opening environment and baselineagent,” arXiv preprint arXiv:1908.01887 , 2019.[7] E. Valassakis, Z. Ding, and E. Johns, “Crossing the gap: A deepdive into zero-shot sim-to-real transfer for dynamics,” arXiv preprintarXiv:2008.06686 , 2020.[8] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learninghand-eye coordination for robotic grasping with deep learning and large-scale data collection,”

The International Journal of Robotics Research ,vol. 37, no. 4-5, pp. 421–436, 2018.[9] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540 , 2016.[10] Y.-Y. Tsai, B. Xiao, E. Johns, and G.-Z. Yang, “Constrained spaceoptimization and reinforcement learning for complex tasks,”

IEEERobotics and Automation Letters (RA-L) , vol. 5, no. 2, 2020.[11] G. Garcia-Hernando, E. Johns, and T.-K. Kim, “Physics-based dexterousmanipulations with estimated hand poses and residual reinforcementlearning,” in

International Conference on Intelligent Robots and Systems(IROS) , 2020.[12] W. Yu, J. Tan, C. K. Liu, and G. Turk, “Preparing for the unknown:Learning a universal policy with online system identiﬁcation,” arXivpreprint arXiv:1702.02453 , 2017. [13] A. Allevato, E. S. Short, M. Pryor, and A. Thomaz, “Tunenet: One-shot residual tuning for system identiﬁcation and sim-to-real robot tasktransfer,” in

Conference on Robot Learning , 2020, pp. 445–455.[14] R. Jeong, J. Kay, F. Romano, T. Lampe, T. Rothorl, A. Abdol-maleki, T. Erez, Y. Tassa, and F. Nori, “Modelling generalized forceswith reinforcement learning for sim-to-real transfer,” arXiv preprintarXiv:1910.09471 , 2019.[15] J. Liang, S. Saxena, and O. Kroemer, “Learning active task-orientedexploration policies for bridging the sim-to-real gap,” arXiv preprintarXiv:2006.01952 , 2020.[16] M. Kaspar, J. D. M. Osorio, and J. Bock, “Sim2real transfer forreinforcement learning without dynamics randomization,” arXiv preprintarXiv:2002.11635 , 2020.[17] D. Schafroth, C. Bermes, S. Bouabdallah, and R. Siegwart, “Modeling,system identiﬁcation and robust control of a coaxial micro helicopter,”

Control Engineering Practice , vol. 18, no. 7, pp. 700–711, 2010.[18] J. Tan, Z. Xie, B. Boots, and C. K. Liu, “Simulation-based designof dynamic controllers for humanoid balancing,” in ,2016, pp. 2729–2736.[19] A. Farchy, S. Barrett, P. MacAlpine, and P. Stone, “Humanoid robotslearning to walk faster: From the real world to simulation and back,” in

Proceedings of the 2013 international conference on Autonomous agentsand multi-agent systems , 2013, pp. 39–46.[20] W. Yu, V. C. Kumar, G. Turk, and C. K. Liu, “Sim-to-real transfer forbiped locomotion,” arXiv preprint arXiv:1903.01390 , 2019.[21] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,“Domain randomization for transferring deep neural networks from sim-ulation to the real world,” in . IEEE, 2017, pp. 23–30.[22] S. Zhu, D. Surovik, K. Bekris, and A. Boularias, “Efﬁcient model iden-tiﬁcation for tensegrity locomotion,” in . IEEE, 2018, pp.2985–2990.[23] N. Fazeli, R. Tedrake, and A. Rodriguez, “Identiﬁability analysis ofplanar rigid-body frictional contact,” in

Robotics Research . Springer,2018, pp. 665–682.[24] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-realtransfer of robotic control with dynamics randomization,” in . IEEE,2018, pp. 1–8.[25] S. James, A. J. Davison, and E. Johns, “Transferring end-to-end vi-suomotor control from simulation to real world for a multi-stage task,” arXiv preprint arXiv:1707.02267 , 2017.[26] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff,and D. Fox, “Closing the sim-to-real loop: Adapting simulation random-ization with real world experience,” in . IEEE, 2019, pp. 8973–8979.[27] F. Ramos, R. C. Possas, and D. Fox, “Bayessim: adaptive domainrandomization via probabilistic inference for robotics simulators,” arXivpreprint arXiv:1906.01728 , 2019.[28] R. Alghonaim and E. Johns, “Benchmarking domain randomisation forvisual sim-to-real transfer,” arXiv preprint arXiv:2011.07112 , 2020.[29] M. A. Beaumont, W. Zhang, and D. J. Balding, “Approximate bayesiancomputation in population genetics,”

Genetics , vol. 162, no. 4, pp. 2025–2035, 2002.[30] J. Peters, K. M¨ulling, and Y. Altun, “Relative entropy policy search.” in

AAAI , vol. 10. Atlanta, 2010, pp. 1607–1612.[31] F. Muratore, C. Eilers, M. Gienger, and J. Peters, “Bayesian domain ran-domization for sim-to-real transfer,” arXiv preprint arXiv:2003.02471 ,2020.[32] N. Hansen, “The cma evolution strategy: a comparing review,” in

Towards a new evolutionary computation . Springer, 2006, pp. 75–102.[33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 ,2017.[34] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine formodel-based control,” in2012 IEEE/RSJ International Conference onIntelligent Robots and Systems