[PDF] Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system

Abstract

Reinforcement learning has emerged as a promising methodology for training robot controllers. However, most results have been limited to simulation due to the need for a large number of samples and the lack of automated-yet-safe data collection methods. Model-based reinforcement learning methods provide an avenue to circumvent these challenges, but the traditional concern has been the mismatch between the simulator and the real world. Here, we show that control policies learned in simulation can successfully transfer to a physical system, composed of three Phantom robots pushing an object to various desired target positions. We use a modified form of the natural policy gradient algorithm for learning, applied to a carefully identified simulation model. The resulting policies, trained entirely in simulation, work well on the physical system without additional training. In addition, we show that training with an ensemble of models makes the learned policies more robust to modeling errors, thus compensating for difficulties in system identification.

Full PDF

RReinforcement learning for non-prehensile manipulation:Transfer from simulation to physical system

Kendall Lowrey , , Svetoslav Kolev , Jeremy Dao , Aravind Rajeswaran and Emanuel Todorov , Abstract — Reinforcement learning has emerged as a promis-ing methodology for training robot controllers. However, mostresults have been limited to simulation due to the need for alarge number of samples and the lack of automated-yet-safedata collection methods. Model-based reinforcement learningmethods provide an avenue to circumvent these challenges,but the traditional concern has been the mismatch betweenthe simulator and the real world. Here, we show that controlpolicies learned in simulation can successfully transfer to aphysical system, composed of three Phantom robots pushing anobject to various desired target positions. We use a modiﬁedform of the natural policy gradient algorithm for learning,applied to a carefully identiﬁed simulation model. The resultingpolicies, trained entirely in simulation, work well on the phys-ical system without additional training. In addition, we showthat training with an ensemble of models makes the learnedpolicies more robust to modeling errors, thus compensating fordifﬁculties in system identiﬁcation. The results are illustratedin the accompanying video.

I. INTRODUCTIONNon-prehensile object manipulation remains a challengingcontrol problem in robotics. In this work, we focus on aparticularly challenging system using three Phantom robotsas ﬁngers. These are haptic robots that are torque-controlledand have higher bandwidth than the ﬁngers of existingrobotic hands. In terms of speed and compliance (but notstrength), they are close to the capabilities of the humanhand. This makes them harder to control, especially in non-prehensile manipulation tasks where the speciﬁcs of eachcontact event and the balance of contact forces exerted onthe object are very important and need to be considered bythe controller in some form.Here we develop a solution using Reinforcement Learning(RL). We use the MuJoCo physics simulator [1] as themodeling platform and ﬁt the parameters of the simulationmodel to match the real system as closely as possible.For policy learning with the model, we use a normalizednatural policy gradient (NPG) method [2], [3], [4]. WhileReinforcement Learning (RL) methods such as NPG are inprinciple model-free, in practice they require large amountsof data. In the absence of an automatic way to generatesafe exploration controllers, learning is largely possible onlyin simulation. Indeed the vast majority of recent resultsin continuous RL have been obtained in simulation. Thesestudies often propose to extend the corresponding methodsto physical systems in future work, but the scarcity of suchresults indicates that ‘sim-to-real’ transfer is harder than it *This work was supported by the NSF. University of Washington, Roboti LLC.Correspond to: [email protected] seems. The few successful applications to real robots havebeen in tasks involving position or velocity control that avoidsome difﬁculties of torque-controlled platforms.To obtain an accurate simulation model, we ﬁt the param-eters of the simulation model using a physically consistentsystem identiﬁcation procedure [5] developed speciﬁcallyfor identiﬁcation in contact rich settings such as the onewe study. While true model-free RL may one day becomefeasible, we believe leveraging the capabilities of a physicssimulator will always help speed up the learning process.As with any controller developed in simulation, perfor-mance on the real system is likely to degrade due to modelingerrors. To assess, as well as mitigate, the effects of theseerrors, we compare learning with respect to three differentmodels: (i) the nominal model obtained from system identi-ﬁcation; (ii) a modiﬁed model where we intentionally mis-specify some model parameters; (iii) an ensemble of modelswhere the mean is wrong but the variance is large enoughto include the nominal model. The purpose of (ii) is tosimulate a scenario where system identiﬁcation has not beenperformed well, and we wish to study the performance degra-dation. The goal with ensemble approaches [6], [7], [8] is tostudy if including a distribution over models can compensatefor inaccuracies in system identiﬁcation during control tasks.We ﬁnd that (i) achieves the best performance as expected,and (iii) is robust but suffers a degradation in performance.Videos of the trained policies is available at: https://sites.google.com/view/phantomsim2real

A. Related Work

There are many methods towards developing safe androbust robot controllers. Robot actions that involve dynamicmotions require not only precise control execution, but alsorobust compensation when the action inevitably does notgo according to plan–the physics of the real world arenotoriously uncooperative. Control methods that depend onphysical models, whether or not with reduced or simpliﬁedmodels, are able to produce dynamic actions [9], [10], [11],[12], [13], [14]. They frequently rely on physics simulationsfor testing purposes, before usage on real hardware. Thisstep is critical, as any modeling errors can signiﬁcantlycontribute to poor performance or even hardware damage[15]. Including uncertainty in the planning stage is one wayto avoid this problem, and may also enable model learningsimultaneously [16], [17], [18], [19], [20]. These modelcentric approaches offer strong performance expectations,but unless uncertainty or robustness is explicitly taken intoaccount, may be brittle to external unknowns [21]. a r X i v : . [ c s . R O ] M a r n the other hand, Reinforcement Learning offers a meansto directly learn from the robot’s experience [22], [23].The difﬁculty, of course, is where the robot’s experiencescome from: as RL algorithms may need signiﬁcant amountsof data, doing this on hardware may be infeasible [24].Directly training on hardware has been feasible in somecases [25], [26], but domains with many dimensions orhighly nonlinear dynamics will always require more datato sufﬁciently explore, human demonstrations for imitationlearning, and/or parameterized explorations [27], [28], [29],[30]. Another common issue with learning in the real worldis how to reset the state of the system, with some workbeing done [31]. For sensitive and delicate systems, the onlysafe place to perform learning is in simulation. Transferringto real hardware can take many approaches as well, eitherthrough adaptation [32], [33], or incorporating uncertainty[34], [35].This work focuses on using a physics simulator to trainpolicies for manipulation using reinforcement learning. Asthe manipulator is non-prehensile, we do not use any demon-strations or guide the policy search. To facilitate transfer tohardware, we also avoid the use of an estimator (i.e. the useof a model to predict state like a Kalman ﬁlter) by learninga function that directly converts from sensor values to motortorques. The policy is then transfered to the hardware forevaluation, and show that even for incorrect models usedduring training, useful policies are obtained by using anensemble of models. Sections 2 and 3 detail the RL problemformulation and solution. Section 4 explains the hardwareplatform and details of the manipulation task are in section5. Finally Section 6 contains the results and Section 7 thediscussion. II. PROBLEM FORMULATIONWe model the control problem as a Markov decisionprocess (MDP) in the episodic average reward setting, whichis deﬁned using the tuple: M = {S , A , R , T , ρ , T } . S ⊆ R n , A ⊆ R m , and R : S × A → R are (continuous) set ofstates, set of actions, and the reward function respectively. T : S × A → S is the stochastic transition function; ρ isthe probability distribution over initial states; and T is themaximum episode length. We wish to solve for a stochasticpolicy of the form π : S × A → R , which optimizes theaverage reward accumulated over the episode. Formally, theperformance of a policy is evaluated according to: η ( π ) = 1 T E π, M (cid:34) T (cid:88) t =1 r t (cid:35) . (1)In this ﬁnite horizon rollout setting, we deﬁne the value, Q ,and advantage functions as follows: V π ( s, t ) = E π, M (cid:34) T (cid:88) t (cid:48) = t r t (cid:48) (cid:35) Q π ( s, a, t ) = E M (cid:104) R ( s, a ) (cid:105) + E s (cid:48) ∼T ( s,a ) (cid:104) V π ( s (cid:48) , t + 1) (cid:105) A π ( s, a, t ) = Q π ( s, a, t ) − V π ( s, t ) We consider parametrized policies π θ , and hence wish tooptimize for the parameters ( θ ) . In this work, we represent π θ as a multivariate Gaussian with diagonal covariance. Inour experiments, we use an afﬁne policy as our functionapproximator, visualized in ﬁgure 6. Algorithm 1

Distributed Natural Policy Gradient Initialize policy parameters to θ for k = 1 to K do Distribute Policy and Value function parameters. for w = 1 to N workers do Collect trajectories { τ (1) , . . . τ ( N ) } by rolling outthe stochastic policy π ( · ; θ k ) . Compute ∇ θ log π ( a t | s t ; θ k ) for each ( s, a ) pairalong trajectories sampled in iteration k . Compute advantages A πk based on trajectories initeration k and approximate value function V πk − . Compute policy gradient according to eq. (2). Compute the Fisher matrix (4).

Return Fisher Matrix, gradient, and value functionparameters to central server. end for

Average Fisher Matrix gradient, and perform gradientascent (5)

Update parameters of value function. end for

III. M

ETHOD

A. Natural Policy Gradient

Policy gradient algorithms are a class of RL methodswhere the parameters of the policy are directly optimizedtypically using gradient based methods. Using the scorefunction gradient estimator, the sample based estimate of thepolicy gradient can be derived to be: [36]: ˆ g = 1 N T N (cid:88) i =1 T (cid:88) t =1 ∇ θ log π θ ( a it | s it ) ˆ A π ( s it , a it , t ) (2)A straightforward gradient ascent using the above gradientestimate is the REINFORCE algorithm [36]. Gradient ascentwith this direction is sub-optimal since it is not the steepestascent direction in the metric of the parameter space [37].Consequently, a local search approach that moves along thesteepest ascent direction was proposed by Kakade [2] calledthe natural policy gradient. This has been expanded uponin subsequent works [3], [4], [38], [39], [40], and forms acritical component in state of the art RL algorithms. Naturalpolicy gradient is obtained by solving the following localoptimization problem around iterate θ k :maximize θ g T ( θ − θ k ) subject to ( θ − θ k ) T F θ k ( θ − θ k ) ≤ δ, (3)where F θ k is the Fisher Information Metric at the currentiterate θ k . We apply a normalized gradient ascent procedure,hich has been shown to further stabilize the training pro-cess [38], [39], [4]. This results in the following update rule: θ k +1 = θ k + (cid:115) δg T F − θ k g F − θ k g. (4)The version of natural policy gradient outlined above waschosen for simplicity and ease of implementation. The nat-ural gradient performs covariant updates by rescaling theparameter updates according to curvature information presentin the Fisher matrix, thus behaving almost like a second orderoptimization method. Furthermore, due to the normalizedgradient procedure, the gradient information is insensitive tolinear rescaling of the reward function, improving trainingstability. For estimating the advantage function, we use theGAE procedure [41] and use a quadratic function approxi-mator with all the terms in s for the baseline. B. Distributed Processing

As the natural policy gradient algorithm is an on-policymethod, all data is collected from the current policy. How-ever, the NPG algorithm allows for the rollouts and mostcomputation to be performed independently as only the gra-dient and the Fisher matrix need to be synced. Independentprocesses can compute the gradient and Fisher matrix, witha centralized server averaging these values and performingthe matrix inversion and gradient step as in equation (4).The new policy is then shared with each worker. Thetotal size of messages passed is proportional to the sizeof the Fisher Matrix used for the policy, and linear in thenumber of worker nodes. Policies with many parameters mayexperience message passing overhead, but the trade-off is thateach worker can perform as many rollouts during samplecollection without changing the message size, encouragingmore data gathering (which large policies require).A summary of the distributed algorithm we used is showin Algorithm 1.

50 100 150 200 250 30002004006008001000

Avg. Policy Training Reward, 3 seeds

Training Iteration R e w a r d Linear Policy Reward

Fig. 1. Learning curve of our linear/afﬁne policy. We show here the curvefor the policy trained with the correct mass as a representative curve.

We implemented our RL code and interfaced with the Mu-JoCo simulator with the Julia programming language [42]. The built-in multi-processing and multi-node capabilities ofJulia facilitated this distributed algorithm’s performance; weare able to train a linear policy on this task in less than 3minutes on a 4 node cluster with Intel i7-3930k processors.IV. HARDWARE AND PHYSICS SIMULATION

Fig. 2. Phantom Manipulation Platform.

A. System Overview

We use our Phantom Manipulation Platform as our hard-ware testbed. It consists of three Phantom Haptic Devices,each acting as a robotic ﬁnger. Each haptic device is a 3-DOF cable driven system shown in ﬁgure 2. The actuation isdone with Maxon motors (Model RE 25 . Sensing

As we wish to learn control policies that map from obser-vations to controls, the choice of observations are critical tosuccessful learning. Each Phantom is equipped with 3 opticalencoders at a resolution of 5K steps per radian. We use a low-pass ﬁlter to compute the joint velocities. We also rely on aVicon motion capture system, which gives us position dataat 240Hz for the object we are manipulating–we assume theobject remains upright and do not include orientation. Whilebeing quite precise ( . mm error), the overall accuracy issigniﬁcantly worse ( < mm ) due, in part, to imperfect objectmodels and camera calibrations. While the Phantoms’ jointposition sensors are noiseless, they often have small biasesdue to imperfect calibration. One Phantom robot is equippedwith an ATI Nano17 3-axis force/torque sensor. This data isnot used during training or in any learned controller, but usedas a means of hardware / simulation comparison describedin a later section. The entire system is simulated for policytraining in the MuJoCo physics engine [1].In total, our control policy has an observational input of36 dimensions with 9 actuator outputs– the 9 positions and9 velocities of the Phantoms are converted to 15 positionsand 15 velocities for modeling purposes due to the parallellinkages. We additionally use 3 positions for both the ma-nipulated object and the tracked goal, with 9 outputs for the3 actuators per Phantom. Velocity observations are not usedfor the object as this would require state estimation that wehave deliberately avoided. C. System Identiﬁcation

System identiﬁcation of model parameters was performedin our prior work [5], but modeling errors are difﬁcult toeliminate completely. For system ID we collected variousbehaviors with the robots, ranging from effector motion infree space to infer intrinsic robot parameters, to manipulationexamples such as touching, pushing and sliding betweenthe end effector and the object to infer contact parameters.The resulting data is fed into the joint system ID and stateestimation optimization procedure. As explained in [5], stateestimation is needed when doing system ID in order toeliminate the small amounts of noise and biases that oursensors produce.The recorded behaviors are represented as a list of sensorreadings S = { s , s , . . . , s n } and motor torques U = { u , u , . . . , u n } . State estimation means ﬁnding a trajectoryof states Q = { q , q , . . . , q n } . Each state is a vector q i =( θ , . . . , θ k (cid:48) , x, y, z, q w , q x , q y , q z ) , representing joint anglesand object position. We also perform system ID which isﬁnding the set of parameters P , which include coefﬁcients offriction, contact softness, damping coefﬁcients, link inertiasand others. We then pose the system ID and estimation jointproblems as the following optimization problem: min P , Q (cid:88) i =1 ...n (cid:107) ˆ τ i − u i (cid:107) ∗ + (cid:88) i =1 ...n (cid:107) ˆ s i − s i (cid:107) ∗ where ˆ τ i (predicted control signal) and ˆ s i (predicted sensoroutputs) are computed by the inverse dynamics generative model of MuJoCo: (ˆ τ i , ˆ s i ) = mj inverse ( q i − , q i , q i +1 ) . Theoptimization problem is solved via Gauss-Newton [5].V. TASK & EXPERIMENTSIn this section we ﬁrst describe the manipulation task usedto evaluate learned policy performance, then describe thepractical considerations involved in using the NPG algorithmin this work. Finally, we describe two experiments evaluat-ing learned policy performance in both simulation and onhardware. A. Task Description

We use the NPG algorithm to learn a pushing task. Thegoal is to reduce the distance of the object, in this casethe cylinder, to a target position as much as possible. Thismanipulation task requires that contacts can be made andbroken multiple times during a pushing movement. As thereare no state constraints involved in this RL algorithm, wecannot guarantee that the object will reach the target location(the object can be pushed into an un-reachable location). Wefeel that this is an acceptable trade-off if we can achievemore robust control over a wider state space.For these tasks we model the bases of each Phantom asﬁxed, arrayed roughly equilateral around the object beingmanipulated–this is to achieve closure around the object. Wedo not enforce a precise location for the bases to make themanipulation tasks more challenging and expect them to shiftduring operation regardless.

B. Training Considerations

Policy training is the process of discovering which actionsthe controller should take from the current state to achievehigh reward. As such, it has implications for how well-performing the ﬁnal policy is. Training structure informsthe policy of good behavior, but is contrasted with the timerequired to craft the reward function. In this task’s case, weuse a very simple reward structure. In addition to the primaryreward of reducing the distance between the object and thegoal location, we provide the reward function with terms toreduce the distance between each ﬁnger tip and the object.This kind of hint term is common in both reinforcementlearnin g and trajectory optimization. There is also a controlcost, a , that penalizes using too much torque. The entirereward function at time t is as follows: R t ( s, a ) = 1 − (cid:107) O xy − G xy (cid:107)− (cid:88) i =1 (cid:107) f i − O xy (cid:107)− . a (5)Where O xy is the current position on the xy -plane ofthe object, G xy is the goal position, and f i is the Phantomend effector position. The state s consists of joint positions,velocities, and goal position. The actions ( a ) are torques.The initial state of each trajectory rollout is with the Phan-tom robots at randomized joint positions, deliberately notcontacting the cylinder object. The cylinder location is keptat the origin on the xy -plane, but the desired goal locationis set to a uniformly random point within a 12cm diameter ig. 3. In these plots, we seed our simulator with the state measured from the hardware system. We use the simulator to calculate the instantaneousforces measured by a simulated force sensor, per time-step of real world data, and compare it to the data collected from hardware. These plots are in theframe of the contact, with the force along the normal axis being greatest. Note that the Y-axis of the Normal Axis Torque plot is different from the othertorque plots. circle around the origin. To have more diverse initializationsand to encourage robust policies, new initial states have achance of starting at some state from one of the previousiteration’s trajectories, provided the previous trajectory hada high reward. If the initial state was a continuation of theprevious trajectory, the target location was again randomized:performing well previously only gives an initial state, and thepolicy needs to learn to push the object to a new goal. This issimilar to a procedure outlined in [43] for training interactiveand robust policies.Finally, to further encourage robust behavior, we varythe location of the base of the Phantom robots by addingGaussian noise before each rollout. The standard deviationof this noise is 0.5cm. In this way an ensemble of modelsis used in discovering robust behavior. As discussed above,we expect to have imprecisely measured their base locationsand for each base to potentially shift during operation. Weexamine this effect more closely as one of our experiments. C. Experiments

We devised two experiments to explore the validity ofthe NPG reinforcement learning algorithm to discover robustpolicies for difﬁcult robotic manipulation tasks.First, we collect runtime data of positions, velocities, andforce-torque measurements from a sensor equipped Phantomusing the best performing controller we have learned. Weuse this hardware data (positions and calculated velocitiesof the system) to seed our simulator, where instantaneousforces are calculated using a simulated force-torque sensor.This data is compared to force-torque data collected fromthe hardware. Instantaneous force differences highlight theinaccuracies between a model in simulation and data in thereal world that eventually lead to divergence.We also compare short trajectory rollouts in simulationthat have been seeded with data collected with hardware.This compares the policy’s behavior, not the system’s, as wewish to examine the performance of the policy in both hard-ware and simulation. Ideally, if the simulated environmentmatches the hardware, it can be taken as an indication that the system identiﬁcation has been performed correctly. Secondly,we would like to compare the behavior of the learned controlpolicy. From the perspective of task completion, the similar-ity or divergence of sim and real is less important as long asthe robot completes the tasks satisfactorily. Said another way,poor system identiﬁcation or sim/real divergence matters lessif the robot gets the job done. For these experiments, thetarget location is set by the user by moving a second trackedobject above the cylinder. This data was recorded and usedto collect the above datasets for analysis.As a second experiment direction, we show how theeffects of model ensembles during training affect robustnessand feasibility. To do this, we explicitly vary the mass ofthe object being manipulated. The object (cylinder) wasmeasured to be 0.34kg in mass, therefore, we train a policywith the mass set to this value. The object’s mass was chosento be modiﬁed due to the very visible effects an incorrectmass would have on performance. We train two additionalpolicies, both with a mass of 0.4kg (approximately 20%more mass). One of the additional policies is trained withan approximated ensemble: we add Gaussian noise to theobject mass parameter with standard deviation of 0.03 (30grams). All three policies are evaluated in simulation with acorrectly measured object mass, and in the real world withour 0.34kg cylinder.To evaluate the policies, we calculate a path for the targetto follow. The path is a spiral from the origin moving outwarduntil it achieves a radius from the origin of 4 cm, at which itchanges to a circular path and makes a full rotation, still at aradius of 4cm. This takes 4 seconds to complete. This pathwas programmatically set in both the simulator and on thereal hardware to be consistent. This object ideally followsthis trajectory path, as it presents a very visible means toexplore policy performance.VI. R

ESULTS

The results for the two experiments are presented asfollows, with additional discussion in the next section. ig. 4. 10 rollouts are performed where the target position of the object is the path that spirals from the center outward (in black) and then performsa full circular sweep (the plots represent a top-down view). We compare three differently trained policies: one where the mass of the object cylinder is0.34kg, one where the mass is increased by 20 percent (to 0.4kg), and ﬁnally, we train a policy with the incorrect mass, but add model noise (at standarddeviation 0.03) during training to create an ensemble. We evaluate these policies on the correct (0.34kg) mass in both simulation and on hardware. In both,the policy trained with the incorrect mass does not track the goal path, sometimes even losing the object. We also calculate the per time-step error fromthe goal path, averaged from all 10 rollouts (right-most plots); there is usually a non-zero error between the object and the reference due to the feedbackpolicy having to ’catch up’ to the reference.TABLE IA

VERAGE D ISTANCE FROM T ARGET , 10 R

OLLOUTS

Sim Hardware0.34kg Policy (correct) 2.1cm 2.33cm0.4kg Policy (incorrect) 2.65cm 3.4cm0.4kg Policy ensemble 2.15cm 2.52cm

A. Simulation vs Hardware

We show comparisons between calculated forces andtorques in simulation and hardware in ﬁgure 3. Our simulatedvalues closely match the sensed hardware values. However,the discretization of hardware sensors for the joint positionsand velocities are not as precise as simulation, which mayresult in a different calculation of instantaneous forces. WhileMuJoCo can represent soft contacts, the parameters deﬁningthem were not identiﬁed accurately. Critically, we can seethat when contact is not being made, the sensors, simulatedand hardware, are in agreement.We ﬁnd that our learned controllers are still able toperform well at task completion, despite differences betweensimulation and hardware. We can see in ﬁgure 5 that forthe correct policy (learned with a correct model), when weperform a rollout in simulation based on hardware data, thesimulated rollout is close to the data collected from hardware.The policy performance in simulation is close to the policy performance in hardware. This is not the case for the policytrained using incorrect mass and the ensemble policy, wherethe simulated rollout is different from the hardware data. Wehypothesize that policies trained on speciﬁc models over-ﬁtto these models, and take advantage of the speciﬁcs of themodel to complete the task. Despite the correct simulationbeing similar to hardware, the controller’s behavior couldcause divergence on whatever remaining small parameterdifferences. The ensemble policy, as expected, lies some-where between the correct and incorrect policy (trained withincorrect parameters).

B. Training with Ensembles

We ﬁnd that training policies with model ensembles tobe particularly helpful. Despite being given a very incorrectmass of the object, the policy trained with the ensembleperformed very well (ﬁgure 4). In addition to performingwell in simulation, we found it to perform nearly as wellas the correctly trained policy on hardware. Given that theseare learned feedback controllers, there is a distinct lag ofthe object behind the desired reference trajectory. Table 1quantiﬁes the error from the reference trajectory across thewhole length of the action. This mirrors our comparisonin the previous section, where the correct policy performscomparably in both hardware and simulation, with the otherpolicies less so. This is important to note given the poorperformance of the incorrect policy: this task’s training is ig. 5. We show the effects of different controllers. We seed our simulatorwith the hardware’s state, and, in simulation, perform a rollout of 200 time-steps – about 0.1 seconds. This is rendered as the black lines above. Thecorrect policy (trained with measured mass), has rollouts that closely matchthe measured hardware state data. The incorrect policy (trained with anincorrect mass), performs differently in simulation. The remaining ensemblepolicy performs better than the incorrect one; this demonstrates that a ’safe’policy can be learned to at least begin an iterative data collection process.While it could be expected that the policies perform similarly in bothsimulation and hardware, we see that it is not the case here. A policytrained on an incorrect model would over-ﬁt to the incorrect model, andchanging to one of two different models (i.e. simulation or hardware) canhave un-intuitive effects. indeed sensitive to this model parameter. The implication ofthe ensemble approach is not just that it can overcome pooror incorrect modeling, but can provide a safe initial policyto collect valuable data to improve the model.VII. DISCUSSIONOur results suggest two interesting observations. Sim-ulation can provide a safe backdrop to test and developnon-intuitive controllers (see ﬁgure 6). This controller wasdeveloped for a robotic system without an intermediatecontroller such as PID, and without human demonstrationsto shape the behavior. We also eschewed the use of a stateestimator during training and run-time as this would addadditional modeling reliance and complexity. Our simulation

Fig. 6. We render the policy parameters. A distinct pattern can be seenin the three negative weights in the top left of the pos and vel blocks.These correspond to the control outputs for the each of the three Phantom’sturret joint; as each robot is sensitive, the policy outputs a negative gainto avoid large control forces. Additionally, we can see that the ﬁrst andsecond Phantom contributes primarily the object’s X location, while thethird Phantom handles the Y location. This linear policy can be likened toa hand crafted feedback policy, but the numerous non-zero values would beunintuitive to select for a human. Presumably they can be used to contributeto the controllers robustness as learned through this method. based policy learning approach also conveniently allows forbuilding robust controllers by creating ensembles of modelsby varying physical parameters.We show how simulated ensemble methods provide twomajor beneﬁts. Firstly, it can partially make up for incorrectlymeasured / identiﬁed model parameters. This beneﬁt shouldbe obvious: it can be difﬁcult to measure model parametersaffecting nonlinear physical phenomena. Additionally, train-ing in an ensemble has the added beneﬁt of allowing for moreconservative policies to enable appropriate data collectionfor actual model improvement. A natural extension of thisobservation would be full model adaptation using a techniquesuch as EPopt [7].Model adaptation provides a bridge between model-basedand model-free methods. Leveraging a model in simulationcan provide a useful policy to begin robot operation, whichcan subsequently be ﬁne-tuned on hardware in a model-freemode. Very dynamic behaviors may not be suited to directhardware training without signiﬁcant human imposed safetyconstraints, which may take signiﬁcant time to develop andmay not account for all use cases. Given that most robots aremanufactured using modern techniques, a model to be usedin simulation is very likely to exist, and this model shouldbe leveraged to obtain better policies.A

CKNOWLEDGEMENTS

This work was supported in part by the NSF. The authorswould like to thank Vikash Kumar and the anonymousreviewers for valuable feedback.

EFERENCES[1] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine formodel-based control,” pp. 5026–5033, 2012.[2] S. Kakade, “A natural policy gradient,” in

NIPS , 2001.[3] J. Peters and S. Schaal, “Natural actor-critic,”

Neurocomputing ,vol. 71, pp. 1180–1190, 2007.[4] A. Rajeswaran, K. Lowrey, E. Todorov, and S. Kakade, “TowardsGeneralization and Simplicity in Continuous Control,” in

NIPS , 2017.[5] S. Kolev and E. Todorov, “Physically consistent state estimation andsystem identiﬁcation for contacts,” in

Humanoid Robots (Humanoids),2015 IEEE-RAS 15th International Conference on . IEEE, 2015, pp.1036–1043.[6] I. Mordatch, K. Lowrey, and E. Todorov, “Ensemble-cio: Full-bodydynamic motion planning that transfers to physical humanoids,” in

Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ InternationalConference on . IEEE, 2015, pp. 5307–5314.[7] A. Rajeswaran, S. Ghotra, S. Levine, and B. Ravindran, “Epopt:Learning robust neural network policies using model ensembles,” arXiv preprint arXiv:1610.01283 , 2016.[8] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,”

CoRR ,vol. abs/1710.06537, 2017.[9] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of com-plex behaviors through online trajectory optimization,” in

IntelligentRobots and Systems (IROS), 2012 IEEE/RSJ International Conferenceon . IEEE, 2012, pp. 4906–4913.[10] I. Mordatch, E. Todorov, and Z. Popovi´c, “Discovery of complexbehaviors through contact-invariant optimization,”

ACM Transactionson Graphics (TOG) , vol. 31, no. 4, p. 43, 2012.[11] Y. Tassa, N. Mansard, and E. Todorov, “Control-limited differentialdynamic programming,” in

Robotics and Automation (ICRA), 2014IEEE International Conference on . IEEE, 2014, pp. 1168–1175.[12] S. Feng, E. Whitman, X. Xinjilefu, and C. G. Atkeson, “Optimization-based full body control for the darpa robotics challenge,”

Journal ofField Robotics , vol. 32, no. 2, pp. 293–312, 2015.[13] J. Koenemann, A. Del Prete, Y. Tassa, E. Todorov, O. Stasse, M. Ben-newitz, and N. Mansard, “Whole-body model-predictive control ap-plied to the hrp-2 humanoid,” in

Intelligent Robots and Systems(IROS), 2015 IEEE/RSJ International Conference on . IEEE, 2015,pp. 3346–3351.[14] M. Raibert, K. Blankespoor, G. Nelson, and R. Playter, “Bigdog, therough-terrain quadruped robot,”

IFAC Proceedings Volumes , vol. 41,no. 2, pp. 10 822–10 825, 2008.[15] D. Nguyen-Tuong and J. Peters, “Model learning for robot control: asurvey,”

Cognitive processing , vol. 12, no. 4, pp. 319–340, 2011.[16] I. Mordatch, N. Mishra, C. Eppner, and P. Abbeel, “Combiningmodel-based policy search with online model learning for control ofphysical humanoids,” in

Robotics and Automation (ICRA), 2016 IEEEInternational Conference on . IEEE, 2016, pp. 242–248.[17] Y. Pan and E. A. Theodorou, “Data-driven differential dynamic pro-gramming using gaussian processes,” in

American Control Conference(ACC), 2015 . IEEE, 2015, pp. 4467–4472.[18] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Optimizing walking con-trollers for uncertain inputs and environments,” in

ACM Transactionson Graphics (TOG) , vol. 29, no. 4. ACM, 2010, p. 73.[19] G. Lee, S. S. Srinivasa, and M. T. Mason, “Gp-ilqg: Data-drivenrobust optimal control for uncertain nonlinear dynamical systems,” arXiv preprint arXiv:1705.05344 , 2017.[20] S. Ross and J. A. Bagnell, “Agnostic system identiﬁcation for model-based reinforcement learning,” arXiv preprint arXiv:1203.1007 , 2012.[21] M. Johnson, B. Shrewsbury, S. Bertrand, T. Wu, D. Duran, M. Floyd,P. Abeles, D. Stephen, N. Mertins, A. Lesman, et al. , “Team ihmc’slessons learned from the darpa robotics challenge trials,”

Journal ofField Robotics , vol. 32, no. 2, pp. 192–208, 2015.[22] J. Peters, S. Vijayakumar, and S. Schaal, “Reinforcement learning forhumanoid robotics,” in

Proceedings of the third IEEE-RAS interna-tional conference on humanoid robots , 2003, pp. 1–20.[23] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcementlearning: A survey,”

Journal of artiﬁcial intelligence research , vol. 4,pp. 237–285, 1996.[24] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning inrobotics: A survey,”

The International Journal of Robotics Research ,vol. 32, no. 11, pp. 1238–1274, 2013. [25] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcementlearning for robotic manipulation with asynchronous off-policy up-dates,” in

Robotics and Automation (ICRA), 2017 IEEE InternationalConference on . IEEE, 2017, pp. 3389–3396.[26] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based anddata-efﬁcient approach to policy search,” in

Proceedings of the 28thInternational Conference on machine learning (ICML-11) , 2011, pp.465–472.[27] Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, andS. Levine, “Path integral guided policy search,” in

Robotics andAutomation (ICRA), 2017 IEEE International Conference on . IEEE,2017, pp. 3381–3388.[28] A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, and S. Levine, “Col-lective robot reinforcement learning with distributed asynchronousguided policy search,” arXiv preprint arXiv:1610.00673 , 2016.[29] J. Kober and J. R. Peters, “Policy search for motor primitives inrobotics,” in

Advances in neural information processing systems , 2009,pp. 849–856.[30] A. Rajeswaran, V. Kumar, A. Gupta, J. Schulman, E. Todorov, andS. Levine, “Learning complex dexterous manipulation with deep rein-forcement learning and demonstrations,”

CoRR , vol. abs/1709.10087,2017.[31] W. Montgomery, A. Ajay, C. Finn, P. Abbeel, and S. Levine, “Reset-free guided policy search: efﬁcient deep reinforcement learning withstochastic initial states,” in

Robotics and Automation (ICRA), 2017IEEE International Conference on . IEEE, 2017, pp. 3373–3380.[32] I. Menache, S. Mannor, and N. Shimkin, “Basis function adaptationin temporal difference reinforcement learning,”

Annals of OperationsResearch , vol. 134, no. 1, pp. 215–238, 2005.[33] S. Barrett, M. E. Taylor, and P. Stone, “Transfer learning for reinforce-ment learning on a physical robot,” in

Ninth International Conferenceon Autonomous Agents and Multiagent Systems-Adaptive LearningAgents Workshop (AAMAS-ALA) , 2010.[34] M. Cutler, T. J. Walsh, and J. P. How, “Reinforcement learning withmulti-ﬁdelity simulators,” in

Robotics and Automation (ICRA), 2014IEEE International Conference on . IEEE, 2014, pp. 3888–3895.[35] P. Abbeel, M. Quigley, and A. Y. Ng, “Using inaccurate modelsin reinforcement learning,” in

Proceedings of the 23rd internationalconference on Machine learning . ACM, 2006, pp. 1–8.[36] R. J. Williams, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,”

Machine Learning , vol. 8, no. 3,pp. 229–256, 1992.[37] S. Amari, “Natural gradient works efﬁciently in learning,”

NeuralComputation , vol. 10, pp. 251–276, 1998.[38] J. Peters, “Machine learning of motor skills for robotics,”

PhDDissertation, University of Southern California , 2007.[39] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel, “Trustregion policy optimization,” in

ICML , 2015.[40] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” arXiv preprintarXiv:1707.06347 , 2017.[41] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estima-tion,” in

ICLR , 2016.[42] J. Bezanson, S. Karpinski, V. B. Shah, and A. Edelman, “Julia:A fast dynamic language for technical computing,” arXiv preprintarXiv:1209.5145 , 2012.[43] I. Mordatch, K. Lowrey, G. Andrew, Z. Popovic, and E. V. Todorov,“Interactive control of diverse complex characters with neural net-works,” in