[PDF] Keep it Simple: Data-efficient Learning for Controlling Complex Systems with Simple Models

Abstract

When manipulating a novel object with complex dynamics, a state representation is not always available, for example for deformable objects. Learning both a representation and dynamics from observations requires large amounts of data. We propose Learned Visual Similarity Predictive Control (LVSPC), a novel method for data-efficient learning to control systems with complex dynamics and high-dimensional state spaces from images. LVSPC leverages a given simple model approximation from which image observations can be generated. We use these images to train a perception model that estimates the simple model state from observations of the complex system online. We then use data from the complex system to fit the parameters of the simple model and learn where this model is inaccurate, also online. Finally, we use Model Predictive Control and bias the controller away from regions where the simple model is inaccurate and thus where the controller is less reliable. We evaluate LVSPC on two tasks; manipulating a tethered mass and a rope. We find that our method performs comparably to state-of-the-art reinforcement learning methods with an order of magnitude less data. LVSPC also completes the rope manipulation task on a real robot with 80% success rate after only 10 trials, despite using a perception system trained only on images from simulation.

Full PDF

aa r X i v : . [ c s . R O ] F e b IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021 1

Keep it Simple: Data-efﬁcient Learning forControlling Complex Systems with Simple Models

Thomas Power and Dmitry Berenson Abstract —When manipulating a novel object with complexdynamics, a state representation is not always available, for exam-ple for deformable objects. Learning both a representation anddynamics from observations requires large amounts of data. Wepropose Learned Visual Similarity Predictive Control (LVSPC),a novel method for data-efﬁcient learning to control systemswith complex dynamics and high-dimensional state spaces fromimages. LVSPC leverages a given simple model approximationfrom which image observations can be generated. We use theseimages to train a perception model that estimates the simplemodel state from observations of the complex system online. Wethen use data from the complex system to ﬁt the parametersof the simple model and learn where this model is inaccurate,also online. Finally, we use Model Predictive Control and biasthe controller away from regions where the simple model isinaccurate and thus where the controller is less reliable. Weevaluate LVSPC on two tasks; manipulating a tethered massand a rope. We ﬁnd that our method performs comparably tostate-of-the-art reinforcement learning methods with an order ofmagnitude less data . LVSPC also completes the rope manipulationtask on a real robot with 80% success rate after only 10 trials,despite using a perception system trained only on images fromsimulation.

Index Terms —Machine Learning for Robot Control; Motionand Path Planning

I. I

NTRODUCTION W HILE recent machine learning methods have beeneffective for many manipulation tasks, they rely onaccess to large datasets of the system being manipulated [1],[2], [3]. Yet in many scenarios we do not have time to gatherextensive training data with an object before performing atask. Sim-to-real transfer has been used to ﬁne-tune parameterson limited real-world data when the real object is similar tothose used in simulation [4], [5], but these methods struggleif the objects are signiﬁcantly different. We would like to useprior knowledge about the object to reduce the data requiredfor learning, but the question of how to effectively use priorknowledge when encountering a novel object remains open.This paper addresses how to leverage dynamics models ofsimple systems when learning to control much more complex,but related, systems online. While it is possible to learndynamics using only online data (e.g. [6]), we wish to use our

Manuscript Received: October 16, 2020; Revised: December 21, 2020;Accepted: January 15, 2021.This paper was recommended for publication by Editor Dana Kulic uponevaluation of the Associate Editor and Reviewers’ comments. This work wassupported in part by NSF Grant IIS-1750489 and ONR grant N00014-21-1-2118. Authors are with the University of Michigan, Ann Arbor, MI, USA. { tpower, dmitryb } @umich.edu Digital Object Identiﬁer (DOI): see top of this page knowledge of a simple model to make the learning much moredata-efﬁcient, and thus practical for real-world application. Forexample, consider a tethered mass being swung by a gripper(Figure 1). The dynamics of the system are complex andrequire a great deal of data to learn. However, if we treatthe system as a cart with a rigid pendulum, we can predict thedynamics fairly accurately for some subset of the state-actionspace . We can exploit this subset to perform tasks such asbringing the mass to a target, even without a globally-accuratedynamics model. Simple models are often used in this way,for example in deformable object manipulation [7], [8] andcontrol for humanoids [9].To use knowledge of the dynamics of the simple model tocontrol the more complex true system, we must know whichstates of the complex system correspond to which states of thesimple system. What makes this problem especially difﬁcultis that, while we can design a useful state representationfor the simple system ofﬂine, we do not know what staterepresentation to use for the complex system, so we cannotexplicitly deﬁne a correspondence between states.Our key insight for overcoming this problem is that thesimple system (and its state representation) is a good approx-imation of the complex system when it gives rise to similarimage observations to the complex system. By using a metricfor observation similarity that reasons about uncertainty wecan build a controller for the complex system and also learnwhere our approximation is inaccurate (to avoid visiting thoseparts of the state space). By utilizing domain randomizationduring training, we enable a single simple system state to elicita wide variety of image observations; i.e. shapes, colors, andobstacles can vary while still producing an image we considerto be visually-similar . We use online system identiﬁcationto estimate the parameters of the simple model, however,deciding which class of simple model to use for a given task isnot within the scope of this paper. Here we made this decisionmanually but seek to automate selecting the class of simplemodel in future work.This paper makes the following contributions: 1) LearnedVisual Similarity Predictive Control (LVSPC), a novel frame-work for learning how to perform manipulation tasks with acomplex system given only a simple model and images froma small number of trials online; 2) Evaluation of LVSPC onmanipulating a tethered mass (using a cart-pole as a simplemodel) and a rope (using a rigid body as a simple model)(See Fig. 1) in simulation, showing large improvements indata-efﬁciency over baselines (PlaNet [6] and CURL [10]).LVSPC also completes the rope manipulation task on a realrobot with 80% success rate after only 10 trials.

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021

Fig. 1: (a-c) : LVSPC controlling a tethered mass to a desired position (blue) from images by treating it as a cart-pole; (d-g) : LVSPC bringsa rope to a target location in a narrow passage between two obstacles while avoiding protrusions by treating the rope as a rigid object. Therobot starts with the rope slack but pulls it taut to keep the approximation more accurate, allowing it to complete the task.

LVSPC consists of two phases: 1) Ofﬂine, we train anensemble Convolutional Neural Network (CNN) perceptionsystem on image observations of the simple system, outputtingan estimate of the simple system’s state. 2) Online, givenimage observations of the complex system, we do systemidentiﬁcation to estimate parameters of the simple systemdynamics and learn a Gaussian Process (GP) that predictswhere the simple model is accurate. We use the simple modeland the GP to track the object via a Gaussian Process Un-scented Kalman Filter (GPUKF) [11] and perform control viaModel Predictive Path Integral Control (MPPI) [12], biasingthe system away from inaccurate transitions.II. R

ELATED W ORK

Dynamics from Images : Learning-based approaches usingdynamics models for control with images observations haveincluded learning dynamics models directly in image space [1],[3], [13]. Dynamics in image space are highly complex, andthese methods require large amounts of data. Other methodslearn dynamics in a lower-dimensional latent space [14], [2],[6], [15]. None of these methods incorporate prior knowledge.SE3-PoseNets [16] learn dynamics in pose-space from pointcloud data. [17] use the positions of a set of ordered points asthe representation of a rope and pre-trains a state estimator onground truth in a simulator. Unlike LVSPC, neither of thesemethods use a given model approximation nor do they reasonabout model uncertainty.

Using simpliﬁed models : Simpliﬁed models have beenwidely explored in the legged robotics literature, in particularusing spring-mass damper models [18], [9]. Simpliﬁed modelshave been used to generate trajectories for a lower-level con-troller to track with guarantees [19]. However, these guaranteesrequire access to a high-ﬁdelity model. Other work [20] hasused a set of simple models and a selection mechanism tochoose between them. [8] use a given simpliﬁed dynamicsmodel and learns a classiﬁer on whether a given transition isreliable. We use GP uncertainty to model transition reliabilityrather than a classiﬁer. We also use image observations andperform tracking concurrently.

Incorporating model uncertainty : Previous work has shownthat reasoning about model uncertainty can improve dataefﬁciency [21], [22]. PILCO [21] uses a Gaussian Processdynamics model for model uncertainty and achieves high dataefﬁciency on learning control policies. Gaussian Processesdynamics have also been used for the purpose of both avoidinguncertainty [23], or explicitly seeking it [24]. PETS [22] uses aprobabilistic ensemble of neural networks to model uncertaintyand is able to outperform PILCO on control tasks with high state dimension. These methods have only been demonstratedon tasks for which state is available, and not on image domainswhere parameterizing uncertainty can be difﬁcult. LVSPC aimsto combine modeling of uncertainty in the dynamics withstrong priors to maintain high data efﬁciency when learningfrom images. III. P

ROBLEM S TATEMENT

We consider a nonlinear discrete-time system with state x ∈X and controls u ∈ U . The system has unknown true dynamicsgiven by x t +1 = f ( x t , u t ) . We assume X may be arbitrarilyhigh-dimensional and unobserved. Instead we may only haveaccess to observations o ∈ O via an observation function atthe current state o t = g ( x t ) .We deﬁne a trial as a time-limited attempt to ﬁnd a sequenceof controls { u , ..., u T } such that the ﬁnal state x T ∈ X goal where X goal is the trial’s goal region. We assume that wecan fully observe when the system has reached the goal i.e. o ∈ O goal ⇐⇒ x ∈ X goal . The goal in observationspace is deﬁned as O goal = { g ( x ) : x ∈ X goal } . Weassume that data collection on the true system is expensive.The unknown dynamics and high-dimensional state make thisproblem intractable to solve with a small dataset. Insteadwe seek to model the system in a latent state of lowerdimensionality z ∈ Z with simple dynamics ˆ f ρ parameterizedby ρ with input-dependent noise. The transition distribution,which we will denote as p z for shorthand is given by p ( z t +1 | z t , u t ) = N ( ˆ f ρ ( z t , u t ) , Q ( z t , u t )) (1)We assume that ˆ f ρ is given and is differentiable with respectto ( z, u, ρ ) . Q is an input-dependent uncertainty term. We alsoassume that the simple dynamics are Markovian. The simplesystem has the same observation space O and has a givenobservation function o t = ˆ g ( z t ) . We assume that we can apriori specify some subset of the goal region in Z as Z goal ,i.e that { ˆ g ( z ) : z ∈ Z goal } ⊂ O goal . This could also be doneby specifying O goal directly (as is common in learning tocontrol from images, e.g. [25]) and using this to infer z goal .We then seek to design a feedback policy u t = π ( z t ) such that z T ∈ Z goal for some time T . Our goal is to design π using ˆ f ρ so that it achieves high success rate after a small numberof trials. IV. M ETHODS

Our approach to this problem requires input in the form ofa simple model approximation that is believed to accuratelyrepresent the dynamics over some subset of the complex

OWER AND BERENSON: DATA-EFFICIENT LEARNING FOR CONTROLLING COMPLEX SYSTEMS WITH SIMPLE MODELS 3

Algorithm 1

LVSPC

Inputs:

Simple model dynamics: ˆ f ρ ; Simple model cost: c ;Simple model renderer ˆ g ; Initial data size N ; Episodes K Ofﬂine Training with simple system data { y i , o i } Ni =1 ← CollectData ( ˆ f ρ , ˆ g, N ); φ ← TrainStateEstimator ( { y i , o i } Ni =1 ); Online Training with complex system data D ← ∅ ; ρ, Q ← Initialize ; for k ∈ { , ..., K } do p z ← N ( ˆ f ρ ( z t , u t ) , Q ( z t , u t )) ; D ← D ∪ ; Rollout ( p z , c , φ ); ρ ← ; FitSimpleSystem ( D , ˆ f ρ ); Q ← FitGP ( D , ˆ f ρ , Q , ρ );system ( X , U ). By using this simple model in simulation wecan generate large amounts of data. The key to our approach isto leverage this data and our knowledge of the simple system.We then reduce the problem of unsupervised representationand dynamics learning to that of supervised learning of a per-ception system for the simple model representation (ofﬂine),and then learning when this representation and the dynamicsare accurate (online).Our full method is shown in Algorithm 1 and Figure 2.The overall procedure is to ﬁrst generate a dataset of imageswith corresponding simple model conﬁgurations and then totrain a perception system to estimate these conﬁgurations fromimages. Once this perception system is trained ofﬂine, wemove to the online execution/learning phase, where we mustmanipulate the never-before-seen complex system.The goal of the online execution is to reach a givengoal region. However, because the perception system andthe simple model dynamics can only account for some com-plex model states, we must try to avoid states where theperception/dynamics are inaccurate. To this end, we collectdata as we attempt the task and use that data to train aGP that captures the error in the simple model predictions.This error distribution is input into a Kalman Filter variant tobetter estimate the state and into a trajectory optimizer, whichattempts to avoid regions of state space where the simplemodel predictions are inaccurate. The process of planningtrajectories, executing one action, estimating the resultingstate, and replanning a trajectory (Alg. 2) repeats until thegoal (or a timeout) is reached. A. Simple Model

The simple system state may contain elements which cannotbe estimated from a single image, e.g. velocities. Thus wedeﬁne the components of the simple state that can be noisilyobserved from a single image as latent observations y . Wethen have the non-linear discrete-time state space model withdynamics described in Eq. (1). In general there will be a non-linear mapping from z to y . In this paper we consider only alinear mapping, which is sufﬁcient for our models: y t = Cz t + ǫ, (2)For an n -dimensional simple model system ( z ∈ R n ) with m -dimensional ( m ≤ n ) observations ( y ∈ R m ), C = Algorithm 2

Rollout

Inputs:

Transition distribution p z ; Simple model cost: c ; CNNEnsemble φ D ← ∅ ; µ z , Σ z ← Initialize ; for t ∈ { , ..., T } do µ yt , Σ yt ← φ ( o t ) ; y t ∼ N ( µ yt , Σ yt ) ; µ zt , Σ zt ← GPUKF ( µ zt − , Σ zt − , u t − , p z , y t ); u t ← MPPI ( µ zt , c, p z ) ; D ← D ∪ ( µ yt , Σ yt , u t ) ; ExecuteAction( u t ) ; if AtGoal then break; return D [ I m × m, , m × n − m ] selects the latent observations from z . Forexample, if z is the position and velocity of a point, then y is only the position, which is all that can be observed froma single image. In the case where ǫ ∼ N (0 , R ) for positive-deﬁnite R we can use noisy measurements y to estimate z by ﬁltering using non-linear techniques such as the UnscentedKalman Filter (UKF) [26]. We will show how to use a GP tolearn Q ( z t , u t ) in Eq. (1) from data in Sec. IV-D. B. Probabilistic CNN Ensemble for Perception

In order to use the simple model for the complex system,we need a perception system φ that maps images to simplemodel states (even if the image is generated from the complexsystem). We would also like a way to estimate how well asimple model state approximates the complex system at agiven state, as this gives us an estimate of conﬁdence in thesimple system dynamics at this state. We use the uncertaintyin the perception estimate as a proxy for correspondencebetween the simple state and the unknown complex state. Theperception output is µ yt , Σ yt = φ ( o t ) (3) y t ∼ p ( y t | o t ) = N ( µ yt , Σ yt ) , (4)where the variance Σ yt estimates the uncertainty, and φ is theperception system. We assume an isotropic Gaussian in Eq. 3,thus Σ yt can be described by a vector σ yt ∈ R m . Ensembleshave been empirically shown to give useful estimates ofprediction uncertainty, which can be used to evaluate if a giveninput is out-of-distribution w.r.t the training data [27]. Thususing ensembles avoids manually deﬁning a similarity betweenthe complex system observations and observations generatedfrom the simple system. Instead we can input observation o t from the complex system into our perception system, and ifit produces a high-certainty estimate of y t (i.e. where || σ yt || is small), this implies that y t is a good approximation for thecomplex system at time t .We parameterize φ as a CNN ensemble which is trainedwith data generated from the simple system. Each CNNin the ensemble is a probabilistic CNN which outputs theparameters of a Gaussian, these are then combined into oneGaussian estimate. We train the CNN via supervised learningon observations of the simple system which we collect fromsimulation, along with correspond simple system states. Im-portantly, we assume that we can generate observations from IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021

Fig. 2: Method overview.

Left : Training the CNN ensemble on image observations generated from the simple system ofﬂine. φ is a CNNensemble with variance used as a measure of uncertainty; Center : Online execution using the simple model CNN with GPUKF ﬁltering andMPPI for control;

Right : Procedure for ﬁtting parameterized simple model and GP from observations of the complex system. The transitionprobability (red) is trained to predict the future uncertainty of φ , allowing us to avoid avoid areas where φ is not conﬁdent. the simple system which are similar to the complex systemobservations. To avoid requiring precise knowledge of thecomplex system before generating the simple model data, wegenerate a diverse training set of observations from the simplemodel. For example, we generate cart-poles with varyingpendulum length for the tethered mass scenario. By generatingdiverse observations via domain randomization, our notion ofvisual similarity means that there is a simple system with someappearance and system parameters that looks similar to thecomplex system. See in Fig. 3 for examples.Given an o t of the complex system online, we sample y t from the output of the φ and use this along with the learnedGP transition distribution (Sec. IV-D) to track a Gaussiandistribution over the simple model state ( p ( z t | u t − , y t ) = N ( µ zt , σ zt ) ) with a GPUKF [11]—an extension to the UKFfor GP dynamics. When predicting p ( z t +1 | u t , y t ) in theGPUKF we use the posterior mean of the GP (Sec. IV-D)to perform the unscented transform, while the process noiseis the posterior covariance of the GP, Q ( z t , u t ) , evaluated at ( µ zt , u t ) . C. System Identiﬁcation

The simple model dynamics may be parameterized by ρ (for example mass, length, etc.) and in order to use it, wemust estimate the ρ which best approximates the complexsystem. One approach is using the Kalman ﬁlter to jointlyestimate ρ and the latent state z , but we found that this wasnot numerically stable. Instead we use maximum-likelihoodestimation on observed trajectories from the complex system.Given an observed trajectory of the complex systemconsisting of { o t , u t } Tt =1 we encode the observations into { µ yt , σ yt , u t } Tt =1 . Since our trajectory may contain transitionswhich the simple model cannot accurately predict, we split thetrajectory into N trajectories of length K < T , and discardtrajectories with average uncertainties above threshold α sowe are left with high-certainty sub-trajectories. For each sub-trajectory we rollout the actions u T using Eq. (1) and (2) toget estimated observations ˆ y T and perform gradient ascent Fig. 3: Examples of data generated from the simple system fortraining the CNN ensemble. (a) Tethered mass experiment, showingdifferent geometries of the cart-pole. (b) Simulated rope manipulationexperiment, showing different geometries of rigid link, and differingnumber and geometries of objects. (c) Real robot rope manipulationexperiment. We randomize textures, lighting, obstacle conﬁguration,camera pose, and rigid link geometry and add noise. on the parameters ρ and the trajectory initial states { z i } Ni =1 bymaximizing the log likelihood of ˆ y T in the distribution outputby the CNN ensemble N ( µ y T , σ y T ) . The CNN weights areheld constant. This process optimizes ρ to match the observeddynamics for high-certainty transitions in ( Z , U ) . D. Predicting Future Uncertainty with GP Regression

From φ we have a conﬁdence in our simple model approx-imation at a given y (the uncertainty σ y ). To keep the systemin regimes where the approximation is accurate we also needto predict the future uncertainty conditioned on actions. Our OWER AND BERENSON: DATA-EFFICIENT LEARNING FOR CONTROLLING COMPLEX SYSTEMS WITH SIMPLE MODELS 5 uncertainty expresses uncertainty over the validity of the stateas a description of the complex system, rather than the value of the state. Since we are using state uncertainty as a measureof conﬁdence in the simple model approximation we modelthis uncertainty as state and action-dependent and use a GPwith mean function ˆ f ρ and kernel function K to model thetransition distribution. The GP posterior is p ( z t +1 | z t , u t ) = N ( ˆ f ρ ( z t , u t , ρ )+ µ f ( z t , u t ) , Q ( z t , u t )) , (5)where µ f and Q are typically found via conditioning on sometraining set. However in our case this is a Gaussian ProcessState Space Model (GPSSM) [28] with transition probabilityabove and emission probability deﬁned in Eq. (2). Trainingthis GP is non-trivial as we do not have access to z directly.Instead we must jointly infer both the transition probabilityand z during training.We use a Parametric Predictive GP (PPGP)[29] in orderto train a GP with state-dependent aleatoric uncertainty viastochastic gradient descent. The uncertainty of the GP σ z is used to predict the uncertainty of the CNN ensemble σ y via Eq. (2). The PPGP is a sparse GP method which ﬁtspsuedo-inputs ( ζ ) and psuedo-ouputs ( γ ∼ N ( m, S ) ) suchthat conditioning the GP on ( γ, ζ ) approximates the true GPposterior. The GP parameters are thus ( m, S, ζ ) as well asthe kernel hyper-parameters. The GP posterior contains anadditional µ f term compared with Eq. (1). This allows theGP posterior mean to deviate from that of the simple model,attempting to ﬁt transitions which do not conform to the simplemodel dynamics. Since our representation is known to beinsufﬁcient to model the true dynamics of the system, we areconservative and do not allow the GP to ﬁt such transitionsby constraining m = 0 and thus µ f = 0 . We compare to avariant of our method where we do not enforce µ f = 0 in ourexperiments.We now describe how to train this GP using trajectoriesfrom the complex system of the form { µ yt , σ yt , u t } Tt =1 . Wewould like Eq. (2, 5) and an initial p ( z ) to be able toreproduce the trajectory and uncertainties from the CNN. Thelearning objective to be minimized is then L = KL ( p ( y T | o T ) || p ( y T | u T )) , (6)where KL is the Kullback–Leibler divergence, p ( y T ) rep-resents the joint distribution p ( y , ..., y T ) , p ( y T | o T ) is theoutput of the CNN, and p ( y T | u T ) is the prediction fromthe dynamics and Eq. (2). The GP predicted uncertainty σ zt isused with Eq. (2) to predict a latent observation uncertainty ˆ σ yt . This objective aims to make the predicted uncertainty ˆ σ yt and the observed uncertainties σ yt consistent, i.e. the GP willpredict the future uncertainty. p ( y T | o T ) is ﬁxed (i.e. we are not retraining the CNNonline). Given this, we can rewrite the objective in terms ofexpectations over p ( y T | o T ) L = − E p ( y T | o T ) [log p ( y T | u T )]+ H [ p ( y T | o T )] , (7)where H is the entropy and this entropy term can be droppedas it only depends on the pre-trained CNN. We can thenoptimize by maximizing the conditional expectation in Eq. (7) of y T . To do this we construct a variational lower boundon p ( y T | u T ) . This lower bound is given by ELBO = T X t =1 E q ( z t ) [log p ( y t | z t )] − KL ( q ( z ) || p ( z )) − T X t =2 E q ( z t − ) [ KL ( q ( z t ) || p ( z t | z t − , u t − ))] , (8)where the prior on the initial state is p ( z ) ∼ N (0 , I ) and q ( z t ) = p ( z t | y t , u t − ) is the GPUKF ﬁltering distribution[11]. The ﬁnal objective to minimize is given by L L = − E p ( y T | o T ) [ ELBO ] ≥ L (9)To evaluate this objective we use the reparameterization trickto sample from the CNN and estimate gradients for L . Afterperforming this training procedure we obtain the transitiondistribution p z , which is used by the GPUKF to performﬁltering and by the MPC to predict future uncertainty. E. Model Predictive Control

For MPC we use MPPI [12] with a cost c for the giventask. To encourage the controller to keep the system in thedomain of the simple model we add a cost to penalize thepredicted uncertainty. Thus the cost function has the form c ( z, σ z , u ) (examples are shown in the experiments). Notethat typically in this setting the expected cost is computed,but as mentioned in the previous section, our uncertaintydoes not express uncertainty over the value of the state.When rolling out a predicted trajectory with the model wepropagate the expectation through the dynamics and recordthe one-step uncertainty for each step resulting in a trajectory ( µ zt , σ zt , u t ) Tt =1 with which to calculate the cost. If we donot penalize this uncertainty, it will be ignored, which isequivalent to assuming the simple model is always accurate(we compare to this method in our experiments). Also, becausewe manually design the simple model state representation,we can incorporate additional information, such as avoidingcollision, into the cost, which would have to be learned for anunsupervised learned representation.V. E XPERIMENTS

We evaluate LVSPC on 1) manipulating a tethered mass,and 2) placing a rope in a narrow opening vs. baselines inthe low-data regime. An episode is a time-limited attempt toreach the goal (terminating early when the goal is reached).See the accompanying video for example task executions.

A. Environmentsa) Tethered Mass:

This task involves controlling a teth-ered mass by applying force to the base of the tether. The goalis to bring the mass to a target without the tether contactingthe target (tether contact results in failure). We implement thissystem in MuJoCo [30]. There is a single actuated horizontaljoint at the top of the tether (see Figure 4). Goals are randomlyassigned at the start of each trial. This example demonstratesthe applicability of LVSPC to highly-dynamic systems wherevelocity must be considered.

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021

The simple system we choose here is the pendulum ona cart (i.e. a cart-pole); we choose this because we ob-served that when the tether is taut the system will behavelike a pendulum. We use an analytical dynamics functionfor ˆ f ρ . We deﬁne z = [ p x cart , p x mass , p y mass , ˙ p x cart , ˙ θ ] ,where θ is the angle of the pendulum. We deﬁne the la-tent observations as y = [ p x cart , p x mass , p y mass ] and thus C = [ I × × ] . The parameters ρ are [ mass_cart,mass_pole, angular_damping ]. b) Rope Manipulation: This task consists of two KUKAiiwa 7-DOF arms holding the ends of a rope. The goal tobring the center of the rope to the center of a narrow gapbetween two obstacles. These obstacles have small protrusionson which the rope can become caught. We implement this en-vironment in Gazebo with the ode45 back-end (Figure 4). Theaction space of the robot is [∆ p L , ∆ p R ] ∈ R where p L , p R are the left and right end-effector positions, respectively. Weuse a Jacobian-based method for inverse kinematics so thattransitions in the robot’s conﬁguration space are smooth. Theobservations consists of RGBD data from an overheard Kinect.The goal and obstacle conﬁguration for the task remain ﬁxedacross trials, but the starting locations of the end-effectors vary.We choose this example because it mimics cable installation,which is necessary for manufacturing and repair applications,where there are often narrow gaps and protrusions.The simple system we choose here is to treat the rope asif it is a rigid link. The simple dynamics are then speciﬁedby adding a constraint that the gripper distances remain ﬁxed.This approximation will be accurate so long as the rope is kepttaut for the duration of the task. We deﬁne z = y = [ p L , p R ] and C = [ I × ] . Since this model does not require dynamicparameters we forego the sysid step of our method. B. Baselines

We compare LVSPC to two recent methods from the litera-ture. The ﬁrst method is PlaNet [6], a model-based reinforce-ment learning algorithm. PlaNet learns a low-dimensional staterepresentation along with dynamics and cost functions. Thesecond is CURL [10], which uses a contrastive loss to learn arepresentation in which to learn a policy and has shown state-of-the-art sample-efﬁciency. For each of these baselines we testthem by training them directly on the task with the complexsystem. We also show results for when the baselines are pre-trained on the simple system and ﬁne-tuned on the complexsystem to investigate if these methods can take advantageof the data from the simple system. Both baselines wereoriginally proposed with RGB observations, and we extendthem to use RGBD for the rope experiment.We also test with three variants of LVSPC: 1) The fullmethod which does both system identiﬁcation and GP learn-ing; 2) LVSPC without the GP, this is equivalent to onlyusing the simple model for control, and assuming it will besufﬁciently accurate for all transitions. We choose this variantto investigate whether learning and avoiding inaccurate areasof the simple model state space is helpful for task performance;and 3) LVSPC without constraining the GP posterior to bezero-mean, hence attempting to learn a better approximation of the dynamics in the simple system state space, rather thanonly where the simple model is accurate.

C. Simple Model Dataa) Tethered Mass:

For pre-training the state estimator wegenerate 5000 trajectories of 20 time-steps from the cart-poleusing random actions and render the cart-pole conﬁgurations toproduce images. This corresponds to 100000 × grayscaleframes. For domain randomization, we vary the dimensionsand parameters of the system (see Figure 3(a)). b) Rope Manipulation: For pre-training the state estima-tor we generate 800 trajectories of 50 time-steps length usingrandom actions from the rigid body system and render theconﬁguration. This corresponds to 80000 × × RGBDframes. For domain randomization, we vary the dimensionsof the rigid link and the obstacles, as well as the obstaclelocations (examples shown in Figure 3(b)).

D. Cost Functions

For both LVSPC and PlaNet we use an MPC horizonof 40 and sample 1000 trajectories per timestep. We donot have a cost on control. CURL and PlaNet use the trueenvironmental cost i.e. c env ( x t ) , whereas LVSPC and variantsuse an equivalent cost based on the simple model state withan uncertainty penalty c ( z t , σ zt ) . The environmental costs usethe true state from the simulator to calculate the cost (becauseCURL and PlaNet have no knowledge of the simple model),whereas LVSPC uses the simple model state to approximatethis cost, effectively giving CURL and PlaNet an advantage. a) Tethered Mass: The environmental cost consistsof three parts; a euclidean distance to goal, a colli-sion penalty for the tether and mass, and a penaltywhen the system goes out of view of the camera.The cost functions are c ( z t , σ zt ) = δ g distToGoal Z + OffScreen ( z t ) + 10 checkCollision ( z t ) + βσ zt and c env ( x t ) = δ g distToGoal X + OffScreen ( x t ) +10 checkCollision ( x t ) , where β is a parameter on howheavily to weigh uncertainty, and δ g is if the goal is reachedbefore time t and otherwise. To balance exploiting vs.exploring we increase β from to . in the ﬁrst 10 episodes.This cost is not memoryless; δ g depends on the state for times t ′ < t . This is because we only wish to hit the target, we donot have to reach the target and stay there. b) Rope Manipulation: The environmental cost is thedistance to the goal, computed by considering the centre of therope to be a ﬂoating point, discretizing the 3D environmentinto a 8-connected graph and solve for the shortest path tothe goal for every point in the graph. We do not penalizecontact for the baselines, as we found that they could exploitcontact to help complete the task. The cost for LVSPCpenalizes contact (because the simple model is rigid), wherewe do a collision-check for the rigid-link approximation. Thecost functions are c ( z t , σ zt ) = distToGoal Z + βσ zt +100 checkCollision ( z t ) and c env ( x t ) = distToGoal X .To balance exploiting vs. exploring we increase β from to in the ﬁrst 10 episodes. OWER AND BERENSON: DATA-EFFICIENT LEARNING FOR CONTROLLING COMPLEX SYSTEMS WITH SIMPLE MODELS 7

Fig. 4: (a)

Tethered mass input image (64x64 grayscale) with the target (left) and the single prismatic joint (blue); (b) output from CNNensemble and GPUKF estimation (red); (c) planned trajectory from MPPI (green). Only the ﬁrst action from this trajectory is executed beforereplanning; (d)

The rope manipulation environment. The goal is to bring the centre of the rope to the centre of the narrow gap. The sidesof the gap have protrusions which can catch the rope; (e, f)

Example RGB and D observations from overhead Kinect.

E. Network Architectures

Networks are implemented in PyTorch [31], and the GPsare implemented in GPytorch [32], which allows us to exploitparallelism on the GPU for GP inference when performingMPPI. Thus, for the rope manipulation experiment, an iterationof MPPI takes only . s on average using an Intel i7-8700KCPU and an Nvidia 1080Ti GPU. For both experiments we usea CNN ensemble consisting of 10 networks. All convolutionalﬁlters have ﬁlter size × and stride 2 for downsampling, alllayers other than the output layers use ReLU activations. Weuse the Adam optimizer with a learning rate of − , exceptwhen ﬁne-tuning the pretrained CURL and PlaNet modelswhere we use − .For the GP dynamics model, we use 200 inducing points.We train an independent GP for each output dimension usingthe RBF kernel with automatic-relevance determination [33].We use a learning rate of − to train the GP and performsysid. For each experiment CURL and PlaNet use encoderswith the same architecture as our CNN. The transition andreward models for PlaNet are the same architecture as [6].The actor-critic architecture for CURL is the same as in [10].Both CURL and PlaNet are trained end-to-end. a) Tethered Mass: Each CNN consists ﬁrst of 4 convolu-tional layers. There is then a fully connected layer with 2048hidden units, followed by an output layer. b) Rope manipulation:

Each CNN seperately processesdepth and RGB, consisting of an RGB module and a depthmodule which are combined downstream. Each module con-sists ﬁrst of 4 convolutional layers. There is then a fully-connected layer with 512 hidden units. After passing the RGBimage through both the RGB module, and the depth imagethrough the depth module, the output from each module iscombined and passed through a ﬁnal hidden layer of 1024units, followed by an output layer.

F. Resultsa) Tethered Mass:

An example of the system trackingand MPC is demonstrated in Figure 4. Our statistical resultsare shown in Figure 5(a, b). PlaNet achieves it’s maximumperformance at 200-300 episodes and has a success rate ofapproximately with large variation. We see that CURLshows the highest asymptotic performance, with after 400episodes. Higher asymptotic performance is typical of model-free learning methods. Pre-training both PlaNet and CURLon data from the simple system results in improved initialperformance, but lower ﬁnal performance. In contrast, LVSPC achieves approximately after 20 episodes, outperformingPlaNet and matching CURL’s performance after 200 episodes,demonstrating 10x improved data efﬁciency. We also see thatseeking to learn the dynamics in the simple state space with theGP results in substantially worse performance. This is likelybecause the simple state representation is insufﬁcient to modelthe full complex dynamics. b) Simulated rope manipulation:

Our statistical resultsare shown in Figure 5(c, d). PlaNet’s performance after 500episodes is approximately , while CURL solves the taskwith almost success rate after 250 episodes. Pre-trainingCURL on data from the simple system results in improvedinitial performance, but lower ﬁnal performance, howeverpretraining PlaNet led to poor performance which it could notrecover from, getting caught on the obstacles in every episode.Our full method achieves success rate after 20 episodes,again equivalent to CURL’s performance after 200 episodes(thus we have 10x better data-efﬁciency) and outperformingPlaNet’s ﬁnal performance. We see that naively treating therope as a rigid object results in approximately successand almost all failures result from the rope snagging on theprotrusions on the side of the gap. As in the tethered massexperiment, attempting to ﬁt the complex dynamics in thesimple mode space is ineffective, causing frequent snaggingon obstacles.

G. Rope Manipulation on a Real Robot

Our simulation experiments show that LVSPC is effectiveat transferring within the same simulation environment. Tovalidate that we can still use LVSPC when the simple modeland complex environments are very different, we perform therope manipulation experiment described above on a real robotusing a perception system trained only in simulation. We usedomain randomization to improve the transfer of the CNNensemble to real data [34] (see Figure 3(c)). We observedbetter generalization when we randomized the pose of thecamera and trained the CNN ensemble to produce an estimatein the camera frame instead of the world frame.We perform the experiment on the real robot over 5 randomseeds. For each seed, after every 5 episodes we record thesuccess rate on 10 test episodes. The results are shown in TableI. Using LVSPC we can complete this task with successusing only 10 episodes of data collected on the real robot.This experiment demonstrates that using LVSPC is promisingfor real-world tasks, as we only need data from simulation totrain an effective perception system.

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021

Fig. 5: Average Success over 10 test tasks vs number of episodes for both experiments. Shaded region shows minimum and maximum successrate over 5 runs for LVSPC and ablations and 3 runs for the baselines for a total of 50 and 30 test tasks for LVSPC and the baselines,respectively. a) LVSPC and ablations for tethered mass, dotted lines show baseline performance after 500 episodes. b) Baselines for tetheredmass. c) LVSPC and ablations for rope, dotted lines show baseline performance after 500 episodes. d) Baselines for rope.

Episode 0 5 10 15 20Success rate 0.3 0.7 0.8 0.78 0.82

TABLE I: Results over 5 random seeds for real robot experiment

VI. C

ONCLUSION

We have presented LVSPC for leveraging a given simplemodel approximation to improve data efﬁciency for controltasks on systems with complex dynamics from image obser-vations. We demonstrated this method on two tasks, showingsubstantially improved performance in the low-data regimeover recent reinforcement learning methods. We have alsodemonstrated that we can apply our framework to a real robotwhile only using simulated data for pre-training. We assumedthat the user speciﬁes a type of simple model, but choosinga simple model which can approximate the complex systemis an open problem, made difﬁcult by the requirement that itmust be possible to complete the task while operating only inthe regime where the simple model is accurate. In future workwe intend to incorporate multiple simple models and create away to decide which is most appropriate.R

EFERENCES[1] A. Xie, F. Ebert, S. Levine, and C. Finn, “Improvisation through physicalunderstanding: Using novel objects as tools with visual foresight,” in

RSS , 2019.[2] A. Wang, T. Kurutach, P. Abbeel, and A. Tamar, “Learning roboticmanipulation through visual planning and acting,” in

RSS , 2019.[3] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning topoke by poking: Experiential learning of intuitive physics,” in

NeurIPS ,2016.[4] S. James, A. J. Davison, and E. Johns, “Transferring end-to-end visuo-motor control from simulation to real world for a multi-stage task,” in

CoRL , 2017.[5] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff,and D. Fox, “Closing the sim-to-real loop: Adapting simulation random-ization with real world experience,” in

ICRA , 2019.[6] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, andJ. Davidson, “Learning latent dynamics for planning from pixels,” in

ICML , 2019.[7] S. Miller, J. van den Berg, M. Fritz, T. Darrell, K. Goldberg, andP. Abbeel, “A geometric approach to robotic laundry folding,”

IJRR ,vol. 31, no. 2, pp. 249–267, 2012.[8] D. McConachie, T. Power, P. Mitrano, and D. Berenson, “Learning whento trust a dynamics model for planning in reduced state spaces,”

RA-L ,vol. 5, no. 2, pp. 3540–3547, April 2020.[9] J. Pratt, C.-M. Chew, A. Torres, P. Dilworth, and G. Pratt, “Virtual modelcontrol: An intuitive approach for bipedal locomotion,”

IJRR , vol. 20,no. 2, pp. 129–143, 2001.[10] M. Laskin, A. Srinivas, and P. Abbeel, “Curl: Contrastive unsupervisedrepresentations for reinforcement learning,”

ICML , 2020.[11] J. Ko and D. Fox, “Gp-bayesﬁlters: Bayesian ﬁltering using gaussianprocess prediction and observation models,”

AuRo , vol. 27, pp. 75–90,May 2009. [12] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots,and E. A. Theodorou, “Information theoretic mpc for model-basedreinforcement learning,” in

ICRA , 2017.[13] C. Finn and S. Levine, “Deep visual foresight for planning robotmotion,”

ICRA , 2016.[14] E. Banijamali, R. Shu, M. Ghavamzadeh, H. H. Bui, and A. Ghodsi,“Robust locally-linear controllable embedding,” in

AISTATS , 2018.[15] D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi, “Dream to control:Learning behaviors by latent imagination,” in

ICLR , 2020.[16] A. Byravan, F. Leeb, F. Meier, and D. Fox, “SE3-Pose-Nets: StructuredDeep Dynamics Models for Visuomotor Control,” in

ICRA , 2018.[17] M. Yan, Y. Zhu, N. Jin, and J. Bohg, “Self-supervised learning of stateestimation for manipulating deformable linear objects,”

RA-L , vol. 5,no. 2, pp. 2372–2379, 2020.[18] S. Feng, E. Whitman, X. Xinjilefu, and C. G. Atkeson, “Optimizationbased full body control for the atlas robot,” in

Humanoids , 2014.[19] S. Kousik, P. Holmes, and R. Vasudevan, “Safe, Aggressive QuadrotorFlight via Reachability-Based Trajectory Design,” in

DSCC , 2019.[20] D. Mcconachie and D. Berenson, “Estimating model utility for de-formable object manipulation using multiarmed bandit methods,”

T-ASE ,vol. 15, no. 3, pp. 967–979, July 2018.[21] M. Deisenroth and C. Rasmussen, “Pilco: A model-based and data-efﬁcient approach to policy search,” in

ICML , 2011.[22] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforce-ment learning in a handful of trials using probabilistic dynamics models,”in

NeurIPS , 2018.[23] F. Farshidian and J. Buchli, “Risk sensitive, nonlinear optimal control:Iterative linear exponential-quadratic optimal control with gaussiannoise,” arXiv preprint: 1512.07173 , 2015.[24] S. Bechtle, Y. Lin, A. Rai, L. Righetti, and F. Meier, “Curious ilqr:Resolving uncertainty in model-based rl,” in

CoRL , 2019.[25] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visualforesight: Model-based deep reinforcement learning for vision-basedrobotic control,” arXiv preprint 1812.00568 , 2018.[26] E. A. Wan and R. Van Der Merwe, “The unscented kalman ﬁlter fornonlinear estimation,” in

AS-SPCC , 2000, pp. 153–158.[27] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalablepredictive uncertainty estimation using deep ensembles,” in

NeurIPS ,2017.[28] R. Frigola, Y. Chen, and C. E. Rasmussen, “Variational gaussian processstate-space models,” in

NeurIPS , 2014.[29] M. Jankowiak, G. Pleiss, and J. R. Gardner, “Parametric gaussian processregressors,” in

ICML , 2020.[30] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in

IROS , 2012.[31]

PyTorch: An Imperative Style, High-Performance Deep Learning Li-brary , 2019.[32] J. R. Gardner, G. Pleiss, D. Bindel, K. Q. Weinberger, and A. G. Wilson,“Gpytorch: Blackbox matrix-matrix gaussian process inference with gpuacceleration,” in

NeurIPS , 2018.[33] R. M. Neal,

Bayesian Learning for Neural Networks . Springer-Verlag,1996.[34] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,“Domain randomization for transferring deep neural networks fromsimulation to the real world,” in