[PDF] Data-Efficient Learning of Feedback Policies from Image Pixels using Deep Dynamical Models

Abstract

Data-efficient reinforcement learning (RL) in continuous state-action spaces using very high-dimensional observations remains a key challenge in developing fully autonomous systems. We consider a particularly important instance of this challenge, the pixels-to-torques problem, where an RL agent learns a closed-loop control policy ("torques") from pixel information only. We introduce a data-efficient, model-based reinforcement learning algorithm that learns such a closed-loop policy directly from pixel information. The key ingredient is a deep dynamical model for learning a low-dimensional feature embedding of images jointly with a predictive model in this low-dimensional feature space. Joint learning is crucial for long-term predictions, which lie at the core of the adaptive nonlinear model predictive control strategy that we use for closed-loop control. Compared to state-of-the-art RL methods for continuous states and actions, our approach learns quickly, scales to high-dimensional state spaces, is lightweight and an important step toward fully autonomous end-to-end learning from pixels to torques.

Full PDF

DData-Efﬁcient Learning of Feedback Policies fromImage Pixels using Deep Dynamical Models

John-Alexander M. Assael ∗ Department of ComputingImperial College London, UK [email protected]

Niklas Wahlstr¨om ∗ Division of Automatic ControlLink¨oping University, Sweden [email protected]

Thomas B. Sch¨on

Department of Information TechnologyUppsala University, Sweden [email protected]

Marc Peter Deisenroth

Department of ComputingImperial College London, UK [email protected]

Abstract

Data-efﬁcient reinforcement learning (RL) in continuous state-action spaces usingvery high-dimensional observations remains a key challenge in developing fullyautonomous systems. We consider a particularly important instance of this chal-lenge, the pixels-to-torques problem, where an RL agent learns a closed-loop con-trol policy (“torques”) from pixel information only. We introduce a data-efﬁcient,model-based reinforcement learning algorithm that learns such a closed-loop pol-icy directly from pixel information. The key ingredient is a deep dynamical modelfor learning a low-dimensional feature embedding of images jointly with a predic-tive model in this low-dimensional feature space. Joint learning is crucial for long-term predictions, which lie at the core of the adaptive nonlinear model predictivecontrol strategy that we use for closed-loop control. Compared to state-of-the-artRL methods for continuous states and actions, our approach learns quickly, scalesto high-dimensional state spaces, is lightweight and an important step toward fullyautonomous end-to-end learning from pixels to torques.

The vision of fully autonomous and intelligent systems that learn by themselves has inspired ar-tiﬁcial intelligence (AI) and robotics research for many decades. The pixels to torques problem identiﬁes key aspects of such an autonomous system: autonomous thinking and decision makingusing (general-purpose) sensor measurements only, intelligent exploration and learning from mis-takes. We consider the problem of efﬁciently learning closed-loop policies (“torques”) from pixelinformation end-to-end. Although, this problem falls into the general class of reinforcement learning(RL) [1], it is challenging for the following reasons: (1) The state-space (here deﬁned by pixel val-ues) is enormous (e.g., for a × image, we are looking at continuous-valued dimensions);(2) In many practical applications, we need to ﬁnd solutions data efﬁciently: When working withreal systems, e.g., robots, we cannot perform millions of experiments because of time and hardwareconstraints.One way of using data efﬁciently, and, therefore, reducing the number of experiments, is to learnpredictive forward models of the underlying dynamical system, which are then used for internalsimulations and policy learning. These ideas have been successfully applied to RL, control and ∗ These authors contributed equally to this work. a r X i v : . [ c s . A I] O c t obotics [2–7], for instance. However, they often rely on heuristics, demonstrations or engineeredlow-dimensional features, and do not easily scale to data-efﬁcient RL using pixel information only.A common way of dealing with high-dimensional data is to learn low-dimensional feature represen-tations. Deep learning architectures, such as deep neural networks [8], stacked auto-encoders [9, 10],or convolutional neural networks [11], are the current state-of-the-art in learning parsimonious repre-sentations of high-dimensional data. Since 2006, deep learning has produced outstanding empiricalresults in image, text and audio tasks [12]. Related Work

Over the last months, there has been signiﬁcant progress in the context of thepixels-to-torques problem. A ﬁrst working solution was presented in 2015 [13], where an RL agentautomatically learned to play Atari games purely based on pixel information. The key idea was toembed the high-dimensional pixel space into a lower-dimensional space using deep neural networksand apply Q-learning in this compact feature space. A potential issue with this approach is that itis not a data-efﬁcient way of learning policies (weeks of training data are required), i.e., it will beimpractical to apply it to a robotic scenario. This data inefﬁciency is not speciﬁc to Q-learning, buta general problem of model-free RL methods [3, 14].To increase data efﬁciency, model-based RL methods aim to learn a model of the transition dynam-ics of the system/robot and subsequently use this model as a surrogate simulator. Recently, the ideaof learning predictive models from raw images where only pixel information is available was ex-ploited [15, 16]. The approach taken here follows the idea of Deep Dynamical Models (DDMs) [17]:Instead of learning predictive models for images directly, a detour via a low-dimensional featurespace is taken by embedding images into a lower-dimensional feature space, e.g., with a deep auto-encoder. This detour is promising since direct mappings between high-dimensional spaces requirelarge data sets. Whereas Wahlstr¨om et al. [15] consider deterministic systems and nonlinear modelpredictive control (NMPC) techniques for online control, Watter et al. [16] use variational auto-encoders [18], local linearization, and locally linear control methods (iLQR [19] and AICO [20]).To model the dynamical behavior of the system, the pixels of both the current and previous frame areused. Watter et al. [16] concatenate the input pixels to discover such features, whereas Wahlstr¨omet al. [15] concatenate the processed low-dimensional embeddings of the two states. The latterapproach requires at least ≈ × fewer parameters, which makes it a promising candidate for moredata-efﬁcient learning. Nevertheless, properties such as local linearization [16] can be attractive.However, the complex architecture proposed by Watter et al. [16] is based on very large neuralnetwork models with ≈ million parameters for learning to swing up a single pendulum. A vastnumber of training parameters results in higher model complexity, and, thus, decreased statisticalefﬁciency. Hence, an excessive number of training samples, which might not be available, is requiredto learn the underlying system, taking several days to be trained. These properties make data-efﬁcient learning complicated. Therefore, we propose a relatively lightweight architecture to addressthe pixels-to-torques problem in a data-efﬁcient manner. Contribution

We propose a data-efﬁcient model-based RL algorithm that addresses the pixels-to-torques problem. (1) We devise a data-efﬁcient policy learning framework based on the DDMapproach for learning predictive models for images. We use state-of-the-art optimization techniquesfor training the DDM. (2) Our model proﬁts from a concatenation of low-dimensional features (in-stead of high-dimensional images) to model dynamical behavior, yielding ≈ – times fewer modelparameters and faster training time than the complex E2C architecture [16]. In practice, our modelrequires only a few hours of training, while E2C [16] requires days. (3) We introduce a novel train-ing objective that encourages consistency in the latent space paving the way towards more accuratelong-term predictions. Overall, we use an efﬁcient model architecture, which can learn tasks ofcomplex non-linear dynamics. We consider an N -step ﬁnite-horizon RL setting in which an agent attempts to solve a particulartask by trial and error. In particular, our objective is to ﬁnd a closed-loop policy π ∗ , that minimizesthe long-term cost V π = (cid:80) N − t =0 c ( s t , u t ) , where c denotes an immediate cost, s t ∈ R N s is thecontinuous-valued system state, and u t ∈ R N u are continuous control signals.2he learning agent faces the following additional two challenges: (a) The agent has no access tothe true state s t , but perceives the environment only through high-dimensional pixel information x t ∈ R N x (images); (b) A good control policy is required in only a few trials. This setting ispractically relevant, e.g., when the agent is a robot that is monitored by a video camera based onwhich the robot has to learn to solve tasks fully autonomously. Therefore, this setting is an instanceof the pixels-to-torques problem.We solve this problem in three key steps, which will are detailed in the following sections: (a)Using a deep auto-encoder architecture we map the high-dimensional pixel information x t to a low-dimensional embedding/feature z t . (b) We combine the z t , and z t − features with the control signal u t to learn a predictive model of the system dynamics for predicting future features z t +1 . (a) and (b)form a Deep Dynamical Model (DDM) [17]. (c) We apply an adaptive nonlinear model predictivecontrol strategy for optimal closed-loop control and end-to-end learning from pixels to torques. x t+1 x t x t+2 x t x t-1 u t u t-1 x t+1 u t+1 z t z t-1 z t+1 z t+1 z t z t+2 Figure 1: Graphical model of the DDM pro-cess for predicting future states.Our approach to solving the pixels-to-torques prob-lem is based on a deep dynamical model (DDM), seeFigure 1, which jointly (a) embeds high-dimensionalimages in a low-dimensional feature space via deepauto-encoders, and (b) learns a predictive forwardmodel in this feature space, based on the work byWahlstr¨om et al. [17]. In particular, we consider aDDM with control signals u t and high-dimensionalobservations x t at time-step t . We assume that therelevant properties of x t can be compactly repre-sented by a feature variable z t . Furthermore, ˜ x t is the reconstructed high-dimensional measurement.The two components of the DDM, i.e., the low-dimensional feature and the predictive model, which predicts future features ˆ z t +1 and observations ˆ x t +1 based on past observations and control signals, are detailed in the following sections. Inspired by the concept of static auto-encoders [21, 22], we turn them into a dynamical model thatcan predict future features ˆ z t +1 and images ˆ x t +1 . Our DDM consists of the following elements:1. An encoder f enc mapping high-dimensional observations x t onto low-dimensional fea-tures z t ,2. A decoder f dec mapping low-dimensional features z t back to high-dimensional observa-tions ˆ x t , and3. The predictive model f pred , which takes z t , z t − , u t as input and predicts the next latentfeature ˆ z t +1 .The f enc , f dec and f pred functions of our DDM are neural network models performing the followingtransformations: z t = f enc ( x t ) , (1a) ˜ x t = f dec ( z t ) , (1b) ˆ z t +1 = f pred ( z t − , z t , u t ) , (1c) ˆ x t +1 = f dec (ˆ z t +1 ) . (1d)We now put these elements together to construct the DDM. The DDM architecture takes the raw im-ages x t − and x t as input and maps them to their low-dimensional features z t − and z t respectively,using f enc in (1a). These latent features are then concatenated and, together with the control signal u t , used to predict ˆ z t +1 with f pred in (1c). Finally, the predicted feature ˆ z t +1 is passed through thedecoder network f dec , to compute the predicted image ˆ x t +1 . The overall architecture is depicted inFigure 2. 3 t+1 x t f enc x t-1 f enc u t f pred f dec x t+1 z t z t-1 Figure 2: The architecture of the proposed DDM, for extracting the underlying properties of thedynamical system and predicting the next image frame from raw pixel information.The neural networks f enc , f dec and f pred that compose the DDM, are parameterized by θ enc , θ dec and θ pred respectively. These parameters consist of the weights that perform linear transformations of theinput data in each neuron. For training the DDM in (1), we introduce a novel training objective, that encourages consistency inthe latent space, paving the way toward accurate long-term predictions. More speciﬁcally, for ourtraining objective we deﬁne the following cost functions L R ( x t ) = (cid:107) ˜ x t − x t (cid:107) , (2a) L P ( x t − , x t , u t , x t +1 ) = (cid:107) ˆ x t +1 − x t +1 (cid:107) , (2b) L L ( z t − , z t , u t , z t +1 ) = (cid:107) ˆ z t +1 − z t +1 (cid:107) , (2c)where L R is the squared deep auto-encoder reconstruction error and L P is the squared predictionerror, both operating in image space. Note that ˆ x t +1 = f dec ( f pred ( f enc ( x t ) , f enc ( x t ) , u t )) dependson the parameters of the decoder, the predictive model and the encoder. Additionally, we introduce L L that enforces consistency between the latent spaces of the encoder f enc and the prediction model f pred . In the big-data regime, this additional penalty in latent space is not necessary, but if not muchdata is available, this additional term increases the data efﬁciency as the prediction model is forced tomake predictions ˆ z t +1 = f pred ( z t − , z t , u t ) close to the next embedded feature z t +1 = f enc ( x t +1 ) .The overall training objective of the current dataset D = ( x N , u N ) is given by L ( D ) = (cid:88) N − t =0 L R ( x t ) + L P ( x t − , x t , u t , x t +1 ) + α L L ( z t − , z t , u t , z t +1 )= (cid:88) N − t =0 (cid:107) ˜ x t − x t (cid:107) + (cid:107) ˆ x t +1 − x t +1 (cid:107) + α (cid:107) ˆ z t +1 − z t +1 (cid:107) , (3)where α is a parameter that controls the inﬂuence of L L . Finally, we train the DDM parameters ( θ enc , θ dec , θ pred ) by jointly minimizing the overall cost (ˆ θ enc , ˆ θ dec , ˆ θ pred ) = arg min θ enc ,θ dec ,θ pred L ( D ) . (4)Training jointly leads to good predictions as it facilitates the extraction and separation of the featuresdescribing the underlying dynamical system, and not only features for creating good reconstruc-tions [17]. The neural networks f enc , f dec and f pred are composed by linear layers, where each of the ﬁrst are followed by Rectiﬁed Linear Unit (ReLU) activation functions [23]. As it has been demon-strated [24], ReLU non-linearities allow the network to train ≈ × faster than the conventional tanh units, as evaluated on the CIFAR-10 [25] dataset. Furthermore, similar to Watter et al. [16], we useAdam [26] to train the DDM, which is considered the state-of-the-art among the latest methods forstochastic gradient optimization. Finally, after evaluating different weight optimization methods,such as uniform and random Gaussian [24, 27], the weights of the DDM were initialized usingorthogonal weight initialization [28], which demonstrated the most efﬁcient training performance,leading to decoupled weights that evolve independently of each other.4 Policy Learning

Our objective is to control the system to a state where a certain target frame x ref without any priorknowledge of the system at hand. To accomplish this we use the DDM for learning a closed-looppolicy by means of nonlinear model predictive control (NMPC). NMPC ﬁnds an optimal sequence of control signals that minimizes a K -step loss function, where K is typically smaller than the full horizon. We choose to do the control in the low-dimensionalembedded space to reduce the complexity of the control problem.Our NMPC formulation relies on (a) a target feature z ref and (b) the DDM that allows us to predictfuture features. The target feature is computed by encoding the target frame z ref = f enc ( x ref , θ E ) provided by the model. Further, with the DDM, future features ˆ z , . . . , ˆ z K can be predicted basedon a sequence of future (and yet unknown) controls u , . . . , u K − and two initial encoded features z − , z assuming that the current feature is denoted by z .Using the dynamical model, NMPC determines an optimal (open-loop) control sequence u ∗ , . . . , u ∗ K − , such that the predicted features z , . . . , ˆ z K gets as close to the target feature z ref as possible, which results in the objective u ∗ , . . . , u ∗ K − ∈ arg min u K − K − (cid:88) t =0 (cid:107) ˆ z t − z ref (cid:107) + λ (cid:107) u t (cid:107) , (5)where (cid:107) ˆ z t − z ref (cid:107) is a cost associated with the deviation of the predicted features ˆ z K − fromthe reference feature z ref , and (cid:107) u t (cid:107) penalizes the amplitude of the control signals. Here, λ isa tuning parameter adjusting the importance of these two objectives. When the control sequence u ∗ , . . . , u ∗ K − is determined, the ﬁrst control u ∗ is applied to the system. After observing thenext feature, NMPC repeats the entire optimization and turns the overall policy into a closed-loop(feedback) control strategy.Overall, we now have an online NMPC algorithm that, given a trained DDM, works indirectly onimages by exploiting their feature representation. We will now turn over to describe how adaptive NMPC can be used together with our DDM toaddress the pixels-to-torques problem and to learn from scratch. At the core of our NMPC formula-tion lies the DDM, which is used to predict future features (and images) from a sequence of controlsignals. The quality of the NMPC controller is inherently bound to the prediction quality of thedynamical model, which is typical in model-based RL [5, 14, 29].

Algorithm 1

Adaptive online NMPC in feature spaceFollow a random control strategy and record data loop

Update DDM with all data collected so far for t = 0 to N − do Get current feature z t via the encoder u ∗ t ← (cid:15) -greedy NMPC policy using DDM pred.Apply u ∗ t and record data end forend loop To learn models and controllers fromscratch, we apply a control schemethat allows us to update the DDMas new data arrives. In particu-lar, we use the NMPC controller inan adaptive fashion to gradually im-prove the model by collected data inthe feedback loop without any spe-ciﬁc prior knowledge of the systemat hand. Data collection is performedin closed-loop (online NMPC), and itis divided into multiple sequential tri-als. After each trial, we add the dataof the most recent trajectory to the data set, and the model is re-trained using all data that has beencollected so far.Simply applying the NMPC controller based on a randomly initialized model would make theclosed-loop system very likely to converge to a point, which is far away from the desired refer-5nce value, due to the poor model that cannot extrapolate well to unseen states. This would in turnimply that no data is collected in unexplored regions, including the region that we are interested in.There are two solutions to this problem: either we use a probabilistic dynamical model [5, 14] toexplicitly account for model uncertainty and the implied natural exploration, or we follow an explicitexploration strategy to ensure proper excitation of the system. In this paper, we follow the latter ap-proach. In particular, we choose an (cid:15) -greedy exploration strategy where the optimal feedback u ∗ ateach time step is selected with a probability − (cid:15) , and a random action is selected with probability (cid:15) .Algorithm 1 summarizes our adaptive online NMPC scheme. We initialize the DDM with a randomtrial. We use the learned DDM to ﬁnd an (cid:15) -greedy policy using predicted features within NMPC.This happens online while the collected data is added to the data set, and the DDM is updated aftereach trial. In this section, we empirically assess the components of our proposed methodology for autonomouslearning from high-dimensional synthetic image data, on learning the underlying dynamics of asingle and a planar double pendulum. The main lines of the evaluation are: (a) the quality of thelearned DDM and (b) the overall learning framework.In both experiments, we consider the following setting: We take screenshots of a simulated pendu-lum system at a sampling frequency of . s. Each pixel x ( i ) t is a component of the measurement x t ∈ R N x and takes a continuous gray-value in the interval [0 , . The control signals u t are thetorques applied to the system. No access to the underlying dynamics nor the state (angles and an-gular velocities) was available, i.e., we are dealing with a high-dimensional continuous time series.The challenge was to data-efﬁciently learn (a) a good dynamical model and (b) a good controllerfrom pixel information only.To speed up the training process, we applied PCA prior to model learning as a pre-processing stepto reduce the dimensionality of the original problem. With these inputs, a -layer auto-encoder wasemployed, such that the dimensionality of the features is optimal to model the periodic angle of thependulums. The features z t − , z t and u t were later passed to the -layer predictive feedforwardneural network generating ˆ z t +1 . Furthermore, during training, the α parameter for encouragingconsistent latent space predictions was set to for both experiments. While, in the adaptive NMPC,the λ tuning parameter that penalizes the amplitude of the control signals, was set to . . The ﬁrst experiment evaluates the performance of the DDM on a planar pendulum, assembled by1-link robot arm with length m, weight kg and friction coefﬁcient N s m/rad.The screenshotsconsist of ×

40 = 1600 pixels, and the input dimension has been reduced to dim ( x t ) = 100 usingPCA. These inputs are processed by an encoder f enc with architecture: × – ReLU – × – ReLU – × . True video framesPredicted video framesx t+0 x t+1 x t+2 x t+3 x t+4 x t+5 x t+6 x t+7 x t+8 x t+1 x t+2 x t+3 x t+4 x t+5 x t+6 x t+7 x t+8 (a) Planar single pendulum True video framesPredicted video framesx t+0 x t+1 x t+2 x t+3 x t+4 x t+5 x t+6 x t+7 x t+8 x t+1 x t+2 x t+3 x t+4 x t+5 x t+6 x t+7 x t+8 (b) Planar double pendulum Figure 3: Long-term (up to eight steps) predictive performance of the DDM controlling a planarpendulum (a) and a planar double pendulum (b): true (upper plot) and predicted (lower plot) videoframes on test data. 6he low-dimensional features are of dim ( z t ) = 2 in order to model the periodic angle of the pendu-lum. To capture the dynamic properties, such as angular velocity, we concatenate two consecutivefeatures z t − , z t with the control signal u t ∈ R and pass them through the predictive model f pred ,with architecture: × – ReLU – × – ReLU – × . Note that the dimensionality ofthe ﬁrst layer is given by dim ( z t − ) + dim ( z t ) + dim ( u t ) = 5 . Finally, the predicted feature ˆ z t +1 ,can be mapped back to ˆ x t +1 using our decoder, with architecture: × – ReLU – × – ReLU– × . − − − latent dimension 1 − − − − l a t e n t d i m e n s i o n ππ π r o t a t i o n ( r a d ) Figure 4: The feature space z ∈ R ofthe single pendulum experiment for dif-ferent pendulum angles between ◦ and ◦ , generated by the DDM. The performance of the DDM is illustrated in Figure 3 ona test data set. The top row shows the true images and thebottom row shows the DDM’s long-term predictions. Themodel yields a good predictive performance for both one-step ahead prediction and multiple-step ahead prediction,a consequence of (a) jointly learning predictor and auto-encoder, (b) concatenating features instead of images tomodel the dynamic behavior.In Figure 4, we show the learned feature space z ∈ R for different pendulum angles between ◦ and ◦ . TheDDM has learned to generate features that represent theangle of the pendulum, as they are mapped to a circle-like shape accounting for the wrap-around property of anangle.Finally, in Figure 5, we report results on learning a policythat moves the pendulum from a start position ϕ = 0 ◦ toan upright target position ϕ = ± ◦ . The reference signal was the screenshot of the pendulum inthe target position. For the NMPC controller, we used a planning horizon of K = 15 steps and acontrol penalty λ = 0 . . For the (cid:15) -greedy exploration strategy we used (cid:15) = 0 . .Figure 5 shows the learning stages of the system, i.e., the different trials of the NMPC controller.Starting with a randomly initialized model, images were appended to the dataset in each trial.As it can be seen, starting already from the ﬁrst controlled trial, the system managed to control thependulum successfully and bring it to a position less than ◦ from the target position. This means,the solution is found very data efﬁciently, especially when we consider that the problem is learnedfrom pixels information without access to the “true” state. In this experiment, a planar double pendulum is considered assembled by 2-link robot arm withlength 1m and 1m respectively, weight 1kg and 1kg and friction coefﬁcients N s m/rad. Torquescan be applied at both joints. The screenshots consist of ×

48 = 2304 pixels, and the inputdimension has been reduced to dim ( x t ) = 512 prior to model learning using PCA to speed up the A b s . e rr o r f r o m t a r g e t ( d e g ◦ ) Error (10 ◦ )Error (pendulum angle) (a) Planar single pendulum A b s . e rr o r f r o m t a r g e t ( d e g ◦ ) Error (10 ◦ )Error (inner angle)Error (outer angle) (b) Planar double pendulum Figure 5: Results on learning a policy that moves the single (a) and the double (b) pendulum systemsfrom ϕ = 0 ◦ to ϕ = ± ◦ , in time-steps. The horizontal axis shows the learning stages andthe corresponding image frames available to the learner. The vertical axis shows the absolute errorfrom the target state, averaged over the last 10 time steps of each test trajectory. The dashed lineshows a ◦ error, which indicates a “good” solution.7raining process. The encoder architecture is: × – ReLU – × – ReLU – × ,and the decoder vice versa. The low-dimensional embeddings dim ( z t ) = 4 and the architecture ofthe predictive model was: × – ReLU – × – ReLU – × .The predictive performance of the DDM is shown in Figure 3 on a test data set. The performanceof the controller is depicted in Figure 5. We used trials with the downward initial position ϕ = 0 ◦ and upward target ϕ = ± ◦ for the angle of both inner and outer pendulums. The ﬁgure showsthe error after each trial (1000 frames) and clearly indicates that after three controlled trials a goodsolution is found, which brings both pendulums within a ◦ range to the target angles.Despite the high complexity of the dynamical system, our learning framework manages to success-fully control both pendulums after the third trial in nearly all cases. The same experiments were executed employing PILCO [30], a state of the art policy search method,under the following settings: (a) PILCO has access to the true state, i.e., the angle ϕ and angularvelocity ˙ ϕ ; (b) A deep auto-encoder is used to learn two-dimensional features z t from images, whichare used by PILCO for policy learning. In the ﬁrst setting (a) PILCO managed to successfully reachthe target after the second and the third trial in the two experiments, respectively. However, in setting(b), PILCO did not manage to learn anything meaningful at all. The reason why PILCO could notlearn on auto-encoder features is that these features were only trained to minimize the reconstructionerror. However, the auto-encoder did not attempt to map similar images to similar features, whichled to zig-zagging around in feature space (instead of following a smooth manifold as in Figure 4),making the model learning part in feature space incredibly hard [17].We modeled and controlled equally complex models with E2C [16], but at the same time our DDMrequires ≈ × fewer neural network parameters if we use the same PCA pre-processing step withinE2C. The reason lies in our efﬁcient processing of the dynamics of the model in the feature spaceinstead of the image space. This number increases up to ≈ × fewer parameters than E2C withoutthe PCA pre-processing step.The number of parameters can be directly translated to reduced training time, and increased dataefﬁciency. Employing the adaptive model predictive control, our proposed DDM model requiressigniﬁcantly less data samples, as it efﬁciently focuses on learning the latent space towards thereference target state. Furthermore, the control performance of our model is gradually improvedin respect to the number of trials. As proved by our experimental evaluation we can successfullycontrol a complex dynamical system, such as the planar double pendulum, with less than samples. This adaptive learning approach can be essential in problems with time and hardwareconstraints. We proposed a data-efﬁcient model-based RL algorithm that learns closed-loop policies in contin-uous state and action spaces directly from pixel information. The key components of our solutionare (a) a deep dynamical model (DDM) that is used for long-term predictions via a compact featurespace, (b) a novel training objective that encourages consistency in the latent space, paving the waytoward more accurate long-term predictions, and (c) an NMPC controller that uses the predictionsof the DDM to determine optimal actions on the ﬂy without the need for value function estima-tion. For the success of this RL algorithm it is crucial that the DDM learns the feature mappingand the predictive model in feature space jointly to capture dynamical behavior for high-qualitylong-term predictions. Compared to state-of-the-art RL our algorithm learns fairly quickly, scales tohigh-dimensional state spaces and facilitates learning from pixels to torques.

Acknoledgements

We thank Roberto Calandra for valuable discussions in the early stages of the project. The Tesla K40used for this research was donated by the NVIDIA Corporation.8 eferences [1] R. S. Sutton and A. G. Barto.

Reinforcement learning: An introduction , volume 1. MIT Press,1998.[2] J. Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning inreactive environments. In

International Joint Conference on Neural Networks (IJCNN) , pages253–258. IEEE, 1990.[3] C. G. Atkeson and S. Schaal. Learning tasks from a single demonstration. In

Proceedings ofthe International Conference on Robotics and Automation (ICRA) . IEEE, 1997.[4] J. A. Bagnell and J. G. Schneider. Autonomous helicopter control using reinforcement learningpolicy search methods. In

Proceedings of the IEEE International Conference on Robotics andAutomation , volume 2, pages 1615–1620, 2001.[5] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for data-efﬁcient learningin robotics and control.

IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI) , 37(2):408–423, 2015.[6] Y. Pan and E. Theodorou. Probabilistic differential dynamic programming. In

Advances inNeural Information Processing Systems , pages 1907–1915, 2014.[7] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702 , 2015.[8] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks.

Science , 313(5786):504–507, 2006.[9] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deepnetworks. In

Advances in Neural Information Processing Systems (NIPS) , pages 153–160.MIT Press, 2007.[10] P. Vincent, L. Hugo, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust fea-tures with denoising autoencoders. In

International Conference on Machine Learning (ICML) ,pages 1096–1103. ACM, 2008. ISBN 978-1-60558-205-4.[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to documentrecognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[12] J. Schmidhuber. Deep learning in neural networks: An overview. Technical Report IDSIA-03-14 / arXiv:1404.7828v1 [cs.NE], The Swiss AI Lab IDSIA, 2014.[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-forcement learning.

Nature , 518(7540):529–533, 2015.[14] J. G. Schneider. Exploiting model uncertainty estimates for safe dynamic control learning. In

Advances in Neural Information Processing Systems (NIPS) . 1997.[15] N. Wahlstr¨om, T. B. Sch¨on, and M. P. Deisenroth. From pixels to torques: Policy learning withdeep dynamical models. arXiv preprint arXiv:1502.02251 , 2015.[16] M. Watter, J. T. Springenberg, J. Boedecker, and M. A. Riedmiller. Embed to control: A locallylinear latent dynamics model for control from raw images. arXiv preprint arXiv:1506.07365 ,2015.[17] N. Wahlstr¨om, T. B. Sch¨on, and M. P. Deisenroth. Learning deep dynamical models fromimage pixels. In

IFAC Symposium on System Identiﬁcation (SYSID) , 2015.[18] D. Jimenez Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and vari-ational inference in deep latent Gaussian models. In

International Conference on MachineLearning (ICML) , June 2014.[19] E. Todorov and W. Li. A generalized iterative LQG method for locally-optimal feedbackcontrol of constrained nonlinear stochastic systems. In

American Control Conference , pages300–306. IEEE, 2005.[20] M. Toussaint. Robot trajectory optimization using approximate inference. In

InternationalConference on Machine Learning (ICML) , Montreal, QC, Canada, June 2009.921] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by errorpropagation. In D. E. Rumelhart and J. L. McClelland, editors,

Parallel Distributed Processing ,volume 1, pages 318–362. MIT Press, 1986.[22] Y. Bengio. Learning deep architectures for AI.

Foundations and trends in Machine Learning ,2(1):1–127, 2009.[23] V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted Boltzmann machines. In

International Conference on Machine Learning (ICML) , pages 807–814, 2010.[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutionalneural networks. In

Advances in Neural Information Processing Systems (NIPS) , pages 1097–1105, 2012.[25] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Com-puter Science Department, University of Toronto, 2009.[26] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[27] X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neuralnetworks. In

Artiﬁcial Intelligence and Statistics (AISTATS) , pages 249–256, 2010.[28] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics oflearning in deep linear neural networks. arXiv preprint arXiv:1312.6120 , 2013.[29] S. Schaal. Learning from demonstration. In

Advances in Neural Information Processing Sys-tems (NIPS) , pages 1040–1046. 1997.[30] M. P. Deisenroth and C. E. Rasmussen. PILCO: A model-based and data-efﬁcient approach topolicy search. In