[PDF] Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

Abstract

In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably intertwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latent-variable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner's solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our method achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings.

Full PDF

PPublished as a conference paper at ICLR 2019 L EARNING DYNAMICS MODEL IN REINFORCEMENTLEARNING BY INCORPORATING THE LONG TERM FU - TURE

Nan Rosemary Ke + ‡ ♣ , Amanpreet Singh ♥ , Ahmed Touati + ♥ , Anirudh Goyal + ,Yoshua Bengio + ¶ , Devi Parikh ♥♦ & Dhruv Batra ♥♦ A BSTRACT

In model-based reinforcement learning, the agent interleaves between modellearning and planning. These two components are inextricably intertwined. Ifthe model is not able to provide sensible long-term prediction, the executed plan-ner would exploit model ﬂaws, which can yield catastrophic failures. This paperfocuses on building a model that reasons about the long-term future and demon-strates how to use this for efﬁcient planning and exploration. To this end, webuild a latent-variable autoregressive model by leveraging recent ideas in varia-tional inference. We argue that forcing latent variables to carry future informationthrough an auxiliary task substantially improves long-term predictions. Moreover,by planning in the latent space, the planner’s solution is ensured to be within re-gions where the model is valid. An exploration strategy can be devised by search-ing for unlikely trajectories under the model. Our method achieves higher rewardfaster compared to baselines on a variety of tasks and environments in both theimitation learning and model-based reinforcement learning settings.

NTRODUCTION

Reinforcement Learning (RL) is an agent-oriented learning paradigm concerned with learning byinteracting with an uncertain environment. Combined with deep neural networks as function ap-proximators, deep reinforcement learning (deep RL) algorithms recently allowed us to tackle highlycomplex tasks. Despite recent success in a variety of challenging environment such as Atari games(Bellemare et al., 2013) and the game of Go (Silver et al., 2016), it is still difﬁcult to apply RLapproaches in domains with high dimensional observation-action space and complex dynamics.Furthermore, most popular RL algorithms are model-free as they directly learn a value function(Mnih et al., 2015) or policy (Schulman et al., 2015; 2017) without trying to model or predict theenvironment’s dynamics. Model-free RL techniques often require large amounts of training dataand can be expensive, dangerous or impossibly slow, especially for agents and robots acting in thereal world. On the other hand, model-based RL (Sutton, 1991; Deisenroth & Rasmussen, 2011;Chiappa et al., 2017) provides an alternative approach by learning an explicit representation of theunderlying environment dynamics. The principal component of model-based methods is to use anestimated model as an internal simulator for planning, hence limiting the need for interaction withthe environment. Unfortunately, when the dynamics are complex, it is not trivial to learn modelsthat are accurate enough to later ensure stable and fast learning of a good policy.The most widely used techniques for model learning are based on one-step prediction. Speciﬁ-cally, given an observation o t and an action a t at time t , a model is trained to predict the con-ditional distribution over the immediate next observation o t +1 , i.e p ( o t +1 | o t , a t ) . Althoughcomputationally easy, the one-step prediction error is an inadequate proxy for the downstream per-formance of model-based methods as it does not account for how the model behaves when com- + Mila, Universit´e de Montr´eal, ♥ Facebook AI Research, ♣ Polytechnique Montr´eal, ¶ CIFAR SeniorFellow, ‡ Work done at Facebook AI Research, ♦ Georgia Institute of TechnologyCorresponding authors: [email protected] a r X i v : . [ s t a t . M L ] M a r ublished as a conference paper at ICLR 2019posed with itself. In fact, one-step modelling errors can compound after multiple steps and candegrade the policy learning. This is referred to as the compounding error phenomenon (Talvi-tie, 2014; Asadi et al., 2018; Weber et al., 2017). Other examples of models are autoregres-sive models such as recurrent neural networks (Mikolov et al., 2010) that factorize naturally as log p θ ( o t +1 , a t +1 , o t +2 , a t +2 , . . . | o t , a t ) = (cid:80) t log p θ ( o t +1 , a t +1 | o , a , . . . o t , a t ) . Training au-toregressive models using maximum likelihood results in ‘teacher-forcing’ that breaks the trainingover one-step decisions. Such sequential models are known to suffer from accumulating errors asobserved in (Lamb et al., 2016; Bengio et al., 2015).Our key motivation is the following – a model of the environment should reason about ( i.e . be trainedto predict) long-term transition dynamics p θ ( o t +1 , a t +1 , o t +2 , a t +2 , . . . | o t , a t ) and not just singlestep transitions p θ ( o t +1 | o t , a t ) . That is, the model should predict what will happen in the long-termfuture, and not just the immediate future. We hypothesize (and test) that such a model would exhibitless cascading of errors and would learn better feature embeddings for improved performance.One way to capture long-term transition dynamics is to use latent variables recurrent networks.Ideally, latent variables could capture higher level structures in the data and help to reason aboutlong-term transition dynamics. However, in practice it is difﬁcult for latent variables to capturehigher level representation in the presence of a strong autoregressive model as shown in Gulrajaniet al. (2016); Goyal et al. (2017); Guu et al. (2018). To overcome this difﬁculty, we leverage recentadvances in variational inference. In particular, we make use of the recently proposed Z-forcingidea (Goyal et al., 2017), which uses an auxiliary cost on the latent variable to predict the long-termfuture. Keeping in mind that more accurate long-term prediction is better for planning, we use twoways to inject future information into latent variables. Firstly, we augment the dynamics model witha backward recurrent network (RNN) such that the approximate posterior of latent variables dependson the summary of future information. Secondly, we force latent variables to predict a summary ofthe future using an auxiliary cost that acts as a regularizer. Unlike one-step prediction, our approachencourages the predicted future observations to remain grounded in the real observations.Injection of information about the future can also help in planning as it can be seen as injecting aplan for the future. In stochastic environment dynamics, unfolding the dynamics model may lead tounlikely trajectories due to errors compounding at each step during rollouts.In this work, we make the following key contributions:1. We demonstrate that having an auxiliary loss to predict the longer-term future helps in fasterimitation learning.2. We demonstrate that incorporating the latent plan into dynamics model can be used for planning(for example Model Predictive Control) efﬁciently. We show the performance of the proposedmethod as compared to existing state of the art RL methods.3. We empirically observe that using the proposed auxiliary loss could help in ﬁnding sub-goals inthe partially observable 2D environment. ROPOSED M ODEL

We consider an agent in the environment that observes at each time step t an observation o t . Theexecution of a given action a t causes the environment to transition to a new unobserved state, returna reward and emit an observation at the next time step sampled from p (cid:63) ( o t +1 | o t , a t ) where o t and a t are the observation and action sequences up to time step t . In many domains of interest, theunderlying transition dynamics p (cid:63) are not known and the observations are very high-dimensionalraw pixel observations. In the following, we will explain our novel proposed approach to learn anaccurate environment model that could be used as an internal simulator for planning.We focus on the task of predicting a future observation-action sequence ( o T , a T ) given an ini-tial observation o . We frame this problem as estimating the conditional probability distribution p ( o T , a T | o ) . The latter distribution is modeled by a recurrent neural network with stochasticlatent variables z T . We train the model using variational inference. We introduce an approximateposterior over latent variables. We maximize a regularized form of the Evidence Lower Bound(ELBO). The regularization comes from an auxiliary task we assign to the latent variables.2ublished as a conference paper at ICLR 2019 Figure 1: Left: the graphical model representing the generative model p θ . Right: the architecture of theinference model. The inference network q φ uses a backward recurrent state b t (in red) to approximate thedependence of z t on future observations. it shares the forward recurrent state h t − with the generative modelto approximate the dependence of z t on past observations and latent variables. Boxes are deterministic hiddenstates. circles are random variables and ﬁlled circles represent variables observed during training. ENERATIVE PROCESS

The graphical model in Fig. 1 illustrates the dependencies in our generative model. Observationsand latent variables are coupled by using an autoregressive model, the Long Short Term Memory(LSTM) architecture (Hochreiter & Schmidhuber, 1997), which runs through the sequence: h t = f ( o t , h t − , z t ) (1)where f is a deterministic non-linear transition function and h t is the LSTM hidden state at time t .According the graphical model in Fig. 1, the predictive distribution factorizes as follows: p θ ( o T , a T | o , h ) = (cid:90) T (cid:89) t =1 p θ ( o t | a t − , h t − , z t ) (cid:124) (cid:123)(cid:122) (cid:125) observation decoder p θ ( a t − | h t − , z t ) (cid:124) (cid:123)(cid:122) (cid:125) action decoder p θ ( z t | h t − ) (cid:124) (cid:123)(cid:122) (cid:125) latent prior dz (2)where1. p θ ( o t | a t − , h t − , z t ) is the observation decoder distribution conditioned on the last action a t − , the hidden state h t and the latent variable z t .2. p θ ( a t − | h t − , z t ) is the action decoder distribution conditioned on the the hidden states h t − and the latent variable z t .3. p θ ( z t | h t − ) is the prior over latent variable z t condition on the hidden states h t − All these conditional distributions, listed above, are represented by simple distributions such asGaussian distributions. Their means and standard variations are computed by multi-layered feed-forward networks. Although each single distribution is unimodal, the marginalization over sequenceof latent variables makes p θ ( o T , a T | o ) highly multimodal. Note that the prior distribution of thelatent random variable at time step t depends on all the preceding inputs via the hidden state h t − .This temporal structure of the prior has been shown to improve the representational power (Chunget al., 2015; Fraccaro et al., 2016; Goyal et al., 2017) of the latent variable.2.2 I NFERENCE M ODEL

In order to overcome the intractability of posterior inference of latent variables given observation-action sequence, we make use of amortized variational inference ideas (Kingma & Welling, 2013).We consider recognition or inference network, a neural network which approximates the intractableposterior. The true posterior of a given latent variable z t is p ( z t | h t − , a t − T , o t : T , z t +1: T ) . For thesake of an efﬁcient posterior approximation, we make the following design choices:1. We drop the dependence of the posterior on actions a t − T and future latent variables z t +1: T .2. To take into account the dependence on h t − , we share parameters between the generative modeland the recognition model by making the approximate posterior, a function of the hidden state h t − computed by the LSTM transition module f of the generative model.3. To take into account the dependence on future observations o t : T , we use an LSTM that processesobservation sequence backward as b t = g ( o t , b t +1 ) , where g is a deterministic transition functionand b t is the LSTM backward hidden state at time t .3ublished as a conference paper at ICLR 20194. Finally, a feed-forward network takes as inputs h t − and b t and output the mean and the standarddeviation of the approximate posterior q φ ( z t | h t − , b t ) .In principle, the posterior should depend on future actions. To take into account the dependence onfuture actions as well as future observations, we can use the LSTM that processes the observation-action sequence in backward manner. In pilot trials, we conducted experiments with and withoutthe dependencies on actions for the backward LSTM and we did not notice a noticeable differencein terms of performance. Therefore, we chose to drop the dependence on actions in the backwardLSTM to simplify the code.Now using the approximate posterior, the Evidence Lower Bound (ELBO) is derived as follows: log p θ ( o T , a T | o , h ) ≥ E q φ ( z T | o T ,a T ) (cid:104) log p θ ( o T , a T , z T | o , h )log q φ ( z T | o T , a T ) (cid:105) (3) = E q φ ( z T | o T ,a T ) (cid:104) log p θ ( o T , a T | o , h , z T ) (cid:105) (4) − KL ( q φ ( z T | o T , a T ) (cid:107) p θ ( z T | o , h )) Leveraging temporal structure of the generative and inference network, the ELBO breaks down as: L ( o T , a T ; θ, φ ) = (cid:88) t E q φ ( z t | h t − ,b t ) (cid:104) log p θ ( o t | a t − , h t − , z t ) + log p θ ( a t − | h t − , z t ) (cid:105) (5) − KL ( q φ ( z t | h t − , b t ) (cid:107) p θ ( z t | h t − )) UXILIARY C OST

The main difﬁculty in latent variable models is how to learn a meaningful latent variables that capturehigh level abstractions in underlying observed data. It has been challenging to combine powerfulautoregressive observation decoder with latent variables in a way to make the latter carry usefulinformation (Chen et al., 2016; Bowman et al., 2015). Consider the task of learning to navigate abuilding from raw images. We try to build an internal model of the world from observation-actiontrajectories. This is a very high-dimensional and highly redundant observation space. Intuitively, wewould like that our latent variables capture an abstract representation describing the essential aspectsof the building’s topology needed for navigation such as object locations and distance betweenrooms. The decoder will then encode high frequency source of variations such as objects’ textureand other visual details. Training the model by maximum likelihood objective is not sensitive tohow different level of information is encoded. This could lead to two bad scenarios: either latentvariables are unused and the whole information is captured by the observation decoder, or the modellearns a stationary auto-encoder with focus on compressing a single observation (Karl et al., 2016).The shortcomings, described above, are generally due to two main reasons: the approximate poste-rior provides a weak signal or the model focuses on short-term reconstruction. In order to addressthe latter issue, we enforce our latent variables to carry useful information about the future obser-vations in the sequence. In particular, we make use of the so-called “Z-forcing” idea (Goyal et al.,2017): we consider training a conditional generative model p ζ ( b | z ) of backward states b given theinferred latent variables z ∼ q θ ( z | h, b ) . This model is trained by log-likelihood maximization: max ζ E q θ ( z | b,h ) [log p ζ ( b | z )] (6)The loss above will act as a training regularization that enforce latent variables z t to encode futureinformation.2.4 M ODEL TRAINING

The training objective is a regularized version of the ELBO. The regularization is imposed by theauxiliary cost deﬁned as the reconstruction term of the additional backward generative model. Webring together the ELBO in (5) and the reconstruction term in (6), multiplied by the trade-off pa-4ublished as a conference paper at ICLR 2019rameter β , to deﬁne our ﬁnal objective: L ( o T , a T ; θ, φ, ζ ) = (cid:88) t E q φ ( z t | h t − ,b t ) (cid:104) log p θ ( o t | a t − , h t − , z t ) + log p θ ( a t − | h t − , z t ) (7) + β log p ζ ( b t | z t ) (cid:105) − KL ( q φ ( z t | h t − , b t ) (cid:107) p θ ( z t | h t − )) We use the reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014) and a singleposterior sample to obtain unbiased gradient estimators of the ELBO in (7). As the approximateposterior should be agnostic to the auxiliary task assigned to the latent variable, we don’t accountfor the gradients of the auxiliary cost with respect to backward network during optimization (7).

SING THE MODEL FOR SEQUENTIAL TASKS

Here we explain how we can use our dynamics model to help solve sequential RL tasks. We considertwo settings: imitation learning, where a learner is asked to mimic an expert and reinforcementlearning, where an agent aims at maximizing its long-term performance.3.1 U

SING THE MODEL FOR IMITATION LEARNING

We consider a passive approach of imitation learning, also known as behavioral cloning (Pomerleau,1991). We have a set of training trajectories achieved by an expert policy. Each trajectory consistsof a sequence of observations o T and a sequence of actions a T executed by an expert. The goal isto train a learner to achieve – given an observation – an action as similar as possible to the expert’s.This is typically accomplished via supervised learning over observation-action pairs from experttrajectories. However, this assumes that training observation-action pairs are i.i.d. This criticalassumption implies that the learner’s action does not inﬂuence the distribution of future observationsupon which it acts. Moreover, this kind of approach does not make use of full trajectories we haveat our disposals and chooses to break correlations between observation-actions pairs.In contrast, we propose to leverage the temporal coherence present in our training data by trainingour dynamic model using full trajectories. The advantage of our method is that our model wouldcapture the training distribution of sequences. Therefore, it is more robust to compounding error, acommon problem in methods that ﬁt one-step decisions.3.2 U SING THE MODEL FOR REINFORCEMENT LEARNING

Model-based RL approaches can be understood as consisting of two main components: (i) modellearning from observations and (ii) planning (obtaining a policy from the learned model). Here, wewill present how our dynamics model can be used to help solve RL problems. In particular, weexplain how to perform planning under our model and how to gather data that we feed later to ourmodel for training.3.2.1 P

LANNING

Given a reward function r , we can evaluate each transition made by our dynamics model. A plan-ner aims at ﬁnding the optimal action sequence that maximizes the long-term return deﬁned asthe expected cumulative reward. This can be summarized by the following optimization problem: max a T E [ (cid:80) Tt =1 r t ] where the expectation is over trajectories sampled under the model.If we optimize directly on actions, the planner may output a sequence of actions that induces adifferent observation-action distribution than seen during training and end up in regions where themodel may capture poorly the environment’s dynamics and make prediction errors. This training/testdistribution mismatch could result in ‘catastrophic failure’, e.g . the planner may output actions thatperform well under the model but poorly when executed in the real environment.To ensure that the planner’s solution is grounded in the training manifold, we propose to performplanning over latent variables instead of over actions: max z T E [ (cid:80) Tt =1 r t ] . In particular, we use5ublished as a conference paper at ICLR 2019model predictive control (MPC) (Mayne et al., 2000) as planner in latent space as shown in Alg. 1.Given, an episode of length T , we generate a bunch of sequences starting from the initial observation,We evaluate each sequence based on their cumulative reward and we take the best sequence. Thenwe pick the k ﬁrst latent variables z k for the best sequence and we execute k actions a k in thereal environment conditioned on the picked latent variables. Now, we re-plan again by followingthe same steps described above starting at the last observation of the generated segment. Note thatfor an episode of length T , we re-plan only T /k times because we generate a sequence of k actionsafter each plan.3.2.2 D ATA G ATHERING P ROCESS

Now we turn out to our approach to collect data useful for model training. So far, we assumed thatour training trajectories are given and ﬁxed. As a consequence, the learned model capture only thetraining distribution and relying on this model for planning will compute poor actions. Therefore,we need to consider an exploration strategy for data generating. A naive approach would be to col-lect data under random policy that picks uniformly random actions. This random exploration is ofteninefﬁcient in term of sample complexity. It usually wastes a lot of time in already well understoodregions in the environment while other regions may be still poorly explored. A more directed explo-ration strategy consists in collecting trajectories that are not likely under the model distribution. Forthis purpose, we consider a policy π ω parameterized by ω and we train it to maximize the negativeregularized ELBO L in (7). Speciﬁcally, if p π ω ( o T , a T ) denotes the distribution of trajectory ( o T , a T ) induced by π ω , we consider the following optimization problem: max w E p πω ( o T ,a T ) [ −L ( o T , a T ; θ, φ, ζ )] (8)The above problem can be solved using any policy gradient method , such as proximal policy opti-mization PPO (Schulman et al., 2017), with negative regularized ELBO as a reward per trajectory.The overall algorithm is described in Alg. 2. We essentially obtain a high rewarding trajectory byperforms Model Predictive Control (MPC) at every k -steps. We then use the exploration policy π ω tosample trajectories that are adjacent to the high-rewarding one obtained using MPC. The algorithmthen uses the sampled trajectories for training the model. Algorithm 1

Model Predictive Control (MPC)Given trained model M, Reward function R for times t ∈ { , ..., T /k } do

1. Generate m sequences of observation se-quences of length T MP C

2. Evaluate reward per sequence and take thebest sequence.3. Save the k ﬁrst latent variables z k for thebest sequence (1 latent per observation)4. Execute the actions conditioned on z k andobservation o k for k steps starting at the lastobservation of last segment. Algorithm 2

Overall AlgorithmInitialize replay buffer and the model with datafrom randomly initialized π ω for iteration i ∈ { , ..., N } do

1. Execute MPC as in Algorithm 12. Run exploration policy starting from a ran-dom point on the trajectory visited by MPC3. Update replay buffer with gathered data4. Update exploration policy π ω using PPO withrewards as the negative regularized ELBO5. Train the model using a mixture of newly gen-erated data by π ω and data in the replay buffer ELATED W ORK

Generative Sequence Models.

There’s a rich literature of work combining recurrent neural net-works with stochastic dynamics (Chung et al., 2015; Chen et al., 2016; Krishnan et al., 2015; Frac-caro et al., 2016; Gulrajani et al., 2016; Goyal et al., 2017; Guu et al., 2018). works propose a variantof RNNs with stochastic dynamics or state space models, but do not investigate their applicability tomodel based reinforcement learning. Previous work on learning dynamics models for Atari gameshave either consider learning deterministic models (Oh et al., 2015; Chiappa et al., 2017) or statespace models (Buesing et al., 2018). These models are usually trained with one step ahead predic-tion loss or ﬁxed k-step ahead prediction loss. Our work is related in the sense that we use stochasticRNNs where the dynamics are conditioned on latent variables, but we propose to incorporate longterm future which, as we demonstrate empirically, improves over these models. In our model, theapproximate posterior is conditioned on the state of the backward running RNN, which helps to6ublished as a conference paper at ICLR 2019escape local minima as pointed out by (Karl et al., 2016). The idea of using a bidirectional posteriorgoes back to at least (Bayer & Osendorfer, 2014) and has been successfully used by (Karl et al.,2016; Goyal et al., 2017). The application to learning models for reinforcement learning is novel.

Model based RL.

Many of these prior methods aim to learn the dynamics model of the environmentwhich can then be used for planning, generating synthetic experience, or policy search (Atkeson &Schaal, 1997; Peters et al., 2010; Sutton, 1991). Improving representations within the context ofmodel-based RL has been studied for value prediction (Oh et al., 2017), dimensionality reduction(Nouri & Littman, 2010), self-organizing maps (Smith, 2002), and incentivizing exploration (Stadieet al., 2015). Weber et al. (2017) introduce Imagination-Augmented Agent which uses rolloutsimagined by the dynamics model as inputs to the policy function, by summarizing the outputs ofthe imagined rollouts with a recurrent neural network. Buesing et al. (2018) compare several meth-ods of dynamic modeling and show that state-space models could learn good state representationsthat could be encoded and fed to the Imagination-Augmented Agent. Karl et al. (2017) provide acomputationally efﬁcient way to estimate a variational lower bound to empowerement . As their for-mulation assumes the availability of a differentiable model to propagate through the transitions, theytrain a dynamic model using Deep Variational Bayes Filter (Karl et al., 2016). (Goyal et al., 2017).(Holland et al., 2018) points out that incorporating long term future by doing Dyna style planningcould be useful for model based RL. Here we are interested in learning better representations for thedynamics model using auxiliary losses by predicting the hidden state of the backward running RNN.

Auxiliary Losses.

Several works have incorporated auxiliary loses which results in representationswhich can generalize. Pathak et al. (2017) considered using inverse models, and using the predictionerror as a proxy for curiosity. Different works have also considered using loss as a reward whichacts as a supervision for reinforcement learning problems (Shelhamer et al., 2016). Jaderberg et al.(2016) considered pseudo reward functions which helps to generalize effectively across differentAtari games. In this work, we propose to use the auxillary loss for improving the dynamics modelin the context of reinforcement learning.

Incorporating the Future.

Recent works have considered incorporating the future by dynamicallycomputing rollouts across many rollout lengths and using this for improving the policy (Buckmanet al., 2018). Sutton et al. (1998) introduced TD( λ ), a temporal difference method in which targetsfrom multiple time steps are merged via exponential decay. To the best of our knowledge no priorwork has considered incorporating the long term future in the case of stochastic dynamics modelsfor building better models. Many of the model based mentioned above learn global models of thesystem that are then used for planning, generating synthetic experience, or policy search. Thesemethods require an reliable model and will typically suffer from modeling bias, hence these modelsare still limited to short horizon prediction in more complex domains (Mishra et al., 2017). XPERIMENTS

As discussed in Section 3, we study our proposed model under imitation learning and model-basedRL. We perform experiments to answer the following questions:1. In the imitation learning setting, how does having access to the future during training help withpolicy learning?2. Does our model help to learn a better predictive model of the world?3. Can our model help in predicting subgoals ?4. In model-based reinforcement learning setting, how does having a better predictive model of theworld help for planning and control?5.1 I

MITATION L EARNING

First, we consider the imitation learning setting where we have training trajectories generated by anexpert at our disposal. Our model is trained as described in Section 2.4. We evaluate our model oncontinuous control tasks in Mujoco and CarRacing environments, as well as a partially observable2D grid-world environments with subgoals called BabyAI (Chevalier-Boisvert & Willems, 2018).We compare our model to two baselines for all imitation learning tasks: a recurrent policy , an LSTMthat predicts only the action a t given an observation o t , and a recurrent decoder , an LSTM that pre-dicts both action and next observation given an observation. We compare to the recurrent policy to7ublished as a conference paper at ICLR 2019 (a) HalfCheetah (b) Reacher (c) CarRacingFigure 2: Imitation Learning.

We show comparison of our method with the baseline methods for

Half-Cheetah, Reacher and Car Racing tasks. We ﬁnd that our method is able to achieve higher reward faster than baseline methods and is more stable . demonstrate the value of modeling future at all and we compare to the recurrent decoder to demon-strate the value of modeling long-term future trajectories (as opposite to single-step observationprediction. For all tasks, we take high-dimensional rendered image as input (compared to low-dimensional state vector). All models are trained on 10k expert trajectories and hyper parametersused are described in Section 8.1 appendix. Mujoco tasks.

We evaluate the models on Reacher and HalfCheetah. We take rendered imagesas inputs for both tasks and we compare to recurrent policy and recurrent decoder baselines. Theperformance in terms of test rewards are shown in Fig. 2. Our model signiﬁcantly and consistentlyoutperforms both baselines for both Half Cheetah and Reacher.

Car Racing task.

The Car Racing task (Klimov, 2016) is a continuous control task, details forexperimental setup can be found in appendix. The expert is trained using methods in Ha & Schmid-huber (2018). The model’s performance compared to the baseline is shown in Fig. 2. Our modelboth achieves a higher reward and is more stable in terms of test performance compared to both therecurrent policy and recurrent decoder.

BabyAI PickUnlock Task

We evaluate on the PickUnlock task on the BabyAI platform(Chevalier-Boisvert & Willems, 2018). The BabyAI platform is a partially observable (POMDP)2D GridWorld with subgoals and language instructions for each task. We remove the language in-structions since language-understanding is not the focus of this paper. The PickUnlock task consistsof 2 rooms separated by a wall with a key, there is a key in the left room and a target in the rightroom. The agent always starts in the left room and needs to ﬁrst ﬁnd the key, use the key to unlockthe door to go into the next room to reach to the goal. The agent receives a reward of 1 for complet-ing the task under a ﬁxed number of steps and gets a small punishment for taking too many stepsfor completing the task. Our model consistently achieves higher rewards compared to the recurrentpolicy baseline as shown in Fig. 3.5.2 L

ONG H ORIZON VIDEO PREDICTION

One way to check if the model learns a better generative model of the world is to evaluate it on long-horizon video prediction. We evaluate the model in the CarRacing environment (Klimov, 2016). Weevaluate the likelihood of these observations under the models trained in Section 5.1 on 1000 testtrajectories generated by the expert trained using Ha & Schmidhuber (2018). Our method signiﬁ-cantly outperforms the recurrent decoder by achieving a negative log-likelihood (NLL) of − . whereas the recurrent decoder achieves an NLL of − . . We also generate images (videos) fromthe model by doing a 15-step rollout and the images. The video can be found at the anonymouslink for our method and recurrent decoder. Note that the samples are random and not cherry-picked.Visually, our method seems to generate more coherent and complicated scenes, the entire road withsome curves (not just a straight line) is generated. In comparison, the recurrent decoder turns togenerated non-complete road (with parts of it missing) and the road generated is often straight withno curves or complications. 8ublished as a conference paper at ICLR 2019 Figure 3:

Model-Based RL . We show our comparison of our methods with baseline methods including SeCTArfor

BabyAI PickUnlock task and

Wheeled locomotion task with sparse rewards. We observe that our baselineachieves higher rewards than the corresponding baselines.

UBGOAL DETECTION

Intuitively, a model should become sharply better at predicting the future (corresponding to a steepreduction in prediction loss) when it observes and could easily reach a ‘marker’ corresponding toa subgoal towards the ﬁnal goal. We study this for the BabyAI task that contains natural subgoalssuch as locating the key, getting the key, opening the door, and ﬁnding the target in the next room.Experimentally, we do indeed observe that there is sharp decrease in prediction error as the agentlocates a subgoal. We also observe that there is an increase in prediction cost when it has a difﬁcultylocating the next subgoal (no key or goal in sight). Qualitative examples of this behavior are shownin Appendix Section 8.2.5.4 M

ODEL - BASED P LANNING

We evaluate our model on the wheeled locomotion tasks as in (Co-Reyes et al., 2018) with sparserewards. The agent is given a reward for every third goal it reached. we compare our model to the re-cently proposed Sectar model (Co-Reyes et al., 2018). We outperform the Sectar model, which itselfoutperforms many other baselines such as Actor-Critic (A3C) (Mnih et al., 2016), TRPO (Schulmanet al., 2015), Option Critic (Bacon et al., 2017), FeUdal (Vezhnevets et al., 2017), VIME (Houthooftet al., 2016) . We use the same sets of hyperparameters as in Co-Reyes et al. (2018).

ONCLUSION

In this work we considered the challenge of model learning in model-based RL. We showed howto train, from raw high-dimensional observations, a latent-variable model that is robust to com-pounding error. The key insight in our approach involve forcing our latent variables to accountfor long-term future information. We explain how we use the model for efﬁcient planning and ex-ploration. Through experiments in various tasks, we demonstrate the beneﬁts of such a model toprovide sensible long-term predictions and therefore outperform baseline methods.

CKNOWLEDGEMENTS

The authors acknowledge the important role played by their colleagues at Facebook AI Researchthroughout the duration of this work. We are also grateful to the reviewers for their constructivefeedback which helped to improve the clarity of the paper. Rosemary is thankful to Nikita Kitaevand Hugo Larochelle for useful discussions. Anirudh is thankful to Alessandro Sordoni, SergeyLevine for useful discussions. Anirudh Goyal is grateful to NSERC, CIFAR, Google, Samsung,Nuance, IBM, Canada Research Chairs, Canada Graduate Scholarship Program, Nvidia for funding,and Compute Canada for computing resources.

We build on top of the author’s open-sourced at https://github.com/wyndwarrior/Sectar. We were not ableto achieve the reported results for Sectar and hence we reported the numbers we achieved. R EFERENCES

Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in model-based rein-forcement learning. arXiv preprint arXiv:1804.07193 , 2018.Christopher G Atkeson and Stefan Schaal. Robot learning from demonstration. In

ICML , volume 97,pp. 12–20. Citeseer, 1997.Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In

AAAI , pp. 1726–1734, 2017.Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. arXiv preprintarXiv:1411.7610 , 2014.Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ-ment: An evaluation platform for general agents.

Journal of Artiﬁcial Intelligence Research , 47:253–279, 2013.Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequenceprediction with recurrent neural networks. In

Advances in Neural Information Processing Sys-tems , pp. 1171–1179, 2015.Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Ben-gio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 , 2015.Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efﬁcient reinforcement learning with stochastic ensemble value expansion. arXiv preprintarXiv:1807.01675 , 2018.Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P Re-ichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and queryingfast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006 , 2018.Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, IlyaSutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731 ,2016.Maxime Chevalier-Boisvert and Lucas Willems. Minimalistic gridworld environment for openaigym. https://github.com/maximecb/gym-minigrid , 2018.Silvia Chiappa, S´ebastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent environmentsimulators. arXiv preprint arXiv:1704.02254 , 2017.Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Ben-gio. A recurrent latent variable model for sequential data. In

Advances in neural informationprocessing systems , pp. 2980–2988, 2015.John D Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and SergeyLevine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajec-tory embeddings. arXiv preprint arXiv:1806.02813 , 2018.Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efﬁcient approach to policysearch. In

Proceedings of the 28th International Conference on machine learning (ICML-11) , pp.465–472, 2011.Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural modelswith stochastic layers. In

Advances in neural information processing systems , pp. 2199–2207,2016.Anirudh ALIAS PARTH Goyal, Alessandro Sordoni, Marc-Alexandre Cˆot´e, Nan Ke, and YoshuaBengio. Z-forcing: Training stochastic recurrent networks. In

Advances in Neural InformationProcessing Systems , pp. 6713–6723, 2017. 10ublished as a conference paper at ICLR 2019Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez,and Aaron Courville. Pixelvae: A latent variable model for natural images. arXiv preprintarXiv:1611.05013 , 2016.Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. Generating sentences byediting prototypes.

Transactions of the Association of Computational Linguistics , 6:437–450,2018.David Ha and J¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. arXivpreprint arXiv:1809.01999 , 2018.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.G Zacharias Holland, Erik Talvitie, and Michael Bowling. The effect of planning shape on dyna-style planning in high-dimensional state spaces. arXiv preprint arXiv:1806.01825 , 2018.Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime:Variational information maximizing exploration. In

Advances in Neural Information ProcessingSystems , pp. 1109–1117, 2016.Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, DavidSilver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXivpreprint arXiv:1611.05397 , 2016.Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep varia-tional bayes ﬁlters: Unsupervised learning of state space models from raw data. arXiv preprintarXiv:1605.06432 , 2016.Maximilian Karl, Maximilian Soelch, Philip Becker-Ehmck, Djalel Benbouzid, Patrick van derSmagt, and Justin Bayer. Unsupervised real-time control through variational empowerment. arXivpreprint arXiv:1710.05101 , 2017.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 , 2013.Oleg Klimov. Carracing for openai gym. https://gym.openai.com/envs/CarRacing-v0/ , 2016.Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman ﬁlters. arXiv preprintarXiv:1511.05121 , 2015.Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron CCourville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent net-works. In

Advances In Neural Information Processing Systems , pp. 4601–4609, 2016.David Q Mayne, James B Rawlings, Christopher V Rao, and Pierre OM Scokaert. Constrainedmodel predictive control: Stability and optimality.

Automatica , 36(6):789–814, 2000.Tom´aˇs Mikolov, Martin Karaﬁ´at, Luk´aˇs Burget, Jan ˇCernock`y, and Sanjeev Khudanpur. Recurrentneural network based language model. In

Eleventh Annual Conference of the International SpeechCommunication Association , 2010.Nikhil Mishra, Pieter Abbeel, and Igor Mordatch. Prediction and control with temporal segmentmodels. arXiv preprint arXiv:1703.04070 , 2017.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-levelcontrol through deep reinforcement learning.

Nature , 518(7540):529, 2015.11ublished as a conference paper at ICLR 2019Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, TimHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcementlearning. In

International conference on machine learning , pp. 1928–1937, 2016.Ali Nouri and Michael L Littman. Dimension reduction and its application to model-based explo-ration in continuous spaces.

Machine Learning , 81(1):85–98, 2010.Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditionalvideo prediction using deep networks in atari games. In

Advances in neural information process-ing systems , pp. 2863–2871, 2015.Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In

Advances in NeuralInformation Processing Systems , pp. 6118–6128, 2017.Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven explorationby self-supervised prediction. In

International Conference on Machine Learning (ICML) , volume2017, 2017.Jan Peters, Katharina M¨ulling, and Yasemin Altun. Relative entropy policy search. In

AAAI , pp.1607–1612. Atlanta, 2010.Dean A Pomerleau. Efﬁcient training of artiﬁcial neural networks for autonomous navigation.

Neu-ral Computation , 3(1):88–97, 1991.Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation andapproximate inference in deep generative models. arXiv preprint arXiv:1401.4082 , 2014.John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust regionpolicy optimization. In

International Conference on Machine Learning , pp. 1889–1897, 2015.John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.Evan Shelhamer, Parsa Mahmoudieh, Max Argus, and Trevor Darrell. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307 , 2016.David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Masteringthe game of go with deep neural networks and tree search. nature , 529(7587):484, 2016.Andrew James Smith. Applications of the self-organising map to reinforcement learning.

Neuralnetworks , 15(8-9):1107–1124, 2002.Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcementlearning with deep predictive models.

CoRR , abs/1507.00814, 2015. URL http://arxiv.org/abs/1507.00814 .Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.

ACMSIGART Bulletin , 2(4):160–163, 1991.Richard S Sutton, Andrew G Barto, et al.

Reinforcement learning: An introduction . MIT press,1998.Erik Talvitie. Model regularization for stable sample rollouts. In

UAI , 2014.Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.In

Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on , pp. 5026–5033. IEEE, 2012.Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, DavidSilver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXivpreprint arXiv:1703.01161 , 2017.Th´eophane Weber, S´ebastien Racani`ere, David P Reichert, Lars Buesing, Arthur Guez,Danilo Jimenez Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, YujiaLi, et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprintarXiv:1707.06203 , 2017. 12ublished as a conference paper at ICLR 2019

PPENDIX

XPERIMENTAL S ETUP

We perform the same hyper parameters search for the baseline as well as our methods. We use theAdam optimizer Kingma & Ba (2014) and tune learning rates using [1 e − , e − , e − , e − . Forthe hyper parameters speciﬁc for our model, we tune KL starting weight between [0 . , . , . ,the KL weight increase per iteration is ﬁxed at . and the auxiliary cost for predicting thebackward hidden state b t is kept at . for all experiments. We list the details for each experimentand task below. Mujoco Tasks

We evaluate on 2 Mujoco tasks (Todorov et al., 2012), the Reacher and the HalfCheetah task(Todorov et al., 2012). The Reacher tasks is an object manipulation task consist ofmanipulating a 7-DoF robotic arm to reach the goal, the agent is rewarded for the number of objectsit reaches within a ﬁxed number of steps. The HalfCheetah task is continuous control task where theagent is awarded for the distance the robots moves.For both tasks, the experts are trained using Trust Region Policy Optimization (TRPO) (Schulmanet al., 2015). We generate 10k expert trajectories for training the student model, all models aretrained for 50 epochs. For the HalfCheetah task, we chunk the trajectory (1000 timesteps) into 4chunks of length 250 to save computation time.

Car Racing task

The Car Racing task (Klimov, 2016) is a continuous control task where eachepisode contains randomly generated trials. The agent (car) is rewarded for visiting as many tilesas possible in the least amount of time possible. The expert is trained using methods in (Ha &Schmidhuber, 2018). We generate 10k trajectories from the expert. For trajectories of length over1000, we take the ﬁrst 1000 steps. Similarly to Section 5.1, we chunk the 1000 steps trajectory into4 chunks of 250 for computation purposes.

BabyAI

The BabyAI environment is a POMDP 2D Minigrid envorinment (Chevalier-Boisvert &Willems, 2018) with multiple tasks. For our experiments, we use the PickupUnlock task consistentof 2 rooms, a key, an object to pick up and a door in between the rooms. The agent starts off in theleft room where it needs to ﬁnd a key, it then needs to take the key to the door to unlock the nextroom, after which, the agent will move into the next room and ﬁnd the object that it needs to pick up.The rooms can be of different sizes and the difﬁculty increases as the size of the room increases. Wetrain all our models on room of size 15. It is not trivial to train up a reinforcement learning expert onthe PickupUnlock task on room size of 15. We use curriculum learning with PPO (Schulman et al.,2017) for training our experts. We start with a room size of 6 and increase the room size by 2 at eachlevel of curriculum learning.We train the LSTM baseline and our model both using imitation learning. The training data are 10ktrajectories generated from the expert model. We evaluate the both baseline and our model every 100iterations on the real test environment (BabyAI environment) and we report the reward per episode.Experiments are run 5 times with different random seeds and we report the average of the 5 runs.

Wheeled locomotion

We use the Wheeled locomotion with sparse rewards environment from (Co-Reyes et al., 2018). The robot is presented with multiple goals and must move sequentially in orderto reach each reward. The agent obtains a reward for every 3 goal it reaches and hence this is a taskwith sparse rewards. We follow similar setup to (Co-Reyes et al., 2018), the number of exploredtrajectories for MPC is 2048, MPC re-plans at every 19 steps. However, different from (Co-Reyeset al., 2018), we sample latent variables from our sequential prior which depends on the summaryof the past events h t . This is in comparison to (Co-Reyes et al., 2018), where the prior of the latentvariables are ﬁxed. Experiments are run 3 times and average of the 3 runs are reported.8.2 C ORRELATION BETWEEN SUBGOAL AND PREDICTION LOSS

Our model has an auxiliary cost associated with predicting the long term future. Intuitively, themodel is better at predicting the long term future when there is more certainty about the future.Let’s consider a setting where the task is in a POMDP environment that has multiple subgoals, for13ublished as a conference paper at ICLR 2019 (i) step 0 (ii) step 1 (iii) step 2 (iv) step 3 (v) step 4(vi) step 5 (vii) step 6 (viii) step 7 (ix) step 8 (x) step 9(xi) step 10 (xii) step 11 (xiii) step 12 (xiv) step 13 (xv) step 14(xvi) step 15 (xvii) step 16 (xviii) step 17 (xix) The auxiliary costFigure 4: The ﬁrst 18 plots show how the agent evolves in BabyAI environment for 18 steps. The last plotshows the the corresponding auxiliary cost in function of steps. The agent is in red. The gray regions in imagesare the agent’s observational space. The keys are doors can be an arbitary color, in this example, both the keyand the door are in blue. The auxillary cost generally descreases over time.(xix) The auxiliary costFigure 4: The ﬁrst 18 plots show how the agent evolves in BabyAI environment for 18 steps. The last plotshows the the corresponding auxiliary cost in function of steps. The agent is in red. The gray regions in imagesare the agent’s observational space. The keys are doors can be an arbitary color, in this example, both the keyand the door are in blue. The auxillary cost generally descreases over time.