[PDF] Learning to Predict Without Looking Ahead: World Models Without Forward Prediction

Abstract

Much of model-based reinforcement learning involves learning a model of an agent's world, and training an agent to leverage this model to perform a task more efficiently. While these models are demonstrably useful for agents, every naturally occurring model of the world of which we are aware---e.g., a brain---arose as the byproduct of competing evolutionary pressures for survival, not minimization of a supervised forward-predictive loss via gradient descent. That useful models can arise out of the messy and slow optimization process of evolution suggests that forward-predictive modeling can arise as a side-effect of optimization under the right circumstances. Crucially, this optimization process need not explicitly be a forward-predictive loss. In this work, we introduce a modification to traditional reinforcement learning which we call observational dropout, whereby we limit the agents ability to observe the real environment at each timestep. In doing so, we can coerce an agent into learning a world model to fill in the observation gaps during reinforcement learning. We show that the emerged world model, while not explicitly trained to predict the future, can help the agent learn key skills required to perform well in its environment. Videos of our results available at this https URL

Full PDF

LLearning to Predict Without Looking Ahead:World Models Without Forward Prediction

C. Daniel Freeman, Luke Metz, David Ha

Google Brain {cdfreeman, lmetz, hadavid}@google.com

Abstract

Much of model-based reinforcement learning involves learning a model of anagent’s world, and training an agent to leverage this model to perform a task moreefﬁciently. While these models are demonstrably useful for agents, every naturallyoccurring model of the world of which we are aware—e.g., a brain—arose as thebyproduct of competing evolutionary pressures for survival, not minimization of asupervised forward-predictive loss via gradient descent. That useful models canarise out of the messy and slow optimization process of evolution suggests thatforward-predictive modeling can arise as a side-effect of optimization under theright circumstances. Crucially, this optimization process need not explicitly be aforward-predictive loss. In this work, we introduce a modiﬁcation to traditionalreinforcement learning which we call observational dropout , whereby we limitthe agents ability to observe the real environment at each timestep. In doing so,we can coerce an agent into learning a world model to ﬁll in the observation gapsduring reinforcement learning. We show that the emerged world model, whilenot explicitly trained to predict the future, can help the agent learn key skillsrequired to perform well in its environment. Videos of our results available at https://learningtopredict.github.io/

Much of the motivation of model-based reinforcement learning (RL) derives from the potential utilityof learned models for downstream tasks, like prediction [13, 15], planning [1, 36, 41, 42, 44, 65], andcounterfactual reasoning [9, 29]. Whether such models are learned from data, or created from domainknowledge, there’s an implicit assumption that an agent’s world model [22, 53, 67] is a forward modelfor predicting future states. While a perfect forward model will undoubtedly deliver great utility, theyare difﬁcult to create, thus much of the research has been focused on either dealing with uncertaintiesof forward models [11, 17, 22], or improving their prediction accuracy [23, 29]. While progress hasbeen made with current approaches, it is not clear that models trained explicitly to perform forwardprediction are the only possible or even desirable solution. (1) (2) (3) (4) (5)(6) (7) (8) (9) (10)

Figure 1: Our agent is given only infrequent observations of its environment (e.g., frames 1, 8),and must learn a world model to ﬁll in the observation gaps. The colorless cart-pole represents thepredicted observations seen by the policy. Under such constraints, we show that world models canemerge so that the policy can still perform well on a swing-up cart-pole environment. a r X i v : . [ c s . N E ] O c t e hypothesize that explicit forward prediction is not required to learn useful models of the world,and that prediction may arise as an emergent property if it is useful for an agent to perform its task.To encourage prediction to emerge, we introduce a constraint to our agent: at each timestep, the agentis only allowed to observe its environment with some probability p . To cope with this constraint, wegive our agent an internal model that takes as input both the previous observation and action, and itgenerates a new observation as an output. Crucially, the input observation to the model will be theground truth only with probability p , while the input observation will be its previously generatedone with probability − p . The agent’s policy will act on this internal observation without knowingwhether it is real, or generated by its internal model. In this work, we investigate to what extent worldmodels trained with policy gradients behave like forward predictive models, by restricting the agent’sability to observe its environment.By jointly learning both the policy and model to perform well on the given task, we can directlyoptimize the model without ever explicitly optimizing for forward prediction. This allows the modelto focus on generating any “predictions” that are useful for the policy to perform well on the task,even if they are not realistic. The models that emerge under our constraints capture the essence ofwhat the agent needs to see from the world. We conduct various experiments to show, under certainconditions, that the models learn to behave like imperfect forward predictors. We demonstrate thatthese models can be used to generate environments that do not follow the rules that govern the actualenvironment, but nonetheless can be used to teach the agent important skills needed in the actualenvironment. We also examine the role of inductive biases in the world model, and show that thearchitecture of the model plays a role in not only in performance, but also interpretability. One promising reason to learn models of the world is to accelerate learning of policies by trainingthese models. These works obtain experience from the real environment, and ﬁt a model directlyto this data. Some of the earliest work leverage simple model parameterizations – e.g. learnableparameters for system identiﬁcation [47]. Recently, there has been large interest in using moreﬂexible parameterizations in the form of function approximators. The earliest work we are aware ofthat uses feed forward neural networks as predictive models for tasks is Werbos [67]. To model timedependence, recurrent neural network were introduced in [53]. Recently, as our modeling abilitiesincreased, there has been renewed interest in directly modeling pixels [23, 30, 46, 60]. Mathieu et al.[38] modify the loss function used to generate more realistic predictions. Denton and Fergus [12]propose a stochastic model which learns to predict the next frame in a sequence, whereas Finn et al.[15] employ a different parameterization involving predicting pixel movement as opposed to directlypredicting pixels. Kumar et al. [33] employ ﬂow based tractable density models to learn models, andHa and Schmidhuber [22] leverages a VAE-RNN architecture to learn an embedding of pixel dataacross time. Hafner et al. [23] propose to learn a latent space, and learn forward dynamics in thislatent space. Other methods utilize probabilistic dynamics models which allow for better planning inthe face of uncertainty [11, 17]. Presaging much of this work is [58], which learns a model that canpredict environment state over multiple timescales via imagined rollouts.As both predictive modeling and control improves there has been a large number of successesleveraging learned predictive models in Atari [8, 29] and robotics [14]. Unlike our work, all ofthese methods leverage transitions to learn an explicit dynamics model. Despite advances in forwardpredictive modeling, the application of such models is limited to relatively simple domains wheremodels perform well.Errors in the world model compound, and cause issues when used for control [3, 63]. Amos et al. [2],similar to our work, directly optimizes the dynamics model against loss by differentiating through aplanning procedure, and Schmidhuber [52] proposes a similar idea of improving the internal modelusing an RNN, although the RNN world model is initially trained to perform forward prediction.In this work we structure our learning problem so a model of the world will emerge as a result ofsolving a given task. This notion of emergent behavior has been explored in a number of differentareas and broadly is called “representation learning” [6]. Early work on autoencoders leveragereconstruction based losses to learn meaningful features [27, 34]. Follow up work focuses on learning“disentangled” representations by enforcing more structure in the learning procedure[25, 26]. Selfsupervised approaches construct other learning problems, e.g. solving a jigsaw puzzle [43], orleveraging temporal structure [45, 57]. Alternative setups, closer to our own specify a speciﬁc2earning problem and observe that by solving these problems lead to interesting learned behavior (e.g.grid cells) [4, 10]. In the context of learning models, Watter et al. [66] construct a locally linear latentspace where planning can then be performed.The force driving model improvement in our work consists of black box optimization. In an effort toemulate nature, evolutionary algorithms where proposed [18, 24, 28, 61, 68]. These algorithms arerobust and will adapt to constraints such as ours while still solving the given task [7, 35]. Recently,reinforcement learning has emerged as a promising framework to tackle optimization leveragingthe sequential nature of the world for increased efﬁciency [39, 40, 54, 55, 62]. The exact typeof the optimization is of less importance to us in this work and thus we choose to use a simplepopulation-based optimization algorithm [69] with connections to evolution strategies [48, 51, 56].The boundary between what is considered model-free and model-based reinforcement learning isblurred when one can considers both the model network and controller network together as one giantpolicy that can be trained end-to-end with model-free methods. [50] demonstrates this by trainingboth world model and policy via evolution. [37] explore modifying sensor information similarly toour observational dropout. Instead of performance, however, this work focus on understanding whatthese models learn and show there usefulness – e.g. training a policy inside the learned models.

A common goal when learning a world model is to learn a perfect forward predictor. In this section,we provide intuitions for why this is not always necessary, and demonstrate how learning on random“world models” can lead to performant policies when transferred to the real world. For simplicity, weconsider the classical control task of balance cart-pole[5]. While there are many ways of constructingworld models for cart-pole, an optimal forward predictive model will have to generate trajectories ofsolutions to the simple linear differential equation describing the pole’s dynamics near the unstableequilibrium point . One particular coefﬁcient matrix fully describes these dynamics, thus, for thisexample, we identify this coefﬁcient matrix as the free parameters of the world model, M .While this unique M perfectly describe the dynamics of the pole, if our objective is only to stabilizethe system— not achieve perfect forward prediction—it stands to reason that we may not necessarilyneed to know these exact dynamics. In fact, if one solves for the linear feedback parameters thatstabilize a cart-pole system with coefﬁcient matrix M (cid:48) (not necessarily equal to M ), for a widevariety of M (cid:48) , those same linear feedback parameters will also stabilize the “true” dynamics M . Thusone successful, albeit silly strategy for solving balance cart-pole is choosing a random M (cid:48) , ﬁndinglinear feedback parameters that stabilize this M (cid:48) , and then deploying those same feedback controls tothe “real” model M . We provide the details of this procedure in Appendix A.Note that the “world model” learned in this way is almost arbitrarily wrong. It does not produceuseful forward predictions, nor does it accurately estimate any of the parameters of the “real” worldlike the length of the pole, or the mass of the cart. Nonetheless, it can be used to produce a successfulstabilizing policy. In sum, this toy problem exhibits three interesting qualities: That a world modelcan be learned that produces a valid policy without needing a forward predictive loss, That a worldmodel need not itself be forward predictive (at all) to facilitate ﬁnding a valid policy, and Thatthe inductive bias intrinsic to one’s world model almost entirely controls the ease of optimization ofthe ﬁnal policy. Unfortunately, most real world environments are not this simple and will not lead toperformant policies without ever observing the real world. Nonetheless, the underlying lesson—thata world model can be quite wrong, so long as it is wrong the in the “right” way—will be a recurringtheme throughout.

In the previous section, we outlined a strategy for ﬁnding policies without even “seeing” the realworld. In this section, we relax this constraint and allow the agent to periodically switch betweenreal observations and simulated observations generated by a world model. We call this method observational dropout , inspired by [59]. In general, the full dynamics describing cart-pole is non-linear. However, in the limit of a heavy cart andsmall perturbations about the vertical at low speeds, it reduces to a linear system. See Appendix A for details. s ∈ S , transition distribution s t +1 ∼ P ( s t , a t ) , andreward distribution R ( s t , a, s t +1 ) we can create a new partially observed MDP with 2 states, s (cid:48) =( s orig , s model ) ∈ ( S , S ) , consisting of both the original states, and the internal state produced by theworld model. The transition function then switches between the real, and world model states withsome probability p : P (cid:48) ( a t , ( s (cid:48) ) t ) = (cid:40) ( s t +1 orig , s t +1 orig ) , if p < r ( s t +1 orig , s t +1 model ) , if p ≥ r (1)where r ∼ Uniform (0 , , s t +1 orig is the real environment transition, s t +1 orig ∼ P ( s torig , a t ) , s t +1 model isthe next world model transition, s t +1 model ∼ M ( s tmodel , a t ; φ ) , p is the peek probability.The observation space of this new partially observed MDP is always the second entry of the statetuple, s (cid:48) . As before, we care about performing well on the real environment thus the reward functionis the same as the original environment: R (cid:48) ( s t , a t , s t +1 ) = R ( s torig , a t , s t +1 orig ) . Our learning taskconsists of training an agent, π ( s ; θ ) , and the world model, M ( s, a t ; φ ) to maximize reward in thisaugmented MDP. In our work, we parameterize our world model M , and our policy π , as neuralnetworks with parameters φ and θ respectively. While it’s possible to optimize this objective with anyreinforcement learning method [39, 40, 54, 55], we choose to use population based REINFORCE[69] due to its simplicity and effectiveness at achieving high scores on various tasks [20, 21, 51]. Byrestricting the observations, we make optimization harder and thus expect worse performance on theunderlying task. We can use this optimization procedure, however, to drive learning of the worldmodel much in the same way evolution drove our internal world models.One might worry that a policy with sufﬁcient capacity could extract useful data from a world model,even if that world model’s features weren’t easily interpretable. In this limit, our procedure startslooking like a strange sort of recurrent network, where the world model “learns” to extract difﬁcult-to-interpret features (like, e.g., the hidden state of an RNN) from the world state, and then the policyis powerful enough to learn to use these features to make decisions about how to act. While this isindeed a possibility, in practice, we usually constrain the capacity of the policies we studied to besmall enough that this did not occur. For a counter-example, see the fully connected world model forthe grid world tasks in Section 4.2. As the balance cart-pole task discussed earlier can be trivially solved with a wide range of parametersfor a simple linear policy, we conduct experiments where we apply observational dropout on themore difﬁcult swing up cart-pole—a task that cannot be solved with a linear policy, as it requiresthe agent to learn two distinct subtasks: (1) to add energy to the system when it needs to swing upthe pole, and (2) to remove energy to balance the pole once the pole is close to the unstable, uprightequilibrium [64]. Our setup is closely based on the environment described in [17, 70], where theground truth dynamics of the environment is described as [¨ x, ¨ θ ] = F ( x, θ, ˙ x, ˙ θ ) . F is a system ofnon-linear equations, and the agent is rewarded for getting x close to zero and cos ( θ ) close to one.For more details, see Appendix B. The setup of the cart-pole experiment augmented with observational dropout is visualized in Figure 1.We report the performance of our agent trained in environments with various peek probabilities, p , inFigure 2 (left). A result higher than ∼

500 means that the agent is able to swing up and balance thecart-pole most of the time. Interestingly, the agent is still able to solve the task even when on lookingat a tenth of the frames ( p = 10% ), and even at a lower p = 5% , it solves the task half of the time.To understand the extent to which the policy, π relies on the learned world model, M , and to probe thedynamics learned world model, we trained a new policy entirely within learned world model and then Released code to facilitate reproduction of experiments at https://learningtopredict.github.io/ % 3% 5% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%300350400450500550600650 Cartpole Swingup Mean Cumulative Score vs Peek ProbabilityPeek Probability M ea n C u m u l a t i v e R e w a r d

1% 3% 5% 10% 20% 30% 40% 50% 60% 70% 80% 90%0100200300400500600

Cartpole Swingup: Deploying Policy Learned in World Model to Actual EnvironmentWorld Model Learned with Peek Probability M ea n C u m u l a t i v e R e w a r d learned model (1200 hidden units): 430 ± 15learned model (120 hidden units): 274 ± 122chamption solution in population: 593 ± 24 Figure 2:

Left:

Performance of cart-pole swing up under various observational dropout probabilities, p . Here, both the policy and world model are learned. Right:

Performance of deploying policiestrained from scratch inside of the environment generated by the world model, in the actual environ-ment. For each p , the experiment is run 10 times independently (orange). Performance is measuredby averaging cumulative scores over 100 rollouts. Model-based baseline performances learned viaa forward-predictive loss are indicated in red, blue. Note how world models learned when trainedunder approximately 3-5% observational dropout can be used to train performant policies.deployed these policies back to the original environment. Results in Figure 2 (right). Qualitatively,the agent learns to swing up the pole, and balance it for a short period of time when it achieves a meanreward above ∼ (a) Policy learned in environment generated using world model.(b) Deploying policy learned in (a) into real environment. Figure 3: a. In the generated environment, the cart-pole stabilizes at an angle that is not perfectlyperpendicular, due to its imperfect nature. b. This policy is still able to swing up the cart-pole in theactual environment, although it remains balanced only for some time before falling down. The worldmodel is jointly trained with an observational dropout probability of p = 5% .Figure 3 depicts a trajectory of a policy trained entirely within a learned world model deployedon the actual environment. It is interesting to note that the dynamics in the world model, M , arenot perfect–for instance, the optimal policy inside the world model can only swing up and balancethe pole at an angle that is not perpendicular to the ground. We notice in other world models, theoptimal policy learns to swing up the pole and only balance it for a short period of time, even in theself-contained world model. It should not surprise us then, that the most successful policies whendeployed back to the actual environment can swing up and only balance the pole for a short while,before the pole falls down.As noted earlier, the task of stabilizing the pole once it is near its target state (when x , θ , ˙ x , ˙ θ is nearzero) is trivial, hence a policy, π , jointly trained with world model, M , will not require accuratepredictions to keep the pole balanced. For this subtask, π needs only to occasionally observe theactual world and realign its internal observation with reality. Conversely, the subtask of swingingthe pole upwards and then lowering the velocities is much more challenging, hence π will rely onthe world model to captures the essence of the dynamics for it to accomplish the subtask. The worldmodel M only learns the difﬁcult part of the real world, as that is all that is required of it to facilitatethe policy performing well on the task. 5 .2 Examining world models’ inductive biases in a grid world To illustrate the generality of our method to more varied domains, and to further emphasize therole played by inductive bias in our models, we consider an additional problem: a classic search/ avoidance task in a grid world. In this problem, an agent navigates a grid environment withrandomly placed apples and ﬁres. Apples provide reward, and ﬁres provide negative reward. Theagent is allowed to move in the four cardinal directions, or to perform a no-op. For more details seeAppendices B and C.Figure 4: A cartoon demonstrating the shift of the receptive ﬁeld of the world model as it moves tothe right. The greyed out column indicates the column of forgotten data, and the light blue columnindicates the “new” information gleaned from moving to the right. An optimal predictor would learnthe distribution function p and sample from it to populate this rightmost column, and would match theground truth everywhere else. The rightmost heatmap illustrates how predictions of a convolutionalmodel correlate with the ground truth (more orange = more predictive) when moving to the right,averaged over 1000 randomized right-moving steps. See Appendix D for more details. Crucially, thisheat map is most predictive for the cells the agent can actually see, and is less predictive for the cellsright outside its ﬁeld of view (the rightmost column) as expected.For simplicity, we considered only stateless policies and world models. While this necessarily limitsthe expressive capacity of our world models, the optimal forward predictive model within this class ofnetworks is straightforward to consider: movement of the agent essentially corresponds to a bit-shiftmap on the world model’s observation vectors. For example, for an optimal forward predictor, if anagent moves rightwards, every apple and ﬁre within its receptive ﬁeld should shift to the left. Theleftmost column of observations shifts out of sight, and is forgotten—as the model is stateless—andthe rightmost column of observations should be populated according to some distribution whichdepends on the locations of apples and ﬁres visible to the agent, as well as the particular scheme usedto populate the world with apples and ﬁres. Figure 4 illustrates the receptive ﬁeld of the world model.

0% 20% 40% 60% 80% 100%

Peek Probability R convfc Figure 5: Performance, R of the two architectures, empirically averaged over hundred policies and athousand rollouts as a function of peek probability, p . The convolutional architecture reliably outperforms the fully connected architecture. Error bars indicate standard error. Intuitively, a score near amounts to random motion on the lattice—encountering apples as often as ﬁres, and approximatelycorresponds to encountering apples two to three times more often than ﬁres. A baseline that istrained on a version of the environment without any ﬁres—i.e., a proxy baseline for an agent that canperfectly avoid ﬁres—reliably achieves a score of . Agents were trained for 4000 generations.This partial observability of the world immediately handicaps the ability of the world model toperform long imagined trajectories in comparison with the previous continuous, fully observedcart-pole tasks. Nonetheless, there remains sufﬁcient information in the world to train world modelsvia observational dropout that are predictive. 6or our numerical experiments we compared two different world model architectures: a fullyconnected model and a convolutional model. See Appendix B for details. Naively, these models arelisted in increasing order of inductive bias, but decreasing order of overall capacity ( parametersfor the fully connected model, learnable parameters for the convolutional model)—i.e., the fullyconnected architecture has the highest capacity and the least bias, whereas the convolutional modelhas the most bias but the least capacity. The performance of these models on the task as a functionof peek probability is provided in Figure 5. As in the cart-pole tasks, we trained the agent’s policyand world model jointly, where with some probability p the agent sees the ground truth observationinstead of predictions from its world model.Curiously, even though the fully connected architecture has the highest overall capacity, and is capableof learning a transition map closer to the “optimal” forward predictive function for this task if taughtto do so via supervised learning of a forward-predictive loss, it reliably performs worse than theconvolutional architectures on the search and avoidance task. This is not entirely surprising: theconvolutional architectures induce a considerably better prior over the space of world models thanthe fully connected architecture via their translational invariance. It is comparatively much easier forthe convolutional architectures to randomly discover the right sort of transition maps. ↓ ↑ → ← no-op Figure 6: Empirically averaged correlation matrices between a world model’s output and the groundtruth. Averages were calculated using random transitions for each direction of a typicalconvolutional p = 75% world model. Higher correlation (yellow-white) translates to a world modelthat is closer to a next frame predictor. Note that a predictive map is not learned for every direction.The row and column, respectively of dark pixels for ↓ and → correspond exactly to the newly-seenpixels for those directions which are indicated in light-blue in Figure 4.Because the world model is not being explicitly optimized to achieve forward prediction, it doesn’toften learn a predictive function for every direction. We selected a typical convolutional worldmodel and plot its empirically averaged correlation with the ground truth next-frames in Figure6. Here, the world model clearly only learns reliable transition maps for moving down and to theright, which is sufﬁcient. Qualitatively, we found that the convolutional world models learned withpeek-probability close to p = 50% were “best” in that they were more likely to result in accuratetransition maps—similar to the cart-pole results indicated in Figure 2 (right). Fully connected worldmodels reliably learned completely uninterpretable transition maps (e.g., see the additional correlationplots in Appendix D). That policies could almost achieve the same performance with fully connectedworld models as with convolutional world model is reminiscent of a recurrent architecture that usesthe (generally not-easily-interpretable) hidden state as a feature. off the road In more challenging environments, observations are often expressed as high dimensional pixel imagesrather than state vectors. In this experiment, we apply observation dropout to learn a world model ofa car racing game from pixel observations. We would like to know to what extent the world modelcan facilitate the policy at driving if the agent is only allowed to see the road only only a fraction ofthe time. We are also interested in the representations the model learns to facilitate driving, and inmeasuring the usefulness of its internal representation for this task.In Car Racing [32], the agent’s goal is to drive around the tracks, which are randomly generatedfor each trial, and drive over as many tiles as possibles in the shortest time. At each timestep, theenvironment provides the agent with a high dimensional pixel image observation, and the agentoutputs 3 continuous action parameters that control the car’s steering, acceleration, and brakes.To reduce the dimensionality of the pixel observations, we follow the procedure in [22] and traina Variational Autoencoder (VAE) [31, 49] using on rollouts collected from a random policy, tocompress a pixel observation into a small dimensional latent vector z . Our agent will use z insteadas its observation. Examples of pixel observations, and reconstructions from their compressed7 ctual frames from rollout (a)(1) (2) (3)Actual frames from rollout (b) time ⟶ time ⟶ VAE reconstructions of actual framesVAE reconstructions of actual framesVAE decoded images of predicted latent vectorsVAE decoded images of predicted latent vectors

Figure 7: Two examples of action-conditioned predictions from a world model trained at p = 10% (bottom rows). Red boxes indicate actual observations from the environment the agent is allowedto see. While the agent is devoid of sight, the world model predicts (1) small movements of the carrelative to the track and (2) upcoming turns. Without access to actual observations for many timesteps,it incorrectly predicts a turn in (3) until an actual observation realigns the world model with reality.representations are shown in the ﬁrst 2 rows of Figure 7. Our policy, a feed forward network, will acton actual observations with probability p , otherwise on observations produced by the world model.Our world model, M , a small feed forward network with a hidden layer, outputs the change of themean latent vector z , conditioned on the previous observation (actual or predicted) and action taken(i.e ∆ z = M ( z, a ) ). We can use the VAE’s decoder to visualize the latent vectors produced by M ,and compare them with the actual observations that the agent is not able to see (Figure 7). We observethat our world model, while not explicitly trained to predict future frames, are still able to makemeaningful action-conditioned predictions. The model also learns to predict local changes in the car’sposition relative to the road given the action taken, and also attempts to predict upcoming curves.

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%450500550600650700750800850900

Car Racing Mean Cumulative Score vs Peek ProbabilityPeek Probability M ea n C u m u l a t i v e R e w a r d

10% 20% 30% 40% 50% 60% 70% 80% 90%500550600650700750800850900950

CarRacing: Performance Using World Model's Hidden Units as Inputs vs Peek ProbabilityWorld Model Learned with Peek Probability M ea n C u m u l a t i v e R e w a r d Ha and Schmidhuber (2018): 906 ± 21Risi and Stanley (2019): 903 ± 72chamption solution: 873 ± 71

Figure 8:

Left:

Mean performance of Car Racing under various p over 100 trials. Right:

Meanperformance achieved by training a linear policy using only the outputs of the hidden layer of a worldmodel learned at peek probability p . We run 5 independent seeds for each p (orange). Model-basedbaseline performances learned via a forward-predictive loss are indicated in red, blue. We notethat in this constrained linear policy setup, our best solution out of a population of trials achieves aperformance slightly below reported state-of-the-art results (i.e. [22, 50]). As in the swingup cartpoleexperiments, the best world models for training policies occur at a characteristic peek probabilitythat roughly coincides with the peek probability at which performance begins to degrade for jointlytrained models (i.e., the bend in the left pane occurs near the peak of the right pane).Our policy π is jointly trained with world model M in the car racing environment augmented witha peek probability p . The agent’s performance is reported in Figure 8 (left). Qualitatively, a scoreabove ∼

800 means that the agent can navigate around the track, making the occasional driving error.We see that the agent is still able to perform the task when 70% of the actual observation frames aredropped out, and the world model is relied upon to ﬁll in the observation gaps for the policy.8f the world model produces useful predictions for the policy, then its hidden representation usedto produce the predictions should also be useful features to facilitate the task at hand. We can testwhether the hidden units of the world model are directly useful for the task, by ﬁrst freezing theweights of the world model, and then training from scratch a linear policy using only the outputs ofthe intermediate hidden layer of the world model as the only inputs. This feature vector extracted thehidden layer will be mapped directly to the 3 outputs controlling the car, and we can measure theperformance of a linear policy using features of world models trained at various peek probabilities.The results reported in Figure 8 (right) show that world models trained at lower peek probabilitieshave a higher chance of learning features that are useful enough for a linear controller to achieve anaverage score of 800. The average performance of the linear controller peaks when using modelstrained with p around 40%. This suggests that a world model will learn more useful representationwhen the policy needs to rely more on its predictions as the agent’s ability to observe the environmentdecreases. However, a peek probability too close to zero will hinder the agent’s ability to perform itstask, especially in non-deterministic environments such as this one, and thus also affect the usefulnessof its world model for the real world, as the agent is almost completely disconnected from reality. In this work, we explore world models that emerge when training with observational dropout forseveral reinforcement learning tasks. In particular, we’ve demonstrated how effective world modelscan emerge from the optimization of total reward. Even on these simple environments, the emergedworld models do not perfectly model the world, but they facilitate policy learning well enough tosolve the studied tasks.The deﬁciencies of the world models learned in this way have a consistency: the cart-pole worldmodels learned to swing up the pole, but did not have a perfect notion of equilibrium—the grid worldworld models could perform reliable bit-shift maps, but only in certain directions—the car racingworld model tended to ignore the forward motion of the car, unless a turn was visible to the agent(or imagined). Crucially, none of these deﬁciencies were catastrophic enough to cripple the agent’sperformance. In fact, these deﬁciencies were, in some cases, irrelevant to the performance of thepolicy. We speculate that the complexity of world models could be greatly reduced if they could fullyleverage this idea: that a complete model of the world is actually unnecessary for most tasks—that byidentifying the important part of the world, policies could be trained signiﬁcantly more quickly, ormore sample efﬁciently.We hope this work stimulates further exploration of both model based and model free reinforcementlearning, particularly in areas where learning a perfect world model is intractable.

Acknowledgments

We would like to thank our three reviewers for their helpful comments. Additionally, we would liketo thank Alex Alemi, Tom Brown, Douglas Eck, Jaehoon Lee, Bła˙zej Osi´nski, Ben Poole, JaschaSohl-Dickstein, Mark Woodward, Andrea Benucci, Julian Togelius, Sebastian Risi, Hugo Ponte,and Brian Cheung for helpful comments, discussions, and advice on early versions of this work.Experiments in this work were conducted with the support of Google Cloud Platform.9 eferences [1] James F Allen and Johannes A Koomen. Planning using a temporal world model. In

Proceedings ofthe Eighth international joint conference on Artiﬁcial intelligence-Volume 2 , pages 741–747. MorganKaufmann Publishers Inc., 1983.[2] Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J Zico Kolter. Differentiable mpc for end-to-end planning and control. In

Advances in Neural Information Processing Systems , pages 8289–8300,2018.[3] Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in model-based reinforcementlearning. arXiv preprint arXiv:1804.07193 , 2018.[4] Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr Mirowski,Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al. Vector-based navigationusing grid-like representations in artiﬁcial agents.

Nature , 557(7705):429, 2018.[5] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solvedifﬁcult learning control problems.

IEEE transactions on systems, man, and cybernetics , pages 834–846,1983.[6] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and newperspectives.

IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828, 2013.[7] Josh Bongard, Victor Zykov, and Hod Lipson. Resilient machines through continuous self-modeling.

Science , 314(5802):1118–1121, 2006.[8] Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P Reichert,Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and querying fast generativemodels for reinforcement learning. arXiv preprint arXiv:1802.03006 , 2018.[9] Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau,and Nicolas Heess. Woulda, coulda, shoulda: Counterfactually-guided policy search. arXiv preprintarXiv:1811.06272 , 2018.[10] Christopher J Cueva and Xue-Xin Wei. Emergence of grid-like representations by training recurrent neuralnetworks to perform spatial localization. arXiv preprint arXiv:1803.07770 , 2018.[11] Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efﬁcient approach to policy search.In

Proceedings of the 28th International Conference on machine learning (ICML-11) , pages 465–472,2011.[12] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. arXiv preprintarXiv:1802.07687 , 2018.[13] Bradley B Doll, Dylan A Simon, and Nathaniel D Daw. The ubiquity of model-based reinforcementlearning.

Current opinion in neurobiology , 22(6):1075–1081, 2012.[14] Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual fore-sight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprintarXiv:1812.00568 , 2018.[15] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction throughvideo prediction. In

Advances in neural information processing systems , pages 64–72, 2016.[16] Adam Gaier and David Ha. Weight agnostic neural networks. arXiv preprint arXiv:1906.04358 , 2019.[17] Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving PILCO with Bayesian neuralnetwork dynamics models. In

Data-Efﬁcient Machine Learning workshop, ICML , volume 4, 2016.[18] David E Goldberg and John H Holland. Genetic algorithms and machine learning.

Machine learning , 3(2):95–99, 1988.[19] Roderic A. Grupen. Cmpsci embedded systems 503.

Online , 2018. URL .[20] D. Ha. Evolving stable strategies. http://blog.otoro.net/ , 2017. URL http://blog.otoro.net/2017/11/12/evolving-stable-strategies/ .[21] David Ha. Reinforcement learning for improving agent design. arXiv:1810.03779 , 2018. URL https://designrl.github.io .[22] David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In

Advances inNeural Information Processing Systems 31 , pages 2451–2463. Curran Associates, Inc., 2018.[23] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and JamesDavidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551 , 2018.

24] Nikolaus Hansen, Sibylle D Müller, and Petros Koumoutsakos. Reducing the time complexity of thederandomized evolution strategy with covariance matrix adaptation (cma-es).

Evolutionary computation ,11(1):1–18, 2003.[25] Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed,and Alexander Lerchner. Early visual concept learning with unsupervised deep learning. arXiv preprintarXiv:1606.05579 , 2016.[26] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexan-der Lerchner. Towards a deﬁnition of disentangled representations. arXiv preprint arXiv:1812.02230 ,2018.[27] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science , 313(5786):504–507, 2006.[28] John Henry Holland et al.

Adaptation in natural and artiﬁcial systems: an introductory analysis withapplications to biology, control, and artiﬁcial intelligence . MIT press, 1975.[29] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, KonradCzechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based rein-forcement learning for atari. arXiv preprint arXiv:1903.00374 , 2019.[30] Nal Kalchbrenner, Aäron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves,and Koray Kavukcuoglu. Video pixel networks. In

Proceedings of the 34th International Conference onMachine Learning-Volume 70 , pages 1771–1779. JMLR. org, 2017.[31] D. Kingma and M. Welling. Auto-encoding variational bayes.

Preprint arXiv:1312.6114 , 2013. URL https://arxiv.org/abs/1312.6114 .[32] Oleg Klimov. CarRacing-v0. https://gym.openai.com/envs/CarRacing-v0/ , 2016.[33] M Kumar, M Babaeizadeh, D Erhan, C Finn, S Levine, L Dinh, and D Kingma. Videoﬂow: A ﬂow-basedgenerative model for video. arXiv preprint arXiv:1903.01434 , 2019.[34] Quoc V Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff Dean,and Andrew Y Ng. Building high-level features using large scale unsupervised learning. arXiv preprintarXiv:1112.6209 , 2011.[35] Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley,Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution:A collection of anecdotes from the evolutionary computation and artiﬁcial life research communities. arXivpreprint arXiv:1803.03453 , 2018.[36] Ian Lenz, Ross A Knepper, and Ashutosh Saxena. Deepmpc: Learning deep latent features for modelpredictive control. In

Robotics: Science and Systems . Rome, Italy, 2015.[37] Hugo Marques, Julian Togelius, Magdalena Kogutowska, Owen Holland, and Simon M Lucas. Sensorlessbut not senseless: Prediction in evolutionary car racing. In , pages370–377. IEEE, 2007.[38] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond meansquare error. arXiv preprint arXiv:1511.05440 , 2015.[39] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control throughdeep reinforcement learning.

Nature , 518(7540):529, 2015.[40] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In

International conference on machine learning , pages 1928–1937, 2016.[41] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamicsfor model-based deep reinforcement learning with model-free ﬁne-tuning. In , pages 7559–7566. IEEE, 2018.[42] Anusha Nagabandi, Guangzhao Yang, Thomas Asmar, Ravi Pandya, Gregory Kahn, Sergey Levine, andRonald S Fearing. Learning image-conditioned dynamics models for control of underactuated leggedmillirobots. In , pages4606–4613. IEEE, 2018.[43] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsawpuzzles. In

European Conference on Computer Vision , pages 69–84. Springer, 2016.[44] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional videoprediction using deep networks in atari games. In

Advances in neural information processing systems ,pages 2863–2871, 2015.

45] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictivecoding. arXiv preprint arXiv:1807.03748 , 2018.[46] Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoder with differen-tiable memory. arXiv preprint arXiv:1511.06309 , 2015.[47] Gianluigi Pillonetto, Francesco Dinuzzo, Tianshi Chen, Giuseppe De Nicolao, and Lennart Ljung. Kernelmethods in system identiﬁcation, machine learning and function estimation: A survey.

Automatica , 50(3):657–682, 2014.[48] Ingo Rechenberg. Evolutionsstrategie–optimierung technisher systeme nach prinzipien der biologischenevolution.

Frommann-Holzboog , 1973.[49] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deepgenerative models.

Preprint arXiv:1401.4082 , 2014. URL https://arxiv.org/abs/1401.4082 .[50] Sebastian Risi and Kenneth O. Stanley. Deep neuroevolution of recurrent and discrete world models. In

Proceedings of the Genetic and Evolutionary Computation Conference , GECCO ’19, pages 456–462,New York, NY, USA, 2019. ACM. ISBN 978-1-4503-6111-8. doi: 10.1145/3321707.3321817. URL http://doi.acm.org/10.1145/3321707.3321817 .[51] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative toreinforcement learning.

Preprint arXiv:1703.03864 , 2017.[52] J. Schmidhuber. On learning to think: Algorithmic information theory for novel combinations of re-inforcement learning controllers and recurrent neural world models. arXiv preprint arXiv:1511.09249 ,2015.[53] Jürgen Schmidhuber. Making the world differentiable: On using self-supervised fully recurrent neuralnetworks for dynamic reinforcement learning and planning in non-stationary environments.

TechnicalReport , 1990.[54] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policyoptimization. In

International Conference on Machine Learning , pages 1889–1897, 2015.[55] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.[56] H-P Schwefel.

Numerische Optimierung von Computer-Modellen mittels der Evolutionsstrategie.(Teil 1,Kap. 1-5) . Birkhäuser, 1977.[57] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and SergeyLevine. Time-contrastive networks: Self-supervised learning from video. In , pages 1134–1141. IEEE, 2018.[58] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning andplanning. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages3191–3199. JMLR. org, 2017.[59] N Srivastava, G Hinton, A Krizhevsky, I Sutskever, and R Salakhutdinov. Dropout: a simple way to preventneural networks from overﬁtting.

JMLR , 15(1):1929–1958, 2014.[60] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video represen-tations using lstms. arXiv preprint arXiv:1502.04681 , 2015.[61] Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune.Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networksfor reinforcement learning. arXiv preprint arXiv:1712.06567 , 2017.[62] Richard S Sutton, Andrew G Barto, et al.

Introduction to reinforcement learning , volume 135. MIT pressCambridge, 1998.[63] Erik Talvitie. Model regularization for stable sample rollouts. In

UAI , pages 780–789, 2014.[64] Russ Tedrake. Underactuated robotics: Learning, planning, and control for efﬁcient and agile machines:Course notes for mit 6.832.

Working draft edition , 3, 2009.[65] Sebastian Thrun, Knut Möller, and Alexander Linden. Planning with an adaptive world model. In

Advancesin neural information processing systems , pages 450–456, 1991.[66] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locallylinear latent dynamics model for control from raw images. In

Advances in neural information processingsystems , pages 2746–2754, 2015.[67] Paul J Werbos. Learning how the world works: Speciﬁcations for predictive networks in robots and brains.In

Proceedings of IEEE International Conference on Systems, Man and Cybernetics, NY , 1987.

68] Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In ,pages 3381–3387. IEEE, 2008.[69] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning.

Machine learning , 8(3-4):229–256, 1992.[70] Xingdong Zuo. PyTorch implementation of Improving PILCO with Bayesian neural network dynamicsmodels, 2018. https://github.com/zuoxingdong/DeepPILCO . Random world models for balance cart-pole

Consider the classical control task of balance cart-pole, where the cart is initialized “close” to the unstable, fullyupright equilibrium, and the cart’s (not the pole’s) acceleration is the only directly controllable parameter.Following [19], the Lagrangian for cart-pole system takes the form: L = 12 ( M + m ) ˙ x + 12 mL ˙ θ − mLcos ( θ ) ˙ θ ˙ x − mgLcos ( θ ) (2)In the presence of a control force, u ( t ) , the solution to the Euler-Lagrange equation ∂L∂q − ∂∂t ∂L∂ ˙ q = 0 for q ∈ { θ, x } yields the full equations of motion: ( M + m )¨ x + mLsin ( θ ) ˙ θ − mLcos ( θ )¨ θ = u ( t ) (3) mL ¨ θ − mLcos ( θ )¨ x − mgLsin ( θ ) = 0 (4)Taking the limits that m (cid:28) M, ˙ θ (cid:28) , and θ (cid:28) —that the pole is light compared to the cart, that the poleis not moving very fast, and that the pole is near the vertical, respectively—the x and θ components of thedifferential equation decouple, and the pole dynamics can be rearranged into the matrix equation: (cid:18) ˙ θ ¨ θ (cid:19) = (cid:18) gL (cid:19) (cid:18) θ ˙ θ (cid:19) + (cid:18) ML (cid:19) u ( t ) (5)Finally, the linear feedback ansatz for u ( t ) is imply: u ( t ) = (cid:0) u u (cid:1) (cid:18) θ ˙ θ (cid:19) (6)Combining Eq. 5 and Eq. 6 results in 7 as desired.Linearizing around this equilibrium, taking the linear feedback ansatz for the form of the controller, andconsidering only the angular degrees of freedom the equations of motion for this system are, (cid:18) ˙ θ ¨ θ (cid:19) = (cid:18) gL + u ML u ML (cid:19) (cid:18) θ ˙ θ (cid:19) ∼ (cid:18) a bc + u d + u (cid:19) (cid:18) θ ˙ θ (cid:19) (7)where θ is the angle of the pole with the vertical, g is the gravitational constant, L is the length of the pole, M isthe mass of the cart (considered much larger than the mass of the pole), and u , u are the free parameters of thecontroller. Reinterpreted as a model-based reinforcement learning problem, u , u are the free parameters ofa policy π , and the exact model of the dynamics of the “world” are the solutions to this differential equation.Finding a policy which “solves” Eq. 7—i.e., that drives the pole towards the θ = 0 equilibrium state—is possiblevia random search. Almost any pair of negative entries stabilizes the pole .We don’t often have access to the exact equations of motion for a problem of interest, thus one possible analogous“world-model” version of this task is to ﬁnd both a policy, u , u , and matrix entries a, b, c, d : that when solved,also solve the original task (i.e., Eq. 7 RHS). This task is equivalent to learning a world model for a problem,training a policy entirely within the learned model, and then measuring the transfer of the policy to the realworld. Surprisingly, this task is also efﬁciently solvable via random search with high probability. Speciﬁcally,starting with Gaussian distributed a, b, c, d and then solving for a u ∗ , u ∗ that stabilizes Eq. 7, with probability p > those same u ∗ , u ∗ will also stabilize a balance cart-pole problem with L, g, M ∼ O (1) .While this cartoon does rely on the simplicity of the solution space for balance cart-pole, it hints at a moregeneral property of learned models for RL tasks: models can be wrong so long as they are wrong in the right way.“Solving” the balance cart-pole task fundamentally amounts to ﬁnding u , u that cause the coefﬁcient matrixto have negative eigenvalues. The class of matrices that is negative deﬁnite both when added to (cid:18) u u (cid:19) and when (cid:18) gL + u ML u ML (cid:19) is itself negative deﬁnite is large. Thus, looking for u , u that stabilize randommatrices in the neighborhood of the coefﬁcient matrix is a sensible, albeit highly inductively biased, strategy. Ofcourse, most problems do not afford such a dramatic freedom in the dimensionality of the solution manifold. “Stability”, here, is meant in the formal control theoretic sense—i.e. that the coefﬁcient matrix has onlynegative eigenvalues Experimental Details

Please visit the web version at https://learningtopredict.github.io/ of this paper for informationabout the released code for reproducing experiments.Experiments were performed used multi-core machines on Google Cloud Platform, for various peek probabilitysettings, and also for multiple independent runs with different initial random seeds. Cart-pole swing upexperiments were performed on multiple 96 core machine, while car racing and grid world experiments wereperformed on 64 core machines.Below we describe architecture setup and experimental details for each experiment.

B.1 Swing up cart-pole

In our experiments, we ﬁne-tuned individual weight parameters for the champion networks found to measure theperformance impact of further training. For this, we used population-based REINFORCE, as in Section 6 of [69].Our speciﬁc approach is based on open source estool [20] implementation of population-based REINFORCEwith default parameter settings, where we use a population size of 384, and had each agent perform the task 16times with different initial random seeds for swing up cart-pole. The agent’s reward signal used by the policygradient method is the average reward of the 16 rollouts.In this task, our policy network is a feed forward network with 5 inputs, 1 hidden layer of 10 tanh units, and1 output for the action. The world model is another feed forward network with 5 inputs, 30 hidden tanh units,and 5 outputs. We experimented with a larger hidden size, and extra hidden layers, but did not see meaningfuldifferences in performance. All models were trained for 10,000 generations.

B.2 Grid Worlds

For our grid world experiments, we used the same open source population-based optimizer implementation withdefault parameters, a population size of 8, and a cumulative reward signal averaged over 4 rollouts.For the fully connected network experiments, the input to the world model was a ﬂattened list of the × × observation binary variables concatenated with the length one-hot action vector. This was passed into a onelayer network with 100 hidden units in the hidden layer, and 50 units in the output layer. Predictions werecalculated via thresholding: if an output was greater than .5, it was rounded to 1, otherwise it was rounded to 0.All apple and ﬁre locations were predicted simultaneously.For the convolutional network experiments, we used a convolutional architecture with shared weights, a × kernel where the corner entries were forced to be zero (i.e., only the center pixel, and the pixels in the 4 cardinaldirections around it were inputs for each receptive ﬁeld—that is, 5 of the 9 pixels in the × receptive ﬁeldwere active), and 100 channels. For each × receptive ﬁeld, the one hot action vector was concatenated to theﬂattened ﬁeld, and then processed by the network. The output of each × receptive ﬁeld was 1-dimensional,and we used the same thresholding scheme as for the fully connected networks—i.e., more than .5 was roundedto 1, and less than .5 was rounded to 0. Apple and Fire observations were predicted with the same network.For both world model architecture experiments, the same policy architecture was used: a simple two-layer fullyconnected network with a tanh activation after the ﬁrst layer, 100 units in the ﬁrst hidden layer, and 32 units inthe second hidden layer, and 5 units in the output layer.All models were trained for 4000 generations, and all models took between 10 and 100 random steps in a × environment with apples and ﬁres. B.3 Car Racing

As in the swing up cart-pole experiment, we used the same open source population-based optimizer implementa-tion with default parameters, but due to the extra computation time required, we instead use a population size of64 and the average cumulative reward of 4 rollouts to calculate as the reward signal.The code and setup of the VAE for the Car Racing task is taken from [22]. We used the pre-trained VAE madeavailable in [16] with a latent size of 16 which trained following the same procedure as [22]. For simplicity,we do not use the RNN world model described in [22] for achieving state-of-the-art results, but instead, wefound that removing the VAE noise from the latent vector for the observation improves results, hence in ourexperiment, from the point of view of the policy, the observed latent vector z is set to the predicted mean of theencoder, µ , from the pre-trained VAE.In this task, our policy network is a feed forward network with 16 inputs (the latent vector), 1 hidden layer of 10tanh units, and 3 outputs for the action space. The world model is another feed forward network with 16 inputs,10 hidden tanh units, and 16 outputs. The 10 hidden units of this world model was used as the input for a simplelinear policy in the experiments. All models were trained for 1,000 generations. The Grid World Environment

The environments are all square grids with impassable walls on the boundary. Apples and Fires are placedrandomly, but so that no tile has more than at most 1 Apple or Fire.When the agent encounters an apple, it does not consume it until it takes an additional step—i.e., the agent seesthe apple on the turn that agent encounters it. Consumed apples are removed from the environment. The agentreceives 1 point of reward for every step in the environment, 6 reward for every apple it encounters, and -8reward for every ﬁre it encounters.Apples and Fires are represented as binary variables in a d × d × matrix, for grid width d . The agent canperform one of 5 actions—movement in the 4 cardinal directions, or a no-op. D Correlations of predictions ↓ ↑ → ← no-op

Figure 9: Correlation matrices as in Fig. 6 for several sampled convolutional architectures. The darkpixel immediately adjacent to the agent in many of the correlation plots is a result of the agent failingto predict its own consumption of an apple, because the model used was translationally invariant.16 ↑ → ← no-op0.00.10.20.30.40.5