[PDF] How do Mixture Density RNNs Predict the Future?

Abstract

Gaining a better understanding of how and what machine learning systems learn is important to increase confidence in their decisions and catalyze further research. In this paper, we analyze the predictions made by a specific type of recurrent neural network, mixture density RNNs (MD-RNNs). These networks learn to model predictions as a combination of multiple Gaussian distributions, making them particularly interesting for problems where a sequence of inputs may lead to several distinct future possibilities. An example is learning internal models of an environment, where different events may or may not occur, but where the average over different events is not meaningful. By analyzing the predictions made by trained MD-RNNs, we find that their different Gaussian components have two complementary roles: 1) Separately modeling different stochastic events and 2) Separately modeling scenarios governed by different rules. These findings increase our understanding of what is learned by predictive MD-RNNs, and open up new research directions for further understanding how we can benefit from their self-organizing model decomposition.

Full PDF

HHow do Mixture Density RNNs Predict the Future?

Kai Olav Ellefsen Charles Patrick Martin

Jim Torresen

Abstract

Gaining a better understanding of how and whatmachine learning systems learn is important to in-crease conﬁdence in their decisions and catalyzefurther research. In this paper, we analyze the pre-dictions made by a speciﬁc type of recurrent neu-ral network, mixture density RNNs (MD-RNNs).These networks learn to model predictions as acombination of multiple Gaussian distributions,making them particularly interesting for problemswhere a sequence of inputs may lead to severaldistinct future possibilities. An example is learn-ing internal models of an environment, where dif-ferent events may or may not occur, but wherethe average over different events is not meaning-ful. By analyzing the predictions made by trainedMD-RNNs, we ﬁnd that their different Gaussiancomponents have two complementary roles: 1)Separately modeling different stochastic eventsand 2) Separately modeling scenarios governedby different rules. These ﬁndings increase ourunderstanding of what is learned by predictiveMD-RNNs, and open up new research directionsfor further understanding how we can beneﬁt fromtheir self-organizing model decomposition.

1. Introduction

Deep learning has greatly increased the ability for comput-ers to perform complex tasks from a wide range of domains,including image recognition, language modeling, game play-ing and predicting the future (Mnih et al., 2015; LeCun et al.,2015; Wichers et al., 2018). However, we have an incom-plete understanding of exactly how the deep learning modelslearn to perform these tasks. Gaining a better understandingof these models is considered to be one of the most impor-tant current challenges in artiﬁcial intelligence (Samek et al.,2017; Garcia et al., 2018). There are many reasons why abetter understanding is important, ranging from increasing Department of Informatics, University of Oslo, Norway RITMO, University of Oslo, Norway. Correspondence to: KaiOlav Ellefsen < uio > . the ability to trust machine learning systems, to the beneﬁtssuch understanding would have for continued research anddevelopment of algorithms. There are therefore many recentstudies aiming to make deep learning models more under-standable and explainable, for both convolutional neuralnetworks (CNNs) (Yosinski et al., 2015), recurrent neuralnetworks (RNNs) (Karpathy et al., 2015) and other architec-tures (Smilkov et al., 2017).In this paper, we aim to gain a better understanding of onespeciﬁc neural network architecture, mixture density RNNs(MD-RNNs). MD-RNNs are recurrent neural networks com-bined with a mixture density network (Bishop, 1994; Bishop& Others, 1995), such that that the output parametrizes amixture of Gaussians distribution (Figure 1). MD-RNNsare particularly interesting for tasks involving creative pre-diction, since the recurrent part allows the modeling andforecasting of sequences, and the Gaussian mixture partallows predictions to be creative, modeling different typesof scenarios in a single neural network (Ha & Eck, 2017).MD-RNNs recently gained signiﬁcant attention (Pearson,2018; Yao, 2018) in Ha and Schmidhuber’s paper on “WorldModels” (Ha & Schmidhuber, 2018), which demonstratedthat MD-RNNs, can learn to predict the future from a largenumber of observations of a simulated world. The internalmodels learned in Ha and Schmidhuber’s work represent theagent’s world so well that the authors were able to 1) trainan agent inside its own internal model (or, said differently,inside its own “dream”) and 2) to be the ﬁrst agent to solvethe Car Racing environment in OpenAI gym. Despite theimpressive results and the wide attention this work has got-ten, we do not have a good understanding of how predictiveMD-RNNs model the world.MD-RNNs make predictions by sampling from a proba-bility distribution with multiple different sub-distributions.We investigate two hypotheses about the role of these sub-distributions (mixture components) when MD-RNNs predictthe future. The hypotheses are: 1) Different mixture com-ponents model different stochastic events and 2) Differentmixture components model different situations with differ-ent “rules” (that is, different internal models – Figure 1).We train world models in a Doom game environment, simi-larly to (Ha & Schmidhuber, 2018), and let them hallucinateimagined scenarios. From these scenarios, we extract events a r X i v : . [ c s . L G ] J a n ow do MD-RNNs predict the future? Figure 1.

Left: MD-RNNs model data with probability distributions composed of several components, parametrized by π , µ and σ . Right:We investigate the roles of individual components to gain a better understanding of how MD-RNNs make predictions. One possibility(illustrated here) is that different mixture components represent situations governed by different rules. and situations according to our two hypotheses, and thenmeasure to which degree different events are produced bydifferent components of the mixture model.The main contributions of this paper are 1) A frameworkfor automatically measuring the tendency for different com-ponents of a Gaussian mixture model to generate particulartypes of prediction, and 2) New insights into the roles of theGaussian components of trained MD-RNNs. In particular,we ﬁnd evidence for both our hypotheses, including a veryclear demonstration of different mixture components self-organizing to serve as internal models for scenarios withdifferent rules.

2. Background

Generative machine learning models for content such astext, images or sound typically model the generated contentwith a probability distribution (Goodfellow et al., 2016).Mixture density networks (MDNs) are neural networks thatrepresent mixture density models (McLachlan & Basford,1988), that is, probability distributions which are composedof several sub-distributions (several Gaussian distributionsin the models applied here – see Figure 1). MDNs can inprinciple represent any conditional probability distribution,and are useful when the modeled phenomenon is not wellrepresented by a simpler distribution. An example, whichwe study here, is learning internal models of an environment,where different events may or may not occur, but where theaverage over different events is not meaningful. In this case,the multiple sub-distributions of a mixture density modelcan help model the fact that the world has multiple possiblestates which should not be mixed together or averaged.In practice, a mixture density network (MDN) operates bytransforming the outputs of a neural network to form the pa-rameters of a mixture distribution (Bishop, 1994), generallywith Gaussian models for each mixture component. These parameters are the centres ( µ ) and scales ( σ ) for each Gaus-sian component, as well as a weight ( π ) for each component(see Figure 1). The MDN usually uses an exponential activa-tion function to transform the scale parameters to be positiveand non-zero. For training, the probability density functionof the mixture model is used to generate the negative loglikelihood for the loss function. This involves constructingprobability density functions (PDFs) for each Gaussian com-ponent and categorical distribution from the mixture weights(see Appendix Section 1.4 for details). One advantage of anMDN is that various component distributions can be usedso long as the PDF is tractable, for instance, 1D (Bishop,1994) or 2D (Graves, 2013) Gaussian distributions, or, as inour case, a multivariate Gaussian with a diagonal covariancematrix.For inference, results are sampled from the mixture distribu-tion. First, the π s are used to form a categorical distributionby applying the softmax function. A sample is drawn fromthis distribution to determine which Gaussian componentwill provide the output. The index i of the sampled π is usedto select a Gaussian distribution, N ( µ i , σ i ) , from which asample is drawn to provide the outcome. In some cases,it is advantageous to adjust the diversity of sampling (forinstance, to favour unlikely predictions), in which case thetemperature of the categorical distribution can be adjusted inthe typical way, and the covariance matrices of the Gaussiancomponents may be scaled. We refer the these operationsas adjusting π - or σ -temperature respectively.An MDN can be applied to the outputs of an RNN, form-ing an MD-RNN. This approach has been applied to model2D pen data, such as for handwriting (Graves, 2013) andsketches (Ha & Eck, 2017) as well as musical perfor-mance (Martin & Torresen, 2018). Other applications in-clude parametric speech synthesis (Wang et al., 2017), andidentifying salient locations in video data (Bazzani et al.,2017). Ha and Schmidhuber applied an MD-RNN model tomodel the future state of a video game screen image and as- ow do MD-RNNs predict the future? sist an RL agent (Ha & Schmidhuber, 2018). In the presentresearch, we delve into this application to understand whatsuch a model learns about the virtual worlds and how thisinformation is represented. Progress in deep learning has recently made it possible tolearn to predict future frames of video from observing se-quences of video frames (Finn et al., 2016; Mathieu et al.,2016). However, most approaches for predicting future vi-sual input from pixels have typically only had the abilityto predict a few frames into the future before predicted im-ages get blurry or static. Recently, techniques have beendeveloped that attempt to mitigate this limitation by ﬁrstencoding frames into a compact, high-level representation,then predicting how this compact representation developsover time. Finally, decoding the predicted compact repre-sentation produces a predicted future image. Compared topredictions made directly in pixel-space, such high-levelpredictions degrade less quickly, demonstrating good pre-diction performance many seconds into the future (Villegaset al., 2017; Wichers et al., 2018; Ha & Schmidhuber, 2018).In Ha and Schmidhuber’s recurrent world model (Ha &Schmidhuber, 2018), the predictive model consists of twocomponents: 1) A visual component (V), which learns anencoding/decoding between a visual scene and a compactrepresentation and 2) A memory component (M), whichlearns how the compact representation develops over time(Figure 2). The ﬁrst component is learned by a variationalautoencoder (VAE (Kingma & Welling, 2013)), by present-ing it with a large collection of pictures from the visualscene. The second is learned by an MD-RNN. This worldmodel was demonstrated to be able to predict many framesinto the future, and in fact to “dream” whole episodes ofagent experience. It is, however, not clear what role thedifferent mixture components in the MD-RNN play in pre-dicting the future.

One of our hypotheses suggests that the different Gaussiancomponents learn to model different situations with different“rules”, that is, situations where predictions need to be sodifferent that they are best modeled separately. Humansshow a remarkable ability to learn internal models (mentalsimulations) of a wide range of different situations, objectsand people, without a high degree of conﬂict or interferencebetween them.One theory suggests that this ability is facilitated by themodular organization of our central nervous system. Neuralmodularity may be a key to allow multiple internal models tocoexist, enabling the selection of the appropriate actions forthe current context (Ghahramani & Wolpert, 1997; Wolpert

VAE-EncoderVAE-DecoderMD-RNN z t p ( ^ z t + ) Sampling ^ z t + h t VAE-EncoderVAE-DecoderMD-RNN z t + p ( ^ z t + ) Sampling ^ z t + h t + Input: Sequence of ObservationsOutput: Sequence of Predictions o t o t + ^ o t + ^ o t + Figure 2.

World model predicting future frames by combining avariational autoencoder and an MD-RNN. We follow the architec-ture suggested in (Ha & Schmidhuber, 2018). et al., 2003). Computational models built around this ideahave indeed demonstrated the ability to learn and maintainmultiple internal models, and select the appropriate modelfor a given context (Wolpert & Kawato, 1998; Haruno et al.,2001; Demiris & Khadhouri, 2006). These models work bydividing learning experiences into multiple modules, wheredifferent modules compete to represent different situations.After a number of learning episodes, this causes differentmodules to specialize at representing different internal mod-els, allowing the system to model situations with differentrules, with minimal interference. Our hypothesis suggeststhat the Gaussian components of the MDN self-organizeto perform a similar task, allowing scenarios with differentrules to be modeled with little interference.

3. Methods

Ha and Schmidhuber’s world model (Ha & Schmidhuber,2018) combines an MD-RNN and a VAE to predict futurestates of a video game screen (Figure 2). By training theVAE to compress representations of visual scenes, the MD-RNN has a more manageable job of predicting how scenesunfold in the future. ow do MD-RNNs predict the future?

Training the model happens in two steps: First, the VAEis trained on examples of images from the environmentin which we wish to learn to make predictions. The VAEcompresses each image (64x64 pixels, with 3 color chan-nels in our setup) into a latent vector, z (64 ﬂoating-pointnumbers in our case). It then attempts to reconstruct thesame image from the latent vector. The VAE is trained toboth reconstruct images as well as possible, and to keepthe representations of similar inputs close together in latentspace (details are found in Appendix Section 1.3). Thisallows small changes to the latent vector to give meaningfulchanges in the compressed images.After the VAE has learned to compress images of the worldinto latent vectors, the MD-RNN can be trained on se-quences of latent vectors. We follow (Ha & Schmidhu-ber, 2018) in applying a single-layer LSTM (Hochreiter& Schmidhuber, 1997), trained by seeing examples of se-quences of images as input, and the same sequence, shiftedby one time-step, as outputs. Thereby, the LSTM learns topredict the next latent vector from a sequence of previousobservations. More details on the World Model MD-RNNare found in the Appendix Section 1.4.3.1.1. D ATA COLLECTION AND TRAINING

The data collection and training process follows (Ha &Schmidhuber, 2018), except we do not train a controller,since we are here analyzing predictions, and not using themfor agent control. The process can be summarized in thefollowing steps (more details in Appendix Section 1 .1. Simulate 2,000 episodes with a random policy. Storeall actions taken and frames observed.2. Train a VAE to encode each frame into a length 64latent vector z , and to decode z back to the same image.3. Generate latent vectors z for each frame from the sim-ulated episodes. Further training can now be donewithout the actual images.4. Train an MD-RNN to model P ( z t +1 | a t , z t , h t ) , thatis, the probability distribution for next latent vector,given the current latent vector and action, as well asthe RNN’s hidden state. We follow (Ha & Schmidhuber, 2018) in training the pre-dictive MDN-RNNs to model the VizDoom (Kempka et al.,2017) Take Cover scenario . This scenario takes place in a Experiment code is available at: http://doi.org/10.5281/zenodo.2539145 https://gym.openai.com/envs/DoomTakeCover-v0/ rectangular room, where a player is facing monsters on anopposite wall. Monsters will ﬁre exploding ﬁreballs at theplayer, and the player attempts to survive as long as possibleby moving left and right, dodging the incoming projectiles.Agents receive 3D images of the scene ahead of them asinput, and make only one decision at each timestep: Moveto the left, move to the right or stay in the same place.This scenario serves as a useful test for our hypotheses, sinceit has both stochastic events (e.g., monsters may or maynot launch ﬁreballs) and different situations governed bydifferent rules (an exploding ﬁreball behaves very differentlythan an incoming ﬁreball mid-air). After training MD-RNNs, we analyze the predictions theymake when “dreaming” about the future. We insert an initiallatent vector (representing the real initial state from thegame) into the MD-RNN, and then repeat the steps belowfor as long as we want to predict (Figure 3 illustrates thesteps, following the same numbering as the list):1. Produce a probability distribution over the next latentvector, P (ˆ z t +1 | a t , ˆ z t , h t ) parametrized by the MDN-parameters, π, µ and σ . Store π (the vector indicatingthe weight of each Gaussian component).2. Sample a latent vector ˆ z t +1 from the probability distri-bution, and decode it into a predicted frame with theVAE.3. Analyze the predicted frame to measure which eventsare depicted (see below). Store the list of events forthe current frame together with π . Together, these cantell us whether different mixture components generatedifferent events.4. Repeat the process, starting from point 1, with thesampled ˆ z t +1 as the RNN input, to predict the nextlatent vector P (ˆ z t +2 | a t +1 , ˆ z t +1 , h t +1 ) Every round through this process generates a new predictedlatent vector, which is next used as input to predict thelatent vector following it. Storing latent vectors and MD-RNN parameters allows us to subsequently analyze the waythe MD-RNN has learned to represent the world and makepredictions about it.3.3.1. M

EASURING EVENTS IN PREDICTED FRAMES

Our hypotheses suggest that different mixture componentsrepresent either different stochastic events, or different sit-uations where different rules apply. In the world we makepredictions about, we identify two stochastic events: 1)Monsters may appear, and 2) they may launch a ﬁreball ow do MD-RNNs predict the future?

Figure 3.

Our proposed framework for analyzing how MD-RNNs make predictions towards the player. Note that monsters never disappearin the modeled world, and ﬁreballs disappearing is not astochastic event, since once a ﬁreball has been ﬁred, it willdeterministically disappear after reaching the other end ofthe room.For our second hypothesis, we identify three situationswhere the rules for how frames evolve in a time sequenceare very different: 1) The normal situation (player is facingmonsters, who sometimes launch ﬁreballs), 2) an explo-sion takes place in front of the player and 3) the player isnext to a wall. Situation 2 and 3 are so different from thenormal situation that internal models of the three differentsituations could beneﬁt from some separation. Explosionscover a large portion of the screen, and unfold according toa speciﬁc sequence, which has little to do with the way anormal scene unfolds (see Figure 5). Walls next to the agentresult in unique dynamics, since they require a large portionof the screen to move sideways (in the opposite direction)as the player moves.Since we are dealing with a quite simple and limited world,we can measure events from frames with straightforwardimage processing methods from the Python package scikit-image . The methods we apply to measure the presence ofmonsters, ﬁreballs, walls and explosions are documented inAppendix Section 2, and also made available online . https://scikit-image.org/ http://doi.org/10.5281/zenodo.2539145

4. Results

As previously discussed, we have two main hypothesesabout the roles of different mixture components in the MD-RNN: 1) Different components learn to model differentpossible futures, allowing them to creatively sample whatwill happen next, and 2) Different components learn to formdifferent internal models of the environment, that is, theyspecialize to model situations governed by a speciﬁc set ofrules. Below, we analyze MD-RNN predictions along withthe weights of mixture components to shed light on thesehypotheses.

In our main experiments, we test 5 independently trainedMD-RNNs, all with the same architecture (see AppendixSection 1), to reduce the chance that results are speciﬁc toone trained model. In practice, we found results to be verysimilar when training the same model multiple times withshufﬂed data. For all ﬁve, we generate multiple “dreams” bypredicting future latent vectors, and feeding each predictionin as the input-vector to the RNN for the next time-step,along with a randomly sampled action. This allows usto dream up long prediction sequences which, althoughnot always realistic, illuminate how mixture componentsrelate to predicted events. In our main experiments, wegenerate 10 dreams for each of our 5 models, each dream1000 time-steps long. Tests of statistical signiﬁcance applythe Mann-Whitney U test. ow do MD-RNNs predict the future?

TOCHASTIC EVENTS

Our ﬁrst hypothesis suggests that different mixture compo-nents represent different stochastic events, allowing creativepredictions about the future by sampling from differentGaussian components. We test this hypothesis by dreamingup many different futures as described above, and measuring1) different stochastic events in the dreams and 2) the weightassigned to each component in the mixture model. As men-tioned above, there are two different stochastic events in thisscenario: ﬁreballs appearing and monsters appearing.To conﬁdently say that a speciﬁc mixture component isparticularly responsible for producing one event, we needto measure whether that component has produced the eventmore frequently than one would expect if events were evenlydistributed among components. For instance, if we ﬁndthat one component is responsible for 80% of the ﬁreballappearances, but that component is also responsible for80% of all generated frames, then we do not have any clearevidence. We therefore measure the relationship betweencomponents and events as follows:1. Produce 10 different dreams with each of the 5 trainedMD-RNNs, resulting in a total of 50 dreams.2. For each time-step of a dream, measure a) the presenceof the events described above, and b) which componentis currently the most active (the one with the highest π -value output by the MD-RNN).3. Within one dream, the component that produced anevent most frequently, is denoted as the “main compo-nent” for that event. This is the Gaussian that is mostlikely responsible for generating the given event.4. Measuring the proportion of the event produced by the“main component” versus the other components acrossall N dreams yields the leftmost boxes in the pairs inFigure 4.5. To be sure the “main component” is speciﬁcally respon-sible for the speciﬁc event/situation, we also measurethe proportion of all frames produced by that compo-nent. This yields the rightmost boxes.6. A signiﬁcantly higher value in the leftmost than therightmost box thus indicates that one component isproducing the relevant event/situation more frequentlythan one would expect by looking at the proportion ofall events generated by that component.As we can see in Figure 4, there is a strong tendency forﬁreball appearances to be produced more by one speciﬁc FireballAppears MonsterAppears

Generated Event P r o p o r t i o n o f F r a m e s f r o m M a i n C o m p o n e n t *** Current EventAny Event

Figure 4.

The tendency for different stochastic events to be pro-duced by one speciﬁc Gaussian component (blue) vs the ten-dency for that component to be responsible for events overall(orange). ∗ ∗ ∗ indicates signiﬁcant differences with p < . . Figure 5.

Top: A monster launching a ﬁreball at the player. Bottom:An explosion unfolding in front of the player. The two situationsare governed by very different rules. Images down-sampled to thesame resolution (64x64) used during training. component. There is no similar tendency for monster ap-pearances.4.2.2. D

IFFERENT INTERNAL MODELS

Our second hypothesis is that different mixture componentsrepresent different internal models , that is, models of scenar-ios where the rules are different. To study this, we repeatedthe calculations outlined above, measuring the presence ofsuch scenarios, rather than stochastic events. As discussedabove, we identify 3 scenarios in this game where the rulesof how to generate the next frame are very different fromthe normal situation (facing monsters and any ﬁreballs): 1)having a wall on the left, 2) having a wall on the right and3) getting hit by an exploding ﬁreball. ow do MD-RNNs predict the future?

The result of this calculation is shown in Figure 6. Thereare statistically signiﬁcant differences ( p < . ) betweenthe main component’s tendency to generate the speciﬁc sit-uations and their tendency to generate frames overall, forexplosions and walls on either side. We also show that thesame effect is not generally present for situations containingﬁreballs. We hypothesize that this is because ﬁreballs arevery common, and do not drastically change the way theworld changes from one frame to the next. There shouldtherefore be less need for modeling them in a separate mix-ture component. Explosion Left Wall Right Wall Fireball

Generated Event P r o p o r t i o n o f F r a m e s f r o m M a i n C o m p o n e n t *** *** *** Current EventAny Event

Figure 6.

The tendency for different scenarios to be produced byone speciﬁc Gaussian component (blue) vs the tendency for thatcomponent to be responsible for events overall (orange). ∗ ∗ ∗ in-dicates signiﬁcant differences with p < . . To further illuminate the role of different mixture com-ponents, we let a single trained MD-RNN dream up 100-timestep predictions, while plotting the weights of all com-ponents, to see which one is currently most responsible formaking predictions. An example of such a plot is shown inFigure 7. Notice one speciﬁc mixture component dominatesfrom around timestep 75, the same time that an explosion ispresent in the frame. In other repetitions of the experiment,we found a different component tends to dominate whenthe agent is near a wall. In “normal” situations (no nearbywalls or explosions), we tend to see different componentsbeing active together, without any clear dominance. Thissupports our second hypothesis, that different componentsspecialize to model scenarios where the rules for generatingfuture frames are different.

As a ﬁnal test of the role of different mixture components inmaking predictions, we generated dreams while committing to a single component during an entire dream. We condi-

Timestep . . . . . . C o m p o n e n t W e i g h t ( π i ) (a) The weights of the 5 different mixture components ( π i for i ∈{ , , , , } ) during a 100-timestep dream.(b) The resulting dream generated by sampling according to themixture weights in (a). Numbers indicate the current timestep. Figure 7.

The weights over time of each of the 5 mixture com-ponents output by the MD-RNN and the corresponding dreamsproduced by sampling according to those weights. Until timestep75, several components are similarly weighted, and responsible formaking predictions together. After timestep 75, one componentdominates, and takes over in generating predictions. The resultingprediction is an exploding ﬁreball. ow do MD-RNNs predict the future? tioned the MD-RNN with a random start image, and made adream predicting 1000 steps into the future, sampling onlyfrom the ﬁrst mixture component . We then repeated this foreach of the ﬁve mixture components in the MD-RNN. Sincedifferent trained models may not represent the same eventsin the exact same mixture components, we base this analysison ten 1000-timestep dreams for each component, from asingle trained model.The results are shown in Figure 8. There is a clear tendencyfor this model to generate explosions with the second mix-ture component, and walls (both right and left) with the ﬁfthcomponent. Visualizing the conditional dreams, we observesomething interesting: The components that do not produceexplosions result in dreams where ﬁreballs approach theplayer, but stop and hover mid-air, or even reverse and re-turn to the monsters. Presumably, these components havenever learned to model explosions, and can therefore notproduce them when being responsible for generating dreamsalone.

Explosion Fireball Left Wall Right Wall

Generated Event T o t a l F r a m e C o un t Gaussian

Figure 8.

The events generated in dreams, when committing to asingle Gaussian component during the entire prediction sequence.

5. Discussion

The results give support for both our initial hypotheses: Thedifferent Gaussian components in the MD-RNN specializeto model both different stochastic events (Figure 4) and dif-ferent internal models (Figure 6). For the stochastic events,we saw a strong tendency for ﬁreball appearances to begenerated more frequently by a speciﬁc mixture component,but the same was not true for monster appearances. Detect-ing monsters in the generated predictions is more difﬁcultthan the other elements we detect (see Appendix section 2),and we cannot rule out that a relationship exists here thatwe could have measured if monster appearances were lessambiguous.For the second hypothesis, we observed very clear evidence that situations where the rules governing predictions aredifferent were produced by separate Gaussian components.This was seen clearly both when measuring which compo-nents were most active in the different situations (Figures 6and 7), and when sampling from a single component duringan entire prediction sequence (Figure 8).

6. Conclusion

Through automatic classiﬁcation of predicted frames, wehave shed light on the way mixture density RNNs predictthe future. We started out with two hypotheses for the roleof the different components in the Gaussian mixture mod-els, and found some evidence in support of both. First, wefound evidence that different components produce differentstochastic events more frequently, supporting the hypothesisthat different components of the mixture models representdifferent potential directions for the predicted future. Thisis a valuable property for systems modeling creative pre-dictions (such as in generation of artistic text, music andimages), since it allows them to strike a balance betweenmodeling an observed phenomenon and improvising bychoosing between several possible predicted futures.We found even more solid evidence for our second hypothe-sis: There is a very strong tendency for different componentsof the mixture model to be responsible for producing eventsthat are governed by different “rules”, that is, events thatrequire different internal models. Building machine learn-ing systems that can represent different internal models isa long-standing challenge, since learning of very differentskills tends to cause interference or forgetting (Ellefsen et al.,2015). One way this challenge has been handled in the past,is by building modular systems, where different modules compete to represent different internal models (Demiris &Khadhouri, 2006; Haruno et al., 2001). Our results sug-gest that mixture density RNNs self-organize to separatedifferent internal models into different components.This ability of MD-RNNs opens up a further hypothesis:Since these networks can automatically self-organize multi-ple internal models, they should be well equipped to modeldifferent scenarios with a low degree of interference. Infuture studies, we plan to examine this further by trainingMD-RNNs on multiple different environments, studyingthe effect of the number of components in the MDN on theobserved interference.

7. Acknowledgments

This work is partially supported by The Research Councilof Norway as a part of the Engineering Predictability withEmbodied Cognition (EPEC) project, under grant agreement240862. ow do MD-RNNs predict the future?

References

Bazzani, L., Larochelle, H., and Torresani, L. RecurrentMixture Density Network for Spatiotemporal Visual At-tention. In

International Conference on Learning Repre-sentations , pp. arXiv:1603.08199, 2017.Bishop, C. M. Mixture density networks. Technical Re-port NCRG/97/004, Neural Computing Research Group,Aston University, 1994.Bishop, C. M. and Others.

Neural networks for patternrecognition . Oxford university press, 1995.Demiris, Y. and Khadhouri, B. Hierarchical attentive mul-tiple models for execution and recognition of actions.

Robotics and Autonomous Systems , 54(5):361–369, 2006.ISSN 09218890. doi: 10.1016/j.robot.2006.02.003.Ellefsen, K. O., Mouret, J.-B., and Clune, J. Neural Modular-ity Helps Organisms Evolve to Learn New Skills withoutForgetting Old Skills.

PLOS Computational Biology , 11(4):e1004128, 2015. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1004128. URL http://dx.plos.org/10.1371/journal.pcbi.1004128 .Finn, C., Goodfellow, I., and Levine, S. UnsupervisedLearning for Physical Interaction through Video Predic-tion.

ArXiv e-prints , pp. arXiv:1605.07157, 2016.Garcia, R., Telea, A. C., da Silva, B. C., Tørresen, J., andComba, J. L. D. A task-and-technique centered surveyon visual analytics for deep learning model engineering.

Computers & Graphics , 77:30–49, 2018.Ghahramani, Z. and Wolpert, D. M. Modular decomposi-tion in visuomotor learning.

Nature , 386(6623):392–395,1997. ISSN 00280836. doi: 10.1038/386392a0.Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y.Deep generative models. In

Deep learning , chapter 20,pp. 651–716. MIT press Cambridge, 2016.Graves, A. Generating Sequences With Recurrent NeuralNetworks.

ArXiv e-prints , 1308.0850, aug 2013. URL https://arxiv.org/abs/1308.0850 .Ha, D. and Eck, D. A Neural Representation of SketchDrawings. arXiv e-prints , 1704.03477, apr 2017. URL http://arxiv.org/abs/1704.03477 .Ha, D. and Schmidhuber, J. Recurrent World ModelsFacilitate Policy Evolution. In Bengio, S., Wallach, H.,Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Gar-nett, R. (eds.),

Advances in Neural Information Process-ing Systems 31 , pp. 2451–2463. Curran Associates, Inc.,2018. URL http://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution.pdf . Haruno, M., Wolpert, D. M., and Kawato, M. MOSAICModel for Sensorimotor Learning and Control.

NeuralComputation , 2220:2201–2220, 2001.Hochreiter, S. and Schmidhuber, J. Long Short-Term Mem-ory.

Neural Computation , 1997. ISSN 08997667. doi:10.1162/neco.1997.9.8.1735.Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizingand understanding recurrent networks. arXiv preprintarXiv:1506.02078 , 2015.Kempka, M., Wydmuch, M., Runc, G., Toczek, J., andJaskowski, W. ViZDoom: A Doom-based AI researchplatform for visual reinforcement learning. In

IEEE Con-ference on Computatonal Intelligence and Games, CIG ,2017. ISBN 9781509018833. doi: 10.1109/CIG.2016.7860433.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114 , 2013.LeCun, Y., Bengio, Y., and Hinton, G. Deep learn-ing.

Nature , 521(7553):436–444, 2015. doi: 10.1038/nature14539.Martin, C. P. and Torresen, J. RoboJam: A musical mix-ture density network for collaborative touchscreen in-teraction. In

Lecture Notes in Computer Science (in-cluding subseries Lecture Notes in Artiﬁcial Intelligenceand Lecture Notes in Bioinformatics) , volume 10783LNCS, pp. 161–176. Springer, Cham, 2018. ISBN9783319775821. doi: 10.1007/978-3-319-77583-8 11.URL http://link.springer.com/10.1007/978-3-319-77583-8{_}11 .Mathieu, M., Couprie, C., and LeCun, Y. Deep multi-scalevideo prediction beyond mean square error. In

ICLR ,2016.McLachlan, G. J. and Basford, K. E.

Mixture models: Infer-ence and applications to clustering , volume 84. MarcelDekker, 1988.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. a., Ve-ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C.,Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-stra, D., Legg, S., and Hassabis, D. Human-level controlthrough deep reinforcement learning.

Nature , 518(7540):529–533, 2015. doi: 10.1038/nature14236.Pearson, J. Watch a Computer Learn to Play Doom’Inside a Dream [online article].

Motherboard ,2018. Retrieved from https://motherboard.vice.com/en{_}us/article/43bxjj/watch-deep-learning-ai-computer-play-doom-dream . ow do MD-RNNs predict the future? Samek, W., Wiegand, T., and M¨uller, K.-R. Explain-able artiﬁcial intelligence: Understanding, visualizingand interpreting deep learning models. arXiv preprintarXiv:1708.08296 , 2017.Smilkov, D., Carter, S., Sculley, D., Vi´egas, F. B., andWattenberg, M. Direct-manipulation visualization of deepnetworks. arXiv preprint arXiv:1708.03788 , 2017.Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., and Lee, H.Learning to Generate Long-term Future via HierarchicalPrediction.

ICML , apr 2017.Wang, X., Takaki, S., and Yamagishi, J. An autore-gressive recurrent mixture density network for para-metric speech synthesis. In . IEEE, mar 2017. doi: 10.1109/icassp.2017.7953087. URL https://doi.org/10.1109/icassp.2017.7953087 .Wichers, N., Villegas, R., Erhan, D., and Lee,H. Hierarchical Long-term Video Prediction with-out Supervision, jul 2018. ISSN 1938-7228.URL http://proceedings.mlr.press/v80/wichers18a.html .Wolpert, D. M. and Kawato, M. Multiple paired forwardand inverse models for motor control.

Neural Networks ,11(7):1317–1329, 1998. ISSN 08936080. doi: 10.1016/S0893-6080(98)00066-5.Wolpert, D. M., Doya, K., and Kawato, M. A uni-fying computational framework for motor controland social interaction.

Philosophical Transactionsof the Royal Society B: Biological Sciences , 358(1431):593–602, 2003. ISSN 09628436. doi:10.1098/rstb.2002.1238. URL .Yao, M. No Time to Read AI Research? We Sum-marized Top 2018 Papers for You [Blog Post],2018. URL .Retrieved from .Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., and Lipson, H.Understanding neural networks through deep visualiza-tion.