[PDF] Neural Physicist: Learning Physical Dynamics from Image Sequences

Abstract

We present a novel architecture named Neural Physicist (NeurPhy) to learn physical dynamics directly from image sequences using deep neural networks. For any physical system, given the global system parameters, the time evolution of states is governed by the underlying physical laws. How to learn meaningful system representations in an end-to-end way and estimate accurate state transition dynamics facilitating long-term prediction have been long-standing challenges. In this paper, by leveraging recent progresses in representation learning and state space models (SSMs), we propose NeurPhy, which uses variational auto-encoder (VAE) to extract underlying Markovian dynamic state at each time step, neural process (NP) to extract the global system parameters, and a non-linear non-recurrent stochastic state space model to learn the physical dynamic transition. We apply NeurPhy to two physical experimental environments, i.e., damped pendulum and planetary orbits motion, and achieve promising results. Our model can not only extract the physically meaningful state representations, but also learn the state transition dynamics enabling long-term predictions for unseen image sequences. Furthermore, from the manifold dimension of the latent state space, we can easily identify the degree of freedom (DoF) of the underlying physical systems.

Full PDF

NNeural Physicist: Learning Physical Dynamics fromImage Sequences

Baocheng Zhu

Ant Financial Services GroupShanghai, China [email protected]

Shijun Wang

Ant Financial Services GroupSeattle, USA [email protected]

James Zhang

Ant Financial Services GroupNew York, USA [email protected]

Abstract

We present a novel architecture named Neural Physicist (NeurPhy) to learn phys-ical dynamics directly from image sequences using deep neural networks. Forany physical system, given the global system parameters, the time evolution ofstates is governed by the underlying physical laws. How to learn meaningfulsystem representations in an end-to-end way and estimate accurate state transitiondynamics facilitating long-term prediction have been long-standing challenges.In this paper, by leveraging recent progresses in representation learning and statespace models (SSMs), we propose NeurPhy, which uses variational auto-encoder(VAE) to extract underlying Markovian dynamic state at each time step, neural pro-cess (NP) to extract the global system parameters, and a non-linear non-recurrentstochastic state space model to learn the physical dynamic transition. We applyNeurPhy to two physical experimental environments, i.e., damped pendulum andplanetary orbits motion, and achieve promising results. Our model can not onlyextract the physically meaningful state representations, but also learn the statetransition dynamics enabling long-term predictions for unseen image sequences.Furthermore, from the manifold dimension of the latent state space, we can easilyidentify the degree of freedom (DoF) of the underlying physical systems.

Discovering physical laws by doing experiments has always been only human expertise. Throughquantitative measurements, using inductive reasoning, researchers propose hypotheses to explainthe observed data and to predict future. In recent years, deep learning (DL) [15] has shown itsextraordinary power in information extraction and pattern recognition, fuelled the major breakthroughsin various areas such as image recognition [27], natural language processing [8] and reinforcementlearning [31]. It would be of great help to the basic science if we can use DL to extract the underlyingphysical laws directly from experimental data or observations. However, applying DL to facilitatephysical law discoveries is still rarely explored [6, 21, 4].In this paper, we try to make one step towards this goal. In our setting, for a physical system, weonly have the image sequences of object movement, each with a different global parameter setting.Our goal is to build a deep learning model to infer the underlying physical state transition dynamicswhich can enable long-term movement predictions, especially for unseen global parameter settings.

Preprint. Under review. a r X i v : . [ c s . L G ] J un o solve the problem above, there are two key tasks to be done: state identiﬁcation and state transitionlearning. Taking advantage of recent progresses in representation learning, we propose a novelneural architecture named Neural Physicist (NeurPhy), which uses variational auto-encoder (VAE)[25] to extract underlying dynamic state at each time step, and neural process (NP) [14, 11] toextract the global system representations. For state transition learning, NeurPhy uses a stochasticstate space model (SSM) [26, 22, 9, 17, 18] to learn the physical dynamic process, which cannaturally incorporate uncertainty estimation. To the best of our knowledge, this is the ﬁrst workthat learns physical dynamics in an end-to-end way from raw image sequences by performingsystem identiﬁcation and state extraction. The main contributions of this paper are: 1) We split statelearning into two parts: global and local (dynamic) representations, and use NP to extract the globalrepresentations of the image sequences. The learned representations are physically meaningful andmatch the underlying system parameters. 2) We use a stochastic SSM to learn systems’ physicaltransition dynamics together with learned global state representations. The dynamic states extractedat each time step are Markovian and match the ground-truth states. Speciﬁcally, our proposedarchitecture does not require a recurrent structure to infer the dynamic states, and the state transitiondynamics are not limited to linear models used in many previous works [22, 29, 12]. It is thus morecomputationally efﬁcient and has better applicability. 3) The NeurPhy can extrapolate to imagesequences whose global system parameters are not seen in the training phase. Regarding each imagesequence as a task, this means our model naturally has the ability for meta-learning [3, 5, 34]. Figure 1: (a) Model architecture of NeurPhy. Circles are random variables and ﬁlled circles representvariables observed during training. Solid lines denote the generative process; Dashed lines denotethe inference model; Dotted lines denote multi-step transition dynamics. (b) Model architecture forextracting global representation from context samples. A context sample consists of two consecutiveimage frames encoded through CNN, then all samples are aggregated to a global representation r c .For any physical system, though governed by the same physical laws, different system parametersusually result in different observed behaviours. Taking dampened pendulum for example, if wechange its length or mass, the motion period and decay time will change too. In our setting, foreach physical environment, given one set of system parameters, we have a time-discrete sequence oflength T denoted by x T = ( x , x , · · · , x T ) . Note that each frame x t is a high-dimensional rawimage and each sequence deﬁnes a task. Then, for N different system parameters setting, we have N different tasks denoted by ( x T , · · · , x N T ) . Given an arbitrary image sequence x T , our problemis to infer the underlying system parameters and state transition dynamics, and to make long-termpredictions of the system’s evolution. Note that, we are dealing with a meta-learning problem here,i.e., we have to make correct predictions with global parameters not seen in the training phase.We tackle the problem as follows, shown graphically in Figure 1(a). For any image sequence, we ﬁrstsplit the whole sequence into two parts: contexts and targets . The context samples x c are used toextract the global representation r c of the sequence (see next subsection for details). The intrinsicMarkovian state z t − d is inferred from historical images x t − d using a recognition network, then z t − d transits to z t step-by-step, aided with global information r c , using a stochastic state space model.2ote that d is the steps dynamic state transited and is called the overshooting length [18, 16]. Finally,an observation model is used to generate image x t from the dynamic state z t .Note that our architecture is quite general. If the input sequences are low-dimensional variablesinstead of images, all convolutional layers in NeurPhy can be replaced by multilayer perceptronswith all other parts intact, which makes the learning tasks simpler. We show the correspondingexperimental results later in supplementary materials. In order to infer the global representation of the image sequence, we leverage the recent progresses inNP [14, 11] and conditional-VAE [33]. Detailed structure is shown in Figure 1(b).We randomly select n c context samples from the whole image sequence, and each context samplecontains two images from consecutive time steps t i and t i + 1 . We use a convolutional neuralnetwork (CNN) to extract the information r ct from each context sample, and aggregate all of theminto a global representation r c . This procedure resembles NP proposed by Garnelo et al. [14]. NPis a kind of stochastic process which uses neural networks to learn distributions over functions. Astochastic process needs to satisfy two conditions: exchangeability and consistency . Reﬂected in NP,the parameters of inference network should be invariant to permutations of observations x and time t , so the aggregation operator in the procedure should be invariant to the exchange of r ct . Here, forsimplicity, we use the mean operator. Please note that we use two consecutive frames of images as acontext sample and aggregate all context samples into r c , will force the network to extract the globalinformation of state transition across different time steps. As shown in Figure 1(a), our state space model differs from previous works [26, 22, 12, 7, 10, 13, 9]mainly in the following aspects. First, we only have access to image sequences, since individual imageobservations generally do not reveal the full state of the system, we consider a partially observableMarkov decision process (POMDP). Here, we use a convolutional recognition network which stackstwo consecutive image frames as inputs to extract the underlying Markovian dynamic states z t , i.e.,we assume q ( z t | x t ) = q ( z t | x t − t ) . In the experimental section, we will show that our recognitionmodel can extract the correct position and velocity. The use of only two consecutive frames butnot the whole history also enables our model to be non-recurrent, which can signiﬁcantly simplifythe model structure and speed up inferencing. Second, our dynamic transition function takes theform p ( z t | z t − , r c ) , which uses the global information r c of the sequence extracted from contextsamples. Splitting the state information into global and local ones ( r c and z t ) is quite intuitive, asthe global information can represent the time invariant features of the system while the local onesrepresent only the information which changes from time to time. Third, we use multi-layer neuralnetworks to model the state transition function, which makes our transition model non-linear andmore general. Moreover, the inferred dynamic states are stochastic, which means our transitionmodel can naturally obtain uncertainty estimation with more robustness. Last but not least, we usethe overshooting technique [18, 16] to train the model with multi-step prediction loss in the latentstate space. In the experimental section, we will see that using overshooting technique can greatlyreduce the compounding error of multi-step predictions.It is well known that a powerful generative model such as VAE usually ignores the conditioninginformation [2, 19, 16], which makes learning latent Markovian states impossible. We tackle thisproblem mainly with two techniques: First we split the state information into global and local parts,which makes the learning of the Markovian dynamic states easier. Second, we employ overshootingtechnique, which enforces the consistency in the multi-step state transition process. Combining discussions of previous sections, our model consists of following components:Global representation model: r c ∼ p ( r c | x c ) State space model: z t ∼ p ( z t | z t − , r c ) Recognition model: z t ∼ q ( z t | x ≤ t ) (cid:44) q ( z t | x t − t ) Observation model: x t ∼ p ( x t | z t ) . (1)In order to overcome the intractability of posterior distributions of the latent state z , we use standardvariational inference technique [25], and derive the evidence lower bound on the data log-likelihood3onditioned on context points: ln p d ( x T | x c ) = ln (cid:90) T (cid:89) t =1 p ( x t | z t ) p ( z t | z t − d , r c ) p ( r c | x c ) dz T dr c ≥ T (cid:88) t =1 E q ( z t | x ≤ t ) ln p ( x t | z t ) − E z t − ,r c KL ( q ( z t | x ≤ t ) || p ( z t | z t − , r c )) . (2)Here, the subscript d denotes the overshooting length. For more detailed derivation, see supplementarymaterials.The ﬁrst term is the VAE reconstruction loss term. For images, we calculate it using cross-entropy ofeach pixel. The recognition model infers the approximate state posterior z t from the past observations x ≤ t (In our setting, from past two consecutive observations x t − t ). Then the latent dynamic states z t is decoded to x t by the observation model, which we want to make it and ground-truth image as similaras possible. The second term is the dynamic state transition loss term. Here, the discrepancy betweenthe transition distribution p ( z t | z t − , r c ) and the approximate posteriors distribution q ( z t | x ≤ t ) iscalculated by Kullback–Leibler (KL) divergence. The expectation of KL term is taken on z t − and r c ,which can be written explicitly as z t − ∼ (cid:82) p ( z t − | z t − d , r c ) q ( z t − d | x ≤ t − d ) dz t − d and r c ∼ p ( r c | x c ) for overshooting length d . The state transition process is as follows: First a recognition model is usedto infer the approximate posterior state z t − d from the past observations x ≤ t − d . Then z t − d conductsmulti-step transition to z t in the state space. Then z t is compared against the one extracted from therecognition model using x ≤ t . Note that all the comparison is done in the latent state space, but notthe observation space, which signiﬁcantly reduces the computation expenditure.The previous derivation assumes a ﬁxed overshooting length d . To better model both short and longterm dynamics, we can combine loss terms from different overshooting length up to some dynamictransition horizon D , i.e., we combine { ln p d ( x T | x c ) } Dd =1 , to arrive at our ﬁnal objective function: ln p ( x T | x c ) = 1 D D (cid:88) d =1 ln p d ( x T | x c ) ≥ T (cid:88) t =1 E q ( z t | x ≤ t ) ln p ( x t | z t ) − D D (cid:88) d =1 β d E KL ( q ( z t | x ≤ t ) || p ( z t | z t − , r c )) p ( z t − | z t − d ,r c ) q ( z t − d | x ≤ t − d ) p ( r c | x c ) . (3)Here, we include weighting factors { β d } Dd =1 analogously to the β − VAE [20], which can tune therelative strength between the competitive reconstruction and dynamic transition loss terms.

We apply NeurPhy to two experimental physical systems, i.e., damped pendulum and planetaryorbits motion , and verify whether our model can correctly discover the underlying global physicalparameters and the associated dynamics. Detailed parameter settings of network structures can befound in supplementary materials.In all experiments, each image sequence forms a task which is associated with one set of globalsystem parameters. The image sequence contains a total of 101 time steps in t ∈ [0 , , and eachimage consists of × binary pixels. We take 90% of tasks as meta-training tasks, the other 10%as meta-test tasks. For each meta-training task, we use 20 context samples that are randomly selectedfrom the image sequence to extract the global representation. We take 90% frames as training (target)samples, and the other 10% as test ones. For the meta-test tasks, we also use 20 context samples, butthey come from the ﬁrst 21 frames of the image sequence to avoid using any future information. Therest samples are for meta-testing. We set maximum overshooting length D = 5 , batch size B = 50 , β d = 1( d ∈ [1 , D ]) and run epochs using Adam optimizer [24] with ﬁxed learning rate of . . The ﬁrst problem we consider is a pendulum swinging back and forth, subject to gravity and airfriction with an initial angular velocity of ω , as shown in Figure 2(a). From the second law of4igure 2: (a) The damped pendulum system with initial angular velocity ω and gravity mg . (b) Themotion of pendulum angle θ for l = 2 and m = 1 , . , . (c) The motion of pendulum angle θ for m = 1 and l = 1 , , . (d) The state space of θ vs. ω for all sequences.motion, the pendulum’s angle θ follows equation: mg sin θ + µl ˙ θ + ml ¨ θ = 0 , (4)where m is the pendulum mass, g is the gravitational constant, µ is the damping coefﬁcient generatedby air friction and l is the pendulum length. Here ˙ denotes time derivative, ¨ denotes second derivative,and we rewrite angular velocity as ω = ˙ θ .If the angle θ is small enough, the equation of motion has an analytical solution, but in general, itcan only be solved numerically. From the state ( θ, ω ) at time step t , we can get the state at t + 1 byfollowing dynamic transition equations: θ t +1 = θ t + ∆ t · ω t , ω t +1 = ω t − ∆ t · ( µm ω t + gl sin θ t ) . (5)Here, we assume time interval ∆ t is small enough.In our experiments, we ﬁx the gravitational constant and damping coefﬁcient to g = 10 , µ = 0 . andset the initial angle and angular velocity to θ = − π, ω = 4 respectively. We vary the pendulumlength and mass between the interval [1 , and [1 , respectively, and generate a total of differentcombinations of l and m . For each ( l, m ) combination, we generate states of time-steps (with t ∈ [0 , ) and render them into images. From Equation (5), we can see that the global parameters ofthe pendulum system is l and m , and we plot the typical evolution of θ for different l and m in Figure2(b)(c). We assume that the different masses are caused by different pendulum densities and cannotbe directly detected from single images. In this case, in order to discover the correct dynamics and tomake good predictions, the model has to learn the global system parameter m from the changes in theimage sequences. The state space ( θ, ω ) for all sequences and time steps of each sequence are shownin Figure 2(d). The state space curve of each sequence is a spiral caused by the damping. If we waitlong enough, they will all sink into the rest point (0 , , and the central hole will be fully ﬁlled.Figure 3: (a-b) The latent space of global representations r c with coloring according to the ground-truth global parameters l and m . (c-h) The learned dynamic state space z of pendulum. The ﬁrst rowdenotes the state space of all samples with coloring according to ground-truth parameters l , θ and ω .The second row denotes the state space of two slices with l = 1 and l = 3 .In the experiments, we set both the dimension of global representations r c and dynamic states z to 3. As shown in Figure 3(a)(b), The learned r c lies in a two-dimensional manifold. From the5igure 4: Two samples of image sequences of pendulum’s motions in meta-test set for 50-steppredictions. The ﬁrst row shows the ground-truth images, the second row shows the predicted ones.Table 1: Fitting scores for global parameters R ScoreExperiment l (Linear) l (Quadratic) m (Linear) m (Quadratic)Damped pendulum 0.999 1.000 0.795 0.949 r n (Linear) r n (Quadratic) e (Linear) e (Quadratic)Planetary orbits motion 0.959 0.977 0.681 0.935colormap plot, we see a one-to-one correspondence with the underlying ground-truth parameters l and m , which means NeurPhy has extracted the correct global representations. To show it morequantitatively, we ﬁt the learned representations r c with global parameters l and m using a linear anda quadratic regression respectively. The ﬁtted R are shown in Table 1. We can see that the ﬁttingis quite well up to quadratic terms (especially for parameter l ), which means the representation welearned is indeed a simple mapping from the de-facto physical system variables. Note that learning amanifold of using one dimension to encode m is non-trivial, as the pendulum’s mass m can only beextracted from the time evolution of the image sequence.In Figure 3(c)(d), we plot the learned latent Markovian dynamic state space z together with theground-truth state parameters l , θ and ω (shown by colormap). Note that the state parameter l isunchanged during the motion, but it is necessary for the image reconstruction, so the state space z encodes the parameters ( l, θ, ω ) using all three dimensions. More clearly, we plot the state space fortwo particular parameter setting with l = 1 and l = 3 in Figure 3(d)(f)(h), and we can see that each l lies in a two-dimensional Mobius strip sub-manifold that encodes θ and ω respectively.In Figure 4, started from the leftmost state images, we plot two sets of predicted images for the next50 time steps on the meta-test set. The images in the ﬁrst row are the ground-truth, while images inthe second row are the predicted ones. We can see that the learned dynamic state transition modelcan make good long-term predictions even though the maximum overshooting length is only 5.More quantitatively, we show the model performance for both training, test and meta-test samplesin Table 2. Recalling that in Equation 3, we list the value of cross-entropy terms together with theKullback–Leibler (KL) divergence terms for different overshooting length (KL1 ∼ KL5 for d = 1 ∼ ).Lower cross-entropy means better image reconstruction, and smaller KL term means better long-termprediction. For ablation study, we also experiment with maximum overshooting length D = 1 tostudy whether larger overshooting length can really help learning dynamic transitions. As indicatedin Table 2, while a larger maximum overshooting length D = 5 can slightly hurt the reconstructionﬁdelity, it can greatly reduce the KL terms, facilitating better long-term predictions. Next, we apply NeurPhy to the prediction problem of planetary orbits motion. Planetary orbitsgenerally form ellipses as shown in Figure 5. Speciﬁcally, the radius has following expression: r = r n e e cos( θ − θ n ) , (6)where r n denotes the perihelion distance, e denotes the eccentricity and θ n denotes the angle of themajor axis. The effects of different parameters on planetary orbits are shown in Figure 5(a)-(c). Wealso plot the dynamic state space ( r, θ ) in Figure 5(d). More detailed derivation of the orbit dynamicscan be found in supplementary materials. 6able 2: Model performance Experiment D Stage Cross Entropy KL1 KL2 KL3 KL4 KL5

Dampedpendulum 1 Training

Planetaryorbits motion 1 Training

Figure 5: Schematic diagram of the planetary orbits motion. The effects of different global parameterson orbits are shown in (a) r n , (b) e and (c) θ n . (d) The state space of r vs. θ for all sequences.Without loss of generality, we set initial angle θ = 0 , and vary initial radius and velocity ( r , v r , v θ ) to generate different motion sequences. Here, v r and v θ denote the velocity along radius and angulardirection respectively. A set of ( r , v r , v θ ) has one-to-one correspondence to a set of ( r n , e, θ n ) ,which deﬁnes a unique orbit. In another word, given the initial state, the system parameters ( r n , e, θ n ) characterizing the planetary orbits are also determined. At each moment, the dynamic state can berepresented solely by two parameters ( r, θ ) , and the corresponding velocity ( v r , v θ ) can be calculatedby ( r, θ ) together with global parameters ( r n , e, θ n ) as shown in supplementary materials.The time evolution of ( r, θ ) is governed by dynamic equations: θ t +1 = θ t + ∆ t · hr t , r t +1 = r n e e cos( θ t +1 − θ n ) . (7)Note that h = (cid:112) GM (1 + e ) r n is proportional to the angular momentum of the orbit ( G and M denote the gravitational constant and the mass of the center sun, respectively)(see supplementarymaterials for the derivation). From Equation 7, we can see that there are three global parameters ( r n , e, θ n ) deﬁning the orbit motion, but with ( r t , θ t ) , we can ﬁrst obtain θ n from Equation 6 andthen substitute it to the second formula to obtain r t +1 . This means that in the dynamic evolution ofplanet orbits, θ n is a pseudo-degree of freedom, and the dynamics can be deﬁned with just two globalparameters ( r n , e ) , which also reﬂects that there is one hidden symmetry in the orbit motion.We generate image sequences by varying r , v r and v θ in the range [1 . , , [0 , . and [0 . , . respectively and for each sequence, we generate image frames. If our model can learn correctdynamics, it is expected that the dynamic latent parameters z lie in a two dimensional manifoldcorrelated to the ground-truth states ( r, θ ) , while the global representation r c of each sequence shouldalso lie within a two dimensional manifold corresponding to the ground-truth system parameters ( r n , e ) .In Figure 6(a)(b), we can clearly see that the learned global latent representation indeed lies in atwo dimensional manifold. This is an amazing result, as we take a long analytical effect in previousparagraphs to identify that the independent number of ground-truth system parameters which affect7igure 6: (a-b) The global representation manifold r c for planetary orbits with coloring according to r n and e . (c-d) The state space of planetary orbits motion, with coloring according to r and θ .the dynamic evolution is two (i.e. r n and e ) rather than three (i.e. r n , e and θ n ). We again ﬁt the latentrepresentations r c with global parameters r n and e using linear and quadratic regressions shownin Table 1. The ﬁtted R scores are very high which means a good match with the ground-truthglobal parameters. The learned dynamic state space in Figure 6(c)(d) also shows our model can inferthe physically meaningful state representations. In Figure 7 we also show two sets of multi-steppredictions (50 time-steps) on the meta-test set, with both ground-truth (the ﬁrst row) and predicted(the second row) images. It can be seen that the dynamic model we learned can make correctlong-term prediction of the orbit motion. We also show the model performance together with D = 1 in Table 2, the lower KL term values for D = 5 also indicates that the overshooting technique is ofgreat help to long-term predictions.Figure 7: Two samples of image sequences of orbit motions in meta-test set for 50-step predictions.The ﬁrst row shows the ground-truth images, while the second row shows the predicted ones. Our work was inspired by the work of Garnelo et al. [14] applying NP to function regression andoptimization. Kim et al. [23] addressed the underﬁtting issue by incorporating attention mechanism.Eslami et al. [11] and Kumar et al. [28] applied it to the scene representation and Singh et al. [32]extended it to sequential cases.For the state transition modelling, Karl et al. [22] proposed Deep Variational Bayes Filters (DVBF),a linear Gaussian state space model (LGSSM) using linear Gaussian models to learn and to identifyMarkovian state space. Deisenroth et al. [7] introduced PILCO, a data-efﬁcient policy search methodin model-based reinforcement learning, which uses GPs to model state transition dynamics explicitlyincorporating model uncertainty into long-term planning, achieving unprecedented learning efﬁciencyon high-dimensional control tasks. Watter et al. [35] proposed Embed to Control (E2C) that learnsa local linear dynamical transition model from raw pixel images which achieves promising resultson a variety of complex control problems. Recently, Hafner et al. [18] proposed Deep PlanningNetwork (PlaNet), a model-based reinforcement learning method that learns the state dynamics fromimages and plans in latent space and achieves performance comparable to model-free algorithms incontinuous control tasks.For physical dynamics learning, Iten et al. [21] used NP to discover physical concepts, but theydid not learn the underlying Markovian state and dynamics, while the inputs to the model are low-dimensional states. Breen et al. [4] applied deep neural networks to solve the motion of chaoticthree-body problem and achieved accurate solutions several orders faster than a state-of-the-art solver.

In this work, we propose a novel network architecture NeurPhy that can learn physical dynamicsdirectly from image sequences. NeurPhy can not only correctly extract the global system parameters8rom the sequence and dynamic state from each frame, but also succeeds in predicting dynamicevolution in unseen cases. We apply our model for the characterization and prediction of dampedpendulum and planetary orbits motion, and achieve promising results. Our architecture is quitegeneral, which can be easily extended to model-based reinforcement learning, which we will explorein the future.

Now we derive the lower bound objective function of NeurPhy. For overshooting length d , Condi-tioned on context points x c , the log-likelihood of data can be written as: ln p d ( x T | x c )= ln (cid:90) T (cid:89) t =1 p ( x t | z t ) p ( z t | z t − d , r c ) p ( r c | x c ) dz T dr c = ln (cid:90) T (cid:89) t =1 q ( z t | x ≤ t ) q ( z t | x ≤ t ) p ( x t | z t ) p ( z t | z t − d , r c ) p ( r c | x c ) dz T dr c . (8)Here, we use the Markovian state space asumption[9], i.e., the dynamic state z t contains all theinformation needed for image x t reconstruction. Also, we assume the approximate posterior q ( z t ) does not depend on context r c explicitly given x ≤ t . From Jensen’s inequality, we can get: ln (cid:90) T (cid:89) t =1 q ( z t | x ≤ t ) q ( z t | x ≤ t ) p ( x t | z t ) p ( z t | z t − d , r c ) p ( r c | x c ) dz T dr c ≥ E q ( z T | x ≤ T ) p ( r c | x c ) T (cid:88) t =1 [ln p ( x t | z t ) + ln p ( z t | z t − d , r c ) − ln q ( z t | x ≤ t )]= E q ( z T | x ≤ T ) p ( r c | x c ) T (cid:88) t =1 (cid:2) ln p ( x t | z t ) + ln E p ( z t − | z t − d ,r c ) p ( z t | z t − , r c ) − ln q ( z t | x ≤ t ) (cid:3) ≥ E q ( z T | x ≤ T ) p ( r c | x c ) T (cid:88) t =1 (cid:2) ln p ( x t | z t ) + E p ( z t − | z t − d ,r c ) ln p ( z t | z t − , r c ) − ln q ( z t | x ≤ t ) (cid:3) = T (cid:88) t =1 E q ( z t | x ≤ t ) ln p ( x t | z t ) − E KL ( q ( z t | x ≤ t ) || p ( z t | z t − , r c )) p ( z t − | z t − d ,r c ) q ( z t − d | x ≤ t − d ) p ( r c | x c ) , (9)which is eq.(2) in the main paper. In this subsection, we derive the elliptical orbits motion obeyed by planets which circle around acentral sun. We follow the derivation in reference [30, 1]. Assuming the mass of central sun M isseveral orders larger than the mass of planet m , then the position of the central sun is approximatelystill and we only need to consider the motion of the planet. Using polar coordinates ( r, θ ) , the positionvector r can be written as (see Figure 8): r = r ˆ r. (10)Here, symbols in bold font denote vectors, symbols with hat denotes unit vectors along the corre-sponding coordinate direction. Applying time derivative to r , we can get: ˙ r = ˙ r ˆ r + r ˙ˆ r = ˙ r ˆ r + rω ˆ θ, (11) ¨ r = (¨ r − rω )ˆ r + (2 ˙ rω + r ˙ ω )ˆ θ. (12)9igure 8: Schematic diagram of the planetary orbits motion with r denoting the radius from planet tothe central sun, θ denoting the angle between r and reference axis. ˆ r and ˆ θ denote unit vectors alongradial and angular direction respectively.Here, we rewrite angular velocity ˙ θ as ω and have used identities ˆ r = ω ˆ θ, ˆ θ = − ω ˆ r in the derivation.Now, From the second law of motion F = m a = m ¨ r and the law of universal gravitation: F = − G M mr ˆ r, (13)we have: ¨ r − rω = − GMr , (14) rω + r ˙ ω = 0 . (15)Here, G denotes the gravitational constant and ¨ denotes the second derivative of time. Eq. (15) canbe rewritten to a total differential form: rω + r ˙ ω = ddt ( r ω ) = 0 ⇒ r ω = h, (16)where h is a constant, which is proportional to angular momentum l = mrv θ = mr ω = mh . Eq.(16) is also equivalent to the Kepler’s second law of planetary motion.Let u = 1 /r , we have: ˙ r = − ˙ uu = − ˙ θu dudθ = − h dudθ , ¨ r = − h ˙ θ d udθ = − h u d udθ . (17)Together with eq.(14), we can get: d udθ + u = GMh . (18)By solving above equation, we have following general solution: u = A cos( θ − θ n ) + GMh . (19)So the planetary orbit r ( θ ) is: r ( θ ) = 1 A cos( θ − θ n ) + GMh . (20)Here, θ n denotes the angle of the major axis and A is a coefﬁcient.10he nearest radius called the perihelion distance r n can be calculated by setting θ = θ n , i.e., r n = 1 A + GMh . (21)Deﬁne the eccentricity e = Ah /GM , we have: r n = h GM (1 + e ) . (22)Then we can rewrite r using r n as: r = r n e e cos( θ − θ n ) . (23)This is the eq. (6) used in the main paper. Now we derive the global parameters ( h, e, θ n , r n ) from the initial condition ( r , v r , v θ ) . Note that h is a constant, we have: h = r v θ . (24)From the law of energy conservation, we have: m (( v r ) + ( v θ ) ) − GM mr = 12 m ( v θn ) − GM mr n . (25)Here, the left-hand side is the energy at initial moment, while the right-hand side represents theenergy at perihelion r n . Together with h = r n v θn and eq. (22), we have: G M h ( e −

1) = 12 (( v r ) + ( v θ ) ) − GMr . (26)From the above equation, we can get the eccentricity e , and from eq. (23), we have: r = r n e e cos( θ − θ n ) = r n e e cos( θ n ) , (27)from which we can get θ n . Here, we assume the initial angle θ = 0 .So, from eq. (22)(24)(25)(27), we can see that the set ( r , v r , v θ ) has one-to-one correspondencewith ( r n , e, θ n ) , as claimed in the main paper. From Eq.(16), we have: ˙ θ = hr ( θ ) . (28)Integrate above equation with time t , we can derive the time evolution of θ . And from the relation of θ with r (eq. (23)), we can also get the time evolution of r .The differential equation eq. (28) is a transcendental equation, which has no analytical solution. Inour setting, we only have to derive the state transition equations from ( r t , θ t ) to ( r t +1 , θ t +1 ) . Fromeq. (28), we have: θ t +1 = θ t + ∆ t · hr t , (29)and from eq. (23), we have: r t = r n e e cos( θ t − θ n ) , (30) r t +1 = r n e e cos( θ t +1 − θ n ) . (31)These transition equations correspond to eq. (7) in the main paper.11 .3 Experimental results for low-dimensional inputs using Neurphy Our model is quite general, if inputs are not raw images but low-dimensional variables (such aspositions of pendulum end-point), our model can also be applied to the systems by changing allconvolutional layers in NeurPhy to multilayer perceptrons. As we will show below, the learning tasksactually become easier.

Figure 9: The global representation manifold for damped pendulum with colormaps denotingunderlying (a) l and (b) m . The hole marked by dashed circles represent samples in meta-test.Figure 10: The dynamic state space for damped pendulum with colormaps denoting underlying (a) l ,(b) θ and (c) ω .We use the position coordinates ( x, y ) of pendulum end-points as inputs (i.e., the images of dimension × × in the main paper are replaced by positions of dimension ), following the same routine asin the main paper, we run experiments with various global parameters l, m ( l ∈ [2 , and m ∈ [2 , ).The learned global representations are shown in Figure 9. In Figure 9(a) and (b), colormaps denote theground-truth l and m respectively. We can see a clear correspondence between the learned manifoldand the true global parameters. Furthermore, there are some holes between data points in the ﬁgure(such as the places marked by dashed circles), corresponding to the meta-test samples. In Figure10, we plot the dynamic state space of damped pendulum with colormaps denoting the ground-truthdynamic state ( l, θ, ω ) . The learned state space lies in a 3-dimensional manifold, which shows goodcorrespondences with underlying dynamic states.To show the performance of our model on low-dimensional inputs, we plot the x coordinates ofpendulum end-points for training set in Figure 11, with x axes denoting the true values and y axesdenoting predicted ones. We can see a perfect match between the true and the predicted pendulum’sposition. When the overshooting length d is longer, the prediction get worse. In Table 3, we calculatethe mean square errors (MSEs) of the true and the predicted pendulum’s position for training, testand meta-test (20 and 2 context samples) sets. T + d represents the result with overshooting length d ,in the table, d changes from to . We can see that on both training and test sets, MSEs are quitesmall, which get a bit larger on the meta-test set.12igure 11: The true and predicted x coordinates of pendulum end-points with different overshootinglengths d: (a) d = 0 , (b) d = 1 , (c) d = 2 , (d) d = 3 , (e) d = 4 and (f) d = 5 . MSE

T+0 T+1 T+2 T+3 T+4 T+5Training

Test

Meta-test (20 contexts)

Meta-test (2 contexts) ( l, m ) is unseen in the training set. In Figure 13, as the same as the training setting,we give the model 20 context samples, but in Figure 14, we only give the model 2 context samples tosee if our meta-learning algorithm can deal with fewer context samples. We can see that our modelcan still make quite good prediction even only 2 context samples are given, this can also be seen inthe last row of Table 3, the MSEs for different overshooting lengths is only slightly larger than thosegiven 20 context samples. We also apply the low-dimensional inputs model for the prediction of planetary orbits motion. Allthe settings are the same as in the main paper except the inputs are changed to planet’s positioncoordinates ( x, y ) . In Figure 15 and 16, we plot the learned global representation and dynamic statespace. In Figure 17, we plot the true and predicted x coordinates of planet’s positions. In Table 4, wegive the MSEs of predictions. In Figure 18, 19 and 20, we plot the performance for a training andmeta-test (both 20 and 2 context samples) sequence. We can see that our model indeed can learn theunderlying global and local representations together with the dynamic transition process. Also notethat it is easier for our model to learn the planetary orbits motion than the damped pendulum . MSE

T+0 T+1 T+2 T+3 T+4 T+5Training

Test

Meta-test (20 contexts)

Meta-test (2 contexts) d = 0 , (b) d = 1 , (c) d = 2 , (d) d = 3 , (e) d = 4 and (f) d = 5 .Figure 13: The true and predicted x coordinates of damped pendulum for a meta-test sequence with20 context samples. The solid lines denote true x coordinates, the blue stars are context points, thered circles are test points for prediction. The x-axis denotes time step and we plot results for differentovershooting lengths in (a) d = 0 , (b) d = 1 , (c) d = 2 , (d) d = 3 , (e) d = 4 and (f) d = 5 . For

Global representation model and

Recognition model , we use convolutional neural networks, eachwith four layers with ﬁlter sizes of [32,64,64,128], kernel sizes of × , strides of 2 and ReLUactivations. For Observation model , we use deconvolutional neural networks with ﬁlter sizes of[64,64,32,1], kernel sizes of [5,5,6,6], strides of 2 and three layers of ReLU activations, and theactivation of last layer is Sigmoid. For

State space model , we just use a 4-layer forward neuralnetwork with cell units [128,128,64,16] and ReLU activations. The last layer is further connectedto a dense layer with units size × dim z , where dim z is the dimension of the latent dynamic state,which is set it to 3 in our experiments. First dim z units generates the mean value of z and the second dim z units produces the standard derivation. 14igure 14: The true and predicted x coordinates of damped pendulum for a meta-test sequence with 2context samples. The solid lines denote true x coordinates, the blue stars are context points, the redcircles are test points for prediction. The x-axis denotes time step and we plot results for differentovershooting lengths in (a) d = 0 , (b) d = 1 , (c) d = 2 , (d) d = 3 , (e) d = 4 and (f) d = 5 .Figure 15: The global representation manifold r for planetary orbits with colormaps denoting (a) r n and (b) e . The network architecture is almost the same as the case for the images, except that the convolutionallayers are placed by multi-layers forward networks. For

Global representation model , it is replacedby a 4-layer forward neural network with cell units of [128,128,64,16] and ReLU activations. For

Recognition model , we only use a 2-layer forward neural network with cell units of [32,16] and ReLUactivations. For damped pendulum system, we set β d = 1 for all d ∈ [1 , D ] , batch size B = 2 andrun 1000 epochs. For planetary orbits motion , we set β d = 1 for all d ∈ [1 , D ] , batch size B = 5 and run 500 epochs. References [1] http://farside.ph.utexas.edu/teaching/301/lectures/node155.html .[2] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A. Saurous, and Kevin Murphy. Fixing aBroken ELBO. In

International Conference on Machine Learning , pages 159–168, 2018.[3] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul,Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In

Advances in neural information processing systems , pages 3981–3989, 2016.[4] Philip G. Breen, Christopher N. Foley, Tjarda Boekholt, and Simon Portegies Zwart. Newton vs themachine: solving the chaotic three-body problem using deep neural networks. arXiv:1910.07291 [astro-ph,physics:physics] , Oct. 2019. 00000 arXiv: 1910.07291. r , (b) θ .Figure 17: The true and predicted x coordinates of planet’s positions with different overshootinglengths d: (a) d = 0 , (b) d = 1 , (c) d = 2 , (d) d = 3 , (e) d = 4 and (f) d = 5 .Figure 18: The true and predicted x coordinates of planet’s positions for a training sequence. Thesolid lines denote true x coordinates, the blue stars are context points, the red circles are target pointsand the green triangles are test points. The x-axis denotes time step and we plot results for differentovershooting lengths in (a) d = 0 , (b) d = 1 , (c) d = 2 , (d) d = 3 , (e) d = 4 and (f) d = 5 .16igure 19: The true and predicted x coordinates of planet’s positions for a meta-test sequence with 20context samples. The solid lines denote true x coordinates, the blue stars are context points, the redcircles are test points for prediction. The x-axis denotes time step and we plot results for differentovershooting lengths in (a) d = 0 , (b) d = 1 , (c) d = 2 , (d) d = 3 , (e) d = 4 and (f) d = 5 .Figure 20: The true and predicted x coordinates of planet’s positions for a meta-test sequence with 2context samples. The solid lines denote true x coordinates, the blue stars are context points, the redcircles are test points for prediction. The x-axis denotes time step and we plot results for differentovershooting lengths in (a) d = 0 , (b) d = 1 , (c) d = 2 , (d) d = 3 , (e) d = 4 and (f) d = 5 .17

5] Yutian Chen, Matthew W. Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Timothy P. Lillicrap, MattBotvinick, and Nando de Freitas. Learning to learn without gradient descent by gradient descent. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 748–756. JMLR.org, 2017.[6] Emmanuel de Bézenac, Arthur Pajot, and Patrick Gallinari. Deep learning for physical processes: Incor-porating prior scientiﬁc knowledge. In , 2018.[7] Marc Peter Deisenroth and Carl Edward Rasmussen. PILCO: a model-based and data-efﬁcient approachto policy search. In

Proceedings of the 28th International Conference on International Conference onMachine Learning , pages 465–472. Omnipress, 2011.[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of DeepBidirectional Transformers for Language Understanding. In

Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 4171–4186, 2019.[9] Andreas Doerr, Christian Daniel, Martin Schiegg, Nguyen-Tuong Duy, Stefan Schaal, Marc Toussaint, andTrimpe Sebastian. Probabilistic Recurrent State-Space Models. In

International Conference on MachineLearning , pages 1279–1288, 2018.[10] Stefanos Eleftheriadis, Tom Nicholson, Marc Deisenroth, and James Hensman. Identiﬁcation of Gaussianprocess state space models. In

Advances in neural information processing systems , pages 5309–5319,2017.[11] S. M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo,Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing,Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, MattBotvinick, Daan Wierstra, Koray Kavukcuoglu, and Demis Hassabis. Neural scene representation andrendering.

Science , 360(6394):1204–1210, June 2018.[12] Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A disentangled recognition andnonlinear dynamics model for unsupervised learning. In

Advances in Neural Information ProcessingSystems , pages 3601–3610, 2017.[13] Roger Frigola, Yutian Chen, and Carl Edward Rasmussen. Variational Gaussian process state-space models.In

Advances in neural information processing systems , pages 3680–3688, 2014.[14] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami,and Yee Whye Teh. Neural Processes. arXiv:1807.01622 [cs, stat] , July 2018. arXiv: 1807.01622.[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep learning . Adaptive computation and machinelearning. The MIT Press, Cambridge, Massachusetts, 2016. 11003.[16] Karol Gregor, Danilo Jimenez Rezende, Frederic Besse, Yan Wu, Hamza Merzic, and Aaron van den Oord.Shaping Belief States with Generative Environment Models for RL. arXiv:1906.09237 [cs, stat] , June2019. arXiv: 1906.09237.[17] David Ha and Jürgen Schmidhuber. World Models. arXiv:1803.10122 [cs, stat] , Mar. 2018. arXiv:1803.10122.[18] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and JamesDavidson. Learning Latent Dynamics for Planning from Pixels. In

International Conference on MachineLearning , pages 2555–2565, 2019.[19] Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging inference networksand posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534 , 2019.[20] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, ShakirMohamed, and Alexander Lerchner. beta-VAE: Learning Basic Visual Concepts with a ConstrainedVariational Framework.

ICLR , 2(5):6, 2017.[21] Raban Iten, Tony Metger, Henrik Wilming, Lidia del Rio, and Renato Renner. Discovering physical con-cepts with neural networks. arXiv:1807.10300 [physics, physics:quant-ph] , July 2018. arXiv: 1807.10300.[22] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep Variational BayesFilters: Unsupervised Learning of State Space Models from Raw Data. arXiv:1605.06432 [cs, stat] , May2016. arXiv: 1605.06432.[23] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, OriolVinyals, and Yee Whye Teh. Attentive Neural Processes. arXiv:1901.05761 [cs, stat] , Jan. 2019. arXiv:1901.05761.[24] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs] ,Dec. 2014. arXiv: 1412.6980.[25] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat] , Dec.2013. arXiv: 1312.6114.[26] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep Kalman Filters. arXiv:1511.05121 [cs, stat] , Nov.2015. arXiv: 1511.05121.[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classiﬁcation with deep convolutionalneural networks.

Commun. ACM , 60(6):84–90, May 2017. 49878.

28] Ananya Kumar, S. M. Ali Eslami, Danilo J. Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart, andMurray Shanahan. Consistent Generative Query Networks. arXiv:1807.02033 [cs, stat] , July 2018. arXiv:1807.02033.[29] Syama Sundar Rangapuram, Matthias W. Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and TimJanuschowski. Deep state space models for time series forecasting. In

Advances in Neural InformationProcessing Systems , pages 7785–7794, 2018.[30] Archie E. Roy.

Orbital motion . CRC Press, 2004.[31] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, MarcLanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and DemisHassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.

Science , 362(6419):1140–1144, Dec. 2018.[32] Gautam Singh, Jaesik Yoon, Youngsung Son, and Sungjin Ahn. Sequential neural processes. pages10254–10264, 2019.[33] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deepconditional generative models. In

Advances in neural information processing systems , pages 3483–3491,2015.[34] Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Remi Munos, CharlesBlundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv:1611.05763 [cs,stat] , Nov. 2016. arXiv: 1611.05763.[35] Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin A. Riedmiller. Embed to control:A locally linear latent dynamics model for control from raw images. In Corinna Cortes, Neil D. Lawrence,Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors,

NIPS , pages 2746–2754, 2015., pages 2746–2754, 2015.