[PDF] Comparing recurrent and convolutional neural networks for predicting wave propagation

Abstract

Dynamical systems can be modelled by partial differential equations and numerical computations are used everywhere in science and engineering. In this work, we investigate the performance of recurrent and convolutional deep neural network architectures to predict the surface waves. The system is governed by the Saint-Venant equations. We improve on the long-term prediction over previous methods while keeping the inference time at a fraction of numerical simulations. We also show that convolutional networks perform at least as well as recurrent networks in this task. Finally, we assess the generalisation capability of each network by extrapolating in longer time-frames and in different physical settings.

Full PDF

IICLR 2020 Workshop on Deep Learning and Differential Equations C OMPARING RECURRENT AND CONVOLUTIONAL NEU - RAL NETWORKS FOR PREDICTING WAVE PROPAGATION

Stathi Fotiadis, Eduardo Pignatelli, Anil A. Bharath

Department of BioengineeringImperial College London

Mario Lino Valencia, Chris D. Cantwell

Department of AeronauticsImperial College London

Amos Storkey

Institute for Adaptive and Neural ComputationThe University of Edinburgh A BSTRACT

Dynamical systems can be modelled by partial differential equations and a needfor their numerical solution appears in many areas of science and engineering.In this work, we investigate the performance of recurrent and convolutional deepneural network architectures to predict the propagation of surface waves governedby the Saint-Venant equations. We improve on the long-term prediction over pre-vious methods while keeping the inference time at a fraction of numerical sim-ulations. We also show that convolutional networks perform at least as well asrecurrent networks in this task. Finally, we assess the generalisation capability ofeach network by extrapolating for longer times and in different physical settings. NTRODUCTION

Many physical systems in science and engineering are described by partial differential equations(PDEs). This study investigates the performance of recurrent and convolutional deep neural net-works to model such phenomena. Accurately predicting the evolution of such systems is usuallydone through numerical simulations, a task that requires signiﬁcant computational resources. Sim-ulations usually need extensive tuning and need to be re-run from scratch even for small variationsin the parameters. With their potential to learn hierarchical representations, deep learning tech-niques have emerged as an alternative to numerical solvers, by offering a desirable balance betweenaccuracy and computational cost (Carleo et al., 2019).Here, we focus on the modelling of surface wave propagation governed by the Saint-Venant (SV)equations. This phenomenon offers a good test-bed for controlled analyses on two-dimensional se-quence prediction of PDEs for several reasons. First, in contrast to some physical systems, such asﬂuid ﬂow, the evolution of the real system is unlikely to enter chaotic regimes. From a represen-tation learning point of view, this makes model training and assessment relatively straightforward.Despite this, the SV equations are strongly related to the Navier-Stokes equations, widely used incomputational ﬂuids. Further, computational modelling of surface waves is used in seismology,computer animation, in predictions of surface runoff from rainfall – a critical aspect of the watercycle (Moussa & Bocquillon, 2000) – and ﬂood modelling (Ersoy et al., 2017).This study provides three contributions. First, we identify three relevant architectures for spatiotem-poral prediction. Two of these architectures lead to improved accuracy in long-term prediction overprevious attempts (Sorteberg et al., 2019) while keeping the inference time orders of magnitudesmaller than typical solvers. Secondly, our comparison between recurrent and purely convolutionalmodels indicates that both can be equally effective in spatiotemporal prediction of SV PDEs. Thisis in alignment with the ﬁndings of Bai et al. (2018) that demonstrates that convolutional models areas effective as recurrent models in one-dimensional sequence modelling. Finally, we evaluate thegeneralisation of the models in situations not seen during training and indicate their shortcomings. Code and data available at github.com/stathius/wave_propagation a r X i v : . [ c s . L G ] A p r CLR 2020 Workshop on Deep Learning and Differential Equations

Time-step ahead R M S E BaselineConvLSTMPredRNN++U-Net

Figure 1: Comparing the long-term prediction of thefour models on the test set.

The Causal-LSTM andthe U-Net signiﬁcantly outperform the baseline LSTMmodel. The vertical line indicates the training horizon.

Time-step ahead 20 80LSTM (baseline) 0.08 ± ± ± ± ± ± ± ± Table 1: Root Mean Square Error (RMSE) com-parison of model prediction at speciﬁc time-stepsahead.

Best accuracy in bold. The standard errors areacross 4 runs with different initialisation.

Figure 2: Cumulative reconstruction of the output from the feature maps of the pre-last layer of theU-Net (top) and PredRNN++ (bottom).

Prediction corresponds to the 80th time-step ahead. We ordered thefeature maps by the absolute value of their weight, from the most important to the least. The PredRNN++gradually builds up its prediction. The U-Net works differently: the ﬁrst few feature maps put emphasis on theboundary conditions. Then some of the feature maps focus on the peaks (white colour) and some others on thetroughs. All combined build the ﬁnal prediction.

ELATED WORK

Deep learning methods have been proposed for spatiotemporal forecasting in various ﬁelds includingthe solution of PDEs. Recurrent neural networks have been proven a good ﬁt for the task, due to theirinnate ability to capture temporal correlations. Srivastava et al. (2015) use a convolutional encoder-decoder architecture where an LSTM module is used to propagate the latent space to the future.Variations of this technique have been successfully applied to the long-term prediction of physicalsystems, such as sliding objects (Ehrhardt et al., 2017) and wave propagation (Sorteberg et al.,2019). Convolutional LSTMs (ConvLSTM) use convolutions inside the LSTM cell to complementthe temporal state with spatial information. Whilst initially proposed for precipitation nowcasting,ConvLSTMs were also found successful for video prediction (Shi et al., 2015). Wang et al. (2018a)proposed the PredRNN++, featuring spatial memory that traverses the stacked cells in the networkand improves the accuracy of short-term prediction over ConvLSTMs.Feed-forward models have, also, been used in spatiotemporal forecasting. Mathieu et al. (2015)used a CNN to encode video frames in a latent space and extrapolated the latent vectors to thefuture. Tompson et al. (2017) employed CNNs to speed up the projection step in ﬂuid ﬂow simu-lations. U-Net has been used for optical ﬂow estimation in videos (Dosovitskiy et al.) as well asin physical systems, such as sea temperature predictions (de Bezenac et al., 2017) and acceleratingthe simulation of the Navier-Stokes equations (Thuerey et al., 2018). While both recurrent and con-volutional models have been successfully applied for the prediction of PDEs, there is a paucity ofstudies comparing the two categories from a representation learning point of view.Other architectures for spatiotemporal prediction include Generative Adversarial Networks, for ﬂuidsimulations (Kim et al., 2018) and Graph Networks for wind-farm power estimation (Park & Park,2019). There is also a growing body of research on physics-inspired networks for solving PDEs(Raissi et al., 2017; Perdikaris & Yang, 2019). 2CLR 2020 Workshop on Deep Learning and Differential Equations

Time-step ahead R M S E Test setLinesDouble DropShallow Depth Opposite Illum.Random Illum.Small TankBig Tank

Figure 3: Generalisation in different physical set-tings for the U-Net.

The network copes well withchanges to illumination and even with two drop butcannot predict well linear waves or different tank size. Time-step ahead 20 40 60 80Test set

Opposite Illum.

Random Illum. 0.4 0.06 0.08 0.10Double Drop 0.04 0.07 0.10 0.13Lines 0.11 0.16 0.18 0.19Shallow Depth 0.04 0.09 0.13 0.16Big Tank 0.08 0.14 0.16 0.17Small Tank 0.19 0.22 0.23 0.23

Table 2: RMSE of U-Net across datasets at speciﬁcpoints in time.

Performance varies across differentphysical settings. The model is invariant to an orthog-onal phase shift in illumination.

VALUATED MODELS

Four different models are assessed in this work. Three of them are recurrent (LSTM, ConvLSTM,PredRNN++) and one is feed-forward (U-Net). A detailed description of all the implementationscan be found in Section B of the Appendix. The LSTM model was speciﬁcally developed forwave propagation prediction (Sorteberg et al., 2019) and serves as a baseline on which we soughtimprovement. It is composed of a convolutional encoder and decoder with three LSTMs in themiddle. The LSTM modules use the vector output of the encoder as an inner representation andpropagate it forward in time. Each LSTM propagates a different part of the sequence (see Appendix).The other models were selected on the basis of their applicability to relevant tasks. ConvLSTM andPredRNN++ have been empirically shown to perform well at short-term spatiotemporal predictions.The rationale for using them in long-term prediction is that the underlying physics of wave propa-gation do not change. If a model learns a good representation of short-term dynamics, then the erroraccumulation should remain low long-term. Both models use convolutions inside the recurrent cellto create a synergy between spatial and temporal modelling. Additionally, PredRNN++ employs aspatial memory that traverses the vertical stack to increase short-term accuracy.The feed-forward model is based on the U-Net architecture used in spatiotemporal prediction. Forexample, it has been used to infer optical ﬂow (Fischer et al., 2015) , motion ﬁelds (de Bezenac et al.,2017) and velocity ﬁelds (Thuerey et al., 2018). In contrast, we train the network end-to-end andconditional on its own predictions; the latter shifts the focus from short-term to long-term accuracy.

ESULTS

ONG TERM PREDICTION : E

XTRAPOLATION IN TIME

We evaluated how well the models extrapolate in time. Given ground-truth simulations of 100 framesin length, we tested the model predictions up to 80 steps, much more than the maximum of 20 framesequences that the models are trained upon. The RMSE at each time step is calculated as an averageover all the test sequences. Results show that the baseline LSTM gives the worst performance. TheRMSE error reaches 0.10 after only 21 frames while the error sharply raises after frame 10 (Figure1). A probable cause is the usage of three distinct LSTMs, which require more data to train upon.The ConvLSTM offers an improvement: it reaches 0.1 RMSE after only 53 frames. The error trendis also very gradual, almost linear. An even greater improvement comes from the PredRNN++,which provides a very low error over the whole prediction range. Its maximum error at frame 80is 0.091, substantially lower than the LSTM (0.186) and the ConvLSTM (0.150) (Table 1). Thisconﬁrms the ﬁndings of Wang et al. (2018b), that PredRNN++ is more efﬁcient than ConvLSTM.U-Net is on par with PredRNN++ until frame 34, but has better long-term prediction, reaching 0.071RMSE at frame 80 vs 0.091 of the PredRNN++. The U-Net decreases the RMSE by comparedto the baseline. It is also the faster model, providing a × speed-up over the numerical solver thatwe used (Table 6 in Appendix). 3CLR 2020 Workshop on Deep Learning and Differential Equations Target (t=80) PredRNN++ U-Net0 20 40 60 80 100 120Pixel Location0.40.60.81.0 I n t e n s i t y ImageTargetPredRNN++U-Net G r o un d T r u t h Lines DoubleDrop DifferentIllumination DifferentDepth SmallerTub BiggerTub P r e d i c t i o n A b s o l u t e E rr o r Figure 4: Left: Qualitative comparison between the U-Net and the PredRNN++ on the test set.

Theintensity proﬁle corresponds to the yellow line (the line with the highest variation). In this particular case, thePredRNN++ has missed the time constant.

Right: Predictions (at time t = 80 ) of the U-Net in the variousdataset that have not been seen during training. In double drop, we see how the model fails to accuratelypredict the double wave-front. For bigger and smaller tub it misses the time constant.

Qualitatively, it appears that the PredRNN++ propagates its internal representation one step at a timewhile the U-Net predicts multiple frames in one pass. How the output is reconstructed in the lastlayer is indicative of the differences (Figure 2).4.2 G

ENERALISATION : E

XTRAPOLATION IN OTHER PHYSICAL SETTINGS

Here, we evaluate the capabilities and limitations of our models by testing under different initialconditions, illumination models and tank dimensions (Table 3 in Appendix). For conciseness, weonly present the results of the U-Net but the same conclusions stand for all the models.The U-Net seems to be quite robust to changes in illumination. The RMSE for opposite illuminationangle ( ◦ ) is indistinguishable to the original test set (Figure 3 and Table 2). This indicates thatthe learned representation is invariant to a perpendicular phase shift in lighting conditions. Prop-agation of linear waves appears to be more challenging, RMSE exceeds 0.10 after just 12 frames.The visualisation shows how the morphology of the prediction is qualitatively different, containingcircular artefacts, reminiscent of the training data (Figure 4). When two drops are used, the RMSE isfairly low but the two wave-fronts of the predictions are sometimes blurred. We also varied the tanksize to study the effect of wave speed. It seems that both cases are challenging with the smaller tanksize, or equivalently faster waves, exceeding 0.10 RMSE after just 5 frames. Predictions in Figure4 demonstrate how the network miscalculates the wave speed, and its predictions are either fasteror slower than the ground truth. Please note that direct comparisons between datasets based on theRMSE is not without shortcomings. Each dataset has its own inherent "variation" which affect theRMSE, i.e. waves move faster in a small tank (see Figure 11 in the Appendix for a discussion). ONCLUSIONS AND F UTURE W ORK

In this work we investigated the use of deep networks for approximating wave propagation. Usinga U-Net architecture, we managed to reduce the long-term approximation RMSE to 0.071 againstthe previous baseline of 0.186. At the same time, the U-Net is × faster than the simulation.Our results suggest that the U-Net outperforms state-of-the-art recurrent models. It is unclear whyU-Net models perform so well in this task. It been demonstrated that convolutional networks areeffective at modelling one-dimensional temporal sequences (Bai et al., 2018); it might be true forhigher-dimensional data. Furthermore, the simulated data are based on few-step solvers. In sucha case the memory modules may not offer a signiﬁcant advantage. Lastly, we extensively assessedhow the networks generalise in unseen physical settings and pointed out current limitations.In the future, we aim to introduce noise in the simulation so the system becomes stochastic. It wouldbe interesting to see if in this case the recurrent models learn the dynamics better than the U-Net.A big shortcoming of the current models is generalisation in other physical settings. We plan toaddress this by a physics-inspired latent space factorisation and meta-learning.4CLR 2020 Workshop on Deep Learning and Differential Equations R EFERENCES

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutionaland recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 , 2018.Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby,Leslie Vogt-Maranto, and Lenka Zdeborová. Machine learning and the physical sciences. 2019.URL http://arxiv.org/abs/1903.10563 .Nicolas Cellier. Scikit-fdiff / skfdiff. https://gitlab.com/celliern/scikit-fdiff/ ,2019. [Online; accessed 11-8-2019].Emmanuel de Bezenac, Arthur Pajot, and Patrick Gallinari. Deep learning for physical processes:Incorporating prior scientiﬁc knowledge. arXiv preprint arXiv:1711.07970 , 2017.Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Hazırbas¸ Hazırbas¸, VladimirGolkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. FlowNet: Learning Op-tical Flow with Convolutional Networks. Technical report.Sebastien Ehrhardt, Aron Monszpart, Niloy J. Mitra, and Andrea Vedaldi. Learning A PhysicalLong-term Predictor. 3 2017. URL http://arxiv.org/abs/1703.00247 .Mehmet Ersoy, Omar Lakkis, and Philip Townsend. A saint-venant shallow water model for over-land ﬂows with precipitation and recharge, 2017.Philipp Fischer et al. Flownet: Learning optical ﬂow with convolutional networks. arXiv preprintarXiv:1504.06852 , 2015.Byungsoo Kim, Vinicius C. Azevedo, Nils Thuerey, Theodore Kim, Markus Gross, and Bar-bara Solenthaler. Deep Fluids: A Generative Network for Parameterized Fluid Simulations.6 2018. doi: 10.1111/cgf.13619. URL http://arxiv.org/abs/1806.02071http://dx.doi.org/10.1111/cgf.13619 .Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyondmean square error. arXiv preprint arXiv:1511.05440 , 2015.Roger Moussa and Claude Bocquillon. Approximation zones of the saint-venant equations f ﬂoodrouting with overbank ﬂow.

Hydrology and Earth System Sciences Discussions , 4(2):251–260,2000.Junyoung Park and Jinkyoo Park. Physics-induced graph neural network: An application to wind-farm power estimation.

Energy , 187:115883, 11 2019. ISSN 03605442. doi: 10.1016/j.energy.2019.115883.Paris Perdikaris and Yibo Yang. Modeling stochastic systems using physics-informed deep genera-tive models. 2019.Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics Informed Deep Learning(Part I): Data-driven Solutions of Nonlinear Partial Differential Equations. 11 2017. URL http://arxiv.org/abs/1711.10561 .Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo.Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. 62015. URL http://arxiv.org/abs/1506.04214 .Wilhelm E Sorteberg, Stef Garasto, Chris C. Cantwell, and Anil A. Bharath. Approximating theSolution of Surface Wave Propagation Using Deep Neural Networks. In

INNS Big Data andDeep Learning , 2019. doi: 10.1007/978-3-030-16841-4{\_}26.Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of videorepresentations using lstms. In

International conference on machine learning , pp. 843–852, 2015.Nils Thuerey, Konstantin Weissenow, Lukas Prantl, and Xiangyu Hu. Deep Learning Methodsfor Reynolds-Averaged Navier-Stokes Simulations of Airfoil Flows. 10 2018. URL http://arxiv.org/abs/1810.08217 . 5CLR 2020 Workshop on Deep Learning and Differential EquationsJonathan Tompson, Kristofer Schlachter, Pablo Sprechmann, and Ken Perlin. Accelerating eulerianﬂuid simulation with convolutional networks. In

ICML , pp. 3424–3433. JMLR. org, 2017.Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. PredRNN++: To-wards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning. 42018a. URL http://arxiv.org/abs/1804.06300 .Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S Yu. Predrnn++: Towardsa resolution of the deep-in-time dilemma in spatiotemporal predictive learning. arXiv preprintarXiv:1804.06300 , 2018b. A PPENDIX

A D

ATASETS

The datasets were created by simulating the Saint-Venant equations: h t + (( H + h ) u ) x + (( H + h ) v ) y = 0 u t + uu x + vu y + gh x − ν ( u xx + u yy ) = 0 v t + uv x + vv y + gh y − ν ( v xx + v yy ) = 0 (1)The package triﬂow (Cellier, 2019) was used for the simulation. The Coriolis force and viscosityterms were neglected, kinematic viscosity was − m /s which is close to water viscosity at ◦ C,the height H is set to 10 m and the size of the tank is randomly selected in each simulation between10 and 20m. The initial wave excitation is in the form of a Gaussian droplet at random locations. Forrendering, we used ◦ lighting azimuth and ◦ altitude. Each sequence is 100 steps long while thetime step is 0.01 sec. In total, 3,000 sequences were rendered. The frame size was × pixelsbut was subsequently re-sampled down to × . The generalisation datasets were created withthe same method by varying the physical properties of the simulation (Table 3).We also used image normalisation which is known to improve performance on image predictiontasks. Normalising the pixel values to zero mean and standard deviation 1 worked best for us. Notethat the normalising values are computed from the training set alone and applied to the validationand test sets. Data augmentation techniques like horizontal and vertical ﬂips were employed on aper sequence basis. From the 3000 sequences of the original dataset, 70% were used for training,15% for validation and 15% for testing. Dataset Name Initial Condition Height(m) Tank Size(m) Illum. Azimuth Sequences

Training/Validation/Test Droplet 10 [10, 20] ◦ ◦ ◦ ◦ ◦ ◦ ◦ Table 3:

The original dataset was used for training, model selection and evaluation. Models were also trainedwith the ﬁxed tank dataset to study the effect of tank size. All the other datasets were used to evaluate thegeneralisation capabilities of the models.

B M

ODELS

B.1 LSTMThe encoder consists of 4 convolutional layers with 60, 120, 240, 480 feature maps, kernel sizes 7,3, 3, 3 and padding of 2, 1, 1, 1 pixels. Dimensionality reduction is achieved by using a kernel strideof size in all layers. After each convolutional layer, there is a batch normalisation layer and a tanh of the units, that are chosenrandomly in each pass. The ﬁnal part of the encoder is a fully connected layer of width L = 1000 .This is the latent vector input to the three LSTMs. One LSTM is used for the ﬁrst input, the secondLSTM is for predicting the 10th frame (midway) and the third LSTM for all the other frames. Thedecoder is based on deconvolutions that double the spatial dimensions of the feature maps in eachlayer until the original × size is reached. It is a mirror of the encoder in terms of featuremap size while the kernel is 3, the padding is 1 and the stride is 2 for all the layers. Figure 6 depictsthe architecture. LSTM

Input Output

Encoder Decoder

Hidden state

Figure 5: Schematic of the encoder and decoder used in the LSTM model (Sorteberg et al., 2019).

Di-mensions or layers are left and up, number of channels at the bottom. c t , h t LSTM i t=1 EncoderDecoder

Initial inputOutput

LSTM p t=2 Decoder LSTM p t=3 Decoder .....

LSTM r t=10 EncoderDecoder .....

Refeed from output

Figure 6: Schematic of the encoder and decoder used in the LSTM model (Sorteberg et al., 2019).

Di-mensions or layers are left and up, number of channels at the bottom.

B.2 C

ONV

LSTMOur architecture uses a stack of 3 ConvLSTM cells. Initially, a convolutional encoder with 8, 64,192 feature maps respectively reduces the spatial dimensions to × . All layers have kernels ofsize 3, zero padding of width 1 and Leaky ReLU non-linearities with slope 0.2. A stride of 2 pixelsis used to reduce the dimensionality. At the ﬁnal layer, the input is represented by a × × tensor. Inside the ConvLSTMs we use kernels of size 3 and zero padding of 1 pixel to avoid thedimensionality reduction. The decoder uses deconvolutions with stride 2 to double up the pixeldimensions in each layer. 7CLR 2020 Workshop on Deep Learning and Differential Equations ConvolutionConvLSTMConvolutionConvLSTMConvolutionConvLSTM I ConvolutionConvLSTMConvolutionConvLSTMConvolutionConvLSTM t=1 t=2

Encoder t=N

ConvolutionConvLSTMConvolutionConvLSTMConvolutionConvLSTM ConvLSTMConvLSTMDeconvolutionConvLSTM I I N ......... t=N+1 DeconvolutionDeconvolution O Forecaster

ConvLSTMConvLSTMDeconvolutionConvLSTMDeconvolutionDeconvolution O ......... t=N+2 ConvLSTMConvLSTMDeconvolutionConvLSTMDeconvolutionDeconvolution O M t=N+M Figure 7: ConvLSTM Model

The encoder processes the N=5 input frames one at a time to create an internalrepresentation. The representation gets copied over to the forecaster that uses it to generate M=10 future frames.The feature map dimensions can be seen next to each layer.

B.3 P

RED

RNN++The unfolding of the model through time is presented in Figure 8. The vertical stack is comprisedof one convolutional, one max pooling and four PredRNN++ layers. The convolutional layer has akernel of size 3, no padding and outputs 8 feature maps. In the original paper, they do not use anydimensionality reduction because their input dimensions are × per frame. Our input dimensions( × ) are too big to ﬁt in available GPU memory, so we used max-pooling with stride 4 toreduce the dimensions to × pixels. Following the original paper, we used 4 PredRNN++ layersbut reduced the size of all of them to 64 channels each to meet hardware memory constraints. Weused convolutional kernels of size 3. The forecaster uses a deconvolutional layer kernel size 7 andstride 4 to restore the internal state to the original dimensions. ConvolutionCausal LSTMCausal LSTMCausal LSTM I t=1 t=2 Encoder t=N ......... t=N+1

Deconvolution O Forecaster t=M

Causal LSTM

Max Pooling

ConvolutionCausal LSTMCausal LSTMCausal LSTM I Causal LSTMMax Pooling ...

ConvolutionCausal LSTMCausal LSTMCausal LSTMCausal LSTMMax Pooling I N Causal LSTMCausal LSTMCausal LSTMCausal LSTM Deconvolution O Causal LSTMCausal LSTMCausal LSTMCausal LSTM ............ t=N+M

Deconvolution O M Causal LSTMCausal LSTMCausal LSTMCausal LSTM

Figure 8: PredRNN++ Model

The encoder processes the N=5 input frames one at a time and an internalrepresentation is created in each PredRNN++ cell. These representations get copied over to the forecaster thatuses them to generate M=20 future frames. The feature map dimensions can be seen next to each layer.

B.4 U-N ET The encoder is composed of fours blocks each containing two convolutional layers with kernel size3 and padding 1, followed by ReLU non-linearities. The ﬁrst three blocks include a max-poolinglayer of stride 2 that reduces the size in half. The number of feature maps doubles in each layer.For the expanding part, we use bilinear interpolation with scale factor 2 instead of deconvolutions tokeep the number of parameters low. Skip connections are also employed to copy feature maps fromearlier layers but contrary to the original paper we do not reduce the dimensions of the copied featuremaps. This way, high-level, coarser feature maps are combined with ﬁne-grained local informationof lower layers over the whole domain. The network architecture can be seen in Figure 9.8CLR 2020 Workshop on Deep Learning and Differential Equations

256 256

512 512

256 256 256 256 Convolutions

ReLU

Max Pooling

Unpooling

Skip Connections

Input: 128x128xN

Output: 128x128xM

Figure 9: Schematic of the U-Net model.

The number of channels is below each layer and its dimensions onthe side. Our model has input N=5 and output M=20.

C H

YPERPARAMETERS

Assume that N is the number of input and M the number of output frames of the model. For eachtraining iteration, we randomly selecting K sub-sequences of length N + M from each simulatedsequence. The models were trained to minimise the MSE over their respective output length M . Ineach iteration, the weights are updated using an Adam optimiser while a scheduling scheme adjuststhe learning rate (LR) by a scaling factor of − if there is no improvement in validation errorafter a given amount of epochs (patience). The hyperparameters of interest are the input length N ∈ { , , } , training output length M ∈ { , , , } , samples per sequence between weightupdates K ∈ { , , , } , batch size b ∈ { , } , LR ∈ { − , − , − , − } and patience p ∈ { , } . Grid search was used to ﬁnd the best set of hyperparameters of each model. The trainingbudget was 24h hours. To obtain an arbitrary long prediction we the output as the next input. Thegoal is to obtain networks with a low error in long term prediction so, during model selection, wechose the hyper-parameters that gave the lowest validation error over 50 frames regardless of theoutput size of the model M . The ﬁnal hyperparameters and model sizes can be found in Tables 4and 5. Model InputLength OutputLength Samples perSequence BatchSize LearningRate Patience

LSTM 5 20 10 16 − − − − Table 4:

Hyper-parameters of the best performing model for each architecture

D M

ODEL SIZE AND SPEED

Models were implemented in PyTorch and the code is publicly available in GitHub. Models weretrained on a GTX 1060 GPU with 6GB of memory. Total training time includes evaluation overhead.

Model TrainableParameters EpochTime Num.Epochs BestEpoch TotalTraining Time

LSTM 88.2M 12m 75 71 24hConvLSTM 12.3M 36m 24 18 24hPredRNN++ 2.5M 33m 43 36 24hU-Net 7.8M 8m 171 166 24h

Table 5:

Model size and training times

Method Time per frame (ms) Speed-upNumerical simulator 630.7 -LSTM 15.0 40xConvLSTM 4.5 141xPredRNN++ 9.2 68xU-Net 2.6 241x

Table 6: Time it takes to compute one frame.

Deep learning approximations offer a signiﬁcant speed-up overnumerical simulations.

E R

ESULTS A DDENDUM G r o un d t r u t h t=10 t=20 t=30 t=40 t=60 t=80 P r e d R NN ++ U - N e t G r o un d t r u t h t=10 t=20 t=30 t=40 t=60 t=80 P r e d R NN ++ U - N e t Figure 10:

Prediction roll-out from U-Net and PredRNN++. Both sequences are from the test set.

Target (t=80) PredRNN++ U-Net0 20 40 60 80 100 120Pixel Location0.00.51.0 I n t e n s i t y ImageTargetPredRNN++U-Net

Figure 11:

Generalization of U-Net and PredRNN++ in different physical settings. In the left we see linearwaves. The networks introduce circular patterns where they don’t exist. In the right panel it is the smaller tank,where waves are faster. Both models miss the time constant by being slower than the ground-truth.

Time step ahead M S E OriginalLinesDouble DropShallow Depth Illum. OppositeIllum. RandomSmall TankBig Tank

Figure 12: Flat Image (left) and

Previous Frame (right) baselines compared against ground truth acrossdifferent datasets. The ﬂat image baseline measures how much the frame is away from the reference height.The Previous Frame indicates how much consecutive frames differ. Datasets are not equally hard to predict andthis should be taken into account when assessing the generalisation capacity of a model.

E.1 P

REDICTING THE TANK SIZE FROM THE LATENT SPACE

Here we check if the trained U-Net acquired any understanding of the physical properties of thesystem. We focus on the tank size, or inversely the speed of the wave, for two reasons. First of all,the U-Net failed to extrapolate to different tank sizes. This experiment could provide some insightson why this failure happens. Secondly, tank size information is readily available. Each datasetsequence corresponds to a different tank size, and the tank is always square. In the training andtesting dataset we have tank size s i ∈ [10 , meters. For the smaller tank we used s i ∈ [5 , andfor the bigger s i ∈ [20 , meters. 11CLR 2020 Workshop on Deep Learning and Differential EquationsThe question we try to answer is: does the latent representation of the U-Net capture that tank sizeinformation s i . We take the pre-trained encoder from the U-Net and add some additional layersso that the output is only one number (Figure 13). The system is trained to predict the tank sizewhen given 5 consecutive frame. Only the additional part is updated during training. The weightsof the encoder are kept frozen. We compare the pre-trained encoder against a randomly initialisedencoder. We, also, compare the models to a dummy regressor that predicts always the mean tanksize for each dataset i.e. 15 for the test set, 7.5 for the small tank and 30 for the big tank. Resultsin Table 6 indicate that the pre-trained encoder can be used to extract the tank size with relativelylow error (0.14) while the random encoder gives a much higher error of 2.27, slightly lower to thedummy regressor (2.45). This indicates that the pre-trained encoder encapsulates physically relevantinformation relating to the tank size. When it comes to the bigger and smaller tanks, both the pre-trained and the random encoders fail to extrapolate and give errors higher than the dummy regressor. Figure 13: Schematic of the model used for tank size prediction

Test set Bigger Tank Smaller TankPre-trained encoder

Table 7: Predicting the tank size from the U-Nets latent space.