Comparing recurrent and convolutional neural networks for predicting wave propagation
Stathi Fotiadis, Eduardo Pignatelli, Mario Lino Valencia, Chris Cantwell, Amos Storkey, Anil A. Bharath
IICLR 2020 Workshop on Deep Learning and Differential Equations C OMPARING RECURRENT AND CONVOLUTIONAL NEU - RAL NETWORKS FOR PREDICTING WAVE PROPAGATION
Stathi Fotiadis, Eduardo Pignatelli, Anil A. Bharath
Department of BioengineeringImperial College London
Mario Lino Valencia, Chris D. Cantwell
Department of AeronauticsImperial College London
Amos Storkey
Institute for Adaptive and Neural ComputationThe University of Edinburgh A BSTRACT
Dynamical systems can be modelled by partial differential equations and a needfor their numerical solution appears in many areas of science and engineering.In this work, we investigate the performance of recurrent and convolutional deepneural network architectures to predict the propagation of surface waves governedby the Saint-Venant equations. We improve on the long-term prediction over pre-vious methods while keeping the inference time at a fraction of numerical sim-ulations. We also show that convolutional networks perform at least as well asrecurrent networks in this task. Finally, we assess the generalisation capability ofeach network by extrapolating for longer times and in different physical settings. NTRODUCTION
Many physical systems in science and engineering are described by partial differential equations(PDEs). This study investigates the performance of recurrent and convolutional deep neural net-works to model such phenomena. Accurately predicting the evolution of such systems is usuallydone through numerical simulations, a task that requires significant computational resources. Sim-ulations usually need extensive tuning and need to be re-run from scratch even for small variationsin the parameters. With their potential to learn hierarchical representations, deep learning tech-niques have emerged as an alternative to numerical solvers, by offering a desirable balance betweenaccuracy and computational cost (Carleo et al., 2019).Here, we focus on the modelling of surface wave propagation governed by the Saint-Venant (SV)equations. This phenomenon offers a good test-bed for controlled analyses on two-dimensional se-quence prediction of PDEs for several reasons. First, in contrast to some physical systems, such asfluid flow, the evolution of the real system is unlikely to enter chaotic regimes. From a represen-tation learning point of view, this makes model training and assessment relatively straightforward.Despite this, the SV equations are strongly related to the Navier-Stokes equations, widely used incomputational fluids. Further, computational modelling of surface waves is used in seismology,computer animation, in predictions of surface runoff from rainfall – a critical aspect of the watercycle (Moussa & Bocquillon, 2000) – and flood modelling (Ersoy et al., 2017).This study provides three contributions. First, we identify three relevant architectures for spatiotem-poral prediction. Two of these architectures lead to improved accuracy in long-term prediction overprevious attempts (Sorteberg et al., 2019) while keeping the inference time orders of magnitudesmaller than typical solvers. Secondly, our comparison between recurrent and purely convolutionalmodels indicates that both can be equally effective in spatiotemporal prediction of SV PDEs. Thisis in alignment with the findings of Bai et al. (2018) that demonstrates that convolutional models areas effective as recurrent models in one-dimensional sequence modelling. Finally, we evaluate thegeneralisation of the models in situations not seen during training and indicate their shortcomings. Code and data available at github.com/stathius/wave_propagation a r X i v : . [ c s . L G ] A p r CLR 2020 Workshop on Deep Learning and Differential Equations
Time-step ahead R M S E BaselineConvLSTMPredRNN++U-Net
Figure 1: Comparing the long-term prediction of thefour models on the test set.
The Causal-LSTM andthe U-Net significantly outperform the baseline LSTMmodel. The vertical line indicates the training horizon.
Time-step ahead 20 80LSTM (baseline) 0.08 ± ± ± ± ± ± ± ± Table 1: Root Mean Square Error (RMSE) com-parison of model prediction at specific time-stepsahead.
Best accuracy in bold. The standard errors areacross 4 runs with different initialisation.
Figure 2: Cumulative reconstruction of the output from the feature maps of the pre-last layer of theU-Net (top) and PredRNN++ (bottom).
Prediction corresponds to the 80th time-step ahead. We ordered thefeature maps by the absolute value of their weight, from the most important to the least. The PredRNN++gradually builds up its prediction. The U-Net works differently: the first few feature maps put emphasis on theboundary conditions. Then some of the feature maps focus on the peaks (white colour) and some others on thetroughs. All combined build the final prediction.
ELATED WORK
Deep learning methods have been proposed for spatiotemporal forecasting in various fields includingthe solution of PDEs. Recurrent neural networks have been proven a good fit for the task, due to theirinnate ability to capture temporal correlations. Srivastava et al. (2015) use a convolutional encoder-decoder architecture where an LSTM module is used to propagate the latent space to the future.Variations of this technique have been successfully applied to the long-term prediction of physicalsystems, such as sliding objects (Ehrhardt et al., 2017) and wave propagation (Sorteberg et al.,2019). Convolutional LSTMs (ConvLSTM) use convolutions inside the LSTM cell to complementthe temporal state with spatial information. Whilst initially proposed for precipitation nowcasting,ConvLSTMs were also found successful for video prediction (Shi et al., 2015). Wang et al. (2018a)proposed the PredRNN++, featuring spatial memory that traverses the stacked cells in the networkand improves the accuracy of short-term prediction over ConvLSTMs.Feed-forward models have, also, been used in spatiotemporal forecasting. Mathieu et al. (2015)used a CNN to encode video frames in a latent space and extrapolated the latent vectors to thefuture. Tompson et al. (2017) employed CNNs to speed up the projection step in fluid flow simu-lations. U-Net has been used for optical flow estimation in videos (Dosovitskiy et al.) as well asin physical systems, such as sea temperature predictions (de Bezenac et al., 2017) and acceleratingthe simulation of the Navier-Stokes equations (Thuerey et al., 2018). While both recurrent and con-volutional models have been successfully applied for the prediction of PDEs, there is a paucity ofstudies comparing the two categories from a representation learning point of view.Other architectures for spatiotemporal prediction include Generative Adversarial Networks, for fluidsimulations (Kim et al., 2018) and Graph Networks for wind-farm power estimation (Park & Park,2019). There is also a growing body of research on physics-inspired networks for solving PDEs(Raissi et al., 2017; Perdikaris & Yang, 2019). 2CLR 2020 Workshop on Deep Learning and Differential Equations
Time-step ahead R M S E Test setLinesDouble DropShallow Depth Opposite Illum.Random Illum.Small TankBig Tank
Figure 3: Generalisation in different physical set-tings for the U-Net.
The network copes well withchanges to illumination and even with two drop butcannot predict well linear waves or different tank size. Time-step ahead 20 40 60 80Test set
Opposite Illum.
Random Illum. 0.4 0.06 0.08 0.10Double Drop 0.04 0.07 0.10 0.13Lines 0.11 0.16 0.18 0.19Shallow Depth 0.04 0.09 0.13 0.16Big Tank 0.08 0.14 0.16 0.17Small Tank 0.19 0.22 0.23 0.23
Table 2: RMSE of U-Net across datasets at specificpoints in time.
Performance varies across differentphysical settings. The model is invariant to an orthog-onal phase shift in illumination.
VALUATED MODELS
Four different models are assessed in this work. Three of them are recurrent (LSTM, ConvLSTM,PredRNN++) and one is feed-forward (U-Net). A detailed description of all the implementationscan be found in Section B of the Appendix. The LSTM model was specifically developed forwave propagation prediction (Sorteberg et al., 2019) and serves as a baseline on which we soughtimprovement. It is composed of a convolutional encoder and decoder with three LSTMs in themiddle. The LSTM modules use the vector output of the encoder as an inner representation andpropagate it forward in time. Each LSTM propagates a different part of the sequence (see Appendix).The other models were selected on the basis of their applicability to relevant tasks. ConvLSTM andPredRNN++ have been empirically shown to perform well at short-term spatiotemporal predictions.The rationale for using them in long-term prediction is that the underlying physics of wave propa-gation do not change. If a model learns a good representation of short-term dynamics, then the erroraccumulation should remain low long-term. Both models use convolutions inside the recurrent cellto create a synergy between spatial and temporal modelling. Additionally, PredRNN++ employs aspatial memory that traverses the vertical stack to increase short-term accuracy.The feed-forward model is based on the U-Net architecture used in spatiotemporal prediction. Forexample, it has been used to infer optical flow (Fischer et al., 2015) , motion fields (de Bezenac et al.,2017) and velocity fields (Thuerey et al., 2018). In contrast, we train the network end-to-end andconditional on its own predictions; the latter shifts the focus from short-term to long-term accuracy.
ESULTS
ONG TERM PREDICTION : E
XTRAPOLATION IN TIME
We evaluated how well the models extrapolate in time. Given ground-truth simulations of 100 framesin length, we tested the model predictions up to 80 steps, much more than the maximum of 20 framesequences that the models are trained upon. The RMSE at each time step is calculated as an averageover all the test sequences. Results show that the baseline LSTM gives the worst performance. TheRMSE error reaches 0.10 after only 21 frames while the error sharply raises after frame 10 (Figure1). A probable cause is the usage of three distinct LSTMs, which require more data to train upon.The ConvLSTM offers an improvement: it reaches 0.1 RMSE after only 53 frames. The error trendis also very gradual, almost linear. An even greater improvement comes from the PredRNN++,which provides a very low error over the whole prediction range. Its maximum error at frame 80is 0.091, substantially lower than the LSTM (0.186) and the ConvLSTM (0.150) (Table 1). Thisconfirms the findings of Wang et al. (2018b), that PredRNN++ is more efficient than ConvLSTM.U-Net is on par with PredRNN++ until frame 34, but has better long-term prediction, reaching 0.071RMSE at frame 80 vs 0.091 of the PredRNN++. The U-Net decreases the RMSE by comparedto the baseline. It is also the faster model, providing a × speed-up over the numerical solver thatwe used (Table 6 in Appendix). 3CLR 2020 Workshop on Deep Learning and Differential Equations Target (t=80) PredRNN++ U-Net0 20 40 60 80 100 120Pixel Location0.40.60.81.0 I n t e n s i t y ImageTargetPredRNN++U-Net G r o un d T r u t h Lines DoubleDrop DifferentIllumination DifferentDepth SmallerTub BiggerTub P r e d i c t i o n A b s o l u t e E rr o r Figure 4: Left: Qualitative comparison between the U-Net and the PredRNN++ on the test set.
Theintensity profile corresponds to the yellow line (the line with the highest variation). In this particular case, thePredRNN++ has missed the time constant.
Right: Predictions (at time t = 80 ) of the U-Net in the variousdataset that have not been seen during training. In double drop, we see how the model fails to accuratelypredict the double wave-front. For bigger and smaller tub it misses the time constant.
Qualitatively, it appears that the PredRNN++ propagates its internal representation one step at a timewhile the U-Net predicts multiple frames in one pass. How the output is reconstructed in the lastlayer is indicative of the differences (Figure 2).4.2 G
ENERALISATION : E
XTRAPOLATION IN OTHER PHYSICAL SETTINGS
Here, we evaluate the capabilities and limitations of our models by testing under different initialconditions, illumination models and tank dimensions (Table 3 in Appendix). For conciseness, weonly present the results of the U-Net but the same conclusions stand for all the models.The U-Net seems to be quite robust to changes in illumination. The RMSE for opposite illuminationangle ( ◦ ) is indistinguishable to the original test set (Figure 3 and Table 2). This indicates thatthe learned representation is invariant to a perpendicular phase shift in lighting conditions. Prop-agation of linear waves appears to be more challenging, RMSE exceeds 0.10 after just 12 frames.The visualisation shows how the morphology of the prediction is qualitatively different, containingcircular artefacts, reminiscent of the training data (Figure 4). When two drops are used, the RMSE isfairly low but the two wave-fronts of the predictions are sometimes blurred. We also varied the tanksize to study the effect of wave speed. It seems that both cases are challenging with the smaller tanksize, or equivalently faster waves, exceeding 0.10 RMSE after just 5 frames. Predictions in Figure4 demonstrate how the network miscalculates the wave speed, and its predictions are either fasteror slower than the ground truth. Please note that direct comparisons between datasets based on theRMSE is not without shortcomings. Each dataset has its own inherent "variation" which affect theRMSE, i.e. waves move faster in a small tank (see Figure 11 in the Appendix for a discussion). ONCLUSIONS AND F UTURE W ORK
In this work we investigated the use of deep networks for approximating wave propagation. Usinga U-Net architecture, we managed to reduce the long-term approximation RMSE to 0.071 againstthe previous baseline of 0.186. At the same time, the U-Net is × faster than the simulation.Our results suggest that the U-Net outperforms state-of-the-art recurrent models. It is unclear whyU-Net models perform so well in this task. It been demonstrated that convolutional networks areeffective at modelling one-dimensional temporal sequences (Bai et al., 2018); it might be true forhigher-dimensional data. Furthermore, the simulated data are based on few-step solvers. In sucha case the memory modules may not offer a significant advantage. Lastly, we extensively assessedhow the networks generalise in unseen physical settings and pointed out current limitations.In the future, we aim to introduce noise in the simulation so the system becomes stochastic. It wouldbe interesting to see if in this case the recurrent models learn the dynamics better than the U-Net.A big shortcoming of the current models is generalisation in other physical settings. We plan toaddress this by a physics-inspired latent space factorisation and meta-learning.4CLR 2020 Workshop on Deep Learning and Differential Equations R EFERENCES
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutionaland recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 , 2018.Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby,Leslie Vogt-Maranto, and Lenka Zdeborová. Machine learning and the physical sciences. 2019.URL http://arxiv.org/abs/1903.10563 .Nicolas Cellier. Scikit-fdiff / skfdiff. https://gitlab.com/celliern/scikit-fdiff/ ,2019. [Online; accessed 11-8-2019].Emmanuel de Bezenac, Arthur Pajot, and Patrick Gallinari. Deep learning for physical processes:Incorporating prior scientific knowledge. arXiv preprint arXiv:1711.07970 , 2017.Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Hazırbas¸ Hazırbas¸, VladimirGolkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. FlowNet: Learning Op-tical Flow with Convolutional Networks. Technical report.Sebastien Ehrhardt, Aron Monszpart, Niloy J. Mitra, and Andrea Vedaldi. Learning A PhysicalLong-term Predictor. 3 2017. URL http://arxiv.org/abs/1703.00247 .Mehmet Ersoy, Omar Lakkis, and Philip Townsend. A saint-venant shallow water model for over-land flows with precipitation and recharge, 2017.Philipp Fischer et al. Flownet: Learning optical flow with convolutional networks. arXiv preprintarXiv:1504.06852 , 2015.Byungsoo Kim, Vinicius C. Azevedo, Nils Thuerey, Theodore Kim, Markus Gross, and Bar-bara Solenthaler. Deep Fluids: A Generative Network for Parameterized Fluid Simulations.6 2018. doi: 10.1111/cgf.13619. URL http://arxiv.org/abs/1806.02071http://dx.doi.org/10.1111/cgf.13619 .Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyondmean square error. arXiv preprint arXiv:1511.05440 , 2015.Roger Moussa and Claude Bocquillon. Approximation zones of the saint-venant equations f floodrouting with overbank flow.
Hydrology and Earth System Sciences Discussions , 4(2):251–260,2000.Junyoung Park and Jinkyoo Park. Physics-induced graph neural network: An application to wind-farm power estimation.
Energy , 187:115883, 11 2019. ISSN 03605442. doi: 10.1016/j.energy.2019.115883.Paris Perdikaris and Yibo Yang. Modeling stochastic systems using physics-informed deep genera-tive models. 2019.Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics Informed Deep Learning(Part I): Data-driven Solutions of Nonlinear Partial Differential Equations. 11 2017. URL http://arxiv.org/abs/1711.10561 .Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo.Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. 62015. URL http://arxiv.org/abs/1506.04214 .Wilhelm E Sorteberg, Stef Garasto, Chris C. Cantwell, and Anil A. Bharath. Approximating theSolution of Surface Wave Propagation Using Deep Neural Networks. In
INNS Big Data andDeep Learning , 2019. doi: 10.1007/978-3-030-16841-4{\_}26.Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of videorepresentations using lstms. In
International conference on machine learning , pp. 843–852, 2015.Nils Thuerey, Konstantin Weissenow, Lukas Prantl, and Xiangyu Hu. Deep Learning Methodsfor Reynolds-Averaged Navier-Stokes Simulations of Airfoil Flows. 10 2018. URL http://arxiv.org/abs/1810.08217 . 5CLR 2020 Workshop on Deep Learning and Differential EquationsJonathan Tompson, Kristofer Schlachter, Pablo Sprechmann, and Ken Perlin. Accelerating eulerianfluid simulation with convolutional networks. In
ICML , pp. 3424–3433. JMLR. org, 2017.Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. PredRNN++: To-wards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning. 42018a. URL http://arxiv.org/abs/1804.06300 .Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S Yu. Predrnn++: Towardsa resolution of the deep-in-time dilemma in spatiotemporal predictive learning. arXiv preprintarXiv:1804.06300 , 2018b. A PPENDIX
A D
ATASETS
The datasets were created by simulating the Saint-Venant equations: h t + (( H + h ) u ) x + (( H + h ) v ) y = 0 u t + uu x + vu y + gh x − ν ( u xx + u yy ) = 0 v t + uv x + vv y + gh y − ν ( v xx + v yy ) = 0 (1)The package triflow (Cellier, 2019) was used for the simulation. The Coriolis force and viscosityterms were neglected, kinematic viscosity was − m /s which is close to water viscosity at ◦ C,the height H is set to 10 m and the size of the tank is randomly selected in each simulation between10 and 20m. The initial wave excitation is in the form of a Gaussian droplet at random locations. Forrendering, we used ◦ lighting azimuth and ◦ altitude. Each sequence is 100 steps long while thetime step is 0.01 sec. In total, 3,000 sequences were rendered. The frame size was × pixelsbut was subsequently re-sampled down to × . The generalisation datasets were created withthe same method by varying the physical properties of the simulation (Table 3).We also used image normalisation which is known to improve performance on image predictiontasks. Normalising the pixel values to zero mean and standard deviation 1 worked best for us. Notethat the normalising values are computed from the training set alone and applied to the validationand test sets. Data augmentation techniques like horizontal and vertical flips were employed on aper sequence basis. From the 3000 sequences of the original dataset, 70% were used for training,15% for validation and 15% for testing. Dataset Name Initial Condition Height(m) Tank Size(m) Illum. Azimuth Sequences
Training/Validation/Test Droplet 10 [10, 20] ◦ ◦ ◦ ◦ ◦ ◦ ◦ Table 3:
The original dataset was used for training, model selection and evaluation. Models were also trainedwith the fixed tank dataset to study the effect of tank size. All the other datasets were used to evaluate thegeneralisation capabilities of the models.
B M
ODELS
B.1 LSTMThe encoder consists of 4 convolutional layers with 60, 120, 240, 480 feature maps, kernel sizes 7,3, 3, 3 and padding of 2, 1, 1, 1 pixels. Dimensionality reduction is achieved by using a kernel strideof size in all layers. After each convolutional layer, there is a batch normalisation layer and a tanh of the units, that are chosenrandomly in each pass. The final part of the encoder is a fully connected layer of width L = 1000 .This is the latent vector input to the three LSTMs. One LSTM is used for the first input, the secondLSTM is for predicting the 10th frame (midway) and the third LSTM for all the other frames. Thedecoder is based on deconvolutions that double the spatial dimensions of the feature maps in eachlayer until the original × size is reached. It is a mirror of the encoder in terms of featuremap size while the kernel is 3, the padding is 1 and the stride is 2 for all the layers. Figure 6 depictsthe architecture. LSTM
Input Output
Encoder Decoder
Hidden state
Figure 5: Schematic of the encoder and decoder used in the LSTM model (Sorteberg et al., 2019).
Di-mensions or layers are left and up, number of channels at the bottom. c t , h t LSTM i t=1 EncoderDecoder
Initial inputOutput
LSTM p t=2 Decoder LSTM p t=3 Decoder .....
LSTM r t=10 EncoderDecoder .....
Refeed from output
Figure 6: Schematic of the encoder and decoder used in the LSTM model (Sorteberg et al., 2019).
Di-mensions or layers are left and up, number of channels at the bottom.
B.2 C
ONV
LSTMOur architecture uses a stack of 3 ConvLSTM cells. Initially, a convolutional encoder with 8, 64,192 feature maps respectively reduces the spatial dimensions to × . All layers have kernels ofsize 3, zero padding of width 1 and Leaky ReLU non-linearities with slope 0.2. A stride of 2 pixelsis used to reduce the dimensionality. At the final layer, the input is represented by a × × tensor. Inside the ConvLSTMs we use kernels of size 3 and zero padding of 1 pixel to avoid thedimensionality reduction. The decoder uses deconvolutions with stride 2 to double up the pixeldimensions in each layer. 7CLR 2020 Workshop on Deep Learning and Differential Equations ConvolutionConvLSTMConvolutionConvLSTMConvolutionConvLSTM I ConvolutionConvLSTMConvolutionConvLSTMConvolutionConvLSTM t=1 t=2
Encoder t=N
ConvolutionConvLSTMConvolutionConvLSTMConvolutionConvLSTM ConvLSTMConvLSTMDeconvolutionConvLSTM I I N ......... t=N+1 DeconvolutionDeconvolution O Forecaster
ConvLSTMConvLSTMDeconvolutionConvLSTMDeconvolutionDeconvolution O ......... t=N+2 ConvLSTMConvLSTMDeconvolutionConvLSTMDeconvolutionDeconvolution O M t=N+M Figure 7: ConvLSTM Model
The encoder processes the N=5 input frames one at a time to create an internalrepresentation. The representation gets copied over to the forecaster that uses it to generate M=10 future frames.The feature map dimensions can be seen next to each layer.
B.3 P
RED
RNN++The unfolding of the model through time is presented in Figure 8. The vertical stack is comprisedof one convolutional, one max pooling and four PredRNN++ layers. The convolutional layer has akernel of size 3, no padding and outputs 8 feature maps. In the original paper, they do not use anydimensionality reduction because their input dimensions are × per frame. Our input dimensions( × ) are too big to fit in available GPU memory, so we used max-pooling with stride 4 toreduce the dimensions to × pixels. Following the original paper, we used 4 PredRNN++ layersbut reduced the size of all of them to 64 channels each to meet hardware memory constraints. Weused convolutional kernels of size 3. The forecaster uses a deconvolutional layer kernel size 7 andstride 4 to restore the internal state to the original dimensions. ConvolutionCausal LSTMCausal LSTMCausal LSTM I t=1 t=2 Encoder t=N ......... t=N+1
Deconvolution O Forecaster t=M
Causal LSTM
Max Pooling
ConvolutionCausal LSTMCausal LSTMCausal LSTM I Causal LSTMMax Pooling ...
ConvolutionCausal LSTMCausal LSTMCausal LSTMCausal LSTMMax Pooling I N Causal LSTMCausal LSTMCausal LSTMCausal LSTM Deconvolution O Causal LSTMCausal LSTMCausal LSTMCausal LSTM ............ t=N+M
Deconvolution O M Causal LSTMCausal LSTMCausal LSTMCausal LSTM
Figure 8: PredRNN++ Model
The encoder processes the N=5 input frames one at a time and an internalrepresentation is created in each PredRNN++ cell. These representations get copied over to the forecaster thatuses them to generate M=20 future frames. The feature map dimensions can be seen next to each layer.
B.4 U-N ET The encoder is composed of fours blocks each containing two convolutional layers with kernel size3 and padding 1, followed by ReLU non-linearities. The first three blocks include a max-poolinglayer of stride 2 that reduces the size in half. The number of feature maps doubles in each layer.For the expanding part, we use bilinear interpolation with scale factor 2 instead of deconvolutions tokeep the number of parameters low. Skip connections are also employed to copy feature maps fromearlier layers but contrary to the original paper we do not reduce the dimensions of the copied featuremaps. This way, high-level, coarser feature maps are combined with fine-grained local informationof lower layers over the whole domain. The network architecture can be seen in Figure 9.8CLR 2020 Workshop on Deep Learning and Differential Equations
256 256
512 512
256 256 256 256 Convolutions
ReLU
Max Pooling
Unpooling
Skip Connections
Input: 128x128xN
Output: 128x128xM
Figure 9: Schematic of the U-Net model.
The number of channels is below each layer and its dimensions onthe side. Our model has input N=5 and output M=20.
C H
YPERPARAMETERS
Assume that N is the number of input and M the number of output frames of the model. For eachtraining iteration, we randomly selecting K sub-sequences of length N + M from each simulatedsequence. The models were trained to minimise the MSE over their respective output length M . Ineach iteration, the weights are updated using an Adam optimiser while a scheduling scheme adjuststhe learning rate (LR) by a scaling factor of − if there is no improvement in validation errorafter a given amount of epochs (patience). The hyperparameters of interest are the input length N ∈ { , , } , training output length M ∈ { , , , } , samples per sequence between weightupdates K ∈ { , , , } , batch size b ∈ { , } , LR ∈ { − , − , − , − } and patience p ∈ { , } . Grid search was used to find the best set of hyperparameters of each model. The trainingbudget was 24h hours. To obtain an arbitrary long prediction we the output as the next input. Thegoal is to obtain networks with a low error in long term prediction so, during model selection, wechose the hyper-parameters that gave the lowest validation error over 50 frames regardless of theoutput size of the model M . The final hyperparameters and model sizes can be found in Tables 4and 5. Model InputLength OutputLength Samples perSequence BatchSize LearningRate Patience
LSTM 5 20 10 16 − − − − Table 4:
Hyper-parameters of the best performing model for each architecture
D M
ODEL SIZE AND SPEED
Models were implemented in PyTorch and the code is publicly available in GitHub. Models weretrained on a GTX 1060 GPU with 6GB of memory. Total training time includes evaluation overhead.
Model TrainableParameters EpochTime Num.Epochs BestEpoch TotalTraining Time
LSTM 88.2M 12m 75 71 24hConvLSTM 12.3M 36m 24 18 24hPredRNN++ 2.5M 33m 43 36 24hU-Net 7.8M 8m 171 166 24h
Table 5:
Model size and training times
Method Time per frame (ms) Speed-upNumerical simulator 630.7 -LSTM 15.0 40xConvLSTM 4.5 141xPredRNN++ 9.2 68xU-Net 2.6 241x
Table 6: Time it takes to compute one frame.
Deep learning approximations offer a significant speed-up overnumerical simulations.
E R
ESULTS A DDENDUM G r o un d t r u t h t=10 t=20 t=30 t=40 t=60 t=80 P r e d R NN ++ U - N e t G r o un d t r u t h t=10 t=20 t=30 t=40 t=60 t=80 P r e d R NN ++ U - N e t Figure 10:
Prediction roll-out from U-Net and PredRNN++. Both sequences are from the test set.
Target (t=80) PredRNN++ U-Net0 20 40 60 80 100 120Pixel Location0.00.51.0 I n t e n s i t y ImageTargetPredRNN++U-Net
Target (t=80) PredRNN++ U-Net0 20 40 60 80 100 120Pixel Location0.00.51.0 I n t e n s i t y ImageTargetPredRNN++U-Net
Figure 11:
Generalization of U-Net and PredRNN++ in different physical settings. In the left we see linearwaves. The networks introduce circular patterns where they don’t exist. In the right panel it is the smaller tank,where waves are faster. Both models miss the time constant by being slower than the ground-truth.
Time step ahead M S E OriginalLinesDouble DropShallow Depth Illum. OppositeIllum. RandomSmall TankBig Tank
Time step ahead M S E OriginalLinesDouble DropShallow Depth Illum. OppositeIllum. RandomSmall TankBig Tank
Figure 12: Flat Image (left) and
Previous Frame (right) baselines compared against ground truth acrossdifferent datasets. The flat image baseline measures how much the frame is away from the reference height.The Previous Frame indicates how much consecutive frames differ. Datasets are not equally hard to predict andthis should be taken into account when assessing the generalisation capacity of a model.
E.1 P
REDICTING THE TANK SIZE FROM THE LATENT SPACE
Here we check if the trained U-Net acquired any understanding of the physical properties of thesystem. We focus on the tank size, or inversely the speed of the wave, for two reasons. First of all,the U-Net failed to extrapolate to different tank sizes. This experiment could provide some insightson why this failure happens. Secondly, tank size information is readily available. Each datasetsequence corresponds to a different tank size, and the tank is always square. In the training andtesting dataset we have tank size s i ∈ [10 , meters. For the smaller tank we used s i ∈ [5 , andfor the bigger s i ∈ [20 , meters. 11CLR 2020 Workshop on Deep Learning and Differential EquationsThe question we try to answer is: does the latent representation of the U-Net capture that tank sizeinformation s i . We take the pre-trained encoder from the U-Net and add some additional layersso that the output is only one number (Figure 13). The system is trained to predict the tank sizewhen given 5 consecutive frame. Only the additional part is updated during training. The weightsof the encoder are kept frozen. We compare the pre-trained encoder against a randomly initialisedencoder. We, also, compare the models to a dummy regressor that predicts always the mean tanksize for each dataset i.e. 15 for the test set, 7.5 for the small tank and 30 for the big tank. Resultsin Table 6 indicate that the pre-trained encoder can be used to extract the tank size with relativelylow error (0.14) while the random encoder gives a much higher error of 2.27, slightly lower to thedummy regressor (2.45). This indicates that the pre-trained encoder encapsulates physically relevantinformation relating to the tank size. When it comes to the bigger and smaller tanks, both the pre-trained and the random encoders fail to extrapolate and give errors higher than the dummy regressor. Figure 13: Schematic of the model used for tank size prediction
Test set Bigger Tank Smaller TankPre-trained encoder
Table 7: Predicting the tank size from the U-Nets latent space.