[PDF] Deep Orthogonal Decompositions for Convective Nowcasting

Abstract

Near-term prediction of the structured spatio-temporal processes driving our climate is of profound importance to the safety and well-being of millions, but the prounced nonlinear convection of these processes make a complete mechanistic description even of the short-term dynamics challenging. However, convective transport provides not only a principled physical description of the problem, but is also indicative of the transport in time of informative features which has lead to the recent successful development of ``physics free'' approaches to the now-casting problem. In this work we demonstrate that their remains an important role to be played by physically informed models, which can successfully leverage deep learning (DL) to project the process onto a lower dimensional space on which a minimal dynamical description holds. Our approach synthesises the feature extraction capabilities of DL with physically motivated dynamics to outperform existing model free approaches, as well as state of the art hybrid approaches, on complex real world datasets including sea surface temperature and precipitation.

Full PDF

DDeep Orthogonal Decompositions for ConvectiveNowcasting

Daniel J. Tait

University of WarwickAlan Turing Institute [email protected]

Abstract

Near-term prediction of the structured spatio-temporal processes driving our climateis of profound importance to the safety and well-being of millions, but the prouncednonlinear convection of these processes make a complete mechanistic descriptioneven of the short-term dynamics challenging. However, convective transportprovides not only a principled physical description of the problem, but is alsoindicative of the transport in time of informative features which has lead to therecent successful development of “physics free” approaches to the now-castingproblem. In this work we demonstrate that their remains an important role tobe played by physically informed models, which can successfully leverage deeplearning (DL) to project the process onto a lower dimensional space on whicha minimal dynamical description holds. Our approach synthesises the featureextraction capabilities of DL with physically motivated dynamics to outperformexisting model free approaches, as well as state of the art hybrid approaches, oncomplex real world datasets including sea surface temperature and precipitation.

Nowcasting refers to the process of producing near-term predictions of the complex dynamicalsystems driving our natural world. Such systems have a signiﬁcant impact on the daily lives andsafety of our planets’ population, with important examples including precipitation and sea surfacetemperature; variables crucial for ﬂood warning and prevention, prediction of cyclogenesis, andmaritime safety. A common theme is the dominance of short-term dynamics by convection – thetransport of material along the stream lines of a vector ﬁeld. Transport and the reduced role ofdiffusion at these time-scales leads to temporal feature preservation, and consequently these systemshave received increasing interest from the machine learning (ML) community [23, 24, 16, 26, 2].The classical approach to solving these problems in the natural sciences is to specify a model of theunderling physical properties. A properly calibrated mechanistic model described by parameteriseddifferential operators is able to provide future prediction and generalise to new scenarios, provided thischange of physics can be encoded by some parameter. In a lingua franca of ML we would commenton the remarkable generalisation and transfer-ability of fundamental physical theories speciﬁed bymathematical models, a demonstration of “the unreasonable effectiveness of mathematics” [29].Nevertheless, a successful physical theory often requires multiple simplifying assumptions so remov-ing many of the complexities of real-world phenomena, done correctly this reduction allows the theoryto focus on the most salient qualities of the system. However, this reduction can be insufﬁcient fordetailed prediction in the most complex open world systems we experience in nature, or else requireexpensive numerical models [15]. This complexity, combined with the existence of the persistent,informative features that characterise convection has led to the development of deep learning (DL)approaches, notably the dynamics based approaches of [23, 24, 16, 26], or the images-to-images type

Preprint. Under review. a r X i v : . [ phy s i c s . c o m p - ph ] J un pproaches adopted by [3, 12, 2]. As DL methods continue to advance it becomes an open questionas to whether one should still attempt to embed mechanistic structure into our models, or if one oughtto go “physics free” [2]. We suggest this stance is information inefﬁcient, ignoring the vast priorscientiﬁc knowledge acquired by human endeavour. This viewpoint has led to the emergence of physics informed ML [19, 8] which aims to combine the expressive power of modern ML methods,with the representative power of fundamental physical models. Typically these employ DL surrogatesof the input-output map, then ensuring this map is either constructed to obey some physical principle[7, 28], or else regularised by a guiding mechanistic model [19, 25, 5, 18, 31, 33]In this work we add to the argument that physical principles still have an important role to playin modelling by introducing a new physically informed method motivated by two of the guidingprinciples behind a successful mathematical analysis of a complex dynamical problem; (i) theprojection of a high dimensional model onto a lower dimensional representation, and (ii) the existenceof a transformation that brings the dynamics closer to linear. We use the feature extraction prowess ofDL to perform our reduction, while using physical principles to guide the evolution of these features,constructing a ﬂexible system that is well suited to modelling convection phenomena by ensuringthat deep nonlinear components are strongly linked with the minimal physical dynamics.In summary, we use the powerful feature extraction capabilities of modern ML to perform localisedmodel order reduction by projection onto a subspace spanned by deeply-extracted orthogonal features.We then show how one can combine ﬂexible methods with minimal physical representations toadvance the dynamics in this reduced space, crucially we construct our method so as to enable atighter link between the DL-based encoding and the dynamics than previous approaches. Finallywe demonstrate that this synthesis of DL feature extraction and minimal physics outperforms bothexisting model free approaches and state of the art hybrid approaches on complex real world data-sets.

The classical approach to model based forecasting predicts the evolution of a variable using pastrecords to calibrate a mechanistic model. These models are typically complex systems of partial dif-ferential equations (PDEs), which must be sufﬁciently rich to embed all relevant physical knowledge.These systems can be used to describe the change in intensity of some target variable, for examplesea surface temperature or precipitation rate. Two of the most important physical processes describ-ing these changes are diffusion – a smoothing transformation, and transport – volume preservingtranslations, and we now provide a brief review of the mathematical models encoding these effects. ∂u∂t = −∇ · ( α ∇ u ) + w · ∇ u (1) (a) Pseudo-linear transport ∂u∂t + w · ∇ u = ∇ · ( α ∇ u ) (2a) ∂ w ∂t + w · ∇ w = ν ∆ w + f (2b) ∇ · w = 0 (2c) (b) Rayleigh - Bénard convection Our goal is to introduce a ﬂexible family of mod-els for the evolution of a ﬁeld variable u ( x, t ) indexed by spatial coordinates x in a region Ω ⊂ R D , and temporal coordinate t ∈ [ t , T ] .A starting point for describing diffusion andtransport is the PDE (1), which is parameterisedby a diffusion coefﬁcient α ( x , u ) , governingsmoothing over time, and a transport vector ﬁeld, w ( x , u ) , describing the direction of bulk motion,this a (pseudo-)linear PDE and so in principleeasy to solve. However, this only defers mod-elling complexity because one must still moti-vate an appropriate transport vector ﬁeld. Math-ematically the most well studied models jointlydescribing convective transport are the system2 of Rayleigh-Bénard equations [22, 10]. How-ever, solving this system across multiple spatialand temporal scales, even for known parameters, presents a formidable challenge.Nevertheless, this motivating idea of forming a conditionally linear PDE has been successfully appliedby [8]. Their approach uses data to inform the vector ﬁeld w , and then use the Gaussian integralkernel obtained by solving (1) to motivate an integral-convolution update of the current image of the2eld variable. They propose to discard the non-linear components (2b) – (2c) of the system (2) whenadvancing the solution, then reincorporate them in the supervised per-time step loss function L t = (cid:88) x ∈ Ω ρ (ˆ u t +1 ( x ) − u t +1 ( x )) + λ div (cid:107)∇ · w ( x ) (cid:107) + λ magn (cid:107) ˆ w ( x ) (cid:107) + λ grad (cid:107)∇ w ( x ) (cid:107) (3)where ρ is a loss function on the space of pixel-images, ˆ u t +1 is the predicted surface and u t +1 is theactual surface. Ultimately, this assumes the linear PDE (2a) is sufﬁcient to capture the dynamics in theoriginal space, instead we shall attempt to project the dynamics onto a space where this assumptionbecomes more plausible, but ﬁrst we brieﬂy review how to numerically solve a PDE like (1) or (2). It is typically necessary to solve a PDE problem, like (1), numerically via discretisation, one powerfulmethod of doing so is to ﬁrst multiply the problem by a test function v ∈ ˆ V , and then integrate toform a variational problem. It is typical to choose tests functions which vanish on the boundary, andso one arrives at the classical variational form of the PDE problem: ﬁnd u ∈ V such that (cid:90) Ω ∂u ( t, x ) ∂t v ( x ) d x = (cid:90) Ω α ( x , u ) ∇ u ( t, x ) · ∇ v ( x ) d x + (cid:90) Ω w ( x ) · ∇ u ( t, x ) v ( x ) d x (4)for any test function v ∈ ˆ V , this is the weak form of the classical problem (1). In order to numericallyimplement this idea one must also replace the trial and test spaces by ﬁnite dimensional sub-spacesand we take V = span (cid:0) { ϕ m } Mm =1 (cid:1) = ˆ V . While there are multiple choices of basis functions { ϕ m } Mm =1 , for example one can use the nodal basis functions of the FEM [20], in this work we shallconsider sets of orthonormal basis functions. That is we specify a collection of M basis functionssuch that (cid:104) ϕ i , ϕ j (cid:105) L (Ω) = δ ij , where (cid:104)· , ·(cid:105) L (Ω) is the L -inner product.We then search for solutionsto (10) with representation u ( t, x ) = (cid:80) Mm =1 ( z t ) m ϕ m ( x ) , where z t ∈ R M is a vector of unknowncoefﬁcients to be determined. Inserting this representation into (10) we achieve a ﬁnite-dimensionalprojection of the dynamical problem as an ordinary differential equation (ODE) ddt z = L ϕ z , ( L ϕ ) ij = (cid:90) Ω α ( x ) ∇ ϕ j ( x ) · ∇ ϕ i ( x ) dx + (cid:90) Ω τ ( x ) · ∇ ϕ j ( x ) ϕ i ( x ) dx (5)This is the Ritz-Galerkin projection of the dynamics onto the subspace V . We use the notation L ϕ to make it clear that where as the classical operator in the RHS of (1) depended only up on thecoefﬁcient functions parameterising it, the projected problem further depends on the ability of thebasis functions to capture information about the problem, entwining the encoding and the dynamics.The less informative the basis functions the higher the dimension of M will be needed to faithfullyreproduce the dynamics in the function space. In what follows we refer to the process of forming thestate dynamic matrix L in (5) as the Assembly operation, which involves performing quadratures tocompress the parameter ﬁelds α, w and basis functions into an M × M matrix. Given an ensemble { u t } Nt =1 of ﬁeld variables over a domain Ω , the proper orthogonal decomposition (POD) is a technique for extracting an informative modal basis, that is a low-dimensional basis whichcaptures most of the information or “energy” of the ensemble. The decomposition in POD is morefamiliar to the machine learning community under the name principal components analysis (PCA),or alternatively the Karhunen-Loeve decomposition [11, 13], and involves reconstructing elementsof the ensemble as ˆ u = (cid:80) m z k ϕ m ( x ) where { ϕ m ( x ) } Mm =1 are the ﬁrst M -eigenfunctions of theempirical covariance matrix of the ensemble ordered by largest eigenvalue.The idea of using the POD eigenfunctions, as an “empircal basis” onto which to perform theabove Ritz-Galerkin projection for modelling turbulent ﬂows was introduced in [14]. However, theoptimality results concern POD as a linear re-constructor, and do not transfer to any optimality onthe dynamic prediction problem [6]. Therefore, in this work we shall instead use deep networks toextract our orthogonal subspace, however motivated by the POD idea we shall attempt to still includesome version of this optimal reconstruction to motivate the regulariser in our supervised loss.3 a) GalerkinBasis feature extraction (b) Reconstruction from basis featuresFigure 2: (a) A local input sequence { u k } tt − (cid:96) is passed through the GalerkinBasis feature extraction, toproduce a collection of M orthogonal images { ϕ m } Mm =1 . (b) An image can be projected onto the subspace span { ϕ m } and then reconstructed as the linear combination ˆ u t = (cid:80) Mm =1 (cid:104) u t , ϕ m (cid:105) ϕ m . Our objective will be to estimate a future length T sequence of realisations of the process startingfrom time t , using only information coming from the length (cid:96) history process { u k ( x ) } tk = t − (cid:96) . Infact,we shall also discretise the domain as an n x × n y pixelated grid, and instead aim to estimatethe vectorised ﬁeld variable u ∈ R n x × n y . That is we aim to learn an images-to-images map { ˆ u k } k = t + Tk = t +1 = f ( u t − (cid:96) , . . . , u t ) , which also embodies some minimal dynamical representation. Todo so we introduce a DL approach to building a localised version of the POD-basis discussed above. Deep Galerkin Features

In order to project the inputs onto a reduced space it is required to construct a set of functions, { ϕ j } Mj =1 ,such that (cid:104) ϕ i , ϕ j (cid:105) L (Ω) = δ ij , again we shall work with the vectorised images of these functionsand replace the L (Ω) orthogonality condition with a quadrature approximation. Furthermore, andwithout loss of generality, we can re-scale the domain to have unit volume so that we seek a collectionof vectors such that (cid:104) ϕ i , ϕ j (cid:105) = δ ij , ∀ i, j = 1 , . . . , M under the Euclidean inner product on R n x × n y .Any given input u t can then projected onto an M dimensional space by the Galerkin projection Π Galerkin : R n x × n y → R M where z t = Π Galerkin u t is the M -vector with coefﬁcients ( z ) j = (cid:104) u , ϕ j (cid:105) ,for j = 1 , . . . , M . While the POD method discussed in the previous section constructs this basisvector using the complete set of inputs, we wish to perform our projection using only the most recent (cid:96) inputs, relying on the power of DL approaches to extract sufﬁcient information from this reduced set.Our chosen feature extractor will be the ubiquitous U-Net [21], owing to its demonstrated successesin performing feature extraction from image data. Since we seek M basis functions we shall considerarchitectures which takes as input an image sequence of shape ( n x , n y , (cid:96) ) , and outputs an imagesequence of shape ( n x , n y , M ) . Our GalerkinBasis model is then a composition of this map, withan orthogonalisation of the outputs and we write { ϕ ( t ) j } Mj =1 = GalerkinBasis ( u t − (cid:96) , . . . , u t ) = Orthogonalise ◦ Unet ( u t − (cid:96) , . . . , u t ) . (6)Functionally, we have constructed the basis of vectors ϕ ( t ) j = ϕ ( t ) j ( u u t − (cid:96) , . . . , u t ) , which dependonly on temporally local values, an overview of the composed process is represented in Fig. 2. Linear latent convection model

The

GalerkinBasis provides our desired nonlinear transformation of the input sequence onto alinear space parameterised by z ∈ R M , using only information up to time t . For the remainder of theprediction horizon we wish to forward solve the problem using the physically informed convectiondynamics on a linear space determined by the extracted features. We shall assume that in this spacethe dynamics are reasonably well described by the linear PDE (1), and use our learned features tocarry out a projection of the dynamics onto a local processes obtained via (5).4 a) Single time step of [8] (b) Single time-step of our DOD-Convection modelFigure 3: (a) The method of [8] uses DL to estimate a motion ﬁeld and then transport the solution in the originaldata-space, no further use is made of the latent representations. (b) Our method uses the

GalerkinBasis network to encode the inputs to a latent space, and then advances the solution in the latent space using onlyinformation from the encoded variables to parameterise the

Convection model, before ﬁnally reconstructing inthe data-space

It remains to parameterise the dynamics with a diffusion coefﬁcient, and a transport vector ﬁeld. Weshall refer to the model component performing this parameterisation as the

Convection model, andwe shall also allow it to be informed by the previous (cid:96) observations. Once this component has beenspeciﬁed we shall advance the latent state according to an updating scheme of the form α ( k ) , w ( k ) = Convection ( z k − (cid:96) , . . . , z k ) , (7a) L ( k ) ϕ = Assembly ( α ( k ) , w ( k ) , { ϕ m ( u t − (cid:96) , . . . , u t ) } Mm =1 ) (7b) z k +1 = z k + (cid:90) t k +1 t k L ϕ z τ dτ (7c)for k = 0 , . . . , T − . First the previous values of the latent process are used to build new parametersof the transport model (7a), these are then combined with the features to assemble the state dynamicmatrix (7b), and ﬁnally the linear ODE (5) is solved to give the new state, (7c), and this process isrepeated until a complete set of T latent states are obtained. Our general approach to parameterisationof the transporting physical model is therefore similar to that in [8], but crucially our approachpropagates the dynamics in the latent space by way of the Galerkin projection, rather than appliedto the complete surface. This leads to an important difference since any surface now has a ﬁnitedimensional representation by z t our convection step (7a) may be taken as a function of the lowdimensional latent states, that is we can consider a decoder which takes the (cid:96) × M lagged latent statesand outputs the diffusion and transport ﬁeld over Ω this is the top-right block of Fig. 3b.Conversely, the approach taken in [8] requires one to ﬁrst encode the surfaces, and then decode themagain to estimate the motion ﬁeld, see Fig. 3a. While this step necessarily creates an implicit latentrepresentation as a byproduct of this encode/decode step, this representation has no further role toplay in the model dynamics. In contrast, since we have already performed encoding through theGalerkin projection, we require no further compression allowing the latent representations producedby our approach to feature more directly in the dynamical description, see Fig. 3b.In summary, our model ﬁrst uses the output of the GalerkinBasis to construct a set of orthogonalfeature images and then uses the Ritz-Galerkin method to project the dynamics onto this space.The encoded variables are then passed to the

Convection model to determine the dynamics, butnotably while the convection model in Fig. 3a must account for all of the transformation applied toan input image, our model is able to share the dynamical description with the ϕ -features through theprojection (5). This allows for a more parsimonious parameterisation of the the Convection model,and since only this smaller model is called repeatedly during the loop procedure (3) we achieve amore memory efﬁcient implementation. Recognising the partition of our method into an orthogonaldecomposition, and then a convection informed forward integration, we shall refer to it as a DeepOrthogonal Decomposition for Convection model, or simply

DOD-Convection .5 econstruction and supervised loss Finally, having created a full set of latent states { z ( k ) } k = T + tk = t − (cid:96) , we may use our Galerkin features toreconstruct the process. Note that we have both the historical states, { z k } tk = t − (cid:96) , and the future states { z k } Tk = t +1 , we also further recall from our discussion of the POD method in Section 2.3 that whencombined with the Galerkin features, the historical states should do a good job of reconstructing theinputs, but that the POD method produces the optimal such reconstruction. Together this partitioninto historical and future states suggests the following decomposition of our supervised loss function L t = T (cid:88) k =1 ρ (cid:32) u t + k − (cid:88) m ( z t + k ) m ϕ m ( x ) (cid:33) + λ div (cid:107)∇ · w (cid:107) (8a) + λ recon (cid:96) (cid:88) k =1 ρ (cid:32) u t − k − (cid:88) m ( z t − k ) m ϕ m ( x ) (cid:33) . (8b)The ﬁrst two terms of (8b) are analogous to those in (3), playing a similar role with regards tomeasuring the accuracy of the prediction through a metric ρ , while regularising the divergence of theoutput w of the Convection to ensure a valid divergence free vector ﬁeld.The ﬁnal term is speciﬁc to our method, and addresses the optimality condition of the POD basis asdiscussed in Section 2.3, in particular we commented that when used to reconstruct an ensemble thePOD basis gives the minimal reconstruction error for a particular dimension of the subspace. Thecoefﬁcient λ recon therefore controls the extent to which our local basis decomposition attempts torecover a version of this optimally condition. This trade-off helps ensure that the GalerkinBasis continues to have an identiﬁable role as an encoder/decoder of the surfaces given its additional rolein assembling the discretised differential operator L ϕ . This dual role of the ϕ -features is one of theprincipal strengths of our method, allowing dynamical components lost by the reduction to the overlysimplistic PDE (1), to be recaptured by the DL-features, at the expense of the optimality of thesesame features as re-constructors of the surface. We now apply our method to the large-scale convective systems encountered in the climate sciencesin order to assess the accuracy and computational efﬁciency of our method compared to alternatives.All experiments were conducted on a single Nvidia GeForce RTX 2060 SUPER GPu with 8GB ofmemory. For additional experiments and visualisations see the supplementary material.

Figure 4: Regions used for the sea surface temperatureexperiment

As in [8] we demonstrate our model on the SSTdata obtained from the NEMO (Nucleus for Eu-ropean Modelling of the Ocean) engine [15]. Weextract the 64 by 64 pixel sub-regions shown inFig. 4, and use data from 2016-2018 to trainthe model, a give a total of 11935 acquisitions.We then test on data from the years 2019 usingthe same regions giving 3894 test instances. Theregions were selected to capture the complicatednonlinear mixing that occurs as warm-water istransported along the gulf stream and meets thecolder water of the arctic regions. We assumethat on the short-term horizons we are interestedin, the information within a region is sufﬁcientfor prediction and do not incorporate informa-tion from adjacent regions. Data is normalisedto remove trends on a yearly timescale by takingthe mean and standard deviations over the day of the year.We compare the method we have introduced to the physics informed model of [8], as well as aconvolutional LSTM (Conv-LSTM) introduced by [23, 24] which enhances LSTM networks with a6patial convolution to better model spatio-temporal process, and ﬁnally to compare with “phyiscs-free”approaches we use a U-Net as proposed in [3, 12, 2] which treats prediction as an images-to-imagesproblem, and makes no attempt to embed recurrent dynamical or physical structure. All models wereimplemented in Tensorﬂow [1], apart from [8] for which we use publicly available code. Table 1: Comparison of methods on the SST data. Average score is the mean squared error (MSE), No. ofparameters is the total number of trainable parameters in each model, and run-time is the mean time per-batchwith the maximum batch size that can ﬁt in memory. We ﬁt our model with M = 16 ϕ -featuresA VERAGE S CORE (MSE) N O . OF P ARAMETERS R UN -T IME [ S ]C ONV -LSTM[24] 0.2179 1,779,073 0.43U-N ET [12, 2] 0.1452 31,032,446 0.79F LOW [8] 0.1344 22,197,736 0.60

DOD-Convec . Results are displayed in Table 1, examining the test error for each method we see that our methoddemonstrates superior performance with a lower test error than the purely data-driven approaches andthe more dynamic alternatives. As discussed in Sec. 3 our approach leads to a more parsimoniousparameterisation of the convection ﬁeld than [8]. With regards to run-time the U-Net model is faster,and more memory efﬁcient than the alternatives, including our own, which make use of a recurrentstructure to advance the predictions. In spite of this our method still demonstrates competitive runtime, with the increased accuracy a more than acceptable trade-off.In Fig. 5 we plot a representative sequence from the test set, in which we observe that our method,row three, seems to be better at identifying the “whorl” pattern formations that characterise turbulence,but that the U-Net feature extraction also does a job job of identifying these features, but only on alimit time-horizon with the structure degrading as the sequence continues, this is most noticeablein the loss of structure of the in-ﬂowing cold ﬂux in the top-right of the images in row four. On theother-hand the method of [8] does a better job capturing and preserving linear features, this is likelybecause this method is ultimately solving a linear PDE (2a), and identifying a convection modelfrom data alone that will do the “violence” [6] of a nonlinear model from data alone is hard. Byprojecting to a latent space, and allowing this linear space to be adaptivly determined by nonlineartransformations of the local inputs we are better able to recover nonlinear features with a simplerconvection model, and while we do pay a visible price in the smoothness of our reconstructions,overall we achieve better performance as presented in Tab. 1.

Spatio-temporal dynamical modelling

Traditional dynamical modelling approaches have beenbased on either optical ﬂows [9, 30] which estiamte the transport vector ﬁeld by examining thedifferenced input sequence, or else numerical models [4] where the transporting ﬁeld emerges as theoutput of a calibrated physical model. More principled physical models have the ability to displaycomplex long-range spatio-temporal dependencies, but deriving and then calibrating such models issigniﬁcantly harder than the DL and hybrid methods we consider.

Physics free deep learning

As noted by [2] “Physics free" deep learning approaches can be boradlydivided into attempts to model the time evolution as a temporal sequence, or else those which treat theproblem as learning an unknown “image(s)-to-images” map. The former include the convolutionalLSTM model of [23, 24], as well as video generation methods [16, 26] For the latter option U-Netbased approaches have been popular, [3, 12] and [2] who argue that the overly simplistic assumptionsin optical ﬂow or poorly speciﬁed numerical models are often unviable, so that free-form DL methodsdisplay superior performance methods tied to a misspeciﬁed physical model. While our work doesuse an overly simplistic model of the dynamics, by projecting on the latent space we are able toensure this simpliﬁcation does not overly constrain the resulting predictions.

Physics informed deep learning

Aside from the work of [8] which we have discussed in Section 2,we also mention the more complicated U-Net architecture inspired by multiscale approximations to A PyTorch [17] implementation for the model of [8] is available at https://github.com/emited/flow igure 5: Predicted surfaces from an input sequence of length four, top row, compared to a target output sequenceof length six, second row. from a test set sequence in Region 2 of Fig. 4 the Navier-Stokes equations in [27], since this model uses physical information to design a speciﬁcU-Net architecture there some overlap with the already mentioned images-to-images methods. Finallywe note that we have advanced the solution to the problem by solving something which closelyapproximates the solution of a PDE. Alternative approaches [19, 5, 33] have avoided this explicit solvestep and instead use ﬂexible DL surrogates of the output map, with the PDE operators re-appearing asregularisers. This is a much more dramatic regularisation than the λ div loss we have considered, andwe are unaware of this complete operator regularisation scheme having been applied to convectivenon-linear systems at the scale considered in the experiments. In this work we have combined the powerful feature extraction capabilities of DL, with minimalphysical representations of the dynamics to introduce a physically informed model demonstratingsuperior predictive performance on the convective physical processes that dominate the atmosphericand ﬂuid transport problems pervading our natural world. To the extent that this model is “physicallyinformed” we have a priori speciﬁed only a minimal representation of the dynamics, we justify this byour desire to avoid overly strong, and so impossible to justify, assumptions on the complex generatingmechanism and therefore maintain as much as possible the ability of the model to learn the underlyingdynamics from data. Crucial to this has been our efforts to ensure that the latent representationsformed by DL-encoding are more strongly entwined with the dynamics of the process than previousapproaches, leading to a model that demonstrates superior predictive performance, and improvedrecovery of temporally persistent nonlinear features.8 roader Impact

Climate change manifests not only as long-term trends, but also as the increased short-term volatilityof these processes, and as such the development of near-term prediction methods which can be rapidlyevaluated and re-calibrated is only increasing in importance. Accurate near term prediction of rapidlychanging events is vital for public safety in situations including extreme ﬂooding, cyclone formation,and forest ﬁres – all of these systems possessing the informative transport features which would allowour method to be successfully applied. These applications will continue to add to the evidence wehave already provided that physical principles still have an important role to play in designing models,and argue against the “physics free” philosophy. We demonstrate that there is much to be gainedfrom strongly connecting learned latent representations with global model dynamics, achieving notonly interpret-able latent spaces, but also interpretable dynamics on these spaces. The resultingmodels demonstrate better performance, and have a clearer physical picture, and we envision ourwork continuing to invite methodological study of the generalisation beneﬁts to be obtained by thisstronger entwining between encoded representations and the original data-space dynamics.Despite the progress we have made, predicting these systems remains a hard problem, and it isunlikely that any one method will work best in all scenarios and so appropriate aggregation strategiesshould be adopted. However, our discussion of the sea surface predictions in Sec. 4 demonstratesthat our model does a better job of predicting certain nonlinear patterns in convective ﬂows, and sowill be an effective complement to those existing methods which perform well on more linear ﬂows.Finally while we have presented an effective supervised learning method, it will be important toextend the work we have done to a generative probabilistic framework to provide further uncertaintyquantiﬁcation.

References [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, VijayVasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL . Software available from tensorﬂow.org.[2] Shreya Agrawal, Luke Barrington, Carla Bromberg, John Burge, Cenk Gazen, and Jason Hickey. Machinelearning for precipitation nowcasting from radar images, 2019.[3] G. Ayzel, M. Heistermann, A. Sorokin, O. Nikitin, and O. Lukyanova. All convolutional neural networks forradar-based precipitation nowcasting.

Procedia Computer Science , 150:186 – 192, 2019. ISSN 1877-0509.doi: https://doi.org/10.1016/j.procs.2019.02.036. URL . Proceedings of the 13th International Symposium “IntelligentSystems 2018” (INTELS’18), 22-24 October, 2018, St. Petersburg, Russia.[4] Dominique Béréziat and Isabelle Herlin. Coupling dynamic equations and satellite images for modellingocean surface circulation.

Communications in Computer and Information Science , 550:191–205, 2015.doi: 10.1007/978-3-319-25117-2\_12. URL https://hal.inria.fr/hal-01245369 .[5] Jens Berg and Kaj Nyström. A uniﬁed deep artiﬁcial neural network approach to partial differentialequations in complex geometries, 2017.[6] G Berkooz, P Holmes, and J L Lumley. The proper orthogonal decomposition in the analysis of turbulentﬂows.

Annual Review of Fluid Mechanics , 25(1):539–575, 1993. doi: 10.1146/annurev.ﬂ.25.010193.002543. URL https://doi.org/10.1146/annurev.fl.25.010193.002543 .[7] Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutionalnetworks and the icosahedral CNN. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,

Proceedingsof the 36th International Conference on Machine Learning , volume 97 of

Proceedings of MachineLearning Research , pages 1321–1330, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/cohen19d.html .[8] Emmanuel de Bezenac, Arthur Pajot, and Patrick Gallinari. Deep learning for physical processes: Incorpo-rating prior scientiﬁc knowledge. In

International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=By4HsfWAZ .[9] David J. Fleet and Y. Weiss. Optical ﬂow estimation. In Nikos Paragios, Yunmei Chen, and Olivier D.Faugeras, editors,

Handbook of Mathematical Models in Computer Vision . Springer, 2006.

10] A. V. Getling and Edward A. Spiegel.

Rayleigh-Bénard Convection: Structures and Dynamics . ColumbiaUniversity, New York, 1998.[11] K. Karhunen. Zur spektraltheorie stochastischer prozesse.

Annales Academiæ Scientiarum Fennicæ, SeriesA , 1946.[12] Vadim Lebedev, Vladimir Ivashkin, Irina Rudenko, Alexander Ganshin, Alexander Molchanov, SergeyOvcharenko, Ruslan Grokhovetskiy, Ivan Bushmarinov, and Dmitry Solomentsev. Precipitation nowcastingwith satellite imagery. In

Proceedings of the 25th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining , KDD ’19, page 2680–2688, New York, NY, USA, 2019. Association forComputing Machinery. ISBN 9781450362016. doi: 10.1145/3292500.3330762. URL https://doi.org/10.1145/3292500.3330762 .[13] M. Loeve. Fonctions aleatoires de second ordre.

Comptes Rendus De L’Académie Des Sciences , 220, 1945.[14] J.L. Lumley. Coherent structures in turbulence. In RICHARD E. MEYER, editor,

Transition andTurbulence , pages 215 – 242. Academic Press, 1981. ISBN 978-0-12-493240-1. doi: https://doi.org/10.1016/B978-0-12-493240-1.50017-X. URL .[15] G. Madec. NEMO ocean engine. Technical report, Institut Pierre-Simon Laplace IPSL, 2008.[16] Michaël Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond meansquare error. In Yoshua Bengio and Yann LeCun, editors, , 2016.URL http://arxiv.org/abs/1511.05440 .[17] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf,Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, BenoitSteiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors,

Advances in Neural Information Processing Systems32 , pages 8026–8037. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf .[18] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning frame-work for solving forward and inverse problems involving nonlinear partial differential equations.

Journalof Computational Physics , 378:686 – 707, 2019. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.10.045. URL .[19] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics informed deep learning (part i):Data-driven solutions of nonlinear partial differential equations, 2017.[20] J Reddy.

An Introduction to the Finite Element Method . McGraw-Hill Education, New York, 2005.[21] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedicalimage segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi,editors,

Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 , pages 234–241,Cham, 2015. Springer International Publishing. ISBN 978-3-319-24574-4.[22] Barry Saltzman.

Selected Papers on the THEORY OF THERMAL CONVECTION with special applicationto the earth’s planetary atmosphere . Dover, 1962.[23] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang chun Woo. Convolu-tional lstm network: A machine learning approach for precipitation nowcasting.

CoRR , 2015.[24] Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong,and Wang-chun WOO. Deep learning for precipitation nowcasting: A benchmark and anew model. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, editors,

Advances in Neural Information Processing Systems 30 ,pages 5617–5627. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7145-deep-learning-for-precipitation-nowcasting-a-benchmark-and-a-new-model.pdf .[25] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partialdifferential equations.

Journal of Computational Physics , 375:1339 – 1364, 2018. ISSN 0021-9991.doi: https://doi.org/10.1016/j.jcp.2018.08.029. URL .[26] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In D. D.Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,

Advances in Neural InformationProcessing Systems 29 , pages 613–621. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6194-generating-videos-with-scene-dynamics.pdf .

27] Rui Wang, Karthik Kashinath, Mustafa Mustafa, Adrian Albert, and Rose Yu. Towards physics-informed deep learning for turbulent ﬂow prediction, 2020. URL https://openreview.net/forum?id=Hkg5lAEtvS .[28] Maurice Weiler and Gabriele Cesa. General e(2)-equivariant steerable cnns. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors,

Advances in Neural InformationProcessing Systems 32 , pages 14334–14345. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/9580-general-e2-equivariant-steerable-cnns.pdf .[29] Eugene P. Wigner. The unreasonable effectiveness of mathematics in the natural sciences. richard courantlecture in mathematical sciences delivered at new york university, may 11, 1959.

Communicationson Pure and Applied Mathematics , 13(1):1–14, 1960. doi: 10.1002/cpa.3160130102. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.3160130102 .[30] W.-C. Woo and W.-K. Wong. Operational application of optical ﬂow techniques to radar-based rainfallnowcasting.

Atmosphere , 8(5):48 – 68, 2017.[31] Alireza Yazdani, Maziar Raissi, and George Em Karniadakis. Systems biology informed deep learningfor inferring parameters and hidden dynamics. bioRxiv , 2019. doi: 10.1101/865063. URL .[32] Jian Zhang, Kenneth Howard, Carrie Langston, Brian Kaney, Youcun Qi, Lin Tang, Heather Grams, YadongWang, Stephen Cocks, Steven Martinaitis, Ami Arthur, Karen Cooper, Jeff Brogden, and David Kitzmiller.Multi-radar multi-sensor (mrms) quantitative precipitation estimation: Initial operating capabilities.

Bulletinof the American Meteorological Society , 97(4):621–638, 2016. doi: 10.1175/BAMS-D-14-00174.1. URL https://doi.org/10.1175/BAMS-D-14-00174.1 .[33] Yinhao Zhu, Nicholas Zabaras, P. Koutsourelakis, and Paris Perdikaris. Physics-constrained deep learningfor high-dimensional surrogate modeling and uncertainty quantiﬁcation without labeled data.

Journal ofComputational Physics , 394, 05 2019. doi: 10.1016/j.jcp.2019.05.024.

AppendicesA Ritz-Galerkin method for PDEs

Since the forward map is unavailable in closed form it becomes necessary to solve the PDE numerically. Toaccompany the necessarily brief review in the main paper we provide an extended presentation of how todiscretise a PDE on the model Poisson problem, though see also [20] for more information − ∆ u ( x ) = f ( x ) for x ∈ Ω (9a) u ( x ) = g ( x ) on ∂ Ω . (9b)To discretise one ﬁrst multiplies (9a) by some test function v ∈ ˆ V and then performs an integration by parts, it isstandard to further assume that the test functions vanish on the boundary, so one arrives at the classical form ofthe variational problem : ﬁnd u ∈ V such that (cid:90) Ω ∇ u ( x ) · ∇ v ( x ) d x = (cid:90) Ω f ( x ) v ( x ) d x − (cid:90) ∂ Ω g ( s ) v ( s ) ds (10)for any test function v ∈ ˆ V , this is the weak form of the classical problem (9).To numerically implement this idea one replaces the trial and test spaces by ﬁnite dimensional sub-spaces V h and ˆ V h respectively, with corresponding basis functions denoted by φ i , and ˆ φ i .Once we have speciﬁed this basis we can then search for solutions with a representation of the form u ( x ) = (cid:80) Mj =1 ( ξ ) uj φ j ( x ) , where ξ u ∈ R M is the vector of coefﬁcients parameterising the ﬁnite-dimensionalrepresentation of u . Inserting this representation into (10) we arrive at the ﬁnite-dimensional version on theweak-form given by the system of equations N (cid:88) j =1 ( ξ u ) j (cid:90) Ω ∇ φ j ( x ) · ∇ ˆ φ i ( x ) d x = (cid:90) Ω f ( x ) ˆ φ i ( x ) d x − (cid:90) ∂ Ω g ( x ) ˆ φ i ( s ) d s , hich must hold for each basis vector ˆ φ i , i = 1 , . . . , M . We may represent this in equivalent matrix-vectornotation as A ξ u = f , where the stiffness matrix and load vector are given by ( A ) ij = (cid:90) Ω ∇ φ j ( x ) · ∇ ˆ φ i ( x ) dx (11a) ( f ) i = (cid:90) Ω f ( x ) ˆ φ i ( x ) dx − (cid:90) ∂ Ω g ( s ) ˆ φ i ( s ) d s . (11b)Following the same procedure we can arrive at a weak-form discretisation of the transport equation (1). In thiscase the discrete representation analogous to (11a) is given by L [ z ] where ( L [ z ]) ij = (cid:90) Ω a ( x ) ∇ φ j ( x ) · ∇ ˆ φ i ( x ) dx + (cid:90) Ω w ( x ) · ∇ φ j ( x ) ˆ φ i ( x ) dx. (12) B Network architectures

U-Net architectures

We use a U-Net [21] for both our basis feature extractor, and in its own right as a benchmark DL method.These take the standard U-Ne form being divided into an encoder and a decoder, the encoder is a sequence ofcomputational units called the

Downsample Blocks and the encoder a sequence of

Upsample Blocks , long rangeskips connections are then provided to connect each downsample block with the corresponding upsamplingblock in the decoding phase. The basis form of these blocks is given by • Downsample Block: MaxPooling → Conv2D → ReLU → Conv2D → ReLU • Upsample Block: Upsample → Conv2D → Relu → Conv2D → Relu → Conv2Dwhere the upsampling operation doubles the pixel resolution in both directions using nearest neighbour interpo-lation.The U-Net used in the experiments has four down-sample and up-sample blocks, the smaller U-Net used as ourfeature extractor has three.

Convolution decoder

As discussed in Sec. 3 our method estimates the transport vector ﬁeld and diffusion coefﬁcient from the latentvariables z ∈ R (cid:96) × M , this is done in two phases, ﬁrst using a sequence of dense networks, and then convolutions.For the × image patches used in the sea surface temperature experiment this network takes the form • Dense phase: Flatten → Dense (64 , ’relu’ ) → Dense (256 , ’relu’ ) → Dense (1024 , ’relu’ ) → Dense (1024 , ’relu’ ) → Dense (1024 , ’relu’ ) • Convolution phase: Reshape ((4 , , → Upsample → Conv2D (32 , , ’relu’ ) → Upsample → Conv2D (16 , , ’relu’ ) → Upsample → Conv2D (8 , , ’relu’ ) → Upsample → Conv2D (3 , where Dense ( n, ’relu’ ) is a densely connected layer with n features and ReLU activation function. SimilarlyConv2D ( n, k, ’relu’ ) is a 2D convolution layer with n features and a k × k -kernel operation. The Upsampleoperation is the same as for the U-Net architecture just described. The Flatten operation takes the sequence oflatent states and ﬂattens them into a single (cid:96) × M vector.The output of this as (64 , , image, ﬁnal we split on the last dimension to produce the (64 , , diffusionﬁeld, and the (64 , , transport vector ﬁeld, and the diffusion ﬁeld is passed through an additional softpluslayer to ensure positivity. C Further experimental details

A NEMO Sea Surface Temperature data

The NEMO (Nucleaus for European Modelling of the Ocean) engine [15] is a state-of-the-art frameworkfor modelling the convective processes that drive the worlds oceans, including temperature, salinty, sea icearea and thickness etc., and is provided by the Copernicus Marine Service portal with product identiﬁer

GLOBAL_ANALYSIS_FORECAST_PHY_001_024 . While it is a simulation engine, it accumulates historical data igure 6: Estimated convection ﬁeld for our model.to produce a synthesised estimate of the state of the system using data assimilation methods, as such the datadoes follow the true temperature, making this an ideal dataset to apply our method to.Our dataset is formed by ﬁrst decomposing a larger study region into 64 by 64 pixel image patches, these aredisplayed in Figure 4. Our regions differ from those used in [8] in that we have selected our grouping to moreclosely follow the movement of the gulf stream up the Eastern seaboard of North America and then accrossthe Atlantic, mixing with the colder water from the Arctic regions. This choice allows us to model the mostinteresting regions, with the complex nonlinear mixing dynamics we are most interested in.Due to a re-calibration of the GLOBAL_ANALYSIS_FORECAST_PHY_001_024 product we only have access todata from January 1st, 2016 onwards and so cannot use the same temporal range as considered in [8]. Instead weuse data from 2016–2018 as training data, and then the data from 2019 to test on. We also partitioned our datainto length 4 input sequences, and then predicted on the next 8 sequences. Leading to 11935 training examplesand 3984 test examples, we also test our data on the same set of regions as used for training.

Additional ﬁgures

In Fig. 6 we display the predictions from our model ﬁrst presented in Fig 5, but nowincluding also the convection ﬁeld used to advance our model. This vector ﬁeld displays both long range streamlines, as well as localised regions around which a particle rotates around a center, furthermore the richnessof all of these ﬁelds was obtained by decoding a × dimensional vector, and not as an image-to-imagetransform. This parsimonious deconvolution from the latent space, combined with the ability of our method tolearn complex vector ﬁelds is the principle component of its demonstrated success on convective ﬁelds such asthe sea surface temperature data. B Precipitation nowcasting

For this experiment we consider the problem of providing short term precipitation forecasts from Doppler radarimages as considered in [2]. The data comes from the multi-radar multi-sensor (MRMS) system developedby NOAA National Severe Storms Laboratory, and provides precipitation rate updates at 2 minute intervalsover a spatial resolution of 1km × × regionsin the Paciﬁc Northwest, and then further restrict our attention to the two wettest regions in the training set.We train on data from January 2018 and then test on the ﬁrst week of November 2017 giving 8904 traininginstances and 2568 test instances. Following [2] we frame the problem as a classiﬁcation problem and discretisethe precipitation data using three thresholds in units of millimetres of rain per hour, these ranges are given by [0 , . , [0 . , . , [1 . , . and [2 . , ∞ ) , and the model is then trained using a cross-entropy loss target thesepixel values.The resulting accuracy on the classiﬁcation problem is displayed in Table 2, which demonstrate that there islittle to choose between the two approaches in this instance. It would appear that once the process has beencoarsened in this way there is no signiﬁcant difference between the physics free method and our physics informedmethod on the precipitation data, the implications of this are the subject of further study, however it is ultimately eassuring that our method proves no worse on this problem, and indeed signiﬁcant better on the “full-image”temperature data reported in Sec. 4Table 2: One hour ahead precipitation nowcasting accuracy.T RAINING A CCURACY T EST A CCURACY

U-N ET [12, 2] 0.9353 0.8047 DOD-Convec0.8445 0.8052