[PDF] Learning Variational Data Assimilation Models and Solvers

Abstract

Full PDF

EE ND - TO - END LEARNING FOR V ARIATIONAL D ATA A SSIMILATION M ODELS AND S OLVERS

A P

REPRINT

R. Fablet ∗ IMT AtlantiqueUMR CNRS Lab-STICC, Brest, FR [email protected]

B. Chapron

IfremerUMR CNRS LOPS, Brest, FR [email protected]

L. Drumetz

IMT AtlantiqueUMR CNRS Lab-STICC, Brest, FR [email protected]

E. Memin

INRIA RennesUMR CNRS IRMAR, Rennes, FR [email protected]

O. Pannekoucke

INPT-ENMUMR CNRS CNRM, CERFACS, Toulouse, FR [email protected]

F. Rousseau

IMT AtlantiqueUMR INSERM Latim, Brest, FR [email protected]

July 28, 2020 A BSTRACT

This paper addresses variational data assimilation from a learning point of view. Data assimilationaims to reconstruct the time evolution of some state given a series of observations, possibly noisy andirregularly-sampled. Using automatic differentiation tools embedded in deep learning frameworks,we introduce end-to-end neural network architectures for data assimilation. It comprises two keycomponents: a variational model and a gradient-based solver both implemented as neural networks. Akey feature of the proposed end-to-end learning architecture is that we may train the NN models usingboth supervised and unsupervised strategies. Our numerical experiments on Lorenz-63 and Lorenz-96systems report signiﬁcant gain w.r.t. a classic gradient-based minimization of the variational costboth in terms of reconstruction performance and optimization complexity. Intriguingly, we also showthat the variational models issued from the true Lorenz-63 and Lorenz-96 ODE representations maynot lead to the best reconstruction performance. We believe these results may open new researchavenues for the speciﬁcation of assimilation models in geoscience. K eywords variational data assimilation · geophysical dynamics · partial observations · dynamical model · end-to-endlearning · meta-learning · neural networks In geoscience, the reconstruction of the dynamics of a given state or process from a sequence of partial and noisyobservations of the system is referred to as a data assimilation issue. Data assimilation is at the core of a wide range ofapplications, including operational ones, with the aim to make the most of available observation datasets, includingfor instance both in situ and satellite-derived data, and state-of-the-art numerical models [13]. Broadly speaking, avast family of data assimilation methods comes to the minimization of some energy or functional which involvestwo terms, a dynamical prior and an observation term. We may distinguish two main categories of data assimilation ∗ Corresponding author a r X i v : . [ phy s i c s . c o m p - ph ] J u l PREPRINT - J

ULY

28, 2020approaches [13]: variational data assimilation and statistical data assimilation. Variational data assimilation relies on acontinuous state-space formulation and results in a gradient-based minimization of the deﬁned variational cost. Bycontrast, statistical data assimilation schemes generally relies on iterative formulations of stochastic ﬁltering techniquessuch as Kalman and particle ﬁlters.Whereas data assimilation naturally applies to model-driven settings, where the formulation of the dynamical andobservation models follow from some physical expertise, data-driven strategies have emerged relatively recently [19, 5]as possible alternatives. The focus has mainly been given to the identiﬁcation of data-driven representations of thedynamical model, which may be directly plugged into state-of-the-art assimilation frameworks, especially statisticalones [19, 25, 5]. Recent advances have also been reported for the data-driven identiﬁcation of the dynamical modelfrom partial, irregularly-sampled and/or noisy observation data [24, 5, 30].Here, we aim to further explore how data-driven strategies and associated machine learning schemes may be of interestfor data assimilation issues. Especially, end-to-end learning strategies aim to build so-called end-to-end neural networkarchitectures so that one can learn all the components of the architecture w.r.t. some target to be predicted frominput data, whereas the traditional approach usually relies on designing each component relatively independently.Many fruitful applications of end-to-end learning strategies have been recently reported in the deep learning literature[32, 7, 11], including for solving inverse problems in image and signal processing. In this work, we introduce a novelend-to-end learning framework for data assimilation, which uses as inputs a sequence of observations and delivers asoutputs a reconstructed state sequence. Following a variational assimilation formulation, our key contributions are asfollows: • the proposed end-to-end learning architecture combines two main components, a ﬁrst neural-network architec-ture dedicated to the deﬁnition and evaluation of the variational assimilation cost and a second neural-networkarchitecture corresponding to a gradient-based solver of some target loss function. The latter exploits ideassimilar to optimizer learning [1, 34, 20] and directly beneﬁts from automatic differentiation tools embedded indeep learning frameworks, so that we do not need to derive explicitly the adjoint operator of the considereddynamical model; • Given some training criterion, this end-to-end learning architecture provides new means to train all theunknown parameters of the considered neural network architectures. Interestingly, we may consider as trainingcriterion the classic observation-driven data assimilation cost as well as a classic reconstruction error assumingwe are provided with groundtruthed data; • We report numerical experiments on Lorenz-63 and Lorenz-96 dynamics, which support the relevance of theproposed framework w.r.t. classic variational data assimilation schemes. Trained NN architectures may bothspeed up the assimilation process as NN solvers may greatly reduce the number of gradient-based iterations.They may also lead to a very signiﬁcant improvement of the reconstruction performance if groundtruthed dataare available; • Intriguingly, our experiments question the way dynamical priors are deﬁned and calibrated in data assimilationschemes and suggest that approaches based on the sole forecasting performance of the dynamical model maynot be optimal for assimilation purposes.This paper is organized as follows. Section 2 introduces the considered variational formulation within a deep learningparadigm. We detail the proposed end-to-end learning framework in Section 3. Numerical experiments on Lorenz-63and Lorenz-96 dynamics are reported in Section 4. We further discuss our key contributions in Section 6.

We introduce in this section the considered variational formulation which provides the basis for the proposed end-to-endlearning framework. Let us consider the following continuous state-space formulation  ∂x ( t ) ∂t = M ( x ( t )) + η ( t ) y ( t ) = x ( t ) + (cid:15) ( t ) , ∀ t ∈ { t , t + ∆ t...., , t + N ∆ t } (1)with x the considered time-dependent state, M the dynamical model and { t , t + ∆ t...., , t + N ∆ t } the observationtimes. Observations { y ( t i ) } may only be partial, meaning that some components of y ( t i ) are not observed. Let Ω t i denote the domain corresponding to the observed components of y ( t i ) at time t i . Ω t i may refer to a spatial domain forspatio-temporal dynamics or a list of indices for multivariate state sequence. Processes η and (cid:15) represent respectivelymodel errors and observation errors. 2 PREPRINT - J

ULY

28, 2020The data assimilation issue, i.e. the reconstruction of the hidden state sequence x given a series of observations { y ( t i ) } ,may be stated as the minimization of the following cost U Φ ( x, y, Ω) = λ (cid:88) i (cid:107) x ( t i ) − y ( t i ) (cid:107) ti + λ (cid:88) n (cid:107) x ( t i ) − Φ( x )( t i ) (cid:107) (2)where (cid:107) · (cid:107) stands for the evaluation of the quadratic norm restricted to domain Ω , e.g. (cid:107) u (cid:107) = (cid:82) Ω u ( p ) dp for ascalar 2 D state u . This accounts both for an irregular space-time sampling of the observations as well as an assimilationtime step smaller than the observation one. Φ is the ﬂow operator Φ( x )( t ) = x ( t − ∆) + (cid:90) tt − ∆ M ( x ( u )) du (3)where (cid:107) · (cid:107) stands for the evaluation of the quadratic norm restricted to domain Ω , e.g. (cid:107) u (cid:107) = (cid:82) Ω u ( p ) dp for ascalar 2 D state u . This accounts both for an irregular space-time sampling of the observations as well as an observationsampling coarser than the assimilation time step. In the cost function Eq. (2), the ﬁrst term is a weighted measure ofthe distance from x to the data while the second term measures the distance between the empirical and the theoreticaldynamics considering the forecast model as imperfect. This cost function is known as the weak-constraint 4D-Var(see e.g. [33]). Note here, for the sake of simplicity we do not consider a background term often used to measurethe distance from x to a given background state, and corresponding to a Tikhonov regularization. Using a vectorialformulation, we may rewrite variational cost U Φ ( x, y, Ω) as: U Φ ( x, y, Ω) = λ (cid:107) x − y (cid:107) + λ (cid:107) x − Φ( x ) (cid:107) (4)with (cid:107) x − y (cid:107) the observation term and (cid:107) x − Φ( x ) (cid:107) the dynamical prior. Within a variational assimilation setting, theminimization of this variational cost typically exploits an iterative gradient-based scheme given some initial estimate x (0) , the simplest one being a ﬁxed-step gradient descent x ( k +1) = x ( k ) − α ∇ x U Φ (cid:16) x ( k ) , y, Ω (cid:17) (5)This requires the computation of gradient operator ∇ x U using the adjoint method [3, 4]. The computational implemen-tation of the adjoint method for operator Id − Φ may derive from the continuous formulation of Φ as well as from thediscretized version of dynamical model M to ensure the full consistency between the forward integration of model M and the adjoint formulation [2].Within a deep learning framework , given a neural-network-based (NN) formulation for operator Φ , one may straight-forwardly use automatic differentiation tools, typically referred to as the backward step in deep learning, to compute ∇ x U . In this paper, we investigate such NN formulations for operator Φ and also design NN architectures for theassociated solver, i.e. an iterative gradient-based inversion algorithm based on variational cost (4). Interestingly, theresulting end-to-end NN architecture provides means both to learn a solver for a predeﬁned variational setting such as(4) as well as to jointly learn operator Φ and the solver w.r.t. some reconstruction criterion, which may be differentfrom variational cost (4). In this section, we detail the proposed end-to-end learning framework based on the NN implementation of variationalformulation (4). We ﬁrst introduce two different types of parameterizations for operator Φ in (4), which refer to explicitODE-based and PDE-based representations (Section 3.1) and constrained CNN representations (Section 3.2). We thendetail the NN-based solver for the targeted minimization issue. Φ Here, we assume that we know the dynamical operator M in (1), or at least the parametric family it belongs to M ∈ F = (cid:8) M θ with θ ∈ A ⊂ R N (cid:9) with N the number of parameters of the model: ∂x ( t ) ∂t = M θ ( x ( t )) (6) This includes frameworks such as pytorch ( ), tensorﬂow ( )and jax ( https://jax.readthedocs.io/ ) in Python. We may also cite automatic differentiation tools in Julia ( ). Here, we use pytorch and provide our associated code available on https://github.com/CIA-Oceanix/DinAE_VarNN_torchDev . PREPRINT - J

ULY

28, 2020Considering an explicit Euler integration scheme, operator Φ is deﬁned as: Φ( x )( t ) = x ( t − ∆) + ∆ · M θ ( x ( t − ∆)) (7)with ∆ the integration time step. This resorts to a residual block [16] given the NN-based implementation of model M θ . We may refer the reader to Section 4 for such examples with Lorenz-63 and Lorenz-96 dynamics.We may also consider higher-order explicit schemes similarly to [12]. For example, using a fourth-order Runge-Kuttascheme, operator Φ is given by: Φ( x )( t ) = x ( t − ∆) + ∆ · (cid:88) i =1 α i k i ( x ( t − ∆) , M θ ) (8)where α = α = 1 / , α = α = 2 / , k i ( x ( t ) , M θ ) = M θ ( x ( t ) + β i · ∆ · k i − ) with k = 0 , β = β = β = 1 / and β = 1 . In such formulations, operator Φ computes a one-step ahead prediction of the input state sequence. Wemay emphasize that the size of the output of the NN-based implementation of operator Φ is the same size as the inputsequence. In a similar fashion, this also applies to PDE formulations and could be combined to the automatic generationof NN architectures from symbolic PDEs as proposed in [26]. Φ We also consider CNN architectures for operator Φ . CNN architectures whose possible parameterization includes theidentity operator Φ( x ) = x shall be excluded as they would result in meaningless priors in minimisation of energy (4).Among classic architectures satisfying this constraint, we may cite auto-encoders such that Φ( x ) = φ ( ψ ( x )) (9)where operator ψ maps the input variable x to a lower-dimensional space. When considering dynamical systems,auto-encoders may not seem highly relevant to deal with ﬁne-scale processes. Here, following [14], we focus on another class of CNN architectures where ψ is a one-layer CNN where the central values of all convolution kernels is setto zero such that ψ ( x )( s ) at position s does not depend on variable x ( s ) . Here, s typically refers to a tuple ( t, k ) of atime index t and a feature index k . It would also apply to multivariate space-time tensors. φ is a CNN which composesa number of convolution and activation layers where the kernel size of all convolution layers is 1 along time and/orspace dimensions of process x .We refer to these architectures as GENN (Gibbs Energy Neural Network) as they can beregarded as a NN implementation of Gibbs energies [28].Intesretingly, within this category of CNN representations for operator Φ , we may consider multi-scale representationsto capture patterns of interest at different scales. Here, we consider two-scale representations Φ( x ) = U p (Φ ( Dw ( x ))) + Φ ( x ) (10)where operators Φ , follow the constrained parameterization introduced above. Dw is a downsampling operatorimplemented as an average pooling layer and U p an upsampling operator implemented as a ConvTranspose layer.

Given a NN formulation for operator Φ , we design a NN solver for the targeted minimization based on criterion(4). Using automatic differentiation tools embedded in deep learning framework , we can evaluate the gradient ofvariational cost (4) w.r.t. variable x , denoted as ∇ x U . Using similar ideas as meta-learning schemes [1, 17], we candesign recurrent neural networks to implement gradient-based solvers for the targeted data assimilation issue. Here, weinvestigate two types of solvers corresponding to the following iterative updates: • a LSTM-based solver , which updates the state at the k th iteration as  g ( k +1) = LST M (cid:2) α · ∇ x U Φ (cid:0) x ( k ) , y, Ω (cid:1) , h ( k ) , c ( k ) (cid:3) x ( k +1) = x ( k ) − L (cid:0) g ( k +1) (cid:1) (11)where α is a scalar parameter, { h ( k ) , c ( k ) } the internal states of the LSTM model and L a linear layer to mapthe LSTM output to the space spanned by state x . Here, we use pytorch. You may refer to the online code available at https://github.com/CIA-Oceanix for details on the use ofautograd functions in pytorch. PREPRINT - J

ULY

28, 2020Figure 1:

Sketch of the proposed end-to-end architecture: given a partial observation y with missing data mask Ω and an initialization x (0) for the unknown state x to be reconstructed, the proposed neural network architecture relieson an iterative gradient-based solver. It exploits a residual architecture with a predeﬁned number of residual blocks,where the k th residual unit uses as input the gradient ∇ x U Φ (cid:0) x ( k − , y Ω (cid:1) of variational cost U Φ w.r.t. state x evaluatedfor the output of the previous residual step. Gradient ∇ x U Φ (cid:0) x ( k − , y Ω (cid:1) derives from automatic differentiation toolsembedded in deep learning frameworks, e.g. autograd function in pytorch. • a CNN solver , which updates the state at the k th iteration as  g ( k +1) = C (cid:2) α · ∇ x U Φ (cid:0) x ( k ) , y, Ω (cid:1) , g ( k ) (cid:3) x ( k +1) = x ( k ) − A (cid:0) g ( k +1) (cid:1) (12)where C is a CNN with a sequence of convolutional layers with Relu activations which uses as inputs theconcatenation of gradient ∇ x U Φ (cid:0) x ( k ) , y Ω (cid:1) and previous update g ( k ) . Here, A refers to an atan activation toavoid exploding updates, especially in the early steps of the learning process.These two types of updates are used as residual blocks to design a residual network (ResNet) with a predeﬁnednumber of iterations (typically, from 5 to 20 in our experiments). Overall, as sketched in Fig.1 the resulting end-to-endarchitecture uses as inputs an initial state x (0) , an observation series y and the associated observation domain Ω toaccount for missing values and aims to reconstruct the hidden state x . In terms of neural network models, it combinesa NN model for variational cost U Φ ( x, y ) and a ResNet solver, denoted as Γ . Let us denote by Ψ Φ , Γ ( x (0) , y, Ω) theresulting end-to-end model. Given the proposed end-to-end architecture, we may consider different learning strategies. The ﬁrst strategy assumesthat NN operator Φ is known. We may be provided with an ODE or PDE representation for dynamical model M or wemay perform as a preliminary step the supervised identiﬁcation of operator Φ using some representative dataset forstate sequence x [15, 31, 26]. In such a situation, given some observation dataset comprising a number of observationseries { y , ..., y N } with associated missing data masks { Ω , ..., Ω N } , we can address the learning of NN solver Γ asthe minimization of variational cost (4), which leads to the following learning loss L = (cid:88) n U Φ (cid:16) Ψ Φ , Γ ( x (0) n , y n , Ω n ) , y n , Ω n (cid:17) (13)where parameter λ , in the deﬁnition of energy U Φ in (4) are set a priori. This learning strategy is the standard criterionconsidered in variational data assimilation. It may be regarded as a non-supervised setting in the sense no groundtrutheddata is available for the reconstruction of state x from observation data ( y, Ω) .Interestingly, we may also consider a supervised setting where the training dataset comprises observation series { y , ..., y N } , associated missing data masks { Ω , ..., Ω N } and true states { x , ..., x N } . Here, we can consider aslearning loss the minimization of the reconstruction error L = (cid:88) n (cid:13)(cid:13)(cid:13) x n − Ψ Φ , Γ (cid:16) x (0) n , y n , Ω n (cid:17)(cid:13)(cid:13)(cid:13) (14)5 PREPRINT - J

ULY

28, 2020

Learning scheme Model Optim R-Score 4DVar-score ODE-score

No learning L63-RK4 FSGD 3.55 7.70e-3 < < < aG-Conv 29.85 2.0e-4aG-LSTM 7.05 xxxx 2.0e-4L63-RK4 aG-Conv 2.82 9.24e-2 < Supervisd aG-LSTM 4.00 9.36e-2 < learning L63-GENN aG-Conv Table 1:

Lorenz-63 experiments: we report the reconstruction performance of the proposed framework using unsuper-vised and supervised learning schemes. We consider two types of representations of the dynamics through operator Φ (see the main text for details): a NN derived from the known ODE using a RK4 integration scheme (L63-RK4) and atwo-scale GENN representation (L63-GENN). Regarding NN solver Γ , we consider two types of architectures usingautomatic differentiation to compute the gradient of assimilation cost (4) w.r.t. state x : LSTM-based solver (aG-LSTM)and CNN-based ones (aG-CNN). As baseline, referred to as FSGD, we report the performance of a ﬁxed-step gradientdescent for assimilation cost (4) with the known ODE model and a fourth-order Runge-Kutta (RK4) integration scheme.We evaluate three performance metrics: the mean square error of the reconstruction of the true state (R-score), the valueof the assimilation cost (4) for the reconstructed state x (4DVar-score) and the mean square error of the one-step-aheadprediction of the consider dynamical prior Φ (ODE-score).This is a classic supervised strategy where we aim to train the end-to-end architecture so that the reconstruction errorof the true state given the observation sequence is minimized. We may point out that, in this supervised setting, thegradient used as input in the trainable solver is not the gradient of the training loss but the gradient of the variationalcost, whose computation only involves observation data and not the true states.Given these learning losses, we implement all models using pytorch framework and the Adam optimizer. We typicallyincrease incrementally the number of iterations of the gradient-based NN Solver (typically from 5 iterations to 20 ones).We let the reader refer to the code available online for additional details on the implementation. This Section reports numerical experiments with the proposed framework for Lorenz-63 and Lorenz-96 systems, whichare widely considered for demonstration and evaluation purposes in data assimilation and geoscience which are amongthe typical case-studies considered in recent data-driven and learning-based studies [19, 30, 5].

We ﬁrst perform numerical experiments for the assimilation of Lorenz-63 dynamics from partial and noisy observations.Lorenz-63 system is governed by the following three-dimensional ODE:  dX t, dt = σ ( X t, − X t, ) dX t, dt = ρX t, − X t, − X t, X t, dX t, dt = X t, X t, − βX t, (15)with the following parameterization: σ = 10 , ρ = 28 and β = 8 / . It generates chaotic patterns [21]. In ourexperiments, we simulate Lorenz-63 time series with a 0.01 time step using a RK45 integration scheme [12]. Weconsider time series with 200 time steps. The observation data are sampled every 8 time steps for the ﬁrst component ofthe Lorenz-63 state only and involve a Gaussian additive noise with a variance of 2. The test dataset comprises 2000sequences and the training data with groundtruthed states contain 10000 sequences.Regarding NN operator Φ , we consider two architectures as discussed in Section 3: • an ODE-based NN representation based on a 4 th -order Runge-Kutta integration scheme of the ODE. The ODEoperator is implemented as bilinear convolutional layers with a total of 9 parameters as proposed in [15]; https://github.com/CIA-Oceanix PREPRINT - J

ULY

28, 2020Figure 2:

Reconstruction of Lorenz-63 dynamics.

Upper panel: solvers’ energy pathways using different parame-terizations for operator Φ in (4). We depict along the sequence of iterations of different gradient-based solvers theevolution of the reconstruction error and of the variational cost. For the known Lorenz-63 ODE model, we report theseenergy pathways for ﬁxed-step gradient descent of variational cost (4) (magenta dashed) and learned NN solvers usingunsupervised (magenta, dashed) and supervised (magenta, dashed-dotted) learning schemes. We also report the energypathway for a joint supervised learning of a GENN-based parameterization of operator Φ and of the associated NNsolver. Lower panel: associated pdfs of the reconstruction error for each component of the Lorenz-63 state. • a constrained CNN representation. We consider a two-scale NN representation (10), where operators Φ , involve bilinear convolutional blocks with 30 channels. This architecture involves a total of ≈ Γ . They involverespectively ≈ ≈ .We report the performance metrics in Tab.1 for the different parameterizations of operator Φ and the associated solversusing either a supervised learning strategy or an unsupervised one (See Section 3). As baseline, we consider a ﬁxed-stepgradient descent (FSGD) of assimilation cost (4) using a fourth-order Runge-Kutta integration scheme for the knownODE model. This baseline is also implemented in Pytorch. All trained solvers involve 20 gradient-based iterations. Asevaluation metrics, we consider the reconstruction error of the true state (mean square error) (R-score), the assimilationcost (4DVar-score) and the mean square error for the one-step-ahead dynamical prior (cid:107) x − Φ( x ) (cid:107) (ODE-score). Tocompute comparable assimilation costs across methods, we report the assimilation cost for all schemes with λ = 0 . and λ = 1 . , although these parameters may differ from the values learned during the training phase for supervisedsettings.From Tab.1, we may ﬁrst notice that all schemes, which aim at minimizing assimilation cost (4, i.e. the FSGD andunsupervised solvers, lead to worse reconstruction performance (R-score about 3.5) compared with supervised schemes(R-score up to 1.62). Conversely, the former leads to signiﬁcantly better 4DVar-scores. These results emphasize thatthe assimilation cost may not be a very relevant indicator of the reconstruction performance. For instance, the trainedaG-Conv solver leads to a slightly better performance than the FSGD (R-score of 3.52 vs. 3.55) but with a greaterassimilation cost (4DVar-score of 1.05e-2 vs. 7.70e-3). In the unsupervised setting, the use of a GENN prior results in A notebook for Lorenz-63 system in repository https://github.com/CIA-Oceanix/DinAE_4DVarNN_torch provides theimplementation of the reported experiments. PREPRINT - J

ULY

28, 2020

Example

Figure 3:

Reconstruction examples for Lorenz-63 dynamics : from left to right, we report four examples of 200-stepLorenz-63 sequences with true states (black, solid), observed values for the ﬁrst component (blue dots), assimilatedstates with different schemes using the true ODE model (magenta) and a jointly learned GENN-based operator Φ andsolver (cyan). We let the reader refer to Fig.2 for the details of the different ODE-based schemes. Each row depicts oneof the three components of the Lorenz-63 states.very poor reconstruction performance, which may relate to a worse prediction performance (ODE-score of 2.0e-4 vs.3.7e-5 for a fourth-order Runge-Kutta scheme with the true ODE model).Importantly, the supervised schemes greatly the reconstruction error compared with the baseline (R-score of 1.34for the best supervised scheme vs. 3.55 for the FSGD scheme). The best performance is indeed reported for a jointlearning of operator Φ and of the associated solver, using a GENN parameterization for operator Φ . The improvementis also signiﬁcant compared to the supervised learning of a solver using the discretized version of the true ODE modelfor operator Φ . Intriguingly, the learnt GENN-based operator Φ is characterized by a relatively poor one-step-aheadprediction error (ODE-score of 0.16 vs. 3.7e-5 for the ODE-based schemes). These results suggest that the bestdynamical prior for assimilation purposes might not be the discretization of the true ODE model. The detailed analysisof the values of the assimilation cost further points out that its direct minimization may not be the best strategy tooptimize the reconstruction performance.We further illustrate these results in Fig.2 and Fig;3. For a randomly-sampled subset of the test dataset, we report inFig.2 the energy pathways of different solvers: namely the baseline (FSGD), the best unsupervised and supervisedsolvers using the true ODE model for operator Φ , and the best supervised solver for a GENN-based operator Φ . Here,the FSGD scheme involves more than 150000 steps, whereas all the learned solvers involve only 20 gradient-basediterations. In these experiments, the CNN-based solver always reach a better performance than the LSTM-solver.Interestingly, the joint learning of operator Φ and of the associated solver leads a smooth descent, which may indicatethat the training phase converges towards a new variational cost with a better agreement between the minimization ofthis cost and the reconstruction performance.The distributions of the error of the reconstruction for the three components of the Lorenz-63 state in Fig.Fig.2 alongwith different examples in Fig.3 further emphasize the better reconstruction performance of the joint learning of aGENN-based operator Φ and a aG-Conv solver, which leads to reconstruction patterns very similar to the true ones. The second experiment involves Lorenz-96 dynamics which are governed by the following ODE dx t,i dt = ( x t,i +1 − x t,i − ) x t,i − − x t,i + F (16)with i the index from 1 to 40 and F a scalar parameter set to 8. The above equation involves a periodic boundaryconstraint along the multivariate dimension of state x ( t ) . In our experiments, we simulate Lorenz-96 time series with a0.05 time step using a RK45 integration scheme [12]. We consider time series with 200 time steps. The observationdata are sampled every 4 time steps and involve a Gaussian additive noise with a variance of 2. Only 20 of the 40components of the states are observed according to a random sampling. The test dataset comprises 256 sequences andthe training data with groundtruthed states involve 2000 sequences.Regarding NN operator Φ , we consider two architectures as discussed in Section 3:8 PREPRINT - J

ULY

28, 2020

Learning scheme Model Optim R-Score 4DVar-score ODE-score

No learning L96-RK4 4DVar 1.06 1.47e-2 < < < < 1e-4 Supervised aG-LSTM 0.97 3.15e-2 < 1e-4 learning L96-GENN aG-Conv 0.49 7.47e-2 12.20e-2aG-LSTM

Table 2:

Lorenz-96 experiments: we report the reconstruction performance of the proposed framework using unsuper-vised and supervised learning schemes. We let the reader refer to Tab.1 and the main text for details on the consideredschemes and evaluation criteria.

True and observed states Reconstruction examples and associated error maps

Gradient descent Unsupervised solver Supervised solver Jointly learnedfor ODE cost for ODE cost for ODE cost cost and solver

Figure 4:

Reconstruction of Lorenz-96 dynamics.

Upper left panel: example of true state sequence and associatedobservations; Upper center panel: solvers’ energy pathways using Lorenz-96 ODE cost or learnable variational cost;Upper right panel: associated pdfs of the reconstruction errors. Lower panel: We depict the ﬁrst half of an example of200-time-step series of 40-dimensional Lorenz-96 states, the x-axis being the time axis; ﬁrst row, reconstructed statesequence for the unsupervised and supervised solvers for the ODE cost (4) as well as for the jointly learned cost andsolver; second row, absolute error maps of each of the four reconstructions. All states and errors are displayed with thesame colormap. • an ODE-based NN representation, referred to as L96-RK4 based on a 4 th -order Runge-Kutta integrationscheme of the ODE. Our implementation involves bilinear convolutional layers with a periodic boundaryconditions. This representation comprises a total of 9 parameters; • a constrained CNN representation referred to as L96-GENN. We consider a two-scale NN representation (10),where operators Φ , involve bilinear convolutional architectures with a total of ≈ Γ . They involve respectively ≈ ≈ We ﬁrst report in Tab.2 a quantitative comparison of different parameterizations of operator Φ and solver Γ . Similarlyto Lorenz-63 experiments, the best unsupervised solver for an ODE-based cost reaches the same reconstruction A colab notebook for Lorenz-96 system in repository https://github.com/CIA-Oceanix/DinAE_4DVarNN_torch de-scribes the implementation of the reported experiments. PREPRINT - J

ULY

28, 2020performance as the baseline using only 20 gradient-based iterations, to be compared with several thousands for theFSGD solver. Again, the CNN-based solver outperforms the LSTM-based one in the unsupervised setting. We donot report the performance for the GENN-based representation in the unsupervised setting as it behaves very poorlysimilarly to Lorenz-63 experiments. Besides, we report a very signiﬁcant improvement in the supervised setting,especially when we jointly learn operator Φ and solver Γ with a relative gain greater than 50% compared with thebaseline. In this case, the best performance is achieved with a LSTM-based solver. Again, the best reconstructionperformance does not come with a good one-step-ahead prediction score for the learned GENN-based representation.We further illustrate these experiments in Fig.4. Here, all solvers’ energy pathways are consistent with minimizationpatterns both for the reconstruction error and the assimilation cost. These pathways are very similar for the three solversusing an ODE-based cost. Visually, they lead to similar error distributions. For the example reported in the bottompanel of Fig.4, they involve similar error patterns though the FSGD solver involves greater errors. In this respect, thereconstruction issued from the joint learning of a GENN-based operator Φ and a LSTM-based solver is associated withlower error values. In this section, we further discuss how the proposed framework relates to previous works according to three differentaspects: end-to-end learning framework for the resolution of inverse problems, learning schemes for gradient-basedsolvers and representation learning for geophysical dynamics.

End-to-end learning for inverse problems:

A variety of end-to-end architectures have been proposed for solvinginverse problems in signal and image processing, for instance for denoising, deconvolution or super-resolution issues[10, 22, 23, 35]. In these studies, proposed schemes generally rely on a global convolutional architecture, but do notexplicitly state on one hand a neural-network representation of an energy-based setting and, on the other hand, thegradient-based solver of the consisdered variational formulation. For instance, in [10], the end-to-end architecture isinspired by reaction-diffusion PDEs derived from the Euler-Lagrange equation of a variational formulation, but doesnot explicitly embed such a variational formulation. Conversely, deep learning frameworks and embedded automaticdifferentiation tools have been considered to solve for inverse problems through the minimization of variational models,where the prior can be given by a pre-trained NN representation [23]. An other strategy explored in the literature relieson the deﬁnition of two independent networks, one corresponding to the generative model and the other one to theinverse model. Raissi et al. [30] were among the ﬁrst to explore such end-to-end architectures for the data-drivenidentiﬁcation of ODE/PDE representations. This is similar in spirit to auto-encoder architectures: the decoder aims tosolve the inverse of the encoder but it does not explicitly depend on the representation chosen for the encoder. Here,the considered end-to-end architecture exploits the automatic differentiation to explicitly relate the gradient-basedinversion model to the neural-network representation of the dynamics. This is regarded as of great interest to improvethe overall interpretability of the architecture. Besides, our experiments support that jointly learning the representationof the dynamics and the associated solver can lead to a very signiﬁcant improvement compared with pre-training thedynamical prior and solving the resulting variational minimisation.

Learning gradient-based solvers for inverse problems:

A key component of the proposed end-to-end architectureis the gradient-based NN solver for the targeted minimization. This is similar to meta-learning and learning-to-learnstrategies [1, 17], where one can learn jointly some targeted representation and the optimizer of the training loss.The LSTM-based architecture of the considered gradient-based NN optimizer is similar to the one considered inlearning-to-learn schemes. We may notice that we apply here automatic differentiation to compute the gradient w.r.t.the hidden state and not the parameters of the NN representations. Besides, in the supervised learning case, the gradientcomputed using automatic differentiation is not the gradient of the loss to be minimized during the training process butthe gradient of assimilation criterion (4). We indeed expect the latter to be a good proxy to minimize the reconstructionerror of the true states. The reported numerical experiments support this assumption and suggest extensions to othertraining losses which could beneﬁt from additional data, not provided as inputs for the computation of variational cost(4). Note that in general, the 4DVar cost-function can present multiple local and global minima, except in the particularcase where the dynamics is linear (as well as the observational operator that map the state space to the observationalspace). In order to avoid falling into local minima or to select the absolute minimum a quasi-static approach can beconsidered [29], based on a successive small increments of the assimilation period. While this issue is not addressed inthe present work, a similar quasi-static approach could be considered in the design of the learning gradient-based solver.

Representation learning for geophysical dynamics:

As stated in the introduction, the data-driven identiﬁcation ofrepresentations of geophysical dynamics is very active research area. Numerous recent studies have investigateddifferent machine learning schemes for the identiﬁcation of governing equations of geophysical processes, includingsparse regression techniques [6], neural ODE and PDE schemes [9, 31, 26] as well as analog methods [19]. Interestingly,10

PREPRINT - J

ULY

28, 2020advances have been proposed for the data-driven identiﬁcation of such representations, when the processes of interest arenot fully observed, which covers noisy and irregularly-sampled observations, partially-observed system. Reduced-ordermodeling, which aims to identify lower-dimensional governing equations, also involves very similar studies [8]. Theseprevious works mainly rely on the identiﬁcation of an ordinary or partial differential equation as this mathematicalrepresentation is a very generic one for modeling geophysical dynamics. The proposed end-to-end framework explicitlyembeds a representation of the considered geophysical dynamics. In line with previous works cited above, we mayconsider neural ODE/PDE representations [26, 9, 15]. As we rely on an energy-based setting, we can explore othertypes of representation. In the reported experiments, we have shown that a multi-scale energy-based representation forthe dynamical prior leads to signiﬁcantly better reconstruction performance than an ODE-based representation at theexpense of the relatively coarse approximation of the dynamics. These experiments suggest that, for given process,different representations might be relevant depending on targeted application, especially simulation vs. assimilation. Italso raises the question as to whether the representation should be adapted to observation operator.

Data assimilation for geophysical ﬂow forecasting:

This work can be adpated to the different classical formulationsconsidered used for data assimilation issues in meteorology or oceanography. Compared with (4), the data assimilationcost usually involves a discrepancy term on the initial condition, called the background term. This term is usuallyexpressed as a Mahalanobis distance and involves the inverse of the background covariance matrix. This matrix playsthe role of a smoothing prior on the solution and its speciﬁcation is essential, yet difﬁcult to ﬁx/parameterize on physicalgrounds. In the proposed framework, this would correspond to parameterizing and learning this discrepancy term withrespect to the initial condition, while the dynamics is ﬁxed and given. While the considered formulation (4) relates to theweak-constraint data assimilation, which accounts for modelling erors, we could also explore the proposed frameworkfor the so-called strong constraint data assimilation, which amounts to stating operator Φ as the forecasting of thestate sequence { x ( t i ) } i over the entire time interval from the initial condition x ( t ) . Within a weak-constraint dataassimilation with a known physical model, the proposed framework may provide new means to learn of a correctionerror term in the dynamics from a parameterization of operator Φ as Φ ∗ + d Φ with Φ ∗ deﬁned as in the strong constraintcase and d Φ as in the reported experiments. Learning for computational ﬂuid dynamics:

In the present study, one of the considered NN architecture for operator Φ derives from a general time evolution model interpreted in terms of iterative integration schemes (i.e. Runge-Kutta,Euler etc.). In computational ﬂow dynamics, splitting methods are usually introduced to deal with incompressiblity andthe computation of the pressure forcing term [18, 27]. The introduction of such splitting in the NN architecture shouldenable us to enforce incompressibility of the solution and thus to strengthen the physical relevance of the solutions.The synergy between NN architectures and numerical schemes is a natural way to enforce some numerical constraintsimposed by the deep physical features of the dynamics at stake. For instance, splitting in terms of slow 3D internalbaroclinic and fast 2D (depth averaged) barotropic motions, as it is usually done in numerical ocean models, shouldstrongly enforce a natural physical relevance for the learning of ocean dynamics. We have introduced an end-to-end learning framework for variational assimilation problems. Assuming the observationoperator is known, it combines a neural-network representation of the dynamics and a neural-network solver for theconsidered variational cost. We may derived the latter from the known differential operator for the considered dynamics.Here, for Lorenz-63 and Lorenz-96 dynamics, we considered NN representations which implement fourth-orderRunge-Kutta integration schemes. A similar approach can be considered for PDEs, including with automatic NNgeneration tools from symbolic PDE expressions as proposed in [26]. Depending on the learning setting, we may learnjointly the two NN representations or only the NN solver assuming the NN representation of the dynamics has beencalibrated previously. We provide a pytorch implementation of the proposed framework.We report numerical experiments which support the relevance of the proposed framework. For Lorenz-63 and Lorenz-96dynamics, we have shown that we may learn fast gradient-based solver that can reach state-of-the-art performancewith only a few gradient iterations (20 in the reported experiments). Interestingly, our ﬁndings suggest that, whengroundtruthed datasets are available, the joint learning of the NN representation of the dynamics and of the associated NNsolver may lead to a very signiﬁcant improvement of the reconstruction performance. Illustrated here for Lorenz systems,future work shall further explore whether these ﬁndings generalize to other systems, especially higher-dimensionalones.Our ﬁndings also question the design and selection of the dynamical prior in variational assimilation system. Theysuggest that numerical ODE-based representations optimal in terms of forecasting performance may not be optimal forreconstruction purposes. Future work shall further explore this question on the synergy between the representation ofthe dynamics and the gradient-based solver to retrieve the best possible state estimate. The extension of the proposed11

PREPRINT - J

ULY

28, 2020framework to stochastic representations is also of great interest and may for instance beneﬁt from the deﬁnition ofvariational functionals from log-likelihood functions where the different terms or parameters may be represented asneural networks.

Acknowledgements

This work was supported by LEFE program (LEFE MANU project IA-OAC), CNES (grant OSTST-MANATEE) andANR Projects Melody and OceaniX. It beneﬁted from HPC and GPU resources from Azure (Microsoft EU Oceanawards) and from GENCI-IDRIS (Grant 2020-101030).

References [1] M. Andrychowicz, M. Denil, S. Gomez, M.W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas.Learning to learn by gradient descent by gradient descent. In

NIPS , pages 3981–3989, 2016.[2] R. N. Bannister. A review of operational methods of variational and ensemble-variational data assimilation.

QJRMS , 143(703):607–633, 2017. _eprint: https://rmets.onlinelibrary.wiley.com/doi/pdf/10.1002/qj.2982.[3] E.L. Blayo. Reduced order approaches for variational data assimilation in oceanography. In

Proc. 2nd Int.Workshop on Industrial applications of low order model based on POD , 2008.[4] Jacques Blum, FranÃ§ois-Xavier Le Dimet, and I Michael Navon. Data assimilation for geophysical ﬂuids. In

Handbook of numerical analysis , volume 14, pages 385–441. Elsevier, 2009.[5] M. Bocquet, J. Brajard, A. Carrassi, and L. Bertino. Bayesian inference of chaotic dynamics by merging dataassimilation, machine learning and expectation-maximization.

Foundations of Data Science , 2(1):55–80, 2020.arXiv: 2001.06270.[6] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identiﬁcation ofnonlinear dynamical systems.

PNAS , 113(15):3932–3937, 2016.[7] M. Busta, L. Neumann, and J. Matas. Deep TextSpotter: An End-To-End Trainable Scene Text Localization andRecognition Framework. In

Proc. IEEE CVPR , pages 2204–2212, 2017.[8] K. Champion, B. Lusch, J.N. Kutz, and S.L. Brunton. Data-driven discovery of coordinates and governingequations.

PNAS , 116(45):22445–22451, 2019.[9] T.Q. Chen, Y. Rubanova, J. Bettencourt, and D.K. Duvenaud. Neural Ordinary Differential Equations. In

NIPS ,pages 6571–6583, 2018.[10] Y. Chen, W. Yu, and T. Pock. On Learning Optimized Reaction Diffusion Processes for Effective Image Restoration.In

Proc. IEEE CVPR , pages 5261–5269, 2015.[11] S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In

IEEE ICASSP , pages 6964–6968, 2014.[12] J. R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae.

Journal of Computational andApplied Mathematics , 6(1):19–26, March 1980.[13] G. Evensen.

Data Assimilation . Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.[14] R. Fablet, L. Drumetz, and F. Rousseau. End-to-end learning of energy-based representations for irregularly-sampled signals and images. arXiv preprint arXiv:1910.00556 , 2019.[15] R. Fablet, S. Ouala, and C. Herzet. Bilinear residual Neural Network for the identiﬁcation and forecasting ofdynamical systems. In

EUSIPCO , pages 1–5, Rome, Italy, September 2018.[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In

IEEE CVPR , pages770–778, June 2016.[17] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey. Meta-learning in neural networks: A survey. arXivpreprint arXiv:2004.05439 , 2020.[18] Raad I Issa, AD Gosman, and AP Watkins. The computation of compressible and incompressible recirculatingﬂows by a non-iterative implicit scheme.

Journal of Computational Physics , 62(1):66–82, 1986. Publisher:Elsevier.[19] R. Lguensat, P. Tandeo, P. Aillot, and R. Fablet. The Analog Data Assimilation.

Monthly Weather Review , 2017.[20] D. Li, Y. Yang, Y.-Z. Song, and T.M. Hospedales. Learning to Generalize: Meta-Learning for Domain Generaliza-tion. In

AAAI , April 2018. 12

PREPRINT - J

ULY

28, 2020[21] E.N. Lorenz. Deterministic Nonperiodic Flow.

Jal Atm. Sc. , 20(2):130–141, 1963.[22] A. Lucas, M. Iliadis, R. Molina, and A.K. Katsaggelos. Using Deep Neural Networks for Inverse Problems inImaging: Beyond Analytical Methods.

IEEE SPM , 35(1):20–36, 2018.[23] M.T. McCann, K.H. Jin, and M. Unser. Convolutional Neural Networks for Inverse Problems in Imaging: AReview.

IEEE SPM , 34(6):85–95, 2017.[24] D. Nguyen, S. Ouala, L. Drumetz, and R. Fablet. Assimilation-based Learning of Chaotic Dynamical Systemsfrom Noisy and Partial Data. In

IEEE ICASSP , pages 3862–3866, 2020.[25] S. Ouala, R. Fablet, C. Herzet, B. Chapron, A. Pascual, F. Collard, and L. Gaultier. Neural-Network-based KalmanFilters for the Spatio-Temporal Interpolation of Satellite-derived Sea Surface Temperature.

Remote Sensing , 2018.[26] O. Pannekoucke and R. Fablet. PDE-NetGen 1.0: from symbolic PDE representations of physical processes totrainable neural network representations.

Geoscientiﬁc Model Development , pages 1–14, 2020.[27] Suhas Patankar.

Numerical heat transfer and ﬂuid ﬂow . Taylor & Francis, 2018.[28] P. Perez. Markov random ﬁelds and images.

CWI Quarterly , pages 413–437, 1998.[29] C. Pires, R. Vautard, and O. Talagrand. On extending the limits of variational assimilation in nonlinear chaoticsystems.

Tellus A , 48(1):96–121, 1996.[30] M. Raissi. Deep Hidden Physics Models: Deep Learning of Nonlinear Partial Differential Equations.

JMLR ,19:1–24, 2018.[31] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework forsolving forward and inverse problems involving nonlinear partial differential equations.

Journal of ComputationalPhysics , 378:686–707, February 2019.[32] E. Schwartz, R. Giryes, and A.M. Bronstein. DeepISP: Toward Learning an End-to-End Image Processing Pipeline.

IEEE TIP , 28(2):912–923, February 2019.[33] Y. Trémolet. Model error estimation in 4D Var.

QJRMS , 133(626):1267–1280, 2007. Publisher: Wiley OnlineLibrary.[34] R. Vilalta and Y. Drissi. A Perspective View and Survey of Meta-Learning.

Artiﬁcial Intelligence Review ,18(2):77–95, June 2002.[35] J. Xie, L. Xu, and E. Chen. Image Denoising and Inpainting with Deep Neural Networks. In