[PDF] Combining data assimilation and machine learning to infer unresolved scale parametrisation

Abstract

In recent years, machine learning (ML) has been proposed to devise data-driven parametrisations of unresolved processes in dynamical numerical models. In most cases, the ML training leverages high-resolution simulations to provide a dense, noiseless target state. Our goal is to go beyond the use of high-resolution simulations and train ML-based parametrisation using direct data, in the realistic scenario of noisy and sparse observations. The algorithm proposed in this work is a two-step process. First, data assimilation (DA) techniques are applied to estimate the full state of the system from a truncated model. The unresolved part of the truncated model is viewed as a model error in the DA system. In a second step, ML is used to emulate the unresolved part, a predictor of model error given the state of the system. Finally, the ML-based parametrisation model is added to the physical core truncated model to produce a hybrid model. The DA component of the proposed method relies on an ensemble Kalman filter while the ML parametrisation is represented by a neural network. The approach is applied to the two-scale Lorenz model and to MAOOAM, a reduced-order coupled ocean-atmosphere model. We show that in both cases the hybrid model yields forecasts with better skill than the truncated model. Moreover, the attractor of the system is significantly better represented by the hybrid model than by the truncated model.

Full PDF

rrsta.royalsocietypublishing.org

Research

Article submitted to journal

Subject Areas:

Computer modelling and simulation;artiﬁcial intelligence; climatology.

Keywords:

Numerical modelling; dataassimilation; machine learning.

Author for correspondence:

Julien Brajarde-mail: [email protected]

Combining data assimilationand machine learning to inferunresolved scaleparametrisation.

Julien Brajard , , Alberto Carrassi , , MarcBocquet and Laurent Bertino Nansen Center (NERSC), 5006, Bergen, Norway Sorbonne University, Paris, France Department of Meteorology, University of Readingand NCEO, United-Kingdom Mathematical Institute, University of Utrecht, TheNetherlands CEREA, joint laboratory École des Ponts ParisTechand EDF R&D, Université Paris-Est, France

In recent years, machine learning (ML) has beenproposed to devise data-driven parametrisations ofunresolved processes in dynamical numerical models.In most cases, the ML training leverages high-resolution simulations to provide a dense, noiselesstarget state. Our goal is to go beyond the useof high-resolution simulations and train ML-basedparametrisation using direct data, in the realisticscenario of noisy and sparse observations.The algorithm proposed in this work is a two-stepprocess. First, data assimilation (DA) techniques areapplied to estimate the full state of the system from atruncated model. The unresolved part of the truncatedmodel is viewed as a model error in the DA system. Ina second step, ML is used to emulate the unresolvedpart, a predictor of model error given the state of thesystem. Finally, the ML-based parametrisation modelis added to the physical core truncated model toproduce a hybrid model.The DA component of the proposed methodrelies on an ensemble Kalman ﬁlter while the MLparametrisation is represented by a neural network.The approach is applied to the two-scale Lorenzmodel and to MAOOAM, a reduced-order coupledocean-atmosphere model. We show that in both casesthe hybrid model yields forecasts with better skillthan the truncated model. Moreover, the attractor ofthe system is signiﬁcantly better represented by thehybrid model than by the truncated model. c (cid:13) The Authors. Published by the Royal Society under the terms of theCreative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author andsource are credited. a r X i v : . [ phy s i c s . c o m p - ph ] S e p r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A ..................................................................

1. Introduction

The Earth climate system is one example of a natural system that is reasonably well representedthrough known physical laws and that has been intensively observed for decades (see, e.g. , [1]).Physical laws, in the form of ordinary (ODEs) or partial differential equations (PDEs), areimplemented through numerical models providing the time evolution of the system’s state.Although weather and climate predictions have constantly improved, and will likely continueto do so, uncertainties will ineluctably remain. Those usually fall into two major classes: (i) theinternal variability driven by the sensitivity to the initial conditions, and (ii) the model errors.The former has to do with the ampliﬁcation of the initial condition error and arises even inperfect models - it is mitigated by using data assimilation (DA) [2]. The latter is present evenif one would perfectly know the initial conditions and has to do with the incorrect and/orincomplete representation of the laws governing the system. The two sources of errors areinevitably entangled and it is difﬁcult to separate them in practice.Machine learning (ML) was recently shown to be effective in reducing model error, inparticular that originating from unresolved scales. This has been achieved using two approaches.The ﬁrst consists in learning a subgrid parameterisation of a model from existing physics-based expensive parametrisation schemes [3,4], or from the differences between high- andlow-resolution simulations [5–8]. In those approaches, the unresolved part of the model isrepresented by a ML process while the core of the model is derived from ODEs. The secondapproach is to emulate the entire model using observations. With spatially dense and noise-freedata, this approach has been based on sparse regression [9], echo state networks [10,11], recurrentneural networks [12], residual neural network [13] or convolutional neural networks [14,15]. Thechallenging problem of partial and/or noisy observations has been addressed using dedicatedNN architecture [16] or in combination with data assimilation methods [17–21].This work presents a new method to obtain a data-driven parameterisation of a model’sunresolved scale. In particular, we aim at producing a hybrid model combining the physics-basedcore (encoding the best of our knowledge of the resolved scales physics) with the data-drivenparameterisation. By leveraging the use of DA, our method efﬁciently handles noisy and sparseobservations.

2. Objectives and deﬁnitions (a) Statement of the problem

We consider an autonomous chaotic dynamical system, seen as our reference “truth”, representedby the ODE d z ( t ) d t = f ( z ( t )) , (2.1)with z ( t ) ∈ R N z being the system’s state at time t . From an arbitrary state z ( t ) on the system’sattractor the model can be integrated forward for one time step δt , to get: z δt = M ( z , δt ) , (2.2)where z δt = z ( t + δt ) .Let us formally deﬁne a projection operator Π : R N z (cid:55)→ R N x such as x = Π ( z ) , with x beingthe projection of the full state into a reduced dimension state: N x < N z . For example, Π canbe a subsampling operator retaining only a subset of z or a downscaling operator from ahigh-resolution state to a lower resolution. From Eq. 2.2, we also deﬁne x δt = Π ( z δt ) .Consider now a scale-truncated model, M r , that provides an imperfect description of Eq. (2.2)in the reduced space R N x , x r δt = M r ( x r , δt ) , (2.3) r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. where the superscript r stands for “resolved”. When initialised from the projection of the truthinto the reduced space ( i.e. no initial condition error: x r = x ), the difference between the 1-timestep predictions from Eqs. (2.2) and (2.3) deﬁnes the model error due to the neglected scales in theresolved model (cid:15) m ( z , δt ) = x δt − x r δt = Π ◦ M ( z , δt ) − M r ( Π ( z ) , δt ) . (2.4)The objective of this work is to complement the resolved model (2.3) by an empiricalrepresentation of the unresolved scales based on a neural network (NN) trained on incompleteand noisy data. We will hereafter denote the NN-based unresolved scale representation by g ( x , θ ) .The vector θ is the set of trainable parameters of the NN and is determined by minimising theloss function L ( θ ) = E z ∼ p Z (cid:2) g ( Π ( z ) , θ ) − (cid:15) m ( z , δt ) (cid:3) , (2.5)where p Z is the invariant probability density function (assumed to exist) on the attractor.We construct the hybrid model M h , parametrised by θ , such that x h δt = M r ( x , δt ) + g ( x , θ ) = M h ( x , δt ) . (2.6)Our objective is to optimally estimate the parameters θ so that the hybrid model is the mostaccurate representation of the true underlying dynamics.Apart from trivial cases, the loss function in Eq. (2.5) cannot be computed: x δt is unknown and p Z is generally intractable. Assuming ergodicity of the true dynamics, we can however estimateit by a Monte-Carlo approach such that: L ( θ ) ≈ K K − (cid:88) k =0 (cid:2) g ( x k , θ ) − (cid:15) m ( z k , δt ) (cid:3) , (2.7)where z K − = { z , z , · · · , z K − } , z k = z ( t k ) , is a set of samples of p Z , typically extractedfrom a time series of modelled state variables and x k = Π ( z k ) . Such samples are usually notindependent (due to the underlying dynamics being deterministic), and only provide a biasedapproximation of p Z . Furthermore, the need to sample the whole attractor implies treating timeseries signiﬁcantly longer than the decorrelation time ( i.e. K very large in general) . (b) Framework of the study The loss function, Eq. (2.7), cannot be minimised directly because some of its key entries areunavailable, in particular, obviously, the true process Eq. (2.2) and the time series z K − . Theavailable terms in the loss function are: The truncated model.

The truncated model, our best available knowledge about the true process,is usually very complex (high dimensional, nonlinear and with diagnostic variables). Hence, weshall assume that its gradient, ∇ x M r , cannot be computed analytically. We will thus focus ondeveloping a (model) adjoint-free approach that is more ﬂexible and suitable to high dimensionalnonlinear scenarios where deriving and maintaining an adjoint model is a difﬁcult and costlytask [22]. When the gradient can be computed, other efﬁcient methods exist [21]. Observations.

Observations are incomplete and noisy and are obtained through: y k = H k x k + (cid:15) o k , (2.8)where x k is the true state in the reduced space, y k ∈ R N y and H k ∈ R N y × N x are the observationvector and operator respectively at t k , while (cid:15) o k is the observation error, assumed to beuncorrelated in time and normally distributed with mean and a variance-covariance matrix ( σ o ) I N y , where I N y is the identity matrix of size N y × N y and σ o is the standard deviation.For simplicity, the observation error standard deviation is taken constant and the observation r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. operator linear. Both assumptions can be relaxed without major drawbacks even if it can inducepractical difﬁculties. The ideal, most favourable, situation in which the full system’s state isobserved in the reduced space with no error is referred as the "perfect observation" case: y k = x k .For convenience, we further assume that observations are available regularly at multiples of themodel time step such that ∆t = t k +1 − t k = N c δt , N c ∈ N ∗ . This also accounts for the fact that, ingeneral, the observation sampling period is longer than the integration time step of the numericalmodel.

3. Method (a) Loss function approximation

Let assume that an estimation of x k ( k ∈ { , · · · , K } ) is available at observation times, every ∆t = N c δt , so that x k +1 = Π ◦ M ( N c ) ( z k , δt ) , (3.1)where M ( N c ) is the composition of the model N c times, and similarly for the truncated model, x r k +1 = M r(N c ) ( x k , δt ) . Since observations are not available at each time step ( N c > ), the modelerror (cid:15) m is not known at each time step neither, and the loss function Eq. (2.7) can not be exactlycomputed. In the following, we will present two key simplifying assumptions that will lead to atractable approximation of the loss function.The ﬁrst consists in assuming ∆t to be short enough so that the state evolved by the truncatedmodel is independent of the model error due to the unresolved scale so that the model error isan additive term to the truncated model forecast after a ∆t time integration. The second in thatthe variability of (cid:15) m is small within ∆t . The combination of these two hypotheses is known as the linear superposition assumption , and can be formalised as: x k +1 ≈ x r k +1 + N c × (cid:15) m ( z k , δt ) . (3.2)Note that the same approximation was made in a similar setting by [8].The optimisation can now be performed using the approximate loss function: L ( θ ) ≈ L a ( θ ) = 1 KN c K − (cid:88) k =0 (cid:2) N c g ( x k , θ ) − ( x k +1 − x r k +1 ) (cid:3) . (3.3)The modiﬁed loss function, Eq. (3.3) can be computed without knowing the full state z k but onlyits projection x k . L a can be minimised using a gradient descent algorithm as long as the gradientof g ( x k , θ ) can be computed, which is standard for any neural network library. (b) Description of the algorithm In order to minimise the loss function deﬁned in Eq. (3.3), a sequence of x K has to be available.Two cases are considered: the ﬁrst is the aforementioned "perfect observations" case, in whichwe have a complete and noise-free sequence of a state variable x k in the reduced space. Thisideal situation will set the upper-bound performance in the algorithm evaluation that follows.The second case is the more realistic case of noisy (but yet unbiased) and possibly incompleteobservations. Here, a complete sequence x K is obtained by processing the incomplete and noisyobservations using DA [2]: the observations y k are combined with the forecast from the truncatedmodel x r k in order to provide the analysed state vector x a k . The DA method used in this work is theFinite-Size Ensemble Kalman Filter (EnKF-N) [23] implemented in the DAPPER framework [24].Even if the proposed algorithm is general and suitable for any DA algorithm, the EnKF-N hasbeen chosen because of its efﬁciency. In particular, the inﬂation factor, a needed ﬁx to mitigate theimpact of sampling errors in the ensemble-based DA methods, is automatically estimated (thus r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. avoiding long tuning). This inﬂation factor accounts also implicitly and partially for the effect ofthe model error.The correction x a k − x r k made by DA is called analysis increment and was used to estimatemodel error due to unresolved scales in sequential DA in [25,26]. An analysis increment iscomposed of a correction both for the model error (which is what the hybrid model aims atestimating) and for the initial conditions error (which cannot be represented by the hybrid model).There is also an additional uncertainty due to the observation errors. The initial conditions andthe observations errors are two sources of uncertainties called the data uncertainties. If they aretoo large, relatively to the model error, they could bias the estimation of the loss function givenin Eq. (3.3) causing the hybrid model to overﬁt on these data and to lack generalisation to otherinitial conditions. To mitigate this problem, the time series x a0: K estimated by DA is ﬁltered usinga simple low-pass ﬁlter (a rolling mean) producing a smoothed time series x s0: K . This ﬁlter isexpected to correct for data uncertainty provided that the observation error is uncorrelated intime, and thus contains high-temporal frequencies. On the other hand, this ﬁltering could removethe fastest scales of the unresolved part of the model. Although it has been assumed in the linearsuperposition assumption that these high-frequencies can be neglected, we do not know a priorito which extent, which can lead to possibly hamper the forecast skill. The scale separation betweenthe model error acting on long time scales and the initial errors acting on faster scales has beenused in previous studies either to estimate the model error in DA [18] or to improve the forecast ofa NN model [11]. Finally, note that the ﬁltering can be adapted separately for the fast and the slowvariables contained in x k . This is, for instance, the case in coupled atmosphere-ocean models, andis addressed in section 5. Algorithm 1

Summary of the algorithm used to determine the hybrid model

Input:

Observations y K ,truncated model M r ,NN architecture g ( x , θ ) . Output: state vector estimation x s0: K ,optimal value of θ .1: if y K is perfect then x s0: K = y K else

4: Use a DA algorithm (e.g. EnKF-N) to estimate the state vector series x a0: K

5: Filter the components (or a subset of components) of x a0: K using a low-pass ﬁlter toproduced the smoothed ﬁeld x s0: K end if

7: compute the target for the NN: (cid:15) m k = (1 /N c )( x s k +1 − M r ( x s k , N c δt ))

8: Determine θ (training of the NN) using the dataset ( x s K − ; (cid:15) m0: K − ) return ( x s0: K , θ ) In both perfect and imperfect observations cases, the loss function L a ( θ ) is minimised usinga standard NN training procedure: if g ( x k , θ ) is represented by a NN, x k as the inputs and θ asweight, the problem is equivalent to a supervised regression problem in which the targeted outputof the neural network is (cid:15) m k = (1 /N c )( x k +1 − x r k ) . The algorithm is summarised in Algorithm 1. (c) Numerical experiment protocol The performance of the algorithm is evaluated on twin experiments: a full model M is used toproduce true states, synthetic observations, and to assess the forecast skill of the hybrid model.The truncated model M r is obtained by neglecting some components of the true model. r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. A series of true states x K is produced using the true model after projection in the reducedspace. This series is used as perfect observations to obtain the so-called "perfect observation-derived hybrid model". Then, synthetic observations are generated using the deﬁnition inEq. (2.8). Here, it is assumed that the observation operator and the observation error statistics areperfectly known. This assumption is not needed for the functioning of the proposed algorithmand could be relaxed at a future stage. Nonetheless, the assumption is done here to simplifythe interpretation of the results so the focus is on the truncated model error. Furthermore, theobservation error statistics can also be estimated within the DA process itself (see [40] for a reviewof such methods), and thus be integrated into our combined ML-DA method.The observations generated using Eq. (2.8) are used in Algorithm 1 to produce the so-called"DA-derived hybrid model". Note that the perfect observation-derived hybrid model beneﬁtsfrom the optimal information at a given time step (the state is completely observed withoutnoise). Given the conditions stated in section 2(b), it is expected to be the best possible modeland represents the benchmark against which we will test the DA-derived hybrid model. (d) Evaluation metrics The skill of each model is evaluated by running an ensemble of N f = 20 forecasts for N f different initial conditions and for a time period τ : x f( l ) ( τ ) ( l = 1 , · · · , N f ). The N f different initialconditions are chosen on the attractor of the true model (by running the model long enough beforesetting the initial conditions), independent of the values of the time series x K used to performthe DA and the NN training. This ensemble of forecast constitutes the so-called test dataset, as itis not used to optimise nor to tune the algorithm. If tuning the algorithm is needed, the originaltime series x K can be split in a training part (used to optimise the NN) and a validation part(used to tune the NN algorithm).As evaluation metric we will use the relative root mean square error (R-RMSE) R-RMSE( τ, n ) = (cid:118)(cid:117)(cid:117)(cid:116) N f N f (cid:88) l =1 V n ( x t ) (cid:16) x f( l ) n ( τ ) − x t( l ) n ( τ ) (cid:17) , (3.4)where x f( l ) n ( τ ) (resp. x t( l ) n ( τ ) ) is the forecast from the hybrid or the truncated model (resp. the truemodel) at time t for the n -th component of x and for the sample l corresponding to a simulationfor one particular initial condition. V n ( x t ) is the n -th component of the variance over the timedimension of the true model forecast time series x t . Note that if the truncated and/or the hybridmodels have the same variability as the true one (i.e. the same variance in time), the R-RMSEconverges to 1.

4. Application to the two-scale Lorenz model (a) Description of the model

The two-scale Lorenz model [27], hereafter L2S, is given by the following set of ODEs: d x n d t = ψ + n ( x ) + F − cb (cid:88) m =0 u m +10 n d u m d t = cb ψ − m ( b u ) + h cb x m/ ,ψ ± n ( x ) = x n ∓ ( x n ± − x n ∓ ) − u n , (4.1)where n = 0 , · · · , N x − ( N x = 36 ) and m = 0 , · · · , N u − ( N u = 360 ). The indices n are periodic,e.g., x = x N x . The values chosen for the parameters are the same as in [28]: the time scale ratio c is set to 10, the space-scale ratio b = 10 , the coupling h = 1 and the forcing F = 10 . Time t isexpressed in model time unit, denoted MTU hereafter. r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. This set of two-scale ODEs, considered as the true model, is integrated using a fourth-orderRunge-Kutta scheme with a time step of 0.005 MTU. The ODEs describing the evolution of x onlyrepresent the truncated model and are obtained by setting the coupling c to 0. It is integratedusing a fourth-order Runge-Kutta scheme with a time step of 0.01 MTU. (b) Setup of the reference experiment A so-called "reference experiment" is deﬁned in this section. The true model is integratedover approximately 1500 MTU after a spinup of 3 MTU to produce the true state, on whichobservational noise is added. The EnKF-N is used to assimilate these observations using a largenumber N = 50 > N x of ensemble members to reduce sampling errors. A noise is added to thestate vector after each forecast to approximately account for the model error arisen from the modelbeing truncated. It helps to avoid ﬁlter divergence and can be seen as additive inﬂation. This stepis necessary given that, due to model error, the forecast ensemble would be otherwise under-dispersive. The noise is assumed Gaussian with zero mean and standard deviation σ m = . optimised after tuning experiments (not shown here). In this reference experiment, the analysisobtained from the DA is not ﬁltered, yet the sensitivity to the ﬁltering of the DA analysis is studiedin section 4(d).The last step of the algorithm is to train a neural network to emulate the unresolved part ofthe model on the 1500 MTU time series produced by DA. The NN architecture is composed ofconvolutional layers (denoted “conv.” in Table 1) where the non-linear activation function is ahyperbolic tangent (denoted "tanh"). Some additional parameters have been added, mainly toregularise the training: a batchnorm layer at the input layer, which normalises the training batch,and a L2-regularisation term on the parameters of the last layer. The parameters of the NN areoptimised using the "RMSprop" [29] optimiser over 100 epochs. For each epoch, batches of 33training examples are used to optimise the weights, until all the examples are consumed: this isa standard stochastic minimisation procedure [30]. Full details on the reference experiment aregiven in Table 1, in the column labelled L2S. (c) Results The terminology of the experiments described in the following is recalled in table 2. In Figure 1,both the true model and the DA-derived hybrid model (based on the reference experimentdescribed in section 4(b)) are initialised from an initial condition on the attractor, chosen to beindependent of the training set x K . The true and the hybrid model are run over 5 MTU, andtheir difference is displayed. It can be noticed that both runs are very close until 2 MTU and thatthe hybrid model has predictive skill until 3-4 MTU for this particular set of initial conditions.Note that the Lyapunov time of the truncated model is 0.72 [33], meaning that the hybrid modelprovides accurate forecasts until 3 Lyapunov times.In Figure 2, the R-RMSE is averaged over 20 members corresponding 20 initial conditions andalso across all the 36 components of x . R-RMSE is displayed as a function of time. Several densitiesof observations have been considered: if N y = 36 the full state is observed in the reduced space ateach observation time. If N y < , H k is a sub-sampling operator that draws randomly N y valuesfrom the state following a uniform distribution changing the observation locations at each timestep.Results shown in Figure 2 (left panel) conﬁrms that the DA-derived hybrid model has apredictive skill, signiﬁcantly better than the truncated model until 4 MTU. The effect of reducedobservation density is minor: the skill of the various hybrid models’ forecasts is very similar. Thisshows the algorithm efﬁciency in handling sparse data to accurately train a NN model. This is akey strength of our method; most of the other approaches that parametrise a part of the modelusing ML assume dense observations, e.g. [3,4,7] (similarly to our perfect observation case). r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. Table 1.

Settings of the numerical experiments with the L2S “reference experiment” and with MAOOAM.

L2S MAOOAMParameter Symbol Value Value Noteclimatological std σ hf NA calculation in 5(b).

Model parameter

Size of the state N x

36 36Integration time step δt T Imperfect observation setting

Standard deviation σ o . σ hf Ocean ﬁltered.Observation operator H I N x I N x Identity matrixTime sampling ∆t Data assimilation

DA algorithm EnKF-NEnsemble size N

50 50Model additive noise σ m − σ hf1:20 Atmosphere onlyLow-pass ﬁltering size No 55 days Ocean only.

Neural Network

L2S MAOOAMType of Layer 1 Batchnorm Batchnorm [31]Type of Layer 2 conv. denseSize of Layer 2 43 100Activation of Layer 2 tanh ReLU [32]Filter size of Layer 2 5 -Type of Layer 3 conv. denseSize of Layer 3 28 50Filter size of Layer 2 1 -Activation of Layer 3 tanh ReLUType of Layer 4 conv. denseSize of Layer 4 36 36Activation of Layer 4 Linear LinearL2 regularisation 0.07 − Optimiser RMSprop RMSprop [29]Number of epochs 100 100batch size 33 128

Table 2.

Numerical experiments terminology

Reference experiment Setup deﬁned in table 1 (column L2S).DA-derived hybrid model NN trained with data assimilation reanalysis obtainedfrom noisy and sparse observations.Perfect-observation-derivedhybrid model NN trained with perfect observations. (d) Sensitivity studies

In Figure 2 (middle and right panels), the forecast sensitivity to different parameters is studiedusing the R-RMSE at a leading time τ = 2 MTU, averaged over 20 simulations, and over allcomponents of x . The middle panel shows the sensitivity to the observation sampling frequency.For the perfect observation-derived and DA-derived hybrid models, the forecast skill is sensitiveto the value of ∆t . The forecast skill is signiﬁcantly degraded for higher values of ∆t . This is r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. Figure 1.

Hovmøller plot of the true model (upper panel) and of the DA-derived hybrid model (middle panel) for the sameparticular initial conditions over 5 MTU. The bottom panel shows the difference between the two simulations. The setupof the experiment is detailed in Table 1.

Figure 2.

Left : R-RMSE versus time for the perfect observations-derived hybrid model (green), the truncated model(red) and the DA-derived hybrid model (other colours) for 3 different densities of observation. Observation standarddeviation, σ o = 0 . , the time interval, ∆t = 0 . MT Us , are as in the reference experiment; thus for N y = 36 weretrieve the reference experiment. Middle : R-RMSE at a lead time τ = 2 MTUs for the DA-derived hybrid models(orange) and the perfect observation-derived hybrid models (green) for different observation sampling time ∆t as well asthe truncated model (red). Right : R-RMSE for the hybrid models trained with no ﬁltering of DA analysis (orange) andwith a 0.05 MTU window ﬁlter (purple) for different observation error standard deviation σ o . Black contour indicates thereference experiment conditions described in Table 1. probably due to the violation of the linear superposition assumption for high values of ∆t so thatthe coupling between the resolved part and the unresolved part of the model, as well as the effectof non-linearity of the unresolved part, are no longer negligible.The right panel of Figure 2 examines the impact of the observational noise. The result of thereference setting (where the analysis is not ﬁltered before training the NN) is compared withthe case of ﬁltering with a rolling mean of 0.05 MTU ( i.e. r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. is damped by the ﬁlter, but also by the increase of the observational noise that tends to addrandomness on all scales, including the small ones. In this case, except for very strong noise,ﬁltering does not seem to improve drastically the forecast skill. Notably, even a strong noise inthe data has only a very small inﬂuence on the forecast skill of the hybrid model: when the noiseon the observation is multiplied by a factor 20, the error in the forecast at t + 2 MTU is onlymultiplied by a factor 1.3.From the results on the 2-scales Lorenz model, we conclude that the algorithm is robust againstvarying data spatial density, but is sensitive to their temporal distribution. Also, ﬁltering theanalyses obtained from DA may appear useful for slow processes but can deteriorate the resultsby ﬁltering signiﬁcant fast processes.

5. Application to a low-order coupled ocean-atmosphere model (a) Description of the model

We consider here the Modular Arbitrary-Order-Ocean-Atmosphere Model (MAOOAM)introduced by [34]. MAOOAM has 3 layers (2 for the atmosphere and 1 for the ocean) and isa reduced-order quasi-geostrophic model resolved in the spectral space. Its state is composed of n a modes of the atmospheric barotropic streamfunction ψ a,i and the atmospheric temperatureanomaly θ a,i respectively, plus n o modes of the oceanic streamfunction ψ o,j and the oceanictemperature anomaly θ o,j respectively. The total number of variables is N x = 2 n a + 2 n o . Weconsider two versions of MAOOAM: the true model with dimension N z = 56 ( n a = 20 , n o = 8 )corresponding to the conﬁguration “atm. x - y oc. x - y ” in [34] and the truncated model with N x = 36 ( n a = 20 , n o = 8 ) corresponding to the conﬁguration atm. “ x - y oc. x - y ” in [34]. Thetruncated model is missing 20 high-order atmospheric variables (10 for the streamfunction and 10for the temperature anomaly). Thus the truncated model does not resolve the atmosphere-oceancoupling related to these high order atmospheric modes.The true model is used to generate synthetic observations. The forecast skill and the long-termproperties of the truncated and the hybrid models will be evaluated by inspecting 3 crucial modelvariables, called key variables , that are ψ o, , θ o, and ψ a, ( i.e. the second components of oceanstreamfunction and temperature and the ﬁrst component of the atmospheric streamfunction).They account for 42%, 51%, and 18% respectively of the variability of a reanalysis of 2-dimensionalﬁelds [35], and have been already used in previous studies ( e.g. [36]). MAOOAM has also beenrecently used to study coupled data assimilation methods [37,38]. Unsurprisingly, in MAOOAMthe ocean variables are considered the slow ones while the atmospheric variables are the fast ones. (b) Experimental setup We will express time in real time units (minutes, hours, days, ...) but, in practice, the model time isnon-dimensional. Consequently, the dimensioned time values presented hereafter are not roundnumbers.Given the diverse time scales and amplitudes of the MAOOAM variables, the noise parametersare all scaled on a climatological standard deviation of high frequencies σ hf ∈ R N x , which isdeﬁned as the temporal standard deviation of the state vector after ﬁltering out slow variationsof a period longer than 1 month. This high-pass ﬁlter is carried out by subtracting the 1-monthrunning average.The parameters of the experiments are presented in Table 1, in the column labelled MAOOAM.The true model is integrated over approximately years after a spinup of , years, in thesame conﬁguration as in [34]. In all experiments with MAOOAM, the state is fully observed every hours ( ∆t = 27 hours) (corresponding to N c = 1 , ). A small modiﬁcation was made to theobservations from Eq. (2.8) to account for the fact that observations of the ocean are not at the samescale as those of the atmosphere: before being assimilated, instantaneous ocean observations areaveraged over a 55 days rolling period centred at the analysis times. The EnKF-N is used as DA r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. algorithm. The noise on the model forecast is added only to the atmospheric variables with astandard deviation of σ m = 10 − σ hf1:20 . As mentioned in section 3(b), the analysis obtained fromthe DA is ﬁltered. The slow processes are expected to occur mainly in the ocean, so only theocean components of the state vector x a0: K are ﬁltered to produce x s0: K . Differently from the L2Smodel experiments, ﬁltering the analysis has proven necessary to train the hybrid model usingMAOOAM.The NN-architecture is a simple 3 layers multi-layer perceptrons; see Table 1 for full details. Asopposed to the L2S model, the state vector has no locality properties (because it is deﬁned in thespectral space), so the convolutional layers are not applicable (see the discussion about localityin [20]). The training of the NN is performed in the same way as for the L2S experiments. (c) Results The forecast skill metrics are presented in Table 3 for the truncated model as well as for the perfectobservation-derived and the DA-derived hybrid models. Given the different time scales involved,the forecast lead time of the key atmospheric variable ψ a, is 1 day whereas the forecast leadtime of the two key oceanic variables ψ o, and θ o, is 2 years. It can be seen that both perfectobservation-derived and DA-derived hybrid models have superior skill to the truncated model.The improvement is larger for the ocean, with a factor of 2 to 3, and is similar for both hybridmodels. Recall that the true model has here the same oceanic variables as the truncated model, sothere is no difference in the representation of the pure oceanic processes. The improvement is thusfully due to an enhanced representation of the atmosphere-ocean coupling processes, the hybridmodel better representing the interplay between the unresolved fast atmospheric variables andthe slow oceanic variables.The atmospheric key variable is improved to a lesser extent by the two hybrid models, andthe perfect observations-derived model is signiﬁcantly better than the DA derived model. Thisproves the limited capability of the hybrid model to represent a fast process, a situation furtherexacerbated in the case of the DA-derived hybrid model, when only noisy and partial data are atdisposal. This result was indeed expected given the assumptions made on the unresolved termof the model in section 3(a) when a slow variation of the unresolved term was assumed. The fastprocesses are also less accurately represented because the sampling rate of the observations ( hours) is well beyond the atmospheric time scale, and because of the presence of the observationerrors (when applied).In Figure 3, the attractors of the different models are displayed in the phase space deﬁnedby two key variable: ψ o, and ψ a, . A signiﬁcant difference can immediately be seen between theattractors of the truncated and the true models: the truncated model visits areas of the phase spacethat are not admitted in the real dynamics. Remarkably, these discrepancies are reduced by thehybrid models both derived from perfect observation and from DA. Some states seem to remainout of the true model attractor, however, but much fewer.Quantitative characterisation of the attractors is presented in Table 4, which provides the 3quartiles (including the median) for each key variable. For the truncated model and the hybridmodels, the difference of the quartiles is given relatively to the true model. For all the oceanicvariables the distribution of the values of both hybrid models is signiﬁcantly closer to the truedistribution than for the truncated model. For the key atmospheric variable ψ a, , only the hybridmodel derived from perfect observations shows an improvement. It conﬁrms the conclusion madeon the forecast skill that the hybrid model represents well the slow process, in particular oceanicvariables in this case, and that the fast processes are not fully retrieved, in particular in case of theDA-derived hybrid model.

6. Conclusion

We have developed a novel method to build a hybrid model consisting of a physics-basedtruncated model and a data-driven model of the unresolved processes. The approach is based r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. Table 3.

Forecast R-RMSE of hybrid and truncated MAOOAM models

R-RMSE(lead time τ ) ψ o, (2 years) θ o, (2 years) ψ a, (1 day)Truncated 0.23 0.21 0.36Perfect obs. hybrid 0.07 0.07 0.23DA hybrid 0.10 0.06 0.28 Figure 3.

Cross-section of the attractor for two key variables ψ a, and ψ o, in the true model (upper left), the truncatedmodel (upper right), the perfect-observation-derived hybrid model (lower left) and the DA-derived hybrid model (lowerright). Table 4.

Quartiles of the key variables for the MAOOAM model relative to the true model. ψ o, θ o, Q1 M Q3 Q1 M Q3True model . · − . · − . · − . · − . · − . · − Truncated -229% -80% -26% -22% -10% -6%Perfect obs. hybrid -55% -26% 0.6% 7% -2% -4%DA hybrid -14% 9% 8% 8% -5% -0.2% ψ a, Q1 M Q3True model . · − . · − . · − Truncated -12% -11% -11%Perfect obs. hybrid -0.6% -2% 0.02%DA hybrid -15% -14% -11%on realistic assumptions that only noisy and incomplete observations are available at a lowerfrequency than the model integration time step. r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A .................................................................. With a two-scale low-order chaotic system [27], we showed that the hybrid model forecast skillis sensitive to the observation frequency but very robust against high observational noise andsparse spatial distribution. This is probably due to the fact that increased observation frequencieschallenge the validity of the linear superposition assumption more than large observational noise(see the discussion in Appendix of [39]). We then applied the method to the low-order coupledocean-atmosphere model MAOOAM [34] which contains multiple temporal scales. Forecastskill and global statistics were signiﬁcantly improved by the hybrid model compared with thetruncated model encouraging further studies in high-dimensional and more realistic scenarios.Notably, the hybrid model derived from noisy observations has comparable forecast skill on theoceanic variables to that of the hybrid model derived from perfect observations.In view of operational systems, it should be noted that the proposed algorithm relies on twoexisting data assimilation and neural networks training techniques that both scale well in high-dimension (see, e.g. , [41] and [42]). In principle, the present algorithm can be applied to largerand more realistic problems. In particular, the fact that the method does not rely on the adjointof the truncated model is an advantage in terms of code maintenance. However, we foresee somepractical challenges: for instance, the computational architecture and the data types used forphysics-based numerical models and for machine learning algorithms can be very different ( e.g. multi-core supercomputers and graphics processing units). Training and running hybrid modelsefﬁciently imposes heavy requirements on both the hardware and software and may come withan overhead even if some tools are very promising [43].The approach presented here can also accommodate the additional representation of theremaining model error ( i.e. the model error of the hybrid model): it could either be done withinthe numerical model by parameterising the model error [21], or by training stochastic neuralnetworks [44].Data Accessibility.

Data used in all the experiments are available at ftp://ftp.nersc.no/reddaml/ .The instructions to download the data, to run all the Lorenz 2 scale application and to reproduce ﬁgures andtables of the article can be found at https://github.com/brajard/reddaml/releases/tag/v1.0 . Authors’ Contributions.

JB ﬁrst proposed the theory, implemented and conducted the numericalexperiments. All authors have contributed to the interpretation of the theory and the results as well as theedition of the manuscript. All authors approved the manuscript.

Competing Interests.

The authors declare that they have no competing interests.

Funding.

JB and LB have been funded by the project REDDA (

Acknowledgements.

The authors are thankful to Patrick Raanes (NORCE, NO) for his support on the useof the data assimilation python platform DAPPER, and to Jonathan Demaeyer (RMI, BE) for the insightfuldiscussions about MAOOAM. Thanks to Joséphine Schmutz (IMT) for his review of the GitHub repository.Many thanks also to the two anonymous reviewers for their valuable comments.

References

1. Merchant CJ, Embury O, Bulgin CE, Block T, Corlett GK, Fiedler E, Good SA, Mittaz J, RaynerNA, Berry D, et al.

Scientiﬁc data , 1–18.2. Carrassi A, Bocquet M, Bertino L, Evensen G. 2018 Data assimilation in the geosciences: Anoverview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change , e535.3. O’Gorman PA, Dwyer JG. 2018 Using machine learning to parameterize moist convection:Potential for modeling of climate, climate change, and extreme events. Journal of Advances in Modeling Earth Systems , 2548–2563. r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A ..................................................................

4. Rasp S, Pritchard MS, Gentine P. 2018 Deep learning to represent subgrid processes in climatemodels.

Proceedings of the National Academy of Sciences , 9684–9689.5. Krasnopolsky VM, Fox-Rabinovitz MS, Belochitski AA. 2013 Using ensemble of neuralnetworks to learn stochastic convection parameterizations for climate and numerical weatherprediction models from data simulated by a cloud resolving model.

Advances in Artiﬁcial Neural Systems .6. Bolton T, Zanna L. 2019 Applications of deep learning to ocean data inference and subgridparameterization.

Journal of Advances in Modeling Earth Systems , 376–399.7. Brenowitz ND, Bretherton CS. 2018 Prognostic validation of a neural network uniﬁed physicsparameterization. Geophysical Research Letters , 6289–6298.8. Rasp S. 2020 Coupled online learning as a way to tackle instabilities and biases in neuralnetwork parameterizations: general algorithms and Lorenz 96 case study (v1. 0). Geoscientiﬁc Model Development , 2185–2185.9. Brunton SL, Proctor JL, Kutz JN. 2016 Discovering governing equations from data by sparseidentiﬁcation of nonlinear dynamical systems. Proceedings of the national academy of sciences , 3932–3937.10. Pathak J, Hunt B, Girvan M, Lu Z, Ott E. 2018 Model-free prediction of large spatiotemporallychaotic systems from data: A reservoir computing approach.

Physical review letters , 024102.11. Faranda D, Vrac M, Yiou P, Pons FME, Hamid A, Carella G, Ngoungue Langue CG, ThaoS, Gautard V. 2020 Boosting performance in machine learning of geophysical ﬂows via scaleseparation.Working paper or preprint12. Park DC. 2010 A time series data prediction scheme using bilinear recurrent neural network.In , pp. 1–7. IEEE.13. Fablet R, Ouala S, Herzet C. 2018 Bilinear residual neural network for the identiﬁcation andforecasting of geophysical dynamics.In , pp. 1477–1481. IEEE.14. Scher S. 2018 Toward data-driven weather and climate forecasting: Approximating a simplegeneral circulation model with deep learning.

Geophysical Research Letters , 12–616.15. Dueben PD, Bauer P. 2018 Challenges and design choices for global weather and climatemodels based on machine learning. Geoscientiﬁc Model Development , 3999–4009.16. de Bezenac E, Pajot A, Gallinari P. 2019 Deep learning for physical processes: Incorporatingprior scientiﬁc knowledge. Journal of Statistical Mechanics: Theory and Experiment , 124009.17. Nguyen D, Ouala S, Drumetz L, Fablet R. 2019 Em-like learning chaotic dynamics from noisyand partial observations. arXiv preprint arXiv:1903.10335 .18. Laloyaux P, Bonavita M, Dahoui M, Farnan J, Healy S, Hólm E, Lang S. 2020 Towards anunbiased stratospheric analysis.

Quarterly Journal of the Royal Meteorological Society .19. Bonavita M, Laloyaux P. 2020 Machine learning for model error inference and correction.

Earth and Space Science Open Archive p. 36.20. Brajard J, Carassi A, Bocquet M, Bertino L. 2020 Combining data assimilation and machinelearning to emulate a dynamical model from sparse and noisy observations: a case study withthe Lorenz 96 model.

Journal of Computational Science , 101171.21. Bocquet M, Brajard J, Carrassi A, Bertino L. 2020 Bayesian inference of chaotic dynamics bymerging data assimilation, machine learning and expectation-maximization. Foundations of Data Science , 55–80.22. Tian X, Zhang H, Feng X, Xie Y. 2018 Nonlinear least squares en4dvar to 4denvar methods fordata assimilation: Formulation, analysis, and preliminary evaluation. Monthly Weather Review , 77–93. r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A ..................................................................

23. Bocquet M, Raanes PN, Hannart A. 2015 Expanding the validity of the ensemble Kalman ﬁlterwithout the intrinsic need for inﬂation.

Nonlinear Processes in Geophysics , 645.24. Raanes PN, et al. International Journal of Bifurcation and Chaos , 3619–3626.26. Mitchell L, Carrassi A. 2015 Accounting for model error due to unresolved scales withinensemble Kalman ﬁltering. Quarterly Journal of the Royal Meteorological Society , 1417–1428.27. Lorenz EN. 2005 Designing chaotic models.

Journal of the atmospheric sciences , 1574–1587.28. Bocquet M, Brajard J, Carrassi A, Bertino L. 2019 Data assimilation as a learning tool to inferordinary differential equation representations of dynamical models. Nonlinear Processes in Geophysics , 143–162.29. Hinton G, Srivastava N, Swersky K. 2012.Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.30. Bottou L. 2010 Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010 , pp. 177–186. Springer.31. Ioffe S. 2017 Batch renormalization: Towards reducing minibatch dependence in batch-normalized models.In

Advances in neural information processing systems , pp. 1945–1953.32. Glorot X, Bordes A, Bengio Y. 2011 Deep sparse rectiﬁer neural networks.In

Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics , pp.315–323.33. Carlu M, Ginelli F, Lucarini V, Politi A. 2018 Lyapunov analysis of multiscale dynamics: theslow bundle of the two-scale Lorenz 96 model. arXiv preprint arXiv:1809.05065 .34. De Cruz L, Demaeyer J, Vannitsem S. 2016 The modular arbitrary-order ocean-atmospheremodel:

MAOOAM v1.0.

Geoscientiﬁc Model Development , 2793–2808.35. Vannitsem S, Ghil M. 2017 Evidence of coupling in ocean-atmosphere dynamics over the northatlantic. Geophysical Research Letters , 2016–2026.36. Demaeyer J, Vannitsem S. 2017 Stochastic parametrization of subgrid-scale processes incoupled ocean–atmosphere systems: beneﬁts and limitations of response theory. Quarterly Journal of the Royal Meteorological Society , 881–896.37. Penny S, Bach E, Bhargava K, Chang CC, Da C, Sun L, Yoshida T. 2019 Strongly coupled dataassimilation in multiscale media: Experiments using a quasi-geostrophic coupled model.

Journal of Advances in Modeling Earth Systems , 1803–1829.38. Tondeur M, Carrassi A, Vannitsem S, Bocquet M. 2020 On temporal scale separation incoupled data assimilation with the ensemble Kalman ﬁlter. Journal of Statistical Physics , 1161–1185.39. Bocquet M, Carrassi A. 2017 Four-dimensional ensemble variational data assimilation and theunstable subspace.

Tellus A: Dynamic Meteorology and Oceanography , 1304504.40. Tandeo P, Ailliot P, Bocquet M, Carrassi A, Miyoshi T, Pulido M, Zhen Y. 2020 A Reviewof Innovation-Based Methods to Jointly Estimate Model and Observation Error CovarianceMatrices in Ensemble Data Assimilation. Monthly Weather Review , .41. Sakov P, Counillon F, Bertino L, Lisæter K, Oke P, Korablev A. 2012 Topaz4: an ocean-sea icedata assimilation system for the north atlantic and arctic.

Ocean Science Discussions .42. LeCun Y, Bengio Y, Hinton G. 2015 Deep learning. nature , 436–444.43. Ott J, Pritchard M, Best N, Linstead E, Curcic M, Baldi P. 2020 A fortran-keras deep learningbridge for scientiﬁc computing. arXiv preprint arXiv:2004.10652 . r s t a . r o y a l s o c i e t y pub li s h i ng . o r g P h il . T r an s . R . S o c . A ..................................................................

44. Gal Y, Ghahramani Z. 2016 Dropout as a Bayesian approximation: Representing modeluncertainty in deep learning.In international conference on machine learninginternational conference on machine learning