[PDF] Data assimilation empowered neural network parameterizations for subgrid processes in geophysical flows

Abstract

In the past couple of years, there is a proliferation in the use of machine learning approaches to represent subgrid scale processes in geophysical flows with an aim to improve the forecasting capability and to accelerate numerical simulations of these flows. Despite its success for different types of flow, the online deployment of a data-driven closure model can cause instabilities and biases in modeling the overall effect of subgrid scale processes, which in turn leads to inaccurate prediction. To tackle this issue, we exploit the data assimilation technique to correct the physics-based model coupled with the neural network as a surrogate for unresolved flow dynamics in multiscale systems. In particular, we use a set of neural network architectures to learn the correlation between resolved flow variables and the parameterizations of unresolved flow dynamics and formulate a data assimilation approach to correct the hybrid model during their online deployment. We illustrate our framework in a set of applications of the multiscale Lorenz 96 system for which the parameterization model for unresolved scales is exactly known, and the two-dimensional Kraichnan turbulence system for which the parameterization model for unresolved scales is not known a priori. Our analysis, therefore, comprises a predictive dynamical core empowered by (i) a data-driven closure model for subgrid scale processes, (ii) a data assimilation approach for forecast error correction, and (iii) both data-driven closure and data assimilation procedures. We show significant improvement in the long-term prediction of the underlying chaotic dynamics with our framework compared to using only neural network parameterizations for future prediction.

Full PDF

DD ATA ASSIMILATION EMPOWERED NEURAL NETWORKPARAMETERIZATIONS FOR SUBGRID PROCESSES INGEOPHYSICAL FLOWS

Suraj Pawar

School of Mechanical & Aerospace Engineering,Oklahoma State University,Stillwater, Oklahoma - 74078, USA. [email protected]

Omer San

School of Mechanical & Aerospace Engineering,Oklahoma State University,Stillwater, Oklahoma - 74078, USA. [email protected] A BSTRACT

In the past couple of years, there is a proliferation in the use of machine learning approaches torepresent subgrid scale processes in geophysical ﬂows with an aim to improve the forecastingcapability and to accelerate numerical simulations of these ﬂows. Despite its success for differenttypes of ﬂow, the online deployment of a data-driven closure model can cause instabilities and biasesin modeling the overall effect of subgrid scale processes, which in turn leads to inaccurate prediction.To tackle this issue, we exploit the data assimilation technique to correct the physics-based modelcoupled with the neural network as a surrogate for unresolved ﬂow dynamics in multiscale systems. Inparticular, we use a set of neural network architectures to learn the correlation between resolved ﬂowvariables and the parameterizations of unresolved ﬂow dynamics and formulate a data assimilationapproach to correct the hybrid model during their online deployment. We illustrate our framework ina application of the multiscale Lorenz 96 system for which the parameterization model for unresolvedscales is exactly known. Our analysis, therefore, comprises a predictive dynamical core empoweredby (i) a data-driven closure model for subgrid scale processes, (ii) a data assimilation approach forforecast error correction, and (iii) both data-driven closure and data assimilation procedures. Weshow signiﬁcant improvement in the long-term perdition of the underlying chaotic dynamics with ourframework compared to using only neural network parameterizations for future prediction. Moreover,we demonstrate that these data-driven parameterization models can handle the non-Gaussian statisticsof subgrid scale processes, and effectively improve the accuracy of outer data assimilation workﬂowloops in a modular non-intrusive way. K eywords Neural network, subgrid scale processes, data assimilation, ensemble Kalman ﬁlter, chaotic system,multiscale Lorenz 96 model

Geophysical ﬂows are characterized by the multiscale nature of ﬂows where there is a massive difference betweenthe largest and smallest scales, and these scales interact with each other to exchange heat, momentum, and moisture.This makes the numerical simulations of geophysical ﬂows in which every ﬂow feature is resolved computationallyunmanageable, even though the physical laws governing these processes are well known. Therefore, the atmosphereand ocean models compute the approximate numerical solution on the computational grid that consists of O (10 ) to O (10 ) grids with a spacing of O (10 km ) to O (100 km ) . The effect of unresolved scales is taken into account byusing several parameterization schemes, which represent the dynamics of subgrid scale processes as a function ofresolved dynamics [1–3]. However, the weather projection is marred by large uncertainties in the parameters of theseparameterization schemes, and also due to incorrect structure of these parameterizations equations itself [4–6].Typically, the parameters of these parameterization schemes are estimated by the model tuning process based onthe observations from experimental and ﬁeld measurements or the data generated from high-resolution numericalsimulations [7, 8]. The nonlinear and multiscale nature of geophysical ﬂows makes this tuning procedure cumbersomeand can impede accurate climate prediction [9]. A recent development in machine learning, particularly deep learning a r X i v : . [ phy s i c s . c o m p - ph ] J un a priori and a posteriori performance of data-driven Reynolds stress closuremodels as the RANS equations with such model can be ill-conditioned. Therefore, even though data-driven turbulenceclosure models predicted better closure terms, their online deployment does not lead to signiﬁcant improvement inthe mean velocity ﬁeld prediction [22, 25]. Wu et al. [61] proposed a metric to evaluate the conditioning of RANSequations in the a priori settings and showed that the implicit treatment of Reynolds stresses leads to reduced error inmean velocity prediction.Data assimilation (DA) is a well-established discipline where observations are blended with the model to take uncertain-ties into account for improving the numerical prediction of the system [62–67] and can be applied to achieve accurateprediction in hybrid models that employ data-driven model as a submodel for some processes (for example subgridscale processes). DA tools are being extensively utilized in geoscience and numerical weather forecast centers to correctbackground predictions based on a combination of heterogeneous measurement data coming from ground observations2nd satellite remote-sensing. These techniques have been also investigated recently for integrating experimental datainto large-eddy simulations of engineering ﬂows [68]. In a DA workﬂow, we merge forward model predictions withobservational data. However, it has been often remarked that no-model is correct but some of them are useful. In typicalDA studies and twin experiments, therefore, the subgrid scale processes have been modeled as Gaussian noise due to alack of structural information on their mechanisms. If we would know their dynamics either structurally or functionally,for sure it would be wise to include them in the model before a DA analysis is executed. However, the subgrid scaleprocesses in turbulent ﬂows often cannot be accurately modeled by Gaussian noise, and ML methodologies can beadopted to get a grip on subgrid scale processes. Hence, we put forth a neural network based statistical learningapproach to improve model uncertainty and incorporate this information as a data-driven closure term to the forwardmodel. We examine how the forecast error reduces due by including ML based closure term to the underlying forwardmodel. Indeed, the integration of DA with ML methodologies holds immense potential in various ﬁelds of physicalscience [69–73] and we demonstrate this through our study.In this work, we propose a neural network closure framework in developing hybrid physics-ML models through DAfor multiscale systems. In particular, we advocate the use of sequential DA techniques to tackle the closure modelingproblem by incorporating real-time observations into a model equipped with neural network parameterization schemesfor unresolved physics. To this end, we use real-time observations to regularize ML empowered predictive tools throughensemble Kalman ﬁlter based approach. We focus on a two-level Lorenz 96 model [74] for our numerical experimentssince it generates a controllable test case for advancing turbulence parameterization theories, especially in the age ofdata-driven models. The Lorenz 96 is an idealized model of atmospheric circulation and is used widely to test researchideas [75–77]. Even though the dynamics of both large and small scales are known exactly for a two-level Lorenz 96model, it is very difﬁcult to predict it because of the strong interplay between fast and slow subsystems. Therefore,we select this multiscale model for the assessments of data-driven closures for capturing the physics of subgrid scales.Since we use an “explicit" evolution equation for the closure parameterizations, we can easily assess the data-drivenmodels in a posteriori simulations. This often comprises a challenging task in LES computations since the low-passﬁltering operation is “implicitly" applied to the governing equations. Our approach is multifaceted in at least twoways. We ﬁrst show that the infusion of the DA approaches improves the forecasting quality of predictive modelsequipped with data-driven parameterizations. Second, we also demonstrate that the data-driven parameterizations helpsigniﬁcantly to reduce forecast errors in DA workﬂows. Therefore, our modular framework can be considered as away to incorporate real-time observations that are prevalent in today’s weather forecast station into hybrid modelsconstituted from a physics-based model as the dynamical core of the system, and a data-driven model to describeunresolved physics.The paper is structured as follows. In Section 2, we discuss the problem of parameterizations using a two-level Lorenz96 model as a prototypical example. Section 3 details two types of neural network utilized in this study for learning themapping between resolved variables and parameterizations of unresolved scales. We explain the methodology of dataassimilation and the deterministic ensemble Kalman ﬁlter algorithm in Section 4. In Section 5, we discuss the ﬁndingsof our numerical experiments with a two-level Lorenz 96 model. Finally, we conclude with the summary and directionfor future work in Section 6. In this section, we describe the two-level variant of the Lorenz 96 model proposed by Lorenz [74]. This model has beenextensively investigated to study stochastic parameterization schemes [78–80], scale-adaptive parameterizations [81],and neural network parameterizations [58]. The two-level Lorenz 96 model can be written as dX i dt = − X i − ( X i − − X i +1 ) − X i − hcb J (cid:88) j =1 Y j,i + F, (1) dY j,i dt = − cbY j +1 ,i ( Y j +2 ,i − Y j − ,i ) − cY j,i + hcb X i , (2)where Equation 1 represents the evolution of slow, high-amplitude variables X i ( i = 1 , . . . , n ) , and Equation 2 providesthe evolution of a coupled fast, low-amplitude variable Y j,i ( j = 1 , . . . , J ) . We use n = 36 and J = 10 in ourcomputational experiments. We utilize c = 10 and b = 10 , which implies that the small scales ﬂuctuate 10 times fasterthan the larger scales. Also, the coupling coefﬁcient h between two scales is equal to 1 and the forcing is set at F = 10 to make both variables exhibit the chaotic behavior.In parameterization research, small scale variables are not resolved and their effect is typically parameterized as afunction of resolved large scale variables. A forecast model for the resolved variables given in Equation 1 can be3onstructed with the parameterization for unresolved variables as follow d (cid:101) X i dt = − (cid:101) X i − ( (cid:101) X i − − (cid:101) X i +1 ) − (cid:101) X i − hcb G i + F, (3)where the tilde is used to denote the fact that the parameterization G i is used to represent the effect of unresolvedvariables. Typically, the parameterizations is a function of resolved variables and can be written mathematically as J (cid:88) j =1 Y j,i : ≈ G i = N ( (cid:101) X ) , (4)where N ( · ) is the nonlinear mapping of resolved variables to the parameterizations at the i th grid point. This mappingcan be based on certain physical arguments or can also be learned with any data-driven methods. Therefore inparameterization research for multiscale systems, the underlying physical laws governing the dynamics of resolvedvariables are assumed to be known exactly, and the effect of unresolved variables is considered through parameterizations G i . If we use data-driven methods to represent the parameterization G i , then the forecast model given in Equation 3 canbe considered as a hybrid model. Our main objective in this work is to improve the forecasting capability of multiscalesystems that are represented by a hybrid model embedded with data-driven parameterizations and we achieve thisthrough data assimilation techniques. The parameterization problem in multiscale ﬂows can be posed as a regression problem where the mapping betweenresolved scales and unresolved scales has to be determined. We consider supervised class of machine learningalgorithms, where the optimal map between inputs and outputs is learned. In this section, we describe an artiﬁcial neuralnetwork (ANN) also called as multilayer perceptron, and convolutional neural network (CNN) to build data-drivenparameterization models.

An artiﬁcial neural network is made up of several layers consisting of the predeﬁned number of neurons. Each neuronconsists of certain coefﬁcients called weights and some bias. The weight determines how signiﬁcant certain inputfeature is to the output. The input from the previous layer is multiplied by a weight matrix as shown below S l = W l X l − , (5)where X l − is the output of the ( l − th layer, W l is the matrix of weights for the l th layer. The summation of theabove input-weight product and the bias is then passed through a node’s activation function which is usually somenonlinear function. The introduction of nonlinearity through activation function allows the neural network to learnhighly complex relations between the input and output. The output of the l th layer can be written as X l = ζ ( S l + B l ) , (6)where B l is the vector of biasing parameters for the l th layer and ζ is the activation function. If there are L layers betweenthe input and the output in a neural network, then the output of the neural network can be represented mathematically asfollow ˜ Y = ζ L ( W L , B L , . . . , ζ ( W , B , ζ ( W , B , X ))) , (7)where X and ˜ Y are the input and output of the ANN, respectively. There are several activation functions that providesdifferent nonlinearity. Some of the widely used activation functions are sigmoid ζ ( φ ) = 1 / (1 + e − φ ) , hyperbolictangent (tanh) ζ ( φ ) = ( e φ − e − φ ) / ( e φ + e − φ ) , and rectiﬁed linear unit (ReLU) ζ ( φ ) = max [0 , φ ] .The matrix W and B are determined through the minimization of the loss function (for example mean squared errorbetween true and predicted labels). The gradient of the objective function with respect to weights and biases arecalculated with the backpropagation algorithm. The optimization algorithms like the stochastic gradient descentmethod [82] provide a rapid way to learn optimal weights. The training procedure for ANN can be summarized asfollow: • The input and output of the neural network are speciﬁed along with some initial weights initialization for neurons. • The training data is run through the network to produce output ˜ Y whose true label is Y . • The derivative of the objective function with each of the training weight is computed using the chain rule.4

The weights are then updated based on the learning rate and the optimization algorithm.We continue to iterate through this procedure until convergence or the maximum number of iterations is reached. Thereare different ways in which the relationship between resolved and unresolved variables in multiscale systems can belearned with the ANN. The most common method is to employ point-to-point mapping, where the input features ata single grid point are utilized to learn the output labels at that point [22, 29, 83]. Another method is to include theinformation at neighboring grid points to determine the output label at a single point [26, 84]. We train our ANN byincluding information at different number of neighboring grid points and assess how does this additional informationaffects in learning the correlation between resolved and unresolved variables. We investigate three types of ANN modelsand they can be written as ANN-3 : { X i − , X i , X i +1 } ∈ R → { G i } ∈ R , (8)ANN-5 : { X i − . . . , X i +2 } ∈ R → { G i } ∈ R , (9)ANN-7 : { X i − . . . , X i +3 } ∈ R → { G i } ∈ R , (10)where G i is the parameterization at i th grid point and X i is the resolved variable. For the training, we assume that theresolved variables and the parameterizations are known exactly and are computed by solving Equation 1 and Equation 2in a coupled manner. For all ANN architectures used in this study, we apply two hidden layers with 40 neurons andReLU activation function. The ANN is trained using an Adam optimizer for 300 iterations. The convolutional neural network (CNN) is particularly attractive when the data is in the form of two-dimensionalimages [85]. Here, we present the CNN architecture assuming that the input and output of the neural network havethe structure of two-dimensional images. This formulation can be easily applied to one-dimensional images whenthe dimension in one direction is collapsed to one. The Conv layers are the fundamental building blocks of the CNN.Each Conv layer has a predeﬁned number of ﬁlters (also called kernels) whose weights have to be learned using thebackpropagation algorithm. The shape of the ﬁlter is usually smaller than the actual image and it extends through the fulldepth of the input volume from the previous layer. For example, if the input to the CNN has × × dimensionwhere 1 is the number of input features, the kernels of the ﬁrst Conv layer can have × × shape. During the forwardpropagation, the ﬁlter is convolved across the width and height of the input volume to produce the two-dimensionalmap. The two-dimensional map is constructed by computing the dot product between the weights of the ﬁlter and theinput volume at any position and then sliding it over the whole volume. Mathematically the convolution operationcorresponding to one ﬁlter can be written as S lij = ∆ i / (cid:88) p = − ∆ i / j / (cid:88) q = − ∆ j / k / (cid:88) r = − ∆ k / W lpqr X l − i + p j + q k + r + B pqr , (11)where ∆ i , ∆ j , ∆ k are the sizes of ﬁlter in each direction, W lpqr are the entries of the ﬁlter for l th Conv layer, B pqr is thebiasing parameter, and X l − ijk is the input from ( l − th layer. Each Conv layer will have a set of predeﬁned ﬁlters andthe two-dimensional map produced by each ﬁlter is then stacked in the depth dimension to produce a three-dimensionaloutput volume. This output volume is passed through an activation function to produce a nonlinear map between inputsand outputs. The output of the l th layer is given by X lijk = ζ ( S lijk ) , (12)where ζ is the activation function. It should be noted that as we convolve the ﬁlter across the input volume, the size ofthe input volume shrinks in height and width dimension. Therefore, it is common practice to pad the input volume withzeros called zero-padding. The zero-padding permits us to control the shape of the output volume and is used in ourneural network parameterization framework to preserve the shape so that input and output width and height are thesame. The main advantage of CNN is its weight sharing property because the ﬁlter of the smaller size is shared acrossthe whole image which is larger in size. This allows CNN to handle large data without the signiﬁcant computationaloverhead. The CNN mapping in our work can be mathematically presented asCNN : { X , . . . , X n } ∈ R n → { G , . . . G n } ∈ R n , (13)where X i is the resolved variable and G i is the parameterization. Therefore, the solution at a single time step correspondsto one training example for training the CNN. In our CNN architecture, we use only one hidden layer between the inputand output. This hidden layer has 128 ﬁlters with × shape. We apply ReLU activation function and use zero-paddingto keep the input and output shape the same. The CNN is trained with an Adam optimizer for 400 iterations.5 Data assimilation

As highlighted in many studies, neural network parameterizations suffer from instabilities and biases once the trainedmodel is deployed in a forward solver [16, 16, 59–61]. From our numerical experiments, we observe that the forwardmodel with only neural network parameterizations delivers accurate prediction only up to some time and after that themodel starts deviating from the true trajectory. In order to address this issue and improve the long-term forecast withhybrid models, we utilize the data assimilation (DA) to incorporate noisy measurements into future state prediction.The main theme of DA is to extract the information from observational data to correct dynamical models and improvetheir prediction. There is a rich literature on DA [62–64, 86, 87] and here we discuss only sequential data assimilationproblem and then outline the algorithm procedure for the deterministic ensemble Kalman ﬁlter (DEnKF).We consider the dynamical system whose evolution can be represented as x k +1 = M ( x k ) + w k +1 , (14)where x k ∈ R n is the state of the dynamical system at discrete time t k , M : R n → R n is the nonlinear model operatorthat deﬁnes the temporal evolution of the system. In this work, the dynamical system is the two-level Lorenz 96model with neural network parameterizations whose evolution is governed by Equation 3. The term w k +1 denotes themodel noise that takes into account any type of uncertainty in the model that can be attributed to boundary conditions,imperfect models, etc. Let z k ∈ R m be observations of the state vector obtained through noisy measurements procedureas given below z k = h ( x k ) + v k , (15)where h ( · ) is a nonlinear function that maps R n → R m , and v k ∈ R m is the measurement noise. We assume thatthe measurement noise is a white Gaussian noise with zero mean and the covariance matrix R k , i.e., v k ∼ N (0 , R k ) .Additionally, the noise vectors w k and v k are assumed to be uncorrelated at two different time steps. The sequentialdata assimilation can be considered as a problem of estimating the state x k of the system given the observations upto time t k , i.e., z , . . . , z k . When we utilize observations to estimate the state of the system, we say that the data areassimilated into the model. We will use the notation (cid:98) x k to denote an analyzed state of the system at time t k when all ofthe observations up to and including time t k are used in determining the state of the system. When all the observationsbefore (but not including) time t k are utilized for estimating the state of the system, then we call it the forecast estimateand denote it as x fk .We use the DEnKF algorithm proposed by Sakov et al. [88] for the data assimilation and its procedure is summarizedin Algorithm 1. We start the DEnKF algorithm by initializing the state estimate for all ensemble members usingEquation 16. The anomalies between the forecast estimate of all ensembles and its sample mean is computed utilizingEquation 19. Once the observations are available at time t k +1 , the forecast state estimate is assimilated as given inEquation 20, where the Kalman gain K is computed using its square root version. The anomalies for all ensemblemembers are updated separately with half the Kalman gain as shown in Equation 23. The analyzed state estimatefor all ensembles members are obtained by offsetting the analyzed anomalies with the analyzed state estimate and iscalculated with Equation 24. We adopt the twin experiment setting [89] to test the DA algorithm for a two-level Lorenz96 model with neural network parameterizations. Also, we validate our implementation of the DEnKF algorithm usingthe one-level Lorenz 96 model and is discussed in detail in Appendix A. In this section, we discuss the results of numerical experiments with a two-level variant of the Lorenz 96 systemembedded with neural network parameterizations for the unresolved variables. We utilize the fourth-order Runge-Kuttanumerical scheme with a time step ∆ t = 0 . for temporal integration of the Lorenz 96 model. We apply the periodicboundary condition for the slow variables, i.e., X i − n = X i + n = X i . The fast variables are extended by letting Y j,i − n = Y j,i + n = Y j,i , Y j − J,i = Y j,i − , and Y j + J,i = Y j,i +1 . The physical initial condition is computed by startingwith an equilibrium condition at time t = − for slow variables. The equilibrium condition for slow variables is X i = F for i ∈ , , . . . , n . We perturb the equilibrium solution for the th state variable as X = F + 0 . . At thetime t = − , the fast variables are assigned with random numbers between − F/ to F/ . We integrate a two-levelLorenz 96 model by solving both Equation 1 and Equation 2 in a coupled manner up to time t = 0 . With this initialcondition (i.e., at t = 0 ), we generate the training data for neural networks by integrating the two-level Lorenz 96model from t = 0 to t = 10 . Therefore, we gather 10,000 temporal snapshots to generate the training data. For all ournumerical experiments, we use 80% of the data to train the neural network and 20% data to validate the training. Weassess the performance of a trained neural network by deploying it in a forecast model for temporal integration betweentime t = 10 and t = 20 . Therefore, there is no overlap between the data used for training and testing. Since the neural6 lgorithm 1 Deterministic ensemble Kalman ﬁlter Initialize the state of the system for different ensemble members. (cid:98) X ( i ) = m + y ( i ) , (16)where y ( i ) ∼ N (0 , P ) . For k = 0 , , . . . proceed with the forecast and data assimilation step as follow • Forecast step: – Integrate the state estimate all ensemble members from time t k to t k +1 as follow X fk +1 ( i ) = M ( (cid:98) X k ( i )) (17) – Compute the sample mean, ensemble anomalies, and error covaraince as follow x fk +1 = 1 N N (cid:88) i =1 X fk +1 ( i ) , (18) A fk +1 ( i ) = X fk +1 ( i ) − x fk +1 , (19) • Data assimilation step: – Once the observations are available at time t k +1 , forecast state estimate is assimilated with the observation asfollow (cid:98) x k +1 = x fk +1 + K [ z k +1 − h ( x fk +1 )] . (20)Here, the Kalman gain is given as K = A f ( HA f ) T N − (cid:20) ( HA f )( HA f ) T N − R (cid:21) − , (21)where H ∈ R m × n is the Jacobian of the observation operator (i.e., H kl = ∂h k ∂x l ), and a size of R n × N matrix isconcatenated as follows A f = [ A fk +1 (1) , A fk +1 (2) , . . . , A fk +1 ( N )] . (22) – Compute the analyzed anomalies as below (cid:98) A k +1 ( i ) = A fk +1 ( i ) − KHA fk +1 ( i ) . (23) – Calculate the analyzed ensemble using the analyzed state estimate and analyzed anomalies as follow (cid:98) X k +1 ( i ) = (cid:98) A k +1 ( i ) + (cid:98) x k +1 . (24)7etwork has not seen the testing data during the training, the performance of neural network parameterizations in thistemporal region will give us an insight on its generalizability to unseen data.First, we present results for ANN based parameterizations trained using neighboring stencil mapping as discussed inSection 3.1. Figure 1 displays the full state trajectory of the Lorenz 96 model from time t = 10 to t = 20 computedby solving both the evolution of slow and fast variables (i.e., True) and with ANN based parameterizations for fastvariables (i.e., ANN-3, ANN-5, ANN-7). The difference between the true solution ﬁeld and the predicted solution ﬁeldis also depicted in Figure 1. It can be observed that the predicted solution ﬁeld starts deviating from the true solutionﬁeld at around t ≈ for all ANN-based parameterizations. u (a) True u (b) ANN-3 u (c) ANN-5 u (d) ANN-7 (cid:15) (e) Error (ANN-3) (cid:15) (f) Error (ANN-5)

10 12 14 16 18 20 t (cid:15) (g) Error (ANN-7) − − Figure 1: Full state trajectory of the multiscale Lorenz 96 model with the closure term computed using the differentneighboring stencil mapping feedforward ANN architecture.Next, we illustrate how the prediction of a two-level Lorenz 96 model with neural network parameterizations canbe improved using data assimilation by incorporating noisy observations in the future state prediction. For our twinexperiment, we obtain observations by adding noise drawn from the Gaussian distribution with zero mean and thecovariance matrix R k , i.e., v k ∼ N (0 , R k ) . We use R k = σ I , where σ is the standard deviation of measurementnoise and is set at σ = 1 . We assume that observations are sparse in space and are collected at every th timestep. We present two levels of observation density in space for the DA. For the ﬁrst case, we employ observationsat [ X , X , . . . , X ] ∈ R for the assimilation. The second set of observations consists of 50% of the full state ofthe system, i.e., [ X , X , . . . , X ] ∈ R . In Figure 2, we provide the full state trajectory prediction for the ANN-5parameterization without any DA and with DA for two sets of observations. We can observe that there is a substantialimprovement in the long-term prediction even with only 25% of the observations incorporated through the DEnKFalgorithm. The results in Figure 2 provide the evidence for the good performance of the present framework in achievingaccurate long-term prediction for hybrid models embedded with data-driven parameterizations. Therefore, the presentframework can lead to accurate forecasting by exploiting online measurements coming from various types of sensor8etworks and can ﬁnd applications in different ﬁelds like climate modeling, turbulence closure modeling where thesubgrid scale parameterizations are unavoidable. u (a) True u (b) ANN-5 u (c) DEnKF ( m = 9) u (d) DEnKF ( m = 18) (cid:15) (e) Error (ANN-5) (cid:15) (f) Error (DEnKF ( m = 9))

10 12 14 16 18 20 t (cid:15) (g) Error (DEnKF ( m = 18)) − − Figure 2: Full state trajectory of the multiscale Lorenz 96 model with the closure term computed using the ﬁve-pointneighboring stencil mapping feedforward ANN architecture and the DEnKF used for data assimilation.Figure 3 illustrates the time evolution of the full state trajectory of a two-level Lorenz 96 model with CNN basedparameterizations for unresolved scales. CNN is fed with the entire state of the slow variables as an input and itcalculates the parameterizations of fast variables at all grid points. From Figure 3, we can deduce that the predicted statetrajectory starts deviating from the true state at around t ≈ when only CNN based parameterizations are employedin the forward model of slow variables. When we incorporate observations through DA, we observe considerableimprovement in the state prediction over a longer period.Based on results presented in Figure 2 and Figure 3, we can notice that the error is slightly higher between time t = 18 to t = 20 for the CNN based parameterizations empowered with DA. One potential reason for this discrepancy can bethe stochastic nature of the parameterization model. The true parameterization model in itself is stochastic and mightnot follow a Gaussian distribution. Another reason for the inaccurate forecast can be attributed to the uncertainty in theprediction of parameterizations by CNN. To isolate the source of error, we integrate the forecast model for a two-levelLorenz 96 model without any parameterizations. The results for this numerical experiment are discussed in Appendix B.In this numerical experiment, the observations include the effect of unresolved scales and can be considered as an addednoise. The sequential DA methods based on Kalman ﬁlters deliver a considerably accurate solution when the model andobservations noise is drawn from a Gaussian distribution and enough observations are provided. If the parameterizationof unresolved scales follows a Gaussian distribution, we should be able to recover the accurate state of the system asthe density of observations is increased. However, as reported in Figure 5, there is a high level of inaccuracy evenwhen 100% of the state is observable. Therefore, we can conclude that there is a considerable beneﬁt of includingneural network parameterizations compared to using no parameterization in the forecast model. The results provided inFigure 2 and Figure 3 also shows that the neural network parameterizations can capture the non-Gaussian statistics of9 u (a) True u (b) CNN u (c) DEnKF ( m = 9) u (d) DEnKF ( m = 18) (cid:15) (e) Error (CNN) (cid:15) (f) Error (DEnKF ( m = 9))

10 12 14 16 18 20 t (cid:15) (g) Error (DEnKF ( m = 18)) − − Figure 3: Full state trajectory of the multiscale Lorenz 96 model with the closure term computed using the CNNarchitecture and the DEnKF used for data assimilationsubgrid scale processes and this leads to accurate forecasting over a longer period. There are other DA approaches thatdeal with non-Gaussian distributions for noise vectors [90–95]. We restrict ourselves to the DEnKF algorithm for DA inthis study and plan to explore other DA algorithms in our future work.We assess the quantitative performance of different numerical experiments performed in this study using the root meansquared error (RMSE) between the true and predicted state of slow variables in a two-level Lorenz 96 model. TheRMSE is computed as shown belowRMSE = (cid:118)(cid:117)(cid:117)(cid:116) n n t n (cid:88) i =1 n t (cid:88) k =1 (cid:0) X T i ( t k ) − X P i ( t k ) (cid:1) , (25)where X T i is the true state of the system and X P i is the predicted state of the system. Table 1 reports the RMSE for atwo-level Lorenz 96 model for all cases investigated in this work. We can see that the RMSE is very high when wedo not use any parameterizations for unresolved scales even when measurements for an entire state of the system areincorporated through DA. The data assimilation alone can not account for the effect of unresolved scales, even thoughtheir effect is present in the observations data. Therefore, it is imperative to include parameterizations of fast variablesin the forecast model of slow variables. We observe that the ANN architecture provides slightly more accurate resultsthan the CNN based parameterizations for fast variables. Also, the RMSE is minimum for the ANN-3 parameterizationsand we observe a slight increase in RMSE by including more neighboring information. One potential reason for thisobservation can be the use of the same hyperparameters for all ANN architectures. However, this change is very smalland the RMSE is the same order of magnitude for all types of neural network parameterizations. The RMSE is almostthe same when 25% or 50% of the full state of the system is observed in data assimilation framework.10able 1: Quantitative assessment of different neural network parameterizations for subgrid scale processes using thetotal root mean square error given by Equation (25). Framework

RMSE

Only neural network parameterizations

ANN-3 . ANN-5 . ANN-7 . CNN . Only data assimilation

No parameterizations ( m = 9 ) . No parameterizations ( m = 18 ) . No parameterizations ( m = 36 ) . Neural network parameterizations with data assimilation

ANN-5 ( m = 9 ) . ANN-5 ( m = 18 ) . CNN ( m = 9 ) . CNN ( m = 18 ) . In the present study, we introduce a framework to apply data assimilation methods to the physics-based model embeddedwith data-driven parameterizations to achieve accurate long-term forecast in multiscale systems. We demonstrate thatthe forecasting capability of hybrid models can be signiﬁcantly improved by exploiting online measurements fromvarious types of sensor networks. Speciﬁcally, we use neural networks to learn the relation between resolved scales andthe effect of unresolved scales (i.e., parameterizations). The deployment of the trained neural network in the forwardsimulation provides accurate prediction up to a short period and then there is a large discrepancy between true andpredicted state of the system. To address this issue, we exploit the sparse observations data through data assimilationto improve the accuracy of the forecasting over a longer period. We illustrate this framework for a two-scale variantof the Lorenz 96 model which consists of fast and slow variables whose dynamics are exactly known. We obtain aconsiderable improvement in the prediction by combining neural network parameterizations and data assimilationcompared to employing only neural network parameterizations. We also found that including an ML based closureterm seems to capture non-Gaussian statistics and signiﬁcantly improve the forecast error. Based on our numericalexperiments with data assimilation empowered neural network parameterizations, we can conclude that improvingmachine learning-based model prediction with data assimilation methods offers a promising research direction.Our future work aims at leveraging the underlying physical conservation laws into neural network training to producephysically consistent parameterizations. As the deep learning ﬁeld is evolving rapidly, we can integrate modern neuralnetwork architectures and training methodology into our framework to attain higher accuracy. In the present framework,we employ the deterministic ensemble Kalman ﬁlter (DEnKF) algorithm for data assimilation in the present study. Thisalgorithm gives accurate prediction when the uncertainty in model and observations follows a Gaussian distribution.We plan to investigate other data assimilation approaches like maximum likelihood ensemble ﬁlter methods that canhandle the non-Gaussian nature of uncertainty in the mathematical model to get further improvement in the accuracyprediction. We will also test the present framework for more complex turbulent ﬂows as a part of our future effort.Finally, we conclude by reemphasizing that the integration of data assimilation with hybrid physics-ML models can beeffectively used for modeling of multiscale systems.

Acknowledgement

This material is based upon work supported by the U.S. Department of Energy, Ofﬁce of Science, Ofﬁce of AdvancedScientiﬁc Computing Research under Award Number de-sc0019290. O.S. gratefully acknowledges their support.Disclaimer. This report was prepared as an account of work sponsored by an agency of the United States Government.Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty,express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any11nformation, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights.Reference herein to any speciﬁc commercial product, process, or service by trade name, trademark, manufacturer, orotherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United StatesGovernment or any agency thereof. The views and opinions of authors expressed herein do not necessarily state orreﬂect those of the United States Government or any agency thereof.

A Validation of the Deterministic Ensemble-Kalman Filter

In this Appendix, we provide results of data assimilation with the DEnKF algorithm for one level Lorenz 96 model.The one level Lorenz 96 model is given as dX i dt = − X i − ( X i − − X i +1 ) − X i + F, (26)for i ∈ , , . . . , and F = 10 . The above model is completely deterministic as there is no parameterization of theunresolved scales. We use the similar settings as the two-level variant of the Lorenz 96 model for temporal integrationusing the fourth-order Runge-Kutta numerical scheme. The true initial condition is generated by integrating the solutionstarting from an equilibrium condition from t = − to t = 0 . For all ensemble members, we start with an initialcondition obtained by perturbing the true initial condition with a noise drawn from the Gaussian distribution with zeromean and the variance of × − . The observations are generated for data assimilation by adding a measurementnoise from the Gaussian distribution with zero mean and the variance of σ = 1 (i.e, R k = I ) to the true state of thesystem. The observations are assumed to be available at every th time step, similar to the two-level variant of theLorenz 96 model.As depicted in Figure 4, we can conclude that the DEnKF can correct the erroneous trajectory even when only 9observations are employed for data assimilation. As the amount of observations is increased to 18, we observe areduction in the error. We reiterate here that, we have complete control over the model (since it is deterministic) in thenumerical experiments with a one-level Lorenz 96 model. As we introduce fast scale variables, the evolution of slowvariables in a two-level Lorenz 96 model is no longer deterministic and simple Kalman ﬁlter based algorithms mightnot be enough to give accurate prediction over a longer period. B Data-assimilation with no parameterization

In this Appendix, we report the performance of the DEnKF algorithm for the data assimilation of a two-level variant ofthe Lorenz 96 model with no parameterizations employed for unresolved scales. The two-level Lorenz 96 model with noparameterizations reduces to one-level Lorenz 96 model as presented in Equation 26. We note here that the observationsused for data assimilation are the same as the numerical experiments with a two-level Lorenz 96 model. Therefore,the effect of unresolved scales is embedded in observations. The parameterization of fast variables (i.e., hcb (cid:80) Jj =1 Y j,i term in Equation 1) can be considered as an added noise to the true state of the system for a one-level Lorenz 96 modelpresented in Equation 26.In Figure 5, we report the true state of a two-level Lorenz 96 model and also the predicted state trajectory using theDA framework with no parameterization. We provide the results for three sets of observations utilized in DA. Theobservations are incorporated at every th time step of the model through assimilation stage. We can observe that, evenwhen 100% of the full state is observable, we do not recover the true state trajectory of a two-level Lorenz 96 model.With this observation, we can conclude that it is essential to incorporate parameterization of unresolved scales into aforward model of the DA procedure to recover the accurate state trajectory. The root mean squared error between theassimilated states and true states for three sets of observations is provided in Table 1. References [1] David J Stensrud.

Parameterization schemes: keys to understanding numerical weather prediction models .Cambridge University Press, 2009.[2] Jinqiao Duan and Balasubramanya Nadiga. Stochastic parameterization for large eddy simulation of geophysicalﬂows.

Proceedings of the American Mathematical Society , 135(4):1187–1196, 2007.[3] David A Randall. Cloud parameterization for climate modeling: Status and prospects.

Atmospheric research ,23(3-4):345–361, 1989. 12 u (a) True u (b) Erroneous u (c) DEnKF ( m = 9) u (d) DEnKF ( m = 18) (cid:15) (e) Error (Erroneous) (cid:15) (f) Error (DEnKF ( m = 9)) t (cid:15) (g) Error (DEnKF ( m = 18)) − − Figure 4: Full state trajectory of the Lorenz 96 model with the the DEnKF algorithm.[4] Tapio Schneider, Shiwei Lan, Andrew Stuart, and Joao Teixeira. Earth system modeling 2.0: A blueprint formodels that learn from observations and targeted high-resolution simulations.

Geophysical Research Letters ,44(24):12–396, 2017.[5] David Draper. Assessment and propagation of model uncertainty.

Journal of the Royal Statistical Society: SeriesB (Methodological) , 57(1):45–70, 1995.[6] Christopher E Holloway and J David Neelin. Moisture vertical structure, column water vapor, and tropical deepconvection.

Journal of the atmospheric sciences , 66(6):1665–1683, 2009.[7] Christian Jakob. An improved strategy for the evaluation of cloud parameterizations in GCMs.

Bulletin of theAmerican Meteorological Society , 84(10):1387–1402, 2003.[8] Christian Jakob. Accelerating progress in global atmospheric model development through improved parameteriza-tions: Challenges, opportunities, and strategies.

Bulletin of the American Meteorological Society , 91(7):869–876,2010.[9] Ming Zhao, J-C Golaz, Isaac M Held, Venkatachalam Ramaswamy, S-J Lin, Y Ming, P Ginoux, B Wyman,LJ Donner, D Paynter, et al. Uncertainty in model climate sensitivity traced to representations of cumulusprecipitation microphysics.

Journal of Climate , 29(2):543–560, 2016.[10] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436–444, 2015.[11] Philipp Neumann, Peter Düben, Panagiotis Adamidis, Peter Bauer, Matthias Brück, Luis Kornblueh, DanielKlocke, Bjorn Stevens, Nils Wedi, and Joachim Biercamp. Assessing the scales in numerical weather and climatepredictions: will exascale be the rescue?

Philosophical Transactions of the Royal Society A , 377(2142):20180148,2019. 13 u (a) True u (b) DEnKF ( m = 9) u (c) DEnKF ( m = 18) u (d) DEnKF ( m = 36) (cid:15) (e) Error (DEnKF ( m = 9)) (cid:15) (f) Error (DEnKF ( m = 18)) t (cid:15) (g) Error (DEnKF ( m = 36)) − − Figure 5: Full state trajectory of the multiscale Lorenz 96 model with no closure for subgrid processes. The observationdata for the DEnKF algorithm is obtained by adding measurement noise to the exact solution of the multiscale Lorenz96 system.[12] Claudia Kuenzer, Marco Ottinger, Martin Wegmann, Huadong Guo, Changlin Wang, Jianzhong Zhang, StefanDech, and Martin Wikelski. Earth observation satellite sensors for biodiversity monitoring: potentials andbottlenecks.

International Journal of Remote Sensing , 35(18):6599–6647, 2014.[13] Yunjie Liu, Evan Racah, Joaquin Correa, Amir Khosrowshahi, David Lavers, Kenneth Kunkel, Michael Wehner,William Collins, et al. Application of deep convolutional neural networks for detecting extreme weather in climatedatasets. arXiv preprint arXiv:1605.01156 , 2016.[14] Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo.Deep learning for precipitation nowcasting: A benchmark and a new model. In

Advances in neural informationprocessing systems , pages 5617–5627, 2017.[15] Emmanuel de Bezenac, Arthur Pajot, and Patrick Gallinari. Deep learning for physical processes: Incorporatingprior scientiﬁc knowledge.

Journal of Statistical Mechanics: Theory and Experiment , 2019(12):124009, 2019.[16] Stephan Rasp, Michael S Pritchard, and Pierre Gentine. Deep learning to represent subgrid processes in climatemodels.

Proceedings of the National Academy of Sciences , 115(39):9684–9689, 2018.[17] Pierre Gentine, Mike Pritchard, Stephan Rasp, Gael Reinaudi, and Galen Yacalis. Could machine learning breakthe convection parameterization deadlock?

Geophysical Research Letters , 45(11):5742–5751, 2018.[18] Steven L Brunton, Bernd R Noack, and Petros Koumoutsakos. Machine learning for ﬂuid mechanics.

AnnualReview of Fluid Mechanics , 52, 2019.[19] MP Brenner, JD Eldredge, and JB Freund. Perspective on machine learning for advancing ﬂuid mechanics.

Physical Review Fluids , 4(10):100501, 2019. 1420] Karthik Duraisamy, Gianluca Iaccarino, and Heng Xiao. Turbulence modeling in the age of data.

Annual Reviewof Fluid Mechanics , 51:357–377, 2019.[21] F Sarghini, G De Felice, and S Santini. Neural networks based subgrid scale modeling in large eddy simulations.

Computers & Fluids , 32(1):97–108, 2003.[22] Masataka Gamahara and Yuji Hattori. Searching for turbulence models by artiﬁcial neural network.

PhysicalReview Fluids , 2(5):054604, 2017.[23] Romit Maulik, Himanshu Sharma, Saumil Patel, Bethany Lusch, and Elise Jennings. Accelerating RANSturbulence modeling using potential ﬂow and machine learning. arXiv preprint arXiv:1910.10878 , 2019.[24] Julia Ling, Andrew Kurzawski, and Jeremy Templeton. Reynolds averaged turbulence modelling using deepneural networks with embedded invariance.

Journal of Fluid Mechanics , 807:155–166, 2016.[25] Jian-Xun Wang, Jin-Long Wu, and Heng Xiao. Physics-informed machine learning approach for reconstructingReynolds stress modeling discrepancies based on dns data.

Physical Review Fluids , 2(3):034603, 2017.[26] Romit Maulik, Omer San, Adil Rasheed, and Prakash Vedula. Subgrid modelling for two-dimensional turbulenceusing neural networks.

Journal of Fluid Mechanics , 858:122–144, 2019.[27] Andrea Beck, David Flad, and Claus-Dieter Munz. Deep neural networks for data-driven LES closure models.

Journal of Computational Physics , page 108910, 2019.[28] Chenyue Xie, Jianchun Wang, and E Weinan. Modeling subgrid-scale forces by spatial artiﬁcial neural networksin large eddy simulation of turbulence.

Physical Review Fluids , 5(5):054606, 2020.[29] Chenyue Xie, Jianchun Wang, Ke Li, and Chao Ma. Artiﬁcial neural network approach to large-eddy simulationof compressible isotropic turbulence.

Physical Review E , 99(5):053113, 2019.[30] PA Srinivasan, L Guastoni, Hossein Azizpour, PHILIPP Schlatter, and Ricardo Vinuesa. Predictions of turbulentshear ﬂows using deep neural networks.

Physical Review Fluids , 4(5), 2019.[31] Junhyuk Kim and Changhoon Lee. Prediction of turbulent heat transfer using convolutional neural networks.

Journal of Fluid Mechanics , 882:A18, 2020.[32] RA Heinonen and PH Diamond. Turbulence model reduction by deep learning.

Physical Review E , 101(6):061201,2020.[33] Guido Novati, Hugues Lascombes de Laroussilhe, and Petros Koumoutsakos. Automating turbulence modelingby multi-agent reinforcement learning. arXiv preprint arXiv:2005.09023 , 2020.[34] Kai Fukami, Koji Fukagata, and Kunihiko Taira. Super-resolution reconstruction of turbulent ﬂows with machinelearning.

Journal of Fluid Mechanics , 870:106–120, 2019.[35] Chiyu Max Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi ATchelepi, Philip Marcus, Anima Anandkumar, et al. Meshfreeﬂownet: A physics-constrained deep continuousspace-time super-resolution framework. arXiv preprint arXiv:2005.01463 , 2020.[36] Akshay Subramaniam, Man Long Wong, Raunak D Borker, Sravya Nimmagadda, and Sanjiva K Lele. Turbulenceenrichment using physics-informed generative adversarial networks. arXiv , pages arXiv–2003, 2020.[37] Jaideep Pathak, Brian Hunt, Michelle Girvan, Zhixin Lu, and Edward Ott. Model-free prediction of large spatiotem-porally chaotic systems from data: A reservoir computing approach.

Physical Review Letters , 120(2):024102,2018.[38] PR Vlachas, J Pathak, BR Hunt, TP Sapsis, M Girvan, E Ott, and P Koumoutsakos. Backpropagation algorithmsand reservoir computing in recurrent neural networks for the forecasting of complex spatiotemporal dynamics.

Neural Networks , 2020.[39] Zhong Yi Wan, Pantelis Vlachas, Petros Koumoutsakos, and Themistoklis Sapsis. Data-assisted reduced-ordermodeling of extreme events in complex dynamical systems.

PloS one , 13(5), 2018.[40] Kookjin Lee and Kevin Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deepconvolutional autoencoders.

Journal of Computational Physics , 404:108973, 2020.[41] Arvind Mohan, Don Daniel, Michael Chertkov, and Daniel Livescu. Compressed convolutional lstm: An efﬁcientdeep learning framework to model high ﬁdelity 3d turbulence. arXiv preprint arXiv:1903.00033 , 2019.[42] Elizabeth Qian, Boris Kramer, Benjamin Peherstorfer, and Karen Willcox. Lift & learn: Physics-informed machinelearning for large-scale nonlinear dynamical systems.

Physica D: Nonlinear Phenomena , 406:132401, 2020.[43] Sk Mashﬁqur Rahman, Suraj Pawar, Omer San, Adil Rasheed, and Traian Iliescu. Nonintrusive reduced ordermodeling framework for quasigeostrophic turbulence.

Physical Review E , 100:053306, 2019.1544] Romit Maulik, Romain Egele, Bethany Lusch, and Prasanna Balaprakash. Recurrent neural network architecturesearch for geophysical emulation. arXiv preprint arXiv:2004.10928 , 2020.[45] Rui Wang, Karthik Kashinath, Mustafa Mustafa, Adrian Albert, and Rose Yu. Towards physics-informed deeplearning for turbulent ﬂow prediction. arXiv preprint arXiv:1911.08655 , 2019.[46] Istvan Szunyogh, Troy Arcomano, Jaideep Pathak, Alexander Wikner, Brian Hunt, and Edward Ott. A machine-learning-based global atmospheric forecast model. 2020.[47] M Cheng, F Fang, C C Pain, and I M Navon. Data-driven modelling of nonlinear spatio-temporal ﬂuid ﬂows usinga deep convolutional generative adversarial network.

Computer Methods in Applied Mechanics and Engineering ,365:113000, 2020.[48] James H Faghmous, Arindam Banerjee, Shashi Shekhar, Michael Steinbach, Vipin Kumar, Auroop R Ganguly,and Nagiza Samatova. Theory-guided data science for climate change.

Computer , 47(11):74–78, 2014.[49] Nicholas Wagner and James M Rondinelli. Theory-guided machine learning in materials science.

Frontiers inMaterials , 3:28, 2016.[50] Markus Reichstein, Gustau Camps-Valls, Bjorn Stevens, Martin Jung, Joachim Denzler, Nuno Carvalhais, et al.Deep learning and process understanding for data-driven Earth system science.

Nature , 566(7743):195–204, 2019.[51] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learningframework for solving forward and inverse problems involving nonlinear partial differential equations.

Journal ofComputational Physics , 378:686–707, 2019.[52] Maziar Raissi and George Em Karniadakis. Hidden physics models: Machine learning of nonlinear partialdifferential equations.

Journal of Computational Physics , 357:125–141, 2018.[53] Jin-Long Wu, Karthik Kashinath, Adrian Albert, Dragos Chirila, Heng Xiao, et al. Enforcing statistical constraintsin generative adversarial networks for modeling chaotic dynamical systems.

Journal of Computational Physics ,406:109209, 2020.[54] N Benjamin Erichson, Lionel Mathelin, Zhewei Yao, Steven L Brunton, Michael W Mahoney, and J Nathan Kutz.Shallow neural networks for ﬂuid ﬂow reconstruction with limited sensors.

Proceedings of the Royal Society A ,476(2238):20200097, 2020.[55] Arvind T Mohan, Nicholas Lubbers, Daniel Livescu, and Michael Chertkov. Embedding hard physical constraintsin neural network coarse-graining of 3D turbulence. arXiv preprint arXiv:2002.00021 , 2020.[56] Pablo Márquez-Neila, Mathieu Salzmann, and Pascal Fua. Imposing hard constraints on deep networks: Promisesand limitations. arXiv preprint arXiv:1706.02025 , 2017.[57] Anuj Karpatne, Gowtham Atluri, James H Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly,Shashi Shekhar, Nagiza Samatova, and Vipin Kumar. Theory-guided data science: A new paradigm for scientiﬁcdiscovery from data.

IEEE Transactions on Knowledge and Data Engineering , 29(10):2318–2331, 2017.[58] S. Rasp. Coupled online learning as a way to tackle instabilities and biases in neural network parameterizations:general algorithms and Lorenz 96 case study (v1.0).

Geoscientiﬁc Model Development , 13(5):2185–2196, 2020.[59] Noah D Brenowitz and Christopher S Bretherton. Spatially extended tests of a neural network parametrizationtrained by coarse-graining.

Journal of Advances in Modeling Earth Systems , 11(8):2728–2744, 2019.[60] Noah D Brenowitz and Christopher S Bretherton. Prognostic validation of a neural network uniﬁed physicsparameterization.

Geophysical Research Letters , 45(12):6289–6298, 2018.[61] Jinlong Wu, Heng Xiao, Rui Sun, and Qiqi Wang. Reynolds-averaged Navier–Stokes equations with explicitdata-driven Reynolds stress closure can be ill-conditioned.

Journal of Fluid Mechanics , 869:553–586, 2019.[62] John M Lewis, Sivaramakrishnan Lakshmivarahan, and Sudarshan Dhall.

Dynamic data assimilation: a leastsquares approach , volume 104. Cambridge University Press, Cambridge, 2006.[63] Dan Simon.

Optimal state estimation: Kalman, H inﬁnity, and nonlinear approaches . John Wiley & Sons, 2006.[64] Geir Evensen.

Data assimilation: the ensemble Kalman ﬁlter . Springer Science & Business Media, 2009.[65] D Xiao, J Du, F Fang, CC Pain, and J Li. Parameterised non-intrusive reduced order methods for ensembleKalman ﬁlter data assimilation.

Computers & Fluids , 177:69–77, 2018.[66] Camille Zerfas, Leo G Rebholz, Michael Schneier, and Traian Iliescu. Continuous data assimilation reduced ordermodels of ﬂuid ﬂow.

Computer Methods in Applied Mechanics and Engineering , 357:112596, 2019.[67] Rossella Arcucci, Laetitia Mottet, Christopher Pain, and Yi-Ke Guo. Optimal reduced space for variational dataassimilation.

Journal of Computational Physics , 379:51–69, 2019.1668] Jeffrey W Labahn, Hao Wu, Shaun R Harris, Bruno Coriton, Jonathan H Frank, and Matthias Ihme. EnsembleKalman ﬁlter for assimilating experimental data into large-eddy simulations of turbulent ﬂows.

Flow, Turbulenceand Combustion , 104:861—-893, 2020.[69] Xin Li, Feng Liu, and Miao Fang. Harmonizing models and observations: data assimilation for earth systemscience.

Sci China Earth Sci , 2020.[70] Redouane Lguensat, Pierre Tandeo, Pierre Ailliot, Manuel Pulido, and Ronan Fablet. The analog data assimilation.

Monthly Weather Review , 145(10):4093–4107, 2017.[71] Meng Tang, Yimin Liu, and Louis J Durlofsky. A deep-learning-based surrogate model for data assimilation indynamic subsurface ﬂow problems. arXiv preprint arXiv:1908.05823 , 2019.[72] Marc Bocquet, Julien Brajard, Alberto Carrassi, and Laurent Bertino. Bayesian inference of chaotic dynamics bymerging data assimilation, machine learning and expectation-maximization.

Foundations of Data Science , 2(1):55,2020.[73] Julien Brajard, Alberto Carassi, Marc Bocquet, and Laurent Bertino. Combining data assimilation and machinelearning to emulate a dynamical model from sparse and noisy observations: a case study with the Lorenz 96model. arXiv preprint arXiv:2001.01520 , 2020.[74] Edward N Lorenz. Predictability: A problem partly solved. In

Proc. Seminar on Predictability , volume 1, 1996.[75] KJH Law, D Sanz-Alonso, Abhishek Shukla, and AM Stuart. Filter accuracy for the Lorenz 96 model: Fixedversus adaptive observation operators.

Physica D: Nonlinear Phenomena , 325:1–13, 2016.[76] Alireza Karimi and Mark R Paul. Extensive chaos in the Lorenz-96 model.

Chaos: An interdisciplinary journal ofnonlinear science , 20(4):043105, 2010.[77] S Herrera, Diego Paz ó, J Ferná Ndez, and Miguel A Rodríguez. The role of large-scale spatial patterns in thechaotic ampliﬁcation of perturbations in a Lorenz’96 model.

Tellus A: Dynamic Meteorology and Oceanography ,63(5):978–990, 2011.[78] Tim N Palmer. A nonlinear dynamical perspective on model error: A proposal for non-local stochastic-dynamicparametrization in weather and climate prediction models.

Quarterly Journal of the Royal Meteorological Society ,127(572):279–304, 2001.[79] Daniel S Wilks. Effects of stochastic parametrizations in the Lorenz’96 system.

Quarterly Journal of the RoyalMeteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography ,131(606):389–407, 2005.[80] Daan Crommelin and Eric Vanden-Eijnden. Subgrid-scale parameterization with conditional markov chains.

Journal of the Atmospheric Sciences , 65(8):2661–2675, 2008.[81] Gabriele Vissio and Valerio Lucarini. A proof of concept for scale-adaptive parametrizations: the case of thelorenz’96 model.

Quarterly Journal of the Royal Meteorological Society , 144(710):63–75, 2018.[82] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[83] XIA Yang, S Zafar, J-X Wang, and H Xiao. Predictive large-eddy-simulation wall modeling via physics-informedneural networks.

Physical Review Fluids , 4(3):034602, 2019.[84] San O Rasheed A Pawar, S and P Vedula. A priori analysis on deep learning of subgrid-scale parameterizationsfor kraichnan turbulence.

Theoretical and Computational Fluid Dynamics , pages 387–401, 2020.[85] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to documentrecognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[86] Arthur Gelb.

Applied optimal estimation . MIT press, 1974.[87] Greg Welch and Gary Bishop. An introduction to the Kalman ﬁlter. 1995.[88] Pavel Sakov and Peter R Oke. A deterministic formulation of the ensemble Kalman ﬁlter: an alternative toensemble square root ﬁlters.

Tellus A: Dynamic Meteorology and Oceanography , 60(2):361–371, 2008.[89] Henry Abarbanel.

Predicting the future: completing models of observed complex systems . Springer, 2013.[90] Weixuan Li, W Steven Rosenthal, and Guang Lin. Trimmed ensemble Kalman ﬁlter for nonlinear and non-gaussiandata assimilation problems. arXiv preprint arXiv:1808.05465 , 2018.[91] Jeffrey L Anderson. A non-Gaussian ensemble ﬁlter update for data assimilation.

Monthly Weather Review ,138(11):4186–4198, 2010. 1792] Amit Apte, Martin Hairer, A M Stuart, and Jochen Voss. Sampling the posterior: An approach to non-Gaussiandata assimilation.

Physica D: Nonlinear Phenomena , 230(1-2):50–64, 2007.[93] Milija Zupanski. Maximum likelihood ensemble ﬁlter: Theoretical aspects.

Monthly Weather Review , 133(6):1710–1726, 2005.[94] Alberto Carrassi, Stephane Vannitsem, Dusanka Zupanski, and Milija Zupanski. The maximum likelihoodensemble ﬁlter performances in chaotic systems.

Tellus A: Dynamic Meteorology and Oceanography , 61(5):587–600, 2008.[95] Elias David Nino-Ruiz, Alfonso Mancilla-Herrera, Santiago Lopez-Restrepo, and Olga Quintero-Montoya. Amaximum likelihood ensemble ﬁlter via a modiﬁed Cholesky decomposition for non-Gaussian data assimilation.