[PDF] Supervised learning from noisy observations: Combining machine-learning techniques with data assimilation

Abstract

Data-driven prediction and physics-agnostic machine-learning methods have attracted increased interest in recent years achieving forecast horizons going well beyond those to be expected for chaotic dynamical systems. In a separate strand of research data-assimilation has been successfully used to optimally combine forecast models and their inherent uncertainty with incoming noisy observations. The key idea in our work here is to achieve increased forecast capabilities by judiciously combining machine-learning algorithms and data assimilation. We combine the physics-agnostic data-driven approach of random feature maps as a forecast model within an ensemble Kalman filter data assimilation procedure. The machine-learning model is learned sequentially by incorporating incoming noisy observations. We show that the obtained forecast model has remarkably good forecast skill while being computationally cheap once trained. Going beyond the task of forecasting, we show that our method can be used to generate reliable ensembles for probabilistic forecasting as well as to learn effective model closure in multi-scale systems.

Full PDF

aa r X i v : . [ phy s i c s . d a t a - a n ] J u l SUPERVISED LEARNING FROM NOISY OBSERVATIONS:COMBINING MACHINE-LEARNING TECHNIQUES WITH DATAASSIMILATION

GEORG A. GOTTWALD

School of Mathematics and StatisticsUniversity of SydneyNSW 2006Australia

SEBASTIAN REICH

Institute of MathematicsUniversity of PotsdamGermany

Abstract.

Data-driven prediction and physics-agnostic machine-learning meth-ods have attracted increased interest in recent years achieving forecast horizonsgoing well beyond those to be expected for chaotic dynamical systems. In a sep-arate strand of research data-assimilation has been successfully used to optimallycombine forecast models and their inherent uncertainty with incoming noisy obser-vations. The key idea in our work here is to achieve increased forecast capabilitiesby judiciously combining machine-learning algorithms and data assimilation. Wecombine the physics-agnostic data-driven approach of random feature maps as aforecast model within an ensemble Kalman ﬁlter data assimilation procedure. Themachine-learning model is learned sequentially by incorporating incoming noisy ob-servations. We show that the obtained forecast model has remarkably good forecastskill while being computationally cheap once trained. Going beyond the task of fore-casting, we show that our method can be used to generate reliable ensembles forprobabilistic forecasting as well as to learn eﬀective model closure in multi-scalesystems. Introduction

Data-driven physics-agnostic modelling has attracted a great deal of interest inrecent years. Such models have become particularly attractive for modelling high-dimensional complex systems where the detailed evolutionary laws are either unknown

E-mail addresses : [email protected], [email protected] . r are too complex to be resolved numerically, and provide a desirable cost-eﬀectivealternative to the numerical simulation of high-dimensional possibly stiﬀ dynamicalsystems. Designing such surrogate dynamical models can be formulated abstractlyas a problem of function approximation under supervision: mapping a current stateto a new state at a later point in time, using information of a given training dataset. Being able to approximate this function with satisfactory accuracy then allowsfor forecasting by mapping unseen data. The concept of supervised learning can fur-ther be exploited to ﬁnd computationally tractable reduced models for a subset ofresolved variables. This problem arises in the context of multi-scale systems whereone is typically interested in the dynamics of the slow resolved variables. Reducinga stiﬀ potentially high-dimensional multi-scale system to an eﬀective evolution equa-tion for the slow variables only has obvious computational advantages, reducing thedimensionality and allowing for a larger time step in the numerical discretisation. Inthis context the aim of machine learning is to learn the so called closure term whichparametrizes the eﬀect of the fast unresolved degrees of freedom on the resolved slowdynamics. Ideally one aims at determining the closure term when only observationsof the resolved variables are available.A particularly simple and computationally cheap machine learning technique in-volves random feature maps (Rahimi and Recht, 2008). Akin to expressing functionsas linear combination of basis functions in some Hilbert space such as for exampleFourier or Chebychev basis functions, here functions are expressed as linear com-binations of functions φ ( w · u + b ), where w and b are randomly drawn. It wasshown that linear combinations of random feature maps are able to approximatecontinuous functions arbitrarily close, a property known as universal approxima-tion property (Park and Sandberg, 1991; Cybenko, 1989; Barron, 1993). The taskof training consists of learning the coeﬃcients of the linear combination which canbe achieved eﬃciently by linear ridge regression. See Rahimi and Recht (2008); Bach(2017a,b); Sun et al. (2019) for recent accounts. In the context of learning models ofdynamical systems the framework of random feature maps was extended to includeinternal degrees of freedom with their own dynamics in so called echo-state networks(Maass et al., 2002; Jaeger, 2002; Jaeger and Haas, 2004; Pathak et al., 2018). Ran-dom feature maps and their extensions have been successfully used when the trainingdata set is noise-free, e.g. when the data originate from a known underlying model.In the case when the model is unknown and the training data set instead consists ofnoisy observations, the architecture of random feature maps proves to be too sensitiveto noise and is not able to provide a mapping to be used for forecasting unseen data.To incorporate noisy observations to estimate the state of a dynamical systemis a problem addressed by data assimilation (DA) or ﬁltering (Majda and Harlim,2012; Law et al., 2015; Reich and Cotter, 2015). In DA incoming noisy observationsare optimally incorporated to increase the predictability of a given forecast model.DA can be used to estimate the state variables as well as unknown parameters ofthe model. DA now constitutes a standard tool in science and engineering and isused, for example, in numerical weather forecasting where it is the main driver for he increased predictability enjoyed over the past decades. DA has already beenused successfully in data-driven modelling where the forecast model is replacedby delay coordinates using Takens embedding and phase space reconstruction(Hamilton et al., 2016). Here we shall combine the random feature map architecturewithin a data assimilation procedure. The idea is to learn the coeﬃcients of the linearcombination of the random feature map approximation sequentially by updatingthem in time with incoming observations rather than performing linear regressionon the entire training data set. The forecast model within the data assimilationprocedure is given by the random feature map model itself. It should be noted,however, that the resulting combined state-parameter estimation problem is no longerlinear. Within such a setting, an attractive DA framework is provided by ensembleKalman ﬁlters (EnKF) where the statistical information needed for the Kalman ﬁlteris provided through a Monte Carlo approximation of an ensemble of model forecasts(Majda and Harlim, 2012; Law et al., 2015; Reich and Cotter, 2015). Such ensembleﬁlters were shown to perform very well for nonlinear dynamical systems unlikethe classical Kalman ﬁlter. We coin our methodology with the acronym RAFDA,standing for RAndom Feature maps and Data Assimilation. We shall consider threeprototypical dynamical systems to numerically investigate RAFDA and to illustratehow RAFDA is able to extend the good forecast skills of classical random featuremaps to the case of noisy training data. We consider the Lorenz-63 model, theKuramoto-Sivashinsky equation and the multi-scale Lorenz-96 system to show howRAFDA achieves the goals we set out above. Besides an improved forecast skill ofsingle trajectories compared to classical random feature maps, RAFDA naturallyextends to probabilistic forecasting as it automatically generates an ensemble ofinitial conditions. We will see that our method generates reliable ensembles forwhich each ensemble member has equal probability of being closest to the truth. Theformer application concerned the case when we are seeking a cheap surrogate model,learned from noisy training data without any information about the underlyingdynamical system. In the context of multi-scale systems scientists often know theform of the physical model for the resolved slow scales but lack detailed informationabout the closure term driving the physical model. For the Lorenz-96 multi-scalesystem we show that given noisy observations of the slow variables RAFDA candetermine the closure term and can be used to perform subgrid-scale parametrization.The paper is organised as follow. In Section 2 we develop our RAFDA methodology.In Section 3 we present applications to the Lorenz-63 model, where we test in partic-ular for the dependency of the forecast skill on the length of the training data set, thenoise level and the reservoir size. We further show that ensembles generated throughthe combined random feature map and data assimilation procedure provide reliableensembles to be used in probabilistic ensemble forecasting. We further consider theKuramoto-Sivashinsky equation as a paradigmatic model for spatio-temporal chaosin partial diﬀerential equations in Section 4. In Section 5 we consider the multi-scale orenz-96 model and show that RAFDA can be used to learn closure models for theeﬀective slow dynamics. We conclude in Section 6 with a discussion and an outlook.2. Computational methods

Consider a D -dimensional dynamical system˙ u = F ( u ) , (1)which is observed at discrete times t n = n ∆ t of interval length ∆ t > n = 0 , . . . , N ,to yield a noisy time series of vector-valued observations u o n ∈ R D u o n = u n + Γ / η n (2)with measurement error covariance matrix Γ ∈ R D × D and independent and normallydistributed noise η n ∈ R D , that is, η n ∼ N( , I ). We will typically work with ameasurement error covariance matrix Γ = η I (3)with scalar variance parameter η ≥

0. The special case η = 0 corresponds to exactstate observations. In this paper, we are primarily interested in noisy state observa-tions, that is, in η > u ( t ) in the time interval ∆ t asa propagator map u n +1 = Ψ ∆ t ( u n ) . (4)The aim of data-driven modelling is then to ﬁnd an approximation of this map andconstruct a surrogate model ˆ u n +1 = Ψ S (ˆ u n ) (5)with ˆ u = u , which is to be learned from the observational data (2). The ob-servational data may come as outputs from a known model or are given by actualobservations without any knowledge of the underlying model. In the case when themodel is known a surrogate model may still have computational advantages, in par-ticular for high-dimensional models and/or for stiﬀ multi-scale models, where solvingthe full model may require too small step-sizes δt ≪ ∆ t . One can also encounter thecase when the underlying dynamics (1) is only partially known. This situation arisestypically in multi-scale systems, where u ( t ) characterises the resolved (slow) degreesof freedom and the analytic form of F is known but the eﬀect of the unresolveddynamics on the resolved one has to be inferred from data.Independent of whether the underlying model (1) is known or not, one ends upwith a combined problem of having to estimate the states u n ≈ u ( t n ) as well as thefunctional form of the propagator Ψ S from the noisy data u o n , n = 1 , . . . , N . Thisproblem has been well studied in the literature, both within parametric and non-parametric settings, using diﬀerent approximation tools such as radial basis functions(RBF) and reproducing kernel Hilbert spaces (RKHS). In this paper, we focus on aparticular class of shallow neural networks and the RKHS induced by their random eature maps. See, for example, Bach (2017a,b); Sun et al. (2019) for recent theoret-ical results on such approximations. In particular, the attractiveness of this RKHS,as compared to more complex machine learning architectures such as those consid-ered for example in Qi and Majda (2020), is its easy embedding into sequential dataassimilation via the ensemble Kalman ﬁlter (Evensen, 2006), as we will demonstratein this paper.The random feature approximation to (5) proceeds as follows: We deﬁne D r featuremaps, which we write in vector form as φ ( u ) = tanh( W in u + b in ) (6)using a weight matrix W in ∈ R D r × D and a bias b in ∈ R D r . The weight matrix andthe bias are chosen randomly and independently of the observed u o n , n = 0 , . . . , N .The surrogate propagator (4) is then deﬁned byΨ S ( u ) = W φ ( u ) , (7)where W ∈ R D × D r is a matrix of coeﬃcients which will be learned from the data u o n , n = 0 , . . . , N . It was shown that random feature maps and their associated RKHSenjoy the universal approximation property (Park and Sandberg, 1991; Cybenko,1989; Barron, 1993), stating that their associated RKHS are dense in the spaceof continuous functions. Their success in practical applications, however, dependscrucially on an appropriate choice of two random sets of parameters W in and b in .We will comment on this aspect in more detail later. Remark : Recently there has been a lot of interest in echo-state networks and reservoircomputing (Jaeger, 2002; Jaeger and Haas, 2004; Pathak et al., 2018) which involveinternal reservoir dynamics. Here we ﬁnd that this additional complexity is not nec-essary to achieve good forecasting skill. Random feature maps have recently alsobeen extended to solution maps of partial diﬀerential equations. See, for example,(Nelson and Stuart, 2020). All these contributions either assume exact state obser-vations or ignore measurement errors.2.1.

Zero measurement noise: Linear Regression.

If the measurement noise in(2) is zero, that is u n = u o n , the external weight matrix W ∈ R D × D r , which maps therandom features to the state variable at the next time step, can be determined vialinear ridge regression (LR) over a training data set of length N . More precisely, weseek the minimiser of L ( W ) = 12 k W Φ − U o k + β k W k , (8)where k A k F denotes the Frobenius norm of a matrix A , U o ∈ R D × N is the matrixwith columns u o n , n = 1 , . . . , N , and Φ ∈ R D r × N consists of columns φ n = φ ( u o n − ) , (9) = 1 , . . . , N . The parameter β ≥ W LR , is explicitly given by W LR = U o Φ T (cid:0) ΦΦ T + β I (cid:1) − . (10)We shall explore in Sections 3-5 in how far linear ridge regression can be used toconstruct a forecast model trained on noisy observations.2.2. Non-zero measurement noise: RAFDA.

If the observations are contami-nated by noise, we propose here to estimate the weight matrix W recursively usingsequential DA (Majda and Harlim, 2012; Law et al., 2015; Reich and Cotter, 2015).Whereas the non-recursive estimation described in the previous Section using linearridge regression only utilizes the information contained in the observations, we hereaim at optimally estimating the weight matrix using both the observations as wellas the underlying dynamical surrogate model (5). In particular, we employ a com-bined state and parameter estimation via state augmentation (Majda and Harlim,2012; Reich and Cotter, 2015). More precisely, we formulate the forecast model forconstant parameters W as u f n +1 = W a n φ ( u a n ) (11a) W f n +1 = W a n , (11b)where the superscript f denotes the forecast and the superscript a denotes the analysisdeﬁned below. Furthermore, the model states u a n , W a n and u f n +1 , W f n +1 , respectively,are now treated as random variables. To formulate the Kalman ﬁlter analysis step,we unravel the weight matrix W ∈ R D × D r into a parameter vector w ∈ R DD r with w D r = W , . . . , W D r , w D r +1:2 D r = W , . . . , W D r and so forth. Concatenatingfurther we introduce z = ( u T , w T ) T ∈ R D z with D z = D + DD r . While the forecaststep (11) leads to an update of the random variable z a n into z f n +1 , the analysis stepfor the mean is provided by z a n +1 = z f n +1 − K n +1 ( Hz f n +1 − u o n +1 ) (12)with the observation matrix H ∈ R D × D z deﬁned by Hz = u , i.e. we assume that weobserve all state variables u . The Kalman gain matrix K is given by K n +1 = P f n +1 H T (cid:0) HP f n +1 H T + Γ (cid:1) − , (13)where the forecast covariance matrix is given by P f n +1 = h ˆ z f n +1 ⊗ ˆ z f n +1 i = (cid:18) h ˆ u f n +1 ⊗ ˆ u f n +1 i h ˆ u f n +1 ⊗ ˆ w f n +1 ih ˆ w f n +1 ⊗ ˆ u f n +1 i h ˆ w f n +1 ⊗ ˆ w f n +1 i (cid:19) . (14)The angular brackets h f ( z ) i denote the expectation value of a function f ( z ) and thehat denotes the perturbation of z f n +1 from its mean z f n +1 = h z f n +1 i , that is,ˆ u f n +1 = u f n +1 − u f n +1 . (15) ince we are only observing the state variables u , the Kalman update (12) can bewritten to explicitly separate the state and parameter variables as u a n +1 = u f n +1 − P f uu (cid:0) P f uu + Γ (cid:1) − (cid:0) u f n +1 − u o n +1 (cid:1) (16) w a n +1 = w f n +1 − P f wu (cid:0) P f uu + Γ (cid:1) − (cid:0) u f n +1 − u o n +1 (cid:1) , (17)where P f uu = h ˆ u f n +1 ⊗ ˆ u f n +1 i and P f wu = h ˆ w f n +1 ⊗ ˆ u f n +1 i .The Kalman ﬁlter is optimal in the sense that ¯ z a maximizes the likelihood ofthe observations provided that both the observations and the state variables areGaussian distributed random variables. Since the forward model (11) is nonlinear inthe augmented state variable and the involved random variables cannot assumed tobe Gaussian distributed, the combined forecast and analysis cycle is not well deﬁnedand we employ the stochastic EnKF (Burgers et al., 1998; Evensen, 2006) to deﬁnea computationally robust Monte Carlo closure. In particular, deﬁne an ensemble ofstates Z ∈ R D z × M consisting of M members z ( i ) ∈ R D z , i = 1 , . . . , M , that is, Z = (cid:2) z (1) , z (2) , . . . , z ( M ) (cid:3) , (18)their empirical mean z = 1 M M X i =1 z ( i ) , (19)and the associated matrix of ensemble deviationsˆ Z = (cid:2) z (1) − z , z (2) − z , . . . , z ( M ) − z (cid:3) ∈ R D z × M . (20)We deﬁne these objects for the forecast and analysis ensembles Z f n and Z a n , respec-tively, for all n ≥

0. Each ensemble member is propagated individually under theforecast step (11). This deﬁnes the update from the last analysis ensemble Z a n to thenext forecast ensemble Z f n +1 . The EnKF analysis step is now deﬁned as follows. Theensemble deviation matrix ˆ Z f n +1 can be used to approximate the forecast covariancematrix (14) as P f n +1 = 1 M − Z f n +1 ( ˆ Z f n +1 ) T ∈ R D z × D z . (21)Upon introducing the matrix U o n +1 ∈ R D × M of perturbed observations U o n +1 = h u o n +1 − Γ / η (1) n +1 , u o n +1 − Γ / η (2) n +1 , . . . , u o n +1 − Γ / η ( M ) n +1 i , (22)where the η ( i ) n +1 ∈ R D , i = 1 , . . . , M , are realisations of independent and normallydistributed random variables (compare (2)), we obtain the following compact repre-sentation of the EnKF update step: Z a n +1 = Z f n +1 − P f n +1 H T (cid:0) HP n +1 H T + Γ (cid:1) − (cid:0) HZ f n +1 − U o n +1 (cid:1) . (23)The ensemble forecast step deﬁned by (11), together with the EnKF analysisstep (23) constitute our combined RAndom Features maps and Data Assimilation RAFDA) method. The resulting weight matrix W a N at ﬁnal training time N isdenoted by W RAFDA .In the subsequent two subsections we discuss further algorithmic details of ourmethod.2.3.

Choice of random feature maps.

The choice of the random coeﬃcients b in and W in in the feature maps (6) and their dimension D r is crucial for the success ofour method. In this study, we choose these entries to be independent and uniformlydistributed with ( W in ) ij ∼ U [ − w, w ] and ( b in ) i ∼ U [ − b, b ] . (24)The hyper-parameters w > b > − , D as uniformly as possible, in particularly sampling their nonlinear domain. See thenumerical experiment section for more details.We mention that one could also dynamically adapt those parameters. Thiswould extend our method from random feature maps to two-layer net-works, going from RKHS to Barron spaces (Chizat and Bach, 2018; Mei et al.,2018; Rotskoﬀ and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020; E et al.,2019). We leave this option in combination with data assimilation as a topic forfurther research.2.4. Further algorithmic details of RAFDA.

The required ensemble size M ofthe standard implementation (23) of an EnKF has to be of the order O ( D z ), whichcan become prohibitive for typical reservoir sizes of D r ∼ O (10 ) and potentiallyhigh-dimensional dynamical systems. Within the EnKF community further approx-imations such as localisation and inﬂation have been developed to deal with ﬁnite-size eﬀects (Evensen, 2006; Reich and Cotter, 2015; Houtekamer and Zhang, 2016).Here we follow the concept of B-localisation (Houtekamer and Mitchell, 1998, 2001;Hamill et al., 2001),where the empirical covariance matrix P f n +1 is tempered by asymmetric positive deﬁnite localisation matrix B via the Kronecker product, that is, e P f n +1 = B ◦ P f n +1 . (25)The EnKF update step (23) is then replaced by Z a n +1 = Z f n +1 − e P f n +1 H T (cid:16) H e P f n +1 H T + Γ (cid:17) − (cid:0) HZ f n +1 − U o n +1 (cid:1) . (26)Localisation implies that certain correlations between variables get reduced or evenset to zero if the corresponding entry in B is less than one or zero, respectively. Thisallows to control spurious correlations between uncorrelated variables of O (1 / √ M ),caused by employing a ﬁnite ensemble size. In our numerical experiments we employthe following form of localisation. The j th row of the weight matrix W is responsiblefor updating the j th component of the dynamic state variable u under the forwardmodel (11a). We reﬂect this property in the approximate covariance matrix e P f n +1 y ignoring all correlations between the j th row of W and all components of theobserved u o n +1 except for its j th entry.We also employ multiplicative inﬂation in which the members z ( i ) n +1 of the forecastensemble Z f n +1 are replaced by z ( i ) n +1 → z n +1 + α ( z ( i ) n +1 − z n +1 ) , (27) i = 1 , . . . , M , prior to the EnKF data assimilation step (23) with α > P f n +1 of theensemble.We ﬁnally need to specify the distribution of the extended state variable z at initialtime t = 0. We treat the two components u and w of z as independent and set u ∼ N( u o0 , Γ ) (28)as well as w ∼ N( w LR , γ I ) , (29)where w LR is the vectorial form of the solution W LR to the ridge regression formula-tion (8) and γ > w ∼ N( , γ I ) if w LR is unavailable.In the subsequent sections we illustrate the proposed RAFDA methodology inseveral dynamical systems, demonstrate its improved forecast skill compared to thelinear regression approach (10), and investigate the dependency of RAFDA on thenoise strength and the available training size N , as well as showing the impact on theapproximation properties of RAFDA of the various choices a modeller needs to takesuch as the reservoir dimension D r and the choice of the random elements W in and b in . 3. Ordinary differential equations: Lorenz-63 equation

We consider as a ﬁrst test bed the Lorenz-63 system (Lorenz, 1963)˙ y = 10( y − x )˙ x = 28 x − y − xz ˙ z = − z + xy (30)with u = ( x, y, z ) T ∈ R . Observations are taken every ∆ t = 0 .

02 time units and u n corresponds to the solution at t n = n ∆ t . In all simulations a transient of 40 timeunits is discarded to ensure that the dynamics has settled on the attractor.To assess the propensity of random feature maps with data assimilation and clas-sical random feature maps with linear regression to be used as a forecast model wetest the forecast models of LR u n +1 = W LR φ ( u n ) (31) ith W LR given by (10), and of RAFDA u n +1 = W RAFDA φ ( u n ) , (32)where W RAFDA is determined via the training data as the ﬁnal outcome of theiterative DA procedure given by (11) and (23). In both cases the reservoir variablesare given by (6). For LR we employ a regularization parameter of β = 0 . α = 0) and set, unless stated otherwise, D r = M = 300 and N = 4 ,

000 and a noise strength of η = 0 . u valid ( t n ), n ≥

0. As a quantitative diagnostic for the forecast skill we record the fore-cast time τ f , deﬁned as the largest time such that the relative forecast error E ( t n ) = || u valid ( t n ) − u n || / || u valid ( t n ) || ≤ θ . We choose here θ = 0 .

05. We measuretime in units of the Lyapunov time tλ max with the maximal Lyapunov exponent λ max = 0 .

91. In the following we discuss the eﬀect of the various parameters as wellas on the length of the available data and the noise levels on the forecast capabilitiesof the two models.3.1.

Dependency on the internal parameters.

We ﬁrst address the issue ofthe choice of the randomly chosen internal parameters W in and b in . The entries ofthe internal weight matrix and bias are chosen uniformly randomly over intervals[ − w, w ] in case of W in and [ − b, b ] in case of b in (cf (24)). Figure 1 shows the forecasttimes τ f as a function of the two parameters w and b for single realisations of (24).The forecast times are estimated using the ridge linear regression matrix W LR for asingle time series of noiseless observations with N = 200 ,

000 training data points.Results are shown for training on noiseless and on noisy observations with η = 0 . w and b . Inparticular, parameter choices leading to excellent forecast times in the noiseless casemay not lead to good forecast times in the case of non-zero measurement noise.To explore what constitutes a good set of inner parameters corresponding to longforecast times we show in Figure 2 normalized histograms of one of the componentsof the random feature map φ j (cf (6)). For inner parameter choices correspondingto poor forecast times τ f < . ±

1, caused by the tanh-random feature map (6) having scale parameters w such that the tanh-function cannot resolve the whole range of dynamical states ofthe underlying observations. Inner parameters corresponding to long forecast timeson the other hand resolve the whole dynamical range within the nonlinear range ofthe tanh-function. In the following the internal weight matrix and bias are chosenrandomly with w = 0 .

005 and b = 4 which yields large forecast times τ f for η = 0and η = 0 . igure 1. Contourplot of the forecast time τ f for diﬀerent choices ofthe internal parameters and measurement noise variance η . Left: η = 0.Right: η = 0 . Figure 2.

Normalized empirical histogram of the random feature map(6) φ for four choices of inner parameters w and b . Left: inner param-eters corresponding to forecast times τ f < .

5. Right: inner parameterscorresponding to forecast times τ f > Dependency of LR on the regularization parameter β . Figure 3 showsthe mean of the forecast time τ f obtained from LR for size D r = 300 for diﬀerentchoices of the regularization parameter β , averaged over 500 realisations, each trainedon N = 4 ,

000 observations. We observe that when noiseless data are used to trainLR, the forecast time has a maximum for β = 2 . · − . For noisy observations with η = 0 . τ f is relatively robust to changes in the regularizationparameter for a wide range of β .3.3. Dependency of RAFDA on the initial ensemble.

Turning to RAFDA, wenow investigate the inﬂuence on the choice of the initial ensemble for the parameters.We consider initial ensembles drawn in an unbiased fashion w ∼ N( , γ I ) as well as

20 -15 -10 -5 00123

Figure 3.

Mean of the forecast time τ f in units of the Lyapunov timefor varying regularization parameters β for LR for noiseless ( η = 0) andfor noise contaminated observations ( η = 0 . w ∼ N( w LR , γ I ) as in (29), where we employ β = 2 . · − to obtain w LR (which gave rise to the peak in the forecast time fornoiseless data; cf. Figure 3). The model is trained on N = 4 ,

000 noisy observationswith noise level η = 0 . D r = 300 was used. Wechose M = 300 ensemble members for RAFDA, and averaged over 500 realisations.Figure 4 shows the dependency of the mean forecast time on the initial spread ofthe ensemble γ . For both types of initial ensembles there is a wide range of initialensemble spreads γ for which good forecast times are obtained. As to be expected,the same forecast times τ f are produced for large values of γ independent if the initialensemble was chosen centred around the prior W LR or unbiased. Centering the initialensemble around the prior provided by LR produces larger forecast times than LRfor noisy data for γ > . · − , and generally leads to more robust forecast timeswith respect to the initial spread γ than the unbiased initial ensemble. We remarkthat the diﬀerences between the two initialisation choices depend on the noise levelof the training data set; for small noise levels, centering the initial ensemble aroundthe prior provided by LR leads to better forecast times compared to the unbiasedinitial ensemble for a wider range of values of γ . In the following we will use initialensembles centred around the prior provided by LR with γ = 1 , Dependency on the reservoir dimension D r . Figure 5 shows the depen-dency of the forecast time τ f on the reservoir dimension D r for ﬁxed training length N = 4 , M = D r members.The forecast performance of both RAFDA and LR increases with the reservoir size D r initially and then saturates for suﬃciently large D r . In the saturated range theforecast times of RAFDA are almost twice as larger as classical LR.Figure 6 shows the empirical histogram of the forecast time τ f for RAFDA andfor LR over 500 independent realisations. RAFDA clearly exhibits improved forecast

15 -10 -5 0 5 1001234

Figure 4.

Mean of the forecast time τ f in units of the Lyapunov timefor varying γ for RAFDA for noise contaminated observations with η = 0 .

2. Shown are results when the initial ensemble is drawn aroundthe prior provided by LR w ∼ N( w LR , γ I ) (29) and the unbiasedchoice w ∼ N( , γ I ). For each realisation the prior is calculated anew,leading to a slightly jagged curve for LR. Figure 5.

Mean of the forecast time τ f in units of the Lyapunov timefor varying reservoir size D r for RAFDA and for LR.times with a mean of 3 . . D r = 1 ,

600 with M = 1 ,

600 ensemble members. The modelsare trained with N = 4 ,

000 noisy observations.3.5.

Dependency on the training data size N . The size of the training data setneeds to be suﬃciently large for two separate reason. First of all, the training dataset needs to sample suﬃcient parts in phase space to allow for forecasting of unseendata which may explore diﬀerent areas of phase space. Secondly, the size of thetraining data set needs to be suﬃciently large to allow for good statistical estimation igure 6. Empirical histogram of forecast times τ f in units of theLyapunov time for D r = 1 , τ f on the size N ofa noisy training data set over 500 independent validation data sets. We checked thatthe analysis ﬁelds for the Lorenz-63 system variables were close to the truth and noensemble inﬂation was required. As for the reservoir dimension, the forecast times τ f increase with increasing length of the training data set N and saturate for suﬃcientlylarge training data sets, exhibiting the same twofold improvement of RAFDA overLR as observed in Section 3.4. Figure 7.

Mean of the forecast time τ f in units of the Lyapunov timefor varying lengths N of the training data set for RAFDA and for LR.3.6. Dependency on the measurement noise level η . We now test how theforecast capabilities depend on the observational noise level η . Figure 8 shows the ependency of the forecast time τ f on the noise level for ﬁxed training data length andreservoir dimension. For suﬃciently small noise levels η < . τ f ≈ . τ f ≈ .

8, albeit fora longer range in noise levels. The range of constant forecast times is followed by anexponentially decaying range, before the models loose any forecast skill with τ f → τ f with respect to the observationalnoise strength η is consistent with the sensitivity of chaotic dynamical systems withrespect to their initial condition and the exponential separation of nearby trajectories.To corroborate this interpretation we have conﬁrmed that the same forecast times areachieved when simultaneously doubling the data size to N = 8 ,

000 and the noise level η . -10 -5 0 5 100246 Figure 8.

Mean of the forecast time τ f in units of the Lyapunov timefor varying noise levels η for RAFDA and for LR.3.7. Application to ensemble forecasting.

In chaotic dynamical systems a singleforecast is to a certain degree meaningless as small uncertainties in the initial conditioncan lead to widely diﬀering forecasts at a later time. Probabilistic forecasts providea more appropriate framework for forecasting. In particular, ensemble forecasting,whereby a Monte Carlo estimate of the underlying probability density function isprovided by running an ensemble of initial conditions, are now widely used in numer-ical weather forecasting issuing both the most probable forecast and its associateduncertainty (Epstein, 1969; Leith, 1974; Leutbecher and Palmer, 2008). The qual-ity of such ensemble forecasts crucially depends on how the ensembles are generated.There exist several strategies using singular vectors (Lorenz, 1965; Palmer, 1993), bredvectors (Toth and Kalnay, 1993, 1997; Giggins and Gottwald, 2019), analysis ensem-bles from ensemble Kalman ﬁlters (Wang and Bishop, 2003; Buizza et al., 2005), andmore recently analogs (Atencia and Zawadzki, 2017). Here we show that ensemblesobtained from RAFDA provide reliable forecast ensembles to be used in probabilis-tic forecasts. A good probabilistic forecasts is not necessarily one with the smallest oot-mean-square (RMS) error (consider a probability density function with disjointsupport, then the mean may not be even an actual physical state), but rather onewhere each ensemble member has equal probability of being closest to the truth. Suchensembles are called reliable. We probe the reliability here with the continuous rankedprobability score (CRPS) (Hersbach, 2000; Wilks, 2006). The CRPS is deﬁned for alead time τ as CRPS( τ ) = 1 D D X k =1 Z ∞−∞ h F ( u ; τ, k ) − F truth ( u ; τ, k ) i d u, (33)where F and F truth are the cumulative probability distributions of the forecast en-semble and truth, respectively, which are estimated as F ( u ; τ, k ) = 1 M M X m =1 Θ (cid:16) u − u ( m,k ) ( τ ) (cid:17) , (34) F truth ( u ; τ, k ) = Θ (cid:16) u − u ( k )truth ( τ ) (cid:17) , (35)where Θ( x ) is the Heaviside function with Θ( x ) = 0 for x < x ) = 1 other-wise, and u ( m,k ) ( τ ) denotes the k -component of the m th ensemble member u ( m ) ( τ ) atforecast time τ > N = 250 , η = 0 .

2. We further compare with results obtained froma traditional EnKF ensemble where the full Lorenz-63 model (30) is employed asforecast model. We ﬁnd that RAFDA provides the smallest RMS error with 0 . .

17. LR has a markedly largeranalysis RMS error of 7 .

8. Figure 9 shows CRPS( τ ), averaged over 500 realisations,for the three ensembles. For small lead times EnKF and RAFDA perform similar.However, RAFDA clearly generates a more reliable ensemble for larger lead times.LR ensembles cannot be classiﬁed as reliable.4. Partial differential equations: Kuramoto-Sivashinsky equation

We now consider artiﬁcially generated time series obtained from a numerical sim-ulation of the Kuramoto-Sivashinsky equation (Kuramoto and Tsuzuki, 1975, 1976;Sivashinsky, 1977; Sivashinsky and Michelson, 1980) u t + uu x + αu xx + u xxxx = 0 (36)with periodic boundary conditions. For system length L = 22 the Kuramoto-Sivashinsky equation is chaotic with a maximal Lyapunov exponent of λ max = 0 . Figure 9.

CRPS as a function of the lead time τ for RAFDA, LR andEnKF ensembles.a partial-diﬀerential equation, its dynamics evolves on a ﬁnite dimensional manifold(Temam, 1997).We generate observations u o n by integrating (36) using a pseudo-spectral Crank-Nicolson scheme where the nonlinearity is treated with a second-order Adams-Bashforth scheme. We employ a temporal discretization step of δt = 0 .

001 and use64 spatial grid points, and sample every ∆ t = 0 .

25 time units to obtain observations u o n ∈ R . An initial transient period of 25 × time units is discarded to ensurethat the dynamics has settled onto the attractor.The internal weight matrix and bias are chosen randomly according to (24) with w = 0 . b = 1 with a reservoir of size D r = 2 , N = 70 ,

000 with η = 0 .

01. The data assimilationcomponent of RAFDA was executed with M = 1 ,

000 ensemble members and withoutany inﬂation with an initial ensemble chosen according to (29) with γ = 6 . · − .Figure 10 depicts results of the RAFDA forecast showing that reasonable forecastscan be made up to times close to four Lyapunov times. Classical random feature mapswith linear ridge regression trained on the same data set only yield forecast horizons ofone Lyapunov time. If LR is trained on noiseless data, it achieves comparable forecastskill to RAFDA. We have refrained here from optimizing the reservoir weights w and b and the reservoir dimension D r to achieve larger forecast horizons.5. Closure models: Multi-scale Lorenz-96 system

We consider now the problem of model closure in multi-scale systems. Typicallyone is only interested in the dynamics of the large-scale slow variables. Moreover,one typically only has access to observations of the resolved slow dynamics. Theobjective of model closure, or also known as the parametrization problem, is to ﬁnda self-consistent dynamical model for the resolved slow large-scale variables. Theeﬀective model for the slow large-scale dynamics allows for a larger time step in the igure 10. Hovm¨oller diagram of the forecast ﬁeld u ( x, t ). Left:RAFDA forecast. Right: Root-mean-square error of the RAFDA fore-cast and the truth. Times are in units of 1 /λ max .numerical simulation compared to the prohibitively small time steps required in thefull stiﬀ multi-scale system.We consider here the multi-scale Lorenz-96 system for K slow variables X ( k ) whichare each coupled to J fast variables Y ( j,k ) , given by ddt X ( k ) = − X ( k − ( X ( k − − X ( k +1) ) − X ( k ) + F − hcd J X j =1 Y ( j,k ) , (37) ddt Y ( j,k ) = − cbY ( j +1 ,k ) ( Y ( j +2 ,k ) − Y ( j − ,k ) ) − cY ( j,k ) + hcd X ( k ) , (38)with cyclic boundary conditions X ( k + K ) = X ( k ) , Y ( j,k + K ) = Y ( j,k ) and Y ( j + J,k ) = Y ( j,k +1) , giving a total of D = K ( J +1) variables. This set of equations was introducedas a caricature for the midlatitude atmosphere (Lorenz, 1996). The variables X ( k ) model large scale atmospheric ﬁelds arranged on a latitudinal circle, such as synopticweather systems. Each of the X ( k ) variables is connected to J small-scale variables ( j,k ) . The time scale separation is parametrized by the coeﬃcient c , and the ratioof the amplitudes of the large-scale and the small-scale variables is controlled by b .The coupling between the slow and fast variables is controlled by the parameter h .Both large-scale and small-scale dynamics variables evolve when uncoupled accordingto nonlinear transport and linear damping; the large-scales are subjected to externalforcing F . We choose here as parameters K = 8 and J = 32, leading to a total of256 variables, and F = 20, c = d = 10 and h = 1 as in Wilks (2005); Arnold et al.(2013). The choice c = d = 10 implies that the variables Y ( j,k ) ﬂuctuate with a 10times higher frequency and with an approximately 10 times smaller amplitude whencompared to the X ( k ) . Setting the coupling constant h = 1, corresponds to strongcoupling where the dynamics is driven by the fast sub-system (Herrera et al., 2010).Our aim is to to provide a reduced model of the form ddt X ( k ) = G ( k ) ( X ) + ψ ( X ( k ) ) , (39)with X = ( X (1) , . . . , X ( K ) ) T , G ( k ) ( X ) = − X ( k − ( X ( k − − X ( k +1) ) − X ( k ) + F , andwhere the closure term ψ ( x ) parametrizes the eﬀect of the fast unresolved dynamicsand is learned by RAFDA. Note that one and the same closure map can be applied atall grid points due to the spatial homogeneity of the problem. The application we havein mind is that a scientist has available a physics-based model for the resolved slowlarge-scale dynamics, but doe snot have access to a the unresolved fast small-scalemodel, and the closure model needs to inferred from noisy observations.In the case when the full multi-scale model (37)–(38) is available and noiselessdata of both the slow and fast variables are available, the closure problem has beenattempted outside the scope of machine learning in Wilks (2005) and Arnold et al.(2013). Therein deterministic parametrizations were proposed by polynomial data-ﬁtting the closure term to the resolved variables with ψ Wilks ( x ) = 0 .

262 + 1 . x − . x − . x + 0 . x (40) ψ AMP ( x ) = 0 .

341 + 1 . x − . x − . x . (41)Here we assume we are only given observations of the slow variables X n at discretetimes t n = n ∆ t . In LR, we set u n = X n , n = 0 , . . . , N , and the closure mapping ψ ( x ) is learned directly from the data via ﬁnite-diﬀerencing. More speciﬁcally, theclosure is performed by minimising the diﬀerence between the random feature mapapproximation of the closure term ψ ( x ) = φ ( x ) over all grid points x = X ( k ) , and theapproximation of the closure term provided by the observations X ( k ) n − X ( k ) n − ∆ t − G ( k ) ( X n − ) . This ﬁnite diﬀerencing, however, constitutes a bad estimator of the closure term ψ ( x )for noisy observations of the slow variables. In the following we therefore only presentresults for LR in the case of zero measurement error (i.e. the same setting in which thepolynomial ﬁts (40) and (41) were obtained). RAFDA, on the other hand, employs as he forecast model within the data assimilation training phase the reduced model (39)where ψ ( x ) is estimated using random feature maps ψ ( x ) = W tanh( W in x + b in ) (cf.(6) and (7)). RAFDA is trained on N = 100 ,

000 noisy observations of all components X ( k ) of X with variance η = 0 .

02 sampled at ∆ t = 0 . γ = 0 .

01. The validation is performed by propagatingthe reduced model (39) with the learned mapping ψ ( x ) added at each grid point inan Euler time-stepping method with step-size δt = 0 .

005 for a total of 10 time steps.Figure 11 shows the closure term ψ ( x ) estimated from RAFDA, LR and from thepolynomial ﬁts (40)–(41). We show in Figure 12 the empirical probability densityfunction of the validation time series and list in Table 1 the associated relative errorsof the mean and the variance of X for the various parametrization schemes used.The closure model provided by RAFDA outperforms both the closure of LR and theone given by (40)–(41). It is seen that the reference empirical probability densityfunction of the full multi-scale L96 system assigns more probability to higher valuesof X . We conjecture that this underdispersiveness of the closure models stems fromtheir assumed deterministic nature. The deterministic closures schemes consideredhere ignore the eﬀective diﬀusive eﬀect of the fast variables which are present forﬁnite time-scale and spatial-scale separation. As a ﬁrst step towards a full stochasticparametrization one could use RAFDA to learn the propagator of the mean E ( k ) = P Jj =1 Y ( j,k ) alongside the mapping ψ ( x ). Table 1.

Relative error of the mean µ and the variance σ of the slowvariables X for the various parametrization schemes. µ σ RAFDA 0 . . .

029 0 . .

033 0 . .

016 0 . Discussion and outlook

We have proposed a new data-driven physics-agnostic forecast model coinedRAFDA which combines random feature maps with DA. The method is designedto provide cheap surrogate mappings to be used in forecasting or to construct closuremodels when no a priori knowledge of the evolution equations is available and theunderlying system is only accessible via noisy observations. The linear coeﬃcientsof the random feature map are sequentially updated incorporating incoming observa-tions using an EnKF, which itself uses the random feature map as its forecast model.We have tested the forecast capabilities in several situations for ordinary and partialdiﬀerential equations. For the Lorenz-63 model we were able to achieve a forecasthorizon of up to 4 Lyapunov times, similarly for the Kuramoto-Sivashinsky equation.We investigated the inﬂuence on the data length, the number of feature maps and the Figure 11.

Closure terms ψ = ψ ( x ) for the multi-scale L96-system(37)-(38) estimated using various parametrizations. Figure 12.

Empirical probability density function of the slow vari-ables X k of the full multi-scale L96-system (37)-(38) and of the RAFDAclosure model.noise strength on the forecast horizon. Depending on the dimension of the underlyingdynamical system a larger number of random feature maps and larger training datasets are required. Once trained, however, the computational cost to run the modelcan be signiﬁcantly cheaper than numerical integration of the original full model. Forexample, for the Kuramoto-Sivashinsky model we used a discretization time step of δt = 10 − whereas the surrogate RAFAD model was trained on a time series with∆ t = 0 .

25, which implies a computational gain in running time of ∆ t/δt = 250.However, we stress that the real advantage of RAFDA becomes apparent when the.model is unknown. Besides providing remarkable forecast skills for individual tra-jectories, we showed that ensembles generated by in data assimilation cycles whichuse a pre-trained RAFDA forecast model are reliable which makes them attractive andidates for probabilistic forecasting where one would like to issue an uncertaintyquantiﬁcation of a forecast. Furthermore, we showed that the multi-scale Lorenz-96model RAFDA allowed us to ﬁnd a Markovian closure model for the resolved vari-ables having only observed its slow variables. We note that the EnKF in augmentedstate space could be replaced by other data assimilation schemes such as proposed,for example, in Chen and Majda (2019).An open question is how to choose the random parameters of the feature map.The universal approximation property ensures that any continuous function can bewell approximated by a linear combination of the features, but it does not tell youhow to choose the parameters and how many features are needed. We have shownthat the forecasts capabilities sensitively depends on the random parameters andthat parameters which lead to good approximation in the noiseless case may fail toapproximate the forecast map when trained on noisy data. A natural extension is todetermine the optimal parameters W in and b in simultaneously with W , possibly usingagain an ensemble Kalman ﬁlter. Rather than random features which are related toRKHS the machine-learning framework would then consist of a two-layer network,and the corresponding spaces are Barron spaces (Chizat and Bach, 2018; Mei et al.,2018; Rotskoﬀ and Vanden-Eijnden, 2018; Sirignano and Spiliopoulos, 2020; E et al.,2019). This extension is planned for further research. Acknowledgments

The work of SR has been partially funded by Deutsche Forschungsgemeinschaft(DFG, German Science Foundation) - SFB 1294/1 - 318763901.

References

J. L. Anderson and S. L. Anderson. A Monte Carlo implementa-tion of the nonlinear ﬁltering problem to produce ensemble assimila-tions and forecasts.

Monthly Weather Review , 127(12):2741–2758, 1999.doi:10.1175/1520-0493(1999)127 < > Philosophical Transactions of the RoyalSociety A: Mathematical, Physical and Engineering Sciences , 371(1991):20110479,2013. doi:10.1098/rsta.2011.0479.A. Atencia and I. Zawadzki. Analogs on the Lorenz attractor and ensemble spread.

Monthly Weather Review , 145(4):1381–1400, 2017. doi:10.1175/MWR-D-16-0123.1.F. Bach. Breaking the curse of dimensionality with convex neu-ral networks.

J. Mach. Learn. Research , 18(19):1–53, 2017a. URL http://jmlr.org/papers/v18/14-546.html .F. Bach. On the equivalence between kernel quadrature rules and randomfeature expansions.

J. Mach. Learn. Research , 18(21):1–38, 2017b. URL . . Barron. Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. on Inform. Theory , 39:930–945, 1993. doi:10.1109/18.256500.R. Buizza, P. L. Houtekamer, G. Pellerin, Z. Toth, Y. Zhu, and M. Wei. A comparisonof the ECMWF, MSC, and NCEP global ensemble prediction systems.

MonthlyWeather Review , 133(5):1076–1097, 2005. doi:10.1175/MWR2905.1.G. Burgers, P. J. van Leeuwen, and G. Evensen. Analysis scheme in theensemble Kalman ﬁlter.

Monthly Weather Review , 126(6):1719–1724, 1998.doi:10.1175/1520-0493(1998)126 < > Journal of Computational Physics , 397:108836, 2019.doi:10.1016/j.jcp.2019.07.035.L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport.

Advances in neural informationprocessing systems, , pages 3036–3046, 2018.G. Cybenko. Approximation by superposition of a sigmoidal function.

Math. Contr.,Sign., and Syst. , 2:303–314, 1989. doi:10.1007/BF02551274.W. E, C. Ma, and L. Wu. A priori estimates of the population risk fortwo-layer neural networks.

Commun. Math. Sci. , 17(5):1407–1425, 2019.doi:10.4310/CMS.2019.v17.n5.a11.R. Edson, J. Bunder, T. Mattner, and A. Roberts. Lyapunov exponentsof the Kuramoto–Sivashinsky PDE.

ANZIAM Journal , 61(0):270–285, 2019.doi:10.21914/anziamj.v61i0.13939.E. S. Epstein. Stochastic dynamic prediction.

Tellus , 21(6):739–759, 1969.doi:10.1111/j.2153-3490.1969.tb00483.x.G. Evensen.

Data Assimilation: The Ensemble Kalman Filter . Springer, New York,2006.B. Giggins and G. A. Gottwald. Stochastically perturbed bred vectors in multi-scalesystems.

Quarterly Journal of the Royal Meteorological Society , 145:642–658, 2019.doi:10.1002/qj.3457.T. M. Hamill, J. S. Whitaker, and C. Snyder. Distance-dependentﬁltering of background covariance estimates in an ensembleKalman ﬁlter.

Monthly Weather Review , 129(11):2776–2790, 2001.doi:10.1175/1520-0493(2001)129 < > Phys. Rev. X , 6:011021, Mar 2016. doi:10.1103/PhysRevX.6.011021.S. Herrera, J. Fern´andez, M. Rodriguez, and J. Guti´errez. Spatio-temporal errorgrowth in the multi-scale Lorenz-96 model.

Nonlinear Processes in Geophysics , 17(4):329, 2010. doi:10.5194/npg-17-329-2010.H. Hersbach. Decomposition of the continuous ranked probability score for en-semble prediction systems.

Weather and Forecasting , 15(5):559–570, 2000.doi:10.1175/1520-0434(2000)015 < > . L. Houtekamer and H. L. Mitchell. Data assimilation using an ensem-ble Kalman ﬁlter technique. Monthly Weather Review , 126(3):796–811, 1998.doi:10.1175/1520-0493(1998)126 < > Monthly Weather Review , 129(1):123–136, 2001.doi:10.1175/1520-0493(2001)129 < > Monthly Weather Review , 144(12):4489–4532, 2016.doi:10.1175/MWR-D-15-0440.1.H. Jaeger. A tutorial on training recurrent neural networks, covering BPPT, RTRL,EKF and the ”echo state network” approach, 2002.H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systemsand saving energy in wireless communication.

Science , 304(5667):78–80, 2004.doi:10.1126/science.1091277.Y. Kuramoto and T. Tsuzuki. On the Formation of Ddssipative structures inreaction-diﬀusion systems: Reductive perturbation approach.

Progress of Theo-retical Physics , 54(3):687–699, 09 1975. ISSN 0033-068X. doi:10.1143/PTP.54.687.Y. Kuramoto and T. Tsuzuki. Persistent propagation of concentration waves in dissi-pative media far from thermal equilibrium.

Progress of Theoretical Physics , 55(2):356–369, 02 1976. ISSN 0033-068X. doi:10.1143/PTP.55.356.K. Law, A. Stuart, and K. Zygalakis.

Data assimilation: A mathematical introduction .Springer-Verlag, New York, 2015.C. E. Leith. Theoretical skill of Monte Carlo forecasts.

Monthly Weather Review , 102(6):409–418, 1974. doi:10.1175/1520-0493(1974)102 < > Journal of ComputationalPhysics , 227(7):3515 – 3539, 2008. ISSN 0021-9991. doi:10.1016/j.jcp.2007.02.014.E. N. Lorenz. Deterministic nonperiodic ﬂow.

Journal of the Atmospheric Sciences ,20(2):130–141, 1963. doi:10.1175/1520-0469(1963)020 < > Tellus ,17(3):321–333, 1965. doi:10.3402/tellusa.v17i3.9076.E. N. Lorenz. Predictability: A problem partly solved. In T. Palmer, editor,

Proc.Seminar on predictability Vol. 1 , pages 1–18, Reading, UK, 1996. ECMWF.W. Maass, T. Natschl¨ager, and H. Markram. Real-time computing without stablestates: A new framework for neural computation based on perturbations.

NeuralComputation , 14(11):2531–2560, 2002. doi:10.1162/089976602760407955.A. Majda and J. Harlim.

Filtering Complex Turbulent Systems . Cambridge UniversityPress, Cambridge, 2012.S. Mei, A. Montanari, and P.-M. Nguyen. A mean ﬁeld view of the landscape of two-layer neural networks.

Proc. Natl. Acad. Sci. USA , 115(33):E7665–E7671, 2018.doi:10.1073/pnas.1806579115.N. Nelson and A. Stuart. The random feature model for input-ouput maps betweenBanach spaces, 2020. URL https://arxiv.org/pdf/2005.10224 . . N. Palmer. Extended-range atmospheric prediction and the Lorenzmodel. Bulletin of the American Meteorological Society , 74(1):49–66, 1993.doi:10.1175/1520-0477(1993)074 < > Neural Computation , 3:246–257, 1991. doi:10.1162/neco.1991.3.2.246.J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. Model-free prediction of large spa-tiotemporally chaotic systems from data: A reservoir computing approach.

Phys.Rev. Lett. , 120:024102, Jan 2018. doi:10.1103/PhysRevLett.120.024102.D. Qi and A. J. Majda. Using machine learning to predict extreme events in com-plex systems.

Proceedings of the National Academy of Sciences , 117:52–59, 2020.doi:10.1073/pnas.1917285117.A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. C.Platt, D. Koller, Y. Singer, and S. T. Roweis, editors,

Advances in Neural Informa-tion Processing Systems 20 , pages 1177–1184. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf .A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In , pages 555–561, 2008.S. Reich and C. Cotter.

Probabilistic forecasting and Bayesian data assimilation .Cambridge University Press, New York, 2015.G. M. Rotskoﬀ and E. Vanden-Eijnden. Neural networks as interacting particle sys-tems: Asymptotic convexity of the loss landscape and universal scaling of theapproximation error, 2018. URL https://arXiv.org/pdf/1805.00915 .J. Sirignano and K. Spiliopoulos. Mean Field Analysis of Neural Networks: A Lawof Large Numbers.

SIAM J. Appl. Math. , 80(2):725–752, 2020. ISSN 0036-1399.doi:10.1137/18M1192184.G. Sivashinsky. Nonlinear analysis of hydrodynamic instability in laminar ﬂames - I:Derivation of basic equations.

Acta Astronautica , 4(11):1177 – 1206, 1977. ISSN0094-5765. doi:10.1016/0094-5765(77)90096-0.G. I. Sivashinsky and D. M. Michelson. On irregular wavy ﬂow of a liquid ﬁlm downa vertical plane.

Progress of Theoretical Physics , 63(6):2112–2114, 06 1980. ISSN0033-068X. doi:10.1143/PTP.63.2112.Y. Sun, A. Gilbert, and A. Tewari. On the approximation capabili-ties of ReLU neural networks and random ReLU features, 2019. URL https://arxiv.org/pdf/1801.04374 .R. Temam.

Inﬁnite-dimensional dynamical systems in mechanics and physics , vol-ume 68 of

Applied Mathematical Sciences . Springer-Verlag, New York, secondedition, 1997. doi:10.1007/978-1-4612-0645-3.Z. Toth and E. Kalnay. Ensemble forecasting at NMC: The generation of pertur-bations.

Bulletin of the American Meteorological Society , 74(12):2317–2330, 1993.doi:10.1175/1520-0477(1993)074 < > Monthly Weather Review , 125(12):3297–3319, 1997. oi:10.1175/1520-0493(1997)125 < > Journal of the Atmospheric Sciences , 60(9):1140–1158, 2003. doi:10.1175/1520-0469(2003)060 < > Quarterly Journal of the Royal Meteorological Society , 131(606):389–407, 2005.doi:10.1256/qj.04.03.D. S. Wilks.

Statistical Methods in the Atmospheric Sciences . Elsevier, Oxford, 2006.. Elsevier, Oxford, 2006.