[PDF] Effective models and predictability of chaotic multiscale systems via machine learning

Abstract

We scrutinize the use of machine learning, based on reservoir computing, to build data-driven effective models of multiscale chaotic systems. We show that, for a wide scale separation, machine learning generates effective models akin to those obtained using multiscale asymptotic techniques and, remarkably, remains effective in predictability also when the scale separation is reduced. We also show that predictability can be improved by hybridizing the reservoir with an imperfect model.

Full PDF

EEﬀective models and predictability of chaotic multiscale systems via machine learning

Francesco Borra, ∗ Angelo Vulpiani, and Massimo Cencini † Dipartimento di Fisica, Universit`a “Sapienza” Piazzale A. Moro 5, I-00185 Rome, Italy Istituto dei Sistemi Complessi, CNR, via dei Taurini 19, I-00185 Rome, Italy

We scrutinize the use of machine learning, based on reservoir computing, to build data-driveneﬀective models of multiscale chaotic systems. We show that, for a wide scale separation, machinelearning generates eﬀective models akin to those obtained using multiscale asymptotic techniquesand, remarkably, remains eﬀective in predictability also when the scale separation is reduced. Wealso show that predictability can be improved by hybridizing the reservoir with an imperfect model.

I. INTRODUCTION

Machine learning techniques are impacting science atan impressive pace from robotics [1] to genetics [2],medicine [3], and physics [4]. In physics, reservoir com-puting [5, 6], based on echo-state neural networks [7–9],is gathering much attention for model-free, data-drivenpredictions of chaotic evolutions [10–14]. Here, we scru-tinize the use of reservoir computing to build eﬀectivemodels for predicting the slow degrees of freedom of mul-tiscale chaotic systems. We also consider hybrid reser-voirs, blending data with predictions based on an imper-fect model [15] (see also [16]).Multiscale chaotic systems represent a challenge toboth theory and applications. For instance, turbulencecan easily span over 4/6 decades in temporal/spatialscales [17], while climate time scales range from hours ofatmosphere variability to thousands years of deep ocean[18, 19]. These huge ranges of scales stymie direct nu-merical approaches making modeling of fast degrees offreedom mandatory, being slow ones usually the most in-teresting to predict. In principle, the latter are easier topredict: the maximal Lyapunov exponent (of the orderof the inverse of the fastest time scale) controls the earlydynamics of very small perturbations appertaining to thefast degrees of freedom that saturate with time, lettingthe perturbations on the slow degrees of freedom to growat a slower rate controlled by the typically weaker nonlin-ear instabilities [20–22]. However, owing to nonlinearity,fast degrees of freedom depend on, and in turn, impacton the slower ones. Consequently, improper modeling theformer severely hampers the predictability of the latter[23].We focus here on a simpliﬁed setting with only twotime scales, i.e. on systems of the form:˙ X = τ − s F s ( X , x ) , ˙ x = τ − f F f ( x , X ) , (1)where X and x represent the slow and fast degrees offreedom, respectively. The time scale separation betweenthem is controlled by c = τ s /τ f . The goal is to build aneﬀective model for the slow variables, ˙ X = F eﬀ ( X ), to ∗ Corresponding author; [email protected] † Corresponding author; [email protected] predict their evolution. When the fast variables are muchfaster than the slow ones ( c (cid:29) II. RESERVOIR COMPUTING FOR CHAOTICSYSTEMS AND ITS IMPLEMENTATION

Reservoir computing [5, 6] is a brain inspired approachbased on a recurrent neural network (RNN), the reservoir(R) – i.e. an auxiliary high dimensional nonlinear dynam-ical system naturally suited to deal with time sequences–,(usually) linearly coupled to a time dependent lower di-mensional input (I), to produce an output (O). To makeO optimized for approximating some desired dynamicalobservable, the network must be trained. Reservoir com-puting implementation avoids backpropagation [26] byonly training the output layer, while R-to-R and I-to-R connections are quenched random variables. Remark-ably, the reservoir computing approach allows for fasthardware implementations with a variety of nonlinearsystems [27, 28]. Choosing the output as a linear pro- a r X i v : . [ n li n . AO ] J u l jection of functions of the R-state, the optimization canbe rapidly achieved via linear regression. The methodworks provided R-to-R connections are designed to forcethe R-state to only depend on the recent past history ofthe input signal, fading the memory of the initial state. A. Predicting chaotic systems with reservoircomputing

When considering a chaotic dynamical system withstate s ( t ) = ( X ( t ) , x ( t )), with reference to Eq. (1), theinput signal u ( t ) ∈ IR D I is typically a subset of the stateobservables, u ( t ) = h ( s ( t )), for instance in the follow-ing we consider functions of the slow variables, X , only.When the dimensionality, D R , of the reservoir is largeenough and the R-to-R connections are suitable chosen,its state, r ( t ) ∈ IR D R , becomes a representation – anecho – of the input state s ( t ) [6, 9, 12], via a mechanismalike to generalized synchronization [12, 29]. In this con-ﬁguration, dubbed open loop [30] (Fig. 1a), the RNN isdriven by the input and, loosely speaking, synchronizeswith it. When this is achieved, the output, v ( t ) ∈ IR D O can be trained (optimized) to ﬁt a desired function of s ( t ), for instance, to predict the next outcome of the ob-servable, i.e. v ( t + ∆ t ) = u ( t + ∆ t ). After training, wecan close the loop by feeding the output as a new inputto R (Fig. 1b), thus obtaining an eﬀective model for pre-dicting the time sequence. For the closed loop mode toconstitute an eﬀective (neural) model of the dynamics ofinterest, we ask the network to work for arbitrary initialconditions, i.e. not only right after the training: a prop-erty dubbed reusability in [12]. For this purpose, whenstarting from a random reservoir state, a short synchro-nization period in open loop is needed before closing theloop. The method to work requires some stability prop-erty which cannot, in general, be granted in the closedloop conﬁguration [30]. B. Implementation

Reservoir neurons can be implemented in diﬀerentways [5], here we use echo state neural network [9], mostlyfollowing [10–12]. Here, we assume D R (cid:29) D I = D O andthe input to be sampled at discrete time intervals ∆ t .Both assumptions are not restrictive, for instance in thehybrid implementation below we will use D O (cid:54) = D I andthe extension to continuous time is straightforward [5].The reservoir is built via a sparse (low degree, d ), random D R × D R connectivity matrix W R , with uniformly dis-tributed entries in [ − ρ –the largest eigenvalue. The request ρ < r ( t ) with s ( t ). We distinguish trainingand prediction. Training is done in open loop mode usingan input trajectory u ( t ) with t ∈ [ − T s : T t ] discretized FIG. 1. (Color online) Sketch of reservoir computing: (a)the components and their connections; (b) the two modes ofoperation: open loop for synchronizing the reservoir to theinput and for training, closed loop for prediction. in steps of size ∆ t . T s deﬁnes the length of initial tran-sient to let the r ( t ), randomly initialized at t = − T s ,to synchronize with the system dynamics. While T t isthe training input sequence length. After being scaledto be zero mean and unit standard deviation, the inputis linearly coupled to the reservoir nodes via a D R × D I matrix W I , with random entries uniform in [ − σ : σ ]. Inopen loop r ( t ) evolves as r ( t + ∆ t ) = tanh[ W R r ( t ) + W I u ( t )] , (2)where tanh is applied element wise, and can be re-placed with other nonlinearities. The output is com-puted as v ( t + ∆ t ) = W O r (cid:63) ( t + ∆ t ) with the D R × D O matrix W O obtained via linear regression by imposing W O = arg min W { (cid:80) ≤ t ≤ T t || v ( t ) − u ( t ) || + α Tr[ WW T ] } ,to ensure the output to be the best predictor of the nextinput observable. The term proportional to α is a reg-ularization, while r (cid:63) is a function of the reservoir state.Here, we take r ∗ i ( t ) = r i ( t ) if i is odd and r ∗ i ( t ) = r i ( t )otherwise [32]. Once W O is determined, we switch toprediction mode. Given a short sequence of measure-ments, in open loop, we can synchronize the reservoirwith the dynamics (2), and then close the loop letting u ( t ) ← v ( t ) = W O r (cid:63) ( t ) in Eq. (2). This way Eq. (2)becomes a fully data driven eﬀective model for the timesignal to be predicted. The resulting model, and thusits performances, will implicitly depend on the hyperpa-rameters ( d, ρ and σ ) deﬁning the RNN structure and theI-to-R connections, and on the length of the training tra-jectory, their choice is discussed in Appendix A, wherefurther details on the implementation can be found. C. Hybrid implementation

So far we assumed no prior knowledge of the dynamicalsystem that generated the input. If we have an imperfectmodel for approximately predicting the next outcome ofthe observables u ( t ), we can include such information ina hybrid scheme by slightly changing the input and/oroutput scheme to exploit this extra knowledge [15, 16].The idea of blending machine learning algorithms withphysics informed model is quite general and it has beenexploited also with methods diﬀerent from reservoir com-puting, see e.g. [33–35].Let ℘ [ u ( t )] = ˆ u ( t +∆ t ) be the estimated next outcomeof the observable u ( t ) according to our imperfect model. FIG. 2. (Color online) Prediction error growth (a-b) and network eﬀective model (c), for a single realization of a network of D R = 500 neurons. (a) Average (over 10 initial conditions) (log )error (cid:104) E ( t ) (cid:105) vs time during synchronization (open loop, grayregion) and prediction (closed loop) for c = 10 and ∆ t = 0 .

1: the yellow shaded area circumscribes the twin and random twinmodel predictions (see text); reservoir computer prediction (solid, black curve) compared with that of the truncated model(purple, dotted curve), of the model fast variables replaced by their average (blue, dash dotted curve) and model (5) (red,dashed curve). The inset shows the same (closed loop only) for ∆ t = 0 .

01. (b) Same as (a) for c = 3. (c) Residual derivatives(6) vs Y for c = 10 and ∆ t = 0 .

01, computed with the network (black dots), the multiscale model (5) (yellow, dashed line), andthe full dynamics (gray dots). For details on hyperparameters see Appendix A 6.

The idea is to supply such information in the input byreplacing u ( t ) with the column vector ( u ( t ) , ℘ [ u ( t )]) T ,thus doubling the dimensionality of the input matrix.For the output we proceed as before. The whole schemeis thus as the above one with the only diﬀerence that W O is now a D R × D I /

2. The switch to the predic-tion mode is then obtained using as input in Eq. (2)( W O r (cid:63) ( t ) , ℘ [ W O r (cid:63) ( t )]) T . It is worth noticing that otherhybrid schemes are possible, e.g. in Ref. [15] the outputwas in the form v ( t + ∆ t ) = W O ( r (cid:63) ( t ) , ℘ [ u ( t )]) T , i.e. is acombination of the prediction based on the network andon the physical model. In Appendix C we comment fur-ther on our choice, and we compare it with the schemeproposed in Ref. [15]. III. RESULTS FOR A TWO TIME SCALESSYSTEM

We now consider the model introduced in [36] as acaricature for the interaction of the (fast) atmosphereand the (slow) ocean. It consists of two Lorenz systemscoupled as follows:  ˙ X = a ( Y − X )˙ Y = R s X − ZX − Y − (cid:15) s xy ˙ Z = XY − bZ (3)  ˙ x = ca ( y − x )˙ y = c ( R f x − zx − y ) + (cid:15) f Y x ˙ z = c ( xy − bz ) , (4)where (3) and (4) describe the evolution of the slow andfast variables, respectively. We ﬁx the parameters as in[36]: a = 10, b = 8 / R s = 28, R f = 45, (cid:15) s = 10 − and (cid:15) f = 10, while for the time scale separation parameter, c , we use c = 10 (as in [36]) and c = 3. The formercorresponds to a scale separation such that the adiabatic approximation already provides good results (see below).Moreover, for c = 10, the error growth on the slow vari-ables is characterized by two exponential regimes [36]:the former with rate given by the Lyapunov exponent ofthe full system λ f ≈ .

5, and the latter by λ s ≈ . u ( t ) = ( X ( t ) , Y ( t ) , Z ( t )). In openloop, we let the reservoir to synchronize with the input,subsequently we perform the training and optimize W O as explained earlier. Then, to test the prediction per-formance we consider 10 initial conditions, for each ofwhich, we feed the slow variables to the network in openloop and record, from t = − T s to t = 0, the one step(log )error E ( t ) = log (cid:107) v ( t ) − u ( t ) (cid:107) , v ( t ) being the one-step network forecast (output). The average (log )error (cid:104) E ( t ) (cid:105) decreases linearly (a visual proof of the echo stateproperty) and then reaches a plateau (see grey regionsin Fig. 2a-b)- the synchronization error E S - which canbe interpreted as the average smallest (log )error on theinitial condition and the one step error prediction.At the end of the open loop, after synchronization, weswitch to the prediction (closed loop) conﬁguration andcompute the (log )error growth between the predicted,by the network, and reference evolution. Moreover, wetake the output variables at the end of the open loop anduse them to initialize other models to be compared withthe network, data-driven model. First, we consider theperfect model with an error on the initial condition, i.eEqs. (3) with the slow variables set equal to the networkobtained values at t = 0, i.e. at the end of the openloop. By construction, the network does not forecastthe fast variables, which are thus initialized either us-ing their exact values from the reference trajectory (twinmodel), which is quite ’unfair’, or random values (randomtwin) from the stationary measure on the fast attractorwith ﬁxed slow variables. Then we consider increasinglyreﬁned eﬀective models for the slow degrees of freedomonly: a “truncated” model, ˙ X = F T ( X ), obtained fromEqs. (3) by setting (cid:15) s = 0, a model in which we replacethe fast variables in Eqs. (3) with their global averageor by their average with ﬁxed slow variables – adiabaticapproximation – which amounts to replacing (cid:15) s xy in theequation for ˙ Y with (see Appendix B for details on thederivation): (cid:15) s (cid:104) xy (cid:105) X = (1 . . Y /c ) Θ(1 . . Y /c ) , (5)being Θ the Heaviside step function.In Fig. 2a and b we show the results of these compar-isons for c = 10 and 3, respectively and sampling time∆ t = 0 . t = 0 .

01 in the insets). When c = 10 and∆ t = 0 . t = 0 . t = 0 .

01 (see Fig. 2a inset). For c = 3, the poor scale separation spoils the eﬀectiveness ofmodel (5), see Fig. 2b. Moreover, discarded variables arenot fast enough to average themselves out, making thelearning task harder. Nevertheless, the network remainspredictive. A. Which eﬀective model the network has built?

We now focus on the case c = 10 and ∆ t = 0 .

01, forwhich we can gain some insights into how the networkworks by comparing it to model (5). The sampling time isindeed small enough for time diﬀerences to approximatederivatives. In Fig. 2c we demonstrate that the networkgenerates an eﬀective model akin to the adiabatic one (5),by showing a surrogate of the residual time derivative of Y , i.e. by removing the truncated model derivative, as afunction of Y :∆ (cid:101) ˙ Y = Y ( t + ∆ t ) − Y ( t )∆ t − Y T ( t + ∆ t ) − Y ( t )∆ t , (6)this is a proxy of how the network has modeled theterm − (cid:15) s xy in Eq. (3). The underlying idea is as fol-lows. We let the network evolve in closed loop, attime t it takes as input the forecasted slow variables v ( t ) = ( ˆ X ( t ) , ˆ Y ( t ) , ˆ Z ( t )) and outputs the next step fore-cast v ( t +∆ t ) = ( ˆ X ( t +∆ t ) , ˆ Y ( t +∆ t ) , ˆ Z ( t +∆ t )). We thenuse v ( t ) as input to the truncated model, and evolve it for a time step ∆ t to obtain ( X T ( t + ∆ t ) , Y T ( t + ∆ t ) , Z T ( t +∆ t )). Equation (6) is then used to infer how the networkmodels the coupling with the fast variables. Evolving byone time step v ( t ) using Eq. (5) and then again employing(6) we, obviously, obtain the line − . − . Y (dashedin Fig. 2c). The network residual derivative (black dotsin Fig. 2c) forms a narrow stripe around that line, mean-ing that the network, for wide scale separation, performsa conditional average which for c = 10 is equivalent tothe adiabatic approximation (5). For comparison, we alsoshow the residual derivative (6) for the full model (3-4)(gray dots), which displays large/fast oscillations that areindeed best ﬁtted by (5). For c = 3, while the adiabaticapproximation is too rough, remarkably the network stillperforms well even though is more diﬃcult to identify themodel it has build, which will depend on the whole setof slow variables (see discussion in App. B). B. Predictability time and hybrid scheme

So far we focused on the predictability of a quite largenetwork ( D R = 500 as compared to the low dimensional-ity of Eqs. (3-4)). How does the network performancesdepend on the reservoir size D R ?In Fig. 3 we show the D R -dependence of the average(over reservoir realizations and initial conditions) pre-dictability time, T p , deﬁned as the ﬁrst time such thatthe error between the network predicted and referencetrajectory reaches the threshold value 0 . (cid:104)|| X || (cid:105) / . For T p λ s E s T p λ s T p λ f D R FIG. 3. (Color online) Average predictability time, T p nor-malized with the slow ﬁnite size Lyapunov exponent λ s (leftscale) and with the (fast) maximal Lyapunov exponent λ f (right scale), versus reservoir size D R (hyperparameters forthe hybrid implementation are the same of the reservoir onlyapproach which are discussed in App. A 6) for reservoir only(purple circles) and hybrid scheme (green squares), systemparameters c = 10 and ∆ t = 0 .

1. Error bars denote statisticalstandard deviation over 20 independent network realizations,each sampling 10 initial conditions. Inset: T p λ s vs synchro-nization average (log )error (cid:104) E S (cid:105) . The slope of the black lineis − λ s . D R (cid:38) D R (cid:38)

300 and, for smaller reservoirs, the pre-dictability time of the hybrid scheme is longer. More-over, the hybrid scheme is less prone to failures even forsmall D R , hence ﬂuctuations are smaller (see Fig. 4).Note that the chosen hybrid design ensures that the im-provement is only due to reservoir capability of buildinga better eﬀective model, reducing the average synchro-nization (log )error (cid:104) E S (cid:105) (see the insets of Fig. 3 and4, and the discussion in App. C) and thus the error onthe initial condition of the slow variables. Indeed, in theinset of Fig. 3 we also show the slope predicted on thebasis of the slow perturbation growth rate, λ s [36].Summarizing, the diﬀerence between hybrid and reser-voir only approach disappears at increasing D R as thesame plateau values for both synchronization error andpredictability time are reached. In other terms if thereservoir is large enough adding the extra informationfrom the imperfect model does not improve the modelproduced by the network. Or, cast in a positive message,using a physically informed model allows for reducing thesize of the reservoir to achieve a reasonable predictabilityand thus eﬀective model of the dynamics.We remark that in Fig. 3 the predictability time T p wasmade nondimensional either using the growth rate of theslow dynamics λ s (left scale of Fig. 3), or using the Lya-punov exponent of the full system λ f (right scale) thatis dominated by the fast dynamics. For large networksthe predictability time is as large as 5 (ﬁnite size) Lya-punov times, which corresponds to about 70 Lyapunovtimes with respect to the full dynamics. This remarkablelong predictability with respect to the fastest time scaleis typical of multiscale systems, where the maximal Lya-punov exponent does not say much for the predictabilityof the slow degrees of freedom [20–22].Figures 2a-b and 3(inset) (see also the inset of Fig. 4),show that it is hard reaching synchronization error below10 − . Even when this happens it does not improve thepredictability, as the error quickly (even faster than theLyapunov time) raises to that value. Indeed, such an er-ror threshold corresponds to the crossover scale betweenthe fast- and slow-controlled regime of the perturbationgrowth (see Fig. 2 in Ref. [36]). In other terms, pushingthe error below this value requires the reservoir to (par-tially) reconstruct also the fast component dynamics. T p λ s D R -2.2-2-1.8-1.6-1.4-1.2-1-0.8 200 400 600 800 1000 < E s > D R FIG. 4. (Color online) Nondimensional predictability time T p λ s of 20 network realizations (each averaged over 10 ini-tial conditions), in reservoir only (purple circles) and hybridscheme (green squares), as a function of the reservoir size D R .The solid curves display the average over all realizations, al-ready presented in Fig. 3. Notice that, in the reservoir onlyscheme, a number of outliers are present for D R (cid:46) (cid:104) E S (cid:105) , i.e. the average (over network realizations and 10 ini-tial conditions for each realizations) (log )error at the end ofthe open loop versus the reservoir size D R for the reservoironly (purple curve) and the hybrid (green curve) scheme, re-spectively. Symbols display the synchronization error in eachnetwork realization. Notice that there are no realizations withstrong departure from the average as observed in main panelfor the predictability time: this shows that the predictabilityperformance is not always liked to the synchronization error(see text for a further discussion). Data refer to the case c = 10 and ∆ t = 0 .

1. For hyperparameters see App. A 6.

C. The role of the synchronization error

In the previous section, we have used as a performancemetrics the average predictability time, T p , namely theaverage time it takes an error to reach a given toler-ance ∆ ∗ = 0 . (cid:104)|| X || (cid:105) / after the synchronization pe-riod. Clearly, being the system chaotic and interpretingthe synchronization error as an error on the initial con-ditions one can naively think that reducing such error al-ways enhances the predictability, and, consequently, thatthe smallness of such error can be used as another per-formance metrics for the network. In the following weshow that this is only partially true.Clearly to achieve long term predictability the small-ness of the synchronization error is a necessary condition.Indeed the (log)error at the end of open loop cycle, E S ,puts an upper limit to the predictability time as T p (cid:46) λ s [log(∆ ∗ ) − E S ] , (7)as conﬁrmed by the solid line in the inset of Fig. 3. How-ever, it is not otherwise very informative about the over-all performance. The reason is that the value of E S , which can also be seen as the average error on one steppredictions, does not provide information on the struc-tural stability of the dynamics. Indeed, for a variety ofhyperparameters values, we have observed low E S result-ing in failed predictions: in other terms the model builtby the network is not eﬀective in forecasting and in re-producing the climate. In these cases the network wasunable to generate a good eﬀective model, as shown inFig. 4 this typically happens for relatively small D R inthe reservoir only implementation.In a less extreme scenario, the error E S can be decep-tively lower than the scale at which the dynamics hasbeen properly learned. This latter case is relevant to themultiscale setting since, as outlined at the end of the pre-vious section, fast variable reconstruction is necessary topush the initial error below a certain threshold. In somecases, we did observe the synchronization error fallingbelow the typical value 10 − but immediately jumpingback to it, implying unstable fast scale reconstruction(for instance, see c = 3, ∆ t = 0 .

01 in Fig. 2b).As a consequence of the two above observations, E S isan unreliable metric for hyperparameters landscape ex-ploration as well. We also remark that, even if fast scaleswere modeled with proper architecture and training time,and E S could be pushed below the crossover with an ac-tual boost in performance, such improvements would notdramatically increase the predictability time of the slowvariables, since they are suppressed by the global (andgreater, as dominated by the fast degrees of freedom)Lyapunov exponent. This situation as discussed above istypical of multiscale systems. IV. CONCLUSIONS

We have shown that reservoir computing is a promisingmachine learning tool for building eﬀective, data drivenmodels for multiscale chaotic systems able to provide, insome cases, forecasting as good as those that can be ob-tained with a perfect model with error on the initial con-ditions. Moreover, the simplicity of the system allowed usto gain insights into the inner work of the reservoir com-puting approach that, at least for large scale separation,is building an eﬀective model akin to that obtained byasymptotic multiscale techniques. Finally, the reservoircomputing approach can be reinforced by blending it withan imperfect predictor, making it to perform well alsowith smaller reservoirs. While we have obtained these re-sults with a relatively simple two time scale model, giventhe success of previous applications to spatially extendedsystems [11], we think the method should work also withmore complex high dimensional multiscale systems. Inthe latter, it may be necessary to consider multi reser-voir architectures [38] in parallel [11]. Moreover, reservoircomputing can be used to directly predict unobserved de-grees of freedom [39]. Using this scheme and the ideas de- veloped in this work it would be interesting to explore thepossibility to build novel subgrid schemes for turbulentﬂows [40, 41] (see also [16] for a very recent attempt inthis direction based on reservoir computing with hybridimplementation), preliminary tests could be performedin shell models for turbulence for which physics only in-formed approaches have been only partially successful[42].

ACKNOWLEDGMENTS

We thanks L. Biferale for early interest in the projectand useful discussions. AV and FB acknowledgeMIUR-PRIN2017 “Coarse-grained description for non-equilibrium systems and transport phenomena” (CO-NEST). We acknowledge the CINECA award for theavailability of high performance computing resources andsupport from the GPU-AI expert group in CINECA.

Appendix A: Details on the implementation1. Intra-reservoir (R-to-R) connectivity matrix W R The intra-reservoir connectivity matrix, W R , is gen-erated by drawing each entry from the same distribu-tion. Each element is the product of two random vari-ables W ij = a ∗ b : a being a real uniformly distributedrandom number in [ − b taking values 1 or 0 withprobability P d = d/D R and 1 − P d , respectively. Conse-quently, each row has, on average, d non zero elements.Since D R (cid:29) d , the number of non null entries per rowis essentially distributed according to a Poisson distribu-tion. As a last step, the maximal eigenvalue, ρ max ( W )of the resulting matrix W is computed and the matrixis rescaled element wise so that its new spectral radiusmatches the target value ρ , i.e.: W R = W ρρ max ( W )

2. Input-to-reservoir (I-to-R) connectivity matrix W I The input to reservoir matrix W I is generated in sucha way that each reservoir node is connected to a singleinput. For this purpose, for each row j , a single element n j , uniformly chosen between 1 and the input dimension D I , is diﬀerent from zero. This means that the reservoirnode j is only connected to the n thj input node. Theconnection strength is randomly chosen in [ − σ, σ ] withuniform distribution.

3. Optimization of the output matrix W O The output matrix W O is obtained via optimization.As explained in Sec. II B, W O should be chosen so thatthe output v ( t ) = W O r ∗ ( t ) is, on average, as similaras possible to the input signal u ( t ). Incidentally, we re-mark that the use of r ∗ instead of simply r is relevantto achieve accurate forecasting and is heuristically moti-vated by the need to add some nonlinearity in the net-work [7]. The particular choice we adopted, r ∗ i = r i or r i for i odd or even, respectively was suggested in [10, 12]in view of the symmetries of the Lorenz model.As for the optimization of W O , we require that itshould minimize the cost function L = 1 T t (cid:88) ≤ t ≤ T t (cid:107) W O r ∗ ( t ) − u ( t ) (cid:107) + α tr[ W O W TO ] , (A1)where T denotes the transpose and T t is the length ofthe training input, whose choice is discussed below. Wepoint out that the sum appearing in Eq. (A1) is a del-icate quantity: we have observed that moderate errorscompromise the ﬁnal performance. For this reason, theKahan summation has been employed in order to boostnumerical accuracy. The solution of the minimization ofEq. (A1), W optO so that d L d W TO (cid:12)(cid:12)(cid:12)(cid:12) W optO = 0is W optO = (cid:104) u ⊗ r ∗ T (cid:105) ( α I D R + (cid:104) r ∗ ⊗ r ∗ T (cid:105) ) − where (cid:104)·(cid:105) denotes the empirical average T t (cid:80) t , ⊗ de-notes the outer product and I D R the D R × D R iden-tity matrix. The addend proportional to α in Eq. (A1)is the Tikhonov term, which is a L regularization on W O . The Tikhonov regularization improves the numer-ical stability of the inversion, which could be critical ifthe ratio between the largest and the smallest eigenval-ues, ρ max and ρ min , is too large and the latter wouldbehave as a (numerical) null eigenvalue [43], as it is thecase for the dynamics we are studying. Here we haveused α = 10 − . which empirically was found to lead tolog ( ρ max /ρ min ) ≈

4. Synchronization time and length of the traininginput trajectory

All results presented in this article have been obtainedusing training trajectories of length T t = 500. We remarkthat using values 100 ≤ T t ≤ T t have been tested to be in the range thatguarantees long term reconstruction of the attractor with proper hyperparameters. As for the the synchronizinglength, we have chosen T s = 4. Such value is about fourtimes larger than the time actually needed to achieve bestpossible synchronization indeed, as shown in the grayshaded areas of Figs. 2a-b, the error E saturates to E S in about a time unit.

5. Numerical details

The whole code has been implemented in python3 , withlinear algebra performed via numpy . Numerical integra-tion of the coupled Lorenz model were performed via a4 th order Runge Kutta scheme.

6. Fixing the hyperparameters

The architecture of a generic network is described bya number of parameters, often dubbed hyperparame-ters, e.g.: the number of layers, activation functions etc.While a proper design is always crucial, in the reser-voir computing paradigm, this issue is especially criticaldue to the absence of global optimization via backpropa-gation. The reservoir-to-reservoir and input-to-reservoirconnectivity matrices, as discussed above, are quenchedstochastic variables, whose distribution depends on fourhyperparameters:Net ∼ P ( σ, ρ, d, D R ) , namely, the strength of the I-to-R connection matrix, σ ,the spectral radius, ρ , and the degree, d , of the R-to-Rconnection matrix, and the reservoir size D R Once thedistribution is chosen, there are two separate issues.The ﬁrst is that, for a given choice of ( σ, ρ, d ), thenetwork should be self-averaging if its size D R is largeenough. Indeed, we see from Fig. 4 that the variabilitybetween realizations decreases with D R , as expected.The second issue is the choice of the triple ( σ, ρ, d ). Ingeneral, the existence of large and nearly ﬂat (for anyreasonable performance metrics) region of suitable hy-perparameters implies the robustness of the method. Asfor the problem we have presented, such region exists,even though, in the case ∆ t = 0 . c = 10, moderate ﬁnetuning of the hyperparameters did improve the ﬁnal re-sult, allowing to even (moderately) outperform the fullyinformed twin model, as shown in Fig. 2a.It is important to remark that the characteristics ofthe regions of suitable hyperparameters depend on theused metric. Here, we have focused on medium term pre-dictability, i.e. we evaluate the error between forecastedand reference slow variables at a time (after synchroniza-tion) that is much larger than one step ∆ t but before er-ror saturation (corresponding to trajectories completelyuncorrelated). Requiring a too short time predictability,as discussed in Ref. [12], typically is not enough for re-producing long time statistical properties of the targetsystem (i.e. the so called climate), as the learned attrac-tor may be unstable even if the dynamical equations arelocally well approximated. If both short term predictabil-ity and climate reproduction are required, the suitablehyperparameter region typically shrinks. The metric weused typically led to both predictability and climate re-production, at least for reservoir sizes large enough.In order to ﬁx the parameters, two techniques havebeen employed. The ﬁrst is the standard search on agrid (for a representative example see Fig. 5): a latticeis generated in the space of parameters, then each nodeis evaluated according to some cost function metrics. Ifsuch function is regular enough, it should be possible todetect a suitable region of parameters. While this defaultmethod is sound, it may require to train many indepen-dent networks, even in poorly performing regions. Eachnetwork cannot be too small for two reasons: the ﬁrst isthat small networks suﬀer from higher inter realizationﬂuctuations, the second is that we cannot exclude thatoptimal ( σ, ρ, d ) have a loose dependence on the reservoirsize D R . As further discussed below we found a mild de-pendence on the network degree d , provided it is not toolarge, thus in Fig.5 we focused on the dependence on ρ and σ .The second technique is the no gradient optimiza-tion method known as particle swarming optimization(PSO) [44]. PSO consists in generating n (we used n = 10) tuples of candidate – the particles – parame-ters, say p i = ( ρ i , σ i , d i ) i = 1 , ..., n . At each step, eachcandidate is tested with a given metrics f . Here, we usedthe average (over 50 −

100 initial conditions) error on theslow variables after t = 2 , , k of the algorithm, each candidate is accelerated ρ σ e rr o r a t t = FIG. 5. (Color online) Performance grid for c = 3, ∆ t = 0 . N = 350, d = 5. Colors code error between forecasted andreference trajectory at time t = 5 after closing the loop, whichif the metrics here used f = (cid:107) X forecast ( t = 5) − X true ( t =5) (cid:107) (averaged over 100 points of the attractor) for a singlerealization of the network for a given value of parameters( ρ, σ ). To highlight the suitable parameter region, a cutoﬀ onhas been put at f = 1. towards a stochastic mixture of its own best performingpast position p ∗ i ( k ) = arg min p i ( s ) { f ( p i ( s )) | s < k } and the overall best past performer p ∗ ( k ) = arg min p ∗ i ( k ) { f ( p ∗ i ( k )) | i = 1 , ..., n } . Particles are evolved with the following second order timediscrete dynamics p i ( k + 1) = p i ( k ) + v i ( k ) v i ( k + 1) = w v i ( k ) + φ i ( k ) ( p ∗ i ( k ) − p i ( k ))+ φ i ( k ) ( p ∗ ( k ) − p i ( k ))with φ ji ( k ) ∈ [0 ,

1] being random variables and w ∈ [0 , pyswarms . After a suitable amount ofiterations, p ∗ should be a valid candidate. The advan-tage of PSO is that, after a transient, most candidateevaluations (each of which require to initialize, train andtest at least one network) should happen in the good re-gions. It is worth pointing out that, unless self-averagingis achieved thanks to large enough reservoir sizes, internetwork variability adds noise to limited attractor sam-pling when evaluating f and, therefore, ﬂuctuations mayappear and trap the algorithm in suboptimal regions forsome time. Moreover, the algorithm itself depends onsome hyperparameters that may have to be optimizedthemselves by hand.In our study, PSO has been mainly useful in ﬁxingparameters in the (∆ t = 0 . c = 10) case and to observethat d is the parameter which aﬀects the performancethe least. Some gridding (especially in ρ and σ ) aroundthe optimal solution is useful, in general, as a cross checkand to highlight the robustness (or lack thereof) of thesolution.In Table I we summarize the hyperparameters used inour study. ∆ t = 0 . t = 0 . d = 5 σ = 2 ρ = 0 . d = 5 σ = 2 . ρ = 0 . d = 5 σ = 1 . ρ = 0 . d = 5 σ = 0 . ρ = 0 . t is the sampling time, c is the time scale sep-aration of the multiscale scale Lorenz model Eq. (3-4), σ isthe input-to-reservoir coupling strength, ρ is spectral radiusof the reservoir-to-reservoir connectivity matrix and d its de-gree. For the hybrid implementation, discussed in Sec. II Cand App. C we used the same hyperparameters. Appendix B: Multiscale model for the two timescales coupled Lorenz systems

In this Appendix we show how Eq. (5) was derived.Following the notation of Eq. (1), we will denote with X and x the slow ( X, Y, Z ) and fast ( x, z, y ) variables,respectively. Our aim is to provide a model of the fastvariables in Eq. (3) in terms of the slow ones. When theformer are much faster than the latter, we can assumethat x equilibrates, i.e. distribute according to a sta-tionary measure, for each value of the slow variables X – adiabatic principle –, and we call the expected valueswith respect to such measure as (cid:104)·(cid:105) X . The adiabatic ap-proach can work only for a wide scale separation ( c (cid:29) Y enters the dynamics of the fastvariables, only the value of Y will matter in building theadiabatic approximation, i.e. (cid:104)·(cid:105) X ≡ (cid:104)·(cid:105) Y . In general,for moderate scale separation, this is not the case and a“closure” of the fast variables depending on the whole setof slow variables would be required, a much harder task.In order to model (cid:104) xy (cid:105) X , we ﬁrst impose stationarityi.e. (cid:104) ˙ x (cid:105) X = 0., which applied to the third line of (4)yields (cid:104) xy (cid:105) X = b (cid:104) z (cid:105) X . (B1)Inserting (B1) in the equation for Y in (3) we obtain ˙ Y = R s X − ZX − Y − (cid:15) s b (cid:104) z (cid:105) X . Now we need to determine (cid:104) z (cid:105) X . For this purpose, we notice that the equation for˙ y in (4) can be rewritten as˙ y = c [ Rx − zx − y ] with R = R f + (cid:15) f c Y (B2)Exploiting the adiabatic principle we assume Y (the slowvariable) as ﬁxed so that Eq. (4) with the second equationsubstituted with Eq. (B2) becomes the standard Lorenzmodel, a part from an inessential change of time scale. < z > R R0R-1-3.614+0.976R

FIG. 6. (Color online) (cid:104) z (cid:105) R vs R numerically computed in thestandard Lorenz system (symbols). Notice that the curve iswell approximated, in the range of interest, by the piecewiselinear function (B3), see legend. Thus, we can now evolve the standard Lorenz model andcompute (cid:104) z (cid:105) R (where R is just to remind the R depen-dence that will be reﬂected in a Y dependence via (B2)),which is shown in Fig. 6. As one can see (cid:104) z (cid:105) R dependson R approximately as follows: (cid:104) z (cid:105) R ≈  R < R − ≤ R (cid:46) . . R − . R (cid:38) .

74 (B3)We remind that R c = 24 .

74 is the critical value at whichthe ﬁxed point z ∗ = ( R − R − , (B4)(where Θ is the Heaviside step function) loses its sta-bility. Remarkably, (cid:104) z (cid:105) R remain close to z ∗ also for R > R c . The second expression in Eq. (B3), or equiva-lently Eq. (B4), yields (cid:104) xy (cid:105) Y = b ( R f − (cid:15) f /c ) Y )Θ( R f − (cid:15) f /c ) Y ) . (B5)Using the numerical values of the constants ( b = 8 / R f = 45 and (cid:15) f = 10), the above expression providesthe estimate (cid:104) xy (cid:105) Y ≈ (117 .

33 + 26 . Y /c ) Θ(117 .

33 +26 . Y /c ) while using the third expression of Eq. (B3)yields model (5) that we used to compare with the net-work, i.e. (cid:104) xy (cid:105) Y = (107 . . Y /c ) Θ((107 . . Y /c ) . (B6)For c = 10, the latter expression is very close to the (nu-merically obtained) conditional average (cid:104) xy | Y (cid:105) (Fig. 7a),conﬁrming that the scale separation is wide enough forthe adiabatic approximation to work almost perfectly.We notice that for c = 10 the typical range of variationof Y is such that R mostly lies in the region of the thirdbranch of (B3), explaining the validity of the approxima-tion.Conversely, for the case c = 3, as shown in Fig. 7bthe approximation is much cruder and important devia-tions are present especially for large positive values of Y .Indeed, in general, (cid:104) xy (cid:105) X (cid:54) = (cid:104) xy | X (cid:105) i.e. multiscale average obtained via the adiabatic princi-ple and the conditional average are not equivalent since,in general, (cid:104) ˙ x | X (cid:105) (cid:54) = 0. In this case the values of all slowvariables will matter in building a proper eﬀective model,a hard task even for the simple Lorenz model here consid-ered. However, as shown in Fig. 2b, even in this case thereservoir computing approach is quite performing eventhough it is not straightforward to decipher the model itwas able to devise. Appendix C: Discussion on various hybrid schemesimplementations

The hybrid scheme discussed in Sec. II C allows forhighlighting the properties of the reservoir, but it is just0

20 40 60 80 100 120 140 160 180 200-30 -20 -10 0 10 20 30(a) c=10 < xy > Y Y-100 0 100 200 300 400 500 600 700-30 -20 -10 0 10 20 30(b) c=3 < xy > Y Y FIG. 7. (Color online) Numerically computed conditional av-erages (symbols) (cid:104) xy | Y (cid:105) for (a) c = 10 and (b) c = 3 and thecorresponding multiscale (adiabatic) averages (cid:104) xy (cid:105) Y given byEq. (B6) (solid curve). For c = 10 the two curves overlap. one possible choice. Here, we brieﬂy discuss three generalschemes.Let us assume, for simplicity, that our dynamical sys-tem, with state variables s = ( s , ..., s n ), is describedby the equation s ( t + 1) = f ( s ( t )), which is unknown.Here, without loss of generality, we use discrete time dy-namics and that we want to forecast the whole set ofstate variables, this is just for the sake of simplicity ofthe presentation. Provided we have an imperfect model, s ( t + 1) ≈ f m ( s ( t )), for its evolution, we have basicallythree options for building a hybrid scheme.A ﬁrst possibility is to approximate via machine learn-ing only the part of the signal that is not captured by themodel f m ( s ( t )). In other terms, one writes a forecast asˆ s ( t + 1) = f m ( s ( t )) + δ n ( s ( t )) (C1)where the residual δ n is given by the network, and canbe learned from a set of input-output pairs { s ( t ) , s ( t +1) − f m ( s ( t )) } t = − T according to some supervised learningalgorithm. In our framework, the hybrid network shouldbe trained with the usual input but with target output given by the diﬀerence between the true value of s ( t + 1)and the model forecast f m ( s ( t )).A second possibility is to add the available model pre-diction f m ( s ( t )) to the input s ( t ), obtaining an aug-mented input ( s ( t ) , f m ( s ( t )) for the network. In thiscase, the forecast reads asˆ s ( t + 1) = f n ( s ( t ) , f m ( s ( t )) . (C2)Clearly, if the model based prediction is very accurate,the network will try to approximate the identity func-tion. The network should be trained with a set of input-outputs pairs { ( s ( t ) , f m ( s ( t ))) , s ( t + 1) } t = − T . This is theapproach we have implemented in this article, in orderto evaluate the performance of the reservoir.A third possibility, which should be as least as goodas the best option between the previous two (unless verypoor design choices are made) mixes the above two bycomputing a residual with a model informed network,this is the approach, e.g., described in [15]. In this casethe forecast is obtained as:ˆ s ( t + 1) = A f m ( s ( t )) + B δ n ( s ( t ) , f m ( s ( t )) , (C3)where the matrices A and B should be optimized, alongwith δ n [15]. This last option is a special case of thesecond scheme, describing a residual multilayered neuralnetwork with a linear output layer.For the sake of completeness, in Fig. 8 we show howthis last architectures compares with the one we used inFigs. 3 and 4 in terms of predictability. It consists in tak-ing the optimized linear combination of the predictionsfrom the hybrid net and the imperfect model. Namely,one augment the r ∗ array as ˜ r ∗ = ( r ∗ , f m ) and then op-timists W O to achieve v ( t + 1) = ˆ s ( t + 1) ≈ W O ˜ r ∗ ( t ).As one can see, the main eﬀect is to slightly shift the T p D R FIG. 8. (Color online) Average (over 10 initial conditions)predictability times are shown for reservoir only and two hy-brid implementations (∆ t = 0 . c = 10). The green linecorresponds hybrid scheme (C2), blue lines to hybrid scheme(C3) and purple lines to reservoir only baseline. [1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning,Robot. Auton. Sys. , 469 (2009).[2] M. W. Libbrecht and W. S. Noble, Nature Rev. Genet. , 321 (2015).[3] J. He, S. L. Baxter, J. Xu, J. Xu, X. Zhou, and K. Zhang,Nature Med. , 30 (2019).[4] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld,N. Tishby, L. Vogt-Maranto, and L. Zdeborov´a, Rev.Mod. Phys. , 045002 (2019).[5] D. Verstraeten, B. Schrauwen, M. dHaene, andD. Stroobandt, Neural Net. , 391 (2007).[6] B. Schrauwen, D. Verstraeten, and J. Van Campenhout,in Proc. 15th Europ. Symp. Artif. Neural Netw. (2007)pp. 471–482.[7] H. Jaeger, German Nat. Res. Center Infor. Tech. GMDTechnical Report , 13 (2001).[8] M. Lukoˇseviˇcius and H. Jaeger, Comp. Sci. Rev. , 127(2009).[9] H. Jaeger and H. Haas, Science , 78 (2004).[10] J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott,Chaos , 121102 (2017).[11] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott, Phys.Rev. Lett. , 024102 (2018).[12] Z. Lu, B. R. Hunt, and E. Ott, Chaos , 061104 (2018).[13] P. R. Vlachas, W. Byeon, Z. Y. Wan, T. P. Sapsis, andP. Koumoutsakos, Proc. Royal Soc. A , 20170844(2018).[14] K. Nakai and Y. Saiki, Phys. Rev. E , 023111 (2018).[15] J. Pathak, A. Wikner, R. Fussell, S. Chandra, B. R.Hunt, M. Girvan, and E. Ott, Chaos , 041101 (2018).[16] A. Wikner, J. Pathak, B. Hunt, M. Girvan, T. Arco-mano, I. Szunyogh, A. Pomerance, and E. Ott, Chaos , 053111 (2020).[17] Z. Warhaft, Proc. Nat. Acad. Sci. , 2481 (2002).[18] J. P. Peixoto and A. H. Oort, Physics of climate (NewYork, NY (United States); American Institute of Physics,1992).[19] J. Pedlosky,

Geophysical ﬂuid dynamics (Springer, 2013).[20] E. N. Lorenz, in

ECMWF Seminar Proceedings on Pre-dictability , Vol. 1 (ECMWF, Reading, UK, 1995).[21] E. Aurell, G. Boﬀetta, A. Crisanti, G. Paladin, andA. Vulpiani, Phys. Rev. Lett. , 1262 (1996).[22] M. Cencini and A. Vulpiani, J. Phys. A , 254019(2013).[23] G. Boﬀetta, A. Celani, M. Cencini, G. Lacorata, andA. Vulpiani, J. Phys. A , 1313 (2000).[24] J. A. Sanders, F. Verhulst, and J. Murdock, Aver- aging methods in nonlinear dynamical systems , Vol. 59(Springer, 2007).[25] G. Pavliotis and A. Stuart,