[PDF] Predicting high-dimensional heterogeneous time series employing generalized local states

Abstract

We generalize the concept of local states (LS) for the prediction of high-dimensional, potentially mixed chaotic systems. The construction of generalized local states (GLS) relies on defining distances between time series on the basis of their (non-)linear correlations. We demonstrate the prediction capabilities of our approach based on the reservoir computing (RC) paradigm using the Kuramoto-Sivashinsky (KS), the Lorenz-96 (L96) and a combination of both systems. In the mixed system a separation of the time series belonging to the two different systems is made possible with GLS. More importantly, prediction remains possible with GLS, where the LS approach must naturally fail. Applications for the prediction of very heterogeneous time series with GLSs are briefly outlined.

Full PDF

PPredicting high-dimensional heterogeneous time series employing generalized localstates

Sebastian Baur , and Christoph R¨ath Institut f¨ur Materialphysik im Weltraum, Deutsches Zentrum f¨ur Luft-und Raumfahrt,M¨unchner Straße 20, 82234 Weßling, Germany and Fakult¨at f¨ur Physik, Ludwig-Maximilians-Universit¨at M¨unchen, Schellingstraße 4, 80779 Munich, Germany (Dated: January 4, 2021)We generalize the concept of local states (LS) for the prediction of high-dimensional, potentiallymixed chaotic systems. The construction of generalized local states (GLS) relies on deﬁning distancesbetween time series on the basis of their (non-)linear correlations. We demonstrate the predictioncapabilities of our approach based on the reservoir computing (RC) paradigm using the Kuramoto-Sivashinsky (KS), the Lorenz-96 (L96) and a combination of both systems. In the mixed systema separation of the time series belonging to the two diﬀerent systems is made possible with GLS.More importantly, prediction remains possible with GLS, where the LS approach must naturallyfail. Applications for the prediction of very heterogeneous time series with GLSs are brieﬂy outlined.

Introduction.—

Tremendous advances in predicting theshort and long term behavior of complex systems havebeen made in recent years by applying machine learn-ing (ML) [1–7]. RC turned out to be a very promisingML-approach as it combines outstanding performancewith conceptual advantages like very fast and compara-bly transparent learning and possibly appealing hardwarerealizations of RC [8, 9].For high-dimensional systems RC suﬀers like other MLmethods from the ”curse of dimensionality” meaning thatthe number of nodes of the network representing thereservoir has to be considerably larger than the dimen-sionality of the input data rendering the training unfea-sible with a naive RC approach. With a parallel predic-tion scheme based on LS [10], however, the forecasting ofhigh-dimensional chaotic spatiotemporal systems of arbi-trarily large extent becomes possible [11, 12].The deﬁnition of LS relies deﬁning spatial local neigh-borhoods for each time series to be predicted. Thus, theknowledge of the position of the time series in space is anecessary prerequisite for deﬁning LS. The similarity oftime series can be deﬁned in a much more general wayby deducing a distance measure and thus a local neigh-borhood from the correlations among the time series [13].Those generalized similarities led – among others – to areasonable, fully data-driven taxonomy of the stock mar-ket [14] while crucial diﬀerences between the linear andnonlinear correlation structure of the stock market espe-cially during crises were reported later [15].In this

Letter , we employ this approach to deﬁne GLSfor the prediction of high dimensional systems with whichsome of the shortcomings of the LS-approach can be over-come.

Systems and Simulation Details.—

For modeling highdimensional, spatiotemporal, chaotic systems, the L96and KS systems have become widely used in the RC com-munity [3, 16–18]. The L96 [19, 20] system is deﬁned as dx j dt = ( x j +1 − x j − ) x j − − x j + F , where x ( t ) is the D -dimensional state vector of the system and F the forcingparameter. In this study, we set D = 40 and F = 5 resulting in a chaotic system for which we calculate amaximal Lyapunov exponent (MLE) of Λ max = 0 .

45 us-ing Sprott’s method of orbit separation (OS) [21]. Weuse the fourth-order Runge–Kutta method [22] for thesimulation, with a time step of ∆ t = 0 . ∂ t u + ∂ x u + ∂ x u + u∂ x u = 0 , wherethe ﬁeld u ( x, t ) is deﬁned on some domain size L . In thisstudy, we use a domain size of L = 22 with periodicboundary conditions u ( x + L, t ) = u ( x, t ) for all 0 ≤ x ≤ L . For the numerical treatment, the equations arediscretized on a grid of D = 40 points, the same size asthe L96 system, and numerically integrated, with a timestep of ∆ t = 0 .

5, using the fourth order time-steppingmethod ETDRK4 [26]. Using OS we ﬁnd a MLE ofΛ max = 0 . Reservoir Computing.—

We use a standard setup forreservoir computing. The core of RC is a network, in thefollowing called the reservoir A , created as a sparse ran-dom D r × D r network with average node degree κ . Afternetwork generation, all its random, uniformly distributedconnection strengths are scaled to have a predeterminedﬁxed spectral radius ρ . The D in dimensional input x ( t )interacts with the reservoir state r ( t ) through an inputcoupler which in our case is a sparse D r × D in matrix W in . Following [1], W in is created such that one el-ement in each row is chosen uniformly between [ − ω, ω ]where ω is the input coupler scaling parameter. All otherelements of W in are zero. The input data then connectswith the reservoir state r ( t ) of the previous time step viaactivation function tanh( · ) to advance the reservoir stateby one step in time r ( t +∆ t ) = tanh ( A r ( t ) + W in x ( t )).We choose a simple matrix as output cou-pler W out , whose elements are determined inthe training phase via ridge regression W out =arg min W out ( k W out ˜ r ( t ) − y T ( t ) k + β k W out k ), where y T ( t ) is the D out dimensional target output. ˜ r is anonlinear transformation of the reservoir state r herechosen to be ˜ r = [ r , r ] T = [ r , r , ..., r D r , r , r , ...r D r ] T .This blow-up of the reservoir states ﬁrst introduced by a r X i v : . [ phy s i c s . d a t a - a n ] F e b Lu et al. [16] actually serves to break the symmetries inthe reservoir equations [27].To avoid that the arbitrary initial state of the reservoirinﬂuence the regression results, training only starts afterwashout phase of 20000 time steps.Once trained, the output y ( t ) can be calculated fromthe reservoir states r ( t ) as y ( t ) = W out ˜ r ( t ). When us-ing RC for prediction, it can then be run autonomouslyby using the prediction of the previous time step y ( t ) asthe input x pred ( t ) to calculate the next predicted timestep y ( t + ∆ t ). In this case one ﬁnds r ( t + ∆ t ) =tanh ( A r ( t ) + W in x pred ( t )). Generalized Local States.—

GLS is based on the LSapproach proposed by Parlitz et al. [10] and used forRC prediction by Pathak et al. [3]. While non-local im-plementations of RC algorithms use just one network toprocess all input data, LS and GLS partitions the inputdata into multiple subsets of smaller dimension each withtheir own reservoir. The reservoirs are then trained onand predict these subsets only. This provides an eﬀec-tive workaround for the curse of dimensionality by essen-tially parallelizing the prediction of one high-dimensionaldataset using many lower-dimensional subsets. Thesesubsets are in the following called neighborhoods .Each neighborhood itself consists of a number of core variables and neighbor variables and is assigned its ownreservoir. Each reservoir in question then uses only thevariables of the input making up its neighborhood to pre-dict its core variables as accurately as possible.As such, the input time series for the reservoir assignedto the i -th neighborhood is not the full D in dimensionalinput time series x anymore, but instead a slice unique tothis neighborhood x i of dimension D i in ≤ D in . Similarly,the trained output of the i -th neighborhood y i is givenby just its core variables and hence is an even smallersubset of the neighborhood of dimension D i out ≤ D i in .Between each prediction step the neighborhoods needa new input which, as typically D i out < D i in , can onlycome from the other neighborhoods’ prediction. Hencebetween each prediction step, a new input vector x pred is formed by the combined core variable prediction of allneighborhoods. As a consequence, each variable of theoriginal input must be in one and only one neighborhoodas a core variable.Pathak et al. originally introduced their LS approachin the context of the KS system, a purely locally inter-acting system. As such, their neighborhoods were chosento have only spatially adjacent cores surrounded by con-tiguous buﬀer regions of neighbors (see Figure 1a).For systems where such a local interaction is notpresent, this scheme cannot be used. Nonetheless, theidea of locality in the sense of importance to a predictionis generalizableto something which we will call similarity . As thechoice of neighborhoods is essentially arbitrary, this al-lows the creation of neighborhoods not by which variablesare locally closest to the core variables, but by instead in-cluding the variables most similar to the cores as neigh- (a)(b) FIG. 1. Schematic depicting diﬀerent neighborhoods andassociated data ﬂow. Yellow circles mark the core variables,green the neighbors and purple all other variables of the fullinput. (a)

LS RC. The highlighted neighborhood of the reser-voir has one core variable x i and four locally adjacent neigh-bors. These ﬁve variables are used as input for the reser-voir R i to predict the core variables future state y i . Manymore reservoirs exist, each with their own neighborhood. (b) GLS RC. As in (a) , the neighborhood has one core variableand four neighbors, with the crucial diﬀerence being that theneighbors do not have to be locally adjacent to the core. bors. Such a non-local GLS neighborhood is shown inFigure 1b).The eﬀectiveness of this procedure of course dependson an appropriate choice of neighborhoods, such that theinformation necessary to predict their core variables ispresent in their (non-local) neighbors.Our ﬁrst measure to estimate the similarity of the timeseries is the (linear) cross-correlation (CC) coeﬃcient C x i ,x j , where x i is the i -th variable of the time series x .To transform the CC into a similarity measure (SM) wetake its absolute value SM( X i , X j ) = (cid:12)(cid:12) C X i ,X j (cid:12)(cid:12) ∈ [0 , time steps),well behaved time series, we will implement the widelyused binning method [15, 28, 29] as estimator for theMI. Akin to [15] we ﬁnd empirically that for our timeseries of length T = 10 a choice of jp T / k = 158bins of equal size works well. We normalize the MI asdescribed by Strehl et al. [30] leading to our MI SMSM( X i , X j ) = I ( X i ,X j ) √ H ( X i ) H ( X j ) ∈ [0 , I ( X i , X j )is the MI while H ( X i ) and H ( X j ) are the Shannon en-tropies of X i and X j respectively.Once a SM has been chosen and calculated for all vari-ables of the full input data, one needs to use it to cre-ate the neighborhoods. As the prediction output of eachneighborhood is only given by its core variables, pre-dicting these core variables as accurately as possible ismost important. While a variety of choices are possi-ble, we restrict ourselves in this Letter to associate eachneighborhood with exactly one core variable. In the LScase, the neighbors of the core are simply the variablesspatially closest to that core. In the following, we willcall this a spatial neighborhood (SN). In the GLS case,the exact analog is possible, where the neighbors of thecore are the variables most similar to it as deﬁned be theSM. We call the corresponding neighborhoods CC or MIneighborhoods respectively, depending on the similaritymeasure used.Lastly, it should be emphasized that the GLS methodis in principle independent of not only the speciﬁc SMused, but also of the chosen prediction method. Manyother choices for the SM, e.g. the transfer entropy, orthe network, e.g. LSTMs, are conceivable. Experiments.—

To enable a fair comparison of the dif-ferent neighborhood generation methods, the RC hyper-parameters are optimized for the LS neighborhoods andthen copied for the GLS neighborhoods without furtheradjustments. Furthermore, we added noise to all train-ing data following Vlachas et al. [2]. Without this noisewe found the short term prediction accuracy to often behigher, but at the cost of an increased rate of failed re-alizations , and lower quality long term predictions ingeneral. The added noise proved decisive in minimizingvariance between network realizations and reducing thenumber of failed realizations, especially for the L96 sys-tem. Heuristically, we found normally distributed noisewith standard deviation σ noise = 1% σ data , where σ data is the standard deviation of the training data, to be asweet spot. The hyperparameters used for all RC aregiven in table S2 [31]. Transient eﬀects of the simulateddata were discarded before any synchronization, trainingor prediction took place. Similarly, each reservoir train-ing and prediction is preceded by 2000 synchronizationsteps. All reservoirs were trained for a total of 10 timesteps using the noisy training data.To calculate the SMs we use the same noisy data usedto train the reservoirs. As a result all neighborhoodsshown here are representative examples, but not uniformfor all realizations. The neighborhoods are calculated fora single core and 18 neighbors, in the case of SN and MIneighborhoods, and 28 neighbors, in the case of a CCneighborhood.The neighborhood sizes for SN and MI were chosen tobe similar to Pathak’s original paper [3] which had a to-tal neighborhood size of 20. While for them this resultsin good predictions, for the CC SM this leads to essen-tially all predictions diverging. This is likely the resultof the linear CC SM not recognizing the importance ofthe core’s nearest neighbors in these systems. As such, the CC neighborhood size was increased to a total sizeof 29, the minimum where no predictions diverged. CCneighborhoods of diﬀerent sizes for the L96 system aredepicted in Figure S1 [31].Example neighborhoods for the L96 systems are shownin Figure 2. While the SN neighborhoods are simply de- x NeighborCoreElse (a) (b) (c)

20 400 Neighborhood 20 400 Neighborhood 20 400 Neighborhood

FIG. 2. Example Neighborhoods for the L96 system. Eachrow depicts one variable of the time series, while each columnrepresents a neighborhood. Each neighborhood is deﬁned bythe core variable (yellow) and its neighbors (green). (a)

SN, (b)

CC, (c)

MI neighborhoods. ﬁned as a single of core and its nearest neighbors, the CCand MI neighborhoods warrant a closer look. First andforemost, even though the MI neighborhoods were calcu-lated dynamically, without directly using the knowledgeof L96 being a locally interacting system, the resulting MIneighborhoods closely resemble the SN neighborhoods.The CC neighborhoods in contrast include many vari-ables spatially far away from the core.100 distinct random network realizations are generatedfor each system-SM combination. They are then trainedand used to predict the same 300 sections of 10 Lyapunovtimes length on the chaotic attractor of the L96 and KSsystems.To quantify the short term prediction accuracy, we de-ﬁne the normalized root mean square error (NRMSE) asNRMSE( ˆ y ) = √ (ˆ y − y ) y max − y min , where ˆ y ∈ R D is the predictionat a single time-step, y ∈ R D the true signal and y max ( y min ) the largest (smallest) value taken of any variablein the simulated data set. Short term prediction resultsare shown in Figure 3. Λ max t N R M S E CCSNMI FIG. 3. Short term prediction of the L96 system. NRMSE ofSN, CC, and MI prediction data averaged over ﬁrst the 300predicted sections and then the 100 network realizations. Theerror bands correspond to the 3 σ standard deviation of therandom network realizations. Looking at the short term predictions of the L96 sys-tem, it is very striking that the averaged NRMSE of theSN and MI neighborhoods coincide more or less exactly,while the CC prediction is signiﬁcantly worse. This per-formance drop is likely the result of the CC neighbor-hood’s inclusion of many variables which are far fromthe, ostensibly most important, close region around thecore (see Figure 2b).It should be noted that the statistical stability that hasbeen assessed here for the ﬁrst time is remarkable. OftenRC algorithms exhibit a much larger spread of predictionqualities between diﬀerent random networks [32]. Thisstability is partly attributable to ﬁnding the correct hy-perparameters for the systems at hand, as suboptimallychosen hyperparameters often leave the best predictionsintact while increasing the variance towards completelyfailed predictions drastically. [17, 32] However, empiri-cally we found that the role of the noise added to thetraining data in achieving this stability, especially for theL96 system, cannot be understated either. Additionallyit should be noted that, as is typical for RC, the variancein short term NRMSE between diﬀerent starting posi-tions on the attractor is much larger than the variancefor diﬀerent random network realizations.Using the OS method we can calculate the MLEs ofthe predictions as another characteristic to quantify longterm prediction accuracy. For this purpose we let each ofour reservoir realizations predict three distinct sections ofthe systems attractor each 1000 Lyapunov times in lengthwhich are then used to calculate the MLE. The details ofthe OS procedure used are described in the supplementalmaterial [31]. The resulting MLEs are listed in table I.

System SIM SN CC MIL96 0 .

45 0 . ± .

01 0 . ± .

02 0 . ± . σ standard deviation between network realizations. The MLEs of all three neighborhoods agree well withthe MLEs calculated directly from the simulations.The neighborhood creation, prediction and analysisdone here for the L96 system are repeated for the KSsystem in the supplemental material [31], showing largelythe same quality of results, being comparable with the LSresults achieved by Pathak et al. [3]To test the usefulness of GLS for non-locally interact-ing systems, we use the KS and L96 systems to artiﬁciallycreate such a non-locally interacting test system. As de-picted in Figure S7 [31] we do this by concatenating bothsystems and then randomly shuﬄing the 80 variables ofthe combined system. As this combined system now is acomposite of two systems with diﬀerent time steps, itdoes not have a well deﬁned Lyapunov exponent anymore. For the sake of consistency we nonetheless con-tinue the time axis rescaling in terms of Lyapunov expo-nents. To do so we use the larger of the two system’s timesteps per Lyapunov time as calculated in table S1 [31], hence at worst slightly underestimating our short termprediction results.As before, we can also calculate the SN, CC and MIneighborhoods for this new concatenated system. TheCC and MI neighborhoods are shown in Figure 4. x (a) (b) x NeighborCoreElse (c) (d)

FIG. 4. Neighborhoods of the concatenated L96 and KS sys-tems. (a) is the CC and (b) the MI neighborhoods for theconcatenated, un-shuﬄed system. Variable 1-40 of the dataused to compute these neighborhoods come from the L96 sys-tem with variables 41-80 originating from the KS simulationIn (c) the CC and (d) the MI neighborhoods for the shuﬄedsystem are shown.

The SN neighborhoods have been omitted from thisFigure as they are, by deﬁnition, always the same. Fas-cinatingly, both the CC and MI neighborhoods in thecombined but not shuﬄed systems look like they are com-posed of the individual system’s neighborhoods. In fact,exactly this is the case as both the CC and the MI SMsare able to completely separate the KS and L96 systems.As before, we quantify the short term prediction ac-curacy using the NRMSE. The results are shown in Fig-ure 5. Immediately noticeable is the almost instant di-vergence of all SN predictions. This is of course expected,considering the SN neighborhood’s assume a locally in-teracting system which the shuﬄed system is not. Fur-thermore, the NRMSE of the CC and MI neighborhoodsis the combination of the NRMSE of the individual sys-tems. This, again, makes sense due to the perfect sep-aration between the L96 and the KS system shown inFigure 4. As for the KS system, this results in the MISM delivering signiﬁcantly better results than the CCSM.For the long term statistical analysis we cannot use theMLE as it is not well deﬁned due to the diﬀerent timestep sizes of the sub-systems. Therefore we look at the Λ max t N R M S E x (a) x (b) x (c) x (d) (e) CCSNMI -40

FIG. 5. Shuﬄed system short term prediction comparison. (a)

Simulated data of the combined shuﬄed KS-L96 system. (b-d)

Exemplary diﬀerence of the RC prediction to the simu-lated data when using the (b)

SN, (c)

CC and (d)

MI neigh-borhoods. The color scale of the diverging prediction was cutto the ensure legibility of the other plots (e)

NRMSE of SN,CC, and MI prediction data averaged over ﬁrst the 300 pre-dicted sections and then the 100 network realizations. Theerror bands correspond to the 3 σ standard deviation of therandom network realizations. probability distribution functions (PDFs) which we esti-mate as simple histograms using the same data otherwiseused for the OS. The details of the histogram generationare described in the supplementary material [31]. TheCC and MI PDFs are depicted in Figure 6.The predicted PDFs show excellent agreement withthe simulated data. Similarly to the MLE results, theworse short term predictive performance of the CC SMcompared to the MI SM is not reﬂected in its climatereproduction quality. As the complete separation of thetwo subsystems shown in Figure 4 means that the corre-sponding RCs do not interact, the computed histogramsfor the combined system are, literally, the combined his-tograms of the independent sub-system predictions. ThePDFs for the individually predicted L96 and KS systemare shown in Figures S2 and S5 [31]. Conclusions.—

We have proposed and discussed a gen-eralization of the concept of LS in the sense of using SMderived from correlations among (instead of spatial dis-tances between) time series. This oﬀers a much moreversatile approach for the prediction of high-dimensionalcomplex systems. First, GLS can still make excellent predictions in the case of mixed systems, where LS is P D F -2 -4 -8 -6 (a)(b) -2 -4 -8 -6 P D F -2 2-6 Prediction Value 6 10 FIG. 6. PDFs for the longterm predictions of the com-bined and shuﬄed KS-L96 system. The PDFs are estimatedvia two superimposed 100 bin histograms of the simulated(gray) and the predicted (orange) data of the (a)

CC or (b)

MI Neighborhood. Each entry in the histogram is generatedby the prediction of a single variable value, with the datasetcoming from the prediction of three 1000 Lyapunov time longsequences. The error bars represent the 1 σ standard deviationof the predicted histogram bins. doomed to fail. Here it is worth noticing that the per-fect neighborhood separation we found in the combinedKS-L96 system (see Figure 4) suggests that GLS can beused to separate mixed data sets and to thus infer diﬀer-ent origins of a set of heterogeneous time series, for whichthe generating processes are unknown. Second, predic-tion of high-dimensional systems remains feasible, whenfor a number or for all time series no spatial informationis available. This is more and more the typical case inreal world applications, when analyzing such heteroge-neous data sets ranging from remote sensing data, ﬁnan-cial data, social media data (e.g. Instagram, Twitter)to nowadays also infection rates during the COVID-19pandemic, etc. and a combination of the aforementioned.Current research explores applications of GLS-based pre-dictions in those cases. Acknowledgments.—

We wish to acknowledge valu-able discussions with D. Mohr, P. Huber, J. Aumeier,Y.Mabrouk, and J. Herteux. [1] Z. Lu, B. R. Hunt, and E. Ott, Attractor reconstruc-tion by machine learning, Chaos , 061104 (2018),arXiv:1805.03362.[2] P. R. Vlachas, W. Byeon, Z. Y. Wan, T. P. Sapsis,and P. Koumoutsakos, Data-driven forecasting of high-dimensional chaotic systems with long short-Term mem-ory networks, Proc. R. Soc. A Math. Phys. Eng. Sci. ,20170844 (2018).[3] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott, Model-Free Prediction of Large Spatiotemporally Chaotic Sys-tems from Data: A Reservoir Computing Approach,Phys. Rev. Lett. , 24102 (2018).[4] A. Chattopadhyay, A. Subel, and P. Hassanzadeh, Data-Driven Super-Parameterization Using Deep Learning:Experimentation With Multiscale Lorenz 96 Systemsand Transfer Learning, J. Adv. Model. Earth Syst. ,10.1029/2020MS002084 (2020), arXiv:2002.11167.[5] Y. A. Mabrouk and C. R¨ath, Calibrated reservoir com-puters, Chaos , 113134 (2020).[6] A. Haluszczynski, J. Aumeier, J. Herteux, and C. R¨ath,Reducing network size and improving prediction stabil-ity of reservoir computing, Chaos , 063136 (2020),arXiv:2003.03178.[7] T. L. Carroll, Do Reservoir Computers Work Best at theEdge of Chaos?, Chaos An Interdiscip. J. Nonlinear Sci. , 121109 (2020), arXiv:2012.01409.[8] S. Bompas, B. Georgeot, and D. Gu´ery-Odelin, Accuracyof neural networks for the simulation of chaotic dynamics:Precision of training data vs precision of the algorithm,Chaos , 113118 (2020), arXiv:2008.04222.[9] G. Marcucci, D. Pierangeli, and C. Conti, Theory ofNeuromorphic Computing by Waves: Machine Learningby Rogue Waves, Dispersive Shocks, and Solitons, Phys.Rev. Lett. , 093901 (2020), arXiv:1912.07044.[10] U. Parlitz and C. Merkwirth, Prediction of Spatiotem-poral Time Series Based on Reconstructed Local States,Phys. Rev. Lett. , 1890 (2000).[11] J. Pathak, Model-Free Prediction of Large Spatiotempo-rally chaotic systems from data - supp material, Phys.Rev. Lett. , 381 (2018).[12] R. S. Zimmermann and U. Parlitz, Observing spatio-temporal dynamics of excitable media using reservoircomputing, Chaos , 10.1063/1.5022276 (2018).[13] R. N. Mantegna, Hierarchical structure in ﬁnancial mar-kets, Eur. Phys. J. B , 193 (1999), arXiv:9802256[cond-mat].[14] J. P. Onnela, A. Chakraborti, K. Kaski, J. Kert´esz, andA. Kanto, Dynamics of market correlations: Taxonomyand portfolio analysis, Phys. Rev. E - Stat. Physics, Plas-mas, Fluids, Relat. Interdiscip. Top. , 056110 (2003),arXiv:0302546 [cond-mat].[15] A. Haluszczynski, I. Laut, H. Modest, and C. R¨ath, Lin-ear and nonlinear market correlations: Characterizing ﬁ-nancial crises and portfolio optimization, Phys. Rev. E , 1 (2017), arXiv:1712.02661.[16] Z. Lu, J. Pathak, B. Hunt, M. Girvan, R. Brockett,and E. Ott, Reservoir observers: Model-free inference of unmeasured variables in chaotic systems, Chaos ,10.1063/1.4979665 (2017).[17] P. R. Vlachas, J. Pathak, B. R. Hunt, T. P. Sapsis,M. Girvan, E. Ott, and P. Koumoutsakos, Backpropa-gation algorithms and Reservoir Computing in Recur-rent Neural Networks for the forecasting of complex spa-tiotemporal dynamics, Neural Networks , 191 (2020),arXiv:1910.05266.[18] Y. Huang, Z. Fu, and C. L. E. Franzke, Detectingcausality from time series in a machine learning frame-work, Chaos An Interdiscip. J. Nonlinear Sci. , 063116(2020).[19] E. N. Lorenz, Predictablilty: A problem partly solved(1996).[20] D. S. Wilks, Eﬀects of stochastic parametrizations in theLorenz ’96 system, Q. J. R. Meteorol. Soc. , 389(2005).[21] J. C. Sprott and J. C. Sprott, Chaos and time-series anal-ysis , Vol. 69 (Citeseer, 2003).[22] W. H. Press, S. A. Teukolsky, B. P. Flannery, and W. T.Vetterling,

Numerical recipes in Fortran 77: volume 1,volume 1 of Fortran numerical recipes: the art of scien-tiﬁc computing (Cambridge university press, 1992).[23] Y. Kuramoto and T. Tsuzuki, Persistent propagation ofconcentration waves in dissipative media far from ther-mal equilibrium, Progress of theoretical physics , 356(1976).[24] Y. Pomeau and S. Zaleski, The Kuramoto-Sivashinskyequation: A caricature of hydrodynamic turbulence?(Springer, 1985) pp. 296–303.[25] Y. Kuramoto and T. Tsuzuki, Persistent Propagationof Concentration Waves in Dissipative Media Far fromThermal Equilibrium, Prog. Theor. Phys. , 356 (1976).[26] A. K. Kassam and L. N. Trefethen, Fourth-order time-stepping for stiﬀ PDEs, SIAM J. Sci. Comput. , 1214(2005).[27] J. Herteux and C. R¨ath, Breaking Symmetries of theReservoir Equations in Echo State Networks, ChaosAn Interdiscip. J. Nonlinear Sci. , 123142 (2020),arXiv:2010.07103.[28] C. E. Shannon, A Mathematical Theory of Communica-tion, Bell Syst. Tech. J. , 623 (1948).[29] A. Kraskov, H. St¨ogbauer, and P. Grassberger, Estimat-ing mutual information, Phys. Rev. E - Stat. Physics,Plasmas, Fluids, Relat. Interdiscip. Top. , 16 (2004),arXiv:0305641 [cond-mat].[30] A. Strehl and J. Ghosh, Cluster ensembles - A knowl-edge reuse framework for combining multiple partitions,J. Mach. Learn. Res. , 583 (2003).[31] See Supplemental Material at [URL will be inserted bypublisher] for further details on the reservoir computingparameters, methods and additional analysis.[32] A. Haluszczynski and C. R¨ath, Good and bad predic-tions: Assessing and improving the replication of chaoticattractors by means of reservoir computing, Chaos ,10.1063/1.5118725 (2019), arXiv:1907.05639. upplementary Material Sebastian Baur , and Christoph R¨ath Institut f¨ur Materialphysik im Weltraum,Deutsches Zentrum f¨ur Luft-und Raumfahrt,M¨unchner Straße 20, 82234 Weßling, Germany and Fakult¨at f¨ur Physik, Ludwig-Maximilians-Universit¨at M¨unchen,Schellingstraße 4, 80779 Munich, Germany S1 a r X i v : . [ phy s i c s . d a t a - a n ] F e b RBIT SEPARATION

Typically the important Lyapunov exponent is considered to be the maximal Lyapunovexponent (MLE), deﬁned as the largest Lyapunov exponent of any given chaotic system. Itsimportance stems from the fact that it is intimately tied to the predictability of the systemas, given a MLE Λ max two inﬁnitesimally close trajectories in phase space, initially separatedby the vector δ x ( t = 0) diverge as [1] | δ x ( t ) | ≈ e Λ max t | δ x ( t = 0) | , (1)where t is the time since separation.To calculate the MLE we use Sprott’s method of orbit separation (OS) [2]. By takingthe logarithm and the average h·i over many trajectory divergences in diﬀerent parts of thechaotic attractor we ﬁnd Λ max = 1 t − t (cid:28) log | δ x ( t ) | log | δ x ( t ) | (cid:29) . (2)Note that for this equation to hold, we are only using the divergence data after transienteﬀects have subsided but before the divergence size saturates due to it reaching the size ofthe chaotic attractor. From this, we can use a simple linear least squares ﬁt to calculate theMLE.As described in the main text, we base the OS calculation on three 1000 Lyapunov timelong term prediction datasets. For each of the 100 realizations we choose 50 trajectorypositions uniformly distributed in the ﬁrst of the three long term datasets as starting pointfor the orbit separation at which we add normally distributed noise with standard deviation σ noise = 10 − σ r to the internal reservoir states. From this perturbed internal state, we let thereservoir predict 1500 time steps and compute the separation magnitude to the unperturbedpredicted time series using the least squared ﬁt of equation 2 as described above.The MLE of the Lorenz-96 (L96) and Kuramoto-Sivashinsky (KS) simulations are com-puted analogously, using one 10 time steps long dataset from which 10 uniformly dis-tributed starting positions for the trajectory divergence are chosen. The resulting MLEs aregiven in table S1. S2 ystem Λ max ∆ t T L , time steps per T L ,KS 0 .

049 0 . .

45 0 .

05 2 . max , time step size ∆ t , Lyapunov time T L , and the number of time steps per T L for the KS and L96 systems. RESERVOIR COMPUTING PARAMETERS

Table S2 speciﬁes the hyperparameters used for all used for all reservoir computing (RC)computations in the paper. reservoir dimension D r κ ρ . ω . β − noise level α S3 C NEIGHBORHOOD SIZE

As discussed in the text, the cross-correlation (CC) neighborhood sizes were necessarilychosen to be signiﬁcantly larger than the spatial neighborhood (SN) and mutual information(MI) neighborhood sizes.This necessity is likely the result of the CC similarity measure (SM) not recognizing theimportance of the core’s nearest neighbors in the L96 and KS systems. As such, the CCneighborhood size was increased to a total size of 29, the minimum where no predictionsdiverged and where the core’s nearest neighbors were consistently included in its neighbor-hood. CC neighborhoods of size 19, 27 and 29 for the L96 system are depicted in Figure S1. x NeighborCoreElse (a) (b) (c)

20 400 Neighborhood 20 400 Neighborhood 20 400 Neighborhood

FIG. S1. CC neighborhoods of the L96 system. (a)

Total neighborhood size 19, the nearestneighbors of the core variables are not included in the neighborhood. (b)

Neighborhood size 27.Only a couple nearest neighbors of the cores are missing. Nevertheless, trying to predict the systemusing these neighborhoods fails every time. (c)

Neighborhood size 29. All nearest neighbors ofthe cores are included in the neighborhood. This is the smallest neighborhood size for which aprediction consistently succeeds.

While the true importance of each variable regarding the prediction of another is of courseunknown, at least for the locally interacting L96 and KS systems studied here, the spatialnearest neighbors of the cores are likely to be critical for achieving a sensible prediction.S4

DF ESTIMATION

Estimating the probability distribution functions (PDFs) was done with the same datasets used to calculate the MLE. The three resulting predictions and reference data sets fromthe simulation are treated as one single data set of 3000 Lyapunov times length respectively.The resulting 40 (80) dimensional datasets are ﬂattened which are then used to build thehistogram. The histograms themselves consist of 100 equally sized bins. To make comparisoneasy, they are chosen such that bin position and size for the simulated and predicted timeseries histograms are the same.In addition to the PDF of the prediction of the combined and shuﬄed systems of Figure 6,the PDFs of the predictions of the individual systems are shown in Figure S2 and Figure S5.The L96 PDFs are in excellent agreement with the simulated data for all three neighborhoodtypes, up to and including the tails of the distributions. This is aligns with the MLEcalculation results in which the comparatively worse L96 CC normalized root mean squareerror (NRMSE) also did not seem to aﬀect the long term statistics.S5 D F -2 -4 -8 -6 (b)(c) -2 -4 -8 -6 P D F -2 2-6 Prediction Value 6 10 P D F -2 -4 -8 -6 (a) FIG. S2. PDFs for the longterm predictions of the L96 system. The PDFs are estimated viatwo superimposed 100 bin histograms of the simulated (gray) and the predicted (orange) data ofthe (a)

SN, (b)

CC and (c)

In this section the prediction results of the KS system are presented. They are achievedwith exactly same methods and parameters as used for the L96 system.S6xample neighborhoods and short term prediction results for the KS system are shownin Figure S3 and Figure S4 respectively. x NeighborCoreElse (a) (b) (c)

20 400 Neighborhood 20 400 Neighborhood 20 400 Neighborhood

FIG. S3. Example neighborhoods for the KS system. Each row depicts one variable of the timeseries, while each column represents a neighborhood. Each neighborhood is deﬁned by the corevariable (yellow) and its neighbors (green). (a)

SN neighborhoods. The SN neighborhood consistsonly of a core and its 18 nearest neighbors. (b)

CC neighborhoods. Each CC neighborhood consistsof a total of 29 variables. (c)

MI neighborhoods.

Looking at the short term predictions of the KS system shown in Figure S4, it is strikingthat the averaged NRMSE of all three neighborhoods coincide more or less exactly, comparedto the L96 system, in which the CC neighborhoods performed signiﬁcantly worse.The PDFs calculated for the KS system are depicted in Figure S5.It is clear that the CC neighborhood is able to reproduce the underlying simulated datathe best, with excellent agreement even at the tails of the distribution. Next best is the MIwhich, while still showing close agreement at the core of the distribution, exhibits a diver-gence at the tails mainly by assuming values of greater magnitude than are ever reached inthe simulated dataset. This pattern continues with the SN which shows the same divergencebehavior as the MI RC, only more pronounced.The explanation for this is that both the MI RC as well as the SN RC seem to “get stuck”in diﬀerent parts of the attractor when predicting independently for a long time. Notably,this is not a full breakdown of the trajectory dynamics as can be seen when looking atexemplary trajectory slices, individual realization PDFs and the longterm NRMSE depictedin Figure S6. S7 max t N R M S E x (a) x (b) x (c) x (d) (e) CCSNMI -3 FIG. S4. Short term prediction of the KS system. (a)

Actual KS simulation data. (b-d)

Exemplaryerror in the RC prediction when using the (b)

SN, (c)

CC and (d)

MI neighborhoods. (e)

NRMSEof SN, CC, and MI prediction data averaged over ﬁrst the 300 predicted sections and then the 30network realizations. The error bands correspond to the 3 σ standard deviation of the randomnetwork realizations. We multiply t by the MLE Λ max of the model, so that each unit on thehorizontal axis represents one Lyapunov time. Looking at the predicted data more closely, one ﬁnds the characteristic shapes and lengthscales of the KS still intact, but with the normally almost horizontal lines showing a clearS8 b)(c) -6 -2 -4 P D F -2 0-4 Prediction Value 2 4 (a) -6 -2 -4 P D F -6 -2 -4 P D F FIG. S5. PDF for the longterm predictions of the KS system. The PDFs are estimated via twosuperimposed 100 bin histograms of the simulated (gray) and the predicted (orange) data of the (a)

SN, (b)

CC and (c)

MI Neighborhoods. Each entry in the histogram is generated by theprediction/simulation of a single variable, with the dataset coming from three 1000 Lyapunov timelong sequences. The errorbars represent the 1 σ standard deviation of the predicted histogram bins. drift to lower or higher xxxxx respectively. While this behavior is also present in the simu-lated KS data, this is the case only for much shorter durations of at most a couple Lyapunovtimes, while the predicted data manifests this shift for hundreds of Lyapunov times or more.Notably, no clear separation between the start and end of such a shift period is visible inS9 a) -6 -2 -4 P D F x N R M S E Λ max t Λ max t Λ max t Λ max t (b)(c) (d)(e) (f) FIG. S6. SN example realizations exhibiting the PDF shift. (a)-(b)

Predictions of the KS system,qualitatively diﬀerent than the simulated data. (c)-(d)

NRMSEs over a full 1000 Lyapunov timescorresponding to the predictions shown in (a) and (b) respectively. (e)-(f )

PDFs correspondingto the NRMSEs depicted in (c) and (d). A clear shift of the entire distribution to lower (higher)values is visible. the NRMSEs even though the eﬀect on the PDFs is immediately noticeable.This is especially interesting as the SN and MI neighborhoods for the KS system are al-most the same (see Figure S3), meaning that the resulting diﬀerence in climate reproductionability must come from the few outliers, far removed from the ostensibly most importantclosest neighbors. These results indicate that even for purely locally interacting systemsthere might be something to gain from using the generalized local states (GLS) approachS10ver the conventional local states (LS) one, or at least further reinforces the importance ofsmall amounts of randomness to get high quality long term predictions from RC.Using OS to calculate the MLE we ﬁnd them again to agree excellently with the simulatedMLE as depicted in table S3.

System SIM SN CC MIKS 0 .

049 0 . ± .

004 0 . ± .

001 0 . ± . σ standard deviation between network realizations. Notably, this is true even for the SN and MI neighborhoods in the KS system with the onlynoticeable diﬀerence being a slightly larger variance for the SN neighborhood. Therefore,due to the relatively low impact on the MLE, the previously mentioned behavior of theprediction getting “stuck” in a part of the attractor, might have never been discoveredwithout calculating the PDFs. This highlights the need for multiple analysis methods whenquantifying the climate reproduction of a prediction.

CONCATENATED AND SHUFFLED SIMULATION DATA

To test the usefulness of GLS for non-locally interacting systems we use the KS and L96systems to artiﬁcially create a non-locally interacting test system. As depicted in Figure S7we do this by concatenating both systems and then randomly shuﬄing the 80 variables ofthe combined system. [1] M. T. Rosenstein, J. J. Collins, and C. J. De Luca, A practical method for calculating largestLyapunov exponents from small data sets, Phys. D Nonlinear Phenom. , 117 (1993).[2] J. C. Sprott and J. C. Sprott, Chaos and time-series analysis , Vol. 69 (Citeseer, 2003).

S11 (a) x (b) Λ max t FIG. S7. Concatenated System (a)

Simulated data combined from the L96 (variables 1-40) andKS (variables 41-80) systems. (b)

Simulated data as in (a) with the variables shuﬄed. The timeaxis is scaled by the MLE Λ max of the KS model.of the KS model.