[PDF] Accuracy of neural networks for the simulation of chaotic dynamics: precision of training data vs precision of the algorithm

Abstract

We explore the influence of precision of the data and the algorithm for the simulation of chaotic dynamics by neural networks techniques. For this purpose, we simulate the Lorenz system with different precisions using three different neural network techniques adapted to time series, namely reservoir computing (using ESN), LSTM and TCN, for both short and long time predictions, and assess their efficiency and accuracy. Our results show that the ESN network is better at predicting accurately the dynamics of the system, and that in all cases the precision of the algorithm is more important than the precision of the training data for the accuracy of the predictions. This result gives support to the idea that neural networks can perform time-series predictions in many practical applications for which data are necessarily of limited precision, in line with recent results. It also suggests that for a given set of data the reliability of the predictions can be significantly improved by using a network with higher precision than the one of the data.

Full PDF

AAccuracy of neural networks for the simulation of chaotic dynamics:precision of training data vs precision of the algorithm

S. Bompas,

1, 2

B. Georgeot, and D. Guéry-Odelin Laboratoire Collisions, Agrégats, Réactivité, IRSAMC, Université de Toulouse, CNRS, UPS,France Laboratoire de Physique Théorique, IRSAMC, Université de Toulouse, CNRS, UPS,France (Dated: July 8, 2020)

We explore the inﬂuence of precision of the data and the algorithm for the simulation of chaotic dynamics by neuralnetworks techniques. For this purpose, we simulate the Lorenz system with different precisions using three differentneural network techniques adapted to time series, namely reservoir computing (using ESN), LSTM and TCN, for bothshort and long time predictions, and assess their efﬁciency and accuracy. Our results show that the precision of thealgorithm is more important than the precision of the training data for the accuracy of the predictions. This result givessupport to the idea that neural networks can perform time-series predictions in many practical applications for whichdata are necessarily of limited precision, in line with recent results. It also suggests that for a given set of data thereliability of the predictions can be signiﬁcantly improved by using a network with higher precision than the one of thedata.

I. INTRODUCTION

Techniques of machine learning have been shown lately tobe efﬁcient in a huge variety of tasks, from playing the gameof Go to speech recognition or automatic translation . Inmany cases, such breakthroughs correspond to complicatedtasks with complex decision-making processes. However, itwas highlighted recently that such tools can also be useful intasks which are much more adapted to standard algorithms,such as simulation of physical systems. Indeed, in , itwas shown that a certain type of machine learning algorithmscalled reservoir computing was able to forecast the evolutionof complex physical systems, namely a fully chaotic model(see also Ref. and references therein). Remarkably enough,the simulation is made from the time series of the previousstates of the system, without solving explicitly the equationsdeﬁning the model. It was also shown that other types ofneural networks may be efﬁcient as well in predicting the be-haviour of such systems .So far, the results have shown that different machine learn-ing techniques can simulate chaotic dynamics, both for shortand long times. However, it is important for future applica-tions to assess the accuracy of these techniques in a preciseway. In this paper, we explore the role of precision of thedata used for the training of the network and of the algorithmitself on the accuracy of the simulation. We do so on a spe-ciﬁc case of reservoir computing (Echo State Network-ESN)as well as on two other standard machine learning techniquesused in this context, commonly called LSTM and TCN techniques. We compare the accuracy of these methods to theexplicit integration of the equations of motion, both for shorttime and long time predictions of a well known chaotic sys-tem originating from meteorology, the Lorenz system. Ourresults show that the precision of the algorithm is more im-portant than the precision of the training data for the accuracyof the simulation. This has interesting consequences for ap-plications, since the precision of the algorithm is by far easierto control than the one of the training data. We also discuss the training by considering trajectories of different size and bycomputing the time required to train the networks. II. SYSTEM STUDIED

The Lorenz system was introduced in 1963 by EdwardLorenz as an extremely simpliﬁed model of meteorology.It corresponds to a set of three nonlinear coupled equationsfor the variables x , y and z as a function of time:˙ x = σ ( y − x ) (1)˙ y = x ( ρ − z ) (2)˙ z = xy − β z . (3)Throughout the paper we choose the standard set of parame-ters: σ = , ρ =

28 and β = / a r X i v : . [ c s . N E ] J u l xz y -20-1001020-20-1001020 xyz Time step iteration ( a )( b )( c )( d ) FIG. 1. (a),(b),(c): example of trajectory of the Lorenz system,showing x ( t ) , y ( t ) and z ( t ) , with a time step dt = .

02; (d): long-time convergence towards the so-called Lorenz attractor. Parameters: σ = , ρ =

28 and β = /

3. Initial conditions: x ( ) = y ( ) = . ( ) = . Runge-Kutta integration method of order 4 (RK4) in quadru-ple precision for trajectories computed using RK4 with lowerprecision (i. e. separated initially by 10 − or 10 − ). Theystrongly depart after a certain time from the high precisiontrajectory. The separation time clearly increases only loga-rithmically with the precision.This property makes numerical simulation of speciﬁc tra-jectories for chaotic systems very difﬁcult: increasing by ex-ponentially large factors the precision only increases linearlythe prediction time.However, one may ask a different type of questions. Evenif the short term behavior of a speciﬁc trajectory is hard to ob-tain numerically in a reliable manner, is it still possible to getaccurate results on statistical properties of the system for longterm? To answer this question, we calculate the ﬁrst returnapplication. This application, introduced by Lorenz, consistsin plotting the successive maxima of z ( t ) over a long periodof time. For that, it is enough to locate the maxima Z i ofthe curve and plot the position of a given maxima Z i + as afunction of the preceding one, Z i . These data are related to thestructure of the Lorenz attractor to which trajectories converge -14 -10 -6 -2 Time step iteration D i s t a n ce FIG. 2. Euclidean distance between the reference trajectory of theLorenz system obtained with quadruple precision with the doubleprecision trajectory (dotted line) and the simple precision trajectory(solid line), with a time step dt = .

30 32 34 36 38 40 42 44 46 48303234363840424446 Z i Z i +1 FIG. 3. Comparison of return map of the Lorenz system (long termbehavior) with quadruple precision (blue dots) and double (red dots)and simple precision (green dots). The points are nearly superim-posed revealing that the long term prediction is almost the same in-dependently of the precision. for long time. Figure 3 compares such long term predictionsusing the RK4 algorithm to integrate Eqs. (1)-(3) with differ-ent precision. We observe that the statistical properties at longtime are not dramatically sensitive to the precision at whichthe calculation is performed. Even if individual trajectoriesare not accurately described, their global properties are cor-rectly described. This is similar to what distinguish climatesimulations from meteorological simulations: even if individ-ual trajectories cannot be simulated beyond a few weeks topredict the weather, long term global properties of the system(climate characteristics) can be obtained for much longer pe-riods (years or decades).To evaluate quantitatively the accuracy of long term simu-lations, we made a polynomial ﬁt of the return map obtainedwith quadruple precision on each side of the peak of the re-turn map in the window of parameters delimited by the bluedashed lines on the left side and by the red dashed lines onthe right side (see Fig. 4). We have then computed the relativeerror ξ between the ﬁt and the data. The mean percentage er-ror remains below 0.2 % in the zones delimited by the dashed FIG. 4. The return map is calculated using a RK4 integration algo-rithm in quadruple precision (upper panel). We ﬁt the data in betweenthe blue (red) dashed line with a polynomial of degree 10. We plotthe relative difference ξ between the ﬁt and the return map in thelower panel. M e a n r e l a t i v e e rr o r( % ) Time step iteration

FIG. 5. Mean relative error in percentage in the distance of the returnmap points (calculated from the RK4 algorithm) from the polynomialﬁt with time step dt = .

02 (triangle) and for simple precision (largered symbol) and double precision (small blue symbol). line.We then compute for simple and double precision the dis-tance towards the ﬁt as a function of the number of iterationpoints considered (see Fig. 5). The results show that he meanrelative error converges to less than 0 .

2% for sufﬁciently largedatabases, in both simple and double precision. The large rel-ative error for a small number of iteration steps is due to thefact that the system has not yet reached the asymptotic behav-ior of the return map. The data shown in Fig. 5 indicate thatthe long term prediction characterized here by the return mapis almost insensitive to the precision with which the trajectoryis computed.

III. RESULTS: ACCURACY OF PREDICTIONS FOR THELORENZ MODEL

In order to evaluate the accuracy of the machine learningapproaches to predict the behavior of the Lorenz system, weuse three different methods: a reservoir computing model aspioneered in this context in , called Echo State Network(ESN) and two other approaches based on Recurrent Neural Networks (RNN) used in , called LSTM and TCN. Thecharacteristics of the networks we used are detailed in the Ap-pendix.In this Section, we compare the predictions and perfor-mances of each network, focusing especially on the effectsof precision of both data and algorithm. Networks are trainedon trajectories generated by the RK4 integration method andhaving thousands of points separated by a time step dt = . A. Resources needed for the simulation by the three neuralnetworks

ESN

LSTM TCN10 T r a i n i n g t i m e [ s ] − − − R e l a t i v ee rr o r FIG. 6. Upper panel: Comparison of the training time fordifferent neural networks: ESN (the reservoir contains 200neurons),ESN , a LSTM network (with a single hidden layer hav-ing 64 neurons) and a TCN network. The red (blue) color is used fora computation of the networks parameters in simple (double) preci-sion. Lower panel: Figure of merit of each neural network represent-ing the mean relative error in the estimate of the training trajectories.

Figure 6 gives an overview of the different resources con-sumed during the training phase by the three networks forachieving a similar converged simulation on the same com-puter once the network has been set up. It is worth noticingthat the performance are for a standard processor. We have notused GPU cards. For the training time, we use the same setof training trajectories (100 trajectories, each trajectory con-tains 50 000 points separated by a time interval dt = . ), 300 neurons (ESN ), a LSTM network (with asingle hidden layer having 64 neurons) and a TCN network(similar structure as the LSTM network). Note that the LSTMand TCN are trained 10 times on the training data set whilethe ESN scans the training data set only once. In addition, the ESN

ESN

LSTM TCN10 N u m b e r o f p a r a m e t e r s FIG. 7. Comparison of number of parameters for the different neuralnetworks considered in Fig. 6. Red is the training size, blue the totalsize. number of parameters that are updated are signiﬁcantly differ-ent depending on the reservoir type as illustrated in Figure. 7.The LSTM and TCN networks adapt themselves by modify-ing all the network parameters. This is to be contrasted withthe ESN that updates only the connections towards the outputas discussed in the appendix, making the training size muchsmaller than the total size.The ﬁgure of merit of each neural network is represented inthe lower panel of Figure 6 where we represent the mean rela-tive distance between the trajectories provided by the networkcompared to the training one. This quantity is here averagedover all the training trajectories. When this relative error isequal to 0.01, it means that the average relative error is on theorder of 1 %. As expected, for each neural network the com-putation of the parameters in double precision yields betterresults. We also see that the ESN network seems more accu-rate at reproducing the training trajectory. We conclude thatthe ESN turns out to be signiﬁcantly more efﬁcient that theLSTM and TCN networks with respect to the training timeand moreover seems to better reproduce the training trajec-tory.

B. Short term predictions

We now turn to the accuracy of the predictions of the differ-ent networks as compared to a quadruple precision simulationby integration of the equations (3).We ﬁrst look at short term predictions, i.e. accurate descrip-tion of a single speciﬁc trajectory. That is the type of predic-tions where chaotic systems are the most difﬁcult to handle.It is similar to meteorological predictions in weather models,since one wants a precise state of the system starting from aspeciﬁc initial state. We recall that the data are generated viathe RK4 method, with a time step of 0.02 and a sampling ofthousands of points. Our reference trajectory is calculated inquadruple precision for the same time step and sampling.A parameter set speciﬁc to each network architecture hasbeen established allowing each network to converge. They -10 0 10 -20-1001020 xyz

Time step iteration

FIG. 8. Comparison between the quadruple precision RK4 simu-lation (red line) and the prediction of the ESN in double precisionwith a reservoir of size N =

300 (blue line). The initial conditionsare x ( ) = . y ( ) = .

45 and z ( ) = .

41, and the time step is dt = .

02. The ESN has been trained over 50000 time step iterationsbefore the prediction for the subsequent iterations represented in thisﬁgure. can be used to predict future points beyond the training set.As said before, the protocol is the same for the three types ofnetworks. The output associated to input vector at time t = T deﬁnes the next point for the trajectory at time T + dt . Thisprocedure is iterated to get the prediction over large amountof time. We provide an example in Fig. 8 for an ESN neuralnetwork which turns out to be able to provide an accurate pre-diction of the trajectory over the short term for relatively longtime.To be more quantitative, we evaluate for each simulation alimit time, τ lim , deﬁned as the time when the simulation de-parts from the correct trajectory by at least 5%. This quan-tity is plotted in Fig. 9 for the three networks considered, asa function of the size of training data (number of points ofthe exact trajectory which are used to train the network). Inall cases, one sees an increase of the limit time with increas-ing dataset, until it reaches a plateau where increasing thedataset does not help any more. This deﬁnes a sort of ultimatelimit time for this kind of simulation. All three neworks areeffective at predicting the dynamics, giving accurate resultsfor hundreds of time steps. The LSTM and TCN networksgive very similar results, and are signiﬁcantly and systemat-ically less effective than the ESN network used in the semi-nal paper , with prediction times 20% smaller. We recall (seepreceding subsection) that the LSTM and TCN networks arenot only signiﬁcantly less effective at predicting the dynam-ics than the ESN, they are also more costly in resources. Themain difﬁculty for an ESN network is in the search for a viableparameter.We note that although these neural network methods areeffective, they are less efﬁcient than standard classical simu-lations like RK4 with lower precision (see Fig.2). We should

200 240 280 320 360 τ li m τ li m τ li m Number of time steps of the trajectories for the training (a)(b)(c)

FIG. 9. Impact of the precision of training data and of the neural net-work on the short term quantiﬁed by the time τ lim above which theprediction departs by more than 5% from the quadruple precision tra-jectory: (a) ESN, (b) LSTM and (c) TCN. Data and network doubleprecision (ﬁlled square), data simple precision and network doubleprecision (ﬁlled disk), data double precision and network simple pre-cision (empty square), data and network simple precision (circle).Time step is dt = . note however that neural networks techniques are still new andfar from optimized compared to integration methods. In addi-tion, the neural network techniques do not need the equationsand do not depend on approximation which can have beenused to construct them.Figure 9 also enables to assess the question of the impactof precision on the predictive abilities of the neural networks.We have changed independently the precision of the datasetsused to train the network, and the precision of the networkalgorithm itself. We see that in all cases the precision of thenetwork will impact the accuracy of the prediction. Indeed,for these short term predictions, a double precision networkalways give better results than a single precision network. In-terestingly enough, with a single precision network, increas-ing the precision of the training data does not help. On theother hand, using a double precision network even on singleprecision data is more advantageous than a single precisionnetwork on any type of data. These results are valid for thethree types of networks over the full range of training setsused. It therefore seems that the precision of the network iscrucial for the accuracy of the prediction, and more so thanthe precision of the data. It is especially important in view ofthe fact that the precision of the data can be less easily con-trolled than the precision of the network. C. Long term predictions

We now turn to long term predictions, i.e. accurate descrip-tion of a statistical quantities corresponding to many trajecto-ries, as opposed to a single speciﬁc trajectory.

30 32 34 36 38 40 42 44 46 48303234363840424446 Z i Z i +1 FIG. 10. Return map of the Lorenz system obtained by an ESN net-work simulation.

0 0.2 0.4 0.6 M e a n r e l a t i v ee rr o r( % ) M e a n r e l a t i v ee rr o r( % ) M e a n r e l a t i v ee rr o r( % ) Number of time steps of the trajectories for the training (a)(b)(c)

FIG. 11. Impact of precision of training data and network type onthe precision of the simulation for long times. Same notations as forFig. 9. Time step is dt = . Figure 10 displays an example of return map constructedfrom ESN predictions, showing that, despite the fact that indi-vidual trajectories are not accurately simulated for long times,the long time dynamics is correctly described, giving the gen-eral shape of the return map.To be more quantitative, Fig. 11 uses the measure deﬁned inSection II (see Fig. 4) to assess the efﬁciency of the neural net-work methods for long term dynamics. Despite the fact thatthe LSTM and TCN networks are more cumbersome to imple-ment and take more running time, the results are clearly betterfor the ESN network, which can achieve an accuracy simi-lar to the one of the RK4 simulations (see Fig. 11a). For theLSTM and TCN networks, the results presented in Figs. 11band c show that these networks are able to reproduce the longterm dynamics, but the accuracy is less good than for ESNnetworks or RK4, even for large sizes of the training dataset.As in the case of short term predictions, the results pre-sented in Figs. 11 allow us to estimate the effects of the pre-cision on long term predictions. For the ESN network, it isclear that the precision of the results is entirely controlled bythe precision of the network, independently of the precision ofthe training data. For LSTM and TCN networks, we can seean effect of the precision of both the network and the data, butin all cases the precision of the network is the dominant fac-tor: even with low precision data, the high precision networkfares better than a low precision network with high precisiondata.

IV. CONCLUSION

The results presented in this paper conﬁrm previous works,showing that neural networks are able to simulate chaotic sys-tems, both for short term and long term predictions. We alsoshow that the ESN network (reservoir computing) seems glob-ally more efﬁcient in this task than LSTM or TCN networks,in line with the recent work . Our investigations allow toassess the effect of the precision of the training data and pre-cision of the network on the accuracy of the results. Our re-sults show than in a very consistent manner, the precision ofthe network matters more than the precision of the data onwhich it is trained. In view of the exponential instability ofchaotic dynamics, where small errors are exponentially am-pliﬁed by the dynamics, this may seem surprising. However,this is good news for practical applications, such as meteo-rology or climate simulations: the precision of the dataset isin many instances given by the precision of observations, thatmay be hard to ameliorate, while the precision of the networkis controlled at the level of the algorithm used and may beincreased at a cost of more computing time. ACKNOWLEDGMENTS

We thank Gael Reinaudi for discussions. We thank CalMiPfor access to its supercomputer. This study has been supportedthrough the EUR grant NanoX ANR-17-EURE-0009 in theframework of the “Programme des Investissements d’Avenir”.

Appendix A: The three machine learning approaches used

In the Appendix, we give an overview of the main featuresof the three neural networks that have been used in the article,namely the ESN, LSTM and TCN networks.

1. Reservoir computing: ESN network

FIG. 12. Schematic representation of an Echo State Network (ESN).

The ﬁrst network we use corresponds to reservoir comput-ing. We focus on a speciﬁc model called Echo State Net-works (ESN). Reservoir computing methods were developedin the context of computational neuroscience to mimick theprocesses of the brain. Their success in machine learningcomes from the fact that they are relatively cheap in comput-ing time and have a simple structure. Their complexities lie intheir training and the adjustment of parameters to obtain thedesired results. The structure of ESN networks is schematizedon Fig. 12.To train our ESN on a time-dependent signal u n with n = , ..., T where T is the duration of the sequence in discretizedtime, we must minimize a cost function between y re fn and y n .Here y re fn is the output that we want to obtain with u n , and y n is the output of the network when we give it u n as input. Forthe Lorenz problem, u n , y re fn and y n are 3D vector. Generally,the cost function that one seeks to minimize is the error be-tween the output of the network and the reference signal. Thisfunction is often in the form of a mean square error or, in ourcase, of the mean standard deviation.The output of the network is calculated as follows: y n = W out [ u n ; x n ] , (A1)where W out is the output weight matrix that we are trying totrain, [.;.] represents the concatenation, u n is our vectorialinput signal and x n the vector corresponding to the reservoirneuron activations. It has the dimension N of the reservoir andis calculated as follows: x n = ( − α ) x n − + α ˜ x n (A2)with ˜ x n corresponding to the new value of x n :˜ x n = tanh ( W in [ u n ] + W x n − + ε + µ ) (A3) α is the leaking rate, ε = − .

154 is an offset optimized onour set of data, µ is a random Gaussian variable of standarddeviation equal to 2 . × − , W is the system reservoir and W in is the input weight matrix of the reservoir. The dimensionof W in is N × ( + ) the + x and y to zero.There are several important parameters that must be ad-justed depending on the problem we are studying if we wantour ESN to be able to predict our system. The ﬁrst parameterwe can play on is the size of the reservoir itself. The morecomplex the problem that we want to deal with, the more thesize of the reservoir will have an impact on the capacities ofthe network. A large reservoir will generally give better re-sults than a small reservoir. Once the size of our reservoir hasbeen chosen, we can play on the central parameter of an ESN:the spectral radius of the reservoir. Often denoted by ρ ( W ) ,this is the maximal absolute value of eigenvalues of the matrix W . The spectral radius determines how quickly the inﬂuenceof an input data dissipates in the reservoir over time. If theproblem being treated has no long-term dependency, there isno need to keep data sent far in advance. We can thereforeensure that the spectral radius is unitary. In some cases, if ourproblem has long-term dependencies, it is possible to have ρ ( W ) > α . It character-izes the speed at which we come to update our reservoir withthe new data that we provide over time.The matrices W and W in are initialized at the start but arenot modiﬁed during training. Only the output matrix W out isdriven: W out = Y re f X T ( XX T + β I ) − (A4)where, for our Lorenz problem, X = ( x , ..., x T ) (dimension N × T ), Y re f = ( y re f , ..., y re fT ) (dimension 3 × T ) and I is the N × N identity matrix. As a result, the dimension of W out is3 × N + W out calculation. The parameter β makes it possible to limit thisdependence by penalizing the too large values of W out . This isall the more true for a single precision network which is moresensitive to these ﬂuctuations and whose β must vary by sev-eral orders of magnitude depending on the size of the trainingdata. In double precision (ﬂoat64), β varies from 10 − for5000 training points to 10 − for 5 . training points against10 − to 10 − in simple precision (ﬂoat32). As the reservoiris not changed during training, one must choose the initial-ization hyperparameters to ensure that one has a consistentoutput with the expected results. This requires adjusting thevalues of the leaking rate, spectral radius and input scaling asa priority. The optimization of these parameters has been donevia a grid search where we decrease the search step as we ﬁndgood parameters.The initialization parameters are for W in a density equal to d = .

464 with values randomly distributed from a Gaussianfunction of standard deviation equal to σ = . W , we have chosen d W = . σ W = . ρ W = .

2. LSTM and TCN networks

FIG. 13. General structure of Recurrent Neural Networks (RNN).

The two other networks we use are based on Recurrent Neu-ral Network architectures (RNN). RNNs can be represented asa single module chain (see Fig. 13). The length of this chaindepends on the length of the sequence that is sent to the input.The output of the previous module serves as input for the nextmodule in addition to the data on which we train our network.This allows the network to keep in memory what has been sentpreviously.The major problem in this type of network is the exponen-tial decrease of the gradient during the training of the network.This is due to the nature of back-propagation of the error inthe network. The gradient is the value calculated to adjust theweights in the network, allowing the network to learn. Thelarger the gradient, the greater the adjustments in the weights,and vice versa. When applying back-propagation to the net-work, each layer calculates its gradient from the effect of thegradient in the previous layer. If the gradient in the previouslayer is small, then the gradient in the current layer will beeven weaker. The ﬁrst layers in the network therefore havealmost no de facto adjustment in their weight matrices for avery small gradient.

FIG. 14. Schematic representation of a Long Short Term Memory(LSTM) network: structure of one elementary cell.

To solve this problem of attenuation of the corrections, theLSTM networks (Long Short Term Memory networks) havebeen explicitly developed for this purpose. They can also berepresented as a module chain, but unlike conventional RNNs,they have a more complex internal structure, composed of fourlayers which interact with each other (see Fig. 14).The ﬁrst layer is called “input gate”, and is represented by ahorizontal line that runs through the entire cell. It allows datato easily browse the entire network. This structure representsthe state of the cell over time. On this line there are otherstructures which will be used to modify the data which gothrough the cell.The next step in our network is the forget gate structure. Itconsists of a neural network with an activation function of thesigmoid type and makes it possible to decide which part willbe kept in the cell: f t = σ ( W f [ h t − , u t ] + b f ) , (A5)where W f and b f are the weights and bias of the network forthe update gate layer, u t is the input data at time t and h t − isthe hidden state output by the previous cell.The second step is to decide what to store. This structureconsists of two parts. The ﬁrst part is a neural network withan activation function of the sigmoid type, and will allow usto choose which value will be updated in our structure: i t = σ ( W i [ h t − , u t ] + b i ) , (A6)where W i and b i are the weights and bias of the sigmoid net-work for the update gate layer. u t is the input data at time step t and h t − is the hidden state output by the previous cell. Thesecond part is another neural network with this time an acti-vation function of the hyperbolic tangent type and that allowsto create the new state candidate of our vector C t as follows:˜ C t = tanh ( W c [ h t − , u t ] + b c ) , (A7)where W c and b c are the weights and bias of the hyperbolictangent network for the update gate layer, u t is the input dataat time t and h t − is the hidden state output by the previouscell. The new cell state C t is then computed by adding thedifferent outputs from the input gate, the forget gate and theupdate gate as follows: C t = f t ∗ C t − + i t ∗ ˜ C t , (A8)where f t is the output of the forget gate layer, i t is the inputlayer choosing which values are going to be updated, ˜ C t isthe new cell state candidate, C t − is the cell state of the pre-vious cell and ∗ denote elementwise operation. The structuredescribed above is then repeated from cell to cell.A ﬁnal structure makes it possible to determine what willbe the output of the network. The output will be based on thestate of the cell to which we have applied a network with asigmoid activation function to choose which part will be re-turned. Then we apply a hyberbolic tangent function to thecell state and multiply it with the previous value to get thenew cell state output: o t = σ ( W o [ h t − , u t ] + b o ) , (A9) h t = o t ∗ tanh ( C t ) . (A10) FIG. 15. Schematic representation of a Temporal Convolutional Net-work (TCN). h t is then sent into a linear layer for prediction of y t .The third architecture we use consists in TCN networks ,which use causal convolutions, meaning that at time step t , thenetwork can only access data points at time step t and earlier,ensuring no information leakage from future to past (see Fig.15). The ability of causal convolution to keep previous datapoints in memory depends linearly on the depth of the net-work. This is why we are using dilated convolution to modellong term dependencies of our system as shown in as it en-ables an exponentially large receptive ﬁeld depending on thedepth of the network. This enables TCN to function in a waysimilar to RNN. For an input sequence U of size T (with ele-ments u n ), the dilated causal convolution H we use is deﬁnedas H ( u ) n = ( U ∗ d h )( u ) n = k − ∑ i = h ( i ) u n − d . i (A11)where d is the dilatation factor, h is a ﬁlter ∈ R k − where k isthe ﬁlter size and the indices n − d . i represents the direction ofthe past. Using larger dilatation factor enables an output at thetop level to represent a wider range of inputs, thus effectivelyexpanding the receptive ﬁeld of the network.The LSTM and TCN networks are more complex networksand more demanding in computation than ESN. We have setup these networks with the Tensorﬂow library. For a trajectorymade of N i time step iterations, we use 35 successive pointsof the trajectory to predict the next step. In this way, we builda predicting vector of dimension N i −

35. We use batches of32 successive values of this vector to update the network pa-rameters with the gradient back propagation algorithm (usingthe Adam optimizer with an exponential learning rate decay).This process is performed over all the values of the predict-ing vector, and repeated 10 times (number of epochs equalto 10). One has indeed to make several passes on the train-ing data to get good results. On average, an epoch takes 30seconds. Testing the different possible architectures thereforetakes more time than for the ESN. D. Silver et al.,

Mastering the game of Go with deep neural networks andtree search

Nature , 484 (2016). G. Hinton et al.,

Deep Neural Networks for Acoustic Modeling in SpeechRecognition: The Shared Views of Four Research Groups , IEEE Signal Pro-cess. Mag. , 82 (2012). Y. Wu et al

Google’s Neural Machine Translation System: Bridging theGap between Human and Machine Translation , preprint arXiv:1609.08144(2016). J. Pathak, Z. Lu, B.R. Hunt, M. Girvan, E. Ott,

Using machine learningto replicate chaotic attractors and calculateLyapunov exponents from data

Chaos , 121102 (2017). J. Pathak, B.R. Hunt, M. Girvan, Z. Lu, E. Ott,

Model-Free Prediction ofLarge Spatiotemporally Chaotic Systems from Data: A Reservoir Comput-ing Approach

Phys. Rev. Lett. , 024102 (2018). Z. Lu, B.R. Hunt and E. Ott,

Attractor reconstruction by machine learning ,Chaos , 061104 (2018). M. Lukosevicius and H. Jaeger,

Reservoir computing approaches to recur-rent neural network training , Comput. Sci. Rev. , 127 (2009). Yang Tang, Jürgen Kurths, Wei Lin, Edward Ott, and Ljupco Kocarev,

In-troduction to Focus Issue: When machine learning meets complex systems:Networks, chaos, and nonlinear dynamics , Chaos , 063151 (2020). P. R. Vlachas, W. Byeon, Z. Y. Wan, T. P. Sapsis and P. Koumoutsakos,

Data-driven forecasting of high-dimensional chaotic systems with longshort-term memory networks , Proc. R. Soc. A , 20170844 (2018). P. G. Breen, C. N. Foley, T. Boekholt, S. P. Zwart,

Newton versus the ma-chine: solving the chaotic three-body problem using deep neural networks

Monthly Notices of the Royal Astronomical Society , 2465 (2020) P.R. Vlachas, J. Pathak, B. R. Hunt, T.P. Sapsis, M. Girvan, E. Ott and P.Koumoutsakos,

Backpropagation Algorithms and Reservoir Computing inRecurrent Neural Networks for the Forecasting of Complex SpatiotemporalDynamics , Neural Networks , 191 (2020). A. Chattopadhyay, P. Hassanzadeh, and D. Subramanian,

Data-drivenprediction of a multi-scale Lorenz 96 chaotic system using deep learn-ing methods: Reservoir computing, ANN, and RNN-LSTM K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, J. Schmidhuber,

LSTM: A Search Space Odyssey

IEEE Transactions on Neural Networksand Learning Systems , 10 (2017). S. Bai, J.Z. Kolter, V. Koltun

An Empirical Evaluation of Generic Con-volutional and Recurrent Networks for Sequence Modeling arXiv preprintarXiv:1803.01271 (2018). E. N. Lorenz,

Deterministic Nonperiodic Flow , J. Atmos. Sci. , 130(1963). A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A.Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu,