[PDF] On Explaining the Surprising Success of Reservoir Computing Forecaster of Chaos? The Universal Machine Learning Dynamical System with Contrasts to VAR and DMD

Abstract

Machine learning has become a widely popular and successful paradigm, including in data-driven science and engineering. A major application problem is data-driven forecasting of future states from a complex dynamical. Artificial neural networks (ANN) have evolved as a clear leader amongst many machine learning approaches, and recurrent neural networks (RNN) are considered to be especially well suited for forecasting dynamical systems. In this setting, the echo state networks (ESN) or reservoir computer (RC) have emerged for their simplicity and computational complexity advantages. Instead of a fully trained network, an RC trains only read-out weights by a simple, efficient least squares method. What is perhaps quite surprising is that nonetheless an RC succeeds to make high quality forecasts, competitively with more intensively trained methods, even if not the leader. There remains an unanswered question as to why and how an RC works at all, despite randomly selected weights. We explicitly connect the RC with linear activation and linear read-out to well developed time-series literature on vector autoregressive averages (VAR) that includes theorems on representability through the WOLD theorem, which already perform reasonably for short term forecasts. In the case of a linear activation and now popular quadratic read-out RC, we explicitly connect to a nonlinear VAR (NVAR), which performs quite well. Further, we associate this paradigm to the now widely popular dynamic mode decomposition (DMD), and thus these three are in a sense different faces of the same thing. We illustrate our observations in terms of popular benchmark examples including Mackey-Glass differential delay equations and the Lorenz63 system.

Full PDF

OOn Explaining the Surprising Success of Reservoir ComputingForecaster of Chaos?The Universal Machine Learning Dynamical System with Contraststo VAR and DMD

Erik Bollt

Department of Electrical and Computer Engineering, Clarkson University, Potsdam, NY13699, USA Clarkson Center for Complex Systems Science ( C S ), Potsdam, NY 13699, USA Abstract

Machine learning has become a widely popular and successful paradigm, including in data-drivenscience and engineering. A major application problem is the forecasting of future states from a com-plex dynamical, given a cache of data representing time ordered observation of states from the system.Artiﬁcial neural networks (ANN) have evolved as a clear leader amongst many machine learning ap-proaches, and recurrent neural networks (RNN) are especially well suited due to an aspect of memoryassociated with the concept, even if the major step of training the RNN to data which typically involvesbackpropagation and optimization becomes computationally especially intensive. In this setting, theecho state networks (ESN) or reservoir computer (RC) have emerged for their simplicity since they area special case of an RNN, which are not fully trained to the data. Instead only the readout weightsare trained but read-in weights and internal weights are simply selected randomly, and this represents amajor computational savings because to train them is inherently a nonlinear optimization problem andso very expensive as their are typically a massive number of weights to train. However, the read-outweights can be trained by a simple least squares step and so it is simple and eﬃcient to train. Whatis perhaps quite surprising is that nonetheless an RC succeeds as a methodology to make high qualityforecasts, competitively with more intensively trained methods, even if not the leader. The RC is a clearleader in eﬃciency and simplicity and in many settings they have suﬃcient quality. There remains anunanswered question as to why and how an RC works at all, with randomly selected weights. To this endthis work analyzes a further simpliﬁed RC, where the internal activation function is an identity functioninstead of a sigmoid function, so that in this setting we can more fully understand the ﬁtting processthat occurs is complete. Speciﬁcally we are then able to connect the RC to the well developed time-seriesliterature on vector autoregressive averages (VAR) that includes theorems on representability throughthe WOLD theorem. Further we can associate this paradigm with the now widely popular dynamicmode decomposition (DMD), and thus these three are in a sense diﬀerent faces of the same thing. Oursimpliﬁcation is not presented for sake of tuning or improving an RC, but rather for sake of analysis thesurprise being not that it doesn’t work better but that such random methods work at all. We illustrateour observations in terms of two diﬀerent popular benchmark examples which are the Mackey-Glassdiﬀerential delay equations and the Lorenz63 system.

Key Words: linear reservoir computing, RC, neural network, recurrent neural network, RNN, machinelearning, vector autoregression, VAR, Wold theorem, dynamic mode decomposition, DMD.

The power and success of artiﬁcial neural networks has been profound across many disciplines,including in dynamical systems. A leader amongst methodologies for forecasting has been therecurrent neural network (RNN) for aspects of memory. However, because of the large number a r X i v : . [ phy s i c s . d a t a - a n ] S e p f parameters to train to data observations, and likewise the nonlinear nature of the associatedoptimization process, the training phase can be computationally extremely intensive. Theecho-state, reservoir (RC) computing concept is a signiﬁcant simpliﬁcation where only theoutput weights are trained and in a manner that allows for a straight forward and cheapleast squares method, and otherwise, the rest of the weights, those of the input layer andthose of inner layers are simply selected randomly. It is clear that this would be cheaper totrain, but what is not clear and perhaps a surprise is that it would work at all, but work itdoes. With a simpliﬁcation of the concept to allow for a linear activation function, while theperformance is not quite as good it does still work, and now we are able to analyze in detailthe role of the randomly selected parameters and how there is still freedom in ﬁtting a welldeﬁned time-series forecasting model, which in fact is equivalent to the well developed theoryof vector autoregression (VAR). Within the VAR and related VMA theory we recall the Woldtheorem that allows us to discuss representation, and now as we show it is relevant to the RCfor machine learning. Also, with this description, we are able to connect to the recently highlypopular DMD concept. Artiﬁcial neural networks (ANN) have emerged as a core and powerful technology in machine learning[22, 46, 56, 57, 59] that is well suited for the supervised learning in data-driven science and engineering,speciﬁcally including for forecasting problems in complex dynamical systems [25, 42, 36, 47, 13, 44, 34].However, the most straight forward feedforward ANN with back propagation for training concepts can beextremely expensive to optimize to the data, even considering important recent innovations such as stochasticgradient descent or hardware break throughs such as GPU-based processing. Recurrent neural networkconcepts, (RNN) are especially suitable for temporal data from a dynamical system [25, 41, 5, 6, 20, 68], asthey naturally embed temporal information, and especially the long short term memory (LSTM) approachdemonstrate excellent ﬁdelity, [82, 35, 81, 20, 18, 86], but these are especially expensive to fully train, [63].The reservoir computing (RC) [38, 52, 80] and the closely related echo state network (ESN) [37, 51] andliquid state machine (LSM) [54, 31] have emerged as a special variant of an RNN, where only the outputlayer is trained rather than the entire network of weights. As such, this requires only a simple and eﬃcientleast squares estimation, rather than the more expensive full nonlinear optimization associated with a fullytraining an RNN. Nonetheless, and perhaps a most surprising outcome is that despite this gross simpliﬁcation,the forecasting capability can still be competitive even for chaotic or spatiotemporally complex problems[82, 64, 88, 50, 16, 17, 27]. Speciﬁcally, an RC thrives when a full state observation is available, while fullerand more expensive variants of RNN, especially the LSTM would considered higher perming, especially whenonly a reduced variable set is available [81, 82]. Still, the RC are popular, surely because of their simplicityto train, and perhaps in part because of their undeniable even if surprising ﬁdelity.The purpose of this work is to bridge the theoretical gap, to oﬀer at least a partial explanation as to howan RC can be such a successful and general universal dynamical system for forecasting such a wide array ofsystems despite randomly “trained” read-in and inner weights. The purpose of this paper is not speciﬁcally tobuild a new method, or improve the current method, but to explain what is perhaps a surprising that the RCmethod works at all. In so doing, we challenge the concept with a simpliﬁed linear activation function versionfor sake that this allows our simpliﬁed analysis throughout, and even if this version has reduced ﬁdelity, weshow it does still have theoretic reasons it still works. There have been few signiﬁcant explanations as to whyit works so well, but notably [29, 15, 21]. Usually instead we ﬁnd in the literature a collection of descriptionsas to how to choose random networks as the inner layers, regarding sparsity [75, 30], or regarding design ofthe spectral radius for linear stability [17, 39, 30] and the echo property, [26].In the spirit of still incomplete theoretical basis as to the underlying success of RC, we allow a simpliﬁedversion of RC with linear activation functions for which we are able to more fully identify the inner workings ofhow the RC can be a universal forecasting machine, even for time-series from complex and chaotic dynamicalsystems. We show that by this simpliﬁcation the RC still works, albeit with reduced performance quality,2ut nonetheless the purpose here being theoretical explanation of how such a simple system of only trainingthe readout is possible. We oﬀer this variant as a theoretical construction. By this interpretation, we willalso be able to connect the RC concept to other theoretically more matured theories. Speciﬁcally, the theoryof autoregression (AR) from time-series analysis and moving averages (MA), and together called ARMA[62, 19, 12, 66, 78, 74] , are founded on the Wold theorem [84] that we show are directly related to the RCconcept. The vector formulation of these [67, 53], called vector autogregression (VAR) and vector movingaverages (VMA) are also connected by a corresponding Wold theorem. Further, we describe a relationshipto the recently highly popular dynamic mode decomposition (DMD) [72, 83, 43, 47, 8], which is an empiricalformulation of Koopman spectral theory, [8, 3, 10]. So while we do not oﬀer this simpliﬁed RC for performanceover other approaches, we hope that this work will serve to shed light on how the simpliﬁed RC approachis capable of providing useful time-series forecasts, and likewise as a suggestion as to how the general RC issuccessful.This paper is arranged as follows. In Sec. 2 we describe the nature of the data as derived from a stochasticprocess. In Sec. 3, we review the standard RC concept, and we demonstrate it already with time-series datafrom a Mackey-Glass diﬀerential delay equation. In Sec. 4, is the heart of this paper, where we ﬁrst presentthat a linear activiation function allows the RC to be stated as a linear recursion, and therefore ﬁtting justthe readout weights can proceed in a manner such that despite random readin and inner weights, there is awell posed problem. Then, in this form we are able to directly relate the linear RC solution to the classicalVAR(k) solution. As such, we are then able to enlist statistical time-series forecasting theory for forecastingstochastic processes, so that the Wold theorem that guarantees a VMA can then be translated to a VAR.Furthermore the associated companion form of a VAR(1) usefully states the full vectorized problem. In Sec. 6,we note that the companion form of the VAR(1) is reminiscent of prior work for another famous concept indata-driven dynamical systems which is the time-delay formulation of a DMD-Koopman analysis. Finallyin the examples Sec. 7, we present two classical examples, the Mackey-Glass diﬀerential delay equation andthe Lorenz63 ordinary diﬀerential equation, with examples comparing aspects of a full nonlinear RC and thelinear variant of an RC.

Figure 1: Time-series acquired from the Mackey-Glass diﬀerential delay equation, Eq. (56), has become astandard example for time-series forecasting, for benchmarking data-driven methods since it is dynamicalrich and high-dimensional and therefore challenging. (Top) Time-series, index. (Bottom) Three-dimensionalprojection in delay coordinates, ( x ( t ) , x ( t − τ ) , x ( t − ∗ τ )), τ = 20. A sample of N = 10 ,

000 data points ischosen as the training data set.For data-driven forecasting problems, we require data from a process, including from a deterministic orotherwise from a stochastic dynamical system, [11]. A process, stated, { X t : t ∈ T } (1)3s in terms of a collection of random variables, X t on a common probability space, (Ω , B , P ), where Ω isthe sample space, B the σ -alebra, and P a corresponding probability measure. T is a “time” index setand commonly it is chosen as either R , or Z or subsets. For sake of discussing ﬁnite samples of data, weemphasize maps, which may well be from discretely sampling a ﬂow. A data set from such a process samples x t i of X t i , stated as a time sorted sample, { x i } Ni =1 , t < t < ... < t N , using indexing notation, x i := x t i .Uniform timing is also a simplifying assumption, h = t i +1 − t i , for all t i ∈ T . Assuming a vector real valuedtime-series, of dimension d x , { x i } Ni =1 ⊂ R d x . Data derived from a ﬂow, say,˙ x = f ( x ) (2)may be collected by stroboscopic map, x i +1 = F t ( x i ) = x ( t + τ ) = x ( t ) + (cid:90) t + τt f ( x ( s ) ds. (3)Suppressing the stroboscopic time t , this is a discrete time map F , and likewise other Poincare’ maps maybe useful for ﬂight between surface of section, and random dynamical systems may also be relevant [73, 11].An underlying principle here is that the data should be “long enough”, and likewise a general failing of anydata-driven machine learning method for forecasting a stochastic process will tend to do much better interms of interpolation than extrapolation. Generalizing, to allow for out of sample forecasts will tend to faremuch better when the point to be forecasts is close to other observed inputs. Said another way, the qualityof results can be brittle, depending as much upon curating a representative data set as the details of themethod used to avoid that struggle between ﬁtting between observations and overﬁtting and too far out ofsample.As a matter of presenting examples, we will highlight two classic problems that remain popular inbenchmarking for machine learning in recent literature. These will be, • The Mackey-Glass diﬀerential delay equations, Eq. (56), and • The Lorenz63 system, Eq. (57),both of which will be presented in fuller detail in Sec. 7. In Fig. 1 we show early in this presentation forsake of context, a time-series data set of the Mackey-Glass system, from Eq. (56), to stand in as a typicaldata set. This problem is a useful benchmark, and it is often used as such [55, 32, 2, 60, 9, 24, 85], perhapsbecause it is a well known chaotic process, but also for sake of dimensional complexities that we recall inSec. 7.1.

In this section we review the standard and fully nonlinear RC method, by which we mean, including theuse of a nonlinear activation function q ( s ). In this context, q ( s ) is usually taken to be a sigmoidal functionsuch as the hyperbolic tangent function. However, in the next section we will challenge these steps includingsimplifying to the identity function, q ( s ) = s .Assuming the training data, { x i } Ni =1 ⊂ R d x , the reservoir computing RNN is stated, r i +1 = (1 − α ) r i + αq ( Ar i + u i + b ) , y i +1 = W out r i +1 . (4)The hidden variable r i ∈ R d r is generally taken to be of a much higher dimension d r > d x , by a linear liftingtransformation, u i = W in x i , (5)4igure 2: Reservoir Computing (RC) as deﬁned Eq. (4), including a randomly selected d r × d x read in matrix, W in from d x × x , a randomly selected d r × d r inner layer recurrence matrix A for inner states d r × r and the d x × d r trained read-out matrix matrix W out .and W in is a randomly selected matrix d r × d x of weights. See Fig. 2. A is also a linear transformation,as randomly chosen square matrix d r × d r of weights, that should be designed with certain propertiessuch as spectral radius for convergence [17, 39], or sparsity, [50, 64, 82] or otherwise consideration of the“echo-state” property, [15]. Likewise, the readout is by a linear transformation, using a d x × d r matrix ofweights W out . However, W out , and only W out , is trained to the data, allowing for forecasts y i given data x i ∈ R d x , which is the major simplify aspect of RC since it can be done by a simple and cheap least squarescomputation. Finally q : R → R is an “activation” function, using the phrasing from machine learning inthe neural network community to mimic the concept of a biological network that ﬁres when a voltage hasreached a threshold. Popular choices include q ( s ) = tanh ( s ), meaning a componentwise application of thescalar hyperbolic tangent function when s is multivariate. Other activations are popular in general neuralnetwork theory, including other sigmoidal functions, and also the ReLu function in certain contexts but notso commonly in RC, [27]. 0 ≤ α ≤ α = 0 in this paper as outside the purpose of challenging the concept of explaining howthe RC may work in a special case of identity q in which case, nonzero α can be considered as absorbed intothe random A ; (1 − α ) r + α Ar = ((1 − α ) I + α A ) r and since A is chosen randomly, then ((1 − α ) I + α A )may be an alternative random selection. Finally, b serves as an oﬀset for activation, that is useful in somecontexts, but it is also not relevant for our needs for the same reason we choose α = 0, and we choose b = 0.What is remarkable about RC is that the usual hard work of optimally developing a full RNN is almostentirely skipped. Instead of learning W in and A optimally ﬁtted to the data, these seemingly very importantmatrices are simply picked randomly. This is an enormous savings over what would usually be inherentlyhard to handle since the parameters are composed within the nonlinear activation q and require at least agradient descent optimization of back-propagation in a high dimensional and likely multi-peaked optimizationspace. Almost any matrix distribution may plausibly due, but several diﬀerent recipes are suggested. Wesay “recipe” rather than algorithm since these are descriptions of successful observations in practice, ratherthan a product of mathematical theory that is still not complete. Here, we choose the entries of A uniformly, A i,j ∼ U ( − β, β ), with β to scale the spectral radius, but other choices are common, notably for sparsity.The read in matrix is also chosen uniformly randomly, W ini,j ∼ U (0 , γ ), with γ > r . 5igure 3: Standard nonlinear RC, one-time-step forecasts from the Mackey-Glass diﬀerential delay equation,Eq. (56), using a training data set from N = 10 ,

000 samples as shown in Fig. 1. (Top) Time-series data, N = 5 ,

000 shown for clearer illustration. (Middle) Reservoir trained across the data set, and 500 samplesare shown for clarity where we see the error is suﬃciently small that the one-time-step forecasts and the truedata are almost the same so that the plot is indistinguishable (both shown, but curves overlay). Regularityis chosen to be, λ = 1 .e −

6. (Bottom) Some randomly selected 7 of the (usually hidden) d r = 500 activationfunctions illustrate the general appearance. Contrast to forecasting into the future as shown in Fig. 4, andlinear method in Fig. 5. 6he crucial aspect of the simpliﬁcation that makes reservoir computing so easy and computationallyeﬃcient, is that training to the output becomes just a linear process. The cheap and simple least squaressolution is easily handled directly by matrix computations. Let, W out = arg min V ∈ R dx × dr (cid:107) X¯ − VR (cid:107) F = arg min V ∈ R dx × dr N (cid:88) i = k (cid:107) x i − Vr i (cid:107) , k ≥ . (6)Notation here is standard that (cid:107) · (cid:107) F denotes the Frobenius-norm of the matrix, which is the least squaresequivalent of the least squares matrix parameter estimation problem. The data { x i } Ni =1 is stated as a d x × N − k array. X = [ x k +1 | x k +1 | . . . | x N ] = [ Vr k +1 | Vr k +2 | . . . | Vr N ] = VR , k ≥ X to be optimized in least squares by W out , processed through the RC, R = [ r k +1 | r k +1 | . . . | r N ] , k ≥ . (8)While k = 1 is allowable, here for theoretical development in subsequent sections, we allow for larger k ≥ W out := XR T ( RR T + λ I ) − . (9)Notation includes · T is the matrix transpose, I is the identity matrix, and the choice of regularity parameteris λ ≥

0. We will write a regularized pseudo-inverse with the notation, R † λ := R T ( RR T + λ I ) − (10)In Appendix 11 we review the matrix theory as to how to form regularized pseudo-inverses such as R † λ by aregularized singular value decomposition (SVD) in terms of regularized singular values such as σ i / ( σ i + λ )obtained from the singular values σ i from the SVD of R .In Fig. 3, we show an example of an RC machine obtained from data obtained from the Mackey-Glassdiﬀerential delay equations, Eq. (56). We see ﬁtting for N = 10 ,

000 data points x ( t ), d x = 1, regularizingparameter λ = 1 . × − , and ﬁtting for constant time oﬀset. Fit and true data are shown to be so closethat in fact the blue ﬁt curve hides the red true data curve. Also shown are several (7) of the d r = 500hidden variables r ( t ). The ﬁt matrix A is randomly chosen with entries from a uniform distribution, andthen scaled so that the spectral radius ρ ( A ) = 1. The random random matrix W in is also chosen uniformly,scaled so that x values lead to r in [ − . , . d r > d x must be “large enough,” but how big is not well understood. Furthermore, the nature of theunderlying distribution of matrices W in and A is not fully understood. We hope to contribute some generalperspective as to why an RC may work at all, even though our goal here will not be to handle aspects ofimproving performance. q ( s ) = s , Yields a VAR(k) Now we attempt to challenge a central typical assumption of the RC method. Instead of choosing theactivation function to be a sigmoid function, instead, we use the identity function, q ( x ) = x . With this7igure 4: Standard nonlinear RC, forecasts into the future, from the Mackey-Glass diﬀerential delay equation,Eq. (56), using a training data set from N = 10 ,

000 samples as shown in Figs. 1, 3. (Top) Time-series data,0 ≤ t ≤

500 zoom plotted for clearer illustration. (Top) Forecasts into the future (Red) diverge from true(Blue), and (Bottom) error is shown. 8ssumption, we can show that the resulting linear RC machine is equivalent to a vector autoregressive process(VAR) [67, 33], which is extremely popular and successful in the timeseries forecasting ﬁeld, particularlyin econometrics [1]. With this simpliﬁcation, we ﬁnd that not only can the linear RC still make usefulforecasts, but we are able to connect the RC concept to this well established theory associated with VARtime-series analysis, notably the existence of representation WOLD theorem, [84, 62]. However, while thisgives some explanation as to why a standard nonlinear RC may work despite the seemingly oversimpliﬁcationof a full RNN, we show that that the linear RC does still performs and furthermore, now with theoreticalunderpinnings, even if the full nonlinear RC may still perform better. So it is for the theoretical connectionsthat we make this simpliﬁcation, rather than a suggestion that it may be a new or simpler method.Before proceeding with a discussion of q ( s ) = s , notice that r is related to the scale of the read-in matrix, W in . Proceed by initializing the process, by Eq. (5), u = W in x , but also we choose, r = 0 . (11)Consider that since W in is randomly chosen, and we choose uniformly W in ∼ U (0 , γ ), then the parameter γ > u i and then r i . See for example Fig. 3, where the nativedata x from the Mackey-Glass system is translated to scaled internal variables. Recall the power series ofthe nonlinear activation function, q ( s ) = tanh ( s ) ≈ s − s / s / − . . . , (12)Clearly for s <<

1, then q ( s ) ∼ s even if chosen as a sigmoid, and the choice of read-in scale could bedesigned to put us in this regime as long as A is designed to keep us in this regime.However, in the following we proceed to study the consequences of stating the activation exactly as theidentity, q ( s ) = s. (13)With this assumption, the ﬁrst several iterations follow from Eq. (4) and Eq. (11) as a forward propagation,for which we explicitly observe the following recursion. r = Ar + u = u = W in x (14) r = Ar + u = A W in x + W in x (15) r = Ar + u = A ( Ar + u ) + u = A W in x + AW in x + W in x (16)... r k +1 = Ar k + u k = A ( Ar k − + u k − ) + u k ...= A k − W in x + A k − W in x + . . . + AW in x k − + W in x k (17)= k (cid:88) j =1 A j − u k − j +1 = k (cid:88) j =1 A j − W in x k − j +1 , (18)using notation, A = I , the identity matrix. Since the readout of this process is by Eq. (4), y i = W out r i ,9hen we may rewrite the ﬁnal equation, Eq. (18), by left multiplying by W out . y k +1 = W out r k +1 = k (cid:88) j =1 A j − W in x k − j +1 = W out A k − W in x + W out A k − W in x + . . . + W out AW in x k − + W out W in x k = a k x + a k − x + . . . + a x k − + a x k , (19)with notation, a j = W out A j − W in , j = 1 , , ..., k. (20)Each of these coeﬃcients a j are d x × d x matrices. This follows simply by Eq. (20), collecting productsbetween d x × d r to d r × d r and then d r × d x matrices and notation A l = Π li =1 A = A · A . . . · A , l -times if l >

0, or the identity matrix when s = 0.By Eq. (19), a linear RC yields a classical VAR(k), (a vector autoregression model of k -delays) that in ageneral form is [67], y k +1 = c + a k x + a k − x + . . . + a x k − + a x k + ξ k +1 . (21)In this writing, c allows for a general oﬀset term, a d x × ξ k +1 is underlying “noise” of the stochastic process which is part of the stability theory we review in the nextsection, must be assumed to come from a covariance stationary process. This relationship between an RCand a VAR(k) allows us to relate to the corresponding theoretical discussions of relevant alternative formsand stability and convergence from the stochastic process time-series literature, that we will also expandupon in the next section.Considering the complete data set of vector time-series, { x i } Ni =1 yields,  | | | | y k +1 y k +2 . . . y N | | | |  = (cid:2)(cid:2) a (cid:3) (cid:2) a (cid:3) . . . (cid:2) a k (cid:3)(cid:3)  | | ... | x k x k +1 . . . x N − | | ... | x k − x k . . . x N − | | ... | ... ... ... ... | | ... | x x . . . x N − k − | | ... |  . (22)Restating this as a single linear equation, Y = a X . (23)Again, remembering that x i are d x × a i are d x × d x matrices, a = [ (cid:2) a (cid:3) | (cid:2) a (cid:3) | . . . | (cid:2) a k (cid:3) ] , isa d x × ( kd x ) matrix. Y = (cid:2) y k +1 | y k +2 | . . . | y N (cid:3) , is a d x × ( N − k ) matrix, and X is a ( kdx ) × ( N − k ) matrix.Notice that we have stated the target values, which are chosen from the data in practice, y k +1 = x k +1 .Formally, minimizing in least squares, with regularization, J ( a ) = (cid:107) Y − a X (cid:107) F + λ (cid:107) a (cid:107) F , (24)with Y being the target output of the right hand side of Eq. (22) by best ﬁtted matrix a ∗ . The solution ofthis regularized least squares problem may be written in its matrix form, a ∗ = X X T ( XX T + λI ) − := X X † λ , (25)where the symbol † refers to the Penrose pseudo-inverse, with notation described in detail in Eqs. (62)-(63),when formulating the “ridge” Tikhonov regularized pseudo-inverse X † λ .10igure 5: The fully linear RC, q ( s ) = s , forecasts from the Mackey-Glass diﬀerential delay equation, Eq. (56),using the same training data set from N = 10 ,

000 samples as shown in Fig. 1. Contrasting to forecasts intothe future as shown in Figs. 4, we see that clearly the nonlinear RC outperforms the linear RC, and by awide margin. But that is not the message here, rather which is one of explaining the relationships and ﬁttingof the parameters, and so that ﬁtting just the readout matrix W out is relevant is established by Eq. (34). Now, we will further decompose the derived VAR(k) coeﬃcients found in Eq. (25), to emphasize the trainingof just the output matrix W out of an associated RC, in terms of randomly pre-choosing A and W in .Referring to Eqs. (19)-(20), we can rewrite Eqs. (22)-(23) as, Y = a X = v AX . (26)with the matrix deﬁned, A = [ W in | AW in | . . . | A k − W in | A k − W in ] , (27)This is a combination of exponents of the random d r × d r matrix A , and the random d r × d x matrix W in ,and so it is itself a kd r × d x random matrix. Interestingly, considering just one column at a time of the W inl , l = 1 , .., d r , A k − can be understood as a collection of columns from a Krylov space and this entire processcan be discussed as an Arnoldi-iteration, which is something we will explore further in Section 6.Consider that the least squares objective Eq. (29) can be expanded to split, a = v A , (28)to emphasize that since if we pre-choose A and W in , then only the readout matrix v is a free parameter, J ( v ) = (cid:107) X − Y (cid:107) F = (cid:107) X − a X (cid:107) F = (cid:107) X − v AX (cid:107) F . (29)Optimizing for v yields, W out := v ∗ = X ( AX ) † = ( X X † ) A † , (30)Comparing this equation with Eq. (25), deﬁning a , we see ( X X † ) formally appears in both expressions. Onlythe associative property of matrix multiplication is needed to emphasize the role of A . More importantly, this11xpression Eq. (30) for W out is written so as to emphasize that the reservoir computing process is designedwith A and X . Combined through the iteration, as ( AX ) is the data that results from Eq. (17), r k +1 = A k − W in x + A k − W in x + . . . + A W in x k − + W in x k . (31)This is written naturally, R = ( AX ) . (32)by the simple way Eq. (30) uses a matrix identity of pseudo-inverses, [28],( AX ) † = X † A † . (33)Associativity emphasizes that since A is deterministically deﬁned, once A and W in are chosen, and separatelyfrom the data X , then the ﬁtting of only the parameters of W out are suﬃcient. If we want the VAR(k)parameters, we could either ignore the prior knowledge of choice of A and W in , and compute a directlyfrom Eq. (29), or from Eq. (30), deﬁning, W out := v ∗ = a ∗ A † λ = X X † λ A † λ . (34)We summarize that these manipulations concluding with Eq. (34) serve directly as the connection betweenthe RC ﬁtted readout and the coeﬃcient matrices of a VAR(k). The roles of pre-choosing A and W in relatedirectly to W out coeﬃcients, or indirectly to the ﬁtted data. Concluding this section with the an example,we simplify the nonlinear RC of the Mackey-Glass data from Figs. 1, 3, to a purely linear RC ﬁt shown inFig. 5 which clearly is not as well performing but it does still make some forecast into the future. Furtherdiscussion of this example and also likewise a Lorenz63 example in Sec. 7. k Since the VAR(k) model of vector autoregression appears naturally in our discussion from the simpliﬁedactivation function q ( x ) = x , as summarized by Eqs. (19), and (21), we now recall some of the classical un-derlying theory from the statistical time-series analysis literature [84, 67] that describes suﬃcient conditionsunder which we expect existence of a VAR(k) representation.The Wold theorem plays a central role in time-series analysis as it describes existence of a vector movingaverage (VMA) model representation, which then under further assumptions for invertibility, is equivalentto a VAR. Assumptions require a stationary process as a sum of two components: 1) a stochastic componentconsisting of “linear” combinations of lags from a white noise process, and 2) a deterministic componentthat is uncorrelated with the stochastic component. First we recall deﬁnitions. A d-dimensional stochasticprocess ξ t of zero mean, E ( ξ t ) = , is derived from a white noise stochastic process, written with zero mean ξ t = [ ξ ,t , ξ ,t , ..., ξ d,t ] ∼ W N (0 , Ω) if E ( ξ t ) = and E ( ξ t ξ Tt ) = , for t (cid:54) = t , but E ( ξ t ξ Tt ) = Ω is symmetricpositive semi-deﬁnite. A stochastic process is covariance stationary if all terms of the sequence have the samemean, and any two terms depend only on their relative positions. That is, E ( ξ t (cid:48) ) = E ( ξ t ), for all t (cid:48) , and forall t (cid:48) ≥

0, there exists γ t (cid:48) ∈ R such that, Cov ( ξ t , ξ t − t (cid:48) ) = γ t (cid:48) , for all t > t (cid:48) , meaning depending on t − t (cid:48) rather than the t or t (cid:48) . With these deﬁnitions, we can state the central theorem of this section that we recall: Theorem 1 (Wold Decomposititon Theorem, [84, 67])

A zero mean covariance stationary vector pro-cess { x t } admits a representation, X t = C ( L ) ξ t + µ t , (35) where C ( L ) = (cid:80) ∞ i =0 C i L i is a polynomial delay operator polynomial, the C i are the moving average matri-ces, and L i ( ξ t ) = ξ t − i . The term C ( L ) ξ is the stochastic part of the decomposition. The µ t term is thedeterministic (perfectly predictable) part as a linear combination of the past values of X t . Furthermore, • µ t is a d -dimensional linearly deterministic process. • ξ t ∼ W N (0 , Ω) is white noise. Coeﬃcient matrices are square summable, ∞ (cid:88) i =0 (cid:107) C i (cid:107) < ∞ . (36) • C = I , the identity matrix. • For each t , µ t is called the innovation for X t , in that µ t = X t − P ( X t | X t − , X t − , ... ) . The so-calledshock is fundamental. Clarifying notation of the delay operator polynomial, with an example, let, C ( L ) = (cid:20) L − L − L (cid:21) = (cid:20) (cid:21) + (cid:20) − − (cid:21) L = C + C L, and C i = (cid:20) (cid:21) if i > , (37)so if for example, x t ∈ R , C ( L ) x t = (cid:20) L − L − L (cid:21) (cid:20) x ,t x ,t (cid:21) = (cid:20) x ,t + x ,t + x , ( t − x , ( t − + x ,t − x , ( t − (cid:21) . (38)For interpretation and deﬁnition, consider: • If µ t = 0, then this is called a “regular” process, and therefore there is a purely vector moving average(VMA) representation. If C i = 0 for i > p for some ﬁnite p > ∞ ) representation. • If X t is regular then the representation is unique.Now to our point to relate a Wold VMA representation to our discussion following the linear RC wherewe saw a VAR(k) results Eqs. (19) when the activation is linear q ( x ) = x . If the delay polynomial C ( L ) isinvertible with, C ( L ) − C ( L ) = I , and denote C ( L ) − = B ( L ) = B ( L ) = B − B L − B L − . . . in terms ofmatrices B i , then writing explicitly,( B − B L − B L − . . . )( I + C L + C L + . . . ) = I. (39)Existence of this inverse implies that the Wold implied VMA process has a VAR representation, X t = C ( L ) ξ t = ⇒ B ( L ) X t = ξ t . (40)In practice, when an inﬁnite order vector moving average process, VMA( ∞ ) corresponds to an inﬁniteorder vector autoregressive process, VAR( ∞ ), then recursion of expanding Eq. (39) and matching term byterm yields, B = I, B = C , . . . , B k = C k + B C k − + . . . + B k − C , ... (41)Though, a VAR representation may be found from a VMA through several methods, including a methodof moments leading to the Walker-Yule equations, [65], or a least squares method in the case of ﬁnitepresentations. Often, for parsimonious eﬃciency of presentation, a mixed form of a p -step AR and a q -stepMA model might make a suitable approximation, for what is called a ARMA(p,q) model.While not allowing ourselves to be drawn entirely into the detailed theory of econometrics and statisticaltime-series analysis, pursuing stronger necessary conditions, we wish to point out some already apparentrelevant points from the stated special suﬃcient conditions. Remark 1

Summary statements. If a vector stochastic process satisﬁes the hypothesis of a Wold theorem,then it: • Can be written either as a VMA or a VAR, when Eq. (40) of C ( L ) is invertible, Eq.(41). In practice a ﬁnite k, VAR(k) estimates a VAR( ∞ ) as k ↑ , since the sequence of coeﬃcients matrices { C i } , are square summable, Eq. (36), and considering Eq. (41). • Furthermore, in practice a least squares estimate of a VAR(k) may be used for ﬁnite k, which relatesto an RC by the least squares ﬁt, Eqs. (30), (34).

Finally, we separate from the above technical points, the following fundamental remark to distinguishexistence versus uniqueness of a representation,

Remark 2

While a stochastic process may have a VMA representation and if through invertibility, a cor-responding VAR, which is a linear descriptions of the process, it may not taken to be “the” unique physicalunderlying description since nonlinear descriptions certainly may exist.

Remark 3

The processes that we may be interested in, such as those derived from Eq. (1), may describethe evolution of a (chaotic) dynamical system and these may allow a representation, Eq. (3), [11, 76, 45].However, in many of these natural examples, the “color” or even the nature of the noise may well not beconforming to the white noise assumption of the Wold theorem 1. Certainly contrasting samples from aninvariant measure from a chaotic dynamical system to a white noise process is a well studied [69, 40], butstill undecided topic. While existence of the VMA and corresponding VAR representation by referring tothe Wold theorem does depend on that hypothesis, nonetheless, successful constructive ﬁtting of a VAR(k)by regression, even if implicitly through an RC, seems to proceed successfully in practice in a wide array ofexamples.

With this last remark, we admit that while the details of the rigor guaranteeing existence may in practicebreak down, due to inability to check all hypothesis, as often such gaps occur between mathematics, appliedmathematics, and practice as related to real world data, we feel that the concept is still highly instructiveas underlying explanation, despite strong suﬃcient assumptions used to extend a rigorous theory.

To discuss stability, we recall [67] the fact that a VAR(k), Eqs. (19), (21), x k +1 = c + a k x + a k − x + . . . + a x k − + a x k + ξ k +1 , (42)can be stated as a VAR(1) in terms of “stacked” (delayed) variables, called the companion system. Thisidea is familiar in dynamical systems as we see it is related to stating time-delay variables and the Taken’sembedding theorem [77, 61, 71, 58, 9, 87]. Deﬁne, X k +1 = A X k + C + e k , where, X k =  | x k | x k − | ... | x |  , e k =  | ξ k | | ... | |  , C =  | c | | ... | |  and A =  a a . . . a k I . . . I . . . 00 . . . I  , (43)where since a i are each d x × d x matrices,then A is kd x × kd x , and X k is kd x ×

1. For discussion in the nextsection, it will be convenient to consider for contrast to Eq. (50), a matrix of all the data, X = (cid:2) X k X k +1 . . . X N − k − (cid:3) , and likewise let, X (cid:48) = (cid:2) X k +1 X k +2 . . . X N − k (cid:3) , (44)are kd x × ( N − k − X , also from Eq. (22).14t follows that analysis of stability of a VAR(1) suﬃciently describes the stability of a VAR(k). If thereis even a small oﬀset c , whether by a bias or imperfection of ﬁt, then follows the recursion, X k = C + A X k − + e k − , = ⇒ X k = ( I + A + . . . + A l − ) C + A l X k − l + l − (cid:88) j =0 A j e k − l . (45)This relates the VAR(1) back to a VMA( l ) form. Clearly, even a small constant disturbance C is successivelyinﬂuenced by the (delay) matrix A . In the limit l → ∞ , recall the geometric series of matrices,( I − A ) − = lim l →∞ ( I + A + . . . + A l − ) , (46)converges if the spectral radius is strictly contained in the complex unit disc, ρ ( A ) = max λ : det ( A − λI )=0 | λ | < . (47)Equivalently, a general VAR(k), Eq. (42), is stable if and only if a characteristic polynomial, det ( I − a z − a z − . . . − a k z k ) = 0 , (48)has all its roots outside the unit disc. Under this stability assumption we conclude that, X k = C + A X k − + e k − = ( I − A ) − C + ∞ (cid:88) j =0 A j e k − l . (49)which relates a VAR(1) form to a Wold form through a VMA( ∞ ). Since by Eq. (20), each matrix a j = W out A j − W in , then the magnitude of entries in the matrix A and the read in matrix W in each moderate themagnitudes of entries of a j . So considerations by the Gershgorin disc theorem, [28] relates these magnitudesto the magnitudes of z . Generally sparsity of A , magnitude of the spectrum of A and magnitudes of W in can be reduced for stability, and to moderate the “memory” associated with converges with k , and thesemagnitudes were already discussed for sake of a regime where the usual sigmoidal q would be close to theidentity. To brieﬂy answer the question titling this section, the answer is yes, there is a connection between VAR andDMD, and so to RC. The more nuanced answer is that the connection is not complete. Throughout thediscussion so far, a specialized version of an RC using an identity activation function, yields a linear processthat is shown to relate to a VAR that is also a linear process. In this section we ask if it also relates tothe dynamic mode decomposition, DMD [72, 70, 43, 83], a concept that is also premised on a linear processmodel as a ﬁnite estimation of the inﬁnite dimensional linear action of the Koopman operator on a functionspace of observables [3]. In Koopman theory, instead of describing the evolution and geometry of orbits inthe phase space, the transfer operator methods generally describe evolution of functions whose domain is thephase space [45, 11]. Recently this approach has excited a huge trend in applied dynamical systems, withmany excellent research papers [83, 43, 47, 8], review papers [3, 14], and books [43] toward theory, numericalimplementation and scientiﬁc application practice. Our focus here will remain narrow, the goal being tosimply identify a connection to the RC and its related VAR, as discussed above. A primary purpose of DMDmethods are for modal analysis of the system to describe coherent and typical behaviors, but it also can beused for forecasting, and for this sake the analogy is drawn here.For direct comparison, ﬁrst allow some minor manipulations to relate the VAR(k), Eq. (42) and Eq. (22),to a typical DMD form. A time-delay version of a linear evolution is a special case of an exact DMD written15s follows, with notation used as above,  | | ... | x k +1 x k +2 . . . x N | | ... | x k x k +1 . . . x N − | | ... | ... ... ... ... | | ... | x x . . . x N − k | | ... |  = K  | | ... | x k x k +1 . . . x N − | | ... | x k − x k . . . x N − | | ... | ... ... ... ... | | ... | x x . . . x N − k − | | ... |  , (50)or simply, X (cid:48) = K X , (51)where X , and X (cid:48) are the kd x × ( N − k −

1) data matrices in Eq. (44), and K is a kd x × kd x DMD matrixapproximating the action of the inﬁnite-dimensional Koopman operator. Abusing notation slightly, the leastsquares problem, K = arg min K (cid:107) X (cid:48) − K X (cid:107) F , (52)has the solution, K = X (cid:48) X † , (53)which is called the “exact DMD” solution. While there are many variants of DMD, this one called exactDMD is popular for its simplicity of implementation while still useful for interpreting the system in termsof modal behaviors.Contrasting K derived by exact DMD, Eq. (50), versus A for the VAR(1) form described in Eqs. (43)-(44) reveals clear similarities since each states a linear relationship between the same data, X (cid:48) = K X , versus X (cid:48) = A X , but these are ill-posed equations and the A need not be the same as K . Closer inspection revealsthat Eq. (53) allows freedom for best least squares ﬁt considering the entire matrix K , and so diﬀerencesrelative to Eqs. (22), (25). Whereas, only the ﬁrst k rows of A are free parameters in the regression; thesubsequent rows of A are sparsely patterned with either zero’s or the identity matrix, Eq. (43).A similar, but not identical, structural diﬀerence appears when contrasting the SVD based exact DMD tothe original DMD method of Schmidt [72] and also Rowley and Mezic, [70] which is an Arnoldi-like version ofDMD in terms of iterations in a Krylov space [4, 79]. Reviewing that Arnoldi-version of DMD, using the nota-tion of [70], observations x k ∈ R d are assumed (ﬁtted) to be from a linear process, but also by considering theiterations are to be ﬁtted in the Krylov space, assuming that x m ∈ Kry m ( x ) = span { x , A x , ..., A m − x } for data, K = [ x | x | . . . | x m ] = [ x | A x | . . . | A m − x ]. Stating the linear combination x m = A x m − = c x + . . . + c m − x m − = K c , where c = [ c ; c ; ...; c m − ] is the vector of coeﬃcients. Then a key and cleverobservation was to rewrite this in terms of a companion matrix, C =  . . . c c c ... . . . ...0 0 . . . c m −  . (54)So that results, AK = KC. (55)16rom there, exploiting the theme of Arnoldi methods, the eigenvalues of C are related as a subset of theeigenvalues of A and with a direct linear relationship between eigenvectors of C and A , Ritz vectors, andthe unknown coeﬃcients c of C can be computed by a least squares procedure. Keeping in mind that poweriterations as one does in Krylov spaces emphasize just the dominant direction, the Arnoldi methods takecare to orthogonalize at each step in the algorithm for stabilization an otherwise unstable search for largesparse matrices, and these make deliberate use of QR decompositions. Our interest here is only to point outanalogies between A from reservoir computing and VAR(1) and K , rather than to continue toward discussionof modal analysis as one does in DMD analysis. Summarizing, the analogy we see that the companion matrix C in Eq. (54) reminds us of the companion matrix A in Eq. (43). However the most signiﬁcant diﬀerence isthat while c i are scalars, that a i are d x × d x matrices. The example Figs. 1, 3-5 already threaded in the above presentation of methods, were in terms of theMackey-Glass diﬀerential delay equation system, which we now recall. Then in the subsequent, we will showsimilar ﬁgures highlighting the concepts in a diﬀerent system, the famous Lorenz63 ODEs.

The Mackey-Glass diﬀerential delay equation, [55], x (cid:48) ( t ) = ax ( t − t d )1 + [ x ( t − t d )] c − bx ( t ) , (56)has become a now classic standard example in time-series analysis [23, 48] of a high (inﬁnite) dimensionaldynamical system with a low-dimensional attractor, which we have used as a benchmark in our own previouswork, [9] and including recently in presentations for machine learning [9]. The problem is physiologicallyrelevant for describing dynamic diseases. A diﬀerential delay equations can be described as inﬁnite dimen-sional dynamical systems, a concept that is more easily understandable in terms of the notion that an initialcondition state advances not just a ﬁnite dimensional vector, but rather an entire interval [ t , t + t d ] ofinitial values of x ( t ) are required. However, the MG equations have a nice property for the its practicaluse as a benchmark problem, which is that there is essentially an attractor whose fractal dimension varieswith respect to the parameters chosen, allowing for a complexity tunable test problem. We have chosenparameters t d = 17 , a = 0 . , b = 0 . , c = 10 . d = 4. We use integration steps of ∆ t = t d /

100 throughout. We show time-series in Fig. 1, astandard nonlinear RC forecast of the system in Fig. 3, and the linear RC/VAR forecast of the system inFigs. 4-5.

The Lorenz63 system [49] is the three coupled ordinary diﬀerential equations:˙ x = 10( y − x ) , ˙ y = x (28 − z ) − y, ˙ z = xy − (8 / z. (57)While these Lorenz equations may have been originally posed as time varying Fourier coeﬃcients for de-scribing a partial diﬀerential equation system describing convection rolls of heated ﬂuid in an atmosphericsystem, they have become been a popular paradigm in the study of chaotic systems, for foundation principlesof chaos historically and ongoing, as a simple and familiar benchmark problem and also in the pedagogy ofdynamical systems. The chaotic attractor in the phase space { x ( t ) , y ( t ) , z ( t ) } illustrates a familiar butterﬂy,but we show a segment of the x ( t ) time series that will be used as our data set, in Fig. 6 and Fig. 7 . Also17igure 6: Lorenz time-series. With nonlinear threshold function, q ( x ) = tanh ( x ). Full state variable x = ( x, y, z ) is used, thus d x = 3. Reservoir of size d r = 25. However, R = [ r ; r. ]] is used at readout whichis therefore a d x × d r = 3 ×

50 matrix. Blue data and red is forecast into the future. Error and Log of errorin time are each shown, growing from initial seed. 18igure 7: Lorenz time-series. With linear threshold function, q ( x ) = tanh ( x ). Full state variable x = ( x, y, z )is used, thus d x = 3. Reservoir of size d r = 25. Blue data and red is forecast into the future. Error andLog of error in time are each shown, growing from initial seed. Contrast to full nonlinear version in Fig. 6.Quality is not as good, but it still works reasonably, at least for many randomly chosen A .19hown are nonlinear RC, q ( x ) = tanh ( x ) activation forecasts in Fig. 6 using the usual nonlinear reservoircomputing, with excellent success. In Fig. 7, we show forecasting results using a linear q ( x ) = x activationRC with still good results and associated with the theory in this work connecting to VAR(k). The success of machine learning and artiﬁcial neural networks has lead to a clear and overwhelming widelyadopted wave across so many areas where data is relevant and patterns are of interest. Dynamical systemsis no exception and forecasting a dynamical system is a speciﬁc application that is broadly relevant and ofinterest to us here. The RNN framework is particularly relevant for a dynamical systems since the reservememory aspect of the concept allows for a good framework and some aspects of delay embedding. Howeverwhile the RNN tends to have many many parameters to ﬁt, with the danger of overﬁtting always present,and in any case the large cost of optimization in the training phase, there is a surprising short cut. Theecho-state/reservoir computing concept presumes to choose the weights for input layer and internal layerentirely randomly. Then only the output layer is trained. Furthermore that output layer training will be alinear least squares solve, rather than the usual nonlinear optimization necessary for the many parameters ofthe full nonlinear RNN. That this would allow a huge computational savings is clear. What is not clear andperhaps even a surprise, is how can this gross simpliﬁcation still lead to useful results. While there have beena number of studies describing experimentally how to choose better random processes to deﬁne the randomparameters, e.g. such as to emphasize sparsity, or to control the spectral radius, and other properties, inthis work we have taken a diﬀerent approach, which is not speciﬁcally to improve the performance of theconcept but instead to give a partial explanation as to how the concept can work at all. After all, at ﬁrstglance it may seem that it would be impossible that such a simpliﬁcation could work. In this work, wehave simpliﬁed the RC concept, allowing for the activation function to be an identity function instead ofthe more typical sigmoidal function. In this case, the process is entirely linear and so easier to study. As itturns out the RC still performs well as the nonlinear activation may be important for better performance,is not necessary for some good performance, and certainly leads to easier analysis. Herein we have beenable to prove that the linear RC is in factly directly related to the more matured topic of VAR, vectorautoregressive time series forecasting, and with all the related theory, such as the WOLD theorem throughwhich we recall the representation theorem also now applies to the RC. Also we are able to make a directconnection to the increasingly popular DMD theory. So now with a linear RC analyzed from the start,including the randomly selected parameters, we oﬀer a few explicit examples that also demonstrate thatthe concept works. Therefore, while our simpliﬁcation to a linear RC is not proved to also explain the fullnonlinear RC, we feel it does give some insight as to how a random partial selection can still be compensated,and also, we argue that a scaling step further connects the linear to a linearized sample to the sigmoidalactivation. We hope in the future that we might relate this work more explicitly to explain the nonlinearRC success beyond the series explanation here, and also to explain more closely the role of memory beyondspectral radius, perhaps to embedding theory, and also perhaps to suﬃcient dimensional observation for anapproximate isometric embedding in terms of the Johnson-LIndenstrauss theorem.

The data that support the ﬁndings of this study are available from the corresponding author upon reasonablerequest.

10 Acknowledgments

The author received funding from the Army Research Oﬃce (N68164-EG) and also DARPA.20

We review how to numerically and stably compute the pseudo-inverse by the singular value decomposition,with regularized singular values (SVD). Reviewing the matrix theory of regularized pseudo-inverses forgeneral matrices, if, Xb = z, (58) X n × p , b p × , z n × , then if the SVD is X = U Σ V , with orthogonal matrices, n × n , U satisﬁes, U U T = U T U = I and p × p , V satisﬁes V V T = V T V = I , and Σ is n × p “diagonal” matrix of singular values, σ ≥ σ ≥ σ r ≥ ≥ σ p ≥  σ . . . σ p  , if n = p, Σ =  σ . . . ... σ p  , if n > p,  σ . . . . . . . . .σ p . . .  , if n < p. (59)Then, X † := ( X T X ) − X T = V Σ † U T , where,  σ . . . σ p ... ... ...  , in the n > p case . (60)The least squares estimator of Xb = z is, b ∗ = ( X T X ) − X T z := X † z, (61)and we write the ridge regression Tikhonov regularized solution, b ∗ λ = ( X T X + λI ) − X t z = V (Σ T Σ + λI ) − Σ T U T z := X † λ z. (62)The regularized pseudo-inverse X † λ is better stated in terms of the regularized singular values, by,Σ † λ := (Σ T Σ + λI ) − Σ T =  σ σ + λ . . . σ p σ p + λ ... · · ·  in the n > p case, (63)and then, b ∗ λ = X † λ z = V Σ † λ U T z. (64)Throughout, since we will always refer to regularized pseudo-inverses, we will not emphasize this by abusingnotation allowing that b ∗ denotes b ∗ λ even if only a very small λ > λ = 1 . × − . This mitigates the tendency of overﬁtting or likewise stated in terms of zero or almost zerosingular values that would otherwise appear in the denominators of Σ † . The theory is similar for n = p and n < p , as well as the scenario where z is not just a vector but a matrix, and likewise as in Eq. (9) where werefer to the transpose scenario. References [1] Gianni Amisano and Carlo Giannini.

Topics in structural VAR econometrics . Springer Science &Business Media, 2012. 212] Piotr Antonik, Marvyn Gulina, Ja¨el Pauwels, and Serge Massar. Using a reservoir computer to learnchaotic attractors, with applications to chaos synchronization and cryptography.

Physical Review E ,98(1):012215, 2018.[3] Hassan Arbabi and Igor Mezic. Ergodic theory, dynamic mode decomposition, and computation ofspectral properties of the koopman operator.

SIAM Journal on Applied Dynamical Systems , 16(4):2096–2126, 2017.[4] Walter Edwin Arnoldi. The principle of minimized iterations in the solution of the matrix eigenvalueproblem.

Quarterly of applied mathematics , 9(1):17–29, 1951.[5] Coryn AL Bailer-Jones, David JC MacKay, and Philip J Withers. A recurrent neural network formodelling dynamical systems. network: computation in neural systems , 9(4):531–547, 1998.[6] Thanasis G Barbounis, John B Theocharis, Minas C Alexiadis, and Petros S Dokopoulos. Long-termwind speed and power forecasting using local recurrent neural network models.

IEEE Transactions onEnergy Conversion , 21(1):273–284, 2006.[7] Erik Bollt. Regularized kernel machine learning for data driven forecasting of chaos.[8] Erik Bollt. Geometric considerations of a good dictionary for koopman analysis of dynamical systems. arXiv preprint arXiv:1912.09570 , 2019.[9] Erik M Bollt. Model selection, conﬁdence and scaling in predicting chaotic time-series.

InternationalJournal of Bifurcation and Chaos , 10(06):1407–1422, 2000.[10] Erik M Bollt, Qianxiao Li, Felix Dietrich, and Ioannis Kevrekidis. On matching, and even rectifying,dynamical systems through koopman operator eigenfunctions.

SIAM Journal on Applied DynamicalSystems , 17(2):1925–1960, 2018.[11] Erik M Bollt and Naratip Santitissadeekorn.

Applied and computational measurable dynamics . SIAM,2013.[12] GEORGE EP Box, Gwilym M Jenkins, and G Reinsel. Time series analysis: forecasting and controlholden-day san francisco.

BoxTime Series Analysis: Forecasting and Control Holden Day1970 , 1970.[13] Steven L Brunton and J Nathan Kutz.

Data-driven science and engineering: Machine learning, dynam-ical systems, and control . Cambridge University Press, 2019.[14] Marko Budiˇsi´c, Ryan Mohr, and Igor Mezi´c. Applied koopmanism.

Chaos: An Interdisciplinary Journalof Nonlinear Science , 22(4):047510, 2012.[15] Michael Buehner and Peter Young. A tighter bound for the echo state property.

IEEE Transactions onNeural Networks , 17(3):820–824, 2006.[16] Daniel Canaday, Aaron Griﬃth, and Daniel J Gauthier. Rapid time series prediction with a hardware-based reservoir computer.

Chaos: An Interdisciplinary Journal of Nonlinear Science , 28(12):123119,2018.[17] Thomas L Carroll and Louis M Pecora. Network structure eﬀects in reservoir computers.

Chaos: AnInterdisciplinary Journal of Nonlinear Science , 29(8):083130, 2019.[18] Ashesh Chattopadhyay, Pedram Hassanzadeh, Devika Subramanian, and Krishna Palem. Data-drivenprediction of a multi-scale lorenz 96 chaotic system using a hierarchy of deep learning methods: Reservoircomputing, ann, and rnn-lstm. 2019. 2219] Jiann-Fuh Chen, Wei-Ming Wang, and Chao-Ming Huang. Analysis of an adaptive time-series autore-gressive moving-average (arma) model for short-term load forecasting.

Electric Power Systems Research ,34(3):187–196, 1995.[20] Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Using recurrent neural network modelsfor early detection of heart failure onset.

Journal of the American Medical Informatics Association ,24(2):361–370, 2017.[21] David Darmon, Christopher J Cellucci, and Paul E Rapp. Information dynamics with conﬁdence:Using reservoir computing to construct conﬁdence intervals for information-dynamic measures.

Chaos:An Interdisciplinary Journal of Nonlinear Science , 29(8):083113, 2019.[22] Philippe De Wilde.

Neural network models: an analysis . Springer, 1996.[23] J Doyne Farmer. Chaotic attractors of an inﬁnite-dimensional dynamical system.

Physica D: NonlinearPhenomena , 4(3):366–393, 1982.[24] Mohammad Farzad, Hanif Tahersima, and Hamid Khaloozadeh. Predicting the mackey glass chaotictime series using genetic algorithm. In , pages 5460–5463. IEEE, 2006.[25] Ken-ichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by continuous timerecurrent neural networks.

Neural networks , 6(6):801–806, 1993.[26] Claudio Gallicchio. Chasing the echo state property. arXiv preprint arXiv:1811.10892 , 2018.[27] Daniel J Gauthier. Reservoir computing: Harnessing a universal dynamical system.

Phys. Rev. Lett ,120(2018):024102, 2018.[28] Gene H Golub and Charles F Van Loan. Matrix computations, 4th.

Johns Hopkins , 2013.[29] Lukas Gonon and Juan-Pablo Ortega. Reservoir computing universality with stochastic inputs.

IEEEtransactions on neural networks and learning systems , 31(1):100–112, 2019.[30] Aaron Griﬃth, Andrew Pomerance, and Daniel J Gauthier. Forecasting chaotic systems with verylow connectivity reservoir computers.

Chaos: An Interdisciplinary Journal of Nonlinear Science ,29(12):123108, 2019.[31] Beata J Grzyb, Eris Chinellato, Grzegorz M Wojcik, and Wieslaw A Kaminski. Which model to use forthe liquid state machine? In , pages 1018–1024.IEEE, 2009.[32] Min Han, Zhi-Wei Shi, and Wei Guo. Reservoir neural state reconstruction and chaotic time seriesprediction.

Acta Physica Sinica , 56(1):43–50, 2007.[33] L Harrison, William D Penny, and Karl Friston. Multivariate autoregressive modeling of fmri timeseries.

Neuroimage , 19(4):1477–1491, 2003.[34] David Hartman and Lalit K Mestha. A deep learning framework for model reduction of dynamicalsystems. In , pages 1917–1922.IEEE, 2017.[35] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[36] Jin-Quan Huang and Frank L Lewis. Neural-network predictive control for nonlinear dynamic systemswith time-delay.

IEEE Transactions on Neural Networks , 14(2):377–389, 2003.2337] Herbert Jaeger. The echo state approach to analysing and training recurrent neural networks-with anerratum note.

Bonn, Germany: German National Research Center for Information Technology GMDTechnical Report , 148(34):13, 2001.[38] Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energyin wireless communication. science , 304(5667):78–80, 2004.[39] Junjie Jiang and Ying-Cheng Lai. Model-free prediction of spatiotemporal dynamical systems withrecurrent neural networks: Role of network spectral radius.

Physical Review Research , 1(3):033056,2019.[40] Matthew B Kennel and Steven Isabelle. Method to distinguish possible chaos from colored noise andto determine embedding parameters.

Physical Review A , 46(6):3111, 1992.[41] Masahiro Kimura and Ryohei Nakano. Learning dynamical systems by recurrent neural networks fromorbits.

Neural Networks , 11(9):1589–1599, 1998.[42] S Narendra Kumpati, Parthasarathy Kannan, et al. Identiﬁcation and control of dynamical systemsusing neural networks.

IEEE Transactions on neural networks , 1(1):4–27, 1990.[43] J Nathan Kutz, Steven L Brunton, Bingni W Brunton, and Joshua L Proctor.

Dynamic mode decom-position: data-driven modeling of complex systems . SIAM, 2016.[44] Martin L¨angkvist, Lars Karlsson, and Amy Loutﬁ. A review of unsupervised feature learning and deeplearning for time-series modeling.

Pattern Recognition Letters , 42:11–24, 2014.[45] Andrzej Lasota and Michael C Mackey.

Chaos, fractals, and noise: stochastic aspects of dynamics ,volume 97. Springer Science & Business Media, 2013.[46] Daniel S Levine.

Introduction to neural and cognitive modeling . Routledge, 2018.[47] Qianxiao Li, Felix Dietrich, Erik M Bollt, and Ioannis G Kevrekidis. Extended dynamic mode de-composition with dictionary learning: A data-driven adaptive spectral decomposition of the koopmanoperator.

Chaos: An Interdisciplinary Journal of Nonlinear Science , 27(10):103111, 2017.[48] A Lichtenberg. J and lieberman ma 1983 regular and stochastic motion.

Applied Mathematical Sci-ences ,38, 85.[49] Edward N Lorenz. Deterministic nonperiodic ﬂow.

Journal of the atmospheric sciences , 20(2):130–141,1963.[50] Zhixin Lu, Brian R Hunt, and Edward Ott. Attractor reconstruction by machine learning.

Chaos: AnInterdisciplinary Journal of Nonlinear Science , 28(6):061104, 2018.[51] Mantas Lukoˇseviˇcius. A practical guide to applying echo state networks. In

Neural networks: Tricks ofthe trade , pages 659–686. Springer, 2012.[52] Mantas Lukoˇseviˇcius and Herbert Jaeger. Reservoir computing approaches to recurrent neural networktraining.

Computer Science Review , 3(3):127–149, 2009.[53] Helmut L¨utkepohl.

New introduction to multiple time series analysis . Springer Science & BusinessMedia, 2005.[54] Wolfgang Maass, Thomas Natschl¨ager, and Henry Markram. Real-time computing without stable states:A new framework for neural computation based on perturbations.

Neural computation , 14(11):2531–2560, 2002. 2455] Michael C Mackey and Leon Glass. Oscillation and chaos in physiological control systems.

Science ,197(4300):287–289, 1977.[56] Stephen Marsland.

Machine learning: an algorithmic perspective . CRC press, 2015.[57] Donald Michie, David J Spiegelhalter, CC Taylor, et al. Machine learning.

Neural and StatisticalClassiﬁcation , 13(1994):1–298, 1994.[58] Mark R Muldoon, David S Broomhead, Jeremy P Huke, and Rainer Hegger. Delay embedding in thepresence of dynamical noise.

Dynamics and Stability of Systems , 13(2):175–186, 1998.[59] Marilyn M Nelson and William T Illingworth. A practical guide to neural nets. 1991.[60] Silvia Ort´ın Gonz´alez, Miguel C Soriano, Luis Pesquera Gonz´alez, Daniel Brunner, Daniel SanMart´ın Segura, Ingo Fischer, Claudio Mirasso, Jos´e Manuel Guti´errez Llorente, et al. A uniﬁed frame-work for reservoir computing and extreme learning machines based on a single time-delayed neuron.2015.[61] Norman H Packard, James P Crutchﬁeld, J Doyne Farmer, and Robert S Shaw. Geometry from a timeseries.

Physical review letters , 45(9):712, 1980.[62] Sudhakar Madhavrao Pandit, Shien-Ming Wu, et al.

Time series and system analysis with applications ,volume 3. Wiley New York, 1983.[63] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the diﬃculty of training recurrent neuralnetworks. In

International conference on machine learning , pages 1310–1318, 2013.[64] Jaideep Pathak, Brian Hunt, Michelle Girvan, Zhixin Lu, and Edward Ott. Model-free prediction oflarge spatiotemporally chaotic systems from data: A reservoir computing approach.

Physical reviewletters , 120(2):024102, 2018.[65] Jostein Paulsen and Dag Tjøstheim. On the estimation of residual variance and order in autoregressivetime series.

Journal of the Royal Statistical Society: Series B (Methodological) , 47(2):216–228, 1985.[66] Daniel Pena, George C Tiao, and Ruey S Tsay.

A course in time series analysis , volume 322. JohnWiley & Sons, 2011.[67] Duo Qin. Rise of var modelling approach.

Journal of Economic Surveys , 25(1):156–174, 2011.[68] Akhter Mohiuddin Rather, Arun Agarwal, and VN Sastry. Recurrent neural network and a hybridmodel for prediction of stock returns.

Expert Systems with Applications , 42(6):3234–3241, 2015.[69] OA Rosso, HA Larrondo, MT Martin, A Plastino, and MA Fuentes. Distinguishing noise from chaos.

Physical review letters , 99(15):154102, 2007.[70] Clarence W Rowley, IGOR MEZI?, Shervin Bagheri, Philipp Schlatter, DANS HENNINGSON, et al.Spectral analysis of nonlinear ﬂows.

Journal of ﬂuid mechanics , 641(1):115–127, 2009.[71] Tim Sauer, James A Yorke, and Martin Casdagli. Embedology.

Journal of statistical Physics , 65(3-4):579–616, 1991.[72] Peter J Schmid. Dynamic mode decomposition of numerical and experimental data.

Journal of ﬂuidmechanics , 656:5–28, 2010.[73] Alwyn Scott.

Encyclopedia of nonlinear science . Routledge, 2006.[74] C Serio. Autoregressive representation of time series as a tool to diagnose the presence of chaos.

EPL(Europhysics Letters) , 27(2):103, 1994. 2575] Qingsong Song and Zuren Feng. Eﬀects of connectivity structure of complex echo state network on itsprediction performance for nonlinear time series.

Neurocomputing , 73(10-12):2177–2185, 2010.[76] Jie Sun, Dane Taylor, and Erik M Bollt. Causal network inference by optimal causation entropy.

SIAMJournal on Applied Dynamical Systems , 14(1):73–106, 2015.[77] Floris Takens. Detecting strange attractors in turbulence. In

Dynamical systems and turbulence, War-wick 1980 , pages 366–381. Springer, 1981.[78] George C Tiao and Ruey S Tsay. Consistency properties of least squares estimates of autoregressiveparameters in arma models.

The Annals of Statistics , pages 856–871, 1983.[79] Henk A Van der Vorst.

Iterative Krylov methods for large linear systems , volume 13. CambridgeUniversity Press, 2003.[80] David Verstraeten, Benjamin Schrauwen, Michiel dHaene, and Dirk Stroobandt. An experimentaluniﬁcation of reservoir computing methods.

Neural networks , 20(3):391–403, 2007.[81] Pantelis R Vlachas, Wonmin Byeon, Zhong Y Wan, Themistoklis P Sapsis, and Petros Koumout-sakos. Data-driven forecasting of high-dimensional chaotic systems with long short-term memorynetworks.

Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences ,474(2213):20170844, 2018.[82] Pantelis R Vlachas, Jaideep Pathak, Brian R Hunt, Themistoklis P Sapsis, Michelle Girvan, EdwardOtt, and Petros Koumoutsakos. Forecasting of spatio-temporal chaotic dynamics with recurrent neuralnetworks: A comparative study of reservoir computing and backpropagation algorithms. arXiv preprintarXiv:1910.05266 , 2019.[83] Matthew O Williams, Ioannis G Kevrekidis, and Clarence W Rowley. A data–driven approximationof the koopman operator: Extending dynamic mode decomposition.

Journal of Nonlinear Science ,25(6):1307–1346, 2015.[84] Herman Ole Andreas Wold.

A Study in the Analysis of Stationary Time Series: With an Appendix .Almqvist & Wiksell, 1954.[85] Kyongmin Yeo. Model-free prediction of noisy chaotic time series by deep learning. arXiv preprintarXiv:1710.01693 , 2017.[86] Kyongmin Yeo and Igor Melnyk. Deep learning algorithm for data-driven simulation of noisy dynamicalsystem.

Journal of Computational Physics , 376:1212–1231, 2019.[87] Koji Yonemoto and Takashi Yanagawa. Estimating the embedding dimension and delay time of chaotictime series by an autoregressive model.

Bulletin of informatics and cybernetics , 33(1-2):53–62, 2001.[88] Roland S Zimmermann and Ulrich Parlitz. Observing spatio-temporal dynamics of excitable mediausing reservoir computing.