Breaking Symmetries of the Reservoir Equations in Echo State Networks
BBreaking Symmetries of the Reservoir Equations in Echo State Networks
Joschka Herteux a) and Christoph R¨ath b) Institut f¨ur Materialphysik im Weltraum, Deutsches Zentrum f¨ur Luft- und Raumfahrt, M¨unchner Str. 20,82234 Wessling, Germany (Dated: 7 December 2020)
Reservoir computing has repeatedly been shown to be extremely successful in the prediction of nonlineartime-series. However, there is no complete understanding of the proper design of a reservoir yet. We find thatthe simplest popular setup has a harmful symmetry, which leads to the prediction of what we call mirror-attractor . We prove this analytically. Similar problems can arise in a general context, and we use themto explain the success or failure of some designs. The symmetry is a direct consequence of the hyperbolictangent activation function. Further, four ways to break the symmetry are compared numerically: A bias inthe output, a shift in the input, a quadratic term in the readout, and a mixture of even and odd activationfunctions. Firstly, we test their susceptibility to the mirror-attractor. Secondly, we evaluate their performanceon the task of predicting Lorenz data with the mean shifted to zero. The short-time prediction is measuredwith the forecast horizon while the largest Lyapunov exponent and the correlation dimension are used torepresent the climate. Finally, the same analysis is repeated on a combined dataset of the Lorenz attractorand the Halvorsen attractor, which we designed to reveal potential problems with symmetry. We find that allmethods except the output bias are able to fully break the symmetry with input shift and quadratic readoutperforming the best overall.
Reservoir computing describes a kind of recurrentneural network, which has been very successfulin the prediction of chaotic systems. However,the details of its inner workings have yet to befully understood. One important aspect of anyneural network is the activation function. Eventhough its effects have been extensively studiedin other Machine Learning techniques, there arestill open questions in the context of reservoircomputing. Our research aims to fill this gap.We prove analytically that an antisymmetric ac-tivation function like the hyperbolic tangent leadsto a disastrous symmetry in a popular setup wecall simple ESN. This leads the reservoir to learnan inverted version of the training data we call mirror-attractor , which we demonstrate numeri-cally. This heavily perturbs any prediction, es-pecially if the mirror-attractor overlaps with thereal attractor. Further, we compare four differentways to break the symmetry. We test numericallyif they tend to learn the mirror-attractor and testtheir performance on two tasks where the simpleESN fails. We find that three of them are able tofully break the symmetry.
I. INTRODUCTION
Machine Learning (ML) has shown to be tremendouslysuccessful in categorization and recognition tasks and theuse of ML algorithms has become common in technical a) Electronic mail: [email protected] b) Electronic mail: [email protected] devices of daily living. But the application of ML alsopervades more and more areas of science including re-search on complex systems. For a very recent collectionsee and references therein.In nonlinear dynamics ML-based methods have recentlyattracted a lot of attention, because it was demonstratedthat the exact short term prediction of nonlinear sys-tem can be significantly improved. Furthermore, it wasshown that ML techniques also allow for a very accu-rate reproduction of the long term properties (”the cli-mate”) of complex systems. Several ML methods likedeep feed-forward artificial neural network (ANN), recur-rent neural network (RNN) with long short-term memory(LSTM) and reservoir computing (RC) fulfill the predic-tion tasks.
RC has attracted most attention. It is amachine learning method that has been independentlydiscovered as Liquid State Machines (LSM) by Maass and as echo state networks (ESN) by Jaeger. We focushere on the ESN approach, which falls under the cat-egory of Recurrent Neural Networks (RNN). The maindifference to other RNNs such as LSTMs is that in RConly the last layer is explicitly trained via linear regres-sion. Instead of hidden layers it uses a so-called reservoir ,which in the case of the ESN is typically a network withrecurrent connections.The popularity of RC has several reasons. First, RCoften shows superior performance. Second, ESNs of-fer conceptual advantages. As only the output layeris explicitly trained, the number of weights to be ad-justed is very small. Thus, the training of ESNs iscomparably transparent, extremely CPU-efficient (ordersof magnitude faster than for ANNs) and the vanishing-gradient-problem is circumvented. Furthermore, small,smart and energy-efficient hardware implementations us-ing photonic systems, spintronic systems and manymore are conceivable and being developed (and refer- a r X i v : . [ phy s i c s . d a t a - a n ] D ec ences therein).Ongoing research is focused on identifying the necessaryconditions for a good reservoir. Recent studies focusedmainly on the influence of the size and topology of thereservoir on the prediction capabilities. Less atten-tion has so far been paid on the role of activation andonto the overall performance of RC.In this paper we study in detail the sensitivity of RC tosymmetries in the activation function. We reveal thatpreviously reported shortcomings for simple ESNs canunambiguously be attributed to symmetry properties ofthe activation function (and not of the input signal). Wepropose and assess four different methods to break thesymmetries that were developed for obtaining more reli-able prediction results.The paper is organized as follows: Sec. II first discussesthe different measures used and the two test systems: theLorenz and the Halvorsen equations. Afterwards the dif-ferent ESN designs used in this study are introduced andthe symmetry of the simple ESN is proven. In Sec. IIIthe three numerical experiments we conducted and theirresults are presented. Finally we discuss our findings inSec. IV.
II. METHODSA. Measures and System Characteristics1. Forecast Horizon
As in we use the forecast horizon to measure thequality of short-time predictions. It is defined as the timebetween the start of a prediction and the point where itdeviates from the test data more than a fixed threshold.The exact condition reads | v ( t ) − v R ( t ) | > δ , (1)where the norm is taken elementwise. Due to the chaoticnature of our training data, any small perturbation willusually grow exponentially with time. Thus, this indi-cates the end of a reliable prediction of the actual tra-jectory. The measure is not very sensitive to the exactvalue of the threshold for the same reason.The threshold generally depends on the direction and weuse δ = (5 . , . , . T for the Lorenz system withoutpreprocessing. In general the values of δ are chosen to beapproximately 15% of the spatial extent of the respectiveattractor in the given direction. This is useful if the dy-namics of a system takes place on different lengthscales.
2. Correlation Dimension
To evaluate the climate of a prediction we use two mea-sures. To understand the structural complexity of theattractor it is interesting to look at the correlation di-mension. This is a way to quantify the dimensionality of the space populated by the trajectory. The correlationdimension is based on the discrete form of the correlationintegral C ( r ) = lim N →∞ N N (cid:88) i,j =1 θ ( r − | x i − x j | ) , (2)which returns the fraction of pairs of points that arecloser than the threshold distance r . θ represents theHeaviside function. The correlation dimension is thendefined by the relation C ( r ) ∝ r ν (3)as the scaling exponent ν . For a self-similar strange at-tractor this relation holds in some range of r , which needsto be properly calibrated. Here we adjusted it for everygiven problem beforehand on simulated data, which isnot used for training or testing. We note that precisionis not essential here, since we are only interested in com-parisons and not in absolute values.To get the correlation dimension for a given dataset weuse the Grassberger Procaccia algorithm.
3. Largest Lyapunov Exponent
The second measure we use to evaluate the climate isthe largest Lyapunov exponent. In contrast to the corre-lation dimension it is indicative of the development of thesystem in time. A d -dimensional chaotic system is char-acterized by d Lyapunov exponents of which at least oneis positive. They describe the average rate of exponen-tial growth of a small perturbation in each direction inphase space. The largest Lyapunov exponent λ is the oneassociated with the direction of the fastest divergence. d ( t ) = Ce λt . (4)Since it dominates the dynamics it has a special signifi-cance. It can be calculated from data with relative easeby using the Rosenstein algorithm. It is also possibleto determine the complete Lyapunov spectrum from theequations, which we have access to for our testdata aswell as for our ESNs.
However, we found that thecomparison is clearer in our case with the data-drivenapproach, because it is completely independent of detailsof the system, e.g. the question if it is discrete or contin-uous. This method is also computationally less costly.We can further define the
Lyapunov time τ λ = λ as char-acteristic timescale of a system. We use τ λ = . ≈ . τ λ = . ≈ .
35 for theHalvorsen system, based on our measurements in tableII.
B. Lorenz and Halvorsen system
A standard example of a chaotic attractor is providedby the Lorenz system. It is widely used as a test casefor prediction of such systems with RC. It is defined bythe equations ˙ x = σ ( y − x )˙ y = x ( ρ − z ) − y ˙ z = xy − βz + x , (5)where we use the standard parameters σ = 10, β = 8 / ρ = 28. We simulate the dynamics by integrat-ing these equations using the Runge-Kutta method withtimesteps of ∆ t = 0 .
02. We use varying starting pointson the attractor.The equations are symmetric under the transformation( x, y, z ) → ( − x, − y, z ). Thus the mirror-attractor dif-fers only in the z-coordinate. Furthermore, the mean ofthe attractor in z-direction is far away from the origin at¯ z ≈ .
5. This makes it an especially useful example forbreaking the symmetry in the reservoir.As a secondary test case we use the Halvorsen equations ˙ x = − σx − y − z − y ˙ y = − σy − z − x − z ˙ z = − σz − x − y − x . (6)with σ = 1 .
3. We simulate the dynamics in the same wayas for the Lorenz equations. This system also exhibitschaotic behavior but does not have any symmetries un-der inversion of its coordinates. It has a cyclic symmetry,which should however not be relevant in this context.Both Lorenz and Halvorsen system are 3-dimensional au-tonomous dissipative flows.
C. Reservoir Computing and Simple ESN
There is a multitude of ways to design an ESN. In thispaper we use several different variants, which will beintroduced in the following sections. In general the inputis fed into the reservoir and influences its dynamics. Ausually linear readout is trained to translate the stateof the reservoir into the desired output. The reservoirstate is then a random, high-dimensional, nonlineartransformation of all previous input data. This naturallygives it a kind of memory.In an ESN the dynamics of the reservoir are generallygoverned by an update equation for the reservoir state r t ∈ R N of the form r t +1 = f ( Ar t , W in x t ) (7)Here f is called activation function, the adjacency ma-trix A ∈ R N × N represents the network, x t ∈ R d x is theinput fed into the reservoir and W in ∈ R d x × N is the in-put matrix. There are many possible choices for f andways to construct A and W in . Here we always create A as an Erd¨os-Renyi random network. Sparse networkshave been found to be advantageous. The weights of the network are then drawn uniformly from [ − ,
1] andafterwards rescaled to fix the spectral radius ρ to somefixed value. ρ is a free hyperparameter.We chose W in to be also sparse, in the sense that ev-ery row has only one nonzero element. This means everyreservoir node is only connected to one degree of freedomof the input. We fixed the number of nodes per dimen-sion to be the same plus or minus one. The nonzeroelements are drawn uniformly from the interval [ − , s input , which is anotherfree hyperparameter.From this we can then compute the output y t ∈ R d y .Here we are interested in the prediction case, where wetrain the ESN to approximate y t ≈ x t +1 . Thus, the di-mension of input and output are the same, so we use d x = d y := d . The readout is characterized by y t = W out ˜ r t (8)where typically ˜ r t = r t , but ˜ r t ∈ R ˜ N can also besome nonlinear transformation or extension of r t . Thereadout matrix W out ∈ R d × ˜ N is the only part of theESN that is trained. This is typically done via simpleRidge Regression. To train the reservoir the training data x train = { x , ..., x T train } is fed into the reservoir to get thesequence r train = { r , ..., r T train +1 } . The first T sync timesteps of r are then discarded. This transient periodis only used to synchronize the reservoir with thetraining data. This frees the ESN of any influence of thereservoir’s initial condition thanks to the fading memoryproperty . The state of a properly designed reservoircontinuously loses its dependence on past states overtime. For a detailed description see e.g. Now the readout matrix can be calculated by minimiz-ing (cid:88) T sync ≤ t ≤ T train (cid:107) W out ˜ r t − v t (cid:107) − β (cid:107) W out (cid:107) , (9)where we get another hyperparameter β from the regular-ization. The target output v t is in the case of predictionjust x t +1 . We thus get W out = (˜ r T ˜ r + β ) − ˜ r T v , (10)where r is r train in matrix form after discarding thesynchronization steps and v is analogous.In the following we always use a network with N = 200, T train = 10500 and T sync = 500 and average degree k = 4 unless otherwise stated. The spectral radius ρ , theregularization parameter β and the input scaling s input are optimized for specific problems.Our basic setup is close to what Jaeger originally pro-posed and it is one of the most widely used variants. We FIG. 1. Failed prediction of the Lorenz system with a simpleESN. The trajectory jumps down to the mirror-attractor.FIG. 2. Failed prediction of zero-mean, normalized Lorenzdata. The trajectory jumps frequently between the originaland the mirror-attractor. call it simple ESN because all other designs we use areextensions of it. It is defined by the following equations: r t +1 = tanh( Ar t + W in x t ) (11) y t = W out r t (12)The activation function is a sigmoidal function, specifi-cally a hyperbolic tangent, which is the typical choice.The reservoir states are not transformed before the read-out. With this setup successful predictions of differentdatasets have been made in many cases. However, wecan sometimes see very specific ways in which they failas illustrated in figure 1 and figure 2. When predict-ing the Lorenz attractor (see Sec. II B), the prediction sometimes jumps to an inverted version of the trainingdataset, which we call mirror-attractor . To investigatethe severity of the problem, we created 1000 realizationswith predictions of 500000 timesteps. We found that in98.5% of cases, where the prediction crossed the zero inthe z -direction, this lead to a jump to the other attrac-tor. 18% of the predictions did not exhibit a jump at anypoint. Among those that did, the average time of the firstjump was about 31000 timesteps (539 Lyapunov times).Typically, the trajectory changed from one attractor tothe other multiple times. FIG. 3. Prediction of the z -coordinate after synchronizationwith original data (upper) and inverted data (lower) for thesimple ESN. The t -axis is given in units of Lyapunov times. While this is concerning, it still allows for decent short-time predictions. We were able to reach average forecasthorizons of about 400 timesteps (7 Lyapunov times) afterhyperparameter optimization. However, when the datais brought to zero-mean the ability to make accurate pre-dictions largely breaks down. After hyperparameter opti-mization we get a forecast horizon of about 90 timesteps(1.6 Lyapunov times). Since the two attractors overlap,the prediction jumps between them very frequently as wecan see in figure 2. Sometimes it even travels outside ofboth for a short time. Since this kind of preprocessing isconsidered good practice in machine learning and usuallyleads to better results, this shows a severe failure of themethod.Difficulties in predicting the Lorenz system, which is awidespread test case for ESNs, when using this kind ofsetup have been noted by previous studies. They werelinked to the symmetry of the Lorenz equation under thetransformation ( x, y, z ) → ( − x, − y, z ). However, we ob-serve these problems with other datasets as well. We canlargely explain this phenomenon by mathematical anal-ysis independent of the input data.To prove this we do the following: Assume r = w.l.o.g.because of the fading memory property. Let us now ana-lyze what happens, when instead of the original trainingsequence x train = { x , ..., x T train } we use its inverted ver-sion − x train = {− x , ..., − x T train } to train the readoutmatrix: r ( − x train ) = = − r ( x train ) (13) r ( − x train ) = tanh( − W in x n ) (14)= − tanh( W in x n ) (15)= − r ( x train ) (16)This serves as the base case for our mathematical induc-tion. We follow up with the induction step. Assume r t ( − x train ) = − r t ( x train ) (17)Then r t +1 ( − x train ) = tanh( Ar t ( − x train ) − W in x t ) (18)= − tanh( Ar t ( x train ) + W in x t ) (19)= − r t +1 ( x train ) (20)Overall we get r train ( − x train ) = − r train ( x train ). Sothe dynamics of the reservoir only changed sign and areotherwise unaffected. This is a consequence of the anti-symmetry of the hyperbolic tangent.Obviously, because of the linearity of the readout, we alsoget y t ( − x train ) = − W out r t ( x train ) = − y t ( x train ) (21)And finally W out ( − x ) = ( r T ( − x ) r ( − x ) + β ) − r T ( − x )( − x )(22)= ( r T ( x ) r ( x ) + β ) − r T ( x ) x (23)= W out ( x ) (24)So training the simple ESN with inverted data is equiv-alent to training on the original data and both lead tolearning the exact same parameters. Thus, it can nevermap these sequences to either the same output or anyoutput that differs by anything other than the sign. Itis therefore not universal. It can only fully learn the dy-namics of systems that are themselves point symmetricat the origin.Furthermore, when we use a simple ESN for predictionit is now obvious that it learns to replicate the invertedmirror-attractor as well as the real attractor. In caseswhere they overlap they are however incompatible. Whenthey do not overlap, but are close enough to each otherthis makes jumps possible.In figure 3 this is demonstrated by comparing the predic-tions of an already trained simple ESN after being syn-chronized with additional Lorenz data either unchangedor inverted. We can see that the prediction of inverteddata is simply the inversion of the prediction of the orig-inal data, just as expected from theory.For a jump to happen, the reservoir has to arrive at a state that matches better with the mirror-attractor thanwith the real one. Since the reservoir has memory it isnot obvious how fast input data from the phase-space re-gion of the mirror-attractor can actually make that hap-pen. Empirically we found that crossing the zero in thez-direction leads to a jump in 98 .
5% of cases. We furtherobserved that inverting the input in a single timestepwas reliably enough to push the prediction on the mirror-attractor. This implies a strong sensitivity to the inputdata with regards to inducing jumps.It is clear that this kind of symmetry creates a significantlimitation for the simple ESN. We can therefore easily ex-plain the previous problems with this kind of approachas well as the so far mostly empirical success of somemethods combating them. In many recent publicationsthe readout was extended with a some kind of nonlineartransformation. The empirical advantage of this has beenexplored without theoretical explanation by Chattopad-hyay et al.. Typically quadratic terms are included inthe readout (see Sec. II D 2). This was to our knowledgeoriginally introduced by Lu et al. in order to specificallysolve a problem relating to the symmetry of the Lorenzequations, when using the ESN as an observer. It hassince been used successfully in many more general caseswithout theoretical explanation. From our analysis it isnow clear that this readout breaks the antisymmetry ofthe ESN as a whole, which is completely independent ofany symmetries of the input data.A similar analysis can be fruitful on many different de-signs of ESN. For example in a recent Paper by Carrolland Pecora the following update equation was used: r i ( t + 1) = αr i ( t ) + (1 − α ) tanh( (25) M (cid:88) j =1 A ij r j ( t ) + w i x ( t ) + 1) (26)Where w i are the elements of what they call input vec-tor . Empirically they found that the performance suf-fered for w i = 1 ∀ i compared to w i ∈ { +1 , − } . We canexplain this with a symmetry under the transformation s ( t ) → − s ( t ) − w for any constant w i = w ∀ i similar towhat we found for the simple ESN. As soon as w i takeson different values for different i this symmetry is broken. D. Breaking Symmetry
To better understand what is the best way to break theharmful antisymmetry in the simple ESN, we test fourdifferent designs. There are two main ways of approach-ing the problem. We can either break the symmetry inthe reservoir, e.g by changing the activation function, orin the readout. When choosing the latter option, equa-tion 17 still holds. The dynamics of the reservoir stilldo not change meaningfully with the sign of the trainingdata. Only during prediction does the influence of thereadout actually come into play.We use two designs following each approach.
1. Output Bias
FIG. 4. Prediction of the z -coordinate after synchronizationwith original data (upper) and inverted data (lower) for theESN with output bias. The t -axis is given in units of Lya-punov times. The prediction of inverted data is perturbedbut still in the mirror-attractor. This is one of the simplest ways to break the symmetry.The readout is changed by using ˜ r = { r , r , ..., r N , } .Effectively this leads to y t = W out ˜ r t = ˜ W out r t + b (27)where b ∈ R d is called bias-term and is fixed in the LinearRegression. This very basic extension of Linear Regres-sion is often already seen as good practice. Formally thisbreaks the symmetry. y t ( − r t ) = − ˜ W out r t + b = − y t ( r t ) + 2 b (28)we note however that this way there are only d parame-ters to represent the difference under sign-change.
2. Lu Readout
As previously mentioned, a quadratic extension of thereadout has recently become popular after its introduc-tion by Lu et al. We use two different variants of thisapproach. In its most powerful version it consists of us-ing ˜ r = { r , r , ..., r N , r , r , ..., r N } in the readout.Effectively we get y t = W out ˜ r t = W out r t + W out r t . (29)Where W out ∈ R d × N can be divided in W out and W out ∈ R d × N And y t ( − r t ) = − W out r t + W out r t (30)= − y t ( r t ) + 2 W out r t . (31) FIG. 5. Prediction after synchronizing with inverted data forthe ESN with Lu readout.
The number of parameters that represent the differenceunder sign-change is d × N . In the following we will callthis extended Lu readout . We use this in order to test thefull potential of this approach. However, the higher num-ber of parameters in the output matrix makes a quanti-tative comparison to other approaches unfair. For thispurpose we also test a second version.In the original work by Lu et al. only half of the nodesare squared in the readout and each is only used ei-ther in its linear or in its quadratic form. This can beachieved with a transformation of the reservoir state like˜ r = { r , r , r , r , ..., r N − , r N } , where we assumed N tobe even. We call this Lu readout or regular Lu readout .This way the number of parameters is unaffected, whichmakes a fair comparison with the other methods possible.
3. Input Shift
This design for an ESN has been proven to be universalby Grigoryeva and Ortega Specifically this means it canapproximate any causal and time-invariant filter with thefading-memory property, which naturally excludes anyproblems with symmetry. This is also recommended in“A practical guide to Applying Echo State Networks” byLukoˇseviˇcius. For this design the activation function of the simple ESNis extended by including a random bias term in everynode of the reservoir. It can be written as a random vec-tor γ ∈ R N and gives the following new update equation: r t +1 = tanh( Ar t + W in x t + γ ) . (32)The readout is unchanged from the simple ESN. Unlikethe first two methods this breaks the symmetry in thereservoir itself.We draw the elements of γ uniformly from [ − s γ ; , s γ ]where s γ is a new hyperparameter to be optimized. Thiswas a somewhat arbitrary choice for simplicity. In prin-ciple we could instead use a normal distribution, the dis-tribution of the training data, etc.
4. Mixed Activation Functions
Another way of breaking the symmetry directly inthe reservoir is to replace some of the odd tanh ac-tivation functions with even functions. This was in-spired by a different framing of the problem: Everynode in the network can be understood as a functionof the concatenation of input datum and reservoir state˜ x = { r , ..., r N , x , .., x d } . The readout is then simply alinear combination of these functions, where the weightsare optimized to approximate the output. The nodesdiffer by the random parameters introduced through theelements of W in , A and γ , if input shift is included. Ingeneral we want this set of functions to approximate abasis in the corresponding function space to be as pow-erful as possible. In the case of the simple ESN thesefunctions are all odd and any linear combination of themwill still be odd. Mixing in even functions gives us ac-cess to the whole function space as any function can bedivided in an even and an odd part.As even function we simply used tanh . We assignedhalf of the nodes connected to each input dimension tobe even nodes, where this activation functions is used. ESN design F.H. in ∆ t ( τ λ ) λ ± σ ν ± σ Simple ESN 90.9(1.6) 0 . ± . . ± . . ± . . ± . . ± .
03 1 . ± . . ± .
02 1 . ± . . ± .
04 1 . ± . . ± .
02 1 . ± . ∞ . ± .
02 1 . ± . III. RESULTSA. Predicting The Mirror-Attractor
To test the ability of the four methods to break thesymmetry of the simple ESN, we tried to force them topredict the mirror-attractor of the Lorenz equations afterbeing trained with regular data. If the symmetry is trulybroken, we expect this prediction to fail completely, in-dicating that the ESN did not learn anything about the −2 −1 Output BiasE t. Lu ReadoutMi ed ActivationsInput ShiftLu Readout
FIG. 6. Normalized histogram of the forecast horizon (inunits of Lyapunov times) with respect to the inverted testdata after synchronizing with inverted training data for thefour different symmetry-breaking designs. mirror-attractor.To accomplish this we trained our ESNs with regularLorenz data and then synchronized it with the invertednext 500 timesteps of the simulation. We then mea-sured the forecast horizon of the prediction in regardsto the (also inverted) test data. For comparison, we alsolooked at the prediction after synchronization with thesame data without inversion.In figure 4 we see the behavior of the ESN with outputbias. It differs from before in that the prediction of themirror-attractor is not simply the inversion of the regularprediction as for the simple ESN in figure 3. However,even though it is generally a worse prediction, it clearlyfollows the inverted trajectory. The ESN has still learneda slightly perturbed version of the mirror-attractor.When using input shift, Lu readout or mixed activations,we never observed a prediction staying in the vicinity ofthe mirror-attractor. Most of them instead leave it imme-diately and quickly converge to the real Lorenz attractoras in figure 5. In some cases the trajectory finds someother fixed point instead, but it never stays in the mirror-attractor. Qualitatively we get the same behavior whenusing mixed activations or input shift instead.Furthermore, we made 1000 predictions of the mirror-attractor with all four designs while varying the networkand the starting point of the training data. The dis-tribution of forecast horizons is shown in figure 6. Theoutput bias clearly sticks out as the only method showingthe ability to predict the mirror-attractor. Some realiza-tions reach forecast horizons up to one Lyapunov timewhile the other methods never go beyond 0.1 Lyapunovtimes corresponding to only O(1) timesteps. An excep-tion is the regular Lu readout, where the prediction tendsto stay on the mirror-attractor slightly longer than forinput shift, extended Lu readout and mixed activationfunctions. However, all of these trajectories converge tothe original attractor afterwards.We also tested the rate of jumps between the attractorsfor the ESN with output bias analogously to the sim-ple ESN in Sec. II C by making 1000 predictions with500000 timesteps. Network and training data were var-ied for each realization. This time 30.5% of them didnot show any jump. For the others the average time ofthe first jump was after about 33000 timesteps (574 Lya-punov times). 95.2% of times the z -coordinate crossedzero it lead to a jump. This might indicate a small im-provement, but the fundamental problem has not beensolved by the output bias.We did not observe any jumps when we used the othermethods. B. Zero-mean Lorenz
FIG. 7. Normalized histogram of forecast horizons (in units ofLyapunov times) for normalized, zero-mean Lorenz data withall five variants of ESN. Based on 1000 realizations (varyingtraining data, network and starting point of prediction) foreach.
To further compare the performance of the differentESNs, we test the ability to learn and predict Lorenzdata where the mean has been shifted to the origin. Asalready discussed, this leads to overlap between real andmirror-attractor. So even though this preprocessing isusually preferred, it makes the simple ESN’s problemswith antisymmetry especially severe. We also rescaledall data to have a standard deviation of 1.To get reliable quantitative results, we first carried outa hyperparameter optimization on this task for every de-sign used. We used a grid search with 100 realizationsfor each point in parameter space. The best parameters are chosen on the basis of the highest average forecasthorizon. Further details and results can be found in theappendix.With the optimized hyperparameters we created 1000 re-alizations for each design and measured forecast horizon,largest Lyapunov exponent and correlation dimension.To accurately represent the climate we used predictionswith a length of 20000 timesteps. The results are com-piled in table I and figure 7 and 8.The simple ESN’s forecast horizon consistently lies in aregion below 3.5 Lyapunov times with a mean of 1.6 andsimilarly to the first task the output bias offers a notice-able but small improvement with a mean of 2.6 Lyapunovtimes. The other four methods to break the symmetryall seem to work in principle. Their forecast horizonsmostly lie between 7 and 14 Lyapunov times. ExtendedLu readout and input shift show no significant differencewith an average forecast horizon of about 11 Lyapunovtimes, while mixed activation functions and regular Lureadout show a lower average forecast horizon of 9.4 and9.6 respectively.In agreement with the results for the forecast horizon,the simple ESN’s climate produces values of largest Lya-punov exponent and correlation dimension far away fromthe desired region. Again the output bias is only a smallimprovement. All results for extended Lu readout andinput shift lie in the direct vicinity of the target andthe mean values and standard deviations match thoseof the test data within uncertainty. Regular Lu readoutand mixed activation functions also reproduce the correctmean values but with much higher variance. This can beattributed to the clear outliers visible in the plots. Wenote that the results for the climate are less reliable thanthose for the forecast horizon since we did not optimizethe hyperparameters for this task.In some cases the predicted trajectory diverged com-pletely or got stuck in a fixed point. This made theproper calculation of the largest Lyapunov exponent im-possible and thus these values are excluded from thatstatistic. This happened 6 times with the simple ESNsetup, 30 times with mixed activation functions and 5times, when using a regular Lu readout. Input shift, ex-tended Lu readout and even output bias did not showthis behavior in any prediction in this experiment.Overall using a regular Lu readout or mixed activationfunctions did not perform quite as well on this standardtask as input shift and extended Lu readout.
C. Halvorsen and Lorenz
Finally, we compare the five different ESNs on a taskthat we specifically designed to test their symmetrybreaking abilities. For this goal we create a dataset bysimulating both the Lorenz and the Halvorsen system.The mean of the Lorenz data is shifted to (1 , ,
1) and themean of the Halvorsen data is shifted to ( − , − , − FIG. 8. Scatterplot of the Largest Lyapunov Exponent against the correlation dimension when predicting zero-mean Lorenzdata. Red ellipse corresponds to five times the standard deviation of the test data. Based on 1000 realizations for each setupas in figure 7.FIG. 9. Example of predictions after training the same net-work on Lorenz and Halvorsen data simultaneously as de-scribed in III C. Here we use the simple ESN setup. higher than 1 from the mean in any dimension. This en-sures that the two attractors do not overlap while lying
FIG. 10. Example prediction of Lorenz and Halvorsen at-tractor with a single ESN with extended Lu readout. completely in the region of each others mirror-attractor.To train the ESN to simultaneously be able to predictboth systems we use the following trick. Firstly, we syn-0chronize it with the Lorenz data and record r trainLorenz afterthe initial transient period. We do however not calcu-late W out yet. Instead we repeat the process with theHalvorsen data. Now the transient period has the addi-tional use of letting the reservoir forget about the Lorenzsystem. This way we get r trainLorenz and r trainHalvorsen . Wesimply concatenate them to get a single dataset r train ,from which we finally compute the readout matrix. Asdesired output we use an analogous concatenation of theLorenz data and the Halvorsen data. We note that, sincethe linear readout is in no way sensitive to the causal re-lationship between the reservoir states and the transientperiod at the second training stage was discarded, thetransition between the two systems in itself does not in-fluence training.This way the ESN has to learn dynamics that are gov-erned by a completely different set of equations insteadof the mirror-attractor. In the end it should be able topredict both attractors depending on the starting pointof the reservoir states. Since this is a more difficult taskand to make sure that possible failures are not just due toa lack of nodes we use N = 500 in this experiment. How-ever, we made similar qualitative observations for smallerand larger networks.To be able to do a quantitative analysis, we first per-formed a hyperparameter optimization as described inthe appendix. We used the product of forecast horizonson both systems as a measure of performance. After-wards we carried out the same experiment as in Sec. III Bwith this combined dataset. We always made a predictionon both attractors with the same network. The resultsare compiled in table II.Unsurprisingly we observe that the simple ESN is notable to master this task (see figure 9). Most predictionscompletely diverge from both attractors. In the handfulof cases where one of the predictions actually reproducedthe climate of one attractor, the other one was always acomplete failure. Qualitatively we see the same resultswhen including an output bias.Since all predictions with the simple ESN and almost allwith the ESN with output bias either diverged or con-verged to some fixed point, we could not provide mean-ingful results for the climate. We note however that theshort-term prediction of the Lorenz system was actuallysignificantly better for both than in Sec. III B. We at-tribute this to the higher number of nodes. Still theinability to reproduce long-term behavior indicates thatthis kind of problem can only be solved with a properlybroken symmetry as we assumed.Again the other methods to break the symmetry are allsuccessful (see as an example figure 10) and predict bothattractors with the same training quite well. The resultsin the forecast horizon for input shift, mixed activationfunctions and the regular Lu readout are very similarwith an average of about 9.8 Lyapunov times (Lorenz)and 10.7 Lyapunov times (Halvorsen). However, onlythe input shift was able to reproduce the climate withthe same accuracy as the test data. It is notable that the extended Lu readout performedsignificantly better than the others in terms of theforecast horizon on both systems. The averages were10.6 Lyapunov times (Lorenz) and 11.6 Lyapunov times(Halvorsen). In contrast, there was also a small numberof predictions that completely diverged or got stuck in afixed point with this design. This occurred 26 times withthe extended Lu readout and 8 times with the regularLu readout. As in Sec. III B these predictions are notincluded in the Lyapunov exponent statistic. The samedid not happen when using input shift or mixed activa-tion functions. This might be due to the fact that theLu readout does not break the symmetry in the reservoiritself. For this more complicated task the additional pa-rameters in the readout might not always be sufficient toencode the difference in the dynamics for a sign change.It could be related to the fact that those dynamics arecompletely independent of each other. ESN design F.H. in ∆ t ( τ λ ) λ ± σ ν ± σ Simple ESN 183.2(3.2) - -54.2(0.8) - -Output Bias 226.3(3.9) - -57.7(0.9) - -Mixed Activations 563.4(9.8) 0 . ± .
03 1 . ± . . ± .
03 1 . ± . . ± .
02 1 . ± . . ± .
03 1 . ± . . ± .
02 1 . ± . . ± .
03 1 . ± . . ± .
05 1 . ± . . ± .
03 1 . ± . ∞ . ± .
02 1 . ± . ∞ . ± .
03 1 . ± . IV. DISCUSSION
In the present work we showed a mathematical prooffor the antisymmetry of the simple ESN with regards tochanging the sign of the input. This is a consequence ofthe antisymmetry of the activation function. It makes itimpossible to fully learn the dynamics of any attractorthat is not point symmetric around the origin. In practicewe observed that the prediction jumps to an inverted ver-sion of the real attractor we call mirror-attractor. Thisis especially disastrous if the two overlap. From this weconclude that this setup is not suitable for general tasksand should not be used.1Furthermore, we note that the sensitivity to this kind ofsymmetries with regards to the input is a universal prop-erty of ESNs and Reservoir Computers in general. Thisis in no way limited to the specifics of the simple ESN. Itmust be kept in mind in every reservoir design and canexplain the empirical success or failure of some of them.In our experiments with the output bias we found thatformally breaking the symmetry alone is not enoughto solve the problems associated with it. It was onlyable to improve the performance marginally and we stillobserved the appearance of an only slightly perturbedmirror-attractor. This might be due to the fact that thenumber of parameters representing the symmetry breakin this approach is too low to accurately model the dif-ference.We were however able to successfully break the symme-try and solve the problem with three other approaches:Introducing an input shift in the activation function, us-ing a mixture of even and odd activation functions andincluding squared nodes in the readout. All of them wereable to eliminate the mirror-attractor and make qualita-tively good predictions even for zero-mean Lorenz data,where the overlap with the mirror-attractor is a severeproblem for the simple ESN. They were further all ableto master the task of predicting a dataset made of Lorenzand Halvorsen data, where it was necessary to learn com-pletely different dynamics in the regime of the mirror-attractor.The input shift proved in our test as a very useful andreliable tool to break the symmetry. The performance inall tasks was consistently good in short-time prediction.It successfully reproduced the climate statistics of thetest data for the zero-mean Lorenz dataset and even forthe combined Lorenz and Halvorsen data. This methodwas the only one to never produce outliers of completelyfailed predictions. It is also worth stressing that to ourknowledge universality is only proven for ESNs with in-put shift. One disadvantage of this approach is the addi-tional hyperparameter, which has to be optimized.Even though the Lu readout in its regular form also seemsto have broken the symmetry successfully, the resultswere generally not quite on par with the input shift. How-ever, this gap was bridged by the extended Lu readout,which poses a convenient way to add more parametersto the model without increasing the size of the reser-voir. For the zero-mean Lorenz data this lead to a per-formance essentially identical with that of the input shift.In the case of the combined dataset the short-time pre-diction surpassed the input shift, while the climate wasnot reproduced with the same accuracy. This improve-ment is likely due to the higher number of parametersand thus higher complexity in the readout. While this isof course accompanied by an increase in computationaltime, it demonstrates the general possibility to enhancethe prediction abilities of a given reservoir by extendingthe readout. Even though the regular Lu readout seemsto be enough to break the symmetry, using even higherorder nonlinear transformations of nodes and an even bigger output matrix could further increase the perfor-mance. An application of this could be found in physicalRC, where the dynamics of the reservoir might be inacce-sible. One might however consider making the dynamicsof the reservoir more complex, while keeping the simplelinear readout, to be more in line with the philosophy ofRC.The ESN with mixed activation functions performed sim-ilarly well to the ESN with regular Lu readout. This isinteresting, because the former can be understood as us-ing squared nodes not only in the readout, but also inthe dynamics of the reservoir. The results imply thatthis does not lead to a meaningful improvement. At leastfor our implementation we also found it to have a highertime cost. Thus, we do not recommend its use in thegiven form. However, the usage of different functions,different ratios, etc. could lead to better performance.Further research in this direction is needed.In light of these results we recommend both the inputshift and the Lu readout as methods to break the sym-metry.
ACKNOWLEDGEMENTS
We wish to acknowledge useful discussions and com-ments from Jonas Aumeier, Sebastian Baur, YoussefMabrouk, Alexander Haluszczynski and HubertusThomas. We also want to thank our anonymous review-ers for their helpful suggestions.
DATA AVAILABILITY
The data that support the findings of this study areavailable from the corresponding author upon reasonablerequest.
Appendix A: Hyperparameter Optimization
The Hyperparameter Optimization was carried out asa simple grid search with the aim to maximize the fore-cast horizon. For the simple ESN we searched over a and (cid:15) , with s input = a (1 − (cid:15) ) (A1) ρ = a(cid:15) (A2). The same was done for the ESN with output bias andthe ESN with Lu readout. For the ESN with input shiftwe additionally varied the scale s γ and for the mixedactivations we replaced a with a and a , which wereoptimized for the tanh-nodes and the tanh -nodes sepa-rately.2
1. Predicting the mirror-attractor
Since we were less interested in quantitative results inthis case we did not perform a real hyperparameter op-timization procedure. Instead we used the parametersfrom our previous work for the simple ESN and manu-ally searched the parameters for the others to reproducethe Lorenz attractor reasonably well. This left us withthe parameters in table III. Simple Out. Bias Lu readout Mixed Inp. Shift a (cid:15) β . × − . × − . × − . × − . × − s γ - - - - 13. a - - - 0.32 - a - - - 0.32 -TABLE III. Hyperparameter choices for the Prediction of themirror-attractor.
2. Zero-mean Lorenz
In the case of the zero-mean Lorenz data we simulated100 trajectories of training data and 100 trajectories oftest data. At every point in hyperparameter space wegenerate a new network and a new W in for each tra-jectory. We then choose the hyperparameters with thehighest average forecast horizon.Since it did not seem to depend strongly on the otherhyperparameters, the problem or the specific design, wesimply set β = 1 . × − as in the first task to savetime with the already very costly grid search.The results are compiled in tables IV, V, VI, VII, VIIIand IX. Hyperparameter Min Max Step Size Optimal a (cid:15) a (cid:15) a (cid:15) a (cid:15) a (cid:15) s γ a a (cid:15)
3. Halvorsen and Lorenz
For the combined dataset of Halvorsen and Lorenz at-tractor we simulate 100 training and test trajectories ofeach system and use them as described in Sec. III C.As before we train a completely new reservoir on eachtrajectory for every point in parameter space. For ev-ery realization the product of the two forecast horizonsis calculated. The optimal hyperparameters are chosento maximize this this product averaged over the trajec-tories.For this task we did include the regularization parameter β in the search with a logarithmic scale from 10 − to0.001 in 11 steps. We consistently found β = 10 − to bethe best choice for all designs.The results are compiled in tables X, XI, XII, XIII, XIVand XV.3 Hyperparameter Min Max Step Size Optimal a (cid:15) a (cid:15) a (cid:15) a (cid:15) a (cid:15) s γ a a (cid:15) REFERENCES Y. Tang, J. Kurths, W. Lin, E. Ott, and L. Kocarev, “Intro-duction to focus issue: When machine learning meets complexsystems: Networks, chaos, and nonlinear dynamics,” Chaos: AnInterdisciplinary Journal of Nonlinear Science , 063151 (2020). Z. Lu, B. R. Hunt, and E. Ott, “Attractor reconstruction bymachine learning,” Chaos: An Interdisciplinary Journal of Non-linear Science , 061104 (2018). J. Pathak, Z. Lu, B. R. Hunt, M. Girvan, and E. Ott, “Usingmachine learning to replicate chaotic attractors and calculate lya-punov exponents from data,” Chaos: An Interdisciplinary Jour-nal of Nonlinear Science , 121102 (2017). P. R. Vlachas, J. Pathak, B. R. Hunt, T. P. Sapsis, M. Gir-van, E. Ott, and P. Koumoutsakos, “Backpropagation algo-rithms and reservoir computing in recurrent neural networks forthe forecasting of complex spatiotemporal dynamics,” (2019),arXiv:1910.05266 [eess.SP]. A. Chattopadhyay, P. Hassanzadeh, D. Subramanian, andK. Palem, “Data-driven prediction of a multi-scale lorenz 96chaotic system using a hierarchy of deep learning methods:Reservoir computing, ann, and rnn-lstm,” (2019). W. Maass, T. Natschlaeger, and H. Markram, “Real-time com-puting without stable states: A new framework for neural compu-tation based on perturbations,” Neural Computation , 2531–2560 (2002). H. Jaeger, “The “echo state” approach to analysing and trainingrecurrent neural networks-with an erratum note,” Bonn, Ger-many: German National Research Center for Information Tech-nology GMD Technical Report , 13 (2001). G. Van der Sande, D. Brunner, and M. C. Soriano, “Advancesin photonic reservoir computing,” Nanophotonics , 561–576(2017). D. Prychynenko, M. Sitte, K. Litzius, B. Kr¨uger, G. Bourianoff,M. Kl¨aui, J. Sinova, and K. Everschor-Sitte, “Magnetic skyrmionas a nonlinear resistive element: A potential building block forreservoir computing,” Physical Review Applied , 014034 (2018). G. Tanaka, T. Yamane, J. B. H´eroux, R. Nakane, N. Kanazawa,S. Takeda, H. Numata, D. Nakano, and A. Hirose, “Recent ad-vances in physical reservoir computing: A review,” Neural Net-works , 100–123 (2019). A. Griffith, A. Pomerance, and D. J. Gauthier, “Forecastingchaotic systems with very low connectivity reservoir computers,”Chaos: An Interdisciplinary Journal of Nonlinear Science ,123108 (2019). T. L. Carroll and L. M. Pecora, “Network structure effects inreservoir computers,” arXiv preprint arXiv:1903.12487 (2019). A. Haluszczynski and C. R¨ath, “Good and bad predictions:Assessing and improving the replication of chaotic attractorsby means of reservoir computing,” Chaos: An InterdisciplinaryJournal of Nonlinear Science , 103143 (2019). A. Haluszczynski, J. Aumeier, J. Herteux, and C. R¨ath, “Reduc-ing network size and improving prediction stability of reservoircomputing,” Chaos: An Interdisciplinary Journal of NonlinearScience , 063136 (2020). T. Carroll, “Path length statistics in reservoir computers,”Chaos: An Interdisciplinary Journal of Nonlinear Science ,083130 (2020). P. Grassberger and I. Procaccia, “Measuring the strangeness ofstrange attractors,” Physica D: Nonlinear Phenomena , 189–208(1983). P. Grassberger, “Generalized dimensions of strange attractors,”Physics Letters A , 227–230 (1983). M. T. Rosenstein, J. J. Collins, and C. J. De Luca, “A practicalmethod for calculating largest lyapunov exponents from smalldata sets,” Physica D: Nonlinear Phenomena , 117–134 (1993). M. Sandri, “Numerical calculation of lyapunov exponents,”Mathematica Journal , 78–84 (1996). E. N. Lorenz, “Deterministic nonperiodic flow,” Journal of theatmospheric sciences , 130–141 (1963). J. C. Sprott and J. C. Sprott,
Chaos and time-series analysis ,Vol. 69 (Citeseer, 2003). A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased es-timation for nonorthogonal problems,” Technometrics , 55–67(1970). L. Grigoryeva and J.-P. Ortega, “Echo state networks are uni-versal,” Neural Networks , 495–508 (2018). S. Boyd and L. Chua, “Fading memory and the problem ofapproximating nonlinear operators with volterra series,” IEEE Transactions on circuits and systems , 1150–1161 (1985). Z. Lu, J. Pathak, B. Hunt, M. Girvan, R. Brockett, and E. Ott,“Reservoir observers: Model-free inference of unmeasured vari-ables in chaotic systems,” Chaos: An Interdisciplinary Journalof Nonlinear Science , 041102 (2017). M. Lukoˇseviˇcius, “A practical guide to applying echo state net-works,” in