Sparsity in Reservoir Computing Neural Networks
SSparsity in Reservoir Computing Neural Networks
Claudio Gallicchio
Department of Computer ScienceUniversity of Pisa
Pisa, [email protected]
Abstract —Reservoir Computing (RC) is a well-known strategyfor designing Recurrent Neural Networks featured by strikingefficiency of training. The crucial aspect of RC is to properlyinstantiate the hidden recurrent layer that serves as dynamicalmemory to the system. In this respect, the common recipe isto create a pool of randomly and sparsely connected recurrentneurons. While the aspect of sparsity in the design of RC systemshas been debated in the literature, it is nowadays understoodmainly as a way to enhance the efficiency of computation,exploiting sparse matrix operations.In this paper, we empirically investigate the role of sparsity inRC network design under the perspective of the richness of thedeveloped temporal representations. We analyze both sparsity inthe recurrent connections, and in the connections from the inputto the reservoir. Our results point out that sparsity, in particularin input-reservoir connections, has a major role in developinginternal temporal representations that have a longer short-termmemory of past inputs and a higher dimension.
Index Terms —Reservoir Computing, Echo State Networks,Short-term Memory, Sparse Recurrent Neural Networks
I. I
NTRODUCTION
Recurrent Neural Networks (RNNs) [1] are a fundamentaltool for adaptive processing of dynamically evolving informa-tion, with excellent performance in fields such as time-seriesforecasting [2], machine translation [3], speech and text pro-cessing [4], [5], just to mention a few. An increasing number ofworks are analyzing the role of sparsity in the design of trained(dynamical) neural networks systems, for example throughpruning [6] or re-wiring [7] connections. The characterizationemerging from these studies is that having sparse connectionsbetween neurons is not only advantageous in computationalterms - as it enables fast sparse matrix computations - but canalso be beneficial to obtain a better performance in practice.Moreover, in the context of neurobiologically-inspired infor-mation processing systems, a sparse degree of connectivitybetween neurons has been shown to improve the quality ofthe developed internal representations [8]. Interestingly, theoptimal amount sparsity in the numerical simulations matchedobserved properties of cerebellum-like circuits.Reservoir Computing (RC) neural networks [9]–[11] rep-resent an intriguing development in the field of RNNs. InRC, the recurrent hidden layer of a RNN is left untrainedafter initialization subject to asymptotic stability conditions
This paper is currently under review. of the corresponding dynamical system. As a result, learningis applied only to a simple readout component with strikingadvantages in terms of required training times compared tofully trained RNNs. Pushing the involved algorithms towardsextreme simplicity and efficiency makes the RC approachvery well suited for real-world application scenarios featuredby (possibly severe) resource constraints, such as neuromor-phic hardware implementations [12] or cyber-physical systemswhere the learning modules are embedded at the edge [13].A typical strategy in the design of RC networks is to setupthe recurrent layer in a sparse way. The initial intuition wasthat sparsity in the recurrent untrained layer could enable adecoupling of state variables and hence richer representations[14]. Successively, several authors pointed out empirical ev-idences contrary to the initial intuition (see, e.g., [10], [15],[16]). Currently, the sparse design of reservoirs is commonlyunderstood mainly as a way to speedup state computations,without a practical effect on the resulting performance. How-ever, the impact of sparsity on the performance of RC neuralnetworks has been typically studied limited to the recurrentconnections only. In this paper, we intend to shed more lighton the role of sparsity in RC by extending the analysis to bothrecurrent and input connections. Specifically, we empiricallyshow the effect of recurrent and input sparsity in reservoirs,evaluated by means of short-term memory capacity and effec-tive dimension of the resulting state trajectories.The rest of this paper is structured as follows. We intro-duce the basics of RC methodology in Section II, discussinginitialization and sparsity of reservoirs. Then, in Section IIIwe present the concepts of short-term memory capacity andeffective reservoir dimension. Our experimental analysis isdescribed in Section IV. Finally, in Section V we draw ourconclusions and sketch possible developments.II. R
ESERVOIR C OMPUTING N EURAL N ETWORKS
Here we give a brief description of the RC design method-ology for RNNs, focusing on the Echo State Network (ESN)[11], [14] model.An RC network is a neural information processing systemthat treats data in the form of (temporal) sequences. Architec-turally, the neural network is composed by a hidden recurrentlayer called reservoir , and an output layer called readout .Fig. 1 illustrates the building blocks of a typical RC network.In what follows, we denote the number of reservoir neurons,i.e., the reservoir dimension, by N , and the state of the a r X i v : . [ c s . L G ] J un eservoirreadoutinput 𝐱(𝑡) 𝐡(𝑡)𝐲(𝑡)𝐱(𝑡)𝐔𝐕𝐖 Fig. 1: Architecture of an RC neural network. The dotted arrowindicates trained connections.reservoir system at time t by h ( t ) ∈ R N . This state is evolvedby following a state update equation: h ( t ) = tanh( Ux ( t ) + Wh ( t − , (1)where x ( t ) ∈ R M is the M-dimensional input at time t , U ∈ R N × M is the input weight matrix, modulating the influence ofthe external input on the current state, and W ∈ R N × N is therecurrent weight matrix, which controls the impact of previousstate on the current state. The state is typically set to a zerovector as initial condition, i.e. h (0) = ∈ R N . Note that herewe dropped from (1) the reference to bias terms to focus theanalysis on the external stimulating input signal alone. Bothweight matrices U and W remain untrained after initialization(see Section II-A).The reservoir system is coupled with a linear readout layerthat computes an L -dimensional output at each time step, i.e. y ( t ) ∈ R L , as an affine transformation of the reservoir state: y ( t ) = Vh ( t ) + b , (2)where V ∈ R L × N is a readout weight matrix and b ∈ R L isa bias vector (that assumes a constant unitary input bias forthe readout). The readout parameters are the only ones thatundergo a training process, typically in closed-form fashionby using pseudo-inversion [9]. A. Initialization of Reservoirs
The fundamental characterization of RC neural networksis that all the reservoir parameters remain untrained afterinitialization. Such initialization is performed in agreement to asymptotic stability conditions expressed by the Echo StateProperty (ESP) [14], [17], [18], which essentially require tocontrol the magnitude of the weights in U and W . Usually,both the input weights in U and the recurrent weights in W are randomly drawn from a uniform distribution in [ − , .After that, the elements in U are re-scaled by a factor ω in ,which takes the role of input scaling. The weights in W arere-scaled to control the largest absolute eigenvalue, i.e., thespectral radius ρ , typically to a value smaller than 1 [14].The design strategy of the reservoir topology (i.e., the wayin which the reservoir neurons are connected among eachother) has been subject of several studies in literature (see, e.g.,[19], [20]). While some of the proposed reservoir organizationscan be beneficial in specific application circumstances, arandom and sparse topological organization of the reservoiris the architecture of choice in general cases. This is the focusof our analysis in this paper.Making the connections among reservoir neurons sparsehas the fundamental practical advantage to reduce the cost ofstate update operations in (1). Actually, for densely connectedreservoirs (and assuming N >> M ) the cost of state updatingscales as O ( N ) , i.e. quadratically with the reservoir size. Afirst approach to make the reservoir sparsely connected wouldbe to impose a (small) fixed percentage, say C , of non-zeroweights in the involved weight matrices matrices. Althoughreducing the running times in practice for smaller reservoirs,this approach would asymptotically scale as O ( N C/ ,hence still quadratically with the reservoir size. A moreeffective approach, which is adopted in this paper, is to fix thenumber, say χ R , of incoming recurrent connections for eachreservoir unit. This indeed makes the state update cost as smallas O ( N χ R ) , i.e. scaling only linearly with the number ofneurons in the reservoir. A similar strategy can be adapted forthe setup of the input connections. In this case, to ensure thateach input dimension is actually forwarded to the reservoir, wefix the number of outgoing connections from each input units,denoted as χ I . The sparse architectural reservoir setup used inthis paper is exemplified in Figure 2. Notice that in this case,every row of W has exactly χ R non-zero values, and everycolumn of U has exactly χ I non-zero elements, with both χ R and χ I being not greater than N .III. S HORT - TERM M EMORY AND E FFECTIVE R ESERVOIR S PACE D IMENSION
The role of the recurrent reservoir system is to embed theinput time-series into an internal “state” representation, givenby the activation of the reservoir neurons over time. Here weanalyze the quality of such internal reservoir representationby quantifying its short-term memory and effective dimension.
Short-term Memory Capacity (MC) [21] tests the ability ofa recurrent neural system to reconstruct its driving inputtime-series from the transient state dynamics. More in de-tail, the reservoir is driven by a uni-dimensional time-series, x ( t ) , t = 1 , , . . . , and different readout units are trained torecall progressively delayed versions of the input. I.e., the i-th 𝑹 = 𝟑 𝝌 𝑰 = 𝟐 inputreservoir Fig. 2: Illustration of sparsity in input to reservoir and recurrentreservoir connections. χ R indicates the number of incomingrecurrent connections for each reservoir unit. χ I indicates thenumber of outgoing connections from each input unit.readout unit y i ( t ) should approximate x ( t − i ) . The MC of anRC network is then quantified as follows: M C = ∞ (cid:88) i =1 cov ( x ( t − i ) , y i ( t )) σ ( x ( t − i )) σ ( y i ( t )) , (3)i.e., as the sum of squared correlation coefficients of thedelayed input and reconstructed signals. Effective Dimension ( N eff ) [8], [22] is a measure of thenumber of orthogonal directions in the neuronal system’s statetrajectory over time. While the evolution of the reservoir sys-tem in (1) is described by an N -dimensional state vector h ( t ) ,the actual reservoir trajectory lies into a lower-dimensionalmanifold whose dimension can be quantified as follows: N eff = ( (cid:80) Ni =1 λ i ) (cid:80) Ni =1 λ i , (4)where λ i , i = 1 , , . . . , N , denote the eigenvalues of thecovariance matrix of the reservoir state activation over time.When measured for a reservoir under the driving influence ofan external time-series, (4) gives an estimate of the numberof directions of reservoir state variability that are (linearly)uncorrelated along the observed trajectory.IV. E XPERIMENTAL A NALYSIS
We measured the short-term memory (MC) and the effectivereservoir dimension ( N eff ) introduced in Section III for RCnetworks varying the amount of recurrent and input connec-tions. Our experimental settings are described in Section IV-A,while the results are reported in Section IV-B. A. Settings
We used a uni-dimensional signal as driving input for thereservoir (i.e., M = 1 ). To maximally test the intrinsic qualityof reservoir representations, we used iid randomly sampledinputs x ( t ) from a uniform distribution (in [ − . , . ). Thelength of the generated input time-series was , and thenumber of reservoir neurons was fixed to N = 100 . Tocompute MC, we used the first time-steps as training set ,using the remaining time-steps to assess the MC score.The total number of delays used for the computation of (3) was200, which is in practice sufficient to account for all the non-negligible contributions for 100-dimensional reservoirs. Thelast time-steps of the dataset were also used to computethe effective reservoir dimension N eff (see (4)).In our experiments, we used RC networks with spectralradius ρ = 0 . and input scaling ω in = 1 . While this setupis of common use in RC practice, we also ran preliminaryexperiments with other choices of these hyper-parameters,finding that the outcomes are not qualitatively different. Wevaried both the number of recurrent connections ( χ R ) and ofinput connections ( χ I ) from 1 to 100 (with step of 1). Foreach configuration we averaged the results over 50 reservoirrealizations. B. Results
The achieved values of MC and N eff in correspondenceof the possible sparsity settings (values of χ R and χ I ) areshown in Fig. 3. We can draw two major observations fromthe results. First, the number of input connections has adecisive impact on both the short-term memory and the effec-tive reservoir dimension of the networks. Indeed, maximallysparse input connections, with χ I = 1 , achieved the highestperformances. Interestingly, simply propagating the input toall the reservoir neurons degrades the performance sensibly.Second, the role of sparsity in recurrent connections seems tobe much less important. In fact, the trend in Fig. 3 indicatesthat for a given input connectivity, the achieved results are notmuch sensible to the exact number of recurrent connections(after a minimum number has been exceeded).The results are further detailed in Fig. 4, which showsthe best result for each choice of input (resp. recurrent)connectivity in Fig. 4(a) (resp. Fig. 4(b)), as well as theresults achieved for maximally sparse input connectivity, i.e.for χ I = 1 , in Fig. 4(c). Figs. 4(a)-(b) confirm the alreadyobserved trends. On the one hand the performance of theRC networks tends to deteriorate for less sparse input weightmatrices. On the other hand, a modest number of recurrentconnections is already sufficient to achieve a performancenot far from the highest possible one. For RC networks with χ I = 1 (Fig. 4(c)), both MC and N eff saturate for fairly smallvalues of χ R , without appreciable differences for settings withmore than recurrent connections per reservoir neuron. We used pseudo-inversion to train the readout, discarding the first 1000time-steps as initial transient. ig. 3: Short-term Memory Capacity (MC) and effective dimension ( N eff ) of RC networks. Results corresponds to N = 100 reservoir neurons, spectral radius ρ = 0 . , and input scaling ω in = 1 . Recurrent ( χ R ) and input ( χ I ) connectivity varied from1 to 100 with step 1. For each of the 10000 configurations the results are averaged over a number of 50 reservoir realizations.Fig. 4: Short-term Memory Capacity (MC) and effective dimension ( N eff ) of 100 units RC networks, detailed for: (a) bestresults for increasing input connectivity; (b) best results for increasing recurrent connectivity; (c) results for maximally sparseinput connections ( χ I = 1 ) and increasing recurrent connectivity. Results are re-scaled to [0 , .V. C ONCLUSIONS
We have empirically analyzed the performance of RCneural networks in relation to sparsity of input and recurrentconnections. Our results indicate that under commonly usedreservoir configurations, the number of non-zero connectionscan play a decisive role in determining the richness of thedeveloped representations. In particular, while a modest num-ber of recurrent connections is already sufficient to achievegood performance, we found that maximally sparse input toreservoir connections lead to the best results both in terms ofshort-term memory and in terms of effective dimension of thestate manifold. Overall, our analysis points out a simple ruleof thumb for shaping reservoir weight matrices in case of uni-dimensional driving time-series: (i) connect the input to justone reservoir neuron, and (ii) set a small number of incomingrecurrent connections ( ≈ ) for each reservoir neuron.The study presented in this paper can be seen as preparatoryto opening further and deeper lines of research. First of all, the role of sparsity can be investigated in synergy with structured (rather than random) recurrent reservoir topologies,such as those based on cyclic [20] or small-world [23]connections. Similarly, the study can be extended towards deep RC neural networks [24], [25], where multiple reservoirlayers are connected in a pipeline. In this case, the sparsityof input connections for higher layers has the even moreintriguing role of modulating the extent of signal propagationbetween consecutive internal representations. Neuromorphichardware implementations [12], [26]–[28] of deep recurrentneural systems are an important example of a domain wheresuch insights can be capitalized in practice. Under a broaderperspective, and outside the RC world, the analysis presentedhere pointed out that a sparse setting of RNN connectionsbrings advantages even before learning of the non-zero con-nections. How these architectural advantages can be furtherexploited by (supervised or unsupervised) training is anotherexciting open research question.
EFERENCES[1] J. F. Kolen and S. C. Kremer,
A field guide to dynamical recurrentnetworks . John Wiley & Sons, 2001.[2] N. Laptev, J. Yosinski, L. E. Li, and S. Smyl, “Time-series extreme eventforecasting with neural networks at uber,” in
International Conferenceon Machine Learning , vol. 34, 2017, pp. 1–5.[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in
Advances in neural information processingsystems , 2014, pp. 3104–3112.[4] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” in . IEEE, 2013, pp. 6645–6649.[5] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang et al. , “Abstractive textsummarization using sequence-to-sequence rnns and beyond,” arXivpreprint arXiv:1602.06023 , 2016.[6] S. Narang, E. Elsen, G. Diamos, and S. Sengupta, “Exploringsparsity in recurrent neural networks,”
ICLR 2017. arXiv preprintarXiv:1704.05119 , 2017.[7] G. Bellec, D. Kappel, W. Maass, and R. Legenstein, “Deep rewiring:Training very sparse deep networks,”
ICLR 2018. arXiv preprintarXiv:1711.05136 , 2018.[8] A. Litwin-Kumar, K. D. Harris, R. Axel, H. Sompolinsky, and L. Abbott,“Optimal degrees of synaptic connectivity,”
Neuron , vol. 93, no. 5, pp.1153–1164, 2017.[9] M. Lukoˇseviˇcius and H. Jaeger, “Reservoir computing approaches torecurrent neural network training,”
Computer Science Review , vol. 3,no. 3, pp. 127–149, 2009.[10] B. Schrauwen, D. Verstraeten, and J. Van Campenhout, “An overview ofreservoir computing: theory, applications and implementations,” in
Pro-ceedings of the 15th european symposium on artificial neural networks.p. 471-482 2007 , 2007, pp. 471–482.[11] H. Jaeger and H. Haas, “Harnessing nonlinearity: Predicting chaoticsystems and saving energy in wireless communication,” science , vol.304, no. 5667, pp. 78–80, 2004.[12] L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Guti´errez,L. Pesquera, C. R. Mirasso, and I. Fischer, “Photonic informationprocessing beyond turing: an optoelectronic implementation of reservoircomputing,”
Optics express , vol. 20, no. 3, pp. 3241–3249, 2012.[13] D. Bacciu, P. Barsocchi, S. Chessa, C. Gallicchio, and A. Micheli, “Anexperimental characterization of reservoir computing in ambient assistedliving applications,”
Neural Computing and Applications , vol. 24, no. 6,pp. 1451–1464, 2014.[14] H. Jaeger, “The echo state approach to analysing and training recur-rent neural networks-with an erratum note,”
Bonn, Germany: GermanNational Research Center for Information Technology GMD TechnicalReport , vol. 148, no. 34, p. 13, 2001.[15] Y. Xue, L. Yang, and S. Haykin, “Decoupled echo state networks withlateral inhibition,”
Neural Networks , vol. 20, no. 3, pp. 365–376, 2007.[16] C. Gallicchio and A. Micheli, “Architectural and markovian factors ofecho state networks,”
Neural Networks , vol. 24, no. 5, pp. 440–456,2011.[17] I. B. Yildiz, H. Jaeger, and S. J. Kiebel, “Re-visiting the echo stateproperty,”
Neural networks , vol. 35, pp. 1–9, 2012.[18] C. Gallicchio, “Chasing the echo state property,” in
Proceedings ofESANN , 2019, pp. 667–672.[19] T. Strauss, W. Wustlich, and R. Labahn, “Design strategies for weightmatrices of echo state networks,”
Neural computation , vol. 24, no. 12,pp. 3246–3276, 2012.[20] A. Rodan and P. Tino, “Minimum complexity echo state network,”
IEEEtransactions on neural networks , vol. 22, no. 1, pp. 131–144, 2010.[21] H. Jaeger,
Short term memory in echo state networks . GMD-Forschungszentrum Informationstechnik, 2001, vol. 5.[22] L. F. Abbott, K. Rajan, and H. Sompolinsky, “Interactions betweenintrinsic and stimulus-evoked activity in recurrent neural networks,”
Thedynamic brain: an exploration of neuronal variability and its functionalsignificance , pp. 1–16, 2011.[23] Y. Kawai, J. Park, and M. Asada, “A small-world topology enhancesthe echo state property and signal propagation in reservoir computing,”
Neural Networks , vol. 112, pp. 15–23, 2019.[24] C. Gallicchio, A. Micheli, and L. Pedrelli, “Deep reservoir computing:A critical experimental analysis,”
Neurocomputing , vol. 268, pp. 87–99,2017. [25] ——, “Design of deep echo state networks,”
Neural Networks , vol. 108,pp. 33–47, 2018.[26] J. Moughames, X. Porte, M. Thiel, G. Ulliac, M. Jacquot, L. Larger,M. Kadic, and D. Brunner, “Three dimensional waveguide-interconnectsfor scalable integration of photonic neural networks,” arXiv preprintarXiv:1912.08203 , 2019.[27] M. Freiberger, P. Bienstman, and J. Dambre, “Towards deep physicalreservoir computing through automatic task decomposition and map-ping,” arXiv preprint arXiv:1910.13332 , 2019.[28] J. Partzsch and R. Schuffny, “Analyzing the scaling of connectivityin neuromorphic hardware and in models of neural networks,”