[PDF] Ensemble perspective for understanding temporal credit assignment

Abstract

Recurrent neural networks are widely used for modeling spatio-temporal sequences in both nature language processing and neural population dynamics. However, understanding the temporal credit assignment is hard. Here, we propose that each individual connection in the recurrent computation is modeled by a spike and slab distribution, rather than a precise weight value. We then derive the mean-field algorithm to train the network at the ensemble level. The method is then applied to classify handwritten digits when pixels are read in sequence, and to the multisensory integration task that is a fundamental cognitive function of animals. Our model reveals important connections that determine the overall performance of the network. The model also shows how spatio-temporal information is processed through the hyperparameters of the distribution, and moreover reveals distinct types of emergent neural selectivity. It is thus promising to study the temporal credit assignment in recurrent neural networks from the ensemble perspective.

Full PDF

EEnsemble perspective for understanding temporal credit assignment

Wenxuan Zou, ∗ Chan Li, ∗ and Haiping Huang † PMI Lab, School of Physics, Sun Yat-sen University, Guangzhou 510275, People’s Republic of China (Dated: February 9, 2021)Recurrent neural networks are widely used for modeling spatio-temporal sequences in both naturelanguage processing and neural population dynamics. However, understanding the temporal creditassignment is hard. Here, we propose that each individual connection in the recurrent computationis modeled by a spike and slab distribution, rather than a precise weight value. We then derivethe mean-ﬁeld algorithm to train the network at the ensemble level. The method is then appliedto classify handwritten digits when pixels are read in sequence, and to the multisensory integrationtask that is a fundamental cognitive function of animals. Our model reveals important connectionsthat determine the overall performance of the network. The model also shows how spatio-temporalinformation is processed through the hyperparameters of the distribution, and moreover revealsdistinct types of emergent neural selectivity. It is thus promising to study the temporal creditassignment in recurrent neural networks from the ensemble perspective.

I. INTRODUCTION

Recurrence is ubiquitous in the brain. Neural networks with reciprocally connected recurrent units are calledrecurrent neural networks (RNN). Because of feedback supports provided by these recurrent units, this type of neuralnetworks is able to maintain information about sensory inputs across temporal domains, and thus plays an importantrole in processing time-dependent sequences, thereby widely used as a basic computational block in nature languageprocessing [1–6] and even modeling brain dynamics in various kinds of neural circuits [7–10].Training RNNs is in general very challenging, because of the intrinsic diﬃculty in capturing long-term dependenceof the sequences. Advanced architectures commonly introduce gating mechanisms, e.g., the long short-term memorynetwork (LSTM) with multiplicative gates controlling the information ﬂow across time steps [1], or a simpliﬁedvariant—gated recurrent unit network (GRU) [6]. All these recurrent neural networks are commonly trained bybackpropagation through time (BPTT) [11, 12], which sums up all gradient (of the loss function) contributions overall time steps of a trial, to update the recurrent network parameters. The training is terminated until a speciﬁcnetwork yields a satisﬁed generalization accuracy on unseen trials (time-dependent sequences). This speciﬁc networkis clearly a point-estimate of the candidate architecture realizing the desired computational task. A recent studyof learning (spatial) credit assignment suggests that an ensemble of candidate networks, instead of the traditionalpoint-estimate, can be successfully learned at a cheap computational cost, and particularly yields statistical propertiesof the trained network [13]. The ensemble training is achieved by only deﬁning a spike and slab (SaS) probabilitydistribution for each network connection, which oﬀers an opportunity to look at the relevance of each connection tothe macroscopic behavioral output of the network. Therefore, we expect that a similar perspective applies to theRNNs, and an ensemble of candidate networks can also emerge during training of the SaS probability distributionsfor RNNs.Weight uncertainty is a key concept in studying neural circuits [14]. The stochastic nature of computation appearsnot only in the earlier sensory layers but also in deep layers of internal dynamics. Revealing the underlying mechanismabout how the uncertainty is combined with the recurrent dynamics becomes essential to interpret the behavior ofRNNs. In particular, addressing how a RNN learns a probability distribution over weights could provide potentialinsights towards understanding of learning in the brain. There may exist a long-term dependence in the recurrentdynamics, and both directions of one connection may carry distinct information in a spatio-temporal domain. Howa training at the ensemble level combines these intrinsic properties of RNNs thus becomes intriguing. In this work,we derive a mean-ﬁeld method to train RNNs, considering a weight distribution for each direction of connection.We then test our method on both MNIST dataset in a sequence-learning setting [15] and multi-sensory integrationtasks [16–18]. The multi-sensory integration tasks are relevant to computational modeling of cognitive experimentsof rodents and primates.By analyzing the distribution of the model parameters, and its relationship with the computational task, and ∗ W.Z. and C.L. contributed equally to this work. † Electronic address: [email protected] a r X i v : . [ c ond - m a t . d i s - nn ] F e b W in Input W W out

Recurrent units Output

FIG. 1: Illustration of a RNN learning the temporal credit assignment. Each connection is described by the spike and slabdistribution; the spike indicates the corresponding synaptic weight absent for a task, while the slab represents a Gaussiandistribution responsible for the weight uncertainty. The Gaussian slab is displayed in the plot, and the arrow in the reservoirshows diﬀerent directions of information ﬂow. Diﬀerent sizes of connection show how probable these connections should bepresent during recurrent computation. moreover the selectivity of each recurrent unit, we are able to provide mechanistic understanding about the recurrentdynamics in both engineering applications and computational neuroscience modeling. Our method learns the statisticsof the RNN ensemble, producing a dynamic architecture that adapts to temporally varying inputs, which therebygoes beyond traditional training protocols focusing on a stationary network topology. The hyperparameters governingthe weight distribution reveal which credit assignments are critical to the network behavior, which further explainsdistinct functions of each computational layer and the emergent neural selectivity. Therefore, our ensemble theorycan be used as a promising tool to explore internal dynamics of widely used RNNs.

II. MODEL AND ENSEMBLE TRAINING

In this study, we consider a recurrent neural network processing a time-dependent input x ( t ) of time length T .The input signal is sent to the recurrent reservoir via the input weight matrix W in . The number of input units aredetermined by design details of the task, and w in ij denotes the weight value for the connection from input unit j torecurrent node i . The neural responses of recurrent units are represented by an activity vector r ( t ) at time step t .We deﬁne w ij as the connection from node j to node i in the reservoir. In general, w ij (cid:54) = w ji , indicating diﬀerentweights for diﬀerent directions of information ﬂow. In addition, we do not preclude the self-interaction w ii , which canbe used to maintain the representation encoded in the history of the internal dynamics [19]. The speciﬁc statistics ofthis self-interaction can be determined by the learning shown below. The internal activity r ( t ) is read out throughthe output matrix W out in the form of the time-dependent output y ( t ).We ﬁrst deﬁne h i ( t ) as the time-dependent synaptic current of neuron i . The dynamics of the recurrent networkcan thus be summarized as follows, h i ( t + 1) = (1 − α ) h i ( t ) + αu i ( t + 1) + √ ασ n i , (1a) u i ( t + 1) = N (cid:88) j =1 w ij r j ( t ) + N (cid:88) j =1 w in ij x j ( t + 1) , (1b) r i ( t ) = φ ( h i ( t )) , (1c) z k ( t ) = N (cid:88) i =1 w out ki r i ( t ) , (1d) y k ( t ) = f ( z k ( t )) , (1e)where u ( · ) is the pre-activation function, φ ( · ) denotes the nonlinear transfer function, for which we use the rectiﬁedlinear unit (ReLU) function for all tasks. α = ∆ tτ , where ∆ t is a small time interval (e.g., emerging from discretizationof a continuous dynamics [20]), and τ denotes the time constant of the dynamics. n i ∼ N (0 ,

1) indicates a normallydistributed random number with zero mean and unit variance, sampled independently at every time step, and σ controls the strength of the recurrent noise intrinsic to the network. This intrinsic noise is present in modelingcognitive tasks, but absent for engineering applications. Considering a Gaussian white noise and a baseline ( x ), wewrite the input vector x ( t ) as x ( t ) = x + x task t + (cid:113) σ /α ξ , (2)where the input has been written in a discrete time form, x task t denotes the sequence of the task, and ξ i ∼ N (0 , σ in , commonly observed in brain circuits [21, 22], but is absent in engineering applications.For a MNIST classiﬁcation task, f ( · ) is chosen to be the softmax function, specifying the probability over all theclasses at the last time step T , i.e., y k ( T ) = e zk ( T ) (cid:80) j e zj ( T ) . We deﬁne ˆ y k as the target label (one-hot form), and use thecross entropy L = − (cid:80) i ˆ y i ln y i ( T ) as the loss function to be minimized. For a multisensory integration task, f ( · ) isan identity function, and the mean squared-error is chosen to be the objective function.To search for the optimal random network ensemble for time-dependent tasks in recurrent neural networks, wemodel the statistics of the weight matrices by the SaS probability distribution [13] as follows, P ( w in ij ) = π in ij δ ( w in ij ) + (1 − π in ij ) N ( w in ij | m in ij , Ξ in ij ) , (3a) P ( w ij ) = π ij δ ( w ij ) + (1 − π ij ) N ( w ij | m ij , Ξ ij ) , (3b) P ( w out ki ) = π out ki δ ( w out ki ) + (1 − π out ki ) N ( w out ki | m out ki , Ξ out ki ) . (3c)The spike mass at δ ( · ) is related to the network compression, indicating the necessary weight resources required fora speciﬁc task. The continuous slab, N ( w ij | m ij , Ξ ij ), denotes the Gaussian distribution with mean m ij and varianceΞ ij , characterizing the weight uncertainty when the corresponding connection can not be absent (see Fig. 1).Next, we derive the learning equations about how the SaS parameters are updated, based on mean-ﬁeld approxi-mation. More precisely, we consider the average over the statistics of the network ensemble during training. Noticethat, the ﬁrst and second moments of the weight w ij for three sets of weights can be written in a common form as µ ij = (1 − π ij ) m ij and (cid:37) ij = (1 − π ij )(( m ij ) + Ξ ij ). Given a large fan-in, the pre-activation u i ( t ) and the output z i ( t ) can be re-parametrized by using standard Gaussian random variables , i.e., it is reasonable to assume that theyare subject to N ( u i ( t ) | G in i ( t ) + G rec i ( t − , (cid:112) (∆ in i ( t )) + (∆ rec i ( t − ) and N ( z i ( t ) | G out i ( t ) , ∆ out i ( t )), respectively,according to the central-limit-theorem. Note that when the number of fan-in to a recurrent unit is small, the central-limit-theorem may break. In this situation, we take the deterministic limit. Therefore, the mean-ﬁeld dynamics ofthe model becomes h i ( t + 1) = (1 − α ) h i ( t ) + αu i ( t + 1) + √ ασ n i , (4a) u i ( t + 1) = G rec i ( t ) + G in i ( t + 1) + (cid:15) u i ( t + 1) (cid:113) (∆ in i ( t + 1)) + (∆ rec i ( t )) , (4b) r i ( t ) = φ ( h i ( t )) , (4c) z k ( t ) = G out k ( t ) + (cid:15) out k ( t )∆ out k ( t ) , (4d) y k ( t ) = f ( z k ( t )) , (4e)where G in i ( t + 1) = (cid:88) j µ in ij x j ( t + 1) , (5a) G rec i ( t + 1) = (cid:88) j µ ij r j ( t + 1) , (5b) G out k ( t + 1) = (cid:88) i µ out ki r i ( t + 1) , (5c)(∆ in i ( t + 1)) = (cid:88) j ( (cid:37) in ij − ( µ in ij ) )( x j ( t + 1)) , (5d)(∆ rec i ( t + 1)) = (cid:88) j ( (cid:37) ij − ( µ ij ) )( r j ( t + 1)) , (5e)(∆ out k ( t + 1)) = (cid:88) i ( (cid:37) out ki − ( µ out ki ) )( r i ( t + 1)) . (5f)Note that { (cid:15) u ( t ) } and { (cid:15) out ( t ) } are both independent random variables sampled from the standard Gaussian distri-bution with zero mean and unit variance, which are quenched for every single training mini-epoch and also time-stepdependent, maintaining the same sequence of values in both feedforward and backward computations.Updating the network parameters ( θθθ in ik , θθθ ik , θθθ out ki ) can be achieved by the gradient descent on the objective function L . First of all, we update θθθ out ki ≡ ( m out ki , π out ki , Ξ out ki ). ∂ L ∂m out ki = T (cid:88) t =0 ∂ L ∂z k ( t ) ∂z k ( t ) ∂m out ki , (6a) ∂ L ∂π out ki = T (cid:88) t =0 ∂ L ∂z k ( t ) ∂z k ( t ) ∂π out ki , (6b) ∂ L ∂ Ξ out ki = T (cid:88) t =0 ∂ L ∂z k ( t ) ∂z k ( t ) ∂ Ξ out ki , (6c)where ∂ L ∂z k ( t ) is related to the form of the loss function. For categorization tasks, f ( · ) is chosen to be the softmaxfunction, and we use the cross entropy as our objective function, for which ∂ L ∂z k ( t ) = ( y k ( T ) − ˆ y k ) δ t,T . It is worthnoticing that for multi-sensory integration tasks, this derivative does not vanish at intermediate time steps and becomethus time-dependent. The other term ∂z k ( t ) ∂θ out ki can be directly computed, showing how sensitive the network activity isread out under the change of the SaS parameters in the output layer: ∂z k ( t ) ∂m out ki = (1 − π out ki ) r i ( t ) + (cid:15) out k ( t )( µ out ki π out ki )( r i ( t )) ∆ out k , (7a) ∂z k ( t ) ∂π out ki = − m out ki r i ( t ) + (cid:15) out k ( t )(( m out ki ) (1 − π out ki ) − Ξ out ki )( r i ( t )) out k , (7b) ∂z k ( t ) ∂ Ξ out ki = (cid:15) out k ( t )(1 − π out ki )( r i ( t )) out k . (7c)Next, we derive the learning equation for the hyper-parameters in the reservoir and input layer, i.e., θθθ ij ≡ ( m ij , π ij , Ξ ij ) and θθθ in ij ≡ ( m in ij , π in ij , Ξ in ij ). To get a general form, we set ˜ m ij ≡ ( m ij , m in ij ) , ˜ π ij ≡ ( π ij , π in ij ), and˜Ξ ij ≡ (Ξ ij , Ξ in ij ). Based on the chain rule, we then arrive at the following equations, ∂ L ∂ ˜ m ij = T (cid:88) t =0 ∂ L ∂h i ( t ) ∂h i ( t ) ∂u i ( t ) ∂u i ( t ) ∂ ˜ m ij = T (cid:88) t =0 αδ i ( t ) ∂u i ( t ) ∂ ˜ m ij , (8a) ∂ L ∂ ˜ π ij = T (cid:88) t =0 ∂ L ∂h i ( t ) ∂h i ( t ) ∂u i ( t ) ∂u i ( t ) ∂ ˜ π ij = T (cid:88) t =0 αδ i ( t ) ∂u i ( t ) ∂ ˜ π ij , (8b) ∂ L ∂ ˜Ξ ij = T (cid:88) t =0 ∂ L ∂h i ( t ) ∂h i ( t ) ∂u i ( t ) ∂u i ( t ) ∂ ˜Ξ ij = T (cid:88) t =0 αδ i ( t ) ∂u i ( t ) ∂ ˜Ξ ij , (8c)where we have deﬁned δ i ( t ) ≡ ∂ L ∂h i ( t ) . The auxiliary variable δ i ( t ) can be computed by the chain rule once again,resulting in a BPTT equation of the error signal starting from the last time-step T : δ i ( t ) = (cid:88) j ∂ L ∂h j ( t + 1) ∂h j ( t + 1) ∂h i ( t ) + (cid:88) k ∂ L ∂z k ( t ) ∂z k ( t ) ∂r i ( t ) φ (cid:48) ( h i ( t )) , (9)where t = 0 , , , ..., T −

1, and φ (cid:48) ( · ) denotes the derivative of the transfer function. The ﬁrst summation in δ i ( t ) canbe directly expanded as (cid:88) j ∂ L ∂h j ( t + 1) ∂h j ( t + 1) ∂h i ( t ) = (1 − α ) δ i ( t + 1) + (cid:88) j αδ j ( t + 1) ∂u j ( t + 1) ∂h i ( t ) , (10a) ∂u j ( t + 1) ∂h i ( t ) = ∂u j ( t + 1) ∂r i ( t ) ∂r i ( t ) ∂h i ( t ) = (1 − π ji ) m ji φ (cid:48) ( h i ( t )) + (cid:15) u j ( t + 1) ( (cid:37) ji − ( µ ji ) ) r i ( t ) φ (cid:48) ( h i ( t )) (cid:113) ((∆ in j ( t + 1)) + (∆ rec j ( t )) ) . (10b)The second summation in δ i ( t ) is given by (cid:88) k ∂ L ∂z k ( t ) ∂z k ( t ) ∂h i ( t ) = (cid:88) k ∂ L ∂z k ( t ) × (cid:20) µ out ki + (cid:15) out k ( t ) ( (cid:37) out ki − ( µ out ki ) ) r i ( t )∆ out k ( t ) (cid:21) φ (cid:48) ( h i ( t )) . (11)It is worth noting that for the last time-step T , the error signal δ i ( T ) is written as δ i ( T ) = (cid:88) k ∂ L ∂z k ( T ) × (cid:20) µ out ki + (cid:15) out k ( T ) ( (cid:37) out ki − ( µ out ki ) ) r i ( T )∆ out k ( T ) (cid:21) φ (cid:48) ( h i ( T )) . (12)To compute Eq. (8), we need to work out the following derivatives, which characterize the sensitivity of the pre-activation under the change of hyper-parameters θθθ ij and θθθ in ij . We summarize the results as follows, ∂u i ( t ) ∂m in ij = (1 − π in ij ) x j ( t ) + (cid:15) u i ( t ) µ in ij π in ij ( x j ( t )) (cid:112) (∆ in i ( t )) + (∆ rec i ( t − , (13a) ∂u i ( t ) ∂π in ij = − m in ij x j ( t ) + (cid:15) u i ( t ) (( m in ij ) (1 − π in ij ) − Ξ in ij )( x j ( t )) (cid:112) (∆ in i ( t )) + (∆ rec i ( t − , (13b) ∂u i ( t ) ∂ Ξ in ij = (cid:15) u i ( t ) (1 − π in ij )( x j ( t )) (cid:112) (∆ in i ( t )) + (∆ rec i ( t − , (13c) ∂u i ( t ) ∂m ij = (1 − π ij ) r j ( t −

1) + (cid:15) u i ( t ) µ ij π ij ( r j ( t − (cid:112) (∆ in i ( t )) + (∆ rec i ( t − , (13d) ∂u i ( t ) ∂π ij = − m ij r j ( t −

1) + (cid:15) u i ( t ) (( m ij ) (1 − π ij ) − Ξ ij )( r j ( t − (cid:112) (∆ in i ( t )) + (∆ rec i ( t − , (13e) ∂u i ( t ) ∂ Ξ ij = (cid:15) u i ( t ) (1 − π ij )( r j ( t − (cid:112) (∆ in i ( t )) + (∆ rec i ( t − . (13f)In this learning process, our model learns a RNN ensemble to realize the time-dependent computation, in contrastto the standard BPTT algorithm which gives only a point-estimate of RNN weights. In particular, if we set π = 0and Ξ = 0, our learning equation reduces to the standard BPTT. Hence, our model can be thought of as a generalizedversion of BPTT (i.e., gBPTT), as shown in Fig. 2, where each weight parameter should be understood as the SaShyper-parameters, corresponding to our ensemble setting.The size of the candidate network space can be captured by the network entropy S = − (cid:82) R D P ( w ) ln P ( w ) d w , where D is the number of weight parameters in the network. If we assume the joint distribution of weights to be factorizedacross individual connections, the overall energy S can be obtained by summing up the entropy of individual weightsas S = (cid:80) (cid:96) S (cid:96) . The entropy of each directed connection (cid:96) is derived as follows [13]: S (cid:96) = − π (cid:96) ln [ π (cid:96) δ (0) + (1 − π (cid:96) ) N (0 | m (cid:96) , Ξ (cid:96) )] − − π (cid:96) B (cid:88) s Γ( (cid:15) s ) , (14)where Γ( (cid:15) s ) = ln [ π (cid:96) δ ( m (cid:96) + √ Ξ (cid:96) (cid:15) s ) + − π (cid:96) √ Ξ (cid:96) N ( (cid:15) s | , B denotes the number of standard Gaussian variables (cid:15) s .This model entropy is just an approximate estimate of the true value whose exact computation is impossible.If π (cid:96) = 0, the probability distribution of w (cid:96) can be written as P ( w (cid:96) ) = N ( w (cid:96) | m (cid:96) , Ξ (cid:96) ), and the entropy S (cid:96) can beanalytically computed as ln (2 πe Ξ (cid:96) ). If Ξ (cid:96) = 0, the Gaussian distribution reduces to a Dirac delta function, andthe entropy becomes an entropy of discrete random variables, S (cid:96) = π (cid:96) ln π (cid:96) + (1 − π (cid:96) ) ln (1 − π (cid:96) ). However, theremay exist a mixture of discrete and continuous contributions to the entropy. Therefore, the entropy value can takenegative values. In practice, we use δ ( x ) = lim a → + √ πa e − x a to approximate the delta-peak with a small value of a .The stochasticity of the SaS distribution can be also decomposed into two levels: one at the choice of connection thatis absent and the other at the Gaussian slab itself. The former one is a discrete entropy characterized by the spikeprobability, and the latter is a continuous entropy just characterized by the variance of the Gaussian slab. III. RESULTS

In this section, we show applications of our ensemble theory of temporal credit assignment to both engineeringand computational cognitive tasks, i.e., pixel-by-pixel MNIST digit classiﬁcation, and the multisensory integrationdemonstrating the beneﬁt of multiple sources of information for decision making. We will explore in detail richproperties of trained RNN model accomplishing the above computational tasks of diﬀerent nature. ... T z ˆ − t y ˆ t y ˆ T y t z − t z − t h t h T h − t X t X T X WW T  − t  t  ∂∂ tt h  -1-1 ∂∂ tt h  ∂∂ TT h  out W out W out W in W in W in W FIG. 2: Illustration of the generalized back-propagation through time. X t is a time-dependent input into the network, resultingin a sequence of hidden representations r ( t ) (i.e., φ ( h t )) and a time-dependent loss L t . The error signal ∂ L t ∂ h t propagates backfrom the last time-step T to the ﬁrst time step, yielding the temporally accumulated gradients for hyper-parameters to beupdated. The contribution of ∂ L t ∂ z t is not shown in this plot. X t , z t and h t indicate x ( t ), z ( t ) and h ( t ) in the main text,respectively. a b FIG. 3: Properties of trained RNN models for the pixel-by-pixel MNIST classiﬁcation. (a) Test accuracy versus trainingepoch. The network of N = 100 neurons is trained on the full MNIST dataset (60 000 images), and tested on the unseen dataof 10 000 images. Five independent runs of the algorithm are considered. There are no baseline inputs. Other parameters fortraining are as follows: α = 0 . (cid:96) regularization strength is 10 − , and the initial learning rate lr = 0 . w ij ( i < j ) along which the recurrent feedback passesfrom neuron j to neuron i ; while recurrent II denotes the case of w ji ( j > i ) along which the recurrent feedback passes fromneuron i to neuron j . The ﬂuctuation is estimated from ﬁve independent runs [the same training conditions as in (a)]. A. Pixel-by-Pixel MNIST digit classiﬁcation

Training RNNs is hard because of long-term dependency in the sequence of inputs. A challenging task of long-termdependency is to train a RNN to classify the MNIST images when the 784 pixels are fed into the network one byone (note that x = 0), where the network is required to predict the category of the image after seeing all the 784pixels. This task displays a long range of dependency, because the network reads one pixel at a single time-step ina scan-line order from the top left pixel to the bottom right pixel (Fig. 5), and the information of as long as 784time steps must be maintained before the ﬁnal decision. We apply the vanilla RNN with N = 100 recurrent unitsto achieve this challenging goal. Because the input size is one in this task, and thus the set of hyper-parameters( π in , Ξ in ) are set to zero (i.e., the deterministic limit). The entire MNIST dataset is divided into mini-batches forstochastic gradient descent, and we apply cross entropy as the objective function to be minimized. In this task, the a bc d FIG. 4: Evolution of entropy per connection for the pixel-by-pixel MNIST classiﬁcation. The training conditions are the sameas in Fig. 3. Five independent runs are considered. The input weight is deterministic without any stochasticity. In addition, toavoid divergence, we relax all values of Ξ smaller than a to be a in the computation of the entropy. Our results are not sensitiveto the choice of a , once the value of a is small. (a,b) We decompose the variability of the SaS distribution into two levels: thevariability of selecting zero synapses ( π -entropy) and the variability inside the Gaussian slab (Ξ-entropy). (c,d) Evolution ofentropy per connection ( B = 100). We use the Gaussian distribution of small variance (indicated by a ) to approximate thedelta-peak. The value of a does not aﬀect the qualitative behavior of the entropy proﬁle. error signal appears only after the network reads all the 784 pixels. In the current setting, we ignore the noise termsin the dynamics equations [Eq. (1) and Eq. (2)]. Surprisingly, although working at the ensemble level, our model canachieve a comparable or even better performance than the traditional BPTT method, as shown in Fig. 3 (a).Our simulation reveals that the sparsity per connection, (cid:80) l π l M (a total of M directed connections in the network),achieves a larger value in the recurrent layer compared with the output layer; this behavior does not depend on thespeciﬁc directions of the coupling [Fig. 3 (b)]. During training, the sparsity level grows until saturation, suggestingthat the training may remove or minimize the impacts of irrelevant information across time steps. Interestingly, theentropy per connection in the recurrent layer also increases at the early stage of the training, but decreases at thelate stage [Fig. 4], showing that the training is able to reorganize the information landscape shaped by the recurrentfeedbacks. The network at the end of training becomes more deterministic (e.g., Ξ gets close to zero, or the spikeprobability goes away from one half). The discrete π -entropy always increases until saturation at a certain level. Theentropy proﬁle in the output layer shows a similar behavior yet with a lower entropy value. This is consistent withdistinct roles of recurrent and readout layers. The recurrent layer (or computing reservoir) searches for an optimalway of processing temporally complex sensory inputs, while the output layer implements the decoding of the signalshidden in the reservoir dynamics. Note that two types of deterministic weights yield vanishing individual entropyvalues, i.e., (i) π = 1 (unimportant (UIP) weight); (ii) π = 0, Ξ = 0 and m (cid:54) = 0 (very important (VIP) weight).To study the network behavior from the perspective of hyper-parameter distributions, we plot the distribution ofthree sets of parameters ( m , Ξ , π ) for the output layer, two diﬀerent directions of connections in the recurrent layer,and m of the input layer (( π in , Ξ in ) are set to zero as explained before), as shown in Fig. 5. The distribution of spikeprobability π has the shape of L for all layers. The extreme at π = 0 indicates that the corresponding synapticweight carries important feature-selective information, and are thus critical to the network performance. The shape of π -distribution proﬁle reﬂects the task diﬃculty for the considered network (given the same initialization condition),since we observe that in a simpler task for the network to read 28 pixels at each time step (rather than one pixel by W in W W out

Recurrent units P r o b a b ili t y D i s t r i bu t i o n . . . m -5 5

0 1 π Ξ P r o b a b ili t y D i s t r i bu t i o n . . .

0 1 m -5 5

0 1 π Ξ

0 1 m -5 5

0 1 π Ξ P r o b a b ili t y D i s t r i bu t i o n . . .

0 1 m -5 5 softmax output c da b W i→j W j→i ... {

784 time steps P r o b a b ili t y D i s t r i bu t i o n . . . FIG. 5: Distributions of hyper-parameters ( π, m,

Ξ) in the trained network (pixel-by-pixel MNIST classiﬁcation). The trainingconditions are the same as in Fig. 3. In (a,b), i < j is assumed.

From t o a m From t o b From t o c FIG. 6: Heterogeneous structures of trained RNN networks. Training conditions are the same as in Fig. 3. Hyper-parameters( π, m,

Ξ) are plotted in the matrix form with the dimension N × N , where N indicates the number of recurrent neurons. Theelement of these matrices, say m ij , denotes the hyper-parameter value for the directed link from neuron j to neuron i . a b FIG. 7: Targeted weight perturbation in the recurrent layer of RNNs. All the training conditions are kept the same as in Fig 3,and the results are averaged over ﬁve independent runs. (a) VIP weights are stochastically turn oﬀ, and the same number ofrandomly selected weights from the entire weight population are also deleted for comparison of the resultant performance. (b)The fraction of VIP weights changes during the training process. one pixel), the proﬁle of π -distribution develops a U shape, with the other peak at π = 1, implying the emergenceof a signiﬁcant fraction of UIP weights. In addition, Ξ has an L-shaped distribution which peaks at zero, suggestingthat the corresponding weight distribution takes a deterministic value of m , which becomes the weight value of thatconnection.As for the readout layer, the mean of the continuous slab becomes much more dispersed, compared with that of therecurrent layer, while the statistics proﬁle for π and Ξ becomes more converged, which is in an excellent agreementwith the entropy proﬁle shown in Fig. 4. We thus conclude that the recurrent layer has a greater variability thanthe output layer. This great variability allows the recurrent layer to transform the complex sensory inputs withhierarchical spatiotemporal structures in a ﬂexible way. Nevertheless, the minimal variability makes the readoutbehavior more robust. It is therefore interesting in future works to address the precise mechanism underlying howthe statistics of the connections in a RNN facilitates the formation of latent dynamics for computation.We next look at the speciﬁc proﬁle of individual hyper-parameters in a matrix form (Fig. 6). We ﬁnd that the diag-onal (self-interaction) of m emerges from gBPTT (note that the diagonals of π and Ξ nearly vanish), demonstratingthe signiﬁcant role of self-interactions in maintaining long-term memory and thereby the learning performance, inaccord with the heuristic strategy used in a recent study [19]. There also appears heterogeneity in the hyper-parametermatrices, i.e., some directed connections play a more important role than the others. In particular, some connectionscan be eliminated for saving computation resources.Our method can identify the exact nature of each directed connection in the network. Targeted weight perturbationcan thus be performed in the recurrent layer (Fig. 7) . In the course of training, the fraction of VIP connections slowlydecreases, although the fraction of VIP connections is not signiﬁcant (around 1%) [Fig. 7(b)]. Surprisingly, pruningthese VIP weights could strongly deteriorate the test performance of the network (up to the chance level), whereaspruning the same number of randomly selected connections from the entire weight population yields a negligibledrop of the test accuracy. We thus conclude that in a RNN, there exists a minority of VIP weights carrying keyspatiotemporal information essential for decision making of the network.We then explore whether the selectivity of neurons for the sensory inputs could emerge from our training. Thedegree of selectivity for an individual unit, say neuron i , can be characterized by an index, namely SLI [23] as followsSLI( i ) = 11 − N s (cid:34) − ( N s (cid:80) N s s =1 r i,s ) N s (cid:80) N s s =1 ( r i,s ) (cid:35) , (15)where r i,s indicates the response of the neuron to the input stimulus s with the total number of stimulus-class being N s ( N s = 10 in the MNIST experiment). Note that r i,s for the RNN depends on the time step as well. The valueof SLI ranges from 0 (when the neuron responds identically to all stimuli) to 1 (when the neuron responds only to asingle stimulus), and a higher SLI indicates a higher degree of selectivity, and vice versa. Interestingly, we ﬁnd thatmost neurons have mixed selectivity shown in Fig. 8 (the right panel), whose SLI lies between 0 and 1, and suchneurons respond strongly to a few types of images. The mixed selectivity was also discovered in complex cognitivetasks [24]. We also observe that some neurons are particularly selective to only one type of images with a high valueof SLI [Fig. 8 (middle)], while the remaining part of neurons keeps silent with an SLI near to zero [Fig. 8 (left)]. As0 SLI (i=60)= 0.0 SLI (i=94)= 0.98 SLI (i=14)= 0.57

FIG. 8: Selectivity of representative neurons in RNNs learning MNIST digit classiﬁcation. (Left) the non-selectivity case.(Middle) the uni-selectivity case. (Right) the mixed-selectivity case. Time steps indicate the dynamic steps during the testphase, and the labels in the left panel are shared for the other two panels. a b

FIG. 9: The statistics of time-dependent networks does not change signiﬁcantly. (a) The fraction of zero elements of sampledweights across 784 time steps. The ﬂuctuation is computed from ﬁve independent runs. (b) The asymmetry measure η deﬁnedby w ij w ji w ij versus time step. The over-bar means the average over all reciprocal connections. The result is averaged over ﬁveindependent runs. expected, the selectivity property emerges slightly before the decision making of the network.We ﬁnally remark that gBPTT produces a network ensemble, characterized by the hyper-parameter set of the SaSdistribution. A concrete network can thus be sampled from this ensemble. During training, we use independentlytime-dependent Gaussian noise (cid:15) u ( t ) to approximate the statistics of the ensemble. Accordingly, we ﬁnd that a time-dependent concrete network yields the identical test performance with the mean-ﬁeld propagation (i.e., using theparametrized pre-activation, see Eq. (4)]. The overall statistics of sampled networks does not vary signiﬁcantly acrossdynamic steps (Fig. 9). This observation is in stark contrast to the traditional RNN training, where a deterministicweight matrix is used in all time steps. In fact, in a biological neural circuit, changes in synaptic connections areessential for the development and function of the nervous system [25, 26]. In other words, the speciﬁc details ofthe connection pattern for a circuit may not be critical to the behavior, but rather, the parameters underlying thestatistics of weight distributions become dominant factors for the behavioral output of the network (see also a recentpaper demonstrating that innate face-selectivity could emerge from statistical variation of the feedforward projectionsin hierarchical neural networks [27]). Therefore, our current ensemble perspective of training RNNs oﬀers a promisingframework to achieve neural computation with a dynamic network, which is likely to be consistent with biologicallyplausible computations with ﬂuctuating dendritic spines, e.g., dendritic spines can undergo morphological remodelingin adaptation to sensory stimuli or in learning [26]. B. Multisensory integration task

Multisensory integration is a fundamental ability of the brain to combine cues from multiple senses to form robustperception of signals in the noisy world [16, 17]. In typical cognitive experiments, two separate sources of information1

Input W Output W Visual signal(a.u.)Auditorysignal(a.u.) choice highchoice low t (ms) stimulus periodfixation period decision period t t t VisualOnlyAuditoryOnlyMulti-sensory W { Modality outin

FIG. 10: Schematic illustration of the RNN performing a multi-sensory integration task. In one trial, the task consists ofthree consecutive stages (of lengths 300, 900, and 400 ms, respectively) including a ﬁxation period ( t ∼ ∼ t ), and a decision period ( t ∼ t ). Note that in a test trial, the lengths of three stages are 500, 1000, and 300 ms,respectively. During the ﬁxation period, the network must maintain a low output value, ensuring that the network will besensitive to the stimulus signals. During the stimulus period, either or both (or multisensory) sources of signals with the samefrequency f , including visual signal and auditory signal, act as the input to the network. The frequency f ranges from 9 to 16.Each type of sensory input contains positively tuned and negatively tuned signals (only the positively tuned one is shown inthe plot). During the decision period, there are no modality input signal, and the network must learn to discriminate the evenrate of the stimulus period, i.e., the output must hold a high or low value correctly. The output at the last time step of thedecision period is assigned the ﬁnal decision of one trial. Other parameters for training are as follows: x = 0 . τ = 100 ms, α = 0 . σ = 0 . σ in = 0 . (cid:96) regularization strength is 10 − , and the initial learning rate lr = 0 . t = 0 . are provided to animals, e.g., rats, before the animals make a decision. The source of information can be eitherauditory clicks or visual ﬂashes, or both [28]. When the task becomes diﬃcult, the multisensory training is moreeﬀective than the unisensory one. Akin to animals trained to perform a behavior task, RNNs can also learn theinput-output mapping through an optimization procedure, which is able to provide a quantitative way to understandthe computational principles underlying the multisensory integration. In particular, by increasing biological plausiblelevels of network architectures and even learning dynamics [8, 10], one may generate quantitative hypotheses about thedynamical mechanisms that may be implemented in real neural circuits to solve the same behavior task. In this section,we restrict the network setting to the simplest case, i.e., a group of neurons are reciprocally connected with continuousﬁring rate dynamics. To consider more biological details is straightforward, e.g., taking Dale’s principle [21]. Thegoal in this section is to show that our ensemble perspective also works in modeling cognitive computational tasks.In the multisensory integration experiment (MSI), either or both of auditory and visual signals are fed into therecurrent network via deterministic weights (the same reason as in pixel-by-pixel MNIST task), which is required,after a stimulus period, to report whether the frequency of the presented stimulus is above a decision boundary (12 . r ( t ) in our model is anactivity vector indicating the ﬁring rates of neurons, obtained through a non-linear transfer function (ReLU here) ofthe synaptic current input [ h ( t )]. The current includes both of external input and recurrent feedback. The output z ( t ) is a weighted readout of the neural responses in the reservoir (a binary choice for the MSI task). The RNN isused here to model the multisensory event-rate discrimination task for the rats [21, 28], and is trained by our gBPTT2 frequency p e r c e n t a g e h i g h boundary=12.5VisualAuditoryMultisensory FIG. 11: Psychometric function for the MSI task. This function shows the percentage of high choices made by the network asa function of the event rate for both unisensory and multisensory trials. Each marker is the mean result over 100 independenttest trials. Other training conditions are the same as in Fig. 10. training epoch

I I M outputrecurrent Irecurrent II FIG. 12: Evolution of sparsity densities with training epochs for the MSI task. The lines are the mean results of ten independenttraining trials, and the shadow indicates the ﬂuctuation. Recurrent I and II are the lower and upper triangles of the π matrix,respectively. Other training conditions are the same as in Fig. 10. to solve the same audiovisual integration task. The visual and auditory inputs can be either positively (increasingfunction of event rate) or negatively tuned (decreasing function of event rate). Showing the network both types oftunned inputs could improve the training [29]. The RNN is composed of 150 neurons, whose recurrent dynamics isrequired to hold a high output value if the input event rate (represented by time-dependent noisy inputs) was abovethe decision boundary, and hold a low output value otherwise. Neurons are reciprocally connected, and the globalstatistics of the topology is learned from the training trials.The beneﬁts of multisensory inputs, as known in cognitive science [16], are reproduced by our RNN model trainedby gBPTT (Fig. 11). Integrating multiple sources of information, rather than unisensory inputs, helps decision makingparticularly when the task becomes hard (i.e., around the decision boundary). As the training proceeds, the sparsityof the recurrent layer grows rapidly until saturation, while the sparsity of the output layer grows in the same mannerbut ﬁnally reaching a lower value yet with a small ﬂuctuation (Fig. 12). The recurrent layer becomes sparser withtraining, demonstrating that the latent dynamics is likely low dimensional, because of existence of some unnecessarydegrees of freedom in recurrent feedbacks. In contrast, the goal of the output layer is to decode the recurrent dynamics,and the output layer should therefore keep all relevant dimensions of information integrated, which requires a denselyconnected output layer. This behavior is consistent with the evolution of the entropy proﬁle. The recurrent layermaintains a relatively higher level of variability, compared with that of the output layer at the end of training (see the3 e n t r o p y p e r c o nn e c t i o n (a) Recurrent - entropyOutput - entropy e n t r o p y p e r c o nn e c t i o n (b) Recurrent - entropyOutput - entropy e n t r o p y p e r c o nn e c t i o n D (c) Recurrent - entropyOutput - entropy 0 50 100 150 200 250 300 350 400training epoch4.54.03.53.02.52.01.5 e n t r o p y p e r c o nn e c t i o n D (d) Recurrent - entropyOutput - entropy

FIG. 13: Evolution of entropy densities with training epochs for the MSI task. The lines are the mean results of ten independenttraining trials, and the shadow indicates the ﬂuctuation. The deﬁnitions of entropies are the same as in the MNIST experiment(see Fig. 4). The value of a does not aﬀect the qualitative behavior of the entropy proﬁle. Other training conditions are thesame as in Fig. 10. continuous entropy computed according to Eq. (14), or the Ξ-entropy in Fig. 13). In addition, the discrete π -entropydecreases with training in the output layer, in contrast to the increasing behavior of the same type of entropy in therecurrent layer (Fig. 13).Let us then look at the distribution proﬁle of hyperparameters (Fig. 14). In the output layer, the distribution of π is U-shaped, and Ξ shows a sharp single peak at zero, which demonstrates that a dominant part of the weightdistribution reduces to the Bernoulli distribution with two peaks at 0 and m (cid:54) = 0 respectively. The observed lessvariability in weight values of the output layer is consistent with the decoding stability. In the recurrent layer, theproﬁle of π -distribution is U-shaped, and the distribution proﬁle of Ξ resembles an L shape. This implies that thereemerge VIP and UIP connections in the network. Moreover, a certain level of variability is allowed for the weightvalues, making a ﬂexible computation possible in the internal dynamics.Next, we ask whether our training leads to the emergence of neural selectivity, which indeed exists in the prefrontalcortex of behaving animals [24]. The selectivity properties of neurons play a critical role in the complex cognitiveability. In our trained networks, we also ﬁnd that neurons in the recurrent layer display diﬀerent types of selectivity(Fig. 15). In other words, neurons become highly active for either of modality (visual or auditory) and choice (highor low), or both (mixed selectivity).We then explore the detailed patterns of the hyper-parameter matrices, which conveys individual contributions ofeach connection to the behavioral performance. By inspecting the sparsity matrix [Fig. 16 (b)], one can identify bothunnecessary ( π = 1) and important connections ( π = 0). We also ﬁnd that, some neurons prefer receiving or sendinginformation during the recurrent computation. An interpretation is that, the spatio-temporal information is dividedinto relevant and irrelevant parts; the relevant parts are maintained through sending preference, while the irrelevantparts are blocked through receiving preference.Target weight perturbation experiments (Fig. 17) show that the VIP weights play a fundamental role in supportingthe task accuracy reached by the recurrent computation. Our method can thus provide precise temporal creditassignment to the MSI task, which the standard BPTT could not.Finally, we remark that a dynamic network with time-dependent speciﬁed weight values is able to reach an equivalentaccuracy with the network using the mean-ﬁeld propagation. Note that the overall statistics of the network does notchange signiﬁcantly (Fig. 18), suggesting that the hyperparameters for the the weight statistics are more importantthan precise values of weights. In fact, dendritic spines in neural circuits, biological substrates for synaptic contacts, arealso subject to ﬂuctuation, i.e., a highly dynamic structure [25, 26]. Future exploration of this interesting connectionwould be fruitful, as an ensemble perspective is much more abstract than a concrete topology, while the speciﬁedstationary topology is still widely used in modern machine learning. Therefore, the ensemble perspective yielding a4 W in W W out

Recurrent units

0 1 π Ξ P r o b a b ili t y D i s t r i bu t i o n . . .

0 0.03 m -0.4 0.4

0 1 π Ξ P r o b a b ili t y D i s t r i bu t i o n . . .

0 0.03 m -0.4 0.4 0 1 π Ξ P r o b a b ili t y D i s t r i bu t i o n

0 0.03 m -0.4 0.4 d a b W i→j W j→i Choice highChoice low

Sensory Input P r o b a b ili t y D i s t r i bu t i o n . . . -1.5 1.5 c m . . . FIG. 14: Distributions of hyper-parameters ( π, m,

Ξ) in input, recurrent and output layers of the RNN model for the MSItask. Training conditions are the same as in Fig. 10. In (a,b), i < j is assumed. -500 0 1000 time(ms) a c t i v i t y a choice (low) selectivity Vis, highVis, lowAud, highAud, low -500 0 1000 time(ms) a c t i v i t y b choice (high) selectivity -500 0 1000 time(ms) a c t i v i t y emodality (visual) selectivity -500 0 1000 time(ms) a c t i v i t y fmodality (auditory) selectivity -500 0 1000 time(ms) a c t i v i t y c mixed selectivity -500 0 1000 time(ms) a c t i v i t y d mixed selectivity FIG. 15: Selectivity of constituent neurons for the MSI task. Training conditions are the same as in Fig. 10. The neuronsshow selectivity with respect to choice (a, b), modality (e, f) or both (c, d). From t o a m From t o b From t o c FIG. 16: Hyper-parameter matrix of the recurrent layer in a trained network for the MSI task. Trained conditions are thesame as in Fig. 10. Hyper-parameters ( π, m,

Ξ) are plotted in the matrix form with the dimension N × N , where N indicatesthe number of neurons in the reservoir. t e s t a cc u r a c y ( % ) a VIP weightsAll weights 0 50 100 150 200 250 300 350 400 training epoch F r a c t i o n ( % ) b FIG. 17: Targeted-weight perturbation in the recurrent layer for the MSI task. (a) VIP weights (a relaxed version, i.e., π < . , | m | > .

05, note that Ξ is very small) are stochastically turned oﬀ, in comparison with randomly selected weightspruned with the same number. (b) The fraction of VIP weights changes during the training process. Trained conditions arethe same as in Fig. 10. dynamic network in adaptation of external stimuli could shed light on our understanding of adaptive RNNs.

IV. CONCLUSION

In this study, we propose an ensemble perspective for understanding temporal credit assignment, which yields ageneralized BPTT algorithm for training RNNs in practice. Our training protocol produces the hyperparametersunderlying the weight distribution, or the statistics of the RNN ensemble. In contrast to the standard BPTT, gBPTThighlights the importance of network statistics, which can particularly make a dynamic network changing its speciﬁednetwork weights a potential substrate for recurrent computation in adaptation to sensory inputs. It is thus interestingin future works to explore the biological counterpart of neural computation in brain circuits, e.g., in terms of dendriticspine dynamics [25].Our SaS model has three types of hyperparameters with distinct roles. m tells us the mean of the continuousGaussian slab, while Ξ controls the variance (ﬂuctuation) of the slab. In both computational tasks we are interestedin, we ﬁnd that the Ξ-distribution proﬁle is L-shaped. In other words, the peak at zero turns the SaS distribution into aBernoulli distribution, while the tail at ﬁnite values of variance endows a computational ﬂexibility to each connection,which may be critical to recode the high-dimensional sensory inputs into a low-dimensional latent dynamics. It isthus interesting to establish this hypothesis, by addressing precisely how the ensemble perspective helps to clarify themechanism of low-dimensional latent dynamics encoding relevant spatio-temporal features of inputs.The spike probability π tells us the sparsity level of the network. We ﬁnd that a sparse RNN emerges aftertraining. In particular, the recurrent layer is sparser than the output layer, reﬂecting diﬀerent roles of both layers.The sparseness allows the recurrent layer to remove some unnecessary degrees of freedom in the recurrent dynamics,making recurrent feedbacks maintain only relevant information for decision making. In contrast, the output layer is6 time step F r a t i o n o f z e r o e l e m e n t s ( % ) a time step b FIG. 18: The statistics of the dynamic network implementing the MSI task could be preserved. (a) The fraction of zeroelements of sampled weights across one test trial. The ﬂuctuation is computed from ten independent runs. (b) The asymmetrymeasure η deﬁned by w ij w ji w ij versus time step. The over-bar means the average over all reciprocal connections. The result isaveraged over ten independent runs. much denser, being thus able to read out relevant information both completely and robustly (note that most of Ξvalues are zero). It is worth noticing that the π -distribution proﬁle implies the existence of VIP weights, which weshow has a critical contribution to the overall performance of the network. Our method thus provides a practical wayto identify the critical elements of the network topology, which could contribute to understanding temporal creditassignment underlying the behavior of the RNN.Working at the ensemble level, the constituent neurons of our model also display neural selectivity of diﬀerent naturein response to the task parameters. Some neurons show uni-selectivity, while the others show mixed selectivity. Theselectivity property of neurons, shaped by the recurrent connections, is also a key factor impacting the behavioraloutput of the network, and thus deserves future studies about the structure basis of the neural selectivity.Taken together, our model and the associated training method are able to provide insights toward a precise under-standing of temporal credit assignment, not only in engineering applications (e.g., MNIST digit classiﬁcation), butalso in modeling brain dynamics (e.g., multisensory integration task). Acknowledgments

This research was supported by the National Natural Science Foundation of China for Grant No. 11805284 and thestart-up budget 74130-18831109 of the 100-talent-program of Sun Yat-sen University. [1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.

Neural Computation , 9(8):1735–1780, 1997.[2] Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850 , 2013.[3] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In

Advances in NeuralInformation Processing Systems 27 , pages 3104–3112, 2014.[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align andtranslate. In

ICLR 2015 : International Conference on Learning Representations 2015 , 2015.[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and YoshuaBengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078 ,2014.[6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neuralnetworks on sequence modeling. arXiv:1412.3555 , 2014.[7] Saurabh Vyas, Matthew D. Golub, David Sussillo, and Krishna V. Shenoy. Computation through neural populationdynamics.

Annual Review of Neuroscience , 43(1):249–275, 2020.[8] James M Murray. Local online learning in recurrent networks with random feedback. eLife , 8:e43299, 2019.[9] Dean V. Buonomano and Wolfgang Maass. State-dependent computations: spatiotemporal processing in cortical networks.

Nature Reviews Neuroscience , 10(2):113–125, 2009.[10] Thomas Miconi. Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed duringcognitive tasks. eLife , 6:e20899, 2017. [11] Jeﬀrey L. Elman. Finding structure in time. Cognitive Science , 14(2):179–211, 1990.[12] P. J. Werbos. Backpropagation through time: what it does and how to do it.

Proceedings of the IEEE , 78(10):1550–1560,1990.[13] Chan Li and Haiping Huang. Learning credit assignment.

Phys. Rev. Lett. , 125:178301, 2020.[14] Alexandre Pouget, Jeﬀrey M Beck, Wei Ji Ma, and Peter E Latham. Probabilistic brains: knowns and unknowns.

NatureNeuroscience , 16(9):1170–1178, 2013.[15] Quoc V. Le, Navdeep Jaitly, and Geoﬀrey E. Hinton. A simple way to initialize recurrent networks of rectiﬁed linear units. arXiv:1504.00941 , 2015.[16] Ladan Shams and Aaron R. Seitz. Beneﬁts of multisensory learning.

Trends in Cognitive Sciences , 12(11):411–417, 2008.[17] Dora E Angelaki, Yong Gu, and Gregory C DeAngelis. Multisensory integration: psychophysics, neurophysiology, andcomputation.

Current Opinion in Neurobiology , 19(4):452–458, 2009.[18] Chandramouli Chandrasekaran. Computational principles and models of multisensory integration.

Current Opinion inNeurobiology , 43:25–34, 2017.[19] Yuhuang Hu, Adrian E. G. Huber, Jithendar Anumula, and Shih-Chii Liu. Overcoming the vanishing gradient problem inplain recurrent networks. arXiv:1801.06105 , 2018.[20] Daniel T. Gillespie. Exact numerical simulation of the ornstein-uhlenbeck process and its integral.

Physical Review E ,54(2):2084–2091, 1996.[21] H. Francis Song, Guangyu R. Yang, and Xiao-Jing Wang. Training excitatory-inhibitory recurrent neural networks forcognitive tasks: A simple and ﬂexible framework.

PLOS Computational Biology , 12:e1004792, 2016.[22] David Raposo, John P. Sheppard, Paul R. Schrater, and Anne K. Churchland. Multisensory decision-making in rats andhumans.

The Journal of Neuroscience , 32(11):3726–3735, 2012.[23] Cengiz Pehlevan and Haim Sompolinsky. Selectivity and sparseness in randomly connected balanced networks.

PLOSONE , 9(2):e89992, 2014.[24] Mattia Rigotti, Omri Barak, Melissa R. Warden, Xiao Jing Wang, Nathaniel D. Daw, Earl K. Miller, and Stefano Fusi.The importance of mixed selectivity in complex cognitive tasks.

Nature , 497(7451):585–590, 2013.[25] Nobuaki Yasumatsu, Masanori Matsuzaki, Takashi Miyazaki, Jun Noguchi, and Haruo Kasai. Principles of long-termdynamics of dendritic spines.

The Journal of Neuroscience , 28(50):13592–13608, 2008.[26] D. Harshad Bhatt, Shengxiang Zhang, and Wen-Biao Gan. Dendritic spine dynamics.

Annual Review of Physiology ,71(1):261–282, 2009.[27] Seungdae Baek, Min Song, Jaeson Jang, Gwangsu Kim, and Se-Bum Paik. Spontaneous generation of face recognition inuntrained deep neural networks. bioRxiv , 2019.[28] David Raposo, Matthew T Kaufman, and Anne K Churchland. A category-free neural population supports evolvingdemands during decision-making.

Nature Neuroscience , 17(12):1784–1792, 2014.[29] Paul Miller, Carlos D. Brody, Ranulfo Romo, and Xiao Jing Wang. A recurrent network model of somatosensory parametricworking memory in the prefrontal cortex.