Slow manifolds in recurrent networks encode working memory efficiently and robustly
SSlow manifolds in recurrent networks encode workingmemory efficiently and robustly
Elham Ghazizadeh ∗ & ShiNung Ching , , Electrical and Systems Engineering, Washington University in St. Louis, 1 Brookings Dr, St. Louis, MO 63130,USA Biomedical Engineering, Washington University in St. Louis, 1 Brookings Dr, St. Louis, MO 63130, USA Division of Biology and Biomedical Sciences, Washington University in St. Louis, 1 Brookings Dr, St. Louis, MO63130, USA
Working memory is a cognitive function involving the storage and manipulation of latent in-formation over brief intervals of time, thus making it crucial for context-dependent compu-tation. Here, we use a top-down modeling approach to examine network-level mechanisms ofworking memory, an enigmatic issue and central topic of study in neuroscience and machineintelligence. We train thousands of recurrent neural networks on a working memory taskand then perform dynamical systems analysis on the ensuing optimized networks, whereinwe find that four distinct dynamical mechanisms can emerge. In particular, we show theprevalence of a mechanism in which memories are encoded along slow stable manifolds inthe network state space, leading to a phasic neuronal activation profile during memory peri-ods. In contrast to mechanisms in which memories are directly encoded at stable attractors,these networks naturally forget stimuli over time. Despite this seeming functional disadvan-tage, they are more efficient in terms of how they leverage their attractor landscape and a r X i v : . [ q - b i o . N C ] J a n aradoxically, are considerably more robust to noise. Our results provide new dynamical hy-potheses regarding how working memory function is encoded in both natural and artificialneural networks. Working memory (WM) is a temporary store that allows for active manipulation of informa-tion in the absence of external stimuli . Critical cognitive functions such as reasoning, planningand problem solving rely on working memory and thus its mechanistic basis is a key questionin science and machine intelligence. Presumably, memory retention relies on an invariant latentneural representation of past stimuli , but the precise nature of these representations and the dy-namical mechanisms by which they are created in neural circuits remain enigmatic. Experimentaland theoretical characterizations of working memory typically center on a delay period that occursafter stimulus presentation and before onset of a behavioral response or action . Characteriza-tions of neural activity during delay periods dichotomize into two broad categories: (i) persistent,tonic activity and (ii) time varying, phasic activity. In the former, neurons are tuned to relevant fea-tures of a stimulus and produce elevated and relatively constant activity throughout the delay .In the latter, neuronal activity fluctuates, ramping up and down during delay periods
5, 10, 11 . Tonicand phasic paradigms have been observed in working memory tasks in animals
10, 11 and artificialnetworks
12, 13 . However, the mechanisms underlying these descriptions and the reasons why onemay manifest over the other in certain circumstances is far from clear.Understanding the network mechanism of working memory often revolves around the role ofself-sustaining attractors, including discrete fixed points , which correspond to neuronal activity2atterns that are maintained indefinitely in the absence of exogenous stimuli or perturbation. Tonicdelay activity is thought to coincide with such attractors , thus allowing for stable maintenanceof memory representations for potentially arbitrary lengths of time.On the other hand, in the phasic hypothesis memory representations do not coincide withself-sustaining attractors. Instead, high-dimensional neuronal activity fluctuations may projectonto a lower-dimensional latent space upon which an invariant representation is held during delayintervals
17, 18 . For example, if the activity of a neuron gradually drops, the activity of anotherneuron increases to compensate for that drop. Thus, during delay, neural activity may traverse alow-dimensional manifold corresponding to this invariant representation
13, 19 .Disambiguating the above mechanisms requires deriving an understanding of the generativeprocesses that give rise to time-varying, task-evoked neural activity. Ideally, we would be able toanalytically characterize these mechanisms in a dynamical systems framework that could revealthe details of the attractor landscape embedded within neuronal networks.However, ascertaining dynamical systems models of biological networks is not straightfor-ward, especially at a level of scale commensurate with networks thought to be relevant to WM,such as prefrontal cortex
20, 21 . Here, artificial recurrent neural networks (RNNs) can form aninteresting and potentially useful surrogate from which to derive mechanistic hypotheses. Suchnetworks can be optimized in a top-down fashion to engage high-level cognitive tasks that includeWM requirements . 3hen, the emergent dynamics of the synthesized model can be analyzed and used to make ar-guments for or against different mechanisms, based on the predictive validity of the model outputsrelative to actual brain activity.In this spirit, recent works have tried to reconcile the aforementioned hypotheses regardingpersistent vs. transient delay activity in the context of WM . Orhan and colleagues optimizedRNNs to perform several short-term memory tasks and they observed that both tonic and phasicdelay activity could arise, depending on specific task details and optimization/learning parameters.Similarly, Nachstedt and colleagues showed that the existence of both mechanisms simultane-ously can mediate reliable task performance in the face of uncertain stimulus timing. However, itremains unclear what factors sway RNNs to manifest one mechanism over another and, related,whether they carry different functional advantages.With regards to the last point, the ability of optimized RNNs to predict actual brain activitymay depend crucially on certain restrictions regarding the optimization method that is used, e.g.,by encouraging solutions that manifest connection motifs that are more biologically realistic
13, 31 .Thus, using RNNs to build potentially explanatory hypotheses regarding neural circuit mechanismslikely requires careful consideration of the numerical optimization strategy used, including hyper-parameters and initialization schemes, as well as prior constraints on network architecture
25, 32 .Expanding on this idea, in this work, we pursue the top-down RNN approach to study potentialmechanisms underlying WM function, training thousands of networks to perform an analyticallytractable sequential, memory-dependent pattern matching task. To train our network, we modify4he First-Order Reduced and Controlled Error (FORCE) method by using a temporally restrictederror kernel to confine error regression to occur only during brief intervals within each trial. Theproposed framework blends trial-based reinforcement learning with first-order regression, thus ob-viating the need for a continual external supervisory error-signal.Our premise is that this revised optimization framework, by leaving long epochs uncon-strained, may allow for a wider range of possible emergent dynamics. Indeed, by optimizing RNNsacross different hyperparameters and initialization schemes within this framework, we identify adiversity of network mechanisms, each achieving the desired function, but varying in their keydynamical properties. We find that networks can embed predominantly asymptotically stable fixedpoints, stable limit cycle attractors, or a combination thereof. Most interestingly, we show herethat there are two distinct mechanisms by which stable fixed points can be used to serve memoryencoding, one leading to tonic activation and the other leading to phasic activation. We show thatthe latter, while unable to sustain memories over arbitrary lengths of time (i.e., wherein the model‘forgets’) nonetheless constitutes a more efficient and robust mechanism by which memories canbe encoded. ResultsWorking memory can be encoded via distinct dynamical mechanisms associated with tonicand phasic neural activation.
We enacted a trial-based WM task involving sequential patternmatching (SPM) that exhibits working memory requirements (Fig. 1a). In our design, high-5imensional stimuli are encoded as bivariate random processes, such that the network is requiredto temporally integrate each stimulus and then store a latent representation of said stimulus forlater processing. We optimized RNNs to perform this task by using a modified FORCE method that included a temporally restricted error kernel. Here, regression occurs at two phases duringeach trial: (i) during memory/delay periods, wherein we promote the formation of an invariantlatent linear projection from neural units nominally associated with maintenance of a memory rep-resentation; and (ii) at the conclusion of each trial, wherein we promote a linearly decoded outputresponse signal (Fig. 1a,b). All other temporal epochs are unconstrained, thus obviating the need togenerate an error signal continuously throughout trials, which may overly constrain the dynamics (see also Methods for additional details and Supplementary Fig. 1).We found that optimized networks could produce both tonic and phasic activity patternsduring delay periods, as exemplified for two different networks of 1000 neurons in Fig. 1c. Inorder to study the dynamical mechanisms underlying these overt patterns we first used a numericalcriteria on neuronal activity at the end of the delay period, T d . Specifically, we arrested trialsat T d and forward simulated the networks autonomously to ascertain whether the activity wassustained at a fixed point (see Methods). We identified four distinct dynamical mechanisms thatcould mediate working memory. In the case of tonic activation, network activity would indeedremain persistent, i.e., x ( t ) , the state vector of neuronal activity, would remain near x ( T d ) with k ˙ x k ’ , indicative of a fixed point attractor (Fig. 2a). We refer to this mechanism as directfixed point encoding (DFP). In the case of phasic patterns, x ( t ) in the forward simulation woulddeviate from x ( T d ) . In some cases, the network would always settle at a different fixed point from6he memory representation (Fig. 2b, termed indirect fixed point encoding, IFP), independent ofthe stimulus or network initial condition. In other cases the network would always asymptoticallyapproach a stable limit cycle attractor (Fig. 2c, limit cycle encoding, LC). In a fourth case (notdepicted), the network could asymptotically approach either a disparate fixed point or a limit cycle,depending on the stimulus realization (termed mixed encoding, see Supplementary Fig. 2). In total,we optimized 1524 network models, of which 703 were identified of the direct fixed point (DFP)mechanism, 534 were of the indirect fixed point (IFP) mechanism, 182 were of the limit cycle(LC) mechanism, and 105 were of the mixed (Mix) mechanism. Given their dominance in theemergent solutions, our subsequent attention will be on understanding the workings of the DFPand IFP mechanisms, though we will later also untangle the factors that cause each mechanism toarise over the others. Indirect encoding efficiently uses the network attractor landscape.
The above findings sug-gests that key invariant structures in the network attractor landscape – stable fixed points andattractive limit cycles – determine whether and how delay activity takes on a tonic or phasic char-acteristic. To delve further into these mechanisms, we attempted to analyze how networks in eachof the four categories leverage their respective attractor landscapes during the task.We began by linearizing the dynamics at the origin and using mean-field results to establishlower bounds on the number of fixed point attractors manifest in the network attractor landscape.Fig. 3a,b show how our four mechanistic categories break down along three key properties ofspectra of the ensuing Jacobian matrix, where distinctions are readily observed. Most notably,7he landscapes associated with direct fixed point encoding involve a greater number of fixed pointattractors relative to indirect encoding. In support of this point, Fig. 3c illustrates representativelow-dimensional projections of network activity in each of the four mechanisms with stable fixedpoints overlaid (here, we restrict attention to the positive quadrant, see also Discussion). In DFPencoding (Fig. 3c), the sequential stimuli move the trajectory between different fixed points (as-sociated with memory representations), culminating in an output that is itself associated with adifferent fixed point (i.e., here a total of four fixed points are used in the service of the task). Incontrast, the landscape for IFP encoding (Fig. 3c) involves a single fixed point that does not en-code memories, nor does it encode the nominal output (though, it is approached asymptotically ifnetworks are forward simulated autonomously after trial cessation). Thus, IFP encoding is able tomaintain invariant representations during the relevant memory periods without relying directly onthe presence of multiple fixed point attractors (see also Supplementary Fig. 3.) Indirect encoding uses slow manifolds to sustain memory representations
Following from theabove, IFP encoding appears to use the geometry of the stable manifolds of the single fixed point tomaintain memory representations. Fig. 4a illustrates the spectrum of the linearized dynamics aboutthe fixed point in the previous IFP example encoding model, where we see many eigenvalues nearthe imaginary axis, indicating the presence of slow, stable manifolds along which activity flows in arelatively invariant fashion. These manifolds provide the opportunity for a low-dimensional latentrepresentation of memory to be maintained, despite phasic activity (along the manifold). Indeed,because we encourage linearly mapped latent representations via our optimization method (seeMethods), we know these manifolds have a planar geometry in the firing rate activity variables.8n contrast, Fig. 4b illustrates the spectra resulting from linearization about two memory fixedpoints in a DFP model. Here we note that eigenvalues are relatively offset from the imaginary axis,indicating rapid convergence to the fixed point. This conclusion is supported in Fig. 4c, whichshows the relative proportion of delay periods in which neurons are in the saturated (nonlinear) vs.linear range of the activation function, i.e. tanh(.) , for each model we trained. The much largerproportion of saturated neurons in DFP encoding indicates that the mass of eigenvalues for thesemodels is relatively contracted and offset from the imaginary axis (see Methods and equation (10))and thus associated with fast decay to the fixed points.To further understand the circuit-level details mediating the DFP and IFP mechanisms, wecharacterized the connectivity between neurons. We first noted that DFP encoding leads to anoverall much greater distribution of connectivity weights between neurons relative to IFP (Fig.4d). Next, we sorted neurons according to their peak activation (as in Fig. 1c) and examinedtheir average pre-synaptic activity throughout the course of trials. We found that neurons in DFPencoding exhibited highly structured synaptic tuning to different stimuli and memory periods, incontrast to IFP encoding (Fig. 4e). Finally, we examined the bidirectional synaptic weight between‘adjacent’ neurons (ones with temporally sequential maximal activation). Here, DFP exhibits nosystematic connectivity structure, while IFP shows that neurons with similar peak activation timesare more tightly coupled (Fig. 4f). This latter point suggests that traversal along the slow manifoldsis mediated by an internal sequential, ‘daisy chain’ type of structure embedded within the trainedIFP encoding network. 9 table manifold encoding is forgetful, but robust.
We sought to better understand the functionaladvantages of the different mechanism types. In this regard, we interrogated networks by extend-ing delay periods beyond the nominal training requirements, a form of increased memory demand.The main question here is how increasing the memory demand in this way would affect activityand consequently degrade task performance. Fig. 5a illustrates the comparison of neural activitypatterns for DFP and IFP encoding categories (with extended delay equal to five times the nominaldelay interval). For DFP encoding, regardless of the length of the extended delay, the neural activ-ity is unaffected since the network uses fixed points as the invariant structure to encode memorytraces. Consequently, after the extended delay ends and the network receives the second stimulus,task computations can be executed correctly. However, for IFP encoding, during the extended de-lay interval, neural activity gradually drops away which results in loss of function due to deviationfrom the ‘correct’ activity pattern upon receiving the second stimulus. Fig. 5b summarizes the de-viation from the nominally ‘correct’ post-delay neural activity as a function of delay extension forour two FP mechanisms, as well as LC and Mix. As expected, for DFP encoding this deviation isnear zero. In contrast, for IFP the networks can tolerate extended delay up to % 100 of the nominaldelay, after which point performance gradually drops, i.e., the correct representation is ‘forgotten’.To assay other functional aspects of these mechanisms, we examined how performance of ournetworks would tolerate the presence of a distracting noise added to the actual stimulus. Here, wefound a counterintuitive functional advantage of ‘forgetting,’ relative to the DFP mechanism. Wespecifically added uncorrelated noise of differing variance to the first of the two sequential stimuliand examined deterioration from the nominal ‘correct’ neural representation at trial conclusion.10or values of noise variance that are less than the stimulus variance (vertical dotted line Fig. 5c)IFC (and indeed LC) encoding are highly robust to perturbations, and indeed variances in excess ofan order of magnitude greater than the stimulus can be tolerated. In stark contrast, DFP encodingis highly fragile with respect to distracting noise, with rapid and near-complete breakdown of thecorrect neural representation after modest perturbation (Fig. 5c). To understand this mechanismwe carefully studied the trajectories in low-dimensional space in the presence of distracting noise(Supplementary Fig. 4), from which we ascertained that the distracting noise was essentially plac-ing the trajectory in an erroneous basin of attraction, i.e., causing an incorrect memory fixed pointto be induced. This result runs counter to classical Hopfield-type associative memory theory ,which presumes that basins are useful to rejecting noise and uncertainty. Our finding, in essence,indicates that the high reliance on many fixed points for stable memory representations in DFPencoding makes this mechanism more susceptible to the temporal integration of noise (see alsoDiscussion). Initial network properties dictate the emergence of different solution dynamics.
Finally, wesought to understand the factors prior to optimization that bias the emergent dynamics towards onetype of mechanism versus another. We considered three main network properties: (i) the strengthof connectivity, g , (ii) the variance of feedback, σ f , and (iii) the sparsity of the initial connectivitymatrix. We varied these parameters over their possible ranges. Fig. 6 illustrates the effect ofdifferent parameterizations: for small values of g the trainability of networks is poor, but improvessignificantly as g increases. In other words, large random initial connectivity facilitates training,consistent with known results
33, 35 . For g < the untrained network has one stable fixed point11t the origin and the emergent trained dynamics tend to be of DFC or IFC encoding (Fig. 6a).Interestingly, the variance of feedback weights, σ f has a notable effect on the emergent dynamics;for large values of σ f the networks tend to form DFC models and as σ f decreases only IFC and LCmodels arise (Fig. 6b). The sparsity of initial connectivity matrix has no significant effect on thetrainability of networks nor the emergent dynamics (Fig. 6c). DiscussionLearning a diversity of dynamics for working memory function
In this work we used a top-down optimization-based approach to investigate potential dynamical mechanisms mediating WMfunction. By training/optimizing RNNs using a modification of the FORCE regression method,we found four qualitatively different types of network dynamics that can mediate function. At amechanistic level, these solutions are differentiated on the basis of the number of asymptoticallystable fixed points manifest in the network vector field and, crucially, how those fixed points areleveraged in the service of the task. We note especially two solution types, one reflecting neuralmemory representations that are highly persistent corresponding to direct encoding at fixed points(i.e., DFP), versus the other where neural representations are transient and correspond to traversalalong slow manifolds in the network state space (i.e., IFP). At the level of neural activity, DFP pro-duces tonic sustained activity during delay periods, while IFP produces phasic, transient activity.Our results are related to prior work that has shown that persistent versus transient encodingof memories can manifest in neural networks trained on different WM tasks and under different12ptimization/learning schemes . Here, we choose to focus on a single, structured task in aneffort to go beyond overt activity characterizations and carefully dissect the underlying dynamicalmechanisms associated, namely the attractor landscape in the neural state space. Doing so providesnot only insight into potential generative circuit processes but also allows us to perform sensitivityanalyses to ascertain nuanced functional advantages associated with the different mechanisms. Tradeoff between efficiency, memory persistence and robustness
In particular, our results sug-gest an interesting balance between persistence and robustness of memory representations. Specif-ically, the DFP mechanism resembles in many ways traditional associative memory attractor dy-namics, in the sense of Hopfield networks . Here, each memoranda is associated with a distinct,asymptotically stable fixed point. On the one hand, such a mechanism is able to retain memories forarbitrary lengths of time. Further, the dynamics with the attractor basins can nominally correct forsmall perturbations to neural trajectories at the onset of memory periods. However, our results sug-gest that this latter classical interpretation breaks down when considering sequential, time-varyingstimuli. In this case, perturbations to stimuli can accrue over time, causing neural representationsto stray into errant basins of attraction, ultimately leading to failure of performance.In contrast, in the IFP encoding mechanism, the network vector field exhibits a smaller num-ber of fixed points that do not encode memoranda directly. Rather, memory representations areformed from projection of neural activity along slow manifolds that are ostensibly shaped throughoptimization of the network vector field. The fixed points here are, in essence, ‘shared’ betweenmemoranda. This mechanism turns out to be far more robust to time-varying stimulus perturba-13ions. There are likely two factors related to this robustness. First, noisy perturbations may not beable to easily move trajectories off of the ‘correct’ slow manifold. Second, there are no competingattractors to absorb errant trajectories, as would be the case in the DFP mechanism. In total, theIFP encoding can be viewed as an overall more efficient use of neural dynamics wherein the lackof a persistent representation (i.e., ‘forgetfulness’) is offset by both a lighter weight coding schemein terms of the number of attractors deployed in the state space, leading – perhaps paradoxically –to more robust performance. Shaping a landscape with few attractors
Expanding on the above point of efficiency, it is of notethat the limit cycle and mixed mechanisms are most comparable to IFP in terms of the way in whichthey attractor landscape is used in the service of the task. In the LC mechanism in particular, theoscillatory cycle is not itself used to encode or sustain the memory, but rather shapes the landscapeto create slow manifolds for encoding, similar to IFP. Thus, while the oscillation is not directlyfunctional, it nonetheless is critical in establishing the ‘right’ landscape for task completion. Froman energetic standpoint, the indirect mechanisms are potentially less expensive since most neuronsare inactive at any moment in time, in contrast to DFP encoding.
Temporally restricted optimization promotes solutions that are compatible with observeddynamics in vivo
Our results shed light on the different means by which recurrent networks canembed memory functions within their dynamics. Such a question is highly relevant to understand-ing how key machine learning and artificial intelligence constructs such as RNNs encode complexcontext-dependent functions . However, they also suggest mechanistic interpretations for actual14M circuits in the brain, given the seeming prevalence of phasic activity patterns during delayintervals observed in vivo . Indeed, it has been observed that neurons in memory-relevant regionssuch as prefrontal cortex do not necessarily maintain persistent activity throughout long delay pe-riods, but rather may ‘ramp’ on and off at systematic time points , as is compatible with our IFCmechanism. Further, in the IFC mechanism, most neurons are lightly saturated (Fig. 4c), meaningthat most neurons are within a linear regime, as thought to occur in actual neural circuits
34, 38 .Notably, the IFP dynamical mechanism only arises after using the proposed temporally re-stricted error kernel. Indeed, we found that using the native FORCE method without such a kernelleads to poor trainability; and further those networks that do manage to be trained are highly fragileto the extended delay and noise perturbations we considered. This fragility ostensibly arises due tothe latent outputs being overly constrained in this situation. Indeed, the choice of how to constrainthese outputs throughout the task is somewhat arbitrary in the first place. Hence, the temporallyrestricted error kernel may be allowing for the emergence of more naturalistic dynamics in ourRNNs.
Potential for enhanced fast learning and generalization
An important technical caveat is thatwe have set up our RNNs to produce activity in the positive quadrant. Hence, our analysis focuseson characterization of the attractor landscape in that region of the state space. However, becausewe consider an odd activation function, we know analytically that the fixed points analyzed in ournetworks have ‘mirror’ negative fixed points that are not directly used in the service of the task,which means that these dynamical features are essentially ‘wasted’ by construction and network15esign. A speculative hypothesis is that these fixed points may allow the network to more quicklylearn a related task with minimal synaptic modification, i.e., by leveraging the mirror dynamicsthat are already embedded in the network. Such a concept is related to the idea of meta-learning and may be an interesting line of future study. MethodsWorking memory task details.
In this study, we considered a sequential pattern-matching taskthat takes into account key aspects of working memory tasks: stimulus processing, memory encod-ing and response execution . Our goal was to use a task of sufficiently low dimension as to be ableto perform tractable and potentially illuminating post-hoc analysis on the emergent dynamics ofRNNs. In the proposed task, each trial consists of two random process stimuli that are sequentiallypresented and interleaved with delay intervals, followed by a brief response interval (Fig. 1). Eachstimulus is a two-dimensional Gaussian process obtained in the latent space of a Variational AutoEncoder (VAE) trained on the MNIST dataset of hand-written digits. We designed the task patternassociation rule to emulate summation, which differs from simple match or non-match tasks .Specifically, to keep the dimensionality of the task low, we use two different stimuli resulting in 3potential task outcomes (for summation). Recurrent network model.
We considered recurrent networks composed of N nonlinear firing-rate units specified by: τ ˙ x ( t ) = − x ( t ) + J r ( t ) (1)16here x ∈ R N is the state vector and r ( t ) = tanh ( x ( t )) denotes the activity obtained via applyinghyperbolic nonlinearity to the neuronal state variables. We set the network time constant, τ = 1 forsimplicity. Here, J is the (untrained) synaptic connectivity matrix with elements drawn randomlyfrom a Gaussian distribution, i.e. J ij ∼ N (0 , σ J ) . Specifically, we parameterize σ J = g N , so that g controls the strength of synaptic interactions. Optimization method.
Recurrent networks with fully random connectivity as in equation (1) havea rich dynamical repertoire and thus are capable of generating complex temporal patterns that arecommensurate with spontaneous cortical activities
12, 41 . To make these networks learn the functionof interest and thus perform the task, we first define two variables decoded from the networkactivity: z o ( t ) = W To r ( t ) z d ( t ) = W Td r ( t ) (2)where z o ( t ) is a network output for generating responses, while z d ( t ) is a low-dimensional la-tent variable that is linearly decoded from neural firing rate activity, i.e. z d ( t ) = ( z d ( t ) , z d ( t )) .In our network, invariant memory representations will be formed in this latent space. Optimiza-tion/learning proceeds by modifying the projection vectors W o ∈ R N × and W d ∈ R N × . Thenetwork output z o ( t ) and dummy output z d ( t ) are fed back to the network via feedback weightsi.e. W f ∈ R N × and W fd ∈ R N × , respectively. This results in modified synaptic connectivity: ˙ x ( t ) = − x ( t ) + ( J + W f W To + W fd W Td ) r ( t ) + W i u ( t ) (3)The elements of W f and W fd are drawn independently from Gaussian distributions with zero meanand variance σ f . The network receives the exogenous input (i.e., stimulus) u ( t ) ∈ R × via input17eights W i ∈ R N × (see Fig. 1b). This strategy effectively modifies the initial connectivity byaddition of a low-rank component, allowing for more interpretable relations between the overallnetwork connectivity and function
35, 42 . Note that a minimal rank, i.e. rank 1, perturbation couldbe used, but it is known to induce high correlations between emergent fixed points, thus restrictingthe potential range of emergent dynamics
35, 43 . Hence, to allow for a potentially wide range ofsolutions, we used a random connectivity plus rank 3 structure for the SPM task.In our framework, optimization occurs only during the relevant temporal intervals in whichthese target signals are defined (Fig. 1a), which we term a temporally restricted error kernel. Whenapplying this kernel, the total error derived for a given trial is: E ( t ) = 12 Z T d e d ( t ) dt + 12 Z T r e o ( t ) dt (4)where T d and T r are the temporal epochs associated with the two delay periods and response period,respectively (Fig. 1a). Here, e d ( t ) = k z d ( t ) − f d k , (5)where z d = ( z d , z d ) and f d = ( f d , f d ) and e o ( t ) = k z o ( t ) − f o k . (6)Here, f d , f d and f o are scalar real numbers chosen prior to optimization to represent the 2-dimensional stimulus and the trial outcome. During the delay intervals in particular, a low er-ror thus implies that the neural activity linearly maps to a constant, invariant representation (i.e., z d ∈ R × ). Activity during temporal epochs outside of the these periods do not impact the error.Optimization proceeds by modifying readout weights W o and W d to minimize these errors. Within18he temporal error kernel, we deploy the FORCE method for parametric regression in RNNs. Here, W o and W d are updated using recursive least squares . Briefly, to reduce e o ( t ) , we obtain W o ( t ) = W o ( t − ∆ t ) − e o ( t ) P ( t ) r ( t ) P ( t ) = P ( t − ∆ t ) − P ( t − ∆ t ) r ( t ) r T ( t ) P ( t − ∆ t )1 + r T ( t ) P ( t − ∆ t ) r ( t ) (7)where P ( t ) denotes the approximate estimate for the inverse of the correlation matrix of networkactivities with a regularization term P ( t ) = Z T r r ( t ) r T ( t ) dt + α I N (8)where α is the regularization parameter and I N the identity matrix. In the same manner, to reduce e d ( t ) we have W d ( t ) = W d ( t − ∆ t ) − e d ( t ) P ( t ) r ( t ) . (9)Note that we update the associated inverse correlation matrices during training intervals T d and T r (shown in Fig. 1a). In total, our training paradigm is a temporally regularized FORCE methodthat mitigates overfitting and in turn provides a potentially broader range of dynamical solutions tomanifest. Indeed, it is known that optimizing RNNs using FORCE for a sequential trial-based task(here, a pattern association task with memory requirement) prevents the emergence of multiplefixed points in optimized networks, and thus can overly constrain the range of possible solutiondynamics . Dynamical systems analysis.
The central theoretical question in our study pertains to analyzingthe dynamics of our optimized networks. A first order question in this regard is to elucidate the19andscape of attractors manifest in the network’s vector field. In equation (1), the origin is alwaysa fixed point associated with the Jacobian matrix J (shifted by − I ). To study the stability ofthe origin, we can thus consider the eigenvalues of the connectivity matrix. It is well known thatthe eigenvalues of a random connectivity matrix J are distributed over a disk with radius g for N → ∞
41, 44 . Thus, the stability of origin varies with these parameters. For g < the originis asymptotically stable, while for g > the origin is unstable, suggestive of potentially chaoticdynamics in the overall network. For the optimized networks with the rank 3 structure, the Jacobianmatrix at the origin is J T = J + W f W To + W fd W Td .Understanding the location and stability of fixed points away from the origin is harder to as-certain analytically. Hence, we rely on a number of numerical procedures to identify these points.To locate stable fixed points used for task computations, we arrest trials at relevant time moments,then forward simulate to ascertain the asymptotic behavior of the network. In one set of simula-tions, this forward simulation is carried out for trials arrested at the end of the first delay period.In a second set of simulations, it is carried out after trial conclusion. The forward simulation iscarried out for ten times the nominal trial length, at which time we assume the network state is ina stationary regime, i.e., within an (cid:15) distance of either a stable fixed point or limit cycle. We canperform additional linearization about stable fixed points that are discovered numerically in thisway. Here, the eigenvalue spectrum of Jacobian matrix, Q , at these non-zero fixed points, denoted x ∗ , is as follows Q = ( J + W f W To + W fd W Td ) R (10)20here R is a diagonal matrix with elements R ij = δ ij r i with r = 1 − tanh ( x ∗ ) . (11)Note that if the states are largely saturated at a fixed point (as in the case of DFP encoding, Fig.4c), then the entires of r are very small, which contracts the spectrum of Q . Network connectivity analysis.
To link network connectivity properties with the identified mech-anisms, we first sorted neurons based on the time they reach their peak activity for each tasktrial. Then, we sorted elements of the optimized connectivity matrix, i.e. J T , using the same or-dered sequence of neurons. We calculated the average pre-synaptic (or incoming) connections, i.e. ¯ J i = P Nj J T ij N , see Fig. 4e. Moreover, we performed an analysis suggested by ; we computedthe average of diagonal and successive off-diagonal elements of sorted connectivity matrix, i.e. | i − j | = c, c ∈ { , ..., N } , see Fig. 4f, which indicates the strength of reciprocal coupling betweenneurons as a function of the temporal distance between their peak activation. Simulation parameters.
In the task, we set the stimulus interval to 100 time steps, delay intervalsto 50 time steps (except for the extended delay experiments) and response interval to 50 time steps.From the sequential bivariate random process stimuli, we trained a variational auto encoder on theMNIST digits 1 and 0. We encoded the summation rule outcomes (i.e. 0, 1, 2) as 0.5, 1 and 1.5(i.e., different values of f o in equation (6)), respectively for training the networks. We encodedthe latent outputs f d and f d as the two dimensional mean vector for each digit representation,whenever that digit appeared prior to the delay period being optimized.For all simulations the value of α is initialized to 1 and P was initialized to the identity21atrix. The number of neurons was set as N = 1000 and elements of W o and W d are initializedto zero. Input weights were drawn randomly from zero mean Gaussian distribution with variance0.02. We set the time step, dt , for Euler integration to . . During training intervals, shown in Fig.1a, we updated weights every 2 time steps. For four different initialization seeds we considered allpossible combination of feasible values for g , σ f and sparsity (in Fig. 6). The value of σ f was cho-sen proportional to the size of network, i.e. σ f ∈ { . N , N , . N , . N , . N , . N , . N , . N } .Training was terminated if the average root mean squared error between target and output was lessthat 0.01 for all trials.For exemplar networks used in figures 3a,c and 4a,b,d-f, for type DFP: g = 0 . , σ f = 1 andsparsity is 0.2. For IFP: g = 0 . , σ f = 0 . and sparsity is 0.1. For LC: g = 0 . , σ f = 0 . andsparsity is 0.1. For Mix: g = 0 . , σ f = 0 . and sparsity is 0.1. The initialization seed is the samefor these 4 exemplar networks. Code availability.
We performed simulations using Python and TensorFlow. All training and anal-ysis are available on GitHub ( https://github.com/Lham71/Working-Memory-Modeling ). Acknowledgements
This work was partially support by grants R01EB028154 and 1653589 from the Na-tional Institutes of Health and the National Science Foundation, respectively. We gratefully acknowledgeDr. Larry Snyder and Charles Holmes for participating in discussions related to this work and for reviewinga preliminary version of this manuscript. + + stim. 1 stim. 2 responsedelay 1(memory) delay 2 neu r on s t i m . s t i m . r e s p o n s ed e l a y l a y time (steps) T d unconst. r a t e a b c time Figure 1:
Optimizing RNN using modified FORCE to perform a sequential pattern matching (SPM) task. a,
A single trial of a SPM task, wherein low-dimensional random process representations of handwritten digit stimuliare followed by short delay intervals. The network is optimized to generate the correct ‘summation’ output duringa prescribed response interval. b, Schematic diagram of RNN architecture and low-rank structure added to initialconnectivity J . After the network receives input trials sequentially via input weights, it encodes the memory repre-sentation z d ( t ) and generates the task outputs z o ( t ) . We use a rank 2 structure for encoding memory and a rank 1structure for generating response. c, Tonic and phasic activity for two different networks. Activity patterns (normal-ized) of neurons are sorted by the time of their peak value. e l a y Trial arrested autonomousforward simulation stim.: '1'stim.: '0' a b c
Time (steps)
Figure 2:
Forward simulation of network after delay to identify distinct dynamical mechanisms underlying WM . a, Direct Fixed Point encoding (DFP), where the network uses fixed points to encode memory representations of eachstimulus. b, Indirect Fixed Point encoding (IFP), where the network asymptotically settles at a fixed point but thisfixed point does not correspond to a memory representation. c, Limit Cycle (LC), where the network asymptoticallyapproached a stable limit cycle attractor. o f r e a l o u t l i e r s s r e il t uo x e l p m o c f o D s t i m . s t i m . P C P C P C s t i m . s t i m . P C P C P C s t i m . s t i m . P C P C s t i m . s t i m . P C P C P C P C s t i m . l a y s t i m . l a y r e s p o n s e time DFP IFP
LC MIX stable fixed pt. f o r w a r d s i m . a bc Figure 3:
Attractor landscape for optimized networks a,
Eigenvalue spectrum (for an exemplar DFP network).The gray circle shows the radius of the theoretical eigenvalue spectrum of the initial connectivity matrix J . Afteroptimization, a set of outliers emerges in the eigenvalue spectrum. Here, the initial connectivity matrix is the Jacobianat the origin (shifted by − I ). b, Categorization of four distinct mechanisms along key properties of the networkJacobian evaluated at the origin. c, Attractor landscape and trajectory of exemplar task trials. Three-dimensionalneural trajectories are obtained via applying Principle Component Analysis (PCA) to 1000-dimensional neural activityfrom networks of each dynamical mechanism. In DFP, the network creates 4 stable fixed points to solve the SPM task.For the displayed trajectory, the network uses two fixed points (shown in yellow) to directly encode the memory andtrial output (the inset shows the area inside the circle). In IFP, the memory representation and trial output are encodedalong the slow manifold of the single fixed point in the state space. In LC, the trajectories approach a stable limitcycle. For the mixed mechanism, both a stable fixed point and limit cycle are observed. ReIm s a t u r a t i on r a t i o D F P I F P M i x L C Im Re a b c i - j d l og d i s t r i bu t i on , ^ synaptic strength JJ T neuron p r e - sy nap t i c s t r eng t h synaptic strength -1 -2 -3 -1 -2 -3 b i d i r e c t i ona l sy n . s t r eng t h -1 10 0 1000 0-500 500 e f Figure 4:
Eigenvalue spectrum at task fixed points and connectivity characterization. a,
Eigenvalue spectrum ofJacobian matrix at the single non-zero stable fixed point of IFP (shown in Fig. 2b). b, Eigenvalue spectra of Jacobianmatrix at memory fixed points of DFP (shown in Fig. 2a). c, Saturation ratio (the ratio of neurons with activity insaturated range of activation function during memory interval (averaged over all trials)) for all networks simulated(across all four mechanisms). Standard error of the mean is depicted. d, Distribution of connectivity matrix entries(i.e., weights) before and after training for DFP (the top panel) and IFP (the bottom panel). e, Average pre-synaptic(incoming connections) strength sorted by peak activation of neurons (as in Fig. 1c) for DFP and IFP, respectively. f, Comparison of mean and variance of elements of task connectivity matrix based on temporal distance of neurons. ForIFP (the bottom panel) temporally adjacent neurons are more tightly coupled and a peak can be observed. (The insetshows this peak and i, j denote neurons indices. ) tim.1 stim.2 resp. stim.1 stim.2 resp. extended delay time (steps) neu r on extended delay (time step)
100 % of nominal delay noise variance s qua r ed e rr o r DFPIFPMixLCDFPIFPMixLC stim.variance ab c s qua r ed e rr o r Figure 5:
Functional advantages/disadvantages of each mechanism type. a,
Comparison of activity patterns beforeand after increasing memory demand for DFP (top panel) and IFP (bottom panel). b, Summary of deviation fromcorrect pattern of activity across different values of extended delay for all optimized networks. The squared errorshows the difference between correct and deviated trial outputs averaged over all trials and associated networks. c, Summary of deviation from correct pattern of activity across different values of noise variance for all optimizednetworks. b and c show that IFP, LC and mixed mechanisms are forgetful, but robust to sizable perturbations. parsity r e l a t i v e c oun t s DFPIFP a b c
Figure 6:
The effect of parameters prior to optimization on the diversity of the emergent solutions.
Relativecount shows the number of trainable networks divided by the total number of networks for each specified value ofparameters. Transparent bars show the relative count of trainable networks and the inner bars show the correspondingemergent types for each specified value of parameters. a, The strength of connections within the initial connectivitymatrix J . b, The variance of feedback weights. c, The sparsity of J .
28. N. Cowan, What are the differences between long-term, short-term, and working memory?,Progress in brain research 169 (2008) 323–338.2. R. Chaudhuri, I. Fiete, Computational principles of memory, Nature neuroscience 19 (3)(2016) 394.3. R. Romo, C. D. Brody, A. Hern´andez, L. Lemus, Neuronal correlates of parametric workingmemory in the prefrontal cortex, Nature 399 (6735) (1999) 470–473.4. S. Funahashi, C. J. Bruce, P. S. Goldman-Rakic, Mnemonic coding of visual space in themonkey’s dorsolateral prefrontal cortex, Journal of neurophysiology 61 (2) (1989) 331–349.5. C. D. Harvey, P. Coen, D. W. Tank, Choice-specific sequences in parietal cortex during avirtual-navigation decision task, Nature 484 (7392) (2012) 62–68.6. E. K. Miller, M. Lundqvist, A. M. Bastos, Working memory 2.0, Neuron 100 (2) (2018)463–475.7. M. R. Riley, C. Constantinidis, Role of prefrontal persistent activity in working memory,Frontiers in systems neuroscience 9 (2016) 181.8. X.-J. Wang, Synaptic reverberation underlying mnemonic persistent activity, Trends in neu-rosciences 24 (8) (2001) 455–463.9. A. Compte, N. Brunel, P. S. Goldman-Rakic, X.-J. Wang, Synaptic mechanisms and networkdynamics underlying spatial working memory in a cortical network model, Cerebral cortex10 (9) (2000) 910–923. 290. X. Zhang, H. Yi, W. Bai, X. Tian, Dynamic trajectory of multiple single-unit activity duringworking memory task in rats, Frontiers in computational neuroscience 9 (2015) 117.11. J. D. Murray, A. Bernacchia, N. A. Roy, C. Constantinidis, R. Romo, X.-J. Wang, Stablepopulation coding for working memory coexists with heterogeneous neural dynamics in pre-frontal cortex, Proceedings of the National Academy of Sciences 114 (2) (2017) 394–399.12. O. Barak, D. Sussillo, R. Romo, M. Tsodyks, L. Abbott, From fixed points to chaos: threemodels of delayed discrimination, Progress in neurobiology 103 (2013) 214–222.13. W. Chaisangmongkon, S. K. Swaminathan, D. J. Freedman, X.-J. Wang, Computing by ro-bust transience: how the fronto-parietal network performs sequential, category-based deci-sions, Neuron 93 (6) (2017) 1504–1517.14. J. Zylberberg, B. W. Strowbridge, Mechanisms of persistent activity in cortical circuits: pos-sible neural substrates for working memory, Annual review of neuroscience 40.15. O. Barak, M. Tsodyks, Working models of working memory, Current opinion in neurobiol-ogy 25 (2014) 20–24.16. G. Deco, E. T. Rolls, Attention and working memory: a dynamical model of neuronal activityin the prefrontal cortex, European Journal of Neuroscience 18 (8) (2003) 2374–2390.17. S. Druckmann, D. B. Chklovskii, Neuronal circuits underlying persistent representations de-spite time varying activity, Current Biology 22 (22) (2012) 2095–2103.308. E. Spaak, K. Watanabe, S. Funahashi, M. G. Stokes, Stable and dynamic coding for workingmemory in primate prefrontal cortex, Journal of Neuroscience 37 (27) (2017) 6503–6516.19. G. Bondanelli, S. Ostojic, Coding with transient trajectories in recurrent neural networks,PLoS computational biology 16 (2) (2020) e1007655.20. J. C. Park, J. W. Bae, J. Kim, M. W. Jung, Dynamically changing neuronal activity supportingworking memory for predictable and unpredictable durations, Scientific reports 9 (1) (2019)1–10.21. R. H. Bauer, J. M. Fuster, Delayed-matching and delayed-response deficit from cooling dor-solateral prefrontal cortex in monkeys., Journal of comparative and physiological psychology90 (3) (1976) 293.22. Gregor M Hoerzer, Robert Legenstein, and Wolfgang Maass. Emergence of complex compu-tational structures from chaotic neural networks through reward-modulated hebbian learning.
Cerebral cortex , 24(3):677–690, 2014.23. P. Enel, E. Procyk, R. Quilodran, P. F. Dominey, Reservoir computing properties of neuraldynamics in prefrontal cortex, PLoS computational biology 12 (6) (2016) e1004967.24. R. Pascanu, H. Jaeger, A neurodynamical model for working memory, Neural networks 24 (2)(2011) 199–207.25. N. Maheswaranathan, A. Williams, M. Golub, S. Ganguli, D. Sussillo, Universality and in-dividuality in neural dynamics across large populations of recurrent networks, in: Advancesin neural information processing systems, 2019, pp. 15629–15641.316. H. F. Song, G. R. Yang, X.-J. Wang, Training excitatory-inhibitory recurrent neural networksfor cognitive tasks: a simple and flexible framework, PLoS computational biology 12 (2)(2016) e1004792.27. T. Nachstedt, C. Tetzlaff, Working memory requires a combination of transient and attractor-dominated dynamics to process unreliably timed inputs, Scientific reports 7 (1) (2017) 1–14.28. P. Enel, J. D. Wallis, E. L. Rich, Stable and dynamic representations of value in the prefrontalcortex, Elife 9 (2020) e54313.29. S. E. Cavanagh, J. P. Towers, J. D. Wallis, L. T. Hunt, S. W. Kennerley, Reconciling persistentand dynamic hypotheses of working memory coding in prefrontal cortex, Nature communi-cations 9 (1) (2018) 1–16.30. A. E. Orhan, W. J. Ma, A diverse range of factors affect the nature of neural representationsunderlying short-term memory, Nature neuroscience 22 (2) (2019) 275–283.31. D. Sussillo, M. M. Churchland, M. T. Kaufman, K. V. Shenoy, A neural network that finds anaturalistic solution for the production of muscle activity, Nature neuroscience 18 (7) (2015)1025–1033.32. G. R. Yang, M. R. Joglekar, H. F. Song, W. T. Newsome, X.-J. Wang, Task representations inneural networks trained to perform many cognitive tasks, Nature neuroscience 22 (2) (2019)297–306.33. D. Sussillo, L. F. Abbott, Generating coherent patterns of activity from chaotic neural net-works, Neuron 63 (4) (2009) 544–557. 324. G. R. Yang, X.-J. Wang, Artificial neural networks for neuroscientists: A primer, arXivpreprint arXiv:2006.01001.35. F. Schuessler, A. Dubreuil, F. Mastrogiuseppe, S. Ostojic, O. Barak, Dynamics of randomrecurrent networks with correlated low-rank structure, Physical Review Research 2 (1) (2020)013111.36. J. J. Hopfield, Neural networks and physical systems with emergent collective computationalabilities, Proceedings of the national academy of sciences 79 (8) (1982) 2554–2558.37. Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, and David Sussillo.Reverse engineering recurrent networks for sentiment classification reveals line attractor dy-namics. In
Advances in neural information processing systems , pages 15696–15705, 2019.38. D. B. Rubin, S. D. Van Hooser, K. D. Miller, The stabilized supralinear network: a unifyingcircuit motif underlying multi-input integration in sensory cortex, Neuron 85 (2) (2015) 402–417.39. J. X. Wang, Z. Kurth-Nelson, D. Kumaran, D. Tirumala, H. Soyer, J. Z. Leibo, D. Hass-abis, M. Botvinick, Prefrontal cortex as a meta-reinforcement learning system, Nature neu-roscience 21 (6) (2018) 860–868.40. K. Sakai, Task set and prefrontal cortex, Annu. Rev. Neurosci. 31 (2008) 219–245.41. H. Sompolinsky, A. Crisanti, H.-J. Sommers, Chaos in random neural networks, Physicalreview letters 61 (3) (1988) 259. 332. F. Mastrogiuseppe, S. Ostojic, Linking connectivity, dynamics, and computations in low-rankrecurrent neural networks, Neuron 99 (3) (2018) 609–623.43. C. Beer, O. Barak, One step back, two steps forward: interference and learning in recurrentneural networks, Neural Computation 31 (10) (2019) 1985–2003.44. K. Rajan, L. F. Abbott, Eigenvalue spectra of random matrices for neural networks, Physicalreview letters 97 (18) (2006) 188104.45. K. Rajan, C. D. Harvey, D. W. Tank, Recurrent network models of sequence generation andmemory, Neuron 90 (1) (2016) 128–142. 34 upplementary Information timetrial 1 trial 2 trial 3 trial 4
Figure 1:
RNN outputs after optimization for performing SPM task
Several concatenated trials of the task withoutputs z d = ( z d , z d ) and z o are shown. Using the modified FORCE method the network generates memoryencodings during T d and trial output during response intervals, T r (shaded area). a r X i v : . [ q - b i o . N C ] J a n f from initial condition, forFP LCif from initial condition, foryes noyes noDFP IFP Figure 2:
Flow chart
Categorization of dynamical mechanisms based on (i) the type of asymptotic attractor (eitherfixed point or limit cycle) and (ii) whether delay periods correspond to a fixed point. eu r on time ab IFP neu r onneu r on time cd LC Mix s t i m . s t i m . s t i m . s t i m . s t i m . s t i m . s t i m . s t i m . P C P C P C P C P C P C P C P C P C P C P C P C neu r on DFP s t i m . l a y s t i m . l a y r e s p o n s e time stable fixed pt. f o r w a r d s i m . Figure 3:
Neural activities and associated dynamical landscape for a single trial
For the same exemplar networksas in Fig. 3c, but a different set of stimuli (i.e., here, two realizations of the same digit are presented), neural activityand associated low-dimensional (PCA) trajectories are plotted. Note that PCA components are obtained for eachexemplar network individually. The trajectories are color coded using the same scheme as the color bar on the top. InDFP, the network creates four stable fixed points to solve the SPM task (the inset shows the area inside the circle). Forthe displayed trajectory, the network uses two fixed points (shown in yellow) to represent memory and trial output. InIFP, memory representation and trial output are encoded along the slow manifold of the single fixed point in the statespace. In LC, the trajectories approach a stable limit cycle. For the mixed mechanism, both a stable fixed point andlimit cycle can be seen. t i m . s t i m . no i sy s t i m . s t i m . c o rr e c t r e s p o n s e w r ong r e s pon s e P C P C P C s t i m . s t i m . no i sy s t i m . s t i m . c o r r e c t r e s p o n s e w r ong r e s pon s e P C P C P C s t i m . s t i m . no i sy s t i m . s t i m . c o r r e c t r e s p o n s e w r o n g r e s p o n s e P C P C P C s t i m . s t i m . n o i s y s t i m . s t i m . c o r r e c t r e s p o n s e w r o n g r e s p o n s e P C P C P C a bdc DFP IFP LC Mix salient corrupted s t i m . l a y s t i m . l a y r e s p o n s e time stable fixed pt. f o r w a r d s i m . Figure 4:
Neural trajectories of perturbed and salient trials