[PDF] Implementing Inductive bias for different navigation tasks through diverse RNN attractors

Abstract

Navigation is crucial for animal behavior and is assumed to require an internal representation of the external environment, termed a cognitive map. The precise form of this representation is often considered to be a metric representation of space. An internal representation, however, is judged by its contribution to performance on a given task, and may thus vary between different types of navigation tasks. Here we train a recurrent neural network that controls an agent performing several navigation tasks in a simple environment. To focus on internal representations, we split learning into a task-agnostic pre-training stage that modifies internal connectivity and a task-specific Q learning stage that controls the network's output. We show that pre-training shapes the attractor landscape of the networks, leading to either a continuous attractor, discrete attractors or a disordered state. These structures induce bias onto the Q-Learning phase, leading to a performance pattern across the tasks corresponding to metric and topological regularities. By combining two types of networks in a modular structure, we could get better performance for both regularities. Our results show that, in recurrent networks, inductive bias takes the form of attractor landscapes -- which can be shaped by pre-training and analyzed using dynamical systems methods. Furthermore, we demonstrate that non-metric representations are useful for navigation tasks, and their combination with metric representation leads to flexibile multiple-task learning.

Full PDF

PPublished as a conference paper at ICLR 2020 I MPLEMENTING I NDUCTIVE BIAS FOR DIFFERENTNAVIGATION TASKS THROUGH DIVERSE

RNN

ATTR - RACTORS

Tie Xu, Omri Barak

Rappaport Faculty of Medicine and Network Biology Research LaboratoryTechnion, Israel Institute of TechnologyHaifa, 320003, Israel [email protected], [email protected] A BSTRACT

Navigation is crucial for animal behavior and is assumed to require an internal rep-resentation of the external environment, termed a cognitive map. The precise formof this representation is often considered to be a metric representation of space.An internal representation, however, is judged by its contribution to performanceon a given task, and may thus vary between different types of navigation tasks.Here we train a recurrent neural network that controls an agent performing severalnavigation tasks in a simple environment. To focus on internal representations, wesplit learning into a task-agnostic pre-training stage that modiﬁes internal connec-tivity and a task-speciﬁc Q learning stage that controls the network’s output. Weshow that pre-training shapes the attractor landscape of the networks, leading toeither a continuous attractor, discrete attractors or a disordered state. These struc-tures induce bias onto the Q-Learning phase, leading to a performance patternacross the tasks corresponding to metric and topological regularities. By combin-ing two types of networks in a modular structure, we could get better performancefor both regularities. Our results show that, in recurrent networks, inductive biastakes the form of attractor landscapes – which can be shaped by pre-training andanalyzed using dynamical systems methods. Furthermore, we demonstrate thatnon-metric representations are useful for navigation tasks, and their combinationwith metric representation leads to ﬂexibile multiple-task learning.

NTRODUCTION

Spatial navigation is an important task that requires a correct internal representation of the world,and thus its mechanistic underpinnings have attracted the attention of scientists for a long time(O’Keefe & Nadel, 1978). A standard tool for navigation is a euclidean map, and this naturallyleads to the hypothesis that our internal model is such a map. Artiﬁcial navigation also relies onSLAM (Simultaneous localization and mapping) which is based on maps (Kanitscheider & Fiete,2017a). On the other hand, both from an ecological view and from a pure machine learning per-spective, navigation is ﬁrstly about reward acquisition, while exploiting the statistical regularities ofthe environment. Different tasks and environments lead to different statistical regularities. Thus itis unclear which internal representations are optimal for reward acquisition. We take a functionalapproach to this question by training recurrent neural networks for navigation tasks with varioustypes of statistical regularities. Because we are interested in internal representations, we opt for atwo-phase learning scheme instead of end-to-end learning. Inspired by the biological phenomenaof evolution and development, we ﬁrst pre-train the networks to emphasize several aspects of theirinternal representation. Following pre-training, we use Q-learning to modify the network’s readoutweights for speciﬁc tasks while maintaining its internal connectivity.We evaluate the performance for different networks on a battery of simple navigation tasks withdifferent statistical regularities and show that the internal representations of the networks manifestin differential performance according to the nature of tasks. The link between task performanceand network structure is understood by probing networks’ dynamics, exposing a low-dimensional1 a r X i v : . [ q - b i o . N C ] F e b ublished as a conference paper at ICLR 2020manifold of slow dynamics in phase space, which is clustered into three major categories: continuousattractor, discrete attractors, and unstructured chaotic dynamics. The different network attractorsencode different priors, or inductive bias, for speciﬁc tasks which corresponds to metric or topologyinvariances in the tasks. By combining networks with different inductive biases we could build amodular system with improved multiple-task learning.Overall we offer a paradigm which shows how dynamics of recurrent networks implement differentpriors for environments. Pre-training, which is agnostic to speciﬁc tasks, could lead to dramaticdifference in the network’s dynamical landscape and affect reinforcement learning of different nav-igation tasks. ELATED WORK

Several recent papers used a functional approach for navigation (Cueva & Wei, 2018; Kanitscheider& Fiete, 2017b; Banino et al., 2018). These works, however, consider the position as the desiredoutput, by assuming that it is the relevant representation for navigation. These works successfullyshow that the recurrent network agent could solve the neural SLAM problem and that this could re-sult in units of the network exhibiting similar response proﬁles to those found in neurophysiologicalexperiments (place and grid cells). In our case, the desired behavior was to obtain the reward, andnot to report the current position.Another recent approach did deﬁne reward acquisition as the goal, by applying deep RL directly tonavigation problems in an end to end manner (Mirowski et al., 2016). The navigation tasks reliedon rich visual cues, that allowed evaluation in a state of the art setting. This richness, however,can hinder the greater mechanistic insights that can be obtained from the systematic analysis of toyproblems – and accordingly, the focus of these works is on performance.Our work is also related to recent works in neuroscience that highlight the richness of neural rep-resentations for navigation, beyond Euclidian spatial maps (Hardcastle et al., 2017; Wirth et al.,2017).Our pre-training is similar to unsupervised, followed by supervised training (Erhan et al., 2010).In the past few years, end-to-end learning is a more dominant approach (Graves et al., 2014; Mnihet al., 2013) . We highlight the ability of a pre-training framework to manipulate network dynamicsand the resulting internal representations and study their effect as inductive bias.

ESULTS

ASK D EFINITION

Navigation can be described as taking advantage of spatial regularities of the environment to achievegoals. This view naturally leads to considering a cognitive map as an internal model of the environ-ment, but leaves open the question of precisely which type of map is to be expected. To answer thisquestion, we systematically study both a space of networks – emphasizing different internal models– and a space of tasks – emphasizing different spatial regularities. To allow a systematic approach,we design a toy navigation problem, inspired by the Morris water maze (Morris, 1981). An agentis placed in a random position in a discretized square arena (size 15), and has to locate the rewardlocation (yellow square, Fig 1A), while only receiving input (empty/wall/reward) from the 8 neigh-boring positions. The reward is placed in one of two possible locations in the room according to anexternal context signal, and the agent can move in one of the four cardinal directions. At every trial,the agent is placed in a random position in the arena, and the network’s internal state is randomlyinitialized as well. The platform location is constant across trials for each context (see Methods).The agent is controlled by a RNN that receives the proximal sensory input, as well as a feedbackof its own chosen action (Fig. 1B). The network’s output is a value for each of the 4 possible ac-tions, the maximum of which is chosen to update the agent’s position. We use a vanilla RNN (seeAppendix for LSTM units) described by: 2ublished as a conference paper at ICLR 2020 h t +1 = (cid:18) − τ (cid:19) h t + 1 τ tanh ( W h t + W i f ( z t ) + W a A t + W c C t ) (1) Q ( h t ) = W o h t + b o (2)where h t is the activity of neurons in the networks(512 neurons as default), W is connectivity matrix, τ is a timescale of update. The sensory input f ( z t ) is fed through connections matrix W s , and actionfeedback is fed through W a . The context signal C t is fed through matrix W c . The network outputsa Q function, which is computed by a linear transformation of its hidden state.Beyond the basic setting (Fig. 1A), we design several variants of the task to emphasize differentstatistical regularities (Fig. 1C). In all cases, the agent begins from a random position and has toreach the context-dependent reward location in the shortest time using only proximal input. The”Hole” variant introduces a random placement of obstacles (different numbers and positions) ineach trial. The ”Bar” variant introduces a horizontal or vertical bar in random positions in each trial.The various ”Scale” tasks stretch the arena in the horizontal or vertical direction while maintainingthe relative position of the rewards. The ”Implicit context” task is similar to the basic setting, but theexternal context input is eliminated, and instead, the color of the walls indicates the reward position.For all these tasks, the agent needs to ﬁnd a strategy that tackles the uncertain elements to achievethe goals. Despite the simple setting of the game, the tasks are not trivial due to identical visualinputs in most of the locations and various uncertain elements adding to the task difﬁculty.Figure 1: Navigation task and network architecture(A)

Basic task setting. The agent begins in arandom position and has to locate the reward, which is in a context-dependent location. Input is onlyprovided from the 8 neighboring cells. (B)

The agent is controlled by an RNN that receives inputfrom proximal visual stimuli and its action feedback. An external context is provided to indicatewhich of two possible reward locations is active. (C)

Tasks used: Basic, Random obstacles placedin each trial (either holes or bars), scaling the arena in either direction or both, implicit context signal(wall color) instead of external context.3.2 T

RAINING FRAMEWORK

We aim to understand the interaction between internal representation and the statistical regularitiesof the various tasks. In principle, this could be accomplished by end-to-end reinforcement learningof many tasks, using various hyper-parameters to allow different solutions to the same task. Weopted for a different approach - both due to computation efﬁciency (see Appendix III) and due tobiological motivations. A biological agent acquires navigation ability during evolution and develop-ment, which shapes its elementary cognitive ability such as spatial or object memory. This shapingprovides a scaffold upon which the animal could adapt and learn quickly to perform diverse tasksduring life. Similarly, we divide learning into two phases, a pre-training phase that is task agnosticand a Q learning phase that is task-speciﬁc (Fig. 2A). During pre-training we modify the network’sinternal and input connectivity, while Q learning only modiﬁes the output.Pre-training is implemented in an environment similar to the basic task, with an arena size chosenrandomly between 10 to 20. The agent’s actions are externally determined as a correlated random3ublished as a conference paper at ICLR 2020walk, instead of being internally generated by the agent. Inspired by neurophysiological ﬁndings, weemphasize two different aspects of internal representation - landmark memory (Identity of the lastencountered wall) and position encoding (O’Keefe & Nadel, 1978). We thus pre-train the internalconnectivity to generate an ensemble of networks with various hyperparameters that control therelative importance of these two aspects, as well as which parts of the connectivity

W, W a , W i aremodiﬁed. We term networks emphasizing the two aspects respectively MemNet and PosNet, andcall the naive random network RandNet (Fig. 2A). This is done by stochastic gradient descent onthe following objective function: S = − α n (cid:88) i =1 ˆ P ( z t ) logP ( z t ) − β n (cid:88) i =1 ˆ I t logP ( I t ) − γ n (cid:88) i =1 ˆ A t logP ( A t ) (3)with z = ( x, y ) for position, I for landmark memory (identity of the last wall encountered), A foraction. The term on action serves as a regularizer. The three probability distributions are estimatedfrom hidden states of the RNN, given by: P ( I | h t ) = exp ( W m h t + b m ) (cid:80) m ( exp ( W m h t + b m )) (4) P ( A | h t − , h t ) = exp ( W a [ h t − , h t ] + b a ) (cid:80) a exp ( W a [ h t − , h t ] + b a ) (5) P ( z | h t ) = exp (( z − ( W p h t + b p )) /σ ) (cid:80) z exp (( z − ( W p h t + b p )) /σ ) (6)where W m , W p , W a are readout matrices from hidden states and [ h t − , h t ] denotes the concatena-tion of last and current hidden states. Tables 1,2,3 in the Appendix show the hyperparameter choicesfor all networks. The ratio between α and β controls the tradeoff between position and memory.The exact values of the hyperparameters were found through trial and error.Having obtained this ensemble of networks, we use a Q-learning algorithm with TD-lambda updatefor the network’s outputs, which are Q values. We utilize the fact that only the readout matrix W o istrained to use a recursive least square method which allows a fast update of weights for different tasks(Sussillo & Abbott, 2009). This choice leads to a much better convergence speed when compared tostochastic gradient descent. The update rule used is: W o ( n + 1) = W o ( n ) − e ( n ) P ( n ) H ( n ) T (7) P ( n + 1) = ( C ( n + 1) + αI ) − (8) C ( n + 1) = λC ( n ) + H ( n ) T H ( n ) (9) e ( n ) = W o H ( n ) − Y ( n ) (10)where H is a matrix of hidden states over 120 time steps, αI is a regularizer and λ controls forgettingrate of past data.We then analyze the test performance of all networks on all tasks (Figure 2B and Table 3 in ap-pendix). Figure 2B,C show that there are correlations between different tasks and between differentnetworks. We quantify this correlation structure by performing principal component analysis of theperformance matrix. We ﬁnd that the ﬁrst two PCs in task space explain of the variance. Theﬁrst component corresponds to the difﬁculty (average performance) of each task, while the coefﬁ-cients of the second component are informative regarding the nature of the tasks (Fig. 2B, right):Bar (-0.49), Hole(-0.25), Basic(-0.21), Implicit context (-0.12), ScaleX (0.04), ScaleY (0.31), Scale(0.74). We speculate these numbers characterize the importance of two different invariances inherentin the tasks. Negative coefﬁcients correspond to metric invariance. For example, when overcomingdynamic obstacles, the position remains invariant. This type of task was fundamental to establishmetric cognitive maps in neuroscience (O’Keefe & Nadel, 1978). Positive coefﬁcients correspondto topological invariance, deﬁned as the relation between landmarks unaffected by the metric infor-mation.Observing the behavior of networks for the extreme tasks of this axis indeed conﬁrms the specula-tion. Fig. 3A shows that the successful agent overcomes the scaling task by ﬁnding a set of actions4ublished as a conference paper at ICLR 2020Figure 2: Training scheme and performance analysis. (A)

Two-stage learning framework. Taskagnostic pre-training of the internal connectivity is done while emphasizing either position decoding(PosNet) or the identity of the last wall (landmark memory, MemNet). Following pre-training, Qlearning of the output is performed for each task. This is also done on networks that were not pre-trained (RandNet). (B)

Example trajectories of PosNet on the basic task, starting from 9 differentinitial conditions. The numbers are the scores for each trial (see Appendix). (C)

Task performancefor all networks on all tasks. The score is an average of trials from all starting positions, where eachtrial is scored by the time relative to the shortest path, or − if the agent fails to reach the rewardafter 120 steps. Bars on the left are coefﬁcients of the second principal component, corresponding tometric vs. topological tasks. The columns show different realizations of Posnet, RanNet and Mem-Net. The last column is a modular network introduced in last section of Results. (D) Correlationbetween all tasks, showing a clustering into two main groups (metric and topological). Parametersfor all networks are in Appendix Tables 1,2,3.that captures the relations between landmarks and reward, thus generalizing to larger size arenas.Fig3B shows that the successful agent in the bar task uses a very different strategy. An agent thatcaptures the metric invariance could adjust trajectories and reach the reward each time when theobstacle is changed. This ability is often related to the ability to use shortcuts (O’Keefe & Nadel,1978). The other tasks intepolate between the two extremes, due to the presence of both elementsin the tasks. For instance, the implicit context task requires the agent to combine landmark memory(color of the wall) with position to locate the reward.We thus deﬁne metric and topological scores by using a weighted average of task performance usingnegative and positive coefﬁcients respectively. Fig. 3C shows the various networks measured bythe two scores. We see that random networks (blue) can achieve reasonable performance with some5ublished as a conference paper at ICLR 2020hyperparameter choices, but they are balanced with respect to the metric topological score. On theother hand, PostNet networks are pushed to the metric side and MemNet networks to the topologicalside. This result indicates that the inductive bias achieved via task agnostic pre-training is manifestedin the performance of networks on various navigation tasks.Figure 3:

Different strategies for different regularities. (A)

A MemNet network solving thescaling task. The agent uses a sequence of landmark-conditioned actions, and thus generalizes tolarger arenas. (B)

A PosNet network solving the bar task. The agent appears to understand its metricposition, and uses it to move in novel paths towards the reward. (C)

Performance of all networks inall tasks, projected onto the metric and topological scores.3.3 L

INKING REPRESENTATION TO DYNAMICS

What are the underlying structures of different networks that encode the bias for different tasks?We approach this question by noting that RNNs are nonlinear dynamical systems. As such, it isinformative to detect ﬁxed points and other special areas of phase space to better understand theirdynamics. For instance, a network that memorizes a binary value might be expected to contain twodiscrete ﬁxed points (Fig. 4A). A network that integrates a continuous value might contain a lineattractorKim et al. (2017); Kakaria & de Bivort (2017), and , a network that integrates position mightcontain a plane attractor – a 2D manifold of ﬁxed points – because this would enable updating x and y coordinates with actions, and maintaining the current position in the absence of action (Burak &Fiete, 2009). Trained networks, however, often converge to approximate ﬁxed points (slow points)(Sussillo & Barak; Mante et al., 2013; Maheswaranathan et al., 2019), as they are not required tomaintain the same position for an inﬁnite time. We thus expect the relevant slow points to be some-where between the actual trajectories and true ﬁxed points. We detect these areas of phase spaceusing adaptations of existing techniques (Appendix 5.3, (Sussillo & Barak)). Brieﬂy, we drive theagent to move in the environment, while recording its position and last seen landmark (wall). Thisresults in a collection of hidden states. Then, for each point in this collection, we relax the dynamicstowards approximate ﬁxed points. This procedure results in points with different hidden state veloci-ties for the three networks (Fig. 4B) – RandNet does not seem to have any slow points, while PosNetand MemNet do, with MemNet’s points being slower. The resulting manifold of slow points for atypical PosNet is depicted in Figure 4C, along with the labels of position and stimulus from whichrelaxation began. It is apparent that pretraining has created in PosNet a smooth representation ofposition along this manifold. The MemNet manifold represents landmark memory as 4 distinct ﬁxedpoints without a spatial representation. Note that despite the dominance of position representation inPosNet, landmark memory still modulates this representation (Fig 3A, M) - showing that pretrainingdid not result in a perfect plane attractor, but rather in an approximate collection of 4 plane attractors(Fig. 3D, MP). This conjunctive representation can also be appreciated by considering the decodingaccuracy of trajectories conditioned on the number of wall encounters. As the agent encounters thewall, the decoding of position from the manifold improves, implying the ability to integrate pathintegration and landmark memory (Fig Appendix 8).6ublished as a conference paper at ICLR 2020We thus see that the pre-training biases are implemented by distinct attractor landscapes, from whichwe could see both qualitative differences between networks and a trade-off between landmark mem-ory and position encoding. The continuous attractors of PosNet correspond to a metric representa-tion of space, albeit modulated by landmark memory. The discrete attractors of MemNet encode thelandmark memory in a robust manner, while sacriﬁcing position encoding. The untrained RandNet,on the other hand, has no clear structure, and relies on a short transient memory of the last landmark.Figure 4: Diverse attractor landscapes underly diverse agent priors . (A) Velocity of points be-fore and after the relaxation procedure. PosNet and MemNet converged to approximate ﬁxed points(slow points), while RandNet did not. (B)

Illustration of possible slow point structures (purple),along with trajectories around them (red). A pair of points can encode a discrete memory, a lineattractor can integrate a single variable, and a plane attractor can integrate two variables (e.g. x, y ). (C) Attractor landscape for PosNet and MemNet projected into the ﬁrst 3 PCs of the hidden state.Coloring is according to either X,Y coordinates or the identity of the last wall encountered (land-mark memory, M). Note how the position is smoothly encoded on the manifold for PosNet, andmemory is encoded by four discrete points for MemNet. The memory panels show a ﬁt of a planeto the X,Y coordinates, conditioned upon a given landmark memory – showing that PosNet also hasmemory information, and not just position. The networks used are 1,13,20 from Table 3.The above analysis was performed on three typical networks and is somewhat time-consuming. Inorder to get a broader view of internal representations in all networks, we use a simple measure ofthe components of the representation. Speciﬁcally, we drove the agent to move in an arena of inﬁnite7ublished as a conference paper at ICLR 2020size that was empty except a single wall (of a different identity in each trial). We then used GLM(generalized linear model) to determine the variance explained by both position and the identity ofthe wall encountered from the network’s hidden state. Figure 6A shows these two measures for allthe networks. The results echo those measured with the battery of 7 tasks (Fig. 3C), but are ordersof magnitude faster to compute. Indeed, if we correlate these measures with performance on thedifferent tasks, we see that they correspond to the metric-topological axis as deﬁned by PCA (Fig.5B, compare with Fig. 2B, right).Figure 5:

Components of the internal representation . (A) Variance of the hidden state explainedby position and memory for all networks. Note the clear separation between the different pre-trainingregimes. The crosses denote networks used in Fig. 4. (B)

Correlation of the two components withperformance on all tasks. The strength of the relevant components in the internal representation arepredictive of task performance following Q-learning. Note the similarity with the PCA coefﬁcientsin Fig. 2B, right.3.4 A

MODULAR SYSTEM THAT COMBINES ADVANTAGES OF BOTH DYNAMICS

Altogether, we showed a differential ability of networks to cope with different environmental reg-ularities via inductive bias encoded in their dynamics. Considering the tradeoff is a fundamentalproperty of single module RNN, it is natural to ask if we could combine advantages from both dy-namics into a modular system. Inspired by Park et al. (2001). We design a hierarchical systemcomposed of the representation layer on the bottom and selector layer on top. The representationlayer concatenates PosNet and MemNet modules together, each evaluating action values accordingto its own dynamics. The second layer selects the more reliable module based on the combined rep-resentation by assigning a value (reliability) to each module. The module with maximal reliabilitymakes the ﬁnal decision. Thus, the control of the agent shifts between different dynamics accordingto current input or history. The modular system signiﬁcantly shifts the metric-topological balance(Fig2C, Fig3B). The reliability V is learned similarly as Q(Appendix 5.7). h t +1 = (cid:18) − τ (cid:19) h t + 1 τ tanh (cid:0) W pos h t + W i f ( z t ) + W a A t + W c C t (cid:1) (11) h t +1 = (cid:18) − τ (cid:19) h t + 1 τ tanh (cid:0) W mem h t + W i f ( z t ) + W a A t + W c C t (cid:1) (12) Q h t ) = W o h t + b o (13) Q h t ) = W o h t + b o (14) V ( h t , h t ) = W sel ([ h t , h t ]) + b sel (15) ISCUSSION

Our work explores how internal representations for navigation tasks are implemented by the dy-namics of recurrent neural networks. We show that pre-training networks in a task-agnostic mannercan shape their dynamics into discrete ﬁxed points or into a low-D manifold of slow points. These8ublished as a conference paper at ICLR 2020distinct dynamical objects correspond to landmark memory and spatial memory respectively. Whenperforming Q learning for speciﬁc tasks, these dynamical objects serve as priors for the network’srepresentations and shift its performance on the various navigation tasks. Here we show that bothplane attractors and discrete attractors are useful. It would be interesting to see whether and howother dynamical objects can serve as inductive biases for other domains. In tasks outside of re-inforcement learning, for instance, line attractors were shown to underlie network computations(Mante et al., 2013; Maheswaranathan et al., 2019).An agent that has to perform several navigation tasks will require both types of representations. Asingle recurrent network, however, has a trade-off between adapting to one type of task or to another.The attractor landscape picture provides a possible dynamical reason for the tradeoff. Positionrequires a continuous attractor, whereas stimulus memory requires discrete attractors. While it ispossible to have four separated plane attractors, it is perhaps easier for learning to converge to oneor the other. A different solution to learn multiple tasks is by considering multiple modules, eachoptimized for a different dynamical regime. We showed that such a modular system is able to learnmultiple tasks, in a manner that is more ﬂexible than any single-module network we could train.Pre-training alters network connectivity. The resulting connectivity is expected to be between ran-dom networks (Lukoˇseviˇcius & Jaeger, 2009) and designed ones (Burak & Fiete, 2009). It is perhapssurprising that even the untrained RandNet can perform some of the navigation tasks using only Q-learning of the readout (with appropriate hyperparameters, see Tables 2,3 and section 4 ”Linkingdynamics to connectivity” in Appendix). This is consistent with recent work showing that somearchitectures can perform various tasks without learning (Gaier & Ha, 2019). Studying the con-nectivity changes due to pre-training may help understand the statistics from which to draw betterrandom networks (Appendix section 4).Apart from improving the understanding of representation and dynamics, it is interesting to considerthe efﬁciency of our two-stage learning compared to standard approaches. We found that end-to-end training is much slower, cannot learn topological tasks and has weaker transfer between tasks(See Appendix section 5.2). Thus it is interesting to explore whether this approach could be used toaccelerate learning in other domains, similar to curriculum learning (Bengio et al., 2009).A

CKNOWLEDGMENTS

OB is supported by the Israeli Science Foundation (346/16) and by a Rappaport Institute Thematicgrant. TX is supported by Key scientiﬁc technological innovation research project by Chinese Min-istry of Education and Tsinghua University Initiative Scientiﬁc Research Program for computationalresources. R EFERENCES

Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr Mirowski,Alexander Pritzel, Martin J. Chadwick, Thomas Degris, Joseph Modayil, Greg Wayne, HubertSoyer, Fabio Viola, Brian Zhang, Ross Goroshin, Neil Rabinowitz, Razvan Pascanu, Char-lie Beattie, Stig Petersen, Amir Sadik, Stephen Gaffney, Helen King, Koray Kavukcuoglu,Demis Hassabis, Raia Hadsell, and Dharshan Kumaran. Vector-based navigation using grid-like representations in artiﬁcial agents.

Nature , 557(7705):429–433, 5 2018. ISSN 0028-0836. doi: 10.1038/s41586-018-0102-6. URL .Yoshua Bengio, J´erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

Proceedings of the 26th annual international conference on machine learning , pp. 41–48. ACM,2009.Yoram Burak and Ila R. Fiete. Accurate Path Integration in Continuous Attractor Network Modelsof Grid Cells.

PLoS Computational Biology , 5(2):e1000291, 2 2009. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1000291. URL https://dx.plos.org/10.1371/journal.pcbi.1000291 .Christopher J. Cueva and Xue-Xin Wei. Emergence of grid-like representations by training recurrentneural networks to perform spatial localization. 3 2018. URL http://arxiv.org/abs/1803.07770 . 9ublished as a conference paper at ICLR 2020Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, andSamy Bengio. Why does unsupervised pre-training help deep learning?

Journal of MachineLearning Research , 11(Feb):625–660, 2010.Adam Gaier and David Ha. Weight agnostic neural networks. arXiv preprint arXiv:1906.04358 ,2019.Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. 10 2014. URL http://arxiv.org/abs/1410.5401 .Kiah Hardcastle, Niru Maheswaranathan, Surya Ganguli, and Lisa M. Giocomo. A Multiplexed,Heterogeneous, and Adaptive Code for Navigation in Medial Entorhinal Cortex.

Neuron , 94(2):375–387, 2017. ISSN 10974199. doi: 10.1016/j.neuron.2017.03.025. URL http://dx.doi.org/10.1016/j.neuron.2017.03.025 .Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver. Memory-based control withrecurrent neural networks. arXiv preprint arXiv:1512.04455 , 2015.Herbert Jaeger. The "echo state" approach to analysing and training recurrentneural networks-with an Erratum note 1. Technical report, 2010. URL https://pdfs.semanticscholar.org/8430/c0b9afa478ae660398704b11dca1221ccf22.pdf .Kyobi S Kakaria and Benjamin L de Bivort. Ring attractor dynamics emerge from a spiking modelof the entire protocerebral bridge.

Frontiers in behavioral neuroscience , 11:8, 2017.Ingmar Kanitscheider and Ila Fiete. Making our way through the world: Towards a functionalunderstanding of the brain’s spatial circuits.

Current Opinion in Systems Biology , 3:186–194, 62017a. ISSN 24523100. doi: 10.1016/j.coisb.2017.04.008. URL https://linkinghub.elsevier.com/retrieve/pii/S2452310017300549 .Ingmar Kanitscheider and Ila Fiete. Training recurrent networksto generate hypotheses about how the brain solves hard naviga-tion problems, 2017b. URL http://papers.nips.cc/paper/7039-training-recurrent-networks-to-generate-hypotheses-about-how-the-brain-solves-hard-navigation-problems .Garrett E Katz and James A Reggia. Using directional ﬁbers to locate ﬁxed points of recurrentneural networks.

IEEE transactions on neural networks and learning systems , 29(8):3636–3646,2017.Sung Soo Kim, Herv Rouault, Shaul Druckmann, and Vivek Jayaraman. Ring attractor dynamics inthe Drosophila central brain.

Science (New York, N.Y.) , 356(6340):849–853, 5 2017. ISSN 1095-9203. doi: 10.1126/science.aal4835. URL .Mantas Lukoˇseviˇcius and Herbert Jaeger. Reservoir Computing Approaches to Recurrent Neu-ral Network Training. Technical report. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.470.843&rep=rep1&type=pdf .Mantas Lukoˇseviˇcius and Herbert Jaeger. Reservoir computing approaches to recurrent neuralnetwork training.

Computer Science Review , 3(3):127–149, 8 2009. ISSN 1574-0137. doi:10.1016/J.COSREV.2009.03.005. URL .Niru Maheswaranathan, Alex H Williams, Matthew D Golub, Surya Ganguli, and David Sussillo.Line attractor dynamics in recurrent networks for sentiment classiﬁcation. 2019.Valerio Mante, David Sussillo, Krishna V. Shenoy, and William T. Newsome. Context-dependentcomputation by recurrent dynamics in prefrontal cortex.

Nature , 503(7474):78–84, 11 2013. ISSN0028-0836. doi: 10.1038/nature12742. URL . 10ublished as a conference paper at ICLR 2020Francesca Mastrogiuseppe and Srdjan Ostojic. Linking Connectivity, Dynamics, and Computationsin Low-Rank Recurrent Neural Networks.

Neuron , 99(3):609–623, 2018. ISSN 10974199. doi:10.1016/j.neuron.2018.07.003. URL https://doi.org/10.1016/j.neuron.2018.07.003 .Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J. Ballard, Andrea Banino,Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and RaiaHadsell. Learning to Navigate in Complex Environments. 11 2016. URL http://arxiv.org/abs/1611.03673 .Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wier-stra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 , 2013.Richard G.M. Morris. Spatial localization does not require the presence of local cues.

Learning and Motivation , 12(2):239–260, 5 1981. ISSN 0023-9690. doi: 10.1016/0023-9690(81)90020-5. URL .John. O’Keefe and Lynn Nadel.

The hippocampus as a cognitive map . Clarendon Press, 1978. ISBN0198572069. URL https://repository.arizona.edu/handle/10150/620894 .Kui-Hong Park, Yong-Jae Kim, and Jong-Hwan Kim. Modular q-learning based multi-agent coop-eration for robot soccer.

Robotics and Autonomous Systems , 35(2):109–122, 2001.David Sussillo and L. F. Abbott. Generating Coherent Patterns of Activity from Chaotic NeuralNetworks.

Neuron , 63(4):544–557, 2009. ISSN 08966273. doi: 10.1016/j.neuron.2009.07.018.URL http://dx.doi.org/10.1016/j.neuron.2009.07.018 .David Sussillo and Omri Barak. Opening the Black Box: Low-dimensional dynamics in high-dimensional recurrent neural networks. Technical report. URL https://barak.net.technion.ac.il/files/2012/11/sussillo_barak-neco.pdf .Sylvia Wirth, Pierre Baraduc, Aurlie Plant´e, Serge Pin`ede, and Jean Ren Duhamel. Gaze-informed,task-situated representation of space in primate hippocampus during virtual navigation.

PLoSBiology , 15(2):1–28, 2017. ISSN 15457885. doi: 10.1371/journal.pbio.2001045.11ublished as a conference paper at ICLR 2020

PPENDIX

ERFORMANCE MEASURE FOR EACH TASK

When testing the agent on a task, we perform a trial for each possible initial position of the agent.Note that the hidden state is randomly initialized in each trial, so this is not an exhaustive search ofall possible trial types. We then measure the time it takes the agent to reach the target. This time isnormalized by an approximate optimal strategy – moving from the initial position to a corner of thearena (providing x and y information), and then heading straight to the reward. If the agent fails toreach the target after 120 steps, the trial score is − : Score = (cid:26) T /T opt

T < T max − T > T max (16)5.2 T WO - STAGE VERSUS END - TO - END LEARNING

To check the effectiveness of our two-stage learning (pre-training followed by Q-Learning), wecontrast it with an end-to-end approach. We considered two approached to training: a classic deepQ learning version Mnih et al. (2013) and a method adapted from RDPG Heess et al. (2015). Thenaive Q learning is separated into a play phase and training phase. During the play phase the agentcollects experience as s t , a t , h t r t into a replay buffer. The action value Q is computed throughTD lambda method as Q ( h, a ) as target function. During the training phase the agent samples thepast experience through replay buffer perform gradient descent to minimize the difference betweenexpected Q ( h, a ) and target Q. We found that for most tasks the deep Q learning method was betterthan the adapted RDPG, and thus used it for our benchmarks. We used both LSTM and vanilla RNNfor these test.We found that, for the basic task, all networks achieve optimal performance, but our approach issigniﬁcantly more data-efﬁcient even with random networks (ﬁg. 6A). For all topological tasks,the end-to-end approach fails with both vanilla RNN and LSTM (ﬁg 6C table 3). The end to endapproach performs relatively well in metric tasks, except for the implicit context task (ﬁg 6B table3), which converges to a similar performance as PosNet but with a much slower convergence speed( trials vs trials).For the end-to-end approaches, a critical question is whether an internal representation emerges,which enables better performance in similar tasks. For instance, do networks that were successfullyend-to-end trained in the basic task develop a representation that facilitates learning the bar task? Toanswer this question, we use networks that were end-to-end trained on one task and use them as abasis for RLS Q-learning of a different task. This allows comparison with the pre-trained networks.Figure 7 shows that pre-training provides a better substrate for subsequent Q-learning - even whenconsidering generalization within metric tasks. For the implicit context task, the difference is evengreater.5.3 E XPLORING THE LOW D NETWORK DYNAMICS

Recurrent neural networks are nonlinear dynamical systems. As such, they behave differently indifferent areas of phase space. It is often informative to locate ﬁxed points of the dynamics, and usetheir local dynamics as anchors to understand global dynamics. When considering trained RNNs, itis reasonable to expect approximate ﬁxed points rather than exact ones. This is because a ﬁxed pointcorresponds to maintaining the same hidden state for inﬁnite time, whereas a trained network is onlyexposed to a ﬁnite time. These slow points (Sussillo & Barak; Mante et al., 2013) can be detectedin several manners (Sussillo & Barak; Katz & Reggia, 2017). For the case of stable ﬁxed points(attractors), it is also possible to simulate the dynamics until convergence. In our setting, we opt forthe latter option. Because the agent never stays in the same place, we relax the dynamics towardsattractors by providing as action feedback the average of all 4 actions. The relevant manifold (e.g.a plane attractor) might contain areas that are more stable than others (for instance a few true ﬁxedpoints), but we want to avoid detecting only these areas. We thus search for the relevant manifold inthe following manner. We drive the agent to move in the environment, while recording its positionand last seen stimulus (wall). This results in a collection of hidden states, labelled by position12ublished as a conference paper at ICLR 2020Figure 6: Learning curves for end-to-end training, compared to the Q-learning phase of the two-stage learning. Each curve represents a single network. Curves shown for Basic, Bar and Scalingtasks.and stimulus that we term the m = 0 manifold. For each point on the manifold, we continuesimulating the dynamics for m extra steps while providing as input the average of all 4 actions,resulting in further m (cid:54) = 0 manifolds. If these states are the underlying scaffold for the dynamics,they should encode the position (or memory). We therefore choose m by a cross-validation method– decoding new trajectories obtained in the basic task by using the k = 15 -nearest neighbors in each m -manifold. The red curve in Figure 8A shows the resulting decoding accuracy for position usingPosNet, where the accuracy starts to fall around m = 25 , indicating that further relaxation leads toirrelevant ﬁxed points.5.4 A TTRACTOR LANDSCAPE FOR DIFFERENT RECURRENT ARCHITECTURES

We tested the qualitative shape of the slow manifolds that emerge from pre-training other unit types.Speciﬁcally, we pre-trained an LSTM network using the same parameters as PosNet1 and MemNet1(Table 1). Figures 9 show a qualitatively similar behvior to that described in the main text. Notethat MemNet has slow regions instead of discrete points, and we suspect discrete attractors mightappear with longer relaxation times. The differences between the columns also demonstrate thatslow points, revealed by relaxation, are generally helpful to analyze dynamics of different types ofrecurrent networks.5.5 P RE - TRAINING PROTOCOLS AND PERFORMANCE OF NETWORKS

As explained in the main text, pre-training emphasizes decoding of either landmark memory orthe position of the agent. We used several variants of hyperparameters to pre-train the networks.Equation 17, which is written again for convenience, deﬁnes the relevant parameters:13ublished as a conference paper at ICLR 2020Figure 7: Generalization or transfer of end-to-end networks. Q-Learning of various tasks from astarting point of either pre-trained networks, or from end-to-end trained networks. The

FromPos and

FromMem curves denote PosNet and MemNet respecively. Each panel shows a learning curve for adifferent task.Figure 8: The accuracy of decoding position ( A ) or landmark memory ( B ) from the attractormanifold as a function of the number of relaxation steps. Panel A shows a drop in accuracy around m = 25 , indicating that at this stage the process converges to irrelevant ﬁxed points. (C) Conjunctivecoding of memory and position. The accuracy of decoding position from the attractor manifold asa function of the number of wall encounters. Only PosNet shows an improvement in decoding withthe added information. This is consistent with the joint representation of position and memory inthe attractors. S = − α n (cid:88) i =1 ˆ P ( z t ) logP ( z t ) − β n (cid:88) i =1 ˆ I t logP ( I t ) − γ n (cid:88) i =1 ˆ A t logP ( A t ) (17)The agent was driven to explore an empty arena (with walls) using random actions, with a probability p of changing action (direction) at any step. Table 1 shows the protocols (hyperparameters), Table14ublished as a conference paper at ICLR 2020Figure 9: Slow points analysis for LSTM PosNet and MemNet(pretrained for landmark). Similarto Figure 4 in the main text. The three columns show points in phase space during a random walkand after 10 or 20 relaxation steps. The different rows are colored according to X, Y and identity oflast wall. The PosNet is on top and MemNet is at bottom2 shows the random networks hyperparameters, and table 3 shows the performance of the resultingnetworks on all tasks. For all pre-training protocols an l2 regularizer of − on internal weights15ublished as a conference paper at ICLR 2020was used, and a learning rate of − . All PosNet and MemNet training started from RandNet1(detailed below). Table 1: Pretraining protocolshyperparametersprotocol name loss( α, β, γ ) Weights adjusted p PosNet1 1, 0, 0 W , W a , W i W W W , W a , W i W , W a , W i W W W W , W a , W i W h t +1 = (1 − τ ) h t + 1 τ (tanh( W h t + W i f ( z t ) + W a A t + W c C t ) (18) Q ( h t ) = W o h t + b o (19)Number of neurons used 512, time constant τ is taken to be 2, the choice of hyper parameters isaccording to standard reservoir computing litterature (Jaeger, 2010; Lukoˇseviˇcius & Jaeger). Theweights are taken from a standard Normal distribution. It is crucial to choose an appropriate stan-dard deviation for success of training (Appendix section 5), which is summarized in 2, each unitrepresents / √ N Table 2: Random Networkshyperparametersname

W W a W i RandNet1 1 1 10RandNet2 1 5 10RandNet3 0.5 1 10RandNet4 0.5 5 10RandNet5 1.2 1 10RandNet6 1.2 5 105.6 M

ODULAR N ETWORK PROTOCOL

The results of modular network are obtained from combining the PosNet 1 and MemNet 25 fromtable . Both the learning of Q function and V function is learned in the the same way as main resultsequ (7,8,9, 10).5.7 L

INKING DYNAMICS TO CONNECTIVITY

Pretraining modiﬁed the internal connectivity of the networks. Here, we explore the link betweenconnectivity and and dynamics. We draw inspiration from two observations in the ﬁeld of reservoircomputing (Lukoˇseviˇcius & Jaeger, 2009). On the one hand, the norm of the internal connectivityhas a large effect on network dynamics and performance, with an advantage to residing on theedge of chaos (Jaeger, 2010). On the other hand, restricting learning to the readout weights (whichis then fed back to the network, Sussillo & Abbott (2009)) results in a low-rank perturbation to theconnectivity, the possible contributions of which were recently explored (Mastrogiuseppe & Ostojic,2018).We thus analyzed both aspects. Fig. 10A shows the norms of several matrices as they evolvethrough pre-training, showing an opposite trend for PosNet and MemNet with respect to the internal16ublished as a conference paper at ICLR 2020Table 3: Performance of all networks on all tasksprotocol BSC HO BAR SC SX SY IM1.PosNet1 0.91 0.79 0.73 -0.36 0.90 0.19 0.782.PosNet1 0.97 0.78 0.79 -0.47 0.89 0.53 0.113.PosNet1 0.95 0.74 -0.04 -0.16 0.70 0.70 0.664.PosNet1 0.97 0.82 0.68 0.03 0.89 0.21 0.695.PosNet1 0.88 0.74 0.73 -0.36 0.88 0.20 0.786.PosNet1 0.91 0.63 0.48 -0.41 0.62 0.04 -0.217.PosNet1 0.94 0.64 0.51 0.26 0.64 0.15 -0.318.PosNet2 0.89 0.62 0.14 -0.25 0.76 0.24 0.849.PosNet3 0.93 0.48 0.41 0.04 0.22 -0.24 0.7310.PosNet4 0.83 0.58 0.72 -0.07 0.16 0.58 -0.2211.PosNet5 0.86 0.69 0.48 -0.19 0.21 0.65 0.0112.PosNet6 0.80 0.17 0.11 -0.33 0.07 0.36 -0.1013.RandNet1 0.89 0.52 -0.15 0.26 0.79 -0.17 -0.1414.RandNet2 0.84 0.25 -0.89 -0.03 0.62 -0.29 -0.1015.RandNet3 0.51 0.05 -0.93 -0.39 0.30 -0.35 -0.0916.RandNet4 0.65 -0.31 -0.94 -0.37 -0.23 -0.04 -0.1717.RandNet5 0.91 0.22 0.15 -0.16 0.62 0.06 -0.4718.RandNet6 0.88 0.70 -0.36 -0.44 0.63 0.11 -0.3319.MemNet1 0.65 0.27 -0.84 0.62 0.92 0.62 -0.2020.MemNet1 0.79 0.15 -0.94 0.48 0.64 0.43 -0.1421.MemNet1 0.82 0.28 -0.30 0.37 0.69 0.45 -0.1522.MemNet1 0.73 -0.09 -0.51 0.65 0.84 0.58 -0.2923.MemNet1 0.84 0.54 0.39 -0.07 0.85 0.26 -0.4124.MemNet2 0.76 0.52 -0.87 0.58 0.47 0.90 -0.2925.MemNet3 0.76 -0.11 -0.46 0.65 0.83 0.61 -0.2826.MemNet4 0.73 0.35 -0.64 0.43 0.91 0.50 -0.1027.MemNet5 0.75 0.08 -0.86 -0.37 0.12 -0.12 -0.0428.End2End1 0.89 0.67 0.70 -0.16 -0.09 0.18 0.1429.End2End2 0.67 0.8 0.73 -0.62 0.51 -0.61 -0.3630.Modular 0.96 0.76 0.76 0.62 0.92 0.59 0.86connectivity W . To estimate the low-rank component, we performed singular value decompositionon the change to the internal connectivity induced by pre-training (Fig. 10B). W = W + U SV T (20)The singular values of the actual change were compared to a shufﬂed version, revealing their low-rank structure (Fig. 10C,D). Note that pretraining was not constrained to generate such a low-rank perturbation. Furthermore, we show that the low-rank structure is partially correlated to thenetwork’s inputs, possibly contributing to their effective ampliﬁcation through the dynamics (Fig.10E-H). Because we detected both types of connectivity changes (norm and low-rank), we nextsought to characterize their relative contributions to network dynamics, representation and behavior.In order to assess the effect of matrix norms, we generated a large number of scaled random matricesand use the GLM analyse in ﬁgure 5 to access its inﬂuence on dynamics. We see the trade-offbetween landmark memory and path integration is affected by norm (Fig. 10E). But the actualnumbers, however, are much lower for the scaled random matrices compared to the pretrained ones– indicating the importance of the low-rank component (Fig. 10F) . Indeed, when removing evenonly the leading 5 ranks from ∆ W , Network encoding and performance on all tasks approaches thatof RandNet.5.8 B EHAVIOR OF DIFFERENT NETWORKS

Different networks develop diverse strategies for metric and topological tasks. In this section, wegive examples of typical trajectories of PosNet, MemNet, RanNet in the basic, bar and scaling tasks.17ublished as a conference paper at ICLR 2020Figure 10:

Connectivity changes during pre-training . (A) Top, Evolution of norm during pre-training for both PosNet (red) and MemNet (green). Bottom, low rank effect. The network can bedecomposed into two parts, a random part, and a learned low-rank structure through SVD. (B)

SVDof ∆ W compared to a shufﬂed version of ∆ W , showing that most of the learned structure is concen-trated in the ﬁrst few ranks for PosNet(top).The same low-rank effect observed for MemNet(bottom). (C-D) Measuring the overlap between action feedback matrix or input matrix and output vector v of ∆ W for PosNet (C) and MemNet (D). (E) The variance of hidden states explained by a GLMmodel containing position and landmark memory. Each pixel represents a scaled random matrix,with the colored circles showing the norms of the pre-trained networks. The red, blue and greendots correspond to norm of selected PosNet, RandNet, MemNet for dynamics analysis. (F)