[PDF] Hippocampal representations emerge when training recurrent neural networks on a memory dependent maze navigation task

Abstract

Can neural networks learn goal-directed behaviour using similar strategies to the brain, by combining the relationships between the current state of the organism and the consequences of future actions? Recent work has shown that recurrent neural networks trained on goal based tasks can develop representations resembling those found in the brain, entorhinal cortex grid cells, for instance. Here we explore the evolution of the dynamics of their internal representations and compare this with experimental data. We observe that once a recurrent network is trained to learn the structure of its environment solely based on sensory prediction, an attractor based landscape forms in the network's representation, which parallels hippocampal place cells in structure and function. Next, we extend the predictive objective to include Q-learning for a reward task, where rewarding actions are dependent on delayed cue modulation. Mirroring experimental findings in hippocampus recordings in rodents performing the same task, this training paradigm causes nonlocal neural activity to sweep forward in space at decision points, anticipating the future path to a rewarded location. Moreover, prevalent choice and cue-selective neurons form in this network, again recapitulating experimental findings. Together, these results indicate that combining predictive, unsupervised learning of the structure of an environment with reinforcement learning can help understand the formation of hippocampus-like representations containing both spatial and task-relevant information.

Full PDF

HHippocampal representations emerge when trainingrecurrent neural networks on a memory dependent mazenavigation task

Justin Jude * , Matthias H. Hennig ** Institute for Adaptive and Neural Computation, School of Informatics, University of Edinburgh, United Kingdom * [email protected] ** [email protected] Abstract

Can neural networks learn goal-directed behaviour using similar strategies to thebrain, by combining the relationships between the current state of the organism andthe consequences of future actions? Recent work has shown that recurrent neural net-works trained on goal based tasks can develop representations resembling those foundin the brain, entorhinal cortex grid cells, for instance. Here we explore the evolution ofthe dynamics of their internal representations and compare this with experimental data.We observe that once a recurrent network is trained to learn the structure of its envi-ronment solely based on sensory prediction, an attractor based landscape forms in thenetwork’s representation, which parallels hippocampal place cells in structure and func-tion. Next, we extend the predictive objective to include Q-learning for a reward task,where rewarding actions are dependent on delayed cue modulation. Mirroring exper-imental ndings in hippocampus recordings in rodents performing the same task, thistraining paradigm causes nonlocal neural activity to sweep forward in space at decisionpoints, anticipating the future path to a rewarded location. Moreover, prevalent choiceand cue-selective neurons form in this network, again recapitulating experimental nd-ings. Together, these results indicate that combining predictive, unsupervised learningof the structure of an environment with reinforcement learning can help understand theformation of hippocampus-like representations containing both spatial and task-relevantinformation.

Recurrent neural networks have been used to perform spatial navigation tasks and the subsequentstudy of their internal representations has yielded dynamics and structures that are strikingly bio-logical. Metric (Cueva and Wei, 2018; Banino et al., 2018) and non-metric (Recanatesi et al., 2019)representations mimicking grid (Fyhn et al., 2004) and place cells (O’Keefe and Nadel, 1978) respec-tively form once the recurrent network has learned a predictive task in the context of a complexenvironment. Cueva et al. (2020) demonstrates not only the emergence of characteristic neural rep-resentations, but also hallmarks of head direction system cells when training a recurrent networkon a simple angular velocity integration task. Biologically, non-metric representations are associatedwith landmark spatial memory, in which place cells within the mammalian hippocampus re whenthe associated organism is present in a corresponding place eld. Extraeld ring of place cells oc-curs when these neurons spike outside of these contiguous place eld regions. Here we show thatrecurrent neural networks (RNNs) not only form corresponding attractor landscapes, but also pro-duce representations with internal dynamics that closely resemble those found experimentally in thehippocampus when performing goal-directed behaviour.Research in neuroscience such as that of Johnson and Redish (2007), shows that spatial representationsin mice in the CA3 region of the hippocampus frequently re nonlocally. Grin et al. (2007) show thata far higher proportion of hippocampal neurons in the CA1 region in rats performing an episodic taskin a T-shaped maze encode the phase of the task rather than spatial information (in this case trajectorydirection). Ainge et al. (2007) show CA1 place cells encode destination location at the start position ofa maze. Lee et al. (2006) demonstrate that place elds of CA1 neurons gradually drift toward rewardlocations throughout reward training on a T-shaped maze.1 a r X i v : . [ q - b i o . N C ] J a n n this work we show that a recurrent neural network learning a choice-reward based task using re-inforcement learning, in conjunction with predictive sensory learning in a T-shaped maze producesan internal representation with consistent extraeld ring associated with consequential decisionpoints. In addition we nd that the network’s representation, once trained, follows a forward sweep-ing pattern as identied by Johnson and Redish (2007). We then show that a higher proportion ofunits in the trained network show strong selectivity for the encoding or choice phase of the task thanthe proportion showing selectivity for spatial topology. Importantly, these properties only emergeduring predictive learning, where task learning is much faster compared to traditional deep Q learn-ing. Figure 1:

Left, the wall observation and cue received by the network at each timestep. Right, theentangled predictive task the LSTM network is pre-trained on in order to generate a non-metric mapof the maze environment.We use a form of the cued-choice maze used by Johnson and Redish (2007) which has a central Tstructure with returning arms, shown in Figure 1. All walls of the maze are tiled with distinct RGBcolours which are generated at random and remain xed throughout. An agent is initially learningto predict the next sensory stimulus given its movement. This combination of unsupervised learningand exploration has been shown previously to produce place cell-like encoding of the agent’s position(Recanatesi et al., 2019). Next, rewards at four possible locations are introduced and the agent is taskedwith associating a cue with the rewarding trajectory. The agent has four vision sensors, one in eachcardinal direction, reading the wall RGB colours they intersect. The cue tone is played to the agentas it passes the halfway point of the central maze stem. A low frequency cue indicates that the agentwill turn left at the top of the maze stem and a high frequency cue indicates a right turn. These cuetones take the form of a high or low valued scalar perturbed with normally distributed noise if at acue point, with a zero value given at all other locations. These four RGB colours as well as the cuefrequency at the current location make up the total input received by the agent.The agent is controlled by a recurrent neural network comprised of a 380 unit Long-Short term mem-ory (Hochreiter and Schmidhuber, 1997) (LSTM) network with a single layered readout for the pre-diction of RGB values. We rst pre-train the network by tasking it with predicting the subsequentobservation of wall colours from the currently observable wall colours given its trajectory throughthe maze. The agent’s starting location is at the bottom of the central stem of the T maze and a tra-jectory of left or right at the top of the central stem is chosen pseudorandomly, depicted with red andblue arrows respectively in Figure 1 and corresponding to the low (red trajectory) or high (blue trajec-tory) cue tone value given halfway up the stem. As in the experiments by Johnson and Redish (2007),during pre-training the agent does not choose any of its actions and is only learning to predict thesequence of wall colors it encounters. In a given pre-training iteration, we collect all observations asthe agent traverses the maze until it returns to the start location at the bottom of the central stem andnally train the LSTM on the entire collected trajectory. The network is trained with a mean-squarederror loss of predicted and target wall colours (Eq. 1), with model parameters optimised using Adam(Kingma and Ba, 2015) and a learning rate of 0.001. 𝑙𝑜𝑠𝑠 𝑟𝑔𝑏 = 𝑛 𝑛 ∑︁ 𝑖 = ( 𝑦 𝑟𝑔𝑏 − ( 𝑊 𝑟𝑔𝑏 ℎ 𝑡 + 𝑏 𝑟𝑔𝑏 )) (1)2o solve this task, the network has to maintain the cue tone played in its internal memory for severaltime steps in order to predict subsequent wall colours from the top of the central stem. In our model,this is achieved through the network forming a non-metric representation (attractor landscape) of themaze environment, as also demonstrated by Xu and Barak (2020). Similarly, behavioural experimentstypically have a comparable familiarisation phase with the environment before reward-based tasksare introduced (Johnson and Redish, 2007; Grin et al., 2007). Figure 2:

For the joint task to be learned by the LSTM network, we introduce secondary cue points,where the same cue tone as that played at the primary cue point will be repeated if and only if theagent has proceeded in turning in the direction corresponding to the cue tone frequency given at theprimary cue location. The agent is free to choose the next action to be taken when traversing themaze at either the choice point at the top of the stem of the maze or at the secondary cue locations.There are two potential reward sites on both returning arms, with the reward sites being active if theagent is on the returning arm corresponding to the cue tone frequency.Once the LSTM has formed an internal representation of the maze, the agent is tasked with navigatingtowards potential reward sites whose location is indicated by the cue signal: a low frequency cueindicates active reward sites on the left return arm and a high frequency cue indicates active rewardsites on the right return arm - the cue tone and corresponding side of active reward sites are togetherchosen randomly at each iteration with a secondary cue given if the agent has turned correctly. Inthis phase there are three choice points wherein the agent is able to choose its next action and isconstrained to follow the forward maze direction elsewhere: at the top of the maze stem and at thetwo secondary choice points (Figure 2), with initially random movement at these points during rewardtraining. There are 5 steps between the cue and choice points and 7 steps from the choice point tothe rst reward site on either return arm. The inclusion of the secondary cues as additional choicepoints was motivated by the experimental set up used by Johnson and Redish (2007), to compare thenetwork activity at these points to experimental data. These secondary points also give the agentthe opportunity to backtrack on its decision made at the primary choice point in light of furtherenvironmental observation (the presentation or lack thereof of the secondary cue), and make learningmore ecient in our model. This may explain how it speeds up training the animals in the sametask.We additionally introduce a new single layered readout for the LSTM network which predicts state-action values associated with the four cardinal directions in relation to the agent’s current positionand direction. At each timestep, this ensemble receives the agent’s environment observation and theagent follows an epsilon-greedy policy (starting with fully random movement at choice points and adecaying epsilon thereafter) for choosing optimal actions of those available at each of the three choicepoints.The recurrent network controlling the agent is trained on a weighted combined loss of a reinforcementlearning (RL) task loss and the previously described predictive wall colour loss: 𝑙𝑜𝑠𝑠 𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑 = | 𝑄 ( 𝑠, 𝑎 ) − ( 𝑟 + 𝛾 · 𝑄 (cid:48) ( 𝑠 (cid:48) , arg max 𝑎 (cid:48) 𝑄 ( 𝑠 (cid:48) , 𝑎 (cid:48) ))| + 𝜆 · 𝑙𝑜𝑠𝑠 𝑟𝑔𝑏 (2)3he rst component of this loss is the dierence between predicted and observed state-action valueswhich are represented by Q-values (Watkins and Dayan, 1992), which are a prediction of future globalreward: 𝑄 ( 𝑠, 𝑎 ) = 𝑊 𝑄 ℎ 𝑡 + 𝑏 𝑄 (3)We use double-Q learning (Van Hasselt et al., 2016) to train the agent on the task, updating the targetQ value predictor ( 𝑄 (cid:48) - a LSTM with same number of units) every 15 training iterations. Double-Qlearning allows for optimal performance on the reward task in drastically fewer agent maze traversalsand network training iterations than with standard DQN (Mnih et al., 2013) based Q-learning whichsuers from overestimation of Q-values. We settle on a discount factor ( 𝛾 ) of 0.8 as values higherthan this regularly cause the network to converge on solutions wherein the agent does not take themost direct path to reward locations, with backtracking at secondary choice points. The second losscomponent is the sensory prediction task which we used to pre-train the network ( 𝜆 = . The agent learns the sensory prediction task to a high degree of recall and after around a thousandtraining iterations (combined loss with pre-training in Figure 3), the agent was able to achieve perfectperformance on the reward task when the LSTM network had 380 or more units (Fig. 3, right). Wetrained the reinforcement learning (Eq. 2) portion of the task in an epsilon greedy manner, with asteadily decaying epsilon to ensure that the agent would choose the rewarding path consistently onceactions were chosen at choice points completely by the network. Notably, the agent did not turnat either of the secondary choice points once training had completed - only at the primary choicepoint.We attempted to run the reinforcement learning task alone in a maze with no sensory input except thereward cue. In this scenario the network is not able to learn the task due to a lack of self-localisationand is unable to perform the task based on step counting between the cue and choice point. In ad-dition, the reward based reinforcement learning task was attempted using Q-learning alone with aloss function that did not include the wall colour prediction error, both with and without pre-training(shown in Fig. 3, left). In both cases we nd that the reward task is not learnable with the same higherrate of epsilon decay we use for the combined loss function with pre-training, as the network quicklyforgets the attractor landscape of the maze formed during pre-training, which we maintain throughthe combined loss (Eq. 2). We also nd the network can solve the reward task using the combinedloss without pre-training, albeit in around 3 times the number of maze traversals as with the use ofthe spatial map formed in the pre-trained case (Fig. 3, left).4 igure 3:

Left: Success rate (proportion of direct traversals to reward locations) of each set of train-ing paradigms on the reward task, averaged over 10 initial conditions and random wall colours usingoptimal rate of epsilon decay for each paradigm, each shown with a 95% condence interval. Attrac-tor landscape formed during pre-training alongside combined loss allows network to achieve perfectperformance on reward task in relatively few maze traversals. Q-learning alone without pre-trainingalso achieves perfect performance in more than twice the number of maze traversals. Q-learningalone with pre-training takes far more maze traversals to converge (and is less likely to be optimal)due to the non-random initial state of network and inability to utilise the spatial map formed. Com-bined training without pre-training also takes relatively many maze traversals to converge due to arelatively dicult joint task with no biased initial state. Right: Pre-trained network optimised withcombined loss converges at similar rates with dierent network sizes above 380 units.

First, we investigate the representation learned by the the network during these two stages of training.Pre-training causes the formation of discrete attractors that resemble place cells in the hippocampus.Individual units in the network generally have well isolated place elds, which together cover thewhole maze and therefore allow reliable decoding of agent location. In addition to an increase inactivity in a particular unit when the agent moves across its respective place eld, we also observesubstantial extraeld ring of these units. This activity occurs mainly at the primary cue locationand at the rst choice point after pre-training. After training on the reward task, in addition to theplace elds, the network also has units with extraeld activity at the secondary choice points (Fig.4E).In the top row of Figure 3(A-D) we show activity in 4 reward trained LSTM units obtained throughthe collection of unit activity from a full left sided trajectory from the maze start point returning tothe start point with cues presented, together with a full right sided trajectory. We show all activityfrom this activity collection in the top row of Figure 3(A-D) and proceed to outline the maze areasfor each unit with activity higher than 30% of the peak activity of that particular unit (mirroring theexperimental threshold used by Johnson and Redish (2007)), denoting them as place elds correspond-ing to these LSTM units. In experiments, rodents seem to pause at high consequence decision points(Johnson and Redish, 2007) with alternating head movement behaviour signifying vicarious trial anderror (VTE) (Muenzinger, 1938; Hu and Amsel, 1995). In the activity plots in the bottom row of Figure3(A-D), we simulate this using our reward trained model by running the agent from the start posi-tion at the bottom of the maze stem, then pausing it at the top of the stem, with a left cue presentedhalfway up. We show activity above 60% of unit peak activity (identied with the previously collectedaggregated activity) shown in addition to the previously identied place elds.The network representation seems to sample both return arms, with surprisingly high extraeld ac-tivity in the shown LSTM units when the agent is paused at the maze choice point, a location forwhich these units do not usually have corresponding activity (Fig. 4A-D). We dene nonlocal ringas unit activity above 60% of peak averaged unit activity when running the agent along the centralstem (bottom row Fig. 4A-D) to observe only the most poignant extraeld behaviour.5 igure 4: A - D ) Top row : Activity maps showing well isolated place elds of four LSTM units (actingas place cells) indicated in dotted regions after the reward task. Place elds determined by contiguouslocality with average activity exceeding 30% peak unit activity during a single left trajectory followedby a right trajectory.

Bottom row : LSTM unit activity exceeding 60% of previously averaged peak unitactivity for the given neuron when agent run from bottom of maze stem to top of stem and given a lowfrequency (left) cue tone halfway up the stem, then stationary at choice point with LSTM networkrepeatedly receiving observation from choice point for timesteps thereafter (shown in addition topreviously determined unit place elds in dotted regions). A , B ) Strong extraeld ring contiguouslyfrom cue to choice point. C , D ) High extraeld ring at choice point while agent is paused at topof stem. E ) Place elds (determined from average activity on both trajectories) of four LSTM unitsoutlined in dotted areas after reward based task. High levels of consistent extraeld ring at primaryand secondary cue points in 56% of LSTM units. The internal dynamics of the LSTM network has an inherently forward looking representation of themaze once pre-trained in a predictive manner. As depicted in Fig. 5, whilst the agent is stationary,the dynamics of the LSTM network moves forward through the maze, incorporating the trajectorymodulation of the cue played halfway up the maze stem. The forward movement of the representationis also notable for having an inconsistent velocity, where the LSTM inferred agent location jumps(Hasselmo, 2009) from the top of the maze to lower down the arm (timestep 16 in Fig. 5).

Figure 5:

LSTM inferred agent position after pre-training on maze. The agent is run from the start attimestep 1 to timestep 4 where it receives a low frequency cue (indicating a left turn). At timestep 9the agent is stopped at the top of the maze stem and the LSTM is given the environment observationfrom this location for the remainder of the shown timesteps. The inferred position then moves leftaccording to the cue with the position seeming to jump abruptly between timesteps 15 and 16. Theinferred position then moves back to the starting position at timestep 26. We observe an analogousinferred forward moving representation on the right side of the maze with a high frequency cue.6 igure 6:

LSTM representation after reward training. As previously, we run the agent from the startposition to the top of the stem of the maze at timestep 9 with a low frequency (left) cue tone attimestep 4. Again, the agent is stopped at this position with the LSTM network receiving the environ-ment observation from this position for the remainder of the shown timesteps. As with the networkpurely trained on the predictive task, the representation moves in the direction corresponding to thefrequency of the given cue tone. Then between timesteps 14 and 15, the inferred position jumpsfrom the return arm with active reward sites to the alternate arm, with the inferred position movingfrom this position to the start location fairly consistently. Then the inferred position jumps again attimestep 32 to the rewarding return arm and moves constantly to the start position.In stark contrast to the dynamics of the LSTM network after predictive pre-training, following trainingon the reward task the forward representation of the LSTM is still looking ahead of the agent but isnow displaying sweeping behaviour (Fig. 6) which is identied experimentally in rats by Johnson andRedish (2007) when performing cue based tasks. When the agent is stationary at the choice point, weobserve the representation moving ahead of the agent - rst in the direction corresponding to the cuegiven at the rst cue point and then abruptly down the opposing arm of the maze towards the startinglocation, thereafter the representation moves down the correct arm (corresponding to the cue) andbecomes stationary at the maze start location. This path switching behaviour is reliably observed innetworks trained on the combined loss (Eq. 2) with and without pre-training, with diering numbersof units and initial conditions as long as the reward task is solved without backtracking at secondarycue locations. The network lacks a sweeping or forward moving representation when trained on thereward task with Q-learning alone, regardless of pre-training. Thus pre-training does not contributeto sweeping or path switching behaviour.

Figure 7:

UMAP manifold of LSTM network dynamics of complete left trajectory (dark blue) andcomplete right trajectory (red) shown along with manifold of dynamics when agent run from startlocation to choice point with left cue (light blue) and right cue (pink) given at cue point and agentpaused in place at the top of the maze stem. A few timesteps after the agent is paused, the dynamicsof the left cue paused agent (light blue) switches manifold path abruptly from running alongside thecomplete left trajectory path (blue) and joins the right trajectory path (red), following this for manytimesteps before ultimately resulting at the same manifold end position as the complete left trajectorymanifold path (blue). This is analogous for the right cue paths (red and pink).7e further investigate the network representation using Uniform Manifold Approximation and Pro-jection (UMAP) (McInnes et al., 2018). Figure 7 shows generally connected manifolds, with closerinspection revealing the dynamics which leads to the sweeping arm behaviour in Figure 6 when theagent is stationary at the primary choice point. Zeroing visual input while the agent is paused at thechoice point gives comparable representation dynamics to that observed in Figures 6 and 7.Separately, we nd that place elds of particular LSTM units drift forwards from their original r-ing positions after pre-training, towards the reward locations on the return arms throughout rewardtraining, as shown experimentally in CA1 neurons in Lee et al. (2006). We observe this behaviour in50 out of 380 network units (13%), with nal resting locations of place elds at reward locations (seenin Appendix Figure 10). This is possibly explained by the gradient of Q values (prediction of pre-dicted reward) spreading backwards from reward locations (Hasselmo, 2005) and becoming strongerthroughout training.

In addition to a forward sweeping representation, this trained network also exhibits neural selectivitythat closely matches hippocampal circuits. Grin et al. (2007) reported that after reward learning, hip-pocampal neurons were more strongly selective for the encoding or choice phase of a task rather thanthe direction of the organism’s trajectory. We garner the preference of selectivity of each neuronalunit in our network using a discrimination index used by Grin et al. (2007) for the turn directionselectivity ( 𝐷𝐼 turn ) and the phase selectivity ( 𝐷𝐼 phase ): 𝐷𝐼 turn = 𝐹𝑅 right − 𝐹𝑅 left 𝐹𝑅 right + 𝐹𝑅 left 𝐷𝐼 phase = 𝐹𝑅 cue − 𝐹𝑅 choice 𝐹𝑅 cue + 𝐹𝑅 choice (4)where 𝐹𝑅 right for a particular LSTM unit is the mean ring rate from the cue point on the central stemto the choice point at the top of the stem on trajectories where the agent turns right at the choicepoint. Similarly 𝐹𝑅 left is the mean stem ring rate when the agent turns left. 𝐹𝑅 cue is the ring rateat the cue (encoding) point averaged over both left and right trajectories and similarly 𝐹𝑅 choice is theaveraged ring rate at the choice (sampling) point. Figure 8:

Histograms showing LSTM unit discrimination index for turn direction selectivity ( 𝐷𝐼 turn )vs task phase selectivity ( 𝐷𝐼 phase ). A highly negative selectivity index for turn direction indicates aneuronal unit which exhibits high levels of selectivity (uniquely high network activity) for a leftwardtrajectory and a highly positive selectivity index indicates selectivity for a rightward trajectory. Anegative selectivity for task phase indicates a neuron which is highly selective for the choice (retrieval)phase of the goal based task whereas a positive index indicates a neuron which is highly selective forthe cue (encoding) phase of the task.The ring areas used for selectivity measurement are insets in Figure 8. We use the stem abovethe cue point to assess turn direction selectivity, and the cue/choice points to assess encoding andsampling ( 𝐷𝐼 phase ). Figure 8 shows a higher proportion of LSTM units are strongly task selectiverather than turn selective, with signicantly more units having large absolute 𝐷𝐼 phase indices than 𝐷𝐼 turn indices.In addition, the reward trained network is found to have a disproportionately high number of units(163 out of 380 LSTM units) with place elds at the start location of the maze. Moreover, we nd8vidence of conditional destination encoding in these units which were heavily dierentiated in theirring with respect to particular rewarding locations, as shown experimentally in CA1 hippocampalplace cells (Ainge et al., 2007; Wood et al., 2000; Ferbinteanu and Shapiro, 2003). 59.5% of units with aplace eld at the maze start location red uniquely at this point for rewarding locations on a particularreturn arm. In this work we show that networks trained with a combined predictive and goal-based objectiveexhibit functional dynamics and selectivity behaviour coinciding with that of hippocampal neurons.We demonstrate that extraeld ring activity of network units emerge when a simulated agent, whichis trained on a goal based reward task in a T-shaped maze, pauses at decision points - suggestingintrinsic dynamics are encoding the future trajectory of the agent. This mirrors experimental resultsin hippocampal place cells in rats (Johnson and Redish, 2007; Frank et al., 2000). At the same time, wend that networks using this combined objective, following pre-training only on a sensory predictiontask, can learn the correct goal-directed behaviour much faster than an equivalent network with onlya Q learning objective.Previous work shows that metric neural representations of environments form when an RNN is op-timised to predict agent position from agent velocity (Cueva and Wei, 2018; Banino et al., 2018) andnon-metric representations form when an RNN is trained to predict future sensory events given di-rection of movement (Recanatesi et al., 2019). When training our model we do not provide the LSTMnetwork with any explicit information about location or direction, it only receives sensory informa-tion. This is similar to the purely contextual input received by the model pre-trained by Xu and Barak(2020) where no velocity input is given, however, the network used by these authors is still trainedon position and landmark prediction in a supervised way.Instead, our training paradigm forces the LSTM to maintain an implicit notion of movement within itsinternal state in relation to environmental observations. This, in conjunction with the considerationthat model-free RL methods such as Q-learning perform poorly on tasks in dynamic environmentssuch as ours (Dolan and Dayan, 2013), and the long term dependency on the delayed cue in perspectiveof the choice location, makes the task outlined in Figure 2 particularly challenging.Training on a sensory predictive task causes the formation of a non-metric place cell like representa-tion in the activations of network units, similarly to Recanatesi et al. (2019). These units demonstratenonlocal extraeld ring (Appendix Figure 9) and after reward training (Figure 4). Johnson and Redish(2007) nd that this extraeld ring is particularly striking at consequential decision points where ratsusually pause in order to sample previously seen trajectories. We observe that cue or choice point ex-traeld activity is evident in most LSTM units after training on the reward task. This is likely dueto the increased precedence these points have in the agent reaching reward locations. Together thetrained LSTM network units form a representation which sweeps along the paths available to theagent, rst down the reward path and then the other, as shown in Figure 6 and demonstrated in ratsin Johnson and Redish (2007).Although hippocampal place cells are critical for spatial memory (Nakazawa et al., 2002; Florian andRoullet, 2004; Sandi et al., 2003; Redish and Touretzky, 1998; Miller et al., 2020), it is currently unclearby what mechanism an ensemble of place cells contributes to a representation of goal-directed be-haviour (Morris, 1990). Our model and training paradigm is in keeping with the hypothesis that thehippocampus is involved in maintaining a conjunctive representation of cognitive maps and sensoryinformation (Whittington et al., 2019). We show that this paradigm can be extended with predictivelearning of Q-values of anticipated future reward, and show that the resulting representation is wellsuited for learning actions leading from a cue to a reward. Importantly, this representation emergessolely from sampling sensory inputs and predicted rewards, while reinforcement learning itself re-mains model-free and is initially random. The surprising similarity of the task-dependent activity inour simulations and experimentally recorded neural activity in similar tasks suggests that the modelmay replicate central aspects of learning and planning in the hippocampus. Our trained model couldimprove understanding of hippocampal function by testing hypotheses regarding previously unob-served dynamics inexpensively. This could be performed on maze environments such as this work,or more open arena settings once the model is retrained.9 eferences

Ainge, J. A., Tamosiunaite, M., Woergoetter, F., and Dudchenko, P. A. (2007). Hippocampal CA1 placecells encode intended destination on a maze with multiple choice points.

Journal of Neuroscience ,27(36).Banino, A., Barry, C., Uria, B., Blundell, C., Lillicrap, T., Mirowski, P., Pritzel, A., Chadwick, M. J.,Degris, T., Modayil, J., Wayne, G., Soyer, H., Viola, F., Zhang, B., Goroshin, R., Rabinowitz, N.,Pascanu, R., Beattie, C., Petersen, S., Sadik, A., Ganey, S., King, H., Kavukcuoglu, K., Hassabis,D., Hadsell, R., and Kumaran, D. (2018). Vector-based navigation using grid-like representations inarticial agents.

Nature .Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y.(2014). Learning phrase representations using RNN encoder-decoder for statistical machine trans-lation. In

EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing,Proceedings of the Conference .Cueva, C. J., Wang, P. Y., Chin, M., and Wei, X.-X. (2020). Emergence of functional and structural prop-erties of the head direction system by optimization of recurrent neural networks. In

InternationalConference on Learning Representations .Cueva, C. J. and Wei, X. X. (2018). Emergence of grid-like representations by training recurrent neuralnetworks to perform spatial localization. In . International Conference on Learning Representa-tions, ICLR.Dolan, R. J. and Dayan, P. (2013). Goals and habits in the brain.Ferbinteanu, J. and Shapiro, M. L. (2003). Prospective and retrospective memory coding in the hip-pocampus.

Neuron , 40(6).Florian, C. and Roullet, P. (2004). Hippocampal CA3-region is crucial for acquisition and memoryconsolidation in Morris water maze task in mice.

Behavioural Brain Research , 154(2).Frank, L. M., Brown, E. N., and Wilson, M. (2000). Trajectory encoding in the hippocampus andentorhinal cortex.

Neuron , 27(1).Fyhn, M., Molden, S., Witter, M. P., Moser, E. I., and Moser, M. B. (2004). Spatial representation in theentorhinal cortex.

Science .Grin, A. L., Eichenbaum, H., and Hasselmo, M. E. (2007). Spatial representations of hippocampal CA1neurons are modulated by behavioral context in a hippocampus-dependent memory task.

Journalof Neuroscience , 27(9).Hasselmo, M. E. (2005). A model of prefrontal cortical mechanisms for goal-directed behavior.

Journalof Cognitive Neuroscience , 17(7).Hasselmo, M. E. (2009). A model of episodic memory: Mental time travel along encoded trajectoriesusing grid cells.

Neurobiology of Learning and Memory , 92(4).Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory.

Neural Computation .Hu, D. and Amsel, A. (1995). A simple test of the vicarious trial-and-error hypothesis of hippocampalfunction.

Proceedings of the National Academy of Sciences of the United States of America , 92(12).Johnson, A. and Redish, A. D. (2007). Neural ensembles in CA3 transiently encode paths forward ofthe animal at a decision point.

Journal of Neuroscience , 27(45).Kingma, D. P. and Ba, J. L. (2015). Adam: A method for stochastic optimization. In .Lee, I., Grin, A. L., Zilli, E. A., Eichenbaum, H., and Hasselmo, M. E. (2006). Gradual Translocationof Spatial Correlates of Neuronal Firing in the Hippocampus toward Prospective Reward Locations.

Neuron , 51(5):639–650. 10cInnes, L., Healy, J., Saul, N., and Großberger, L. (2018). UMAP: Uniform Manifold Approximationand Projection.

Journal of Open Source Software , 3(29).Miller, T. D., Chong, T. T., Davies, A. M., Johnson, M. R., Irani, S. R., Husain, M., Ng, T. W., Jacob, S.,Maddison, P., Kennard, C., Gowland, P. A., and Rosenthal, C. R. (2020). Human hippocampal CA3damage disrupts both recent and remote episodic memories. eLife , 9.Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M.(2013). Playing atari with deep reinforcement learning.Morris, R. (1990). Does the hippocampus play a disproportionate role in spatial memory.

Discussionsin Neuroscience , 6:39–45.Muenzinger, K. F. (1938). Vicarious Trial and Error at a Point of Choice: I. A General Survey of itsRelation to Learning Eciency.

Pedagogical Seminary and Journal of Genetic Psychology , 53(1).Nakazawa, K., Quirk, M. C., Chitwood, R. A., Watanabe, M., Yeckel, M. F., Sun, L. D., Kato, A., Carr,C. A., Johnston, D., Wilson, M. A., and Tonegawa, S. (2002). Requirement for hippocampal CA3NMDA receptors in associative memory recall.

Science , 297(5579).O’Keefe, J. and Nadel, L. (1978).

The hippocampus as a cognitive map . Clarendon Press, Oxford, UnitedKingdom.Recanatesi, S., Farrell, M., Lajoie, G., Deneve, S., Rigotti, M., and Shea-Brown, E. (2019). Signatures oflow-dimensional neural predictive manifolds.

Cosyne Abstracts 2019, Lisbon, PT.

Redish, A. D. and Touretzky, D. S. (1998). The Role of the Hippocampus in Solving the Morris WaterMaze.

Neural Computation , 10(1).Sandi, C., Davies, H. A., Cordero, M. I., Rodriguez, J. J., Popov, V. I., and Stewart, M. G. (2003). Rapid re-versal of stress induced loss of synapses in CA3 of rat hippocampus following water maze training.

European Journal of Neuroscience , 17(11).Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-Learning.In .Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning.

Machine Learning , 8(3-4).Whittington, J. C., Muller, T. H., Mark, S., Chen, G., Barry, C., Burgess, N., and Behrens, T. E. (2019).The Tolman-Eichenbaum Machine: Unifying space and relational memory through generalisationin the hippocampal formation. bioRxiv , page 770495.Wood, E. R., Dudchenko, P. A., Robitsek, R. J., and Eichenbaum, H. (2000). Hippocampal neurons en-code information about dierent types of memory episodes occurring in the same location.

Neuron ,27(3).Xu, T. and Barak, O. (2020). Implementing inductive bias for dierent navigation tasks through diversernn attrractors. In

International Conference on Learning Representations . A Appendix

A.1 Extraeld place cell ring after sensory prediction task

After pre-training on the sensory prediction task outlined in Figure 1, we observe that when the agentis paused at the top of the stem of the maze the network representation moves far ahead of the agent,caused by extraeld activity of many neurons in the network.11 igure 9: A , B , C ) Top row: well isolated place elds of three LSTM units indicated in dotted regionsafter pre-training. Place elds determined by contiguous locality with average activity exceeding 30%peak eld activity during a single left trajectory followed by a right trajectory. Bottom row: agentrun from bottom of maze stem to top of stem (and given a low frequency cue tone halfway up thestem) and paused at choice point with LSTM network repeatedly receiving observations from choicepoint for timesteps thereafter. A ) Strong extraeld ring at choice point with some activity at thecue point. B ) Extraeld activity at choice point and at position below. C ) High extraeld ring at cuepoint before agent pauses at top of stem. Figure 10:

Place elds of four LSTM units, starting from i = 0 where the network has been pre-trained on the sensory prediction task, drifting forwards towards reward locations throughout rewardtraining (where i is the number of training iterations). The place elds ultimately rest at maze rewardlocations at the end of reward training ( ii