A Robotic Model of Hippocampal Reverse Replay for Reinforcement Learning
AA Robotic Model of Hippocampal Reverse Replay forReinforcement Learning
Matthew T. Whelan, , Tony J. Prescott, , Eleni Vasilaki , , ∗ Department of Computer Science, The University of Sheffield, Sheffield, UK Sheffield Robotics, Sheffield, UK ∗ Corresponding author: e.vasilaki@sheffield.ac.ukKeywords: hippocampal reply, reinforcement learning, robotics, computational neuroscience
Abstract
Hippocampal reverse replay is thought to contribute to learning, and particu-larly reinforcement learning, in animals. We present a computational model oflearning in the hippocampus that builds on a previous model of the hippocampal-striatal network viewed as implementing a three-factor reinforcement learning rule.To augment this model with hippocampal reverse replay, a novel policy gradientlearning rule is derived that associates place cell activity with responses in cellsrepresenting actions. This new model is evaluated using a simulated robot spatialnavigation task inspired by the Morris water maze. Results show that reverse replaycan accelerate learning from reinforcement, whilst improving stability and robust-ness over multiple trials. As implied by the neurobiological data, our study impliesthat reverse replay can make a significant positive contribution to reinforcementlearning, although learning that is less efficient and less stable is possible in its ab-sence. We conclude that reverse replay may enhance reinforcement learning in themammalian hippocampal-striatal system rather than provide its core mechanism.1 a r X i v : . [ q - b i o . N C ] F e b Introduction
Many of the challenges in the development of effective and adaptable robots can be posed asreinforcement learning (RL) problems; consequently there has been no shortage of attemptsto apply RL methods to robotics (
26, 45 ). However, robotics also poses significant challengesfor RL systems. These include factors such as continuous state and action spaces, real-timeand end-to-end learning, reward signalling, behavioural traps, computational efficiency, lim-ited training examples, non-episodic resetting, and lack of convergence due to non-stationaryenvironments (
26, 27, 53 ).Much of RL theory has been inspired by early behavioural studies in animals ( ), and forgood reason, since biology has found many of the solutions to the control problems we aresearching for in our engineered systems. As such, with continued developments in biology, andparticularly in neuroscience, it would be wise to continue transferring insights from biologyinto robotics ( ). Yet equally important is its inverse, the use our computational and roboticmodels to inform our understanding of the biology (
30, 49 ). Robots offer a valuable real-worldtesting opportunity to validate computational neuroscience models (
24, 36, 43 ).Though the neurobiology of RL has largely centred on the role of dopamine as a reward-prediction error signal (
39, 42 ), there are still questions surrounding how brain regions mightcoordinate with dopamine release for effective learning. Behavioural timescales evolve overseconds, perhaps longer, whilst the timescales for synaptic plasticity in mechanisms such asspike-timing dependent plasticity (STDP) evolve over milliseconds ( ) – how does the nervoussystem bridge these time differentials so that rewarded behaviour is reinforced at the level ofsynaptic plasticities?One recent hypothesis offering an explanation to this problem has been in three-factor learn-ing rules (
9, 11, 40, 47 ). In the three-factor learning rule hypothesis, learning at synapses occurs2nly in the presence of a third factor, with the first and second factors being the typical pre- andpost-synaptic activities. This can be stated in a general form as follows, ddt w ij = ηf ( x j ) g ( y i ) M ( t ) (1)where η is the learning rate, x j represents a pre-synaptic neuron with index j , y i a post-synapticneuron with index i , and f ( · ) and g ( · ) being functions mapping respectively the pre- and post-synaptic neuron activities. M ( t ) represents the third factor, which here is not specific to theneuron indices i and j and is therefore a global term. This third factor is speculated to representa neuromodulatory signal, which in this case is best thought of as dopamine, or more generallyas a reward signal. Equation 1 appears to possess the problem stated above, of how learning canoccur for neurons that were co-active prior to the introduction of the third factor. To solve this,a synaptic-specific eligibility trace is introduced, which is a time-decaying form of the pre- andpost-synaptic activities ( ), ddt e ij = − e ij τ e + ηf ( x j ) g ( y i ) ddt w ij = e ij M ( t ) (2)The eligibility trace time constant, τ e , modulates how far back in time two neurons were co-active in order for learning to occur – the larger τ e is, the more of the behavioural time historywill be learned and therefore reinforced. To effectively learn behavioural sequences over thetime course of seconds τ e is set to be in the range of a few seconds ( ). Work conducted byVasilaki et al. ( ) successfully applied such a learning mechanism in a spiking network modelfor a simulated agent learning to navigate in a Morris water maze task ( ), in which they useda value of 5s for τ e , a value that was optimised to that specific setting. Hippocampal replay however suggests an alternative approach, building on the three-factorlearning rule but avoiding the need for synapse-specific eligibility traces. Hippocampal replaywas originally shown in rodents as the reactivation during sleep states of hippocampal place3ells that were active during a prior awake behavioural episode (
44, 52 ). During replay events,the place cells retain the temporal ordering experienced during the awake behavioural state,but do so on a compressed timescale – replays typically replay cell activities over the courseof a few tenths of a second, as opposed to the few seconds it took during awake behaviour.Furthermore, experimental results presented later to these original results showed that replayscan occur in the reverse direction too, and that these reverse replays occurred when the rodenthad just reached a reward location (
5, 8 ). Interestingly, these replays would repeat the rodent’simmediate behavioural sequence that had led up to the reward, which led Foster and Wilson ( )to speculate that hippocampal reverse replays, coupled with phasic dopamine release, might besuch a mechanism to reinforce behavioural sequences leading to rewards.Whilst it has been well established that hippocampal neurons project to the nucleus accum-bens ( ), the proposal that reverse replays may play an important role in RL has since receivedfurther support. For instance, there are experimental results showing that reverse replays oftenco-occur with replays of the ventral striatum ( ) as well as there being increased activity in theventral tegmental area during awake replays ( ), which is an important region for dopaminerelease. Furthermore, rewards have been shown to modulate the frequency with which reversereplays occur, such that increased rewards promotes more reverse replays, whilst decreasedrewards suppresses reverse replays ( ).To help better understand the role of hippocampal reverse replays in the RL process, wepresent here a neural RL network model that has been augmented with a hippocampal CA3inspired network capable of producing reverse replays. The network has been implementedon a simulation of the biomimetic robot MiRo-e ( ) to show its effectiveness in a roboticsetting. The RL model is an adapted hippocampal-striatal inspired spiking network by ( )derived using a policy gradient method, but modified here for continuous-rate valued neurons –this modification leads to a novel learning rule, though similar to previous learning rules ( ),4or this particular network architecture. The hippocampal reverse replay network meanwhile istaken from Whelan et al. ( ), who implemented the network on the same MiRo robot, itselfbased on earlier work by Haga and Fukai ( ) and Pang and Fairhall ( ). We demonstrate inrobotic simulations that activity replay improves the stability and robustness of the RL algorithmby providing an additional signal to the eligibility trace. We implemented the model using a simulation of the biomimetic robot MiRo-e. The MiRorobot is a commercially available biomimetic robot developed by Consequential Robotics Ltdin partnership with the University of Sheffield. MiRo’s physical design and control systemarchitecture are inspired by biology, psychology and neuroscience ( ) making it a useful plat-form for embedded testing of brain-inspired models of perception, memory and learning ( ).For mobility it is differentially driven, whilst for sensing we make use of its front facing sonarfor the detection of approaching walls and objects. The Gazebo physics engine is used to per-form simulations, where we take advantage of the readily available open-arena (Figure 1C).The simulator uses the Kinetic Kame distribution of the Robot Operating System (ROS). Fullspecifications for the MiRo robot, including instructions for simulator setup, can be found onthe MiRo documentation web page ( ). The network is composed of a layer of 100 bidirectionally connected place cells , which con-nects feedforwardly to a layer of 72 action cells via a weight matrix of size 100 ×
72 (Figure1B). In this model, activity in each of the place cells is set to encode a specific location in theenvironment (
32, 33 ). Place cell activities are generated heuristically using two dimensional5ormal distributions of activity inputs, determined as a function of MiRo’s position from eachplace field’s centre point (Figure 1A). This is similar to other approaches of place cell activitygeneration (
18, 47 ). The action cells are driven by the place cells, with each action cell encod-ing for a specific with 5 degree increments, thus 72 action cells encode 360 degrees of possibleheading directions. By computing a population vector of the action cell activities, these dis-creet heading directions can be transformed into continuous headings. For simplicity, MiRo’sforward velocity is kept constant at 0.2m/s. We now describe the details of the network in full.
The network model of place cells represents a simplified hippocampal CA3 network that iscapable of generating reverse replays of recent place cell sequence trajectories. This model ofreverse replays was first presented in ( ), but with one minor modification. Whereas the reversereplay model in ( ) has a global inhibitory term acting on all place cells, here the place cellshave those inhibitory inputs removed from their dynamics. This modification does not affectthe ability of the network to produce reverse replays, which is shown in the SupplementaryMaterial, where reverse replays both with and without global inhibition are compared.In more detail, the place cells consist of a network of 100 neurons, each of which is bidi-rectionally connected to its eight nearest neighbours as determined by the positioning of theirplace fields. Hence, place cells with neighbouring place fields are bidirectionally connected toone another (Figure 1B), whereas place cells whose place fields are further than one place fieldapart are not. In this manner, the connectivity of the network represents a map of the environ-ment. This network approach is similar to the network approach taken by Haga and Fukai ( )in their model of reverse replay, except their weights are plastic whilst we keep ours static. Thestatic weights for each cell, represented by w placejk indicating the weight projecting from neuron k onto neuron j , are all set to 1, with no cells self-projecting to themselves. Figure 1B displays6igure 1: The testing environment, showing the simulated MiRo robot in a circular arena. A)Place fields are spread evenly across the environment, with some overlap, and place cell ratesare determined by the normally distributed input computed as a function of MiRo’s distancefrom the place field’s centre. B) Place cells (blue, bottom set of neurons) are bidirectionallyconnected to their eight nearest neighbours. These synapses have no long-term plasticity, butdo have short-term plasticity. Each place cell connects feedforwardly via long-term plasticsynapses to a network of action cells (red, top set of neurons). In total there are 100 place cellsand 72 action cells.the full connectivity schema for the bidirectionally connected place cell network.The rate for each place cell neuron, represented by x j , is given as a linearly rectified rate7ith upper and lower bounds, x j = if x (cid:48) j < if x (cid:48) j > x (cid:48) j otherwise. (3)The variable x (cid:48) j is defined as, x (cid:48) j = α ( I j − (cid:15) ) where α , (cid:15) are constants and I j is the cell’s activity, which evolves according to time decayingfirst order dynamics, τ I ddt I j = − I j + ψ j I synj + I placej (4)where τ I is the time constant, I synj is the synaptic inputs from the cell’s neighbouring neurons,and I placej is the place specific input calculated as per a normal distribution of MiRo’s positionfrom the place field’s centre point. ψ j represents the place cell’s intrinsic plasticity , detailedfurther below.Each place cell has associated with it a place field in the environment defined by its centrepoint and width, with place fields distributed evenly across the environment (100 in total). Asstated, the place specific input, I placej , is computed from a two-dimensional normal distributiondetermined by MiRo’s distance from the place field’s centre point, I placej = I pmax exp (cid:20) − ( x cMiRo − x cj ) + ( y cMiRo − y cj ) d (cid:21) (5)where I pmax determines the max value for the place cell input. ( x cMiRo , y cMiRo ) represents MiRo’s ( x, y ) coordinate position in the environment, whilst ( x cj , y cj ) is the location of the place field’scentre point. The term d in the denominator is a constant that determines the width of the placefield.The synaptic inputs, I synj , are computed as a sum over neighbouring synaptic inputs modu-8ated by the effects of short-term depression and facilitation, D k and F k respectively, I synj = λ (cid:88) k =1 w placejk x k D k F k (6)where w placejk is the weight projecting from place cell k onto place cell j . In this model, all theseweights are fixed at a value of 1. λ takes on a value of 0 or 1 depending on whether MiRo isexploring ( λ = 0 ) or is at the reward ( λ = 1 ). This prevents any synaptic transmissions duringexploration, but not whilst MiRo is at the reward (the point at which reverse replays occur).This two-stage approach can be found in similar models as a means to separate an encoding stage during exploration from a retrieval stage ( ), and was a key feature of some of the earlyassociative memory models ( ). Experimental evidence also supports this two-stage processdue to the effects of acetylcholine. Acetylcholine levels have been shown to be high duringexploration but drop during rest ( ), whilst acetlycholine itself has the effect of suppressing therecurrent synaptic transmissions in the hippocampal CA3 region ( ). It is for this reason thatthe global inhibitory inputs found in ( ) are not necessary, as the λ term here does functionallythe same operation as the inhibitory inputs (inhibition is decreased during reverse replays, thusincreasing synaptic transmission), yet is simpler to implement. D k and F k in Equation 6 are respectively the short-term depression and short-term facilita-tion terms, and for each place cell these are computed as (as in ( ), but see (
7, 46, 48 )), ddt D k = 1 − D k τ ST D − x k D k F k (7) ddt F k = U − F k τ ST F + U (1 − F k ) x k (8)where τ ST D and τ ST F are the time constants, and U is a constant representing the steady-statevalue for short-term facilitation when there is no neuron activity ( x k = 0) . D k and F k each takeon values in the range [0 , . Notice that when x k > , short-term depression is driven steadilytowards 0, whereas short-term facilitation is driven steadily upwards towards 1. Modification9f the time constants allows either short-term depression or short-term facilitation effects todominate. In this model, the time constants are chosen so that depression is the primary short-term effect. This ensures that during reverse replay events, activity propagating from one neuronto the next quickly dissipates, allowing for stable replays without activity exploding in thenetwork.We turn finally to the intrinsic plasticity term in Equation 4, represented by ψ j . Its be-haviour, as observed in Equation 4, is to scale all incoming synaptic inputs. In ( ), Pang andFairhall used a heuristically developed sigmoid whose output was determined as a function ofthe neuron’s rate. Intrinsic plasticity in their model did not decay once it had been activated.Since our robot often travels across most of the environment, we needed a time decaying formof intrinsic plasticity to avoid potentiating all cells in the network. The simplest form of suchtime decaying intrinsic plasticity is therefore, ddt ψ j = ψ ss − ψ j τ ψ + ψ max −
11 + exp [ − β ( x j − x ψ )] (9)with again, τ ψ being its time constant, and ψ ss being a constant that determines the steady statevalue for when the sigmoidal term on the right is 0. All of ψ max , β and x ψ are constants thatdetermine the shape of the sigmoid. Since ψ j could potentially grow beyond the value of ψ max ,we restrict ψ j so that if ψ j > ψ max , then ψ j is set to ψ max .In order to initiate a replay event, place cell inputs, computed using Equation (5) withMiRo’s current location at the reward, are input into the place cell dynamics (Eqtn 4) one sec-ond after MiRo reaches the reward, for a duration of 100ms. Intrinsic plasticity for those cellsthat were most recently active during the trajectory is increased, whilst synaptic conductance inthe place cell network is turned on by setting λ = 1 . This causes the place cell input to activateonly its adjacent cells that were recently active. This effects continues throughout all recentlyactive cells, thus resulting in a reverse replay. Short-term depression ensures that the activity10issipates quickly as it propagates from one neuron to the next. The action cell values determine how MiRo moves in the environment. All place cells projectfeedforwardly through a set of plastic synapses to all action cells, as shown in Figure 1B. Thereare 72 action cells, the value of each drawn from a Gaussian distribution with mean ˜ y i andvariance σ , y i ∼ N (cid:0) ˜ y i , σ (cid:1) (10)The mean value ˜ y i is calculated as follows, ˜ y i = 11 + exp (cid:104) − c (cid:80) j =1 w P C - ACij x j − c (cid:105) (11)with c and c determining the shape of the sigmoid. w P C - ACij represents the weight projectingfrom place cell j onto action cell i . The sigmoidal function is one possible choice which resultsin saturating terms in the RL learning rule (Section 5), an alternative option for instance couldhave been a linear function. The action cells are restricted to take on values between 0 and 1,i.e. y i → [0 , , and be interpreted as normalised firing rates.MiRo moves at a constant forward velocity, whereas the output of the action cells sets atarget heading for MiRo to move in. This target heading is allocentric, in that the heading isrelative to the arena. The activity for each action cell is denoted as y i and the target heading as θ target . To find the heading from the action cells, the population vector of the action cell valuesis computed as follows, θ target = arctan (cid:18) (cid:80) i y i sin θ i (cid:80) i y i cos θ i (cid:19) (12)where θ i is the angle coded for by action cell i . It is also possible to compute the magnitudeof the population vector, which denotes how strongly the action cell activities are promoting a11articular heading, m target = (cid:118)(cid:117)(cid:117)(cid:116)(cid:32)(cid:88) i y i sin θ i (cid:33) + (cid:32)(cid:88) i y i cos θ i (cid:33) (13)For practical reasons, the action cells are computed not only from place cell inputs, butalso by a separate module, termed a semi-random walk module. This is because the network,particularly in the early stages of exploration when the weights are randomised, is often unableto make useable directional decisions. A simple implementation of a semi-random walk moduletherefore allows MiRo to explore the environment sensibly, as opposed to erratically when therandomised network weights are used. The details of the semi random walk implementation isgiven below. Semi-Random Walk Module – In cases where the signal provided by the action cells, as com-puted by Equation 13 is not strong enough (i.e. less than 1), then MiRo takes a random walkrather than following the direction selected by the action cells. To compute the heading, a smallbut random value, θ noise , is added to MiRo’s current heading, θ random walk = θ current + θ noise (14)where θ noise is a random variable taken from the uniform distribution θ noise ∼ unif ( − o , o ) .This ensures that MiRo generally keeps moving in its current direction, but is capable of chang-ing slightly to the left or right, though by no more than o .To convert this into action cell values, each action cell is computed as a function of itsangular distance from θ random walk , in a similar to manner to how the place cell activities werecomputed as the Cartesian distance of MiRo from the place cell centres, y random walki = y maxi exp (cid:34) − ( θ random walk − θ i ) θ d (cid:35) (15)where y maxi determines the maximum value for y i , in this case 1, and θ d determines the distri-bution width, and θ i is the angle corresponding to action cell i .12o state this more formally, let the magnitude of the place cell network proposal be (seeEquation 13), m P C proposal = (cid:118)(cid:117)(cid:117)(cid:116)(cid:32)(cid:88) i ˜ y i sin θ i (cid:33) + (cid:32)(cid:88) i ˜ y i cos θ i (cid:33) (16)then the final action cell values are only changed to y i = y random walki if m P C proposal < . Elsethey stay as they are from Equation 10. Computing Action Cells During Reverse Replays – The computation for y i in Equation (10)is suitable for the exploration stage, but requires a minor modification in order for the actioncells to properly replay during reverse replay events. Thus far, y i is computed either by takingthe network’s output as determined by the place cell inputs or, if this output is weak, by using asemi-random walk. In order for the y i term to compute properly in the reverse replay case then,we perform the following, y replayi = 11 + exp (cid:104) − c (cid:80) j =1 (cid:0) w P C - ACij + 0 . sgn ( e rij ) (cid:1) x j − c (cid:105) (17)which is the same computation as Equation (11), with the only difference being that the placecell to action cell weights have added to them the value 0.1 multiplied by the sign of the eligi-bility trace for that synapse. The term e rij represents the value of e ij , i.e. a trace of the potentialsynaptic changes, at the moment of reward retrieval. This effectively stores the history of synap-tic activity and adds a transient weight increase to synapses that were recently active. How thiseligibility trace is computed is described in Section 5.Modifying the action cells during replays is necessary so that a reverse replay of the placecells can appropriately reinstate the activity in the action cells ( ). Without this change thereverse replays would offer no additional benefits. This modification acts like a synaptic tag thatactivates at reward retrieval only and provides temporary synaptic modifications, according tothe sign of the eligibility trace, during the reverse replay stage. Despite making this assumption,this temporary change in synaptic strengths is similar in nature to that of acetylcholine levels13odifying synaptic conductances during replay events in the hippocampus ( ). In other words,synaptic weights (and their modifications) are suppressed during exploration, but are manifestduring the replay stage.We also tested the rule using a weaker assumption, adding only a value of 0.1 for anysynapse in which e ij > , whilst adding nothing for synapses where e ij < . However, this wasshown to perform worse than even the non-replay case. On closer inspection, the reason for thisproved to be the loss of causality during a replay event. Since replays activate multiple cellssimultaneously, those synapses that may have had e ij < will be influenced by neighbouringplace and action cell activities without having the negative eligibility to counter those effects.This influence caused them to increase their weights, as opposed to decreasing, which would bethe proper direction given a negative value for e ij . The learning rule has been derived using a policy gradient reinforcement learning method ( ).Its form is that of a three factor learning rule with an eligibility trace ( ). The full derivationfor the learning rule is presented in the Appendix.When MiRo is exploring, a learning rule of the following is implemented dw P C - ACij dt = ησ Re ij (18)where R is a reward value, whilst the term e ij represents the eligibility trace and is a timedecaying function of the potential weight changes, determined by, de ij dt = − e ij τ e + ( y i − ˜ y i ) (1 − ˜ y i ) ˜ y i x j . (19)During reverse replays however, the activity of the action cells are given by y replayi . Inorder to retain mathematically accuracy in the derivation (see Appendix), we derived a similarlearning rule from a supervised learning approach having the following form14 w P C - ACij dt = η (cid:48) e ij , (20)where the eligibility trace is determined by, de ij dt = − e ij τ e + (cid:16) y replayi − ˜ y i (cid:17) (1 − ˜ y i ) ˜ y i x j , (21)We have set η (cid:48) = η/σ and let R = 1 at the reward location in our simulations, which rendersthe RL rule and the supervised learning rule equivalent. The population weight vector for a single place cell is computed as, ( w xj , w yj ) = (cid:32) (cid:88) i =1 w P C - ACij cos θ i , (cid:88) i =1 w P C - ACij sin θ i (cid:33) (22)where ( w xj , w yj ) represents the x and y components for the weight population vector of the j th place cell, w P C - ACij is the value of the weight from place cell j onto action cell i , and θ i is theheading direction that action cell i codes for. The magnitude of the population weight vectorcan then be computed as, M w j = (cid:113)(cid:0) w xj (cid:1) + (cid:0) w yj (cid:1) (23)The population weight vector depicts the preferred direction of MiRo when placed at the centerof the location of the place cell. A description of the full implementation process is provided here, with an overview of thealgorithmic implementation presented in the Supplementary Material.
Initialisation – At the start of a new experiment, the weights that connect the place cells to theaction cells are randomised and then normalised, w P C - ACij ← w P C - ACij (cid:80) i w P C - ACij . (24)15ll the variables for the place cells are set to their steady state conditions for when no placespecific inputs are present, and the action cells are all set to zero. MiRo is then placed into arandom location in the arena. Taking Actions – There are three main actions MiRo can make, depending on whether thereward it receives is positive ( R = 1 ) and is therefore at the goal, negative ( R = − ) such thatMiRo has reached a wall, or zero ( R = 0 ) for all other cases. If the reward is 0, the action cellvalues, y i is computed according to Equation 10, or y random walki is computed from Equation15 if m P C p roposal < , letting then y i = y random walki . From this, a heading is computed usingEquation 12. MiRo moves at a constant forward velocity with this heading, with a new headingbeing computed every 0.5s. If MiRo reaches a wall, a wall avoidance procedure is used, turningMiRo round 180 o . Finally, if MiRo reaches the goal, it pauses there for 2s, after which it headsto a new random starting location. Determining Reward Values – As described above, there are three reward values that MiRocan collect. If MiRo has reached a wall, a reward of R = − is presented to MiRo for a periodof 0.5s, which tends to occur during MiRo’s wall avoidance procedure. If MiRo has found thegoal, a reward of R = +1 is presented to it for a period of 2s. And if neither of these conditionsare true, then MiRo receives no reward, i.e. R = 0. Initiating Reverse Replays – Reverse replays are only inititiated when MiRo reaches the goallocation, but not for when MiRo is avoiding a wall. For the case in which reverse replays areinitiated, λ is set to 1 to allow hippocampal synaptic conductance, and the place specific inputfor MiRo’s position whilst at the goal, I placej , is injected 1s after MiRo first reaches the goalfor a total time of 100ms. With synaptic conductance enabled, and due to intrinsic plasticity,this initiates reverse replay events initiating at the goal location and traveling back through therecent trajectory in the place cell network. An example of a reverse replay can be found inthe Supplementary Material. Whilst learning is done as standard in the non-replay stage using16quations 18 and 19 when MiRo first reaches the goal, once the replays start learning is doneusing the supervised learning rule of Equations 20 and 21. Updating Network Variables – Regardless of whether MiRo is exploring, avoiding a wall, oris at the goal and is initiating replays, all the network variables, including the weight updates,occur for every time step of the simulation. It is only when MiRo has reached the goal, gonethrough the 2s of reward collection, and is making its way to a new random start location thatall the variables are reset as in the Initialisation step above (though excluding the randomisationof the weights). This would then begin a new trial in the experiment.
All parameter values used in the Hippocampal network are summarised in Table 1, and thosefor the Striatal network in Table 2. Values for η and τ e are specified appropriately in the results,since they are modified across various experiments.Parameter Value α C − (cid:15) A τ I . sI pmax Ad . mλ or , see text τ ST D . s τ ST F sU ψ ss ψ max τ ψ sβ x ψ Hz Table 1: Summarising the model parameter values for the hippocampal network used in theexperiments. All these parameters are kept constant across all experiments.17arameter Value c c σ θ d τ e See text η See text
Table 2: Summarising the model parameter values for the striatal network used in the exper-iments. Except for the learning rate, η , and the eligibility trace time constant, τ e , all otherparameters are kept constant for all experiments. This results section is divided into two subsections. Presented first are the results for whenrunning the model without reverse replays, to demonstrate the functionality of the network andthe learning rule. Following this, the model is then run with reverse replays, with these resultsbeing compared to the non-replay case. All model parameters and the learning rule are keptequal between the two cases to facilitate the comparison, however when we compare the twomodels in terms of performance we optimise the key parameters deferentially for each model,comparing best with best performance.
We first demonstrate the functionality of the learning rule (Equations 18 and 19), without re-verse replays. Figure 2A shows the results for the time taken to reach the hidden goal as afunction of trial number, averaged across 20 independent experiments. The time to reach thegoal approaches the asymptotic performance after around 5 trials. Note however that there ap-pears to be larger variance towards the final two trials. Further trials were later run in order totest whether this increased variability in performance was significant or not (see Section 3.2.4).Figure 2B displays the weight vector for the weights projecting from the place cells to the action18igure 2: Results for the non-replay case in order to test that the derived learning rule performswell. Parameters used were η = 0 . and τ e = 1 s . A) Plot showing the average time to reachgoal (red line) and standard deviations (shaded area) over 20 trials. Averages and standarddeviations are computed from 20 independent experiments. B) Weight population vectors at thestart of trial 1 versus at the end of trial 20 in an example experiment. Magnitudes for the vectorsare represented as a shade of colour; the darker the shade, the larger the magnitude. Red dotsindicate the goal location.cells. We note that after 20 trials the arrows in general point towards the direction of the goal. We then ran the model with reverse replays, implementing the learning rule of Equations 20 and21, using first the same learning rate and eligibility trace time constant as in the non-replay case19bove. The performance average showed not to have any significant difference ( p > . across18 trials in a Wilcoxon Signed-Rank Test). Average time to reach goal over the last 10 trials is6.21s in the non-replay case and 6.92s in the replay case (data not shown, see SupplementaryMaterial). This suggests in the first instance that replays are at least as good when comparedto the best case non-replay, which was also confirmed when comparing individually optimisedparameters (learning rate and eligibility time constant) for each network. Further results onperformance of varying the learning rate and eligibility trace time constant are presented next. Given the standard, non-replay model requires the recent history to be stored in the eligibilitytrace, it follows that having too small an eligibility trace time constant might negatively impactthe performance of the model. This is due to the time constant reflecting how far back theinformation about the Reward will be “transmitted”. Reverse replays however have the potentialto compensate for this, since the recent history is also stored, and then replayed, in the placecell network. Figure 3 shows the effects on performance of significantly reducing the eligibilitytrace time constant (to τ = 0 . s ). Both cases, with and without reverse replays, are compared.If the learning rate is too small ( η = 0 . ) then for neither case is there any learning. But as thelearning rate is increased, having reverse replays shows to significantly improve performance.Similar results are found for a larger, but still small, eligibility trace time constant of τ e = 0 . s (see Supplementary Material). An interesting comparison can be shown between the magnitudes of weight changes for thereplay case and non-replay cases. Figure 4 shows the population vectors of the weights afterreward retrieval. Population vectors for the weights are computed according to Equations 22 and23. There are two observations to be made here. First is that the weight magnitudes are greater20igure 3: Comparing the effects of a small eligibility trace time constant with and withoutreverse replays. τ e = 0 . s across all figures. Thick lines are averages across 40 independentexperiments, with shaded areas representing one standard deviation. Moving averages, averagedacross 3 trials, are plotted here for smoothness.when reverse replays are implemented, which is expected since activity replay offers additionalinformation to the synaptic changes. And second is that the direction of the population weightvectors themselves are slightly different, particularly in the location at the start of the trajectory.In particular, the weight vectors point more towards the goal location in the replay case, whereasthe non-replay case has weight vectors pointing along the direction of the path taken by therobot. Whilst only one case has been depicted here, this is representative of a number of casesfor various parameter values. 21igure 4: Population weight vectors after reward retrieval in the non-replay and replay cases.Top figure shows the path taken by MiRo, where S represents the starting location and G thegoal location. Plots show weight population vectors for the non-replay case (A) and standardreplay case (B) with τ e = 1 s ; η = 0 . . The color scale represents the magnitudes for eachweight vector. We investigated the robustness of the performance across various values of τ e and η . Figure 5displays the average performance over the last 10 trials, comparing again with replays versuswithout replays. There are perhaps two noticeable observations to make here. Firstly, when22igure 5: Comparing average performance across a range of values for τ e and η . Bars showthe average time taken to reach the goal. These plots are found by first averaging across 40independent experiments (as shown in Figure 3 for instance), and then averaging over the last10 trials. Error bars indicate the 10 trial average of standard deviations.the eligibility trace time constant is small, employing reverse replays shows considerable im-provements in performance over the non-replay case across the various values of learning rates.Learning still exists in the non-replay case, however, it is noticeably diminished compared withthe replay case. Secondly, although this marked improvement in performance vanishes forlarger eligibility trace time constants, reverse replays do not at the very least hinder perfor-mance. 23 .2.4 Comparison of Best Cases Figure 6 compares the results for the best cases with and without reverse replays. To achievethese results we optimised τ e and η independently for each case and run a total of 30 trials.The reason for this was a suspected instability in the non-replay case when the amount of trialsincreased as indicated in Figure 2. A Wilcoxin signed-rank test was run on all trials for the twocases, and for 8 of the last 12 trials (trials 19-30), there were significant differences between thetwo ( p < . , full table of results can be found in the Supplementary material) despite therebeing no significant differences in trials 0-18 (Section 3.2 above).This instability in the non-replay case is not observed in the case with replays. We alsoFigure 6: Comparing the best cases with and without reverse replays. With reverse replaysthe parameters are τ e = 0 . s , η = 1 . Without reverse replays the parameters are τ e = 1 s , η = 0 . . The means (solid lines) and standard deviations (shaded regions) are computedacross 40 independent experiments. 24ote the difference in parameters for the best cases. With reverse replays the parameters are τ e = 0 . s , η = 1 , whereas without reverse replays they are τ e = 1 s , η = 0 . . We speculatethat the necessary choice in the eligibility time constant for the non-replay case (i.e. that itnecessarily needs to be large enough to store the trajectory history) is the cause of this instability. Hippocampal reverse replay has long been implicated in reinforcement learning ( ), but howthe dynamics of hippocampal replay produce behavioural changes, and why hippocampal re-play could be important in learning, are ongoing questions. By embodying first a hippocampal-striatal inspired model ( ) into a simulated MiRo robot, and then augmenting it with a modelof hippocampal reverse replay ( ), we have been able to examine the link between hippocam-pal replay and behavioural changes in a spatial navigation task. We have shown that reversereplays can enable quicker reinforced learning, as well as generating more robust behaviouraltrajectories over repeated trials.In the three-factor, synaptic eligibility trace hypothesis, the time constants for the traceshave been argued to be on the order of a few seconds, necessary for learning over behaviouraltime scales ( ). However, results here indicate that due to reverse replays, it is not necessaryfor synaptic eligibility trace time constants to be on the order of seconds – a few millisecondsis sufficient. The synaptic eligibility trace is still required here for storing the history; it justdoes not matter how much of the eligibility trace is stored, it is only important that enough isstored for effective reinstatement during a reverse replay. It has also been argued that neuronal,as opposed to synaptic, eligibility traces could be sufficient for storing a memory trace, as inthe two-compartmental neuron model of ( ). Intrinsic plasticity in this model is not unlike aneuronal eligibility trace, storing the memory trace within the place cells for reinstatement atthe end of a rewarding episode. 25t could be the case that reverse replays speed up learning by introducing an additionalsource of information regarding past states, and the results shown here provide some supportfor this. Experimental evidence does show for instance that disruption of hippocampal rip-ples during awake states, when reverse replays occur, does disrupt but not completely diminishspatial learning in rats ( ). Whilst the longer eligibility trace time constants in this model( τ e = 1 s, s ) do not show diminished performance without reverse replays, the smaller timeconstants ( τ e = 0 . s, . s ) do. Hence, these results support the view that reverse replaysenhance, rather than provide entirely, the mechanism for learning. Beyond reverse replayshowever, forward replays have been known to occur on multiple occasions for up to 10 hourspost-exploration ( ), which could be more important for memory consolidation than awakereverse replays (
6, 12 ).In the best case comparison (Figure 6), it is clear why a sufficiently large, yet not overlylarge, eligibility trace time constant for the non-replay case gives best performance – it muststore a suitable amount of the trajectory history for learning. If the eligibility trace time con-stant were too small, it would not store enough of the history, whereas too large and it storessub-optimal or unnecessary trajectories that go too far back in time. Yet the non-replay modelbecame more unstable as the number of trials increased, as shown in Figure 6. One expla-nation for this is that the eligibility trace time constant necessary for learning in non-replayhad to be large enough to store trajectory histories, but doing this increases the probability thatsub-optimal paths may be learned. For the replay case however, since the trajectory was re-played during learning, it was not necessary to have such a large eligibility trace time constant.Sub-optimal paths going further back in time are therefore no longer as strongly learned. Fur-thermore, replays are able to modify slightly the behavioural trajectories. By looking at theeffects in the weight vectors of Figure 4, it is apparent that the weight vectors closer to the startlocation are shifted to point more towards the goal in the replay case. Reverse replays could26elp in solving the exploration-exploitation problem in RL ( ), since they could simulate moreoptimal trajectories that were not explicitly taken during behaviour.In this model, there are two sets of competing behaviours during the exploratory stage – thememory guided behaviour of the hippocampus and the semi-random walk behaviour – whichare heuristically selected for based on the signal strength of the hippocampal output: If thehippocampal output does not express strongly for a particular action, the semi-random walkbehaviour is implemented instead. An interesting comparison with the basal ganglia, and itsinput structure the striatum, could be made here, since these structures are thought to play a rolein action selection (
15, 29, 37, 39 ). A basic interpretation of this action selection mechanismis that the basal ganglia receives a variety of candidate motor behaviours, each of which areperhaps mutually incompatible, but from which the basal ganglia must select one (or more)of these behaviours for expressing (
16, 17 ). Since the selection of an action in our model isdetermined from the striatal action cell outputs, it appears likely that this selection would occurwithin the basal ganglia.But perhaps more interesting is that in the synaptic learning rule presented here, the differ-ence between the action selected, y i , and the hippocampal output, ˜ y i , is used to update synapticstrengths. One interpretation for this could be that this difference behaves as an error signal,signalling to the hippocampal-striatal synapses how “good”, or how “close”, their predictionswere in generating behaviours that led towards rewards. But how might this be implementedin the basal ganglia? Whilst the striatum acts as the input structure to the basal ganglia, neu-roanatomical evidence shows that the basal ganglia sub-regions loop back on one another ( ),and that in particular the striatum sends inhibitory signals to the substantia nigra (SN), whichin turn projects back both excitatory and inhibitory signals via dopamine (D1 and D2 receptorsrespectively) to the striatum (
10, 19 ). There is therefore a potential mechanism for appropriatefeedback to the hippocampal-striatal synapses in order to provide this error signalling, and an27xploration of this error signal hypothesis could be a potentially interesting research endeavour.
This work has explored the role that reverse replays may have in biological reinforcement learn-ing. As a baseline, we have derived a policy gradient Reinforcement Learning rule, which weemployed to associate actions with place cell activities. This is a three factor learning rule withan eligibility trace, where the eligibility trace stores the pairwise co-activities of place and actioncells. The learning rule was shown to perform successfully when applied to a simulated MiRorobot for a Morris water-maze like task. We further augmented the network and learning rulewith reverse replays, which acted to reinstate recent place and action cell activities. The effectof these replays was that learning was significantly improved for circumstances in which eligi-bility traces did not store sufficient activity history. In addition, this had the effect of generatingmore stable performance as the number of trials increased. Learning with reverse replays wasimproved upon the case without replays, yet learning was still achievable without replays. Re-verse replay may therefore enhance reinforcement learning in the hippocampal-striatal networkwhilst not necessarily providing its core mechanism.
Funding
This work has been in part funded by the EU Horizon 2020 programme through the FET Flag-ship Human Brain Project (HBP-SGA2: 785907; HBP-SGA3: 945539).
Acknowledgments
The authors would like to thank Andy Philippides and Michael Mangan for their valuable inputand useful discussions. 28 eferences
1. R Ellen Ambrose, Brad E Pfeiffer, and David J Foster. Reverse replay of hippocampal placecells is uniquely modulated by changing reward.
Neuron , 91(5):1124–1136, 2016.2. Guo-qiang Bi and Mu-ming Poo. Synaptic modifications in cultured hippocampal neu-rons: dependence on spike timing, synaptic strength, and postsynaptic cell type.
Journal ofneuroscience , 18(24):10464–10472, 1998.3. Johanni Brea, Alexisz Tam´as Ga´al, Robert Urbanczik, and Walter Senn. Prospective codingby spiking neurons.
PLoS computational biology , 12(6), 2016.4. Consequential Robotics. Documentation for the miro-e robot:http://labs.consequentialrobotics.com/miro-e/docs/, 2019.5. Kamran Diba and Gy ¨orgy Buzs´aki. Forward and reverse hippocampal place-cell sequencesduring ripples.
Nature neuroscience , 10(10):1241–1242, 2007.6. Val´erie Ego-Stengel and Matthew A Wilson. Disruption of ripple-associated hippocampalactivity during rest impairs spatial learning in the rat.
Hippocampus , 20(1):1–10, 2010.7. Umberto Esposito, Michele Giugliano, and Eleni Vasilaki. Adaptation of short-term plas-ticity parameters via error-driven learning may explain the correlation between activity-dependent synaptic properties, connectivity motifs and target specificity.
Frontiers in com-putational neuroscience , 8:175, 2015.8. David J Foster and Matthew A Wilson. Reverse replay of behavioural sequences in hip-pocampal place cells during the awake state.
Nature , 440(7084):680–683, 2006.9. Nicolas Fr´emaux and Wulfram Gerstner. Neuromodulated spike-timing-dependent plastic-ity, and theory of three-factor learning rules.
Frontiers in neural circuits , 9:85, 2016.290. Charles R Gerfen, Thomas M Engber, Lawrence C Mahan, ZVI Susel, Thomas N Chase,Frederick J Monsma, and David R Sibley. D1 and d2 dopamine receptor-regulated gene ex-pression of striatonigral and striatopallidal neurons.
Science , 250(4986):1429–1432, 1990.11. Wulfram Gerstner, Marco Lehmann, Vasiliki Liakoni, Dane Corneil, and Johanni Brea. Eli-gibility traces and plasticity on behavioral time scales: experimental support of neohebbianthree-factor learning rules.
Frontiers in neural circuits , 12:53, 2018.12. Gabrielle Girardeau, Karim Benchenane, Sidney I Wiener, Gy¨orgy Buzs´aki, and Micha¨el BZugaro. Selective suppression of hippocampal ripples impairs spatial memory.
Natureneuroscience , 12(10):1222, 2009.13. Bapun Giri, Hiroyuki Miyawaki, Kenji Mizuseki, Sen Cheng, and Kamran Diba. Hip-pocampal reactivation extends for several hours following novel experience.
Journal ofNeuroscience , 39(5):866–875, 2019.14. Stephen N Gomperts, Fabian Kloosterman, and Matthew A Wilson. Vta neurons coordinatewith the hippocampal reactivation of spatial experience.
Elife , 4:e05360, 2015.15. Sten Grillner, Jeanette Hellgren, Ariane Menard, Kazuya Saitoh, and Martin A Wikstr¨om.Mechanisms for selection of basic motor programs–roles for the striatum and pallidum.
Trends in neurosciences , 28(7):364–370, 2005.16. Kevin Gurney, Tony J Prescott, and Peter Redgrave. A computational model of action se-lection in the basal ganglia. i. a new functional anatomy.
Biological cybernetics , 84(6):401–410, 2001.17. Kevin Gurney, Tony J Prescott, and Peter Redgrave. A computational model of action se-lection in the basal ganglia. ii. analysis and simulation of behaviour.
Biological cybernetics ,84(6):411–423, 2001. 308. Tatsuya Haga and Tomoki Fukai. Recurrent network model for learning goal-directed se-quences through reverse replay.
Elife , 7:e34171, 2018.19. LG Harsing Jr and MJ Zigmond. Influence of dopamine on gaba release in striatum: ev-idence for d1–d2 interactions and non-synaptic influences.
Neuroscience , 77(2):419–429,1997.20. Michael E Hasselmo, Eric Schnell, and Edi Barkai. Dynamics of learning and recall atexcitatory recurrent synapses and cholinergic modulation in rat hippocampal region ca3.
Journal of Neuroscience , 15(7):5249–5262, 1995.21. John J Hopfield. Neural networks and physical systems with emergent collective com-putational abilities.
Proceedings of the national academy of sciences , 79(8):2554–2558,1982.22. Mark D Humphries and Tony J Prescott. The ventral basal ganglia, a selection mechanismat the crossroads of space, strategy, and reward.
Progress in neurobiology , 90(4):385–417,2010.23. Shantanu P Jadhav, Caleb Kemere, P Walter German, and Loren M Frank. Awake hip-pocampal sharp-wave ripples support spatial memory.
Science , 336(6087):1454–1458,2012.24. Adrien Jauffret, Nicolas Cuperlier, and Philippe Gaussier. From grid cells and visual placecells to multimodal place cell: a new robotic architecture.
Frontiers in neurorobotics , 9:1,2015.25. Hideki Kametani and Hiroshi Kawamura. Alterations in acetylcholine release in the rathippocampus during sleep-wakefulness detected by intracerebral dialysis.
Life sciences ,47(5):421–426, 1990. 316. Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: Asurvey.
The International Journal of Robotics Research , 32(11):1238–1274, 2013.27. Sampo Kuutti, Richard Bowden, Yaochu Jin, Phil Barber, and Saber Fallah. A survey ofdeep learning applications to autonomous vehicle control.
IEEE Transactions on IntelligentTransportation Systems , 2020.28. Fuhai Ling, Alejandro Jimenez-Rodriguez, and Tony J Prescott. Obstacle avoidance usingstereo vision and deep reinforcement learning in an animal-like robot. In , pages 71–76. IEEE, 2019.29. Jonathan W Mink. The basal ganglia: focused selection and inhibition of competing motorprograms.
Progress in neurobiology , 50(4):381–425, 1996.30. Ben Mitchinson, M Pearson, T Pipe, and Tony J Prescott. Biomimetic robots as scientificmodels: a view from the whisker tip.
Neuromorphic and brain-based robots , pages 23–57,2011.31. Ben Mitchinson and Tony J Prescott. Miro: a robot “mammal” with a biomimetic brain-based control system. In
Conference on Biomimetic and Biohybrid Systems , pages 179–191.Springer, 2016.32. John O’Keefe. Place units in the hippocampus of the freely moving rat.
Experimentalneurology , 51(1):78–109, 1976.33. John O’Keefe and Jonathan Dostrovsky. The hippocampus as a spatial map: preliminaryevidence from unit activity in the freely-moving rat.
Brain research , 1971.34. Rich Pang and Adrienne L Fairhall. Fast and flexible sequence induction in spiking neuralnetworks via rapid excitability changes. eLife , 8:e44324, 2019.325. CMA Pennartz, E Lee, J Verheul, P Lipa, Carol A Barnes, and BL McNaughton. Theventral striatum in off-line processing: ensemble reactivation during sleep and modulationby hippocampal ripples.
Journal of Neuroscience , 24(29):6446–6456, 2004.36. Tony J Prescott, Daniel Camilleri, Uriel Martinez-Hernandez, Andreas Damianou, andNeil D Lawrence. Memory and mental time travel in humans and social robots.
Philo-sophical Transactions of the Royal Society B , 374(1771):20180025, 2019.37. Tony J Prescott, Fernando M Montes Gonz´alez, Kevin Gurney, Mark D Humphries, andPeter Redgrave. A robot model of the basal ganglia: behavior and intrinsic processing.
Neural networks , 19(1):31–61, 2006.38. Tony J Prescott, Nathan Lepora, and Paul FM J Verschure.
Living machines: A handbookof research in biomimetics and biohybrid systems . Oxford University Press, 2018.39. P Redgrave, N Vautrelle, PG Overton, and J Reynolds. Phasic dopamine signaling in actionselection and reinforcement learning. In
Handbook of Behavioral Neuroscience , volume 24,pages 707–723. Elsevier, 2017.40. Paul Richmond, Lars Buesing, Michele Giugliano, and Eleni Vasilaki. Democratic pop-ulation decisions result in robust policy-gradient learning: a parametric study with gpusimulations.
PLoS one , 6(5), 2011.41. Varun Saravanan, Danial Arabali, Arthur Jochems, Anja-Xiaoxing Cui, Luise Gootjes-Dreesbach, Vassilis Cutsuridis, and Motoharu Yoshida. Transition between encoding andconsolidation/replay dynamics via cholinergic modulation of can current: a modeling study.
Hippocampus , 25(9):1052–1070, 2015.42. Wolfram Schultz. Predictive reward signal of dopamine neurons.
Journal of neurophysiol-ogy , 80(1):1–27, 1998. 333. Denis Sheynikhovich, Ricardo Chavarriaga, Thomas Str¨osslin, Angelo Arleo, and Wul-fram Gerstner. Is there a geometric module for spatial orientation? insights from a rodentnavigation model.
Psychological review , 116(3):540, 2009.44. William E Skaggs and Bruce L McNaughton. Replay of neuronal firing sequences in rathippocampus during sleep following spatial experience.
Science , 271(5257):1870–1873,1996.45. Richard S Sutton and Andrew G Barto.
Reinforcement learning: An introduction . MITpress, 2018.46. Misha Tsodyks, Klaus Pawelzik, and Henry Markram. Neural networks with dynamicsynapses.
Neural computation , 10(4):821–835, 1998.47. Eleni Vasilaki, Nicolas Fr´emaux, Robert Urbanczik, Walter Senn, and Wulfram Gerstner.Spike-based reinforcement learning in continuous state and action space: when policy gra-dient methods fail.
PLoS computational biology , 5(12), 2009.48. Eleni Vasilaki and Michele Giugliano. Emergence of connectivity motifs in networks ofmodel neurons with short-and long-term plastic synapses.
PloS one , 9(1), 2014.49. Barbara Webb. Can robots make good models of biological behaviour?
Behavioral andbrain sciences , 24(6):1033–1050, 2001.50. Matthew T. Whelan, Eleni Vasilaki, and Tony J. Prescott. Fast reverse replays of recentspatiotemporal trajectories in a robotic hippocampal model. In
Biomimetic and BiohybridSystems , Cham, 2020. Springer International Publishing.51. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein-forcement learning.
Machine learning , 8(3-4):229–256, 1992.342. Matthew A Wilson and Bruce L McNaughton. Reactivation of hippocampal ensemblememories during sleep.
Science , 265(5172):676–679, 1994.53. Henry Zhu, Justin Yu, Abhishek Gupta, Dhruv Shah, Kristian Hartikainen, Avi Singh,Vikash Kumar, and Sergey Levine. The ingredients of real-world robotic reinforcementlearning. arXiv preprint arXiv:2004.12570 , 2020.35 ppendix - Mathematical Derivation of the Place-Action CellSynaptic Learning Rule
Derivation of the reinforcement learning rule – We derive a policy gradient rule ( ) following( ), but here we use continuous valued neurons instead of spiking neurons. The expectationfor the rewards earned in an episode of duration T is given by, (cid:104) R (cid:105) T = (cid:90) X (cid:90) Y R ( x , y ) P w ( x , y ) d y d x (25)where X is the space of the inputs of and Y the space of the output of the network, and P w ( x , y ) the probability that the network has input x and output y , parametrised by the weights.We can decompose the probability, P w ( x , y ) (see decomposition of the probability in ( ))as, P w ( x , y ) = (cid:89) j g j ( x , y ) (cid:89) i h i ( x , y ) , (26)where h i is the probability the i -th action cell generates output y j contained in y , when thenetwork receives input x . Similarly g j is the probability for the activity produced by the j -thplace cell given its input. We then wish to calculate the partial derivative over a weight w kl ofthe expected reward, ∂ (cid:104) R T (cid:105) ∂w kl = (cid:90) X (cid:90) Y R ( x , y ) ∂P w ( x , y ) ∂w kl d y d x . (27)To do so, we take into account that P w ( x , y ) = (cid:104) P w ( x , y ) h k ( x , y ) (cid:105) h k ( x , y ) , where the term in squarebrackets does not depend on w kl since we remove its contribution from P w ( x , y ) by dividingwith h k ( x , y ) . We can then write, ∂P w ( x , y ) ∂w kl = P w ( x , y ) ∂ log h k ( x , y ) ∂w kl . (28)36his leads to, ∂ (cid:104) R T (cid:105) ∂w kl = (cid:90) X (cid:90) Y R ( x , y ) P w ( x , y ) ∂ log h k ( x , y ) ∂w kl d y d x . (29)To proceed, we need to consider the distribution of the activities of the action cells h k . Thiswe choose to be a Gaussian function with mean ¯ y k and variance σ (see also section “StriatalAction Cells”), h k ( X, Y ) = 1 σ √ π exp (cid:32) − ( y k − ˜ y k ) σ (cid:33) . (30)The mean of the distribution is calculated by ˜ y k = f s (cid:16) c (cid:80) j w kj x j + c (cid:17) , see also Equation11, where f s is a sigmoidal function. We note that a different choice of function would haveresulted in a variant of this rule. Therefore, ∂ log h k ( x , y ) ∂w kl = c y k − ˜ y k σ (1 − ˜ y k ) ˜ y k x l . (31)Replacing 31 in 29 we end up with, ∂ (cid:104) R T (cid:105) ∂w kl = (cid:90) X (cid:90) Y c R ( x , y ) P w ( x , y ) y k − ˜ y k σ (1 − ˜ y k ) ˜ y k x l d y d x . (32)Then the batch update rule is given by, dw kl dt = η (cid:90) X (cid:90) Y R ( x , y ) P w ( x , y ) y k − ˜ y k σ (1 − ˜ y k ) ˜ y k x l d y d x . (33)The batch rule indicates that we need to average the term R ( x , y ) y k − ˜ y k σ (1 − ˜ y k ) ˜ y k x l acrossmany trials. When an on-line setting is considered, the average is naturally rising from samplingthroughout the episodes. Hence the on-line version of this rule is given by, dw kl dt = η R ( x , y ) y k − ˜ y k σ (1 − ˜ y k ) ˜ y k x l , (34)with the factor c absorbed in the learning rate. We note however that this rule is appropriatefor scenarios where reward is immediate. To deal with cases of distant rewards, such as ours37here reward comes at the end of a sequence of actions, we need to resort to eligibility traces.Our rule is similar to REINFORCE with multiparameter distribution ( ); we differ by having acontinuous time formulation and a different parametrisation of the neuronal probability densityfunction. Further, in our case we do not learn the variance of the probability density function.We introduce an eligibility trace by updating the weights connecting the place cells to theaction cells, W P C - AC by dw P C - ACij dt = ησ R ( x , y ) e ij (35)The term e ij represents the eligibility trace, see also ( ), and is a time decaying function of thepotential weight changes, determined by, de ij dt = − e ij τ e + ( y i − ˜ y i ) (1 − ˜ y i ) ˜ y i x j . (36) Derivation of the supervised learning rule – During replays, we assume that synapses betweenplace and action cells change to minimise the function E = 12 (cid:88) i (cid:16) y replayi − ˜ y i (cid:17) , (37)in other words, we assume that during the replay Equation 17 provides a fixed target valuefor the mean of the Gaussian distribution of the action cells at time t . In what follows weconsider the target constant for the shape of the derivation and consistency with the form of thereinforcement learning rule, but in fact this target changes as time and consequently the weightsfrom place to action cells change, making the rule unstable, but stabilising under a short, fixedlength of reply time. Taking the gradient over the error function with respect to the weight w kl ,when considering the “target” activity for the action cells fixed, leads to the backpropagationupdate rule for a single layer network dw kl dt = η (cid:48) (cid:16) y replayk − ˜ y k (cid:17) ˜ y k (1 − ˜ y k ) x l , (38)38here η (cid:48) is the learning rule, in our simulations η (cid:48) = η/σ similar to the reinforcement learningrule. Also for consistency with the reinforcement learning rule formulation, we introduce aneligibility trace by updating the weights connecting the place cells to the action cells, W P C - AC by dw P C - ACij dt = η (cid:48) e ij , (39)where the eligibility trace is determined by, de ij dt = − e ij τ e + (cid:16) y replayi − ˜ y i (cid:17) (1 − ˜ y i ) ˜ y i x j , (40)where again the time constant τ e is the same as in the reinforcement learning rule.In the case of replays then, when the robot has reached its target, it first learns using thestandard learning rule as in Equations 35 and 36. After 1s, a replay event is initiated, and learn-ing is then done using the supervised learning rule here, using Equations 39 and 40. By settingthe reward value to R = 1= 1