[PDF] Predicting the future with a scale-invariant temporal memory for the past

Abstract

In recent years it has become clear that the brain maintains a temporal memory of recent events stretching far into the past. This paper presents a neurally-inspired algorithm to use a scale-invariant temporal representation of the past to predict a scale-invariant future. The result is a scale-invariant estimate of future events as a function of the time at which they are expected to occur. The algorithm is time-local, with credit assigned to the present event by observing how it affects the prediction of the future. To illustrate the potential utility of this approach, we test the model on simultaneous renewal processes with different time scales. The algorithm scales well on these problems despite the fact that the number of states needed to describe them as a Markov process grows exponentially.

Full PDF

11 Predicting the future with a scale-invariant tem-poral memory for the past

Wei Zhong Goh , Varun Ursekar , Marc W. Howard , Graduate Program in Neuroscience, Department of Physics, Department of Psychological and Brain Sciences,Center for Systems Neuroscience,610 Commonwealth Avenue,Boston University.

Keywords:

Reinforcement learning, prediction, scale invariance, long memory

Abstract

In recent years it has become clear that the brain maintains a temporal memory of recentevents stretching far into the past. This paper presents a neurally-inspired algorithm touse a scale-invariant temporal representation of the past to predict a scale-invariant fu-ture. The result is a scale-invariant estimate of future events as a function of the time atwhich they are expected to occur. The algorithm is time-local, with credit assigned tothe present event by observing how it affects the prediction of the future. To illustratethe potential utility of this approach, we test the model on simultaneous renewal pro-cesses with different time scales. The algorithm scales well on these problems despitethe fact that the number of states needed to describe them as a Markov process growsexponentially.

Reinforcement learning (RL) models that are designed for Markov processes (e.g.,Watkins and Dayan, 1992; Sutton, 1988) have been extraordinarily successful in ac-counting for reward systems in the brain (e.g., Schultz et al., 1997; Waelti et al., 2001)and led to remarkable achievements in artiﬁcial intelligence (e.g., Mnih et al., 2015;Silver et al., 2018). For instance, in the successor representation, each relevant conﬁgu-ration of the world is deﬁned as a state and the goal is to estimate the Markov transitionprobabilities between states (Dayan, 1993). Despite the success of RL, its afﬁnity forMarkov statistics may be a serious limitation. The real world contains many distinct a r X i v : . [ q - b i o . N C ] J a n auses that predict their effects at a range of time scales, presenting a challenge forlearners optimized for Markov statistics. Of course, random processes with memorycan be turned into Markov processes at the cost of deﬁning additional states. How-ever, the cost in terms of memory, and time to learn transition probabilities among anexponentially growing number of states, may be excessively costly in some settings.It has been proposed that a primary function of the mammalian brain is to predictfuture events to enable adaptive behavior (Clark, 2013; Friston, 2010). Evidence fromneuroscience has made clear that the brain contains robust memory for the identity andtime of recent events. For instance, sequentially activated time cells in the hippocam-pus, prefrontal cortex, and striatum (e.g., MacDonald et al., 2011; Tiganj et al., 2018;Mello et al., 2015) maintain information about the time at which recent events wereexperienced over at least tens of seconds, and perhaps much longer. Experimental pre-sentation of distinct stimuli triggers different sequences of time cells (e.g., Tiganj et al.,2018; Taxidis et al., 2020; Cruzado et al., in press) so that these populations maintaininformation about what happened when. In addition to sequentially activated time cells,neurons in the entorhinal cortex (Tsao et al., 2018; Bright et al., 2020) and other cor-tical regions (Bernacchia et al., 2011; Murray et al., 2017) carry temporal information via populations of neurons that respond with a spectrum of characteristic time scalesup to at least tens of minutes. This paper, inspired by work arguing that conditioningin the brain results from an attempt to learn temporal contingencies between stimuliBalsam and Gallistel (2009); Gallistel et al. (2019), presents a formal model that learnsto predict the future given a temporal record of the past. This proposed mechanism iscomputable given a temporal history that can be translated in time and proposes a so-lution for how to estimate the future from a past that includes information about manyevents.This paper proceeds as follows. In the rest of this section, we review a model forretaining a record of past events, and associations between event pairs. In Section 2,we present the model for predicting the future given a temporal record of the past. InSection 3, we discuss its computational complexity, time scale invariance and severalother properties. In Section 4, we present a numerical demonstration of the efﬁcacyof this algorithm. Finally, in Section 5, we compare this algorithm to traditional RLalgorithms, and point out its connections to neuroscience. We start with an agent which is capable of observing and remembering several types ofevents, such as the onset of a 440 Hz tone or the appearance of an image of an apple.In this section, we will describe a model for its capabilities. We will see that the agentmaintains a fuzzy timeline of past events, which it uses to make pairwise associationsbetween events. Neurobiological justiﬁcation for this model is outlined in Section A.1of the Appendix.

We assume that the world provides a series of discrete events that occur in continuoustime. At each moment, at most one event can occur. For simplicity, without loss of2 H[WHUQDO WLPH W V L JQD O V W U HQJ W K D VLJQDOIRUYDULRXVVWLPXOLW\SHV [VLJQDO I ; VLJQDO I < ]VLJQDO I = LQWHUQDOSDVWWLPH P H P R U \ V W U HQJ W K E PHPRU\IRU[ I ; DWYDULRXVWLPHVVLQFH[RFFXUUHG [RFFXUUHGDJR[RFFXUUHGDJR[RFFXUUHGDJR Figure 1:

Memory is a fuzzy representation of the signal up to the present. (a)

Sig-nal as a function of external time, for three event types, X , Y and Z . This is the scenarioconsidered in Fig. 2 through 6. (b) Memory for a recent event as a function of internalpast time, at varying (external) times since the event occurred. As a function of internalpast time, peaks in the memory are present at approximately the time interval since theevent.generality, suppose there are three types of events, which we call X , Y and Z respec-tively. Whenever we need to avoid confusion, we will use event type to refer to typeof event, and use event episode to refer to an individual occurrence of an event. Weencode the occurrence of the event type X as a signal f X ( t ) , which is the sum of Diracdelta functions centered at the occurrence times of episodes of X (Fig. 1a). (We willdiscuss quantities in relation to X ; such statements hold analogously for Y and Z .) Wecall t , the argument for the signal f X ( t ) , real time or external time , to emphasize thatthis time axis is a feature of the world instead of being constructed by the observer. Wedenote the collection of all three signals as f ( t ) , and analogously for the quantities tofollow. At every instant in (external) time t , the agent has direct access to f ( t ) (whichis zero precisely unless the event of interest occurs at t ), but not f at any other timevalue. Signals are shown in Fig. 1a in the case where X , Y and Z occur at times 0, 1 and2 respectively. At every instant in time t , the agent’s memory for X , denoted ˜ f X ( ∗ τ ; t ) , is a fuzzyrepresentation of the signal up to the present, f X ( t − ∗ τ ) . From the agent’s perspective,the internal past time, ∗ τ > , indexes how long ago events in memory might haveoccurred. The degree of fuzziness of the memory varies inversely with a sharpnessparameter k , which is typically a small even integer; throughout this paper, it is ﬁxedat 8.At time τ + t , the memory element for an event that occurred at time τ is givenby ˜ f ( ∗ τ ; τ + t ) = Φ k ( t/ ∗ τ ) / ∗ τ , where the fuzziness, Φ k ( · ) , is given by the dimensionless3quation Φ k ( x ) = u ( x ) κ x k e − kx , (1) κ = k k +1 /k ! is a normalizing constant and u is the unit step function. Memories for arecent event are shown in Fig. 1b for various values of t . For an arbitrary signal f , theassociated memory up to time t is ˜ f ( ∗ τ ; t ) = 1 ∗ τ (cid:90) t −∞ f ( τ ) Φ k (cid:18) t − τ ∗ τ (cid:19) dτ. (2)In other words, the memory for an event type is the sum of the memory elements asso-ciated with each episode of that event type. On its face, Eq. 2 appears to assume thatthe agent has access to the inﬁnite past of f ( t ) . However, previous work has shownthat ˜ f ( ∗ τ ; t ) can be efﬁciently and time-locally constructed from a set of leaky inte-grators with a spectrum of time constants (see section A.1 in the Appendix; Shankarand Howard, 2013). Using this approach, the number of leaky integrators necessary toremember the past to some bound T goes up like log T .The signal f up to any given external time t ﬁxes the event occurrence history.However, due to the agent’s fuzzy memory, the agent is only able to form a fuzzy sub-jective belief distribution about the event occurrence history leading up to the present.We may interpret the memory for X as the agent’s subjective estimate of the instanta-neous rate of occurrence of X at time t − ∗ τ . In other words, we have, for an inﬁnitesimaltime element d ∗ τ , ˜ f X ( ∗ τ ; t ) d ∗ τ ≈ P (cid:16) X @ t − ∗ τ ( d ∗ τ ) (cid:17) , (3)where P ( · ) , the probability of an event, is used in the subjective Bayesian sense todescribe the agent’s belief, and “ X @ t − ∗ τ ( d ∗ τ ) ” stands for “an episode of event X occurred within the inﬁnitesimal time interval between t − ∗ τ and t − ∗ τ + d ∗ τ .” Since ˜ f allows the agent access to the identity of and approximate time at which past eventsmight have happened, we describe ˜ f ( ∗ τ ) to be a timeline of the past.At each instant in time t , the agent is also able to compute the state of the memorya time interval δ into the future, assuming that no events of interest occur during thatinterval. We call this the projected memory, which is given by, for an arbitrary signal f , ˜ f δ ( ∗ τ ; t ) = 1 ∗ τ (cid:90) t −∞ f ( τ ) Φ k (cid:18) t + δ − τ ∗ τ (cid:19) dτ. (4)Translation can be efﬁciently implemented based on the set of leaky integrators. Priorwork has shown that this can be done in a neurobiologically reasonable way (see SectionA.2 in the Appendix; Shankar et al., 2016). Many models of memory make use of associations between the temporal context de-scribing the recent past and the currently available stimulus. The agent described herebuilds pairwise associations from X (the cue) to Y (the outcome) as the average state ofmemory for X whenever Y occurs, and analogously for other event pairs: ∆ M YX ( ∗ τ ) ∝ ˜ f X ( ∗ τ ; t ) f Y ( t ) . (5)4 LQWHUQDOIXWXUHWLPH D VV R F L D W L RQ V W U HQJ W K SDLUZLVHDVVRFLDWLRQV 0 <; [ 0 =; [ Figure 2:

Pairwise associations fuzzily represent the rate of ﬁnding two events oc-curring a certain time interval apart.

Other event types are ignored in computingthe association between two event types. The associations shown here are based on thesignals in Fig. 1a. As a function of internal time, the associations peak at around thetime interval separating the event pairs.Note that as a neural network, Eq. 5 simply requires Hebbian learning. At the end oflearning, we normalize M i by the number of episodes of event i , (cid:82) f i ( t ) dt .For example, suppose that X always precedes Y by a time interval τ XY . Then, by theend of learning we would have the pairwise association M YX ( ∗ τ ) = Φ k ( τ XY / ∗ τ ) / ∗ τ . (6)Fig. 2 shows the pairwise associations between two pairs of events, occurring 1 and 2time units apart respectively.We may view M YX from two complementary perspectives. Firstly, given the occur-rence of Y in the present, the agent could use M YX ( ∗ τ ) as a subjective estimate, based onan average over occurrences of Y , of the instantaneous rate of occurrence of X at time ∗ τ in the past, that is, M YX ( ∗ τ ) d ∗ τ ≈ P (cid:16) X @ t Y − ∗ τ ( d ∗ τ ) (cid:17) . (7)Secondly, more usefully, given the occurrence of X in the present, the agent may use M YX ( δ ) as a subjective estimate of the instantaneous rate of occurrence of Y at time δ inthe future, that is, M YX ( δ ) dδ ≈ P ( Y @ t X + δ ( dδ )) . (8)We use ∗ τ or δ as the time argument for M according to the interpretation that applies.We use the pairwise association M to predict the future based on only a single cuein the present. In the next section, we introduce an association C that we use to predictthe future based on multiple cues. To put M and C in equations of the same form, weuse an alternative to Eq. 8 to predict the future based on the present. We estimate therate of future events of Y based on the occurrence of X in the present as m YX ( δ ; t X ) dδ ≈ P ( Y @ t X + δ ( dδ )) , (9)5here m X ( δ ; t X ) = e [ M f X δ ]( t X ) . (10)The operator M is deﬁned by [ M f X δ ]( t X ) = κ + (cid:90) f X δ ( ∗ τ ; t X ) log M X ( ∗ τ ) d ∗ τ , (11)the constant κ = ( k + 1) [log k − ψ ( k )] , ψ ( k ) is the digamma function (note that κ ∼ . for k = 8 , and κ → as k → ∞ ), and f X δ ( ∗ τ ; t X ) = Φ k ( δ/ ∗ τ ) / ∗ τ (12)denotes the future state of the memory element associated with the currently occurringepisode of X . The operator M may be thought of as operating on the observation of thecurrent event to generate a prediction for the future. In general, Eq. 8 and Eq. 9 providesimilar estimates for the future. The constant κ is chosen so that precisely when X and Y have always occurred together and were separated by a ﬁxed time interval, Eq. 8 andEq. 9 provide exactly the same estimate. It would be straightforward to build a prediction for the future based on a single event(e.g., the most recent event) using the pairwise associations M . The challenge is tobuild a prediction that is based on multiple events in the recent past. One difﬁcultyarises when associations overlap. For example, we associate the sound of rain ( X ) witha chance of hearing thunder ( Z ). We also associate the sight of wet ground ( Y ) witha chance of hearing thunder ( Z ). Having heard the sound of rain, the prediction forthunder should not be increased by the sight of wet ground when we step outdoors.This example illustrates one of the pitfalls of simply adding the predictions suggestedby the pairwise associations.To address double-counting, in addition to pairwise associations, we construct creditassociations between event pairs, which is the key for this algorithm to generating atimeline for the future. In Section 2.1, we explain how the agent constructs a timelineof the future by integrating over a timeline of the past, weighted by credit associationsbetween cues and outcomes. In Section 2.2, we show how the agent learns credit as-sociations between cues and outcomes based on comparing predictions prior to the cuewith predictions due to the cue. In addition to the pairwise associations M , we build the credit associations C betweeneach pair of events (a cue and an outcome) as a function of internal time δ since the cue.We interpret C YX ( δ ) as logarithm of the factor by which an agent adjusts its subjectiveestimate of the instantaneous rate of occurrence of Y at time δ in the future, having justobserved X . Denoting λ ( δ ) as the agent’s prior estimate (just before observing X ), wehave λ ( δ ) exp C YX ( δ ) dδ ≈ P ( Y @ t X + δ ( dδ )) . (13)6q. 13 relates to the observation of one cue at one time (the present). For cues inthe past, the further in the past they occur, the more imminent outcomes should seem.For example, if X has credit for Y peaking at δ = 5 and X occurred three time units ago, Y should be expected in two time units. Accounting for multiple cues over the past, weﬁnd that at time t , the agent’s internal timeline for an event type, time δ into the future,is p ( δ ; t ) = Λ e [ C ˜ f δ ]( t ) , (14)where p stands for prediction for the rate of an event type and Λ is the long-term averagerate of that event type. The operator C is deﬁned by [ C ˜ f δ ]( t ) = (cid:88) E (cid:90) C E ( ∗ τ ) ˜ f Eδ ( ∗ τ ; t ) d ∗ τ , (15)where the index of summation E indexes the possible cue types. We interpret p Y ( δ ; t ) asthe agent’s subjective estimate, made at time t , of the instantaneous rate of occurrenceof Y at time δ in the future, that is, p Y ( δ ; t ) dδ ≈ P ( Y @ t + δ ( dδ )) . (16)Unlike Eq. 8 and Eq. 13, this estimate takes into account all of the events that haveoccurred in the recent past. A schematic distinguishing the utility of the pairwise asso-ciations M and the credit associations C in making predictions is shown in Fig. 3. Justas we consider ˜ f ( ∗ τ ) a timeline of the past, we consider p ( δ ) a timeline of the future.Note that p Y ( δ = 0; t ) would correspond to the agent’s internal model for, in the lan-guage of point process theory, the conditional intensity function of Y (see Rasmussen,2018).As an illustration, consider again the scenario where events X , Y and Z always oc-cur consecutively, once on each trial, at relative times 0, 1 and 2 respectively, with avery long gap between trials. Once X occurs, the proposed algorithm (explained in thefollowing sections) generates predictions for Y and Z that become more and more immi-nent as time elapses (Fig. 4). As a function of δ , the predictions peak at approximatelythe time when the events are, in fact, due. Loosely speaking, we assign credit for an outcome to an event according to how muchthe event’s occurrence would revise the prediction for that future outcome. In our ex-ample, wet ground would be assigned little to no predictive value, since the chance ofthunder has already been predicted by the sound of rain. During training, we update thecredit assigned to an event when that event occurs. In this section, we will describe theupdate that happens when X occurs with no loss of generality.Formally, as we have stated, exp C YX ( δ ) is the factor by which we should adjust theprediction for Y at time δ in the future, having just observed X . Therefore, to compute exp C YX ( δ ) , whenever X is observed, we will ﬁrst compute the prediction for Y beforeand due to the observation of X , and analogously for other possible outcomes.7 observation, f ← m e m o r y , f p r ed i c t i on , p → X Y Z real time, t i n t e r na l f u t u r e t i m e , δ i n t e r na l pa s t t i m e , τ ⃰ ˜ X Z M F · C ˜ C Figure 3:

Predictions can be made using credit associations C based on memoriesof the multiple events in the recent past. The horizontal axis shows events occurringin real time. The event signal for this scenario is shown in Fig. 1a. Associated witheach point in real time is an agent’s internal time axis, shown here diagonally at t =1 . , which indexes memories of the past (bottom half) and predictions for the future(top half). The agent may make a prediction for the future with M (Eq. 8) based onthe currently observed event (here, Y ). As a better alternative, the agent may make aprediction for the future with C (Eq. 14) based on multiple events in the present and therecent past. 8 p Y prediction for , p Z prediction for the futureas x recedes into the past e s t i m a t e d o cc u rr e n c e r a t e X Y ZX Y ZX Y ZX Y Z a b

X Y ZX Y ZX Y ZX Y Z

Figure 4: (Continued on the following page.)9igure 4:

Predictions for events peak at about the right time and become moreimminent with time.

The events X , Y and Z occur on each trial at times 0, 1 and2 respectively, as with previous ﬁgures. (a) A schematic of the state of memory andprediction as a function of time. The axes have the same interpretation as in Fig. 3.At real time 0.00, X is observed, leading to a prediction for Y and Z , depicted alongthe diagonal internal time axis. As real time passes, the memory of X recedes into thepast, and the prediction for Y and Z become more imminent, depicted by the events’downward movement along the internal time axis. (b) Prediction for Y and Z generatedby simulation using the proposed algorithm, as the memory for X recedes into the past,depicted at four time points. The peak times for the prediction for Y and Z correspondroughly to when the events are in fact due, and move towards zero as time passes. Forexample, in the topmost plot, right after X occurs, Y and Z are to occur in 1 and 2 timeunits respectively. Indeed, the generated predictions for Y and Z peak at approximately δ = Prior to an event observation at time t , the prediction associated with internal futuretime δ is given simply by p − ( δ ; t ) = lim t (cid:48) → t − p ( δ ; t (cid:48) ) . (17)This prediction arises from the memory of cues in the past, and speciﬁcally excludesthe effects of the event episode that occurs at time t .Consider the scenario in Fig. 1, where X , Y and Z occur consistently at trial times0, 1 and 2 respectively. When X occurs, ( p − ) Y = Λ Y , the long-term average rate of Y , for all δ . This is because p − is computed based on memory of events occurringbefore X , of which there are none (Fig. 5c). In contrast, when Y occurs, ( p − ) Z shows apeak at δ = 1 , based on memory of events occurring before Y (i.e., X ), and the creditassociation between X and Z (Fig. 5f). For the prediction due to the observed event X itself, we use the prediction based on thepairwise association in accordance with Eq. 9, p + ( δ ; t ) = m X ( δ ; t ) . (18)For the scenario in Fig. 1, when X occurs, ( p + ) Y = m YX = M YX (Fig. 5c), and when Y occurs, ( p + ) Z = m ZY = M ZY (Fig. 5f). Both of these have the same form, peakingsharply at δ = 1 , since the time interval between X and Y and between Y and Z are ﬁxedand equal. C When X is observed at time t , we update C X in the following manner: ∆ exp C X ( δ ) ∝ p + ( δ ; t ) p − ( δ ; t ) − exp C X ( δ ) . (19)10his update depends on the previous state of C (through Eq. 17 and Eq. 14). Duringtraining, as events occur, we update respective components of C , which in turn enhancesthe agent’s predictions of the future as training proceeds. This update rule squares withthe intuition that events be assigned credit in accordance with their association withoutcomes that are not previously predicted. As training proceeds, exp C X ( δ ) approaches p + ( δ ; t ) /p − ( δ ; t ) in expectation, up to the variability of event occurrence history inrecent episodes of X during training. Since we assume stationary statistics, a smalllearning rate (i.e., constant of proportionality in Eq. 19) should be used to minimize theeffects of such variability.For the scenario in Fig. 1, as noted, the observation of X generates a prior predictionfor Z that is present when Y occurs. Thus, via Eq. 19, Y receives less credit for Z than X for Y at each δ (Fig. 5), even though the X – Y and Y – Z pairwise associations are thesame (Fig. 6). The agent’s memory ˜ f encodes a timeline of past events (Eq. 2). Using Hebbian asso-ciation, the agent makes pairwise associations M between each pair of event types asa function of internal time (Eq. 5). This lets the agent form a prediction for the futurewhenever an event occurs, but only based on the pairwise correlations associated withthat event as a cue. To predict future events based on its memory for (possibly multi-ple) cues, the agent learns credit associations C between each pair of event types as afunction of internal time. The agent uses C and ˜ f to generate a timeline of future events(Eq. 14). While the agent learns, each time an event occurs, we step exp C βα (where α is the event that occurred) towards the ratio of the prediction for β due to α (basedon M ), to the prediction for β prior to α (based on C ) (Eq. 19). This design curbsdouble-counting of correlations for an outcome associated with multiple cues. Throughlearning, we expect the agent to produce better and better predictions for events in itsfuture. The algorithm described above has interesting computational properties. We will dis-cuss how it scales with the number of event types that can be distinguished and the timescales over which prospection is implemented. It can be shown that the model is op-timal for pairwise predictions modulo the uncertainty that comes from ﬁnite temporalresolution of memory. Moreover, the model is invariant to rescaling of time, which maybe useful in applications where the relevant time scale is not known a priori . As with traditional associative models, the computational time and space required forthis algorithm vary quadratically with the number of event types considered. In typicalRL models, each state s must be deﬁned to include all of the information that could11 a db ec f X Y Z X Y ZX

I didn’t seeanythingrecently I know Y will happen soon, thanks to X Y I saw X recently Z will happen soon, but I sort ofknew that Z will happen sometime soon e s t i m a t e d o cc u rr e n c e r a t e o f E v e n t Learning of exp C YX (History: empty) p+p0 2 4 6internal future time, 01020 e x p c r e d i t d e n s i t y exp C target= p+/p e s t i m a t e d o cc u rr e n c e r a t e o f E v e n t Learning of exp C ZY (History: X) p+p0 2 4 6internal future time, 01020 e x p c r e d i t d e n s i t y exp C target= p+/p X Y Z X Y Z

Figure 5: (Continued on the following page.)12igure 5:

Observed events receive less credit for future events which have alreadybeen predicted based on past events.

As with all previous ﬁgures, the events X , Y and Z occur on each trial at times 0, 1 and 2 respectively. (a) A schematic of the stateof memory and prediction as a function of time, as in Fig. 4(a). At real time 0.00, X is observed and the memory is empty. (b) An illustration of an agent’s inferences atthe time X occurs. No memory of past events exists to suggest a prediction, whereasthe currently observed event X suggests that Y occurs soon. (c) Plots of p + , p − (top)and e C YX (bottom) as a function of internal future time, δ , at the time X occurs, for theprediction of Y . The quantity p + (red) is the pairwise association between X and Y ,while p − (purple) is ﬂat as a function of δ as there is no memory of events. The quantity e C YX = p + /p − (orange). (d) Same as (a), but at real time 1.00. Y is observed and X is in memory. (e) An illustration of an agent’s inferences at the time Y occurs. Theagent remembers X , prompting a prior prediction of Z . The currently observed event Y suggests the same, but the agent does not gain much information from Y , and henceassigns Y less credit. (f) Same as (c), but at the time Y occurs, for the prediction of Z .The quantity p + is the pairwise association between Y and Z , which is the same as thatbetween X and Y . However, p − reﬂects the prior prediction for Z based on the memoryfor X . (This is p C from the bottommost plot in Fig. 4b.) Thus, e C YX is diminished. e v e n t p a i r a ss o c i a t i o n s , M M and exp C xx0 2 4 6internal future time, 01020 e x p c r e d i t d e n s i t y , e x p C , f o r e v e n t p a i r s Figure 6:

For event pairs, credit density can differ despite having the same pair-wise associations.

A summary of the event pair associations (top) and credit densities(bottom) for all nontrivial event pairs for the scenario in all previous ﬁgures, where theevents X , Y and Z occur on each trial at times 0, 1 and 2 respectively. The pairwise asso-ciations M YX and M ZY , overlapping perfectly, are slightly displaced for clarity. However, C YX is greater than C ZY due to the memory of X allowing a prior prediction for Z when Y occurs, as shown in Fig. 5(f). 13ffect the transition to the next state, in order to ﬁt into the Markov structure. If tran-sitions depend on the indeﬁnite past, then the number of possible states would becomeunwieldy. In contrast, the event types used here are economically deﬁned to be thoseevents that occupy a single point in time ( X , Y , etc.), which are much smaller in number.In addition, this algorithm runs in time and space polynomial in the number of ∗ τ time points considered in ˜ f ( ∗ τ ) and δ time points considered in p ( δ ) . For example, inEq. 15, for each δ , the numerical integral is computed in time linear in the numberof ∗ τ , the variable of integration, corresponding to how far in the past memories areconsidered. The full prediction, over all δ that the agent considers, is computed in timelinear in the number of δ . Translation to different values of δ can either be implementedserially, consistent with neural considerations (Shankar et al., 2016), or be parallelized in silico . The quick performance comes at the cost of the ability to directly handle someforms of joint statistics among cues. We discuss this shortcoming in Sec. 5.2.The longest time scale over which predictions are based and are made increases ex-ponentially with computational demands. Although the integral form in Eq. 2 wouldseem to require memory for the entire history up to the present, ˜ f can be generated fromleaky integrators with a number of time constants (Shankar and Howard, 2013). Thescale invariance of Φ k allows us to choose the distribution of time constants as a geo-metric series, resulting in a logarithmic relationship between the number of integratorsand the longest time scale that can be represented. Even when the time interval between events is ﬁxed, fuzzy memory (ﬁnite k ) leads totemporal fuzziness in both the pairwise association M and prediction p ( δ ) . At everyinstant in time, this induced fuzziness is equivalent, in its effect on the prediction, tofuzziness due to intrinsic temporal uncertainty in the signal f faced by an agent withperfect memory (inﬁnite k ).As an example, consider an agent with fuzzy memory encountering X , followed by Y after a ﬁxed time interval τ . Precisely at the time X occurs, the agent’s prediction for Y is given by p Y ( δ ; t X ) = κ δ Φ k (cid:16) τδ (cid:17) , (20)where κ = (cid:0) e ψ ( k ) /k (cid:1) k +1 . Another agent with perfect memory encountering X , fol-lowed by Y after a random time interval τ , whose probability density function is givenby q τ ( t ) = Φ k ( t/δ ) /δ , makes an optimal prediction following X equivalent to Eq. 20.Although the fuzzy memory agent’s prediction for Y some time after encountering X is different from Eq. 20, this equivalence property still holds: at every instant in time,there exists a perfect memory agent, with observations subject to some density functionof τ , with an equivalent optimal prediction. The prediction algorithm inherits the time scale invariance of the temporal record ofthe past. If the input signals are time-dilated, the resulting predictions would be time-14ilated, rescaled in magnitude and otherwise unchanged (Fig. 7). Therefore, the pre-diction algorithm, with an appropriate range of ∗ τ and δ , supports chains of events thathappen over any time scale.Formally, for any constant λ , the estimated probability of event occurrence within asmall duration dδ , p ( δ ; t ) dδ , is invariant under the transformation t → λt ∗ τ → λ ∗ τδ → λδ. This means that within the limits of a computational implementation, i.e., far from thesmallest and largest values of ∗ τ and δ (which grow exponentially with the resourcescommitted to representing time), the model provides the same relative temporal resolu-tion. Consider the scenario where X occurs, then Y , then Z , always with the same time delays.In the limit of perfect memory, Y would receive no credit for Z . This is because theoccurrence of X would allow the time of occurrence of Z to be predicted perfectly at alltimes. The occurrence of Y would not improve the (already perfect) prediction. Whenmemory is fuzzy, the X – Z pairwise association would have a larger temporal uncertaintythan the Y – Z pairwise association, since Y and Z are closer in time than X and Z (Eq. 5).Therefore, the occurrence of Y would improve the prediction for Z . The closer Y occursto Z , the more Y sharpens the prediction for Z , and the more credit is assigned to Y for Z . Fig. 8 illustrates this effect. We have seen that the algorithm described here is able to predict the future based on atemporally extended record of the past containing multiple possible cues. In addition,this prediction does not require selection of a preferred time scale, allowing for gener-alization across an exponentially large range of times. As a consequence of these twoproperties, this approach is well-suited to applications where the relevant time scale isnot known a priori or to situations where there are multiple processes at different char-acteristic time scales that must be simultaneously learned. To illustrate these properties,we demonstrate learning of the algorithm on a time series of discrete events generatedfrom multiple Markov renewal processes (MRP).In principle, the algorithm we describe is capable of handling multiple cues withadditive effects (but see Sec. 5.2) stretching into the indeﬁnite past. However, for sim-plicity, we generate a scenario such that each event has exactly one cue. This cue ismostly found at most 15 time steps before the event. For comparison, most consecutive15 p Y p Z Scale-invariance output with input time dilation credit density prediction right after x e x p c r e d i t d e n s i t y , e x p C , f o r e v e n t p a i r s e s t i m a t e d e v e n t o cc u rr e n c e r a t e Figure 7:

Predictions are time scale–invariant.

Top: Credit density and, as an exam-ple, the prediction after X occurs for Y and Z , as a function of internal future time, δ , forthe scenario in all previous ﬁgures. Middle, bottom: When the scenario is time-dilated,shown here by 10 and 100 times, the model output is unchanged as a function of dilatedinternal time. In the case of predictions, the magnitude rescales to preserve the areaunder the curve. This suggests that the proposed algorithm supports chains of eventsthat occur over any time scale. 16 e x p c r e d i t d e n s i t y , e x p C Z Y exp C ZY for various time delays X Y Z X Y ZX Y Z X Y ZXY Z XY Z t YZ = 0.25 0.50 1.00 1.331.90 2.00 a b t YZ = 0.250.50 1.001.33 1.90 2.00 Figure 8:

Temporal proximity promotes credit assignment.

Events X , Y and Z occurat times 0, − t YZ and 2 respectively. (a) A schematic of the state of memory andprediction at the time Y occurs, for six values of t YZ . The axes have the same interpre-tation as in Fig. 3. (b) Credit assigned to Y for Z is shown here for the six values of t YZ ,as a function of internal future time, δ . In other words, each line represents differentamounts of temporal proximity between Y and Z , while the interval between X and Z remains ﬁxed. For t YZ = 1 . , Y much closer in time to X than to Z . In this case, thecredit is almost ﬂat, as the prediction for Z due to X is still fresh. The case t YZ = 1 is thescenario in Fig. 1 through 6. For lower and lower values of t YZ , credit density is moreand more sharply peaked. The prediction for Z due to X has ﬂattened out, allowing theeffect of the pairwise association between X and Z to dominate.17vents have an intervening time of between 0.1 and 15 time units. Crucially, the cue isnot usually the immediately preceding event, but one of the several preceding events.Thus, one cannot merely predict the future based on the most recent event. To add re-alism, we introduce a small amount of variability in the event type of the outcome, aswell as a small amount of Gaussian variability in the time of the outcome.The way we generated a scenario with such properties is to superpose several MRPs,each with three base event types, U , V , W . MRPs have the property that the type of eachevent is the sole determiner of the probability distribution of the type and time of thenext event. In other words, each event has a single cue. Superposing MRPs destroys theguarantee that the cue immediately precedes its outcome. We generated the scenario us-ing two approaches, mainly differing in the way event types are determined in the super-posed process. For the ﬁrst approach, event types in the superposed process are deter-mined according to the base type of the event and the MRP of origin. For example, for asuperposition of 7 MRPs, there would be × event types ( U , V , W , . . . , W ).An example of such a scenario with two MRPs superposed is shown in Fig. 9a. Fig-ure 9b shows the corresponding mean transition times for each type of transition. Thedrawback of this approach is that as the number of MRPs increases, the number of eventtypes increases, making the prediction task inherently harder. For the second approach,event types in the superposed process are determined only according to the base typesof the events, even if they originate from different MRPs. This way, for the predictionof the type of an event, there are always two wrong answers and one correct answer, fora fair comparison regardless of the number of MRPs superposed.The algorithm we describe can be used to predict both the time and type of likelyevents in the future. However, for simplicity, we evaluate the algorithm on its averageaccuracy of predicting the type of the next event, given the time to the next event,whenever an event occurs. We generate this prediction via argmax i p i ( δ = t n +1 − t n ; t = t n ) , where t n is the time of the n th event. We call this the C -based prediction. As acomparison, we generate an M -based prediction via argmax i m ij ( δ = t n +1 − t n ; t = t n ) ,where j is the type of the event at t n , and evaluate its average accuracy. Notice that the M -based prediction only invokes pairwise associations with event j as the cue. Finally,we compare these to a baseline of always predicting the most frequent event type. Ourmethod is described in detail in Appendix A.3.The average accuracies of the prediction methods are shown in Fig. 9c and 9d, as afunction of the number of MRPs superposed, for the ﬁrst and second approach of sce-nario generation respectively. The C - and M -based predictions generally outperformthe baseline model. Across both ﬁgures, the results are qualitatively similar. The accu-racies of C - and M -based predictions are comparable for a single MRP. This is expectedsince for an MRP, the cue and its outcome are neighbors. Whenever an event occurs,the M -based predictor uses the pairwise associations between that event and its possibleoutcomes to predict the type of the next event. However, as more and more MRPs aresuperposed, the C -based algorithm outperforms. The M -based algorithm suffers whensuccessive events originate from separate MRPs, and the pairwise association betweenthe respective event types would not be predictive. In contrast, the C -based algorithmmakes predictions based on events in the present and in the past, where the correct cuewould be included in such situations.This demonstration provides a proof of concept that the algorithm provides reason-18ble predictions for cues at time scales spanning one order of magnitude. We accom-plished this without selecting any single operating scale. The demonstration gives aﬂavor for the advantages of the algorithm we describe over Markov models. A classicapproach based on n -th order Markov models would entail discretizing time at somelowest-level scale (but see Kurth-Nelson and Redish, 2009; Ludvig et al., 2008), andsizing the memory buffer to encompass most of the longest transitions. For simplicity,we have constructed a relatively tame scenario for this demonstration, in which mostevent relationships only span about 1–15 time units, and events are sparse. In reality, thewider the range of time scales, the harder it is for standard algorithms operating at thelowest-level time scale, which fumble at time scales signiﬁcantly different from theiroperating scale (Mozer, 1992). In scenarios where events have long-range temporal de-pendencies, Markov models would be signiﬁcantly limited by the exponential growthin the number of states (and thus, computational demands) with the size of the memorybuffer. The algorithm we describe does not face these limitations (see Sec. 3.1). We have proposed an algorithm that generates a scale-invariant timeline of the future.This algorithm is time-local in the sense that predictions at time t are derived from ˜ f ( ∗ τ , t ) , which represents events that are, in fact, non-local in real time. Moreover,the translation mechanism enables event rates at future time points to be estimated.In addition to associative memory, as developed by model-free RL algorithms, thiscapability would let an agent construct the estimate over possible futures (McGuire andKable, 2013). This model has properties that are quite different from traditional RL paradigms. First,this algorithm naturally runs in continuous time, which suits applications dealing withnatural processes unfolding in time. This feature contrasts with basic RL algorithms,which only allow agents to move among discrete states in discrete time. In princi-ple, this proposed algorithm can be extended such that position in higher-dimensionalspaces replaces or augments time, allowing agents to navigate real and abstract spaces.Translation can be along an angle or perhaps even along a trajectory, instead of beingconﬁned to a given axis (Eq. 4).Second, the scale invariance of the model is useful in applications where the timescale of event relationships is not known in advance. In principle, the model is indiffer-ent towards the absolute time intervals between events. Instead, within a given scenario,it is only concerned about time intervals relative to other time intervals. In comparison,in traditional RL systems, a time scale for history dependency, if any, is set by the sizeof the history that the designer deﬁnes as part of the state s . Moreover, in many as-pects of the world that we might be interested in, such as in natural language (Altmannet al., 2012), network trafﬁc (Cohen et al., 2000) and ﬁnancial markets (Cont, 2005),event dependencies exist simultaneously across a wide range of scales. This model ispotentially suited for such applications, since it incorporates past events across a range19 bc d

1U 1V 1W

2U 2V 2W 1V

1W 2V Type includes MRP of origin

Base types only

Number of MRPs superposed Number of MRPs superposed

SuperposedMRP

MRP

Figure 9:

The algorithm provides good predictions for cues at multiple time scales.(a)

The top panel shows the ﬁrst few events in a superposed process. The bottom twopanels show the corresponding events from the two component MRPs, which are com-posed of events { U , V , W } and { U , V , W } , respectively. Note that successiveevents in the superposed process (e.g., the ﬁrst two events in the topmost panel) maybe from different MRPs, and thus the earlier event not predictive of the later event. (b) Graph depicting mean transition times between event types within each componentMRP. Weights are associated with the arrowhead closest to them. The variances of thenormally distributed transition times are not shown here. Note, for example, how the V → U transition takes place at the scale of about 1.5 time units, while the U → W transition takes place at a different scale of about 10 time units. The two MRPs depictedhere are two of the component MRPs in the simulation used to generate (c). (c) We su-perpose MRPs such that event types from different MRPs are deemed different eventtypes in the superposed process. (d)

We superpose MRPs such that event types fromdifferent MRPs are identiﬁed by base types ( U , V , W ) irrespective of their MRP of ori-gin, resulting in exactly three event types in the superposed process. For (c) and (d),each point represents an average accuracy computed by repeating the training and test-ing procedures 6 times for each choice of number of MRPs superposed. Regardless ofmethod of superposing MRPs, the algorithm (labeled C ) performs well above chance,showing that it provides good predictions for cues at multiple time scales. See text fora comparison of the C - and M -based predictions.20f time scales, and an increase in computing resources provides an exponential increasein the length of history considered.Third, in the context of RL, this model may be incorporated into algorithms to allowagents to naturally form a prediction of its own trajectory as a function of future time.This can be done by considering the agent’s arrival at some or all states as events. Inaddition, by combining the predictions for future states s as a function of future time, p s ( δ ) , with a reward function over future states, the agent can generate the predictedfuture reward as a function of future time, r ( δ ) . By learning and comparing weightedintegrals of r ( δ ) for several alternative policies, the agent can choose ﬂexibly amongthese policies according to task demands. For instance, if the agent knows it only has10 time units to complete the task, it can choose the policy with the highest (cid:82) r ( δ ) dδ .The model’s ability to form a prediction as a function of future time stands in contrast toRL paradigms, which tend to ﬂatten the dimension of future time. For example, a naiveRL agent assigns values to states according to the expected sum of future reward startingfrom that state; a successor representation agent (Dayan, 1993) learns the expectedfuture state occupancy, summed over future time , starting from each state (but see Tanoet al., 2020; Momennejad and Howard, 2018). We highlight two limitations relating to applying this algorithm toward machine learn-ing. First, the algorithm, as currently described, is not directly sensitive to joint statisticsof two or more cues. For example, the model would be unable to capture the conditionalstructure “ Z occurs exactly if either X or Y occurs, but not both”. As a consequence,the algorithm is also unable to deal appropriately with number of events. For instance,the algorithm has no basis to differentiate “ X precedes Y by 10 s” from “half of thetime, X precedes two closely-spaced occurrences of Y by 10 s and the other half of thetime, Y does not occur”. We can mitigate this issue by perceiving events depending oncontext. For example, the agent can perceive the Y after an X as the event XY , enablingsensitivity to joint statistics of at most two cues. In terms of computational complex-ity, naively implementing this would introduce a quadratic factor in the number of baseevents. However, we can reduce the resource complexity by ﬁnding a compressed rep-resentation of the event history while preserving information about future events: thatis, dealing with the information bottleneck problem (Tishby et al., 2000). Since existingdeep neural network algorithms efﬁciently extract joint statistics, it would be natural topursue research that seeks to merge this approach with deep network algorithms.Second, this algorithm is limited in prescribing how to achieve optimal policies inthe context of RL. Our focus has been on how to predict future events, and not how tolearn the best policy. In many contexts, it is natural to deﬁne events such that eventsoccur depending on actions of the agent (e.g., in a spatial navigation task where eventsoccur based on the agent’s trajectory). In these cases, in effect, we presume that theagent follows an existing policy π , and the model deduces event associations and makespredictions with respect to π . The agent can certainly ﬂexibly choose among several al-ternative policies, say, between π and π (cid:48) , by comparing predictions from the start stateand selecting the more rewarding alternative. However, unlike basic RL algorithms,we do not prescribe a method for learning a policy that scales in complexity with the21umber of states, such as a policy to navigate a grid. In the context of grid navigation,we have, in effect, avoided assigning values to coordinates on the grid, since this con-tradicts our design principle of allowing history to inﬂuence events (rewards). Moreresearch would be needed if one wished to pursue policy learning within the frameworkwe describe. The success of RL algorithms in accounting for the ﬁring of dopaminergic neurons inthe basal forebrain (Schultz et al., 1997) is arguably the greatest achievement in com-putational cognitive neuroscience. The basic empirical story is well-known. Dopamin-ergic neurons respond to unpredicted rewarding outcomes. However, with learning, asthe reward becomes predicted by a neutral stimulus, the cells no longer ﬁre to the pre-dicted reward but instead ﬁre to the neutral stimulus that predicts the future rewardingoutcome (see Schultz, 2006, for a review of the early literature). This basic story can beaccommodated within this framework. Consider learning an association between X and Y , separated by a ﬁxed delay τ . Initially, Y is unpredicted. After learning, the predictionfor Y , a time δ (cid:39) τ in the future, becomes available at the moment X is presented. After Y is presented, the prediction no longer includes Y . Unlike temporal difference learningalgorithms, there is no sense in which value moves along the future time axis. Thus,this model potentially makes sense of some empirical results unexplained by temporaldifference learning algorithms (Pan et al., 2005). Moreover, the scale invariance of thepredicted future leads to a natural account of results from some experiments that manip-ulate the delay τ between the predicting stimulus and the rewarding outcome (Fiorilloet al., 2008).The algorithm described here relies on the ability to translate ˜ f towards the past.Shankar et al. (2016) suggested that hippocampal theta (4–12 Hz) oscillations couldprovide a mechanism for translation of temporal representations. The basic conjectureof that model for translation is that different values of δ map onto different phases oftheta oscillations. If the timeline δ maps onto different phases of the theta oscillation,this places a lower limit on the order of 100 ms on the timelines indexed by ∗ τ and δ . Thisconjecture makes sense of several neurophysiological ﬁndings, including the gradualramping of ﬁring in striatal neurons accompanied by phase precession with respect tohippocampal theta (van der Meer and Redish, 2011). The learning rule presented here,Eq. 19, describes changes in the strength of connections in C by noting the differencebetween p + and p − at each value of δ . This suggests convergent connections betweenaxons communicating M and C arriving at target neurons representing predicted futureoutcomes. Acknowledgments

This work was supported by NIBIB R01EB022864 and NSF IIS 1631460. The authorsgratefully acknowledge inspiring conversations with Randy Gallistel and work in earlystages of this project by Kostya Tiurev. 22 ode availability

The code that supports the demonstration in Sec. 4 can be found at https://predicting.gitlab.io . Appendix

A.1 A formal model for temporal record of the past

Let multiple types of discrete events occur in continuous time. For each event type, wedenote the signal by f ( t ) , where each event is represented by a Dirac delta function atthe instant it occurs. For each event type, an array of leaky integrators, F , with a rangeof decay rates s , receive the signal as input: ∂∂t F ( s ; t ) = − sF ( s ; t ) + f ( t ) . (A1)The array of leaky integrators F ( s ; t ) encodes the real Laplace transform of the signalup to time t , where s is the Laplace domain variable. For each event type, an arrayof time cells ˜ f ( ∗ τ ) approximately inverts the Laplace transform (see Post, 1930). Thisyields an estimate of the signal up to time t , at time offsets ∗ τ prior: ˜ f ( ∗ τ ; t ) = ˜ f ( k/s ; t ) = ( − k k ! s k +1 ∂ k ∂s k F ( s ; t ) = L − k F ( s ; t ) . (A2)The constant k is a sharpness parameter. As k → ∞ , the estimate ˜ f ( ∗ τ ) becomesprecise, at the cost of inﬁnite resources to implement the model. As stated in Eq. 2, foran arbitrary signal f , ˜ f ( ∗ τ ; t ) = 1 ∗ τ (cid:90) t −∞ f ( τ )Φ k (cid:18) t − τ ∗ τ (cid:19) dτ. (A3)In other words, for a given ∗ τ , ˜ f ( ∗ τ ; t ) is proportional to a causal convolution of the signal f with a kernel Φ k that describes the smearing. A.2 Time-translation to estimate the future state of the past

The future state of the memory (Eq. 4) can be readily computed through translation inthe Laplace domain: ˜ f δ ( ∗ τ ; t ) ≡ L − k R δ F ( s ; t ) ≡ L − k (cid:8) e − sδ F ( s ; t ) (cid:9) . (A4)Building a translation operator out of realistic neurons and synapses is a non-trivial,but tractable problem. It has been proposed that the brain implements translation tovarious amounts δ by mapping δ on to different phases of theta oscillations (Shankaret al., 2016). Previous work has long argued that theta oscillations, a prominent 4–12 Hz oscillation in the local ﬁeld potential, have long been believed to be crucial in the23eurobiology of memory (Buzs´aki, 2002; Hasselmo et al., 2002; Kahana et al., 2001).Requiring scale invariance, and also consideration of the problem from the perspec-tive of the individual neurons requires the sweep through δ to accelerate exponentiallythrough the theta cycle.The future state of the memory element associated with the current event i (Eq. 12)is given in the Laplace domain by f jδ ( t ) = δ ij L − k R δ (1) = δ ij L − k e − sδ , (A5)since the Laplace transform of the Dirac delta function is 1 for all s . The δ on the leftside of the equation denotes the time offset from the present, while δ ij refers to theKronecker delta function. A.3 Demonstration: Methods

Given a time-ordered set of events [ e , e , ..., e n ] , where each e i = ( x i , t i ) comprisesa discrete-valued type and real-valued timestamp, we are interested in predicting thetype x n +1 of the next event given its time of occurrence t n +1 . In the demonstration, weapply the prediction algorithm (“ C -based”) to a superposition of independent MRPsand compare its predictions to those of a pairwise event association model (“ M -based”).In the simulation, both the C - and M -based predictors have memories spanning − to time units into the past, each covered by 200 log-spaced memory nodes.Within each MRP, the probability of the type and time of an event depends solelyon the type of the most recent past event, i.e., for MRP k , P (( x kn +1 , t kn +1 ) |{ ( x km , t km ) } m ≤ n ) = P (( x kn +1 , t kn +1 ) | x kn ) , where t kn +1 > t kn . The set of event types within each MRP is discrete and ﬁnite, whiletransition times ∆ t n +1 = t n +1 − t n > are real and strictly positive; this allows onlyone event to occur at a given time. Within each MRP, the probability of the type of thenext event is given by the transition matrix P ij = P ( x kn +1 = j | x kn = i ) =  .

05 0 .

75 0 . . .

05 0 . .

75 0 . .  . The transitions from i to j in MRP k follows a truncated normal distribution N ( µ kij , σ kij ) ,with a lower bound cutoff of − (to ensure positivity).We use two approaches which generate superposed processes differently. We dis-cuss the ﬁrst approach, used for Fig. 9c. The means µ kij and variances σ kij of the transi-tion time distributions are drawn uniformly from the intervals (0 , and (0 , , respec-tively. The same values are used across all six runs of the simulation. For each run, wegenerate exactly seven MRPs, labeled k = 1 , . . . , , each with 500 event episodes. Wethen construct seven superposed processes from the aforementioned MRPs as follows.The ﬁrst superposed process consists of one MRP, namely, the MRP k = 1 ; the secondsuperposed process consists of two MRPs, namely the MRPs with k = 1 and k = 2 ;and so on. Each component MRP has three types of events, so the total number of eventtypes in the superposed process is N , where N is the number of MRPs superposed.24e now discuss the second approach, used for Fig. 9d. We draw exactly one set oftransition time distribution parameters µ ij and σ ij as before. This same set of parametersis used across all six runs of the simulation, and for all MRPs k = 1 , . . . , . We generateexactly seven MRPs of 20,000 event episodes each, and generate seven superposedprocesses therefrom by incrementally superposing the MRPs as in the ﬁrst approach.Every MRP has three types of events ( U , V , W ). In the superposed processes, the eventtypes are not distinguished according to the MRP of origin (e.g., a U from one MRPand a U from another MRP are both of type U in the superposed process). Thus, incontrast to the previous approach, the algorithms only observe three types of events inthe superposed MRPs.In both Fig. 9c and 9d, 80% of each superposed process is used for training and therest for testing. For the C -based prediction, accuracy on the test set is computed bychecking if, at every time t n that an event occurs, the prediction evaluated at t n +1 , thetime of the next event, argmax i p i ( δ = ∆ t n +1 ; t = t n ) matches the event that actuallyoccurs at that time. For the M -based prediction, the computation is analogous, exceptthe prediction is found via argmax i m ij ( δ = ∆ t n +1 ; t = t n ) , where j = x n , the type ofthe event at t n . The simulation is run 6 times and the average accuracy is reported. References

Altmann, E. G., Cristadoro, G., and Esposti, M. D. (2012). On the origin of long-range correlations in texts.

Proceedings of the National Academy of Sciences ,109(29):11582–7.Balsam, P. D. and Gallistel, C. R. (2009). Temporal maps and informativeness in asso-ciative learning.

Trends in Neuroscience , 32(2):73–78.Bernacchia, A., Seo, H., Lee, D., and Wang, X. J. (2011). A reservoir of time constantsfor memory traces in cortical neurons.

Nature Neuroscience , 14(3):366–72.Bright, I. M., Meister, M. L. R., Cruzado, N. A., Tiganj, Z., Buffalo, E. A., and Howard,M. W. (2020). A temporal record of the past with a spectrum of time constants inthe monkey entorhinal cortex.

Proceedings of the National Academy of Sciences ,117:20274–20283.Buzs´aki, G. (2002). Theta oscillations in the hippocampus.

Neuron , 33(3):325–40.Clark, A. (2013). Whatever next? predictive brains, situated agents, and the future ofcognitive science.

Behavioral and Brain Sciences , 36(03):181–204.Cohen, R., Erez, K., Ben-Avraham, D., and Havlin, S. (2000). Resilience of the internetto random breakdowns.

Physical Review Letters , 85(21):4626.Cont, R. (2005). Long range dependence in ﬁnancial markets. In L´evy-V´ehel, J. andLutton, E., editors,

Fractals in Engineering , pages 159–179, London. Springer.25ruzado, N. A., Tiganj, Z., Brincat, S. L., Miller, E. K., and Howard, M. W. (In press).Conjunctive representation of what and when in monkey hippocampus and lateralprefrontal cortex during an associative memory task.

Hippocampus .Dayan, P. (1993). Improving generalization for temporal difference learning: The suc-cessor representation.

Neural Computation , 5(4):613–624.Fiorillo, C. D., Newsome, W. T., and Schultz, W. (2008). The temporal precision ofreward prediction in dopamine neurons.

Nature Neuroscience .Friston, K. (2010). The free-energy principle: a uniﬁed brain theory?

Nature ReviewsNeuroscience , 11:127–138.Gallistel, C., Craig, A. R., and Shahan, T. A. (2019). Contingency, contiguity, andcausality in conditioning: Applying information theory and weber’s law to the as-signment of credit problem.

Psychological review , 126(5):761.Hasselmo, M. E., Bodel´on, C., and Wyble, B. P. (2002). A proposed function for hip-pocampal theta rhythm: Separate phases of encoding and retrieval enhance reversalof prior learning.

Neural Computation , 14:793–817.Kahana, M. J., Seelig, D., and Madsen, J. R. (2001). Theta returns.

Current Opinion inBiology , 11(6):739–44.Kurth-Nelson, Z. and Redish, A. D. (2009). Temporal-difference reinforcement learn-ing with distributed representations.

PLoS One , 4(10):e7362.Ludvig, E. A., Sutton, R. S., and Kehoe, E. J. (2008). Stimulus representation and thetiming of reward-prediction errors in models of dopamine system.

Neural Computa-tion , 20:3034–3054.MacDonald, C. J., Lepage, K. Q., Eden, U. T., and Eichenbaum, H. (2011). Hip-pocampal “time cells” bridge the gap in memory for discontiguous events.

Neuron ,71(4):737–749.McGuire, J. T. and Kable, J. W. (2013). Rational temporal predictions can underlieapparent failures to delay gratiﬁcation.

Psychological Review , 120(2):395–410.Mello, G. B., Soares, S., and Paton, J. J. (2015). A scalable population code for time inthe striatum.

Current Biology , 25(9):1113–1122.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G.,Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533.Momennejad, I. and Howard, M. W. (2018). Predicting the future with multi-scalesuccessor representations. bioRxiv , page 449470.Mozer, M. C. (1992). Induction of multiscale temporal structure. In

Advances in neuralinformation processing systems , pages 275–282.26urray, J. D., Bernacchia, A., Roy, N. A., Constantinidis, C., Romo, R., and Wang,X.-J. (2017). Stable population coding for working memory coexists with heteroge-neous neural dynamics in prefrontal cortex.

Proceedings of the National Academy ofSciences , 114(2):394–399.Pan, W. X., Schmidt, R., Wickens, J. R., and Hyland, B. I. (2005). Dopamine cellsrespond to predicted events during classical conditioning: evidence for eligibilitytraces in the reward-learning network.

Journal of Neuroscience , 25(26):6235–42.Post, E. (1930). Generalized differentiation.

Transactions of the American Mathemati-cal Society , 32:723–781.Rasmussen, J. G. (2018). Lecture Notes: Temporal Point Processes and the ConditionalIntensity Function. arXiv preprint 1806.00221 .Schultz, W. (2006). Behavioral theories and the neurophysiology of reward.

AnnualReview of Psychology , 57:87–115.Schultz, W., Dayan, P., and Montague, P. R. (1997). A neural substrate of predictionand reward.

Science , 275:1593–1599.Shankar, K. H. and Howard, M. W. (2013). Optimally fuzzy temporal memory.

Journalof Machine Learning Research , 14:3753–3780.Shankar, K. H., Singh, I., and Howard, M. W. (2016). Neural mechanism to simulate ascale-invariant future.

Neural Computation , 28:2594–2627.Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot,M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018). A general reinforcementlearning algorithm that masters chess, shogi, and go through self-play.

Science ,362(6419):1140–1144.Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.

Ma-chine learning , 3(1):9–44.Tano, P., Dayan, P., and Pouget, A. (2020). A local temporal difference code for dis-tributional reinforcement learning.

Advances in Neural Information Processing Sys-tems , 33.Taxidis, J., Pnevmatikakis, E. A., Dorian, C. C., Mylavarapu, A. L., Arora, J. S., Sama-dian, K. D., Hoffberg, E. A., and Golshani, P. (2020). Differential emergence andstability of sensory and temporal representations in context-speciﬁc hippocampal se-quences.

Neuron .Tiganj, Z., Cromer, J. A., Roy, J. E., Miller, E. K., and Howard, M. W. (2018). Com-pressed timeline of recent experience in monkey lPFC.

Journal of Cognitive Neuro-science , 30:935–950.Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv preprint physics/0004057 . 27sao, A., Sugar, J., Lu, L., Wang, C., Knierim, J. J., Moser, M.-B., and Moser, E. I.(2018). Integrating time from experience in the lateral entorhinal cortex.

Nature ,561:57–62.van der Meer, M. A. A. and Redish, A. D. (2011). Theta phase precession in rat ventralstriatum links place and reward information.

Journal of Neuroscience , 31(8):2843–54.Waelti, P., Dickinson, A., and Schultz, W. (2001). Dopamine responses comply withbasic assumptions of formal learning theory.

Nature , 412(6842):43–8.Watkins, C. J. and Dayan, P. (1992). Q-learning.