Deep Predictive Learning in Neocortex and Pulvinar
Randall C. O'Reilly, Jacob L. Russin, Maryam Zolfaghar, John Rohrlich
DDeep Predictive Learning in Neocortex and Pulvinar
Randall C. O’Reilly, Jacob L. Russin, Maryam Zolfaghar, and John RohrlichDepartment of Psychology, Computer Science, and Center for NeuroscienceUniversity of California Davis1544 Newton CtDavis, CA 95618 [email protected]
June 29, 2020We thank Dean Wyatte, Tom Hazy, Seth Herd, Kai Krueger, Tim Curran, David Sheinberg, Lew Harvey,Jessica Mollick, Will Chapman, Helene Devillez, and the rest of the CCN Lab for many helpful commentsand suggestions. Supported by: ONR grants ONR N00014-19-1-2684 / N00014-18-1-2116,N00014-14-1-0670 / N00014-16-1-2128, N00014-18-C-2067, N00014-13-1-0067, D00014-12-C-0638.This work utilized the Janus supercomputer, which is supported by the National Science Foundation(award number CNS-0821794) and the University of Colorado Boulder. The Janus supercomputer is a jointeffort of the University of Colorado Boulder, the University of Colorado Denver and the National Centerfor Atmospheric Research. All data and materials will be available at https://github.com/ccnlab/deep-obj-cat upon publication. a r X i v : . [ q - b i o . N C ] J un bstract How does the human brain learn new concepts from raw sensory experience, without explicit instruction?We still do not have a widely-accepted answer to this central question. Here, we propose a detailed biologicalmechanism for the widely-embraced idea that learning is based on the differences between predictions andactual outcomes (i.e., predictive error-driven learning ). Specifically, numerous weak projections into thepulvinar nucleus of the thalamus generate top-down predictions, and sparse, focal driver inputs from lowerareas supply the actual outcome, originating in layer 5 intrinsic bursting (5IB) neurons. Thus, the outcomeis only briefly activated, roughly every 100 msec (i.e., 10 Hz, alpha ), resulting in a temporal differenceerror signal , which drives local synaptic changes throughout the neocortex, resulting in a biologically-plausible form of error backpropagation learning. We implemented these mechanisms in a large-scale modelof the visual system, and found that the simulated inferotemporal (IT) pathway learns to systematicallycategorize 3D objects according to invariant shape properties, based solely on predictive learning from rawvisual inputs. These categories match human judgments on the same stimuli, and are consistent with neuralrepresentations in IT cortex in primates.
Deep Predictive Learning
The fundamental epistemological conundrum of how knowledge emerges from raw experience has chal-lenged philosophers and scientists for centuries. There have been significant advances in cognitive andcomputational models of learning (Ashby & Maddox, 2011; LeCun, Bengio, & Hinton, 2015; Watanabe& Sasaki, 2015) and in our understanding of the detailed biochemical basis of synaptic plasticity (Cooper& Bear, 2012; L¨uscher & Malenka, 2012; Shouval, Bear, & Cooper, 2002; Urakubo, Honda, Froemke, &Kuroda, 2008). However, there is still no widely-accepted answer to this puzzle that is clearly supported byknown biological mechanisms and also produces effective learning at the computational and cognitive levels.At these functional levels, the idea that we learn via an active predictive process goes back to Helmholtz’s recognition by synthesis proposal (von Helmholtz, 2013), and has been widely embraced in a wide range ofdifferent frameworks (Clark, 2013; Dayan, Hinton, Neal, & Zemel, 1995; de Lange, Heilbron, & Kok, 2018;J. Elman et al., 1996; J. L. Elman, 1990; Friston, 2005; George & Hawkins, 2009; Hawkins & Blakeslee,2004; Kawato, Hayakawa, & Inui, 1993; Mumford, 1992; Rao & Ballard, 1999; Summerfield & de Lange,2014). Here, we propose a detailed biological mechanism for a specific form of predictive error-drivenlearning based on distinctive patterns of connectivity between the neocortex and the pulvinar nucleus of thethalamus (S. M. Sherman & Guillery, 2006; Usrey & Sherman, 2018).Specifically, we hypothesize that learning is based on the difference between top-down predictions, gen-erated by numerous weak projections into the thalamic relay cells (TRCs) in the pulvinar, and the actualoutcomes supplied by sparse, focal, strong driver inputs from lower areas. Because these driver inputs orig-inate in layer 5 intrinsic bursting (5IB) neurons, the outcome is only briefly activated, roughly every 100msec (i.e., 10 Hz, alpha ). Thus, the prediction error is a temporal difference in activation states over thepulvinar, from an earlier prediction to a subsequent burst of outcome. This temporal difference can drivelocal synaptic changes throughout the neocortex, supporting a biologically-plausible form of error back-propagation to improve the predictions over time (Ackley, Hinton, & Sejnowski, 1985; Bengio, Mesnard,Fischer, Zhang, & Wu, 2017; Hinton & McClelland, 1988; Lillicrap, Santoro, Marris, Akerman, & Hinton,2020; O’Reilly, 1996; Whittington & Bogacz, 2019). The temporal-difference form of error-driven learningcontrasts with prevalent alternative hypotheses that require a separate population of neurons to compute aprediction error “explicitly” and transmit it directly through neural firing (Friston, 2005, 2010; Kawato etal., 1993; Lotter, Kreiman, & Cox, 2016; Ouden, Kok, & Lange, 2012; Rao & Ballard, 1999).In the following, our primary objective is to describe the hypothesized biologically-based mechanismfor predictive error-driven learning, contrast it with other existing proposals regarding the functions of thisthalamocortical circuitry and other ways that the brain might support predictive learning, and evaluate itrelative to a wide range of existing anatomical and electrophysiological data. We provide a number ofspecific empirical predictions that follow from this functional view of the thalamocortical circuit, whichcould potentially be tested by current neuroscientific methods. Thus, this work proposes a clear functionalinterpretation of this distinctive thalamocortical circuitry that contrasts with existing ideas in testable ways.A second major objective is to implement this predictive error-driven learning mechanism in a large-scalecomputational model that faithfully captures its essential biological features, to test whether the proposedlearning mechanism can drive the formation of cognitively-useful representations. In particular, we ask acritical question for any purely predictive-learning model: can it develop high-level, abstract representa-tions while learning from nothing but predicting low-level visual inputs. For example, most visual objectrecognition models that provide a reasonable fit to neurophysiological data rely on large human-labeled im-age datasets to explicitly train abstract category information via error-backpropagation (Cadieu et al., 2014;Khaligh-Razavi & Kriegeskorte, 2014; Rajalingham et al., 2018). Through large-scale simulations based onthe known structure of the visual system, we found that our biologically based predictive learning mech-anism developed high-level abstract representations that significantly diverge from the similarity structurepresent in the lower layers of the network, and systematically categorize 3D objects according to invariantshape properties. Furthermore, we found in a similarity judgment experiment that these categories match ’Reilly et al. not to produce a better machine-learning(ML) algorithm per se , but rather to test the computational properties of our biologically-based, scientifictheory for how the mammalian brain might learn. Thus, we explicitly dissuade readers from the inevitabledesire to evaluate the importance of our model based on differences in narrow, performance-based MLmetrics: it should instead be evaluated on its ability to explain a wide range of data across multiple levels ofanalysis, just as every other scientific theory is evaluated.The remainder of the paper is organized as follows. First, we provide a concise overview of the bio-logically based predictive error-driven learning framework. Next, we discuss the relevant biological data indetail, along with testable predictions that can differentiate this account from other existing ideas. Then, wepresent the large-scale model of the visual system, which learns by predicting over brief visual movies of 3Dobjects rotating and translating in space. We evaluate this model and compare it to two other prediction-errorlearning models that use pure error-backpropagation, based on current deep-convolutional neural network(DCNN) principles. Finally, we conclude with a discussion of related models and outstanding issues.
Predictive Error-driven Learning in the Neocortex and Pulvinar
Figure 1a shows the thalamocortical circuits characterized by S. M. Sherman and Guillery (2006) (seealso S. M. Sherman & Guillery, 2013; Usrey & Sherman, 2018), which have two distinct projections con-verging on the principal thalamic relay cells (TRCs) of the pulvinar , the primary thalamic nucleus that isinterconnected with higher-level posterior cortical visual areas (Arcaro, Pinsk, & Kastner, 2015; Halassa& Kastner, 2017; Shipp, 2003). One projection consists of numerous, weaker connections originating indeep layer VI of the neocortex (the 6CT corticothalamic projecting cells), which we hypothesize gener-ate a top-down prediction on the pulvinar. The other is a sparse, focal (Rockland, 1996, 1998) and strong driver pathway that originates from lower-level layer 5 intrinsic bursting cells (5IB), which we hypothesizeprovide the outcome. These 5IB neurons fire discrete bursts with intrinsic dynamics having a period ofroughly 100 msec between bursts (Connors, Gutnick, & Prince, 1982; Franceschetti et al., 1995; Larkum,Zhu, & Sakmann, 1999; Saalmann, Pinsk, Wang, Li, & Kastner, 2012; Silva, Amitai, & Connors, 1991),which corresponds to the widely-studied alpha frequency of 10 Hz that originates in cortical deep layersand has important effects on a wide range of perceptual and attentional tasks (Buffalo, Fries, Landman,Buschman, & Desimone, 2011; Clayton, Yeung, & Kadosh, 2018; Jensen, Bonnefond, & VanRullen, 2012;K. Mathewson, Gratton, Fabiani, Beck, & Ro, 2009; VanRullen & Koch, 2003).The existing literature generally characterizes the 6CT projection as modulatory (S. M. Sherman &Guillery, 2013; Usrey & Sherman, 2018), but a number of electrophysiological recordings from awake,behaving animals clearly show sustained, continuous patterns of neural firing in pulvinar TRC neurons,which is not consistent with the idea that they are only being driven by their 5IB inputs (Bender, 1982;Bender & Youakim, 2001; Komura, Nikkuni, Hirashima, Uetake, & Miyamoto, 2013; Petersen, Robinson,& Keys, 1985; Robinson, 1993; Saalmann et al., 2012; Zhou, Schafer, & Desimone, 2016). Indeed, theserecordings show that pulvinar neural firing generally resembles that of the visual areas with which theyinterconnect. This is important because our predictive learning framework requires that these 6CT top-down projections be capable of driving TRC activity directly. Specifically, in contrast to the standard view,the core idea behind our theory is that the top-down 6CT projections drive a prediction across the extentof the pulvinar, which precedes the subsequent outcome state resulting from the strong 5IB driver inputs,
Deep Predictive Learning predictionV1 (minus) actual(plus)V2 Deep (t-1 context) t t+100 msec
Bidirectional connectionsto higher layers (V4... ) (contextupdate) err … … err temporaldifferenceerror =
V2 Super (current t)
Pulvinar (prediction) a)b)
Figure 1: a) Summary figure from Sherman & Guillery (2006) showing the strong feedforward driver pro-jection emanating from layer 5IB cells in lower layers (e.g., V1), and the much more numerous feedback“modulatory” projection from layer 6CT cells. We interpret these same connections as providing a predic-tion (6CT) vs. outcome (5IB) activity pattern over the pulvinar. b) Temporal evolution of information flowunder our prediction error hypothesis, operating on visual sequences, over two idealized alpha cycles of 100msec each. Superficial layers (Super, lamina 2/3) always encode the current state, integrating bottom-up andtop-down inputs. In each alpha cycle, the V2 Deep layer (lamina 5, 6) uses the prior alpha cycle of contextto generate a prediction ( minus phase) on the pulvinar thalamic relay cells (TRC). The bottom-up outcomeis driven by lower-level (V1) 5IB strong driver inputs ( plus phase); error-driven learning occurs as a functionof the temporal difference between these phases, sent via broad pulvinar projections, in both superficial anddeep layers, and in the 6CT projections into the pulvinar (only 5IB drivers are non-learning). 5IB burstingin V2 drives updating of temporal context in V2 Deep layers (this phasic updating prevents current outcomeactivation in superficial layers from informing the prediction), while also driving the plus phase in higherareas of pulvinar that learn to predict V2 activation states, and so on.as illustrated in Figure 1b (Kachergis, Wyatte, O’Reilly, de Kleijn, & Hommel, 2014; O’Reilly, Wyatte, &Rohrlich, 2014, 2017).Assuming a 100 msec alpha cycle for the purposes of illustration (the actual timing is likely to be moredynamic as discussed next), the activity state in pulvinar TRC neurons, representing a prediction, shoulddevelop during the first ∼
75 msec, while the final ∼
25 msec largely reflects the strong 5IB bottom-upground-truth driver inputs. Thus, the prediction error signal is reflected in the temporal difference of theseactivation states over time . In other words, our hypothesis is that the pulvinar is directly representing eitherthe top-down prediction or the bottom-up outcome at any given time, and the temporal difference between ’Reilly et al. adaptation ,which is generally thought to increase neural activity and learning associated with novel inputs relative to re-cently familiar ones (Abbott, Varela, Sen, & Nelson, 1997; Brette & Gerstner, 2005; Grill-Spector, Henson,& Martin, 2006; Hennig, 2013; M¨uller, Metha, Krauskopf, & Lennie, 1999). In the case where outcomes areconsistent with prior predictions (i.e., the predictions are accurate), the same population of neurons acrosspulvinar and cortex should be active over time, whereas unpredicted outcomes will generally activate newsubsets of neurons in superficial cortical layers representing the current state. Thus, due to adaptation, thereshould be a phasic increase in activity in these superficial neurons at the onset of unpredicted stimuli relativeto predicted ones. Furthermore, the 5IB neurons downstream of these superficial neurons may be partic-ularly responsive to these phasic activity increases, causing their bursting to coincide preferentially withunexpected outcomes, thereby driving the phase resetting of the alpha cycle to such events. Thus, duringa sequence of predicted states, the pulvinar may experience relatively weaker or even absent 5IB drivinginputs, until an unpredicted stimulus arises. At this point, error-driven learning would be more stronglyengaged as a function of the phasic release from adaptation and 5IB burst activation. We discuss these dy-namics more later in the context of the expectation suppression phenomena (Bastos et al., 2012; Meyer &Olson, 2011; Summerfield, Trittschuh, Monti, Mesulam, & Egner, 2008; Todorovic, van Ede, Maris, & deLange, 2011).We also hypothesize that 5IB bursting preferentially drives learning, due the strong driving nature ofthe outputs from these neurons. In computational terms, this anchors the target or plus phase to be at thispoint of 5IB bursting. Furthermore, this means that the prediction is essentially defined as the state prior to5IB bursting, and the learning rule automatically causes that prior state to better anticipate the subsequentstate. This means that even if no prediction was initially generated, learning over multiple iterations willwork to create one, to the extent that a reliable prediction can be generated based on internal states andenvironmental inputs. It also means that although the alpha rhythm defines a baseline minimum predictionwindow, predictive learning could still happen at longer delays (again assuming relevant predictive stateinformation is available to bridge the delay). In short, learning always just happens whenever somethingunexpected occurs, at any point, and drives the development of predictions immediately prior, to the extentsuch predictions are possible to generate. In the typical lab experiment where phasic stimuli are presentedwithout any predictable temporal sequence (which is likely uncharacteristic of the natural world), there mayoften be no significant prediction prior to stimulus onset, and we would expect such stimuli to reliablydrive 5IB bursting, which is consistent with available electrophysiological data (Bender, 1982; Bender &Youakim, 2001; Komura et al., 2013; Petersen et al., 1985; Robinson, 1993; Saalmann et al., 2012; Zhou etal., 2016).As may be evident by this point, we are mainly focused on prediction in the sense of the humorousquote: “prediction is very difficult, especially about the future” (attributable to Danish author Robert StormPetersen), whereas this term is potentially confusingly used in a much broader sense in most Bayesian-inspired predictive coding frameworks (de Lange et al., 2018; Friston, 2005; Rao & Ballard, 1999). Theseframeworks use “prediction” to encompass everything from genetic biases to the results of learning in thefeedforward synaptic pathways to top-down filling-in or biasing of the current stimulus properties, andfairly rarely for the “about the future” meaning. We think these different phenomena are each associated
Deep Predictive Learning with different neural mechanisms at different time scales (O’Reilly, Munakata, Frank, Hazy, & Contributors,2012; O’Reilly, Wyatte, Herd, Mingus, & Jilk, 2013), and thus prefer to treat them separately, while alsorecognizing that they clearly interact, e.g., with predictive learning hypothesized to be the primary driver oflearning of all pathways in the cortex.Thus, our use of the term prediction here refers specifically to anticipatory neural firing that predictssubsequent stimuli. We use the term postdiction to refer to the operation of this predictive mechanism after astimulus has been initially processed (to consolidate and more deeply encode, as in an auto-encoder model),and distinguish both from top-down excitatory biasing , which directly influences the online superficial layerneural representations of the current stimulus (Desimone & Duncan, 1995; Miller & Cohen, 2001; O’Reillyet al., 2013; Reynolds, Chelazzi, & Desimone, 1999). Finally, many discussions of prediction error in theliterature include late, frontally-associated processes such as those associated with the P300 ERP component(Holroyd & Coles, 2002). We specifically exclude these from the scope of the mechanisms described here,which are anticipatory, fast, and low-level, as is appropriate for the posterior cortical sensory processingareas that interconnect with the pulvinar.
Computational Properties of Predictive Learning in the Thalamocortical Circuits
We next elaborate the connections between the computational properties required for predictive learning,and the properties of the thalamocortical circuits in the pulvinar, which appear to be notably well suited forthe hypothesized predictive learning role, in the following ways: • Assuming the process of generating a prediction involves the integration of multiple converging inputsfrom a range of higher-level cortical areas, each encoding different dimensions of relevance (e.g., lo-cation, motion, color, texture, shape, etc), sufficient time must be available to perform this integration,along with some kind of dedicated neural substrate upon which it can be performed. This neural sub-strate must be distinct from those encoding the continuously evolving representations of the incomingsensory state, assuming that it is not possible to suspend that process during the time it takes to de-velop the prediction (and thereby re-use the same substrate). Furthermore, it is likely that predictiongeneration requires a broader convergence of top-down inputs than is required for sensory state en-coding, and any prediction error signal should also be widely broadcast back out to these same areas,to provide the training signal that improves their predictions. All of these considerations are nicelysatisfied by having a separate, compact, broadly integrative, bidirectionally connected nucleus in theform of the pulvinar and its 6CT inputs and reciprocal efferents back out to the neocortex (Shipp,2003). Furthermore, the TRC neurons are distinctive in having no significant lateral interconnectivity(S. M. Sherman & Guillery, 2006), enabling them to faithfully represent their inputs. These propertiesled Mumford (1991) to characterize the pulvinar as a blackboard , and we further suggest the metaphorof a projection screen upon which the predictions are projected. • The obvious locus for ongoing sensory integration and the online “current state” representation isin the superficial lamina of each cortical area. The pyramidal neurons here are densely and bidi-rectionally interconnected with other cortical areas, and update rapidly to new stimulus inputs, withcontinuous, relatively rapid firing (up to about 100 Hz) for preferred stimuli. These neurons integratehigher-level top-down information with bottom-up sensory information, to fill in missing information,resolve ambiguities, focus attention, and generally enhance the consistency and quality of the onlinerepresentations (Desimone & Duncan, 1995; Miller & Cohen, 2001; O’Reilly, Hazy, & Herd, 2016;O’Reilly et al., 2012, 2013; Reynolds et al., 1999). As noted above, we distinguish this form of top-down processing, which is often most evident during the period after stimulus onset (Lee & Mumford,2003), from the specifically predictive sort. However, when the deep layers are predictively anticipat-ing the onset of upcoming stimuli, these top-down deep layer projections will result in pre-activation ’Reilly et al. • Each cortical area requires a distinct population of neurons to generate its contribution to the overallprediction, for reasons that will become clear in a moment. With the superficial neurons occupiedby the current state, this naturally leaves the deep lamina neurons as the logical substrate for thisjob, particularly the 6CT population that projects directly and exclusively to the pulvinar. Thus, thisframework also provides a clear functional division of labor for the superficial and deep neocorticallamina (with layer 4 stellates providing a localized input processing function). • A true prediction (about the future) must be prevented from cheating and relying on direct infor-mation about that which is being predicted. Thus there must be a mechanism preventing the newsensory outcome information continuously encoded in the superficial layers from “contaminating”the prediction-generation components in the deep layers and pulvinar. The phasic, bursting nature ofthe 5IB driver inputs provides this essential feature, creating a window where no outcome signals areimpinging on the pulvinar, when the prediction can be represented. The prediction and current statesynchronize at the moment of the 5IB driver bursting. Furthermore, the activity in the 6CT deep lay-ers that generates the top-down predictions over the pulvinar is itself driven by 5IB neuron burstingwithin the local columnar circuits of these higher level areas, such that these prediction-generatingneurons are also kept isolated from current superficial-layer activation (Figure 1b). Computationally,this functions much like the simple recurrent network (SRN) context layer updating (J. L. Elman,1990; Jordan, 1989) which reflects the prior trial’s state, as shown in the supplemental information.Interestingly, by these principles, the lack of bursting in the driver inputs to first-order sensory thala-mus areas (S. M. Sherman & Guillery, 2006) means that these areas should not be directly capableof error-driven predictive learning, but they do receive “collateral” error signals from the pulvinar(Shipp, 2003), which could provide some useful indirect error-driven learning signals. • The outcome signal should be as veridical as possible (i.e., directly reflecting the bottom-up outcome),and should arise from lower areas in the hierarchy relative to the corresponding 6CT inputs: thebottom-up, sparse, focal, strongly driving nature of the 5IB projections can directly convey suchveridical outcome signals, and ensure that they dominate the activation of their TRC targets. Basedon indirect available data, it is likely that each pulvinar TRC neuron receives only roughly 1-6 driverinputs (S. M. Sherman & Guillery, 2006, 2011). Furthermore, these inputs are likely not plastic (Usrey& Sherman, 2018) — they drive plasticity, and are thus not subject to it. • The integration required to generate the prediction should take more time than the outcome phase,which is consistent with a relatively brief period of 5IB bursting (roughly 25 msec or less; Connorset al., 1982), leaving approximately 75 msec of a nominal 100 msec alpha cycle for this integration.The overall duration of the alpha cycle itself may represent a reasonable compromise between thisintegration time and the need to keep up with predictions tracking changes in the world. • For cortical neurons receiving projections from the pulvinar, there must be some way in which thedifference between prediction and outcome (i.e., the error itself) can drive learning. Here we hy-pothesize that this difference remains as a temporal difference error signal, i.e., the difference overtime in pulvinar activation states, arising naturally as a prediction state followed by the outcome state.This contrasts with prevalent alternative hypotheses that require a separate population of neurons tocompute a prediction error “explicitly” and transmit it directly through neural firing, as we discusslater. We argue that a temporal-difference prediction error signal is more natural, efficient, and con-sistent with bidirectional excitatory connectivity between cortical areas (Desimone & Duncan, 1995;Hopfield, 1984; Miller & Cohen, 2001; O’Reilly et al., 2013; Reynolds et al., 1999; Rumelhart &McClelland, 1982). Note that this form of temporal-difference learning signal is distinct from the
Deep Predictive Learning widely-used TD model in reinforcement learning (Sutton & Barto, 1998), which is scalar, and appliesto reward expectations, not sensory predictions (although see Gardner, Schoenbaum, & Gershman,2018 and Dayan, 1993 for potential connections between these two forms of prediction error). • There is a long history of computational models of error-driven learning based on temporal-differencesignals (Ackley et al., 1985; Bengio et al., 2017; Lillicrap et al., 2020; O’Reilly, 1996; Whittington& Bogacz, 2019), and we have recently provided a direct biological mechanism for this form oflearning (O’Reilly et al., 2012), based on a biologically-detailed model of spike timing dependentplasticity (STDP) (Urakubo et al., 2008). We showed that when activated by realistic Poisson spiketrains, this STDP model produces a non-monotonic learning curve similar to that of the BCM model(Bienenstock, Cooper, & Munro, 1982), which results from competing calcium-driven postsynapticplasticity pathways (Cooper & Bear, 2012; Shouval et al., 2002). As in the BCM framework, wehypothesized that the threshold crossover point in this nonmonotonic curve moves dynamically —if this happens on the alpha timescale (Lim et al., 2015), then it can reflect the prediction phase ofactivity, producing a net error-driven learning rule based on a subsequent calcium signal reflectingthe outcome state. This form of error-driven learning mathematically approximates backpropagationgradient descent to minimize overall prediction errors (O’Reilly, 1996).Thus, remarkably, the pulvinar and associated thalamocortical circuitry appears to provide precisely thenecessary ingredients to support predictive error-driven learning. Interestingly, although S. M. Sherman andGuillery (2006) did not propose a predictive learning mechanism as just described, they did speculate abouta potential role for this circuit in motor forward-model learning and the predictive remapping phenomenon(S. M. Sherman & Guillery, 2011; Usrey & Sherman, 2018). In addition, Pennartz, Dora, Muckli, andLorteije (2019) also suggested that the pulvinar may be involved in predictive learning, but within the expliciterror-coding framework and not involving the detailed aspects of the above-described circuitry.As we discuss later, this proposed predictive role for the pulvinar is not incompatible with the morewidely-discussed role it may play in attention (Bender & Youakim, 2001; Fiebelkorn & Kastner, 2019;LaBerge & Buchsbaum, 1990; Saalmann & Kastner, 2011; Snow, Allen, Rafal, & Humphreys, 2009; Zhouet al., 2016). Indeed, we think these two functions are synergistic (i.e., you predict what you attend, andvice-versa; Richter & de Lange, 2019), and have initial computational results consistent with this idea.In the following sections, we discuss some of the most important neural data of relevance to our hypothe-ses (beyond that summarized above), including contrasts with a widely-discussed alternative framework forpredictive coding, and some of the extensive data on alpha frequency effects, followed by a discussion ofpredictions that would clearly test the validity of this framework.
Additional Neuroscience Data
We begin with data relevant to the basic neural-level properties of the framework. First, direct elec-trophysiological recording of deep layer neurons shows periodic alpha-scale bursting for continuous tones(Luczak, Bartho, & Harris, 2013), and there are a variety of potential mechanisms behind the generation andsynchronization of the 5IB bursts driving this alpha cycle (Connors et al., 1982; Franceschetti et al., 1995;Silva et al., 1991). Furthermore, the pulvinar has been shown to drive alpha-frequency synchronization ofcortical activity across areas in the alpha band (Saalmann et al., 2012). We review the larger alpha frequencyliterature in more detail below.The 6CT neurons exhibit regular spiking behavior, in contrast to the 5IB bursting (Thomson, 2010;Thomson & Lamy, 2007). Also, they do not have axonal branches that project to other cortical areas —the subpopulation that projects to the pulvinar only project there and not to other cortical areas (Petrof,Viaene, & Sherman, 2012), whereas there are other layer 6 neurons that do project to other cortical areas.This distinct connectivity is consistent with a specific role of this neuron type in generating predictions in ’Reilly et al. glomeruli structures at their synapses onto pulvinar neurons, whichcontain a complete feedforward inhibition circuit involving a local inhibitory interneuron, in addition to thedirect strong excitatory driver input (Wilson, Bose, Sherman, & Guillery, 1984). Computationally, this canprovide a balanced level of excitatory and inhibitory drive so as to not overly excite the receiving neuron,while still dominating its firing behavior.Although there are well-documented and widely-discussed burst vs. tonic firing modes in pulvinar neu-rons (S. M. Sherman & Guillery, 2006), there is not much evidence of these playing a clear role in theawake, behaving state, and as noted above the growing electrophysiological evidence shows a remarkablecorrespondence between cortical and pulvinar response properties across multiple different pulvinar areasin this awake state. Nevertheless, there may be important dynamics arising from these firing modes that aremore subtle or emerge in particular types of state transitions that may have yet to be identified.
Contrast with Explict Error (EE) Frameworks
To further clarify the nature of the present theory, and introduce a body of relevant data, it is importantto contrast it with the widely-discussed explicit error ( EE ) framework for predictive coding (Bastos et al.,2012; Friston, 2005, 2010; Kawato et al., 1993; Lotter et al., 2016; Ouden et al., 2012; Rao & Ballard, 1999)(Figure 2). Despite many attempts to identify such explicit error-coding neurons in the cortex, no substantialbody of unambiguous evidence has been discovered (Kok & de Lange, 2015; Kok, Jehee, & de Lange,2012; Lee & Mumford, 2003; Summerfield & Egner, 2009; Walsh, McGovern, Clark, & O’Connell, 2020).Furthermore, due to the positive-only firing rate nature of neural coding, two separate populations wouldbe required to convey both signs of prediction error signals, or it would have to be encoded as a variationfrom tonic firing levels, which are generally low in the neocortex. By contrast, the use of temporal-differenceerror signals enables all connections between cortical layers to be excitatory and each layer can represent thepositive encoding of either the prediction or outcome state, at different levels of abstraction. These propertiesare overwhelmingly supported by extensive electrophysiological data about the hierarchical organization ofrepresentations, e.g., in the visual object recognition pathway (Cadieu et al., 2014; Kobatake & Tanaka,1994; VanRullen & Thorpe, 2002), and are consistent with the widely-supported biased competition modelfor excitatory top-down attentional effects (Desimone & Duncan, 1995; Miller & Cohen, 2001; O’Reilly etal., 2013; Reynolds et al., 1999).By contrast, the EE approach requires net inhibitory top-down predictions, and it sends error signalsforward, not positive representations of the actual state at a given level of abstraction. Thus a literal inter-pretation (and at least one existing implementation; Lotter et al., 2016) has only error signals represented atall levels above the lowest level, which is inconsistent with the positive encoding of stimuli at various levelsof abstraction across the visual hierarchy. For example, although Issa, Cadieu, and DiCarlo (2018) observedan error-signal-like increase in activation for atypical faces in some pIT neurons, these neurons overall hada positive stimulus encoding, with only a relatively small, later, error-like modulation. Furthermore, as dis-cussed below, anticipatory predictions typically closely resemble the subsequent stimulus-driven activity,suggesting a positive, not inhibitory, effect (Cavanagh, Hunt, Afraz, & Rolfs, 2010; Duhamel, Colby, &Goldberg, 1992; Lee & Mumford, 2003; Walsh et al., 2020). However, there are various different waysof reformulating the neural implementation of EE that can avoid some of these issues (Bastos et al., 2012;Spratling, 2008), but perhaps this flexibility renders the framework difficult to falsify (Kogo & Trengove,0 Deep Predictive Learning prediction(minus) actual(plus)5IB errerr temporaldifferenceerror =
Pulvinar6CTSuper predictionactual err
Super + _V4DeepV2V2V1 _+ explicitdifferenceerror = (All connections are excitatory, red = plus phase, blue = minus) (purple = inhibitory, green = excitatory black = unspecified) a) Temporal Difference Error b) Explicit Error
Figure 2: Comparison between: a) The proposed thalamocortical temporal-difference predictive learningmodel (from Figure 1), versus b) The Bayesian-style explicit error (EE) coding model (Rao & Ballard,1999; Friston, 2010, Bastos et al., 2012), in a situation where the prediction is clearly erroneous (ball pre-dicted to emerge on right, actually emerges on left). The EE model holds that superficial (2/3) error-codingneurons receive the prediction via a net inhibitory top-down projection from higher-level deep layer neurons,and an excitatory bottom-up projection representing the outcome, such that their activation represents thedifference. To encode both signs of the error (omissions, false alarms) with positive-only spike rates, twoseparate populations of EE neurons would be required, or a more complicated deviation from tonic firinglevel scheme. Unambiguous evidence of such EE coding neurons has not been found (Walsh et al, 2020). Incontrast, error signals in our proposed framework remain as a temporal difference between the two states ofprediction vs. outcome, which enables all connectivity between cortical areas to be excitatory and alwaysrepresent a positive encoding of either the prediction or outcome . In contrast, under EE, after one errorsubtraction at the lowest level, only error signals are hypothesized to flow forward to higher layers, meaningthat the representations at higher layers are about increasingly higher-order errors , not positive encodings ofthe environmental state at increasing levels of abstraction. This is inconsistent with extensive available data.For this illustration, V1 is assumed to be like a clamped input layer, not subect to predictive learning itself.2015). In any case, an extensive treatment of the issues with EE is beyond the scope of this paper and has al-ready been aptly covered by Walsh et al. (2020) — our goal here is to highlight some of the core differencesas a way to clarify the framework by way of contrast, and in relation to available data.First, there are many examples of anticipatory predictive neural firing in the brain. Of perhaps greatestrelevance, Barczak et al. (2018) recently showed that the auditory pulvinar in monkeys exhibits predictivefiring using a carefully controlled auditory sequence that had no first-order acoustic differences from a back-ground noise signal. The pulvinar predictive activation preceded that of A1, suggesting a strong predictiverole for pulvinar. Unfortunately, the deep layers of higher auditory areas that should contribute to the forma-tion of the pulvinar prediction were not recorded in this study, so their role in generating the prediction couldnot be determined. Nevertheless, there is extensive additional evidence for top-down anticipatory activationof predicted stimuli, with activity patterns closely resembling the subsequent stimulus-driven ones (Walshet al., 2020). For example, the widely replicated predictive remapping effect is of this nature (Cavanagh etal., 2010; Duhamel et al., 1992; Wurtz, 2008) — see below for a simulation and further discussion of thisdata. The fact that these anticipatory activations are of a positive nature, consistent with the stimulus-driven ’Reilly et al. expectation suppression (Bastos et al., 2012; Meyer & Olson, 2011; Summerfield etal., 2008; Todorovic et al., 2011). This phenomenon is widely cited as evidence in favor of the EE predictivecoding framework, consistent with an inhibitory effect of the expectation. Nevertheless, despite variousconflicting results and many complications of interpretation, multiple comprehensive reviews conclude thatit is difficult to distinguish expectation suppression from the neural adaptation effects that underlie thewell-documented repetition suppression effect (Kok & de Lange, 2015; Kok et al., 2012; Lee & Mumford,2003; Summerfield & Egner, 2009; Vinken & Vogels, 2017; Walsh et al., 2020). Furthermore, detailedsingle-neuron level recordings are the least likely to show these effects — instead, they are most evidentin aggregate signals such as the BOLD response in fMRI, suggesting that they may more strongly reflectpopulation-level differences in activity, rather than individual explicit error coding neurons.As noted earlier, in our framework accurately predicted outcomes would result in a continued adaptationof the neural response carrying over from the prediction to the outcome state, whereas unexpected outcomeswould be associated with two distinct patterns of activity over a given area: first the prediction and then theoutcome. Thus, the unexpected outcome state would not be subject to the prior neural adaptation effects,and furthermore the time-integrated aggregate activity over these two patterns would be greater comparedto the single activity state associated with an accurately predicted outcome. Thus, our model explains ex-pectation suppression without invoking EE neurons, meaning that considerably more detailed and replicableexperimental paradigms using single-neuron resolution techniques are needed to distinguish EE from ourframework.
Alpha Frequency Effects
The alpha frequency bursting of 5IB neurons acting as drivers into the pulvinar naturally entrains thepredictive learning process in our model to this fundamental rhythm, which has long been recognized asan important signature of posterior cortical function (Berger, 1929; Nunn & Osselton, 1974; VanRullen& Koch, 2003; Varela, Toro, John, & Schwartz, 1981; Walter, 1953). A number of different functionalassociations with alpha have been established, and this literature is large and growing rapidly. Thus, werefer the reader to recent reviews (Clayton et al., 2018; Foster & Awh, 2019; Jensen, Bonnefond, Marshall,& Tiesinga, 2015; VanRullen, 2016) while highlighting the data most relevant to our specific frameworkhere, organized according to a set of key points. • Alpha is specifically associated with deep neocortical layers and the pulvinar, and with feedbackpathways in the cortex.
This has been established using direct laminar-specific electrophysiologicalsingle-neuron and local field potential (LFP) recordings (Buffalo et al., 2011; Luczak et al., 2013;Maier, Adams, Aura, & Leopold, 2010; Maier, Aura, & Leopold, 2011; Spaak, Bonnefond, Maier,Leopold, & Jensen, 2012; Xing, Yeh, Burns, & Shapley, 2012), and feedforward vs. feedback manip-ulations (Bastos et al., 2015; Jensen et al., 2015; Michalareas et al., 2016; van Kerkoerle et al., 2014;von Stein, Chiang, & K¨onig, 2000). These data are consistent with the 5IB alpha bursting and themajor role of cortical deep layers in driving top-down corticocortical projections (in addition to the6CT pathway which is specific to the pulvinar). By contrast, these same papers show that superficialcortical layers are associated with gamma frequency (40 Hz) dynamics. Overall, these data suggestthat noninvasive EEG methods could provide a direct window onto the predictive learning process.However, the next point raises some important interpretational difficulties. • Increases in cortical activity levels, e.g., due to attention, produce a corresponding decrease in al-pha power, while decreased activity increases alpha power (Foster & Awh, 2019; Fries, Womels-2
Deep Predictive Learning dorf, Oostenveld, & Desimone, 2008; Jensen & Mazaheri, 2010; Kelly, Lalor, Reilly, & Foxe, 2006;Klimesch, Sauseng, & Hanslmayr, 2007; Worden, Foxe, Wang, & Simpson, 2000). This pattern is notexactly what you might expect if alpha was a signature of predictive learning. However, given thatthese same pulvinar and thalamocortical pathways are also widely regarded as important for attention(Bender & Youakim, 2001; Fiebelkorn & Kastner, 2019; LaBerge & Buchsbaum, 1990; Saalmann &Kastner, 2011; Snow et al., 2009; Zhou et al., 2016), this pattern presents a challenge for many theo-rists. However, it is possible to explain this pattern as arising directly from the desynchronizing effectsof cortical activity on alpha power. Specifically, neural spiking is associated with broadband noise,due to the highly random, Poisson nature of spike firing, which can desynchronize the entrainment oflower-frequency oscillations including alpha (Privman, Malach, & Yeshurun, 2013; Ray & Maunsell,2011; Solomon et al., 2017; Waldert, Lemon, & Kraskov, 2013). In other words, because corticalactivity is inherently noisy, it tends to interfere with the coherent activity across populations of neu-rons needed to produce a strong alpha frequency power signal. This explanation is directly supportedby studies manipulating and measuring cortical activity (Fries et al., 2008; Zhou et al., 2016), andis consistent with alpha power changes being a result of attentional modulation, but not their cause(Antonov, Chakravarthi, & Andersen, 2020). Thus, while attention and predictive learning can bothaffect overall activity levels in cortex, and thus drive changes in alpha power, alpha power itself is nota transparent measure of the underlying mechanisms supporting these functions, which may help toexplain some contradictory patterns of results (Foster & Awh, 2019; Gundlach, Moratti, Forschack,& M¨uller, 2020; Keitel et al., 2019). • Alpha phase effects provide a more direct measure of thalamocortical function than alpha power,and have been more consistently related to perception, attention, and prediction (Busch, Dubois, &VanRullen, 2009; Jaegle & Ro, 2013; K. E. Mathewson, Fabiani, Gratton, Beck, & Lleras, 2010; Neu-pane, Guitton, & Pack, 2017; Nunn & Osselton, 1974; Palva & Palva, 2011; Sol´ıs-Vivanco, Jensen, &Bonnefond, 2018; VanRullen & Koch, 2003; Varela et al., 1981). For example, weak, near-thresholdstimuli are more reliably detected and processed when presented in the trough of the individual’s on-going alpha cycle. Of greatest relevance to the present paper are studies showing effects of predictionon alpha phase (Mayer, Schwiedrzik, Wibral, Singer, & Melloni, 2016; Samaha, Bauer, Cimaroli, &Postle, 2015; M. T. Sherman, Kanai, Seth, & VanRullen, 2016). For example, Mayer et al. (2016)showed that prestimulus alpha phase directly correlated with the predictability of the upcoming stim-ulus, and the pattern of this prestimulus activation was indistinguishable from the subsequent stimulusactivation pattern. This is consistent with our model, and less consistent with the EE framework, asdiscussed previously. Neupane et al. (2017) found strong alpha coherence effects in LFP recordingsdistributed across V4, associated with the predictive remapping of receptive fields (Duhamel et al.,1992). • Discrete, salient, or oscillatory stimuli entrain the alpha cycle in the brain (K. E. Mathewson et al.,2012; Spaak, de Lange, & Jensen, 2014). Furthermore, the massive literature on event related poten-tials (ERPs) may represent a significant contribution from alpha-level entrainment (Gruber, Klimesch,Sauseng, & Doppelmayr, 2005; Klimesch, 2011; Makeig et al., 2002). These entrainment effects areconsistent with the 5IB entrainment mechanisms in our framework, as described earlier, and entrain-ment is functionally important for aligning predictive learning with relevant salient or unexpectedoutcomes. • The pulvinar contributes to synchronizing alpha phase relationships across different brain areas (Fiebelkorn, Pinsk, & Kastner, 2018; Saalmann et al., 2012). This is consistent with the broad, conver-gent pattern of projections into the pulvinar from many different cortical areas, and the correspondingbroad projections back out to these same areas (Arcaro et al., 2015; Shipp, 2003). Functionally, this ’Reilly et al. • The theta cycle, comprised of a pair of alpha cycles, organizes saccades, and attentional, motor,and mnemonic processes (Fiebelkorn & Kastner, 2019). The theta rhythm is dominant in the medialtemporal lobe and hippocampus, and has been extensively studied there (Buzs´aki, 2005; Kahana,Seelig, & Madsen, 2001). Furthermore, there is a sharp peak of saccade fixation durations at 200msec, which suggests that two alpha cycles are typically required for complete processing of a givenfixation. On the first cycle, the predictions from before the eye moved may be fairly vague dependingon factors such as the size of the saccade and familiarity with the environment. But after the firstalpha cycle of a fixation, a subsequent postdiction phase can provide an important additional learningopportunity, to consolidate and more deeply encode the current fixation (computationally equivalentto an auto-encoder). Also, a mix of smaller saccades (including microsaccades) and larger saccadesenables a range of more and less predictable outcomes on the first alpha cycle after the saccade, andmatches human behavior (Martinez-Conde, Macknik, & Hubel, 2004; Martinez-Conde, Otero-Millan,& Macknik, 2013).Putting all of these points together, a particularly effective way of testing the predictions of our frame-work would be measuring alpha phase changes emerging in the prestimulus period as a function of predictivelearning in predictable sequential stimulus streams. In addition, it would also be important to examine thetaand alpha-cycle dynamics in relation to predictive learning in the context of attention, motor control, andmemory processes, to better understand the larger systems-level temporal organization of learning and pro-cessing in the brain (Fiebelkorn & Kastner, 2019).
Predictions for Predictive Learning
In this section, we enumerate a set of direct, testable predictions from our framework. Before doing so,there are several important considerations for any experimental test of the theory. First, the nature of whatis to be learned must be matched to the pulvinar area in question. For example, learning a new variation ofbasic physics in movies at the alpha time scale (e.g., altering properties such as gravity, inertia, or elasticity),would be appropriate for the lower level visual pathways. At higher visual levels (e.g., IT cortex), it might bepossible to use simple sequences of different objects, although it is not clear to what extent the hippocampusor prefrontal cortex might also contribute in this case (Fiser et al., 2016; Gavornik & Bear, 2014). Todistinguish pulvinar learning effects from pervasive motor learning supported by other brain areas, it wouldbe most effective to directly measure activity in the pulvinar and / or associated perceptual neocortical areas,instead of involving overt behavioral performance.Much of the learning in posterior sensory cortex should take place early in development, requiring veryearly developmental interventions or genetic knockouts that are expressed from the start (which can alsohave other interpretational issues if not highly selective). In our models described below, the bulk of thebasic sensory predictive learning happens very quickly, because the basic first-level regularities are quitestrong and relatively easily learned. While there are longer-term changes in the higher-level pathways inour models, more fine-grained measurements would likely be required to see these changes. Once thislearning has taken place, the remaining contributions of the thalamocortical circuit are likely more stronglyweighted toward its role in attention, as we discuss below. Finally, directly lesioning or inactivating thepulvinar is not likely to be very informative, because existing work has shown dramatic effects on corticalactivity (Purushothaman, Marion, Li, & Casagrande, 2012; Zhou et al., 2016), and also any effects could beattributed to the attentional contributions of the pulvinar.With these considerations in mind, here are a set of strong predictions from our model that should be4
Deep Predictive Learning testable using existing techniques. Failure to obtain the predicted result, while adhering to all the relevantconstraints, would constitute a falsification of our model. • Blocking 5IB bursting mechanisms early in developmental learning should disrupt learning . It shouldbe possible to selectively knock out or modify the channels that cause this specific population ofneurons to burst fire, and doing so should have a significant effect on learning in associated neocorticaland pulvinar areas, given the critical role that this burst firing plays on the predictive learning processas elaborated above. • Blocking synaptic plasticity in pulvinar (specifically the 6CT inputs) very early in developmentallearning should impair learning . While most of the learning overall should occur in the neocortexas a result of the temporal difference error signal broadcast by the pulvinar (which should remaingenerally intact), learning in the 6CT projections is important, especially right at the start, to map theemerging neocortical representations into the space defined by the 5IB projections. • Temporal differences on an alpha cycle timescale actually drive synaptic plasticity in an error-drivenlearning manner, in neocortical pyramidal neurons and in 6CT inputs to pulvinar . That is, if a pre /post pair of neurons across a synapse is more active in the prediction than the subsequent outcome,the synapse should experience LTD (long term depression), and vice-versa if the activity pattern isreversed (long term potentiation, LTP, for more activity in outcome than prediction). Furthermore,if activity is essentially stable across both prediction and outcome phases, then weights should notchange (modulo a small level of Hebbian learning; O’Reilly & Munakata, 2000; O’Reilly et al.,2012). This should be directly testable using current experimental methods, and is perhaps the singlemost important empirical test of this entire framework, and it also underlies many other current ap-proaches to error-driven learning in the brain (Bengio et al., 2017; Lillicrap et al., 2020; Whittington &Bogacz, 2019). One general consideration is the extent to which an awake in vivo preparation wouldbe required to capture all the neuromodulatory and other factors present when this learning normallytakes place. Some suggestive evidence in such a preparation is generally consistent with a sensitivityto relatively short-term temporal dynamics (Lim et al., 2015), although these results lacked the directmeasurement of individual neural activity across a synapse.
Predictive Learning of Object Categories in IT Cortex
Now we turn to our implementation of the proposed thalamocortical predictive error-driven learningframework, in a large-scale model of visual predictive learning (Figure 3). Our second major objective,and a critical question for predictive learning, is whether the model can develop high-level, abstract waysof representing the raw sensory inputs, while learning from nothing but predicting these low-level visualinputs. We showed the model brief movies of 156 3D object exemplars drawn from 20 different basic-levelcategories (e.g., car, stapler, table lamp, traffic cone, etc.) selected for their overall shape diversity from theCU3D-100 dataset (O’Reilly et al., 2013). The objects moved and rotated in 3D space over 8 movie frames,where each frame was sampled at the alpha frequency (Figure 3b). There were also saccadic eye movementsevery other frame, introducing an additional predictive-learning challenge. An efferent copy signal enabledfull prediction of the effects of the eye movement, and allows the model to capture the signature predictiveremapping phenomenon (Cavanagh et al., 2010; Duhamel et al., 1992; Neupane et al., 2017). The only learning signal available to the model was the prediction error generated by the temporal difference betweenwhat it predicted to see in the V1 input in the next frame and what was actually seen.As described in detail in the supporting information, our model was constructed to capture critical fea-tures of the visual system, including the major division between a dorsal
Where and ventral
What pathway ’Reilly et al. a)b) Deep 6CT 5IB (actual) (predict)
Dorsal
Where
Ventral
WhatWhat *Where predictionerrorpredictionerrorpredictionerror
Low Res High Res BidirErr Backprop .33 .53 .59 .62 .57 .52 .48 .58r:
ImageV1h reconPulvinarpredictionpred errsaccade
Figure 3: a) The
What-Where-Integration, WWI deep predictive learning model. The dorsal
Where pathwaylearns first, using easily-abstracted spatial blobs , to predict object location based on prior motion, visualmotion, and saccade efferent copy signals. This drives strong top-down inputs to lower areas with accuratespatial predictions, leaving the residual error concentrated on
What and
What * Where integration. The V3and DP (dorsal prelunate) constitute the
What * Where integration pathway, binding features and locations.V4, TEO, and TE are the
What pathway, learning abstracted object category representations, which alsodrive strong top-down inputs to lower areas. Suffixes: s = superficial, d = deep, p = pulvinar. c) Examplesequence of 8 alpha cycles that the model learned to predict, with the reconstruction of each image basedon the V1 gabor filters (
V1h recon ), and model-generated prediction (correlation r prediction error shown).The low resolution and reconstruction distortion impair visual assessment, but r values are well above the r ’s for each V1 state compared to the previous time step (mean = .38, min of .16 on frame 4 – see SI formore analysis). Eye icons indicate when a saccade occurred.(Ungerleider & Mishkin, 1982), and the overall hierarchical organization of these pathways derived fromdetailed connectivity analyses (Felleman & Van Essen, 1991; Markov, Ercsey-Ravasz, et al., 2014; Markov,Vezoli, et al., 2014; Rockland & Pandya, 1979). In addition to these biological constraints, we conductedextensive exploration of the connectivity and architecture space, and found a remarkable convergence be-tween what worked functionally and the known properties of these pathways (O’Reilly et al., 2017). Forexample, the feedforward pathway has projections from lower-level superficial layers to superficial layers of6 Deep Predictive Learning a) b)
Biological model Human ratings
Pyramid Vertical Round Boxy Horizontal Pyramid Vertical Round Boxy Horizontal
Figure 4: a) Category similarity structure that developed in the highest layer, TE, of the biologically basedpredictive learning model, showing correlation distance (1-correlation) similarity of the TE representationfor each 3D object against every other 3D object (156 total objects). Blue cells have high similarity, andmodel has learned block-diagonal clusters or categories of high-similarity groupings, contrasted against dis-similar off-diagonal other categories. Clustering maximized average within – between correlation distance(see SI), and clearly corresponded to the shown shape-based categories, with exemplars from each categoryshown. Also, all items from the same basic-level object categories (N=20) are reliably subsumed withinlearned categories. b) Human similarity ratings for the same 3D objects, presented with the V1 reconstruc-tion (see Fig 1b) to capture coarse perception in the model, aggregated by 20 basic-level categories (156x 156 matrix was too large to sample densely experimentally). Each cell is 1 - proportion of time givenobject pair was rated more similar than another pair (see SI). The human matrix shares the same centroidcategorical structure as the model (confirmed by permutation testing and agglomorative cluster analysis, seeSI), indicating that human raters used the same shape-based category structure.higher levels, while feedback originated in both the superficial and deep and projected back to both (Felle-man & Van Essen, 1991; Rockland & Pandya, 1979). Also, consistent with the core features of the pulvinarpathways discussed above, deep layer predictive (6CT) inputs originated in higher levels, while driver (5IB)inputs originated in lower levels. For simplicity we organized the model layers in terms of these driver in-puts, whereas the topographic organization of pulvinar in the brain is organized more according to the 6CTprojection loops (Shipp, 2003).Another important set of parameters are the strength of deep-layer recurrent projections, which influencethe timescale of temporal integration, producing a simple biologically-based version of slow feature analysis (Foldiak, 1991; Wiskott & Sejnowski, 2002). We followed the biological data suggesting that recurrenceincreases progressively up the visual hierarchy (Chaudhuri, Knoblauch, Gariel, Kennedy, & Wang, 2015). Itwas essential that the
Where pathway learn first, consistent with extant data (Bourne & Rosa, 2006; Kiorpes,Price, Hall-Haro, & Anthony Movshon, 2012), including early pathways interconnecting LIP and pulvinar(Bridge, Leopold, & Bourne, 2016), and a rare asymmetric pathway, from V1 to LIP (Markov, Ercsey-Ravasz, et al., 2014), providing a direct short-cut for high-level spatial representations in LIP. Results fromvarious informative model architecture and parameter manipulations are discussed below after the primaryresults from the standard intact model. Learning curves and other model details are shown in the supportinginformation. ’Reilly et al.
What pathway of the model emerge through the deep hierarchy of layers progressing upwardfrom V1. This has been investigated in recent comparisons between monkey electrophysiological recordingsand deep convolutional neural networks (DCNNs), which provide a reasonably good fit the the overallprogressive pattern of increasingly categorical organization (Cadieu et al., 2014). However, these DCNNswere trained on large datasets of human-labeled object categories, and it is perhaps not too surprising thatthe higher layers closer to these category output labels exhibited a greater degree of categorical organization— this is an intrinsic property of the error backpropagation gradients. In contrast, because the only sourceof learning in our model comes from prediction errors over the V1 input layers, the graded emergence ofan object hierarchy here reflects a truly self-organizing learning process. Figure 5 compares the similaritystructures in layers V4 and IT in macaque monkeys (Cadieu et al., 2014) with those in corresponding layersin our model. In both the monkeys and our model, the higher IT layer builds upon and clarifies the noisierstructure that is emerging in the earlier V4 layer, showing that our model replicates the essential qualitativehierarchical progression in the brain. After presenting a few more analyses, we explore the critical factorsthat lead to this result.We can more precisely quantify the emergence of categorical representations in our model by computingthe second-order similarity across the similarity matrices computed at each layer in the network (Figure 6).This shows the extent to which the similarity matrix across objects in one layer is itself similar to the objectsimilarity matrix in another layer, in terms of a correlation measure across these similarity matrices. Startingfrom either V1 compared to all higher layers, or TE compared to all lower layers, we found a consistentpattern of progressive emergence of the object categorization structure in the upper IT pathway (TEO, TE).Critically, this analysis shows that the IT category structure is significantly different from that present at thelevel of the V1 primary visual input. Thus the model, despite being trained only to generate accurate visualinput-level predictions, has learned to represent these objects in an abstract way that goes beyond the rawinput-level information. We further verified that at the highest IT levels in the model, a consistent, spatially-invariant representation is present across different views of the same object (e.g., the average correlationacross frames within an object was .901). This is also evident in Figure 4a by virtue of the close similarity8
Deep Predictive Learning pyramidverticalroundboxyhoriz
Figure 5: Comparison of progression from V4 to IT in macaque monkey visual cortex (top row, fromCadieu et al., 2014) versus same progression in model (replotted using comparable color scale). Althoughthe underlying categories are different, and the monkeys have a much richer multi-modal experience of theworld to reinforce categories such as foods and faces, the model nevertheless shows a similar qualitativeprogression of stronger categorical structure in IT, where the block-diagonal highly similar representationsare more consistent across categories, and the off-diagonal differences are stronger and more consistent aswell (i.e., categories are also more clearly differentiated). Note that the critical difference in our modelversus those compared in Cadieu et al. 2014 and related papers is that they explicitly trained their modelson category labels, whereas our model is entirely self-organizing and has no external categorical trainingsignal.
LayerV1 V1h V2 V3 DP V4 TEO TE C o rr e l a t i on V1 correlTEs correl
Figure 6: Emergence of abstract category structure over the hierarchy of layers. Red line = correlationsimilarity between the TE similarity matrix (shown in Figure 2a) and all layers; black line shows correlationsimilarity between V1 against all layers (1 = identical; 0 = orthogonal). Both show that IT layers (TEO, TE)progressively differentiate from raw input similarity structure present in V1, and, critically, that the modelhas learned structure beyond that present in the input. ’Reilly et al. d) V1, max = 1a) Bp, max = 0.3 b) PredNet, max = 0.75e) Bp w V1, max = 0.3 f) Leabra w V1, max = 1.5 c) Bp LayerV1 V1h V2 V3 DP V4 TEOTE C o rr e l a t i on V1 correlTEs correl
Figure 7: a) Best-fitting category similarity for TE layer of the backpropagation (Bp) model with the sameWhat / Where structure as the biological model. Only two broad categories are evident, and the lower max distance (0.3 vs. 1.5 in biological model) means that the patterns are highly similar overall. b) Best-fittingsimilarity structure for the PredNet model, in the highest of its layers (layer 6), which is more differentiatedthan Bp (max = 0.75) but also less cleanly similar within categories (i.e., less solidly blue along the blockdiagonal), and overall follows a broad category structure similar to V1. c) Comparison of similarity struc-tures across layers in the Bp model (compare to Figure 2c): unlike in the biological model, the V1 structureis largely preserved across layers, and is little different from the structure that best fits the TE layer shown inpanel a , indicating that the model has not developed abstractions beyond the structure present in the visualinput. Layer V3 is most directly influenced by spatial prediction errors, so it differs from both in stronglyencoding position information. d) The best fitting V1 structure, which has 2 broad categories and banana isin a third category by itself. The lack of dark blue on the block diagonal indicates that these categories arerelatively weak, and every item is fairly dissimilar from every other. e) The same similarities shown in panel a for Bp TE also fit reasonably well sorted according to the V1 structure (and they have a similar averagewithin - between contrast differences, of 0.0838 and 0.0513 – see SI for details). f) The similarity structurefrom the biological model resorted in the V1 structure does not fit well: the blue is not aligned along theblock diagonal, and the yellow is not strictly off-diagonal. This is consistent with the large difference inaverage contrast distance: 0.5071 for the best categories vs. 0.3070 for the V1 categories.across multiple objects within the same category.In summary, the model learned an abstract category organization that reflects the overall visual shapesof the objects as judged by human participants, in a way that is invariant to the differences in motion,rotation, and scaling that are present in the V1 visual inputs. We are not aware of any other model that hasaccomplished this signature computation of the ventral
What pathway in a purely self-organizing manneroperating on realistic 3D visual objects, without any explicit supervised category labels, much less using alearning algorithm directly based on detailed properties of the underlying biological circuits in this pathway.
Backpropagation Comparison Models
To help discern some of the factors that contribute to the categorical learning in our model, and providea comparison with more widely-used error backpropagation models, we tested a backpropagation-based0
Deep Predictive Learning (Bp) version of the same
What vs.
Where architecture as our biologically-based predictive error model, andwe also tested a standard
PredNet model (Lotter et al., 2016) with extensive hyperparameter optimization(see SI). Due to the constraints of backpropagation, we had to eliminate any bidirectional connectivityloops in the Bp version, but we were able to retain a form of predictive learning by configuring the V1ppulvinar layer as the final target output layer, with the target being the next visual input relative to the V1inputs. As shown in Figure 7, the highest layers of the Bp model form a simple binary category structureoverall, and the detailed item-level similarity structure does not diverge significantly from that present at thelowest V1 inputs, indicating that it has not formed novel systematic structured representations, in contrastto those formed in the biologically based model. Similar results were found in the PredNet model, wherethe highest layer representations remained very close to the V1 input structure. Because existing work withthese models has typically relied on additional supervised learning and decoder-based analyses (which areessentially equivalent to an additional layer of supervised learning), these RSA-based analyses provide animportant, more sensitive way of determining what they learn.These results show that the additional biologically derived properties in our model are playing a criticalrole in the development of abstract categorical representations that go beyond the raw visual inputs. Theseproperties include: excitatory bidirectional connections, inhibitory competition, and an additional Hebbianform of learning that serves as a regularizer (similar to weight decay) on top of predictive error-drivenlearning (O’Reilly, 1998; O’Reilly & Munakata, 2000). Each of these properties could promote the forma-tion of categorical representations. Bidirectional connections enable top-down signals to consistently shapelower-level representations, creating significant attractor dynamics that cause the entire network to settleinto discrete categorical attractor states. Another indication of the importance of bidirectional connectionsis that a greedy layer-wise pretraining scheme, consistent with a putative developmental cascade of learningfrom the sensory periphery on up (Bengio, Yao, Alain, & Vincent, 2013; Hinton & Salakhutdinov, 2006;Shrager & Johnson, 1996; Valpola, 2014), did not work in our model. Instead, we found it essential thathigher layers, with their ability to form more abstract, invariant representations, interact and shape learningin lower layers right from the beginning.Furthermore, the recurrent connections within the TEO and TE layers likely play an important role bybiasing the temporal dynamics toward longer persistence (Chaudhuri et al., 2015). By contrast, backprop-agation networks typically lack these kinds of attractor dynamics, and this could contribute significantlyto their relative lack of categorical learning. Hebbian learning drives the formation of representations thatencode the principal components of activity correlations over time, which can help more categorical repre-sentations coalesce (and results below already indicate its importance). Inhibition, especially in combinationwith Hebbian learning, drives representations to specialize on more specific subsets of the space.Ongoing work is attempting to determine which of these is essential in this case (perhaps all of them) bysystematically introducing some of these properties into the backpropagation model, though this is difficultbecause full bidirectional recurrent activity propagation, which is essential for conveying error signals top-down in the biological network, is incompatible with the standard efficient form of error backpropagation,and requires significantly more computationally intensive and unstable forms of fully recurrent backpropa-gation (Pineda, 1987; Williams & Zipser, 1992). Furthermore, Hebbian learning requires dynamic inhibitorycompetition which is difficult to incorporate within the backpropagation framework.
Architecture and Parameter Manipulations
Figure 8 shows just a few of the large number of parameter manipulations that have been conducted todevelop and test the final architecture. For example, we hypothesized that separating the overall predictionproblem between a spatial
Where vs. non-spatial
What pathway (Goodale & Milner, 1992; Ungerleider &Mishkin, 1982), would strongly benefit the formation of more abstract, categorical object representations inthe
What pathway. Specifically, the
Where pathway can learn relatively quickly to predict the overall spatial ’Reilly et al. V1 V1h V2 V4 TEO TELayer C o rr e l a t i on IntactNo Where
LayerV1 V1h V2 V3 DP V4 TEO TE LayerV1 V1h V2 V3 DP V4 TEO TE
IntactNo Predict IntactHebb 2 a) Where pathway lesioned b) No prediction (auto-encoder) c) Hebbian reduced
Figure 8: Effects of various manipulations on the extent to which TE representations differentiate from V1.For all plots,
Intact is the same result shown in Figure 5 from the intact model for ease of comparison. All ofthe following manipulations significantly impair the development of abstract TE categorical representations(i.e., TE is more similar to V1 and the other layers). a) Dorsal
Where pathway lesions, including lateralinferior parietal sulcus (LIP), V3, and dorsal prelunate (DP). This pathway is essential for regressing outlocation-based prediction errors, so that the residual errors concentrate feature-encoding errors that train the
What pathway. b) Allowing the deep layers full access to current-time information, thus effectively elim-inating the prediction demand and turning the network into an auto-encoder, which significantly impairsrepresentation development, and supports the importance of the challenge of predictive learning for devel-oping deeper, more abstract representations. c) Reducing the strength of Hebbian learning by 20% (from 2.5to 2), demonstrating the essential role played by this form of learning on shaping categorical representations.Eliminating Hebbian learning entirely (not shown) prevented the model from learning anything at all, as italso plays a critical regularization and shaping role on learning.trajectory of the object (and anticipate the effects of saccades), and thus effectively regress out that com-ponent of the overall prediction error, leaving the residual error concentrated in object feature information,which can train the ventral
What pathway to develop abstract visual categories. Figure 8a shows that, in-deed, when the
Where pathway is lesioned, the formation of abstract categorical representations in the intact
What pathway is significantly impaired. Figure 8b shows that full predictive learning, as compared to justencoding and decoding the current state (i.e., an auto-encoder, which is much easier computationally, andleads to much better overall accuracy), is also critical for the formation of abstract categorical representa-tions — prediction is a “desirable difficulty” (Bjork, 1994). Finally, Figure 8c shows the impact of reducingHebbian learning, which impairs category learning as expected.
Predictive Behavior
A signature example of predictive behavior at the neural level in the brain is the predictive remapping of visual space in anticipation of a saccadic eye movements (Colby, Duhamel, & Goldberg, 1997; Duhamelet al., 1992; Gottlieb, Kusunoki, & Goldberg, 1998; Marino & Mazer, 2016; Nakamura & Colby, 2002)(Figure 9a). Here, parietal neurons start to fire at the future receptive field location where a currently-visiblestimulus will appear after a planned saccade is actually executed. Remapping has also been shown for borderownership neurons in V2 (O’Herron & von der Heydt, 2013) and in area V4 (Neupane, Guitton, & Pack,2016, 2020). These are examples, we believe, of a predictive process operating throughout the neocortex topredict what will be experienced next. A major consequence of this predictive process is the perception of astable, coherent visual world despite constant saccades and other sources of visual change.Figure 9b shows that our model exhibits this predictive remapping phenomenon. Specifically, LIP, which2
Deep Predictive Learning
Cycles (msec) A c t i v a t i on ( m ean f i r i ng r a t e ) LIPd (old pos) LIPd (new pos)V2d V2sSaccade Fixation Plus phase
Duhamel et al., 1992Model
Figure 9: Predictive Remapping. top:
Original remapping data in LIP from Duhamel et al (1992). A) showsstimulus (star) response within receptive field (dashed circle) relative to fixation dot (upper right of fixation).B) Just prior to monkey making a saccade to new fixation (moving left), stimulus is turned on in receptivefield location that will be upper right of the new fixation point, and the LIP neuron responds to that stimulusin advance of the saccade completing. The neuron does not respond to the stimulus in that location if it isnot about to make a saccade that puts it within its receptive field (not shown). This is predictive remapping.C) response to the old stimulus location goes away as saccade is initiated. bottom:
Data from our model,from individual units in LIPd, V2d, and V2s, showing that the LIP deep neurons respond to the saccadefirst, activating in the new location and deactivating in the old, and this LIP activation goes top-down toV3 and V2 to drive updating there, generally at a longer latency and with less activation especially in thesuperficial layers. When the new stimulus appears at the point of fixation (after a 50 msec saccade here), the primed
V2s units get fully activated by the incoming stimulus. But the deep neurons are insulated from thissuperficial input until the plus phase, when the cascade of 5IB firing drives activation of the actual stimuluslocation into the pulvinar, which then reflects up into all the other layers.is most directly interconnected with the saccade efferent copy signals, is the first to predict the new location,and it then drives top-down activation of lower layers. This top-down dynamic is consistent with the accountof predictive remapping given by Wurtz (2008) and Cavanagh et al. (2010), who argue that the key remap-ping takes place at the high levels of the dorsal stream, which then drive top-down activation of the predictedlocation in lower areas, instead of the alternative where lower-levels remap themselves based on saccade-related signals. The lower-level visual layers are simply too large and distributed to be able to remap acrossthe relevant degrees of visual angle — the extensive lateral connectivity needed to communicate across these ’Reilly et al.
Discussion
We have hypothesized a novel computational function for the distinctive features of thalamocorticalcircuits (S. M. Sherman & Guillery, 2006; Usrey & Sherman, 2018), as supporting a specific form ofprediction-error driven learning, where predictions arise from the numerous top-down layer 6CT projec-tions into the pulvinar, and the strong, sparse, focal driving 5IB inputs supply the bottom-up sensory-drivenoutcome. The phasic bursting nature of the 5IB inputs results in a natural temporal-difference error signalof prediction followed by outcome, consistent with extensive neural recording data. This temporal dynamicis also essential for enabling predictions to be generated without contamination from current sensory inputs,and predicts a characteristic alpha frequency prediction cycle based on the 10hz bursting cycle of the 5IBinputs, consistent with the pervasive influence of alpha on perception and neural dynamics (Clayton et al.,2018; Foster & Awh, 2019; Jensen et al., 2015; VanRullen, 2016). In short, the hypothesized predictivelearning function fits remarkably well with a number of well-established properties of these thalamocorticalcircuits, and we also provided a set of additional predictions that could be tested to further evaluate this the-ory, especially in contrast to the widely-discussed alternative of explicit error coding neurons, which havenot been unambiguously supported across a range of empirical studies (Walsh et al., 2020).Furthermore, we implemented this theory in a large scale model of the visual system, and demonstratedthat learning based strictly on predicting what will be seen next is, in conjunction with a number of criticalbiologically motivated network properties and mechanisms, capable of generating abstract, invariant cate-gorical representations of the overall shapes of objects. The nature of these shape representations closelymatches human shape similarity judgments on the same objects. Thus, predictive learning has the potentialto go beyond the surface structure of its inputs, and develop systematic, abstract encodings of the environ-ment. We found that comparison models based on standard error backpropagation learning did not learn acategorical structure that went beyond the surface similarity present in the visual input layers, and futurework is focused on narrowing down the specific mechanisms required to drive this learning.In addition to the predictive learning functions of the deep / thalamic layers, these same circuits are alsolikely critical for supporting powerful top-down attentional mechanisms that have a net multiplicative effecton superficial-layer activations (Bortone, Olsen, & Scanziani, 2014; Bortone et al., 2014; Olsen, Bortone,Adesnik, & Scanziani, 2012; Olsen et al., 2012). The importance of the pulvinar for attentional processinghas been widely documented (Bender & Youakim, 2001; LaBerge & Buchsbaum, 1990; Saalmann et al.,2012, e.g.,), and there is likely an additional important role of the thalamic reticular nucleus (TRN), whichcan contribute a surround-inhibition contrast-enhancing effect on top of the incoming attentional signalfrom the cortex (Crick, 1984; Jaramillo, Mejias, & Wang, 2019; Pinault, 2004; Wimmer et al., 2015). Inother work in progress, we have shown that the deep / thalamic circuits in our model produce attentionaleffects consistent with the abstract Reynolds and Heeger (2009) model, while the contributions of the deeplayer networks to this function are broadly consistent with the folded-feedback model (Grossberg, 1999).These attentional modulation signals cause the bidirectional constraint satisfaction process in the superficialnetwork to focus on task-relevant information while down-regulating responses to irrelevant information —in the real world, there are typically too many objects to track at any given time, so predictive learning mustbe directed toward the most important objects (Cavanagh et al., 2010; Pylyshyn, 1989; Richter & de Lange,2019).There is also data suggesting that the pulvinar is important for supporting confidence judgments, drivenby relative ambiguity in a random dot motion categorization task (Komura et al., 2013). Critically for thepresent framework, this confidence modulation only emerged in the period after the first 100 msec of pro-cessing, and manifested as a positive correlation with confidence (i.e., more unambiguous stimuli resulted4
Deep Predictive Learning in higher firing rates). We can interpret this as reflecting an ongoing generative postdiction of the stimulussignal, with stronger firing associated with more unambiguous top-down activation based on the currentinternal representation. Note that this directionality is the opposite of explicit error-coding neurons, whichwould presumably increase with increasing error / ambiguity in the prediction. Interestingly, inactivation ofthese pulvinar neurons resulted in a substantial (200%) increase in opt-out choices on the most ambiguousstimuli, suggesting a level of metacognitive awareness of the pulvinar signal (or at least a direct effect ofpulvinar on relevant metacognitive processes). Predictive accuracy would be an ideal source of metacog-nitive confidence signals across a wide range of domains, suggesting another important contribution ofpulvinar even after initial learning. Jaramillo et al. (2019) present a comprehensive model of attentional,decision-making, and working memory contributions of the pulvinar, including this confidence data, whichis generally compatible with our framework, although it does not address any learning phenomena.Considerable further work remains to be done to more precisely characterize the essential properties ofour biologically motivated model necessary to produce this abstract form of learning, and to further explorethe full scope of predictive learning across different domains. We strongly suspect that extensive cross-modal predictive learning in real-world environments, including between sensory and motor systems, is asignificant factor in infant development and could greatly multiply the opportunities for the formation ofhigher-order abstract representations that more compactly and systematically capture the structure of theworld (Yu & Smith, 2012). Future versions of these models could thus potentially provide novel insightsinto the fundamental question of how deep an understanding a pre-verbal human, or a non-verbal primate,can develop (J. Elman et al., 1996; Spelke, Breinlinger, Macomber, & Jacobson, 1992), based on predictivelearning mechanisms. This would then represent the foundation upon which language and cultural learningbuilds, to shape the full extent of human intelligence. ’Reilly et al. Supplemental Information
All of the materials described here, including the experimental study, the computational models, andthe code to perform the representational similarity analysis, are all available on our github account at: https://github.com/ccnlab/deep-obj-cat
For the computational models in particular, themost complete understanding can only be had by directly examining the code for the models, as thereare a number of details that are not efficiently captured in this suppplementary materials text.
Representational Similarity Analysis Methods
The different representations being compared here are:
Leabra:
The DeepLeabra (biological model) TE layer representations (specifically TEs = superficial –results are very similar for deep as well).
Bp:
The TEs layer representations from the backpropagation version of biological model, including
What,Where and
What * Where integration layers, trained with the V1p and V1hp (low and high resolutionpulvinar) layers as final output layers, using the time t target pattern from the t − input (i.e., as apredictive network). V1:
The gabor-filtered representation of the visual input to both of the above models, which was identicalacross them.
PredNet:
Highest layer (6th Layer) of the PredNet architecture.
Expt:
Similarity matrix constructed from human pairwise similarity judgments (see
Behavioral ExperimentMethods ).An optimal category cluster can be defined as one that has high within-cluster similarity and low between-cluster similarity. This can be operationalized by the contrast distance metric, based on a 1-correlation ( cor-relation distance ) measure, as the difference between the average within-cluster similarity and the averagebetween-cluster similarity: cd = (cid:104) − r in (cid:105) − (cid:104) − r out (cid:105) (1)With distance-like 1-correlation values, this contrast distance should be minimized (it is typically negative),or equivalently the contrast on raw correlation values can be maximized (it is typically a positive number –just the sign flip of distance value). We refer to the positive numbers and maximization here as that is morenatural.Starting with an initial set of clusters, a permutation-based hill-climbing strategy was used to determinea local minimum in this measure: each item was tested in each of the other possible categories, and if thatconfiguration reduced the overall average contrast distance metric across all items, then it was adopted andthe process iterated until no such permutation improved the metric. This algorithm can only decrease thenumber of clusters (by moving all items out of a given cluster), so different numbers of initial clusters canbe used to search the overall space.Figure 10 shows the resulting categories. The Bp model converged on the same cluster state from allstarting configurations tested, varying from 5 to 2 initial categories. This is the cluster set shown in Figure 5aof the main paper, and has an average contrast distance ( acd ) of 0.0838 (this is relatively low because thepatterns were overall quite similar). Likewise, the V1 patterns (which were the same across Leabra and Bpmodels) reliably converged on the same pattern (shown in Figure 5d), with acd = 0.2448.6 Deep Predictive Learning
Centroid Bp
1. pyramid • banana • layercake • trafficcone • sailboat • trex2. vertical • person • guitar • tablelamp3. round • doorknob • donut 3. round cont’d • handgun • chair4. box • slrcamera • elephant • piano • fish5. horiz • car • heavycannon • stapler • motorcycle 1. cat1 • banana • layercake • trafficcone • sailboat • trex • person • guitar • tablelamp • doorknob • donut 1. cat1 cont’d • handgun • chair • slrcamera • elephant • piano • fish • car2. cat2 • heavycannon • stapler • motorcycle V1 PredNet
1. cat1 • trafficcone • sailboat • person • guitar • tablelamp • chair2. cat2 • layercake • trex • doorknob • donut 2. cat2 cont’d • handgun • slrcamera • elephant • piano • fish • car • heavycannon • stapler • motorcycle3. cat3 • banana 1. cat1 • trafficcone • sailboat • person • guitar • tablelamp • layercake2. cat2 • trex • donut • banana • handgun 2. cat2 cont’d • slrcamera • elephant • fish • car • heavycannon • stapler • motorcycle3. cat3 • chair • doorknob • piano Figure 10: Shape categories used for similarity matrix plots in main paper.
Centroid shape categories arenear-best for both the Leabra model and the Expt results, and fit our visual intuitions about overall shape. Bp are reliably optimal for Bp model from all starting points. V1 are reliably optimal for V1 inputs, and alsowere close to the best for the Bp and PredNet layer 6 representations. PredNet are best stable solution forPredNet layer 6.For the PredNet layer 6 representations, starting from the V1 categories gave the best results of any otherset ( acd = 0.1967), and a few permutations resulted in a reliable solution that was arrived at from all other3 category starting points tested, shown in Figure 10 ( acd = 0.2820). This indicates that PredNet did notgo much beyond the structure present in the input, even though it did not use the V1 gabor filtering usedin the Leabra and Bp models (i.e., this V1-level encoding well-captures the structure of the visual inputsin general). The PredNet pixel and layer 1 representations both converged on essentially a single monoliticcategory with very low acd (0.0018, 0.0013).For the Leabra TE representations, we found a set of centroid shape categories that are near-best whenconsidering both the Leabra model and the results from the human behavioral experiment (Expt). Startingfrom these categories, the permutation analysis converged on reducing the size of the vertical and round ’Reilly et al.
Expt ), it was clear that it stronglycoincided with our original shape intuitions, and with the finer-grained 5 category centroid structure. Start-ing from the centroid categories, the maximal permutation made only 3 changes, moving trex (T-rex) andhandgun into the horizontal category, and chair into the pyramid, going from a distance score of 0.3083 to0.3225, which is a relatively small improvement. However, using the maximal
Expt clusters directly on theLeabra model gives a lower acd measure of 0.3745 (compared to 0.5071 for centroid), so the centroid cate-gories represent a good middle-ground between experiment and the model, and this strong shared similaritystructure with near-optimal cluster structures confirms that the model and people are encoding largely thesame information.In contrast, if we organize the experiment similarity matrix using the Bp categories, it produces a verypoor average contrast distance measure of 0.0643 (compared to 0.3083 for the centroid categories), stronglysuggesting that people’s shape representations are not compatible with that simple structure.Another approach to determining clusters from similarity matrices, agglomerative clustering , starts withall items as singletons, and iteratively combines the closest two into a new cluster. The results for the Leabraand Expt similarity matrices are shown in Figure 11, which has also color-coded the items in terms of theircategory status according to the centroid structure. Due to a strong history dependency in the clusteringprocess, and the indeterminacy of reducing a high-dimensional similarity structure down to two dimensions,structure beyond the leaf level is not very reliable (ties are also broken by a random number generator), butnevertheless you can clearly see that in both cases items from the same cluster are almost always together asleaves in the plots. This then provides additional converging support for the idea that the model is learningthe same kind of shape categories as people have.
Behavioral Experiment Methods
The behavioral experiment was conducted on Amazon.com’s MTurk web platform under University ofColorado IRB approval (19-0176), using 30 participants each categorizing up to 800 image pairs as shownin Figure 12, using the standard simple image categorization framework with a lightly customized script.Objects were drawn from the 156 3D object set, but data was aggregated in terms of the 20 basic-levelcategories (car, stapler, etc) because we could not sample all 156 x 156 object pairs. Thus, the resulting datawas aggregated for each category pair in terms of the proportion of times when that pair was selected whenpresented.The individual images were produced by reconstructing from the V1 transform that the computationalmodel used in its high resolution V1 input layer, to give human participants as similar of an experience aspossible to how the model “saw” the objects, and to reduce the influence of existing semantic knowledgewhich was entirely missing in our model (Figure 12).8
Deep Predictive Learning
Leabra TEs Clusters motorcyclecarheavycannonstapler slrcameradonutfishelephanthandgundoorknobpianochair trafficconelayercakebananatrexguitarpersonsailboattablelamp
Experiment Clusters chairhandgunstaplertrexheavycannoncarlayercake pianoslrcamerafishmotorcycleelephantdonutdoorknobsailboattrafficconebananatablelamppersonguitar a) b)
Figure 11: Agglomerative clustering on the Leabra and Expt representations, with the centroid categoriescolor coded. The most reliable information from this is the leaf-level groupings, as the rest of the structureis indeterminante and history dependent in reducing higher-dimensional structure down to a 2D plot. Bothcluster plots show a strong tendency to group leaf items together in the same centroid categories, with a fewexceptions in each case. Also, the Leabra plot nicely captures the broader 3-category structure evident inthe similarity matrix plots, within which the 5 finer-grained centroid categories are organized. Overall, thisprovides further confirmation that the model and the human subjects are organizing the shapes in largely thesame way.Figure 12: Example stimulus from the behavioral experiment, using the V1 reconstruction of the actualinput images presented to the model, to better capture the coarse-grained perception of the model. Subjectswere requested to choose which of the two pairs, Left or Right, was most similar in terms of overall shape . Biological Model Methods
This section provides more information about the
DeepLeabra What-Where Integration (WWI) model.The purpose of this information is to give more detailed insight into the model’s function beyond the levelprovided in the main text, but with a model of this complexity, the only way to really understand it is toexplore the model itself. It is available for download at: https://github.com/ccnlab/deep-obj-cat/sims/C++
Furthermore, the best way to understand this model is to understand the frameworkin which it is implemented, which is explained in great detail, with many running simulations explainingspecific elements of functionality, at http://ccnbook.colorado.edu ’Reilly et al. Units PoolsArea Name X Y X Y Receiving Projections
V1 V1s 4 5 8 8V1p 4 5 8 8 V1s V2d V3d V4d TEOdV1h V1hs 4 5 16 16V1hp 4 5 16 16 V1s V2d V3d V4d TEOdEyes EyePos 21 21SaccadePlan 11 11Saccade 11 11Obj ObjVel 11 11V2 V2s 10 10 8 8 V1s LIPs V3s V4s TEOd V1p V1hpV2d 10 10 8 8
V2s
V1p V1hp LIPd LIPp V3d V4d V3s TEOsLIP MtPos 1 1 8 8 V1sLIPs 4 4 8 8 MtPos ObjVel SaccadePlan EyePos LIPpLIPd 4 4 8 8
LIPs
LIPp ObjVel Saccade EyePosLIPp 1 1 8 8
MtPos
V1s LIPdV3 V3s 10 10 4 4 V2s V4s TEOs DPs LIPs V1p V1hp DPp TEOdV3d 10 10 4 4
V3s
V1p V1hp DPp LIPd DPd V4d V4s DPs TEOsV3p 10 10 4 4
V3s
V2d DPd TEOdDP DPs 10 10 V2s V3s TEOs V1p V1hp V3p TEOpDPd 10 10
DPs
V1p V1hp DPp TEOdDPp 10 10
DPs
V2d V3d DPd TEOdV4 V4s 10 10 4 4 V2s TEOs V1p V1hpV4d 10 10 4 4
V4s
V1p V1hp V4p TEOd TEOsV4p 10 10 4 4
V4s
V2d V3d V4d TEOdTEO TEOs 10 10 4 4 V4s V1p V1hp TEsTEOd 10 10 4 4
TEOs TEOd
V1p V1hp V4p TEOp TEp TEdTEOp 10 10 4 4
TEOs
V3d V4d TEOd TEdTE TEs 10 10 4 4 TEOs V1p V1hpTEd 10 10 4 4
TEs TEd
V1p V1hp V4p TEOp TEp TEOdTEp 10 10 4 4
TEs
V3d V4d TEOdTable 1: Layer sizes, showing numbers of units in one pool (or entire layer if Pool is missing), and thenumber of Pools of such units, along X,Y axes. Each area has three associated layers: s = superficial layer, d = deep layer (context updated by 51B neurons in same area, shown in bold), p = pulvinar layer (driven by5IB neurons from associated area, shown in bold). Layer Sizes and Structure
Figure 2 in the main text shows the general configuration of the model, and Table 1 shows the specificsizes of each of the layers, and where they receive inputs from.All the activation and general learning parameters in the model are at their standard Leabra defaults.
Projections
The general principles and patterns of connectivity are shown in Figure 13 (and Figure 1 in the maintext). As noted in the main text, the connectivity and overall structure obeys the established principlesidentified in neocortical anatomy Felleman and Van Essen (1991); Markov, Ercsey-Ravasz, et al. (2014);0
Deep Predictive Learning V1 sd V2 sd V3 sd V4 sd TEO sd MT sd LIP sd FEF sd plansaccade
Where What *Where What s ss dd sd d V1 dp V2 dp V3 dp V4 dp TEO dp MT dp LIP dp Where What *Where What
Feedforward Feedback
Super (2,3)Layer 4Deep (5,6) Axon A c t i v a t i on F l o w A c t i v a t i on F l o w TerminalsNeurons a)b) c)d) e)
Figure 13: Principles of connectivity in DeepLeabra. a) Markov et al (2014) data showing density of ret-rograde labeling from a given injection in a middle-level area (d): most feedforward projections originatefrom superficial layers of lower areas (a,b,c) and deep layers predominantly contribute to feedback (andmore strongly for longer-range feedback). b) Summary diagram showing most feedforward connectionsoriginating in superficial layers of lower area, and terminating in layer 4 of higher area, while feedbackconnections can originate in either superficial or deep layers, and in both cases terminate in both superficialand deep layers of the lower area (adapted from Felleman & Van Essen, 1991). c) Anatomical hierarchy asdetermined by percentage of superficial layer source labeling (SLN) by Markov et al (2014) — the hierar-chical levels are well matched for our model, but we functionally divide the dorsal pathway (shown in greenbackground) into the two separable components of a
Where and a
What * Where integration pathway. d) Su-perficial and deep-layer connectivity in the model. Note the repeating motif between hierarchically-adjacentareas, with bidirectional connectivity between superficial layers, and feedback into deep layers from bothhigher-level superficial and deep layers, according to canonical pattern shown in panels a and b. Special pat-terns of connectivity from TEO to V3 and V2, involving crossed super-to-deep and deep-to-super pathways,provide top-down support for predictions based on high-level object representations. e) Connectivity fordeep layers and pulvinar in the model, which generally mirror the corticocortical pathways (in d). Each pul-vinar layer (p) receives 5IB driving inputs from the labeled layer (e.g., V1p receives 5IB drivers from V1).In reality these neurons are more distributed throughout the pulvinar, but it is computationally convenientto organize them together as shown. Deep layers (d) provide predictive input into pulvinar, and pulvinarprojections send error signals (via temporal differences between predictions and actual state) to both deepand superficial layers of given areas (only d shown). Most areas send deep-layer prediction inputs into themain V1p prediction layer, and receive reciprocal error signals therefrom. The strongest constraint we foundwas that pulvinar outputs (colored green) must generally project only to higher areas, not to lower areas,with the exceptions of DPp → V3 and LIPp → V2. V2p was omitted because it is largely redundant withV1p in this simple model. ’Reilly et al.
Where pathway pre-training compared to purely random ini-tial weights. In addition to exploring different patterns of overall connectivity, we also explored differencesin the relative strengths of receiving projections, which can be set with a wt scale.rel parameter in thesimulator. All feedforward pathways have a default strength of 1. For the feedback projections, which aretypically weaker (consistent with the biology), we explored a discrete range of strengths, typically .5, .2,.1, and .05. The strongest top-down projections were into V2s from LIP and V3, while most others were.2 or .1. Likewise projections from the pulvinar were weaker, typically .1. These differences in strengthsometimes had large effects on performance during the initial bootstrapping of the overall model structure,but in the final model they are typically not very consequential for any individual projection.
Training Parameters
Training typically consisted of 512 alpha trials per epoch (51.2 seconds of real time equivalent), for1,000 such epochs. Each trial was generated from a virtual reality environment in the emergent simulator,that rendered first-person views with moving eye position onto the object tumbling through space with fixedmotion and rotation parameters over the sequence of 8 frames (see Figure 2c in main text for representativeexample). Because the start of each sequence of 8 frames is unpredictable, we turned off learning for thattrial, which improves learning overall. We have recently developed an automatic such mechanism basedon the running-average (and running variance) of the prediction error, where we turn off learning wheneverthe current prediction error z-normalized by these running average values is below 1.5 standard deviations,which works well, and will be incorporated into future models. Biologically, this could correspond to aconnection between pulvinar and neuromodulatory areas that could regulate the effective learning rate inthis way.Figure 14a shows the learning trajectory of the model, indicating that it learns quite rapidly. This rapidinitial learning is likely facilitated by the extensive use of shortcut connections convering from all over thesimulated visual system onto the V1 pulvinar layers, and direct projections back from these pulvinar layers.Thus, error signals are directly communicated and can drive learning quickly and efficiently. However, thereare also extensive indirect, bidirectional connections among the superficial layers, which can drive indirecterror backpropagation learning as well.2
Deep Predictive Learning epoch100 500 100000.51 P r ed i c t i on A cc u r a cy ( C o rr e l a t i on ) V1p (low res)V1hp (high res) a) b) c)200 Epochs 1,000 Epochs
Figure 14: a) Predictive learning curve for DeepLeabra, showing the correlation between prediction andactual over the two different V1 layers. Initial learning is quite rapid, followed by a slower but progressivelearning process that reflects development of the IT representations (e.g., manipulations that interfere withthose areas selectively impair this part of the learning curve). Overall prediction accuracy remains farfrom perfect, as shown in Figure 2c in main text, and significantly worse than the backpropagtion-basedmodels. This is a typical finding from Leabra models which are significantly more constrained as a result ofbidirectional attractor dynamics, Hebbian learning, and inhibitory competition – i.e., the very things that arelikely important for forming abstract catgorical representations. b) Similarity matrix over TEs layer at 200epochs, which has less contrast and definition compared to the 1,000 epoch result ( c also shown in Figure 3ain main text). Testing Parameters
To be able to monitor similarity metrics as the model trained, we used a running-average integration ofneural activity across trials to accumulate the patterns used for the RSA analysis described above. Specif-ically, for each object, and each frame, the current activation pattern across each layer was recorded andaveraged unit-by-unit with a time constant of τ = 10 . Critically, by integrating separately for each frame,this running-average computation did not introduce any bias for temporally-adjacent frames to be more sim-ilar. Nevertheless, when we computed the frame-to-frame similarities for TE, they were quite high (.901correlation on average across all objects). Model Algorithms
The biologically-based model was implemented using the Leabra framework, which is described in detailin previous publications O’Reilly (1996, 1998); O’Reilly and Munakata (2000); O’Reilly et al. (2012), andsummarized here. There are two main implementations of Leabra, one in the
C++ emergent software,and a new one using Go and Python language at: https://github.com/emer/leabra . There arealso other simpler implementations in Python and MATLAB, see https://grey.colorado.edu/emergent/index.php/Leabra . Both of the preceeding links contain a full detailed description ofthe algorithm. These same equations and standard parameters have been used to simulate over 40 differentmodels in O’Reilly and Munakata (2000); O’Reilly et al. (2012), and a number of other research models.Thus, the model can be viewed as an instantiation of a systematic modeling framework using standardizedmechanisms, instead of constructing new mechanisms for each model. Here, we only detail properties ofthe predictive learning algorithm that go beyond the basic Leabra model. ’Reilly et al. Deep Context
At the end of every plus phase, a new deep-layer context net input is computed from the dot product ofthe context weights times the sending activations, just as in the standard net input: η = (cid:104) x i w ij (cid:105) = 1 n (cid:88) i x i w ij (2)This net input is then added in with the standard net input at each cycle of processing.The relative strength of these context layer inputs was set progressively larger for higher layers in the net-work, with a maximum of 4 in V4, TEO, and TE. In addition, TEO and TE received self context projectionswhich provide an extended window of temporal context into the prior 200 msec interval, consistent withmultiple sources of neural data Chaudhuri et al. (2015). These self projections were connected only withinthe narrower Pool level of units, enabling these neurons to develop mutually-excitatory loops to sustainactivations over the multiple trials when the same object was present. We hypothesize that these modifica-tions correspond to biological adaptations in IT cortex that likewise support greater sustained activation ofobject-level representations.Learning of the context weights occurs as normal, but using the sending activation states from the prior time step’s activation. Computational and Biological Details of SRN-like Functionality
Predictive auto-encoder learning has been explored in various frameworks, but the most relevant to ourmodel comes from the application of the SRN to a range of predictive learning domains J. Elman et al.(1996); J. L. Elman (1990). One of the most powerful features of the SRN is that it enables error-drivenlearning, instead of arbitrary parameter settings, to determine how prior information is integrated with newinformation. Thus, SRNs can learn to hold onto some important information for a relatively long interval,while rapidly updating other information that is only relevant for a shorter duration. This same flexibilityis present in our DeepLeabra model. Furthermore, because this temporal context information is hypothe-sized to be present in the deep layers throughout the entire neocortex (in every microcolumn of tissue), theDeepLeabra model provides a more pervasive and interconnected form of temporal integration compared tothe SRN, which typically just has a single temporal context layer associated with the internal “hidden” layerof processing units.An extensive computational analysis of what makes the SRN work as well as it does, and explorationsof a range of possible alternative frameworks, has led us to an important general principle: subsequent out-comes determine what is relevant from the past . At some level, this may seem obvious, but it has significantimplications for predictive learning mechanisms based on temporal context. It means that the informationencoded in a temporal context representation cannot be learned at the time when that information is presentlyactive. Instead, the relevant contextual information is learned on the basis of what happens next. This ex-plains the peculiar power of the otherwise strange property of the SRN: the temporal context information ispreserved as a direct copy of the state of the hidden layer units on the previous time step (Figure 15), andthen learned synaptic weights integrate that copied context information into the next hidden state (whichis then copied to the context again, and so on). This enables the error-driven learning taking place in the current time step to determine how context information from the previous time step is integrated. And thesimple direct copy operation eschews any attempt to shape this temporal context itself, instead relying onthe learning pressure that shapes the hidden layer representations to also shape the context representations.In other words, this copy operation is essential, because there is no other viable source of learning signals toshape the nature of the context representation itself (because these learning signals require future outcomes,which are by definition only available later).The direct copy operation of the SRN is however seemingly problematic from a biological perspective:4
Deep Predictive Learning
Inputt (Context) (predict next) t-1 Supera) SRN:- deep copies & holds- super integrates b) DeepLeabra:- deep integs & holds- super addsInputt (Context) (predict next)
Deept-1 ( c op y ) ( add ) (integ) (integ) Figure 15: How the DeepLeabra temporal context computation compares to the SRN mathematically. a) In a standard SRN, the context (deep layer biologically) is a copy of the hidden activations from the priortime step, and these are held constant while the hidden layer (superficial) units integrate the context throughlearned synaptic weights. b) In DeepLeabra, the deep layer performs the weighted integration of the soon-to-be context information from the superficial layer, and then holds this integrated value, and feeds it backas an additive net-input like signal to the superficial layer. The context net input is pre-computed, instead ofhaving to compute this same value over and over again. This is more efficient, and more compatible withthe diffuse interconnections among the deep layer neurons. Layer 6 projections to the thalamus and backrecirculate this pre-computed net input value into the superficial layers (via layer 4), and back into itself tosupport maintenance of the held value.how could neurons copy activations from another set of neurons at some discrete point in time, and thenhold onto those copied values for a duration of 100 msec, which is a reasonably long period of time inneural terms (e.g., a rapidly firing cortical neuron fires at around 100 Hz, meaning that it will fire 10 timeswithin that context frame). However, there is an important transformation of the SRN context computation,which is more biologically plausible, and compatible with the structure of the deep network (Figure 15).Specifically, instead of copying an entire set of activation states, the context activations (generated by thephasic 5IB burst) are immediately sent through the adaptive synaptic weights that integrate this information,which we think occurs in the 6CC (corticortical) and other lateral integrative connections from 5IB neuronsinto the rest of the deep network. The result is a pre-computed net input from the context onto a givenhidden unit (in the original SRN terminology), not the raw context information itself. Computationally, andmetabolically, this is a much more efficient mechanism, because the context is, by definition, unchangingover the 100 msec alpha cycle, and thus it makes more sense to pre-compute the synaptic integration, ratherthan repeatedly re-computing this same synaptic integration over and over again (in the original feedforwardbackpropagation-based SRN model, this issue did not arise because a single step of activation updating tookplace for each context update — whereas in our bidirectional model many activation update steps must takeplace per context update).There are a couple of remaining challenges for this transformation of the SRN. First, the pre-computednet input from the context must somehow persist over the subsequent 100 msec period of the alpha cycle. Wehypothesize that this can occur via NMDA and mGluR channels that can easily produce sustained excitatorycurrents over this time frame. Furthermore, the reciprocal excitatory connectivity from 6CT to TRC andback to 6CT could help to sustain the initial temporal context signal. Second, these contextual integrationsynapses require a different form of learning algorithm that uses the sending activation from the prior 100msec, which is well within the time constants in the relevant calcium and second messenger pathways ’Reilly et al. epoch100 500 100000.51 P r ed i c t i on A cc u r a cy ( C o rr e l a t i on ) V1p (low res)V1hp (high res)
Figure 16: Learning curves for the backpropagation version of the WWI model. Although it achieves betterpredictive accuracy than the DeepLeabra version, it fails to acquire abstract object category structure, indi-cating a potential tradeoff between simplifying and categorizing inputs, versus predicting precisely wherethe low-level visual features will move.involved in synaptic plasticity.
Backpropagation Model Methods
The backpropagation version of the WWI model has exactly the same layer sizes and feedforward pat-terns of connectivity as the DeepLeabra version. Topographically, the V1p and V1hp pulvinar layers serveas output layers at the highest level of the network, receiving all the various connections from deep layersas shown in Table 1. Likewise, the LIPp served as a target output layer for the Where pathway. To achievepredictive learning, the V1 pulvinar targets were from the scene at time t , while the V1s inputs were fromthe scene at time t − . We also ran a comparison auto-encoder model that had inputs and target outputsfrom the same time step, and it showed even less systematic organziation of its higher-level representations,further supporting the notion that predictive learning is important, across all frameworks. The learning curvefor the predictive version is shown in Figure 16, which shows better overall prediction accuracy comparedto the DeepLeabra model. However, as the RSA showed, this backpropagation model failed to learn objectcategories that go beyond the input similarity structure, indicating that perhaps it was paying too much “at-tention” in learning to this low-level structure, and lacked the necessary mechanisms to enable it to imposea simplifying higher-level structure on top of these inputs. PredNet Model Methods
The PredNet architecture was designed to incorporate principles from predictive coding theory into aneural network model for predicting the next frame in a video sequence. Details of the model can be foundin the original paper Lotter et al. (2016), but here we provide a brief overview of the architecture.
Architecture
PredNet is a deep convolutional neural network that is composed of layers containing discrete modules.The lowest layer generates a prediction of incoming inputs (i.e. the pixels in the next frame), while each6
Deep Predictive Learning epoch100 500 100000.51 P r ed i c t i on A cc u r a cy ( C o rr e l a t i on ) Figure 17: Learning curves for the PredNet model. This model achieves the best overall prediction perfor-mance but also has the least well differentiated, categorical representations.of the higher layers attempts to predict the errors made by the previous layer. Each layer contains aninput convolutional module ( A l ), a recurrent representational module ( R l ), a prediction module ( ˆ A l ), and arepresentation of its own errors ( E l ). The input convolutional module ( A l ) transforms its input with a setof standard convolutional filters, a rectified linear activation function, and a max-pooling operation. Therecurrent representation module ( R l ) is a convolutional LSTM, which is a recurrent convolutional networkthat replaces the matrix multiplications in the standard LSTM equations with convolutions, allowing it tomaintain a spatially organized representation of its inputs over time. The prediction module ( ˆ A l ) consists ofanother standard convolutional layer and rectified linear activation that is used to generate predictions fromthe output of R l . These predictions are then compared against the output of the input convolutional module( A l ). The errors generated in this comparison are represented explicitly in E l , which applies a rectifiedlinear activation to a concatenation of the positive ( A l − ˆ A l ) and negative ( ˆ A l − A l ) prediction errors. Theseerrors then become the inputs to the next layer. A tl = (cid:40) x t , if l = 0 M axP ool ( ReLU ( Conv ( E tl − ))) , if l > (3) ˆ A tl = ReLU ( Conv ( R tl )) (4) E tl = [ ReLU ( A tl − ˆ A tl ); ReLU ( ˆ A tl − A tl )] (5) R tl = ConvLST M ( E t − l , R t − l , U pSample ( R tl +1 )) (6)At each time step in the video sequence, PredNet generates a prediction of the next frame. This is doneas follows: first, the R l is computed for each layer starting from the top of the hierarchy (because each R tl depends on input from R tl +1 ), and then the A tl , ˆ A tl and E tl are computed in a feed-forward fashion (becauaseeach A tl depends on input from the layer below, E tl − ).All analyses in the RSA were conducted using the representations from the R l layers. Implementation details
All experiments with the PredNet architecture were performed using PyTorch. An informal hyperpa-rameter search was conducted to find the settings that maximized representational similarity to the human ’Reilly et al. pixels layer1 layer2 layer3 layer4 layer5 layer60.0000.0250.0500.0750.1000.1250.1500.175 w i t h i n - b e t w ee n c o rr e l a t i o n Effect of dropout on RSA
Dropout = 0.0Dropout = 0.1Dropout = 0.5
Figure 18: Effect of dropout in PredNet on RSA, as measured by the difference between the average within-category correlation and the average between category correlation (using the Centroid categories derivedfrom human data). Dropout marginally improves the category structure learned in PredNet.judgments. This was done by conducting RSA on each layer for each hyperparameter setting, and comput-ing, according to the Centroid categories derived from the human data, the difference between the averagewithin-category similarity and the average between-category similarity. Our final architecture had 6 layerswith 3, 16, 32, 64, 128, and 256 filters in the A l and R l modules, and 3x3 kernels throughout the wholenetwork. We also found that using sigmoid and tanh activation functions in fully-connected convolutionalLSTMs slightly improved performance, so these were used for all experiments.The weights in the PredNet model are trained using error backpropagation. Predictions are generatedand errors are computed at all levels of the hierarchy, but the model performs better when only the lowestlayer’s errors are backpropagated Lotter et al. (2016). We confirmed these results with experiments thatbackpropagated the errors in higher layers, in which performance (in terms of mean squared error) wasmarginally reduced but the RSA results were similar. For this reason, all reported experiments used aPredNet that was trained by only backpropagating the lowest level error.The model was trained using a batch size of 8 and an Adam optimizer with a learning rate of 0.0001,with no scheduler, for 150,000 batches. A training curve is shown in Figure 17, showing that it achieves thebest overall prediction accuracy of any model we tested, and yet does not have representations that are asdifferentiated or categorical as our biologically based model, as shown in the main paper. Regularization experiments
As discussed in the main paper, our biologically based model includes a number of important biolog-ically motivated properties that may be contributing to the development of its categorical representations.These properties, including excitatory bidirectional connections, inhibitory competition, and an additionalform of Hebbian learning, may be acting as regularizers that encourage categorical learning. We thereforetested whether standard regularization methods used in deep learning would have similar effects on the rep-resentations developed in the PredNet architecture. We tested 1) batch normalization, 2) dropout (0.1, 0.3,and 0.5), and 3) weight decay (0.01,0.001,0.0001,0.00001). All experiments with batch normalization andweight decay showed reduced performance (in terms of both prediction error on the test set and within-category correlation). As shown in figure 18, dropout marginally improved the within-category correlationwhile also slightly improving prediction accuracy, so a dropout rate of 0.1 was used for the comparison toour biologically based model in the main paper.8
Deep Predictive Learning * ReferencesAbbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997, December). Synaptic depression and corticalgain control.
Science , , 220.Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985, December). A learning algorithm for Boltzmannmachines. Cognitive Science , (1), 147–169.Antonov, P. A., Chakravarthi, R., & Andersen, S. K. (2020, October). Too little, too late, and in the wrongplace: Alpha band activity does not reflect an active mechanism of selective attention. NeuroImage , , 117006. doi: 10.1016/j.neuroimage.2020.117006Arcaro, M. J., Pinsk, M. A., & Kastner, S. (2015, July). The anatomical and functional organization ofthe human visual pulvinar. Journal of Neuroscience , (27), 9848–9871. doi: 10.1523/JNEUROSCI.1575-14.2015Ashby, F. G., & Maddox, W. T. (2011, April). Human Category Learning 2.0. Annals of the New YorkAcademy of Sciences , , 147–161. doi: 10.1111/j.1749-6632.2010.05874.xBarczak, A., O’Connell, M. N., McGinnis, T., Ross, D., Mowery, T., Falchier, A., & Lakatos, P. (2018,August). Top-down, contextual entrainment of neuronal oscillations in the auditory thalamocorticalcircuit. Proceedings of the National Academy of Sciences , (32), E7605-E7614. doi: 10.1073/pnas.1714684115Bastos, A. M., Usrey, W. M., Adams, R. A., Mangun, G. R., Fries, P., & Friston, K. J. (2012, November).Canonical microcircuits for predictive coding. Neuron , (4), 695–711. Retrieved from Bastos, A. M., Vezoli, J., Bosman, C. A., Schoffelen, J.-M., Oostenveld, R., Dowdall, J. R., . . . Fries,P. (2015, January). Visual Areas Exert Feedforward and Feedback Influences through Distinct Fre-quency Channels.
Neuron , (2), 390–401. doi: 10.1016/j.neuron.2014.12.018Bender, D. B. (1982, July). Receptive-field properties of neurons in the macaque inferior pulvinar. Journal ofneurophysiology , . Retrieved from Bender, D. B., & Youakim, M. (2001, January). Effect of attentive fixation in macaque thalamus and cortex.
Journal of neurophysiology , , 219–234. Retrieved from Bengio, Y., Mesnard, T., Fischer, A., Zhang, S., & Wu, Y. (2017, January). STDP-compatible approximationof backpropagation in an energy-based model.
Neural Computation , (3), 555–577. doi: 10.1162/NECO a 00934Bengio, Y., Yao, L., Alain, G., & Vincent, P. (2013). Generalized Denoising Auto-Encoders as Gen-erative Models. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger(Eds.), Advances in Neural Information Processing Systems 26 (pp. 899–907). Curran Associates,Inc. Retrieved 2017-05-15, from http://papers.nips.cc/paper/5023-generalized-denoising-auto-encoders-as-generative-models.pdf
Berger, H. (1929, December). ¨Uber das Elektrenkephalogramm des Menschen.
Archiv f¨ur Psychiatrie undNervenkrankheiten , (1), 527–570. doi: 10.1007/BF01797193Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982, March). Theory for the development of neuronselectivity: Orientation specificity and binocular interaction in visual cortex. The Journal of Neuro-science , (2), 32–48. Retrieved from Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In
Metacog-nition: Knowing about knowing (pp. 185–205). Cambridge, MA, US: The MIT Press.Bortone, D. S., Olsen, S. R., & Scanziani, M. (2014, April). Translaminar inhibitory cells recruited by layer6 corticothalamic neurons suppress visual cortex.
Neuron , . Retrieved from ’Reilly et al. .nlm.nih.gov/pubmed/24656931 Bourne, J. A., & Rosa, M. G. P. (2006, March). Hierarchical development of the primate visual cortex,as revealed by neurofilament immunoreactivity: Early maturation of the middle temporal area (MT).
Cerebral Cortex , (3), 405–414. doi: 10.1093/cercor/bhi119Brette, R., & Gerstner, W. (2005, November). Adaptive exponential integrate-and-fire model as an effectivedescription of neuronal activity. Journal of Neurophysiology , (5), 3637–3642. doi: 10.1152/jn.00686.2005Bridge, H., Leopold, D. A., & Bourne, J. A. (2016, February). Adaptive Pulvinar Circuitry Supports VisualCognition. Trends in Cognitive Sciences , (2), 146–157. doi: 10.1016/j.tics.2015.10.003Buffalo, E. A., Fries, P., Landman, R., Buschman, T. J., & Desimone, R. (2011, July). Laminar differencesin gamma and alpha coherence in the ventral stream. Proceedings of the National Academy of Sciencesof the United States of America , (27), 11262–11267. Retrieved from Busch, N. A., Dubois, J., & VanRullen, R. (2009, June). The phase of ongoing EEG oscillations predictsvisual perception.
The Journal of Neuroscience , (24), 7869–7876. Retrieved from Buzs´aki, G. (2005). Theta rhythm of navigation: Link between path integration and landmark navigation,episodic and semantic memory.
Hippocampus , (7), 827–840. doi: 10.1002/hipo.20113Cadieu, C. F., Hong, H., Yamins, D. L. K., Pinto, N., Ardila, D., Solomon, E. A., . . . DiCarlo, J. J. (2014,December). Deep Neural Networks Rival the Representation of Primate IT Cortex for Core VisualObject Recognition. PLoS Computational Biology , (12), e1003963. doi: 10.1371/journal.pcbi.1003963Cavanagh, P., Hunt, A. R., Afraz, A., & Rolfs, M. (2010, April). Visual stability based on remapping ofattention pointers. Trends in Cognitive Sciences , (4), 147–153. doi: 10.1016/j.tics.2010.01.007Chaudhuri, R., Knoblauch, K., Gariel, M.-A., Kennedy, H., & Wang, X.-J. (2015, October). A Large-ScaleCircuit Mechanism for Hierarchical Dynamical Processing in the Primate Cortex. Neuron , (2),419–431. doi: 10.1016/j.neuron.2015.09.008Clark, A. (2013, June). Whatever next? Predictive brains, situated agents, and the future of cognitivescience. Behavioral and Brain Sciences , (3), 181–204. Retrieved from Clayton, M. S., Yeung, N., & Kadosh, R. C. (2018). The many characters of visual alpha oscillations.
European Journal of Neuroscience , (7), 2498–2508. doi: 10.1111/ejn.13747Colby, C. L., Duhamel, J. R., & Goldberg, M. E. (1997, March). Visual, presaccadic, and cognitiveactivation of single neurons in monkey lateral intraparietal area. Journal of neurophysiology , ,2841. Retrieved from Connors, B. W., Gutnick, M. J., & Prince, D. A. (1982, December). Electrophysiological propertiesof neocortical neurons in vitro.
Journal of Neurophysiology , (6), 1302–1320. Retrieved from Cooper, L. N., & Bear, M. F. (2012, November). The BCM theory of synapse modification at 30: Interactionof theory with experiment.
Nature Reviews Neuroscience , (11), 798–810. doi: 10.1038/nrn3353Crick, F. (1984, July). Function of the thalamic reticular complex: The searchlight hypothesis. Proceedingsof the National Academy of Sciences of the United States of America , , 4586–4590. Retrieved from Cricri, F., Ni, X., Honkala, M., Aksu, E., & Gabbouj, M. (2016, December). Video Ladder Net-works. arXiv:1612.01756 [cs, stat] . Retrieved 2020-06-25, from http://arxiv.org/abs/1612.01756
Dayan, P. (1993, January). Improving generalization for temporal difference learning: The successorrepresentation.
Neural Computation , (4), 613–624. Retrieved from http://cognet.mit.edu/ Deep Predictive Learning journal/10.1162/neco.1993.5.4.613
Dayan, P., Hinton, G. E., Neal, R. N., & Zemel, R. S. (1995, January). The Helmholtz machine.
NeuralComputation , (5), 889-904.de Lange, F. P., Heilbron, M., & Kok, P. (2018, September). How do expectations shape perception? Trendsin Cognitive Sciences , (9), 764–779. doi: 10.1016/j.tics.2018.06.002Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review ofNeuroscience , (1), 193–222. doi: 10.1146/annurev.ne.18.030195.001205Duhamel, J. R., Colby, C. L., & Goldberg, M. E. (1992, April). The updating of the representation of visualspace in parietal cortex by intended eye movements. Science , (5040), 90–92. Retrieved from Elman, J., Bates, E., Karmiloff-Smith, A., Johnson, M., Parisi, D., & Plunkett, K. (1996).
RethinkingInnateness: A Connectionist Perspective on Development . Cambridge, MA: MIT Press.Elman, J. L. (1990, January). Finding structure in time.
Cognitive Science , (2), 179–211.Felleman, D. J., & Van Essen, D. C. (1991, January). Distributed Hierarchical Processing in the PrimateCerebral Cortex. Cerebral Cortex , (1), 1–47. Retrieved from Fiebelkorn, I. C., & Kastner, S. (2019, February). A rhythmic theory of attention.
Trends in CognitiveSciences , (2), 87–101. doi: 10.1016/j.tics.2018.11.009Fiebelkorn, I. C., Pinsk, M. A., & Kastner, S. (2018, August). A dynamic interplay within the frontoparietalnetwork underlies rhythmic spatial attention. Neuron , (4), 842-853.e8. doi: 10.1016/j.neuron.2018.07.038Fiser, A., Mahringer, D., Oyibo, H. K., Petersen, A. V., Leinweber, M., & Keller, G. B. (2016, December).Experience-dependent spatial expectations in mouse visual cortex. Nature Neuroscience , (12),1658–1664. doi: 10.1038/nn.4385Foldiak, P. (1991, January). Learning Invariance from Transformation Sequences. Neural Computation , (2), 194–200.Foster, J. J., & Awh, E. (2019, October). The role of alpha oscillations in spatial attention: Limited evidencefor a suppression account. Current Opinion in Psychology , , 34–40. doi: 10.1016/j.copsyc.2018.11.001Franceschetti, S., Guatteo, E., Panzica, F., Sancini, G., Wanke, E., & Avanzini, G. (1995, October). Ionicmechanisms underlying burst firing in pyramidal neurons: Intracellular study in rat sensorimotor cor-tex. Brain Research , (1–2), 127–139. Retrieved from Fries, P., Womelsdorf, T., Oostenveld, R., & Desimone, R. (2008, April). The Effects of Visual Stimulationand Selective Visual Attention on Rhythmic Neuronal Synchronization in Macaque Area V4.
Journalof Neuroscience , (18), 4823–4835. doi: 10.1523/JNEUROSCI.4499-07.2008Friston, K. (2005, April). A theory of cortical responses. Philosophical Transactions of the Royal So-ciety B , (1456), 815–836. Retrieved from Friston, K. (2010, February). The free-energy principle: A unified brain theory?
Nature ReviewsNeuroscience , (2), 127–138. Retrieved from Fusi, S., Miller, E. K., & Rigotti, M. (2016, April). Why neurons mix: High dimensionality for highercognition.
Current Opinion in Neurobiology , , 66–74. doi: 10.1016/j.conb.2016.01.010Gardner, M. P. H., Schoenbaum, G., & Gershman, S. J. (2018, November). Rethinking dopamine asgeneralized prediction error. Proceedings of the Royal Society B: Biological Sciences , (1891),20181645. doi: 10.1098/rspb.2018.1645Gavornik, J. P., & Bear, M. F. (2014, May). Learned spatiotemporal sequence recognition and prediction in ’Reilly et al. Nature Neuroscience , (5), 732–737. doi: 10.1038/nn.3683George, D., & Hawkins, J. (2009, October). Towards a mathematical theory of cortical micro-circuits. PLoSComputational Biology , (10). Retrieved from Goodale, M. A., & Milner, A. D. (1992, January). Separate visual pathways for perception and action.
Trends in Neurosciences , (1), 20–25.Gottlieb, J. P., Kusunoki, M., & Goldberg, M. E. (1998, February). The representation of visual salience inmonkey parietal cortex. Nature , , 481. Retrieved from Grill-Spector, K., Henson, R., & Martin, A. (2006, January). Repetition and the brain: Neural models ofstimulus-specific effects.
Trends in Cognitive Sciences , (1), 14–23. doi: 10.1016/j.tics.2005.11.006Grossberg, S. (1999). How does the cerebral cortex work? Learning, attention, and grouping by the laminarcircuits of visual cortex. Spatial vision , . Retrieved from Gruber, W. R., Klimesch, W., Sauseng, P., & Doppelmayr, M. (2005, April). Alpha Phase SynchronizationPredicts P1 and N1 Latency and Amplitude Size.
Cerebral Cortex , (4), 371–377. doi: 10.1093/cercor/bhh139Gundlach, C., Moratti, S., Forschack, N., & M¨uller, M. M. (2020, May). Spatial Attentional SelectionModulates Early Visual Stimulus Processing Independently of Visual Alpha Modulations. CerebralCortex , (6), 3686–3703. doi: 10.1093/cercor/bhz335Halassa, M. M., & Kastner, S. (2017, December). Thalamic functions in distributed cognitive control. Nature Neuroscience , (12), 1669. doi: 10.1038/s41593-017-0020-1Hawkins, J., & Blakeslee, S. (2004). On Intelligence . New York, NY: Times Books.Hennig, M. H. (2013). Theoretical models of synaptic short term plasticity.
Frontiers in Computa-tional Neuroscience , . Retrieved from Hinton, G. E., & McClelland, J. L. (1988, January). Learning representations by recirculation. InD. Z. Anderson (Ed.),
Neural Information Processing Systems (NIPS 1987) (Vol. 0, pp. 358–366).New York: American Institute of Physics. Retrieved from http://papers.nips.cc/paper/78-learning-representations-by-recirculation.pdf
Hinton, G. E., & Salakhutdinov, R. R. (2006, July). Reducing the dimensionality of data with neuralnetworks.
Science , (5786), 504–507. Retrieved from Holroyd, C. B., & Coles, M. G. H. (2002, October). The neural basis of human error processing: Reinforce-ment learning, dopamine, and the error-related negativity.
Psychological Review , (4), 679–709.Retrieved from Hopfield, J. J. (1984, July). Neurons with graded response have collective computational properties likethose of two-state neurons.
Proceedings of the National Academy of Sciences USA , , 3088–3092.Retrieved from Issa, E. B., Cadieu, C. F., & DiCarlo, J. J. (2018, November). Neural dynamics at successive stagesof the ventral visual stream are consistent with hierarchical error signals. eLife , , e42870. doi:10.7554/eLife.42870Jaegle, A., & Ro, T. (2013, October). Direct Control of Visual Perception with Phase-specific Modulationof Posterior Parietal Cortex. Journal of Cognitive Neuroscience , (2), 422–432. doi: 10.1162/jocn a 00494Jaramillo, J., Mejias, J. F., & Wang, X.-J. (2019, January). Engagement of Pulvino-cortical Feedforwardand Feedback Pathways in Cognitive Computations. Neuron , (2), 321-336.e9. doi: 10.1016/2 Deep Predictive Learning j.neuron.2018.11.023Jensen, O., Bonnefond, M., Marshall, T. R., & Tiesinga, P. (2015, April). Oscillatory mechanisms offeedforward and feedback visual processing.
Trends in Neurosciences , (4), 192–194. doi: 10.1016/j.tins.2015.02.006Jensen, O., Bonnefond, M., & VanRullen, R. (2012, April). An oscillatory mechanism for prioritizingsalient unattended stimuli. Trends in Cognitive Sciences , (4), 200–206. Retrieved from Jensen, O., & Mazaheri, A. (2010). Shaping functional architecture by oscillatory alpha activity: Gating byinhibition.
Frontiers in Human Neuroscience , (186). doi: 10.3389/fnhum.2010.00186Jordan, M. I. (1989, January). Serial Order: A Parallel, Distributed Processing Approach. In J. L. Elman &D. E. Rumelhart (Eds.), Advances in Connectionist Theory: Speech.
Hillsdale, NJ: Lawrence ErlbaumAssociates.Kachergis, G., Wyatte, D., O’Reilly, R. C., de Kleijn, R., & Hommel, B. (2014, November). A continuous-time neural model for sequential action.
Philosophical Transactions of the Royal Society B: BiologicalSciences , (1655), 20130623. doi: 10.1098/rstb.2013.0623Kahana, M. J., Seelig, D., & Madsen, J. R. (2001, December). Theta returns. Current Opinion in Neurobi-ology , (6), 739–744. doi: 10.1016/s0959-4388(01)00278-1Kawato, M., Hayakawa, H., & Inui, T. (1993, January). A forward-inverse optics model of reciprocalconnections between visual cortical areas. Network: Computation in Neural Systems , (4), 415–422.doi: 10.1088/0954-898X 4 4 001Keitel, C., Keitel, A., Benwell, C. S. Y., Daube, C., Thut, G., & Gross, J. (2019, April). Stimulus-DrivenBrain Rhythms within the Alpha Band: The Attentional-Modulation Conundrum. Journal of Neuro-science , (16), 3119–3129. doi: 10.1523/JNEUROSCI.1633-18.2019Kelly, S. P., Lalor, E. C., Reilly, R. B., & Foxe, J. J. (2006, June). Increases in Alpha Oscillatory PowerReflect an Active Retinotopic Mechanism for Distracter Suppression During Sustained VisuospatialAttention. Journal of Neurophysiology , (6), 3844–3851. doi: 10.1152/jn.01234.2005Khaligh-Razavi, S.-M., & Kriegeskorte, N. (2014, November). Deep Supervised, but Not Unsupervised,Models May Explain IT Cortical Representation. PLOS Computational Biology , (11), e1003915.doi: 10.1371/journal.pcbi.1003915Kiorpes, L., Price, T., Hall-Haro, C., & Anthony Movshon, J. (2012, June). Development of sensitivity toglobal form and motion in macaque monkeys (Macaca nemestrina). Vision Research , , 34–42. doi:10.1016/j.visres.2012.04.018Klimesch, W. (2011, August). Evoked alpha and early access to the knowledge system: The P1 inhibitiontiming hypothesis. Brain Research , , 52–71. doi: 10.1016/j.brainres.2011.06.003Klimesch, W., Sauseng, P., & Hanslmayr, S. (2007, January). EEG alpha oscillations: The inhibition-timinghypothesis. Brain Research Reviews , (1), 63–88. doi: 10.1016/j.brainresrev.2006.06.003Kobatake, E., & Tanaka, K. (1994, January). Neuronal selectivities to complex object features in the ventralvisual pathway. Journal of Neurophysiology , (3), 856–867.Kogo, N., & Trengove, C. (2015). Is predictive coding theory articulated enough to be testable? Frontiersin Computational Neuroscience , . doi: 10.3389/fncom.2015.00111Kok, P., & de Lange, F. P. (2015). Predictive Coding in Sensory Cortex. In An Introduction to Model-BasedCognitive Neuroscience (pp. 221–244). Springer, New York, NY. doi: 10.1007/978-1-4939-2236-911Kok, P., Jehee, J. F. M., & de Lange, F. P. (2012, July). Less Is More: Expectation Sharpens Representationsin the Primary Visual Cortex.
Neuron , (2), 265–270. doi: 10.1016/j.neuron.2012.04.034Komura, Y., Nikkuni, A., Hirashima, N., Uetake, T., & Miyamoto, A. (2013, June). Responses of pulvinarneurons reflect a subject’s confidence in visual categorization. Nature Neuroscience , (6), 749–755.doi: 10.1038/nn.3393 ’Reilly et al. Frontiers in Systems Neuroscience , (4). Retrieved from LaBerge, D., & Buchsbaum, M. S. (1990, March). Positron emission tomographic measurements of pulvinaractivity during an attention task.
The Journal of neuroscience : the official journal of the Societyfor Neuroscience , , 613–9. Retrieved from Larkum, M. E., Zhu, J. J., & Sakmann, B. (1999, March). A new cellular mechanism for coupling inputsarriving at different cortical layers.
Nature , (6725), 338–341. doi: 10.1038/18686LeCun, Y., Bengio, Y., & Hinton, G. (2015, May). Deep learning. Nature , (7553), 436–444. doi:10.1038/nature14539Lee, T. S., & Mumford, D. (2003, July). Hierarchical Bayesian inference in the visual cortex. Journal ofthe Optical Society of America , (7), 1434–1448. Retrieved from Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., & Hinton, G. (2020, June). Backpropagation andthe brain.
Nature Reviews Neuroscience , (6), 335–346. doi: 10.1038/s41583-020-0277-3Lim, S., McKee, J. L., Woloszyn, L., Amit, Y., Freedman, D. J., Sheinberg, D. L., & Brunel, N. (2015,December). Inferring learning rules from distributions of firing rates in cortical neurons. NatureNeuroscience , (12), 1804–1810. doi: 10.1038/nn.4158Lotter, W., Kreiman, G., & Cox, D. (2016, May). Deep predictive coding networks for video predictionand unsupervised learning. arXiv:1605.08104 [cs, q-bio] . Retrieved 2017-08-11, from http://arxiv.org/abs/1605.08104 Luczak, A., Bartho, P., & Harris, K. D. (2013, January). Gating of sensory input by spontaneous corticalactivity.
The Journal of Neuroscience , (4), 1684–1695. Retrieved from L¨uscher, C., & Malenka, R. C. (2012, June). NMDA receptor-dependent long-term potentiation and long-term depression (LTP/LTD).
Cold Spring Harbor Perspectives in Biology , (6), a005710. doi: 10.1101/cshperspect.a005710Maier, A., Adams, G. K., Aura, C., & Leopold, D. A. (2010). Distinct Superficial and Deep LaminarDomains of Activity in the Visual Cortex during Rest and Stimulation. Frontiers in Systems Neuro-science , (31). doi: 10.3389/fnsys.2010.00031Maier, A., Aura, C. J., & Leopold, D. A. (2011, February). Infragranular sources of sustained local fieldpotential responses in macaque primary visual cortex. The Journal of Neuroscience , (6), 1971–1980. Retrieved from Makeig, S., Westerfield, M., Jung, T. P., Enghoff, S., Townsend, J., Courchesne, E., & Sejnowski, T. J.(2002, January). Dynamic Brain Sources of Visual Evoked Responses.
Science , , 690–693.Marino, A. C., & Mazer, J. A. (2016). Perisaccadic Updating of Visual Representations and AttentionalStates: Linking Behavior and Neurophysiology. Frontiers in Systems Neuroscience , . doi: 10.3389/fnsys.2016.00003Markov, N. T., Ercsey-Ravasz, M. M., Gomes, R., R, A., Lamy, C., Magrou, L., . . . Kennedy, H. (2014,January). A Weighted and Directed Interareal Connectivity Matrix for Macaque Cerebral Cortex. Cerebral Cortex , (1), 17–36. doi: 10.1093/cercor/bhs270Markov, N. T., Vezoli, J., Chameau, P., Falchier, A., Quilodran, R., Huissoud, C., . . . Kennedy, H. (2014,January). Anatomy of hierarchy: Feedforward and feedback pathways in macaque visual cortex:Cortical counterstreams. Journal of Comparative Neurology , (1), 225–259. doi: 10.1002/cne.23458Martinez-Conde, S., Macknik, S. L., & Hubel, D. H. (2004, March). The role of fixational eye movementsin visual perception. Nature Reviews Neuroscience , (3), 229–240. doi: 10.1038/nrn13484 Deep Predictive Learning
Martinez-Conde, S., Otero-Millan, J., & Macknik, S. L. (2013, February). The impact of microsaccades onvision: Towards a unified theory of saccadic function.
Nature Reviews Neuroscience , (2), 83–96.doi: 10.1038/nrn3405Mathewson, K., Gratton, G., Fabiani, M., Beck, D., & Ro, T. (2009). To see or not to see: Prestimulus alphaphase predicts visual awareness. The Journal of Neuroscience , (9), 2725–2732.Mathewson, K. E., Fabiani, M., Gratton, G., Beck, D. M., & Lleras, A. (2010, April). Rescuing stimulifrom invisibility: Inducing a momentary release from visual masking with pre-target entrainment. Cognition , (1), 186–191. Retrieved from Mathewson, K. E., Prudhomme, C., Fabiani, M., Beck, D. M., Lleras, A., & Gratton, G. (2012, August).Making waves in the stream of consciousness: Entraining oscillations in EEG alpha and fluctuationsin visual awareness with rhythmic visual stimulation.
Journal of Cognitive Neuroscience , (12),2321–2333. doi: 10.1162/jocn a 00288Mayer, A., Schwiedrzik, C. M., Wibral, M., Singer, W., & Melloni, L. (2016, July). Expecting to Seea Letter: Alpha Oscillations as Carriers of Top-Down Sensory Predictions. Cerebral Cortex , (7),3146–3160. doi: 10.1093/cercor/bhv146Meyer, T., & Olson, C. R. (2011, November). Statistical learning of visual transitions in monkey infer-otemporal cortex. Proceedings of the National Academy of Sciences of the United States of Amer-ica , (48), 19401–19406. Retrieved from Michalareas, G., Vezoli, J., van Pelt, S., Schoffelen, J.-M., Kennedy, H., & Fries, P. (2016, January). Alpha-Beta and Gamma Rhythms Subserve Feedback and Feedforward Influences among Human VisualCortical Areas.
Neuron , (2), 384–397. doi: 10.1016/j.neuron.2015.12.018Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annual Reviewof Neuroscience , , 167–202. Retrieved from M¨uller, J. R., Metha, A. B., Krauskopf, J., & Lennie, P. (1999, September). Rapid adaptation in visualcortex to the structure of images.
Science (New York, N.Y.) , , 1405. Retrieved from Mumford, D. (1991, June). On the computational architecture of the neocortex.
Biological Cybernetics , (2), 135–145. doi: 10.1007/BF00202389Mumford, D. (1992). On the computational architecture of the neocortex. II. The role of cortico-corticalloops. Biological Cybernetics , (3), 241–251. Retrieved from Nakamura, K., & Colby, C. L. (2002, March). Updating of the visual representation in monkey striate andextrastriate cortex during saccades.
Proceedings of the National Academy of Sciences of the UnitedStates of America , (6), 4026–4031. Retrieved from Neupane, S., Guitton, D., & Pack, C. C. (2016, February). Two distinct types of remapping in primatecortical area V4.
Nature Communications , , 10402. doi: 10.1038/ncomms10402Neupane, S., Guitton, D., & Pack, C. C. (2017, July). Coherent alpha oscillations link current and futurereceptive fields during saccades. Proceedings of the National Academy of Sciences , 201701672. doi:10.1073/pnas.1701672114Neupane, S., Guitton, D., & Pack, C. C. (2020, April). Perisaccadic remapping: What? How? Why?
Reviews in the Neurosciences . doi: 10.1515/revneuro-2019-0097Nunn, C. M. H., & Osselton, J. W. (1974, May). The Influence of the EEG Alpha Rhythm on the Perceptionof Visual Stimuli.
Psychophysiology , (3), 294–303. doi: 10.1111/j.1469-8986.1974.tb00547.xO’Herron, P., & von der Heydt, R. (2013, January). Remapping of border ownership in the visual cortex. ’Reilly et al. Journal of Neuroscience , (5), 1964–1974. doi: 10.1523/JNEUROSCI.2797-12.2013Olsen, S., Bortone, D., Adesnik, H., & Scanziani, M. (2012, February). Gain control by layer six in corticalcircuits of vision. Nature , (7387), 47–52.O’Reilly, R. C. (1996, January). Biologically plausible error-driven learning using local activationdifferences: The generalized recirculation algorithm. Neural Computation , (5), 895–938. doi:10.1162/neco.1996.8.5.895O’Reilly, R. C. (1998, January). Six Principles for Biologically-Based Computational Models of CorticalCognition. Trends in Cognitive Sciences , (11), 455–462. Retrieved from O’Reilly, R. C., Hazy, T. E., & Herd, S. A. (2016). The Leabra cognitive architecture: How to play 20principles with nature and win! In S. Chipman (Ed.),
Oxford handbook of cognitive science.
OxfordUniversity Press. Retrieved 2015-05-15, from
O’Reilly, R. C., & Munakata, Y. (2000).
Computational Explorations in Cognitive Neuroscience: Under-standing the Mind by Simulating the Brain . Cambridge, MA: MIT Press.O’Reilly, R. C., Munakata, Y., Frank, M. J., Hazy, T. E., & Contributors. (2012).
Computational CognitiveNeuroscience . Wiki Book, 1st Edition, URL: http://ccnbook.colorado.edu. Retrieved from http://ccnbook.colorado.edu
O’Reilly, R. C., Wyatte, D., Herd, S., Mingus, B., & Jilk, D. J. (2013). Recurrent Processing during ObjectRecognition.
Frontiers in Psychology , (124). Retrieved from O’Reilly, R. C., Wyatte, D., & Rohrlich, J. (2014, July). Learning Through Time in the ThalamocorticalLoops. arXiv:1407.3432 [q-bio] . Retrieved 2015-05-15, from http://arxiv.org/abs/1407.3432
O’Reilly, R. C., Wyatte, D. R., & Rohrlich, J. (2017, September). Deep predictive learning: A com-prehensive model of three visual streams. arXiv:1709.04654 [q-bio] . Retrieved 2017-09-15, from http://arxiv.org/abs/1709.04654
Ouden, H. E. M., Kok, P., & Lange, F. P. (2012). How prediction errors shape perception, attention,and motivation.
Frontiers in Psychology , (548). Retrieved from Palva, S., & Palva, J. M. (2011). Functional roles of alpha-band phase synchronization in local and large-scale cortical networks.
Frontiers in Psychology , (204), ePub only. Retrieved from Pennartz, C. M., Dora, S., Muckli, L., & Lorteije, J. A. (2019). Towards a Unified View on Pathways andFunctions of Neural Recurrent Processing.
Trends in Neurosciences .Petersen, S. E., Robinson, D. L., & Keys, W. (1985, October). Pulvinar nuclei of the behaving rhesusmonkey: Visual responses and their modulation.
Journal of neurophysiology , . Retrieved from Petrof, I., Viaene, A. N., & Sherman, S. M. (2012, June). Two populations of corticothalamic and interarealcorticocortical cells in the subgranular layers of the mouse primary sensory cortices.
Journal ofComparative Neurology , (8), 1678–1686. doi: 10.1002/cne.23006Pinault, D. (2004, August). The thalamic reticular nucleus: Structure, function and concept. Brain research , . Retrieved from Pineda, F. J. (1987, January). Generalization of Backpropagation to Recurrent Neural Networks.
PhysicalReview Letters , , 2229–2232.Privman, E., Malach, R., & Yeshurun, Y. (2013, April). Modeling the electrical field created by mass neuralactivity. Neural Networks , , 44–51. doi: 10.1016/j.neunet.2013.01.004Purushothaman, G., Marion, R., Li, K., & Casagrande, V. A. (2012, June). Gating and control of primary6 Deep Predictive Learning visual cortex by pulvinar.
Nature Neuroscience , (6), 905–912. doi: 10.1038/nn.3106Pylyshyn, Z. (1989, June). The role of location indexes in spatial perception: A sketch of the FINSTspatial-index model. Cognition , (1), 65–97. doi: 10.1016/0010-0277(89)90014-0Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018, February). Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys,and state-of-the-art deep artificial neural networks. bioRxiv , 240614. doi: 10.1101/240614Rao, R. P., & Ballard, D. H. (1999, January). Predictive coding in the visual cortex: A functional in-terpretation of some extra-classical receptive-field effects. Nature Neuroscience , (1), 79–87. doi:10.1038/4580Ray, S., & Maunsell, J. H. R. (2011, April). Different origins of gamma rhythm and high-gamma activityin macaque visual cortex. PLoS biology , (4), e1000610. doi: 10.1371/journal.pbio.1000610Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999, April). Competitive mechanisms subserve attentionin macaque areas V2 and V4. The Journal of neuroscience : the official journal of the Society forNeuroscience , , 1736–1753. Retrieved from Reynolds, J. H., & Heeger, D. J. (2009, January). The normalization model of attention.
Neuron , (2),168–185. Retrieved from Richter, D., & de Lange, F. P. (2019, August). Statistical learning attenuates visual activity only for attendedstimuli. eLife , , e47869. doi: 10.7554/eLife.47869Robinson, D. L. (1993). Functional contributions of the primate pulvinar. Progress in brain research , .Retrieved from Rockland, K. S. (1996, October). Two types of corticopulvinar terminations: Round (type 2) and elongate(type 1).
The Journal of comparative neurology , , 57–87. Retrieved from Rockland, K. S. (1998, January). Convergence and branching patterns of round, type 2 corticopulv-inar axons.
The Journal of Comparative Neurology , (4), 515–536. doi: 10.1002/(SICI)1096-9861(19980126)390:4 (cid:104) (cid:105) Brain Research , (1), 3–20. Retrieved from Rumelhart, D. E., & McClelland, J. L. (1982, April). An interactive activation model of context effectsin letter perception: Part 2. The contextual enhancement effect and some tests and extensions of themodel.
Psychological review , , 60–94. Retrieved from Saalmann, Y. B., & Kastner, S. (2011, July). Cognitive and perceptual functions of the visual thala-mus.
Neuron , (2), 209–223. Retrieved from Saalmann, Y. B., Pinsk, M. A., Wang, L., Li, X., & Kastner, S. (2012, August). The pulvinar regulatesinformation transmission between cortical areas based on attention demands.
Science , (6095),753–756. doi: 10.1126/science.1223082Samaha, J., Bauer, P., Cimaroli, S., & Postle, B. R. (2015, July). Top-down control of the phase of alpha-band oscillations as a mechanism for temporal prediction. Proceedings of the National Academy ofSciences USA , (27), 8439–8444. doi: 10.1073/pnas.1503686112Sherman, M. T., Kanai, R., Seth, A. K., & VanRullen, R. (2016, April). Rhythmic influence of top–downperceptual priors in the phase of prestimulus occipital alpha oscillations. Journal of Cognitive Neuro-science , (9), 1318–1330. doi: 10.1162/jocn a 00973Sherman, S. M. (2014, May). The function of metabotropic glutamate receptors in thalamus and cortex. The Neuroscientist , (2), 146–149. ’Reilly et al. Exploring the Thalamus and Its Role in Cortical Function .Cambridge, MA: MIT Press. Retrieved from
Sherman, S. M., & Guillery, R. W. (2011, September). Distinct functions for direct and transthalamiccorticocortical connections.
Journal of Neurophysiology , (3), 1068–1077. doi: 10.1152/jn.00429.2011Sherman, S. M., & Guillery, R. W. (2013). Functional Connections of Cortical Areas: A New View Fromthe Thalamus . Cambridge, MA: MIT Press.Shipp, S. (2003, October). The functional logic of cortico-pulvinar connections.
Philosophical Transactionsof the Royal Society of London B , (1438), 1605–1624. Retrieved from Shouval, H. Z. S., Bear, M. F., & Cooper, L. N. (2002, August). A unified model of NMDA receptor-dependent bidirectional synaptic plasticity.
Proceedings of the National Academy of SciencesUSA , (16), 10831–10836. Retrieved from Shrager, J., & Johnson, M. H. (1996, October). Dynamic Plasticity Influences the Emergence of Function ina Simple Cortical Array.
Neural Networks , (7), 1119–1129. doi: 10.1016/0893-6080(96)00033-0Silva, L. R., Amitai, Y., & Connors, B. W. (1991, January). Intrinsic oscillations of neocortex generatedby layer 5 pyramidal neurons. Science , (4992), 432–435. Retrieved from Snow, J. C., Allen, H. A., Rafal, R. D., & Humphreys, G. W. (2009, March). Impaired attentional selectionfollowing lesions to human pulvinar: Evidence for homology between human and monkey.
Proceed-ings of the National Academy of Sciences , (10), 4054–4059. doi: 10.1073/pnas.0810086106Sol´ıs-Vivanco, R., Jensen, O., & Bonnefond, M. (2018, August). Top-Down Control of Alpha Phase Adjust-ment in Anticipation of Temporally Predictable Visual Stimuli. Journal of Cognitive Neuroscience , (8), 1157–1169. doi: 10.1162/jocn a 01280Solomon, E. A., Kragel, J. E., Sperling, M. R., Sharan, A., Worrell, G., Kucewicz, M., . . . Kahana, M. J.(2017, November). Widespread theta synchrony and high-frequency desynchronization underliesenhanced cognition. Nature Communications , (1), 1704. doi: 10.1038/s41467-017-01763-2Spaak, E., Bonnefond, M., Maier, A., Leopold, D. A., & Jensen, O. (2012, December). Layer-specificentrainment of gamma-band neural activity by the alpha rhythm in monkey visual cortex. CurrentBiology , (24), 2313–2318. Retrieved from Spaak, E., de Lange, F. P., & Jensen, O. (2014, March). Local Entrainment of Alpha Oscillations by VisualStimuli Causes Cyclic Modulation of Perception.
Journal of Neuroscience , (10), 3536–3544. doi:10.1523/JNEUROSCI.4385-13.2014Spelke, E., Breinlinger, K., Macomber, J., & Jacobson, K. (1992, January). Origins of Knowledge. Psycho-logical Review , (4), 605–632.Spratling, M. W. (2008). Reconciling predictive coding and biased competition models of cortical function. Frontiers in Computational Neuroscience , (4), 1-8 (online). Retrieved from Summerfield, C., & de Lange, F. P. (2014, November). Expectation in perceptual decision making: Neuraland computational mechanisms.
Nature Reviews Neuroscience , (11), 745–756. doi: 10.1038/nrn3838Summerfield, C., & Egner, T. (2009, September). Expectation (and attention) in visual cognition. Trends inCognitive Sciences , (9), 403–409. doi: 10.1016/j.tics.2009.06.003Summerfield, C., Trittschuh, E. H., Monti, J. M., Mesulam, M. M., & Egner, T. (2008, September). Neuralrepetition suppression reflects fulfilled perceptual expectations. Nature Neuroscience , (9), 1004–8 Deep Predictive Learning
Sutton, R. S., & Barto, A. G. (1998).
Reinforcement Learning: An Introduction.
Cambridge, MA: MITPress. Retrieved from
Thomson, A. M. (2010). Neocortical layer 6, a review.
Frontiers in Neuroanatomy , (13). Retrieved from Thomson, A. M., & Lamy, C. (2007, November). Functional maps of neocortical local circuitry.
Frontiersin Neuroscience , (1), 19–42. Retrieved from Todorovic, A., van Ede, F., Maris, E., & de Lange, F. P. (2011, June). Prior Expectation Mediates NeuralAdaptation to Repeated Sounds in the Auditory Cortex: An MEG Study.
Journal of Neuroscience , (25), 9118–9123. doi: 10.1523/JNEUROSCI.1425-11.2011Ungerleider, L. G., & Mishkin, M. (1982, January). Two Cortical Visual Systems. In D. J. Ingle,M. A. Goodale, & R. J. W. Mansfield (Eds.), The Analysis of Visual Behavior (pp. 549–586). Cam-bridge, MA: MIT Press.Urakubo, H., Honda, M., Froemke, R. C., & Kuroda, S. (2008, March). Requirement of an allosteric kineticsof NMDA receptors for spike timing-dependent plasticity.
The Journal of Neuroscience , (13),3310–3323. Retrieved from Usrey, W. M., & Sherman, S. M. (2018). Corticofugal circuits: Communication lines from the cortex to therest of the brain.
Journal of Comparative Neurology , (0). doi: 10.1002/cne.24423Valpola, H. (2014, November). From neural PCA to deep unsupervised learning. arXiv:1411.7783 [cs,stat] . Retrieved 2017-05-15, from http://arxiv.org/abs/1411.7783 van Kerkoerle, T., Self, M. W., Dagnino, B., Gariel-Mathis, M.-A., Poort, J., van der Togt, C., & Roelfsema,P. R. (2014, October). Alpha and gamma oscillations characterize feedback and feedforward pro-cessing in monkey visual cortex. Proceedings of the National Academy of Sciences U.S.A. , (40),14332–14341. Retrieved from VanRullen, R. (2016, October). Perceptual cycles.
Trends in Cognitive Sciences , (10), 723–735. doi:10.1016/j.tics.2016.07.006VanRullen, R., & Koch, C. (2003, May). Is perception discrete or continuous? Trends in Cognitive Sciences , (5), 207–213. Retrieved from VanRullen, R., & Thorpe, S. J. (2002, November). Surfing a spike wave down the ventral stream.
Vi-sion research , , 2593–2615. Retrieved from Varela, F. J., Toro, A., John, E. R., & Schwartz, E. L. (1981). Perceptual framing and cortical al-pha rhythm.
Neuropsychologia , (5), 675–686. Retrieved from Vinken, K., & Vogels, R. (2017, November). Adaptation can explain evidence for encoding of probabilisticinformation in macaque inferior temporal cortex.
Current Biology , (22), R1210-R1212. doi: 10.1016/j.cub.2017.09.018von Stein, A., Chiang, C., & K¨onig, P. (2000, December). Top-down processing mediated by interarealsynchronization. Proceedings of the National Academy of Sciences of the United States of America , (26), 14748–14753. doi: 10.1073/pnas.97.26.14748von Helmholtz, H. (2013). Treatise on Physiological Optics, Vol III . Courier Corporation.Waldert, S., Lemon, R. N., & Kraskov, A. (2013). Influence of spiking activity on cortical local fieldpotentials.
The Journal of Physiology , (21), 5291–5303. doi: 10.1113/jphysiol.2013.258228Walsh, K. S., McGovern, D. P., Clark, A., & O’Connell, R. G. (2020, March). Evaluating the neurophysio-logical evidence for predictive processing as a model of perception. Annals of the New York Academyof Sciences , (1), 242–268. doi: 10.1111/nyas.14321 ’Reilly et al. The living brain . Oxford, England: W. W. Norton.Watanabe, T., & Sasaki, Y. (2015, January). Perceptual learning: Toward a comprehensive theory.
Annualreview of psychology , , 197–221. doi: 10.1146/annurev-psych-010814-015214Whittington, J. C. R., & Bogacz, R. (2019, March). Theories of error back-propagation in the brain. Trendsin Cognitive Sciences , (3), 235–250. doi: 10.1016/j.tics.2018.12.005Williams, R. J., & Zipser, D. (1992, January). Gradient-based learning algorithms for recurrent networksand their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation:Theory, Architectures and Applications.
Hillsdale, NJ: Erlbaum.Wilson, J. R., Bose, N., Sherman, S. M., & Guillery, R. W. (1984, June). Fine structural morphology ofidentified X- and Y-cells in the cat’s lateral geniculate nucleus.
Proceedings of the Royal Society ofLondon. Series B. Biological Sciences , (1225), 411–436. doi: 10.1098/rspb.1984.0042Wimmer, R. D., Schmitt, L. I., Davidson, T. J., Nakajima, M., Deisseroth, K., & Halassa, M. M. (2015,October). Thalamic control of sensory selection in divided attention. Nature , (7575), 705–709.doi: 10.1038/nature15398Wiskott, L., & Sejnowski, T. J. (2002, April). Slow feature analysis: Unsupervised learning of invari-ances. Neural Computation , , 715–770. Retrieved from Worden, M. S., Foxe, J. J., Wang, N., & Simpson, G. V. (2000, March). Anticipatory biasing of visuospa-tial attention indexed by retinotopically specific alpha-band electroencephalography increases overoccipital cortex.
The Journal of neuroscience , . Retrieved from Wurtz, R. H. (2008, September). Neuronal mechanisms of visual stability.
Vision Research , (20), 2070–2089. doi: 10.1016/j.visres.2008.03.021Xing, D., Yeh, C.-I., Burns, S., & Shapley, R. M. (2012, August). Laminar analysis of visually evokedactivity in the primary visual cortex. Proceedings of the National Academy of Sciences , (34),13871–13876. doi: 10.1073/pnas.1201478109Yu, C., & Smith, L. B. (2012, November). Embodied attention and word learning by toddlers. Cognition , (2), 244–262. doi: 10.1016/j.cognition.2012.06.016Zhou, H., Schafer, R. J., & Desimone, R. (2016). Pulvinar-cortex interactions in vision and attention. Neuron ,89