[PDF] Deep Predictive Learning in Neocortex and Pulvinar

Abstract

How do humans learn from raw sensory experience? Throughout life, but most obviously in infancy, we learn without explicit instruction. We propose a detailed biological mechanism for the widely-embraced idea that learning is based on the differences between predictions and actual outcomes (i.e., predictive error-driven learning). Specifically, numerous weak projections into the pulvinar nucleus of the thalamus generate top-down predictions, and sparse, focal driver inputs from lower areas supply the actual outcome, originating in layer 5 intrinsic bursting (5IB) neurons. Thus, the outcome is only briefly activated, roughly every 100 msec (i.e., 10 Hz, alpha), resulting in a temporal difference error signal, which drives local synaptic changes throughout the neocortex, resulting in a biologically-plausible form of error backpropagation learning. We implemented these mechanisms in a large-scale model of the visual system, and found that the simulated inferotemporal (IT) pathway learns to systematically categorize 3D objects according to invariant shape properties, based solely on predictive learning from raw visual inputs. These categories match human judgments on the same stimuli, and are consistent with neural representations in IT cortex in primates.

Full PDF

DDeep Predictive Learning in Neocortex and Pulvinar

Randall C. O’Reilly, Jacob L. Russin, Maryam Zolfaghar, and John RohrlichDepartment of Psychology, Computer Science, and Center for NeuroscienceUniversity of California Davis1544 Newton CtDavis, CA 95618 [email protected]

June 29, 2020We thank Dean Wyatte, Tom Hazy, Seth Herd, Kai Krueger, Tim Curran, David Sheinberg, Lew Harvey,Jessica Mollick, Will Chapman, Helene Devillez, and the rest of the CCN Lab for many helpful commentsand suggestions. Supported by: ONR grants ONR N00014-19-1-2684 / N00014-18-1-2116,N00014-14-1-0670 / N00014-16-1-2128, N00014-18-C-2067, N00014-13-1-0067, D00014-12-C-0638.This work utilized the Janus supercomputer, which is supported by the National Science Foundation(award number CNS-0821794) and the University of Colorado Boulder. The Janus supercomputer is a jointeffort of the University of Colorado Boulder, the University of Colorado Denver and the National Centerfor Atmospheric Research. All data and materials will be available at https://github.com/ccnlab/deep-obj-cat upon publication. a r X i v : . [ q - b i o . N C ] J un bstract How does the human brain learn new concepts from raw sensory experience, without explicit instruction?We still do not have a widely-accepted answer to this central question. Here, we propose a detailed biologicalmechanism for the widely-embraced idea that learning is based on the differences between predictions andactual outcomes (i.e., predictive error-driven learning ). Speciﬁcally, numerous weak projections into thepulvinar nucleus of the thalamus generate top-down predictions, and sparse, focal driver inputs from lowerareas supply the actual outcome, originating in layer 5 intrinsic bursting (5IB) neurons. Thus, the outcomeis only brieﬂy activated, roughly every 100 msec (i.e., 10 Hz, alpha ), resulting in a temporal differenceerror signal , which drives local synaptic changes throughout the neocortex, resulting in a biologically-plausible form of error backpropagation learning. We implemented these mechanisms in a large-scale modelof the visual system, and found that the simulated inferotemporal (IT) pathway learns to systematicallycategorize 3D objects according to invariant shape properties, based solely on predictive learning from rawvisual inputs. These categories match human judgments on the same stimuli, and are consistent with neuralrepresentations in IT cortex in primates.

Deep Predictive Learning

The fundamental epistemological conundrum of how knowledge emerges from raw experience has chal-lenged philosophers and scientists for centuries. There have been signiﬁcant advances in cognitive andcomputational models of learning (Ashby & Maddox, 2011; LeCun, Bengio, & Hinton, 2015; Watanabe& Sasaki, 2015) and in our understanding of the detailed biochemical basis of synaptic plasticity (Cooper& Bear, 2012; L¨uscher & Malenka, 2012; Shouval, Bear, & Cooper, 2002; Urakubo, Honda, Froemke, &Kuroda, 2008). However, there is still no widely-accepted answer to this puzzle that is clearly supported byknown biological mechanisms and also produces effective learning at the computational and cognitive levels.At these functional levels, the idea that we learn via an active predictive process goes back to Helmholtz’s recognition by synthesis proposal (von Helmholtz, 2013), and has been widely embraced in a wide range ofdifferent frameworks (Clark, 2013; Dayan, Hinton, Neal, & Zemel, 1995; de Lange, Heilbron, & Kok, 2018;J. Elman et al., 1996; J. L. Elman, 1990; Friston, 2005; George & Hawkins, 2009; Hawkins & Blakeslee,2004; Kawato, Hayakawa, & Inui, 1993; Mumford, 1992; Rao & Ballard, 1999; Summerﬁeld & de Lange,2014). Here, we propose a detailed biological mechanism for a speciﬁc form of predictive error-drivenlearning based on distinctive patterns of connectivity between the neocortex and the pulvinar nucleus of thethalamus (S. M. Sherman & Guillery, 2006; Usrey & Sherman, 2018).Speciﬁcally, we hypothesize that learning is based on the difference between top-down predictions, gen-erated by numerous weak projections into the thalamic relay cells (TRCs) in the pulvinar, and the actualoutcomes supplied by sparse, focal, strong driver inputs from lower areas. Because these driver inputs orig-inate in layer 5 intrinsic bursting (5IB) neurons, the outcome is only brieﬂy activated, roughly every 100msec (i.e., 10 Hz, alpha ). Thus, the prediction error is a temporal difference in activation states over thepulvinar, from an earlier prediction to a subsequent burst of outcome. This temporal difference can drivelocal synaptic changes throughout the neocortex, supporting a biologically-plausible form of error back-propagation to improve the predictions over time (Ackley, Hinton, & Sejnowski, 1985; Bengio, Mesnard,Fischer, Zhang, & Wu, 2017; Hinton & McClelland, 1988; Lillicrap, Santoro, Marris, Akerman, & Hinton,2020; O’Reilly, 1996; Whittington & Bogacz, 2019). The temporal-difference form of error-driven learningcontrasts with prevalent alternative hypotheses that require a separate population of neurons to compute aprediction error “explicitly” and transmit it directly through neural ﬁring (Friston, 2005, 2010; Kawato etal., 1993; Lotter, Kreiman, & Cox, 2016; Ouden, Kok, & Lange, 2012; Rao & Ballard, 1999).In the following, our primary objective is to describe the hypothesized biologically-based mechanismfor predictive error-driven learning, contrast it with other existing proposals regarding the functions of thisthalamocortical circuitry and other ways that the brain might support predictive learning, and evaluate itrelative to a wide range of existing anatomical and electrophysiological data. We provide a number ofspeciﬁc empirical predictions that follow from this functional view of the thalamocortical circuit, whichcould potentially be tested by current neuroscientiﬁc methods. Thus, this work proposes a clear functionalinterpretation of this distinctive thalamocortical circuitry that contrasts with existing ideas in testable ways.A second major objective is to implement this predictive error-driven learning mechanism in a large-scalecomputational model that faithfully captures its essential biological features, to test whether the proposedlearning mechanism can drive the formation of cognitively-useful representations. In particular, we ask acritical question for any purely predictive-learning model: can it develop high-level, abstract representa-tions while learning from nothing but predicting low-level visual inputs. For example, most visual objectrecognition models that provide a reasonable ﬁt to neurophysiological data rely on large human-labeled im-age datasets to explicitly train abstract category information via error-backpropagation (Cadieu et al., 2014;Khaligh-Razavi & Kriegeskorte, 2014; Rajalingham et al., 2018). Through large-scale simulations based onthe known structure of the visual system, we found that our biologically based predictive learning mech-anism developed high-level abstract representations that signiﬁcantly diverge from the similarity structurepresent in the lower layers of the network, and systematically categorize 3D objects according to invariantshape properties. Furthermore, we found in a similarity judgment experiment that these categories match ’Reilly et al. not to produce a better machine-learning(ML) algorithm per se , but rather to test the computational properties of our biologically-based, scientiﬁctheory for how the mammalian brain might learn. Thus, we explicitly dissuade readers from the inevitabledesire to evaluate the importance of our model based on differences in narrow, performance-based MLmetrics: it should instead be evaluated on its ability to explain a wide range of data across multiple levels ofanalysis, just as every other scientiﬁc theory is evaluated.The remainder of the paper is organized as follows. First, we provide a concise overview of the bio-logically based predictive error-driven learning framework. Next, we discuss the relevant biological data indetail, along with testable predictions that can differentiate this account from other existing ideas. Then, wepresent the large-scale model of the visual system, which learns by predicting over brief visual movies of 3Dobjects rotating and translating in space. We evaluate this model and compare it to two other prediction-errorlearning models that use pure error-backpropagation, based on current deep-convolutional neural network(DCNN) principles. Finally, we conclude with a discussion of related models and outstanding issues.

Predictive Error-driven Learning in the Neocortex and Pulvinar

Figure 1a shows the thalamocortical circuits characterized by S. M. Sherman and Guillery (2006) (seealso S. M. Sherman & Guillery, 2013; Usrey & Sherman, 2018), which have two distinct projections con-verging on the principal thalamic relay cells (TRCs) of the pulvinar , the primary thalamic nucleus that isinterconnected with higher-level posterior cortical visual areas (Arcaro, Pinsk, & Kastner, 2015; Halassa& Kastner, 2017; Shipp, 2003). One projection consists of numerous, weaker connections originating indeep layer VI of the neocortex (the 6CT corticothalamic projecting cells), which we hypothesize gener-ate a top-down prediction on the pulvinar. The other is a sparse, focal (Rockland, 1996, 1998) and strong driver pathway that originates from lower-level layer 5 intrinsic bursting cells (5IB), which we hypothesizeprovide the outcome. These 5IB neurons ﬁre discrete bursts with intrinsic dynamics having a period ofroughly 100 msec between bursts (Connors, Gutnick, & Prince, 1982; Franceschetti et al., 1995; Larkum,Zhu, & Sakmann, 1999; Saalmann, Pinsk, Wang, Li, & Kastner, 2012; Silva, Amitai, & Connors, 1991),which corresponds to the widely-studied alpha frequency of 10 Hz that originates in cortical deep layersand has important effects on a wide range of perceptual and attentional tasks (Buffalo, Fries, Landman,Buschman, & Desimone, 2011; Clayton, Yeung, & Kadosh, 2018; Jensen, Bonnefond, & VanRullen, 2012;K. Mathewson, Gratton, Fabiani, Beck, & Ro, 2009; VanRullen & Koch, 2003).The existing literature generally characterizes the 6CT projection as modulatory (S. M. Sherman &Guillery, 2013; Usrey & Sherman, 2018), but a number of electrophysiological recordings from awake,behaving animals clearly show sustained, continuous patterns of neural ﬁring in pulvinar TRC neurons,which is not consistent with the idea that they are only being driven by their 5IB inputs (Bender, 1982;Bender & Youakim, 2001; Komura, Nikkuni, Hirashima, Uetake, & Miyamoto, 2013; Petersen, Robinson,& Keys, 1985; Robinson, 1993; Saalmann et al., 2012; Zhou, Schafer, & Desimone, 2016). Indeed, theserecordings show that pulvinar neural ﬁring generally resembles that of the visual areas with which theyinterconnect. This is important because our predictive learning framework requires that these 6CT top-down projections be capable of driving TRC activity directly. Speciﬁcally, in contrast to the standard view,the core idea behind our theory is that the top-down 6CT projections drive a prediction across the extentof the pulvinar, which precedes the subsequent outcome state resulting from the strong 5IB driver inputs,

Deep Predictive Learning predictionV1 (minus) actual(plus)V2 Deep (t-1 context) t t+100 msec

Bidirectional connectionsto higher layers (V4... ) (contextupdate) err … … err temporaldifferenceerror =

V2 Super (current t)

Pulvinar (prediction) a)b)

Figure 1: a) Summary ﬁgure from Sherman & Guillery (2006) showing the strong feedforward driver pro-jection emanating from layer 5IB cells in lower layers (e.g., V1), and the much more numerous feedback“modulatory” projection from layer 6CT cells. We interpret these same connections as providing a predic-tion (6CT) vs. outcome (5IB) activity pattern over the pulvinar. b) Temporal evolution of information ﬂowunder our prediction error hypothesis, operating on visual sequences, over two idealized alpha cycles of 100msec each. Superﬁcial layers (Super, lamina 2/3) always encode the current state, integrating bottom-up andtop-down inputs. In each alpha cycle, the V2 Deep layer (lamina 5, 6) uses the prior alpha cycle of contextto generate a prediction ( minus phase) on the pulvinar thalamic relay cells (TRC). The bottom-up outcomeis driven by lower-level (V1) 5IB strong driver inputs ( plus phase); error-driven learning occurs as a functionof the temporal difference between these phases, sent via broad pulvinar projections, in both superﬁcial anddeep layers, and in the 6CT projections into the pulvinar (only 5IB drivers are non-learning). 5IB burstingin V2 drives updating of temporal context in V2 Deep layers (this phasic updating prevents current outcomeactivation in superﬁcial layers from informing the prediction), while also driving the plus phase in higherareas of pulvinar that learn to predict V2 activation states, and so on.as illustrated in Figure 1b (Kachergis, Wyatte, O’Reilly, de Kleijn, & Hommel, 2014; O’Reilly, Wyatte, &Rohrlich, 2014, 2017).Assuming a 100 msec alpha cycle for the purposes of illustration (the actual timing is likely to be moredynamic as discussed next), the activity state in pulvinar TRC neurons, representing a prediction, shoulddevelop during the ﬁrst ∼

75 msec, while the ﬁnal ∼

25 msec largely reﬂects the strong 5IB bottom-upground-truth driver inputs. Thus, the prediction error signal is reﬂected in the temporal difference of theseactivation states over time . In other words, our hypothesis is that the pulvinar is directly representing eitherthe top-down prediction or the bottom-up outcome at any given time, and the temporal difference between ’Reilly et al. adaptation ,which is generally thought to increase neural activity and learning associated with novel inputs relative to re-cently familiar ones (Abbott, Varela, Sen, & Nelson, 1997; Brette & Gerstner, 2005; Grill-Spector, Henson,& Martin, 2006; Hennig, 2013; M¨uller, Metha, Krauskopf, & Lennie, 1999). In the case where outcomes areconsistent with prior predictions (i.e., the predictions are accurate), the same population of neurons acrosspulvinar and cortex should be active over time, whereas unpredicted outcomes will generally activate newsubsets of neurons in superﬁcial cortical layers representing the current state. Thus, due to adaptation, thereshould be a phasic increase in activity in these superﬁcial neurons at the onset of unpredicted stimuli relativeto predicted ones. Furthermore, the 5IB neurons downstream of these superﬁcial neurons may be partic-ularly responsive to these phasic activity increases, causing their bursting to coincide preferentially withunexpected outcomes, thereby driving the phase resetting of the alpha cycle to such events. Thus, duringa sequence of predicted states, the pulvinar may experience relatively weaker or even absent 5IB drivinginputs, until an unpredicted stimulus arises. At this point, error-driven learning would be more stronglyengaged as a function of the phasic release from adaptation and 5IB burst activation. We discuss these dy-namics more later in the context of the expectation suppression phenomena (Bastos et al., 2012; Meyer &Olson, 2011; Summerﬁeld, Trittschuh, Monti, Mesulam, & Egner, 2008; Todorovic, van Ede, Maris, & deLange, 2011).We also hypothesize that 5IB bursting preferentially drives learning, due the strong driving nature ofthe outputs from these neurons. In computational terms, this anchors the target or plus phase to be at thispoint of 5IB bursting. Furthermore, this means that the prediction is essentially deﬁned as the state prior to5IB bursting, and the learning rule automatically causes that prior state to better anticipate the subsequentstate. This means that even if no prediction was initially generated, learning over multiple iterations willwork to create one, to the extent that a reliable prediction can be generated based on internal states andenvironmental inputs. It also means that although the alpha rhythm deﬁnes a baseline minimum predictionwindow, predictive learning could still happen at longer delays (again assuming relevant predictive stateinformation is available to bridge the delay). In short, learning always just happens whenever somethingunexpected occurs, at any point, and drives the development of predictions immediately prior, to the extentsuch predictions are possible to generate. In the typical lab experiment where phasic stimuli are presentedwithout any predictable temporal sequence (which is likely uncharacteristic of the natural world), there mayoften be no signiﬁcant prediction prior to stimulus onset, and we would expect such stimuli to reliablydrive 5IB bursting, which is consistent with available electrophysiological data (Bender, 1982; Bender &Youakim, 2001; Komura et al., 2013; Petersen et al., 1985; Robinson, 1993; Saalmann et al., 2012; Zhou etal., 2016).As may be evident by this point, we are mainly focused on prediction in the sense of the humorousquote: “prediction is very difﬁcult, especially about the future” (attributable to Danish author Robert StormPetersen), whereas this term is potentially confusingly used in a much broader sense in most Bayesian-inspired predictive coding frameworks (de Lange et al., 2018; Friston, 2005; Rao & Ballard, 1999). Theseframeworks use “prediction” to encompass everything from genetic biases to the results of learning in thefeedforward synaptic pathways to top-down ﬁlling-in or biasing of the current stimulus properties, andfairly rarely for the “about the future” meaning. We think these different phenomena are each associated

Deep Predictive Learning with different neural mechanisms at different time scales (O’Reilly, Munakata, Frank, Hazy, & Contributors,2012; O’Reilly, Wyatte, Herd, Mingus, & Jilk, 2013), and thus prefer to treat them separately, while alsorecognizing that they clearly interact, e.g., with predictive learning hypothesized to be the primary driver oflearning of all pathways in the cortex.Thus, our use of the term prediction here refers speciﬁcally to anticipatory neural ﬁring that predictssubsequent stimuli. We use the term postdiction to refer to the operation of this predictive mechanism after astimulus has been initially processed (to consolidate and more deeply encode, as in an auto-encoder model),and distinguish both from top-down excitatory biasing , which directly inﬂuences the online superﬁcial layerneural representations of the current stimulus (Desimone & Duncan, 1995; Miller & Cohen, 2001; O’Reillyet al., 2013; Reynolds, Chelazzi, & Desimone, 1999). Finally, many discussions of prediction error in theliterature include late, frontally-associated processes such as those associated with the P300 ERP component(Holroyd & Coles, 2002). We speciﬁcally exclude these from the scope of the mechanisms described here,which are anticipatory, fast, and low-level, as is appropriate for the posterior cortical sensory processingareas that interconnect with the pulvinar.

Computational Properties of Predictive Learning in the Thalamocortical Circuits

We next elaborate the connections between the computational properties required for predictive learning,and the properties of the thalamocortical circuits in the pulvinar, which appear to be notably well suited forthe hypothesized predictive learning role, in the following ways: • Assuming the process of generating a prediction involves the integration of multiple converging inputsfrom a range of higher-level cortical areas, each encoding different dimensions of relevance (e.g., lo-cation, motion, color, texture, shape, etc), sufﬁcient time must be available to perform this integration,along with some kind of dedicated neural substrate upon which it can be performed. This neural sub-strate must be distinct from those encoding the continuously evolving representations of the incomingsensory state, assuming that it is not possible to suspend that process during the time it takes to de-velop the prediction (and thereby re-use the same substrate). Furthermore, it is likely that predictiongeneration requires a broader convergence of top-down inputs than is required for sensory state en-coding, and any prediction error signal should also be widely broadcast back out to these same areas,to provide the training signal that improves their predictions. All of these considerations are nicelysatisﬁed by having a separate, compact, broadly integrative, bidirectionally connected nucleus in theform of the pulvinar and its 6CT inputs and reciprocal efferents back out to the neocortex (Shipp,2003). Furthermore, the TRC neurons are distinctive in having no signiﬁcant lateral interconnectivity(S. M. Sherman & Guillery, 2006), enabling them to faithfully represent their inputs. These propertiesled Mumford (1991) to characterize the pulvinar as a blackboard , and we further suggest the metaphorof a projection screen upon which the predictions are projected. • The obvious locus for ongoing sensory integration and the online “current state” representation isin the superﬁcial lamina of each cortical area. The pyramidal neurons here are densely and bidi-rectionally interconnected with other cortical areas, and update rapidly to new stimulus inputs, withcontinuous, relatively rapid ﬁring (up to about 100 Hz) for preferred stimuli. These neurons integratehigher-level top-down information with bottom-up sensory information, to ﬁll in missing information,resolve ambiguities, focus attention, and generally enhance the consistency and quality of the onlinerepresentations (Desimone & Duncan, 1995; Miller & Cohen, 2001; O’Reilly, Hazy, & Herd, 2016;O’Reilly et al., 2012, 2013; Reynolds et al., 1999). As noted above, we distinguish this form of top-down processing, which is often most evident during the period after stimulus onset (Lee & Mumford,2003), from the speciﬁcally predictive sort. However, when the deep layers are predictively anticipat-ing the onset of upcoming stimuli, these top-down deep layer projections will result in pre-activation ’Reilly et al. • Each cortical area requires a distinct population of neurons to generate its contribution to the overallprediction, for reasons that will become clear in a moment. With the superﬁcial neurons occupiedby the current state, this naturally leaves the deep lamina neurons as the logical substrate for thisjob, particularly the 6CT population that projects directly and exclusively to the pulvinar. Thus, thisframework also provides a clear functional division of labor for the superﬁcial and deep neocorticallamina (with layer 4 stellates providing a localized input processing function). • A true prediction (about the future) must be prevented from cheating and relying on direct infor-mation about that which is being predicted. Thus there must be a mechanism preventing the newsensory outcome information continuously encoded in the superﬁcial layers from “contaminating”the prediction-generation components in the deep layers and pulvinar. The phasic, bursting nature ofthe 5IB driver inputs provides this essential feature, creating a window where no outcome signals areimpinging on the pulvinar, when the prediction can be represented. The prediction and current statesynchronize at the moment of the 5IB driver bursting. Furthermore, the activity in the 6CT deep lay-ers that generates the top-down predictions over the pulvinar is itself driven by 5IB neuron burstingwithin the local columnar circuits of these higher level areas, such that these prediction-generatingneurons are also kept isolated from current superﬁcial-layer activation (Figure 1b). Computationally,this functions much like the simple recurrent network (SRN) context layer updating (J. L. Elman,1990; Jordan, 1989) which reﬂects the prior trial’s state, as shown in the supplemental information.Interestingly, by these principles, the lack of bursting in the driver inputs to ﬁrst-order sensory thala-mus areas (S. M. Sherman & Guillery, 2006) means that these areas should not be directly capableof error-driven predictive learning, but they do receive “collateral” error signals from the pulvinar(Shipp, 2003), which could provide some useful indirect error-driven learning signals. • The outcome signal should be as veridical as possible (i.e., directly reﬂecting the bottom-up outcome),and should arise from lower areas in the hierarchy relative to the corresponding 6CT inputs: thebottom-up, sparse, focal, strongly driving nature of the 5IB projections can directly convey suchveridical outcome signals, and ensure that they dominate the activation of their TRC targets. Basedon indirect available data, it is likely that each pulvinar TRC neuron receives only roughly 1-6 driverinputs (S. M. Sherman & Guillery, 2006, 2011). Furthermore, these inputs are likely not plastic (Usrey& Sherman, 2018) — they drive plasticity, and are thus not subject to it. • The integration required to generate the prediction should take more time than the outcome phase,which is consistent with a relatively brief period of 5IB bursting (roughly 25 msec or less; Connorset al., 1982), leaving approximately 75 msec of a nominal 100 msec alpha cycle for this integration.The overall duration of the alpha cycle itself may represent a reasonable compromise between thisintegration time and the need to keep up with predictions tracking changes in the world. • For cortical neurons receiving projections from the pulvinar, there must be some way in which thedifference between prediction and outcome (i.e., the error itself) can drive learning. Here we hy-pothesize that this difference remains as a temporal difference error signal, i.e., the difference overtime in pulvinar activation states, arising naturally as a prediction state followed by the outcome state.This contrasts with prevalent alternative hypotheses that require a separate population of neurons tocompute a prediction error “explicitly” and transmit it directly through neural ﬁring, as we discusslater. We argue that a temporal-difference prediction error signal is more natural, efﬁcient, and con-sistent with bidirectional excitatory connectivity between cortical areas (Desimone & Duncan, 1995;Hopﬁeld, 1984; Miller & Cohen, 2001; O’Reilly et al., 2013; Reynolds et al., 1999; Rumelhart &McClelland, 1982). Note that this form of temporal-difference learning signal is distinct from the

Deep Predictive Learning widely-used TD model in reinforcement learning (Sutton & Barto, 1998), which is scalar, and appliesto reward expectations, not sensory predictions (although see Gardner, Schoenbaum, & Gershman,2018 and Dayan, 1993 for potential connections between these two forms of prediction error). • There is a long history of computational models of error-driven learning based on temporal-differencesignals (Ackley et al., 1985; Bengio et al., 2017; Lillicrap et al., 2020; O’Reilly, 1996; Whittington& Bogacz, 2019), and we have recently provided a direct biological mechanism for this form oflearning (O’Reilly et al., 2012), based on a biologically-detailed model of spike timing dependentplasticity (STDP) (Urakubo et al., 2008). We showed that when activated by realistic Poisson spiketrains, this STDP model produces a non-monotonic learning curve similar to that of the BCM model(Bienenstock, Cooper, & Munro, 1982), which results from competing calcium-driven postsynapticplasticity pathways (Cooper & Bear, 2012; Shouval et al., 2002). As in the BCM framework, wehypothesized that the threshold crossover point in this nonmonotonic curve moves dynamically —if this happens on the alpha timescale (Lim et al., 2015), then it can reﬂect the prediction phase ofactivity, producing a net error-driven learning rule based on a subsequent calcium signal reﬂectingthe outcome state. This form of error-driven learning mathematically approximates backpropagationgradient descent to minimize overall prediction errors (O’Reilly, 1996).Thus, remarkably, the pulvinar and associated thalamocortical circuitry appears to provide precisely thenecessary ingredients to support predictive error-driven learning. Interestingly, although S. M. Sherman andGuillery (2006) did not propose a predictive learning mechanism as just described, they did speculate abouta potential role for this circuit in motor forward-model learning and the predictive remapping phenomenon(S. M. Sherman & Guillery, 2011; Usrey & Sherman, 2018). In addition, Pennartz, Dora, Muckli, andLorteije (2019) also suggested that the pulvinar may be involved in predictive learning, but within the expliciterror-coding framework and not involving the detailed aspects of the above-described circuitry.As we discuss later, this proposed predictive role for the pulvinar is not incompatible with the morewidely-discussed role it may play in attention (Bender & Youakim, 2001; Fiebelkorn & Kastner, 2019;LaBerge & Buchsbaum, 1990; Saalmann & Kastner, 2011; Snow, Allen, Rafal, & Humphreys, 2009; Zhouet al., 2016). Indeed, we think these two functions are synergistic (i.e., you predict what you attend, andvice-versa; Richter & de Lange, 2019), and have initial computational results consistent with this idea.In the following sections, we discuss some of the most important neural data of relevance to our hypothe-ses (beyond that summarized above), including contrasts with a widely-discussed alternative framework forpredictive coding, and some of the extensive data on alpha frequency effects, followed by a discussion ofpredictions that would clearly test the validity of this framework.

Additional Neuroscience Data

We begin with data relevant to the basic neural-level properties of the framework. First, direct elec-trophysiological recording of deep layer neurons shows periodic alpha-scale bursting for continuous tones(Luczak, Bartho, & Harris, 2013), and there are a variety of potential mechanisms behind the generation andsynchronization of the 5IB bursts driving this alpha cycle (Connors et al., 1982; Franceschetti et al., 1995;Silva et al., 1991). Furthermore, the pulvinar has been shown to drive alpha-frequency synchronization ofcortical activity across areas in the alpha band (Saalmann et al., 2012). We review the larger alpha frequencyliterature in more detail below.The 6CT neurons exhibit regular spiking behavior, in contrast to the 5IB bursting (Thomson, 2010;Thomson & Lamy, 2007). Also, they do not have axonal branches that project to other cortical areas —the subpopulation that projects to the pulvinar only project there and not to other cortical areas (Petrof,Viaene, & Sherman, 2012), whereas there are other layer 6 neurons that do project to other cortical areas.This distinct connectivity is consistent with a speciﬁc role of this neuron type in generating predictions in ’Reilly et al. glomeruli structures at their synapses onto pulvinar neurons, whichcontain a complete feedforward inhibition circuit involving a local inhibitory interneuron, in addition to thedirect strong excitatory driver input (Wilson, Bose, Sherman, & Guillery, 1984). Computationally, this canprovide a balanced level of excitatory and inhibitory drive so as to not overly excite the receiving neuron,while still dominating its ﬁring behavior.Although there are well-documented and widely-discussed burst vs. tonic ﬁring modes in pulvinar neu-rons (S. M. Sherman & Guillery, 2006), there is not much evidence of these playing a clear role in theawake, behaving state, and as noted above the growing electrophysiological evidence shows a remarkablecorrespondence between cortical and pulvinar response properties across multiple different pulvinar areasin this awake state. Nevertheless, there may be important dynamics arising from these ﬁring modes that aremore subtle or emerge in particular types of state transitions that may have yet to be identiﬁed.

Contrast with Explict Error (EE) Frameworks

To further clarify the nature of the present theory, and introduce a body of relevant data, it is importantto contrast it with the widely-discussed explicit error ( EE ) framework for predictive coding (Bastos et al.,2012; Friston, 2005, 2010; Kawato et al., 1993; Lotter et al., 2016; Ouden et al., 2012; Rao & Ballard, 1999)(Figure 2). Despite many attempts to identify such explicit error-coding neurons in the cortex, no substantialbody of unambiguous evidence has been discovered (Kok & de Lange, 2015; Kok, Jehee, & de Lange,2012; Lee & Mumford, 2003; Summerﬁeld & Egner, 2009; Walsh, McGovern, Clark, & O’Connell, 2020).Furthermore, due to the positive-only ﬁring rate nature of neural coding, two separate populations wouldbe required to convey both signs of prediction error signals, or it would have to be encoded as a variationfrom tonic ﬁring levels, which are generally low in the neocortex. By contrast, the use of temporal-differenceerror signals enables all connections between cortical layers to be excitatory and each layer can represent thepositive encoding of either the prediction or outcome state, at different levels of abstraction. These propertiesare overwhelmingly supported by extensive electrophysiological data about the hierarchical organization ofrepresentations, e.g., in the visual object recognition pathway (Cadieu et al., 2014; Kobatake & Tanaka,1994; VanRullen & Thorpe, 2002), and are consistent with the widely-supported biased competition modelfor excitatory top-down attentional effects (Desimone & Duncan, 1995; Miller & Cohen, 2001; O’Reilly etal., 2013; Reynolds et al., 1999).By contrast, the EE approach requires net inhibitory top-down predictions, and it sends error signalsforward, not positive representations of the actual state at a given level of abstraction. Thus a literal inter-pretation (and at least one existing implementation; Lotter et al., 2016) has only error signals represented atall levels above the lowest level, which is inconsistent with the positive encoding of stimuli at various levelsof abstraction across the visual hierarchy. For example, although Issa, Cadieu, and DiCarlo (2018) observedan error-signal-like increase in activation for atypical faces in some pIT neurons, these neurons overall hada positive stimulus encoding, with only a relatively small, later, error-like modulation. Furthermore, as dis-cussed below, anticipatory predictions typically closely resemble the subsequent stimulus-driven activity,suggesting a positive, not inhibitory, effect (Cavanagh, Hunt, Afraz, & Rolfs, 2010; Duhamel, Colby, &Goldberg, 1992; Lee & Mumford, 2003; Walsh et al., 2020). However, there are various different waysof reformulating the neural implementation of EE that can avoid some of these issues (Bastos et al., 2012;Spratling, 2008), but perhaps this ﬂexibility renders the framework difﬁcult to falsify (Kogo & Trengove,0 Deep Predictive Learning prediction(minus) actual(plus)5IB errerr temporaldifferenceerror =

Pulvinar6CTSuper predictionactual err

Super + _V4DeepV2V2V1 _+ explicitdifferenceerror = (All connections are excitatory, red = plus phase, blue = minus) (purple = inhibitory, green = excitatory black = unspecified) a) Temporal Difference Error b) Explicit Error

Figure 2: Comparison between: a) The proposed thalamocortical temporal-difference predictive learningmodel (from Figure 1), versus b) The Bayesian-style explicit error (EE) coding model (Rao & Ballard,1999; Friston, 2010, Bastos et al., 2012), in a situation where the prediction is clearly erroneous (ball pre-dicted to emerge on right, actually emerges on left). The EE model holds that superﬁcial (2/3) error-codingneurons receive the prediction via a net inhibitory top-down projection from higher-level deep layer neurons,and an excitatory bottom-up projection representing the outcome, such that their activation represents thedifference. To encode both signs of the error (omissions, false alarms) with positive-only spike rates, twoseparate populations of EE neurons would be required, or a more complicated deviation from tonic ﬁringlevel scheme. Unambiguous evidence of such EE coding neurons has not been found (Walsh et al, 2020). Incontrast, error signals in our proposed framework remain as a temporal difference between the two states ofprediction vs. outcome, which enables all connectivity between cortical areas to be excitatory and alwaysrepresent a positive encoding of either the prediction or outcome . In contrast, under EE, after one errorsubtraction at the lowest level, only error signals are hypothesized to ﬂow forward to higher layers, meaningthat the representations at higher layers are about increasingly higher-order errors , not positive encodings ofthe environmental state at increasing levels of abstraction. This is inconsistent with extensive available data.For this illustration, V1 is assumed to be like a clamped input layer, not subect to predictive learning itself.2015). In any case, an extensive treatment of the issues with EE is beyond the scope of this paper and has al-ready been aptly covered by Walsh et al. (2020) — our goal here is to highlight some of the core differencesas a way to clarify the framework by way of contrast, and in relation to available data.First, there are many examples of anticipatory predictive neural ﬁring in the brain. Of perhaps greatestrelevance, Barczak et al. (2018) recently showed that the auditory pulvinar in monkeys exhibits predictiveﬁring using a carefully controlled auditory sequence that had no ﬁrst-order acoustic differences from a back-ground noise signal. The pulvinar predictive activation preceded that of A1, suggesting a strong predictiverole for pulvinar. Unfortunately, the deep layers of higher auditory areas that should contribute to the forma-tion of the pulvinar prediction were not recorded in this study, so their role in generating the prediction couldnot be determined. Nevertheless, there is extensive additional evidence for top-down anticipatory activationof predicted stimuli, with activity patterns closely resembling the subsequent stimulus-driven ones (Walshet al., 2020). For example, the widely replicated predictive remapping effect is of this nature (Cavanagh etal., 2010; Duhamel et al., 1992; Wurtz, 2008) — see below for a simulation and further discussion of thisdata. The fact that these anticipatory activations are of a positive nature, consistent with the stimulus-driven ’Reilly et al. expectation suppression (Bastos et al., 2012; Meyer & Olson, 2011; Summerﬁeld etal., 2008; Todorovic et al., 2011). This phenomenon is widely cited as evidence in favor of the EE predictivecoding framework, consistent with an inhibitory effect of the expectation. Nevertheless, despite variousconﬂicting results and many complications of interpretation, multiple comprehensive reviews conclude thatit is difﬁcult to distinguish expectation suppression from the neural adaptation effects that underlie thewell-documented repetition suppression effect (Kok & de Lange, 2015; Kok et al., 2012; Lee & Mumford,2003; Summerﬁeld & Egner, 2009; Vinken & Vogels, 2017; Walsh et al., 2020). Furthermore, detailedsingle-neuron level recordings are the least likely to show these effects — instead, they are most evidentin aggregate signals such as the BOLD response in fMRI, suggesting that they may more strongly reﬂectpopulation-level differences in activity, rather than individual explicit error coding neurons.As noted earlier, in our framework accurately predicted outcomes would result in a continued adaptationof the neural response carrying over from the prediction to the outcome state, whereas unexpected outcomeswould be associated with two distinct patterns of activity over a given area: ﬁrst the prediction and then theoutcome. Thus, the unexpected outcome state would not be subject to the prior neural adaptation effects,and furthermore the time-integrated aggregate activity over these two patterns would be greater comparedto the single activity state associated with an accurately predicted outcome. Thus, our model explains ex-pectation suppression without invoking EE neurons, meaning that considerably more detailed and replicableexperimental paradigms using single-neuron resolution techniques are needed to distinguish EE from ourframework.

Alpha Frequency Effects

The alpha frequency bursting of 5IB neurons acting as drivers into the pulvinar naturally entrains thepredictive learning process in our model to this fundamental rhythm, which has long been recognized asan important signature of posterior cortical function (Berger, 1929; Nunn & Osselton, 1974; VanRullen& Koch, 2003; Varela, Toro, John, & Schwartz, 1981; Walter, 1953). A number of different functionalassociations with alpha have been established, and this literature is large and growing rapidly. Thus, werefer the reader to recent reviews (Clayton et al., 2018; Foster & Awh, 2019; Jensen, Bonnefond, Marshall,& Tiesinga, 2015; VanRullen, 2016) while highlighting the data most relevant to our speciﬁc frameworkhere, organized according to a set of key points. • Alpha is speciﬁcally associated with deep neocortical layers and the pulvinar, and with feedbackpathways in the cortex.

This has been established using direct laminar-speciﬁc electrophysiologicalsingle-neuron and local ﬁeld potential (LFP) recordings (Buffalo et al., 2011; Luczak et al., 2013;Maier, Adams, Aura, & Leopold, 2010; Maier, Aura, & Leopold, 2011; Spaak, Bonnefond, Maier,Leopold, & Jensen, 2012; Xing, Yeh, Burns, & Shapley, 2012), and feedforward vs. feedback manip-ulations (Bastos et al., 2015; Jensen et al., 2015; Michalareas et al., 2016; van Kerkoerle et al., 2014;von Stein, Chiang, & K¨onig, 2000). These data are consistent with the 5IB alpha bursting and themajor role of cortical deep layers in driving top-down corticocortical projections (in addition to the6CT pathway which is speciﬁc to the pulvinar). By contrast, these same papers show that superﬁcialcortical layers are associated with gamma frequency (40 Hz) dynamics. Overall, these data suggestthat noninvasive EEG methods could provide a direct window onto the predictive learning process.However, the next point raises some important interpretational difﬁculties. • Increases in cortical activity levels, e.g., due to attention, produce a corresponding decrease in al-pha power, while decreased activity increases alpha power (Foster & Awh, 2019; Fries, Womels-2

Deep Predictive Learning dorf, Oostenveld, & Desimone, 2008; Jensen & Mazaheri, 2010; Kelly, Lalor, Reilly, & Foxe, 2006;Klimesch, Sauseng, & Hanslmayr, 2007; Worden, Foxe, Wang, & Simpson, 2000). This pattern is notexactly what you might expect if alpha was a signature of predictive learning. However, given thatthese same pulvinar and thalamocortical pathways are also widely regarded as important for attention(Bender & Youakim, 2001; Fiebelkorn & Kastner, 2019; LaBerge & Buchsbaum, 1990; Saalmann &Kastner, 2011; Snow et al., 2009; Zhou et al., 2016), this pattern presents a challenge for many theo-rists. However, it is possible to explain this pattern as arising directly from the desynchronizing effectsof cortical activity on alpha power. Speciﬁcally, neural spiking is associated with broadband noise,due to the highly random, Poisson nature of spike ﬁring, which can desynchronize the entrainment oflower-frequency oscillations including alpha (Privman, Malach, & Yeshurun, 2013; Ray & Maunsell,2011; Solomon et al., 2017; Waldert, Lemon, & Kraskov, 2013). In other words, because corticalactivity is inherently noisy, it tends to interfere with the coherent activity across populations of neu-rons needed to produce a strong alpha frequency power signal. This explanation is directly supportedby studies manipulating and measuring cortical activity (Fries et al., 2008; Zhou et al., 2016), andis consistent with alpha power changes being a result of attentional modulation, but not their cause(Antonov, Chakravarthi, & Andersen, 2020). Thus, while attention and predictive learning can bothaffect overall activity levels in cortex, and thus drive changes in alpha power, alpha power itself is nota transparent measure of the underlying mechanisms supporting these functions, which may help toexplain some contradictory patterns of results (Foster & Awh, 2019; Gundlach, Moratti, Forschack,& M¨uller, 2020; Keitel et al., 2019). • Alpha phase effects provide a more direct measure of thalamocortical function than alpha power,and have been more consistently related to perception, attention, and prediction (Busch, Dubois, &VanRullen, 2009; Jaegle & Ro, 2013; K. E. Mathewson, Fabiani, Gratton, Beck, & Lleras, 2010; Neu-pane, Guitton, & Pack, 2017; Nunn & Osselton, 1974; Palva & Palva, 2011; Sol´ıs-Vivanco, Jensen, &Bonnefond, 2018; VanRullen & Koch, 2003; Varela et al., 1981). For example, weak, near-thresholdstimuli are more reliably detected and processed when presented in the trough of the individual’s on-going alpha cycle. Of greatest relevance to the present paper are studies showing effects of predictionon alpha phase (Mayer, Schwiedrzik, Wibral, Singer, & Melloni, 2016; Samaha, Bauer, Cimaroli, &Postle, 2015; M. T. Sherman, Kanai, Seth, & VanRullen, 2016). For example, Mayer et al. (2016)showed that prestimulus alpha phase directly correlated with the predictability of the upcoming stim-ulus, and the pattern of this prestimulus activation was indistinguishable from the subsequent stimulusactivation pattern. This is consistent with our model, and less consistent with the EE framework, asdiscussed previously. Neupane et al. (2017) found strong alpha coherence effects in LFP recordingsdistributed across V4, associated with the predictive remapping of receptive ﬁelds (Duhamel et al.,1992). • Discrete, salient, or oscillatory stimuli entrain the alpha cycle in the brain (K. E. Mathewson et al.,2012; Spaak, de Lange, & Jensen, 2014). Furthermore, the massive literature on event related poten-tials (ERPs) may represent a signiﬁcant contribution from alpha-level entrainment (Gruber, Klimesch,Sauseng, & Doppelmayr, 2005; Klimesch, 2011; Makeig et al., 2002). These entrainment effects areconsistent with the 5IB entrainment mechanisms in our framework, as described earlier, and entrain-ment is functionally important for aligning predictive learning with relevant salient or unexpectedoutcomes. • The pulvinar contributes to synchronizing alpha phase relationships across different brain areas (Fiebelkorn, Pinsk, & Kastner, 2018; Saalmann et al., 2012). This is consistent with the broad, conver-gent pattern of projections into the pulvinar from many different cortical areas, and the correspondingbroad projections back out to these same areas (Arcaro et al., 2015; Shipp, 2003). Functionally, this ’Reilly et al. • The theta cycle, comprised of a pair of alpha cycles, organizes saccades, and attentional, motor,and mnemonic processes (Fiebelkorn & Kastner, 2019). The theta rhythm is dominant in the medialtemporal lobe and hippocampus, and has been extensively studied there (Buzs´aki, 2005; Kahana,Seelig, & Madsen, 2001). Furthermore, there is a sharp peak of saccade ﬁxation durations at 200msec, which suggests that two alpha cycles are typically required for complete processing of a givenﬁxation. On the ﬁrst cycle, the predictions from before the eye moved may be fairly vague dependingon factors such as the size of the saccade and familiarity with the environment. But after the ﬁrstalpha cycle of a ﬁxation, a subsequent postdiction phase can provide an important additional learningopportunity, to consolidate and more deeply encode the current ﬁxation (computationally equivalentto an auto-encoder). Also, a mix of smaller saccades (including microsaccades) and larger saccadesenables a range of more and less predictable outcomes on the ﬁrst alpha cycle after the saccade, andmatches human behavior (Martinez-Conde, Macknik, & Hubel, 2004; Martinez-Conde, Otero-Millan,& Macknik, 2013).Putting all of these points together, a particularly effective way of testing the predictions of our frame-work would be measuring alpha phase changes emerging in the prestimulus period as a function of predictivelearning in predictable sequential stimulus streams. In addition, it would also be important to examine thetaand alpha-cycle dynamics in relation to predictive learning in the context of attention, motor control, andmemory processes, to better understand the larger systems-level temporal organization of learning and pro-cessing in the brain (Fiebelkorn & Kastner, 2019).

Predictions for Predictive Learning

In this section, we enumerate a set of direct, testable predictions from our framework. Before doing so,there are several important considerations for any experimental test of the theory. First, the nature of whatis to be learned must be matched to the pulvinar area in question. For example, learning a new variation ofbasic physics in movies at the alpha time scale (e.g., altering properties such as gravity, inertia, or elasticity),would be appropriate for the lower level visual pathways. At higher visual levels (e.g., IT cortex), it might bepossible to use simple sequences of different objects, although it is not clear to what extent the hippocampusor prefrontal cortex might also contribute in this case (Fiser et al., 2016; Gavornik & Bear, 2014). Todistinguish pulvinar learning effects from pervasive motor learning supported by other brain areas, it wouldbe most effective to directly measure activity in the pulvinar and / or associated perceptual neocortical areas,instead of involving overt behavioral performance.Much of the learning in posterior sensory cortex should take place early in development, requiring veryearly developmental interventions or genetic knockouts that are expressed from the start (which can alsohave other interpretational issues if not highly selective). In our models described below, the bulk of thebasic sensory predictive learning happens very quickly, because the basic ﬁrst-level regularities are quitestrong and relatively easily learned. While there are longer-term changes in the higher-level pathways inour models, more ﬁne-grained measurements would likely be required to see these changes. Once thislearning has taken place, the remaining contributions of the thalamocortical circuit are likely more stronglyweighted toward its role in attention, as we discuss below. Finally, directly lesioning or inactivating thepulvinar is not likely to be very informative, because existing work has shown dramatic effects on corticalactivity (Purushothaman, Marion, Li, & Casagrande, 2012; Zhou et al., 2016), and also any effects could beattributed to the attentional contributions of the pulvinar.With these considerations in mind, here are a set of strong predictions from our model that should be4

Deep Predictive Learning testable using existing techniques. Failure to obtain the predicted result, while adhering to all the relevantconstraints, would constitute a falsiﬁcation of our model. • Blocking 5IB bursting mechanisms early in developmental learning should disrupt learning . It shouldbe possible to selectively knock out or modify the channels that cause this speciﬁc population ofneurons to burst ﬁre, and doing so should have a signiﬁcant effect on learning in associated neocorticaland pulvinar areas, given the critical role that this burst ﬁring plays on the predictive learning processas elaborated above. • Blocking synaptic plasticity in pulvinar (speciﬁcally the 6CT inputs) very early in developmentallearning should impair learning . While most of the learning overall should occur in the neocortexas a result of the temporal difference error signal broadcast by the pulvinar (which should remaingenerally intact), learning in the 6CT projections is important, especially right at the start, to map theemerging neocortical representations into the space deﬁned by the 5IB projections. • Temporal differences on an alpha cycle timescale actually drive synaptic plasticity in an error-drivenlearning manner, in neocortical pyramidal neurons and in 6CT inputs to pulvinar . That is, if a pre /post pair of neurons across a synapse is more active in the prediction than the subsequent outcome,the synapse should experience LTD (long term depression), and vice-versa if the activity pattern isreversed (long term potentiation, LTP, for more activity in outcome than prediction). Furthermore,if activity is essentially stable across both prediction and outcome phases, then weights should notchange (modulo a small level of Hebbian learning; O’Reilly & Munakata, 2000; O’Reilly et al.,2012). This should be directly testable using current experimental methods, and is perhaps the singlemost important empirical test of this entire framework, and it also underlies many other current ap-proaches to error-driven learning in the brain (Bengio et al., 2017; Lillicrap et al., 2020; Whittington &Bogacz, 2019). One general consideration is the extent to which an awake in vivo preparation wouldbe required to capture all the neuromodulatory and other factors present when this learning normallytakes place. Some suggestive evidence in such a preparation is generally consistent with a sensitivityto relatively short-term temporal dynamics (Lim et al., 2015), although these results lacked the directmeasurement of individual neural activity across a synapse.

Predictive Learning of Object Categories in IT Cortex

Now we turn to our implementation of the proposed thalamocortical predictive error-driven learningframework, in a large-scale model of visual predictive learning (Figure 3). Our second major objective,and a critical question for predictive learning, is whether the model can develop high-level, abstract waysof representing the raw sensory inputs, while learning from nothing but predicting these low-level visualinputs. We showed the model brief movies of 156 3D object exemplars drawn from 20 different basic-levelcategories (e.g., car, stapler, table lamp, trafﬁc cone, etc.) selected for their overall shape diversity from theCU3D-100 dataset (O’Reilly et al., 2013). The objects moved and rotated in 3D space over 8 movie frames,where each frame was sampled at the alpha frequency (Figure 3b). There were also saccadic eye movementsevery other frame, introducing an additional predictive-learning challenge. An efferent copy signal enabledfull prediction of the effects of the eye movement, and allows the model to capture the signature predictiveremapping phenomenon (Cavanagh et al., 2010; Duhamel et al., 1992; Neupane et al., 2017). The only learning signal available to the model was the prediction error generated by the temporal difference betweenwhat it predicted to see in the V1 input in the next frame and what was actually seen.As described in detail in the supporting information, our model was constructed to capture critical fea-tures of the visual system, including the major division between a dorsal

Where and ventral

What pathway ’Reilly et al. a)b) Deep 6CT 5IB (actual) (predict)

Dorsal

Where

Ventral

WhatWhat *Where predictionerrorpredictionerrorpredictionerror

Low Res High Res BidirErr Backprop .33 .53 .59 .62 .57 .52 .48 .58r:

ImageV1h reconPulvinarpredictionpred errsaccade

Figure 3: a) The

What-Where-Integration, WWI deep predictive learning model. The dorsal

Where pathwaylearns ﬁrst, using easily-abstracted spatial blobs , to predict object location based on prior motion, visualmotion, and saccade efferent copy signals. This drives strong top-down inputs to lower areas with accuratespatial predictions, leaving the residual error concentrated on

What and

What * Where integration. The V3and DP (dorsal prelunate) constitute the

What * Where integration pathway, binding features and locations.V4, TEO, and TE are the

What pathway, learning abstracted object category representations, which alsodrive strong top-down inputs to lower areas. Sufﬁxes: s = superﬁcial, d = deep, p = pulvinar. c) Examplesequence of 8 alpha cycles that the model learned to predict, with the reconstruction of each image basedon the V1 gabor ﬁlters (

V1h recon ), and model-generated prediction (correlation r prediction error shown).The low resolution and reconstruction distortion impair visual assessment, but r values are well above the r ’s for each V1 state compared to the previous time step (mean = .38, min of .16 on frame 4 – see SI formore analysis). Eye icons indicate when a saccade occurred.(Ungerleider & Mishkin, 1982), and the overall hierarchical organization of these pathways derived fromdetailed connectivity analyses (Felleman & Van Essen, 1991; Markov, Ercsey-Ravasz, et al., 2014; Markov,Vezoli, et al., 2014; Rockland & Pandya, 1979). In addition to these biological constraints, we conductedextensive exploration of the connectivity and architecture space, and found a remarkable convergence be-tween what worked functionally and the known properties of these pathways (O’Reilly et al., 2017). Forexample, the feedforward pathway has projections from lower-level superﬁcial layers to superﬁcial layers of6 Deep Predictive Learning a) b)

Biological model Human ratings

Pyramid Vertical Round Boxy Horizontal Pyramid Vertical Round Boxy Horizontal

Figure 4: a) Category similarity structure that developed in the highest layer, TE, of the biologically basedpredictive learning model, showing correlation distance (1-correlation) similarity of the TE representationfor each 3D object against every other 3D object (156 total objects). Blue cells have high similarity, andmodel has learned block-diagonal clusters or categories of high-similarity groupings, contrasted against dis-similar off-diagonal other categories. Clustering maximized average within – between correlation distance(see SI), and clearly corresponded to the shown shape-based categories, with exemplars from each categoryshown. Also, all items from the same basic-level object categories (N=20) are reliably subsumed withinlearned categories. b) Human similarity ratings for the same 3D objects, presented with the V1 reconstruc-tion (see Fig 1b) to capture coarse perception in the model, aggregated by 20 basic-level categories (156x 156 matrix was too large to sample densely experimentally). Each cell is 1 - proportion of time givenobject pair was rated more similar than another pair (see SI). The human matrix shares the same centroidcategorical structure as the model (conﬁrmed by permutation testing and agglomorative cluster analysis, seeSI), indicating that human raters used the same shape-based category structure.higher levels, while feedback originated in both the superﬁcial and deep and projected back to both (Felle-man & Van Essen, 1991; Rockland & Pandya, 1979). Also, consistent with the core features of the pulvinarpathways discussed above, deep layer predictive (6CT) inputs originated in higher levels, while driver (5IB)inputs originated in lower levels. For simplicity we organized the model layers in terms of these driver in-puts, whereas the topographic organization of pulvinar in the brain is organized more according to the 6CTprojection loops (Shipp, 2003).Another important set of parameters are the strength of deep-layer recurrent projections, which inﬂuencethe timescale of temporal integration, producing a simple biologically-based version of slow feature analysis (Foldiak, 1991; Wiskott & Sejnowski, 2002). We followed the biological data suggesting that recurrenceincreases progressively up the visual hierarchy (Chaudhuri, Knoblauch, Gariel, Kennedy, & Wang, 2015). Itwas essential that the

Where pathway learn ﬁrst, consistent with extant data (Bourne & Rosa, 2006; Kiorpes,Price, Hall-Haro, & Anthony Movshon, 2012), including early pathways interconnecting LIP and pulvinar(Bridge, Leopold, & Bourne, 2016), and a rare asymmetric pathway, from V1 to LIP (Markov, Ercsey-Ravasz, et al., 2014), providing a direct short-cut for high-level spatial representations in LIP. Results fromvarious informative model architecture and parameter manipulations are discussed below after the primaryresults from the standard intact model. Learning curves and other model details are shown in the supportinginformation. ’Reilly et al.

What pathway of the model emerge through the deep hierarchy of layers progressing upwardfrom V1. This has been investigated in recent comparisons between monkey electrophysiological recordingsand deep convolutional neural networks (DCNNs), which provide a reasonably good ﬁt the the overallprogressive pattern of increasingly categorical organization (Cadieu et al., 2014). However, these DCNNswere trained on large datasets of human-labeled object categories, and it is perhaps not too surprising thatthe higher layers closer to these category output labels exhibited a greater degree of categorical organization— this is an intrinsic property of the error backpropagation gradients. In contrast, because the only sourceof learning in our model comes from prediction errors over the V1 input layers, the graded emergence ofan object hierarchy here reﬂects a truly self-organizing learning process. Figure 5 compares the similaritystructures in layers V4 and IT in macaque monkeys (Cadieu et al., 2014) with those in corresponding layersin our model. In both the monkeys and our model, the higher IT layer builds upon and clariﬁes the noisierstructure that is emerging in the earlier V4 layer, showing that our model replicates the essential qualitativehierarchical progression in the brain. After presenting a few more analyses, we explore the critical factorsthat lead to this result.We can more precisely quantify the emergence of categorical representations in our model by computingthe second-order similarity across the similarity matrices computed at each layer in the network (Figure 6).This shows the extent to which the similarity matrix across objects in one layer is itself similar to the objectsimilarity matrix in another layer, in terms of a correlation measure across these similarity matrices. Startingfrom either V1 compared to all higher layers, or TE compared to all lower layers, we found a consistentpattern of progressive emergence of the object categorization structure in the upper IT pathway (TEO, TE).Critically, this analysis shows that the IT category structure is signiﬁcantly different from that present at thelevel of the V1 primary visual input. Thus the model, despite being trained only to generate accurate visualinput-level predictions, has learned to represent these objects in an abstract way that goes beyond the rawinput-level information. We further veriﬁed that at the highest IT levels in the model, a consistent, spatially-invariant representation is present across different views of the same object (e.g., the average correlationacross frames within an object was .901). This is also evident in Figure 4a by virtue of the close similarity8

Deep Predictive Learning pyramidverticalroundboxyhoriz

Figure 5: Comparison of progression from V4 to IT in macaque monkey visual cortex (top row, fromCadieu et al., 2014) versus same progression in model (replotted using comparable color scale). Althoughthe underlying categories are different, and the monkeys have a much richer multi-modal experience of theworld to reinforce categories such as foods and faces, the model nevertheless shows a similar qualitativeprogression of stronger categorical structure in IT, where the block-diagonal highly similar representationsare more consistent across categories, and the off-diagonal differences are stronger and more consistent aswell (i.e., categories are also more clearly differentiated). Note that the critical difference in our modelversus those compared in Cadieu et al. 2014 and related papers is that they explicitly trained their modelson category labels, whereas our model is entirely self-organizing and has no external categorical trainingsignal.

LayerV1 V1h V2 V3 DP V4 TEO TE C o rr e l a t i on V1 correlTEs correl

Figure 6: Emergence of abstract category structure over the hierarchy of layers. Red line = correlationsimilarity between the TE similarity matrix (shown in Figure 2a) and all layers; black line shows correlationsimilarity between V1 against all layers (1 = identical; 0 = orthogonal). Both show that IT layers (TEO, TE)progressively differentiate from raw input similarity structure present in V1, and, critically, that the modelhas learned structure beyond that present in the input. ’Reilly et al. d) V1, max = 1a) Bp, max = 0.3 b) PredNet, max = 0.75e) Bp w V1, max = 0.3 f) Leabra w V1, max = 1.5 c) Bp LayerV1 V1h V2 V3 DP V4 TEOTE C o rr e l a t i on V1 correlTEs correl

Figure 7: a) Best-ﬁtting category similarity for TE layer of the backpropagation (Bp) model with the sameWhat / Where structure as the biological model. Only two broad categories are evident, and the lower max distance (0.3 vs. 1.5 in biological model) means that the patterns are highly similar overall. b) Best-ﬁttingsimilarity structure for the PredNet model, in the highest of its layers (layer 6), which is more differentiatedthan Bp (max = 0.75) but also less cleanly similar within categories (i.e., less solidly blue along the blockdiagonal), and overall follows a broad category structure similar to V1. c) Comparison of similarity struc-tures across layers in the Bp model (compare to Figure 2c): unlike in the biological model, the V1 structureis largely preserved across layers, and is little different from the structure that best ﬁts the TE layer shown inpanel a , indicating that the model has not developed abstractions beyond the structure present in the visualinput. Layer V3 is most directly inﬂuenced by spatial prediction errors, so it differs from both in stronglyencoding position information. d) The best ﬁtting V1 structure, which has 2 broad categories and banana isin a third category by itself. The lack of dark blue on the block diagonal indicates that these categories arerelatively weak, and every item is fairly dissimilar from every other. e) The same similarities shown in panel a for Bp TE also ﬁt reasonably well sorted according to the V1 structure (and they have a similar averagewithin - between contrast differences, of 0.0838 and 0.0513 – see SI for details). f) The similarity structurefrom the biological model resorted in the V1 structure does not ﬁt well: the blue is not aligned along theblock diagonal, and the yellow is not strictly off-diagonal. This is consistent with the large difference inaverage contrast distance: 0.5071 for the best categories vs. 0.3070 for the V1 categories.across multiple objects within the same category.In summary, the model learned an abstract category organization that reﬂects the overall visual shapesof the objects as judged by human participants, in a way that is invariant to the differences in motion,rotation, and scaling that are present in the V1 visual inputs. We are not aware of any other model that hasaccomplished this signature computation of the ventral

What pathway in a purely self-organizing manneroperating on realistic 3D visual objects, without any explicit supervised category labels, much less using alearning algorithm directly based on detailed properties of the underlying biological circuits in this pathway.

Backpropagation Comparison Models

To help discern some of the factors that contribute to the categorical learning in our model, and providea comparison with more widely-used error backpropagation models, we tested a backpropagation-based0

Deep Predictive Learning (Bp) version of the same

What vs.

Where architecture as our biologically-based predictive error model, andwe also tested a standard

PredNet model (Lotter et al., 2016) with extensive hyperparameter optimization(see SI). Due to the constraints of backpropagation, we had to eliminate any bidirectional connectivityloops in the Bp version, but we were able to retain a form of predictive learning by conﬁguring the V1ppulvinar layer as the ﬁnal target output layer, with the target being the next visual input relative to the V1inputs. As shown in Figure 7, the highest layers of the Bp model form a simple binary category structureoverall, and the detailed item-level similarity structure does not diverge signiﬁcantly from that present at thelowest V1 inputs, indicating that it has not formed novel systematic structured representations, in contrastto those formed in the biologically based model. Similar results were found in the PredNet model, wherethe highest layer representations remained very close to the V1 input structure. Because existing work withthese models has typically relied on additional supervised learning and decoder-based analyses (which areessentially equivalent to an additional layer of supervised learning), these RSA-based analyses provide animportant, more sensitive way of determining what they learn.These results show that the additional biologically derived properties in our model are playing a criticalrole in the development of abstract categorical representations that go beyond the raw visual inputs. Theseproperties include: excitatory bidirectional connections, inhibitory competition, and an additional Hebbianform of learning that serves as a regularizer (similar to weight decay) on top of predictive error-drivenlearning (O’Reilly, 1998; O’Reilly & Munakata, 2000). Each of these properties could promote the forma-tion of categorical representations. Bidirectional connections enable top-down signals to consistently shapelower-level representations, creating signiﬁcant attractor dynamics that cause the entire network to settleinto discrete categorical attractor states. Another indication of the importance of bidirectional connectionsis that a greedy layer-wise pretraining scheme, consistent with a putative developmental cascade of learningfrom the sensory periphery on up (Bengio, Yao, Alain, & Vincent, 2013; Hinton & Salakhutdinov, 2006;Shrager & Johnson, 1996; Valpola, 2014), did not work in our model. Instead, we found it essential thathigher layers, with their ability to form more abstract, invariant representations, interact and shape learningin lower layers right from the beginning.Furthermore, the recurrent connections within the TEO and TE layers likely play an important role bybiasing the temporal dynamics toward longer persistence (Chaudhuri et al., 2015). By contrast, backprop-agation networks typically lack these kinds of attractor dynamics, and this could contribute signiﬁcantlyto their relative lack of categorical learning. Hebbian learning drives the formation of representations thatencode the principal components of activity correlations over time, which can help more categorical repre-sentations coalesce (and results below already indicate its importance). Inhibition, especially in combinationwith Hebbian learning, drives representations to specialize on more speciﬁc subsets of the space.Ongoing work is attempting to determine which of these is essential in this case (perhaps all of them) bysystematically introducing some of these properties into the backpropagation model, though this is difﬁcultbecause full bidirectional recurrent activity propagation, which is essential for conveying error signals top-down in the biological network, is incompatible with the standard efﬁcient form of error backpropagation,and requires signiﬁcantly more computationally intensive and unstable forms of fully recurrent backpropa-gation (Pineda, 1987; Williams & Zipser, 1992). Furthermore, Hebbian learning requires dynamic inhibitorycompetition which is difﬁcult to incorporate within the backpropagation framework.

Architecture and Parameter Manipulations

Figure 8 shows just a few of the large number of parameter manipulations that have been conducted todevelop and test the ﬁnal architecture. For example, we hypothesized that separating the overall predictionproblem between a spatial

Where vs. non-spatial

What pathway (Goodale & Milner, 1992; Ungerleider &Mishkin, 1982), would strongly beneﬁt the formation of more abstract, categorical object representations inthe

What pathway. Speciﬁcally, the

Where pathway can learn relatively quickly to predict the overall spatial ’Reilly et al. V1 V1h V2 V4 TEO TELayer C o rr e l a t i on IntactNo Where

LayerV1 V1h V2 V3 DP V4 TEO TE LayerV1 V1h V2 V3 DP V4 TEO TE

IntactNo Predict IntactHebb 2 a) Where pathway lesioned b) No prediction (auto-encoder) c) Hebbian reduced

Figure 8: Effects of various manipulations on the extent to which TE representations differentiate from V1.For all plots,

Intact is the same result shown in Figure 5 from the intact model for ease of comparison. All ofthe following manipulations signiﬁcantly impair the development of abstract TE categorical representations(i.e., TE is more similar to V1 and the other layers). a) Dorsal

Where pathway lesions, including lateralinferior parietal sulcus (LIP), V3, and dorsal prelunate (DP). This pathway is essential for regressing outlocation-based prediction errors, so that the residual errors concentrate feature-encoding errors that train the

What pathway. b) Allowing the deep layers full access to current-time information, thus effectively elim-inating the prediction demand and turning the network into an auto-encoder, which signiﬁcantly impairsrepresentation development, and supports the importance of the challenge of predictive learning for devel-oping deeper, more abstract representations. c) Reducing the strength of Hebbian learning by 20% (from 2.5to 2), demonstrating the essential role played by this form of learning on shaping categorical representations.Eliminating Hebbian learning entirely (not shown) prevented the model from learning anything at all, as italso plays a critical regularization and shaping role on learning.trajectory of the object (and anticipate the effects of saccades), and thus effectively regress out that com-ponent of the overall prediction error, leaving the residual error concentrated in object feature information,which can train the ventral

What pathway to develop abstract visual categories. Figure 8a shows that, in-deed, when the

Where pathway is lesioned, the formation of abstract categorical representations in the intact

What pathway is signiﬁcantly impaired. Figure 8b shows that full predictive learning, as compared to justencoding and decoding the current state (i.e., an auto-encoder, which is much easier computationally, andleads to much better overall accuracy), is also critical for the formation of abstract categorical representa-tions — prediction is a “desirable difﬁculty” (Bjork, 1994). Finally, Figure 8c shows the impact of reducingHebbian learning, which impairs category learning as expected.

Predictive Behavior

A signature example of predictive behavior at the neural level in the brain is the predictive remapping of visual space in anticipation of a saccadic eye movements (Colby, Duhamel, & Goldberg, 1997; Duhamelet al., 1992; Gottlieb, Kusunoki, & Goldberg, 1998; Marino & Mazer, 2016; Nakamura & Colby, 2002)(Figure 9a). Here, parietal neurons start to ﬁre at the future receptive ﬁeld location where a currently-visiblestimulus will appear after a planned saccade is actually executed. Remapping has also been shown for borderownership neurons in V2 (O’Herron & von der Heydt, 2013) and in area V4 (Neupane, Guitton, & Pack,2016, 2020). These are examples, we believe, of a predictive process operating throughout the neocortex topredict what will be experienced next. A major consequence of this predictive process is the perception of astable, coherent visual world despite constant saccades and other sources of visual change.Figure 9b shows that our model exhibits this predictive remapping phenomenon. Speciﬁcally, LIP, which2

Deep Predictive Learning

Cycles (msec) A c t i v a t i on ( m ean f i r i ng r a t e ) LIPd (old pos) LIPd (new pos)V2d V2sSaccade Fixation Plus phase

Duhamel et al., 1992Model

Figure 9: Predictive Remapping. top:

Original remapping data in LIP from Duhamel et al (1992). A) showsstimulus (star) response within receptive ﬁeld (dashed circle) relative to ﬁxation dot (upper right of ﬁxation).B) Just prior to monkey making a saccade to new ﬁxation (moving left), stimulus is turned on in receptiveﬁeld location that will be upper right of the new ﬁxation point, and the LIP neuron responds to that stimulusin advance of the saccade completing. The neuron does not respond to the stimulus in that location if it isnot about to make a saccade that puts it within its receptive ﬁeld (not shown). This is predictive remapping.C) response to the old stimulus location goes away as saccade is initiated. bottom:

Data from our model,from individual units in LIPd, V2d, and V2s, showing that the LIP deep neurons respond to the saccadeﬁrst, activating in the new location and deactivating in the old, and this LIP activation goes top-down toV3 and V2 to drive updating there, generally at a longer latency and with less activation especially in thesuperﬁcial layers. When the new stimulus appears at the point of ﬁxation (after a 50 msec saccade here), the primed

V2s units get fully activated by the incoming stimulus. But the deep neurons are insulated from thissuperﬁcial input until the plus phase, when the cascade of 5IB ﬁring drives activation of the actual stimuluslocation into the pulvinar, which then reﬂects up into all the other layers.is most directly interconnected with the saccade efferent copy signals, is the ﬁrst to predict the new location,and it then drives top-down activation of lower layers. This top-down dynamic is consistent with the accountof predictive remapping given by Wurtz (2008) and Cavanagh et al. (2010), who argue that the key remap-ping takes place at the high levels of the dorsal stream, which then drive top-down activation of the predictedlocation in lower areas, instead of the alternative where lower-levels remap themselves based on saccade-related signals. The lower-level visual layers are simply too large and distributed to be able to remap acrossthe relevant degrees of visual angle — the extensive lateral connectivity needed to communicate across these ’Reilly et al.

Discussion

We have hypothesized a novel computational function for the distinctive features of thalamocorticalcircuits (S. M. Sherman & Guillery, 2006; Usrey & Sherman, 2018), as supporting a speciﬁc form ofprediction-error driven learning, where predictions arise from the numerous top-down layer 6CT projec-tions into the pulvinar, and the strong, sparse, focal driving 5IB inputs supply the bottom-up sensory-drivenoutcome. The phasic bursting nature of the 5IB inputs results in a natural temporal-difference error signalof prediction followed by outcome, consistent with extensive neural recording data. This temporal dynamicis also essential for enabling predictions to be generated without contamination from current sensory inputs,and predicts a characteristic alpha frequency prediction cycle based on the 10hz bursting cycle of the 5IBinputs, consistent with the pervasive inﬂuence of alpha on perception and neural dynamics (Clayton et al.,2018; Foster & Awh, 2019; Jensen et al., 2015; VanRullen, 2016). In short, the hypothesized predictivelearning function ﬁts remarkably well with a number of well-established properties of these thalamocorticalcircuits, and we also provided a set of additional predictions that could be tested to further evaluate this the-ory, especially in contrast to the widely-discussed alternative of explicit error coding neurons, which havenot been unambiguously supported across a range of empirical studies (Walsh et al., 2020).Furthermore, we implemented this theory in a large scale model of the visual system, and demonstratedthat learning based strictly on predicting what will be seen next is, in conjunction with a number of criticalbiologically motivated network properties and mechanisms, capable of generating abstract, invariant cate-gorical representations of the overall shapes of objects. The nature of these shape representations closelymatches human shape similarity judgments on the same objects. Thus, predictive learning has the potentialto go beyond the surface structure of its inputs, and develop systematic, abstract encodings of the environ-ment. We found that comparison models based on standard error backpropagation learning did not learn acategorical structure that went beyond the surface similarity present in the visual input layers, and futurework is focused on narrowing down the speciﬁc mechanisms required to drive this learning.In addition to the predictive learning functions of the deep / thalamic layers, these same circuits are alsolikely critical for supporting powerful top-down attentional mechanisms that have a net multiplicative effecton superﬁcial-layer activations (Bortone, Olsen, & Scanziani, 2014; Bortone et al., 2014; Olsen, Bortone,Adesnik, & Scanziani, 2012; Olsen et al., 2012). The importance of the pulvinar for attentional processinghas been widely documented (Bender & Youakim, 2001; LaBerge & Buchsbaum, 1990; Saalmann et al.,2012, e.g.,), and there is likely an additional important role of the thalamic reticular nucleus (TRN), whichcan contribute a surround-inhibition contrast-enhancing effect on top of the incoming attentional signalfrom the cortex (Crick, 1984; Jaramillo, Mejias, & Wang, 2019; Pinault, 2004; Wimmer et al., 2015). Inother work in progress, we have shown that the deep / thalamic circuits in our model produce attentionaleffects consistent with the abstract Reynolds and Heeger (2009) model, while the contributions of the deeplayer networks to this function are broadly consistent with the folded-feedback model (Grossberg, 1999).These attentional modulation signals cause the bidirectional constraint satisfaction process in the superﬁcialnetwork to focus on task-relevant information while down-regulating responses to irrelevant information —in the real world, there are typically too many objects to track at any given time, so predictive learning mustbe directed toward the most important objects (Cavanagh et al., 2010; Pylyshyn, 1989; Richter & de Lange,2019).There is also data suggesting that the pulvinar is important for supporting conﬁdence judgments, drivenby relative ambiguity in a random dot motion categorization task (Komura et al., 2013). Critically for thepresent framework, this conﬁdence modulation only emerged in the period after the ﬁrst 100 msec of pro-cessing, and manifested as a positive correlation with conﬁdence (i.e., more unambiguous stimuli resulted4

Deep Predictive Learning in higher ﬁring rates). We can interpret this as reﬂecting an ongoing generative postdiction of the stimulussignal, with stronger ﬁring associated with more unambiguous top-down activation based on the currentinternal representation. Note that this directionality is the opposite of explicit error-coding neurons, whichwould presumably increase with increasing error / ambiguity in the prediction. Interestingly, inactivation ofthese pulvinar neurons resulted in a substantial (200%) increase in opt-out choices on the most ambiguousstimuli, suggesting a level of metacognitive awareness of the pulvinar signal (or at least a direct effect ofpulvinar on relevant metacognitive processes). Predictive accuracy would be an ideal source of metacog-nitive conﬁdence signals across a wide range of domains, suggesting another important contribution ofpulvinar even after initial learning. Jaramillo et al. (2019) present a comprehensive model of attentional,decision-making, and working memory contributions of the pulvinar, including this conﬁdence data, whichis generally compatible with our framework, although it does not address any learning phenomena.Considerable further work remains to be done to more precisely characterize the essential properties ofour biologically motivated model necessary to produce this abstract form of learning, and to further explorethe full scope of predictive learning across different domains. We strongly suspect that extensive cross-modal predictive learning in real-world environments, including between sensory and motor systems, is asigniﬁcant factor in infant development and could greatly multiply the opportunities for the formation ofhigher-order abstract representations that more compactly and systematically capture the structure of theworld (Yu & Smith, 2012). Future versions of these models could thus potentially provide novel insightsinto the fundamental question of how deep an understanding a pre-verbal human, or a non-verbal primate,can develop (J. Elman et al., 1996; Spelke, Breinlinger, Macomber, & Jacobson, 1992), based on predictivelearning mechanisms. This would then represent the foundation upon which language and cultural learningbuilds, to shape the full extent of human intelligence. ’Reilly et al. Supplemental Information

All of the materials described here, including the experimental study, the computational models, andthe code to perform the representational similarity analysis, are all available on our github account at: https://github.com/ccnlab/deep-obj-cat

For the computational models in particular, themost complete understanding can only be had by directly examining the code for the models, as thereare a number of details that are not efﬁciently captured in this suppplementary materials text.

Representational Similarity Analysis Methods

The different representations being compared here are:

Leabra:

The DeepLeabra (biological model) TE layer representations (speciﬁcally TEs = superﬁcial –results are very similar for deep as well).

Bp:

The TEs layer representations from the backpropagation version of biological model, including

What,Where and

What * Where integration layers, trained with the V1p and V1hp (low and high resolutionpulvinar) layers as ﬁnal output layers, using the time t target pattern from the t − input (i.e., as apredictive network). V1:

The gabor-ﬁltered representation of the visual input to both of the above models, which was identicalacross them.

PredNet:

Highest layer (6th Layer) of the PredNet architecture.

Expt:

Similarity matrix constructed from human pairwise similarity judgments (see

Behavioral ExperimentMethods ).An optimal category cluster can be deﬁned as one that has high within-cluster similarity and low between-cluster similarity. This can be operationalized by the contrast distance metric, based on a 1-correlation ( cor-relation distance ) measure, as the difference between the average within-cluster similarity and the averagebetween-cluster similarity: cd = (cid:104) − r in (cid:105) − (cid:104) − r out (cid:105) (1)With distance-like 1-correlation values, this contrast distance should be minimized (it is typically negative),or equivalently the contrast on raw correlation values can be maximized (it is typically a positive number –just the sign ﬂip of distance value). We refer to the positive numbers and maximization here as that is morenatural.Starting with an initial set of clusters, a permutation-based hill-climbing strategy was used to determinea local minimum in this measure: each item was tested in each of the other possible categories, and if thatconﬁguration reduced the overall average contrast distance metric across all items, then it was adopted andthe process iterated until no such permutation improved the metric. This algorithm can only decrease thenumber of clusters (by moving all items out of a given cluster), so different numbers of initial clusters canbe used to search the overall space.Figure 10 shows the resulting categories. The Bp model converged on the same cluster state from allstarting conﬁgurations tested, varying from 5 to 2 initial categories. This is the cluster set shown in Figure 5aof the main paper, and has an average contrast distance ( acd ) of 0.0838 (this is relatively low because thepatterns were overall quite similar). Likewise, the V1 patterns (which were the same across Leabra and Bpmodels) reliably converged on the same pattern (shown in Figure 5d), with acd = 0.2448.6 Deep Predictive Learning

Centroid Bp

1. pyramid • banana • layercake • trafﬁccone • sailboat • trex2. vertical • person • guitar • tablelamp3. round • doorknob • donut 3. round cont’d • handgun • chair4. box • slrcamera • elephant • piano • ﬁsh5. horiz • car • heavycannon • stapler • motorcycle 1. cat1 • banana • layercake • trafﬁccone • sailboat • trex • person • guitar • tablelamp • doorknob • donut 1. cat1 cont’d • handgun • chair • slrcamera • elephant • piano • ﬁsh • car2. cat2 • heavycannon • stapler • motorcycle V1 PredNet

1. cat1 • trafﬁccone • sailboat • person • guitar • tablelamp • chair2. cat2 • layercake • trex • doorknob • donut 2. cat2 cont’d • handgun • slrcamera • elephant • piano • ﬁsh • car • heavycannon • stapler • motorcycle3. cat3 • banana 1. cat1 • trafﬁccone • sailboat • person • guitar • tablelamp • layercake2. cat2 • trex • donut • banana • handgun 2. cat2 cont’d • slrcamera • elephant • ﬁsh • car • heavycannon • stapler • motorcycle3. cat3 • chair • doorknob • piano Figure 10: Shape categories used for similarity matrix plots in main paper.

Centroid shape categories arenear-best for both the Leabra model and the Expt results, and ﬁt our visual intuitions about overall shape. Bp are reliably optimal for Bp model from all starting points. V1 are reliably optimal for V1 inputs, and alsowere close to the best for the Bp and PredNet layer 6 representations. PredNet are best stable solution forPredNet layer 6.For the PredNet layer 6 representations, starting from the V1 categories gave the best results of any otherset ( acd = 0.1967), and a few permutations resulted in a reliable solution that was arrived at from all other3 category starting points tested, shown in Figure 10 ( acd = 0.2820). This indicates that PredNet did notgo much beyond the structure present in the input, even though it did not use the V1 gabor ﬁltering usedin the Leabra and Bp models (i.e., this V1-level encoding well-captures the structure of the visual inputsin general). The PredNet pixel and layer 1 representations both converged on essentially a single monoliticcategory with very low acd (0.0018, 0.0013).For the Leabra TE representations, we found a set of centroid shape categories that are near-best whenconsidering both the Leabra model and the results from the human behavioral experiment (Expt). Startingfrom these categories, the permutation analysis converged on reducing the size of the vertical and round ’Reilly et al.

Expt ), it was clear that it stronglycoincided with our original shape intuitions, and with the ﬁner-grained 5 category centroid structure. Start-ing from the centroid categories, the maximal permutation made only 3 changes, moving trex (T-rex) andhandgun into the horizontal category, and chair into the pyramid, going from a distance score of 0.3083 to0.3225, which is a relatively small improvement. However, using the maximal

Expt clusters directly on theLeabra model gives a lower acd measure of 0.3745 (compared to 0.5071 for centroid), so the centroid cate-gories represent a good middle-ground between experiment and the model, and this strong shared similaritystructure with near-optimal cluster structures conﬁrms that the model and people are encoding largely thesame information.In contrast, if we organize the experiment similarity matrix using the Bp categories, it produces a verypoor average contrast distance measure of 0.0643 (compared to 0.3083 for the centroid categories), stronglysuggesting that people’s shape representations are not compatible with that simple structure.Another approach to determining clusters from similarity matrices, agglomerative clustering , starts withall items as singletons, and iteratively combines the closest two into a new cluster. The results for the Leabraand Expt similarity matrices are shown in Figure 11, which has also color-coded the items in terms of theircategory status according to the centroid structure. Due to a strong history dependency in the clusteringprocess, and the indeterminacy of reducing a high-dimensional similarity structure down to two dimensions,structure beyond the leaf level is not very reliable (ties are also broken by a random number generator), butnevertheless you can clearly see that in both cases items from the same cluster are almost always together asleaves in the plots. This then provides additional converging support for the idea that the model is learningthe same kind of shape categories as people have.

Behavioral Experiment Methods

The behavioral experiment was conducted on Amazon.com’s MTurk web platform under University ofColorado IRB approval (19-0176), using 30 participants each categorizing up to 800 image pairs as shownin Figure 12, using the standard simple image categorization framework with a lightly customized script.Objects were drawn from the 156 3D object set, but data was aggregated in terms of the 20 basic-levelcategories (car, stapler, etc) because we could not sample all 156 x 156 object pairs. Thus, the resulting datawas aggregated for each category pair in terms of the proportion of times when that pair was selected whenpresented.The individual images were produced by reconstructing from the V1 transform that the computationalmodel used in its high resolution V1 input layer, to give human participants as similar of an experience aspossible to how the model “saw” the objects, and to reduce the inﬂuence of existing semantic knowledgewhich was entirely missing in our model (Figure 12).8

Deep Predictive Learning

Leabra TEs Clusters motorcyclecarheavycannonstapler slrcameradonutfishelephanthandgundoorknobpianochair trafficconelayercakebananatrexguitarpersonsailboattablelamp

Experiment Clusters chairhandgunstaplertrexheavycannoncarlayercake pianoslrcamerafishmotorcycleelephantdonutdoorknobsailboattrafficconebananatablelamppersonguitar a) b)

Figure 11: Agglomerative clustering on the Leabra and Expt representations, with the centroid categoriescolor coded. The most reliable information from this is the leaf-level groupings, as the rest of the structureis indeterminante and history dependent in reducing higher-dimensional structure down to a 2D plot. Bothcluster plots show a strong tendency to group leaf items together in the same centroid categories, with a fewexceptions in each case. Also, the Leabra plot nicely captures the broader 3-category structure evident inthe similarity matrix plots, within which the 5 ﬁner-grained centroid categories are organized. Overall, thisprovides further conﬁrmation that the model and the human subjects are organizing the shapes in largely thesame way.Figure 12: Example stimulus from the behavioral experiment, using the V1 reconstruction of the actualinput images presented to the model, to better capture the coarse-grained perception of the model. Subjectswere requested to choose which of the two pairs, Left or Right, was most similar in terms of overall shape . Biological Model Methods

This section provides more information about the

DeepLeabra What-Where Integration (WWI) model.The purpose of this information is to give more detailed insight into the model’s function beyond the levelprovided in the main text, but with a model of this complexity, the only way to really understand it is toexplore the model itself. It is available for download at: https://github.com/ccnlab/deep-obj-cat/sims/C++

Furthermore, the best way to understand this model is to understand the frameworkin which it is implemented, which is explained in great detail, with many running simulations explainingspeciﬁc elements of functionality, at http://ccnbook.colorado.edu ’Reilly et al. Units PoolsArea Name X Y X Y Receiving Projections

V1 V1s 4 5 8 8V1p 4 5 8 8 V1s V2d V3d V4d TEOdV1h V1hs 4 5 16 16V1hp 4 5 16 16 V1s V2d V3d V4d TEOdEyes EyePos 21 21SaccadePlan 11 11Saccade 11 11Obj ObjVel 11 11V2 V2s 10 10 8 8 V1s LIPs V3s V4s TEOd V1p V1hpV2d 10 10 8 8

V2s

V1p V1hp LIPd LIPp V3d V4d V3s TEOsLIP MtPos 1 1 8 8 V1sLIPs 4 4 8 8 MtPos ObjVel SaccadePlan EyePos LIPpLIPd 4 4 8 8

LIPs

LIPp ObjVel Saccade EyePosLIPp 1 1 8 8

MtPos

V1s LIPdV3 V3s 10 10 4 4 V2s V4s TEOs DPs LIPs V1p V1hp DPp TEOdV3d 10 10 4 4

V3s

V1p V1hp DPp LIPd DPd V4d V4s DPs TEOsV3p 10 10 4 4

V3s

V2d DPd TEOdDP DPs 10 10 V2s V3s TEOs V1p V1hp V3p TEOpDPd 10 10

DPs

V1p V1hp DPp TEOdDPp 10 10

DPs

V2d V3d DPd TEOdV4 V4s 10 10 4 4 V2s TEOs V1p V1hpV4d 10 10 4 4

V4s

V1p V1hp V4p TEOd TEOsV4p 10 10 4 4

V4s

V2d V3d V4d TEOdTEO TEOs 10 10 4 4 V4s V1p V1hp TEsTEOd 10 10 4 4

TEOs TEOd

V1p V1hp V4p TEOp TEp TEdTEOp 10 10 4 4

TEOs

V3d V4d TEOd TEdTE TEs 10 10 4 4 TEOs V1p V1hpTEd 10 10 4 4

TEs TEd

V1p V1hp V4p TEOp TEp TEOdTEp 10 10 4 4

TEs

V3d V4d TEOdTable 1: Layer sizes, showing numbers of units in one pool (or entire layer if Pool is missing), and thenumber of Pools of such units, along X,Y axes. Each area has three associated layers: s = superﬁcial layer, d = deep layer (context updated by 51B neurons in same area, shown in bold), p = pulvinar layer (driven by5IB neurons from associated area, shown in bold). Layer Sizes and Structure

Figure 2 in the main text shows the general conﬁguration of the model, and Table 1 shows the speciﬁcsizes of each of the layers, and where they receive inputs from.All the activation and general learning parameters in the model are at their standard Leabra defaults.

Projections

The general principles and patterns of connectivity are shown in Figure 13 (and Figure 1 in the maintext). As noted in the main text, the connectivity and overall structure obeys the established principlesidentiﬁed in neocortical anatomy Felleman and Van Essen (1991); Markov, Ercsey-Ravasz, et al. (2014);0

Deep Predictive Learning V1 sd V2 sd V3 sd V4 sd TEO sd MT sd LIP sd FEF sd plansaccade

Where What *Where What s ss dd sd d V1 dp V2 dp V3 dp V4 dp TEO dp MT dp LIP dp Where What *Where What

Feedforward Feedback

Super (2,3)Layer 4Deep (5,6) Axon A c t i v a t i on F l o w A c t i v a t i on F l o w TerminalsNeurons a)b) c)d) e)

Figure 13: Principles of connectivity in DeepLeabra. a) Markov et al (2014) data showing density of ret-rograde labeling from a given injection in a middle-level area (d): most feedforward projections originatefrom superﬁcial layers of lower areas (a,b,c) and deep layers predominantly contribute to feedback (andmore strongly for longer-range feedback). b) Summary diagram showing most feedforward connectionsoriginating in superﬁcial layers of lower area, and terminating in layer 4 of higher area, while feedbackconnections can originate in either superﬁcial or deep layers, and in both cases terminate in both superﬁcialand deep layers of the lower area (adapted from Felleman & Van Essen, 1991). c) Anatomical hierarchy asdetermined by percentage of superﬁcial layer source labeling (SLN) by Markov et al (2014) — the hierar-chical levels are well matched for our model, but we functionally divide the dorsal pathway (shown in greenbackground) into the two separable components of a

Where and a

What * Where integration pathway. d) Su-perﬁcial and deep-layer connectivity in the model. Note the repeating motif between hierarchically-adjacentareas, with bidirectional connectivity between superﬁcial layers, and feedback into deep layers from bothhigher-level superﬁcial and deep layers, according to canonical pattern shown in panels a and b. Special pat-terns of connectivity from TEO to V3 and V2, involving crossed super-to-deep and deep-to-super pathways,provide top-down support for predictions based on high-level object representations. e) Connectivity fordeep layers and pulvinar in the model, which generally mirror the corticocortical pathways (in d). Each pul-vinar layer (p) receives 5IB driving inputs from the labeled layer (e.g., V1p receives 5IB drivers from V1).In reality these neurons are more distributed throughout the pulvinar, but it is computationally convenientto organize them together as shown. Deep layers (d) provide predictive input into pulvinar, and pulvinarprojections send error signals (via temporal differences between predictions and actual state) to both deepand superﬁcial layers of given areas (only d shown). Most areas send deep-layer prediction inputs into themain V1p prediction layer, and receive reciprocal error signals therefrom. The strongest constraint we foundwas that pulvinar outputs (colored green) must generally project only to higher areas, not to lower areas,with the exceptions of DPp → V3 and LIPp → V2. V2p was omitted because it is largely redundant withV1p in this simple model. ’Reilly et al.

Where pathway pre-training compared to purely random ini-tial weights. In addition to exploring different patterns of overall connectivity, we also explored differencesin the relative strengths of receiving projections, which can be set with a wt scale.rel parameter in thesimulator. All feedforward pathways have a default strength of 1. For the feedback projections, which aretypically weaker (consistent with the biology), we explored a discrete range of strengths, typically .5, .2,.1, and .05. The strongest top-down projections were into V2s from LIP and V3, while most others were.2 or .1. Likewise projections from the pulvinar were weaker, typically .1. These differences in strengthsometimes had large effects on performance during the initial bootstrapping of the overall model structure,but in the ﬁnal model they are typically not very consequential for any individual projection.

Training Parameters

Training typically consisted of 512 alpha trials per epoch (51.2 seconds of real time equivalent), for1,000 such epochs. Each trial was generated from a virtual reality environment in the emergent simulator,that rendered ﬁrst-person views with moving eye position onto the object tumbling through space with ﬁxedmotion and rotation parameters over the sequence of 8 frames (see Figure 2c in main text for representativeexample). Because the start of each sequence of 8 frames is unpredictable, we turned off learning for thattrial, which improves learning overall. We have recently developed an automatic such mechanism basedon the running-average (and running variance) of the prediction error, where we turn off learning wheneverthe current prediction error z-normalized by these running average values is below 1.5 standard deviations,which works well, and will be incorporated into future models. Biologically, this could correspond to aconnection between pulvinar and neuromodulatory areas that could regulate the effective learning rate inthis way.Figure 14a shows the learning trajectory of the model, indicating that it learns quite rapidly. This rapidinitial learning is likely facilitated by the extensive use of shortcut connections convering from all over thesimulated visual system onto the V1 pulvinar layers, and direct projections back from these pulvinar layers.Thus, error signals are directly communicated and can drive learning quickly and efﬁciently. However, thereare also extensive indirect, bidirectional connections among the superﬁcial layers, which can drive indirecterror backpropagation learning as well.2

Deep Predictive Learning epoch100 500 100000.51 P r ed i c t i on A cc u r a cy ( C o rr e l a t i on ) V1p (low res)V1hp (high res) a) b) c)200 Epochs 1,000 Epochs

Figure 14: a) Predictive learning curve for DeepLeabra, showing the correlation between prediction andactual over the two different V1 layers. Initial learning is quite rapid, followed by a slower but progressivelearning process that reﬂects development of the IT representations (e.g., manipulations that interfere withthose areas selectively impair this part of the learning curve). Overall prediction accuracy remains farfrom perfect, as shown in Figure 2c in main text, and signiﬁcantly worse than the backpropagtion-basedmodels. This is a typical ﬁnding from Leabra models which are signiﬁcantly more constrained as a result ofbidirectional attractor dynamics, Hebbian learning, and inhibitory competition – i.e., the very things that arelikely important for forming abstract catgorical representations. b) Similarity matrix over TEs layer at 200epochs, which has less contrast and deﬁnition compared to the 1,000 epoch result ( c also shown in Figure 3ain main text). Testing Parameters

To be able to monitor similarity metrics as the model trained, we used a running-average integration ofneural activity across trials to accumulate the patterns used for the RSA analysis described above. Specif-ically, for each object, and each frame, the current activation pattern across each layer was recorded andaveraged unit-by-unit with a time constant of τ = 10 . Critically, by integrating separately for each frame,this running-average computation did not introduce any bias for temporally-adjacent frames to be more sim-ilar. Nevertheless, when we computed the frame-to-frame similarities for TE, they were quite high (.901correlation on average across all objects). Model Algorithms

The biologically-based model was implemented using the Leabra framework, which is described in detailin previous publications O’Reilly (1996, 1998); O’Reilly and Munakata (2000); O’Reilly et al. (2012), andsummarized here. There are two main implementations of Leabra, one in the

C++ emergent software,and a new one using Go and Python language at: https://github.com/emer/leabra . There arealso other simpler implementations in Python and MATLAB, see https://grey.colorado.edu/emergent/index.php/Leabra . Both of the preceeding links contain a full detailed description ofthe algorithm. These same equations and standard parameters have been used to simulate over 40 differentmodels in O’Reilly and Munakata (2000); O’Reilly et al. (2012), and a number of other research models.Thus, the model can be viewed as an instantiation of a systematic modeling framework using standardizedmechanisms, instead of constructing new mechanisms for each model. Here, we only detail properties ofthe predictive learning algorithm that go beyond the basic Leabra model. ’Reilly et al. Deep Context

At the end of every plus phase, a new deep-layer context net input is computed from the dot product ofthe context weights times the sending activations, just as in the standard net input: η = (cid:104) x i w ij (cid:105) = 1 n (cid:88) i x i w ij (2)This net input is then added in with the standard net input at each cycle of processing.The relative strength of these context layer inputs was set progressively larger for higher layers in the net-work, with a maximum of 4 in V4, TEO, and TE. In addition, TEO and TE received self context projectionswhich provide an extended window of temporal context into the prior 200 msec interval, consistent withmultiple sources of neural data Chaudhuri et al. (2015). These self projections were connected only withinthe narrower Pool level of units, enabling these neurons to develop mutually-excitatory loops to sustainactivations over the multiple trials when the same object was present. We hypothesize that these modiﬁca-tions correspond to biological adaptations in IT cortex that likewise support greater sustained activation ofobject-level representations.Learning of the context weights occurs as normal, but using the sending activation states from the prior time step’s activation. Computational and Biological Details of SRN-like Functionality

Predictive auto-encoder learning has been explored in various frameworks, but the most relevant to ourmodel comes from the application of the SRN to a range of predictive learning domains J. Elman et al.(1996); J. L. Elman (1990). One of the most powerful features of the SRN is that it enables error-drivenlearning, instead of arbitrary parameter settings, to determine how prior information is integrated with newinformation. Thus, SRNs can learn to hold onto some important information for a relatively long interval,while rapidly updating other information that is only relevant for a shorter duration. This same ﬂexibilityis present in our DeepLeabra model. Furthermore, because this temporal context information is hypothe-sized to be present in the deep layers throughout the entire neocortex (in every microcolumn of tissue), theDeepLeabra model provides a more pervasive and interconnected form of temporal integration compared tothe SRN, which typically just has a single temporal context layer associated with the internal “hidden” layerof processing units.An extensive computational analysis of what makes the SRN work as well as it does, and explorationsof a range of possible alternative frameworks, has led us to an important general principle: subsequent out-comes determine what is relevant from the past . At some level, this may seem obvious, but it has signiﬁcantimplications for predictive learning mechanisms based on temporal context. It means that the informationencoded in a temporal context representation cannot be learned at the time when that information is presentlyactive. Instead, the relevant contextual information is learned on the basis of what happens next. This ex-plains the peculiar power of the otherwise strange property of the SRN: the temporal context information ispreserved as a direct copy of the state of the hidden layer units on the previous time step (Figure 15), andthen learned synaptic weights integrate that copied context information into the next hidden state (whichis then copied to the context again, and so on). This enables the error-driven learning taking place in the current time step to determine how context information from the previous time step is integrated. And thesimple direct copy operation eschews any attempt to shape this temporal context itself, instead relying onthe learning pressure that shapes the hidden layer representations to also shape the context representations.In other words, this copy operation is essential, because there is no other viable source of learning signals toshape the nature of the context representation itself (because these learning signals require future outcomes,which are by deﬁnition only available later).The direct copy operation of the SRN is however seemingly problematic from a biological perspective:4

Deep Predictive Learning

Inputt (Context) (predict next) t-1 Supera) SRN:- deep copies & holds- super integrates b) DeepLeabra:- deep integs & holds- super addsInputt (Context) (predict next)

Deept-1 ( c op y ) ( add ) (integ) (integ) Figure 15: How the DeepLeabra temporal context computation compares to the SRN mathematically. a) In a standard SRN, the context (deep layer biologically) is a copy of the hidden activations from the priortime step, and these are held constant while the hidden layer (superﬁcial) units integrate the context throughlearned synaptic weights. b) In DeepLeabra, the deep layer performs the weighted integration of the soon-to-be context information from the superﬁcial layer, and then holds this integrated value, and feeds it backas an additive net-input like signal to the superﬁcial layer. The context net input is pre-computed, instead ofhaving to compute this same value over and over again. This is more efﬁcient, and more compatible withthe diffuse interconnections among the deep layer neurons. Layer 6 projections to the thalamus and backrecirculate this pre-computed net input value into the superﬁcial layers (via layer 4), and back into itself tosupport maintenance of the held value.how could neurons copy activations from another set of neurons at some discrete point in time, and thenhold onto those copied values for a duration of 100 msec, which is a reasonably long period of time inneural terms (e.g., a rapidly ﬁring cortical neuron ﬁres at around 100 Hz, meaning that it will ﬁre 10 timeswithin that context frame). However, there is an important transformation of the SRN context computation,which is more biologically plausible, and compatible with the structure of the deep network (Figure 15).Speciﬁcally, instead of copying an entire set of activation states, the context activations (generated by thephasic 5IB burst) are immediately sent through the adaptive synaptic weights that integrate this information,which we think occurs in the 6CC (corticortical) and other lateral integrative connections from 5IB neuronsinto the rest of the deep network. The result is a pre-computed net input from the context onto a givenhidden unit (in the original SRN terminology), not the raw context information itself. Computationally, andmetabolically, this is a much more efﬁcient mechanism, because the context is, by deﬁnition, unchangingover the 100 msec alpha cycle, and thus it makes more sense to pre-compute the synaptic integration, ratherthan repeatedly re-computing this same synaptic integration over and over again (in the original feedforwardbackpropagation-based SRN model, this issue did not arise because a single step of activation updating tookplace for each context update — whereas in our bidirectional model many activation update steps must takeplace per context update).There are a couple of remaining challenges for this transformation of the SRN. First, the pre-computednet input from the context must somehow persist over the subsequent 100 msec period of the alpha cycle. Wehypothesize that this can occur via NMDA and mGluR channels that can easily produce sustained excitatorycurrents over this time frame. Furthermore, the reciprocal excitatory connectivity from 6CT to TRC andback to 6CT could help to sustain the initial temporal context signal. Second, these contextual integrationsynapses require a different form of learning algorithm that uses the sending activation from the prior 100msec, which is well within the time constants in the relevant calcium and second messenger pathways ’Reilly et al. epoch100 500 100000.51 P r ed i c t i on A cc u r a cy ( C o rr e l a t i on ) V1p (low res)V1hp (high res)

Figure 16: Learning curves for the backpropagation version of the WWI model. Although it achieves betterpredictive accuracy than the DeepLeabra version, it fails to acquire abstract object category structure, indi-cating a potential tradeoff between simplifying and categorizing inputs, versus predicting precisely wherethe low-level visual features will move.involved in synaptic plasticity.

Backpropagation Model Methods

The backpropagation version of the WWI model has exactly the same layer sizes and feedforward pat-terns of connectivity as the DeepLeabra version. Topographically, the V1p and V1hp pulvinar layers serveas output layers at the highest level of the network, receiving all the various connections from deep layersas shown in Table 1. Likewise, the LIPp served as a target output layer for the Where pathway. To achievepredictive learning, the V1 pulvinar targets were from the scene at time t , while the V1s inputs were fromthe scene at time t − . We also ran a comparison auto-encoder model that had inputs and target outputsfrom the same time step, and it showed even less systematic organziation of its higher-level representations,further supporting the notion that predictive learning is important, across all frameworks. The learning curvefor the predictive version is shown in Figure 16, which shows better overall prediction accuracy comparedto the DeepLeabra model. However, as the RSA showed, this backpropagation model failed to learn objectcategories that go beyond the input similarity structure, indicating that perhaps it was paying too much “at-tention” in learning to this low-level structure, and lacked the necessary mechanisms to enable it to imposea simplifying higher-level structure on top of these inputs. PredNet Model Methods

The PredNet architecture was designed to incorporate principles from predictive coding theory into aneural network model for predicting the next frame in a video sequence. Details of the model can be foundin the original paper Lotter et al. (2016), but here we provide a brief overview of the architecture.

Architecture

PredNet is a deep convolutional neural network that is composed of layers containing discrete modules.The lowest layer generates a prediction of incoming inputs (i.e. the pixels in the next frame), while each6

Deep Predictive Learning epoch100 500 100000.51 P r ed i c t i on A cc u r a cy ( C o rr e l a t i on ) Figure 17: Learning curves for the PredNet model. This model achieves the best overall prediction perfor-mance but also has the least well differentiated, categorical representations.of the higher layers attempts to predict the errors made by the previous layer. Each layer contains aninput convolutional module ( A l ), a recurrent representational module ( R l ), a prediction module ( ˆ A l ), and arepresentation of its own errors ( E l ). The input convolutional module ( A l ) transforms its input with a setof standard convolutional ﬁlters, a rectiﬁed linear activation function, and a max-pooling operation. Therecurrent representation module ( R l ) is a convolutional LSTM, which is a recurrent convolutional networkthat replaces the matrix multiplications in the standard LSTM equations with convolutions, allowing it tomaintain a spatially organized representation of its inputs over time. The prediction module ( ˆ A l ) consists ofanother standard convolutional layer and rectiﬁed linear activation that is used to generate predictions fromthe output of R l . These predictions are then compared against the output of the input convolutional module( A l ). The errors generated in this comparison are represented explicitly in E l , which applies a rectiﬁedlinear activation to a concatenation of the positive ( A l − ˆ A l ) and negative ( ˆ A l − A l ) prediction errors. Theseerrors then become the inputs to the next layer. A tl = (cid:40) x t , if l = 0 M axP ool ( ReLU ( Conv ( E tl − ))) , if l > (3) ˆ A tl = ReLU ( Conv ( R tl )) (4) E tl = [ ReLU ( A tl − ˆ A tl ); ReLU ( ˆ A tl − A tl )] (5) R tl = ConvLST M ( E t − l , R t − l , U pSample ( R tl +1 )) (6)At each time step in the video sequence, PredNet generates a prediction of the next frame. This is doneas follows: ﬁrst, the R l is computed for each layer starting from the top of the hierarchy (because each R tl depends on input from R tl +1 ), and then the A tl , ˆ A tl and E tl are computed in a feed-forward fashion (becauaseeach A tl depends on input from the layer below, E tl − ).All analyses in the RSA were conducted using the representations from the R l layers. Implementation details

All experiments with the PredNet architecture were performed using PyTorch. An informal hyperpa-rameter search was conducted to ﬁnd the settings that maximized representational similarity to the human ’Reilly et al. pixels layer1 layer2 layer3 layer4 layer5 layer60.0000.0250.0500.0750.1000.1250.1500.175 w i t h i n - b e t w ee n c o rr e l a t i o n Effect of dropout on RSA

Dropout = 0.0Dropout = 0.1Dropout = 0.5

Figure 18: Effect of dropout in PredNet on RSA, as measured by the difference between the average within-category correlation and the average between category correlation (using the Centroid categories derivedfrom human data). Dropout marginally improves the category structure learned in PredNet.judgments. This was done by conducting RSA on each layer for each hyperparameter setting, and comput-ing, according to the Centroid categories derived from the human data, the difference between the averagewithin-category similarity and the average between-category similarity. Our ﬁnal architecture had 6 layerswith 3, 16, 32, 64, 128, and 256 ﬁlters in the A l and R l modules, and 3x3 kernels throughout the wholenetwork. We also found that using sigmoid and tanh activation functions in fully-connected convolutionalLSTMs slightly improved performance, so these were used for all experiments.The weights in the PredNet model are trained using error backpropagation. Predictions are generatedand errors are computed at all levels of the hierarchy, but the model performs better when only the lowestlayer’s errors are backpropagated Lotter et al. (2016). We conﬁrmed these results with experiments thatbackpropagated the errors in higher layers, in which performance (in terms of mean squared error) wasmarginally reduced but the RSA results were similar. For this reason, all reported experiments used aPredNet that was trained by only backpropagating the lowest level error.The model was trained using a batch size of 8 and an Adam optimizer with a learning rate of 0.0001,with no scheduler, for 150,000 batches. A training curve is shown in Figure 17, showing that it achieves thebest overall prediction accuracy of any model we tested, and yet does not have representations that are asdifferentiated or categorical as our biologically based model, as shown in the main paper. Regularization experiments

As discussed in the main paper, our biologically based model includes a number of important biolog-ically motivated properties that may be contributing to the development of its categorical representations.These properties, including excitatory bidirectional connections, inhibitory competition, and an additionalform of Hebbian learning, may be acting as regularizers that encourage categorical learning. We thereforetested whether standard regularization methods used in deep learning would have similar effects on the rep-resentations developed in the PredNet architecture. We tested 1) batch normalization, 2) dropout (0.1, 0.3,and 0.5), and 3) weight decay (0.01,0.001,0.0001,0.00001). All experiments with batch normalization andweight decay showed reduced performance (in terms of both prediction error on the test set and within-category correlation). As shown in ﬁgure 18, dropout marginally improved the within-category correlationwhile also slightly improving prediction accuracy, so a dropout rate of 0.1 was used for the comparison toour biologically based model in the main paper.8

Deep Predictive Learning * ReferencesAbbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997, December). Synaptic depression and corticalgain control.

Science , , 220.Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985, December). A learning algorithm for Boltzmannmachines. Cognitive Science , (1), 147–169.Antonov, P. A., Chakravarthi, R., & Andersen, S. K. (2020, October). Too little, too late, and in the wrongplace: Alpha band activity does not reﬂect an active mechanism of selective attention. NeuroImage , , 117006. doi: 10.1016/j.neuroimage.2020.117006Arcaro, M. J., Pinsk, M. A., & Kastner, S. (2015, July). The anatomical and functional organization ofthe human visual pulvinar. Journal of Neuroscience , (27), 9848–9871. doi: 10.1523/JNEUROSCI.1575-14.2015Ashby, F. G., & Maddox, W. T. (2011, April). Human Category Learning 2.0. Annals of the New YorkAcademy of Sciences , , 147–161. doi: 10.1111/j.1749-6632.2010.05874.xBarczak, A., O’Connell, M. N., McGinnis, T., Ross, D., Mowery, T., Falchier, A., & Lakatos, P. (2018,August). Top-down, contextual entrainment of neuronal oscillations in the auditory thalamocorticalcircuit. Proceedings of the National Academy of Sciences , (32), E7605-E7614. doi: 10.1073/pnas.1714684115Bastos, A. M., Usrey, W. M., Adams, R. A., Mangun, G. R., Fries, P., & Friston, K. J. (2012, November).Canonical microcircuits for predictive coding. Neuron , (4), 695–711. Retrieved from Bastos, A. M., Vezoli, J., Bosman, C. A., Schoffelen, J.-M., Oostenveld, R., Dowdall, J. R., . . . Fries,P. (2015, January). Visual Areas Exert Feedforward and Feedback Inﬂuences through Distinct Fre-quency Channels.

Neuron , (2), 390–401. doi: 10.1016/j.neuron.2014.12.018Bender, D. B. (1982, July). Receptive-ﬁeld properties of neurons in the macaque inferior pulvinar. Journal ofneurophysiology , . Retrieved from Bender, D. B., & Youakim, M. (2001, January). Effect of attentive ﬁxation in macaque thalamus and cortex.

Journal of neurophysiology , , 219–234. Retrieved from Bengio, Y., Mesnard, T., Fischer, A., Zhang, S., & Wu, Y. (2017, January). STDP-compatible approximationof backpropagation in an energy-based model.

Neural Computation , (3), 555–577. doi: 10.1162/NECO a 00934Bengio, Y., Yao, L., Alain, G., & Vincent, P. (2013). Generalized Denoising Auto-Encoders as Gen-erative Models. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger(Eds.), Advances in Neural Information Processing Systems 26 (pp. 899–907). Curran Associates,Inc. Retrieved 2017-05-15, from http://papers.nips.cc/paper/5023-generalized-denoising-auto-encoders-as-generative-models.pdf

Berger, H. (1929, December). ¨Uber das Elektrenkephalogramm des Menschen.

Archiv f¨ur Psychiatrie undNervenkrankheiten , (1), 527–570. doi: 10.1007/BF01797193Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982, March). Theory for the development of neuronselectivity: Orientation speciﬁcity and binocular interaction in visual cortex. The Journal of Neuro-science , (2), 32–48. Retrieved from Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In

Metacog-nition: Knowing about knowing (pp. 185–205). Cambridge, MA, US: The MIT Press.Bortone, D. S., Olsen, S. R., & Scanziani, M. (2014, April). Translaminar inhibitory cells recruited by layer6 corticothalamic neurons suppress visual cortex.

Neuron , . Retrieved from ’Reilly et al. .nlm.nih.gov/pubmed/24656931 Bourne, J. A., & Rosa, M. G. P. (2006, March). Hierarchical development of the primate visual cortex,as revealed by neuroﬁlament immunoreactivity: Early maturation of the middle temporal area (MT).

Cerebral Cortex , (3), 405–414. doi: 10.1093/cercor/bhi119Brette, R., & Gerstner, W. (2005, November). Adaptive exponential integrate-and-ﬁre model as an effectivedescription of neuronal activity. Journal of Neurophysiology , (5), 3637–3642. doi: 10.1152/jn.00686.2005Bridge, H., Leopold, D. A., & Bourne, J. A. (2016, February). Adaptive Pulvinar Circuitry Supports VisualCognition. Trends in Cognitive Sciences , (2), 146–157. doi: 10.1016/j.tics.2015.10.003Buffalo, E. A., Fries, P., Landman, R., Buschman, T. J., & Desimone, R. (2011, July). Laminar differencesin gamma and alpha coherence in the ventral stream. Proceedings of the National Academy of Sciencesof the United States of America , (27), 11262–11267. Retrieved from Busch, N. A., Dubois, J., & VanRullen, R. (2009, June). The phase of ongoing EEG oscillations predictsvisual perception.

The Journal of Neuroscience , (24), 7869–7876. Retrieved from Buzs´aki, G. (2005). Theta rhythm of navigation: Link between path integration and landmark navigation,episodic and semantic memory.

Hippocampus , (7), 827–840. doi: 10.1002/hipo.20113Cadieu, C. F., Hong, H., Yamins, D. L. K., Pinto, N., Ardila, D., Solomon, E. A., . . . DiCarlo, J. J. (2014,December). Deep Neural Networks Rival the Representation of Primate IT Cortex for Core VisualObject Recognition. PLoS Computational Biology , (12), e1003963. doi: 10.1371/journal.pcbi.1003963Cavanagh, P., Hunt, A. R., Afraz, A., & Rolfs, M. (2010, April). Visual stability based on remapping ofattention pointers. Trends in Cognitive Sciences , (4), 147–153. doi: 10.1016/j.tics.2010.01.007Chaudhuri, R., Knoblauch, K., Gariel, M.-A., Kennedy, H., & Wang, X.-J. (2015, October). A Large-ScaleCircuit Mechanism for Hierarchical Dynamical Processing in the Primate Cortex. Neuron , (2),419–431. doi: 10.1016/j.neuron.2015.09.008Clark, A. (2013, June). Whatever next? Predictive brains, situated agents, and the future of cognitivescience. Behavioral and Brain Sciences , (3), 181–204. Retrieved from Clayton, M. S., Yeung, N., & Kadosh, R. C. (2018). The many characters of visual alpha oscillations.

European Journal of Neuroscience , (7), 2498–2508. doi: 10.1111/ejn.13747Colby, C. L., Duhamel, J. R., & Goldberg, M. E. (1997, March). Visual, presaccadic, and cognitiveactivation of single neurons in monkey lateral intraparietal area. Journal of neurophysiology , ,2841. Retrieved from Connors, B. W., Gutnick, M. J., & Prince, D. A. (1982, December). Electrophysiological propertiesof neocortical neurons in vitro.

Journal of Neurophysiology , (6), 1302–1320. Retrieved from Cooper, L. N., & Bear, M. F. (2012, November). The BCM theory of synapse modiﬁcation at 30: Interactionof theory with experiment.

Nature Reviews Neuroscience , (11), 798–810. doi: 10.1038/nrn3353Crick, F. (1984, July). Function of the thalamic reticular complex: The searchlight hypothesis. Proceedingsof the National Academy of Sciences of the United States of America , , 4586–4590. Retrieved from Cricri, F., Ni, X., Honkala, M., Aksu, E., & Gabbouj, M. (2016, December). Video Ladder Net-works. arXiv:1612.01756 [cs, stat] . Retrieved 2020-06-25, from http://arxiv.org/abs/1612.01756

Dayan, P. (1993, January). Improving generalization for temporal difference learning: The successorrepresentation.

Neural Computation , (4), 613–624. Retrieved from http://cognet.mit.edu/ Deep Predictive Learning journal/10.1162/neco.1993.5.4.613

Dayan, P., Hinton, G. E., Neal, R. N., & Zemel, R. S. (1995, January). The Helmholtz machine.

NeuralComputation , (5), 889-904.de Lange, F. P., Heilbron, M., & Kok, P. (2018, September). How do expectations shape perception? Trendsin Cognitive Sciences , (9), 764–779. doi: 10.1016/j.tics.2018.06.002Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review ofNeuroscience , (1), 193–222. doi: 10.1146/annurev.ne.18.030195.001205Duhamel, J. R., Colby, C. L., & Goldberg, M. E. (1992, April). The updating of the representation of visualspace in parietal cortex by intended eye movements. Science , (5040), 90–92. Retrieved from Elman, J., Bates, E., Karmiloff-Smith, A., Johnson, M., Parisi, D., & Plunkett, K. (1996).

RethinkingInnateness: A Connectionist Perspective on Development . Cambridge, MA: MIT Press.Elman, J. L. (1990, January). Finding structure in time.

Cognitive Science , (2), 179–211.Felleman, D. J., & Van Essen, D. C. (1991, January). Distributed Hierarchical Processing in the PrimateCerebral Cortex. Cerebral Cortex , (1), 1–47. Retrieved from Fiebelkorn, I. C., & Kastner, S. (2019, February). A rhythmic theory of attention.

Trends in CognitiveSciences , (2), 87–101. doi: 10.1016/j.tics.2018.11.009Fiebelkorn, I. C., Pinsk, M. A., & Kastner, S. (2018, August). A dynamic interplay within the frontoparietalnetwork underlies rhythmic spatial attention. Neuron , (4), 842-853.e8. doi: 10.1016/j.neuron.2018.07.038Fiser, A., Mahringer, D., Oyibo, H. K., Petersen, A. V., Leinweber, M., & Keller, G. B. (2016, December).Experience-dependent spatial expectations in mouse visual cortex. Nature Neuroscience , (12),1658–1664. doi: 10.1038/nn.4385Foldiak, P. (1991, January). Learning Invariance from Transformation Sequences. Neural Computation , (2), 194–200.Foster, J. J., & Awh, E. (2019, October). The role of alpha oscillations in spatial attention: Limited evidencefor a suppression account. Current Opinion in Psychology , , 34–40. doi: 10.1016/j.copsyc.2018.11.001Franceschetti, S., Guatteo, E., Panzica, F., Sancini, G., Wanke, E., & Avanzini, G. (1995, October). Ionicmechanisms underlying burst ﬁring in pyramidal neurons: Intracellular study in rat sensorimotor cor-tex. Brain Research , (1–2), 127–139. Retrieved from Fries, P., Womelsdorf, T., Oostenveld, R., & Desimone, R. (2008, April). The Effects of Visual Stimulationand Selective Visual Attention on Rhythmic Neuronal Synchronization in Macaque Area V4.

Journalof Neuroscience , (18), 4823–4835. doi: 10.1523/JNEUROSCI.4499-07.2008Friston, K. (2005, April). A theory of cortical responses. Philosophical Transactions of the Royal So-ciety B , (1456), 815–836. Retrieved from Friston, K. (2010, February). The free-energy principle: A uniﬁed brain theory?

Nature ReviewsNeuroscience , (2), 127–138. Retrieved from Fusi, S., Miller, E. K., & Rigotti, M. (2016, April). Why neurons mix: High dimensionality for highercognition.

Current Opinion in Neurobiology , , 66–74. doi: 10.1016/j.conb.2016.01.010Gardner, M. P. H., Schoenbaum, G., & Gershman, S. J. (2018, November). Rethinking dopamine asgeneralized prediction error. Proceedings of the Royal Society B: Biological Sciences , (1891),20181645. doi: 10.1098/rspb.2018.1645Gavornik, J. P., & Bear, M. F. (2014, May). Learned spatiotemporal sequence recognition and prediction in ’Reilly et al. Nature Neuroscience , (5), 732–737. doi: 10.1038/nn.3683George, D., & Hawkins, J. (2009, October). Towards a mathematical theory of cortical micro-circuits. PLoSComputational Biology , (10). Retrieved from Goodale, M. A., & Milner, A. D. (1992, January). Separate visual pathways for perception and action.

Trends in Neurosciences , (1), 20–25.Gottlieb, J. P., Kusunoki, M., & Goldberg, M. E. (1998, February). The representation of visual salience inmonkey parietal cortex. Nature , , 481. Retrieved from Grill-Spector, K., Henson, R., & Martin, A. (2006, January). Repetition and the brain: Neural models ofstimulus-speciﬁc effects.

Trends in Cognitive Sciences , (1), 14–23. doi: 10.1016/j.tics.2005.11.006Grossberg, S. (1999). How does the cerebral cortex work? Learning, attention, and grouping by the laminarcircuits of visual cortex. Spatial vision , . Retrieved from Gruber, W. R., Klimesch, W., Sauseng, P., & Doppelmayr, M. (2005, April). Alpha Phase SynchronizationPredicts P1 and N1 Latency and Amplitude Size.

Cerebral Cortex , (4), 371–377. doi: 10.1093/cercor/bhh139Gundlach, C., Moratti, S., Forschack, N., & M¨uller, M. M. (2020, May). Spatial Attentional SelectionModulates Early Visual Stimulus Processing Independently of Visual Alpha Modulations. CerebralCortex , (6), 3686–3703. doi: 10.1093/cercor/bhz335Halassa, M. M., & Kastner, S. (2017, December). Thalamic functions in distributed cognitive control. Nature Neuroscience , (12), 1669. doi: 10.1038/s41593-017-0020-1Hawkins, J., & Blakeslee, S. (2004). On Intelligence . New York, NY: Times Books.Hennig, M. H. (2013). Theoretical models of synaptic short term plasticity.

Frontiers in Computa-tional Neuroscience , . Retrieved from Hinton, G. E., & McClelland, J. L. (1988, January). Learning representations by recirculation. InD. Z. Anderson (Ed.),

Neural Information Processing Systems (NIPS 1987) (Vol. 0, pp. 358–366).New York: American Institute of Physics. Retrieved from http://papers.nips.cc/paper/78-learning-representations-by-recirculation.pdf

Hinton, G. E., & Salakhutdinov, R. R. (2006, July). Reducing the dimensionality of data with neuralnetworks.

Science , (5786), 504–507. Retrieved from Holroyd, C. B., & Coles, M. G. H. (2002, October). The neural basis of human error processing: Reinforce-ment learning, dopamine, and the error-related negativity.

Psychological Review , (4), 679–709.Retrieved from Hopﬁeld, J. J. (1984, July). Neurons with graded response have collective computational properties likethose of two-state neurons.

Proceedings of the National Academy of Sciences USA , , 3088–3092.Retrieved from Issa, E. B., Cadieu, C. F., & DiCarlo, J. J. (2018, November). Neural dynamics at successive stagesof the ventral visual stream are consistent with hierarchical error signals. eLife , , e42870. doi:10.7554/eLife.42870Jaegle, A., & Ro, T. (2013, October). Direct Control of Visual Perception with Phase-speciﬁc Modulationof Posterior Parietal Cortex. Journal of Cognitive Neuroscience , (2), 422–432. doi: 10.1162/jocn a 00494Jaramillo, J., Mejias, J. F., & Wang, X.-J. (2019, January). Engagement of Pulvino-cortical Feedforwardand Feedback Pathways in Cognitive Computations. Neuron , (2), 321-336.e9. doi: 10.1016/2 Deep Predictive Learning j.neuron.2018.11.023Jensen, O., Bonnefond, M., Marshall, T. R., & Tiesinga, P. (2015, April). Oscillatory mechanisms offeedforward and feedback visual processing.

Trends in Neurosciences , (4), 192–194. doi: 10.1016/j.tins.2015.02.006Jensen, O., Bonnefond, M., & VanRullen, R. (2012, April). An oscillatory mechanism for prioritizingsalient unattended stimuli. Trends in Cognitive Sciences , (4), 200–206. Retrieved from Jensen, O., & Mazaheri, A. (2010). Shaping functional architecture by oscillatory alpha activity: Gating byinhibition.

Frontiers in Human Neuroscience , (186). doi: 10.3389/fnhum.2010.00186Jordan, M. I. (1989, January). Serial Order: A Parallel, Distributed Processing Approach. In J. L. Elman &D. E. Rumelhart (Eds.), Advances in Connectionist Theory: Speech.

Hillsdale, NJ: Lawrence ErlbaumAssociates.Kachergis, G., Wyatte, D., O’Reilly, R. C., de Kleijn, R., & Hommel, B. (2014, November). A continuous-time neural model for sequential action.

Philosophical Transactions of the Royal Society B: BiologicalSciences , (1655), 20130623. doi: 10.1098/rstb.2013.0623Kahana, M. J., Seelig, D., & Madsen, J. R. (2001, December). Theta returns. Current Opinion in Neurobi-ology , (6), 739–744. doi: 10.1016/s0959-4388(01)00278-1Kawato, M., Hayakawa, H., & Inui, T. (1993, January). A forward-inverse optics model of reciprocalconnections between visual cortical areas. Network: Computation in Neural Systems , (4), 415–422.doi: 10.1088/0954-898X 4 4 001Keitel, C., Keitel, A., Benwell, C. S. Y., Daube, C., Thut, G., & Gross, J. (2019, April). Stimulus-DrivenBrain Rhythms within the Alpha Band: The Attentional-Modulation Conundrum. Journal of Neuro-science , (16), 3119–3129. doi: 10.1523/JNEUROSCI.1633-18.2019Kelly, S. P., Lalor, E. C., Reilly, R. B., & Foxe, J. J. (2006, June). Increases in Alpha Oscillatory PowerReﬂect an Active Retinotopic Mechanism for Distracter Suppression During Sustained VisuospatialAttention. Journal of Neurophysiology , (6), 3844–3851. doi: 10.1152/jn.01234.2005Khaligh-Razavi, S.-M., & Kriegeskorte, N. (2014, November). Deep Supervised, but Not Unsupervised,Models May Explain IT Cortical Representation. PLOS Computational Biology , (11), e1003915.doi: 10.1371/journal.pcbi.1003915Kiorpes, L., Price, T., Hall-Haro, C., & Anthony Movshon, J. (2012, June). Development of sensitivity toglobal form and motion in macaque monkeys (Macaca nemestrina). Vision Research , , 34–42. doi:10.1016/j.visres.2012.04.018Klimesch, W. (2011, August). Evoked alpha and early access to the knowledge system: The P1 inhibitiontiming hypothesis. Brain Research , , 52–71. doi: 10.1016/j.brainres.2011.06.003Klimesch, W., Sauseng, P., & Hanslmayr, S. (2007, January). EEG alpha oscillations: The inhibition-timinghypothesis. Brain Research Reviews , (1), 63–88. doi: 10.1016/j.brainresrev.2006.06.003Kobatake, E., & Tanaka, K. (1994, January). Neuronal selectivities to complex object features in the ventralvisual pathway. Journal of Neurophysiology , (3), 856–867.Kogo, N., & Trengove, C. (2015). Is predictive coding theory articulated enough to be testable? Frontiersin Computational Neuroscience , . doi: 10.3389/fncom.2015.00111Kok, P., & de Lange, F. P. (2015). Predictive Coding in Sensory Cortex. In An Introduction to Model-BasedCognitive Neuroscience (pp. 221–244). Springer, New York, NY. doi: 10.1007/978-1-4939-2236-911Kok, P., Jehee, J. F. M., & de Lange, F. P. (2012, July). Less Is More: Expectation Sharpens Representationsin the Primary Visual Cortex.

Neuron , (2), 265–270. doi: 10.1016/j.neuron.2012.04.034Komura, Y., Nikkuni, A., Hirashima, N., Uetake, T., & Miyamoto, A. (2013, June). Responses of pulvinarneurons reﬂect a subject’s conﬁdence in visual categorization. Nature Neuroscience , (6), 749–755.doi: 10.1038/nn.3393 ’Reilly et al. Frontiers in Systems Neuroscience , (4). Retrieved from LaBerge, D., & Buchsbaum, M. S. (1990, March). Positron emission tomographic measurements of pulvinaractivity during an attention task.

The Journal of neuroscience : the ofﬁcial journal of the Societyfor Neuroscience , , 613–9. Retrieved from Larkum, M. E., Zhu, J. J., & Sakmann, B. (1999, March). A new cellular mechanism for coupling inputsarriving at different cortical layers.

Nature , (6725), 338–341. doi: 10.1038/18686LeCun, Y., Bengio, Y., & Hinton, G. (2015, May). Deep learning. Nature , (7553), 436–444. doi:10.1038/nature14539Lee, T. S., & Mumford, D. (2003, July). Hierarchical Bayesian inference in the visual cortex. Journal ofthe Optical Society of America , (7), 1434–1448. Retrieved from Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., & Hinton, G. (2020, June). Backpropagation andthe brain.

Nature Reviews Neuroscience , (6), 335–346. doi: 10.1038/s41583-020-0277-3Lim, S., McKee, J. L., Woloszyn, L., Amit, Y., Freedman, D. J., Sheinberg, D. L., & Brunel, N. (2015,December). Inferring learning rules from distributions of ﬁring rates in cortical neurons. NatureNeuroscience , (12), 1804–1810. doi: 10.1038/nn.4158Lotter, W., Kreiman, G., & Cox, D. (2016, May). Deep predictive coding networks for video predictionand unsupervised learning. arXiv:1605.08104 [cs, q-bio] . Retrieved 2017-08-11, from http://arxiv.org/abs/1605.08104 Luczak, A., Bartho, P., & Harris, K. D. (2013, January). Gating of sensory input by spontaneous corticalactivity.

The Journal of Neuroscience , (4), 1684–1695. Retrieved from L¨uscher, C., & Malenka, R. C. (2012, June). NMDA receptor-dependent long-term potentiation and long-term depression (LTP/LTD).

Cold Spring Harbor Perspectives in Biology , (6), a005710. doi: 10.1101/cshperspect.a005710Maier, A., Adams, G. K., Aura, C., & Leopold, D. A. (2010). Distinct Superﬁcial and Deep LaminarDomains of Activity in the Visual Cortex during Rest and Stimulation. Frontiers in Systems Neuro-science , (31). doi: 10.3389/fnsys.2010.00031Maier, A., Aura, C. J., & Leopold, D. A. (2011, February). Infragranular sources of sustained local ﬁeldpotential responses in macaque primary visual cortex. The Journal of Neuroscience , (6), 1971–1980. Retrieved from Makeig, S., Westerﬁeld, M., Jung, T. P., Enghoff, S., Townsend, J., Courchesne, E., & Sejnowski, T. J.(2002, January). Dynamic Brain Sources of Visual Evoked Responses.

Science , , 690–693.Marino, A. C., & Mazer, J. A. (2016). Perisaccadic Updating of Visual Representations and AttentionalStates: Linking Behavior and Neurophysiology. Frontiers in Systems Neuroscience , . doi: 10.3389/fnsys.2016.00003Markov, N. T., Ercsey-Ravasz, M. M., Gomes, R., R, A., Lamy, C., Magrou, L., . . . Kennedy, H. (2014,January). A Weighted and Directed Interareal Connectivity Matrix for Macaque Cerebral Cortex. Cerebral Cortex , (1), 17–36. doi: 10.1093/cercor/bhs270Markov, N. T., Vezoli, J., Chameau, P., Falchier, A., Quilodran, R., Huissoud, C., . . . Kennedy, H. (2014,January). Anatomy of hierarchy: Feedforward and feedback pathways in macaque visual cortex:Cortical counterstreams. Journal of Comparative Neurology , (1), 225–259. doi: 10.1002/cne.23458Martinez-Conde, S., Macknik, S. L., & Hubel, D. H. (2004, March). The role of ﬁxational eye movementsin visual perception. Nature Reviews Neuroscience , (3), 229–240. doi: 10.1038/nrn13484 Deep Predictive Learning

Martinez-Conde, S., Otero-Millan, J., & Macknik, S. L. (2013, February). The impact of microsaccades onvision: Towards a uniﬁed theory of saccadic function.

Nature Reviews Neuroscience , (2), 83–96.doi: 10.1038/nrn3405Mathewson, K., Gratton, G., Fabiani, M., Beck, D., & Ro, T. (2009). To see or not to see: Prestimulus alphaphase predicts visual awareness. The Journal of Neuroscience , (9), 2725–2732.Mathewson, K. E., Fabiani, M., Gratton, G., Beck, D. M., & Lleras, A. (2010, April). Rescuing stimulifrom invisibility: Inducing a momentary release from visual masking with pre-target entrainment. Cognition , (1), 186–191. Retrieved from Mathewson, K. E., Prudhomme, C., Fabiani, M., Beck, D. M., Lleras, A., & Gratton, G. (2012, August).Making waves in the stream of consciousness: Entraining oscillations in EEG alpha and ﬂuctuationsin visual awareness with rhythmic visual stimulation.

Journal of Cognitive Neuroscience , (12),2321–2333. doi: 10.1162/jocn a 00288Mayer, A., Schwiedrzik, C. M., Wibral, M., Singer, W., & Melloni, L. (2016, July). Expecting to Seea Letter: Alpha Oscillations as Carriers of Top-Down Sensory Predictions. Cerebral Cortex , (7),3146–3160. doi: 10.1093/cercor/bhv146Meyer, T., & Olson, C. R. (2011, November). Statistical learning of visual transitions in monkey infer-otemporal cortex. Proceedings of the National Academy of Sciences of the United States of Amer-ica , (48), 19401–19406. Retrieved from Michalareas, G., Vezoli, J., van Pelt, S., Schoffelen, J.-M., Kennedy, H., & Fries, P. (2016, January). Alpha-Beta and Gamma Rhythms Subserve Feedback and Feedforward Inﬂuences among Human VisualCortical Areas.

Neuron , (2), 384–397. doi: 10.1016/j.neuron.2015.12.018Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annual Reviewof Neuroscience , , 167–202. Retrieved from M¨uller, J. R., Metha, A. B., Krauskopf, J., & Lennie, P. (1999, September). Rapid adaptation in visualcortex to the structure of images.

Science (New York, N.Y.) , , 1405. Retrieved from Mumford, D. (1991, June). On the computational architecture of the neocortex.

Biological Cybernetics , (2), 135–145. doi: 10.1007/BF00202389Mumford, D. (1992). On the computational architecture of the neocortex. II. The role of cortico-corticalloops. Biological Cybernetics , (3), 241–251. Retrieved from Nakamura, K., & Colby, C. L. (2002, March). Updating of the visual representation in monkey striate andextrastriate cortex during saccades.

Proceedings of the National Academy of Sciences of the UnitedStates of America , (6), 4026–4031. Retrieved from Neupane, S., Guitton, D., & Pack, C. C. (2016, February). Two distinct types of remapping in primatecortical area V4.

Nature Communications , , 10402. doi: 10.1038/ncomms10402Neupane, S., Guitton, D., & Pack, C. C. (2017, July). Coherent alpha oscillations link current and futurereceptive ﬁelds during saccades. Proceedings of the National Academy of Sciences , 201701672. doi:10.1073/pnas.1701672114Neupane, S., Guitton, D., & Pack, C. C. (2020, April). Perisaccadic remapping: What? How? Why?

Reviews in the Neurosciences . doi: 10.1515/revneuro-2019-0097Nunn, C. M. H., & Osselton, J. W. (1974, May). The Inﬂuence of the EEG Alpha Rhythm on the Perceptionof Visual Stimuli.

Psychophysiology , (3), 294–303. doi: 10.1111/j.1469-8986.1974.tb00547.xO’Herron, P., & von der Heydt, R. (2013, January). Remapping of border ownership in the visual cortex. ’Reilly et al. Journal of Neuroscience , (5), 1964–1974. doi: 10.1523/JNEUROSCI.2797-12.2013Olsen, S., Bortone, D., Adesnik, H., & Scanziani, M. (2012, February). Gain control by layer six in corticalcircuits of vision. Nature , (7387), 47–52.O’Reilly, R. C. (1996, January). Biologically plausible error-driven learning using local activationdifferences: The generalized recirculation algorithm. Neural Computation , (5), 895–938. doi:10.1162/neco.1996.8.5.895O’Reilly, R. C. (1998, January). Six Principles for Biologically-Based Computational Models of CorticalCognition. Trends in Cognitive Sciences , (11), 455–462. Retrieved from O’Reilly, R. C., Hazy, T. E., & Herd, S. A. (2016). The Leabra cognitive architecture: How to play 20principles with nature and win! In S. Chipman (Ed.),

Oxford handbook of cognitive science.

OxfordUniversity Press. Retrieved 2015-05-15, from

O’Reilly, R. C., & Munakata, Y. (2000).

Computational Explorations in Cognitive Neuroscience: Under-standing the Mind by Simulating the Brain . Cambridge, MA: MIT Press.O’Reilly, R. C., Munakata, Y., Frank, M. J., Hazy, T. E., & Contributors. (2012).

Computational CognitiveNeuroscience . Wiki Book, 1st Edition, URL: http://ccnbook.colorado.edu. Retrieved from http://ccnbook.colorado.edu

O’Reilly, R. C., Wyatte, D., Herd, S., Mingus, B., & Jilk, D. J. (2013). Recurrent Processing during ObjectRecognition.

Frontiers in Psychology , (124). Retrieved from O’Reilly, R. C., Wyatte, D., & Rohrlich, J. (2014, July). Learning Through Time in the ThalamocorticalLoops. arXiv:1407.3432 [q-bio] . Retrieved 2015-05-15, from http://arxiv.org/abs/1407.3432

O’Reilly, R. C., Wyatte, D. R., & Rohrlich, J. (2017, September). Deep predictive learning: A com-prehensive model of three visual streams. arXiv:1709.04654 [q-bio] . Retrieved 2017-09-15, from http://arxiv.org/abs/1709.04654

Ouden, H. E. M., Kok, P., & Lange, F. P. (2012). How prediction errors shape perception, attention,and motivation.

Frontiers in Psychology , (548). Retrieved from Palva, S., & Palva, J. M. (2011). Functional roles of alpha-band phase synchronization in local and large-scale cortical networks.

Frontiers in Psychology , (204), ePub only. Retrieved from Pennartz, C. M., Dora, S., Muckli, L., & Lorteije, J. A. (2019). Towards a Uniﬁed View on Pathways andFunctions of Neural Recurrent Processing.

Trends in Neurosciences .Petersen, S. E., Robinson, D. L., & Keys, W. (1985, October). Pulvinar nuclei of the behaving rhesusmonkey: Visual responses and their modulation.

Journal of neurophysiology , . Retrieved from Petrof, I., Viaene, A. N., & Sherman, S. M. (2012, June). Two populations of corticothalamic and interarealcorticocortical cells in the subgranular layers of the mouse primary sensory cortices.

Journal ofComparative Neurology , (8), 1678–1686. doi: 10.1002/cne.23006Pinault, D. (2004, August). The thalamic reticular nucleus: Structure, function and concept. Brain research , . Retrieved from Pineda, F. J. (1987, January). Generalization of Backpropagation to Recurrent Neural Networks.

PhysicalReview Letters , , 2229–2232.Privman, E., Malach, R., & Yeshurun, Y. (2013, April). Modeling the electrical ﬁeld created by mass neuralactivity. Neural Networks , , 44–51. doi: 10.1016/j.neunet.2013.01.004Purushothaman, G., Marion, R., Li, K., & Casagrande, V. A. (2012, June). Gating and control of primary6 Deep Predictive Learning visual cortex by pulvinar.

Nature Neuroscience , (6), 905–912. doi: 10.1038/nn.3106Pylyshyn, Z. (1989, June). The role of location indexes in spatial perception: A sketch of the FINSTspatial-index model. Cognition , (1), 65–97. doi: 10.1016/0010-0277(89)90014-0Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018, February). Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys,and state-of-the-art deep artiﬁcial neural networks. bioRxiv , 240614. doi: 10.1101/240614Rao, R. P., & Ballard, D. H. (1999, January). Predictive coding in the visual cortex: A functional in-terpretation of some extra-classical receptive-ﬁeld effects. Nature Neuroscience , (1), 79–87. doi:10.1038/4580Ray, S., & Maunsell, J. H. R. (2011, April). Different origins of gamma rhythm and high-gamma activityin macaque visual cortex. PLoS biology , (4), e1000610. doi: 10.1371/journal.pbio.1000610Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999, April). Competitive mechanisms subserve attentionin macaque areas V2 and V4. The Journal of neuroscience : the ofﬁcial journal of the Society forNeuroscience , , 1736–1753. Retrieved from Reynolds, J. H., & Heeger, D. J. (2009, January). The normalization model of attention.

Neuron , (2),168–185. Retrieved from Richter, D., & de Lange, F. P. (2019, August). Statistical learning attenuates visual activity only for attendedstimuli. eLife , , e47869. doi: 10.7554/eLife.47869Robinson, D. L. (1993). Functional contributions of the primate pulvinar. Progress in brain research , .Retrieved from Rockland, K. S. (1996, October). Two types of corticopulvinar terminations: Round (type 2) and elongate(type 1).

The Journal of comparative neurology , , 57–87. Retrieved from Rockland, K. S. (1998, January). Convergence and branching patterns of round, type 2 corticopulv-inar axons.

The Journal of Comparative Neurology , (4), 515–536. doi: 10.1002/(SICI)1096-9861(19980126)390:4 (cid:104) (cid:105) Brain Research , (1), 3–20. Retrieved from Rumelhart, D. E., & McClelland, J. L. (1982, April). An interactive activation model of context effectsin letter perception: Part 2. The contextual enhancement effect and some tests and extensions of themodel.

Psychological review , , 60–94. Retrieved from Saalmann, Y. B., & Kastner, S. (2011, July). Cognitive and perceptual functions of the visual thala-mus.

Neuron , (2), 209–223. Retrieved from Saalmann, Y. B., Pinsk, M. A., Wang, L., Li, X., & Kastner, S. (2012, August). The pulvinar regulatesinformation transmission between cortical areas based on attention demands.

Science , (6095),753–756. doi: 10.1126/science.1223082Samaha, J., Bauer, P., Cimaroli, S., & Postle, B. R. (2015, July). Top-down control of the phase of alpha-band oscillations as a mechanism for temporal prediction. Proceedings of the National Academy ofSciences USA , (27), 8439–8444. doi: 10.1073/pnas.1503686112Sherman, M. T., Kanai, R., Seth, A. K., & VanRullen, R. (2016, April). Rhythmic inﬂuence of top–downperceptual priors in the phase of prestimulus occipital alpha oscillations. Journal of Cognitive Neuro-science , (9), 1318–1330. doi: 10.1162/jocn a 00973Sherman, S. M. (2014, May). The function of metabotropic glutamate receptors in thalamus and cortex. The Neuroscientist , (2), 146–149. ’Reilly et al. Exploring the Thalamus and Its Role in Cortical Function .Cambridge, MA: MIT Press. Retrieved from

Sherman, S. M., & Guillery, R. W. (2011, September). Distinct functions for direct and transthalamiccorticocortical connections.

Journal of Neurophysiology , (3), 1068–1077. doi: 10.1152/jn.00429.2011Sherman, S. M., & Guillery, R. W. (2013). Functional Connections of Cortical Areas: A New View Fromthe Thalamus . Cambridge, MA: MIT Press.Shipp, S. (2003, October). The functional logic of cortico-pulvinar connections.

Philosophical Transactionsof the Royal Society of London B , (1438), 1605–1624. Retrieved from Shouval, H. Z. S., Bear, M. F., & Cooper, L. N. (2002, August). A uniﬁed model of NMDA receptor-dependent bidirectional synaptic plasticity.

Proceedings of the National Academy of SciencesUSA , (16), 10831–10836. Retrieved from Shrager, J., & Johnson, M. H. (1996, October). Dynamic Plasticity Inﬂuences the Emergence of Function ina Simple Cortical Array.

Neural Networks , (7), 1119–1129. doi: 10.1016/0893-6080(96)00033-0Silva, L. R., Amitai, Y., & Connors, B. W. (1991, January). Intrinsic oscillations of neocortex generatedby layer 5 pyramidal neurons. Science , (4992), 432–435. Retrieved from Snow, J. C., Allen, H. A., Rafal, R. D., & Humphreys, G. W. (2009, March). Impaired attentional selectionfollowing lesions to human pulvinar: Evidence for homology between human and monkey.

Proceed-ings of the National Academy of Sciences , (10), 4054–4059. doi: 10.1073/pnas.0810086106Sol´ıs-Vivanco, R., Jensen, O., & Bonnefond, M. (2018, August). Top-Down Control of Alpha Phase Adjust-ment in Anticipation of Temporally Predictable Visual Stimuli. Journal of Cognitive Neuroscience , (8), 1157–1169. doi: 10.1162/jocn a 01280Solomon, E. A., Kragel, J. E., Sperling, M. R., Sharan, A., Worrell, G., Kucewicz, M., . . . Kahana, M. J.(2017, November). Widespread theta synchrony and high-frequency desynchronization underliesenhanced cognition. Nature Communications , (1), 1704. doi: 10.1038/s41467-017-01763-2Spaak, E., Bonnefond, M., Maier, A., Leopold, D. A., & Jensen, O. (2012, December). Layer-speciﬁcentrainment of gamma-band neural activity by the alpha rhythm in monkey visual cortex. CurrentBiology , (24), 2313–2318. Retrieved from Spaak, E., de Lange, F. P., & Jensen, O. (2014, March). Local Entrainment of Alpha Oscillations by VisualStimuli Causes Cyclic Modulation of Perception.

Journal of Neuroscience , (10), 3536–3544. doi:10.1523/JNEUROSCI.4385-13.2014Spelke, E., Breinlinger, K., Macomber, J., & Jacobson, K. (1992, January). Origins of Knowledge. Psycho-logical Review , (4), 605–632.Spratling, M. W. (2008). Reconciling predictive coding and biased competition models of cortical function. Frontiers in Computational Neuroscience , (4), 1-8 (online). Retrieved from Summerﬁeld, C., & de Lange, F. P. (2014, November). Expectation in perceptual decision making: Neuraland computational mechanisms.

Nature Reviews Neuroscience , (11), 745–756. doi: 10.1038/nrn3838Summerﬁeld, C., & Egner, T. (2009, September). Expectation (and attention) in visual cognition. Trends inCognitive Sciences , (9), 403–409. doi: 10.1016/j.tics.2009.06.003Summerﬁeld, C., Trittschuh, E. H., Monti, J. M., Mesulam, M. M., & Egner, T. (2008, September). Neuralrepetition suppression reﬂects fulﬁlled perceptual expectations. Nature Neuroscience , (9), 1004–8 Deep Predictive Learning

Sutton, R. S., & Barto, A. G. (1998).

Reinforcement Learning: An Introduction.

Cambridge, MA: MITPress. Retrieved from

Thomson, A. M. (2010). Neocortical layer 6, a review.

Frontiers in Neuroanatomy , (13). Retrieved from Thomson, A. M., & Lamy, C. (2007, November). Functional maps of neocortical local circuitry.

Frontiersin Neuroscience , (1), 19–42. Retrieved from Todorovic, A., van Ede, F., Maris, E., & de Lange, F. P. (2011, June). Prior Expectation Mediates NeuralAdaptation to Repeated Sounds in the Auditory Cortex: An MEG Study.

Journal of Neuroscience , (25), 9118–9123. doi: 10.1523/JNEUROSCI.1425-11.2011Ungerleider, L. G., & Mishkin, M. (1982, January). Two Cortical Visual Systems. In D. J. Ingle,M. A. Goodale, & R. J. W. Mansﬁeld (Eds.), The Analysis of Visual Behavior (pp. 549–586). Cam-bridge, MA: MIT Press.Urakubo, H., Honda, M., Froemke, R. C., & Kuroda, S. (2008, March). Requirement of an allosteric kineticsof NMDA receptors for spike timing-dependent plasticity.

The Journal of Neuroscience , (13),3310–3323. Retrieved from Usrey, W. M., & Sherman, S. M. (2018). Corticofugal circuits: Communication lines from the cortex to therest of the brain.

Journal of Comparative Neurology , (0). doi: 10.1002/cne.24423Valpola, H. (2014, November). From neural PCA to deep unsupervised learning. arXiv:1411.7783 [cs,stat] . Retrieved 2017-05-15, from http://arxiv.org/abs/1411.7783 van Kerkoerle, T., Self, M. W., Dagnino, B., Gariel-Mathis, M.-A., Poort, J., van der Togt, C., & Roelfsema,P. R. (2014, October). Alpha and gamma oscillations characterize feedback and feedforward pro-cessing in monkey visual cortex. Proceedings of the National Academy of Sciences U.S.A. , (40),14332–14341. Retrieved from VanRullen, R. (2016, October). Perceptual cycles.

Trends in Cognitive Sciences , (10), 723–735. doi:10.1016/j.tics.2016.07.006VanRullen, R., & Koch, C. (2003, May). Is perception discrete or continuous? Trends in Cognitive Sciences , (5), 207–213. Retrieved from VanRullen, R., & Thorpe, S. J. (2002, November). Surﬁng a spike wave down the ventral stream.

Vi-sion research , , 2593–2615. Retrieved from Varela, F. J., Toro, A., John, E. R., & Schwartz, E. L. (1981). Perceptual framing and cortical al-pha rhythm.

Neuropsychologia , (5), 675–686. Retrieved from Vinken, K., & Vogels, R. (2017, November). Adaptation can explain evidence for encoding of probabilisticinformation in macaque inferior temporal cortex.

Current Biology , (22), R1210-R1212. doi: 10.1016/j.cub.2017.09.018von Stein, A., Chiang, C., & K¨onig, P. (2000, December). Top-down processing mediated by interarealsynchronization. Proceedings of the National Academy of Sciences of the United States of America , (26), 14748–14753. doi: 10.1073/pnas.97.26.14748von Helmholtz, H. (2013). Treatise on Physiological Optics, Vol III . Courier Corporation.Waldert, S., Lemon, R. N., & Kraskov, A. (2013). Inﬂuence of spiking activity on cortical local ﬁeldpotentials.

The Journal of Physiology , (21), 5291–5303. doi: 10.1113/jphysiol.2013.258228Walsh, K. S., McGovern, D. P., Clark, A., & O’Connell, R. G. (2020, March). Evaluating the neurophysio-logical evidence for predictive processing as a model of perception. Annals of the New York Academyof Sciences , (1), 242–268. doi: 10.1111/nyas.14321 ’Reilly et al. The living brain . Oxford, England: W. W. Norton.Watanabe, T., & Sasaki, Y. (2015, January). Perceptual learning: Toward a comprehensive theory.

Annualreview of psychology , , 197–221. doi: 10.1146/annurev-psych-010814-015214Whittington, J. C. R., & Bogacz, R. (2019, March). Theories of error back-propagation in the brain. Trendsin Cognitive Sciences , (3), 235–250. doi: 10.1016/j.tics.2018.12.005Williams, R. J., & Zipser, D. (1992, January). Gradient-based learning algorithms for recurrent networksand their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation:Theory, Architectures and Applications.

Hillsdale, NJ: Erlbaum.Wilson, J. R., Bose, N., Sherman, S. M., & Guillery, R. W. (1984, June). Fine structural morphology ofidentiﬁed X- and Y-cells in the cat’s lateral geniculate nucleus.

Proceedings of the Royal Society ofLondon. Series B. Biological Sciences , (1225), 411–436. doi: 10.1098/rspb.1984.0042Wimmer, R. D., Schmitt, L. I., Davidson, T. J., Nakajima, M., Deisseroth, K., & Halassa, M. M. (2015,October). Thalamic control of sensory selection in divided attention. Nature , (7575), 705–709.doi: 10.1038/nature15398Wiskott, L., & Sejnowski, T. J. (2002, April). Slow feature analysis: Unsupervised learning of invari-ances. Neural Computation , , 715–770. Retrieved from Worden, M. S., Foxe, J. J., Wang, N., & Simpson, G. V. (2000, March). Anticipatory biasing of visuospa-tial attention indexed by retinotopically speciﬁc alpha-band electroencephalography increases overoccipital cortex.

The Journal of neuroscience , . Retrieved from Wurtz, R. H. (2008, September). Neuronal mechanisms of visual stability.

Vision Research , (20), 2070–2089. doi: 10.1016/j.visres.2008.03.021Xing, D., Yeh, C.-I., Burns, S., & Shapley, R. M. (2012, August). Laminar analysis of visually evokedactivity in the primary visual cortex. Proceedings of the National Academy of Sciences , (34),13871–13876. doi: 10.1073/pnas.1201478109Yu, C., & Smith, L. B. (2012, November). Embodied attention and word learning by toddlers. Cognition , (2), 244–262. doi: 10.1016/j.cognition.2012.06.016Zhou, H., Schafer, R. J., & Desimone, R. (2016). Pulvinar-cortex interactions in vision and attention. Neuron ,89