Grid Cell Path Integration For Movement-Based Visual Object Recognition
GG RID C ELL P ATH I NTEGRATION F OR M OVEMENT -B ASED V ISUAL O BJECT R ECOGNITION
Niels Leadholm
Numenta and The University of Oxford [email protected]
Marcus Lewis
Numenta [email protected]
Subutai Ahmad
Numenta [email protected]
February 19, 2021 A BSTRACT
Grid cells enable the brain to model the physical space of the world and navigate effectively viapath integration, updating self-position using information from self-movement. Recent proposalssuggest that the brain might use similar mechanisms to understand the structure of objects in diversesensory modalities, including vision. In machine vision, object recognition given a sequence ofsensory samples of an image, such as saccades, is a challenging problem when the sequence does notfollow a consistent, fixed pattern - yet this is something humans do naturally and effortlessly. Weexplore how grid cell-based path integration in a cortical network can support reliable recognition ofobjects given an arbitrary sequence of inputs. Our network (GridCellNet) uses grid cell computationsto integrate visual information and make predictions based on movements. We use local Hebbianplasticity rules to learn rapidly from a handful of examples (few-shot learning), and consider thetask of recognizing MNIST digits given only a sequence of image feature patches. We compareGridCellNet to k-Nearest Neighbour (k-NN) classifiers as well as recurrent neural networks (RNNs),both of which lack explicit mechanisms for handling arbitrary sequences of input samples. Weshow that GridCellNet can reliably perform classification, generalizing to both unseen examplesand completely novel sequence trajectories. We further show that inference is often successful aftersampling a fraction of the input space, enabling the predictive GridCellNet to reconstruct the restof the image given just a few movements. We propose that dynamically moving agents with activesensors can use grid cell representations not only for navigation, but also for efficient recognition andfeature prediction of seen objects.
When exploring a visual scene, primates sample the world in a serial sequence by performing saccades [Yarbus, 1967].For the purpose of recognising objects, it is non-trivial that this sampling can follow an arbitrary sequence order. Forexample, one might selectively attend to the most salient parts of a face rather than performing a raster scan across theimage. While many previous efforts to model primate object recognition have focused on massively parallel processingof a single input, the challenge of dealing with the necessarily sequential nature of sensory input has received lessattention [Bicanski and Burgess, 2019]. Recurrent-neural networks can perform complex tasks with sequential inputs,and might seem like a natural candidate for such a challenge, yet they struggle to learn when provided with sequencesthat do not follow a fixed order during training and inference. In the natural world there are additional challengesthat can present themselves. Often only a handful of object examples are available, and learning should be rapid(i.e. requiring limited training on the few examples given). These are all constraints that humans are able to handleeffortlessly. Understanding how learning and inference under these conditions might be achieved has two appealingaspects. As well as potentially uncovering the basis for human performance in this domain, the flexibility to operateunder such a regime could also enable artificial agents to explore the world in a more principled and adaptive manner.While recurrent neural networks do not have explicit mechanisms for dealing with this challenge, grid cells mightprovide a neurally plausible solution employed by the brain. Together with place cells in the hippocampus [O’Keefeand Conway, 1978], grid cells in the entorhinal cortex enable the brain to model space during navigation. In particular, a r X i v : . [ c s . A I] F e b PREPRINT - F
EBRUARY
19, 2021grid cells fire in repeating patterns as space is traversed [Hafting et al., 2005]. Using multiple grid cells of differentscale and orientation, the location of an animal can be uniquely encoded [Fiete et al., 2008]. Importantly, this locationrepresentation can be updated to support path integration - that is, given information about self-movement, an agent candetermine its new location by reading out from grid cell activity [Hafting et al., 2005, Moser et al., 2008]. The role ofsuch cells in spatial navigation is widely established, but recent experimental evidence has also uncovered the presenceof grid cell-like activity in visual space [Killian et al., 2012, Nau et al., 2018, Julian et al., 2018]. Theoreticians haveargued that grid cell-like computations might be used to build object representations in diverse sensory modalities[Hawkins et al., 2019], including vision [Bicanski and Burgess, 2019]. This is an intriguing solution to our openingproblem, but the demonstration of object recognition with such computations has so far been limited to either syntheticobjects [Lewis et al., 2019], or visual tasks requiring the recall of a memorized example [Bicanski and Burgess, 2019],rather than generalization to unseen examples of an object class.Neurally-motivated systems that can solve rapid object learning and recognition given saccade-like visual inputs aretherefore lacking. We set out to address this by implementing a biologically plausible network, called GridCellNet,based on cortical columns and grid cell-like computations. The system uses rapid Hebbian-style learning to associatesensed features and their spatial location in the reference frame of an object, while dendritic segments enable the systemto encode predictive states. Locations are encoded by activity in grid cell modules that are updated with self-movement.In addition to addressing the challenge of arbitrary sequence inputs, this system also includes the desirable propertiesof rapid learning (functioning with both few training examples and few weight updates), and predictive capabilities,enabling completion of an image given partial inputs. We evaluate the performance of GridCellNet in these task settings,and compare it to typical machine-learning approaches. In accordance with human capabilities, our system outperformsthese other approaches in the challenging setting we explore. To summarise, our primary contributions are to:• Implement a biologically-motivated neural architecture that uses arbitrary sequences of local visual featuresacross space to learn objects and recognize them. This ability is dependent on grid cell-like computations.• Demonstrate the ability of our network to successfully generalize to unseen objects in the challenging settingof arbitrary sequence inputs under few-shot learning.• Compare the proposed system to more traditional machine learning systems, such as recurrent neural networks,demonstrating its superiority in settings where humans outperform current artificial systems.
Grid Cells
During spatial navigation in an animal such as a rat, grid cells are notable for firing at regular intervals asspace is traversed. These points of activity correspond to a triangular lattice with a particular phase, orientation andscale [Hafting et al., 2005] (Figure 1 top). Grid cells with the same orientation and scale, but different phases, formwhat are known as grid cell ‘modules’ [Stensola et al., 2012]. As a rodent moves, any individual grid cell’s activity isambiguous as a means of encoding the animal’s position. The joint activity of multiple grid cell modules however canuniquely encodes a position; importantly, this encoding scheme has a large representational capacity [Fiete et al., 2008](Figure 1a middle). Information about self-movement is used to update the current location representation by eachgrid cell’s firing corresponding to the positional change (Figure 1b bottom). This process, known as path-integration,means that after returning to the same position, the same grid cells will be active regardless of the path taken [Haftinget al., 2005, Moser et al., 2008]. Note that our work does not deal with how grid cells might actually implement pathintegration - rather we explore the significance of path integration in a neural population for developing useful objectrepresentations.The combined properties of a large capacity for unique spatial representations and path integration enable grid cells toact as a powerful substrate for encoding spatial information. In Hawkins et al. [2019], it was suggested that similar cellsmight exist outside of the entorhinal cortex in cortical columns throughout the brain, and could thereby support spatialencoding in sensory modalities such as touch and vision. Recent experimental work has supported this possibility [Longet al., 2021, Long and Zhang, 2021]. This work aims to demonstrate the utility of such grid cell representations invisual tasks. For a more in depth discussion of grid cell computations as explored here, we direct readers to Lewis et al.[2019].
Cortical Models
Our work builds on previous models of the cortical architecture of the mammalian brain. This priorwork demonstrated that networks with a columnar architecture, where different layers correspond to sensory andlocation-based representations, can learn inputs such as objects composed of synthetic features. Neurons in theselayers receive external sensory and self-movement information, while they share connections that enable learnedassociations between features and locations, as well as predictions during inference [Hawkins et al., 2017]. Neuralactivity (including input features) are represented in the distributed activity of sparse binary vectors - a form of encoding2
PREPRINT - F
EBRUARY
19, 2021where the dimensionality used is relatively high, but where only a small subset of given nodes are ever active at a time(taking values of 0 or 1). Such encoding has numerous appealing properties, including tolerance to noise [Ahmad andScheinkman, 2019], a large representational capacity, and the ability to encode notions of similarity between objects[Hawkins and Ahmad, 2016].Early models in this framework did not explicitly discuss how the brain might implement the encoding of locationinformation. It was recently suggested that this might be provided by neurons with grid cell properties distributedthroughout the cortical columns of the brain. In various sensory modalities, grid cells could then be used to encodefeature locations in an object’s own reference frame. The idea that each of the columns throughout the brain wouldbe learning object representations in a massively parallel process was dubbed the Thousand Brains Theory. Thisnomenclature was to contrast the theory to those that suggest a more strict hierarchy with object-like representationsonly existing at certain levels of processing [Hawkins et al., 2019]. While models developed from this theory havebeen shown to be capable of rapidly learning objects and performing recognition, this was limited to synthetic data-sets[Lewis et al., 2019]. A similar concept was used to develop a model that was able to recognize objects from images,but this was limited to recall of memorized examples, rather than generalization to novel examples of an object class[Bicanski and Burgess, 2019].
Machine Learning Approaches
Few shot learning is a large field, and prior work has addressed learning hand-writtencharacters with various techniques [Wong and Yuille, 2015, Lake et al., 2016, George et al., 2017], or demonstratedthe benefits of memory-like mechanisms in the few-shot setting [Santoro et al., 2016]. Our intent is not to presentthe current work as a strong solution to the problem of few-shot learning. However, the few-shot setting capturesour biological motivation of humans learning rapidly from arbitrary feature sequences. As such, we use the few-shotexperimental setting to evaluate the performance of our system.Figure 1:
Using grid cell representations for object recognition. a) The combination of multiple grid cells of differentscale and orientation (red and green) can uniquely encode the location of a sensor (e.g. retinal patch). Here we usemultiple grid cell modules with sparse activity (bottom, each indicated by a rhombus) to encode and update the sensor’slocation with self-movement information. b) We hypothesize that this process can be used for object recognition withactive sensors, and use sequences through a 5x5 grid of feature patches extracted from MNIST images to test this.c) GridCellNet takes in motor input when the sensor moves (1) and updates its location representations. The currentlocation representation is used to predict incoming sensory information (2), before this is received (3). Correctlypredicted sensory information is then used to update the location representation (4). Locations are initially ambiguous,represented using a union of locations, and disambiguated over time via sensory input. Yellow and blue dots in thelocation layer indicate two different objects which are compatible with the current sequence. Classification is successfulonce the representation is unambiguous, i.e. that it is a subset of representations of that class. The two-layer network isbased on cells with sparse binary activity, dendritic segments, and Hebbian-like learning, following Lewis et al. [2019],and with figures reproduced with permission from the authors.
Our work builds on the sensorimotor system implemented in Lewis et al. [2019], which in turn uses many of thealgorithms employed in Hawkins et al. [2017]. In this paper we address two limitations of the network in Lewis et al.3
PREPRINT - F
EBRUARY
19, 2021[2019]. First, they used synthetic objects and features rather than those derived from natural data-sets. Second, theirsystem was designed to only recall previously seen objects and did not generalize to other examples of an object class.In order to remove those limitations, we make two main changes:• We implement a sparse convolutional feature detector, trained on images. When the sensor moves to samplean image patch, the corresponding subset of sparse feature outputs is sent to the sensorimotor network.• We enhance the classifier by storing multiple location representations from the training examples of each class.Classification then operates on unions of grid cell locations.We describe each component in more detail below. Except where noted, the sensorimotor network is mathematically asdescribed in Lewis et al. [2019], and we advise readers interested in those details to refer to that work.
In order to handle realistic images, we use a trained convolutional neural network to generate sparse binary featuresat multiple image locations. Specifically, we trained a convolutional neural network (CNN) [Lecun et al., 1998] in asupervised paradigm on a subset of the MNIST training data-set of handwritten digits (54,000 images) [LeCun et al.,1998], tuning it using a hold-out cross-validation section (6,000 images). This encoder network had the architectureshown in Figure 2a, where the second max-pooling operation was followed by a k-Winner Take All layer (k-WTA)Ahmad and Scheinkman [2019] to enforce sparsity in the representation (Figure 2b). While the non-zero values in thislayer took on real-number values (necessary for useful gradients during learning), we require binary feature vectors forinput to the sensorimotor network. We therefore used the network after training to pass images through until the k-WTAlayer, and then binarized this representation, providing us with a 5x5 grid of features. At each grid location we thenhave a vector of dimension 128 which contains regional information about the image representation in a sparse format.The feature vectors at each of these 25 locations form the input to all of our downstream classifiers. Note therefore thatnone of our classifiers (GridCellNet and the control comparisons) received direct pixel inputs as their features.
CNN Details
It is possible that binarization could lead to significant loss of information. We verified that a linearclassifier can be trained to perform accurate classification with these as feature inputs with an accuracy of 99%+. Wealso verified that a decoder could accurately reconstruct the input image. In order to achieve optimal performance, it wasuseful for the feature vector to have a reasonably large dimension (128), number of non-zero elements (19), and highentropy (intuitively, how often did each input feature contribute to a representation across all examples). To optimize theentropy, we made use of the k-WTA’s duty cycle (which monitors how often a unit is contributing to representations),and boosting factor (which biases a unit’s activity to target a given duty cycle) (see Ahmad and Scheinkman [2019] fordetails). This ensures a greater number of neurons each contribute information at some point, and the representationbecomes more distributed.In order of layers, the CNN architecture was composed of a convolution (kernel size 5, channels 64), max-pooling,convolution (kernel size 5, channels 128), max-pooling, k-WTA, and three fully connected layers (dimensions of 256,128, and 10). k-WTA applied to the max-pooling layer was local (that is, the k-winners were determined across allchannels at a given spatial location, rather than across the entire image space); this ensured that each extracted featurevector would have the same sparsity. We used stochastic gradient descent with a learning rate of 0.01, momentum 0.5,batch size of 128, and 10 epochs of training.
The sensorimotor network consists of two layers, one representing the sensory input, the other representing the locationof the sensor (Fig 1c). Cells in both layers can be either on or off, and activity in the network proceeds through a seriesof discrete time steps. The sensory layer receives the input features, as well as modulatory input from the location layer.The location layer receives movement information, as well as input from the sensory layer. The connections between thesensory and location layers are modelled as dendritic segments, the small branches on biological neurons that integratemultiple synapses. A dendritic segment is deemed active if there is a significant match between the sending layer’ssparse activity and sparse, learned weights. This match must exceed a user-set threshold, and these dendritic segmentsenable a given layer’s activity to predict representations in the other layer.The basic intuition for inference is that at the outset, a sensory feature is likely to be ambiguous as to the nature of theobject, and so the network should encode this ambiguity with a representation that corresponds to a union of all theobjects compatible with this given feature. For example, a curved contour at the top of an image might represent a 9 ora 0 (Figure 3a). The object-representation is encoded by the activity in the location layer (as each learned object usesa unique location space), and so this union of multiple objects will correspond to multiple cells being active in each4
PREPRINT - F
EBRUARY
19, 2021Figure 2:
The pre-processing convolutional neural network, including its use in generating the input features for laterclassifiers. a) The encoder CNN is trained end-to-end to perform classification, with a k-WTA operation that constrainsthe mid-level representations to a specific level of sparsity. Numbers below operations show the channel dimension.b) An example of what the k-WTA and binarized representations might look like. Note that in all of our tasks, theclassifiers are given a sparse feature vector (dimension 128) from each of the 5x5 locations in a sequence, representedwith the eye and its movement. The order with which the features are sampled across this 5x5 space can either be fixedfor all examples during both training and testing, or follow an arbitrary sequence. c) For optimal performance, the twoparameters of k-WTA (target duty cycle and boost strength) are optimized so as to ensure most of the neurons achievethe target duty cycle. On the left is shown typical values for boosting factor used in past models, while on the rightwe show the result of using the larger boosting factor and target duty cycle that we arrived at through hyperparametertuning.grid cell module. In our example, the active location representation will correspond to both where a curved contourwas learned for a 9 and where it was learned for a 0. As additional features are sensed, the network will use its currentrepresentation of candidate objects to predict the next feature, with only those that are compatible with the subsequentsensation remaining part of the representation. Notably, these predictions rely on the presence of a feature at a givenlocation , and not simply a bag-of-features detector. If the sensor was to move to the bottom-left of the image, thelearned 9 representation and 0 representation will predict different features. As more sensations are experienced, thesystem should converge to a specific representation that is consistent with a learned object. The object is recognisedwhen the representation in the location layer corresponds to only that given object, and not any others. We now describethe network architecture and stages of inference in more detail.
Network Architecture
The location layer consists of 40 grid cell modules, each a lattice of 50 by 50 cells. A grid cellmodule has a particular scale and orientation, while the active location corresponds to the current phase of activity inthe module. In our model, grid cells can be either active or inactive, and activation is determined by either the currentrepresentation in the sensory layer, or movement applied to the previous location representation; at time step t andfor grid cell module i , these are denoted by the binary arrays A loc ,it, sense and A loc ,it, move respectively. We model the locationphase that determines A loc ,it, move using a square rather than the biologically motivated triangular lattice used in Lewis et al.[2019], although this has no major consequence for the system.The sensory layer is identical to that used in Hawkins et al. [2017]. The input features are binary vectors of length 128with 19 active values (i.e. approximately 88% sparsity). The sensory layer in turn consists of a corresponding 128mini-columns, which receive the input features in a one-to-one fashion. Each mini-column in the sensory layer consistsof multiple cells (here 32). This enables the mini-columns to use sparse activity to uniquely encode features associated5 PREPRINT - F
EBRUARY
19, 2021with particular objects (i.e. location representations). The active cells in mini-column i at time-step t are denoted by thebinary array A in ,it . Stage 1: Using Movement to Update the Location Representation
If the location layer has active cells, then eachmodule uses the current movement information to compute a new set of active cells. Each module will apply atranslation to its 50 by 50 activation pattern, according to the movement information. The translation vector is differentin each module, and is determined by applying the following dilative rotation to the movement vector: M i = 1 s i (cid:20) cos( θ i ) − sin( θ i )sin( θ i ) cos( θ i ) (cid:21) (1)where i denotes the particular grid cell module, θ its orientation, and s its scale.The translated 50 by 50 pattern will rarely align neatly with the original 50 by 50 cells, except in discrete environments.Typically each active cell in the pattern will land on the corner between four cells, so each active cell will activate upto four cells after the translation vector has been applied. During inference, this is indeed what happens, but duringlearning we allow the grid cell module to have more certainty about the current location. During learning, translatingthe active pattern will not increase the number of active cells; instead, the module’s internal state includes a list ofhigh-precision active phases, and the module applies the translation to those phases, rather than estimating those phasesfrom the current set of active cells. This difference in the algorithm’s behavior in inference and learning reflects the factthat binary representations will always lead to some spatial uncertainty during inference.When inference begins for a new object, no location information is available - this will instead become available atstage 4, discussed below, and so inference proceeds to stage 2. Stage 2: Predicting Sensory Input with the Location Representation
The cells in the mini-columns have dendriticsegments which receive activity from the location layer. If a dendritic segment is active (that is, a cell in the sensorylayer is predicted by the activity of the location layer), then it is in a predictive state. Let the binary vector π in t denotethe sensory cells that have at least one active dendritic segment, and θ in a dendritic threshold; then: π in ,ct = (cid:26) , ∃ d [ D in c,d · A loc t, move ≥ θ in ]0 , otherwise (2) Stage 3: Determining Activity in the Sensory Layer
If a given sensory layer cell that is predicted also receivesactivity from the input feature (that is, it is in a column receiving sensory input and was therefore correctly predicted), itwill be active and inhibit any other cells in the mini-column that are not predicted. Note that multiple cells in any givenmini-column can be active if they are predicted by the current location representation. If no cells in a mini-column arepredicted but it receives sensory input, then all cells in the mini-column will become active.
Stage 4: Using the Sensory Representation to Update the Location
After the sensory representation has beendetermined, the location layer receives inputs from the sensory layer. In particular, the sensory features help to recalllocation information, and supplement the location representation arrived at by path integration. Similar to equation 2,this is determined by the overlap between the active cells in the sensory layer, and the learned weights. In this casehowever, an active dendritic segment is sufficient for a location cell to now be active, such that: π loc ,i,ct = (cid:40) , ∃ d [ D loc ,ic,d · A in t ≥ θ loc ]0 , otherwise (3) A loc ,it, sense = (cid:40) π loc ,it , (cid:13)(cid:13)(cid:13) π loc ,it (cid:13)(cid:13)(cid:13) > A loc ,it, move , otherwise (4)At this stage, the next movement is received, and the four stages are repeated for the next time-step until inference issuccessful (discussed further below). Learning
Learning takes place via the reciprocal strengthening of connections between the active representation inthe sensory layer, and the current location representation. The aim is to associate a given feature with a given locationin that object’s reference frame. When a new object is learned, the location representation at the first sensation israndomly initialized, such that each object operates in a different location space. This enables multiple objects to bejointly represented during inference, as the probability of overlap between different objects’ location spaces is low. As6
PREPRINT - F
EBRUARY
19, 2021further sensations are performed during learning, the location representation is updated using movement information asdescribed in Stage 1 above. For the sensory layer, as the feature-location association has yet to be learned, a randomcell in each mini-column receiving an input will be selected to be active. Each active cell in the sensory and locationlayer will then form reciprocal connections on one of their dendritic segments ( d (cid:48) ) according to the following: D loc c,d (cid:48) := D loc c,d (cid:48) | A in t, learn (5) D in c,d (cid:48) := D in c,d (cid:48) | A loc t, sense (6)Here " | " is used to indicate the bitwise OR operator; that is, if a synapse already exists between two cells, then it isunaffected by the learning rule.Following the above, the network can rapidly learn objects by visiting each feature once, performing a single set ofweight updates for each feature. Note that due to the path-integration performed by the grid cells, both learning andinference can take place using an arbitrary order through the features of the object - there need be no correspondencebetween the order taken at learning vs. that used at testing. Note also that the learning process could in principle beimplemented on hardware in a parallelized form, although for biological plausibility, this is run as a serial process.There are several hyperparameters of the model that might be tuned to optimize performance, such as increasing thegrid cell module size to increase the capacity of the model (see Lewis et al. [2019] for a quantitative exploration ofthis). The main parameter we tuned was the dendritic threshold θ loc . If this was too high, grid cells were too stringent inwhich sensory features were present to become active; if it was too low, grid cells were too easily activated by spurioussensory features. Details on how we tuned hyper-parameters are provided further below. In order to enable classification of unseen objects, we extended the classification algorithm in Lewis et al. [2019].Our recognition algorithm is summarized in Figure 3. The original algorithm required the location representation ofthe object to be a subset of the target representation, where the target was a single learned example. In contrast, wedesignated inference as the location representation being a subset of the union of a particular class’ possible locationrepresentations, but not any other’s. That is, inference requires that the location representation is a subset of onlyone class. Additionally, the inference stage uses information about the current position of the sensor on the unknownobject to constrain the location representations it compares to. Note that any location representation (either currentlyinstantiated in the network, or previously learned), can be represented by the set of active grid cells. Formally, theclassifier’s probabilities for class y during inference therefore corresponds to: p ( y ) = (cid:26) , ( A loc ⊆ L mi = y ) AND ( A loc (cid:54)⊆ L mi (cid:54) = y )0 , otherwise (7)where L mi is the joint location representation of all examples learned for class i at the sensor position m . Here, theposition information m concerns the whereabouts of the sensor in the reference frame of the unknown object. Note thatthe network is able to develop a representation of the object and make predictions using the entirely internal referenceframe provided by the grid cells. In order to perform the classification step however, learning and inference does requiresome notion of sensor position in the reference frame of the unknown object’s shape ("Where on the digit that I’mtrying to recognise is my sensor?"). Doing so ensures that the location representations which are compared to aresensibly constrained.The above inference is successful if p ( y ) = 1 and y is the target class. Note however that there are three mainfailure cases. One is that when inference has taken place, y is some other class that is not the target. In this case, therepresentation has converged to a wrong digit (for example converging to a 4 when it was in fact a 9). The second failurecase is that the representation never converges to a subset of a learned representation (for example if a particularlyunusual 6 is encountered). Finally, it is possible that the representation simultaneously satisfies multiple classes inthe same inference step. This might be handled by random selection of one of the winning representations, but in oursimulations this never occurred.In principle, a risk of this approach is that as the number of learned examples grows, the unions that form the targetrepresentations would become large and saturate the sparsity and capacity of the system; in practice we found thesystem was able to work well with few shot-learning (e.g. up to 100 examples per class), while we never exploredtraining on the full data-set due to demanding wall-clock training times.7 PREPRINT - F
EBRUARY
19, 2021Figure 3:
Overview of the GridCellNet classification algorithm. a) The algorithm’s concept at a high level. Notethat a single sensation ( f ) is likely to be ambiguous as to the nature of the object, and so correct inference requiresintegration over several features. b) When the sensor moves to its first location (<1>, under Movement 1), there is nolocation representation on which to base sensory predictions (<2>). As a result, the first sensation (<3>), activates allthe cells in columns that receive input. This activity then activates location representations which are consistent withthat feature input (<4>). This will be a union of multiple object representations, some of which are consistent with thetarget class (shades of yellow), and some of which are not (shades of blue) - thus at this point the class identity of theobject is ambiguous. With the next sensor movement (<1>, under Movement 2), the location representations of the gridcells are updated using path integration. The active grid cells then provide a prediction (via the modulatory impact ofdendritic segments) to the sensory layer (<2>). The next sensory input (<3>) is consistent with two previously learnedexamples of 9’s, i.e. with that feature representation at that location given the previous feature representations at theprevious locations . The remaining active location representations (<4>), are now a sub-set of the target class (the unionof location representations of all learned 9’s), and classification is successful. This step uses information about therelative position of the sensor, m , to constrain the comparison (e.g. the bottom-middle region of the unknown digit).The completion of inference is indicated with a star, although it may not be successful (discussed in main text). Notethat Movement 1, Sensation 1, Movement 2, and Sensation 2 indicate the same two-layer network over the discretetime-steps of the algorithm. Figure adapted with permission from Lewis et al. [2019]. We compare our architecture to both an RNN and a k-NN classifier. For our RNN, we used a long short-term memory(LSTM) classifier [Hochreiter and Schmidhuber, 1997]. This network received an input sequence of length 25,corresponding to the 25 locations in the image feature space. Each feature vector in this sequence was a sparse featurevector extracted from our CNN, as for the GridCellNet. In addition, the indexed position of this feature in the inputsequence (in the form an integer from 0 to 24) was provided as an additional feature element (thus each input featurevector consisted of 129 elements). Whether the order with which the sequence was provided was fixed across objectswas determined by the experimental condition, but the location information provided as the additional feature valuealways represented the ground truth index/location of the feature.The RNN had a single hidden layer of dimension 128. Using additional layers did not appear helpful. We used weightdecay [Ng, 2004] of 0.001 and optimized with Adam [Kingma and Ba, 2015]. Learning rates were selected via a8
PREPRINT - F
EBRUARY
19, 2021grid-search for the best performance on each classification task independently, where each task is distinguished by boththe number of training examples, and whether evaluation was using a fixed or arbitrary input sequence.We also compared our network to a k-NN classifier. This received the same input as the RNN, but as a single extendedarray, rather than sequentially, and without the additional location information. The number of neighbours for the k-NNclassifier [Fix and Hodges, 1989] and the dendritic threshold θ loc were similarly selected via a grid-search for eachclassification task. Additional details for how the data-set was divided and the selected hyper-parameters are providedin Appendix 6.1.In order to enable visualization of the current feature representations in the GridCellNet (including its predictions), wetrained a multi-layer perceptron (decoder network) on the sparse binary features from a subset of the MNIST trainingdata-set (54,000 images). This decoder had a single hidden layer of dimension 512, with input 128x5x5 and output28x28. For the decoder we used Adam with a learning rate of 0.001, a batch size of 64, and 10 epochs of training. To assess the utility of our proposed network, we evaluate its accuracy in three main scenarios: a fixed input sequence,arbitrary input sequences, and partial input sequences. We also assess the ability of GridCellNet to make predictions ofupcoming sensory inputs.
With a fixed input sequence, the sequence of samples across the space of 5x5 features follows the same order for allobjects during both training and evaluation. An arbitrary input sequence more closely resembles active sensation and isfar more challenging. In this scenario, the input order is randomly determined for every object, and is not fixed betweentraining and testing. As the GridCellNet performs only a single weight update per feature of an object during learning,we compare to LSTMs with both 1 epoch of training as well as 50 epochs. Note therefore that our few-shot settingconsiders not only exposure to a limited number of training examples, but also limited opportunity for weight updateswith each training example.We begin by evaluating the classifiers on the standard task of learning and generalization given a fixed input sequence.As expected, all of the classifiers do reasonably well (Figure 4a), with the exception of the LSTM constrained to onlyone weight update per image. GridCellNet’s learning takes place using rapid Hebbian-like weight updates, and sounlike the LSTM, it can form robust representations in-spite of only having observed each training object once.We also note that the k-NN actually performs better than the LSTM with 50 epochs of training, despite the LSTMbeing provided with location information. This appears to be due to the challenge the LSTM faces of learning longerrange dependencies given so few training examples, in spite of being provided with information about where the currentfeature is located in the sequence. The ability of the LSTM to solve this appears highly sensitive to adjustments inhyper-parameters including the learning rate. While additional hyper-parameter tuning might ameliorate this, it isnotable that GridCellNet does not suffer from the issue of long-range dependencies, and performs at its optimum withminimal changes to its key hyper-parameter (see Table 1 in Appendix 6.1).We next assess performance in the setting of an arbitrary sequence input. As predicted, only GridCellNet maintainsits performance (Figure 4b). In particular, the path integration properties of grid cells enable the network to representthe spatial location of features in a manner that can handle arbitrary and previously unseen movements through space.After perceiving a feature at a given location, the active grid cells perform path integration given the current movementto meaningfully update their representation of the new location. This in turn predicts the learned features at this newlocation. The particularities of the path that was taken is irrelevant to this process, and so the system is robust toarbitrary feature sequences.It is worth noting that given sufficient training time and examples, the LSTM’s performance steadily improves. Deeplearning architectures are undeniably powerful (indeed we used them to perform the initial feature extraction step forall of our classifiers), and it is likely that given enough training examples, the LSTM’s performance would matchGridCellNet. Note in particular that GridCellNet uses only simple, Hebbian-like learning rules (i.e. no back-propagationof error). The contrast in performance in the few-shot setting however supports the proposed architecture as a principledand biologically plausible mechanism by which humans might rapidly learn under the challenging settings exploredhere. 9
PREPRINT - F
EBRUARY
19, 2021Figure 4:
Performance of Classifiers Given Fixed or Arbitrary Sequences of Input.
Classification accuracy on 1000examples of the MNIST test set as a function of the number of training examples per-class. a) Accuracy when anidentical sequence of passes over the input space is used for both training and inference. b) Performance when thesequence can be arbitrary and different between training and inference. Error bars show the 95% confidence interval ofthe mean across three random seeds.
We have noted that, in principle, GridCellNet may successfully classify an object before it has received a completesequence of all 25 features. Recall that classification has occurred as soon as GridCellNet’s representation has convergedto a subset of the representations associated with a particular class.To assess the performance of the GridCellNet as a function of the number of sensations, we simply determined thecumulative accuracy as a function of the number of sensations (total possible of 25). For k-NN, we iteratively fit k-NNclassifiers with progressively more elements from the feature sequence available, and assessed their accuracy.In Figure 5a, we show that GridCellNet classifies most of the examples given to it after observing only a fraction of thetotal input sequence; indeed, the majority of the successful classifications occur in under 10 sensations. For comparisonwe show that the k-NN with an arbitrary input sequence benefits little from additional examples - it is largely limited tokey off of isolated features in order to perform at above chance levels.
That GridCellNet can frequently perform inference after only a few iterations has an additional advantage beyondefficiency; due to the predictive nature of the network, it can represent features in unseen parts of the image.To visualise the representation of the GridCellNet during inference, we extract the sequence of sensed and predictedfeatures at each progressive sensation. As the sensor passes over the sequence of inputs, the networks representationaccumulates the ground-truth sensations previously experienced, as well as the prediction for the next step. Importantlyhowever, once inference is successful, all future accumulated representations are based solely on predictions from theinputs received up until the time of inference. The totality of these representations are then fed to the decoder networkto visualise the output at different stages of inference. We show examples where the network converges to a singlerepresentation at inference time; the decoder in its current form is unable to make sense of the input if the representationat inference includes a union of object representations.Figure 5b shows examples of GridCellNet restoring from memory an example that closely matches the input. In theearly stages of inference, the internal representation consists of a small number of perceived features, and the nextpredicted feature (highlighted box). At inference time, enough features have been sensed that the decoded image10
PREPRINT - F
EBRUARY
19, 2021often appears recognizable to a human. After then evaluating predictions at every unseen location, we observe thatGridCellNet can recall an example from memory that is similar to the input.Note that at every time-step, the model can make a prediction about any arbitrary location. It would therefore bepossible in principle to query every location before inference, but as the decoder we use has only been trained on singlefeatures (rather than a union of overlapping representations), these visualizations are not informative.Figure 5:
Performance of GridCellNet as a Function of Sensations, and Predictive Capabilities.
Accuracy as a functionof the number of sensations. Both classifiers were trained with 5 training examples per class, using arbitrary inputsequences at training and test time. b) We use the sensed and predicted features of the GridCellNet and a decodernetwork to visualise the system’s representations. We show two example prediction sequences (top). GridCellNetpredicts its next sensation (highlighted box) based on prior input. After successful inference (red border), the systempredicts the entire image, substituting unseen features with those from memory. A diagram of the correspondencebetween past sensations and upcoming predictions is given at bottom, matching the representations seen for the ‘8’ inthe middle row. At each sequence step, the entire 5x5 grid of feature representations (each a vector of length 128) is fedto the previously trained decoder, and the corresponding reconstructed image shown. Note "Sensation 1" is not shown,as at this time, there is no prediction for the network to make.
We have presented a novel approach to the challenge of active visual object recognition given an arbitrary sequenceof feature inputs sampled across space. Robustness to this task is achieved through the use of grid cells to model thelocation of features in the reference frame of an object. In addition, our network takes advantage of rapid Hebbian-styleweight updates to enable few-shot learning, and predictive components to enable the completion of partially observedimages.This work was partly motivated by the observation that humans solve this task effortlessly when performing saccades,and that grid cells might enable a biologically plausible solution. It has been proposed that humans might perform objectrecognition in a variety of sensory modalities by making use of grid cell computations [Hawkins et al., 2019], includingin vision [Bicanski and Burgess, 2019]. In Bicanski and Burgess [2019], the authors used features extracted frommultiple locations of an image to perform a task with some similarities to our own. These features were sequentially fedto a classifier that, similar to ours, integrated these features into a learned representation. When subsequently challengedto recall which of a handful of memorized images were presented, the system successfully did so, even under settingssuch as partial occlusion. Importantly however, their focus was on visual recognition memory , and images used duringtraining were the same as those used at evaluation time. Thus there was no need for the system to generalize to unseenexamples, recognising the commonality between different instances of a class. Our work therefore represents the firstdemonstration that grid cell-like computations can be leveraged to enable generalization on a visual task to unseenexamples of an object class. While this supports the plausibility of the human brain using such a mechanism, applying11
PREPRINT - F
EBRUARY
19, 2021the proposed system to a more challenging data-set such as Omniglot [Lake et al., 2016] is an important next step fordemonstrating the capability of the approach.In addition to the above relevance to neuroscience, the approach has implications for machine learning. In particular,providing a system with the flexibility to perform well with novel sampling sequences has an obvious advantage;assuming classification is possible without traversing the entire sequence space (as we demonstrated), then an agentusing such a mechanism to perform object recognition could sample the most informative regions in a principled manner,and thereby operate more efficiently. The opposite to this is an agent that is constrained to sample every point, alwaysfollowing the same sequence such as a raster scan across the image. We emphasize that while the accuracy achieved bythe GridCellNet is not as high as some other approaches to few-shot learning [Wong and Yuille, 2015, Lake et al., 2016],our intention is not to propose the model as a strong solution to that general setting. Rather our purpose is to show thatin the context of few-shot learning, the use of grid cell representations can provide robustness to unpredictable inputssequences, which might have downstream benefits for embodied agents. A future area of investigation will be pairingGridCellNet with a reinforcement learning agent that can learn to optimally control the movement of its sensor. It isalso worth noting that the LSTM scales better with larger data-sets; as such, a sensible approach would likely be tocombine the flexible, rapid learning of GridCellNet with longer-term learning in deep-learning architectures.Lastly, a natural benefit that the proposed system could bring is robustness to shifts of an object across an image(i.e. translation invariance). In spite of components intended to support translation invariance, CNNs still find thischallenging [Mu and Gilmer, 2019]. By relying primarily on representing features in an objects reference frame, wepredict that the system would generalize to a novel location without any further training. We have not explored thishowever due to the constraints imposed by our feature extraction method, which does not guarantee the necessaryequivariant shifts of features, as well as the current requirement that the classification criterion has some notion ofwhere the sensor is in the reference frame of the unknown object.Despite the above promising results and avenues for future research, we must highlight several limitations of the currentwork. Although we attempted to ameliorate this somewhat by using separate data sub-sets, our method of featureextraction with a CNN is un-intuitive given our task of sequentially sampling feature patches. As noted above, it alsoconstrains the application of GridCellNet to other interesting tasks such as image translation, and as such exploringalternative feature extraction methods such as a patch-based auto-encoder is an avenue for future research.In order to support generalization (rather than simply recall of a memorized example), there is currently a requirementthat learning and classification receive information about where the sensor is positioned on the object being sensed.It is worth noting that this information was provided to the LSTM, and so it did not offer an unfair advantage forGridCellNet. Furthermore, all of the steps of developing an object representation (including prediction of unseenfeatures), do not require this information, and the provided signal could be as simple as the sensor’s position relative tothe centre of mass of an object. The primary limitation with this requirement is that the network cannot make full use ofits natural bias towards translation invariance. The requirement for positional information can be relaxed, in which caseclassification takes place by comparing the current location representation to any of those associated with a given class,but this can overwhelm the representational capacity of the current form of GridCellNet. Future modifications such asthe use of multiple cortical columns may help to address this challenge.While we compare our architecture to an LSTM, it is possible that transformer networks [Vaswani et al., 2017] wouldperform better in this setting. They currently represent the state of the art in many sequence based tasks, includingvisual tasks [Dosovitskiy et al., 2020], and explicitly encode positional information. It is worth noting however that,while not explored here, grid cells can in principle encode 3D structure [Klukas et al., 2020]. Transformer networksalready suffer from efficiency issues with long sequences, for which the introduction of a third dimension in the inputrepresentation would be problematic. The performance of transformer networks on our task and the generalization ofGridCellNet to 3D objects will therefore be topics for future investigations.In summary, we believe that this work supports the notion that the brain may use grid cell computations whenperforming visual object recognition, and that this might underlie some of the visual tasks in which humans stilloutperform engineered systems. Future studies will aim to demonstrate this principle on more complex data-sets thanthe simple MNIST task explored here.
References
Subutai Ahmad and Luiz Scheinkman. How Can We Be So Dense? The Robustness of Highly Sparse Representations.
ICML 2019 Workshop on Uncertainty and Robustness in Deep Learning , 2019.Andrej Bicanski and Neil Burgess. A Computational Model of Visual Recognition Memory via Grid Cells.
CurrentBiology , 2019. ISSN 09609822. doi: 10.1016/j.cub.2019.01.077.12
PREPRINT - F
EBRUARY
19, 2021Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Imageis Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929 , 2020.Ila R. Fiete, Yoram Burak, and Ted Brookings. What grid cells convey about rat location.
Journal of Neuroscience , 28(27), 2008. ISSN 02706474. doi: 10.1523/JNEUROSCI.5684-07.2008.Evelyn Fix and J. L. Hodges. Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties.
International Statistical Review / Revue Internationale de Statistique , 57(3), 1989. ISSN 03067734. doi: 10.2307/1403797.Dileep George, Wolfgang Lehrach, Ken Kansky, Miguel Lázaro-Gredilla, Christopher Laan, Bhaskara Marthi, XinghuaLou, Zhaoshi Meng, Yi Liu, Huayan Wang, Alex Lavin, and D. Scott Phoenix. A generative vision model thattrains with high data efficiency and breaks text-based CAPTCHAs.
Science , 358(6368), 2017. ISSN 10959203. doi:10.1126/science.aag2612.Torkel Hafting, Marianne Fyhn, Sturla Molden, May Britt Moser, and Edvard I. Moser. Microstructure of a spatial mapin the entorhinal cortex.
Nature , 436(7052), 2005. ISSN 00280836. doi: 10.1038/nature03721.Jeff Hawkins and Subutai Ahmad. Why Neurons Have Thousands of Synapses, a Theory of Sequence Memory inNeocortex.
Frontiers in neural circuits , 10, 2016. ISSN 16625110. doi: 10.3389/fncir.2016.00023.Jeff Hawkins, Subutai Ahmad, and Yuwei Cui. A Theory of How Columns in the Neocortex Enable Learning theStructure of the World.
Frontiers in Neural Circuits , 11(October):1–18, 2017. ISSN 1662-5110. doi: 10.3389/fncir.2017.00081. URL http://journal.frontiersin.org/article/10.3389/fncir.2017.00081/full .Jeff Hawkins, Marcus Lewis, Mirko Klukas, Scott Purdy, and Subutai Ahmad. A framework for intelligence andcortical function based on grid cells in the neocortex.
Frontiers in Neural Circuits , 2019. ISSN 16625110. doi:10.3389/fncir.2018.00121.Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory.
Neural Computation , 9(8), 1997. ISSN 08997667.doi: 10.1162/neco.1997.9.8.1735.Joshua B. Julian, Alexandra T. Keinath, Giulia Frazzetta, and Russell A. Epstein. Human entorhinal cortex representsvisual space using a boundary-anchored grid.
Nature Neuroscience , 21(2), 2018. ISSN 15461726. doi: 10.1038/s41593-017-0049-1.Nathaniel J. Killian, Michael J. Jutras, and Elizabeth A. Buffalo. A map of visual space in the primate entorhinal cortex.
Nature , 491(7426), 2012. ISSN 00280836. doi: 10.1038/nature11587.Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In , 2015.Mirko Klukas, Marcus Lewis, and Ila Fiete. Efficient and flexible representation of higher-dimensional cognitivevariables with grid cells.
PLoS Computational Biology , 16(4), 2020. ISSN 15537358. doi: 10.1371/journal.pcbi.1007796.Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building Machines ThatLearn and Think Like People.
Behavioral and Brain Sciences , (2012):1–101, 2016. ISSN 14691825. doi:10.1017/S0140525X16001837.Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Ha. LeNet. In
Proceedings of the IEEE , number November,pages 1–46, 1998. ISBN 0018-9219. doi: 10.1109/5.726791.Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to documentrecognition.
Proceedings of the IEEE , 86(11):2278–2323, 1998. ISSN 00189219. doi: 10.1109/5.726791.Marcus Lewis, Scott Purdy, Subutai Ahmad, and Jeff Hawkins. Locations in the neocortex: A theory of sensorimotorobject recognition using cortical grid cells.
Frontiers in Neural Circuits , 2019. ISSN 16625110. doi: 10.3389/fncir.2019.00022.Xiaoyang Long and Sheng Jia Zhang. A novel somatosensory spatial navigation system outside the hippocampalformation, 2021. ISSN 17487838.Xiaoyang Long, Bin Deng, Jing Cai, Zhe Chen, and Sheng-Jia Zhang. A compact spatial map in V2 visual cortex. bioRxiv , 2021.Edvard I. Moser, Emilio Kropff, and May Britt Moser. Place cells, grid cells, and the brain’s spatial representationsystem, 2008. ISSN 0147006X.Norman Mu and Justin Gilmer. MNIST-C: A Robustness Benchmark for Computer Vision.
ICML 2019 Workshop onUncertainty and Ro- bustness in Deep Learning , 2019.13
PREPRINT - F
EBRUARY
19, 2021Matthias Nau, Tobias Navarro Schröder, Jacob L.S. Bellmund, and Christian F. Doeller. Hexadirectional codingof visual space in human entorhinal cortex.
Nature Neuroscience , 21(2), 2018. ISSN 15461726. doi: 10.1038/s41593-017-0050-8.Andrew Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. In
Proceedings, Twenty-FirstInternational Conference on Machine Learning, ICML 2004 , 2004. ISBN 1581138385. doi: 10.1145/1015330.1015435.J. O’Keefe and D. H. Conway. Hippocampal place units in the freely moving rat: Why they fire where they fire.
Experimental Brain Research , 31(4), 1978. ISSN 00144819. doi: 10.1007/BF00239813.Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. One-shot Learning withMemory-Augmented Neural Networks.
NeurIPS 2016 Deep Learning Symposium , arXiv:1605.06065, 2016.Hanne Stensola, Tor Stensola, Trygve Solstad, Kristian FrØland, May Britt Moser, and Edvard I. Moser. The entorhinalgrid map is discretized, 2012. ISSN 00280836.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, andIllia Polosukhin. Attention is all you need. In
Advances in Neural Information Processing Systems , volume2017-December, 2017.Alex Wong and Alan Yuille. One shot learning via compositions of meaningful patches.
Proceedings of the IEEEInternational Conference on Computer Vision (2015) , 2015.Alfred L Yarbus. Eye Movements During Perception of Complex Objects, 1967.
In Figure 6, we provide a break-down of how different splits of the data are used for different steps in feature extraction,hyper-parameter tuning, and learning. In Table 1, we list the hyper-parameters arrived at for each classifier, specific toeach learning task.Figure 6:
Division of the data-set for training and evaluation.
As the down-stream classifiers use features derivedfrom a network classified end-to-end, and we are interested in few shot learning, we divide the data set such that thetraining of the various systems uses different sub-sets of the data. This also allows us to perform hyper-parameter tuningon hold-out data, with the exception of the key hyperparameters which are deliberately selected to enhance few-shotlearning on the final evaluation data for all classifiers. Note that although images are therefore re-used during this stepof hyper-parameter tuning and final evaluation, the feature vectors generated by the CNN encoder will vary with therandom seed used, and therefore between the settings in which hyper-parameter tuning and evaluation are performed.14
PREPRINT - F
EBRUARY
19, 2021Table 1: Choice of Hyper-parameters for Classifiers Given Fixed or Abitrary Input Sequences θ locloc