Lifelong Learning of Spatiotemporal Representations with Dual-Memory Recurrent Self-Organization
LLifelong Learning of Spatiotemporal Representationswith Dual-Memory Recurrent Self-Organization
German I. Parisi , Jun Tani , Cornelius Weber , Stefan Wermter Knowledge Technology, Department of Informatics, Universität Hamburg, Germany Cognitive Neurorobotics Research Unit, Okinawa Institute of Science and Technology, Japan
Abstract
Artificial autonomous agents and robots interacting in complex environments are requiredto continually acquire and fine-tune knowledge over sustained periods of time. The abilityto learn from continuous streams of information is referred to as lifelong learning andrepresents a long-standing challenge for neural network models due to catastrophic forgettingin which novel sensory experience interferes with existing representations and leads to abruptdecreases in the performance on previously acquired knowledge. Computational modelsof lifelong learning typically alleviate catastrophic forgetting in experimental scenarioswith given datasets of static images and limited complexity, thereby differing significantlyfrom the conditions artificial agents are exposed to. In more natural settings, sequentialinformation may become progressively available over time and access to previous experiencemay be restricted. Therefore, specialized neural network mechanisms are required thatadapt to novel sequential experience while preventing disruptive interference with existingrepresentations. In this paper, we propose a dual-memory self-organizing architecture forlifelong learning scenarios. The architecture comprises two growing recurrent networkswith the complementary tasks of learning object instances (episodic memory) and categories(semantic memory). Both growing networks can expand in response to novel sensoryexperience: the episodic memory learns fine-grained spatiotemporal representations ofobject instances in an unsupervised fashion while the semantic memory uses task-relevantsignals to regulate structural plasticity levels and develop more compact representationsfrom episodic experience. For the consolidation of knowledge in the absence of externalsensory input, the episodic memory periodically replays trajectories of neural reactivations.We evaluate the proposed model on the CORe50 benchmark dataset for continuous objectrecognition, showing that we significantly outperform current methods of lifelong learningin three different incremental learning scenarios.
Artificial autonomous agents and robots interacting in dynamic environments are required to con-tinually acquire and fine-tune their knowledge over time (Thrun and Mitchell, 1995; Parisi et al.,2018a). The ability to progressively learn over a sustained time span by accommodating novelknowledge while retaining previously learned experiences is referred to as continual or lifelonglearning. In contrast to state-of-the-art deep learning models that typically rely on the full trainingset being available at once (see LeCun et al., 2015 for a review), lifelong learning systems mustaccount for situations in which the training data become incrementally available over time. Effectivemodels of lifelong learning are crucial in real-world conditions where an autonomous agent cannotbe provided with all the necessary prior knowledge to interact with the environment and the directaccess to previous experience is restricted (Thrun and Mitchell, 1995). Importantly, there may beno distinction between training and test phases, which requires the system to concurrently learnand timely trigger behavioral responses (Cangelosi and Schlesinger, 2015; Tani, 2016). Lifelongmachine learning represents a long-standing challenge due to catastrophic forgetting or interference,i.e., training a model with a new task leads to an abrupt decrease in the performance on previouslylearned tasks (McCloskey and Cohen, 1989). To overcome catastrophic forgetting, computational
Preprint. Parisi et al. (2018) Front. Neurorobot. 12:78. doi: 10.3389/fnbot.2018.00078 a r X i v : . [ c s . A I] D ec odels must adapt their existing representations on the basis of novel sensory experience whilepreventing disruptive interference with previously learned representations. The extent to which asystem must be flexible for learning novel knowledge and stable for preventing the disruption ofconsolidated knowledge is known as the stability-plasticity dilemma, which has been extensivelystudied for both computational and biological systems (e.g., Grossberg, 1980, 2007; Mermillod etal., 2013; Ditzler et al., 2015). Neurophysiological evidence suggests distributed mechanisms ofstructural plasticity that promote lifelong memory formation, consolidation, and retrieval in multiplebrain areas (Power and Schlaggar, 2016; Zenke et al., 2017a). Such mechanisms support the develop-ment of the human cognitive system on the basis of sensorimotor experiences over sustained timespans (Lewkowicz, 2014). Crucially, the brain must constantly perform two complementary tasks:(i) recollecting separate episodic events (specifics), and (ii) learning the statistical structure fromthe episodic events (generalities). The complementary learning systems (CLS) theory (McClellandet al., 1995; Kumaran et al., 2016) holds that these two interdependent operations are mediatedby the interplay of the mammalian hippocampus and neocortex, providing the means for episodicmemory (specific experience) and semantic memory (general structured knowledge). Accordingly,the hippocampal system exhibits quick learning of sparse representations from episodic experiencewhich will, in turn, be transferred and integrated into the neocortical system characterized by a slowerlearning rate with more compact representations of statistical regularities.Re-training a (deep) neural architecture from scratch in response to novel sensory input can requireextensive computational effort. Furthermore, storing all the previously encountered data in lifelonglearning scenarios has the general drawback of large memory requirements. Instead, Robins (1995)proposed pseudo-rehearsal (or intrinsic replay) in which previous memories are revisited without theneed of explicitly storing data samples. Pseudo-samples are drawn from a probabilistic or generativemodel and replayed to the system for memory consolidation. From a biological perspective, thedirect access to past experiences is limited or restricted. Therefore, the replay of hippocampalrepresentations in the absence of external sensory input plays a crucial role in memory encoding(Carr et al., 2011; Kumaran et al., 2016). Memory replay is argued to occur through the reactivationof neural patterns during both sleep and awake states (e.g., free recall; Gelbard-Sagiv et al., 2008).Hippocampal replay provides the means for the gradual integration of knowledge into neocorticalstructures through the reactivation of recently acquired knowledge interleaved with the exposureto ongoing episodic experience (McClelland et al., 1995). Consequently, the periodic replay ofpreviously encountered samples can alleviate catastrophic forgetting during incremental learningtasks, especially when the number of training samples for the different classes is unbalanced or whena sample is encountered only once (Robins, 1995).A number of computational approaches have drawn inspiration from the learning principles ob-served in biological systems. Machine learning models addressing lifelong learning can be dividedinto approaches that regulate intrinsic levels of plasticity to protect consolidated knowledge, thatdynamically allocate neural resources in response to novel experience, or that use complementarydual-memory systems with memory replay (see section 2). However, most of these methods aredesigned to address supervised learning on image datasets of very limited complexity such as MNIST(LeCun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) while not scaling up to incremental learningtasks with larger-scale datasets of natural images and videos (Kemker et al., 2018; Parisi et al., 2018a).Crucially, such models do not take into account the temporal structure of the input which plays animportant role in more realistic learning conditions, e.g., an autonomous agent learning from theinteraction with the environment. Therefore, in contrast to approaches in which static images arelearned and recognized in isolation, we focus on lifelong learning tasks where sequential data withmeaningful temporal relations become progressively available over time.In this paper, we propose a growing dual-memory (GDM) architecture for the lifelong learning ofspatiotemporal representations from videos, performing continuous object recognition at an instancelevel (episodic knowledge) and at a category level (semantic knowledge). The architecture comprisestwo recurrent self-organizing memories that dynamically adapt the number of neurons and synapses:the episodic memory learns representations of sensory experience in an unsupervised fashion throughinput-driven plasticity, whereas the semantic memory develops more compact representations ofstatistical regularities embedded in episodic experience. For this purpose, the semantic memoryreceives neural activation trajectories from the episodic memory and uses task-relevant signals(annotated labels) to modulate levels of neurogenesis and neural update. Internally generated neuralactivity patterns in the episodic memory are periodically replayed to both memories in the absence of2ensory input, thereby mitigating catastrophic forgetting during incremental learning. We conducta series of experiments with the recently published Continuous Object Recognition (CORe50)benchmark dataset (Lomonaco and Maltoni, 2017). The dataset comprises 50 objects within 10categories with image sequences captured under different conditions and containing multiple viewsof the same objects (indoors and outdoors, varying background, object pose, and degree of occlusion).We show that our model scales up to learning novel object instances and categories and that itoutperforms current lifelong learning approaches in three different incremental learning scenarios. The CLS theory (McClelland et al., 1995) provides the basis for computational frameworks thataim to generalize across experiences while retaining specific memories in a lifelong fashion. Earlycomputational attempts include French (1997) who developed a dual-memory framework usingpseudorehearsal (Robins, 1995) to transfer memories, i.e., the training samples are not explicitly keptin memory but drawn from a probabilistic model. However, there is no empirical evidence showingthat this or similar contemporaneous approaches (see O’Reilly and Norman, 2002 for a review) scaleup to large-scale image and video benchmark datasets. More recently, Gepperth and Karaoguz (2015)proposed two approaches for incremental learning using a modified self-organizing map (SOM) and aSOM extended with a short-term memory (STM). We refer to these two approaches as GeppNet andGeppNet+STM, respectively. In GeppNet, task-relevant feedback from a regression layer is used toselect whether learning in the self-organizing hidden layer takes place. In GeppNet+STM, the STMis used to store novel knowledge which is occasionally played back to the GeppNet layer during sleepphases interleaved with training phases. This latter approach yields better performance and fasterconvergence in incremental learning tasks with the MNIST dataset. However, the STM has a limitedcapacity, thus learning new knowledge can overwrite old knowledge. In both cases, the learningprocess is divided into the initialization and the actual incremental learning phase. Furthermore,GeppNet and GeppNet+STM require storing the entire training dataset during training. Kemkerand Kanan (2018) proposed the FearNet model for incremental class learning inspired by studies ofmemory recall and consolidation in the mammalian brain during fear conditioning (Kitamura et al.,2017). FearNet uses a hippocampal network capable of immediately recalling new examples, a PFCnetwork for long-term memories, and a third neural network inspired by the basolateral amygdala fordetermining whether the system should use the PFC or hippocampal network for a particular example.FearNet consolidates information from its hippocampal network to its PFC network during sleepphases. Kamra et al. (2018) presented a similar dual-memory framework for lifelong learning thatuses a variational autoencoder as a generative model for pseudo-rehearsal. Their framework generatesa short-term memory module for each new task. However, prior to consolidation, predictions aremade using an oracle, i.e., they know which module contains the associated memory.Different methods have been proposed that are based on regularization techniques to impose con-straints on the update of the neural weights. This is inspired by neuroscience findings suggesting thatconsolidated knowledge can be protected from interference via changing levels of synaptic plasticity(Benna and Fusi, 2016) and is typically modeled in terms of adding regularization terms that penalizechanges in the mapping function of a neural network. For instance, Li and Hoiem (2016) proposed aconvolutional neural network (CNN) architecture in which the network that predicts the previouslylearned tasks is enforced to be similar to the network that also predicts the current task by usingknowledge distillation, i.e., the transferring of knowledge from a large, highly regularized model toa smaller model. This approach, known as learning without forgetting (LwF), has the drawbacksof highly depending on the relevance of the tasks and that the training time for one task linearlyincreases with the number of old tasks. Kirkpatrick et al. (2017) proposed elastic weight consolidation(EWC) which adds a penalty term to the loss function and constrains the weight parameters that arerelevant to retain previously learned tasks. However, this approach requires a diagonal weighting overthe parameters of the learned tasks which is proportional to the diagonal of the Fisher informationmetric, with synaptic importance being computed offline and limiting its computational applicationto low-dimensional output spaces. Zenke et al. (2017b) proposed to alleviate catastrophic forgettingby allowing individual synapses to estimate their importance for solving a learned task. Similar toKirkpatrick et al. (2017), this approach penalizes changes to the most relevant synapses so that newtasks can be learned with minimal interference. In this case, the synaptic importance is computed inan online fashion over the learning trajectory in the parameter space.3n general, regularization approaches comprise additional loss terms for protecting consolidatedknowledge which, with a limited amount of neural resources, leads to a trade-off on the performanceof old and novel tasks. Other approaches expand the neural architecture to accommodate novelknowledge. Rusu et al. (2016) proposed to block any changes to the network trained on previousknowledge and expand the architecture by allocating novel sub-networks with a fixed capacity to betrained with the new information. This prevents catastrophic forgetting but leads the complexity of thearchitecture to grow with the number of learned tasks. Draelos et al. (2017) trained an autoencoderincrementally using the reconstruction error to show whether the older digits were retained. Theirmodel added new neural units to the autoencoder to facilitate the addition of new MNIST digits.Rebuffi et al. (2017) proposed the iCaRL approach which stores example data points that are usedalong with new data to dynamically adapt the weights of a feature extractor. By combining new andold data, they prevent catastrophic forgetting but at the expense of a higher memory footprint.The approaches described above are designed for the classification of static images, often exposingthe learning algorithm to training samples in a random order. Conversely, in more natural settings, wemake use of the spatiotemporal structure of the input. In previous research (Parisi et al., 2017), weshowed that the lifelong learning of action sequences can be achieved in terms of prediction-drivenneural dynamics with internal representations emerging in a hierarchy of recurrent self-organizingnetworks. The networks can dynamically allocate neural resources and update connectivity patternsaccording to competitive Hebbian learning by computing the input based on its similarity withexisting knowledge and minimizing interference by creating new neurons whenever they are required.This approach has shown competitive results with batch learning methods on action benchmarkdatasets. However, the neural growth and update are driven by the minimization of the bottomupreconstruction error and, thus, without taking into account top-down, task-relevant signals that canregulate the plasticitystability balance. Furthermore, the model cannot learn in the absence of externalsensory input, which leads to a non-negligible degree of disruptive interference during incrementallearning tasks.
The proposed architecture with growing dual-memory learning (GDM) comprises a deep convolu-tional feature extractor and two hierarchically arranged recurrent self-organizing networks (Figure 1).Both recurrent networks are extended versions of the Gamma-GWR model (Parisi et al., 2017) thatdynamically create new neurons and connections in response to novel sequential input. The growingepisodic memory (G-EM) learns from sensory experience in an unsupervised fashion, i.e., levels ofstructural plasticity are regulated by the ability of the network to predict the spatiotemporal patternsgiven as input. Instead, the growing semantic memory (G-SM) receives neural activation trajectoriesfrom G-EM and uses task-relevant signals (input annotations) to modulate levels of neurogenesis andneural update, thereby developing more compact representations of statistical regularities embeddedin episodic experience. Therefore, G-EM and G-SM mitigate catastrophic forgetting through self-organizing learning dynamics with structural plasticity, increasing information storage capacity inresponse to novel input.The architecture classifies image sequences at an instance level (episodic experience) and a categorylevel (semantic knowledge). Thus, each input sample carries two labels which are used for theclassification task at the different levels of the network hierarchy. For the consolidation of knowledgeover time in the absence of sensory input, internally generated neural activity patterns in G-EM areperiodically replayed to both memories, thereby mitigating catastrophic forgetting during incrementallearning tasks. For this purpose, G-EM is equipped with synapses that learn statistically significantneural activity in the temporal domain. As a result, sequence-selective neural activation trajectoriescan be generated and replayed after each learning episode without explicitly storing sequential input.
The Gamma-GWR model (Parisi et al., 2017) is a recurrent extension of the Grow-When-Required(GWR) self-organizing network (Marsland et al., 2002) that embeds a Gamma memory (Principe etal., 1994) for representing short-term temporal relations. The Gamma-GWR can dynamically grow orshrink in response to the sensory input distribution. New neurons will be created to better representthe input and connections (synapses) between neurons will develop according to competitive Hebbian4 pisodic Memory t Convolutional Feature Extractor A) Architecture C) Temporal synapses
BMU at time t BMU at time t- Input-driven feedforwardMemory replayUnsupervised learningRegulated learning B) Classification G - E M Episodic
Instance level
Semantic
Category level G - S M Semantic Memory
Figure 1: (A) Illustration of our growing dual-memory (GDM) architecture for lifelong learning.Extracted features from image sequences are fed into a growing episodic memory (G-EM) consistingof an extended version of the recurrent Grow-When-Required network (section 3.2). Neural activationtrajectories from G-EM are feed-forwarded to the growing semantic memory (G-SM) that developsmore compact representations of episodic experience (section 3.3). While the learning process ofG-EM remains unsupervised, G-SM uses class labels as task-relevant signals to regulate levels ofstructural plasticity. After each learning episode, internally generated neural activation trajectoriesare replayed to both memories (green arrows; section 3.4); (B) The architecture classifies imagesequences at instance level (episodic experience) and at category level (semantic knowledge). Forthe purpose of classification, neurons in G-EM and G-SM associatively learn histograms of classlabels from the input (red dashed lines); (C) To enable memory replay in the absence of sensoryinput, G-EM is equipped with temporal synapses that are strengthened (thicker arrow) betweenconsecutively activated best-matching units (BMU).learning, i.e. neurons that activate simultaneously will be connected to each other. The Gamma-GWRlearns the spatiotemporal structure of the input through the integration of temporal context into thecomputation of the self-organizing network dynamics.The network is composed of a dynamic set of neurons, A , with each neuron consisting of a weightvector w j and a number K of context descriptors c j,k ( w j , c j,k ∈ R n ). Given the input x ( t ) ∈ R n ,the index of the best-matching unit (BMU), b , is computed as: b = arg min j ∈ A ( d j ) , (1) d j = α (cid:107) x ( t ) − w j (cid:107) + K (cid:88) k =1 α k (cid:107) C k ( t ) − c j,k (cid:107) , (2) C k ( t ) = β · w t − b + (1 − β ) · c t − b,k − , (3)where (cid:107) · (cid:107) denotes the Euclidean distance, α i and β are constant factors that regulate the influenceof the temporal context, w t − b is the weight vector of the BMU at t − , and C k ∈ R n is the globalcontext of the network with C k ( t ) = 0 .The activity of the network, a ( t ) , is defined in relation to the distance between the input and its BMU(Equation 2) as follows: a ( t ) = exp( − d b ) , (4)thus yielding the highest activation value of when the network can perfectly predict the inputsequence (i.e. d b = 0 ). Furthermore, each neuron is equipped with a habituation counter h j ∈ [0 , expressing how frequently it has fired based on a simplified model of how the efficacy of a habituating5ynapse reduces over time (Stanley, 1976). Newly created neurons start with h j = 1 . Then, thehabituation counter of the BMU, b , and its neighboring neurons, n , iteratively decrease towards 0.The habituation rule (Marsland et al., 2002) for a neuron i is given by: ∆ h i = τ i · κ · (1 − h i ) − τ i , (5)with i ∈ { b, n } and where τ i and κ are constants that control the monotonically decreasing behavior.Typically, h b is decreased faster than h n with τ b > τ n .The network is initialized with two neurons and, at each learning iteration, a new neuron is createdwhenever the activity of the network, a ( t ) , in response to the input x ( t ) is smaller than a giveninsertion threshold a T . Furthermore, h b must be smaller than a habituation threshold h T in order forthe insertion condition to hold, thereby fostering the training of existing neurons before new onesare added. The new neuron is created halfway between the BMU and the input. The training of theneurons is carried out by adapting the BMU b and the neurons n to which the b is connected: ∆ w i = (cid:15) i · h i · ( x ( t ) − w i ) , (6) ∆ c i,k = (cid:15) i · h i · ( C k ( t ) − c i,k ) , (7)with i ∈ { b, n } and where (cid:15) i is a constant learning rate ( (cid:15) n < (cid:15) b ). Furthermore, the habituationcounters of the BMU and the neighboring neurons are updated according to Equation 5. Connectionsbetween neurons are updated on the basis of neural co-activation, i.e. when two neurons fire together(BMU and second-BMU), a connection between them is created if it does not yet exist. Eachconnection has an age that increases at each learning iteration. The age of the connection betweenthe BMU and the second-BMU is reset to , whereas the other ages are increased by a value of .The connections with an age greater than a given threshold can be removed, and neurons withoutconnections can be deleted.For the purpose of classification, an associative matrix H ( j, l ) stores the frequency-based distributionof sample labels during the learning phase so that each neuron j stores the number of times that aninput with label l had j as its BMU. Thus, the predicted label ξ j for a neuron j can be computed as: ξ j = arg max l ∈ L H ( j, l ) , (8)where l is an arbitrary label. Therefore, the unsupervised Gamma-GWR can be used for classificationwithout requiring the number of label classes to be predefined. The learning process of growing episodic memory G-EM is unsupervised, thereby creating newneurons or updating existing ones to minimize the discrepancy between the sequential input andits neural representation. In this way, episodic memories can be acquired and fine-tuned iterativelythrough sensory experience. This is functionally consistent with hippocampal representations, e.g.in the dentate gyrus, which are responsible for pattern separation through the orthogonalization ofincoming inputs supporting the auto-associative storage and retrieval of item-specific informationfrom individual episodes (Yassa and Stark, 2011; Neuneubel and Knierim, 2014).Given an input image frame, the extracted image feature vector (see section 4.1) is given as input toG-EM which recursively integrates the temporal context into the self-organizing neural dynamics.The spatial resolution of G-EM neurons can be tuned through the insertion threshold, a T , with agreater a T leading to more fine-grained representations since new neurons will be created whenever a ( t ) < a T (see Equation 4). The temporal depth is set by the number of context descriptors, K , witha greater K yielding neurons that activate for larger temporal windows (longer sequences), whereasthe temporal resolution is set by the hyperparameters α and β (see Equation 2 and 3).To enable memory replay in the absence of external sensory input, we extend the Gamma-GWRmodel by implementing temporal connections that learn trajectories of neural activity in the temporaldomain. Such temporal connections are sequence-selective synaptic links which are incrementedbetween two consecutively activated neurons (Parisi et al., 2016). Sequence selectivity driven byasymmetric connections has been argued to be a feature of the cortex (Mineiro and Zipser, 1998),where an active neuron pre-activates neurons encoding future patterns. Formally, when the twoneurons i and j are consecutively activated at time t − and t respectively, their temporal synaptic6ink P ( i,j ) is increased by ∆ P ( i,j ) = 1 . For each neuron i ∈ A , we can retrieve the next neuron v ofa prototype trajectory by selecting v = arg max j ∈ A \ i P ( i,j ) . (9)Recursively generated neural activation trajectories can be used for memory replay (see section3.4). During the learning phase, G-EM neurons will store instance-level label classes ξ I for theclassification of the input (see Equation 8). Furthermore, since trajectories of G-EM neurons arereplayed to G-SM in the absence of sensory input, G-EM neurons will also store labels at a categorylabel l C . Therefore, the associative matrix for each neuron j is of the form H ( j, l I , l C ) . The growing semantic memory G-SM combines bottom-up drive from neural activity in G-EM andtop-down signals (i.e. category-level labels from the input) to regulate structural plasticity levels.More specifically, the mechanisms of neurogenesis and neural weight update are regulated by theability of G-SM to correctly classify its input. Therefore, while G-EM iteratively minimizes thediscrepancy between the input sequences and their internal representations, G-SM will create newneurons only if the correct label of a training sample cannot be predicted by its BMU in G-SM. Thisis implemented as an additional constraint in the condition for neurogenesis so that new neurons arenot created unless the predicted label of the BMU (Equation 8) does not match the input label.G-SM receives as input activated neural weights from G-EM, i.e. the weight vector of a BMU inG-EM, w EM b , for a given input frame. As an additional mechanism to prevent novel sensory experiencefrom interfering with consolidated representations, G-SM neurons are updated (Equation 6 and 7)only if the predicted label for the BMU in G-SM matches in the input label, i.e. if the BMU codes forthe same object category as the input. In this way, the representations of an object category cannot beupdated in the direction of the input belonging to a different category, which would cause disruptiveinterference.As a result of hierarchical processing, G-SM neurons code for information acquired over largertemporal windows than neurons in G-EM. That is, one G-SM will fire for a number K SM + 1 ofneurons fired in G-EM (where K SM is the temporal depth of G-SM neurons). Since G-EM neuronswill fire for a number K EM + 1 of input frames, G-SM neurons will code for a total of K SM + K EM + 1 input frames. This is consistent with established models of memory consolidation where neocorticalrepresentations code for information acquired over more extended time periods than the hippocampus(e.g., Kumaran and McClelland, 2012; Kumaran et al., 2016), thereby yielding a higher degree oftemporal slownessTemporal slowness results from the statistical learning of spatiotemporal regularities, with neuronscoding for prototype sequences of sensory experience. By using category-level signals to regulateneural growth and update, G-SM will develop more compact representations from episodic experiencewith neurons activating in correspondence of semantically-related input, e.g., the same neuron mayactivate for different instances of the same category and, because of the processing of temporalcontext, the same object seen from different angles. However, specialized mechanisms of slow featureanalysis can be implemented that would yield invariance to complex input transformations such asview invariance (e.g., Berkes and Wiskott, 2005; Einhäuser et al., 2005). View invariance of objectsis a prominent property of higher-level visual areas of the mammalian brain, with neurons codingfor abstract representations of familiar objects rather than for individual views and visual features(Booth and Rolls, 1998; Karimi-Rouzbahani et al., 2017). Neurophysiological studies evidence thatdistributed representations in high-level visual regions of the neocortex (semantic) are less sparsethan those of the hippocampus (episodic) and where related categories are represented by overlappingneural codes (Clarke and Tyler, 2014; Yamins et al., 2018). Hippocampal replay provides the means for the gradual integration of knowledge into neocorticalstructures and is thought to occur through the reactivation of recently acquired knowledge interleavedwith the exposure to ongoing experiences (McClelland et al., 1995). Although the periodic replay ofprevious data samples can alleviate catastrophic forgetting, storing all previously encountered datasamples has the general drawback of large memory requirements and large retraining computationaltimes. 7n pseudo-rehearsal (or intrinsic replay), memories are drawn from a probabilistic or generativemodel and replayed to the system for memory consolidation (Robins, 1995). In our case, however,we cannot simply draw or generate isolated and randomly selected pseudo-samples from a givendistribution since we must account for preserving the temporal structure of the input. Therefore,we generate pseudo-patterns in terms of temporally-ordered trajectories of neural activity. For thispurpose, we propose to use the asymmetric temporal links of G-EM (section 3.2) to recursivelyreactivate sequence-selective neural activity trajectories (RNATs) embedded in the network. RNATscan be computed for each neuron in G-EM for a given temporal window and replayed to G-EM andG-SM after each learning episode triggered by external input stimulation.For each neuron j in G-EM, we generate a RNAT, S j , of length λ = K EM + K SM + 1 as follows: S j = (cid:104) w EMs (0) , w EMs (1) , ..., w EMs ( λ ) (cid:105) , (10)s ( i ) = arg max n ∈ A \ j P ( n, s ( i − , i ∈ [1 , λ ] , (11)where P ( i,j ) is the matrix of temporal synapses (as defined by Equation 9) and s (0) = j . The classlabels of the pseudo-patters in S j can be retrieved according to Equation 8.The set of generated RNATs from all G-EM neurons is replayed to G-EM and G-SM after eachlearning episode, i.e., a learning epoch over a batch of sensory observations. As a result of computingRNATs, sequence-selective prototype sequences can be generated and periodically replayed withoutthe need of explicitly storing the temporal relations and labels of previously seen training samples.This is conceptually consistent with neurophysiological studies evidencing that hippocampal replayconsists of the reactivation of previously stored patterns of neural activity occurring predominantlyafter an experience (Kudrimoti et al., 1999; Karlsson and Frank, 2009). We perform a series of experiments evaluating the performance of the proposed GDM model in batchlearning (section 4.2), incremental learning (section 4.3), and incremental learning with memoryreplay (section 4.4). We analyze and evaluate our model with the CORe50 dataset (Lomonaco andMaltoni, 2017; see section 4.1), a recently published benchmark for continuous object recognitionfrom video sequences. We reproduce three experimental conditions defined by the CORe50 bench-mark (section 4.5) showing that our model significantly outperforms state-of-the-art lifelong learningapproaches. For the replication of these experiments, the source code of the GDM model is availableas a repository. The CORe50 comprises 50 objects within 10 categories with image sequences captured under differentconditions and multiple views of the same objects (varying background, object pose, and degreeof occlusion; see Figure 2). Each object comprises a video sequence of approximately 15 secondswhere the object is shown to the vision sensor held by a human operator. The video sequences werecollected in 11 sessions (8 indoors, 3 outdoors) with a Kinect 2.0 sensor delivering RGB ( × )and depth images ( × ) at 20 frames per second (fps) for a total of 164,866 frames. For ourexperiments, we used × RGB images provided by the dataset at a reduced frame rate of 5hz.The movements performed by the human operator with the objects (e.g. rotation) are quite smoothand reducing the number of frames per second has not shown significant loss of information.For a more direct comparison with the baseline results provided by Lomonaco and Maltoni (2017)who adopted the VGG model (Simonyan and Zisserman, 2014) pre-trained on the ILSVRC-2012dataset (Russakovsky et al., 2014), our feature extraction module consists of the same pre-trainedVGG model to which we applied a convolutional operation with 256 1x1 kernels on the output ofthe fully-connected hidden layer to reduce its dimensionality from 2048 to 256. Therefore, G-EMreceives a 256-dimensional feature vector per sequence frame. Such compression of the featurevectors is desirable since the Gamma-GWR uses the Euclidean distance as a metric to compute theBMUs, which becomes weakly discriminant when the data are very high-dimensional or sparse (Parisi GDM model: https://github.com/giparisi/GDM ) b) Figure 2: The CORe50 dataset designed for continuous object recognition: (A) Example frames ofthe 10 categories (columns) comprising 5 object instances each, (B) Example frames for one objectinstance from the 11 acquisition sessions showing different background, illumination, pose, anddegree of occlusion. Adapted from Lomonaco and Maltoni (2017).et al., 2015). Furthermore, it is expected that different pre-trained models may exhibit a slightlybetter performance than VGG, e.g. ResNet-50 (He et al., 2016; see Lomonaco and Maltoni, 2018for ResNet-50 performance on CORe50). However, here we focus on showing the contribution ofcontext-aware growing networks rather than comparing deep feature extractors.
We trained the architecture on the entire training data at once and subsequently tested its classificationperformance at instance and category level. Following the same evaluation scheme described byLomonaco and Maltoni (2017), we used the samples from sessions , , for testing and thesamples from the remaining 8 sessions for training. We compare our results to the baseline providedby Lomonaco and Maltoni (2017) using fine-tuning on a pre-trained VGG network (VGG+FT). Tobetter assess the contribution of temporal context (TC) for the task of continuous object recognition,we performed batch learning experiments with 3 different model configurations: • GDM : We trained the model using TC and tested it on novel sequences. For each inputframe, an object instance and an object category are predicted. • GDM (No TC) : We trained and tested the model without TC by setting K = 0 , i.e. thecomputation of the BMU is reduced to b = arg min j ∈ A (cid:107) x ( t ) − w j (cid:107) . • GDM (No TC during test) : We trained the model with TC but tested on single imageframes by setting K = 0 during the test phase.The training hyperparameters are listed in Table 1. Except for the insertion thresholds a EM T and a SM T , the remaining parameters were set similar to Parisi et al. (2017) for the incremental learningof sequences. Larger insertion thresholds will lead to a larger number of neurons. However, thebest classification performance will not be necessarily obtained by the largest number of neurons.In G-EM, the neural representation should be characterized by a sufficiently high spatiotemporalresolution for discriminating between similar object instances and replaying episodic experience inthe absence of sensory input. Conversely, regulated unsupervised learning in G-SM will lead to amore compact, overlapping neural representation with a smaller number of neurons while preservingthe ability to correctly classify its input. The number of context descriptors ( K EM , K SM ) is set to . This means that G-EM neurons will activate in correspondence of 3 image frames and G-SMneurons in correspondence of 3 G-EM neurons, i.e. a processing window of 5 frames (1s of video at5fps). Additional experiments showed that increasing the number of context descriptors does notsignificantly improve the overall accuracy. This is because a small number of context descriptors willlead to learning short-term temporal relations which are useful for temporal slowness, i.e. neuronsthat activate for multiple similar views of the same object (where different views of the object areinduced by object motion). Neurons with a higher temporal depth will learn longer-term temporalrelations and, depending on the difference between the training and test set, training with longersequences may result in the specialization of neurons to the sequences in the training set while failing9able 1: Training hyperparameters for the G-EM and G-SM networks (batch and incrementallearning). Hyperparameters Value
Insertion thresholds a EM T = 0 . , a SM T = 0 . Habituation counters h T = 0 . , τ b = 0 . , τ n = 0 . , κ = 1 . Temporal depth K EM = 2 , K SM = 2 Temporal context α = [0 . , . , . , β = 0 . Learning rates (cid:15) b = 0 . , (cid:15) n = 0 . Table 2: Comparison of batch learning performance for instance-level and category-level classification.We show the accuracy for the pre-trained VGG with fine-tuning (VGG+FT) and the proposed GDMfor three different configurations: (i) growing networks with temporal context (TC), (ii) without TC,and (iii) without TC during test. Best results in bold.
Accuracy (%) Accuracy (%)Approach (Instances) (Categories)VGG + FT (Lomonaco and Maltoni, 2017) 69.08 80.23Proposed GDM (No TC) 70.42 83.54Proposed GDM (No TC during test) 72.56 87.32Proposed GDM to generalize. Therefore, convenient values for K EM and K SM can be selected according to differentcriteria and properties of the input, e.g. number of frames per second, smoothness of object motion,desired degree of neural specialization.The classification performance for the 3 different configurations is summarized in Table 2, showinginstance-level and category-level accuracy after 35 training epochs averaged across 5 learning trialsin which we randomly shuffled the batches from different sessions. The best results were obtainedby GDM using temporal context with an average accuracy of 79.43% (instance level) and 93.92%(category level), showing an improvement of 10.35% and 13.69% respectively with respect to thebaseline results (Lomonaco and Maltoni, 2017). Without the use of temporal context, the accuracy iscomparable to the baseline showing a marginal improvement of 1.34% (instance level) and 3.31%(category level). Our results demonstrate that learning the temporal relations of the input playsan important role for this dataset. Interestingly, dropping the temporal component during the testphase, i.e. using single image frames for testing on context-aware networks, shows a slightly betterperformance (2.14% and 3.78% respectively) than training without temporal context. This is becausetrained neural weights embed some temporal structure of the training sequences and, consequently, thecontext-free computation of a BMU from a single input frame will still be matched to context-awareneurons.Figure 3 shows the number of neurons, update rate, and classification accuracy for G-EM and G-SM(with temporal context) through 35 training epochs averaged across 5 learning trials. It can be seenthat the average number of neurons created in G-EM is significantly higher than in G-SM (Figure 3.A).This is expected since G-EM will grow to minimize the discrepancy between the input and its internalrepresentation, whereas neurogenesis and neural update rate in G-SM are regulated by the abilityof the network to predict the correct class labels of the input. The update rate (Figure 3.B) is givenby multiplying the fixed learning rate by the habituation counter of the neurons ( (cid:15) i · h i ), whichshows a monotonically decreasing behavior. This indicates that, after a number of epochs, thecreated neurons become habituated to the input. Such a habituation mechanism has the advantageof protecting consolidated knowledge from being disrupted or overwritten by the learning of novelsensory experience, i.e. well-trained neurons will respond slower to changes in the distribution andthe network will create new neurons to compensate for the discrepancy between the input and itsrepresentation. 10 ( S R F K V $ 1 R R I 1 H X U R Q V * ( 0 * 6 0 ( S R F K V % 8 S G D W H 5 D W H ( S R F K V &