[PDF] A biological plausible audio-visual integration model for continual lifelong learning

Abstract

The problem of catastrophic forgetting can be traced back to the 1980s, but it has not been completely solved. Since human brains are good at continual lifelong learning, brain-inspired methods may provide solutions to this problem. The end result of learning different objects in different categories is the formation of concepts in the brain. Experiments showed that concepts are likely encoded by concept cells in the medial temporal lobe (MTL) of the human brain. Furthermore, concept cells encode concepts sparsely and are responsive to multi-modal stimuli. However, it is unknown how concepts are formed in the MTL. Here we assume that the integration of audio and visual perceptual information in the MTL during learning is a crucial step to form concepts and make continual learning possible, and we propose a biological plausible audio-visual integration model (AVIM), which is a spiking neural network with multi-compartmental neuron model and a calcium based synaptic tagging and capture plasticity model, as a possible mechanism of concept formation. We then build such a model and run on different datasets to test its ability of continual learning. Our simulation results show that the AVIM not only achieves state-of-the-art performance compared with other advanced methods but also the output of AVIM for each concept has stable representations during the continual learning process. These results support our assumption that concept formation is essential for continuous lifelong learning, and suggest the AVIM we propose here is a possible mechanism of concept formation, and hence is a brain-like solution to the problem of catastrophic forgetting.

Full PDF

AA biological plausible audio-visual integration modelfor continual lifelong learning

Wenjie Chen, Fengtong Du, Ye Wang, Lihong Cao

Communication University of ChinaBeijing, China

Abstract

The problem of catastrophic forgetting can be traced back to the 1980s, but it hasnot been completely solved. Since human brains are good at continual lifelonglearning, brain-inspired methods may provide solutions to this problem. Theend result of learning different objects in different categories is the formation ofconcepts in the brain. Experiments showed that concepts are likely encoded byconcept cells in the medial temporal lobe (MTL) of the human brain. Furthermore,concept cells encode concepts sparsely and are responsive to multi-modal stimuli.However, it is unknown how concepts are formed in the MTL. Here we assumethat the integration of audio and visual perceptual information in the MTL duringlearning is a crucial step to form concepts and make continual learning possible, andwe propose a biological plausible audio-visual integration model (AVIM), which isa spiking neural network with multi-compartmental neuron model and a calciumbased synaptic tagging and capture plasticity model, as a possible mechanism ofconcept formation. We then build such a model and run on different datasets to testits ability of continual learning. Our simulation results show that the AVIM notonly achieves state-of-the-art performance compared with other advanced methodsbut also the output of AVIM for each concept has stable representations duringthe continual learning process. These results support our assumption that conceptformation is essential for continuous lifelong learning, and suggest the AVIMwe propose here is a possible mechanism of concept formation, and hence is abrain-like solution to the problem of catastrophic forgetting.

In the late 1980s, researchers discovered the sequential learning problem known as catastrophicforgetting in the connectionist networks [1]. With the great success of deep learning in the ﬁelds ofperceptual recognition [2] and board games [3–5], researchers once again started to pay attentionto the problem of catastrophic forgetting, as the existing AI technology facing with the difﬁculty ofcontinuous learning in the continuously changing environment [6]. Within the framework of artiﬁcialneural network (ANN) with backpropagation (BP) algorithm [7], researchers have deﬁned a seriesof continual learning tasks [8], such as incremental learning of new instances, new classes, and themixture of new instances and classes, and have proposed many constructive solutions to these speciﬁctasks. The mainstream methods proposed can be divided into the following topics: regularizationsof the network [9–15], parameters isolation [16, 17], dynamic structure [18, 19] , memory-basedconsolidation [20, 21], network with attention mechanism [22, 23], and the dual system inspired bycomplementary learning system theory in the brain [24–26]. Although the blossom of new methods,they are still far from the level of human continual learning. How to achieve human-level of continuallifelong learning still needs more exploration and research.

Preprint. Under review. a r X i v : . [ c s . N E ] J u l earning from the brain The causes of catastrophic forgetting problem can be manifold, so thesolution to this problem needs to be considered in many ways. From the perspective of the neuronmodel, McCulloch-Pitts neuron in the ANN is oversimpliﬁed compared with the neurons in the brain[27]. The neuron in the brain is not just a point. It has a tree-like dendrite that performs nonlinearcomputation [28]. The dendrites of pyramidal cells of the human brain are more complex thanthose of other species [29], which suggests that the multi-compartmental neuron model might be anessential structure of higher intelligence. The multi-compartmental neuron with spike feature oftenplays a vital role in computational neuroscience [30]. From the perspective of the synaptic model, thesynapse in the ANN is represented by a single parameter, which is also oversimpliﬁed compared tothe real synapse in the brain. There are excitatory and inhibitory synapses in the brain. Moreover,their functions are not merely doing addition or subtraction. There are two types of summation, whichare spatial summation and temporal summation for neural signals at the synapse, notably temporalsummation is not available in the ANN. Last, but perhaps most important, is synaptic plasticity.Although BP in the ANN is very efﬁcient, it is still unknown whether BP is biological plausiblein the brain [31, 32]. Synaptic plasticity in the brain involves the process of proteins synthesis ofpostsynaptic neurons, and is closely related to the change of calcium concentration [33]. The theoryof synaptic tagging and capture (STC) establishes the relationship between the calcium concentrationand the plasticity-related proteins. It provides an elegant biological explanation of the consolidationof newly formed memories at the cellular and synaptic scale [34]. The computational model [35, 36]of STC can also explain some experimental phenomena well.In general, humans are learning in a multi-modal environment in both supervised and unsupervisedways. For auditory signals, the brain has better self-learning ability [37]. The result of continuallearning in the brain is the formation of concepts. Experiments showed that concepts are likelyencoded by concept cells in the MTL of the human brain. Furthermore, concept cells encode conceptssparsely and are responsive to multi-modal stimuli [38]. Considering concept cells are multi-modalcells, it is reasonable to guess that the origination of these concept cells should be in the perirhinalcortex since the cells in this region are multi-modal and the perceptual inputs to this region are singlemodal. Besides, the connections between the perirhinal cortex and entorhinal cortex are very strong.Although the perirhinal cortex has rich connections to many areas of the brain[39], the primary sourceof visual inputs it receives should be temporal area TE, which is the end of the ventral visual pathway,and the primary source of auditory inputs should be parahippocampus cortex, which receives inputdirectly from the auditory cortex. Functionally, studies conclude that the TE’s role is mainly visualperception, while the role of the perirhinal cortex is mainly related to memory (including visualmemory) [40], which can be demonstrated by the biological lesion tests [41].

Method overview

Inspired by the multi-modal integration learning in the perirhinal cortex of theMTL, we propose a biological plausible audio-visual integration model (AVIM) for continual learningand test it on different datasets. Figure 1 presents the overall schematic framework of the proposedmethod. We use a pre-trained convolutional neural network(CNN) to get high-level visual featurevectors (V-FV) for images. We assume that the auditory signals of the same conceptual level objectssuch as dog and cat in the brain form a sparse spatial distribution with equal energy represented bythe number of ﬁring neurons. The spatial sparseness leads to near orthogonal relationships amongauditory perceptual coding neurons. So we call such auditory coding scheme as near-orthogonalsparse code (NOSC). There are several facts supporting the NOSC assumption. First, human brainscan do self-learning from auditory signals very well [37]. Second, high-level representations ofauditory objects in the brain are highly sparse coded. Finally, the number of shared/overlappingneurons in the high-level representations of auditory objects are relatively small [42]. For audio-visual integration, we make an integration layer composed of multi-compartment Hodgkin-Huxleytype neurons whose dendrites receive high-level visual and auditory feature signals. We allow theconnections from visual signals to the integration layer to have plasticity. In addition, we have aninhibition layer connecting to the integration layer. During training, we present the NOSC and V-FVfrom the same class to the integration layer simultaneously. During testing, we only present V-FV tothe integration layer and take the spike trains of neurons in the integration layer as the output. Finally,we use a linear output classiﬁer (LOC-ANN) for decoding the output of the integration layer andobtain ﬁnal classiﬁcation results.The organizational structure of the rest paper is as follows. In Section 2, we explain the detailsof the methods and the experimental design. In Section 3, we compare the experimental results2igure 1: The overall framework of the present work for continual learning. (a) . Encoding module.For visual signals, a pre-trained CNN is used to get high-level visual feature code V-FV. For auditorysignals, a NOSC generator is used to generate auditory feature code. (b) . Integration module. TheAVIM consists of four layers, which are the VF layer, AF layer, AVI layer, and INB layer. Here,VF represents the visual feature layer, AF represents the auditory feature layer, AVI represents theaudio-visual integration layer, INB represents the inhibition layer. (c) . Decoding module. TheLOC-ANN is to decode the spike trains in the AVI layer and get ﬁnal classiﬁcation results.of the proposed method to the state-of-the-art in different datasets. In Section 4, we discuss theexperimental results and make a clear explanation of the role of each module in the proposed methodduring continual learning. Finally, we conclude the proposed method in Section 5

The proposed AVIM consists of four layers, which are the VF layer, AF layer, AVI layer, and INBlayer. Here, VF represents the visual feature layer, AF represents the auditory feature layer, AVIrepresents the audio-visual integration layer, and INB represents the inhibition layer. For simplicity,we denote the connection from VF to AVI as S1, the connection from VF to AVI as S2, the connectionfrom AVI to INB as S3, and the connection from INB to AVI as S4. Figure 1-(b) visualizes thestructure of the AVIM. In AVIM, the number of neurons in the sensory feature layer, such as VF andAF are the same. The proportion of neurons between the sensory feature layer and AVI is around 1:3.The proportion of neurons between AVI and INB is around 4:1, which is similar to the proportion ofexcitatory and inhibitory neurons in the brain. As for connections in AVIM, the projecting ratio of S1,S2, S3, and S4 are around 1:4, 1:6, 1:1.5, and 1:10, respectively.

Neuron model

For neurons in the VF layer, we use the V-FV extracted from the encoding moduleand map the activation values (in the range of 0 to 1) of V-FV neurons linearly to a suitable range ofﬁring rate (in the range of 0 to 20Hz) to produce spike trains that obey the Poisson distribution. Forneurons in the AF layer, we use the NOSC generated in the encoding module and map the values ofNOSC to the spike trains using the same method as VF neurons. For neurons in the AVI layer, we usea two-compartment pyramidal neuron model proposed in [43] based on the fact that the major neuronsin the perirhinal cortex are pyramidal cells. The AVI neurons consist of soma and dendrite. Thedynamic equations of voltage in soma and dendrite are 1 and 2, respectively. The constant parametersin 1 and 2 are: C m = 3 . µF/cm , g ds = 0 . µS/cm , g sd = 0 . µS/cm . For neurons in theINB layer, we use an inhibitory neuron model in the hippocampus proposed in [44], which has onecompartment, see 3. The constant parameter in 3 is C m = 1 µF/cm . See supplementary materialfor more details of neuron models and the algorithm of generating NOSC. C m dV s dt = − I L − I Na − I K − I Ca − I ahp − g ds ( V s − V d ) + I synT oSoma (1) C m dV d dt = − I L − I Na − I K − I Ca − I ahp − g sd ( V d − V s ) + I synT oSoma (2)3 m dVdt = − I L − I Na − I K + I syn (3) Synaptic model

In the AVIM, synapse S1 is the excitatory connection with plasticity, includingAMPA and NMDA receptors. We use a calcium based STC plasticity model to learn the weightchanges at S1. Synapse S2 and S3 are excitatory connections without plasticity, including only AMPAreceptors. Synapse S4 is inhibitory connection, including GABA receptors.

To decode the spike trains in the AVI layer, we design an energy normalized linear output classiﬁerbased on the ANN called LOC-ANN for simplicity. We convert the spike trains pattern within onesecond in the AVI layer of each sample to the ﬁring rate pattern and send them to the LOC-ANN.Figure 2 visualize the procedure of the proposed updating methods in LOC-ANN during continuallearning. Set N is the number of neurons in the input layer, C is the total number of categories,synapse W ij ( i = 1 , , ..., N ; j = 1 , , ..., C ) represents the connection from the i-th input neuronto the j-th output neuron. At ﬁrst, all connections in LOC-ANN are 0. During learning the j-th class, only the j-th output neuron receives the error signal from the mean square error loss functionand the corresponding synapses W ij ( i = 1 , , ..., N ) are able to update, while the other synapses W ik ( i = 1 , , ..., N ; k (cid:54) = j ) keep ﬁxed. The label of the j-th output neuron is set to 1. After learningthe j-th class, we normalize the synapses W ij ( i = 1 , , ..., N ) connecting to the j-th output neuronby (cid:113)(cid:80) Ni =1 W ij in order to balance the energy of classiﬁcation weights of each class.Figure 2: The ﬂow chart of the updating method of LOC-ANN in continual learnng. LearningStep-jrepresents the process of learning the j-th class. In order to test the performance of AVIM, MNIST [45], EMINST [46] and CIFAR100[47] are used to construct image datasets of 10, 20 and 100 classes as experimental datasets. ForMNIST10 and EMNIST20 datasets, we randomly select 50 samples per class from the training set astraining samples and 50 samples per class from the test set as test samples. For CIFAR100 dataset,we randomly select ten samples per class from the training set as training samples and ten samplesper class from the test set as test samples. For each dataset, we use four CNN of different quality toobtain four groups of V-FV with different levels of linear separability, which are denoting as FV1to FV4, respectively. The level of linear separability of V-FV from FV1 to FV4 is increasing. Seesupplementary material for the details of V-FV datasets.

Experimental design

In this paper, we focus on the incremental learning of new classes. In ourexperiment, there is only one class for each task. For example, in the continual learning of 10 digits,the agent needs to learn the number "0" ﬁrst, then the number "1", and so on. The whole procedure ofcontinual learning C classes including training and testing is shown in Figure 3. Set

NN-i (1 ≤ i ≤ C) is the network composed of AVIM-i and

LOC-ANN-i after training on the i-th class. a i,j is the testaccuracy of the j-th (1 ≤ j ≤ i) class of NN-i , S j is the number of test samples in the j-th class, the testaccuracy A i of NN-i is deﬁned as A i = (cid:80) ij =1 a i,j S j (cid:80) ij =1 S j .4igure 3: (a) . The whole procedure of continual learning C classes. (b) . The details of the learningprocess on NN-i. (c) . The details of the testing process on NN-i. Experimental settings

We deﬁne a pair of NOSC and V-FV from the same class as an A-V trainingsample. When learning the i-th class, each A-V training sample of the i-th class is presented toAVIM-i only once. The duration of the presentation of an A-V training sample is 2 seconds. Thereare 4 seconds between presentations of two adjacent A-V training samples. When testing, only V-FVof the learned classes are presented to AVIM. The duration of the presentation of a test sample is 1second. Moreover, there are 0.1 seconds between presentations of two adjacent test samples. SeeTable 1 for the experimental settings for AVIM in each dataset.

Other methods to be compared

We compared AVIM with ﬁve ANN methods, including ANNBase, ANN Ofﬂine, iCaRL [21], GEM [20] and EWC [10]. ANN Base represents the sequentiallearning method, and ANN Ofﬂine represents the normal shufﬂe learning method. All ANN methodsuse the same network structure, which consists of three fully connected layers: the input layer, thehidden layer, and the output layer. The number of neurons in the hidden layer is equal to the numberof neurons in the AVI layer. For GEM, the memory sample size of each class is 1 for all experiments.For iCaRL, the maximum memory sample size in each experiment is the total number of class, whichmeans that iCaRL and GEM have the same number of memory samples at the last learning. ANNBase, EWC, and AVIM have no memory sample. ANN Ofﬂine use all samples during learning. Seesupplementary material for the details of the comparison experimental settings.Table 1: Experimental settings of AVIM in different datasets. In NOSC, N is the number of totalneurons, n is the number of ﬁring neurons for each class, and K is the max number of shared ﬁringneurons between any two classes.

CNN NOSC AVIMDATASET V-FV N n K VF AF AVI INB

MNIST10 15 15 2 1 15 15 50 12EMNIST20 20 20 3 1 20 20 67 16CIFAR100 50 50 5 2 50 50 167 405able 2: Final test accuracy of AVIM and ANN methods on 12 datasets.

Black bolding in the tablerepresents the best continuous learning performance in each dataset, ∗ represents the second bestperformance in each dataset (excluding ofﬂine method). DATASET Ofﬂine( % ) AVIM( % ) GEM( % ) iCaRL( % ) EWC( % ) Base( % ) MNIST10-FV1 72.0 (a) . The change of the ﬁring ratepatterns of the training samples in the AVI layer during the training process. The small matrix M i,j ( ≤ i ≤ ; i ≤ j ≤ ) on the left is composed of the AVI ﬁring rate patterns of all the i-th classtraining samples in AVIM-j. In M i,j , the column represents the sample index, and the row representsthe neuron index. The large matrix on the right is composed of all the M i,j during the whole learningprocess. The column represents the learning step index, and the row represents the class index. (b) .The change of the hidden layer patterns of the training samples during the training process usingiCaRL. In Table 2, we can see the ﬁnal test accuracy of all methods on each dataset. In Figure 4, we visualizethe test accuracy curve of all methods during continuous learning the CIFAR100-FV datasets. Fromthe results shown in Table 2 and Figure 4, we can conclude that: 1. In all, the proposed AVIM canachieve comparable performances as iCaRL, while being much better than GEM and EWC; 2. TheAVIM has a signiﬁcant advantage compared with iCaRL in continual learning the CIFAR100-FVdatasets, which is likely because the limited memory capacity of iCaRL affects its performance inlearning more classes; 3. GEM is overall worse than AVIM and iCaRL, which is likely becausethe limitation of memory capacity and the randomness samples selection of GEM can affect itsperformance on memory consolidation; 4. The performance of EWC is signiﬁcantly worse thanAVIM, iCaRL, and GEM, and is slightly better than ANN Base, suggesting that EWC is not goodat incremental learning of new classes. In addition, we compared the change of the ﬁring ratepattern/hidden layer patterns in AVIM and iCaRL during the learning process. As can be seen fromﬁgure 5, AVIM formed relatively more stable representations of the learned categories compared withiCaRL in the process of continuous learning.

The contribution of the present work can be summarized as the following: 1. We propose theAVIM as a brain-like solution to continuous lifelong learning. The AVIM achieve the state-of-the-artperformance compared with methods such as iCaRL, GEM, and EWC on several public datasets;2. We make an assumption about the auditory object feature code as NOSC to be integrated withvisual object feature vectors during continual learning; 3. Finally, we use an energy normalizedlinear classiﬁer LOC-ANN to decode spike trains from the integrated layer’s neurons and make theclassiﬁcation.

The role of AVIM

The success of AVIM is likely due to the multi-compartmental neuron modeland the STC plasticity. It is known that dendritic integration in the brain is complicated. It isworth noting that, under the same framework, if we used the point neuron model instead of themulti-compartment neuron model, the AVI layer could not form stable representations of objects, andthe continuous learning fails. In this work, only two compartments per neuron are used. In the future,more compartments and connection patterns should be considered. We use STC plasticity instead ofspike timing-dependent plasticity because STC is a calcium-based learning rule, allowing us to study7nd control more details of neuronal responses. It will be worthwhile to try different spiking neuralnetwork learning rules in the future.

The role of NOSC

In this present work, NOSC as the auditory feature code is randomly generatedunder the constraints imposed by NOSC requirements. The connection from the AF layer, whichtakes NOSC as input, to the AVI layer is ﬁxed and does not have any plasticity during continuallearning in our experiments. The inspiration for such structural design came from the followingexperimental phenomenon: 1. The brain has better self-learning ability for hearing; 2. The hearinghas a signiﬁcant impact on vision; 3. The development of hearing is earlier than vision. Althoughthis design is reasonable to some extent, it is more biological plausible that different perceptionsdo interact with each other during the learning process. One of the future work is to study how tointegrate multi-modal signals in the case of both visual and auditory plasticity.

The role of LOC-ANN

The LOC-ANN is used to decode the spike trains of the AVI layer andget the ﬁnal classiﬁcation results. We propose an energy normalized linear classiﬁer LOC-ANNfor continual learning. In the brain, classiﬁcation decisions are made in the prefrontal cortex (PFC).The PFC can make ﬂexible decisions depending on the tasks and context. Since we only care aboutcategorical classiﬁcation in this work, we assume equal energy for presentations of different classes.

In this paper, we propose a novel biological plausible audio-visual integration model (AVIM) withmulti-compartment neuron models and the STC plasticity for continual learning. Our simulationresults show that AVIM can achieve the state-of-the-art performance compared with methods suchas iCaRL, GEM, and EWC. It should be noted that the present work does not mainly aim to get thetop-one performance on several datasets but more on exploring the possible mechanism of brain-likelearning, and speciﬁcally of concept formation. The results suggest the AVIM we propose here is apossible mechanism of concept formation in the brain, and hence provides a brain-like solution to theproblem of catastrophic forgetting.

References [1] M. McCloskey and N. Cohen, “Catastrophic interference in connectionist networks: Thesequential learning problem,”

Psychology of Learning and Motivation - Advances in Researchand Theory , vol. 24, no. C, pp. 109–165, Jan. 1989.[2] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521, no. 7553, p. 436, 2015.[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep rein-forcement learning,”

Nature , vol. 518, no. 7540, pp. 529–533, 2015.[4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. Den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, M. Lanctot et al. , “Mastering the game of go with deepneural networks and tree search,”

Nature , vol. 529, no. 7587, pp. 484–489, 2016.[5] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,M. Lai, A. Bolton et al. , “Mastering the game of go without human knowledge,”

Nature , vol.550, no. 7676, pp. 354–359, 2017.[6] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual lifelong learning withneural networks: A review,”

Neural Networks , vol. 113, pp. 54–71, 2019.[7] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,”

Nature , vol. 323, no. 6088, pp. 696–699, 1988.[8] V. Lomonaco and D. Maltoni, “Core50: a new dataset and benchmark for continuous objectrecognition,” arXiv: Computer Vision and Pattern Recognition , 2017.[9] I. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, “An empirical investigation ofcatastrophic forgetting in gradient-based neural networks,”

In ICLR , 2013.810] J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,J. Quan, T. Ramalho, A. Grabskabarwinska et al. , “Overcoming catastrophic forgetting in neuralnetworks,”

Proceedings of the National Academy of Sciences of the United States of America ,vol. 114, no. 13, pp. 3521–3526, 2017.[11] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,”

In ICML ,pp. 3987–3995, 2017.[12] S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang, “Overcoming catastrophic forgetting by incrementalmoment matching,”

In NIPS , pp. 4652–4662, 2017.[13] G. Zeng, Y. Chen, B. Cui, and S. Yu, “Continual learning of context-dependent processing inneural networks,”

Nature Machine Intelligence , vol. 1, p. 364–372, 2019.[14] Z. Li and D. Hoiem, “Learning without forgetting,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , vol. 40, no. 12, pp. 2935–2947, 2018.[15] S. Golkar, M. Kagan, and K. Cho, “Continual learning via neural pruning,” arXiv:1903.04476 ,2019.[16] A. Mallya and S. Lazebnik, “Packnet: Adding multiple tasks to a single network by iterativepruning,”

In CVPR , pp. 7765–7773, 2018.[17] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra,“Pathnet: Evolution channels gradient descent in super neural networks,” arXiv: Neural andEvolutionary Computing , 2017.[18] R. N. Charles, D. Guillaume, R. Andreialexandru, K. Koray, H. R. Thais, P. Razvan, K. James,and S. H. Josef, “Progressive neural networks,” arXiv: Learning , 2017.[19] R. Aljundi, P. Chakravarty, and T. Tuytelaars, “Expert gate: Lifelong learning with a network ofexperts,”

In CVPR , pp. 7120–7129, 2017.[20] D. Lopezpaz and M. Ranzato, “Gradient episodic memory for continual learning,”

In NIPS , pp.6467–6476, 2017.[21] S. Rebufﬁ, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classiﬁer andrepresentation learning,”

In CVPR , vol. 2017, pp. 5533–5542, 2017.[22] N. Y. Masse, G. D. Grant, and D. J. Freedman, “Alleviating catastrophic forgetting usingcontext-dependent gating and synaptic stabilization,”

Proceedings of the National Academy ofSciences of the United States of America , vol. 115, no. 44, p. 201803839, 2018.[23] J. Serra, D. Suris, M. Miron, and A. Karatzoglou, “Overcoming catastrophic forgetting withhard attention to the task,”

In ICML , pp. 4548–4557, 2018.[24] J. L. Mcclelland, B. L. Mcnaughton, and R. C. Oreilly, “Why there are complementary learningsystems in the hippocampus and neocortex: insights from the successes and failures of connec-tionist models of learning and memory.”

Psychological Review , vol. 102, no. 3, pp. 419–457,1995.[25] J. L. Mcclelland, B. L. Mcnaughton, and A. K. Lampinen, “Integration of new information inmemory: New insights from a complementary learning systems perspective,” bioRxiv , 2020.[26] R. Kemker and C. Kanan, “Fearnet: Brain-inspired model for incremental learning,”

In ICLR ,2018.[27] A. L. Hodgkin and A. F. Huxley, “A quantitative description of membrane current and itsapplication to conduction and excitation in nerve,”

The Journal of Physiology , vol. 117, no. 4,pp. 500–544, 1952.[28] G. Major, M. E. Larkum, and J. Schiller, “Active properties of neocortical pyramidal neurondendrites,”

Annual Review of Neuroscience , vol. 36, no. 1, pp. 1–24, 2013.929] L. Beaulieularoche, E. H. S. Toloza, M. S. V. Der Goes, M. Lafourcade, D. Barnagian,Z. Williams, E. N. Eskandar, M. P. Frosch, S. S. Cash, and M. T. Harnett, “Enhanced dendriticcompartmentalization in human cortical neurons,”

Cell , vol. 175, no. 3, pp. 643–651, 2018.[30] G. R. Yang, J. D. Murray, and X. Wang, “A dendritic disinhibitory circuit mechanism forpathway-speciﬁc gating,”

Nature Communications , vol. 7, no. 1, pp. 12 815–12 815, 2016.[31] J. C. R. Whittington and R. Bogacz, “Theories of error back-propagation in the brain,”

Trendsin Cognitive Sciences , vol. 23, no. 3, pp. 235–250, 2019.[32] A. S. L. M. C. J. A. Timothy P, Lillicrap, , and G. Hinton, “Backpropagation and the brain,”

Nature Reviews Neuroscience , 2020.[33] T. Rogerson, D. J. Cai, A. Frank, Y. Sano, J. L. Shobe, M. Lopezaranda, and A. J. Silva,“Synaptic tagging during memory allocation,”

Nature Reviews Neuroscience , vol. 15, no. 3, pp.157–169, 2014.[34] R. L. Redondo and R. G. M. Morris, “Making memories last: the synaptic tagging and capturehypothesis,”

Nature Reviews Neuroscience , vol. 12, no. 1, pp. 17–30, 2011.[35] C. Clopath, L. Ziegler, E. Vasilaki, L. Busing, and W. Gerstner, “Tag-trigger-consolidation: amodel of early and late long-term-potentiation and depression.”

PLOS Computational Biology ,vol. 4, no. 12, 2008.[36] P. Smolen, D. A. Baxter, and J. H. Byrne, “Molecular constraints on synaptic tagging andmaintenance of long-term potentiation: a predictive model.”

PLOS Computational Biology ,vol. 8, no. 8, 2012.[37] G. K. Vallabha, J. L. Mcclelland, F. Pons, J. F. Werker, and S. Amano, “Unsupervised learningof vowel categories from infant-directed speech,”

Proceedings of the National Academy ofSciences of the United States of America , vol. 104, no. 33, pp. 13 273–13 278, 2007.[38] R. Q. Quiroga, “Concept cells: the building blocks of declarative memory functions,”

NatureReviews Neuroscience , vol. 13, no. 8, pp. 587–597, 2012.[39] W. A. Suzuki and Y. Naya, “The perirhinal cortex,”

Annual Review of Neuroscience , vol. 37,no. 1, pp. 39–53, 2014.[40] R. R. Hampton, “Monkey perirhinal cortex is critical for visual memory, but not for visualperception: reexamination of the behavioural evidence from monkeys.”

Quarterly Journal ofExperimental Psychology Section B-comparative and Physiological Psychology , vol. 58, pp.283–299, 2005.[41] E. A. Buffalo, S. J. Ramus, R. E. Clark, E. Teng, L. R. Squire, and S. M. Zola, “Dissociationbetween the effects of damage to perirhinal cortex and area te,”

Learn Mem

The Journal of Neuroscience , vol. 26, no. 40, pp. 10 232–10 234,2006.[43] G. L. F. Yuen and D. Durand, “Reconstruction of hippocampal granule cell electrophysiologyby computer simulation,”

Neuroscience , vol. 41, no. 2, pp. 411–423, 1991.[44] X. Wang and G. Buzsaki, “Gamma oscillation by synaptic inhibition in a hippocampal in-terneuronal network model,”

The Journal of Neuroscience , vol. 16, no. 20, pp. 6402–6413,1996. 1045] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to documentrecognition,”

Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998.[46] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik, “Emnist: an extension of mnist to handwrit-ten letters.” arXiv: Computer Vision and Pattern Recognition , 2017.[47] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.11 upplementary material

Audio-visual integration model (AVIM)Neuron model

For neurons in the AVI layer, we use the two-compartment pyramidal neuron modelproposed in [43]. Each AVI neuron consists of a soma and a dendritic compartment. The followingEq. (4) and (5) are the models of the soma and the dendritic compartment, respectively. The constantparameters in Eq.(4) and (5) are: C m = 3 . µF/cm , g ds = 0 . µS/cm , g sd = 0 . µS/cm . C m dV s dt = − I L − I Na − I K − I Ca − I ahp − g ds ( V s − V d ) + I synT oSoma (4) C m dV d dt = − I L − I Na − I K − I Ca − I ahp − g sd ( V d − V s ) + I synT oSoma (5)The following equations ( 6 to 17 ) are the models of ion channels: I L = g L ( V − E L ) , g L = 0 . mS/cm , E L = 2 . mV (6) I Na = g Na m h ( V − E Na ) , g Na = 250 mS/cm , E Na = 115 mV (7) dmdt = α m (1 − m ) − β m m, α m = − . V − exp ( V − − ) − , β m = 0 . V − exp ( V − ) − (8) dhdt = α h (1 − h ) − β h h, α h = 0 . exp ( V +1320 ) , β h = 0 . exp ( V − . − ) + 1 (9) I K = g K n ( V − E K ) , g K = 40 mS/cm , E K = − mV (10) dndt = α n (1 − n ) − β n n, α n = − . V − exp ( V − − ) − , β n = 0 . exp ( V − ) (11) I Ca = g Ca s w ( V − E Ca ) , g Ca = 1 mS/cm , E Ca = 140 mV (12) dsdt = α s (1 − s ) − β s s, α s = − . V − . exp ( V − . − ) − , β s = 0 . V − exp ( V − ) − (13) dwdt = α w (1 − w ) − β w w, α w = 0 . exp ( V − . . ) , β w = 3 exp ( V − . − ) + 1 (14) I ahp = g ahp q ( V − E K ) , g ahp = 4 . mS/cm , E K = − mV (15) dqdt = α q (1 − q ) − β q q, α q = 0 . exp ( log [ Ca ]+4 . − . ) , β q = 0 . exp ( log [ Ca ]+36 . ) (16)The model of calcium ion concentration is the following Eq.(17) (the calcium concentration in twocompartments are the same, not considering the diffusion of calcium ions between soma and dendrite): d [ Ca ] dt = − . I Ca − . − [ Ca ]100 (17)For neurons in the INB layer, we use an inhibitory neuron model in the hippocampus proposed in[44]. Each INB neuron has one compartment, the model of which is Eq.(18). The constant parameterin Eq.(18) is C m = 1 µF/cm . C m dVdt = − I L − I Na − I K + I syn (18)The following equations (19 to 23) are the ion channels models: I L = g L ( V − E L ) , g L = 0 . mS/cm , E L = − mV (19)12 Na = g Na m ∞ h ( V − E Na ) , g Na = 35 mS/cm , E Na = 55 mV (20) m ∞ = α m α m + β m , α m = − . V + 35) exp ( V +35 − ) − , β m = 4 exp ( − ( V + 60)18 ) (21) I K = g K n ( V − E K ) , g K = 9 mS/cm , E K = − mV (22) dndt = α n (1 − n ) − β n n, α n = − . V + 34) exp ( V +34 − ) − , β n = 0 . exp ( − ( V + 44)80 ) (23) Synaptic model

In the AVIM, synapse S1 is the excitatory connection with plasticity, includingAMPA and NMDA receptors. Synapse S2 and S3 are excitatory connections without plasticity,including only AMPA receptors. Synapse S4 is inhibitory connection, including GABA receptors.AMPA and GABA receptors are ligand-gated ion channels, the model of which are shown asEq.(24). Here, g syn is the receptor conductance, E syn is the reverse potential of the receptor. I syn ( t ) = g syn ( t )( V m ( t ) − E syn ) (24)NMDA receptor is a voltage-gated ion channel, and its model is Eq.(25) . In the experiments, [ M g ] o = 1 , β =0.08, γ =9. I syn ( t ) = g syn ( t ) s ( V )( V m ( t ) − E syn ) , s ( V ) = 11 + [ M g ] o exp ( − βV m + γ ) (25)The model of receptor conductance g syn is an β -function, see the following Eq.(26). Here, g syn isthe maximum receptor conductance, τ rise and τ decay are the time constants, and x ( t ) is the spiketrain of presynaptic neuron. τ rise τ decay d gdt + ( τ rise + τ decay ) dgdt + g = g syn x ( t ) (26)The reversal potentials of the above receptors are: E AMP A = 0 mV, E

NMDA = 0 mV, E

GABA = − mV . See Table 3 for the experimental parameter setting of synaptic model.Table 3: Experimental parameter setting of synaptic model in AVIM. Synapses g AMP A ( mS/cm ) g NMDA ( mS/cm ) g GABA ( mS/cm ) τ rise ( ms ) τ decay ( ms ) S1 0.1 0.1 N 5 100S2 1.0 N N 2 2S3 0.01 N N 2 2S4 N N 0.0002 5 100

NOSC algorithm

The NOSC stands for near orthogonal sparse coding. The number of different NOSC codes is limitedby three parameters N, n, and K, where N is the number of total neurons, n is the number of ﬁringneurons for each class, and K is the max number of shared ﬁring neurons between any two classes.For a problem with C classes, we design the NOSC codes generating procedure as Figure 6. The ﬁrstC NOSC codes can then be generated and denoted as NOSC(C,N,n,K). When the case is clear, wesimplify the notation as NOSC(N,n,K). In continual learning the MNIST10-FV dataset, we use theNOSC(15,2,1) as the auditory feature code, see Figure 7. In continual learning the EMNIST20-FVdataset, we use the NOSC(20,3,1), see Figure 8. In continual learning the CIFAR100-FV dataset, weuse the NOSC(50,5,2), see Figure 9. We set speciﬁc random seed in the NOSC generation programto ensure the generating process of NOSC is repeatable.13igure 6: The NOSC(C,N,n,K) generating algorithm.

DatasetsOriginal datasets

In order to test the performance of AVIM, MNIST, EMINST and CIFAR100 are used to constructimage datasets of 10, 20 and 100 classes as experimental datasets.

MNIST10-(TR50-TE50)

MNIST dataset consists of 10 classes. For each category, we randomlyselect 50 samples from the training set as training samples, another 50 samples from the trainingset as validation samples, and 50 samples from test set as test samples. We call this data setMNIST10-(TR50-TE50).

EMNIST20-(TR50-TE50)

EMNIST47 dataset consists of 47 classes. We select 20 classes fromthe EMNIST47 dataset and construct the EMNIST20 dataset. The 20 categories include a total of10 Arabic numerals from 0 to 9 and 10 English letters such as "B,C,E,G,H,K,Q,R,W,X". For eachcategory, we randomly select 50 samples from the training set as training samples, another 50 samplesfrom the training set as validation samples, and 50 samples from test set as test samples. We call thisdata set EMNIST20-(TR50-TE50). 14igure 7: The NOSC(15,2,1) used in continual learning the MNIST10-FVFigure 8: The NOSC(20,3,1) used in continual learning the EMNIST20-FV15igure 9: The NOSC(50,5,2) used in continual learning the CIFAR100-FV16

IFAR100-(TR10-TE10)

There are 100 classes in the CIFAR100 dataset, with 500 trainingsamples and 100 test samples in each class. For each category, we randomly select 450 samples fromthe training set as training samples, another 50 samples from the training set as validation samples,and 100 samples from the test set as test samples. We call this data set CIFAR100-(TR450-TE100).In CIFAR100-(TR450-TE100), we randomly select ten samples per class from the training set andten samples per class from the test set to construct a new data set CIFAR100-(TR10-TE10).

V-FV datasets

We use the output feature vector of the last layer before the classiﬁcation layer in DNN as V-FV. Inorder to compare different continuous learning algorithms, we controll the quality of the pre-trainedDNN and obtain V-FV data of different qualities. We take the validation accuracy of DNN on eachdataset in the training process as the quality control parameter and get four levels of V-FV data withlinear separability from low to high denoted by FV1 to FV4, respectively. The following are thedetails of the generation of four levels of V-FV on each dataset.

MNIST10-FV

In order to get the MNIST10-FV dataset, we design an MNIST10-CNN and train itusing the MNIST10-(TR50-TE50) dataset. The network structure of MNIST10-CNN is similar to theLeNet-5 and the dimension of V-FV in this network is 15. During learning, we record the highestvalidation accuracy that MNIST10-CNN can achieve (92.6 % ). Based on the following four ranges ofquality control parameters: 60 % -62 % , 70 % -72 % , 80 % -82 % , and above 90 % , we then extract fourlevels of V-FV of MNIST10-(TR50-TE50), named MNIST10-FV1, MNIST10-FV2, MNIST10-FV3and MNIST10-FV4, respectively. We use the t-SNE method to visualize the distribution of these fourlevels of V-FV of MNIST10-(TR50-TE50); see Figure 10. The effect of quality control is evidentthat higher quality results in better separability. EMNIST20-FV

In order to get the EMNIST20-FV dataset, we design an EMNIST20-CNN andtrain it using the EMNIST20-(TR50-TE50) dataset. The network structure of EMNIST20-CNN issimilar to the LeNet-5 and the dimension of V-FV in this network is 20. During learning, we recordthe highest validation accuracy that EMNIST20-CNN can achieve (90.6 % ). Based on the followingfour ranges of quality control parameters: 60 % -62 % , 70 % -72 % , 80 % -82 % , and above 90 % , we thenextract four levels of V-FV of EMNIST20-(TR50-TE50), named EMNIST20-FV1, EMNIST20-FV2,EMNIST20-FV3 and EMNIST20-FV4, respectively. We use the t-SNE method to visualize thedistribution of these four levels of V-FV of EMNIST10-(TR50-TE50); see Figure 11. CIFAR100-FV

In order to get the CIFAR100-FV dataset, we design an CIFAR100-CNN and trainit using the CIFAR100-(TR450-TE100) dataset. The network structure of CIFAR100-CNN is aVGG network with batch normalization and the dimension of V-FV in this network is 50. Duringlearning, we record the highest validation accuracy that CIFAR100-CNN can achieve (74 % ). Basedon the following four ranges of quality control parameters: 50 % -52 % , 58 % -60 % , 66 % -68 % , andabove 72 % , we then extract four levels of V-FV of CIFAR100-(TR10-TE10), named CIFAR100-FV1,CIFAR100-FV2, CIFAR100-FV3 and CIFAR100-FV4, respectively. We use the t-SNE method tovisualize the distribution of these four levels of V-FV of CIFAR100-(TR10-TE10); see Figure 12. ANN methods experimental parameter setting

We compare AVIM with ﬁve ANN methods, including ANN(Base), ANN(Ofﬂine), iCaRL, GEM andEWC. ANN Base represents the sequential learning method, and ANN Ofﬂine represents the normalshufﬂe learning method. All the ANN methods use the same V-FV datasets in incremental learningof the new classes task. There is only one class in each task. The network structures for all ANNmethods are the same, including N inputs neurons, H hidden neurons, and C output neurons. For theMNIST10-FV dataset, (N, H, C) = (10, 50, 10). During training, the batch size of all ANN methodsis 50; the learning rate is 0.0001. For the EMNIST20-FV dataset, (N, H, C) = (20, 67, 20). Duringtraining, the batch size of all ANN methods is 50; the learning rate is 0.0001. For the CIFAR100-FVdataset, (N, H, C) = (50, 167, 100). During training, the batch size of all ANN methods is 10; thelearning rate of ANN(Ofﬂine), ANN(Base), and EWC are 0.0001; the learning rate of GEM is 0.01;the learning rate of iCaRL is 0.05. All experiments set the random seed to 0.17igure 10: T-SNE distribution of MNIST10-(TR50-TE50)-FV

Weight initialization

In all experiments, the weights of the network output layer are initializedwith a Gaussian with a mean of 0 and a variance of 0.001, while the weights of other layers areinitialized with a Gaussian with a mean of 0 and a variance of 0.005.

Stop training strategy

ANN(Ofﬂine), ANN(Base), EWC, and GEM use the early stop strategyto determine when to stop the training process. Speciﬁcally, we stop training on the current taskwhen accuracy on the test set not increase for ﬁve epochs. iCaRL uses the ﬁxed number of epochs intraining, the number of training epochs of iCaRL for each task is 2500.