[PDF] Adversarial Training of Variational Auto-encoders for Continual Zero-shot Learning(A-CZSL)

Abstract

Most of the existing artificial neural networks(ANNs) fail to learn continually due to catastrophic forgetting, while humans can do the same by maintaining previous tasks' performances. Although storing all the previous data can alleviate the problem, it takes a large memory, infeasible in real-world utilization. We propose a continual zero-shot learning model(A-CZSL) that is more suitable in real-case scenarios to address the issue that can learn sequentially and distinguish classes the model has not seen during training. Further, to enhance the reliability, we develop A-CZSL for a single head continual learning setting where task identity is revealed during the training process but not during the testing. We present a hybrid network that consists of a shared VAE module to hold information of all tasks and task-specific private VAE modules for each task. The model's size grows with each task to prevent catastrophic forgetting of task-specific skills, and it includes a replay approach to preserve shared skills. We demonstrate our hybrid model outperforms the baselines and is effective on several datasets, i.e., CUB, AWA1, AWA2, and aPY. We show our method is superior in class sequentially learning with ZSL(Zero-Shot Learning) and GZSL(Generalized Zero-Shot Learning).

Full PDF

AAdversarial Training of Variational Auto-encoders for ContinualZero-shot Learning

Subhankar Ghosh

Indian Institute of Science [email protected]

Abstract —Most of the existing artiﬁcial neural net-works(ANNs) fail to learn continually due to catastrophic for-getting, while humans can do the same by maintaining previoustasks’ performances. Although storing all the previous datacan alleviate the problem, it takes a large memory, infeasiblein real-world utilization. We propose a continual zero-shotlearning model that is more suitable in real-case scenarios toaddress the issue that can learn sequentially and distinguishclasses the model has not seen during training. We present ahybrid network that consists of a shared VAE module to holdinformation of all tasks and task-speciﬁc private VAE modulesfor each task. The model’s size grows with each task to preventcatastrophic forgetting of task-speciﬁc skills, and it includesa replay approach to preserve shared skills. We demonstrateour hybrid model is effective on several datasets, i.e., CUB,AWA1, AWA2, and aPY. We show our method is superior onclass sequentially learning with ZSL(Zero-Shot Learning) andGZSL(Generalized Zero-Shot Learning). Our code is availableat https://github.com/CZSLwithCVAE/CZSL CVAE.

I. I

NTRODUCTION

Although ANNs achieve state-of-the-art performance onmany machine learning problems like image classiﬁcation,object detection, and natural language processing, the modelsforget the previous knowledge due to catastrophic forgettingwhen trained for new tasks. We still need to improve ourexisting algorithms to achieve human-level performance theway humans learn[1] sequentially without forgetting previoustasks throughout their life. Can we build an ANN model thatcan learn sequentially and simultaneously works for zero-shot learning(distinguish the classes it has not seen duringtraining)? The name of such methods is continual zero-shotlearning. All the existing algorithms have limitations; they caneither learn continually or works for ZSL(distinguish data ofunseen levels).ZSL(zero-shot learning), where a trained model seesdata only from unseen classes during testing, andGZSL(generalized zero-shot learning), where data comefrom both seen and unseen classes for prediction. Bothmentioned tasks are challenging for a model to performcontinually. Researchers have addressed both ZSL[2, 3]and continual learning[4, 14, 19] approaches separately.References [5, 6] combines ZSL with continual learningbefore, though our approach is entirely different from that,but is more realistic and beats most of their results on thesame datasets. One of the approaches for continual learning is learning asingle representation with a ﬁxed size. They learn essentialweight parameters for each task and avoid alteration tolearn new tasks. In contrast, structure-based approachesgrow in size with each task. However, these approaches arenot feasible for a large number of tasks if each needs avast memory. Another method relies on experience replayeither by storing data[7, 8] from previous tasks or synthesizedata[9, 10] for old classes using generative methods. In thispaper, we propose a novel adversarial training of variationalautoencoders for continual zero-shot learning. Here, a disjointspace composes a task-speciﬁc latent space that is learnedfor each task, and a task-invariant feature space is learnedfor all tasks. Task-speciﬁc in a sense, the 1st private modulelearns from the 1st task’s real data only, whereas the t th private module gets trained by real data from t th taskand replay synthesized data from all previous (t-1) tasks.We tackle both ZSL and continual learning together byusing CVAE(conditional variational autoencoders)[11] thattransfer knowledge from seen to unseen classes through classembeddings[12] to counter ZSL problems. As visual data isnot available during training time, knowledge transfer fromseen to unseen classes is formed through side informationthat makes semantic relationships between classes andclass-embeddings. Our approach is motivated by the factthat processing and synthesizing images are time takingfor continual learning when the number of classes is high.Therefore, instead of images, we train and test our modelwith the features of the same images generated using apre-trained model. Another thing that motivates us is that thehuman brain structure is complex and contains billions ofneurons[13], so we may need to eventually make complicatednetworks in the future containing a huge number of neuronsto learn sequentially.The main contributions of this work aresummarized as follows: • We develop an experience replay-based and structure-based continual zero-shot learning method using CVAE. • The proposed method is developed for a single head set-ting that is more convenient to solve real case scenarios. • We present results for four ZSL benchmark datasets forcontinual zero-shot learning. • We propose two types of latent space: one, task-invariantholds information for all tasks, another space is task-1 a r X i v : . [ c s . C V ] F e b peciﬁc. If there are T tasks, our proposed architectureconsists of one task-invariant VAE and T task-speciﬁcVAEs. II. R ELETED W ORK

A. Continual learning

There are three types of continual learning: regularization-based, memory-based, and structure-based methods.

Regularization methods

Here, the learning capacity is ﬁxed[14, 15], and continuallearning is performed by penalizing a network’s parameters.Researchers use a new regularizer for a new method. In[15], the essential parameters are computed online. They keeptrack of how the loss function changes due to a speciﬁcparameter change and accumulate this information duringtraining. There should be a weight importance process forparameters selection to prioritize parameters usage—the wayelastic weight consolidation(EWC)[14] gives importance toparameters based on the Fisher information matrix. The usagesof these methods are limited because they cannot perform wellfor a large number of tasks.

Memory-based methods

Methods in this category try to prevent forgetting by eitherstoring[17, 18] or synthesizing data from previous classes.The ﬁrst one needs memory for rehearsal, whereas the latteris a generative model like GAN[20] or VAE[11], or bothsynthesize data of previous tasks to perform pseudo-rehearsal.The number of examplers stored decreases with the increasein classes if the memory budget is limited. Researchers haverecently proposed using a tiny memory[17] to store a fewexamples per class for old tasks.

Structure-based methods

The third approach to mitigate forgetting is structure-basedmethods[21]. The size of a network grows with each task toprevent catastrophic forgetting. Previous tasks’ performance ismaintained by freezing the learned module while accommodat-ing new tasks by augmenting the network with new modules.The computational cost for this method is inevitable if thenumber of tasks is high.

B. Zero-shot learning and Generalized zero-shot learning

Recently, zero-shot learning(ZSL) has attracted a lot ofattention because the model can distinguish unseen classesduring testing. ZSL models are able to do so by transferringknowledge from seen to unseen levels through a semanticrelationship between classes and their attributes. We cantransform a ZSL problem into a supervised machine learningproblem using generative models like GAN or VAE, or both.Once a generative model gets trained, it can synthesize data forunseen classes, and the data is useful for training a classiﬁerlike a conventional supervised problem. Another modiﬁcationof ZSL is GZSL, a more practical approach, where data comefrom both seen and unseen classes.

Fig. 1: shows our model at training time where the shared moduleplays a minimax game with the discriminator to generate task-invariant ˆ X s features. In contrast, the discriminator attempts toassign task labels to the synthesized features( ˆ X s ). Architecture growsbecause of the task-speciﬁc modules denoted as P t and p t , task-speciﬁc perceptron networks assign class labels during each task’straining. To prevent forgetting shared module is trained with replayexamples of previously learned classes generated from the decoderof the shared module using z (cid:48) ∼ N ( µ = 0 , (cid:80) = 1) .Fig. 2: has three parts:1) Training a separate classiﬁer with generated data(of both seen andunseen classes) using the shared module’s trained decoder.2) Testing the classiﬁer for both seen and unseen classes(ZSL).3) The decoder is used to generate replay samples that get concate-nated to the next task to train the model with new tasks. C. Adversarial learning

Adversarial learning has usages in many domains such asgenerative models[20], object composition[22], representationlearning[23], domain adaptation[24], active learning[25] etc.In adversarial training, a model learns the parameters throughthe minimax game, where a module wants to maximize thecost function, and another wants to minimize the same. Thispaper shows shared play the minimax game with discriminator,where shared tries to minimize the loss function, and thediscriminator wants to maximize.II. A

DVERSARIAL T RAINING OF C ONDITIONAL V ARIATIONAL A UTOENCODERS FOR C ONTINUAL Z ERO -S HOT L EARNING

We study the problem of learning a sequence of T datadistributions denoted as D tr = { D tr , D tr , ..., D Ttr } , where D ttr = { ( X ti , Y ti , tar ti , T ti ) n t i =1 } is the data distribution forthe task t with n t sample tuples of input( X t ∈ X ), targetlabel( tar t ∈ tar ), attributes of classes ( Y t ∈ Y ), andtask label( T t ∈ T ). D ttr contains seen class information.Apart from this, class embeddings of unseen classes( U c = { ( a i ) n uc i =1 } ) are also available, where n uc is the number ofunseen labels. The goal is to learn a sequential function, f : ( z ∼ N (0 , , Y ) → ˆ X s , for each task, where ˆ X s is synthesized data generated from the shared module. Thesynthetic data can be used to train a supervised classiﬁer. Weaim to learn another function, f θ c : ˆ X → tar , after trainingeach task, that can map input(from seen or unseen or bothclasses) into it’s target output without affecting the previousmodel’s performance on prior works. We try to achieve ourgoal by training two separate modules: shared and private,to enhance a better knowledge transfer from seen to unseenclasses and better forget avoidance of prior knowledge. Themodel prevents catastrophic forgetting in shared and privatespaces separately and begins learning f tθ where θ ∈ ( θ S , θ P ) as mapping function from ( X ttr , Y ttr ) to tar t . We use somen samples per class to be synthesized prior to t th task andaccumulate the generated data to the current task( t th ) to trainthe model. D ttr ← D ttr ∪ D t − gen The cross-entropy loss function for the f tθ mapping corre-sponds to: L task = t ( f tθ , D ttr ) = − E ( X t , Y t ,tar t ) ∼ D t (cid:34) C (cid:88) c =1 ( c = tar t ) log ( σ ( f tθ ( X t , Y t ))) (cid:35) (1)Where σ is the softmax function, in learning a sequenceof tasks, an ideal f t maps the input features X t to twoindependent feature spaces: X ts a shared features space amongall tasks and X tp remains private for each task. Both X ts and X tp get concatenated and fed to a task-speciﬁc multi-layerperceptron network to get desired output labels.We introduce a mapping named shared ( S θ s : X → ˆ X s ) andtrain it to generate features by feeding noise into the sharedmodule’s decoder to fool a discriminator D. In contrast, theD( D θ d : ˆ X s → T ) try to assign the synthesized features totheir corresponding task labels( T t ∈{ , , ,..,T } ). The decoderand the discriminator can do so when the D gets trained tomaximize the probability of assigning correct task labels to thefeatures generated from the shared module. Simultaneously,the shared tries to minimize the same probability. The corresponding cross-entropy adversarial loss for theminimax game: L adv ( D, S, D ttr ) = min S max DT (cid:88) t =0 t = t t (cid:2) log ( D ( S ( X t , Y t ))) (cid:3) (2)The extra-label zero is there for fake data generated from theGaussian distribution with mean = 0 and std = 1. In mostcases, we use adversarial training in a generative adversarialnetwork that tries to learn the input data distribution in orderto synthesize more data from the same distribution. Herewe do the same by utilizing generative models task-invariantshared(VAE), and task-speciﬁc private(VAE); both try to learninput data distribution.To facilitate adversarial training for S, we use the Gra-dient Reversal Layer[28] that directly tries to maximize thediscriminator’s loss. The layer acts like an identity functionduring forward-propagation but multiplies the loss with anegative one during backpropagation in order to maximize thecost function for the discriminator. The adversarial trainingbetween the discriminator and the shared is complete whenthe discriminator can no longer predict the correct task labelfor features generated from the shared module. The privatemodule, however, merely learns any task-invariant features. Variational autoencoders

Autoencoders can effectively learn feature space and rep-resentation[15, 22]. A variational Autoencoder(VAE) is agenerative model that follows an encoder-latent vector-decodearchitecture of classical autoencoder, which places a priordistribution on the feature space and uses an expected lowerbound to optimize the learned posterior. Conditional VAE isan extension of the VAE, where data are fed to network withclass properties such as labels, attributes, etc. The VAE is afundamental building block of our approach. Variational dis-tribution aims to ﬁnd a true conditional probability distributionover the latent variables z through minimizing their distanceusing a variational lower bound limit. The loss function for aVAE is: L V AE = E q φ ( z | x ) [ log ( p θ ( x | z ))] − D KL ( q φ ( z | x ) (cid:107) p θ ( z ) ) (3)Where the ﬁrst term is the reconstruction loss, and the secondone is the KL divergence between q ( z | x ) and p(z). Theencoder predicts µ and (cid:80) such that q φ ( z | x ) = N ( µ, (cid:80) ) ,from which a latent vector is synthesized via reparametrizationprocess.The ﬁnal objective function of the model for the t th task is: L ( t ) = λ L adv + λ L task + λ L sV AE + λ L pV AE (4)Where, λ , λ , λ , and λ are regularizers to control the effectof each component. The full algorithm of the model is givenin Algorithm 1. . Avoid forgetting Catastrophic forgetting occurs because of the imbalancebetween old and new classes that results in a bias of thenetwork towards the newest ones. One insight of our approachis to decouple the single representation learned for all taskscontinually into two parts: private and shared. Though knowl-edge is transferred for ZSL and GZSL mostly from the sharedmodule from seen to unseen classes. The critical approach isexperience replay that gets concatenated to the current task’sdata during training of the model with the same task to avoidforgetting sequentially.

B. Datasets

We evaluate our model on four benchmark datasets usedfor ZSL: Attribute Pascal and Yahoo(aPY)[2], Animals WithAttributes(AWA1, AWA2)[2], and Caltech-UCSD-Birds 200-2011(CUB)[26]. Statistics of the datasets are presented inTable I.

C. Continual Zero-shot learning(CZSL) setting

The dataset we use follows the setting used in[5]. It explainswhether a class is seen or unseen is decided based on thenumber of tasks a model has been trained so far. If a modelgoes trained continually up to the t th task, the classes areassumed to be seen till the t th task, and the rest of the wholedataset’s classes are accepted unseen for the model whiletraining. D. evaluation matrices

We evaluate the resulting model on all previous tasks similarto[16, 18] after training for each new task. We use ACC asthe average test classiﬁcation accuracy across all classes forGZSL, seen classes, and unseen classes for GSL to measureour model’s performance. To measure forgetting, we calculatebackward transfer, BWT that says how much learning newtasks have inﬂuenced previous tasks’ performance. While

BW T > indicates catastrophic forgetting and BW T < ,learning new tasks has helped improve performance on pre-vious tasks. We calculate forgetting measure for seen classesonly. BW T = 1 T − T − (cid:88) t =1 (cid:2) R seent,t − R seenT,t (cid:3) (5) mSA = 1 T T (cid:88) t =1 R seent,t (6)mSA is the mean seen classiﬁcation accuracy across all tasks. mU A = 1 T − T − (cid:88) t =1 R unseent,t (7)mUA is the measure of zero-shot learning performance for themodel. mOA = 1 T T (cid:88) t =1 R overallt,t (8) mOA is the measure of generalized zero-shot learning perfor-mance. mH = 1 T − T − (cid:88) t =1 (cid:20) ∗ R seent,t ∗ R unseent,t R seent,t + R unseent,t (cid:21) (9)mH is the hermonic mean classiﬁcation accuracy. Algorithm 1

Continual Zero-shot Learning

Input : ( X , Y , tar ) ∼ D all Parameters : θ S , θ P , θ D , θ c Output : ˆ X S , ˆ X P D gen ← {} for t ← do for e ← do for k ← S steps do Compute L task using ( X t , Y t , tar t ) ∈ D t Compute L adv using ( X t , Y t , t ) ∈ D t Compute L SV AE for shared module using ( X t , Y t ) ∈ D t Compute L PV AE for private module using ( X t , Y t ) ∈ D t L ( t ) = λ L adv + λ L task + λ L sV AE + λ L pV AE θ (cid:48) S ← θ S − α S ∇ L ( t ) θ (cid:48) P ← θ P − α P ∇ L ( t ) end for for j ← D steps do Compute L adv for D using ( S ( x ) t , tar t ) and ( z (cid:48) ∼N ( µ = 0 , (cid:80) = 1) , tar = 0 ) θ (cid:48) D ← θ D − α D ∇ L ( t ) end for end for Generate data from the shared module for seen andunseen classes to train a separate classiﬁer. D ← D seen ∪ D unseen for C e ← C epochs do Compute L class using ( X , tar ) ∈ D θ (cid:48) c ← θ c − α c ∇ L class end for Test the classiﬁer for seen data.

Test the classiﬁer for unseen data(ZSL).

Test the classiﬁer for all seen and unseen data(GZSL). for c ← do C is the replay classes. for i ← do n is the number of samples to be generated perclass for the experience replay. ( X i , Y i , tar i ) ∼ D gen end for end for D t +1 ← D t +1 ∪ D gen end for Where R j,i is the test classiﬁcation accuracy on task i aftersequentially ﬁnishing learning the j th task. ataset Semantic Dim TABLE I: Datasets and their statistics, Where SC and UC are seenand unseen classes respectively.

CUBMethods mSA mUA(ZSL) mH mOA(GZSL)AGEM+CZSL - - 13.20 -Seq-CVAE 24.66 8.57 12.18 -CZSL-CV+mof 43.73 10.26 16.34 -CZSL-CV+rb 42.97 13.07 19.53 -CZSL-CV+res -ours(without adv) 34.25 12.42 17.41 ours(with adv) 34.47 12.00 17.15 21.72

TABLE II: Results for the CUB dataset, where mSA: Mean SeenAccyracy, mUA: Mean Unseen Accuracy, mH: Hermonic MeanAccuracy, mOA = Mean Overall Accuracy. The best results in thetable are presented in boldface.

IV. E

XPERIMENTS

In this section, we discuss baselines, training, and results.

Fig. 3: Results without adversarial training for the CUB dataset.Fig. 4: Results with adversarial training for the CUB dataset.

A. Baselines

The research on continual zero-shot learning(CZSL) hasbeen less explored. References[5, 6] has investigated the workbefore on a single-head setting that we represent in this paper.Reference [6] used the following baselines, so we do the samein this paper. • AGEM + CZSL [5]: It is an average gradient episodicmemory-based continual zero-shot learning. The authorsof [5] have mentioned the harmonic mean of the CUBdataset only. • SEQ + CVAE [6]: The authors train CVAE sequentiallywithout considering any continual learning strategy. After aPYMethods mSA mUA(ZSL) mH mOA(GZSL)AGEM+CZSL - - - -Seq-CVAE 51.57 11.38 18.33 -CZSL-CV+mof 64.91 10.79 18.27 -CZSL-CV+rb 64.45 11.00 18.60 -CZSL-CV+res -ours(without adv) 58.14 ours(with adv) 55.46 11.2 18.63 35.97

TABLE III: Results for the aPY dataset, where mSA: Mean SeenAccyracy, mUA: Mean Unseen Accuracy, mH: Hermonic MeanAccuracy, mOA = Mean Overall Accuracy. The best results in thetable are presented in boldface.

SEQ+CVAE is trained on the current task, synthetic aregenerated using noise and class embeddings for all classesto train a separate classiﬁer.

B. Other Methods

We also compare our results with CZSL-CV+mof[6],CZSL-CV+rb[6], and CZSL-CV+res[6].

Fig. 5: Results without adversarial training for the aPY dataset.Fig. 6: Results with adversarial training for the aPY dataset.

C. Training

We use Pytorch as our framework. We train our model witha hundred epochs and a classiﬁer for thirty epochs for each taskon all datasets except CUB. We use the same number of epochsfor the CUB dataset till task ﬁfteen and then reduce to ﬁfty forthe model and ten for the classiﬁer till task eighteen and againdecrease to twenty for the model and ﬁve for the classiﬁer.The Adam[27] optimizer is used in all experiments, and thelearning rate for the classiﬁer and others is 0.0001 and 0.001,respectively. We use weight decay 0.0001 as a regularizer forthe classiﬁer. We use 500 hidden units for both shared andprivate modules. Latent dimension is 50, and batch size forboth model and classiﬁer is 61. We take λ = λ = λ = 1 ,and λ = 0 . . WA1Methods mSA mUA(ZSL) mH mOA(GZSL)AGEM+CZSL - - - -Seq-CVAE 59.27 18.24 27.14 -CZSL-CV+mof 76.77 19.26 30.46 -CZSL-CV+rb 77.85 21.90 33.64 -CZSL-CV+res

TABLE IV: Results for the AWA1 dataset, where mSA: Mean SeenAccyracy, mUA: Mean Unseen Accuracy, mH: Hermonic MeanAccuracy, mOA = Mean Overall Accuracy. The best results in thetable are presented in boldface.Fig. 7: Results without adversarial training for the AWA1 dataset.Fig. 8: Results with adversarial training for the AWA1 dataset.

D. Results

Results are presented for the CUB, aPY, AWA1, and AWA2datasets in Table II, III, IV, and V, respectively. Forgettingmeasures are given in graphs corresponding to each datasetwith and without adversarial training. Our model outperformsall baselines methods for all datasets. The model on[5, 6]has used a memory-based approach. In contrast, we use theexperience replay method that is more feasible in real applica-tions. Fig[3, 5, 7] shows that when we train our model withoutadversarial effect, performance grows with each task becausethe number of seen classes increases and the number of unseenclasses decreases. When we train our model adversariallywith AWA1 and AWA2, it outperforms all other methods.Our method obtains the best performance of four evaluationmatrices in AWA1 and AWA2 datasets. The model achieves seen, . unseen, . H on the AWA1 dataset,and . seen, . unseen, . H on the AWA2dataset. Although the method does not obtain the best resultson the CUB dataset, but it gives the best unseen accuracy onthe aPY dataset. Our model obtains a balanced performanceof previous tasks and current task, which notably outperformsthe baselines.

AWA2Methods mSA mUA(ZSL) mH mOA(GZSL)AGEM+CZSL - - - -Seq-CVAE 61.42 19.34 28.67 -CZSL-CV+mof 79.11 24.41 36.60 -CZSL-CV+rb 80.92 24.82 37.32 -CZSL-CV+res -ours(without adv) 70.05 22.85 32.98 44.97ours(with adv) 70.16

TABLE V: Results for the AWA2 dataset, where mSA: Mean SeenAccyracy, mUA: Mean Unseen Accuracy, mH: Hermonic MeanAccuracy, mOA = Mean Overall Accuracy. The best results in thetable are presented in boldface.Fig. 9: Results without adversarial training for the AWA2 dataset.Fig. 10: Results with adversarial training for the AWA2 dataset.

V. C

ONCLUSION

In this work, we proposed a novel hybrid algorithm. Thenovelty of our work is that we use adversarial learning. Herethe model needs experience replay and grows for task incre-mental learning. The private module barely shares knowledgefrom seen to unseen classes that can be future work to optimizethe private module for ZSL. Another future work might beto develop task-free continual zero-shot learning. How canwe build a continual zero-shot learning model for objectdetection? What should be the optimum latent dimension,hidden-layer size?I. R

EFERENCES [1] W. C. Abraham and A. Robins. Memory retention–thesynaptic stability versus plasticity dilemma. Trends in neuro-sciences, 28(2):73–78, 2005.[2] Ali Farhadi, Ian Endres, Derek Hoiem, and DavidForsyth. Describing objects by their attributes. In 2009 IEEEConference on Computer Vision and Pattern Recognition,pages 1778–1785. IEEE, 2009.[3] Rafael Felix, Vijay BG Kumar, Ian Reid, and GustavoCarneiro. Multi-modal cycle-consistent generalized zeroshotlearning. In ECCV, pages 21–37, 2018.[4] Tyler L Hayes, Nathan D Cahill, and Christopher Kanan.Memory efﬁcient experience replay for streaming learning. In2019 International Conference on Robotics and Automation(ICRA), pages 9769–9776. IEEE, 2019.[5] Ivan Skorokhodov and Mohamed Elhoseiny. Nor-malization matters in zero-shot learning. arXiv preprintarXiv:2006.11328, 2020.[6] Gautam C, Parameswaran S, Mishra A, Sundaram S.Generalized Continual Zero-Shot Learning. arXiv preprintarXiv:2011.08508. 2020 Nov 17.[7] Tyler L Hayes, Nathan D Cahill, and Christopher Kanan.Memory efﬁcient experience replay for streaming learning. In2019 International Conference on Robotics and Automation(ICRA), pages 9769–9776. IEEE, 2019.[8] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho-seiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HSTorr, and Marc’Aurelio Ranzato. On tiny episodic memoriesin continual learning. arXiv preprint arXiv:1902.10486, 2019.[9] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Ji-won Kim. Continual learning with deep generative replay. InAdvances in Neural Information Processing Systems, pages2990–2999, 2017.[10] Xialei Liu, Chenshen Wu, Mikel Menta, Luis Herranz,Bogdan Raducanu, Andrew D Bagdanov, Shangling Jui, andJoost van de Weijer. Generative feature replay for classincre-mental learning. In Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition Workshops,pages 226–227, 2020.[11] Diederik P Kingma and Max Welling. Auto-encodingvariational bayes. arXiv preprint arXiv:1312.6114, 2013.[12] Vinay Kumar Verma and Piyush Rai. A simple ex-ponential family framework for zero-shot learning. In JointEuropean conference on machine learning and knowledgediscovery in databases, pages 792–808. Springer, 2017.[13] Herculano-Houzel S. The human brain in numbers:a linearly scaled-up primate brain. Front Hum Neurosci.2009 Nov 9;3:31. doi: 10.3389/neuro.09.031.2009. PMID:19915731; PMCID: PMC2776484, 2009.[14] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness,J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho,T., Grabska-Barwinska, A., et alet al