[PDF] GAN Memory with No Forgetting

Abstract

As a fundamental issue in lifelong learning, catastrophic forgetting is directly caused by inaccessible historical data; accordingly, if the data (information) were memorized perfectly, no forgetting should be expected. Motivated by that, we propose a GAN memory for lifelong learning, which is capable of remembering a stream of datasets via generative processes, with \emph{no} forgetting. Our GAN memory is based on recognizing that one can modulate the "style" of a GAN model to form perceptually-distant targeted generation. Accordingly, we propose to do sequential style modulations atop a well-behaved base GAN model, to form sequential targeted generative models, while simultaneously benefiting from the transferred base knowledge. The GAN memory -- that is motivated by lifelong learning -- is therefore itself manifested by a form of lifelong learning, via forward transfer and modulation of information from prior tasks. Experiments demonstrate the superiority of our method over existing approaches and its effectiveness in alleviating catastrophic forgetting for lifelong classification problems. Code is available at this https URL.

Full PDF

GGAN Memory with No Forgetting

Yulai Cong ∗ Miaoyun Zhao ∗ Jianqiao Li Sijia Wang Lawrence Carin

Department of Electrical and Computer Engineering, Duke University

Abstract

Seeking to address the fundamental issue of memory in lifelong learning, wepropose a GAN memory that is capable of realistically remembering a stream ofgenerative processes with no forgetting. Our GAN memory is based on recognizingthat one can modulate the “style” of a GAN model to form perceptually-distanttargeted generation. Accordingly, we propose to do sequential style modulationsatop a well-behaved base GAN model, to form sequential targeted generativemodels, while simultaneously beneﬁting from the transferred base knowledge.Experiments demonstrate the superiority of our method over existing approachesand its effectiveness in alleviating catastrophic forgetting for lifelong classiﬁcationproblems. Concerning the ability to continually learn new knowledge without forgetting previously learnedexperiences, lifelong learning (or continual learning) is a long-standing challenge for machine learningand artiﬁcial intelligence systems [74, 28, 71, 11, 14, 58]. The main issue associated with lifelonglearning is the notorious catastrophic forgetting of deep neural networks [48, 36, 85], i.e., training amodel with new information severely interferes with previously learned knowledge.To alleviate the catastrophic forgetting issue, many methods have been proposed, with most focusingon discriminative/classiﬁcation tasks [36, 63, 92, 54, 91]. Reviewing existing methods, [75] revealedgenerative replay (or pseudo-rehearsal) [70, 67, 84, 64, 86] is an effective and general strategy forlifelong learning, with this further supported by [40, 76]. That revelation is anticipated, for if thecharacteristics of previous data are remembered perfectly ( e.g., via realistic generative replay), noforgetting should be expected for lifelong learning. Compared with the coreset idea, that savesrepresentative samples of previous data [54, 63, 11], generative replay has advantages in addressingprivacy concerns and remembering potentially more complete data information (via the generativeprocess). However, most existing generative replay methods either deliver blurry generated samples[10, 40] or only work well on simple datasets [40, 76, 40] like MNIST; besides, they often don’t scalewell to practical situations with high resolution [58] or a long sequence [84], sometimes even withnegative backward transfer [93, 80]. Therefore, it’s challenging to continually learn a well-behavedgenerative replay model [40], even for moderately complex datasets like CIFAR10.We seek a realistic generative replay framework to alleviate catastrophic forgetting; going further,we consider developing a realistic generative memory with growing (expressive) power, believed tobe a fundamental building block toward general lifelong learning systems. We leverage the popularGAN [25] setup as the key component of that generative memory, which we term GAN memory,because ( i ) GANs have shown remarkable power in synthesizing realistic high-dimensional samples[9, 51, 33, 34]; ( ii ) by modeling the generative process of training data, GANs summarize the datastatistical information in the model parameters, consequently also protecting privacy (the original dataneed not be saved); and ( iii ) a GAN often generates realistic samples not observed in training data, ∗ Equal Contribution. Correspondence to: Yulai Cong and Miaoyun Zhao.Preprint. Under review. a r X i v : . [ c s . C V ] J un elivering a synthetic data augmentation that potentially beneﬁts better performance of downstreamtasks [78, 8, 9, 21, 33, 26, 27]. Distinct from existing methods, our GAN memory leverages transferlearning [6, 16, 46, 89, 57] and (image) style transfer [18, 30, 41]. Its key foundation is a discoverythat one can leverage style-transfer techniques [62, 95] to modulate a source generator/discriminatorinto a powerful generator/discriminator for perceptually-distant target domains (see Section 4.1),with a limited amount of style parameters. Exploiting that discovery, our GAN memory sequentiallymodulates (and also transfers knowledge from) a well-behaved base/source GAN model to realisticallyremember a sequence of (target) generative processes with no forgetting. The base model could beﬂexibly selected as an existing GAN model trained on a large-scale public dataset; our experimentswill show that ﬂexibility roughly means source and target data should have the same data type ( e.g. ,images).Our GAN memory serves as a solution to the fundamental memory issue of general lifelong learning.In practice, it can be used, for example, as a realistic generative replay to alleviate catastrophicforgetting for challenging downstream tasks with high-dimensional data and a long (and varying)task sequence. Our contributions are as follows. • Based on FiLM [62] and AdaFM [95], we develop modiﬁed variants, termed mFiLM and mAdaFM,to better adapt/transfer the source fully connected (FC) and convolutional (Conv) layers to targetdomains, respectively. We demonstrate that mFiLM and mAdaFM can be leveraged to modulatethe “style” of a source GAN model (including both the generator and discriminator) to form agenerative/discriminative model capable of addressing a perceptually-distant target domain. • Based on the above discovery, we propose our GAN memory, that is endowed with growing(expressive) generative power, yet with no forgetting of existing capabilities, by leveraging alimited amount of task-speciﬁc style parameters. We analyze the roles played by those styleparameters and reveal their further compressibility. • We generalize our GAN memory to its conditional variant, followed by empirically verifying itseffectiveness in delivering realistic synthesized samples to alleviate catastrophic forgetting forchallenging lifelong classiﬁcation tasks.

Lifelong learning

Exiting lifelong learning methods can be roughly grouped into three categories, i.e., regularization-based [36, 67, 92, 54, 66], dynamic-model-based [47, 66, 47], and generative-replay-based methods [70, 42, 84, 64, 86, 76]. Among these methods, generative replay is believedan effective and general strategy for general lifelong learning problems [64, 86, 40, 76], as discussedabove. However, most existing methods of this type often have blurry generation (for images) orscalability issues [84, 58, 93, 80]. MeRGAN [84] leverages a copy of the current generator toreplay previous data information, showing increasingly blurry historical generations with reinforcedgeneration artifacts [93] as training proceeds. Lifelong GAN [93] employs cycle consistency (viaan auxiliary encoder) and knowledge distillation (via copies of that encoder and the generator) toremember image-conditioned generation. However, it still shows decreased performance on historicaltasks. By comparison, our GAN memory delivers realistic synthesis with no forgetting and scales wellto high-dimensional situations with a long task-sequence, capable of serving as a realistic generativememory for general lifelong learning systems. Transfer learning

Being general and effective, transfer learning has attracted increasing attentionrecently in various research ﬁelds, with a focus on discriminative tasks like classiﬁcation [6, 68, 88,90, 16, 24, 44, 46, 72, 89] and those in natural language processing [3, 61, 53, 52, 2]. However forgenerative tasks, only a few efforts have been made [83, 57, 95]. For example, a GAN pretrainedon a large-scale source dataset is used in [83] to initialize the GAN model in a target domain, forefﬁcient training or even better performance; alternatively, [57] freezes the source GAN generatoronly to modulate its hidden-layer statistics to “add” new generation power with L /perceptual loss;observing the general applicability of low-level ﬁlters in GAN generator/discriminator, [95] transfers-then-freezes them to facilitate generation with limited data in a target domain. Those methods eitheronly concern synthesis in the target domain (completely forgetting source generation) [83, 95] ordeliver blurry target generation [57]. Our GAN memory provides both realistic target generation andno forgetting on source generation. 2 Preliminary

We brieﬂy review two building blocks on which our method is constructed: generative adversarialnetworks (GANs) [25, 33] and style-transfer techniques [62, 95].

Generative adversarial networks (GANs)

GANs continue to show increasing power in synthesizinghighly realistic observations [32, 45, 51, 9, 33, 34], and have found wide applicability in variousﬁelds [39, 1, 19, 79, 81, 82, 12, 87, 37]. A GAN often consists of a generator G and a discriminator D , with both trained adversarially with objective min G max D E x ∼ q data ( x ) (cid:2) log D ( x ) (cid:3) + E z ∼ p ( z ) (cid:2) log(1 − D ( G ( z ))) (cid:3) , (1)where p ( z ) is a simple distribution ( e.g. , Gaussian) and q data ( x ) is the underlying (unknown) datadistribution from which we observe samples. Style-transfer techniques

An extensive literature [23, 69, 77, 13, 60, 38] has explored how one canmanipulate the style of an image ( e.g., the texture [18, 30, 41] or attributes [33, 34]) by modulating thestatistics of its latent features. These methods use style-transfer techniques like conditional instancenormalization [18] or adaptive instance normalization [30], most of which are related to Feature-wiseLinear Modulation (FiLM) [62]. FiLM imposes simple element-wise afﬁne transformations to latentfeatures of a neural network, showing remarkable effectiveness in various domains [30, 17, 33, 57].Given a d -dimensional feature h ∈ R d from a layer of a neural network, FiLM yields ˆ h = γ (cid:12) h + β , (2)where ˆ h is forwarded to the next layer, (cid:12) denotes the Hadamard product, and the scale γ ∈ R d and shift β ∈ R d may be conditioned on other information [18, 62, 17]. Different from FiLMmodulating latent features, another technique named adaptive ﬁlter modulation (AdaFM) modulatessource convolutional (Conv) ﬁlters to manipulate its “style” to deliver a boosted transfer performance[95]. Speciﬁcally, given a Conv ﬁlter W ∈ R C out × C in × K × K , where C in /C out denotes the number ofinput/output channels and K × K is the kernel size, AdaFM yields ˆ W = Γ (cid:12) W + B , (3)where the scale matrix Γ ∈ R C out × C in , the shift matrix B ∈ R C out × C in , and the modulated ˆ W is usedto convolve with input feature maps for output ones. Note (cid:12) is implemented with broadcasting. Targeting the fundamental memory issue of lifelong learning, we propose to exploit popular GANsto design a realistic generative memory (named GAN memory) to sequentially remember data-generating processes. Speciﬁcally, we consider a lifelong generation problem: the GAN memorysequentially accesses a stream of datasets/tasks {D , D , · · · } (during task t , only D t is accessible);after task t , the GAN memory should be able to synthesize realistic samples resembling {D , · · · , D t } .Below we ﬁrst reveal a surprising discovery that lays the foundation of the paper; we then buildon top of it our GAN memory followed by a detailed analysis; and ﬁnally we present compressiontechniques to facilitate our GAN memory for lifelong problems with a long task sequence. Moving well beyond the style-transfer literature modulating image features to manipulate its style[18, 30, 17], we discover that one can even modulate the “style” of a source generative/discriminativeprocess ( e.g., a GAN generator/discriminator trained on a source dataset D ) to form synthesispower for a perceptually-distant target domain ( e.g., a generative/discriminative power on D ), viamanipulating its FC and Conv layers with style-transfer techniques (or their variants).Before introducing the technical details, we emphasize our basic assumption of well-behaved sourceFC and Conv parameters; often parameters from a GAN model trained on large-scale datasets satisfythat assumption. To highlight our discovery, we choose a moderately sophisticated GAN model [49]trained on the CelebA [43] (containing only faces) as the source, and select perceptually-distant targetdatasets including Flowers, Cathedrals, Cats, Brain-tumor images, Chest X-rays, and Anime images(see Figure 5 and Section 5.1). With the style-transfer techniques detailed below, we observe realistic For simplicity, we omit layer-index notation throughout the paper. This setup is not limited, as it’s often convenient to use a physical memory buffer to form the data stream. ource Target FCConv z x FCConv W z x (a) (b) (c) Figure 1: (a) Style modulation of a source GAN model (demonstrated with the generator, but it is also applied tothe discriminator). Source parameters (green { W , W } ) are frozen, with limited trainable style parameters ( i.e., red { γ , β , b FC , Γ , B , b Conv } ) introduced to form the augmentation to the target domain. (b) Comparing our stylemodulation to the strong ﬁne-tuning baseline (see Appendix B for details). (c) The architecture of our GANmemory for a stream of target generation tasks. generations in all target domains (see Figure 5), even though the generation power is modulated froman entirely different source domain. Alternatively given a speciﬁc target domain, that observationalso implies the ﬂexibility in choosing a source model, i.e., the source (with well-behaved parameters)should have the same target data type, but it need not be related to the target domain. In the contextof image-based data, this implies a certain universal structure to images, that may be captured withina GAN by one (relatively large) image dataset. Via appropriate style transformation of this model, itcan be adapted to new and very different image classes, using potentially limited observations fromthose target domains.We next present the style-transfer techniques employed here, modiﬁed FiLM (mFiLM) and modiﬁedAdaFM (mAdaFM), for modulating FC and Conv layers, respectively. FC layers

Given a source FC layer h source = W z + b with weight W ∈ R d out × d in , bias b ∈ R d out , andinput z ∈ R d in , mFiLM modulates its parameters to form a target function h target = ˆ W z + ˆ b with ˆ W = γ (cid:12) W − µσ + β , ˆ b = b + b FC , (4)where µ , σ ∈ R d out , with the elements µ i , σ i denoting the mean and standard derivation of the vector W i, : , respectively; γ , β , b FC ∈ R d out are target-speciﬁc scale, shift, and bias style parameters trainedwith target data ( W and b are frozen – from learning on the original source domain – during targettraining). One may interpret mFiLM as applying FiLM [62] (or batch normalization [31]) to a sourceFC weight to modulate its (row) statistics/style (encoded in µ and σ ) to adapt to a target domain. Conv layers

Given a source Conv layer H source = W ∗ H (cid:48) + b with input feature maps H (cid:48) , Convﬁlters W ∈ R C out × C in × K × K , and bias b ∈ R C out , we leverage mAdaFM to modulate its parametersto form a target Conv layer as H target = ˆ W ∗ H (cid:48) + ˆ b , where ˆ W = Γ (cid:12) W − MS + B , ˆ b = b + b Conv , (5)where M , S ∈ R C out × C in with the elements M i,j , S i,j denoting the mean and standard derivationof the vector vec ( W i,j, : , : ) , respectively. The trainable target-speciﬁc style parameters are Γ , B ∈ R C out × C in and b Conv ∈ R C out , with W and b frozen. Similar to mFiLM, mAdaFM ﬁrst removes thesource style (encoded in M and S ), followed by leveraging Γ and B to learn the target style.The adopted style modulation process (shown with the generator) is illustrated in Figure 1(a).Given a source GAN model, we transfer and freeze all its parameters to a target domain, fol-lowed by using mFiLM/mAdaFM in (4)/(5) to modulate its FC/Conv layers (with style parameters { γ , β , b FC , Γ , B , b Conv } ) to yield the target generation model. With that style modulation, we transferthe source knowledge (within the frozen parameters) to the (potentially) perceptually-distant targetdomain to facilitate a realistic generation therein (see Figure 5 for generated samples). For quantitativeevaluations, comparisons are made with a strong baseline, i.e., ﬁne-tuning the whole source model(including both generator and discriminator) on target data. The FID scores (lower is better) [29]from both methods are summarized in Figure 1(b), where our method consistently outperforms thatstrong baseline by a large margin, on both training efﬁciency and performance, highlighting the The newly-introduced style parameters { γ , β , b FC , Γ , B , b Conv } of our method are only about of thoseof the ﬁne-tuning baseline, yet they deliver better efﬁciency and performance. This is likely because the targetdata are not sufﬁcient enough to train well-behaved parameters like the source ones, when performing ﬁne-tuning e All w/o (a) (b) All(c)

Figure 2: (a) Style parameters modulate different generation perspectives. (b) The biases model sparse objects.(c) Modulations in different blocks have different strength/focus over the generation. B m is the m th residualblock. B / B is closest to the noise/observation. See Appendix C for details. valuable knowledge within frozen source parameters (apparently a type of universal informationabout images) and showing a potentially better way for transfer learning.Complementing the common knowledge of transfer learning, i.e., low-level ﬁlters (those close toobservations) are generally applicable, while high-level ones are task-speciﬁc [88, 90, 44, 4, 5, 20,57, 95], the above discovery reveals an orthogonal dimensionality for transfer learning. Speciﬁcally,the shape of kernels ( i.e., relative relationship among kernel elements { W i,j, , , W i,j, , , · · · } ) maybe generally transferable whereas the statistics of kernels (the mean M i,j or standard derivation S i,j )or among-kernel correlations ( e.g., relative relationship among kernel statistics) are task-speciﬁc. Asimilar conjecture on low-level Conv ﬁlters was discussed in [95]; we reveal such patterns even holdfor the whole GAN model (for both low-level and high-level kernels of the generator/discriminator),which is unanticipated because common experience associates high-level kernels with task-speciﬁcinformation. This insight might reveal a new avenue for transfer learning. Based on the above observations, we propose a GAN memory that has the power to realisticallyremember a stream of generative processes with no forgetting. The key observation here is whenmodulating a source GAN model for a target domain, the source model is frozen (thus no forgettingof the source) with limited target-speciﬁc style parameters ( i.e., no inﬂuence among tasks) introducedto form a realistic target generation. Accordingly, we can use a well-behaved source GAN modelas a base, followed by sequentially modulating its “style” to deliver realistic generation power on astream of target datasets, as illustrated in Figure 1(c). We use the same settings as in Section 4.1. Asstyle parameters are often limited (and can be further compressed as in Section 4.3), one could expectfrom our GAN memory a substantial compression of streaming datasets, while not forgetting realisticgenerative replay (see Figure 5). To help better understand how/why our GAN memory works, wenext reveal ﬁve of its properties. Each group of style parameters modulates a different generation perspective . Style parameters consistof three groups, i.e., scales { γ , Γ } , shifts { β , B } , and biases { b FC , b Conv } . Taking as examples thestyle parameters trained on the Flowers, Cathedrals, and Brain-tumor images, Figure 2(a) demon-strates the generation perspective modulated by each group: ( i ) when none/all groups are applied,GAN memory generates realistic source/target images (see the ﬁrst/last column, respectively); ( ii )when only modulating via the scales (denoted as γ / Γ Only, the second column), the generatedsamples show textural/structural information from target domains (like the textures on petals orthe contours of buildings/skulls); ( iii ) as shown in the third column, the shifts if solely appliedprincipally control the low-frequency color information from target domains; ( iv ) ﬁnally the biases(see the forth column) control the illumination and sparse objects (not obvious here). To clearlyreveal the role played by the biases, we keep both scales and shifts ﬁxed, and compare the generatedsamples with/without biases on the Brain-tumor dataset; Figure 2(b) shows the biases are importantin modeling sparse objects like the tumor or tissue details, which may be valuable in pathologicalanalysis. Also the following Figure 3(a) show that the biases help with a better training efﬁciency. alone. Note that with the techniques from Section 4.3, one can use much less style parameters ( e.g. , . ofthose of the ﬁne-tuning baseline) to yield a comparable performance. Our GAN memory is amenable to streaming training, parallel training, and even their ﬂexible combinations,thanks to the frozen base model and task-speciﬁc style parameters. (a) (b) (c) Class1Class2Class3Class4Class5Class6

Figure 3: (a) Ablation study on Flowers. (b) Smooth interpolations between ﬂower and cat generative processes.(c) Realistic replay from our conditional-GAN memory. See Appendix D for details and more demonstrations.

Style parameters within different blocks show different strength/focus over the generation . Figure2(c) shows the generated samples on the Flowers dataset when gradually and accumulatively addingmodulations to each block (from FC to B6). To begin with, the FC modulation changes the overallcontrast and illumination of the generation; then the style parameters in B0-B3 make the most effortto modulate the face manifold into a ﬂower, followed by modulations in B4-B6 reﬁning the generationdetails. Such patterns are somewhat consistent with existing practice, in the sense that high-level/low-level kernel statistics are more task-speciﬁc/generally-applicable. It’s therefore interesting to considercombining the two orthogonal dimensions, i.e., the existing low-layer/high-layer split and the revealedkernel-shape/kernel-statistics split, for potentially better transfer learning.

Normalization contributes signiﬁcantly to better training efﬁciency and performance . To investigatehow the weight normalization and the biases in (4)/(5) contribute, ablation studies are conducted,with the results shown in Figure 3(a). With the weight normalization to remove the source “style,” ourmethod shows both an improved training efﬁciency and a boosted performance; the biases contributeto a better efﬁciency but with minimal inﬂuence on the performance.

GAN memory enables smooth interpolations among generative processes , by dynamically combiningtwo (or more) sets of style parameters. Figure 3(b) demonstrates smooth interpolations between ﬂowerand cat generative processes. Such a property can be used to deliver versatile data augmentationamong different domains, which may, for example, beneﬁt a downstream robust classiﬁcation [59, 7].

GAN memory readily generalizes to label-conditioned generation problems , where each task dataset D t contains both observations { x } and the paired labels { y } with y ∈ { , · · · , C t } . Mimicking theconditional GAN [50], we specify one FC bias per class to model the label information; accordingly,the style parameters for the t th task are { γ , β , { b FC } C t i =1 , Γ , B , b Conv } . Figure 3(c) shows the realisticreplay from our conditional-GAN memory after sequential training on ﬁve tasks (exampled with bird,dog, and butterﬂy; see Section 5.2 for details). Delivering realistic sequentially learned generations with no forgetting, the GAN memory presentedabove is expected to be sufﬁcient for many practical applications with a moderate number of tasks.However, for challenging situations with many tasks, to save a set of style parameters for each taskmight become prohibitive. For that problem, we reveal below ( i ) style parameters ( i.e., the expensivematrices Γ and B ) can be compressed to lower the memory cost for each task; ( ii ) one can exploitsharing parameters among tasks to further enhance memory savings. We ﬁrst investigate the singular values of Γ s and B s learned at different blocks/layers, and summarizethem in Figure 4(a). It’s clear the Γ and B parameters are in general low-rank; moreover, the closera Γ or B is to the noise (often with a larger matrix size and thus more expensive), the stronger itslow-rank property (yielding better compressibility). Accordingly, we truncate/zero-out small singularvalues at each block/layer to test the corresponding performance decline. Figure 4(b) summarizes theresults, where keeping matrix energy [55] ( ≈ top singular values) of Γ and B in B0-B4almost has the same performance, verifying the compressibility of Γ and B .Based on the compressibility of Γ and B , we next reveal parameter sharing among tasks can beexploited for further memory saving. Speciﬁcally, we propose to leverage matrix factorization andlow-rank regularization to form a lifelong knowledge base mimicking [66]. Taking the t th task as an A compromise may save limited task-speciﬁc style parameters to hard disks and only load them when used. The cheap vector parameters { γ , β , b FC , b Conv } can be similarly processed in a dictionary learning manner.As they are often quite inexpensive to retain, we consider them being task-speciﬁc for simplicity. ask-specific coefficientsTruncate small singular values:T tt :(Low-rank regularizer on ) (a) (b) (c) Figure 4: (a) Normalized singular values of Γ at all blocks/layers on Flowers (see Appendix Figure 14 for B ).Maximum normalization is applied to both axes. B m L n stands for the n th Conv layer of the m th block. (b)Inﬂuence of truncation (via preserving X % matrix energy [55]) on generation. (c) GAN memory with furthercompression and knowledge sharing. See Appendix G for details. Data Stream G AN M e m o r y M e R G AN Figure 5: The task/data stream (left) and generated samples after training on each task/dataset (right). example, instead of optimizing over a task-speciﬁc Γ t (similarly for B t ), we alternatively optimizeover its parameterization Γ t . = L KB t − Λ t ( R KB t − ) T + E t , where L KB t − and R KB t − are respectivelythe existing left/right knowledge base, Λ t = Diag ( λ t ) , and λ t and E t are task-speciﬁc trainableparameters. The nuclear norm (cid:107) E t (cid:107) ∗ is added to the loss to encourage a low-rank property. Aftertraining on the t th task, we apply singular value decomposition to E t , keep the top singular valuesto preserved X % matrix energy, and use the corresponding left/right singular vectors to update theleft/right knowledge base to L KB t and R KB t . The overall procedure is demonstrated in Figure 4(c). Experiments on high-dimensional image datasets from diverse ﬁelds are conducted to demonstrate theeffectiveness of the proposed techniques. Speciﬁcally, we ﬁrst test our GAN memory to realisticallyremember a stream of generative processes; we then show that our (conditional) GAN memory canbe used to form realistic pseudo rehearsal (synthesis of data from prior tasks) to alleviate catastrophicforgetting for challenging lifelong classiﬁcation tasks; and ﬁnally for long task sequences, we revealthe techniques from Section 4.3 enable signiﬁcant memory savings but with comparable performance.Detailed experimental settings are given in Appendix A.

To demonstrate the superiority of our GAN memory over existing replay-based methods, we design achallenging lifelong generation problem consisting of 6 perceptually-distant tasks/datasets (see Figure5): Flowers [56], Cathedrals [96], Cats [94], Brain-tumor images [15], Chest X-rays [35], and Animefaces. The GAN model [49] trained on the CelebA [43] ( D ) is selected as the base; other suchGAN models may readily be considered. We compare our GAN memory with the memory replayGAN (MeRGAN) [84], which keeps another copy of the generator in memory to replay historicalgenerations, to mitigate catastrophic forgetting. Qualitative comparisons between both methodsalong the sequential training are shown in Figure 5. It’s clear our GAN memory delivers realisticgenerations with no forgetting on historical tasks, whereas MeRGAN shows increasingly blurryhistorical generations with reinforced generation artifacts [83] as training proceeds. For quantitativecomparisons, the FID scores [29] along the training are summarized in Figure 6 (left), highlightingthe advantages of our GAN memory, i.e., realistic generations with no forgetting. Also revealed isthat our method even performs better with a better efﬁciency for the current task, likely thanks to thetransferred common knowledge within frozen base parameters. https://github.com/jayleicn/animeGAN igure 6: (Left) FID curves on the lifelong generation problem of Section 5.1. (Right) Classiﬁcation accuracyon the lifelong classiﬁcation problem of Section 5.2. Witnessing the success of our (conditional) GAN memory in continually remembering realisticgenerations, we next utilize it as a pseudo rehearsal to assist downstream lifelong classiﬁcations.Speciﬁcally, we design 6 streaming tasks by selecting ﬁsh, bird, snake, dog, butterﬂy, and insectimages from the ImageNet [65]; each task is then formalized as a 6-classiﬁcation problem ( e.g., fortask bird, the goal is to classify 6 categories of birds). We employ the challenging class-incrementallearning setup [76], i.e., the classiﬁer (after task t ) is expected to accurately classify all observed(ﬁrst t ) categories. For comparisons, we employ the regularization-based EWC [36] and thegenerative-replay-based MeRGAN [84]. Detailed settings are given in Appendix F.Testing classiﬁcation accuracy on all 36 categories from the compared methods along trainingare summarized in Figure 6 (right). It’s clear that EWC barely works in the class-incrementallearning scenario [75, 40, 76]. MeRGAN doesn’t work well when the task sequence is long, becauseof its increasingly blurry rehearsal as shown in Figure 5. By comparison, our conditional-GANmemory with no forgetting succeeds in stably maintaining an increasing accuracy as new tasks come,highlighting its practical value in alleviating catastrophic forgetting for general lifelong learningproblems. See Appendix F for evolution of the performance on each task along the training process. Table 1: Comparisons of GAN memory with (Compr) or without (Naive) compression techniques.

Task D D D D D D Compr

Compr / Naive

To verify the effectiveness of the compression techniques presented in Section 4.3, which arebelieved valuable for lifelong learning with many tasks, we design another lifelong generationproblem based on the ImageNet for better demonstration. Speciﬁcally, we select 6 categories ofbutterﬂy images to form a stream of 6 generation tasks/datasets (one category per task), among whichsimilarity/knowledge-sharability is expected. The procedure shown in Figure 4(c) is employed for ourmethod. See Appendix G for details. We compare our GAN memory with compression techniques(denoted as Compr) to its naive implementation with task-speciﬁc style parameters (Naive), with theresults summarized in Table 1. It’s clear that ( i ) even for task D (with an empty knowledge base),the low-rank property of Γ / B enables a signiﬁcant parameter compression; ( ii ) based on the existingknowledge base, much less new parameters are necessary to form a new generation model ( e.g., fortask D or D ), conﬁrming the reusability/sharability of existing knowledge; and ( iii ) though it hassigniﬁcant parameter compression, Compr delivers comparable performance to Naive, conﬁrming theeffectiveness of the presented compression techniques. We reveal that one can modulate the “style” of a GAN model to accurately synthesize the statisticsof data from perceptually-distant targets. Based on that recognition, we propose our GAN memory The datasets from Section 5.1 are too perceptually-distant to illustrate parameter sharability among tasks.

Broader impact

Capable of remembering a stream of data generative processes with no forgetting, our GAN memoryhas the following potential positive impact in the society: ( i ) it may serve as a powerful generativereplay for challenging lifelong applications such as self-driving; ( ii ) as no original data are saved, theconcerns on data privacy may be well addressed; ( iii ) GAN memory enables ﬂexible control over thereplayed contents, which is of great value to practical applications, where training data are unbalanced,or where one needs to ﬂexibly select which model capability to maintain/forget during training. Sinceour GAN memory is built on top of GANs, it may inherit their ethical and societal impact. Despitebeing versatility, GANs may be improperly used to synthesize fake images/news/videos, resulting innegative consequences. Furthermore, we should be cautious of the failure of adversarial training dueto mode collapse, which may compromise the generative capability on the current task. Note thattraining failure, if happens, will not hurt the performance on other tasks, showing certain robustnessof our GAN memory. Acknowledgments and Disclosure of Funding

The research was supported by part by DARPA, DOE, NIH, NSF and ONR. The Titan Xp GPU usedwas donated by the NVIDIA Corporation.

References [1] K. Ak, J. Lim, J. Tham, and A. Kassim. Attribute manipulation generative adversarial networks for fashionimages. In

ICCV , pages 10541–10550, 2019.[2] Y. Arase and J. Tsujii. Transfer ﬁne-tuning: A BERT case study. arXiv preprint arXiv:1909.00931 , 2019.[3] X. Bao and Q. Qiao. Transfer learning from pre-trained bert for pronoun resolution. In

Proceedings of theFirst Workshop on Gender Bias in Natural Language Processing , pages 82–88, 2019.[4] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretabilityof deep visual representations. In

CVPR , pages 6541–6549, 2017.[5] D. Bau, J. Zhu, H. Strobelt, B. Zhou, J. Tenenbaum, W. Freeman, and A. Torralba. GAN dissection:Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597 , 2018.[6] Y. Bengio. Deep learning of representations for unsupervised and transfer learning. In

Proceedings ofICML workshop on unsupervised and transfer learning , pages 17–36, 2012.[7] D. Bertsimas, J. Dunn, C. Pawlowski, and Y. Zhuo. Robust classiﬁcation.

INFORMS Journal onOptimization , 1(1):2–34, 2019.[8] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Hammers, D. Dickie, M. Hernández, J. Wardlaw,and D. Rueckert. GAN augmentation: augmenting training data using generative adversarial networks. arXiv preprint arXiv:1810.10863 , 2018.[9] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis.In

ICLR , 2019.[10] H. Caselles-Dupré, M. Garcia-Ortiz, and D. Filliat. Continual state representation learning for reinforcementlearning using generative replay. arXiv preprint arXiv:1810.03880 , 2018.[11] F. Castro, M. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari. End-to-end incremental learning. In

ECCV , pages 233–248, 2018.[12] C. Chan, S. Ginosar, T. Zhou, and A. Efros. Everybody dance now. In

CVPR , pages 5933–5942, 2019.[13] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image styletransfer. In

CVPR , pages 1897–1906, 2017.[14] Z. Chen and B. Liu. Lifelong machine learning.

Synthesis Lectures on Artiﬁcial Intelligence and MachineLearning , 12(3):1–207, 2018.

15] J. Cheng, W. Yang, M. Huang, W. Huang, J. Jiang, Y. Zhou, R. Yang, J. Zhao, Y. Feng, and Q. Feng.Retrieval of brain tumors by adaptive spatial pooling and ﬁsher vector representation.

PloS one , 11(6),2016.[16] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutionalactivation feature for generic visual recognition. In

ICML , pages 647–655, 2014.[17] V. Dumoulin, E. Perez, N. Schucher, F. Strub, H. Vries, A. Courville, and Y. Bengio. Feature-wisetransformations.

Distill , 3(7):e11, 2018.[18] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. arXiv preprintarXiv:1610.07629 , 2016.[19] W. Fedus, I. Goodfellow, and A. Dai. MaskGAN: better text generation via ﬁlling in the _. arXiv preprintarXiv:1801.07736 , 2018.[20] Y. Frégier and J. Gouray. Mind2mind: transfer learning for GANs. arXiv preprint arXiv:1906.11613 ,2019.[21] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan. GAN-based syntheticmedical image augmentation for increased CNN performance in liver lesion classiﬁcation.

Neurocomputing ,321:321–331, 2018.[22] S. Gais, G. Albouy, M. Boly, T. Dang-Vu, A. Darsaud, M. Desseilles, G. Rauchs, M. Schabus, V. Sterpenich,G. Vandewalle, et al. Sleep transforms the cerebral trace of declarative memories.

PNAS , 104(47):18778–18783, 2007.[23] G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens. Exploring the structure of a real-time, arbitraryneural artistic stylization network. arXiv preprint arXiv:1705.06830 , 2017.[24] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detectionand semantic segmentation. In

CVPR , pages 580–587, 2014.[25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.Generative adversarial nets. In

NIPS , pages 2672–2680, 2014.[26] C. Han, K. Murao, T. Noguchi, Y. Kawata, F. Uchiyama, L. Rundo, H. Nakayama, and S. Satoh. Learningmore with less: conditional PGGAN-based data augmentation for brain metastases detection using highly-rough annotation on MR images. arXiv preprint arXiv:1902.09856 , 2019.[27] C. Han, L. Rundo, R. Araki, Y. Furukawa, G. Mauri, H. Nakayama, and H. Hayashi. Inﬁnite brain MRimages: PGGAN-based data augmentation for tumor detection. In

Neural Approaches to Dynamics ofSignal Exchanges , pages 291–303. Springer, 2020.[28] D. Hassabis, D. Kumaran, C. Summerﬁeld, and M. Botvinick. Neuroscience-inspired artiﬁcial intelligence.

Neuron , 95(2):245–258, 2017.[29] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scaleupdate rule converge to a local Nash equilibrium. In

NIPS , pages 6626–6637, 2017.[30] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In

ICCV , pages 1501–1510, 2017.[31] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internalcovariate shift. arXiv preprint arXiv:1502.03167 , 2015.[32] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability,and variation. arXiv preprint arXiv:1710.10196 , 2017.[33] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks.In

CVPR , June 2019.[34] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the imagequality of stylegan. arXiv preprint arXiv:1912.04958 , 2019.[35] D. S Kermany, M. Goldbaum, W. Cai, C. CS Valentim, H. Liang, S. L Baxter, A. McKeown, G. Yang,X. Wu, F. Yan, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning.

Cell , 172(5):1122–1131, 2018.[36] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. Rusu, K. Milan, J. Quan, T. Ramalho,A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.

PNAS , page201611835, 2017.[37] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, andA. Courville. MelGAN: Generative adversarial networks for conditional waveform synthesis. arXivpreprint arXiv:1910.06711 , 2019.[38] L. Kurzman, D. Vazquez, and I. Laradji. Class-based styling: Real-time localized style transfer withsemantic segmentation. In

ICCV , pages 0–0, 2019.

39] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In

CVPR , pages 4681–4690, 2017.[40] T. Lesort, H. Caselles-Dupré, M. Garcia-Ortiz, A. Stoian, and D. Filliat. Generative models from theperspective of continual learning. In

IJCNN , pages 1–8. IEEE, 2019.[41] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang. Universal style transfer via feature transforms. In

NIPS , pages 386–396, 2017.[42] Z. Li and D. Hoiem. Learning without forgetting.

IEEE transactions on pattern analysis and machineintelligence , 40(12):2935–2947, 2017.[43] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In

ICCV , pages 3730–3738,2015.[44] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791 , 2015.[45] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are GANs created equal? a large-scalestudy. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,

NeurIPS , pages 700–709. Curran Associates, Inc., 2018.[46] Z. Luo, Y. Zou, J. Hoffman, and L. Fei-Fei. Label efﬁcient learning of transferable representations acrosssdomains and tasks. In

NIPS , pages 165–177, 2017.[47] N. Masse, G. Grant, and D. Freedman. Alleviating catastrophic forgetting using context-dependent gatingand synaptic stabilization.

Proceedings of the National Academy of Sciences , 115(44):E10467–E10475,2018.[48] M. McCloskey and N. Cohen. Catastrophic interference in connectionist networks: The sequential learningproblem. In

Psychology of learning and motivation , volume 24, pages 109–165. Elsevier, 1989.[49] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In

ICML , pages 3478–3487, 2018.[50] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 , 2014.[51] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarialnetworks. arXiv preprint arXiv:1802.05957 , 2018.[52] M. Moradshahi, H. Palangi, M. Lam, P. Smolensky, and J. Gao. HUBERT untangles BERT to improvetransfer across NLP tasks. arXiv preprint arXiv:1910.12647 , 2019.[53] M. Mozafari, R. Farahbakhsh, and N. Crespi. A BERT-based transfer learning approach for hate speechdetection in online social media. arXiv preprint arXiv:1910.12574 , 2019.[54] C. Nguyen, Y. Li, T. Bui, and R. Turner. Variational continual learning. arXiv preprint arXiv:1710.10628 ,2017.[55] V. Nikiforov. The energy of graphs and matrices.

JMAA , 326(2):1472–1475, 2007.[56] M. Nilsback and A. Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In

ICCV,Graphics & Image Processing , pages 722–729. IEEE, 2008.[57] A. Noguchi and T. Harada. Image generation from small datasets via batch statistics adaptation. arXivpreprint arXiv:1904.01774 , 2019.[58] G. Parisi, R. Kemker, J. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks:A review.

Neural Networks , 2019.[59] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim. Data synthesis based on generativeadversarial networks.

Proceedings of the VLDB Endowment , 11(10):1071–1083, 2018.[60] T. Park, M. Liu, T. Wang, and J. Zhu. Semantic image synthesis with spatially-adaptive normalization. In

CVPR , pages 2337–2346, 2019.[61] Y. Peng, S. Yan, and Z. Lu. Transfer learning in biomedical natural language processing: An evaluation ofBERT and ELMo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474 , 2019.[62] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. FiLM: Visual reasoning with a generalconditioning layer. In

AAAI , 2018.[63] S. Rebufﬁ, A. Kolesnikov, G. Sperl, and C. Lampert. iCaRL: Incremental classiﬁer and representationlearning. In

CVPR , pages 2001–2010, 2017.[64] A. Rios and L. Itti. Closed-loop memory gan for continual learning. arXiv preprint arXiv:1811.01146 ,2018.[65] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, et al. Imagenet large scale visual recognition challenge.

IJCV , 115(3):211–252, 2015.

66] J. Schwarz, J. Luketina, W. Czarnecki, A. Grabska-Barwinska, Y. Teh, R. Pascanu, and R. Hadsell. Progress& compress: A scalable framework for continual learning. arXiv preprint arXiv:1805.06370 , 2018.[67] A. Seff, A. Beatson, D. Suo, and H. Liu. Continual learning in generative adversarial nets.

NeurIPS , 2017.[68] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 , 2013.[69] F. Shen, S. Yan, and G. Zeng. Neural style transfer via meta networks. In

CVPR , pages 8061–8069, 2018.[70] H. Shin, J. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In

NIPS , pages2990–2999, 2017.[71] K. Shmelkov, C. Schmid, and K. Alahari. Incremental learning of object detectors without catastrophicforgetting. In

ICCV , pages 3400–3409, 2017.[72] Q. Sun, B. Schiele, and M. Fritz. A domain based approach to social relation recognition. In

CVPR , pages3481–3490, 2017.[73] P. Taupin and F. Gage. Adult neurogenesis and neural stem cells of the central nervous system in mammals.

Journal of neuroscience research , 69(6):745–749, 2002.[74] S. Thrun and T. Mitchell. Lifelong robot learning.

Robotics and autonomous systems , 15(1-2):25–46,1995.[75] G. van de Ven and A. Tolias. Generative replay with feedback connections as a general strategy forcontinual learning. arXiv preprint arXiv:1809.10635 , 2018.[76] G. van de Ven and A. Tolias. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734 ,2019.[77] H. Wang, X. Liang, H. Zhang, D. Yeung, and E. Xing. Zm-net: Real-time zero-shot image manipulationnetwork. arXiv preprint arXiv:1703.07255 , 2017.[78] J. Wang and L. Perez. The effectiveness of data augmentation in image classiﬁcation using deep learning.

Convolutional Neural Networks Vis. Recognit , 2017.[79] K. Wang and X. Wan. SentiGAN: Generating sentimental texts via mixture adversarial networks. In

IJCAI ,pages 4446–4452, 2018.[80] R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, C. Cao, D. Jiang, and M. Zhou. K-adapter: Infusingknowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808 , 2020.[81] R. Wang, D. Zhou, and Y. He. Open event extraction from online text using a generative adversarialnetwork. arXiv preprint arXiv:1908.09246 , 2019.[82] T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. arXivpreprint arXiv:1808.06601 , 2018.[83] Y. Wang, C. Wu, L. Herranz, J. van de Weijer, A. Gonzalez-Garcia, and B. Raducanu. Transferring GANs:generating images from limited data. In

ECCV , pages 218–234, 2018.[84] C. Wu, L. Herranz, X. Liu, J. van de Weijer, B. Raducanu, et al. Memory replay GANs: Learning togenerate new categories without forgetting. In

NeurIPS , pages 5962–5972, 2018.[85] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu. Large scale incremental learning. In

CVPR ,pages 374–382, 2019.[86] Y. Xiang, Y. Fu, P. Ji, and H. Huang. Incremental learning using conditional adversarial networks. In

Proceedings of the IEEE International Conference on Computer Vision , pages 6619–6628, 2019.[87] R. Yamamoto, E. Song, and J. Kim. Probability density distillation with generative adversarial networksfor high-quality parallel waveform generation. arXiv preprint arXiv:1904.04472 , 2019.[88] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In

NIPS , pages 3320–3328, 2014.[89] A. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transferlearning. In

CVPR , pages 3712–3722, 2018.[90] M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In

ECCV , pages 818–833.Springer, 2014.[91] G. Zeng, Y. Chen, B. Cui, and S. Yu. Continual learning of context-dependent processing in neuralnetworks.

Nature Machine Intelligence , 1(8):364–372, 2019.[92] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In

ICML , pages3987–3995. JMLR. org, 2017.[93] M. Zhai, L. Chen, F. Tung, J. He, M. Nawhal, and G. Mori. Lifelong GAN: Continual learning forconditional image generation. In

ICCV , pages 2759–2768, 2019.

94] W. Zhang, J. Sun, and X. Tang. Cat head detection-how to effectively exploit shape and texture features.In

ECCV , pages 802–816. Springer, 2008.[95] M. Zhao, Y. Cong, and L. Carin. On leveraging pretrained GANs for limited-data generation, 2020.[96] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognitionusing places database. In

NIPS , pages 487–495, 2014. ppendix of GAN Memory with No Forgetting Yulai Cong ∗ , Miaoyun Zhao ∗ , Jianqiao Li, Sijia Wang, Lawrence CarinDepartment of ECE, Duke University A Experimental settings

For all experiments about GAN memory and MeRGAN, we inherit the architecture and experimentalsettings from GP-GAN [49]. Note, for both GAN memory and MeRGAN, we use the GP-GANmodel pretrained on CelebA. In the implementation of GAN memory, we apply style modulationon all layers except the last Conv layer for generator, and apply the proposed style modulation onall layers except the last FC layer for discriminator. Adam is used as the optimizer with learningrate × − and coefﬁcients ( β , β ) = (0 . , . . For the discriminator, gradient penalty on realsamples ( R -regularizer) is applied with γ = 10 . . For MeRGAN, the Replay alignment parameteris set as λ RA = 1 × − , which is the same as their publication.All images are resized to × for consistency. The dimension of the input noise is set to for all tasks. The FID scores are calculated using N real and generated images, for the dataset withless than 10,000 images, we set N as the number of data samples; for the dataset with larger than10,000 samples, we set N as 10,000.Code will be available upon publication. B On Figure 1(b)

To demonstrate that the introduced trainable parameters in our GAN memory is enough to manage adecent performance on new task. We test its performance via FID and compare with a strong baselineFine-tuning.On the Fine-tuning method. It inherits the architecture from GP-GAN which is the same as thegreen/frozen part of our GAN memory (see Figure 1(a)). Given a target task/data ( e.g.,

Flowers,Cathedrals, or Cats), all the parameters are trainable and ﬁne-tuned to ﬁt the target data. We considerFine-tuning as a strong baseline because it utilizes the whole model power on the target data.

C On Figure 2

For all illustrations, we train GAN memory on the target data and record the well trained styleparameters as { γ , β , b FC , Γ , B , b Conv } t =1 . According to equation (4) and (5), we can represent thestyle parameters for the source data CelebA as { γ , β , b FC , Γ , B , b Conv } t =0 = { µ , σ , , M , S , } .Then, we can get the generation by selectively replacing { γ , β , b FC , Γ , B , b Conv } t =0 with { γ , β , b FC , Γ , B , b Conv } t =1 . Note, for Figure 2(a)(b) the operations are explained within one speciﬁcFC/Conv layer, one need to apply it to all layers in real implementation. The detailed techniques areas follows. On Figure 2(a) . Take the target data Flowers as an example, • “None” means no modulation from target data is applied, we only use the modulation from thesource data { µ , σ , , M , S , } and get face images; • “ γ / Γ Only” means that we replace the γ / Γ parameters from source data with the one from targetdata, namely using the modulation {{ γ } t =1 , σ , , { Γ } t =1 , S , } for generation; • “ β / B Only” means that we replace the β / B parameters from source data with the one from targetdata, namely using the modulation { µ , { β } t =1 , , M , { B } t =1 , } for generation; • “ b FC / b Conv

Only” means that we replace the b FC / b Conv parameters from source data with the onefrom target data, namely using the modulation { µ , σ , { b FC } t =1 , M , S , { b Conv } t =1 } for generation; • “All” means all style parameters from target data { γ , β , b FC , Γ , B , b Conv } t =1 are applied and themodel generates ﬂowers; On Figure 2(b) . 14 “All” is obtained via a similar way to that of Figure 2(a); • “w/o b FC / b Conv ” means using the style parameters from target data without b FC / b Conv , namelyusing the modulation {{ γ } t =1 , { β } t =1 , , { Γ } t =1 , { B } t =1 , } for generation; On Figure 2(c) . • “None” and "All” are obtained via a similar way to that of Figure 2(a); • “FC” is obtained by applying a newly designed style parameters which copies the style parametersfor souce data and replaces these style parameters within FC layer with the style parameters withinthe FC layer for target data; • “B ” is obtained by copying the designed style parameters under the “FC” setting and replacingthese style parameters within B block with those from target data; • “B ” is obtained by copying the designed style parameters under the “B ” setting and replacingthese style parameters within B block with those from target data; • and so forth for the later blocks. D On Figure 3

D.1 On Figure 3(a)

The ablation study is conducted to test the effect the normalization operation and the bias term onGAN memory. • “Our” is our GAN memory; • “NoNorm” is a modiﬁed version of our GAN memory which removes the normalization on W / W in Equation (4)/(5) and results in, ˆ W = γ (cid:12) W + β and ˆ W = Γ (cid:12) W + B • “NoBias” is a modiﬁed version of our GAN memory which removes the bias term b FC / b Conv inEquation (4)/(5) and results in, ˆ b = b D.2 On Figure 3(b)

Here we discuss the detailed techniques for interpolation among different generative processes withour GAN memory and show more examples in Figure 7, 8, 9, and 10. Taking the smooth interpolationbetween ﬂowers and cats generative processes as an example, we do the following procedures to getthe results. • Train GAN memory on Flowers and Cats independently and get the well trained style parametersfor Flowers as { γ , β , b FC , Γ , B , b Conv } t =1 and for Cats as { γ , β , b FC , Γ , B , b Conv } t =2 ; • Sample a bunch of random noise z ( e.g., random noise vectors) and ﬁx it; • Get the interpolated generation by dynamically combining the two sets of style parameters via (1 − λ Interp ) { γ , β , b FC , Γ , B , b Conv } t =1 + λ Interp { γ , β , b FC , Γ , B , b Conv } t =2 , with λ Interp varying from to .Note, the task CelebA has been well pretrained and we can obtain the style parameters directly as { γ , β , b FC , Γ , B , b Conv } t =0 = { µ , σ , , M , S , } .15 igure 7: Smooth interpolations between faces and ﬂowers generative processes via GAN memory. igure 8: Smooth interpolations between ﬂowers and cats generative processes via GAN memory.Figure 9: Smooth interpolations between cats and cathedrals generative processes via GAN memory. igure 10: Smooth interpolations between cathedrals and Brain-tumor images generative processes via GANmemory. E More results for Section 5.1

Given the same model pretrained for CelebA, we train GAN memory and MeRGAN on a sequence oftasks: Flowers (8,189 images), Cathedrals (7,350 images), Cats (9,993 images), Brain-tumor images(3,064 images), Chest X-rays (5,216 images), and Anime images (115,085 images). When task t presents, only the data from D t is available, and GAN memory/MeRGAN is expected to generatesamples for all learned tasks D , · · · , D t . Due to the limited space, we only show part of the resultsin Figure 5. To make it clearer, Figure 11 shows a complete results with the generations for all tasksalong the training process listed. 18 (a) GAN Memory(b) MeRGAN Figure 11: Comparing the generated samples from GAN memory (top) and MeRGAN (bottom) on lifelonglearning of a sequential generation tasks: Flowers, Cathedrals, Cats, Brain-tumor images, Chest X-rays, andAnime images. MeRGAN shows an increasing blurriness along the task sequence, while GAN memory canlearn to perform the current task and remembering the old tasks realistically.

F Experimental settings for lifelong classiﬁcation

Given a sequence of 6 tasks: ﬁsh, bird, snake, dog, butterﬂy, and insect selected from ImageNet.Each task is a classiﬁcation problem over six categories/sub-classes. We employ the challengingclass-incremental learning setup, when task t presents, the classiﬁer (have been trained on previous t − tasks) is expected to accurately classify all observed t categories (namely, the previous t − categories plus the current categories). There are × categories/sub-classes in total, foreach category/sub-class, we randomly select images for training and for testing.As for the classiﬁcation model, we select the pretrained ResNet18 model for ImageNet. The optimizerfor classiﬁcation is selected as Adam with learning rate × − and coefﬁcients ( β , β ) =(0 . , . .For EWC, we adopt the code from [76], and set the parameters as λ EWC = 10 . We also tried thesetting λ EWC = 10 used in [84], however, the results shows that λ EWC = 10 is too strong in ourcase, e.g., when the learning proceeds to task / , the parameters are strongly pinned to the onelearned for previous tasks, which makes it difﬁcult to learn the new/current task, and the accuracy forthe cuttent task kept almost . For GAN memory and MeRGAN, we adopt the two step frameworkfrom [70], every time a new task t presents, we ( i ) train GAN memory/MeRGAN (which has alreadyremembered all the previous tasks ∼ t − ) to remember the generation for both the previous tasksand the current task; ( ii ) at the same time, train the classiﬁer/solver on a combined dataset: realsamples for current task with their labels provided and generated samples for previous tasks replayedby GAN memory/MeRGAN. Since we apply label conditioned generation, where the category is aninput, the replay process can be readily controlled with the reliable sampling of ( x , y ) pairs ( e.g., sample the label y and noise z ﬁrst and then sample data x via the generator x = G ( z , y ) ), which isconsidered effective in avoiding potential classiﬁcation errors and biased sampling towards recentcategories [84]. Note, for both GAN memory and MeRGAN, we used the GP-GAN model pretrainedon CelebA.The batch size for EWC is set as n = 36 . For replay based method, the batch size for classiﬁcation isset as follows. When task t = 1 , , · · · , T presents, the batch size is n × t : ( i ) n samples from thecurrent task; ( ii ) n × ( t − generated samples replayed by GAN memory/MeRGAN for previous t − tasks (each task with n replayed samples).The classiﬁcation accuracy shown in Figure 6 (right) is obtained by testing the classiﬁer on the wholetest dataset for all tasks with ×

100 = 3600 images. We also show in Figure 12 the evolution ofthe classiﬁcation accuracy for each task during the whole training process. Take Figure 12(a) as an19xample, the shown classiﬁcation accuracy for task 1 on D is obtained by testing the classiﬁer onthe test images belonging to D with ×

100 = 600 images.We observe from Figure 12 that, ( i ) EWC forgets the previous task much quickly than the others, e.g., when the learning proceeds to task D , the knowledge/capability learned from task D , D , and D are totally forgotten and the performances are seriously decreased; ( ii ) MeRGAN showsclear performance decline on historical classiﬁcations due to its blurry rehearsal on complex datasets,which is especially obvious when the task sequence becomes long (see Figure 12(a)(b)(c)). ( iii ) Bycomparison, our (conditional) GAN memory succeeds in stably maintaining historical accuracy evenwhen the sequence becomes long thanks to its realistic pseudo rehearsal, highlighting its practicalvalues in serving as a generative memory for general lifelong learning problems. (a) (b)(c) (d)(e) (f) Figure 12: The evolution of the classiﬁcation accuracy for each task during the whole training process: (a) Task D ; (b) Task D ; (c) Task D ; (d) Task D ; (e) Task D ; (f) Task D . Figure 13 shows the realistic replay from our conditional-GAN memory after sequential training onﬁve tasks. The last task have its real data available, thus there is no need to learn to replay a) fish Class1Class2Class3Class4Class5

Class6 (c) snake

Class1Class2Class3Class4Class5

Class6 d) dog Class1Class2Class3Class4Class5

Class6 (e) butterfly

Class1Class2Class3Class4Class5

Class6

Figure 13: Realistic replay from our conditional-GAN memory. (a) ﬁsh; (b) bird; (c) snake; (d) dog; (e) butterﬂy.

G GAN memory with further compression

G.1 The compression method

Our proposed GAN memory is practical when the sequence of tasks is mediate. However when thenumber of tasks is extremely large, e.g., Γ , B by taking advantage of the redundancy (Low-rank) within it,which is potential to remember extremely long sequence of tasks with ﬁnite parameters.Similar to Figure 4(a), the singular values of scale Γ and bias B learned at different blocks/layers aresummarized in Figure 14. Obviously, those bias B are also generally low-rank and the closer a bias B to the noise, the stronger low-rank property it shows.22aximum normalization is applied for both Figure 4(a) and Figure 14 to provide a clear illustration.Take curve “B0L0” as an example, given the singular values vector as y and the index of eachelement as x , we do the Maximum normalization as ˆ x = x / max( x ) , ˆ y = y / max( y ) , and thenplot { ˆ x , ˆ y } . Figure 14: The singular value of the scale Γ (left) and bias B (right) within each Conv layer on Flowers. Based on the discovered low-rank property, we test the compressibility of the learned parameters bytruncating/zero-out small singular values. The results in Figure 4(b) are obtained by only keeping ∼ matrix energy [55] at each block alternatively and evaluating their performance (FID).Take B0 as an example, we keep ∼ matrix energy for all the scale Γ and bias B matriceswithin block B0 with the other blocks/layers unchanged and evaluate the FID.To have a straight forward understanding of the proposed compression method for our GAN memory(see Figure 4(c)), we summarize the speciﬁc steps in Algorithm 1, where r = 0 . works well inthe experiments. Given E t = U t S t V tT with S t = Diag ( s t ) , the low-rank regularization for theﬁrst task is simply implemented via min (cid:107) E (cid:107) ∗ = min (cid:107) s (cid:107) ; the low-rank regularization for thelater tasks are implemented via min (cid:107) E t (cid:107) ∗ = min (cid:107) η (cid:12) s t (cid:107) , where η is a vector of same size as s t with non-negative values and is designed as η = 0 . σ ( · jJ ) with σ ( · ) as a sigmoid function, J asthe length of s t , j as the index of [ s t ] j . This kind of designation forms a prior which imposes weaksparse constraint on part of the elements in s t and imposes strong sparse constraint on the rest. Thisprior is found effective in keeping a good performance. Algorithm 1

GAN memory with further compression (exampled by Γ t within a Conv layer) Input:

A sequence of T tasks, D , D , · · · , D T , the knowledge base L = ∅ and R = ∅ , a scale r tobalance between the low-rank regularization and the generator loss. Output:

The knowledge base L and R , the task-speciﬁc coefﬁcients, c , c , · · · , c T for t in task to task T do Train GAN memory on the current task t with the following settings:( i ) Γ t . = L KB t − Λ t R KB t − T + E t with Λ t = Diag ( λ t ) ;( ii ) Generator loss with low-rank regularization r (cid:107) E t (cid:107) ∗ considered; Do SVD on E t as E t = U t S t V tT with S t = Diag ( s t ) Zero-out the small singular values to keep X % matrix energy for Γ t and obtains E t ≈ ˆ U t ˆ S t ˆ V Tt Collect the task-speciﬁc coeffecients for task t as c t = [ λ t , s t ] Collect knowledge base L = [ L , ˆ U t ] , V = [ V , ˆ V t ] ; end for The index here begins from . .2 Experiments on the compressibility of the GAN memory From Figure 4(b), we observe that Γ / B from different blocks/layers have different compressibility:the larger the matrix size for Γ / B (the closer to the noise), the larger the compression ratio is. Thus, itis proper to keep different percent of matrix energy for different blocks/layers to minimize the declinein ﬁnal performance. The results shown in Table 1 are obtained by keeping { , , , } matrix energy for Γ / B within B0, B1, B2, B3 respectively. The compression techniques are notconsidered for the other blocks/layers ( i.e., B4, B5, B6 and FC) because these layers have a relativelysmall amount of parameters.

Figure 15: The percentage of newly added bases for scale Γ (left) and bias B (right) after the sequential trainingon each of the butterﬂy tasks. Each bar value is a ratio between the number of newly added bases and themaximum rank of Γ / B . The percentages are the ratio between the amount of newly added parameters with andwithout using compression for each task. By comparing to a naive implementation with task-speciﬁc style parameters (see Section 4.2), Figure15 shows the the overall (and block/layer wise) compression ratios delivered by the compressiontechniques as training proceeds. Each bar shows the ratio between the number of newly added basesand the maximum rank of Γ / B . The shown percentages above each group of bars are the ratiobetween the amount of newly added parameters with and without using compression for each task.It’s clear that ( i ) even for task D (with an empty knowledge base), the low-rank property of Γ / B enables a signiﬁcant parameter compression; ( ii ) based on the existing knowledge base, much lessnew parameters are necessary to form a new generation power ( e.g., for task D , D , D , and D ),conﬁrming the reusability/sharability of existing knowledge; and ( iii ) the lower the block (the closerto the noise with a larger matrix size for Γ / B ), the larger the compression. All these three factorstogether lead to the signiﬁcant overall compression ratios.Intuitively, when more tasks are learned and the knowledge base is rich enough, no new bases wouldbe necessary for future tasks (despite we still need to save tiny task-speciﬁc coefﬁcients). For theproposed compression method, the selection of a proper threshold for keeping matrix energy is veryimportant in balancing between saving parameters and keeping performance ( e.g.,e.g.,