[PDF] Regularize, Expand and Compress: Multi-task based Lifelong Learning via NonExpansive AutoML

Abstract

Lifelong learning, the problem of continual learning where tasks arrive in sequence, has been lately attracting more attention in the computer vision community. The aim of lifelong learning is to develop a system that can learn new tasks while maintaining the performance on the previously learned tasks. However, there are two obstacles for lifelong learning of deep neural networks: catastrophic forgetting and capacity limitation. To solve the above issues, inspired by the recent breakthroughs in automatically learning good neural network architectures, we develop a Multi-task based lifelong learning via nonexpansive AutoML framework termed Regularize, Expand and Compress (REC). REC is composed of three stages: 1) continually learns the sequential tasks without the learned tasks' data via a newly proposed multi-task weight consolidation (MWC) algorithm; 2) expands the network to help the lifelong learning with potentially improved model capability and performance by network-transformation based AutoML; 3) compresses the expanded model after learning every new task to maintain model efficiency and performance. The proposed MWC and REC algorithms achieve superior performance over other lifelong learning algorithms on four different datasets.

Full PDF

RRegularize, Expand and Compress: Multi-task based Lifelong Learning viaNonExpansive AutoML

Jie Zhang , Junting Zhang , Shalini Ghosh , Dawei Li , Jingwen Zhu , Heming Zhang , Yalin Wang Arizona State University, Tempe, AZ; University of Southern California, Los Angeles, CA; Samsung Research America, Mountain View, CA

Abstract

Lifelong learning, the problem of continual learningwhere tasks arrive in sequence, has been lately attractingmore attention in the computer vision community. The aimof lifelong learning is to develop a system that can learnnew tasks while maintaining the performance on the pre-viously learned tasks. However, there are two obstaclesfor lifelong learning of deep neural networks: catastrophicforgetting and capacity limitation. To solve the above is-sues, inspired by the recent breakthroughs in automati-cally learning good neural network architectures, we de-velop a Multi-task based lifelong learning via nonexpansiveAutoML framework termed Regularize, Expand and Com-press (REC). REC is composed of three stages: 1) continu-ally learns the sequential tasks without the learned tasks’data via a newly proposed multi-task weight consolida-tion (MWC) algorithm; 2) expands the network to help thelifelong learning with potentially improved model capabil-ity and performance by network-transformation based Au-toML; 3) compresses the expanded model after learning ev-ery new task to maintain model efﬁciency and performance.The proposed MWC and

REC algorithms achieve supe-rior performance over other lifelong learning algorithmson four different datasets.

1. Introduction

In many real-world applications, batches of data arriveperiodically (e.g., daily, weekly, or monthly) with the datadistribution changing over time. This presents an opportu-nity for lifelong learning or continual learning, and is an im-portant developing topic of interest in artiﬁcial intelligence.The primary goal of lifelong learning is to learn consec-utive tasks without forgetting the knowledge learned frompreviously trained tasks, and leverage the previous knowl-edge to obtain better performance or faster convergence onthe newly coming task. One simple way is to ﬁne-tune themodel for every new task; however, such retraining typi-

Figure 1. (a) State-of-the-art DEN method [30] selectively re-trains the old network, dynamically expands the model capacity(b) The proposed REC method expands the network through net-work transformation based AutoML, and then subsequently com-presses the model to its original size. cally degenerates the model performance on both new tasksand the old ones. If the new tasks are largely different fromthe old ones, it might not be able to learn the optimal modelfor the new tasks. Meanwhile, the retrained representationsmay adversely affect the old tasks, causing them to driftfrom their optimal solution. This can cause “catastrophicforgetting”, a phenomenon where training a model with newtasks interferes the previously learned old knowledge, lead-ing to a performance degradation or even overwriting of theold knowledge by the new one.To overcome above catastrophic forgetting problem,many approaches have been proposed [13, 17, 22]. Kirk-patrick et al . [13] propose using a regularization term toprevent the new weights from deviating too much from thepreviously learned weights, based on their signiﬁcance toold tasks. Their method uses a ﬁxed neural network archi-tecture, which would not scale up when network capacitygets saturated with more and more new tasks to learn. Dy-namically expanding the network [30] (DEN) is one wayto overcome the problem caused by static architecture — itexpands the network capacity whenever it detects that theloss for the new task would not reach a pre-deﬁned thresh-1 a r X i v : . [ c s . C V ] M a r ld. However, DEN involves many hyperparameters andthe ﬁnal performance is highly sensitive to these parame-ters; it relies on hand-crafted heuristics to explore the tuningspace. But the search space is considerably large, such thathuman experts usually ﬁnd a sub-optimal solution while thecurrent parameters tuning procedures are time-consuming.To this end, we aim to automatically expand the networkfor lifelong learning, with higher performance and less pa-rameter redundancy than human-designed architectures. Tobetter facilitate (a) automatic knowledge transfer withouthuman expert tuning and (b) model design with optimizedmodel complexity, we unprecedentedly propose to applyAutoML [23] for lifelong learning while taking learning ef-ﬁciency into consideration.AutoML refers to automatically learn a suitable machinelearning (ML) model for a given task — Neural Architec-ture Search (NAS) [32] is a subﬁeld of AutoML for deeplearning, which searches for optimal hyperparameters of de-signing a network architecture using reinforcement learn-ing (RL). The RL framework has a main controller that ob-serves the generated children networks’ performance on thevalidation set as the reward signal, it then gives higher prob-abilities to architectures that have higher performance thanthe lower ones to update the model. If we use this approachdirectly in the lifelong learning setting, it would forget oldtasks’ knowledge and be a wasteful process since each newtask network architecture would need to be searched fromscratch by the controller, ignoring the correlations betweenpreviously learned tasks and the new task. We hereby pro-pose a multi-task weight consolidation (MWC) approach tolearn the discriminative weights subset by incorporating in-herent correlations between old tasks and new task. Fur-thermore, to narrow down the architecture searching spaceand save training time, network transformation based Au-toML [3] is utilized to accelerate the meta-learning of thenew network.However, if we keep expanding the network for moreand more new tasks, the model will have a much largermodel size comparing with the initial model and suffer theinefﬁcient problem (e.g., low memory footprint, low powerusage). Many network-expanding-based lifelong learningalgorithms [24, 30] increase the model capability but alsodecrease the learning efﬁciency in terms of memory costand power usage. To address this issue, we conduct modelcompression after completing the learning of each new task— we compress the expanded model to the initial model,with negligible performance loss on both old and new tasks.Fig 1 shows the main difference of our approach with net-work expansion-based lifelong learning algorithms.In this paper, we propose a Multi-task based lifelonglearning via nonexpansive AutoML framework termed Reg-ularize, Expand and Compress (REC), to continually andautomatically learn on such sequential data sets. We start with a given small network to learn an initial model on theﬁrst given task; REC then searches the best network archi-tecture by network transformation based AutoML for thenew upcoming task without access to the old tasks’ data us-ing a newly proposed MWC algorithm and compress theexpanded network size to the initial network size.Our key contributions of this work can be summarizedas follows: • We propose to Regularize, Expand and Compress(REC) for lifelong learning, which automatically ex-pands the network capacity for learning a new taskwith higher performance and less parameter redun-dancy than human-designed architectures. • To overcome catastrophic forgetting for the old learnedtasks, we propose a novel Multi-task Weight Con-solidation (MWC) — it considers the discriminativeweight subset by incorporating inherent correlationsbetween old tasks and new task and learns the newlyadded layer as a task-speciﬁc layer for the new task. • Furthermore, unlike previous network-expanding-based lifelong learning algorithms, REC compressesthe model after learning every new task to guaran-tee the model efﬁciency. The ﬁnal model is a non-expensive model but the performance enhanced by net-work expanding before the compression.

2. Related Work

Recently, a lot of lifelong learning methods were pro-posed to address the catastrophic forgetting problem. Theﬁrst group of methods uses regularized learning. ElasticWeight Consolidation (EWC) [13] shows that task-speciﬁcsynaptic consolidation may overcome catastrophic forget-ting in neural networks and observes the important weightsfor the previous tasks and selectively adjusts the plasticity ofthe weights. Inspired by EWC, Schwarz et al . [26] proposeonline EWC, which enlarges the EWC scalability by lim-iting the regularization term computational cost when thenumber of tasks increases. Synaptic Intelligence [31] com-putes an online importance measure along an entire learn-ing trajectory, which is similar to EWC. Rotate-EWC [19](REWC) is a modiﬁed version of EWC — it approximatelydiagonalizes the Fisher information matrix of the networkparameters that compute the factorized rotation of the pa-rameter space used in conjunction with EWC.The second group of the strategies is associated withlearning task-speciﬁc parameters. Learning without forget-ting (LwF) [17] leverages distillation regularization on thenew tasks — the soft labels of previously learned tasks areenforced to be similar to the network with the current task2 able 1. Comparisons of the lifelong learning approaches for overcoming catastrophic forgetting. EWC: Elastic Weight Consolidation [13];DEN: Dynamically expandable network [30]; LwF: Learning without forgetting [17]; GEM: Gradient of Episodic Memory [20]; PGN:Progressive neural network [24] and our algorithm REC.

EWC DEN LwF GEM PGN RECNo memory growth (cid:88) (cid:88) (cid:88) (cid:88)

No exemplar (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Expanding network capacity when necessary (cid:88) (cid:88) (cid:88)

AutoML ability (cid:88) by using knowledge distillation [10]. Less-forgetful learn-ing [12] is proposed to regularize the L distance betweenthe ﬁnal hidden activations and the old tasks’ parameters forpreserving the old task feature mappings.The third group of methods expands the network capac-ity. Progressive neural network (PGN) [24] is proposed toblock any changes to the pre-trained network models onpreviously learned tasks and expands the network archi-tecture by allocating sub-networks with the ﬁxed capacityto be trained with the new information. PathNet [7] usesagents embedding into a neural network to ﬁnd which partsof the network can be reused for learning new tasks andfreezes task-relevant paths for avoiding catastrophic forget-ting. Dynamically expanding network (DEN) [30] increasesthe number of trainable parameters to continually learn newtasks and dynamically selects neurons to retrain or expandneuron capacity by using group sparse regularization.The other family of the methods uses episodic mem-ory, where the previously learned task samples are storedto effectively recall the experience in the past. Gradientof Episodic Memory (GEM) [20] performs positive for-ward transfer, minimizes negative backward transfer to pre-viously learned tasks and learns the subset of correlationsto a set of tasks without using task descriptors. IncrementalClassiﬁer and Representation Learning (iCaRL) [22] com-bines classiﬁcation loss on new tasks and distillation loss onpreviously learned tasks with a K-nearest neighbor classiﬁerand selects the exemplars for each task by letting the em-beddings of the selected samples closer to the center pointof each class. Table 1 shows the multiple merits of REC,comparing with previous researches in this area. There are many works on AutoML to improve the perfor-mance of deep neural networks [32, 21, 3]. Neural Archi-tecture Search (NAS) [32] searches the transferable networkblocks via reinforcement learning and outperforms manymanually designed network architecture. ENAS [21] usesa controller to discover network architectures by searchingan optimal subgraph within a large computational graph andshares parameters among child models to enable efﬁcientNAS. EAS [3] efﬁciently explores network architecture vianetwork transformation [4] which is a functionality preserv-ing method to expand the architecture with a ﬁxed number

Figure 2. Illustration of our lifelong learning framework. RECﬁrst uses MWC to search the best child network by Net2Deeperand Net2Wider operators in the controller for a new coming task,then compresses the expanded network to the same size as the ini-tial model and continually learns next new task. of units or ﬁlters.Besides, Knowledge distillation (KD) [10] is also veryrelated to our work. KD is widely used to compress anetwork with a different architecture that approximates theoriginal network where knowledge is transferred from alarge teacher network to a small student network. The stu-dent network is trained with KD loss –a modiﬁed cross-entropy loss– that ensures the teacher network and studentnetwork are similar. In our work, we adopt the KD to com-press the expanded network after learning each new task.

3. Method

Fig. 2 is an overview of our AutoML framework RECfor lifelong learning, it has three steps: Regularize multi-task weight consolidation, Expand network by AutoML andCompress the expanded model.

We deﬁne the lifelong learning problem as follows —there will be an unknown number of tasks with unknowndistributions, arriving in sequence. Our goal is to learna deep model in such a lifelong learning scenario with-out catastrophic forgetting. For the evaluation protocol,3e report the classiﬁcation accuracy of each of previous T − tasks and the current task T after training on the T -th task. Given a sequence of T tasks, task at timepoint t = 1 , , · · · , T with N t images comes with dataset D t = { x ti , y ti } N t i =1 . Speciﬁcally, for task t , y ti ∈ { , ..., K } is the label for the i -th sample x ti ∈ R d t in task t . Wedenote the training data matrix by X t for D t , i.e., X t =( x t , · · · , x tN t ) . When the dataset of task t comes, all theprevious training datasets D , · · · , D t − are not availableany more, but the deep model parameter θ t − = { θ t − l } Ll =1 can be accessed. The lifelong learning problem at timepoint t when given data D t can be deﬁned as solving thefollowing problem: min θ t F ( θ t | θ t − , D t ) , t = 1 , · · · , T (1)where F is the loss function of solving θ t , θ t is the param-eter for task t . Note that the number of the upcoming taskscan be ﬁnite or inﬁnite — for simpliﬁcation, we considerthe ﬁnite scenario here.Kirkpatrick et al. [13] proposed EWC that consists of aquadratic penalty on the difference between the parameter θ t and θ t − to slow down the catastrophic forgetting for pre-viously learned tasks. The posterior distribution p ( θ t | D t ) isused to describe the problem by the Bayes’ rule. log p ( θ t | D t ) = log p ( D t | θ t )+log p ( θ t | D t − ) − log p ( D t ) , (2)where the posterior probability log p ( θ t | D t − ) embeds allthe information from task t − . However, the problem (2) isintractable so that EWC approximates it as a Gaussian dis-tribution with mean of parameter ¯ θ t − and a diagonal I ofthe Fisher Information matrix F . The matrix F is computedby F i = I ( θ t ) ii = E x [( ∂∂θ ti log p ( D t | θ t )) | θ t ] . Therefore,the problem of EWC on task t can be written as follows: min θ t F t ( θ t ) + λ (cid:88) i F i ( θ ti − ¯ θ t − i ) , (3)where F t is the loss function for task t , λ denotes how im-portant the task t − is compared to the task t and i labelseach weight of the parameter θ . The main problem of EWC is that EWC only enforcestask t close to task t − . This will ignore the inherent cor-relations between task t − and task t and such relationshipmight potentially help overcome catastrophic forgetting onthe previously learned tasks. Learning multiple related tasksjointly can improve performance relative to learning eachtask separately, when the tasks are related — this idea is in-corporated into Multi-Task Learning (MTL) [6]. It has beencommonly used to obtain better generalization performance Figure 3. MWC retrains the entire network learned on previoustasks while regularizing it to prevent forgetting from the originalmodel. MWC (purple solid line) learns better parameter represen-tations to overcome catastrophic forgetting by studying MTL withthe sparsity-inducing norm (purple dash line) and EWC (red line). than learning each task individually. We redeﬁne Eq. 3 us-ing MTL and propose a new objective function Eq. 4 to im-prove the ability of overcoming catastrophic forgetting frommultiple tasks simultaneously: min θ t F t ( θ t ) + λ (cid:88) i F i ( θ ti − ¯ θ t − i ) + λ || [ θ t ; θ t − ] || , , (4)where λ is the non-negative regularization parameter and || [ θ t ; θ t − ] || , = |||| θ t || , || θ t − || || is the l , -norm reg-ularization to learn the related representations. Here, weemploy the multi-task learning with l , -norm [18] to cap-ture the common subset of relevant parameters from eachlayer for task t − and task t .Speciﬁcally, we further consider some important param-eters which have better representation power to a subset oftasks. The MTL with sparsity-inducing norm [8] has beenwidely studied to select such discriminative parameter sub-set by incorporating inherent correlations among multipletasks. To this end, the l sparse norm is imposed to learnthe new task-speciﬁc parameters while learning task relat-edness among multiple tasks. Therefore, the objective func-tion for task t becomes: min θ t F t ( θ t ) + λ (cid:88) i F i ( θ ti − ¯ θ t − i ) + λ || [ θ t ; θ t − ] || , + λ || θ t || , (5)where λ is the non-negative regularization parameter. Wecall our algorithm Multi-task Weight Consolidation (MWC)because it studies the discriminative weights subset with in-herent correlations among multiple tasks . Fig. 3 shows thegeometric illustration of MWC. MWC is a regularization-based lifelong learning algo-rithm, it might be needed to expand the network if the taskis very different from the existing ones or the network ca-pacity is not sufﬁcient when more and more newly coming4 lgorithm 1:

REC

Input :

Dataset D , · · · , D T , λ, λ , λ Output: θ Tc begin for t = 1 → T do if t = 1 then Train an initial network with weights θ by using Eq. 1. else Search a best child network θ t by Alg. 2with Eq. 8. Compress θ t to the same model size as θ by Eq. 10 and use θ tc for next task.tasks. And human experts usually ﬁnd a sub-optimal solu-tion, this encourages us to propose AutoML based networkexpanding method for lifelong learning. We name it Regu-larize, Expand, Compress (REC) and summarize the stepsin Algorithm 1. The details of the network transformationsbased AutoML for REC are outlined in Algorithm 2.We consider net2wider and net2deeper operators [4] inour controller. The net2wider network transformation func-tion as follows: π wider ( j ) = (cid:26) j j ≤ O l ,random sample f rom { , ..., O l } j > O l , (6)where O l represents the outputs of the original layer l . Andthe net2deeper network transformation function is γ ( π deeper ( j )) = γ ( j ) ∀ j. (7)where the constraint γ holds for the rectiﬁed linear activa-tion. We learn a meta-controller to generate network trans-formation actions (Eq. 6 and Eq. 7) when given the initialnetwork architecture. Speciﬁcally, we use an encoder net-work [3], which is implemented with an input embeddinglayer and a bidirectional recurrent neural network [25], tolearn a low-dimensional representation of the initial net-work and be embedded into different operators to generatedifferent network transformation actions. Besides, we usea shared sigmoid classiﬁer to make the Net2Wider decisionaccording to the hidden state of the layer learned by the bidi-rectional encoder network [3] and the wider network can befurther combined with a Net2Deeper operator.We then integrate MWC (Eq. 5) into above AutoML sys-tem for lifelong learning. After we learning the network θ t − on the data D t − , we will automatically search thebest child network θ t by Net2wider and Net2Deeper opera-tors when it is necessary to expand the network while keep-ing the model performance on task t − based on Eq. 5.If the controller decides to expand the network, the newly added layer will not have the previous tasks’ Fisher Infor-mation. We consider the newly added layer as a new task-speciﬁc layer, l regularization is adopted to promote spar-sity in the new weight so that each neuron only connectedwith few neurons in the layer below and this will efﬁcientlylearn the best representation for the new task while reducingthe computation overheads. The modiﬁed MWC in networkexpanding scenario as follows: min θ t F t ( θ t ) + λ (cid:88) i (cid:54) = deeperi (cid:54) = wider F i ( θ ti − ¯ θ t − i ) + λ || [ θ t ; θ t − ] || , + λ || θ ti = deeperi = wider || , (8)where the subscript deeper and wider refer to the newlyadded layer in task t .After the controller generates the child network, the childnetwork will achieve an accuracy A val on the validation setof task t and this will be used as the reward signal R t toupdate the controller. We maximize the expected reward toﬁnd the optimal child network. The empirical approxima-tion of our AutoML REINFORCE rule [28] as follows: m m (cid:88) i =1 S (cid:88) s =1 (cid:53) C log P ( a s | a , · · · , a s − ; C ) R ti , (9)where m is the number of children networks that the con-troller C samples in one batch and a s and g s represents theaction and state of predicting s -th hyperparameter to designa child network architecture, respectively. T is the transi-tion function in Alg. 2. Since R t is non-differentiable, weuse policy gradient to update the controller. We use a non-linear transformation tan ( A val × π/ on validation set oftask t as done in [3] and use the transformed value as thereward. We also use an exponential moving average of pre-vious rewards with a decay of 0.95 to reduce the variance.To balance the old task and new task knowledge, we setmaximum expanding layers are 2 and 3 on net2wider andnet2deeper operators, respectively.If the network keeps expanding as more and more taskswill be given, the model will suffer the inefﬁcient problemand have extra memory cost. Thus, the model compressiontechnique is needed to reduce the memory cost and receive anonexpansive model. Here, we use soft-label (the logits) asknowledge distillation (KD) [10] instead of the hard labelsto train the student model. We follow Ba et al. [2] that thestudent model is trained to minimize the mean of the l losson the training data { x ti , z ti } N t i =1 , where z ti is the logits of thechild model θ t i -th training sample. We compress the θ t tothe same size model as θ by KD loss below: min θ tc F kd ( f ( x t ; θ tc ) , z t ) = 1 N t (cid:88) i || f ( x ti ; θ tc ) − z ti || , (10)5 lgorithm 2: Automatically Network Transformation

Input :

Dataset D t , θ t − Output:

The best expended model θ t begin for i = 1 → m do for s = 1 → S do a s ← π deeper ( g s − ; θ t − deeper ) or π wider ( g s − ; θ t − wider ) g s ← T ( g s − , a s ) θ t ← θ tnewLayer R i ← tanh ( A ti ( g S ) × π/ θ ti ← (cid:53) θ ti − J ( θ ti − ) where θ tc is the weights of the student network and f ( x ti ; θ tc ) is the prediction of task t i -th training sample. The ﬁnalstudent network θ tc is trained to convergence with hard andsoft labels by the following loss function: min θ tc F ( f ( x t ; θ tc ) , y t ) + F kd ( f ( x t ; θ tc ) , z t ) , (11)where F is the loss function (cross-entropy in this work) fortraining with ground truth y t of task t .

4. Experiments

Datasets.

We evaluate our algorithm on most commonlyused datasets for lifelong learning. We list them as follows: – MNIST-permutation : MNIST [16] is used as themost common datasets among all lifelong learning works,which consists of ten handwritten digits classes with60,000/10,000 training and testing examples. One way tocreate the datasets for multiple tasks is randomly permut-ing the pixels by a ﬁxed permutation [13] so that the inputdistribution for each task is unrelated. – MNIST-Variation : MNIST-variation [16] dataset ro-tates the MNIST dataset by a ﬁxed angle between 0 to 180degrees for each different task. We use /T as the ﬁxedangle to create T tasks. – CIFAR-100 : CIFAR-100 [14] dataset contains 60,000 × color images in 100 object classes. Each classhas 500/100 images for training and testing. We considereach task with a set of classes, it contains /T classeswhen there are T tasks. Different from MNIST-permutationdataset, the input distributions are similar for all tasks butthe output distributions for each task are different. – CUB-200 : CUB-200 [29] is a ﬁne-grained image clas-siﬁcation benchmark, we use CUB-200-2011 version in thiswork. It contains 11,788 images of 200 types of birdswith 5,994/5,794 for training and testing. Each image has detailed annotations and a bounding box. We crop thebounding boxes from the original images and resize themto × . We use the same way to create multiple tasksas CIFAR-100 dataset.For the ﬁrst three datasets, we choose T = 10 tasks.Since the ﬁne-grained CUB-200 dataset is more challengingthan others, we set T = 4 tasks to show better comparisonson lifelong learning. For all datasets, we use 0.1 ratios tosplit validation set and the model observes the tasks in se-quence. We generate multiple tasks for each dataset ﬁrst andall comparison methods then use the same task order and thesame categories within the task for fair comparisons. Base network settings.

For two MNIST datasets, weuse a two-layer fully-connected neural network of 100-100 units with ReLU activations as our initial network.For CIFAR-100 dataset, we use a modiﬁed version ofAlexNet [15] which has ﬁve convolutional layers (64-128-256-256-128 depth with × ﬁlter size), and three fully-connected layers (384-192-100 neurons at each layer) andthe standard data augmentation is used in this dataset. ForCUB-200 dataset, we use a pre-trained VGG-16 [27] modelfrom ImageNet [5] and ﬁne-tune it on the CUB-200 data forbetter initialization. We follow the setting of Liu et al. [19],which adds a global pooling layer after the ﬁnal convolu-tional layer of the VGG-16. The fully-connected layers arechanged to 512-512 and the size of the output layer is thenumber of classes in each task. All models and algorithmsare implemented using Tensorﬂow [1] library. Comparison methods.

We compare our algorithmwith six other methods: 1) SN: A single network trainedacross all tasks. 2) Net2Net [4]: Network expandingby Net2Net [4] on new task. 3) EWC [13]: A deepnetwork trained with elastic weight consolidation. 4)Net2Net-EWC: Network expanding by Net2Net [4] withelastic weight consolidation [13] when learning new task.5) DEN [30]: Dynamically expandable network. 6)REWC [19]: Rotate Elastic Weight Consolidation. 7)MWC: A deep network trained with multi-task weight con-solidation. 8)REC: Regularize, Expand and Compress.

Hyperparameter settings.

All hyper-parameters inMWC are optimized using a grid-search and the best resultsfor each model are reported. For two MNIST datasets, theSGD optimizer is used with a learning rate of . and weset batch size of 256 with 8 epochs, λ = 2 , λ = 0 . and λ = 0 . in all experiments. For CIFAR-100 dataset,we use SGD optimizer with momentum parameter of . ,learning rate of . , batch size of 128 with 20 epochs, λ = 10 , λ = 0 . and λ = 0 . . For CUB dataset,the Adam optimizer is used with a learning rate of . ,batch size of 32 and 50 epochs, λ = 100 , λ = 0 . and λ = 0 . . For network transformation based AutoMLexperimental settings, we followed the training details ofCai et al . [3].6 Number of tasks A cc u r acy Average per-task performance on MNIST-permutation

SN (0.174)Net2Net (0.321)EWC (0.844)Net2Net-EWC (0.818)DEN (0.949)MWC (0.938)REC (0.957)

Number of tasks A cc u r acy Average per-task performance on MNIST-variation

SN (0.177)Net2Net (0.593)EWC (0.614)Net2Net-EWC (0.646)DEN (0.710)MWC (0.703)REC (0.715)

Number of tasks A cc u r acy Average per-task performance on CIFAR-100

SN (0.163)Net2Net (0.208)EWC (0.419)Net2Net-EWC (0.472)MWC (0.556)REC (0.597)

Figure 4. The experimental results of continual training on MNIST-permutation, MNIST-variation and CIFAR-100 datasets. We reportthe average per-task performance (Accuracy) of the models over T = 10 task. The numbers in the legend represent average per-taskperformance after the model has ﬁnished learning task t . Time at task t A cc u r acy Performance change for Task 1 on MNIST-permutation

SNNet2NetEWCNet2Net-EWCDENMWCREC

Time at task t A cc u r acy Performance change for Task 1 on MNIST-variation

SNNet2NetEWCNet2Net-EWCDENMWCREC

Time at task t A cc u r acy Performance change for Task 1 on CIFAR-100

SNNet2NetEWCNet2Net-EWCMWCREC

Figure 5. Forgetting experiment for task 1 on MNIST-permutation, MNIST-variation and CIFAR-100 datasets. We report the accuracy ofdifferent models on task t = 1 at each training stage to see how the model performance changes over time for all datasets.Table 2. Comparisons of the model size and the average task ac-curacy after training tasks of different approaches on MNIST-permutation. W (1) : the number of parameters of task 1. W (10) : the number of parameters after training task 10. ACC(10): average per-task accuracy after training task 10. Methods

W (1)

W (10) ACC (10)SN 0.01M 0.01M 17.4%Net2Net 0.01M 0.02M 32.1%EWC 0.01M 0.01M 84.4%Net2Net-EWC 0.01M 0.02M 81.8%DEN 0.01M 0.14M 94.9%MWC 0.01M 0.01M 93.8%REC

We evaluate our methods from both model accuracy andmodel complexity, where we measure the model size at theend of the training process.

Comparisons of the model performance.

We re-port the average per-task accuracy of MNIST-permutation,MNIST-variation and CIFAR-100 datasets when T = 10 in Fig. 4. Overall, REC outperforms all comparison meth-ods and overcomes catastrophic forgetting especially on thelater tasks (after task 5). We can observe that the regulariza-tion based network (EWC, MWC) has worse performance than expandable networks (DEN, REC), which shows thatselectively expand networks help improve the performanceby a large margin. Speciﬁcally, REC performs better thanDEN on two MNIST datasets and MWC performs simi-larly with DEN on MNIST-permutation dataset while us-ing fewer parameters. We also observe that directly ap-ply Net2Net [4] on lifelong learning does not perform wellsince it forgets the old tasks’ knowledge as ﬁnetuning (SN),but adding EWC as the loss function can help enhancethe old tasks’ performance on Net2Net. REC has betterperformance than Net2Net-EWC, because we consider thenew task-speciﬁc parameters and the discriminative com-mon subset between the old tasks and the new one.We also evaluate the catastrophic forgetting over timeon the earliest task, Fig. 5 shows the test accuracy of theﬁrst task throughout the whole lifelong learning processon MNIST-permutation, MNIST-variation and CIFAR-100datasets. It shows that our methods (MWC and REC)overcome forgetting on old tasks compared with all othermethods on MNIST-permutation and CIFAR-100 datasets.It is worth noting that DEN performs slightly better thanour method on task 1 after learning later tasks on MNIST-variation dataset due to they selectively expands network forthe new task, it will give a bias towards to the earliest task.Our REC is a nonexpensive network and our overall aver-7 able 3. Comparisons of the model size and the average task accu-racy after training tasks of different approaches on CIFAR-100dataset. W (1) : the number of parameters of task 1. W (10) :the number of parameters of the model after training task 10. ACC(10): average per-task accuracy after training task 10. Methods

W (1)

W (10) ACC (10)SN 4M 4M 16.3%Net2Net 4M 6.3M 20.8%EWC 4M 4M 41.9%Net2Net-EWC 4M 7.4M 47.2%MWC 4M 4M 55.6%REC

4M 4M 59.7% age per-task performance is better than DEN, which showsthat our method has better performance on later learnedtasks and achieve a more balanced performance when learn-ing sequential tasks in the temporal dimension comparingwith DEN. Besides, we have an interesting founding onMNIST-variation dataset, the SN and Net2Net has irregularperformance on task 1 after learning task 10, it is due to thetask 10 is the upside-down ﬂipped image of task 1 and suchﬂip gives beneﬁt on some digits such as ‘1’,‘0’,‘8’. And SNand Net2Net forget too much task 1’ knowledge after learn-ing task 9, they only can keep the most recently learned taskknowledge when they learn task 10 comparing with EWC,MWC and REC and this causes the irregular performance.

Comparisons of the model complexity.

Table 2 andTable 3 report the comparisons of the model size and theaverage per-task performance after training T = 10 tasks ofdifferent approaches on MNIST-permutation and CIFAR-100 datasets, respectively. Overall, REC performs simi-larly or better than all other approaches with smaller modelsize. We observe that DEN performs better than MWCand worse than REC on MNIST-permutation dataset, butit has 1.4X network expansion comparing with ours. ForCIFAR-100 dataset, We compute our AUROC after learning T = 10 tasks, REC can achieve 0.887 comparing with DEN(0.923), however, our model size is 50% of DEN’s model.Besides, we notice that DEN involves 7 hyperparametersand very sensitive to them, we slightly change one of themfrom − to − , the result becomes 0.8907 on MNIST-permutation dataset. Our method only has three hyperpa-rameters and it needs much less expert tuning comparingwith DEN. Training times is a limitation of the current ver-sion of REC, since REC is a reinforcement learning basedalgorithm, a varies number of trails are needed and this re-sults in more training time than other methods. We willimprove the training efﬁciency of our work in the future.Besides, we did not consider complexity network structures(e.g. ResNet [9], DenseNet [11]), we will extend the currentwork to more network architectures in the future. Comparison results on CUB-200 dataset.

Fig. 6 showsthe comparison results when T = 4 on CUB-200 datasetwith EWC [13] and REWC [19]. It shows that MWC hascomparable results with REWC, MWC has better perfor-

50 100 150 200

Number of classes A cc u r acy Average per-task performance on CUB-200

EWCREWCMWCREC-newREC-all

Figure 6. Comparison results with EWC and REWC on CUB-200dataset when T = 4 .Table 4. Comparison results of average per-task accuracy aftertraining task 10 on MNIST-permutation dataset. Method EWC EWC+ l EWC+ l , MWCACC(10) 84.4% 87.7% 88.5% 94.0%mance on task 3 and task 4 while has worse performanceon task 2. We test REC with only new task validation set(REC-new), which has similar results as MWC on latertasks. This might be caused by using only new task valida-tion set is not sufﬁcient to compute the rewards on a moresubtle dataset. We hypothesis the exemplars from old taskswill help improve the nonexpansive AutoML system’s per-formance. Thus, we use the validation sets of all learnedtasks to compute the rewards and report the results (REC-all) in Fig. 6. The results show that exemplars from old taskshelp improve the performance of AutoML based algorithmand we will investigate the relationship between the num-ber of exemplars and the performance of REC in our futurework.

Ablation study on each component in MWC.

Westudy how the different components used in MWC affectthe ﬁnal performance of lifelong learning. We report theaverage per-task accuracy after training task 10 on MNIST-permutation of different strategies

EWC , EWC with l -normonly , EWC with l -norm only and MWC in Table 4. It showsthat l , -norm has a stronger effect of the performance than l -norm while our method MWC outperforms the singleregularization strategies, which demonstrates the meaning-ful and useful of our method by studying common weightssubset with discriminative new task parameters.

5. Conclusion and Future Works

In this work, we develop a multi-task based lifelonglearning framework via nonexpansive AutoML (REC).REC is achieved at two stages: continually network ex-pansion and model compression, besides a novel multi-taskweight consolidation algorithm is proposed to overcomecatastrophic forgetting. We achieved better accuracy andsmaller model size than other lifelong learning methods onfour datasets. In the future, we plan to reduce the trainingtime of the AutoML based algorithm and explore the needof exemplars for computing the rewards to improve the cur-8ent work.

References [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor-ﬂow: a system for large-scale machine learning. In

OSDI ,volume 16, pages 265–283, 2016.[2] J. Ba and R. Caruana. Do deep nets really need to be deep?In

Advances in neural information processing systems , pages2654–2662, 2014.[3] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Efﬁcientarchitecture search by network transformation. AAAI, 2018.[4] T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accel-erating learning via knowledge transfer. arXiv preprintarXiv:1511.05641 , 2015.[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In

Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on , pages 248–255. Ieee, 2009.[6] T. Evgeniou and M. Pontil. Regularized multi–task learn-ing. In

Proceedings of the tenth ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages109–117. ACM, 2004.[7] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha,A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolutionchannels gradient descent in super neural networks. arXivpreprint arXiv:1701.08734 , 2017.[8] P. Gong, J. Ye, and C.-s. Zhang. Multi-stage multi-task fea-ture learning. In

Advances in neural information processingsystems , pages 1988–1996, 2012.[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016.[10] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv preprint arXiv:1503.02531 , 2015.[11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger.Densely connected convolutional networks. In

CVPR , vol-ume 1, page 3, 2017.[12] H. Jung, J. Ju, M. Jung, and J. Kim. Less-forgetful learn-ing for domain expansion in deep neural networks. arXivpreprint arXiv:1711.05959 , 2017.[13] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des-jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho,A. Grabska-Barwinska, et al. Overcoming catastrophic for-getting in neural networks.

Proceedings of the nationalacademy of sciences , page 201611835, 2017.[14] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. Technical report, Citeseer, 2009.[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012.[16] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/ , 1998. [17] Z. Li and D. Hoiem. Learning without forgetting.

IEEETransactions on Pattern Analysis and Machine Intelligence ,2017.[18] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via ef-ﬁcient l 2, 1-norm minimization. In

Proceedings of thetwenty-ﬁfth conference on uncertainty in artiﬁcial intelli-gence , pages 339–348. AUAI Press, 2009.[19] X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M.Lopez, and A. D. Bagdanov. Rotate your networks: Betterweight consolidation and less catastrophic forgetting. arXivpreprint arXiv:1802.02950 , 2018.[20] D. Lopez-Paz et al. Gradient episodic memory for contin-ual learning. In

Advances in Neural Information ProcessingSystems , pages 6467–6476, 2017.[21] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efﬁ-cient neural architecture search via parameter sharing. arXivpreprint arXiv:1802.03268 , 2018.[22] S.-A. Rebufﬁ, A. Kolesnikov, G. Sperl, and C. H. Lampert.icarl: Incremental classiﬁer and representation learning. In

Proc. CVPR , 2017.[23] C. Robert. Machine learning, a probabilistic perspective,2014.[24] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer,J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Had-sell. Progressive neural networks. arXiv preprintarXiv:1606.04671 , 2016.[25] M. Schuster and K. K. Paliwal. Bidirectional recurrentneural networks.

IEEE Transactions on Signal Processing ,45(11):2673–2681, 1997.[26] J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell. Progress& compress: A scalable framework for continual learning. arXiv preprint arXiv:1805.06370 , 2018.[27] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[28] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Man-sour. Policy gradient methods for reinforcement learningwith function approximation. In

Advances in neural infor-mation processing systems , pages 1057–1063, 2000.[29] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-port CNS-TR-2011-001, California Institute of Technology,2011.[30] J. Yoon, E. Yang, J. Lee, and S. J. Hwang. Lifelong learningwith dynamically expandable networks. 2018.[31] F. Zenke, B. Poole, and S. Ganguli. Continuallearning through synaptic intelligence. arXiv preprintarXiv:1703.04200 , 2017.[32] B. Zoph and Q. V. Le. Neural architecture search with rein-forcement learning. arXiv preprint arXiv:1611.01578 , 2016., 2016.