[PDF] Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference

Abstract

Lack of performance when it comes to continual learning over non-stationary distributions of data remains a major challenge in scaling neural network learning to more human realistic settings. In this work we propose a new conceptualization of the continual learning problem in terms of a temporally symmetric trade-off between transfer and interference that can be optimized by enforcing gradient alignment across examples. We then propose a new algorithm, Meta-Experience Replay (MER), that directly exploits this view by combining experience replay with optimization based meta-learning. This method learns parameters that make interference based on future gradients less likely and transfer based on future gradients more likely. We conduct experiments across continual lifelong supervised learning benchmarks and non-stationary reinforcement learning environments demonstrating that our approach consistently outperforms recently proposed baselines for continual learning. Our experiments show that the gap between the performance of MER and baseline algorithms grows both as the environment gets more non-stationary and as the fraction of the total experiences stored gets smaller.

Full PDF

PPublished as a conference paper at ICLR 2019 L EARNING TO L EARN WITHOUT F ORGETTING B Y M AXIMIZING T RANSFER AND M INIMIZING I NTERFERENCE

Matthew Riemer , Ignacio Cases , Robert Ajemian , Miao Liu , Irina Rish , Yuhai Tu , andGerald Tesauro IBM Research, Yorktown Heights, NY Linguistics and Computer Science Departments, Stanford NLP Group, Stanford University MIT-IBM Watson AI Lab Department of Brain and Cognitive Sciences, MIT A BSTRACT

Lack of performance when it comes to continual learning over non-stationary dis-tributions of data remains a major challenge in scaling neural network learning tomore human realistic settings. In this work we propose a new conceptualization ofthe continual learning problem in terms of a temporally symmetric trade-off be-tween transfer and interference that can be optimized by enforcing gradient align-ment across examples. We then propose a new algorithm, Meta-Experience Re-play (MER), that directly exploits this view by combining experience replay withoptimization based meta-learning. This method learns parameters that make inter-ference based on future gradients less likely and transfer based on future gradientsmore likely. We conduct experiments across continual lifelong supervised learn-ing benchmarks and non-stationary reinforcement learning environments demon-strating that our approach consistently outperforms recently proposed baselinesfor continual learning. Our experiments show that the gap between the perfor-mance of MER and baseline algorithms grows both as the environment gets morenon-stationary and as the fraction of the total experiences stored gets smaller.

OLVING THE C ONTINUAL L EARNING P ROBLEM

A long-held goal of AI is to build agents capable of operating autonomously for long periods. Suchagents must incrementally learn and adapt to a changing environment while maintaining memoriesof what they have learned before, a setting known as lifelong learning (Thrun, 1994; 1996). In thispaper we explore a variant called continual learning (Ring, 1994). In continual learning we assumethat the learner is exposed to a sequence of tasks, where each task is a sequence of experiencesfrom the same distribution (see Appendix A for details). We would like to develop a solution inthis setting by discovering notions of tasks without supervision while learning incrementally afterevery experience. This is challenging because in standard ofﬂine single task and multi-task learning(Caruana, 1997) it is implicitly assumed that the data is drawn from an i.i.d. stationary distribution.Unfortunately, neural networks tend to struggle whenever this is not the case (Goodrich, 2015).Over the years, solutions to the continual learning problem have been largely driven by prominentconceptualizations of the issues faced by neural networks. One popular view is catastrophic forget-ting (interference) (McCloskey & Cohen, 1989), in which the primary concern is the lack of stabilityin neural networks, and the main solution is to limit the extent of weight sharing across experiencesby focusing on preserving past knowledge (Kirkpatrick et al., 2017; Zenke et al., 2017; Lee et al.,2017). Another popular and more complex conceptualization is the stability-plasticity dilemma(Carpenter & Grossberg, 1987). In this view, the primary concern is the balance between network We consider task agnostic future gradients , referring to gradients of the model parameters with respect tounseen data points. These can be drawn from tasks that have already been partially learned or unseen tasks. a r X i v : . [ c s . L G ] M a y ublished as a conference paper at ICLR 2019 Stability – Plasticity Dilemma Stability – Plasticity Dilemma

A. Transfer – Interference Trade-o ﬀ TransferOld Learning Current Learning Future LearningSharing Sharing

B. TransferC. Interference ∂ L i ∂ θ ∂ L j ∂ θ ∂ L i ∂ θ ∂ L j ∂ θ FUTUREPAST

TransferInterference Interference

Figure 1: A) The stability-plasticity dilemma considers plasticity with respect to the current learning and howit degrades old learning. The transfer-interference trade-off considers the stability-plasticity dilemma and itsdependence on weight sharing in both forward and backward directions. This symmetric view is crucial assolutions that purely focus on reducing the degree of weight-sharing are unlikely to produce transfer in thefuture. B) A depiction of transfer in weight space. C) A depiction of interference in weight space. stability (to preserve past knowledge) and plasticity (to rapidly learn the current experience). For ex-ample, these techniques focus on balancing limited weight sharing with some mechanism to ensurefast learning (Li & Hoiem, 2016; Riemer et al., 2016a; Lopez-Paz & Ranzato, 2017; Rosenbaumet al., 2018; Lee et al., 2018; Serr`a et al., 2018). In this paper, we extend this view by noting that forcontinual learning over an unbounded number of distributions, we need to consider weight sharingand the stability-plasticity trade-off in both the forward and backward directions in time (Figure 1A).The transfer-interference trade-off proposed in this paper (section 2) presents a novel perspectiveon the goal of gradient alignment for the continual learning problem. This is right at the heartof the problem as these gradients are the update steps for SGD based optimizers during learningand there is a clear connection between gradients angles and managing the extent of weight shar-ing. The key difference in perspective with past conceptualizations of continual learning is thatwe are not just concerned with current transfer and interference with respect to past examples, butalso with the dynamics of transfer and interference moving forward as we learn. Other approacheshave certainly explored operational notions of transfer and interference in forward and backwarddirections (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2018), the link to weight sharing (French,1991; Ajemian et al., 2013), and the idea of inﬂuencing gradient alignment for continual learningbefore (Lopez-Paz & Ranzato, 2017). However, in past work, ad hoc changes have been made tothe dynamics of weight sharing based on current learning and past learning without formulating aconsistent theory about the optimal weight sharing dynamics. This new view of the problem leadsto a natural meta-learning (Schmidhuber, 1987) perspective on continual learning: we would like tolearn to modify our learning to affect the dynamics of transfer and interference in a general sense.To the extent that our meta-learning into the future generalizes, this should make it easier for ourmodel to perform continual learning in non-stationary settings. We achieve this by building off pastwork on experience replay (Murre, 1992; Lin, 1992; Robins, 1995) that has been a mainstay forsolving non-stationary problems with neural networks. We propose a novel meta-experience replay(MER) algorithm that combines experience replay with optimization based meta-learning (section3) as a ﬁrst step towards modeling this perspective. Moreover, our experiments (sections 4, 5, and6), conﬁrm our theory. MER shows great promise across a variety of supervised continual learningand continual reinforcement learning settings. Critically, our approach is not reliant on any providednotion of tasks and in most of the settings we explore we must detect the concept of tasks withoutsupervision. See Appendix B for a more detailed positioning with respect to related research. HE T RANSFER -I NTERFERENCE T RADE - OFF FOR C ONTINUAL L EARNING

At an instant in time with parameters θ and loss L , we can deﬁne operational measures of transferand interference between two arbitrary distinct examples ( x i , y i ) and ( x j , y j ) while training with Throughout the paper we discuss ideas in terms of the supervised learning problem formulation. Extensionsto the reinforcement learning formulation are straightforward. We provide more details in Appendix N.

Transfer occurs when: ∂L ( x i , y i ) ∂θ · ∂L ( x j , y j ) ∂θ > , (1)where · is the dot product operator. This implies that learning example i will without repetitionimprove performance on example j and vice versa (Figure 1B). Interference occurs when: ∂L ( x i , y i ) ∂θ · ∂L ( x j , y j ) ∂θ < . (2)Here, in contrast, learning example i will lead to unlearning (i.e. forgetting) of example j and viceversa (Figure 1C). There is weight sharing between i and j when they are learned using an over-lapping set of parameters. So, potential for transfer is maximized when weight sharing is maximizedwhile potential for interference is minimized when weight sharing is minimized (Appendix C).Past solutions for the stability-plasticity dilemma in continual learning operate in a simpliﬁed tem-poral context where learning is divided into two phases: all past experiences are lumped together as old memories and the data currently being learned qualiﬁes as new learning . In this setting, the goalis to simply minimize the interference projecting backward in time, which is generally achieved byreducing the degree of weight sharing explicitly or implicitly. In Appendix D we explain how ourbaseline approaches (Kirkpatrick et al., 2017; Lopez-Paz & Ranzato, 2017) ﬁt within this paradigm.The important issue with this perspective, however, is that the system still has learning to do andwhat the future may bring is largely unknown. This makes it incumbent upon us to do nothing topotentially undermine the networks ability to effectively learn in an uncertain future. This consid-eration makes us extend the temporal horizon of the stability-plasticity problem forward, turning it,more generally, into a continual learning problem that we label as solving the Transfer-InterferenceTrade-off (Figure 1A). Speciﬁcally, it is important not only to reduce backward interference fromour current point in time, but we must do so in a manner that does not limit our ability to learn inthe future. This more general perspective acknowledges a subtlety in the problem: the issue of gra-dient alignment and thus weight sharing across examples arises both backward and forward in time.With this temporally symmetric perspective, the transfer-interference trade-off becomes clear. Herewe propose a potential solution where we learn to learn in a way that promotes gradient alignmentat each point in time. The weight sharing across examples that enables transfer to improve futureperformance must not disrupt performance on what has come previously. As such, our work adoptsa meta-learning perspective on the continual learning problem. We would like to learn to learn eachexample in a way that generalizes to other examples from the overall distribution.

YSTEM FOR L EARNING TO L EARN WITHOUT F ORGETTING

In typical ofﬂine supervised learning, we can express our optimization objective over the stationarydistribution of x, y pairs within the dataset D : θ = arg min θ E ( x,y ) ∼ D [ L ( x, y )] , (3)where L is the loss function, which can be selected to ﬁt the problem. If we would like to maximizetransfer and minimize interference, we can imagine it would be useful to add an auxiliary loss to theobjective to bias the learning process in that direction. Considering equations 1 and 2, one obviouslybeneﬁcial choice would be to also directly consider the gradients with respect to the loss functionevaluated at randomly chosen datapoints. If we could maximize the dot products between gradientsat these different points, it would directly encourage the network to share parameters where gradientdirections align and keep parameters separate where interference is caused by gradients in oppositedirections. So, ideally we would like to optimize for the following objective : θ = arg min θ E [( x i ,y i ) , ( x j ,y j )] ∼ D [ L ( x i , y i ) + L ( x j , y j ) − α ∂L ( x i , y i ) ∂θ · ∂L ( x j , y j ) ∂θ ] , (4) We borrow our terminology from operational measures of forward transfer and backward transfer in Lopez-Paz & Ranzato (2017), but adopt a temporally symmetric view of the phenomenon by dropping the speciﬁcationof direction. Interference commonly refers to negative transfer in either direction in the literature. The inclusion of L ( x j , y j ) is largely an arbitrary notation choice as the relative prioritization of the twotypes of terms can be absorbed in α . We use this notation as it is most consistant with our implementation. Algorithm 1

Meta-Experience Replay (MER) procedure T RAIN ( D, θ, α, β, γ, s, k ) M ← {} for t = 1 , ..., T dofor ( x, y ) in D t do // Draw batches from buffer: B , ..., B s ← sample ( x, y, s, k, M ) θ A ← θ for i = 1 , ..., s do θ Wi, ← θ for j = 1 , ..., k do x c , y c ← B i [ j ] θ Wi,j ← SGD ( x c , y c , θ Wi,j − , α ) end for // Within batch Reptile meta-update: θ ← θ Wi, + β ( θ Wi,k − θ Wi, ) θ Ai ← θ end for // Across batch Reptile meta-update: θ ← θ A + γ ( θ As − θ A ) // Reservoir sampling memory update: M ← M ∪ { ( x, y ) } (algorithm 3) end forend forreturn θ, M end procedure where ( x i , y i ) and ( x j , y j ) are randomly sam-pled unique data points. We will attempt to de-sign a continual learning system that optimizesfor this objective. However, there are multipleproblems that must be addressed to implementthis kind of learning process in practice. Theﬁrst problem is that continual learning dealswith learning over a non-stationary stream ofdata. We address this by implementing an ex-perience replay module that augments onlinelearning so that we can approximately optimizeover the stationary distribution of all examplesseen so far. Another practical problem is thatthe gradients of this loss depend on the secondderivative of the loss function, which is expen-sive to compute. We address this by indirectlyapproximating the objective to a ﬁrst order Tay-lor expansion using a meta-learning algorithmwith minimal computational overhead.3.1 E XPERIENCE R EPLAY

Learning objective:

The continual lifelonglearning setting poses a challenge for the opti-mization of neural networks as examples comeone by one in a non-stationary stream. Instead,we would like our network to optimize over the stationary distribution of all examples seen so far.Experience replay (Lin, 1992; Murre, 1992) is an old technique that remains a central component ofdeep learning systems attempting to learn in non-stationary settings, and we will adopt here conven-tions from recent work (Zhang & Sutton, 2017; Riemer et al., 2017b) leveraging this approach. Thecentral feature of experience replay is keeping a memory of examples seen M that is interleavedwith the training of the current example with the goal of making training more stable. As a result,experience replay approximates the objective in equation 3 to the extent that M approximates D : θ = arg min θ E ( x,y ) ∼ M [ L ( x, y )] , (5) M has a current size M size and maximum size M max . In our work, we update the buffer withreservoir sampling (Appendix F). This ensures that at every time-step the probability that any ofthe N examples seen has of being in the buffer is equal to M size /N . The content of the bufferresembles a stationary distribution over all examples seen to the extent that the items stored capturesthe variation of past examples. Following the standard practice in ofﬂine learning, we train byrandomly sampling a batch B from the distribution captured by M . Prioritizing the current example: the variant of experience replay we explore differs from ofﬂinelearning in that the current example has a special role ensuring that it is always interleaved with theexamples sampled from the replay buffer. This is because before we proceed to the next example,we want to make sure our algorithm has the ability to optimize for the current example (particularlyif it is not added to the memory). Over N examples seen, this still implies that we have trainedwith each example as the current example with probability per step of /N . We provide algorithmsfurther detailing how experience replay is used in this work in Appendix G (algorithms 4 and 5). Concerns about storing examples:

Obviously, it is not scalable to store every experience seen inmemory. As such, in this work we focus on showing that we can achieve greater performance thanbaseline techniques when each approach is provided with only a small memory buffer.3.2 C

OMBINING E XPERIENCE R EPLAY WITH O PTIMIZATION B ASED M ETA -L EARNING

First order meta-learning:

One of the most popular meta-learning algorithms to date is Model Ag-nostic Meta-Learning (MAML) (Finn et al., 2017). MAML is an optimization based meta-learningalgorithm with nice properties such as the ability to approximate any learning algorithm and the4ublished as a conference paper at ICLR 2019ability to generalize well to learning data outside of the previous distribution (Finn & Levine, 2017).One aspect of MAML that limits its scalability is the need to explicitly compute second deriva-tives. The authors proposed a variant called ﬁrst-order MAML (FOMAML), which is deﬁned byignoring the second derivative terms to address this issue and surprisingly found that it achievedvery similar performance. Recently, this phenomenon was explained by Nichol & Schulman (2018)who noted through Taylor expansion that the two algorithms were approximately optimizing forthe same loss function. Nichol & Schulman (2018) also proposed an algorithm, Reptile, that ef-ﬁciently optimizes for approximately the same objective while not requiring that the data be splitinto training and testing splits for each task learned as MAML does. Reptile is implemented byoptimizing across s batches of data sequentially with an SGD based optimizer and learning rate α .After training on these batches, we take the initial parameters before training θ and update them to θ ← θ + β ∗ ( θ k − θ ) where β is the learning rate for the meta-learning update. The process re-peats for each series of s batches (algorithm 2). Shown in terms of gradients in Nichol & Schulman(2018), Reptile approximately optimizes for the following objective over a set of s batches: θ = arg min θ E B ,...,B s ∼ D [2 s (cid:88) i =1 [ L ( B i ) − i − (cid:88) j =1 α ∂L ( B i ) ∂θ · ∂L ( B j ) ∂θ ]] , (6)where B , ..., B s are batches within D . This is similar to our motivation in equation 4 to the extentthat gradients produced on these batches approximate samples from the stationary distribution. The MER learning objective:

In this work, we modify the Reptile algorithm to properly integrateit with an experience replay module, facilitating continual learning while maximizing transfer andminimizing interference. As we describe in more detail during the derivation in Appendix I, achiev-ing the Reptile objective in an online setting where examples are provided sequentially is non-trivialand is in part only achievable because of our sampling strategies for both the buffer and batch. Fol-lowing our remarks about experience replay from the prior section, this allows us to optimize for thefollowing objective in a continual learning setting using our proposed MER algorithm: θ = arg min θ E [( x ,y ) ,..., ( x sk ,y sk )] ∼ M [2 s (cid:88) i =1 k (cid:88) j =1 [ L ( x ij , y ij ) − i − (cid:88) q =1 j − (cid:88) r =1 α ∂L ( x ij , y ij ) ∂θ · ∂L ( x qr , y qr ) ∂θ ]] . (7) The MER algorithm:

MER maintains an experience replay style memory M with reservoir sam-pling and at each time step draws s batches including k − random samples from the buffer to betrained alongside the current example. Each of the k examples within each batch is treated as its ownReptile batch of size 1 with an inner loop Reptile meta-update after that batch is processed. We thenapply the Reptile meta-update again in an outer loop across the s batches. We provide further detailsfor MER in algorithm 1. This procedure approximates the objective of equation 7 when β = 1 . The sample function produces s batches for updates. Each batch is created by ﬁrst adding the currentexample and then interleaving k − random examples from M . Controlling the degree of regularization:

In light of our ideal objective in equation 4, we cansee that using a SGD batch size of 1 has an advantage over larger batches because it allows for thesecond derivative information conveyed to the algorithm to be ﬁne grained on the example level.Another reason to use sample level effective batches is that for a given number of samples drawnfrom the buffer, we maximize s from equation 6. In equation 6, the typical ofﬂine learning loss has aweighting proportional to s and the regularizer term to maximize transfer and minimize interferencehas a weighting proportional to αs ( s − / . This implies that by maximizing the effective s we canput more weight on the regularization term. We found that for a ﬁxed number of examples drawnfrom M , we consistently performed better converting to a long list of individual samples than wedid using proper batches as in Nichol & Schulman (2018) for few shot learning. Prioritizing current learning:

To ensure strong regularization, we would like our number ofbatches processed in a Reptile update to be large – enough that experience replay alone wouldstart to overﬁt to M . As such, we also need to make sure we provide enough priority to learningthe current example, particularly because we may not store it in M . To achieve this in algorithm 1,we sample s separate batches from M that are processed sequentially and each interleaved with thecurrent example. In Appendix H we also outline two additional variants of MER with very similarproperties in that they effectively approximate for the same objective. In one we choose one bigbatch of size sk − s memories and s copies of the current example (algorithm 6). In the other, wechoose one memory batch of size k − with a special current item learning rate of sα (algorithm 7).5ublished as a conference paper at ICLR 2019 Unique properties:

In the end, our approach amounts to a quite easy to implement and computa-tionally efﬁcient extension of SGD, which is applied to an experience replay buffer by leveraging themachinery of past work on optimization based meta-learning. However, the emergent regularizationon learning is totally different than those previously considered. Past work on optimization basedmeta-learning has enabled fast learning on incoming data without considering past data. Meanwhile,past work on experience replay only focused on stabilizing learning by approximating stationaryconditions without altering model parameters to change the dynamics of transfer and interference.

VALUATION FOR S UPERVISED C ONTINUAL L IFELONG L EARNING

To test the efﬁcacy of MER we compare it to relevant baselines for continual learning of manysupervised tasks from Lopez-Paz & Ranzato (2017) (see Appendix D for in-depth descriptions): • Online: represents online learning performance of a model trained straightforwardly oneexample at a time on the incoming non-stationary training data by simply applying SGD. • Independent: an independent predictor per task with less hidden units proportional to thenumber of tasks. When useful, it can be initialized by cloning the last predictor. • Task Input: has the same architecture as Online, but with a dedicated input layer per task. • EWC:

Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) is an algorithm thatmodiﬁes online learning where the loss is regularized to avoid catastrophic forgetting. • GEM:

Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) is an approachfor making efﬁcient use of episodic storage by following gradients on incoming examplesto the maximum extent while altering them so that they do not interfere with past memories.An independent adhoc analysis is performed to alter each incoming gradient. In contrast toMER, nothing generalizable is learned across examples about how to alter gradients.We follow Lopez-Paz & Ranzato (2017) and consider ﬁnal retained accuracy across all tasks aftertraining sequentially on all tasks as our main metric for comparing approaches. Moving forwardwe will refer to this metric as retained accuracy (RA) . In order to reveal more characteristics ofthe learning behavior, we also report the learning accuracy (LA) which is the average accuracy foreach task directly after it is learned. Additionally, we report the backward transfer and interference(BTI) as the average change in accuracy from when a task is learned to the end of training. Ahighly negative BTI reﬂects catastrophic forgetting. Forward transfer and interference (Lopez-Paz& Ranzato, 2017) is only applicable for one task we explore, so we provide details in Appendix K.

Question 1

How does MER perform on supervised continual learning benchmarks?

To address this question we consider two continual learning benchmarks from Lopez-Paz & Ranzato(2017).

MNIST Permutations is a variant of MNIST ﬁrst proposed in Kirkpatrick et al. (2017)where each task is transformed by a ﬁxed permutation of the MNIST pixels. As such, the inputdistribution of each task is unrelated.

MNIST Rotations is another variant of MNIST proposed inLopez-Paz & Ranzato (2017) where each task contains digits rotated by a ﬁxed angle between 0 and180 degrees. We follow the standard benchmark setting from Lopez-Paz & Ranzato (2017) using amodest memory buffer of size 5120 to learn 1000 sampled examples across each of 20 tasks. Weprovide detailed information about our architectures and hyperparameters in Appendix J.In Table 1 we report results on these benchmarks in comparison to our baseline approaches. ClearlyGEM outperforms our other baselines, but our approach adds signiﬁcant value over GEM in termsof retained accuracy on both benchmarks. MER achieves this by striking a superior balance betweentransfer and interference with respect to the past and future data. MER displays the best adaption toincoming tasks, while also providing very strong retention of knowledge when learning future tasks.EWC and using a task speciﬁc input layer both also lead to gains over standard online learning interms of retained accuracy. However, they are quite far below the performance of approaches thatmake usage of episodic storage. While EWC does not store examples, in storing the Fisher informa-tion for each task it accrues more incremental resources than the episodic storage approaches.

Question 2

How do the performance gains from MER vary as a function of the buffer size?

To make progress towards the greater goals of lifelong learning, we would like our algorithm tomake the most use of even a modest buffer. This is because in extremely large scale settings it is6ublished as a conference paper at ICLR 2019

Model MNIST Rotations MNIST PermutationsRA LA BTI RA LA BTIOnline 53.38 ± ± -5.44 ± ± ± -13.76 ± Independent 60.74 ± ± - 55.80 ± ± -Task Input 79.98 ± ± -1.06 ± ± ± -0.74 ± EWC 57.96 ± ± -20.42 ± ± ± -13.32 ± GEM 87.58 ± ± +1.12 ± ± ± +2.56 ± MER ± ± +1.94 ± ± ± -0.86 ± Table 1: Performance on continual lifelong learning 20 tasks benchmarks from (Lopez-Paz & Ranzato, 2017).Model Buffer MNIST Rotations MNIST PermutationsRA LA BTI RA LA BTIGEM 5120 87.58 ± ± +1.12 ± ± ± +2.56 ±

500 74.88 ± ± -11.02 ± ± ± -11.02 ±

200 67.38 ± ± -18.02 ± ± ± -24.42 ± MER 5120 ± ± +1.94 ± ± ± -0.86 ± ± ± -3.58 ± ± ± -6.40 ± ± ± -5.60 ± ± ± -9.96 ± Table 2: Performance varying the buffer size on continual learning benchmarks (Lopez-Paz & Ranzato, 2017). unrealistic to assume a system can store a large percentage of previous examples in memory. Assuch, we would like to compare MER to GEM, which is known to perform well with an extremelysmall memory buffer (Lopez-Paz & Ranzato, 2017). We consider a buffer size of 500, that is over 10times smaller than the standard setting on these benchmarks. Additionally, we also consider a buffersize of 200, matching the smallest setting explored in Lopez-Paz & Ranzato (2017). This settingcorresponds to an average storage of 1 example for each combination of task and class. We reportour results in Table 2. The beneﬁts of MER seem to grow as the buffer becomes smaller. In thesmallest setting, MER provides more than a 10% boost in retained accuracy on both benchmarks.

Question 3

How effective is MER at dealing with increasingly non-stationary settings?

Another larger goal of lifelong learning is to enable continual learning with only relatively fewexamples per task. This setting is particularly difﬁcult because we have less data to characterizeeach class to learn from and our distribution is increasingly non-stationary over a ﬁxed amount oftraining. We would like to explore how various models perform in this kind of setting. To do this weconsider two new benchmarks.

Many Permutations is a variant of

MNIST Permutations that has 5times more tasks (100 total) and 5 times less training examples per task (200 each). Meanwhile wealso explore the

Omniglot (Lake et al., 2011) benchmark treating each of the 50 alphabets to be atask (see Appendix J for experimental details). Following multi-task learning conventions, 90% ofthe data is used for training and 10% is used for testing (Yang & Hospedales, 2017). Overall thereare 1623 characters. We learn each character and task sequentially with a task speciﬁc output layer.We report continual learning results using these new datasets in Table 3. The effect on

Many Per-mutations of efﬁciently using episodic storage becomes even more pronounced when the settingbecomes more non-stationary. GEM and MER both achieve nearly double the performance of EWCand online learning. We also see that increasingly non-stationary settings lead to a larger perfor-mance gain for MER over GEM. Gains are quite signiﬁcant for

Many Permutations and remarkablefor

Omniglot . Omniglot is even more non-stationary including slightly fewer examples per task andMER nearly quadruples the performance of baseline techniques. Considering the poor performance

Model Buffer Many Permutations OmniglotRA LA BTI RA LA BTIOnline 0 32.62 ± ± -19.06 ± ± ± -1.02 ± EWC 0 33.46 ± ± -17.84 ± ± ± -4.80 ± GEM 5120 56.76 ± ± -2.92 ± ± ± +14.19 ±

500 32.14 ± ± -23.52 ± - - -MER 5120 ± ± +3.08 ± ± ± +6.11 ± ± ± -17.78 ± ± ± +3.27 ± Table 3: Performance on many task non-stationary continual lifelong learning benchmarks. task 0

Frames M e a n sc o r e Frames M e a n sc o r e task 0 Frames M e a n sc o r e DQN DQN-MERCatcher Task 1Flappy Bird Task 1

Figure 2:

Left : a sequence of frames from Catcher and Flappy Bird respectively. The goal in Catcher is tocapture the falling pellet by moving the racket on the bottom of the screen. In Flappy Bird, the goal is tonavigate the bird through as many pipes as possible by making it go up or letting it fall.

Right : average score inCatcher (above) and Flappy Bird (below) for evaluation on the ﬁrst task which has slower falling pellets and alarger pipe gap. of online learning and EWC it is natural to question whether or not examples were learned in theﬁrst place. We experiment with using as many as 100 gradient descent steps per incoming exampleto ensure each example is learned when ﬁrst seen. However, due to the extremely non-stationarysetting no run of any variant we tried surpassed 5.5% retained accuracy. GEM also has major deﬁcitsfor learning on Omniglot that are resolved by MER which achieves far better performance when itcomes to quickly learning the current task. GEM maintains a buffer using a recent item based sam-pling strategy and thus can not deal with non-stationarity within the task nearly as well as reservoirsampling. Additionally, we found that the optimization based on the buffer was signiﬁcantly lesseffective and less reliable as the quadratic program fails for many hyperparameter values that leadto non-positive deﬁnite matrices. Unfortunately, we could not get GEM to consistently converge onOmniglot for a memory size of 500 (signiﬁcantly less than the number of classes), meanwhile MERhandles it well. In fact, MER greatly outperforms GEM with an order of magnitude smaller buffer.

Figure 3: Further details on Omniglot perfor-mance characteristics for each model.

We provide additional details about our experimentson Omniglot in Figure 3. We plot retained train-ing accuracy, retained testing accuracy, and compu-tation time for the entire training period using oneCPU. We ﬁnd that MER strikes the best balanceof computational efﬁciency and performance evenwhen using algorithm 1 for MER which performsmore computation than algorithm 7. The computa-tion involved in the GEM update does not scale wellto large CNN models like those that are popular forOmniglot. MER is far better able to ﬁt the trainingdata than our baseline models while maintaining a computational efﬁciency closer to online updatemethods like EWC than GEM.

VALUATION FOR C ONTINUAL R EINFORCEMENT L EARNING

Question 4

Can MER improve a DQN with ER in continual reinforcement learning settings?

We considered the evaluation of MER in a continual reinforcement learning setting where the en-vironment is highly non-stationary. In order to produce these non-stationary environments in a8ublished as a conference paper at ICLR 2019

Frames M e a n s c o r e DQN DQN-MER

Frames M e a n s c o r e Frames M e a n s c o r e Frames M e a n s c o r e Frames M e a n s c o r e Frames M e a n s c o r e Frames M e a n sc o r e Task 1 Task 2 Task 3Task 4 Task 5 Task 6

Figure 4: Continual learning performance for a non-stationary version of Catcher. Graphs show averaged valuesover ten validation episodes across ﬁve different seeds. Vertical grid lines on the x-axis indicate a task switch. controlled way suitable for our experimental purposes, we utilized arcade games provided by Tasﬁ(2016). Speciﬁcally, we used Catcher and Flappy Bird, two simple but interesting enough environ-ments (see Appendix N.1 for details). For the purposes of our explanation, we will call each set ofﬁxed game-dependent parameters a task . The multi-task setting is then built by introducing changesin these parameters, resulting in non-stationarity across tasks. Each agent is once again evaluatedbased on its performance over time on all tasks. Our model uses a standard DQN model, developedfor Atari (Mnih et al., 2015). See Appendix N.2 for implementation details.In Catcher, we then obtain different tasks by incrementally increasing the pellet velocity a total of5 times during training. In Flappy Bird, the different tasks are obtained by incrementally reducingthe separation between upper and lower pipes a total of 5 times during training. In Figure 4, weshow the performance in Catcher when trained sequentially on 6 different tasks for 25k frames eachto a maximum of 150k frames, evaluated at each point in time in all 6 tasks. Under these non-stationary conditions, a DQN using MER performs consistently better than the standard DQN withan experience replay buffer (see Appendix N.4 for further comments and ablation results). If wetake as inspiration how humans perform, in the last stages of training we hope that a player thatobtains good results in later tasks will also obtain good results in the ﬁrst tasks, as the ﬁrst tasks aresubsumed in the latter ones. For example, in Catcher, the pellet moves faster in later tasks, and thuswe expect to be able to do well on the ﬁrst task. However, DQN forgets signiﬁcantly how to getslowly moving pellets. In contrast, DQN-MER exhibits minimal or no forgetting after training onthe rest of the tasks. This behavior is intuitive because we would expect transfer to happen naturallyin this setting. We see similar behavior for Flappy Bird. DQN-MER becomes a Platinum playeron the ﬁrst task when it is learning the third task. This is a more difﬁcult environment in which thepipe gap is noticeably smaller (see Appendix N.4). DQN-MER exhibits the kind of learning patternsexpected from humans for these games, while a standard DQN struggles to generalize as the gamechanges and to retain knowledge over time. URTHER A NALYSIS OF THE A PPROACH

In this section we would like to dive deeper into how MER works. To do so we run additionaldetailed experiments across our three MNIST based continual learning benchmarks.

Question 5

Does MER lead to a shift in the distribution of gradient dot products?

We would like to directly verify that MER achieves our motivation in equation 7 and results insigniﬁcant changes in the distribution of gradient dot products between new incoming examples andpast examples over the course of learning when compared to experience replay (ER) from algorithm Agents are not provided task information, forcing them to identify changes in game play on their own. ± -1.652 ± -1.280 ± MER +0.042 ± +0.017 ± +0.131 ± Table 4: Analysis of the mean dot product across the period of learning between gradients on incomingexamples and gradients on randomly sampled past examples across 5 runs on MNIST based benchmarks.

5. For these experiments, we maintain a history of all examples seen that is totally separate from ournotion of memory buffers that only include a partial history of examples. Every time we receive anew example we use the current model to extract a gradient direction and we also randomly sampleﬁve examples from the previous history. We save the dot products of the incoming example gradientwith these ﬁve past example gradients and consider the mean of the distribution of dot productsseen over the course of learning for each model. We run this experiment on the best hyperparamatersetting for both ER and MER from algorithm 6 with one batch per example for fair comparison. Eachmodel is evaluated ﬁve times over the course of learning. We report mean and standard deviationsof the mean gradient dot product across runs in Table 4. We can thus verify that a very signiﬁcantand reproducible difference in the mean gradient encountered is seen for MER in comparison to ERalone. This difference alters the learning process making incoming examples on average result inslight transfer rather than signiﬁcant interference. This analysis conﬁrms the desired effect of theobjective function in equation 7. For these tasks there are enough similarities that our meta-learninggeneralizes very well into the future. We should also expect it to perform well during drastic domainshifts like other meta-learning algorithms driven by SGD alone (Finn & Levine, 2017).

Question 6

What components of MER are most important?

We would like to further analyze our MER model to understand what components add the mostvalue and when. We want to understand how powerful our proposed variants of ER are on their ownand how much is added by adding meta-learning to ER. In Appendix L we provide detailed resultsconsidering ablated baselines for our experiments on the MNIST lifelong learning benchmarks. Our versions of ER consistently provide gains over GEM on their own, but the techniques performvery comparably when we also maintain GEM’s buffer with reservoir sampling or use ER witha GEM style buffer. Additionally, we see that adding meta-learning to ER consistently results inperformance gains. In fact, meta-learning appears to provide increasing value for smaller buffers. InAppendix M, we provide further validation that our results are reproducible across runs and seeds.We would also like to compare the variants of MER proposed in algorithms 1, 6, and 7. Conceptu-ally algorithms 1 and 7 represent different mechanisms of increasing the importance of the currentexample in algorithm 6. We ﬁnd that all variants of MER result in signiﬁcant improvements on ER.Meanwhile, the variants that increase the importance of the current example see a further improve-ment in performance, performing quite comparably to each other. Overall, in our MNIST experi-ments algorithm 7 displays the best tradeoff of computational efﬁciency and performance. Finally,we conducted experiments demonstrating that adaptive optimizers like Adam and RMSProp can notaccount for the gap between ER and MER. Particularly for smaller buffer sizes, these approachesoverﬁt more on the buffer and actually hurt generalization in comparison to SGD.

ONCLUSION

In this paper we have cast a new perspective on the problem of continual learning in terms of a fun-damental trade-off between transfer and interference. Exploiting this perspective, we have in turndeveloped a new algorithm Meta-Experience Replay (MER) that is well suited for application togeneral purpose continual learning problems. We have demonstrated that MER regularizes the ob-jective of experience replay so that gradients on incoming examples are more likely to have transferand less likely to have interference with respect to past examples. The result is a general purpose so-lution to continual learning problems that outperforms strong baselines for both supervised continuallearning benchmarks and continual learning in non-stationary reinforcement learning environments.Techniques for continual learning have been largely driven by different conceptualizations of thefundamental problem encountered by neural networks. We hope that the transfer-interference trade-off can be a useful problem view for future work to exploit with MER as a ﬁrst successful example. Code available at https://github.com/mattriemer/mer . CKNOWLEDGMENTS

We would like to thank Pouya Bashivan, Christopher Potts, Dan Jurafsky, and Joshua Greene fortheir input and support of this work. Additionally, we would like to thank Arslan Chaudhry andMarc’Aurelio Ranzato for their helpful comments and discussions. We also thank the three anony-mous reviewers for their valuable suggestions. This research was supported by the MIT-IBM WatsonAI Lab, and is based in part upon work supported by the Stanford Data Science Initiative and by theNSF under Grant No. BCS-1456077 and the NSF Award IIS-1514268. R EFERENCES

Robert Ajemian, Alessandro DAusilio, Helene Moorman, and Emilio Bizzi. A theory for how sen-sorimotor skills are learned and retained in noisy and nonstationary neural circuits.

Proceedingsof the National Academy of Sciences , 110(52):E5078–E5087, 2013.Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel.Continuous adaptation via meta-learning in nonstationary and competitive environments.

ICLR ,2018.Rahaf Aljundi, Jay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a networkof experts. In

Proceedings CVPR 2017 , pp. 3366–3375, 2017.Bernard Ans and St´ephane Rousset. Avoiding catastrophic forgetting by coupling two reverberatingneural networks.

Comptes Rendus de l’Acad´emie des Sciences-Series III-Sciences de la Vie , 320(12):989–997, 1997.Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. 2017.Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computationin neural networks for faster models. arXiv preprint arXiv:1511.06297 , 2015.Gail A Carpenter and Stephen Grossberg. A massively parallel architecture for a self-organizingneural pattern recognition machine.

Computer vision, graphics, and image processing , 37(1):54–115, 1987.Rich Caruana. Multitask learning.

Machine Learning , 28(1):41–75, 1997. doi: 10.1023/A:1007379606734. URL http://dx.doi.org/10.1023/A:1007379606734 .Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Rieman-nian walk for incremental learning: Understanding forgetting and intransigence. arXiv preprintarXiv:1801.10112 , 2018.Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu,Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in superneural networks. arXiv preprint arXiv:1701.08734 , 2017.Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradientdescent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622 , 2017.Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptationof deep networks. arXiv preprint arXiv:1703.03400 , 2017.Robert M. French. Using semi-distributed representations to overcome catastrophic forgetting inconnectionist networks. In

In Proceedings of the 13th Annual Cognitive Science Society Confer-ence , pp. 173–178. Erlbaum, 1991.Robert M French. Catastrophic forgetting in connectionist networks.

Trends in cognitive sciences ,3(4):128–135, 1999.Stefano Fusi, Patrick J Drew, and Larry F Abbott. Cascade models of synaptically stored memories.

Neuron , 45(4):599–611, 2005.Benjamin Frederick Goodrich. Neuron clustering for mitigating catastrophic forgetting in supervisedand reinforcement learning. 2015. 11ublished as a conference paper at ICLR 2019Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learninglecture 6a overview of mini-batch gradient descent. 2012.Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. 1987.Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures oflocal experts.

Neural computation , 3(1):79–87, 1991.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei ARusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcom-ing catastrophic forgetting in neural networks.

Proceedings of the National Academy of Sciences ,pp. 201611835, 2017.Subhaneil Lahiri and Surya Ganguli. A memory frontier for complex synapses. In

Advances inneural information processing systems , pp. 1034–1042, 2013.Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning ofsimple visual concepts. In

Proceedings of the Cognitive Science Society , volume 33, 2011.Jeongtae Lee, Jaehong Yun, Sungju Hwang, and Eunho Yang. Lifelong learning with dynamicallyexpandable networks.

ICLR , 2018.Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcomingcatastrophic forgetting by incremental moment matching. In

Advances in Neural InformationProcessing Systems , pp. 4652–4662, 2017.Zhizhong Li and Derek Hoiem. Learning without forgetting. In

European Conference on ComputerVision , pp. 614–629. Springer, 2016.Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching.

Machine learning , 8(3-4):293–321, 1992.David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continuum learning.

NIPS , 2017.Daniel J Mankowitz, Timothy A Mann, Pierre-Luc Bacon, Doina Precup, and Shie Mannor. Learn-ing robust options. arXiv preprint arXiv:1802.03236 , 2018.James L McClelland, Bruce L McNaughton, and Randall C O’reilly. Why there are complementarylearning systems in the hippocampus and neocortex: insights from the successes and failures ofconnectionist models of learning and memory.

Psychological review , 102(3):419, 1995.Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: Thesequential learning problem.

Psychology of learning and motivation , 24:109–165, 1989.Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks formulti-task learning. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 3994–4003, 2016.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-levelcontrol through deep reinforcement learning.

Nature , 518(7540):529–533, 2015.Jacob MJ Murre.

Learning and categorization in modular neural networks.

Lawrence ErlbaumAssociates, Inc, 1992.Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprintarXiv:1803.02999 , 2018.Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.12ublished as a conference paper at ICLR 2019Sylvestre-Alvise Rebufﬁ, Alexander Kolesnikov, and Christoph H Lampert. icarl: Incremental clas-siﬁer and representation learning.

CVPR , 2017.Matthew Riemer, Sophia Krasikov, and Harini Srinivasan. A deep learning and knowledge transferbased architecture for social media user characteristic determination. In

Proceedings of the thirdInternational Workshop on Natural Language Processing for Social Media , pp. 39–47, 2015.Matthew Riemer, Elham Khabiri, and Richard Goodwin. Representation stability as a regularizerfor improved text analytics transfer learning. arXiv preprint arXiv:1704.03617 , 2016a.Matthew Riemer, Aditya Vempaty, Flavio Calmon, Fenno Heath, Richard Hull, and Elham Khabiri.Correcting forecasts with multifactor neural attention. In

International Conference on MachineLearning , pp. 3010–3019, 2016b.Matthew Riemer, Michele Franceschini, Djallel Bouneffouf, and Tim Klinger. Generative knowl-edge distillation for general purpose function compression.

NIPS 2017 Workshop on TeachingMachines, Robots, and Humans , 5:30, 2017a.Matthew Riemer, Tim Klinger, Michele Franceschini, and Djallel Bouneffouf. Scalable recollectionsfor continual lifelong learning. arXiv preprint arXiv:1711.06761 , 2017b.Matthew Riemer, Miao Liu, and Gerald Tesauro. Learning abstract options.

NIPS , 2018.Mark Bishop Ring.

Continual learning in reinforcement environments . PhD thesis, University ofTexas at Austin Austin, Texas 78712, 1994.Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal.

Connection Science , 7(2):123–146, 1995.Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection ofnon-linear functions for multi-task learning.

ICLR , 2018.Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In

International conference on machine learn-ing , pp. 1842–1850, 2016.J¨urgen Schmidhuber.

Evolutionary principles in self-referential learning, or on learning how tolearn: the meta-meta-... hook . PhD thesis, Technische Universit¨at M¨unchen, 1987.J¨urgen Schmidhuber. Optimal ordered problem solver.

Machine Learning , 54(3):211–254, 2004.J¨urgen Schmidhuber. Powerplay: Training an increasingly general problem solver by continuallysearching for the simplest still unsolvable problem.

Frontiers in psychology , 4:313, 2013.Joan Serr`a, D´ıdac Sur´ıs, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophicforgetting with hard attention to the task. arXiv preprint arXiv:1801.01423 , 2018.Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 , 2017.Norman Tasﬁ. Pygame learning environment. https://github.com/ntasfi/PyGame-Learning-Environment , 2016.Sebastian Thrun. Lifelong learning perspective for mobile robot control. In

Proceedings of theIEEE/RSJ/GI International Conference on Intelligent Robots and Systems , volume 1, pp. 23–30,1994.Sebastian Thrun. Is learning the n-th thing any easier than learning the ﬁrst?

Advances in neuralinformation processing systems , pp. 640–646, 1996.Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for oneshot learning. In

Advances in Neural Information Processing Systems , pp. 3630–3638, 2016.13ublished as a conference paper at ICLR 2019Jeffrey S Vitter. Random sampling with a reservoir.

ACM Transactions on Mathematical Software(TOMS) , 11(1):37–57, 1985.Yongxin Yang and Timothy Hospedales. Deep multi-task representation learning: A tensor factori-sation approach.

ICLR , 2017.Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence.In

International Conference on Machine Learning , pp. 3987–3995, 2017.Shangtong Zhang and Richard S Sutton. A deeper look at experience replay. arXiv preprintarXiv:1712.01275 , 2017.

A C

ONTINUAL L EARNING P ROBLEM F ORMULATION

In the classical ofﬂine supervised learning setting, a learning agent is given a ﬁxed training dataset D = { ( x i , y i ) } ni =1 of n samples, each containing an input feature vector x i ∈ X associatedwith the corresponding output (target, or label) y i ∈ Y ; a common assumption is that the trainingsamples are i.i.d. samples drawn from the same unknown joint probability distribution P ( x , y ) .The learning task is often formulated as a function approximation problem, i.e. ﬁnding a func-tion, or model, f θ ( x ) : X → Y from a given class of models (e.g., neural networks, decisiontrees, linear functions, etc.) where θ are the parameters estimated from data. Given a loss function L ( f θ ( x ) , y ) , the parameter estimation is formulated as an empirical risk minimization problem: min θ | D | (cid:80) ( x i , y i ) ∼ D L ( f θ ( x ) , y ) .On the contrary, the online learning setting does not assume a ﬁxed training dataset but rather astream of data samples, where unlabeled feature vectors arrive one at a time, or in small mini-batches, and the learner must assign labels to those inputs, receive the correct labels, and updatethe model accordingly, in iterative fashion. While classical online learning assumes i.i.d. samples, continual or lifelong learning does not make such an assumption, and requires a learning agent tohandle non-stationary data streams.In this work, we deﬁne continual learning as online learning from a non-stationary input data stream,with a speciﬁc type of non-stationarity as deﬁned below. Namely, we follow a commonly usedsetting to deﬁne non-stationary conditions for continual learning, dubbed locally i.i.d by Lopez-Paz& Ranzato (2017), where the agent learns over a sequence of separate stationary distributions oneafter another. We call the individual stationary distributions tasks , where each task t k is an onlinesupervised learning problem associated with its own data probability distribution P k ( x , y ) . Namely,we are given a (potentially inﬁnite) sequence ( x , y , t ) , ..., ( x i , y i , t i ) , ..., ( x i + j , y i + j , t i + j ) While many continual learning methods assume the task descriptors t k are available to a learner,we are interested in developing approaches which do not have to rely on such information and canlearn continuously without explicit announcement of the task change. Borrowing terminology fromChaudhry et al. (2018), we explore the single-headed setting in most of our experiments, whichkeeps learning a common function f θ across changing tasks. In contrast, multi-headed learning,which we consider for our Omniglot experiments, involves a separate ﬁnal classiﬁcation layer foreach task. This makes more sense in case of Omniglot dataset, where the number of classes for eachtask varies considerably from task to task. We should also note that for Omniglot we consider asetting that is locally i.i.d. at the class level rather than the task level. B R

ELATION TO P AST W ORK

With regard to the continual learning setting speciﬁcally, other recent work has explored similaroperational measures of transfer and interference. For example, the notions of

Forward Transfer and

Backward Transfer were explored in Lopez-Paz & Ranzato (2017). However, the approachof that work, GEM, was primarily concerned with solving the classic stability-plasticity dilemma(Carpenter & Grossberg, 1987) at a speciﬁc instance of time. Adjustments to gradients on thecurrent data are made in an ad hoc manner solving a quadratic program separately for each example.14ublished as a conference paper at ICLR 2019In our work we try to learn a generalizable theory about weight sharing that can learn to inﬂuencethe distribution of gradients not just in the past and present, but in the future as well. Additionally,in Chaudhry et al. (2018) similar ideas were explored with operational measures of intransigence (the inability to learn new data) and forgetting (the loss of previous performance). These measuresare also intimately related to the stability-plasticity dilemma as intransigence is high when plasticityis low and forgetting is high when stability is low. The major distinction in the transfer-interferencetrade-off proposed in this work is that we aim to learn the optimal weight sharing scheme to optimizefor the stability-plasticity dilemma with the hope that our learning about weight sharing will improvethe stability and efﬁcacy of learning on unseen data as well.With regard to the problem of weight-sharing in neural networks more generally, a host of differ-ent strategies have been proposed in the past to deal with the problems of catastrophic forgettingand/or the stability-plasticity dilemma (for review, see French (1999)). For example, one strat-egy for alleviating catastrophic forgetting is to make distributed representations less distributed –or semi-distributed (French, 1991) – for the case of past learning. Activation sharpening as intro-duced by French (1991) is a prominent example. A second strategy known as dual network models(McClelland et al., 1995; Ans & Rousset, 1997) is based on the neurobiological ﬁnding that bothhippocampal and cortical circuits contributed differentially to memory. The cortical circuits arehighly distributed with overlapping representations suitable for task generalization, while the moresparse hippocampal representations tend to be non-overlapping. The existence of dual circuits pro-vides an extra degree of freedom for balancing the dual constraints of stability and plasticity. In asimilar spirit, models have been proposed that have two classes of weights operating on two differenttimescales (Hinton & Plaut, 1987). A third strategy also motivated by neurobiological considera-tions is the use of latent synaptic dynamics (Fusi et al., 2005; Lahiri & Ganguli, 2013). Here thebasic idea is that synaptic strength is determined by a multiple of variables, including latent ones noteasily observed, operating at different timescales such that their net effect is to provide the systemwith additional degrees-of-freedom to store past experience without interfering with current learn-ing. A fourth strategy is the use of feedback mechanisms to stabilize representations (Carpenter &Grossberg, 1987; Murre, 1992). In this class of models, a previously experienced memory will trig-ger top down feedback that prevents plasticity, while novel stimuli that experience no such feedbacktrigger plasticity. All of these approaches have their own strengths and weaknesses with respect tothe stability-plasticity dilemma and, by extension, the transfer-interference trade-off we propose.Another relevant work is the POWERPLAY framework (Schmidhuber, 2004; 2013) which is amethod for asymptotically optimal curriculum learning that by deﬁnition cannot forget previouslylearned skills. POWERPLAY also uses environment-independent replay of behavioral traces toavoid forgetting previous skills. However, POWERPLAY is orthogonal to our work as we considera different setting where the agent cannot directly control the new tasks that will be encountered inthe environment and thus must instead learn to adapt and react to non-stationarity conditions.In contrast to past work on meta-learning for few shot learning (Santoro et al., 2016; Vinyals et al.,2016; Ravi & Larochelle, 2016; Finn et al., 2017) and reinforcement learning across successivetasks (Al-Shedivat et al., 2018), we are not only trying to improve the speed of learning on newdata, but also trying to do it in a way that preserves knowledge of past data and generalizes to futuredata. While past work has considered learning to inﬂuence gradient angles, so that there is morealignment and thus faster learning within a task, we focus on a setting where we would like toinﬂuence gradient angles from all tasks at all points in time.As our model aims to inﬂuence the dynamics of weight sharing, it bears conceptual similarity tomixtures of experts (Jacobs et al., 1991) style models for lifelong and multi-task learning (Misraet al., 2016; Riemer et al., 2016b; Aljundi et al., 2017; Fernando et al., 2017; Shazeer et al., 2017;Rosenbaum et al., 2018). MER implicitly affects the dynamics of weight sharing, but it is possiblethat combining it with mixtures of experts models could further amplify the ability for the model tocontrol these dynamics. This is potentially an interesting avenue for future work.The options framework has also been considered as a solution to a similar continual RL setting tothe one we explore (Mankowitz et al., 2018). Options formalize the notion of temporally abstractionactions in RL. Interestingly, generic architectures designed for shallow (Bacon et al., 2017) or deep(Riemer et al., 2018) hierarchies of options in essence learn very complex patterns of weight sharingover time. The option hierarchies constitute an explicit mechanism of controlling the extent ofweight sharing for continual learning, allowing for orthogonalization of weights relating to different15ublished as a conference paper at ICLR 2019skills. In contrast, our work explores a method of implicitly optimizing weight sharing for continuallearning that improves the efﬁcacy of experience replay. MER should be simple to implement inconcert with options based methods and combining the two is an interesting direction for futurework.

C T HE C ONNECTION B ETWEEN W EIGHT S HARING AND THE T RANSFER -I NTERFERENCE T RADE - OFF

In this section we would like to generalize our interpretation of a large set of different weight sharingschemes including (Riemer et al., 2015; Bengio et al., 2015; Rosenbaum et al., 2018; Serr`a et al.,2018) and how the concept of weight sharing impacts the dynamics of transfer (equation 1) andinterference (equation 2). We will assume that we have a total parameter space θ that can be usedby our network at any point in time. However, it is not a requirement that all parameters are actuallyused at all points in time. So, we can consider two speciﬁc instances in time. One where we receivedata point ( x , y ) and leverage parameters θ . Then, at the other instance in time, we receive datapoint ( x , y ) and leverage parameters θ . θ and θ are both subsets of θ and critically the overlapbetween these subsets inﬂuences the possible extent of transfer and interference when training oneither data point.First let us consider two extremes. In the ﬁrst extreme imagine θ and θ are entirely non-overlapping. As such ∂L ( x ,y ) ∂θ · ∂L ( x ,y ) ∂θ = 0 . On the positive side, this means that our solutionhas no potential for interference between the examples. On the other hand, there is no potential fortransfer either. On the other extreme, we can imagine that θ = θ . In this case, the potential forboth transfer and interference is maximized as gradients with respect to every parameter have thepossibility of a non-zero dot product with each other.From this discussion it is clear that both the extreme of full weight sharing and the extreme of noweight sharing have value depending on the relationship between data points. What we would reallylike for continual learning is to have a system that learns when to share weights and when not to onits own. To the extent that our learning about weight sharing generalizes, this should allow us to ﬁndan optimal solution to the transfer-interference trade-off. D F

URTHER D ESCRIPTIONS AND C OMPARISONS WITH B ASELINE A LGORITHMS

Independent: originally reported in (Lopez-Paz & Ranzato, 2017) is the performance of an inde-pendent predictor per task which has the same architecture but with less hidden units proportional tothe number of tasks. The independent predictor can be initialized randomly or clone the last trainedpredictor depending on what leads to better performance.

EWC:

Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) is an algorithm that modi-ﬁes online learning where the loss is regularized to avoid catastrophic forgetting by considering theimportance of parameters in the model as measured by their ﬁsher information. EWC follows thecatastrophic forgetting view of the continual learning problem by promoting less sharing of parame-ters for new learning that were deemed to be important for performance on old memories . We utilizethe code provided by Lopez-Paz & Ranzato (2017) in our experiments. The only difference in oursetting is that we provide the model one example at a time to test true continual learning rather thanproviding a batch of 10 examples at a time.

GEM:

Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) is an algorithm meant toenhance the effectiveness of episodic storage based continual learning techniques by allowing themodel to adapt to incoming examples using SGD as long as the gradients do not interfere withexamples from each task stored in a memory buffer. If gradients interfere leading to a decreasein the performance of a past task, a quadratic program is used to solve for the closest gradient tothe original that does not have negative gradient dot products with the aggregate memories fromany previous tasks. GEM is known to achieve superior performance in comparison to other recentlyproposed techniques that use episodic storage like Rebufﬁ et al. (2017), making superior use of smallmemory buffer sizes. GEM follows similar motivation to our approach in that it also considers theintelligent use of gradient dot product information to improve the use case of supervised continual16ublished as a conference paper at ICLR 2019learning. As a result, it is a very strong and interesting baseline to compare with our approach. Wemodify the original code and benchmarks provided by Lopez-Paz & Ranzato (2017). Once againthe only difference in our setting is that we provide the model one example at a time to test truecontinual learning rather than providing a batch of 10 examples at a time.We can consider the GEM algorithm as tailored to the stability-plasticity dilemma conceptualiza-tion of continual learning in that it looks to preserve performance on past tasks while allowing formaximal plasticity to the new task. To achieve this, GEM solves a quadratic program to ﬁnd anapproximate gradient g new that closely matches ∂L ( x new ,y new ) ∂θ while ensuring that the followingconstraint holds: g new · ∂L ( x old , y old ) ∂θ > . (8) E R

EPTILE A LGORITHM

We detail the standard Reptile algorithm from (Nichol & Schulman, 2018) in algorithm 2. The sample function randomly samples s batches of size k from dataset D . The SGD function appliesmin-batch stochastic gradient descent over a batch of data given a set of current parameters andlearning rate.

Algorithm 2

Reptile for Stationary Data procedure T RAIN ( D, θ, α, β, s, k ) while not done do // Draw batches from data: B , ..., B s ← sample ( D, s, k ) θ ← θ for i = 1 , ..., s do θ i ← SGD ( B i , θ i − , α ) end for // Reptile meta-update: θ ← θ + β ( θ s − θ ) end whilereturn θ end procedure F D

ETAILS ON R ESERVOIR S AMPLING

Throughout this paper we refer to updates to our memory M as M ← M ∪ { ( x, y ) } . We would liketo now provide details on how we update our memory buffer using reservoir sampling as outlined inVitter (1985) (algorithm 3). Reservoir sampling solves the problem of keeping some limited number M of N total items seen before with equal probability MN when you don’t know what number N will be in advance. The randomInteger function randomly draws an integer inclusively betweenthe provided minimum and maximum values. 17ublished as a conference paper at ICLR 2019 Algorithm 3

Reservoir Sampling with Algorithm R procedure R ESERVOIR ( M, N, x, y ) if M > N then M [ N ] ← ( x, y ) else j = randomInteger ( min = 0 , max = N ) if j < M then M [ j ] ← ( x, y ) end ifend ifreturn M end procedure G E

XPERIENCE R EPLAY A LGORITHMS

We detail the our variant of the experience replay in algorithm 4. This procedure closely followsrecent enhancements discussed in Zhang & Sutton (2017); Riemer et al. (2017b;a) The sample function randomly samples k − examples from the memory buffer M and interleaves them withthe current example to form a single size k batch. The SGD function applies mini-batch stochasticgradient descent over a batch of data given a set of current parameters and learning rate.

Algorithm 4

Experience Replay (ER) with Reservoir Sampling procedure T RAIN ( D, θ, α, k ) M ← {} for t = 1 , ..., T dofor ( x, y ) in D t do // Draw batch from buffer: B ← sample ( x, y, k, M ) // Update parameters with mini-batch SGD: θ ← SGD ( B, θ, α ) // Reservoir sampling memory update: M ← M ∪ { ( x, y ) } (algorithm 3) end forend forreturn θ, M end procedure Unfortunately, it is not straightforward to implement algorithm 4 in all circumstances. In particular,it depends whether the neural network architecture is single headed (sharing an output layer andoutput space among all tasks) or multi-headed (where each task gets its own unique output space).In multi-headed settings, it is common to consider the tasks in separate batches and to equally weightthe sampled tasks during each update. This results in training the parameters evenly for each taskand is particularly important so we pay equal attention to each set of task speciﬁc parameters. Wedetail an approach that separates tasks into sub-batches for a balanced update in algorithm 5. Here L is the loss given a set of parameters over a batch of data and SGD applies a mini-batch gradientdescent update rule over a loss given a set of parameters and learning rate.18ublished as a conference paper at ICLR 2019

Algorithm 5

Experience Replay (ER) with Tasks procedure T RAIN ( D, θ, α, k ) M ← {} for t = 1 , ..., T dofor ( x, y ) in D t do // Draw batch from buffer: B ← sample ( x, y, k, M ) // Compute balanced loss across tasksloss = 0.0 for task in B do loss = loss + L ( B [ task ] , θ ) end for // Update parameters with mini-batch SGD: θ ← SGD ( loss, θ, α ) // Reservoir sampling memory update: M ← M ∪ { ( x, y ) } (algorithm 3) end forend forreturn θ, M end procedure Our experiments demonstrate that both variants of experience replay are very effective for continuallearning. Meanwhile, each performs signiﬁcantly better than the other on some datasets and settings.

H T HE V ARIANTS OF

MER

We detail two additional variants of MER (algorithm 1) in algorithms 6 and 7. The sample functiontakes on a slightly different meaning in each variant of the algorithm. In algorithm 1 sample is usedto produce s batches consisting of k − random examples from the memory buffer and the currentexample. In algorithm 6 sample is used to produce one batch consisting of sk − s examples fromthe memory buffer and s copies of the current example. In algorithm 7 sample is used to produceone batch consisting of k − examples from the memory buffer. In algorithm 6, sample placesthe current example at the end of the batch. Meanwhile, in algorithm 7, sample places the currentexample in a random location within the batch. In contrast, the SGD function carries a commonmeaning across algorithms, applying stochastic gradient descent over a particular input and outputgiven a set of current parameters and learning rate.19ublished as a conference paper at ICLR 2019

Algorithm 6

Meta-Experience Replay (MER) - One Big Batch procedure T RAIN ( D, θ, α, γ, sk ) M ← {} for t = 1 , ..., T dofor ( x, y ) in D t do // Draw batch from buffer: B ← sample ( x, y, s, k, M ) θ ← θ for i = 1 , ..., sk do x c , y c ← B i [ j ] θ i ← SGD ( x c , y c , θ i − , α ) end for // Reptile meta-update: θ ← θ + γ ( θ sk − θ ) // Reservoir sampling memory update: M ← M ∪ { ( x, y ) } (algorithm 3) end forend forreturn θ, M end procedureAlgorithm 7 Meta-Experience Replay (MER) - Current Example Learning Rate procedure T RAIN ( D, θ, α, γ, s, k ) M ← {} for t = 1 , ..., T dofor ( x, y ) in D t do // Draw batch from buffer: B, index ← sample ( k − , M ) θ ← θ // SGD on individual samples from batch: for i = 1 , ..., k − do x c , y c ← B i [ j ] if j = index // High learning rate SGD on current example: θ k ← SGD ( x, y, θ k − , sα ) else θ i ← SGD ( x c , y c , θ i − , α ) end for // Reptile meta-update: θ ← θ + γ ( θ k − θ ) // Reservoir sampling memory update: M ← M ∪ { ( x, y ) } (algorithm 3) end forend forreturn θ, M end procedure I D

ERIVING THE E FFECTIVE O BJECTIVE OF

MER

We would like to derive what objective Meta-Experience Replay (algorithm 1) approximates andshow that it is approximately the same objective from algorithms 6 and 7. We follow conventionsfrom Nichol & Schulman (2018) and ﬁrst demonstrate what happens to the effective gradients com-puted by the algorithm in the most trivial case. As in Nichol & Schulman (2018), this allows usto extrapolate an effective gradient that is a function of the number of steps taken. We can then20ublished as a conference paper at ICLR 2019consider the effective loss function that results in this gradient. Before we begin, let us deﬁne thefollowing terms from Nichol & Schulman (2018): g i = ∂L ( θ i ) ∂θ i (gradient obtained during SGD) (9) θ i +1 = θ i − αg i (sequence of parameter vectors) (10) ¯ g i = ∂L ( θ i ) ∂θ (gradient at initial point) (11) g ji = ∂L ( θ i ) ∂θ j (gradient evaluated at point i with respect to parameters j) (12) ¯ H i = ∂ L ( θ i ) ∂θ (Hessian at initial point) (13) H ji = ∂ L ( θ i ) ∂θ j (Hessian evaluated at point i with respect to parameters j) (14)In Nichol & Schulman (2018) they consider the effective gradient across one loop of reptile withsize k = 2 . As we have both an outer loop of Reptile applied across batches and an inner loopapplied within the batch to consider, we start with a setting where the number of batches s = 2 andthe number of examples per batch k = 2 . Let’s recall from the original paper that the gradients ofReptile with k = 2 was: g Reptile,k =2 ,s =1 = g + g = ¯ g + ¯ g − α ¯ H ¯ g + O ( α ) (15)So, we can also consider the gradients of Reptile if we had 4 examples in one big batch (algorithm6) as opposed to 2 batches of 2 examples: g Reptile,k =4 ,s =1 = g + g + g + g = ¯ g + ¯ g + ¯ g + ¯ g − α ¯ H ¯ g − α ¯ H ¯ g − α ¯ H ¯ g − α ¯ H ¯ g − α ¯ H ¯ g − α ¯ H ¯ g + O ( α ) (16)Now we can consider the case for MER where we deﬁne the parameter values as follows extendingalgorithm 1 where A stands for across batches and W stands for within batches: θ = θ A = θ W (17) θ W = θ W − αg (18) θ W = θ W − αg (19) θ A = θ A + β ( θ W − θ A ) α = θ + β ( θ W − θ ) α = θ W (20) θ W = θ W − αg (21) θ W = θ W − αg (22)21ublished as a conference paper at ICLR 2019 θ A = θ A + β ( θ W − θ A ) α (23) θ = θ A + γβ ( θ A − θ A ) β = θ A + γ ( θ A − θ A ) (24) g MER the gradient of Meta-Experience Replay can thus be deﬁned analogously to the gradient ofReptile as: g MER = θ A − θ A β = θ − θ A β (25)By simply applying Reptile from equation 15 we can derive the value of the parameters θ A afterupdating with Reptile within the ﬁrst batch in terms of the original parameters θ : θ A = θ − β ¯ g − β ¯ g + βα ¯ H ¯ g + O ( βα ) (26)By subbing equation 26 into equation 23 we can see that: θ A = θ − β ¯ g − β ¯ g + βα ¯ H ¯ g − βg − βg + O ( βα ) (27)We can express g in terms of the initial point, by considering a Taylor expansion following theReptile paper: g = ¯ g + α ¯ H ( θ W − θ ) + O ( α ) (28)Then substituting in for θ W we express g in terms of θ : g = ¯ g − αβ ¯ H ¯ g − αβ ¯ H ¯ g + O ( α ) (29)We can then rewrite g by taking a Taylor expansions with respect to θ W : g = g − αH g + O ( α ) (30)Taking another Taylor expansion we ﬁnd that we can transform our expression for the Hessian: H = ¯ H + O ( α ) (31)We can analogously also transform our expression our expression for g : g = ¯ g + α ¯ H ( θ W − θ ) + O ( α ) (32)Substituting for θ W in terms of θ g = ¯ g − αβ ¯ H ¯ g − αβ ¯ H ¯ g + O ( α ) (33)We then substitute equation 31, equation 33, and equation 29 into equation 34: g = ¯ g − αβ ¯ H ¯ g − αβ ¯ H ¯ g − α ¯ H ¯ g + O ( α ) (34)Finally, we have all of the terms we need to express θ A and we can then derive an expression for theMER gradient g MER : 22ublished as a conference paper at ICLR 2019 g MER = ¯ g + ¯ g + ¯ g + ¯ g − α ¯ H ¯ g − α ¯ H ¯ g − αβ ¯ H ¯ g − αβ ¯ H ¯ g − αβ ¯ H ¯ g − αβ ¯ H ¯ g + O ( α ) (35)This equation is quite interesting and very similar to equation 16. As we would like to approximatethe same objective, we can remove one hyperparameter from our model by setting β = 1 . Thisyields: g MER = ¯ g + ¯ g + ¯ g + ¯ g − α ¯ H ¯ g − α ¯ H ¯ g − α ¯ H ¯ g − α ¯ H ¯ g − α ¯ H ¯ g − α ¯ H ¯ g + O ( α ) (36)Indeed, with β set to equal 1, we have shown that the gradient of MER is the same as one loopof Reptile with a number of steps equal to the total number of examples in all batches of MER(algorithm 6) if the current example is mixed in with the same proportion. If we include in thecurrent example for s of sk examples in our meta-replay batch, it gets the same overall priority inboth cases which is s times larger than that of a random example drawn from the buffer. As such,we can also optimize an equivalent gradient using algorithm 7 because it uses a factor s to increasethe priority of the gradient given to the current example.While β = 1 is an interesting special case of MER in algorithm 1, in general we ﬁnd it can be usefulto set β to be a value smaller than 1. In fact, in our experiments we consider the case when β issmaller than 1 and γ = 1 . The success of this approach makes sense because the higher order termsin the Taylor expansion that reﬂect the mismatch between parameters across replay batches disturbthe learning process. By setting β to a value below 1 we can reduce our comparative weighting onpromoting inter batch gradient similarities rather than intra batch gradient similarities.It was noted in (Nichol & Schulman, 2018) that the following equality holds if the examples andorder are random: E [ ¯ H ¯ g ] = E [ ¯ H ¯ g ] = 12 E [ ∂∂θ (¯ g · ¯ g )] (37)In our work to make sure this equality holds in an online setting, we must take multiple precautionsas noted in the main text. The issue is that examples are received in a non-stationary sequence sowhen applied in a continual learning setting the order is not totally random or arbitrary as in theoriginal Reptile work. We address this by maintaining our buffer using reservoir sampling, whichensures that any example seen before has a probability N of being a particular element in the buffer.We also randomly select over these elements to form a batch. As this makes the order largelyarbitrary to the extent that our buffer includes all examples seen, we are approximating the randomofﬂine setting from the original Reptile paper. As such we can view the gradients in equation 16 andequation 36 as leading to approximately the following objective function: θ = arg min θ E ( x ,y ) ,..., ( x sk ,y sk ) ∼ M [2 s (cid:88) i =1 k (cid:88) j =1 [ L ( x ij , y ij ) − i − (cid:88) q =1 j − (cid:88) r =1 α ∂L ( x ij , y ij ) ∂θ · ∂L ( x qr , y qr ) ∂θ ]] . (38)This is precisely equation 7 in the main text. J S

UPERVISED C ONTINUAL L IFELONG L EARNING

For the supervised continual learning benchmarks leveraging MNIST Rotations and MNIST Permu-tations, following conventions, we use a two layer MLP architecture for all models with 100 hiddenunits in each layer. We also model our hyperparameter search after Lopez-Paz & Ranzato (2017)while providing statistics for each model across 5 random seeds.For Omniglot, following Vinyals et al. (2016) we scale the images to 28x28 and use an architecturethat consists of a stack of 4 modules before a fully connected softmax layer. Each module includesa 3x3 convolution with 64 ﬁlters, a ReLU non-linearity and 2x2 max-pooling.23ublished as a conference paper at ICLR 2019J.1 H

YPERPARAMETER S EARCH

Here we report the hyper-parameter grids that we searched over in our experiments. We note inparenthesis the best values for MNIST Rotations (ROT) at each buffer size (ROT-5120, ROT-500,ROT-200), MNIST Permutations (PERM) at each buffer size (PERM-5120, PERM-500, PERM-200), Many Permutations (MANY) at each buffer size (MANY-5120, MANY-500), and Omniglot(OMNI) at each buffer size (OMNI-5120, OMNI-500). • Online Learning– learning rate: [0.0001, 0.0003 (ROT), 0.001, 0.003 (PERM, MANY), 0.01, 0.03, 0.1(OMNI)] • Independent Model Per Task– learning rate: [0.0001, 0.0003, 0.001, 0.003, 0.01 (ROT, PERM, MANY), 0.03, 0.1] • Task Speciﬁc Input Layer– learning rate: [0.0001, 0.0003, 0.001, 0.003, 0.01 (ROT, PERM), 0.03, 0.1] • EWC– learning rate: [0.001 (ROT, OMNI), 0.003 (MANY), 0.01 (PERM), 0.03, 0.1, 0.3, 1.0]– regularization: [1 (MANY), 3, 10 (PERM, OMNI), 30, 100 (ROT), 300, 1000, 3000,10000, 30000] • GEM– learning rate: [0.001, 0.003 (MANY-500), 0.01 (ROT, PERM, OMNI, MANY-5120),0.03, 0.1, 0.3, 1.0]– memory strength ( γ ): [0.0 (ROT-500, ROT-200, PERM-200, MANY-5120), 0.1(MANY-500), 0.5 (OMNI), 1.0 (ROT-5120, PERM-5120, PERM-500)] • Experience Replay (Algorithm 4)– learning rate: [0.00003, 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1 (ROT, PERM,MANY)]– batch size ( k -1): [5 (ROT-500), 10 (ROT-200, PERM-500, PERM-200), 25 (ROT-5120, PERM-5120, MANY), 50, 100, 250] • Experience Replay (Algorithm 5)– learning rate: [0.00003, 0.0001, 0.0003, 0.001, 0.003 (MANY-5120), 0.01 (ROT-500,ROT-200, PERM, MANY-500), 0.03 (ROT-5120), 0.1]– batch size ( k -1): [5 (MANY-500), 10 (PERM-200, MANY-5120), 25 (PERM-5120,PERM-500), 50 (ROT-200), 100 (ROT-5120, ROT-500), 250] • Meta-Experience Replay (Algorithm 1)– learning rate ( α ): [0.01 (OMNI-5120), 0.03 (ROT-5120, PERM, MANY-500), 0.1(ROT-500, ROT-200, OMNI-500)]– across batch meta-learning rate ( γ ): 1.0– within batch meta-learning rate ( β ): [0.01 (ROT-500, ROT-200, MANY-5120), 0.03(ROT-5120, PERM, MANY-500), 0.1, 0.3, 1.0 (OMNI)]– batch size ( k -1): [5 (MANY, OMNI-500), 10 (ROT-500, ROT-200, PERM-200), 25(PERM-500, OMNI-5120), 50, 100 (ROT-5120, PERM-5120)]– number of batches per example: [1, 2 (OMNI-500), 5 (ROT-200, OMNI-5120), 10(ROT-5120, ROT-500, PERM, MANY)] • Meta-Experience Replay (Algorithm 6)– learning rate ( α ): [0.01, 0.03 (ROT-5120, PERM-5120, PERM-500, MANY-5120),0.1 (ROT-500, ROT-200, PERM-200, MANY-500)]– meta-learning rate ( γ ): [0.03 (ROT-500, ROT-200, PERM-200, MANY-500), 0.1(ROT-5120, PERM-5120, MANY-5120), 0.3 (PERM-500), 0.6, 1.0]– batch size ( k -1): [5 (PERM-200, MANY-500), 10 (ROT-500, PERM-500) 25 (ROT-200, MANY-5120), 50 (PERM-5120), 100 (ROT-5120), 250]24ublished as a conference paper at ICLR 2019Model FTIOnline 58.22 ± Task Input 1.62 ± EWC 58.26 ± GEM ± MER ± Table 5: Forward transfer and interference (FTI) experiments on MNIST Rotations. – number of batches per example: 1 • Meta-Experience Replay (Algorithm 7)– learning rate ( α ): [0.01 (PERM-5120, PERM-500), 0.03 (ROT, PERM-200, MANY),0.1]– within batch meta-learning rate ( γ ): [0.03 (ROT, MANY), 0.1 (PERM), 0.3, 1.0]– batch size ( k -1): [5 (PERM-200, MANY-500), 10, 25 (PERM-500), 50 (ROT-200,ROT-500, MANY-5120), 100 (ROT-5120, PERM-5120)]– current example learning rate multiplier ( s ): [1, 2 (PERM-200), 5 (ROT), 10 (PERM-5120, PERM-500, MANY)] K F

ORWARD T RANSFER AND I NTERFERENCE

Forward transfer was a metric deﬁned in Lopez-Paz & Ranzato (2017) based on the average in-creased performance on a task relative to performance at random initialization before training onthat task. Unfortunately, this metric does not make much sense for tasks like MNIST Permutationswhere inputs are totally uncorrelated across tasks or Omniglot where outputs are totally uncorrelatedacross tasks. As such, we only provide performance for this metric on MNIST Rotations in Table 5.

L A

BLATION E XPERIMENTS

We plot our detailed ablation results in Table 6. In order to consider a version of GEM that usesreservoir sampling, we maintain our buffer the same way that we do for experience replay andMER. We consider everything in the buffer to be old data and solve the GEM quadratic program sothat the loss is not increased on this data. We found that considering the task level gradient directionsdid not lead to improvements.

M R

EPRODUCIBILITY OF R ESULTS

While the results so far have provided substantial evidence of the beneﬁts of MER for continuallearning, one potential concern with our experimental protocol in Appendix J.1 is that the largerhyperparameter search space used for MER may artiﬁcially inﬂate improvements given typical runto run variation. To validate that this is not the case, we have run extensive additional experimentsin this section to see how the model performs across different random seeds and machines. Thecodebase presents some inherent stochasticity across runs. As such, in Tables 7, 8, and 9 we reporttwo levels of generalization for a set of hyperparameters beyond the conﬁguration of an individualrun. In the

Same Seeds column, we report the results for the original 5 model seeds (0-4) deployedon different machines. In the

Different Seeds column, we report the results for a different 25 modelseeds (5-29) also deployed on different machines.In all cases, we see that there are quantitative differences generalizing across seeds and machines.However, new settings do not always result in lower performance. Additionally, the differences arenot qualitative in nature. In fact, in every setting we come to approximately the same qualitativeconclusions about how each model performs. 25ublished as a conference paper at ICLR 2019Model Buffer Size Rotations Permutations Many PermutationsER with SGD (algorithm 4) 5120 87.82 ± ± ±

500 77.82 ± ± ±

200 70.72 ± ± − ER with Tasks and SGD (algorithm 5) 5120 88.50 ± ± ±

500 77.30 ± ± ±

200 70.82 ± ± − MER (algorithm 1) 5120 ± ± ± ± ± ± ± ± -MER (algorithm 6) 5120 88.94 ± ± ±

500 79.38 ± ± ±

200 73.74 ± ± -MER (algorithm 7) 5120 89.34 ± ± ± ± ± ± ± ± -ER with Tasks and Adam 5120 88.68 ± ± ± (Kingma & Ba, 2014) 500 77.84 ± ± ±

200 69.48 ± ± − ER with Tasks and RMSProp 5120 88.28 ± ± ± (Hinton et al., 2012) 500 76.32 ± ± ±

200 66.66 ± ± − ER with Tasks and GEM Style Buffer 5120 86.78 ± ± ±

500 74.26 ± ± ±

200 66.02 ± ± -GEM (Lopez-Paz & Ranzato, 2017) 5120 87.58 ± ± ±

500 74.88 ± ± ±

200 67.38 ± ± -GEM with Reservoir Sampling 5120 87.16 ± ± ±

500 77.26 ± ± ±

200 69.00 ± ± - Table 6: Retained accuracy ablation experiments on MNIST based learning lifelong learning benchmarks. ± ± ± Independent N/A 60.74 ± ± ± EwC N/A 57.96 ± ± ± GEM 5120 87.58 ± ± ±

500 74.88 ± ± ±

200 67.38 ± ± ± ER with SGD (algorithm 4) 5120 87.82 ± ± ±

500 77.82 ± ± ±

200 70.72 ± ± ± ER with Tasks and SGD (algorithm 5) 5120 88.50 ± ± ±

500 77.30 ± ± ±

200 70.82 ± ± ± MER (algorithm 1) 5120 ± ± ± ± ± ± ± ± ± MER (algorithm 6) 5120 88.94 ± ± ±

500 79.38 ± ± ±

200 73.74 ± ± ± MER (algorithm 7) 5120 89.34 ± ± ± ± ± ± ± ± ± Table 7: A reproducability comparison of retained accuracy across machines and seeds for the best performinghyperparameters on MNIST Rotations.

Model Buffer Size Original Same Seeds Different SeedsOnline N/A 55.42 ± ± ± Independent N/A 55.80 ± ± ± EwC N/A 62.32 ± ± ± GEM 5120 83.02 ± ± ±

500 69.26 ± ± ±

200 55.42 ± ± ± ER with SGD (algorithm 4) 5120 84.30 ± ± ±

500 75.80 ± ± ±

200 69.52 ± ± ± ER with Tasks and SGD (algorithm 5) 5120 84.00 ± ± ±

500 74.32 ± ± ±

200 68.06 ± ± ± MER (algorithm 1) 5120 ± ± ±

500 77.50 ± ± ± ± ± ± MER (algorithm 6) 5120 84.70 ± ± ±

500 75.88 ± ± ±

200 70.30 ± ± ± MER (algorithm 7) 5120 ± ± ± ± ± ± ± ± ± Table 8: A reproducability comparison of retained accuracy across machines and seeds for the best performinghyperparameters on MNIST Permutations. ± ± ± Independent N/A 13.55 ± ± ± EwC N/A 33.46 ± ± ± GEM 5120 56.76 ± ± ±

500 32.14 ± ± ± ER with SGD (algorithm 4) 5120 60.67 ± ± ±

500 44.08 ± ± ± ER with Tasks and SGD (algorithm 5) 5120 60.30 ± ± ±

500 43.14 ± ± ± MER (algorithm 1) 5120 62.52 ± ± ± ± ± ± MER (algorithm 6) 5120 60.98 ± ± ±

500 44.48 ± ± ± MER (algorithm 7) 5120 ± ± ±

500 46.72 ± ± ± Table 9: A reproducability comparison of retained accuracy across machines and seeds for the best performinghyperparameters on MNIST Many Permutations.

N C

ONTINUAL R EINFORCEMENT L EARNING

We detail the application of MER to deep Q-learning in algorithm 8, using notation from Mnih et al.(2015). 28ublished as a conference paper at ICLR 2019

Algorithm 8

Deep Q-learning with Meta-Experience Replay (MER) procedure

DQN-MER( env, f rameLimit, θ, α, β, γ, steps, k, E Q ) // Initialize action-value function Q with parameters θ : Q ← Q ( θ ) // Initialize action-value function ˆ Q with the same parameters ˆ θ = θ : ˆ Q ← ˆ Q (ˆ θ ) = ˆ Q ( θ ) // Initialize experience replay buffer: M ← {} M.age ← while M.age ≤ f rameLimit do // Begin new episode: env.reset () // Initialize the s state with the initial observation: while episode not done do // Select with probability p an action a from set of possible actions: a = (cid:26) random selection of action ˆ a p ≤ (cid:15) arg max a (cid:48) Q ( s t , a (cid:48) ; θ ) p > (cid:15) // Perform the action a in the environment: s (cid:48) , r t ← env.step ( s, a ) // Store current transition with reward r : M ← M ∪ { ( s, a, r, s (cid:48) ) } (algorithm 3) B , ..., B steps ← sample ( s, a, r, s (cid:48) , steps, k, M ) // Store current weights: θ A ← θ for i = 1 , ..., steps do θ Wi, ← θ for j = 1 , ..., k do // Sample one set of processed sequences, actions, and rewards from M : s, a, r, s (cid:48) = B i [ j ] y = (cid:40) r if ﬁnal frame in episode r + Γ max a ˆ Q ( s (cid:48) , a ; ˆ θ ) otherwise // Optimize the Huber loss H ( y, Q ( s, a ; θ Wi,j − )) : L ← H ( y, Q ( s, a ; θ Wi,j − )) θ Wi,j ← θ Wi,j − − α ∂L∂θ Wi,j − end for // Within batch Reptile meta-update: θ ← θ Wi, + β ( θ Wi,k − θ Wi, ) θ Ai ← θ end for // Across batch Reptile meta-update: θ ← θ A + γ ( θ Asteps − θ A ) // Reset target action-value network ˆ Q to Q every E Q number of episodes: ˆ Q = Q end whileend whilereturn θ, M end procedure N.1 D

ESCRIPTION OF C ATCHER AND F LAPPY B IRD

In Catcher, the agent controls a segment that lies horizontally in the bottom of the screen, i.e. abasket, and can move right or left, or stay still. The goal is to move the basket to catch as manypellets as possible. Missing a pellet results in losing one of the three available lives. Pellets emergeone by one from the top of the screen, and have a descending velocity that is ﬁxed for each task.29ublished as a conference paper at ICLR 2019The reward and thus y axis in our Catcher experiments refers to the number of fruits caught duringthe full game span.In the case of the very popular game Flappy Bird, the agent has to navigate a bird in an environmentfull of pipes by deciding whether to ﬂap or not ﬂap its wings. The pipes appear always in pairs,one from the bottom of the screen and one from the top of the screen, and have a gap that allowsthe bird to pass through them. Flapping the wings results in the bird ascending, otherwise the birddescends to ground naturally. Both ascending and descending velocities are presets by the physicsengine of the game. The goal is to pass through many pairs of pipes as possible without hitting apipe, as this results in losing the game. The scoring scheme in this game awards a point each time apipe is crossed. Despite very simple mechanics, Flappy Bird has proven to be challenging for manyhumans. According to the original game scoring scheme, players with a score of 10 receive a Bronzemedal; with 20 points, a Silver medal; 30 results in a Gold medal, and any score better than 40 isrewarded with a Platinum medal.N.2 DQN

WITH M ETA -E XPERIENCE R EPLAY

The DQN used to train on both games follows the classic architecture from Mnih et al. (2015): ithas a CNN consisting of 3 layers, the ﬁrst with 32 ﬁlters and an 8x8 kernel, the second layer with64 ﬁlters and a 4x4 kernel, and a ﬁnal layer with 64 ﬁlters and a 3x3 kernel. The CNN is followedby two fully connected layers. A ReLU non-linearity was applied after each layer. We limited thememory buffer size for our models to 50k transitions, which is roughly the proportion of the totalmemories used in the benchmark setting for our supervised learning tasks.N.3 P

ARAMETERS FOR C ONTINUAL R EINFORCEMENT L EARNING E XPERIMENTS

For the continual reinforcement learning setting we set the parameters using results from the exper-iments in the supervised setting as guidance. Both Catcher and Flappy Bird used the same hyperparameters as detailed below with the obvious exception of the game-dependent parameter that de-ﬁnes each task. Models were trained to a maximum of 150k frames and 6 total tasks, switchingevery 25k frames. Runs used different random seeds for the initialization as stated in the ﬁgures. • Game Parameters– Catcher: ∆ : . (vertical velocity of pellet increased from default 0.608).– Flappy Bird: ∆ : − (pipe gap decreased 5 from default 100). • Experience Replay– learning rate: 0.0001– batch size ( k -1): 16 • Meta-Experience Replay– learning rate ( α ): 0.0001– within batch meta-learning rate ( β ): 1– across batch meta-learning rate ( γ ): 0.3– batch size ( k -1): 16– number of steps: 1– buffer size: 50000N.4 F URTHER C OMMENTS ON C ONTINUAL R EINFORCEMENT L EARNING E VALUATION

Performance during training in continual learning for a non-stationary version of Flappy Bird isshown in Figure 5. The graphs show averaged values over three validation episodes across threedifferent seed initializations. Vertical grid lines on the frames axis indicate task switches.We have also conducted experiments performed using a DQN with reservoir sampling, ﬁnding thatit consistently underperforms a DQN with typical recency based sampling. A DQN with MERachieves approximately the asymptotic performance for the single task DQN by the end of trainingfor most tasks. On the other hand, DQN with reservoir sampling achieves worse performance thanthe standard DQN, so it is clear that, in this particular setting where a later task is subsumed in30ublished as a conference paper at ICLR 2019

Frames M e a n s c o r e Frames M e a n s c o r e Frames M e a n s c o r e Frames M e a n s c o r e Frames M e a n s c o r e Frames M e a n s c o r e DQN DQN-MERTask 1 Task 2 Task 3Task 4 Task 5 Task 6