[PDF] Unifying Regularisation Methods for Continual Learning

Abstract

Continual Learning addresses the challenge of learning a number of different tasks sequentially. The goal of maintaining knowledge of earlier tasks without re-accessing them starkly conflicts with standard SGD training for artificial neural networks. An influential method to tackle this problem without storing old data are so-called regularisation approaches. They measure the importance of each parameter for solving a given task and subsequently protect important parameters from large changes. In the literature, three ways to measure parameter importance have been put forward and they have inspired a large body of follow-up work. Here, we present strong theoretical and empirical evidence that these three methods, Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI) and Memory Aware Synapses (MAS), are surprisingly similar and are all linked to the same theoretical quantity. Concretely, we show that, despite stemming from very different motivations, both SI and MAS approximate the square root of the Fisher Information, with the Fisher being the theoretically justified basis of EWC. Moreover, we show that for SI the relation to the Fisher -- and in fact its performance -- is due to a previously unknown bias. On top of uncovering unknown similarities and unifying regularisation approaches, we also demonstrate that our insights enable practical performance improvements for large batch training.

Full PDF

UUnderstanding Regularisation Methods for ContinualLearning

Frederik Benzing

Department of Computer ScienceInstitute of Theoretical Computer ScienceETH Zurich8092 Zurich, Switzerland [email protected]

Abstract

The problem of Catastrophic Forgetting has received a lot of attention in the pastyears. An important class of proposed solutions are so-called regularisation ap-proaches, which protect weights from large changes according to their importances.Various ways to measure this importance have been put forward, all stemmingfrom different theoretical or intuitive motivations. We present mathematical andempirical evidence that two of these methods – Synaptic Intelligence and MemoryAware Synapses – approximate a rescaled version of the Fisher Information, atheoretically justiﬁed importance measure also used in the literature. As part ofour methods, we show that the importance approximation of Synaptic Intelligenceis biased and that, in fact, this bias explains its performance best. Altogether, ourresults offer a theoretical account for the effectiveness of different regularisationapproaches and uncover similarities between the methods proposed so far.

Despite much progress, there remain many gaps between artiﬁcial and biological intelligence. Amongthese, the ability of animals to acquire new knowledge throughout their life has inspired a growingbody of literature. In continual learning, a number of tasks are presented sequentially and thealgorithm should learn new tasks without overwriting previous knowledge and without revisiting olddata. An important class of algorithms proposed to solve this problem are regularisation approaches[21, 44, 3, 36, 6, 17, 37, 29]. They identify important weights after each task and protect them fromlarge changes by modifying the loss function. At the heart of these algorithms lies the question howto estimate parameter importance, as this directly determines which parameters will be kept close totheir old value and which parameters can be modiﬁed freely to solve new tasks. In this article, our aimis to understand why different importance measures proposed in the literature are effective. Notably,this question remains open for two well-known regularisation methods, Synaptic Intelligence (SI)[44] and Memory Aware Synapses (MAS) [3]. The latter approach has a purely heuristic motivation,while the former makes simplifying assumptions, violated in a deep learning setting, to theoreticallyjustify its importance measure.Our main contribution is to provide mathematical as well as empirical evidence that both thesemethods, SI and MAS, despite their different motivations, approximate a rescaled version of theFisher Information. The Fisher Information has been used for continual learning [21] along with aclear theoretical, Bayesian justiﬁcation [21, 17]. Thus, on top of unifying different regularisation-approaches proposed so far, our results give a more sound theoretical explanation for the effectivenessof two more continual learning algorithms.

Preprint. Under review. a r X i v : . [ c s . L G ] J un o establish our claims, we ﬁrst investigate MAS and give a simple mathematical derivation to showthat MAS approximates what we call the Absolute Fisher. We carefully validate the assumptions ofour derivations as well as the resulting claims empirically on standard continual learning benchmarks.As a second – maybe surprising – step we show that the importance approximation of SI is biasedand that this bias contributes most to the continual learning performance of SI. Finally, we show thatthrough this bias SI is – like MAS – closely related to the Absolute Fisher. Before reviewing the continual learning literature we point out that one of our algorithms is linkedto the recently proposed Loss-Change-Allocation (LCA) [23]. This method gains insights into thetraining process of neural networks [2, 32, 39] by assessing how much the changes of individualparameters contribute to learning. To this end, LCA approximates the same path integral as SynapticIntelligence [44]. For the approximation, the gradient has to be evaluated on the entire training set foreach parameter update. Our method ‘Unbiased Synaptic Intelligence’ gives an unbiased estimate ofthe same quantity, but only using mini-batches, thus being computationally cheaper.The problem of catastrophic forgetting in neural networks has been known and studied for manydecades [28, 34, 12]. In the context of deep learning, it recently started receiving more attentionagain [15, 41]. An attempt to overcome this problem was made with the algorithm Elastic WeightConsolidation (EWC) [21]. We will review EWC and two other regularisation approaches – SynapticIntelligence [44] and Memory Aware Synapses [3] – as well as their relatives [36, 6, 37] more closelyin the next section. The idea of regularisation was also applied in the context of Bayesian neuralnetworks in [29, 42], who use the weights’ posteriors after having learned a task as the prior forlearning the next task.On top of regularisation approaches, Parisi et al. [30] identify two more classes of continual learningapproaches.One of these classes are replay methods, which rely on data from old distributions to protectknowledge while learning new tasks. For example, iCarl [35] uses stored datapoints from old datasetsand computes nearest neighbours in feature space for classiﬁcation. GEM [26] and follow up work[7, 5] project gradients from new tasks onto directions that do not increase the loss on stored examples.However, empirical results from [8] suggest that one of the most effective ways to use small samplesof old data is to simply mix them with the mini-batches of new data during training. To avoidstoring old examples, which strictly speaking violates continual learning desiderata, generative replay methods train generative models to simulate old data distributions [38]. In this context, biologicallyinspired replay systems which mimic the hippocampus have also been put forward [19].The third class of continual learning algorithms are architectural approaches, which use networkexpansion to learn new tasks without corrupting previous knowledge. PathNet [11] uses evolution toidentify good subnetworks for each individual tasks, while [25] extends the network through neuralarchitecture search. [37, 14] ﬁrst train on a task, then compress the resulting net and iterate thisprocess. Mixture models, with growing numbers of components, are investigated in [33, 24].Finally, [43, 16, 10] point out that different continual learning scenarios and assumptions with varyingdifﬁculty were used across the literature. They also clearly deﬁne and distinguish these scenarios andtest how well different algorithms perform in them.

In continual learning we are given K datasets D , . . . , D K sequentially. When training a neural netwith N parameters θ ∈ R N on dataset D k , we have no access to the previously seen datasets D k − but should retain some memory of them. Regularisation based approaches introduced in [21, 17] doso by modifying the loss function L k related to dataset D k . Let us denote the parameters obtainedafter ﬁnishing training on task k by θ ( k ) and let Ω ( k ) be some positive semi-deﬁnite N × N matrix.When training on task k , regularisation methods use the loss ˜ L k = L k + c · (cid:16) θ − θ ( k − (cid:17) T Ω ( k − (cid:16) θ − θ ( k − (cid:17) . (1)2here c > is a hyperparameter. The ﬁrst term in the loss function is the standard loss on task k .The second term makes sure that the parameters do not move away too far from the the previousparameters.It can already be seen that the matrix Ω ( k ) can quickly become prohibitively large as it has N entries,where N is the number of parameters. In practice, Ω is therefore often approximated by a diagonalmatrix diag ( ω , . . . , ω N ) . The loss function deﬁned above then simpliﬁes to ˜ L k = L k + c · N (cid:88) i =1 ω ( k − i (cid:16) θ i − θ ( k − i (cid:17) and ω i is often referred to as the importance of parameter θ i . Usually, we have ω ( k ) i = ω ( k − i + ω i ,where ω i is the importance for the most recently learned task k .How good a regularisation based approach is, depends on how meaningful its importance estimates ω i are. Different such estimates have been proposed and the aim of this work is to understand whydifferent approaches work and how similar they are. Broadly speaking (and to the best of our knowledge), three methods to measure parameter importancehave been proposed, but see also below for adaptations of these methods.

Elastic Weight Consolidation [21] uses the diagonal of the Fisher Information as importancemeasure. It is evaluated at the end of training a given task. To deﬁne the

Fisher Information Matrix F ,let us assume we are given a dataset X and that, for a datapoint X the network predicts a probabilitydistribution (or equivalently, a vector of probabilities) q X with entries q X ( y ) for y ∈ L , where L is the set of potential labels. Let us further assume that we use a categorical cross entropy loss CEL( q X , y ) and write g ( X, y ) = ∂ CEL( q X ,y ) ∂θ for its gradient.Then the Fisher Information F is given as F = E X ∼X E y ∼ q X (cid:2) g ( X, y ) g ( X, y ) T (cid:3) (2) = E X ∼X (cid:88) y ∈ Y q X ( y ) · g ( X, y ) g ( X, y ) T (3)Concretely, taking only the diagonal of F means ω i (EWC) = E X ∼X E y ∼ q X (cid:2) g i ( X, y ) (cid:3) . The

Empirical Fisher Information matrix is an approximation of the Fisher Information, which ratherthan taking the expectation over y ∼ q X takes the (deterministic) label y given by the labels of thedataset. We will also deﬁne the Predicted Fisher Information as taking the maximum-likelihoodpredicted label, i.e. the argmax of q X ( y ) . Our empirical ﬁndings suggest that these different versionof the Fisher Information can be used almost exchangeably, see appendix.The Fisher Information is often used as a measure of parameter sensitivity and under the assumptionthat the learned label distribution q X is the real label distribution it equals the Hessian of the loss(with respect to the real, not the empirical, label distribution) [27, 31]. Its effectiveness has a cleantheoretical interpretation [21, 17]. Memory Aware Synapses [3] heuristically argues that the sensitivity of the ﬁnal layer with respectto a parameter should be used as the parameter’s importance. Similar to the Fisher Information, thissensitivity is evaluated after training a given task. Denoting, as before, the ﬁnal layer of learnedprobabilities by q X , this means ω i (MAS) = E X ∼X (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) ∂ (cid:107) q X (cid:107) ∂θ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) , where the norm is Euclidean. The authors of MAS point out that their measure could also be applied in an unsupervised setting, where thelast layer does not necessarily represent predicted labels. ynaptic Intelligence [44] approximates the contribution of each parameter to the decrease in lossand uses this contribution as importance. To formalise the ‘contribution of a parameter’, let us denotethe parameters at time t by θ ( t ) and the loss by L ( t ) . If the parameters follow a smooth trajectory inparameter space, then we can write the decrease in loss between time and T as L (0) − L ( T ) = − (cid:90) θ ( T ) θ (0) ∂L ( t ) ∂θ θ (cid:48) ( t ) dt = − N (cid:88) i =1 (cid:90) θ i ( T ) θ i (0) ∂L ( t ) ∂θ i θ (cid:48) i ( t ) dt. (4)The i -th summand in (4) can be interpreted as the contribution of parameter i to the decrease in loss.While we cannot evaluate the integral precisely, we can use a ﬁrst order approximation to obtain theimportances. To do so, we write ∆ i ( t ) = ( θ i ( t + 1) − θ i ( t )) for an approximation of θ (cid:48) i ( t ) dt and get ˜ ω i (SI) = T − (cid:88) t =0 ∂L ( t ) ∂θ i · ∆ i ( t ) . In addition, SI rescales its importances as follows ω i (SI) = max { , ˜ ω i (SI) } ( θ i ( T ) − θ i (0)) + ξ . (5)The authors of SI provide a theoretical argument for why this importance measure is equal to theHessian of the loss under some assumptions. Not all of these assumptions seem realistic and we willshow theoretically and empirically that the assumption of using full-batch-gradient-descent, which isviolated in a deep-learning setting, has an important effect. Related regularisation approaches.

Strictly speaking, we presented a version of EWC describedin [17] and tested in [6, 37]. Riemannian Walk [6] is a combination of EWC and a ‘KL-rescaled’version of SI. Similar to EWC [36] uses the Fisher Information for regularisation, but rather than adiagonal approximation, ﬁnds a block diagonal approximation allowing to take layer-wise parameterinteractions into account. While improving performance, this comes at the cost of a memoryrequirement of order N · K (where N is the number of parameters and K the number of tasks) whichis enough memory to train and store separate networks for each task. . In this section we describe our theoretical motivations and design experiments to test our hypotheses,focussing on Memory Aware Synapses (MAS) in Section 4.1 and on Synaptic Intelligence (SI) inSection 4.2. The empirical results of our experiments will be presented in Section 5. An overview ofdifferent algorithms described here can be found in Table 1.

We start by taking a closer look at the deﬁnition of the importances of MAS. We apply linearity ofderivatives, the chain rule and write y = arg max q X to obtain ∂ (cid:107) q X (cid:107) ∂θ = 2 (cid:88) y ∈ Y q X ( y ) ∂ q X ( y ) ∂θ ≈ q X ( y ) ∂ q X ( y ) ∂θ = 2 q X ( y ) ∂ log q X ( y ) ∂θ . (6)Here we made the assumption that the sum is dominated by its maximum-likelihood label y , whichshould be the case if the network classiﬁes its images conﬁdently, i.e. q ( y ) (cid:29) q ( y ) for y (cid:54) = y .Using the same heuristic for the Fisher Information we obtain E y ∼ q X (cid:2) g ( X, y ) (cid:3) ≈ q X ( y ) g ( X, y ) . Comparing this with (6) and recalling notation g ( X, y ) = ∂ log q X ( y )) /∂θ , the similarity betweenMAS and the Fisher Information becomes apparent. The only difference is that MAS has an additional Note that the max(0 , · ) is not part of the description in [44]. However, we needed to include it to reproducetheir results. A similar mechanism can be found in the ofﬁcial SI code. This is especially relevant since the task-identity is usually known at test time for regularisation approaches.

Overview of different algorithms and their importance measures for one task.

Al-gorithms on the left calculate importance ‘online’ along the parameter trajectory during training.Algorithms on the right calculate importance at the end of training a task by going through (part of)the training set again. Correspondingly, the sum is over timesteps t (left) or datapoints X (right). N is the number of images over which is summed. Note that all the algorithms on the left rescaletheir ﬁnal importances as in equation (5). In all algorithms ∆( t ) = θ ( t + 1) − θ ( t ) refers to theparameter update at time t , which depends on both the current task’s loss and the regularisation loss.Moreover, ( g t + σ t ) refers to the stochastic gradient estimate of the current task’s loss (where g t isthe full gradient and σ t the noise) given to the optimizer (together with the regularisation loss) tocalculate the parameter update. In contrast, ( g t + σ (cid:48) t ) refers to an independent stochastic gradientestimate of the current task loss. For a datapoint X , q X denotes the predicted label distribution and g ( X, y ) refers to the gradient of datapoint X if it had label y . Name Paramaterimportance ω ( · ) Name Paramaterimportance ω ( · ) SI (cid:88) t ( g t + σ t )∆( t ) Fisher N (cid:88) X E y ∼ q X (cid:2) g ( X, y ) (cid:3) SIU (SI-Unbiased) (cid:88) t ( g t + σ (cid:48) t )∆( t ) AF (Absolute Fisher) N (cid:88) X E y ∼ q X [ | g ( X, y ) | ] SIB (SI Bias-only) (cid:88) t ( σ t − σ (cid:48) t )∆( t ) MAS N (cid:88) X (cid:12)(cid:12)(cid:12)(cid:12) ∂ || q X || ∂θ (cid:12)(cid:12)(cid:12)(cid:12) OnAF (Online AF) (cid:88) t | g t + σ t | MASX (MAS-Max) N (cid:88) X (cid:12)(cid:12)(cid:12)(cid:12) ∂ (max q X ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) factor of q X ( y ) and g ( X, y ) is not squared. We therefore hypothesise that MAS is almost linearlydependent of the Absolute Fisher (AF), which is analogous to the Fisher Information, but takesabsolute values of gradients rather than their squares, c.f. Table 1.Our hypothesis is based on two assumptions: (1) The sum over the labels is dominated by itsmaximum likelihood label y , cf. equation (6). (2) We assume that the factor q X ( y ) is roughlyconstant across most images X . To check (1), we explicitly evaluate the term of the sum based on y (and call this ω (MASX) , see Table 1), and compare it to the entire sum (MASX vs MAS). Intuitively,(2) is justiﬁed because at the end of training most images X will have q X ( y ) ≈ . Note that evenimages with q X ( y ) = 0 . deviate only by a factor of 2, and still lead to a large correlation betweenMAS and AF. An empirical validation of (2) is checking if MASX and AF are linearly dependent.Both assumptions (1)&(2) are applicable in a wide range of settings, see Figures 2 (A) and 1b forstrong empirical conﬁrmations of them on standard benchmarks. To calculate ω (SI) , we need to calculate the product p = ∂L ( t ) ∂θ · ∆( t ) at each step t . Since evaluatingthe full gradient ∂L ( t ) ∂θ is prohibitively expensive, SI [44] uses a mini-batch. The resulting estimate isbiased since the same mini-batch is used for the parameter update ∆( t ) and the estimate of ∂L ( t ) ∂θ .We now give the calculations detailing the argument above. For ease of exposition, let us assumethat the network is optimized using vanilla SGD with learning rate . Given a mini-batch, denoteits gradient estimate by g + σ , where g = ∂L ( t ) ∂θ denotes the real gradient and σ the mini-batchnoise. With SGD the parameter update equals ∆( t ) = g + σ . Thus, our product p should be p = g · ( g + σ ) . However, using g + σ , which was used for the parameter update, to estimate ∂L ( t ) ∂θ i results in p biased = ( g + σ ) . Thus, the gradient noise introduces a bias of E [ σ + σg ] = E [ σ ] . Unbiased Synaptic Intelligence.

It is easy to design an unbiased estimate of p by using twoindependent mini-batches to calculate the parameter update and to estimate g . This way we get ∆( t ) = g + σ and an estimate g + σ (cid:48) for g from an independent mini-batch with independent noise σ (cid:48) . We obtain p unbiased = ( g + σ (cid:48) ) · ( g + σ ) which in expectation is equal to p = g · ( g + σ ) . Based5n this we deﬁne an unbiased importance measure ˜ ω i (SIU) = T − (cid:88) t =0 ( g t + σ (cid:48) t ) · ∆( t ) . Bias-Only version of SI.

To isolate the bias, we can simply take the difference between biasedand unbiased estimate. Concretely, this gives a measure which only measures the bias of SI and isindependent of the path integral, ˜ ω i (SIB) = T − (cid:88) t =0 (( g + σ ) − ( g + σ (cid:48) t )) · ∆( t ) . Observe that this estimate multiplies the parameter-update ∆( t ) with nothing but stochastic gradientnoise. From the perspective of SI, this should not be meaningful and have poor continual learningperformance. Which part dominates SI?

The approximations presented in equation (4) imply that L (0) − L ( T ) ≈ (cid:80) i ˜ ω i ( SI ) . Thus, comparing (cid:80) i ˜ ω i for SI, SIU, SIB to L (0) − L ( T ) measures how goodthe approximations are and shows whether SI is dominated by its bias, see Figure 1a. Further,comparing the continual learning performance of SI, SIU, SIB shows whether SI’s continual learningperformance relies on the approximation of the path integral or its bias, see Table 2, Exp No 2. In this section we will indicate the relation between the importance measure ω (SI) and the FisherInformation. Recall that ω ( SI ) is a sum over terms ∂L ( t ) ∂θ · ∆( t ) , where ∆( t ) = θ ( t + 1) − θ ( t ) isthe parameter update at time t . Recall as well that both terms in this product, ∂L ( t ) ∂θ as well as ∆( t ) ,are computed/approximated using the same mini-batch and thus the have the same noise.For a precise understanding, we need to take into account how the optimizer uses stochastic gradientsto update the parameters. In our experiments (as well as in the original SI experiments) we usedthe Adam optimizer [20]. Given a stochastic gradient g t + σ t , it keeps a running average of thegradient m t = (1 − β )( g t + σ t ) + β m t − as well as a running average of the squared gradients v t = (1 − β )( g t + σ t ) + β v t − . Ignoring some normalisation terms and the learning rate, theAdam parameter update is given by ∆( t ) = m t / ( √ v t + (cid:15) ) , where β = 0 . and β = 0 . . Thus, ∂L ( t ) ∂θ ∆( t ) = (1 − β )( g t + σ t ) √ v t + (cid:15) + β ( g t + σ t ) m t − √ v t + (cid:15) . Assuming that the noise σ t is considerably bigger than the gradient g t we obtain ∂L ( t ) ∂θ ∆( t ) ≈ (1 − β ) ( g t + σ t ) √ v t + (cid:15) (7)The precise assumption made here is (1 − β ) σ t (cid:29) β m t − g t (we ignore the term σ t m t − since E [ σ t m t − ] = 0 and since we average SI over many time steps). This assumption is supportedstrongly by Figure 1a, where the difference between green (SI) and blue (SIU) line is caused preciselyby (1 − β ) σ , see B.1 for full details. We also refer to [23] and Figures D.1, D.2 and F.5, F.4 formore evidence for the assumption that the noise is considerably bigger than the gradient.Next, we consider v t . It is a slowly moving average of ( g t + σ t ) and will therefore be approximately E [( g t + σ t ) ] . It is thus reasonable to expect that √ v t is typically roughly proportional to | g t + σ t | .This clearly does not hold for all possible distributions of g t + σ t , but is unlikely to be violated bynon adversarially chosen, realistic distributions. With (7) and √ v t ∝∼ | g t + σ t | we arrive at ˜ ω ( SI ) = (cid:88) t ∂L ( t ) ∂θ ∆( t ) ≈ (1 − β ) (cid:88) t ( g t + σ t ) √ v t ∝∼ (cid:88) t | g t + σ t | . Our resulting approximation ˜ ω ( SI ) ∝∼ (cid:80) t | g t + σ t | is very similar to the (Empirical) Absolute Fisher.The only differences are (1) the gradient in ˜ ω ( SI ) is evaluated in mini-batches rather than on single6 a) (b) Figure 1: (a) Summed importances (cid:80) i ˜ ω i for Permuted-MNIST. (b) Pearson correlation forpairs of importance measures for all tasks. Dashed lines represent CIFAR, solid lines MNIST.Error bars are stds across 4 rruns. ‘AF’ is an estimate of the Absolute Fisher based on 30000 (MNIST)resp. 5000 (CIFAR) images. ‘AF_small’, ‘MAS’ and ‘MASX’ are based on 1000 resp. 500 images.images; (2) the sum is taken ‘online’ along the parameter trajectory rather than at the ﬁnal point of thetrajectory. (1) does not have a large effect when σ t (cid:29) g t (as detailed in Appendix B.2 ) which is thecase, c.f. our ﬁndings about gradient noise. The inﬂuence of (2) is hard to quantify theoretically anddepends on how much the gradients change along the parameter trajectory. To capture this difference,we deﬁne the Online Absolute Fisher as ˜ ω (OnAF) = (cid:80) t | g + σ t | and empirically test how similar ˜ ω (OnAF) and ˜ ω (AF) are.Our heuristics suggest that ˜ ω (SI) and ˜ ω (OnAF) are very similar. If in addition, we empirically seethat ˜ ω (OnAF) and ˜ ω (AF) are similar, this indicates that SI approximates AF, a rescaled version ofthe Fisher Information. Figures 2 and 1b empirically conﬁrm both parts of this hypothesis.We emphasise that our derivation above assumes that the parameter update ∆( t ) is given by thegradients of the current task and ignores the regularisation term. The latter will change updatedirection and momentum of the optimizer. Thus, strong regularisation will make the relation betweenSI and OnAF noisy, but not abolish it. See Fig 1b (middle row) and Sec F.3 for conﬁrmations hereof. In this section we empirically test our three main hypotheses:1. Does Memory Aware Synapses approximate the Absolute Fisher as derived in Section 4.1?2. Which part of the importance ˜ ω ( SI ) of Synaptic Intelligence, the bias or the unbiasedcontribution, explains its effectiveness, see Section 4.2.1?3. Does ˜ ω ( SI ) indeed correlate strongly with the Absolute Fisher as indicated in Section 4.2.2? Our experiments closely follow [44]. We use the

Permuted MNIST [15] benchmark with 10 tasks,a fully connected ReLU architecture and a single output-head (‘domain incremental’ setting, c.f.[16, 43]) and the

Split CIFAR 10/100 benchmark with the keras default CNN and a multi-headoutput (‘task incremental’). Following [42], but unlike [44], we re-initialise model variables (fullyconnected layers, etc.) after each task and usually ﬁnd better performance in our hyperparametersearchers. We defer further experimental details to the Appendix A. Code is available on github . Our ﬁrst objective is to test whether MAS and Absolute Fisher arerelated. To this end we calculate ω (MAS) , ω (MASX) , ω (AF) and show scatter plots of these values https://github.com/freedbee/continual_regularisation Scatter- and Intensity plots for pairs of importance measures. r denotes the Pearsoncorrelation. Scatter Plots (blue):

Each point in the scatter plot corresponds to one weight of the net.A straight line through the origin corresponds to two importances measures being multiples of eachother.

Intensity Plots (gray scale):

Show data analogous to scatter plots, but are normalised for bettervisualisation. See Appendix A.1 for details and Figure F.6 for same data with only scatter/intensityplots. (A)

Comparison between MAS, MASX, AF. (B)

Comparison between SI, OnAF, AF on ﬁrsttask of CIFAR (top), last task of CIFAR (middle) and ‘middle’ task of MNIST (bottom).in Figure 2 (A) . We ﬁnd that they are almost identical. Figure 1b shows equally strong correlationson the other tasks and datasets, presenting strong evidence for our heuristics. We also evaluatedcontinual learning algorithms based on ω (MAS) , ω (MASX) , ω (AF) and found very similar results,Table see 2, Exp No 1, providing further evidence that our mathematical derivations hold. Why does SI work?

To understand how large the bias of SI is, we track the sum (cid:80) i ˜ ω i ofimportances for SI and its unbiased version SIU. As a control we include the decrease of the loss L (0) − L ( T ) which the path-integral (c.f. equation (4)) approximates, as well as an ﬁrst orderapproximation of the path-integral without relying on stochastic gradient estimates (computing thefull training set gradient at each step). The results are portrayed in Figure 1a. They indicate that (1)the bias is 5-6 times larger than the unbiased part; (2) using an unbiased gradient estimate and usingthe entire training set gradient gives almost identical values of (cid:80) i ω i supporting the validity of theunbiased estimator; (3) even the unbiased ﬁrst order approximation of the path integral overestimatesthe decrease in loss. This is consistent with previous empirical studies showing that the loss haspositive curvature [18, 23]. See Figures F.5, F.4 for similar results on CIFAR and other MNIST tasks.We have seen that the SI’s bias is considerably larger than its unbiased component. This does notfully explain which of these parts is more essential for continual learning. We therefore run continuallearning algorithms based on SI, its unbiased version (SIU) as well as a version which isolates thebias (SIB). The results are shown in Table 2, Exp No. 2. SIU consistently has worse performancethan SI and SIB has the same or slightly better performance than SI. From SI’s perspective this israther surprising and is convincing evidence that SI’s continual learning capability is due to its bias. Is SI related to Absolute Fisher?

Our main assumption in relating SI to Online Absolute Fisher(OnAF) was that the contribution of the noise (bias) to SI is bigger than its unbiased part, c.f. equation(7). We have already seen strong evidence that this is the case (Figure 1a). Therefore it may beexpected that OnAF and SI do have signiﬁcant correlations: On all tasks of MNIST and the ﬁrst taskof CIFAR, SI and OnAF have (Pearson) correlations > . ; on tasks 2-6 of CIFAR we ﬁnd weaker,but signiﬁcant correlations around 0.5, see Figures 2, 1b. This weaker correlation on CIFAR tasks2-6 is well explained by regularisation, see 4.2.2 and Appendix F.3 for two control experiments.8able 2: Test accuracies on Permuted-MNIST and Split CIFAR 10/100.

Algorithms are groupedby experiment as indicated in main text. Reported results are mean and standard deviation over 3resp. 10 runs for P-MNIST resp. CIFAR 10/100. N O . A LGO . P-MNIST CIFAR 10/100 N O . A LGO . P-MNIST CIFAR 10/100 MAS ± ± SI ± ± MASX ± ± OnAF ± ± AF ± ± AF ± ± SI ± ± EWC ± ± SIU ± ± MAS ± ± SIB ± ± AF ± ± We also ﬁnd that OnAF and AF are surprisingly similar, thus closely relating SI and AF (Figures 2,1b) supporting our hypotheses. We also evaluate the corresponding continual learning algorithms SI,OnAF and AF, see Table 2, Exp 3. The results are very similar, indicating that SI, like MAS, worksbecause it approximates a rescaled version of the Fisher Information.

Is Absolute or Real Fisher better for CL?

We have seen evidence that both MAS and SI rely onestimating a rescaled Fisher Information. This raises the questions whether a rescaled version of theFisher or the Fisher itself are more effective for continual learning. To answer this, we compare MAS,AF and EWC, since these methods all evaluate the importance at the end of training. The resultsare shown in Figure 2, Exp 4. All algorithms perform similarly, only the difference between MASand EWC on CIFAR is signiﬁcant ( p = 0 . , t-test, not corrected for multiple comparisons). Whichalgorithm performs best may well depend on hyperparameters and the precise training set-up. We have focused on regularisation approaches for continual learning and more speciﬁcally, how thesemethods assess parameter importance. We investigated three different importance measures proposedin the literature, namely Elastic Weight Consolidation, Synaptic Intelligence and Memory AwareSynapses. For standard (non-bayesian) neural networks these are (to the best of our knowledge)the only approaches to regularisation based continual learning. Other regularisation-related workdirectly relies on one of these algorithms or combinations of them [21, 44, 3, 4, 37, 36, 6, 17].Originally, all three methods were motivated rather differently and suggested seemingly orthogonalimportance measures. However, we gave heuristic mathematical derivations indicating that allmethods nevertheless – and somewhat accidentally – are very closely related to the Fisher Information.We thoroughly checked the assumptions underlying our heuristics in realistic, standard continuallearning set-ups and found convincing empirical support for our claims. On the one hand, our work uniﬁes regularisation approaches proposed so far and highlights unexpectedsimilarities. Our mathematical derivations do not only explain some surprising results, but also implythat our ﬁndings are more robust than merely observing empirical correlations.On the other hand, our ﬁndings can have immediate practical consequences, as they help explainwhich algorithms are likely to work in which setting. There are several situations, in which we canmake grounded predictions on the relative performance of MAS, SI and EWC. To be concrete, weexhibit how our insights may explain empirical observations presented in [9], which is an extensivecomparative study of continual learning algorithms. (1) [9] ﬁnds that SI is sensitive to task-orderingwhen dataset sizes are imbalanced - namely, when small datasets are learned ﬁrst SI overﬁts more.When training for a ﬁxed number of epochs, small datasets will have small SI importances (since itsbiased sum contains few terms, independent of the path-integral) so that the net is subsequently lessregularised and prone to overﬁt. (2) [9] notes that EWC suffers from having many weights with small We check in F.1 that similarity of AF, OnAF does not depend on how long the networks are trained andshow control correlations between SIU, SIB and (On)AF in F.2, further supporting our claim, that continuallearning performance is explained by similarity to AF. The complexity of these set-ups and our limited understanding of neural networks in general mean thatprecise mathematical theorems quantifying when which methods are how similar are currently out of reach. Wetherefore believe that our simple, clear heuristics are more meaningful and instructive than theorems relying onunrealistic assumptions or only treating toy examples.

Acknowledgements

I would like to thank Xun Zou, Florian Meier, Angelika Steger, Kalina Petrova and Asier Mujika formany helpful discussions and valuable feedback on this manuscript. I am also thankful for AM’stechnical advice and the Felix Weissenberger Foundation’s ongoing support.

References [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system forlarge-scale machine learning. In { USENIX } Symposium on Operating Systems Design andImplementation ( { OSDI } , pages 265–283, 2016.[2] Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classiﬁerprobes. arXiv preprint arXiv:1610.01644 , 2016.[3] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte-laars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the EuropeanConference on Computer Vision (ECCV) , pages 139–154, 2018.[4] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free continual learning. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages11254–11263, 2019.[5] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selectionfor online continual learning. In

Advances in Neural Information Processing Systems , pages11816–11825, 2019.[6] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannianwalk for incremental learning: Understanding forgetting and intransigence. In

Proceedings ofthe European Conference on Computer Vision (ECCV) , pages 532–547, 2018.[7] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efﬁcientlifelong learning with a-gem. arXiv preprint arXiv:1812.00420 , 2018.[8] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet KDokania, Philip HS Torr, and Marc’Aurelio Ranzato. Continual learning with tiny episodicmemories. arXiv preprint arXiv:1902.10486 , 2019.[9] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis,Gregory Slabaugh, and Tinne Tuytelaars. Continual learning: A comparative study on how todefy forgetting in classiﬁcation tasks. arXiv preprint arXiv:1909.08383 , 2019.[10] Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning. arXivpreprint arXiv:1805.09733 , 2018.[11] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu,Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in superneural networks. arXiv preprint arXiv:1701.08734 , 2017.[12] Robert M French. Catastrophic forgetting in connectionist networks.

Trends in cognitivesciences , 3(4):128–135, 1999. 1013] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedfor-ward neural networks. In

Proceedings of the thirteenth international conference on artiﬁcialintelligence and statistics , pages 249–256, 2010.[14] Siavash Golkar, Michael Kagan, and Kyunghyun Cho. Continual learning via neural pruning. arXiv preprint arXiv:1903.04476 , 2019.[15] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empiricalinvestigation of catastrophic forgetting in gradient-based neural networks. arXiv preprintarXiv:1312.6211 , 2013.[16] Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-evaluating con-tinual learning scenarios: A categorization and case for strong baselines. arXiv preprintarXiv:1810.12488 , 2018.[17] Ferenc Huszár. Note on the quadratic penalties in elastic weight consolidation.

Proceedings ofthe National Academy of Sciences , page 201717042, 2018.[18] Stanislaw Jastrzebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and AmosStorkey. On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031 , 2018.[19] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. arXiv preprint arXiv:1711.10563 , 2017.[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[21] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.Overcoming catastrophic forgetting in neural networks.

Proceedings of the national academy ofsciences , 114(13):3521–3526, 2017.[22] Frederik Kunstner, Lukas Balles, and Philipp Hennig. Limitations of the empirical ﬁsherapproximation. arXiv preprint arXiv:1905.12558 , 2019.[23] Janice Lan, Rosanne Liu, Hattie Zhou, and Jason Yosinski. Lca: Loss change allocationfor neural network training. In

Advances in Neural Information Processing Systems , pages3614–3624, 2019.[24] Soochan Lee, Junsoo Ha, Dongsu Zhang, and Gunhee Kim. A neural dirichlet process mixturemodel for task-free continual learning. arXiv preprint arXiv:2001.00689 , 2020.[25] Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: Acontinual structure learning framework for overcoming catastrophic forgetting. arXiv preprintarXiv:1904.00310 , 2019.[26] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.In

Advances in Neural Information Processing Systems , pages 6467–6476, 2017.[27] James Martens. New insights and perspectives on the natural gradient method. arXiv preprintarXiv:1412.1193 , 2014.[28] Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks:The sequential learning problem.

Psychology of Learning and Motivation - Advances in Researchand Theory , 24(C):109–165, January 1989. ISSN 0079-7421. doi: 10.1016/S0079-7421(08)60536-8.[29] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continuallearning. arXiv preprint arXiv:1710.10628 , 2017.[30] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continuallifelong learning with neural networks: A review.

Neural Networks , 2019.1131] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXivpreprint arXiv:1301.3584 , 2013.[32] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vec-tor canonical correlation analysis for deep learning dynamics and interpretability. In

Advancesin Neural Information Processing Systems , pages 6076–6085, 2017.[33] Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pascanu, Yee Whye Teh, and RaiaHadsell. Continual unsupervised representation learning. In

Advances in Neural InformationProcessing Systems , pages 7645–7655, 2019.[34] Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learningand forgetting functions.

Psychological review , 97(2):285, 1990.[35] Sylvestre-Alvise Rebufﬁ, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl:Incremental classiﬁer and representation learning. In

Proceedings of the IEEE conference onComputer Vision and Pattern Recognition , pages 2001–2010, 2017.[36] Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximationsfor overcoming catastrophic forgetting. In

Advances in Neural Information Processing Systems ,pages 3738–3748, 2018.[37] Jonathan Schwarz, Jelena Luketina, Wojciech M Czarnecki, Agnieszka Grabska-Barwinska,Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable frameworkfor continual learning. arXiv preprint arXiv:1805.06370 , 2018.[38] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deepgenerative replay. In

Advances in Neural Information Processing Systems , pages 2990–2999,2017.[39] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks viainformation. arXiv preprint arXiv:1703.00810 , 2017.[40] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overﬁtting.

The journal of machinelearning research , 15(1):1929–1958, 2014.[41] Rupesh K Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and JürgenSchmidhuber. Compete to compute. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahra-mani, and K. Q. Weinberger, editors,

Advances in Neural Information Processing Systems 26 ,pages 2310–2318. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5059-compete-to-compute.pdf .[42] Siddharth Swaroop, Cuong V Nguyen, Thang D Bui, and Richard E Turner. Improving andunderstanding variational continual learning. arXiv preprint arXiv:1905.02099 , 2019.[43] Gido M van de Ven and Andreas S Tolias. Three scenarios for continual learning. arXiv preprintarXiv:1904.07734 , 2019.[44] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synapticintelligence. In

Proceedings of the 34th International Conference on Machine Learning-Volume70 , pages 3987–3995. JMLR. org, 2017. 12

PPENDIX

This appendix has several components:1. We give full experimental details including hyperparameter seraches and values in SectionA.2. We give the details of two calculations omiited in our derivation that SI and AF are related,see B3. We describe some variants of SI and OnAF, which we tried with the aim of improvingperformance in Section C.4. We empirically investigate the gradient noise in Section D. This experiment is not speciﬁcto continual learning.5. We provide an empirical comparison of different versions of the Fisher Information (‘real’,‘empirical’ and ‘predicted’) in Section E.6. We include additional experiments and some additional plots in Section F.

A Experimental details

We tried to include all details necessary to reproduce our results here. If we forgot something, don’thesitate to send us an email. Code will also be available on github.

A.1 Scatter Plot Data Collection and Intensity Plots

The scatter plots in Figure 2 are based on a run of SI, where we in parallel to evaluating the SIimportance measure also evaluated all other importance measures. AF is evaluated using 30000samples on MNIST and 500 on CIFAR. Figure 1b shows data based on 4 repetitions of this experimentand conﬁrms that the correlations observed in the scatter plots are representative, it also shows theeffect of sample sizes on AF. In addition, we gathered data following the precise setup of [44],i.e. without reinitialising model variables after each task, and found even stronger support for ourheuristics in this setting, see Section F.3.

Intensity Plots

Show same kind of data as scatter plots, but each weight is binned into one of 80x50equispaced bins. The number of weights per bin is divided by the maximum number of weights perbin in that column and the resulting value is shown on a gray scale. Normalisation per column wasperformed as for ‘global normalisation’ only weights in the bottom left are visible; the number ofbins was chosen to match aspect ratio of plots.

A.2 Benchmarks In Permuted MNIST [15] each task consists of predicting the label of a random (but ﬁxed) pixel-permutation of MNIST. We use a total of 10 tasks. As noted, we use a domain incremental setting,i.e. a single output head shared by all tasks.In

Split CIFAR10/100 the network ﬁrst has to classify CIFAR10 and then is successively givengroups of 10 (consecutive) classes of CIFAR100. As in [44], we use 6 tasks in total. Preliminaryexperiments on the maximum of 11 tasks showed very little difference. We use a task-incrementalsetting, i.e. each task has its output head and task-identity is known during training and testing.

A.3 Pre-processing

Following [44], we normalise all pixel values to be in the interval [0 , (i.e. we divide by 255) forMNIST and CIFAR datasets and use no data augmentation. We point out that pre-processing canaffect performance, e.g. differences in pre-processing are part of the reason why results for SI onPermuted-MNIST in [16] are considerably worse than reported in [44] (the other two reasons beingan unusually small learning rate of − for Adam in [16] and a slightly smaller architecture).13 .4 Architectures and Initialization For P-MNIST we use a fully connected net with ReLu activations and two hidden layers of 2000 unitseach. For Split CIFAR10/100 we use the default Keras CNN for CIFAR 10 with 4 convolutional layersand two fully connected layers, dropout [40] and maxpooling, see Table A.1. We use Glorot-uniform[13] initialization for both architectures.

A.5 Optimization

We use the Adam Optimizer [20] with tensorﬂow [1] default settings ( lr = 0 . , β = 0 . , β =0 . , (cid:15) = 10 − ) and batchsize 256 for both tasks like [44]. We reset the optimizer variables(momentum and such) after each task. A.6 SI & OnAF details

Recall that we applied the operation max(0 , · ) (i.e. a ReLU activation) to the importance measure ofeach individual task (equation (5) of main paper), before adding it to the overall importance. In theoriginal SI implementation, this seems to be replaced by applying the same operation to the overallimportances (after adding potentially negative values from a given task). No description of either ofthese operations is given in the SI-paper. In light of our ﬁndings, our version seems more justiﬁed.Somewhat naturally, the gradient ∂L∂θ usually refers to the cross-entropy loss of the current task andnot the total loss including regularisation. For CIFAR we evaluated this gradient without dropout (butthe parameter update was of course evaluated with dropout). For OnAF, similarly to SI, we used the gradient evaluated without dropout on the cross-entropy lossof the current task for our importance measure.

A.7 Estimating Fisher Information, MAS, etc.

When estimating Fisher Information / MAS, we do not iterate over the whole training set, due tothe considerable computational cost. We sample uniformly random subsets of size 1000 (P-MNIST)and 500 (CIFAR). In preliminary experiments with twice as large subsets we found no improvementin average accuracy. When comparing two different measures (e.g. MAS and AF) we use the samesamples to calculate both measures to avoid unnecessary noise in our comparisons.Table A.1: CIFAR 10/100 architecture. Following [44] we use the keras default architecture forCIFAR 10. Below, ‘Filt.’ refers to the number of ﬁlters of a convolutional layer, or respectively thenumber of neurons in a fully connected layer. ‘Drop.’ refers to the dropout rate. ‘Non-Lin.’ refers tothe type of non-linearity used.Table reproduced and slightly adapted from [44].

Layer Kernel Stride Filt. Drop. Non-Lin.3x32x32 inputConvolution 3x3 1x1 32 ReLUConvolution 3x3 1x1 32 ReLUMaxPool 2x2 2x2 0.25Convolution 3x3 1x1 64 ReLUConvolution 3x3 1x1 64 ReLUMaxPool 2x2 2x2 0.25FC 512 0.5 ReLUTask 1: FC 10 softmax...:FC 10 softmaxTask 6: FC 10 softmax We did not check the original SI code for what’s done there and the paper does not describe which versionis used.

Algorithm MNIST CIFARc re-init c re-init

SI 0.2 √ √ SIU 2.0 √ × SIB 0.5 √ √ OnAF 5e-5 √ √ MAS 500 × √ MASX 200 × √ EWC 5e3 × √ AF 200 × √ A.8 Hyperparameters

For all methods and benchmarks, we performed grid searches for the hyperparameter c and over thechoice whether or not to re-initialise model variables after each task.The grid of c included values a · i where a ∈ { , , } and i was chosen in a suitable range (if ahyperparameter close to a boundary of our grid performed best, we extended the grid).For CIFAR, we measured validation set performance based on at least three repetitions for goodhyperparameters. We then picked the best HP and ran 10 repetitions on the test set. For MNIST, wemeasured HP-quality on the test set based on at least 3 runs for good HPs.Additionally, SI and consequently SIU, SIB as well as OnAF have rescaled importance (c.f. equation(5) from main paper). The damping term ξ in this rescaling was set to . for MNIST and to . for CIFAR following [44] without further HP search.All results shown are based on the same hyperparameters obtained – individually for each method –as described above. They can be found in Table A.2. We note that the difference between the HPsfor MAS and MASX might seem to contradict our claims that the two measures are almost identical(they should require the same c in this case), but this is most likely due to similar performance fordifferent HPs and random ﬂuctuations. For example on CIFAR, MAS had almost identical validationperformance for c = 200 (best for MAS) and c = 1000 (best for MASX) ( . ± . vs . ± . ).Also for the other methods, we observed that usually there were two or more HP-conﬁgurationswhich performed very similarly. The precise ‘best’ values as found by the HP search and reported inTable A.2 are therefore subject to random ﬂuctuations in the grid search. B Two Calculations

Here we give two calculations omitted Secontion 4.2.2 of the main paper.

B.1 Bias of SI with Adam

We claimed that the difference between SI and SIU (green and blue line) seen in Figure 1a (and alsoin Figures F.4, F.5) is due to the term (1 − β ) σ t . To see this, recall that for SI, we approximate ∂L ( t ) ∂θ by g t + σ t , which is the same gradient estimate given to Adam. So we get SI: ∂L ( t ) ∂θ ∆( t ) = (1 − β )( g t + σ t ) √ v t + (cid:15) + β ( g t + σ t ) m t − √ v t + (cid:15) . For SIU, we use an independent mini-batch estimate g t + σ (cid:48) t for ∂L ( t ) ∂θ and therefore obtain SIU: ∂L ( t ) ∂θ ∆( t ) = (1 − β )( g t + σ (cid:48) t )( g t + σ t ) √ v t + (cid:15) + β ( g t + σ (cid:48) t ) m t − √ v t + (cid:15) . Taking the difference between these two and ignoring all terms which have expectation zero (notethat E [ σ t ] = E [ σ (cid:48) t ] = 0 and that σ t , σ (cid:48) t are independent of m t − and g t ) gives SI − SIU: (1 − β ) σ t √ v t + (cid:15)

15s claimed.Note also that in expectation SIU equals (1 − β ) g t √ v t + (cid:15) + β g t m t − √ v t + (cid:15) so that a large differencebetween SI and SIU really means (1 − β ) σ t (cid:29) (1 − β ) g t + β m t − g t ≈ β m t − g t . The lastapproximation here is valid because β m t − (cid:29) (1 − β ) g t which holds since (1) β (cid:29) (1 − β ) and(2) we have E [ m t − ] ≈ g t and m t − additionally has a noise component, which g t does not have, sothat E [ | m t − | ] > E [ | g t | ] . B.2 Effect of Calculating Fisher in Batches

For this subsection, let us slightly change notation and denote the images by X , . . . , X D and thegradients (with respect to their labels and the cross entropy loss) by g + σ , . . . g + σ D . Here, again g is the overall training set gradient and σ i is the noise (i.e. (cid:80) Di =1 σ i = 0 ). Then the Empirical Fisheris given by EF = 1 D D (cid:88) i =1 ( g + σ i ) We want to compare this to evaluating the squared gradient over a batch. Let i , . . . , i b denoteuniformly random, independent indices from { , . . . , D } , so that X i , . . . , X i b is a random mini-batch of size b . Let g + σ be the gradient on this mini-batch. We then have, taking expectations overthe random indices, E [( g + σ ) ] = E (cid:34) b b (cid:88) r,s =1 ( g + σ i r )( g + σ i s ) (cid:35) = b ( b − b E [( g + σ i )( g + σ i )] + bb E [( g + σ i ) ]= b − b g + 1 b EF ≈ b EFThis conclusion may seem surprising, but please recall that it is only based on one assumption, namelythat the gradient noise is considerably bigger than the gradient itself - an assumption for which wehave collected ample evidence.Concretely, we assume E [( g + σ ) ] (cid:29) ( b − g . Note that unlike in other sections of this manuscript,here σ refers to the gradient noise of individual images (rather than mini-batch noise). We have seenthat even with a batch-size of 256 (i.e. when we reduce then noise by a factor of 256) the noise isstill much bigger than the gradient (recall one more time Figures F.4, F.5, D.1, D.2), so that ourassumption is true with room to spare.We also mention preliminary and rather unsurprising experiments, in which we evaluated the Em-pirical Fisher in mini-batches on P-MNIST (at the end of training a tasked) and observed the sameperformance as when evaluating the Empirical Fisher (and also ‘real’ Fisher, for that matter) on singleimages. C Variants of SI/OnAF

Based on our ﬁndings, we tried several adoptions of SI as described below.Firstly, rather than taking the running sum of the product of gradient and update for SI, we exper-imented with an exponential moving average (EMA). We tried this based on our evidence that SIapproximates some form of the Fisher Information. The EMA puts more weight on recently observedsamples of this value, which should be more related to the actual Fisher Information and discardsinformation which stems too far from the past. We performed HP searches over different decayfactors for the running average on P-MNIST and found small improvements. Additionally, we foundthat the EMA is less stable if training time is increased (to e.g. 100 or 200 epochs per task). In thiscase, the magnitude of the EMA after for example the ﬁrst task varies greatly from run to run (since16igure D.1: Gradient noise, measured as squared (cid:96) distance between full training set gradient and thestochastic mini-batch gradient with a batch size of 256. ‘Full gradient’ magnitude is also measured assquared (cid:96) norm.Data obtained by training a ReLu network with 2 hidden layers of 2000 hidden units for 20 epochswith default Adam settings. Only every 20-th datapoint shown for better visualisation.the parameters may or may not be in a very ﬂat region of the loss landscape), while the running sum(as in SI) is more stable (as we observed empirically).Secondly, observe that the magnitude of the per-task-importance of SI depends largely on how longwe train, rather than on the change in loss. This can be seen for examples in Figure F.5, where theimportances of the CIFAR 100 tasks are considerably smaller than the importance of the CIFAR 10task. This is partly because each class in CIFAR 100 has 10 times less training samples, correspondingto 10 times less training iterations when keeping the number of epochs ﬁxed (at 60 in our case).This suggests rescaling the per-task-importance by the number of training iterations of that task. Wesimply divided the importance of each task by the number of training iterations of that task (andperformed a new HP search). We found no change in performance with this rescaling.Thirdly, recall that we argued that SI is closely related to the running sum of absolute gradient values,OnAF. We also implemented an importance measure based on the running sum of squared gradients(OnF – Online Fisher), as this variant more closely matches the Real Fisher Information. We observeda slight decrease in performance.We believe that our ﬁndings may indicate that regularisation based continual learning is fairly robustto rescaling its importance measures, at least on the datasets/settings we tested. D Gradient Noise

Here, we quantitatively assess the noise magnitude outside the continual learning context. Recall thatFigure 1a from the main paper, as well as Figures F.4 and F.5 already show that the noise dominatesthe SI importance measure, which indicates that the noise is considerably larger than the gradientitself.To obtain an assessment independent of the SI continual learning importance measure, we trained ournetwork on MNIST as described before, i.e. a ReLu network with 2 hidden layers of 2000 units each,trained for 20 epochs with batch size 256 and default Adam settings. At each training iteration, ontop of calculating the stochastic mini-batch gradient used for optimization, we also computed the fullgradient on the entire training set and computed the noise – which refers to the squared (cid:96) distancebetween the stochastic mini-batch gradient and the full gradient – as well as the ratio between noiseand gradient, measured as the ratio of squared (cid:96) norms. The results are shown in Figure D.1. Inaddition, we computed the fraction of iterations in which the ratio between noise and squared gradientnorm is above a certain threshold, see Figure D.2. E Comparison of different variants of Fisher Information

It is not fully clear which version of Fisher Information is used for EWC and its variants, since‘Empirical’ and ‘Real’ Fisher Information are both often referred to simply as ‘Fisher Information’ in17igure D.2: y -value shows fraction of training iterations in which the ratio between mini-batch noiseand full training set gradient was at least x -value. Data obtained as described in main text / Figure D.1,in particular the batch-size was 256. ‘Ratio’ refers to the ratio of squared (cid:96) norms of the respectivevalues.the literature. Neither [21] nor [37] make code available, or provide details in the description of theiralgorithms.We compare different variants here and note that [43] also hints at such a comparison. We compare‘empirical’, ‘predicted’ as well as ‘real’ Fisher, as described in the main paper. Note that, at leastwith a naïve implementation, the real Fisher takes 10 times longer to compute than empirical andpredicted Fisher, where 10 is the number of potential output labels. More efﬁcient ways to calculatethe Fisher Information [22, 27] are not directly supported by standard deep learning libraries (at thetime of writing, to the best of our knowledge).We report continual learning performance based on the different Fisher Informations in Table E.1 andﬁnd very little difference.Additionally, we trained SI on P-MNIST and Split CIFAR and evaluated the different kinds of Fisherat the end of each task using the same samples of the training set (10 tasks for P-MNIST and 6 tasksfor CIFAR) and calculated the Pearson Correlations.For real and empicial Fisher, on P-MNIST the correlation was between 0.49 and 0.91; on Split CIFARit was between 0.77 and 0.99.For real and predicted Fisher, on P-MNIST the correlation was between 0.75 and 0.92; on SplitCIFAR it was between 0.77 and 1.00.Table E.1: Classiﬁcation accuracies for different versions of Fisher Information evaluated onPermuted-MNIST and Split CIFAR 10/100. A LGORITHM

P-MNIST CIFAR 10/100 R EAL F ISHER (EWC?) 97.2 ± ± MPIRICAL F ISHER ± ± REDICTED F ISHER ± ± F More Experiments and Plots

For SI, we note that the observation that most of the SI importance is due to its bias is consistentacross datasets and tasks, see Figures F.4 and F.5. Intriguingly, on CIFAR we ﬁnd that the unbiasedapproximation of SI slightly underestimates the decrease in loss in the last task, suggesting thatstrong regularisation pushes the parameters in places, where the cross-entropy of the current task hasnegative curvature.We show analogues of Figure 2 consisting of only scatter/ only intensity plots in Figure F.6. Note thatfor both architectures there are a few million weights. The scatter plots are overcrowded and showthe whole range of dependencies between two measure that can possibly occur. Intensity plots showthe dependencies that the majority of weights adheres to.18igure F.1: Analogous to Figure 1b from main paper, comparing SI, SIU, SIB. Left: CIFAR, Right:MNIST

F.1 Inﬂuence of Training Duration on AF, OnAF

It seems plausible that the correlation between AF and OnAF we observed is due to training ontasks for a long time – the model might converge quickly so that the sum of OnAF is dominatedby summands which approximate AF close to the ﬁnal point in training. To test this, we trainedour networks for varying numbers of epochs on the ﬁrst task of P-MNIST and Split CIFAR10/100and measured the correlation between OnAF and AF. The results in Tables F.1, F.2 show that thecorrelation between AF, OnAF does not rely on long training duration. On MNIST, we even seedecreasing correlation with longer training time, this may be due to either an actual decrease incorrelation or due to higher variance when estimating AF using 10000 samples (or both).Table F.1:

Correlation between AF and OnAF on CIFAR10 depending on training time.

Weshow mean and standard deviation of Pearson correlation over 3 runs. AF is based on 500 samples. Ifthe standard deviation is below 0.005 it is shown as 0.00. N UMBER OF E POCHS C ORRELATION ± ± ± ± ± ± ± Table F.2:

Correlation between AF and OnAF on MNIST depending on training time.

We showmean and standard deviation of Pearson correlation over 3 runs. AF is based on 10000 samples. N UMBER OF E POCHS C ORRELATION ± ± ± ± ± F.2 Comparisons between SI, SIU and SIB

Here, we show correlations of SIU, SIB with OnAF, AF as controls, see Figure F.1. We generallyﬁnd that SIU is less correlated with AF/OnAF than SI, SIB as predicted by our heuristic derivations.This is in line with our hypothesis that performance of these algorithms is largely explained by theirrelation to (On)AF. 19he only exception is the ﬁrst task of CIFAR (i.e. CIFAR10), where correlation with OnAF is . ± . for SI; . ± . for SIU; . ± . for SIB. The similarity between SIUand OnAF in this situation is explained by the weights with largest OnAF importance: If we removethe 5% of weights which have highest OnAF importance, correlation between OnAF, SIU drops to . ± . , but for SI resp. SIB the same procedure yields . ± . resp. . ± . .This indicates that for the ﬁrst task of CIFAR there is a small fraction of weights with large OnAFimportances, which are dominated by the gradient rather than the noise. The remainder of weights isin accordance with our previous observations and dominated by noise, recall also Figure F.5. F.3 Effect of regularisation on SI

Figure F.2: Note y-scale. Analogous to Figure 1b from main paper, showing data from CIFAR andconﬁrming that with weaker regularisation, OnAF and SI are similar. Left: Showing runs of SI with c = 0 . Right: Showing runs of SI without reinitialising network weights.We perform two experiments to check whether the weaker correspondence between SI, OnAF onCIFAR tasks 2-6 as compared to CIFAR task1 is really due to regularisation. Firstly, we set thehyperparameter c governing the strength of regularisation to 0. We show results in Figure F.2, ﬁndingthat correspondence is as strong on tasks 2-6 as it was on the ﬁrst task. Secondly, we preciselyfollow the setting of the original SI method [44] and do not reinitialize the network weights aftereach task. Note that in this setting, the optimal hyper parameter as found by our HP search is c = 0 . and validation performance is slightly worse ( . ± . without re-initialisation vs . ± . with reinitialisation and c = 0 . ). In addition to smaller c , the lack of reinitialisation means thatthe parameters will never get too far away from the optimum of the old task during optimization,meaning that the regularisation-gradients are smaller. Indeed, we ﬁnd that in this case there is a strongcorrespondence between OnAF and SI as shown in Figure F.2.Finally, we point out that our comparisons are usually obtained after applying a max(0 , · ) (c.f.equation (5) of main paper) to the SI (and SIU, SIB) importances as the algorithms diverge withoutthis operation. We present results without this max operation in F.3, further showing that (1)correlation between OnAF and SI is weakened by strong regularisation and (2) the correlation is dueto the bias of SI. See also Figure F.7. 20igure F.3: Analogous to Figure 1b from main paper, comparing SI, SIU, SIB but before the operation max( · , is applied to importances of SI, SIU, SIB. See fulltext.21igure F.4: Summed importances for all P-MNIST tasks for SI and its unbiased version. Analogousto Figure 1a from main paper. 22igure F.5: Summed importances for all Split CIFAR 10/100 tasks for SI and its unbiased version..Analogous to Figure 1a from main paper. 23igure F.6: Same data as Figure 2 from main paper.24igure F.7: Scatter plots of a variant of SI, which does not set negative per-task-importances to 0 aftereach task. Refer to main text for more details, see also Figure F.325 eferences [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system forlarge-scale machine learning. In { USENIX } Symposium on Operating Systems Design andImplementation ( { OSDI } , pages 265–283, 2016.[2] Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classiﬁerprobes. arXiv preprint arXiv:1610.01644 , 2016.[3] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte-laars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the EuropeanConference on Computer Vision (ECCV) , pages 139–154, 2018.[4] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free continual learning. In

Trends in cognitivesciences , 3(4):128–135, 1999.[13] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedfor-ward neural networks. In

Proceedings ofthe National Academy of Sciences , page 201717042, 2018.2618] Stanislaw Jastrzebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and AmosStorkey. On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031 , 2018.[19] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. arXiv preprint arXiv:1711.10563 , 2017.[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[21] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.Overcoming catastrophic forgetting in neural networks.

Neural Networks , 2019.[31] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXivpreprint arXiv:1301.3584 , 2013.[32] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vec-tor canonical correlation analysis for deep learning dynamics and interpretability. In

Advances in Neural InformationProcessing Systems , pages 7645–7655, 2019.[34] Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learningand forgetting functions.

Psychological review , 97(2):285, 1990.[35] Sylvestre-Alvise Rebufﬁ, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl:Incremental classiﬁer and representation learning. In

Proceedings of the IEEE conference onComputer Vision and Pattern Recognition , pages 2001–2010, 2017.2736] Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximationsfor overcoming catastrophic forgetting. In