Generalized Variational Continual Learning
GG ENERALIZED V ARIATIONAL C ONTINUAL L EARNING
Noel Loo, Siddharth Swaroop & Richard E. Turner
University of Cambridge { nl355,ss2163,ret26 } @cam.ac.uk A BSTRACT
Continual learning deals with training models on new tasks and datasets in an on-line fashion. One strand of research has used probabilistic regularization for con-tinual learning, with two of the main approaches in this vein being Online ElasticWeight Consolidation (Online EWC) and Variational Continual Learning (VCL).VCL employs variational inference, which in other settings has been improvedempirically by applying likelihood-tempering. We show that applying this modifi-cation to VCL recovers Online EWC as a limiting case, allowing for interpolationbetween the two approaches. We term the general algorithm Generalized VCL(GVCL). In order to mitigate the observed overpruning effect of VI, we take inspi-ration from a common multi-task architecture, neural networks with task-specificFiLM layers, and find that this addition leads to significant performance gains,specifically for variational methods. In the small-data regime, GVCL stronglyoutperforms existing baselines. In larger datasets, GVCL with FiLM layers out-performs or is competitive with existing baselines in terms of accuracy, whilst alsoproviding significantly better calibration.
NTRODUCTION
Continual learning methods enable learning when a set of tasks changes over time. This topic isof practical interest as many real-world applications require models to be regularly updated as newdata is collected or new tasks arise. Standard machine learning models and training procedures failin these settings (French, 1999), so bespoke architectures and fitting procedures are required.This paper makes two main contributions to continual learning for neural networks. First, we de-velop a new regularization-based approach to continual learning. Regularization approaches adaptparameters to new tasks while keeping them close to settings that are appropriate for old tasks. Twopopular approaches of this type are Variational Continual Learning (VCL) (Nguyen et al., 2018) andOnline Elastic Weight Consolidation (Online EWC) (Kirkpatrick et al., 2017; Schwarz et al., 2018).The former is based on a variational approximation of a neural network’s posterior distribution overweights, while the latter uses Laplace’s approximation. In this paper, we propose Generalized Vari-ational Continual Learning (GVCL) of which VCL and Online EWC are two special cases. Underthis unified framework, we are able to combine the strengths of both approaches. GVCL is closelyrelated to likelihood-tempered Variational Inference (VI), which has been found to improve perfor-mance in standard learning settings (Zhang et al., 2018; Osawa et al., 2019). We also see significantperformance improvements in continual learning.Our second contribution is to introduce an architectural modification to the neural network thatcombats the deleterious overpruning effect of VI (Trippe & Turner, 2018; Turner & Sahani, 2011).We analyze pruning in VCL and show how task-specific FiLM layers mitigate it. Combining thisarchitectural change with GVCL further improves performance which exceeds or is within statisticalerror of strong baselines such as HAT (Serra et al., 2018) and PathNet (Fernando et al., 2017).The paper is organized as follows. Section 2 outlines the derivation of GVCL, shows how it unifiesmany continual learning algorithms, and describes why it might be expected to perform better thanthem. Section 3 introduces FiLM layers, first from the perspective of multi-task learning, and thenthrough the lens of variational over-pruning, showing how FiLM layers mitigate this pathology ofVCL. Finally, in Section 5 we test GVCL and GVCL with FiLM layers on many standard bench-marks, including ones with few samples, a regime that could benefit more from continual learning.1 a r X i v : . [ c s . L G ] N ov e find that GVCL with FiLM layers outperforms existing baselines on a variety of metrics, includ-ing raw accuracy, forwards and backwards transfer, and calibration error. In Section 5.4 we showthat FiLM layers provide a disproportionate improvement to variational methods, confirming ourhypothesis in Section 3. ENERALIZED V ARIATIONAL C ONTINUAL L EARNING
In this section, we introduce Generalized Variational Continual Learning (GVCL) as a likelihood-tempered version of VCL, with further details in Appendix C. We show how GVCL recovers OnlineEWC. We also discuss further links between GVCL and the Bayesian cold posterior in Appendix D.2.1 L
IKELIHOOD - TEMPERING IN V ARIATIONAL C ONTINUAL L EARNING
Variational Continual Learning (VCL).
Bayes’ rule calculates a posterior distribution over modelparameters θ based on a prior distribution p ( θ ) and some dataset D T = { X T , y T } . Bayes’ rulenaturally supports online and continual learning by using the previous posterior p ( θ | D T − ) as anew prior when seeing new data (Nguyen et al., 2018). Due to the intractability of Bayes’ rulein complicated models such as neural networks, approximations are employed, and VCL (Nguyenet al., 2018) uses one such approximation, Variational Inference (VI). This approximation is basedon approximating the posterior p ( θ | D T ) with a simpler distribution q T ( θ ) , such as a Gaussian. Thisis achieved by optimizing the ELBO for the optimal q T ( θ ) ,ELBO VCL = E θ ∼ q T ( θ ) [log p ( D T | θ )] − D KL ( q T ( θ ) (cid:107) q T − ( θ )) , (1)where q T − ( θ ) is the approximation to the previous task posterior. Intuitively, this refines a distri-bution over weight samples that balances good predictive performance (the first expected predictionaccuracy term) while remaining close to the prior (the second KL-divergence regularization term). Likelihood-tempered VCL.
Optimizing the ELBO will recover the true posterior if the approxi-mating family is sufficiently rich. However, the simple families used in practice typically lead topoor test-set performance. Practitioners have found that performance can be improved by down-weighting the KL-divergence regularization term by a factor β , with < β < . Examples of thisare seen in Zhang et al. (2018) and Osawa et al. (2019), where the latter uses a “data augmentationfactor” for down-weighting. In a similar vein, sampling from “cold posteriors” in SG-MCMC hasalso been shown to outperform the standard Bayes posterior, where the cold posterior is given by p T ( θ | D ) ∝ p ( θ | D ) T , T < (Wenzel et al., 2020). Values of β > have also been used to improvethe disentanglement variational autoencoder learned models (Higgins et al., 2017). We down-weightthe KL-divergence term in VCL, optimizing the β -ELBO , β -ELBO = E θ ∼ q T ( θ ) [log p ( D T | θ )] − βD KL ( q T ( θ ) (cid:107) q T − ( θ )) . VCL is trivially recovered when β = 1 . We will now show that surprisingly as β → , we recovera special case of Online EWC. Then, by modifying the term further as required to recover the fullversion of Online EWC, we will arrive at our algorithm, Generalized VCL.2.2 O NLINE
EWC
IS A SPECIAL CASE OF
GVCLWe analyze the effect of KL-reweighting on VCL in the case where the approximating family isrestricted to Gaussian distributions over θ . We will consider training all the tasks with a KL-reweighting factor of β , and then take the limit β → , recovering Online EWC. Let the approxi-mate posteriors at the previous and current tasks be denoted as q T − ( θ ) = N ( θ ; µ T − , Σ T − ) and q T ( θ ) = N ( θ ; µ T , Σ T ) respectively, where we are learning { µ T , Σ T } . The optimal Σ T under the β -ELBO has the form (see Appendix C), Σ − T = 1 β ∇ µ T ∇ µ T E q T ( θ ) [ − log p ( D T | θ )] + Σ − T − . (2)Now take the limit β → . From Equation 2, Σ T → , so q T ( θ ) becomes a delta function, and Σ − T = − β ∇ µ T ∇ µ T log p ( D T | θ = µ T ) + Σ − T − = 1 β H T + Σ − T − = 1 β T (cid:88) t =1 H t + Σ − , (3) We slightly abuse notation by writing the likelihood as p ( D T | θ ) instead of p ( y T | θ, X T ) . H T is the T th task Hessian . Although the learnt distribution q T ( θ ) becomes a delta function(and not a full Gaussian distribution as in Laplace’s approximation), we will see that a cancellationof β factors in the β -ELBO will lead to the eventual equivalence between GVCL and Online EWC.Consider the terms in the β -ELBO that only involve µ T : β -ELBO = E θ ∼ q T ( θ ) [log p ( D T | θ )] − β µ T − µ T − ) (cid:62) Σ − T − ( µ T − µ T − )= log p ( D T | θ = µ T ) −
12 ( µ T − µ T − ) (cid:62) (cid:32) T − (cid:88) t =1 H t + β Σ − (cid:33) ( µ T − µ T − ) , (4)where we have set the form of Σ T − to be as in Equation 3. Equation 4 is an instance of the objectivefunction used by a number of continual learning methods, most notably Online EWC (Kirkpatricket al., 2017; Schwarz et al., 2018), Online-Structured Laplace (Ritter et al., 2018), and SOLA (Yinet al., 2020). These algorithms can be recovered by changing the approximate posterior class Q to Gaussians with diagonal, block-diagonal Kronecker-factored covariance matrices, and low-rankprecision matrices, respectively (see Appendices C.4 and C.5).Based on this analysis, we see that β can be seen as interpolating between VCL, with β = 1 ,and continual learning algorithms which use point-wise approximations of curvature as β → . InAppendix A we explore how β controls the scale of the quadratic curvature approximation, verifyingwith experiments on a toy dataset.. Small β values learn distributions with good local structure,while higher β values learn distributions with a more global structure. We explore this in moredetail in Appendices A and B, where we show the convergence of GVCL to Online-EWC on a toyexperiment. Inference using GVCL.
When performing inference with GVCL at test time, we use samples fromthe unmodified q ( θ ) distribution. This means that when β = 1 , we recover the VCL predictive,and as β → , the posterior collapses as described earlier, meaning that the weight samples areeffectively deterministic. This is in line with the inference procedure given by Online EWC and itsvariants. In practice, we use values of β = 0 . − . in Section 5, meaning that some uncertaintyis retained, but not all. We can increase the uncertainty at inference time by using an additionaltempering step, which we describe, along with further generalizations in Appendix D.2.3 R EINTERPRETING λ AS C OLD P OSTERIOR R EGULARIZATION
As described above, the β -ELBO recovers instances of a number of existing second-order continuallearning algorithms including Online EWC as special cases. However, the correspondence does notrecover a key hyperparameter λ used by these methods that up-weights the quadratic regularizationterm. Instead, our derivation produces an implicit value of λ = 1 , i.e. equal weight between tasksof equal sample count. In practice it is found that algorithms such as Online EWC perform bestwhen λ > , typically − . In this section, we view this λ hyperparameter as a form of coldposterior regularization.In the previous section, we showed that β controls the length-scale over which we approximatethe curvature of the posterior. However, the magnitude of the quadratic regularizer stays the same,because the O ( β − ) precision matrix and the β coefficient in front of the KL-term cancel out. Takinginspiration from cold posteriors (Wenzel et al., 2020), which temper both the likelihood and the priorand improve accuracy with Bayesian neural networks, we suggest tempering the prior in GVCL.Therefore, rather than measuring the KL divergence between the posterior and prior, q T and q T − ,respectively, we suggest regularizing towards tempered version of the prior, q λT − . However, thisform of regularization has a problem: in continual learning, over the course of many tasks, old taskswill be increasingly (exponentially) tempered. In order to combat this, we also use the tempered ver-sion of the posterior in the KL divergence, q λT . This should allow us to gain benefits from temperingthe prior while being stable over multiple tasks in continual learning. The actual Hessian may not be positive semidefinite while Σ is, so here we refer to a positive semidefiniteapproximation of the Hessian. EWC uses the Fisher information, but our derivation results in the Hessian. The two matrices coincidewhen the model has near-zero training loss, as is often the case (Martens, 2020).
3s we now show, tempering in this way recovers the λ hyperparameter from algorithms such asOnline EWC. Note that raising the distributions to the power λ is equivalent to tempering by τ = λ − . For Gaussians, tempering a distribution by a temperature τ = λ − is the same as scaling thecovariance by λ − . We can therefore expand our new KL divergence, D KL (cid:16) q λT (cid:107) q λT − (cid:17) = (cid:0) ( µ T − µ T − ) (cid:62) λ Σ − T − ( µ T − µ T − ) + Tr ( λ Σ − T − λ − Σ T ) + log | Σ T − | λ − d | Σ T | λ − d − d (cid:1) = (cid:0) ( µ T − µ T − ) (cid:62) λ Σ − T − ( µ T − µ T − ) + Tr (Σ − T − Σ T ) + log | Σ T − || Σ T | − d (cid:1) = D KL λ ( q T (cid:107) q T − ) . In the limit of β → , our λ coincides with Online EWC’s λ , if the tasks have the same numberof samples. However, this form of λ has a slight problem: it increases the regularization strength ofthe initial prior Σ on the mean parameter update. We empirically found that this negatively affectsperformance. We therefore propose a different version of λ , which only up-weights the “data-dependent” parts of Σ T − , which can be viewed as likelihood tempering the previous task posterior,as opposed to tempering both the initial prior and likelihood components. This new version stillconverges to Online EWC as β → , since the O (1) prior becomes negligible compared to the O ( β − ) Hessian terms. We define, ˜Σ − T,λ := λβ T (cid:88) t =1 H t + Σ − = λ (Σ − T − Σ − ) + Σ − . In practice, it is necessary to clip negative values of Σ − T − Σ − to keep ˜Σ − T,λ positive definite. Thisis only required because of errors during optimization. We then use a modified KL-divergence, D KL˜ λ ( q T (cid:107) q T − ) = (cid:16) ( µ T − µ T − ) (cid:62) ˜Σ − T − ,λ ( µ T − µ T − ) + Tr (Σ − T − Σ T ) + log | Σ T − || Σ T | − d (cid:17) . Note that in Online EWC, there is another parameter γ , that down-weights the previous Fisher ma-trices. As shown in Appendix C, we can introduce this hyperparameter by taking the KL divergencepriors and posteriors at different temperatures: q λT − and q γλT . However, we do not find that thisapproach improves performance. Combining everything, we have our objective for GVCL, E θ ∼ q T ( θ ) [log p ( D T | θ )] − βD KL˜ λ ( q T ( θ ) (cid:107) q T − ( θ )) . I LM L
AYERS FOR C ONTINUAL L EARNING
The Generalized VCL algorithm proposed in Section 2 is applicable to any model. Here we discussa multi-task neural network architecture that is especially well-suited to GVCL when the task ID isknown at both training and inference time: neural networks with task-specific FiLM layers.3.1 B
ACKGROUND TO F I LM L
AYERS
The most common architecture for continual learning is the multi-headed neural network. A sharedset of body parameters act as the feature extractor. For every task, features are generated in the sameway, before finally being passed to separate head networks for each task. This architecture doesnot allow for task-specific differentiation in the feature extractor, which is limiting (consider, forexample, the different tasks of handwritten digit recognition and image recognition). FiLM layers(Perez et al., 2018) address this limitation by linearly modulating features for each specific task sothat useful features can be amplified and inappropriate ones ignored. In fully-connected layers, thetransformation is applied element-wise: for a hidden layer with width W and activation values h i , ≤ i ≤ W , FiLM layers perform the transformation h (cid:48) i = γ i h i + b i , before being passed on to theremainder of the network. For convolutional layers, transformations are applied filter-wise. Considera layer with N filters of size K × K , resulting in activations h i,j,k , ≤ i ≤ N, ≤ j ≤ W, ≤ k ≤ H , where W and H are the dimensions of the resulting feature map. The transformation has theform h (cid:48) i,j,k = γ i ∗ h i,j,k + b i . The number of required parameters scales with the number of filters,as opposed to the full activation dimension, making them computationally cheap and parameter-efficient. FiLM layers have previously been shown to help with fine-tuning for transfer learning(Rebuffi et al., 2017), multi-task meta-learning (Requeima et al., 2019), and few-shot learning (Perezet al., 2018). In Appendix F, we show how FiLM layer parameters are interpretable, with similaritiesbetween FiLM layer parameters for similar tasks in a multi-task setup.4 a) GVCL (b) GVCL+ FiLM Figure 1: Visualizations of deviation from the prior distribution for filters in the first layer of aconvolutional networks trained on Hard-CHASY. Lighter colours indicate an active filter for thattask. Models are trained either (a) sequentially using GVCL, or (b) sequentially with GVCL +FiLM. FiLM layers increase the number of active units.3.2 C
OMBINING
GVCL
AND F I LM L
AYERS
It is simple to apply GVCL to models which utilize FiLM layers. Since these layers are specific toeach task they do not need a distributional treatment or regularization as was necessary to supportcontinual learning of the shared parameters. Instead, point estimates are found by optimising theGVCL objective function. This has a well-defined optimum unlike joint MAP training when FiLMlayers are added (see Appendix E for a discussion). We might expect an improved performancefor continual learning by introducing task-specific FiLM layers as this results in a more suitablemulti-task model. However, when combined with GVCL, there is an additional benefit.When applied to multi-head networks, VCL tends to prune out large parts of the network (Trippe &Turner, 2018; Turner & Sahani, 2011) and GVCL inherits this behaviour. This occurs in the follow-ing way: First, weights entering a node revert to their prior distribution due to the KL-regularizationterm in the ELBO. These weights then add noise to the network, affecting the likelihood term ofthe ELBO. To avoid this, the bias concentrates at a negative value so that the ReLU activation ef-fectively shuts off the node. In the single task setting, this is often relatively benign and can evenfacilitate compression (Louizos et al., 2017; Molchanov et al., 2017). However, in continual learningthe effect is pathological: the bias remains negative due to its low variance, meaning that the nodeis effectively shut off from that point forward, preventing the node from re-activating. Ultimately,large sections of the network can be shut off after the first task and cannot be used for future tasks,which wastes network capacity (see Figure 1a).In contrast, when using task-specific FiLM layers, pruning can be achieved by either setting theFiLM layer scale to 0 or the FiLM layer bias to be negative. Since there is no KL-penalty on theseparameters, it is optimal to prune in this way. Critically, both the incoming weights and the bias ofa pruned node can then return to the prior without adding noise to the network, meaning that thenode can be re-activated in later tasks. The increase in the number of unpruned units can be seen inFigure 1b. In Appendix G we provide more evidence of this mechanism.
ELATED W ORK
Regularization-based continual learning.
Many algorithms attempt to regularize network param-eters based on a metric of importance. Section 2 shows how some methods can be seen as specialcases of GVCL. We now focus on other related methods. Lee et al. (2017) proposed IMM, which isan extension to EWC which merges posteriors based on their Fisher information matrices. Ahn et al.(2019), like us, use regularizers based on the ELBO, but also measure importance on a per-node basisrather than a per-weight one. SI (Zenke et al., 2017) measures importance using “Synaptic Saliency,”as opposed to methods based on approximate curvature.
Architectural approaches to continual learning.
This family of methods modifies the standardneural architecture by adding components to the network. Progressive Neural Networks (Rusu et al.,2016) adds a parallel column network for every task, growing the model size over time. PathNet(Fernando et al., 2017) fixes the model size while optimizing the paths between layer columns.5rchitectural approaches are often used in tandem with regularization based approaches, such asin HAT (Serra et al., 2018), which uses per-task gating parameters alongside a compression-basedregularizer. Adel et al. (2020) propose CLAW, which also uses variational inference alongside per-task parameters, but requires a more complex meta-learning based training procedure involvingmultiple splits of the dataset. GVCL with FiLM layers adds to this list of hybrid architectural-regularization based approaches. See Appendix H for a more comprehensive related works section.
XPERIMENTS
We run experiments in the small-data regime (Easy-CHASY and Hard-CHASY) (Section 5.1), on
Split -MNIST (Section 5.1), on the larger Split CIFAR benchmark (Section 5.2), and on a muchlarger Mixed Vision benchmark consisting of 8 different image classification datasets (Section 5.3).In order to compare continual learning performance, we compare final average accuracy, forwardtransfer (the improvement on the current task as number of past tasks increases (Pan et al., 2020))and backward transfer (the difference in accuracy between when a task is first trained and its accu-racy after the final task (Lopez-Paz & Ranzato, 2017)). We compare to many baselines, but due tospace constraints, only report the best-performing baselines in the main text. We also compare totwo offline methods: an upper-bound “joint” version trained on all tasks jointly, and a lower-bound“separate” version with each task trained separately (no transfer). Further baseline results are in Ap-pendix J. The combination of GVCL on task-specific FiLM layers (GVCL-F) outperforms baselineson the smaller-scale benchmarks and outperforms or performs within statistical error of baselineson the larger Mixed Vision benchmark. We also report calibration curves, showing that GVCL-F iswell-calibrated. Full experimental protocol and hyperparameters are reported in Appendix I.5.1 CHASY
AND
Split -MNIST (a) Easy-CHASY (b) Hard-CHASY (c)
Split -MNIST
Figure 2: Running average accuracy of Easy-CHASY, Hard-CHASY and
Split -MNIST trained con-tinually. GVCL-F and GVCL are compared to the best performing baseline algorithm. GVCL-Fand GVCL both significantly outperform HAT on Easy-CHASY. On Hard-CHASY, GVCL-F stillmanages to perform as well joint MAP training, while GVCL performs as well as PathNet. In Split-MNIST, GVCL-F narrowly outperforms HAT, with both performing nearly as well as joint training.The CHASY benchmark consists of a set of tasks specifically designed for multi-task and continuallearning, with detailed explanation in Appendix K. It is derived from the HASYv2 dataset (Thoma,2017), which consists of 32x32 handwritten latex characters. Easy-CHASY was designed to max-imize transfer between tasks and consists of similar tasks with 20 classes for the first task, to 11classes for the last. Hard-CHASY represents scenarios where tasks are very distinct, where tasksrange from 18 to 10 classes. Both versions have very few samples per class. Testing our algorithmon these datasets tests two extremes of the continual learning spectrum. For these two datasets weuse a small convolutional network comprising two convolutions layers and a fully connected layer.For our
Split -MNIST experiment, in addition to the standard 5 binary classification tasks for
Split -MNIST, we add 5 more binary classification tasks by taking characters from the KMNIST dataset(Clanuwat et al., 2018). For these experiments we used a 2-layer fully-connected network, as incommon in continual learning literature (Nguyen et al., 2018; Zenke et al., 2017).Figure 2 shows the raw accuracy results. As the CHASY datasets have very few samples per class(16 per class, resulting in the largest task having a training set of 320 samples), it is easy to overfit.This few-sample regime is a key practical use case for continual learning as it is essential to transferinformation between tasks. In this regime, continual learning algorithms based on MAP-inference6 a) Easy-CHASY (b) Hard-CHASY
Figure 3: Accuracy of Easy-CHASY and Hard-CHASY trained models at the end of learning all10 tasks continually. Performance of GVCL-F, GVCL and the best performing baselines (HATand Pathnet) are compared to Joint and Separate training. GVCL-F again strongly outperforms thebaselines and performs similar to the upper-bound VI joint training.overfit, resulting in poor performance. As GVCL-F is based on a Bayesian framework, it is notas adversely affected by the low sample count, achieving 90.9% accuracy on Easy-CHASY com-pared to 82.6% of the best performing MAP-based CL algorithm, HAT. Hard-CHASY tells a similarstory, 69.1% compared to PathNet’s 64.8%. Compared to the full joint training baselines, GVCL-F achieves nearly the same accuracy (Figure 3). The gap between GVCL-F and GVCL is largerfor Easy-CHASY than for Hard-CHASY, as the task-specific adaptation that FiLM layers provideis more beneficial when tasks require contrasting features, as in Hard-CHASY. With
Split -MNIST,GVCL-F also reaches the same performance as joint training, however it is difficult to distinguishapproaches on this benchmark as many achieve near maximal accuracy.
GVCL-F GVCL HAT PathNet VCL Online EWCEasy-CHASY ACC (%) . ± . . ± . . ± . . ± . . ± . . ± . BWT (%) . ± . − . ± . − . ± . . ± . − . ± . − . ± . FWT (%) . ± . − . ± . . ± . − . ± . − . ± . − . ± . Hard-CHASY ACC (%) . ± . . ± . . ± . . ± . . ± . . ± . BWT (%) − . ± . − . ± . − . ± . . ± . − . ± . − . ± . FWT (%) − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Split -MNIST(10 Tasks) ACC (%) . ± . . ± . . ± . . ± . . ± . . ± . BWT (%) . ± . − . ± . − . ± . . ± . − . ± . − . ± . FWT (%) − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Split -CIFAR ACC (%) . ± . . ± . . ± . . ± . . ± . . ± . BWT (%) − . ± . − . ± . − . ± . . ± . − . ± . − . ± . FWT (%) . ± . . ± . . ± . − . ± . − . ± . . ± . Mixed VisionTasks ACC (%) . ± . . ± . . ± . . ± . . ± . . ± . BWT (%) − . ± . − . ± . − . ± . . ± . − . ± . − . ± . FWT (%) − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Table 1: Performance metrics of GVCL-F and GVCL compared to baselines (more in Appendix J).GVCL-F obtains the best accuracy and backwards/forwards transfer on many datasets/architectures.5.2
Split -CIFARThe popular
Split -CIFAR dataset, introduced in Zenke et al. (2017), has CIFAR10 as the first task,and then 5 tasks as disjoint 10-way classifications from the first 50 classes of CIFAR100, giving atotal of 6 tasks. We use the same architecture as in other papers (Zenke et al., 2017; Pan et al., 2020).Like with Easy-CHASY, jointly learning these tasks significantly outperforms networks separatelytrained on the tasks, indicating potential for forward and backward transfer in a continual learningalgorithm. Results are in Figure 4. GVCL-F is able to achieve the same final accuracy as jointtraining with FiLM layers, achieving 80.0 ± β = 0 . which is far from Online EWC. Webelieve this is because optimizing the GVCL cost for small β is more challenging (see Appendix B).However, since intermediate β settings result in more pruning, FiLM layers then bring significantimprovement. (a) Running average accuracy of Split -CIFAR (b) Final accuracies on
Split -CIFAR
Figure 4: Running average accuracy of
Split -CIFAR and final accuracies after continually trainingon 6 tasks for GVCL-F, GVCL, and HAT. GVCL-F achieves the maximum amount of forwardstransfer, and achieves close to the upper-bound joint performance.5.3 M
IXED V ISION T ASKS (a) Final accuracies after all tasks (b) Relative accuracy after training on the i th task Figure 5: (a) Average accuracy of mixed vision tasks at the end of training for GVCL-F and HAT.Both algorithms perform nearly equally well in this respect. (b) GVCL-F gracefully forgets, withhigher intermediate accuracies, while HAT has a lower initial accuracy but does not forget.We finally test on a set of mixed vision datasets, as in Serra et al. (2018). This benchmark consistsof 8 image classification datasets with 10-100 classes and a range of dataset sizes, with the orderof tasks randomly permuted between different runs. We use the same AlexNet architecture as inSerra et al. (2018). Average accuracies of the 8 tasks after continual training are shown in Figure 5.GVCL-F’s final accuracy matches that of HAT, with similar final performances of 80.0 ± ± t tasks means that the method performs better on the tasks seen so far than it does on the sametasks after seeing all 8 tasks (Appendix I contains a precise definition). HAT achieves its continuallearning performance by compressing earlier tasks, hindering their performance in order to reservecapacity for later tasks. In contrast, GVCL-F attempts to maximize the performance for early tasks,but allows performance to gradually decay, as shown by the gradually decreasing relative accuracy inFigure 5b. While both strategies result in good final accuracy, one could argue that pre-compressinga network in anticipation of future tasks which may or may not arrive is an impractical real-world8trategy, as the number of total tasks may be unknown a priori , and therefore one does not knowhow much to compress the network. The approach taken by GVCL-F is then more desirable, as itensures good performance after any number of tasks, and frees capacity by “gracefully forgetting”. (a) Cifar100 Calibration Curve (b) Facescrub Calibration Curve (c) ECE on individual tasks Figure 6: Calibration curves and Expected Calibration Error for GVCL-F and HAT trained on theMixed Vision Tasks benchmark. GVCL-F achieves much lower Expected Calibration Error, attain-ing a value averaged across all tasks of 0.3% compared to HAT’s 1.7%.
Uncertainty calibration.
As GVCL-F is based on a probabilistic framework, we expect it to havegood uncertainty calibration compared to other baselines. We show this for the Mixed Vision tasksin Figure 6. Overall, the average Expected Calibration Error for GVCL-F (averaged over tasks) is0.32%, compared to HAT’s 1.69%, with a better ECE on 7 of the 8 tasks. These results demonstratethat GVCL-F is generally significantly better calibrated than HAT, which can be extremely importantin decision critical problems where networks must know when they are likely to be uncertain.5.4 R
ELATIVE G AIN F ROM A DDING F I LM L
AYERS
Algorithm Easy-CHASY Hard-CHASY
Split -MNIST (10 tasks)
Split -CIFAR Mixed Vision Tasks AverageGVCL . ± . % . ± . % . ± . % . ± . % . ± . % . ± . %VCL . ± . % . ± . % . ± . % . ± . % . ± . % . ± . %Online EWC . ± . % . ± . % . ± . % . ± . % . ± . % . ± . % Table 2: Relative performance improvement from adding FiLM layers on several benchmarks, forVI and non-VI based algorithms. VI-based approaches see a much more significantly gain overEWC, suggesting that FiLM layers synergize very well with VI and address the pruning issue.In Section 3, we suggested that adding FiLM layers to VCL in particular would result in the largestgains, since it addresses issues specific to VI, and that FiLM parameter values were automaticallybest allocated based on the prior. In Section 5.4, we compare the relative gain of adding FiLM layersto VI-based approaches and Online EWC. We omitted HAT, since it already has per-task gatingmechanisms, so FiLM layers would be redundant. We see that the gains from adding FiLM layersto Online EWC are limited, averaging . compared to over for both VCL and GVCL. Thissuggests that the strength of FiLM layers is primarily in how they interact with variational methodsfor continual learning. As described in Section 3, with VI we do not need any special algorithmto encourage pruning and how to allocate resources, as they are done automatically by VI. Thiscontrasts HAT, where specific regularizers and gradient modifications are necessary to encouragethe use of FiLM parameters. ONCLUSIONS
We have developed a framework, GVCL, that generalizes Online EWC and VCL, and we combinedit with task-specific FiLM layers to mitigate the effects of variational pruning. GVCL with FiLMlayers outperforms strong baselines on a number of benchmarks, according to several metrics. Fu-ture research might combine GVCL with memory replay methods, or find ways to use FiLM layerswhen task ID information is unavailable. 9
EFERENCES
Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Char-less Fowlkes, Stefano Soatto, and Pietro Perona. Task2Vec: Task Embedding for Meta-Learning. arXiv:1902.03545 [cs, stat] , February 2019. URL http://arxiv.org/abs/1902.03545 . arXiv: 1902.03545.Alessandro Achille, Giovanni Paolini, and Stefano Soatto. Where is the information in a deep neuralnetwork?, 2020.Tameem Adel, Han Zhao, and Richard E. Turner. Continual learning with adaptive weights(claw). In
International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=Hklso24Kwr .Hongjoon Ahn, Sungmin Cha, Donggyu Lee, and Taesup Moon. Uncertainty-based continuallearning with adaptive regularization. In
Advances in Neural Information Processing Systems 32 ,pp. 4392–4402. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8690-uncertainty-based-continual-learning-with-adaptive-regularization.pdf .Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A. Saurous, and Kevin Murphy. Fixinga broken ELBO. volume 80 of
Proceedings of Machine Learning Research , pp. 159–168, Stock-holmsm¨assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/alemi18a.html .Tarin Clanuwat, Mikel Bober-Irizar, A. Kitamoto, A. Lamb, Kazuaki Yamamoto, and David Ha.Deep learning for classical japanese literature.
ArXiv , abs/1812.01718, 2018.Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei Rusu,Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in superneural networks. 01 2017.Robert M. French. Catastrophic forgetting in connectionist networks.
Trends in CognitiveSciences , 3(4):128–135, April 1999. ISSN 1364-6613. doi: 10.1016/S1364-6613(99)01294-2. URL .Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,Shakir Mohamed, and Alexander Lerchner. β -VAE: LEARNING BASIC VISUAL CONCEPTSWITH A CONSTRAINED VARIATIONAL FRAMEWORK. pp. 22, 2017.J. Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, J. Veness, G. Desjardins, Andrei A.Rusu, K. Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis,C. Clopath, D. Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural net-works. Proceedings of the National Academy of Sciences , 114:3521 – 3526, 2017.Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang.Overcoming catastrophic forgetting by incremental moment matching. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett (eds.),
Advances in Neural Information Processing Systems 30 , pp. 4652–4662. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7051-overcoming-catastrophic-forgetting-by-incremental-moment-matching.pdf .David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continuallearning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett (eds.),
Advances in Neural Information Processing Systems 30 , pp.6467–6476. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7225-gradient-episodic-memory-for-continual-learning.pdf .Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learn-ing. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett (eds.),
Advances in Neural Information Processing Systems 30 , pp.10288–3298. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6921-bayesian-compression-for-deep-learning.pdf .James Martens. New insights and perspectives on the natural gradient method.
Journal of Ma-chine Learning Research , 21(146):1–76, 2020. URL http://jmlr.org/papers/v21/17-678.html .Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deepneural networks. volume 70 of
Proceedings of Machine Learning Research , pp. 2498–2507,International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/molchanov17a.html .Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, and Richard E. Turner. Variational continuallearning. In
International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=BkQqq0gRb .Manfred Opper and Cedric Archambeau. The variational gaussian approximation revisited.
Neuralcomputation , 21:786–92, 10 2008. doi: 10.1162/neco.2008.08-07-592.Kazuki Osawa, Siddharth Swaroop, Mohammad Emtiyaz E Khan, Anirudh Jain, Runa Es-chenhagen, Richard E Turner, and Rio Yokota. Practical deep learning with bayesianprinciples. In
Advances in Neural Information Processing Systems 32 , pp. 4287–4299. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8681-practical-deep-learning-with-bayesian-principles.pdf .Pingbo Pan, Siddharth Swaroop, Alexander Immer, Runa Eschenhagen, Richard E. Turner, andMohammad Emtiyaz Khan. Continual Deep Learning by Functional Regularisation of Memo-rable Past. arXiv:2004.14070 [cs, stat] , June 2020. URL http://arxiv.org/abs/2004.14070 . arXiv: 2004.14070.Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visualreasoning with a general conditioning layer. In
AAAI , 2018.Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domainswith residual adapters. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett (eds.),
Advances in Neural Information Processing Systems30 , pp. 506–516. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6654-learning-multiple-visual-domains-with-residual-adapters.pdf .James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and Richard E Turner. Fastand flexible multi-task classification using conditional neural adaptive processes. In
Advances inNeural Information Processing Systems 32 , pp. 7959–7970. Curran Associates, Inc., 2019.Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximationsfor overcoming catastrophic forgetting. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,N. Cesa-Bianchi, and R. Garnett (eds.),
Advances in Neural Information Processing Systems 31 ,pp. 3738–3748. Curran Associates, Inc., 2018.Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive Neural Networks. arXiv:1606.04671 [cs] , September 2016. URL http://arxiv.org/abs/1606.04671 .arXiv: 1606.04671.Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee WhyeTeh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for contin-ual learning. volume 80 of
Proceedings of Machine Learning Research , pp. 4528–4537, Stock-holmsm¨assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/schwarz18a.html .Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophicforgetting with hard attention to the task. volume 80 of
Proceedings of Machine Learning Re-search , pp. 4548–4557, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/serra18a.html .11lexander Smola, Vishy Vishwanathan, and Eleazar Eskin. Laplace propagation. 01 2003.Siddharth Swaroop, Cuong V. Nguyen, Thang D. Bui, and Richard E. Turner. Improving andUnderstanding Variational Continual Learning. arXiv:1905.02099 [cs, stat] , May 2019. URL http://arxiv.org/abs/1905.02099 . arXiv: 1905.02099.Martin Thoma. The HASYv2 dataset. arXiv:1701.08380 [cs] , January 2017. URL http://arxiv.org/abs/1701.08380 . arXiv: 1701.08380.Brian Trippe and Richard Turner. Overpruning in Variational Bayesian Neural Networks. arXiv:1801.06230 [stat] , January 2018. URL http://arxiv.org/abs/1801.06230 .arXiv: 1801.06230.R. Turner and M. Sahani. Two problems with variational expectation maximisation for time-seriesmodels. 2011.Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt,Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is thebayes posterior in deep neural networks really?
CoRR , abs/2002.02405, 2020. URL https://arxiv.org/abs/2002.02405 .Dong Yin, Mehrdad Farajtabar, and Ang Li. SOLA: Continual Learning with Second-Order LossApproximation. arXiv:2006.10974 [cs, stat] , June 2020. URL http://arxiv.org/abs/2006.10974 . arXiv: 2006.10974.Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence.volume 70 of
Proceedings of Machine Learning Research , pp. 3987–3995, International Conven-tion Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/zenke17a.html .Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradi-ent as variational inference. volume 80 of
Proceedings of Machine Learning Research , pp.5852–5861, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/zhang18l.html .12 L OCAL VS GLOBAL CURVATURE IN
GVCL
In this section, we look at the effect of β on the approximation of local curvature found from opti-mizing the β -ELBO by analyzing its effect on a toy dataset. In doing so, we aim to provide intuitionwhy different values of β might outperform β = 1 . We start by looking at the equation of the fixedpoint of Σ . Σ − T = 1 β ∇ µ T ∇ µ T E q T ( θ ) [ − log p ( D T | θ )] + Σ − T − . (5)We consider the T = 1 case. We can interpret this as roughly measuring the curvature of log p ( D T | θ ) at different samples of θ drawn from the distribution q T ( θ ) . Based on this equation,we know Σ − T increases as β decreases, so samples from q T ( θ ) are more localized, meaning that thecurvature is measured closer to the mean, forming a local approximation of curvature. Conversely,if β is larger, Σ − T broadens and the approximation of curvature is on a more global scale. Forsimplicity, we write ∇ µ T ∇ µ T E q T ( θ ) [ − log p ( D T | θ )] as ˜ H T .To test this explanation of β , we performed β -VI on a simple toy dataset.We have a true data generative distribution X ∼ N (0 , , and we sample 1000 points forming thedataset, D . Our model is a generative model with X ∼ N ( f ( θ ) , σ = 30) , with θ being the model’sonly parameter and f ( θ ) an arbitrary fixed function. With β -VI, we aim to approximate p ( θ | D ) with q ( θ ) = N ( θ ; µ, σ ) with a prior p ( θ ) = N ( θ ; 0 , . We choose three different equations for f ( θ ) :1. f ( θ ) = | θ | . f ( θ ) = (cid:112) | θ | f ( θ ) = (cid:112) ( | θ | − . + 0 . We visualize log p ( D | θ ) for each of these three functions in Figure 7. Here, we see that the datalikelihoods have very distinct shapes. f results in a likelihood that is flat locally but curves furtheraway from the origin. f is the opposite: there is a cusp at 0 then flattens out. f is a mix, where at avery small scale it has high curvature, then flattens, then curves again. Now, we perform β -VI to get µ and σ , for β ∈ { . , , } . We then have values for σ , which acts as Σ − T in Equation 5. Wewant to extract ˜ H T − from these values, so we perform the operation ˜ σ = β σ − , which representsour estimate of the curvature of log p ( D | θ ) at the mean. This operation also “cancels” the scalingeffect of β . We then plot these approximate log-likelihood functions log ˜ p ( D | θ ) = N ( θ ; µ, ˜ σ ) inFigure 8. (a) f ( θ ) (b) f ( θ ) (c) f ( θ ) Figure 7: True data log-likelihoods of a generative model of the form p ( x | θ ) = N ( x ; f ( θ ) , σ ) .Curves are shifted so that they pass through the originFrom these figures, we see a clear trend: small values of β cause the approximate curvature to bemeasured locally while larger values cause it to be measured globally, confirming our hypothesis.Most striking is Figure 8c, where the curvature is not strictly increasing or decreasing further fromthe origin. Here, we see that the curvature first is high for β = 0 . , then flattens out for β = 1 thenbecomes high again for β = 10 . Now imagine in continual learning our posterior for a parameterwhose posterior looks like Figure 8a. Here, the parameter would be under-regularized with β = 1 ,13 a) f ( θ ) (b) f ( θ ) (c) f ( θ ) Figure 8: Approximate data log-likelihoods found using β -VI for various values of β for threedifferent generative models. Small values of β cause local approximations of curvature and largevalues cause global ones.so the parameter will drift far away, significantly affecting performance. Equally, if the posteriorwas like Figure 8b, then values of β = 1 would cause the parameter to be over-regularized, limitingmodel capacity than in practice could be freed. In practice we found that β values of . − . worked the best. We leave finding better ways of quantifying the posterior’s variable curvature andways of selecting appropriate values of β as future work. B C
ONVERGENCE TO O NLINE -EWC
ON A T OY E XAMPLE
Figure 9: Visualization of a simple 2d logistic regression clustering task. The first task is distin-guishing blue and red, classes 1 and 2 respectively. The second task is distinguishing green (class 1)from yellow (class 2). The combined task is shown on the leftHere, we demonstrate convergence of GVCL to Online-EWC for small β . In this problem, we dealwith 2d logistic regression on a toy dataset consisting of separated clusters. The clusters are shown inFigure 9. The first set of tasks is separating the red/blue clusters, then the second is the yellow/greenclusters. Blue and green are the first class and red and yellow are the second. Or model is given bythe equation p ( y i = 1 | w, b, x i ) = σ ( w (cid:62) x i + b ) Where x i are our datapoints and w and b are our parameters. y i = 1 means class 2 (and y i = 0 means class 1). x is 2-dimensional so we have a total of 3 parameters.Next, we ran GVCL with decreasing values of β and compared the resulting values of w and b after the second task to solution generated by Online-EWC. For both cases, we set λ = 1 . For ourprior, we used the unit normal prior on both w and b , our approximating distribution was a fullyfactorized Gaussian. We ran this experiment for 5 random seeds (of the parameters, not the clusters)and plotted the results.Figure 10 shows the result. Evidently, the values of the parameters approach those of Online-EWCas we decrease β , in line with our theory. However, it is worth noting that to get this convergentbehaviour, we had to run this experiment for very long. For the lowest β value, it took 17 minutes to14igure 10: Convergence of GVCL parameter values to Online-EWC parameter values for decreasingvalues of β for a toy 2d logistic regression problemconverge compared to 1.7 for β = 1 . A small learning rate of 1e-4 with 100000 iteration steps wasnecessary for the smallest β = β , but thesmallest values of β would result in completely different values.This shows that while in theory, for small β , GVCL should approach Online-EWC, it is extremelyhard to achieve in practice. Given that it takes so long to achieve convergent behaviour on a modelwith 3 parameters, it is unsurprising that we were not able to achieve the same performance asOnline-EWC for our neural networks, and explains why despite GVCL, in theory, encompassingOnline-EWC, can sometimes perform worse. C F
URTHER DETAILS ON RECOVERING O NLINE
EWC
Here, we show the full derivation to recover Online EWC from GVCL, as β → . First, we expandthe β -ELBO which for Gaussian priors and posteriors has the form: β -ELBO = E θ ∼ q T ( θ ) log p ( D T | θ ) − βD KL ( q T ( θ ) || q T − ( θ ))= E θ ∼ q T ( θ ) [log p ( D T | θ )] − β (cid:18) log | Σ T − | − log | Σ T | − d + Tr (Σ − T − Σ T ) + ( µ T − µ T − ) (cid:62) Σ − T ( µ T − µ T − ) (cid:19) , where q T ( θ ) is our approximate distribution with means and covariance µ T and Σ T , and our priordistribution q T − ( θ ) has mean and covariance µ T − and Σ T − . D T refers to the T th dataset and d the dimension of µ . Next, take derivatives wrt Σ T and set to 0: ∇ Σ T β -ELBO = ∇ Σ T E θ ∼ q T ( θ ) [log p ( D T | θ )] + β − T − β − T − (6) ∇ µ ∇ µ E q T ( θ ) [log p ( D T | θ )] + β − T − β − T − (7) ⇒ Σ − T = 1 β ∇ µ T ∇ µ T E q T ( θ ) [ − log p ( D T | θ )] + Σ − T − . (8)We move from Equation 6 to Equation 7 using Equation 19 in Opper & Archambeau (2008). FromEquation 8, we see that as β → , the precision grows indefinitely, so q T ( θ ) approaches a deltafunction centered at its mean. We give a more precise explanation of this argument in Appendix C.1.15e have Σ − T = − β ∇ µ T ∇ µ T log p ( D T | θ = µ T ) + Σ − T − Σ − T = 1 β H T + Σ − T − , (9)where H T is the Hessian of the T th dataset log-likelihood. This recursion of Σ − T gives Σ − T = 1 β T (cid:88) t =1 H t + Σ − . Now, optimizing the β -ELBO for µ T (ignoring terms that do not depend on µ T ): β -ELBO = E θ ∼ q ( θ ) [log p ( D | θ )] − β µ T − µ T − ) (cid:62) Σ − T − ( µ T − µ T − ) (10) = log p ( D | θ = µ T ) −
12 ( µ T − µ T − ) (cid:62) (cid:32) T − (cid:88) t =1 H t + β Σ − (cid:33) ( µ T − µ T − ) . (11)Which is the exact optimization problem for Laplace Propagation (Smola et al., 2003). If we notethat H T ≈ N T F T (Martens, 2020), where N T is the number of samples in the T th dataset and F T is the Fisher information matrix, we recover Online EWC with λ = 1 when N = N = ... = N T (with γ = 1 ).C.1 C LARIFICATION OF THE DELTA - FUNCTION ARGUMENT
In C, we argued, Σ − T = 1 β ∇ µ T ∇ µ T E q T ( θ ) [ − log p ( D T | θ )] + Σ − T − ≈ β H T + Σ − T − for small β . We argued that for small β , q ( θ ) collapsed to its mean and it is safe to treat theexpectation as sampling only from the mean. In this section, we show that this argument is justified. Lemma 1. If q ( θ ) has mean and covariance parameters µ and Σ , and Σ − = β ∇ µ ∇ µ E θ ∼ q ( θ ) [ f ( θ )] + C , C = O ( β ) , then for small β , Σ − ≈ β H µ + C , where H µ isthe Hessian of f ( θ ) evaluated at µ , assuming H µ = O (1) Proof.
We first assume that f ( θ ) admits a Taylor expansion around µ . For notational purposes, wedefine, T k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ = ∂f∂θ ( k ) . . . ∂θ ( k n ) (cid:12)(cid:12)(cid:12) θ = µ For our notation, upper indices in brackets indicate vector components (not powers), and lowerindices indicate covector components. Note that, H µ,i,j = T i,j (cid:12)(cid:12)(cid:12) θ = µ . Then, a Taylor expansion centered at µ has the form f ( θ ) = f ( µ ) + ∞ (cid:88) n =1 n ! T k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ( θ − µ ) ( k ) . . . ( θ − µ ) ( k n ) In this case, the µ in H µ,i,j refers to the Hessian evaluated at µ , while i, j refers to the indices T k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ( θ − µ ) ( k ) . . . ( θ − µ ) ( k n ) = D (cid:88) k ,...,k n =1 T k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ( θ − µ ) ( k ) . . . ( θ − µ ) ( k n ) (12)With D the dimension of θ . To denote the central moments of q ( θ ) , we define ˜ µ ( k ,...,k n ) := E θ ∼ q ( θ ) (cid:104) ( θ − µ ) ( k ) . . . ( θ − µ ) ( k n ) (cid:105) These moments can be computed using Isserlis’ theorem. Notably, for a Gaussian, if n is odd, ˜ µ ( k ,...,k n ) = 0 Now, we can compute our expectation as an infinite sum: ∇ µ ∇ µ E θ ∼ q ( θ ) [ f ( θ )] = ∇ µ ∇ µ E θ ∼ q ( θ ) (cid:34) f ( µ ) + ∞ (cid:88) n =1 n ! T k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ( θ − µ ) ( k ) . . . ( θ − µ ) ( k n ) (cid:35) = ∇ µ ∇ µ (cid:34) f ( µ ) + ∞ (cid:88) n =1 n ! T k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ˜ µ ( k ,...,k n ) (cid:35) = ∇ µ ∇ µ (cid:34) f ( µ ) + ∞ (cid:88) n =1 n ! T k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ˜ µ ( k ,...,k n ) (cid:35) (odd moments are 0) = A for notational simplicity We can look at individual components of A : A i,j = ∂∂µ ( i ) ∂∂µ ( j ) (cid:34) f ( µ ) + ∞ (cid:88) n =1 n ! T k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ˜ µ ( k ,...,k n ) (cid:35) = T i,j (cid:12)(cid:12)(cid:12) θ = µ + ∞ (cid:88) n =1 n ! T i,j,k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ˜ µ ( k ,...,k n ) Now we can insert this into our original equation. Σ − = 1 β ∇ µ ∇ µ E θ ∼ q ( θ ) [ f ( θ )] + C Σ − = 1 β A + C Σ − i,j = 1 β A i,j + C i,j looking at individual indices Σ − i,j (cid:124)(cid:123)(cid:122)(cid:125) O ( β ) = 1 β (cid:32) T i,j (cid:12)(cid:12)(cid:12) θ = µ (cid:124) (cid:123)(cid:122) (cid:125) O (1) + ∞ (cid:88) n =1 n ! T i,j,k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ˜ µ ( k ,...,k n ) (cid:124) (cid:123)(cid:122) (cid:125) O ( β ) (cid:33) + C i,j (cid:124)(cid:123)(cid:122)(cid:125) O ( β ) Now we assumed that H µ is O (1) (so T i,j (cid:12)(cid:12)(cid:12) θ = µ is too), which means that Σ − i,j must be at least O ( β ) .If Σ − = O ( β ) , then Σ = O ( β ) . From Isserlis’ theorem, we know that ˜ µ ( k ,...,k n ) is composed ofthe product of n elements of Σ , so ˜ µ ( k ,...,k n ) = O ( β n ) . T i,j,k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ is constant with respectto β , so is O (1) . Hence, the summation is O ( β ) , which for small β is negligible compared to the O (1) term T i,j (cid:12)(cid:12)(cid:12) θ = µ , so can therefore be ignored. Then, keeping only O ( β ) terms,17 ( β ) (cid:122)(cid:125)(cid:124)(cid:123) Σ − i,j = 1 β (cid:32) O (1) (cid:122) (cid:125)(cid:124) (cid:123) T i,j (cid:12)(cid:12)(cid:12) θ = µ + O ( β ) (cid:122) (cid:125)(cid:124) (cid:123) ∞ (cid:88) n =1 n ! T i,j,k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ˜ µ ( k ,...,k n ) (cid:33) + O ( β ) (cid:122)(cid:125)(cid:124)(cid:123) C i,jO ( β ) (cid:122)(cid:125)(cid:124)(cid:123) Σ − i,j = O ( β ) (cid:122) (cid:125)(cid:124) (cid:123) β T i,j (cid:12)(cid:12)(cid:12) θ = µ + O (1) (cid:122) (cid:125)(cid:124) (cid:123) β (cid:32) ∞ (cid:88) n =1 n ! T i,j,k ,...,k n (cid:12)(cid:12)(cid:12) θ = µ ˜ µ ( k ,...,k n ) (cid:33) + O ( β ) (cid:122)(cid:125)(cid:124)(cid:123) C i,j ≈ β T i,j (cid:12)(cid:12)(cid:12) θ = µ + C i,j = 1 β H µ,i,j + C i,j Σ − ≈ β H µ + C C.2 C
ORRESPONDING
GVCL’ S λ AND O NLINE
EWC’ S λ We use D KL ˜ λ in place of D KL , with D KL ˜ λ defined as D KL˜ λ ( q T (cid:107) q T − ) = 12 (cid:16) ( µ T − µ T − ) (cid:62) ˜Σ − − ,λ µ T − µ T − ) + Tr (Σ − T − Σ T )+ log | Σ T − | − d − log | Σ T | (cid:17) , with ˜Σ − T,λ := λβ T (cid:88) t =1 H t + Σ − = λ (Σ − T − Σ − ) + Σ − . Now, the fixed point for Σ T is still given by Equation 9, but the β -ELBO for for terms involving µ T has the form, β -ELBO = E θ ∼ q ( θ ) [log p ( D | θ )] − β µ T − µ T − ) (cid:62) ˜Σ − T − ,λ ( µ T − µ T − )= log p ( D | θ = µ T ) −
12 ( µ T − µ T − ) (cid:62) (cid:32) λ T (cid:88) t =1 H t + β Σ − (cid:33) ( µ T − µ T − ) , which upweights the quadratic terms dependent on the data (and not the prior), similarly to λ inOnline EWC.C.3 R ECOVERING γ F ROM T EMPERING
In order to recover λ , we used the KL-divergence between tempered priors and posteriors q λT − and q λT . Recovering γ can be done using the same trick, except we temper the posterior to q γλT : D KL( q λT (cid:107) q γλT − ) = (cid:0) ( µ T − µ T − ) (cid:62) λ Σ − T − ( µ T − µ T − )+ Tr ( γλ Σ − T − λ − Σ T ) + log | λ − Σ T − || ( γλ ) − Σ T | − d (cid:1) = (cid:0) ( µ T − µ T − ) (cid:62) λ Σ − T − ( µ T − µ T − ) + γ Tr (Σ − T − Σ T ) − log | Σ T | (cid:1) + cons. = D KL λ,γ ( q T (cid:107) q T − )
18e can apply the same λ to ˜ λ as before to get D KL˜ λ,γ ( q T (cid:107) q T − ) . Plugging this into the β -ELBOand solving yields the recursion for Σ T to be Σ − T = 1 β H T + γ Σ − T − , which is exactly that of Online EWC.C.4 GVCL RECOVERS THE SAME APPROXIMATION OF F T AS O NLINE
EWCThe earlier analysis dealt with full rank Σ T . In practice, however, Σ T is rarely full rank and we dealwith approximations of Σ T . In this subsection, we consider diagonal Σ T , like Online EWC, which inpractice uses a diagonal approximation of F T . The way Online EWC approximates this diagonal isby matching diagonal entries of F T . There are many ways of producing a diagonal approximation ofa matrix, for example matching diagonals of the inverse matrix is also valid, depending on the metricwe use. Here, we aim to show that that the diagonal approximation of Σ T that is produced when Q is the family of diagonal covariance Gaussians is the same as the way Online EWC approximates F T , that is, diagonals of Σ − T, approx match diagonals of Σ − T, true , i.e. we match the diagonal precision entries, not the diagonal covariance entries.Let Σ T, approx = diag ( σ , σ , ..., σ d ) , with d the dimension of the matrix. Because we are performingVI, we are optimizing the forwards KL divergence, i.e. D KL ( q approx || q true ) . Therefore, ignoringterms that do not depend on Σ T, approx , D KL ( q approx || q true ) = 12 Tr (Σ T, approx Σ − T, true ) −
12 log | Σ T, approx | + ( constants wrt Σ T, approx )= 12 d (cid:88) i =1 (Σ T, approx Σ − T, true ) i,i − d (cid:88) i =1 log σ i = 12 d (cid:88) i =1 (cid:16) σ i (Σ − T, true ) i,i ) − log σ i (cid:17) . Optimizing wrt σ i : ∂D KL ( q approx || q true ) ∂σ i = 0 = 12 (cid:18) (Σ − T, true ) i,i − σ i (cid:19) ⇒ σ i = 1(Σ − T, true ) i,i . So we have that diagonals of Σ − T, approx match diagonals of Σ − T, true .C.5 GVCL RECOVERS THE SAME APPROXIMATION OF H T AS SOLASOLA approximates the Hessian with a rank-restricted matrix ˜ H (Yin et al., 2020). We first considera relaxation of this problem with full rank, then consider the limit when we reduce this relaxation.Because we are concerned with limiting β → , it is sufficient to consider Σ − true as H , the trueHessian. Because H is symmetric (and assuming it is positive-semi-definite), we can also writeH as H = V DV (cid:62) = (cid:80) pi =1 λ i x i x (cid:62) i , with D , and V be the diagonal matrix of eigenvalues anda unitary matrix of eigenvectors, respectively. These eigenvalues and eigenvectors are λ i and x i ,respectively, and p the dimension of H .For ˜ H , we first consider full-rank matrix which becomes low-rank as δ → : ˜ H = k (cid:88) i =1 ˜ λ i ˜ x i ˜ x (cid:62) i + p (cid:88) j = k +1 δ ˜ x j ˜ x (cid:62) j ˜ λ i , ≤ i ≤ k as its first k eigenvalues and δ as its remaining. We also set ˜ x (cid:62) i ˜ x i = 1 and ˜ x (cid:62) i ˜ x j = 0 , i (cid:54) = j .With KL minimization, we aim to minimize (up to a constant and scalar factor),KL = Tr (Σ approx Σ − true ) − log | Σ approx | In our case, this is Equation 13, which we can further expand as,KL = Tr ( ˜ H − H ) − log | ˜ H − | (13) = Tr k (cid:88) i =1 λ i ˜ x i ˜ x (cid:62) i + p (cid:88) j = k +1 δ ˜ x j ˜ x (cid:62) j H + k (cid:88) i =1 log(˜ λ i ) + p (cid:88) j = k +1 log δ (14) = Tr (cid:32) k (cid:88) i =1 λ i ˜ x i ˜ x (cid:62) i H (cid:33) + Tr p (cid:88) j = k +1 δ ˜ x j ˜ x (cid:62) j H + k (cid:88) i =1 log(˜ λ i ) + p (cid:88) j = k +1 log δ (15) = k (cid:88) i =1 λ i ˜ x (cid:62) i H ˜ x i + p (cid:88) j = k +1 δ ˜ x (cid:62) j H ˜ x j + k (cid:88) i =1 log(˜ λ i ) + p (cid:88) j = k +1 log δ (16)(17)Taking derivatives wrt ˜ λ i , we have: ∂KL∂ ˜ λ i = 0 = − λ i ˜ x (cid:62) i H ˜ x i + 1˜ λ i (18) ⇒ ˜ λ i = ˜ x (cid:62) i H ˜ x i (19)Which when put into Equation 16,KL = k (cid:88) i =1 λ i ˜ x (cid:62) i H ˜ x i + p (cid:88) j = k +1 δ ˜ x (cid:62) j H ˜ x j + k (cid:88) i =1 log(˜ λ i ) + p (cid:88) j = k +1 log δ (20) = k (cid:88) i =1 ˜ x (cid:62) i H ˜ x i ˜ x (cid:62) i H ˜ x i + p (cid:88) j = k +1 δ ˜ x (cid:62) j H ˜ x j + k (cid:88) i =1 log(˜ λ i ) + p (cid:88) j = k +1 log δ (21) = k + p (cid:88) j = k +1 δ ˜ x (cid:62) j H ˜ x j + k (cid:88) i =1 log(˜ λ i ) + p (cid:88) j = k +1 log δ (22) = 1 δ p (cid:88) j = k +1 ˜ x (cid:62) j H ˜ x j + k (cid:88) i =1 log(˜ λ i ) (removing constants) (23) = 1 δ p (cid:88) j = k +1 ˜ x (cid:62) j H ˜ x j + k (cid:88) i =1 log(˜ x (cid:62) i H ˜ x i ) (24)Now we need to consider the constraints ˜ x (cid:62) i ˜ x i = 1 and ˜ x (cid:62) i ˜ x j = 0 , i (cid:54) = j by adding Lagrangemultipliers to our KL cost, L = 1 δ p (cid:88) j = k +1 ˜ x (cid:62) j H ˜ x j + k (cid:88) i =1 log(˜ x (cid:62) i H ˜ x i ) − k (cid:88) i =1 φ i,i (˜ x (cid:62) i ˜ x i − − (cid:88) i,j,i (cid:54) = j φ i,j ˜ x (cid:62) i ˜ x j (25)20aking derivatives wrt ˜ x i : ∂L∂ ˜ x i = 0 = 2 H ˜ x i ˜ x (cid:62) i H ˜ x i − φ i,i ˜ x i − (cid:88) i,j (cid:54) = i φ i,j ˜ x j (26) (cid:88) i,j (cid:54) = i φ i,j ˜ x j = (cid:18) H ˜ x (cid:62) i H ˜ x i − φ i,i I p (cid:19) ˜ x i (27)In Equation 27, we have ˜ x i expressed as a linear combination of ˜ x j , j (cid:54) = i , but ˜ x i and ˜ x j areorthogonal, so ˜ x i cannot be expressed as such, so φ i,j = 0 , i (cid:54) = j , and, H ˜ x i ˜ x (cid:62) i H ˜ x i = φ i,i ˜ x i (28)Meaning ˜ x i are eigenvectors of H for ≤ i ≤ k . We can also use the same Lagrange multipliers toshow that ˜ x i for k + 1 ≤ i ≤ p are also eigenvectors of H .This means that our cost, KL = 1 δ p (cid:88) j = k +1 ˜ x (cid:62) j H ˜ x j + k (cid:88) i =1 log(˜ x (cid:62) i H ˜ x i ) (29) = 1 δ p (cid:88) j = k +1 ˜ κ j + k (cid:88) i =1 log(˜ κ i ) (30)where the set (˜ κ , ˜ κ , ..., ˜ κ p ) is a permutation of ( λ , λ , ..., λ p ) and ˜ κ i = ˜ λ i for ≤ i ≤ k . I.e., ˜ H shares k eigenvalues with H , and the rest are δ . It now remains to determine which eigenvalues areshared and which are excluded.Considering only two eigenvalues, λ i , λ j , and let λ i > λ j ≥ . Let r = λ i λ j . The relative cost ofexcluding λ i in the set { ˜ κ , ˜ κ , ..., ˜ κ k } compared to including it is,Relative Cost = λ i − λ j δ − log λ i λ j = λ i (1 − r ) δ − log r If the relative cost is positive, then including λ i as one of the eigenvalues of ˜ H is the more optimalchoice. Now solving the inequality, Relative Cost > λ i (1 − r ) δ − log r > λ i > δ (1 − r ) log r Which, for sufficiently small δ is always true because r > . Thus, it is always better to swaptwo eigenvalues which are included/excluded, if the excluded one is larger. This means that ˜ H hasthe k largest eigenvalues of H , and we already showed that it shares the same eigenvectors. Thismaximum eigenvalue/eigenvector pair selection is exactly the procedure used by SOLA.21 C OLD P OSTERIOR
VCL
AND F URTHER G ENERALIZATIONS
The use of KL-reweighting is closely related related to the idea of “cold-posteriors,” in which p T ( θ | D ) ∝ p ( θ | D ) τ . Finding this cold posterior is equivalent to find optimal q distributions formaximizing the τ -ELBO: τ -ELBO := E θ ∼ q ( θ ) [log p ( D | θ ) + log p ( θ ) − τ log q ( θ )] when Q is all possible distributions of θ . This objective is the same as the standard ELBO with onlythe entropy term reweighted, and contrasts the β -ELBO where both the entropy and prior likelihoodsare reweighted. Here, β acts similarly to T (the temperature, not to be confused with task number).This relationship naturally leads to the transition diagram shown in Figure 11. In this, we can seethat we can easily transition between posteriors at different temperatures by optimizing either the β -ELBO, τ -ELBO, or tempering the posterior.Cold ( τ < ) p ∝ p ( θ ) τ p ∝ p ( θ | D ) τ p ∝ p ( θ | D ) τ ... Warm ( τ = 1 ) p ( θ ) p ( θ | D ) p ( θ | D ) ... Tempering β -ELBO Tempering β -ELBO Tempering τ -ELBOELBO τ -ELBOELBO Figure 11: Transitions between posteriors at different temperatures using tempering and optimizingeither the τ -ELBO or β -ELBOWhen Q contains all possible distributions, moving along any path results in the exact same distri-bution, for example optimizing the τ -ELBO then tempering is the same as directly optimizing theELBO. However in the case where Q is limited, this transition is not exact, and the resulting poste-rior is path dependent. In fact, each possible path represents a different valid method for performingcontinual learning. Standard VCL works by traversing the horizontal arrows, directly optimizing theELBO, while an alternative scheme of VCL would optimize the τ -ELBO to form cold posteriors,then heat the posterior before optimizing the τ -ELBO for a new task. Inference can be done at eitherthe warm or cold state. Note that for Gaussians, heating the posterior is just a matter of scaling thecovariance matrix by a constant factor τ after τ before .While warm posteriors generated through this two-step procedure are not optimal under the ELBO,when Q is limited, they may perform better for continual learning. Similar to Equation 2, the optimal Σ when optimizing the τ -ELBO is given by Σ − T = 1 τ T (cid:88) t =1 ˜ H t + 1 τ Σ − Where ˜ H t is the approximate curvature for a specific value of τ for task t , which coincides with thetrue Hessian for τ → , like with the β -ELBO. Here, both the prior and data-dependent componentare scaled by τ , in contrast to Equation 2, where only the data-dependent component is reweighted.As discussed in Section 2.2 and further explored in appendix A, this leads to a different scale ofthe quadratic approximation, which may lend itself better for continual learning. This also resultsin a second way to recover γ in Online EWC by first optimizing the β -ELBO with β = γ , thentempering by a factor of γ (i.e. increasing the temperature when γ < ). E MAP D
EGENERACY WITH F I LM L
AYERS
Here we describe how training FiLM layer with MAP training leads to degenerate values for theweights and scales, whereas with VI training, no degeneracy occurs. For simplicity, consider onlythe nodes leading into a single node and let there be d of them, i.e. θ has dimension d . Because weonly have one node, our scale parameter γ is a single variable.22or MAP training, we have the loss function L = − p ( D | θ, γ ) + λ θ , with D the dataset and λ the L2 regularization hyperparameter. Note that p ( D | θ, γ ) = p ( D | cθ, c γ ) , hence we can scale θ arbitrarily without affecting the likelihood, so long as γ is scaled inversely. If c < , λ θ < λ ( c θ ) ,so increasing c decreases the L2 penalty if θ is inversely scaled by c . Therefore the optimal settingof the scale parameter γ is arbitrarily large, while θ shrinks to 0.At a high level, VI-training (with Gaussian posteriors and priors) does not have this issue becausethe KL-divergence penalizes the variance of the parameters from deviating from the prior in addi-tion to the mean parameters, whereas MAP training only penalizes the means. Unlike with MAPtraining, if we downscale the weights, we also downscale the value of the variances, which increasesthe KL-divergence. The variances cannot revert to the prior either, as when they are up-scaledby the FiLM scale parameter, the noise would increase, affecting the log-likelihood component ofthe ELBO. Therefore, there exists an optimal amount of scaling which balances the mean-squaredpenalty component of the KL-divergence and the variance terms.Mathematically we can derive this optimal scale. Consider the scenario with VI training with Gaus-sian variational distribution and prior, where our approximate posterior q ( θ ) has mean and variance µ and Σ and our prior p ( θ ) has parameters µ and Σ . First consider the scenario without FiLM Lay-ers. Now, have our loss function L = − E θ ∼ q ( θ ) log p ( D | θ ) + D KL ( q ( θ ) || q ( θ )) . For multivariateGaussians, D KL ( q ( θ ) || p ( θ )) = 12 (log | Σ | − log | Σ | − d + T r (Σ − Σ) + ( µ − µ ) T Σ − ( µ − µ )) . Now consider another distribution q (cid:48) ( θ ) , with mean and variance parameters cµ and c Σ . Now if q (cid:48) ( θ ) is paired with FiLM scale parameter γ set at c , the log-likelihood component is unchanged: E θ ∼ q ( θ ) log p ( D | θ ) = E θ ∼ q (cid:48) ( θ ) log p ( D | θ, γ = 1 c ) , with γ being our FiLM scale parameter and p ( D | θ, γ ) representing a model with FiLM scale layers.Now consider the D KL ( q (cid:48) ( θ ) || q ( θ )) , and optimize c with µ and Σ fixed: D KL ( q (cid:48) ( θ ) || p ( θ )) = 12 (log | Σ | − log | c Σ | − d + T r (Σ − c Σ) + ( cµ − µ ) T Σ − ( cµ − µ ))= 12 (log | Σ | − log | Σ | − d log c − d + c T r (Σ − Σ)+ ( cµ − µ ) T Σ − ( cµ − µ )) ∂D KL ∂c | c = c ∗ = 0 = − dc ∗ + c ∗ T r (Σ − Σ) + ( c ∗ µ − µ ) T Σ − µ − d + c ∗ T r (Σ − Σ) + c ∗ µ T Σ − µ − c ∗ µ T Σ − µ c ∗ ( T r (Σ − Σ) + µ T Σ − µ ) − c ∗ µ T Σ − µ − d ⇒ c ∗ = µ T Σ − µ ± (cid:113) ( µ T Σ − µ ) + 4 d ( T r (Σ − Σ) + µ T Σ − µ )2( T r (Σ − Σ) + µ T Σ − µ ) . Also note that c = 0 results in an infinitely-large KL-divergence, so there is a barrier at c = 0 , i.e. Ifoptimized through gradient descent, c should never change sign. Furthermore, note that ∂ D KL ∂c = dc + T r (Σ − Σ) + µ T Σ − µ > . So the KL-divergence is concave with respective to c , so c ∗ is a minimizer of D KL and therefore D KL ( q ( θ ) || p ( θ )) ≥ D KL ( q (cid:48) ( θ ) || p ( θ )) | c = c ∗ , which implies the optimal value of the FiLM scale parameter γ is c ∗ . While no formal data wascollected, it was observed that the scale parameters do in fact reach very close to this optimal scalevalue after training. 23 C LUSTERING OF F I LM P
ARAMETERS (a) Scales (b) Shifts (c) Shifts and Scales
Figure 12: T-SNE of FiLM layer parameters of 58 tasks coming from different domains. Shift andscale parameters from the same domain are more similar than those from different ones.In this section, we test the interpretability of learned FiLM Parameters. Such clustering has beendone in the past with FiLM parameters, as well as node-wise uncertainty parameters. One wouldintuitively expect that tasks from similar domains would finds similar features salient, and thus sharesimilar FiLM parameters. To test this hypothesis, we took the 8 mixed vision task from Section 5.3and split each task into multi 5-way classification tasks so that there were many tasks from similardomains. For example, CIFAR100, which originally had 100 classes, became 20 5-way clasificationtasks, Trafficsigns became 8 tasks (7 5-way and 1 8-way), and MNIST 2 (2 5-way). Next, we trainedthe same architecture used in Section 5.3 except trained all 58 resulting tasks. Joint training waschosen over continual learning to avoid artifacts which would arise from task ordering. Figure 12shows that the results scale and shift parameters can be clustered and FiLM parameters which arisefrom the same base task cluster together. Like in Achille et al. (2019), this likely could be used asa means of knowing which tasks to learn continually and which tasks to separate (i.e. tasks fromthe same cluster would likely benefit from joint training, while tasks from different ones should beseparately trained), however we did not explore this idea further.
G H OW F I LM L
AYERS INTERACT WITH PRUNING (a) Weights, no FiLM layers (b) Biases, no FiLM layers(c) Weights, with FiLM layers (d) Biases, with FiLM layers
Figure 13: Posterior distributions for incoming weights (left) or biases (right) for a node in thefirst layer. Nodes are either unrpruned (left within a column) or pruned (right within a column).Without FiLM Layers (top row), we see that pruned nodes have their bias concentrated at a negativevalue, preventing future tasks from reactivating the node. With FiLM Layers, a pruned node prunesusing the FiLM parameters rather than the shared ones, allowing the posteriors to revert to the priordistribution, allowing for node reactivation.In Section 3, we discussed the problem of pruning in variational continual learning and how itprevents nodes from becoming reactivated. To reiterate, pruning broadly occurs in three steps:1. Weights incoming to a node begin to revert to the prior distribution2. Noise from these high-variance weights affect the likelihood term in the ELBO24. To prevent noise, the bias concentrates at a negative value to be cut off by the ReLU acti-vationLater tasks then are initialized with this negative bias with low variance, meaning that the node hasa difficult time reactivating the node without incurring a high prior cost. This results in the effectshown in Figure 1, where after the first task, effectively no more nodes are reactivated. The effectis further exacerbated with larger values of β , where the pruning effect is stronger. Increasing λ worsens this as well, as increasing the quadratic cost further prevents already low-variance negativebiases from moving.We verify that this mechanism is indeed the cause of the limited capacity use by visualizing theposteriors for weights and biases entering a node in the first convolutional layer for a network trainedon Easy-CHASY (Figure 13). Here, we see that biases in pruned nodes when there are no FiLMLayers do indeed concentrate at negative values. In contrast, biases in models with FiLM layers areable to revert to their prior because the FiLM parameters perform pruning. H R
ELATED W ORK
Regularization-based continual learning.
Many algorithms attempt to regularize network param-eters based on a metric of importance. The most directly comparable algorithms to GVCL are EWC(Kirkpatrick et al., 2017), Online EWC (Schwarz et al., 2018), and VCL (Nguyen et al., 2018).EWC measures importance based on the Fisher information matrix, while VCL uses an approxi-mate posterior covariance matrix as an importance measure. Online EWC slightly modifies EWC sothat there is only a single regularizer based on the cumulative sum of Fisher information matrices.Lee et al. (2017) proposed IMM, which is an extension to EWC which merges posteriors based ontheir Fisher information matrices. Ritter et al. (2018) and Yin et al. (2020) both aim to approximatethe Hessian by using either Kronecker-factored or low-rank forms, using the Laplace approximationto form approximate posteriors of parameters. These methods all use second-order approximationsof the loss.Ahn et al. (2019), like us, use regularizers based on the ELBO, but also measure impor-tance on a per-node basis than a per-weight one. SI (Zenke et al., 2017) measures importance using“Synaptic Saliency,” as opposed to methods based on approximate curvature.
Architectural approaches to continual and meta-learning.
This family of methods modifies thestandard neural architecture by either adding parallel or series components to the network. Pro-gressive Neural Networks adds a parallel column network for every task. Pathnet (Fernando et al.,2017) can be interpreted as a parallel-network based algorithm, but rather than growing model sizeover time, the model size remains fixed while paths between layer columns are optimized. FiLMparameters can be interpreted as adding series components to a network, and has been a mainstayin the multitask and meta-learning literature. Requeima et al. (2019) use hypernetworks to amortizeFiLM parameter learning, and has been shown to be capable of continual learning. Architecturalapproaches are often used in tandem with regularization based approaches, such as in HAT (Serraet al., 2018), which uses per-task gating parameters alongside a compression-based regularizer. Adelet al. (2020) propose CLAW, which also uses variational inference alongside per-task parameters,but requires a more complex meta-learning based training procedure involving multiple splits ofthe dataset. GVCL with FiLM layers adds to this list of hybrid architectural-regularization basedapproaches.
Cold Posteriors and likelihood-tempering.
As mentioned in Section 2, likelihood-tempering (orKL-reweighting) has been empirically found to improve performance when using variational infer-ence for Bayesian Neural Networks over a wide number of contexts and papers (Osawa et al., 2019;Zhang et al., 2018). Cold posteriors are closely related to likelihood tempering, except they temperthe full posterior rather than only the likelihood term, and often empirically outperform Bayesianposteriors when using MCMC sampling Wenzel et al. (2020). From an information-theoretic per-spective, KL-reweighted ELBOs have also studied as compression (Achille et al., 2020). Achilleet al. (2019), like us, considers a limiting case of β , and uses this to measure parameter saliency,but use this information to create a task embedding rather than for continual learning. Outside ofthe Bayesian Neural Network context, values of β > have also been explored (Higgins et al.,2017), and more generally different values of β trace out different points on a rate-distortion curvefor VAEs (Alemi et al., 2018). 25 E XPERIMENT DETAILS
I.1 R
EPORTED M ETRICS
All reported scores and figures present the mean and standard deviation across 5 runs of the algo-rithm with a different network initialization. For Easy-CHASY and Hard-CHASY, train/test splitsare also varied across iterations. For the Mixed Vision tasks, task permutation of the 8 tasks is alsorandomized between iterations.Let the matrix R i,j represent the performance of j th task after the model was trained on the i th task.Furthermore, let R indj be the mean performance of the j th for a network trained only on that taskand let the total number of tasks be T . Following Lopez-Paz & Ranzato (2017) and Pan et al. (2020),we define Average Accuracy (ACC) = 1 T T (cid:88) j =1 R T,j , Forward Transfer (FWT) = 1 T T (cid:88) j =1 R j,j − R indj , Backward Transfer (BWT) = 1 T T (cid:88) j =1 R T,j − R j,j . Note that these metrics are not exactly the same as those presented in all other works, as the FWTand BWT metrics are summed over the indices ≤ j ≤ T , whereas Lopez-Paz & Ranzato (2017)and Pan et al. (2020) sum from ≤ j ≤ T and ≤ j ≤ T − for FWT and BWT, respectively.For FWT, this definition does not assumes that R , = R ind , and affects algorithms such as HATand Progressive Neural Networks, which either compress the model, resulting in lower accuracy, oruse a smaller architecture for the first task. The modified BWT transfer is equal to the other BWTmetrics apart from a constant factor T − T .Intuitively, forward transfer equates to how much continual learning has benefited a task when atask is newly learned, while backwards transfer is the accuracy drop as the network learns moretasks compared to when a task was first learned. Furthermore, in the tables in Appendix J, we alsopresent net performance gain (NET), which quantifies the total gain over separate training, at theend of training continually:NET = FWT + BWT = 1 T T (cid:88) j =1 R T,j − R indj . Note that for computation of R ind , we compare to models trained under the same paradigm, i.e.MAP algorithms (all baselines except for VCL) are compared to a MAP trained model, and VIalgorithms (GVCL-F, GVCL and VCL) are compared to KL-reweighted VI models. This does notmake a difference for most of the benchmarks where R ind MAP ≈ R ind VI . However, for Easy and Hard-CHASY, R ind MAP < R ind VI , so we compare VI to VI and MAP to MAP to obtain fair metrics.In Figure 5b, we plot ∆ ACC i , which we define as ∆ ACC i = 1 i i (cid:88) j =1 R i,j − R T,j . This metric is useful when the tasks have very different accuracies and their permutation is random-ized, as is the case with the mixed vision tasks. Note that this means that R i,j would refer to adifferent task for each permutation, but we average over the 5 permutations of the runs. Empirically,26f two algorithms have similar final accuracies, this metric measures how much the network forgetsabout the first i tasks from that point to the end, and also measures how high the accuracy wouldhave been if training was terminated after i tasks. Plotting this also captures the concept as gracefulvs catastrophic forgetting, as graceful forgetting would show up as a smooth downward curve, whilecatastrophic forgetting would have sudden drops.I.2 O PTIMIZER AND TRAINING DETAILS
The implementation of all baseline methods was based on the Github repository for HAT (Serraet al., 2018), except the implementions of IMM-Mode and EWC were modified due to an error inthe computation of the Fisher Information Matrix in the original implementation. Baseline MAPalgorithms were trained with SGD with a decaying learning starting at 5e-2 with a maximum epochsof 200 per task for the Split -MNIST,
Split -CIFAR and the mixed vision benchmarks. The numberof maximum epochs for Easy-CHASY and Hard-CHASY was 1000, due to the small dataset size.Early stopping based on the validation set was used. 10% of the training set was used as validationfor these methods, and for Easy and Hard CHASY, 8 samples per class form the validation set (whichare disjoint from the training samples or test samples).For VI models, we used Adam optimizer with a learning rate of 1e-4 for
Split -MNIST and Mixture,and 1e-3 for Easy-CHASY, Hard-CHASY and
Split -CIFAR. We briefly tested running the baselinesalgorithms using Adam rather than SGD and performance did not change. Easy-CHASY and Hard-CHASY were run for 1500 epochs per task,
Split -MNIST for 100,
Split -CIFAR for 60, and 180for Mixture. The number of epochs was changed so that the number of gradient steps for eachtask was roughly equal. For Easy-CHASY, Hard-CHASY and
Split -CIFAR, this means that latertasks are run for more epochs, since the largest training sets are at the start. For Mixture, we ran180 equivalents epochs for Facescrub. For how many epochs this equates to in the other datasets,we refer the reader to Appendix A in Serra et al. (2018). We did not use early stopping for theseVI results. While we understand that in some cases we trained for many more epochs than thebaselines, the baselines used early stopping and therefore all stopped long before the 200 epochlimit was reached, so allocating more time would not change their results. Swaroop et al. (2019)also finds that allowing VI to converge is crucial for continual learning performance. We leave thediscussion of improving this convergence time for future work.All experiments (both the baselines and VI methods) use a batch size of 64.I.3 A
RCHITECTURAL DETAILS
Easy and Hard CHASY.
We use a convolutional architecture with 2 convolutions layers with:1. 3x3 convolutional layer with 16 filters, padding of 1, ReLU activations2. 2x2 Max Pooling with stride 23. 3x3 convolutional layer with 32 filters, padding of 1, ReLU activations4. 2x2 Max Pooling with stride 25. Flattening layer6. Fully connected layer with 100 units and ReLU activations7. Task-specific head layers
Split -MNIST.
We use a standard MLP with:1. Fully connected layer with 256 units and ReLU activations2. Fully connected layer with 256 units and ReLU activations3. Task-specific head layers
Split -CIFAR.
We use the same architecture from Zenke et al. (2017):1. 3x3 convolutional layer with 32 filters, padding of 1, ReLU activations Repository at https://github.com/joansj/hat
27. 3x3 convolutional layer with 32 filters, padding of 1, ReLU activations3. 2x2 Max Pooling with stride 24. 3x3 convolutional layer with 64 filters, padding of 1, ReLU activations5. 3x3 convolutional layer with 64 filters, padding of 1, ReLU activations6. 2x2 Max Pooling with stride 27. Flattening8. Fully connected layer with 512 units and ReLU activations9. Task-specific head layers
Mixed vision tasks.
We use the same AlexNet architecture from Serra et al. (2018):1. 4x4 convolutional layer with 64 filters, padding of 0, ReLU activations2. 2x2 Max Pooling with stride 23. 3x3 convolutional layer with 128 filters, padding of 0, ReLU activations4. 2x2 Max Pooling with stride 25. 2x2 convolutional layer with 256 filters, padding of 0, ReLU activations6. 2x2 Max Pooling with stride 27. Flattening8. Fully connected layer with 2048 units and ReLU activations9. Fully connected layer with 2048 units and ReLU activations10. Task-specific head layersFor MAP models, dropout layers with probabilities of either 0.2 or 0.5 were added after convolu-tional or fully-connected layers. For GVCL-F, FiLM layers were inserted after convolutional/hiddenlayers, but before ReLU activations.I.4 H
YPERPARAMETER SELECTION
For all algorithms on Easy-CHASY, Hard-CHASY,
Split -MNIST and
Split -CIFAR, hyperparameterselection was done by selecting the combination which produced the best average accuracy on thefirst 3 tasks. The algorithms were then run on the full number of tasks. For the Mixed Vision tasks,the best hyperparameters for the baselines were taken from the HAT Github repository. For GVCL,we performed hyperparameter selection in the same way as in Serra et al. (2018): we found thebest hyperparameters for the average performance on the first random permutation of tasks. Notethat in the mixture tasks, we randomly permute the task order for each iteration (with permutationskept consistent between algorithms), whereas for the other 4 benchmarks, the task order is fixed.Hyperparameter searches were performed using a grid search. The best selected hyperparametersare shown in Table 3. 28 lgorithm Hyperparameter Easy-CHASY Hard-CHASY
Split -MNIST
Split -CIFAR Mixed VisionGVCL-F β λ
10 10 100 100 50GVCL β λ
100 100 1 1000 100HAT λ s max
10 50 50 50 400*PathNet λ
100 500 10000 100 5Progressive None - - - - -IMM-Mean λ λ λ T * Best hyperparameters taken from HAT code
Table 3: Best (selected) hyperparameters for continual learning experiments for variousalgorithms. We fix Online EWC’s γ = 1 .For the Joint and Separate VI baselines, we used the same β . For the mixed vision tasks, we had toused a prior variance of 0.01 (for both VCl, GVCL and GVCL-F), but for all other tasks we did notneed to tune this. J F
URTHER E XPERIMENTAL RESULTS
In following section we present more quantitative results of the various baselines on our benchmarks.For brevity, in the main text, we only included the best performing baselines and those which aremost comparable to GVCL, which consisted of HAT, PathNet, Online EWC and VCL.29.1 E
ASY -CHASY
ADDITIONAL RESULTS
Metric ACC (%) BWT (%) FWT (%) NET (%)GVCL-F . ± . . ± . . ± . . ± . GVCL . ± . − . ± . − . ± . − . ± . HAT . ± . − . ± . . ± . − . ± . PathNet . ± . . ± . − . ± . − . ± . VCL . ± . − . ± . − . ± . − . ± . VCL-F . ± . − . ± . − . ± . − . ± . Online EWC . ± . − . ± . − . ± . − . ± . Online EWC-F . ± . − . ± . − . ± . − . ± . Progressive . ± . . ± . − . ± . − . ± . IMM-mean . ± . − . ± . − . ± . − . ± . imm-mode . ± . − . ± . . ± . − . ± . LWF . ± . − . ± . . ± . − . ± . SGD . ± . − . ± . . ± . − . ± . SGD-Frozen . ± . . ± . − . ± . − . ± . Separate (MAP) . ± . - - . ± . Separate ( β -VI) . ± . - - . ± . Joint (MAP) . ± . - - . ± . Joint ( β -VI + FiLM) . ± . - - . ± . Table 4: Performance metrics of GVCL-F, GVCL and various baseline algorithms on Easy-CHASY.Separate and joint training results for both MAP and β -VI models are also presentedFigure 14: Mean accuracy of individual tasks after training for all approaches on Easy-CHASY30igure 15: Mean accuracy of individual tasks after training for the top 5 performing approaches onEasy-CHASYFigure 16: Running average accuracy of individual tasks after training for the all approaches onEasy-CHASY 31igure 17: Running average accuracy of individual tasks after training for the top 5 approaches onEasy-CHASY 32.2 H ARD -CHASY A
DDITIONAL R ESULTS
Metric ACC (%) BWT (%) FWT (%) NET (%)GVCL-F . ± . − . ± . − . ± . − . ± . GVCL . ± . − . ± . − . ± . − . ± . HAT . ± . − . ± . − . ± . − . ± . PathNet . ± . . ± . − . ± . − . ± . VCL . ± . − . ± . − . ± . − . ± . VCL-F . ± . − . ± . − . ± . − . ± . Online EWC . ± . − . ± . − . ± . − . ± . Online EWC-F . ± . − . ± . − . ± . − . ± . Progressive . ± . . ± . − . ± . − . ± . IMM-mean . ± . − . ± . − . ± . − . ± . imm-mode . ± . − . ± . − . ± . − . ± . LWF . ± . − . ± . . ± . − . ± . SGD . ± . − . ± . . ± . − . ± . SGD-Frozen . ± . . ± . − . ± . − . ± . Separate (MAP) . ± . - - . ± . Separate ( β -VI) . ± . - - . ± . Joint (MAP) . ± . - - − . ± . Joint ( β -VI + FiLM) . ± . - - − . ± . Table 5: Performance metrics of GVCL-F, GVCL and various baseline algorithms on Hard-CHASY.Separate and joint training results for both MAP and β -VI models are also presentedFigure 18: Mean accuracy of individual tasks after training for all approaches on Hard-CHASY33igure 19: Mean accuracy of individual tasks after training for the top 5 performing approaches onHard-CHASYFigure 20: Running average accuracy of individual tasks after training for the all approaches onHard-CHASY 34igure 21: Running average accuracy of individual tasks after training for the top 5 approaches onHard-CHASY 35.3 Split -MNIST
ADDITIONAL RESULTS
Metric ACC (%) BWT (%) FWT (%) NET (%)GVCL-F . ± . . ± . − . ± . − . ± . GVCL . ± . − . ± . − . ± . − . ± . HAT . ± . − . ± . − . ± . − . ± . PathNet . ± . . ± . − . ± . − . ± . VCL . ± . − . ± . − . ± . − . ± . VCL-F . ± . − . ± . − . ± . − . ± . Online EWC . ± . − . ± . − . ± . − . ± . Online EWC-F . ± . − . ± . − . ± . − . ± . Progressive . ± . . ± . − . ± . − . ± . IMM-mean . ± . . ± . − . ± . − . ± . imm-mode . ± . − . ± . − . ± . − . ± . LWF . ± . − . ± . − . ± . − . ± . SGD . ± . − . ± . . ± . − . ± . SGD-Frozen . ± . . ± . − . ± . − . ± . Separate (MAP) . ± . - - . ± . Separate ( β -VI) . ± . - - . ± . Joint (MAP) . ± . - - . ± . Joint ( β -VI + FiLM) . ± . - - . ± . Table 6: Performance metrics of GVCL-F, GVCL and various baseline algorithms on
Split -MNIST.Separate and joint training results for both MAP and β -VI models are also presentedFigure 22: Mean accuracy of individual tasks after training for all approaches on Split -MNIST36igure 23: Mean accuracy of individual tasks after training for the top 5 performing approaches on
Split -MNISTFigure 24: Running average accuracy of individual tasks after training for the all approaches on
Split -MNIST 37igure 25: Running average accuracy of individual tasks after training for the top 5 approaches on
Split -MNIST 38.4
Split -CIFAR
ADDITIONAL RESULTS
Metric ACC (%) BWT (%) FWT (%) NET (%)GVCL-F . ± . − . ± . . ± . . ± . GVCL . ± . − . ± . . ± . − . ± . HAT . ± . − . ± . . ± . . ± . PathNet . ± . . ± . − . ± . − . ± . VCL . ± . − . ± . − . ± . − . ± . VCL-F . ± . − . ± . . ± . − . ± . Online EWC . ± . − . ± . . ± . . ± . Online EWC-F . ± . − . ± . . ± . . ± . Progressive . ± . . ± . . ± . . ± . IMM-mean . ± . − . ± . − . ± . − . ± . imm-mode . ± . − . ± . . ± . . ± . LWF . ± . − . ± . . ± . . ± . SGD . ± . − . ± . . ± . . ± . SGD-Frozen . ± . . ± . − . ± . − . ± . Separate (MAP) . ± . - - . ± . Separate ( β -VI) . ± . - - . ± . Joint (MAP) . ± . - - . ± . Joint ( β -VI + FiLM) . ± . - - . ± . Table 7: Performance metrics of GVCL-F, GVCL and various baseline algorithms on
Split -CIFAR.Separate and joint training results for both MAP and β -VI models are also presentedFigure 26: Mean accuracy of individual tasks after training for all approaches on Split -CIFAR39igure 27: Mean accuracy of individual tasks after training for the top 5 performing approaches on
Split -CIFARFigure 28: Running average accuracy of individual tasks after training for the all approaches on
Split -CIFAR 40igure 29: Running average accuracy of individual tasks after training for the top 5 approaches on
Split -CIFAR 41.5 M
IXED VISION TASKS ADDITIONAL RESULTS
Metric ACC (%) BWT (%) FWT (%) NET (%)GVCL-F . ± . − . ± . − . ± . − . ± . GVCL . ± . − . ± . − . ± . − . ± . HAT . ± . − . ± . − . ± . − . ± . PathNet . ± . . ± . − . ± . − . ± . VCL . ± . − . ± . − . ± . − . ± . VCL-F . ± . − . ± . − . ± . − . ± . Online EWC . ± . − . ± . − . ± . − . ± . Online EWC-F . ± . − . ± . − . ± . − . ± . Progressive . ± . . ± . − . ± . − . ± . IMM-mean . ± . − . ± . − . ± . − . ± . imm-mode . ± . − . ± . − . ± . − . ± . LWF . ± . − . ± . − . ± . − . ± . SGD . ± . − . ± . − . ± . − . ± . SGD-Frozen . ± . . ± . − . ± . − . ± . Separate (MAP) . ± . - - . ± . Separate ( β -VI) . ± . - - . ± . Joint (MAP) . ± . - - − . ± . Joint ( β -VI + FiLM) . ± . - - − . ± . Table 8: Performance metrics of GVCL-F, GVCL and various baseline algorithms on Mixed Visiontasks. Separate and joint training results for both MAP and β -VI models are also presentedFigure 30: Mean accuracy of individual tasks after training for all approaches on mixed vision tasks42igure 31: Mean accuracy of individual tasks after training for the top 5 performing approaches onmixed vision tasks CIFAR10 CIFAR100 MNIST SVHN F-MNIST TrafficSigns Facescrub NotMNIST AverageGVCL-F 0.79%
HAT
Table 9: ECE of all 8 mixed vision tasks for a model trained continually using GVCL-F or HAT.F-MNIST stands for FashionMNIST. 43igure 32: Clusters of symbols found by performing K-means clustering with K = 20 based on theembedding layer of a model trained with variational inference on a 200-way classification task onthe 200 most common symbols in the HASYv2 dataset. Easy-CHASY is made by taking the firstsymbol from each cluster as the first task, then the second, and so on, up to 10 tasks. Hard-CHASYis made by taking the clusters with the most classes in order (clusters 1-10). K C
LUSTERED
HASY V The HASYv2 dataset is a dataset consisting over 32x32 black/white handwritten Latex characters.There are a total of 369 classes, and over 150 000 total samples (Thoma, 2017).We constructed 10 classification tasks, each with a varying number of classes ranging from 20 to11. To construct these tasks, we first trained a mean-field Bayesian neural network on a 200-wayclassification task on the 200 classes with the most total samples. To get an embedding for eachclass, we use the activations of the second-last layer. Then, we performed K-means clustering with20 clusters on the means of the embedding generated by each class when the samples of the classeswere input into the network. Doing this yielded the classes shown in figure 32. Now, within eachcluster are classes which are deemed “similar” by the network. To make the 10 classification tasks,we then took classes from each cluster sequentially (in order of the class whose mean was closestto the cluster’s mean), so that each task contains at most 1 symbol from each cluster. Doing thisensures that tasks are similar to one another, since each task consists of classes which are differentin similar ways. With the classes selected, the training set is made by selecting 16 samples of eachclasses, and using the remaining as the test set. This procedure was used to generate the “easy” setof tasks, which should have the maximum amount of similarity between tasks. We also constructeda second set of tasks, the “hard” set, in which each task is individually difficult. This was done byselecting each task to be classification within each cluster, selecting clusters with the most number ofsymbols first. This corresponds to clusters 1-10 in figure 32. With the classes for each task selected,16 samples from each class are used in the training set, and the remainder are used as the test set.Excess samples are discarded so that the test set class distribution is also uniform within each task.It was necessary to perform this clustering procedure as we found it difficult to produce sizabletransfer gains if we simply constructed tasks by taking the classes with the most samples. Whilewe were able to have gains of up to 3% from joint training on 10 20-way classification tasks withthe tasks chosen by class sample count, these gains were significantly diminished when performingMAP estimation as opposed to MLE estimation, and reduced even further when performing VI.Because one of our benchmark continual learning methods is VCL, showing transfer when trainedusing VI is necessary. 44 a) Average relative performance (b) Individual task relative performances
Figure 33: Relative test-set accuracy of models trained jointly on the easy set of tasks relativeto individual training for MAP estimation. Figure 33a shows the means aggregated over all taskswhile figure 33b shows the performance differences for individual tasks. Performance increases nearmonotonically as more tasks are added, achieving an average of around 4.7% gain with 10 tasks (a) (b) (c)
Figure 34: Relative performance of models trained jointly on the easy set of tasks relative to indi-vidual training for variational inference with various KL-reweighting coefficients ββ