Unified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition
Martin Mundt, Sagnik Majumder, Iuliia Pliushch, Yong Won Hong, Visvanathan Ramesh
UUnified Probabilistic Deep Continual Learning throughGenerative Replay and Open Set Recognition
Martin Mundt Sagnik Majumder Iuliia Pliushch Visvanathan Ramesh Abstract
We introduce a probabilistic approach to unifydeep continual learning with open set recognition,based on variational Bayesian inference. Our sin-gle model combines a joint probabilistic encoderwith a generative model and a linear classifier thatget shared across sequentially arriving tasks. Inorder to successfully distinguish unseen unknowndata from trained known tasks, we propose tobound the class specific approximate posterior byfitting regions of high density on the basis of cor-rectly classified data points. These bounds arefurther used to significantly alleviate catastrophicforgetting by avoiding samples from low densityareas in generative replay. Our approach requiresno storing of old- or upfront knowledge of fu-ture data and is empirically validated on visualand audio tasks in class incremental, as well ascross-dataset scenarios across modalities.
1. Introduction
Modern machine learning systems are typically trained ina closed world setting according to an isolated learningparadigm. They take on the assumption that data is availableat all times and data inputs encountered during applicationof the learned model come from the same statistical popula-tion as the training data. However, the real world requiresdealing with sequentially arriving tasks and data comingfrom potentially yet unknown sources. A neural networkthat is trained exclusively on such newly arriving data willoverwrite its representations and thus forget knowledge ofpast tasks, an early identified phenomenon coined catas-trophic forgetting (McCloskey & Cohen, 1989). Moreover,when confronting the learned model with unseen concepts,misclassification is bound to occur (Matan et al., 1990).Existing continual learning literature predominantly concen- Center for Cognition and Computation, Goethe Univer-sity, Frankfurt, Germany. Correspondence to: Martin Mundt < [email protected] > . trates its efforts on finding mechanisms to alleviate catas-trophic forgetting (Parisi et al., 2019) and the term continuallearning is not necessarily used in a wider sense. Specifi-cally, the aforementioned crucial system component to dis-tinguish seen from unseen unknown data, both as a guaran-tee for robust application and to avoid the requirement of ex-plicit task labels for prediction, is generally missing. A naiveconditioning on unseen unknown data through inclusion ofa ”background” class is infeasible as by definition we donot have access to it a priori. Commonly applied threshold-ing of prediction values is veritably insufficient as resultinglarge confidences cannot be prevented (Matan et al., 1990).Arguably this also includes variational methods (Kingma &Welling, 2013; Farquhar & Gal, 2018; Achille et al., 2018)to gauge neural network uncertainty, since the closed worldassumption also holds true for Bayesian methods (Boultet al., 2019). Recently, Bendale & Boult (2016) have pro-posed extreme value theory (EVT) based meta-recognitionto address open set detection on the basis of softmax pre-dictions in conventional feed-foward deep neural networks.Inspired by this work, we propose a probabilistic approachto unify open set recognition with continual learning in asingle deep model. Our specific contributions are: • We introduce a single model for continual learning thatcombines a joint probabilistic encoder with a gener-ative model and a linear classifier. This architectureenables a natural formulation to address open set recog-nition on the basis of EVT bounds to the class condi-tional approximate posterior in Bayesian inference. • Apart from using EVT for detection of unseen un-known data, we show that generated samples fromareas of low probability density under the aggregateposterior can be excluded in generative replay for con-tinual learning. This leads to significantly reducedcatastrophic forgetting without storing real data. • Empirically, we show that our model can incrementallylearn the classes of two image and one audio dataset, aswell as cross-dataset scenarios across modalities, whilebeing able to successfully distinguish various unseendatasets from data belonging to known tasks. a r X i v : . [ c s . L G ] F e b nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition
2. Background and Related Work
In isolated supervised machine learning the core assumptionis the presence of i.i.d. data at all times and training is con-ducted using a dataset D ≡ (cid:8)(cid:0) x ( n ) , y ( n ) (cid:1)(cid:9) Nn =1 , consistingof N pairs of data instances x ( n ) and their correspondinglabels y ( n ) ∈ { . . . C } for C classes. In contrast, in con-tinual learning task data D t ≡ (cid:110)(cid:16) x ( n ) t , y ( n ) t (cid:17)(cid:111) N t n =1 with t = 1 , . . . , T arrives sequentially for T disjoint datasets,each with number of classes C t . It is assumed that only thedata of the current task is available. Different methods in theliterature have been identified to prevent a model from for-getting past knowledge, either explicitly, through regulariza-tion or freezing of weights, or implicitly, through rehearsalof data by sampling retained subsets or sampling from agenerative memory. A recent review of many continuallearning methods is provided by Parisi et al. (2019). Here,we present a brief summary of particular related works. Regularization and weight freezing:
Regularizationmethods such as synaptic intelligence (SI) (Zenke et al.,2017) or elastic weight consolidation (EWC) (Kirkpatricket al., 2017) explicitly constrain the weights during continuallearning to avoid drifting too far away from previous tasks’solutions. In a related picture, learning without forgetting(Li & Hoiem, 2016) uses knowledge distillation (Hintonet al., 2014) to regularize the end-to-end functional. Furthermethods employ dynamically expandable neural networks(Yoon et al., 2018) or progressive networks (Rusu et al.,2016), that expand the capacity of the neural network, whilefreezing or regularizing existing representations.
Rehearsal:
These methods store and rehearse data fromdistributions belonging to old tasks or generate samples inpseudo-rehearsal (Robins, 1995). The central component ofthe former is thus the selection of significant instances. Formethods such as iCarl (Rebuffi et al., 2017) it is thereforecommon to resort to auxiliary techniques such as a nearest-mean classifier (Mensink et al., 2012) or coresets (Bachemet al., 2015). Inspired by complementary learning systemstheory (O’Reilly & Norman, 2003), dual-model approachessample data from a separate generative memory. In Gepp-Net (Gepperth & Karaoguz, 2016) an additional long-shortterm memory (Hochreiter & Schmidhuber, 1997) is usedfor storage, whereas generative replay (Shin et al., 2017)samples form a separately trained generative adversarialnetwork (GAN) (Goodfellow et al., 2014).
Bayesian methods:
As detailed by Farquhar & Gal(2018), Bayesian methods provide natural capability forcontinual learning by making use of the learned distributionin e.g. a variationally trained neural network (Kingma & Welling, 2013). Existing works nevertheless fall into theabove two categories: a prior-based approach using the for-mer task’s approximate posterior as the new task’s prior(Nguyen et al., 2018) or estimating the likelihood of formerdata through generative replay or other forms of rehearsal(Farquhar & Gal, 2018; Achille et al., 2018).
Evaluation assumptions and multiple model heads:
The success of many of these techniques can be attributedmainly to the considered evaluation scenario. With the ex-ception of Farquhar & Gal (2018), all above techniquestrain a separate classifier per task and thus either requireexplicit storage of task labels, or assume the presence ofa task oracle during evaluation. This multi-head classifierscenario prevents ”cross-talk” between classifier units bynot sharing them, which would otherwise rapidly decaythe accuracy as newly introduced classes directly confuseexisting concepts. While the latter is acceptable to gaugecatastrophic forgetting, it also signifies a major limitation inpractical application. Even though Farquhar & Gal (2018)use a single classifier, they train a separate generative modelper task to avoid catastrophic forgetting of the generator.Our approach builds upon these previous works by propos-ing a single model with single classifier head with a naturalmechanism for open set recognition and improved genera-tive replay from a Bayesian perspective.
The above mentioned literature focuses their continual learn-ing efforts predominantly on addressing catastrophic for-getting. Corresponding evaluation is thus conducted in aclosed world setting, where instances that do not belongto the observed data distribution are not encountered. Inreality, this is not guaranteed and our models need the abil-ity to identify unknown examples in an open world. Weprovide a small overview of relevant approaches to addressthe latter. A more comprehensive review of recent methodsis provided by Boult et al. (2019).
Bayesian uncertainty:
Bayesian deep neural networkmodels (Kingma & Welling, 2013) could be argued to intrin-sically be able to reject statistical outliers. Intuitively, onecould estimate a model’s uncertainty through Monte-Carlodropout (Gal & Ghahramani, 2015) or other variational ap-proximations (Farquhar & Gal, 2018). However, this isgenerally insufficient as uncertain inputs are not necessarilyunknown and vice versa unknowns do not necessarily haveto appear as uncertain (Boult et al., 2019).
Calibration:
The aim of these works is to seperate aknown and unknown input through prediction confidence,often by fine-tuning or re-training an already existing model.In ODIN (Liang et al., 2018) this is addressed through per- nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition 𝑥 𝑧𝜇 𝜃 𝜎 𝜃 𝜃 𝑦 ′ 𝑥 ′ Shared Encoder &Latent Embedding Decoder Single-headlinearClassifier ( 𝑧 | 𝑥 ) 𝑞 𝜃 ( 𝑥 | 𝑧 ) 𝑝 𝜙 ( 𝑦 | 𝑧 ) 𝑝 𝜉 ×+ 𝜖 ∼ (0, 𝐼 ) EVT open setmeta-recognition 𝜉 𝜙
Figure 1.
Joint continual learning model consisting of a sharedprobabilistic encoder q θ ( z | x ) , probabilistic decoder p φ ( x , z ) andprobabilistic classifier p ξ ( y, z ) . For open set recognition andgenerative replay with outlier rejection, EVT based bounds on thebasis of the approximate posterior are established. turbations and temperature scaling, while Lee et al. (2018)use a separately trained GAN to generate out-of-distributionsamples from low probability densities and explicitly reducetheir confidence through inclusion of an additional loss term.Similarly, Dhamija et al. (2018) define a loss function thataims to maximize entropy for unknown inputs. Extreme value theory:
One approach to open set recog-nition in deep neural networks is through extreme-valuetheory (EVT) based meta-recognition (Thomas et al., 2014;Bendale & Boult, 2016), i.e. without re-training or modify-ing loss functions by assuming upfront presence of unknowndata. The goal here is to bound the open space on the basisof already seen data instances. Bendale & Boult (2016)have proposed OpenMax to modify a neural network’s soft-max prediction values on the basis of extreme values of thepenultimate layer’s activation values.Our work extends these approaches by moving away frompredictive values and instead uses EVT to bound the ap-proximate posterior. In contrast to predictive values suchas reconstruction losses, where differences in reconstructedimages do not necessarily have to reflect the outcome withrespect to our task’s target, we thus directly operate on theunderlying (lower-bound to the) data distribution, and thegenerative factors. This allows us to also constrain gener-ative replay to distribution inliers, which further alleviatescatastrophic forgetting in continual learning. While we canstill leverage variational inference to gauge model uncer-tainty, the need to rely on classifier entropy or confidence,that are known to be overconfident and can never be cali-brated for all unknown inputs, is circumvented.
3. Unifying Continual Learning with OpenSet Recognition
We consider the continual learning scenario with aware-ness of an open world from a perspective of variationalBayesian inference in deep neural networks (Kingma & Welling, 2013). Our model consists of a shared encoderwith variational parameters θ , decoder and linear classifierwith respective parameters φ and ξ . The joint probabilis-tic encoder learns an encoding to a latent variable z , overwhich a unit Gaussian prior is placed. Using variationalinference, the encoder’s purpose is to approximate the trueposterior to both p φ ( x , z ) and p ξ ( y, z ) . The probabilisticdecoder p φ ( x | z ) and probabilistic linear classifier p ξ ( y | z ) then return the conditional probability density of the in-put x and target y under the respective generative modelgiven a sample z from the approximate posterior q θ ( z | x ) .This yields a generative model p ( x , y , z ) , for which weassume a factorization and generative process of the form p ( x , y , z ) = p ( x | z ) p ( y | z ) p ( z ) . For variational inferencewith this model, the sum over all elements in the dataset n ∈ D of the following loss thus needs to be optimized: L (cid:16) x ( n ) , y ( n ) ; θ , φ , ξ (cid:17) = − βKL ( q θ ( z | x ( n ) ) || p ( z ))+ E q θ ( z | x ( n ) ) (cid:104) log p φ ( x ( n ) | z ) + log p ξ ( y ( n ) | z ) (cid:105) (1)This model can be seen as a variant of β -VAE (Higgins et al.,2017), where in addition to approximating the data distribu-tion the model learns to incorporate the class structure intothe latent space. It forms the basis for continual learningwith open set recognition and respective improvements togenerative replay, which will be discussed in subsequentsections. An illustration of the model is shown in figure 1. Without further constraints, one could continually trainabove model by sequentially accumulating and optimizingequation 1 over all currently present tasks t = 1 , . . . , T : L UB t ( x , y ; θ , φ , ξ ) = t (cid:88) τ =1 N τ N τ (cid:88) n =1 L (cid:16) x ( n ) τ , y ( n ) τ ; θ , φ , ξ (cid:17) (2)Being based on the accumulation of real data, this equationprovides an upper-bound to achievable performance in con-tinual learning. However, this form of continued trainingis generally infeasible if only the most recent task’s datais assumed to be available. Making use of the generativenature of our model, we follow previous works (Farquhar &Gal, 2018; Achille et al., 2018) and estimate the likelihoodof former data through generative replay: L t ( x , y ; θ , φ , ξ ) = 1 N t N t (cid:88) n =1 L (cid:16) x ( n ) t , y ( n ) t ; θ , φ , ξ (cid:17) + 1 N (cid:48) t N (cid:48) t (cid:88) n =1 L (cid:16) x (cid:48) ( n ) t , y (cid:48) ( n ) t ; θ , φ , ξ (cid:17) (3)where, x (cid:48) t ∼ p φ ,t − ( x | z ); y (cid:48) t ∼ p ξ ,t − ( y | z ) and z ∼ p ( z ) (4) nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition (a) 6-classes (b) 10-classes Figure 2. (a) 2-D latent space visualization for continually learnedMNIST after inclusion of six classes and at the end of training.
Here, x (cid:48) t is a sample from the generative model with itscorresponding label y (cid:48) t obtained from the classifier. N (cid:48) t isthe number of total data instances of all previously seentasks or alternatively a hyper-parameter. This way theexpectation of the log-likelihood for all previously seentasks is estimated and the dataset at any point in time ˜ D t ≡ (cid:110) ( ˜ x ( n ) t , ˜ y ( n ) t ) (cid:111) ˜ N t n =1 = { ( x t ∪ x (cid:48) t , y t ∪ y (cid:48) t ) } is a com-bination of generations from seen past data distributions andthe current task’s real data. β In contrast to prior works based on multiple models, ourapproach of using equation 3 to continually train a singlemodel has two implications. With every encounter of anadditional class: 1. a new classifier unit and correspondingweights need to be added. 2: the latent encoding needsto adjust to accommodate the additional class under theconstraint of the classifier requirement of linear separability.The first implication can be addressed by expanding the ex-isting classifier weight tensor and only initializing the newlyadded weights. If the distribution from which the newlyadded weights are drawn is independent of the number ofclasses and only depends on the input dimensionality, suchas the initialization scheme proposed by (He et al., 2015),the initialization scheme remains constant throughout train-ing. While the addition itself will temporarily confuse exist-ing units, this should make sure that newly added parametersare on the same scale as existing weights and thus trained inpractice. Note that in principle, during the optimization ofa task the weight distribution could shift significantly fromits initial state. However, we do not encounter this potentialissue in empirical experiments. Nevertheless, we point outthat this currently under-explored topic requires separatefuture investigation in the context of model expansion.For the second implication, the β term of equation 1 iscrucial. Here, the role of beta is to control the capacity of the information bottleneck and regulate the effective latentencoding overlap (Burgess et al., 2017), which can bestbe summarized with a direct quote from the recent workof Mathieu et al. (2019): ”The overlap factor is perhapsbest understood by considering extremes: too little, andthe latents effectively become a lookup table; too much,and the data and latents do not convey information abouteach other. In either case, meaningfulness of the latentencodings is lost.” (p. 4) . This can be seen as under- orover-regularization by the prior of what is typically referredto as the aggregate posterior (Hoffman & Johnson, 2016): q θ ,t ( z ) = E p ˜ Dt ( ˜ x ) [ q θ ,t ( z | ˜ x )] ≈ N t ˜ N t (cid:88) n =1 q θ ,t ( z | ˜ x ( n ) ) (5)As an extension of this argument to our model, the necessityof linear class separation given z requires a suitable levelof encoding overlap. This forms the basis for our open setrecognition and respective improved generative replay forcontinual learning, which will be discussed in the followingparagraphs. Example two-dimensional latent encodings fora continually trained MNIST (LeCun et al., 1998) modelwith appropriate β are shown in figure 2. Here, we can seethat the classes are cleanly separated in latent space, as en-forced by the linear classification objective, and new classescan be accommodated continually. Further discussion onthe choice of β can be found in the supplementary material. Trained naively in above fashion, our model would sufferfrom accumulated errors with each successive iteration ofgenerative replay, similar to current literature approaches.The main challenge is that high density areas under theprior p ( z ) are not necessarily reflected in the structure ofthe aggregate posterior q θ ,t ( z ) (Tomczak & Welling, 2018).Thus, generated data from low density regions of the latterdoes not generally correspond to encountered data instances.Vice-versa, data instances that fall into high density regionsunder the prior should not generally be considered as statis-tical inliers with respect to the observed data distribution.Ideally, this challenge would be solved by modifying equa-tions 1 and 2 by replacing the Gaussian prior in the KL-divergence with q θ ,t ( z ) and respectively sampling z ∼ q θ ,t − ( z ) for generative replay in equations 3 and 4. Eventhough using the aggregate posterior as the prior is the objec-tive in multiple recent works, it can be challenging in highdimensions, lead to over-fitting and often comes at the ex-pense of additional hyper-parameters (Tomczak & Welling,2018; Bauer & Mnih, 2019; Takahashi et al., 2019). Toavoid finding an explicit representation for the multi-modal q θ ,t ( z ) , we leverage our model’s class disentanglement anddraw inspiration from the EVT based OpenMax approach nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition (Bendale & Boult, 2016). However, instead of using knowl-edge about extreme distance values in penultimate layeractivations to modify a Softmax prediction’s confidence,we propose to apply EVT on the basis of the class condi-tional aggregate posterior. In this view, any sample can beregarded as statistically outlying if its distance to the classes’latent means is extreme with respect to what has been ob-served for the majority of correctly predicted data instances,i.e. the sample falls into a region of low density under theaggregate posterior and is less likely to belong to p ˜ D ( ˜ x ) .For convenience, let us introduce the indices of all cor-rectly classified data instances at the end of task t as m = 1 , . . . , ˜ M t . To construct a statistical meta-recognitionmodel, we first obtain each class’ mean latent vector for allcorrectly predicted seen data instances: ¯ z c,t = 1 | ˜ M c,t | (cid:88) m ∈ ˜ M c,t E q θ,t ( z | ˜ x ( m ) t ) [ z ] (6)and define the respective set of latent distances as: ∆ c,t ≡ (cid:110) f d (cid:16) ¯ z c,t , E q θ,t ( z | ˜ x ( m ) t ) [ z ] (cid:17)(cid:111) m ∈ ˜ M c,t (7)Here, f d signifies a choice of distance metric. We pro-ceed to fit a per class heavy-tail Weibull distribution ρ c,t =( τ c,t , κ c,t , λ c,t ) on ∆ c,t for a given tail-size η . As the dis-tances are based on the class conditional approximate pos-terior, we can thus bound the latent space regions of highdensity. The tightness of the bounds is characterized through η , that can be seen as a prior belief with respect to the out-lier quantity assumed to be inherently present in the datadistribution. The choice of f d determines the nature anddimensionality of the obtained distance distribution. Forour experiments, we find that the cosine distance and thus aunivariate Weibull distance distribution per class seems tobe sufficient.Using the cumulative distribution function of this Weibullmodel ρ t we can estimate any sample’s outlier probability: ω ρ ,t ( z ) = min (cid:18) − exp (cid:18) − | f d (¯ z t , z ) − τ t | λ t (cid:19)(cid:19) κ t (8)where the minimum returns the smallest outlier probabilityacross all classes. If this outlier probability is larger than aprior rejection probability Ω t , the instance can be consideredas unknown as it is far away from all known classes. For anovel data instance, the outlier probability can be based oncomputation of the probabilistic encoder z ∼ q θ,t ( z | x ) anda false overconfident classifier prediction avoided. Analo-gously, for the generative model, equation 8 can be usedwith z ∼ p ( z ) and the probabilistic decoder only calculatedfor samples that are considered to be statistically inlying.This way, we can constrain the naive generative replay ofequation 4 to the aggregate posterior, while avoiding the Figure 3.
Generated MNIST images x ∼ p φ ,t ( x | z ) with z ∼ p ( z ) and their corresponding class c obtained from the classifier p ξ ,t ( y | z ) for c = 0 (top row), c = 5 (middle row) and c = 9 (bottom row) together with their open set outlier percentage. need to sample z ∼ q θ ,t ( z ) directly. Although this maysound detrimental to our method, it comes with the advan-tage of scalability to high dimensions. We further arguethat the computational overhead for generative replay, bothfrom sampling from the prior z ∼ p ( z ) in large parallelizedbatches and computation of equation 8, is negligible in con-trast to the much more computationally heavy deep prob-abilistic decoder or even the linear classifier, as the latteronly need to be calculated for accepted samples. To give avisual illustration, we show examples of generated MNISTimages together with their outlier percentage in figure 3.
4. Experiments
Similar to recent literature (Zenke et al., 2017; Kirkpatricket al., 2017; Farquhar & Gal, 2018; Shin et al., 2017; Parisiet al., 2019), we consider the incremental MNIST (LeCunet al., 1998) dataset, where classes arrive in groups of two,and corresponding versions of the FashionMNIST (Xiaoet al., 2017) and AudioMNIST dataset (Becker et al., 2018).For the latter we follow the authors’ procedure of convertingthe audio recordings into spectrograms. In addition to thisclass incremental setting, we also evaluate cross-datasetscenarios, where datasets are sequentially added with all oftheir classes and the model has to learn across modalities.For a common frame of reference, we base both encoderand decoder architectures on 14-layer wide residual net-works with a latent dimensionality of 60 (He et al., 2016;Zagoruyko & Komodakis, 2016; Gulrajani et al., 2017; Chenet al., 2017). For the generative replay with statistical outlierrejection, we use an aggressive rejection rate of Ω t = 0 . and dynamically set tail-sizes to 5% of seen examples perclass. To avoid over-fitting, we add noise sampled from N (0 , . to each input. This is preferable to weight reg-ularization as it doesn’t entail unrecoverable units that areneeded to encode later tasks. We thus refer to our proposedmodel as Open-set Classifying Denoising Variational Auto-Encoder (OCDVAE), for which we have found a value of nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition β = 0 . to consistently work well, see discussion in theappendix. We empirically compare the following methods: Dual Model : separate generative and discriminative varia-tional models in analogy to Shin et al. (2017).
EWC : elastic weight consolidation (Kirkpatrick et al., 2017)for a purely discriminative model.
OCDVAE (ours) : our proposed joint model with posteriorbased open set recognition and resulting statistical outlierrejection in generative replay.
CDVAE : the naive approach of generating from the priordistribution in our joint model. We include these results tohighlight the effect of aggregate posterior to prior mismatch.
ISO : isolated learning, where all data is always present. UB : upper-bound on achievable model performance by se-quentially accumulating all data, given by equation 2. LB : lower-bound on model performance when only the cur-rent task’s data is available. No additional mechanism is inplace and full catastrophic forgetting occurs.Our evaluation metrics are inspired by previously proposedcontinual learning measures (Lopez-Paz & Ranzato, 2017;Kemker et al., 2018). In addition to overall accuracy α t,all ,these metrics monitor forgetting by computing a base ac-curacy α t,base for the initial task at increment t , while alsogauging the amount of new knowledge that can be encodedby monitoring the accuracy for the most recent increment α t,new . We evaluate the quality of the generative modelsthrough classification accuracy as it depends on generatedreplay and a direct evaluation of pixel-wise reconstructionlosses is not necessarily coupled to classification accuracy orretention thereof. However, we provide a detailed analysisof reconstruction losses for all tasks, as well as KL diver-gences for all experiments in the supplementary material.To provide a fair comparison of achievable accuracy, allabove approaches are trained to converge on each taskusing the Adam optimizer (Kingma & Ba, 2015). We repeatall experiments 5 times to gauge statistical consistency.The full hyper-parameter specification can be found in thesupplementary material. All models were trained on a singleGTX 1080 GPU and we make our code publicly available athttps://github.com/MrtnMndt/OCDVAEContinualLearning Achieved accuracies for continual learning across datasetsare summarized in table 1. In general the upper-bound val-ues are almost identical to isolated learning. Similarly, thenew task’s metrics are negligibly close, as the WRN archi-tecture ensures enough capacity to encode new knowledge.In contrast to EWC that is universally unable to maintainknowledge, as also previously observed by Kemker et al.(2018); Parisi et al. (2019), approaches based on generativereplay are able to partially retain information. Yet they ac-cumulate errors due to samples generated from low density
Table 1.
Results for continual learning across datasets averagedover 5 runs, baselines and the reference isolated learning scenariofor FashionMNIST (F) → MNIST (M) → AudioMNIST (A) andthe reverse order. α T indicates the respective accuracy at the endof the last increment T = 3 . Cross-dataset α T (%) (T=3)base new all F - M - A CDVAE ISO
CDVAE UB
CDVAE LB
EWC ± ± ± Dual Model ± ± ± CDVAE ± ± ± OCDVAE ± ± ± A - M - F CDVAE ISO
CDVAE UB
CDVAE LB
EWC ± ± ± Dual Model ± ± ± CDVAE ± ± ± OCDVAE ± ± ± regions. This is noticeable for both the dual model ap-proach with a separate VAE and discriminative model, andmore heavily so for the naive CDVAE where the structureof q θ ,t ( z ) is further affected by the discriminator. How-ever, our proposed OCDVAE model overcomes this issueto a considerable degree, rivalling and improving upon theperformance of training separate models.Apart from these classification accuracies, we also quantita-tively analyze the models’ ability to distinguish unknowntasks’ data from data belonging to known tasks. Here,the challenge is to consider all unseen test data of alreadytrained tasks as inlying, while successfully identifying % of unknown datasets as outliers. For this purpose, weevaluate models after training on one dataset on its respec-tive test set, the remaining tasks’ datasets and additionallythe KMNIST (Clanuwat et al., 2018), SVHN (Netzer et al.,2011) and CIFAR (Krizhevsky, 2009) datasets.We compare and contrast three criteria that could be usedfor open set recognition: classifier predictive entropy, recon-struction loss and our proposed latent based EVT approach.Naively one might expect the Bayesian approach to handleunknown data through uncertainty. We thus approximatethe expectation with variational samples from the ap-proximate posterior per data point. Figure 4 shows the threecriteria and respective percentage of the total dataset beingconsidered as outlying for the OCDVAE model trained onFashionMNIST. In consistence with (Nalisnick et al., 2019),we can observe that use of reconstruction loss can onlysometimes distinguish between the known tasks’ test dataand unknown datasets. In the case of classifier predictive nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Dataset entropy P e r c e n t a g e o f d a t a s e t o u t li e r s FashionMNIST (trained)MNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN
Dataset reconstruction loss (nats) P e r c e n t a g e o f d a t a s e t o u t li e r s FashionMNIST (trained)MNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN
Weibull CDF outlier rejection prior t P e r c e n t a g e o f d a t a s e t o u t li e r s FashionMNIST (trained)MNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN
Figure 4.
Trained FashionMNIST OCDVAE evaluated on unknown datasets. All metrics are averaged over approximate posteriorsamples per data point. (Left) Classifier entropy values are insufficient to separate most of unknown from the known task’s test data.(Center) Reconstruction loss allows for a partial distinction. (Right) Our posterior based open set recognition considers the large majorityof unknown data as statistical outliers across a wide range of rejection priors Ω t . Table 2.
Test accuracies and outlier detection values of the joint OCDVAE and dual model (VAE and separate deep classifier) approacheswhen considering 95 % of known tasks’ validation data is inlying. Percentage of detected outliers is reported based on classifier predictiveentropy, reconstruction loss and our posterior based EVT approach, averaged over 100 z ∼ q θ ( z | x ) samples per data-point respectively.Note that larger values are better, except for the test data of the trained dataset, where ideally 0% should be considered as outlying. Outlier detection at 95% validation inliers (%)
MNIST Fashion Audio KMNIST CIFAR10 CIFAR100 SVHNTrained Model Test acc. Criterion F a s h i o n M N I S T Dual, 90.48 Class entropy 74.71 5.461 69.65 77.85 24.91 28.76 36.64CL + Reconstruction 5.535 5.340 64.10 31.33 99.50 98.41 97.24VAE Latent EVT 96.22 5.138 93.00 91.51 71.82 72.08 73.85Joint, 90.92 Class Entropy 66.91 5.145 61.86 56.14 43.98 46.59 37.85OCDVAE Reconstruction 0.601 5.483 63.00 28.69 99.67 98.91 98.56Latent EVT 96.23 5.216 94.76 96.07 96.15 95.94 96.84 M N I S T Dual, 99.40 Class entropy 4.160 90.43 97.53 95.29 98.54 98.63 95.51CL + Reconstruction 5.522 99.98 99.97 99.98 99.99 99.96 99.98VAE Latent EVT 4.362 99.41 99.80 99.86 99.95 99.97 99.52Joint, 99.53 Class entropy 3.948 95.15 98.55 95.49 99.47 99.34 97.98OCDVAE Reconstruction 5.083 99.50 99.98 99.91 99.97 99.99 99.98Latent EVT 4.361 99.78 99.67 99.73 99.96 99.93 99.70 A ud i o M N I S T Dual, 98.53 Class entropy 97.63 57.64 5.066 95.53 66.49 65.25 54.91CL + Reconstruction 6.235 46.32 4.433 98.73 98.63 98.63 97.45VAE Latent EVT 99.82 78.74 5.038 99.47 93.44 92.76 88.73Joint, 98.57 Class entropy 99.23 89.33 5.731 99.15 92.31 91.06 85.77OCDVAE Reconstruction 0.614 38.50 3.966 36.05 98.62 98.54 96.99Latent EVT 99.91 99.53 5.089 99.81 100.0 99.99 99.98 entropy, depending on the exact choice of entropy threshold,generally only a partial separation can be achieved. Further-more, both of these criteria pose the additional challengeof results being highly dependent on the choice of the pre-cise cut-off value. In contrast, the test data from the knowntasks is regarded as inlying across a wide range of rejectionpriors Ω t and the majority of other datasets is consistentlyregarded as outlying by our proposed open set mechanism.We provide quantitative outlier detection accuracies in ta- ble 2. Here, a validation split is used to determine therespective value at which of the validation data is con-sidered as inlying before using these priors to determineoutlier counts for the known tasks’ test set as well as otherdatasets. We provide this evaluation for both our joint model,as well as separate discriminative and generative models.While MNIST seems to be an easy to identify dataset for allapproaches, we can make two major observations:1. The latent based EVT approach generally outperforms nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Table 3.
Results for class incremental continual learning ap-proaches averaged over 5 runs, baselines and the referenceisolated learning scenario for the three datasets. α T indicates therespective accuracy at the end of the last increment T = 5 . Class-incremental α T (%) (T=5)base new all F a s h i o n M N I S T CDVAE ISO
CDVAE UB
CDVAE LB
EWC ± ± ± Dual Model ± ± ± CDVAE ± ± ± OCDVAE ± ± ± M N I S T CDVAE ISO
CDVAE UB
CDVAE LB
EWC ± ± ± Dual Model ± ± ± CDVAE ± ± ± OCDVAE ± ± ± A ud i o M N I S T CDVAE ISO
CDVAE UB
CDVAE LB
EWC ± ± ± Dual Model ± ± ± CDVAE ± ± ± OCDVAE ± ± ± the other criteria, particularly for the OCDVAE wherea near perfect open set detection can be achieved.2. Even though we can apply EVT to a purely discrimi-native model, the joint OCDVAE model consistentlyexhibits more accurate outlier detection. We hypothe-size that this is due to the joint model also optimizinga variational lower bound to the data distribution p ( x ) .We provide figures similar to figure 4 for all models reportedin table 2 in the supplementary material. We show results in analogy to table 1 for the class incre-mental scenario in table 3. With the exception of MNIST,where the dual model approach fares well, a similar pat-tern as before can be observed in this more challengingscenario and our proposed OCDVAE approach significantlyoutperforms all other methods. Interestingly, as a result ofusing a single model across tasks, we observe backwardtransfer in some experiments. This is particularly apparentfor AudioMNIST, where addition of the second incrementfirst decays and inclusion of later tasks improves the secondtask’s accuracy. We provide a detailed account of all inter-mediate results and examples of generated images for allincrements t = 1 , . . . , in the supplementary material. Table 4.
Results for PixelVAE based continual learning approachesaveraged over 5 runs in analogy to tables 1 and 3.
Class-incremental α T (%) (T=5)base new all F a s h i o n Dual Pix Model ± ± ± PixCDVAE ± ± ± PixOCDVAE ± ± ± M N I S T Dual Pix Model ± ± ± PixCDVAE ± ± ± PixOCDVAE ± ± ± A ud i o Dual Pix Model ± ± ± PixCDVAE ± ± ± PixOCDVAE ± ± ± Cross-dataset α T (%) (T=3)base new all F - M - A Dual Pix Model ± ± ± PixCDVAE ± ± ± PixOCDVAE ± ± ± A - M - F Dual Pix Model ± ± ± PixCDVAE ± ± ± PixOCDVAE ± ± ± In a final empirical evaluation, we investigate the choice ofgenerative model and optionally enhance the probabilisticdecoder with an autoregressive variant, where generation ofa pixel’s value is spatially conditioned on previous pixels(van den Oord et al., 2016; Gulrajani et al., 2017; Chenet al., 2017). We show results corresponding to tables 1 and3 for pixel models in table 4. While we can observe that thisgenerally further alleviates catastrophic forgetting inducedby multiple successive iterations of generative replay, it doessignificantly more so for our proposed approach.
5. Summary and outlook
We have proposed a probabilistic approach to unify con-tinual deep learning with open set recognition based onBayesian inference. Using a single model that combines ashared probabilistic encoder with a generative model andan expanding linear classifier, we have introduced EVTbased bounds to the approximate posterior. The derivedopen set recognition and corresponding generative replaywith statistical outlier rejection have been shown to achievecompelling results in both task incremental as well as cross-dataset continual learning across image and audio modali-ties, while being able to distinguish seen from unseen data.Our approach readily benefits from recent advances such asautoregressive models (Gulrajani et al., 2017; Chen et al.,2017) and we therefore expect future application to extendto more complicated data such as larger scale color images. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition
References
Achille, A., Eccles, T., Matthey, L., Burgess, C. P., Watters,N., Lerchner, A., and Higgins, I. Life-Long Disentan-gled Representation Learning with Cross-Domain LatentHomologies.
Neural Information Processing Systems(NeurIPS) , 2018.Bachem, O., Lucic, M., and Krause, A. Coresets for Non-parametric Estimation - the Case of DP-Means.
Inter-national Conference on Machine Learning (ICML) , 37:209–217, 2015.Bauer, M. and Mnih, A. Resampled Priors for VariationalAutoencoders.
International Conference on ArtificialIntelligence and Statistics (AISTATS) , 89, 2019.Becker, S., Ackermann, M., Lapuschkin, S., M¨uller, K.-R.,and Samek, W. Interpreting and Explaining Deep Neu-ral Networks for Classification of Audio Signals. arXivpreprint arXiv: 1807.03418 , 2018.Bendale, A. and Boult, T. E. Towards Open Set DeepNetworks.
Computer Vision and Pattern Recognition(CVPR) , 2016.Boult, T. E., Cruz, S., Dhamija, A., Gunther, M., Henry-doss, J., and Scheirer, W. Learning and the Unknown :Surveying Steps Toward Open World Recognition.
AAAIConference on Artificial Intelligence (AAAI) , 2019.Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters,N., Desjardins, G., and Lerchner, A. Understanding dis-entangling in beta-VAE.
Neural Information ProcessingSystems (NeurIPS), Workshop on Learning DisentangledRepresentations , 2017.Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal,P., Schulman, J., Sutskever, I., and Abbeel, P. Varia-tional Lossy Autoencoder.
International Conference onLearning Representations (ICLR) , 2017.Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A.,Yamamoto, K., and Ha, D. Deep Learning for Classi-cal Japanese Literature.
Neural Information ProcessingSystems (NeurIPS), Workshop on Machine Learning forCreativity and Design , 2018.Dhamija, A. R., G¨unther, M., and Boult, T. E. ReducingNetwork Agnostophobia.
Neural Information ProcessingSystems (NeurIPS) , 2018.Farquhar, S. and Gal, Y. A Unifying Bayesian View of Con-tinual Learning.
Neural Information Processing Systems(NeurIPS) Bayesian Deep Learning Workshop , 2018.Gal, Y. and Ghahramani, Z. Dropout as a Bayesian Ap-proximation : Representing Model Uncertainty in DeepLearning.
International Conference on Machine Learning(ICML) , 48, 2015. Gepperth, A. and Karaoguz, C. A Bio-Inspired IncrementalLearning Architecture for Applied Perceptual Problems.
Cognitive Computation , 8(5):924–934, 2016.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative Adversarial Nets.
Neural InformationProcessing Systems (NeurIPS) , 2014.Gulrajani, I., Kumar, K., Faruk, A., Taiga, A. A., Visin, F.,Vazquez, D., and Courville, A. PixelVAE: a Latent Vari-able Model for Natural Images.
International Conferenceon Learning Representations (ICLR) , 2017.He, K., Zhang, X., Ren, S., and Sun, J. Delving deep intorectifiers: Surpassing human-level performance on ima-genet classification.
International Conference on Com-puter Vision (ICCV) , 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep ResidualLearning for Image Recognition.
Computer Vision andPattern Recognition (CVPR) , 2016.Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,Botvinick, M., Mohamed, S., and Lerchner, A. beta-VAE:Learning Basic Visual Concepts with a Constrained Varia-tional Framework.
International Conference on LearningRepresentations (ICLR) , 2017.Hinton, G. E., Vinyals, O., and Dean, J. Distilling theKnowledge in a Neural Network.
NeurIPS Deep LearningWorkshop , 2014.Hochreiter, S. and Schmidhuber, J. Long Short-Term Mem-ory.
Neural Computation , 9(8):1735–1780, 1997.Hoffman, M. D. and Johnson, M. J. ELBO surgery:yet another way to carve up the variational evidencelower bound.
Neural Information Processing Systems(NeurIPS), Advances in Approximate Bayesian InferenceWorkshop , 2016.Ioffe, S. and Szegedy, C. Batch Normalization: AcceleratingDeep Network Training by Reducing Internal CovariateShift.
International Conference on Machine Learning(ICML) , 2015.Kemker, R., McClure, M., Abitino, A., Hayes, T., andKanan, C. Measuring Catastrophic Forgetting in NeuralNetworks.
AAAI Conference on Artificial Intelligence(AAAI) , 2018.Kingma, D. P. and Ba, J. L. Adam: a Method for StochasticOptimization.
International Conference on LearningRepresentations (ICLR) , 2015.Kingma, D. P. and Welling, M. Auto-Encoding VariationalBayes.
International Conference on Learning Represen-tations (ICLR) , 2013. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des-jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho,T., Grabska-Barwinska, A., Hassabis, D., Clopath, C.,Kumaran, D., and Hadsell, R. Overcoming catastrophicforgetting in neural networks.
Proceedings of the Na-tional Academy of Sciences (PNAS) , 114(13):3521–3526,2017.Krizhevsky, A. Learning Multiple Layers of Features fromTiny Images. Technical report, Toronto, 2009.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.
Proceed-ings of the IEEE , 86(11):2278–2323, 1998.Lee, K., Lee, H., Lee, K., and Shin, J. Training Confidence-Calibrated Classifiers for Detecting Out-of-DistributionSamples.
International Conference on Learning Repre-sentations (ICLR) , 2018.Li, Z. and Hoiem, D. Learning without forgetting.
EuropeanConference on Computer Vision (ECCV) , 2016.Liang, S., Li, Y., and Srikant, R. Enhancing the Reliability ofOut-of-distribution Image Detection in Neural Networks.
International Conference on Learning Representations(ICLR) , 2018.Lopez-Paz, D. and Ranzato, M. A. Gradient Episodic Mem-ory for Continual Learning.
Neural Information Process-ing Systems (NeurIPS) , 2017.Matan, O., Kiang, R., Stenard, C. E., and Boser, B. E. Hand-written Character Recognition Using Neural Network Ar-chitectures. ,2(5):1003–1011, 1990.Mathieu, E., Rainforth, T., Siddharth, N., and Teh, Y. W. Dis-entangling disentanglement in variational autoencoders.
International Conference on Machine Learning (ICML) ,pp. 7744–7754, 2019.McCloskey, M. and Cohen, N. J. Catastrophic Interferencein Connectionist Networks : The Sequential LearningProblem.
Psychology of Learning and Motivation - Ad-vances in Research and Theory , 24(C):109–165, 1989.Mensink, T., Verbeek, J., Perronnin, F., Csurka, G., Mensink,T., Verbeek, J., Perronnin, F., and Csurka, G. MetricLearning for Large Scale Image Classification : Gener-alizing to New Classes at Near-Zero Cost.
EuropeanConference on Computer Vision (ECCV) , 2012.Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., andLakshminarayanan, B. Do Deep Generative ModelsKnow What They Don’t Know?
International Confer-ence on Learning Representations (ICLR) , 2019. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B.,and Ng, A. Y. Reading Digits in Natural Images withUnsupervised Feature Learning.
Neural Information Pro-cessing Systems (NeurIPS), Workshop on Deep Learningand Unsupervised Feature Learning , 2011.Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. Varia-tional Continual Learning.
International Conference onLearning Representations (ICLR) , 2018.O’Reilly, R. C. and Norman, K. A. Hippocampal and neocor-tical contributions to memory: Advances in the comple-mentary learning systems framework.
Trends in CognitiveSciences , 6(12):505–510, 2003.Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter,S. Continual Lifelong Learning with Neural Networks:A Review.
Neural Networks , 113:54–71, 2019.Rebuffi, S. A., Kolesnikov, A., Sperl, G., and Lampert, C. H.iCaRL: Incremental classifier and representation learning.
Computer Vision and Pattern Recognition (CVPR) , 2017.Robins, A. Catastrophic Forgetting, Rehearsal and Pseu-dorehearsal.
Connection Science , 7(2):123–146, 1995.Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H.,Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Had-sell, R. Progressive Neural Networks. arXiv preprintarXiv: 1606.04671 , 2016.Shin, H., Lee, J. K., and Kim, J. J. Continual Learning withDeep Generative Replay.
Neural Information ProcessingSystems (NeurIPS) , 2017.Takahashi, H., Iwata, T., Yamanaka, Y., Yamada, M., andYagi, S. Variational Autoencoder with Implicit OptimalPriors.
Proceedings of the AAAI Conference on ArtificialIntelligence , 33:5066–5073, 2019.Thomas, M. R. P., Ahrens, J., and Tashev, I. ProbabilityModels For Open Set Recognition.
IEEE Transactionson Pattern Analysis and Machine Intelligence , 2014.Tomczak, J. M. and Welling, M. VAE with a vampprior.
International Conference on Artificial Intelligence andStatistics (AISTATS) , 84, 2018.van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K.Pixel Recurrent Neural Networks.
International Con-ference on Machine Learning (ICML) , 48:1747–1756,2016.Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: aNovel Image Dataset for Benchmarking Machine Learn-ing Algorithms. arXiv preprint arXiv: 1708.07747 , 2017. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition
Yoon, J., Yang, E., Lee, J., and Hwang, S. J. LifelongLearning with Dynamically Expandable Networks.
Inter-national Conference on Learning Representations (ICLR) ,2018.Zagoruyko, S. and Komodakis, N. Wide Residual Networks.
British Machine Vision Conference (BMVC) , 2016.Zenke, F., Poole, B., and Ganguli, S. Continual LearningThrough Synaptic Intelligence.
International Conferenceon Machine Learning (ICML) , 70:3987–3995, 2017. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition
Supplementary material
The supplementary material provides further details for thematerial presented in the main body. Specifically, the struc-ture is as follows: A. Full specification of training procedure and hyper-parameters, including exact architecture definitions. B. Extended discussion, qualitative and quantitative ex-amples for the role of β . Further visualization of con-tinually learned 2-D latent encodings as an extensionto figure 2 in the main body. C. Additional visualization of open set detection for allquantitatively evaluated models considered in table 2of the main body. D. Full continual learning results for all task increments,including all reconstruction losses and KL divergences. E. Visualization of generative replay examples for allmodels and task increments.
A. Training hyper-parameters andarchitecture definitions
We provide a full specification of hyper-parameters, modelarchitectures and the training procedure in this section.We base our encoder and decoder architecture on 14-layerwide residual networks (He et al., 2016; Zagoruyko & Ko-modakis, 2016) as used in lossy auto-encoders (Gulrajaniet al., 2017; Chen et al., 2017), with a latent dimensional-ity of to demonstrate scalability to high-dimensions anddeep networks. These architectures are shown in detail intables 5 and 6. Hidden layers include batch-normalization(Ioffe & Szegedy, 2015) with a value of − and use ReLUactivations. For a common frame of reference, all meth-ods’ share the same underlying WRN architecture. For theautoregressive addition to our joint model, we set the num-ber of output channels of the decoder to 60 and append 3additional pixel decoder layers, each with a kernel size of × and 60 channels. While we will report reconstructionlog-likelihoods in nats, these models are practically formu-lated as a classification problem with a 256-way Softmax.The corresponding loss is in bits per dimension. We haveconverted these values to have a better comparison, but inorder to do so we need to sample from the pixel decoder’smultinomial distribution to calculate a binary cross-entropyon reconstructed images. We further note that all losses arenormalized with respect to dimensions.We use hyper-parameters consistent with the literature (Gul-rajani et al., 2017; Chen et al., 2017). Accordingly, allmodels are optimized using stochastic gradient descent witha mini-batch size of and Adam (Kingma & Ba, 2015) Table 5. µ and σ that depend on the chosenlatent space dimensionality and the data’s spatial size. Layer type WRN encoderLayer 1 conv × - 48, p = 1Block 1 conv × - 160, p = 1; conv × - 160 (skip next layer)conv × - 160, p = 1conv × - 160, p = 1; shortcut (skip next layer)conv × - 160, p = 1Block 2 conv × - 320, s = 2, p = 1; conv × - 320, s = 2 (skip next layer)conv × - 320, p = 1conv × - 320, p = 1; shortcut (skip next layer)conv × - 320, p = 1Block 3 conv × - 640, s = 2, p = 1; conv × - 640, s = 2 (skip next layer)conv × - 640, p = 1conv × - 640, p = 1; shortcut (skip next layer)conv × - 640, p = 1 with a learning rate of . and first and second momentaequal to . and . . As detailed in the main body, weadd noise sampled from N (0 , . to the input to avoidover-fitting. Due to the inevitable data augmentation effect,we train all approaches in this denoising fashion. No furtherdata augmentation or preprocessing is applied. We initializeall weights according to He et al. (2015).All class incremental models are trained for 120 epochsper task on MNIST and FashionMNIST and 150 epochs onAudioMNIST. Complementary incremental cross-datasetmodels are trained for 200 epochs per task on data resizedto × . While our proposed model exhibits forwardtransfer due to weight sharing and need not necessarily betrained for the entire amount of epochs for each subsequenttask, this guarantees convergence and a fair comparison ofresults with respect to achievable accuracy of other methods.Isolated models are trained for 200 and 300 epochs untilconvergence respectively. For the generative replay withstatistical outlier rejection, we use an aggressive rejectionrate of Ω t = 0 . (with analogous results with . ) anddynamically set tail-sizes to 5% of seen examples per class.As mentioned in the main body, the used open set distancemeasure is the cosine distance.For EWC, the number of Fisher samples is fixed to the totalnumber of data points from all the previously seen tasks. Asuitable Fisher multiplier value λ has been determined byconducting a grid search over a set of five values: 50, 100,500, 1000 and 5000 on held-out validation data for the firsttwo tasks in sequence. We observe exploding gradients if λ is too high. However, a very small λ leads to excessivedrift in the weight distribution across subsequent tasks that nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Table 6. P w and P h refer to the input’s spatial dimension. Convolutional (conv)and transposed convolutional (conv t) layers are parametrized bya quadratic filter size followed by the amount of filters. p ands represent zero padding and stride respectively. If no paddingor stride is specified then p = 0 and s = 1. Skip connections arean additional operation at a layer, with the layer to be skippedspecified in brackets. Every convolutional and fully-connected(FC) layer are followed by batch-normalization and a ReLUactivation function. The model ends on a Sigmoid function. Layer type WRN decoderLayer 1 FC × (cid:98) P w / (cid:99) × (cid:98) P h / (cid:99) Block 1 conv t × - 320, p = 1; conv t × - 320 (skip next layer)conv × - 320, p = 1conv × - 320, p = 1; shortcut (skip next layer)conv × - 320, p = 1upsample × × - 160, p = 1; conv t × - 160 (skip next layer)conv × - 160, p = 1conv × - 160, p = 1; shortcut (skip next layer)conv × - 160, p = 1upsample × × - 48, p = 1; conv t × - 48 (skip next layer)conv × - 48, p = 1conv × - 48, p = 1; shortcut (skip next layer)conv × - 48, p = 1Layer 2 conv × - 3, p = 1 further results in catastrophic inference. Empirically, λ =500 in the class-incremental scenario and λ = 1000 in thecross-dataset setting seem to provide the best balance. B. Further discussion on the role of β In the main body the role of the β term (Higgins et al., 2017)in our model’s loss function is pointed out. Here, we delveinto further detail with qualitative and quantitative examplesto support the arguments. To facilitate the discussion, werepeat equation 1 of the main body: L (cid:16) x ( n ) , y ( n ) ; θ , φ , ξ (cid:17) = − βKL ( q θ ( z | x ( n ) ) || p ( z ))+ E q θ ( z | x ( n ) ) (cid:104) log p φ ( x ( n ) | z ) + log p ξ ( y ( n ) | z ) (cid:105) The β term weights the strength of the regularization by theprior through the KL divergence. Selection of this strengthis necessary to control the information bottleneck of thelatent space and regulate the effective latent encoding over-lap. To repeat the main body, and previous arguments byHoffman & Johnson (2016) and Burgess et al. (2017): toolarge β values (typically >> ) will result in a collapse ofany structure present in the aggregate posterior. Too small β values (typically << ) lead to the latent space being alookup table. In either case, there is no meaningful infor-mation between the latents. This is particularly relevant toour objective of linear class separability, that requires for-mation of an aggregate latent encoding that is disentangledwith respect to the different classes. To visualize this, we Table 7.
Losses obtained for different β values for MNIST usingthe WRN architecture with 2-D latent space. Training conductedin isolated fashion to quantitatively showcase the role of β . Un-normalized values in nats are reported in brackets for referencepurposes. In nats per dimension (nats in brackets)2-D latent Beta KLD Recon loss Class Loss Accuracy [%]train 1.0 1.039 (2.078) 0.237 (185.8) 0.539 (5.39) 79.87test 1.030 (2.060) 0.235 (184.3) 0.596 (5.96) 78.30train 0.5 1.406 (2.812) 0.230 (180.4) 0.221 (2.21) 93.88test 1.382 (2.764) 0.228 (178.8) 0.305 (3.05) 92.07train 0.1 2.055 (4.110) 0.214 (167.8) 0.042 (0.42) 99.68test 2.071 (4.142) 0.212 (166.3) 0.116 (1.16) 98.73train 0.05 2.395 (4.790) 0.208 (163.1) 0.025 (0.25) 99.83test 2.382 (4.764) 0.206 (161.6) 0.159 (1.59) 98.79
Table 8.
Losses obtained for different β values for MNIST usingthe WRN architecture with 60-D latent space. Training conductedin isolated fashion to quantitatively showcase the role of β . Un-normalized values in nats are reported in brackets for referencepurposes. In nats per dimension (nats in brackets)60-D latent Beta KLD Recon loss Class Loss Accuracy [%]train 1.0 0.108 (6.480) 0.184 (144.3) 0.0110 (0.110) 99.71test 0.110 (6.600) 0.181 (142.0) 0.0457 (0.457) 99.03train 0.5 0.151 (9.060) 0.162 (127.1) 0.0052 (0.052) 99.87test 0.156 (9.360) 0.159 (124.7) 0.0451 (0.451) 99.14train 0.1 0.346 (20.76) 0.124 (97.22) 0.0022 (0.022) 99.95test 0.342 (20.52) 0.126 (98.79) 0.0286 (0.286) 99.38train 0.05 0.476 (28.56) 0.115 (90.16) 0.0018 (0.018) 99.95test 0.471 (28.26) 0.118 (92.53) 0.0311 (0.311) 99.34 have trained multiple models with different β values on theMNIST dataset, in an isolated fashion with all data presentat all times to focus on the effect of β . The correspondingaggregate encodings at the end of training are shown in fig-ure 5. Here, we can empirically observe above points. Witha beta of one and larger, the aggregate posterior’s structurestarts to collapse and the aggregate encoding converges toa Normal distribution. While this minimizes the distribu-tional mismatch with respect to the prior, the separabilityof classes is also lost and an accurate classification cannotbe achieved. On the other hand, if the beta value gets eversmaller there is insufficient regularization present and theaggregate posterior no longer follows a Normal distribution.The latter does not only render sampling for generative re-play difficult, it also challenges the assumption of distancesto each class’ latent mean being Weibull distributed, as thelatter can essentially be seen as a skewed Normal.It is important to note that the we normalize losses withrespect to dimensions, and the value of β should thus alsobe seen as a normalized quantity. While the relative effectof increasing or decreasing beta stays the same, the absolutevalue of β can be subject to any normalization.To provide corresponding quantitative examples for the mod-els trained with different β with 2-D latent spaces and 60-Dlatent spaces in tables 7 and 8 respectively. In both cases we nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition (a) β β β β Figure 5. β values for the used WRN architecture. observe that decreasing the value of beta below one is neces-sary to improve classification accuracy, as well as the overallvariational lower bound. Taking the 60 dimensional case asa specific example, we can also observe that reducing thebeta value too far and decreasing it from e.g. . to . leads to deterioration of the variational lower bound, from . to . natural units, while the classificationaccuracy by itself does not improve further.To conclude our discussion on the role of beta, we show thefull set of time steps to complete figure 2 of the main body.Here, a model with 2-D latent space and β = 1 was trainedcontinually on MNIST and the aggregate encoding is visual-ized for each additionally arriving set of two classes in figure6. In the figure’s last panel, we further show the aggregatelatent encoding at the last time step for the model trained inan autoregressive fashion, i.e. a PixelVAE (van den Oordet al., 2016; Gulrajani et al., 2017). Also coined ”lossy au-toencoder” by Chen et al. (2017), the authors argue that thismodel leaves the encoding of local structure to the autore-gressive pixel decoder, with a focus on global structure inthe latent encoding. In our visualization, this seems to bereflected in a change of the aggregate posterior’s structure. C. Additional open set recognitionvisualization
As we point out in section 4 of the main paper, our posteriorbased open set recognition considers almost all of the un-known datasets as statistical outliers, while at the same timeregarding unseen test data from the originally trained tasksas distribution inliers across a wide range of rejection priors.In addition to the outlier rejection curves for FashionMNISTand the quantitative results presented in the main body, wealso show the full outlier rejection curves for the remainingdatasets, as well as all dual model approaches in figures7, 8 and 9. These figures visually support the quantitativefindings described in the main body and respective conclu-sions. In summary, the joint OCDVAE performs better atopen set recognition in direct comparison to the dual modelsetting, particularly when using the EVT based criterion.Apart from the MNIST dataset, where reconstruction losscan be a sufficient metric for open set detection, the latentbased approach also exhibits less dependency on the out-lier rejection prior and consistently improves the ability todiscern unknown data. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition (a) 2 classes (b) 4 classes (c) 6 classes(d) 8 classes (e) 10 classes (f) 10 classes with PixelVAE
Figure 6. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition
Dataset entropy P e r c e n t a g e o f d a t a s e t o u t li e r s FashionMNIST (trained)MNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (a) Dual model classifier en-tropy based OSR
Dataset entropy P e r c e n t a g e o f d a t a s e t o u t li e r s FashionMNIST (trained)MNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (b) OCDVAE classifier entropybased OSR
Dataset reconstruction loss (nats) P e r c e n t a g e o f d a t a s e t o u t li e r s FashionMNIST (trained)MNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (c) Dual model reconstructionloss based OSR
Dataset reconstruction loss (nats) P e r c e n t a g e o f d a t a s e t o u t li e r s FashionMNIST (trained)MNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (d) OCDVAE reconstructionloss based OSR
Weibull CDF outlier rejection prior t P e r c e n t a g e o f d a t a s e t o u t li e r s FashionMNIST (trained)MNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (e) Dual model posterior EVTbased OSR
Weibull CDF outlier rejection prior t P e r c e n t a g e o f d a t a s e t o u t li e r s FashionMNIST (trained)MNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (f) OCDVAE posterior EVTbased OSR
Figure 7.
Dual model and OCDVAE trained on FashionMNISTevaluated on unseen datasets. Pairs of panels show the contrastbetween the approaches. Left panels correspond to the dual model,right panels show the joint OCDVAE model. (a+b) The classi-fier entropy values by itself are insufficient to separate most ofunknown from the known task’s test data. (c+d) Reconstructionloss allows for a partial distinction. (e+f) Our posterior-based openset recognition considers the large majority of unknown data asstatistical outliers across a wide range of rejection priors Ω t , signif-icantly more so in the OCDVAE model. All metrics are reported asthe mean over approximate posterior samples per data point. Dataset entropy P e r c e n t a g e o f d a t a s e t o u t li e r s AudioMNIST (trained)MNISTFashionMNISTKMNISTCIFAR10CIFAR100SVHN (a) Dual model classifier en-tropy based OSR
Dataset entropy P e r c e n t a g e o f d a t a s e t o u t li e r s AudioMNIST (trained)MNISTFashionMNISTKMNISTCIFAR10CIFAR100SVHN (b) OCDVAE classifier entropybased OSR
Dataset reconstruction loss (nats) P e r c e n t a g e o f d a t a s e t o u t li e r s AudioMNIST (trained)MNISTFashionMNISTKMNISTCIFAR10CIFAR100SVHN (c) Dual model reconstructionloss based OSR
Dataset reconstruction loss (nats) P e r c e n t a g e o f d a t a s e t o u t li e r s AudioMNIST (trained)MNISTFashionMNISTKMNISTCIFAR10CIFAR100SVHN (d) OCDVAE reconstructionloss based OSR
Weibull CDF outlier rejection prior t P e r c e n t a g e o f d a t a s e t o u t li e r s AudioMNIST (trained)MNISTFashionMNISTKMNISTCIFAR10CIFAR100SVHN (e) Dual model posterior EVTbased OSR
Weibull CDF outlier rejection prior t P e r c e n t a g e o f d a t a s e t o u t li e r s AudioMNIST (trained)MNISTFashionMNISTKMNISTCIFAR10CIFAR100SVHN (f) OCDVAE posterior EVTbased OSR
Figure 8.
Dual model and OCDVAE trained on AudioMNIST eval-uated on unseen datasets. Pairs of panels show the contrast betweenthe approaches. Left panels correspond to the dual model, rightpanels show the joint OCDVAE model. (a+b) The classifier en-tropy values by itself are insufficient to separate most of unknownfrom the known task’s test data. (c+d) Reconstruction loss allowsfor a partial distinction. (e+f) Our posterior-based open set recog-nition considers the large majority of unknown data as statisticaloutliers across a wide range of rejection priors Ω t , significantlymore so in the OCDVAE model. All metrics are reported as themean over approximate posterior samples per data point. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Dataset entropy P e r c e n t a g e o f d a t a s e t o u t li e r s MNIST (trained)FashionMNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (a) Dual model classifier en-tropy based OSR
Dataset entropy P e r c e n t a g e o f d a t a s e t o u t li e r s MNIST (trained)FashionMNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (b) OCDVAE classifier entropybased OSR
Dataset reconstruction loss (nats) P e r c e n t a g e o f d a t a s e t o u t li e r s MNIST (trained)FashionMNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (c) Dual model reconstructionloss based OSR
Dataset reconstruction loss (nats) P e r c e n t a g e o f d a t a s e t o u t li e r s MNIST (trained)FashionMNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (d) OCDVAE reconstructionloss based OSR
Weibull CDF outlier rejection prior t P e r c e n t a g e o f d a t a s e t o u t li e r s MNIST (trained)FashionMNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (e) Dual model posterior EVTbased OSR
Weibull CDF outlier rejection prior t P e r c e n t a g e o f d a t a s e t o u t li e r s MNIST (trained)FashionMNISTAudioMNISTKMNISTCIFAR10CIFAR100SVHN (f) OCDVAE posterior EVTbased OSR
Figure 9.
Dual model and OCDVAE trained on MNIST evaluatedon unseen datasets. Pairs of panels show the contrast betweenthe approaches. Left panels correspond to the dual model, rightpanels show the joint OCDVAE model. (a+b) The classifier en-tropy values by itself can achieve a partial separation betweenunknown and the known task’s test data. (c+d) Reconstruction lossallows for distinction if the cut-off is chosen correctly. (e+f) Ourposterior-based open set recognition considers the large majorityof unknown data as statistical outliers across a wide range of re-jection priors Ω t . All metrics are reported as the mean over approximate posterior samples per data point. While the OCDVAEshows improvement upon the dual model approach, particularly ifusing classifier entropies for OSR, both models trained on MNISTperform well in OSR. In direct contrast with models trained onFashion- or AudioMNIST and respective figures 7 and 8, thisshows that evaluation on MNIST alone is generally insufficient. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition D. Detailed results
In the main body we have reported three metrics for ourcontinual learning experiments based on classification ac-curacy: the base task’s accuracy over time α t,base , the newtask’s accuracy α t,new and the overall accuracy at any pointin time α t,all . This is an appropriate measure to evalu-ate the quality of the generative model over time giventhat the employed mechanism to avoid catastrophic infer-ence in continual learning is generative replay. On the onehand, if catastrophic inference occurs in the decoder thesampled data will no longer resemble the instances of theobserved data distribution. This will in turn degrade theencoder during continued training and thus the classificationaccuracy. On the other hand, this proxy measure for thegeneration quality avoids the common pitfalls of pixel-wisereconstruction metrics. The information necessary to main-tain respective knowledge of the data distribution throughthe variational approximation in the probabilistic encoderdoes not necessarily rely on correctly reconstructing data’slocal information. To take an example, if a model were toreconstruct all images perfectly but with some degree ofspatial translation or rotation, then the negative log likeli-hood (NLL) would arguable be worse than that of a modelwhich reconstructs local details correctly on a pixel levelfor a fraction of the image. As this could be details in e.g.the background or other class unspecific areas, training oncorresponding generations does not have to prevent loss ofencoder knowledge with respect to the classification task.As such, a similar argument can be conjured for the KLdivergence. On the one hand, monitoring the KL divergenceas a regularization term by itself over the course of continuallearning is meaningless without regarding the data’s NLL.On the other hand, for our OCDVAE model the exact valueof the KL divergence does not immediately reflect the qual-ity of the generated data. This is because we do not samplemerely from the prior, but as explained in the main bodyemploy a rejection mechanism to draw samples that belongto the aggregate posterior.Nevertheless, for the purpose of completeness and in addi-tion to the results provided in the experimental section ofthe main body, we provide the reconstruction losses and KLdivergences for all applicable models in this supplementarymaterial section. Analogous to the three metrics for classi-fication accuracy of base, new and all tasks, we define therespective reconstruction losses γ t,base , γ t,new and γ t,all .The KL divergence KL t always measures the deviationfrom the prior p ( z ) at any point in time, as the prior remainsthe same throughout continual training. Following the abovediscussion, we argue that these values should be regardedwith caution and should not be interpreted separately. D.1. Full cross dataset results
We show the full cross dataset results in table 9 in exten-sion to table 1 in the main body. An analogous table forthe presented autoregressive models can be found in table10. Similar to the accuracy values, we can observe thatthe mismatch between aggregate posterior and prior as ex-pressed through the KL divergence is greater in a naive jointmodel (naive CDVAE) in comparison to a dual model ap-proach with separate generative and discriminative models.Our proposed OCDVAE model, with respective rejectionsampling scheme that takes into account the structure ofthe aggregate posterior, alleviates this to a large degree.The reconstruction losses of both the dual model and thejoint OCDVAE approach show only negligible deviationwith respect to the achievable upper bound and only lim-ited catastrophic inference of the decoder occurs. However,we can also observe that by itself these quantities are notindicative of maintaining encoder knowledge with respectto representations required for classification. This is par-ticularly visible in the tables’ second experiment, wherewe first train Audio data and then proceed with the twoimage datasets. Here, the KL divergence and reconstructionloss are both better for the dual model, whereas a muchhigher accuracy over time is maintained in the OCDVAEmodel. Naturally, this is because a significant mismatchbetween aggregate posterior and prior is also present in apurely unsupervised generative model and naively samplingfrom the prior will result in generated instances that do notresemble those present in the observed data distribution.While weaker in effect, this is similar to the naive CDVAEapproach. Without the presence of the linear discriminatoron the latents in the purely unsupervised generative model,there is however no straightforward mechanism to disentan-gle the latent space according to classes. Our proposed openset approach and the resulting constraint to samples fromthe aggregate posterior as presented in the OCDVAE is thusnot trivially applicable.
D.2. Full class incremental results
In addition to reconstruction losses and KL divergences, wealso report the detailed full set of intermediate results forthe five task steps of the class incremental scenario. Wethus extend table 3 in the main body with results for alltask increments t = 1 , . . . , and a complete list of lossesin tables 11, 12 and 13 for the three datasets respectively.The corresponding results for autoregressive models arepresented in tables 14, 15 and 16.Once more, we can observe the increased effect of error ac-cumulation due to unconstrained generative sampling fromthe prior in comparison to the open set counterpart thatlimits sampling to the aggregate posterior. The statisticaldeviations across experiment repetitions in the base and the nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Table 9.
Results for incremental cross-dataset continual learning approaches averaged over 5 runs, baselines and the reference isolatedlearning scenario for FashionMNIST (F) → MNIST (M) → AudioMNIST (A) and the reverse order. Extension of table 1 in the mainbody. Here, in addition to the accuracy α T , γ T and KL T also indicate the respective NLL reconstruction metrics and corresponding KLdivergences at the end of the last increment T = 3 . Cross-dataset α T (%) γ T (nats) KL T (nats)base new all base new all all F - M - A CDVAE ISO
CDVAE UB
CDVAE LB
EWC ± ± ± Dual Model ± ± ± ± ± ± ± CDVAE ± ± ± ± ± ± ± OCDVAE ± ± ± ± ± ± ± A - M - F CDVAE ISO
CDVAE UB
CDVAE LB
EWC ± ± ± Dual Model ± ± ± ± ± ± ± CDVAE ± ± ± ± ± ± ± OCDVAE ± ± ± ± ± ± ± Table 10.
Results for PixelVAE based cross-dataset continual learning approaches averaged over 5 runs in analogy to table 9. Extension oftable 4 in the main body. Here, in addition to the accuracy α T , γ T and KL T also indicate the respective NLL reconstruction metrics andcorresponding KL divergences at the end of the last increment T = 3 . Cross-dataset α T (%) γ T (nats) KL T (nats)base new all base new all all F - M - A Dual Pix Model ± ± ± ± ± ± ± PixCDVAE ± ± ± ± ± ± ± PixOCDVAE ± ± ± ± ± ± ± A - M - F Dual Pix Model ± ± ± ± ± ± ± PixCDVAE ± ± ± ± ± ± ± PixOCDVAE ± ± ± ± ± ± ± overall classification accuracies are higher and are gener-ally decreased by the open set models. For example, intable 11 the MNIST base and overall accuracy deviationsof a naive CDVAE are higher than the respective valuesfor OCDVAE starting already from the second task incre-ment. Correspondingly, the accuracy values themselvesexperience larger decline for CDVAE than for OCDVAEwith progressive increments. This difference is not as pro-nounced at the end of the first task increment because themodels haven’t been trained on any of their own generateddata yet. Successful literature approaches such as the varia-tional generative replay proposed by Farquhar & Gal (2018)thus avoid repeated learning based on previous generatedexamples and simply store and retain a separate generativemodel for each task. The strength of our model is that, in-stead of storing a trained model for each task increment, weare able to continually keep training our joint model withdata generated for all previously seen tasks by filtering out ambiguous samples from low density areas of the posterior.Similar trends can also be observed for the respective pixelmodels.We also see that regularization approaches such as EWCalready fail at the first increment. In contrast to the successthat has been reported in prior literature (Kirkpatrick et al.,2017; Kemker et al., 2018), this is due to the use of a singleclassification head. This is intuitive because introductionof new units, as described in the main body, directly con-fuses the existing classification. Regularization approachesby definition are challenged in this scenario because theweights are not allowed to drift too far away from previousvalues. For emphasis we repeat that however this scenario ismuch more practical and realistic than a multi-head scenariowith a separate classifier per task. While regularization ap-proaches are largely successful in the latter setting, it is notonly restricted to the closed world, but further requires an nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition oracle at prediction stage to chose the correct classificationhead. In contrast, our proposed approach requires no knowl-edge of task labels for prediction and is robust in an openworld.With respect to KL divergences and reconstruction losseswe can make two observations. First, the arguments ofthe previous section hold and by itself the small relativeimprovements between models should be interpreted withcaution as they do not directly translate to maintaining con-tinual learning accuracy. Second, we can also observe thatreconstruction losses at every increment for all γ t,all andrespective negative log likelihoods for only the new task γ t,new are harder to interpret than the accuracy counterpart.While the latter is normalized between zero and unity, thereconstruction loss of different tasks is expected to fluctu-ate largely according to the task’s images’ reconstructioncomplexity. To give a concrete example, it is rather straight-forward to come to the conclusion that a model suffers fromlimited capacity or lack of complexity if a single newlyarriving class cannot be classified well. In the case of re-construction it is common to observe either a large decreasein negative log likelihood for the newly arriving class, or abig increase depending on the specific introduced class. Assuch, these values are naturally comparable between models,but are challenging to interpret across time steps withoutalso analyzing the underlying nature of the introduced class.The exception is formed by the base task’s reconstructionloss γ t,base . In analogy to base classification accuracy, thisquantity still measures the amount of catastrophic forgettingacross time. However, in all tables we can observe thatcatastrophic forgetting of the decoder as measured by thebase reconstruction loss is almost imperceivable. As this isnot at all reflected in the respective accuracy over time, itfurther underlines our previous arguments that reconstruc-tion loss is not necessarily the best metric to monitor in thepresented continual learning scenario. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Table 11.
Results for class incremental continual learning approaches averaged over 5 runs, baselines and the reference isolated learningscenario for MNIST at the end of every task increment. Extension of table 3 in the main body. Here, in addition to the accuracy α t , γ t and KL t also indicate the respective NLL reconstruction metrics and corresponding KL divergences at the end of every task increment t . MNIST t CDVAE ISO CDVAE UB CDVAE LB EWC Dual Model CDVAE OCDVAE α base,t ± ± ± ± ± ± ± ± ± ± ± ± (%) 4 99.85 00.00 00.49 ± ± ± ± ± ± ± ± α new,t ± ± ± ± ± ± ± ± ± ± ± ± (%) 4 99.49 100.0 99.87 ± ± ± ± ± ± ± ± α all,t ± ± ± ± ± ± ± ± ± ± ± ± (%) 4 99.50 24.82 25.36 ± ± ± ± ± ± ± ± γ base,t ± ± ± ± ± ± ± ± ± (nats) 4 64.25 126.9 70.41 ± ± ± ± ± ± γ new,t ± ± ± ± ± ± ± ± ± (nats) 4 72.68 74.61 73.23 ± ± ± ± ± ± γ all,t ± ± ± ± ± ± ± ± ± (nats) 4 79.72 203.1 82.92 ± ± ± ± ± ± KL all,t ± ± ± ± ± ± ± ± ± (nats) 4 20.48 26.32 16.09 ± ± ± ± ± ± nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Table 12.
Results for class incremental continual learning approaches averaged over 5 runs, baselines and the reference isolated learningscenario for FashionMNIST at the end of every task increment. Extension of table 3 in the main body. Here, in addition to the accuracy α t , γ t and KL t also indicate the respective NLL reconstruction metrics and corresponding KL divergences at the end of every task increment t . Fashion t CDVAE ISO CDVAE UB CDVAE LB EWC Dual Model CDVAE OCDVAE α base,t ± ± ± ± ± ± ± ± ± ± ± ± (%) 4 91.35 00.00 00.33 ± ± ± ± ± ± ± ± α new,t ± ± ± ± ± ± ± ± ± ± ± ± (%) 4 84.75 99.90 99.95 ± ± ± ± ± ± ± ± α all,t ± ± ± ± ± ± ± ± ± ± ± ± (%) 4 87.51 25.00 25.21 ± ± ± ± ± ± ± ± γ base,t ± ± ± ± ± ± ± ± ± (nats) 4 207.7 243.6 213.6 ± ± ± ± ± ± γ new,t ± ± ± ± ± ± ± ± ± (nats) 4 220.5 219.7 219.5 ± ± ± ± ± ± γ all,t ± ± ± ± ± ± ± ± ± (nats) 4 220.4 238.7 225.1 ± ± ± ± ± ± KL all,t ± ± ± ± ± ± ± ± ± (nats) 4 20.06 17.31 10.96 ± ± ± ± ± ± nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Table 13.
Results for class incremental continual learning approaches averaged over 5 runs, baselines and the reference isolated learningscenario for AudioMNIST at the end of every task increment. Extension of table 3 in the main body. Here, in addition to the accuracy α t , γ t and KL t also indicate the respective NLL reconstruction metrics and corresponding KL divergences at the end of every task increment t . Audio t CDVAE ISO CDVAE UB CDVAE LB EWC Dual Model CDVAE OCDVAE α base,t ± ± ± ± ± ± ± ± ± ± ± ± (%) 4 99.92 00.00 00.31 ± ± ± ± ± ± ± ± α new,t ± ± ± ± ± ± ± ± ± ± ± ± (%) 4 97.33 98.67 97.03 ± ± ± ± ± ± ± ± α all,t ± ± ± ± ± ± ± ± ± ± ± ± (%) 4 98.60 24.58 24.50 ± ± ± ± ± ± ± ± γ base,t ± ± ± ± ± ± ± ± ± (nats) 4 419.9 428.5 425.4 ± ± ± ± ± ± γ new,t ± ± ± ± ± ± ± ± ± (nats) 4 485.9 487.1 486.5 ± ± ± ± ± ± γ all,t ± ± ± ± ± ± ± ± ± (nats) 4 430.3 438.4 433.9 ± ± ± ± ± ± KL all,t ± ± ± ± ± ± ± ± ± (nats) 4 13.61 14.41 5.243 ± ± ± ± ± ± nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Table 14.
Results for PixelVAE based class incremental continuallearning approaches averaged over 5 runs, baselines and thereference isolated learning scenario for MNIST at the end ofevery task increment in analogy to table 11. Extension of table4 in the main body. Here, in addition to the accuracy α t , γ t and KL t also indicate the respective NLL reconstruction metrics andcorresponding KL divergences at the end of every task increment t . MNIST t Dual Pix Model PixCDVAE PixOCDVAE α base,t ± ± ± ± ± ± ± ± ± (%) 4 98.33 ± ± ± ± ± ± α new,t ± ± ± ± ± ± ± ± ± (%) 4 98.61 ± ± ± ± ± ± α all,t ± ± ± ± ± ± ± ± ± (%) 4 98.22 ± ± ± ± ± ± γ base,t ± ± ± ± ± ± ± ± ± (nats) 4 91.75 ± ± ± ± ± ± γ new,t ± ± ± ± ± ± ± ± ± (nats) 4 100.9 ± ± ± ± ± ± γ all,t ± ± ± ± ± ± ± ± ± (nats) 4 103.9 ± ± ± ± ± ± KL all,t ± ± ± ± ± ± ± ± ± (nats) 4 5.603 ± ± ± ± ± ± Table 15.
Results for PixelVAE based class incremental continuallearning approaches averaged over 5 runs, baselines and thereference isolated learning scenario for FashionMNIST at the endof every task increment in analogy to table 12. Extension of table4 in the main body. Here, in addition to the accuracy α t , γ t and KL t also indicate the respective NLL reconstruction metrics andcorresponding KL divergences at the end of every task increment t . Fashion t Dual Pix Model PixCDVAE PixOCDVAE α base,t ± ± ± ± ± ± ± ± ± (%) 4 54.69 ± ± ± ± ± ± α new,t ± ± ± ± ± ± ± ± ± (%) 4 97.55 ± ± ± ± ± ± α all,t ± ± ± ± ± ± ± ± ± (%) 4 62.93 ± ± ± ± ± ± γ base,t ± ± ± ± ± ± ± ± ± (nats) 4 273.7 ± ± ± ± ± ± γ new,t ± ± ± ± ± ± ± ± ± (nats) 4 282.4 ± ± ± ± ± ± γ all,t ± ± ± ± ± ± ± ± ± (nats) 4 284.9 ± ± ± ± ± ± KL all,t ± ± ± ± ± ± ± ± ± (nats) 4 8.982 ± ± ± ± ± ± nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition Table 16.
Results for PixelVAE based class incremental continuallearning approaches averaged over 5 runs, baselines and thereference isolated learning scenario for AudioMNIST at the endof every task increment in analogy to table 13. Extension of table4 in the main body. Here, in addition to the accuracy α t , γ t and KL t also indicate the respective NLL reconstruction metrics andcorresponding KL divergences at the end of every task increment t . Audio t Dual Pix Model PixCDVAE PixOCDVAE α base,t ± ± ± ± ± ± ± ± ± (%) 4 81.55 ± ± ± ± ± ± α new,t ± ± ± ± ± ± ± ± ± (%) 4 95.31 ± ± ± ± ± ± α all,t ± ± ± ± ± ± ± ± ± (%) 4 86.97 ± ± ± ± ± ± γ base,t ± ± ± ± ± ± ± ± ± (nats) 4 434.2 ± ± ± ± ± ± γ new,t ± ± ± ± ± ± ± ± ± (nats) 4 497.4 ± ± ± ± ± ± γ all,t ± ± ± ± ± ± ± ± ± (nats) 4 441.6 ± ± ± ± ± ± KL all,t ± ± ± ± ± ± ± ± ± (nats) 4 5.817 ± ± ± ± ± ± nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition D.3. Backward transfer
As our model’s weights are shared fully across all tasks thisopens up the scope for both forward and backward transferof knowledge. For clarification, the former concept refers toexisting tasks’ representations aiding in acquisition of a newtask’s information, e.g. by speeding up training. The latterconcept describes the reverse phenomenon, where introduc-tion of a new task leads to learning of representations thatretrospectively improve former tasks, even if its real data isno longer present. Figure 10 highlights an interesting caseof backward transfer for class-incremental learning with ourOCDVAE model on the AudioMNIST dataset, as quantita-tively presented in tables 13 and 16. The addition of two newclasses (four and five) at the end of the second incrementleads to an improvement in the classification performanceon class two, as indicated by the confusion matrices. Wepoint out that this is a desirable continual learning propertythat can only emerge from having a single model with asingle classification head.
E. Generative replay examples with CDVAEand OCDVAE
In this section we provide visualization of data instances thatare produced during generative replay at the end of each taskincrement. In particular, we qualitative illustrate the effect ofconstraining sampling to the aggregate posterior in contrastto naively sampling from the prior without statistical outlierrejection for low density regions. Figures 11, 12 and 13illustrate generated images for MNIST, FashionMNIST andAudioMNIST respectively. For both a naive CDVAE aswell as the autoregressive PixCDVAE we observe significantconfusion with respect to classes. As the generative modelneeds to learn how to replay old tasks’ data based on itsown former generations, ambiguity and blurry interpolationsaccumulate and are rapidly amplified. This is not the casefor OCDVAE and PixOCDVAE, where the generative modelis capable of maintaining higher visual fidelity throughoutcontinual training and misclassification is scarce. (a) 2 classes(b) 4 classes(c) 6 classes
Figure 10.
AudioMNIST confusion matrices for incrementallylearned classes with the OCDVAE model. When adding classestwo and three the model experiences difficulty in classification,however is able to overcome this challenge by exhibiting back-ward transfer when later learning classes four and five. It is alsoobservable how forgetting of the initial classes is limited. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition (a) CDVAE 2 classes (b) CDVAE 4 classes (c) CDVAE 6 classes (d) CDVAE 8 classes(e) OCDVAE 2 classes (f) OCDVAE 4 classes (g) OCDVAE 6 classes (h) OCDVAE 8 classes(i) PixCDVAE 2 classes (j) PixCDVAE 4 classes (k) PixCDVAE 6 classes (l) PixCDVAE 8 classes(m) PixOCDVAE 2 classes (n) PixOCDVAE 4 classes (o) PixOCDVAE 6 classes (p) PixOCDVAE 8 classes
Figure 11.
Generated images for continually learned incremental MNIST at the end of task increments for CDVAE (a-d), OCDVAE (e-h),PixCDVAE (i-l) and PixOCDVAE (m-p). Each individual grid is sorted according to the class label that is predicted by the classifier. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition (a) CDVAE 2 classes (b) CDVAE 4 classes (c) CDVAE 6 classes (d) CDVAE 8 classes(e) OCDVAE 2 classes (f) OCDVAE 4 classes (g) OCDVAE 6 classes (h) OCDVAE 8 classes(i) PixCDVAE 2 classes (j) PixCDVAE 4 classes (k) PixCDVAE 6 classes (l) PixCDVAE 8 classes(m) PixOCDVAE 2 classes (n) PixOCDVAE 4 classes (o) PixOCDVAE 6 classes (p) PixOCDVAE 8 classes
Figure 12.
Generated images for continually learned incremental FashionMNIST at the end of task increments for CDVAE (a-d), OCDVAE(e-h), PixCDVAE (i-l) and PixOCDVAE (m-p). Each individual grid is sorted according to the class label that is predicted by the classifier. nified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition (a) CDVAE 2 classes (b) CDVAE 4 classes (c) CDVAE 6 classes (d) CDVAE 8 classes(e) OCDVAE 2 classes (f) OCDVAE 4 classes (g) OCDVAE 6 classes (h) OCDVAE 8 classes(i) PixCDVAE 2 classes (j) PixCDVAE 4 classes (k) PixCDVAE 6 classes (l) PixCDVAE 8 classes(m) PixOCDVAE 2 classes (n) PixOCDVAE 4 classes (o) PixOCDVAE 6 classes (p) PixOCDVAE 8 classes