[PDF] On Class Orderings for Incremental Learning

Abstract

The influence of class orderings in the evaluation of incremental learning has received very little attention. In this paper, we investigate the impact of class orderings for incrementally learned classifiers. We propose a method to compute various orderings for a dataset. The orderings are derived by simulated annealing optimization from the confusion matrix and reflect different incremental learning scenarios, including maximally and minimally confusing tasks. We evaluate a wide range of state-of-the-art incremental learning methods on the proposed orderings. Results show that orderings can have a significant impact on performance and the ranking of the methods.

Full PDF

OOn Class Orderings for Incremental Learning

Marc Masana * 1

Bartłomiej Twardowski * 1

Joost van de Weijer Abstract

The inﬂuence of class orderings in the evaluationof incremental learning has received very littleattention. In this paper, we investigate the im-pact of class orderings for incrementally learnedclassiﬁers. We propose a method to compute var-ious orderings for a dataset. The orderings arederived by simulated annealing optimization fromthe confusion matrix and reﬂect different incre-mental learning scenarios, including maximallyand minimally confusing tasks. We evaluate awide range of state-of-the-art incremental learn-ing methods on the proposed orderings. Resultsshow that orderings can have a signiﬁcant impacton performance and the ranking of the methods.

1. Introduction

Incremental learning (IL) has gained popularity over the lastyears as a way to continuously introduce new concepts toan existing model. Incrementally learning tasks relieves theissues of maintaining and retraining large datasets, and costsassociated with it. However, retaining knowledge when re-training on different data in artiﬁcial neural networks is not atrivial task. Whenever a network is trained only on new data,the abrupt loss of previously acquired knowledge manifests.This phenomenon is know as catastrophic forgetting (Mc-Closkey & Cohen, 1989). In recent years, many methodshave been proposed to alleviate this problem which standsin the way to advanced life-long learning systems (Lesortet al., 2020; De Lange et al., 2019; Parisi et al., 2019).The problem of continual learning is often simpliﬁed toan incremental learning of new concepts (classes) in awell-deﬁned, equally divided, sequence of tasks. Thismay sound artiﬁcial, but is a common choice in recentworks (Aljundi et al., 2017; Li & Hoiem, 2017; Rebufﬁet al., 2017; Chaudhry et al., 2018; Belouadah & Popescu, * Equal contribution LAMP team, Computer Vision Cen-ter, UAB Barcelona, Spain. Correspondence to: Marc Masana < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s).

Figure 1.

Accuracies of CIFAR-100 classes after a single non-incremental training on ResNet-32 incrementally ordered by classaccuracy (top), and grouped with provided coarse-grained labels(bottom). Dashed lines represent task boundaries for an equallydivided 10-task split. Colors denote coarse-grained group labels. a r X i v : . [ c s . C V ] J u l n Class Orderings for Incremental Learning Figure 2.

CM from CIFAR-100 with original class order (left)and coarse grained labels order (right) after a joint training onResNet-32. Diagonal values are skipped for better visualization. the ordering. In our research, we propose a method basedon confusion matrix (CM) ordering (see Fig. 2) that helpsexplore the difﬁculty of the incrementally learned classiﬁereven further.The main contributions of this paper are: 1) proposing anovel method for class ordering in IL scenarios based onconfusion matrix values, 2) investigating IL methods robust-ness to class ordering, 3) analysing of some commonly usedsplit strategies in comparison to the random ones.

2. Related work

Class-IL:

We chose some class-IL methods that are com-mon comparison in the literature, and some that are currentstate-of-the-art. LwF (Li & Hoiem, 2017) is a regularization-based method which adds a constrain loss to the outputs ofolder classes to not change too much when learning a newtask. Similarly, EWC (Kirkpatrick et al., 2017) also appliesa regularization constrain on the weights to limit their shift.iCaRL (Rebufﬁ et al., 2017) proposed to extend LwF bykeeping a small memory with data exemplars which is re-played during training. Following this idea, BiC (Wu et al.,2019) and LUCIR (Hou et al., 2019) extend the usage of dis-tillation and exemplars to also apply a bias correction to theoutputs of different tasks, allowing to compensate the im-balance introduced by new tasks. Finally, IL2M (Belouadah& Popescu, 2019) also proposes bias correction, althoughover Finetuning since they show that it works better thanapplying it over LwF.

Class ordering:

Most works on class-IL report results byusing a random order of classes on CIFAR-100, i.e. iCaRL,LUCIR, BiC. Furthermore, the popularity of iCaRL andthe interest in comparing with it, makes quite commontheir speciﬁc class ordering (their code ﬁxes the randomseed to 1993). However, none of them look deeper intothe choosing of that speciﬁc class order. In (Masana et al.,2020) and (De Lange et al., 2019), the authors touch thesubject of class ordering, showing that some methods report different results based on different class orderings. Onlya random ordering and a semantically split ordering wereinvestigated, without any dedicated method to order classesharder or easier, as this was not their main focus.

Curriculum Learning and Classiﬁcation Complexity:

In contrast to curriculum learning, our objective is not toobtain the best model performance (Pentina et al., 2015),but to propose an evaluation for class-IL methods underdifferent scenarios. Another similar research direction isassessing how complex is a classiﬁcation task. In our case,we could use a known measure (Lorena et al., 2019), or theone proposed in (Nguyen et al., 2019) for IL.

3. Class ordering

In multi class-IL for image classiﬁcation, a sequence oftasks where each consists of m t classes is learned one at atime, extending the knowledge of the model in incrementalsteps. Given a set of paired data x i with their respectiveclass labels y i ∈ C t , where C t = { c t , c t , . . . , c tm t } denotesthe set of m t classes of task t . When training on task t , onlydata ( x i , y i ) ∼ D t is available. We consider disjoint classesbetween all tasks, C t ∩ C s = ∅ for t (cid:54) = s as in (Aljundiet al., 2017; Chaudhry et al., 2018; Dhar et al., 2019; Houet al., 2019; Liu et al., 2018; Rebufﬁ et al., 2017; Yu et al.,2020). After training each task we evaluate the learnedmodel on all classes seen so far C = (cid:84) i

Class ordering objectives (top) and resulting confusionmatrices (bottom). • random : takes a permutation of the class order. Bydefault taking the original class ordering which thedataset provides (usually alphabetically ordered or sim-ilar), or otherwise a permutation corresponding to arandom seed. As explained in Sec. 2, in the case ofCIFAR-100, some works decide to ﬁx the seed to thesame as iCaRL (Rebufﬁ et al., 2017).If we train a model in a single training session (joint training)with all data for all classes, we can calculate CM, as seen onFig. 2. Based on that, we deﬁne two more class orders as: • max confusion (maxConf): highly miss-classiﬁedclasses are next to each other—max confusion is hap-pening around the CM diagonal. This creates an ILsplit with more difﬁcult intra-task classiﬁcation. • min confusion (minConf): enforces classes which arerarely miss-classiﬁed to be in the same task—max con-fusion is happening at the corners of the CM. Intra-taskand adjacency tasks classiﬁcation becomes easier, butalso pushes the most miss-classiﬁed classes towardsthe ﬁrst and last tasks.Finding the above orderings based on the CM values isa non-trivial task, where a brute-force naive approach of O ( n !) cannot be applied even to moderate size problems.To reduce the complexity, we re-formulate ﬁnding the classordering as an optimization problem where the objective isto maximize the value of the ﬁtness function to a desired M . Then, we use an objective weight matrix W ∈ N | C |×| C | (see Fig. 3) which is used to calculate a score value:score ( M, W ) = tr( W T M ) , (1)to assess adaptation in a search for a solution. In this case,a global optimum is not necessary since it is hard to estab-lish. Instead, we use the simulated annealing optimizationalgorithm to ﬁnd an ordering in a constrained time limited by the number of ﬁtting iterations. The same approach hasbeen previously used for CM ordering for better visualiza-tion of large matrices in (Thoma, 2017a) and for HASYv2dataset (Thoma, 2017b).We can also deﬁne an objective CM, and try to converge to apermutation that better accommodates to incremental tasksby introducing their boundaries. If the splits of C , . . . , C t can be know upfront, which is usually the case, we canincorporate task boundaries in the weighting matrix for bothscenarios. Therefore, we can deﬁne three more orderings: • increasing task confusion (incTaskConf): maximizeconfusion in all tasks around the diagonal of M withincreasing confusion between them. The objectivematrix is presented in Fig. 3 (right). • equal task confusion (eqTaskConf): similar to maxconfusion, but introducing task boundaries shouldcause less confusion between adjacent tasks. • decreasing task confusion (decTaskConf): maximizeconfusion in all tasks around the diagonal while de-creasing it between them. Similar to increasing, butwith inverted diagonal weights from Fig. 3 (right).In the speciﬁc case of CIFAR-100, since a coarse grained hi-erarchy of the classes exists, we can also deﬁne an orderingbased on the provided two level taxonomy. In each task wecan have classes related to the same group or similar groupsin order to make the classiﬁcation harder. • coarse grained : ordered by a provided grouping ortaxonomy. For CIFAR-100 classes are divided into 20groups, as shown with bar colors in Fig. 1 and labelsin Fig. 2 (right).

4. Experimental results

We compare the class orderings on CIFAR-100 consideringten equal tasks trained on ResNet-32 from scratch. Trainingstarts with learning rate (LR) of 0.1, momentum of 0.9,weight decay of 2e-4, and a LR scheduler with ten epochspatience and a LR factor of / until LR is lower than 1e-4 or200 epochs have passed. Method hyperparameter are chosenfollowing the framework from (De Lange et al., 2019) forthe ﬁrst 3 tasks, and ﬁxed afterwards.We compare all orderings proposed in Sec. 3 on Finetuning(FT) and LwF in Fig. 4. Both methods seem to have verylittle difference in general behaviour on the different order-ings. Random, iCaRL seed and minConf provide a betterperformance after all tasks. The results point to minConfbeing the most stable, and Random being generally closerto it. After the ﬁrst task, incTaskConf has the highest per-formance since it learns the less confusing group of classes.However, after all tasks, it ends up having one of the lowestperformances, together with maxConf. n Class Orderings for Incremental Learning Table 1.

CIFAR-100 results for class-IL with growing memory of 20 exemplars per class (10 runs average and standard deviation). Thebest score for each task of each method is in bold. Underscore marks the lowest score. task Random iCaRL seed coarse maxConf minConf decTaskConf eqTaskConf incTaskConfLwF 2 51.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

10 20 30 40 50 60 70 80 90 100

Number of classes A cc u r a c y ( % ) random (11.1)iCaRL seed (11.2)maxConf. (6.8)minConf. (11.1)decTaskConf (6.2)eqTaskConf (4.6)incTaskConf (4.8)coarse (7.5)

10 20 30 40 50 60 70 80 90 100

Number of classes A cc u r a c y ( % ) random (29.3)iCaRL seed (30.5)maxConf. (15.0)minConf. (29.8)decTaskConf (18.1)eqTaskConf (17.5)incTaskConf (19.6)coarse (18.1) Figure 4.

FT (left) and LwF (right) with different class orderings for CIFAR-100 on ResNet-32 from scratch without exemplars memory.

Results on different methods are presented in Tab. 1, us-ing 20 exemplars per class with herding selection – LwFis adapted to use exemplars. As expected, decTaskConfresults in the most confusing class ordering for the ﬁrst taskswith the lowest performance. Analogously, incTaskConfachieves the best performance. This is due to not havinglearned all classes at this point but only the most or least con-fusing, respectively. This behaviour is different than the oneseen in the setting without exemplars, where incTaskConf isonly better until task 2. LwF and LUCIR have a similar task10 overall performance (avg. 28%), iCaRL follows (avg.33%), while BiC and IL2M have a better one (avg. 38%). Inaddition, the standard deviation across all orderings is lowfor BiC and IL2M ( ∼ ∼ ∼

5. Conclusions

Class orderings for class-IL inﬂuence the overall evaluationperformance. For a single method the spread between mostextreme orderings can be signiﬁcant. Comparing the order-ings, we found that the random ordering obtains among thehighest performances when used by non-exemplar methods.The proposed class orderings based on the confusion matrixcan be used as a tool for checking robustness of class-ILapproaches. A direct extension of this work would be to usedifferent datasets and ordering methods. For a fairer com-parison of methods, we recommend to compare methods onseveral class orderings.

Acknowledgements

We would like to thank Xialei Liu for his helpful discussion.Marc Masana acknowledges 2019-FI B2-00189 grant fromGeneralitat de Catalunya. n Class Orderings for Incremental Learning

References

Aljundi, R., Chakravarty, P., and Tuytelaars, T. Expertgate: Lifelong learning with a network of experts. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2017.Belouadah, E. and Popescu, A. Il2m: Class incrementallearning with dual memory. In

Proceedings of the IEEEInternational Conference on Computer Vision , pp. 583–592, 2019.Chaudhry, A., Dokania, P. K., Ajanthan, T., and Torr, P. H.Riemannian walk for incremental learning: Understand-ing forgetting and intransigence. In

Proceedings of theEuropean Conference on Computer Vision (ECCV) , pp.532–547, 2018.De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X.,Leonardis, A., Slabaugh, G., and Tuytelaars, T. Continuallearning: A comparative study on how to defy forgettingin classiﬁcation tasks. arXiv preprint arXiv:1909.08383 ,2019.Dhar, P., Singh, R. V., Peng, K.-C., Wu, Z., and Chellappa,R. Learning without memorizing. In

Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , pp. 5138–5146, 2019.Hou, S., Pan, X., Loy, C. C., Wang, Z., and Lin, D. Learn-ing a uniﬁed classiﬁer incrementally via rebalancing. In

Proceedings of the IEEE International Conference onComputer Vision , pp. 831–839, 2019.Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des-jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T.,Grabska-Barwinska, A., et al. Overcoming catastrophicforgetting in neural networks.

Proceedings of the nationalacademy of sciences , 114(13):3521–3526, 2017.Krizhevsky, A. Learning multiple layers of features fromtiny images. Technical report, Citeseer, 2009.Lesort, T., Lomonaco, V., Stoian, A., Maltoni, D., Filliat, D.,and D´ıaz-Rodr´ıguez, N. Continual learning for robotics:Deﬁnition, framework, learning strategies, opportunitiesand challenges.

Information Fusion , 58:52–68, 2020.Li, Z. and Hoiem, D. Learning without forgetting.

IEEEtransactions on pattern analysis and machine intelligence ,40(12):2935–2947, 2017.Liu, X., Masana, M., Herranz, L., Van de Weijer, J., Lopez,A. M., and Bagdanov, A. D. Rotate your networks: Betterweight consolidation and less catastrophic forgetting. In

International Conference on Pattern Recognition (ICPR) ,2018. Lorena, A., Garcia, L. P., Lehmann, J., de Souto, M., and Ho,T. How complex is your classiﬁcation problem?: A surveyon measuring classiﬁcation complexity.

ACM ComputingSurveys , 52:1–34, 09 2019. doi: 10.1145/3347711.Masana, M., Tuytelaars, T., and van de Weijer, J. Ternaryfeature masks: continual learning without any forgetting. arXiv preprint arXiv:2001.08714 , 2020.McCloskey, M. and Cohen, N. J. Catastrophic interfer-ence in connectionist networks: The sequential learningproblem. In

Psychology of learning and motivation , vol-ume 24, pp. 109–165. Elsevier, 1989.Nguyen, C. V., Achille, A., Lam, M., Hassner, T., Mahade-van, V., and Soatto, S. Toward understanding catastrophicforgetting in continual learning, 2019.Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter,S. Continual lifelong learning with neural networks: Areview.

Neural Networks , 2019.Pentina, A., Sharmanska, V., and Lampert, C. H. Curriculumlearning of multiple tasks. In

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , June2015.Rebufﬁ, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H.icarl: Incremental classiﬁer and representation learning.In

Proceedings of the IEEE conference on ComputerVision and Pattern Recognition , pp. 2001–2010, 2017.Thoma, M. Analysis and optimization of convolutionalneural network architectures, 2017a.Thoma, M. The hasyv2 dataset, 2017b.Wu, Y., Chen, Y., Wang, L., Ye, Y., Liu, Z., Guo, Y., and Fu,Y. Large scale incremental learning. In

Proceedings ofthe IEEE International Conference on Computer Vision ,pp. 374–382, 2019.Yu, L., Twardowski, B., Liu, X., Herranz, L., Wang, K.,Cheng, Y., Jui, S., and Weijer, J. v. d. Semantic drift com-pensation for class-incremental learning. In

Proceedingsof the IEEE/CVF Conference on Computer Vision andPattern Recognition , pp. 6982–6991, 2020. n Class Orderings for Incremental Learning

Appendices

A. Plots for growing memory setting

In this appendix we present plots that reﬂect evaluation of each method from Table 1 for different class orderings afterlearning each task.

10 20 30 40 50 60 70 80 90 100

Number of classes A cc u r a c y ( % ) random (24.8)iCaRL seed (27.3)maxConf. (32.6)minConf. (25.4)decTaskConf (29.9)eqTaskConf (31.6)incTaskConf (29.4)coarse (26.5) Figure 5.

LwF results for different class orderings for CIFAR-100on ResNet-32 from scratch with 20 exemplars per class growingmemory.

10 20 30 40 50 60 70 80 90 100

Number of classes A cc u r a c y ( % ) random (32.2)iCaRL seed (34.2)maxConf. (35.4)minConf. (33.4)decTaskConf (28.0)eqTaskConf (35.4)incTaskConf (32.0)coarse (32.8) Figure 6. iCaRL results for different class orderings forCIFAR-100 on ResNet-32 from scratch with 20 exemplars perclass growing memory.

10 20 30 40 50 60 70 80 90 100

Number of classes A cc u r a c y ( % ) random (39.3)iCaRL seed (40.1)maxConf. (37.5)minConf. (40.1)decTaskConf (37.2)eqTaskConf (38.6)incTaskConf (37.8)coarse (39.3) Figure 7.

BiC results for different class orderings for CIFAR-100on ResNet-32 from scratch with 20 exemplars per class growingmemory.

10 20 30 40 50 60 70 80 90 100

Number of classes A cc u r a c y ( % ) random (27.2)iCaRL seed (29.6)maxConf. (28.9)minConf. (27.7)decTaskConf (29.1)eqTaskConf (31.9)incTaskConf (28.5)coarse (26.2) Figure 8.

LUCIR results for different class orderings forCIFAR-100 on ResNet-32 from scratch with 20 exemplars perclass growing memory.

10 20 30 40 50 60 70 80 90 100