On Class Orderings for Incremental Learning
OOn Class Orderings for Incremental Learning
Marc Masana * 1
Bartłomiej Twardowski * 1
Joost van de Weijer Abstract
The influence of class orderings in the evaluationof incremental learning has received very littleattention. In this paper, we investigate the im-pact of class orderings for incrementally learnedclassifiers. We propose a method to compute var-ious orderings for a dataset. The orderings arederived by simulated annealing optimization fromthe confusion matrix and reflect different incre-mental learning scenarios, including maximallyand minimally confusing tasks. We evaluate awide range of state-of-the-art incremental learn-ing methods on the proposed orderings. Resultsshow that orderings can have a significant impacton performance and the ranking of the methods.
1. Introduction
Incremental learning (IL) has gained popularity over the lastyears as a way to continuously introduce new concepts toan existing model. Incrementally learning tasks relieves theissues of maintaining and retraining large datasets, and costsassociated with it. However, retaining knowledge when re-training on different data in artificial neural networks is not atrivial task. Whenever a network is trained only on new data,the abrupt loss of previously acquired knowledge manifests.This phenomenon is know as catastrophic forgetting (Mc-Closkey & Cohen, 1989). In recent years, many methodshave been proposed to alleviate this problem which standsin the way to advanced life-long learning systems (Lesortet al., 2020; De Lange et al., 2019; Parisi et al., 2019).The problem of continual learning is often simplified toan incremental learning of new concepts (classes) in awell-defined, equally divided, sequence of tasks. Thismay sound artificial, but is a common choice in recentworks (Aljundi et al., 2017; Li & Hoiem, 2017; Rebuffiet al., 2017; Chaudhry et al., 2018; Belouadah & Popescu, * Equal contribution LAMP team, Computer Vision Cen-ter, UAB Barcelona, Spain. Correspondence to: Marc Masana < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s).
Figure 1.
Accuracies of CIFAR-100 classes after a single non-incremental training on ResNet-32 incrementally ordered by classaccuracy (top), and grouped with provided coarse-grained labels(bottom). Dashed lines represent task boundaries for an equallydivided 10-task split. Colors denote coarse-grained group labels. a r X i v : . [ c s . C V ] J u l n Class Orderings for Incremental Learning Figure 2.
CM from CIFAR-100 with original class order (left)and coarse grained labels order (right) after a joint training onResNet-32. Diagonal values are skipped for better visualization. the ordering. In our research, we propose a method basedon confusion matrix (CM) ordering (see Fig. 2) that helpsexplore the difficulty of the incrementally learned classifiereven further.The main contributions of this paper are: 1) proposing anovel method for class ordering in IL scenarios based onconfusion matrix values, 2) investigating IL methods robust-ness to class ordering, 3) analysing of some commonly usedsplit strategies in comparison to the random ones.
2. Related work
Class-IL:
We chose some class-IL methods that are com-mon comparison in the literature, and some that are currentstate-of-the-art. LwF (Li & Hoiem, 2017) is a regularization-based method which adds a constrain loss to the outputs ofolder classes to not change too much when learning a newtask. Similarly, EWC (Kirkpatrick et al., 2017) also appliesa regularization constrain on the weights to limit their shift.iCaRL (Rebuffi et al., 2017) proposed to extend LwF bykeeping a small memory with data exemplars which is re-played during training. Following this idea, BiC (Wu et al.,2019) and LUCIR (Hou et al., 2019) extend the usage of dis-tillation and exemplars to also apply a bias correction to theoutputs of different tasks, allowing to compensate the im-balance introduced by new tasks. Finally, IL2M (Belouadah& Popescu, 2019) also proposes bias correction, althoughover Finetuning since they show that it works better thanapplying it over LwF.
Class ordering:
Most works on class-IL report results byusing a random order of classes on CIFAR-100, i.e. iCaRL,LUCIR, BiC. Furthermore, the popularity of iCaRL andthe interest in comparing with it, makes quite commontheir specific class ordering (their code fixes the randomseed to 1993). However, none of them look deeper intothe choosing of that specific class order. In (Masana et al.,2020) and (De Lange et al., 2019), the authors touch thesubject of class ordering, showing that some methods report different results based on different class orderings. Onlya random ordering and a semantically split ordering wereinvestigated, without any dedicated method to order classesharder or easier, as this was not their main focus.
Curriculum Learning and Classification Complexity:
In contrast to curriculum learning, our objective is not toobtain the best model performance (Pentina et al., 2015),but to propose an evaluation for class-IL methods underdifferent scenarios. Another similar research direction isassessing how complex is a classification task. In our case,we could use a known measure (Lorena et al., 2019), or theone proposed in (Nguyen et al., 2019) for IL.
3. Class ordering
In multi class-IL for image classification, a sequence oftasks where each consists of m t classes is learned one at atime, extending the knowledge of the model in incrementalsteps. Given a set of paired data x i with their respectiveclass labels y i ∈ C t , where C t = { c t , c t , . . . , c tm t } denotesthe set of m t classes of task t . When training on task t , onlydata ( x i , y i ) ∼ D t is available. We consider disjoint classesbetween all tasks, C t ∩ C s = ∅ for t (cid:54) = s as in (Aljundiet al., 2017; Chaudhry et al., 2018; Dhar et al., 2019; Houet al., 2019; Liu et al., 2018; Rebuffi et al., 2017; Yu et al.,2020). After training each task we evaluate the learnedmodel on all classes seen so far C = (cid:84) i Class ordering objectives (top) and resulting confusionmatrices (bottom). • random : takes a permutation of the class order. Bydefault taking the original class ordering which thedataset provides (usually alphabetically ordered or sim-ilar), or otherwise a permutation corresponding to arandom seed. As explained in Sec. 2, in the case ofCIFAR-100, some works decide to fix the seed to thesame as iCaRL (Rebuffi et al., 2017).If we train a model in a single training session (joint training)with all data for all classes, we can calculate CM, as seen onFig. 2. Based on that, we define two more class orders as: • max confusion (maxConf): highly miss-classifiedclasses are next to each other—max confusion is hap-pening around the CM diagonal. This creates an ILsplit with more difficult intra-task classification. • min confusion (minConf): enforces classes which arerarely miss-classified to be in the same task—max con-fusion is happening at the corners of the CM. Intra-taskand adjacency tasks classification becomes easier, butalso pushes the most miss-classified classes towardsthe first and last tasks.Finding the above orderings based on the CM values isa non-trivial task, where a brute-force naive approach of O ( n !) cannot be applied even to moderate size problems.To reduce the complexity, we re-formulate finding the classordering as an optimization problem where the objective isto maximize the value of the fitness function to a desired M . Then, we use an objective weight matrix W ∈ N | C |×| C | (see Fig. 3) which is used to calculate a score value:score ( M, W ) = tr( W T M ) , (1)to assess adaptation in a search for a solution. In this case,a global optimum is not necessary since it is hard to estab-lish. Instead, we use the simulated annealing optimizationalgorithm to find an ordering in a constrained time limited by the number of fitting iterations. The same approach hasbeen previously used for CM ordering for better visualiza-tion of large matrices in (Thoma, 2017a) and for HASYv2dataset (Thoma, 2017b).We can also define an objective CM, and try to converge to apermutation that better accommodates to incremental tasksby introducing their boundaries. If the splits of C , . . . , C t can be know upfront, which is usually the case, we canincorporate task boundaries in the weighting matrix for bothscenarios. Therefore, we can define three more orderings: • increasing task confusion (incTaskConf): maximizeconfusion in all tasks around the diagonal of M withincreasing confusion between them. The objectivematrix is presented in Fig. 3 (right). • equal task confusion (eqTaskConf): similar to maxconfusion, but introducing task boundaries shouldcause less confusion between adjacent tasks. • decreasing task confusion (decTaskConf): maximizeconfusion in all tasks around the diagonal while de-creasing it between them. Similar to increasing, butwith inverted diagonal weights from Fig. 3 (right).In the specific case of CIFAR-100, since a coarse grained hi-erarchy of the classes exists, we can also define an orderingbased on the provided two level taxonomy. In each task wecan have classes related to the same group or similar groupsin order to make the classification harder. • coarse grained : ordered by a provided grouping ortaxonomy. For CIFAR-100 classes are divided into 20groups, as shown with bar colors in Fig. 1 and labelsin Fig. 2 (right). 4. Experimental results We compare the class orderings on CIFAR-100 consideringten equal tasks trained on ResNet-32 from scratch. Trainingstarts with learning rate (LR) of 0.1, momentum of 0.9,weight decay of 2e-4, and a LR scheduler with ten epochspatience and a LR factor of / until LR is lower than 1e-4 or200 epochs have passed. Method hyperparameter are chosenfollowing the framework from (De Lange et al., 2019) forthe first 3 tasks, and fixed afterwards.We compare all orderings proposed in Sec. 3 on Finetuning(FT) and LwF in Fig. 4. Both methods seem to have verylittle difference in general behaviour on the different order-ings. Random, iCaRL seed and minConf provide a betterperformance after all tasks. The results point to minConfbeing the most stable, and Random being generally closerto it. After the first task, incTaskConf has the highest per-formance since it learns the less confusing group of classes.However, after all tasks, it ends up having one of the lowestperformances, together with maxConf. n Class Orderings for Incremental Learning Table 1. CIFAR-100 results for class-IL with growing memory of 20 exemplars per class (10 runs average and standard deviation). Thebest score for each task of each method is in bold. Underscore marks the lowest score. task Random iCaRL seed coarse maxConf minConf decTaskConf eqTaskConf incTaskConfLwF 2 51.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 10 20 30 40 50 60 70 80 90 100 Number of classes A cc u r a c y ( % ) random (11.1)iCaRL seed (11.2)maxConf. (6.8)minConf. (11.1)decTaskConf (6.2)eqTaskConf (4.6)incTaskConf (4.8)coarse (7.5) 10 20 30 40 50 60 70 80 90 100 Number of classes A cc u r a c y ( % ) random (29.3)iCaRL seed (30.5)maxConf. (15.0)minConf. (29.8)decTaskConf (18.1)eqTaskConf (17.5)incTaskConf (19.6)coarse (18.1) Figure 4. FT (left) and LwF (right) with different class orderings for CIFAR-100 on ResNet-32 from scratch without exemplars memory. Results on different methods are presented in Tab. 1, us-ing 20 exemplars per class with herding selection – LwFis adapted to use exemplars. As expected, decTaskConfresults in the most confusing class ordering for the first taskswith the lowest performance. Analogously, incTaskConfachieves the best performance. This is due to not havinglearned all classes at this point but only the most or least con-fusing, respectively. This behaviour is different than the oneseen in the setting without exemplars, where incTaskConf isonly better until task 2. LwF and LUCIR have a similar task10 overall performance (avg. 28%), iCaRL follows (avg.33%), while BiC and IL2M have a better one (avg. 38%). Inaddition, the standard deviation across all orderings is lowfor BiC and IL2M ( ∼ ∼ ∼ 5. Conclusions Class orderings for class-IL influence the overall evaluationperformance. For a single method the spread between mostextreme orderings can be significant. Comparing the order-ings, we found that the random ordering obtains among thehighest performances when used by non-exemplar methods.The proposed class orderings based on the confusion matrixcan be used as a tool for checking robustness of class-ILapproaches. A direct extension of this work would be to usedifferent datasets and ordering methods. For a fairer com-parison of methods, we recommend to compare methods onseveral class orderings. Acknowledgements We would like to thank Xialei Liu for his helpful discussion.Marc Masana acknowledges 2019-FI B2-00189 grant fromGeneralitat de Catalunya. n Class Orderings for Incremental Learning References Aljundi, R., Chakravarty, P., and Tuytelaars, T. Expertgate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2017.Belouadah, E. and Popescu, A. Il2m: Class incrementallearning with dual memory. In Proceedings of the IEEEInternational Conference on Computer Vision , pp. 583–592, 2019.Chaudhry, A., Dokania, P. K., Ajanthan, T., and Torr, P. H.Riemannian walk for incremental learning: Understand-ing forgetting and intransigence. In Proceedings of theEuropean Conference on Computer Vision (ECCV) , pp.532–547, 2018.De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X.,Leonardis, A., Slabaugh, G., and Tuytelaars, T. Continuallearning: A comparative study on how to defy forgettingin classification tasks. arXiv preprint arXiv:1909.08383 ,2019.Dhar, P., Singh, R. V., Peng, K.-C., Wu, Z., and Chellappa,R. Learning without memorizing. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , pp. 5138–5146, 2019.Hou, S., Pan, X., Loy, C. C., Wang, Z., and Lin, D. Learn-ing a unified classifier incrementally via rebalancing. In Proceedings of the IEEE International Conference onComputer Vision , pp. 831–839, 2019.Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des-jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T.,Grabska-Barwinska, A., et al. Overcoming catastrophicforgetting in neural networks. Proceedings of the nationalacademy of sciences , 114(13):3521–3526, 2017.Krizhevsky, A. Learning multiple layers of features fromtiny images. Technical report, Citeseer, 2009.Lesort, T., Lomonaco, V., Stoian, A., Maltoni, D., Filliat, D.,and D´ıaz-Rodr´ıguez, N. Continual learning for robotics:Definition, framework, learning strategies, opportunitiesand challenges. Information Fusion , 58:52–68, 2020.Li, Z. and Hoiem, D. Learning without forgetting. IEEEtransactions on pattern analysis and machine intelligence ,40(12):2935–2947, 2017.Liu, X., Masana, M., Herranz, L., Van de Weijer, J., Lopez,A. M., and Bagdanov, A. D. Rotate your networks: Betterweight consolidation and less catastrophic forgetting. In International Conference on Pattern Recognition (ICPR) ,2018. Lorena, A., Garcia, L. P., Lehmann, J., de Souto, M., and Ho,T. How complex is your classification problem?: A surveyon measuring classification complexity. ACM ComputingSurveys , 52:1–34, 09 2019. doi: 10.1145/3347711.Masana, M., Tuytelaars, T., and van de Weijer, J. Ternaryfeature masks: continual learning without any forgetting. arXiv preprint arXiv:2001.08714 , 2020.McCloskey, M. and Cohen, N. J. Catastrophic interfer-ence in connectionist networks: The sequential learningproblem. In Psychology of learning and motivation , vol-ume 24, pp. 109–165. Elsevier, 1989.Nguyen, C. V., Achille, A., Lam, M., Hassner, T., Mahade-van, V., and Soatto, S. Toward understanding catastrophicforgetting in continual learning, 2019.Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter,S. Continual lifelong learning with neural networks: Areview. Neural Networks , 2019.Pentina, A., Sharmanska, V., and Lampert, C. H. Curriculumlearning of multiple tasks. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , June2015.Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H.icarl: Incremental classifier and representation learning.In Proceedings of the IEEE conference on ComputerVision and Pattern Recognition , pp. 2001–2010, 2017.Thoma, M. Analysis and optimization of convolutionalneural network architectures, 2017a.Thoma, M. The hasyv2 dataset, 2017b.Wu, Y., Chen, Y., Wang, L., Ye, Y., Liu, Z., Guo, Y., and Fu,Y. Large scale incremental learning. In Proceedings ofthe IEEE International Conference on Computer Vision ,pp. 374–382, 2019.Yu, L., Twardowski, B., Liu, X., Herranz, L., Wang, K.,Cheng, Y., Jui, S., and Weijer, J. v. d. Semantic drift com-pensation for class-incremental learning. In Proceedingsof the IEEE/CVF Conference on Computer Vision andPattern Recognition , pp. 6982–6991, 2020. n Class Orderings for Incremental Learning Appendices A. Plots for growing memory setting In this appendix we present plots that reflect evaluation of each method from Table 1 for different class orderings afterlearning each task. 10 20 30 40 50 60 70 80 90 100 Number of classes A cc u r a c y ( % ) random (24.8)iCaRL seed (27.3)maxConf. (32.6)minConf. (25.4)decTaskConf (29.9)eqTaskConf (31.6)incTaskConf (29.4)coarse (26.5) Figure 5. LwF results for different class orderings for CIFAR-100on ResNet-32 from scratch with 20 exemplars per class growingmemory. 10 20 30 40 50 60 70 80 90 100 Number of classes A cc u r a c y ( % ) random (32.2)iCaRL seed (34.2)maxConf. (35.4)minConf. (33.4)decTaskConf (28.0)eqTaskConf (35.4)incTaskConf (32.0)coarse (32.8) Figure 6. iCaRL results for different class orderings forCIFAR-100 on ResNet-32 from scratch with 20 exemplars perclass growing memory. 10 20 30 40 50 60 70 80 90 100 Number of classes A cc u r a c y ( % ) random (39.3)iCaRL seed (40.1)maxConf. (37.5)minConf. (40.1)decTaskConf (37.2)eqTaskConf (38.6)incTaskConf (37.8)coarse (39.3) Figure 7. BiC results for different class orderings for CIFAR-100on ResNet-32 from scratch with 20 exemplars per class growingmemory. 10 20 30 40 50 60 70 80 90 100 Number of classes A cc u r a c y ( % ) random (27.2)iCaRL seed (29.6)maxConf. (28.9)minConf. (27.7)decTaskConf (29.1)eqTaskConf (31.9)incTaskConf (28.5)coarse (26.2) Figure 8. LUCIR results for different class orderings forCIFAR-100 on ResNet-32 from scratch with 20 exemplars perclass growing memory. 10 20 30 40 50 60 70 80 90 100