[PDF] Rethinking Curriculum Learning with Incremental Labels and Adaptive Compensation

Abstract

Like humans, deep networks have been shown to learn better when samples are organized and introduced in a meaningful order or curriculum. Conventional curriculum learning schemes introduce samples in their order of difficulty. This forces models to begin learning from a subset of the available data while adding the external overhead of evaluating the difficulty of samples. In this work, we propose Learning with Incremental Labels and Adaptive Compensation (LILAC), a two-phase method that incrementally increases the number of unique output labels rather than the difficulty of samples while consistently using the entire dataset throughout training. In the first phase, Incremental Label Introduction, we partition data into mutually exclusive subsets, one that contains a subset of the ground-truth labels and another that contains the remaining data attached to a pseudo-label. Throughout the training process, we recursively reveal unseen ground-truth labels in fixed increments until all the labels are known to the model. In the second phase, Adaptive Compensation, we optimize the loss function using altered target vectors of previously misclassified samples. The target vectors of such samples are modified to a smoother distribution to help models learn better. On evaluating across three standard image benchmarks, CIFAR-10, CIFAR-100, and STL-10, we show that LILAC outperforms all comparable baselines. Further, we detail the importance of pacing the introduction of new labels to a model as well as the impact of using a smooth target vector.

Full PDF

RRAVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING Rethinking Curriculum Learning withIncremental Labels and AdaptiveCompensation

Madan Ravi Ganesh [email protected]

Jason J. Corso [email protected]

University of MichiganEECSAnn ArborMichigan, USA

Abstract

Like humans, deep networks have been shown to learn better when samples are orga-nized and introduced in a meaningful order or curriculum [38]. Conventional curriculumlearning schemes introduce samples in their order of difﬁculty. This forces models tobegin learning from a subset of the available data while adding the external overhead ofevaluating the difﬁculty of samples. In this work, we propose

Learning with Incremen-tal Labels and Adaptive Compensation (LILAC), a two-phase method that incrementallyincreases the number of unique output labels rather than the difﬁculty of samples whileconsistently using the entire dataset throughout training. In the ﬁrst phase,

IncrementalLabel Introduction , we partition data into mutually exclusive subsets, one that contains asubset of the ground-truth labels and another that contains the remaining data attached toa pseudo-label. Throughout the training process, we recursively reveal unseen ground-truth labels in ﬁxed increments until all the labels are known to the model. In the secondphase,

Adaptive Compensation , we optimize the loss function using altered target vectorsof previously misclassiﬁed samples. The target vectors of such samples are modiﬁed to asmoother distribution to help models learn better. On evaluating across three standard im-age benchmarks, CIFAR-10, CIFAR-100, and STL-10, we show that LILAC outperformsall comparable baselines. Further, we detail the importance of pacing the introduction ofnew labels to a model as well as the impact of using a smooth target vector.

Deep networks are a notoriously hard class of models to train effectively [10, 13, 22, 23].A combination of high-dimensional problems, characterized by a large number of labelsand a high volume of samples, a large number of free parameters and extreme sensitivity toexperimental setups are some of the main reasons for the difﬁculty in training deep networks.The go-to solution for deep network optimization is Stochastic Gradient Descent with mini-batches [31] (batch learning) or its derivatives. There are two alternative lines of work whichoffer strategies to guide deep networks to better solutions than batch learning: CurriculumLearning [3, 12, 15] and Label Smoothing [9, 39].Curriculum learning helps deep networks learn better by gradually increasing the dif-ﬁculty of samples used to train networks. This idea is inspired by methods used to teach c (cid:13) a r X i v : . [ c s . C V ] A ug RAVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING

Step

T-i

Label 1Label 2Label 3Label 4 Correct Incorrect ,,,,

Step

T-1

Step T Pseudo-LabelOne-hot Vector

Incremental Label Introduction Adaptive Compensation

Ground-truthOne-hot Vector ..... ...

Model at time t ....... ,,,, ,,,,

Step ,,,, ... ..... Figure 1: Illustration of the components of LILACfor a four label dataset case. The

Incre-mental Label introduction (IL) phase introduces new labels at regular intervals while us-ing the data corresponding to unknown labels (pseudo-label) as negative samples. Once allthe labels have been introduced, the

Adaptive Compensation (AC) phase of training begins.Here, a prior copy of the network is used to classify training data. If a sample is misclassiﬁedthen a smoother distribution is used as its ground-truth vector in the current epoch.humans and patterns in human cognition and behaviour [1, 35]. The “difﬁculty” of samplesin the dataset, obtained using either external ranking methods or internal rewards [12, 16],introduces an extra computational overhead while the setup itself restricts the amount of datafrom which the model begins to learn.Label smoothing techniques [28, 30, 39] regularize the outcomes of deep networks toprevent over-ﬁtting while improving on existing solutions. They penalize network outputsbased on criteria such as noisy labels, overconﬁdent model outcomes, or robustness of a net-work around a data point in the feature space. Often, such methods penalize the entire datasetthroughout the training phase with no regard to the prediction accuracy of each sample.Inspired by an alternative outlook on Elman’s [9] notion of “starting small”, we proposeLILAC,

Learning with Incremental Labels and Adaptive Compensation , a novel label-basedalgorithm that overcomes the issues of the previous methods and effectively combines them.LILAC works in two phases, 1)

Incremental Label Introduction (IL), which emphasizesgradually learning labels, instead of samples, and 2)

Adaptive Compensation (AC), whichregularizes the outcomes of previously misclassiﬁed samples by modifying their target vec-tors to smoother distributions in the objective function (Fig. 1).In the ﬁrst phase, we partition data into two mutually exclusive sets: S , a subset ofground-truth (GT) labels and their corresponding data; and U , remaining data associatedwith a pseudo-label ( ρ ) and used as negative samples. Once the network is trained usingthe current state of the data partition for a ﬁxed interval, we reveal more GT labels and theircorresponding data and repeat the training process. By contrasting data in S against theentire remaining dataset in U , we consistently use all the available data throughout training,thereby overcoming one of the key issues of curriculum learning. The setup of the IL phase,inspired by continual learning, allows us to ﬂexibly space out the introduction of new labelsand provide the network enough time to develop a strong understanding of each class.Once all the GT labels are revealed, we initiate the AC phase of training. In this phase, AVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING we replace the target one-hot vector of misclassiﬁed samples, obtained from a previous ver-sion of the network being trained, with a smoother distribution. The smoother distributionprovides an easier value for the network to learn while the use of a prior copy of the net-work helps avoid external computational overhead and limits the alteration to only necessarysamples.To summarize, our main contributions in LILAC are as follows: • we introduce a novel method for curriculum learning that incrementally learns labels as opposed to samples , • we formulate Adaptive Compensation as a method to regularize misclassiﬁed sampleswhile removing external computational overhead, • ﬁnally, we improve average recognition accuracy across all of the evaluated bench-marks compared to batch learning, a property that is not shared by the other testedcurriculum learning and label smoothing methods.Our code is available at https://github.com/MichiganCOG/LILAC_v2 . Curriculum Learning

Bengio et al . [3], Florensa et al . [12], and Graves et al . [15] aresome important works that have redeﬁned and applied curriculum learning in the contextof deep networks. These ideas were expanded upon to show improvements in performanceacross corrupted [19] and small datasets [11]. More recently, Hacohen and Weinshall [16]explored the impact of varying the pace with which samples were introduced while Wein-shall [38] used alternative deep networks to categorize difﬁcult samples. To the best of ourknowledge, most previous works have assumed that samples cover a broad spectrum of dif-ﬁculty and hence need to be categorized and presented in an orderly fashion. The closestrelevant work to ours, in terms of learning labels, gradually varies the GT vector from amultimodal distribution to a one-hot vector over the course of the training phase [8].

Label Smoothing

Label smoothing techniques regularize deep networks by penalizing theobjective function based on a pre-deﬁned criterion. Such criteria include using a mixture oftrue and noisy labels [39], penalizing highly conﬁdent outputs [28], and using an alternatedeep network’s outcomes as GT [30]. Bagherinezhad et al . [2] proposed the idea of usinglogits from trained models instead of just one-hot vectors as GT. Complementary work byMiyato et al . [27] used the local distributional smoothness, based on the robustness of amodel’s distribution around a data point, to smooth labels. The work closest to our methodwas proposed in Szegedy et al . [36], where an alternative target distribution was used acrossthe entire dataset. Instead, we propose to alter the GT vector for only samples that aremisclassiﬁed. They are identiﬁed using a prior copy of the current model, which helps avoidexternal computational overhead and only uses a small set of operations.

Incremental Learning and Negative Mining

Incremental and Continual learning areclosely related ﬁelds that inspired the structure of our algorithm. Their primary concernis learning over evolving data distributions with the addition of constraints on the storagememory [5, 29], distillation of knowledge across different distributions [33, 34], assumptionof a single pass over data [6, 26], etc. In our approach, we depart from the assumption ofevolving data distributions. Instead, we adopt the experimental pipeline used in incrementallearning to introduce new labels at regular intervals. At the same time, inspired by negative

RAVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING

Step 1: Partition Data Step 2: Sample Mini-batch Step 3: Balance Mini-batch

Label 1Label 2Label 3Label 4PseudoLabel

Figure 2: Illustration of the steps in the IL phase when (

Top ) only one GT label is in S and ( Bottom ) when two GT labels are in S . The steps are 1) partition data, 2) sample amini-batch of data and 3) balance the number of samples from U to match those from S inthe mini-batch before training. Samples from U are assumed to have a uniform prior whenbeing augmented/reduced to match the total number of samples from S . Values inside eachpie represent the number of samples. Across both cases, the number of samples from S determines the ﬁnal balanced mini-batch size.mining [4, 24, 37], we use the remaining training data, associated with a pseudo-label, asnegative samples. Overall, our setup effectively uses the entire training dataset, thus main-taining the same data distribution. In LILAC, our main objective is to improve upon batch learning. We do so by ﬁrst gradu-ally learning labels, in ﬁxed increments, until all GT labels are known to the network (Sec-tion 3.1). This behaviour assumes that all samples are of equal difﬁculty and are available tothe network throughout the training phase. Further, we focus on learning strong representa-tions of each class over a dedicated period of time. Once all GT labels are known, we shiftto regularizing previously misclassiﬁed samples by smoothing the distribution of their targetvector while maintaining the peak at the same GT label (Section 3.2). Using a smoother dis-tribution leads to an increase in the entropy of the target vector and helps the network learnbetter, as we demonstrate in Section 4.2.

In the IL phase, we partition data into two sets: S , a subset of GT labels and their corre-sponding data; and U , the remaining data marked as negative samples using a pseudo-label ρ . Over the course of multiple intervals of training, we reveal more GT labels to the networkaccording to a predetermined schedule. Within a given interval of training, the data partitionis held ﬁxed and we uniformly sample mini-batches from the entire training set based ontheir GT label. However, for samples from U , we use ρ as their label. There is no additionalchange required in the objective function or the outputs of the model when we sample datafrom U . By the end of this phase, we reveal all GT labels to the network. AVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING For a given dataset, we assume a total of L labels are provided in the ascending order oftheir value. Based on this ordering, we initialize the ﬁrst b labels, and their correspondingdata, as S , and the data corresponding to the remaining L − b labels as U . Over the course ofmultiple training intervals, we reveal GT labels in increments of m , a hyper-parameter thatcontrols the schedule of new label introduction. Revealing a GT label involves moving thecorresponding data from U to S and using their GT label instead of ρ .Within a training interval, we train the network for E epochs using the current state ofthe data partition. First, we sample a mini-batch of data based on a uniform prior over theirGT labels. Then, we modify their target vectors based on the partition to which a samplebelongs. To ensure the balanced occurrence of samples from GT labels and ρ , we augment orreduce the number of samples from U to match those from S and use this curated mini-batchto train the network. After E epochs, we move m new GT labels and their correspondingdata from U to S and repeat the entire process (Fig. 2). Once all the GT labels have been revealed and the network has trained sufﬁciently, we be-gin the AC phase. In the AC phase, we use a smoother distribution for the target vector ofsamples which the network is unable to correctly classify. Compared to one-hot vectors,optimizing over this smoother distribution, with an increased entropy, can bridge the gapbetween the unequal distances in the embedding space and overlaps in the label space [32].This overlap can occur due to common image content or close proximity in the embeddingspace relative to other classes. Thus, improving the entropy of such target vectors can helpmodify the embedding space in the next epoch and compensate for the predictions of mis-classiﬁed samples.For a sample ( x i , y i ) in epoch e ≥ T , we use predictions from the model at e − e − ( x i , y i ) denotes a training sample and its corresponding GT label for sample index i ,and T represents a threshold epoch value until which the network is trained without adaptivecompensation. We compute the ﬁnal target vector for the i th instance at epoch e , t ei , based onthe model θ e − using the following equation, t ei = (cid:40) ( ε L − L − ) δ y i + ( − ε L − ) , arg max (cid:0) f θ e − ( x i ) (cid:1) (cid:54) = y i δ y i , otherwise . (1)Here, δ y i represents the one-hot vector corresponding to GT label y i , is a vector of L dimensions with all entries as 1 and ε is a scaling hyper-parameter. Datasets and Metrics

We use three datasets, CIFAR-10, CIFAR-100 ([20]), and STL-10 ([7]), to evaluate our method and validate our claims. CIFAR-10 and CIFAR-100 are10 and 100 class variants of the popular image benchmark CIFAR while STL-10 is a 10class subset of ImageNet.

Average Recognition Accuracy (%) combined with their

StandardDeviation across 5 trials are used to evaluate the performance of all the algorithms.

RAVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING

Experimental Setup

For CIFAR-10/100, we use ResNet18 ([17]) as the architectural back-bone while for STL-10, we use ResNet34. We set ρ as the last label and b as half the totalnumber of labels of a given dataset. In each interval of LILAC’s IL phase, we train themodel for 7, 3, and 10 epochs each, at a learning rate of 0.1, 0.01, and 0.1 for CIFAR-10,CIFAR-100, and STL-10, respectively. In the AC phase epochs 150, 220, and 370 are usedas thresholds (epoch T ) for CIFAR-10, CIFAR-100, and STL-10 respectively. Detailed ex-planations of the experimental setups are provided in the supplementary materials. Baselines

1. Stochastic Gradient Descent with mini-batches (Batch Learning).2. Standard Baselines • Fixed Curriculum: Following the methodology proposed in Bengio et al . [3], we createa “Simple” subset of the dataset using data that is within a value of 1.1 as predicted bya linear one-vs-all SVR model. The deep network is trained on the “Simple” datasetfor a ﬁxed period of time, which mirrors the total length of the IL phase, after whichthe entire dataset is used to train the network. • Label Smoothing: We follow the method proposed in Szegedy et al . [36].3. Custom Baselines • Dynamic Batch Size (DBS): DBS randomly copies data available within a mini-batchto mimic variable batch sizes, similar to the IL phase. However, all GT labels areavailable to the model throughout the training process. • Random Augmentation (RA): This baseline samples from a single randomly chosenclass in U , available in the current mini-batch, to balance data between S and U in thecurrent mini-batch. This is in contrast to LILAC, which uses samples from all classesin U that are available in the current mini-batch.4. Ablative Baselines • Only IL : This baseline quantiﬁes the contribution of incrementally learning labels whencombined with batch learning. • Only AC : This baseline shows the impact of adaptive compensation, as a label smooth-ing technique, when combined with batch learning.

Table 1 illustrates the improvement offered by LILAC over Batch Learning, with comparablesetups. Further, we break down the contributions of each phase of LILAC. Both

Only IL and

Only AC improve over batch learning, albeit to varying degrees, which highlights theirindividual strengths and importance. However, only when we combine both phases do weobserve a consistently high performance across all benchmarks. This indicates that these twophases complement each other.The Fixed Curriculum approach does not offer consistent improvements over the BatchLearning baseline across CIFAR-100 and STL-10 while the Label Smoothing approach doesnot outperform batch learning on the STL-10 dataset. While both of these standard baselinesfall short, LILAC consistently outperforms Batch Learning across all evaluated benchmarks.Interestingly, Label Smoothing provides the highest performance on CIFAR-100. Since theoriginal formulation of LILAC was based on Batch Learning, we assumed all GT vectors tobe one-hot. This assumption is violated in Label Smoothing. When we tailor our GT vectorsaccording to the Label Smoothing baseline, we outperform it with minimal hyper-parameterchanges, a testament to LILAC’s applicability on top of conventional label smoothing.

AVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING Types Training Performance (%)CIFAR 10 CIFAR 100 STL 10Batch Learning 95.19 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Only IL ( ours ) 95.38 ± ± ± Only AC ( ours ) 95.38 ± ± ± ours ) ± ± ± LS + LILAC ( ours ) 95.34 ± ± ± Shake-Drop + LILAC (ours) 96.79

Table 2: LILAC easily outperforms the Shake-Drop network ([40]) as well as other topperforming algorithms on CIFAR-10 with standard pre-processing (random crop + ﬂip) .The RA baseline highlights the importance of using all of the data in U as negative sam-ples in the IL phase as opposed to using data from individual classes. This is reﬂected inthe boost in performance offered by LILAC. The DBS baseline is used to highlight the im-portance of ﬂuctuating mini-batch sizes, which occur due to the balancing of data in the ILphase. Even with the availability of all labels and ﬂuctuating batch sizes, the DBS baselineis easily outperformed by LILAC. This indicates the importance of the recursive structureused to introduce data in the IL phase as well as the use of data from U as negative samples.Overall, LILAC consistently outperforms Batch Learning across all benchmarks while ex-isting comparable methods fail to do so. When we extend LILAC to the Shake-Drop [40]network architecture, with only standard pre-processing, we easily outperform other existingapproaches with comparable setups, as shown in Table 2. Smoothness of Target Vector ( ε ) Throughout this work, we maintained the importanceof using a smoother distribution as the alternate target vector during the AC phase. Ta-ble 3 (

Top ) illustrates the change in performance across varying degrees of smoothness inthe alternate target vector. There is a clear increase in performance when ε values are be- RAVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING

Property Performance (%)CIFAR-10 CIFAR-100 STL-10 ε = . ± ± ± ε = . ± ± ± ε = . ± ± ± ε = . ± ± ± ε = . ± ± ± ε = . ± ± ± ε = . ± ± ± ε = . ± ± ± m : 1 95.32 ± ± ± m : 2 (4) ± ± ± m : 4 (8) 95.29 ± ± ± Top ) The mid-range of ε values, 0.7-0.4, show an increase in performance whilethe edges, due to either too sharp or too ﬂat a distribution, show decreased performance.( Bottom ) Only IL model results illustrate the importance of introducing a small number ofnew labels in each interval of the IL phase. Values in brackets are for CIFAR-100.tween 0.7-0.4 (mid-range). On either side of this band of values, the GT vector is either toosharp or too ﬂat, leading to a drop in performance.

Size of Label Groups ( m ) LILAC is designed to introduce as many or as few new labelsas desired in the IL phase. We hypothesized that developing stronger representations can befacilitated by introducing a small number of new labels while contrasting it against a largevariety of negative samples. Table 3 (

Bottom ) supports our hypothesis by illustrating thedecrease in performance with an increase in the number of new labels introduced in eachinterval of the IL phase. Thus, we introduce two labels each for CIFAR-10 and STL-10 andonly one new label per interval for CIFAR-100 throughout the experiments in Table 1.

In this section, we take a closer look at the impact of each phase of LILAC and how theyaffect the quality of the learned representations. We extract features from the second to lastlayer of ResNet18/34 from 3 different baselines (Batch Learning, LILAC, and

Only IL ) anduse these features to train a linear SVM model.Fig. 3 highlights the two important phases in our algorithm. First, the plots on the left-hand side show a steady improvement in the performance of LILAC and the

Only IL baselineonce the IL phase is complete and all the labels have been introduced to the network. Whenwe compare the plots of CIFAR-10 and STL-10 against CIFAR-100, we see that all baselinesfollow the learning trend shown by Batch Learning, with CIFAR-100 being slightly delayed.Since there are a large number of epochs required to introduce all the labels of CIFAR-100to the network, the plots are signiﬁcantly delayed compared to batch learning. Conversely,since there are very few epochs in the IL phase of CIFAR-10 and STL-10, we observe theperformance trend of

Only IL and LILAC quickly match that of Batch Learning. Overall, the

AVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING Epochs R e c o g n i t i o n A cc u r a c y ( % ) Supervised Clustering Performance: CIFAR10

Baseline (94.57%)Only IL (94.68%)LILAC (94.80%)End of IL

100 150 200 250 300

Epochs R e c o g n i t i o n A cc u r a c y ( % ) Supervised Clustering Performance: CIFAR10

Baseline (94.57%)Only IL (94.68%)LILAC (94.80%)Beginning of AC

Epochs R e c o g n i t i o n A cc u r a c y ( % ) Supervised Clustering Performance: STL10

Baseline (72.79%)Only IL (73.24%)LILAC (73.60%)End of IL

200 250 300 350 400 450

Epochs R e c o g n i t i o n A cc u r a c y ( % ) Supervised Clustering Performance: STL10

Baseline (72.79%)Only IL (73.24%)LILAC (73.60%)Beginning of AC

Epochs R e c o g n i t i o n A cc u r a c y ( % ) Supervised Clustering Performance: CIFAR100

Baseline (76.95%)Only IL (77.32%)LILAC (77.62%)End of IL

150 200 250 300 350 400 450

Epochs R e c o g n i t i o n A cc u r a c y ( % ) Supervised Clustering Performance: CIFAR100

Baseline (76.95%)Only IL (77.32%)LILAC (77.62%)Beginning of AC

Figure 3: Plots on the (

Left ) show the common learning trend between all baselines, albeitslightly delayed for CIFAR-100, after the IL phase while those on the (

Right ) show steadyimprovement in performance after applying AC when compared to the

Only IL baseline.Final supervised classiﬁcation performances on representations collected from LILAC easilyoutperform those from Batch Learning and

Only IL methods.ﬁnal performances of both LILAC and the

Only IL baseline are higher than Batch Learning,which supports the importance of the IL phase in learning strong representations. (a) CIFAR-10 (b) CIFAR-100 (c) STL-10

Figure 4: Illustration of 8 randomly chosen samples that were incorrectly labelled by the

Only IL baseline and correctly labelled by LILAC. This highlights the importance of AC. RAVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING

The plots on the right-hand side highlight the similarity in behaviour of

Only IL and LILACbefore AC. However, afterward, we observe that the performance of LILAC overtakes the

Only IL baseline. This is a clear indicator of the improvement in representation quality whenAC is applied. Additionally, from Fig. 3 we observe that inherently the STL-10 datasetresults have a high standard deviation, which is reﬂected in the middle portion of the trainingphase, between the end of IL and the beginning of AC and it is not a consequence of ourapproach. We provide examples in Fig. 4 of randomly sampled data from the testing set thatwere incorrectly classiﬁed by the

Only IL baseline and were correctly classiﬁed by LILAC.

In this work, we proposed LILAC, which rethinks curriculum learning based on incremen-tally learning labels instead of samples. This approach helps kick-start the learning pro-cess from a substantially better starting point while making the learned embedding spaceamenable to adaptive compensation of target vectors. Both these techniques combine well inLILAC to show the highest performance on CIFAR-10 for simple data augmentations whileeasily outperforming batch and curriculum learning and label smoothing methods on com-parable network architectures. The next step in unlocking the full potential of this setup isto include a conﬁdence measure on the predictions of the network so that it can handle theeffects of dropout or partial inputs. In further expanding LILAC’s ability to handle partialinputs, we aim to explore its effect on standard incremental learning (memory-constrained)while also extending its applicability to more complex neural network architectures.

This work was in part supported by NSF NRI IIS 1522904 and NIST 60NANB17D191. Theﬁndings and views represent those of the authors alone and not the funding agencies. Theauthors would also like to thank members of the COG lab for their invaluable input in puttingtogether and reﬁning this work.

References [1] Judith Avrahami, Yaakov Kareev, Yonatan Bogot, Ruth Caspi, Salomka Dunaevsky,and Sharon Lerner. Teaching by examples: Implications for the process of categoryacquisition.

The Quarterly Journal of Experimental Psychology Section A , 50(3):586–606, 1997.[2] Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, and Ali Farhadi.Label reﬁnery: Improving imagenet classiﬁcation through label progression. arXivpreprint arXiv:1805.02641 , 2018.[3] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculumlearning. In

Proceedings of the 26th annual international conference on machine learn-ing , pages 41–48. ACM, 2009.

AVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING [4] Maxime Bucher, Stéphane Herbin, and Frédéric Jurie. Hard negative mining for met-ric learning based zero-shot classiﬁcation. In Gang Hua and Hervé Jégou, editors, Computer Vision – ECCV 2016 Workshops , pages 524–531, Cham, 2016. Springer In-ternational Publishing. ISBN 978-3-319-49409-8.[5] Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, andKarteek Alahari. End-to-end incremental learning. In

Proceedings of the EuropeanConference on Computer Vision (ECCV) , pages 233–248, 2018.[6] Arslan Chaudhry, Marcâ ˘A ´ZAurelio Ranzato, Marcus Rohrbach, and Mohamed Elho-seiny. Efﬁcient lifelong learning with a-GEM. In

International Conference on LearningRepresentations , 2019.[7] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks inunsupervised feature learning. In

Proceedings of the fourteenth international confer-ence on artiﬁcial intelligence and statistics , pages 215–223, 2011.[8] Urun Dogan, Aniket Anand Deshmukh, Marcin Machura, and Christian Igel. Label-similarity curriculum learning. arXiv preprint arXiv:1911.06902 , 2019.[9] Jeffrey L Elman. Learning and development in neural networks: The importance ofstarting small.

Cognition , 48(1):71–99, 1993.[10] Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and PascalVincent. The difﬁculty of training deep architectures and the effect of unsupervisedpre-training. In

Artiﬁcial Intelligence and Statistics , pages 153–160, 2009.[11] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In

International Conference on Learning Representations , 2018.[12] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel.Reverse curriculum generation for reinforcement learning. In

Proceedings of the 1stAnnual Conference on Robot Learning , volume 78 of

Proceedings of Machine LearningResearch , pages 482–495. PMLR, 13–15 Nov 2017.[13] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feed-forward neural networks. In

Proceedings of the thirteenth international conference onartiﬁcial intelligence and statistics , pages 249–256, 2010.[14] Ben Graham. Fractional max-pooling (2014). arXiv preprint arXiv:1412.6071 , 2014.[15] Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and KorayKavukcuoglu. Automated curriculum learning for neural networks. In

Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pages 1311–1320.JMLR. org, 2017.[16] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in train-ing deep networks. In

Proceedings of the 36th International Conference on MachineLearning , Proceedings of Machine Learning Research, pages 2535–2544, Long Beach,California, USA, 09–15 Jun 2019. PMLR. RAVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 770–778, 2016.[18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Denselyconnected convolutional networks. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 4700–4708, 2017.[19] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. MentorNet:Learning data-driven curriculum for very deep neural networks on corrupted labels. In

Proceedings of the 35th International Conference on Machine Learning , Proceedingsof Machine Learning Research, pages 2304–2313. PMLR, 10–15 Jul 2018.[20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tinyimages. Technical report, Citeseer, 2009.[21] Harold W Kuhn. The hungarian method for the assignment problem.

Naval researchlogistics quarterly , 2(1-2):83–97, 1955.[22] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Ben-gio. An empirical evaluation of deep architectures on problems with many factors ofvariation. In

Proceedings of the 24th international conference on Machine learning ,pages 473–480. ACM, 2007.[23] Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, and Pascal Lamblin. Exploringstrategies for training deep neural networks.

Journal of machine learning research , 10(Jan):1–40, 2009.[24] Xirong Li, CeesG M Snoek, Marcel Worring, Dennis Koelma, and Arnold WM Smeul-ders. Bootstrapping visual categorization with relevant negatives.

IEEE Transactionson Multimedia , 15(4):933–945, 2013.[25] Senwei Liang, Yuehaw Kwoo, and Haizhao Yang. Drop-activation: Implicit parameterreduction and harmonic regularization. arXiv preprint arXiv:1811.05850 , 2018.[26] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continuallearning. In

Advances in Neural Information Processing Systems , pages 6467–6476,2017.[27] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Dis-tributional smoothing by virtual adversarial examples. In , 2016.[28] Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hin-ton. Regularizing neural networks by penalizing conﬁdent output distributions.

CoRR ,2017.[29] Sylvestre-Alvise Rebufﬁ, Alexander Kolesnikov, Georg Sperl, and Christoph H Lam-pert. icarl: Incremental classiﬁer and representation learning. In

Proceedings ofthe IEEE conference on Computer Vision and Pattern Recognition , pages 2001–2010,2017.

AVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING [30] Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan,and Andrew Rabinovich. Training deep neural networks on noisy labels with boot-strapping. In , 2015.[31] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annalsof mathematical statistics , pages 400–407, 1951.[32] Pau Rodríguez, Miguel A Bautista, Jordi Gonzalez, and Sergio Escalera. Beyond one-hot encoding: Lower dimensional target embedding.

Image and Vision Computing , 75:21–31, 2018.[33] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and GregoryWayne. Experience replay for continual learning. In

Advances in Neural InformationProcessing Systems , pages 348–358, 2019.[34] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress:A scalable framework for continual learning. In

International Conference on MachineLearning , pages 4535–4544, 2018.[35] Burrhus F Skinner. Reinforcement today.

American Psychologist , 13(4):94, 1958.[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.Rethinking the inception architecture for computer vision. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 2818–2826, 2016.[37] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representationsusing videos. In

Proceedings of the IEEE International Conference on Computer Vi-sion , pages 2794–2802, 2015.[38] Daphna Weinshall, Gad Cohen, and Dan Amir. Curriculum learning by transfer learn-ing: Theory and experiments with deep networks. In

Proceedings of the 35th Interna-tional Conference on Machine Learning , Proceedings of Machine Learning Research,pages 5238–5246, StockholmsmÃd’ssan, Stockholm Sweden, 2018. PMLR.[39] Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel: Reg-ularizing cnn on the loss layer. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 4753–4762, 2016.[40] Yoshihiro Yamada, Masakazu Iwamura, Takuya Akiba, and Koichi Kise. Shakedropregularization for deep residual learning. arXiv preprint arXiv:1802.02375 , 2018.[41] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In

British MachineVision Conference 2016 . British Machine Vision Association, 2016.[42] Ke Zhang, Miao Sun, Tony X Han, Xingfang Yuan, Liru Guo, and Tao Liu. Residualnetworks of residual networks: Multilevel residual networks.

IEEE Transactions onCircuits and Systems for Video Technology , 28(6):1303–1314, 2017. RAVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING

A Experimental Setup

In Table 4 we list the general hyper-parameters used to train the batch learning portion ofevery baseline. This setup covers the training beyond the IL phase for LILAC, DBS, RA,and

Only IL as well as the

Only AC baseline. Across all the methods we ensure that the totalnumber of training epochs, when all the labels in the dataset are known, is held constant.Parameters CIFAR10/100 STL10Epochs 300 450Batch Size 128 128Learning Rate 0.1 0.1Lr Milestones [90 180 260] [300 400]Weight Decay 0.0005 0.0005Nesterov Momentum Yes YesGamma 0.2 0.1Table 4: List of hyper-parameters used to in batch learning. Note: All experiments used theSGD optimizer.

B Hyper-parameter Selection

Property Performance (%)CIFAR-10 CIFAR-100 STL-10 E = ± ± ± E = ± ± ± E = ± ± ± E = ± ± ± E =

10 95.26 ± ± ± Label Order: Rnd. 95.30 ± ± ± ± ± ± ± ± ± Top ) Varying E , the ﬁxed training interval size in the IL phase, shows a datasetspeciﬁc behaviour, with the dataset with lesser labels preferring a larger number of epochswhile the dataset with more labels prefers a smaller number of epochs. ( Bottom ) Comparingrandom label ordering and difﬁculty-based label ordering to the ascending order assumptionused throughout our experiments, we observe no preference to any ordering pattern.

Epochs in Training Interval

When we vary E , the ﬁxed training interval size in the ILphase, we observe a dataset speciﬁc behaviour. For datasets with lesser number of total la-bels, a larger number of epochs provides better performance while for datasets with more la-bels, a smaller number of epochs yields better performance. While the alternate learning ratecan have a huge impact on this performance, pacing the introduction of new labels, accord- AVI GANESH, CORSO: LABEL-BASED CURRICULUM LEARNING Epochs R e c o g n i t i o n A cc u r a c y ( % ) Unsupervised Clustering Performance: CIFAR10

Baseline (95.22%)Only IL (95.38%)LILAC (95.46%)End of IL

100 150 200 250 300

Epochs R e c o g n i t i o n A cc u r a c y ( % ) Unsupervised Clustering Performance: CIFAR10

Baseline (95.22%)Only IL (95.38%)LILAC (95.46%)Beginning of AC

Epochs R e c o g n i t i o n A cc u r a c y ( % ) Unsupervised Clustering Performance: STL10

Baseline (73.02%)Only IL (73.48%)LILAC (73.77%)End of IL

200 250 300 350 400 450

Epochs R e c o g n i t i o n A cc u r a c y ( % ) Unsupervised Clustering Performance: STL10

Baseline (73.02%)Only IL (73.48%)LILAC (73.77%)Beginning of AC

Figure 5: Unsupervised classiﬁcation performance on representations collected from LILACeasily outperforms those collected from Batch Learning and

Only IL methods. The plots onthe left show the common learning trend between all baselines after IL while plots on theright show steady improvement in performance after applying AC when compared to thebaselines.ing to the empirical results, can have a tremendous impact on subsequent hyper-parametersused in LILAC.

Label Order

In Table 5, we compare three different orders of label introduction duringthe IL phase, 1) random label order, 2) difﬁculty-based label order, and 3) ascending labelorder. Here, difﬁculty-based label order is obtained from the overall classiﬁcation scoresper label, obtained from the features of a trained model. Although these three orders do notconstitute the exhaustive set of possible label orderings, within these three possibilities thereis no deﬁnitive order that boosts the performance of LILAC consistently. Thus, we employascending label order throughout our work.NOTE:

Only IL baseline is used throughout Table 5.