[PDF] Few-Shot Unsupervised Continual Learning through Meta-Examples

Abstract

In real-world applications, data do not reflect the ones commonly used for neural networks training, since they are usually few, unlabeled and can be available as a stream. Hence many existing deep learning solutions suffer from a limited range of applications, in particular in the case of online streaming data that evolve over time. To narrow this gap, in this work we introduce a novel and complex setting involving unsupervised meta-continual learning with unbalanced tasks. These tasks are built through a clustering procedure applied to a fitted embedding space. We exploit a meta-learning scheme that simultaneously alleviates catastrophic forgetting and favors the generalization to new tasks. Moreover, to encourage feature reuse during the meta-optimization, we exploit a single inner loop taking advantage of an aggregated representation achieved through the use of a self-attention mechanism. Experimental results on few-shot learning benchmarks show competitive performance even compared to the supervised case. Additionally, we empirically observe that in an unsupervised scenario, the small tasks and the variability in the clusters pooling play a crucial role in the generalization capability of the network. Further, on complex datasets, the exploitation of more clusters than the true number of classes leads to higher results, even compared to the ones obtained with full supervision, suggesting that a predefined partitioning into classes can miss relevant structural information.

Full PDF

FFew-Shot Unsupervised Continual Learning throughMeta-Examples

Alessia Bertugli ∗ University of Trento [email protected]

Stefano Vincenzi ∗ University of Modena and Reggio Emilia [email protected]

Simone Calderara

University of Modena and Reggio Emilia [email protected]

Andrea Passerini

University of Trento [email protected]

Abstract

In real-world applications, data do not reﬂect the ones commonly used for neuralnetworks training, since they are usually few, unlabeled and can be available as astream. Hence many existing deep learning solutions suffer from a limited rangeof applications, in particular in the case of online streaming data that evolve overtime. To narrow this gap, in this work we introduce a novel and complex settinginvolving unsupervised meta-continual learning with unbalanced tasks. Thesetasks are built through a clustering procedure applied to a ﬁtted embedding space.We exploit a meta-learning scheme that simultaneously alleviates catastrophicforgetting and favors the generalization to new tasks. Moreover, to encouragefeature reuse during the meta-optimization, we exploit a single inner loop takingadvantage of an aggregated representation achieved through the use of a self-attention mechanism. Experimental results on few-shot learning benchmarks showcompetitive performance even compared to the supervised case. Additionally,we empirically observe that in an unsupervised scenario, the small tasks andthe variability in the clusters pooling play a crucial role in the generalizationcapability of the network. Further, on complex datasets, the exploitation of moreclusters than the true number of classes leads to higher results, even compared tothe ones obtained with full supervision, suggesting that a predeﬁned partitioninginto classes can miss relevant structural information. The code is available at https://github.com/alessiabertugli/FUSION-ME

Continual learning has been widely studied in the last few years to solve the catastrophic forgettingproblem that affects neural networks. Several methods [1–6] have been proposed to solve thisproblem involving a replay buffer, network expansion, selectively regularizing and distillation. Someworks [7–12] take advantage of the meta-learning abilities of generalization on different tasks andrapid learning on new ones to deal with continual learning problems. Few works on unsupervisedmeta-learning [13–15] and unsupervised continual learning [16] have been recently proposed, butthe ﬁrst ones deal with independent and identically distributed data, while the second one assumesthe availability of a huge dataset. Moreover, the majority of continual learning and meta-learningworks assume that data are perfectly balanced or equally distributed among classes. We propose anew, more realistic setting, namely FUSION (Few-shot UnSupervIsed cONtinual learning), dealing ∗ Equal contribution.34th Conference on Neural Information Processing Systems (NeurIPS 2020), 4th Workshop on Meta-Learning. a r X i v : . [ c s . L G ] N ov nner loop

10 samples

Outer loop

15 samplesTask 1 Inner loop10 samples Outer loop15 samplesTask NInner loop

15 samples

Outer loop 11 samplesTask 1 Inner loop

Outer loop17 samples Task NSupervisedUnsupervised

Figure 1: Supervised vs unsupervised tasks ﬂow. In the supervised version, tasks are perfectlybalanced and contain a ﬁxed number of elements for inner loop (10 samples) and outer loop (15samples, 5 from the current cluster and 10 randomly sampled from other clusters). In the unsupervisedmodel, tasks are unbalanced and contain / of cluster data for the inner loop and / for the outerloop in addition to a ﬁxed number of random samples.with unlabeled and unbalanced tasks in a meta-continual learning fashion and a novel method MEML(Meta-Example Meta-Learning), that is able to face this complex scenario. In the task constructionphase, rather than directly exploiting high dimensional raw data, an embedding learning network isused to learn a ﬁtted embedding space to facilitate clustering. Precisely, the k-means algorithm isapplied to build tasks composed of unbalanced data, each one with the assigned pseudo-label. Ourmeta-learning model relies on a double-loop procedure that receives data in an online incrementallearning fashion. The classiﬁcation layers are learned through a single inner loop update, adopting anattentive mechanism that extracts the most relevant features -meta example- of the current unbalancedtask; this considerably reduces the training time and memory usage. In the outer loop, to avoidforgetting and improve generalization, we train all model layers exploiting, as input, an ensemblebetween data of the same class of the stream and data randomly sampled from the overall trajectory(see Figure 1). We test our model and setup on Omniglot [17] and Mini-ImageNet [18], achievingfavorable results compared to baseline approaches. We show the importance of performing the singleinner loop update on the meta-example with respect to both updating over a random sample andupdating over multiple samples of the same task. We empirically verify that with tasks generated inan unsupervised manner, the need for balanced data is not crucial compared to the variability in thedata and the exploitation of small clusters. We propose a novel setting that deals with unsupervised meta-continual learning and study the effectof the unbalanced tasks derived by an unconstrained clustering approach.As done in [13], the task construction phase exploits the k-means algorithm over suitable embeddingsobtained through an unsupervised pre-training. This simple but effective method assigns the samepseudo-label to all data points belonging to the same cluster. The ﬁrst step employs two differentmodels: Deep Cluster [19] for Mini-ImageNet, and ACAI [20] for Omniglot. Both these methodsconsist of unsupervised training and produce an embedding vector set Z = Z , Z , ..., Z N , whereN is the number of data points in the training set. ACAI is based on an autoencoder while DeepCluster on a deep feature extraction phase followed by k-means clustering. They outline some ofthe most promising approaches to deal with unlabeled, high dimensional data to obtain and discovermeaningful latent features. Applying k-means over these embeddings leads to unbalanced clusters,which determine unbalanced tasks. This is in contrast with typical meta-learning and continuallearning problems, where data are perfectly balanced. To recover a balanced setting, in [13], theauthors set a threshold on the cluster dimension, discarding extra samples and smaller clusters. Arecent alternative [21] forces the network to balance clusters, but this imposes a partitioning of theembedding space that contrasts with the extracted features. We believe that these approaches are2ub-optimal as they alter the data distribution. In an unsupervised setting, where data points aregrouped based on the similarity of their features, variability is an essential factor. By keeping also thesmall tasks, our model generalizes better and reaches higher accuracy at meta-test time. In a dataimbalanced setting, the obtained meta-representation is more inﬂuenced by large clusters. Since thelatter may contain more generic features than the smaller ones, the model is able to generalize betterby mostly learning from them. Despite this, the small clusters may contain important information fordifferent classes presented during evaluation. To corroborate this claim, we investigate balancingtechniques, both at data-level, such as data augmentation and at model-level, such as balancingparameters into the loss term. Once the tasks are built, they are sampled one at a time for the meta-continual train. The training process happens in a class incremental way, where a task correspondto one cluster. During this phase, the network have to learn a good representation that will be ableto generalize to unseen tasks, while avoiding forgetting. The meta-train phase relies completely onpseudo-labels. At meta-continual test time, novel and unseen tasks are presented to the network. Therepresentation learned during meta-train remains ﬁxed, while only the prediction layers are ﬁne-tuned,testing on few data of novel classes. Our network is composed of a Feature Extraction Network (FEN) and a CLassiﬁcation Network(CLN), both updated during the meta-training phase through a meta-learning procedure based on theconstruction of a meta-example. MAML and all its variants rely on a two-loop mechanism that allowslearning new tasks from a few steps of gradient descent. Recent investigations on this algorithmexplain that the real reason for MAML’s success resides in feature reuse instead of rapid learning [22],proving that learning meaningful representations is a crucial factor. Based on this assumption, wefocus on the generalization ability of the feature extraction layers. We remove the need for severalinner loops, maintaining a single inner loop update through an attentive procedure that considerablyreduces the training time and computational resources needed for training the model and increasesthe global performance. At each time-step, as pointed out in Figure 1, a task T i = ( S cluster , S query ) is randomly sampled from tasks distribution p ( T ) . S cluster contains elements of the same clusterand is deﬁned as S cluster = { ( X k , Y k ) } Kk =0 , with Y = ... = Y K , where Y = ... = Y k is thecluster pseudo-label. Instead, S query contains a variable number of elements belonging to the currentcluster and a ﬁxed number of elements randomly sampled from all other clusters, and is deﬁned as S query = { ( X q , Y q ) } Qq =0 . All the elements belonging to S cluster are processed by the frozen FEN,parameterized by θ , computing the feature vectors R , R , ..., R K in parallel for all task elements as R K = f θ ( X K ) . The obtained embeddings are reﬁned with an attention function parameterizedby ρ computes the attention coefﬁcients α from the features vectors: α K = Softmax [ f ρ ( R K )] . (1)Then, the ﬁnal aggregated representation learning vector M E , called meta-example , captures themost salient features, and is computed as follows:

M E = K (cid:88) k =0 [ R k ∗ α k ] . (2)The single inner loop is performed on this meta-example, which adds up the weighted-featurescontribution of each element of the current cluster. Then, the cross-entropy loss (cid:96) between thepredicted label and the pseudo-label is computed and both the classiﬁcation network parameters W and the attention parameters ρ ( ψ = { W i , ρ } ) are updated as follows: ψ ← ψ − α ∇ ψ (cid:96) i ( f ψ ( ME ) , Y ) , (3)where α is the inner loop learning rate. Finally, to update the whole network parameters φ = { θ, W i , ρ } , and to ensure generalization across tasks, the outer loop loss is computed S query . Theouter loop parameters are thus updated as follows: φ ← φ − β ∇ φ (cid:96) i ( f φ ( X Q ) , Y Q ) , (4)where β is the outer loop learning rate. 3able 1: Meta-test results on Omniglot. Algorithm/Classes 10 50 75 100 150 200Oracle OML [24] 88.4 74.0 69.8 57.4 51.6 47.9Oracle MEML

OML balanced 500 67.8 27.6 29.4 24.5 18.7 15.8OML balancing param 59.4 27.2 24.3 18.4 15.5 11.8OML augmentation 72.2 35.1 32.5 27.5 21.8 17.3OML 74.6 32.5 30.6 25.8 19.9 16.1OML single update 67.5 32.0 30.2 24.3 18.4 15.3MEML mean 60.6 31.2 25.8 21.3 17.0 13.7MEML (Ours)

Table 2: Meta-test results on Mini-ImageNet.

Algorithm/Classes 2 4 6 8 10Oracle OML [24] 50.0 31.9 27.0 16.7 13.9Oracle MEML

OML 49.3 41.0 19.2 18.2 12.0MEML 64 58.0 41.2

MEML 512 54.7 36.4 26.2 14.1 21.4MEML 64 RS 54.0 39.0 31.2 27.3 16.4 vs.

Unbalanced Tasks

To justify the use of unbalanced tasks and show that allowing unbalanced clusters is more beneﬁcialthan enforcing fewer balanced ones, we present in Table 1 some comparisons achieved on theOmniglot dataset. First of all, we introduce a baseline in which the number of clusters is set to the truenumber of classes, removing from the task distribution the ones containing less than N elements andsampling N elements from the bigger ones (OML). We thus obtain a perfectly balanced training set atthe cost of less variety within the clusters; however, this leads to poor performance as small clustersare never represented. Setting a smaller number of clusters than the number of true classes givesthe same results (OML balanced 500). This test shows that cluster variety is more important thanbalancing for generalization. To verify if maintaining variety and balancing data can lead to betterperformance, we try two balancing strategies: augmentation, at data-level, and balancing parameter,at model-level. For the ﬁrst one, we keep all clusters, sampling N elements from the bigger and usingdata augmentation for the smaller to reach N elements (OML augmentation). At model-level, wemultiply the loss term by a balancing parameter, to weight the update for each task based on clusterlength (OML balancing param). These two tests, especially the latter one, result in lower performancewith respect to the unbalanced setting, suggesting that the only thing that matters is cluster variety.We can also presume that bigger clusters may contain the most meaningful and general features,so unbalancing does not negatively affect the training of our unsupervised meta-continual learningmodel. Finally, as we want to conﬁrm that this intuition is valid in a more general unsupervisedmeta-learning model, we perform the balanced/unbalanced experiments also on CACTUs [13]. Theresults are shown in Table 3 (Top) and attest that the model trained on unbalance data outperformsthe balanced one, further proving the importance of task variance to better generalize to new classesat meta-test time. We report the results training the algorithms on ways for generality purposesand shots and shots, in order to have enough data points per class to create the imbalance. vs. Multiple Updates

In Table 1, we show that the model trained with the attention-based method consistently outperformsall the other baselines. The single update gives the worst performance, but not really far from themultiple updates one, conﬁrming the idea that the strength of generalization relies on the featurereuse. Also, the mean test has performance comparable with the multiple and single update ones,proving the effectiveness of the attention mechanism to determine a suitable and general embeddingvector for the CLN. Training time and resources consumption is considerably reduced with our modelbased on a single update on the generated meta-example (see Supplementary Material). We also testour technique in a standard meta-learning setting. We compare our meta-example based algorithmMEML to MAML [23] on Omniglot dataset in Table 3 (bottom), consistently outperforming it. Wereport the results training on ways and and shots. In particular, the shots test highlights theeffectiveness of our aggregation method. vs. Oracles

To see how the performance of MEML is far from those achievable with the real labels, we also reportfor all datasets the accuracy reached in a supervised setting ( oracles ) on both Omniglot (see Table 1)and Mini-ImageNet (see Table 2). We deﬁne Oracle OML the supervised model present in [24],4able 3: Balanced vs. unbalanced CACTUs-MAML (top) and MEML, with our meta-example update,compared to basic MAML (bottom) on Omniglot dataset.Algorithm/Ways, Shots 5,1 5,5 20,1 20,5CACTUs-MAML Balanced 20,5 60.50 84.00 40.50 67.62CACTUs-MAML Unbalanced 20,5

CACTUs-MAML Balanced 20,15 67.00 86.00 32.50 64.62CACTUs-MAML Unbalanced 20,15

MAML 20,1 78.00 97.50 77.62 92.87MEML 20,1 (Ours)

MAML 20,5 88.00 99.50 74.62 92.75MEML 20,5 (Ours) and Oracle MEML the supervised model updated with our meta-example strategy. Oracle MEMLoutperforms Oracle OML on Omniglot and Mini-ImageNet, suggesting that the meta-examplesstrategy is beneﬁcial even in a fully supervised case. MEML reaches higher performance comparedto the other OML baselines but lower on Omniglot compared to the Oracle OML. On Mini-ImageNet,our model trained with clusters outperforms both oracles. To further improve the performanceavoiding forgetting at meta-test time, we add a rehearsal strategy based on reservoir sampling on theCLN (MEML RS). This generally results in superior performance on Omniglot. On Mini-ImageNetthe performance with and without rehearsal are similar, due to the low number of test classes in thedataset that alleviates catastrophic forgetting.

In an unsupervised setting, the number of original classes could be unknown. Consequently, it isimportant to assess the performance of our model by varying the number of clusters at meta-traintime. With a coarse-grain clustering, a low number of clusters are formed and distant embeddingscan be assigned to the same pseudo-label, grouping classes that can be rather different. On the otherhand, with a ﬁne-grain clustering, a high number of clusters with low variance are generated. Bothcases lead to poor performance at meta-test time. We test on Omniglot (see Table 1), setting thenumber of clusters to: the true number of classes (OML); a lower number of clusters (OML balanced500), resulting in more than 20 samples each. Since the Omniglot dataset comprehends 20 samplesper class, in the ﬁrst case it results in unbalanced tasks, while in the second we sample 20 elementsfrom the bigger clusters. The performance of the 1100 clusters test is consistently higher than thatobtained with the 500 clusters test, conﬁrming that variability is more important than balancing. OnMini-ImageNet, we test our method in Table 2 with 64, 128, 256, and 512 clusters (MEML numberof clusters). Since Mini-ImageNet contains 600 examples per class, after clustering we sampleexamples between 10 and 30, proportionally to the cluster dimension. We obtain the best resultswith 256 clusters and the meta-example approach, outperforming not only the other unsupervisedexperiment but also the supervised oracle. We observe that using 512 clusters degrades performancewith respect to the 256 case, suggesting that tasks constructed over an embedding space with toospeciﬁc features fail to generalize. Using a lower number of clusters, such as 64 or 128, also achievesworse performance. This time, the embedding space is likely aggregating distant features, leading toa complex meta-continual training, whose pseudo-classes are not clearly separated.

Continual learning is one of the most challenging problems arising from neural networks thatare heavily affected by catastrophic forgetting. The proposed methods can be divided into threemain categories.

Architectural strategies , are based on speciﬁc architectures designed to mitigatecatastrophic forgetting [25, 26].

Regularization strategies are based on putting regularization termsinto the loss function, promoting selective consolidation of important past weights [1, 5]. Finally5 ehearsal strategies focus on retaining part of past information and periodically replaying it to themodel to strengthen connections for memories, involving meta-learning [4, 27], combination ofrehearsal and regularization strategies [2, 3], knowledge distillation [28–31], generative replay [32–34] and channel gating [35]. Only a few recent works have studied the problem of unlabeled data,which mainly involves representation learning. CURL [16] proposes an unsupervised model built ona representation learning network. This latter learn a mixture of Gaussian encoding task variations,then integrates a generative memory replay buffer as a strategy to overcome forgetting.

Meta-learning, or learning to learn, aims to improve the neural networks ability to rapidly learn newtasks with few training samples. The majority of meta-learning approaches proposed in literature arebased on Model-Agnostic Meta-Learning (MAML) [23, 36–38]. Through the learning of a proﬁtableparameter initialization with a double loop procedure, MAML limits the number of stochastic gradientdescent steps required to learn new tasks, speeding up the adaptation process performed at meta-testtime. Although MAML is suitable for many learning settings, few works investigate the unsupervisedmeta-learning problem. CACTUs [13] proposes a new unsupervised meta-learning method relyingon clustering feature embeddings through the k-means algorithm and then builds tasks upon thepredicted classes. UMTRA [14] is a further method of unsupervised meta-learning based on a randomsampling and data augmentation strategy to build meta-learning tasks, achieving comparable resultswith respect to CACTUs. UFLST [15] proposes an unsupervised few-shot learning method based onself-supervised training, alternating between progressive clustering and update of the representations.

Meta-learning has extensively been merged with continual learning for different purposes. We canhighlight the existence of two strands of literature [39]: meta-continual learning with the aim ofincremental task learning and continual-meta learning with the aim of fast remembering. Continual-meta learning approaches mainly focus on making meta-learning algorithms online, with the aim torapidly remember meta-test tasks [40, 7, 11]. More relevant to our work are meta-continual learningalgorithms [8, 24, 41, 12, 10, 9, 42], which use meta-learning rules to “learn how not to forget".OML [24] and its variant ANML [41] favor sparse representations by employing a trajectory-inputupdate in the inner loop and a random-input update in the outer one. The algorithm jointly trains arepresentation learning network (RLN) and a prediction learning network (PLN) during the meta-training phase. Then, at meta-test time, the RLN layers are frozen and only the PLN is updated.ANML replaces the RLN network with a neuro-modulatory network that acts as a gating mechanismon the PLN activations following the idea of conditional computation.

In this work, we tackle a novel problem concerning few-shot unsupervised continual learning. Wepropose a simple but effective model based on the construction of unbalanced tasks and meta-examples. Our model is motivated by the power of representation learning , which relies on fewand raw data with no need for human supervision. With an unconstrained clustering approach, weﬁnd that no balancing technique is necessary for an unsupervised scenario that needs to generalizeto new tasks. In fact, the most robust and general features are gained though task variety; evenif favoring larger clusters leads to more general features, smaller ones should not be discarded asthey can be representative of less common tasks. This means that there is no need for complexrepresentation learning algorithm that try to balance clusters elements. A future achievement is todeeply investigate this insight by observing the variability of the embeddings in the feature space.A further improvement consists in the introduction of FiLM layers [43] into the FEN to changedata representation at meta-test time and the introduction of an OoD detector to face with Out-of-Distribution tasks. The performances of our model with meta-examples suggest that a single innerupdate can increase performances if the most relevant features for the task are selected. To this end, amore reﬁned technique, relying on hierarchical aggregation techniques, can be considered.6 eferences [1] Kirkpatrick, J.N., Pascanu, R., Rabinowitz, N.C., Veness, J., Desjardins, G., Rusu, A.A., Milan,K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D.,Hadsell, R.: Overcoming catastrophic forgetting in neural networks. Proceedings of the NationalAcademy of Sciences of the United States of America

114 13 (2016) 3521–3526[2] Lopez-Paz, D., Ranzato, M.A.: Gradient episodic memory for continual learning. In Guyon,I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., eds.:Advances in Neural Information Processing Systems 30. Curran Associates, Inc. (2017) 6467–6476[3] Chaudhry, A., Ranzato, M., Rohrbach, M., Elhoseiny, M.: Efﬁcient lifelong learning withA-GEM. In: ICLR. (2019)[4] Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., , Tesauro, G.: Learning tolearn without forgetting by maximizing transfer and minimizing interference. In: InternationalConference on Learning Representations. (2019)[5] Zenke, F., Poole, B., Ganguli, S.: Continual learning through synaptic intelligence. In:Proceedings of the 34th International Conference on Machine Learning - Volume 70. ICML’17,JMLR.org (2017) 3987–3995[6] Rebufﬁ, S.A., Kolesnikov, A.I., Sperl, G., Lampert, C.H.: icarl: Incremental classiﬁer andrepresentation learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (2017) 5533–5542[7] Jerfel, G., Grant, E., Grifﬁths, T., Heller, K.A.: Reconciling meta-learning and continuallearning with online mixtures of tasks. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R., eds.: Advances in Neural Information Processing Systems 32.Curran Associates, Inc. (2019) 9119–9130[8] Vuorio, R., Cho, D.Y., Kim, D., Kim, J.: Meta continual learning. ArXiv abs/1806.06928 (2018)[9] Rajasegaran, J., Khan, S., Hayat, M., Khan, F.S., Shah, M.: itaml : An incremental task-agnosticmeta-learning approach. The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (June 2020)[10] Liu, Q., Majumder, O., Ravichandran, A., Bhotika, R., Soatto, S.: Incremental learning formetric-based meta-learners. ArXiv abs/2002.04162 (2020)[11] Harrison, J., Sharma, A., Finn, C., Pavone, M.: Continuous meta-learning without tasks (2020)[12] Yao, H., Wei, Y., Huang, J., Li, Z.: Hierarchically structured meta-learning. In: Proceedings ofthe 36rd International Conference on International Conference on Machine Learning. ICML’19(2019)[13] Hsu, K., Levine, S., Finn, C.: Unsupervised learning via meta-learning. In: InternationalConference on Learning Representations. (2019)[14] Khodadadeh, S., Bölöni, L., Shah, M.: Unsupervised meta-learning for few-shot image andvideo classiﬁcation. ArXiv abs/1811.11819 (2018)[15] Ji, Z., Zou, X., Huang, T., Wu, S.: Unsupervised few-shot learning via self-supervised training.ArXiv abs/1912.12178 (2019)[16] Rao, D., Visin, F., Rusu, A.A., Teh, Y.W., Pascanu, R., Hadsell, R.: Continual unsupervisedrepresentation learning (2019)[17] Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning throughprobabilistic program induction. Science (6266) (2015) 1332–1338[18] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-ScaleHierarchical Image Database. In: CVPR09. (2009)719] Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning ofvisual features. In: European Conference on Computer Vision. (2018)[20] Berthelot*, D., Raffel*, C., Roy, A., Goodfellow, I.: Understanding and improving interpolationin autoencoders via an adversarial regularizer. In: International Conference on LearningRepresentations. (2019)[21] Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering andrepresentation learning. In: International Conference on Learning Representations. (2020)[22] Raghu, A., Raghu, M., Bengio, S., Vinyals, O.: Rapid learning or feature reuse? towardsunderstanding the effectiveness of maml. In: ICLR. (2020)[23] Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deepnetworks. In Precup, D., Teh, Y.W., eds.: Proceedings of the 34th International Conference onMachine Learning. Volume 70 of Proceedings of Machine Learning Research., InternationalConvention Centre, Sydney, Australia, PMLR (06–11 Aug 2017) 1126–1135[24] Javed, K., White, M.: Meta-learning representations for continual learning. In Wallach, H.,Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R., eds.: Advances in NeuralInformation Processing Systems 32. Curran Associates, Inc. (2019) 1818–1828[25] Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K.,Pascanu, R., Hadsell, R.: Progressive neural networks. ArXiv abs/1606.04671 (2016)[26] Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y.W., Pascanu, R.,Hadsell, R.: Progress & compress: A scalable framework for continual learning. In: ICML.(2018)[27] Spigler, G.: Meta-learnt priors slow down catastrophic forgetting in neural networks. ArXiv abs/1909.04170 (2019)[28] Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again neuralnetworks. In: ICML. (2018)[29] Li, Z., Hoiem, D.: Learning without forgetting. IEEE Transactions on Pattern Analysis andMachine Intelligence (2018) 2935–2947[30] Hou, S., Pan, X., Loy, C.C., Wang, Z., Lin, D.: Lifelong learning via progressive distillationand retrospection. In: Proceedings of the European Conference on Computer Vision (ECCV).(2018)[31] Lee, K., Lee, K., Shin, J., Lee, H.: Overcoming catastrophic forgetting with unlabeled data inthe wild. In: ICCV. (2019)[32] Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative replay. In:Proceedings of the 31st International Conference on Neural Information Processing Systems.NIPS’17, Red Hook, NY, USA, Curran Associates Inc. (2017) 2994–3003[33] Silver, D.L., Mahfuz, S.: Generating accurate pseudo examples for continual learning. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Workshops. (June 2020)[34] Liu, X., Wu, C., Menta, M., Herranz, L., Raducanu, B., Bagdanov, A.D., Jui, S., van de Weijer,J.: Generative feature replay for class-incremental learning. 2020 IEEE/CVF Conference onComputer Vision and Pattern Recognition Workshops (CVPRW) (2020) 915–924[35] Abati, D., Tomczak, J., Blankevoort, T., Calderara, S., Cucchiara, R., Bejnordi, B.E.: Con-ditional channel gated networks for task-aware continual learning. The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) (June 2020)[36] Rajeswaran, A., Finn, C., Kakade, S.M., Levine, S.: Meta-learning with implicit gradients. In:NeurIPS. (2019) 837] Nichol, A., Achiam, J., Schulman, J.: On ﬁrst-order meta-learning algorithms. ArXiv abs/1803.02999 (2018)[38] Rusu, A.A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., Hadsell, R.:Meta-learning with latent embedding optimization. In: International Conference on LearningRepresentations. (2019)[39] Caccia, M., Rodriguez, P., Ostapenko, O., Normandin, F., Lin, M., Caccia, L., Laradji, I., Rish,I., Lacoste, A., Vazquez, D., et al.: Online fast adaptation and knowledge accumulation: a newapproach to continual learning. arXiv preprint arXiv:2003.05856 (2020)[40] Finn, C., Rajeswaran, A., Kakade, S., Levine, S.: Online meta-learning. In Chaudhuri, K.,Salakhutdinov, R., eds.: Proceedings of the 36th International Conference on Machine Learning.Volume 97 of Proceedings of Machine Learning Research., Long Beach, California, USA,PMLR (09–15 Jun 2019) 1920–1930[41] Beaulieu, S., Frati, L., Miconi, T., Lehman, J., Stanley, K.O., Clune, J., Cheney, N.: Learning tocontinually learn. 24th European Conference on Artiﬁcial Intelligence (ECAI) (2020)[42] Tao, X., Hong, X., Chang, X., Dong, S., Wei, X., Gong, Y.: Few-shot class-incremental learning.The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)[43] Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual reasoning with ageneral conditioning layer. arXiv preprint arXiv:1709.07871 (2017)[44] Antoniou, A., Patacchiola, M., Ochal, M., Storkey, A.J.: Deﬁning benchmarks for continualfew-shot learning. ArXiv abs/2004.11967 (2020)[45] Lee, H.B., Lee, H., Na, D., Kim, S., Park, M., Yang, E., Hwang, S.J.: Learning to bal-ance: Bayesian meta-learning for imbalanced and out-of-distribution tasks. In: InternationalConference on Learning Representations. (2020)[46] Zintgraf, L., Shiarlis, K., Kurin, V., Hofmann, K., Whiteson, S.: Fast context adaptation viameta-learning. In: Thirty-sixth International Conference on Machine Learning (ICML). (June2019)[47] Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: ICLR. (2017)[48] Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report,Canadian Institute for Advanced Research (2009)[49] Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-UCSDBirds 200. Technical Report CNS-TR-2010-001, California Institute of Technology (2010)9 upplementary Material A Test on SlimageNet64

We also make some preliminary attempts on SlimageNet64 [44], a novel and difﬁcult benchmarkfor few-shot continual learning. We make the embeddings using DeepCluster [19] and we report theobtained results in Table 4. We ﬁnd that our update method based on meta-example overcomes thebaselines on both supervised and unsupervised approaches. MEML with 1600 clusters reaches betterperformances then the 800 clusters cases, meaning that a more reﬁned partition of the embeddingspace is more beneﬁcial then a rigid partitioning into classes.Table 4: Meta-test test results on SlimageNet64 dataset.Algorithm/Classes 5 10 20 30 40 50Oracle OML 24.0 ±2.5. 14.3 ±3.1 15.9 ±8.3 5.0 ±1.1

OML 22.4 ±1.9 16.1 ±1.9 23.3 ±5.0 3.8 ±0.6 8.9 ±0.0 2.1 ±1.2MEML 800 22.4 ±3.2 13.9 ±0.0 25.8 ±0.0

B Out-of-Distribution Tasks

Since our model is unsupervised, FEN training is only based on feature embeddings, with no class-dependent bias. This way, our model could be general enough for OoD tasks, where the trainingtasks belong to a different data distribution (i.e. a different dataset) with respect to the test tasks. Toinvestigate this conjecture, we test our model on the Cifar100 and Cub datasets. Results in Table 5show that, by training on Mini-ImageNet and testing on Cifar100 (top half) or training on Omniglotand testing on Cub (bottom half), the unsupervised approach generally outperforms the supervisedone. In the latter case, MEML also outperforms the supervised oracle trained on Cub, which isincapable of learning a meaningful representation in our particular setting.Table 5: Meta-test test results with Out-of-Distribution tasks on Cifar100 and Cub datasets.

Cifar100 /Classes 2 4 6 8 10Oracle OML Cifar100 66.0 45.0 34.0 30.0 29.5Oracle MEML Mini-ImageNet 58.0 33.0

MEML Mini-ImageNet

Cub /Classes 2 10 20 30 40Oracle OML Cub 50.0 13.9 25.8 4.5 8.9Oracle MEML Omniglot 44.0 49.1 32.7

C Rehearsal at Meta-Train Time

Rehearsal strategy can be useful at meta-test time. In particular, when the CLN is adapted to newtasks in an incremental fashion, its weights can be overridden favoring the last tasks at the expense ofthe ﬁrst ones. The beneﬁcial effect of rehearsal at meta-test time can be noticed when the numberof test tasks is high. In fact, reservoir sampling is generally helpful on Omniglot, that is tested on200 classes, while it does not give the same beneﬁt on Mini-ImageNet, where it reaches similar ora little lower performance. We want to verify if rehearsal can be beneﬁcial also at meta-train time,10able 6: Meta-test test results on Omniglot dataset with rehearsal only during meta-test and both atmeta-train and meta-test.Algorithm/Classes 10 50 75 100 150 200OML RS only test 67.9 55.1 46.2 37.0 29.6

OML RS both train/test

MEML RS both train/test 74.7 47.0 48.4 38.3 28.9 24.2replacing the query set S query with a coreset built with reservoir sampling S coreset . This way, insteadof sampling from random clusters, a buffer of previously seen data is stored in a buffer of ﬁxeddimension. We try three different memory size 200, 500, 1000, obtaining, as expected, increasingresults as the size increases. In Table 6 we report accuracy results on Omniglot adding rehearsal onlyat meta test time, and adding it at both meta-train and meta-test time with OML and MEML. Wereport only the results obtained with a buffer size 500 to avoid redundancies. As it can be noted, withOML, using a coreset instead of a query set at meta-train time increase the performance with respectto the case of query set usage, meaning that the representation suffers from catastrophic forgettingand the use of random data (acting only for generality purposes and not contrasting forgetting) arenot enough to learn a good representation. On the contrary, with MEML, the use of a rehearsalstrategy at meta-train time get worse performance. We hypothesize that this behavior is due to thedifferent number of inner loop update between the two models. In fact, OML, making several innerloop updates on data belonging to the same cluster, brings the CLN weights nearest the currentcluster, suffering the effect of forgetting more then MEML that makes a single inner loop on themeta-example. These results prove that, at meta-train time, MEML needs only the generalization ability given by S query , while OML needs also the remembering ability given by S coreset . D Details on Balancing Techniques

To verify the effect of unbalanced tasks during meta-training, we apply two techniques to balancetasks, one at data-level, data augmentation and the other at model-level, loss balancing . We brieﬂyexplain how these methods are implemented.

D.1 Data Augmentation

We apply data augmentation on the Omniglot dataset to observe if balancing the clusters could lead tosuperior performance. We notice that the results reached applying data augmentation are comparablewith the one obtained with unbalanced tasks. Practically, we sample 20 elements from the clustersbigger than 20, while we exploit augmentation on the cluster with less than 20 elements. Till reaching20 samples for tasks, we pick each time a random image between the ones in the cluster employing arandom combination of various augmentation techniques, such as horizontal ﬂip, vertical ﬂip, afﬁnetransformations, random crop, and color jitter. In detail, about the random crop, we select a randomportion included between , , , or of the entire image. Regarding the color jitter, weuse brightness, contrast, saturation, and hue factor (the ﬁrst three denote a factor including between . and . , the hue instead one including between − . and . ) to adjust the image. D.2 Loss Balancing

Our model applies clustering on all training data before starting to learn the meta-representation. Thisway, we can ﬁnd the maximum C max and minimum number C min of elements per cluster obtainedby k-means algorithm. Then, for each cluster, we ﬁnd its number of elements C current and computethe balanced vector Γ as follow. Γ = C max − C min C current − C min + (cid:15) , (5)where (cid:15) is used to avoid division by zero. Finally Γ is normalized as follow. Γ norm = Γ − Γ min Γ max − Γ min . (6)11able 7: Meta-test test results on Mini-ImageNet dataset with Sela embedding.Algorithm/Classes 2 4 6 8 10OML 50.0 25.0 For each sampled task (taskId), the corresponding balancing parameter is selected and multiplied bythe cross-entropy loss CE during meta-optimization as reported in below. L = Γ norm [ taskId ] · CE ( logits , Y ) , (7)where logits indicate the output of the model. E Comparison with SeLa Embeddings

We try a recent embedding learning method based on self-labeling, SeLa [21], that forces a balancedseparation between clusters. In Table 7, we report the results obtained training our model with SeLaembeddings on Mini-ImageNet. The main idea, taking up what was done in DeepCluster [19], is tojoin clustering and representation learning, combining cross-entropy minimization with a clusteringalgorithm like K-means. This approach could lead to degenerate solutions such as all data pointsmapped to the same cluster. The authors of SeLa tried to solve this issue by adding the constraintthat the labels must induce equipartition of the data, which they observe maximizes the informationbetween data indices and labels. This new criterion extends standard cross-entropy minimizationto an optimal transport problem, which is harder to optimize, exploiting traditional algorithms thatscale badly when facing larger datasets. To solve this problem a fast version of the Sinkhorn-Knoopalgorithm is applied.In detail, given a dataset of N data points I , . . . , I N with corresponding labels y , . . . , y N ∈{ , . . . , K } , drawn from a space of K possible labels, and a deep neural network x = Φ( I ) mapping I to feature vectors x ∈ R D ; the learning objective is deﬁned as: min p,q E ( p, q ) subject to ∀ y : q ( y | x i ) ∈ { , } and N (cid:88) i =1 q ( y | x i ) = NK . (8) E ( p, q ) is deﬁned as the average cross-entropy loss, while the constraints mean that the N datapoints are split uniformly among the K classes and that each x i is assigned to exactly one label.The objective in Equation (8) is solved as an instance of the optimal transport problem, for furtherdetails refer to the paper. DeepCluster adopts particular implementation choices to avoid degeneratesolutions, but contrary to SeLa it does not force the clusters to contain the same number of samples.We empirically observe that in our setting an unconstrained approach leads to better results. F Time and Computational Analysis

In Table 8, we compare training time and computational resources usage between OML and MEMLon Omniglot and Mini-ImageNet. Both datasets conﬁrm that MEML, adopting a single inner update,trains considerably faster and uses approximately one-third of the GPU resources with respect toOML. This latter performs an update for each sample included in S cluster , keeping a computationalgraph of the model in memory for each update. This leads to slower training time, especially whenthe required number of epochs is high, such as for Mini-ImageNet. Even though with this kind ofdatasets we do not require particular GPU resources, this test shows the strength of our model in aneventually future scenario exploiting large image, deeper network, and more cluster samples.12able 8: Training time and GPU usage of MEML vs. OML on Omniglot and Mini-ImageNet.Model Dataset Training time GPU usageOML Omniglot 1h 32m 2.239 GbMEML Omniglot 47m 0.743 GbOML Mini-ImageNet 7h 44m 3.111 GbMEML Mini-ImageNet 3h 58m 1.147 GbTable 9: Features comparison between our MEML and several works recently proposed in theliterature involving continual learning and few-shot learning into the wild.Few-shot Unsupervised Continual Imbalance OoD Algorithm (cid:55) (cid:55) (cid:88) (cid:55) (cid:55) iCARL [6] (cid:55) (cid:88) (cid:88) (cid:55) (cid:55)

CURL [16] (cid:88) (cid:88) (cid:55) (cid:55) (cid:55)

CACTUs [13] (cid:88) (cid:88) (cid:55) (cid:55) (cid:55)

UMTRA [14] (cid:88) (cid:88) (cid:55) (cid:55) (cid:55)

UFLST [15] (cid:88) (cid:55) (cid:55) (cid:88) (cid:88)

L2B [45] (cid:55) (cid:88) (cid:88) (cid:55) (cid:88)

GD [31] (cid:88) (cid:55) (cid:88) (cid:55) (cid:55)

OML [24] (cid:88) (cid:55) (cid:88) (cid:55) (cid:55)

ANML [41] (cid:88) (cid:55) (cid:88) (cid:55) (cid:88)

Continual-MAML [39] (cid:88) (cid:55) (cid:88) (cid:55) (cid:55) iTAML [9] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

MEML (Ours)

G Learning in the Jungle

To the best of our knowledge, a few-shot unsupervised continual learning setting has never beenstudied before in the literature. However, some works propose “learning in the jungle" problems,that involve a mixture of non-trivial settings. In Table 9 we compare some novel methods to ourOML, highlighting the features of each one. Our model is the only one that presents such complexsetting involving few-shot learning, continual learning, unlabelled and unbalanced tasks and proposesexperiments that show the model ability to learn from OoD data. Note that this analysis is notintended to be a complete analysis of all the methods of continual learning and few-shot learning, butonly of those methods that have been placed in a different setting from the one that is commonly usedin these two ﬁelds or that are related to them.

H FiLM Layers for OoD Tasks

To further improve the results testing on OoD tasks, we introduce FiLM [43] layers within the OMLarchitecture (the supervised baseline). In a

FiLMed neural network some conditional input is used toconditioned FiLM layers, to inﬂuence the ﬁnal prediction by this input. The FiLM generator mapthis information into FiLM parameters, applying feature-wise afﬁne transformation (in particularscaling and shifting) element-wise (features map-wise for CNN). If x is the input of a FiLM layer, z aconditional input, and γ and β are z -dependent scaling and shifting vectors, the FiLM transformationis reported below. FiLM = γ ( z ) x (cid:12) β ( z ) (9)We apply this concept to OML, conditioning the prediction to task-speciﬁc features. We add twoFiLM layers as linear layers after each of the last two convolutional layers of the FEN. These layershave adaptable parameters, updating in both the inner and the outer loop. In detail, recovering whatwas already done in [46], we introduce a -dimensional context parameter vector producing,through the linear layer, ﬁlters. These ﬁlters are used to apply an afﬁne transformation on theoutput of the convolutional layer. Context parameters are reset to zero before each new task, whileFiLMs are trained to be general for all tasks and never reset during meta-train.13able 10: Meta-test test results on Omniglot dataset with FiLM layers applied on Oracle OML.Algorithm/Classes 10 50 75 100 150 200Oracle OML 88.4 74.0 69.8 57.4 51.6 47.9OML FiLM Table 11: Meta-test test results on Cifar100 dataset with FiLM layers applied on Oracle OML trainedon Omniglot.

Cifar100 /Classes 2 4 6 8 10OML Omniglot

At meta-test time, we update the FiLM layers (during the meta-test train phase) and we reset thecontext parameters after each new task. This way, the context parameters are speciﬁc and dependenton each task while the FiLM layers can adapt themselves to the new unseen classes, in order to shiftthe frozen representation according to the context. This way, if a task changes, the model could beable to shift the representation reaching better generalization capabilities. The advantage is morepronounced facing with OoD tasks since their distribution is much different with respect to themeta-train one. We report some preliminary results obtained applying FiLM layers on the OML [24]model, trained on Omniglot and tested on both Omniglot (see Table 10) and Cifar100 (see Table 11).We ﬁnd that OML with FiLM layers outperforms or at least equals on both dataset.The results are promising, but we believe that much better performance could be achieved trainingcontext parameters and FiLM layers separately or introducing some tricks to train them together.

I The Effect of Self-Attention

Here we want to empirically view how our self-attention mechanism acts on cluster images. We reportsome examples of clusters and the respectively self-attention coefﬁcients that MEML associatesto each image. In Figure 2 some samples obtained during MEML training are reported, on Mini-ImageNet and Omniglot respectively. The darker colors indicate the values of the highest attentioncoefﬁcient, while the lighter colors indicate the lower ones. In the majority of cases, our mechanismrewards the most representative examples of the cluster, meaning the ones that globally contain mostof the features present in the other examples as well. A further improvement could be to identify theoutliers (the samples more distant from the others at features-level) of a cluster and discard thembefore the self-attention mechanism is applied. This way, only the features of the correctly groupedsamples can be employed to build the meta-example.

J Datasets

To evaluate our model, we employ two standard datasets typically used to validate few-shot learningmethods: Omniglot and Mini-ImageNet. In addition, we try our model on a new and challenging few-shot continual learning benchmark, SlimageNet64. The Omniglot dataset contains 1623 charactersfrom 50 different alphabets with 20 greyscale image samples per class. We use the same splits as [13],using 1100 characters for meta-training, 100 for meta-validation, and 423 for meta-testing. TheMini-ImageNet dataset consists of 100 classes of realistic RGB images with 600 examples per class.As done in [47, 13], we use 64 classes for meta-training, 16 for meta-validation and 20 for meta-test.The SlimageNet64 dataset contains 1000 classes with 200 RGB images per class taken from thedown-scaled version of ILSVRC-2012, ImageNet64x64. 800 classes are used for meta-train and theremaining 200 for meta-test purposes. Finally, we use the Cifar100 [48] and Cub [49] datasets toprove our model performance on Out-of-Distribution tasks.14 .51 0.01 0.13 0.13 0.03 0.04 0.07 0.060.02 0.19 0.04 0.05 0.39 0.03 0.27 0.010.14 0.01 0.52 0.06 0.02 0.03 0.12 0.100.12 0.66 0.15 0.070.14 0.26 0.09 0.05 0.18 0.08 0.20

Figure 2: Samples of clusters (one for each row) generated on Omniglot (left) and Mini-ImageNet(right). Self-attention coefﬁcients are reported associated to each image.

Algorithm 1

MEML algorithm on FUSION setting

Require: : D = X , X , ..., X N : unlabeled training set Require: α , β : inner loop and outer loop learning rates Run embedding learning on D producing Z N from X N Run k -means on Z N generating a distribution of unbalanced tasks p ( T ) from clusters Randomly initialize θ and W while not done do Sample a task T i ∼ p ( T ) = ( S cluster , S query ) Randomly initialize W i S cluster = { ( X k , Y k ) } Kk =0 , with Y = ... = Y K S query = { ( X q , Y q ) } Qq =0 R K = f θ ( X K ) α K = Softmax [ f ρ ( R K )] ME = K (cid:88) k =0 [ R k ∗ α k ] ψ, φ = { W i , ρ } , { θ, W i , ρ } ψ ← ψ − α ∇ ψ (cid:96) i ( f ψ ( ME ) , Y ) φ ← φ − β ∇ φ (cid:96) i ( f φ ( X Q ) , Y Q ) end while K Implementation Details

The FEN is composed of convolutional layers followed by ReLU activations, × kernel (forOmniglot, the last one is a × kernel) followed by linear layers interleaved by a ReLU activation.The attention mechanism is implemented with two additional linear layers interleaved by a

Tanh function and followed by a

Softmax and a sum to compute attention coefﬁcients and aggregate features.For Omniglot, we train the model for steps while for Mini-ImageNet and SlimageNet64 for , with meta-batch size equals to . The outer loop learning rate is set to e − while theinner loop learning rate is set to . , with Adam optimizer. We report the algorithm of MEMLmeta-training in Algorithm 1 and an illustration of the four phases in Figure 3.15 . Embedding Learning 3. Meta-Continual training 𝐶 ! 𝐶 " 𝐶 𝑍 = 𝑍 ! , 𝑍 " ,…, 𝑍 $ 𝑝 𝜏 = 𝜏 ! , 𝜏 " , …, 𝜏 % , …𝐷 = 𝑋 ! ,𝑋 " ,…, 𝑋 $ ClusteringCross-entropy

Backprop 𝑍

2. Unsupervised Tasks 4. Meta-Continual test

FEN 𝜃 FEN 𝜃 ’ CLN 𝑊 CLN 𝑊 ’ 𝜏 ! ~𝑝 𝜏 , 𝑆 " , 𝑆 )$’(* Singleinnerloop Outerloop

Loss with respect to 𝑊 ! 𝐿(𝑌 " +𝑌 ’( , 𝑌′ " + 𝑌′ ’( ) FEN 𝜃 ’ CLN 𝑊 ’ 𝜏 ( ~𝑝 𝜏 Meta-test TrainMeta-testTest

FEN 𝜃 CLN 𝑊 𝑆 " 𝑆 )$’(* + Unbalanced task distribution