[PDF] Demystifying Assumptions in Learning to Discover Novel Classes

Abstract

In learning to discover novel classes (L2DNC), we are given labeled data from seen classes and unlabeled data from unseen classes, and we train clustering models for the unseen classes. However, the rigorous definition of L2DNC is unexplored, which results in that its implicit assumptions are still unclear. In this paper, we demystify assumptions behind L2DNC and find that high-level semantic features should be shared among the seen and unseen classes. This naturally motivates us to link L2DNC to meta-learning that has exactly the same assumption as L2DNC. Based on this finding, L2DNC is not only theoretically solvable, but can also be empirically solved by meta-learning algorithms after slight modifications. This L2DNC methodology significantly reduces the amount of unlabeled data needed for training and makes it more practical, as demonstrated in experiments. The use of very limited data is also justified by the application scenario of L2DNC: since it is unnatural to label only seen-class data, L2DNC is sampling instead of labeling in causality. Therefore, unseen-class data should be collected on the way of collecting seen-class data, which is why they are novel and first need to be clustered.

Full PDF

MMeta Discovery: Learning to Discover Novel Classes given VeryLimited Data

Haoang Chi ∗ , Feng Liu ∗ , Wenjing Yang † , Long Lan ,Tongliang Liu , Gang Niu , and Bo Han Institute for Quantum Information & State Key Laboratory of High PerformanceComputing, College of Computer, National University of Defense Technology University of Technology Sydney University of Sydney RIKEN Center for Advanced Intelligence Project Hong Kong Baptist University

Abstract In learning to discover novel classes (L2DNC), we are given labeled data from seen classes and unlabeled data from unseen classes, and we need to train clustering models for the unseen classes. Since L2DNC isa new problem, its application scenario and implicit assumption are unclear. In this paper, we analyzeand improve it by linking it to meta-learning : although there are no meta-training and meta-test phases,the underlying assumption is exactly the same, namely high-level semantic features are shared among theseen and unseen classes. Under this assumption, L2DNC is not only theoretically solvable, but also canbe empirically solved by meta-learning algorithms slightly modiﬁed to ﬁt our proposed framework. ThisL2DNC methodology signiﬁcantly reduces the amount of unlabeled data needed for training and makesit more practical, as demonstrated in experiments. The use of very limited data is also justiﬁed by theapplication scenario of L2DNC: since it is unnatural to label only seen-class data, L2DNC is causally sampling instead of labeling . The unseen-class data should be collected on the way of collecting seen-classdata, which is why they are novel and ﬁrst need to be clustered. With the development of high-performance computing, we can train deep networks to achieve various taskswell [Deng et al., 2009]. However, the trained networks can only recognize the classes seen in the trainingset (i.e., known/seen classes), and cannot identify and cluster the novel classes (i.e., unseen classes) likehuman beings. A prime example is that human can easily tell a novel animal category (e.g., okapi) afterlearning a few seen animal categories (e.g., horse, dog). Namely, human can eﬀortlessly discover (cluster)novel categories of animals. Inspired by this fact, previous works formulated a novel problem called learningto discover novel classes (L2DNC) [Hsu et al., 2018], where we train a clustering model using plenty of unlabeled novel-class and labeled known-class data (Figure 1a).However, the rigorous deﬁnition of L2DNC is still unexplored . As a result, it is unclear whether theL2DNC problem can be addressed. For example, if we only have labeled images of cars, we cannot use the cardataset to help cluster images of animals. Besides, it is unrealistic that we can have a dataset that containsplenty of unlabeled novel-class data. For instance, botanists collect plant specimens they need in the forests. ∗ Equal contributions. † Corresponding authors. a r X i v : . [ c s . L G ] F e b kapi Gerenuk Octopus Unrealistic to collectUnseenHelp cluster (a) We can see rare animals many times (existing works).

Okapi

Gerenuk Octopus

UnseenHelp cluster (b) Rare animals can only be seen few times (ours).

Figure 1: In L2DNC, previous works assume that we can have a dataset only containing plenty of unlabelednovel-class data (subﬁgure (a)). However, it is unrealistic that we can always have a dataset that containsplenty of unlabeled novel-class data. In this paper, we revisit the L2DNC problem from the view of datacollection and ﬁnd that the novel-class data can be collected on the same way of collecting known-class data.In this view, we can only observe few novel-class data. To this end, we reshape the L2DNC problem to amore realistic one:

L2DNC given very limited data (L2DNCL, subﬁgure (b)).Except for the plants they are interested in (i.e., known classes), they also ﬁnd scarce plants never seen before(i.e., novel classes) [Smith, 1874]. Since a trip to forests is relatively costly and toilsome, botanists had bettercollect these scarce plants passingly for future research. Thus, botanists will have a plenty of labeled datawith known classes, but few unlabeled data with novel classes.Motivated by the example, we reshape the L2DNC to a more realistic problem called learning to discovernovel classes given very limited data (L2DNCL), which aims to discover novel classes from only few unlabeleddata (Figure 1b). Besides, we believe that the novel-class data should be collected on the same way ofcollecting known-class data, and they need to be clustered ﬁrst. Since both data are collected together fromthe same scenario (e.g., plants in forests), they share the high-level semantic features , which is exactly sameas the underlying assumption of meta-learning [Maurer, 2005, Chen et al., 2020].In this paper, we are naturally motivated to link L2DNCL with meta-learning. Therefore, we formalize theL2DNCL problem by referring the deﬁnition of meta-learning. As a result, L2DNCL is not only theoreticallysolvable (Theorem 1), but can be empirically addressed by modiﬁed meta-learning approaches. Based on ourformalization, we ﬁnd that the key diﬀerence between meta-learning and L2DNCL lies in their inner-tasks.In meta-learning, the inner-task is a classiﬁcation task, while in L2DNCL, it is a clustering task. Thus, wecan modify the training strategies of the inner-tasks of meta-learning methods such that they can discovernovel classes, i.e., meta discovery (MEDI).Speciﬁcally, we ﬁrst propose a novel sampling method to sample training tasks for meta-learning methods.In sampled tasks, labeled and unlabeled data share the high-level semantic features and the same clusteringrule (Figure 3). Then, based on this novel sampling method, we realize MEDI using two representativemeta-learning methods: model-agnostic meta-learning (MAML) [Finn et al., 2017] and prototypical network (ProtoNet) [Snell et al., 2017]. Figure 2 demonstrates that existing L2DNC methods cannot address L2DNCLtasks well; while ours methods (i.e., MEDI-MAML and MEDI-PRO) can perform much better.We conduct experiments on four benchmarks (SVHN, CIFAR-10, CIFAR-100, Omniglot) and compare ourmethod with six competitive baselines [MacQueen et al., 1967, Hsu et al., 2018, 2019, Han et al., 2019, 2020].Empirical results show that our method outperforms these baselines signiﬁcantly. Moreover, we provide anew practical application to the prosperous meta-learning community [Chen et al., 2020], which lights up anovel road for L2DNCL. 2 (SRFK $ FF 0(',3520(',0$0/56.PHDQV.&/0&/'7& (a) CIFAR-10 5-way 5-shot (SRFK $ FF (b) CIFAR-10 5-way 1-shot (SRFK $ FF (c) SVHN 5-way 5-shot (SRFK $ FF (d) SVHN 5-way 1-shot Figure 2: We conducted experiments on

CIFAR-10 and

SVHN and reported the average clustering accuracy (ACC (%), Section 8) when using existing and our methods to address the L2DNCL problem. We canobserve that existing methods cannot address the L2DNCL problem, while our methods (MEDI-MAML andMEDI-PRO) can address the L2DNCL problem well.

This paper is mainly related to L2DNC and meta-learning. We brieﬂy summarize the most representativeworks and discuss the relationship between these works and our work.

Learning to discover novel classes.

L2DNC is proposed in recent years, aiming to cluster unlabelednovel-class data according to their underlying categories. Compared with unsupervised learning [Barlow,1989], L2DNC also requires labeled known-class data to help cluster novel-class data. The pioneering methodsinclude

KLD-based contrastive loss (KCL) [Hsu et al., 2018], meta classiﬁcation likelihood (MCL) [Hsu et al.,2019], deep transfer clustering (DTC) [Han et al., 2019], and rank statistics (RS) [Han et al., 2020].In KCL [Hsu et al., 2018], a method based on pairwise similarity is introduced. They ﬁrst pre-traineda similarity prediction network on labeled data of known classes and then use this network to predict thesimilarity of each unlabeled data pair, which acts as the supervision information to train the main model.Then, MCL [Hsu et al., 2019] changed the loss function of KCL (the KL-divergence based contrastive loss) tothe meta classiﬁcation likelihood loss.In DTC [Han et al., 2019], they ﬁrst learned a data embedding with metric learning on labeled data,and then they employed the DEC [Xie et al., 2016] to learn the cluster assignments on unlabeled data. InRS [Han et al., 2020], they used the rank statistics to predict the pairwise similarity of data. To keep theperformance on data of known classes, they pre-trained the data embedding network with self-supervisedlearning method [Gidaris et al., 2018] on both labeled data and unlabeled data. Comparing existing works[Hsu et al., 2018, Zhong et al., 2020] with ours, we aim to cluster unlabeled data when their quantity is few.

Meta-Learning.

Meta-learning is also known as learning-to-learn, which train a meta-model over a largevariety of learning tasks [Ravi and Larochelle, 2017]. In meta-learning, we often assume that data share thesame high-level features, which ensures that meta-learning can be theoretically addressed [Maurer, 2005].According to Hospedales et al. [2020], there are three common approaches to meta-learning: optimization-based[Finn et al., 2017], model-based [Santoro et al., 2016], and metric-based [Snell et al., 2017].Optimization-based methods include those where the inner-level task is literally solved as an optimizationproblem, and focus on extracting meta knowledge required to improve optimization performance. In model-based methods, the inner learning step is wrapped up in the feed-forward pass of a single model. Metric-basedmethods perform non-parametric learning at the inner-task level by simply comparing validation points withtraining points and predicting the label of matching training points. Since meta-learning and the L2DNCLhave the same assumption that data share the same high-level semantic features (introduced in Section 1),we link L2DNCL to meta-learning problem, providing a way to formalize and analyze the L2DNCL.3 olorShape Frame

Figure 3: We can cluster these objects using at least three rules, i.e., colors, shapes and frames. Motivated bythe rule of clustering, when addressing the L2DNCL problem, we need to sample inner-level tasks, wheredata share the same rule (Algorithm 1).

This section presents two key concepts used in this paper.

Rigorous Deﬁnition of Meta-learning.

Diﬀerent from single-task learning where the training instancesare data and the output is a hypothesis, the training instances of meta-learning are tasks and the output is analgorithm [Maurer, 2005]. Let T = ( D, ˜ (cid:96), ˜ H ) be the task space, where D is a distribution over domain X × Y , Y is the label space, ˜ H = { h : X → Y} is the hypothesis space, and ˜ (cid:96) : ˜

H × Z → R + is the loss function. Therigorous deﬁnition of meta-learning is deﬁned as follows. Problem 1 (Meta-learning) . Given the training tasks {T i = ( D i , ˜ (cid:96), ˜ H ) } ni =1 drawn from a task distribution ˜ P ( T ) , meta-samples S = { S tri } ni =1 are drawn from the training tasks, where S tri ∼D mi are the training set of i th task T i with the sizes m , and each task T i outputs an inner-algorithm A ( S ) : Z m → ˜ H . In meta-learning,we aim to propose a meta-algorithm A to train an inner-task algorithm A ( S ) with the meta-samples S . Thetrained A ( S ) should have a good performance on the new task T ne = ( D ne , ˜ (cid:96), ˜ H ) ∼ ˜ P ( T ) . Speciﬁcally, thetrained A ( S ) should learn a hypothesis ˜ h ∗ = A ( S )( S trne ) : X → Y with S trne such that h ∗ ( x ) is the right labelof x , where S trne ∼ D mne and ( x, y ) ∼ D ne . To measure the performance of A ( S ), we use the expectation of the generalized error with respected tothe task distribution ˜ P ( T ), which is deﬁned as R ( A ( S ) , ˜ P ( T )) = E T ∼ P ( T ) E S tr ∼D m E z ∼D ˜ (cid:96) ( A ( S )( S tr ) , z ) . (1) Uniform Stability of Meta-algorithms.

To analyze the L2DNCL problem, we also introduce theconcept of uniform stability of meta-algorithms, following Chen et al. [2020] and Maurer [2005]. Giventraining meta-samples S = { S tri ∪ S tsi } ni =1 , we modify S by removing the i -th element to obtain S \ i = { S tr ∪ S ts , . . . , S tri − ∪ S tsi − , S tri +1 ∪ S tsi +1 , . . . , S trn ∪ S tsn } . Following Maurer [2005], we deﬁne the uniformstability of meta-algorithms as follows. Deﬁnition 1.

A meta-algorithm A has uniform stability β w.r.t. the loss function (cid:96) if the following holdsfor any meta-samples S and ∀ i ∈ { , . . . , n } , ∀T ∼ ˜ P ( T ) , ∀ S tr ∼ D m , ∀ S ts ∼ D k : | ˆ L ( A ( S )( S tr ) , S ts ) − ˆ L ( A ( S \ i )( S tr ) , S ts ) | ≤ β, where ˆ L ( A ( S )( S tri ) , S tsi ) = 1 k (cid:88) z ij ∈ S tsi ˜ (cid:96) ( A ( S )( S tri ) , z ij ) . If β is small, the meta-algorithm A will perform stably even when we change the training meta-samples.Chen et al. [2020] also show that, under certain conditions, β will go to zero if the number of trainingmeta-samples goes to inﬁnity. In this paper, all tasks and data are i.i.d. drawn from the corresponding distributions. Problem Formalization

This section ﬁrst revisits the unrealistic assumptions in the L2DNC problem, then points out a realisticassumption for the problem, ﬁnally formalizes a more realistic problem called

L2DNC give very limited data (L2DCNL).

According to Hsu et al. [2018], in the L2DNC problem, they assume to have a dataset that only contains alarge amount of unlabeled novel-class data, and data in L2DNC are collected in the labeling process (i.e., X → Y ), where X represents features and Y represents the known label set. However, such requirement isunrealistic in the labeling process.For example, when we have a large amount of unlabeled images, in the labelling process, we normally onlylet experts annotate part of images (like active learning [Cortes et al., 2020]) due to the cost of annotation.Thus, there are still many unlabelled known-class images in our dataset. That is, we cannot obtain anunlabelled dataset that only contains novel-class images, which breaks the basic assumption made by previousL2DNC works.As for the above issue, it seems that the L2DNC problem is even not realistic from its current assumptionsalthough human beings can eﬀortlessly discover novel classes using the classes previously learned. Thus, weraise a natural question: is the L2DNC problem reasonable and realistic but previous understanding of theL2DNC missed some key factors? The answer is aﬃrmative. In this paper, we revisit the L2DNC problemfrom the view that L2DNC is causally sampling instead of labeling (assumed by previous works).

In Section 1, we present the botanists example to show that L2DNC problem is a realistic one when weonly have few unlabeled novel-class data (i.e., the L2DNCL problem). Thus, it is interesting to investigatethe diﬀerence between the assumption made by previous works and the assumption made in the botanistsexample. The result is that, in the botanists example, we collect data using a sampling way ( Y → X ) insteadof a labelling way ( X → Y ). In the causally sampling process Y → X , the novel-class data are collected onthe way of collecting known-class data. This is the reason why the collected unlabled data belong to novelclasses and we can only observe few novel-class data. Since we do not know these novel classes, we need tocluster these novel-class data ﬁrst and then research them. That is the process to discover novel classes .Namely, it is realistic to have a dataset only including few unlabeled novel-class data if we collected data in asampling way.More importantly, since novel-class data and known-class data are collected from the same scenario (likethe forest in the botanists example), they share the high-level semantic features, which is exactly same as theunderlying assumptions of meta-learning [Maurer, 2005, Amit and Meir, 2018, Chen et al., 2020]. Thus, inthis paper, we are naturally motivated to link L2DNCL to meta-learning. To do so, we can formalize theL2DNCL problem by referring the deﬁnition of meta-learning (see the following subsection). As a result,L2DNCL is not only theoretically solvable (see Section 6), but also can be empirically solved by modiﬁedmeta-learning methods (see Section 7). This section presents a plain deﬁnition and a rigorous deﬁnition for L2DNCL problem.

A plain deﬁnition.

In L2DNCL, we have a dataset consisting of labeled known-class data S l = { ( x li , y li ) : i = 1 , . . . , n l } and unlabeled novel-class data S u = { x ui : i = 1 , . . . , m } , where 0 < m (cid:28) n l . The label spaceof S l and the label space of S u are disjoint but S l and S u share the high-level semantic features. The numberof classes of S l and the number of classes of S u are K l and K u , respectively. Our goal is to leverage theprior knowledge of S l to assign the data in S u into a certain amount of clusters (denoted as C u ) according to5heir categories. In the following, we give a rigorous deﬁnition of L2DNCL according to the deﬁnition ofmeta-learning (by referring to Section 3). A rigorous deﬁnition.

Let T ∗ = ( D l , (cid:96), H ) be the task space, where D l is a distribution over domain X × C , H = { h : X → C} is the hypothesis space, C contains indexes of categories (including known andnovel categories), and the loss function (cid:96) : H × X → R + means the loss incurred by predicting an output h ( x )when the ground-truth cluster index is c . The rigorous deﬁnition of L2DNCL problem is deﬁned as follows. Problem 2 (L2DNCL) . Given the training tasks T = {T i = ( D li , (cid:96), H ) } ni =1 drawn from a task distribution P ( T ∗ ) , meta-samples S l = { S l,tri ∪ S l,tsi } ni =1 are drawn from the training tasks T , where S l,tri ∼ ( D li ) m and S l,tsi ∼ ( D li ) k are the training set and the test set of the task T i with the sizes m and k , respectively, and eachtask T i can output an inner-task clustering algorithm A ( S l ) : X m → H . In L2DNCL, we aim to propose ameta-algorithm A to train an inner-task clustering algorithm A ( S l ) with the meta-samples S l . The trained A ( S l ) should have a good performance on the new task T ne = ( D lne , (cid:96), H ) ∼ P ( T ∗ ) where we can only observefeatures (i.e., x ∈ X ) from the distribution D lne . Speciﬁcally, the trained A ( S l ) should learn a hypothesis h ∗ = A ( S l )( S u,trne ) : X → C with S u,trne such that h ∗ ( x ) is the right cluster index of x , where S u,trne is thefeature set of S trne ∼ ( D lne ) m and ( x, c ) ∼ D lne . Remark 1.

Compared to the deﬁnition of meta-learning, L2DNCL aims to train an inner-task clusteringalgorithm A ( S l ) : X m → H rather than A ( S ) : Z m → ˜ H . Besides, in L2DNCL, we can only observe featuresfrom the new task, while we can observe the whole data from the new task in the meta-learning. Since wecan only observe few data from the new task in L2DNCL, m is a very small number. Based on the deﬁnition of L2DNCL and Eq. (1), we aim to minimize the following risk in the L2DNCL: R ( A ( S l ) , P ( T ∗ )) = E T ∼ P ( T ∗ ) E S ∼ ( D l ) m E ( x,c ) ∼D l (cid:96) ( A ( S l )( S X,l ) , x ) , (2)where S X,l denotes the feature set of S l ∼ ( D l ) m . The R ( A ( S l ) , P ( T ∗ )) is the expectation of generalizederror with respected to the task distribution P ( T ∗ ) and can measure the performance of each inner-taskclustering algorithm. In practice, the meta-clustering algorithm of L2DNCL is optimized by minimizing theaverage of the empirical error on the training tasks, called the empirical multi-task error :ˆ R ( A ( S l ) , S l ) = 1 n n (cid:88) i =1 k (cid:88) ( x ij ,c ij ) ∈ S l,tsi (cid:96) ( A ( S l )( S X,l,tri ) , x ij ) , (3)where S X,l,tri denotes the feature set of S l,tri , and S li = S l,tri ∪ S l,tsi ∼ ( D li ) m . After training the inner-taskclustering algorithm with data of known classes S l , we will obtain a well-trained inner-task clusteringalgorithm ¯ A = A ( S l ). Then, we feed the novel-class data S u into the algorithm ¯ A , we will get a hypothesis h ∗ = ¯ A ( S u ). Finally, the cluster index of x ∈ S u is h ∗ ( x ). This section presents that L2DNCL is theoretically solvable (see Theorem 1) according to the meta-learningtheory. The generalization bound of inner-task clustering algorithm A ( S l ) of meta-based L2DNCL algorithmscan be obtained from the uniform stability β of the meta-algorithm A . Theorem 1.

Input: data of known classes S l , feature extractor G , classiﬁers { F i } K l i =1 , learning rates ω , ω .

1: Initialize θ G and { θ F i } K l i =1 ; for t = 1 , . . . , T do2: Compute ∇ θ G L S and {∇ θ Fi L S } K l i =1 using S l and L S in Eq. ( ?? );

3: Update θ G = θ G − ω ∇ θ G L S ( θ G , { θ F i } K l i =1 ) and θ F i = θ F i − ω ∇ θ Fi L S ( θ G , { θ F i } K l i =1 ), i = 1 , . . . , K l ; end4: Compute F i ( G ( x )) to obtain { P i ( y | x ) } K l i =1 for each x ∈ S l ;

5: Compose V i = { x : V ( x ) = i } using V in Eq. (5), i = 1 , . . . , K l ; Output: { V i } K l i =1 where (cid:15) ( n, β ) = 2 β + (4 nβ + M ) (cid:113) log(1 /δ )2 n . By Theorem 1, the generalization bound depends on the number of the training tasks n and the uniformstability parameter β . If β < O ( √ n ), we have (cid:15) ( n, β ) → n → ∞ . Hence, given a suﬃciently small β , thetransfer error R ( A ( S l ) , P ( T ∗ )) converges to training error ˆ R ( A ( S l ) , S l ) as the number of training tasks n grows. The proof of Theorem 1 can be found in Appendix A. Based on Problem 2, the key diﬀerence between meta-learning and L2DNCL lies in their inner-tasks. Inmeta-learning, the inner-task is a classiﬁcation task, while in L2DNCL, the inner-task is a clustering task.Thus, we can modify the training strategies of the inner-tasks of meta-learning methods such that they candiscover novel classes, i.e., meta discovery (MEDI).In L2DNCL, sampling training tasks T from the distribution of known classes P ( T ∗ ) is important, whichaﬀects the ﬁnal results directly. Since the clustering rules may not be unique (Figure 3), diﬀerent data take onvarious rules. If we sample data with diﬀerent rules to compose a training task, these data will inﬂuence eachother, resulting in misleading the training procedures of the inner-tasks. Namely, sampling independentlyand identically (like MAML) cannot contribute to obtain a good performance when addressing the L2DNCLproblem. Thus, in MEDI, it is a key to propose a new task sampler that takes care of clustering rules. From the perspective of multi-view learning [Blum and Mitchell, 1998], data usually contain diﬀerent featurerepresentations. Namely, data have multiple views. However, there are always one view or a few views thatare dominate for each instance, and these dominated views are similar in the high-level semantic meaning [Liet al., 2019]. Therefore, we propose to use dominated views to replace with clustering rules. To this end,we propose a novel task-sampling method called clustering-rule-aware task sampler (CATA, Algorithm 1),which is based on a multi-view network that contains a feature extractor G : X → R h and K classiﬁers { F i : R h → Y} Ki =1 (Figure 4).The feature extractor G provides the shared data representations for K diﬀerent classiﬁers { F i } Ki =1 . Eachclassiﬁer classiﬁes data from its own view. The feature extractor G learns from all gradients from { F i } Ki =1 . Toensure that diﬀerent classiﬁers have diﬀerent views, we constrain the weight vector of the ﬁrst fully connectedlayer of each classiﬁer to be orthogonal. Take F i and F j as an example, we add the term | W Ti W j | to thesampler’s loss function, where W i and W j denote the weight vectors of the ﬁrst fully connected layer of F i and F j respectively. | W Ti W j | tending to 0 means that F i and F j are nearly independent [Saito et al., 2017].7 S C NN DNN G View 1View 2View 3 F ( ) V x F F Well-trained Well-trained ⊥⊥ Figure 4: The structure of the clustering-rule-aware task sampler (CATA): a novel sampling method ofmeta-learning for L2DNCL. Here we show the inference process of assigning labeled data of known classes tothree diﬀerent views with the well-trained G and { F i } i =1 . V ( x ) is the voting function deﬁned as Eq. (5).The weights of the ﬁrst layers of F , F , and F are constrained to orthogonal mutually.Therefore, the loss function for training our sampler is deﬁned as follows, L S ( θ G , { θ F i } Ki =1 ) = 1 N K K (cid:88) j =1 N (cid:88) i =1 (cid:96) ce ( F j ◦ G ( x i ) , y i ) + 2 λK ( K − (cid:88) i (cid:54) = j | W Ti W j | , where (cid:96) ce is the standard cross-entropy loss function and λ is a trade-oﬀ parameter.After we obtain the well-trained feature extractor G and classiﬁers { F i } Ki =1 , we input a training data x toour sampler and then we will get the probabilities that x belongs to class y in each classiﬁer, i.e. { P i ( y | x ) } Ki =1 ,where y is the label of x . Therefore, the view which x belongs to is deﬁned as V ( x ) = arg max i P i ( y | x ) . (5)Now we have assigned data to K subsets according to their views, i.e. { V i = { x ∈ X : V ( x ) = i }} Ki =1 . Then,we can directly randomly sample a certain amount of data (e.g., N -way, K -shot) from one subset to composea training task T i . According to the proportion of the amount of data in each subset, we sample trainingtasks with diﬀerent frequencies. We consider two representative meta-learning methods, MAML and ProtoNet to solve the L2DNCL problem.Namely, we will realize MEDI using MAML (MEDI-MAML) and ProtoNet (MEDI-PRO).

MEDI-MAML.

In this part, another feature extractor Ψ is given to obtain the embedding of data,following a classiﬁer g with softmax layer to classify data. As the data of novel classes comes from the samedistribution with the data of known classes, the feature extractor Ψ should be applicable to known and novelclasses as well. The key idea is that similarity data should belong to the same class. For data pair ( x i , x j ),we denote by the symbol s ij = 1 if they come from the same class, and on the contrary, s ij = 0.Following Han et al. [2020], we adopt a more robust pairwise similarity called ranking statistics. For z i = Ψ( x i ) and z i = Ψ( x j ), we rank the values of z i and z j by magnitude. Then we check if the indices ofthe values of top- k ranked dimensions are the same. Namely, s ij = 1 if they are the same, on the contrary, s ij = 0.We use the pairwise similarities { s ij } ≤ i,j ≤ N as the pseudo labels to train our feature extractor Ψ andclassiﬁer g . Therefore, we convert L2DNCL from a clustering problem to a classiﬁcation problem. Asmentioned above, g is a classiﬁer with softmax layer, so the inner product g ( z i ) T g ( z j ) is the cosine similarity8 lgorithm 2 MEDI-MAML for L2DNCL.

Input: task distribution: P ( T ∗ ); learning rate: α, β ; feature extractor: Ψ; classiﬁer: g

1: Initialize θ g ◦ Ψ ; while not done do2: Sample tasks {T i = ( D li , (cid:96), H ) } ni =1 by CATA (Alg. 1); for all T i do3: Sample m training data S l,tri ∼ ( D li ) m ;

4: Evaluate ∇ θ g ◦ Ψ L T i ( θ g ◦ Ψ ) using S l,tri and L T i in Eq. (6);

5: Compute adapted parameters with gradient descent: θ (cid:48) g ◦ Ψ ,i = θ g ◦ Ψ − α ∇ θ g ◦ Ψ L T i ( θ g ◦ Ψ );

6: Sample k test data S l,tsi ∼ ( D li ) k ; end7: Update θ g ◦ Ψ = θ g ◦ Ψ − β ∇ θ g ◦ Ψ L A ( θ (cid:48) g ◦ Ψ ) using each S l,tsi and L A in Eq. (7); endOutput: clustering algorithm ¯ A . Algorithm 3

MEDI-PRO for L2DNCL.

Input: task distribution: P ( T ∗ ); feature extractor: f

1: Initialize θ f ; while not done do2: Sample tasks {T i = ( D li , (cid:96), H ) } ni =1 by CATA (Alg. 1); for all T i do3: Sample K u elements from { , . . . , K l } as set C I ; for s in C I do4: Sample m training data of s th class S l,tri,s ∼ ( D li,s ) m ;

5: Compute c i,s ( S l,tri,s ) using Eq. (9)

6: Sample k test data of s th class S l,tsi,s ∼ ( D li,s ) k ; end7: Update θ f = θ f − γ ∇ θ f L T i ( θ f ) using S l,tsi,s and L T i in Eq. (11); endendOutput: feature extractor f . between x i and x j , which serves as the score for whether x i and x j belong to the same class. After samplingtraining tasks {T i } ni =1 , we will train the inner-algorithm by optimizing the following BCE loss function: L T i ( θ g ◦ Ψ ) = − N N (cid:88) i =1 N (cid:88) j =1 [ s ij log( g ( z i ) T g ( z j )) + (1 − s ij ) log(1 − g ( z i ) T g ( z j ))] . (6)The entire procedure of L2DNCL by MAML is shown in Algorithm 2. Following the framework of MAML,the parameters of clustering algorithm A are trained by optimizing the following loss function: L A ( θ g ◦ Ψ ) = n (cid:88) i =1 L T i ( θ g ◦ Ψ − α ∇ θ g ◦ Ψ L T i ( θ g ◦ Ψ )) , (7)where α is the learning rate of inner-algorithm. Then we conduct the meta-optimization to update theparameters of cluster algorithm A as follows: θ g ◦ Ψ ← θ g ◦ Ψ − β ∇ θ g ◦ Ψ L A ( θ g ◦ Ψ ) , (8)where β denotes the meta learning rate. MEDI-PRO.

Following Snell et al. [2017], we denote f : X → R M as a embedding function, which mapsdata to their representations. In training task T i , the mean vector of representations of data from class- s (i.e.,9 (SRFK $ FF 0(',3520(',0$0/56.PHDQV.&/0&/'7& (a) CIFAR-100 20-way 1-shot (SRFK $ FF (b) CIFAR-100 20-way 5-shot (SRFK $ FF (c) OmniGlot 20-way 1-shot (SRFK $ FF (d) OmniGlot 20-way 5-shot Figure 5: We conducted experiments on

CIFAR-100 and

Omniglot and reported the average clusteringaccuracy (ACC (%), Section 8) when using existing and our methods to address the L2DNCL problem. Wecan observe that existing methods cannot address the L2DNCL problem, while our methods (MEDI-MAMLand MEDI-PRO) can address the L2DNCL problem well. S l,tri,s ) is deﬁned as prototype c k : c i,s ( S l,tri,s ) = 1 | S l,tri,s | (cid:88) ( x li ,y li ) ∈ S l,tri,s f ( x li ) . (9)Here, we deﬁne a distance d : R M × R M → [0 , + ∞ ) to measure the distance between the data from test setand the prototype. Then, we produce a distribution over classes for a test data x based on softmax overdistances to the prototypes in the embedding space: p ( y = s | x ) = exp( − d ( f ( x ) , c s )) (cid:80) s (cid:48) exp( − d ( f ( x ) , c s (cid:48) )) . (10)We train the embedding function by optimizing the negative log-probability, i.e. − log p ( y = s | x ). So the lossfunction of ProtoNet is deﬁned as follows, L T i = − k (cid:88) s ∈ [ K u ] (cid:88) x ∈ S l,tsi,s log p ( y = s | x ) , (11)where [ K u ] denotes the number of K u classes selected from { , . . . , K l } . S l,tsi,s is the test set of labeled dataof class- s from task T i . The full procedures of training f are shown in Algorithm 3. After training theembedding function f well, we use the training set of S u to obtain the prototypes. For x in the test set of S u , we compute the distance between x and each prototype, and then the class corresponding to the nearestprototype is the class of x . In this section, we test the eﬃcacy of our methods and possible baselines on four datasets.

Datasets.

To evaluate the performance of our methods, we conduct experiments on four popular imageclassiﬁcation benchmarks, including CIFAR-10 [Krizhevsky and Hinton, 2009], CIFAR-100 [Krizhevsky andHinton, 2009], SVHN [Netzer et al., 2011], and OmniGlot [Lake et al., 2015].CIFAR-10 dataset contains 60 ,

000 images with sizes of 32 ×

32. Following Han et al. [2019], for L2DNCL,we select the ﬁrst ﬁve classes (i.e. airplane, automobile, bird, cat, and deer) as known classes and the rest ofclasses as novel classes. The amount of data from each novel class is no more than 5. CIFAR-100 datasetcontains 100 classes. Following Han et al. [2020], we select the ﬁrst 80 classes as known classes and select thelast 20 classes as novel classes.SVHN contains 73 ,

257 training data and 26 ,

032 test data with labels 0-9. Following Han et al. [2019], weselect the ﬁrst ﬁve classes (0-4) as known classes and select the (5-9) as novel classes. OmniGlot constains10able 1: Ablation Study on four datasets. In this table, we report the ACC (%) ± standard deviation ofACC (%) on four datasts, where MM and MP represent MEDI-MAML and MEDI-PRO, respectively, and Crepresents CATA and w/o represents “without”. Methods SVHN (5-way) CIFAR10 (5-way) CIFAR100 (20-way) OmniGlot (20-way)5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shotMM 47.3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ,

632 handwritten characters from 50 diﬀerent alphabets. Following Hsu et al. [2019], we select all the 30alphabets in background set (964 classes) as known classes and select each of the 20 alphabets in evaluation set (659 classes) as novel classes.Following the protocol of few-shot learning [Ziko et al., 2020], for SVHN and CIFAR-10, we perform thefew-shot tasks of 5-way 1-shot and 5-way 5-shot, and we perform the few-shot tasks of 20-way 1-shot and20-way 5-shot for CIFAR-100 and OmniGlot.

Baselines.

To verify the performance of our meta-based L2DNCL methods (MEDI-MAML and MEDI-PRO), we compare them with ﬁve competitive baselines, including K-means [MacQueen et al., 1967], KCL[Hsu et al., 2018] , MCL [Hsu et al., 2019], DTC [Han et al., 2019], and RS [Han et al., 2020]. We modifythese baselines by only reducing the amount of novel-class data, with other conﬁgurations invariable. Weclarify the implementation details of CATA, MEDI-MAML, and MEDI-PRO in Appendix B.

Evaluation metric.

For a clustering problem, we use the average clustering accuracy (Acc) to evaluatethe performance of clustering, which is deﬁned as follows,max φ ∈ L N N (cid:88) i =1 { ¯ y i = φ ( y i ) } , (12)where ¯ y i and y i denote the ground-truth label and assigned cluster indices respectively. L is the set ofmappings from cluster indices to ground-truth labels. Results on CIFAR-10.

As shown in Figures 2(a) and 2(b), MEDI-MAML and MEDI-PRO outperformall baselines signiﬁcantly, and the ACC of MEDI-PRO is much higher than that of MEDI-MAML. The mainreason is that MEDI-PRO makes full use of the labels of known-class data in the training process, whileMEDI-MAML does not. MEDI-MAML only uses the labels of known-class data in the sampling process.Besides, K-means performs better than it on other datasets signiﬁcantly. The reason is that clustering rulescontained in CIFAR-10 are suitable for K-means.

Results on SVHN.

Figures 2(d) and 2(c) show that our methods still outperform all baselines. In thetask of 5-way 1-shot, RS performs as well as MEDI-MAML (Figure 2(d)). The reason is that RS trains theembedding network with self-supervised learning under 1-shot case, which partly overcomes this problem bydata augment.

Results on CIFAR-100.

It is clear that we outperform all baselines. Diﬀer from tasks on other datasets,MEDI-MAML performs equally even a little better than MEDI-PRO shown in Figure 5(a). The reason isthat the amount of known classes is relatively large and the data distribution of CIFAR-100 is complex, sowe cannot accurately compute prototypes with very limited data.11 esults on OmniGlot.

As shown in Figures 5(c) and 5(d), our methods still have the highest ACC. Weﬁnd that the Acc of K-means are merely 1 .

15% and 1 . Ablation Study.

To verify the eﬀectiveness of CATA, we conduct ablation study by removing CATAfrom MEDI-MAML and MEDI-PRO. According to Table 1, CATA signiﬁcantly improves the performance ofMEDI-MAML and MEDI-PRO. However, there exists an abnormal phenomenon in OmniGlot, i.e., MM w/oC outperforms MM in the task of 20-way 1-shot. Although we need 16 (= | S l,tr | + | S l,ts | = 1 + 15) datafor each class in an inner-task, the total amount of data for each class is only 20. Therefore, there are notenough data for CATA to sample, which makes CATA cannot improve the ACC of MEDI-MAML. In this paper, we study an important problem called learning to discover novel classes (L2DNC) and ﬁndthat its current assumptions are not realistic. Therefore, we revisit this problem and ﬁnd that L2DNC is causally sampling instead of labeling. Then, we reshape L2DNC to a more realistic problem called

L2DNCgiven very limited data (L2DNCL) and point out that data in the L2DNCL problem share the high-levelsemantic features, which is also the underlying assumption of the meta-learning. To this end, we propose todiscover novel classes in a meta-learning way, i.e., the meta discovery (MEDI). After realizing MEDI usingMAML and ProtoNet, we ﬁnd that meta-learning based L2DNCL methods outperform all existing baselineson four datasets, which lights up a novel road for L2DNCL. In the future, we will focus on cross-domainL2DNCL, where only part of data share the high-level semantic features.

References

Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended pac-bayes theory. In Jennifer G.Dy and Andreas Krause, editors,

ICML , 2018.Horace B Barlow. Unsupervised learning.

Neural computation , 1(3):295–311, 1989.Avrim Blum and Tom M. Mitchell. Combining labeled and unlabeled data with co-training. In

COLT , 1998.Jiaxin Chen, Xiao-Ming Wu, Yanke Li, Qimai LI, Li-Ming Zhan, and Fu-Lai Chung. A closer look at thetraining strategy for modern meta-learning. In

NeurIPS , 2020.Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Adaptive region-basedactive learning. In

ICML , 2020.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchicalimage database. In

CVPR , 2009.Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deepnetworks. In

ICML , 2017.Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predictingimage rotations. In

ICLR , 2018.Kai Han, Andrea Vedaldi, and Andrew Zisserman. Learning to discover novel visual categories via deeptransfer clustering. In

ICCV , 2019.Kai Han, Sylvestre-Alvise Rebuﬃ, S´ebastien Ehrhardt, Andrea Vedaldi, and Andrew Zisserman. Automaticallydiscovering and learning new visual categories with ranking statistics. In

ICLR , 2020.12aiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

CVPR , 2016.Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks:A survey. arXiv preprint arXiv:2004.05439 , 2020.Yen-Chang Hsu, Zhaoyang Lv, and Zsolt Kira. Learning to cluster in order to transfer across domains andtasks. In

ICLR , 2018.Yen-Chang Hsu, Zhaoyang Lv, Joel Schlosser, Phillip Odom, and Zsolt Kira. Multi-class classiﬁcation withoutmulti-class labels. In

ICLR , 2019.Sergey Ioﬀe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In

ICML , 2015.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

ICLR , 2015.Alex Krizhevsky and Geoﬀrey Hinton. Learning multiple layers of features from tiny images. 2009.Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning throughprobabilistic program induction.

Science , 350(6266):1332–1338, 2015.Yingming Li, Ming Yang, and Zhongfei Zhang. A survey of multi-view representation learning.

IEEE Trans.Knowl. Data Eng. , 31(10):1863–1883, 2019.James MacQueen et al. Some methods for classiﬁcation and analysis of multivariate observations. In

Proceedings of the ﬁfth Berkeley symposium on mathematical statistics and probability , 1967.Andreas Maurer. Algorithmic stability and meta-learning.

The Journal of Machine Learning Research , 6:967–994, 2005.Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digitsin natural images with unsupervised feature learning. In

NeurIPS Workshop on Deep Learning andUnsupervised Feature Learning , 2011.Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In

ICLR , 2017.Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domainadaptation. In Doina Precup and Yee Whye Teh, editors,

ICML , 2017.Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. Meta-learningwith memory-augmented neural networks. In

ICML , 2016.Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.In

ICLR , 2015.EJ A’Court Smith. Discovery of remains of plants and insects.

Nature , 11(266):88–88, 1874.Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In

NeurIPS ,2017.Nitish Srivastava, Geoﬀrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:a simple way to prevent neural networks from overﬁtting.

Journal of Machine Learning Research , 15(1):1929–1958, 2014.Junyuan Xie, Ross B. Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In

ICML , 2016. 13hun Zhong, Linchao Zhu, Zhiming Luo, Shaozi Li, Yi Yang, and Nicu Sebe. Openmix: Reviving knownknowledge for discovering novel visual categories in an open world.

CoRR , abs/2004.05551, 2020.Imtiaz Masud Ziko, Jose Dolz, Eric Granger, and Ismail Ben Ayed. Laplacian regularized few-shot learning.In

ICML , 2020. 14

Proof of Theorem 1

The proof of Theorem 1 mainly follows Chen et al. [2020]. Given training meta-samples S = { S tri ∪ S tsi } ni =1 ,we modify S by replacing the i -th element to obtain S i = { S tr ∪ S ts , . . . , S tri − ∪ S tsi − , S tr (cid:48) i ∪ S ts (cid:48) i , S tri +1 ∪ S tsi +1 , . . . , S trn ∪ S tsn } , where the replacement sample S (cid:48) i is assumed to be drawn from D and is independentfrom S . In the same way, given a training set S = { z , . . . , z i − , z i , z i +1 , . . . , z n } , we modify S by replacingthe i -th element to obtain S i = { z , . . . , z i − , z (cid:48) i , z i +1 , . . . , z n } . Lemma 1 ( McDiarmid Inequality ) . Let S and S i deﬁned as above, let F : Z n → R be any measurablefunction for which there exits constants c i ( i = 1 , . . . , n ) such that sup S ∈Z m ,z (cid:48) i ∈Z | F ( S ) − F ( S i ) | ≤ c i , then P S [ F ( S ) − E S [ F ( S )] ≥ (cid:15) ] ≤ exp ( − (cid:15) (cid:80) ni =1 c i ) . Theorem 2.

For any task distribution P ( T ∗ ) and meta-samples S l with n tasks, if a meta-algorithm A hasuniform stability β w.r.t. a loss function (cid:96) bounded by M , then the following statement holds with probabilityof at least − δ for any δ ∈ (0 , : R ( A ( S l ) , P ( T ∗ )) ≤ ˆ R ( A ( S l ) , S l ) + (cid:15) ( n, β ) , (13) where (cid:15) ( n, β ) = 2 β + (4 nβ + M ) (cid:113) log(1 /δ )2 n .Proof. Let F ( S l ) = R ( A ( S l ) , P ( T ∗ )) − ˆ R ( A ( S l ) , S l ) and F ( S l,i ) = R ( A ( S l,i ) , P ( T ∗ )) − ˆ R ( A ( S l,i ) , S l,i ).We have | F ( S l ) − F ( S l,i ) | ≤ |R ( A ( S l ) , P ( T ∗ )) − R ( A ( S l,i ) , P ( T ∗ )) | + | ˆ R ( A ( S l ) , S l ) − ˆ R ( A ( S l,i ) , S l,i ) | . (14)The ﬁrst term in Eq. (14) can be written as |R ( A ( S l ) , P ( T ∗ )) − R ( A ( S l,i ) , P ( T ∗ )) | ≤ |R ( A ( S l ) , P ( T ∗ )) − R ( A ( S l \ i ) , P ( T ∗ )) | + |R ( A ( S l,i ) , P ( T ∗ )) − R ( A ( S l \ i ) , P ( T ∗ )) | . We can upper bound the ﬁrst term in Eq. (14) by studying the variation when a sample set S li of trainingtask D i is deleted, |R ( A ( S l ) , P ( T ∗ )) − R ( A ( S l \ i ) , P ( T ∗ )) |≤ E T ∼ P ( T ∗ ) E S ∼ ( D l ) m E ( x,c ) ∼D l | (cid:96) ( A ( S l )( S X,l ) , x ) − (cid:96) ( A ( S l \ i )( S X,l ) , x ) |≤ sup T ∼ P ( T ∗ ) ,S ∼ ( D l ) m , ( x,c ) ∼D l | (cid:96) ( A ( S l )( S X,l ) , x ) − (cid:96) ( A ( S l \ i )( S X,l ) , x ) |≤ β. Similarly, we have |R ( A ( S l,i ) , P ( T ∗ )) − R ( A ( S l \ i ) , P ( T ∗ )) | ≤ β . So the ﬁrst term of Eq. (14) is upperbounded by 2 β . The second factor in Eq. (14) can be guaranteed likewise as follows, | ˆ R ( A ( S l ) , S l ) − ˆ R ( A ( S l,i ) , S l,i ) |≤ n (cid:88) q (cid:54) = i | k (cid:88) ( x qj ,c qj ) ∈ S l,tsq ( (cid:96) ( A ( S l )( S X,l,trq ) , x qj ) − (cid:96) ( A ( S l,i )( S X,l,trq ) , x qj )) | + 1 nk | (cid:88) ( x ij ,c ij ) ∈ S l,tsi (cid:96) ( A ( S l )( S X,l,tri ) , x ij ) − (cid:88) ( x ij ,c ij ) ∈ S (cid:48) ,l,tsi (cid:96) ( A ( S l,i )( S (cid:48) ,X,l,tri ) , x ij ) |≤ β + Mn . | F ( S l ) − F ( S l,i ) | satisﬁes the condition of Lemma 1 with c i = 4 β + Mn . It remains to bound E S l [ F ( S l )] = E S l [ R ( A ( S l ) , P ( T ∗ ))] − E S l [ ˆ R ( A ( S l ) , S l )]. The ﬁrst term can be written as follows, E S l [ R ( A ( S l ) , P ( T ∗ ))] = E S l ,S (cid:48) ,X,l,tri ,S (cid:48) ,l,tsi k (cid:88) ( x ij ,c ij ) ∈ S (cid:48) ,l,tsi (cid:96) ( A ( S l )( S (cid:48) ,X,l,tri ) , x ij ) . Similarly, the second term is, E S l [ ˆ R ( A ( S l ) , S l )] = E S l [ 1 n n (cid:88) i =1 k (cid:88) ( x ij ,c ij ) ∈ S l,tsi (cid:96) ( A ( S l )( S X,l,tri ) , x ij )]= E S l [ 1 k (cid:88) ( x ij ,c ij ) ∈ S l,tsi (cid:96) ( A ( S l )( S X,l,tri ) , x ij )]= E S l ,S (cid:48) ,X,l,tri ,S (cid:48) ,l,tsi [ 1 k (cid:88) ( x ij ,c ij ) ∈ S (cid:48) ,l,tsi (cid:96) ( A ( S l,i )( S (cid:48) ,X,l,tri ) , x ij )] . Hence, E S l [ F ( S l )] is upper bounded by 2 β , E S l [ R ( A ( S l ) , P ( T ∗ ))] − E S l [ ˆ R ( A ( S l ) , S l )]= E S l ,S (cid:48) ,X,l,tri ,S (cid:48) ,l,tsi [ 1 k (cid:88) ( x ij ,c ij ) ∈ S (cid:48) ,l,tsi (cid:96) ( A ( S l )( S (cid:48) ,X,l,tri ) , x ij ) − k (cid:88) ( x ij ,c ij ) ∈ S (cid:48) ,l,tsi (cid:96) ( A ( S l,i )( S (cid:48) ,X,l,tri ) , x ij )] ≤ β. Plugging the above inequality in Lemma 1, we obtain P S l [ R ( A ( S l ) , P ( T ∗ )) − ˆ R ( A ( S l ) , S l ) ≥ β + (cid:15) ] ≤ exp( − (cid:15) (cid:80) ni =1 (4 β + Mn ) ) . Finally, setting the right side of the above inequality to δ , the following result holds with probability of 1 − δ , R ( A ( S l ) , P ( T ∗ )) ≤ ˆ R ( A ( S l ) , S l ) + 2 β + (4 nβ + M ) (cid:114) log(1 /δ )2 n . B Implementation details

CATA.

We use ResNet-18 [He et al., 2016] as the feature extractor and use three fully-connected layerswith softmax layer as the classiﬁer. We also use BN layer [Ioﬀe and Szegedy, 2015] and Dropout [Srivastavaet al., 2014] in network layers. In this paper, we select the number of views K = 3 for all four datasets. Inother words, there are three classiﬁers following by the feature extractor. Both the feature extractor and 3classiﬁer use Adam [Kingma and Ba, 2015] as their optimizer. The number of training steps is 50 and thelearning rates of feature extractor and classiﬁers are 0 .

01 and 0 .

001 respectively. We use the tradeoﬀ λ of 1 / MEDI-MAML for L2DNCL.

We use VGG-16 [Simonyan and Zisserman, 2015] as the feature extractorfor all four datasets. We use SGD as meta-optimizer and general gradient descent as inner-optimizer for allfour datasets. For all experiments, we sample 1000 training tasks by CATA for meta training and ﬁnetunethe meta-algorithm after every 200 episodes with data of novel classes. The meta learning rate and innerlearning are 0 . .

001 respectively. We use a meta batch size (the amount of training tasks per trainingstep) of 16 \ { CIFAR-10,SVHN }\{

CIFAR-100,Omniglot } . In addition, we choose k to be 10 which issuitable for all datasets. For each training task, we update the corresponding inner-algorithm by 10 steps.16 EDI-PRO for L2DNCL.

We use a neural network of four convolutional blocks as the embeddingfunction for all datasets following Snell et al. [2017]. Each block comprises a 64-ﬁlter 3 × × ..