Demystifying Assumptions in Learning to Discover Novel Classes
Haoang Chi, Feng Liu, Wenjing Yang, Long Lan, Tongliang Liu, Bo Han, Gang Niu, Mingyuan Zhou, Masashi Sugiyama
MMeta Discovery: Learning to Discover Novel Classes given VeryLimited Data
Haoang Chi ∗ , Feng Liu ∗ , Wenjing Yang † , Long Lan ,Tongliang Liu , Gang Niu , and Bo Han Institute for Quantum Information & State Key Laboratory of High PerformanceComputing, College of Computer, National University of Defense Technology University of Technology Sydney University of Sydney RIKEN Center for Advanced Intelligence Project Hong Kong Baptist University
Abstract In learning to discover novel classes (L2DNC), we are given labeled data from seen classes and unlabeled data from unseen classes, and we need to train clustering models for the unseen classes. Since L2DNC isa new problem, its application scenario and implicit assumption are unclear. In this paper, we analyzeand improve it by linking it to meta-learning : although there are no meta-training and meta-test phases,the underlying assumption is exactly the same, namely high-level semantic features are shared among theseen and unseen classes. Under this assumption, L2DNC is not only theoretically solvable, but also canbe empirically solved by meta-learning algorithms slightly modified to fit our proposed framework. ThisL2DNC methodology significantly reduces the amount of unlabeled data needed for training and makesit more practical, as demonstrated in experiments. The use of very limited data is also justified by theapplication scenario of L2DNC: since it is unnatural to label only seen-class data, L2DNC is causally sampling instead of labeling . The unseen-class data should be collected on the way of collecting seen-classdata, which is why they are novel and first need to be clustered. With the development of high-performance computing, we can train deep networks to achieve various taskswell [Deng et al., 2009]. However, the trained networks can only recognize the classes seen in the trainingset (i.e., known/seen classes), and cannot identify and cluster the novel classes (i.e., unseen classes) likehuman beings. A prime example is that human can easily tell a novel animal category (e.g., okapi) afterlearning a few seen animal categories (e.g., horse, dog). Namely, human can effortlessly discover (cluster)novel categories of animals. Inspired by this fact, previous works formulated a novel problem called learningto discover novel classes (L2DNC) [Hsu et al., 2018], where we train a clustering model using plenty of unlabeled novel-class and labeled known-class data (Figure 1a).However, the rigorous definition of L2DNC is still unexplored . As a result, it is unclear whether theL2DNC problem can be addressed. For example, if we only have labeled images of cars, we cannot use the cardataset to help cluster images of animals. Besides, it is unrealistic that we can have a dataset that containsplenty of unlabeled novel-class data. For instance, botanists collect plant specimens they need in the forests. ∗ Equal contributions. † Corresponding authors. a r X i v : . [ c s . L G ] F e b kapi Gerenuk Octopus Unrealistic to collectUnseenHelp cluster (a) We can see rare animals many times (existing works).
Okapi
Gerenuk Octopus
UnseenHelp cluster (b) Rare animals can only be seen few times (ours).
Figure 1: In L2DNC, previous works assume that we can have a dataset only containing plenty of unlabelednovel-class data (subfigure (a)). However, it is unrealistic that we can always have a dataset that containsplenty of unlabeled novel-class data. In this paper, we revisit the L2DNC problem from the view of datacollection and find that the novel-class data can be collected on the same way of collecting known-class data.In this view, we can only observe few novel-class data. To this end, we reshape the L2DNC problem to amore realistic one:
L2DNC given very limited data (L2DNCL, subfigure (b)).Except for the plants they are interested in (i.e., known classes), they also find scarce plants never seen before(i.e., novel classes) [Smith, 1874]. Since a trip to forests is relatively costly and toilsome, botanists had bettercollect these scarce plants passingly for future research. Thus, botanists will have a plenty of labeled datawith known classes, but few unlabeled data with novel classes.Motivated by the example, we reshape the L2DNC to a more realistic problem called learning to discovernovel classes given very limited data (L2DNCL), which aims to discover novel classes from only few unlabeleddata (Figure 1b). Besides, we believe that the novel-class data should be collected on the same way ofcollecting known-class data, and they need to be clustered first. Since both data are collected together fromthe same scenario (e.g., plants in forests), they share the high-level semantic features , which is exactly sameas the underlying assumption of meta-learning [Maurer, 2005, Chen et al., 2020].In this paper, we are naturally motivated to link L2DNCL with meta-learning. Therefore, we formalize theL2DNCL problem by referring the definition of meta-learning. As a result, L2DNCL is not only theoreticallysolvable (Theorem 1), but can be empirically addressed by modified meta-learning approaches. Based on ourformalization, we find that the key difference between meta-learning and L2DNCL lies in their inner-tasks.In meta-learning, the inner-task is a classification task, while in L2DNCL, it is a clustering task. Thus, wecan modify the training strategies of the inner-tasks of meta-learning methods such that they can discovernovel classes, i.e., meta discovery (MEDI).Specifically, we first propose a novel sampling method to sample training tasks for meta-learning methods.In sampled tasks, labeled and unlabeled data share the high-level semantic features and the same clusteringrule (Figure 3). Then, based on this novel sampling method, we realize MEDI using two representativemeta-learning methods: model-agnostic meta-learning (MAML) [Finn et al., 2017] and prototypical network (ProtoNet) [Snell et al., 2017]. Figure 2 demonstrates that existing L2DNC methods cannot address L2DNCLtasks well; while ours methods (i.e., MEDI-MAML and MEDI-PRO) can perform much better.We conduct experiments on four benchmarks (SVHN, CIFAR-10, CIFAR-100, Omniglot) and compare ourmethod with six competitive baselines [MacQueen et al., 1967, Hsu et al., 2018, 2019, Han et al., 2019, 2020].Empirical results show that our method outperforms these baselines significantly. Moreover, we provide anew practical application to the prosperous meta-learning community [Chen et al., 2020], which lights up anovel road for L2DNCL. 2 ( S R F K $ F F 0 ( ' , 3 5 2 0 ( ' , 0 $ 0 / 5 6 . P H D Q V . &