[PDF] Deep Metric Transfer for Label Propagation with Limited Annotated Data

Abstract

We study object recognition under the constraint that each object class is only represented by very few observations. Semi-supervised learning, transfer learning, and few-shot recognition all concern with achieving fast generalization with few labeled data. In this paper, we propose a generic framework that utilizes unlabeled data to aid generalization for all three tasks. Our approach is to create much more training data through label propagation from the few labeled examples to a vast collection of unannotated images. The main contribution of the paper is that we show such a label propagation scheme can be highly effective when the similarity metric used for propagation is transferred from other related domains. We test various combinations of supervised and unsupervised metric learning methods with various label propagation algorithms. We find that our framework is very generic without being sensitive to any specific techniques. By taking advantage of unlabeled data in this way, we achieve significant improvements on all three tasks.

Full PDF

DDeep Metric Transfer for Label Propagation with Limited Annotated Data

Bin Liu † Zhirong Wu † Han Hu ∗ Stephen Lin Tsinghua University Microsoft Research Asia [email protected] { wuzhiron,hanhu,stevelin } @microsoft.com Abstract

We study object recognition under the constraint thateach object class is only represented by very few obser-vations. Semi-supervised learning, transfer learning, andfew-shot recognition all concern with achieving fast gener-alization with few labeled data. In this paper, we proposea generic framework that utilize unlabeled data to aid gen-eralization for all three tasks. Our approach is to createmuch more training data through label propagation fromthe few labeled examples to a vast collection of unanno-tated images. The main contribution of the paper is thatwe show such a label propagation scheme can be highlyeffective when the similarity metric used for propagationis transferred from other related domains. We test variouscombinations of supervised and unsupervised metric learn-ing methods with various label propagation algorithms. Weﬁnd that our framework is very generic without being sensi-tive to any speciﬁc techniques. By taking advantage of unla-beled data in this way, we achieve signiﬁcant improvementson all three tasks. Code is availble at http://github.com/Microsoft/metric-transfer.pytorch .

1. Introduction

We address the problem of object recognition from avery small amount of labeled data. This problem is of par-ticular importance when limited labels can be collected dueto either time or ﬁnancial constraints. Though this is a dif-ﬁcult challenge, we are encouraged by evidence from cog-nitive science suggesting that infants can quickly learn newconcepts from very few examples [21, 1].Many recognition problems in computer vision are con-cerned with learning on few labeled data. Semi-supervisedlearning, transfer learning, and few-shot recognition all aimto achieve fast generalization from few examples, by lever-aging unlabeled data or labeled data from other domains.The fundamental difﬁculty of this problem is that naivesupervised training with very few examples results in severe † Equal contribution. (cid:63)

Corresponding author. The work is done whenBin Liu is an intern at MSRA.

Unlabeled DataTarget problemPretrained metricsLots of training data for the target problemTransfer

Figure 1: Overview of the approach. Often, object cate-gories are represented by very few images. We transfer ametric learned from another domain and propagate the la-bels from the few labeled images to a vast collection ofunannotated images. We show this can reliably create muchmore labeled data for the target problem.over-ﬁtting. Because of this, prior work in semi-supervisedlearning rely on strong regularizations such as augmenta-tions [10], temporal consistency [20], and adversarial exam-ples [27] to improve performance. Some related works infew-shot learning do not even reﬁne an online classiﬁer. In-stead, they simply apply the similarity metric learned fromtraining categories to new categories without adaptation.Meta-learning [8] seeks to optimize an online parametricclassiﬁer with few samples, but under the assumption thatjust a few steps of optimization will lead to effective gener-alization with less overﬁtting. These approaches indirectlyaddress the inherent problem of limited training data.In this paper, we propose a new framework of label prop-agation via metric transfer to tackle the problem of limitedtraining data. We propagate labels to an unlabeled dataset,so that training a supervised model with great learning ca-1 a r X i v : . [ c s . C V ] J u l acity no longer faces over-ﬁtting. This approach is relatedto work on “pseudo-labeling” [22, 31], where the modelis bootstrapped from limited data and trained on the newdata/label pairs it infers. However, that is unlikely to workwell when the labeled data is scarce, since the initial modelis likely to be poor. Instead of bootstrapping, our worktransfers the metric learned from another related domain,and thus provides a much better generalization ability.Our approach works with three data domains: a sourcedomain to learn a similarity metric, few labeled examplesto deﬁne the target problem, and an unlabeled dataset inwhich to propagate labels. As in Figure 1, we ﬁrst learn asimilarity metric on the source domain, which can be eitherlabeled or unlabeled. Supervised learning or unsupervised(self-supervised) learning is used to learn the metric accord-ingly. Then, given few observations of the target problem,we propagate the labels from these observations to the un-labeled dataset using the metric learned in the source do-main. This creates an abundance of labeled data for learninga classiﬁer. Finally, we train a standard supervised modelusing the propagated labels.The main contribution of this work is the metric trans-fer approach for label propagation. By studying differentcombinations of metric pretraining methods ( e.g. unsuper-vised, supervised) and label propagation algorithms ( e.g. nearest neighbors, spectral clustering), we ﬁnd that our met-ric transfer approach on unlabeled data is general enoughto work effectively for many settings. For semi-supervisedlearning on CIFAR10 and ImageNet, we obtain an absolute improvement over the state-of-the-art when labeleddata is limited ( − labels per category). We also achievea improvement on transferring representations fromImageNet to CIFAR10 for transfer learning, and im-proved performance for few-shot recognition on the mini-ImageNet benchmark.Due to this generic framework, our work also brings in-dividual insights into the respective tasks we studied: 1) forsemi-supervised learning, algorithms may better developfrom unsupervised learning, as opposed to using unlabeleddata for regularization. 2) for transfer learning, we proposean alternative method for transferring knowledge other thanthe dominant ﬁnetuning approach. 3) for few-shot recog-nition, in certain scenarios, unlabeled data in the target do-main is more beneﬁcial than labeled data in the source do-main.

2. Related Work

Large-scale Recognition.

To solve a computer vision prob-lem, it has become a common practice to build a large-scaledataset [6, 3] and train deep neural networks [19, 34] onit. This philosophy has achieved unprecedented successon many important computer vision problems [6, 24, 32]. However, constructing a large-scale dataset is often time-consuming and expensive, and this has motivated work onunsupervised learning and problems deﬁned on few labeledsamples.

Semi-supervised Learning.

Semi-supervised learning [39]is a problem that lies in between supervised learning and un-supervised learning. It aims to make more accurate predic-tions by leveraging a large amount of unlabeled data thanby relying on the labeled data alone. In the era of deeplearning, one line of work leverages unlabeled data throughdeep generative models [17, 29]. However, training of gen-erative models is often unstable, making it tricky to workwith recognition tasks. Recent efforts on semi-supervisedlearning focus on regularization by self-ensembling throughconsistency loss, such as temporal ensembling [20], ad-versarial ensembling [27], teacher-student distillation [36],and cross-view ensembling [2]. The pseudo-labeling ap-proach [22, 31] initializes a model on a smalled labeleddataset and bootstraps on the new data it predicts. Thistends to fail when the labeled set is small.Our work is most closely related to the transductive ap-proaches [16, 45]. Prior work [7] in computer vision showsthat label propagation can work well with handcrafted GISTdescriptors. We bring it to the context of deep learning, anddemonstrate that metric transfer may further improve theaccuracy of label propagation.

Few-shot Recognition.

Given some training data in train-ing categories, few-shot recognition [1] requires the clas-siﬁer to generalize to new categories from observing veryfew examples, often 1-shot or 5-shot. A body of work ap-proaches this problem by ofﬂine metric learning [37, 35,40], where a generic similarity metric is learned on the train-ing data and directly transferred to the new categories usingsimple nearest neighbor classiﬁers without further adapta-tion. Recent works on meta-learning [8, 23, 26] take alearning-to-learn approach using online algorithms. In or-der not to overﬁt to the few examples, they develop meta-learners to ﬁnd a common embedding space, which can befurther ﬁnetuned with fast convergence to the target prob-lem. Recent works [30, 9] using meta-learning consider thecombined problem of semi-supervised learning and few-shot recognition, by allowing access to unlabeled data infew-shot recognition. This drives few-shot recognition intomore realistic scenarios. We follow this setting as we studyfew-shot recognition.

Transfer Learning.

Since the inception of the ImageNetchallenge [32], transfer learning has emerged almost every-where in visual recognition, such as in object detection [11]and semantic segmentation [25], by simply transferring thenetwork weights learned on ImageNet classiﬁcation andﬁnetuning on the target task. When the pretraining task andthe target task are closely related, this tends to generalizeuch better than training from scratch on the target taskalone. Domain adaptation seeks to address a much moredifﬁcult scenario where there is a large gap between the in-puts of the source and target domains [14], for example, be-tween real images and synthetic images. What we study inthis paper is metric transfer. Different from prior work [42]that employ metric transfer just to reduce the distributiondivergence of different domains, we use metric transfer topropagate labels. Through this, we show that metric propa-gation is an effective method for learning with small data.

3. Approach

To deal with the shortage of labeled data, our approachis to enlarge it by propagating labels from annotated imagesto unlabeled data using the similarity metric between datapairs. The creation of much more labeled data enables us totrain deep neural networks to their full learning capacity.Our framework works on three data domains: the sourcedomain S , the target domain T , and additional unlabeleddata U . The source domain S can be labeled or unlabeledwith abundant data, and it is used to learn a generic similar-ity metric between data pairs. The target domain T only hasfew labeled data, but it deﬁnes the problem we want to opti-mize. The unlabeled data U is the resource in which to prop-agate labels, and may potentially contain similar classes tothe task deﬁned in T . It may or may not have overlappingclasses with S .The approach we propose in this paper is very general,suggesting that a spectrum of metric pretraining and labelpropagation algorithms can all work well in this framework.Below we introduce our method in details, and overviewseveral metric learning and label propagation methods weused for our experiments. The source domain S is used for pretraining a similaritymetric between data pairs. Ideally, we desire the metric tocapture the inherent structure in the target domain T , so thattransferring labels from T is reliable and useful. For this tohappen, we usually hold some prior knowledge about thesource S and the target T . For example, the source domainis sampled from the same distribution as the target domain,but is completely unannotated, or the source domain is an-notated with a different task but is closely related to the tar-get. Formally, a similarity metric s ij between data x i and x j can be deﬁned as s ij = f ( x i , x j ) , (1)where f is the similarity function to be learned. In thiswork, we use deep neural networks as a parametric modelof this similarity function. The metric can be trained witheither supervised or unsupervised methods, depending on whether labels are given in the source domain S . We brieﬂyreview the training algorithms as follows. Unsupervised Metric Pretraining

Recently, there has been growing interest in unsupervisedlearning and self-supervised learning. Different algorithmsare based on different data properties (e.g. color [44], con-text [4], motion [46]) and thus may vary in performance onthe target task we may want to transfer. However, it is notour intent to give a comprehensive comparison over variousmethods and choose the best one. Instead, we show thatgeneral unsupervised transfer is beneﬁcial for label propa-gation and leads to improved performance.In this work, we utilize two unsupervised learning meth-ods: instance discrimination [41] and colorization [44]. Forinstance discrimination, we treat each instance as a class,and maximize the probability of each example belonging tothe class of itself, P ( i | x i ) = exp ( s ii ) (cid:80) nj =1 exp ( s ij ) . (2)For colorization, the idea is to learn a mapping fromgrayscale images to colorful ones. Following the origi-nal paper [44], instead of predicting raw pixel colors, wequantize the color space into soft bins q , and use the cross-entropy loss on the soft bins, L color = − (cid:88) h,w q h,w log ( q h,w ) , (3)where h, w are spatial indices. We follow previous work [5]for applying ResNet to colorization, where we use a basenetwork to map inputs to features, and a head networkof three convolutional layers to convert features to colors.Since colorization does not automatically output a metric,we use the Euclidean distance on the features from the basenetwork to measure similarity. Supervised Metric Pretraining

In some scenarios, we have access to a labeled dataset,such as PASCAL VOC and ImageNet, having commonali-ties with the target task. Traditional metric learning with su-pervision minimizes the intra-class distance and maximizesthe inter-class distance of the labeled samples. For this pur-pose, many types of loss functions such as contrastive loss,triplet loss [13], and neighborhood analysis [12] have beenproposed. In this work, we use neighborhood analysis [12]to learn our metric. Concretely, we maximize the likelihoodof each example being supported by other examples belong-ing to the same category, P ( y i | x i ) = (cid:80) y k = y i exp ( s ik ) (cid:80) nj =1 exp ( s ij ) . (4)igure 2: Left: raw similarity matrix. Right: similarity ma-trix by spectral embedding. Through spectral embedding,sparse similarities are propagated to distant areas to reveal global structure. Samples are sorted by their class id forbetter visualization. Given a target T represented by a small number of la-beled examples, and a unlabeled set U , we propagate labelsfrom T to U using the similarity function f ( · ) learned from S . Suppose T = { ( x , y ) , ( x , y ) , ..., ( x n t , y n t ) } , and U = { x n t +1 , x n t +2 , ..., x n t + n u } , where n t and n u are thenumber of images in T , U respectively. Label y i is repre-sented as a vector with the ground-truth class element setto and the others set to − . We consider two propagationalgorithms. Naive Nearest Neighbors

A straightforward propagation approach is to vote for theclass of an unlabeled sample based on its similarity to eachof the exemplars in the target set T . For an unlabeled ex-ample x u ∈ U , we calculate its logits z u,c for every class c , z u,c = 1 n t,c n t (cid:88) i =1 I ( y i,c = 1) · W i,u , (5)where I ( · ) is the indicator function, W i,u = exp ( f ( x i , x u ))) denotes the similarity between exam-ple i and u , and n t,c is the number of labeled imagesavailable for class c .The nearest neighbor propagation method is essentiallya one-step random walk where the similarity metric actsas the transition matrix and the indicator function acts asthe initial distribution. The effectiveness of such one-steppropagation depends heavily on the quality of the similaritymetric.In general, it is hard to learn such a metric well, espe-cially when limited supervision is available, because of thevisual diversity of images. Figure 2 (left) shows a typi-cal similarity matrix computed from unsupervised features.Data points in the similarity matrix are sparsely connected,thus limiting the one-step label propagation approach.

0% 20% 40% 60% 80% 100%Data sorted by confidence20%40%60%80%100% A cc u r a c y o f P s e u d o L a b e l spectral clusteringk-nearest neighbor Figure 3: The accumulated accuracy of the pseudo labelson the validation data sorted by the conﬁdence measure.

Constrained Spectral Clustering

Constrained spectral clustering [15, 7] may potentially re-lieve such a problem. Instead of propagating labels byone step as in the naive nearest neighbor approach, con-strained spectral clustering propagates labels through mul-tiple steps by taking advantage of structure within the un-labeled dataset. It computes a spectral embedding [33, 38]from the original similarity metric, which is then used as thenew metric for label propagation. The spectral embeddingis formulated as W (cid:48) = η (cid:88) j =2 λ j e j e Tj , (6)where λ j and e j are the eigenvalues and eigenvectors of thenormalized Laplacian in ascending order. The Laplacianmatrix L sym is derived from the original similarity metricas L sym = I − D − / W D − / , with degree matrix D = diag ( d ) and d i = (cid:80) j W ij . Parameter η is the total numberof eigen components used.Due to its globalized nature, spectral clustering is ableto pass messages between distant areas, which is in con-trast to the local behavior of the naive nearest neighborsapproach. The embedded metric is usually densely con-nected and better aligned with object classes, as illustratedin Figure 2 (right). Using the same voting approach as inEqn (5), labeled propagation can be more accurate than us-ing the original raw similarity metric.Constrained spectral clustering is also efﬁcient. By fol-lowing the common practice of using k -nearest neighborsto build the similarity graph [38], propagating labels to k images takes about 10 seconds on a regular GPU. Given the logits z i , the pseudo label ˆ y i is estimated as ˆ y i = argmax c z i,c . (7)able 1: Ablation study of the mean average precision (mAP) of pseudo labels on CIFAR10.Metric pretraining Propagation method 50 100 250 500 1000 2000 4000 8000Bootstrapping Nearest neighbor 22.03 25.74 48.35 68.03 77.57 77.28 87.77 90.88Spectral 23.49 28.88 54.46 70.02 80.94 87.77 Colorization [44] Nearest neighbor 57.32 67.61 75.48 79.34 80.70 82.14 83.66 84.79Spectral 60.85 67.34 76.31 80.04 81.78 81.89 82.93 82.03Instance [41] Nearest neighbor 54.82 62.99 77.08 84.90 88.68 91.34 92.72 93.67Spectral

Colorization [44] No 49.57 55.41 64.65 68.81 73.40 77.93 82.17 86.25Nearest neighbor 49.96 52.69 65.63 65.88 70.88 76.36 80.16 84.64Spectral 53.47 55.08 68.40 71.15 72.38 76.50 80.31 84.03Instance [41] No 35.27 37.87 62.46 71.04 75.96 80.12 83.90 87.82Nearest neighbor 46.68 54.45 66.93 74.16 79.17 82.24 z i produced by the label propagation al-gorithm, we ﬁrst normalize it into a probabilistic distribu-tion, ¯ z i,c = exp ( z i,c /τ ) (cid:80) j exp ( z i,j /τ ) , (8)where c indexes the dimension of categories, and the tem-perature τ controls the sharpness of the distribution. Wethen deﬁne the conﬁdence measure α i of the pseudo labelas the difference between the maximum response and thesecond largest response, α i = max j ¯ z i,j − max c (cid:54) =argmax ¯ z i,j ¯ z i,c . (9)A high value of α i indicates a conﬁdent estimate of thepseudo label, and a low value of α i indicates an ambiguousestimate. In Figure 3, we measure the accumulated accu-racy of pseudo labels on validation data sorted by this con-ﬁdence. It can be seen that our conﬁdence measure gives agood indication of the quality of pseudo labels.Our ﬁnal training criterion is given by L = − N (cid:88) i α i · log p ˆ y i (10) where ˆ y i is the pseudo label for example i , and p ( · ) is thesoftmax probability output of the classiﬁcation network.In practice, since some pseudo labels have relatively lowconﬁdence, e.g. α < . , and thus contribute negligibly tothe overall learning criterion, we may safely discard thoseexamples to speed up learning.

4. Experiments

Through experiments, we show that, with unlabeled data,metric propagation is able to effectively label lots of datawhen little labeled data is given. We verify our approachon semi-supervised learning, where an unsupervised metricis transferred, and on transfer learning, where supervisedmetrics generalize across different data distributions, andon few-shot recognition, where the metric can generalizeacross open-set object categories. While studying few-shotrecognition, we leverage an extra unlabeled data for labelpropagation, which is also known as semi-supervised few-shot recognition [30].Our approach has two major hyper-parameters: the num-ber of the eigenvectors η for spectral clustering and the tem-perature σ controlling the conﬁdence distribution. Differ-ent parameter settings may slightly change the performance.We use η = 200 and σ = 40 across the experiments. A de-tailed analysis is provided in the supplementary materials.able 3: Scalability to large network architectures on CIFAR10.Methods Network architectures 50 100 250 500 1000 2000 4000 8000Mean Teacher WideResNet-28-2 29.66 36.62 45.49 57.19 65.07 79.26 84.38 87.55 Ours

Ours

50 100 250 500 1000 2000 4000 8000Number of Labeled Datapoints20%30%40%50%60%70%80%90% T e s t A cc u r a c y -modelMean TeacherVATPseudo LabelOurs Figure 4: Comparisons to the state-of-the-art on CIFAR10.

We follow a recent evaluation paper [28], which givesa comprehensive benchmark for state-of-the-art semi-supervised learning approaches. A majority of our ablationstudies are conducted on CIFAR10 [18], while we also testour method on ImageNet. On CIFAR10, we use the sameWide-ResNet [43] architecture with 28 layers and a widthfactor of 2. We report performance varying the number oflabeled examples from to , of , examples.For training our model, we pretrain the metric using theunlabeled split, and propagate labels to the same unlabeledset. This means S = U in our framework. We use SGDfor optimization with an initial learning rate of 0.01 and acosine decay schedule. We ﬁx the total number of optimiza-tion iterations to K as opposed to ﬁxing optimizationepochs, because it gives more consistent comparisons whenthe number of labeled data varies. Study of different pretrained metrics.

Our label propagation algorithm needs a pretrained sim-ilarity metric to guide it. The pretrained metric can belearned by supervised methods using limited labeled data,or by unsupervised methods using large-scale unlabeleddata. Here, we consider three metric pretraining methods:1. supervised bootstrapping on limited labeled data.2. self-supervised learning by image colorization [44].3. unsupervised learning by instance discrimination [41].We train the models using the optimal parameters for eachpretraining method. Then we use cosine similarity in thefeature space for propagating labels to the unlabeled data. In Table 1, we evaluate the quality of pseudo labels as themean average precision (mAP) sorted by the conﬁdence asin Figure 3. Table 2 lists the ﬁnal semi-supervised recogni-tion accuracy. We can see that both unsupervised methodsgeneralize much better than the supervised bootstrappingmethod most of the time, until the labeled set is relativelylarge with 4000 labels. This conﬁrms our claim that unsu-pervised transfer is the key for label propagation. For theunsupervised methods, non-parametric metric learning per-forms better than colorization, probably because it explic-itly learns a similarity metric. We also include the result ofthe naive baseline which trains from scratch using limitedlabeled data without label propagation.

Study of different label propagation schemes.

Given the pretrained metrics, there are various ways totransfer the metrics. We consider three possible solutions:1. no propagation, only transfer network weights.2. nearest neighbor metric transfer.3. spectral metric transfer.The ﬁrst baseline is a common practice, which basicallytransfers the network weights and then ﬁnetunes on the la-beled data. The second is much weaker than the third be-cause it only considers one-hop distances, without takinginto account the similarities between unlabeled pairs.The results are summarized in Table 1 and Table 2. Com-pared to the state-of-the-art performance in Table 4, evena simple ﬁnetuning approach outperforms the state-of-the-arts when the labeled data is small. For example, by ﬁnetun-ing from instance discrimination, we achieve . with labeled data, signiﬁcantly outperforming the state-of-the-art result of . . This suggests that unsupervisedpretraining generally improves semi-supervised learning.When unlabeled data is used for label propagation, met-ric transfer can be much stronger than just weight trans-fer, improving the performance to . with labeleddata. It is also evident that the spectral clustering methodperforms better than weighted nearest neighbors because ofits globalization behavior. Scalability to large network architectures.

In contrast to prior methods which face over-ﬁtting is-sues, our approach can easily scale to larger network ar-chitectures. Here, we keep all the learning parameters un-changed, and experiment with a wider version of Wide-ResNet-28 with a width factor of 10. We consider a state-able 4: Ours is complementary to all prior state-of-the-artmethods on CIFAR10.Num Labeled 250 4000Ours 71.26 84.52Pi Model [20] 47.07 84.17 + Ours + Ours + Ours + Ours − . Our method enjoys consistently signiﬁcantgains from a larger network on all the testing scenarios. Itachieves an unprecedented . accuracy using only labels with Wide-ResNet-28-10. Comparison to the state-of-the-art on CIFAR10.

We compare our approach to state of the art methods inFigure 4. Ours is particularly stronger when the labeled setis small, but this advantage diminishes as the labeled setgrows. However, as most prior approaches focus on self-ensembling, ours is orthogonal to them. We examine thecomplementarity of our method by combining it with eachof the prior approaches. To do so, we generate our mostconﬁdent K pseudo labels (about of the full data),and use it as ground-truth for the other algorithms. Forfair comparisons, we run public code with our generatedpseudo labels. In Table 4, combining our approach leads toimproved performance for all of the methods. Comparison to the state-of-the-art on ImageNet.

We notice that few literature report semi-supervised clas-siﬁcation performance on ImageNet consistently. In thispaper, we consider ﬁnetuning from an unsupervised modeltrained with instance discrimination [41] as our baseline.We vary the number of labeled examples from to ofthe entire ImageNet. We use ResNet-50 to pretrain the un-supervised model, and split the dataset into 10 chunks forspectral clustering to speed up the computation. In Table 5,ﬁnetuning from unsupervised model signiﬁcantly improvesupon training from scratch, and our label propagation ap-proach outperforms the ﬁnetuning approach. Notable, oursis better when labeled data is available. We also examine whether the proposed metric transfercan work across different data distributions. We pretrain https://github.com/brain-research/realistic-ssl-evaluation Table 5: Semi-supervised classiﬁcation results on the Ima-geNet dataset.Num Labeled

1% 2% 4%

Scratch 22.4 40.2 58.2Finetune 39.2 52.8 65.2

Ours 58.6 66.3 72.4 the metric on the source S ImageNet, and transfer it to theunlabeled U CIFAR10. For this, we study supervised andunsupervised pretraining for transfer learning.

Transferring from labeled ImageNet.

We resize Ima-geNet images to a resolution of × and pretrain themetric on them by supervised learning. We keep the net-work architecture WideResNet-28-2 for meaningful com-parison with the semi-supervised settings in Sec 4.1. Thisobtains an accuracy of on the ImageNet validation set.Then we transfer the metric to CIFAR10. This transfer isconducted by network ﬁnetuning and by metric propaga-tion. In Table 6, we can see that simple network ﬁnetuningcan reach the best results obtained in the semi-supervisedsettings of the previous subsection. By using label propa-gation with spectral clustering, we can observe a large im-provement, yielding . accuracy with just labeledimages. This illustrates the generality of our metric transferapproach, where supervised transfer can also take advan-tage of unlabeled data to improve generalization. Transferring from unlabeled ImageNet.

Instead of su-pervised training which encodes prior knowledge about ob-ject categories, we treat ImageNet images as unlabeled andrepeat the previous experiment. Different from the ear-lier unsupervised experiments, this setting involves substan-tially more unlabeled data, which could potentially lead toa better unsupervised metric. However, our results sug-gest otherwise. When propagating to CIFAR10, the unsu-pervised metric learned from ImageNet is inferior to themetric learned from CIFAR10. This is possibly due tothe data distribution gap between CIFAR10 and ImageNet.Nevertheless, our unsupervised transfer from ImageNet stillsurpasses the state-of-the-art in the semi-supervised settingwhen labeled samples are limited.

Few-shot recognition targets a more challenging sce-nario, the generalization across object categories ( a.k.a. open-set recognition). Originally, the problem is deﬁnedwith numerous labeled examples in a source dataset, andfew examples in the target categories. Recent works [30, 9]also explore the scenario where extra unlabeled data isavailable for this problem. This ﬁts into our framework forstudying label propagation via metric transfer.We follow the protocols in [30] for conducting the ex-able 6: Transfer learning from ImageNet to CIFAR10.Metric pretraining Transfer method 50 100 250 500 1000 2000 4000 8000Unsupervised Network ﬁnetuning 28.92 34.56 57.14 67.54 76.20 80.92 85.01 88.74Spectral 44.30 46.51 61.29 68.31 72.61 77.86 84.00 88.19Supervised Network ﬁnetuning 54.95 61.88 73.01 78.43 84.52 88.79 91.44 93.05Spectral 77.71 85.34 86.07 86.91 88.27 89.93 91.22 93.49Table 7: Few-shot recognition on Mini-ImageNet dataset.Method Fintune Unlabel 5-way Classiﬁcationdata 1-shot 5-shotNN baseline [37] No No 41.1 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± categories,with for training, for validation and for testing.Images in each category are split into as labeled, and as unlabeled. Training uses only the labeled split inthe training categories. During evaluation, a testing episodeis constructed by sampling few-shot labeled observationsfrom the labeled split in the testing categories, and all ofthe unlabeled images in all the testing categories. A testingepisode requires the model to ﬁnd useful information in theunlabeled set to aid recognition from the few-shot observa-tions. Unlike [30], which includes ﬁve distractor categoriesin the unlabeled set, we consider all categories in thetesting set, which better reﬂects practical scenarios. We test episodes and report the results.We follow prior work [37] by using a shallow architec-ture with four convolutional layers and a ﬁnal fully con-nected layer. Each convolutional layer has 64 channels, in-terleaved with ReLU, subsampling and a batch normaliza-tion layer. Images are resized to × to train the model.We use the spectral embedding approach for label propaga-tion. During online training, we use an initial learning rateof . with a total of 30 epochs and decrease the learningrate to be times smaller after epochs. Transfer from supervised models.

We use a recent super-vised metric learning approach SNCA [40] as the baseline.After label propagation and ﬁnetuning on the new data, oursupervised propagation obtains a signiﬁcant boost of Figure 5: Visualizations of top ranked retrievals from theunlabeled set given one-shot observations.over SNCA. Prior work [30] improves upon its baselines,but fails to make further improvement because of limitedtraining data. In Figure 5, we visualize the top retrievalsfrom the unlabeled set in the one-shot scenario. These re-trievals not only belong to the same class as the groundtruth, but their diversity facilitates a strong classiﬁer.

Transfer from unsupervised models.

We also investigatepretraining the metric without labels, using instance dis-crimination [41] for learning the metric. Surprisingly, inTable 7, our unsupervised propagation obtains better per-formance than the ofﬂine metric learning approach with an-notations [40], by . in 1-shot recognition and for5-shot. This suggests that leveraging unlabeled data in thetarget problem may possibly be more beneﬁcial than usinglabeled samples in the source domain.

5. Discussions • The effectiveness of label propagation depends heavilyon the learned metric, so advances in metric learningshould lead to improved results. Since the prevalentpretraining methods in deep learning use softmax clas-siﬁcation, we hope to draw more attention to pretrain-ing networks with metric learning. • Currently, we study metric pretraining and label prop-agation separately. It may be beneﬁcial to formulatethem jointly in an end-to-end framework. • Our algorithm takes advantage of the unlabeled dataset U to create more training data. The overall perfor-mance is affected by the relevance of image contentin the unlabeled set U to that of the target T , as thisimpacts the ability to effectively propagate labels. P s e u d o A cc u r a c y T e s t A cc u r a c y temperature=1temperature=5temperature=40temperature=100 Figure 6: Ablations of model parameters η and σ . A1. Ablations of Model Parameters

Our model depends on two parameters: the number ofeigen components η used for spectral clustering, and thetemperature σ used for controlling the conﬁdence. We used η = 200 and σ = 40 in our main submission. In Figure 6,we show the effects of the two parameters respectively.The number of eigenvectors η works well in the rangebetween and . We can see a trade-off of the value η for performance under various number of labeled samples.Smaller η beneﬁts very few labeled samples, while larger η beneﬁts comparably more labeled samples. For the temper-ature parameter σ , it is generally robust for a wide range ofvalues between to . A2. Additional Visualizations

We provide more retrieval visualizations in the CIFAR10and mini-ImageNet dataset in Figure 7 and Figure 8. ForCIFAR10, we show the top retrievals for each class inthe unlabeled set given labeled examples. For mini-ImageNet, we show the top retrievals in the 5-class 1-shotscenario.

References [1] S. Carey and E. Bartlett. Acquiring a single new word. 1978.[2] K. Clark, T. Luong, and Q. V. Le. Cross-view training forsemi-supervised learning. In

ICLR , 2019.[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In

CVPR . Ieee, 2009.[4] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi-sual representation learning by context prediction. In

ICCV ,2015.[5] C. Doersch and A. Zisserman. Multi-task self-supervisedvisual learning. In

ICCV , 2017.[6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge.

IJCV , 2010. [7] R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learn-ing in gigantic image collections. In

NIPS , 2009.[8] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprintarXiv:1703.03400 , 2017.[9] V. Garcia and J. Bruna. Few-shot learning with graph neuralnetworks. arXiv preprint arXiv:1711.04043 , 2017.[10] X. Gastaldi. Shake-shake regularization. arXiv preprintarXiv:1705.07485 , 2017.[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In

CVPR , 2014.[12] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R.Salakhutdinov. Neighbourhood components analysis. In

Ad-vances in neural information processing systems , 2005.[13] E. Hoffer and N. Ailon. Deep metric learning using tripletnetwork. In

International Workshop on Similarity-BasedPattern Recognition . Springer, 2015.[14] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko,A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adver-sarial domain adaptation. arXiv preprint arXiv:1711.03213 ,2017.[15] H. Hu, J. Feng, C. Yu, and J. Zhou. Multi-class constrainednormalized cut with hard, soft, unary and pairwise priors andits applications to object segmentation.

TIP , 2013.[16] T. Joachims. Transductive learning via spectral graph parti-tioning. In

Proceedings of the 20th International Conferenceon Machine Learning (ICML-03) , pages 290–297, 2003.[17] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling.Semi-supervised learning with deep generative models. In

NIPS , 2014.[18] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. Technical report, Citeseer, 2009.[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , 2012.[20] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 ,2016.[21] B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum. Oneshot learning of simple visual concepts. In

Proceedings ofthe Annual Meeting of the Cognitive Science Society , 2011.22] D.-H. Lee. Pseudo-label: The simple and efﬁcient semi-supervised learning method for deep neural networks. In

Workshop on Challenges in Representation Learning, ICML ,2013.[23] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learn-ing to learn quickly for few shot learning. arXiv preprintarXiv:1707.09835 , 2017.[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In

ECCV . Springer, 2014.[25] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In

CVPR , pages 3431–3440, 2015.[26] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A sim-ple neural attentive meta-learner. 2018.[27] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii. Virtualadversarial training: a regularization method for supervisedand semi-supervised learning. 2017.[28] A. Oliver, A. Odena, C. Raffel, E. D. Cubuk, and I. J. Good-fellow. Realistic evaluation of semi-supervised learning al-gorithms. 2018.[29] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, andT. Raiko. Semi-supervised learning with ladder networks.In

NIPS , 2015.[30] M. Ren, E. Triantaﬁllou, S. Ravi, J. Snell, K. Swersky, J. B.Tenenbaum, H. Larochelle, and R. S. Zemel. Meta-learningfor semi-supervised few-shot classiﬁcation. arXiv preprintarXiv:1803.00676 , 2018.[31] C. Rosenberg, M. Hebert, and H. Schneiderman. Semi-supervised self-training of object detection models. In

WACV/MOTION , 2005.[32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.

IJCV , 2015.[33] J. Shi and J. Malik. Normalized cuts and image segmenta-tion.

TPAMI , 2000.[34] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[35] J. Snell, K. Swersky, and R. Zemel. Prototypical networksfor few-shot learning. In

NIPS , 2017.[36] A. Tarvainen and H. Valpola. Mean teachers are better rolemodels: Weight-averaged consistency targets improve semi-supervised deep learning results. In

NIPS , 2017.[37] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al.Matching networks for one shot learning. In

NIPS , 2016.[38] U. Von Luxburg. A tutorial on spectral clustering.

Statisticsand computing , 2007.[39] J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learn-ing via semi-supervised embedding. In

Neural Networks:Tricks of the Trade . Springer, 2012.[40] Z. Wu, A. A. Efros, and S. X. Yu. Improving generaliza-tion via scalable neighborhood component analysis. arXivpreprint arXiv:1808.04699 , 2018.[41] Z. Wu, Y. Xiong, X. Y. Stella, and D. Lin. Unsupervisedfeature learning via non-parametric instance discrimination.In

CVPR , 2018.[42] Y. Xu, S. J. Pan, H. Xiong, Q. Wu, R. Luo, H. Min, andH. Song. A uniﬁed framework for metric transfer learning.

IEEE Trans. Knowl. Data Eng. , 2017.[43] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146 , 2016.[44] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-tion. In

ECCV . Springer, 2016.[45] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, andB. Sch¨olkopf. Learning with local and global consistency.In

Advances in neural information processing systems , pages321–328, 2004.[46] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu-pervised learning of depth and ego-motion from video. In

CVPR , 2017. irplane automobile bird cat deer dog frog horse ship truckirplane automobile bird cat deer dog frog horse ship truck