Unsupervised Representation Learning by InvariancePropagation
UUnsupervised Representation Learning by InvariancePropagation
Feng Wang, Huaping Liu ∗ , Di Guo, Fuchun Sun Department of Computer Science and Technology, Tsinghua University, ChinaBeijing National Research Center for Information Science and Technology [email protected],[email protected]@gmail.com,[email protected]
Abstract
Unsupervised learning methods based on contrastive learning have drawn increas-ing attention and achieved promising results. Most of them aim to learn repre-sentations invariant to instance-level variations, which are provided by differentviews of the same instance. In this paper, we propose Invariance Propagation tofocus on learning representations invariant to category-level variations, which areprovided by different instances from the same category. Our method recursivelydiscovers semantically consistent samples residing in the same high-density regionsin representation space. We demonstrate a hard sampling strategy to concentrateon maximizing the agreement between the anchor sample and its hard positivesamples, which provide more intra-class variations to help capture more abstractinvariance. As a result, with a ResNet-50 as the backbone, our method achieves71.3% top-1 accuracy on ImageNet linear classification and 78.2% top-5 accuracyfine-tuning on only 1% labels, surpassing previous results. We also achieve state-of-the-art performance on other downstream tasks, including linear classificationon Places205 and Pascal VOC, and transfer learning on small scale datasets.
Deep convolutional neural networks have gained great progress since the emergence of large-scaleannotated datasets [7, 44]. The learned networks perform well on classification tasks [23, 37, 18],and can be transfered well to other tasks such as detection [13, 36] and segmentation [27, 17].However, such progress requires large-scale human-annotated datasets, which are difficult to acquire.Unsupervised learning is introduced to give us the promise to learn useful representations withoutmanual annotations. Specifically, many self-supervised methods are proposed to learn representationsby solving handcrafted auxiliary tasks, such as jigsaw puzzle [31], rotation [12], colorization [42],etc. Although the self-supervised methods have gained remarkable performance, the design ofpretext tasks depends on the domain-specific knowledge, limiting the generality of both the learnedrepresentations and the design of future methods. Recently, methods based on contrastive learning[39, 33, 38, 16, 1] have drawn increasing attention and achieved promising results. Most of them aimto learn representations invariant to different views of the same instance, such as data augmentations[39, 33, 16, 1], color information [38] and context information [33], while ignoring the relationsbetween different instances which are the key to reflect the global semantic structure.In this work, we present Invariance Propagation (InvP), a novel method which embodies the relationsbetween different instances to learn representations with category-level invariance. Specifically, InvPdiscovers positive samples by recursively propagating the local invariance through the k-nearestneighbors graph. We keep k small to preserve a high semantic consistency. By applying transitivity ∗ corresponding author.34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. a r X i v : . [ c s . C V ] N ov n the kNN graph, we can obtain positive images exhibiting richer intra-class variations. In this way,the positive samples contain different semantically consistent instances discovered by our algorithm,which enable the network to learn representations invariant to intra-class inter-instance variations.InvP keeps the positive samples residing in the same high-density regions, making the positivesamples more consistent, which coincides with the smoothness assumption [4] widely adopted insemi-supervised learning.After obtaining the positive samples, our goal is to learn a model which maps the pixel space to anembedding space where positive samples are attracted and negative samples are separated. To learnthe model effectively, we demonstrate a hard sampling strategy. Specifically, we regard those positivesamples with low similarities to the anchor sample as hard positive samples, and concentrate onmaximizing the agreement between the anchor sample and its hard positive samples. We will showin the experiments section that the hard positive samples provide more intra-class variations, whichhelps improve the robustness of the learned representations.We evaluate our method on extensive downstream tasks including linear classification on ImageNet [7],Places205 [44] and Pascal VOC [10], transfer learning on seven small scale datasets, semi-supervisedlearning and object detection. Overall, our contributions can be summarized as follows:• We propose Invariance Propagation, a novel unsupervised learning method that exploits the relationsbetween different instances to learn representations invariant to category-level variations. Ourmethod is both effective and reproducible on standard hardware.• We demonstrate a hard sampling strategy to find positive samples that provide more intra-classvariations to help capture more abstract invariance. Experiments are conducted to validate theeffectiveness of the proposed hard sampling strategy.• We conduct extensive quantitative experiments to validate the effectiveness of our method, achievingcompetitive results on object detection and state-of-the-art results on ImageNet, Places205, andVOC2007 linear classification, semi-supervised learning, and transfer learning.• We conduct qualitative experiments, including the visualization of similarity distribution, whichshows our method successfully captures the category-level invariance, and the visualization of hardpositive samples, which gives an intuitive understanding of the hard sampling strategy. Self-supervised Learning.
Many self-supervised learning methods are proposed to solve artificiallydesigned pretext tasks. The underlying assumption is that solving the pretext tasks learns somegeneral knowledge, which is also required by some downstream tasks. Examples include contextprediction [8], jigsaw puzzle [31], rotations [12], colorization [42, 24], context encoder [35], splitbrain [43], learning by count [32], etc. Although the self-supervised learning methods have gainedremarkable performance, the design of pretext tasks depends on the domain-specific knowledge,which limits the generality of both the learned representations and the design of future methods. Amore general paradigm of self-supervised learning is to utilize the clustering technique. Deep cluster[2, 3] first proposes to use clustering results as pseudo-labels to help learn useful representations. Veryrecently, BoWNet [11] is proposed to learn representations by classifing the visual words producedby clustering, achieving competitive results on many downstream tasks.
Contrastive Learning.
Recently, unsupervised methods based on contrastive loss have drawnincreasing attention and achieved state-of-the-art performances. These methods use contrastive lossto learn representations which are invariant to data augmentations [1, 39, 16], context information[33, 19], cross-channel information [38] or different pretext tasks [28]. Methods based on contrastivelearning require many negative samples, and some works concentrate on the way to store negativesamples. Wu et al. [39] first propose to save the computed features to a memory bank. He et al. [16]propose MoCo to retrieve negative samples from a momentum queue. There are also some works[5, 40] directly use features of the current batch as negative samples. Several works have tried tolearn representations invariant to inter-instance variations. Huang et al. [20] propose neighborhooddiscovery to learn more robust representations by incorporating the nearest neighbors to positivesamples, and they adopt a curriculum manner to choose positive samples progressively. Zhuang et al.[46] propose local aggregation to incorporate the samples in the same cluster into positive samples,2 tep 1 Step 6 Step 11 Step 16
AB CAB CAB CAB C
Figure 1: Illustration of the positive sample discovery algorithm. We construct the kNN graph to findthe positive samples of point A recursively (k=6 in the example). The dark blue points represent thediscovered positive samples. We show the positive samples of A in step 1,6,11 and 16. In each step,all k -nearest neighbors of the current positive samples are added to the positive sample set.and they use nearest neighbors as background samples. With regard to concurrent work, PCL [26]also concentrates on the inter-instance relations and combines the contrastive loss with clustering. Given an unlabled dataset X = { x , ..., x n } , our goal is to learn an embedding function f θ thatmaps X to V = { v , ..., v n } where v i = f θ ( x i ) resides in a low dimensional manifold in whichsemantically consistent images are concentrated and semantically inconsistent images are separated.To model the relative similarity of two samples, we follow [39, 20, 46] to define the probability ofsample v i being recognized as the j -th sample as: P v i ( j ) = exp(¯ v j · v i /τ ) (cid:80) nk =1 exp(¯ v k · v i /τ ) (1)where τ is the temperature parameter and ¯ v k may come from memory bank [39], momentum queue[16], or the current batch of features [40, 5]. In this paper, we simply use the memory bank to update ¯ v i as an exponential moving average of v i . Furthermore, given an image set S , we then define theprobability of v i being recognized as an image in S as: P v i ( S ) = (cid:88) j ∈S P v i ( j ) (2)which is similar to [20, 46]. Next, with these definitions, we will introduce the positive samplediscovery algorithm to find the semantically consistent samples for each image and then introducethe hard sampling strategy to learn our models effectively. Formally, we define the indices of the k -nearest neighbors of feature v i as N k ( i ) . Correspondingly,we denote N k ( I ) as the union set of N k ( i ) for i ∈ I . Note that i / ∈ N k ( i ) . After giving thesedefinitions, we formulate the positive sample set N ( i ) of image x i as: N ( i ) = N k ( i ) ∪ N k ( N k ( i )) ∪ ... ∪ N k ( N k ( N k ( ... N k (cid:124) (cid:123)(cid:122) (cid:125) l ( i )))) (3)The process is illustrated in Fig 1. In each step, all k -nearest neighbors of the current discoveredpositive samples are added to the positive sample set. The process repeats l steps. In another view,the positive samples discovered by our algorithm can be regarded as those samples whose graphdistances from v i are less or equal than l .The underlying principle of the positive sample discovery algorithm is the smoothness assumption [4].Specifically, if two points v and v in a high-density region are close, then their semantic informationshould be similar. By transitivity, the assumption implies that if two samples are linked by a pathof high density, their semantic information is likely to be close. In our method, we keep k small to3uarantee the high-density condition and adjust l to find the appropriate number of positive samples.This can be illustrated as Fig 1, in which we can observe the discovered positive samples all reside ina connected high-density region.We compare our algorithm with the KNN algorithm [20], i.e., choosing the K-nearest neighbors of v i as positive samples (To distinguish the KNN here with the kNN adopted by our method, we usethe uppercase K here, while the lowercase k in our method). In the last step of Fig 1, samples in thedashed circle are positive samples discovered by KNN. By comparison, we observe that point B isincluded in K-nearest neighbors of point A and point C is not included. However, point A and Bdo not belong to the same category because they reside in different high-density regions. In a goodembedding space, samples with the same category should not be separated by a low-density region.The KNN method uses Euclidean distance as the global metric, which is not suitable for the manifoldstructure. As a comparison, our method uses Euclidean distance as the local metric to construct thekNN graph (manifolds locally have the structure of Euclidean space), and we uses graph distance asthe global metric. Up to now, we can learn the unsupervised model by optimizing − log( P v i ( N ( i ))) for all v i . However,if we simply optimize the loss, the penalty strength on P v i ( j ) for all j ∈ N ( i ) is equal. For example,if P v i ( j ) is sufficiently large and P v i ( j ) is relative small ( j , j ∈ N ( i ) ), the gradients with respectto P v i ( j ) and P v i ( j ) are of same amplitudes, which will result in the situation that optimizing − log( P v i ( N ( i ))) tends to maximize some easy optimized similarities. Ideally, v i should be similarto all discovered positive samples, instead of only some of them, to capture abstract invarianceeffectively.To solve the problem, we maximize the infimum of { P v i ( j ) | j ∈ N ( i ) } , i.e., the minimum probability,to maximize all probabilities. In practice, we select P samples with the lowest similarity to constructthe hard positive sample set N h ( i ) . These hard positive samples deviate far from the anchor samplesuch that they provide more intra-class variations, which is beneficial to learn more abstract invariance.Correspondingly, we also choose the hard negative samples. We denote the M nearest neighbors of v i as N M ( i ) . The M is large enough such that N ( i ) ⊆ N M ( i ) . Then we denote the hard negativesample set N neg ( i ) = N M ( i ) − N ( i ) , and the background set B ( i ) = N neg ( i ) ∪ N h ( i ) . Aftergiving these definitions, we formulate the loss for x i as follows: L inv ( x i ) = − log P v i ( N h ( i ) | B ( i )) (4) = − log (cid:80) p ∈N h ( i ) exp(¯ v p · v i /τ ) (cid:80) n ∈ B ( i ) exp(¯ v n · v i /τ ) (5)The intuition behind the hard negative sampling is that it is not necessary to make the negative pointsthat are already far from the anchor point be further, and it is more important to keep the ambiguousnegative samples further. In practice, due to the inefficiency of computing the B ( i ) , we approximate B ( i ) by N M ( i ) and it works fairly well. At the begining of the training process, the discovered positive samples are not reliable due to therandom network initialization. So we combine the Invariance Propagation loss with the instancediscriminative loss L ins together. For the instance discriminative term, the objective function withhard negative sampling is as follows: L ins ( x i ) = − log P v i ( i |N M ( i ) ∪ { i } ) (6)As the training proceeds, the discovered positive samples are more and more reliable. A reasonablestrategy is to ramp up the weight of Invariance Propagation loss according to a time-dependentweighting function ω ( t ) as the training lasting. The total loss is formulated as: L ( x i ) = L ins ( x i ) + λ inv · ω ( t ) · L inv ( x i ) (7)In practice, a simple binary ramp up function works well sufficiently. Specifically, we set ω ( t ) as inthe first T epochs, i.e., t ≤ T , and we set ω ( t ) as when t > T .4able 1: Linear classification results on ImageNet, Places205 and Pascal VOC07. We report 1-crop,top-1 accuracy. For ImageNet and VOC, we report the linear results for the output of the 16-th block.For Places205, we report the linear results for the output of the 15-th block. The BowNet has muchmore parameters than ordinary ResNet50 network because of the large size of fully connected layer. Method Architecture
Supervised [28] ResNet-50 26 200 75.9 51.5 87.5
Self-supervised learning methods
Colorization [42] ResNet-50 24 28 39.6 37.5 55.6Jigsaw [14] ResNet-50 24 90 45.7 41.2 64.5Rotation [12] ResNet-50 24 35 48.9 41.5 63.9BigBiGAN [9] ResNet-50 24 488 56.6 49.8 -BoWNet(conv5) [11] ResNet-50 65 280 60.5 50.1 78.4BoWNet(conv4) [11] ResNet-50 65 280 62.1 51.1 79.3
Methods based on contrastive learning
InsDis [39] ResNet-50 24 200 54.0 45.5 -LocalAgg [46] ResNet-50 24 200 58.8 49.1 -MoCo [16] ResNet-50 24 200 60.6 - -PIRL [28] ResNet-50 24 800 63.6 49.8 81.1CMC [38] ResNet-50-Lab 47 400 64.1 - -CPC [33] ResNet-101 28 - 48.7 - -CPC v2 [19] ResNet-170 303 - 65.9 - -AMDIM [1] AMDIM 626 150 68.1 55.0 -SimCLR [5] ResNet-50-MLP 28 1000 69.3 - 80.5MoCo v2 [6] ResNet-50-MLP 28 800 71.1 - -PCL [26] ResNet-50 24 200 62.2 49.2 82.2PCL [26] ResNet-50-MLP 28 200 65.9 49.8 84.0InvP (Ours) ResNet-50 24 800 67.7 52.6 84.2InvP (Ours) ResNet-50-MLP 28 800
To efficiently retrieve the features of distractor samples to compute Eq 7, we maintain a memory bankto save features of all samples, which was proposed by Wu et al. [39]. Given the current calculatedfeature v i for x i , we update the corresponding feature in memory bank as the exponential movingaverage of historical calculated features. The memory bank is initialized with random D-dimensionalunit vectors and then update its values after each epoch, following the common design [39, 16, 46].More details are in the supplementary material. In this section, we conduct quantitative and qualitative experiments to evaluate the proposed method.We train our unsupervised models on the training set of ImageNet [7]. We evaluate the qualityof the learned representations on extensive downstream tasks, including linear classification onImageNet, Places205 and Pascal VOC, semi-supervised classification on ImageNet, transfer learningon seven small scale datasets, and object detection. We also give the ablation study on severalcritical components of our method. We visualize the embedding statistics and the easy&hard positiveneighbourhoods to provide qualitative analysis.For all experiments, we set τ = 0 . for linear head and τ = 0 . for MLP head. We set λ inv = 0 . , T = 30 , k = 4 , M = 4096 , l = 3 , P = 50 . We use SGD optimizer with a momentum of 0.9 tooptimize our models. The batch size is set to for ImageNet. More details can be found in thesupplementary material. We evaluate our method by linear classification on frozen features, followinga common protocol proposed by [42]. Specifically, we freeze the parameters of convolutional layers,add a global average pooling layer, and train a linear classifier to classify images with true labels. Weevaluate the linear classification results on all 17 blocks of ResNet-50 and report the best top-1, 1-crop5able 2: Semi-supervised learning performance on ImageNet. We fine-tune our pre-trained modelson 1% or 10% of ImageNet labeled data sampled from training set. We report top-5 accuracy on theheld-out validation set. The results of other methods are adopted from original papers.
Method Architecture Pretrain Top5 AccuracyEpochs
1% 10%
Semi-supervised learning methods
VAT + Ent Min [15, 29] ResNet-50v2 - 47.0 83.4S L Exemplar [41] ResNet-50v2 - 47.0 83.7S L Rotation [41] ResNet-50v2 - 53.4 83.8LLP [45] ResNet-50 - 61.9 88.5
Unsupervised learning methods
Jigsaw [14] ResNet-50 90 45.3 79.3InsDis [39] ResNet-50 200 39.2 77.4PIRL [28] ResNet-50 800 57.2 83.8SimCLR [5] ResNet-50-MLP 1000 75.5 87.8PCL [26] ResNet-50 200 75.6 86.2InvP (Ours) ResNet-50 800 76.7 87.2InvP (Ours) ResNet-50-MLP 800
Table 3: Transfer learning performance on different datasets. We compare our method with SimCLR[5], supervised model and model trained from scratch. We report top-1 accuracy for CIFAR10,CIFAR100 and Stanford Cars; mean per-class accuracy for Caltech-101, Oxford-IIIT Pets and Oxford102 Flowers; and the 11 point mAP for Pascal VOC2007, which is same as the setting of SimCLR
Method
CIFAR10 CIFAR100 VOC Caltech101 Cars Pets FlowersScratch 95.9 80.2 67.3 72.6 91.4 81.5 92.0Supervised 97.5 86.4 85.0 93.3 92.1 92.1 97.6SimCLR 97.7
InvP (Ours)
Semi-supervised Learning.
We validate our method on the task of semi-supervised learning. Wefollow the setting of [41] to randomly choose 1% and 10% labeled images from ImageNet trainingset, and fine-tune the pre-trained unsupervised models. We report the top-5 accuracy on the held-outvalidation set. The results are shown in Table 2. Our method outperforms other semi-supervised andunsupervised methods, setting a new state-of-the-art on large scale semi-supervised classificationtask. With only 1% labeled images, our method achieves 78.2% top-5 accuracy, surpassing previousbest result by an absolute margin of 2.6%, which shows the high quality of the learned features.
Transfer Learning.
To investigate the transferability of our unsupervised models, we evaluate thefine-tuning performance of our method on seven different datasets. Specifically, we choose fournatural image datasets: CIFAR10 and CIFAR100 [22], Pascal VOC2007 [10] and Caltech-101 [25],as well as three fine-grained classification datasets: Stanford Cars [21], Oxford-IIIT Pets [34] andOxford 102 Flowers [30]. We first train the unsupervised model on ImageNet without labels. Thenwe fine-tune the pre-trained unsupervised model on the above seven datasets. In this experiment,we use the ResNet-50 with MLP projection head as the backbone. The results are shown in Table3. Specifically, our method surpasses SimCLR on CIFAR10, VOC2007, Caltech101, and Pets. On6able 4: The results of object detection. We fine-tune the unsupervised model on the PascalVOC2007+2012 training set and report AP , AP and AP all on VOC2007 test set, which is awidely adopted setting [16, 28, 14]. The proposed method outperforms other competitors.Method Dataset Network AP AP AP Supervised ImageNet-1k R50 C4 53.2 80.8 58.5Jigsaw [14, 28] ImageNet-22k R50-C4 48.9 75.1 52.9InsDis [39] ImageNet-1k R50-C4 52.3 79.1 56.9MoCo [16] ImageNet-1k R50-C4 55.2 81.4 61.2MoCo [16] ImageNet-1k R50-C5 53.8 81.1 58.6PIRL [28] ImageNet-1k R50-C4 54.0 80.7 59.7BoWNet [11] ImageNet-1k R50-C4 55.8 81.3 61.1MoCo v2 [6] ImageNet-1k R50-C4
InvP (Ours) ImageNet-1k R50-C4 56.2 81.8 61.5the other three datasets, we also achieve competitive results. Besides, compared with SimCLR thatrequires a large batch size of 4096 to allocate on 128 TPUs, our method is much easy to implementon standard hardware with only a batch size of 128.
Object Detection.
We further evaluate the learned unsupervised models on object detection. Follow-ing [16], we fine-tune the learned unsupervised models (ResNet-50 with MLP) on Pascal VOC dataset[10], training on the VOC2007+2012 training set and evaluating on the VOC2007 test set. We usethe Faster-RCNN-C4 object detector [36] with ResNet-50 as the backbone and report the detectionperformance in terms of AP , AP and AP all . The results are presented in Table 4. Our methodoutperforms most alternative competitors, including the ImageNet supervised one. We believe ourmethod will get better detection results by replacing the memory bank with the momentum queueproposed in MoCo [16, 6].Table 5: Results of ablation study on ImageNet linear classification. We train all models with 200epochs and report the top-1 center-crop accuracy. The backbone is ResNet-50 without MLP head.InvP KNN Without Hard Positive Without Hard NegativeAcc 63.3 57.6 60.7 61.9 . We study the impact of our positive samples discovery algorithm. Weimplement an alternative method by using the KNN algorithm to find positive samples, which issimilar to [20]. The ImageNet linear classification results are shown in Table 5, from which wecan observe that our method outperforms the KNN counterpart by a large margin, which shows theeffectiveness of our positive samples discovery algorithm. It is noted that we implement all models inTable 5 for 200 epochs using ResNet-50 with linear projection head as the backbone. Impact of Hard Positive Samples . To investigate the effectiveness of the hard positive samplingstrategy, we implement an alternative model by disabling the hard positive sampling strategy andtrain the model for 200 epochs. Table 5 shows the result. Without the hard sampling strategy, thelinear classification performance decreases from 63.3 to 60.7. The results show the effectiveness ofthe hard positive sampling strategy. We believe this is because the hard positive samples providemore intra-class variations, which will be further analyzed in the following subsection.
Impact of Hard Negative Samples.
Table 5 also shows the comparison between the model withthe hard negative sampling strategy and the model without the hard negative sampling strategy. Weobserve that the hard negative sampling strategy gives an absolute improvement of 1.4% on ImageNetlinear classification task, from 61.9% to 63.3%. The results show that concentrating on separating theambiguous negative samples is more effective than separating all negative samples evenly.7 asy Positive Samples Hard Positive Samples (b)(a)
Invariance PropagationInstance Discrimination I m ag e c o un t ( k ) I m ag e c o un t ( k ) Figure 2: (a): The distribution of positive and negative similarities. We compare our model withthe instance discrimination model [39]. (b): Visualization of the positive samples. The first columnrepresents the anchor samples (query). For each anchor sample, we give comparison between theeasy positive samples and the hard positive samples.
To understand the semantic properties of the learned representations, wecalculate the 5 nearest neighbors from the same category as well as the 5 nearest neighbors fromdifferent categories [40] (We randomly sample 1500 images from different categories to makethe positive and negative candidates balance). The distributions of similarities of the two settingsare shown in Fig 2 (a). We compare our method with instance discrimination [39], which is arepresentative method to learn instance level invariant features. We have the following observations. (1)
For samples from different categories, the similarity distributions of the two methods are similar. (2)
For samples from the same category, our method tends to give them much higher similarities thaninstance discrimination. Most similarities of our method are concentrated on around 0.9, while thesimilarities of instance discrimination are concentrated on around 0.5. (3)
For the overlap betweenpositive and negative distributions, our method has only negligible overlap, while the instancediscrimination has a relative large overlap. Overall the positive and negative similarity distribution ofour model is more separable than the instance-wise method. We believe this is caused by the fact thatour method considers the inter-instance relations to learn intra-class invariant representations so thatfor the positive samples from the same category, our method confidently gives higher similarities.
Neighborhoods visualization.
To give an intuitive understanding of the hard positive samplingstrategy, we display the easy&hard positive samples generated by our method in Fig 2 (b). Weobserve that the easy positive samples are more similar with anchor samples than hard positivesamples in textures, colors, patterns, and views, which provide some low-level semantic unrelatedinformation. For the hard positive samples, the semantic content is preserved while they providemore variations of the low-level information. For example, in the first row of Fig 2 (b), the hardpositive samples present more colorful parachutes. In the second row, the hard positive samplespresent sailboats of different patterns and colors compared with the easy positive ones. For the lastrow, the easy positive samples are all about the head part of the dolphin, while the hard positive onesprovide the whole body of different views of the dolphin. The intra-class variations brought by hardnegative samples help the network learn more invariant representations.
In this paper, we propose Invariance Propagation, a novel unsupervised learning method to learnrepresentations invariant to intra-class variations from large numbers of unlabeled images. It encour-ages all positive samples to reside in the same high-density region and a hard sampling strategy isused to provide more intra-class variations to help capture more abstract invariance. The learned8epresentations can be useful for a wide range of downstream tasks and extensive experiments areconducted on tasks of linear classification, semi-supervised learning and transfer learning. Boththe quantitative and qualitative results demonstrate the superiority of the proposed method overstate-of-the-art methods, and some results even surpass supervised models.
Broader Impact
This work presents a novel unsupervised learning method, which effectively utilizes large numbersof unlabeled images to learn representation useful for a wide range of downstream tasks, such asimage recognition, semi-supervised learning, object detection, etc. Without the labels annotated byhumans, our method reduces the prejudice caused by human priors, which may guide the modelsto learn more intrinsic information. The learned representations may benefit robustness in manyscenarios such as adversarial robustness, out-of-distribution detection, label corruptions, etc. What’smore, the unsupervised learning can be applied to autonomous learning in robotics. The robot canautonomously collect the data without specifically labelling it and achieve lifelong learning.There also exist some potential risks for our method. Unsupervised learning solely depends on thedistribution of the data itself to discover the information. Therefore, the learned model may bevulnerable to data distributions. With biased dataset, the model is likely to learn incorrect causalityinformation. For example, in the autonomous system, it is inevitable that the bias will be broughtduring the process of data collection due to the inherent constraints of the system. The model can alsobe easily attacked when the data used for training is contaminated intentionally. Additionally, sincethe learned representation can be used for a wide range of downstream tasks, it should be guaranteedthat they are used for beneficial purposes.We see the effectiveness and convenience of the proposed method, as well as the potential risks. Tomitigate the risks associated with using unsupervised learning, we encourage the research to keep aneye on the distribution of the collected datasets and stop the use of the learned representations forharmful purposes.
Acknowledgments and Disclosure of Funding
This work was supported in part by the National Key Research and Development Program under Grant2018YFB1305102, and the Guoqiang Research Institute Project under Grant No. 2019GQG1010.
References [1] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizingmutual information across views. arXiv preprint arXiv:1906.00910 , 2019.[2] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervisedlearning of visual features. In
The European Conference on Computer Vision (ECCV) , September 2018.[3] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of imagefeatures on non-curated data. In
Proceedings of the IEEE International Conference on Computer Vision ,pages 2959–2968, 2019.[4] O. Chapelle, B. Scholkopf, and A. Zien, Eds. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews].
IEEE Transactions on Neural Networks , 20(3):542–542, 2009.[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework forcontrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.[6] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastivelearning. arXiv preprint arXiv:2003.04297 , 2020.[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In , pages 248–255.Ieee, 2009.[8] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by contextprediction. In
Proceedings of the IEEE International Conference on Computer Vision , pages 1422–1430,2015.[9] Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In
Advances in NeuralInformation Processing Systems , pages 10541–10551, 2019.
10] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. Thepascal visual object classes (voc) challenge.
International Journal of Computer Vision , 88:303–338, 2009.[11] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, and Matthieu Cord. Learning representa-tions by predicting bags of visual words, 2020.[12] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predictingimage rotations. arXiv preprint arXiv:1803.07728 , 2018.[13] Ross Girshick. Fast r-cnn. In
Proceedings of the IEEE international conference on computer vision , pages1440–1448, 2015.[14] Priya Goyal, Dhruv Kumar Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. , pages 6390–6399, 2019.[15] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In
Advances inneural information processing systems , pages 529–536, 2005.[16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervisedvisual representation learning. arXiv preprint arXiv:1911.05722 , 2019.[17] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In
The IEEE InternationalConference on Computer Vision (ICCV) , Oct 2017.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[19] Olivier J Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, SM Eslami, and Aaronvan den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv preprintarXiv:1905.09272 , 2019.[20] Jiabo Huang, Qi Dong, Shaogang Gong, and Xiatian Zhu. Unsupervised deep learning by neighbourhooddiscovery. In
International Conference on Machine Learning , pages 2849–2858, 2019.[21] Jonathan Krause, Jun Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of fine-grainedcars. 2013.[22] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutionalneural networks. In
Advances in neural information processing systems , pages 1097–1105, 2012.[24] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automaticcolorization. In
European Conference on Computer Vision , pages 577–593. Springer, 2016.[25] Fei-Fei Li, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples:An incremental bayesian approach tested on 101 object categories. In
CVPR Workshops , 2004.[26] Junnan Li, Pan Zhou, Caiming Xiong, Richard Socher, and Steven C. H. Hoi. Prototypical contrastivelearning of unsupervised representations.
ArXiv , abs/2005.04966, 2020.[27] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-tion. In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3431–3440,2015.[28] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations.
ArXiv , abs/1912.01991, 2019.[29] Takeru Miyato, Shin ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: Aregularization method for supervised and semi-supervised learning.
IEEE Transactions on Pattern Analysisand Machine Intelligence , 41:1979–1993, 2018.[30] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number ofclasses. In , pages722–729. IEEE, 2008.[31] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsawpuzzles. In
European Conference on Computer Vision , pages 69–84. Springer, 2016.[32] Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In
Proceedings of the IEEE International Conference on Computer Vision , pages 5898–5906, 2017.[33] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictivecoding. arXiv preprint arXiv:1807.03748 , 2018.[34] Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. , pages 3498–3505, 2012.
35] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. In
Proceedings ofthe IEEE conference on computer vision and pattern recognition , pages 2536–2544, 2016.[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detectionwith region proposal networks. In
Advances in neural information processing systems , pages 91–99, 2015.[37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-tion. arXiv preprint arXiv:1409.1556 , 2014.[38] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprintarXiv:1906.05849 , 2019.[39] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 3733–3742, 2018.[40] Mang Ye, Xu Zhang, Pong C. Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariantand spreading instance feature. In
IEEE International Conference on Computer Vision and PatternRecognition (CVPR) , 2019.[41] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semi-supervisedlearning. , pages 1476–1485, 2019.[42] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In
European conference oncomputer vision , pages 649–666. Springer, 2016.[43] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learningby cross-channel prediction. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 1058–1067, 2017.[44] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep featuresfor scene recognition using places database. In
Advances in neural information processing systems , pages487–495, 2014.[45] Chengxu Zhuang, Xuehao Ding, Divyanshu Murli, and Daniel L K Yamins. Local label propagation forlarge-scale semi-supervised learning.
ArXiv , abs/1905.11581, 2019.[46] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning ofvisual embeddings. In
Proceedings of the IEEE International Conference on Computer Vision , pages6002–6012, 2019., pages6002–6012, 2019.