[PDF] K-Shot Contrastive Learning of Visual Features with Multiple Instance Augmentations

Abstract

In this paper, we propose the K -Shot Contrastive Learning (KSCL) of visual features by applying multiple augmentations to investigate the sample variations within individual instances. It aims to combine the advantages of inter-instance discrimination by learning discriminative features to distinguish between different instances, as well as intra-instance variations by matching queries against the variants of augmented samples over instances. Particularly, for each instance, it constructs an instance subspace to model the configuration of how the significant factors of variations in K -shot augmentations can be combined to form the variants of augmentations. Given a query, the most relevant variant of instances is then retrieved by projecting the query onto their subspaces to predict the positive instance class. This generalizes the existing contrastive learning that can be viewed as a special one-shot case. An eigenvalue decomposition is performed to configure instance subspaces, and the embedding network can be trained end-to-end through the differentiable subspace configuration. Experiment results demonstrate the proposed K -shot contrastive learning achieves superior performances to the state-of-the-art unsupervised methods.

Full PDF

PPREPRINT 1 K -Shot Contrastive Learning of Visual Featureswith Multiple Instance Augmentations Haohang Xu, Hongkai Xiong, and Guo-Jun Qi,

Senior Member, IEEE

Abstract —In this paper, we propose the K -Shot Contrastive Learning (KSCL) of visual features by applying multipleaugmentations to investigate the sample variations within individual instances. It aims to combine the advantages of inter-instance discrimination by learning discriminative features to distinguish between different instances, as well as intra-instance variations by matching queries against the variants of augmented samples over instances. Particularly, foreach instance, it constructs an instance subspace to model the conﬁguration of how the signiﬁcant factors of variations in K -shot augmentations can be combined to form the variants of augmentations. Given a query, the most relevant variant ofinstances is then retrieved by projecting the query onto their subspaces to predict the positive instance class. Thisgeneralizes the existing contrastive learning that can be viewed as a special one-shot case. An eigenvalue decompositionis performed to conﬁgure instance subspaces, and the embedding network can be trained end-to-end through thedifferentiable subspace conﬁguration. Experiment results demonstrate the proposed K -shot contrastive learning achievessuperior performances to the state-of-the-art unsupervised methods. Index Terms —Unsupervised learning, self-supervised learning, contrastive learning (cid:70)

NTRODUCTION U NSUPERVISED learning of visual features hasattracted wide attentions as it provides analternative way to efﬁciently train very deep net-works without labeled data. Recent breakthroughsin this direction focus on two categories of methods:contrastive learning [1], [2], [3] and transformationprediction [4], [5], [6], [7], among many alternativeunsupervised methods such as generative adversar-ial networks [8], [9], [10], [11], and auto-encoders[12], [13].The former category [1], [2], [3], [14], [15], [16],[17] trains a network based on a self-training task bydistinguishing between different instance classes eachcontaining the samples augmented from the sameinstance. Such a contrastive learning problem seeksto explore the inter-instance discrimination to performunsupervised learning. On the contrary, the othercategory of transformation prediction methods [4],[5], [6], [7] train a deep network by predicting thetransformations used to augment input instances.It attempts to explore the intra-instance variations under multiple augmentations to learn the feature • H. Xu and H. Xiong are with Shanghai Jiaotong University. • G.-J. Qi was with the Seattle Cloud Lab, Futurewei Technologies,Bellevue, WA 98004.E-mail: [email protected] embedding.A good visual representation ought to combineboth advantages of the inter-instance discriminationand the intra-instance variations. In particular, thefeature embedding should not only capture the sig-niﬁcant intra-instance variations among augmentedsamples from each instance, as well as discernthe distinction between instances by consideringtheir potential variations to enable the inter-instancediscrimination. In other words, the inter-instancediscrimination should be performed by matching aquery against all potential variants of an instance.To this end, we propose a novel K -shot contrastivelearning as a ﬁrst attempt to combine their strengths,and we will show that most of existing contrastivelearning methods are a special one-shot case.In particular, we apply multiple augmentationsto transform each instance, resulting in an instancesubspace spanned by the augmented samples. Eachinstance subspace learns signiﬁcant factors of varia-tions from the augmented samples, which conﬁgureshow these factors can be linearly combined toform the variants of the instance. Then given aquery, the most relevant sample of variant for eachinstance is retrieved by projecting the query onto theassociated subspace [18], [19]. After that, the inter-instance discrimination is conducted by assigning a r X i v : . [ c s . C V ] J u l REPRINT 2 the query to the instance class with the shortestprojection distance. An eigenvalue decomposition isperformed to conﬁgure each instance subspace withthe orthonormal eigenvectors as its basis. This conﬁg-uration of instance subspaces is non-parametric anddifferentiable, allowing an end-to-end training of theembedding network [20] through back-propagation.Experiment results demonstrate that the pro-posed K -Shot Contrastive Learning (KSCL) can con-sistently improve the state-of-the-art performanceon unsupervised learning. Particularly, with theResNet50 backbone, it improves the top-1 accuracyof the SimCLR and the MoCo v2 to 68.8% onImageNet over 200 epochs. It also reaches a highertop-1 accuracy of . over 800 epochs than thebaseline SimCLR and the rerun MoCo v2. For thesake of fair comparison, all these improvementsare achieved with the same experiment settingssuch as network architecture, data augmentation,training strategy and the version of deep learningframework and libraries. The consistently improvedperformances with the same model settings suggestthe proposed KSCL can serve as a generic plugin tofurther increase the accuracy of contrastive learningmethods on downstream tasks.The remainder of the paper is organized asfollows. We will review the related works in Sec-tion 2, and present the proposed K -shot contrastivelearning in Section 3. Implementation details willbe depicted in Section 4. We will demonstrate theexperiment results in Section 5, and conclude thepaper in Section 6. ELATED W ORKS

In this section, we review the related works to theproposed K-Short Contrastive Learning (KSCL) inthe following four areas.

Contrastive learning [1] was ﬁrst proposed to learnunsupervised representations by maximizing themutual information between the learned representa-tion and a particular context. It usually focused onthe context of the same instance to learn features bydiscriminating between one example from the otherin an embedding space [2], [3], [21]. For example, theinstance discrimination has been used as a pretexttask by distinguishing augmented samples fromeach other in a minimatch [2], over a memory bank[21], or a dynamic dictionary with a queue [3].The comparison between the augmented samplesof individual instances was usually performed on a pairwise basis. The state-of-the-art performanceson contrastive learning have relied on a compositeof carefully designed augmentations [2] to preventthe unsupervised training from utilizing side infor-mation to accomplish the pretext task. This has beenshown necessary to reach competitive results ondownstream tasks.

Transformation prediction [4], [5] also constitutesa category of unsupervised methods in learningvisual embeddings. In contrast to contrastive learn-ing that focuses on inter-instance discrimination,it aims to learn the representations that equivaryagainst various transformations [6], [22], [23]. Thesetransformations are used to augment images and thelearned representation is trained to capture the visualstructures from which these transformations can berecognized. It focuses on modeling the intra-instancevariations from which variants of an instance can beleveraged on downstream tasks such as classiﬁcation[4], [5], [22], object detection [4], [22], semanticsegmentations on images [4], [24] and 3D cloudpoints [23]. This category of methods provide anorthogonal perspective to contrastive learning basedon inter-instance discrimination.

From an alternative perspective, contrastive learningbased inter-instance discrimination can be viewedas a special case of few-shot learning [25], [26],[27], [28], [29], where each instance is a class and ithas several examples augmented from the instance.The difference lies that the examples for each classcan be much abundant since one can apply manyaugmentations to generate an arbitrary numberof examples. Of course, these examples are notstatistically independent as they share the sameinstance. Based on this point of view, the non-parametric instance discrimination [21] and thusseveral perspective works [2], [3] can be viewed as anextension of the weight imprinting [30] by initialingthe weights of each instance class with the embeddedfeature vector of an augmented sample, resulting inthe inner product and cosine similarity used in thesealgorithms [2], [3], [21]. Such a surprising connectionbetween the non-parametric instance discriminationand the few-shot learning may open a new way totrain the contrastive prediction model. In this sense,the proposed K -shot contrastive learning generalizesthe few-shot learning by imprinting the orthonormalbasis of an instance subspace with the embeddignsof augmented samples from the instance. REPRINT 3

The length of a vectorized feature representation hasbeen used in capsule nets pioneered by Hinton etal. [31], [32]. In capsule nets, a group of neuronsform a capsule (vector) of which the direction repre-sents different instantiation that equivaries againsttransformations and the length accounts for theconﬁdence that a particular class of object is detected.From this point of view, the projected vector of aquery example to an instance subspace in this paperalso carries an analogy to a capsule. Its directionrepresents the instantiated conﬁguration of how K -shot augmentations from the instance are linearlycombined to form the query, while its length givesrise to the likelihood of the query belonging tothis instance class, since a longer projection meansa shorter distance to the subspace. This idea ofusing projections onto several capsule subspaceseach corresponding to a class has shown promisingresults by effectively training deep networks [32]. HE A PPROACH

In this section, we deﬁne a K -shot contrastivelearning as the pretext task for training unsupervisedfeature embedding with Multiple Instance Augmen-tations (MIAs). Suppose we are given a set of N unlabeled instances X (cid:44) { x n } in a minibatch (e.g., in the SimCLR[2]) or from a dictionary (e.g., the memory bankin non-parametric instance discrimination [21] andthe dynamic queue in the MoCo [3]). Then thecontrastive learning can be formulated as classifyinga query example x into one of N instance classeseach corresponding to an instance x n .The goal is to learn a deep network embeddingeach instance x n and the query x to a feature vector v n and v . Then the probability of the embeddedquery v belonging to an instance class n is deﬁnedas p ( n | v ) = exp(sim( v n , v ) /τ ) (cid:80) Ni =1 exp(sim( v i , v ) /τ ) (1)where a similarity measure sim( · , · ) (e.g., cosinesimilarity) is deﬁned between two embeddings, and τ is a positive temperature hypermeter. When thequery v is the embedding of an augmented samplefrom x n , p ( n | v ) gives rise to the probability of arelevant embedding v n being successfully retrievedfrom the instance class n . One can minimize thecontrastive loss called InfoNCE in [1] resulting from the negative log-likelihood of the above probabilityover a dictionary to train the embedding network.The idea underlying the contrastive learningapproach is a good representation ought to helpretrieve the relevant samples from a set of instances X given a query x . For example, the SimCLR[2] has achieved the state-of-the-art performanceby applying two separate augmentations to eachinstance in a minibatch. Then, given a query example,it views the sample augmented from the sameinstance as the positive example, while treating thoseaugmented from the other instances as negative ones.Alternatively, the MoCo [3] seeks to retrieve relevantsamples from a dynamic queue separate from thecurrent minibatch. Both are based on the similaritybetween a query and a candidate sample to trainthe embedding network, which can be viewed asone-shot contrastive learning as explained later.However, the discrimination between differentinstance classes not only relies on their inter-instancesimilarities, but also is characterized by the distribu-tion of augmented samples from the same instance,i.e., the intra-instance variations. While existing con-trastive learning methods explore the inter-instancediscrimination to predict instance classes, we believethe intra-instance variations also play an indispensablerole. Thus we propose K -shot contrastive learningby matching a query against the variants of eachinstance in the associated instance subspace spannedby K -shot augmentations. K -Shot Multiple Instance Augmentations Let us consider a K -Shot Contrastive Learning(KSCL) problem. Suppose that K different augmen-tations are drawn and applied to each instance x n ,resulting in K augmented samples x kn and theirembeddings v kn for k = 1 , · · · , K .As aforementioned, the information containedin K -shot augmentations provides important cluesto distinguish between different instance classes.Comparing a query against each augmented sampleindividually fails to leverage such intra-instancevariations, since the most relevant sample of variantcould be a combination of rather than individualfactors of variations. Therefore, we are motivatedto explore the intra-instance variations through alinear subspace spanned by the augmented samplesof each instances. Given a query, the most relevantinstance is retrieved by projecting it onto the closestsubspace.As illustrated in Figure 1, consider the embed-dings { v kn } of K -shot augmentations for an instance REPRINT 4

Fig. 1: The pipline of the proposed K -Shot Contrastive Learning (KSCL). For each instance x n , an instancesubspace S n is spanned by the (cid:96) -normalized embeddings { v kn } of K -shot augmentations { x kn } on a unithyper-sphere. A given query embedding v of unit length is projected onto the subspace of each instance,resulting in the projection length (cid:107) Π n ( v ) (cid:107) to measure the probability q ( n | v ) of the query belonging tothe associated instance class. The projection length also gives the cosine similarity of the acute angle θ v n between the query vector v and the instance subspace S n . x n . These embeddings are normalized to have a unitlength and thus reside on the surface of a unit hyper-sphere. Meanwhile, they span an instance subspace S n in the ambient feature space. Then, the projectionof the query v (of a unit length) onto the instancesubspace is Π n ( v ) , and the projection distance of thequery from S n becomes D n ( v ) = (cid:107) v − Π n ( v ) (cid:107) = (cid:113) (cid:107) v (cid:107) − (cid:107) Π n ( v ) (cid:107) (2)where [ v − Π n ( v )] is normal to S n , and the secondequality follows from [ v − Π n ( v )] ⊥ Π n ( v ) , since thenormal vector should be orthogonal to all vectorswithin the subspace. As the embedding v has aconstant unit length (cid:107) v (cid:107) = 1 , minimizing theprojection distance is equivalent to maximizing itsprojection length (cid:107) Π n ( v ) (cid:107) .Let θ v n be the acute angle between the query v and the instance subspace S n . Then we have (cid:107) Π n ( v ) (cid:107) = cos( θ v n ) , i.e., the projection length canbe viewed as the cosine similarity between the queryand the whole instance subspace. Compared with thecosine similarity between individual embeddings ofinstances used in literature [2], [3], [33], it aims tolearn a better representation by discriminating dif-ferent instance subspaces containing the variationsof sample augmentations.Now we can deﬁne the probability of v belonging to an instance class np ( n | v ) = exp( (cid:107) Π n ( v ) (cid:107) /τ ) (cid:80) Nm =1 exp( (cid:107) Π m ( v ) (cid:107) /τ ) (3)Then the KSCL seeks to train the embedding networkby maximizing the loglikehood of the above prob-ability over mini-batches to match a query againstthe correct instance. Particularly, given a query v ofa unit norm, its projection length achieves its maxi-mum if v belongs to S n , i.e., it is a linear combinationof K -shot augmentations { v kn } . In other words, itmatches the query against all linear combinationsof the augmented samples from each instance x n ,and retrieves the most similar one by projecting thequery onto the instance subspace with the shortestdistance. MPLEMENTATIONS

In this section, we discuss the details to implementthe proposed K -Shot Contrastive Learning (KSCL)model. Mathematically, there is a close-form solution to theprojection Π n ( v − m n ) onto the instance subspace S n spanned by K -shot augmentations ( v kn ) ’s. Supposethere exists an othonormal basis for S n denoted by REPRINT 5 the columns of a matrix W n , the projection Π n ( v ) ofa feature vector v can be written as W n W (cid:124) n v .Since we have Π n ( v kn ) = v kn with { v kn } spanning S n , the problem of ﬁnding W n can be formulated byminimizing the following projection residual min W (cid:124) n W n = I K (cid:88) k =1 (cid:107) v kn − Π n ( v kn ) (cid:107) = tr( − W (cid:124) n Σ n W n + Σ n ) (4)where Σ n = V n V (cid:124) n , with V n (cid:44) (cid:2) v n , · · · , v Kn (cid:3) contain-ing the embeddings of the K augmented samples inits columns.After conducting an eigenvalue decompositionon the positive-deﬁnite matrix Σ n , the eigenvectorscorresponding to the largest K eigenvalues giverise to an orthonormal basis W n of the associatedinstance subspace, which minimizes (4).Since the eigenvalue decomposition is differen-tiable, the embedding network can be trained end-to-end through the error back-propagation. However,like the other contrastive learning methods [3], [33],the errors will only be back-propagated throughthe embedding network of queries to save thecomputing cost. Usually, we only consider a smaller number ofeigenvectors, corresponding to the largest L ( < K ) eigenvalues that account for the most signiﬁcantfactors of variations among K -shot augmentations.This ignores the remaining ( K − L ) minor factorsof intra-instance variations that may be incurredby noisy augmentations. It also results in a thinnerprojection matrix (cid:102) W n than W n , and the projectionlength becomes (cid:107) Π n ( v ) (cid:107) = (cid:107) (cid:102) W n v (cid:107) . Thus, we willonly need to store and update (cid:102) W n in the KSCL.In practice, rather than setting L to a preﬁxednumber, we will choose L such as the largest L eigenvalues cover a preset percentage of total eigen-values. The more percentage of total eigenvaluesare preserved, the smaller the projection residual isin Eq. (3); when L ≥ K , the residual vanishes. Thisallows a distinct number of eigenvectors per instanceto ﬂexibly model various degrees of variationsamong K -shot augmentations. K = 1 It is not hard to see that the cosine similarity usedin SimCLR and MoCo is a special case when K = 1 ,i.e., they are one-shot contrastive learning of visual em-beddings. When K = 1 , there is a single augmented sample v n per instance. Its instance subspace S n collapses to a vector v n . Since v n is (cid:96) -normalized tohave a unit length in the SimCLR and the MoCo, theprojection length of a query v to this single vectorbecomes | v (cid:124) n v | . This is the cosine similarity betweentwo vectors used in existing contrastive learningmethods [2], [3], [21] up to an absolute value. XPERIMENTS

In this section, we perform experiments to comparethe KSCL with the other state-of-the-art unsuper-vised learning methods.

To ensure the fair comparison with the previousunsupervised methods [2], [3], [33], in particularSimCLR [2] and MoCo v2 [33], we follow the sameevaluation protocol with the same hyperparameters.Speciﬁcally, a ResNet-50 network is ﬁrst pre-trained on 1.28M ImageNet dataset [34] withoutlabels, and the performance is evaluated by traininga linear classiﬁer upon the ﬁxed features. We reporttop-1 accuracy on the ImageNet validation set witha single crop to × images. The momentumupdate with the same size of dynamic queue, theMLP head, the data augmentation (e.g., color distor-tion and blur augmentation) are also adopted for thesake of fair comparison with the SimCLR and MoCov2. We adopt the same temperature τ = 0 . in [33]without exhaustively searching for an optimal one,yet still obtain better results. This demonstrates theproposed KSCL can be used as a universal plugin toconsistently improve the contrastive learning withno need of further tuning of existing models. Wewill evaluate the impact of K and the percentage ρ of preserved eigenvalues on the performance later. Table 1 compares the top-1 accuracy of the proposedKSCL with that of SimCLR and MoCo on theImageNet. We make a direct comparison between theKSCL and the MoCo v2 by running both on the samehardware platform with the same set of softwaresuch as CUDA10, pytorch v1.3 and torchvision 1.1.0(used in the data augmentation that plays a keyrole in the contrastive learning). With 200 epochsof pretraining, the same top-1 accuracy has beenachieved on MoCo v2. However, its rerun resultover 800 epochs is slightly lower than the reportedresult (71.1%) in literature [33], and this may be dueto different versions of deep learning frameworks

REPRINT 6

TABLE 1: The top-1 accuracy of different models on ImageNet. The ResNet-50 backbone was unsupervisedlypretrained with two-layer MLP head by applying the same combination of enhanced data augmentationsused in SimCLR including stronger color distortion and blurring for a fair comparison. The proposed KSCLis trained with K = 5 and ρ = 40% . The top-1 accuracy is obtained by training a single-layer linear classiﬁerupon the pretrained features. Model epochs batch size top-1 accuracySimCLR [2] 200 256 61.9SimCLR (baseline) [2] 200 8192 66.6MoCo v1 [3] 200 256 60.5MoCo v2 (rerun) [33] 200 256 67.5Proposed KSCL 200 256

Results under more epochs of unsupervised pretrainingSimCLR (baseline) [2] 1000 4096 69.3MoCo v2 (rerun) [33] 800 256 70.6Proposed KSCL 800 256 and drivers that could cause variations in the modelperformance.Table 1 shows that, after unsupervised pretrain-ing of the KSCL with epochs and a batch sizeof 256, the KSCL achieves a top-1 accuracy of 68.8%with K = 5 augmentations and ρ = 40% of preservedeigenvalues. It is worth noting that a larger batchsize is often required to sufﬁciently train the SimCLRwhile the other models such as KSCL and MoComaintain a long dynamic queue as the dictionary.By viewing the SimCLR with a larger batch size of , as a baseline, the KSCL makes a much largerimprovement of . than the MoCo v2 ( . ) onthe SimCLR baseline under epochs. The KSCLalso improves the top-1 accuracy to . on theImageNet over 800 epochs of pretraining. Althougha better result may be obtained by ﬁnetuning thehyperparameter and the data augmentation [35], westick to the same experimental setting in the previousmethods [2], [33] for a direct comparison.We also visualize the learned basis images inFigure 2. The last column presents the basis imagesspanning the underlying instance subspace for a”cat” image. The weight beneath each image is theinner product between the decomposed eigenvectorand the embedding of the corresponding augmenta-tion, and each base is a weighted combination of theaugmented images in the row. The results show thattwo bases sufﬁce to capture the major variationsamong the ﬁve image augmentations, while theremaining three only model the minor ones thatcan be discarded as noises. K and ρ on Performance We also study the impact of different K ’s and ρ ’son the model performance. Table 2 shows the top-1accuracy under various K ’s and ρ ’s. When K = 1 ,it reduces to one-shot contrastive learning whichis similar to the MoCo v2. The difference . vs. . between the KSCL ( K = 1 ) and the MoCov2 is probably because we did not ﬁne-tune thetemperature τ for the projection length to optimizethe KSCL.The accuracy increases with a larger numberof K augmentations per instance and a smallervalue ρ of perceived eigenvalues. This implies thateliminating the minor noisy variations (as illustratedin Figure 2) with a smaller ρ could improve theperformance. Further growing K only marginallyimproves the performance. This is probably becausethe data augmentation adopted in experiments islimited to those used in the compared methodsfor a direct comparison. Applying more types ofaugmentations (e.g., jigsaw and rotations) may injectmore intra-instance variations that encourage to usea larger K . However, studying the role of more typesof augmentations in contrastive learning is beyondthe scope of this paper, and we leave it to futureresearch. Finally, we evaluate the unsupervised representa-tions on the VOC object detection task [36]. TheResNet-50 backbone pretrained on the ImageNetdataset is ﬁne-tuned with a Faster RCNN detector[37] end-to-end on the VOC 2007+2012 trainval

REPRINT 7

Fig. 2: The learned basis in an instance subspace. Each of the ﬁrst ﬁve columns is an augmented image froman instance, and the last column is the basis images each of which is synthesized as a linear combinationof the ﬁve augmented images weighted by the inner product with the corresponding eigenvector in theembedding space.set, and is evaluated on the VOC 2007 test set.Table 3 compares the results with both the MoComodels. Under the same setting, the proposed KSCLoutperforms the compared MoCo v1 and MoCo v2models. The SimCLR model does not report on theVOC object detection task in [2].

ONCLUSION

In this paper, we present a novel K -shot contrastivelearning to learn unsupervised visual features. Itrandomly draws K -shot augmentations and appliesthem separately to each instance. This results in theinstance subspace modeling how the signiﬁcant fac-tors of variances learned from the augmented sam-ples can be linearly combined to form the variantsof an associated instance. Given a query, the most relevant samples are then retrieved by projectingthe query onto individual instance subspaces, andthe query is assigned to the instance subspace withthe shortest projection distance. The proposed K -shot contrastive learning combines the advantages ofboth the inter-instance discrimination and the intra-instance variations to discriminate the distinctionsbetween different instances. The experiment resultsdemonstrate its superior performances to the state-of-the-art contrastive learning methods based on thesame experimental setting. R EFERENCES [1] A. v. d. Oord, Y. Li, and O. Vinyals, “Representationlearning with contrastive predictive coding,” arXiv preprintarXiv:1807.03748 , 2018.

REPRINT 8

TABLE 2: The top-1 accuracy of the proposed KSCL with varying K ’s and ρ ’s under 200 epochs ofpretraining on ImageNet. The ResNet-50 backbone was pretrained with two-layer MLP head by applyingthe same combination of enhanced data augmentations used in SimCLR including stronger color distortionand blurring for a fair comparison. The top-1 accuracy is obtained by training a single-layer linear classiﬁerupon the pretrained features. We also compare the computing time used to train the KSCL per epoch ineight V100. Note when K = 1 , ρ need not be set as it becomes a trivial case of an one-shot contrastivelearning. K ρ epochs batch size top-1 accuracy time/epoch (min.)1 – 200 256 67.2 163 40% 200 256 68.5 265 40% 200 256 68.8 375 90% 200 256 68.4 37

TABLE 3: The comparison between the proposed KSCL ( K = 5 and ρ = 40% ) and the MoCo models. Thepretrained ResNet-50 backbone was transferred to train on VOC 2007+2012 trainval set with a Faster RCNNdetector end-to-end, and evaluated on the VOC 2007 test set. The COCO metrics were adopted to evaluatethe performance. Model epochs batch size AP AP AP MoCo v1 200 256 81.5 55.9 62.6MoCo v2 200 256 82.4 57.0 63.6MoCo v2 800 256 82.5 57.4 64.0KSCL 200 256 82.4 57.1 63.9KSCL 800 256 [2] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “Asimple framework for contrastive learning of visual repre-sentations,” arXiv preprint arXiv:2002.05709 , 2020.[3] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentumcontrast for unsupervised visual representation learning,” arXiv preprint arXiv:1911.05722 , 2019.[4] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervisedrepresentation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728 , 2018.[5] L. Zhang, G.-J. Qi, L. Wang, and J. Luo, “Aet vs. aed: Unsu-pervised representation learning by auto-encoding transfor-mations rather than data,” arXiv preprint arXiv:1901.04596 ,2019.[6] G.-J. Qi, L. Zhang, C. W. Chen, and Q. Tian, “Avt: Unsu-pervised learning of transformation equivariant represen-tations by autoencoding variational transformations,” in

Proceedings of the IEEE International Conference on ComputerVision , 2019, pp. 8130–8139.[7] M. Noroozi and P. Favaro, “Unsupervised learning of visualrepresentations by solving jigsaw puzzles,” in

EuropeanConference on Computer Vision . Springer, 2016, pp. 69–84.[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generativeadversarial nets,” in

Advances in neural information processingsystems , 2014, pp. 2672–2680.[9] G.-J. Qi, “Loss-sensitive generative adversarial networks onlipschitz densities,”

International Journal of Computer Vision ,vol. 128, no. 5, pp. 1118–1140, 2020.[10] G.-J. Qi, L. Zhang, H. Hu, M. Edraki, J. Wang, and X.-S.Hua, “Global versus localized generative adversarial nets,”in

Proceedings of the IEEE conference on computer vision andpattern recognition , 2018, pp. 1517–1525. [11] Y. Zhao, Z. Jin, G.-j. Qi, H. Lu, and X.-s. Hua, “An adver-sarial approach to hard triplet generation,” in

Proceedingsof the European conference on computer vision (ECCV) , 2018,pp. 501–517.[12] J. Masci, U. Meier, D. Cires¸an, and J. Schmidhuber, “Stackedconvolutional auto-encoders for hierarchical feature extrac-tion,” in

International conference on artiﬁcial neural networks .Springer, 2011, pp. 52–59.[13] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transform-ing auto-encoders,” in

International conference on artiﬁcialneural networks . Springer, 2011, pp. 44–51.[14] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learningrepresentations by maximizing mutual information acrossviews,” in

Advances in Neural Information Processing Systems ,2019, pp. 15 509–15 519.[15] O. J. H´enaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch,S. Eslami, and A. v. d. Oord, “Data-efﬁcient image recog-nition with contrastive predictive coding,” arXiv preprintarXiv:1905.09272 , 2019.[16] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal,P. Bachman, A. Trischler, and Y. Bengio, “Learning deeprepresentations by mutual information estimation andmaximization,” arXiv preprint arXiv:1808.06670 , 2018.[17] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiviewcoding,” arXiv preprint arXiv:1906.05849 , 2019.[18] X.-S. Hua and G.-J. Qi, “Online multi-label active anno-tation: towards large-scale content-based video search,”in

Proceedings of the 16th ACM international conference onMultimedia , 2008, pp. 141–150.[19] X. Shu, J. Tang, G.-J. Qi, Z. Li, Y.-G. Jiang, and S. Yan,“Image classiﬁcation with tailored ﬁne-grained dictionaries,”

IEEE Transactions on Circuits and Systems for Video Technology , REPRINT 9 vol. 28, no. 2, pp. 454–467, 2016.[20] S. Chang, G.-J. Qi, C. C. Aggarwal, J. Zhou, M. Wang, andT. S. Huang, “Factorized similarity learning in networks,”in . IEEE,2014, pp. 60–69.[21] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised featurelearning via non-parametric instance discrimination,” in

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2018, pp. 3733–3742.[22] G.-J. Qi, “Learning generalized transformation equivariantrepresentations via autoencoding transformations,” arXivpreprint arXiv:1906.08628 , 2019.[23] X. Gao, W. Hu, and G.-J. Qi, “Graphter: Unsupervised learn-ing of graph transformation equivariant representations viaauto-encoding node-wise transformations,” in

Proceedingsof the IEEE/CVF Conference on Computer Vision and PatternRecognition , 2020, pp. 7163–7172.[24] G.-J. Qi, “Hierarchically gated deep networks for semanticsegmentation,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2016, pp. 2267–2275.[25] Z. Peng, Z. Li, J. Zhang, Y. Li, G.-J. Qi, and J. Tang,“Few-shot image recognition with knowledge transfer,” in

Proceedings of the IEEE International Conference on ComputerVision , 2019, pp. 441–449.[26] M. A. Jamal and G.-J. Qi, “Task agnostic meta-learning forfew-shot learning,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2019, pp. 11 719–11 727.[27] H. Xu, H. Xiong, and G. Qi, “Flat: Few-shot learning viaautoencoding transformation regularizers,” arXiv preprintarXiv:1912.12674 , 2019.[28] Y. Qin, W. Zhang, C. Zhao, Z. Wang, X. Zhu, G. Qi, J. Shi,and Z. Lei, “Prior-knowledge and attention-based meta-learning for few-shot learning,” arXiv , pp. arXiv–1812, 2018.[29] G.-J. Qi, W. Liu, C. Aggarwal, and T. Huang, “Jointintermodal and intramodal label transfers for extremelyrare or unseen classes,”

IEEE transactions on pattern analysisand machine intelligence , vol. 39, no. 7, pp. 1360–1373, 2016.[30] H. Qi, M. Brown, and D. G. Lowe, “Low-shot learning withimprinted weights,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2018, pp. 5822–5830.[31] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic rout-ing between capsules,” in

Advances in neural informationprocessing systems , 2017, pp. 3856–3866.[32] L. Zhang, M. Edraki, and G.-J. Qi, “Cappronet: Deep featurelearning via orthogonal projections onto capsule subspaces,”in

Advances in Neural Information Processing Systems , 2018,pp. 5814–5823.[33] X. Chen, H. Fan, R. Girshick, and K. He, “Improvedbaselines with momentum contrastive learning,” arXivpreprint arXiv:2003.04297 , 2020.[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,”in . Ieee, 2009, pp. 248–255.[35] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, andP. Isola, “What makes for good views for contrastivelearning,” arXiv preprint arXiv:2005.10243 , 2020.[36] M. Everingham, L. Van Gool, C. K. Williams, J. Winn,and A. Zisserman, “The pascal visual object classes (voc)challenge,”

International journal of computer vision , vol. 88,no. 2, pp. 303–338, 2010.[37] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn:Towards real-time object detection with region proposal networks,” in