DetCo: Unsupervised Contrastive Learning for Object Detection
Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang Xu, Zhenguo Li, Ping Luo
DDetCo: Unsupervised Contrastive Learning for Object Detection
Enze Xie ∗ , Jian Ding * , Wenhai Wang , Xiaohang Zhan ,Hang Xu , Zhenguo Li , Ping Luo The University of Hong Kong Huawei Noah’s Ark Lab Wuhan University Nanjing University Chinese University of Hong Kong
Abstract
Unsupervised contrastive learning achieves great suc-cess in learning image representations with CNN. Un-like most recent methods that focused on improving ac-curacy of image classification, we present a novel con-trastive learning approach, named DetCo, which fully ex-plores the contrasts between global image and local imagepatches to learn discriminative representations for objectdetection. DetCo has several appealing benefits. (1) It iscarefully designed by investigating the weaknesses of cur-rent self-supervised methods, which discard important rep-resentations for object detection. (2) DetCo builds hierar-chical intermediate contrastive losses between global im-age and local patches to improve object detection, whilemaintaining global representations for image recognition.Theoretical analysis shows that the local patches actu-ally remove the contextual information of an image, im-proving the lower bound of mutual information for bet-ter contrastive learning. (3) Extensive experiments onPASCAL VOC, COCO and Cityscapes demonstrate thatDetCo not only outperforms state-of-the-art methods onobject detection, but also on segmentation, pose estima-tion, and 3D shape prediction, while it is still competi-tive on image classification. For example, on PASCALVOC, DetCo-100ep achieves 57.4 mAP, which is on parwith the result of MoCov2-800ep. Moreover, DetCo con-sistently outperforms supervised method by 1.6/1.2/1.0 APon Mask RCNN-C4/FPN/RetinaNet with 1x schedule. Codewill be released at github.com/xieenze/DetCo and github.com/open-mmlab/OpenSelfSup .
1. Introduction
Self-supervised learning of visual representation is animportant problem in computer vision, facilitating manydownstream tasks such as image classification, object de-tection, and semantic segmentation [22, 36, 41]. Previous * equal contribution m AP on PAS C A L V O C + ( % ) Training Epoch
DetCo-100epMoCov2-200epMoCo-200epSupervised-90epBYOL-200epSimCLR-200ep SwAV-800epDetCo-50epMoCov2-100epMoCov2-50ep DetCo-200ep DetCo-400ep DetCo-800epMoCov2-400ep MoCov2-800epPIRL-200ep
Figure 1.
Comparisons of mAP on PASCAL VOC [15] 07+12for object detection . For fair comparisons, all results are evalu-ated following the setting of MoCo [19]. For different pre-trainingepochs, we see that DetCo consistently outperforms MoCo v2 [5],which is a strong competitor on VOC compared to many othernewly-proposed approaches such as BYOL [18], PIRL [30], andSwAV [3]. For example, DetCo-100ep already achieves similarmAP compared to MoCov2-800ep. methods [11, 14, 12, 2, 5, 19, 3, 18, 9, 10] focus on de-signing different pretext tasks. One of the most promisingdirections among them is contrastive learning [32], whichtransforms one image into multiple views, and minimize thedistance between views from the same image and maximizethe distance between views from different images.Recent trends in self-supervised contrastive learningmainly design different contrastive pretext tasks in orderto bridge the performance gap between unsupervised andfully-supervised methods for image classification, whoseaccuracy is constantly refreshed by many unsupervisedmethods such as MoCo v1/v2 [19, 5], BYOL [18], andSwAV [3]. However, the pretext tasks designed in theseworks are suboptimal when transferring to object detection,because many differences between image classification andobject detection are neglected in previous methods. For ex-ample, firstly, image classification is typically solved using1 a r X i v : . [ c s . C V ] F e b he 1-of-K loss function, which assumes that each imageonly has one category. This is controversial with objectdetection where an image often has many objects of dif-ferent categories. Secondly, object detection often needsto perform object classification and box regression on lo-cal image regions (patches), but image classification needsglobal image representation. Thirdly, recent advanced ob-ject detectors usually predict objects on multi-level features,while image classifiers typically learn high-level discrimi-native features.As object detection plays an indispensable role in manycomputer vision tasks, this paper aims at designing a self-supervised representation learning method, which is morepowerful when transferring the learned representations tomany detection-related downstream tasks ( e.g . detection,segmentation, pose estimation, and 3D shape prediction)than the existing self-supervised approaches. To this end, we first investigate the inconsistency be-tween the accuracy of image classification and the accuracyof object detection produced by the latest self-supervisedmethods. Then we present three potential practices in orderto design a suitable pretext task for object detection. Finally,following these practices, we design DetCo, a detection-friendly contrastive pretext task that is able to train on large-scale unlabeled data.Specifically, DetCo has two merits that are specially de-signed for improving object detection. First, a few hier-archical contrastive loss functions are applied to differentstages of the backbone network. This is to ensure the dis-criminative capability of each stage, leading to better per-formance for object detection with multiple scales. Thiscompensates for the discrepancy between object detectionand image classification. We analyze this design choicein section 3.2.1 and evaluate its effectiveness with abla-tion studies in section 6.6. Second, we propose a novelcontrastive learning method that combines the advantagesof both global and local representations, by using globalimages and local image patches as input and establishingdifferent contrastive losses between them. That is, we de-sign cross contrasts between global and local information,which is not only beneficial in object detection, but also fa-vorable in image classification, enabling us to surpass pre-vious works, as shown in Figure 1. We also theoreticallyjustify the essential to combine multiple global and localcontrastive losses, showing that they are able to improvethe lower bound of mutual information between two differ-ent views in contrastive learning, as shown in section 3.2.2and section 3.2.3.The main contributions of this work are three-fold. (1)We demonstrate the inconsistency of accuracies betweenimage classification and object detection, when previousself-supervised learned representations are transferred todownstream tasks. We propose three potential practices to design suitable unsupervised pretext task for object detec-tion. As far as we know, this is the first work that deeplystudies this problem. (2) We propose a novel detection-friendly self-supervised method, DetCo, which is able tocombine multiple global and local contrastive losses to im-prove contrastive learning to pre-train discriminative rep-resentations for object detection. Theoretical justificationshows that DetCo is able to improve the lower boundof mutual information in contrastive learning. (3) Ex-tensive experiments on PASCAL VOC [15], COCO [28]and Cityscapes [6] show that DetCo outperforms previousstate-of-the-art methods when transferred to different down-stream tasks such as object detection, segmentation, poseestimation, and 3D shape prediction.
2. Related Work
Existing unsupervised methods for representation learn-ing can be roughly divided into two classes, generative anddiscriminative. Generative methods [11, 14, 12, 2] typicallyrely on auto-encoding of images [38, 23, 37] or adversariallearning [17], and operate directly in pixel space. There-fore, most of them are computationally expensive, and thepixel-level details required for image generation may not benecessary for learning high-level representations.Among discriminative methods [9, 5], self-supervisedcontrastive learning [5, 19, 5, 3, 18] currently achievedstate-of-the-art performance, arousing extensive attentionfrom researchers. Unlike generative methods, contrastivelearning avoids the computation-consuming generation stepby pulling representations of different views of the same im-age ( i.e. , positive pairs) close, and pushing representationsof views from different images ( i.e. , negative pairs) apart.Chen et al . [5] developed a simple framework, termed Sim-CLR, for contrastive learning of visual representations. Itlearns features by contrasting images after a compositionof data augmentations. After that, He et al . [19] and Chen et al . [5] proposed MoCo and MoCo v2, using a movingaverage network (momentum encoder) to maintain consis-tent representations of negative pairs drawn from a mem-ory bank. Recently, SwAV [3] introduced online clusteringinto contrastive learning, without requiring to compute pair-wise comparisons. BYOL [18] avoided the use of negativepairs by bootstrapping the outputs of a network iterativelyto serve as targets for an enhanced representation.Moreover, earlier methods rely on all sorts of pretexttasks to learn visual representations. Relative patch pre-diction [9, 10], colorizing gray-scale images [40, 24], im-age inpainting [34], image jigsaw puzzle [31], image super-resolution [25], and geometric transformations [13, 16]have been proved to be useful for representation learning.Nonetheless, most of the aforementioned methods arespecifically designed for image classification while neglect-ing object detection. Our work focus on designing a better2
MoCo v1 MoCo v2 DetCo SwAV Supervised m A P o f V O CD e t ec t i o n ( % ) T o p - A cc o f I m ag e N e t C l s . ( % ) Classification Detection
Figure 2. Performance of several self-supervised methods transfer-ring to downstream tasks, ImageNet classification and PASCALVOC detection. It shows the accuracy of classification and detec-tion are inconsistent and have low correlation . pretext task friendly to object detection.
3. Methods
In this section, we first analyze the misalignment of ac-curacy between classification and detection with state-of-the-art unsupervised methods, and then point out three prac-tices of designing a detection-friendly pretext task. Second,following the proposed practices, we propose DetCo shownin figure 4. It is composed of (1) a hierarchical interme-diate contrastive loss that keeps features at multiple stagesdiscriminative; (2) a cross global-and-local contrasts, i.e .,building the contrastive loss across the global image and lo-cal patch features by removing part of the contextual infor-mation to enhance the representation ability. After that, wegive a theoretical proof that a cross global-and-local con-trast can improve the lower bound of mutual information.Finally, we present the implementation details of DetCo, e.g. , the setting of important hyper-parameters.
We analyze in detail the performance of recent self-supervised learning methods by transferring them to im-age classification and object detection. We are surprised tofind that the performance of classification and detection islargely inconsistent. Specifically, we select a series of meth-ods: supervised ResNet50 [22], Relative-Loc [9], MoCov1 [19], MoCo v2 [5], and SwAV [3]. To ensure impartialcomparisons, we follow the same fine-tuning setting fromMoCo [19]. Detailed settings can be found in Appendix.Following [19, 4], we report the linear classification top-1 accuracy on ImageNet [8] for image classification, andreport mAP on PASCAL VOC 07+12 [15] for object de-tection. Note that, the results of Relative-Loc are borrowedfrom OpenSelfSup and the detection results of SwAV areproduced by us, using the pre-trained weights from officialcode . https://github.com/open-mmlab/OpenSelfSup https://github.com/facebookresearch/swav Rel-Loc MoCo v1 MoCo v2 DetCo Swav m A P o f V O CD e t ec t i dn ( % ) C l s A cc o f V O C S V M ( % ) Res4 Cls Res5 Cls VOC Det
Figure 3. Performance of VOC SVM classification in
Res4,Res5 and detection. Although Relative-Loc is a non-contrastivemethod, it keeps shallow layer feature discriminative and pre-dicts position between local patches , enabling competitive de-tection results.
As shown in Figure 2, SwAV achieves the best lin-ear classification top-1 accuracy 72.7%, which is 12.1%higher than MoCo v1 and 5.2% than MoCo v2. How-ever, on the detection task, MoCo v2 achieves 57.0% mAP,while SwAV yields merely 54.5%, close to the supervisedResNet-50. Moreover, from Figure 3, we find that althoughthe VOC classification performance of Relative-Loc [9] ismuch lower than other methods, the detection performanceis competitive. These phenomena indicate that for self-supervised pretext methods, the transferring performanceon image classification has a low correlation with that onobject detection.
Why are the detection performance of these methods sodifferent?
MoCo v1 and v2 are contrastive learning meth-ods, while SwAV is a clustering-based method, where sam-ples are classified into 3000 cluster centers during training.Therefore, the training process of SwAV is similar to super-vised classification methods to some extent. As a result,compared with contrastive learning methods, clustering-based methods are more friendly for the image classificationtask, and this is why SwAV has similar performance to thesupervised ResNet50 on both image classification and ob-ject detection. Moreover, we consider that contrastive learn-ing methods are better than clustering-/classification-basedmethods for object detection, because the latter assumes aprior knowledge that there is only one object in one givenimage, which is misaligned with the target of object detec-tion. While contrastive learning methods do not require thisprior knowledge, it discriminates image from a holistic per-spective.
Why does the non-contrastive method Relative-Locachieve competitive detection performance?
It is an in-teresting phenomenon that Relative-Loc falls far below con-trastive methods in classification while keeping competitivein detection, as shown in figure 3. We consider two poten-tial reasons: (1) In Relative-Loc, not only the final featuresbut also the features from the shallow stage have strong dis-crimination capability (see red circle in figure 3), which in-3 a) MoCo 𝑋 ℒ 𝑇 ! 𝑋 ! MomentumEncoder
𝑄𝑢𝑒𝑢𝑒 𝑋 " 𝑇 " Encoder 𝑋 𝑇 " 𝑇 ! 𝑋 " 𝑋 ! 𝑋 $ 𝑋 Local Patches ℒ !"! Encoder
MomentumEncoder ℒ ℒ MomentumEncoder
Encoder(b) DetCo
𝑄𝑢𝑒𝑢𝑒 ! 𝑄𝑢𝑒𝑢𝑒 " Global Images 𝑇 $ 𝑇 Figure 4.
The overall pipeline of DetCo compared with MoCo [19]. (a) is MoCo’s framework, which only considers the high-levelfeature and learning contrast from a global perspective. (b) is our DetCo, which straightforwardly appends hierarchical intermediatecontrast and two additional local patch views for input, building contrastive loss cross the global and local representation. Our DetCoimproves the detection transferring ability by following the proposed three good practices. Note that “ T ” means image transforms and“ L l g ” means contrastive loss cross local and global features. “ Queue g/l ” means different memory banks [39] for global/local features. spires us to enhance the discrimination capability of fea-tures from different depths, thereby improving the detec-tion performance. (2) Relative-Loc focuses on predictingrelative positions between local patches, which benefits thedetection task because local representation’s capability isessential for object detection. This phenomenon inspires usto enhance the local representation in the contrastive learn-ing framework.
What is the guideline for designing a detection-friendlypretext task?
Based on the above analysis, we argue thatdesigning a good detection-friendly pretext task is differentfrom designing a classification-friendly one. Here, we sum-marize three good practices for detection-friendly pretexttasks. (1) Instance discrimination is better than classifica-tion or clustering to serve as a pretext task for object detec-tion. (2) Pretext tasks should keep both low-level and high-level features discriminative for object detection. (3) Apartfrom global image features, local patch features are alsoessential for object detection. Especially, for practice (2),modern detectors often predict results on multi-level featuremaps ( e.g.
Faster RCNN-FPN [26], RetinaNet [27]), andthus reliable multi-level feature maps are required, which isconsistent with our practice (2). We follow the practice (1)to adopt MoCo v2 as our baseline model, and design DetCoin Section 3.2 by strictly following the practice (2) and (3).We conduct controlled experiments in Section 4 to verifythe proposed practices.
Following the proposed practices, DetCo is designedby adding the intermediate multi-stage contrastive loss andcross local and global contrasts based on MoCo v2. Theoverall architecture of DetCo is illustrated in Figure 4. The loss function of DetCo can be defined as follows: L ( I q , I k , P q , P k ) = (cid:88) i =1 w i · ( L ig ↔ g + L il ↔ l + L ig ↔ l ) , (1)where I represents a global image and P represents the lo-cal patch set. Eqn. 1 is a multi-stage contrastive loss. Ineach stage, there are three cross local and global contrastivelosses. We will describe the multi-stage contrastive loss (cid:80) i =1 w i ·L i in Section 3.2.1, and the cross local and globalcontrasts L ig ↔ g + L il ↔ l + L ig ↔ l in Section 3.2.2. The practice (2) requires both low-level and high-level fea-tures to keep strong instance discrimination ability. To ver-ify the effectiveness of the practice (2), we make an intuitivemodification to the original MoCo. Specifically, we feedone image to a standard backbone ResNet-50, and it out-puts features from different stages, termed
Res2, Res3,Res4, Res5 . MoCo only uses
Res5 , but we use all lev-els of features to calculate contrastive losses, ensuring thateach stage of the backbone produces discriminative repre-sentations.Given an image I ∈ R H × W × , it is first transformedto two global views I q and I k with two transformationsrandomly drawn from a set of transformations on globalviews, termed T g . We aim at training a encoder q to-gether with a encoder k with the same architecture, where encoder k update weights using a momentum update strat-egy [19]. The encoder q contains a backbone and fourglobal MLP heads to extract features from four levels.We feed I q to the backbone b θq ( · ) , with parameters θ that extracts features { f , f , f , f } = b θq ( I q ) , where4 i means the feature from the i-th stage. After obtain-ing the multi-level features, we append four global MLPheads { mlp q ( · ) , mlp q ( · ) , mlp q ( · ) , mlp q ( · ) } whose weightsare non-shared. As a result, we obtain four global represen-tations { q g , q g , q g , q g } = encoder q ( I q ) . Likewise, we caneasily get { k g , k g , k g , k g } = encoder k ( I k ) .MoCo uses InfoNCE to calculate contrasitive loss, for-mulated as: L g ↔ g ( I q , I k ) = − log exp( q g · k g + /τ ) (cid:80) Ki =0 exp( q g · k gi /τ ) , (2)where τ is a temperature hyper-parameter [39].Our loss function is similar to MoCo, except that we ex-tend it to multi-level contrastive losses for multi-stage fea-tures, formulated as: Loss = (cid:88) i =1 w i ·L ig ↔ g , (3)where w is the loss weight, and i indicates the current stage.Inspired by the loss weight setting in PSPNet [41], we setthe loss weight of shallow layers to be smaller than deeplayers, and obtain the optimal setting by grid search. Inaddition, we build an individual memory bank queue i foreach layer. In the appendix, we provide the pseudo-code ofintermediate contrastive loss. Following the practice (3), we aim at enhancing the localpatch representation of DetCo. We first transform the inputimage into 9 local patches using jigsaw augmentation, theaugmentation detail is shown in section 6.5. In this way,the contextual information of the global image is reduced.These patches pass through the encoder, and then we canget 9 local feature representation. After that, we combinethese features into one feature representation, and build across global-and-local contrastive loss.Given an image I ∈ R H × W × , first it is transformedto two local patch set P q and P k by two transformationsselected from a local transformation set, termed T l . Thereare 9 patches { p , p , ..., p } in each local patch set. Wefeed the local patch set to backbone and get 9 features F p = { f p , f p , ..., f p } at each stage. Taking a stage asan example, we build a local patch MLP head mlp local ( · ) ,which does not share weights with mlp global ( · ) in sec-tion 3.2.1. Then, F p is concatenated and fed to the localpatch MLP head to get final representation q l . Likewise, wecan use the same approach to get k l .The cross contrastive loss has two parts: theglobal ↔ local contrastive loss and the local ↔ local con-trastive loss. The global ↔ local contrastive loss can be writ-ten as: L g ↔ l ( P q , I k ) = − log exp( q l · k g + /τ ) (cid:80) Ki =0 exp( q l · k gi /τ ) . (4) Similarly, the local ↔ local contrastive loss can be formu-lated as: L l ↔ l ( P q , P k ) = − log exp( q l · k l + /τ ) (cid:80) Ki =0 exp( q l · k li /τ ) . (5)By reducing the contextual information of global image andbuilding cross global-and-local contrast, each local patchesof the image now are aware of instance discrimination.As a result, both the detection and classification perfor-mance boost up. We will give a theory explanation fromthe perspective of Mutual Information Optimization in Sec-tion 3.2.3. This section analyzes that DetCo can improve the lowerbound (LB) of the mutual information (MI) between twoviews, leading to better contrastive learning. Oord etal. [33] demonstrated that contrastive learning is to maxi-mize the lower bound of mutual information, equivalent tominimize the InfoNCE loss function, MI (cid:62) log( K ) − L NCE (cid:44)
Lower Bound ( LB ) , where L NCE is defined in Eqn.(2) and K is the number ofnegative samples. In section 3.2.2, we introduce the crossglobal-and-local contrasts. Here we can show that the MI lower bound between a global image view I and a set oflocal patches P is larger than that between two global im-age views I and I , denoted as LB g ↔ l > LB g ↔ g . We have LB g ↔ l − LB g ↔ g = (log( K ) − L g ↔ lNCE ) − (log( K ) − L g ↔ gNCE )= L g ↔ gNCE ( I , I ) − L g ↔ lNCE ( I , P ) > . (6) Intuitively, Eqn.(6) is established because the informationof a complete image often contains two parts, content in-formation and contextual information, denoted as
Info I =( Info content , Info context ) . And a set of randomly-jitteredlocal patches P would lose the contextual information com-pared to the global image I , making the similarity betweenpositive global ↔ local pairs tend to be smaller than positiveglobal ↔ global pairs ( i.e . the similarity between two globalimage views is often larger than the similarity between a setof random local patches and a global image), as demon-strated in Appendix. That is, we have L g ↔ gNCE ( I , I ) > L g ↔ lNCE ( I , P ) .With LB g ↔ l > LB g ↔ g , we see that the global ↔ localcontrastive loss improves the lower bound of mutual infor-mation compared to the contrast between two global imageviews. Similarly, we can also verify that the local ↔ localcontrast has the same benefit. Here we introduce the basic details. See Appendix formore details .5 mage augmentations . The global image augmentation isthe same as MoCo v2 [5]. First, a random region is croppedwith at least 20% of the image and resized to × with a random horizontal flip, followed by a random colorjittering related to brightness, contrast, saturation, hue andgrayscale. Gaussian Blur is also used for augmentation.The local patch augmentation follows PIRL [30]. First, arandom region is cropped with at least 60% of the imageand resized to × , followed by random flip, colorjitter and blur, sharing the same parameters with global aug-mentation. Then we divide the image into × grids; eachgrid is × . A random crop is applied on each patch toget × to avoid continuity between patches. Finally,we obtain nine patches. Training Details . We use OpenSelfSup as the codebase.Batch size is 256 with 8 V100 GPUs for every experiment.We use standard ResNet-50 [22] for all experiments. For thepretext task, most training hyper-parameters are the sameas MoCo v2. We pre-train 100 epochs on ImageNet forthe ablation study, . Unless other specified, we pre-train200 epochs on ImageNet for fair comparison with othermethods. We try Auto Augmentation strategy [7] in DetCopre-train and it improves the performance on downstreamCOCO detection.
4. Experiments
Experiment Settings.
We conduct all the controlled ex-periments by training 100 epochs. We adopt MoCo v2 asour strong baseline. More ablation studies about hyper-parameters are shown in Appendix. In table 1 and 2, “HIC” means H ierarchical I ntermediate C ontrastive loss,and “CGLC” means C ross G lobal and L ocal C ontrasts. Effectiveness of hierarchical intermediate contrastiveloss . As shown in Table 1 (a) and (b), when adding thehierarchical intermediate contrastive loss on MoCo v2, theperformance of classification drop but detection increase .This perfectly matches the analysis and practice (2) in Sec-tion 3.1: For classification, only the last feature needsto keep discriminative. However, for detection, it is bet-ter to keep features at multiple stages discriminative. Wealso evaluate the VOC SVM classification accuracy at fourstages:
Res2 , Res3 , Res4 , Res5 to demonstrate the en-hancement of the intermediate feature. As shown in Ta-ble 2 (a) and (b), the discrimination ability of shallow fea-tures vastly improves compared with baseline.
Effectiveness of cross global and local contrasts . Asshown in Table 1 (b) and (c), when adding cross globaland local contrasts, the performance of both classificationand detection boosts up and surpasses MoCo v2 baseline.This improvement mainly benefits from the local view that https://github.com/open-mmlab/OpenSelfSup +HIC +CGLC Top1 Top5 mAP(a) × × (cid:88) × ↓↓↓ ↓↓↓ ↑↑↑ (c) (cid:88) (cid:88) ↑↑↑ ↑↑↑ ↑↑↑ Table 1.
Ablation: hierarchical intermediate contrastiveloss (HIC) and cross global and local contrasts(CGLC) . Theresults are evaluated on ImageNet linear classification and PAS-CAL VOC07+12 detection.+HIC +CGLC Res2 Res3 Res4 Res5(a) × × (cid:88) × ↑↑↑ ↑↑↑ ↑↑↑ ↓↓↓ (c) (cid:88) (cid:88) ↑↑↑ ↑↑↑ ↑↑↑ ↑↑↑ Table 2.
Ablation: hierarchical intermediate contrastiveloss (HIC) and cross global and local contrasts(CGLC) . Ac-curacy of feature in different stages are evaluated by PASCALVOC07 SVM classification. removes the context information and improves the lowerbound of mutual information. Also, the cross global andlocal contrasts keep each local patch learn an independentrepresentation of the whole image, which is consistent withobject detection. From table 2 (b) and (c), the cross con-trasts further improve the representation ability of all thestages.
Setup . We choose three representative detectors: FasterRCNN-C4 [36], Faster RCNN-FPN [26] and Reti-naNet [27]. The first two detectors are two-stage and Reti-naNet is one stage detector. Our training settings strictlyfollow MoCo [19], including using “SyncBN” [35] in back-bone and FPN. We report object detection results on PAS-CAL VOC and COCO dataset.
PASCAL VOC.
As shown in Table 6, MoCo v2 is a strongbaseline compared with other unsupervised methods. How-ever, with only 100 epoch pre-training, DetCo achieves al-most the same performance as MoCo v2-800ep (800 epochpre-training). Moreover, DetCo-800ep establishes the newstate-of-the-art at 58.2 mAP and 65.0 AP , 6.2% improve-ment in AP compared with supervised counterpart. COCO with 1 × and 2 × schedule. Table 4 shows the MaskRCNN [21] results on standard 1 × schedule, DetCo alsooutperforms MoCo v2 and other methods in all metrics.The results of 2 × schedule is in Appendix. The column2-3 of Table 5 shows the results of one stage detector Reti-naNet. DetCo pretrain is also better than ImageNet super-vised methods and MoCo v2 in 1 × and 2 × schedule. Forinstance, DetCo is 1.3% higher than MoCov2 on AP , 1 × schedule. COCO with fewer training iterations.
COCO dataset ismuch larger than PASCAL VOC in the data scale. Eventraining from scratch [20] can get a satisfactory result.To verify the effectiveness of unsupervised pre-training,we conduct experiments on extremely stringent conditions:only train detectors with 12k iterations( ≈ × vs. ethod Mask R-CNN R50-C4 COCO 12k Mask R-CNN R50-FPN COCO 12kAP bb AP bb AP bb AP mk AP mk AP mk AP bb AP bb AP bb AP mk AP mk AP mk Rand Init 7.9 16.4 6.9 7.6 14.8 7.2 10.7 20.7 9.9 10.3 19.3 9.6Supervised 27.1 46.8 27.6 24.7 43.6 25.3 28.4 48.3 29.5 26.4 45.2 25.7InsDis[39] 25.8 (-1.3) (-3.6) (-0.6) (-1.0) (-3.2) (-0.8) (-4.2) (-6.8) (-4.4) (-3.6) (-6.3) (-2.0)
PIRL[30] 25.5 (-1.6) (-4.2) (-0.8) (-1.5) (-3.7) (-1.4) (-4.7) (-7.9) (-5.1) (-4.3) (-7.3) (-3.0)
SwAV[3] 16.5 (-10.6) (-11.6) (-14.1) (-8.6) (-11.6) (-10.7) (-2.9) (-2.1) (-4.1) (-1.6) (-1.7) (-0.4)
MoCo[19] 26.9 (-0.2) (-2.3) (+0.6) (-0.1) (-1.8) (+0.3) (-2.8) (-4.9) (-2.9) (-2.5) (-4.4) (-0.9)
MoCov2[5] 27.6 (+0.5) (-1.5) (+1.3) (+0.4) (-1.0) (+1.0) (-1.8) (-3.4) (-1.8) (-1.6) (-3.1) (0.0)
DetCo (+2.2) (+1.6) (+2.7) (+1.8) (+1.7) (+2.0) (-0.5) (-1.4) (-0.2) (-0.4) (-1.0) (+1.2)
DetCo+AA (+2.7) (+2.3) (+3.8) (+2.2) (+2.4) (+2.6) (+1.2) (+1.1) (+1.5) (+1.2) (+1.4) (+3.0)
Table 3.
Object detection and instance segmentation fine-tuned on COCO . All methods are pretrained 200 epochs on ImageNet.
Green means increase and gray means decrease. Our method converges faster than other unsupervised methods under a limited numberof iterations. “AA” means we use Auto Augmentation in pre-training.
Method Mask R-CNN R50-C4 COCO 90k Mask R-CNN R50-FPN COCO 90kAP bb AP bb AP bb AP mk AP mk AP mk AP bb AP bb AP bb AP mk AP mk AP mk Rand Init 26.4 44.0 27.8 29.3 46.9 30.8 31.0 49.5 33.2 28.5 46.8 30.4Supervised 38.2 58.2 41.2 33.3 54.7 35.2 38.9 59.6 42.7 35.4 56.5 38.1InsDis[39] 37.7 (-0.5) (-1.2) (-0.3) (-0.3) (-0.6) (0.0) (-1.5) (-2.0) (-2.1) (-1.3) (-1.9) (-1.7)
PIRL[30] 37.4 (-0.8) (-1.7) (-1.0) (-0.6) (-1.3) (-0.5) (-1.4) (-2.0) (-1.7) (-1.4) (-1.9) (-1.9)
SwAV[3] 32.9 (-5.3) (-3.9) (-6.7) (-3.8) (-4.3) (-4.8) (-0.4) (+0.8) (-1.3) (0.0) (+0.5) (-0.4)
MoCo[19] 38.5 (+0.3) (+0.1) (+0.4) (+0.3) (+0.1) (+0.4) (-0.4) (-0.7) (-0.7) (-0.3) (-0.6) (-0.4)
MoCov2[5] 38.9 (+0.7) (+0.2) (+0.8) (+0.9) (+0.5) (+1.3) (0.0) (-0.2) (-0.3) (+0.1) (0.0) (0.0)
DetCo (+1.2) (+1.0) (+1.1) (+1.1) (+1.0) (+1.4) (+0.6) (+0.7) (+0.4) (+0.5) (+0.4) (+0.5)
DetCo+AA (+1.6) (+1.5) (+1.8) (+1.4) (+1.6) (+1.5) (+1.2) (+1.4) (+1.2) (+1.0) (+1.5) (+0.8)
Table 4.
Object detection and instance segmentation fine-tuned on COCO . All methods are pretrained 200 epochs on ImageNet.Our DetCo is state-of-the-art, surpassing MoCov2 and the supervised method in all metrics. “AA” means we use Auto Augmentation inpre-training.
Method RetinaNet R50 12k RetinaNet R50 90k RetinaNet R50 180k Keypoint RCNN R50 180kAP AP AP AP AP AP AP AP AP AP kp AP kp AP kp Rand Init 4.0 7.9 3.5 24.5 39.0 25.7 32.2 49.4 34.2 65.9 86.5 71.7Supervised 24.3 40.7 25.1 37.4 56.5 39.7 38.9 58.5 41.5 65.8 86.9 71.9InsDis[39] 19.0 (-5.3) (-8.7) (-5.5) (-1.9) (-2.4) (-1.5) (-0.9) (-1.1) (-1.0) (+0.7) (+0.2) (+0.7)
PIRL[30] 19.0 (-5.3) (-9.0) (-5.3) (-1.7) (-2.3) (-1.3) (-0.4) (-0.9) (-0.3) (+0.7) (+0.6) (+0.2)
SwAV[3] 19.7 (-4.6) (-6.0) (-5.6) (-2.2) (-1.6) (-2.2) (-0.3) (+0.3) (-0.4) (+0.2) (0.0) (-0.4)
MoCo[19] 20.2 (-4.1) (-6.8) (-4.3) (-1.1) (-1.5) (-0.7) (-0.2) (-0.6) (0.0) (+1.0) (+0.5) (+0.6)
MoCov2[5] 22.2 (-2.1) (-3.8) (-2.1) (-0.2) (-0.3) (-0.1) (+0.4) (+0.4) (+0.6) (+1.0) (+0.4) (+1.2)
DetCo 23.6 (-0.7) (-2.0) (-0.5) (+0.6) (+0.9) (+1.0) (+0.9) (+1.0) (+0.9) (+1.4) (+0.6) (+1.5)
DetCo+AA (+1.0) (+0.9) (+1.4) (+1.0) (+1.3) (+1.5) (+0.8) (+0.8) (+1.1) - - -
Table 5.
One-stage object detection and keypoint detection fine-tuned on COCO . All methods are pretrained 200 epochs on ImageNet.DetCo outperforms all unsupervised counterparts. “AA” means we use Auto Augmentation in pre-training. × schedule). The 12k iterations make detectors heavilyunder-trained and far from converge, as shown in Table 3and Table 5 column 1. Under this setting, for Mask RCNN-C4, DetCo exceeds MoCo v2 by 3.1% in AP bb and outper-forms supervised methods in all metrics, which indicatesDetCo can fasten the training converge. For Mask RCNN-FPN and RetinaNet, DetCo also has significant advantagesover MoCo v2, and has the closest performance comparedwith supervised counterpart. Discussion.
On the one hand, when the dataset scale ofthe downstream task is small ( e.g.
PASCAL VOC), DetCopre-training has significant advantages compared with thesupervised method. Nonetheless, if the dataset scale is verylarge ( e.g.
COCO), our DetCo is also better than the su-pervised method, but the advantage is narrowed than thesmall-scale dataset. On the other hand, when the compu-tational resource is limited, DetCo can fasten training con-verge compared with other unsupervised methods, and it ison par with supervised methods.
Multi-Person Pose Estimation.
The last column of Ta-ble 5 shows the results of COCO keypoint detection resultsusing Mask RCNN. DetCo also surpasses other methods inall metrics, e.g. kp and 1.5% AP kp higher thansupervised counterpart. Segmentation for Autonomous Driving.
Cityscapes is adataset for autonomous driving in the urban street. We fol-low MoCo to evaluate on instance segmentation with MaskRCNN and semantic segmentation with FCN-16s [29]. Theresults are shown in Table 7. On instance segmenta-tion, DetCo outperforms supervised counterpart by 3.6% onAP mk . For the semantic segmentation task, which is also adense prediction task, DetCo is also 1.9% higher than su-pervised and 0.8% higher than MoCo v2.
3D Human Shape Prediction
Estimating 3D shape from asingle 2D image is challenging, so we evaluate DetCo onDensePose [1] task. As shown in Table 9, DetCo substan-tially outperforms ImageNet supervised method and MoCov2 in all metrics, especially 1.4% on AP .7 ethod Epoch AP AP AP Rand Init - 33.8 60.2 33.1Supervised 90 53.5 81.3 58.8InsDis [39] 200 55.2 (+1.7) (-0.4) (+2.4)
PIRL [30] 200 55.5 (+2.0) (-0.3) (+2.5)
SwAV [3] 800 56.1 (+2.6) (+1.3) (+3.9)
MoCo [19] 200 55.9 (+2.4) (+0.2) (+3.8)
MoCov2 [5] 200 57.0 (+3.5) (+1.1) (+4.8)
MoCov2 [5] 800 57.4 (+3.9) (+1.2) (+5.2)
DetCo 100 57.4 (+3.9) (+1.2) (+5.1)
200 57.8 (+4.3) (+1.3) (+5.4) (+4.7) (+1.4) (+6.2)
Table 6.
Object Detection finetuned on PASCAL VOC07+12using Faster RCNN-C4.
DetCo-100ep is on par with previousstate-of-the-art, and DetCo-800ep achieves the best performance.
Methods Instance Seg. Semantic Seg.AP mk AP mk mIOURand Init 25.4 51.1 65.3supervised 32.9 59.6 74.6InsDis [39] 33.0 (+0.1) (+0.5) (-1.3) PIRL [30] 33.9 (+1.0) (+2.1) (0.0)
SwAV [3] 33.9 (+1.0) (+2.8) (-1.6)
MoCo [19] 32.3 (-0.6) (-0.3) (+0.7)
MoCov2 [5] 33.9 (+1.0) (+1.2) (+1.1)
DetCo (+1.8) (+3.6) (+1.9)
Table 7.
DetCo vs. supervised and other unsupervised methodson Cityscapes dataset.
All methods are pretrained 200 epochson ImageNet. We evaluate instance segmentation and semanticsegmentation tasks.
Image DetCo MoCo v2
Figure 5.
Attention maps generated by DetCo and MoCov2 [5].
DetCo can activate more accurate object regions in the heatmapthan MoCov2. More visualization results are in Appendix.
We follow the standard settings: ImageNet linear classi-fication and VOC SVM classification. The training epochand learning rate is same as MoCo. Table 8 shows the re-sults, our DetCo also outperforms its strong baseline MoCov2 by 1.1% in Top-1 Accuracy. It is also competitive onVOC SVM classification accuracy compared with state-of-the-art counterparts.
Discussion.
We report classification accuracy only to ver-ify the robustness of DetCo, because the first thing we pur-sue is the transferring detection ability . As analyzed above,
Method Epoch ImageNet VOC07Top1 Top5 AccJigsaw [31] - 44.6 - 64.5Rotation [16] - 55.4 - 63.9InsDis [39] 200 56.5 - 76.6LocalAgg [42] 200 58.8 - -PIRL [30] 800 63.6 - 81.1SimCLR [4] 1000 69.3 89.0 -BYOL [18] 1000 74.3 91.6 -SwAV [3] 200 72.7 - 87.6MoCo [19] 200 60.6 - 79.2MoCov2 [5] 200 67.5 - 84.1DetCo 200 68.6 88.5 85.1
Table 8.
Comparison of ImageNet Linear Classification andVOC SVM Classification.
Although DetCo is designed for de-tection, it is also robust and competitive on classification task, andit substantially exceeds MoCov2 baseline by 1.5%.
Method Epoch AP dp AP dp AP dp Rand Init - 40.8 78.6 37.3Supervised 90 50.8 86.3 52.6MoCo [19] 200 49.6 (-1.2) (-0.4) (-2.1)
MoCo v2 [5] 200 50.9 (+0.1) (+0.9) (+0.3)
DetCo 200 (+0.5) (+1.4) (+0.7)
Table 9.
DetCo vs. other methods on Dense Pose task.
It alsoperforms best on monocular 3D human shape prediction.
SwAV, the strongest method on classification, performs notgood on object detection. It also meets our purpose thatdetection needs different pretext task design with classifica-tion, and the design of DetCo is more friendly for detection.
Figure 5 visualizes the attention map of DetCo andMoCo v2. We can see when there is more than one object inthe image, DetCo successfully locates all the objects, whileMoCo v2 fails to activate some objects. Moreover, in thelast column, the attention map of DetCo is more accuratethan MoCo v2 on the boundary. It reflects from the side thatthe localization capability of DetCo is stronger than MoCov2.
5. Conclusion
In this paper, we focus on designing a good pretext taskfor object detection. First, we detailly analyze a seriesof self-supervised methods and conclude that the perfor-mance inconsistency transferring to the classification anddetection task. Second, we propose three good practices todesign a detection-friendly self-supervised learning frame-work. Third, follow the proposed practices, we proposeDetCo, with hierarchical intermediate contrastive loss andcross global and local contrast. It achieves state-of-the-artperformance on a series of detection-related tasks. We be-lieve that there is no single best unsupervised pretext taskfor different downstream tasks and we will put in more ef-fort to explore that in the future.
Acknowledgement . We thank Huawei to support > . Appendix In appendix, we first show that the results of DetCo onSemi-Supervised Object Detection in Section 6.1 and moredownstream tasks in Section 6.2. Second, in Section 6.3,we show the visualization results of DetCo and MoCo v2.Third, we give a proof of lower bound improvement in Sec-tion 6.4. Fourth, we show more implementation details inSection 6.5. Finally, we analysis more ablation studies ofDetCo in Section 6.6.
To verify the effectiveness of self-supervised learning onsmall scale dataset, we randomly sample 1%, 2%, 5%, 10%data to fine-tune the Mask RCNN C4 / FPN and RetinaNet.For all the settings, we fine-tune the detectors with 12k it-erations to avoid overfitting. Other settings are the same asCOCO 1 × and 2 × schedule. The results for Mask RCNNwith 1% and 2% data are shown in Table 10. The resultsfor Mask RCNN with 5% and 10% data are shown in Ta-ble 11. The results for RetinaNet with 1%, 2%, 5%, 10%are shown in Table 12. From Table 10 and 12, we find thatwith only 1% and 2% data, all other unsupervised meth-ods have lower results than supervised counterparts. How-ever, DetCo performs better than all supervised / unsuper-vised methods. Moreover, for 5% and 10% training data,DetCo also outperforms all the counterparts with a largemargin. These results shows that the feature representationpre-trained from self-supervised learning method is benefi-cial for semi-supervised object detection. COCO with 2 × schedule. Table 13 shows the results ofMask RCNN R50 C4 / FPN on COCO with 2 × sched-ule. DetCo achieves state-of-the-art performance on bothobject detection and instance segmentation. For example,for Mask RCNN-C4, DetCo is better than supervisedmethod on AP bb , better than MoCo v2 on AP bb . LVIS Instance Segmentation.
We use LVIS v1.0 for train-ing and evaluation. MoCo [19] adopted LVIS v0.5, but it isoutdated and can not be downloaded from the official web-site. So we fine-tune and compare all the methods usingLVIS v1.0 dataset. The training schedule of LVIS is 180kiterations, the same as MoCo. Other settings also keep thesame with MoCo. The results are shown in Table 14. DetCoalso outperforms MoCo v2 and supervised methods in bothdetection and instance segmentation.
We visualize the attention map of
Res5 on the ImageNet dataset, which is 1/32 resolution ofthe input image size. To get relatively clear attention map, we enlarge the input size from × × to × × .The shape of output tensor Res5 is × × . We cal-culate the mean of tensor Res5 in the channel dimensionand normalize the value to 0-1. Then we get the attentionmap, which shape is × × . We further upsample theattention map to input image’s size using bilinear interpola-tion and project the attention map on to the image to get thevisualized results. Visualization Results.
As shown in Figure 6, it is surpris-ing to see that both MoCo v2 and DetCo can generate rel-atively high-quality attention map that focuses on the fore-ground objects. It demonstrates that contrastive learning-based self-supervised representation methods can poten-tially solve saliency object detection or object localizationin an unsupervised manner.
Moreover , the attention mapof DetCo is much better than MoCo v2 mainly in two as-pects: (1) more accurate boundary localization. (2) moreobject discovery. We analyze that it is mainly due to in-troducing the global-to-local contrasts into DetCo, forcingeach local patch aware of instance discrimination. To opti-mize global-to-local contrastive loss, each local patch needsto distinguish the foreground feature; that is why DetCo canoutput a more accurate attention map. However, MoCo v2uses the whole image to extract features, so it only needs toactivate the most discriminative area.
We visualize the image retrievalresults on the ImageNet validation dataset. First, we ex-tract the final-layer feature of all the images using the fea-tures learned by DetCo. Then we use global average pool-ing(GAP) on the extracted feature, followed by a L nor-malization. The shape of each feature vector is × × .For retrieval, we randomly select several images as queryimages, then directly find K nearest images in the featurespace. K is set to 9 in this paper. Visualization Results.
Figure 7 shows the nearest-neighborretrieval results. We find that DetCo can successfully groupimages according to their categories in most cases in an un-supervised learning manner.9 ethod Mask R-CNN R50-FPN COCO 1% Data Mask R-CNN R50-FPN COCO 2% DataAP bb AP bb AP bb AP mk AP mk AP mk AP bb AP bb AP bb AP mk AP mk AP mk Rand Init 2.5 5.8 1.7 2.3 4.9 1.8 4.5 10.3 3.2 4.3 9.3 3.5Supervised 10.0 19.9 9.2 9.7 18.3 9.2 13.7 26.6 12.8 13.0 24.2 12.6MoCo[19] 9.1 (-0.9) (-2.6) (-0.6) (-1.1) (-2.2) (-0.9) (-0.7) (-2.5) (-0.2) (-0.7) (-1.8) (-0.4)
MoCo v2[5] 9.9 (-0.1) (-1.2) (+0.3) (-0.2) (-1.1) (0.0) (+0.1) (-1.3) (+0.6) (-0.1) (-0.9) (+0.1)
DetCo (+0.7) (+0.3) (+1.2) (+0.5) (+0.7) (+0.7) (+0.6) (-0.3) (+1.1) (+0.5) (+0.4) (+0.5)
DetCo+AA (+2.4) (+3.6) (+2.6) (+2.4) (+3.6) (+2.8) (+2.3) (+3.0) (+2.8) (+2.3) (+2.2) (+2.5)
Table 10.
Semi-Supervised two-stage Detection fine-tuned on COCO 1% and 2% data . All methods are pretrained 200 epochs onImageNet.
Green means increase and gray means decrease. DetCo is better than supervised / unsupervised counterparts in all metrics.
Method Mask R-CNN R50-FPN COCO 5% Data Mask R-CNN R50-FPN COCO 10% DataAP bb AP bb AP bb AP mk AP mk AP mk AP bb AP bb AP bb AP mk AP mk AP mk Rand Init 9.2 18.6 7.9 8.8 16.9 8.1 10.1 20.2 9.1 9.8 18.6 9.2Supervised 19.9 37.0 19.3 18.6 33.7 18.4 23.8 42.8 23.9 22.2 39.6 22.3MoCo[19] 19.6 (-0.3) (-1.9) (+0.7) (-0.3) (-1.4) (+0.2) (-0.5) (-2.1) (0.0) (-0.3) (-1.6) (+0.1)
MoCo v2[5] 20.6 (+0.7) (-0.4) (+1.7) (+0.5) (0.0) (+0.8) (+0.3) (-0.8) (+0.9) (+0.3) (-0.5) (+1.0)
DetCo (+1.5) (+1.1) (+2.3) (+1.3) (+1.4) (+1.4) (+1.5) (+1.1) (+2.1) (+1.4) (+1.2) (+1.7)
DetCo+AA (+2.0) (+2.1) (+2.9) (+1.8) (+2.4) (+2.2) (+2.2) (+2.4) (+3.1) (+2.1) (+2.4) (+2.7)
Table 11.
Semi-Supervised two-stage Detection fine-tuned on COCO 5% and 10% data . All methods are pretrained 200 epochs onImageNet. DetCo is better than supervised / unsupervised counterparts in all metrics.
Method RetinaNet R50 COCO 1% Data RetinaNet R50 COCO 2% Data RetinaNet R50 COCO 5% Data RetinaNet R50 COCO 10% DataAP AP AP AP AP AP AP AP AP AP AP AP Rand Init 1.4 3.5 1.0 2.5 5.6 2.0 3.6 7.4 3.0 3.7 7.5 3.2Supervised 8.2 16.2 7.2 11.2 21.7 10.3 16.5 30.3 15.9 19.6 34.5 19.7MoCo[19] 7.0 (-1.2) (-2.7) (-0.7) (-0.9) (-2.5) (-0.6) (-1.5) (-3.3) (-1.0) (-1.4) (-2.9) (-1.3)
MoCo v2[5] 8.4 (+0.2) (-0.4) (+0.8) (+0.8) (+0.1) (+1.2) (+0.3) (-0.7) (+0.9) (+0.4) (-0.2) (+0.5)
DetCo (+0.6) (+0.5) (+1.0) (+1.8) (+2.3) (+2.2) (+1.4) (+1.4) (+1.8) (+1.2) (+1.1) (+1.6)
DetCo+AA (+1.7) (+3.1) (+1.9) (+2.3) (+3.4) (+2.4) (+2.2) (+2.6) (+2.8) (+2.3) (+3.1) (+2.6)
Table 12.
Semi-Supervised one-stage Detection fine-tuned on COCO 1% and 2% data . All methods are pretrained 200 epochs onImageNet. DetCo is better than supervised / unsupervised counterparts in all metrics.
Method Mask R-CNN R50-C4 COCO 180k Mask R-CNN R50-FPN COCO 180kAP bb AP bb AP bb AP mk AP mk AP mk AP bb AP bb AP bb AP mk AP mk AP mk Rand Init 35.6 54.6 38.2 31.4 51.5 33.5 36.7 56.7 40.0 33.7 53.8 35.9Supervised 40.0 59.9 43.1 34.7 56.5 36.9 40.6 61.3 44.4 36.8 58.1 39.5MoCo[19] 40.7 (+0.7) (+0.6) (+1.0) (+0.7) (+0.8) (+0.7) (+0.2) (+0.3) (+0.3) (+0.1) (+0.3) (+0.2)
MoCov2[5] 41.0 (+1.0) (+0.7) (+1.4) (+0.9) (+0.7) (+1.1) (+0.3) (+0.2) (+0.3) (+0.2) (+0.6) (+0.3)
DetCo (+1.4) (+1.3) (+1.6) (+1.1) (+1.3) (+1.4) (+0.9) (+0.8) (+1.2) (+0.8) (+1.1) (+1.0)
DetCo+AA (+1.3) (+1.3) (+1.9) (+1.1) (+1.4) (+1.3) (+0.9) (+1.2) (+1.2) (+0.9) (+1.4) (+1.0)
Table 13.
Object detection and instance segmentation fine-tuned on COCO . All methods are pretrained 200 epochs on ImageNet. OurDetCo is state-of-the-art, surpassing MoCov2 and the supervised method in all metrics.Method AP bb AP bb AP bb AP mk AP mk AP mk Rand Init 19.6 31.8 20.9 19.0 29.8 20.1Supervised 22.7 36.8 24.1 22.2 34.6 23.5MoCo[19] 23.1 37.4 24.5 22.5 35.1 23.8MoCo v2[5] 23.2 37.4 24.7 22.8 35.1 24.4DetCo
Table 14.
DetCo vs. supervised and other unsupervised meth-ods on LVIS v1.0 dataset.
All methods are pretrained 200 epochson ImageNet. We evaluate object detection and instance segmen-tation tasks. mage DetCo MoCo Image DetCo MoCo Figure 6.
Attention maps generated by DetCo and MoCov2.
DetCo can activate more object regions in the heatmap than MoCov2, andthe attention map of DetCo is more accurate than MoCo v2 in object boundary.
Zoom in for better visualized results. uery Retrievals Figure 7.
Retrieval results of DetCo on ImageNet.
The left column are queries from the validation set, while the right columns show 9nearest neighbors retrieved from the validation set. .4. Improving Lower Bound of Mutual Informa-tion with Patch Representation By adding global-to-local contrasts, we define theDetCo loss in Eqn. 1 in the main paper. The additionalglobal ↔ local contrastive loss improves the lower boundof mutual information compared with global ↔ global con-trastive loss.As already shown in Section 3.2.3, LB g ↔ l − LB g ↔ g = L g ↔ gNCE ( I , I ) − L g ↔ lNCE ( P , I ) , (7)where L g ↔ gNCE ( I , I ) and L g ↔ lNCE ( P , I ) are defined inEqn. 2 and 4 in main paper. Here we define exponentialcosine similarity Sim = exp( q · k + /τ ) for simplicity, so theInfoNCE of global ↔ global and global ↔ local can be re-writen as: L g ↔ gNCE ( I q , I k ) = − log Sim g ↔ gP Sim g ↔ gP + (cid:80) Ki =1 Sim g ↔ gN , (8)and L g ↔ lNCE ( P q , I k ) = − log Sim g ↔ lP Sim g ↔ lP + (cid:80) Ki =1 Sim g ↔ lN , (9)where Sim P means similarity between positive pairs and Sim N means similarity between negative pairs.So LB g ↔ l − LB g ↔ g is translated to Eqn. 8 − Eqn.9, LB g ↔ l − LB g ↔ g = − log Sim g ↔ lP Sim g ↔ lP + (cid:80) Ki =1 Sim g ↔ lN − ( − log Sim g ↔ gP Sim g ↔ gP + (cid:80) Ki =1 Sim g ↔ gN )= log Sim g ↔ gP · ( Sim g ↔ lP + (cid:80) Ki =1 Sim g ↔ lN ) Sim g ↔ lP · ( Sim g ↔ gP + (cid:80) Ki =1 Sim g ↔ gN ) (10)Intuitively, if we want to get LB g ↔ l > LB g ↔ g , we needto prove numerator (denote as A) > denominator(denote asB) in Eqn. 10, LB g ↔ l − LB g ↔ g ≈ A − B = Sim g ↔ gP · ( Sim g ↔ lP + K (cid:88) i =1 Sim g ↔ lN ) − Sim g ↔ lP · ( Sim g ↔ gP + K (cid:88) i =1 Sim g ↔ gN )= Sim g ↔ gP · K (cid:88) i =1 Sim g ↔ lN − Sim g ↔ lP · K (cid:88) i =1 Sim g ↔ gN = P g · N l − P l · N g , (11) Global Image (G) Local Patches (L) (Info !" , Info !" ) (Info !" , )
Figure 8.
Illustration of the information of global image andlocal patches extracted by CNN.
For global image, both contentinformation and the context information is extracted by CNN. Forlocal patches, only the content information is extracted by CNN. where we use P g , N l , P l , N g to denote Sim g ↔ gP , (cid:80) Ki =1 Sim g ↔ lN , Sim g ↔ lP , (cid:80) Ki =1 Sim g ↔ gN for simplicity. Insummary, if we can prove P g · N l > P l · N g , then wecan conclude that LB g ↔ l > LB g ↔ g .Here we define ∆ P = P g − P l and ∆ N = N g − N l ,where ∆ P and ∆ N denotes the difference between globaland local similarity. If we bring ∆ P, ∆ N into Eqn. 11, wecan get: LB g ↔ l − LB g ↔ g = P g · N l − P l · N g = P g · ( N g − ∆ N ) − ( P g − ∆ P ) · N g = ∆ P · N g − ∆ N · P g (12)Here we can naturally know the P g , N g ∈ (1 , e ) . Thenwe give an empirically assumption that ∆ P > and ∆ N → . We verify the assumption is established in boththeoretical analysis and experimental support. For experi-mental support, we collect the statistical data of 32000 sam-ples, as shown in Figure 9, the experimental results matchour assumption. The left figure of Figure 9 shows that twopositive pair’s similarity distributions ∆ P = 0 . . The rightfigure shows that two negative pair’s similarity distributions ∆ N → . We will discuss the theoretical analysis in thenext paragraph. Combine with Eqn. 12 and Figure 9 , wecan easily conclude that ∆ P · N g − ∆ N · P g > , that is LB g ↔ l > LB g ↔ g . Discussion.
In the main paper Section 3.2.3, webriefly define the information of one image
Info I =( Info content , Info context ) and Info P is reduced becausea set of randomly-jittered local patches P would lose thecontextual information compared to the global image I. Asshown in Figure 8, the left figure demonstrates that the scanpath of a 3 × 𝑃 ∆𝑁 cosine similarity(exp) f r equen cy cosine similarity(exp) f r equen cy Figure 9.
The statistical distribution of exponential cosine similarity between positive pairs and negative pairs.
The left figure showsthe similarity between positive pairs for global ↔ global and local ↔ glocal contrasts. The right figure shows the similarity between negativepairs. right figure shows the scan path of a 3 × for positive samples , because the localpatches lose the contextual information. However, for neg-ative samples , we assume that the similarity of two globalimages and local patches only has a minor gap because, forCNN, it is much easy to distinguish two different imagesregardless of contextual information. We conduct exper-iments to obtain the statistics results of the similarity be-tween positive / negative samples for global ↔ global andglobal ↔ local contrastive loss. In detail, we calculate thecosine similarity of 1000 batches; in each batch, there are32 positive pairs; we calculate the mean similarity of eachbatch and show the results in Figure 9.In summary, the Figure 8 and 9 give the theoretical ex-planation and experimental support that the global ↔ localcontrastive loss improves the lower bound of mutual infor-mation, leading to better feature representation. First, we provide a pseudo code for the DetCo trainingloop in Pytorch style, as shown in Algorithm 1. We useApex for mixed-precision training to speed up the training https://github.com/NVIDIA/apex process. Most of our training hyperparameters are directlytaken from MoCo [19]. The loss weight for intermediatecontrastive loss is 1,0.7,0.4,0.1 for Res5 , Res4 , Res3 , Res2 as default. The learning rate for pre-training is 0.06 withthe cosine decay schedule. For each intermediate layerand global-to-local contrast, we use a 2-layer multi-layerperceptron(MLP) head that projects the feature to a 128-Dspace. The design of MLP is the same as MoCo v2, exceptfor the input channel. We also build an individual memorybank for each head to store the negative samples.
Downstream tasks. (1) DensePose . In the main paper,we report the Densepose results. For the Densepose task,MoCo used Detectron2 to evaluate Densepose. However,we find the Detectron2 updated recently, and the perfor-mance is higher than MoCo. So we re-finetuned all themethods use the latest Detectron2 code. We fine-tuneDensepose RCNN with 26k iterations for all methods andreport the results of “densepose gps”. (2) RetinaNet . Weuse ResNet50 as the backbone. We follow the setting ofMoCo on Mask RCNN, adding extra normalization lay-ers in both backbone and fpn. (3) other method.
ForSwAV [3], we download the pre-trained weights from theofficial code . For a fair comparison, we choose the SwAVwith batch size 256, and the training epochs are 200. Thecorresponding ImageNet linear classification is 72.7% inTop-1 accuracy. Then we fine-tuned the weights on PAS-CAL VOC, increasing the learning rate from 0.01 to 0.1,and set the warmup factor =0.333 for 1000 iterations.The above settings the same with the SwAV [3] paper. https://github.com/facebookresearch/swav lgorithm 1 Pseudocode of Intermediate Contrastive Loss. : 4 different stages of backone, termed res2, res3, res4, res5.
Weight of Hierarchical Intermediate Loss.
We findthat the transfer detection performance is highly sen-sitive to the loss weight hyper-parameter for hierarchi-cal intermediate loss, so we set different weights for
Res2 , Res3 , Res4 , Res5 . The result is shown in Table 15.As shown in Table 15 (a), if we set the loss weight equals(0,0,0,1), our method degenerates to MoCo. In Table 15(b)(c), we find that directly adding hierarchical intermedi-ate loss with inappropriate weights leads to negative results.We find that the shallow layers and the deep layers are ina competitive relationship. In other words, if we set lossweight (1,1,1,1) equally, the discriminative ability of deeplayers are large negatively influenced by shallow layers.Here we revisit PSPNet [41], which also use the shallowfeature as the auxiliary loss. In PSPNet, the loss weightsof the shallow and deep feature are (0.4,1). In DetCo, weset the loss weight to (0.1,0.4,0.7,1.0), as shown in Table 15(d)(e), and we find this loss weight improves the transferdetection performance. We argue that making the shallowlayer’s weight equal to deep layers is too aggressive to op-timize. Moreover, if we normalize the weight, making the weight AP AP AP (a) (0,0,0,1) 56.3 81.8 62.1(b) (1,1,1,1)+Norm 55.1 80.4 60.4(c) (1,1,1,1) 55.8 81.6 62.2(d) (0.1,0.4,0.7,1)+Norm 56.5 82.2 62.7(e) (0.1,0.4,0.7,1) 57.0 82.2 63.1 Table 15. Ablation study of intermediate loss weight, under 100epoch pre-training. “Norm” means normalized the loss weights,making the sum of weights equals 1.0. share queue AP AP AP (a) (cid:88) × Table 16. Ablation study of share or not share queue, under 100epoch pre-training. sum of loss weight equals to 1, the accuracy also drops.
Memory Banks for Hierarchical Intermediate Loss.
Original MoCo only utilizes the final feature to calcu-late contrastive loss, so MoCo only uses one memorybank (queue) to store negative samples. However, here weutilize
Res2 , Res3 , Res4 , Res5 to calculate the multi-levelcontrastive loss. An intuitive idea is we also use one mem-ory bank to store negative samples from four stages. Inthis way, one memory bank is shared for both high-leveland low-level features. However, we find sharing memorybank leads to performance drop. So we build an individualmemory bank for the feature from each level. Under thissetting, the performance improves. The results are shownin Table 16. We consider if sharing memory bank crosslevels, the positive samples of each stage need to discrimi-nate negative samples from all levels, which is challengingto optimize. Suppose each layer owns an individual mem-ory bank. In that case, each stage’s positive samples onlyneed to discriminate negative samples from its correspond-ing layer, making the network converge easy.
References [1] Rıza Alp G¨uler, Natalia Neverova, and Iasonas Kokkinos.Densepose: Dense human pose estimation in the wild. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 7297–7306, 2018. 7[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Largescale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 , 2018. 1, 2[3] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi-otr Bojanowski, and Armand Joulin. Unsupervised learningof visual features by contrasting cluster assignments. arXivpreprint arXiv:2006.09882 , 2020. 1, 2, 3, 7, 8, 14[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-offrey Hinton. A simple framework for contrastive learningof visual representations. arXiv preprint arXiv:2002.05709 ,2020. 3, 8
5] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He.Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 , 2020. 1, 2, 3, 6, 7, 8, 10[6] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The cityscapesdataset for semantic urban scene understanding. In
Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 3213–3223, 2016. 2[7] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc VLe. Randaugment: Practical automated data augmentationwith a reduced search space. In
CVPR , 2020. 6[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In , pages 248–255. Ieee, 2009. 3[9] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper-vised visual representation learning by context prediction. In
Proceedings of the IEEE international conference on com-puter vision , pages 1422–1430, 2015. 1, 2, 3[10] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In
Proceedings of the IEEE Inter-national Conference on Computer Vision , pages 2051–2060,2017. 1, 2[11] Jeff Donahue, Philipp Kr¨ahenb¨uhl, and Trevor Darrell. Ad-versarial feature learning. arXiv preprint arXiv:1605.09782 ,2016. 1, 2[12] Jeff Donahue and Karen Simonyan. Large scale adversarialrepresentation learning. In
Advances in Neural InformationProcessing Systems , pages 10542–10552, 2019. 1, 2[13] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Ried-miller, and Thomas Brox. Discriminative unsupervised fea-ture learning with convolutional neural networks. In
Ad-vances in neural information processing systems , pages 766–774, 2014. 2[14] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, OlivierMastropietro, Alex Lamb, Martin Arjovsky, and AaronCourville. Adversarially learned inference. arXiv preprintarXiv:1606.00704 , 2016. 1, 2[15] Mark Everingham, Luc Van Gool, Christopher KI Williams,John Winn, and Andrew Zisserman. The pascal visual objectclasses (voc) challenge.
International journal of computervision , 88(2):303–338, 2010. 1, 2, 3[16] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-supervised representation learning by predicting image rota-tions. arXiv preprint arXiv:1803.07728 , 2018. 2, 8[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In
Advancesin neural information processing systems , pages 2672–2680,2014. 2[18] Jean-Bastien Grill, Florian Strub, Florent Altch´e, CorentinTallec, Pierre H Richemond, Elena Buchatskaya, Carl Do-ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-mad Gheshlaghi Azar, et al. Bootstrap your own latent: Anew approach to self-supervised learning. arXiv preprintarXiv:2006.07733 , 2020. 1, 2, 8 [19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and RossGirshick. Momentum contrast for unsupervised visual rep-resentation learning. In
Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition , pages9729–9738, 2020. 1, 2, 3, 4, 6, 7, 8, 9, 10, 14[20] Kaiming He, Ross Girshick, and Piotr Doll´ar. Rethinkingimagenet pre-training. In
Proceedings of the IEEE inter-national conference on computer vision , pages 4918–4927,2019. 6[21] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In
Proceedings of the IEEE internationalconference on computer vision , pages 2961–2969, 2017. 6[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 1, 3, 6[23] Diederik P Kingma and Max Welling. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114 , 2013. 2[24] Gustav Larsson, Michael Maire, and GregoryShakhnarovich. Learning representations for automaticcolorization. In
European conference on computer vision ,pages 577–593. Springer, 2016. 2[25] Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero,Andrew Cunningham, Alejandro Acosta, Andrew Aitken,Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative ad-versarial network. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 4681–4690,2017. 2[26] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyra-mid networks for object detection. In
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 2117–2125, 2017. 4, 6[27] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Doll´ar. Focal loss for dense object detection. In
Pro-ceedings of the IEEE international conference on computervision , pages 2980–2988, 2017. 4, 6[28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In
European conference on computer vision , pages 740–755.Springer, 2014. 2[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation. In
Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 3431–3440, 2015. 7[30] Ishan Misra and Laurens van der Maaten. Self-supervisedlearning of pretext-invariant representations. In
Proceedingsof the IEEE/CVF Conference on Computer Vision and Pat-tern Recognition , pages 6707–6717, 2020. 1, 6, 7, 8[31] Mehdi Noroozi and Paolo Favaro. Unsupervised learningof visual representations by solving jigsaw puzzles. In
European Conference on Computer Vision , pages 69–84.Springer, 2016. 2, 8[32] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-sentation learning with contrastive predictive coding. arXivpreprint arXiv:1807.03748 , 2018. 1
33] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-sentation learning with contrastive predictive coding. arXivpreprint arXiv:1807.03748 , 2018. 5[34] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, TrevorDarrell, and Alexei A Efros. Context encoders: Featurelearning by inpainting. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages2536–2544, 2016. 2[35] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, XiangyuZhang, Kai Jia, Gang Yu, and Jian Sun. Megdet: A largemini-batch object detector. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages6181–6189, 2018. 6[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In
Advances in neural information pro-cessing systems , pages 91–99, 2015. 1, 6[37] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-stra. Stochastic backpropagation and variational inference indeep latent gaussian models. In
International Conference onMachine Learning , volume 2, 2014. 2[38] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, andPierre-Antoine Manzagol. Extracting and composing robustfeatures with denoising autoencoders. In
Proceedings of the25th international conference on Machine learning , pages1096–1103, 2008. 2[39] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.Unsupervised feature learning via non-parametric instancediscrimination. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3733–3742, 2018. 4, 5, 7, 8[40] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorfulimage colorization. In
European conference on computervision , pages 649–666. Springer, 2016. 2[41] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, XiaogangWang, and Jiaya Jia. Pyramid scene parsing network. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2881–2890, 2017. 1, 5, 15[42] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Lo-cal aggregation for unsupervised learning of visual embed-dings. In
Proceedings of the IEEE International Conferenceon Computer Vision , pages 6002–6012, 2019. 8, pages 6002–6012, 2019. 8