[PDF] Train a One-Million-Way Instance Classifier for Unsupervised Visual Representation Learning

Abstract

This paper presents a simple unsupervised visual representation learning method with a pretext task of discriminating all images in a dataset using a parametric, instance-level classifier. The overall framework is a replica of a supervised classification model, where semantic classes (e.g., dog, bird, and ship) are replaced by instance IDs. However, scaling up the classification task from thousands of semantic labels to millions of instance labels brings specific challenges including 1) the large-scale softmax computation; 2) the slow convergence due to the infrequent visiting of instance samples; and 3) the massive number of negative classes that can be noisy. This work presents several novel techniques to handle these difficulties. First, we introduce a hybrid parallel training framework to make large-scale training feasible. Second, we present a raw-feature initialization mechanism for classification weights, which we assume offers a contrastive prior for instance discrimination and can clearly speed up converge in our experiments. Finally, we propose to smooth the labels of a few hardest classes to avoid optimizing over very similar negative pairs. While being conceptually simple, our framework achieves competitive or superior performance compared to state-of-the-art unsupervised approaches, i.e., SimCLR, MoCoV2, and PIC under ImageNet linear evaluation protocol and on several downstream visual tasks, verifying that full instance classification is a strong pretraining technique for many semantic visual tasks.

Full PDF

TTrain a One-Million-Way Instance Classiﬁer for Unsupervised VisualRepresentation Learning

Yu Liu, Lianghua Huang, Pan Pan, Bin Wang, Yinghui Xu, Rong Jin

Machine Intelligence Technology Lab, Alibaba Group { ly103369, xuangen.hlh, panpan.pp, ganfu.wb, renji.xyh, jinrong.jr } @alibaba-inc.com Abstract

This paper presents a simple unsupervised visual represen-tation learning method with a pretext task of discriminat-ing all images in a dataset using a parametric, instance-levelclassiﬁer. The overall framework is a replica of a supervisedclassiﬁcation model, where semantic classes (e.g., dog, bird, and ship ) are replaced by instance IDs . However, scaling upthe classiﬁcation task from thousands of semantic labels tomillions of instance labels brings speciﬁc challenges includ-ing 1) the large-scale softmax computation; 2) the slow con-vergence due to the infrequent visiting of instance samples;and 3) the massive number of negative classes that can benoisy. This work presents several novel techniques to handlethese difﬁculties. First, we introduce a hybrid parallel trainingframework to make large-scale training feasible. Second, wepresent a raw-feature initialization mechanism for classiﬁca-tion weights, which we assume offers a contrastive prior forinstance discrimination and can clearly speed up converge inour experiments. Finally, we propose to smooth the labels of afew hardest classes to avoid optimizing over very similar neg-ative pairs. While being conceptually simple, our frameworkachieves competitive or superior performance compared tostate-of-the-art unsupervised approaches, i.e., SimCLR, Mo-CoV2, and PIC under ImageNet linear evaluation protocoland on several downstream visual tasks, verifying that fullinstance classiﬁcation is a strong pretraining technique formany semantic visual tasks.

Unsupervised visual representation learning has recentlyshown encouraging progress (He et al. 2020; Chen et al.2020a). Methods using instance discrimination as a pretexttask (Tian, Krishnan, and Isola 2019; He et al. 2020; Chenet al. 2020a) have demonstrated competitive or even supe-rior performance compared to supervised counterparts underImageNet (Deng et al. 2009) linear evaluation protocol andon many downstream visual tasks. This shows the potentialof unsupervised representation learning methods since theycan utilize almost unlimited data without manual labels.To solve the instance discrimination task, usually a dual-branch structure is used, where two transformed views of asame image are encouraged to get close, while transformedviews from different images are expected to get far apart (Heet al. 2020; Chen et al. 2020a,c). These methods often relyon specialized designs such as memory bank (Wu et al.

Encoder

Phase 1: Unsupervised pretrainingPhase 2: Supervised transferring

Pool + MLP N instance classesPool + FC C semantic classes x ~ N instances (a) Outline of our representation learning framework.(b) Correlation between instance and semantic classification. Figure 1: (a) An overview of our unsupervised visual rep-resentation learning framework. Without manual labels, wesimply train an instance-level classiﬁer that tries to distin-guish all images in a dataset to learn discriminative repre-sentations that can be well transferred to supervised tasks.(b) The relationship between instance (unsupervised) and se-mantic (supervised) classiﬁcation accuracies. We observe astrong positive correlation between them in our experiments.2018a), momentum encoder (He et al. 2020; Chen et al.2020c), large batch size (Chen et al. 2020a,b), or shufﬂedbatch normalization (BN) (He et al. 2020; Chen et al. 2020c)to compensate for the lack of negative samples or handle theinformation leakage issue (i.e., samples on a same GPU tendto get closer due to shared BN statistics).Unlike dual-branch approaches, one-branch scheme (e.g.,parametric instance-level classiﬁcation) usually avoids theinformation leakage issue, and can potentially explore alarger set of negative samples. ExemplarCNN (Dosovitskiyet al. 2014) and PIC (Cao et al. 2020) are of this category.Nevertheless, due to the high GPU computation and memory a r X i v : . [ c s . C V ] F e b ncoder Pool + MLP X W J Encoder Pool + MLP X W J Encoder Pool + MLP X T W T J T GPU Partialweights Partiallogits Partialloss Smoothedlabels One-hotlabels

GPU GPU T Top- K lookuptableSub-batch Data Parallel Model Parallel Label Smoothing

Forward passBackward passMatrix productionGatherfeatures

Figure 2: An outline of our distributed hybrid parallel (DHP) training process on T GPU nodes.

Data parallel:

Following dataparallel mechanism, we copy the encoding and MLP layers to all nodes, each processing a subset of minibatch data.

Modelparallel:

Following model parallel mechanism, we evenly divide the classiﬁcation weights to different nodes, and distributethe computation of classiﬁcation scores (forward pass) and weight/feature gradients (backward pass) to different GPUs.

Labelsmoothing:

We smooth labels of the top- K hardest negative classes for each instance to avoid optimizing over noisy pairs.overhead of large-scale instance-level classiﬁcation, thesemethods are either tested on small datasets (Dosovitskiyet al. 2014) or rely on negative class sampling (Cao et al.2020) to make training feasible.This work summarizes typical challenges of using one-branch instance discrimination for unsupervised representa-tion learning, including 1) the large-scale classiﬁcation, 2)the slow convergence due to the infrequent instance access,and 3) a large number of negative classes that can be noisy,and proposes novel techniques to handle them. First, we in-troduce a hybrid parallel training framework to make large-scale classiﬁer training feasible. It relies on model paral-lelism that divides classiﬁcation weights to different GPUs,and evenly distribute the softmax computation (both in for-ward and backward passes) to them. Figure 2 shows anoverview of our distributed training process. This trainingframework can theoretically support up to 100-million-wayclassiﬁcation using 256 GPUs (Song et al. 2020), which farexceeds the number of 1.28 million ImageNet-1K instances,indicating the scalability of our method.Second, instance classiﬁcation faces the slow conver-gence problem due to the extremely infrequent visiting of in-stance samples (Cao et al. 2020). In this work, we tackle thisproblem by introducing a a contrastive prior to the instanceclassiﬁer. Speciﬁcally, with a randomly initialized network,we ﬁx all but BN layers of it and run an inference epochto extract all instance features; then we directly assign thesefeatures to classiﬁcation weights as an initialization. The in-tuition is two-fold. On the one hand, we presume that run-ning BNs may offer a contrastive prior in the output instancefeatures, since in each iteration the features will subtract a weighted average of other instance features extracted in pre-vious iterations. On the other hand, initializing classiﬁcationweights as instance features in essence converts the classiﬁ-cation task to a pair-wise instance comparison task, provid-ing a warm start for convergence.Finally, regarding the massive number of negative in-stance classes that signiﬁcantly raises the risk of optimizingover very similar negative pairs, we propose to smooth thelabels of the top- K hardest negative classes to make trainingeasier. Speciﬁcally, we compute cosine similarities betweeninstance proxies – their corresponding classiﬁcation weights– and ﬁnd the negative classes with the top- K highest sim-ilarities for each instance. The labels of these classes aresmoothed by a factor of α (i.e., from y − = 0 to y − = α/K ).The right part of Figure 2 shows the smoothing process.Note these top- K indices are computed once per trainingepoch, which is very efﬁcient and adds only minimal com-putational overhead for the training process.We evaluate our method under the common ImageNet lin-ear evaluation protocol as well as on several downstreamtasks related to detection or ﬁne-grained classiﬁcation. De-spite its simplicity, our method shows competitive results onthese tasks. For example, our method achieves a top-1 accu-racy of 71.4% under ImageNet linear classiﬁcation protocol,outperforming all other instance discrimination based meth-ods (Chen et al. 2020a,c; Cao et al. 2020). We also obtain asemi-supervised accuracy of 81.8% on ImageNet-1K whenonly 1% of labels are provided, surpassing previous best re-sult by around 4.7%. We hope our full instance classiﬁcationframework can serve as a simple and strong baseline for theunsupervised representation learning community. elated Work Unsupervised visual representation learning.

Unsuper-vised or self-supervised visual representation learning aimsto learn discriminative representation from visual data whereno manual labels are available. Usually a pretext task isutilized to determine the quality of the learned representa-tion and to iteratively optimize the parameters. Represen-tative pretext tasks include transformation prediction (Gi-daris, Singh, and Komodakis 2018; Zhang et al. 2019), in-painting (Pathak et al. 2016), spatial or temporal patch or-der prediction (Doersch, Gupta, and Efros 2015; Norooziand Favaro 2016; Noroozi et al. 2018), colorization (Zhang,Isola, and Efros 2016), clustering (Caron et al. 2018;Zhuang, Zhai, and Yamins 2019a; Asano, Rupprecht, andVedaldi 2020), data generation (Jenni and Favaro 2018;Donahue and Simonyan 2019; Donahue, Kr¨ahenb¨uhl, andDarrell 2016), geometry (Dosovitskiy et al. 2015), and acombination of multiple pretext tasks (Doersch and Zisser-man 2017; Feng, Xu, and Tao 2019).

Contrastive visual representation learning.

More re-cently, contrastive representation learning methods (H´enaffet al. 2019; He et al. 2020) have shown signiﬁcant perfor-mance improvements by using strong data augmentation andproper loss functions (Chen et al. 2020c,a). For these meth-ods, usually a dual-branch structure is employed, where twoaugmented views of an image are encouraged to get closewhile augmented views from different images are forced toget far apart. One problem of these methods is the short-age of negative samples. Some methods rely on large batchsize (Chen et al. 2020a), memory bank (Wu et al. 2018a), ormomentum encoder (He et al. 2020; Chen et al. 2020c) toenlarge the negative pool. Another issue is regarding the in-formation leakage issue (He et al. 2020; Chen et al. 2020c)where features extracted on a same GPU tend to get closedue to the shared BN statistics. MoCo (He et al. 2020; Chenet al. 2020c) solves this problem by using shufﬂed batch nor-malization (BN), while SimCLR (Chen et al. 2020a) handlesthe problem with a synchronized global BN.

Instance discrimination for representation learning.

Un-like the two-branch structure used in contrastive methods,some approaches (Dosovitskiy et al. 2014; Cao et al. 2020)employ a parametric, one-branch structure for instance dis-crimination, which avoids the information leakage issue.Exemplar-CNN (Dosovitskiy et al. 2014) learns a classiﬁerto discriminate between a set of “surrogate classes”, eachclass represents different transformed patches of a singleimage. Nevertheless, it shows worse performance than non-parametric approaches (Wu et al. 2018a). PIC (Cao et al.2020) improves Exemplar-CNN in two ways: 1) it intro-duces a sliding-window data scheduler to alleviate the infre-quent instance visiting problem; 2) it utilizes recent classessampling to reduce the GPU memory consumption. Despiteits effectiveness, it relies on complicated scheduling and op-timization processes, and it cannot fully explore the largenumber of negative instances. This work presents a muchsimpler instance discrimination method that uses an ordi-nary data scheduler and optimization process. In addition, itis able to make full usage of the massive number of negativeinstances in every training iteration.

Methodology

Overall Framework

This work presents an unsupervised representation learningmethod with a pretext task of classifying all image instancesin a dataset. Figure 1 (a) shows the outline of our method.The pipeline is similar to common supervised classiﬁcation,where semantic classes are replaced by instance IDs . In-spired by the design improvements used in recent unsuper-vised frameworks (Chen et al. 2020a), we slightly modifysome components, including using stronger data augmen-tation (i.e., random crop, color jitter, and Gaussian blur), atwo-layer MLP projection head, and a cosine softmax loss.The cosine softmax loss is deﬁned as J = − | I | (cid:88) i ∈ I log exp(cos( w i , x i ) /τ ) (cid:80) Nj =1 exp(cos( w j , x i ) /τ ) , (1)where I denotes the indices of sampled image instances in aminibatch, x i is the projected embedding of instance i , W = { w , w , · · · , w N } ∈ R D × N represents the instance clas-siﬁcation weights, cos( w j , x i ) = ( w T j x i ) / ( (cid:107) w j (cid:107) · (cid:107) x i (cid:107) ) denotes the cosine similarity between w j and x i , and τ is atemperature adjusting the scale of cosine similarities.Nevertheless, there are still challenges for this vanilla in-stance classiﬁcation model to learn good representation, in-cluding 1) the large-scale instance classes (e.g., 1.28 millioninstance classes for ImageNet-1K dataset); 2) the extremelyinfrequent visiting of instance samples; and 3) the massivenumber of negative classes that makes training difﬁcult. Wepropose three efﬁcient techniques to improve the represen-tation learning and the scalability of our method:• Hybrid parallelism.

To support large-scale instance clas-siﬁcation, we rely on hybrid parallelism and evenly dis-tribute the softmax computation (both in forward andbackward passes) to different GPUs. Figure 2 shows aschematic of the distributed training process on T GPUs.•

A contrastive prior.

To improve the convergence, we pro-pose to introduce a a contrastive prior to the instance clas-siﬁer. This is simply achieved by initializing classiﬁcationweights as raw instance features extracted by a ﬁxed ran-dom network with running BNs .• Smoothing labels of hardest classes. the massive num-ber of negative classes raises the risk of optimizing oververy similar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism

Training an instance classiﬁer usually requires to learn alarge-scale fc layer. For example, for ImageNet-1K with ap-proximately 1.28 million images, one needs to optimize aweight matrix of size W ∈ R D × . For ImageNet-21K, the size further enlarged to W ∈ R D × . Thisis often infeasible when using a regular distributed data par-allel (DDP) training pipeline. In this work, we introduce a GPU memory overhead (GB) M a x i m u m c l a ss nu m b e r ( M illi o n ) DDPDHP

Figure 3: Comparison of the maximum number of classessupported by DDP and DHP training frameworks under dif-ferent GPU memory constraints.Table 1: Comparison of the GPU memory consumption andthe training time per epoch of DDP and DHP training frame-works. Experiments are conducted on ImageNet-1K andImageNet-21K datasets.

Dataset

ImageNet-21K 14.2M

OOM * - * OOM : Out-of-memory. distributed hybrid parallel (DHP) training framework (Songet al. 2020) to make large-scale classiﬁcation feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classiﬁcation weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross-entropy loss on the subset data; 6)deduce gradients of the local loss with respect to featuresand weights; 7) gather feature gradients from all GPU nodeand sum them; 8) run a step of optimization to update pa-rameters of encoding, MLP, and classiﬁcation layers. Thepipeline is repeated to loop through the complete dataset forseveral epochs to optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10K to 30M. The experiment is con-ducted on 64 V100 GPUs with 32GB memory and a totalbatch size of 4096. DDP reports out-of-memory (OOM) er-ror when the class number reaches 4.7 million, while theDHP training framework can support up to 30 million num-ber of classes, which is . × of the DDP’s limit. We also Raw withfixed BNs Raw withrunning BNs

Intra-class similarityInter-class similarity020406080100 C o s i n e s i m il a r i t y ( % ) Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.

Dataset

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

To improvethe convergence, we propose to initialize the classiﬁcationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are ﬁxed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•

Smoothing labels of hardest classes.

The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism

Raw withfixed BNs Raw withrunning BNs

Intra-class sim.Inter-class sim.020406080100 C o s i n e s i m il a r i t y ( % ) Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.

Dataset

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

Smoothing labels of hardest classes.

Raw withfixed BNs Raw withrunning BNs

Intra-class sim.Inter-class sim.

Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.

Dataset

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

To improvetheconvergence,weproposetoinitializetheclassiﬁcationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are ﬁxed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•

Smoothing labels of hardest classes.

Training an instance-level classiﬁer usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classiﬁcation feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method linear eval.Random init. 12.4Raw with ﬁxed BNs 22.7Raw with running BNs model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classiﬁcation weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classiﬁcation layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan beneﬁt from more GPUs to support larger-scale instanceclassiﬁcation, but DDP does not bear this scalability. Table 1compares the training efﬁciency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Raw-Feature Initialization with Running BNs

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

To improvetheconvergence,weproposetoinitializetheclassiﬁcationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are ﬁxed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•

Smoothing labels of hardest classes.

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

Smoothing labels of hardest classes.

Raw withfixed BNs Raw withrunning BNs

Intra-class sim.Inter-class sim.

Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.

Dataset

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

To improvetheconvergence,weproposetoinitializetheclassiﬁcationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are ﬁxed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•

Smoothing labels of hardest classes.

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

To improvetheconvergence,weproposetoinitializetheclassiﬁcationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are ﬁxed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•

Smoothing labels of hardest classes.

Instance classiﬁcation faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch) and the lack of priorknowledge in the classiﬁer. A recent work (Cao et al. 2020)handles the infrequent instance visiting problem by using asliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also signiﬁcantly multiplies thetime of looping over the complete dataset.In this work, we present a simple weight initializationmechanism to handle the convergence issue. Speciﬁcally,before training started, we run an inference epoch using the ﬁxed random initial network with trainable BNs to extractall instance features X = { x , x , ··· , x N } 2 R C ⇥ N ;then we directly assign them to classiﬁcation weights W = { w , w , ··· , w N } 2 R C ⇥ N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,a weight vector w i represents the feature center of differenttransformed views of instance i , thus initializing weights as C o s i n e s i m il a r i t y ( % ) Figure 4: Comparison of initialization methods. model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classiﬁcation weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classiﬁcation layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan beneﬁt from more GPUs to support larger-scale instanceclassiﬁcation, but DDP does not bear this scalability. Table 1compares the training efﬁciency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Figure 4: Comparison of initialization methods. model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classiﬁcation weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classiﬁcation layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan beneﬁt from more GPUs to support larger-scale instanceclassiﬁcation, but DDP does not bear this scalability. Table 1compares the training efﬁciency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart.Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method. Dataset

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

Smoothing labels of hardest classes.

Raw withfixed BNs Raw withrunning BNs

Dataset

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

Smoothing labels of hardest classes.

Raw withfixed BNs Raw withrunning BNs

Intra-class sim.Inter-class sim.

Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.

Dataset

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

To improvetheconvergence,weproposetoinitializetheclassiﬁcationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are ﬁxed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•

Smoothing labels of hardest classes.

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

To improvetheconvergence,weproposetoinitializetheclassiﬁcationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are ﬁxed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•

Smoothing labels of hardest classes.

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

Smoothing labels of hardest classes.

Raw withfixed BNs Raw withrunning BNs

Intra-class sim.Inter-class sim.

Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.

Dataset

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

To improvetheconvergence,weproposetoinitializetheclassiﬁcationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are ﬁxed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•

Smoothing labels of hardest classes.

OOM - 21.51G 2940s • Raw-feature initialization with running BNs.

To improvetheconvergence,weproposetoinitializetheclassiﬁcationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are ﬁxed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•

Smoothing labels of hardest classes.

Table: comparison of ImageNet linear evaluationaccuracies of different weight-initialization methods, evalu-ated after epochs training. Bar chart: comparison of av-erage intra- and inter-instance-class cosine similarities whenusing different initialization methods.note that the DHP can beneﬁt from more GPUs to supportlarger-scale instance classiﬁcation, but DDP does not bearthis scalability. Table 1 compares the training efﬁciency ofDDP and DHP frameworks on ImageNet-1K and ImageNet-21K under the same batch size settings. We show that theDHP training framework not only consumes less GPU mem-ory, but also trains much faster than the DDP counterpart.

A Contrastive Prior

Instance classiﬁcation faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch). A recent work (Caoet al. 2020) handles this infrequent visiting problem by usinga sliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also signiﬁcantly multiplies thetime of looping over the complete dataset.In this work, we handle this problem from a different per-spective: we propose to speed up the convergence by intro-ducing a contrastive prior to classiﬁcation weights. Speciﬁ-cally, before training started, we run an inference epoch us-ing the ﬁxed random initial network with running BNs to ex-tract all instance features X = { x , x , · · · , x N } ∈ R D × N ;then we directly assign them to classiﬁcation weights W = { w , w , · · · , w N } ∈ R D × N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,running BNs can offer a contrastive prior in the output fea-tures, since in each inference phase, the features computedafter every BN layer will subtract a running average of otherinstance features extracted in previous iterations. Second,assigning features to weights approximately converts theclassiﬁcation task to a pair-wise metric learning task in earlyepochs, which is relatively easier to converge and offers awarm start for instance classiﬁcation.Figure 4 compares the discriminative ability of differentclassiﬁer initialization schemes, i.e., random weight initial-ization ( random init. for short), raw-feature initializationable 2: Ablation study on the effectiveness of differentcomponents in our method. MLPhead Acontrastiveprior Labelsmoothing ImageNetTop-1 Top-5 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Table 3: Ablation study that compares full instance classiﬁ-cation and sampled instance classiﬁcation. Full 67.3 87.7 with ﬁxed BNs ( raw with ﬁxed BNs for short), and raw-feature initialization with running BNs ( raw with runningBNs for short). We use ImageNet linear evaluation accu-racy (evaluated after epochs training) as well as aver-age intra- and inter-instance-class similarities as the indica-tors. As shown in Figure 4, raw with running BNs achievesthe best linear evaluation accuracy, and clearly outperformsother initialization methods. In addition, raw with runningBNs obtains a much larger similarity gap ( ∼ . ) betweenpositive and negative instance pairs than raw with ﬁxed BNs ( ∼ . ), validating the assumption that running BNs mayprovide a contrastive prior for instance discrimination.We also note that the instance features extracted by a ran-dom network with running BNs are also a robust start forsemantic classiﬁcation. We run an instance retrieval experi-ment on the train set of ImageNet-1K with a randomly ini-tialized ResNet-50 network to extract all image features, anddetermine whether the searched instance and the query in-stance are of a same semantic category. We ﬁnd that a 3%top-1 accuracy can be achieved, which far exceeds the top-1accuracy of 0.1% of a random guess. Smoothing Labels of Hardest Classes

A challenge of instance-level classiﬁcation is that it intro-duces a very large number of negative classes, signiﬁcantlyraising the risk of optimizing over very similar pairs that canbe noisy and make the training hard to converge.In this work, we handle this problem by applying labelsmoothing on a few hardest instance classes. Although othertechniques (e.g., clustering) are also applicable, we chooselabel smoothing for its simplicity and efﬁciency. We no-tice that the semantically similar instance pairs are relativelystable across the training process. Therefore, we representeach instance i as its corresponding classiﬁcation weights w i (instead of its unstable features x i ), and compute the co-sine similarities between w i and all other weights W ¯ i = { w , · · · , w i − , w i +1 , · · · , w N } ∈ R D × ( N − to ﬁnd the Table 4: Ablation study that compares Gaussian random anda contrastive prior for classiﬁer initialization. +15.0 )25 40.8 46.3 ( +5.5 )50 56.1 58.0 ( +1.9 )100 62.9 64.1 ( +1.2 )200 67.3 67.6 ( +0.3 )400 69.3 69.7 ( +0.4 ) Table 5: Ablation study of label smoothing with differentnumber of hard classes K and smoothing factor α . Hard classnumber K Smoothingfactor α Top-1 Top-5 no smoothing

100 0.3 67.5 88.250 0.2 68.0 88.5200 0.2 67.9 88.4 top- K hardest negative classes H i − = { c , c , · · · , c K } . Thelabel of class j ∈ { , , · · · , N } is then deﬁned as y ij =  − α, j = i,α/K, j ∈ H i − , , otherwise . (2)The loss function in Eq. (1) is redeﬁned as J = − | I | (cid:88) i ∈ I log (cid:80) Nj =1 y ij exp(cos( w j , x i ) /τ ) (cid:80) Nj =1 exp(cos( w j , x i ) /τ ) . (3)The top- K similarities between instance weights are com-puted once per epoch, which only amounts for a small frac-tion of training time. The smoothed softmax cross-entropyloss reduces the impact of noisy or very similar negativepairs on the learned representation. This is also veriﬁed laterin our ablation study, where smoothing labels of severalhardest classes improves the transfer performance. Experiments

Experiment Settings

Training datasets.

Unless speciﬁed, we use ImageNet-1K to train our unsupervised model for most experiments.ImageNet-1K consists of around 1.28 million images be-longing to 1000 classes. We treat every image instance(along with its various transformed views) in the dataset as aunique class, and train a 1.28 million-way instance classiﬁeras a pretext task to learn visual representation.

Evaluation datasets.

The learned visual representations areevaluated in three ways: First, under the linear evaluationprotocol of ImageNet-1K, we ﬁx the representation modeland learn a linear classiﬁer upon it. The top-1/top-5 classi-ﬁcation accuracies are employed to compare different unsu-pervised methods. Second, we evaluate the semi-supervisedable 6: State-of-the-art comparison of linear classiﬁcationaccuracy of unsupervised methods on ImageNet-1K.

Method Top-1 Top-5

Supervised

Ours 71.4 90.3 learning performance on ImageNet-1K, where methods arerequired to classify images in the val set when only a smallfraction (i.e., 1% or 10%) of manual labels are providedin the train set. Third, we evaluate the transferring perfor-mance by ﬁnetuning the representations on several down-stream tasks and compute performance gains. In our experi-ments, downstream tasks include Pascal-VOC object detec-tion (Everingham et al. 2010), iNaturalist18 ﬁne-grained im-age classiﬁcation (Van Horn et al. 2018), and many others.

Implementation details.

We use ResNet-50 (He et al. 2016)as the backbone in all our experiments. We train our modelusing the SGD optimizer, where the weight decay and mo-mentum are set to 0.0001 and 0.9, respectively. The initiallearning rate ( lr ) is set to 0.48 and decays using the cosineannealing scheduler. In addition, we use 10 epochs of linear lr warmup to stabilize training. The minibatch size is 4096and the feature dimension D = 128 . We set the temperaturein Eq. (1) as τ = 0 . , and the smoothing factor in Eq. (3)as α = 0 . . For fair comparison, following practices in re-cent works (Chen et al. 2020a; Cao et al. 2020), we feed twoaugmented views per instance for training. All experimentsare conducted on 64 V100 GPUs with 32GB memory. Ablation Study

This section validates several modeling and conﬁgurationoptions for our method. We compare the quality of repre-sentations using ImageNet linear protocol evaluated on the val set. In each experiment, the linear classiﬁer is trainedwith a batch size of 2048 and a lr of 40 that decays duringtraining under the cosine annealing rule. Ablation: effectiveness of components.

Table 2 shows thelinear evaluation performance using different combinationsof components in our method, including a two-layer MLPhead, a contrastive prior, and label smoothing. Accuraciesare measured after 200-epochs training. We ﬁnd that all thethree components bring performance gains, improving thetop-1 accuracy of our method from 58.6% to a competi-tive 68.2%. We also observe that a vanilla instance clas- Table 7: Comparison of semi-supervised learning accuracyof both label propagation and representation learning basedmethods on ImageNet-1K.

Method Label Fraction1% 10%

Supervised

Label propagation:

PseudoLabels (Zhai et al. 2019) 51.6 82.4VAT+Entropy Min. (Miyato et al. 2018) 47.0 83.4UDA (Xie et al. 2019) - 88.5FixMatch (Sohn et al. 2020) - 89.1

Representation Learning:

InstDisc. (Dosovitskiy et al. 2014) 39.2 77.4PIRL (Misra and Maaten 2020) 57.2 83.8PCL (Li et al. 2020) 75.6 86.2SimCLR (Chen et al. 2020a) 75.5 87.8PIC (Cao et al. 2020) 77.1 88.7

Ours 81.8 89.2 siﬁcation model can already achieve a top-1 accuracy of67.3%, suggesting that full instance classiﬁcation is a verystrong baseline for unsupervised representation learning.The contrastive prior and label smoothing on the top- K hardest classes further boost the linear evaluation accuracyby around 1%. Ablation: full instance classiﬁcation v.s. sampled instanceclassiﬁcation.

Table 3 compares linear classiﬁcation accura-cies using representations learned by full instance classiﬁca-tion and by sampled instance classiﬁcation, with samplingsizes ranging from to . Note we remove a contrastiveprior and label smoothing in the experiments and only an-alyze the impact of class sampling. We observe that full in-stance classiﬁcation clearly outperforms sampled instanceclassiﬁcation by a margin of 1.8%, verifying the beneﬁts ofexploring the complete set of negative instances. Ablation: a contrastive prior v.s. random initialization.

Table 4 compares the linear evaluation performance of ourmethod using Gaussian random and a contrastive prior forclassiﬁer initialization, with training length increased from10 to 400 epochs. We observe that a contrastive prior sig-niﬁcantly speeds up convergence compared to random ini-tialization, especially in early epochs (i.e., epoch 10, 25,and 50). Besides, the accuracy of a contrastive prior versionconsistently outperforms random initialization counterpart,showing the robustness of our initialization mechanism.

Ablation: label smoothing on hardest classes.

Table 5shows the impact of label smoothing on representation learn-ing. We vary the considered number K of hardest negativeclasses, and the smoothing factor α . A no smoothing base-line where K = 0 and α = 0 is also included for com-parison. We observe that smoothing labels of a few hardestclasses improves linear evaluation performance over non-smoothing baseline in most hyper-parameter settings. Thebest accuracy can be obtained with K = 100 and α = 0 . ,where a 0.6% gain of top-1 accuracy can be achieved.able 8: Comparison of transferring performance on PAS-CAL VOC object detection. Method AP AP50 AP75

Supervised

PIC (Cao et al. 2020) 57.1 82.4 63.4

Ours

Table 9: Comparison of transferring performance on iNatu-ralist ﬁne-grained classiﬁcation.

Method Top-1 Top-5

Scratch

Supervised

Ours 66.2 86.2

Comparison with Previous Results

ImageNet linear evaluation.

Table 6 compares our workwith previous unsupervised visual representation learningmethods under the ImageNet linear evaluation protocol.We follow recent practices (Chen et al. 2020a,c) to train alonger length, i.e., 1000 epochs. The proposed unsupervisedlearning framework achieves a top-1 accuracy of 71.4% onImageNet-1K, outperforming SimCLR (+2.1%), PIC (0.6%)and MoCoV2 (+0.3%). The results verify that a simple fullinstance classiﬁcation framework can learn very competitivevisual representations. The performance gains can partly beattributed to the ability of large-scale full negative instanceexploration, which is not supported by previous unsuper-vised frameworks (Chen et al. 2020a,c; Cao et al. 2020).

Semi-supervised learning.

Following (Kolesnikov, Zhai,and Beyer 2019; Chen et al. 2020a), we sample a 1% or 10%fraction of labeled data from ImageNet, and train a classiﬁerstarting from our unsupervised pretrained model to evaluatethe semi-supervised learning performance. For 1% labels,we train the backbone with a lr of 0.001 and the classiﬁerwith a lr of 15. For 10% labels, the lr s for the backboneand the classiﬁer are set to 0.001 and 10, respectively (Liet al. 2020). Table 7 compares our work with both represen-tation learning based and label propagation based methods.We obtain a top-5 accuracy of 81.8% when only 1% of la-bels are used, outperforming all previous methods by a non-negligible margin ( +4.7% ). We also achieve the best resultwhen 10% of labels are provided, surpassing SimCLR andPIC by 1.4% and 0.5%, respectively. The results suggest thestrong discriminative ability of our learned representation. Transfer learning.

To further evaluate the learned represen-tation, we apply the pretrained model to several downstreamvisual tasks (including detection, ﬁne-grained classiﬁcation,and many others) to evaluate the transferring performance.

PASCAL VOC Object Detection:

Following (He et al.2020), we use Faster-RCNN (Ren et al. 2015) with ResNet-50 backbone as the object detector. We initialize ResNet-50 with our pretrained weights, and ﬁnetune all layers Table 10: Transferring performance of different pretrainedmodels on more downstream visual tasks.

Method CIFAR10 CIFAR100 SUN397 DTD

Scratch

Supervised

Ours 97.8 86.2 64.2 77.6 end-to-end on the trainval07+12 set of the PASCAL VOCdataset (Everingham et al. 2010). We adopt the same experi-ment settings as MoCoV2 (Chen et al. 2020c). The AP (Av-erage Precision), AP50, and AP75 scores on the test2007 setare used as indicators. Table 8 shows the results. Our trans-ferring performance is signiﬁcantly better than supervisedpretraining counterpart (+3.7% in AP), and is competitivewith state-of-the-art unsupervised learning methods. iNaturalist ﬁne-grained classiﬁcation:

We ﬁnetune thepretrained model end-to-end on the train set of iNaturalist2018 dataset (Van Horn et al. 2018) and evaluate the top-1and top-5 classiﬁcation accuracies on the val set. Results areshown in Table 9. Our method is closely competitive withthe ImageNet supervised pretraining counterpart as well asprevious state-of-the-art unsupervised methods. The resultsindicate the discriminative ability of our pretrained represen-tation in ﬁne-grained classiﬁcation.

More downstream tasks:

Table 10 shows transferring re-sults on more downstream tasks, including image classiﬁ-cation on CIFAR10, CIFAR100 (Krizhevsky, Hinton et al.2009), SUN397 (Xiao et al. 2010), and DTD (Cimpoi et al.2014). To summarize, our method performs competitivelywith ImageNet supervised pretraining as well as state-of-the-art unsupervised pretraining.

Conclusion

In this work, we present an unsupervised visual represen-tation learning framework where the pretext task is to dis-tinguish all instances in a dataset with a parametric classi-ﬁer. The task is similar to supervised semantic classiﬁca-tion, but with a much larger number of classes (equal tothe dataset size) and ﬁner granularity. We ﬁrst introduce ahybrid parallel training framework to make large-scale in-stance classiﬁcation feasible, which signiﬁcantly reduces theGPU memory overhead and speeds up training in our ex-periments. Second, we propose to improve the convergenceby introducing a contrastive prior to the instance classiﬁer.This is achieved by initializing the classiﬁcation weights asraw instance features extracted by a ﬁxed random networkwith running BNs. We show in our experiments that thissimple strategy clearly speeds up convergence and improvesthe transferring performance. Finally, to reduce the impactof noisy negative instance pairs, we propose to smooth thelabels of a few hardest classes. Extensive experiments onImageNet classiﬁcation, semi-supervised classiﬁcation, andmany downstream tasks show that our simple unsupervisedrepresentation learning method performs comparable or su-perior than state-of-the-art unsupervised methods. eferences

Asano, Y. M.; Rupprecht, C.; and Vedaldi, A. 2020. Self-labelling via simultaneous clustering and representationlearning .Cao, Y.; Xie, Z.; Liu, B.; Lin, Y.; Zhang, Z.; and Hu, H.2020. Parametric Instance Classiﬁcation for UnsupervisedVisual Feature Learning. arXiv preprint arXiv:2006.14618 .Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018.Deep clustering for unsupervised learning of visual features.In

Proceedings of the European Conference on ComputerVision (ECCV) , 132–149.Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020a.A Simple Framework for Contrastive Learning of VisualRepresentations. arXiv preprint arXiv:2002.05709 .Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; and Hin-ton, G. 2020b. Big Self-Supervised Models are Strong Semi-Supervised Learners. arXiv preprint arXiv:2006.10029 .Chen, X.; Fan, H.; Girshick, R.; and He, K. 2020c. Improvedbaselines with momentum contrastive learning. arXivpreprint arXiv:2003.04297 .Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; ; andVedaldi, A. 2014. Describing Textures in the Wild. In

Pro-ceedings of the IEEE Conf. on Computer Vision and PatternRecognition (CVPR) .Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical imagedatabase. In , 248–255. Ieee.Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsuper-vised visual representation learning by context prediction. In

Proceedings of the IEEE international conference on com-puter vision , 1422–1430.Doersch, C.; and Zisserman, A. 2017. Multi-task self-supervised visual learning. In

Proceedings of the IEEE In-ternational Conference on Computer Vision , 2051–2060.Donahue, J.; Kr¨ahenb¨uhl, P.; and Darrell, T. 2016. Adver-sarial feature learning. arXiv preprint arXiv:1605.09782 .Donahue, J.; and Simonyan, K. 2019. Large scale adversar-ial representation learning. In

Advances in Neural Informa-tion Processing Systems , 10542–10552.Dosovitskiy, A.; Fischer, P.; Springenberg, J. T.; Riedmiller,M.; and Brox, T. 2015. Discriminative unsupervised featurelearning with exemplar convolutional neural networks.

IEEEtransactions on pattern analysis and machine intelligence

Advances in neuralinformation processing systems , 766–774.Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.;and Zisserman, A. 2010. The pascal visual object classes(voc) challenge.

International journal of computer vision

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , 10364–10374.Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsuper-vised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 .He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020.Momentum contrast for unsupervised visual representationlearning. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , 9729–9738.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778.H´enaff, O. J.; Srinivas, A.; De Fauw, J.; Razavi, A.; Doer-sch, C.; Eslami, S.; and Oord, A. v. d. 2019. Data-efﬁcientimage recognition with contrastive predictive coding. arXivpreprint arXiv:1905.09272 .Jenni, S.; and Favaro, P. 2018. Self-supervised feature learn-ing by learning to spot artifacts. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,2733–2742.Kolesnikov, A.; Zhai, X.; and Beyer, L. 2019. Revisitingself-supervised visual representation learning. In

Proceed-ings of the IEEE conference on Computer Vision and PatternRecognition , 1920–1929.Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiplelayers of features from tiny images .Li, J.; Zhou, P.; Xiong, C.; Socher, R.; and Hoi, S. C. 2020.Prototypical Contrastive Learning of Unsupervised Repre-sentations. arXiv preprint arXiv:2005.04966 .Misra, I.; and Maaten, L. v. d. 2020. Self-supervised learn-ing of pretext-invariant representations. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , 6707–6717.Miyato, T.; Maeda, S.-i.; Koyama, M.; and Ishii, S. 2018.Virtual adversarial training: a regularization method for su-pervised and semi-supervised learning.

IEEE transactionson pattern analysis and machine intelligence

Euro-pean Conference on Computer Vision , 69–84. Springer.Noroozi, M.; Vinjimoor, A.; Favaro, P.; and Pirsiavash, H.2018. Boosting self-supervised learning via knowledgetransfer. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 9359–9367.Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representationlearning with contrastive predictive coding. arXiv preprintarXiv:1807.03748 .Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; andEfros, A. A. 2016. Context encoders: Feature learning byinpainting. In

Proceedings of the IEEE conference on com-puter vision and pattern recognition , 2536–2544.en, S.; He, K.; Girshick, R.; and Sun, J. 2015. FasterR-CNN: Towards Real-Time Object Detection with RegionProposal Networks. In Cortes, C.; Lawrence, N. D.; Lee,D. D.; Sugiyama, M.; and Garnett, R., eds.,

Advances inNeural Information Processing Systems 28 , 91–99.Sohn, K.; Berthelot, D.; Li, C.-L.; Zhang, Z.; Carlini, N.;Cubuk, E. D.; Kurakin, A.; Zhang, H.; and Raffel, C. 2020.Fixmatch: Simplifying semi-supervised learning with con-sistency and conﬁdence. arXiv preprint arXiv:2001.07685 .Song, L.; Pan, P.; Zhao, K.; Yang, H.; Chen, Y.; Zhang, Y.;Xu, Y.; and Jin, R. 2020. Large-Scale Training System for100-Million Classiﬁcation at Alibaba. In

Proceedings ofthe 26th ACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining , 2909–2930.Tian, Y.; Krishnan, D.; and Isola, P. 2019. Contrastive Mul-tiview Coding. arXiv preprint arXiv:1906.05849 .Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.;Shepard, A.; Adam, H.; Perona, P.; and Belongie, S. 2018.The inaturalist species classiﬁcation and detection dataset.In

Proceedings of the IEEE conference on computer visionand pattern recognition , 8769–8778.Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018a. Unsuper-vised feature learning via non-parametric instance discrimi-nation. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 3733–3742.Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018b. Unsuper-vised feature learning via non-parametric instance discrimi-nation. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 3733–3742.Xiao, J.; Hays, J.; Ehinger, K. A.; Oliva, A.; and Torralba,A. 2010. Sun database: Large-scale scene recognition fromabbey to zoo. In , 3485–3492. IEEE.Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.-T.; and Le, Q. V. 2019.Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848 .YM., A.; C., R.; and A., V. 2020. Self-labelling via simul-taneous clustering and representation learning. In

Interna-tional Conference on Learning Representations (ICLR) .Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L. 2019.S4l: Self-supervised semi-supervised learning. In

Proceed-ings of the IEEE international conference on computer vi-sion , 1476–1485.Zhang, L.; Qi, G.-J.; Wang, L.; and Luo, J. 2019. Aet vs.aed: Unsupervised representation learning by auto-encodingtransformations rather than data. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,2547–2555.Zhang, R.; Isola, P.; and Efros, A. A. 2016. Colorful imagecolorization. In

European conference on computer vision ,649–666. Springer.Zhuang, C.; Zhai, A. L.; and Yamins, D. 2019a. Local aggre-gation for unsupervised learning of visual embeddings. In

Proceedings of the IEEE International Conference on Com-puter Vision , 6002–6012.Zhuang, C.; Zhai, A. L.; and Yamins, D. 2019b. Local aggre-gation for unsupervised learning of visual embeddings. In