Train a One-Million-Way Instance Classifier for Unsupervised Visual Representation Learning
Yu Liu, Lianghua Huang, Pan Pan, Bin Wang, Yinghui Xu, Rong Jin
TTrain a One-Million-Way Instance Classifier for Unsupervised VisualRepresentation Learning
Yu Liu, Lianghua Huang, Pan Pan, Bin Wang, Yinghui Xu, Rong Jin
Machine Intelligence Technology Lab, Alibaba Group { ly103369, xuangen.hlh, panpan.pp, ganfu.wb, renji.xyh, jinrong.jr } @alibaba-inc.com Abstract
This paper presents a simple unsupervised visual represen-tation learning method with a pretext task of discriminat-ing all images in a dataset using a parametric, instance-levelclassifier. The overall framework is a replica of a supervisedclassification model, where semantic classes (e.g., dog, bird, and ship ) are replaced by instance IDs . However, scaling upthe classification task from thousands of semantic labels tomillions of instance labels brings specific challenges includ-ing 1) the large-scale softmax computation; 2) the slow con-vergence due to the infrequent visiting of instance samples;and 3) the massive number of negative classes that can benoisy. This work presents several novel techniques to handlethese difficulties. First, we introduce a hybrid parallel trainingframework to make large-scale training feasible. Second, wepresent a raw-feature initialization mechanism for classifica-tion weights, which we assume offers a contrastive prior forinstance discrimination and can clearly speed up converge inour experiments. Finally, we propose to smooth the labels of afew hardest classes to avoid optimizing over very similar neg-ative pairs. While being conceptually simple, our frameworkachieves competitive or superior performance compared tostate-of-the-art unsupervised approaches, i.e., SimCLR, Mo-CoV2, and PIC under ImageNet linear evaluation protocoland on several downstream visual tasks, verifying that fullinstance classification is a strong pretraining technique formany semantic visual tasks.
Unsupervised visual representation learning has recentlyshown encouraging progress (He et al. 2020; Chen et al.2020a). Methods using instance discrimination as a pretexttask (Tian, Krishnan, and Isola 2019; He et al. 2020; Chenet al. 2020a) have demonstrated competitive or even supe-rior performance compared to supervised counterparts underImageNet (Deng et al. 2009) linear evaluation protocol andon many downstream visual tasks. This shows the potentialof unsupervised representation learning methods since theycan utilize almost unlimited data without manual labels.To solve the instance discrimination task, usually a dual-branch structure is used, where two transformed views of asame image are encouraged to get close, while transformedviews from different images are expected to get far apart (Heet al. 2020; Chen et al. 2020a,c). These methods often relyon specialized designs such as memory bank (Wu et al.
Encoder
Phase 1: Unsupervised pretrainingPhase 2: Supervised transferring
Pool + MLP N instance classesPool + FC C semantic classes x ~ N instances (a) Outline of our representation learning framework.(b) Correlation between instance and semantic classification. Figure 1: (a) An overview of our unsupervised visual rep-resentation learning framework. Without manual labels, wesimply train an instance-level classifier that tries to distin-guish all images in a dataset to learn discriminative repre-sentations that can be well transferred to supervised tasks.(b) The relationship between instance (unsupervised) and se-mantic (supervised) classification accuracies. We observe astrong positive correlation between them in our experiments.2018a), momentum encoder (He et al. 2020; Chen et al.2020c), large batch size (Chen et al. 2020a,b), or shuffledbatch normalization (BN) (He et al. 2020; Chen et al. 2020c)to compensate for the lack of negative samples or handle theinformation leakage issue (i.e., samples on a same GPU tendto get closer due to shared BN statistics).Unlike dual-branch approaches, one-branch scheme (e.g.,parametric instance-level classification) usually avoids theinformation leakage issue, and can potentially explore alarger set of negative samples. ExemplarCNN (Dosovitskiyet al. 2014) and PIC (Cao et al. 2020) are of this category.Nevertheless, due to the high GPU computation and memory a r X i v : . [ c s . C V ] F e b ncoder Pool + MLP X W J Encoder Pool + MLP X W J Encoder Pool + MLP X T W T J T GPU Partialweights Partiallogits Partialloss Smoothedlabels One-hotlabels
GPU GPU T Top- K lookuptableSub-batch Data Parallel Model Parallel Label Smoothing
Forward passBackward passMatrix productionGatherfeatures
Figure 2: An outline of our distributed hybrid parallel (DHP) training process on T GPU nodes.
Data parallel:
Following dataparallel mechanism, we copy the encoding and MLP layers to all nodes, each processing a subset of minibatch data.
Modelparallel:
Following model parallel mechanism, we evenly divide the classification weights to different nodes, and distributethe computation of classification scores (forward pass) and weight/feature gradients (backward pass) to different GPUs.
Labelsmoothing:
We smooth labels of the top- K hardest negative classes for each instance to avoid optimizing over noisy pairs.overhead of large-scale instance-level classification, thesemethods are either tested on small datasets (Dosovitskiyet al. 2014) or rely on negative class sampling (Cao et al.2020) to make training feasible.This work summarizes typical challenges of using one-branch instance discrimination for unsupervised representa-tion learning, including 1) the large-scale classification, 2)the slow convergence due to the infrequent instance access,and 3) a large number of negative classes that can be noisy,and proposes novel techniques to handle them. First, we in-troduce a hybrid parallel training framework to make large-scale classifier training feasible. It relies on model paral-lelism that divides classification weights to different GPUs,and evenly distribute the softmax computation (both in for-ward and backward passes) to them. Figure 2 shows anoverview of our distributed training process. This trainingframework can theoretically support up to 100-million-wayclassification using 256 GPUs (Song et al. 2020), which farexceeds the number of 1.28 million ImageNet-1K instances,indicating the scalability of our method.Second, instance classification faces the slow conver-gence problem due to the extremely infrequent visiting of in-stance samples (Cao et al. 2020). In this work, we tackle thisproblem by introducing a a contrastive prior to the instanceclassifier. Specifically, with a randomly initialized network,we fix all but BN layers of it and run an inference epochto extract all instance features; then we directly assign thesefeatures to classification weights as an initialization. The in-tuition is two-fold. On the one hand, we presume that run-ning BNs may offer a contrastive prior in the output instancefeatures, since in each iteration the features will subtract a weighted average of other instance features extracted in pre-vious iterations. On the other hand, initializing classificationweights as instance features in essence converts the classifi-cation task to a pair-wise instance comparison task, provid-ing a warm start for convergence.Finally, regarding the massive number of negative in-stance classes that significantly raises the risk of optimizingover very similar negative pairs, we propose to smooth thelabels of the top- K hardest negative classes to make trainingeasier. Specifically, we compute cosine similarities betweeninstance proxies – their corresponding classification weights– and find the negative classes with the top- K highest sim-ilarities for each instance. The labels of these classes aresmoothed by a factor of α (i.e., from y − = 0 to y − = α/K ).The right part of Figure 2 shows the smoothing process.Note these top- K indices are computed once per trainingepoch, which is very efficient and adds only minimal com-putational overhead for the training process.We evaluate our method under the common ImageNet lin-ear evaluation protocol as well as on several downstreamtasks related to detection or fine-grained classification. De-spite its simplicity, our method shows competitive results onthese tasks. For example, our method achieves a top-1 accu-racy of 71.4% under ImageNet linear classification protocol,outperforming all other instance discrimination based meth-ods (Chen et al. 2020a,c; Cao et al. 2020). We also obtain asemi-supervised accuracy of 81.8% on ImageNet-1K whenonly 1% of labels are provided, surpassing previous best re-sult by around 4.7%. We hope our full instance classificationframework can serve as a simple and strong baseline for theunsupervised representation learning community. elated Work Unsupervised visual representation learning.
Unsuper-vised or self-supervised visual representation learning aimsto learn discriminative representation from visual data whereno manual labels are available. Usually a pretext task isutilized to determine the quality of the learned representa-tion and to iteratively optimize the parameters. Represen-tative pretext tasks include transformation prediction (Gi-daris, Singh, and Komodakis 2018; Zhang et al. 2019), in-painting (Pathak et al. 2016), spatial or temporal patch or-der prediction (Doersch, Gupta, and Efros 2015; Norooziand Favaro 2016; Noroozi et al. 2018), colorization (Zhang,Isola, and Efros 2016), clustering (Caron et al. 2018;Zhuang, Zhai, and Yamins 2019a; Asano, Rupprecht, andVedaldi 2020), data generation (Jenni and Favaro 2018;Donahue and Simonyan 2019; Donahue, Kr¨ahenb¨uhl, andDarrell 2016), geometry (Dosovitskiy et al. 2015), and acombination of multiple pretext tasks (Doersch and Zisser-man 2017; Feng, Xu, and Tao 2019).
Contrastive visual representation learning.
More re-cently, contrastive representation learning methods (H´enaffet al. 2019; He et al. 2020) have shown significant perfor-mance improvements by using strong data augmentation andproper loss functions (Chen et al. 2020c,a). For these meth-ods, usually a dual-branch structure is employed, where twoaugmented views of an image are encouraged to get closewhile augmented views from different images are forced toget far apart. One problem of these methods is the short-age of negative samples. Some methods rely on large batchsize (Chen et al. 2020a), memory bank (Wu et al. 2018a), ormomentum encoder (He et al. 2020; Chen et al. 2020c) toenlarge the negative pool. Another issue is regarding the in-formation leakage issue (He et al. 2020; Chen et al. 2020c)where features extracted on a same GPU tend to get closedue to the shared BN statistics. MoCo (He et al. 2020; Chenet al. 2020c) solves this problem by using shuffled batch nor-malization (BN), while SimCLR (Chen et al. 2020a) handlesthe problem with a synchronized global BN.
Instance discrimination for representation learning.
Un-like the two-branch structure used in contrastive methods,some approaches (Dosovitskiy et al. 2014; Cao et al. 2020)employ a parametric, one-branch structure for instance dis-crimination, which avoids the information leakage issue.Exemplar-CNN (Dosovitskiy et al. 2014) learns a classifierto discriminate between a set of “surrogate classes”, eachclass represents different transformed patches of a singleimage. Nevertheless, it shows worse performance than non-parametric approaches (Wu et al. 2018a). PIC (Cao et al.2020) improves Exemplar-CNN in two ways: 1) it intro-duces a sliding-window data scheduler to alleviate the infre-quent instance visiting problem; 2) it utilizes recent classessampling to reduce the GPU memory consumption. Despiteits effectiveness, it relies on complicated scheduling and op-timization processes, and it cannot fully explore the largenumber of negative instances. This work presents a muchsimpler instance discrimination method that uses an ordi-nary data scheduler and optimization process. In addition, itis able to make full usage of the massive number of negativeinstances in every training iteration.
Methodology
Overall Framework
This work presents an unsupervised representation learningmethod with a pretext task of classifying all image instancesin a dataset. Figure 1 (a) shows the outline of our method.The pipeline is similar to common supervised classification,where semantic classes are replaced by instance IDs . In-spired by the design improvements used in recent unsuper-vised frameworks (Chen et al. 2020a), we slightly modifysome components, including using stronger data augmen-tation (i.e., random crop, color jitter, and Gaussian blur), atwo-layer MLP projection head, and a cosine softmax loss.The cosine softmax loss is defined as J = − | I | (cid:88) i ∈ I log exp(cos( w i , x i ) /τ ) (cid:80) Nj =1 exp(cos( w j , x i ) /τ ) , (1)where I denotes the indices of sampled image instances in aminibatch, x i is the projected embedding of instance i , W = { w , w , · · · , w N } ∈ R D × N represents the instance clas-sification weights, cos( w j , x i ) = ( w T j x i ) / ( (cid:107) w j (cid:107) · (cid:107) x i (cid:107) ) denotes the cosine similarity between w j and x i , and τ is atemperature adjusting the scale of cosine similarities.Nevertheless, there are still challenges for this vanilla in-stance classification model to learn good representation, in-cluding 1) the large-scale instance classes (e.g., 1.28 millioninstance classes for ImageNet-1K dataset); 2) the extremelyinfrequent visiting of instance samples; and 3) the massivenumber of negative classes that makes training difficult. Wepropose three efficient techniques to improve the represen-tation learning and the scalability of our method:• Hybrid parallelism.
To support large-scale instance clas-sification, we rely on hybrid parallelism and evenly dis-tribute the softmax computation (both in forward andbackward passes) to different GPUs. Figure 2 shows aschematic of the distributed training process on T GPUs.•
A contrastive prior.
To improve the convergence, we pro-pose to introduce a a contrastive prior to the instance clas-sifier. This is simply achieved by initializing classificationweights as raw instance features extracted by a fixed ran-dom network with running BNs .• Smoothing labels of hardest classes. the massive num-ber of negative classes raises the risk of optimizing oververy similar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance classifier usually requires to learn alarge-scale fc layer. For example, for ImageNet-1K with ap-proximately 1.28 million images, one needs to optimize aweight matrix of size W ∈ R D × . For ImageNet-21K, the size further enlarged to W ∈ R D × . Thisis often infeasible when using a regular distributed data par-allel (DDP) training pipeline. In this work, we introduce a GPU memory overhead (GB) M a x i m u m c l a ss nu m b e r ( M illi o n ) DDPDHP
Figure 3: Comparison of the maximum number of classessupported by DDP and DHP training frameworks under dif-ferent GPU memory constraints.Table 1: Comparison of the GPU memory consumption andthe training time per epoch of DDP and DHP training frame-works. Experiments are conducted on ImageNet-1K andImageNet-21K datasets.
Dataset
ImageNet-21K 14.2M
OOM * - * OOM : Out-of-memory. distributed hybrid parallel (DHP) training framework (Songet al. 2020) to make large-scale classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross-entropy loss on the subset data; 6)deduce gradients of the local loss with respect to featuresand weights; 7) gather feature gradients from all GPU nodeand sum them; 8) run a step of optimization to update pa-rameters of encoding, MLP, and classification layers. Thepipeline is repeated to loop through the complete dataset forseveral epochs to optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10K to 30M. The experiment is con-ducted on 64 V100 GPUs with 32GB memory and a totalbatch size of 4096. DDP reports out-of-memory (OOM) er-ror when the class number reaches 4.7 million, while theDHP training framework can support up to 30 million num-ber of classes, which is . × of the DDP’s limit. We also Raw withfixed BNs Raw withrunning BNs
Intra-class similarityInter-class similarity020406080100 C o s i n e s i m il a r i t y ( % ) Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.
Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvethe convergence, we propose to initialize the classificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method Linear eval.(epoch 10)Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs
Raw withfixed BNs Raw withrunning BNs
Intra-class sim.Inter-class sim.020406080100 C o s i n e s i m il a r i t y ( % ) Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.
Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvethe convergence, we propose to initialize the classificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method Linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs
Raw withfixed BNs Raw withrunning BNs
Intra-class sim.Inter-class sim.
Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.
Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvetheconvergence,weproposetoinitializetheclassificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Raw-Feature Initialization with Running BNs
Instance classification faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch) and the lack of priorknowledge in the classifier. A recent work (Cao et al. 2020)handles the infrequent instance visiting problem by using asliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also significantly multiplies thetime of looping over the complete dataset.In this work, we present a simple weight initializationmechanism to handle the convergence issue. Specifically,before training started, we run an inference epoch using the fixed random initial network with trainable BNs to extractall instance features X = { x , x , ··· , x N } 2 R C ⇥ N ;then we directly assign them to classification weights W = { w , w , ··· , w N } 2 R C ⇥ N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,a weight vector w i represents the feature center of differenttransformed views of instance i , thus initializing weights asFigure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method. Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvetheconvergence,weproposetoinitializetheclassificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Raw-Feature Initialization with Running BNs
Instance classification faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch) and the lack of priorknowledge in the classifier. A recent work (Cao et al. 2020)handles the infrequent instance visiting problem by using asliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also significantly multiplies thetime of looping over the complete dataset.In this work, we present a simple weight initializationmechanism to handle the convergence issue. Specifically,before training started, we run an inference epoch using the fixed random initial network with trainable BNs to extractall instance features X = { x , x , ··· , x N } 2 R C ⇥ N ;then we directly assign them to classification weights W = { w , w , ··· , w N } 2 R C ⇥ N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,a weight vector w i represents the feature center of differenttransformed views of instance i , thus initializing weights as C o s i n e s i m il a r i t y ( % ) Figure 4: Comparison of initialization methods. model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart.Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method. Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvethe convergence, we propose to initialize the classificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method Linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs
Raw withfixed BNs Raw withrunning BNs
Intra-class sim.Inter-class sim.
Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.
Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvetheconvergence,weproposetoinitializetheclassificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Raw-Feature Initialization with Running BNs
Instance classification faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch) and the lack of priorknowledge in the classifier. A recent work (Cao et al. 2020)handles the infrequent instance visiting problem by using asliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also significantly multiplies thetime of looping over the complete dataset.In this work, we present a simple weight initializationmechanism to handle the convergence issue. Specifically,before training started, we run an inference epoch using the fixed random initial network with trainable BNs to extractall instance features X = { x , x , ··· , x N } 2 R C ⇥ N ;then we directly assign them to classification weights W = { w , w , ··· , w N } 2 R C ⇥ N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,a weight vector w i represents the feature center of differenttransformed views of instance i , thus initializing weights asFigure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method. Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvetheconvergence,weproposetoinitializetheclassificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Raw-Feature Initialization with Running BNs
Instance classification faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch) and the lack of priorknowledge in the classifier. A recent work (Cao et al. 2020)handles the infrequent instance visiting problem by using asliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also significantly multiplies thetime of looping over the complete dataset.In this work, we present a simple weight initializationmechanism to handle the convergence issue. Specifically,before training started, we run an inference epoch using the fixed random initial network with trainable BNs to extractall instance features X = { x , x , ··· , x N } 2 R C ⇥ N ;then we directly assign them to classification weights W = { w , w , ··· , w N } 2 R C ⇥ N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,a weight vector w i represents the feature center of differenttransformed views of instance i , thus initializing weights as C o s i n e s i m il a r i t y ( % ) Figure 4: Comparison of initialization methods. model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Figure 4: Comparison of initialization methods. model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart.Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method. Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvethe convergence, we propose to initialize the classificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method Linear eval.(epoch 10)Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs
Raw withfixed BNs Raw withrunning BNs
Intra-class sim.Inter-class sim.020406080100 C o s i n e s i m il a r i t y ( % ) Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.
Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvethe convergence, we propose to initialize the classificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method Linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs
Raw withfixed BNs Raw withrunning BNs
Intra-class sim.Inter-class sim.
Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.
Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvetheconvergence,weproposetoinitializetheclassificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Raw-Feature Initialization with Running BNs
Instance classification faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch) and the lack of priorknowledge in the classifier. A recent work (Cao et al. 2020)handles the infrequent instance visiting problem by using asliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also significantly multiplies thetime of looping over the complete dataset.In this work, we present a simple weight initializationmechanism to handle the convergence issue. Specifically,before training started, we run an inference epoch using the fixed random initial network with trainable BNs to extractall instance features X = { x , x , ··· , x N } 2 R C ⇥ N ;then we directly assign them to classification weights W = { w , w , ··· , w N } 2 R C ⇥ N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,a weight vector w i represents the feature center of differenttransformed views of instance i , thus initializing weights asFigure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method. Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvetheconvergence,weproposetoinitializetheclassificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Raw-Feature Initialization with Running BNs
Instance classification faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch) and the lack of priorknowledge in the classifier. A recent work (Cao et al. 2020)handles the infrequent instance visiting problem by using asliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also significantly multiplies thetime of looping over the complete dataset.In this work, we present a simple weight initializationmechanism to handle the convergence issue. Specifically,before training started, we run an inference epoch using the fixed random initial network with trainable BNs to extractall instance features X = { x , x , ··· , x N } 2 R C ⇥ N ;then we directly assign them to classification weights W = { w , w , ··· , w N } 2 R C ⇥ N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,a weight vector w i represents the feature center of differenttransformed views of instance i , thus initializing weights as C o s i n e s i m il a r i t y ( % ) Figure 4: Comparison of initialization methods. model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart.Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method. Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvethe convergence, we propose to initialize the classificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method Linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs
Raw withfixed BNs Raw withrunning BNs
Intra-class sim.Inter-class sim.
Figure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method.
Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvetheconvergence,weproposetoinitializetheclassificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Raw-Feature Initialization with Running BNs
Instance classification faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch) and the lack of priorknowledge in the classifier. A recent work (Cao et al. 2020)handles the infrequent instance visiting problem by using asliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also significantly multiplies thetime of looping over the complete dataset.In this work, we present a simple weight initializationmechanism to handle the convergence issue. Specifically,before training started, we run an inference epoch using the fixed random initial network with trainable BNs to extractall instance features X = { x , x , ··· , x N } 2 R C ⇥ N ;then we directly assign them to classification weights W = { w , w , ··· , w N } 2 R C ⇥ N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,a weight vector w i represents the feature center of differenttransformed views of instance i , thus initializing weights asFigure 3: Comparison of GPU memory consumption ofDDP and DHP training frameworks.Table 1: Ablation study on the effectiveness of differentcomponents in our method. Dataset
OOM - 21.51G 2940s • Raw-feature initialization with running BNs.
To improvetheconvergence,weproposetoinitializetheclassificationweights as raw instance features extracted by a randominitial network, where all but BN layers of it are fixed. Weassume that such initialization offers a contrastive priorfor instance discrimination.•
Smoothing labels of hardest classes.
The massive amountof negative classes raises the risk of optimizing over verysimilar pairs. We apply label smoothing on the top- K hardest instance classes to alleviate this issue.Note that the above improvements only bring little or nocomputational overhead for the training process. Next wewill introduce these techniques respectively in details. Hybrid Parallelism
Training an instance-level classifier usually requires to learna large-scale fc layer. For example, for ImageNet-1K withapproximately 1.28 million images, one needs to optimize aweight matrix of size W R C ⇥ ; for ImageNet-21K,the size further enlarged to W R C ⇥ . This is of-ten infeasible when using a regular distributed data training(DDP) pipeline. In this work, we introduce a distributed hy-brid parallel (DHP) training framework (Song et al. 2020) tomake the large-scale instance classification feasible.Figure 2 summarizes the outline of the distributed hybridparallel training process on T GPU nodes. For encoding andMLP layers, we follow the data parallel pipeline and copy them to different GPUs, each processing a subset of mini-batch data; while for the large-scale fc layer, we follow the Table 2: Ablation study on the effectiveness of differentcomponents in our method. Init. method linear eval.Random init. 12.4Raw with fixed BNs 22.7Raw with running BNs model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Raw-Feature Initialization with Running BNs
Instance classification faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch) and the lack of priorknowledge in the classifier. A recent work (Cao et al. 2020)handles the infrequent instance visiting problem by using asliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also significantly multiplies thetime of looping over the complete dataset.In this work, we present a simple weight initializationmechanism to handle the convergence issue. Specifically,before training started, we run an inference epoch using the fixed random initial network with trainable BNs to extractall instance features X = { x , x , ··· , x N } 2 R C ⇥ N ;then we directly assign them to classification weights W = { w , w , ··· , w N } 2 R C ⇥ N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,a weight vector w i represents the feature center of differenttransformed views of instance i , thus initializing weights as C o s i n e s i m il a r i t y ( % ) Figure 4: Comparison of initialization methods. model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. Figure 4: Comparison of initialization methods. model parallel mechanism and split the weights evenly to T GPUs. At each training iteration and for each GPU node,we 1) extract features of a subset of minibatch samples; 2)gather features from all other nodes; 3) compute partial co-sine logits using local classification weights; 4) compute theexponential values of logits and sum them over all nodes toobtain the softmax denominators; 5) compute softmax prob-abilities and the cross entropy loss on the subset data; 6) de-duce gradients of the local loss with respect to features andweights; 7) gather gradients from all GPU node and sumthem; 8) run a step of optimization to update parameters ofencoding, MLP, and classification layers. The pipeline is re-peated to loop over the complete dataset for several epochsto optimize for better representation.Figure 3 compares the GPU memory overhead of DDPand DHP training frameworks when increasing the (pseudo)class number from 10 thousand to 30 million. The exper-iment is conducted on 64 V100 GPUs with 32G memory.DDP reports out-of-memory (OOM) error when the classnumber reaches 4.7 million, while the DHP training frame-work can support up to 30 million number of classes, whichis . ⇥ of the DDP’s limit. We also note that the DHPcan benefit from more GPUs to support larger-scale instanceclassification, but DDP does not bear this scalability. Table 1compares the training efficiency of DDP and DHP frame-works on ImageNet-1K and ImageNet-21K under the samebatch size settings. We show that the DHP training frame-work not only consumes less GPU memory, but also trainsmuch faster than the DDP counterpart. (10 epochs) Figure 4:
Table: comparison of ImageNet linear evaluationaccuracies of different weight-initialization methods, evalu-ated after epochs training. Bar chart: comparison of av-erage intra- and inter-instance-class cosine similarities whenusing different initialization methods.note that the DHP can benefit from more GPUs to supportlarger-scale instance classification, but DDP does not bearthis scalability. Table 1 compares the training efficiency ofDDP and DHP frameworks on ImageNet-1K and ImageNet-21K under the same batch size settings. We show that theDHP training framework not only consumes less GPU mem-ory, but also trains much faster than the DDP counterpart.
A Contrastive Prior
Instance classification faces the slow convergence problemin early epochs due to the infrequent visiting of instancesamples (i.e., once access per epoch). A recent work (Caoet al. 2020) handles this infrequent visiting problem by usinga sliding-window data scheduler, which samples overlappedbatches between adjacent iterations. This increases the pos-itive instance visiting but it also significantly multiplies thetime of looping over the complete dataset.In this work, we handle this problem from a different per-spective: we propose to speed up the convergence by intro-ducing a contrastive prior to classification weights. Specifi-cally, before training started, we run an inference epoch us-ing the fixed random initial network with running BNs to ex-tract all instance features X = { x , x , · · · , x N } ∈ R D × N ;then we directly assign them to classification weights W = { w , w , · · · , w N } ∈ R D × N as an initialization. The intu-ition behind this initialization mechanism is two-fold. First,running BNs can offer a contrastive prior in the output fea-tures, since in each inference phase, the features computedafter every BN layer will subtract a running average of otherinstance features extracted in previous iterations. Second,assigning features to weights approximately converts theclassification task to a pair-wise metric learning task in earlyepochs, which is relatively easier to converge and offers awarm start for instance classification.Figure 4 compares the discriminative ability of differentclassifier initialization schemes, i.e., random weight initial-ization ( random init. for short), raw-feature initializationable 2: Ablation study on the effectiveness of differentcomponents in our method. MLPhead Acontrastiveprior Labelsmoothing ImageNetTop-1 Top-5 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Table 3: Ablation study that compares full instance classifi-cation and sampled instance classification. Full 67.3 87.7 with fixed BNs ( raw with fixed BNs for short), and raw-feature initialization with running BNs ( raw with runningBNs for short). We use ImageNet linear evaluation accu-racy (evaluated after epochs training) as well as aver-age intra- and inter-instance-class similarities as the indica-tors. As shown in Figure 4, raw with running BNs achievesthe best linear evaluation accuracy, and clearly outperformsother initialization methods. In addition, raw with runningBNs obtains a much larger similarity gap ( ∼ . ) betweenpositive and negative instance pairs than raw with fixed BNs ( ∼ . ), validating the assumption that running BNs mayprovide a contrastive prior for instance discrimination.We also note that the instance features extracted by a ran-dom network with running BNs are also a robust start forsemantic classification. We run an instance retrieval experi-ment on the train set of ImageNet-1K with a randomly ini-tialized ResNet-50 network to extract all image features, anddetermine whether the searched instance and the query in-stance are of a same semantic category. We find that a 3%top-1 accuracy can be achieved, which far exceeds the top-1accuracy of 0.1% of a random guess. Smoothing Labels of Hardest Classes
A challenge of instance-level classification is that it intro-duces a very large number of negative classes, significantlyraising the risk of optimizing over very similar pairs that canbe noisy and make the training hard to converge.In this work, we handle this problem by applying labelsmoothing on a few hardest instance classes. Although othertechniques (e.g., clustering) are also applicable, we chooselabel smoothing for its simplicity and efficiency. We no-tice that the semantically similar instance pairs are relativelystable across the training process. Therefore, we representeach instance i as its corresponding classification weights w i (instead of its unstable features x i ), and compute the co-sine similarities between w i and all other weights W ¯ i = { w , · · · , w i − , w i +1 , · · · , w N } ∈ R D × ( N − to find the Table 4: Ablation study that compares Gaussian random anda contrastive prior for classifier initialization. +15.0 )25 40.8 46.3 ( +5.5 )50 56.1 58.0 ( +1.9 )100 62.9 64.1 ( +1.2 )200 67.3 67.6 ( +0.3 )400 69.3 69.7 ( +0.4 ) Table 5: Ablation study of label smoothing with differentnumber of hard classes K and smoothing factor α . Hard classnumber K Smoothingfactor α Top-1 Top-5 no smoothing
100 0.3 67.5 88.250 0.2 68.0 88.5200 0.2 67.9 88.4 top- K hardest negative classes H i − = { c , c , · · · , c K } . Thelabel of class j ∈ { , , · · · , N } is then defined as y ij = − α, j = i,α/K, j ∈ H i − , , otherwise . (2)The loss function in Eq. (1) is redefined as J = − | I | (cid:88) i ∈ I log (cid:80) Nj =1 y ij exp(cos( w j , x i ) /τ ) (cid:80) Nj =1 exp(cos( w j , x i ) /τ ) . (3)The top- K similarities between instance weights are com-puted once per epoch, which only amounts for a small frac-tion of training time. The smoothed softmax cross-entropyloss reduces the impact of noisy or very similar negativepairs on the learned representation. This is also verified laterin our ablation study, where smoothing labels of severalhardest classes improves the transfer performance. Experiments
Experiment Settings
Training datasets.
Unless specified, we use ImageNet-1K to train our unsupervised model for most experiments.ImageNet-1K consists of around 1.28 million images be-longing to 1000 classes. We treat every image instance(along with its various transformed views) in the dataset as aunique class, and train a 1.28 million-way instance classifieras a pretext task to learn visual representation.
Evaluation datasets.
The learned visual representations areevaluated in three ways: First, under the linear evaluationprotocol of ImageNet-1K, we fix the representation modeland learn a linear classifier upon it. The top-1/top-5 classi-fication accuracies are employed to compare different unsu-pervised methods. Second, we evaluate the semi-supervisedable 6: State-of-the-art comparison of linear classificationaccuracy of unsupervised methods on ImageNet-1K.
Method Top-1 Top-5
Supervised
Ours 71.4 90.3 learning performance on ImageNet-1K, where methods arerequired to classify images in the val set when only a smallfraction (i.e., 1% or 10%) of manual labels are providedin the train set. Third, we evaluate the transferring perfor-mance by finetuning the representations on several down-stream tasks and compute performance gains. In our experi-ments, downstream tasks include Pascal-VOC object detec-tion (Everingham et al. 2010), iNaturalist18 fine-grained im-age classification (Van Horn et al. 2018), and many others.
Implementation details.
We use ResNet-50 (He et al. 2016)as the backbone in all our experiments. We train our modelusing the SGD optimizer, where the weight decay and mo-mentum are set to 0.0001 and 0.9, respectively. The initiallearning rate ( lr ) is set to 0.48 and decays using the cosineannealing scheduler. In addition, we use 10 epochs of linear lr warmup to stabilize training. The minibatch size is 4096and the feature dimension D = 128 . We set the temperaturein Eq. (1) as τ = 0 . , and the smoothing factor in Eq. (3)as α = 0 . . For fair comparison, following practices in re-cent works (Chen et al. 2020a; Cao et al. 2020), we feed twoaugmented views per instance for training. All experimentsare conducted on 64 V100 GPUs with 32GB memory. Ablation Study
This section validates several modeling and configurationoptions for our method. We compare the quality of repre-sentations using ImageNet linear protocol evaluated on the val set. In each experiment, the linear classifier is trainedwith a batch size of 2048 and a lr of 40 that decays duringtraining under the cosine annealing rule. Ablation: effectiveness of components.
Table 2 shows thelinear evaluation performance using different combinationsof components in our method, including a two-layer MLPhead, a contrastive prior, and label smoothing. Accuraciesare measured after 200-epochs training. We find that all thethree components bring performance gains, improving thetop-1 accuracy of our method from 58.6% to a competi-tive 68.2%. We also observe that a vanilla instance clas- Table 7: Comparison of semi-supervised learning accuracyof both label propagation and representation learning basedmethods on ImageNet-1K.
Method Label Fraction1% 10%
Supervised
Label propagation:
PseudoLabels (Zhai et al. 2019) 51.6 82.4VAT+Entropy Min. (Miyato et al. 2018) 47.0 83.4UDA (Xie et al. 2019) - 88.5FixMatch (Sohn et al. 2020) - 89.1
Representation Learning:
InstDisc. (Dosovitskiy et al. 2014) 39.2 77.4PIRL (Misra and Maaten 2020) 57.2 83.8PCL (Li et al. 2020) 75.6 86.2SimCLR (Chen et al. 2020a) 75.5 87.8PIC (Cao et al. 2020) 77.1 88.7
Ours 81.8 89.2 sification model can already achieve a top-1 accuracy of67.3%, suggesting that full instance classification is a verystrong baseline for unsupervised representation learning.The contrastive prior and label smoothing on the top- K hardest classes further boost the linear evaluation accuracyby around 1%. Ablation: full instance classification v.s. sampled instanceclassification.
Table 3 compares linear classification accura-cies using representations learned by full instance classifica-tion and by sampled instance classification, with samplingsizes ranging from to . Note we remove a contrastiveprior and label smoothing in the experiments and only an-alyze the impact of class sampling. We observe that full in-stance classification clearly outperforms sampled instanceclassification by a margin of 1.8%, verifying the benefits ofexploring the complete set of negative instances. Ablation: a contrastive prior v.s. random initialization.
Table 4 compares the linear evaluation performance of ourmethod using Gaussian random and a contrastive prior forclassifier initialization, with training length increased from10 to 400 epochs. We observe that a contrastive prior sig-nificantly speeds up convergence compared to random ini-tialization, especially in early epochs (i.e., epoch 10, 25,and 50). Besides, the accuracy of a contrastive prior versionconsistently outperforms random initialization counterpart,showing the robustness of our initialization mechanism.
Ablation: label smoothing on hardest classes.
Table 5shows the impact of label smoothing on representation learn-ing. We vary the considered number K of hardest negativeclasses, and the smoothing factor α . A no smoothing base-line where K = 0 and α = 0 is also included for com-parison. We observe that smoothing labels of a few hardestclasses improves linear evaluation performance over non-smoothing baseline in most hyper-parameter settings. Thebest accuracy can be obtained with K = 100 and α = 0 . ,where a 0.6% gain of top-1 accuracy can be achieved.able 8: Comparison of transferring performance on PAS-CAL VOC object detection. Method AP AP50 AP75
Supervised
PIC (Cao et al. 2020) 57.1 82.4 63.4
Ours
Table 9: Comparison of transferring performance on iNatu-ralist fine-grained classification.
Method Top-1 Top-5
Scratch
Supervised
Ours 66.2 86.2
Comparison with Previous Results
ImageNet linear evaluation.
Table 6 compares our workwith previous unsupervised visual representation learningmethods under the ImageNet linear evaluation protocol.We follow recent practices (Chen et al. 2020a,c) to train alonger length, i.e., 1000 epochs. The proposed unsupervisedlearning framework achieves a top-1 accuracy of 71.4% onImageNet-1K, outperforming SimCLR (+2.1%), PIC (0.6%)and MoCoV2 (+0.3%). The results verify that a simple fullinstance classification framework can learn very competitivevisual representations. The performance gains can partly beattributed to the ability of large-scale full negative instanceexploration, which is not supported by previous unsuper-vised frameworks (Chen et al. 2020a,c; Cao et al. 2020).
Semi-supervised learning.
Following (Kolesnikov, Zhai,and Beyer 2019; Chen et al. 2020a), we sample a 1% or 10%fraction of labeled data from ImageNet, and train a classifierstarting from our unsupervised pretrained model to evaluatethe semi-supervised learning performance. For 1% labels,we train the backbone with a lr of 0.001 and the classifierwith a lr of 15. For 10% labels, the lr s for the backboneand the classifier are set to 0.001 and 10, respectively (Liet al. 2020). Table 7 compares our work with both represen-tation learning based and label propagation based methods.We obtain a top-5 accuracy of 81.8% when only 1% of la-bels are used, outperforming all previous methods by a non-negligible margin ( +4.7% ). We also achieve the best resultwhen 10% of labels are provided, surpassing SimCLR andPIC by 1.4% and 0.5%, respectively. The results suggest thestrong discriminative ability of our learned representation. Transfer learning.
To further evaluate the learned represen-tation, we apply the pretrained model to several downstreamvisual tasks (including detection, fine-grained classification,and many others) to evaluate the transferring performance.
PASCAL VOC Object Detection:
Following (He et al.2020), we use Faster-RCNN (Ren et al. 2015) with ResNet-50 backbone as the object detector. We initialize ResNet-50 with our pretrained weights, and finetune all layers Table 10: Transferring performance of different pretrainedmodels on more downstream visual tasks.
Method CIFAR10 CIFAR100 SUN397 DTD
Scratch
Supervised
Ours 97.8 86.2 64.2 77.6 end-to-end on the trainval07+12 set of the PASCAL VOCdataset (Everingham et al. 2010). We adopt the same experi-ment settings as MoCoV2 (Chen et al. 2020c). The AP (Av-erage Precision), AP50, and AP75 scores on the test2007 setare used as indicators. Table 8 shows the results. Our trans-ferring performance is significantly better than supervisedpretraining counterpart (+3.7% in AP), and is competitivewith state-of-the-art unsupervised learning methods. iNaturalist fine-grained classification:
We finetune thepretrained model end-to-end on the train set of iNaturalist2018 dataset (Van Horn et al. 2018) and evaluate the top-1and top-5 classification accuracies on the val set. Results areshown in Table 9. Our method is closely competitive withthe ImageNet supervised pretraining counterpart as well asprevious state-of-the-art unsupervised methods. The resultsindicate the discriminative ability of our pretrained represen-tation in fine-grained classification.
More downstream tasks:
Table 10 shows transferring re-sults on more downstream tasks, including image classifi-cation on CIFAR10, CIFAR100 (Krizhevsky, Hinton et al.2009), SUN397 (Xiao et al. 2010), and DTD (Cimpoi et al.2014). To summarize, our method performs competitivelywith ImageNet supervised pretraining as well as state-of-the-art unsupervised pretraining.
Conclusion
In this work, we present an unsupervised visual represen-tation learning framework where the pretext task is to dis-tinguish all instances in a dataset with a parametric classi-fier. The task is similar to supervised semantic classifica-tion, but with a much larger number of classes (equal tothe dataset size) and finer granularity. We first introduce ahybrid parallel training framework to make large-scale in-stance classification feasible, which significantly reduces theGPU memory overhead and speeds up training in our ex-periments. Second, we propose to improve the convergenceby introducing a contrastive prior to the instance classifier.This is achieved by initializing the classification weights asraw instance features extracted by a fixed random networkwith running BNs. We show in our experiments that thissimple strategy clearly speeds up convergence and improvesthe transferring performance. Finally, to reduce the impactof noisy negative instance pairs, we propose to smooth thelabels of a few hardest classes. Extensive experiments onImageNet classification, semi-supervised classification, andmany downstream tasks show that our simple unsupervisedrepresentation learning method performs comparable or su-perior than state-of-the-art unsupervised methods. eferences
Asano, Y. M.; Rupprecht, C.; and Vedaldi, A. 2020. Self-labelling via simultaneous clustering and representationlearning .Cao, Y.; Xie, Z.; Liu, B.; Lin, Y.; Zhang, Z.; and Hu, H.2020. Parametric Instance Classification for UnsupervisedVisual Feature Learning. arXiv preprint arXiv:2006.14618 .Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018.Deep clustering for unsupervised learning of visual features.In
Proceedings of the European Conference on ComputerVision (ECCV) , 132–149.Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020a.A Simple Framework for Contrastive Learning of VisualRepresentations. arXiv preprint arXiv:2002.05709 .Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; and Hin-ton, G. 2020b. Big Self-Supervised Models are Strong Semi-Supervised Learners. arXiv preprint arXiv:2006.10029 .Chen, X.; Fan, H.; Girshick, R.; and He, K. 2020c. Improvedbaselines with momentum contrastive learning. arXivpreprint arXiv:2003.04297 .Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; ; andVedaldi, A. 2014. Describing Textures in the Wild. In
Pro-ceedings of the IEEE Conf. on Computer Vision and PatternRecognition (CVPR) .Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical imagedatabase. In , 248–255. Ieee.Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsuper-vised visual representation learning by context prediction. In
Proceedings of the IEEE international conference on com-puter vision , 1422–1430.Doersch, C.; and Zisserman, A. 2017. Multi-task self-supervised visual learning. In
Proceedings of the IEEE In-ternational Conference on Computer Vision , 2051–2060.Donahue, J.; Kr¨ahenb¨uhl, P.; and Darrell, T. 2016. Adver-sarial feature learning. arXiv preprint arXiv:1605.09782 .Donahue, J.; and Simonyan, K. 2019. Large scale adversar-ial representation learning. In
Advances in Neural Informa-tion Processing Systems , 10542–10552.Dosovitskiy, A.; Fischer, P.; Springenberg, J. T.; Riedmiller,M.; and Brox, T. 2015. Discriminative unsupervised featurelearning with exemplar convolutional neural networks.
IEEEtransactions on pattern analysis and machine intelligence
Advances in neuralinformation processing systems , 766–774.Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.;and Zisserman, A. 2010. The pascal visual object classes(voc) challenge.
International journal of computer vision
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , 10364–10374.Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsuper-vised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 .He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020.Momentum contrast for unsupervised visual representationlearning. In
Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , 9729–9738.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778.H´enaff, O. J.; Srinivas, A.; De Fauw, J.; Razavi, A.; Doer-sch, C.; Eslami, S.; and Oord, A. v. d. 2019. Data-efficientimage recognition with contrastive predictive coding. arXivpreprint arXiv:1905.09272 .Jenni, S.; and Favaro, P. 2018. Self-supervised feature learn-ing by learning to spot artifacts. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,2733–2742.Kolesnikov, A.; Zhai, X.; and Beyer, L. 2019. Revisitingself-supervised visual representation learning. In
Proceed-ings of the IEEE conference on Computer Vision and PatternRecognition , 1920–1929.Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiplelayers of features from tiny images .Li, J.; Zhou, P.; Xiong, C.; Socher, R.; and Hoi, S. C. 2020.Prototypical Contrastive Learning of Unsupervised Repre-sentations. arXiv preprint arXiv:2005.04966 .Misra, I.; and Maaten, L. v. d. 2020. Self-supervised learn-ing of pretext-invariant representations. In
Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , 6707–6717.Miyato, T.; Maeda, S.-i.; Koyama, M.; and Ishii, S. 2018.Virtual adversarial training: a regularization method for su-pervised and semi-supervised learning.
IEEE transactionson pattern analysis and machine intelligence
Euro-pean Conference on Computer Vision , 69–84. Springer.Noroozi, M.; Vinjimoor, A.; Favaro, P.; and Pirsiavash, H.2018. Boosting self-supervised learning via knowledgetransfer. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 9359–9367.Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representationlearning with contrastive predictive coding. arXiv preprintarXiv:1807.03748 .Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; andEfros, A. A. 2016. Context encoders: Feature learning byinpainting. In
Proceedings of the IEEE conference on com-puter vision and pattern recognition , 2536–2544.en, S.; He, K.; Girshick, R.; and Sun, J. 2015. FasterR-CNN: Towards Real-Time Object Detection with RegionProposal Networks. In Cortes, C.; Lawrence, N. D.; Lee,D. D.; Sugiyama, M.; and Garnett, R., eds.,
Advances inNeural Information Processing Systems 28 , 91–99.Sohn, K.; Berthelot, D.; Li, C.-L.; Zhang, Z.; Carlini, N.;Cubuk, E. D.; Kurakin, A.; Zhang, H.; and Raffel, C. 2020.Fixmatch: Simplifying semi-supervised learning with con-sistency and confidence. arXiv preprint arXiv:2001.07685 .Song, L.; Pan, P.; Zhao, K.; Yang, H.; Chen, Y.; Zhang, Y.;Xu, Y.; and Jin, R. 2020. Large-Scale Training System for100-Million Classification at Alibaba. In
Proceedings ofthe 26th ACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining , 2909–2930.Tian, Y.; Krishnan, D.; and Isola, P. 2019. Contrastive Mul-tiview Coding. arXiv preprint arXiv:1906.05849 .Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.;Shepard, A.; Adam, H.; Perona, P.; and Belongie, S. 2018.The inaturalist species classification and detection dataset.In
Proceedings of the IEEE conference on computer visionand pattern recognition , 8769–8778.Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018a. Unsuper-vised feature learning via non-parametric instance discrimi-nation. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 3733–3742.Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018b. Unsuper-vised feature learning via non-parametric instance discrimi-nation. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 3733–3742.Xiao, J.; Hays, J.; Ehinger, K. A.; Oliva, A.; and Torralba,A. 2010. Sun database: Large-scale scene recognition fromabbey to zoo. In , 3485–3492. IEEE.Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.-T.; and Le, Q. V. 2019.Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848 .YM., A.; C., R.; and A., V. 2020. Self-labelling via simul-taneous clustering and representation learning. In
Interna-tional Conference on Learning Representations (ICLR) .Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L. 2019.S4l: Self-supervised semi-supervised learning. In
Proceed-ings of the IEEE international conference on computer vi-sion , 1476–1485.Zhang, L.; Qi, G.-J.; Wang, L.; and Luo, J. 2019. Aet vs.aed: Unsupervised representation learning by auto-encodingtransformations rather than data. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,2547–2555.Zhang, R.; Isola, P.; and Efros, A. A. 2016. Colorful imagecolorization. In
European conference on computer vision ,649–666. Springer.Zhuang, C.; Zhai, A. L.; and Yamins, D. 2019a. Local aggre-gation for unsupervised learning of visual embeddings. In
Proceedings of the IEEE International Conference on Com-puter Vision , 6002–6012.Zhuang, C.; Zhai, A. L.; and Yamins, D. 2019b. Local aggre-gation for unsupervised learning of visual embeddings. In