Improving memory banks for unsupervised learning with large mini-batch, consistency and hard negative mining
Adrian Bulat, Enrique Sánchez-Lozano, Georgios Tzimiropoulos
IIMPROVING MEMORY BANKS FOR UNSUPERVISED LEARNING WITH LARGEMINI-BATCH, CONSISTENCY AND HARD NEGATIVE MINING
Adrian Bulat, Enrique S´anchez-Lozano, Georgios Tzimiropoulos
Samsung AI Cambridge, Cambridge, UK
ABSTRACT
An important component of unsupervised learning by instance-based discrimination is a memory bank for storing a featurerepresentation for each training sample in the dataset. In thispaper, we introduce 3 improvements to the vanilla memorybank-based formulation which brings massive accuracy gains:(a)
Large mini-batch : we pull multiple augmentations for eachsample within the same batch and show that this leads to bettermodels and enhanced memory bank updates. (b)
Consistency :we enforce the logits obtained by different augmentations ofthe same sample to be close without trying to enforce dis-crimination with respect to negative samples as proposed byprevious approaches. (c)
Hard negative mining : since instancediscrimination is not meaningful for samples that are too visu-ally similar, we devise a novel nearest neighbour approach forimproving the memory bank that gradually merges extremelysimilar data samples that were previously forced to be apartby the instance level classification loss. Overall, our approachgreatly improves the vanilla memory-bank based instance dis-crimination and outperforms all existing methods for both seenand unseen testing categories with cosine similarity.
Index Terms — unsupervised learning, constrastive loss,memory banks
1. INTRODUCTION
Supervised learning with Deep Neural Networks has been thede facto approach for feature learning in Computer Vision overthe last decade. Recently, there is a surge of interest in learningfeatures in an unsupervised manner. This has the advantageof learning from massive amounts of unlabelled/uncurateddata for feature extraction and network pre-training and isenvisaged to surpass the standard approach of transfer learningfrom ImageNet or other large labelled datasets.The approach we describe in this paper builds upon thewidely-used framework of contrastive learning [1, 2, 3, 4, 5, 6,7] which utilizes a contrastive loss to maximize the similaritybetween the representations of two different instances of thesame training sample while simultaneously minimizing thesimilarity with the representations computed from differentsamples. A key point for contrastive learning is the availabil-ity of a large number of negative samples for computing the contrastive loss that are stored in a memory bank. Since thememory bank is updated rarely, this is believed to hampertraining stability, hence recent methods, like [1, 3, 8] advocateonline learning without a memory bank.For this reason, the method of [1] advocates an onlineapproach by defining the positive pair from two differentlyaugmented versions of the same training sample and considersas negatives all other pairs from the same batch, eliminatingthe memory bank. As opposed to [1], we show how to traina powerful network in an unsupervised manner relying ona memory bank-based training approach. Momentum Con-trast [3] maintains and updates a separate encoder for the neg-ative samples rather than storing a memory bank in a fashionsimilar to the “mean teacher” [9]. More recently, SimCLR [8]emphasized the importance of composite augmentations, largebatch sizes, bigger models and the use of a nonlinear projec-tion head. They suggested that a large minibatch can replace amemory bank. In contrast, our approach employs a memorybank for contrastive learning.Our main contribution is to show how to massively im-prove the vanilla memory bank approach of [2] by introducingminimal changes. We explore 2 key ideas: (1)
What is theeffect of larger batch sizes on contrastive learning with a mem-ory bank? Concurrent work [8] has advocated the use of alarge batch size for online training, i.e. without a memorybank as it increases the number of negative pairs. We showthat a large batch size is also effective for contrastive learning with a memory bank (hence decoupling its positive effect fromthe number of negative pairs) which identifies a connectionwith gradient smoothing and improved memory bank updates.Furthermore, we show that if a larger mini-batch is constructedso that a set of K augmentations for each instance are used,additional consistency between the instance augmentationscan be enforced to further enhance training. (2) Is contrastivelearning effective when instances are too visually similar? In-tuitively, instance discrimination is not meaningful for suchcases. We show that if these samples are “merged” into thememory bank, a much more powerful network can be trained.When reproducing the evaluation protocol of [1], we reportimprovements over [2] of up to ∼ on CIFAR-10 and of upto ∼ on STL-10. Furthermore, with these improvements,our method surpasses [1] and [10] by ∼ on CIFAR-10 andby ∼ on STL-10, setting for these datasets a new state-of- a r X i v : . [ c s . C V ] F e b NN 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 N−10 𝑥 N−1k መ𝑓 መ𝑓 መ𝑓 n−2 መ𝑓 n−1 ℳ 𝑓 𝑓 𝑓 𝑓 k 𝑓 N−10 𝑓 N−1k
Fig. 1 : Overall training process. Each instance within thebatch is augmented K times and passed as input to the network,producing N × K embeddings. The final scores are producedby taking the inner product between the feature embedding f i and the representations stored in the memory bank M . the-art. Overall, we make the following :1. We propose a large mini-batch for memory-bank basedcontrastive learning by pulling, for each sample, a set of K augmentations within the same batch . We show thatthis approach leads to stronger networks and improves thememory bank representation. ( Section 2.2 ).2. By having a set of K augmentations in our disposal, we alsopropose a simple consistency loss which enforces the logitsobtained by different augmentations of the same sample tobe close enough. Notably, this is achieved without trying toenforce discrimination with respect to the negative samplesas proposed by previous approaches ( Section 2.3 ).3. We observe that instance discrimination is not meaningfulfor samples that are too visually similar. Hence, we proposea hard negative mining approach for improving the mem-ory bank that gradually merges extremely visually similardata samples that were previously forced to be apart by theinstance level classification loss (
Section 2.4 ).
2. METHOD2.1. Background
Given a set of n unlabelled images x , x , · · · , x n , our goalis to learn a mapping Φ( x , θ ) from the data to a d -dimensionalfeature embedding f ( x i ) = Φ( x i , θ ) ∈ R d . Typically Φ is aneural network and θ its parameters. Throughout the paper wewill simply refer to the feature embedding of the i -th sampleas f i and assume that (cid:107) f i (cid:107) = 1 . Following [11, 2] our pretexttask will consist in distinguishing the i -th instance from therest of the samples present in the dataset ( i.e . each data samplewill be treated as a separate class). The training objective isthus formulated as minimizing the negative log-likelihood overall instances of the training set: L CE = log n (cid:89) i =1 P ( i | x i ) = n (cid:88) i =1 log e ˆf Ti f i /τ (cid:80) nj =1 e ˆf Tj f i /τ (1)where ˆf j is a negative sample coming from within the batch [1]or from a memory bank [2], and τ is a temperature parameter Expansion Kmethod 1 2 4Standard 80.6 84.7 86.6Multi-augm. : Top-1 (%) accuracy onCIFAR-10 for different ways of in-creasing the batch size.
Num. Kfeat. 2 41 83.9 all 85.3 : Top-1 (%) accu-racy on CIFAR-10 vs. num-ber of features used for up-dating the memory bank. that controls the concentration of the parameters [2]. መ𝑓 መ𝑓 i መ𝑓 n−2 መ 𝑓 n−1 መ𝑓 n−2 ℳ 𝑓 i 𝑓 j k Σ (a) The proposed memorybank update rule shown fora given instance i . መ𝑓 መ 𝑓 i መ𝑓 n−2 መ𝑓 n−1 መ𝑓 j ℳ መ𝑓 መ𝑓 መ𝑓 n−3 መ𝑓 n−2 መ𝑓 new ℳ ′ (b) Offline hard mining strategy.Samples with large cosine similar-ity are merged together. Fig. 2 : Proposed memory bank update mechanisms.
In contrastive learning, a large mini-batch can be motivatedfor the case of online learning (no memory bank is used) forincreasing the number of negative samples. However, for thecase of contrastive learning with a memory bank, the numberof negative samples is fixed and independent of the batch size.We make the observation that for the memory-bank case alarge mini-batch is useful because it results in more frequentupdates for a given feature ˆf i inside the memory bank. Forexample if the batch-size is doubled then ˆf i will be updatedtwice more frequently. As already mentioned in [2], a memory-bank approach comes at the cost of a large oscillation duringtraining due to inconsistencies caused by updating the featurerepresentations for different samples at very different timeinstances. Hence, more frequent updates of the memory bank –offered by a larger mini-batch – can help stabilize training. Weconsider increasing the batch size by an expansion factor of K .There are two ways to achieve this. The standard way is to justincrease the number of samples at each iteration. All samples,in this case, are different to each other. Table 1 shows theresults obtained by training a network with contrastive learn-ing for K = 1 , , on CIFAR-10 using the kNN evaluationprotocol. Clearly, a large batch-size results in much higheraccuracy showcasing its benefit in contrastive learning.The second way to increase the batch size we explore inthis work is by using multiple – in particular K – augmenta-tions per sample within the same batch . Specifically, for everyinput sample x i from the batch we propose to construct a se-ries of K perturbed copies x (0) i , x (1) i , · · · , x ( k ) i , · · · , x ( K − i using a randomly composed set of augmentations T k . As such, one (cid:96) KL (Eq. 3)85.0 85.6 : Top-1 (%) accuracy onCIFAR-10 obtained using kNN fordifferent methods of consistency reg-ularization for K = 2 augmentations. Training Stage1 2scratch 86.5 86.7resume 88.4 : Top-1 (%) accu-racy on CIFAR-10 usingkNN for different stages andtraining strategies. the loss from one batch B (with size |B| ) becomes: L CE = |B| (cid:88) i =1 K (cid:88) k =1 log e ˆf Ti f ( k ) i /τ (cid:80) nj =1 e ˆf Tj f ( k ) i /τ , (2)where x ( k ) i = T k ( x i ) is the k -th augmented copy of image x i transformed using a randomly selected set of chained aug-mentation operators T ( i.e . flipping, color jittering etc .) and f ( k ) i the corresponding embedding produced by passing thesample x ( k ) i through the network. This is illustrated in Fig. 1where different shades of the same color represent differentaugmentations for the same instance. This second way isprimarily motivated by being able to enforce the consistencyloss described in the next section. The results, shown in Ta-ble 1, confirm that by applying the proposed way even higheraccuracies can be achieved.We note that by increasing the batch size in the proposedway ( i.e. using multiple augmentations) by K , the feature ˆ f i is actually updated after the same number of iterations,regardless the value of K , which corresponds to the samenumber of iterations than that of not increasing the batch size.To overcome this issue, we propose the feature ˆf i to be updatedby aggregating the features produced by the K augmentedversions of x i : ˆf i = m ˆf i + (1 − m ) (cid:80) Kk =1 1 K f ( k ) i (see Fig. 2a).The latter observation allows us to further study where theaccuracy improvement in Table 1 comes from. To this end, wefurther study the case of using K augmentations to calculatethe loss of Eq. (2) but updating the memory bank only once (equivalent to using K = 1 ). The results for this case for K = 2 , are shown in Table 1. Interestingly, we observe asignificant accuracy improvement over the baseline (no aug-mentation). Since the memory bank is updated in the sameway as for the case K = 1 , we conjecture that this accuracyimprovement is coming from the smoothed gradients due tothe use of the large batch size. When measured, the (average)cosine distance between the memory bank representations atadjacent epochs becomes smaller as K increases. Overall, weconclude that a large batch size helps improving both networktraining and updating the memory bank. With the introduction of multiple instantiations x (0) i , x (1) i , . . . , x ( k ) i of the same sample within the batch in the previous subsection,generated by applying a different set of randomly selected transformations T k , herein we propose to explicitly enforcea consistent representation between the augmented represen-tations of the same image. A similar idea has been exploredfor the case of semi-supervised learning [12, 13, 14], how-ever, to our knowledge, in the context of contrastive learning,this has not explored before. Notably, this consistency isenforced without trying to enforce discrimination with respectto the negative samples as proposed by recent contrastiveapproaches [1, 2, 3, 4, 5, 6, 7]. More specifically, given a setof logits produced by each of the K augmented copies of i ,we define our consistency loss as follows: L cons = K (cid:88) k =1 (cid:88) j (cid:54) = k KL (cid:16) P ( i | x ( k ) i ) || P ( i | x ( j ) i ) (cid:17) (3) KLKL KL Fig. 3 : KL consistency loss ap-plied between the logits pro-duced by the K augmented sam-ples of a given sample i .Note that the proposedloss term performs a densecorresponding matching( i.e . every possible pairformed using the K aug-mented samples is consid-ered). This is illustrated inFig. 3. For completeness,we also evaluated an (cid:96) lossfor enforcing consistency.As the results from Table 3show, the proposed consis-tency loss offers noticeable improvements over the vanilla training process and the (cid:96) form of regularization directly onthe feature embeddings. Unsupervised learning with instance discrimination assumesthat a sample within the dataset forms a unique class. Anobvious limitation of this approach is that near-identical orvery similar samples are artificially forced to be apart in theembedding space. To alleviate this, we propose an offline kNN-based strategy that merges similar instances into a single class.As opposed to the deep clustering approach from [15], we donot seek to construct large clusters in an online manner viaK-means nor replacing the instance-level discrimination task;instead, during an offline grouping stage, for each memorybank feature representation, we compute its nearest neighbours,and then group the ones located in its immediate σ vicinity(see Fig. 2b). This process is reminiscent of hard negativemining with the difference being that after the hard negativesamples are identified they are treated as positives. Oncethe selected instances are merged together they will have acommon representation and share the same location insidethe memory bank. Similarly, during training, for the groupedinstances instead of using K augmentations of the same image,we uniformly sample and augment images located within thesame group. By using a small σ , the large majority of the able 5 : Top-1 (%) acc. on CIFAR-10 obtained using kNN.Method kNNRandom CNN 32.1DeepCluster (1000) [15] 67.6Exemplar [11] 74.5NPSoftmax [2] 80.8NCE [2] 80.4Triplet [1] 57.5Triplet (Hard) [1] 78.4Invariant Instance [1] 83.6 Ours 89.5 samples after grouping stage remain ungrouped (only 5-10%of samples are grouped). As such the effect of the proposedapproach is to remove very similar samples from being forcedto produce different features. Our proposed conservative hardmining strategy is run in an offline manner near the end ofthe training, each time grouping the most similar samples bymeans of measuring their cosine distance. Firstly, we noticethat the gains flatten out after the algorithm is run for 2 times( i.e . denoted as stages in the tables). Secondly, while themethod offers improvements even when the model is retrainedfrom scratch using the computed assignments, we find thegains are significantly larger if we continue training from thecurrent checkpoint. Table 4 summarizes results showcasingthe large impact of our hard negative mining approach.
3. EXPERIMENTS
We report results for two popular settings: on seen testing cate-gories (testing and training is performed on images that containmutual categories) and unseen testing categories (training andtesting categories are disjoint). All methods were implementedusing PyTorch [16].
Seen Testing Categories.
Following [2, 1] the experimentsare performed on the CIFAR-10 [17] and STL-10 [18] datasetsunder the same settings. In particular, we use a ResNet18 [19]as a feature extractor setting the output embedding size to128. As per [1], the network is trained for 300 epochs usinga starting learning rate of . , which is then dropped by . at epochs 80, 140 and 200. The network is optimized usingSGD with momentum ( = 0 . ) and a weight decay of e − .During training each input sample is randomly augmentedusing a combination of the following transformations: Randomresize and crop, random grayscale, random mirroring and colorjittering. The temperature τ is set to . , the memory bankmomentum to . and the consistency regularization factorto β = 10 . Following [2], we adhere to the linear and kNN protocols. As Tables 5 and 6 show, our method surpasses othermethods, including our direct baseline, the method of [2], by asignificant margin. Unseen Testing Categories.
Following Song et al . [20], we
Table 6 : Top-1 (%) acc. on STL-10 using a linear and kNNclassifier.Method
Ours : Results (%) on Product dataset.Method R@1 R@10 R@100 NMIExemplar [11] 31.5 46.7 64.2 82.9NCE [2] 34.4 49.0 65.2 84.1MOM [24] 16.3 27.6 44.5 80.6Invariant Instance [1] 39.7 54.9 71.0 84.7
Ours 43.6 57.5 71.8 85.3 report results by training a ResNet-18 model on unseen cat-egories on the Standford Online Product [20] dataset. Theimages corresponding to the first half of categories are usedfor training, in an unsupervised manner, without using theirlabels, while the testing is done on images belonging to un-seen categories. We closely align our setting and trainingdetails with [1, 21]: we report results in terms of the clusteringquality and NN retrieval performance. We denote with R @ k the probability of any correct matching to occur in the top-kretrieved [20]. NMI, the second reported metric, measuresthe quality of the clustering. As Table 7 shows, our methodimproves in terms of R@1 on top of the state-of-the-art byalmost 4% and on top of our baseline from [2] by 9%.
4. CONCLUSION
We described three simple yet powerful ways to improve unsu-pervised contrastive learning with a memory bank. Firstly, weproposed a large mini-batch with multiple instance augmenta-tions for providing smoother gradients for improving networktraining and increasing the quality of the features stored in thememory bank. Secondly, we introduced a simple, yet effective,intra-instance consistency loss that encourages the distribu-tion of each augmented sample to match that of the remainingaugmentations. Finally, we presented our very hard miningstrategy that attempts to overcome one of the problems of un-supervised instance discrimination: that of trying to push apartnear-identical images. We exhaustively evaluated the proposedimprovements reporting large accuracy improvements. . REFERENCES [1] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang,“Unsupervised embedding learning via invariant andspreading instance feature,” in
CVPR , 2019.[2] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and DahuaLin, “Unsupervised feature learning via non-parametricinstance discrimination,” in
CVPR , 2018.[3] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, andRoss Girshick, “Momentum contrast for unsupervisedvisual representation learning,” arXiv , 2019.[4] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Rep-resentation learning with contrastive predictive coding,” arXiv , 2018.[5] Olivier J H´enaff, Ali Razavi, Carl Doersch, SM Eslami,and Aaron van den Oord, “Data-efficient image recogni-tion with contrastive predictive coding,” arXiv , 2019.[6] Yonglong Tian, Dilip Krishnan, and Phillip Isola, “Con-trastive multiview coding,” arXiv , 2019.[7] Philip Bachman, R Devon Hjelm, and William Buchwal-ter, “Learning representations by maximizing mutualinformation across views,” in
NeurIPS , 2019.[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, andGeoffrey Hinton, “A simple framework for contrastivelearning of visual representations,” arXiv , 2020.[9] Antti Tarvainen and Harri Valpola, “Mean teachers arebetter role models: Weight-averaged consistency tar-gets improve semi-supervised deep learning results,” in
NeurIPS , 2017.[10] Ishan Misra and Laurens van der Maaten, “Self-supervised learning of pretext-invariant representations,” arXiv , 2019.[11] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Sprin-genberg, Martin Riedmiller, and Thomas Brox, “Dis-criminative unsupervised feature learning with exemplarconvolutional neural networks,”
TPAMI , 2015.[12] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tas-dizen, “Regularization with stochastic transformationsand perturbations for deep semi-supervised learning,” in
NeurIPS , 2016.[13] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, andShin Ishii, “Virtual adversarial training: a regularizationmethod for supervised and semi-supervised learning,”
TPAMI , 2018. [14] David Berthelot, Nicholas Carlini, Ekin D Cubuk, AlexKurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel,“Remixmatch: Semi-supervised learning with distributionalignment and augmentation anchoring,” arXiv , 2019.[15] Mathilde Caron, Piotr Bojanowski, Armand Joulin, andMatthijs Douze, “Deep clustering for unsupervised learn-ing of visual features,” in
ECCV , 2018.[16] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer, “Auto-matic differentiation in pytorch,” 2017.[17] Alex Krizhevsky, Geoffrey Hinton, et al., “Learningmultiple layers of features from tiny images,” 2009.[18] Adam Coates, Andrew Ng, and Honglak Lee, “An anal-ysis of single-layer networks in unsupervised featurelearning,” in
AIStat , 2011.[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in
CVPR , 2016.[20] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and SilvioSavarese, “Deep metric learning via lifted structuredfeature embedding,” in
CVPR , 2016.[21] Yair Movshovitz-Attias, Alexander Toshev, Thomas KLeung, Sergey Ioffe, and Saurabh Singh, “No fuss dis-tance metric learning using proxies,” in
ICCV , 2017.[22] Liefeng Bo, Xiaofeng Ren, and Dieter Fox, “Unsuper-vised feature learning for rgb-d based object recognition,”in
Experimental robotics , 2013.[23] Junbo Zhao, Michael Mathieu, Ross Goroshin, and YannLecun, “Stacked what-where auto-encoders,” arXiv ,2015.[24] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and OndˇrejChum, “Mining on manifolds: Metric learning withoutlabels,” in