[PDF] Negative Data Augmentation

Abstract

Data augmentation is often used to enlarge datasets with synthetic samples generated in accordance with the underlying data distribution. To enable a wider range of augmentations, we explore negative data augmentation strategies (NDA)that intentionally create out-of-distribution samples. We show that such negative out-of-distribution samples provide information on the support of the data distribution, and can be leveraged for generative modeling and representation learning. We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator. We prove that under suitable conditions, optimizing the resulting objective still recovers the true data distribution but can directly bias the generator towards avoiding samples that lack the desired structure. Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities. Further, we incorporate the same negative data augmentation strategy in a contrastive learning framework for self-supervised representation learning on images and videos, achieving improved performance on downstream image classification, object detection, and action recognition tasks. These results suggest that prior knowledge on what does not constitute valid data is an effective form of weak supervision across a range of unsupervised learning tasks.

Full PDF

PPublished as a conference paper at ICLR 2021 N EGATIVE D ATA A UGMENTATION

Abhishek Sinha ∗ Kumar Ayush ∗ Jiaming Song ∗ Burak Uzkent Hongxia Jin Stefano Ermon Department of Computer Science Stanford University { a7b23, kayush, tsong, buzkent, ermon } @stanford.eduSamsung Research America A BSTRACT

Data augmentation is often used to enlarge datasets with synthetic samples gen-erated in accordance with the underlying data distribution. To enable a widerrange of augmentations, we explore negative data augmentation strategies (NDA)that intentionally create out-of-distribution samples. We show that such negativeout-of-distribution samples provide information on the support of the data distri-bution, and can be leveraged for generative modeling and representation learning.We introduce a new GAN training objective where we use NDA as an additionalsource of synthetic data for the discriminator. We prove that under suitable con-ditions, optimizing the resulting objective still recovers the true data distributionbut can directly bias the generator towards avoiding samples that lack the desiredstructure. Empirically, models trained with our method achieve improved con-ditional/unconditional image generation along with improved anomaly detectioncapabilities. Further, we incorporate the same negative data augmentation strategyin a contrastive learning framework for self-supervised representation learning onimages and videos, achieving improved performance on downstream image clas-siﬁcation, object detection, and action recognition tasks. These results suggestthat prior knowledge on what does not constitute valid data is an effective form ofweak supervision across a range of unsupervised learning tasks.

NTRODUCTION

Data augmentation strategies for synthesizing new data in a way that is consistent with an underlyingtask are extremely effective in both supervised and unsupervised learning (Oord et al., 2018; Zhanget al., 2016; Noroozi & Favaro, 2016; Asano et al., 2019). Because they operate at the level ofsamples, they can be combined with most learning algorithms. They allow for the incorporationof prior knowledge (inductive bias) about properties of typical samples from the underlying datadistribution (Jaiswal et al., 2018; Antoniou et al., 2017), e.g., by leveraging invariances to produceadditional “positive” examples of how a task should be solved.To enable users to specify an even wider range of inductive biases, we propose to leverage an alter-native and complementary source of prior knowledge that speciﬁes how a task should not be solved.We formalize this intuition by assuming access to a way of generating samples that are guaranteedto be out-of-support for the data distribution, which we call a

Negative Data Augmentation (NDA).Intuitively, negative out-of-distribution (OOD) samples can be leveraged as a useful inductive biasbecause they provide information about the support of the data distribution to be learned by themodel. For example, in a density estimation problem we can bias the model to avoid putting anyprobability mass in regions which we know a-priori should have zero probability. This can be aneffective prior if the negative samples cover a sufﬁciently large area. The best NDA candidates areones that expose common pitfalls of existing models, such as prioritizing local structure over global ∗ Equal Contribution a r X i v : . [ c s . C V ] F e b ublished as a conference paper at ICLR 2021structure (Geirhos et al., 2018); this motivates us to consider known transformations from the litera-ture that intentionally destroy the spatial coherence of an image (Noroozi & Favaro, 2016; DeVries& Taylor, 2017; Yun et al., 2019), such as Jigsaw transforms.Building on this intuition, we introduce a new GAN training objective where we use NDA as anadditional source of fake data for the discriminator as shown in Fig. 1. Theoretically, we can showthat if the NDA assumption is valid, optimizing this objective will still recover the data distribution inthe limit of inﬁnite data. However, in the ﬁnite data regime, there is a need to generalize beyond theempirical distribution (Zhao et al., 2018). By explicitly providing the discriminator with samples wewant to avoid, we are able to bias the generator towards avoiding undesirable samples thus improvinggeneration quality. Figure 1: Negative Data Aug-mentation for GANs.Furthermore, we propose a way of leveraging NDA for unsu-pervised representation learning. We propose a new contrastivepredictive coding (He et al., 2019; Han et al., 2019) (CPC) ob-jective that encourages the distribution of representations cor-responding to in-support data to become disjoint from that ofNDA data. Empirically, we show that applying NDA with ourproposed transformations (e.g., forcing the representation of nor-mal and jigsaw images to be disjoint) improves performance indownstream tasks.With appropriately chosen NDA strategies, we obtain superiorempirical performance on a variety of tasks, with almost no costin computation. For generative modeling, models trained withNDA achieve better image generation, image translation andanomaly detection performance compared with the same modeltrained without NDA. Similar gains are observed on representa-tion learning for images and videos over downstream tasks suchas image classiﬁcation, object detection and action recognition.These results suggest that NDA has much potential to improve a variety of self-supervised learningtechniques. EGATIVE D ATA A UGMENTATION

The input to most learning algorithms is a dataset of samples from an underlying data distribution p data . While p data is unknown, learning algorithms always rely on prior knowledge about its prop-erties (inductive biases (Wolpert & Macready, 1997)), e.g., by using speciﬁc functional forms suchas neural networks. Similarly, data augmentation strategies exploit known invariances of p data , suchas the conditional label distribution being invariant to semantic-preserving transformations.While typical data augmentation strategies exploit prior knowledge about what is in support of p data ,in this paper, we propose to exploit prior knowledge about what is not in the support of p data . Thisinformation is often available for common data modalities (e.g., natural images and videos) andis under-exploited by existing approaches. Speciﬁcally, we assume: (1) there exists an alternativedistribution p such that its support is disjoint from that of p data ; and (2) access to a procedure toefﬁciently sample from p . We emphasize p need not be explicitly deﬁned (e.g., through an explicitdensity) – it may be implicitly deﬁned by a dataset or by a procedure that transforms samples from p data into ones from p by suitably altering their structure.Figure 2: Negative augmentations produce out-of-distribution samples lacking the typical structureof natural images; these negative samples can be used to inform a model on what it should not learn.2ublished as a conference paper at ICLR 2021Analogous to typical data augmentations, NDA strategies are by deﬁnition domain and task speciﬁc.In this paper, we focus on natural images and videos, and leave the application to other domains(such as natural language processing) as future work. How do we select a good NDA strategy? Ac-cording to the manifold hypothesis (Fefferman et al., 2016), natural images lie on low-dimensionalmanifolds: p data is supported on a low-dimensional manifold of the ambient (pixel) space. Thissuggests that many negative data augmentation strategies exist. Indeed, sampling random noise isin most cases a valid NDA. However, while this prior is generic, it is not very informative, and thisNDA will likely be ineffective for most learning problems. Intuitively, NDA is informative if itssupport is close (in a suitable metric) to that of p data , while being disjoint. These negative sampleswill provide information on the “boundary” of the support of p data , which we will show is help-ful in several learning problems. In most of our tasks, the images are processed by convolutionalneural networks (CNNs) that are good at processing local features but not necessarily global fea-tures (Geirhos et al., 2018). Therefore, we may consider NDA examples to be ones that preservelocal features (“informative”) and break global features, so that it forces the CNNs to learn globalfeatures (by realizing NDAs are different from real data).Leveraging this intuition, we show several image transformations from the literature that can beviewed as generic NDAs over natural images in Figure 2, that we will use for generative modelingand representation learning in the following sections. Details about these transformations can befound in Appendix B. FOR G ENERATIVE A DVERSARIAL N ETWORKS

Figure 3: Schematic overview of our NDA framework.

Left : In the absence of NDA, the supportof a generative model P θ (blue oval) learned from samples (green dots) may “over-generalize” andinclude samples from P or P . Right : With NDA, the learned distribution P θ becomes disjointfrom NDA distributions P and P , thus pushing P θ closer to the true data distribution p data (greenoval). As long as the prior is consistent, i.e. the supports of P and P are truly disjoint from p data ,the best ﬁt distribution in the inﬁnite data regime does not change.In GANs, we are interested in learning a generative model G θ from samples drawn from some datadistribution p data (Goodfellow et al., 2014). GANs use a binary classiﬁer, the so-called discriminator D φ , to distinguish real data from generated (fake) samples. The generator G θ is trained via thefollowing mini-max objective that performs variational Jensen-Shannon divergence minimization: min G θ ∈P ( X ) max D φ L JS ( G θ , D φ ) where (1) L JS ( G θ , D φ ) = E x ∼ p data [log( D φ ( x ))] + E x ∼ G θ [log(1 − D φ ( x ))] (2)This is a special case to the more general variational f -divergence minimization objective (Nowozinet al., 2016). The optimal D φ for any G θ is ( p data /G θ ) / (1 + p data /G θ ) , so the discriminator canserve as a density ratio estimator between p data and G θ .With sufﬁciently expressive models and inﬁnite capacity, G θ will match p data . In practice, however,we have access to ﬁnite datasets and limited model capacity. This means that the generator needsto generalize beyond the empirical distribution, which is challenging because the number of possi-ble discrete distributions scale doubly exponentially w.r.t. to the data dimension. Hence, as studiedin (Zhao et al., 2018), the role of the inductive bias is critical. For example, Zhao et al. (2018) report3ublished as a conference paper at ICLR 2021that when trained on images containing 2 objects only, GANs and other generative models can some-times “generalize” by generating images with 1 or 3 objects (which were never seen in the trainingset). The generalization behavior – which may or may not be desirable – is determined by factorssuch as network architectures, hyperparameters, etc., and is difﬁcult to characterize analytically.Here we propose to bias the learning process by directly specifying what the generator should not generate through NDA. We consider an adversarial game based on the following objective: min G θ ∈P ( X ) max D φ L JS ( λG θ + (1 − λ ) P , D φ ) (3)where the negative samples are generated from a mixture of G θ (the generator distribution) and P (the NDA distribution); the mixture weights are controlled by the hyperparameter λ . Intuitively, thiscan help addresses the above “over-generalization” issue, as we can directly provide supervision onwhat should not be generated and thus guide the support of G θ (see Figure 3) . For instance, in theobject count example above, we can empirically prevent the model from generating images with anundesired number of objects (see Appendix Section A for experimental results on this task).In addition, the introduction of NDA samples will not affect the solution of the original GAN ob-jective in the limit. In the following theorem, we show that given inﬁnite training data and inﬁnitecapacity discriminators and generators, using NDA will not affect the optimal solution to the gener-ator, i.e. the generator will still recover the true data distribution. Theorem 1.

Let P ∈ P ( X ) be any distribution over X with disjoint support than p data , i.e., suchthat supp( p data ) ∩ supp( P ) = ∅ . Let D φ : X → R be the set of all discriminators over X , f : R ≥ → R be a convex, semi-continuous function such that f (1) = 0 , f (cid:63) be the convex conjugateof f , f (cid:48) its derivative, and G θ be a distribution with sample space X . Then ∀ λ ∈ (0 , , we have: arg min G θ ∈P ( X ) max D φ : X → R L f ( G θ , D φ ) = arg min G θ ∈P ( X ) max D φ : X → R L f ( λG θ + (1 − λ ) P , D φ ) = p data (4) where L f ( Q, D φ ) = E x ∼ p data [ D φ ( x )] − E x ∼ Q [ f (cid:63) ( D φ ( x ))] is the objective for f -GAN (Nowozinet al., 2016). However, the optimal discriminators are different for the two objectives: arg max D φ : X → R L f ( G θ , D φ ) = f (cid:48) ( p data /G θ ) (5) arg max D φ : X → R L f ( λG θ + (1 − λ ) P , D φ ) = f (cid:48) ( p data / ( λG θ + (1 − λ ) P )) (6) Proof.

See Appendix C.The above theorem shows that in the limit of inﬁnite data and computation, adding NDA changes theoptimal discriminator solution but not the optimal generator. In practice, when dealing with ﬁnitedata, existing regularization techniques such as weight decay and spectral normalization (Miyatoet al., 2018) allow potentially many solutions that achieve the same objective value. The introduc-tion of NDA samples allows us to ﬁlter out certain solutions by providing additional inductive biasthrough OOD samples. In fact, the optimal discriminator will reﬂect the density ratio between p data and λG θ + (1 − λ ) P (see Eq.(6)), and its values will be higher for samples from p data compared tothose from P . As we will show in Section 5, a discriminator trained with this objective and suitableNDA performs better than relevant baselines for other downstream tasks such as anomaly detection. FOR C ONSTRASTIVE R EPRESENTATION L EARNING

Using a classiﬁer to estimate a density ratio is useful not only for estimating f -divergences (as inthe previous section) but also for estimating mutual information between two random variables. Inrepresentation learning, mutual information (MI) maximization is often employed to learn compactyet useful representations of the data, allowing one to perform downstream tasks efﬁciently (Tishby& Zaslavsky, 2015; Nguyen et al., 2008; Poole et al., 2019b; Oord et al., 2018). Here, we show thatNDA samples are also beneﬁcial for representation learning.In contrastive representation learning (such as CPC (Oord et al., 2018)), the goal is to learn a map-ping h θ ( x ) : X → P ( Z ) that maps a datapoint x to some distribution over the representation space4ublished as a conference paper at ICLR 2021 Z ; once the network h θ is learned, representations are obtained by sampling from z ∼ h θ ( x ) . CPC maximizes the following objective: I CPC ( h θ , g φ ) := E x ∼ p data ( x ) , z ∼ h θ ( x ) , (cid:98) z i ∼ p θ ( z ) (cid:34) log ng φ ( x , z ) g φ ( x , z ) + (cid:80) n − j =1 g φ ( x , (cid:98) z j ) (cid:35) (7)where p θ ( z ) = (cid:82) h θ ( z | x ) p data ( x )d x is the marginal distribution of the representations associatedwith p data . Intuitively, the CPC objective involves an n -class classiﬁcation problem where g φ at-tempts to identify a matching pair (i.e. ( x , z ) ) sampled from the joint distribution from the ( n − non-matching pairs (i.e. ( x , (cid:98) z j ) ) sampled from the product of marginals distribution. Note that g φ plays the role of a discriminator/critic, and is implicitly estimating a density ratio. As n → ∞ ,the optimal g φ corresponds to an un-normalized density ratio between the joint distribution and theproduct of marginals, and the CPC objective matches its upper bound which is the mutual informa-tion between X and Z (Poole et al., 2019a; Song & Ermon, 2019). However, this objective is nolonger able to control the representations for data that are out of support of p data , so there is a riskthat the representations are similar between p data samples and out-of-distribution ones.To mitigate this issue, we propose to use NDA in the CPC objective, where we additionally introducea batch of NDA samples, for each positive sample: I CPC ( h θ , g φ ) := E (cid:34) log ( n + m ) g φ ( x , z ) g φ ( x , z ) + (cid:80) n − j =1 g φ ( x , (cid:98) z j ) + (cid:80) mk =1 g φ ( x , z k ) (cid:35) (8)where the expectation is taken over x ∼ p data ( x ) , z ∼ h θ ( x ) , (cid:98) z i ∼ p θ ( z ) , x k ∼ p (NDA dis-tribution), z k ∼ h θ ( x k ) for all k ∈ [ m ] . Here, the behavior of h θ ( x ) when x is NDA is op-timized explicitly, allowing us to impose additional constraints to the NDA representations. Thiscorresponds to a more challenging classiﬁcation problem (compared to basic CPC) that encourageslearning more informative representations. In the following theorem, we show that the proposed ob-jective encourages the representations for NDA samples to become disjoint from the representationsfor p data samples, i.e. NDA samples and p data samples do not map to the same representation. Theorem 2. (Informal) The optimal solution to h θ in the NDA-CPC objective maps the representa-tions of data samples and NDA samples to disjoint regions.Proof. See Appendix D for a detailed statement and proof.

XPERIMENTS

In this section we report experiments with different types of NDA for image generation. Additionaldetails about the network architectures and hyperparameters can be found in Appendix J.Figure 4: Histogram of differ-ence in the discriminator out-put for a real image and it’sJigsaw version.

Unconditional Image Generation.

We conduct experiments onvarious datasets using the BigGAN architecture (Brock et al., 2018)for unconditional image generation . We ﬁrst explore various im-age transformations from the literature to evaluate which ones areeffective as NDA. For each transformation, we evaluate its perfor-mance as NDA (training as in Eq. 3) and as a traditional data aug-mentation strategy, where we enlarge the training set by applyingthe transformation to real images (denoted PDA for positive dataaugmentation). Table 1 shows the FID scores for different types oftransformations as PDA/NDA. The results suggest that transforma-tions that spatially corrupt the image are strong NDA candidates.It can be seen that Random Horizontal Flip is not effective as anNDA; this is because ﬂipping does not spatially corrupt the imagebut is rather a semantic preserving transformation, hence the NDAdistribution P is not disjoint from p data . On the contrary, it is rea-sonable to assume that if an image is likely under p data , its ﬂippedvariant should also be likely. This is conﬁrmed by the effectiveness of this strategy as PDA. We feed a single label to all images to make the architecture suitable for unconditional generation. w/o Aug. Jigsaw Cutout Stitch Mixup Cutmix Random Crop Random Flip GaussianPDA NDA PDA NDA PDA NDA PDA NDA PDA NDA PDA NDA PDA NDA PDA NDA

Table 2: Comparison of FID scores of different types of NDA for unconditional image generationon various datasets. The numbers in bracket represent the corresponding image resolution in pixels.Jigsaw consistently achieves the best or second best result.

BigGAN Jigsaw Stitching Mixup Cutout Cutmix CR-BigGANCIFAR-10 (32)

CIFAR-100 (32)

CelebA (64)

STL10 (32)

We believe spatially corrupted negatives perform well as NDA in that they push the discriminatorto focus on global features instead of local ones (e.g., texture). We conﬁrm this by plotting thehistogram of differences in the discriminator output for a real image and it’s Jigsaw version asshown in Fig. 4. We show that the difference is (a) centered close to zero for normal BigGAN (so without NDA training, the discriminator cannot distinguish real and Jigsaw samples well ), and (b) centered at a positive number (logit 10) for our method (NDA-BigGAN). Following our ﬁndings,in our remaining experiments we use Jigsaw, Cutout, Stitch, Mixup and Cutmix as they achievesigniﬁcant improvements when used as NDA for unconditional image generation on CIFAR-10.Table 2 shows the FID scores for BigGAN when trained with ﬁve types of negative data augmenta-tion on four different benchmarks. Almost all the NDA augmentations improve the baseline acrossdatasets. For all the datasets except CIFAR-100, λ = 0 . , whereas for CIFAR-100 it is 0.5. Weshow the effect of λ on CIFAR-10 performance in Appendix G. We additionally performed an exper-iment using a mixture of augmentation policy. The results (FID 16.24) were better than the baselinemethod (18.64) but not as good as using a single strategy. Conditional Image Generation.

We also investigate the beneﬁts of NDA in conditional imagegeneration using BigGAN. The results are shown in Table 3. In this setting as well, NDA givesa signiﬁcant boost over the baseline model. We again use λ = 0 . for CIFAR-10 and λ = 0 . for CIFAR-100. For both unconditional and conditional setups we ﬁnd the Jigsaw and Stitchingaugmentations to achieve a better FID score than the other augmentations.Table 3: FID scores for conditional image generation using different NDAs. BigGAN Jigsaw Stitching Mixup Cutout Cutmix CR-BigGANC-10

C-100

Image Translation.

Next, we apply the NDA method to image translation. In particular, we use thePix2Pix model (Isola et al., 2017) that can perform image-to-image translation using GANs providedpaired training data. Here, the generator is conditioned on an image I , and the discriminator takes asinput the concatenation of generated/real image and I . We use Pix2Pix for semantic segmentationon Cityscapes dataset (Cordts et al., 2016) (i.e. photos → labels). Table 4 shows the quantitativegains obtained by using Jigsaw NDA while Figure 7 in Appendix E highlights the qualitative im-provements. The NDA-Pix2Pix model avoids noisy segmentation on objects including buildingsand trees. We use a PyTorch code for BigGAN. The number reported in Brock et al. (2018) for C-10 is 14.73. We use the ofﬁcial PyTorch implementation and show the best results.

Metric Pp. Pc. mIOU

Pix2Pix(cGAN) 0.80 0.24 0.27NDA(cGAN)

Pix2Pix(L1+cGAN) 0.72 0.23 0.18NDA(L1+cGAN)

Table 5: AUROC scores for different OODdatasets. OOD-1 contains different datasets, whileOOD-2 contains the set of 19 different corruptionsin CIFAR-10-C (Hendrycks & Dietterich, 2018)(the average score is reported).

BigGAN Jigsaw EBMOOD-1 DTD 0.70 0.69 0.48SVHN 0.75 0.61 0.63Places-365 0.35 0.58 0.68TinyImageNet 0.40 0.62 0.67CIFAR-100 0.63 0.64 0.50Average 0.57

Anomaly Detection.

As another added beneﬁt of NDA for GANs, we utilize the output scores ofthe BigGAN discriminator for anomaly detection. We experiment with 2 different types of OODdatasets. The ﬁrst set consists of SVHN (Netzer et al., 2011), DTD (Cimpoi et al., 2014), Places-365 (Zhou et al., 2017), TinyImageNet, and CIFAR-100 as the OOD datapoints following the pro-tocol in (Du & Mordatch, 2019; Hendrycks et al., 2018). We train BigGAN w/ and w/o JigsawNDA on the train set of CIFAR-10 and then use the output value of discriminator to classify the testset of CIFAR-10 (not anomalous) and different OOD datapoints (anomalous) as anomalous or not.We use the AUROC metric as proposed in (Hendrycks & Gimpel, 2016) to evaluate the anomalydetection performance. Table 5 compares the performance of NDA with a likelihood based model(Energy Based Models (EBM (Du & Mordatch, 2019)). Results show that Jigsaw NDA performsmuch better than baseline BigGAN and other generative models. We did not include other NDAs asJigsaw achieved the best results.We consider the extreme corruptions in CIFAR-10-C (Hendrycks & Dietterich, 2018) as the sec-ond set of OOD datasets. It consists of 19 different corruptions, each having 5 different levels ofseverity. We only consider the corruption of highest severity for our experiment, as these constitutea signiﬁcant shift from the true data distribution. Averaged over all the 19 different corruptions,the AUROC score for the normal BigGAN is , whereas the BigGAN trained with Jigsaw NDAachieves . The histogram of difference in discriminator’s output for clean and OOD samples areshown in Figure 8 in the appendix. High difference values imply that the Jigsaw NDA is better atdistinguishing OOD samples than the normal BigGAN.

EPRESENTATION L EARNING USING C ONTRASTIVE L OSS AND

NDA

Unsupervised Learning on Images.

In this section, we perform experiments on three benchmarks:(a) CIFAR10 (C10), (b) CIFAR100 (C100), and (c) ImageNet-100 (Deng et al., 2009) to show thebeneﬁts of NDA on representation learning with the contrastive loss function. In our experiments,we use the momentum contrast method (He et al., 2019),

MoCo-V2 , as it is currently the state-of-the-art model on unsupervised learning on ImageNet. For C10 and C100, we train the MoCo-V2 modelfor unsupervised learning (w/ and w/o NDA) for 1000 epochs. On the other hand, for ImageNet-100,we train the MoCo-V2 model (w/ and w/o NDA) for 200 epochs. Additional hyperparameter detailscan be found in the appendix. To evaluate the representations, we train a linear classiﬁer on therepresentations on the same dataset with labels. Table 6 shows the top-1 accuracy of the classiﬁer.We ﬁnd that across all the three datasets, different NDA approaches outperform

MoCo-V2 . WhileCutout NDA performs the best for C10, the best performing NDA for C100 and ImageNet-100are Jigsaw and Mixup respectively. Figure 9 compares the cosine distance of the representationslearned w/ and w/o NDA (jigsaw) and shows that jigsaw and normal images are projected far apartfrom each other when trained using NDA whereas with original MoCo-v2 they are projected closeto each other.

Transfer Learning for Object Detection.

We transfer the network pre-trained over ImageNet-100for the task of Pascal-VOC object detection using a Faster R-CNN detector (C4 backbone) Ren et al.(2015). We ﬁne-tune the network on Pascal VOC 2007+2012 trainval set and test it on the 2007 test7ublished as a conference paper at ICLR 2021Table 6: Top-1 accuracy results on image recognition w/ and w/o NDA on MoCo-V2.

MoCo-V2 Jigsaw Stitching Cutout Cutmix MixupCIFAR-10 91.20 91.66 91.59 set. The baseline MoCo achieves . AP, . AP50, . AP75 whereas the MoCo trainedwith mixup NDA gets . AP, . AP50, . AP75 (an improvement of ≈ . ). Unsupervised Learning on Videos.

In this section, we investigate the beneﬁts of NDA in self-supervised learning of spatio-temporal embeddings from video, suitable for human action recogni-tion. We apply NDA to Dense Predictive Coding (Han et al., 2019), which is a single stream (RGBonly) method for self-supervised representation learning on videos. For videos, we create NDAsamples by performing the same transformation on all frames of the video (e.g. the same jigsawpermutation is applied to all the frames of a video). We evaluate the approach by ﬁrst training theDPC model with NDA on a large-scale dataset (UCF101), and then evaluate the representations bytraining a supervised action classiﬁer on UCF101 and HMDB51 datasets. As shown in Table 7,Jigsaw and Cutmix NDA improve downstream task accuracy on UCF-101 and HMDB-51, achiev-ing new state-of-the-art performance among single stream (RGB only) methods for self-supervisedrepresentation learning (when pre-trained using UCF-101).Table 7: Top-1 accuracy results on action recognition in videos w/ and w/o NDA in DPC.

DPC Jigsaw Stitching Cutout Cutmix MixupUCF-101 (Pre-trained on UCF-101) 61.35 64.54

ELATED WORK

In several machine learning settings, negative samples are produced from a statistical generativemodel. Sung et al. (2019) aim to generate negative data using GANs for semi-supervised learningand novelty detection while we are concerned with efﬁciently creating negative data to improvegenerative models and self-supervised representation learning. Hanneke et al. (2018) also proposean alternative theoretical framework that relies on access to an oracle which classiﬁes a sampleas valid or not, but do not provide any practical implementation. Bose et al. (2018) use adversarialtraining to generate hard negatives that fool the discriminator for NLP tasks whereas we obtain NDAdata from positive data to improve image generation and representation learning. Hou et al. (2018)use a GAN to learn the negative data distribution with the aim of classifying positive-unlabeled (PU)data whereas we do not have access to a mixture data but rather generate negatives by transformingthe positive data.In contrastive unsupervised learning, common negative examples are ones that are assumed to befurther than the positive samples semantically. Word2Vec (Mikolov et al., 2013) considers negativesamples to be ones from a different context and CPC-based methods (Oord et al., 2018) such asmomentum contrast (He et al., 2019), the negative samples are data augmentations from a differentimage. Our work considers a new aspect of “negative samples” that are neither generated from somemodel, nor samples from the data distribution. Instead, by applying negative data augmentation(NDA) to existing samples, we are able to incorporate useful inductive biases that might be difﬁcultto capture otherwise (Zhao et al., 2018).

ONCLUSION

We proposed negative data augmentation as a method to incorporate prior knowledge through out-of-distribution (OOD) samples. NDAs are complementary to traditional data augmentation strategies,8ublished as a conference paper at ICLR 2021which are typically focused on in-distribution samples. Using the NDA framework, we interpretexisting image transformations (e.g., jigsaw) as producing OOD samples and develop new learningalgorithms to leverage them. Owing to rigorous mathematical characterization of the NDA assump-tion, we are able to theoretically analyze their properties. As an example, we bias the generator of aGAN to avoid the support of negative samples, improving results on conditional/unconditional im-age generation tasks. Finally, we leverage NDA for unsupervised representation learning in imagesand videos. By integrating NDA into MoCo-v2 and DPC, we improve results on image and actionrecognition on CIFAR10, CIFAR100, ImageNet-100, UCF-101, and HMDB-51 datasets. Futurework include exploring other augmentation strategies as well as NDAs for other modalities. R EFERENCES

Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarialnetworks. arXiv preprint arXiv:1711.04340 , 2017.Yuki M Asano, Christian Rupprecht, and Andrea Vedaldi. A critical analysis of self-supervision, orwhat we can learn from a single image. arXiv preprint arXiv:1904.13132 , 2019.Avishek Joey Bose, Huan Ling, and Yanshuai Cao. Adversarial contrastive estimation. In

Pro-ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers) , pp. 1021–1032, 2018.Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelity naturalimage synthesis. arXiv preprint arXiv:1809.11096 , 2018.M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In

Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2014.Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, RodrigoBenenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urbanscene understanding. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pp. 3213–3223, 2016.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-erarchical image database. In ,pp. 248–255. Ieee, 2009.Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networkswith cutout. arXiv preprint arXiv:1708.04552 , 2017.Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXivpreprint arXiv:1903.08689 , 2019.Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis.

Journal of the American Mathematical Society , 29(4):983–1049, 2016.Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, andWieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias im-proves accuracy and robustness. arXiv preprint arXiv:1811.12231 , 2018.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in neural infor-mation processing systems , pp. 2672–2680, 2014.Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Im-proved training of wasserstein gans. In

Advances in neural information processing systems , pp.5767–5777, 2017.Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictivecoding. In

Proceedings of the IEEE International Conference on Computer Vision Workshops ,pp. 0–0, 2019. 9ublished as a conference paper at ICLR 2021Steve Hanneke, Adam Tauman Kalai, Gautam Kamath, and Christos Tzamos. Actively avoidingnonsense in generative models. In

Conference On Learning Theory , pp. 209–227, 2018.Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast forunsupervised visual representation learning. arXiv preprint arXiv:1911.05722 , 2019.Dan Hendrycks and Thomas G Dietterich. Benchmarking neural network robustness to commoncorruptions and surface variations. arXiv preprint arXiv:1807.01697 , 2018.Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassiﬁed and out-of-distributionexamples in neural networks. arXiv preprint arXiv:1610.02136 , 2016.Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlierexposure. arXiv preprint arXiv:1812.04606 , 2018.Ming Hou, Brahim Chaib-Draa, Chao Li, and Qibin Zhao. Generative adversarial positive-unlabeledlearning. In

Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence , pp.2255–2261, 2018.Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation withconditional adversarial networks. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pp. 1125–1134, 2017.Ayush Jaiswal, Rex Yue Wu, Wael Abd-Almageed, and Prem Natarajan. Unsupervised adversarialinvariance. In

Advances in Neural Information Processing Systems , pp. 5092–5102, 2018.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed represen-tations of words and phrases and their compositionality. In

Advances in neural information pro-cessing systems , pp. 3111–3119, 2013.Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalizationfor generative adversarial networks. arXiv preprint arXiv:1802.05957 , 2018.Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Readingdigits in natural images with unsupervised feature learning. 2011.Xuanlong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionalsand the likelihood ratio by convex risk minimization. arXiv preprint arXiv:0809.0853 , (11):5847–5861, September 2008. doi: 10.1109/TIT.2010.2068870.Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsawpuzzles. In

European Conference on Computer Vision , pp. 69–84. Springer, 2016.Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplersusing variational divergence minimization. In

Advances in neural information processing systems ,pp. 271–279, 2016.Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predic-tive coding. arXiv preprint arXiv:1807.03748 , 2018.Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A Alemi, and George Tucker. On varia-tional bounds of mutual information. arXiv preprint arXiv:1905.06922 , 2019a.Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A Alemi, and George Tucker. On varia-tional bounds of mutual information. arXiv preprint arXiv:1905.06922 , May 2019b.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time objectdetection with region proposal networks. In

Advances in neural information processing systems ,pp. 91–99, 2015.Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual informationestimators. arXiv preprint arXiv:1910.06222 , October 2019.10ublished as a conference paper at ICLR 2021Yi Lin Sung, Sung-Hsien Hsieh, Soo-Chang Pei, and Chun-Shien Lu. Difference-seeking gener-ative adversarial network–unseen sample generation. In

International Conference on LearningRepresentations , 2019.Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. arXivpreprint arXiv:1503.02406 , March 2015.David H Wolpert and William G Macready. No free lunch theorems for optimization.

IEEE trans-actions on evolutionary computation , 1(1):67–82, 1997.Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo.Cutmix: Regularization strategy to train strong classiﬁers with localizable features. In

Proceed-ings of the IEEE International Conference on Computer Vision , pp. 6023–6032, 2019.Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empiricalrisk minimization. arXiv preprint arXiv:1710.09412 , 2017.Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In

European confer-ence on computer vision , pp. 649–666. Springer, 2016.Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon.Bias and generalization in deep generative models: An empirical study. In

Advances in NeuralInformation Processing Systems , pp. 10792–10801, 2018.Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10million image database for scene recognition.

IEEE transactions on pattern analysis and machineintelligence , 40(6):1452–1464, 2017.

A N

UMEROSITY C ONTAINMENT

Zhao et al. (2018) systematically investigate generalization in deep generative models using twodifferent datasets: (a) a toy dataset where there are k non-overlapping dots (with random color andlocation) in the image (see Figure 5a), and (b) the CLEVR dataset where ther are k objects (withrandom shape, color, location, and size) in the images (see Figure 5b). They train a GAN model(WGAN-GP Gulrajani et al. (2017)) with (either) dataset and observe that the learned distributiondoes not produce the same number of objects as in the dataset it was trained on. The distribution ofthe numerosity in the generated images is centered at the numerosity from the dataset, with a slight-bias towards over-estimation. For, example when trained on images with six dots, the generatedimages contain anywhere from two to eight dots (see Figure 6a). The observation is similar whentrained on images with two CLEVR objects. The generated images contain anywhere from one tothree dots (see Figure 6b).In order to remove samples with numerosity different from the train dataset, we use such samplesas negative data during training. For example, while training on images with six dots we use imageswith four, ﬁve and seven dots as negative data for the GAN. The resulting distribution of the nu-merosity in the generated images is constrained to six. We observe similar behaviour when traininga GAN with images containing two CLEVR objects as positive data and images with one or threeobjects as negative data. B I

MAGE T RANSFORMATIONS

Given an image of size H × W , the different image transformations that we used are describedbelow. Jigsaw- K (Noroozi & Favaro, 2016) We partition the image into a grid of K × K patches ofsize ( H/K ) × ( W/K ) , indexed by [1 , . . . , K × K ] . Then we shufﬂe the image patchesaccording to a random permutation (different from the original order) to produce the NDAimage. Empirically, we ﬁnd K = 2 to work the best for Jigsaw- K NDA.

Stitching

We stitch two equal-sized patches of two different images, either horizontally ( H/ × W )or vertically ( H × W/ ), chosen uniformly at random, to produce the NDA image.11ublished as a conference paper at ICLR 2021 (a) Dots(b) CLEVR Figure 5: Toy Datasets used in Numerosity experiments. (a) (b)

Figure 6:

Left : Distribution over number of dots. The arrows are the number of dots the learningalgorithm is trained on, and the solid line is the distribution over the number of dots the modelgenerates.

Right : Distribution over number of CLEVR objects the model generates. GeneratingCLEVR is harder so we explore only one, but the behaviour with NDA is similar to dots.

Cutout / Cutmix

We select a random patch in the image with its height and width lying betweenone-third and one-half of the image height and width respectively. To construct NDA im-ages, this patch is replaced with the mean pixel value of the patch (like cutout (DeVries &Taylor, 2017) with the only difference that they use zero-masking), or the pixel values ofanother image at the same location (cutmix (Yun et al., 2019)).

Mixup- α NDA image is constructed from a linear interpolation between two images x and y (Zhang et al., 2017), γ x + (1 − γ ) y ; γ ∼ Beta( α, α ) . α is chosen so that the dis-tribution has high density at 0.5. Other classes

NDA images are sampled from other classes in the same dataset. See Appendix A.

C NDA

FOR

GAN S Theorem 1.

Let P ∈ P ( X ) be any distribution over X with disjoint support than p data , i.e., suchthat supp( p data ) ∩ supp( P ) = ∅ . Let D φ : X → R be the set of all discriminators over X , f : R ≥ → R be a convex, semi-continuous function such that f (1) = 0 , f (cid:63) be the convex conjugateof f , f (cid:48) its derivative, and G θ be a distribution with sample space X . Then ∀ λ ∈ (0 , , we have: arg min G θ ∈P ( X ) max D φ : X → R L f ( G θ , D φ ) = arg min G θ ∈P ( X ) max D φ : X → R L f ( λG θ + (1 − λ ) P , D φ ) = p data (4)12ublished as a conference paper at ICLR 2021 where L f ( Q, D φ ) = E x ∼ p data [ D φ ( x )] − E x ∼ Q [ f (cid:63) ( D φ ( x ))] is the objective for f -GAN (Nowozinet al., 2016). However, the optimal discriminators are different for the two objectives: arg max D φ : X → R L f ( G θ , D φ ) = f (cid:48) ( p data /G θ ) (5) arg max D φ : X → R L f ( λG θ + (1 − λ ) P , D φ ) = f (cid:48) ( p data / ( λG θ + (1 − λ ) P )) (6) Proof.

Let us use p ( x ) , p ( x ) , q ( x ) to denote the density functions of p data , P and G θ respectively(and P , P , Q for the respective distributions). First, from Lemma 1 in Nguyen et al. (2008), wehave that max D φ : X → R L f ( G θ , D φ ) = D f ( P (cid:107) G θ ) (9) max D φ : X → R L f ( λG θ + (1 − λ ) P , D φ ) = D f ( P (cid:107) λQ + (1 − λ ) P ) (10)where D f refers to the f -divergence. Then, we have D f ( P || λQ + (1 − λ ) P )= (cid:90) X ( λq ( x ) + (1 − λ ) p ( x )) f (cid:18) p ( x ) λq ( x ) + (1 − λ ) p ( x ) (cid:19) = (cid:90) X λq ( x ) f (cid:18) p ( x ) λq ( x ) + (1 − λ ) p ( x ) (cid:19) + (1 − λ ) f (0) ≥ λf (cid:18)(cid:90) X q ( x ) p ( x ) λq ( x ) + (1 − λ ) p ( x ) (cid:19) + (1 − λ ) f (0) (11) = λf (cid:18) λ (cid:90) X λq ( x ) p ( x ) λq ( x ) + (1 − λ ) p ( x ) (cid:19) + (1 − λ ) f (0)= λf (cid:18) λ (cid:90) X ( λq ( x ) + (1 − λ ) p ( x ) − (1 − λ ) p ( x )) p ( x ) λq ( x ) + (1 − λ ) p ( x ) (cid:19) + (1 − λ ) f (0)= λf (cid:18) λ − (cid:90) X ((1 − λ ) p ( x )) p ( x ) λq ( x ) + (1 − λ ) p ( x ) (cid:19) + (1 − λ ) f (0)= λf (cid:18) λ (cid:19) + (1 − λ ) f (0) (12)where we use the fact that f is convex with Jensen’s inequality in Eq.(11) and the fact that p ( x ) p ( x ) = 0 , ∀ x ∈ X in Eq.(12) since P and P has disjoint support.We also have D f ( P || λP + (1 − λ ) P ) = (cid:90) X ( λp ( x ) + (1 − λ ) p ( x )) f (cid:18) p ( x ) λp ( x ) + (1 − λ ) p ( x ) (cid:19) = (cid:90) X ( λp ( x )) f (cid:18) p ( x ) λp ( x ) + (1 − λ ) p ( x ) (cid:19) + (1 − λ ) f (0)= (cid:90) X ( λp ( x )) f (cid:18) p ( x ) λp ( x ) + 0 (cid:19) + (1 − λ ) f (0)= λf (cid:18) λ (cid:19) + (1 − λ ) f (0) Therefore, in order for the inequality in Equation 11 to be an equality, we must have that q ( x ) = p ( x ) for all x ∈ X . Therefore, the generator distribution recovers the data distribution at theequlibrium posed by the NDA-GAN objective, which is also the case for the original GAN objective.Moreover, from Lemma 1 in Nguyen et al. (2008), we have that: arg max D φ L f ( Q, D φ ) = f (cid:48) ( p data /Q ) (13)13ublished as a conference paper at ICLR 2021Therefore, by replacing Q with G θ and ( λG θ + (1 − λ ) P ) , we have: arg max D φ : X → R L f ( G θ , D φ ) = f (cid:48) ( p data /G θ ) (14) arg max D φ : X → R L f ( λG θ + (1 − λ ) P , D φ ) = f (cid:48) ( p data / ( λG θ + (1 − λ ) P )) (15)which shows that the optimal discriminators are indeed different for the two objectives. D NDA

FOR C ONTRASTIVE R EPRESENTATION L EARNING

We describe the detailed statement of Theorem 2 and proof as follows.

Theorem 3.

For some distribution p over X such that supp( p ) ∩ supp( p data ) = ∅ , and for anymaximizer of the NDA-CPC objective ˆ h ∈ arg max h θ max g φ I CPC ( h θ , g φ ) the representations of negative samples are disjoint from that of positive samples for ˆ h ; i.e., ∀ x ∈ supp( p data ) , ¯ x ∈ supp( p ) , supp(ˆ h ( ¯ x )) ∩ supp(ˆ h ( x )) = ∅ Proof.

We use a contradiction argument to establish the proof. For any representation mapping thatmaximizes the NDA-CPC objective, ˆ h ∈ arg max h θ max g φ I CPC ( h θ , g φ ) suppose that the positive and NDA samples share some support, i.e., ∃ x ∈ supp( p data ) , ¯ x ∈ supp( p ) , supp(ˆ h ( ¯ x )) ∩ supp(ˆ h ( x )) (cid:54) = ∅ We can always construct ˆ h (cid:48) that shares the same representation with ˆ h for p data but have disjoint rep-resentations for NDA samples; i.e., ∀ x ∈ supp( p data ) , ¯ x ∈ supp( p ) , the following two statementsare true:1. ˆ h ( x ) = ˆ h (cid:48) ( x ) ;2. supp(ˆ h (cid:48) ( ¯ x )) ∩ supp(ˆ h (cid:48) ( x )) = ∅ .Our goal is to prove that: max g φ I CPC (ˆ h (cid:48) , g φ ) > max g φ I CPC (ˆ h, g φ ) (16)which shows a contradiction.For ease of exposition, let us allow zero values for the output of g , and deﬁne / (in this case, if g assigns zero to positive values, then the CPC objective becomes −∞ , so it cannot be a maximizerto the objective).Let ˆ g ∈ arg max I CPC (ˆ h, g φ ) be an optimal critic to the representation model ˆ h θ . We then deﬁne afollowing critic function: ˆ g (cid:48) ( x , z ) = (cid:40) ˆ g ( x , z ) if ∃ x ∈ supp( p data ) s.t. z ∈ supp(ˆ h (cid:48) ( x ))0 otherwise (17)In other words, the critic assigns the same value for data-representation pairs over the support of p data and zero otherwise. From the assumption over ˆ h , ∃ x ∈ supp( p data ) , ¯ x ∈ supp( p ) , and z ∈ supp(ˆ h ( ¯ x )) , z ∈ supp(ˆ h ( x )) ( x , z ) can be sampled as a positive pair and ˆ g ( x , z ) > .Therefore, max g φ I CPC (ˆ h (cid:48) , g φ ) ≥ I CPC (ˆ h (cid:48) , ˆ g (cid:48) ) (18) = E (cid:34) log ( n + m )ˆ g (cid:48) ( x , z )ˆ g (cid:48) ( x , z ) + (cid:80) n − j =1 ˆ g (cid:48) ( x , (cid:98) z j ) + (cid:80) mk =1 ˆ g (cid:48) ( x , z k ) (cid:124) (cid:123)(cid:122) (cid:125) =0 (cid:35) (plug in deﬁnition for NDA-CPC) ≥ E (cid:34) log ( n + m )ˆ g ( x , z )ˆ g ( x , z ) + (cid:80) n − j =1 ˆ g ( x , (cid:98) z j ) + (cid:80) mk =1 ˆ g ( x , z k ) (cid:35) (existence of some ˆ g ( x , z ) > ) = max g φ I CPC (ˆ h, g φ ) (Assumption that ˆ g is optimal critic)which proves the theorem via contradiction. E P IX IX Figure 7 highlights the qualitative improvements when we apply the NDA method to Pix2Pixmodel (Isola et al., 2017). Figure 7: Qualitative results on Cityscapes.

F A

NOMALY D ETECTION

Here, we show the histogram of difference in discriminator’s output for clean and OOD samples inFigure 8. High difference values imply that the Jigsaw NDA is better at distinguishing OOD samplesthan the normal BigGAN.

G E

FFECT OF HYPERPARAMETER ON U NCONDITIONAL I MAGE GENERATION

Here, we show the effect of λ for unconditional image generation on CIFAR-10 dataset.Table 8: Effect of λ on the FID score for unconditional image generation on CIFAR-10 using Jigsawas NDA. λ (a) Gaussian Noise (b) Speckle Noise (c) JPEG Compression Figure 8: Histogram of D(clean) - D(corrupt) for 3 different corruptions.

H U

NSUPERVISED L EARNING ON I MAGES

Figure 9: Comparing the cosine distance of the representations learned with Jigsaw NDA and Moco-V2 ( shaded blue ), and original Moco-V2 ( white ). With NDA, we project normal and its jigsawimage representations further away from each other than the one without NDA.

I D

ATASET P REPARATION FOR

FID

EVALUATION

For dataset preparation, we follow the the following procedures: (a) CIFAR-10 contains 60K 32 × ×

32 images with 100 labels, out of which 50K are used for trainingand 10K are used for testing, (c) CelebA contains 162,770 train images and 19,962 test images (weresize the images to 64 × × J H

YPERPARAMETERS AND N ETWORK A RCHITECTURE

Generative Modeling.

We use the same network architecture in BigGAN Brock et al. (2018)for our experiments. The code used for our experiments is based over the author’s PyTorch code.For CIFAR-10, CIFAR-100, and CelebA we train for 500 epochs whereas for STL-10 we train for300 epochs. For all the datasets we use the following hyperparameters: batch-size = 64, generatorlearning rate = 2e-4, discriminator learning rate = 2e-4, discriminator update steps per generatorupdate step = 4. The best model was selected on the basis of FID scores on the test set (as explainedabove). 16ublished as a conference paper at ICLR 2021

Momentum Contrastive Learning.

We use the ofﬁcial PyTorch implementation for our experi-ments. For CIFAR-10 and CIFAR-100, we perform unsupervised pre-training for 1000 epochs andsupervised training (linear classiﬁer) for 100 epochs. For Imagenet-100, we perform unsupervisedpre-training for 200 epochs and supervised training (linear classiﬁer) for 100 epochs. For CIFAR-10 and CIFAR-100, we use the following hyperparameters during pre-training: batch-size = 256,learning-date = 0.3, temperature = 0.07, feature dimensionality = 2048. For ImageNet-100 pre-training we have the following: batch-size = 128, learning-date = 0.015, temperature = 0.2, featuredimensionality = 128. During linear classiﬁcation we use a batch size of 256 for all the datasets andlearning rate of 10 for CIFAR-10, CIFAR-100, whereas for ImageNet-100 we use learning rate of30.

Dense Predictive Coding.

We use the same network architecture and hyper-parameters in DPCHan et al. (2019) for our experiments and use the ofﬁcial PyTorch implementation. We performself-supervised training on UCF-101 for 200 epochs and supervised training (action classiﬁer) for200 epochs on both UCF-101 and HMDB51 datasets.

K C