[PDF] Towards Domain-Agnostic Contrastive Learning

Abstract

Despite recent success, most contrastive self-supervised learning methods are domain-specific, relying heavily on data augmentation techniques that require knowledge about a particular domain, such as image cropping and rotation. To overcome such limitation, we propose a novel domain-agnostic approach to contrastive learning, named DACL, that is applicable to domains where invariances, and thus, data augmentation techniques, are not readily available. Key to our approach is the use of Mixup noise to create similar and dissimilar examples by mixing data samples differently either at the input or hidden-state levels. To demonstrate the effectiveness of DACL, we conduct experiments across various domains such as tabular data, images, and graphs. Our results show that DACL not only outperforms other domain-agnostic noising methods, such as Gaussian-noise, but also combines well with domain-specific methods, such as SimCLR, to improve self-supervised visual representation learning. Finally, we theoretically analyze our method and show advantages over the Gaussian-noise based contrastive learning approach.

Full PDF

TTowards Domain-Agnostic Contrastive Learning

Vikas Verma , , Minh-Thang Luong , Kenji Kawaguchi , Hieu Pham , Quoc V. Le Google Research, Brain Team Aalto University, Finland Harvard University [email protected] [email protected]{thangluong,hyhieu,qvl}@google.com

Abstract

Despite recent success, most contrastive self-supervised learning methods aredomain-speciﬁc, relying heavily on data augmentation techniques that requireknowledge about a particular domain, such as image cropping and rotation. Toovercome such limitation, we propose a novel domain-agnostic approach to con-trastive learning, named

DACL , that is applicable to domains where invariances, andthus, data augmentation techniques, are not readily available. Key to our approachis the use of

Mixup noise to create similar and dissimilar examples by mixing datasamples differently either at the input or hidden-state levels. To demonstrate theeffectiveness of DACL, we conduct experiments across various domains such astabular data, images, and graphs. Our results show that DACL not only outperformsother domain-agnostic noising methods, such as Gaussian-noise, but also combineswell with domain-speciﬁc methods, such as SimCLR, to improve self-supervisedvisual representation learning. Finally, we theoretically analyze our method andshow advantages over the Gaussian-noise based contrastive learning approach.

One of the core objectives of deep learning is to recover useful representations from the raw inputsignals without explicit labels provided by human annotators. Recently, self-supervised learningmethods have emerged as one of the most promising classes of methods to accomplish this objectivewith state-of-the-art performances across various domains such as computer vision (Oord et al., 2018b;He et al., 2019; Chen et al., 2020b), natural language processing (Dai & Le, 2015; Howard & Ruder,2018; Peters et al., 2018; Radford et al., 2019; Clark et al., 2020), and speech recognition (Schneideret al., 2019; Baevski et al., 2020). These self-supervised methods learn useful representationswithout explicit annotations by reformulating the unsupervised representation learning problem intoa supervised learning problem. This reformulation is done by deﬁning a pretext task. The pretexttasks deﬁned in these methods are based on certain domain-speciﬁc regularities and would generallydiffer from domain to domain (more discussion about this is in Section 6).Among various pretext tasks deﬁned for self-supervised learning, contrastive learning (Chopra et al.,2005; Hadsell et al., 2006; Oord et al., 2018a; Hénaff et al., 2019; He et al., 2019; Chen et al., 2020b)is perhaps the most popular approach that learns to distinguish semantically similar examples overdissimilar ones. Despite its general applicability, contrastive learning requires a way, often by meansof data augmentations, to create semantically similar and dissimilar examples in the domain of interestfor it to work. For example, in computer vision, semantically similar samples can be constructedusing semantic-preserving augmentation techniques such as ﬂipping, rotating, jittering, and cropping.These semantic-preserving augmentations, however, require domain-speciﬁc knowledge and may notbe readily available for other modalities such as graph or tabular data.

Work under progress. a r X i v : . [ c s . L G ] N ov ow to create semantically similar and dissimilar samples for new domains remains an open problem.As a simplest solution, one may add a sufﬁciently small random noise (such as Gaussian-noise) toa given sample to construct examples that are similar to it. Although simple, such augmentationstrategies do not exploit the underlying structure of the data manifold. In this work, we propose DACL,which stands for Domain-Agnostic Contrastive Learning, an approach that utilizes Mixup-noise tocreate similar and dissimilar examples by mixing data samples differently either at the input orhidden-state levels. Our experiments demonstrate the effectiveness of DACL across various domains,ranging from tabular data, to images and graphs; whereas, our theoretical analysis sheds light on whyMixup-noise works.In summary, the contribution of this work is as follows:• We propose Mixup-noise as a way of constructing positive and negative samples for con-trastive learning and conduct theoretical analysis to show that Mixup-noise has bettergeneralization bounds than Gaussian-noise.• We experimentally evaluate the effectiveness of DACL against Gaussian-noise based con-trastive learning on the tabular data. We use ﬂattened version of CIFAR10 and Fashion-MNIST datasets as a proxy for the tabular data.• We demonstrate that Mixup-noise based data augmentation is complementary to other image-speciﬁc augmentations for contrastive learning, resulting in improvements over SimCLRbaseline for CIFAR10, CIFAR100 and ImageNet datasets.• We extend DACL to domains where data has a non-ﬁxed topology (for example, graphs) byapplying Mixup-noise in the hidden states.• We show that using other forms of data-dependent noise (geometric-mixup, binary-mixup)can further improve the performance of DACL.

Contrastive learning can be formally deﬁned using the notions of “anchor”, “positive” and “negative”samples. Here, positive and negative samples refer to samples that are semantically similar anddissimilar to anchor samples. Given an encoding function h : x (cid:55)→ h , an anchor sample x and itscorresponding positive and negative samples, x + and x − , the objective of contrastive learning isto bring the anchor and the positive sample closer in the embedding space than the anchor and thenegative sample. Formally, contrastive learning seeks to satisfy the following condition, where sim isa measure of similarity between two vectors:sim ( h , h + ) > sim ( h , h − ) (1)While the above objective can be reformulated in various ways, including max-margin contrastiveloss in Hadsell et al. (2006) and triplet loss in Weinberger & Saul (2009), in this work we considerInfoNCE loss because of its adaptation in multiple current state-of-the-art methods (Sohn, 2016;van den Oord et al., 2019; He et al., 2019; Chen et al., 2020b). Let us suppose that { x k } Nk =1 is a setof N samples such that it consists of a sample x i which is semantically similar to x j and dissimilarto all the other samples in the set, then the InfoNCE tries to maximize the similarity between thepositive pair and minimize the similarity between the negative pairs in such a set, and is deﬁned as: (cid:96) i,j = − log exp(sim( h i , h j )) (cid:80) Nk =1 [ k (cid:54) = i ] exp(sim( h i , h k )) (2) For domains where natural data augmentation methods are not available, we propose to apply Mixup(Zhang et al., 2018) based data interpolation for creating positive and negative samples. Given adata distribution D = { x k } Kk =1 , a positive sample of an anchor x is created by taking its randominterpolation with another randomly chosen sample ˜ x from D : x + = λ x + (1 − λ ) ˜ x (3)where λ is a coefﬁcient sampled from a random distribution such that x + is closer to x than ˜ x . Forinstance, we can sample λ from a uniform distribution λ ∼ U ( α, . with high values of α such as2 lgorithm 1 Mixup-noise Domain-Agnostic Contrastive Learning. input: batch size N , temperature τ , encoder function h , projection-head g , hyperparameter α . for sampled minibatch { x k } Nk =1 do for all k ∈ { , . . . , N } do λ ∼ U ( α, . x ∼ { x k } Nk =1 − { x k } ˜ x k − = λ x k + (1 − λ ) x h k − = h ( ˜ x k − ) z k − = g ( h k − ) λ ∼ U ( α, . x ∼ { x k } Nk =1 − { x k } ˜ x k − = λ x k + (1 − λ ) x h k = h ( ˜ x k ) z k = g ( h k ) end for for all i ∈ { , . . . , N } and j ∈ { , . . . , N } do s i,j = z (cid:62) i z j / ( (cid:107) z i (cid:107)(cid:107) z j (cid:107) ) end for deﬁne (cid:96) ( i, j ) as (cid:96) ( i, j ) = − log exp( s i,j /τ ) (cid:80) Nk =1 [ k (cid:54) = i ] exp( s i,k /τ ) L = N (cid:80) Nk =1 [ (cid:96) (2 k − , k ) + (cid:96) (2 k, k − update networks h and g to minimize L end for return encoder function h ( · ) , and projection-head g ( · ) x .Creating positive samples using Mixup in the input space (Eq. 3) is not feasible in domains wheredata has a non-ﬁxed topology, such as sequences, trees, and graphs. For such domains, we createpositive samples by mixing ﬁxed-length hidden representations of samples (Verma et al., 2019a).Formally, let us assume that there exists an encoder function h : I (cid:55)→ h that maps a sample I from such domains to a representation h via an intermediate layer that has a ﬁxed-length hiddenrepresentation v , then we create positive sample in the intermediate layer as: v + = λ v + (1 − λ )˜ v (4)The above Mixup based method for constructing positive samples can be interpreted as addingnoise to a given sample in the direction of another sample in the data distribution. We term thisas Mixup-noise. One might ask how Mixup-noise is a better choice for contrastive learning thanother forms of noise? The central hypothesis of our method is that a network is forced to learnbetter features if the noise captures the structure of the data manifold rather than being independentof it. Consider an image x and adding Gaussian-noise to it for constructing the positive sample: x + = x + δ , where δ ∼ N ( , σ I ) . In this case, to maximize the similarity between x and x + ,the network can learn just to take an average over the neighboring pixels to remove the noise, thusbypassing learning the semantic concepts in the image. Such kind of trivial feature transformationis not possible with Mixup-noise, enforcing the network to learn better features. In addition to theaforementioned hypothesis, in Section 5, we formally conduct a theoretical analysis to understandthe effect of using Gaussian-noise vs Mixup-noise in the contrastive learning framework.For experiments, we closely follow the encoder and projection-head architecture, and the processfor computing the "normalized and temperature-scaled InfoNCE loss" from SimCLR (Chen et al.,2020b). Our approach for Mixup-noise based Domain-Agnostic Contrastive Learning (DACL) ininput space is summarized in Algorithm 1. (Algorithm for DACL in hidden representations can beeasily derived from Algorithm 1 by applying mixing in Line 8 and 14 instead of line 7 and 13.)3 .1 Additional forms of Mixup-based noise We have thus far proposed the contrastive learning method using the linear-interpolation Mixup.Other forms of Mixup-noise can also be used to obtain more diverse samples for contrastive learning.In particular, we explore "Geometric-Mixup" and "Binary-Mixup" based noise. In geometric-mixup,we create a positive sample corresponding to a sample x by taking its weighted-geometric mean withanother randomly chosen sample ˜ x : x + = x λ (cid:12) ˜ x (1 − λ ) (5)Similar to linear interpolation based Mixup-noise, λ is sampled from a uniform distribution λ ∼ U ( β, . with high values of β .In Binary-Mixup (Beckham et al., 2019), the elements of x are swapped with the elements of anotherrandomly chosen sample ˜ x . This is implemented by sampling a binary mask m ∈ { , } k (where k denotes the number of input features) and performing the following operation: x + = x (cid:12) m + ˜ x (cid:12) (1 − m ) (6)where elements of m are sampled from a Bernoulli ( ρ ) distribution with high ρ parameter.We extend the DACL procedure with the aforementioned additional Mixup-noise functions as follows:for a given sample x , we randomly select a noise function from linear-mixup, geometric-mixup,and binary-mixup, and apply this function to create both of the positive samples corresponding to x (line 7 and 13 in Algorithm 1). The rest of the details are the same as Algorithm 1. We refer to thisprocedure as DACL+ in the following experiments. We present results on three different application domains: tabular data, images, and graphs. For alldatasets, to evaluate the learned representations under different contrastive learning methods, we usethe linear evaluation protocol (Bachman et al., 2019b; Hénaff et al., 2019; He et al., 2019; Chen et al.,2020b), where a linear classiﬁer is trained on top of a frozen encoder network (similar to SimCLR,we discard the projection-head during linear evaluation), and the test accuracy is used as a proxy forrepresentation quality.For each of the experiments, we give details about the architecture and the experimental setup in thecorresponding section. In the following, we describe common hyperparameter search settings. Forexperiments on tabular and image datasets ( Section 4.1 and 4.2), we search the hyperparameter α forlinear mixing (Section 3 or line 5 in Algorithm 1) from the set [0 . , . , . , . , . . To avoid thesearch over hyperparameter β (of Section 3.1), we set it to same value as α . For the hyperparameter ρ of Binary-mixup (Section 3.1), we search the value from the set [0 . , . , . . For Gaussian-noisebased contrastive learning, we chose the mean of Gaussian-noise from the set [0 . , . , . , . andthe standard deviation is set to . . For all experiments, the hyperparameter temperature τ (line 20 inAlgorithm 1) is searched from the set [0 . , . , . . For each of the experiments, we report the bestvalues of aforementioned hyperparameters in the Appendix C .For experiments on graph datasets (Section 4.3), we ﬁx the value of α to . and value of temperature τ to . . For tabular data experiments, we use Fashion-MNIST and CIFAR-10 datasets as a proxy by ﬂatteningthem into a vector format. We consider either linear evaluation from random weights of encoder(No-Pretraining) or from contrastive-learned encoders under either Gaussian noise or Mixup-noise(DACL, DACL+). Additionally, we report supervised learning results (training the full network in asupervised manner).We use a 12-layer fully-connected network as the base encoder and a 3-layer projection head, withReLU non-linearity and batch-normalization for all layers. All pre-training methods are trained for1000 epochs with a batch size of 4096. The linear classiﬁer is trained for 200 epochs with a batch sizeof 256. We use LARS optimizer with cosine decay schedule without restarts (Loshchilov & Hutter,2017), for both pre-training and linear evaluation. The initial learning rate for both pre-training andlinear classiﬁer is set to 0.1. 4 ethod Fashion-MNIST CIFAR10

No-Pretraining 66.6 26.8Gaussian-noise 75.8 27.4DACL 81.4 37.6DACL+

Full networksupervised training 79.1 35.2Table 1: Results on Tabular data with a 12-layer fully-connected network.

Results: as shown in Table 1, DACLperforms signiﬁcantly better than theGaussian-noise based contrastive learn-ing. DACL+ , which uses additionalMixup-noises (Section 3.1), further im-proves the performance of DACL. Moreinterestingly, our results show that thelinear classiﬁer applied to the representa-tions learned by DACL gives better per-formance than training the full networkin a supervised manner.

We use three benchmark image datasets: CIFAR-10, CIFAR-100, and ImageNet. For CIFAR-10 andCIFAR-100, we use No-pretraining, Gaussian-noise pretraining and SimCLR (Chen et al., 2020b).For ImageNet, we use recent contrastive learning methods as additional baselines. SimCLR+DACLrefers to the combination of the SimCLR and DACL methods, which is implemented using thefollowing steps: (1) for each training batch, compute the SimCLR loss and DACL loss separately and(2) pretrain the network using the sum of SimCLR and DACL losses.

Method CIFAR-10 CIFAR-100

No-Pretraining 43.1 18.1Gaussian-noise 56.1 29.8DACL 81.3 46.5DACL+ 83.8 52.7SimCLR 93.4 73.8SimCLR+DACL

Table 2: Results on Image data with ResNet50( × )For all experiments, we closely follow the de-tails in SimCLR (Chen et al., 2020b), bothfor pre-training and linear evaluation. We useResNet-50(x4) as the base encoder network, anda 3-layer MLP projection-head to project therepresentation to a 128-dimensional latent space. Pre-training:

For SimCLR and Sim-CLR+DACL pretraining, we use the followingaugmentation operations: random crop andresize (with random ﬂip), color distortions,and Gaussian blur. We train all models witha batch size of 4096 for 1000 epochs forCIFAR10/100 and 100 epochs for ImageNet . We use LARS optimizer with learning rate 16.0 (= 1 . × Batch-size / for CIFAR10/100 and 4.8 (= 0 . × Batch-size / for ImageNet.Furthermore, we use linear warmup for the ﬁrst 10 epochs and decay the learning rate with the cosinedecay schedule without restarts (Loshchilov & Hutter, 2017). The weight decay is set to − .Method Architecture Param Top 1 Top 5Rotation (Gidaris et al., 2018) ResNet50 ( × ) 86 55.4 -BigBiGAN(Donahue & Simonyan, 2019) ResNet50 ( × ) 86 61.3 81.9AMDIM (Bachman et al., 2019b) Custom-ResNet 626 68.1 -CMC (Tian et al., 2020) ResNet50 ( × ) 188 68.4 88.2MoCo (He et al., 2019) ResNet50 ( × ) 375 68.6 -CPC v2 (Hénaff et al., 2019) ResNet161 305 71.5 90.1No-Pretraining ResNet50 ( × ) 375 4.1 11.5Gaussian-noise ResNet50 ( × ) 375 10.2 23.6DACL ResNet50 ( × ) 375 24.6 44.4SimCLR (Chen et al., 2020b) ResNet50 ( × ) 375 73.4 91.6SimCLR+DACL ResNet50 ( × ) 375 Table 3: Accuracy of linear classiﬁers trained on representations learned withdifferent self-supervised methods on ImageNet dataset.

Linear evalu-ation:

Sinceour goal isnot to producestate-of-the-artperformance,but to examinewhat levelsof perfor-mance canbe achievedwithout usingany domain-speciﬁc dataaugmentations,we do notuse any dataaugmentationduring the lin- Our reproduction of the results of SimCLR for ImageNet in Table 3 differs from Chen et al. (2020b) becauseour experiments are run for 100 epochs vs their 1000 epochs. (= 1 . × Batch-size / andcosine decay schedule without restarts. For ImageNet, we use a batch-size of 4096 and train themodel for 90 epochs, using LARS optimizer with learning rate 1.6 (= 0 . × Batch-size / andcosine decay schedule without restarts. For both the CIFAR10/100, we do not use weight-decay andlearning rate warm-up. Results:

We present the results for CIFAR10/CIFAR100 and ImageNet in Table 2 and Table 3respectively. We observe that DACL is better than Gaussian-noise pre-training by a wide margin andDACL+ can improve the test accuracy even further. However, DACL falls short of methods that useimage augmentations such as SimCLR (Chen et al., 2020b). Interestingly, combining DACL withSimCLR (SimCLR+DACL in Table 2 and Table 3) can improve the performance of SimCLR acrossall the datasets, which suggests that Mixup-noise is complementary to other image data augmentationsfor contrastive learning.

We present the results of applying DACL to graph classiﬁcation problems using six well-knownbenchmark datasets: MUTAG, PTC-MR, REDDIT-BINARY, REDDIT-MULTI-5K, IMDB-BINARY,and IMDB-MULTI (Yanardag & Vishwanathan, 2015). For baselines, we use No-Pretraining andInfoGraph (Sun et al., 2020). InfoGraph is a state-of-the-art contrastive learning method for graphclassiﬁcation problems, which maximizes the mutual-information between the global and node-levelfeatures of a graph by formulating this as a contrastive learning problem.For applying DACL to graph structured data, as discussed in Section 3, it is required to obtainﬁxed-length representations from an intermediate layer of the encoder. For graph neural networks(such as Graph Isomorphism Network (GIN) Xu et al. (2018)), such ﬁxed-length representation canbe obtained by applying global pooling over the node-level representations at any intermediate layer.Thus, the Mixup-noise can be applied to any of the intermediate layer. However, since we followthe encoder and projection-head architecture of SimCLR, we can also apply the Mixup-noise to theoutput of the encoder. In this work, we present experiments with Mixup-noise applied to the outputof the encoder and leave the experiments with Mixup-noise at intermediate layers for future work.We closely follow the experimental setup of InfoGraph (Sun et al., 2020) for a fair comparison,except that we report results for linear classiﬁer instead of the Support Vector Classiﬁer applied to thepre-trained representations. . For all the pre-training methods in Table 4, as graph encoder network,we use GIN (Xu et al., 2018) with 4 hidden layers and node embedding dimension of 512. The outputof this encoder network is a ﬁxed-length vector of dimension × . Further, we use a 3-layerprojection-head with its hidden state dimension being the same as the output dimension of a 4-layerGIN (4 × . Similarly for InfoGraph experiments, we used a 3-layer discriminator network withhidden state dimension × .For all experiments, for pretraining, we train the model for 20 epochs with a batch-size of 128, andfor linear evaluation, we train the linear classiﬁer on the learned representations for 100 updates withfull-batch training. For both pre-training and linear evaluation, we use Adam optimizer with an initiallearning rate chosen from the set { − , − , − } . We perform linear evaluation using 10-foldcross-validation. Since these datasets are small in the number of samples (refer to Appendix SectionD for the dataset details), the linear-evaluation accuracy varies signiﬁcantly across the pre-trainingepochs. Thus, we report the average of linear classiﬁer accuracy over the last ﬁve pre-training epochs.All the experiments are repeated ﬁve times. Results:

In Table 4 we see that DACL closely matches the performance of InfoGraph, with theclassiﬁcation accuracy of these methods being within the standard deviation of each other. In termsof the classiﬁcation accuracy mean, DACL outperforms InfoGraph on four out of six datasets. Our reproduction of the results for InfoGraph differs from Sun et al. (2020) because of this reason. ethod MUTAG PTC-MR REDDIT-BINARY REDDIT-M5K IMDB-BINARY IMDB-MULTINo-Pretraining . ± .

58 53 . ± .

27 55 . ± .

86 24 . ± .

93 52 . ± .

08 33 . ± . InfoGraph (Sun et al., 2020) . ± . . ± .

52 63 . ± . . ± . . ± .

05 39 . ± . DACL . ± . . ± .

57 66 . ± . . ± . . ± .

13 40 . ± . Table 4: Classiﬁcation accuracy using a linear classiﬁer trained on representations obtained usingdifferent self-supervised methods on 6 benchmark graph classiﬁcation datasets.

In this section, we mathematically analyze and compare the properties of Mixup-noise and Gaussian-noise based contrastive learning for a binary classiﬁcation task. We ﬁrst prove that for both Mixup-noise and Gaussian-noise, optimizing hidden layers with contrastive loss is related to minimizingclassiﬁcation loss with the last layer being optimized using labeled data. We then prove that theproposed method with Mixup-noise induces a different regularization effect on the classiﬁcation losswhen compared with that of Gaussian-noise. This different regularization effect shows the advantageof Mixup-noise over Gaussian-noise when the data manifold lies in a low dimensional subspace.To compare the cases of Mixup-noise and Gaussian-noise, we focus on linear-interpolation basedMixup-noise and unify the two cases using the following observation. For Mixup-noise, we can write x + mix = λ x +(1 − λ ) ˜ x = x + αδ ( x , ˜ x ) with α = 1 − λ > and δ ( x , ˜ x ) = ( ˜ x − x ) where ˜ x is drawnfrom some (empirical) input data distribution. For Gaussian-noise, we can write x + gauss = x + αδ ( x , ˜ x ) with α > and δ ( x , ˜ x ) = ˜ x where ˜ x is drawn from some Gaussian distribution. Accordingly, foreach input x , we can write the positive example pair ( x + , x ++ ) and the negative example x − for bothcases as: x + = x + αδ ( x , ˜ x ) , x ++ = x + α (cid:48) δ ( x , ˜ x (cid:48) ) , and x − = ¯ x + α (cid:48)(cid:48) δ ( ¯ x , ˜ x (cid:48)(cid:48) ) , where ¯ x is anotherinput sample. Using this uniﬁed notation, we theoretically analyze our method with the standard con-trastive loss (cid:96) ctr deﬁned by (cid:96) ctr ( x + , x ++ , x − ) = − log exp( sim [ h ( x + ) ,h ( x ++ )])exp( sim [ h ( x + ) ,h ( x ++ )])+exp( sim [ h ( x + ) ,h ( x − )]) , where h ( x ) ∈ R d is the output of the last hidden layer and sim [ q, q (cid:48) ] = q (cid:62) q (cid:48) (cid:107) q (cid:107)(cid:107) q (cid:48) (cid:107) for any given vectors q and q (cid:48) . This contrastive loss (cid:96) ctr without the projection-head g is commonly used in practice andcaptures the essence of contrastive learning. Theoretical analyses of the beneﬁt of the projection-head g and other forms of Mixup-noise are left to future work.This section focuses on binary classiﬁcation with y ∈ { , } using the standard binary cross-entropyloss: (cid:96) cf ( q, y ) = − y log(ˆ p q ( y = 1)) − (1 − y ) log(ˆ p q ( y = 0)) with ˆ p q ( y = 0) = 1 − ˆ p q ( y = 1) where ˆ p q ( y = 1) = − q ) . We use f ( x ) = h ( x ) (cid:62) w to represent the output of the classiﬁerfor some w ; i.e., (cid:96) cf ( f ( x ) , y ) is the cross-entropy loss of the classiﬁer f on the sample ( x , y ) . Let φ : R → [0 , be any Lipschitz function with constant L φ such that φ ( q ) ≥ [ q ≤ for all q ∈ R ; i.e., φ is an smoothed version of 0-1 loss. For example, we can set φ to be the hinge loss. Let X ⊆ R d and Y be the input and output spaces as x ∈ X and y ∈ Y . Let c x be a real number such that c x ≥ ( x k ) for all x ∈ X and k ∈ { , . . . , d } .As we aim to compare the cases of Mixup-noise and Gaussian-noise accurately without taking loosebounds, we ﬁrst prove an exact relationship between the contrastive loss and classiﬁcation loss: Theorem 1.

Let D be a probability distribution over ( x , y ) as ( x , y ) ∼ D , with the corre-sponding marginal distribution D x of x and conditional distribution D y of x given a y . Let ¯ ρ ( y ) = E ( x (cid:48) ,y (cid:48) ) ∼D [ [ y (cid:48) (cid:54) = y ] ] (= Pr( y (cid:48) (cid:54) = y | y ) > . Then, for any distribution pair ( D ˜ x , D α ) and function δ , the following holds: E x , ¯ x ∼D x , ˜ x , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )] = E ( x ,y ) ∼D , ¯ x ∼ D ¯ y , ˜ x , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:2) ρ ( y ) (cid:96) cf (cid:0) f ( x + ) , y (cid:1)(cid:3) + E y [(1 − ¯ ρ ( y )) E y ] where E y = E x , ¯ x ∼D y E ˜ x , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:2) log (cid:0) (cid:2) − h ( x + ) (cid:62) (cid:107) h ( x + ) (cid:107) ( h ( x ++ ) (cid:107) h ( x ++ ) (cid:107) − h ( x − ) (cid:107) h ( x − ) (cid:107) (cid:1)(cid:3)(cid:1)(cid:3) , f ( x + ) = h ( x + ) (cid:62) ˜ w , ¯ y = 1 − y , ˜ w = (cid:107) h ( x + ) (cid:107) − ( (cid:107) h ( π y, ( x ++ , x − )) (cid:107) − h ( π y, ( x ++ , x − )) −(cid:107) h ( π y, ( x ++ , x − )) (cid:107) − h ( π y, ( x ++ , x − ))) , and π y,y (cid:48) ( x ++ , x − ) = [ y = y (cid:48) ] x ++ +(1 − [ y = y (cid:48) ] ) x − . All the proofs are presented in Appendix B. Theorem 1 proves the exact relationship for training losswhen we set the distribution D to be an empirical distribution with dirac measures on training datapoints: see Appendix A for more details. In general, Theorem 1 relates optimizing the contrastiveloss (cid:96) ctr ( x + , x ++ , x − ) to minimizing the classiﬁcation loss (cid:96) cf ( f ( x + ) , y i ) at the perturbed point7 + . The following theorem then shows that it is approximately minimizing the classiﬁcation loss (cid:96) cf ( f ( x ) , y i ) at the original point x with additional regularization terms: Theorem 2.

Let x and w be vectors such that ∇ f ( x ) and ∇ f ( x ) exist. Assume that f ( x ) = ∇ f ( x ) (cid:62) x , ∇ f ( x ) = 0 , and E ˜ x ∼D ˜ x [ ˜ x ] = 0 . Then, if yf ( x ) + ( y − f ( x ) ≥ , the following twostatements hold for any D ˜ x and α > : (i) (Mixup) if δ ( x , ˜ x ) = ˜ x − x , E ˜ x ∼D ˜ x [ (cid:96) cf ( f ( x + ) , y )] (7) = (cid:96) cf ( f ( x ) , y ) + c ( x ) |(cid:107)∇ f ( x ) (cid:107) + c ( x ) (cid:107)∇ f ( x ) (cid:107) + c ( x ) (cid:107)∇ f ( x ) (cid:107) E ˜ x ∼D ˜ x [˜ x ˜ x (cid:62) ] + O ( α ) , (ii) (Gaussian-noise) if δ ( x , ˜ x ) = ˜ x ∼ N (0 , σ I ) , E ˜ x ∼N (0 ,σ I ) [ (cid:96) cf (cid:0) f ( x + ) , y (cid:1) ] = (cid:96) cf ( f ( x ) , y ) + σ c ( x ) (cid:107)∇ f ( x ) (cid:107) + O ( α ) , (8) where c ( x ) = α | cos( ∇ f ( x ) , x ) || y − ψ ( f ( x )) |(cid:107) x (cid:107) ≥ , c ( x ) = α | cos( ∇ f ( x ) , x ) | (cid:107) x (cid:107) | ψ (cid:48) ( f ( x )) | ≥ , and c ( x ) = α | ψ (cid:48) ( f ( x )) | > . Here, ψ is the logicfunction as ψ ( q ) = exp( q )1+exp( q ) ( ψ (cid:48) is its derivative), cos( a, b ) is the cosine similarity of two vector a and b , and (cid:107) v (cid:107) M = v (cid:62) M v for any positive semideﬁnite matrix M . The assumptions of f ( x ) = ∇ f ( x ) (cid:62) x and ∇ f ( x ) = 0 in Theorem 2 are satisﬁed by feedforwarddeep neural networks with ReLU and max pooling (without skip connections) as well as by linearmodels. The condition of yf ( x ) + ( y − f ( x ) ≥ is satisﬁed whenever the training sample ( x , y ) is classiﬁed correctly. In other words, Theorem 2 states that when the model classiﬁes a trainingsample ( x , y ) correctly, a training algorithm implicitly minimizes the additional regularization termsfor the sample ( x , y ) , which partially explains the beneﬁt of training after correct classiﬁcation oftraining samples.In equations (7)–(8), we can see that both the Mixup-noise and Gaussian-noise versions haveregularization effect on (cid:107)∇ f ( x ) (cid:107) — the Euclidean norm of the gradient of the model f with respectto input x . In the case of the linear model, we know from previous work that the regularization on (cid:107)∇ f ( x ) (cid:107) = (cid:107) w (cid:107) indeed promotes generalization: Remark 1. (Bartlett & Mendelson, 2002)

Let F b = { x (cid:55)→ w (cid:62) x : (cid:107) w (cid:107) ≤ b } . Then, for any δ > ,with probability at least − δ over an iid draw of n examples (( x i , y i )) ni =1 , the following holds forall f ∈ F b : E ( x ,y ) [ [(2 y − (cid:54) =sign( f ( x ))] ] − n n (cid:88) i =1 φ ((2 y i − f ( x i )) ≤ L φ (cid:114) bc x dn + (cid:114) ln(2 /δ )2 n . (9)By comparing equations (7)–(8) and by setting D ˜ x to be the input data distribution, we can see thatthe Mixup version has additional regularization effect on (cid:107)∇ f ( x ) (cid:107) X = (cid:107) w (cid:107) X , while the Gaussian-noise version does not, where Σ X = E x [ xx (cid:62) ] is the input covariance matrix. The following theoremshows that this implicit regularization with the Mixup version can further reduce the generalizationerror: Theorem 3.

Let F ( mix ) b = { x (cid:55)→ w (cid:62) x : (cid:107) w (cid:107) X ≤ b } . Then, for any δ > , with probability at least − δ over an iid draw of n examples (( x i , y i )) ni =1 , the following holds for all f ∈ F ( mix ) b : E ( x ,y ) [ [(2 y − (cid:54) =sign( f ( x ))] ] − n n (cid:88) i =1 φ ((2 y i − f ( x i )) ≤ L φ (cid:114) b rank(Σ X ) n + (cid:114) ln(2 /δ )2 n . (10)Comparing equations (9)–(10), we can see that the proposed method with Mixup-noise has theadvantage over that with Gaussian-noise when the input data distribution lies in low dimensionalmanifold as rank(Σ X ) < d . In general, our theoretical results show that the proposed method with We use this notation for conciseness without assuming that it is a norm. If M is only positive semideﬁniteinstead of positive deﬁnite, (cid:107) · (cid:107) M is not a norm since this does not satisfy the deﬁnition of the norm for positivedeﬁniteness; i.e, (cid:107) v (cid:107) = 0 does not imply v = 0 . (cid:107)∇ f ( x ) (cid:107) X , which can reduce the complexity of themodel class of f along the data manifold captured by the covariance Σ X . See Appendix A foradditional discussions on the interpretation of Theorems 1 and 2 for neural networks.The proofs of Theorems 1 and 2 hold true also when we set x to be the output of a hidden layer (andby redeﬁning the domains of h and f to be the output of the hidden layer). Therefore, by treating x to be the output of a hidden layer, our theory also applies to the contrastive learning with positivesamples created by mixing the hidden representations of samples. In this case, Theorems 1 and 2show that the contrastive learning method implicitly regularizes (cid:107)∇ f ( x ( l ) ) (cid:107) E [˜ x ( l ) (˜ x ( l ) ) (cid:62) ] — the normof the gradient of the model f with respect to the output x ( l ) of the l -th hidden layer in the directionof data manifold. Therefore, contrastive learning with Mixup-noise at the input space or a hiddenspace can promote generalization in the data manifold in the input space or the hidden space. Self-supervised learning:

Self-supervised learning methods can be categorized based on thepretext task they seek to learn. The earliest research work in self-supervised learning exists as earlyas in the 1990s. For instance, in de Sa (1994), the pretext task is to minimize the disagreementbetween the outputs of neural networks processing two different modalities of a given sample. In thefollowing, we brieﬂy review various pretext tasks across different domains. In the natural languageunderstanding, pretext tasks include, predicting the neighbouring words (word2vec Mikolov et al.(2013)), predicting the next word (Dai & Le, 2015; Peters et al., 2018; Radford et al., 2019), predictingthe next sentence (Kiros et al., 2015; Devlin et al., 2019), predicting the masked word (Devlin et al.,2019; Yang et al., 2019; Liu et al., 2019; Lan et al., 2020)), and predicting the replaced word inthe sentence (Clark et al., 2020). For computer vision, examples of pretext tasks include rotationprediction (Gidaris et al., 2018), relative position prediction of image patches (Doersch et al., 2015),image colorization (Zhang et al., 2016b), reconstructing the original image from the partial image and(Pathak et al., 2016; Zhang et al., 2016a). For graph-structured data, the pretext task can be predictingthe context (neighbourhood of a given node) or predicting the masked attributes of the node (Hu et al.,2020). Most of the above pretext tasks in these methods are domain-speciﬁc, and hence they cannotbe applied to other domains. Perhaps a notable exception is the language modeling objectives, whichhave been shown to work for both NLP and computer vision (Dai & Le, 2015; Chen et al., 2020a).

Contrastive learning:

Contrastive learning is a form of self-supervised learning where the pretexttask is to bring positive samples closer than the negative samples in the representation space. Thesemethods can be categorized based on how the positive and negative samples are constructed. Inthe following, we will discuss these categories and the domains where these methods can not beapplied: (a) this class of methods use domain-speciﬁc augmentations (Chopra et al., 2005; Hadsellet al., 2006; He et al., 2019; Chen et al., 2020b) for creating positive and negative samples. Thesemethods are state-of-the-art for computer vision tasks but can not be applied to domains wheresemantic-preserving data augmentation does not exist, such as graph-data or tabular data. (b) anotherclass of methods constructs positive and negative samples by deﬁning the local and global context ina sample (Hjelm et al., 2019; Sun et al., 2020; Veliˇckovi´c et al., 2019; Bachman et al., 2019a; Trinhet al., 2019). These methods can not be applied to domains where such global and local context doesnot exist, such as tabular data. (c) yet another class of methods uses the ordering in the sequential datato construct positive and negative samples (Oord et al., 2018a; Hénaff et al., 2019). These methodscan not be applied if the data sample can not be expressed as an ordered sequence, such as graphsand tabular data. Thus our motivation in this work is to propose a contrastive learning method thatcan be applied to a wide variety of domains.

Mixup based methods:

Mixup-based methods allow inducing inductive biases about how amodel’s predictions should behave in-between two or more data samples. Originally, this kindof data-interpolation method was introduced for classiﬁcation by interpolating the data samples inthe input space and their corresponding targets (Zhang et al., 2018; Tokozume et al., 2017). Vermaet al. (2019a) proposed to perform the mixing in the representation space to learn more discrimina-tive features. This also alleviates the data collision problem which might occur in the input spacemixing. Mixup based methods have also seen remarkable success in other problem settings, suchas semi-supervised learning (Verma et al., 2019b; Berthelot et al., 2019), unsupervised learning9sing autoencoders (Beckham et al., 2019; Berthelot* et al., 2019), adversarial learning (Lamb et al.,2019; Lee et al., 2020; Pang* et al., 2020), graph-based learning (Verma et al., 2020; Wang et al.,2020), as well as various domains such as computer vision (Yun et al., 2019; Faramarzi et al., 2020;Jeong et al., 2020; Panﬁlov et al., 2019), natural language (Guo et al., 2019; Zhang et al., 2020)and speech (Lam et al., 2020; Tomashenko et al., 2018). Because of their simplicity and virtuallyno added computation cost, mixing based methods are an appealing choice for regularizing neuralnetworks in various problem settings and domains. Our work takes steps towards addressing theproblem of domain-agnostic self-supervised learning using the Mixup-noise and theoretically studiesits advantage over Gaussian-noise.

In this work, with the motivation of designing a domain-agnostic self-supervised learning method,we study Mixup-noise as a way for creating positive and negative samples for the contrastive learningformulation. Our results show that the proposed method DACL is a viable option for the domainswhere data augmentation methods are not available. Speciﬁcally, for the tabular data, we showthat DACL and DACL+ can achieve better test accuracy than training the neural network in afully-supervised manner. For graph classiﬁcation, DACL is on par with the recently proposed mutual-information maximization method for contrastive learning (Sun et al., 2020). For the image datasets,DACL falls short of those methods which use image-speciﬁc augmentations (random cropping,horizontal ﬂipping, color distortions, etc). However, our experiments show that Mixup-noise can beused as a complementary data augmentation to these image-speciﬁc data augmentations.As future work, one could easily extend DACL to other domains such as natural language andspeech. From a theoretical perspective, we have analyzed DACL in the binary classiﬁcation setting,and extending this analysis to the multi-class setting might shed more light on developing a betterMixup-noise based contrastive learning method. Furthermore, since different kinds of Mixup-noiseexamined in this work are based only on random interpolation between two samples, extendingthe experiments by mixing between more than two samples or learning the optimal mixing policythrough an auxiliary network is another promising avenue for future research.

References

Bachman, Philip, Hjelm, R Devon, and Buchwalter, William. Learning representations by maximizingmutual information across views. In Wallach, H., Larochelle, H., Beygelzimer, A., dé Buc, F.,Fox, E., and Garnett, R. (eds.),

Advances in Neural Information Processing Systems 32 , pp.15535–15545. Curran Associates, Inc., 2019a. URL http://papers.nips.cc/paper/9686-learning-representations-by-maximizing-mutual-information-across-views.pdf .Bachman, Philip, Hjelm, R Devon, and Buchwalter, William. Learning representations by maximizingmutual information across views. In

Advances in Neural Information Processing Systems , pp.15535–15545, 2019b.Baevski, Alexei, Zhou, Henry, Mohamed, Abdelrahman, and Auli, Michael. wav2vec 2.0: Aframework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477 ,2020.Bartlett, Peter L and Mendelson, Shahar. Rademacher and gaussian complexities: Risk bounds andstructural results.

Journal of Machine Learning Research , 3(Nov):463–482, 2002.Beckham, Christopher, Honari, Sina, Verma, Vikas, Lamb, Alex, Ghadiri, Farnoosh, Devon Hjelm,R, Bengio, Yoshua, and Pal, Christopher. On Adversarial Mixup Resynthesis. arXiv e-prints , art.arXiv:1903.02709, Mar 2019.Berthelot, David, Carlini, Nicholas, Goodfellow, Ian, Papernot, Nicolas, Oliver, Avital, and Raffel,Colin. MixMatch: A Holistic Approach to Semi-Supervised Learning. arXiv e-prints , art.arXiv:1905.02249, May 2019.Berthelot*, David, Raffel*, Colin, Roy, Aurko, and Goodfellow, Ian. Understanding and improvinginterpolation in autoencoders via an adversarial regularizer. In

International Conference onLearning Representations , 2019. URL https://openreview.net/forum?id=S1fQSiCcYm .10hen, Mark, Radford, Alec, Child, Rewon, Wu, Jeff, Jun, Heewoo, Luan, David, and Sutskever, Ilya.Generative pretraining from pixels. In

ICML , 2020a.Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, and Hinton, Geoffrey E. A simple frameworkfor contrastive learning of visual representations.

ArXiv , abs/2002.05709, 2020b.Chopra, Sumit, Hadsell, Raia, and LeCun, Yann. Learning a similarity metric discriminatively, withapplication to face veriﬁcation. In

Proceedings of the 2005 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01 , CVPR ’05, pp.539–546, USA, 2005. IEEE Computer Society. ISBN 0769523722. doi: 10.1109/CVPR.2005.202.URL https://doi.org/10.1109/CVPR.2005.202 .Clark, Kevin, Luong, Minh-Thang, Le, Quoc V., and Manning, Christopher D. Electra: Pre-trainingtext encoders as discriminators rather than generators. In

International Conference on LearningRepresentations , 2020. URL https://openreview.net/forum?id=r1xMH1BtvB .Dai, Andrew M and Le, Quoc V. Semi-supervised sequence learning. In

Advances in neuralinformation processing systems , pp. 3079–3087, 2015.de Sa, Virginia R. Learning classiﬁcation with unlabeled data. In Cowan, J. D., Tesauro, G.,and Alspector, J. (eds.),

Advances in Neural Information Processing Systems 6 , pp. 112–119.Morgan-Kaufmann, 1994. URL http://papers.nips.cc/paper/831-learning-classiﬁcation-with-unlabeled-data.pdf .Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and Toutanova, Kristina. BERT: Pre-trainingof deep bidirectional transformers for language understanding. In

Proceedings of the 2019Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers) , pp. 4171–4186, Minneapolis,Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.URL .Doersch, Carl, Gupta, Abhinav, and Efros, Alexei A. Unsupervised visual representation learning bycontext prediction. In

Proceedings of the 2015 IEEE International Conference on Computer Vision(ICCV) , ICCV ’15, pp. 1422–1430, USA, 2015. IEEE Computer Society. ISBN 9781467383912.doi: 10.1109/ICCV.2015.167. URL https://doi.org/10.1109/ICCV.2015.167 .Donahue, Jeff and Simonyan, Karen. Large scale adversarial representation learning, 2019.Faramarzi, Mojtaba, Amini, Mohammad, Badrinaaraayanan, Akilesh, Verma, Vikas, and Chandar,Sarath. Patchup: A regularization technique for convolutional neural networks, 2020.Gidaris, Spyros, Singh, Praveer, and Komodakis, Nikos. Unsupervised representation learning bypredicting image rotations, 2018.Guo, Hongyu, Mao, Yongyi, and Zhang, Richong. Augmenting data with mixup for sentenceclassiﬁcation: An empirical study, 2019.Hadsell, Raia, Chopra, Sumit, and LeCun, Yann. Dimensionality reduction by learning an invariantmapping. CVPR ’06, pp. 1735–1742, USA, 2006. IEEE Computer Society. ISBN 0769525970.doi: 10.1109/CVPR.2006.100. URL https://doi.org/10.1109/CVPR.2006.100 .He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, and Girshick, Ross B. Momentum contrast forunsupervised visual representation learning.

ArXiv , abs/1911.05722, 2019.Hénaff, Olivier J., Srinivas, Aravind, Fauw, Jeffrey De, Razavi, Ali, Doersch, Carl, Eslami, S. M. Ali,and van den Oord, Aäron. Data-efﬁcient image recognition with contrastive predictive coding.

CoRR , abs/1905.09272, 2019. URL http://arxiv.org/abs/1905.09272 .Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil,Trischler, Adam, and Bengio, Yoshua. Learning deep representations by mutual informationestimation and maximization. In

International Conference on Learning Representations , 2019.URL https://openreview.net/forum?id=Bklr3j0cKX .11oward, Jeremy and Ruder, Sebastian. Universal language model ﬁne-tuning for text classiﬁcation. arXiv preprint arXiv:1801.06146 , 2018.Hu, Weihua, Liu*, Bowen, Gomes, Joseph, Zitnik, Marinka, Liang, Percy, Pande, Vijay, and Leskovec,Jure. Strategies for pre-training graph neural networks. In

International Conference on LearningRepresentations , 2020. URL https://openreview.net/forum?id=HJlWWJSFDH .Jeong, Jisoo, Verma, Vikas, Hyun, Minsung, Kannala, Juho, and Kwak, Nojun. Interpolation-basedsemi-supervised learning for object detection, 2020.Kawaguchi, Kenji, Kaelbling, Leslie Pack, and Bengio, Yoshua. Generalization in deep learning. arXiv preprint arXiv:1710.05468 , 2017.Kiros, Ryan, Zhu, Yukun, Salakhutdinov, Russ R, Zemel, Richard, Urtasun, Raquel, Torralba,Antonio, and Fidler, Sanja. Skip-thought vectors. In Cortes, C., Lawrence, N. D., Lee, D. D.,Sugiyama, M., and Garnett, R. (eds.),

Advances in Neural Information Processing Systems 28 ,pp. 3294–3302. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf .Lam, M. W. Y., Wang, J., Su, D., and Yu, D. Mixup-breakdown: A consistency training method forimproving generalization of speech separation models. In

ICASSP 2020 - 2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6374–6378, 2020.Lamb, Alex, Verma, Vikas, Kannala, Juho, and Bengio, Yoshua. Interpolated adversarial training:Achieving robust neural networks without sacriﬁcing too much accuracy. In

Proceedings ofthe 12th ACM Workshop on Artiﬁcial Intelligence and Security , AISec’19, pp. 95–103, NewYork, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368339. doi:10.1145/3338501.3357369. URL https://doi.org/10.1145/3338501.3357369 .Lan, Zhenzhong, Chen, Mingda, Goodman, Sebastian, Gimpel, Kevin, Sharma, Piyush,and Soricut, Radu. Albert: A lite bert for self-supervised learning of language rep-resentations. In

International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=H1eA7AEtvS .Lee, Saehyung, Lee, H., and Yoon, S. Adversarial vertex mixup: Toward better adversarially robustgeneralization. ,pp. 269–278, 2020.Liu, Yinhan, Ott, Myle, Goyal, Naman, Du, Jingfei, Joshi, Mandar, Chen, Danqi, Levy, Omer, Lewis,Mike, Zettlemoyer, Luke, and Stoyanov, Veselin. Roberta: A robustly optimized bert pretrainingapproach, 07 2019.Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In

InternationalConference on Learning Representations (ICLR) 2017 Conference Track , April 2017.Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representa-tions of words and phrases and their compositionality. In Burges, C. J. C., Bottou, L., Welling, M.,Ghahramani, Z., and Weinberger, K. Q. (eds.),

Advances in Neural Information Processing Systems26 , pp. 3111–3119. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf .Oord, A., Li, Y., and Vinyals, Oriol. Representation learning with contrastive predictive coding.

ArXiv , abs/1807.03748, 2018a.Oord, Aaron van den, Li, Yazhe, and Vinyals, Oriol. Representation learning with contrastivepredictive coding. arXiv preprint arXiv:1807.03748 , 2018b.Panﬁlov, Egor, Tiulpin, Aleksei, Klein, Stefan, Nieminen, Miika T., and Saarakkala, Simo. Improv-ing robustness of deep learning based knee mri segmentation: Mixup and adversarial domainadaptation, 2019.Pang*, Tianyu, Xu*, Kun, and Zhu, Jun. Mixup inference: Better exploiting mixup to defendadversarial attacks. In

International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=ByxtC2VtPB .12athak, Deepak, Krähenbühl, Philipp, Donahue, J., Darrell, Trevor, and Efros, Alexei A. Contextencoders: Feature learning by inpainting. , pp. 2536–2544, 2016.Peters, Matthew E, Neumann, Mark, Iyyer, Mohit, Gardner, Matt, Clark, Christopher, Lee,Kenton, and Zettlemoyer, Luke. Deep contextualized word representations. arXiv preprintarXiv:1802.05365 , 2018.Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario, and Sutskever, Ilya.Language models are unsupervised multitask learners. 2019.Schneider, Steffen, Baevski, Alexei, Collobert, Ronan, and Auli, Michael. wav2vec: Unsupervisedpre-training for speech recognition. arXiv preprint arXiv:1904.05862 , 2019.Sohn, Kihyuk. Improved deep metric learning with multi-class n-pair loss objective. InLee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.),

Advances inNeural Information Processing Systems 29 , pp. 1857–1865. Curran Associates, Inc., 2016.URL http://papers.nips.cc/paper/6200-improved-deep-metric-learning-with-multi-class-n-pair-loss-objective.pdf .Sun, Fan-Yun, Hoffman, Jordan, Verma, Vikas, and Tang, Jian. Infograph: Unsuper-vised and semi-supervised graph-level representation learning via mutual information max-imization. In

International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=r1lfF2NYvH .Tian, Yonglong, Krishnan, Dilip, and Isola, Phillip. Contrastive multiview coding, 2020.Tokozume, Yuji, Ushiku, Yoshitaka, and Harada, Tatsuya. Between-class learning for image classiﬁ-cation. , pp. 5486–5494,2017.Tomashenko, Natalia, Khokhlov, Yuri, and Estève, Yannick. Speaker adaptive training and mixupregularization for neural network acoustic models in automatic speech recognition. pp. 2414–2418,09 2018. doi: 10.21437/Interspeech.2018-2209.Trinh, Trieu H., Luong, Minh-Thang, and Le, Quoc V. Selﬁe: Self-supervised pretraining for imageembedding.

CoRR , abs/1906.02940, 2019. URL http://arxiv.org/abs/1906.02940 .van den Oord, Aaron, Li, Yazhe, and Vinyals, Oriol. Representation learning with contrastivepredictive coding, 2019.Veliˇckovi´c, Petar, Fedus, William, Hamilton, William L., Liò, Pietro, Bengio, Yoshua, and Hjelm,R Devon. Deep graph infomax. In

International Conference on Learning Representations , 2019.URL https://openreview.net/forum?id=rklz9iAcKQ .Verma, Vikas, Lamb, Alex, Beckham, Christopher, Najaﬁ, Amir, Mitliagkas, Ioannis, Lopez-Paz,David, and Bengio, Yoshua. Manifold mixup: Better representations by interpolating hiddenstates. In Chaudhuri, Kamalika and Salakhutdinov, Ruslan (eds.),

Proceedings of the 36th In-ternational Conference on Machine Learning , volume 97 of

Proceedings of Machine Learn-ing Research , pp. 6438–6447, Long Beach, California, USA, 09–15 Jun 2019a. PMLR. URL http://proceedings.mlr.press/v97/verma19a.html .Verma, Vikas, Lamb, Alex, Juho, Kannala, Bengio, Yoshua, and Lopez-Paz, David. Inter-polation consistency training for semi-supervised learning. In Kraus, Sarit (ed.),

Proceed-ings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI2019, Macao, China, August 10-16, 2019 . ijcai.org, 2019b. doi: 10.24963/ijcai.2019. URL https://doi.org/10.24963/ijcai.2019 .Verma, Vikas, Qu, Meng, Kawaguchi, Kenji, Lamb, Alex, Bengio, Yoshua, Kannala, Juho, and Tang,Jian. Graphmix: Improved training of gnns for semi-supervised learning, 2020.Wang, Yiwei, Wang, Wei, Liang, Yuxuan, Cai, Yujun, Liu, Juncheng, and Hooi, Bryan. Nodeaug:Semi-supervised node classiﬁcation with data augmentation. KDD ’20, pp. 207–217, New York,NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3403063. URL https://doi.org/10.1145/3394486.3403063 .13einberger, Kilian Q. and Saul, Lawrence K. Distance metric learning for large margin nearestneighbor classiﬁcation.

J. Mach. Learn. Res. , 10:207–244, June 2009. ISSN 1532-4435.Xu, Keyulu, Hu, Weihua, Leskovec, Jure, and Jegelka, Stefanie. How powerful are graph neuralnetworks? arXiv preprint arXiv:1810.00826 , 2018.Yanardag, Pinar and Vishwanathan, SVN. Deep graph kernels. In

Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pp. 1365–1374.ACM, 2015.Yang, Zhilin, Dai, Zihang, Yang, Yiming, Carbonell, Jaime, Salakhutdinov, Russ R, and Le, Quoc V.Xlnet: Generalized autoregressive pretraining for language understanding. In

Advances in neuralinformation processing systems , pp. 5753–5763, 2019.Yun, Sangdoo, Han, Dongyoon, Oh, Seong Joon, Chun, Sanghyuk, Choe, Junsuk, and Yoo,Youngjoon. Cutmix: Regularization strategy to train strong classiﬁers with localizable features,2019.Zhang, Hongyi, Cisse, Moustapha, Dauphin, Yann N., and Lopez-Paz, David. mixup: Beyondempirical risk minimization.

International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb .Zhang, Richard, Isola, Phillip, and Efros, Alexei A. Split-brain autoencoders: Unsu-pervised learning by cross-channel prediction.

CoRR , abs/1611.09842, 2016a. URL http://arxiv.org/abs/1611.09842 .Zhang, Richard, Isola, Phillip, and Efros, Alexei A. Colorful image colorization. In

ECCV , 2016b.Zhang, Rongzhi, Yu, Yue, and Zhang, Chao. Seqmix: Augmenting active sequence labeling viasequence mixup, 2020. 14 ppendix

A Additional discussion on theoretical analysis

On the interpretation of Theorem 1.

In Theorem 1, the distribution D is arbitrary. For exam-ple, if the number of samples generated during training is ﬁnite and n , then the simplest way toinstantiate Theorem 1 is to set D to represent the empirical measure n (cid:80) mi =1 δ ( x i ,y i ) for training data (( x i , y i )) mi =1 (where the Dirac measures δ ( x i ,y i ) ), which yields the following: n m (cid:88) i =1 m (cid:88) j =1 E ˜ x , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + i , x ++ i , x − j )]= 1 n m (cid:88) i =1 (cid:88) j ∈ S yi E ˜ x , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:2) (cid:96) cf (cid:0) f ( x + i ) , y i (cid:1)(cid:3) + 1 n n (cid:88) i =1 [( n − | S y i | ) E y ] , where x + i = x i + αδ ( x i , ˜ x ) , x ++ i = x i + α (cid:48) δ ( x i , ˜ x (cid:48) ) , x − j = ¯ x j + α (cid:48)(cid:48) δ ( ¯ x j , ˜ x (cid:48)(cid:48) ) , S y = { i ∈ [ m ] : y i (cid:54) = y } , f ( x + i ) = (cid:107) ( h ( x + i ) (cid:107) − h ( x + i ) (cid:62) ˜ w , and [ m ] = { , . . . , m } . Here, we used the fact that ¯ ρ ( y ) = | S y | n where | S y | is the number of elements in the set S y . In general, in Theorem 1, we can setthe distribution D to take into account additional data augmentations (that generate inﬁnite numberof samples) and the different ways that we generate positive and negative pairs. On the interpretation of Theorem 2 for deep neural networks.

Consider the case of deep neuralnetworks with ReLU in the form of f ( x ) = W ( H ) σ ( H − ( W ( H − σ ( H − ( · · · σ (1) ( W (1) x ) · · · )) ,where W ( l ) is the weight matrix and σ ( l ) is the ReLU nonlinear function at the l -th layer. In this case,we have (cid:107)∇ f ( x ) (cid:107) = (cid:107) W ( H ) ˙ σ ( H − W ( H − ˙ σ ( H − · · · ˙ σ (1) W (1) (cid:107) , where ˙ σ ( l ) = ∂σ ( l ) ( q ) ∂q | q = W ( l − σ ( l − ( ··· σ (1) ( W (1) x ) ··· ) is a Jacobian matrix and hence W ( H ) ˙ σ ( H − W ( H − ˙ σ ( H − · · · ˙ σ (1) W (1) is the sum of the product of path weights. Thus,regularizing (cid:107)∇ f ( x ) (cid:107) tends to promote generalization as it corresponds to the path weight norm usedin generalization error bounds in previous work (Kawaguchi et al., 2017). B Proof

In this section, we present complete proofs for our theoretical results.

B.1 Proof of Theorem 1

We begin by introducing additional notation to be used in our proof. For two vectors q and q (cid:48) , deﬁne cov[ q , q (cid:48) ] = (cid:88) k cov( q k , q (cid:48) k ) Let ρ y = E ¯ y | y [ [¯ y = y ] ] = (cid:80) ¯ y ∈{ , } p ¯ y (¯ y | y ) [¯ y = y ] = Pr(¯ y = y | y ) . For the completeness, we ﬁrstrecall the following well known fact: Lemma 1.

For any y ∈ { , } and q ∈ R , (cid:96) ( q, y ) = − log (cid:18) exp( yq )1 + exp( q ) (cid:19) roof. By simple arithmetic manipulations, (cid:96) ( q, y ) = − y log (cid:18)

11 + exp( − q ) (cid:19) − (1 − y ) log (cid:18) −

11 + exp( − q ) (cid:19) = − y log (cid:18)

11 + exp( − q ) (cid:19) − (1 − y ) log (cid:18) exp( − q )1 + exp( − q ) (cid:19) = − y log (cid:18) exp( q )1 + exp( q ) (cid:19) − (1 − y ) log (cid:18)

11 + exp( q ) (cid:19) =  − log (cid:16) exp( q )1+exp( q ) (cid:17) if y = 1 − log (cid:16) q ) (cid:17) if y = 0= − log (cid:18) exp( yq )1 + exp( q ) (cid:19) . Before starting the main parts of the proof, we also prepare the following simple facts:

Lemma 2.

For any ( x + , x ++ , x − ) , we have (cid:96) ctr ( x + , x ++ , x − ) = (cid:96) ( sim [ h ( x + ) , h ( x ++ )] − sim [ h ( x + ) , h ( x − )] , Proof.

By simple arithmetic manipulations, (cid:96) ctr ( x + , x ++ , x − ) = − log exp( sim [ h ( x + ) , h ( x ++ )])exp( sim [ h ( x + ) , h ( x ++ )]) + exp( sim [ h ( x + ) , h ( x − )])= − log 11 + exp( sim [ h ( x + ) , h ( x − )] − sim [ h ( x + ) , h ( x ++ )])= − log exp( sim [ h ( x + ) , h ( x ++ )] − sim [ h ( x + ) , h ( x − )])1 + exp( sim [ h ( x + ) , h ( x ++ )] − sim [ h ( x + ) , h ( x − )]) Using Lemma 1 with q = sim [ h ( x + ) , h ( x ++ )] − sim [ h ( x + ) , h ( x − )] , this yields the desired state-ment. Lemma 3.

For any y ∈ { , } and q ∈ R , (cid:96) ( − q,

1) = (cid:96) ( q, . Proof.

Using Lemma 1, (cid:96) ( − q,

1) = − log (cid:18) exp( − q )1 + exp( − q ) (cid:19) = − log (cid:18)

11 + exp( q ) (cid:19) = (cid:96) ( q, . With these facts, we are now ready to start our proof. We ﬁrst prove the relationship between thecontrastive loss and classiﬁcation loss under an ideal situation:

Lemma 4.

Using Lemma 2 and the assumption on sim, (cid:96) ctr ( x + , x ++ , x − ) = (cid:96) ( sim [ h ( x + ) , h ( x ++ )] − sim [ h ( x + ) , h ( x − )] , (cid:96) (cid:18) h ( x + ) (cid:62) h ( x ++ ) ζ ( h ( x + )) ζ ( h ( x ++ )) − h ( x + ) (cid:62) h ( x − ) ζ ( h ( x + )) ζ ( h ( x − )) , (cid:19) = (cid:96) (cid:18) h ( x + ) (cid:62) ζ ( h ( x + )) (cid:18) h ( x ++ ) ζ ( h ( x ++ )) − h ( x − ) ζ ( h ( x − )) (cid:19) , (cid:19) . E x ∼D y E ¯ x ∼D ¯ y (cid:54) = y E x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )]= E x ∼D y , ¯ x ∼D ¯ y (cid:54) = y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ,α (cid:48) ,α (cid:48)(cid:48) (cid:20) (cid:96) (cid:18) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:18) h ( x + α (cid:48) δ ( x , ˜ x (cid:48) )) ζ ( h ( x + α (cid:48) δ ( x , ˜ x (cid:48) ))) − h (¯ x + α (cid:48)(cid:48) δ (¯ x , ˜ x (cid:48)(cid:48) )) ζ ( h (¯ x + α (cid:48)(cid:48) δ (¯ x , ˜ x (cid:48)(cid:48) ))) (cid:19) , (cid:19)(cid:21) =  E x ∼D , x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) α (cid:48) ,α (cid:48)(cid:48) (cid:104) (cid:96) (cid:16) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:16) h ( x + α (cid:48) δ ( x , ˜ x (cid:48) )) ζ ( h ( x + α (cid:48) δ ( x , ˜ x (cid:48) ))) − h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) )) ζ ( h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) ))) (cid:17) , (cid:17)(cid:105) if y = 1 E x ∼D , x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ,α (cid:48) ,α (cid:48)(cid:48) (cid:104) (cid:96) (cid:16) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:16) h ( x + α (cid:48) δ ( x , ˜ x (cid:48) )) ζ ( h ( x + α (cid:48) δ ( x , ˜ x (cid:48) ))) − h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) )) ζ ( h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) ))) (cid:17) , (cid:17)(cid:105) if y = 0 =  E x ∼D , x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ,α (cid:48) ,α (cid:48)(cid:48) (cid:104) (cid:96) (cid:16) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:16) h ( x + α (cid:48) δ ( x , ˜ x (cid:48) )) ζ ( h ( x + α (cid:48) δ ( x , ˜ x (cid:48) ))) − h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) )) ζ ( h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) ))) (cid:17) , (cid:17)(cid:105) if y = 1 E x ∼D , x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ,α (cid:48) ,α (cid:48)(cid:48) (cid:104) (cid:96) (cid:16) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:16) h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) )) ζ ( h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) ))) − h ( x + α (cid:48) δ ( x , ˜ x (cid:48) )) ζ ( h ( x + α (cid:48) δ ( x , ˜ x (cid:48) ))) (cid:17) , (cid:17)(cid:105) if y = 0 =  E x ∼D E x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:104) (cid:96) (cid:16) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:102) W ( x , x ) , (cid:17)(cid:105) if y = 1 E x ∼D E x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:104) (cid:96) (cid:16) − h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:102) W ( x , x ) , (cid:17)(cid:105) if y = 0 where (cid:102) W ( x , x ) = h ( x + α (cid:48) δ ( x , ˜ x (cid:48) )) ζ ( h ( x + α (cid:48) δ ( x , ˜ x (cid:48) ))) − h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) )) ζ ( h ( x + α (cid:48)(cid:48) δ ( x , ˜ x (cid:48)(cid:48) ))) . Using Lemma 3, E x ∼D y E ¯ x ∼D ¯ y (cid:54) = y E x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )]=  E x ∼D E x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:104) (cid:96) (cid:16) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:102) W ( x , x ) , (cid:17)(cid:105) if y = 1 E x ∼D E x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:104) (cid:96) (cid:16) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:102) W ( x , x ) , (cid:17)(cid:105) if y = 0=  E x ∼D E x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:104) (cid:96) (cid:16) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:102) W ( x , x ) , y (cid:17)(cid:105) if y = 1 E x ∼D E x ∼D E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:104) (cid:96) (cid:16) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) (cid:102) W ( x , x ) , y (cid:17)(cid:105) if y = 0= E x ∼D y E ¯ x ∼D ¯ y (cid:54) = y E x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:20) (cid:96) (cid:18) h ( x + αδ ( x , ˜ x )) (cid:62) ζ ( h ( x + αδ ( x , ˜ x ))) ˜ w, y (cid:19)(cid:21) Using the above the relationship under the ideal situation, we now proves the relationship under thepractical situation:

Lemma 5.

Assume that x + = x + αδ ( x , ˜ x ) , x ++ = x + α (cid:48) δ ( x , ˜ x (cid:48) ) , x − = ¯ x + α (cid:48)(cid:48) δ ( ¯ x , ˜ x (cid:48)(cid:48) ) , and sim [ z, z (cid:48) ] = z (cid:62) z (cid:48) ζ ( z ) ζ ( z (cid:48) ) where ζ : z (cid:55)→ ζ ( z ) ∈ R . Then for any ( α, ˜ x , δ, ζ, y ) , we have that E ¯ y | y E x ∼D y , ¯ x ∼D ¯ y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )]= (1 − ρ y ) E x ∼D y , ¯ x ∼D ¯ y (cid:54) = y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:20) (cid:96) (cid:18) h ( x + ) (cid:62) ˜ wζ ( h ( x + )) , y (cid:19)(cid:21) + ρ y E where E = E x , ¯ x ∼D y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:20) log (cid:18) (cid:20) − h ( x + ) (cid:62) ζ ( h ( x + )) (cid:18) h ( x ++ ) ζ ( h ( x ++ )) − h ( x − ) ζ ( h ( x − )) (cid:19)(cid:21)(cid:19)(cid:21) ≥ log   − cov x ∼D y , ˜ x (cid:48) ∼D ˜ x ,α (cid:48) ∼D α (cid:20) h ( x + ) ζ ( h ( x + )) , h ( x ++ ) ζ ( h ( x ++ )) (cid:21) roof. Using Lemma 4, E ¯ y | y E x ∼D y , ¯ x ∼D ¯ y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )]= (cid:88) ¯ y ∈{ , } p ¯ y (¯ y | y ) E x ∼D y , ¯ x ∼D ¯ y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )]= Pr(¯ y = 0 | y ) E x ∼D y , ¯ x ∼D ¯ y =0 , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )] + Pr(¯ y = 1 | y ) E x ∼D y , ¯ x ∼D ¯ y =1 , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )] = Pr(¯ y (cid:54) = y | y ) E x ∼D y , ¯ x ∼D ¯ y (cid:54) = y , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )] + Pr(¯ y = y | y ) E x ∼D y , ¯ x ∼D ¯ y = y , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )] = (1 − ρ y ) E x ∼D y , ¯ x ∼D ¯ y (cid:54) = y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )] + ρ y E x ∼D y , ¯ x ∼D ¯ y = y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )] = (1 − ρ y ) E x ∼D y , ¯ x ∼D ¯ y (cid:54) = y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:20) (cid:96) (cid:18) h ( x + ) (cid:62) ˜ wζ ( h ( x + )) , y (cid:19)(cid:21) + ρ y E x ∼D y , ¯ x ∼D ¯ y = y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )] , which obtain the desired statement for the ﬁrst term. We now focus on the second term. UsingLemmas 1 and 2, with q = h ( x + ) (cid:62) ζ ( h ( x + )) (cid:16) h ( x ++ ) ζ ( h ( x ++ )) − h ( x − ) ζ ( h ( x − )) (cid:17) , (cid:96) ctr ( x + , x ++ , x − ) = (cid:96) ( q,

1) = − log (cid:18) exp( q )1 + exp( q ) (cid:19) = − log (cid:18)

11 + exp( − q ) (cid:19) = log (1 + exp( − q )) . Therefore, E x ∼D y E ¯ x ∼D ¯ y = y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )]= E x , ¯ x ∼D y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:20) log (cid:18) (cid:20) − h ( x + ) (cid:62) ζ ( h ( x + )) (cid:18) h ( x ++ ) ζ ( h ( x ++ )) − h ( x − ) ζ ( h ( x − )) (cid:19)(cid:21)(cid:19)(cid:21) = E, which proves the desired statement with E . We now focus on the lower bound on E . By using theconvexity of q (cid:55)→ log (1 + exp( − q )) and Jensen’s inequality, E ≥ log (cid:32) (cid:34) E x , ¯ x E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ,α (cid:48) ,α (cid:48)(cid:48) (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) (cid:18) h ( x − ) ζ ( h ( x − )) − h ( x ++ ) ζ ( h ( x ++ )) (cid:19)(cid:21)(cid:35)(cid:33) = log (cid:18) (cid:20) E (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) h ( x − ) ζ ( h ( x − )) (cid:21) − E (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) h ( x ++ ) ζ ( h ( x ++ )) (cid:21)(cid:21)(cid:19) = log (cid:18) (cid:20) E (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) (cid:21) E (cid:20) h ( x − ) ζ ( h ( x − )) (cid:21) − E (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) h ( x ++ ) ζ ( h ( x ++ )) (cid:21)(cid:21)(cid:19) Here, we have E x ∼D y , ˜ x (cid:48) ∼D ˜ x ,α (cid:48) ∼D α (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) h ( x ++ ) ζ ( h ( x ++ )) (cid:21) = E x ∼D y , ˜ x (cid:48) ∼D ˜ x ,α (cid:48) ∼D α (cid:88) k (cid:18) h ( x + ) ζ ( h ( x + )) (cid:19) k (cid:18) h ( x ++ ) ζ ( h ( x ++ )) (cid:19) k = (cid:88) k E x ∼D y , ˜ x (cid:48) ∼D ˜ x ,α (cid:48) ∼D α (cid:18) h ( x + ) ζ ( h ( x + )) (cid:19) k (cid:18) h ( x ++ ) ζ ( h ( x ++ )) (cid:19) k = (cid:88) k E (cid:20)(cid:18) h ( x + ) ζ ( h ( x + )) (cid:19) k (cid:21) E (cid:20)(cid:18) h ( x ++ ) ζ ( h ( x ++ )) (cid:19) k (cid:21) + (cid:88) k cov (cid:18)(cid:18) h ( x + ) ζ ( h ( x + )) (cid:19) k , (cid:18) h ( x ++ ) ζ ( h ( x ++ )) (cid:19) k (cid:19) = E x ∼D y (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) (cid:21) E x ∼D y , ˜ x (cid:48) ∼D ˜ x ,α (cid:48) ∼D α (cid:20) h ( x ++ ) ζ ( h ( x ++ )) (cid:21) + cov (cid:20) h ( x + ) ζ ( h ( x + )) , h ( x ) ζ ( h ( x )) (cid:21) E x ∼D y , ˜ x (cid:48) ∼D ˜ x ,α (cid:48) ∼D α (cid:104)(cid:16) h ( x ++ ) ζ ( h ( x ++ )) (cid:17)(cid:105) = E ¯ x ∼D y , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48)(cid:48) ∼D α (cid:104) h ( x − ) ζ ( h ( x − )) (cid:105) , E x ∼D y (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) (cid:21) E ¯ x ∼D y , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48)(cid:48) ∼D α (cid:20) h ( x − ) ζ ( h ( x − )) (cid:21) − E x ∼D y , ˜ x (cid:48) ∼D ˜ x ,α (cid:48) ∼D α (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) h ( x ++ ) ζ ( h ( x ++ )) (cid:21) = E (cid:20) h ( x + ) (cid:62) ζ ( h ( x + )) (cid:21)  E ¯ x ∼D y , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48)(cid:48) ∼D α (cid:20) h ( x − ) ζ ( h ( x − )) (cid:21) − E x ∼D y , ˜ x (cid:48) ∼D ˜ x ,α (cid:48) ∼D α (cid:20) h ( x ++ ) ζ ( h ( x ++ )) (cid:21) − cov (cid:20) h ( x + ) ζ ( h ( x + )) , h ( x ++ ) ζ ( h ( x ++ )) (cid:21) = − cov (cid:20) h ( x + ) ζ ( h ( x + )) , h ( x ++ ) ζ ( h ( x ++ )) (cid:21) Substituting this to the above inequality on E , E ≥ log (cid:18) (cid:20) − cov (cid:20) h ( x + ) ζ ( h ( x + )) , h ( x ++ ) ζ ( h ( x ++ )) (cid:21)(cid:21)(cid:19) , which proves the desired statement for the lower bound on E .With these lemmas, we are now ready to prove Theorem 1: Proof of Theorem 1.

From Lemma 5, we have that E ¯ y | y E x ∼D y , ¯ x ∼D ¯ y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )]= (1 − ρ y ) E x ∼D y , ¯ x ∼D ¯ y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:20) (cid:96) cf (cid:18) h ( x + ) (cid:62) ˜ wζ ( h ( x + )) , y (cid:19)(cid:21) + ρ y E By taking expectation over y in both sides, E y, ¯ y E x ∼D y , ¯ x ∼D ¯ y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )]= E y E x ∼D y , ¯ x ∼D ¯ y (cid:54) = y E ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:20) (1 − ρ y ) (cid:96) cf (cid:18) h ( x + ) (cid:62) ˜ wζ ( h ( x + )) , y (cid:19)(cid:21) + E y [ ρ y E ] Since E y E x ∼D y [ ϕ ( x )] = E ( x ,y ) ∼D [ ϕ ( x )] = E x ∼D x [ ϕ ( x )] given a function ϕ of x , we have E x , ¯ x ∼D x , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α [ (cid:96) ctr ( x + , x ++ , x − )]= E ( x ,y ) ∼D E ¯ x ∼D ¯ y , ˜ x (cid:48) , ˜ x (cid:48)(cid:48) ∼D ˜ x ,α (cid:48) ,α (cid:48)(cid:48) ∼D α (cid:20) ¯ ρ ( y ) (cid:96) cf (cid:18) h ( x + ) (cid:62) ˜ wζ ( h ( x + )) , y (cid:19)(cid:21) + E y [(1 − ¯ ρ ( y )) E ] Taking expectations over ˜ x ∼ D ˜ x and α ∼ D α in both sides yields the desired statement. B.2 Proof of Theorem 2

We begin by introducing additional notation. Deﬁne (cid:96) f,y ( q ) = (cid:96) ( f ( q ) , y ) and (cid:96) y ( q ) = (cid:96) ( q, y ) . Notethat (cid:96) ( f ( q ) , y ) = (cid:96) f,y ( q ) = ( (cid:96) y ◦ f )( q ) . The following shows that the contrastive pre-training isrelated to minimizing the standard classiﬁcation loss (cid:96) ( f ( x ) , y ) while regularizing the change of theloss values in the direction of δ ( x , ˜ x ) : Lemma 6.

Assume that (cid:96) f,y is twice differentiable. Then there exists a function ϕ such that lim q → ϕ ( q ) = 0 and (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + α ∇ (cid:96) f,y ( x ) (cid:62) δ ( x , ˜ x ) + α δ ( x , ˜ x ) (cid:62) ∇ (cid:96) f,y ( x ) δ ( x , ˜ x ) + α ϕ ( α ) . roof. Let x be an arbitrary point in the domain of f . Let ϕ ( α ) = (cid:96) ( f ( x + ) , y ) = (cid:96) f,y ( x + αδ ( x , ˜ x )) . Then, using the deﬁnition of the twice-differentiability of function ϕ , there exists afunction ϕ such that (cid:96) (cid:0) f ( x + ) , y (cid:1) = ϕ ( α ) = ϕ (0) + ϕ (cid:48) (0) α + 12 ϕ (cid:48)(cid:48) (0) α + α ϕ ( α ) , (11)where lim α → ϕ ( α ) = 0 . By chain rule, ϕ (cid:48) ( α ) = ∂(cid:96) ( f ( x + ) , y ) ∂α = ∂(cid:96) ( f ( x + ) , y ) ∂ x + ∂ x + ∂α = ∂(cid:96) ( f ( x + ) , y ) ∂ x + δ ( x , ˜ x ) = ∇ (cid:96) f,y ( x + ) (cid:62) δ ( x , ˜ x ) ϕ (cid:48)(cid:48) ( α ) = δ ( x , ˜ x ) (cid:62) (cid:34) ∂∂α (cid:18) ∂(cid:96) ( f ( x + ) , y ) ∂ x + (cid:19) (cid:62) (cid:35) = δ ( x , ˜ x ) (cid:62) (cid:34) ∂∂ x + (cid:18) ∂(cid:96) ( f ( x + ) , y ) ∂ x + (cid:19) (cid:62) (cid:35) ∂ x + ∂α = δ ( x , ˜ x ) (cid:62) ∇ (cid:96) f,y ( x + ) δ ( x , ˜ x ) Therefore, ϕ (cid:48) (0) = ∇ (cid:96) f,y ( x ) (cid:62) δ ( x , ˜ x ) ϕ (cid:48)(cid:48) (0) = δ ( x , ˜ x ) (cid:62) ∇ (cid:96) f,y ( x ) δ ( x , ˜ x ) . By substituting this to the above equation based on the deﬁnition of twice differentiability, (cid:96) (cid:0) f ( x + ) , y (cid:1) = ϕ ( α ) = (cid:96) ( f ( x ) , y )+ α ∇ (cid:96) f,y ( x ) (cid:62) δ ( x , ˜ x )+ α δ ( x , ˜ x ) (cid:62) ∇ (cid:96) f,y ( x ) δ ( x , ˜ x )+ α ϕ ( α ) . Whereas the above lemma is at the level of loss, we now analyze the phenomena at the level of model:

Lemma 7.

Let x be a ﬁxed point in the domain of f . Given the ﬁxed x , let w ∈ W be a point suchthat ∇ f ( x ) and ∇ f ( x ) exist. Assume that f ( x ) = ∇ f ( x ) (cid:62) x and ∇ f ( x ) = 0 . Then we have (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + α ( ψ ( f ( x )) − y ) ∇ f ( x ) (cid:62) δ ( x , ˜ x ) + α ψ (cid:48) ( f ( x )) |∇ f ( x ) (cid:62) δ ( x , ˜ x ) | + α ϕ ( α ) , where ψ (cid:48) ( · ) = ψ ( · )(1 − ψ ( · )) > .Proof. Under these conditions, ∇ (cid:96) f,y ( x ) = ∇ ( (cid:96) y ◦ f )( x ) = (cid:96) (cid:48) y ( f ( x )) ∇ f ( x ) ∇ (cid:96) f,y ( x ) = (cid:96) (cid:48)(cid:48) y ( f ( x )) ∇ f ( x ) ∇ f ( x ) (cid:62) + (cid:96) (cid:48) y ( f ( x )) ∇ f ( x ) = (cid:96) (cid:48)(cid:48) y ( f ( x )) ∇ f ( x ) ∇ f ( x ) (cid:62) Substituting these into Lemma 6 yields (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + α(cid:96) (cid:48) y ( f ( x )) ∇ f ( x ) (cid:62) δ ( x , ˜ x ) + α (cid:96) (cid:48)(cid:48) y ( f ( x )) δ ( x , ˜ x ) (cid:62) [ ∇ f ( x ) ∇ f ( x ) (cid:62) ] δ ( x , ˜ x ) + α ϕ ( α ) = (cid:96) ( f ( x ) , y ) + α(cid:96) (cid:48) y ( f ( x )) ∇ f ( x ) (cid:62) δ ( x , ˜ x ) + α (cid:96) (cid:48)(cid:48) y ( f ( x ))[ ∇ f ( x ) (cid:62) δ ( x , ˜ x )] + α ϕ ( α ) Using Lemma 1, we can rewrite this loss as follows: (cid:96) ( f ( x ) , y ) = − log exp( yf ( x ))1 + exp( f ( x )) = log[1 + exp( f ( x ))] − yf ( x ) = ψ ( f ( x )) − yf ( x ) where ψ ( q ) = log[1 + exp( q )] . Thus, (cid:96) (cid:48) y ( f ( x )) = ψ (cid:48) ( f ( x )) − y = ψ ( f ( x )) − y(cid:96) (cid:48)(cid:48) y ( f ( x )) = ψ (cid:48)(cid:48) ( f ( x )) = ψ (cid:48) ( f ( x )) Substituting these into the above equation, we have (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + α ( ψ ( f ( x )) − y ) ∇ f ( x ) (cid:62) δ ( x , ˜ x ) + α ψ (cid:48) ( f ( x ))[ ∇ f ( x ) (cid:62) δ ( x , ˜ x )] + α ϕ ( α ) (cid:107)∇ f ( x ) (cid:107) . Lemma 8.

Let δ ( x , ˜ x ) = ˜ x − x . Let x be a ﬁxed point in the domain of f . Given the ﬁxed x ,let w ∈ W be a point such that ∇ f ( x ) and ∇ f ( x ) exist. Assume that f ( x ) = ∇ f ( x ) (cid:62) x and ∇ f ( x ) = 0 . Assume that E ˜ x [ ˜ x ] = 0 . Then, if yf ( x ) + ( y − f ( x ) ≥ , E ˜ x (cid:96) ( f ( x + ) , y )= (cid:96) ( f ( x ) , y ) + c ( x ) |(cid:107)∇ f ( x ) (cid:107) + c ( x ) (cid:107)∇ f ( x ) (cid:107) + c ( x ) (cid:107)∇ f ( x ) (cid:107) E ˜ x ∼D ˜ x [˜ x ˜ x (cid:62) ] + O ( α ) , where c ( x ) = α | cos( ∇ f ( x ) , x ) || y − ψ ( f ( x )) |(cid:107) x (cid:107)| ≥ c ( x ) = α | cos( ∇ f ( x ) , x ) | (cid:107) x (cid:107)| | ψ (cid:48) ( f ( x )) | ≥ c ( x ) = α | ψ (cid:48) ( f ( x )) | > . Proof.

Using Lemma 7 with δ ( x , ˜ x ) = ˜ x − x , (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + α ( ψ ( f ( x )) − y ) ∇ f ( x ) (cid:62) (˜ x − x ) + α ψ (cid:48) ( f ( x )) |∇ f ( x ) (cid:62) (˜ x − x ) | + α ϕ ( α ) = (cid:96) ( f ( x ) , y ) − α ( ψ ( f ( x )) − y ) ∇ f ( x ) (cid:62) ( x − ˜ x ) + α ψ (cid:48) ( f ( x )) |∇ f ( x ) (cid:62) ( x − ˜ x ) | + α ϕ ( α ) = (cid:96) ( f ( x ) , y ) − α ( ψ ( f ( x )) − y )( f ( x ) − ∇ f ( x ) (cid:62) ˜ x ) + α ψ (cid:48) ( f ( x )) | f ( x ) − ∇ f ( x ) (cid:62) ˜ x | + α ϕ ( α ) = (cid:96) ( f ( x ) , y ) + α ( y − ψ ( f ( x )))( f ( x ) − ∇ f ( x ) (cid:62) ˜ x ) + α ψ (cid:48) ( f ( x )) | f ( x ) − ∇ f ( x ) (cid:62) ˜ x | + α ϕ ( α ) Therefore, using E ˜ x ˜ x = 0 , E ˜ x (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + α [ y − ψ ( f ( x ))] f ( x ) + α ψ (cid:48) ( f ( x )) E ˜ x | f ( x ) − ∇ f ( x ) (cid:62) ˜ x | + E ˜ x α ϕ ( α ) Since | f ( x ) − ∇ f ( x ) (cid:62) ˜ x | = f ( x ) − f ( x ) ∇ f ( x ) (cid:62) ˜ x + ( ∇ f ( x ) (cid:62) ˜ x ) , E ˜ x | f ( x ) − ∇ f ( x ) (cid:62) ˜ x | = f ( x ) + E ˜ x ( ∇ f ( x ) (cid:62) ˜ x ) = f ( x ) + ∇ f ( x ) (cid:62) E ˜ x [ ˜ x ˜ x (cid:62) ] ∇ f ( x ) . Thus, E ˜ x (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + α [ y − ψ ( f ( x ))] f ( x ) + α | ψ (cid:48) ( f ( x )) | [ f ( x ) + ∇ f ( x ) (cid:62) E ˜ x [˜ x ˜ x (cid:62) ] ∇ f ( x )] + E ˜ x α ϕ ( α ) The assumption that yf ( x ) + ( y − f ( x ) ≥ implies that f ( x ) ≥ if y = 1 and f ( x ) ≤ if y = 0 . Thus, if y = 1 , [ y − ψ ( f ( x ))] f ( x ) = [1 − ψ ( f ( x ))] f ( x ) ≥ , since f ( x ) ≥ and (1 − ψ ( f ( x ))) ≥ due to ψ ( f ( x )) ∈ (0 , . If y = 0 , [ y − ψ ( f ( x ))] f ( x ) = − ψ ( f ( x )) f ( x ) ≥ , since f ( x ) ≤ and − ψ ( f ( x )) < . Therefore, in both cases, [ y − ψ ( f ( x ))] f ( x ) ≥ , which implies that, y − ψ ( f ( x ))] f ( x ) = [ y − ψ ( f ( x ))] f ( x )= | y − ψ ( f ( x )) ||∇ f ( x ) (cid:62) x | = | y − ψ ( f ( x )) |(cid:107)∇ f ( x ) (cid:107)(cid:107) x (cid:107)| cos( ∇ f ( x ) , x ) | f ( x ) = (cid:107)∇ f ( x ) (cid:107)(cid:107) x (cid:107) cos( ∇ f ( x ) , x ) E ˜ x (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + c ( x ) (cid:107)∇ f ( x ) (cid:107) + c ( x ) (cid:107)∇ f ( x ) (cid:107) + c ( x ) ∇ f ( x ) (cid:62) E ˜ x [˜ x ˜ x (cid:62) ] ∇ f ( x ) + E ˜ x [ α ϕ ( α )] . In the case of Gaussian-noise, we have δ ( x , ˜ x ) = ˜ x ∼ N (0 , σ I ) : Lemma 9.

Let δ ( x , ˜ x ) = ˜ x ∼ N (0 , σ I ) . Let x be a ﬁxed point in the domain of f . Given the ﬁxed x , let w ∈ W be a point such that ∇ f ( x ) and ∇ f ( x ) exist. Assume that f ( x ) = ∇ f ( x ) (cid:62) x and ∇ f ( x ) = 0 . Then E ˜ x ∼N (0 ,σ I ) (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + σ c ( x ) (cid:107)∇ f ( x ) (cid:107) + α ϕ ( α ) where c ( x ) = α | ψ (cid:48) ( f ( x )) | > . Proof.

With δ ( x , ˜ x ) = ˜ x ∼ N (0 , σ I ) , Lemma 7 yields (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + α ( ψ ( f ( x )) − y ) ∇ f ( x ) (cid:62) ˜ x + α ψ (cid:48) ( f ( x )) |∇ f ( x ) (cid:62) ˜ x | + α ϕ ( α ) , Thus, E ˜ x ∼N (0 ,σ I ) (cid:96) (cid:0) f ( x + ) , y (cid:1) = (cid:96) ( f ( x ) , y ) + α ψ (cid:48) ( f ( x )) E ˜ x ∼N (0 ,σ I ) |∇ f ( x ) (cid:62) ˜ x | + α ϕ ( α )= (cid:96) ( f ( x ) , y ) + α ψ (cid:48) ( f ( x )) ∇ f ( x ) (cid:62) E ˜ x ∼N (0 ,σ I ) [ ˜ x ˜ x (cid:62) ] ∇ f ( x ) + α ϕ ( α )= (cid:96) ( f ( x ) , y ) + α ψ (cid:48) ( f ( x )) (cid:107)∇ f ( x ) (cid:107) E ˜ x ∼N (0 ,σ I ) [˜ x ˜ x (cid:62) ] + α ϕ ( α ) By noticing that (cid:107) w (cid:107) E ˜ x ∼N (0 ,σ I ) [˜ x ˜ x (cid:62) ] = σ w (cid:62) Iw = σ (cid:107) w (cid:107) , this implies the desired statement.Combining Lemmas 8–9 yield the statement of Theorem 2. B.3 Proof of Theorem 3

Proof.

Applying the standard result (Bartlett & Mendelson, 2002) yields that with probability at least − δ , E ( x ,y ) [ [(2 y − (cid:54) =sign( f ( x ))] ] − n n (cid:88) i =1 φ ((2 y i − f ( x i )) ≤ L φ R n ( F ( mix ) b ) + (cid:114) ln(2 /δ )2 n . The rest of the proof bounds the Rademacher complexity R n ( F ( mix ) b ) . ˆ R n ( F ( mix ) b ) = E ξ sup f ∈F b n n (cid:88) i =1 ξ i f ( x i )= E ξ sup w : (cid:107) w (cid:107) E ˜ x ∼D x [˜ x ˜ x (cid:62) ] ≤ b n n (cid:88) i =1 ξ i w (cid:62) x i = E ξ sup w : w (cid:62) Σ X w ≤ b n n (cid:88) i =1 ξ i (Σ / X w ) (cid:62) Σ † / X x i ≤ n E ξ sup w : w (cid:62) Σ X w ≤ b (cid:107) Σ / X w (cid:107) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ξ i Σ † / X x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ bn E ξ (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 n (cid:88) j =1 ξ i ξ j (Σ † / X x i ) (cid:62) (Σ † / X x j ) ≤ √ bn (cid:118)(cid:117)(cid:117)(cid:116) E ξ n (cid:88) i =1 n (cid:88) j =1 ξ i ξ j (Σ † / X x i ) (cid:62) (Σ † / X x j )= √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 (Σ † / X x i ) (cid:62) (Σ † / X x i )= √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 x (cid:62) i Σ † X x i Therefore, R n ( F ( mix ) b ) = E S ˆ R n ( F ( mix ) b ) = E S √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 x (cid:62) i Σ † X x i ≤ √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 E x i x (cid:62) i Σ † X x i = √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 E x i (cid:88) k,l (Σ † X ) kl ( x i ) k ( x i ) l = √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 (cid:88) k,l (Σ † X ) kl E x i ( x i ) k ( x i ) l = √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 (cid:88) k,l (Σ † X ) kl (Σ X ) kl = √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 tr(Σ (cid:62) X Σ † X )= √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 tr(Σ X Σ † X )= √ bn (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 rank(Σ X ) ≤ √ b (cid:112) rank(Σ X ) √ n C Best Hyperparameter values for various experiments

In Table 5, 6 and 7, we present the best hyperparameter values for the experiments in Section 4.

D Graph classiﬁcation datasets

Table 8 list the nunber of graphs, number of classes and average number of nodes in a graph for eachof the graph datasets. 23ethod Fashion-MNIST CIFAR10Gaussian-noise Gausssian-mean=0.1, τ =1.0 Gausssian-mean=0.05, τ =1.0DACL α =0.9, τ =1.0 α =0.9, τ =1.0DACL+ α =0.6, τ =1, ρ =0.1 α =0.7, τ =1.0, ρ =0.5Table 5: Best hyperparamter values for experiments on Tabular data (Table 1)Method CIFAR10 CIFAR100Gaussian-noise Gaussian-mean=0.05, τ =0.1 Gaussian-mean=0.05, τ =0.1DACL α =0.9, τ =1.0 α =0.9, τ =1.0DACL+ α =0.9, ρ =0.1, τ =1.0 α =0.9, ρ =0.5, τ =1.0SimCLR τ =0.5 τ =0.5SimCLR+DACL α =0.7, τ =1.0 α =0.7, τ =1.0Table 6: Best hyperparameter values for experiment of CIFAR10/100 dataset (Table 2)Method ImageNetGaussian-noise Gaussian-mean=0.1, τ =1.0DACL α =0.9, τ =1.0SimCLR τ =0.1SimCLR+DACL α =0.9, τ =0.1Table 7: Best hyperparameter values for experiments on ImageNet data (Table 3)Dataset MUTAG PTC-MR RDT-BINARY RDT-M5K IMDB-BINARY IMDB-MULTINo. Graphs

188 344 2000 4999 1000 1500

No. classes

Avg. Graph Size .

93 14 .

29 429 .

63 508 .

52 19 .

77 13 .00