[PDF] For self-supervised learning, Rationality implies generalization, provably

Abstract

We prove a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a representation r of the training data, and then fitting a simple (e.g., linear) classifier g to the labels. Specifically, we show that (under the assumptions described below) the generalization gap of such classifiers tends to zero if C(g)≪n , where C(g) is an appropriately-defined measure of the simple classifier g 's complexity, and n is the number of training samples. We stress that our bound is independent of the complexity of the representation r . We do not make any structural or conditional-independence assumptions on the representation-learning task, which can use the same training dataset that is later used for classification. Rather, we assume that the training procedure satisfies certain natural noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) conditions that widely hold across many standard architectures. We show that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet, including SimCLR, AMDIM and MoCo.

Full PDF

FF OR SELF - SUPERVISED LEARNING ,R ATIONALITY IMPLIES GENERALIZATION , PROVABLY

Yamini Bansal ∗ Harvard University

Gal Kaplun ∗ Harvard University

Boaz Barak † Harvard University A BSTRACT

We prove a new upper bound on the generalization gap of classiﬁers that are ob-tained by ﬁrst using self-supervision to learn a representation r of the training data,and then ﬁtting a simple (e.g., linear) classiﬁer g to the labels. Speciﬁcally, weshow that (under the assumptions described below) the generalization gap of suchclassiﬁers tends to zero if C ( g ) (cid:28) n , where C ( g ) is an appropriately-deﬁnedmeasure of the simple classiﬁer g ’s complexity, and n is the number of trainingsamples. We stress that our bound is independent of the complexity of the repre-sentation r .We do not make any structural or conditional-independence assumptions on therepresentation-learning task, which can use the same training dataset that is laterused for classiﬁcation. Rather, we assume that the training procedure satisﬁescertain natural noise-robustness (adding small amount of label noise causes smalldegradation in performance) and rationality (getting the wrong label is not bet-ter than getting no label at all) conditions that widely hold across many stan-dard architectures. We show that our bound is non-vacuous for many popularrepresentation-learning based classiﬁers on CIFAR-10 and ImageNet, includingSimCLR, AMDIM and BigBiGAN. NTRODUCTION

The current standard approach for classiﬁcation is “end-to-end supervised learning” where one ﬁts acomplex (e.g., a deep neural network) classiﬁer to the given training set (Tan & Le, 2019; He et al.,2016). However, modern classiﬁers are heavily over-parameterized , and as demonstrated by Zhanget al. (2017), can ﬁt 100% of their training set even when given random labels as inputs (in whichcase test performance is no better than chance). Hence, the training performance of such methods isby itself no indication of their performance on new unseen test points.In this work, we study a different class of supervised learning procedures that have recently at-tracted signiﬁcant interest. These classiﬁers are obtained by: (i) performing pre-training with aself-supervised task (i.e., without labels) to obtain a complex representation of the data points, andthen (ii) ﬁtting a simple (e.g., linear) classiﬁer on the representation and the labels. Such “ S elf- S upervised + S imple” (SSS for short) algorithms are commonly used in natural language process-ing tasks (Devlin et al., 2018; Brown et al., 2020), and have recently found uses in other domains aswell (Baevski et al., 2020; Ravanelli et al., 2020; Liu et al., 2019). Compared to standard “end-to-end supervised learning”, SSS algorithms have several practical ad-vantages. In particular, SSS algorithms can incorporate additional unlabeled data, the represen-tation obtained can be useful for multiple downstream tasks, and they can have improved out-of-distribution performance (Hendrycks et al., 2019). Moreover, recent works show that even withoutadditional unlabeled data, SSS algorithms can get close to state-of-art accuracy in several classiﬁ-cation tasks (Chen et al., 2020b; He et al., 2020; Misra & Maaten, 2020; Tian et al., 2019). For ∗ Equal contribution. Email: { ybansal, galkaplun } @g.harvard.edu † Email: [email protected] . In this work we focus only on algorithms that learn a representation, “freeze” it, and then perform classiﬁ-cation using a simple classiﬁer. We do not consider algorithms that “ﬁne tune” the entire representation. a r X i v : . [ c s . L G ] O c t nstance, SimCLRv2 (Chen et al., 2020b) achieves . top-1 performance on ImageNet with avariant of ResNet-152, on par with the end-to-end supervised accuracy of this architecture at . .We show that SSS algorithms have another advantage over standard supervised learning—they oftenhave a small generalization gap between their train and test accuracy, and we prove non-vacuousbounds on this gap. We stress that SSS algorithms use over-parameterized models to extract therepresentation, and reuse the same training data to learn a simple classiﬁer on this representation.Thus, the ﬁnal classiﬁer they produce has high complexity by most standard measures and the re-sulting representation could “memorize” the training set. Consequently, it is not a priori evident thattheir generalization gap will be small.Our bound is obtained by ﬁrst noting that the generalization gap of every training algorithm isbounded by the sum of three quantities, which we name the Robustness gap , Rationality gap , and

Memorization gap (we call this the

RRM bound , see Fact I). We now describe these gaps at a highlevel, deferring the formal deﬁnitions to Section 2. All three gaps involve comparison with a settingwhere we inject label noise by replacing a small fraction η of the labels with random values.The robustness gap corresponds to the amount by which training performance degrades by noiseinjection. That is, it equals the difference between the standard expected training accuracy (withno label noise) and the expected training accuracy in the noisy setting; in both cases, we measureaccuracy with respect to the original (uncorrupted) labels. The robustness gap is nearly always small,and sometimes provably so (see Section 4).The rationality gap corresponds to the difference between performance on the noisy training samples(on which the training algorithm gets the wrong label) and test samples (on which it doesn’t get anylabel at all), again with respect to uncorrupted labels. An optimal Bayesian procedure would havezero rationality gap, and we show that this gap is typically zero or small in practice.The memorization gap , which often accounts for the lion’s share of the generalization gap, corre-sponds to the difference in the noisy experiment between the training accuracy on the entire train setand the training accuracy on the samples that received the wrong label (both measured with respectto uncorrupted labels). The memorization gap can be thought of as quantifying the extent to whichthe classiﬁer can “memorize” noisy labels, or act differently on the noisy points compared to the Figure 1 – Empirical RRM bound.

The components of the RRM bound, as well as the upper bound ofTheorem II for a variety of SSS models on the CIFAR-10 dataset with noise η = 0 . .Each vertical line corresponds to a single model (architecture + self-supervised task + ﬁtting algorithm)and plots the RRM bound for this model. The green component corresponds to robustness, yellowto rationality, and red to memorization. The x axis is the generalization gap, and so the RRM boundis always above the dashed x = y line. A negative generalization gap can occur in algorithms thatuse augmentation. The blue dots correspond to the bound on the generalization gap obtained by re-placing the memorization gap with the bound of Theorem II. See Sections 5 and B.3 for more information. independently of the complexity of the representation. Aslong as the simple classiﬁer is under-parameterized (i.e., its complexity is asymptotically smallerthan the sample size), our bound on the memorization gap tends to zero. When combined withsmall rationality and robustness, we get concrete non-vacuous generalization bounds for variousSSS algorithms on the CIFAR-10 and ImageNet datasets (see Figures 1 and 4). Our results.

In a nutshell, our contributions are the following:1. Our main theoretical result (Theorem II) is that the memorization gap of an SSS algorithmis bounded by O ( (cid:112) C/n ) where C is the complexity of the simple classiﬁer produced in the“simple ﬁt” stage. This bound is oblivious to the complexity of the representation producedin the pre-training and does not make any assumptions on the relationship between therepresentation learning method and the supervised learning task.2. We complement this result with an empirical study of the robustness, rationality, and mem-orization gaps. We show that the RRM bound is typically non-vacuous, and in fact, oftenclose to tight, for a variety of SSS algorithms on the CIFAR-10 and ImageNet datasets,including SimCLR (which achieves test errors close to its supervised counterparts). More-over, in our experimental study, we demonstrate that the generalization gap for SSS algo-rithms is substantially smaller than their fully-supervised counterparts. See Figures 1 and 4for sample results and Section 5 for more details.3. We demonstrate that replacing the memorization gap with the upper bound of Theorem IIyields a non-vacuous generalization bound for a variety of SSS algorithms on CIFAR-10and ImageNet. Moreover, this bound gets tighter with more data augmentation.4. The robustness gap is often negligible in practice, and sometimes provably so (see Sec-tion 4). We show that the rationality gap is small in practice as well. We also prove that apositive rationality gap corresponds to “leaving performance on the table”, in the sense thatwe can transform a learning procedure with a large rationality gap into a procedure withbetter test performance (Theorem 4.1).One way to interpret our results is that instead of obtaining generalization bounds under statisticalassumptions on the distribution, we assume that the rationality and robustness gaps are at mostsome value (e.g., 5%). Readers might worry that we are “assuming away the difﬁculty”, but smallrationality and robustness gaps do not by themselves imply a small generalization gap. Indeed, theseconditions widely hold across many natural algorithms (including not just SSS but also end-to-endsupervised algorithms) with both small and large generalization gaps. As discussed in Section 4,apart from the empirical evidence, there are also theoretical justiﬁcations for small robustness andrationality. See Remark 4.2 and Appendix C for examples showing the necessity of these conditions.1.1 R ELATED W ORK .Our work analyses the generalization gap for supervised classiﬁers that ﬁrst use self-supervision tolearn a representation. We provide a brief exposition of the various types of self-supervised methodsin Section 5, and a more detailed discussion in Appendix B.1.A variety of prior works have provided generalization bounds for supervised deep learning (e.g.,Neyshabur et al. (2017); Bartlett et al. (2017); Dziugaite & Roy (2017); Neyshabur et al. (2018);Golowich et al. (2018); Cao & Gu (2019), and references therein). However, many of these boundsprovide vacuous guarantees for modern architectures (such as the ones considered in this paper)that have the capacity to memorize their entire training set (Zhang et al., 2017). While some non-vacuous bounds are known (e.g., Zhou et al. (2019) gave a 96.5% bound on the error of MobileNet onImageNet), Belkin et al. (2019); Nagarajan & Kolter (2019) have highlighted some general barriersfor bounding the generalization gaps of over-parameterized networks that are trained end-to-end.For similar reasons, standard approaches such as Rademacher complexity cannot directly boundSSS algorithms’ generalization gap (see Remark 4.3).Recently, Saunshi et al. (2019) and Lee et al. (2020) gave generalization bounds for self-supervisedbased classiﬁers. The two works considered special cases of SSS algorithms, such as contrastive earning and pre-text tasks . Both works make strong statistical assumptions of (exact or approxi-mate) conditional independence relating the pre-training and classiﬁcation tasks. For example, ifthe pre-training task is obtained by splitting a given image x into two pieces ( x , x ) and predict-ing x from x , then Lee et al. (2020)’s results require x and x to be approximately independentconditioned on their class y . However, in many realistic cases, the two parts of the same image willshare a signiﬁcant amount of information not explained by the label.Our work applies to general SSS algorithms without such statistical assumptions, at the expense ofassuming bounds on the robustness and rationality gaps. There have been works providing rigorousbounds on the robustness gap or related quantities (See Section 4.). However, as far as we know, therationality gap has not been explicitly deﬁned or studied before. To bound the memorization gap,we use information-theoretic complexity measures. Various information-theoretic quantities havebeen proposed to bound generalization gap in previous work (see Steinke & Zakynthinou (2020)and references therein). While these works bounds generalization directly, we bound a differentquantity—the memorization gap in the RRM decomposition.1.2 P APER O RGANIZATION

Section 2 contains formal deﬁnitions and statements of our results. Section 4 provides an overviewof prior work and our new results on the three gaps of the RRM bound. In Section 5, we describeour experimental setup and detail our empirical results. Section 7 concludes the paper and discussesimportant open questions. Section 3 contains the proof of Theorem II, while Section 6 contains theproof of Theorem 4.1. Appendix B fully details our experimental setup. N OTATION

We use capital letters (e.g., X ) for random variables, lower case letters (e.g., x ) for a single value,and bold font (e.g., x ) for tuples (which will typically have dimension corresponding to the numberof samples, denoted by n ). We use x i for the i -th element of the tuple x . We use calligraphic letters(e.g., X , D ) for both sets and distributions. ORMAL STATEMENT OF RESULTS A training procedure is a (possibly randomized) algorithm T that takes as input a train set ( x , y ) =( x i , y i ) i ∈ [ n ] ∈ ( X ×Y ) n and outputs a classiﬁer f : X → Y . For our current discussion, we make noassumptions on the type of classiﬁer output or the way that it is computed. We denote the distributionover training sets in ( X × Y ) n by D train and the distribution over test samples in X × Y by D test . The generalization gap of a training algorithm T with respect to a distribution pair D = ( D train , D test ) is the expected difference between its train accuracy (which we denote by Train D ,T ) and its testperformance (which we denote by Test D ,T ). We will often drop subscripts such as D , T whenthey can be inferred from the context. We will also consider the η -noisy experiment , which involvescomputing the classiﬁer ˜ f = T ( x , ˜ y ) where ˜ y i = y i with probability − η and is uniform otherwise.Our starting point is the following observation which we call the RRM bound (for R obustness, R ationality, and M emorization). The quantities appearing in it are deﬁned in Table 1. Fact I (

RRM bound ). For every noise parameter η > , training procedure T and distribution D = ( D train , D test ) over training sets and test samples, the RRM bound with respect to T and D is, Train − Test (cid:124) (cid:123)(cid:122) (cid:125)

Generalizationgap ≤ (cid:104) Train − Train ( η ) (cid:105) + (cid:124) (cid:123)(cid:122) (cid:125) Robustnessgap + (cid:104) NTrain ( η ) − Test (cid:105) + (cid:124) (cid:123)(cid:122) (cid:125) Rationalitygap + (cid:104) Train ( η ) − NTrain ( η ) (cid:105) + (cid:124) (cid:123)(cid:122) (cid:125) Memorizationgapwhere we denote x + = max( x, . We provide our code and data in: https://gitlab.com/harvard-machine-learning/. The train and test data often stem from the same distribution (i.e., D train = D n test ), but not always (e.g., itdoes not hold if we use data augmentation). D test enters the RRM bound only via the rationality gap, so theassumption of small rationality may be affected if D train (cid:54) = D n test , but the RRM bound still holds. able 1 – The measurements of accuracy in the RRM bound, all with respect to a training algorithm T ,distributions ( D train , D test ) and parameter η > . The robustness gap is max( Train − Train ( η ) , , the ra-tionality gap is max( NTrain ( η ) − Test , , and the memorization gap is max( Train ( η ) − NTrain ( η ) , . Quantity Training Measurement

Test D ,T f = T ( x , y ) for ( x , y ) ∼ D train Pr [ f ( x ) = y ] for ( x, y ) ∼ D test . Train D ,T f = T ( x , y ) for ( x , y ) ∼ D train Pr [ f ( x i ) = y i ] for train sample ( x i , y i ) . Train D ,T ( η ) ˜ f = T ( x , ˜ y ) for ( x , y ) ∼ D train , ˜ y i = y i w.p. − η , uniform o/w Pr [ ˜ f ( x i ) = y i ] for train sample ( x i , ˜ y i ) where y i original label for x i . NTrain D ,T ( η ) ˜ f = T ( x , ˜ y ) for ( x , y ) ∼ D train , ˜ y i = y i w.p. − η , uniform o/w Pr [ ˜ f ( x i ) = y i | ˜ y i (cid:54) = y i ] for a corrupted trainsample x i where y i original label for x i .The RRM bound is but an observation, as it directly follows from the fact that x + ≥ x for every x . However, it is a very useful one. As mentioned above, for natural algorithms, we expect boththe robustness and rationality components of this gap to be small, and hence the most signiﬁcantcomponent is the memorization gap . In this work we show a rigorous upper bound on this gap forSSS models.We deﬁne formally an SSS Algorithm to be a training procedure T = ( T pre , T ﬁt ) that is obtained by (1) ﬁrst training T pre on x ∈ X n to get a representation r : X → R and then (2) training T ﬁt on ( r ( x ) , y ) for y ∈ Y n to obtain a classiﬁer g : R → Y . The classiﬁer output by T is f : X → Y deﬁned as f ( x ) = g ( r ( x )) . Our main theoretical result is the following. Theorem II (

Memorization gap bound ). For every SSS Algorithm T = ( T pre , T ﬁt ) , noise parameter η > and distribution D over X n × Y n :Memorization gap ( T ) = ( Train T, D ( η ) − NTrain T, D ( η )) + ≤ O ( (cid:113) C η ( T ﬁt ) n · η ) where C η ( T ﬁt ) is a complexity measure of the second phase training procedure, which in particularis upper bounded by the number of bits required to describe the classiﬁer g (See Deﬁnition 2.3.). OMPLEXITY MEASURES

We now deﬁne three complexity measures, all of which can be plugged in as the measure in Theo-rem II. The ﬁrst one, C mdl , is the minimum description length of a classiﬁer. The other two measures C pc and C dc are superﬁcially similar to Rademacher Complexity (cf. Bartlett & Mendelson (2002))in the sense that they capture the ability of the hypothesis to correlate with random noise. Deﬁnition 2.3 (

Complexity of training procedures ). Let T be a training procedure taking as input aset ( r , y ) = { ( r i , y i ) } ni =1 ∈ ( R × Y ) n and outputting a classiﬁer g : r → Y and let η > . Forevery training set ( r , y ) , we deﬁne the following three complexity measures with respect to r , y , η :• The minimum description length of T is deﬁned as C mdl r , y ,η ( T ) := H ( g ) where we considerthe model g as a random variable arising in the η -noisy experiment. • The prediction complexity of T is deﬁned as C pc r , y ,η ( T ) := (cid:80) ni =1 I ( g ( r i ); ˜ y i ) where the ˜ y i ’s are the labels obtained in the η -noisy experiment.• The (unconditional) deviation complexity of T is deﬁned as C dc r , y ,η ( T ) := n · I ( g ( r i ) − y i ; ˜ y i − y i ) where the random variables above are taken over i ∼ [ n ] and subtraction isdone modulo |Y| , identifying Y with the set { , . . . , |Y| − } .Conditioned on y and the choice of the index i , the deviations g ( r i ) − y i and ˜ y i − y i determine thepredictions g ( r i ) and noisy labels ˜ y i , and vice versa. Hence we can think of C dc as an “averaged”variant of C pc , where we make the choice of the index i part of the sample space for the randomvariables. While we expect the two measures to be approximately close, the fact that C dc takes The name “minimum description length” is justiﬁed by the operational deﬁnition of entropy relating it tothe minimum amortized length of a preﬁx-free encoding of a random variable. into the sample space makes it easier to estimate this quantity in practice without using a largenumber of experiment repetitions (see Figure B.2 for convergence rates). The measure C mdl is harderto evaluate in practice, as it requires ﬁnding the optimal compression scheme for the classiﬁer.Section 3 contains the full proof of Theorem II. It is obtained by showing that: (i) for every r , y , η ,and T it holds that C dc r , y ,η ( T ) ≤ C pc r , y ,η ( T ) ≤ C mdl r , y ,η ( T ) , and (ii) for every SSS algorithm T =( T pre , T ﬁt ) and distribution D = ( D train , D test ) , the memorization gap of T is at most (cid:114) E ( x , y ) ∼D train C dc T pre ( x ) , y ,η ( T ﬁt ) (cid:30) (cid:16) η √ n (cid:17) . (1)It is the quantity (1) that we compute in our experiments. ROOF OF T HEOREM II We now prove Theorem II. We start by relating our three complexity measures. The followingtheorem shows that C dc is upper bounded by C pc , which in turn is bounded by the entropy of g . Theorem 3.1 (

Relation of complexity measures ). For every r , y , η > , and T C dc r , y ,η ( T ) ≤ C pc r , y ,η ( T ) ≤ C mdl ( T ) where g is the classiﬁer output by T (considered as a random variable).Proof. Fix T, r , y , η . We get ˜ y by choosing i.i.d random variables N , . . . , N n , each equalling with probability − η and uniform otherwise, and letting ˜ y i = y i + N i (mod |Y| ) .We start by proving the second inequality C pc r , y ,η ( T ) ≤ H ( g ) . Let g = T ( r , ˜ y ) and deﬁne p =( g ( r ) , . . . , g ( r n )) be the vector of predictions. Then, C pc r , y ,η ( T ) = (cid:88) i I ( p i ; ˜ y i ) = (cid:88) i I ( p i ; N i ) (2)with the last equality holding since for ﬁxed y i , N i determines ˜ y i and vice versa. However, sincethe full vector p contains only more information than p i , the right-hand side of (2) is at most (cid:80) ni =1 I ( p ; N i ) ≤ I ( p ; N , . . . , N n ) , using the fact that N i random variables are independent (seeLemma A.2). For a ﬁxed r , the value of p is completely determined by g and hence the entropy of p is at most H ( g ) , establishing the second inequality of the theorem.We now turn to the ﬁrst inequality C dc r , y ,η ( T ) ≤ C pc r , y ,η ( T ) . Let ∆ i = p i − y i (mod |Y| ) . Then, n C pc r , y ,η ( T ) = E j ∼ [ n ] I ( p j ; N j ) = E j ∼ [ n ] I (∆ j ; N j ) (3)since p i determines ∆ i and vice versa (given y ). But, since N j = N | i = j and ∆ j = ∆ | i = j , theright-hand side of (3) equals E j ∼ [ n ] I (∆; N | i = j ) = E j ∼ [ n ] H ( N | i = j ) − H ( N | ∆ , i = j ) . (4)Since N , . . . , N n are identically distributed, H ( N | i = j ) = H ( N ) which means that the right-hand side of (4) equals H ( N ) − E j ∼ [ n ] H ( N | ∆ , i = j ) ≥ H ( N ) − H ( N | ∆) = I (∆; N ) with the inequality holding since on average conditioning reduces entropy. By deﬁnition I (∆; N ) = n C dc r , y ,η ( T ) , establishing what we wanted to prove.The complexity measures C pc and C dc are deﬁned with respect to a ﬁxed train set ( r , y ) , renderingthem applicable for single training sets such as CIFAR-10 and ImageNet that arise in practice. If D is a distribution over ( r , y ) , then we deﬁne the complexity measures C pc and C dc with respect to D asthe average of the corresponding measure with respect to ( r , y ) ∼ D . We now restate Theorem II:6 heorem 3.2 ( Theorem II, restated ). Let T = ( T pre , T ﬁt ) be a training procedure obtained by ﬁrsttraining T pre on x ∈ X n to obtain a representation r : X → R and then training T ﬁt on ( r ( x ) , y )) where y ∈ Y n to obtain a classiﬁer g : R → Y . Then, for every noise parameter η > anddistribution D train over ( X , Y ) n ,Memorization gap ( T ) = (cid:0) Train D train ,T ( η ) − NTrain D train ,T ( η ) (cid:1) + ≤ (cid:114) C dc R ,η ( T ﬁt )2 n · η where R is the distribution over ( R × Y ) n induced by T pre on D train . Note that the bound on the right-hand side is expressed only in terms of the complexity of the secondstage T ﬁt and is independent of the complexity of T pre . The crux of the proof is showing (close to)independence between the corrupted indices and prediction deviation of g resulting from the noise. Proof.

Let ( r , y ) be sampled by ﬁrst drawing ( x , y ) ∼ D train over ( X ×Y ) n then applying r = r ( x ) where r = T pre ( x ) . Consider the sample space of sampling ˜ y according to the η -noisy distributionwith respect to Y , computing g = T ﬁt ( r , ˜ y ) , and sampling i ∼ [ n ] . We deﬁne the following twoBernoulli random variables over this sample space: Z = ∆=0 = (cid:26) g ( R i ) = y i otherwise ; B = N (cid:54) =0 = (cid:26) y i (cid:54) = y i otherwise . For a given r , y , since Z is determined by ∆ and B is determined by N , I ( Z ; B ) ≤ I (∆; N ) = C dc r , y ,η ( T ﬁt ) /n . By Lemma A.1, for every Bernoulli random variables B, Z , | E [ Z ] − E [ Z | B = 1] | ≤ (cid:113) I ( Z ; B ) / E [ B ] And hence in our case (since E [ B ] = η ), E [ Z ] − E [ Z | B = 1] ≤ (cid:114) C dc r , y ,η ( T ﬁt )2 n · η . But E [ Z ] corresponds to the probability that g ( r ) = y for ( r, y ) in the train set, while E [ Z | B = 1] corresponds to this probability over the noisy samples. Hence the memorization gap is bounded by E ( r , y ) ∼R (cid:34)(cid:114) C dc r , y ,η ( T ﬁt )2 n · η (cid:35) ≤ η (cid:115) E ( r , y ) ∼R (cid:20) C dc r , y ,η ( T ﬁt )2 n (cid:21) = (cid:114) C dc R ,η ( T ﬁt )2 n · η using the Jensen inequality and the concavity of square root for the ﬁrst inequality. HE THREE GAPS

We now brieﬂy describe what is known and what we prove about the three components of the RRMbound. We provide some additional discussions in Appendix C, including “counter-examples” ofalgorithms that exhibit large values for each one of these gaps.

The robustness gap.

The robustness gap measures the decrease in training accuracy from adding η noisy labels, measured with respect to the clean labels. The robustness gap and related notions suchas noise stability or tolerance have been studied in various works (cf. Fr´enay & Verleysen (2013);Manwani & Sastry (2013)). Interpolating classiﬁers (with zero train error) satisfy

Train ( η ) ≥ − η and hence their robustness gap is at most η (see left panel of Figure 2). In SSS algorithms,since the representation is learned without using labels, the injection of label noise only affects thesimple classiﬁer, which is often linear . Robustness guarantees for linear classiﬁers have been givenpreviously by Rudin (2005). While proving robustness bounds is not the focus of this paper, wenote in the appendix some simple bounds for least-squares minimization of linear classiﬁers and7 igure 2 – Robustness, Rationality, and Memorization for CIFAR-10.

Each blue point is a differentcombination of (architecture + self-supervised task + ﬁtting algorithm). Each red point is a differentarchitecture trained end-to-end with supervision. We use the ‘ + ’ marker to denote the two best models ofeach type (SSS and supervised). No augmentations were added. Noise is . Details in Appendix B.3 the (potentially inefﬁcient) Empirical Risk Minimization algorithm (see Appendices D.1 and D.2).Empirically, we observe that the robustness gap of SSS algorithms is often signiﬁcantly smaller than η . (See left panels of Figure 2 and Figure 3.) The rationality gap.

To build intuition for the rationality gap, consider the case where the inputs x are images, and the label y is either “cat” or “dog”. A positive rationality gap means that givingthe incorrect label “dog” for a cat image x makes the output classiﬁer more likely to classify x asa cat compared to the case where it is not given any label for x at all. Hence intuitively, a positiverationality gap corresponds to the training procedure being “irrational” or “inconsistent”—wronginformation should be only worse than no information, and we would expect the rationality gap tobe zero or close to it. Indeed, the rationality gap is always zero for interpolating classiﬁers that ﬁt thetraining data perfectly. Moreover, empirically the rationality gap is often small for SSS algorithms,particularly for the better-performing ones. (See middle panels of Figure 2 and Figure 3.)We also show that positive rationality gap corresponds to “leaving performance on the table” byproving the following theorem (see Section 6 for a formal statement and proof): Theorem 4.1 (

Performance on the table theorem, informal ). For every training procedure T anddistribution D test , D train = D n test , there exists a training procedure S satisfying Test S ≥ Test T + rationality gap ( T ) − o (1) . One interpretation of Theorem 4.1 is that we can always reduce the generalization gap torobustness + memorization if we are willing to move from the procedure T to S . In essence, ifthe rationality gap is positive, we could include the test sample in the train set with a random label to increase the test performance. However, this transformation comes at a high computational cost;inference for the classiﬁer produced by S is as expensive as retraining from scratch. Hence, we viewTheorem 4.1 more as a “proof of concept” than as a practical approach for improving performance. Remark 4.2 (

Why rationality? ). Since SSS algorithms use a simple classiﬁer (e.g., linear), the readermay wonder why we cannot directly prove bounds on the generalization gap. The issue is that therepresentation used by SSS algorithms is still sufﬁciently over-parameterized to allow memorizingthe training set samples. As a pedagogical example, consider a representation-learning procedurethat maps a label-free training set x to a representation r : X → R that has high quality, in thesense that the underlying classes become linearly separable in the representation space. Moreover,suppose that the representation space has dimension much smaller than n , and hence a linear classi-ﬁer would not be able to ﬁt noise, meaning the resulting procedure will have a small memorizationgap and small empirical Rademacher complexity. Without access to the labels, we can transform r to a representation r (cid:48) that on input x will output r ( x ) if x is in the training set, and output theall-zero vector (or some other trivial value) otherwise. Given sufﬁciently many parameters, the rep-resentation r (cid:48) (or a close-enough approximation) can be implemented by a neural network. Since r and r (cid:48) are identical on the training set, the procedure using r (cid:48) will have the same train accuracy,memorization gap, and empirical Rademacher complexity. However, using r (cid:48) , one cannot achievebetter than trivial accuracy on unseen test examples. This does not contradict the RRM bound sincethis algorithm will be highly irrational. 8 he memorization gap. The memorization gap corresponds to the algorithm’s ability to ﬁt thenoise (i.e., the gap increases with the number of ﬁt noisy labels). If, for example, the classiﬁeroutput is interpolating , i.e., it satisﬁes f ( x i ) = ˜ y i for every i , then accuracy over the noisy sampleswill be (since for them y i (cid:54) = ˜ y i ). In contrast, the overall accuracy will be in expectation at least − η which means that the memorization gap will be ≈ for small η . However, we show empirically(see right panels of Figures 2 and 3) that the memorization gap is small for many SSS algorithms and prove a bound on it in Theorem II. When combined with small rationality and robustness, this boundresults in non-vacuous generalization bounds for various real settings (e.g., 48% for ResNet101 withSimCLRv2 on ImageNet, and as low as 4% for MoCo V2 with ResNet-18 on CIFAR-10). Moreover,unlike other generalization bounds, our bound decreases with data augmentation (see Figure 5). Remark 4.3 (

Memorization vs. Rademacher ). The memorization gap, as well the complexity mea-sures deﬁned in Section 2.1 have a superﬁcial similarity to

Rademacher complexity (Bartlett &Mendelson, 2002), in the sense that they quantify the ability of the output classiﬁer to ﬁt noise. Onedifference is that Rademacher complexity is deﬁned with respect to % noise, while we considerthe η -noisy experiment for small η . A more fundamental difference is that Rademacher complexityis deﬁned via a supremum over all classiﬁers in some class. In contrast, our measures are deﬁnedwith respect to a particular training algorithm. As mentioned, Zhang et al. (2017) showed that mod-ern end-to-end supervised learning algorithm can ﬁt 100% of their label noise. This is not the case for SSS algorithms, which can only ﬁt 15%-25% of the CIFAR-10 training set when the labels arecompletely random (see Table B.1 in the appendix). However, by itself, the inability of an algorithmto ﬁt random noise does not imply that the Rademacher complexity is small, and does not imply asmall generalization gap. Indeed, the example of Remark 4.2 yields an SSS method with both smallmemorization gap and empirical Rademacher complexity, and yet has a large generalization gap. MPIRICAL STUDY OF THE

RRM

BOUND

In support of our theoretical results, we conduct an extensive empirical study of the three gaps andempirically evaluate the theoretical bound on the memorization gap (from Equation (1) ) for a varietyof SSS algorithms for the CIFAR-10 and ImageNet datasets. We provide a summary of our setupand ﬁndings below. For a full description of the algorithms and hyperparameters, see Appendix B.

Figure 3 – Robustness, Rationality and Memorization for ImageNet.

Each point represents a differentcombination of self-supervised learning algorithm (e.g., SimCLR), backbone architecture (e.g., ResNet-50) and simple classiﬁer (e.g., linear classiﬁcation). Star indicates experiments with 10 augmentationsper training sample. Noise level is η = 5% . Full experimental details in Section B. SSS Algorithms ( T pre , T ﬁt ). For the ﬁrst phase of training T pre , we consider various self-supervisedtraining algorithms that learn a representation without explicit training labels. There are two maintypes of representation learning methods (1) Contrastive Learning , which ﬁnds an embedding bypushing ‘’similar” samples closer, and (2)

Pre-text tasks , which hand craft a supervised task that isindependent of downstream tasks, such as prediction the rotation angle of a given image (Gidariset al., 2018). Our analysis is independent of the type of representation learning method, and wefocus on methods that achieve high test accuracy when combined with the simple test phase. Thelist of methods included in our study is Instance Discrimination (Wu et al., 2018), MoCoV2 (Heet al., 2020), SimCLR (Chen et al., 2020a;b), AMDIM (Bachman et al., 2019), CMC (Tian et al.,2019), InfoMin (Tian et al., 2020) as well as adversarial methods such as BigBiGAN (Donahue &Simonyan, 2019). 9 he o r e m II b o un d

47% 48%

Figure 4 – The RRM bound of SSS methods onImageNet, with models sorted by the generaliza-tion gap. We plot the robustness, rationality andmemorization gaps. Similar to Figure 1, for mostmodels, the bound is tight and is dominated by thememorization gap. Theorem II bound is markedfor the two leftmost models (we did not evaluateit for the others, for computational reasons).

Figure 5 – Empirical RRM for the AMDIMSSS model on CIFAR-10 with increasingnumber of augmentations. While robust-ness and memorization gaps decrease, andso does our generalization bound, the ra-tionality gap increases since D train and D test grow apart. For the second phase of training (also known as the evaluation phase (Goyal et al., 2019)), weconsider simple models such as regularized linear regression, or small Multi-Layer Perceptrons(MLPs). For each evaluation method, we run two experiments: 1) the clean experiment where wetrain T ﬁt on the data and labels ( x , y ) ; 2) the η -noisy experiment where we train T ﬁt on ( x , ˜ y ) where ˜ y are the η noised labels. Unless speciﬁed otherwise we set the noise to η = 5% . Adding augmentations.

We investigate the effect of data augmentation on the three gaps and thetheoretical bound. For each training point, we sample t random augmentations ( t = 10 unless statedotherwise) and add it to the train set. Note that in the noisy experiment two augmented samples ofthe same original point might be assigned with different labels. We use the same augmentation usedin the corresponding self-supervised training phase. Results.

Figures 1 and 2 provide a summary of our experimental results for CIFAR-10. The ro-bustness and rationality gaps are close to zero for most SSS algorithms, while the memorizationgap is usually the dominant term, especially so for models with larger generalization gap. More-over, we see that C dc often produces a reasonably tight bound for the memorization gap, leading toa generalization bound that can be as low as - . In Figures 3 and 4 we give a summary of ourexperimental results for SSS algorithms on ImageNet. Again, the rationality and robustness gapsare bounded by small constants. Notice, that adding augmentations reduces memorization, but maylead to an increase in the rationality gap. This is also demonstrated in Figure 5 where we vary thenumber of data augmentations systematically for one SSS algorithm (AMDIM) on CIFAR-10. Sincecomputing the Theorem II bound for ImageNet is computationally expensive we compute it only fortwo algorithms, which achieve non-vacuous bounds between - , with room for improvement(See Appendix B.5.1.) OSITIVE RATIONALITY GAP LEAVES ROOM FOR IMPROVEMENT

We now prove the “performance on the table theorem” that states that we can always transform atraining procedure with a positive rationality gap into a training procedure with better performance:

Theorem 6.1 (

Performance on the table theorem, restated ). For every training procedure T and D test , n, η , if D train = D n test and T has a positive rationality gap with respect to these parameters, thenthere exists a training procedure S such that, Test S, D ≥ NTrain T, D ( η ) − o (1) = Test T, D + rationality-gap ( T ) − o (1) (5) where o (1) is a term that vanishes with n , and assuming that Train T, D ( η ) ≥ NTrain T, D ( η ) . The assumption, stated differently, implies that the memorization gap will be positive. We expectthis assumption to be true for any reasonable training procedure T (see right panel of Figure 2),since performance on noisy train samples will not be better than the overall train accuracy. Indeed,it holds in all the experiments described in Section 5. In particular (since we can always add noise10o our data), the above means that if the rationality gap is positive, we can use the above to improvethe test performance of “irrational” networks. We now provide a proof for the theorem. Proof.

Let T be a procedure with positive rationality gap that we are trying to transform. Our newalgorithm S would be the following:• Training:

On input a training set D = ( x , ˜ y ) ∈ ( X × Y ) n , algorithm S does not performany computation, but merely stores the dataset D . Thus the “representation” of a point x issimply ( x, D ) .• Inference:

On input a data point x and the original training dataset D , algorithm S chooses i ∼ [ n ] and lets D (cid:48) be the training set obtained by replacing ( x i , y i ) with ( x, ˜ y ) where ˜ y ischosen uniformly at random. We then compute f = T ( D (cid:48) ) , and output f ( x ) .First note that while the number of noisy samples could change by one by replacing ( x i , y i ) with ( x, ˜ y ) , since this number is distributed according to the Binomial distribution with mean ηn andstandard deviation (cid:112) (1 − η ) ηn (cid:29) , this change can affect probabilities by at most o (1) additivefactor (since the statistical distance between the distribution Binom ( η, n ) and Binom ( η, n ) + 1 is o (1) ). If Y has k classes, then with probability − /k we will make ( x, ˜ y ) noisy ( y (cid:54) = ˜ y ) inwhich case the expected performance on it will be NTrain T ( η ) . With probability /k , we choose thecorrect label y in which case performance on this sample will be equal to the expected performanceon clean samples which by our assumptions is at least NTrain T ( η ) as well. Hence, the accuracy onthe new test point is at least NTrain T ( η ) .We stress that the procedure described above, while running in “polynomial time”, is not particularlypractical, since it makes inference as computationally expensive as training. However, it is a proofof concept that irrational networks are, to some extent, “leaving performance on the table”. ONCLUSIONS AND OPEN QUESTIONS

This work demonstrates that SSS algorithms have small generalization gaps. While our focus is onthe memorization gap , our work motivates more investigation of both the robustness and rationality gaps. In particular, we are not aware of any rigorous bounds for the rationality gap of SSS algo-rithms, but we view our “performance on the table” theorem (Theorem 4.1) as a strong indicationthat it is close to zero for natural algorithms. Given our empirical studies, we believe the assumptionsof small robustness and rationality conform well to practice.Our numerical bounds are still far from tight, especially for ImageNet, where evaluating the bound(more so with augmentations) is computationally expensive. Nevertheless, we ﬁnd it striking thatalready in this initial work, we get non-vacuous (and sometimes quite good) bounds. Furthermore,the fact that the empirical RRM bound is often close to the generalization gap, shows that there issigniﬁcant room for improvement.Overall, this work can be viewed as additional evidence for the advantages of SSS algorithms overend-to-end supervised learning. Moreover, some (very preliminary) evidence shows that end-to-end supervised learning implicitly separates into a representation learning and classiﬁcation phases(Morcos et al., 2018). Understanding the extent that supervised learning algorithms implicitly per-form SSS learning is an important research direction in its own right. To the extent this holds, ourwork might shed light on such algorithms’ generalization performance as well.

CKNOWLEDGEMENTS

We thank Dimitris Kalimeris, Preetum Nakkiran, and Eran Malach for comments on early drafts ofthis work. This work supported in part by NSF award CCF 1565264, IIS 1409097, DARPA grantW911NF2010021, and a Simons Investigator Fellowship. We also thank Oracle and Microsoft forgrants used for computational resources. Y.B is partially supported by MIT-IBM Watson AI Lab.Work partially performed while G.K. was an intern at Google Research.11

EFERENCES

Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizingmutual information across views. In

Advances in Neural Information Processing Systems , pp.15535–15545, 2019.Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame-work for self-supervised learning of speech representations, 2020.Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds andstructural results.

Journal of Machine Learning Research , 3(Nov):463–482, 2002.Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds forneural networks. In

Advances in Neural Information Processing Systems , pp. 6240–6249, 2017.Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.

Proceedings of the National Academyof Sciences , 116(32):15849–15854, 2019.Avrim Blum, Merrick L. Furst, Michael J. Kearns, and Richard J. Lipton. Cryptographic prim-itives based on hard learning problems. In Douglas R. Stinson (ed.),

Advances in Cryptology- CRYPTO ’93, 13th Annual International Cryptology Conference, Santa Barbara, California,USA, August 22-26, 1993, Proceedings , volume 773 of

Lecture Notes in Computer Science , pp.278–291. Springer, 1993. doi: 10.1007/3-540-48329-2 \ Advances in Neural Information Processing Systems 32 , pp. 10836–10846.Curran Associates, Inc., 2019.Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework forcontrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020a.Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 , 2020b.Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentumcontrastive learning. arXiv preprint arXiv:2003.04297 , 2020c.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-erarchical image database. In ,pp. 248–255. Ieee, 2009.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In

Advances inNeural Information Processing Systems , pp. 10542–10552, 2019.Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds fordeep (stochastic) neural networks with many more parameters than training data. arXiv preprintarXiv:1703.11008 , 2017.William Falcon and Kyunghyun Cho. A framework for contrastive self-supervised learning anddesigning a new approach, 2020. 12enoˆıt Fr´enay and Michel Verleysen. Classiﬁcation in the presence of label noise: a survey.

IEEEtransactions on neural networks and learning systems , 25(5):845–869, 2013.Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning bypredicting image rotations. arXiv preprint arXiv:1803.07728 , 2018.Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity ofneural networks. In S´ebastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.),

Advancesin Neural Information Processing Systems , volume 75 of

Proceedings of Machine Learning Re-search , pp. 297–299. PMLR, 06–09 Jul 2018.Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In

Proceedings of the IEEE International Conferenceon Computer Vision , pp. 6391–6400, 2019.Michael Gutmann and Aapo Hyv¨arinen. Noise-contrastive estimation: A new estimation principlefor unnormalized statistical models. In

Proceedings of the Thirteenth International Conferenceon Artiﬁcial Intelligence and Statistics , pp. 297–304, 2010.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pp.770–778, 2016.Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast forunsupervised visual representation learning. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , pp. 9729–9738, 2020.Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervised learn-ing can improve model robustness and uncertainty. In H. Wallach, H. Larochelle, A. Beygelzimer,F. d’ Alch´e-Buc, E. Fox, and R. Garnett (eds.),

Advances in Neural Information Processing Sys-tems 32 , pp. 15663–15674. Curran Associates, Inc., 2019.Alex Krizhevsky et al. Learning multiple layers of features from tiny images, 2009.Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps:Provable self-supervised learning. arXiv preprint arXiv:2008.01064 , 2020.Pengpeng Liu, Michael Lyu, Irwin King, and Jia Xu. Selﬂow: Self-supervised learning of opticalﬂow. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) , June 2019.Naresh Manwani and PS Sastry. Noise tolerance under risk minimization.

IEEE transactions oncybernetics , 43(3):1146–1151, 2013.Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representa-tions. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) , June 2020.Ari S. Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neuralnetworks with canonical correlation, 2018.Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain general-ization in deep learning. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florenced’Alch´e-Buc, Emily B. Fox, and Roman Garnett (eds.),

Advances in Neural Information Process-ing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS2019, 8-14 December 2019, Vancouver, BC, Canada , pp. 11611–11622, 2019.Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nati Srebro. Exploring general-ization in deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett (eds.),

Advances in Neural Information Processing Systems 30 , pp.5947–5956. Curran Associates, Inc., 2017. 13ehnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. To-wards understanding the role of over-parametrization in generalization of neural networks.

CoRR ,abs/1805.12076, 2018.Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsawpuzzles. In

European Conference on Computer Vision , pp. 69–84. Springer, 2016.David Page. How to train your resnet. https://myrtle.ai/how-to-train-your-resnet-4-architecture/ , 2018.Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Contextencoders: Feature learning by inpainting. In

Proceedings of the IEEE conference on computervision and pattern recognition , pp. 2536–2544, 2016.Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Tr-mal, and Yoshua Bengio. Multi-task self-supervised learning for robust speech recognition. In

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 6989–6993. IEEE, 2020.Cynthia Rudin. Stability analysis for regularized least squares regression. arXiv preprintcs/0502016 , 2005.Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar.A theoretical analysis of contrastive unsupervised representation learning. In Kamalika Chaudhuriand Ruslan Salakhutdinov (eds.),

Proceedings of the 36th International Conference on MachineLearning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of

Proceedingsof Machine Learning Research , pp. 5628–5637. PMLR, 2019.Thomas Steinke and Lydia Zakynthinou. Reasoning about generalization via conditional mutualinformation. arXiv preprint arXiv:2001.09122 , 2020.Mingxing Tan and Quoc V Le. Efﬁcientnet: Rethinking model scaling for convolutional neuralnetworks. arXiv preprint arXiv:1905.11946 , 2019.Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding.

CoRR ,abs/1906.05849, 2019.Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. Whatmakes for good views for contrastive learning. arXiv preprint arXiv:2005.10243 , 2020.Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting andcomposing robust features with denoising autoencoders. In

Proceedings of the 25th internationalconference on Machine learning , pp. 1096–1103, 2008.Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance-level discrimination. arXiv preprint arXiv:1805.01978 , 2018.Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146 , 2016.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. In .OpenReview.net, 2017.Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In

European confer-ence on computer vision , pp. 649–666. Springer, 2016.Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuousgeneralization bounds at the imagenet scale: a pac-bayesian compression approach. In . OpenReview.net, 2019. 14 M UTUAL INFORMATION FACTS

Lemma A.1 . If A, B are two Bernoulli random variables with nonzero expectation then, | E [ A | B = 1] − E [ A ] | ≤ (cid:113) I ( A ; B ) / E [ B ] . Proof.

A standard relation between mutual information and KL-divergence gives, I ( A ; B ) = D KL ( p A,B || p A p B ) . On the other hand, by the Pinsker inequality, sup S ⊆{ , }×{ , } | p A,B ( S ) − p A × B ( S ) | ≤ (cid:114) D KL ( p A,B || p A p B ) = (cid:114) I ( A, B ) . Thus (letting S = { (1 , } ), | Pr [ A = 1 , B = 1] − Pr [ A = 1] Pr [ B = 1] | ≤ (cid:113) I ( A, B ) . Consequently, | E [ A | B = 1] − E [ A ] | ≤ (cid:113) I ( A, B )) / E ( B ) Lemma A.2 .

For three random variables

W, X, Y , s.t. X and Y are independent, I ( W ; X, Y ) ≥ I ( W ; X ) + I ( W ; Y ) . Proof.

Using the chain rule for mutual information we have: I ( W ; X, Y ) = I ( W ; X ) + I ( W ; Y | X ) Since

X, Y are independent, H ( Y | X ) = H ( Y ) and since conditioning only reduces entropy, wehave H ( Y | W, X ) ≤ H ( Y | W ) . Combining the two we get, I ( W ; Y | X ) = H ( Y | X ) − H ( Y | W, X ) ≥ H ( Y ) − H ( Y | W )= I ( W ; Y ) Thus we have that I ( W ; X, Y ) ≥ I ( W ; X ) + I ( W ; Y ) .Note that by induction we can extend this argument to show that I ( W ; X , ..., X n ) ≥ (cid:80) I ( W ; X i ) where X i are mutually independent. B E

XPERIMENTAL DETAILS

We perform an empirical study of the RRM bound for a wide variety of self-supervised trainingmethods on the ImageNet (Deng et al., 2009) and CIFAR-10 (Krizhevsky et al., 2009) trainingdatasets. We provide a brief description of all the self-supervised training methods that appear inour results below. For each method, we use the ofﬁcial pre-trained models on ImageNet whereveravailable. Since very few methods provide pre-trained models for CIFAR-10, we train models fromscratch. The architectures and other training hyper-parameters are summarized in Table E.4 andTable E.3. Since our primary aim is to study the RRM bound, we do not optimize for reaching thestate-of-the-art performance in our re-implementations. For the second phase of training, we useL2-regularized linear regression, or small non-interpolating Multi-layer perceptrons (MLPs).15.1 S

ELF - SUPERVISED TRAINING METHODS ( T PRE )There is a variety of self-supervised training methods for learning representations without explicitlabels. The two main branches of self-supervised learning methods are:1.

Contrastive learning:

These methods seek to ﬁnd an embedding of the dataset that pushes a positive pair of images close together and a pair of negative images far from each other. Forexample, two different augmented versions of the same image may be considered a positivepair, while two different images may be considered a negative pair. Different methods suchas Instance Discrimination, MoCo, SimCLR, AMDIM, differ in the the way they select thepositive/negative pairs, as well other details like the use of a memory bank or the encoderarchitecture. (See Falcon & Cho (2020) for detailed comparison of these methods.)2.

Handcrafted pretext tasks:

These methods learn a representation by designing a fairly gen-eral supervised task, and utilizing the penultimate or other intermediate layers of this net-work as the representation. Pretext tasks include a diverse range of methods such as pre-dicting the rotation angle of an input image (Gidaris et al., 2018), solving jigsaw puzzles(Noroozi & Favaro, 2016), colorization (Zhang et al., 2016), denoising images (Vincentet al., 2008) or image inpainting (Pathak et al., 2016).Additionally, adversarial image generation can be used for by augmenting a the image generatorwith an encoder (Donahue & Simonyan, 2019). We focus primarily on contrastive learning methodssince they achieve state-of-the-art performance. We now describe these methods brieﬂy.

Instance Discrimination: (Wu et al., 2018) In essence, Instance Discrimination performs super-vised learning with each training sample as a separate class. They minimize the non-parametricsoftmax loss given below for each training sample v = f θ ( x ) J ( θ ) = − n (cid:88) i =1 log (cid:32) exp( v Ti v/τ ) (cid:80) nj =1 exp( v Ti v/τ ) (cid:33) (6)where v i = f θ ( x i ) is the feature vector for the i -th example and τ is a temperature hyperparameter.They use memory banks and a contrastive loss (also known as Noise Contrastive Estimation or NCE(Gutmann & Hyv¨arinen, 2010)) for computing this loss efﬁciently for large datasets. So in this case,a positive pair is an image and itself, while a negative pair is two different training images. Momentum Contrastive (MoCo): (He et al., 2020) MoCo replaces the memory bank in InstanceDiscrimination with a momentum-based query encoder. MoCoV2 (Chen et al., 2020c) applies vari-ous modiﬁcations over SimCLR, like a projection head, and combines it with the MoCo frameworkfor improved performance.

AMDIM: (Bachman et al., 2019) AMDIM uses two augmented versions of the same image aspossitive pairs. For these augmentations, they use random resized crops, random jitters in colorspace, random horizontal ﬂips and random conversions to grayscale. They apply the NCE lossacross multiple scales, by using features from multiple layers. They use a modiﬁed ResNet bychanging the receptive ﬁelds to decrease overlap between positive pairs.

CMC: (Tian et al., 2019) CMC creates two views for contrastive learning by converting each imageinto the Lab color space. L and ab channels from the same image are considered to be a positivepair, while those from two different images are considered to be a negative pair.

PiRL: (Misra & Maaten, 2020) PiRL ﬁrst creates a jigsaw transformation of an image (it divides animage into 9 patches and shufﬂes these patches). It treats an image and its jigsaw as a positive pair,and that of a different image as a negative pair.

SimCLRv1 and SimCLRv2: (Chen et al., 2020a;b) SimCLR also use strong augmentations tocreate positive and negative pairs. They use random resized crops, random Gaussian blurring andrandom jitters in color space. Crucially, they use a projection head that maps the representations toa 128-dimensional space where they apply the contrastive loss. They do not use a memory bank, butuse a large batch size. 16 nfoMin: (Tian et al., 2020) InfoMin uses random resized crops, random color jitters and randomGaussian blurring, as well as jigsaw shufﬂing from PiRL.B.2 S

IMPLE C LASSIFIER ( T FIT )After training the representation learning method, we extract representations r for the training andtest images. We do not add random augmentations to the training images (unless stated otherwise).Then, we train a simple classiﬁer on the dataset { r ( x i ) , y i } ni =1 . We use a linear classiﬁer in mostcases, but we also try a small multi-layer perceptron (as long as it has few parameters and does notinterpolate the training data). We add weight decay in some methods to achieve good test accuracy(see Table E.4 and Table E.3 for values for each method). For the noisy experiment, we set the noiselevel to η = 5% . To compute the complexity bound C dc we run 20 trials (same experiment withdifferent random seed) of the noisy experiment for CIFAR-10 and 50 trials for ImageNet.B.3 E XPERIMENTAL DETAILS FOR EACH PLOT

Figure 1.

This ﬁgure shows the robustness, rationality and memorization gap for various SSS algo-rithms trained on CIFAR-10. The type of self-supervised method, the encoder architecture, as wellas the training hyperparameters are described in Table E.3. For the second phase T ﬁt , we use L2-regularized linear regression for all the methods. For each algorithm listed in Table E.3, the ﬁgurecontains 2 points, one without augmentations, and one with augmentations. Further, we computethe complexity measure C dc for all the methods. All the values (along with the test accuracy) arelisted in Table E.1. Figure 2.

This ﬁgure shows the robustness, rationality and memorization for CIFAR-10 for all thesame methods as in Figure 1. We only include the points without augmentation to show how ratio-nality behaves when ( D train , D test ) are identical. All the values (along with the test accuracy) are listedin Table E.1. In addition, we add three end-to-end fully supervised methods (red circles) to compareand contrast the behavior of each of the gaps for SSS and supervised methods. For the supervisedarchitectures, we train a Myrtle-5 (Page, 2018) convolutional network, a ResNet-18 (He et al., 2016)and a WideResNet-28-10 (Zagoruyko & Komodakis, 2016) with standard hyperparameters. Figure 3 and Figure 4.

These ﬁgures show the robustness, rationality and memorization for theImageNet dataset. The type of self-supervised method, the encoder architecture, as well as thetraining hyperparameters are described in Table E.4. For the second phase T ﬁt , we use L2-regularizedlinear regression for all the methods. The ﬁgures also contain some points with 10 augmentations pertraining image. Further, we compute the complexity measure C dc for all three methods—SimCLRv2with architectures ResNet-50-1x and ResNet-101-2x. All the values (along with the test accuracy)are listed in Table E.2. Figure 5

This ﬁgure shows the effect of increasing augmentations. We add t = { , ..., } aug-mentations and re-train the simple classiﬁer. We do this for the CIFAR-10 dataset, AMDIM self-supervised training with the AMDIM encoder and linear regression (see Table E.3 for the hyperpa-rameters).B.4 A DDITIONAL R ESULTS

B.4.1 G

ENERALIZATION ERROR OF

SSS

ALGORITHMS

To show that SSS algorithms have qualitatively different generalization behavior compared to stan-dard end-to-end supervised methods, we repeat the experiment from Zhang et al. (2017). We ran-domize all the training labels in the CIFAR-10 dataset and train 3 high-performing SSS methods onthese noisy labels. For results see Table B.1. Unlike fully supervised methods, SSS algorithms donot achieve 100% training accuracy on the dataset with noisy labels. In fact, their training accuraciesare fairly low ( ≈ - ). This suggests that the empirical Rademacher complexity is bounded.The algorithms were trained without any augmentations during the simple ﬁtting phase for both SSSand supervised algorithms. The SSS methods were trained using parameters described in Table E.3.17 able B.1 – Train and Test performance on 100% label noise for fully supervised vs. SSS algorithms onCIFAR-10. The ﬁrst row is from Zhang et al. (2017), while the second one is our results for SSS methodsaveraged over 5 runs without augmentations. Training method Architecture/Method Train Acc Test AccSupervised (Zhang et al., 2017) Inception (no aug) 100% 86%(ﬁtting random labels)

SSS SimCLR (ResNet-50) + Linear 94% 92%(ﬁtting random labels)

22% 10%

AMDIM (AMDIM Encoder) + Linear 94% 87.4%(ﬁtting random labels)

18% 10%

MoCoV2 (ResNet-18) + Linear 69% 67.6%(ﬁtting random labels)

15% 10%

B.5 RRM

BOUND WITH VARYING NOISE PARAMETER

We now investigate the effect of varying noise levels on the three gaps as well as on the complexity.We see that the robustness gap increases as we add more noise—this is expected as noise shouldaffect the clean training accuracy. We also observe that the memorization gap decreases, suggestingthat C dc η as a function of η goes down faster than η (see Section 2.1). The Theorem II bound onmemorization gap also decays strongly with the η , becoming more tight as the noise increases. Figure B.1 – RRM + bound with changing η B.5.1 C

ONVERGENCE OF COMPLEXITY MEASURES

We now plot (see Figure B.2) the complexity measures C dc and C pc with increasing number of trialsfor one of the SSS algorithms. As expected, C dc < C pc and C dc converges in about 20 trials forCIFAR-10. On the other hand, the complexity computations for ImageNet need many more trialsfor convergence, since it contains about augmentations × . million training samples makingit cost prohibitive to compute for all the methods. For the CIFAR-10, we use AMDIM with theAMDIM encoder architecture without augmentations. For ImageNet, we use SimCLRv2 with theResNet-101 architecture with 10 augmentations per training sample. C E

XAMPLES OF ALGORITHMS WITH LARGE GAPS

While we argued that SSS algorithms will tend to have small robustness, rationality, and memoriza-tion gaps, this does not hold in the worst case and there are examples of such algorithms that exhibitlarge gaps in each of those cases. 18 a) Theorem II bound with increasing trials. Thebound based on C dc is lower than C pc as expected,and converges within 20 trials. (b) Theorem II bound with increasing trials. C dc isslow to converge due to the large dataset size (10augmentations × Figure B.2 – Convergence of Theorem II bounds for CIFAR-10 and ImageNet

C.1 L

ARGE ROBUSTNESS GAP

Large robustness gap can only arise via computational (as opposed to statistical) considerations.That is, if a training procedure outputs a classiﬁer f ∈ F that achieves on average accuracy α on aclean train set ( X, Y ) , then with high probability, if ( X, ˜ Y ) is an η -noisy train set then there exists f ∈ F that achieves α (1 − η ) accuracy on this train set (by ﬁtting only the “clean” points).However, the training algorithm might not always be able to ﬁnd such a classiﬁer. For example, if thedistribution has the form ( x, y ) = ( x, (cid:80) a j x j mod 2) where x ∼ GF (2) (cid:96) = Z (cid:96) and a ∈ GF (2) (cid:96) is some hidden vector, then there is an efﬁcient algorithm (namely Gaussian elimination) to ﬁnd a given the samples ( x, y ) and hence get accuracy . However, for every ε > and η > , there isno known efﬁcient algorithm that, given a − η perturbed equations of the form {(cid:104) a, x i (cid:105) = ˜ y i } i ∈ [ n ] ﬁnds a (cid:48) ∈ GF (2) (cid:96) such that (cid:80) a (cid:48) j x j = (cid:80) a j x j mod 2 on a / ε fraction of the x ’s. This isknown as the learning parity with noise (LPN) problem (Blum et al., 1993).The assumption of robustness is necessary for a small generalization gap, in the sense that we cancome up with (contrived) examples of algorithms that have small rationality and memorization gapswhile still having large generalization gap. For example, consider an algorithm T that has largegeneralization gap (high train accuracy and small test accuracy), and suppose we augment to thefollowing algorithm T (cid:48) ( x , y ) = (cid:26) T ( x , y ) if y is “clean” if y is “noisy”where denotes the constant zero function (e.g., some trivial classiﬁer) and we use some algorithmto estimate whether or not the labels are noisy. (Such estimates can often be achieved in manynatural cases.) The algorithm T (cid:48) will inherit the generalization gap of T , since that depends onlyon the experiment without noise. Since performance on noisy and clean training samples will bethe same (close to random), T (cid:48) will have zero memorization gap. Since we have assumed small testaccuracy, it will have zero rationality gap also.C.2 L ARGE RATIONALITY GAP

As discussed in Section 6, in the case that D train = D n test , a robust algorithm with large rationalitygap leaves “performance on the table”. We can obtain such algorithms by artiﬁcially droppingperformance on the test data. For example, in the SSS framework, since the representation r isover-parameterized and can memorize the entire train set, we can consider the trivial representation19 ( x ) = (cid:26) x x in train set otherwiseIf we now train some simple classiﬁer on r ( x ) then it can have non-trivial performance on the noisytrain samples, while getting trivial accuracy on all samples outside the train set.In cases where D train and D test are different (for example when D train is an augmented version of D test )then we can no longer claim that a large rationality gap corresponds to “leaving performance on thetable”. For example, we do observe (mild) growth in the rationality gap as we add more augmentedpoints to the training set.C.3 L ARGE MEMORIZATION GAP

It is not hard to ﬁnd examples of networks with large memorization gap. Indeed, as mentionedbefore, any standard interpolating supervised learning algorithm will get a memorization gap closeto . D S

IMPLE ROBUSTNESS BOUNDS

While robustness is not the focus of this work, we collect here two observations on the robustnessof the least-square and minimum risk classiﬁers. These bounds are arguably folklore, but we statethem here for completeness.D.1 R

OBUSTNESS OF LEAST SQUARES CLASSIFIERS

One can prove robustness for classes of algorithms under varying assumptions. As a simple example,we record here a self-contained observation of how margin leads to robustness in least squaresminimization. This is a very simple but also pessimistic bound, and much better ones often hold.

Lemma D.1 .

Let x , . . . , x n ∈ R d and y , . . . , y n ∈ [ k ] , and consider a linear function f : R d → R k that minimizes the quantity (cid:80) i ∈ [ n ] ,j ∈ [ k ] | f ( x i ) j − y i = j | , and suppose that for p fraction of the i ’s, the maximum over j ∈ [ k ] of f ( x i ) is γ larger than the second-largest value.Then in expectation, if we let ˜ y be the η -noisy version of y and ˜ f minimizes (cid:80) i ∈ [ n ] ,j ∈ [ k ] | ˜ f ( x i ) j − ˜ y i = j | , we get that arg max j ˜ f ( x i ) = y i for at least p − η/γ fraction of the i ’s.Proof. We identify y with its “one hot” encoding as a vector in R nk . Let V ⊆ R nk be the subspaceof all vectors of the form ( g ( x ) , . . . , g ( x n )) for linear g : R d → R k . If f is the minimizer inthe theorem statement, and p = ( f ( x ) , . . . , f ( x n )) then p = Π V y where Π V is the orthogonalprojection to the subspace v . If ˜ f is the minimizer for the noisy labels and ˜ p = ( ˜ f ( x ) , . . . , ˜ f ( x n )) ,then ˜ p = Π V ˜ y = Π V ( y + e ) where e is the noise vector ˜ y − y .Hence (cid:107) p − ˜ p (cid:107) = (cid:107) Π V e (cid:107) ≤ (cid:107) e (cid:107) . But in expectation (cid:107) e (cid:107) ≤ ηn (since we ﬂip a label withprobability ≤ η ). For every point i for which the margin was at least γ in p , if ˜ p ’s prediction isdifferent in i , then the contribution of the i -th block to their square norm difference is at least γ / (by shifting the maximum coordinate by − γ/ and the second largest one by γ/ ). Hence at most ηn/γ of these points could have different predictions in p and ˜ p D.2 R

OBUSTNESS OF EMPIRICAL RISK MINIMIZER

The (potentially inefﬁcient) algorithm that minimizes the classiﬁcation errors is always robust.

Lemma D.2 .

Let T ( x , y ) = arg min f ∈F (cid:80) ni =1 f ( x i ) (cid:54) = y i . Then for every η > ,Robustness gap ( T ) ≤ η . roof. Let x , y be any train set, and let α = min g ∈F (cid:80) ni =1 g ( x i ) (cid:54) = y i and f be the minimizer ofthis quantity. Let ˜ y be the η -noisy version of y and let ˜ η be the fraction of i on which y i (cid:54) = ˜ y i . Then, n (cid:88) i =1 f ( x i ) (cid:54) = y i ≤ α + ˜ η . (7)Hence if ˜ f is the minimizer of (7) then we know that ˜ f ( x i ) (cid:54) = ˜ y i for at most α + ˜ η fraction of the i ’s, and so ˜ f ( x i ) (cid:54) = y i for at most α + 2˜ η fraction of the i ’s. Since the train accuracy of T is − α and in expectation of ˜ η is η , we get that in expectation Train T ( η ) ≥ Train T − η L ARGE T ABLES

Table E.1 – Summary of all the methods, architectures and the corresponding results (gaps and accuracies)on CIFAR-10, sorted by generalization gap. While Figure 1 already plots this data, here we also providethe test performance of the corresponding models.Method Backbone DataAug GeneralizationGap Robustness Mem-orization Rationality Theorem IIbound RRMbound TestAccmocov2 resnet18 True -7.35 0.07 0.21 0.00 3.47 0.28 67.19mocov2 wide resnet50 2 True -6.37 0.18 1.03 0.00 7.63 1.21 70.99mocov2 resnet101 True -6.01 0.15 0.71 0.00 6.38 0.86 68.58mocov2 resnet50 True -5.38 0.19 0.84 0.00 6.99 1.03 69.68simclr resnet50 True -2.89 0.30 0.55 0.00 6.63 0.85 91.96amdim resnet101 True -0.91 0.64 3.70 0.00 25.99 4.34 63.56amdim resnet18 True 0.33 0.23 1.15 0.00 8.66 1.38 62.84mocov2 resnet18 False 1.43 0.15 1.24 0.03 14.14 1.43 67.60simclr resnet18 False 1.43 0.28 0.79 0.36 13.35 1.43 82.50amdim wide resnet50 2 True 1.60 0.69 2.46 0.00 19.20 3.15 64.38simclr resnet50 False 1.97 0.22 0.78 0.97 15.75 1.97 92.00simclr resnet50 False 2.24 0.52 1.71 0.01 19.53 2.24 84.94mocov2 resnet50 False 2.72 0.30 2.96 0.00 24.18 3.26 70.09mocov2 resnet101 False 2.82 0.33 3.03 0.00 22.78 3.36 69.08mocov2 wide resnet50 2 False 3.11 0.38 2.79 0.00 22.39 3.18 70.84amdim resnet50 bn True 3.69 0.84 4.22 0.00 31.12 5.06 66.44amdim resnet18 False 4.34 0.42 4.58 0.00 33.47 5.00 62.28amdim amdim encoder True 4.43 0.68 0.36 3.39 10.32 4.43 87.33amdim amdim encoder False 6.68 2.08 5.69 0.00 70.52 7.77 87.38amdim resnet101 False 12.46 1.22 14.26 0.00 100.00 15.49 62.43amdim wide resnet50 2 False 13.07 1.70 15.33 0.00 100.00 17.03 63.80amdim resnet50 bn False 14.73 1.81 16.63 0.00 100.00 18.43 66.28 able E.2 – Summary of all the methods, architectures their corresponding results (gaps and accuracies)on ImageNet, sorted by generalization gap. While Figure 4 already plots this data, here we also providethe test performance of the corresponding models.Method Backbone DataAug GeneralizationGap Robustness Mem-orization Rationality Theorem IIbound RRMbound TestAccsimclrv2 r50 1x sk0 True -2.34 0.26 0.68 0.00 46.93 0.94 70.96simclrv2 r101 2x sk0 True 0.63 0.10 0.80 0.00 47.90 0.91 77.24simclrv2 r152 2x sk0 True 1.00 0.13 0.77 0.10 NA 1.00 77.65moco ResNet-50 True 1.32 0.57 0.93 0.00 NA 1.49 70.15InfoMin ResNet-50 True 4.88 0.81 1.01 3.06 NA 4.88 72.29PiRL ResNet-50 True 6.23 0.29 0.99 4.95 NA 6.23 60.56InsDis ResNet-50 True 6.85 0.25 1.13 5.46 NA 6.85 58.30simclrv2 r101 1x sk1 False 8.23 0.71 4.66 2.86 NA 8.23 76.07InfoMin ResNet-50 False 10.21 2.34 8.96 0.00 NA 11.31 70.31simclrv2 r152 1x sk0 False 10.32 1.12 6.93 2.26 NA 10.32 74.17simclrv2 r101 1x sk0 False 10.53 1.11 6.99 2.42 NA 10.53 73.04simclrv2 r50 1x sk0 False 10.62 0.99 7.31 2.31 NA 10.62 70.69moco ResNet-50 False 10.72 1.82 7.86 1.04 NA 10.72 68.39simclrv2 r152 2x sk0 False 10.92 0.75 7.45 2.72 NA 10.92 77.25simclrv2 r101 2x sk0 False 11.02 0.74 7.51 2.78 NA 11.02 76.72simclr ResNet50 1x False 11.07 1.22 7.73 2.13 NA 11.07 68.73simclrv2 ResNet-50 False 11.16 0.64 7.67 2.85 NA 11.16 74.99PiRL ResNet-50 False 11.43 1.49 8.26 1.68 NA 11.43 59.11InsDis ResNet-50 False 12.02 1.40 8.52 2.10 NA 12.02 56.67amdim ResNet-50 False 13.62 0.90 9.72 3.01 NA 13.62 67.69CMC ResNet-50 False 14.73 2.30 12.30 0.13 NA 14.73 54.60bigbigan ResNet-50 False 29.60 3.13 25.19 1.27 NA 29.60 50.24 Table E.3 – Summary of training methods with their hyper-parameters for CIFAR-10

Self-supervisedmethod BackboneArchitectures Self-supervisedTraining Evaluation SimplePhaseOptimizationAMDIM AMDIM Encoder PLBDefaultparameters Linear Adam β = 0 . β = 0 . Constant LR = 2e-4Batchsize = 500Weight decay = 1e-6ResNet-18ResNet-50WideResNet-50ResNet 101MoCoV2 ResNet-18 PLBDefaultparameters Linear Adam β = 0 . β = 0 . Constant LR = 2e-4Batchsize = 500Weight decay = 1e-6ResNet-50WideResNet-50ResNet 101SimCLR ResNet-18 Batchsize = 128Epochs 200 Linear SGDMomentum = 0.9Constant LR = 0.1Weight decay 1e-6ResNet-50ResNet-50 Batchsize = 512Epochs 60023 able E.4 – Summary of training methods with their hyper-parameters for ImageNet

Self-supervisedmethod BackboneArchitecture Pre-trainedModel Evaluation Optimization WeightDecay EpochsInstanceDiscrimination ResNet-50 PyContrast Linear SGDMomentum = 0.9Initial LR = 30LR drop at { } by factor 0.2 0 40MoCo ResNet-50 Ofﬁcial Linear SGDMomentum = 0.9Initial LR = 30LR drop at { } by factor 0.2 0 40PiRL ResNet-50 PyContrast Linear SGDMomentum = 0.9Initial LR = 30LR drop at { } by factor 0.2 0 40CMC ResNet-50 PyContrast Linear SGDMomentum = 0.9Initial LR = 30LR drop at { } by factor 0.2 0 40AMDIM AMDIM Encoder Ofﬁcial Linear SGDMomentum = 0.9Initial LR = 30LR drop at {

15, 25 } by factor 0.2 1e-3 40BigBiGAN ResNet-50 Ofﬁcial Linear SGDMomentum = 0.9Initial LR = 30LR drop at {

15, 25 }}