For self-supervised learning, Rationality implies generalization, provably
FF OR SELF - SUPERVISED LEARNING ,R ATIONALITY IMPLIES GENERALIZATION , PROVABLY
Yamini Bansal ∗ Harvard University
Gal Kaplun ∗ Harvard University
Boaz Barak † Harvard University A BSTRACT
We prove a new upper bound on the generalization gap of classifiers that are ob-tained by first using self-supervision to learn a representation r of the training data,and then fitting a simple (e.g., linear) classifier g to the labels. Specifically, weshow that (under the assumptions described below) the generalization gap of suchclassifiers tends to zero if C ( g ) (cid:28) n , where C ( g ) is an appropriately-definedmeasure of the simple classifier g ’s complexity, and n is the number of trainingsamples. We stress that our bound is independent of the complexity of the repre-sentation r .We do not make any structural or conditional-independence assumptions on therepresentation-learning task, which can use the same training dataset that is laterused for classification. Rather, we assume that the training procedure satisfiescertain natural noise-robustness (adding small amount of label noise causes smalldegradation in performance) and rationality (getting the wrong label is not bet-ter than getting no label at all) conditions that widely hold across many stan-dard architectures. We show that our bound is non-vacuous for many popularrepresentation-learning based classifiers on CIFAR-10 and ImageNet, includingSimCLR, AMDIM and BigBiGAN. NTRODUCTION
The current standard approach for classification is “end-to-end supervised learning” where one fits acomplex (e.g., a deep neural network) classifier to the given training set (Tan & Le, 2019; He et al.,2016). However, modern classifiers are heavily over-parameterized , and as demonstrated by Zhanget al. (2017), can fit 100% of their training set even when given random labels as inputs (in whichcase test performance is no better than chance). Hence, the training performance of such methods isby itself no indication of their performance on new unseen test points.In this work, we study a different class of supervised learning procedures that have recently at-tracted significant interest. These classifiers are obtained by: (i) performing pre-training with aself-supervised task (i.e., without labels) to obtain a complex representation of the data points, andthen (ii) fitting a simple (e.g., linear) classifier on the representation and the labels. Such “ S elf- S upervised + S imple” (SSS for short) algorithms are commonly used in natural language process-ing tasks (Devlin et al., 2018; Brown et al., 2020), and have recently found uses in other domains aswell (Baevski et al., 2020; Ravanelli et al., 2020; Liu et al., 2019). Compared to standard “end-to-end supervised learning”, SSS algorithms have several practical ad-vantages. In particular, SSS algorithms can incorporate additional unlabeled data, the represen-tation obtained can be useful for multiple downstream tasks, and they can have improved out-of-distribution performance (Hendrycks et al., 2019). Moreover, recent works show that even withoutadditional unlabeled data, SSS algorithms can get close to state-of-art accuracy in several classifi-cation tasks (Chen et al., 2020b; He et al., 2020; Misra & Maaten, 2020; Tian et al., 2019). For ∗ Equal contribution. Email: { ybansal, galkaplun } @g.harvard.edu † Email: [email protected] . In this work we focus only on algorithms that learn a representation, “freeze” it, and then perform classifi-cation using a simple classifier. We do not consider algorithms that “fine tune” the entire representation. a r X i v : . [ c s . L G ] O c t nstance, SimCLRv2 (Chen et al., 2020b) achieves . top-1 performance on ImageNet with avariant of ResNet-152, on par with the end-to-end supervised accuracy of this architecture at . .We show that SSS algorithms have another advantage over standard supervised learning—they oftenhave a small generalization gap between their train and test accuracy, and we prove non-vacuousbounds on this gap. We stress that SSS algorithms use over-parameterized models to extract therepresentation, and reuse the same training data to learn a simple classifier on this representation.Thus, the final classifier they produce has high complexity by most standard measures and the re-sulting representation could “memorize” the training set. Consequently, it is not a priori evident thattheir generalization gap will be small.Our bound is obtained by first noting that the generalization gap of every training algorithm isbounded by the sum of three quantities, which we name the Robustness gap , Rationality gap , and
Memorization gap (we call this the
RRM bound , see Fact I). We now describe these gaps at a highlevel, deferring the formal definitions to Section 2. All three gaps involve comparison with a settingwhere we inject label noise by replacing a small fraction η of the labels with random values.The robustness gap corresponds to the amount by which training performance degrades by noiseinjection. That is, it equals the difference between the standard expected training accuracy (withno label noise) and the expected training accuracy in the noisy setting; in both cases, we measureaccuracy with respect to the original (uncorrupted) labels. The robustness gap is nearly always small,and sometimes provably so (see Section 4).The rationality gap corresponds to the difference between performance on the noisy training samples(on which the training algorithm gets the wrong label) and test samples (on which it doesn’t get anylabel at all), again with respect to uncorrupted labels. An optimal Bayesian procedure would havezero rationality gap, and we show that this gap is typically zero or small in practice.The memorization gap , which often accounts for the lion’s share of the generalization gap, corre-sponds to the difference in the noisy experiment between the training accuracy on the entire train setand the training accuracy on the samples that received the wrong label (both measured with respectto uncorrupted labels). The memorization gap can be thought of as quantifying the extent to whichthe classifier can “memorize” noisy labels, or act differently on the noisy points compared to the Figure 1 – Empirical RRM bound.
The components of the RRM bound, as well as the upper bound ofTheorem II for a variety of SSS models on the CIFAR-10 dataset with noise η = 0 . .Each vertical line corresponds to a single model (architecture + self-supervised task + fitting algorithm)and plots the RRM bound for this model. The green component corresponds to robustness, yellowto rationality, and red to memorization. The x axis is the generalization gap, and so the RRM boundis always above the dashed x = y line. A negative generalization gap can occur in algorithms thatuse augmentation. The blue dots correspond to the bound on the generalization gap obtained by re-placing the memorization gap with the bound of Theorem II. See Sections 5 and B.3 for more information. independently of the complexity of the representation. Aslong as the simple classifier is under-parameterized (i.e., its complexity is asymptotically smallerthan the sample size), our bound on the memorization gap tends to zero. When combined withsmall rationality and robustness, we get concrete non-vacuous generalization bounds for variousSSS algorithms on the CIFAR-10 and ImageNet datasets (see Figures 1 and 4). Our results.
In a nutshell, our contributions are the following:1. Our main theoretical result (Theorem II) is that the memorization gap of an SSS algorithmis bounded by O ( (cid:112) C/n ) where C is the complexity of the simple classifier produced in the“simple fit” stage. This bound is oblivious to the complexity of the representation producedin the pre-training and does not make any assumptions on the relationship between therepresentation learning method and the supervised learning task.2. We complement this result with an empirical study of the robustness, rationality, and mem-orization gaps. We show that the RRM bound is typically non-vacuous, and in fact, oftenclose to tight, for a variety of SSS algorithms on the CIFAR-10 and ImageNet datasets,including SimCLR (which achieves test errors close to its supervised counterparts). More-over, in our experimental study, we demonstrate that the generalization gap for SSS algo-rithms is substantially smaller than their fully-supervised counterparts. See Figures 1 and 4for sample results and Section 5 for more details.3. We demonstrate that replacing the memorization gap with the upper bound of Theorem IIyields a non-vacuous generalization bound for a variety of SSS algorithms on CIFAR-10and ImageNet. Moreover, this bound gets tighter with more data augmentation.4. The robustness gap is often negligible in practice, and sometimes provably so (see Sec-tion 4). We show that the rationality gap is small in practice as well. We also prove that apositive rationality gap corresponds to “leaving performance on the table”, in the sense thatwe can transform a learning procedure with a large rationality gap into a procedure withbetter test performance (Theorem 4.1).One way to interpret our results is that instead of obtaining generalization bounds under statisticalassumptions on the distribution, we assume that the rationality and robustness gaps are at mostsome value (e.g., 5%). Readers might worry that we are “assuming away the difficulty”, but smallrationality and robustness gaps do not by themselves imply a small generalization gap. Indeed, theseconditions widely hold across many natural algorithms (including not just SSS but also end-to-endsupervised algorithms) with both small and large generalization gaps. As discussed in Section 4,apart from the empirical evidence, there are also theoretical justifications for small robustness andrationality. See Remark 4.2 and Appendix C for examples showing the necessity of these conditions.1.1 R ELATED W ORK .Our work analyses the generalization gap for supervised classifiers that first use self-supervision tolearn a representation. We provide a brief exposition of the various types of self-supervised methodsin Section 5, and a more detailed discussion in Appendix B.1.A variety of prior works have provided generalization bounds for supervised deep learning (e.g.,Neyshabur et al. (2017); Bartlett et al. (2017); Dziugaite & Roy (2017); Neyshabur et al. (2018);Golowich et al. (2018); Cao & Gu (2019), and references therein). However, many of these boundsprovide vacuous guarantees for modern architectures (such as the ones considered in this paper)that have the capacity to memorize their entire training set (Zhang et al., 2017). While some non-vacuous bounds are known (e.g., Zhou et al. (2019) gave a 96.5% bound on the error of MobileNet onImageNet), Belkin et al. (2019); Nagarajan & Kolter (2019) have highlighted some general barriersfor bounding the generalization gaps of over-parameterized networks that are trained end-to-end.For similar reasons, standard approaches such as Rademacher complexity cannot directly boundSSS algorithms’ generalization gap (see Remark 4.3).Recently, Saunshi et al. (2019) and Lee et al. (2020) gave generalization bounds for self-supervisedbased classifiers. The two works considered special cases of SSS algorithms, such as contrastive earning and pre-text tasks . Both works make strong statistical assumptions of (exact or approxi-mate) conditional independence relating the pre-training and classification tasks. For example, ifthe pre-training task is obtained by splitting a given image x into two pieces ( x , x ) and predict-ing x from x , then Lee et al. (2020)’s results require x and x to be approximately independentconditioned on their class y . However, in many realistic cases, the two parts of the same image willshare a significant amount of information not explained by the label.Our work applies to general SSS algorithms without such statistical assumptions, at the expense ofassuming bounds on the robustness and rationality gaps. There have been works providing rigorousbounds on the robustness gap or related quantities (See Section 4.). However, as far as we know, therationality gap has not been explicitly defined or studied before. To bound the memorization gap,we use information-theoretic complexity measures. Various information-theoretic quantities havebeen proposed to bound generalization gap in previous work (see Steinke & Zakynthinou (2020)and references therein). While these works bounds generalization directly, we bound a differentquantity—the memorization gap in the RRM decomposition.1.2 P APER O RGANIZATION
Section 2 contains formal definitions and statements of our results. Section 4 provides an overviewof prior work and our new results on the three gaps of the RRM bound. In Section 5, we describeour experimental setup and detail our empirical results. Section 7 concludes the paper and discussesimportant open questions. Section 3 contains the proof of Theorem II, while Section 6 contains theproof of Theorem 4.1. Appendix B fully details our experimental setup. N OTATION
We use capital letters (e.g., X ) for random variables, lower case letters (e.g., x ) for a single value,and bold font (e.g., x ) for tuples (which will typically have dimension corresponding to the numberof samples, denoted by n ). We use x i for the i -th element of the tuple x . We use calligraphic letters(e.g., X , D ) for both sets and distributions. ORMAL STATEMENT OF RESULTS A training procedure is a (possibly randomized) algorithm T that takes as input a train set ( x , y ) =( x i , y i ) i ∈ [ n ] ∈ ( X ×Y ) n and outputs a classifier f : X → Y . For our current discussion, we make noassumptions on the type of classifier output or the way that it is computed. We denote the distributionover training sets in ( X × Y ) n by D train and the distribution over test samples in X × Y by D test . The generalization gap of a training algorithm T with respect to a distribution pair D = ( D train , D test ) is the expected difference between its train accuracy (which we denote by Train D ,T ) and its testperformance (which we denote by Test D ,T ). We will often drop subscripts such as D , T whenthey can be inferred from the context. We will also consider the η -noisy experiment , which involvescomputing the classifier ˜ f = T ( x , ˜ y ) where ˜ y i = y i with probability − η and is uniform otherwise.Our starting point is the following observation which we call the RRM bound (for R obustness, R ationality, and M emorization). The quantities appearing in it are defined in Table 1. Fact I (
RRM bound ). For every noise parameter η > , training procedure T and distribution D = ( D train , D test ) over training sets and test samples, the RRM bound with respect to T and D is, Train − Test (cid:124) (cid:123)(cid:122) (cid:125)
Generalizationgap ≤ (cid:104) Train − Train ( η ) (cid:105) + (cid:124) (cid:123)(cid:122) (cid:125) Robustnessgap + (cid:104) NTrain ( η ) − Test (cid:105) + (cid:124) (cid:123)(cid:122) (cid:125) Rationalitygap + (cid:104) Train ( η ) − NTrain ( η ) (cid:105) + (cid:124) (cid:123)(cid:122) (cid:125) Memorizationgapwhere we denote x + = max( x, . We provide our code and data in: https://gitlab.com/harvard-machine-learning/. The train and test data often stem from the same distribution (i.e., D train = D n test ), but not always (e.g., itdoes not hold if we use data augmentation). D test enters the RRM bound only via the rationality gap, so theassumption of small rationality may be affected if D train (cid:54) = D n test , but the RRM bound still holds. able 1 – The measurements of accuracy in the RRM bound, all with respect to a training algorithm T ,distributions ( D train , D test ) and parameter η > . The robustness gap is max( Train − Train ( η ) , , the ra-tionality gap is max( NTrain ( η ) − Test , , and the memorization gap is max( Train ( η ) − NTrain ( η ) , . Quantity Training Measurement
Test D ,T f = T ( x , y ) for ( x , y ) ∼ D train Pr [ f ( x ) = y ] for ( x, y ) ∼ D test . Train D ,T f = T ( x , y ) for ( x , y ) ∼ D train Pr [ f ( x i ) = y i ] for train sample ( x i , y i ) . Train D ,T ( η ) ˜ f = T ( x , ˜ y ) for ( x , y ) ∼ D train , ˜ y i = y i w.p. − η , uniform o/w Pr [ ˜ f ( x i ) = y i ] for train sample ( x i , ˜ y i ) where y i original label for x i . NTrain D ,T ( η ) ˜ f = T ( x , ˜ y ) for ( x , y ) ∼ D train , ˜ y i = y i w.p. − η , uniform o/w Pr [ ˜ f ( x i ) = y i | ˜ y i (cid:54) = y i ] for a corrupted trainsample x i where y i original label for x i .The RRM bound is but an observation, as it directly follows from the fact that x + ≥ x for every x . However, it is a very useful one. As mentioned above, for natural algorithms, we expect boththe robustness and rationality components of this gap to be small, and hence the most significantcomponent is the memorization gap . In this work we show a rigorous upper bound on this gap forSSS models.We define formally an SSS Algorithm to be a training procedure T = ( T pre , T fit ) that is obtained by (1) first training T pre on x ∈ X n to get a representation r : X → R and then (2) training T fit on ( r ( x ) , y ) for y ∈ Y n to obtain a classifier g : R → Y . The classifier output by T is f : X → Y defined as f ( x ) = g ( r ( x )) . Our main theoretical result is the following. Theorem II (
Memorization gap bound ). For every SSS Algorithm T = ( T pre , T fit ) , noise parameter η > and distribution D over X n × Y n :Memorization gap ( T ) = ( Train T, D ( η ) − NTrain T, D ( η )) + ≤ O ( (cid:113) C η ( T fit ) n · η ) where C η ( T fit ) is a complexity measure of the second phase training procedure, which in particularis upper bounded by the number of bits required to describe the classifier g (See Definition 2.3.). OMPLEXITY MEASURES
We now define three complexity measures, all of which can be plugged in as the measure in Theo-rem II. The first one, C mdl , is the minimum description length of a classifier. The other two measures C pc and C dc are superficially similar to Rademacher Complexity (cf. Bartlett & Mendelson (2002))in the sense that they capture the ability of the hypothesis to correlate with random noise. Definition 2.3 (
Complexity of training procedures ). Let T be a training procedure taking as input aset ( r , y ) = { ( r i , y i ) } ni =1 ∈ ( R × Y ) n and outputting a classifier g : r → Y and let η > . Forevery training set ( r , y ) , we define the following three complexity measures with respect to r , y , η :• The minimum description length of T is defined as C mdl r , y ,η ( T ) := H ( g ) where we considerthe model g as a random variable arising in the η -noisy experiment. • The prediction complexity of T is defined as C pc r , y ,η ( T ) := (cid:80) ni =1 I ( g ( r i ); ˜ y i ) where the ˜ y i ’s are the labels obtained in the η -noisy experiment.• The (unconditional) deviation complexity of T is defined as C dc r , y ,η ( T ) := n · I ( g ( r i ) − y i ; ˜ y i − y i ) where the random variables above are taken over i ∼ [ n ] and subtraction isdone modulo |Y| , identifying Y with the set { , . . . , |Y| − } .Conditioned on y and the choice of the index i , the deviations g ( r i ) − y i and ˜ y i − y i determine thepredictions g ( r i ) and noisy labels ˜ y i , and vice versa. Hence we can think of C dc as an “averaged”variant of C pc , where we make the choice of the index i part of the sample space for the randomvariables. While we expect the two measures to be approximately close, the fact that C dc takes The name “minimum description length” is justified by the operational definition of entropy relating it tothe minimum amortized length of a prefix-free encoding of a random variable. into the sample space makes it easier to estimate this quantity in practice without using a largenumber of experiment repetitions (see Figure B.2 for convergence rates). The measure C mdl is harderto evaluate in practice, as it requires finding the optimal compression scheme for the classifier.Section 3 contains the full proof of Theorem II. It is obtained by showing that: (i) for every r , y , η ,and T it holds that C dc r , y ,η ( T ) ≤ C pc r , y ,η ( T ) ≤ C mdl r , y ,η ( T ) , and (ii) for every SSS algorithm T =( T pre , T fit ) and distribution D = ( D train , D test ) , the memorization gap of T is at most (cid:114) E ( x , y ) ∼D train C dc T pre ( x ) , y ,η ( T fit ) (cid:30) (cid:16) η √ n (cid:17) . (1)It is the quantity (1) that we compute in our experiments. ROOF OF T HEOREM II We now prove Theorem II. We start by relating our three complexity measures. The followingtheorem shows that C dc is upper bounded by C pc , which in turn is bounded by the entropy of g . Theorem 3.1 (
Relation of complexity measures ). For every r , y , η > , and T C dc r , y ,η ( T ) ≤ C pc r , y ,η ( T ) ≤ C mdl ( T ) where g is the classifier output by T (considered as a random variable).Proof. Fix T, r , y , η . We get ˜ y by choosing i.i.d random variables N , . . . , N n , each equalling with probability − η and uniform otherwise, and letting ˜ y i = y i + N i (mod |Y| ) .We start by proving the second inequality C pc r , y ,η ( T ) ≤ H ( g ) . Let g = T ( r , ˜ y ) and define p =( g ( r ) , . . . , g ( r n )) be the vector of predictions. Then, C pc r , y ,η ( T ) = (cid:88) i I ( p i ; ˜ y i ) = (cid:88) i I ( p i ; N i ) (2)with the last equality holding since for fixed y i , N i determines ˜ y i and vice versa. However, sincethe full vector p contains only more information than p i , the right-hand side of (2) is at most (cid:80) ni =1 I ( p ; N i ) ≤ I ( p ; N , . . . , N n ) , using the fact that N i random variables are independent (seeLemma A.2). For a fixed r , the value of p is completely determined by g and hence the entropy of p is at most H ( g ) , establishing the second inequality of the theorem.We now turn to the first inequality C dc r , y ,η ( T ) ≤ C pc r , y ,η ( T ) . Let ∆ i = p i − y i (mod |Y| ) . Then, n C pc r , y ,η ( T ) = E j ∼ [ n ] I ( p j ; N j ) = E j ∼ [ n ] I (∆ j ; N j ) (3)since p i determines ∆ i and vice versa (given y ). But, since N j = N | i = j and ∆ j = ∆ | i = j , theright-hand side of (3) equals E j ∼ [ n ] I (∆; N | i = j ) = E j ∼ [ n ] H ( N | i = j ) − H ( N | ∆ , i = j ) . (4)Since N , . . . , N n are identically distributed, H ( N | i = j ) = H ( N ) which means that the right-hand side of (4) equals H ( N ) − E j ∼ [ n ] H ( N | ∆ , i = j ) ≥ H ( N ) − H ( N | ∆) = I (∆; N ) with the inequality holding since on average conditioning reduces entropy. By definition I (∆; N ) = n C dc r , y ,η ( T ) , establishing what we wanted to prove.The complexity measures C pc and C dc are defined with respect to a fixed train set ( r , y ) , renderingthem applicable for single training sets such as CIFAR-10 and ImageNet that arise in practice. If D is a distribution over ( r , y ) , then we define the complexity measures C pc and C dc with respect to D asthe average of the corresponding measure with respect to ( r , y ) ∼ D . We now restate Theorem II:6 heorem 3.2 ( Theorem II, restated ). Let T = ( T pre , T fit ) be a training procedure obtained by firsttraining T pre on x ∈ X n to obtain a representation r : X → R and then training T fit on ( r ( x ) , y )) where y ∈ Y n to obtain a classifier g : R → Y . Then, for every noise parameter η > anddistribution D train over ( X , Y ) n ,Memorization gap ( T ) = (cid:0) Train D train ,T ( η ) − NTrain D train ,T ( η ) (cid:1) + ≤ (cid:114) C dc R ,η ( T fit )2 n · η where R is the distribution over ( R × Y ) n induced by T pre on D train . Note that the bound on the right-hand side is expressed only in terms of the complexity of the secondstage T fit and is independent of the complexity of T pre . The crux of the proof is showing (close to)independence between the corrupted indices and prediction deviation of g resulting from the noise. Proof.
Let ( r , y ) be sampled by first drawing ( x , y ) ∼ D train over ( X ×Y ) n then applying r = r ( x ) where r = T pre ( x ) . Consider the sample space of sampling ˜ y according to the η -noisy distributionwith respect to Y , computing g = T fit ( r , ˜ y ) , and sampling i ∼ [ n ] . We define the following twoBernoulli random variables over this sample space: Z = ∆=0 = (cid:26) g ( R i ) = y i otherwise ; B = N (cid:54) =0 = (cid:26) y i (cid:54) = y i otherwise . For a given r , y , since Z is determined by ∆ and B is determined by N , I ( Z ; B ) ≤ I (∆; N ) = C dc r , y ,η ( T fit ) /n . By Lemma A.1, for every Bernoulli random variables B, Z , | E [ Z ] − E [ Z | B = 1] | ≤ (cid:113) I ( Z ; B ) / E [ B ] And hence in our case (since E [ B ] = η ), E [ Z ] − E [ Z | B = 1] ≤ (cid:114) C dc r , y ,η ( T fit )2 n · η . But E [ Z ] corresponds to the probability that g ( r ) = y for ( r, y ) in the train set, while E [ Z | B = 1] corresponds to this probability over the noisy samples. Hence the memorization gap is bounded by E ( r , y ) ∼R (cid:34)(cid:114) C dc r , y ,η ( T fit )2 n · η (cid:35) ≤ η (cid:115) E ( r , y ) ∼R (cid:20) C dc r , y ,η ( T fit )2 n (cid:21) = (cid:114) C dc R ,η ( T fit )2 n · η using the Jensen inequality and the concavity of square root for the first inequality. HE THREE GAPS
We now briefly describe what is known and what we prove about the three components of the RRMbound. We provide some additional discussions in Appendix C, including “counter-examples” ofalgorithms that exhibit large values for each one of these gaps.
The robustness gap.
The robustness gap measures the decrease in training accuracy from adding η noisy labels, measured with respect to the clean labels. The robustness gap and related notions suchas noise stability or tolerance have been studied in various works (cf. Fr´enay & Verleysen (2013);Manwani & Sastry (2013)). Interpolating classifiers (with zero train error) satisfy
Train ( η ) ≥ − η and hence their robustness gap is at most η (see left panel of Figure 2). In SSS algorithms,since the representation is learned without using labels, the injection of label noise only affects thesimple classifier, which is often linear . Robustness guarantees for linear classifiers have been givenpreviously by Rudin (2005). While proving robustness bounds is not the focus of this paper, wenote in the appendix some simple bounds for least-squares minimization of linear classifiers and7 igure 2 – Robustness, Rationality, and Memorization for CIFAR-10.
Each blue point is a differentcombination of (architecture + self-supervised task + fitting algorithm). Each red point is a differentarchitecture trained end-to-end with supervision. We use the ‘ + ’ marker to denote the two best models ofeach type (SSS and supervised). No augmentations were added. Noise is . Details in Appendix B.3 the (potentially inefficient) Empirical Risk Minimization algorithm (see Appendices D.1 and D.2).Empirically, we observe that the robustness gap of SSS algorithms is often significantly smaller than η . (See left panels of Figure 2 and Figure 3.) The rationality gap.
To build intuition for the rationality gap, consider the case where the inputs x are images, and the label y is either “cat” or “dog”. A positive rationality gap means that givingthe incorrect label “dog” for a cat image x makes the output classifier more likely to classify x asa cat compared to the case where it is not given any label for x at all. Hence intuitively, a positiverationality gap corresponds to the training procedure being “irrational” or “inconsistent”—wronginformation should be only worse than no information, and we would expect the rationality gap tobe zero or close to it. Indeed, the rationality gap is always zero for interpolating classifiers that fit thetraining data perfectly. Moreover, empirically the rationality gap is often small for SSS algorithms,particularly for the better-performing ones. (See middle panels of Figure 2 and Figure 3.)We also show that positive rationality gap corresponds to “leaving performance on the table” byproving the following theorem (see Section 6 for a formal statement and proof): Theorem 4.1 (
Performance on the table theorem, informal ). For every training procedure T anddistribution D test , D train = D n test , there exists a training procedure S satisfying Test S ≥ Test T + rationality gap ( T ) − o (1) . One interpretation of Theorem 4.1 is that we can always reduce the generalization gap torobustness + memorization if we are willing to move from the procedure T to S . In essence, ifthe rationality gap is positive, we could include the test sample in the train set with a random label to increase the test performance. However, this transformation comes at a high computational cost;inference for the classifier produced by S is as expensive as retraining from scratch. Hence, we viewTheorem 4.1 more as a “proof of concept” than as a practical approach for improving performance. Remark 4.2 (
Why rationality? ). Since SSS algorithms use a simple classifier (e.g., linear), the readermay wonder why we cannot directly prove bounds on the generalization gap. The issue is that therepresentation used by SSS algorithms is still sufficiently over-parameterized to allow memorizingthe training set samples. As a pedagogical example, consider a representation-learning procedurethat maps a label-free training set x to a representation r : X → R that has high quality, in thesense that the underlying classes become linearly separable in the representation space. Moreover,suppose that the representation space has dimension much smaller than n , and hence a linear classi-fier would not be able to fit noise, meaning the resulting procedure will have a small memorizationgap and small empirical Rademacher complexity. Without access to the labels, we can transform r to a representation r (cid:48) that on input x will output r ( x ) if x is in the training set, and output theall-zero vector (or some other trivial value) otherwise. Given sufficiently many parameters, the rep-resentation r (cid:48) (or a close-enough approximation) can be implemented by a neural network. Since r and r (cid:48) are identical on the training set, the procedure using r (cid:48) will have the same train accuracy,memorization gap, and empirical Rademacher complexity. However, using r (cid:48) , one cannot achievebetter than trivial accuracy on unseen test examples. This does not contradict the RRM bound sincethis algorithm will be highly irrational. 8 he memorization gap. The memorization gap corresponds to the algorithm’s ability to fit thenoise (i.e., the gap increases with the number of fit noisy labels). If, for example, the classifieroutput is interpolating , i.e., it satisfies f ( x i ) = ˜ y i for every i , then accuracy over the noisy sampleswill be (since for them y i (cid:54) = ˜ y i ). In contrast, the overall accuracy will be in expectation at least − η which means that the memorization gap will be ≈ for small η . However, we show empirically(see right panels of Figures 2 and 3) that the memorization gap is small for many SSS algorithms and prove a bound on it in Theorem II. When combined with small rationality and robustness, this boundresults in non-vacuous generalization bounds for various real settings (e.g., 48% for ResNet101 withSimCLRv2 on ImageNet, and as low as 4% for MoCo V2 with ResNet-18 on CIFAR-10). Moreover,unlike other generalization bounds, our bound decreases with data augmentation (see Figure 5). Remark 4.3 (
Memorization vs. Rademacher ). The memorization gap, as well the complexity mea-sures defined in Section 2.1 have a superficial similarity to
Rademacher complexity (Bartlett &Mendelson, 2002), in the sense that they quantify the ability of the output classifier to fit noise. Onedifference is that Rademacher complexity is defined with respect to % noise, while we considerthe η -noisy experiment for small η . A more fundamental difference is that Rademacher complexityis defined via a supremum over all classifiers in some class. In contrast, our measures are definedwith respect to a particular training algorithm. As mentioned, Zhang et al. (2017) showed that mod-ern end-to-end supervised learning algorithm can fit 100% of their label noise. This is not the case for SSS algorithms, which can only fit 15%-25% of the CIFAR-10 training set when the labels arecompletely random (see Table B.1 in the appendix). However, by itself, the inability of an algorithmto fit random noise does not imply that the Rademacher complexity is small, and does not imply asmall generalization gap. Indeed, the example of Remark 4.2 yields an SSS method with both smallmemorization gap and empirical Rademacher complexity, and yet has a large generalization gap. MPIRICAL STUDY OF THE
RRM
BOUND
In support of our theoretical results, we conduct an extensive empirical study of the three gaps andempirically evaluate the theoretical bound on the memorization gap (from Equation (1) ) for a varietyof SSS algorithms for the CIFAR-10 and ImageNet datasets. We provide a summary of our setupand findings below. For a full description of the algorithms and hyperparameters, see Appendix B.
Figure 3 – Robustness, Rationality and Memorization for ImageNet.
Each point represents a differentcombination of self-supervised learning algorithm (e.g., SimCLR), backbone architecture (e.g., ResNet-50) and simple classifier (e.g., linear classification). Star indicates experiments with 10 augmentationsper training sample. Noise level is η = 5% . Full experimental details in Section B. SSS Algorithms ( T pre , T fit ). For the first phase of training T pre , we consider various self-supervisedtraining algorithms that learn a representation without explicit training labels. There are two maintypes of representation learning methods (1) Contrastive Learning , which finds an embedding bypushing ‘’similar” samples closer, and (2)
Pre-text tasks , which hand craft a supervised task that isindependent of downstream tasks, such as prediction the rotation angle of a given image (Gidariset al., 2018). Our analysis is independent of the type of representation learning method, and wefocus on methods that achieve high test accuracy when combined with the simple test phase. Thelist of methods included in our study is Instance Discrimination (Wu et al., 2018), MoCoV2 (Heet al., 2020), SimCLR (Chen et al., 2020a;b), AMDIM (Bachman et al., 2019), CMC (Tian et al.,2019), InfoMin (Tian et al., 2020) as well as adversarial methods such as BigBiGAN (Donahue &Simonyan, 2019). 9 he o r e m II b o un d
47% 48%
Figure 4 – The RRM bound of SSS methods onImageNet, with models sorted by the generaliza-tion gap. We plot the robustness, rationality andmemorization gaps. Similar to Figure 1, for mostmodels, the bound is tight and is dominated by thememorization gap. Theorem II bound is markedfor the two leftmost models (we did not evaluateit for the others, for computational reasons).
Figure 5 – Empirical RRM for the AMDIMSSS model on CIFAR-10 with increasingnumber of augmentations. While robust-ness and memorization gaps decrease, andso does our generalization bound, the ra-tionality gap increases since D train and D test grow apart. For the second phase of training (also known as the evaluation phase (Goyal et al., 2019)), weconsider simple models such as regularized linear regression, or small Multi-Layer Perceptrons(MLPs). For each evaluation method, we run two experiments: 1) the clean experiment where wetrain T fit on the data and labels ( x , y ) ; 2) the η -noisy experiment where we train T fit on ( x , ˜ y ) where ˜ y are the η noised labels. Unless specified otherwise we set the noise to η = 5% . Adding augmentations.
We investigate the effect of data augmentation on the three gaps and thetheoretical bound. For each training point, we sample t random augmentations ( t = 10 unless statedotherwise) and add it to the train set. Note that in the noisy experiment two augmented samples ofthe same original point might be assigned with different labels. We use the same augmentation usedin the corresponding self-supervised training phase. Results.
Figures 1 and 2 provide a summary of our experimental results for CIFAR-10. The ro-bustness and rationality gaps are close to zero for most SSS algorithms, while the memorizationgap is usually the dominant term, especially so for models with larger generalization gap. More-over, we see that C dc often produces a reasonably tight bound for the memorization gap, leading toa generalization bound that can be as low as - . In Figures 3 and 4 we give a summary of ourexperimental results for SSS algorithms on ImageNet. Again, the rationality and robustness gapsare bounded by small constants. Notice, that adding augmentations reduces memorization, but maylead to an increase in the rationality gap. This is also demonstrated in Figure 5 where we vary thenumber of data augmentations systematically for one SSS algorithm (AMDIM) on CIFAR-10. Sincecomputing the Theorem II bound for ImageNet is computationally expensive we compute it only fortwo algorithms, which achieve non-vacuous bounds between - , with room for improvement(See Appendix B.5.1.) OSITIVE RATIONALITY GAP LEAVES ROOM FOR IMPROVEMENT
We now prove the “performance on the table theorem” that states that we can always transform atraining procedure with a positive rationality gap into a training procedure with better performance:
Theorem 6.1 (
Performance on the table theorem, restated ). For every training procedure T and D test , n, η , if D train = D n test and T has a positive rationality gap with respect to these parameters, thenthere exists a training procedure S such that, Test S, D ≥ NTrain T, D ( η ) − o (1) = Test T, D + rationality-gap ( T ) − o (1) (5) where o (1) is a term that vanishes with n , and assuming that Train T, D ( η ) ≥ NTrain T, D ( η ) . The assumption, stated differently, implies that the memorization gap will be positive. We expectthis assumption to be true for any reasonable training procedure T (see right panel of Figure 2),since performance on noisy train samples will not be better than the overall train accuracy. Indeed,it holds in all the experiments described in Section 5. In particular (since we can always add noise10o our data), the above means that if the rationality gap is positive, we can use the above to improvethe test performance of “irrational” networks. We now provide a proof for the theorem. Proof.
Let T be a procedure with positive rationality gap that we are trying to transform. Our newalgorithm S would be the following:• Training:
On input a training set D = ( x , ˜ y ) ∈ ( X × Y ) n , algorithm S does not performany computation, but merely stores the dataset D . Thus the “representation” of a point x issimply ( x, D ) .• Inference:
On input a data point x and the original training dataset D , algorithm S chooses i ∼ [ n ] and lets D (cid:48) be the training set obtained by replacing ( x i , y i ) with ( x, ˜ y ) where ˜ y ischosen uniformly at random. We then compute f = T ( D (cid:48) ) , and output f ( x ) .First note that while the number of noisy samples could change by one by replacing ( x i , y i ) with ( x, ˜ y ) , since this number is distributed according to the Binomial distribution with mean ηn andstandard deviation (cid:112) (1 − η ) ηn (cid:29) , this change can affect probabilities by at most o (1) additivefactor (since the statistical distance between the distribution Binom ( η, n ) and Binom ( η, n ) + 1 is o (1) ). If Y has k classes, then with probability − /k we will make ( x, ˜ y ) noisy ( y (cid:54) = ˜ y ) inwhich case the expected performance on it will be NTrain T ( η ) . With probability /k , we choose thecorrect label y in which case performance on this sample will be equal to the expected performanceon clean samples which by our assumptions is at least NTrain T ( η ) as well. Hence, the accuracy onthe new test point is at least NTrain T ( η ) .We stress that the procedure described above, while running in “polynomial time”, is not particularlypractical, since it makes inference as computationally expensive as training. However, it is a proofof concept that irrational networks are, to some extent, “leaving performance on the table”. ONCLUSIONS AND OPEN QUESTIONS
This work demonstrates that SSS algorithms have small generalization gaps. While our focus is onthe memorization gap , our work motivates more investigation of both the robustness and rationality gaps. In particular, we are not aware of any rigorous bounds for the rationality gap of SSS algo-rithms, but we view our “performance on the table” theorem (Theorem 4.1) as a strong indicationthat it is close to zero for natural algorithms. Given our empirical studies, we believe the assumptionsof small robustness and rationality conform well to practice.Our numerical bounds are still far from tight, especially for ImageNet, where evaluating the bound(more so with augmentations) is computationally expensive. Nevertheless, we find it striking thatalready in this initial work, we get non-vacuous (and sometimes quite good) bounds. Furthermore,the fact that the empirical RRM bound is often close to the generalization gap, shows that there issignificant room for improvement.Overall, this work can be viewed as additional evidence for the advantages of SSS algorithms overend-to-end supervised learning. Moreover, some (very preliminary) evidence shows that end-to-end supervised learning implicitly separates into a representation learning and classification phases(Morcos et al., 2018). Understanding the extent that supervised learning algorithms implicitly per-form SSS learning is an important research direction in its own right. To the extent this holds, ourwork might shed light on such algorithms’ generalization performance as well.
CKNOWLEDGEMENTS
We thank Dimitris Kalimeris, Preetum Nakkiran, and Eran Malach for comments on early drafts ofthis work. This work supported in part by NSF award CCF 1565264, IIS 1409097, DARPA grantW911NF2010021, and a Simons Investigator Fellowship. We also thank Oracle and Microsoft forgrants used for computational resources. Y.B is partially supported by MIT-IBM Watson AI Lab.Work partially performed while G.K. was an intern at Google Research.11
EFERENCES
Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizingmutual information across views. In
Advances in Neural Information Processing Systems , pp.15535–15545, 2019.Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame-work for self-supervised learning of speech representations, 2020.Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds andstructural results.
Journal of Machine Learning Research , 3(Nov):463–482, 2002.Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds forneural networks. In
Advances in Neural Information Processing Systems , pp. 6240–6249, 2017.Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.
Proceedings of the National Academyof Sciences , 116(32):15849–15854, 2019.Avrim Blum, Merrick L. Furst, Michael J. Kearns, and Richard J. Lipton. Cryptographic prim-itives based on hard learning problems. In Douglas R. Stinson (ed.),
Advances in Cryptology- CRYPTO ’93, 13th Annual International Cryptology Conference, Santa Barbara, California,USA, August 22-26, 1993, Proceedings , volume 773 of
Lecture Notes in Computer Science , pp.278–291. Springer, 1993. doi: 10.1007/3-540-48329-2 \ Advances in Neural Information Processing Systems 32 , pp. 10836–10846.Curran Associates, Inc., 2019.Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework forcontrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020a.Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 , 2020b.Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentumcontrastive learning. arXiv preprint arXiv:2003.04297 , 2020c.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-erarchical image database. In ,pp. 248–255. Ieee, 2009.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In
Advances inNeural Information Processing Systems , pp. 10542–10552, 2019.Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds fordeep (stochastic) neural networks with many more parameters than training data. arXiv preprintarXiv:1703.11008 , 2017.William Falcon and Kyunghyun Cho. A framework for contrastive self-supervised learning anddesigning a new approach, 2020. 12enoˆıt Fr´enay and Michel Verleysen. Classification in the presence of label noise: a survey.
IEEEtransactions on neural networks and learning systems , 25(5):845–869, 2013.Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning bypredicting image rotations. arXiv preprint arXiv:1803.07728 , 2018.Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity ofneural networks. In S´ebastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.),
Advancesin Neural Information Processing Systems , volume 75 of
Proceedings of Machine Learning Re-search , pp. 297–299. PMLR, 06–09 Jul 2018.Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In
Proceedings of the IEEE International Conferenceon Computer Vision , pp. 6391–6400, 2019.Michael Gutmann and Aapo Hyv¨arinen. Noise-contrastive estimation: A new estimation principlefor unnormalized statistical models. In
Proceedings of the Thirteenth International Conferenceon Artificial Intelligence and Statistics , pp. 297–304, 2010.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In
Proceedings of the IEEE conference on computer vision and pattern recognition , pp.770–778, 2016.Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast forunsupervised visual representation learning. In
Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , pp. 9729–9738, 2020.Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervised learn-ing can improve model robustness and uncertainty. In H. Wallach, H. Larochelle, A. Beygelzimer,F. d’ Alch´e-Buc, E. Fox, and R. Garnett (eds.),
Advances in Neural Information Processing Sys-tems 32 , pp. 15663–15674. Curran Associates, Inc., 2019.Alex Krizhevsky et al. Learning multiple layers of features from tiny images, 2009.Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps:Provable self-supervised learning. arXiv preprint arXiv:2008.01064 , 2020.Pengpeng Liu, Michael Lyu, Irwin King, and Jia Xu. Selflow: Self-supervised learning of opticalflow. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) , June 2019.Naresh Manwani and PS Sastry. Noise tolerance under risk minimization.
IEEE transactions oncybernetics , 43(3):1146–1151, 2013.Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representa-tions. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) , June 2020.Ari S. Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neuralnetworks with canonical correlation, 2018.Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain general-ization in deep learning. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florenced’Alch´e-Buc, Emily B. Fox, and Roman Garnett (eds.),
Advances in Neural Information Process-ing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS2019, 8-14 December 2019, Vancouver, BC, Canada , pp. 11611–11622, 2019.Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nati Srebro. Exploring general-ization in deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett (eds.),
Advances in Neural Information Processing Systems 30 , pp.5947–5956. Curran Associates, Inc., 2017. 13ehnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. To-wards understanding the role of over-parametrization in generalization of neural networks.
CoRR ,abs/1805.12076, 2018.Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsawpuzzles. In
European Conference on Computer Vision , pp. 69–84. Springer, 2016.David Page. How to train your resnet. https://myrtle.ai/how-to-train-your-resnet-4-architecture/ , 2018.Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Contextencoders: Feature learning by inpainting. In
Proceedings of the IEEE conference on computervision and pattern recognition , pp. 2536–2544, 2016.Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Tr-mal, and Yoshua Bengio. Multi-task self-supervised learning for robust speech recognition. In
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 6989–6993. IEEE, 2020.Cynthia Rudin. Stability analysis for regularized least squares regression. arXiv preprintcs/0502016 , 2005.Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar.A theoretical analysis of contrastive unsupervised representation learning. In Kamalika Chaudhuriand Ruslan Salakhutdinov (eds.),
Proceedings of the 36th International Conference on MachineLearning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of
Proceedingsof Machine Learning Research , pp. 5628–5637. PMLR, 2019.Thomas Steinke and Lydia Zakynthinou. Reasoning about generalization via conditional mutualinformation. arXiv preprint arXiv:2001.09122 , 2020.Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neuralnetworks. arXiv preprint arXiv:1905.11946 , 2019.Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding.
CoRR ,abs/1906.05849, 2019.Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. Whatmakes for good views for contrastive learning. arXiv preprint arXiv:2005.10243 , 2020.Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting andcomposing robust features with denoising autoencoders. In
Proceedings of the 25th internationalconference on Machine learning , pp. 1096–1103, 2008.Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance-level discrimination. arXiv preprint arXiv:1805.01978 , 2018.Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146 , 2016.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. In .OpenReview.net, 2017.Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In
European confer-ence on computer vision , pp. 649–666. Springer, 2016.Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuousgeneralization bounds at the imagenet scale: a pac-bayesian compression approach. In . OpenReview.net, 2019. 14 M UTUAL INFORMATION FACTS
Lemma A.1 . If A, B are two Bernoulli random variables with nonzero expectation then, | E [ A | B = 1] − E [ A ] | ≤ (cid:113) I ( A ; B ) / E [ B ] . Proof.
A standard relation between mutual information and KL-divergence gives, I ( A ; B ) = D KL ( p A,B || p A p B ) . On the other hand, by the Pinsker inequality, sup S ⊆{ , }×{ , } | p A,B ( S ) − p A × B ( S ) | ≤ (cid:114) D KL ( p A,B || p A p B ) = (cid:114) I ( A, B ) . Thus (letting S = { (1 , } ), | Pr [ A = 1 , B = 1] − Pr [ A = 1] Pr [ B = 1] | ≤ (cid:113) I ( A, B ) . Consequently, | E [ A | B = 1] − E [ A ] | ≤ (cid:113) I ( A, B )) / E ( B ) Lemma A.2 .
For three random variables
W, X, Y , s.t. X and Y are independent, I ( W ; X, Y ) ≥ I ( W ; X ) + I ( W ; Y ) . Proof.
Using the chain rule for mutual information we have: I ( W ; X, Y ) = I ( W ; X ) + I ( W ; Y | X ) Since
X, Y are independent, H ( Y | X ) = H ( Y ) and since conditioning only reduces entropy, wehave H ( Y | W, X ) ≤ H ( Y | W ) . Combining the two we get, I ( W ; Y | X ) = H ( Y | X ) − H ( Y | W, X ) ≥ H ( Y ) − H ( Y | W )= I ( W ; Y ) Thus we have that I ( W ; X, Y ) ≥ I ( W ; X ) + I ( W ; Y ) .Note that by induction we can extend this argument to show that I ( W ; X , ..., X n ) ≥ (cid:80) I ( W ; X i ) where X i are mutually independent. B E
XPERIMENTAL DETAILS
We perform an empirical study of the RRM bound for a wide variety of self-supervised trainingmethods on the ImageNet (Deng et al., 2009) and CIFAR-10 (Krizhevsky et al., 2009) trainingdatasets. We provide a brief description of all the self-supervised training methods that appear inour results below. For each method, we use the official pre-trained models on ImageNet whereveravailable. Since very few methods provide pre-trained models for CIFAR-10, we train models fromscratch. The architectures and other training hyper-parameters are summarized in Table E.4 andTable E.3. Since our primary aim is to study the RRM bound, we do not optimize for reaching thestate-of-the-art performance in our re-implementations. For the second phase of training, we useL2-regularized linear regression, or small non-interpolating Multi-layer perceptrons (MLPs).15.1 S
ELF - SUPERVISED TRAINING METHODS ( T PRE )There is a variety of self-supervised training methods for learning representations without explicitlabels. The two main branches of self-supervised learning methods are:1.
Contrastive learning:
These methods seek to find an embedding of the dataset that pushes a positive pair of images close together and a pair of negative images far from each other. Forexample, two different augmented versions of the same image may be considered a positivepair, while two different images may be considered a negative pair. Different methods suchas Instance Discrimination, MoCo, SimCLR, AMDIM, differ in the the way they select thepositive/negative pairs, as well other details like the use of a memory bank or the encoderarchitecture. (See Falcon & Cho (2020) for detailed comparison of these methods.)2.
Handcrafted pretext tasks:
These methods learn a representation by designing a fairly gen-eral supervised task, and utilizing the penultimate or other intermediate layers of this net-work as the representation. Pretext tasks include a diverse range of methods such as pre-dicting the rotation angle of an input image (Gidaris et al., 2018), solving jigsaw puzzles(Noroozi & Favaro, 2016), colorization (Zhang et al., 2016), denoising images (Vincentet al., 2008) or image inpainting (Pathak et al., 2016).Additionally, adversarial image generation can be used for by augmenting a the image generatorwith an encoder (Donahue & Simonyan, 2019). We focus primarily on contrastive learning methodssince they achieve state-of-the-art performance. We now describe these methods briefly.
Instance Discrimination: (Wu et al., 2018) In essence, Instance Discrimination performs super-vised learning with each training sample as a separate class. They minimize the non-parametricsoftmax loss given below for each training sample v = f θ ( x ) J ( θ ) = − n (cid:88) i =1 log (cid:32) exp( v Ti v/τ ) (cid:80) nj =1 exp( v Ti v/τ ) (cid:33) (6)where v i = f θ ( x i ) is the feature vector for the i -th example and τ is a temperature hyperparameter.They use memory banks and a contrastive loss (also known as Noise Contrastive Estimation or NCE(Gutmann & Hyv¨arinen, 2010)) for computing this loss efficiently for large datasets. So in this case,a positive pair is an image and itself, while a negative pair is two different training images. Momentum Contrastive (MoCo): (He et al., 2020) MoCo replaces the memory bank in InstanceDiscrimination with a momentum-based query encoder. MoCoV2 (Chen et al., 2020c) applies vari-ous modifications over SimCLR, like a projection head, and combines it with the MoCo frameworkfor improved performance.
AMDIM: (Bachman et al., 2019) AMDIM uses two augmented versions of the same image aspossitive pairs. For these augmentations, they use random resized crops, random jitters in colorspace, random horizontal flips and random conversions to grayscale. They apply the NCE lossacross multiple scales, by using features from multiple layers. They use a modified ResNet bychanging the receptive fields to decrease overlap between positive pairs.
CMC: (Tian et al., 2019) CMC creates two views for contrastive learning by converting each imageinto the Lab color space. L and ab channels from the same image are considered to be a positivepair, while those from two different images are considered to be a negative pair.
PiRL: (Misra & Maaten, 2020) PiRL first creates a jigsaw transformation of an image (it divides animage into 9 patches and shuffles these patches). It treats an image and its jigsaw as a positive pair,and that of a different image as a negative pair.
SimCLRv1 and SimCLRv2: (Chen et al., 2020a;b) SimCLR also use strong augmentations tocreate positive and negative pairs. They use random resized crops, random Gaussian blurring andrandom jitters in color space. Crucially, they use a projection head that maps the representations toa 128-dimensional space where they apply the contrastive loss. They do not use a memory bank, butuse a large batch size. 16 nfoMin: (Tian et al., 2020) InfoMin uses random resized crops, random color jitters and randomGaussian blurring, as well as jigsaw shuffling from PiRL.B.2 S
IMPLE C LASSIFIER ( T FIT )After training the representation learning method, we extract representations r for the training andtest images. We do not add random augmentations to the training images (unless stated otherwise).Then, we train a simple classifier on the dataset { r ( x i ) , y i } ni =1 . We use a linear classifier in mostcases, but we also try a small multi-layer perceptron (as long as it has few parameters and does notinterpolate the training data). We add weight decay in some methods to achieve good test accuracy(see Table E.4 and Table E.3 for values for each method). For the noisy experiment, we set the noiselevel to η = 5% . To compute the complexity bound C dc we run 20 trials (same experiment withdifferent random seed) of the noisy experiment for CIFAR-10 and 50 trials for ImageNet.B.3 E XPERIMENTAL DETAILS FOR EACH PLOT
Figure 1.
This figure shows the robustness, rationality and memorization gap for various SSS algo-rithms trained on CIFAR-10. The type of self-supervised method, the encoder architecture, as wellas the training hyperparameters are described in Table E.3. For the second phase T fit , we use L2-regularized linear regression for all the methods. For each algorithm listed in Table E.3, the figurecontains 2 points, one without augmentations, and one with augmentations. Further, we computethe complexity measure C dc for all the methods. All the values (along with the test accuracy) arelisted in Table E.1. Figure 2.
This figure shows the robustness, rationality and memorization for CIFAR-10 for all thesame methods as in Figure 1. We only include the points without augmentation to show how ratio-nality behaves when ( D train , D test ) are identical. All the values (along with the test accuracy) are listedin Table E.1. In addition, we add three end-to-end fully supervised methods (red circles) to compareand contrast the behavior of each of the gaps for SSS and supervised methods. For the supervisedarchitectures, we train a Myrtle-5 (Page, 2018) convolutional network, a ResNet-18 (He et al., 2016)and a WideResNet-28-10 (Zagoruyko & Komodakis, 2016) with standard hyperparameters. Figure 3 and Figure 4.
These figures show the robustness, rationality and memorization for theImageNet dataset. The type of self-supervised method, the encoder architecture, as well as thetraining hyperparameters are described in Table E.4. For the second phase T fit , we use L2-regularizedlinear regression for all the methods. The figures also contain some points with 10 augmentations pertraining image. Further, we compute the complexity measure C dc for all three methods—SimCLRv2with architectures ResNet-50-1x and ResNet-101-2x. All the values (along with the test accuracy)are listed in Table E.2. Figure 5
This figure shows the effect of increasing augmentations. We add t = { , ..., } aug-mentations and re-train the simple classifier. We do this for the CIFAR-10 dataset, AMDIM self-supervised training with the AMDIM encoder and linear regression (see Table E.3 for the hyperpa-rameters).B.4 A DDITIONAL R ESULTS
B.4.1 G
ENERALIZATION ERROR OF
SSS
ALGORITHMS
To show that SSS algorithms have qualitatively different generalization behavior compared to stan-dard end-to-end supervised methods, we repeat the experiment from Zhang et al. (2017). We ran-domize all the training labels in the CIFAR-10 dataset and train 3 high-performing SSS methods onthese noisy labels. For results see Table B.1. Unlike fully supervised methods, SSS algorithms donot achieve 100% training accuracy on the dataset with noisy labels. In fact, their training accuraciesare fairly low ( ≈ - ). This suggests that the empirical Rademacher complexity is bounded.The algorithms were trained without any augmentations during the simple fitting phase for both SSSand supervised algorithms. The SSS methods were trained using parameters described in Table E.3.17 able B.1 – Train and Test performance on 100% label noise for fully supervised vs. SSS algorithms onCIFAR-10. The first row is from Zhang et al. (2017), while the second one is our results for SSS methodsaveraged over 5 runs without augmentations. Training method Architecture/Method Train Acc Test AccSupervised (Zhang et al., 2017) Inception (no aug) 100% 86%(fitting random labels)
SSS SimCLR (ResNet-50) + Linear 94% 92%(fitting random labels)
22% 10%
AMDIM (AMDIM Encoder) + Linear 94% 87.4%(fitting random labels)
18% 10%
MoCoV2 (ResNet-18) + Linear 69% 67.6%(fitting random labels)
15% 10%
B.5 RRM
BOUND WITH VARYING NOISE PARAMETER
We now investigate the effect of varying noise levels on the three gaps as well as on the complexity.We see that the robustness gap increases as we add more noise—this is expected as noise shouldaffect the clean training accuracy. We also observe that the memorization gap decreases, suggestingthat C dc η as a function of η goes down faster than η (see Section 2.1). The Theorem II bound onmemorization gap also decays strongly with the η , becoming more tight as the noise increases. Figure B.1 – RRM + bound with changing η B.5.1 C
ONVERGENCE OF COMPLEXITY MEASURES
We now plot (see Figure B.2) the complexity measures C dc and C pc with increasing number of trialsfor one of the SSS algorithms. As expected, C dc < C pc and C dc converges in about 20 trials forCIFAR-10. On the other hand, the complexity computations for ImageNet need many more trialsfor convergence, since it contains about augmentations × . million training samples makingit cost prohibitive to compute for all the methods. For the CIFAR-10, we use AMDIM with theAMDIM encoder architecture without augmentations. For ImageNet, we use SimCLRv2 with theResNet-101 architecture with 10 augmentations per training sample. C E
XAMPLES OF ALGORITHMS WITH LARGE GAPS
While we argued that SSS algorithms will tend to have small robustness, rationality, and memoriza-tion gaps, this does not hold in the worst case and there are examples of such algorithms that exhibitlarge gaps in each of those cases. 18 a) Theorem II bound with increasing trials. Thebound based on C dc is lower than C pc as expected,and converges within 20 trials. (b) Theorem II bound with increasing trials. C dc isslow to converge due to the large dataset size (10augmentations × Figure B.2 – Convergence of Theorem II bounds for CIFAR-10 and ImageNet
C.1 L
ARGE ROBUSTNESS GAP
Large robustness gap can only arise via computational (as opposed to statistical) considerations.That is, if a training procedure outputs a classifier f ∈ F that achieves on average accuracy α on aclean train set ( X, Y ) , then with high probability, if ( X, ˜ Y ) is an η -noisy train set then there exists f ∈ F that achieves α (1 − η ) accuracy on this train set (by fitting only the “clean” points).However, the training algorithm might not always be able to find such a classifier. For example, if thedistribution has the form ( x, y ) = ( x, (cid:80) a j x j mod 2) where x ∼ GF (2) (cid:96) = Z (cid:96) and a ∈ GF (2) (cid:96) is some hidden vector, then there is an efficient algorithm (namely Gaussian elimination) to find a given the samples ( x, y ) and hence get accuracy . However, for every ε > and η > , there isno known efficient algorithm that, given a − η perturbed equations of the form {(cid:104) a, x i (cid:105) = ˜ y i } i ∈ [ n ] finds a (cid:48) ∈ GF (2) (cid:96) such that (cid:80) a (cid:48) j x j = (cid:80) a j x j mod 2 on a / ε fraction of the x ’s. This isknown as the learning parity with noise (LPN) problem (Blum et al., 1993).The assumption of robustness is necessary for a small generalization gap, in the sense that we cancome up with (contrived) examples of algorithms that have small rationality and memorization gapswhile still having large generalization gap. For example, consider an algorithm T that has largegeneralization gap (high train accuracy and small test accuracy), and suppose we augment to thefollowing algorithm T (cid:48) ( x , y ) = (cid:26) T ( x , y ) if y is “clean” if y is “noisy”where denotes the constant zero function (e.g., some trivial classifier) and we use some algorithmto estimate whether or not the labels are noisy. (Such estimates can often be achieved in manynatural cases.) The algorithm T (cid:48) will inherit the generalization gap of T , since that depends onlyon the experiment without noise. Since performance on noisy and clean training samples will bethe same (close to random), T (cid:48) will have zero memorization gap. Since we have assumed small testaccuracy, it will have zero rationality gap also.C.2 L ARGE RATIONALITY GAP
As discussed in Section 6, in the case that D train = D n test , a robust algorithm with large rationalitygap leaves “performance on the table”. We can obtain such algorithms by artificially droppingperformance on the test data. For example, in the SSS framework, since the representation r isover-parameterized and can memorize the entire train set, we can consider the trivial representation19 ( x ) = (cid:26) x x in train set otherwiseIf we now train some simple classifier on r ( x ) then it can have non-trivial performance on the noisytrain samples, while getting trivial accuracy on all samples outside the train set.In cases where D train and D test are different (for example when D train is an augmented version of D test )then we can no longer claim that a large rationality gap corresponds to “leaving performance on thetable”. For example, we do observe (mild) growth in the rationality gap as we add more augmentedpoints to the training set.C.3 L ARGE MEMORIZATION GAP
It is not hard to find examples of networks with large memorization gap. Indeed, as mentionedbefore, any standard interpolating supervised learning algorithm will get a memorization gap closeto . D S
IMPLE ROBUSTNESS BOUNDS
While robustness is not the focus of this work, we collect here two observations on the robustnessof the least-square and minimum risk classifiers. These bounds are arguably folklore, but we statethem here for completeness.D.1 R
OBUSTNESS OF LEAST SQUARES CLASSIFIERS
One can prove robustness for classes of algorithms under varying assumptions. As a simple example,we record here a self-contained observation of how margin leads to robustness in least squaresminimization. This is a very simple but also pessimistic bound, and much better ones often hold.
Lemma D.1 .
Let x , . . . , x n ∈ R d and y , . . . , y n ∈ [ k ] , and consider a linear function f : R d → R k that minimizes the quantity (cid:80) i ∈ [ n ] ,j ∈ [ k ] | f ( x i ) j − y i = j | , and suppose that for p fraction of the i ’s, the maximum over j ∈ [ k ] of f ( x i ) is γ larger than the second-largest value.Then in expectation, if we let ˜ y be the η -noisy version of y and ˜ f minimizes (cid:80) i ∈ [ n ] ,j ∈ [ k ] | ˜ f ( x i ) j − ˜ y i = j | , we get that arg max j ˜ f ( x i ) = y i for at least p − η/γ fraction of the i ’s.Proof. We identify y with its “one hot” encoding as a vector in R nk . Let V ⊆ R nk be the subspaceof all vectors of the form ( g ( x ) , . . . , g ( x n )) for linear g : R d → R k . If f is the minimizer inthe theorem statement, and p = ( f ( x ) , . . . , f ( x n )) then p = Π V y where Π V is the orthogonalprojection to the subspace v . If ˜ f is the minimizer for the noisy labels and ˜ p = ( ˜ f ( x ) , . . . , ˜ f ( x n )) ,then ˜ p = Π V ˜ y = Π V ( y + e ) where e is the noise vector ˜ y − y .Hence (cid:107) p − ˜ p (cid:107) = (cid:107) Π V e (cid:107) ≤ (cid:107) e (cid:107) . But in expectation (cid:107) e (cid:107) ≤ ηn (since we flip a label withprobability ≤ η ). For every point i for which the margin was at least γ in p , if ˜ p ’s prediction isdifferent in i , then the contribution of the i -th block to their square norm difference is at least γ / (by shifting the maximum coordinate by − γ/ and the second largest one by γ/ ). Hence at most ηn/γ of these points could have different predictions in p and ˜ p D.2 R
OBUSTNESS OF EMPIRICAL RISK MINIMIZER
The (potentially inefficient) algorithm that minimizes the classification errors is always robust.
Lemma D.2 .
Let T ( x , y ) = arg min f ∈F (cid:80) ni =1 f ( x i ) (cid:54) = y i . Then for every η > ,Robustness gap ( T ) ≤ η . roof. Let x , y be any train set, and let α = min g ∈F (cid:80) ni =1 g ( x i ) (cid:54) = y i and f be the minimizer ofthis quantity. Let ˜ y be the η -noisy version of y and let ˜ η be the fraction of i on which y i (cid:54) = ˜ y i . Then, n (cid:88) i =1 f ( x i ) (cid:54) = y i ≤ α + ˜ η . (7)Hence if ˜ f is the minimizer of (7) then we know that ˜ f ( x i ) (cid:54) = ˜ y i for at most α + ˜ η fraction of the i ’s, and so ˜ f ( x i ) (cid:54) = y i for at most α + 2˜ η fraction of the i ’s. Since the train accuracy of T is − α and in expectation of ˜ η is η , we get that in expectation Train T ( η ) ≥ Train T − η L ARGE T ABLES
Table E.1 – Summary of all the methods, architectures and the corresponding results (gaps and accuracies)on CIFAR-10, sorted by generalization gap. While Figure 1 already plots this data, here we also providethe test performance of the corresponding models.Method Backbone DataAug GeneralizationGap Robustness Mem-orization Rationality Theorem IIbound RRMbound TestAccmocov2 resnet18 True -7.35 0.07 0.21 0.00 3.47 0.28 67.19mocov2 wide resnet50 2 True -6.37 0.18 1.03 0.00 7.63 1.21 70.99mocov2 resnet101 True -6.01 0.15 0.71 0.00 6.38 0.86 68.58mocov2 resnet50 True -5.38 0.19 0.84 0.00 6.99 1.03 69.68simclr resnet50 True -2.89 0.30 0.55 0.00 6.63 0.85 91.96amdim resnet101 True -0.91 0.64 3.70 0.00 25.99 4.34 63.56amdim resnet18 True 0.33 0.23 1.15 0.00 8.66 1.38 62.84mocov2 resnet18 False 1.43 0.15 1.24 0.03 14.14 1.43 67.60simclr resnet18 False 1.43 0.28 0.79 0.36 13.35 1.43 82.50amdim wide resnet50 2 True 1.60 0.69 2.46 0.00 19.20 3.15 64.38simclr resnet50 False 1.97 0.22 0.78 0.97 15.75 1.97 92.00simclr resnet50 False 2.24 0.52 1.71 0.01 19.53 2.24 84.94mocov2 resnet50 False 2.72 0.30 2.96 0.00 24.18 3.26 70.09mocov2 resnet101 False 2.82 0.33 3.03 0.00 22.78 3.36 69.08mocov2 wide resnet50 2 False 3.11 0.38 2.79 0.00 22.39 3.18 70.84amdim resnet50 bn True 3.69 0.84 4.22 0.00 31.12 5.06 66.44amdim resnet18 False 4.34 0.42 4.58 0.00 33.47 5.00 62.28amdim amdim encoder True 4.43 0.68 0.36 3.39 10.32 4.43 87.33amdim amdim encoder False 6.68 2.08 5.69 0.00 70.52 7.77 87.38amdim resnet101 False 12.46 1.22 14.26 0.00 100.00 15.49 62.43amdim wide resnet50 2 False 13.07 1.70 15.33 0.00 100.00 17.03 63.80amdim resnet50 bn False 14.73 1.81 16.63 0.00 100.00 18.43 66.28 able E.2 – Summary of all the methods, architectures their corresponding results (gaps and accuracies)on ImageNet, sorted by generalization gap. While Figure 4 already plots this data, here we also providethe test performance of the corresponding models.Method Backbone DataAug GeneralizationGap Robustness Mem-orization Rationality Theorem IIbound RRMbound TestAccsimclrv2 r50 1x sk0 True -2.34 0.26 0.68 0.00 46.93 0.94 70.96simclrv2 r101 2x sk0 True 0.63 0.10 0.80 0.00 47.90 0.91 77.24simclrv2 r152 2x sk0 True 1.00 0.13 0.77 0.10 NA 1.00 77.65moco ResNet-50 True 1.32 0.57 0.93 0.00 NA 1.49 70.15InfoMin ResNet-50 True 4.88 0.81 1.01 3.06 NA 4.88 72.29PiRL ResNet-50 True 6.23 0.29 0.99 4.95 NA 6.23 60.56InsDis ResNet-50 True 6.85 0.25 1.13 5.46 NA 6.85 58.30simclrv2 r101 1x sk1 False 8.23 0.71 4.66 2.86 NA 8.23 76.07InfoMin ResNet-50 False 10.21 2.34 8.96 0.00 NA 11.31 70.31simclrv2 r152 1x sk0 False 10.32 1.12 6.93 2.26 NA 10.32 74.17simclrv2 r101 1x sk0 False 10.53 1.11 6.99 2.42 NA 10.53 73.04simclrv2 r50 1x sk0 False 10.62 0.99 7.31 2.31 NA 10.62 70.69moco ResNet-50 False 10.72 1.82 7.86 1.04 NA 10.72 68.39simclrv2 r152 2x sk0 False 10.92 0.75 7.45 2.72 NA 10.92 77.25simclrv2 r101 2x sk0 False 11.02 0.74 7.51 2.78 NA 11.02 76.72simclr ResNet50 1x False 11.07 1.22 7.73 2.13 NA 11.07 68.73simclrv2 ResNet-50 False 11.16 0.64 7.67 2.85 NA 11.16 74.99PiRL ResNet-50 False 11.43 1.49 8.26 1.68 NA 11.43 59.11InsDis ResNet-50 False 12.02 1.40 8.52 2.10 NA 12.02 56.67amdim ResNet-50 False 13.62 0.90 9.72 3.01 NA 13.62 67.69CMC ResNet-50 False 14.73 2.30 12.30 0.13 NA 14.73 54.60bigbigan ResNet-50 False 29.60 3.13 25.19 1.27 NA 29.60 50.24 Table E.3 – Summary of training methods with their hyper-parameters for CIFAR-10
Self-supervisedmethod BackboneArchitectures Self-supervisedTraining Evaluation SimplePhaseOptimizationAMDIM AMDIM Encoder PLBDefaultparameters Linear Adam β = 0 . β = 0 . Constant LR = 2e-4Batchsize = 500Weight decay = 1e-6ResNet-18ResNet-50WideResNet-50ResNet 101MoCoV2 ResNet-18 PLBDefaultparameters Linear Adam β = 0 . β = 0 . Constant LR = 2e-4Batchsize = 500Weight decay = 1e-6ResNet-50WideResNet-50ResNet 101SimCLR ResNet-18 Batchsize = 128Epochs 200 Linear SGDMomentum = 0.9Constant LR = 0.1Weight decay 1e-6ResNet-50ResNet-50 Batchsize = 512Epochs 60023 able E.4 – Summary of training methods with their hyper-parameters for ImageNet
Self-supervisedmethod BackboneArchitecture Pre-trainedModel Evaluation Optimization WeightDecay EpochsInstanceDiscrimination ResNet-50 PyContrast Linear SGDMomentum = 0.9Initial LR = 30LR drop at { } by factor 0.2 0 40MoCo ResNet-50 Official Linear SGDMomentum = 0.9Initial LR = 30LR drop at { } by factor 0.2 0 40PiRL ResNet-50 PyContrast Linear SGDMomentum = 0.9Initial LR = 30LR drop at { } by factor 0.2 0 40CMC ResNet-50 PyContrast Linear SGDMomentum = 0.9Initial LR = 30LR drop at { } by factor 0.2 0 40AMDIM AMDIM Encoder Official Linear SGDMomentum = 0.9Initial LR = 30LR drop at {
15, 25 } by factor 0.2 1e-3 40BigBiGAN ResNet-50 Official Linear SGDMomentum = 0.9Initial LR = 30LR drop at {
15, 25 }}