[PDF] Model Similarity Mitigates Test Set Overuse

Abstract

Excessive reuse of test data has become commonplace in today's machine learning workflows. Popular benchmarks, competitions, industrial scale tuning, among other applications, all involve test data reuse beyond guidance by statistical confidence bounds. Nonetheless, recent replication studies give evidence that popular benchmarks continue to support progress despite years of extensive reuse. We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overfitting. Specifically, we show empirically that models proposed for the ImageNet ILSVRC benchmark agree in their predictions well beyond what we can conclude from their accuracy levels alone. Likewise, models created by large scale hyperparameter search enjoy high levels of similarity. Motivated by these empirical observations, we give a non-asymptotic generalization bound that takes similarity into account, leading to meaningful confidence bounds in practical settings.

Full PDF

MModel Similarity Mitigates Test Set Overuse

Horia Mania, John Miller, Ludwig Schmidt, Moritz Hardt, and Benjamin RechtUniversity of California, Berkeley

Abstract

Excessive reuse of test data has become commonplace in today’s machine learning workﬂows.Popular benchmarks, competitions, industrial scale tuning, among other applications, all involvetest data reuse beyond guidance by statistical conﬁdence bounds. Nonetheless, recent replicationstudies give evidence that popular benchmarks continue to support progress despite years ofextensive reuse. We proﬀer a new explanation for the apparent longevity of test data: Manyproposed models are similar in their predictions and we prove that this similarity mitigatesoverﬁtting. Speciﬁcally, we show empirically that models proposed for the ImageNet ILSVRCbenchmark agree in their predictions well beyond what we can conclude from their accuracylevels alone. Likewise, models created by large scale hyperparameter search enjoy high levels ofsimilarity. Motivated by these empirical observations, we give a non-asymptotic generalizationbound that takes similarity into account, leading to meaningful conﬁdence bounds in practicalsettings.

Be it validation sets for model tuning, popular benchmark data, or machine learning competitions,the holdout method is central to the scientiﬁc and industrial activities of the machine learning com-munity. As compute resources scale, a growing number of practitioners evaluate an unprecedentednumber of models against various holdout sets. These practices, collectively, put signiﬁcant pres-sure on the statistical guarantees of the holdout method. Theory suggests that for k models chosenindependently of n test data points, the holdout method provides valid risk estimates for each ofthese models up to a deviation on the order of (cid:112) log( k ) /n. But this bound is the consequence ofan unrealistic assumption. In practice, models incorporate prior information about the availabletest data since human analysts choose models in a manner guided by previous results. Adaptivehyperparameter search algorithms similarly evolve models on the basis of past trials.Adaptivity signiﬁcantly complicates the theoretical guarantees of the holdout method. A simpleadaptive strategy, resembling the practice of selectively ensembling k models, can bias the holdoutmethod by as much as (cid:112) k/n . If this bound were attained in practice, holdout data across theboard would rapidly lose its value over time. Nonetheless, recent replication studies give evidencethat popular benchmarks continue to support progress despite years of extensive reuse [14, 19].In this work, we contribute a new explanation for why the adaptive bound is not attained inpractice and why even the standard non-adaptive bound is more pessimistic than it needs to be.Our explanation centers around the phenomenon of model similarity . Practitioners evaluate models1 a r X i v : . [ c s . L G ] M a y .0 0.5 1.0 Fraction of Models S i m il a r i t y Mean SimilarityModel Similarities on ImageNet

Actual SimilarityIndependent Similarity (a)

Pairwise model similarities on ImageNet

Model Similarity N u m b e r o f T e s t a b l e M o d e l s ImageNet ModelsSimilarity-Based Bounds for ImageNet

Similarity BoundBinomial plus Union Bound (b)

Number of models to be tested

Figure 1: (a) shows the empirical pairwise similarity between Imagenet models and the hypothetical similar-ity between models if they were making mistakes independently. (b) plots the number of testable models onImagenet such that the population error rates for all models are estimated up to ± error with probability . . We compare the guarantee of the standard union bound with that of a union bound which considersmodel similarities. that incorporate common priors, past experiences, and standard practices. As we show empirically,this results in models that exhibit signiﬁcant agreement in their predictions, well beyond whatwould follow from their accuracy values alone. Complementing our empirical investigation of modelsimilarity, we provide a new theoretical analysis of the holdout method that takes model similarityinto account, vastly improving over known bounds in the adaptive and non-adaptive cases whenmodel similarity is high. Our contributions are two-fold. On the empirical side, we demonstrate that a large number ofproposed ImageNet [3, 15] and CIFAR-10 [9] models exhibit a high degree of similarity: Theirpredictions agree far more than we would be able to deduce from their accuracy levels alone. Com-plementing our empirical ﬁndings, we give new generalization bounds that incorporate a measure ofsimilarity. Our generalization bounds help to explain why holdout data has much greater longevitythan prior bounds suggest when models are highly similar, as is the case in practice. Figure 1summarizes these two complementary developments.Underlying Figure 1a is a family of representative ImageNet models whose pairwise similarity weevaluate. The mean level of similarity of these models, together with a reﬁned union bound, oﬀers a × improvement over a carefully optimized baseline bound that does not take model similarity intoaccount. In Figure 1b we compare our guarantee on the number of holdout reuses with the baselinebound. This illustrates that our bound is not just asymptotic, but concrete—it gives meaningfulvalues in the practical regime. Moreover, in Section 5 we discuss how an additional assumption onmodel predictions can boost the similarity based guarantee by multiple orders of magnitude.Investigating model similarity in practice further, we evaluate similarity of models encounteredduring the course of a large random hyperparamter search and a large neural architecture search2or the CIFAR-10 dataset. We ﬁnd that the pairwise model similarities throughout both proceduresremain high. The similarity provides a counterweight to the massive number of model evaluations,limiting the amount of overﬁtting we observe. Recht et al. [14] recently created new test sets for ImageNet and CIFAR10, carefully followingthe original test set creation processes. Reevaluating all proposed models on the new test setsshowed that while there was generally an absolute performance drop, the eﬀect of overﬁtting due toadaptive behavior was limited to non-existent. Indeed, newer and better models on the old test setalso performed better on the new test set, even though they had in principle more time to adapt tothe test set. Also, Yadav and Bottou [19] recently released a new test set for the seminal MNISTtask, on which they observed no overﬁtting.Dwork et al. [5] recognized the issue of adaptivity in holdout reuse and provided new holdoutmechanisms based on noise addition that support quadratically more queries than the standardmethod in the worse case. There is a rich line of work on adaptive data analysis; Smith [17] oﬀersa comprehensive survey of the ﬁeld.We are not the ﬁrst to proﬀer an explanation for the apparent lack of overﬁtting in machine learningbenchmarks. Blum and Hardt [2] argued that if analysts only check if they improved on the previousbest model, while ignoring models that did not improve, better adaptive generalization bounds arepossible. Zrnic and Hardt [20] oﬀered improved guarantees for adaptive analysts that satisfy naturalassumptions, e.g. the analyst is unable to arbitrarily use information from queries asked far in thepast. More recently, Feldman et al. [6] gave evidence that the number of classes in a classiﬁcationproblem helps mitigate overﬁtting in benchmarks. We see these diﬀerent explanations as playingtogether in what is likely the full explanation of the available empirical evidence. In parallel toour work, Yadav and Bottou [19] discussed the advantages of comparing models on the same testset; pairing tests can provide tighter conﬁdence bounds for model comparisons in this setting thanindividual conﬁdence intervals for each model.

Let f : X → Y be a classiﬁer mapping examples from domain X to a label from the set Y . Moreover,we consider a test set S = { ( x , y ) , . . . } of n examples sampled i.i.d. from a data distribution D .The main quantity we aim to analyze is the gap between the accuracy of the classiﬁer f on thetest set S and the population accuracy of the same classiﬁer under the distribution D . If the gapbetween the two accuracies is large, we say f overﬁt to the test set.As is commonly done in the adaptive data analysis literature [1], we formalize interactions withthe test set via statistical queries q : X × Y → R . In our case, the queries are { , } -valued;given a classiﬁer f we consider the query q f deﬁned by q f ( z ) = { f ( x ) (cid:54) = y } , where z = ( x, y ) .Then, we denote the empirical mean of query q f on the test set S (i.e., f ’s test error) by E S [ q f ] = n (cid:80) ni =1 q f ( z i ) . The population mean (population error) is accordingly deﬁned as E D [ q ] = E z ∼D q ( z ) .3hen discussing overﬁtting, we are usually interested in a set of classiﬁers, e.g., obtained viaa hyperparameter search. Let f , . . . , f k be such a set of classiﬁers and q , . . . , q k be the set ofcorresponding queries. To quantify the probability that overﬁtting occurs (i.e., one of the f i has alarge deviation between test and population accuracy), we would like to upper bound the probability P (cid:18) max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≥ ε (cid:19) . (1)A standard way to bound (1) is to invoke the union bound and treat each query separately: P (cid:18) max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≥ ε (cid:19) ≤ k (cid:88) i =1 P ( | E S [ q i ] − E D [ q i ] | ≥ ε ) (2)We can then utilize standard concentration results to bound the right hand side. However, such anapproach inherently cannot capture dependencies between the queries q i (or classiﬁers f i ). In par-ticular, we are interested in the similarity between two queries q and q (cid:48) measued by P ( q ( z ) = q (cid:48) ( z )) (the probability of agreement between the 0-1 losses of the corresponding two classiﬁers). The maingoal of this paper is to understand how high similarity can lead to better bounds on (1), both intheory and in numerical experiments with real data from ImageNet and CIFAR-10. We begin by analyzing the eﬀect of the classiﬁer similarity when the classiﬁers to be evaluated arechosen non-adaptively . For instance, this is the case when the algorithm designer ﬁxes a grid ofhyperparameters to be explored before evaluating any of the classiﬁers on the test set. To drawvalid gains from the hyperparameter search, it is important that the resulting test accuracies reﬂectthe true population accuracies, i.e., probability (1) is small.Bound (2) is sharp when the events {| E S [ q i ] − E D [ q i ] | ≥ ε } are almost disjoint, which is not truewhen the queries are similar to each other. To address this issue, we modify our use of the unionbound. We consider the left tails E i = { E S [ q i ] − E D [ q i ] ≥ ε } . For any t ≥ , we obtain P (cid:32) k (cid:91) i =1 E i (cid:33) ≤ P (cid:32) { E S [ q ] − E D [ q ] ≥ ε − t } k (cid:91) i =2 E i (cid:33) (3) = P ( E S [ q ] − E D [ q ] ≥ ε − t ) + P (cid:32) k (cid:91) i =2 E i ∩ { E S [ q ] − E D [ q ] < ε − t } (cid:33) ≤ P ( E S [ q ] − E D [ q ] ≥ ε − t ) + k (cid:88) i =2 P ( E i ∩ { E S [ q ] − E D [ q ] < ε − t } ) . Intuitively, the terms P ( E i ∩ { E S [ q ] − E D [ q ] < ε − t } ) are small when the queries q and q i aresimilar: if P ( q ( z ) = q i ( z )) is large, we cannot simultaneously have E S [ q ] < E D [ q ] + ε − t and E S [ q i ] ≥ E D [ q i ] + ε since the deviations go into opposite directions. In the rest of this section, wemake this intuition precise in and derive an upper bound on (1) in terms of the query similarities.Before we state our main result, we introduce the following notion of a similarity covering.4 eﬁnition 1. Let F be a set of queries. We say a query set M is a η similarity cover of F if for anyquery q ∈ F there exist q (cid:48) , q (cid:48)(cid:48) ∈ M such that E D [ q (cid:48) ] ≤ E D [ q ] , E D [ q (cid:48)(cid:48) ] ≥ E D [ q ] , P ( q (cid:48) ( z ) = q ( z )) ≥ η ,and P ( q (cid:48)(cid:48) ( z ) = q ( z )) ≥ η ( M does not necessarily have to be a subset of F ). Let N η ( F ) denote thesize of a minimal η similarity cover of F (when the query set F is clear from context we use thesimpler notation N η ). Theorem 2.

Let F = { q , q , . . . , q k } be a collection of queries q i : Z → { , } independent of thetest set { z , z , . . . , z n } . Then, for any η ∈ [0 , we have P (cid:18) max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≥ ε (cid:19) ≤ N η e − nε + 2 ke − nε log (cid:16) ε − η ) (cid:17) . (4) Then, for all η ≤ − max (cid:26) k/δ ) n , (cid:113) log(4 N η /δ )2 n (cid:27) , we have with probability − δ max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≤ max (cid:40)(cid:114) N η /δ ) n , (cid:114) − η ) log (4 k/δ ) n (cid:41) . (5) Moreover, if ε = (cid:113) log((2 N η +1) /δ ) n and η ≥ − ε (cid:16) e ε (2 k ) nε − (cid:17) , we have with probability − δ max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≤ ε. (6)The proof starts with the reﬁned union bound (3), or a standard triangle inequality, and thenapplies the Chernoﬀ concentration bound shown in Lemma 3 for random variables which takevalues in {− , , } . We defer the proof details of both the lemma and the theorem to Appendix A. Lemma 3.

Suppose X i are i.i.d. discrete random variables which take values − , , and withprobabilities p − , p , and p respectively, and hence E X i = p − p − . Then, for any t ≥ such that p − p − + t/ ≥ we have P (cid:32) n n (cid:88) i =1 X i > p − p − + t (cid:33) ≤ e − nt log (cid:16) t p (cid:17) . Discretization arguments based on coverings are standard in statistical learning theory. Coversbased on the population Hamming distance P ( q (cid:48) ( z ) (cid:54) = q ( z )) have been previously studied [4, 11](note that for { , } -valued queries the Hamming distance is equal to the L and L distances). Animportant distinction between our result and prior work is that prior work requires η to be greaterthan − ε . Theorem 2 can oﬀer an improvement over the standard guarantee (cid:112) log( k ) /n evenwhen η is much smaller than − ε . First of all note that (5) holds for η bounded away from one.Moreover, since e ε ≈ (cid:15) , if (2 k ) nε ≤ √ ε (the choice of √ ε is somewhat arbitrary), wesee the requirement on η for (6) is satisﬁed when η is on the order of − √ ε . In the previous section, we showed similarity can prevent overﬁtting when the sequence of queriesis chosen non-adaptively , i.e. when the queries { q , q , . . . , q n } are ﬁxed independently of the test5et S . In the adaptive setting, we assume the query q t can be selected as a function of the pre-vious queries { q , q , . . . , q t − } and estimates { E S [ q ] , E S [ q ] , . . . , E S [ q t − ] } . Even when queries arechosen adaptively, we show leveraging similarity can provide sharper bounds on the probability ofoverﬁtting, P (max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≥ ε ) .In the adaptive setting, the ﬁeld of adaptive data analysis oﬀers a rich technical repertoire toaddress overﬁtting [5, 17]. In this framework, analogous to the typical machine learning workﬂow,an analyst iteratively selects a classiﬁer and then queries a mechanism to provide an estimate of test-set performance. In practice, the mechanism often used is the Trivial Mechanism which computesthe empirical mean of the query on the test set and returns the exact value to the analyst. Forsimplicity, we study how similarity improves the performance of the trivial mechanism.The empirical mean of any query can take at most n + 1 values, and thus a deterministic analystmight ask at most ( n + 1) k − queries in k rounds of interaction with the Trivial Mechanism. Let F denote the set of ( n + 1) k − possible queries. Then, we apply Theorem 2 to F . Corollary 4.

Let F be the set of queries that a ﬁxed analyst A might query the Trivial Mechanism.We assume that the Trivial Mechanism has access to a test set of size n . Let α ∈ [0 , , ε = (cid:114) k − α log( n + 1) + log(2 /δ )) n , and η = 1 − ε e εkα − . If N η ( F ) ≤ ( n + 1) k − α , we have with probability − δ max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≤ ε, (7) for any queries q , q , . . . q k chosen adaptively by A .Proof. Note that when η = 1 − ε e εkα − we have log (cid:16) ε − η ) (cid:17) ≥ εk α . Then, the result followsfrom the ﬁrst part of Theorem 2.Corollary 4 always applies with α = 0 , in which case the bound matches standard results forthe trivial mechanism with ε = ˜ O ( (cid:112) k/n ) . However, if F permits N η ( F ) ≤ ( n + 1) k − α for η =1 − ( ε/ e εk α − − and some α > , we obtain a super linear improvement in the dependenceon k . For instance, if α = 1 / , then ε = ˜ O ( (cid:112) k / /n ) , and we obtain a quadratic improvementin the number of queries for a ﬁxed sample size, an improvement similar to that achieved by theGaussian mechanism [1, 5]. Moreover, since our technique is essentially tightening a union bound,this improvement easily extends to other mechanisms that rely on compression-based arguments,for instance, the Ladder Mechanism [2]. So far, we have established theoretically that similarity between classiﬁers allows us to evaluate alarger number of classiﬁers on the test set without overﬁtting. In this section, we investigate whetherthese improvements already occur in the regime of contemporary machine learning. We speciﬁcallyfocus on ImageNet and CIFAR-10, two widely used machine learning benchmarks that have recently6een shown to exhibit little to no adaptive overﬁtting in spite of almost a decade of test set re-use[14]. For both datasets, we empirically measure two main quantities: (i) The similarity betweena wide range of models, some of them arising from hyperparameter search experiments. (ii) Theresulting increase in the number of models we can evaluate in a non-adaptive setting compared toa baseline that does not utilize the model similarities.

We utilize the model testbed from Recht et al. [14], who collected a dataset of 66 image classi-ﬁers that includes a wide range of standard ImageNet models such as AlexNet [10], ResNets [7],DenseNets [8], VGG [16], Inception [18], and several other models. As a baseline for the observedsimilarities between these models, we compare them to classiﬁers with the same accuracy but oth-erwise random predictions: given two models f and f with population error rates µ and µ , weknow that the similarity P ( { f ( x ) (cid:54) = y } = { f ( x ) (cid:54) = y } ) equals µ µ + (1 − µ )(1 − µ ) if therandom variables { f ( x ) (cid:54) = y } and { f ( x ) (cid:54) = y } are independent. Figure 1a in the introductionshows these model similarities assuming the models make independent mistakes and also the empir-ical data for the (cid:0) (cid:1) = 2 , pairs of models. We see that the empirical similarities are signiﬁcantlyhigher than the random baseline (mean 0.85 vs 0.62).The corresponding Figure 1b shows two lower bounds on the number of models that can be evaluatedfor the empirical ImageNet data. In particular, we use n = 50 , (the size of the ImageNetvalidation set) and a target probability δ = 0 . for the overﬁtting event (1) with error ε = 0 . .We compare two methods for computing the number of non-adaptively testable models: a guaranteebased on the simple union bound (2) and a guarantee based on our more reﬁned union bound derivedfrom our theoretical analysis in Section 3. Later in this section, we introduce an even stronger boundthat utilizes higher-order interactions between the model similarities and yields signiﬁcantly largerimprovements under an assumption on the structure among the classiﬁers.To obtain meaningful quantities in the regime of ImageNet, all bounds here require signiﬁcantlysharper numerical calculations than the standard theoretical tools such as Chernoﬀ bounds. We nowdescribe these calculations at a high level and defer the details to Appendix B. After introducingthe three methods, we compare them on the ImageNet data. Standard union bound.

Given n , ε , and the population error rate of all models E D [ q i ] , we cancompute the right hand side of (2) exactly. It is well known that higher accuracies lead to smallerprobability of error and hence allow for a larger number of test set reuses. We assume all modelshave population accuracy . , the average top-1 accuracy of the Imagenet models. In thiscase, the vanilla union bound (2) guarantees that k = 257 , models can be evaluated on a testset of size , so that their empirical accuracies would lie in the conﬁdence interval . ± . with probability at least . Available at https://github.com/modestyachts/ImageNetV2 . After an additional union bound to decouple the left and right tails. imilarity Union Bound. While the union bound (2) is easy to use, it does not leverage thedependencies between the random variables { f i ( x ) (cid:54) = y } for i ∈ { , , . . . k } . To exploit thisproperty, we utilize the reﬁned union bound (3) which is guaranteed to be an improvement over (2)when the parameter t is optimized. In order to use (3), we must compute the probabilities P ( { E S [ q ] − E D [ q ] ≤ α } ∩ { E S [ q ] − E D [ q ] ≥ α } ) (8)for given α , α , E D [ q ] , E D [ q ] , and similarity P ( q ( z ) = q ( z )) . In Appendix B, we show that wecan compute these probabilities eﬃciently by assigning success probabilities to three independentBernoulli random variables X , X , and W such that ( X W, X W ) is equal to ( q ( z ) , q ( z )) indistribution. Let p w := P ( W = 1) . Then, given i.i.d. draws X i , X i , and W i , we condition on thevalues of W i to express probability (8) as P ( { E S [ q ] − E D [ q ] ≤ α } ∩ { E S [ q ] − E D [ q ] ≥ α } ) (9) = n (cid:88) j =0 (cid:18) nj (cid:19) p jw (1 − p w ) n − j P (cid:32) j (cid:88) i =1 X i ≤ (cid:98) n ( p + α ) (cid:99) (cid:33) P (cid:32) j (cid:88) i =1 X i ≥ (cid:100) n ( p + α ) (cid:101) (cid:33) . We refer the reader to Appendix B for more details. The two tail probabilities for X i and X i canbe computed eﬃciently with the use of beta functions. Using (9) and (3) with a binary search over t , we can compute the probability of making an error ε when estimating the population error ratesof k models with given error rates and pairwise similarities. Figure 1b shows the maximum numberof models k that can be evaluated on the same test set so that the probability of making an ε = 0 . error in estimating all their error rates is at most . when the models satisfy E D [ q i ] = 0 . and P ( q i ( z ) = q j ( z )) ≥ . for all ≤ i, j ≤ k . The ﬁgure shows that our new bound oﬀers a signiﬁcantimprovement over the guarantee given by the standard union bound (2). Similarity union bound with a Naive Bayes assumption.

While the previous computationuses the pairwise similarities observed empirically to oﬀer an improved guarantee on the number ofallowed test set reuses, it does not take into account higher order dependencies between the models.In particular, Figure 4 in Appendix C shows that . of test images are correctly classiﬁed by allthe models, . of test images are correctly classiﬁed by of the models considered, and . of test images are incorrectly classiﬁed by all the models. We now show how this kind of agreementbetween models enables a larger number of test set reuses. Inspired by the coupling used in (9), wemake the following assumption. Assumption A1 (Naive Bayes) . Let q , q , . . . q k be a collection of queries such that E D [ q i ] = p and P ( q i ( z ) = q j ( z )) = η for some p and η , for all ≤ i, j ≤ k . We say such a collection has aNaive Bayes structure if there exist p x and p w in [0 , such that ( q ( z ) , q ( z ) , . . . , q k ( z )) is equalto ( X W, X W, . . . , X k W ) in distribution, where W , X , . . . X k are independent Bernoulli randomvariables with P ( W = 1) = p w and P ( X i = 1) = p x for all ≤ i ≤ k . Intuitively, a collection of queries { f i ( x ) (cid:54) = y } has a Naive Bayes structure if the data distribution D generates easy examples ( x, y ) with probability p w such that all the models f i classify correctly,and if an example is not easy, the models make mistakes independently. As mentioned before,Figure 4 supports the existence of such an easy set. When a test point in the ImageNet test set isnot an easy example, the models do not make mistakes independently. Therefore, Assumption A1is not exactly satisﬁed by existing ImageNet models. However, we know that independent Bernoulli8 .65 0.75 0.85 Model Similarity M u l t i p li c a t i v e G a i n s Naive Bayes BoundSimilarity Bound

Error e M u l t i p li c a t i v e G a i n s Naive Bayes BoundSimilarity Bound

Figure 2:

Left ﬁgure shows the multiplicative gains in the number of testable models, as a function ofmodel similarity, over the guarantee oﬀered by the standard union plus binomial bound, with ε = 0 . and δ = 0 . . Right ﬁgure shows the same multiplicative gains, but as a function of ε , when δ = 0 . and thepairwise similarity is η = 0 . . trials saturate the standard union bound (2). This eﬀect can also be observed in Figure 2. As thesimilarity between the models decreases, i.e. p w decreases, the models make mistakes independentlyand the guarantee with Assumption A1 converges to the standard union bound guarantee. So whileAssumption A1 is not exactly satisﬁed in practice, the violation among the ImageNet classiﬁerslikely implies an even better lower bound on the number of testable models.Assumption A1 is computationally advantageous. It allows us to compute the overﬁtting probability(1) exactly, as we detail in Appendix B. Figure 2 is an extension of Figure 1b; it shows the relativeimprovement of our bounds over the standard union bound in terms of the number of testablemodels when ε = 0 . and δ = 0 . . Moreover, Figure 2 also shows that the relative improvementof our bounds increases quickly with ε . According to Figure 2, Assumption A1 implies that wecan evaluate models on the test set in the regime of ImageNet without overﬁtting. While thisnumber of models might seem unnecessarily large, in Section 4 we saw that when models are chosenadaptively we must consider a tree of possible models, which can easily contain models. Practitioners often evaluate many more models than the handful that ultimately appear in pub-lication. The choice of architecture is the result of a long period of iterative reﬁnement, and thehyperparameters for any ﬁxed architecture are often chosen by evaluating a large grid of plausiblemodels. Using data from CIFAR-10, we demonstrate these common practices both generate largeclasses of very similar models.

Random hyperparameter search.

To understand the similarity between models evaluated inhyperparameter search, we ran our own random search to choose hyperparameters for a ResNet-110. The grid included properties of the achitecture (e.g. type of residual block), the optimizationalgorithm (e.g. choice of optimizer), and the data distribution (e.g. data augmentation strategies).9 .00 0.25 0.50 0.75 1.00

Fraction of Model Pairs S i m il a r i t y CIFAR10 Hyperparameter Search Similarity

Observed SimilarityIndependent Similarity

Similarity Resolution η C o v e r S i z e CIFAR10 Hyperparameter Search Covering Numbers

Figure 3:

Model similarities and covering numbers for random hyperparameter search on CIFAR10.

Table 1: Neural Architecture Search SimilaritiesModels Mean Accuracy Mean Similarity Increase in Testable ModelsSB NBB Random .

8% 97 . . × . · × Highest Scoring .

9% 97 . . × . · × A full speciﬁcation of the grid is included in Appendix D. We sample and train 320 models, and, foreach model, we select 10 checkpoints evenly spaced throughout training. The best model consideredachieves accuracy of 96.6%, and, after restricting to models with accuracy at least 50%, we are leftwith 1,235 model checkpoints. In Figure 3, we show the similarity for each pair of checkpoints andcompute an upper bound on the corresponding similarity covering number N η ( F ) for each possiblevalue of η . As in the case of ImageNet, CIFAR10 models found by random search are signiﬁcantlymore similar than random chance would suggest. Neural architecture search.

In the random search experiment, all of the models were chosennon-adaptively—the grid of models is ﬁxed in advance. However, similarity protects against overﬁt-ting also in the adaptive setting. To illustrate this, we compute the similarity for models evaluatedby automatic neural architecture search. In particular, we ran the DARTS neural architecturesearch pipeline to adaptively evaluate a large number of plausible models in search of promisingconﬁgurations [12, 13]. In Table 1, we report the mean accuracies and pairwise similarities for randomly selected conﬁgurations evaluated by DARTS, as well as the top scoring conﬁgurationsaccording to DARTS internal scoring mechanism. Table 1 also shows the multiplicative gains in thenumber of testable models oﬀered by our similarity bound (SB) and our naive Bayes bound (NBB)over the standard union bound are between one and four orders of magnitude. Therefore, even ina high accuracy regime we can guarantee a signiﬁcantly higher number of test set reuses withoutoverﬁtting when taking into account model similarities.10 Conclusions and future work

We have shown that contemporary image classiﬁcation models are highly similar, and that thissimilarity increases the longevity of the test set both in theory and in experiment. It is worthnoting that model similarity does not preclude progress on the test set: two models that are 85%similar can diﬀer by as much as 15% in accuracy (for context: the top-5 accuracy improvement fromthe seminal AlexNet to the current state of the art on ImageNet is about 17%). In addition, it is wellknown that higher model accuracy implies a larger number of test set reuses without overﬁtting.So as the machine learning practitioner explores increasingly better performing models that alsobecome more similar, it can actually become harder to overﬁt.There are multiple important avenues for future work. First, one natural question is why the classi-ﬁcation models turn out to be so similar. In addition, it would be insightful to understand whetherthe similarity phenomenon is speciﬁc to image classiﬁcation or also arises in other classiﬁcationtasks. There may also be further structural dependencies between models that mitigate the amountof overﬁtting. Finally, it would be ideal to have a statistical procedure that leverages such modelstructure to provide reliable and accurate performance bounds for test set re-use.

Acknowledgements.

We thank Vitaly Feldman for helpful discussions. This work is generouslysupported in part by ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833,the DARPA Assured Autonomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) programs,a Siemens Futuremakers Fellowship, an Amazon AWS AI Research Award, a gift from MicrosoftResearch, and the National Science Foundation Graduate Research Fellowship Program under GrantNo. DGE 1752814.

References [1] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman. Algorithmic stabilityfor adaptive data analysis. In

Symposium on Theory of Computing (STOC) , 2016. https://arxiv.org/abs/1511.02513 .[2] A. Blum and M. Hardt. The Ladder: A reliable leaderboard for machine learning competitions.In

International Conference on Machine Learning (ICML) , 2015. https://arxiv.org/abs/1502.04585 .[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierar-chical image database. In

Conference on Computer Vision and Pattern Recognition (CVPR) ,2009. .[4] L. Devroye, L. Györﬁ, and G. Lugosi.

A probabilistic theory of pattern recognition . Springer,1996. .[5] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statisticalvalidity in adaptive data analysis. In

Symposium on Theory of computing (STOC) , 2015. https://arxiv.org/abs/1411.2664 . 116] V. Feldman, R. Frostig, and M. Hardt. The advantages of multiple classes for reducing over-ﬁtting from test set reuse. In

International Conference on Machine Learning (ICML) , 2019. https://arxiv.org/abs/1905.10360 .[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. https://arxiv.org/abs/1512.03385 .[8] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutionalnetworks. In

Conference on Computer Vision and Pattern Recognition (CVPR) , 2017. https://arxiv.org/abs/1608.06993 .[9] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. .[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcationwith deep convolutional neural networks. In

Advances in Neural Infor-mation Processing Systems (NIPS) , 2012. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks .[11] J. Langford.

Quantitatively Tight Sample Complexity Bounds . PhD thesis, Carnegie Mellon Uni-versity, 2002. http://hunch.net/~jl/projects/prediction_bounds/thesis/thesis.pdf .[12] L. Li and A. Talwalkar. Random search and reproducibility for neural architecture search. In

Conference on Uncertainty in Artiﬁcial Intelligence (UAI) , 2019. https://arxiv.org/abs/1902.07638 .[13] H. Liu, K. Simonyan, and Y. Yang. Darts: Diﬀerentiable architecture search. In

InternationalConference on Learning Representations (ICLR) , 2019. https://arxiv.org/abs/1806.09055 .[14] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do ImageNet classiﬁers generalize toImageNet? In

International Conference on Machine Learning (ICML) , 2019. https://arxiv.org/abs/1902.10811 .[15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li. ImageNet large scale visual recognitionchallenge.

International Journal of Computer Vision , 2015. https://arxiv.org/abs/1409.0575 .[16] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recog-nition. 2014. https://arxiv.org/abs/1409.1556 .[17] A. Smith. Information, privacy and stability in adaptive data analysis, 2017. https://arxiv.org/abs/1706.00820 .[18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,and A. Rabinovich. Going deeper with convolutions. In

Conference on Computer Vision andPattern Recognition (CVPR) , 2015. https://arxiv.org/abs/1409.4842v1 .[19] C. Yadav and L. Bottou. Cold Case: The Lost MNIST Digits. 2019. https://arxiv.org/abs/1905.10498 . 1220] T. Zrnic and M. Hardt. Natural analysts in adaptive data analysis. In

International Conferenceon Machine Learning (ICML) , 2019. https://arxiv.org/abs/1901.11143 . A Proofs for Section 3

Lemma 3.

We assume p > . The result follows by continuity when p = 0 . We prove the more generalcase since the ﬁrst part of the lemma is a particular case. By standard Chernoﬀ methods we have P (cid:32) n n (cid:88) i =1 X i > p − p − + t (cid:33) ≤ e − nλ ( t + p − p − ) (cid:16) p + p e λ + p − e − λ (cid:17) n , for any λ ∈ [0 , ∞ ) . Let r > to be chosen later. Now, we would like to choose λ to be nonnegativeand as large as possible so that p + p e λ + p − e − λ ≤ e λr . (10)By changing variables to e λ = z + 1 for some z ≥ we want to ﬁnd z as large as possible so that p ( z + 1) + p ( z + 1) + p − ≤ ( z + 1) r . Then, by Bernoulli’s inequality it suﬃces if z satisﬁes the inequality p ( z + 1) + p ( z + 1) + p − ≤ r ) z, which is equivalent to p + p z + 2 p ≤ r. Hence, the desired inequality (10) is satisﬁed if z ≤ p − − p + rp , which can be satisﬁed by choosing z = p − − p + rp when p − − p + r ≥ . In this case, we would be able to set λ = log (cid:16) p − − p + rp (cid:17) and obtain P (cid:32) n n (cid:88) i =1 X i > p − p − + t (cid:33) ≤ e − n log (cid:16) p − − p rp (cid:17) ( t + p − p − − r ) . Set r = p − p − + t/ and by the assumption on t we are guaranteed that r ≥ and p − − p + r ≥ .The conclusion follows. 13 heorem 2. Let F = { q , q , . . . , q k } be a collection of queries q i : Z → { , } independent of thetest set { z , z , . . . , z n } . Then, for any η ∈ [0 , we have P (cid:18) max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≥ ε (cid:19) ≤ N η e − nε + 2 ke − nε log (cid:16) ε − η ) (cid:17) . (4) Then, for all η ≤ − max (cid:26) k/δ ) n , (cid:113) log(4 N η /δ )2 n (cid:27) , we have with probability − δ max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≤ max (cid:40)(cid:114) N η /δ ) n , (cid:114) − η ) log (4 k/δ ) n (cid:41) . (5) Moreover, if ε = (cid:113) log((2 N η +1) /δ ) n and η ≥ − ε (cid:16) e ε (2 k ) nε − (cid:17) , we have with probability − δ max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≤ ε. (6) Proof.

First we prove (4) and we start with the right tails. We have P (cid:32) k (cid:91) i =1 { E S [ q i ] − E D [ q i ] ≥ ε } (cid:33) ≤ P  k (cid:91) i =1 { E S [ q i ] − E D [ q i ] ≥ ε } (cid:91) (cid:101) q ∈ M { E S [ (cid:101) q ] − E D [ (cid:101) q ] ≥ ε }  , where M is a minimal η similarity cover of F . Then, there exists a partition of F into subsets R (cid:101) q , with (cid:101) q ∈ M , such that for any q ∈ F there exists (cid:101) q such that q ∈ R (cid:101) q , E D [ q ] ≥ E D [ (cid:101) q ], and P ( q ( z ) = (cid:101) q ( z )) ≥ η . Since R (cid:101) q is a partition of F , we have (cid:80) (cid:101) q ∈ M | R (cid:101) q | = k . Therefore, following thesame argument as in (3), we have P (cid:32) k (cid:91) i =1 { E S [ q i ] − E D [ q i ] ≥ ε } (cid:33) ≤ (cid:88) (cid:101) q ∈ M P (cid:16) E S [ (cid:101) q ] − E D [ (cid:101) q ] ≥ ε (cid:17) + (cid:88) (cid:101) q ∈ M (cid:88) q ∈ R (cid:101) q P ( { E S [ (cid:101) q ] − E D [ (cid:101) q ] ≤ ε/ } ∩ { E S [ q ] − E D [ q ] ≥ ε } ) ≤ (cid:88) (cid:101) q ∈ M P (cid:16) E S [ (cid:101) q ] − E D [ (cid:101) q ] ≥ ε (cid:17) + (cid:88) (cid:101) q ∈ M (cid:88) q ∈ R (cid:101) q P ( E S [ (cid:101) q ] − E D [ (cid:101) q ] + ε/ ≤ E S [ q ] − E D [ q ]) . Now, for every (cid:101) q ∈ M and any q ∈ R (cid:101) q we use a standard Chernoﬀ bound and Lemma 3 to show P (cid:16) E S [ (cid:101) q ] − E D [ (cid:101) q ] ≥ ε (cid:17) ≤ e − nε and P ( E S [ (cid:101) q ] − E D [ (cid:101) q ] + ε/ ≤ E S [ q ] − E D [ q ]) ≤ e − nε log (cid:16) ε − η ) (cid:17) . To see why we can apply Lemma 3 note that q ( z ) − (cid:101) q ( z ) takes values in {− , , } with the probabilityof being at least η , and E D [ q − (cid:101) q ] ≥ by the choice of the covering set. Since | M | = N η and since (cid:80) (cid:101) q ∈ M | R (cid:101) q | = k , we ﬁnd P (cid:32) k (cid:91) i =1 { E S [ q i ] − E D [ q i ] ≥ ε } (cid:33) ≤ N η e − nε + ke − nε log (cid:16) ε − η ) (cid:17) .

14n analogous argument for the left tails yields (4). Now, we turn to showing (5). The goal is toﬁnd ε such that N η e − nε ≤ δ and ke − nε log (cid:16) ε − η ) (cid:17) ≤ δ . (11)The ﬁrst inequality is satisﬁed if ε ≥ (cid:113) N η /δ ) n . To ﬁnd ε that satisﬁes the second conditionwe make use of the inequality log(1 + t ) ≥ tt +1 for all t ≥ . We search for ε that also satisﬁes ε ≤ − η ) . Then, nε (cid:18) ε − η ) (cid:19) ≥ nε − η ) , and we would like the right hand side to be at least log(4 k/δ ) . If we choose ε = max (cid:40)(cid:114) N η /δ ) n , (cid:114) − η ) log(4 k/δ ) n (cid:41) , the condition ε ≤ − η ) is satisﬁed because of the assumption on η . In this case, both conditions(11) are satisﬁed and (5) is proven. Finally, note that when η ≥ − ε e ε (2 k ) nε − we have ke − nε log (cid:16) ε − η ) (cid:17) ≤ e − nε . Then, (6) follows by choosing ε = (cid:113) log((2 N η +1) /δ ) n . This completes the proof. B Tail probability of two dependent binomials

In this section we detail the computations of the two similarity union bounds (with and withoutthe Naive Bayes assumption).

Similarity Union Bound.

We wish to compute the probability P ( { E S [ q ] − E D [ q ] ≤ α } ∩ { E S [ q ] − E D [ q ] ≥ α } ) , (12)where q ( z ) and q ( z ) have some joint distribution over { , } . Let use denote p = E D [ q ] , p = E D [ q ] , and η = P ( q ( z ) = q ( z )) respectively. These three quantities fully determine the jointprobability distribution of q ( z ) and q ( z ) . Speciﬁcally, we have P ( q ( z ) = 1 , q q ( z ) = 1) = p + p + η − , P ( q ( z ) = 1 , q q ( z ) = 0) = 1 + p − p − η P ( q ( z ) = 0 , q q ( z ) = 1) = 1 + p − p − η , P ( q ( z ) = 0 , q q ( z ) = 0) = 1 + η − p − p .

15e denote these four probabilities by p , p , p , and p respectively. We aim to ﬁnd threeindependent Bernoulli random variables X , X , and W such that ( X W, X W ) equals ( q ( z ) , q ( z )) in distribution. It turns out we can achieve this whenever p ≥ ( p + p )( p + p ) , a conditionthat is always satisﬁed in the settings we consider, by setting P ( X = 1) = p p + p , P ( X = 1) = p p + p , P ( W = 1) = ( p + p )( p + p ) p . Then, given i.i.d. draws X i , X i , and W i , probability (8) equals P (cid:32)(cid:40) n (cid:88) i =1 X i W i ≤ (cid:98) n ( p + α ) (cid:99) (cid:41) (cid:92) (cid:40) n (cid:88) i =1 X i W i ≥ (cid:100) n ( p + α ) (cid:101) (cid:41)(cid:33) . (13)Denote p w = P ( W = 1) . Then, we condition on the possible values of W i to obtain P ( { E S [ q ] − E D [ q ] ≤ α } ∩ { E S [ q ] − E D [ q ] ≥ α } ) (14) = n (cid:88) j =0 (cid:18) nj (cid:19) p jw (1 − p w ) n − j P (cid:32) j (cid:88) i =1 X i ≤ (cid:98) n ( p + α ) (cid:99) (cid:33) P (cid:32) j (cid:88) i =1 X i ≥ (cid:100) n ( p + α ) (cid:101) (cid:33) . The two tail probabilities for X i and X i can be computed eﬃciently with the use of beta functions. Similarity union bound with a Naive Bayes assumption.

In this section we wish to computedirectly the overﬁtting probability P (cid:18) max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≥ ε (cid:19) (15)when the query vector ( q ( z ) , q ( z ) , . . . , q k ( z )) is equal in distribution to ( X W, X W, . . . , X k W ) for some independent Bernoulli random variables W , X , . . . X k . Recall that we assume that allqueries q i have equal error rates E D [ q i ] ; let us denote it µ = E D [ q i ] . Moreover, for any two queries q i and q j we have P ( q i ( z ) = q j ( z )) = η .Suppose we are given i.i.d. draws W i and i.i.d. draws X (cid:96)i for ≤ i ≤ n and ≤ (cid:96) ≤ k . Then, if p w := P ( W = 1) , by conditioning on the values of the random variables W i we obtain P (cid:18) max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≥ ε (cid:19) = n (cid:88) j =1 (cid:18) nj (cid:19) p jw (1 − p w ) n − j P (cid:32) k (cid:91) (cid:96) =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n j (cid:88) i =1 X (cid:96)i − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε (cid:33) . The random variables (cid:80) ji =1 X (cid:96)i have the same distribution for all (cid:96) and are independent. Then, P (cid:32) k (cid:91) (cid:96) =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n j (cid:88) i =1 X (cid:96)i − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε (cid:33) = 1 − P (cid:32) k (cid:92) (cid:96) =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n j (cid:88) i =1 X (cid:96)i − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε (cid:33) = 1 − P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n j (cid:88) i =1 X i − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε (cid:33) k . Therefore, we have P (cid:18) max ≤ i ≤ k | E S [ q i ] − E D [ q i ] | ≥ ε (cid:19) = n (cid:88) j =1 (cid:18) nj (cid:19) p jw (1 − p w ) n − j  − P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n j (cid:88) i =1 X i − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε (cid:33) k  . Empirical distribution of image diﬃculty in ImageNet

Maximum number of models misclassifying an image F r a c t i o n o f i m a g e s ImageNet Data

Figure 4: The empirical “diﬃculty” distribution of the 50,000 images in the ImageNet validationset as measured by the classiﬁers in the testbed from Recht et al. [14]. The plot shows how manyof the images are misclassiﬁed by at most a certain number of the models. For instance, about27.8% of the images are correctly classiﬁed by all models, and 55.9% of the images are correctlyclassiﬁed by 60 of the 66 models. 4.7% of the images are misclassiﬁed by all models. The plot showsthat a signiﬁcant fraction of images is classiﬁed correctly by all or almost all of the models. Theseempirical ﬁndings support the Naive Bayes assumption in Section 5.1.

D CIFAR-10 random hyperparameter grid search

We conducted a large hyperparameter search on a ResNet110. All of our experiments build onthe ResNet implementation and training code provided by https://github.com/hysts/pytorch_image_classification . In Table 2, we specify the grid used in the experiments. If not explicitlystated, all other hyperparameters are set to their default settings.17able 2: Random grid search hyperparameters.Parameter Sampling DistributionNumber of base channels Uniform { , , , } Residual block type Uniform { "Basic", "Bottleneck" } Remove ReLu before residual units Uniform{True, False}Add BatchNorm after last convolutions Uniform{True, False}Preactivation of shortcuts after downsampling Uniform{True, False}Batch size Uniform{32, 64, 128, 256}Base learning rate Uniform[1e-4, 0.5]Weight decay Uniform [ − , − Use weight decay with batch norm Uniform{True, False}Optimizer

Uniform{SGD, SGD with Momentum,Nesterov GD, Adam}

Momentum (SGD with momentum) Uniform{0.6, 0.99} β (Adam) Uniform[0.8, 0.95] β (Adam) Uniform[0.9, 0.999]Learning rate schedule Uniform{Cosine, Fixed Decay}Learning rate decay point 1 (Fixed Decay) Uniform{40, 60, 80, 100}Learning rate decay point 2 (Fixed Decay) Uniform{120, 140, 160, 180}Use random crops Uniform{True, False}Random crop padding Uniform{2, 4, 8}Use horizontal ﬂips Uniform{True, False}Use cutout Uniform{True, False}Cutout size Uniform{8, 12, 16}Use dual cutout augmentation Uniform{True, False}Dual cutout α Uniform[0.05, 0.3]Use random erasing Uniform{True, False}Random erasing probability Uniform[0.2, 0.8]Use mixup data augmentation Uniform{True, False}Mixup α Uniform[0.6, 1.4]Use label smoothing Uniform{True, False}Label smoothing (cid:15)(cid:15)