On Robustness and Transferability of Convolutional Neural Networks
Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romijnders, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Matthias Minderer, Alexander D'Amour, Dan Moldovan, Sylvain Gelly, Neil Houlsby, Xiaohua Zhai, Mario Lucic
OOn Robustness and Transferability ofConvolutional Neural Networks
Josip Djolonga ∗ Jessica Yung ∗ Michael Tschannen ∗ Rob Romijnders Lucas BeyerAlexander Kolesnikov Joan Puigcerver Matthias Minderer Alexander D’AmourDan Moldovan Sylvan Gelly Neil Houlsby Xiaohua Zhai Mario Lucic
Google Research (Brain Team)
Abstract
Modern deep convolutional networks (CNNs) are often criticized for not generaliz-ing under distributional shifts. However, several recent breakthroughs in transferlearning suggest that these networks can cope with severe distribution shifts andsuccessfully adapt to new tasks from a few training examples. In this work werevisit the out-of-distribution and transfer performance of modern image classifica-tion CNNs and investigate the impact of the pre-training data size, the model scale,and the data preprocessing pipeline. We find that increasing both the training setand model sizes significantly improve the distributional shift robustness. Further-more, we show that, perhaps surprisingly, simple changes in the preprocessing suchas modifying the image resolution can significantly mitigate robustness issues insome cases. Finally, we outline the shortcomings of existing robustness evaluationdatasets and introduce a synthetic dataset we use for a systematic analysis acrosscommon factors of variation.
Deep convolutional networks have attained impressive results across a plethora of visual classificationbenchmarks [34, 58] where the training and testing distributions match. In the real world, however,the conditions in which the models are deployed can often differ significantly from the conditions inwhich the model was trained. It is imperative to understand the impact dataset shifts [48] have onthe performance of these models. This problem has gained a lot of traction and several systematicinvestigations have showed unexpectedly high sensitivity of image classifiers to various dimensions,including photometric perturbations [25], natural perturbations obtained from video data [52], as wellmodel-specific adversarial perturbations [22].The problem of dataset shift, or out-of-distribution (OOD) generalization , is closely related to alearning paradigm known as transfer learning [54, §13]. In transfer learning we are interested inconstructing models that can improve their performance on some target task by leveraging datafrom different related problems. In contrast, under dataset shift one assumes that there are twoenvironments, namely training and testing [54], with the constraint that the model cannot be adaptedusing data from the target environment. As a consequence, the two environments typically have to bemore similar and their differences more structured than in the transfer setting (c.f. Section 2.1).In this work we evaluate the most successful recent recipes for transfer learning—model and datascale—on the problem of OOD generalization on the most prominent recent datasets, neural architec- ∗ Shared first authorship. Correspondence to {josipd,jessicayung,lucic}@google.com .Preprint. Under review. a r X i v : . [ c s . C V ] J u l ures, and training regimes and find that increasing model and data scale are surprisingly effective tomitigate the effects of different types of dataset shift . Contributions
We systematically investigate the classification accuracy of image classificationmodels on the training distribution, their generalization to OOD data (without adaptation), and theirtransfer learning performance with adaptation in the low-data regime. Specifically, we present: (i)A meta-analysis of existing OOD metrics and transfer learning benchmarks across a wide varietyof models, ranging from self-supervised to fully supervised models with up to 900M parameters,and show that most of the variance contained in the various metrics is explained by the I
MAGE N ET validation set accuracy. However, increasing the model and data scale disproportionately improvestransfer performance, despite providing only marginal improvements performance on the I MAGE N ET validation set. (ii) Focusing on OOD robustness, we analyze the effects of the training set size, modelscale, and the training regime and testing resolution, and conclude that the effect of scale overshadowsall other dimensions. (iii) We introduce a novel dataset for a fine-grained OOD analysis to quantifythe robustness to common factors of variation: object size, object location, and object orientation(rotation angle). In a systematic study we show that the models become less sensitive (and hencemore robust) to each of these factors of variation as the dataset size and model size increase. Understanding and correcting for dataset shifts are classical problems in statistics and machinelearning, and have as such received substantial attention, see e.g. the monograph [48]. Formally,let us denote the observed variable by X and the variable we want to predict by Y . A dataset shiftoccurs when we train on samples from P train ( X, Y ) , but are at test time evaluated under a differentdistribution P test ( X, Y ) . Storkey [54] discusses and precisely defines different possibilities how P train and P test can differ. We are mostly interested in covariate shifts, i.e., when the conditionals P train ( Y | X ) = P test ( Y | X ) agree, but the marginals P train ( X ) and P test ( X ) differ. Most robustnessdatasets proposed in the literature targeting I MAGE N ET models are such instances—the images X come from a source P test ( X ) different from the original collection process P train ( X ) , but the labelsemantics are not supposed to change. As a robustness score one typically uses the expected accuracy,i.e., P test ( Y = f ( X )) , where f ( X ) is the class predicted by the model. Robustness datasets and dataset shift types I MAGE N ET - V MAGE N ET validation set [50]. The authors attempted to replicate the data collection process, butfound that all models drop significantly in accuracy. Recent work attributes this drop to statisticalbias in the data collection [17]. I MAGE N ET -C and I MAGE N ET -P [25] are obtained by corruptingthe I MAGE N ET validation set with classical corruptions, such as blur, different types of noiseand compression, and further cropping the images to × . These datasets define a totalof 15 noise, blur, weather, and digital corruption types, each appearing at 5 severity levels orintensities. O BJECT N ET [3] presents a new test set of images collected directly using crowd-sourcing.O BJECT N ET is particular as the objects are captured at unusual poses in cluttered, natural scenes,which can severely degrade recognition performance. Given this clutter, and arguably better suitabilityas a detection than recognition task [5], Y | X might be hard to define and the dataset goes beyonda covariate shift. In contrast, the I MAGE N ET -A dataset [28] consists of real-world, unmodified,and naturally occurring examples that are misclassified by ResNet models. Hence in addition tothe covariate shift due to the data source, this dataset is not model-agnostic and exhibits a strongselection bias [54]. In an attempt to focus on naturally occurring invariances [52] turned to videosand annotated two datasets, namely, I MAGE N ET -V ID -R OBUST and Y OU T UBE -BB-R
OBUST derivedfrom the I
MAGE N ET -V ID [11] and Y OU T UBE -BB [49] video datasets, respectively. In addition tomeasuring accuracy over the frames, the available temporal structure allows for more fine-grainedrobustness metrics. In [52] the authors suggest the following pm - k metric—given an anchor frameand up to k frames before and after it, a prediction is marked as correct only if the classifier correctlyclassifies all k + 1 frames around the anchor. We present the details of each dataset in Appendix A. In transfer learning [46], a model might leverage the data it has seen on a related distribution, P pre − train , to achieve better performance on a new task P train . Note that in contrast to the covariate2 m a g e N e t I m a g e N e t - A I m a g e N e t - C I m a g e N e t - V O b j e c t N e t I m a g e N e t - V i d Y o u T u b e - B B I m a g e N e t - V i d - W Y o u T u b e - B B - W ImageNetImageNet-AImageNet-CImageNet-V2ObjectNetImageNet-VidYouTube-BBImageNet-Vid-WYouTube-BB-W Sp e a r m a n r a n k c o rr e l a t i o n Improvement in model discriminabilitycompared to ImageNet accuracy
YouTube-BBImageNet-VidYouTube-BB-WImageNet-Vid-WObjectNetImageNet-V2ImageNet-C (contrast)ImageNet-C (brightness)ImageNet-C (shot noise)ImageNet-C (jpeg compression)
How well do metrics discriminate between models?
Figure 1: Correlation and informativeness of robustness metrics. Most metrics correlate strongly with I MA - GE N ET accuracy and provide little additional discriminability. (Left) Spearman’s correlation between metrics. (Right)
Difference in accuracy of a logistic classifier trained to discriminate between model types based onI
MAGE N ET accuracy plus one additional metric, compared to a classifier trained only on I MAGE N ET accuracy(higher is better, top 10 metrics shown). Bars show mean ± s.d. of 1000 bootstrap samples from the 39 models. shift setting, the disparity between P pre − train and the new task is typically larger, but one is addition-ally given samples from P train . While there exist many approaches in how to transfer knowledge tothe new task, the most common approach in modern deep learning, which we use, is to (i) train amodel on P pre − train (using perhaps an auxiliary, self-supervised task [15, 21]), and then (ii) train amodel on P train by initializing the model weights from the model trained in the first step.Recently, a suite of datasets has been collected to benchmark modern image classification transfertechniques [69]. The Visual Task Adaptation Benchmark (VTAB) defines 19 datasets with 1000labeled samples each, categorized in into three groups of natural , specialized and structured datasets: natural (most similar to I MAGE N ET ) consists of standard natural classification tasks (e.g. CIFAR,VGG Flowers); specialized , contains medical and satellite images; and structured (least similar toI MAGE N ET ), consists mostly synthetic tasks that require understanding of the geometric layout ofscenes. We compute an overall transfer score as the mean across all 19 datasets, as well as scores foreach subgroup of tasks. We provide details for all of the tasks in Appendix A. While many robustness metrics have been proposed to capture a different sources of brittleness,it is not well understood how these metrics relate to each other. We investigate (i) the amount ofcomplementary information in these metrics, and (ii) their usefulness in guiding design choices.Further, despite the close relationship between the notions of robustness and transferability, there hasbeen no analysis of how predictive of each other their corresponding metrics are. To analyze thesequestions, we evaluated 39 different models over 23 robustness metrics and the 19 transfer tasks.
Metrics
We consider metrics that quantify both robustness and transfer performance. For robust-ness, we measure the model accuracy on the I
MAGE N ET , I MAGE N ET - V matched frequency variant) and O BJECT N ET datasets. We also consider video datasets, I MAGE N ET -V ID and Y OU T UBE -BB; we use both the accuracy metric and the pm - metric (suffix -W). On I MAGE N ET -C we reportthe AlexNet-accuracy-weighted [37] accuracy over all corruption times (called mean corruption error in [25]). To evaluate the transferability of the models, we use the VTAB-1K benchmark that weintroduced in Section 2.2. We evaluate average transfer performance across all 19 datasets, with 1000examples each, as well as per-group performance. To transfer a model we performed a sweep overtwo learning rates and schedules. We report the median testing accuracy over three fine-tuning runswith parameters selected using a 800-200 example train-validation split. Models
We consider several model families, some of which make use of additional data besidesI
MAGE N ET . We evaluate ResNet-50 [23] and six EfficientNet (B0 through B5) models [58] includingvariants using AutoAugment [10] and AdvProp [66], which have been trained on I MAGE N ET . Weinclude self-supervised SimCLR [6] (three variants: linear classifier on top of representation (lin),fine-tuned on 10% (ft-10), and 100% (ft-100) of the I MAGE N ET data), as well as self-supervised-semi-supervised (S4L) [68] models that have been fine-tuned to 10% and 100% of the I MAGE N ET data. We also consider a set of models that incorporate other data sources. Specifically, we testthree NoisyStudent [67] variants which use I MAGE N ET and unlabelled data from the JFT dataset,BiT (BigTransfer) [34] models that have been first trained on I MAGE N ET , I MAGE N ET -21 K , or JFT3 ImageNet T r a n s f e r ( V T A B ) r=0.73
20 40 60
Robustness T r a n s f e r ( V T A B ) r=0.73 ClassBiT JFTNoisy St.BiT I21kEff.Net+AdvPropVIVIS4LEff.NetBiT ImageNetResNet50SimCLR+FTSimCLR I m a g e N e t I m a g e N e t - V I m a g e N e t - A I m a g e N e t - C I m a g e N e t - V i d I m a g e N e t - V i d - W Y o u T u b e - B B O b j e c t N e t Tr. (Natural)Tr. (Spec.)Tr. (Struct.) .85 .86 .83 .84 .85 .80 .85 .84.50 .50 .35 .39 .38 .38 .37 .52.42 .39 .29 .26 .23 .23 .21 .24 Sp e a r m a n r a n k c o rr e l a t i o n Figure 2: The relationship between transfer learning, I
MAGE N ET , and robustness performance. (Left) Averagescore on all transfer benchmarks versus I
MAGE N ET performance. (Center) Average score on all robustnessbenchmarks versus average transfer performance. (Right)
Correlation between different groups of transferdatasets (natural, specialized, structured), and robustness metrics. and then transferred to I
MAGE N ET by fine-tuning, and the Video-Induced Visual Invariance (VIVI)model [63], which uses I MAGE N ET and unlabelled videos from the YT8M dataset [1]. Finally, weconsider the BigBiGAN [14] model which has been first trained as a class-conditional generativemodel and then fine-tuned to an I MAGE N ET classifier. All model details can be found in Appendix E. How well does I
MAGE N ET accuracy predict performance on OOD data? We start with ananalysis of mutual dependence of the robustness metrics by measuring the Spearman’s ρ rankcorrelation coefficient. Figure 1 (left) shows the rank correlation between the metrics. We observethat all metrics are highly correlated with each other, with a median Spearman’s ρ of . . Themetrics also strongly correlate with the accuracy on the I MAGE N ET validation set with a medianSpearman’s ρ of . and a . minimum. To understand the benefit of these metrics beyondI MAGE N ET accuracy, we fit linear regression models for each metric with I MAGE N ET accuracyas the single covariate. Consistent with the rank correlation analysis, we find that 75.2% of thevariance in the metric values is explained by I MAGE N ET accuracy. A principal component analysisshows that the space of robustness metric residuals spans approximately one statistically significantdimension (Appendix A). This raises the question to what degree the robustness metrics provideuseful information beyond standard I MAGE N ET accuracy, which we investigate next. Can robustness metrics discriminate between models?
The goal of a metric is to discriminatebetween different models and thus guide design choices. We therefore quantify the usefulness of eachmetric in terms of how much it improves the discriminability between the various models beyond theinformation provided by I
MAGE N ET accuracy. Specifically, we train logistic regression classifiersto discriminate between the 12 model groups outlined above. We compared the performance of aclassifier using only I MAGE N ET accuracy as input feature, to a classifier using I MAGE N ET and up totwo of the other metrics, see Fig. 1 (right) and Appendix A. We found that most of the tested metricsprovide little increase in model discriminability over I MAGE N ET accuracy. Of course, this result isconditioned on the size and composition of our dataset, and may differ for a different set of models.However, based on our dataset of 39 models in 12 groups, the most informative metrics are thosebased on different datasets and/or video, rather than I MAGE N ET -derived datasets. How related are OOD robustness and transfer metrics?
Next, we turn to transfer learning.It has been observed that better I
MAGE N ET models transfer better [35, 69]. Since robustness iscorrelated with I MAGE N ET (Figure 1), we might expect a similar relationship. To get an overall view,we compute the mean of all robustness metrics, and compare it to transfer performance. Figure 2(center) shows this average robustness plotted against transfer performance, while Figure 2 (left)shows transfer versus I MAGE N ET accuracy. Indeed, we observe a large correlation ρ = 0 . betweenrobustness metrics and transfer; however, the correlation is not stronger than between transfer andI MAGE N ET . Further, we compute the correlation of the residual robustness score (mean robustnessminus I MAGE N ET accuracy) against transfer score, and find only a weak relationship of ρ = 0 . .This indicates that robustness metrics, on aggregate, do not provide additional signal that predictsmodel transferability beyond that of the base I MAGE N ET performance. We do, however, see someinteresting differences in the relative performances of different model groups. Certain model groups,while attaining reasonable I MAGE N ET /robustness scores, transfer less well to VTAB. Therefore, thereare factors that influence transferability unrelated to robust inference. One example is batch normal-ization which is outperformed by group normalization with weight standardization in transfer [34].Next, we break down the correlation by robustness metrics and transfer datasets in Fig. 2 (right). We4 M 5M 13M
Dataset size T r a i n i n g s t e p s
1M 5M 13M
Dataset size0.0 21.225.97.2 29.536.10.2 16.038.7ImageNet-C
1M 5M 13M
Dataset size0.0 20.025.71.9 27.632.6-4.9 20.132.3ImageNet-V2
1M 5M 13M
Dataset size0.0 13.216.71.5 17.724.6-3.7 13.525.4ObjectNet
1M 5M 13M
Dataset size0.0 22.025.95.5 27.732.11.7 18.236.6ImageNet-Vid
1M 5M 13M
Dataset size0.0 11.914.32.4 12.116.1-3.0 6.8 16.8YouTube-BB
1M 5M 13M
Dataset size0.0 11.618.53.5 20.125.9-1.9 8.4 26.6ImageNet-Vid-W
1M 5M 13M
Dataset size0.0 7.1 11.61.0 9.4 12.9-2.4 4.1 14.0YouTube-BB-W
1M 5M 13M
Dataset size T r a i n i n g s t e p s
1M 5M 13M
Dataset size8.0 18.722.111.419.830.54.2 3.1 28.2ImageNet-C
1M 5M 13M
Dataset size10.216.920.78.9 18.724.8-0.1 7.0 18.0ImageNet-V2
1M 5M 13M
Dataset size5.0 9.7 12.64.1 10.719.4-0.5 4.4 16.4ObjectNet
1M 5M 13M
Dataset size7.3 18.823.28.1 18.525.34.4 4.7 23.5ImageNet-Vid
1M 5M 13M
Dataset size4.0 8.9 10.95.1 5.5 10.80.9 -0.4 8.0YouTube-BB
1M 5M 13M
Dataset size6.5 10.216.28.1 15.424.03.2 3.4 18.8ImageNet-Vid-W
1M 5M 13M
Dataset size2.4 4.9 10.42.9 6.4 10.3-1.6 1.8 10.2YouTube-BB-W
Figure 3: (Top)
Reduction (in %) in classification error relative to the classification error of the model trainedfor 112k steps on 1M examples (bottom left corner) as a function of training iterations and training set size.The results are for ResNet-101x3 trained on I
MAGE N ET -21 K subsets. (Bottom) Relative reduction (in %) inclassification error going from ResNet-50 to ResNet-101x3 as a function of training steps and training set size(I
MAGE N ET -21 K subsets). The reduction generally increases with the training set size and longer training. see that each metric correlates similarly with the task groups. However, for the groups that requiremore distant transfer (Specialized, Structured), no metric predicts transferability well. Curiously, rawI MAGE N ET accuracy is the best predictor of transfer to structured tasks, indicating that robustnessmetrics do not relate to challenging transfer tasks, at least not more than raw I MAGE N ET accuracy. Summary
We have seen that many popular robustness metrics are highly correlated. Some metrics,particularly those not based on I
MAGE N ET , have only a little additional discriminative power todistinguish models over I MAGE N ET accuracy. Transferability is also related to I MAGE N ET accuracy,and hence robustness. We observe that while there is correlation, transfer highlights failures that aresomewhat independent of robustness. Further, no particular robustness metric appears to correlatebetter with any particular group of transfer tasks than I MAGE N ET does. Since all of these metricsseem closely linked, we investigate strategies known to be effective for I MAGE N ET and transferlearning on the newer robustness benchmarks. Increasing the scale of pre-training data, model architecture, and training steps have recently ledto diminishing improvements in terms of I
MAGE N ET accuracy. By contrast, it has been recentlyestablished that scaling along these axes can lead to substantial improvements in transfer learningperformance [34, 58]. In the context of robustness, this type of scaling has been explored less. Whilethere are some results suggesting that scale improves robustness [25, 50, 67, 61], no principledstudy decoupling the different scale axes has been performed. Given the strong correlation betweentransfer performance and robustness, this motivates the systematic investigation of the effects of thepre-training data size, model architecture size, training steps, and input resolution. We consider the standard I
MAGE N ET training setup [23] as a baseline, and scale up the trainingaccordingly. To study the impact of dataset size, we consider the I MAGE N ET -21 K [11] and JFT [55]datasets for the experiments, as pre-training on either of them has shown great performance in transferlearning [34]. We scale from the I MAGE N ET training set size ( . M images) to the I
MAGE N ET -21 K training set size (13M images, about times larger than I MAGE N ET ). To explore the effect of themodel size, we use a ResNet-50 as well as the deeper and × wider ResNet-101x3 model. We furtherinvestigate the impact of the training schedule as larger datasets are known to benefit from longertraining for transfer learning [34]. To disentangle the impact of dataset size and training schedules,we train the models for every pair of dataset size and schedule.We fine-tune the trained models to I MAGE N ET using the BiT HyperRule [34], and assess their OODgeneralization performance in the next section. Throughout, we report the reduction in classificationerror relative to the model which was trained on the smallest number of examples, for the fewestiterations, and hence achieves the lowest accuracy. Other details are presented in Appendix B.5 a cc u r a c y ImageNet-A
R50-ImageNet SimCLR-ft VIVI EfficientNet-B5 BiT-R101x30.20.30.4
ObjectNet
Resolution
DefaultBestFixRes
Figure 4: Comparison of different types of evaluation preprocessing and resolutions. Default: Accuracy obtainedfor the preprocessing and resolution proposed by the authors of the respective models. Best: The accuracy whenselecting the best resolution from { , , , , , , , } . FixRes: Applying FixRes for thesame set of resolutions and selecting the best resolution. Increasing the evaluation resolution and additionallyusing FixRes helps across a large range of models and pretraining datasets on I MAGE N ET -A and O BJECT N ET . Dataset size impact
The results for the ResNet-101x3 model are presented in Fig. 3. When trainedon I
MAGE N ET -21 K , the OOD classification error significantly decreases with increasing datasetsize and training duration: We observe relative error reductions of – when going from 112ksteps on 1M data points to 1.12M steps on 13M data points. The reductions are least pronouncedfor Y OU T UBE -BB/Y OU T UBE -BB-W. Also note that training for . M steps leads to a loweraccuracy than training for only 457k steps unless the full I
MAGE N ET -21 K dataset is used. For modelstrained on JFT we observe a similar behavior except that training for 1.12M steps leads to a higheraccuracy than training for 257k steps even when only 112k or 457k data points are used. The JFTresults are presented in Appendix B. These results suggest that, if the models have enough capacity,increasing the amount of training data, with no additional changes, leads to massive gains in alldatasets simultaneously which is in line with recent results in transfer learning [34]. Model size impact
Figure 3 shows the relative reduction in classification error when using ResNet-101x3 instead of ResNet-50 as a function of the number of training steps and the dataset size. It canbe seen that increasing the model size can lead to substantial reductions of - . For a fixed trainingduration, using more data always helps. However, on I MAGE N ET -21 K , training too long can lead toincreases in the classification error when the model size is increased, unless the full I MAGE N ET -21 K is used. This is likely due to overfitting. This effect is much less pronounced when JFT is used fortraining. JFT results are presented in Appendix B. Again, reductions in classification error are leastpronounced for Y OU T UBE -BB/Y OU T UBE -BB-W.
During training, images are typically cropped randomly, with many crop sizes and aspect ratios, toprevent overfitting. In contrast, during testing, the images are usually rescaled such that the shorterside has a pre-specified length, and a fixed-size center crop is taken and then fed to the classifier.This leads to a mismatch in object sizes between training and testing. Increasing the resolution atwhich images are tested leads to an improvement in accuracy across different architectures [60, 61].Furthermore, additional benefits can be obtained by applying
FixRes – fine-tuning the network on thetraining set with the test-time preprocessing (i.e. omitting random cropping with aspect ratio changes),and at higher resolution. We explore the effect of this discrepancy on the robustness of differentarchitectures. As some of the robustness datasets were collected in a different way from I
MAGE N ET ,discrepancies in the cropping are likely. We investigate both adjusting test-time resolution andapplying FixRes. For FixRes, we use a simple setup with a single schedule and learning rate forall models (except using a × smaller learning rate for the BiT models), and without heavy coloraugmentation as in [60] or label smoothing as in [61]. Furthermore we did not extensively tunehyperparameters, but chose a setup that works reasonably well across architectures and datasets. Results and discussion
Figure 4 shows the accuracy for I
MAGE N ET -A and O BJECT N ET at thetesting resolution proposed by the authors of the respective architecture along with the highestaccuracy obtained by selecting the best testing resolution in { , , , , , , , } ,and after applying FixRes. The results for other datasets are deferred to Appendix C.We start by discussing observations that apply to most of the models, excluding the BiT modelswhich will be discussed below. While FixRes only leads to marginal benefits on I MAGE N ET , it canlead to substantial improvements on the robustness metrics. Choosing the optimal testing resolutionleads to a significant increase in accuracy on I MAGE N ET -A and O BJECT N ET in most cases, andapplying FixRes often leads to additional substantial gains. For O BJECT N ET , fine-tuning with testing6 eference (1.3 M) 5.2 M 13.0 M Improvement across object locations (Filter=0%) (ResNet-50, ImageNet-21K)
Reference (1.3 M) 5.2 M 13.0 M
Improvement across object locations (Filter=0%) (ResNet-101-x3, ImageNet-21K)
Figure 5: (Left)
Sample images from our synthetic dataset. We consider 614 foreground objects from 62 classesand 867 backgrounds and vary the object location, rotation angle, and object size for a total of
611 608 images. (Right)
In the first column, for each location on the grid, we compute the average accuracy. Then, we normalizeeach location by the 95 th percentile across all locations, which quantifies the gap between the locations wherethe model performs well, and the ones where it under-performs (first column, dark blue versus white). Then, weconsider models trained with more data, compute the same normalized score, and plot the difference with respectto the first column. We observe that, as dataset size increases, sensitivity to object location decreases – the outerregions improve in relative accuracy more than the inner ones (e.g. dark blue vs white on the second and thirdcolumns). The effect is more pronounced for the larger model. The full set of results is presented in Figure 15. preprocessing (i.e. fine-tuning with central cropping instead of random cropping as used duringtraining) even helps without increasing resolution in some cases.Increasing the resolution and/or applying FixRes often slightly helps on I MAGE N ET -V2. ForI MAGE N ET -C, the optimal testing resolution often corresponds to the resolution used for training,and applying FixRes rarely changes this picture. This is not surprising as the I MAGE N ET -C imagesare cropped to 224 pixels by default, and increasing the resolution does not add any new informationto the image. For the video-derived robustness datasets I MAGE N ET -V ID -R OBUST and Y OU T UBE -BB-R
OBUST , evaluating at a larger testing resolution and/or applying FixRes at a higher resolutioncan substantially improve the accuracy on the anchor frame and the robustness accuracy for smallEfficientNet and ResNet models, but does not help the larger ones. For the BiT models, the resolutionsuggested by the authors is almost always optimal, except on O
BJECT N ET and I MAGE N ET -A, wherechanging the preprocessing considerably helps. FixRes arguably does not lead to improvements asit was already applied in BiT as a part of the BiT HyperRule. Based on these results we stronglysuggest the application of these adjustments to address the shift caused by resolution mismatch. There are several factors of variation, such as object location, size, and rotation, that we want ourmodels to be robust to. For a solid diagnostic of the failure modes, one should ideally be able to varytesting data according to these axes. However, the combinatorial nature of the number of possiblecombinations of such factors of variation precludes any large-scale systematic data collection scheme.In this work we present a scalable alternative and construct a novel synthetic dataset for fine-grainedevaluation. We paste objects extracted from OpenImages [38] using segmentation masks ontouncluttered backgrounds sourced from the web (Figure 5, details in Appendix D). We can thusconduct controlled studies by systematically varying the object class, size, location, and orientation(rotation angle). We study one factor of variation at a time (e.g. location of the object center), andlook at the average performance for each location over a uniform grid.We investigate the effect of model and dataset size on these three factors of variation by evaluatingthe ResNet-50 and ResNet-101x3 models. We observe that the models become more invariant tolocation (Figure 5), size (Figure 6), and rotation of the objects (Figure 6) as the model or trainingset size increases. The improvements are more pronounced for the larger ResNet-101x3 model. Theanalogous results on the JFT dataset are presented in Appendix D.
There has been a growing literature exploring the robustness of image classification networks. Earlyinvestigations in face and natural image recognition found that performance degrades by introducing7
180 -140 -100 -60 -20 0 20 60 100 140 180
Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-50, ImageNet-21K)
Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, ImageNet-21K)
10 20 30 40 50 60 70 80 90 100
Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-50, ImageNet-21K)
Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, ImageNet-21K)
Figure 6: (Left)
In the first row of both plots we show the ratio of the accuracy and the best accuracy (acrossall rotations). For the second row (model trained on 2.6M instances) and other rows, we compute the samenormalized score and visualize the difference with the first row. Larger differences imply a more uniformbehavior across object rotations. We observe that, as the dataset size increases, the average prediction accuracyacross various rotation angles becomes more uniform. The effect is more pronounced for the larger model. (Right)
Similarly, the average accuracy across various object sizes becomes more uniform for both models. Asexpected, the improvement is most pronounced for small object sizes covering – of the pixels. The fullset of results is presented in Figures 13 and 14. blur, Gaussian noise, occlusion, and compression artifacts, but less by color distortions [12, 33].Subsequent studies have investigated brittleness to similar corruptions [51, 73], as well as to impulsenoise [29], photometric perturbations [59], and small shifts and other transformations [2, 17, 71].CNNs have also been shown to over-rely upon texture rather than shape to make predictions, incontrast to human behavior [20]. Robustness to adversarial attacks [22] is a related, but distinctproblem, where performance under worst-case perturbations are studied. In this paper we did not studysuch adversarial robustness, but have focused on average-case robustness to natural perturbations.Several techniques have been shown to improve model robustness on these datasets. Using betterdata augmentation can improve performance on data with synthetic noise [27, 41]. Auxiliary self-supervision [7, 68] can improve robustness to label noise and common corruptions [26]. Transductivefine-tuning using self-supervision on the test data improves performance under distribution shift [56].Training with adversarial perturbations improves many robustness benchmarks if one uses separateBatch-Norm parameters for clean and adversarial data [66]. Finally, additional pre-training usingvery large auxiliary datasets has recently shown significant improvements in Robustness. NoisyStudent [67] reports good performance on several robustness datasets, while Big Transfer (BiT) [34]reports strong performance on the recently introduced O BJECT N ET dataset [3].Deep networks are often trained by pre-training the network on a different problem and then fine-tuning on the target task. This pre-training is often referred to as representation learning; rep-resentations can be trained using supervised [30, 34], weakly-supervised [42], or unsuperviseddata [13, 14, 63, 67]. Recent benchmarks have been proposed to evaluate transfer to several datasets,to assess generalization to tasks with different characteristics, or those disjoint from the pre-trainingdata [62, 69]. While state-of-the-art performance on many competitive datasets is attained via transferlearning [67, 34], the implication for final robustness metrics remain unclear.Creating synthetic datasets by inserting objects onto backgrounds has been used for training [72, 16]and evaluating models [34], but previous works do not systematically vary object size, location ororientation, or analyze translation and rotation robustness only at the image level [18]. We analyzed OOD generalization and transferability of image classifiers, and demonstrated thatmodel and data scale together with a simple training recipe lead to large improvements. However, themodels do exhibit a substantial gap in performance when tested on OOD data, and scale is unlikely tobe the only approach to close this gap. Secondly, this approach hinges on the availability of curateddatasets and significant computing capabilities which is not always practical. Hence, we believethat transfer learning, i.e. train once, apply many times, is the most promising paradigm for OOD8obustness in the short term. One limitation of this study is that we consider image classificationmodels fine-tuned to the I
MAGE N ET label space which were developed with the goal of optimizing theaccuracy on the I MAGE N ET test set. While existing work shows that we didn’t overfit to I MAGE N ET ,it is possible that these models have correlated failure modes on datasets which share the biaseswith I MAGE N ET [50]. This highlights the need for datasets which enable fine-grained analysis for allimportant factors of variation and we hope that our dataset will be useful for researchers.Instead of requiring the model to work under various dataset shifts, one can ask an alternativequestion: assuming that the model will be deployed in an environment significantly different fromthe training one, can we at least quantify the model uncertainty for each prediction? This importantproperty remains elusive for moderate-scale neural networks [53], but could potentially be improvedby considering larger models and larger pretraining datasets which we leave for future work. References [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, BalakrishnanVaradarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classificationbenchmark. arXiv:1609.08675 , 2016.[2] Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly tosmall image transformations?
Journal of Machine Learning Research , 20, 2019.[3] Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund,Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushingthe limits of object recognition models. In
Advances in Neural Information Processing Systems ,2019.[4] Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, HeinrichKüttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab. arXivpreprint arXiv:1612.03801 , 2016.[5] Ali Borji. Objectnet dataset: Reanalysis and correction. In arXiv 2004.02042 , 2020.[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple frameworkfor contrastive learning of visual representations. arXiv:2002.05709 , 2020.[7] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil Houlsby. Self-supervisedGANs via auxiliary rotation loss. In
Conference on Computer Vision and Pattern Recognition ,2019.[8] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification:Benchmark and state of the art.
Proceedings of the IEEE , 2017.[9] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in thewild. In
IEEE Conference on Computer Vision and Pattern Recognition , 2014.[10] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:Learning augmentation strategies from data. In
Conference on Computer Vision and PatternRecognition , 2019.[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In
Conference on Computer Vision and Pattern Recognition , 2009.[12] Samuel Dodge and Lina Karam. Understanding how image quality affects deep neural networks.In
International Conference on Quality of Multimedia Experience , 2016.[13] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learningby context prediction. In
International Conference on Computer Vision , 2015.[14] Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In
Advancesin Neural Information Processing Systems , 2019.[15] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and ThomasBrox. Discriminative unsupervised feature learning with exemplar convolutional neural net-works.
IEEE transactions on pattern analysis and machine intelligence , 38(9), 2015.[16] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easysynthesis for instance detection. In
International Conference on Computer Vision , 2017.917] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Jacob Steinhardt, andAleksander Madry. Identifying statistical bias in dataset replication. arXiv: 2005.09619 , 2020.[18] Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and atranslation suffice: Fooling CNNs with simple transformations.
CoRR , abs/1712.02779, 2017.[19] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: Thekitti dataset.
International Journal of Robotics Research , 2013.[20] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann,and Wieland Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape biasimproves accuracy and robustness. In
International Conference on Learning Representations ,2019.[21] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning bypredicting image rotations. arXiv:1803.07728 , 2018.[22] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-ial examples. arXiv:1412.6572 , 2014.[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Conference on Computer Vision and Pattern Recognition , 2016.[24] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A noveldataset and deep learning benchmark for land use and land cover classification.
IEEE Journalof Selected Topics in Applied Earth Observations and Remote Sensing , 2019.[25] Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to commoncorruptions and surface variations. arXiv: 1807.01697 , 2018.[26] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervisedlearning can improve model robustness and uncertainty. In
Advances in Neural InformationProcessing Systems , 2019.[27] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshmi-narayanan. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv:1912.02781 , 2019.[28] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Naturaladversarial examples. arXiv: 1907.07174 , 2019.[29] Hossein Hosseini, Baicen Xiao, and Radha Poovendran. Google’s cloud vision api is not robustto noise. In
International Conference on Machine Learning and Applications , 2017.[30] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transferlearning? arXiv:1608.08614 , 2016.[31] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick,and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementaryvisual reasoning. In
IEEE Conference on Computer Vision and Pattern Recognition , 2017.[32] Kaggle and EyePacs. Kaggle diabetic retinopathy detection, July 2015.[33] Samil Karahan, Merve Kilinc Yildirim, Kadir Kirtaç, Ferhat Sükrü Rende, Gultekin Butun, andHazim Kemal Ekenel. How image degradations affect deep CNN-based face recognition? In
International Conference of the Biometrics Special Interest Group , 2016.[34] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, SylvainGelly, and Neil Houlsby. Big transfer (BiT): General visual representation learning.
EuropeanConference on Computer Vision , 2020.[35] Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better imagenet models transfer better?In
Conference on Computer Vision and Pattern Recognition , 2019.[36] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.[37] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In
Advances in Neural Information Processing Systems , 2012.[38] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset,Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and VittorioFerrari. The open images dataset v4: Unified image classification, object detection, and visualrelationship detection at scale. arXiv: 1811.00982 , 2020.1039] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognitionwith invariance to pose and lighting. In
IEEE Conference on Computer Vision and PatternRecognition , 2004.[40] Fei-Fei Li, Rob Fergus, and Pietro Perona. One-shot learning of object categories.
IEEETransactions on Pattern Analysis and Machine Intelligence , 2006.[41] Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer, and Ekin D Cubuk. Improvingrobustness without sacrificing accuracy with patch gaussian augmentation. arXiv:1906.02611 ,2019.[42] Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, YixuanLi, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervisedpretraining. In
European Conference on Computer Vision , 2018.[43] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentangle-ment testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.[44] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.Reading digits in natural images with unsupervised feature learning. In
NIPS Workshop onDeep Learning and Unsupervised Feature Learning 2011 , 2011.[45] M-E. Nilsback and A. Zisserman. Automated flower classification over a large number ofclasses. In
Indian Conference on Computer Vision, Graphics and Image Processing , Dec 2008.[46] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.
IEEE Transactions onknowledge and data engineering , 22(10), 2009.[47] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In
IEEE Conferenceon Computer Vision and Pattern Recognition , 2012.[48] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence.
Dataset shift in machine learning . The MIT Press, 2009.[49] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video.In
Conference on Computer Vision and Pattern Recognition , 2017.[50] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenetclassifiers generalize to imagenet? arXiv: 1902.10811 , 2019.[51] Prasun Roy, Subhankar Ghosh, Saumik Bhattacharya, and Umapada Pal. Effects of degradationson deep neural network architectures. arXiv:1807.10108 , 2018.[52] Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and LudwigSchmidt. A systematic framework for natural perturbations from videos. arXiv:1906.02168 ,2019.[53] Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin,D. Sculley, Joshua V. Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncer-tainty? evaluating predictive uncertainty under dataset shift. In
Advances in Neural InformationProcessing Systems , 2019.[54] Amos Storkey. When training and test sets are different: characterizing learning transfer.
Dataset shift in machine learning , 2009.[55] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonableeffectiveness of data in deep learning era. In
International Conference on Computer Vision ,2017.[56] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A Efros, and Moritz Hardt. Test-timetraining for out-of-distribution generalization. arXiv:1909.13231 , 2019.[57] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,and A. Rabinovich. Going deeper with convolutions. In
Conference on Computer Vision andPattern Recognition , 2015.[58] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neuralnetworks. arXiv:1905.11946 , 2019.[59] Dogancan Temel, Jinsol Lee, and Ghassan AlRegib. Cure-or: Challenging unreal and realenvironments for object recognition. In
International Conference on Machine Learning andApplications , 2018. 1160] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. Fixing the train-testresolution discrepancy. In
Advances in Neural Information Processing Systems , 2019.[61] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. Fixing the train-testresolution discrepancy: Fixefficientnet. arXiv:2003.08237 , 2020.[62] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin,Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-dataset:A dataset of datasets for learning to learn from few examples. arXiv:1903.03096 , 2019.[63] Michael Tschannen, Josip Djolonga, Marvin Ritter, Aravindh Mahendran, Neil Houlsby, SylvainGelly, and Mario Lucic. Self-supervised learning of video-induced visual invariances. In
Conference on Computer Vision and Pattern Recognition , 2020.[64] Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotationequivariant cnns for digital pathology. In
International Conference on Medical Image Computingand Computer-Assisted Intervention , 2018.[65] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database:Large-scale scene recognition from abbey to zoo. In
IEEE Conference on Computer Vision andPattern Recognition , 2010.[66] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, and Quoc V Le. Adversarialexamples improve image recognition. arXiv:1911.09665 , 2019.[67] Qizhe Xie, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Self-training with noisy studentimproves imagenet classification. arXiv:1911.04252 , 2019.[68] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervisedsemi-supervised learning. In
International Conference on Computer Vision , 2019.[69] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, MarioLucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, LucasBeyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly,and Neil Houlsby. A large-scale study of representation learning with the visual task adaptationbenchmark. arXiv: 1910.04867 , 2019.[70] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyondempirical risk minimization. In
International Conference on Learning Representations , 2018.[71] Richard Zhang. Making convolutional networks shift-invariant again. In
International Confer-ence on Machine Learning , 2019.[72] Nanxuan Zhao, Zhirong Wu, Rynson W. H. Lau, and Stephen Lin. Distilling localization forself-supervised representation learning. arXiv: 2004.06638 , 2020.[73] Yiren Zhou, Sibo Song, and Ngai-Man Cheung. On classification of distorted images with deepconvolutional neural networks. In
International Conference on Acoustics, Speech and SignalProcessing , 2017. 12
Analysis of existing robustness and transfer metrics
Here, we provide additional details related to the analyses presented in Figure 1.
A.1 Dimensionality of the space of robustness metrics
To estimate how many different dimensions are measured by the robustness metrics beyond whatis already explained by I
MAGE N ET accuracy, we proceeded as follows. For each of the robustnessmetrics shown in Figure 1 and 8, a linear regression was fit to predict that metric’s value for the 39models, using I MAGE N ET accuracy as the sole predictor variable. Then, the residuals were computedfor each metric by subtracting the linear regression prediction. The plot shows the fraction of varianceexplained for the first 4 principal components of the space of residuals of the robustness metrics. As anull hypothesis, we assumed that there is no correlation structure in the metric residuals. To constructcorresponding null datasets, we randomly permuted the values for each metric independently, whichdestroys the correlation structure between metrics. Figure 7a shows that only the first principalcomponent is significantly above the value expected under the null hypothesis. R e s i d u a l v a r i a n c e e x p l a i n e d ( % ) a f t e r a cc o un t i n g f o r I m a g e N e t a cc u r a c y MetricsNull distribution(shuffled values) (a)
The space of robustness metrics. D
ATASET I NSTANCES C LS .I MAGE N ET [37]
50 000 1000 I MAGE N ET -A [28] I MAGE N ET -C [25] × ×
50 000 1000 O BJECT N ET [3]
18 574 113 I MAGE N ET -V2 [50]
10 000 1000 I MAGE N ET -V ID [52]
22 179 293
YTBB-R
OBUST [52]
51 826 229 (b)
The name and reference, number of instances, and the num-ber of classes overlapping with ImageNet for each dataset.
Figure 7: (Left)
The space of robustness metrics spans approximately one statistically significant dimensionafter accounting for I
MAGE N ET accuracy. Errorbars show 95% confidence intervals based on 1000 bootstrapsamples (for the true data) or 1000 random permutations (for the null distribution). See Section A.1 for details. (Right) Details for the datasets used in this study. The datasets were used only for evaluation.
A.2 Informativeness of robustness metrics
To estimate how useful different combinations of robustness metrics are for discriminating betweenmodel types, we trained logistic regression classifiers to discriminate between the 12 model groupsoutlined in the main paper. We consider I
MAGE N ET accuracy as a baseline metric and thereforecompare the performance of a classifier using only I MAGE N ET accuracy as input feature, to aclassifier using I MAGE N ET either one (Figure 8, left) or two (Figure 8, right) additional metricsas input features. Figure 8 shows difference in accuracy to the baseline (I MAGE N ET ) classifier.These results can serve practitioners with a limited budget as a rough guideline for which metriccombinations are the most informative. In our experiments, the most informative combination ofmetrics in addition to I MAGE N ET accuracy was O BJECT N ET and Y OU T UBE -BB, although othercombinations performed similarly within the statistical uncertainty.
A.3 Visual Task Adaptation Benchmark
The Visual Task Adaptation Benchmark (VTAB) [69] contains 19 tasks. Either the full datasetor -example training sets may be used, we use the version with -example training sets(VTAB-1k).The tasks are divided into three groups: Natural, standard natural image classification problems.Specialized, domain-specific images captured with specialist equipment (e.g. medical images).Structured, classification tasks that require geometric understanding of a scene. The Natural groupcontains the following datasets: Caltech101 [40], CIFAR-100 [36], DTD [9], Flowers102 [45], Pets[47], Sun397 [65], SVHN [44]. The Specialized group contains remote sensing datasets, EuroSAT13 .0 0.1 0.2 0.3 0.4 0.5Improvement in model discriminabilitycompared to ImageNet accuracyYouTube-BBImageNet-VidYouTube-BB-WImageNet-Vid-WObjectNetImageNet-V2ImageNet-C (contrast)ImageNet-C (brightness)ImageNet-C (shot noise)ImageNet-C (jpeg compression)ImageNet-C (impulse noise)ImageNet-C (frost)ImageNet-C (mean)ImageNet-C (elastic transform)ImageNet-C (gaussian noise)ImageNet-C (defocus blur)ImageNet-C (glass blur)ImageNet-AImageNet-C (motion blur)ImageNet-C (fog)ImageNet-C (snow)ImageNet-C (zoom blur)ImageNet-C (pixelate)
How well do metrics discriminate between models? Y o u T u b e - BB I m a g e N e t - V i d Y o u T u b e - BB - W I m a g e N e t - V i d - W O b j e c t N e t I m a g e N e t - V I m a g e N e t - C ( c o n t r a s t ) I m a g e N e t - C ( b r i g h t n e ss ) I m a g e N e t - C ( s h o t n o i s e ) I m a g e N e t - C ( j p e g c o m p r e ss i o n ) I m a g e N e t - C ( i m p u l s e n o i s e ) I m a g e N e t - C ( f r o s t ) I m a g e N e t - C ( m e a n ) I m a g e N e t - C ( e l a s t i c t r a n s f o r m ) I m a g e N e t - C ( g a u ss i a n n o i s e ) I m a g e N e t - C ( d e f o c u s b l u r ) I m a g e N e t - C ( g l a ss b l u r ) I m a g e N e t - A I m a g e N e t - C ( m o t i o n b l u r ) I m a g e N e t - C ( f o g ) I m a g e N e t - C ( s n o w ) I m a g e N e t - C ( z oo m b l u r ) I m a g e N e t - C ( p i x e l a t e ) YouTube-BBImageNet-VidYouTube-BB-WImageNet-Vid-WObjectNetImageNet-V2ImageNet-C (contrast)ImageNet-C (brightness)ImageNet-C (shot noise)ImageNet-C (jpeg compression)ImageNet-C (impulse noise)ImageNet-C (frost)ImageNet-C (mean)ImageNet-C (elastic transform)ImageNet-C (gaussian noise)ImageNet-C (defocus blur)ImageNet-C (glass blur)ImageNet-AImageNet-C (motion blur)ImageNet-C (fog)ImageNet-C (snow)ImageNet-C (zoom blur)ImageNet-C (pixelate) 0.000.080.160.240.320.40 C h a n g e i n d i s c r i m i n a b ili t y o v e r I m a g e N e t a cc u r a c y Figure 8: Informativeness of robustness metrics (related to Figure 1). (Left)
Similar to Figure 1 (left), butshowing all 23 robustness metrics. Difference in accuracy of a logistic classifier trained to discriminate betweenmodel types based on I
MAGE N ET accuracy plus one additional metric, compared to a classifier trained only onI MAGE N ET accuracy (higher is better, top 10 metrics shown). Bars show mean ± s.d. of 1000 bootstrap samplesfrom the 39 models. (Right) Increase in classifier accuracy over I
MAGE N ET accuracy when including up to tworobustness metrics as explanatory variables. The diagonal shows the single-feature values from (left). [24] and Resisc45 [8], and medical images, Patch Camelyon [64] and Diabetic Retinopathy [32]. TheStructured group contains the following tasks: Counting and distance prediction on CLEVR [31].Pixel-location and orientation prediction on dSprites [43]. Camera elevation and object orientationon SmallNORB [39]. Object distance on DMLab [4]. Vehicle distance on KITTI [19]. B Scale and OOD generalization
Training Details
The models are firstly pre-trained on I
MAGE N ET -21 K and JFT, followed byfine-tuning on I MAGE N ET to match the label space for evaluation. We follow the pre-training andBiT-HyperRule fine-tuning setup proposed in [34].Specifically, for pre-training, we use SGD with momentum with initial learning rate of 0.1, andmomentum 0.9. We use linear learning rate warm-up for 5000 optimization steps and multiply thelearning rate by batch size . We use a weight decay of 0.0001. We use the random image croppingtechnique from [57], and random horizontal mirroring followed by × image resize. We use aglobal batch size of 1024 and train on a Cloud TPUv3-128. We pre-train models for the cross productof the following combinations: • Dataset Size : {1.28M (1 × ImageNet train set), 2.6M (2 × ImageNet train set), 5.2M (4 × ImageNet train set), 9M (7 × ImageNet train set), 13M (10 × ImageNet train set)}. • Train Schedule (steps): {113K (90 ImageNet epochs), 229K (180 ImageNet epochs), 457K(360 ImageNet epochs), 791K (630 ImageNet epochs), 1.1M (900 ImageNet epochs)}.For fine-tuning, we use the BiT-Hyperrule as described in [34]: batch size 512, learning rate 0.003,no weight decay, the classification head initialized to zeros, mixup [70] with α = 0 . , fine-tuningfor
20 000 steps with × image resolution. We present the results on the synthetic dataset inAppendix D. Additional Results
Here we highlight the results equivalent to Figure 3, with the only differencethat we consider subsets of the JFT [55] dataset, instead of I
MAGE N ET -21 K (Figure 9).14 M 5M 13M
Dataset size T r a i n i n g s t e p s
1M 5M 13M
Dataset size0.0 8.8 10.94.1 19.823.05.3 22.930.8ImageNet-C
1M 5M 13M
Dataset size0.0 11.112.77.5 21.320.27.5 23.030.3ImageNet-V2
1M 5M 13M
Dataset size0.0 6.9 8.02.1 15.216.73.4 15.622.0ObjectNet
1M 5M 13M
Dataset size0.0 14.016.58.1 22.224.48.5 21.129.4ImageNet-Vid
1M 5M 13M
Dataset size0.0 6.8 6.15.0 8.7 11.84.6 11.2 9.8YouTube-BB
1M 5M 13M
Dataset size0.0 9.4 10.36.4 16.820.43.1 17.624.4ImageNet-Vid-W
1M 5M 13M
Dataset size0.0 7.4 7.76.0 11.812.55.3 12.413.3YouTube-BB-W
1M 5M 13M
Dataset size T r a i n i n g s t e p s
1M 5M 13M
Dataset size12.816.318.514.320.723.215.921.527.2ImageNet-C
1M 5M 13M
Dataset size14.420.521.716.319.719.416.418.123.8ImageNet-V2
1M 5M 13M
Dataset size9.1 12.713.08.2 14.316.29.6 12.517.4ObjectNet
1M 5M 13M
Dataset size9.4 15.818.615.620.021.017.416.921.6ImageNet-Vid
1M 5M 13M
Dataset size5.8 6.9 5.611.3 7.7 9.39.7 6.3 6.5YouTube-BB
1M 5M 13M
Dataset size7.9 11.414.915.815.420.512.116.820.9ImageNet-Vid-W
1M 5M 13M
Dataset size4.7 7.8 7.19.6 9.5 9.69.5 8.4 10.2YouTube-BB-W
Figure 9: (Top)
Reduction (in %) in classification error relative to the classification error of the model trainedfor 112k steps on 1M examples (bottom left corner) as a function of training iterations and training set size. Theresults are for ResNet-101x3 trained on JFT subsets. (Bottom)
Relative reduction (in %) in classification errorgoing from ResNet-50 to ResNet-101x3 as a function of training steps and training set size (JFT subsets). Thereduction generally increases with the training set size and longer training.
C Effect of the testing resolution
Cropping details
Before applying the respective model, we first resize every image such that theshorter side has length (cid:98) . · r (cid:99) while preserving the aspect ratio and take a central crop of size r × r . For the widely used × testing resolution, this leads to standard single-crop testingpreprocessing, where the images are first resized such that the shorter side has length . Training details for FixRes
For fine-tuning to the target resolution (FixRes) we use SGD withmomentum with initial learning rate of . (except for the BiT models for which we use . ),and momentum 0.9, accounting for varying batch size by multiplying the learning rate with batch size .We train for
15 000 · batch size , decaying the learning rate by a factor of after / and / of theiterations. The batch size is chosen based on the model size to avoid memory overflow; we use in most cases. We train on a Cloud TPUv3-64. We emphasize that we did not extensively tune thetraining parameters for FixRes, but chose a setting that works well across models and data sets. Additional results
In Figure 10 we provide an extended version of Figure 4 that shows the effectof FixRes for all datasets and models. In Figure 11 we plot the performance of all models and theirFixRes variants as a function of the resolution. 15 iT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5 A cc u r a c y ImageNet
BiT-M-JFT BiT-M-INet21k BiT-S-INet0.000.25
ImageNet-A
BiT-M-JFT BiT-M-INet21k BiT-S-INet01
ImageNet-C
BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5 A cc u r a c y ImageNet-V2
BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5
ObjectNet
BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5
ImageNet-Vid-Robust
BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5 A cc u r a c y YouTube-BB-Robust
BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5
ImageNet-Vid-Robust-W
BiT-M-JFT BiT-M-INet21k BiT-S-INet0.000.25
YouTube-BB-Robust-W
Resolution
DefaultBestFixRes (a) Different BiT variants. -M- stands for ResNet-101x3, while -S- for ResNet-50x1. INet is a shorthand forImageNet.
B5-NoisyStud B0-NoisyStud B0 B50.00.5 A cc u r a c y ImageNet
B5-NoisyStud B0-NoisyStud B0 B50.000.25
ImageNet-A
B5-NoisyStud B0-NoisyStud B0 B501
ImageNet-C
B5-NoisyStud B0-NoisyStud B0 B50.00.5 A cc u r a c y ImageNet-V2
B5-NoisyStud B0-NoisyStud B0 B50.00.5
ObjectNet
B5-NoisyStud B0-NoisyStud B0 B50.00.5
ImageNet-Vid-Robust
B5-NoisyStud B0-NoisyStud B0 B50.00.5 A cc u r a c y YouTube-BB-Robust
B5-NoisyStud B0-NoisyStud B0 B50.00.5
ImageNet-Vid-Robust-W
B5-NoisyStud B0-NoisyStud B0 B50.000.25
YouTube-BB-Robust-W
Resolution
DefaultBestFixRes (b) Two ImageNet-trained EfficientNet variants (B0,B5) as well as those models trained using the Noisy Studentprotocol.
SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.5 A cc u r a c y ImageNet
SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.2
ImageNet-A
SimCLR-ft-R50x4 SimCLR-ft-R50x101
ImageNet-C
SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.5 A cc u r a c y ImageNet-V2
SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.2
ObjectNet
SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.5
ImageNet-Vid-Robust
SimCLR-ft-R50x4 SimCLR-ft-R50x10.000.25 A cc u r a c y YouTube-BB-Robust
SimCLR-ft-R50x4 SimCLR-ft-R50x10.000.25
ImageNet-Vid-Robust-W
SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.2
YouTube-BB-Robust-W
Resolution
DefaultBestFixRes (c) SimCLR models that have been fine-tuned on ImageNet.
VIVI-x3 VIVI-x10.00.5 A cc u r a c y ImageNet
VIVI-x3 VIVI-x10.0000.025
ImageNet-A
VIVI-x3 VIVI-x101
ImageNet-C
VIVI-x3 VIVI-x10.00.5 A cc u r a c y ImageNet-V2
VIVI-x3 VIVI-x10.00.2
ObjectNet
VIVI-x3 VIVI-x10.00.5
ImageNet-Vid-Robust
VIVI-x3 VIVI-x10.00.2 A cc u r a c y YouTube-BB-Robust
VIVI-x3 VIVI-x10.00.2
ImageNet-Vid-Robust-W
VIVI-x3 VIVI-x10.00.2
YouTube-BB-Robust-W
Resolution
DefaultBestFixRes (d) Two VIVI variants (R50x1 and R50x3), both co-trained with ImageNet.Figure 10: Comparison of different types of evaluation preprocessing and resolutions. Default: Accuracyobtained for the preprocessing and resolution proposed by the authors of the respective models. Best: Theaccuracy when selecting the best resolution from { , , , , , , , } . FixRes: ApplyingFixRes for the same set of resolutions and selecting the best resolution. Increasing the evaluation resolution andadditionally using FixRes helps across a large range of models and pretraining datasets.
50 500 750
Eval. res.
ImageNet
VIVI-x3 VIVI-x3-FixRes VIVI-x1 VIVI-x1-FixRes250 500 750
Eval. res.
ImageNet-A
250 500 750
Eval. res.
ImageNet-C
250 500 750
Eval. res.
ImageNet-V2
250 500 750
Eval. res.
ObjectNet
250 500 750
Eval. res.
ImageNet-Vid-Robust
250 500 750
Eval. res.
YouTube-BB-Robust
250 500 750
Eval. res.
ImageNet-Vid-Robust-W
250 500 750
Eval. res.
YouTube-BB-Robust-W
250 500 750
Eval. res.
ImageNet
BiT-M-INet21k BiT-M-INet21k-FixRes BiT-S-INet BiT-S-INet-FixRes BiT-M-JFT BiT-M-JFT-FixRes BiT-M-INet21k BiT-M-INet21k-FixRes250 500 750
Eval. res.
ImageNet-A
250 500 750
Eval. res.
ImageNet-C
250 500 750
Eval. res.
ImageNet-V2
250 500 750
Eval. res.
ObjectNet
250 500 750
Eval. res.
ImageNet-Vid-Robust
250 500 750
Eval. res.
YouTube-BB-Robust
250 500 750
Eval. res.
ImageNet-Vid-Robust-W
250 500 750
Eval. res.
YouTube-BB-Robust-W
250 500 750
Eval. res.
ImageNet
B0 B0-FixRes B5-NoisyStud B5-NoisyStud-FixRes B0-NoisyStud B0-NoisyStud-FixRes B5 B5-FixRes250 500 750
Eval. res.
ImageNet-A
250 500 750
Eval. res.
ImageNet-C
250 500 750
Eval. res.
ImageNet-V2
250 500 750
Eval. res.
ObjectNet
250 500 750
Eval. res.
ImageNet-Vid-Robust
250 500 750
Eval. res.
YouTube-BB-Robust
250 500 750
Eval. res.
ImageNet-Vid-Robust-W
250 500 750
Eval. res.
YouTube-BB-Robust-W
250 500 750
Eval. res.
ImageNet
SimCLR-ft-R50x4 SimCLR-ft-R50x4-FixRes SimCLR-ft-R50x1 SimCLR-ft-R50x1-FixRes250 500 750
Eval. res.
ImageNet-A
250 500 750
Eval. res.
ImageNet-C
250 500 750
Eval. res.
ImageNet-V2
250 500 750
Eval. res.
ObjectNet
250 500 750
Eval. res.
ImageNet-Vid-Robust
250 500 750
Eval. res.
YouTube-BB-Robust
250 500 750
Eval. res.
ImageNet-Vid-Robust-W
250 500 750
Eval. res.
YouTube-BB-Robust-W
250 500 750
Eval. res.
ImageNet
R50x1 R50x1-FixRes250 500 750
Eval. res.
ImageNet-A
250 500 750
Eval. res.
ImageNet-C
250 500 750
Eval. res.
ImageNet-V2
250 500 750
Eval. res.
ObjectNet
250 500 750
Eval. res.
ImageNet-Vid-Robust
250 500 750
Eval. res.
YouTube-BB-Robust
250 500 750
Eval. res.
ImageNet-Vid-Robust-W
250 500 750
Eval. res.
YouTube-BB-Robust-W
Figure 11: Comparison of different types of evaluation preprocessing and resolutions, without modifying themodel and after applying FixRes. For brevity the same shorthands are used in the model names as in Figure 10.
D Synthetic dataset
In order to measure how model performance changes as object position, size and orientation change,we constructed a synthetic dataset. The dataset consists of objects pasted on relatively unclutteredbackgrounds. We show a few examples in Figure 5 (left) in the main paper and here in Figure 12.The objects were extracted from OpenImages [38] using the provided segmentation masks. As we areinvestigating models trained or fine-tuned on ImageNet, we only used classes that could be mappedto ImageNet. We also removed all objects that are tagged as occluded or truncated, and manuallyremove highly incomplete or inaccurately labeled objects. We converged to 614 object instancesacross 62 classes. The backgrounds were images from nature taken from pexels.com (the licensetherein allows one to reuse photos with modifications). We manually filtered the backgrounds toremove ones with prominent objects, such as images focused on a single animal or person. Wecollected 867 such backgrounds.
F.O.V. D
ATASET C ONFIGURATION I MAGES S IZE
Objects in the center and upright, sizes ranging from 1% to 100%of the image area in 1% increments.
92 884 L OCATION
Objects upright. Sizes are 20% of the image area. We do a gridsearch of locations, dividing the x-coordinate dimension and y-coordinate dimensions into 20 equal parts each, for a total of 400coordinate locations.
479 184 R OTATION
Objects in the center, sizes equal to 20%, 50%, 80% or 100% ofthe image size. Rotation angles ranging from 1 to 341 degreescounterclockwise in 20-degree increments.
39 540
Table 1: Synthetic dataset details. The first column shows the relevant factor of variation (F.O.V.). When thereare multiple values for multiple factors of variation, we generate the full cross product of images. igure 12: Sample images from our synthetic dataset. We consider 614 foreground objects from 62 classes and867 backgrounds and vary the object location, rotation angle, and object size for a total of
611 608 images.
We constructed three subsets for evaluation, one corresponding to each factor of variation we wantedto investigate as shown in Table 1. In particular, for each object instance, we sample two backgrounds,and for each of these object-background combinations, we take a cross product over all the factors ofvariation. For the datasets with multiple values for more than one factor of variation, we take a crossproduct of all the values for each factor of variation in the set (object size, rotation, location). Forexample, for the rotation angle dataset, there are four object sizes and 18 rotation angles, so we do across product and have 72 factor of variation combinations. For the object size and rotation datasets,we only consider images where objects are at least 95% in the image. For the location dataset, suchfiltering removes almost all images where objects are near the edges of the image, so in the mainpaper we do not do such filtering. Note that since we use the central coordinates of objects as theirlocation, at least 25% of each object is in the image even if we do not do any filtering. We presentresults filtering out objects that are less than 50% or 75% in the image in this section in Figures 16and 17 respectively. 18
Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-50, ImageNet-21K)
Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-50, JFT)
Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, ImageNet-21K)
Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, JFT)
Figure 13: In the first row of both plots we show the ratio of the accuracy and the best accuracy (across all areas).For the second row (model trained on 2.6M instances), and other rows, we compute the same normalized scoreand visualize the difference with the first row. Larger differences imply a more uniform behavior across relativeobject areas. We observe that, as the dataset size increases, the average prediction accuracy across various objectareas becomes more uniform. The effect is more pronounced for the larger model. As expected, the improvementis most pronounced for small object sizes covering – of the pixels.
180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180
Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-50, ImageNet-21K)
Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, ImageNet-21K)
Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-50, JFT)
Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, JFT)
Figure 14: In the first row of both plots we show the ratio of the accuracy and the best accuracy (across allrotations). For the second row (model trained on 2.6M instances), and other rows, we compute the samenormalized score and visualize the difference with the first row. Larger differences imply a more uniformbehavior across object rotations. We observe that, as the dataset size increases, the average prediction accuracyacross various rotation angles becomes more uniform. The effect is more pronounced for the larger model. eference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M Improvement across object locations (Filter=0%) (ResNet-50, ImageNet-21K)
Reference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M
Improvement across object locations (Filter=0%) (ResNet-101-x3, ImageNet-21K)
Reference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M
Improvement across object locations (Filter=0%) (ResNet-50, JFT)
Reference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M
Improvement across object locations (Filter=0%) (ResNet-101-x3, JFT)
Figure 15: In the first column, for each location on the grid, we compute the average accuracy. Then, wenormalize each location by the 95 th percentile across all locations, which quantifies the gap between the locationswhere the model performs well, and the ones where it under-performs (first column, dark blue vs white). Then,we consider models trained with more data, compute the same normalized score, and plot the difference withrespect to the first column. We observe that, as dataset size increases, sensitivity to object location decreases –the outer regions improve in relative accuracy more than the inner ones (e.g. dark blue vs white on the secondand third columns). The effect is more pronounced for the larger model. eference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M Improvement across object locations (Filter=50%) (ResNet-50, ImageNet-21K)
Reference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M
Improvement across object locations (Filter=50%) (ResNet-101-x3, ImageNet-21K)
Reference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M
Improvement across object locations (Filter=50%) (ResNet-50, JFT)
Reference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M
Improvement across object locations (Filter=50%) (ResNet-101-x3, JFT)
Figure 16: In the first column, for each location on the grid, we compute the average accuracy. Then, wenormalize each location by the 95 th percentile across all locations, which quantifies the gap between the locationswhere the model performs well, and the ones where it under-performs (first column, dark blue vs white). Then,we consider models trained with more data, compute the same normalized score, and plot the difference withrespect to the first column. We observe that, as dataset size increases, sensitivity to object location decreases –the outer regions improve in relative accuracy more than the inner ones (e.g. dark blue vs white on the secondand third columns). The effect is more pronounced for the larger model. We filter out all test images for whichthe foreground object is not at least within the image. eference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M Improvement across object locations (Filter=75%) (ResNet-50, ImageNet-21K)
Reference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M
Improvement across object locations (Filter=75%) (ResNet-101-x3, ImageNet-21K)
Reference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M
Improvement across object locations (Filter=75%) (ResNet-50, JFT)
Reference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M
Improvement across object locations (Filter=75%) (ResNet-101-x3, JFT)
Figure 17: In the first column, for each location on the grid, we compute the average accuracy. Then, wenormalize each location by the 95 th percentile across all locations, which quantifies the gap between the locationswhere the model performs well, and the ones where it under-performs (first column, dark blue vs white). Then,we consider models trained with more data, compute the same normalized score, and plot the difference withrespect to the first column. We observe that, as dataset size increases, sensitivity to object location decreases –the outer regions improve in relative accuracy more than the inner ones (e.g. dark blue vs white on the secondand third columns). The effect is more pronounced for the larger model. We filter out all test images for whichthe foreground object is not at least within the image. Overview of model abbreviations M ODEL NAME T YPE T RAINING DATA A RCHITECTURE D EPTH C H . R IMAGENET -100 S
UPERVISED I MAGE N ET R ES N ET
50 1 R IMAGENET -10 S
UPERVISED I MAGE N ET , 10% R ES N ET
50 1
BIT - IMAGENET - R X UPERVISED [34] I
MAGE N ET R ES N ET
50 1
BIT - IMAGENET - R X UPERVISED [34] I
MAGE N ET R ES N ET
50 3
BIT - IMAGENET - R X UPERVISED [34] I
MAGE N ET R ES N ET
101 1
BIT - IMAGENET - R X UPERVISED [34] I
MAGE N ET R ES N ET
101 3
BIT - IMAGENET K - R X UPERVISED [34] I
MAGE N ET K R ES N ET
50 1
BIT - IMAGENET K - R X UPERVISED [34] I
MAGE N ET K R ES N ET
50 3
BIT - IMAGENET K - R X UPERVISED [34] I
MAGE N ET K R ES N ET
101 1
BIT - IMAGENET K - R X UPERVISED [34] I
MAGE N ET K R ES N ET
101 3
BIT - JFT - R X UPERVISED [34] JFT R ES N ET
50 1
BIT - JFT - R X UPERVISED [34] JFT R ES N ET
50 3
BIT - JFT - R X UPERVISED [34] JFT R ES N ET
101 1
BIT - JFT - R X UPERVISED [34] JFT R ES N ET
101 3
BIT - JFT - R X UPERVISED [34] JFT R ES N ET
50 3 R IMAGENET -10-
EXEMPLAR S ELF - SUP . &
COTRAINING [68] I
MAGE N ET , 10% R ES N ET
50 1 R IMAGENET -10-
ROTATION S ELF - SUP . &
COTRAINING [68] I
MAGE N ET , 10% R ES N ET
50 1 R IMAGENET -100-
EXEMPLAR S ELF - SUP . &
COTRAINING [68] I
MAGE N ET R ES N ET
50 1 R IMAGENET -100-
ROTATION S ELF - SUP . &
COTRAINING [68] I
MAGE N ET R ES N ET
50 1
SIMCLR -1 X - SELF - SUPERVISED S ELF - SUPERVISED [6],
FINE TUNING I MAGE N ET R ES N ET
50 1
SIMCLR -2 X - SELF - SUPERVISED S ELF - SUPERVISED [6],
FINE TUNING I MAGE N ET R ES N ET
50 2
SIMCLR -4 X - SELF - SUPERVISED S ELF - SUPERVISED [6],
FINE TUNING I MAGE N ET R ES N ET
50 4
SIMCLR -1 X - FINE - TUNED -10 S
ELF - SUPERVISED [6],
FINE TUNING I MAGE N ET , 10% R ES N ET
50 1
SIMCLR -2 X - FINE - TUNED -10 S
ELF - SUPERVISED [6],
FINE TUNING I MAGE N ET , 10% R ES N ET
50 2
SIMCLR -4 X - FINE - TUNED -10 S
ELF - SUPERVISED [6],
FINE TUNING I MAGE N ET , 10% R ES N ET
50 3
SIMCLR -1 X - FINE - TUNED -100 S
ELF - SUPERVISED [6],
FINE TUNING I MAGE N ET R ES N ET
50 1
SIMCLR -2 X - FINE - TUNED -100 S
ELF - SUPERVISED [6],
FINE TUNING I MAGE N ET R ES N ET
50 2
SIMCLR -4 X - FINE - TUNED -100 S
ELF - SUPERVISED [6],
FINE TUNING I MAGE N ET R ES N ET
50 4
EFFICIENTNET - STD - B UPERVISED [58] I
MAGE N ET E FFICIENT N ET
18 1
EFFICIENTNET - STD - B UPERVISED [58] I
MAGE N ET E FFICIENT N ET
37 1
EFFICIENTNET - ADV - PROP - B UPERVISED & ADVERSARIAL [66] I
MAGE N ET E FFICIENT N ET
18 1
EFFICIENTNET - ADV - PROP - B UPERVISED & ADVERSARIAL [66] I
MAGE N ET E FFICIENT N ET
37 1
EFFICIENTNET - ADV - PROP - B UPERVISED & ADVERSARIAL [66] I
MAGE N ET E FFICIENT N ET
64 2
EFFICIENTNET - NOISY - STUDENT - B UPERVISED & DISTILLATION [67] I
MAGE N ET E FFICIENT N ET
18 1
EFFICIENTNET - NOISY - STUDENT - B UPERVISED & DISTILLATION [67] I
MAGE N ET E FFICIENT N ET
37 1
EFFICIENTNET - NOISY - STUDENT - B UPERVISED & DISTILLATION [67] I
MAGE N ET E FFICIENT N ET
64 2
VIVI -1 X S ELF - SUP . &
COTRAINING [63] YT8M, I
MAGE N ET R ES N ET
50 1
VIVI -3 X S ELF - SUP . &
COTRAINING [63] YT8M, I
MAGE N ET R ES N ET
50 3
BIGBIGAN - LINEAR B IDIRECTIONAL ADVERSARIAL [14] I
MAGE N ET R ES N ET
50 1
BIGBIGAN - FINETUNE B IDIRECTIONAL ADVERSARIAL [14] I
MAGE N ET R ES N ET
50 1
Table 2: Overview of models used in this study.
SUP . abbreviates for supervised pre-training. C H . refers to thewidth multiplier for the number of channels.. refers to thewidth multiplier for the number of channels.