[PDF] On Robustness and Transferability of Convolutional Neural Networks

Abstract

Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. However, several recent breakthroughs in transfer learning suggest that these networks can cope with severe distribution shifts and successfully adapt to new tasks from a few training examples. In this work we study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time and investigate the impact of the pre-training data size, the model scale, and the data preprocessing pipeline. We find that increasing both the training set and model sizes significantly improve the distributional shift robustness. Furthermore, we show that, perhaps surprisingly, simple changes in the preprocessing such as modifying the image resolution can significantly mitigate robustness issues in some cases. Finally, we outline the shortcomings of existing robustness evaluation datasets and introduce a synthetic dataset SI-Score we use for a systematic analysis across factors of variation common in visual data such as object size and position.

Full PDF

OOn Robustness and Transferability ofConvolutional Neural Networks

Josip Djolonga ∗ Jessica Yung ∗ Michael Tschannen ∗ Rob Romijnders Lucas BeyerAlexander Kolesnikov Joan Puigcerver Matthias Minderer Alexander D’AmourDan Moldovan Sylvan Gelly Neil Houlsby Xiaohua Zhai Mario Lucic

Google Research (Brain Team)

Abstract

Modern deep convolutional networks (CNNs) are often criticized for not generaliz-ing under distributional shifts. However, several recent breakthroughs in transferlearning suggest that these networks can cope with severe distribution shifts andsuccessfully adapt to new tasks from a few training examples. In this work werevisit the out-of-distribution and transfer performance of modern image classiﬁca-tion CNNs and investigate the impact of the pre-training data size, the model scale,and the data preprocessing pipeline. We ﬁnd that increasing both the training setand model sizes signiﬁcantly improve the distributional shift robustness. Further-more, we show that, perhaps surprisingly, simple changes in the preprocessing suchas modifying the image resolution can signiﬁcantly mitigate robustness issues insome cases. Finally, we outline the shortcomings of existing robustness evaluationdatasets and introduce a synthetic dataset we use for a systematic analysis acrosscommon factors of variation.

Deep convolutional networks have attained impressive results across a plethora of visual classiﬁcationbenchmarks [34, 58] where the training and testing distributions match. In the real world, however,the conditions in which the models are deployed can often differ signiﬁcantly from the conditions inwhich the model was trained. It is imperative to understand the impact dataset shifts [48] have onthe performance of these models. This problem has gained a lot of traction and several systematicinvestigations have showed unexpectedly high sensitivity of image classiﬁers to various dimensions,including photometric perturbations [25], natural perturbations obtained from video data [52], as wellmodel-speciﬁc adversarial perturbations [22].The problem of dataset shift, or out-of-distribution (OOD) generalization , is closely related to alearning paradigm known as transfer learning [54, §13]. In transfer learning we are interested inconstructing models that can improve their performance on some target task by leveraging datafrom different related problems. In contrast, under dataset shift one assumes that there are twoenvironments, namely training and testing [54], with the constraint that the model cannot be adaptedusing data from the target environment. As a consequence, the two environments typically have to bemore similar and their differences more structured than in the transfer setting (c.f. Section 2.1).In this work we evaluate the most successful recent recipes for transfer learning—model and datascale—on the problem of OOD generalization on the most prominent recent datasets, neural architec- ∗ Shared ﬁrst authorship. Correspondence to {josipd,jessicayung,lucic}@google.com .Preprint. Under review. a r X i v : . [ c s . C V ] J u l ures, and training regimes and ﬁnd that increasing model and data scale are surprisingly effective tomitigate the effects of different types of dataset shift . Contributions

We systematically investigate the classiﬁcation accuracy of image classiﬁcationmodels on the training distribution, their generalization to OOD data (without adaptation), and theirtransfer learning performance with adaptation in the low-data regime. Speciﬁcally, we present: (i)A meta-analysis of existing OOD metrics and transfer learning benchmarks across a wide varietyof models, ranging from self-supervised to fully supervised models with up to 900M parameters,and show that most of the variance contained in the various metrics is explained by the I

MAGE N ET validation set accuracy. However, increasing the model and data scale disproportionately improvestransfer performance, despite providing only marginal improvements performance on the I MAGE N ET validation set. (ii) Focusing on OOD robustness, we analyze the effects of the training set size, modelscale, and the training regime and testing resolution, and conclude that the effect of scale overshadowsall other dimensions. (iii) We introduce a novel dataset for a ﬁne-grained OOD analysis to quantifythe robustness to common factors of variation: object size, object location, and object orientation(rotation angle). In a systematic study we show that the models become less sensitive (and hencemore robust) to each of these factors of variation as the dataset size and model size increase. Understanding and correcting for dataset shifts are classical problems in statistics and machinelearning, and have as such received substantial attention, see e.g. the monograph [48]. Formally,let us denote the observed variable by X and the variable we want to predict by Y . A dataset shiftoccurs when we train on samples from P train ( X, Y ) , but are at test time evaluated under a differentdistribution P test ( X, Y ) . Storkey [54] discusses and precisely deﬁnes different possibilities how P train and P test can differ. We are mostly interested in covariate shifts, i.e., when the conditionals P train ( Y | X ) = P test ( Y | X ) agree, but the marginals P train ( X ) and P test ( X ) differ. Most robustnessdatasets proposed in the literature targeting I MAGE N ET models are such instances—the images X come from a source P test ( X ) different from the original collection process P train ( X ) , but the labelsemantics are not supposed to change. As a robustness score one typically uses the expected accuracy,i.e., P test ( Y = f ( X )) , where f ( X ) is the class predicted by the model. Robustness datasets and dataset shift types I MAGE N ET - V MAGE N ET validation set [50]. The authors attempted to replicate the data collection process, butfound that all models drop signiﬁcantly in accuracy. Recent work attributes this drop to statisticalbias in the data collection [17]. I MAGE N ET -C and I MAGE N ET -P [25] are obtained by corruptingthe I MAGE N ET validation set with classical corruptions, such as blur, different types of noiseand compression, and further cropping the images to × . These datasets deﬁne a totalof 15 noise, blur, weather, and digital corruption types, each appearing at 5 severity levels orintensities. O BJECT N ET [3] presents a new test set of images collected directly using crowd-sourcing.O BJECT N ET is particular as the objects are captured at unusual poses in cluttered, natural scenes,which can severely degrade recognition performance. Given this clutter, and arguably better suitabilityas a detection than recognition task [5], Y | X might be hard to deﬁne and the dataset goes beyonda covariate shift. In contrast, the I MAGE N ET -A dataset [28] consists of real-world, unmodiﬁed,and naturally occurring examples that are misclassiﬁed by ResNet models. Hence in addition tothe covariate shift due to the data source, this dataset is not model-agnostic and exhibits a strongselection bias [54]. In an attempt to focus on naturally occurring invariances [52] turned to videosand annotated two datasets, namely, I MAGE N ET -V ID -R OBUST and Y OU T UBE -BB-R

OBUST derivedfrom the I

MAGE N ET -V ID [11] and Y OU T UBE -BB [49] video datasets, respectively. In addition tomeasuring accuracy over the frames, the available temporal structure allows for more ﬁne-grainedrobustness metrics. In [52] the authors suggest the following pm - k metric—given an anchor frameand up to k frames before and after it, a prediction is marked as correct only if the classiﬁer correctlyclassiﬁes all k + 1 frames around the anchor. We present the details of each dataset in Appendix A. In transfer learning [46], a model might leverage the data it has seen on a related distribution, P pre − train , to achieve better performance on a new task P train . Note that in contrast to the covariate2 m a g e N e t I m a g e N e t - A I m a g e N e t - C I m a g e N e t - V O b j e c t N e t I m a g e N e t - V i d Y o u T u b e - B B I m a g e N e t - V i d - W Y o u T u b e - B B - W ImageNetImageNet-AImageNet-CImageNet-V2ObjectNetImageNet-VidYouTube-BBImageNet-Vid-WYouTube-BB-W Sp e a r m a n r a n k c o rr e l a t i o n Improvement in model discriminabilitycompared to ImageNet accuracy

YouTube-BBImageNet-VidYouTube-BB-WImageNet-Vid-WObjectNetImageNet-V2ImageNet-C (contrast)ImageNet-C (brightness)ImageNet-C (shot noise)ImageNet-C (jpeg compression)

How well do metrics discriminate between models?

Figure 1: Correlation and informativeness of robustness metrics. Most metrics correlate strongly with I MA - GE N ET accuracy and provide little additional discriminability. (Left) Spearman’s correlation between metrics. (Right)

Difference in accuracy of a logistic classiﬁer trained to discriminate between model types based onI

MAGE N ET accuracy plus one additional metric, compared to a classiﬁer trained only on I MAGE N ET accuracy(higher is better, top 10 metrics shown). Bars show mean ± s.d. of 1000 bootstrap samples from the 39 models. shift setting, the disparity between P pre − train and the new task is typically larger, but one is addition-ally given samples from P train . While there exist many approaches in how to transfer knowledge tothe new task, the most common approach in modern deep learning, which we use, is to (i) train amodel on P pre − train (using perhaps an auxiliary, self-supervised task [15, 21]), and then (ii) train amodel on P train by initializing the model weights from the model trained in the ﬁrst step.Recently, a suite of datasets has been collected to benchmark modern image classiﬁcation transfertechniques [69]. The Visual Task Adaptation Benchmark (VTAB) deﬁnes 19 datasets with 1000labeled samples each, categorized in into three groups of natural , specialized and structured datasets: natural (most similar to I MAGE N ET ) consists of standard natural classiﬁcation tasks (e.g. CIFAR,VGG Flowers); specialized , contains medical and satellite images; and structured (least similar toI MAGE N ET ), consists mostly synthetic tasks that require understanding of the geometric layout ofscenes. We compute an overall transfer score as the mean across all 19 datasets, as well as scores foreach subgroup of tasks. We provide details for all of the tasks in Appendix A. While many robustness metrics have been proposed to capture a different sources of brittleness,it is not well understood how these metrics relate to each other. We investigate (i) the amount ofcomplementary information in these metrics, and (ii) their usefulness in guiding design choices.Further, despite the close relationship between the notions of robustness and transferability, there hasbeen no analysis of how predictive of each other their corresponding metrics are. To analyze thesequestions, we evaluated 39 different models over 23 robustness metrics and the 19 transfer tasks.

Metrics

We consider metrics that quantify both robustness and transfer performance. For robust-ness, we measure the model accuracy on the I

MAGE N ET , I MAGE N ET - V matched frequency variant) and O BJECT N ET datasets. We also consider video datasets, I MAGE N ET -V ID and Y OU T UBE -BB; we use both the accuracy metric and the pm - metric (sufﬁx -W). On I MAGE N ET -C we reportthe AlexNet-accuracy-weighted [37] accuracy over all corruption times (called mean corruption error in [25]). To evaluate the transferability of the models, we use the VTAB-1K benchmark that weintroduced in Section 2.2. We evaluate average transfer performance across all 19 datasets, with 1000examples each, as well as per-group performance. To transfer a model we performed a sweep overtwo learning rates and schedules. We report the median testing accuracy over three ﬁne-tuning runswith parameters selected using a 800-200 example train-validation split. Models

We consider several model families, some of which make use of additional data besidesI

MAGE N ET . We evaluate ResNet-50 [23] and six EfﬁcientNet (B0 through B5) models [58] includingvariants using AutoAugment [10] and AdvProp [66], which have been trained on I MAGE N ET . Weinclude self-supervised SimCLR [6] (three variants: linear classiﬁer on top of representation (lin),ﬁne-tuned on 10% (ft-10), and 100% (ft-100) of the I MAGE N ET data), as well as self-supervised-semi-supervised (S4L) [68] models that have been ﬁne-tuned to 10% and 100% of the I MAGE N ET data. We also consider a set of models that incorporate other data sources. Speciﬁcally, we testthree NoisyStudent [67] variants which use I MAGE N ET and unlabelled data from the JFT dataset,BiT (BigTransfer) [34] models that have been ﬁrst trained on I MAGE N ET , I MAGE N ET -21 K , or JFT3 ImageNet T r a n s f e r ( V T A B ) r=0.73

20 40 60

Robustness T r a n s f e r ( V T A B ) r=0.73 ClassBiT JFTNoisy St.BiT I21kEff.Net+AdvPropVIVIS4LEff.NetBiT ImageNetResNet50SimCLR+FTSimCLR I m a g e N e t I m a g e N e t - V I m a g e N e t - A I m a g e N e t - C I m a g e N e t - V i d I m a g e N e t - V i d - W Y o u T u b e - B B O b j e c t N e t Tr. (Natural)Tr. (Spec.)Tr. (Struct.) .85 .86 .83 .84 .85 .80 .85 .84.50 .50 .35 .39 .38 .38 .37 .52.42 .39 .29 .26 .23 .23 .21 .24 Sp e a r m a n r a n k c o rr e l a t i o n Figure 2: The relationship between transfer learning, I

MAGE N ET , and robustness performance. (Left) Averagescore on all transfer benchmarks versus I

MAGE N ET performance. (Center) Average score on all robustnessbenchmarks versus average transfer performance. (Right)

Correlation between different groups of transferdatasets (natural, specialized, structured), and robustness metrics. and then transferred to I

MAGE N ET by ﬁne-tuning, and the Video-Induced Visual Invariance (VIVI)model [63], which uses I MAGE N ET and unlabelled videos from the YT8M dataset [1]. Finally, weconsider the BigBiGAN [14] model which has been ﬁrst trained as a class-conditional generativemodel and then ﬁne-tuned to an I MAGE N ET classiﬁer. All model details can be found in Appendix E. How well does I

MAGE N ET accuracy predict performance on OOD data? We start with ananalysis of mutual dependence of the robustness metrics by measuring the Spearman’s ρ rankcorrelation coefﬁcient. Figure 1 (left) shows the rank correlation between the metrics. We observethat all metrics are highly correlated with each other, with a median Spearman’s ρ of . . Themetrics also strongly correlate with the accuracy on the I MAGE N ET validation set with a medianSpearman’s ρ of . and a . minimum. To understand the beneﬁt of these metrics beyondI MAGE N ET accuracy, we ﬁt linear regression models for each metric with I MAGE N ET accuracyas the single covariate. Consistent with the rank correlation analysis, we ﬁnd that 75.2% of thevariance in the metric values is explained by I MAGE N ET accuracy. A principal component analysisshows that the space of robustness metric residuals spans approximately one statistically signiﬁcantdimension (Appendix A). This raises the question to what degree the robustness metrics provideuseful information beyond standard I MAGE N ET accuracy, which we investigate next. Can robustness metrics discriminate between models?

The goal of a metric is to discriminatebetween different models and thus guide design choices. We therefore quantify the usefulness of eachmetric in terms of how much it improves the discriminability between the various models beyond theinformation provided by I

MAGE N ET accuracy. Speciﬁcally, we train logistic regression classiﬁersto discriminate between the 12 model groups outlined above. We compared the performance of aclassiﬁer using only I MAGE N ET accuracy as input feature, to a classiﬁer using I MAGE N ET and up totwo of the other metrics, see Fig. 1 (right) and Appendix A. We found that most of the tested metricsprovide little increase in model discriminability over I MAGE N ET accuracy. Of course, this result isconditioned on the size and composition of our dataset, and may differ for a different set of models.However, based on our dataset of 39 models in 12 groups, the most informative metrics are thosebased on different datasets and/or video, rather than I MAGE N ET -derived datasets. How related are OOD robustness and transfer metrics?

Next, we turn to transfer learning.It has been observed that better I

MAGE N ET models transfer better [35, 69]. Since robustness iscorrelated with I MAGE N ET (Figure 1), we might expect a similar relationship. To get an overall view,we compute the mean of all robustness metrics, and compare it to transfer performance. Figure 2(center) shows this average robustness plotted against transfer performance, while Figure 2 (left)shows transfer versus I MAGE N ET accuracy. Indeed, we observe a large correlation ρ = 0 . betweenrobustness metrics and transfer; however, the correlation is not stronger than between transfer andI MAGE N ET . Further, we compute the correlation of the residual robustness score (mean robustnessminus I MAGE N ET accuracy) against transfer score, and ﬁnd only a weak relationship of ρ = 0 . .This indicates that robustness metrics, on aggregate, do not provide additional signal that predictsmodel transferability beyond that of the base I MAGE N ET performance. We do, however, see someinteresting differences in the relative performances of different model groups. Certain model groups,while attaining reasonable I MAGE N ET /robustness scores, transfer less well to VTAB. Therefore, thereare factors that inﬂuence transferability unrelated to robust inference. One example is batch normal-ization which is outperformed by group normalization with weight standardization in transfer [34].Next, we break down the correlation by robustness metrics and transfer datasets in Fig. 2 (right). We4 M 5M 13M

Dataset size T r a i n i n g s t e p s

1M 5M 13M

Dataset size0.0 21.225.97.2 29.536.10.2 16.038.7ImageNet-C

1M 5M 13M

Dataset size0.0 20.025.71.9 27.632.6-4.9 20.132.3ImageNet-V2

1M 5M 13M

Dataset size0.0 13.216.71.5 17.724.6-3.7 13.525.4ObjectNet

1M 5M 13M

Dataset size0.0 22.025.95.5 27.732.11.7 18.236.6ImageNet-Vid

1M 5M 13M

Dataset size0.0 11.914.32.4 12.116.1-3.0 6.8 16.8YouTube-BB

1M 5M 13M

Dataset size0.0 11.618.53.5 20.125.9-1.9 8.4 26.6ImageNet-Vid-W

1M 5M 13M

Dataset size0.0 7.1 11.61.0 9.4 12.9-2.4 4.1 14.0YouTube-BB-W

1M 5M 13M

Dataset size T r a i n i n g s t e p s

1M 5M 13M

Dataset size8.0 18.722.111.419.830.54.2 3.1 28.2ImageNet-C

1M 5M 13M

Dataset size10.216.920.78.9 18.724.8-0.1 7.0 18.0ImageNet-V2

1M 5M 13M

Dataset size5.0 9.7 12.64.1 10.719.4-0.5 4.4 16.4ObjectNet

1M 5M 13M

Dataset size7.3 18.823.28.1 18.525.34.4 4.7 23.5ImageNet-Vid

1M 5M 13M

Dataset size4.0 8.9 10.95.1 5.5 10.80.9 -0.4 8.0YouTube-BB

1M 5M 13M

Dataset size6.5 10.216.28.1 15.424.03.2 3.4 18.8ImageNet-Vid-W

1M 5M 13M

Dataset size2.4 4.9 10.42.9 6.4 10.3-1.6 1.8 10.2YouTube-BB-W

Figure 3: (Top)

Reduction (in %) in classiﬁcation error relative to the classiﬁcation error of the model trainedfor 112k steps on 1M examples (bottom left corner) as a function of training iterations and training set size.The results are for ResNet-101x3 trained on I

MAGE N ET -21 K subsets. (Bottom) Relative reduction (in %) inclassiﬁcation error going from ResNet-50 to ResNet-101x3 as a function of training steps and training set size(I

MAGE N ET -21 K subsets). The reduction generally increases with the training set size and longer training. see that each metric correlates similarly with the task groups. However, for the groups that requiremore distant transfer (Specialized, Structured), no metric predicts transferability well. Curiously, rawI MAGE N ET accuracy is the best predictor of transfer to structured tasks, indicating that robustnessmetrics do not relate to challenging transfer tasks, at least not more than raw I MAGE N ET accuracy. Summary

We have seen that many popular robustness metrics are highly correlated. Some metrics,particularly those not based on I

MAGE N ET , have only a little additional discriminative power todistinguish models over I MAGE N ET accuracy. Transferability is also related to I MAGE N ET accuracy,and hence robustness. We observe that while there is correlation, transfer highlights failures that aresomewhat independent of robustness. Further, no particular robustness metric appears to correlatebetter with any particular group of transfer tasks than I MAGE N ET does. Since all of these metricsseem closely linked, we investigate strategies known to be effective for I MAGE N ET and transferlearning on the newer robustness benchmarks. Increasing the scale of pre-training data, model architecture, and training steps have recently ledto diminishing improvements in terms of I

MAGE N ET accuracy. By contrast, it has been recentlyestablished that scaling along these axes can lead to substantial improvements in transfer learningperformance [34, 58]. In the context of robustness, this type of scaling has been explored less. Whilethere are some results suggesting that scale improves robustness [25, 50, 67, 61], no principledstudy decoupling the different scale axes has been performed. Given the strong correlation betweentransfer performance and robustness, this motivates the systematic investigation of the effects of thepre-training data size, model architecture size, training steps, and input resolution. We consider the standard I

MAGE N ET training setup [23] as a baseline, and scale up the trainingaccordingly. To study the impact of dataset size, we consider the I MAGE N ET -21 K [11] and JFT [55]datasets for the experiments, as pre-training on either of them has shown great performance in transferlearning [34]. We scale from the I MAGE N ET training set size ( . M images) to the I

MAGE N ET -21 K training set size (13M images, about times larger than I MAGE N ET ). To explore the effect of themodel size, we use a ResNet-50 as well as the deeper and × wider ResNet-101x3 model. We furtherinvestigate the impact of the training schedule as larger datasets are known to beneﬁt from longertraining for transfer learning [34]. To disentangle the impact of dataset size and training schedules,we train the models for every pair of dataset size and schedule.We ﬁne-tune the trained models to I MAGE N ET using the BiT HyperRule [34], and assess their OODgeneralization performance in the next section. Throughout, we report the reduction in classiﬁcationerror relative to the model which was trained on the smallest number of examples, for the fewestiterations, and hence achieves the lowest accuracy. Other details are presented in Appendix B.5 a cc u r a c y ImageNet-A

R50-ImageNet SimCLR-ft VIVI EfficientNet-B5 BiT-R101x30.20.30.4

ObjectNet

Resolution

DefaultBestFixRes

Figure 4: Comparison of different types of evaluation preprocessing and resolutions. Default: Accuracy obtainedfor the preprocessing and resolution proposed by the authors of the respective models. Best: The accuracy whenselecting the best resolution from { , , , , , , , } . FixRes: Applying FixRes for thesame set of resolutions and selecting the best resolution. Increasing the evaluation resolution and additionallyusing FixRes helps across a large range of models and pretraining datasets on I MAGE N ET -A and O BJECT N ET . Dataset size impact

The results for the ResNet-101x3 model are presented in Fig. 3. When trainedon I

MAGE N ET -21 K , the OOD classiﬁcation error signiﬁcantly decreases with increasing datasetsize and training duration: We observe relative error reductions of – when going from 112ksteps on 1M data points to 1.12M steps on 13M data points. The reductions are least pronouncedfor Y OU T UBE -BB/Y OU T UBE -BB-W. Also note that training for . M steps leads to a loweraccuracy than training for only 457k steps unless the full I

MAGE N ET -21 K dataset is used. For modelstrained on JFT we observe a similar behavior except that training for 1.12M steps leads to a higheraccuracy than training for 257k steps even when only 112k or 457k data points are used. The JFTresults are presented in Appendix B. These results suggest that, if the models have enough capacity,increasing the amount of training data, with no additional changes, leads to massive gains in alldatasets simultaneously which is in line with recent results in transfer learning [34]. Model size impact

Figure 3 shows the relative reduction in classiﬁcation error when using ResNet-101x3 instead of ResNet-50 as a function of the number of training steps and the dataset size. It canbe seen that increasing the model size can lead to substantial reductions of - . For a ﬁxed trainingduration, using more data always helps. However, on I MAGE N ET -21 K , training too long can lead toincreases in the classiﬁcation error when the model size is increased, unless the full I MAGE N ET -21 K is used. This is likely due to overﬁtting. This effect is much less pronounced when JFT is used fortraining. JFT results are presented in Appendix B. Again, reductions in classiﬁcation error are leastpronounced for Y OU T UBE -BB/Y OU T UBE -BB-W.

During training, images are typically cropped randomly, with many crop sizes and aspect ratios, toprevent overﬁtting. In contrast, during testing, the images are usually rescaled such that the shorterside has a pre-speciﬁed length, and a ﬁxed-size center crop is taken and then fed to the classiﬁer.This leads to a mismatch in object sizes between training and testing. Increasing the resolution atwhich images are tested leads to an improvement in accuracy across different architectures [60, 61].Furthermore, additional beneﬁts can be obtained by applying

FixRes – ﬁne-tuning the network on thetraining set with the test-time preprocessing (i.e. omitting random cropping with aspect ratio changes),and at higher resolution. We explore the effect of this discrepancy on the robustness of differentarchitectures. As some of the robustness datasets were collected in a different way from I

MAGE N ET ,discrepancies in the cropping are likely. We investigate both adjusting test-time resolution andapplying FixRes. For FixRes, we use a simple setup with a single schedule and learning rate forall models (except using a × smaller learning rate for the BiT models), and without heavy coloraugmentation as in [60] or label smoothing as in [61]. Furthermore we did not extensively tunehyperparameters, but chose a setup that works reasonably well across architectures and datasets. Results and discussion

Figure 4 shows the accuracy for I

MAGE N ET -A and O BJECT N ET at thetesting resolution proposed by the authors of the respective architecture along with the highestaccuracy obtained by selecting the best testing resolution in { , , , , , , , } ,and after applying FixRes. The results for other datasets are deferred to Appendix C.We start by discussing observations that apply to most of the models, excluding the BiT modelswhich will be discussed below. While FixRes only leads to marginal beneﬁts on I MAGE N ET , it canlead to substantial improvements on the robustness metrics. Choosing the optimal testing resolutionleads to a signiﬁcant increase in accuracy on I MAGE N ET -A and O BJECT N ET in most cases, andapplying FixRes often leads to additional substantial gains. For O BJECT N ET , ﬁne-tuning with testing6 eference (1.3 M) 5.2 M 13.0 M Improvement across object locations (Filter=0%) (ResNet-50, ImageNet-21K)

Reference (1.3 M) 5.2 M 13.0 M

Improvement across object locations (Filter=0%) (ResNet-101-x3, ImageNet-21K)

Figure 5: (Left)

Sample images from our synthetic dataset. We consider 614 foreground objects from 62 classesand 867 backgrounds and vary the object location, rotation angle, and object size for a total of

611 608 images. (Right)

In the ﬁrst column, for each location on the grid, we compute the average accuracy. Then, we normalizeeach location by the 95 th percentile across all locations, which quantiﬁes the gap between the locations wherethe model performs well, and the ones where it under-performs (ﬁrst column, dark blue versus white). Then, weconsider models trained with more data, compute the same normalized score, and plot the difference with respectto the ﬁrst column. We observe that, as dataset size increases, sensitivity to object location decreases – the outerregions improve in relative accuracy more than the inner ones (e.g. dark blue vs white on the second and thirdcolumns). The effect is more pronounced for the larger model. The full set of results is presented in Figure 15. preprocessing (i.e. ﬁne-tuning with central cropping instead of random cropping as used duringtraining) even helps without increasing resolution in some cases.Increasing the resolution and/or applying FixRes often slightly helps on I MAGE N ET -V2. ForI MAGE N ET -C, the optimal testing resolution often corresponds to the resolution used for training,and applying FixRes rarely changes this picture. This is not surprising as the I MAGE N ET -C imagesare cropped to 224 pixels by default, and increasing the resolution does not add any new informationto the image. For the video-derived robustness datasets I MAGE N ET -V ID -R OBUST and Y OU T UBE -BB-R

OBUST , evaluating at a larger testing resolution and/or applying FixRes at a higher resolutioncan substantially improve the accuracy on the anchor frame and the robustness accuracy for smallEfﬁcientNet and ResNet models, but does not help the larger ones. For the BiT models, the resolutionsuggested by the authors is almost always optimal, except on O

BJECT N ET and I MAGE N ET -A, wherechanging the preprocessing considerably helps. FixRes arguably does not lead to improvements asit was already applied in BiT as a part of the BiT HyperRule. Based on these results we stronglysuggest the application of these adjustments to address the shift caused by resolution mismatch. There are several factors of variation, such as object location, size, and rotation, that we want ourmodels to be robust to. For a solid diagnostic of the failure modes, one should ideally be able to varytesting data according to these axes. However, the combinatorial nature of the number of possiblecombinations of such factors of variation precludes any large-scale systematic data collection scheme.In this work we present a scalable alternative and construct a novel synthetic dataset for ﬁne-grainedevaluation. We paste objects extracted from OpenImages [38] using segmentation masks ontouncluttered backgrounds sourced from the web (Figure 5, details in Appendix D). We can thusconduct controlled studies by systematically varying the object class, size, location, and orientation(rotation angle). We study one factor of variation at a time (e.g. location of the object center), andlook at the average performance for each location over a uniform grid.We investigate the effect of model and dataset size on these three factors of variation by evaluatingthe ResNet-50 and ResNet-101x3 models. We observe that the models become more invariant tolocation (Figure 5), size (Figure 6), and rotation of the objects (Figure 6) as the model or trainingset size increases. The improvements are more pronounced for the larger ResNet-101x3 model. Theanalogous results on the JFT dataset are presented in Appendix D.

There has been a growing literature exploring the robustness of image classiﬁcation networks. Earlyinvestigations in face and natural image recognition found that performance degrades by introducing7

180 -140 -100 -60 -20 0 20 60 100 140 180

Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-50, ImageNet-21K)

Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, ImageNet-21K)

10 20 30 40 50 60 70 80 90 100

Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-50, ImageNet-21K)

Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, ImageNet-21K)

Figure 6: (Left)

In the ﬁrst row of both plots we show the ratio of the accuracy and the best accuracy (acrossall rotations). For the second row (model trained on 2.6M instances) and other rows, we compute the samenormalized score and visualize the difference with the ﬁrst row. Larger differences imply a more uniformbehavior across object rotations. We observe that, as the dataset size increases, the average prediction accuracyacross various rotation angles becomes more uniform. The effect is more pronounced for the larger model. (Right)

Similarly, the average accuracy across various object sizes becomes more uniform for both models. Asexpected, the improvement is most pronounced for small object sizes covering – of the pixels. The fullset of results is presented in Figures 13 and 14. blur, Gaussian noise, occlusion, and compression artifacts, but less by color distortions [12, 33].Subsequent studies have investigated brittleness to similar corruptions [51, 73], as well as to impulsenoise [29], photometric perturbations [59], and small shifts and other transformations [2, 17, 71].CNNs have also been shown to over-rely upon texture rather than shape to make predictions, incontrast to human behavior [20]. Robustness to adversarial attacks [22] is a related, but distinctproblem, where performance under worst-case perturbations are studied. In this paper we did not studysuch adversarial robustness, but have focused on average-case robustness to natural perturbations.Several techniques have been shown to improve model robustness on these datasets. Using betterdata augmentation can improve performance on data with synthetic noise [27, 41]. Auxiliary self-supervision [7, 68] can improve robustness to label noise and common corruptions [26]. Transductiveﬁne-tuning using self-supervision on the test data improves performance under distribution shift [56].Training with adversarial perturbations improves many robustness benchmarks if one uses separateBatch-Norm parameters for clean and adversarial data [66]. Finally, additional pre-training usingvery large auxiliary datasets has recently shown signiﬁcant improvements in Robustness. NoisyStudent [67] reports good performance on several robustness datasets, while Big Transfer (BiT) [34]reports strong performance on the recently introduced O BJECT N ET dataset [3].Deep networks are often trained by pre-training the network on a different problem and then ﬁne-tuning on the target task. This pre-training is often referred to as representation learning; rep-resentations can be trained using supervised [30, 34], weakly-supervised [42], or unsuperviseddata [13, 14, 63, 67]. Recent benchmarks have been proposed to evaluate transfer to several datasets,to assess generalization to tasks with different characteristics, or those disjoint from the pre-trainingdata [62, 69]. While state-of-the-art performance on many competitive datasets is attained via transferlearning [67, 34], the implication for ﬁnal robustness metrics remain unclear.Creating synthetic datasets by inserting objects onto backgrounds has been used for training [72, 16]and evaluating models [34], but previous works do not systematically vary object size, location ororientation, or analyze translation and rotation robustness only at the image level [18]. We analyzed OOD generalization and transferability of image classiﬁers, and demonstrated thatmodel and data scale together with a simple training recipe lead to large improvements. However, themodels do exhibit a substantial gap in performance when tested on OOD data, and scale is unlikely tobe the only approach to close this gap. Secondly, this approach hinges on the availability of curateddatasets and signiﬁcant computing capabilities which is not always practical. Hence, we believethat transfer learning, i.e. train once, apply many times, is the most promising paradigm for OOD8obustness in the short term. One limitation of this study is that we consider image classiﬁcationmodels ﬁne-tuned to the I

MAGE N ET label space which were developed with the goal of optimizing theaccuracy on the I MAGE N ET test set. While existing work shows that we didn’t overﬁt to I MAGE N ET ,it is possible that these models have correlated failure modes on datasets which share the biaseswith I MAGE N ET [50]. This highlights the need for datasets which enable ﬁne-grained analysis for allimportant factors of variation and we hope that our dataset will be useful for researchers.Instead of requiring the model to work under various dataset shifts, one can ask an alternativequestion: assuming that the model will be deployed in an environment signiﬁcantly different fromthe training one, can we at least quantify the model uncertainty for each prediction? This importantproperty remains elusive for moderate-scale neural networks [53], but could potentially be improvedby considering larger models and larger pretraining datasets which we leave for future work. References [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, BalakrishnanVaradarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classiﬁcationbenchmark. arXiv:1609.08675 , 2016.[2] Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly tosmall image transformations?

Journal of Machine Learning Research , 20, 2019.[3] Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund,Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushingthe limits of object recognition models. In

Advances in Neural Information Processing Systems ,2019.[4] Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, HeinrichKüttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab. arXivpreprint arXiv:1612.03801 , 2016.[5] Ali Borji. Objectnet dataset: Reanalysis and correction. In arXiv 2004.02042 , 2020.[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple frameworkfor contrastive learning of visual representations. arXiv:2002.05709 , 2020.[7] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil Houlsby. Self-supervisedGANs via auxiliary rotation loss. In

Conference on Computer Vision and Pattern Recognition ,2019.[8] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classiﬁcation:Benchmark and state of the art.

Proceedings of the IEEE , 2017.[9] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in thewild. In

IEEE Conference on Computer Vision and Pattern Recognition , 2014.[10] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:Learning augmentation strategies from data. In

Conference on Computer Vision and PatternRecognition , 2019.[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In

Conference on Computer Vision and Pattern Recognition , 2009.[12] Samuel Dodge and Lina Karam. Understanding how image quality affects deep neural networks.In

International Conference on Quality of Multimedia Experience , 2016.[13] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learningby context prediction. In

International Conference on Computer Vision , 2015.[14] Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In

Advancesin Neural Information Processing Systems , 2019.[15] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and ThomasBrox. Discriminative unsupervised feature learning with exemplar convolutional neural net-works.

IEEE transactions on pattern analysis and machine intelligence , 38(9), 2015.[16] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easysynthesis for instance detection. In

International Conference on Computer Vision , 2017.917] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Jacob Steinhardt, andAleksander Madry. Identifying statistical bias in dataset replication. arXiv: 2005.09619 , 2020.[18] Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and atranslation sufﬁce: Fooling CNNs with simple transformations.

CoRR , abs/1712.02779, 2017.[19] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: Thekitti dataset.

International Journal of Robotics Research , 2013.[20] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann,and Wieland Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape biasimproves accuracy and robustness. In

International Conference on Learning Representations ,2019.[21] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning bypredicting image rotations. arXiv:1803.07728 , 2018.[22] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-ial examples. arXiv:1412.6572 , 2014.[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Conference on Computer Vision and Pattern Recognition , 2016.[24] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A noveldataset and deep learning benchmark for land use and land cover classiﬁcation.

IEEE Journalof Selected Topics in Applied Earth Observations and Remote Sensing , 2019.[25] Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to commoncorruptions and surface variations. arXiv: 1807.01697 , 2018.[26] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervisedlearning can improve model robustness and uncertainty. In

Advances in Neural InformationProcessing Systems , 2019.[27] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshmi-narayanan. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv:1912.02781 , 2019.[28] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Naturaladversarial examples. arXiv: 1907.07174 , 2019.[29] Hossein Hosseini, Baicen Xiao, and Radha Poovendran. Google’s cloud vision api is not robustto noise. In

International Conference on Machine Learning and Applications , 2017.[30] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transferlearning? arXiv:1608.08614 , 2016.[31] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick,and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementaryvisual reasoning. In

IEEE Conference on Computer Vision and Pattern Recognition , 2017.[32] Kaggle and EyePacs. Kaggle diabetic retinopathy detection, July 2015.[33] Samil Karahan, Merve Kilinc Yildirim, Kadir Kirtaç, Ferhat Sükrü Rende, Gultekin Butun, andHazim Kemal Ekenel. How image degradations affect deep CNN-based face recognition? In

International Conference of the Biometrics Special Interest Group , 2016.[34] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, SylvainGelly, and Neil Houlsby. Big transfer (BiT): General visual representation learning.

EuropeanConference on Computer Vision , 2020.[35] Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better imagenet models transfer better?In

Conference on Computer Vision and Pattern Recognition , 2019.[36] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.[37] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In

Advances in Neural Information Processing Systems , 2012.[38] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset,Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and VittorioFerrari. The open images dataset v4: Uniﬁed image classiﬁcation, object detection, and visualrelationship detection at scale. arXiv: 1811.00982 , 2020.1039] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognitionwith invariance to pose and lighting. In

IEEE Conference on Computer Vision and PatternRecognition , 2004.[40] Fei-Fei Li, Rob Fergus, and Pietro Perona. One-shot learning of object categories.

IEEETransactions on Pattern Analysis and Machine Intelligence , 2006.[41] Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer, and Ekin D Cubuk. Improvingrobustness without sacriﬁcing accuracy with patch gaussian augmentation. arXiv:1906.02611 ,2019.[42] Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, YixuanLi, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervisedpretraining. In

European Conference on Computer Vision , 2018.[43] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentangle-ment testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.[44] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.Reading digits in natural images with unsupervised feature learning. In

NIPS Workshop onDeep Learning and Unsupervised Feature Learning 2011 , 2011.[45] M-E. Nilsback and A. Zisserman. Automated ﬂower classiﬁcation over a large number ofclasses. In

Indian Conference on Computer Vision, Graphics and Image Processing , Dec 2008.[46] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.

IEEE Transactions onknowledge and data engineering , 22(10), 2009.[47] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In

IEEE Conferenceon Computer Vision and Pattern Recognition , 2012.[48] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence.

Dataset shift in machine learning . The MIT Press, 2009.[49] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video.In

Conference on Computer Vision and Pattern Recognition , 2017.[50] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenetclassiﬁers generalize to imagenet? arXiv: 1902.10811 , 2019.[51] Prasun Roy, Subhankar Ghosh, Saumik Bhattacharya, and Umapada Pal. Effects of degradationson deep neural network architectures. arXiv:1807.10108 , 2018.[52] Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and LudwigSchmidt. A systematic framework for natural perturbations from videos. arXiv:1906.02168 ,2019.[53] Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin,D. Sculley, Joshua V. Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncer-tainty? evaluating predictive uncertainty under dataset shift. In

Advances in Neural InformationProcessing Systems , 2019.[54] Amos Storkey. When training and test sets are different: characterizing learning transfer.

Dataset shift in machine learning , 2009.[55] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonableeffectiveness of data in deep learning era. In

International Conference on Computer Vision ,2017.[56] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A Efros, and Moritz Hardt. Test-timetraining for out-of-distribution generalization. arXiv:1909.13231 , 2019.[57] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,and A. Rabinovich. Going deeper with convolutions. In

Conference on Computer Vision andPattern Recognition , 2015.[58] Mingxing Tan and Quoc V Le. Efﬁcientnet: Rethinking model scaling for convolutional neuralnetworks. arXiv:1905.11946 , 2019.[59] Dogancan Temel, Jinsol Lee, and Ghassan AlRegib. Cure-or: Challenging unreal and realenvironments for object recognition. In

International Conference on Machine Learning andApplications , 2018. 1160] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. Fixing the train-testresolution discrepancy. In

Advances in Neural Information Processing Systems , 2019.[61] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. Fixing the train-testresolution discrepancy: Fixefﬁcientnet. arXiv:2003.08237 , 2020.[62] Eleni Triantaﬁllou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin,Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-dataset:A dataset of datasets for learning to learn from few examples. arXiv:1903.03096 , 2019.[63] Michael Tschannen, Josip Djolonga, Marvin Ritter, Aravindh Mahendran, Neil Houlsby, SylvainGelly, and Mario Lucic. Self-supervised learning of video-induced visual invariances. In

Conference on Computer Vision and Pattern Recognition , 2020.[64] Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotationequivariant cnns for digital pathology. In

International Conference on Medical Image Computingand Computer-Assisted Intervention , 2018.[65] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database:Large-scale scene recognition from abbey to zoo. In

IEEE Conference on Computer Vision andPattern Recognition , 2010.[66] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, and Quoc V Le. Adversarialexamples improve image recognition. arXiv:1911.09665 , 2019.[67] Qizhe Xie, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Self-training with noisy studentimproves imagenet classiﬁcation. arXiv:1911.04252 , 2019.[68] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervisedsemi-supervised learning. In

International Conference on Computer Vision , 2019.[69] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, MarioLucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, LucasBeyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly,and Neil Houlsby. A large-scale study of representation learning with the visual task adaptationbenchmark. arXiv: 1910.04867 , 2019.[70] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyondempirical risk minimization. In

International Conference on Learning Representations , 2018.[71] Richard Zhang. Making convolutional networks shift-invariant again. In

International Confer-ence on Machine Learning , 2019.[72] Nanxuan Zhao, Zhirong Wu, Rynson W. H. Lau, and Stephen Lin. Distilling localization forself-supervised representation learning. arXiv: 2004.06638 , 2020.[73] Yiren Zhou, Sibo Song, and Ngai-Man Cheung. On classiﬁcation of distorted images with deepconvolutional neural networks. In

International Conference on Acoustics, Speech and SignalProcessing , 2017. 12

Analysis of existing robustness and transfer metrics

Here, we provide additional details related to the analyses presented in Figure 1.

A.1 Dimensionality of the space of robustness metrics

To estimate how many different dimensions are measured by the robustness metrics beyond whatis already explained by I

MAGE N ET accuracy, we proceeded as follows. For each of the robustnessmetrics shown in Figure 1 and 8, a linear regression was ﬁt to predict that metric’s value for the 39models, using I MAGE N ET accuracy as the sole predictor variable. Then, the residuals were computedfor each metric by subtracting the linear regression prediction. The plot shows the fraction of varianceexplained for the ﬁrst 4 principal components of the space of residuals of the robustness metrics. As anull hypothesis, we assumed that there is no correlation structure in the metric residuals. To constructcorresponding null datasets, we randomly permuted the values for each metric independently, whichdestroys the correlation structure between metrics. Figure 7a shows that only the ﬁrst principalcomponent is signiﬁcantly above the value expected under the null hypothesis. R e s i d u a l v a r i a n c e e x p l a i n e d ( % ) a f t e r a cc o un t i n g f o r I m a g e N e t a cc u r a c y MetricsNull distribution(shuffled values) (a)

The space of robustness metrics. D

ATASET I NSTANCES C LS .I MAGE N ET [37]

50 000 1000 I MAGE N ET -A [28] I MAGE N ET -C [25] × ×

50 000 1000 O BJECT N ET [3]

18 574 113 I MAGE N ET -V2 [50]

10 000 1000 I MAGE N ET -V ID [52]

22 179 293

YTBB-R

OBUST [52]

51 826 229 (b)

The name and reference, number of instances, and the num-ber of classes overlapping with ImageNet for each dataset.

Figure 7: (Left)

The space of robustness metrics spans approximately one statistically signiﬁcant dimensionafter accounting for I

MAGE N ET accuracy. Errorbars show 95% conﬁdence intervals based on 1000 bootstrapsamples (for the true data) or 1000 random permutations (for the null distribution). See Section A.1 for details. (Right) Details for the datasets used in this study. The datasets were used only for evaluation.

A.2 Informativeness of robustness metrics

To estimate how useful different combinations of robustness metrics are for discriminating betweenmodel types, we trained logistic regression classiﬁers to discriminate between the 12 model groupsoutlined in the main paper. We consider I

MAGE N ET accuracy as a baseline metric and thereforecompare the performance of a classiﬁer using only I MAGE N ET accuracy as input feature, to aclassiﬁer using I MAGE N ET either one (Figure 8, left) or two (Figure 8, right) additional metricsas input features. Figure 8 shows difference in accuracy to the baseline (I MAGE N ET ) classiﬁer.These results can serve practitioners with a limited budget as a rough guideline for which metriccombinations are the most informative. In our experiments, the most informative combination ofmetrics in addition to I MAGE N ET accuracy was O BJECT N ET and Y OU T UBE -BB, although othercombinations performed similarly within the statistical uncertainty.

A.3 Visual Task Adaptation Benchmark

The Visual Task Adaptation Benchmark (VTAB) [69] contains 19 tasks. Either the full datasetor -example training sets may be used, we use the version with -example training sets(VTAB-1k).The tasks are divided into three groups: Natural, standard natural image classiﬁcation problems.Specialized, domain-speciﬁc images captured with specialist equipment (e.g. medical images).Structured, classiﬁcation tasks that require geometric understanding of a scene. The Natural groupcontains the following datasets: Caltech101 [40], CIFAR-100 [36], DTD [9], Flowers102 [45], Pets[47], Sun397 [65], SVHN [44]. The Specialized group contains remote sensing datasets, EuroSAT13 .0 0.1 0.2 0.3 0.4 0.5Improvement in model discriminabilitycompared to ImageNet accuracyYouTube-BBImageNet-VidYouTube-BB-WImageNet-Vid-WObjectNetImageNet-V2ImageNet-C (contrast)ImageNet-C (brightness)ImageNet-C (shot noise)ImageNet-C (jpeg compression)ImageNet-C (impulse noise)ImageNet-C (frost)ImageNet-C (mean)ImageNet-C (elastic transform)ImageNet-C (gaussian noise)ImageNet-C (defocus blur)ImageNet-C (glass blur)ImageNet-AImageNet-C (motion blur)ImageNet-C (fog)ImageNet-C (snow)ImageNet-C (zoom blur)ImageNet-C (pixelate)

How well do metrics discriminate between models? Y o u T u b e - BB I m a g e N e t - V i d Y o u T u b e - BB - W I m a g e N e t - V i d - W O b j e c t N e t I m a g e N e t - V I m a g e N e t - C ( c o n t r a s t ) I m a g e N e t - C ( b r i g h t n e ss ) I m a g e N e t - C ( s h o t n o i s e ) I m a g e N e t - C ( j p e g c o m p r e ss i o n ) I m a g e N e t - C ( i m p u l s e n o i s e ) I m a g e N e t - C ( f r o s t ) I m a g e N e t - C ( m e a n ) I m a g e N e t - C ( e l a s t i c t r a n s f o r m ) I m a g e N e t - C ( g a u ss i a n n o i s e ) I m a g e N e t - C ( d e f o c u s b l u r ) I m a g e N e t - C ( g l a ss b l u r ) I m a g e N e t - A I m a g e N e t - C ( m o t i o n b l u r ) I m a g e N e t - C ( f o g ) I m a g e N e t - C ( s n o w ) I m a g e N e t - C ( z oo m b l u r ) I m a g e N e t - C ( p i x e l a t e ) YouTube-BBImageNet-VidYouTube-BB-WImageNet-Vid-WObjectNetImageNet-V2ImageNet-C (contrast)ImageNet-C (brightness)ImageNet-C (shot noise)ImageNet-C (jpeg compression)ImageNet-C (impulse noise)ImageNet-C (frost)ImageNet-C (mean)ImageNet-C (elastic transform)ImageNet-C (gaussian noise)ImageNet-C (defocus blur)ImageNet-C (glass blur)ImageNet-AImageNet-C (motion blur)ImageNet-C (fog)ImageNet-C (snow)ImageNet-C (zoom blur)ImageNet-C (pixelate) 0.000.080.160.240.320.40 C h a n g e i n d i s c r i m i n a b ili t y o v e r I m a g e N e t a cc u r a c y Figure 8: Informativeness of robustness metrics (related to Figure 1). (Left)

Similar to Figure 1 (left), butshowing all 23 robustness metrics. Difference in accuracy of a logistic classiﬁer trained to discriminate betweenmodel types based on I

MAGE N ET accuracy when including up to tworobustness metrics as explanatory variables. The diagonal shows the single-feature values from (left). [24] and Resisc45 [8], and medical images, Patch Camelyon [64] and Diabetic Retinopathy [32]. TheStructured group contains the following tasks: Counting and distance prediction on CLEVR [31].Pixel-location and orientation prediction on dSprites [43]. Camera elevation and object orientationon SmallNORB [39]. Object distance on DMLab [4]. Vehicle distance on KITTI [19]. B Scale and OOD generalization

Training Details

The models are ﬁrstly pre-trained on I

MAGE N ET -21 K and JFT, followed byﬁne-tuning on I MAGE N ET to match the label space for evaluation. We follow the pre-training andBiT-HyperRule ﬁne-tuning setup proposed in [34].Speciﬁcally, for pre-training, we use SGD with momentum with initial learning rate of 0.1, andmomentum 0.9. We use linear learning rate warm-up for 5000 optimization steps and multiply thelearning rate by batch size . We use a weight decay of 0.0001. We use the random image croppingtechnique from [57], and random horizontal mirroring followed by × image resize. We use aglobal batch size of 1024 and train on a Cloud TPUv3-128. We pre-train models for the cross productof the following combinations: • Dataset Size : {1.28M (1 × ImageNet train set), 2.6M (2 × ImageNet train set), 5.2M (4 × ImageNet train set), 9M (7 × ImageNet train set), 13M (10 × ImageNet train set)}. • Train Schedule (steps): {113K (90 ImageNet epochs), 229K (180 ImageNet epochs), 457K(360 ImageNet epochs), 791K (630 ImageNet epochs), 1.1M (900 ImageNet epochs)}.For ﬁne-tuning, we use the BiT-Hyperrule as described in [34]: batch size 512, learning rate 0.003,no weight decay, the classiﬁcation head initialized to zeros, mixup [70] with α = 0 . , ﬁne-tuningfor

20 000 steps with × image resolution. We present the results on the synthetic dataset inAppendix D. Additional Results

Here we highlight the results equivalent to Figure 3, with the only differencethat we consider subsets of the JFT [55] dataset, instead of I

MAGE N ET -21 K (Figure 9).14 M 5M 13M

Dataset size T r a i n i n g s t e p s

1M 5M 13M

Dataset size0.0 8.8 10.94.1 19.823.05.3 22.930.8ImageNet-C

1M 5M 13M

Dataset size0.0 11.112.77.5 21.320.27.5 23.030.3ImageNet-V2

1M 5M 13M

Dataset size0.0 6.9 8.02.1 15.216.73.4 15.622.0ObjectNet

1M 5M 13M

Dataset size0.0 14.016.58.1 22.224.48.5 21.129.4ImageNet-Vid

1M 5M 13M

Dataset size0.0 6.8 6.15.0 8.7 11.84.6 11.2 9.8YouTube-BB

1M 5M 13M

Dataset size0.0 9.4 10.36.4 16.820.43.1 17.624.4ImageNet-Vid-W

1M 5M 13M

Dataset size0.0 7.4 7.76.0 11.812.55.3 12.413.3YouTube-BB-W

1M 5M 13M

Dataset size T r a i n i n g s t e p s

1M 5M 13M

Dataset size12.816.318.514.320.723.215.921.527.2ImageNet-C

1M 5M 13M

Dataset size14.420.521.716.319.719.416.418.123.8ImageNet-V2

1M 5M 13M

Dataset size9.1 12.713.08.2 14.316.29.6 12.517.4ObjectNet

1M 5M 13M

Dataset size9.4 15.818.615.620.021.017.416.921.6ImageNet-Vid

1M 5M 13M

Dataset size5.8 6.9 5.611.3 7.7 9.39.7 6.3 6.5YouTube-BB

1M 5M 13M

Dataset size7.9 11.414.915.815.420.512.116.820.9ImageNet-Vid-W

1M 5M 13M

Dataset size4.7 7.8 7.19.6 9.5 9.69.5 8.4 10.2YouTube-BB-W

Figure 9: (Top)

Relative reduction (in %) in classiﬁcation errorgoing from ResNet-50 to ResNet-101x3 as a function of training steps and training set size (JFT subsets). Thereduction generally increases with the training set size and longer training.

C Effect of the testing resolution

Cropping details

Before applying the respective model, we ﬁrst resize every image such that theshorter side has length (cid:98) . · r (cid:99) while preserving the aspect ratio and take a central crop of size r × r . For the widely used × testing resolution, this leads to standard single-crop testingpreprocessing, where the images are ﬁrst resized such that the shorter side has length . Training details for FixRes

For ﬁne-tuning to the target resolution (FixRes) we use SGD withmomentum with initial learning rate of . (except for the BiT models for which we use . ),and momentum 0.9, accounting for varying batch size by multiplying the learning rate with batch size .We train for

15 000 · batch size , decaying the learning rate by a factor of after / and / of theiterations. The batch size is chosen based on the model size to avoid memory overﬂow; we use in most cases. We train on a Cloud TPUv3-64. We emphasize that we did not extensively tune thetraining parameters for FixRes, but chose a setting that works well across models and data sets. Additional results

In Figure 10 we provide an extended version of Figure 4 that shows the effectof FixRes for all datasets and models. In Figure 11 we plot the performance of all models and theirFixRes variants as a function of the resolution. 15 iT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5 A cc u r a c y ImageNet

BiT-M-JFT BiT-M-INet21k BiT-S-INet0.000.25

ImageNet-A

BiT-M-JFT BiT-M-INet21k BiT-S-INet01

ImageNet-C

BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5 A cc u r a c y ImageNet-V2

BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5

ObjectNet

BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5

ImageNet-Vid-Robust

BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5 A cc u r a c y YouTube-BB-Robust

BiT-M-JFT BiT-M-INet21k BiT-S-INet0.00.5

ImageNet-Vid-Robust-W

BiT-M-JFT BiT-M-INet21k BiT-S-INet0.000.25

YouTube-BB-Robust-W

Resolution

DefaultBestFixRes (a) Different BiT variants. -M- stands for ResNet-101x3, while -S- for ResNet-50x1. INet is a shorthand forImageNet.

B5-NoisyStud B0-NoisyStud B0 B50.00.5 A cc u r a c y ImageNet

B5-NoisyStud B0-NoisyStud B0 B50.000.25

ImageNet-A

B5-NoisyStud B0-NoisyStud B0 B501

ImageNet-C

B5-NoisyStud B0-NoisyStud B0 B50.00.5 A cc u r a c y ImageNet-V2

B5-NoisyStud B0-NoisyStud B0 B50.00.5

ObjectNet

B5-NoisyStud B0-NoisyStud B0 B50.00.5

ImageNet-Vid-Robust

B5-NoisyStud B0-NoisyStud B0 B50.00.5 A cc u r a c y YouTube-BB-Robust

B5-NoisyStud B0-NoisyStud B0 B50.00.5

ImageNet-Vid-Robust-W

B5-NoisyStud B0-NoisyStud B0 B50.000.25

YouTube-BB-Robust-W

Resolution

DefaultBestFixRes (b) Two ImageNet-trained EfﬁcientNet variants (B0,B5) as well as those models trained using the Noisy Studentprotocol.

SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.5 A cc u r a c y ImageNet

SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.2

ImageNet-A

SimCLR-ft-R50x4 SimCLR-ft-R50x101

ImageNet-C

SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.5 A cc u r a c y ImageNet-V2

SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.2

ObjectNet

SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.5

ImageNet-Vid-Robust

SimCLR-ft-R50x4 SimCLR-ft-R50x10.000.25 A cc u r a c y YouTube-BB-Robust

SimCLR-ft-R50x4 SimCLR-ft-R50x10.000.25

ImageNet-Vid-Robust-W

SimCLR-ft-R50x4 SimCLR-ft-R50x10.00.2

YouTube-BB-Robust-W

Resolution

DefaultBestFixRes (c) SimCLR models that have been ﬁne-tuned on ImageNet.

VIVI-x3 VIVI-x10.00.5 A cc u r a c y ImageNet

VIVI-x3 VIVI-x10.0000.025

ImageNet-A

VIVI-x3 VIVI-x101

ImageNet-C

VIVI-x3 VIVI-x10.00.5 A cc u r a c y ImageNet-V2

VIVI-x3 VIVI-x10.00.2

ObjectNet

VIVI-x3 VIVI-x10.00.5

ImageNet-Vid-Robust

VIVI-x3 VIVI-x10.00.2 A cc u r a c y YouTube-BB-Robust

VIVI-x3 VIVI-x10.00.2

ImageNet-Vid-Robust-W

VIVI-x3 VIVI-x10.00.2

YouTube-BB-Robust-W

Resolution

DefaultBestFixRes (d) Two VIVI variants (R50x1 and R50x3), both co-trained with ImageNet.Figure 10: Comparison of different types of evaluation preprocessing and resolutions. Default: Accuracyobtained for the preprocessing and resolution proposed by the authors of the respective models. Best: Theaccuracy when selecting the best resolution from { , , , , , , , } . FixRes: ApplyingFixRes for the same set of resolutions and selecting the best resolution. Increasing the evaluation resolution andadditionally using FixRes helps across a large range of models and pretraining datasets.

50 500 750

Eval. res.

ImageNet

VIVI-x3 VIVI-x3-FixRes VIVI-x1 VIVI-x1-FixRes250 500 750

Eval. res.

ImageNet-A

250 500 750

Eval. res.

ImageNet-C

250 500 750

Eval. res.

ImageNet-V2

250 500 750

Eval. res.

ObjectNet

250 500 750

Eval. res.

ImageNet-Vid-Robust

250 500 750

Eval. res.

YouTube-BB-Robust

250 500 750

Eval. res.

ImageNet-Vid-Robust-W

250 500 750

Eval. res.

YouTube-BB-Robust-W

250 500 750

Eval. res.

ImageNet

BiT-M-INet21k BiT-M-INet21k-FixRes BiT-S-INet BiT-S-INet-FixRes BiT-M-JFT BiT-M-JFT-FixRes BiT-M-INet21k BiT-M-INet21k-FixRes250 500 750

Eval. res.

ImageNet-A

250 500 750

Eval. res.

ImageNet-C

250 500 750

Eval. res.

ImageNet-V2

250 500 750

Eval. res.

ObjectNet

250 500 750

Eval. res.

ImageNet-Vid-Robust

250 500 750

Eval. res.

YouTube-BB-Robust

250 500 750

Eval. res.

ImageNet-Vid-Robust-W

250 500 750

Eval. res.

YouTube-BB-Robust-W

250 500 750

Eval. res.

ImageNet

B0 B0-FixRes B5-NoisyStud B5-NoisyStud-FixRes B0-NoisyStud B0-NoisyStud-FixRes B5 B5-FixRes250 500 750

Eval. res.

ImageNet-A

250 500 750

Eval. res.

ImageNet-C

250 500 750

Eval. res.

ImageNet-V2

250 500 750

Eval. res.

ObjectNet

250 500 750

Eval. res.

ImageNet-Vid-Robust

250 500 750

Eval. res.

YouTube-BB-Robust

250 500 750

Eval. res.

ImageNet-Vid-Robust-W

250 500 750

Eval. res.

YouTube-BB-Robust-W

250 500 750

Eval. res.

ImageNet

SimCLR-ft-R50x4 SimCLR-ft-R50x4-FixRes SimCLR-ft-R50x1 SimCLR-ft-R50x1-FixRes250 500 750

Eval. res.

ImageNet-A

250 500 750

Eval. res.

ImageNet-C

250 500 750

Eval. res.

ImageNet-V2

250 500 750

Eval. res.

ObjectNet

250 500 750

Eval. res.

ImageNet-Vid-Robust

250 500 750

Eval. res.

YouTube-BB-Robust

250 500 750

Eval. res.

ImageNet-Vid-Robust-W

250 500 750

Eval. res.

YouTube-BB-Robust-W

250 500 750

Eval. res.

ImageNet

R50x1 R50x1-FixRes250 500 750

Eval. res.

ImageNet-A

250 500 750

Eval. res.

ImageNet-C

250 500 750

Eval. res.

ImageNet-V2

250 500 750

Eval. res.

ObjectNet

250 500 750

Eval. res.

ImageNet-Vid-Robust

250 500 750

Eval. res.

YouTube-BB-Robust

250 500 750

Eval. res.

ImageNet-Vid-Robust-W

250 500 750

Eval. res.

YouTube-BB-Robust-W

Figure 11: Comparison of different types of evaluation preprocessing and resolutions, without modifying themodel and after applying FixRes. For brevity the same shorthands are used in the model names as in Figure 10.

D Synthetic dataset

In order to measure how model performance changes as object position, size and orientation change,we constructed a synthetic dataset. The dataset consists of objects pasted on relatively unclutteredbackgrounds. We show a few examples in Figure 5 (left) in the main paper and here in Figure 12.The objects were extracted from OpenImages [38] using the provided segmentation masks. As we areinvestigating models trained or ﬁne-tuned on ImageNet, we only used classes that could be mappedto ImageNet. We also removed all objects that are tagged as occluded or truncated, and manuallyremove highly incomplete or inaccurately labeled objects. We converged to 614 object instancesacross 62 classes. The backgrounds were images from nature taken from pexels.com (the licensetherein allows one to reuse photos with modiﬁcations). We manually ﬁltered the backgrounds toremove ones with prominent objects, such as images focused on a single animal or person. Wecollected 867 such backgrounds.

F.O.V. D

ATASET C ONFIGURATION I MAGES S IZE

Objects in the center and upright, sizes ranging from 1% to 100%of the image area in 1% increments.

92 884 L OCATION

Objects upright. Sizes are 20% of the image area. We do a gridsearch of locations, dividing the x-coordinate dimension and y-coordinate dimensions into 20 equal parts each, for a total of 400coordinate locations.

479 184 R OTATION

Objects in the center, sizes equal to 20%, 50%, 80% or 100% ofthe image size. Rotation angles ranging from 1 to 341 degreescounterclockwise in 20-degree increments.

39 540

Table 1: Synthetic dataset details. The ﬁrst column shows the relevant factor of variation (F.O.V.). When thereare multiple values for multiple factors of variation, we generate the full cross product of images. igure 12: Sample images from our synthetic dataset. We consider 614 foreground objects from 62 classes and867 backgrounds and vary the object location, rotation angle, and object size for a total of

611 608 images.

We constructed three subsets for evaluation, one corresponding to each factor of variation we wantedto investigate as shown in Table 1. In particular, for each object instance, we sample two backgrounds,and for each of these object-background combinations, we take a cross product over all the factors ofvariation. For the datasets with multiple values for more than one factor of variation, we take a crossproduct of all the values for each factor of variation in the set (object size, rotation, location). Forexample, for the rotation angle dataset, there are four object sizes and 18 rotation angles, so we do across product and have 72 factor of variation combinations. For the object size and rotation datasets,we only consider images where objects are at least 95% in the image. For the location dataset, suchﬁltering removes almost all images where objects are near the edges of the image, so in the mainpaper we do not do such ﬁltering. Note that since we use the central coordinates of objects as theirlocation, at least 25% of each object is in the image even if we do not do any ﬁltering. We presentresults ﬁltering out objects that are less than 50% or 75% in the image in this section in Figures 16and 17 respectively. 18

Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-50, ImageNet-21K)

Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-50, JFT)

Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, ImageNet-21K)

Area (%) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, JFT)

Figure 13: In the ﬁrst row of both plots we show the ratio of the accuracy and the best accuracy (across all areas).For the second row (model trained on 2.6M instances), and other rows, we compute the same normalized scoreand visualize the difference with the ﬁrst row. Larger differences imply a more uniform behavior across relativeobject areas. We observe that, as the dataset size increases, the average prediction accuracy across various objectareas becomes more uniform. The effect is more pronounced for the larger model. As expected, the improvementis most pronounced for small object sizes covering – of the pixels.

180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180

Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-50, ImageNet-21K)

Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, ImageNet-21K)

Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-50, JFT)

Rotation (degrees) D a t a s e t s i z e Relative performance improvement (ResNet-101-x3, JFT)

Figure 14: In the ﬁrst row of both plots we show the ratio of the accuracy and the best accuracy (across allrotations). For the second row (model trained on 2.6M instances), and other rows, we compute the samenormalized score and visualize the difference with the ﬁrst row. Larger differences imply a more uniformbehavior across object rotations. We observe that, as the dataset size increases, the average prediction accuracyacross various rotation angles becomes more uniform. The effect is more pronounced for the larger model. eference (1.3 M) 2.6 M 5.2 M 9.0 M 13.0 M Improvement across object locations (Filter=0%) (ResNet-50, ImageNet-21K)