[PDF] Affinity and Diversity: Quantifying Mechanisms of Data Augmentation

Abstract

Though data augmentation has become a standard component of deep neural network training, the underlying mechanism behind the effectiveness of these techniques remains poorly understood. In practice, augmentation policies are often chosen using heuristics of either distribution shift or augmentation diversity. Inspired by these, we seek to quantify how data augmentation improves model generalization. To this end, we introduce interpretable and easy-to-compute measures: Affinity and Diversity. We find that augmentation performance is predicted not by either of these alone but by jointly optimizing the two.

Full PDF

AAfﬁnity and Diversity: Quantifying Mechanisms ofData Augmentation

Raphael Gontijo-Lopes ∗ Google BrainMountain View, CA 94043

Sylvia J. Smullin ∗ GoogleMountain View, CA 94043

Ekin D. Cubuk

Google BrainMountain View, CA 94043 [email protected]

Ethan Dyer

GoogleMountain View, CA 94043 [email protected]

Abstract

Though data augmentation has become a standard component of deep neuralnetwork training, the underlying mechanism behind the effectiveness of thesetechniques remains poorly understood. In practice, augmentation policies are of-ten chosen using heuristics of either distribution shift or augmentation diversity.Inspired by these, we seek to quantify how data augmentation improves model gen-eralization. To this end, we introduce interpretable and easy-to-compute measures:Afﬁnity and Diversity. We ﬁnd that augmentation performance is predicted not byeither of these alone but by jointly optimizing the two.

50 25 0

Affinity .03.1.3 D i v e r s i t y T e s t A cc u r a c y (a) Afﬁnity vs Diversity C l ean D a t a A ug m en t ed D a t a NewData H i gh D i v e r s i t y High Affinity Lo w D i v e r s i t y Low Affinity (b) Model’s View of Data

Figure 1:

Afﬁnity and Diversity parameterize the performance of a model trained with augmentation .(a) CIFAR-10: Color shows the ﬁnal test accuracy. * marks the clean baseline. Each point represents a differentaugmentation that yields test accuracy greater than 88.7%. (b) Representation of how clean data and augmenteddata are related in the space of these two metrics. Higher diversity is represented by a larger bubble whiledistributional similarity is depicted through the overlap of bubbles. Test accuracy generally improves to theupper right in this space. Adding real new data to the training set is expected to be in the far upper right corner. ∗ Equal contributionPreprint. Under review. a r X i v : . [ c s . L G ] J un Introduction

Models that achieve state-of-the-art in image classiﬁcation often use heavy data augmentationstrategies. The best techniques use various transforms applied sequentially and stochastically. Thoughthe effectiveness of this is well-established, the mechanism through which these transformationswork is not well-understood.Since early uses of data augmentation, it has been assumed that augmentation works because itsimulates realistic samples from the true data distribution: “[augmentation strategies are] reasonablesince the transformed reference data is now extremely close to the original data. In this way, theamount of training data is effectively increased" [1]. Because of this, augmentations have often beendesigned with the heuristic of incurring minimal distribution shift from the training data.This rationale does not explain why unrealistic distortions such as cutout [2], SpecAugment [3], andmixup [4] signiﬁcantly improve generalization performance. Furthermore, methods do not alwaystransfer across datasets—

Cutout , for example, is useful on CIFAR-10 and not on ImageNet [5].Additionally, many augmentation policies heavily modify images by stochastically applying multipletransforms to a single image. Based on this observation, some have proposed that augmentationstrategies are effective because they increase the diversity of images seen by the model.In this complex landscape, claims about diversity and distributional similarity remain unveriﬁedheuristics. Without more precise data augmentation science, ﬁnding state-of-the-art strategies requiresbrute force that can cost thousands of GPU hours [6, 7]. This highlights a need to specify and measurethe relationship between the original training data and the augmented dataset, as relevant to a givenmodel’s performance.In this paper, we quantify these heuristics. Seeking to understand the mechanisms of augmentation,we focus on single transforms as a foundation. We present an extensive study of 204 differentaugmentations on CIFAR-10 and 223 on ImageNet, varying both broad transform families and ﬁnertransform parameters. Our contributions are:1. We introduce Afﬁnity and Diversity: interpretable, easy-to-compute metrics for parameteriz-ing augmentation performance. Afﬁnity quantiﬁes how much an augmentation shifts thetraining data distribution from that learned by a model. Diversity quantiﬁes the complexityof the augmented data with respect to the model and learning procedure.2. We show that performance is dependent on both metrics. In the Afﬁnity-Diversity plane, thebest augmentation strategies jointly optimize the two (see Fig 1).3. We connect augmentation to other familiar forms of regularization, such as (cid:96) and learningrate scheduling, observing common features of the dynamics: performance can be improvedand training accelerated by turning off regularization at an appropriate time.4. We show that performance is only improved when a transform increases the total numberof unique training examples. The utility of these new training examples is informed by theaugmentation’s Afﬁnity and Diversity. Since early uses of data augmentation in training neural networks, there has been an assumption thateffective transforms for data augmentation are those that produce images from an “overlapping butdifferent" distribution [1, 8]. Indeed, elastic distortions as well as distortions in the scale, position,and orientation of training images have been used on MNIST [9–12], while horizontal ﬂips, randomcrops, and random distortions to color channels have been used on CIFAR-10 and ImageNet [13–15]. For object detection and image segmentation, one can also use object-centric cropping [16] orcut-and-paste new objects [17–19].In contrast, researchers have also successfully used more generic transformations that are less domain-speciﬁc, such as Gaussian noise [5, 20], input dropout [21], erasing random patches of the trainingsamples during training [2, 3, 22], and adversarial noise [23]. Mixup [4] and Sample Pairing [24] aretwo augmentation methods that use convex combinations of training samples.It is also possible to improve generalization by combining individual transformations. For example,reinforcement learning has been used to choose more optimal combinations of data augmentationtransformations [6, 25]. Follow-up research has lowered the computation cost of such optimization,2y using population based training [26], density matching [27], adversarial policy-design that evolvesthroughout training [7], or a reduced search space [28]. Despite producing unrealistic outputs, suchcombinations of augmentations can be highly effective in different tasks [29–33].Across these different examples, the role of distribution shift in training remains unclear. Lim et al.[27], Hataya et al. [34] have found augmentation policies by minimizing the distance between thedistributions of augmented data and clean data. Recent work found that after training with augmenteddata, ﬁne-tuning on clean training data can be beneﬁcial [35], while Touvron et al. [36] found itbeneﬁcial to ﬁne-tune with a test-set resolution that aligns with the training-set resolution.The true input-space distribution from which a training dataset is drawn remains elusive. To betterunderstand the effect of distribution shift on model performance, many works attempt to estimate it.Often these techniques require training secondary models, such as those based on variational methods[37–40]. Others have tried to augment the training set by modelling the data distribution directly [41].Recent work has suggested that even unrealistic distribution modelling can be beneﬁcial [42].These methods try to specify the distribution separately from the model they are trying to optimize.As a result, they are insensitive to any interaction between the model and data distribution. Instead,we are interested in a measure of how much the data shifts along directions that are most relevant tothe model’s performance.

We performed extensive experiments with various augmentations on CIFAR-10 and ImageNet.Experiments on CIFAR-10 used the WRN-28-2 model [14], trained for 78k steps with cosine learningrate decay. Results are the mean over 10 initializations and reported errors (often too small to showon ﬁgures) are the standard error on the mean. Details on the error analysis are in Sec. C.Experiments on ImageNet used the ResNet-50 model [43], trained for 112.6k steps with a weightdecay rate of 1e-4, and a learning rate of 0.2, which is decayed by 10 at epochs 30, 60, and 80.Images were pre-processed by dividing each pixel value by 255 and normalizing by the data setstatistics. Random crop was also applied on all ImageNet models. These pre-processed data withoutfurther augmentation are “clean data” and a model trained on it is the “clean baseline”. We followedthe same implementation details as Cubuk et al. [6] , including for most augmentation operations.Further implementation details are in Sec. A.For CIFAR-10, test accuracy on the clean baseline is . ± . . The validation accuracy is . ± . . On ImageNet, the test accuracy is 76.06%.Unless speciﬁed otherwise, data augmentation was applied following standard practice: each timean image is drawn, the given augmentation is applied with a given probability. We call this mode dynamic augmentation. Due to whatever stochasticity is in the transform itself (such as randomlyselecting the location for a crop) or in the policy (such as applying a ﬂip only with 50% probability),the augmented image could be different each time. Thus, most of the tested augmentations increasethe number of possible distinct images that can be shown during training.We also performed select experiments using static training. In static augmentation, the augmentationpolicy (one or more transforms) is applied once to the entire clean training set. Static augmentationdoes not change the number of unique images in the dataset. Thus far, heuristics of distribution shift have motivated design of augmentation policies. Inspired bythis focus, we introduce a simple metric to quantify how augmentation shifts data with respect to thedecision boundary of the clean baseline model .We start by noting that a trained model is often sensitive to the distribution of the training data. That is,model performance varies greatly between new samples from the true data distribution and samplesfrom a shifted distribution.Importantly, the model’s sensitivity to distribution shift is not purely a function of the input datadistribution, since training dynamics and the model’s implicit biases affect performance. Because Available at bit.ly/2v2FojN

Deﬁnition 1.

Let D train and D val be training and validation datasets drawn IID from the same clean data distribution, and let D (cid:48) val be derived from D val by applying a stochastic augmentation strategy, a ,once to each image in D val , D (cid:48) val = { ( a ( x ) , y ) : ∀ ( x, y ) ∈ D val } . Further let m be a model trainedon D train and A ( m, D ) denote the model’s accuracy when evaluated on dataset D . The Afﬁnity, T [ a ; m ; D val ] , is given by T [ a ; m ; D val ] = A ( m, D (cid:48) val ) − A ( m, D val ) . (1)With this deﬁnition, Afﬁnity of zero represents no shift and a negative number suggests that theaugmented data is out-of-distribution for the model. Y s h i f t (a) Afﬁnity Y s h i f t (b) D KL Figure 2:

Afﬁnity is a model-sensitive measure ofdistribution shift . Contours indicate lines of equal(a) Afﬁnity, or (b) KL Divergence between the jointdistribution of the original data and targets and theshifted data. The two axes indicate the actual shiftsthat deﬁne the augmentation. Afﬁnity captures model-dependent features, such as the decision boundary.

In Fig. 2 we illustrate Afﬁnity with a two-class classiﬁcation task on a mixture of two Gaussians.Augmentation in this example comprises shift of the means of the Gaussians of the validation datacompared to those used for training. Under this shift, we calculate both Afﬁnity and KL divergenceof the shifted data with respect to the original data. Afﬁnity changes only when the shift in the data iswith respect to the model’s decision boundary, whereas the KL divergence changes even when data isshifted in the direction that is irrelevant to the classiﬁcation task. In this way, Afﬁnity captures whatis relevant to a model: shifts that impact predictions.This same metric has been used as a measure of a model’s robustness to image corruptions that donot change images’ semantic content [20, 44–48]. Here we, turn this around and use it to quantifythe shift of augmented data compared to clean data.Afﬁnity has the following advantages as a metric:1. It is easy to measure. It requires only clean training of the model in question.2. It is independent of any confounding interaction between the data augmentation and thetraining process, since augmentation is only used on the validation set and applied statically.3. It is a measure of distance sensitive to properties of both the data distribution and the model.We gain conﬁdence in this metric by comparing it to other potential model-dependant measuresof distribution shift. We consider the mean log likelihood of augmented test images[49], and theWatanabe–Akaike information criterion (WAIC) [50]. These other metrics have high correlation withAfﬁnity. Details can be found in Sec. F.

Inspired by the observation that multi-factor augmentation policies such as

FlipLR + Crop + Cutout and

RandAugment [28] greatly improve performance, we propose another axis on which to viewaugmentation policies, which we dub

Diversity . This measure is intended to quantify the intuitionthat augmentations prevent models from over-ﬁtting by increasing the number of samples in thetraining set; the importance of this is shown in Sec. 4.3.4ased on the intuition that more diverse data should be more difﬁcult for a model to ﬁt, we propose amodel-based measure. The Diversity metric in this paper is the ﬁnal training loss of a model trainedwith a given augmentation:

Deﬁnition 2.

Let a be an augmentation and D (cid:48) train be the augmented training data resulting fromapplying the augmentation, a , stochastically. Further, let L train be the training loss for a model, m ,trained on D (cid:48) train . We deﬁne the Diversity, D [ a ; m ; D train ] , as D [ a ; m ; D train ] := E D (cid:48) train [ L train ] . (2)Though determining the training loss requires the same amount of work as determining ﬁnal testaccuracy, here we focus on this metric as a tool for understanding.As with Afﬁnity, this deﬁnition of Diversity has the advantage that it can capture model-dependentelements, i.e. it is informed by the class of priors implicit in choosing a model and optimizationscheme as well as by the stopping criterion used in training.Another potential diversity measure is the entropy of the transformed data, D Ent . This is inspiredby the intuition that augmentations with more degrees of freedom perform better. For discretetransformations, we consider the conditional entropy of the augmented data. D Ent := H ( X (cid:48) | X ) = − E X [Σ x (cid:48) p ( x (cid:48) | X ) log( p ( x (cid:48) | X ))] . Here x ∈ X is a clean training image and x (cid:48) ∈ X (cid:48) is an augmented image. This measure has theproperty that it can be evaluated without any training or reference to model architecture. However,the appropriate entropy for continuously-varying transforms is less straightforward.A third proxy for Diversity is the training time needed for a model to reach a given training accuracythreshold. In Sec. E, we show that these three metrics correlate well with each other.In the remaining sections we describe how the complementary metrics of Diversity and Afﬁnity canbe used to characterize and understand augmentation performance. Despite the original inspiration to mimic realistic transformations and minimize distribution shift,many state-of-the-art augmentations yield unrealistic images. This suggests that distribution shiftalone does not fully describe or predict augmentation performance.

75 50 25 0

Affinity T e s t A cc u r a c y Diversity T e s t A cc u r a c y

60 40 20 0

Affinity T e s t A cc u r a c y Diversity T e s t A cc u r a c y (a) T: CIFAR-10. B: ImageNet

75 50 25 0

Affinity D i v e r s i t y T e s t A cc u r a c y (b) CIFAR-10

60 40 20 0

Affinity D i v e r s i t y T e s t A cc u r a c y (c) ImageNet Figure 3:

Augmentation performance is determined by both Afﬁnity and Diversity . (a) Test accuracyplotted against each of Afﬁnity and Diversity for the two datasets, showing that neither metric alone predictsperformance. In the CIFAR-10 plots (top), blue highlights (also in inset) are the augmentations that increase testaccuracy above the clean baseline. Dashed lines indicate the clean baseline. (b) and (c) show test accuracy on thecolor scale in the plane of Afﬁnity and Diversity. The three star markers in (b) are (left to right)

RandAugment , AutoAugment , and mixup . The * on the color bar indicates the clean baseline case. For ﬁxed values of Afﬁnity,test accuracy generally increases with higher values of Diversity. For ﬁxed values of Diversity, test accuracygenerally increases with higher values of Afﬁnity. Note that the gains observed on ImageNet are expected to besmall, in line with previous work on single-transformation policies [48].

Figure 3(a) (left) measures Afﬁnity across 204 different augmentations for CIFAR-10 and 223for ImageNet respectively. We ﬁnd that for the most important augmentations—those that help5erformance—Afﬁnity is a poor predictor of accuracy. Furthermore, we ﬁnd many successfulaugmentations with low Afﬁnity. For example,

Rotate(fixed, 45deg, 50%) , Cutout(16) , andcombinations of

FlipLR , Crop(32) , and

Cutout(16) all have Afﬁnity < − and test accuracy > above clean baseline on CIFAR-10. Augmentation details are in Sec. B.As Afﬁnity does not fully characterize the performance of an augmentation, we seek another metric.To assess the importance of an augmentation’s complexity, we measure Diversity across the same setof augmentations. We ﬁnd that Diversity is complementary in explaining how augmentations canincrease test performance. As shown in Fig. 3(b) and (c), Afﬁnity and Diversity together provide amuch clearer parameterization of an augmentation policy’s beneﬁt to performance. For a ﬁxed levelof Diversity, augmentations with higher Afﬁnity are consistently better. Similarly, for a ﬁxed Afﬁnity,it is generally better to have higher Diversity.A simple case study is presented in Fig. 4. The probability of the transform Rotate(fixed, 60deg) is varied. The accuracy and Afﬁnity are not monotonically related, with the peak accuracy falling atan intermediate value of Afﬁnity. Similarly, accuracy is correlated with Diversity for low probabilitytransformations, but does not track for higher probabilities. The optimal probability for

Rotate(fixed,60deg) lies at an intermediate value of Afﬁnity and Diversity.

50 25 0

Affinity T e s t A cc u r a c y Diversity T e s t A cc u r a c y

60 40 20 0

Affinity D i v e r s i t y T e s t A cc u r a c y Figure 4:

Test accuracy varies differently than either Afﬁnity or Diversity . Here, the probability of

Rotate(fixed, 60deg) on CIFAR-10 is varied from 10% to 90%. Left: as probability increases, Afﬁnitydecreases linearly while the accuracy changes non-monotonically. Center: accuracy and Diversity vary differentlyfrom each other as probability is changed. Right: test accuracy is maximized at intermediate values.

To situate the tested augmentations—mostly single transforms—within the context of the state-of-the-art, we tested three high-performance augmentations from literature: mixup [4],

AutoAugment [6],and

RandAugment [28]. These are highlighted with a star marker in Fig. 3(b).More than either of the metrics alone, Afﬁnity and Diversity together provide a useful parameterizationof an augmentation’s performance. We now turn to investigating the utility of this tool for explainingother observed phenomena of data augmentations.

The term “regularizer” is ill-deﬁned in the literature, often referring to any technique used to reducegeneralization error without necessarily reducing training error [51]. With this deﬁnition, it is widelyacknowledged that commonly-used augmentations act as regularizers [52–54]. Though this is a broaddeﬁnition, we notice another commonality across seemingly different kinds of regularizers: variousregularization techniques yield boosts in performance (or at least no degradation) if the regularizationis turned off at the right time during training. For instance:1. Decaying a large learning rate on an appropriate schedule can be better than maintaining alarge learning rate throughout training [14].2. Turning off (cid:96) regularization at the right time in training does not hurt performance [55].3. Relaxing architectural constraints mid-training can boost ﬁnal performance [56].4. Turning augmentations off and ﬁne-tuning on clean data can improve ﬁnal test accuracy [35].To further study augmentation as a regularizer, we compare the constant augmentation case (with thesame augmentation throughout) to the case where the augmentation is turned off partway throughtraining and training is completed with clean data. For each transform, we test over a range of switch-off points and select the one that yields the best ﬁnal validation or test accuracy on CIFAR-10 and6mageNet respectively. The Switch-off Lift is the resulting increase in ﬁnal test accuracy, comparedto training with augmented data the entire time.

20 30 40 50 60 70 8080859095

Turn Aug OffBaselineConstant Aug

20 30 40 50 60 70 8082.585.087.590.0 V a li d A cc u r a c y Turn OffBaseline

20 30 40 50 60 70 80

Training Steps / 1000

Stepped LRConstant LR (a) Slingshot effect on CIFAR-10

79 88 89.7* 91 100

Test AccuracyTurningAug OffAug On (b) Switch-off Lift on CIFAR-10

50 0

Affinity .01.11 D i v e r s i t y S w i t c h - o ff L i f t

50 25 0

Affinity D i v e r s i t y S w i t c h - o ff L i f t (c) CIFAR-10 (left) ImageNet (right) Figure 5: (a) Switching off regularizers yields a performance boost : Three examples of how turning off aregularizer increases the validation accuracy. This slingshot effect can speed up training and improve the bestvalidation accuracy. Top: training with no augmentation (clean baseline), compared to constant augmentation,and augmentation that is turned off at 55k steps. Here, the augmentation is

Rotate(fixed, 20deg,100%) .Middle: Baseline with constant (cid:96) . This is compared to turning off (cid:96) regularization part way through training.Bottom: Constant learning rate of 0.1 compared to training where the learning rate is decayed in one step bya factor of 10. (b) Bad augmentations can become helpful if switched off : Colored lines connect the testaccuracy with augmentation applied throughout training (top) to the test accuracy with switching mid-training.Color indicates the amount of Switch-off Lift; blue is positive and orange is negative. (c) Switch-off Lift varieswith Afﬁnity and Diversity . Where Switch-off Lift is negative, it is mapped to 0 on the color scale. For some poor-performing augmentations, this gain can actually bring the ﬁnal test accuracy abovethe baseline, as shown in Fig. 5(b). We additionally observe (Fig. 5(a)) that this test accuracyimprovement can happen quite rapidly for both augmentations and for the other regularizers tested.This suggests an opportunity to accelerate training without hurting performance by appropriatelyswitching off regularization. We call this a slingshot effect.Interestingly, we ﬁnd the best time for turning off an augmentation is not always close to the end oftraining, contrary to what is shown in He et al. [35]. For example, without switching,

FlipUD(100%) decreases test accuracy by almost 50% compared to clean baseline. When the augmentation is usedfor only the ﬁrst third of training, ﬁnal test accuracy is above the baseline.He et al. [35] hypothesized that the gain from turning augmentation off is due to recovery from adistribution shift. Indeed, for many detrimental transformations, the test accuracy gained by turningoff the augmentation merely recovers the clean baseline performance. However, in Fig. 5(c), we seethat for a given value of Afﬁnity, Switch-off Lift can vary. This result suggests that the Switch-offLift is derived from more than simply correction of a distribution shift.A few of the tested augmentations, such as

FlipLR(100%) , are fully deterministic. Thus, eachtime an image is drawn in training, it is augmented the same way. When such an augmentation isturned off partway through training, the model then sees images—the clean ones—that are now new.Indeed, when

FlipLR(100%) is switched off at the right time, its ﬁnal test accuracy exceeds that of

FlipLR(50%) without switching. In this way, switching augmentation off may adjust for not only lowAfﬁnity but also low Diversity.

Most augmentations we tested and those used in practice have inherent stochasticity and thus mayalter a given training image differently each time the image is drawn. In the typical dynamic trainingmode, these augmentations increase the number of unique inputs seen across training epochs.To further study how augmentations act as regularizers, we seek to discriminate this increase ineffective dataset size from other effects. We train models with static augmentation, as described7n Sec. 3. This altered training set is used without further modiﬁcation during training so that thenumber of unique training inputs is the same between the augmented and the clean training settings.For almost all tested augmentations, using static augmentation yields lower test accuracy than theclean baseline. Where static augmentation shows a gain (versions of crop ), the difference is less thanthe standard error on the mean. As in the dynamic case, poorer performance in the static case is fortransforms that have lower Afﬁnity and lower Diversity.Static augmentations also always perform worse than their non-deterministic, dynamic counterparts,as shown in Fig. 6. This may be because the Diversity of a static augmentation is always less than thedynamic case (see also Sec. E). The decrease in Diversity in the static case suggests a connectionbetween Diversity and the number of training examples.

10 5 0 5

Static Augmentation D y n a m i c A u g m e n t a t i o n Relative Test Accuracy R e l a t i v e T e s t A cc u r a c y Static Augmentation Diversity D y n a m i c A u g m e n t a t i o n D i v e r s i t y Figure 6:

Static augmentations decreasediversity and performance . CIFAR-10,static augmentation performance is less thanthe clean baseline, (0 , , and less than thedynamic augmentation case. Augmentationswith no stochasticity are excluded becausethey are trivially equal on the two axes (left).Diversity in the static case is less than in thedynamic case.(right) Diagonal line indicateswhere static and dynamic cases would beequal. Together, these results point to the following conclusion:

Increased effective training set size iscrucial to the performance beneﬁt of data augmentation. An augmentation’s Afﬁnity and Diversityinform how useful the additional training examples are.

In this work, we focused on single transforms in an attempt to understand the essential parts ofaugmentation in a controlled context. This builds a foundation for using these metrics to quantify anddesign more complex and powerful combinations of augmentations.Though earlier work has often explicitly focused on just one of these metrics, chosen priors haveimplicitly ensured reasonable values for both. One way to achieve Diversity is to use combinationsof many single augmentations, as in AutoAugment [6]. Because transforms and hyperparametersin Cubuk et al. [6] were chosen by optimizing performance on proxy tasks, the optimal policies includehigh and low Afﬁnity transforms. Fast AutoAugment [27], CTAugment [29, 57], and differentiableRandAugment [28] all aim to increase Afﬁnity by what Lim et al. [27] called “density-matching”.However these methods use the search space of AutoAugment and thus inherit its Diversity.On the other hand, Adversarial AutoAugment [7] focused on increasing Diversity by optimizingpolicies to increase the training loss. While this method did not explicitly aim to increase Afﬁnity, italso used transforms and hyperparameters from the AutoAugment search space. Without such a prior,which includes useful Afﬁnity, the goal of maximizing training loss with no other constraints wouldlead to data augmentation policies that erase all the information from the images.Our results motivate casting an even wider net when searching for augmentation strategies. Firstly,our work suggests that explicitly optimizing along axes of both Afﬁnity and Diversity yields betterperformance. Furthermore, we have seen that poor-performing augmentations can actually be helpfulif turned off during training (Fig. 5). With inclusion of scheduling in augmentation optimization, weexpect there are opportunities for including a different set of augmentations in an ideal policy. Hoet al. [26] observes trends in how probability and magnitude of various transforms change duringtraining for an optimized augmentation schedule. We suggest that with further study, Diversity andAfﬁnity can provide priors for optimization of augmentation schedules.8

Conclusion

We attempted to quantify common intuition that more in-distribution and more diverse augmentationpolicies perform well. To this end, we introduced two easy-to-compute metrics, Afﬁnity and Diversity,intended to measure to what extent a given augmentation is in-distribution and how complex theaugmentation is to learn. Because they are model-dependent, these metrics capture the data shiftsthat affect model performance.With these tools, we have conducted a study over a large class of augmentations for CIFAR-10 andImageNet and found that neither feature alone is a perfect predictor of performance. Rather, wepresented evidence that Diversity and Afﬁnity play dual roles in determining augmentation quality.Optimizing for either metric separately is sub-optimal and the best augmentations balance the two.Additionally, we found that an increased number of training examples, connected to Diversity, was anecessary ingredient of beneﬁcial augmentation.Finally, we found that augmentations share an important feature with other regularizers: switching offregularization at the right time can improve performance. In some cases, this can cause an otherwisepoorly-performing augmentation to be beneﬁcial.We hope our ﬁndings provide a foundation for continued scientiﬁc study of data augmentation.

Data augmentation has the potential to amplify bias

Data augmentation takes a smaller, potentially biased training set and recycles this as the basisof a larger augmented training program. A central ﬁnding of this work is that the success of anaugmentation policy varies with the dual metrics of Afﬁnity and Diversity; as Afﬁnity is explicitlymodel-dependent, it depends on biases present in the model. This data reuse and model-dependenceof successful augmentation suggest the possibility that augmentation may amplify biases in the dataor model and warrants future investigation.

Robust data augmentation can reduce social, environmental, and ﬁnancial costs

At its best, data augmentation provides a means for less well-funded or data-rich practitioners todesign performant models by supplementing a smaller training data set with additional transformedimages. Commonly-used policies, however, such as those found by AutoAugment [6] have relied onexpensive brute force searches which cost thousands of GPU-hours, replacing the need for extensivedata collections with the need for ﬁnancially and environmentally expensive compute. We hopethat by understanding the mechanisms behind successful data augmentation we can design guidedaugmentation policies for new datasets and models, and mitigate the social and ﬁnancial costs of datacollection without undue compute expense.

Fundamental understanding facilitates impact assessment

More broadly, the central aim of this work is to better understand the elements driving successfulaugmentation policies. Truly understanding the conceptual mechanisms at play is crucial in makinginformed judgements about the impact of data augmentation.

Acknowledgements

The authors would like to thank Alex Alemi, Justin Gilmer, Guy Gur-Ari, Albin Jones, BehnamNeyshabur, Zan Armstrong, and Ben Poole for thoughtful discussions on this work.

References [1] J. R. Bellegarda, P. V. de Souza, A. J. Nadas, D. Nahamoo, M. A. Picheny, and L. R. Bahl.Robust speaker adaptation using a piecewise linear acoustic mapping. In

ICASSP-92: 1992IEEE International Conference on Acoustics, Speech, and Signal Processing , 1992.92] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neuralnetworks with cutout. arXiv preprint arXiv:1708.04552 , 2017.[3] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk,and Quoc V Le. Specaugment: A simple data augmentation method for automatic speechrecognition. arXiv preprint arXiv:1904.08779 , 2019.[4] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyondempirical risk minimization. arXiv preprint arXiv:1710.09412 , 2017.[5] Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer, and Ekin D Cubuk. Improvingrobustness without sacriﬁcing accuracy with patch gaussian augmentation. arXiv preprintarXiv:1906.02611 , 2019.[6] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 , 2018.[7] Xinyu Zhang, Qiang Wang, Jian Zhang, and Zhao Zhong. Adversarial autoaugment, 2019.[8] Yoshua Bengio, Frédéric Bastien, Arnaud Bergeron, Nicolas Boulanger-Lewandowski, ThomasBreuel, Youssouf Chherawala, Moustapha Cisse, Myriam Côté, Dumitru Erhan, Jeremy Eu-stache, et al. Deep learners beneﬁt more from out-of-distribution examples. In

Proceedings ofthe Fourteenth International Conference on Artiﬁcial Intelligence and Statistics , pages 164–172,2011.[9] Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks forimage classiﬁcation. In

Proceedings of IEEE Conference on Computer Vision and PatternRecognition , pages 3642–3649. IEEE, 2012.[10] Ikuro Sato, Hiroki Nishimura, and Kensuke Yokoi. Apac: Augmented pattern classiﬁcationwith neural networks. arXiv preprint arXiv:1505.03229 , 2015.[11] Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional neuralnetworks applied to visual document analysis. In

Proceedings of International Conference onDocument Analysis and Recognition , 2003.[12] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization ofneural networks using dropconnect. In

International Conference on Machine Learning , pages1058–1066, 2013.[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In

Advances in Neural Information Processing Systems , 2012.[14] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In

British Machine VisionConference , 2016.[15] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferablearchitectures for scalable image recognition. In

Proceedings of IEEE Conference on ComputerVision and Pattern Recognition , 2017.[16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-YangFu, and Alexander C Berg. Ssd: Single shot multibox detector. In

European conference oncomputer vision , pages 21–37. Springer, 2016.[17] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easysynthesis for instance detection. In

Proceedings of the IEEE International Conference onComputer Vision , pages 1301–1310, 2017.[18] Hao-Shu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, Yong-Lu Li, and Cewu Lu.Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In

Proceedings of the IEEE International Conference on Computer Vision , pages 682–691, 2019.[19] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, Yuning Chai, Pei Sun, Yin Zhou,Xi Yi, Ouais Alsharif, Patrick Nguyen, et al. Starnet: Targeted computation for object detectionin point clouds. arXiv preprint arXiv:1908.11069 , 2019.1020] Nic Ford, Justin Gilmer, Nicolas Carlini, and Dogus Cubuk. Adversarial examples are a naturalconsequence of test error in noise. arXiv preprint arXiv:1901.10513 , 2019.[21] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overﬁtting.

The journal of machinelearning research , 15(1):1929–1958, 2014.[22] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing dataaugmentation. arXiv preprint arXiv:1708.04896 , 2017.[23] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-low, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 ,2013.[24] Hiroshi Inoue. Data augmentation by pairing samples for images classiﬁcation. arXiv preprintarXiv:1801.02929 , 2018.[25] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré.Learning to compose domain-speciﬁc transformations for data augmentation. In

Advances inNeural Information Processing Systems , pages 3239–3249, 2017.[26] Daniel Ho, Eric Liang, Ion Stoica, Pieter Abbeel, and Xi Chen. Population based augmentation:Efﬁcient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393 , 2019.[27] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. arXiv preprint arXiv:1905.00397 , 2019.[28] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical dataaugmentation with no separate search. arXiv preprint arXiv:1909.13719 , 2019.[29] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang,and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment andaugmentation anchoring. arXiv preprint arXiv:1911.09785 , 2019.[30] Mingxing Tan and Quoc V Le. Efﬁcientnet: Rethinking model scaling for convolutional neuralnetworks. arXiv preprint arXiv:1905.11946 , 2019.[31] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efﬁcientdet: Scalable and efﬁcient objectdetection. arXiv preprint arXiv:1911.09070 , 2019.[32] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, and Quoc V Le. Adversarialexamples improve image recognition. arXiv preprint arXiv:1911.09665 , 2019.[33] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised dataaugmentation. arXiv preprint arXiv:1904.12848 , 2019.[34] Ryuichiro Hataya, Jan Zdenek, Kazuki Yoshizoe, and Hideki Nakayama. Faster autoaugment:Learning augmentation strategies using backpropagation, 2019.[35] Zhuoxun He, Lingxi Xie, Xin Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. Data augmentationrevisited: Rethinking the distribution gap between clean and augmented data, 2019.[36] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. Fixing the train-testresolution discrepancy, 2019.[37] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani,M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,

Advances in NeuralInformation Processing Systems 27 , pages 2672–2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf .[38] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengioand Yann LeCun, editors,

ICLR , 2014. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2014.html . 1139] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neuralsamplers using variational divergence minimization, 2016.[40] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review forstatisticians.

Journal of the American Statistical Association , 112(518):859–877, Feb 2017.ISSN 1537-274X. doi: 10.1080/01621459.2017.1285773. URL http://dx.doi.org/10.1080/01621459.2017.1285773 .[41] Toan Tran, Trung Pham, Gustavo Carneiro, Lyle Palmer, and Ian Reid. A bayesian dataaugmentation approach for learning deep models. In

Advances in Neural Information ProcessingSystems , pages 2794–2803, 2017.[42] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Russ R Salakhutdinov. Goodsemi-supervised learning that requires a bad gan. In

Advances in neural information processingsystems , pages 6510–6520, 2017.[43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-age recognition. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 770–778, 2016.[44] Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly tosmall image transformations? arXiv preprint arXiv:1805.12177 , 2018.[45] Samuel Dodge and Lina Karam. A study and comparison of human and deep learning recogni-tion performance under visual distortions. In , pages 1–7. IEEE, 2017.[46] Dan Hendrycks and Thomas G Dietterich. Benchmarking neural network robustness to commoncorruptions and surface variations. arXiv preprint arXiv:1807.01697 , 2018.[47] Amir Rosenfeld, Richard Zemel, and John K Tsotsos. The elephant in the room. arXiv preprintarXiv:1808.03305 , 2018.[48] Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D Cubuk, and Justin Gilmer. Afourier perspective on model robustness in computer vision. arXiv preprint arXiv:1906.08988 ,2019.[49] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, MohammadNorouzi, and Kevin Swersky. Your classiﬁer is secretly an energy based model and you shouldtreat it like one, 2019.[50] Sumio Watanabe. Asymptotic equivalence of bayes cross validation and widely applicableinformation criterion in singular learning theory.

J. Mach. Learn. Res. , 11:3571–3594, December2010. ISSN 1532-4435.[51] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning . MIT Press, 2016. .[52] Alex Hernández-García and Peter König. Further advantages of data augmentation on convo-lutional neural networks. In

International Conference on Artiﬁcial Neural Networks , pages95–103. Springer, 2018.[53] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.[54] Tri Dao, Albert Gu, Alexander J Ratner, Virginia Smith, Christopher De Sa, and ChristopherRé. A kernel theory of modern data augmentation.

Proceedings of machine learning research ,97:1528, 2019.[55] Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deepnetworks: Weight decay and data augmentation affect early learning dynamics, matter little nearconvergence, 2019.[56] Stéphane d’Ascoli, Levent Sagun, Joan Bruna, and Giulio Biroli. Finding the needle in thehaystack with convolutions: on the beneﬁts of architectural bias, 2019.1257] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk,Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learningwith consistency and conﬁdence. arXiv preprint arXiv:2001.07685 , 2020.13

UPPLEMENTARY MATERIALA Training methods

Cifar10 models were trained using code based on AutoAugment code using the following choices:1. Learning rate was decayed following a cosine decay schedule, starting with a value of 0.12. 78050 training steps were used, with data shufﬂed after every epoch.3. As implemented in the AutoAugment code, the WRN-28-2 model was used with stochasticgradient descent and momentum. The optimizer used cross entropy loss with (cid:96) weightdecay of 0.0005.4. Before selecting the validation set, the full training set was shufﬂed and balanced such thatthe subset selected for training was balanced across classes.5. Validation set was the last 5000 samples of the shufﬂed CIFAR-10 training data.6. Models were trained using Python 2.7 and TensorFlow 1.13 .A training time of 78k steps was chosen because it showed reasonable convergence with the standarddata augmentation of FlipLR , Crop , and

Cutout

In the clean baseline case, test accuracy actuallyreached its peak much earlier than 78k steps.With CIFAR-10, experiments were also performed for training dataset sizes of 1024, 4096, and 16384.At smaller dataset sizes, the impact of augmentation and the Switch-off Lift tended to be larger. Theseresults are not shown in this paper.ImageNet models were ResNet-50 trained using the Cloud TPU codebase . Models were trained for112.6k steps with a weight decay rate of 1e-4, and a learning rate of 0.2, which was decayed by 10 atepochs 30, 60, and 80. Batch size was set to be 1024. B Details of augmentation

B.1 CIFAR-10

On CIFAR-10, both color and afﬁne transforms were tested, as given in the full results (see Sec. G).Most augmentations were as deﬁned in Cubuk et al. [6] and additional conventions for augmentationsas labeled in Fig. 7 are deﬁned here. For

Rotate , ﬁxed means each augmented image was rotated byexactly the stated amount, with a randomly-chosen direction. Variable means an augmented imagewas rotated a random amount up to the given value in a randomly-chosen direction.

Shear is deﬁnedsimilarly.

Rotate(square) means that an image was rotated by an amount chosen randomly from[0 ◦ , 90 ◦ , 180 ◦ , 270 ◦ ]. Crop included a padding before the random-location crop so that the ﬁnal image remained × insize. The magnitude given for Crop is the number of pixels that were added in each dimension. Themagnitude given in the label for

Cutout is the size, in pixels, of each dimension of the square cutout.

PatchGaussian was deﬁned as in Lopes et al. [5], with the patch speciﬁed to be contained entirelywithin the image domain. In Fig. 7, it is labeled by two hyperparameters: the size of the square patch(in pixels) that was applied and σ max , which is the maximum standard deviation of the noise thatcould be selected for any given patch. Here, “ﬁxed" means the patch size was always the same.Since FlipLR , Crop , and

Cutout are part of standard pipelines for CIFAR-10, we tested combinationsof the three augmentations (varying probabilities of each) as well as these three augmentations plus ansingle additional augmentation. As in standard processing of CIFAR-10 images, the ﬁrst augmentationapplied was anything that is not one of

FlipLR , Crop , or

Cutout . After that, augmentations wereapplied in the order

Crop , then

FlipLR , then

Cutout .Finally, we tested the CIFAR-10 AutoAugment policy [6], RandAugment [28], and mixup [4]. Thehyperparameters for these augmentations followed the guidelines described in the respective papers. available at github.com/tensorflow/models/tree/master/research/autoaugment available at https://github.com/tensorflow/tpu/tree/master/models/official/resnet

70 60 50 40 30 20 10 0

Affinity D i v e r s i t y Blur: Blur(100%)Blur: Blur(50%)Invert: Invert(100%)Invert: Invert(50%)PG: PatchGaussian(fixed,12, 0.1, 100%)PG: PatchGaussian(fixed,12, 0.2, 100%)PG: PatchGaussian(fixed,12, 0.3, 100%)PG: PatchGaussian(fixed,12, 0.5, 100%)PG: PatchGaussian(fixed,12, 0.8, 100%)PG: PatchGaussian(fixed,12, 1.0, 100%)PG: PatchGaussian(fixed,12, 1.5, 100%)PG: PatchGaussian(fixed,12, 2.0, 100%)PG: PatchGaussian(fixed,16, 0.1, 100%)PG: PatchGaussian(fixed,16, 0.2, 100%)PG: PatchGaussian(fixed,16, 0.3, 100%)PG: PatchGaussian(fixed,16, 0.5, 100%)PG: PatchGaussian(fixed,16, 0.8, 100%)PG: PatchGaussian(fixed,16, 1.0, 100%)PG: PatchGaussian(fixed,16, 1.5, 100%)PG: PatchGaussian(fixed,16, 2.0, 100%)PG: PatchGaussian(fixed,20, 0.1, 100%)PG: PatchGaussian(fixed,20, 0.2, 100%)PG: PatchGaussian(fixed,20, 0.3, 100%)PG: PatchGaussian(fixed,20, 0.5, 100%)PG: PatchGaussian(fixed,20, 0.8, 100%)PG: PatchGaussian(fixed,20, 1.0, 100%)PG: PatchGaussian(fixed,20, 1.5, 100%)PG: PatchGaussian(fixed,20, 2.0, 100%)PG: PatchGaussian(fixed,24, 0.1, 100%)PG: PatchGaussian(fixed,24, 0.2, 100%)PG: PatchGaussian(fixed,24, 0.3, 100%)PG: PatchGaussian(fixed,24, 0.5, 100%)PG: PatchGaussian(fixed,24, 0.8, 100%)PG: PatchGaussian(fixed,24, 1.0, 100%)PG: PatchGaussian(fixed,24, 1.5, 100%)PG: PatchGaussian(fixed,24, 2.0, 100%)PG: PatchGaussian(fixed,28, 0.1, 100%)PG: PatchGaussian(fixed,28, 0.2, 100%)PG: PatchGaussian(fixed,28, 0.3, 100%)PG: PatchGaussian(fixed,28, 0.5, 100%)PG: PatchGaussian(fixed,28, 0.8, 100%)PG: PatchGaussian(fixed,28, 1.0, 100%)PG: PatchGaussian(fixed,28, 1.5, 100%)PG: PatchGaussian(fixed,28, 2.0, 100%)PG: PatchGaussian(fixed,32, 0.1, 100%)PG: PatchGaussian(fixed,32, 0.2, 100%)PG: PatchGaussian(fixed,32, 0.3, 100%)PG: PatchGaussian(fixed,32, 0.5, 100%)PG: PatchGaussian(fixed,32, 0.8, 100%)PG: PatchGaussian(fixed,32, 1.0, 100%)PG: PatchGaussian(fixed,32, 1.5, 100%)PG: PatchGaussian(fixed,32, 2.0, 100%)PG: PatchGaussian(fixed,4, 0.1, 100%)PG: PatchGaussian(fixed,4, 0.2, 100%)PG: PatchGaussian(fixed,4, 0.3, 100%)PG: PatchGaussian(fixed,4, 0.5, 100%)PG: PatchGaussian(fixed,4, 0.8, 100%)PG: PatchGaussian(fixed,4, 1.0, 100%)PG: PatchGaussian(fixed,4, 1.5, 100%)PG: PatchGaussian(fixed,4, 2.0, 100%)PG: PatchGaussian(fixed,8, 0.1, 100%)PG: PatchGaussian(fixed,8, 0.2, 100%)PG: PatchGaussian(fixed,8, 0.3, 100%)PG: PatchGaussian(fixed,8, 0.5, 100%)PG: PatchGaussian(fixed,8, 0.8, 100%)PG: PatchGaussian(fixed,8, 1.0, 100%)PG: PatchGaussian(fixed,8, 1.5, 100%)PG: PatchGaussian(fixed,8, 2.0, 100%)Identity: Clean ShearXSoft: ShearX(variable, 0.1, 100%)ShearXSoft: ShearX(variable, 0.1, 50%)ShearXSoft: ShearX(variable, 0.1, 75%)ShearXSoft: ShearX(variable, 0.3, 100%)ShearXSoft: ShearX(variable, 0.3, 50%)ShearXSoft: ShearX(variable, 0.3, 75%)ShearX: ShearX(fixed ,0.1, 100%)ShearX: ShearX(fixed, 0.1, 50%)ShearX: ShearX(fixed, 0.1, 75%)ShearX: ShearX(fixed, 0.3, 100%)ShearX: ShearX(fixed, 0.3, 50%)ShearX: ShearX(fixed, 0.3, 75%)Rotate_Soft: Rotate(variable, 20deg, 100%)Rotate_Soft: Rotate(variable, 20deg, 50%)Rotate_Soft: Rotate(variable, 20deg, 75%)Rotate_Soft: Rotate(variable, 45, 100%)Rotate_Soft: Rotate(variable, 5deg, 100%)Rotate_Soft: Rotate(variable, 5deg, 50%)Rotate_Soft: Rotate(variable, 5deg, 75%)Rotate_Soft: Rotate(variable, 60deg, 100%)Rotate: Rotate(fixed, 15deg, 50%)Rotate: Rotate(fixed, 20deg, 100%)Rotate: Rotate(fixed, 20deg, 50%)Rotate: Rotate(fixed, 20deg, 75%)Rotate: Rotate(fixed, 45deg, 50%)Rotate: Rotate(fixed, 5deg, 10%)Rotate: Rotate(fixed, 5deg, 100%)Rotate: Rotate(fixed, 5deg, 20%)Rotate: Rotate(fixed, 5deg, 30%)Rotate: Rotate(fixed, 5deg, 40%)Rotate: Rotate(fixed, 5deg, 50%)Rotate: Rotate(fixed, 5deg, 60%)Rotate: Rotate(fixed, 5deg, 70%)Rotate: Rotate(fixed, 5deg, 75%)Rotate: Rotate(fixed, 5deg, 80%)Rotate: Rotate(fixed, 5deg, 90%)Rotate: Rotate(fixed, 60deg, 10%)Rotate: Rotate(fixed, 60deg, 100%)Rotate: Rotate(fixed, 60deg, 20%)Rotate: Rotate(fixed, 60deg, 30%)Rotate: Rotate(fixed, 60deg, 40%)Rotate: Rotate(fixed, 60deg, 50%)Rotate: Rotate(fixed, 60deg, 60%)Rotate: Rotate(fixed, 60deg, 70%)Rotate: Rotate(fixed, 60deg, 80%)Rotate: Rotate(fixed, 60deg, 90%)RHardRotate: Rotate(square, 100%)RHardRotate: Rotate(square, 50%)FlipLR: FlipLR(100%)FlipLR: FlipLR(25%)FlipLR: FlipLR(50%)FlipLR: FlipLR(75%)FlipUD: FlipUD(100%)FlipUD: FlipUD(25%)FlipUD: FlipUD(50%)FlipUD: FlipUD(75%)Crop: Crop(4, 25%)Crop: Crop(4, 50%)Crop: Crop(4, 75%)Crop: Crop(4,100%)Cutout: Cutout(16, 100%)Cutout: Cutout(16, 25%)Cutout: Cutout(16, 50%)Cutout: Cutout(16, 75%)Equalize: Equalize(100%)Equalize: Equalize(50%)FC: FlipLR(50%) + Crop(4,100%)FCC: FlipLR(100%) + Crop(4,100%) + Cutout(16,100%) FCC: FlipLR(100%) + Crop(4,100%) + Cutout(16,25%)FCC: FlipLR(100%) + Crop(4,100%) + Cutout(16,50%)FCC: FlipLR(100%) + Crop(4,100%) + Cutout(16,75%)FCC: FlipLR(100%) + Crop(4,25%) + Cutout(16,100%)FCC: FlipLR(100%) + Crop(4,25%) + Cutout(16,25%)FCC: FlipLR(100%) + Crop(4,25%) + Cutout(16,50%)FCC: FlipLR(100%) + Crop(4,25%) + Cutout(16,75%)FCC: FlipLR(100%) + Crop(4,50%) + Cutout(16,100%)FCC: FlipLR(100%) + Crop(4,50%) + Cutout(16,25%)FCC: FlipLR(100%) + Crop(4,50%) + Cutout(16,50%)FCC: FlipLR(100%) + Crop(4,50%) + Cutout(16,75%)FCC: FlipLR(100%) + Crop(4,75%) + Cutout(16,100%)FCC: FlipLR(100%) + Crop(4,75%) + Cutout(16,25%)FCC: FlipLR(100%) + Crop(4,75%) + Cutout(16,50%)FCC: FlipLR(100%) + Crop(4,75%) + Cutout(16,75%)FCC: FlipLR(25%) + Crop(4,100%) + Cutout(16,100%)FCC: FlipLR(25%) + Crop(4,100%) + Cutout(16,25%)FCC: FlipLR(25%) + Crop(4,100%) + Cutout(16,50%)FCC: FlipLR(25%) + Crop(4,100%) + Cutout(16,75%)FCC: FlipLR(25%) + Crop(4,25%) + Cutout(16,100%)FCC: FlipLR(25%) + Crop(4,25%) + Cutout(16,25%)FCC: FlipLR(25%) + Crop(4,25%) + Cutout(16,50%)FCC: FlipLR(25%) + Crop(4,25%) + Cutout(16,75%)FCC: FlipLR(25%) + Crop(4,50%) + Cutout(16,100%)FCC: FlipLR(25%) + Crop(4,50%) + Cutout(16,25%)FCC: FlipLR(25%) + Crop(4,50%) + Cutout(16,50%)FCC: FlipLR(25%) + Crop(4,50%) + Cutout(16,75%)FCC: FlipLR(25%) + Crop(4,75%) + Cutout(16,100%)FCC: FlipLR(25%) + Crop(4,75%) + Cutout(16,25%)FCC: FlipLR(25%) + Crop(4,75%) + Cutout(16,50%)FCC: FlipLR(25%) + Crop(4,75%) + Cutout(16,75%)FCC: FlipLR(50%) + Crop(4,100%) + Cutout(16,100%)FCC: FlipLR(50%) + Crop(4,100%) + Cutout(16,25%)FCC: FlipLR(50%) + Crop(4,100%) + Cutout(16,50%)FCC: FlipLR(50%) + Crop(4,100%) + Cutout(16,75%)FCC: FlipLR(50%) + Crop(4,25%) + Cutout(16,100%)FCC: FlipLR(50%) + Crop(4,25%) + Cutout(16,25%)FCC: FlipLR(50%) + Crop(4,25%) + Cutout(16,50%)FCC: FlipLR(50%) + Crop(4,25%) + Cutout(16,75%)FCC: FlipLR(50%) + Crop(4,50%) + Cutout(16,100%)FCC: FlipLR(50%) + Crop(4,50%) + Cutout(16,25%)FCC: FlipLR(50%) + Crop(4,50%) + Cutout(16,50%)FCC: FlipLR(50%) + Crop(4,50%) + Cutout(16,75%)FCC: FlipLR(50%) + Crop(4,75%) + Cutout(16,100%)FCC: FlipLR(50%) + Crop(4,75%) + Cutout(16,25%)FCC: FlipLR(50%) + Crop(4,75%) + Cutout(16,50%)FCC: FlipLR(50%) + Crop(4,75%) + Cutout(16,75%)FCC: FlipLR(75%) + Crop(4,100%) + Cutout(16,100%)FCC: FlipLR(75%) + Crop(4,100%) + Cutout(16,25%)FCC: FlipLR(75%) + Crop(4,100%) + Cutout(16,50%)FCC: FlipLR(75%) + Crop(4,100%) + Cutout(16,75%)FCC: FlipLR(75%) + Crop(4,25%) + Cutout(16,100%)FCC: FlipLR(75%) + Crop(4,25%) + Cutout(16,25%)FCC: FlipLR(75%) + Crop(4,25%) + Cutout(16,50%)FCC: FlipLR(75%) + Crop(4,25%) + Cutout(16,75%)FCC: FlipLR(75%) + Crop(4,50%) + Cutout(16,100%)FCC: FlipLR(75%) + Crop(4,50%) + Cutout(16,25%)FCC: FlipLR(75%) + Crop(4,50%) + Cutout(16,50%)FCC: FlipLR(75%) + Crop(4,50%) + Cutout(16,75%)FCC: FlipLR(75%) + Crop(4,75%) + Cutout(16,100%)FCC: FlipLR(75%) + Crop(4,75%) + Cutout(16,25%)FCC: FlipLR(75%) + Crop(4,75%) + Cutout(16,50%)FCC: FlipLR(75%) + Crop(4,75%) + Cutout(16,75%)FCCE: FlipLR(50%) + Crop(4, 100%) + Cutout(16, 100%) + Equalize(50%)FCCR: FlipLR(50%) + Crop(4, 100%) + Cutout(16, 100%) + Rotate(fixed, 15deg, 50%)ZZZAutoaugment: AutoaugmentZZZMixup: MixupZZZRandaug: Randaug

Figure 7: CIFAR-10: Labeled map of tested augmentations on the plane of Afﬁnity and Diversity.Color distinguishes different hyperparameters for a given transform. Legend is below.

B.2 ImageNet

On ImageNet, we experimented with

PatchGaussian , Cutout , operations from the PIL imaginglibrary , and techniques from the AutoAugment code, as described above for CIFAR-10. In additionto PatchGaussian(fixed) , we also tested

PatchGaussian(variable) , where the patch size wasuniformly sampled up to a maximum size. The implementation here did not constrain the patch to be https://pillow.readthedocs.io/en/5.1.x/ Affinity D i v e r s i t y CleanAutoaugmentMixupRandaugPatchGaussian(fixed,12, 0.1, 100%)PatchGaussian(fixed,12, 0.2, 100%)PatchGaussian(fixed,12, 0.3, 100%)PatchGaussian(fixed,12, 0.5, 100%)PatchGaussian(fixed,12, 0.8, 100%)PatchGaussian(fixed,12, 1.0, 100%)PatchGaussian(fixed,12, 1.5, 100%)PatchGaussian(fixed,12, 2.0, 100%)PatchGaussian(fixed,16, 0.1, 100%)PatchGaussian(fixed,16, 0.2, 100%)PatchGaussian(fixed,16, 0.3, 100%)PatchGaussian(fixed,16, 0.5, 100%)PatchGaussian(fixed,16, 0.8, 100%)PatchGaussian(fixed,16, 1.0, 100%)PatchGaussian(fixed,16, 1.5, 100%)PatchGaussian(fixed,16, 2.0, 100%)PatchGaussian(fixed,20, 0.1, 100%)PatchGaussian(fixed,20, 0.2, 100%)PatchGaussian(fixed,20, 0.3, 100%)PatchGaussian(fixed,20, 0.5, 100%)PatchGaussian(fixed,20, 0.8, 100%)PatchGaussian(fixed,20, 1.0, 100%)PatchGaussian(fixed,20, 1.5, 100%)PatchGaussian(fixed,20, 2.0, 100%)PatchGaussian(fixed,24, 0.1, 100%)PatchGaussian(fixed,24, 0.2, 100%)PatchGaussian(fixed,24, 0.3, 100%)PatchGaussian(fixed,24, 0.5, 100%)PatchGaussian(fixed,24, 0.8, 100%)PatchGaussian(fixed,24, 1.0, 100%)PatchGaussian(fixed,24, 1.5, 100%)PatchGaussian(fixed,24, 2.0, 100%)PatchGaussian(fixed,28, 0.1, 100%)PatchGaussian(fixed,28, 0.2, 100%)PatchGaussian(fixed,28, 0.3, 100%)PatchGaussian(fixed,28, 0.5, 100%)PatchGaussian(fixed,28, 0.8, 100%)PatchGaussian(fixed,28, 1.0, 100%)PatchGaussian(fixed,28, 1.5, 100%)PatchGaussian(fixed,28, 2.0, 100%)PatchGaussian(fixed,32, 0.1, 100%)PatchGaussian(fixed,32, 0.2, 100%)PatchGaussian(fixed,32, 0.3, 100%)PatchGaussian(fixed,32, 0.5, 100%)PatchGaussian(fixed,32, 0.8, 100%)PatchGaussian(fixed,32, 1.0, 100%)PatchGaussian(fixed,32, 1.5, 100%)PatchGaussian(fixed,32, 2.0, 100%)PatchGaussian(fixed,4, 0.1, 100%)PatchGaussian(fixed,4, 0.2, 100%)PatchGaussian(fixed,4, 0.3, 100%)PatchGaussian(fixed,4, 0.5, 100%)PatchGaussian(fixed,4, 0.8, 100%)PatchGaussian(fixed,4, 1.0, 100%)PatchGaussian(fixed,4, 1.5, 100%)PatchGaussian(fixed,4, 2.0, 100%)PatchGaussian(fixed,8, 0.1, 100%)PatchGaussian(fixed,8, 0.2, 100%)PatchGaussian(fixed,8, 0.3, 100%)PatchGaussian(fixed,8, 0.5, 100%)PatchGaussian(fixed,8, 0.8, 100%)PatchGaussian(fixed,8, 1.0, 100%)PatchGaussian(fixed,8, 1.5, 100%)PatchGaussian(fixed,8, 2.0, 100%) Invert(100%)Invert(50%)ShearX(variable, 0.1, 100%)ShearX(variable, 0.1, 50%)ShearX(variable, 0.1, 75%)ShearX(variable, 0.3, 100%)ShearX(variable, 0.3, 50%)ShearX(variable, 0.3, 75%)ShearX(fixed ,0.1, 100%)ShearX(fixed, 0.1, 50%)ShearX(fixed, 0.1, 75%)ShearX(fixed, 0.3, 100%)ShearX(fixed, 0.3, 50%)ShearX(fixed, 0.3, 75%)Rotate(variable, 20deg, 100%)Rotate(variable, 20deg, 50%)Rotate(variable, 20deg, 75%)Rotate(variable, 45, 100%)Rotate(variable, 5deg, 100%)Rotate(variable, 5deg, 50%)Rotate(variable, 5deg, 75%)Rotate(variable, 60deg, 100%)Rotate(fixed, 15deg, 50%)Rotate(fixed, 20deg, 100%)Rotate(fixed, 20deg, 50%)Rotate(fixed, 20deg, 75%)Rotate(fixed, 45deg, 50%)Rotate(fixed, 5deg, 10%)Rotate(fixed, 5deg, 100%)Rotate(fixed, 5deg, 20%)Rotate(fixed, 5deg, 30%)Rotate(fixed, 5deg, 40%)Rotate(fixed, 5deg, 50%)Rotate(fixed, 5deg, 60%)Rotate(fixed, 5deg, 70%)Rotate(fixed, 5deg, 75%)Rotate(fixed, 5deg, 80%)Rotate(fixed, 5deg, 90%)Rotate(fixed, 60deg, 10%)Rotate(fixed, 60deg, 100%)Rotate(fixed, 60deg, 20%)Rotate(fixed, 60deg, 30%)Rotate(fixed, 60deg, 40%)Rotate(fixed, 60deg, 50%)Rotate(fixed, 60deg, 60%)Rotate(fixed, 60deg, 70%)Rotate(fixed, 60deg, 80%)Rotate(fixed, 60deg, 90%)Rotate(square, 100%)Rotate(square, 50%)Blur(100%)Blur(50%)FlipLR(100%)FlipLR(25%)FlipLR(50%)FlipLR(75%)FlipUD(100%)FlipUD(25%)FlipUD(50%)FlipUD(75%)Crop(4, 25%)Crop(4, 50%)Crop(4, 75%)Crop(4,100%)Cutout(16, 100%)Cutout(16, 25%)Cutout(16, 50%)Cutout(16, 75%) Equalize(100%)Equalize(50%)FlipLR(50%) + Crop(4,100%)FlipLR(100%) + Crop(4,100%) + Cutout(16,100%)FlipLR(100%) + Crop(4,100%) + Cutout(16,25%)FlipLR(100%) + Crop(4,100%) + Cutout(16,50%)FlipLR(100%) + Crop(4,100%) + Cutout(16,75%)FlipLR(100%) + Crop(4,25%) + Cutout(16,100%)FlipLR(100%) + Crop(4,25%) + Cutout(16,25%)FlipLR(100%) + Crop(4,25%) + Cutout(16,50%)FlipLR(100%) + Crop(4,25%) + Cutout(16,75%)FlipLR(100%) + Crop(4,50%) + Cutout(16,100%)FlipLR(100%) + Crop(4,50%) + Cutout(16,25%)FlipLR(100%) + Crop(4,50%) + Cutout(16,50%)FlipLR(100%) + Crop(4,50%) + Cutout(16,75%)FlipLR(100%) + Crop(4,75%) + Cutout(16,100%)FlipLR(100%) + Crop(4,75%) + Cutout(16,25%)FlipLR(100%) + Crop(4,75%) + Cutout(16,50%)FlipLR(100%) + Crop(4,75%) + Cutout(16,75%)FlipLR(25%) + Crop(4,100%) + Cutout(16,100%)FlipLR(25%) + Crop(4,100%) + Cutout(16,25%)FlipLR(25%) + Crop(4,100%) + Cutout(16,50%)FlipLR(25%) + Crop(4,100%) + Cutout(16,75%)FlipLR(25%) + Crop(4,25%) + Cutout(16,100%)FlipLR(25%) + Crop(4,25%) + Cutout(16,25%)FlipLR(25%) + Crop(4,25%) + Cutout(16,50%)FlipLR(25%) + Crop(4,25%) + Cutout(16,75%)FlipLR(25%) + Crop(4,50%) + Cutout(16,100%)FlipLR(25%) + Crop(4,50%) + Cutout(16,25%)FlipLR(25%) + Crop(4,50%) + Cutout(16,50%)FlipLR(25%) + Crop(4,50%) + Cutout(16,75%)FlipLR(25%) + Crop(4,75%) + Cutout(16,100%)FlipLR(25%) + Crop(4,75%) + Cutout(16,25%)FlipLR(25%) + Crop(4,75%) + Cutout(16,50%)FlipLR(25%) + Crop(4,75%) + Cutout(16,75%)FlipLR(50%) + Crop(4,100%) + Cutout(16,100%)FlipLR(50%) + Crop(4,100%) + Cutout(16,25%)FlipLR(50%) + Crop(4,100%) + Cutout(16,50%)FlipLR(50%) + Crop(4,100%) + Cutout(16,75%)FlipLR(50%) + Crop(4,25%) + Cutout(16,100%)FlipLR(50%) + Crop(4,25%) + Cutout(16,25%)FlipLR(50%) + Crop(4,25%) + Cutout(16,50%)FlipLR(50%) + Crop(4,25%) + Cutout(16,75%)FlipLR(50%) + Crop(4,50%) + Cutout(16,100%)FlipLR(50%) + Crop(4,50%) + Cutout(16,25%)FlipLR(50%) + Crop(4,50%) + Cutout(16,50%)FlipLR(50%) + Crop(4,50%) + Cutout(16,75%)FlipLR(50%) + Crop(4,75%) + Cutout(16,100%)FlipLR(50%) + Crop(4,75%) + Cutout(16,25%)FlipLR(50%) + Crop(4,75%) + Cutout(16,50%)FlipLR(50%) + Crop(4,75%) + Cutout(16,75%)FlipLR(75%) + Crop(4,100%) + Cutout(16,100%)FlipLR(75%) + Crop(4,100%) + Cutout(16,25%)FlipLR(75%) + Crop(4,100%) + Cutout(16,50%)FlipLR(75%) + Crop(4,100%) + Cutout(16,75%)FlipLR(75%) + Crop(4,25%) + Cutout(16,100%)FlipLR(75%) + Crop(4,25%) + Cutout(16,25%)FlipLR(75%) + Crop(4,25%) + Cutout(16,50%)FlipLR(75%) + Crop(4,25%) + Cutout(16,75%)FlipLR(75%) + Crop(4,50%) + Cutout(16,100%)FlipLR(75%) + Crop(4,50%) + Cutout(16,25%)FlipLR(75%) + Crop(4,50%) + Cutout(16,50%)FlipLR(75%) + Crop(4,50%) + Cutout(16,75%)FlipLR(75%) + Crop(4,75%) + Cutout(16,100%)FlipLR(75%) + Crop(4,75%) + Cutout(16,25%)FlipLR(75%) + Crop(4,75%) + Cutout(16,50%)FlipLR(75%) + Crop(4,75%) + Cutout(16,75%)FlipLR(50%) + Crop(4, 100%) + Cutout(16, 100%) + Equalize(50%)FlipLR(50%) + Crop(4, 100%) + Cutout(16, 100%) + Rotate(fixed, 15deg, 50%) entirely contained within the image. Additionally, we experimented with

SolarizeAdd . SolarizeAdd is similar to

Solarize from the PIL library, but has an additional hyperparameter which determineshow much value was added to each pixel that is below the threshold. Finally, we also experimentedwith

Full Gaussian and

Random Erasing on ImageNet.

Full Gaussian adds Gaussian noise to thewhole image.

Random Erasing is similar to

Cutout , but randomly samples the values of the pixels inthe patch [22] (whereas

Cutout sets them to a constant, gray pixel).These augmentations are labeled in Fig. 8.We note that the gains on ImageNet are expected to be small. This is in-line with the magnitudeof the gains observed by related works with single transformations [48]. While combinationsof transformations can lead to bigger improvements [28], our focus is on understanding singleaugmentations as a foundation for future work on their combinations.16

Affinity D i v e r s i t y AutoContrast(100%)Brightness(0.1, 100%)Brightness(0.2, 100%)Brightness(0.3, 100%)Brightness(0.4, 100%)Brightness(0.5, 100%)Brightness(0.6, 100%)Brightness(0.7, 100%)CleanColor(0.1, 100%)Color(0.2, 100%)Color(0.3, 100%)Color(0.4, 100%)Color(0.5, 100%)Color(0.6, 100%)Color(0.7, 100%)Contrast(0.1, 100%)Contrast(0.2, 100%)Contrast(0.3, 100%)Contrast(0.4, 100%)Contrast(0.5, 100%)Contrast(0.6, 100%)Contrast(0.7, 100%)Cutout(variable, 448, 100%)Cutout(fixed, 120, 100%)Cutout(fixed, 150, 100%)Cutout(fixed, 180, 100%)Cutout(fixed, 30, 100%)Cutout(fixed, 60, 100%)Cutout(fixed, 90, 100%)Equalize(100%)FlipUD(100%)FullGaussian(0.1, 100%)FullGaussian(0.2, 100%)FullGaussian(0.3, 100%)FullGaussian(0.5, 100%)FullGaussian(0.8, 100%)FullGaussian(1.0, 100%)FullGaussian(1.5, 100%)FullGaussian(2.0, 100%)Rotate(square, 100%)Invert(100%)Patch Gaussian(variable, 100, 0.2, 100%)Patch Gaussian(variable, 100, 0.5, 100%)Patch Gaussian(variable, 100, 0.8, 100%)Patch Gaussian(variable, 100, 1.0, 100%)Patch Gaussian(variable, 100, 2.0, 100%)Patch Gaussian(variable, 150, 0.2, 100%)Patch Gaussian(variable, 150, 0.5, 100%)Patch Gaussian(variable, 150, 0.8, 100%)Patch Gaussian(variable, 150, 1.0, 100%)Patch Gaussian(variable, 150, 2.0, 100%)Patch Gaussian(variable, 200, 0.2, 100%)Patch Gaussian(variable, 200, 0.5, 100%)Patch Gaussian(variable, 200, 0.8, 100%)Patch Gaussian(variable, 200, 1.0, 100%)Patch Gaussian(variable, 200, 2.0, 100%)Patch Gaussian(variable, 250, 0.1, 100%)Patch Gaussian(variable, 250, 0.2, 100%)Patch Gaussian(variable, 250, 0.3, 100%)Patch Gaussian(variable, 250, 0.5, 100%)Patch Gaussian(variable, 250, 0.8, 100%)Patch Gaussian(variable, 250, 1.0, 100%)Patch Gaussian(variable, 250, 1.5, 100%)Patch Gaussian(variable, 250, 2.0, 100%)Patch Gaussian(variable, 300, 0.2, 100%)Patch Gaussian(variable, 300, 0.5, 100%)Patch Gaussian(variable, 300, 0.8, 100%)Patch Gaussian(variable, 300, 1.0, 100%)Patch Gaussian(variable, 300, 2.0, 100%)Patch Gaussian(variable, 350, 0.2, 100%)Patch Gaussian(variable, 350, 0.5, 100%)Patch Gaussian(variable, 350, 0.8, 100%)Patch Gaussian(variable, 350, 1.0, 100%)Patch Gaussian(variable, 350, 2.0, 100%) Patch Gaussian(variable, 400, 0.2, 100%)Patch Gaussian(variable, 400, 0.5, 100%)Patch Gaussian(variable, 400, 0.8, 100%)Patch Gaussian(variable, 400, 1.0, 100%)Patch Gaussian(variable, 400, 2.0, 100%)Patch Gaussian(fixed, 100, 0.0, 100%)Patch Gaussian(fixed, 100, 0.2, 100%)Patch Gaussian(fixed, 100, 0.5, 100%)Patch Gaussian(fixed, 100, 0.8, 100%)Patch Gaussian(fixed, 100, 1.0, 100%)Patch Gaussian(fixed, 100, 2.0, 100%)Patch Gaussian(fixed, 150, 0.2, 100%)Patch Gaussian(fixed, 150, 0.5, 100%)Patch Gaussian(fixed, 150, 0.8, 100%)Patch Gaussian(fixed, 150, 1.0, 100%)Patch Gaussian(fixed, 150, 2.0, 100%)Patch Gaussian(fixed, 200, 0.2, 100%)Patch Gaussian(fixed, 200, 0.5, 100%)Patch Gaussian(fixed, 200, 0.8, 100%)Patch Gaussian(fixed, 200, 1.0, 100%)Patch Gaussian(fixed, 200, 2.0, 100%)Patch Gaussian(fixed, 250, 0.2, 100%)Patch Gaussian(fixed, 250, 0.5, 100%)Patch Gaussian(fixed, 250, 0.8, 100%)Patch Gaussian(fixed, 250, 1.0, 100%)Patch Gaussian(fixed, 250, 2.0, 100%)Patch Gaussian(fixed, 300, 0.2, 100%)Patch Gaussian(fixed, 300, 0.5, 100%)Patch Gaussian(fixed, 300, 0.8, 100%)Patch Gaussian(fixed, 300, 1.0, 100%)Patch Gaussian(fixed, 300, 2.0, 100%)Patch Gaussian(fixed, 350, 0.2, 100%)Patch Gaussian(fixed, 350, 0.5, 100%)Patch Gaussian(fixed, 350, 0.8, 100%)Patch Gaussian(fixed, 350, 1.0, 100%)Patch Gaussian(fixed, 350, 2.0, 100%)Patch Gaussian(fixed, 400, 0.2, 100%)Patch Gaussian(fixed, 400, 0.5, 100%)Patch Gaussian(fixed, 400, 0.8, 100%)Patch Gaussian(fixed, 400, 1.0, 100%)Patch Gaussian(fixed, 400, 2.0, 100%)Posterize(0, 100%)Posterize(1, 100%)Posterize(2, 100%)Posterize(3, 100%)Posterize(4, 100%)Posterize(5, 100%)Posterize(6, 100%)Posterize(7, 100%)Random Erasing(variable, 448, 100%)Random Erasing(fixed, 120, 100%)Random Erasing(fixed, 150, 100%)Random Erasing(fixed, 180, 100%)Random Erasing(fixed, 30, 100%)Random Erasing(fixed, 60, 100%)Random Erasing(fixed, 90, 100%)Rotate(variable, 0deg, 100%)Rotate(variable, 10deg, 100%)Rotate(variable, 15deg, 100%)Rotate(variable, 20deg, 100%)Rotate(variable, 25deg, 100%)Rotate(variable, 30deg, 100%)Rotate(variable, 5deg, 100%)Sharpness(0.1, 100%)Sharpness(0.2, 100%)Sharpness(0.3, 100%)Sharpness(0.4, 100%)Sharpness(0.5, 100%)Sharpness(0.6, 100%)Sharpness(0.7, 100%)ShearX(variable, 0.1, 100%)ShearX(variable, 0.2, 100%)ShearX(variable, 0.3, 100%)ShearX(variable, 0.4, 100%) ShearX(variable, 0.5, 100%)Solarize(0, 100%)Solarize(100, 100%)Solarize(150, 100%)Solarize(200, 100%)Solarize(250, 100%)Solarize(50, 100%)Solarize Add(-002, 000, 100%)Solarize Add(-002, 050, 100%)Solarize Add(-002, 100, 100%)Solarize Add(-002, 150, 100%)Solarize Add(-002, 200, 100%)Solarize Add(-002, 250, 100%)Solarize Add(-027, 000, 100%)Solarize Add(-027, 050, 100%)Solarize Add(-027, 100, 100%)Solarize Add(-027, 150, 100%)Solarize Add(-027, 200, 100%)Solarize Add(-027, 250, 100%)Solarize Add(-052, 000, 100%)Solarize Add(-052, 050, 100%)Solarize Add(-052, 100, 100%)Solarize Add(-052, 150, 100%)Solarize Add(-052, 200, 100%)Solarize Add(-052, 250, 100%)Solarize Add(-077, 000, 100%)Solarize Add(-077, 050, 100%)Solarize Add(-077, 100, 100%)Solarize Add(-077, 150, 100%)Solarize Add(-077, 200, 100%)Solarize Add(-077, 250, 100%)Solarize Add(-102, 000, 100%)Solarize Add(-102, 050, 100%)Solarize Add(-102, 100, 100%)Solarize Add(-102, 150, 100%)Solarize Add(-102, 200, 100%)Solarize Add(-102, 250, 100%)Solarize Add(-127, 000, 100%)Solarize Add(-127, 050, 100%)Solarize Add(-127, 100, 100%)Solarize Add(-127, 150, 100%)Solarize Add(-127, 200, 100%)Solarize Add(-127, 250, 100%)Solarize Add(0023, 050, 100%)Solarize Add(0023, 100, 100%)Solarize Add(0023, 150, 100%)Solarize Add(0023, 200, 100%)Solarize Add(0023, 250, 100%)Solarize Add(0048, 050, 100%)Solarize Add(0048, 100, 100%)Solarize Add(0048, 150, 100%)Solarize Add(0048, 200, 100%)Solarize Add(0048, 250, 100%)Solarize Add(0073, 100, 100%)Solarize Add(0073, 150, 100%)Solarize Add(0073, 200, 100%)Solarize Add(0073, 250, 100%)Solarize Add(0098, 100, 100%)Solarize Add(0098, 150, 100%)Solarize Add(0098, 200, 100%)Solarize Add(0098, 250, 100%)Solarize Add(0123, 150, 100%)Solarize Add(0123, 200, 100%)Solarize Add(0123, 250, 100%)TranslateX(0, 100%)TranslateX(10, 100%)TranslateX(20, 100%)TranslateX(30, 100%)TranslateX(40, 100%)TranslateX(50, 100%)TranslateX(60, 100%)TranslateX(70, 100%)TranslateX(80, 100%)TranslateX(90, 100%)

Figure 8: ImageNet: Labeled map of tested augmentations on the plane of Afﬁnity and Diversity.Color distinguishes different hyperparameters for a given transform. Legend is below.Each augmentation was applied with a certain probability (given as a percentage in the label). Eachtime an image was pulled for training, the given image was augmented with that probability.

C Error analysis

All of the CIFAR-10 experiments were repeated with 10 different initialization. In most cases, theresulting standard error on the mean (SEM) is too small to show as error bars on plots. The error oneach measurement is given in the full results (see Sec. G).17

Typicality D i v e r s i t y CleanPatch Gaussian(variable, 100, 0.2, 100%)Patch Gaussian(variable, 100, 0.5, 100%)Patch Gaussian(variable, 100, 0.8, 100%)Patch Gaussian(variable, 100, 1.0, 100%)Patch Gaussian(variable, 100, 2.0, 100%)Patch Gaussian(variable, 150, 0.2, 100%)Patch Gaussian(variable, 150, 0.5, 100%)Patch Gaussian(variable, 150, 0.8, 100%)Patch Gaussian(variable, 150, 1.0, 100%)Patch Gaussian(variable, 150, 2.0, 100%)Patch Gaussian(variable, 200, 0.2, 100%)Patch Gaussian(variable, 200, 0.5, 100%)Patch Gaussian(variable, 200, 0.8, 100%)Patch Gaussian(variable, 200, 1.0, 100%)Patch Gaussian(variable, 200, 2.0, 100%)Patch Gaussian(variable, 250, 0.1, 100%)Patch Gaussian(variable, 250, 0.2, 100%)Patch Gaussian(variable, 250, 0.3, 100%)Patch Gaussian(variable, 250, 0.5, 100%)Patch Gaussian(variable, 250, 0.8, 100%)Patch Gaussian(variable, 250, 1.0, 100%)Patch Gaussian(variable, 250, 1.5, 100%)Patch Gaussian(variable, 250, 2.0, 100%)Patch Gaussian(variable, 300, 0.2, 100%)Patch Gaussian(variable, 300, 0.5, 100%)Patch Gaussian(variable, 300, 0.8, 100%)Patch Gaussian(variable, 300, 1.0, 100%)Patch Gaussian(variable, 300, 2.0, 100%)Patch Gaussian(variable, 350, 0.2, 100%)Patch Gaussian(variable, 350, 0.5, 100%)Patch Gaussian(variable, 350, 0.8, 100%)Patch Gaussian(variable, 350, 1.0, 100%)Patch Gaussian(variable, 350, 2.0, 100%)Patch Gaussian(variable, 400, 0.2, 100%)Patch Gaussian(variable, 400, 0.5, 100%)Patch Gaussian(variable, 400, 0.8, 100%)Patch Gaussian(variable, 400, 1.0, 100%)Patch Gaussian(variable, 400, 2.0, 100%)Patch Gaussian(fixed, 100, 0.0, 100%)Patch Gaussian(fixed, 100, 0.2, 100%)Patch Gaussian(fixed, 100, 0.5, 100%)Patch Gaussian(fixed, 100, 0.8, 100%)Patch Gaussian(fixed, 100, 1.0, 100%)Patch Gaussian(fixed, 100, 2.0, 100%)Patch Gaussian(fixed, 150, 0.2, 100%)Patch Gaussian(fixed, 150, 0.5, 100%)Patch Gaussian(fixed, 150, 0.8, 100%)Patch Gaussian(fixed, 150, 1.0, 100%)Patch Gaussian(fixed, 150, 2.0, 100%)Patch Gaussian(fixed, 200, 0.2, 100%)Patch Gaussian(fixed, 200, 0.5, 100%)Patch Gaussian(fixed, 200, 0.8, 100%)Patch Gaussian(fixed, 200, 1.0, 100%)Patch Gaussian(fixed, 200, 2.0, 100%)Patch Gaussian(fixed, 250, 0.2, 100%)Patch Gaussian(fixed, 250, 0.5, 100%)Patch Gaussian(fixed, 250, 0.8, 100%)Patch Gaussian(fixed, 250, 1.0, 100%)Patch Gaussian(fixed, 250, 2.0, 100%)Patch Gaussian(fixed, 300, 0.2, 100%)Patch Gaussian(fixed, 300, 0.5, 100%)Patch Gaussian(fixed, 300, 0.8, 100%)Patch Gaussian(fixed, 300, 1.0, 100%)Patch Gaussian(fixed, 300, 2.0, 100%)Patch Gaussian(fixed, 350, 0.2, 100%)Patch Gaussian(fixed, 350, 0.5, 100%)Patch Gaussian(fixed, 350, 0.8, 100%)Patch Gaussian(fixed, 350, 1.0, 100%)Patch Gaussian(fixed, 350, 2.0, 100%)Patch Gaussian(fixed, 400, 0.2, 100%)Patch Gaussian(fixed, 400, 0.5, 100%)Patch Gaussian(fixed, 400, 0.8, 100%)Patch Gaussian(fixed, 400, 1.0, 100%)Patch Gaussian(fixed, 400, 2.0, 100%) Random Erasing(variable, 448, 100%)Random Erasing(fixed, 120, 100%)Random Erasing(fixed, 150, 100%)Random Erasing(fixed, 180, 100%)Random Erasing(fixed, 30, 100%)Random Erasing(fixed, 60, 100%)Random Erasing(fixed, 90, 100%)Solarize(0, 100%)Solarize(100, 100%)Solarize(150, 100%)Solarize(200, 100%)Solarize(250, 100%)Solarize(50, 100%)Solarize Add(-002, 000, 100%)Solarize Add(-002, 050, 100%)Solarize Add(-002, 100, 100%)Solarize Add(-002, 150, 100%)Solarize Add(-002, 200, 100%)Solarize Add(-002, 250, 100%)Solarize Add(-027, 000, 100%)Solarize Add(-027, 050, 100%)Solarize Add(-027, 100, 100%)Solarize Add(-027, 150, 100%)Solarize Add(-027, 200, 100%)Solarize Add(-027, 250, 100%)Solarize Add(-052, 000, 100%)Solarize Add(-052, 050, 100%)Solarize Add(-052, 100, 100%)Solarize Add(-052, 150, 100%)Solarize Add(-052, 200, 100%)Solarize Add(-052, 250, 100%)Solarize Add(-077, 000, 100%)Solarize Add(-077, 050, 100%)Solarize Add(-077, 100, 100%)Solarize Add(-077, 150, 100%)Solarize Add(-077, 200, 100%)Solarize Add(-077, 250, 100%)Solarize Add(-102, 000, 100%)Solarize Add(-102, 050, 100%)Solarize Add(-102, 100, 100%)Solarize Add(-102, 150, 100%)Solarize Add(-102, 200, 100%)Solarize Add(-102, 250, 100%)Solarize Add(-127, 000, 100%)Solarize Add(-127, 050, 100%)Solarize Add(-127, 100, 100%)Solarize Add(-127, 150, 100%)Solarize Add(-127, 200, 100%)Solarize Add(-127, 250, 100%)Solarize Add(0023, 050, 100%)Solarize Add(0023, 100, 100%)Solarize Add(0023, 150, 100%)Solarize Add(0023, 200, 100%)Solarize Add(0023, 250, 100%)Solarize Add(0048, 050, 100%)Solarize Add(0048, 100, 100%)Solarize Add(0048, 150, 100%)Solarize Add(0048, 200, 100%)Solarize Add(0048, 250, 100%)Solarize Add(0073, 100, 100%)Solarize Add(0073, 150, 100%)Solarize Add(0073, 200, 100%)Solarize Add(0073, 250, 100%)Solarize Add(0098, 100, 100%)Solarize Add(0098, 150, 100%)Solarize Add(0098, 200, 100%)Solarize Add(0098, 250, 100%)Solarize Add(0123, 150, 100%)Solarize Add(0123, 200, 100%)Solarize Add(0123, 250, 100%)Invert(100%)AutoContrast(100%)Equalize(100%)FlipUD(100%) Brightness(0.1, 100%)Brightness(0.2, 100%)Brightness(0.3, 100%)Brightness(0.4, 100%)Brightness(0.5, 100%)Brightness(0.6, 100%)Brightness(0.7, 100%)Color(0.1, 100%)Color(0.2, 100%)Color(0.3, 100%)Color(0.4, 100%)Color(0.5, 100%)Color(0.6, 100%)Color(0.7, 100%)Contrast(0.1, 100%)Contrast(0.2, 100%)Contrast(0.3, 100%)Contrast(0.4, 100%)Contrast(0.5, 100%)Contrast(0.6, 100%)Contrast(0.7, 100%)Cutout(variable, 448, 100%)Cutout(fixed, 120, 100%)Cutout(fixed, 150, 100%)Cutout(fixed, 180, 100%)Cutout(fixed, 30, 100%)Cutout(fixed, 60, 100%)Cutout(fixed, 90, 100%)FullGaussian(0.1, 100%)FullGaussian(0.2, 100%)FullGaussian(0.3, 100%)FullGaussian(0.5, 100%)FullGaussian(0.8, 100%)FullGaussian(1.0, 100%)FullGaussian(1.5, 100%)FullGaussian(2.0, 100%)Rotate(square, 100%)Posterize(0, 100%)Posterize(1, 100%)Posterize(2, 100%)Posterize(3, 100%)Posterize(4, 100%)Posterize(5, 100%)Posterize(6, 100%)Posterize(7, 100%)Rotate(variable, 0deg, 100%)Rotate(variable, 10deg, 100%)Rotate(variable, 15deg, 100%)Rotate(variable, 20deg, 100%)Rotate(variable, 25deg, 100%)Rotate(variable, 30deg, 100%)Rotate(variable, 5deg, 100%)Sharpness(0.1, 100%)Sharpness(0.2, 100%)Sharpness(0.3, 100%)Sharpness(0.4, 100%)Sharpness(0.5, 100%)Sharpness(0.6, 100%)Sharpness(0.7, 100%)ShearX(variable, 0.1, 100%)ShearX(variable, 0.2, 100%)ShearX(variable, 0.3, 100%)ShearX(variable, 0.4, 100%)ShearX(variable, 0.5, 100%)TranslateX(0, 100%)TranslateX(10, 100%)TranslateX(20, 100%)TranslateX(30, 100%)TranslateX(40, 100%)TranslateX(50, 100%)TranslateX(60, 100%)TranslateX(70, 100%)TranslateX(80, 100%)TranslateX(90, 100%)

Afﬁnity and Switch-off Lift both were computed from differences between runs that share the sameinitialization. For Afﬁnity, the same trained model was used for inference on clean validation data andon augmented validation data. Thus, the variance of Afﬁnity for the clean baseline is not independentof the variance of Afﬁnity for a given augmentation. The difference between the augmentation caseand the clean baseline case was taken on a per-experiment basis (for each initialization of the cleanbaseline model) before the error was computed.In the switching experiments, the ﬁnal training without augmentation was completed starting from agiven checkpoint in the model that was trained with augmentation. Thus, each switching experimentshared an initialization with an experiment that had no switching. Again, in this case the differencewas taken on a per-experiment basis before the error (based on the standard deviation) was computed.18ll ImageNet experiments shown are with one initialization. Thus, there are not statistics from whichto analyze the error.

D Switching off augmentations

For CIFAR-10, switching times were tested in increments of approximately 5k steps between ∼ kand ∼ k steps. The best point for switching was determined by the ﬁnal validation accuracy.On ImageNet, we tested turning augmentation off at 50, 60, 70, and 80 epochs. Total training took 90epochs. The best point for switching was determined by the ﬁnal test accuracy.The Switch-off Lift was derived from the experiment at the best switch-off point for each augmenta-tion.For CIFAR-10, there are some augmentations where the validation accuracy was best at 25k, whichmeans that further testing is needed to ﬁnd if the actual optimum switch off point is lower or if thebest case is to not train at all with the given augmentation. Some of the best augmentations havea small negative Switch-off Lift, indicating that it is better to train the entire time with the givenaugmentations.For each augmentation, the best time for switch-off is listed in the full results (see Sec. G). E Diversity metrics

Final Training Loss S t e p s t o T r a i n i n g % A cc u r a c y Final Training Loss E n t r o p y

40k 60k

Steps to Training97% Accuracy E n t r o p y Figure 9: CIFAR-10: Three different diversity metrics are strongly correlated for high entropyaugmentations. Here, the entropy is calculated only for discrete augmentations.We computed three possible diversity metrics, shown in Fig. 9: Entropy, Final Training Loss, TrainingSteps to Accuracy Threshold. The entropy was calculated only for augmentations that have a discretestochasticity (such as

Rotate(fixed) and not for augmentations that have a continuous variation(such as

Rotate(variable) or PatchGaussian ). Final Training Loss is the batch statistic at the laststep of training. For CIFAR-10 experiments, this was averaged across the 10 initializations. ForImageNet, it was averaged over the last 10 steps of training. Training Steps to Accuracy Threshold isthe number of training steps at which the training accuracy ﬁrst hits a threshold of 97%. A few of thetested augmentation (extreme versions of

PatchGaussian ) did not reach this threshold in the giventime and that column is left blank in the full results.Entropy is unique in that it is independent of the model or data set and it is a counting of states.However, it is difﬁcult to compare between discrete and continuously-varying transforms and it is notclear how proper it is to compare even across different types of transforms.Final Training Loss and Training Steps to Accuracy Threshold correlate well across the testedtransforms. Entropy is highly correlated to these measures for

PatchGaussian and versions of

FlipLR , Crop , and

Cutout where only probabilities are varying. For

Rotate and

Shear where magnitudes arevarying as well, the correlation between Entropy and the other two measures is less clear.Building intuition for what Diversity means in this case, the Final Training Loss was compared in thecase of static augmentation to the case of dynamic augmentation. As shown in Fig. 6, in the case ofstatic augmentation, the Diversity was always less than in the typical case of dynamic augmentation.Moreover, across this large range of augmentations, the numerical span of Diversity was very small in19he case of static augmentation, compared to dynamic augmentation. This suggests that this particularmeasure of Diversity is indeed connected to the number of unique or useful training images thatcan be created with a given augmentation. In the case of static augmentation, the number of uniqueimages is exactly the same for all augmentations; dynamic augmentations allow for more uniqueimages and both the number and utility of unique images will vary with augmentation.

F Comparing Afﬁnity to other related measures

We gain conﬁdence in the Afﬁnity measure by comparing it to other potential model-dependantmeasures of distribution shift. In Fig 10, we show the correlation between Afﬁnity and these twomeasures: the mean log likelihood of augmented test images[49] (labeled as “logsumexp(logits)")and the Watanabe–Akaike information criterion (labeled as “WAIC”) [50].Like Afﬁnity, these other two measures indicate how well a model trained on clean data comprehendsaugmented data.

40 20 0

Affinity l o g s u m e x p ( l o g i t s )

40 20 0

Affinity W A I C (a) CIFAR-10

20 40 60 80

Affinity l o g s u m e x p ( l o g i t s ) (b) ImageNet Figure 10: Afﬁnity correlates with two other measures of how augmented images are related to atrained model’s distribution: logsumexp of the logits (left, for CIFAR-10, and right, for ImageNet) isthe mean log likelihood for the image. WAIC (middle, for CIFAR-10) corrects for a possible bias inthat estimate. In all three plots, numbers are referenced to the clean baseline, which is assigned avalue of 0.

G Full results

The plotted data for CIFAR-10 and ImageNet are given in .csv ﬁles uploaded at https://storage.googleapis.com/public_research_data/augmentation/data.ziphttps://storage.googleapis.com/public_research_data/augmentation/data.zip