Affinity and Diversity: Quantifying Mechanisms of Data Augmentation
Raphael Gontijo-Lopes, Sylvia J. Smullin, Ekin D. Cubuk, Ethan Dyer
AAffinity and Diversity: Quantifying Mechanisms ofData Augmentation
Raphael Gontijo-Lopes ∗ Google BrainMountain View, CA 94043
Sylvia J. Smullin ∗ GoogleMountain View, CA 94043
Ekin D. Cubuk
Google BrainMountain View, CA 94043 [email protected]
Ethan Dyer
GoogleMountain View, CA 94043 [email protected]
Abstract
Though data augmentation has become a standard component of deep neuralnetwork training, the underlying mechanism behind the effectiveness of thesetechniques remains poorly understood. In practice, augmentation policies are of-ten chosen using heuristics of either distribution shift or augmentation diversity.Inspired by these, we seek to quantify how data augmentation improves model gen-eralization. To this end, we introduce interpretable and easy-to-compute measures:Affinity and Diversity. We find that augmentation performance is predicted not byeither of these alone but by jointly optimizing the two.
50 25 0
Affinity .03.1.3 D i v e r s i t y T e s t A cc u r a c y (a) Affinity vs Diversity C l ean D a t a A ug m en t ed D a t a NewData H i gh D i v e r s i t y High Affinity Lo w D i v e r s i t y Low Affinity (b) Model’s View of Data
Figure 1:
Affinity and Diversity parameterize the performance of a model trained with augmentation .(a) CIFAR-10: Color shows the final test accuracy. * marks the clean baseline. Each point represents a differentaugmentation that yields test accuracy greater than 88.7%. (b) Representation of how clean data and augmenteddata are related in the space of these two metrics. Higher diversity is represented by a larger bubble whiledistributional similarity is depicted through the overlap of bubbles. Test accuracy generally improves to theupper right in this space. Adding real new data to the training set is expected to be in the far upper right corner. ∗ Equal contributionPreprint. Under review. a r X i v : . [ c s . L G ] J un Introduction
Models that achieve state-of-the-art in image classification often use heavy data augmentationstrategies. The best techniques use various transforms applied sequentially and stochastically. Thoughthe effectiveness of this is well-established, the mechanism through which these transformationswork is not well-understood.Since early uses of data augmentation, it has been assumed that augmentation works because itsimulates realistic samples from the true data distribution: “[augmentation strategies are] reasonablesince the transformed reference data is now extremely close to the original data. In this way, theamount of training data is effectively increased" [1]. Because of this, augmentations have often beendesigned with the heuristic of incurring minimal distribution shift from the training data.This rationale does not explain why unrealistic distortions such as cutout [2], SpecAugment [3], andmixup [4] significantly improve generalization performance. Furthermore, methods do not alwaystransfer across datasets—
Cutout , for example, is useful on CIFAR-10 and not on ImageNet [5].Additionally, many augmentation policies heavily modify images by stochastically applying multipletransforms to a single image. Based on this observation, some have proposed that augmentationstrategies are effective because they increase the diversity of images seen by the model.In this complex landscape, claims about diversity and distributional similarity remain unverifiedheuristics. Without more precise data augmentation science, finding state-of-the-art strategies requiresbrute force that can cost thousands of GPU hours [6, 7]. This highlights a need to specify and measurethe relationship between the original training data and the augmented dataset, as relevant to a givenmodel’s performance.In this paper, we quantify these heuristics. Seeking to understand the mechanisms of augmentation,we focus on single transforms as a foundation. We present an extensive study of 204 differentaugmentations on CIFAR-10 and 223 on ImageNet, varying both broad transform families and finertransform parameters. Our contributions are:1. We introduce Affinity and Diversity: interpretable, easy-to-compute metrics for parameteriz-ing augmentation performance. Affinity quantifies how much an augmentation shifts thetraining data distribution from that learned by a model. Diversity quantifies the complexityof the augmented data with respect to the model and learning procedure.2. We show that performance is dependent on both metrics. In the Affinity-Diversity plane, thebest augmentation strategies jointly optimize the two (see Fig 1).3. We connect augmentation to other familiar forms of regularization, such as (cid:96) and learningrate scheduling, observing common features of the dynamics: performance can be improvedand training accelerated by turning off regularization at an appropriate time.4. We show that performance is only improved when a transform increases the total numberof unique training examples. The utility of these new training examples is informed by theaugmentation’s Affinity and Diversity. Since early uses of data augmentation in training neural networks, there has been an assumption thateffective transforms for data augmentation are those that produce images from an “overlapping butdifferent" distribution [1, 8]. Indeed, elastic distortions as well as distortions in the scale, position,and orientation of training images have been used on MNIST [9–12], while horizontal flips, randomcrops, and random distortions to color channels have been used on CIFAR-10 and ImageNet [13–15]. For object detection and image segmentation, one can also use object-centric cropping [16] orcut-and-paste new objects [17–19].In contrast, researchers have also successfully used more generic transformations that are less domain-specific, such as Gaussian noise [5, 20], input dropout [21], erasing random patches of the trainingsamples during training [2, 3, 22], and adversarial noise [23]. Mixup [4] and Sample Pairing [24] aretwo augmentation methods that use convex combinations of training samples.It is also possible to improve generalization by combining individual transformations. For example,reinforcement learning has been used to choose more optimal combinations of data augmentationtransformations [6, 25]. Follow-up research has lowered the computation cost of such optimization,2y using population based training [26], density matching [27], adversarial policy-design that evolvesthroughout training [7], or a reduced search space [28]. Despite producing unrealistic outputs, suchcombinations of augmentations can be highly effective in different tasks [29–33].Across these different examples, the role of distribution shift in training remains unclear. Lim et al.[27], Hataya et al. [34] have found augmentation policies by minimizing the distance between thedistributions of augmented data and clean data. Recent work found that after training with augmenteddata, fine-tuning on clean training data can be beneficial [35], while Touvron et al. [36] found itbeneficial to fine-tune with a test-set resolution that aligns with the training-set resolution.The true input-space distribution from which a training dataset is drawn remains elusive. To betterunderstand the effect of distribution shift on model performance, many works attempt to estimate it.Often these techniques require training secondary models, such as those based on variational methods[37–40]. Others have tried to augment the training set by modelling the data distribution directly [41].Recent work has suggested that even unrealistic distribution modelling can be beneficial [42].These methods try to specify the distribution separately from the model they are trying to optimize.As a result, they are insensitive to any interaction between the model and data distribution. Instead,we are interested in a measure of how much the data shifts along directions that are most relevant tothe model’s performance.
We performed extensive experiments with various augmentations on CIFAR-10 and ImageNet.Experiments on CIFAR-10 used the WRN-28-2 model [14], trained for 78k steps with cosine learningrate decay. Results are the mean over 10 initializations and reported errors (often too small to showon figures) are the standard error on the mean. Details on the error analysis are in Sec. C.Experiments on ImageNet used the ResNet-50 model [43], trained for 112.6k steps with a weightdecay rate of 1e-4, and a learning rate of 0.2, which is decayed by 10 at epochs 30, 60, and 80.Images were pre-processed by dividing each pixel value by 255 and normalizing by the data setstatistics. Random crop was also applied on all ImageNet models. These pre-processed data withoutfurther augmentation are “clean data” and a model trained on it is the “clean baseline”. We followedthe same implementation details as Cubuk et al. [6] , including for most augmentation operations.Further implementation details are in Sec. A.For CIFAR-10, test accuracy on the clean baseline is . ± . . The validation accuracy is . ± . . On ImageNet, the test accuracy is 76.06%.Unless specified otherwise, data augmentation was applied following standard practice: each timean image is drawn, the given augmentation is applied with a given probability. We call this mode dynamic augmentation. Due to whatever stochasticity is in the transform itself (such as randomlyselecting the location for a crop) or in the policy (such as applying a flip only with 50% probability),the augmented image could be different each time. Thus, most of the tested augmentations increasethe number of possible distinct images that can be shown during training.We also performed select experiments using static training. In static augmentation, the augmentationpolicy (one or more transforms) is applied once to the entire clean training set. Static augmentationdoes not change the number of unique images in the dataset. Thus far, heuristics of distribution shift have motivated design of augmentation policies. Inspired bythis focus, we introduce a simple metric to quantify how augmentation shifts data with respect to thedecision boundary of the clean baseline model .We start by noting that a trained model is often sensitive to the distribution of the training data. That is,model performance varies greatly between new samples from the true data distribution and samplesfrom a shifted distribution.Importantly, the model’s sensitivity to distribution shift is not purely a function of the input datadistribution, since training dynamics and the model’s implicit biases affect performance. Because Available at bit.ly/2v2FojN
Definition 1.
Let D train and D val be training and validation datasets drawn IID from the same clean data distribution, and let D (cid:48) val be derived from D val by applying a stochastic augmentation strategy, a ,once to each image in D val , D (cid:48) val = { ( a ( x ) , y ) : ∀ ( x, y ) ∈ D val } . Further let m be a model trainedon D train and A ( m, D ) denote the model’s accuracy when evaluated on dataset D . The Affinity, T [ a ; m ; D val ] , is given by T [ a ; m ; D val ] = A ( m, D (cid:48) val ) − A ( m, D val ) . (1)With this definition, Affinity of zero represents no shift and a negative number suggests that theaugmented data is out-of-distribution for the model. Y s h i f t (a) Affinity Y s h i f t (b) D KL Figure 2:
Affinity is a model-sensitive measure ofdistribution shift . Contours indicate lines of equal(a) Affinity, or (b) KL Divergence between the jointdistribution of the original data and targets and theshifted data. The two axes indicate the actual shiftsthat define the augmentation. Affinity captures model-dependent features, such as the decision boundary.
In Fig. 2 we illustrate Affinity with a two-class classification task on a mixture of two Gaussians.Augmentation in this example comprises shift of the means of the Gaussians of the validation datacompared to those used for training. Under this shift, we calculate both Affinity and KL divergenceof the shifted data with respect to the original data. Affinity changes only when the shift in the data iswith respect to the model’s decision boundary, whereas the KL divergence changes even when data isshifted in the direction that is irrelevant to the classification task. In this way, Affinity captures whatis relevant to a model: shifts that impact predictions.This same metric has been used as a measure of a model’s robustness to image corruptions that donot change images’ semantic content [20, 44–48]. Here we, turn this around and use it to quantifythe shift of augmented data compared to clean data.Affinity has the following advantages as a metric:1. It is easy to measure. It requires only clean training of the model in question.2. It is independent of any confounding interaction between the data augmentation and thetraining process, since augmentation is only used on the validation set and applied statically.3. It is a measure of distance sensitive to properties of both the data distribution and the model.We gain confidence in this metric by comparing it to other potential model-dependant measuresof distribution shift. We consider the mean log likelihood of augmented test images[49], and theWatanabe–Akaike information criterion (WAIC) [50]. These other metrics have high correlation withAffinity. Details can be found in Sec. F.
Inspired by the observation that multi-factor augmentation policies such as
FlipLR + Crop + Cutout and
RandAugment [28] greatly improve performance, we propose another axis on which to viewaugmentation policies, which we dub
Diversity . This measure is intended to quantify the intuitionthat augmentations prevent models from over-fitting by increasing the number of samples in thetraining set; the importance of this is shown in Sec. 4.3.4ased on the intuition that more diverse data should be more difficult for a model to fit, we propose amodel-based measure. The Diversity metric in this paper is the final training loss of a model trainedwith a given augmentation:
Definition 2.
Let a be an augmentation and D (cid:48) train be the augmented training data resulting fromapplying the augmentation, a , stochastically. Further, let L train be the training loss for a model, m ,trained on D (cid:48) train . We define the Diversity, D [ a ; m ; D train ] , as D [ a ; m ; D train ] := E D (cid:48) train [ L train ] . (2)Though determining the training loss requires the same amount of work as determining final testaccuracy, here we focus on this metric as a tool for understanding.As with Affinity, this definition of Diversity has the advantage that it can capture model-dependentelements, i.e. it is informed by the class of priors implicit in choosing a model and optimizationscheme as well as by the stopping criterion used in training.Another potential diversity measure is the entropy of the transformed data, D Ent . This is inspiredby the intuition that augmentations with more degrees of freedom perform better. For discretetransformations, we consider the conditional entropy of the augmented data. D Ent := H ( X (cid:48) | X ) = − E X [Σ x (cid:48) p ( x (cid:48) | X ) log( p ( x (cid:48) | X ))] . Here x ∈ X is a clean training image and x (cid:48) ∈ X (cid:48) is an augmented image. This measure has theproperty that it can be evaluated without any training or reference to model architecture. However,the appropriate entropy for continuously-varying transforms is less straightforward.A third proxy for Diversity is the training time needed for a model to reach a given training accuracythreshold. In Sec. E, we show that these three metrics correlate well with each other.In the remaining sections we describe how the complementary metrics of Diversity and Affinity canbe used to characterize and understand augmentation performance. Despite the original inspiration to mimic realistic transformations and minimize distribution shift,many state-of-the-art augmentations yield unrealistic images. This suggests that distribution shiftalone does not fully describe or predict augmentation performance.
75 50 25 0
Affinity T e s t A cc u r a c y Diversity T e s t A cc u r a c y
60 40 20 0
Affinity T e s t A cc u r a c y Diversity T e s t A cc u r a c y (a) T: CIFAR-10. B: ImageNet
75 50 25 0
Affinity D i v e r s i t y T e s t A cc u r a c y (b) CIFAR-10
60 40 20 0
Affinity D i v e r s i t y T e s t A cc u r a c y (c) ImageNet Figure 3:
Augmentation performance is determined by both Affinity and Diversity . (a) Test accuracyplotted against each of Affinity and Diversity for the two datasets, showing that neither metric alone predictsperformance. In the CIFAR-10 plots (top), blue highlights (also in inset) are the augmentations that increase testaccuracy above the clean baseline. Dashed lines indicate the clean baseline. (b) and (c) show test accuracy on thecolor scale in the plane of Affinity and Diversity. The three star markers in (b) are (left to right)
RandAugment , AutoAugment , and mixup . The * on the color bar indicates the clean baseline case. For fixed values of Affinity,test accuracy generally increases with higher values of Diversity. For fixed values of Diversity, test accuracygenerally increases with higher values of Affinity. Note that the gains observed on ImageNet are expected to besmall, in line with previous work on single-transformation policies [48].
Figure 3(a) (left) measures Affinity across 204 different augmentations for CIFAR-10 and 223for ImageNet respectively. We find that for the most important augmentations—those that help5erformance—Affinity is a poor predictor of accuracy. Furthermore, we find many successfulaugmentations with low Affinity. For example,
Rotate(fixed, 45deg, 50%) , Cutout(16) , andcombinations of
FlipLR , Crop(32) , and
Cutout(16) all have Affinity < − and test accuracy > above clean baseline on CIFAR-10. Augmentation details are in Sec. B.As Affinity does not fully characterize the performance of an augmentation, we seek another metric.To assess the importance of an augmentation’s complexity, we measure Diversity across the same setof augmentations. We find that Diversity is complementary in explaining how augmentations canincrease test performance. As shown in Fig. 3(b) and (c), Affinity and Diversity together provide amuch clearer parameterization of an augmentation policy’s benefit to performance. For a fixed levelof Diversity, augmentations with higher Affinity are consistently better. Similarly, for a fixed Affinity,it is generally better to have higher Diversity.A simple case study is presented in Fig. 4. The probability of the transform Rotate(fixed, 60deg) is varied. The accuracy and Affinity are not monotonically related, with the peak accuracy falling atan intermediate value of Affinity. Similarly, accuracy is correlated with Diversity for low probabilitytransformations, but does not track for higher probabilities. The optimal probability for
Rotate(fixed,60deg) lies at an intermediate value of Affinity and Diversity.
50 25 0
Affinity T e s t A cc u r a c y Diversity T e s t A cc u r a c y
60 40 20 0
Affinity D i v e r s i t y T e s t A cc u r a c y Figure 4:
Test accuracy varies differently than either Affinity or Diversity . Here, the probability of
Rotate(fixed, 60deg) on CIFAR-10 is varied from 10% to 90%. Left: as probability increases, Affinitydecreases linearly while the accuracy changes non-monotonically. Center: accuracy and Diversity vary differentlyfrom each other as probability is changed. Right: test accuracy is maximized at intermediate values.
To situate the tested augmentations—mostly single transforms—within the context of the state-of-the-art, we tested three high-performance augmentations from literature: mixup [4],
AutoAugment [6],and
RandAugment [28]. These are highlighted with a star marker in Fig. 3(b).More than either of the metrics alone, Affinity and Diversity together provide a useful parameterizationof an augmentation’s performance. We now turn to investigating the utility of this tool for explainingother observed phenomena of data augmentations.
The term “regularizer” is ill-defined in the literature, often referring to any technique used to reducegeneralization error without necessarily reducing training error [51]. With this definition, it is widelyacknowledged that commonly-used augmentations act as regularizers [52–54]. Though this is a broaddefinition, we notice another commonality across seemingly different kinds of regularizers: variousregularization techniques yield boosts in performance (or at least no degradation) if the regularizationis turned off at the right time during training. For instance:1. Decaying a large learning rate on an appropriate schedule can be better than maintaining alarge learning rate throughout training [14].2. Turning off (cid:96) regularization at the right time in training does not hurt performance [55].3. Relaxing architectural constraints mid-training can boost final performance [56].4. Turning augmentations off and fine-tuning on clean data can improve final test accuracy [35].To further study augmentation as a regularizer, we compare the constant augmentation case (with thesame augmentation throughout) to the case where the augmentation is turned off partway throughtraining and training is completed with clean data. For each transform, we test over a range of switch-off points and select the one that yields the best final validation or test accuracy on CIFAR-10 and6mageNet respectively. The Switch-off Lift is the resulting increase in final test accuracy, comparedto training with augmented data the entire time.
20 30 40 50 60 70 8080859095
Turn Aug OffBaselineConstant Aug
20 30 40 50 60 70 8082.585.087.590.0 V a li d A cc u r a c y Turn OffBaseline
20 30 40 50 60 70 80
Training Steps / 1000
Stepped LRConstant LR (a) Slingshot effect on CIFAR-10
79 88 89.7* 91 100
Test AccuracyTurningAug OffAug On (b) Switch-off Lift on CIFAR-10
50 0
Affinity .01.11 D i v e r s i t y S w i t c h - o ff L i f t
50 25 0
Affinity D i v e r s i t y S w i t c h - o ff L i f t (c) CIFAR-10 (left) ImageNet (right) Figure 5: (a) Switching off regularizers yields a performance boost : Three examples of how turning off aregularizer increases the validation accuracy. This slingshot effect can speed up training and improve the bestvalidation accuracy. Top: training with no augmentation (clean baseline), compared to constant augmentation,and augmentation that is turned off at 55k steps. Here, the augmentation is
Rotate(fixed, 20deg,100%) .Middle: Baseline with constant (cid:96) . This is compared to turning off (cid:96) regularization part way through training.Bottom: Constant learning rate of 0.1 compared to training where the learning rate is decayed in one step bya factor of 10. (b) Bad augmentations can become helpful if switched off : Colored lines connect the testaccuracy with augmentation applied throughout training (top) to the test accuracy with switching mid-training.Color indicates the amount of Switch-off Lift; blue is positive and orange is negative. (c) Switch-off Lift varieswith Affinity and Diversity . Where Switch-off Lift is negative, it is mapped to 0 on the color scale. For some poor-performing augmentations, this gain can actually bring the final test accuracy abovethe baseline, as shown in Fig. 5(b). We additionally observe (Fig. 5(a)) that this test accuracyimprovement can happen quite rapidly for both augmentations and for the other regularizers tested.This suggests an opportunity to accelerate training without hurting performance by appropriatelyswitching off regularization. We call this a slingshot effect.Interestingly, we find the best time for turning off an augmentation is not always close to the end oftraining, contrary to what is shown in He et al. [35]. For example, without switching,
FlipUD(100%) decreases test accuracy by almost 50% compared to clean baseline. When the augmentation is usedfor only the first third of training, final test accuracy is above the baseline.He et al. [35] hypothesized that the gain from turning augmentation off is due to recovery from adistribution shift. Indeed, for many detrimental transformations, the test accuracy gained by turningoff the augmentation merely recovers the clean baseline performance. However, in Fig. 5(c), we seethat for a given value of Affinity, Switch-off Lift can vary. This result suggests that the Switch-offLift is derived from more than simply correction of a distribution shift.A few of the tested augmentations, such as
FlipLR(100%) , are fully deterministic. Thus, eachtime an image is drawn in training, it is augmented the same way. When such an augmentation isturned off partway through training, the model then sees images—the clean ones—that are now new.Indeed, when
FlipLR(100%) is switched off at the right time, its final test accuracy exceeds that of
FlipLR(50%) without switching. In this way, switching augmentation off may adjust for not only lowAffinity but also low Diversity.
Most augmentations we tested and those used in practice have inherent stochasticity and thus mayalter a given training image differently each time the image is drawn. In the typical dynamic trainingmode, these augmentations increase the number of unique inputs seen across training epochs.To further study how augmentations act as regularizers, we seek to discriminate this increase ineffective dataset size from other effects. We train models with static augmentation, as described7n Sec. 3. This altered training set is used without further modification during training so that thenumber of unique training inputs is the same between the augmented and the clean training settings.For almost all tested augmentations, using static augmentation yields lower test accuracy than theclean baseline. Where static augmentation shows a gain (versions of crop ), the difference is less thanthe standard error on the mean. As in the dynamic case, poorer performance in the static case is fortransforms that have lower Affinity and lower Diversity.Static augmentations also always perform worse than their non-deterministic, dynamic counterparts,as shown in Fig. 6. This may be because the Diversity of a static augmentation is always less than thedynamic case (see also Sec. E). The decrease in Diversity in the static case suggests a connectionbetween Diversity and the number of training examples.
10 5 0 5
Static Augmentation D y n a m i c A u g m e n t a t i o n Relative Test Accuracy R e l a t i v e T e s t A cc u r a c y Static Augmentation Diversity D y n a m i c A u g m e n t a t i o n D i v e r s i t y Figure 6:
Static augmentations decreasediversity and performance . CIFAR-10,static augmentation performance is less thanthe clean baseline, (0 , , and less than thedynamic augmentation case. Augmentationswith no stochasticity are excluded becausethey are trivially equal on the two axes (left).Diversity in the static case is less than in thedynamic case.(right) Diagonal line indicateswhere static and dynamic cases would beequal. Together, these results point to the following conclusion:
Increased effective training set size iscrucial to the performance benefit of data augmentation. An augmentation’s Affinity and Diversityinform how useful the additional training examples are.
In this work, we focused on single transforms in an attempt to understand the essential parts ofaugmentation in a controlled context. This builds a foundation for using these metrics to quantify anddesign more complex and powerful combinations of augmentations.Though earlier work has often explicitly focused on just one of these metrics, chosen priors haveimplicitly ensured reasonable values for both. One way to achieve Diversity is to use combinationsof many single augmentations, as in AutoAugment [6]. Because transforms and hyperparametersin Cubuk et al. [6] were chosen by optimizing performance on proxy tasks, the optimal policies includehigh and low Affinity transforms. Fast AutoAugment [27], CTAugment [29, 57], and differentiableRandAugment [28] all aim to increase Affinity by what Lim et al. [27] called “density-matching”.However these methods use the search space of AutoAugment and thus inherit its Diversity.On the other hand, Adversarial AutoAugment [7] focused on increasing Diversity by optimizingpolicies to increase the training loss. While this method did not explicitly aim to increase Affinity, italso used transforms and hyperparameters from the AutoAugment search space. Without such a prior,which includes useful Affinity, the goal of maximizing training loss with no other constraints wouldlead to data augmentation policies that erase all the information from the images.Our results motivate casting an even wider net when searching for augmentation strategies. Firstly,our work suggests that explicitly optimizing along axes of both Affinity and Diversity yields betterperformance. Furthermore, we have seen that poor-performing augmentations can actually be helpfulif turned off during training (Fig. 5). With inclusion of scheduling in augmentation optimization, weexpect there are opportunities for including a different set of augmentations in an ideal policy. Hoet al. [26] observes trends in how probability and magnitude of various transforms change duringtraining for an optimized augmentation schedule. We suggest that with further study, Diversity andAffinity can provide priors for optimization of augmentation schedules.8
Conclusion
We attempted to quantify common intuition that more in-distribution and more diverse augmentationpolicies perform well. To this end, we introduced two easy-to-compute metrics, Affinity and Diversity,intended to measure to what extent a given augmentation is in-distribution and how complex theaugmentation is to learn. Because they are model-dependent, these metrics capture the data shiftsthat affect model performance.With these tools, we have conducted a study over a large class of augmentations for CIFAR-10 andImageNet and found that neither feature alone is a perfect predictor of performance. Rather, wepresented evidence that Diversity and Affinity play dual roles in determining augmentation quality.Optimizing for either metric separately is sub-optimal and the best augmentations balance the two.Additionally, we found that an increased number of training examples, connected to Diversity, was anecessary ingredient of beneficial augmentation.Finally, we found that augmentations share an important feature with other regularizers: switching offregularization at the right time can improve performance. In some cases, this can cause an otherwisepoorly-performing augmentation to be beneficial.We hope our findings provide a foundation for continued scientific study of data augmentation.
Data augmentation has the potential to amplify bias
Data augmentation takes a smaller, potentially biased training set and recycles this as the basisof a larger augmented training program. A central finding of this work is that the success of anaugmentation policy varies with the dual metrics of Affinity and Diversity; as Affinity is explicitlymodel-dependent, it depends on biases present in the model. This data reuse and model-dependenceof successful augmentation suggest the possibility that augmentation may amplify biases in the dataor model and warrants future investigation.
Robust data augmentation can reduce social, environmental, and financial costs
At its best, data augmentation provides a means for less well-funded or data-rich practitioners todesign performant models by supplementing a smaller training data set with additional transformedimages. Commonly-used policies, however, such as those found by AutoAugment [6] have relied onexpensive brute force searches which cost thousands of GPU-hours, replacing the need for extensivedata collections with the need for financially and environmentally expensive compute. We hopethat by understanding the mechanisms behind successful data augmentation we can design guidedaugmentation policies for new datasets and models, and mitigate the social and financial costs of datacollection without undue compute expense.
Fundamental understanding facilitates impact assessment
More broadly, the central aim of this work is to better understand the elements driving successfulaugmentation policies. Truly understanding the conceptual mechanisms at play is crucial in makinginformed judgements about the impact of data augmentation.
Acknowledgements
The authors would like to thank Alex Alemi, Justin Gilmer, Guy Gur-Ari, Albin Jones, BehnamNeyshabur, Zan Armstrong, and Ben Poole for thoughtful discussions on this work.
References [1] J. R. Bellegarda, P. V. de Souza, A. J. Nadas, D. Nahamoo, M. A. Picheny, and L. R. Bahl.Robust speaker adaptation using a piecewise linear acoustic mapping. In
ICASSP-92: 1992IEEE International Conference on Acoustics, Speech, and Signal Processing , 1992.92] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neuralnetworks with cutout. arXiv preprint arXiv:1708.04552 , 2017.[3] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk,and Quoc V Le. Specaugment: A simple data augmentation method for automatic speechrecognition. arXiv preprint arXiv:1904.08779 , 2019.[4] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyondempirical risk minimization. arXiv preprint arXiv:1710.09412 , 2017.[5] Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer, and Ekin D Cubuk. Improvingrobustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprintarXiv:1906.02611 , 2019.[6] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 , 2018.[7] Xinyu Zhang, Qiang Wang, Jian Zhang, and Zhao Zhong. Adversarial autoaugment, 2019.[8] Yoshua Bengio, Frédéric Bastien, Arnaud Bergeron, Nicolas Boulanger-Lewandowski, ThomasBreuel, Youssouf Chherawala, Moustapha Cisse, Myriam Côté, Dumitru Erhan, Jeremy Eu-stache, et al. Deep learners benefit more from out-of-distribution examples. In
Proceedings ofthe Fourteenth International Conference on Artificial Intelligence and Statistics , pages 164–172,2011.[9] Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks forimage classification. In
Proceedings of IEEE Conference on Computer Vision and PatternRecognition , pages 3642–3649. IEEE, 2012.[10] Ikuro Sato, Hiroki Nishimura, and Kensuke Yokoi. Apac: Augmented pattern classificationwith neural networks. arXiv preprint arXiv:1505.03229 , 2015.[11] Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional neuralnetworks applied to visual document analysis. In
Proceedings of International Conference onDocument Analysis and Recognition , 2003.[12] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization ofneural networks using dropconnect. In
International Conference on Machine Learning , pages1058–1066, 2013.[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deepconvolutional neural networks. In
Advances in Neural Information Processing Systems , 2012.[14] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In
British Machine VisionConference , 2016.[15] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferablearchitectures for scalable image recognition. In
Proceedings of IEEE Conference on ComputerVision and Pattern Recognition , 2017.[16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-YangFu, and Alexander C Berg. Ssd: Single shot multibox detector. In
European conference oncomputer vision , pages 21–37. Springer, 2016.[17] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easysynthesis for instance detection. In
Proceedings of the IEEE International Conference onComputer Vision , pages 1301–1310, 2017.[18] Hao-Shu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, Yong-Lu Li, and Cewu Lu.Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In
Proceedings of the IEEE International Conference on Computer Vision , pages 682–691, 2019.[19] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, Yuning Chai, Pei Sun, Yin Zhou,Xi Yi, Ouais Alsharif, Patrick Nguyen, et al. Starnet: Targeted computation for object detectionin point clouds. arXiv preprint arXiv:1908.11069 , 2019.1020] Nic Ford, Justin Gilmer, Nicolas Carlini, and Dogus Cubuk. Adversarial examples are a naturalconsequence of test error in noise. arXiv preprint arXiv:1901.10513 , 2019.[21] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overfitting.
The journal of machinelearning research , 15(1):1929–1958, 2014.[22] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing dataaugmentation. arXiv preprint arXiv:1708.04896 , 2017.[23] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-low, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 ,2013.[24] Hiroshi Inoue. Data augmentation by pairing samples for images classification. arXiv preprintarXiv:1801.02929 , 2018.[25] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré.Learning to compose domain-specific transformations for data augmentation. In
Advances inNeural Information Processing Systems , pages 3239–3249, 2017.[26] Daniel Ho, Eric Liang, Ion Stoica, Pieter Abbeel, and Xi Chen. Population based augmentation:Efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393 , 2019.[27] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. arXiv preprint arXiv:1905.00397 , 2019.[28] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical dataaugmentation with no separate search. arXiv preprint arXiv:1909.13719 , 2019.[29] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang,and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment andaugmentation anchoring. arXiv preprint arXiv:1911.09785 , 2019.[30] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neuralnetworks. arXiv preprint arXiv:1905.11946 , 2019.[31] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient objectdetection. arXiv preprint arXiv:1911.09070 , 2019.[32] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, and Quoc V Le. Adversarialexamples improve image recognition. arXiv preprint arXiv:1911.09665 , 2019.[33] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised dataaugmentation. arXiv preprint arXiv:1904.12848 , 2019.[34] Ryuichiro Hataya, Jan Zdenek, Kazuki Yoshizoe, and Hideki Nakayama. Faster autoaugment:Learning augmentation strategies using backpropagation, 2019.[35] Zhuoxun He, Lingxi Xie, Xin Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. Data augmentationrevisited: Rethinking the distribution gap between clean and augmented data, 2019.[36] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. Fixing the train-testresolution discrepancy, 2019.[37] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani,M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,
Advances in NeuralInformation Processing Systems 27 , pages 2672–2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf .[38] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengioand Yann LeCun, editors,
ICLR , 2014. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2014.html . 1139] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neuralsamplers using variational divergence minimization, 2016.[40] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review forstatisticians.
Journal of the American Statistical Association , 112(518):859–877, Feb 2017.ISSN 1537-274X. doi: 10.1080/01621459.2017.1285773. URL http://dx.doi.org/10.1080/01621459.2017.1285773 .[41] Toan Tran, Trung Pham, Gustavo Carneiro, Lyle Palmer, and Ian Reid. A bayesian dataaugmentation approach for learning deep models. In
Advances in Neural Information ProcessingSystems , pages 2794–2803, 2017.[42] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Russ R Salakhutdinov. Goodsemi-supervised learning that requires a bad gan. In
Advances in neural information processingsystems , pages 6510–6520, 2017.[43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-age recognition. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 770–778, 2016.[44] Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly tosmall image transformations? arXiv preprint arXiv:1805.12177 , 2018.[45] Samuel Dodge and Lina Karam. A study and comparison of human and deep learning recogni-tion performance under visual distortions. In , pages 1–7. IEEE, 2017.[46] Dan Hendrycks and Thomas G Dietterich. Benchmarking neural network robustness to commoncorruptions and surface variations. arXiv preprint arXiv:1807.01697 , 2018.[47] Amir Rosenfeld, Richard Zemel, and John K Tsotsos. The elephant in the room. arXiv preprintarXiv:1808.03305 , 2018.[48] Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D Cubuk, and Justin Gilmer. Afourier perspective on model robustness in computer vision. arXiv preprint arXiv:1906.08988 ,2019.[49] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, MohammadNorouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you shouldtreat it like one, 2019.[50] Sumio Watanabe. Asymptotic equivalence of bayes cross validation and widely applicableinformation criterion in singular learning theory.
J. Mach. Learn. Res. , 11:3571–3594, December2010. ISSN 1532-4435.[51] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press, 2016. .[52] Alex Hernández-García and Peter König. Further advantages of data augmentation on convo-lutional neural networks. In
International Conference on Artificial Neural Networks , pages95–103. Springer, 2018.[53] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.[54] Tri Dao, Albert Gu, Alexander J Ratner, Virginia Smith, Christopher De Sa, and ChristopherRé. A kernel theory of modern data augmentation.
Proceedings of machine learning research ,97:1528, 2019.[55] Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deepnetworks: Weight decay and data augmentation affect early learning dynamics, matter little nearconvergence, 2019.[56] Stéphane d’Ascoli, Levent Sagun, Joan Bruna, and Giulio Biroli. Finding the needle in thehaystack with convolutions: on the benefits of architectural bias, 2019.1257] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk,Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learningwith consistency and confidence. arXiv preprint arXiv:2001.07685 , 2020.13
UPPLEMENTARY MATERIALA Training methods
Cifar10 models were trained using code based on AutoAugment code using the following choices:1. Learning rate was decayed following a cosine decay schedule, starting with a value of 0.12. 78050 training steps were used, with data shuffled after every epoch.3. As implemented in the AutoAugment code, the WRN-28-2 model was used with stochasticgradient descent and momentum. The optimizer used cross entropy loss with (cid:96) weightdecay of 0.0005.4. Before selecting the validation set, the full training set was shuffled and balanced such thatthe subset selected for training was balanced across classes.5. Validation set was the last 5000 samples of the shuffled CIFAR-10 training data.6. Models were trained using Python 2.7 and TensorFlow 1.13 .A training time of 78k steps was chosen because it showed reasonable convergence with the standarddata augmentation of FlipLR , Crop , and
Cutout
In the clean baseline case, test accuracy actuallyreached its peak much earlier than 78k steps.With CIFAR-10, experiments were also performed for training dataset sizes of 1024, 4096, and 16384.At smaller dataset sizes, the impact of augmentation and the Switch-off Lift tended to be larger. Theseresults are not shown in this paper.ImageNet models were ResNet-50 trained using the Cloud TPU codebase . Models were trained for112.6k steps with a weight decay rate of 1e-4, and a learning rate of 0.2, which was decayed by 10 atepochs 30, 60, and 80. Batch size was set to be 1024. B Details of augmentation
B.1 CIFAR-10
On CIFAR-10, both color and affine transforms were tested, as given in the full results (see Sec. G).Most augmentations were as defined in Cubuk et al. [6] and additional conventions for augmentationsas labeled in Fig. 7 are defined here. For
Rotate , fixed means each augmented image was rotated byexactly the stated amount, with a randomly-chosen direction. Variable means an augmented imagewas rotated a random amount up to the given value in a randomly-chosen direction.
Shear is definedsimilarly.
Rotate(square) means that an image was rotated by an amount chosen randomly from[0 ◦ , 90 ◦ , 180 ◦ , 270 ◦ ]. Crop included a padding before the random-location crop so that the final image remained × insize. The magnitude given for Crop is the number of pixels that were added in each dimension. Themagnitude given in the label for
Cutout is the size, in pixels, of each dimension of the square cutout.
PatchGaussian was defined as in Lopes et al. [5], with the patch specified to be contained entirelywithin the image domain. In Fig. 7, it is labeled by two hyperparameters: the size of the square patch(in pixels) that was applied and σ max , which is the maximum standard deviation of the noise thatcould be selected for any given patch. Here, “fixed" means the patch size was always the same.Since FlipLR , Crop , and
Cutout are part of standard pipelines for CIFAR-10, we tested combinationsof the three augmentations (varying probabilities of each) as well as these three augmentations plus ansingle additional augmentation. As in standard processing of CIFAR-10 images, the first augmentationapplied was anything that is not one of
FlipLR , Crop , or
Cutout . After that, augmentations wereapplied in the order
Crop , then
FlipLR , then
Cutout .Finally, we tested the CIFAR-10 AutoAugment policy [6], RandAugment [28], and mixup [4]. Thehyperparameters for these augmentations followed the guidelines described in the respective papers. available at github.com/tensorflow/models/tree/master/research/autoaugment available at https://github.com/tensorflow/tpu/tree/master/models/official/resnet
70 60 50 40 30 20 10 0
Affinity D i v e r s i t y Blur: Blur(100%)Blur: Blur(50%)Invert: Invert(100%)Invert: Invert(50%)PG: PatchGaussian(fixed,12, 0.1, 100%)PG: PatchGaussian(fixed,12, 0.2, 100%)PG: PatchGaussian(fixed,12, 0.3, 100%)PG: PatchGaussian(fixed,12, 0.5, 100%)PG: PatchGaussian(fixed,12, 0.8, 100%)PG: PatchGaussian(fixed,12, 1.0, 100%)PG: PatchGaussian(fixed,12, 1.5, 100%)PG: PatchGaussian(fixed,12, 2.0, 100%)PG: PatchGaussian(fixed,16, 0.1, 100%)PG: PatchGaussian(fixed,16, 0.2, 100%)PG: PatchGaussian(fixed,16, 0.3, 100%)PG: PatchGaussian(fixed,16, 0.5, 100%)PG: PatchGaussian(fixed,16, 0.8, 100%)PG: PatchGaussian(fixed,16, 1.0, 100%)PG: PatchGaussian(fixed,16, 1.5, 100%)PG: PatchGaussian(fixed,16, 2.0, 100%)PG: PatchGaussian(fixed,20, 0.1, 100%)PG: PatchGaussian(fixed,20, 0.2, 100%)PG: PatchGaussian(fixed,20, 0.3, 100%)PG: PatchGaussian(fixed,20, 0.5, 100%)PG: PatchGaussian(fixed,20, 0.8, 100%)PG: PatchGaussian(fixed,20, 1.0, 100%)PG: PatchGaussian(fixed,20, 1.5, 100%)PG: PatchGaussian(fixed,20, 2.0, 100%)PG: PatchGaussian(fixed,24, 0.1, 100%)PG: PatchGaussian(fixed,24, 0.2, 100%)PG: PatchGaussian(fixed,24, 0.3, 100%)PG: PatchGaussian(fixed,24, 0.5, 100%)PG: PatchGaussian(fixed,24, 0.8, 100%)PG: PatchGaussian(fixed,24, 1.0, 100%)PG: PatchGaussian(fixed,24, 1.5, 100%)PG: PatchGaussian(fixed,24, 2.0, 100%)PG: PatchGaussian(fixed,28, 0.1, 100%)PG: PatchGaussian(fixed,28, 0.2, 100%)PG: PatchGaussian(fixed,28, 0.3, 100%)PG: PatchGaussian(fixed,28, 0.5, 100%)PG: PatchGaussian(fixed,28, 0.8, 100%)PG: PatchGaussian(fixed,28, 1.0, 100%)PG: PatchGaussian(fixed,28, 1.5, 100%)PG: PatchGaussian(fixed,28, 2.0, 100%)PG: PatchGaussian(fixed,32, 0.1, 100%)PG: PatchGaussian(fixed,32, 0.2, 100%)PG: PatchGaussian(fixed,32, 0.3, 100%)PG: PatchGaussian(fixed,32, 0.5, 100%)PG: PatchGaussian(fixed,32, 0.8, 100%)PG: PatchGaussian(fixed,32, 1.0, 100%)PG: PatchGaussian(fixed,32, 1.5, 100%)PG: PatchGaussian(fixed,32, 2.0, 100%)PG: PatchGaussian(fixed,4, 0.1, 100%)PG: PatchGaussian(fixed,4, 0.2, 100%)PG: PatchGaussian(fixed,4, 0.3, 100%)PG: PatchGaussian(fixed,4, 0.5, 100%)PG: PatchGaussian(fixed,4, 0.8, 100%)PG: PatchGaussian(fixed,4, 1.0, 100%)PG: PatchGaussian(fixed,4, 1.5, 100%)PG: PatchGaussian(fixed,4, 2.0, 100%)PG: PatchGaussian(fixed,8, 0.1, 100%)PG: PatchGaussian(fixed,8, 0.2, 100%)PG: PatchGaussian(fixed,8, 0.3, 100%)PG: PatchGaussian(fixed,8, 0.5, 100%)PG: PatchGaussian(fixed,8, 0.8, 100%)PG: PatchGaussian(fixed,8, 1.0, 100%)PG: PatchGaussian(fixed,8, 1.5, 100%)PG: PatchGaussian(fixed,8, 2.0, 100%)Identity: Clean ShearXSoft: ShearX(variable, 0.1, 100%)ShearXSoft: ShearX(variable, 0.1, 50%)ShearXSoft: ShearX(variable, 0.1, 75%)ShearXSoft: ShearX(variable, 0.3, 100%)ShearXSoft: ShearX(variable, 0.3, 50%)ShearXSoft: ShearX(variable, 0.3, 75%)ShearX: ShearX(fixed ,0.1, 100%)ShearX: ShearX(fixed, 0.1, 50%)ShearX: ShearX(fixed, 0.1, 75%)ShearX: ShearX(fixed, 0.3, 100%)ShearX: ShearX(fixed, 0.3, 50%)ShearX: ShearX(fixed, 0.3, 75%)Rotate_Soft: Rotate(variable, 20deg, 100%)Rotate_Soft: Rotate(variable, 20deg, 50%)Rotate_Soft: Rotate(variable, 20deg, 75%)Rotate_Soft: Rotate(variable, 45, 100%)Rotate_Soft: Rotate(variable, 5deg, 100%)Rotate_Soft: Rotate(variable, 5deg, 50%)Rotate_Soft: Rotate(variable, 5deg, 75%)Rotate_Soft: Rotate(variable, 60deg, 100%)Rotate: Rotate(fixed, 15deg, 50%)Rotate: Rotate(fixed, 20deg, 100%)Rotate: Rotate(fixed, 20deg, 50%)Rotate: Rotate(fixed, 20deg, 75%)Rotate: Rotate(fixed, 45deg, 50%)Rotate: Rotate(fixed, 5deg, 10%)Rotate: Rotate(fixed, 5deg, 100%)Rotate: Rotate(fixed, 5deg, 20%)Rotate: Rotate(fixed, 5deg, 30%)Rotate: Rotate(fixed, 5deg, 40%)Rotate: Rotate(fixed, 5deg, 50%)Rotate: Rotate(fixed, 5deg, 60%)Rotate: Rotate(fixed, 5deg, 70%)Rotate: Rotate(fixed, 5deg, 75%)Rotate: Rotate(fixed, 5deg, 80%)Rotate: Rotate(fixed, 5deg, 90%)Rotate: Rotate(fixed, 60deg, 10%)Rotate: Rotate(fixed, 60deg, 100%)Rotate: Rotate(fixed, 60deg, 20%)Rotate: Rotate(fixed, 60deg, 30%)Rotate: Rotate(fixed, 60deg, 40%)Rotate: Rotate(fixed, 60deg, 50%)Rotate: Rotate(fixed, 60deg, 60%)Rotate: Rotate(fixed, 60deg, 70%)Rotate: Rotate(fixed, 60deg, 80%)Rotate: Rotate(fixed, 60deg, 90%)RHardRotate: Rotate(square, 100%)RHardRotate: Rotate(square, 50%)FlipLR: FlipLR(100%)FlipLR: FlipLR(25%)FlipLR: FlipLR(50%)FlipLR: FlipLR(75%)FlipUD: FlipUD(100%)FlipUD: FlipUD(25%)FlipUD: FlipUD(50%)FlipUD: FlipUD(75%)Crop: Crop(4, 25%)Crop: Crop(4, 50%)Crop: Crop(4, 75%)Crop: Crop(4,100%)Cutout: Cutout(16, 100%)Cutout: Cutout(16, 25%)Cutout: Cutout(16, 50%)Cutout: Cutout(16, 75%)Equalize: Equalize(100%)Equalize: Equalize(50%)FC: FlipLR(50%) + Crop(4,100%)FCC: FlipLR(100%) + Crop(4,100%) + Cutout(16,100%) FCC: FlipLR(100%) + Crop(4,100%) + Cutout(16,25%)FCC: FlipLR(100%) + Crop(4,100%) + Cutout(16,50%)FCC: FlipLR(100%) + Crop(4,100%) + Cutout(16,75%)FCC: FlipLR(100%) + Crop(4,25%) + Cutout(16,100%)FCC: FlipLR(100%) + Crop(4,25%) + Cutout(16,25%)FCC: FlipLR(100%) + Crop(4,25%) + Cutout(16,50%)FCC: FlipLR(100%) + Crop(4,25%) + Cutout(16,75%)FCC: FlipLR(100%) + Crop(4,50%) + Cutout(16,100%)FCC: FlipLR(100%) + Crop(4,50%) + Cutout(16,25%)FCC: FlipLR(100%) + Crop(4,50%) + Cutout(16,50%)FCC: FlipLR(100%) + Crop(4,50%) + Cutout(16,75%)FCC: FlipLR(100%) + Crop(4,75%) + Cutout(16,100%)FCC: FlipLR(100%) + Crop(4,75%) + Cutout(16,25%)FCC: FlipLR(100%) + Crop(4,75%) + Cutout(16,50%)FCC: FlipLR(100%) + Crop(4,75%) + Cutout(16,75%)FCC: FlipLR(25%) + Crop(4,100%) + Cutout(16,100%)FCC: FlipLR(25%) + Crop(4,100%) + Cutout(16,25%)FCC: FlipLR(25%) + Crop(4,100%) + Cutout(16,50%)FCC: FlipLR(25%) + Crop(4,100%) + Cutout(16,75%)FCC: FlipLR(25%) + Crop(4,25%) + Cutout(16,100%)FCC: FlipLR(25%) + Crop(4,25%) + Cutout(16,25%)FCC: FlipLR(25%) + Crop(4,25%) + Cutout(16,50%)FCC: FlipLR(25%) + Crop(4,25%) + Cutout(16,75%)FCC: FlipLR(25%) + Crop(4,50%) + Cutout(16,100%)FCC: FlipLR(25%) + Crop(4,50%) + Cutout(16,25%)FCC: FlipLR(25%) + Crop(4,50%) + Cutout(16,50%)FCC: FlipLR(25%) + Crop(4,50%) + Cutout(16,75%)FCC: FlipLR(25%) + Crop(4,75%) + Cutout(16,100%)FCC: FlipLR(25%) + Crop(4,75%) + Cutout(16,25%)FCC: FlipLR(25%) + Crop(4,75%) + Cutout(16,50%)FCC: FlipLR(25%) + Crop(4,75%) + Cutout(16,75%)FCC: FlipLR(50%) + Crop(4,100%) + Cutout(16,100%)FCC: FlipLR(50%) + Crop(4,100%) + Cutout(16,25%)FCC: FlipLR(50%) + Crop(4,100%) + Cutout(16,50%)FCC: FlipLR(50%) + Crop(4,100%) + Cutout(16,75%)FCC: FlipLR(50%) + Crop(4,25%) + Cutout(16,100%)FCC: FlipLR(50%) + Crop(4,25%) + Cutout(16,25%)FCC: FlipLR(50%) + Crop(4,25%) + Cutout(16,50%)FCC: FlipLR(50%) + Crop(4,25%) + Cutout(16,75%)FCC: FlipLR(50%) + Crop(4,50%) + Cutout(16,100%)FCC: FlipLR(50%) + Crop(4,50%) + Cutout(16,25%)FCC: FlipLR(50%) + Crop(4,50%) + Cutout(16,50%)FCC: FlipLR(50%) + Crop(4,50%) + Cutout(16,75%)FCC: FlipLR(50%) + Crop(4,75%) + Cutout(16,100%)FCC: FlipLR(50%) + Crop(4,75%) + Cutout(16,25%)FCC: FlipLR(50%) + Crop(4,75%) + Cutout(16,50%)FCC: FlipLR(50%) + Crop(4,75%) + Cutout(16,75%)FCC: FlipLR(75%) + Crop(4,100%) + Cutout(16,100%)FCC: FlipLR(75%) + Crop(4,100%) + Cutout(16,25%)FCC: FlipLR(75%) + Crop(4,100%) + Cutout(16,50%)FCC: FlipLR(75%) + Crop(4,100%) + Cutout(16,75%)FCC: FlipLR(75%) + Crop(4,25%) + Cutout(16,100%)FCC: FlipLR(75%) + Crop(4,25%) + Cutout(16,25%)FCC: FlipLR(75%) + Crop(4,25%) + Cutout(16,50%)FCC: FlipLR(75%) + Crop(4,25%) + Cutout(16,75%)FCC: FlipLR(75%) + Crop(4,50%) + Cutout(16,100%)FCC: FlipLR(75%) + Crop(4,50%) + Cutout(16,25%)FCC: FlipLR(75%) + Crop(4,50%) + Cutout(16,50%)FCC: FlipLR(75%) + Crop(4,50%) + Cutout(16,75%)FCC: FlipLR(75%) + Crop(4,75%) + Cutout(16,100%)FCC: FlipLR(75%) + Crop(4,75%) + Cutout(16,25%)FCC: FlipLR(75%) + Crop(4,75%) + Cutout(16,50%)FCC: FlipLR(75%) + Crop(4,75%) + Cutout(16,75%)FCCE: FlipLR(50%) + Crop(4, 100%) + Cutout(16, 100%) + Equalize(50%)FCCR: FlipLR(50%) + Crop(4, 100%) + Cutout(16, 100%) + Rotate(fixed, 15deg, 50%)ZZZAutoaugment: AutoaugmentZZZMixup: MixupZZZRandaug: Randaug
Figure 7: CIFAR-10: Labeled map of tested augmentations on the plane of Affinity and Diversity.Color distinguishes different hyperparameters for a given transform. Legend is below.
B.2 ImageNet
On ImageNet, we experimented with
PatchGaussian , Cutout , operations from the PIL imaginglibrary , and techniques from the AutoAugment code, as described above for CIFAR-10. In additionto PatchGaussian(fixed) , we also tested
PatchGaussian(variable) , where the patch size wasuniformly sampled up to a maximum size. The implementation here did not constrain the patch to be https://pillow.readthedocs.io/en/5.1.x/ Affinity D i v e r s i t y CleanAutoaugmentMixupRandaugPatchGaussian(fixed,12, 0.1, 100%)PatchGaussian(fixed,12, 0.2, 100%)PatchGaussian(fixed,12, 0.3, 100%)PatchGaussian(fixed,12, 0.5, 100%)PatchGaussian(fixed,12, 0.8, 100%)PatchGaussian(fixed,12, 1.0, 100%)PatchGaussian(fixed,12, 1.5, 100%)PatchGaussian(fixed,12, 2.0, 100%)PatchGaussian(fixed,16, 0.1, 100%)PatchGaussian(fixed,16, 0.2, 100%)PatchGaussian(fixed,16, 0.3, 100%)PatchGaussian(fixed,16, 0.5, 100%)PatchGaussian(fixed,16, 0.8, 100%)PatchGaussian(fixed,16, 1.0, 100%)PatchGaussian(fixed,16, 1.5, 100%)PatchGaussian(fixed,16, 2.0, 100%)PatchGaussian(fixed,20, 0.1, 100%)PatchGaussian(fixed,20, 0.2, 100%)PatchGaussian(fixed,20, 0.3, 100%)PatchGaussian(fixed,20, 0.5, 100%)PatchGaussian(fixed,20, 0.8, 100%)PatchGaussian(fixed,20, 1.0, 100%)PatchGaussian(fixed,20, 1.5, 100%)PatchGaussian(fixed,20, 2.0, 100%)PatchGaussian(fixed,24, 0.1, 100%)PatchGaussian(fixed,24, 0.2, 100%)PatchGaussian(fixed,24, 0.3, 100%)PatchGaussian(fixed,24, 0.5, 100%)PatchGaussian(fixed,24, 0.8, 100%)PatchGaussian(fixed,24, 1.0, 100%)PatchGaussian(fixed,24, 1.5, 100%)PatchGaussian(fixed,24, 2.0, 100%)PatchGaussian(fixed,28, 0.1, 100%)PatchGaussian(fixed,28, 0.2, 100%)PatchGaussian(fixed,28, 0.3, 100%)PatchGaussian(fixed,28, 0.5, 100%)PatchGaussian(fixed,28, 0.8, 100%)PatchGaussian(fixed,28, 1.0, 100%)PatchGaussian(fixed,28, 1.5, 100%)PatchGaussian(fixed,28, 2.0, 100%)PatchGaussian(fixed,32, 0.1, 100%)PatchGaussian(fixed,32, 0.2, 100%)PatchGaussian(fixed,32, 0.3, 100%)PatchGaussian(fixed,32, 0.5, 100%)PatchGaussian(fixed,32, 0.8, 100%)PatchGaussian(fixed,32, 1.0, 100%)PatchGaussian(fixed,32, 1.5, 100%)PatchGaussian(fixed,32, 2.0, 100%)PatchGaussian(fixed,4, 0.1, 100%)PatchGaussian(fixed,4, 0.2, 100%)PatchGaussian(fixed,4, 0.3, 100%)PatchGaussian(fixed,4, 0.5, 100%)PatchGaussian(fixed,4, 0.8, 100%)PatchGaussian(fixed,4, 1.0, 100%)PatchGaussian(fixed,4, 1.5, 100%)PatchGaussian(fixed,4, 2.0, 100%)PatchGaussian(fixed,8, 0.1, 100%)PatchGaussian(fixed,8, 0.2, 100%)PatchGaussian(fixed,8, 0.3, 100%)PatchGaussian(fixed,8, 0.5, 100%)PatchGaussian(fixed,8, 0.8, 100%)PatchGaussian(fixed,8, 1.0, 100%)PatchGaussian(fixed,8, 1.5, 100%)PatchGaussian(fixed,8, 2.0, 100%) Invert(100%)Invert(50%)ShearX(variable, 0.1, 100%)ShearX(variable, 0.1, 50%)ShearX(variable, 0.1, 75%)ShearX(variable, 0.3, 100%)ShearX(variable, 0.3, 50%)ShearX(variable, 0.3, 75%)ShearX(fixed ,0.1, 100%)ShearX(fixed, 0.1, 50%)ShearX(fixed, 0.1, 75%)ShearX(fixed, 0.3, 100%)ShearX(fixed, 0.3, 50%)ShearX(fixed, 0.3, 75%)Rotate(variable, 20deg, 100%)Rotate(variable, 20deg, 50%)Rotate(variable, 20deg, 75%)Rotate(variable, 45, 100%)Rotate(variable, 5deg, 100%)Rotate(variable, 5deg, 50%)Rotate(variable, 5deg, 75%)Rotate(variable, 60deg, 100%)Rotate(fixed, 15deg, 50%)Rotate(fixed, 20deg, 100%)Rotate(fixed, 20deg, 50%)Rotate(fixed, 20deg, 75%)Rotate(fixed, 45deg, 50%)Rotate(fixed, 5deg, 10%)Rotate(fixed, 5deg, 100%)Rotate(fixed, 5deg, 20%)Rotate(fixed, 5deg, 30%)Rotate(fixed, 5deg, 40%)Rotate(fixed, 5deg, 50%)Rotate(fixed, 5deg, 60%)Rotate(fixed, 5deg, 70%)Rotate(fixed, 5deg, 75%)Rotate(fixed, 5deg, 80%)Rotate(fixed, 5deg, 90%)Rotate(fixed, 60deg, 10%)Rotate(fixed, 60deg, 100%)Rotate(fixed, 60deg, 20%)Rotate(fixed, 60deg, 30%)Rotate(fixed, 60deg, 40%)Rotate(fixed, 60deg, 50%)Rotate(fixed, 60deg, 60%)Rotate(fixed, 60deg, 70%)Rotate(fixed, 60deg, 80%)Rotate(fixed, 60deg, 90%)Rotate(square, 100%)Rotate(square, 50%)Blur(100%)Blur(50%)FlipLR(100%)FlipLR(25%)FlipLR(50%)FlipLR(75%)FlipUD(100%)FlipUD(25%)FlipUD(50%)FlipUD(75%)Crop(4, 25%)Crop(4, 50%)Crop(4, 75%)Crop(4,100%)Cutout(16, 100%)Cutout(16, 25%)Cutout(16, 50%)Cutout(16, 75%) Equalize(100%)Equalize(50%)FlipLR(50%) + Crop(4,100%)FlipLR(100%) + Crop(4,100%) + Cutout(16,100%)FlipLR(100%) + Crop(4,100%) + Cutout(16,25%)FlipLR(100%) + Crop(4,100%) + Cutout(16,50%)FlipLR(100%) + Crop(4,100%) + Cutout(16,75%)FlipLR(100%) + Crop(4,25%) + Cutout(16,100%)FlipLR(100%) + Crop(4,25%) + Cutout(16,25%)FlipLR(100%) + Crop(4,25%) + Cutout(16,50%)FlipLR(100%) + Crop(4,25%) + Cutout(16,75%)FlipLR(100%) + Crop(4,50%) + Cutout(16,100%)FlipLR(100%) + Crop(4,50%) + Cutout(16,25%)FlipLR(100%) + Crop(4,50%) + Cutout(16,50%)FlipLR(100%) + Crop(4,50%) + Cutout(16,75%)FlipLR(100%) + Crop(4,75%) + Cutout(16,100%)FlipLR(100%) + Crop(4,75%) + Cutout(16,25%)FlipLR(100%) + Crop(4,75%) + Cutout(16,50%)FlipLR(100%) + Crop(4,75%) + Cutout(16,75%)FlipLR(25%) + Crop(4,100%) + Cutout(16,100%)FlipLR(25%) + Crop(4,100%) + Cutout(16,25%)FlipLR(25%) + Crop(4,100%) + Cutout(16,50%)FlipLR(25%) + Crop(4,100%) + Cutout(16,75%)FlipLR(25%) + Crop(4,25%) + Cutout(16,100%)FlipLR(25%) + Crop(4,25%) + Cutout(16,25%)FlipLR(25%) + Crop(4,25%) + Cutout(16,50%)FlipLR(25%) + Crop(4,25%) + Cutout(16,75%)FlipLR(25%) + Crop(4,50%) + Cutout(16,100%)FlipLR(25%) + Crop(4,50%) + Cutout(16,25%)FlipLR(25%) + Crop(4,50%) + Cutout(16,50%)FlipLR(25%) + Crop(4,50%) + Cutout(16,75%)FlipLR(25%) + Crop(4,75%) + Cutout(16,100%)FlipLR(25%) + Crop(4,75%) + Cutout(16,25%)FlipLR(25%) + Crop(4,75%) + Cutout(16,50%)FlipLR(25%) + Crop(4,75%) + Cutout(16,75%)FlipLR(50%) + Crop(4,100%) + Cutout(16,100%)FlipLR(50%) + Crop(4,100%) + Cutout(16,25%)FlipLR(50%) + Crop(4,100%) + Cutout(16,50%)FlipLR(50%) + Crop(4,100%) + Cutout(16,75%)FlipLR(50%) + Crop(4,25%) + Cutout(16,100%)FlipLR(50%) + Crop(4,25%) + Cutout(16,25%)FlipLR(50%) + Crop(4,25%) + Cutout(16,50%)FlipLR(50%) + Crop(4,25%) + Cutout(16,75%)FlipLR(50%) + Crop(4,50%) + Cutout(16,100%)FlipLR(50%) + Crop(4,50%) + Cutout(16,25%)FlipLR(50%) + Crop(4,50%) + Cutout(16,50%)FlipLR(50%) + Crop(4,50%) + Cutout(16,75%)FlipLR(50%) + Crop(4,75%) + Cutout(16,100%)FlipLR(50%) + Crop(4,75%) + Cutout(16,25%)FlipLR(50%) + Crop(4,75%) + Cutout(16,50%)FlipLR(50%) + Crop(4,75%) + Cutout(16,75%)FlipLR(75%) + Crop(4,100%) + Cutout(16,100%)FlipLR(75%) + Crop(4,100%) + Cutout(16,25%)FlipLR(75%) + Crop(4,100%) + Cutout(16,50%)FlipLR(75%) + Crop(4,100%) + Cutout(16,75%)FlipLR(75%) + Crop(4,25%) + Cutout(16,100%)FlipLR(75%) + Crop(4,25%) + Cutout(16,25%)FlipLR(75%) + Crop(4,25%) + Cutout(16,50%)FlipLR(75%) + Crop(4,25%) + Cutout(16,75%)FlipLR(75%) + Crop(4,50%) + Cutout(16,100%)FlipLR(75%) + Crop(4,50%) + Cutout(16,25%)FlipLR(75%) + Crop(4,50%) + Cutout(16,50%)FlipLR(75%) + Crop(4,50%) + Cutout(16,75%)FlipLR(75%) + Crop(4,75%) + Cutout(16,100%)FlipLR(75%) + Crop(4,75%) + Cutout(16,25%)FlipLR(75%) + Crop(4,75%) + Cutout(16,50%)FlipLR(75%) + Crop(4,75%) + Cutout(16,75%)FlipLR(50%) + Crop(4, 100%) + Cutout(16, 100%) + Equalize(50%)FlipLR(50%) + Crop(4, 100%) + Cutout(16, 100%) + Rotate(fixed, 15deg, 50%) entirely contained within the image. Additionally, we experimented with
SolarizeAdd . SolarizeAdd is similar to
Solarize from the PIL library, but has an additional hyperparameter which determineshow much value was added to each pixel that is below the threshold. Finally, we also experimentedwith
Full Gaussian and
Random Erasing on ImageNet.
Full Gaussian adds Gaussian noise to thewhole image.
Random Erasing is similar to
Cutout , but randomly samples the values of the pixels inthe patch [22] (whereas
Cutout sets them to a constant, gray pixel).These augmentations are labeled in Fig. 8.We note that the gains on ImageNet are expected to be small. This is in-line with the magnitudeof the gains observed by related works with single transformations [48]. While combinationsof transformations can lead to bigger improvements [28], our focus is on understanding singleaugmentations as a foundation for future work on their combinations.16
Affinity D i v e r s i t y AutoContrast(100%)Brightness(0.1, 100%)Brightness(0.2, 100%)Brightness(0.3, 100%)Brightness(0.4, 100%)Brightness(0.5, 100%)Brightness(0.6, 100%)Brightness(0.7, 100%)CleanColor(0.1, 100%)Color(0.2, 100%)Color(0.3, 100%)Color(0.4, 100%)Color(0.5, 100%)Color(0.6, 100%)Color(0.7, 100%)Contrast(0.1, 100%)Contrast(0.2, 100%)Contrast(0.3, 100%)Contrast(0.4, 100%)Contrast(0.5, 100%)Contrast(0.6, 100%)Contrast(0.7, 100%)Cutout(variable, 448, 100%)Cutout(fixed, 120, 100%)Cutout(fixed, 150, 100%)Cutout(fixed, 180, 100%)Cutout(fixed, 30, 100%)Cutout(fixed, 60, 100%)Cutout(fixed, 90, 100%)Equalize(100%)FlipUD(100%)FullGaussian(0.1, 100%)FullGaussian(0.2, 100%)FullGaussian(0.3, 100%)FullGaussian(0.5, 100%)FullGaussian(0.8, 100%)FullGaussian(1.0, 100%)FullGaussian(1.5, 100%)FullGaussian(2.0, 100%)Rotate(square, 100%)Invert(100%)Patch Gaussian(variable, 100, 0.2, 100%)Patch Gaussian(variable, 100, 0.5, 100%)Patch Gaussian(variable, 100, 0.8, 100%)Patch Gaussian(variable, 100, 1.0, 100%)Patch Gaussian(variable, 100, 2.0, 100%)Patch Gaussian(variable, 150, 0.2, 100%)Patch Gaussian(variable, 150, 0.5, 100%)Patch Gaussian(variable, 150, 0.8, 100%)Patch Gaussian(variable, 150, 1.0, 100%)Patch Gaussian(variable, 150, 2.0, 100%)Patch Gaussian(variable, 200, 0.2, 100%)Patch Gaussian(variable, 200, 0.5, 100%)Patch Gaussian(variable, 200, 0.8, 100%)Patch Gaussian(variable, 200, 1.0, 100%)Patch Gaussian(variable, 200, 2.0, 100%)Patch Gaussian(variable, 250, 0.1, 100%)Patch Gaussian(variable, 250, 0.2, 100%)Patch Gaussian(variable, 250, 0.3, 100%)Patch Gaussian(variable, 250, 0.5, 100%)Patch Gaussian(variable, 250, 0.8, 100%)Patch Gaussian(variable, 250, 1.0, 100%)Patch Gaussian(variable, 250, 1.5, 100%)Patch Gaussian(variable, 250, 2.0, 100%)Patch Gaussian(variable, 300, 0.2, 100%)Patch Gaussian(variable, 300, 0.5, 100%)Patch Gaussian(variable, 300, 0.8, 100%)Patch Gaussian(variable, 300, 1.0, 100%)Patch Gaussian(variable, 300, 2.0, 100%)Patch Gaussian(variable, 350, 0.2, 100%)Patch Gaussian(variable, 350, 0.5, 100%)Patch Gaussian(variable, 350, 0.8, 100%)Patch Gaussian(variable, 350, 1.0, 100%)Patch Gaussian(variable, 350, 2.0, 100%) Patch Gaussian(variable, 400, 0.2, 100%)Patch Gaussian(variable, 400, 0.5, 100%)Patch Gaussian(variable, 400, 0.8, 100%)Patch Gaussian(variable, 400, 1.0, 100%)Patch Gaussian(variable, 400, 2.0, 100%)Patch Gaussian(fixed, 100, 0.0, 100%)Patch Gaussian(fixed, 100, 0.2, 100%)Patch Gaussian(fixed, 100, 0.5, 100%)Patch Gaussian(fixed, 100, 0.8, 100%)Patch Gaussian(fixed, 100, 1.0, 100%)Patch Gaussian(fixed, 100, 2.0, 100%)Patch Gaussian(fixed, 150, 0.2, 100%)Patch Gaussian(fixed, 150, 0.5, 100%)Patch Gaussian(fixed, 150, 0.8, 100%)Patch Gaussian(fixed, 150, 1.0, 100%)Patch Gaussian(fixed, 150, 2.0, 100%)Patch Gaussian(fixed, 200, 0.2, 100%)Patch Gaussian(fixed, 200, 0.5, 100%)Patch Gaussian(fixed, 200, 0.8, 100%)Patch Gaussian(fixed, 200, 1.0, 100%)Patch Gaussian(fixed, 200, 2.0, 100%)Patch Gaussian(fixed, 250, 0.2, 100%)Patch Gaussian(fixed, 250, 0.5, 100%)Patch Gaussian(fixed, 250, 0.8, 100%)Patch Gaussian(fixed, 250, 1.0, 100%)Patch Gaussian(fixed, 250, 2.0, 100%)Patch Gaussian(fixed, 300, 0.2, 100%)Patch Gaussian(fixed, 300, 0.5, 100%)Patch Gaussian(fixed, 300, 0.8, 100%)Patch Gaussian(fixed, 300, 1.0, 100%)Patch Gaussian(fixed, 300, 2.0, 100%)Patch Gaussian(fixed, 350, 0.2, 100%)Patch Gaussian(fixed, 350, 0.5, 100%)Patch Gaussian(fixed, 350, 0.8, 100%)Patch Gaussian(fixed, 350, 1.0, 100%)Patch Gaussian(fixed, 350, 2.0, 100%)Patch Gaussian(fixed, 400, 0.2, 100%)Patch Gaussian(fixed, 400, 0.5, 100%)Patch Gaussian(fixed, 400, 0.8, 100%)Patch Gaussian(fixed, 400, 1.0, 100%)Patch Gaussian(fixed, 400, 2.0, 100%)Posterize(0, 100%)Posterize(1, 100%)Posterize(2, 100%)Posterize(3, 100%)Posterize(4, 100%)Posterize(5, 100%)Posterize(6, 100%)Posterize(7, 100%)Random Erasing(variable, 448, 100%)Random Erasing(fixed, 120, 100%)Random Erasing(fixed, 150, 100%)Random Erasing(fixed, 180, 100%)Random Erasing(fixed, 30, 100%)Random Erasing(fixed, 60, 100%)Random Erasing(fixed, 90, 100%)Rotate(variable, 0deg, 100%)Rotate(variable, 10deg, 100%)Rotate(variable, 15deg, 100%)Rotate(variable, 20deg, 100%)Rotate(variable, 25deg, 100%)Rotate(variable, 30deg, 100%)Rotate(variable, 5deg, 100%)Sharpness(0.1, 100%)Sharpness(0.2, 100%)Sharpness(0.3, 100%)Sharpness(0.4, 100%)Sharpness(0.5, 100%)Sharpness(0.6, 100%)Sharpness(0.7, 100%)ShearX(variable, 0.1, 100%)ShearX(variable, 0.2, 100%)ShearX(variable, 0.3, 100%)ShearX(variable, 0.4, 100%) ShearX(variable, 0.5, 100%)Solarize(0, 100%)Solarize(100, 100%)Solarize(150, 100%)Solarize(200, 100%)Solarize(250, 100%)Solarize(50, 100%)Solarize Add(-002, 000, 100%)Solarize Add(-002, 050, 100%)Solarize Add(-002, 100, 100%)Solarize Add(-002, 150, 100%)Solarize Add(-002, 200, 100%)Solarize Add(-002, 250, 100%)Solarize Add(-027, 000, 100%)Solarize Add(-027, 050, 100%)Solarize Add(-027, 100, 100%)Solarize Add(-027, 150, 100%)Solarize Add(-027, 200, 100%)Solarize Add(-027, 250, 100%)Solarize Add(-052, 000, 100%)Solarize Add(-052, 050, 100%)Solarize Add(-052, 100, 100%)Solarize Add(-052, 150, 100%)Solarize Add(-052, 200, 100%)Solarize Add(-052, 250, 100%)Solarize Add(-077, 000, 100%)Solarize Add(-077, 050, 100%)Solarize Add(-077, 100, 100%)Solarize Add(-077, 150, 100%)Solarize Add(-077, 200, 100%)Solarize Add(-077, 250, 100%)Solarize Add(-102, 000, 100%)Solarize Add(-102, 050, 100%)Solarize Add(-102, 100, 100%)Solarize Add(-102, 150, 100%)Solarize Add(-102, 200, 100%)Solarize Add(-102, 250, 100%)Solarize Add(-127, 000, 100%)Solarize Add(-127, 050, 100%)Solarize Add(-127, 100, 100%)Solarize Add(-127, 150, 100%)Solarize Add(-127, 200, 100%)Solarize Add(-127, 250, 100%)Solarize Add(0023, 050, 100%)Solarize Add(0023, 100, 100%)Solarize Add(0023, 150, 100%)Solarize Add(0023, 200, 100%)Solarize Add(0023, 250, 100%)Solarize Add(0048, 050, 100%)Solarize Add(0048, 100, 100%)Solarize Add(0048, 150, 100%)Solarize Add(0048, 200, 100%)Solarize Add(0048, 250, 100%)Solarize Add(0073, 100, 100%)Solarize Add(0073, 150, 100%)Solarize Add(0073, 200, 100%)Solarize Add(0073, 250, 100%)Solarize Add(0098, 100, 100%)Solarize Add(0098, 150, 100%)Solarize Add(0098, 200, 100%)Solarize Add(0098, 250, 100%)Solarize Add(0123, 150, 100%)Solarize Add(0123, 200, 100%)Solarize Add(0123, 250, 100%)TranslateX(0, 100%)TranslateX(10, 100%)TranslateX(20, 100%)TranslateX(30, 100%)TranslateX(40, 100%)TranslateX(50, 100%)TranslateX(60, 100%)TranslateX(70, 100%)TranslateX(80, 100%)TranslateX(90, 100%)
Figure 8: ImageNet: Labeled map of tested augmentations on the plane of Affinity and Diversity.Color distinguishes different hyperparameters for a given transform. Legend is below.Each augmentation was applied with a certain probability (given as a percentage in the label). Eachtime an image was pulled for training, the given image was augmented with that probability.
C Error analysis
All of the CIFAR-10 experiments were repeated with 10 different initialization. In most cases, theresulting standard error on the mean (SEM) is too small to show as error bars on plots. The error oneach measurement is given in the full results (see Sec. G).17
Typicality D i v e r s i t y CleanPatch Gaussian(variable, 100, 0.2, 100%)Patch Gaussian(variable, 100, 0.5, 100%)Patch Gaussian(variable, 100, 0.8, 100%)Patch Gaussian(variable, 100, 1.0, 100%)Patch Gaussian(variable, 100, 2.0, 100%)Patch Gaussian(variable, 150, 0.2, 100%)Patch Gaussian(variable, 150, 0.5, 100%)Patch Gaussian(variable, 150, 0.8, 100%)Patch Gaussian(variable, 150, 1.0, 100%)Patch Gaussian(variable, 150, 2.0, 100%)Patch Gaussian(variable, 200, 0.2, 100%)Patch Gaussian(variable, 200, 0.5, 100%)Patch Gaussian(variable, 200, 0.8, 100%)Patch Gaussian(variable, 200, 1.0, 100%)Patch Gaussian(variable, 200, 2.0, 100%)Patch Gaussian(variable, 250, 0.1, 100%)Patch Gaussian(variable, 250, 0.2, 100%)Patch Gaussian(variable, 250, 0.3, 100%)Patch Gaussian(variable, 250, 0.5, 100%)Patch Gaussian(variable, 250, 0.8, 100%)Patch Gaussian(variable, 250, 1.0, 100%)Patch Gaussian(variable, 250, 1.5, 100%)Patch Gaussian(variable, 250, 2.0, 100%)Patch Gaussian(variable, 300, 0.2, 100%)Patch Gaussian(variable, 300, 0.5, 100%)Patch Gaussian(variable, 300, 0.8, 100%)Patch Gaussian(variable, 300, 1.0, 100%)Patch Gaussian(variable, 300, 2.0, 100%)Patch Gaussian(variable, 350, 0.2, 100%)Patch Gaussian(variable, 350, 0.5, 100%)Patch Gaussian(variable, 350, 0.8, 100%)Patch Gaussian(variable, 350, 1.0, 100%)Patch Gaussian(variable, 350, 2.0, 100%)Patch Gaussian(variable, 400, 0.2, 100%)Patch Gaussian(variable, 400, 0.5, 100%)Patch Gaussian(variable, 400, 0.8, 100%)Patch Gaussian(variable, 400, 1.0, 100%)Patch Gaussian(variable, 400, 2.0, 100%)Patch Gaussian(fixed, 100, 0.0, 100%)Patch Gaussian(fixed, 100, 0.2, 100%)Patch Gaussian(fixed, 100, 0.5, 100%)Patch Gaussian(fixed, 100, 0.8, 100%)Patch Gaussian(fixed, 100, 1.0, 100%)Patch Gaussian(fixed, 100, 2.0, 100%)Patch Gaussian(fixed, 150, 0.2, 100%)Patch Gaussian(fixed, 150, 0.5, 100%)Patch Gaussian(fixed, 150, 0.8, 100%)Patch Gaussian(fixed, 150, 1.0, 100%)Patch Gaussian(fixed, 150, 2.0, 100%)Patch Gaussian(fixed, 200, 0.2, 100%)Patch Gaussian(fixed, 200, 0.5, 100%)Patch Gaussian(fixed, 200, 0.8, 100%)Patch Gaussian(fixed, 200, 1.0, 100%)Patch Gaussian(fixed, 200, 2.0, 100%)Patch Gaussian(fixed, 250, 0.2, 100%)Patch Gaussian(fixed, 250, 0.5, 100%)Patch Gaussian(fixed, 250, 0.8, 100%)Patch Gaussian(fixed, 250, 1.0, 100%)Patch Gaussian(fixed, 250, 2.0, 100%)Patch Gaussian(fixed, 300, 0.2, 100%)Patch Gaussian(fixed, 300, 0.5, 100%)Patch Gaussian(fixed, 300, 0.8, 100%)Patch Gaussian(fixed, 300, 1.0, 100%)Patch Gaussian(fixed, 300, 2.0, 100%)Patch Gaussian(fixed, 350, 0.2, 100%)Patch Gaussian(fixed, 350, 0.5, 100%)Patch Gaussian(fixed, 350, 0.8, 100%)Patch Gaussian(fixed, 350, 1.0, 100%)Patch Gaussian(fixed, 350, 2.0, 100%)Patch Gaussian(fixed, 400, 0.2, 100%)Patch Gaussian(fixed, 400, 0.5, 100%)Patch Gaussian(fixed, 400, 0.8, 100%)Patch Gaussian(fixed, 400, 1.0, 100%)Patch Gaussian(fixed, 400, 2.0, 100%) Random Erasing(variable, 448, 100%)Random Erasing(fixed, 120, 100%)Random Erasing(fixed, 150, 100%)Random Erasing(fixed, 180, 100%)Random Erasing(fixed, 30, 100%)Random Erasing(fixed, 60, 100%)Random Erasing(fixed, 90, 100%)Solarize(0, 100%)Solarize(100, 100%)Solarize(150, 100%)Solarize(200, 100%)Solarize(250, 100%)Solarize(50, 100%)Solarize Add(-002, 000, 100%)Solarize Add(-002, 050, 100%)Solarize Add(-002, 100, 100%)Solarize Add(-002, 150, 100%)Solarize Add(-002, 200, 100%)Solarize Add(-002, 250, 100%)Solarize Add(-027, 000, 100%)Solarize Add(-027, 050, 100%)Solarize Add(-027, 100, 100%)Solarize Add(-027, 150, 100%)Solarize Add(-027, 200, 100%)Solarize Add(-027, 250, 100%)Solarize Add(-052, 000, 100%)Solarize Add(-052, 050, 100%)Solarize Add(-052, 100, 100%)Solarize Add(-052, 150, 100%)Solarize Add(-052, 200, 100%)Solarize Add(-052, 250, 100%)Solarize Add(-077, 000, 100%)Solarize Add(-077, 050, 100%)Solarize Add(-077, 100, 100%)Solarize Add(-077, 150, 100%)Solarize Add(-077, 200, 100%)Solarize Add(-077, 250, 100%)Solarize Add(-102, 000, 100%)Solarize Add(-102, 050, 100%)Solarize Add(-102, 100, 100%)Solarize Add(-102, 150, 100%)Solarize Add(-102, 200, 100%)Solarize Add(-102, 250, 100%)Solarize Add(-127, 000, 100%)Solarize Add(-127, 050, 100%)Solarize Add(-127, 100, 100%)Solarize Add(-127, 150, 100%)Solarize Add(-127, 200, 100%)Solarize Add(-127, 250, 100%)Solarize Add(0023, 050, 100%)Solarize Add(0023, 100, 100%)Solarize Add(0023, 150, 100%)Solarize Add(0023, 200, 100%)Solarize Add(0023, 250, 100%)Solarize Add(0048, 050, 100%)Solarize Add(0048, 100, 100%)Solarize Add(0048, 150, 100%)Solarize Add(0048, 200, 100%)Solarize Add(0048, 250, 100%)Solarize Add(0073, 100, 100%)Solarize Add(0073, 150, 100%)Solarize Add(0073, 200, 100%)Solarize Add(0073, 250, 100%)Solarize Add(0098, 100, 100%)Solarize Add(0098, 150, 100%)Solarize Add(0098, 200, 100%)Solarize Add(0098, 250, 100%)Solarize Add(0123, 150, 100%)Solarize Add(0123, 200, 100%)Solarize Add(0123, 250, 100%)Invert(100%)AutoContrast(100%)Equalize(100%)FlipUD(100%) Brightness(0.1, 100%)Brightness(0.2, 100%)Brightness(0.3, 100%)Brightness(0.4, 100%)Brightness(0.5, 100%)Brightness(0.6, 100%)Brightness(0.7, 100%)Color(0.1, 100%)Color(0.2, 100%)Color(0.3, 100%)Color(0.4, 100%)Color(0.5, 100%)Color(0.6, 100%)Color(0.7, 100%)Contrast(0.1, 100%)Contrast(0.2, 100%)Contrast(0.3, 100%)Contrast(0.4, 100%)Contrast(0.5, 100%)Contrast(0.6, 100%)Contrast(0.7, 100%)Cutout(variable, 448, 100%)Cutout(fixed, 120, 100%)Cutout(fixed, 150, 100%)Cutout(fixed, 180, 100%)Cutout(fixed, 30, 100%)Cutout(fixed, 60, 100%)Cutout(fixed, 90, 100%)FullGaussian(0.1, 100%)FullGaussian(0.2, 100%)FullGaussian(0.3, 100%)FullGaussian(0.5, 100%)FullGaussian(0.8, 100%)FullGaussian(1.0, 100%)FullGaussian(1.5, 100%)FullGaussian(2.0, 100%)Rotate(square, 100%)Posterize(0, 100%)Posterize(1, 100%)Posterize(2, 100%)Posterize(3, 100%)Posterize(4, 100%)Posterize(5, 100%)Posterize(6, 100%)Posterize(7, 100%)Rotate(variable, 0deg, 100%)Rotate(variable, 10deg, 100%)Rotate(variable, 15deg, 100%)Rotate(variable, 20deg, 100%)Rotate(variable, 25deg, 100%)Rotate(variable, 30deg, 100%)Rotate(variable, 5deg, 100%)Sharpness(0.1, 100%)Sharpness(0.2, 100%)Sharpness(0.3, 100%)Sharpness(0.4, 100%)Sharpness(0.5, 100%)Sharpness(0.6, 100%)Sharpness(0.7, 100%)ShearX(variable, 0.1, 100%)ShearX(variable, 0.2, 100%)ShearX(variable, 0.3, 100%)ShearX(variable, 0.4, 100%)ShearX(variable, 0.5, 100%)TranslateX(0, 100%)TranslateX(10, 100%)TranslateX(20, 100%)TranslateX(30, 100%)TranslateX(40, 100%)TranslateX(50, 100%)TranslateX(60, 100%)TranslateX(70, 100%)TranslateX(80, 100%)TranslateX(90, 100%)
Affinity and Switch-off Lift both were computed from differences between runs that share the sameinitialization. For Affinity, the same trained model was used for inference on clean validation data andon augmented validation data. Thus, the variance of Affinity for the clean baseline is not independentof the variance of Affinity for a given augmentation. The difference between the augmentation caseand the clean baseline case was taken on a per-experiment basis (for each initialization of the cleanbaseline model) before the error was computed.In the switching experiments, the final training without augmentation was completed starting from agiven checkpoint in the model that was trained with augmentation. Thus, each switching experimentshared an initialization with an experiment that had no switching. Again, in this case the differencewas taken on a per-experiment basis before the error (based on the standard deviation) was computed.18ll ImageNet experiments shown are with one initialization. Thus, there are not statistics from whichto analyze the error.
D Switching off augmentations
For CIFAR-10, switching times were tested in increments of approximately 5k steps between ∼ kand ∼ k steps. The best point for switching was determined by the final validation accuracy.On ImageNet, we tested turning augmentation off at 50, 60, 70, and 80 epochs. Total training took 90epochs. The best point for switching was determined by the final test accuracy.The Switch-off Lift was derived from the experiment at the best switch-off point for each augmenta-tion.For CIFAR-10, there are some augmentations where the validation accuracy was best at 25k, whichmeans that further testing is needed to find if the actual optimum switch off point is lower or if thebest case is to not train at all with the given augmentation. Some of the best augmentations havea small negative Switch-off Lift, indicating that it is better to train the entire time with the givenaugmentations.For each augmentation, the best time for switch-off is listed in the full results (see Sec. G). E Diversity metrics
Final Training Loss S t e p s t o T r a i n i n g % A cc u r a c y Final Training Loss E n t r o p y
40k 60k
Steps to Training97% Accuracy E n t r o p y Figure 9: CIFAR-10: Three different diversity metrics are strongly correlated for high entropyaugmentations. Here, the entropy is calculated only for discrete augmentations.We computed three possible diversity metrics, shown in Fig. 9: Entropy, Final Training Loss, TrainingSteps to Accuracy Threshold. The entropy was calculated only for augmentations that have a discretestochasticity (such as
Rotate(fixed) and not for augmentations that have a continuous variation(such as
Rotate(variable) or PatchGaussian ). Final Training Loss is the batch statistic at the laststep of training. For CIFAR-10 experiments, this was averaged across the 10 initializations. ForImageNet, it was averaged over the last 10 steps of training. Training Steps to Accuracy Threshold isthe number of training steps at which the training accuracy first hits a threshold of 97%. A few of thetested augmentation (extreme versions of
PatchGaussian ) did not reach this threshold in the giventime and that column is left blank in the full results.Entropy is unique in that it is independent of the model or data set and it is a counting of states.However, it is difficult to compare between discrete and continuously-varying transforms and it is notclear how proper it is to compare even across different types of transforms.Final Training Loss and Training Steps to Accuracy Threshold correlate well across the testedtransforms. Entropy is highly correlated to these measures for
PatchGaussian and versions of
FlipLR , Crop , and
Cutout where only probabilities are varying. For
Rotate and
Shear where magnitudes arevarying as well, the correlation between Entropy and the other two measures is less clear.Building intuition for what Diversity means in this case, the Final Training Loss was compared in thecase of static augmentation to the case of dynamic augmentation. As shown in Fig. 6, in the case ofstatic augmentation, the Diversity was always less than in the typical case of dynamic augmentation.Moreover, across this large range of augmentations, the numerical span of Diversity was very small in19he case of static augmentation, compared to dynamic augmentation. This suggests that this particularmeasure of Diversity is indeed connected to the number of unique or useful training images thatcan be created with a given augmentation. In the case of static augmentation, the number of uniqueimages is exactly the same for all augmentations; dynamic augmentations allow for more uniqueimages and both the number and utility of unique images will vary with augmentation.
F Comparing Affinity to other related measures
We gain confidence in the Affinity measure by comparing it to other potential model-dependantmeasures of distribution shift. In Fig 10, we show the correlation between Affinity and these twomeasures: the mean log likelihood of augmented test images[49] (labeled as “logsumexp(logits)")and the Watanabe–Akaike information criterion (labeled as “WAIC”) [50].Like Affinity, these other two measures indicate how well a model trained on clean data comprehendsaugmented data.
40 20 0
Affinity l o g s u m e x p ( l o g i t s )
40 20 0
Affinity W A I C (a) CIFAR-10
20 40 60 80
Affinity l o g s u m e x p ( l o g i t s ) (b) ImageNet Figure 10: Affinity correlates with two other measures of how augmented images are related to atrained model’s distribution: logsumexp of the logits (left, for CIFAR-10, and right, for ImageNet) isthe mean log likelihood for the image. WAIC (middle, for CIFAR-10) corrects for a possible bias inthat estimate. In all three plots, numbers are referenced to the clean baseline, which is assigned avalue of 0.
G Full results
The plotted data for CIFAR-10 and ImageNet are given in .csv files uploaded at https://storage.googleapis.com/public_research_data/augmentation/data.ziphttps://storage.googleapis.com/public_research_data/augmentation/data.zip