[PDF] Hypernetwork-Based Augmentation

Abstract

Data augmentation is an effective technique to improve the generalization of deep neural networks. Recently, AutoAugment proposed a well-designed search space and a search algorithm that automatically finds augmentation policies in a data-driven manner. However, AutoAugment is computationally intensive. In this paper, we propose an efficient gradient-based search algorithm, called Hypernetwork-Based Augmentation (HBA), which simultaneously learns model parameters and augmentation hyperparameters in a single training. Our HBA uses a hypernetwork to approximate a population-based training algorithm, which enables us to tune augmentation hyperparameters by gradient descent. Besides, we introduce a weight sharing strategy that simplifies our hypernetwork architecture and speeds up our search algorithm. We conduct experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet. Our results demonstrate that HBA is significantly faster than state-of-the-art methods while achieving competitive accuracy.

Full PDF

HHypernetwork-Based Augmentation

Chih-Yang Chen , Che-Han Chang , and Edward Y. Chang , HTC Research & Healthcare Stanford University {harry.cy_chen,chehan_chang,edward_chang}@htc.com

Abstract

Data augmentation is an effective technique to improve the generalization ofdeep neural networks. Recently, AutoAugment [1] proposed a well-designedsearch space and a search algorithm that automatically ﬁnds augmentation policiesin a data-driven manner. However, AutoAugment is computationally intensive.In this paper, we propose an efﬁcient gradient-based search algorithm, calledHypernetwork-Based Augmentation (HBA), which simultaneously learns modelparameters and augmentation hyperparameters in a single training. Our HBA uses ahypernetwork to approximate a population-based training algorithm, which enablesus to tune augmentation hyperparameters by gradient descent. Besides, we intro-duce a weight sharing strategy that simpliﬁes our hypernetwork architecture andspeeds up our search algorithm. We conduct experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet. Our results demonstrate that HBA is signiﬁcantlyfaster than state-of-the-art methods while achieving competitive accuracy.

Data augmentation techniques, such as cropping, horizontal ﬂipping, and color jittering, are widelyused in training deep neural networks for image classiﬁcation. Data augmentation acts as a regularizerthat reduces overﬁtting by transforming images to increase the quantity and diversity of trainingdata. Recently, data augmentation has shown to be an effective technique not only for supervisedlearning [1, 2, 3], but also for semi-supervised learning [4, 5], self-supervised learning [6], andreinforcement learning (RL) [7]. However, given a new task or dataset, it is non-trivial to eithermanually or automatically design its data augmentation policy: determine a set of augmentationfunctions and adequately specify their conﬁgurations, such as the range of rotation, the size ofcropping, the degree of color jittering. A weak augmentation policy may not help much, while astrong one could hinder performance by making an augmented image looks inconsistent to its label.AutoAugment [1], a representative pioneering work proposed by Cubuk et al., proposed a searchspace (consisting of 16 image operations) and an RL-based search algorithm to automate the processof ﬁnding an effective augmentation policy over the search space. AutoAugment made remarkableimprovements in image classiﬁcation. However, its search algorithm requires

GPU hours forone augmentation policy learning, which is extremely computationally intensive.To address the computational issue, we formulate the augmentation policy search problem as ahyperparameter optimization problem and propose an efﬁcient algorithm for tuning data augmentationhyperparameters. The central idea of our method is a novel combination of a population-based trainingalgorithm and a hypernetwork , which is a function that outputs the weights of a neural network. Inparticular, we introduce a population-based training procedure and four modiﬁcations to derive ouralgorithm. First, we employ a hypernetwork to represent the set of models. By doing so, training ahypernetwork can be treated as training a continuous set of models. Second, instead of evaluatingthe performance on a discrete set of models, thanks to the use of a hypernetwork, we can performgradient descent and back-propagate gradients through the hypernetwork to efﬁciently ﬁnd theapproximate best model from the continuous set of models. Third, we propose a new hyper-layerfor batch normalization [8] to facilitate the construction of hypernetworks. The resulting algorithmresembles the recently proposed Self-Tuning Network (STN) algorithm [9] but is derived entirely a r X i v : . [ c s . C V ] J un able 1: Our HBA is at least an order of magnitude faster than AutoAugment (AA), Population BasedAugmentation (PBA), and Fast AutoAugment (FAA), while achieving similar accuracy. Dataset Model AA [1] PBA [2] FAA [3] HBACIFAR-10 [13] Wide-ResNet-28-10 [14] GPU Hours . . Test Error (%) . . . . SVHN [15] Wide-ResNet-28-10 [14] GPU Hours . . Test Error (%) . . . . ImageNet [16] ResNet-50 [17] GPU Hours -

450 0 . Test Error (%) . - . . from a different perspective. Fourth, inspired by one-shot neural architecture search [10, 11, 12],we adopt a weight sharing strategy to the models. We show that such a strategy corresponds to asimpliﬁed hypernetwork architecture that effectively reduces the search time and slightly improvesthe accuracy. Our method, dubbed as Hypernetwork-Based Augmentation (HBA), is a gradient-based method that jointly trains the network model and tunes the augmentation hyperparameters. HBAyields hyperparameter schedules that can be used to train on different datasets or to train differentnetwork architectures. Experimental results show that HBA achieves signiﬁcantly faster search speedwhile maintaining competitive accuracy. Table 1 summarizes our main results.Our contributions can be summarized as follows. (1) We propose Hypernetwork-Based Augmentation(HBA), an efﬁcient gradient-based method for automated data augmentation. (2) The derivation ofour HBA algorithm reveals the underlying relationship between population-based and gradient-basedmethods. (3) We propose a weight-sharing strategy that simpliﬁes our hypernetwork architecture andreduces the search time considerably. Data Augmentation.

We review previous work relevant to the development of our method. Werefer readers to a comprehensive survey paper [18] on data augmentation. Hand-designed dataaugmentation techniques, such as horizontal ﬂipping, random cropping, and color transformations,are commonly used in training deep neural networks for image classiﬁcation [19, 17]. In recent years,several effective data augmentation techniques are proposed. Cutout [20] randomly erases contentsby sampling a patch from the input image and replace it with a constant value. Mixup [21] performsdata interpolation that combines pairs of images and their labels in a convex manner to generatevirtual training data. CutMix [22] generates new samples by cutting and pasting image patches withinmini-batches. These data augmentation techniques are hand-designed, and their hyperparameters areusually manually-tuned.

Automated Data Augmentation.

Inspired by recent advances in neural architecture search [23, 10,24], one of the recent trends in data augmentation is automatically ﬁnding augmentation policieswithin a pre-deﬁned search space in a data-driven manner. AutoAugment [1] introduced a well-designed search space and proposed an RL-based search algorithm that trains a recurrent neuralnetwork (RNN) controller to search effective augmentation policies within the search space. AlthoughAutoAugment achieves promising results, its search process is computationally expensive (5000 GPUhours on CIFAR-10). Our algorithm treats the data augmentation hyperparameters as continuousvariables, and efﬁciently optimizes them by gradient descent in a single round of training.Recently, several efﬁcient search algorithms based on AutoAugment’s search space have been pro-posed. Population Based Augmentation (PBA) [2] employed Population Based Training (PBT) [25],an evolution-based hyperparameter optimization algorithm, to search data augmentation schedules.Fast AutoAugment [3] treats the problem as a density matching problem and used Bayesian opti-mization to ﬁnd augmentation policies. OHL-Auto-Aug [26] proposed an online hyperparameterlearning algorithm that jointly learns network parameters and augmentation policy. Our algorithmalso tunes augmentation hyperparameters in an online manner while achieves better search efﬁciency.RandAugment [27] simpliﬁed the search space of AutoAugment and used the grid search method toﬁnd the optimal augmentation policy. Differentiable Automatic Data Augmentation (DADA) [28],inspired by Differentiable Architecture Search (DARTS) [24], proposed a gradient-based methodthat relaxes the discrete policy selection to be differentiable and optimize the augmentation policy bystochastic gradient descent. Our algorithm is also gradient-based but is fundamentally different from2ADA. DADA is based on differentiable relaxation, while ours is based on population-based trainingand hypernetworks.

Hypernetworks.

Ha et al. [29] used hypernetworks to generate weights for recurrent networks.SMASH [11] presented a neural architecture search method that employs a hypernetwork to learna mapping from a binary-encoded architecture space to the weight space. Self-Tuning Network(STN) [9] used a hypernetwork as an approximation to the best response function in bilevel op-timization. Our method uses a hypernetwork to represent a set of models trained with differenthyperparameters.

Hyperparameter Optimization.

Searching data augmentation policies can be formulated as a hy-perparameter optimization problem. We refer readers to a recent survey paper [30] on hyperparameteroptimization. Population Based Training (PBT) [2] presented a generic hyperparameter optimizationalgorithm that trains a population of models in parallel, and periodically evaluates their performanceto perform the so-called "exploit-and-explore" procedure. Recently, MacKay et al. proposed theSelf-Tuning Network (STN) [9], a gradient-based method for tuning regularization hyperparameters,including data augmentation and dropout [31]. Our work is closely related to PBT [25] and STN [9].In our method, we start from a population-based training algorithm and apply a series of modiﬁcationsto reach our algorithm, which resembles the STN algorithm. Our algorithm can be viewed as anefﬁcient approximation of the PBT algorithm. Compared with STN, our work distinguishes fromSTN in three aspects. First, STN is based on the best response approximation to bilevel optimization,while our HBA is derived entirely from the perspective of population-based training. Thus, ourmethod can be viewed as a new and novel interpretation of STN. Second, to facilitate the constructionof hypernetworks, we propose a new hyper-layer for batch normalization. Third, our method employsa weight sharing strategy to improve both the accuracy and search time.

Our proposed method HBA consists of two components: a search space and a search algorithm. Wedescribe our search space in Section 3.1 and our search algorithm in Section 3.2, 3.3, and 3.4. Westart from a baseline algorithm (Section 3.2) and then present a series of modiﬁcations to derive HBA(Section 3.3 and 3.4). Finally, we present our hypernetwork architecture and the proposed weightsharing strategy in Section 3.5.

We deﬁne a data augmentation policy as a stochastic transformation function constructed by a set ofelementary image operations. Each operation is associated with a probability and a magnitude, whichplay the role of data augmentation hyperparameters. For a fair comparison in the experiments, ourtransformation function follows PBA [2], which consists of operations and performs three steps:(1) sampling an integer K ∈ { , , } from a categorical distribution as the number of operationsto be applied, (2) sampling K out of the operations based on the operation probabilities, and(3) sequentially applying these sampled operations to the input image. Please refer to Appendix Aand PBA [2] for the details of the PBA search space. We note that unlike PBA, which discretizescontinuous hyperparameters, we treat all hyperparameters as continuous variables. We formulate tuning data augmentation hyperparameters as training a population of n models. Eachmodel i starts from a different initialization of both model parameters θ i and hyperparameters λ i .During training, we periodically exploit the best performing model and explore its hyperparametersand parameters. Speciﬁcally, our baseline algorithm is deﬁned as a population-based trainingprocedure that iterates the following three steps. Update.

For each model i , we update its parameters θ i by T train steps of stochastic gradient descent(SGD) with a learning rate of α and a batch size of , which can be expressed as θ i = θ i − α ∇ θ L ( F ( A ( x ; λ i ); θ i ) , y ) , (1)where ( x, y ) is the training example, F (; θ i ) is the i -th model, A (; λ i ) is the augmentation policyused for model i , and L is the loss function. Evaluate.

We evaluate each model i on the validation set D V and identify the best performing model k where k = arg min i ∈{ ,...,n } (cid:80) ( x v ,y v ) ∈ D V L ( F ( x v ; θ i ) , y v ) . Exploit-and-Explore.

As exploitation, we select the best performing model k and copy its parametersand hyperparameters to replace the ones of all the other models. Then, for each model, we perform3xploration by perturbing the value of its parameters and hyperparameters with Gaussian noise. Thisstep can be expressed as λ i = λ k + (cid:15) i and θ i = θ k + (cid:15) (cid:48) i , where (cid:15) i ∼ N (0 , σ ) and (cid:15) (cid:48) i ∼ N (0 , σ (cid:48) ) are Gaussian noise. Algorithm 1 shows the baseline algorithm described above. In terms of theexploitation-exploration tradeoff, we can see that our baseline algorithm emphasizes on exploitation. Algorithm 1

The baseline algorithm of HBA. Input: population size n , number of steps T outer and T train , training set D T , augmentationpolicy A , neural network F , learning rate α , validation set D V , Gaussian sigma σ and σ (cid:48) . Initialize { θ i } ni =1 and { λ i } ni =1 for j = 1 to T outer do for i = 1 to n (synchronously in parallel) do for t = 1 to T train do Sample ( x, y ) from D T θ i = θ i − α ∇ θ L ( F ( A ( x ; λ i ); θ i ) , y ) end for end for k = arg min i ∈{ ,...,n } (cid:80) ( x v ,y v ) ∈ D V L ( F ( x v ; θ i ) , y v ) for i = 1 to n do λ i = λ k + (cid:15) i where (cid:15) i ∼ N (0 , σ ) θ i = θ k + (cid:15) (cid:48) i where (cid:15) (cid:48) i ∼ N (0 , σ (cid:48) ) end for end for Output: θ k We deﬁne a hypernetwork as a neural network ˆ θ φ that takes the data augmentation hyperparameters λ as inputs and outputs the parameters θ of the neural network F . φ is the parameters of thehypernetwork to be learned. We use a hypernetwork ˆ θ φ to represent the members of the populationby θ i = ˆ θ φ ( λ i ) for i ∈ { , ..., n } . With ˆ θ φ , the three steps of the training procedure are modiﬁed asfollows. Update.

Training a population of models with parameters { θ i } ni =1 is changed into training a singlehypernetwork with parameters φ . The SGD update formula in Equation 1 is accordingly changed as φ = φ − α n n (cid:88) i =1 ∇ φ L ( F ( A ( x i ; λ i ) , ˆ θ φ ( λ i )) , y i ) . (2)Comparing with Equation 1 that independently updates θ i by SGD with a batch size of , Equation 2updates φ by SGD with a batch size of n . Evaluate.

This step is expressed as k = arg min i ∈{ ,...,n } (cid:80) ( x v ,y v ) ∈ D V L ( F ( x v ; ˆ θ φ ( λ i )) , y v ) . Exploit-and-Explore.

This step is simpliﬁed into λ i = λ k + (cid:15) i . θ i disappears and is implicitlyexpressed by ˆ θ φ ( λ k + (cid:15) i ) , which can be seen as an approximation of ˆ θ φ ( λ k ) + (cid:15) (cid:48) i . Up to now, we haveused a hypernetwork to reformulate a population-based training as a single hypernetwork training,which we show in Algorithm 4 in Appendix B. We continue deriving our algorithm and then detailthe design and construction of our hypernetwork architecture in Section 3.5. A hypernetwork, as a continuous function, represents a continuous set of models. Therefore, insteadof ﬁnding the best performing model within the set of n models, which is computationally heavy, wepropose to solve min λ (cid:80) ( x v ,y v ) ∈ D V L ( F ( x v ; ˆ θ φ ( λ ) , y v ) approximately by SGD. The Evaluate stepis modiﬁed as follows. Evaluate.

Given the current hyperparameters λ , we apply T val steps of SGD on the validation set as λ = λ − α (cid:48) n n (cid:88) i =1 ∇ λ L ( F ( x vi , ˆ θ φ ( λ )) , y vi ) . (3) Exploit-and-Explore.

This step is expressed as λ i = λ + (cid:15) i .4igure 1: (a) We show an toy example network with three trainable layers (in green) and its linearhypernetwork (in blue), which consists of three linear hyper-layers. (b) If we run Algorithm 1 andshare the parameters of the Conv and Linear layers, then it corresponds to running Algorithm 2 witha simpliﬁed hypernetwork, which only has a HyperBN layer. Update.

Furthermore, by substituting λ i = λ + (cid:15) i into Equation 2, we merge the Exploit-and-Explorestep into the Update step and obtain φ = φ − α n n (cid:88) i =1 ∇ φ L ( F ( A ( x i ; λ + (cid:15) i ); ˆ θ φ ( λ + (cid:15) i )) , y i ) . (4)Up to now, the baseline algorithm has been modiﬁed into a gradient-based algorithm that alternatesbetween updating φ (Equation 4) and λ (Equation 3). We show the modiﬁed algorithm in Algorithm 5in Appendix B. Main Algorithm.

Lastly, instead of using the same perturbation noise { (cid:15) i } ni =1 across the T train SGD steps (Equation 4), we re-sample { (cid:15) i } ni =1 for each iteration to enhance the sample diversity. Inother words, at each SGD step, we sample a different discrete set of models from the hypernetworkto update its parameters. With this modiﬁcation, we ﬁnally arrive at our HBA algorithm, as shown inAlgorithm 2. We show the computation graph of Equation 3 and 4 in Figure 2 in Appendix F. Algorithm 2

The main algorithm of HBA. Input: batch size n , number of steps T outer , T train and T val , training set D T , augmentationpolicy A , neural network F , learning rates α and α (cid:48) , validation set D V , Gaussian sigma σ . Initialize φ and λ for j = 1 to T outer do for t = 1 to T train do Sample { ( x i , y i ) } ni =1 from D T { (cid:15) i } ni =1 ∼ N (0 , σ ) φ = φ − α n (cid:80) ni =1 ∇ φ L ( F ( A ( x i ; λ + (cid:15) i ) , ˆ θ φ ( λ + (cid:15) i )) , y i ) (cid:46) Update φ (Equation 4) end for for t = 1 to T val do Sample { ( x vi , y vi ) } ni =1 from D V λ = λ − α (cid:48) n (cid:80) ni =1 ∇ λ L ( F ( x vi , ˆ θ φ ( λ )) , y vi ) (cid:46) Update λ (Equation 3) end for end for Output: ˆ θ φ ( λ ) We now turn back to the design and construction of our hypernetworks mentioned in Section 3.3.Recall that a hypernetwork takes the hyperparameters λ as inputs and generates the parameters θ of aneural network. Prior work STN [9] uses a linear neural network as the hypernetwork. Unfortunately,even a linear hypernetwork requires a prohibitively large number of parameters, which would be | λ || θ | . STN assumes that the linear transform matrix is low-rank to effectively reduce the parametercomplexity from O ( | λ || θ | ) to O ( | λ | + | θ | ) . Given the architecture of a neural network, its linearhypernetwork is constructed in a layer-wise manner: for each trainable layer of the neural network,we associate it with a linear hyper-layer accordingly. In Figure 1(a), we show a toy example networkand its corresponding hypernetwork. In this example, since the network has three trainable layers: aconvolutional (Conv) layer, a batch normalization (BN) layer, and a linear layer, we associate each ofthe three layers with a linear hyper-layer to construct the linear hypernetwork. We adopt the linearhyper-layers proposed by STN. Since STN lacks a hyper-layer for BN, we follow their design spiritand propose a HyperBN layer, which takes λ as input and outputs the afﬁne parameters θ BN ∈ R c of the BN layer by θ BN = HyperBN ( λ ; φ b , φ U , φ V ) = φ b + diag ( φ V λ ) φ U , where φ b , φ U ∈ R c ,5able 2: Ablation study on different weight sharing strategies. Validation errors are reported. Wide-ResNet-40-2 WideResNet-28-10Hyper-layers CIFAR-10 CIFAR-100 Time CIFAR-10 CIFAR-100 TimeVal. Error (%)

Val. Error (%) (GPU Hours)

Val. Error (%)

Val. Error (%) (GPU Hours)

Conv + BN . ± .

09 22 . ± .

19 2 .

03 2 . ± .

13 19 . ± .

21 7 . Conv . ± .

08 22 . ± .

34 1 .

53 2 . ± .

06 19 . ± .

26 6 . BN . ± .

08 22 . ± .

22 1 .

33 2 . ± .

05 19 . ± .

34 4 . . ± .

06 22 . ± . . . ± .

07 19 . ± . . . ± . . ± .

10 0 . . ± . . ± .

34 3 . φ V ∈ R c ×| λ | , and diag ( · ) turns a vector into a diagonal matrix. Due to the space limit, please refer toAppendix D for the details of the hyper-layers. Weight Sharing across Models.

We propose a weight sharing strategy that can effectively reduceboth the search time and the memory consumption of our method. The central idea is: adopting aweight sharing strategy in Algorithm 1 corresponds to using a simpliﬁed hypernetwork in Algorithm 2.Speciﬁcally, under the perspective of population-based training (Algorithm 1), training a set of modelswith weight sharing can be viewed as a multi-task training, where the i -th task corresponds to trainingmodel i with hyperparameters λ i . If all the members of the population share a particular layer’sparameters, then it equivalently means that the corresponding hyper-layer in Algorithm 2 is a constant function independent of hyperparameters. In other words, if we share a layer’s parametersin Algorithm 1, there is no need to pair it with a hyper-layer in Algorithm 2. By doing so, ourhypernetwork becomes smaller, and the training speed becomes faster. In Figure 1(b), we show anexample of hypernetwork that corresponds to a population-based training with weight sharing appliedto the Conv and Linear layers. We ﬁrst present an ablation study of weight sharing strategies and then evaluate our HBA on fourimage classiﬁcation datasets: CIFAR-10 [13], CIFAR-100 [13], SVHN [15], and ImageNet [16]. Wecompare HBA with standard augmentation (Baseline), Cutout [20], AutoAugment (AA) [1], Popula-tion Based Augmentation (PBA) [2], Fast AutoAugment (FAA) [3], OHL-Auto-Aug (OHLAA) [26],RandAugment (RA) [27], Adversarial AutoAugment (AdvAA) [32], and Differentiable AutomaticData Augmentation (DADA) [28]. Lastly, we compare with Self-Tuning Network (STN) [9] onhyperparameter optimization.

Implementation Details.

Following PBA [2], our search space consists of operations, whereeach has two copies. Please see Appendix A for the list of operations and their hyperparameters.Following STN [9], HBA trained the hypernetwork parameters using SGD with T train = 2 andoptimized the hyperparameters using Adam [33] with T val = 1 . We implemented HBA using thePytorch [34] framework. We measured the search time of HBA on an NVIDIA Tesla V100 GPU.We reported the performance of the last epoch after training for all of our results using the mean andstandard deviation over ﬁve runs with different random seeds. For policy evaluation, we scaled thelength of the discovered schedules linearly if necessary. We conducted an ablation study on different weight sharing strategies. In general, if we share a subsetof the trainable layers’ parameters from the perspective of population-based training, it means thatwe do not associate them with hyper-layers. We experimented with ﬁve weight sharing strategies:Conv+BN, Conv, BN, 1st Conv, and 1st BN. These strategies are named by what layers are addedwith hyper-layers. For example, the second strategy (Conv) means that we add HyperConv layers toall the convolutional layers and ignore the other layers. All strategies did not add a hyper-layer to thelast linear layer of the network, which performed better in our preliminary experiments. The intuitionbehind such strategies is as follows. For the low-level layers, we do not share parameters so thatthey are enforced to reﬂect the inﬂuence of different augmentation policies. For the mid and high-level layers, we share parameters to encourage the population to learn a joint feature representation.In this study, we split the CIFAR training set into three sets (30k/10k/10k) and denoted them asTrain30k/Val0/Val. The CIFAR test set was not used. For policy search , we applied HBA to train onthe Train30k/Val0 split to ﬁnd policies. For policy evaluation , we used the searched policies to trainnetworks on the Train30k+Val0 set and evaluated them on the Val set. Table 2 shows the comparison6able 3: Comparison of search time in GPU hours.

AA PBA FAA OHLAA AdvAA DADA

HBA

GPU P100 Titan XP V100 - V100 Titan XP V100CIFAR-10 . . - . . SVHN . - - . . ImageNet -

450 625 1280 1 . . Table 4: CIFAR-10 and CIFAR-100 test error rates (%) . WRN, SS, and PN+SD are the shorthand ofWide-ResNet, Shake-Shake, and PyramidNet+ShakeDrop, respectively.

Baseline Cutout AA PBA FAA RA AdvAA DADA

HBACIFAR-10

WRN-40-2 . . . - . - - . . ± . WRN-28-10 . . . . . . . . . ± . SS (26 2x32d) . . . . . - . . . ± . SS (26 2x96d) . . . . . . . . . ± . SS (26 2x112d) . . . . . - . . . ± . PN+SD . . . . . . . . . ± . CIFAR-100

WRN-40-2 . . . - . - - . . ± . WRN-28-10 . . . . . . . . . ± . SS (26 2x96d) . . . . . - . . . ± . PN+SD . . . . . - . . . ± . results between these ﬁve strategies. We can see that adding fewer hyper-layers reduces search timeand achieves slightly better performance across different datasets and models. Based on the ablationresults, we employ the 1st BN strategy in the subsequent experiments. For each dataset, we randomly sampled a subset of 10,000 images from the original trainingset as the validation set D V for HBA. Following the experimental setting of [1, 2, 3], we randomlysampled a subset from the remaining images to create a reduced version of the training set. For policysearch , we applied HBA to train a Wide-ResNet-40-2 (WRN-40-2) [14] network on the reduceddatasets. For policy evaluation , we evaluated the performance of a model M by using the searchedpolicy to train M on the original training set and measuring the accuracy on the test set. Please seeTable 10 and 11 in Appendix E for the detailed hyperparameter settings. CIFAR-10 and CIFAR-100.

The CIFAR-10 and CIFAR-100 datasets consist of 60,000 naturalimages, with a size of 32x32. The training and test sets have 50,000 and 10,000 images, respectively.We applied our augmentation policy, baseline, and Cutout (with 16x16 pixels) in sequence to eachtraining image. The baseline augmentation is deﬁned as the following operations in sequence:standardization, horizontal ﬂipping, and random cropping. We evaluated our searched policies withWide-ResNet-40-2 (WRN-40-2) [14], Wide-ResNet-28-10 (WRN-28-10) [14], Shake-Shake [35],and PyramidNet+ShakeDrop [36]. As shown in Table 3 and 4, HBA signiﬁcantly improves theperformance over the baseline and Cutout, while achieving competitive accuracy with the comparedmethods. HBA only takes . GPU hours on the Reduced CIFAR-10 for policy search, which isan order of magnitude faster than PBA (population-based method) and is comparable to DADA(gradient-based method).

SVHN.

The SVHN dataset consists of 73,257 training images (called core training set), 531,131additional training images, and 26,032 test images. We applied our augmentation policy, baseline,and Cutout (with 20x20 pixels) in sequence to each training image. The baseline augmentationapplied standardization only. We evaluated our searched policies with WRN-40-2, WRN-28-10,and Shake-Shake. As shown in Table 3 and 5, HBA achieves competitive accuracy and slightlyoutperforms PBA and DADA.

ImageNet.

The ImageNet dataset has a training set of about 1.2M images and a validation set of50,000 images. We applied our augmentation policy and then the baseline augmentation to eachtraining image. Following [3], the baseline augmentation applied random resized crop, horizontal ﬂip,color jittering, PCA jittering, and standardization in sequence. We evaluated our searched policies7able 5: SVHN test error rates (%).

Model Baseline Cutout AA PBA FAA RA DADA

HBA

Wide-ResNet-28-10 . . . . . . . . ± . Shake-Shake (26 2x96d) . . . . - - . . ± . Table 6: ImageNet validation error rates of ResNet-50.

Baseline AA FAA OHLAA RA AdvAA DADA HBATop-1 Error (%) . . . . . . . . ± . Top-5 Error (%) . . . . . . . . ± . Table 7: Comparison between policy search on a proxy task (top half) and target tasks (bottom half).

Policy Search Policy EvaluationDataset Model Time (GPU hours)

Dataset Model Test Error (%)

Reduced CIFAR-10 WRN-40-2 . CIFAR-10 WRN-40-2 . ± . CIFAR-100 . ± . CIFAR-10 WRN-28-10 . ± . CIFAR-100 . ± . CIFAR-10 WRN-40-2 . ↑ CIFAR-10 WRN-40-2 . ± . ↓ CIFAR-100 CIFAR-100 . ± . ↑ CIFAR-10 WRN-28-10 . ↑ CIFAR-10 WRN-28-10 . ± . ↓ CIFAR-100 CIFAR-100 . ± . ↓ Table 8: Comparison with STN [9] on CIFAR-10 using AlexNet.

Validation Loss Test Loss Test Error (%)

STN . ± .

005 0 . ± .

004 19 . ± . HBA . ± .

005 0 . ± .

005 18 . ± . with ResNet-50 [17]. As shown in Table 3 and 6, HBA is at least three orders of magnitude faster thanall the compared methods except DADA. Compared with DADA, HBA achieves higher accuracy. Policy Search on the Target Tasks.

In the previous experiments (Table 3, 4, 5, 6), we appliedHBA on a reduced task, which trains a smaller network (WRN-40-2) on a reduced dataset. Here,We compared with the performance of policy searched on the target tasks, which trains the targetnetwork on the full dataset to ﬁnd the augmentation policy. We consider training a model M ∈{ WRN-40-2 , WRN-28-10 } on a dataset D ∈ { CIFAR-10 , CIFAR-100 } , giving us four target tasks.Table 7 shows that training on target tasks requires longer search time while achieving slightly betterperformances (in three out of four cases), which is similar to the ﬁndings in RA [27] and DADA [28]. We followed the experiment settings of STN [9] and applied HBA to simultaneously train anAlexNet [19] model and tune the regularization hyperparameters on CIFAR-10. We employed theConv strategy to construct our hypernetwork for the AlexNet model. As shown in Table 8, HBAperforms much better than STN, showing the effectiveness of our weight sharing strategy.

In this paper, we proposed Hypernetwork-Based Augmentation (HBA) for automated data augmenta-tion. HBA employs a hypernetwork to train a continuous set of models and uses gradient descent totune augmentation hyperparameters. Our weight sharing strategy improved both the search speedand accuracy. Experimental results showed that HBA is signiﬁcantly faster than the state-of-the-artmethods while offering comparable accuracy. Future directions include: (1) applying HBA to medicalimage datasets, (2) exploring hybrid algorithms that interpolates between PBT and HBA, and (3)determining the hypernetwork architecture by neural architecture search methods.8

Broader Impact

Our work is a new algorithm for automated data augmentation, which can be treated as a speciﬁctechnical component in an AutoML system. On the one hand, AutoML makes it easier for non-expertsto make use of machine learning models and techniques. On the other hand, AutoML helps machinelearning researchers to lift the focus of development from feature design to architecture design andoptimization scheme design.

References [1] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. AutoAugment:Learning augmentation strategies from data. In

CVPR , 2019.[2] Daniel Ho, Eric Liang, Ion Stoica, Pieter Abbeel, and Xi Chen. Population based augmentation:Efﬁcient learning of augmentation policy schedules. In

ICML , 2019.[3] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast AutoAugment.In

NeurIPS , 2019.[4] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised dataaugmentation. arXiv preprint arXiv:1904.12848 , 2019.[5] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin ARaffel. MixMatch: A holistic approach to semi-supervised learning. In

Advances in NeuralInformation Processing Systems , pages 5050–5060, 2019.[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple frameworkfor contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.[7] Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizingdeep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649 , 2020.[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. In

ICML , 2015.[9] Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuningnetworks: Bilevel optimization of hyperparameters using structured best-response functions. In

ICLR , 2019.[10] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efﬁcient neuralarchitecture search via parameter sharing. arXiv preprint arXiv:1802.03268 , 2018.[11] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. SMASH: one-shot modelarchitecture search through hypernetworks. In

ICLR , 2018.[12] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Under-standing and simplifying one-shot architecture search. In

ICML , 2018.[13] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.

Technical Report , 2009.[14] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In

BMVC , 2016.[15] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.Reading digits in natural images with unsupervised feature learning. In

NIPS Workshop onDeep Learning and Unsupervised Feature Learning 2011 , 2011.[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scalehierarchical image database. In

CVPR , 2009.[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

CVPR , 2016.[18] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deeplearning.

Journal of Big Data , 2019.[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In

Advances in neural information processing systems , 2012.[20] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neuralnetworks with cutout. arXiv preprint arXiv:1708.04552 , 2017.[21] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyondempirical risk minimization. In

ICLR , 2018.922] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and YoungjoonYoo. CutMix: Regularization strategy to train strong classiﬁers with localizable features. In

ICCV , 2019.[23] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In

ICLR ,2017.[24] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search.In

ICLR , 2019.[25] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, AliRazavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population basedtraining of neural networks. arXiv preprint arXiv:1711.09846 , 2017.[26] Chen Lin, Minghao Guo, Chuming Li, Xin Yuan, Wei Wu, Junjie Yan, Dahua Lin, and WanliOuyang. Online hyper-parameter learning for auto-augmentation strategy. In

ICCV , 2019.[27] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. RandAugment: Practicalautomated data augmentation with a reduced search space. arXiv preprint arXiv:1909.13719 ,2019.[28] Yonggang Li, Guosheng Hu, Yongtao Wang, Timothy Hospedales, Neil M Robertson,and Yongxing Yang. DADA: Differentiable automatic data augmentation. arXiv preprintarXiv:2003.03780 , 2020.[29] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In

ICLR , 2017.[30] Tong Yu and Hong Zhu. Hyper-parameter optimization: A review of algorithms and applications. arXiv preprint arXiv:2003.05689 , 2020.[31] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overﬁtting.

JMLR , 2014.[32] Xinyu Zhang, Qiang Wang, Jian Zhang, and Zhao Zhong. Adversarial AutoAugment. In

ICLR ,2020.[33] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

ICLR ,2015.[34] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperativestyle, high-performance deep learning library. In

Advances in Neural Information ProcessingSystems , 2019.[35] Xavier Gastaldi. Shake-shake regularization. In

ICLR , 2017.[36] Yoshihiro Yamada, Masakazu Iwamura, Takuya Akiba, and Koichi Kise. Shakedrop regulariza-tion for deep residual learning.

IEEE Access , 2019.10 ppendix A Search space

Elementary Image Operations.

Table 9 shows the list of augmentation operations used in oursearch space. When applying an operation op ∈ { ShearX , ShearY , TranslateX , TranslateY , Rotate } to an input image, we randomly negate the sampled magnitude with a probability of . . Augmentation Function.

Our search space is based on PBA [2]. Please refer to the Algorithm 1 inthe PBA paper for the augmentation function used for both PBA and our HBA.

Initialization.

For each augmentation hyperparameter, we initialize its value by . M min +0 . M max , where [ M min , M max ] is the magnitude range of the hyperparameter. Implementation Details.

Following STN [9], we deﬁne the hyperparameters λ as the unboundedversion of the operation magnitudes and probabilities by mapping a range of magnitude [ M min , M max ] to [ −∞ , ∞ ] through a logit function (the inverse of the sigmoid function). The magnitude range ofeach operation is shown in Table 9, and the probability range of each operation is [0 , .Table 9: List of augmentation operations.Operation Range of Magnitude Unit of MagnitudeShearX [0, 0.3] -ShearY [0, 0.3] -TranslateX [0, 0.45] Image sizeTranslateY [0, 0.45] Image sizeRotate [0, 30] DegreeAutoContrast None -Invert None -Equalize None -Solarize [0, 255] -Posterize [0, 8] BitContrast [0.1, 1.9] -Color [0.1, 1.9] -Brightness [0.1, 1.9] -Sharpness [0.1, 1.9] -Cutout [0, 0.2] Image size11 ppendix B The Derivation of Hypernetwork-Based Augmentation (HBA) Algorithm 3, 4, 5, and 6 show the derivation of our algorithm from a population-based to a gradient-based training algorithm. For each algorithm, we highlight the changes from the previous algorithmin blue. The high-level description of these four algorithms are: • Algorithm 3: training a population of models by stochastic gradient descent. • Algorithm 4: representing a population of models by a hypernetwork. • Algorithm 5: approximating model selection by stochastic gradient descent. • Algorithm 6: the main algorithm.

Algorithm 3

The baseline algorithm of HBAin Section 3.2. Input: population size n , number of steps T outer and T train , training set D T , augmentation policy A , neural network F , learning rate α , validation set D V , Gaussian sigma σ and σ (cid:48) .2: Initialize { θ i } ni =1 and { λ i } ni =1 for j = 1 to T outer do for i = 1 to n (synchronously in parallel) do for t = 1 to T train do

6: Sample ( x, y ) from D T θ i = θ i − α ∇ θ L ( F ( A ( x ; λ i ); θ i ) , y ) end for end for k = arg min i ∈{ ,...,n } (cid:80) ( x v ,y v ) ∈ D V L ( F ( x v ; θ i ) , y v ) for i = 1 to n do λ i = λ k + (cid:15) i where (cid:15) i ∼ N (0 , σ ) θ i = θ k + (cid:15) (cid:48) i where (cid:15) (cid:48) i ∼ N (0 , σ (cid:48) ) end for end for Output: θ k Algorithm 4

The modiﬁed baseline algorithmin Section 3.3. Input: batch size n , number of steps T outer and T train , training set D T , augmentation policy A ,neural network F , learning rate α , validation set D V , Gaussian sigma σ .2: Initialize φ and { λ i } ni =1 for j = 1 to T outer do for t = 1 to T train do

6: Sample { ( x, y ) } ni =1 from D T φ = φ − α n (cid:80) ni =1 ∇ φ L ( F ( A ( x i ; λ i ) , ˆ θ φ ( λ i )) , y i ) end for k = arg min i ∈{ ,...,n } (cid:80) ( x v ,y v ) ∈ D V L ( F ( x v ; ˆ θ φ ( λ i )) , y v ) for i = 1 to n do λ i = λ k + (cid:15) i where (cid:15) i ∼ N (0 , σ ) end for end for Output: ˆ θ φ ( λ k ) Algorithm 5

The modiﬁed baseline algorithmin Section 3.4. Input: batch size n , number of steps T outer , T train ,and T val , training set D T , augmentation policy A ,neural network F , learning rate α and α (cid:48) , validationset D V , Gaussian sigma σ .2: Initialize φ and λ for j = 1 to T outer do { (cid:15) i } ni =1 ∼ N (0 , σ ) for t = 1 to T train do

6: Sample { ( x, y ) } ni =1 from D T φ = φ − α n (cid:80) ni =1 ∇ φ L ( F ( A ( x i ; λ + (cid:15) i ) , ˆ θ φ ( λ + (cid:15) i )) , y i ) end for for t = 1 to T val do

10: Sample { ( x vi , y vi ) } ni =1 from D V λ = λ − α (cid:48) n (cid:80) ni =1 ∇ λ L ( F ( x vi , ˆ θ φ ( λ )) , y vi ) end for end for Output: ˆ θ φ ( λ ) Algorithm 6

The main algorithm of HBAin Section 3.4. Input: batch size n , number of steps T outer , T train ,and T val , training set D T , augmentation policy A ,neural network F , learning rate α and α (cid:48) , validationset D V , Gaussian sigma σ .2: Initialize φ and λ for j = 1 to T outer do for t = 1 to T train do

5: Sample { ( x, y ) } ni =1 from D T { (cid:15) i } ni =1 ∼ N (0 , σ ) φ = φ − α n (cid:80) ni =1 ∇ φ L ( F ( A ( x i ; λ + (cid:15) i ) , ˆ θ φ ( λ + (cid:15) i )) , y i ) end for for t = 1 to T val do

10: Sample { ( x vi , y vi ) } ni =1 from D V λ = λ − α (cid:48) n (cid:80) ni =1 ∇ λ L ( F ( x vi , ˆ θ φ ( λ )) , y vi ) end for end for Output: ˆ θ φ ( λ ) ppendix C Our Baseline Algorithm is a Special Case of PBT Our baseline algorithm (Algorithm 3) can be expressed in the format of PBT [25] as follows:

Step : each model runs an SGD step with a batch size of . Eval : we evaluate each model on a validation set by cross entropy loss.

Ready : each model goes through the exploit-and-explore process every T train steps. Exploit : each model clones the weights and hyperparameters of the best performing model.

Explore : for each model, we perturb the value of its parameters and hyperparameters.

Algorithm 7

The baseline algorithm of HBA in the format of PBT. Input: population size n , number of steps T outer and T train , training set D T , augmentation policy A ,neural network F , learning rate α , validation set D V , Gaussian sigma σ and σ (cid:48) .2: Initialize { θ i } ni =1 and { λ i } ni =1 for j = 1 to T outer do for i = 1 to n (synchronously in parallel) do for t = 1 to T train do

6: //

Step

7: Sample ( x, y ) from D T θ i = θ i − α ∇ θ L ( F ( A ( x ; λ i ); θ i ) , y ) end for end for

11: //

Eval k = arg min i ∈{ ,...,n } (cid:80) ( x v ,y v ) ∈ D V L ( F ( x v ; θ i ) , y v )

13: //

Exploit for i = 1 to n do λ i = λ k θ i = θ k end for

18: //

Explore for i = 1 to n do λ i = λ i + (cid:15) i where (cid:15) i ∼ N (0 , σ ) θ i = θ i + (cid:15) (cid:48) i where (cid:15) (cid:48) i ∼ N (0 , σ (cid:48) ) end for end for Output: θ k Appendix D Hyper-layers

HyperLinear Layer.

Let f be a linear layer. We have y = f ( x ; W ) = W x, (5)where x ∈ R c , y ∈ R c , and the weight matrix W ∈ R c × c is the parameters of the linear layer.Given a linear layer f , we deﬁne its corresponding HyperLinear layer ˆ W as a linear function thatmaps hyperparameters λ ∈ R n to the weight matrix W ∈ R c × c of the linear layer f . We candecompose the hyperlinear layer ˆ W into a set linear functions { ˆ w i ( λ ) } c i =1 that each function ˆ w i ( λ ) outputs the transpose of the i-th row of W . ˆ W can be expressed as W = ˆ W ( λ ) = [ ˆ w ( λ ) , ..., ˆ w c ( λ )] T (6)and ˆ w i ( λ ) = w i + A i λ, (7)where A i ∈ R c × n and w i ∈ R c . A HyperLinear layer parameterized by { A i } c i =1 and { w i } c i =1 requires nc c and c c parameters, respectively. Due to its prohibitively huge memory consumption,following Self-Tuning Network (STN) [9], we assume A i is a rank- matrix to greatly reduce thenumber of parameters of the HyperLinear layer. Speciﬁcally, we deﬁne A i = u i v Ti where u i ∈ R c and v i ∈ R n . By doing so, the number of parameters of the HyperLinear layer is reduced to ( n + c ) c + c c . ˆ w i ( λ ) is written as ˆ w i ( λ ) = w i + A i λ = w i + u i v Ti λ = w i + ( v Ti λ ) u i . (8)13he HyperLinear layer ˆ W ( λ ) can then be expressed as W = ˆ W ( λ ) = [ ˆ w ( λ ) , ..., ˆ w c ( λ )] T = W + diag ( V λ ) U (9)where W = [ w , ..., w c ] T ∈ R c × c , V = [ v , ..., v c ] T ∈ R c × n , U = [ u , ..., u c ] T ∈ R c × c ,and diag ( · ) turns a vector into a diagonal matrix. In particular, W , W , and U have the same matrixsize.Let W , U , and V be denoted by φ , φ U , and φ V , respectively. The HyperLinear layer can beexpressed in a general form as W = ˆ W ( λ ; φ , φ U , φ V ) = φ + diag ( φ V λ ) φ U . (10)Consider a linear layer with a bias b , the bias part of the hyperLinear layer can be additionally deﬁnedin a similar way as b = ˆ b ( λ ; φ b , φ b U , φ b V ) = φ b + diag ( φ b V λ ) φ b U (11)where φ b ∈ R c , φ b V ∈ R c × n , and φ b U ∈ R c . HyperConv Layer.

A linear layer can be interpreted as a × convolutional layer by (1) viewingthe input vector x as an image with c channels and a spatial size of × , and (2) viewing the weightmatrix W as the set of × convolutional ﬁlters, where each row of W is a ﬁlter. Therefore, we candeﬁne the hyper-layer of a × convolutional layer in the same way as the HyperLinear layer. Wefollow the deﬁnition of the HyperLinear layer (Equation 10), and deﬁne the HyperConv1x1 layer as θ conv1x1 = ˆ θ conv1x1 ( λ ; φ c , φ c U , φ c V ) = φ c + diag ( φ c V λ ) φ c U , (12)where θ conv1x1 , φ c , φ c U ∈ R c × c are three sets of × ﬁlters. c and c are the number of the inputand output channels, respectively. diag ( φ c V λ ) φ c U can be interpreted as a ﬁlter-wise scaling of φ c U , inwhich the j-th ﬁlter weights of φ c U are scaled by the j-th element of φ c V λ . In general, the HyperConvlayer for a k × k convolutional layer can be deﬁned with the HyperConv1x1 layer by changing φ c and φ c U from × to k × k ﬁlters as follows: θ conv = ˆ θ conv ( λ ; φ c , φ c U , φ c V ) = φ c + diag ( φ c V λ ) φ c U , (13)where each row of φ c and φ c U corresponds to the parameters of a k × k ﬁlter. Speciﬁcally, θ conv , φ c , φ c U ∈ R c × c k , φ c V ∈ R c × n . For a Conv layer with a bias, the bias part of the Hy-perConv layer is the same as the HyperLinear one (Equation 11). HyperBN Layer.

A batch normalization layer has a trainable afﬁne transformation. Following thedesign spirit of the HyperLinear and the HyperConv layer, we denote the afﬁne parameters by θ BN and deﬁne the HyperBN layer as θ BN = HyperBN ( λ ; φ BN , φ BN U , φ BN V ) = φ BN + diag ( φ BN V λ ) φ BN U , (14)where φ BN ∈ R c , φ BN V ∈ R c × n , φ BN U ∈ R c , and c is the number of the output channels. Thereis a factor because the afﬁne transformation has c scaling parameters and c offset parameters.14 ppendix E Hyperparameters Table 10 and 11 show the hyperparameters used in policy search and policy evaluation, respectively.For the ImageNet dataset, we used a step-decay learning rate schedule that drops by . at epoch , , and . Table 10: Hyperparameters for policy search. Dataset ReducedCIFAR-10 ReducedSVHN ReducedImageNet CIFAR-10 / CIFAR-100No. classes 10 10 120 10No. training images 4,000 4,000 6,000 40,000No. validation images 10,000 10,000 10,000 10,000Model WRN-40-2 WRN-40-2 WRN-40-2 WRN-40-2 / WRN-28-10Input size 32x32 32x32 32x32 32x32No. training epoch 200 160 270 200Learning rate α .

05 0 .

01 0 . . Learning rate schedule ( α ) cosine cosine step cosineWeight decay .

005 0 .

01 0 .

001 0 . Batch size 128 128 128 128Learning rate α (cid:48) .

03 0 .

02 0 .

007 0 . Learning rate schedule ( α (cid:48) ) constant constant constant constantResults in the main paper Table 4 Table 5 Table 6 Table 7 Table 11: Hyperparameters for policy evaluation. LR: learning rate. Schedule: learning rate schedule.WD: weight decay. BS: batch size. Epoch: number of training epoch. All the hyperparameters followthe settings in PBA [2] except the ones for the ImageNet dataset. We used NVIDIA Tesla V100GPUs to train ResNet-50 on the ImageNet dataset.Dataset Model LR Schedule WD BS EpochCIFAR-10 WRN-40-2 0.1 cosine 0.0005 128 200CIFAR-10 WRN-28-10 0.1 cosine 0.0005 128 200CIFAR-10 Shake-Shake (26 2x32d) 0.01 cosine 0.001 128 1800CIFAR-10 Shake-Shake (26 2x96d) 0.01 cosine 0.001 128 1800CIFAR-10 Shake-Shake (26 2x112d) 0.01 cosine 0.001 128 1800CIFAR-10 PyramidNet+ShakeDrop 0.05 cosine 0.00005 64 1800CIFAR-100 WRN-28-10 0.1 cosine 0.0005 128 200CIFAR-100 Shake-Shake (26 2x96d) 0.01 cosine 0.0025 128 1800CIFAR-100 PyramidNet+ShakeDrop 0.025 cosine 0.0005 64 1800SVHN WRN-28-10 0.005 cosine 0.001 128 160SVHN Shake-Shake (26 2x96d) 0.01 cosine 0.00015 128 160ImageNet ResNet-50 0.1 step 0.0001 256 27015 ppendix F Computation graph of HBA

Figure 2: The computation graphs of our algorithm. The arrows in red represents the gradient ﬂow ofthe backward propagation. 16 reprint. Under review.reprint. Under review.