[PDF] Bag of Tricks for Image Classification with Convolutional Neural Networks

Abstract

Much of the recent progress made in image classification research can be credited to training procedure refinements, such as changes in data augmentations and optimization methods. In the literature, however, most refinements are either briefly mentioned as implementation details or only visible in source code. In this paper, we will examine a collection of such refinements and empirically evaluate their impact on the final model accuracy through ablation study. We will show that, by combining these refinements together, we are able to improve various CNN models significantly. For example, we raise ResNet-50's top-1 validation accuracy from 75.3% to 79.29% on ImageNet. We will also demonstrate that improvement on image classification accuracy leads to better transfer learning performance in other application domains such as object detection and semantic segmentation.

Full PDF

BBag of Tricks for Image Classiﬁcation with Convolutional Neural Networks

Tong He Zhi Zhang Hang Zhang Zhongyue Zhang Junyuan Xie Mu LiAmazon Web Services { htong,zhiz,hzaws,zhongyue,junyuanx,mli } @amazon.com Abstract

Much of the recent progress made in image classiﬁcationresearch can be credited to training procedure reﬁnements,such as changes in data augmentations and optimizationmethods. In the literature, however, most reﬁnements are ei-ther brieﬂy mentioned as implementation details or only vis-ible in source code. In this paper, we will examine a collec-tion of such reﬁnements and empirically evaluate their im-pact on the ﬁnal model accuracy through ablation study. Wewill show that, by combining these reﬁnements together, weare able to improve various CNN models signiﬁcantly. Forexample, we raise ResNet-50’s top-1 validation accuracyfrom 75.3% to 79.29% on ImageNet. We will also demon-strate that improvement on image classiﬁcation accuracyleads to better transfer learning performance in other ap-plication domains such as object detection and semanticsegmentation.

1. Introduction

Since the introduction of AlexNet [15] in 2012, deepconvolutional neural networks have become the dominat-ing approach for image classiﬁcation. Various new architec-tures have been proposed since then, including VGG [24],NiN [16], Inception [1], ResNet [9], DenseNet [13], andNASNet [34]. At the same time, we have seen a steadytrend of model accuracy improvement. For example, thetop-1 validation accuracy on ImageNet [23] has been raisedfrom 62.5% (AlexNet) to 82.7% (NASNet-A).However, these advancements did not solely come fromimproved model architecture. Training procedure reﬁne-ments, including changes in loss functions, data preprocess-ing, and optimization methods also played a major role. Alarge number of such reﬁnements has been proposed in thepast years, but has received relatively less attention. In theliterature, most were only brieﬂy mentioned as implemen-tation details while others can only be found in source code.In this paper, we will examine a collection of training Model FLOPs top-1 top-5ResNet-50 [9] 3.9 G 75.3 92.2ResNeXt-50 [27] 4.2 G 77.8 -SE-ResNet-50 [12] 3.9 G 76.71 93.38SE-ResNeXt-50 [12] 4.3 G 78.90 94.51DenseNet-201 [13] 4.3 G 77.42 93.66ResNet-50 + tricks (ours) 4.3 G

Table 1:

Computational costs and validation accuracy ofvarious models.

ResNet, trained with our “tricks”, is ableto outperform newer and improved architectures trainedwith standard pipeline.procedure and model architecture reﬁnements that improvemodel accuracy but barely change computational complex-ity. Many of them are minor “tricks” like modifying thestride size of a particular convolution layer or adjustinglearning rate schedule. Collectively, however, they make abig difference. We will evaluate them on multiple networkarchitectures and datasets and report their impact to the ﬁnalmodel accuracy.Our empirical evaluation shows that several tricks leadto signiﬁcant accuracy improvement and combining themtogether can further boost the model accuracy. We com-pare ResNet-50, after applying all tricks, to other relatednetworks in Table 1. Note that these tricks raises ResNet-50’s top-1 validation accuracy from 75.3% to 79.29% onImageNet. It also outperforms other newer and improvednetwork architectures, such as SE-ResNeXt-50. In addi-tion, we show that our approach can generalize to other net-works (Inception V3 [1] and MobileNet [11]) and datasets(Place365 [32]). We further show that models trained withour tricks bring better transfer learning performance in otherapplication domains such as object detection and semanticsegmentation.

Paper Outline.

We ﬁrst set up a baseline training proce-dure in Section 2, and then discuss several tricks that are1 a r X i v : . [ c s . C V ] D ec lgorithm 1 Train a neural network with mini-batchstochastic gradient descent.initialize(net) for epoch = 1 , . . . , K dofor batch = 1 , . . . , /b do images ← uniformly random sample b images X, y ← preprocess(images) z ← forward(net, X ) (cid:96) ← loss( z , y )grad ← backward( (cid:96) )update(net, grad) end forend for useful for efﬁcient training on new hardware in Section 3. InSection 4 we review three minor model architecture tweaksfor ResNet and propose a new one. Four additional train-ing procedure reﬁnements are then discussed in Section 5.At last, we study if these more accurate models can helptransfer learning in Section 6.Our model implementations and training scripts are pub-licly available in GluonCV .

2. Training Procedures

The template of training a neural network with mini-batch stochastic gradient descent is shown in Algorithm 1.In each iteration, we randomly sample b images to com-pute the gradients and then update the network parameters.It stops after K passes through the dataset. All functionsand hyper-parameters in Algorithm 1 can be implementedin many different ways. In this section, we ﬁrst specify abaseline implementation of Algorithm 1. We follow a widely used implementation [8] of ResNetas our baseline. The preprocessing pipelines between train-ing and validation are different. During training, we per-form the following steps one-by-one:1. Randomly sample an image and decode it into 32-bitﬂoating point raw pixel values in [0 , .2. Randomly crop a rectangular region whose aspect ratiois randomly sampled in [3 / , / and area randomlysampled in [8% , , then resize the cropped regioninto a 224-by-224 square image.3. Flip horizontally with 0.5 probability.4. Scale hue, saturation, and brightness with coefﬁcientsuniformly drawn from [0 . , . .5. Add PCA noise with a coefﬁcient sampled from a nor-mal distribution N (0 , . . https://github.com/dmlc/gluon-cv Model Baseline ReferenceTop-1 Top-5 Top-1 Top-5ResNet-50 [9] 75.87 92.70 75.3 92.2Inception-V3 [26] 77.32 93.43 78.8 94.4MobileNet [11] 69.03 88.71 70.6 -Table 2:

Validation accuracy of reference implementa-tions and our baseline.

Note that the numbers for Incep-tion V3 are obtained with 299-by-299 input images.6. Normalize RGB channels by subtracting 123.68,116.779, 103.939 and dividing by 58.393, 57.12,57.375, respectively.During validation, we resize each image’s shorter edgeto pixels while keeping its aspect ratio. Next, we cropout the 224-by-224 region in the center and normalize RGBchannels similar to training. We do not perform any randomaugmentations during validation.The weights of both convolutional and fully-connectedlayers are initialized with the Xavier algorithm [6]. In par-ticular, we set the parameter to random values uniformlydrawn from [ − a, a ] , where a = (cid:112) / ( d in + d out ) . Here d in and d out are the input and output channel sizes, respec-tively. All biases are initialized to 0. For batch normaliza-tion layers, γ vectors are initialized to 1 and β vectors to0. Nesterov Accelerated Gradient (NAG) descent [20] isused for training. Each model is trained for 120 epochs on8 Nvidia V100 GPUs with a total batch size of 256. Thelearning rate is initialized to . and divided by 10 at the30th, 60th, and 90th epochs. We evaluate three CNNs: ResNet-50 [9], Inception-V3 [1], and MobileNet [11]. For Inception-V3 we resize theinput images into 299x299. We use the ISLVRC2012 [23]dataset, which has 1.3 million images for training and 1000classes. The validation accuracies are shown in Table 2. Ascan be seen, our ResNet-50 results are slightly better thanthe reference results, while our baseline Inception-V3 andMobileNet are slightly lower in accuracy due to differenttraining procedure.

3. Efﬁcient Training

Hardware, especially GPUs, has been rapidly evolvingin recent years. As a result, the optimal choices for manyperformance related trade-offs have changed. For example,it is now more efﬁcient to use lower numerical precision andlarger batch sizes during training. In this section, we reviewvarious techniques that enable low precision and large batchraining without sacriﬁcing model accuracy. Some tech-niques can even improve both accuracy and training speed.

Mini-batch SGD groups multiple samples to a mini-batch to increase parallelism and decrease communicationcosts. Using large batch size, however, may slow downthe training progress. For convex problems, convergencerate decreases as batch size increases. Similar empirical re-sults have been reported for neural networks [25]. In otherwords, for the same number of epochs, training with a largebatch size results in a model with degraded validation accu-racy compared to the ones trained with smaller batch sizes.Multiple works [7, 14] have proposed heuristics to solvethis issue. In the following paragraphs, we will examinefour heuristics that help scale the batch size up for singlemachine training.

Linear scaling learning rate.

In mini-batch SGD, gradi-ent descending is a random process because the examplesare randomly selected in each batch. Increasing the batchsize does not change the expectation of the stochastic gra-dient but reduces its variance. In other words, a large batchsize reduces the noise in the gradient, so we may increasethe learning rate to make a larger progress along the op-posite of the gradient direction. Goyal et al . [7] reportsthat linearly increasing the learning rate with the batch sizeworks empirically for ResNet-50 training. In particular, ifwe follow He et al . [9] to choose 0.1 as the initial learn-ing rate for batch size 256, then when changing to a largerbatch size b , we will increase the initial learning rate to . × b/ . Learning rate warmup.

At the beginning of the training,all parameters are typically random values and therefore faraway from the ﬁnal solution. Using a too large learning ratemay result in numerical instability. In the warmup heuristic,we use a small learning rate at the beginning and then switchback to the initial learning rate when the training processis stable [9]. Goyal et al . [7] proposes a gradual warmupstrategy that increases the learning rate from 0 to the initiallearning rate linearly. In other words, assume we will usethe ﬁrst m batches (e.g. 5 data epochs) to warm up, and theinitial learning rate is η , then at batch i , ≤ i ≤ m , we willset the learning rate to be iη/m . Zero γ . A ResNet network consists of multiple residualblocks, each block consists of several convolutional lay-ers. Given input x , assume block ( x ) is the output for thelast layer in the block, this residual block then outputs x + block ( x ) . Note that the last layer of a block couldbe a batch normalization (BN) layer. The BN layer ﬁrst standardizes its input, denoted by ˆ x , and then performs ascale transformation γ ˆ x + β . Both γ and β are learnableparameters whose elements are initialized to 1s and 0s, re-spectively. In the zero γ initialization heuristic, we initialize γ = 0 for all BN layers that sit at the end of a residual block.Therefore, all residual blocks just return their inputs, mim-ics network that has less number of layers and is easier totrain at the initial stage. No bias decay.

The weight decay is often applied to alllearnable parameters including both weights and bias. It’sequivalent to applying an L2 regularization to all parame-ters to drive their values towards 0. As pointed out by Jia etal . [14], however, it’s recommended to only apply the reg-ularization to weights to avoid overﬁtting. The no bias de-cay heuristic follows this recommendation, it only appliesthe weight decay to the weights in convolution and fully-connected layers. Other parameters, including the biasesand γ and β in BN layers, are left unregularized.Note that LARS [4] offers layer-wise adaptive learningrate and is reported to be effective for extremely large batchsizes (beyond 16K). While in this paper we limit ourselvesto methods that are sufﬁcient for single machine training,in which case a batch size no more than 2K often leads togood system efﬁciency. Neural networks are commonly trained with 32-bit ﬂoat-ing point (FP32) precision. That is, all numbers are stored inFP32 format and both inputs and outputs of arithmetic oper-ations are FP32 numbers as well. New hardware, however,may have enhanced arithmetic logic unit for lower precisiondata types. For example, the previously mentioned NvidiaV100 offers 14 TFLOPS in FP32 but over 100 TFLOPS inFP16. As in Table 3, the overall training speed is acceler-ated by 2 to 3 times after switching from FP32 to FP16 onV100.Despite the performance beneﬁt, a reduced precision hasa narrower range that makes results more likely to be out-of-range and then disturb the training progress. Micikevicius etal . [19] proposes to store all parameters and activations inFP16 and use FP16 to compute gradients. At the same time,all parameters have an copy in FP32 for parameter updat-ing. In addition, multiplying a scalar to the loss to betteralign the range of the gradient into FP16 is also a practicalsolution.

The evaluation results for ResNet-50 are shown in Ta-ble 3. Compared to the baseline with batch size 256 andFP32, using a larger 1024 batch size and FP16 reduces thetraining time for ResNet-50 from 13.3-min per epoch to 4.4-min per epoch. In addition, by stacking all heuristics for nput stemStage 1Stage 2Stage 3Stage 4OutputInputOutput Conv 7x7, 64, s=2InputMaxPool 3x3, s=2DownsamplingResidualResidual Conv1 (cid:91)(cid:20)(cid:15)(cid:3)(cid:24)(cid:20)(cid:21)(cid:15)(cid:3)(cid:86)(cid:32)(cid:21)

Conv3x3, 512Conv1x1, 2048 Conv1 (cid:91)(cid:20)(cid:15)(cid:3)(cid:21)(cid:19)(cid:23)(cid:27)(cid:15)(cid:3)(cid:86)(cid:32)(cid:21)

OutputInput InputOutput Output + Path A Path B

Figure 1: The architecture of ResNet-50. The convolutionkernel size, output channel size and stride size (default is 1)are illustrated, similar for pooling layers.large-batch training, the model trained with 1024 batch sizeand FP16 even slightly increased 0.5% top-1 accuracy com-pared to the baseline model.The ablation study of all heuristics is shown in Table 4.Increasing batch size from 256 to 1024 by linear scalinglearning rate alone leads to a 0.9% decrease of the top-1accuracy while stacking the rest three heuristics bridges thegap. Switching from FP32 to FP16 at the end of trainingdoes not affect the accuracy.

4. Model Tweaks

A model tweak is a minor adjustment to the network ar-chitecture, such as changing the stride of a particular convo-lution layer. Such a tweak often barely changes the compu-tational complexity but might have a non-negligible effecton the model accuracy. In this section, we will use ResNetas an example to investigate the effects of model tweaks.

We will brieﬂy present the ResNet architecture, espe-cially its modules related to the model tweaks. For detailedinformation please refer to He et al . [9]. A ResNet networkconsists of an input stem, four subsequent stages and a ﬁnaloutput layer, which is illustrated in Figure 1. The input stemhas a × convolution with an output channel of 64 and astride of 2, followed by a × max pooling layer also witha stride of 2. The input stem reduces the input width andheight by 4 times and increases its channel size to 64.Starting from stage 2, each stage begins with a down-sampling block, which is then followed by several residualblocks. In the downsampling block, there are path A and Conv (cid:11) (cid:91)(cid:20)(cid:12) Conv (cid:11) (cid:12)

Conv (cid:11) (cid:12)

Conv (cid:11) (cid:91)(cid:20)(cid:15)(cid:3)(cid:86)(cid:32)(cid:21)(cid:12) InputOutput + (a) ResNet-B

Conv (3x3)InputMaxPool (3x3, s=2)OutputConv (3x3, s=2)Conv (3x3) (b)

ResNet-C

Conv (cid:11) (cid:91)(cid:20)(cid:12) Conv (cid:11) (cid:12)

Conv (cid:11) (cid:12)

Conv (cid:11) (cid:91)(cid:20)(cid:12) InputOutput + AvgPool (cid:11) (cid:91)(cid:21)(cid:15)(cid:3)(cid:86)(cid:32)(cid:21)(cid:12) (c) ResNet-D

Figure 2: Three ResNet tweaks. ResNet-B modiﬁes thedownsampling block of Resnet. ResNet-C further modiﬁesthe input stem. On top of that, ResNet-D again modiﬁes thedownsampling block.path B. Path A has three convolutions, whose kernel sizesare × , × and × , respectively. The ﬁrst convolutionhas a stride of 2 to halve the input width and height, and thelast convolution’s output channel is 4 times larger than theprevious two, which is called the bottleneck structure. PathB uses a × convolution with a stride of 2 to transform theinput shape to be the output shape of path A, so we can sumoutputs of both paths to obtain the output of the downsam-pling block. A residual block is similar to a downsamplingblock except for only using convolutions with a stride of 1.One can vary the number of residual blocks in each stageto obtain different ResNet models, such as ResNet-50 andResNet-152, where the number presents the number of con-volutional layers in the network. Next, we revisit two popular ResNet tweaks, we callthem ResNet-B and ResNet-C, respectively. We proposea new model tweak ResNet-D afterwards.

ResNet-B.

This tweak ﬁrst appeared in a Torch imple-mentation of ResNet [8] and then adopted by multipleworks [7, 12, 27]. It changes the downsampling block ofResNet. The observation is that the convolution in path Aignores three-quarters of the input feature map because ituses a kernel size × with a stride of 2. ResNet-B switchesthe strides size of the ﬁrst two convolutions in path A, asshown in Figure 2a, so no information is ignored. Becausethe second convolution has a kernel size × , the outputshape of path A remains unchanged. ResNet-C.

This tweak was proposed in Inception-v2 [26]originally, and it can be found on the implementationsodel Efﬁcient BaselineTime/epoch Top-1 Top-5 Time/epoch Top-1 Top-5ResNet-50 γ × convolution is 5.4times more expensive than a × convolution. So thistweak replacing the × convolution in the input stemwith three conservative × convolutions, which is shownin Figure 2b, with the ﬁrst and second convolutions havetheir output channel of 32 and a stride of 2, while the lastconvolution uses a 64 output channel. ResNet-D.

Inspired by ResNet-B, we note that the × convolution in the path B of the downsampling block alsoignores 3/4 of input feature maps, we would like to modifyit so no information will be ignored. Empirically, we foundadding a × average pooling layer with a stride of 2 beforethe convolution, whose stride is changed to 1, works wellin practice and impacts the computational cost little. Thistweak is illustrated in Figure 2c. We evaluate ResNet-50 with the three tweaks and set-tings described in Section 3, namely the batch size is 1024and precision is FP16. The results are shown in Table 5.Suggested by the results, ResNet-B receives more infor-mation in path A of the downsampling blocks and improvesvalidation accuracy by around . compared to ResNet-50. Replacing the × convolution with three × onesgives another . improvement. Taking more informationin path B of the downsampling blocks improves the vali- Model Table 5: Compare ResNet-50 with three model tweaks onmodel size, FLOPs and ImageNet validation accuracy.dation accuracy by another . . In total, ResNet-50-Dimproves ResNet-50 by .On the other hand, these four models have the samemodel size. ResNet-D has the largest computational cost,but its difference compared to ResNet-50 is within 15% interms of ﬂoating point operations. In practice, we observedResNet-50-D is only 3% slower in training throughput com-pared to ResNet-50.

5. Training Reﬁnements

In this section, we will describe four training reﬁnementsthat aim to further improve the model accuracy.

Learning rate adjustment is crucial to the training. Af-ter the learning rate warmup described in Section 3.1, wetypically steadily decrease the value from the initial learn-ing rate. The widely used strategy is exponentially decayingthe learning rate. He et al . [9] decreases rate at 0.1 for ev-ery 30 epochs, we call it “step decay”. Szegedy et al . [26]decreases rate at 0.94 for every two epochs.In contrast to it, Loshchilov et al . [18] propose a cosineannealing strategy. An simpliﬁed version is decreasing thelearning rate from the initial value to 0 by following thecosine function. Assume the total number of batches is T (the warmup stage is ignored), then at batch t , the learningrate η t is computed as: η t = 12 (cid:18) cos (cid:18) tπT (cid:19)(cid:19) η, (1)where η is the initial learning rate. We call this schedulingas “cosine” decay. .00.10.20.30.4 0 20 40 60 80 100 120Epoch Lea r n i ng R a t e Cosine DecayStep Decay (a) Learning Rate Schedule T op − A cc u r a cy Cosine DecayStep Decay (b) Validation Accuracy

Figure 3: Visualization of learning rate schedules withwarm-up. Top: cosine and step schedules for batch size1024. Bottom: Top-1 validation accuracy curve with regardto the two schedules.The comparison between step decay and cosine decayare illustrated in Figure 3a. As can be seen, the cosine decaydecreases the learning rate slowly at the beginning, and thenbecomes almost linear decreasing in the middle, and slowsdown again at the end. Compared to the step decay, thecosine decay starts to decay the learning since the beginningbut remains large until step decay reduces the learning rateby 10x, which potentially improves the training progress.

The last layer of a image classiﬁcation network is often afully-connected layer with a hidden size being equal to thenumber of labels, denote by K , to output the predicted con-ﬁdence scores. Given an image, denote by z i the predictedscore for class i . These scores can be normalized by thesoftmax operator to obtain predicted probabilities. Denoteby q the output of the softmax operator q = softmax ( z ) , theprobability for class i , q i , can be computed by: q i = exp( z i ) (cid:80) Kj =1 exp( z j ) . (2)It’s easy to see q i > and (cid:80) Ki =1 q i = 1 , so q is a validprobability distribution.On the other hand, assume the true label of this imageis y , we can construct a truth probability distribution to be p i = 1 if i = y and 0 otherwise. During training, we mini-mize the negative cross entropy loss (cid:96) ( p, q ) = − K (cid:88) i =1 q i log p i (3) to update model parameters to make these two probabil-ity distributions similar to each other. In particular, by theway how p is constructed, we know (cid:96) ( p, q ) = − log p y = − z y + log (cid:16)(cid:80) Ki =1 exp( z i ) (cid:17) . The optimal solution is z ∗ y =inf while keeping others small enough. In other words, itencourages the output scores dramatically distinctive whichpotentially leads to overﬁtting.The idea of label smoothing was ﬁrst proposed to trainInception-v2 [26]. It changes the construction of the trueprobability to q i = (cid:40) − ε if i = y,ε/ ( K − otherwise, (4)where ε is a small constant. Now the optimal solutionbecomes z ∗ i = (cid:40) log(( K − − ε ) /ε ) + α if i = y,α otherwise, (5)where α can be an arbitrary real number. This encour-ages a ﬁnite output from the fully-connected layer and cangeneralize better.When ε = 0 , the gap log(( K − − ε ) /ε ) will be ∞ and as ε increases, the gap decreases. Speciﬁcally when ε = ( K − /K , all optimal z ∗ i will be identical. Figure 4ashows how the gap changes as we move ε , given K = 1000 for ImageNet dataset.We empirically compare the output value from twoResNet-50-D models that are trained with and without la-bel smoothing respectively and calculate the gap betweenthe maximum prediction value and the average of the rest.Under ε = 0 . and K = 1000 , the theoretical gap is around9.1. Figure 4b demonstrate the gap distributions from thetwo models predicting over the validation set of ImageNet.It is clear that with label smoothing the distribution centersat the theoretical value and has fewer extreme values. In knowledge distillation [10], we use a teacher modelto help train the current model, which is called the studentmodel. The teacher model is often a pre-trained model withhigher accuracy, so by imitation, the student model is ableto improve its own accuracy while keeping the model com-plexity the same. One example is using a ResNet-152 as theteacher model to help training ResNet-50.During training, we add a distillation loss to penalizethe difference between the softmax outputs from the teachermodel and the learner model. Given an input, assume p isthe true probability distribution, and z and r are outputs ofthe last fully-connected layer of the student model and theteacher model, respectively. Remember previously we use a G ap (a) Theoretical gap D en s i t yy one−hotsmoothed (b) Empirical gap from ImageNet validation set Figure 4: Visualization of the effectiveness of label smooth-ing on ImageNet. Top: theoretical gap between z ∗ p and oth-ers decreases when increasing ε . Bottom: The empiricaldistributions of the gap between the maximum predictionand the average of the rest.negative cross entropy loss (cid:96) ( p, softmax ( z )) to measure thedifference between p and z , here we use the same loss againfor the distillation. Therefore, the loss is changed to (cid:96) ( p, softmax ( z )) + T (cid:96) ( softmax ( r/T ) , softmax ( z/T )) , (6)where T is the temperature hyper-parameter to make thesoftmax outputs smoother thus distill the knowledge of la-bel distribution from teacher’s prediction. In Section 2.1 we described how images are augmentedbefore training. Here we consider another augmentationmethod called mixup [29]. In mixup, each time we ran-domly sample two examples ( x i , y i ) and ( x j , y j ) . Then weform a new example by a weighted linear interpolation ofthese two examples: ˆ x = λx i + (1 − λ ) x j , (7) ˆ y = λy i + (1 − λ ) y j , (8)where λ ∈ [0 , is a random number drawn from the Beta ( α, α ) distribution. In mixup training, we only usethe new example (ˆ x, ˆ y ) . Now we evaluate the four training reﬁnements. Weset ε = 0 . for label smoothing by following Szegedy etal . [26]. For the model distillation we use T = 20 , specif-ically a pretrained ResNet-152-D model with both cosine decay and label smoothing applied is used as the teacher.In the mixup training, we choose α = 0 . in the Beta dis-tribution and increase the number of epochs from 120 to200 because the mixed examples ask for a longer trainingprogress to converge better. When combining the mixuptraining with distillation, we train the teacher model withmixup as well.We demonstrate that the reﬁnements are not only lim-ited to ResNet architecture or the ImageNet dataset. First,we train ResNet-50-D, Inception-V3 and MobileNet on Im-ageNet dataset with reﬁnements. The validation accura-cies for applying these training reﬁnements one-by-one areshown in Table 6. By stacking cosine decay, label smooth-ing and mixup, we have steadily improving ResNet, Incep-tionV3 and MobileNet models. Distillation works well onResNet, however, it does not work well on Inception-V3and MobileNet. Our interpretation is that the teacher modelis not from the same family of the student, therefore hasdifferent distribution in the prediction, and brings negativeimpact to the model.To support our tricks is transferable to other dataset, wetrain a ResNet-50-D model on MIT Places365 dataset withand without the reﬁnements. Results are reported in Ta-ble 7. We see the reﬁnements improve the top-5 accuracyconsistently on both the validation and test set.

6. Transfer Learning

Transfer learning is one major down-streaming use caseof trained image classiﬁcation models. In this section, wewill investigate if these improvements discussed so far canbeneﬁt transfer learning. In particular, we pick two impor-tant computer vision tasks, object detection and semanticsegmentation, and evaluate their performance by varyingbase models.

The goal of object detection is to locate bounding boxesof objects in an image. We evaluate performance usingPASCAL VOC [3]. Similar to Ren et al . [22], we use unionset of VOC 2007 trainval and VOC 2012 trainval for train-ing, and VOC 2007 test for evaluation, respectively. Wetrain Faster-RCNN [22] on this dataset, with reﬁnementsfrom Detectron [5] such as linear warmup and long train-ing schedule. The VGG-19 base model in Faster-RCNNis replaced with various pretrained models in the previousdiscussion. We keep other settings the same so the gain issolely from the base models.Mean average precision (mAP) results are reported inTable 8. We can observe that a base model with a highervalidation accuracy leads to a higher mAP for Faster-RNNin a consistent manner. In particular, the best base modelwith accuracy 79.29% on ImageNet leads to the best mAPeﬁnements ResNet-50-D Inception-V3 MobileNetTop-1 Top-5 Top-1 Top-5 Top-1 Top-5Efﬁcient 77.16 93.52 77.50 93.60 71.90 90.53+ cosine decay 77.91 93.81 78.19 94.06 72.83 91.00+ label smoothing 78.31 94.09 78.40 94.13 72.93 91.14+ distill w/o mixup 78.67 94.36 78.26 94.01 71.97 90.89+ mixup w/o distill 79.15 94.58 + distill w/ mixup

Table 7: Results on both the validation set and the test set of MIT Places 365 dataset. Prediction are generated as statedin Section 2.1. ResNet-50-D Efﬁcient refers to ResNet-50-D trained with settings from Section 3, and ResNet-50-D Bestfurther incorporate cosine scheduling, label smoothing and mixup.Reﬁnement Top-1 mAPB-standard 76.14 77.54D-efﬁcient 77.16 78.30+ cosine 77.91 79.23+ smooth 78.34 80.71+ distill w/o mixup 78.67 80.96+ mixup w/o distill 79.16 81.10+ distill w/ mixup 79.29

Table 8: Faster-RCNN performance with various pre-trained base networks evaluated on Pascal VOC.Reﬁnement Top-1 PixAcc mIoUB-standard 76.14 78.08 37.05D-efﬁcient 77.16 78.88 38.88+ cosine 77.91 + smooth 78.34 78.64 38.75+ distill w/o mixup 78.67 78.97 38.90+ mixup w/o distill 79.16 78.47 37.99+ mixup w/ distill 79.29 78.72 38.40Table 9: FCN performance with various base networks eval-uated on ADE20K.at 81.33% on VOC, which outperforms the standard modelby 4%.

Semantic segmentation predicts the category for everypixel from the input images. We use Fully ConvolutionalNetwork (FCN) [17] for this task and train models on theADE20K [33] dataset. Following PSPNet [31] and Zhang etal . [30], we replace the base network with various pre-trained models discussed in previous sections and apply di-lation network strategy [2, 28] on stage-3 and stage-4. Afully convolutional decoder is built on top of the base net-work to make the ﬁnal prediction.Both pixel accuracy (pixAcc) and mean intersection overunion (mIoU) are reported in Table 9. In contradictionto our results on object detection, the cosine learning rateschedule effectively improves the accuracy of the FCN per-formance, while other reﬁnements provide suboptimal re-sults. A potential explanation to the phenomenon is thatsemantic segmentation predicts in the pixel level. Whilemodels trained with label smoothing, distillation and mixupfavor soften labels, blurred pixel-level information may beblurred and degrade overall pixel-level accuracy.

7. Conclusion

In this paper, we survey a dozen tricks to train deepconvolutional neural networks to improve model accuracy.These tricks introduce minor modiﬁcations to the modelarchitecture, data preprocessing, loss function, and learn-ing rate schedule. Our empirical results on ResNet-50,Inception-V3 and MobileNet indicate that these tricks im-prove model accuracy consistently. More excitingly, stack-ing all of them together leads to a signiﬁcantly higher accu-racy. In addition, these improved pre-trained models showtrong advantages in transfer learning, which improve bothobject detection and semantic segmentation. We believe thebeneﬁts can extend to broader domains where classiﬁcationbase models are favored.

References [1] L. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-thinking atrous convolution for semantic image segmenta-tion.

CoRR , abs/1706.05587, 2017. 1, 2, 5[2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs.

IEEE transactions on pattern analysis and ma-chine intelligence https://github.com/facebookresearch/detectron , 2018. 7[6] X. Glorot and Y. Bengio. Understanding the difﬁculty oftraining deep feedforward neural networks. In

Proceedingsof the thirteenth international conference on artiﬁcial intel-ligence and statistics , pages 249–256, 2010. 2[7] P. Goyal, P. Doll´ar, R. B. Girshick, P. Noordhuis,L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He.Accurate, large minibatch SGD: training imagenet in 1 hour.

CoRR , abs/1706.02677, 2017. 3, 4[8] S. Gross and M. Wilber. Training and investigating residualnets. http://torch.ch/blog/2016/02/04/resnets.html. 2, 4[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016. 1, 2, 3, 4, 5[10] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv preprint arXiv:1503.02531 , 2015.6[11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efﬁ-cient convolutional neural networks for mobile vision appli-cations. arXiv preprint arXiv:1704.04861 , 2017. 1, 2[12] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net-works. arXiv preprint arXiv:1709.01507 , 7, 2017. 1, 4, 5[13] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Wein-berger. Densely connected convolutional networks. In , pages 2261–2269. IEEE, 2017. 1[14] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie,Z. Guo, Y. Yang, L. Yu, et al. Highly scalable deep learningtraining system with mixed-precision: Training imagenet infour minutes. arXiv preprint arXiv:1807.11205 , 2018. 3 [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012. 1[16] M. Lin, Q. Chen, and S. Yan. Network in network. arXivpreprint arXiv:1312.4400 , 2013. 1[17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 3431–3440, 2015. 8[18] I. Loshchilov and F. Hutter. SGDR: stochastic gradient de-scent with restarts.

CoRR , abs/1608.03983, 2016. 5[19] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen,D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev,G. Venkatesh, et al. Mixed precision training. arXiv preprintarXiv:1710.03740 , 2017. 3[20] Y. E. Nesterov. A method for solving the convex program-ming problem with convergence rate o (1/kˆ 2). In

Dokl.Akad. Nauk SSSR , volume 269, pages 543–547, 1983. 2[21] H.-T. Z. Ningning Ma, Xiangyu Zhang and J. Sun. Shufﬂenetv2: Practical guidelines for efﬁcient cnn architecture design. arXiv preprint arXiv:1807.11164 , 2018. 5[22] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. In

Advances in neural information processing systems , pages91–99, 2015. 7[23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.

International Journal of Computer Vision , 115(3):211–252,2015. 1, 2[24] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition.

CoRR ,abs/1409.1556, 2014. 1[25] S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le. Don’tdecay the learning rate, increase the batch size. arXivpreprint arXiv:1711.00489 , 2017. 3[26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.Rethinking the inception architecture for computer vision.

CoRR , abs/1512.00567, 2015. 2, 4, 5, 6, 7[27] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggregatedresidual transformations for deep neural networks. In

Com-puter Vision and Pattern Recognition (CVPR), 2017 IEEEConference on , pages 5987–5995. IEEE, 2017. 1, 4[28] F. Yu and V. Koltun. Multi-scale context aggregation by di-lated convolutions. arXiv preprint arXiv:1511.07122 , 2015.8[29] H. Zhang, M. Ciss´e, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization.

CoRR ,abs/1710.09412, 2017. 7[30] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, andA. Agrawal. Context encoding for semantic segmentation.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2018. 8[31] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network. In

Computer Vision and Pattern Recogni-tion (CVPR), 2017 IEEE Conference on , pages 6230–6239.IEEE, 2017. 5, 832] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.Places: A 10 million image database for scene recognition.

IEEE transactions on pattern analysis and machine intelli-gence , 2017. 1[33] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor-ralba. Scene parsing through ade20k dataset. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , 2017. 8[34] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learn-ing transferable architectures for scalable image recognition.