[PDF] Complement Objective Training

Abstract

Learning with a primary objective, such as softmax cross entropy for classification and sequence generation, has been the norm for training deep neural networks for years. Although being a widely-adopted approach, using cross entropy as the primary objective exploits mostly the information from the ground-truth class for maximizing data likelihood, and largely ignores information from the complement (incorrect) classes. We argue that, in addition to the primary objective, training also using a complement objective that leverages information from the complement classes can be effective in improving model performance. This motivates us to study a new training paradigm that maximizes the likelihood of the groundtruth class while neutralizing the probabilities of the complement classes. We conduct extensive experiments on multiple tasks ranging from computer vision to natural language understanding. The experimental results confirm that, compared to the conventional training with just one primary objective, training also with the complement objective further improves the performance of the state-of-the-art models across all tasks. In addition to the accuracy improvement, we also show that models trained with both primary and complement objectives are more robust to single-step adversarial attacks.

Full PDF

PPublished as a conference paper at ICLR 2019 C OMPLEMENT O BJECTIVE T RAINING

Hao-Yun Chen , Pei-Hsin Wang , Chun-Hao Liu , Shih-Chieh Chang

1, 2 , Jia-Yu Pan ,Yu-Ting Chen , Wei Wei , and Da-Cheng Juan Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan Electronic and Optoelectronic System Research Laboratories, ITRI, Hsinchu, Taiwan Google Research, Mountain View, CA, USA { haoyunchen,peihsin,newgod1992 } @[email protected] { jypan, yutingchen, wewei, dacheng } @google.com A BSTRACT

Learning with a primary objective, such as softmax cross entropy for classiﬁca-tion and sequence generation, has been the norm for training deep neural networksfor years. Although being a widely-adopted approach, using cross entropy as theprimary objective exploits mostly the information from the ground-truth class formaximizing data likelihood, and largely ignores information from the complement(incorrect) classes. We argue that, in addition to the primary objective, trainingalso using a complement objective that leverages information from the comple-ment classes can be effective in improving model performance. This motivatesus to study a new training paradigm that maximizes the likelihood of the ground-truth class while neutralizing the probabilities of the complement classes. Weconduct extensive experiments on multiple tasks ranging from computer vision tonatural language understanding. The experimental results conﬁrm that, comparedto the conventional training with just one primary objective, training also withthe complement objective further improves the performance of the state-of-the-artmodels across all tasks. In addition to the accuracy improvement, we also showthat models trained with both primary and complement objectives are more robustto single-step adversarial attacks.

NTRODUCTION

Statistical learning algorithms work by optimizing towards a training objective. A dominant prin-ciple for training is to optimize likelihood (Mitchell et al., 1997), which measures the probabilityof data given the model under a speciﬁc set of parameters. The popularity of deep neural networkshas given rise to the use of cross entropy (Kullback & Leibler, 1951) as its primary training objec-tive, since minimizing cross entropy is essentially equivalent to maximizing likelihood for disjointclasses. Cross entropy has become the standard training objective for many tasks including classiﬁ-cation (Krizhevsky et al., 2012) and sequence generation (Sutskever et al., 2014).Let y i ∈ { , } K be the label of the i th sample in one-hot encoded representation and ˆ y i ∈ [0 , K be the predicted probabilities, the cross entropy H ( y , ˆ y ) is deﬁned as: H ( y , ˆ y ) = − N N (cid:88) i =1 y Ti · log(ˆ y i )= − N N (cid:88) i =1 log(ˆ y ig ) (1)where ˆ y ig represents the predicted probability of the ground-truth class for the i th sample. Train-ing with cross entropy as the primary objective aims at ﬁnding ˆ θ = arg min θ H ( y , ˆ y ) , where ˆ y = h θ ( x ) , h θ is a neural network and x is a sample. Although training using the cross entropy as1 a r X i v : . [ c s . L G ] M a r ublished as a conference paper at ICLR 2019 (a) ˆ y from the model trained with cross entropy. (b) ˆ y from the model trained with COT. Figure 1: Predicted probabilities ˆ y from two training paradigms: (a) With cross entropy as theprimary objective. (b) COT: with both primary and complement objectives. The model is ResNet-110 and the sample image is from CIFAR10 dataset. The ground-truth class is “horse.” Comparedto (b), the model in (a) is confused by other classes such as “airplane” and “automobile,” whichsuggests (a) might be more susceptible for generalization issues and potentially adversarial attacks. (a) Embeddings from the model trained with cross en-tropy. (b) Embeddings from the model trained with COT. Figure 2: Embeddings for CIFAR10 test images from two training paradigms: (a) With cross entropyas the primary objective. (b) COT: training with both primary and complement objectives. Themodel is ResNet-110, and the “embedding” is the vector representation before taking the softmaxoperation. The embedding representation of each sample is projected to two dimensions using t-SNEfor visualization purpose. Compared to (a), the cluster of each class in (b) is “narrower” in terms ofintra-cluster distance. Also, the clusters in (b) seem to have clean and separable boundaries, leadingto more accurate and robust classiﬁcation results.the primary objective has achieved tremendous success, we have observed one limitation: it exploitsmostly the information from the ground-truth class as Eq(1) shows; the information from comple-ment classes (i.e., incorrect classes) has been largely ignored, since the predicted probabilities otherthan ˆ y ig are zeroed out due to the dot product calculation with the one-hot encoded y i . There-fore, for classes other than the ground truth, the model behavior is not explicitly optimized — theirpredicted probabilities are indirectly minimized when ˆ y ig is maximized since the probabilities sumup to 1. One way to utilize the information from the complement classes is to neutralize their pre-dicted probabilities. To this end, we propose C omplement O bjective T raining (COT), a new trainingparadigm that achieves this optimization goal without compromising the model’s primary objective.Figure 1 illustrates the comparison between Figure 1a: the predicted probability ˆ y from the modeltrained with just cross entropy as the primary objective, and Figure 1b: ˆ y from the model trainedwith both primary and complement objectives. Training with the complement objective ﬁnds theparameters θ that evenly suppress complement classes without compromising the primary objective(i.e., maximizing ˆ y g ), making the model more conﬁdent of the ground-truth class.2ublished as a conference paper at ICLR 2019Figure 2 further illustrates the embeddings of CIFAR10 images calculated from ResNet-110 usingtwo training paradigms: cross entropy and COT. An embedding of an image is the vector repre-sentation computed by the ResNet-110 model, before taking the softmax operation. Compared toFigure 2a, the clusters in Figure 2b seem to have clean and separable boundaries, leading to moreaccurate and robust classiﬁcation results. The experimental results later in Section 3 further conﬁrmthis observation.Complement objective training requires a function that complements the primary objective. In thispaper, we propose “complement entropy” (deﬁned in Section 2) to complement the softmax crossentropy for neutralizing the effects of complement classes. The neural net parameters θ are thenupdated by alternating iteratively between (a) minimizing cross entropy to increase ˆ y g , and (b) max-imizing complement entropy to neutralize ˆ y j (cid:54) = g . Experimental results (in Section 3) conﬁrm thatCOT improves the accuracies of the state-of-the-art methods for both (a) the image classiﬁcationtasks on ImageNet-2012, Tiny ImageNet, CIFAR-10, CIFAR-100, and SVHN, and (b) language un-derstanding tasks on machine translation and speech recognition. Furthermore, experimental resultsalso show that models trained by COT are more robust to adversarial attacks. OMPLEMENT O BJECTIVE T RAINING

In this section, we ﬁrst deﬁne “Complement Entropy” as the complement objective, and then providea new training algorithm for updating neural network parameters θ by alternating iteratively betweenthe primary objective and the complement objective.2.1 C OMPLEMENT ENTROPY

Conventionally, training with cross entropy as the primary objective aims at maximizing the pre-dicted probability of the ground-truth class ˆ y g in Eq(1). As mentioned in the introduction, theproposed COT also maximizes the complement objective for neutralizing the predicted probabilitiesof the complement classes. To achieve this, we propose “complement entropy” as the complementobjective; complement entropy C ( · ) is deﬁned to be the average of sample-wise entropies over com-plement classes in a mini-batch: C (ˆ y ¯ c ) = 1 N N (cid:88) i =1 H (ˆ y i ¯ c )= − N N (cid:88) i =1 K (cid:88) j =1 ,j (cid:54) = g ( ˆ y ij − ˆ y ig ) log( ˆ y ij − ˆ y ig ) (2) H ( · ) is the entropy function. All the symbols and notations used in this paper are summarized inTable 1. One thing worth noticing is that this sample-wise entropy is calculated by considering onlythe complement classes other than the ground-truth class g . The sample-wise predicted probability ˆ y ij is normalized by one minus the ground-truth probability (i.e., − ˆ y ig ). The term ˆ y ij / (1 − ˆ y ig ) canbe understood as: conditioned on the ground-truth class g not happening, the predicted probability tosee the class j for the i th sample. Since the entropy is maximized when the events are equally likelyto occur, optimizing on the complement entropy drives ˆ y ij to (1 − ˆ y ig ) / ( K − , which essentiallyneutralizes the predicted probability of complement classes as K grows large. In other words,maximizing the complement entropy “ﬂattens” the predicted probabilities of complement classes ˆ y j (cid:54) = g . We conjecture that, when ˆ y j (cid:54) = g are neutralized, the neural net h θ generalizes better, since it isless likely to have an incorrect class with a sufﬁciently high predicted probability to “challenge” theground-truth class.2.2 T RAINING WITH COMPLEMENT OBJECTIVE

Given a training procedure using a primary objective, such as softmax cross entropy, one can easilyadopt the complement entropy to turn the procedure into a Complement Objective Training (COT).Algorithm 1 describes the new training mechanism by alternating iteratively between the primaryand complement objectives. At each training step, the cross entropy is ﬁrst calculated as the lossvalue to update the model parameters; next, the complement entropy is calculated as the loss value3ublished as a conference paper at ICLR 2019to perform the second update. Therefore, additional forward and backward propagation are requiredin each iteration when using the complement objective, making the total training time empirically1.6 times longer. Table 1: Notations used in this paper.Symbol Meaning y i One-hot vector representing the label of the i th sample. ˆ y i The predicted probability for each class for the i th sample. g Index of the ground-truth class.y ij or ˆ y ij The j th class (element) of y i or ˆ y i . ˆ y ¯ c Predicted probabilities of of the complement (incorrect) classes. H ( · , · ) Cross entropy function. H ( · ) Entropy function. C ( · ) Complement entropy. N and K Total number of samples and total number of classes.

Algorithm 1:

Training by alternating between primary and complement objectives for t ← to n train steps do

1. Update parameters by Primary Objective: − N (cid:80) Ni =1 log(ˆ y ig )

2. Update parameters by Complement Objective: − N (cid:80) Ni =1 H (ˆ y i ¯ c ) XPERIMENTS

We perform extensive experiments to evaluate COT on tasks in domains ranging from computervision to natural language understanding and compare it with the baseline algorithms that achievestate-of-the-art in the respective domains. We also perform experiments to evaluate the robustness ofthe model trained by COT when attacked by adversarial examples. For each task, we select a state-of-the-art model that has an open-source implementation (referred to as “baseline”) and reproducetheir results with the hyper-parameters reported in the paper or code repository. Our code is availableat https://github.com/henry8527/COT .3.1 B

ALANCING TRAINING OBJECTIVES

In theory, the loss values between the primary and the complement objectives can be in differentscales; therefore, additional efforts for tuning learning rates might be required for optimizers toachieve the best performance. Empirically, we ﬁnd the complement entropy in Eq(2) can be modiﬁedas follows to balance the losses between the two objectives: C (cid:48) (ˆ y ¯ c ) = 1 K − · C (ˆ y ¯ c )= 1 K − · N N (cid:88) i =1 H (ˆ y i ¯ c ) (3)where K is the number of classes. This modiﬁcation can be treated as the complement entropy C ( · ) being “normalized” by ( K − . For all the experiments conducted in this paper, we usethis normalized complement entropy as the complement objective to improve the baselines withoutfurther tuning of learning rates.3.2 I MAGE CLASSIFICATION

We consider the following datasets for experiments with image classiﬁcation: CIFAR-10, CIFAR-100, SVHN, Tiny ImageNet and ImageNet-2012. For CIFAR-10, CIFAR-100 and SVHN, we4ublished as a conference paper at ICLR 2019choose the following baseline models: ResNet-110 (He et al., 2016b), PreAct ResNet-18 (He et al.,2016a), ResNeXt-29 (2 × th and 150 th epoch.The models are trained for 200 epochs, with mini-batches of size 128. The only exception hereis for training WideResNet-28-10, we follow the settings described in (Zagoruyko & Komodakis,2016), and the learning rate is divided by 10 at the 60 th , 120 th and 180 th epoch. In addition, nodropout (Srivastava et al., 2014) is applied to any baseline according to the best practices in (Ioffe &Szegedy, 2015). For Tiny ImageNet and ImageNet-2012, the baseline models are slightly different:we follow the settings from (Zhang et al., 2018), and the details are described in the correspondingparagraphs. CIFAR-10 and CIFAR-100.

CIFAR-10 and CIFAR-100 are datasets (Krizhevsky, 2009) that con-tain colored natural images of 32x32 pixels, in 10 and 100 classes, respectively. We follow the base-line settings (He et al., 2016b) to pre-process the datasets; both datasets are split into a training setwith 50,000 samples and a testing set with 10,000 samples. During training, zero-padding, randomcropping, and horizontal mirroring are applied to the images with a probability of 0.5. For the testingimages, we use the original images of 32x32 pixels.A comparison between the models trained using the primary objective and the COT model is illus-trated in Figures 3a and 4a for CIFAR-10 and CIFAR-100 respectively. We show that COT consis-tently outperforms the baseline models. Some of the models, for example, ResNetXt-29, achieves asigniﬁcant performance boost of 12.5% in terms of classiﬁcation errors. For some other models suchas WideResNet-28-10 and DenseNet-BC-121, the improvements are not as signiﬁcant but are stilllarge enough to justify the differences. Similar conclusions can be observed from the CIFAR-100dataset. In addition to the comparisons of the performance, we also present the change of testingerrors over the course of the training in Figures 3b and 4b for the ResNet-110 model. Following thestandard training practice, learning rates drop after the 100 th epoch, which corresponds to a drop intesting errors. As we can see from the plot, COT outperforms consistently compared to the baselinemodels when the models are close to the convergence. Street View House Numbers (SVHN).

The SVHN dataset (Netzer et al., 2011) consists of imagesextracted from Google Street View. We divide the dataset into a set of 73,257 digits for training anda set of 26,032 digits for testing. When pre-processing the training and validation images, we followthe general practice to normalize pixel values into [-1,1]. Table 2 shows the experimental results andconﬁrms that COT consistently improves the baseline models with the biggest improvement beingthe ResNet-110 with 11.7% reduction on the error rate.Model Baseline

COT

ResNet-110 7.56 ± PreAct ResNet-18 5.46 ± ResNeXt-29 (2 × ± WideResNet-28-10 4.40 ± DenseNet-BC-121 4.72 ± (a) Test errors (in %) on CIFAR-10. For COT, we repeat5 runs and report the “best (mean ± std)” error values. (b) Test errors of ResNet-110 on CIFAR-10over epochs. Figure 3: Classiﬁcation errors on CIFAR-10: (a) COT improves all 5 state-of-the-art models. (b)The improvement over epochs. Notice that the performance improvement from COT becomes stableafter the 100 th epoch due to the learning rate decrease.5ublished as a conference paper at ICLR 2019Model Baseline COT

ResNet-110 29.22

PreAct ResNet-18 25.44

ResNeXt-29 (2 × WideResNet-28-10 21.91

DenseNet-BC-121 21.73 (a) Test errors (in %) on CIFAR-100. For COT, we re-peat 3 runs and report the mean value. (b) Test errors of ResNet-110 on CIFAR-100over the epochs.

Figure 4: Classiﬁcation errors on CIFAR-100: (a) COT improves all 5 state-of-the-art models. (b)The improvement over epochs. Similar to the trend observed in CIFAR-10, the performance im-provement from COT becomes stable after the 100 th epoch due to the learning rate decrease.Table 2: Test errors (in %) of the baseline models and the COT-trained models on the SVHN dataset.The values presented are the mean values of 3 runs.Model Baseline COT

ResNet-110 4.94

PreAct ResNet-18 4.31

ResNeXt-29 (2 × WideResNet-28-10 3.72

DenseNet-BC-121 3.52

Tiny ImageNet dataset is a subset of ImageNet (Deng et al., 2009), whichcontains 100,000 images for training and 10,000 for testing images across 200 classes. In thisdataset, each image is down-sampled to 64x64 pixels from the original 256x256 pixels. We considerfour state-of-the-art models as baselines: ResNet-50, ResNet-101 (He et al., 2016b), ResNeXt-50(32 × × COT

ResNet-50 39.39

ResNet-101 38.23

ResNeXt-50 (32 × ResNeXt-101 (32 × ImageNet-2012 dataset (Russakovsky et al., 2015) is one of the largest datasets forimage classiﬁcation, which contains 1.3 million images for training and 50,000 images for testingwith 1,000 classes. Random crops and horizontal ﬂips are applied during training (He et al., 2016b),while images in the testing set use 224x224 center crops (1-crop testing) for data augmentation.ResNet-50 is selected as the baseline model, and we follow (Goyal et al., 2017) for the experimental https://tiny-imagenet.herokuapp.com/ th , 60 th and 80 th epoch. Table 4 shows (a) the error rate ofbaseline reported by (He et al., 2016b) and (b) the error rate of baseline model trained by COT,which conﬁrms COT further improves the baseline performance.Table 4: Validation errors (in %) for the ImageNet-2012 experiments.Model Baseline COT

ResNet-50 Top-1 Error 24.7

ATURAL LANGUAGE UNDERSTANDING

COT is also evaluated on two natural language understanding (NLU) tasks: machine translation andspeech recognition. One distinct characteristic of most NLU tasks is a large number of target classes.For example, the machine translation dataset used in this paper, IWSLT 2015 English-Vietnamese(Cettolo et al., 2015), consists of vocabularies of 17,191 English words and 7,709 Vietnamese words.This necessitates the normalized complement entropy in Eq(3).

Machine translation.

Neural machine translation (NMT) has popularized the use of neural se-quence models (Sutskever et al., 2014; Cho et al., 2014). Speciﬁcally, we apply COT on theseq2seq model with Luong attention mechanism (Luong et al., 2015) on the IWSLT 2015 English-Vietnamese dataset, which contains 133 thousand translation pairs. For validation and testing, weuse TED tst2012 and TED tst2013, respectively. For the baseline implementation, we follow theofﬁcial TensorFlow-NMT implementation . That is, the number of total training steps is 12,000and the weight decay starts at the 8,000 th step then applied for every 1,000 steps. We experimentmodels with both greedy decoder and beam search decoder. The model trained by COT gives thebest testing results when the beam width is 3, while the baseline uses 10 as the best beam width.Table 5 illustrates the experimental results, showing COT improves testing BLEU scores comparedto the baseline NMT model on both greedy decoder and the beam search decoder.Table 5: Results of IWSLT 2015 English-Vietnamese. The BLEU scores on tst2013 are reported.Model Baseline COT

NMT (greedy) 25.5

NMT (beam search) 26.1

For speech recognition, we experiment on Google Commands Dataset (War-den, 2018), which consists of 65,000 one-second utterances of 30 different types such as “Yes,”“No,” “Up,” “Down” and “Stop.” Our baseline model is referenced from (Zhang et al., 2018).We apply the same pre-processing steps as shown in the paper, and perform the short-time Fouriertransform on the original waveforms ﬁrst at a sampling rate of 4 kHz to receive the correspondingspectrograms. We then zero-pad these spectrograms to equalize each sample’s length. For the base-line model, we select VGG-11 (Simonyan & Zisserman, 2014) and train the model for 30 epochsfollowing the steps in (Zhang et al., 2018). We use SGD optimizer with momentum, and weightdecay is 0.0001. The learning rate starts at 0.0001 and then is divided by 10 at the 10 th and 20 th epoch. COT improves the baseline by further reducing the error rate by 1.56%, as shown in Table 6.3.4 ADVERSARIAL EXAMPLES

An adversarial example is an imperceptibly-perturbed input that results in the model outputtingan incorrect answer with high conﬁdence (Szegedy et al., 2014; Goodfellow et al., 2015). Prior https://github.com/KaimingHe/deep-residual-networks https://github.com/tensorflow/nmt/tree/tf-1.4 COT

VGG-11 6.06 literatures have shown that there are several methods to generate effective adversarial examples thatgreatly mislead the model toward providing wrong predictions.As shown in Figure 2, the proposed COT generates embeddings where the class boundaries areclear and well-separated. We believe that the models trained using COT generalize better and aremore robust to adversarial attacks. To verify this conjecture, we conduct experiments of white-boxattacks to the models trained by COT. We consider a common approach of single-step adversarialattacks: Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) that uses the gradient todetermine the direction of the perturbation to apply on an input for creating an adversarial example.To set up FGSM white-box attacks on a baseline model, adversarial examples are generated usingthe gradients calculated based on the primary objective (referred to as the “primary gradient”) ofthe baseline model. For FGSM white-box attacks on COT, adversarial perturbations are generatedbased on the sum of the primary gradient and the complement gradient ( i.e. , the gradient calculatedfrom the complement objective), both gradients from the model trained by COT. In our experiments,the baseline models are the same as in Section 3.2, and the amount of perturbation is limited to amaximum value of 0.1 as described in (Goodfellow et al., 2015) when creating adversarial examples.Furthermore, we also conduct experiments on FGSM transfer attacks, which use the adversarialexamples from a baseline model to attack and test the robustness of the model trained by COT.Table 7: Classiﬁcation errors (in %) on CIFAR-10 under FGSM white-box & transfer attacks.Model Baseline

COT (White-Box) COT (Transfer)

ResNet-110 62.23

PreAct ResNet-18 65.60

ResNeXt-29 (2 × WideResNet-28-10 59.39

DenseNet-BC-121 65.97

Table 7 shows the performance of the models on the CIFAR-10 dataset under FGSM white-box andtransfer attacks. Generally, the models trained using COT have lower classiﬁcation error under bothFGSM white-box and transfer attacks, which is an indicator that COT models are more robust to bothkinds of attacks. We also conduct experiments on the basic iterative attacks using I-FGSM (Kurakinet al., 2017) and the corresponding results can be found in Appendix A.We conjecture that since the main goal of the complement gradients is to neutralize the probabilitiesof incorrect classes (instead of maximizing the probability of the correct class), the complementgradients may “push away” primary gradients when forming adversarial perturbations, which mightpartially answer why COT is more robust to FGSM white-box attacks compared to the baseline.Regarding the transfer attacks, only the primary objective of the baseline model is used to calculatethe gradients for generating adversarial examples. In other words, the complement gradients are notconsidered when generating adversarial examples in the transfer attack, and this might be the reasonwhy models trained by COT are more robust to transfer attacks. Both conjectures leave a large spacefor future work: using complement objective to defend against more advanced adversarial attacks.

ONCLUSION AND F UTURE W ORK

In this paper, we study Complement Objective Training (COT), a new training paradigm that op-timizes the complement objective in addition to the primary objective. We propose complemententropy as the complement objective for neutralizing the effects of complement (incorrect) classes.8ublished as a conference paper at ICLR 2019Models trained using COT demonstrate superior performance compared to the baseline models. Wealso ﬁnd that COT makes the models robust to single-step adversarial attacks.COT can be extended in several ways: ﬁrst, in this paper, the complement objective is chosen tobe the complement entropy. Non-entropy-based complement objectives should also be consideredfor future studies, which is left as a straight-line future work. Secondly, the exploration of COT onbroader applications remains as an open research question. One example would be applying COTon generative models such as Generative Adversarial Networks (Goodfellow et al., 2014). Anotherexample would be using COT on object detection and segmentation. Finally, in this work, we showusing complement objective help defend single-step adversarial attacks; the behavior of COT onmore advanced adversarial attacks deserves further investigation and is left as another future work. R EFERENCES

Mauro Cettolo, Jan Niehues, Sebastian Stu ker, Luisa Bentivogli, Roldano Cattoni, and MarcelloFederico. The IWSLT 2015 evaluation campaign. In

ICSLP’15 , 2015.Kyunghyun Cho, Bart van Merrienboer, C¸ aglar G¨ulc¸ehre, Fethi Bougares, Holger Schwenk, andYoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical ma-chine translation. In

EMNLP’14 , 2014.Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. ImageNet: A large-scalehierarchical image database. In

CVPR’09 , 2009.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

NIPS’14 . 2014.Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarialexamples. In

ICLR’15 , 2015.Priya Goyal, Piotr Doll´ar, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training Ima-geNet in 1 hour. arXiv preprint arXiv:1706.02677 , 2017.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residualnetworks. In

ECCV’16 , 2016a.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In

CVPR’16 , 2016b.Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger.Snapshot ensembles: Train 1, get M for free. In

ICLR’17 , 2017a.Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connectedconvolutional networks. In

CVPR’17 , 2017b.Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. In

ICML’15 , 2015.Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convo-lutional neural networks. In

NIPS’12 , 2012.Solomon Kullback and Richard A. Leibler. On information and sufﬁciency.

Ann. Math. Statist. ,1951.Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world.In

ICLR’17 Workshop , 2017.Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In

EMNLP’15 , 2015.Tom M Mitchell et al. Machine learning.

Burr Ridge, IL: McGraw Hill , 1997.9ublished as a conference paper at ICLR 2019Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Readingdigits in natural images with unsupervised feature learning. In

NIPS’11 Workshop , 2011.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision ,2015.Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. In

ICLR’15 , 2014.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overﬁtting.

Journal of Machine LearningResearch , 2014.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks.In

NIPS’14 , 2014.Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,and Rob Fergus. Intriguing properties of neural networks. In

ICLR’14 , 2014.Pete Warden. Speech Commands: A dataset for limited-vocabulary speech recognition. arXivpreprint arXiv:1804.03209 , 2018.Saining Xie, Ross B. Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residualtransformations for deep neural networks. In

CVPR’17 , 2017.Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In

BMVC’16 , 2016.Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. Mixup: Beyond empir-ical risk minimization. In

ICLR’18 , 2018. 10ublished as a conference paper at ICLR 2019

A I

TERATIVE F AST G RADIENT S IGN M ETHOD

Table 8: Classiﬁcation errors (in %) on CIFAR-10 under I-FGSM transfer attacks.Model Baseline

COT (Transfer)

ResNet-110 88.00

PreAct ResNet-18 84.56

ResNeXt-29 (2 × WideResNet-28-10 81.97

DenseNet-BC-121 86.9583.66