[PDF] Instance adaptive adversarial training: Improved accuracy tradeoffs in neural nets

Abstract

Adversarial training is by far the most successful strategy for improving robustness of neural networks to adversarial attacks. Despite its success as a defense mechanism, adversarial training fails to generalize well to unperturbed test set. We hypothesize that this poor generalization is a consequence of adversarial training with uniform perturbation radius around every training sample. Samples close to decision boundary can be morphed into a different class under a small perturbation budget, and enforcing large margins around these samples produce poor decision boundaries that generalize poorly. Motivated by this hypothesis, we propose instance adaptive adversarial training -- a technique that enforces sample-specific perturbation margins around every training sample. We show that using our approach, test accuracy on unperturbed samples improve with a marginal drop in robustness. Extensive experiments on CIFAR-10, CIFAR-100 and Imagenet datasets demonstrate the effectiveness of our proposed approach.

Full PDF

II NSTANCE ADAPTIVE ADVERSARIAL TRAINING :I MPROVED ACCURACY TRADEOFFS IN NEURAL NETS

Yogesh Balaji , ∗ Tom Goldstein , Judy Hoffman , Facebook AI Research University of Maryland Georgia Institute of Technology A BSTRACT

Adversarial training is by far the most successful strategy for improving robust-ness of neural networks to adversarial attacks. Despite its success as a defensemechanism, adversarial training fails to generalize well to unperturbed test set.We hypothesize that this poor generalization is a consequence of adversarial train-ing with uniform perturbation radius around every training sample. Samples closeto decision boundary can be morphed into a different class under a small pertur-bation budget, and enforcing large margins around these samples produce poordecision boundaries that generalize poorly. Motivated by this hypothesis, we pro-pose instance adaptive adversarial training – a technique that enforces sample-speciﬁc perturbation margins around every training sample. We show that usingour approach, test accuracy on unperturbed samples improve with a marginal dropin robustness. Extensive experiments on CIFAR-10, CIFAR-100 and Imagenetdatasets demonstrate the effectiveness of our proposed approach.

NTRODUCTION

A key challenge when deploying neural networks in safety-critical applications is their poor stabilityto input perturbations. Extremely tiny perturbations to network inputs may be imperceptible to thehuman eye, and yet cause major changes to outputs. One of the most effective and widely usedmethods for hardening networks to small perturbations is “adversarial training” (Madry et al., 2018),in which a network is trained using adversarially perturbed samples with a ﬁxed perturbation size.By doing so, adversarial training typically tries to enforce that the output of a neural network remainsnearly constant within an (cid:96) p ball of every training input.Despite its ability to increase robustness, adversarial training suffers from poor accuracy on clean(natural) test inputs. The drop in clean accuracy can be as high as on CIFAR-10, and onImagenet (Madry et al., 2018; Xie et al., 2019), making robust models undesirable in some industrialsettings. The consistently poor performance of robust models on clean data has lead to the line ofthought that there may be a fundamental trade-off between robustness and accuracy (Zhang et al.,2019; Tsipras et al., 2019), and recent theoretical results characterized this tradeoff (Fawzi et al.,2018; Shafahi et al., 2018; Mahloujifar et al., 2019).In this work, we aim to understand and optimize the tradeoff between robustness and clean accu-racy. More concretely, our objective is to improve the clean accuracy of adversarial training for achosen level of adversarial robustness. Our method is inspired by the observation that the constraintsenforced by adversarial training are infeasible ; for commonly used values of (cid:15), it is not possible toachieve label consistency within an (cid:15) -ball of each input image because the balls around images ofdifferent classes overlap. This is illustrated on the left of Figure 1, which shows that the (cid:15) -ball arounda “bird” (from the CIFAR-10 training set) contains images of class “deer” (that do not appear in thetraining set). If adversarial training were successful at enforcing label stability in an (cid:15) = 8 ballaround the “bird” training image, doing so would come at the unavoidable cost of misclassifyingthe nearby “deer” images that come along at test time. At the same time, when training images liefar from the decision boundary (eg., the deer image on the right in Fig 1), it is possible to enforcestability with large (cid:15) with no compromise in clean accuracy. When adversarial training on CIFAR-10, we see that (cid:15) = 8 is too large for some images, causing accuracy loss, while being unnecessarilysmall for others, leading to sub-optimal robustness. ∗ Work done during an internship at Facebook AI Research a r X i v : . [ c s . L G ] O c t ird Deer Figure 1: Overview of instance adaptive adversarial training. Samples close to the decision boundary(bird on the left) have nearby samples from a different class (deer) within a small L p ball, makingthe constraints imposed by PGD-8 / PGD-16 adversarial training infeasible. Samples far from thedecision boundary (deer on the right) can withstand large perturbations well beyond (cid:15) = 8 . Ouradaptive adversarial training correctly assigns the perturbation radius (shown in dotted line) so thatsamples within each L p ball maintain the same class.The above observation naturally motivates adversarial training with instance adaptive perturbationradii that are customized to each training image. By choosing larger robustness radii at locationswhere class manifolds are far apart, and smaller radii at locations where class manifolds are closetogether, we get high adversarial robustness where possible while minimizing the clean accuracyloss that comes from enforcing overly-stringent constraints on images that lie near class bound-aries. As a result, instance adaptive training signiﬁcantly improves the tradeoff between accuracyand robustness, breaking through the pareto frontier achieved by standard adversarial training. Ad-ditionally, we show that the learned instance-speciﬁc perturbation radii are interpretable; sampleswith small radii are often ambiguous and have nearby images of another class, while images withlarge radii have unambiguous class labels that are difﬁcult to manipulate.Parallel to our work, we found that Ding et al. (2018) uses adaptive margins in a max-margin frame-work for adversarial training. Their work focuses on improving the adversarial robustness, whichdiffers from our goal of understanding and improving the robustness-accuracy tradeoff. Moreover,our algorithm for choosing adaptive margins signiﬁcantly differs from that of Ding et al. (2018). ACKGROUND

Adversarial attacks are data items containing small perturbations that cause misclassiﬁcation in neu-ral network classiﬁers (Szegedy et al., 2014). Popular methods for crafting attacks include the fastgradient sign method (FGSM) (Goodfellow et al., 2015) which is a one-step gradient attack, pro-jected gradient descent (PGD) (Madry et al., 2018) which is a multi-step extension of FGSM, theC/W attack (Carlini & Wagner, 2017), DeepFool (Moosavi-Dezfooli et al., 2016), and many more.All these methods use the gradient of the loss function with respect to inputs to construct addi-tive perturbations with a norm-constraint. Alternative attack metrics include spatial transformerattacks (Xiao et al., 2018), attacks based on Wasserstein distance in pixel space (Wong et al., 2019),etc.Defending against adversarial attacks is a crucial problem in machine learning. Many early de-fenses (Buckman et al., 2018; Samangouei et al., 2018; Dhillon et al., 2018), were broken by strongattacks. Fortunately, adversarially training is one defense strategy that remains fairly resistant tomost existing attacks. 2et D = { ( x i , y i ) } ni =1 denote the set of training samples in the input dataset. In this paper, we focuson classiﬁcation problems, hence, y i ∈ { , , . . . N c } , where N c denotes the number of classes. Let f θ ( x ) : R c × m × n → R N c denote a neural network model parameterized by θ . Classiﬁers are oftentrained by minimizing the cross entropy loss given by min θ N (cid:88) ( x i ,y i ) ∼D − ˜ y i (cid:2) log( f θ ( x i )) (cid:3) where ˜ y i is the one-hot vector corresponding to the label y i . In adversarial training, instead ofoptimizing the neural network over the clean training set, we use the adversarially perturbed trainingset. Mathematically, this can be written as the following min-max problem min θ max (cid:107) δ i (cid:107) ∞ ≤ (cid:15) N (cid:88) ( x i ,y i ) ∼D − ˜ y i (cid:2) log( f θ ( x i ) + δ i ) (cid:3) (1)This problem is solved by an alternating stochastic method that takes minimization steps for θ, followed by maximization steps that approximately solve the inner problem using k steps of PGD.For more details, refer to Madry et al. (2018). Algorithm 1

Adaptive adversarial training algorithm

Require: N iter : Number of training iterations, N warm : Warmup period Require:

P GD k ( x , y, (cid:15) ) : Function to generate PGD- k adversarial samples with (cid:15) norm-bound Require: (cid:15) w : (cid:15) used in warmup for t in N iter do Sample a batch of training samples { ( x i , y i ) } N batch i =1 ∼ D if t < N warm then (cid:15) i = (cid:15) w else Choose (cid:15) i using Alg 2 end if x advi = P GD ( x i , y i , (cid:15) i ) S + = { i | f ( x i ) is correctly classiﬁed as y i } S − = { i | f ( x i ) is incorrectly classiﬁed as y i } min θ N batch (cid:104) (cid:80) i ∈ S + L cls ( x advi , y i ) + (cid:80) i ∈ S − L cls ( x i , y i ) (cid:105) end for NSTANCE A DAPTIVE A DVERSARIAL T RAINING

To remedy the shortcomings of uniform perturbation radius in adversarial training (Section 1), wepropose

Instance Adaptive Adversarial Training (IAAT), which solves the following optimization: min θ max (cid:107) δ i (cid:107) ∞ <(cid:15) i N (cid:88) ( x i ,y i ) ∼D − ˜ y i (cid:2) log( f θ ( x i ) + δ i ) (cid:3) (2)Like vanilla adversarial training, we solve this by sampling mini-batches of images { x i } , craftingadversarial perturbations { δ i } of size at most { (cid:15) i } , and then updating the network model using theperturbed images.The proposed algorithm is distinctive in that it uses a different (cid:15) i for each image x i . Ideally, wewould choose each (cid:15) i to be as large as possible without ﬁnding images of a different class withinthe (cid:15) i -ball around x i . Since we have no a-priori knowledge of what this radius is, we use a simpleheuristic to update (cid:15) i after each epoch. After crafting a perturbation for x i , we check if the perturbedimage was a successful adversarial example. If PGD succeeded in ﬁnding an image with a differentclass label, then (cid:15) i is too big, so we replace (cid:15) i ← (cid:15) i − γ . If PGD failed, then we set (cid:15) i ← (cid:15) i + γ .Since the network is randomly initialized at the start of training, random predictions are made, andthis causes { (cid:15) i } to shrink rapidly. For this reason, we begin with a warmup period of a few (usually10 epochs for CIFAR-10/100) epochs where adversarial training is performed using uniform (cid:15) forevery sample. After the warmup period ends, we perform instance adaptive adversarial training.A detailed training algorithm is provided in Alg. 1.3 = 0.20 ε = 0.83 ε = 1.71 ε = 1.25 εεε (a) Samples from bottom (cid:15) ε = 28.07 ε = 28.13 ε = 28.23 ε = 28.57 (b) Samples from top (cid:15) Figure 2: Visualizing training samples and their perturbations. The left panel shows samples thatare assigned small (cid:15) (displayed below images) during adaptive training. These images are close toclass boundaries, and change class when perturbed with (cid:15) ≥ . The right panel show images that areassigned large (cid:15). These lie far from the decision boundary, and retain class information even withvery large perturbations. All (cid:15) live in the range [0 , Algorithm 2 (cid:15) selection algorithm

Require: i : Sample index, j : Epoch index Require: β : Smoothing constant, γ : Discretization for (cid:15) search. Set (cid:15) = (cid:15) mem [ j − , i ] + γ Set (cid:15) = (cid:15) mem [ j − , i ] Set (cid:15) = (cid:15) mem [ j − , i ] − γ if f θ ( P GD k ( x i , y i , (cid:15) )) predicts as y i then Set (cid:15) i = (cid:15) else if f θ ( P GD k ( x i , y i , (cid:15) )) predicts as y i then Set (cid:15) i = (cid:15) else Set (cid:15) i = (cid:15) end if (cid:15) i ← (1 − β ) (cid:15) mem [ j − , i ] + β(cid:15) i Update (cid:15) mem [ j, i ] ← (cid:15) i Return (cid:15) i XPERIMENTS

To evaluate the robustness and generalization of our models, we report the following metrics: (1) testaccuracy of unperturbed (natural) test samples, (2) adversarial accuracy of white-box PGD attacks,(3) adversarial accuracy of transfer attacks and (4) accuracy of test samples under common imagecorruptions (Hendrycks & Dietterich, 2019). Following the protocol introduced in Hendrycks &Dietterich (2019), we do not train our models on any image corruptions.4.1 CIFAROn CIFAR-10 and CIFAR-100 datasets, we perform experiments on Resnet-18 and WideRenset-32-10 models following (Madry et al., 2018; Zhang et al., 2019). All models are trained on PGD- attacks i.e., steps of PGD iterations are used for crafting adversarial attacks during training.In the whitebox setting, models are evaluated on: (1) PGD- attacks with random restarts, (2)PGD- attacks with random restarts, and (3) PGD- attacks with random restarts. For4 a) CIFAR-10 (b) CIFAR-100 Figure 3: Tradeoffs between accuracy and robustness: Each blue dot denotes an adversarially trainedmodel with a different (cid:15) . Models trained using instance adaptive adversarial training are shown inred. Adaptive training breaks through the Pareto frontier achieved by plain adversarial training witha ﬁxed (cid:15) .transfer attacks, an independent copy of the model is trained using the same training algorithm andhyper-parameter settings, and PGD- adversarial attacks with random restarts are crafted onthe surrogate model. For image corruptions, following (Hendrycks & Dietterich, 2019), we reportaverage classiﬁcation accuracy on image corruptions. Beating the robustness-accuracy tradeoff:

In adversarial training, the perturbation radius (cid:15) isa hyper-parameter. Training models with varying (cid:15) produces a robustness-accuracy tradeoff curve -models with small training (cid:15) achieve better natural accuracy and poor adversarial robustness, whilemodels trained on large (cid:15) have improved robustness and poor natural accuracy. To generate thistradeoff, we perform adversarial training with (cid:15) in the range { / , / , . . . / } . Instanceadaptive adversarial training is then compared with respect to this tradeoff curve in Fig. 3a, 3b. Twoversions of IAAT are reported - with and without a warmup phase. In both versions, we clearlyachieve an improvement over the accuracy-robustness tradeoff. Use of the warmup phase helpsretain robustness with a drop in natural accuracy compared to its no-warmup counterpart. Clean accuracy improves for a ﬁxed level of robustness:

On CIFAR-10, as shown in Table. 1,we observe that our instance adaptive adversarial training algorithm achieves similar adversarial ro-bustness as the adversarial training baseline. However, the accuracy on clean test samples increasesby . for Resnet-18 and . for WideResnet-32-10. We also observe that the adaptive train-ing algorithm improves robustness to unseen image corruptions. This points to an improvement inoverall generalization ability of the network. On CIFAR-100 (Table. 2), the performance gain innatural test accuracy further increases - . for Resnet-18, and . for Wideresnet-32-10. Theadversarial robustness drop is marginal. Maintaining performance over a range of test (cid:15) : Next, we plot adversarial robustness overa sweep of (cid:15) values used to craft attacks at test time. Fig. 4a, 4b shows an adversarial trainingbaseline with (cid:15) = 8 performs well at high (cid:15) regimes and poorly at low (cid:15) regimes. On the other hand,adversarial training with (cid:15) = 2 has a reverse effect, performing well at low (cid:15) and poorly at high (cid:15) regimes. Our instance adaptive training algorithm maintains good performance over all (cid:15) regimes,achieving slightly less performance than the (cid:15) = 2 model for small test (cid:15), and dominating all modelsfor larger test (cid:15).

Interpretability of (cid:15) : We ﬁnd that the values of (cid:15) i chosen by our adaptive algorithm correlatewell with our own human concept of class ambiguity. Figure 2 (and Figure 7 in Appendix B) showsthat a sampling of images that receive small (cid:15) i contains many ambiguous images, and these imagesare perturbed into a (visually) different class using (cid:15) = 16 . In contrast, images that receive a large (cid:15) i have a visually deﬁnite class, and are not substantially altered by an (cid:15) = 16 perturbation.5able 1: Robustness experiments on CIFAR-10. PGD attacks are generated with (cid:15) = 8 . PGD andPGD attacks are generated with random restarts, while PGD attacks are generated with random restartsMethod Natural Whitebox acc. (in % ) Transfer (in % ) Corruptionacc. (in % ) PGD PGD

PGD acc. (PGD ) acc. (in % ) Resnet-18

Clean 94.21 0.02 0.00 0.00 3.03 72.71Adversarial 83.20 43.79 42.30 42.36 59.80 73.73IAAT 87.26 43.08 41.16 41.16 59.87 78.82

WideResnet 32-10

Clean 95.50 0.05 0.00 0.00 5.02 78.35Adversarial 86.85 46.86 44.82 44.84 62.77 77.99IAAT 91.34 48.53 46.50 46.54 58.20 83.13Table 2: Robustness experiments on CIFAR-100. PGD attacks are generated with (cid:15) = 8 . PGD and PGD attacks are generated with random restarts. PGD attacks are generated with random restartsMethod Natural Whitebox acc. (in % ) Transfer acc. (in % )acc. (in % ) PGD PGD

PGD

Resnet-18

Clean 74.88 0.02 0.00 0.01 1.81Adversarial 55.11 20.69 19.68 19.91 35.57IAAT 63.90 18.50 17.10 17.11 35.74

WideResnet 32-10

Clean 79.91 0.01 0.00 0.00 1.20Adversarial 59.58 26.24 25.47 25.49 38.10IAAT 68.80 26.17 24.22 24.36 35.184.2 I

MAGENET

Following the protocol introduced in Xie et al. (2019), we attack Imagenet models using randomtargeted attacks instead of untargeted attacks as done in previous experiments. During training,adversarial attacks are generated using steps of PGD. As a baseline, we use adversarial trainingwith a ﬁxed (cid:15) of / . This is the setting used in Xie et al. (2019). Adversarial training onImagenet is computationally intensive. To make training practical, we use distributed training withsynchronized SGD on / GPUs. More implementation details can be found in Appendix D. (a) CIFAR-10 (b) CIFAR-100

Figure 4: Plot of adversarial robustness over a sweep of test (cid:15) ↑ ) indicates higher numbers are better, while ( ↓ ) indicates lower numbers are betterMethod Natural Whitebox acc. (in % ) ( ↑ ) Corruptionacc. (in % ) ( ↑ ) (cid:15) = 4 (cid:15) = 8 (cid:15) = 12 (cid:15) = 16 mCE ( ↓ ) Resnet-50

Clean training 75.80 0.64 0.18 0.00 0.00 76.69Adversarial training 50.99 50.89 49.11 44.71 35.82 95.48IAAT 62.71 61.52 54.63 39.90 22.72 85.21

Resnet-101

Clean training 77.10 0.83 0.12 0.00 0.00 70.37Adversarial training 55.42 55.11 53.07 48.35 39.08 91.45IAAT 65.29 63.83 56.62 41.51 23.91 79.52

Resnet-152

Clean training 77.60 0.57 0.08 0.00 0.00 69.27Adversarial training 57.26 56.77 54.75 49.86 40.40 89.31IAAT 67.44 65.97 59.28 45.01 27.85 78.53Table 4: Ablation: Effect of warmup on CIFAR-10Method Natural Whitebox acc. (in % ) Transfer acc.( % ) Corruptionacc. ( % ) PGD PGD

PGD

PGD acc. (in % ) Resnet-18

IAAT (no warm) 89.62 40.55 38.15 38.08 58.89 81.10IAAT (warm) 87.26 43.08 41.16 41.16 59.87 78.82

WideResnet 32-10

IAAT (no warm) 92.62 45.12 41.08 41.11 53.08 84.92IAAT (warm) 90.67 48.53 46.50 46.54 58.20 83.13At test time, we evaluate the models on clean test samples and on whitebox adversarial attacks with (cid:15) = { , , , } . PGD- attacks are used. Additionally, we also report normalized meancorruption error (mCE), an evaluation metric introduced in Hendrycks & Dietterich (2019) to testthe robustness of neural networks to image corruptions. This metric reports mean classiﬁcationerror of different image corruptions averaged over varying levels of degradation. Note that whileaccuracies are reported for natural and adversarial robustness, mCE reports classiﬁcation errors, solower numbers are better.Our experimental results are reported in Table. 3. We observe a huge drop in natural accuracy foradversarial training ( , and drop for Resnet-50, 101 and 152 respectively). Adaptiveadversarial training signiﬁcantly improves the natural accuracy - we obtain a consistent performancegain of on all three models over the adversarial training baseline. On whitebox attacks, IAAToutperforms the adversarial training baseline on low (cid:15) regimes, however a drop of is observedat high (cid:15) ’s ( (cid:15) = 16 ). On the corruption dataset, our model consistently outperforms adversarialtraining. BLATION EXPERIMENTS

FFECT OF WARMUP

In this section, we study the effect of using a warmup phase in adaptive adversarial training. Recallfrom Section 3 that during warmup, adversarial training is performed with uniform norm-boundconstraints. Once the warmup phase ends, we switch to instance adaptive training. From Table 4and 5, we observe that when warmup is used, adversarial robustness improves with a small drop innatural accuracy. The improvement in robustness is more pronounced in the CIFAR-100 dataset.However, as shown in Fig. 3a and 3b, both these settings improve the accuracy-robustness tradeoff.7able 5: Ablation: Effect of warmup on CIFAR-100Method Natural Whitebox acc. (in % ) Transfer acc.( % )acc. (in % ) PGD PGD

PGD

Resnet-18

Adaptive (no warm) 68.34 14.76 13.29 13.30 32.39Adaptive (warm) 63.90 18.50 17.10 17.11 35.74

WideResnet 32-10

Adaptive (no warm) 75.48 18.14 13.78 13.71 24.00Adaptive (warm) 68.80 26.17 24.22 24.36 35.18 (a) Average epsilons (b) Individual (cid:15) for random samples Figure 5: Visualizing (cid:15) progress of instance adaptive adversarial trianing. Plot on the left showsaverage (cid:15) of samples over epochs, while the plot on the right shows (cid:15) progress of three randomlychosen samples.5.2 V

ISUALIZING (cid:15)

PROGRESS

Next, we visualize the evolution of (cid:15) over epochs in adaptive adversarial training. A plot showingthe average (cid:15) growth, along with the (cid:15) progress of randomly picked samples are shown in Fig. 5aand 5b. We observe that average (cid:15) converges to around , which is higher than the default setting of (cid:15) = 8 used in adversarial training. Also, each sample has a different (cid:15) proﬁle - for some, (cid:15) increaseswell beyond the commonly use radius of (cid:15) = 8 , while for others, it converges below it. In addition,a plot showing the histogram of (cid:15) ’s at different snapshots of training is shown in Fig. 8. We observean increase in spread of the histogram as the training progresses. ONCLUSION

In this work, we focus on improving the robustness-accuracy tradeoff in adversarial training. Weﬁrst show that realizable robustness is a sample-speciﬁc attribute: samples close to the decisionboundary can only achieve robustness within a small (cid:15) ball, as they contain samples from a differentclass beyond this radius. On the other hand samples far from the decision boundary can be robuston a relatively large perturbation radius. Motivated by this observation, we develop instance adap-tive adversarial training , in which label consistency constraints are imposed within sample-speciﬁcperturbation radii, which are in-turn estimated. Our proposed algorithm has empirically been shownto improve the robustness-accuracy tradeoff in CIFAR-10, CIFAR-100 and Imagenet datasets.

CKNOWLEDGEMENTS

Goldstein and Balaji were supported in part by the DARPA GARD program, DARPA QED for RML,DARPA Lifelong Learning Machines, the DARPA Young Faculty Award program, the AFOSRMURI program, and the National Science Foundation.8

EFERENCES

Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One hotway to resist adversarial examples. In

International Conference on Learning Representations ,2018. URL https://openreview.net/forum?id=S18Su--CW .Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In ,pp. 39–57, 2017.Guneet S. Dhillon, Kamyar Azizzadenesheli, Jeremy D. Bernstein, Jean Kossaiﬁ, Aran Khanna,Zachary C. Lipton, and Animashree Anandkumar. Stochastic activation pruning for robustadversarial defense. In

International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=H1uR4GZRZ .Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang. Max-margin adversar-ial (mma) training: Direct input space margin maximization through adversarial training. arXivpreprint arXiv:1812.02637 , 2018.Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classiﬁer. In

Advances in Neural Information Processing Systems , pp. 1178–1187, 2018.Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarialexamples. In , 2015. URL http://arxiv.org/abs/1412.6572 .Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An-drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenetin 1 hour. arXiv preprint arXiv:1706.02677 , 2017.Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common cor-ruptions and perturbations. In

International Conference on Learning Representations , 2019. URL https://openreview.net/forum?id=HJz6tiCqYm .Alex Lamb, Vikas Verma, Juho Kannala, and Yoshua Bengio. Interpolated adversarial training:Achieving robust neural networks without sacriﬁcing accuracy.

CoRR , abs/1906.06784, 2019.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To-wards deep learning models resistant to adversarial attacks. In

International Conference on Learn-ing Representations , 2018. URL https://openreview.net/forum?id=rJzIBfZAb .Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. The curse of concentrationin robust learning: Evasion and poisoning attacks from concentration of measure. In

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence , volume 33, pp. 4536–4543, 2019.Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple andaccurate method to fool deep neural networks. In , pp. 2574–2582, 2016.Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-GAN: Protecting classiﬁersagainst adversarial attacks using generative models. In

International Conference on LearningRepresentations , 2018. URL https://openreview.net/forum?id=BkJ3ibb0- .Ali Shafahi, W Ronny Huang, Christoph Studer, Soheil Feizi, and Tom Goldstein. Are adversarialexamples inevitable? arXiv preprint arXiv:1809.02104 , 2018.Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-low, and Rob Fergus. Intriguing properties of neural networks. In

International Conference onLearning Representations , 2014. URL http://arxiv.org/abs/1312.6199 .Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.Robustness may be at odds with accuracy. In , 2019.9ric Wong, Frank Schmidt, and Zico Kolter. Wasserstein adversarial examples via projectedSinkhorn iterations. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.),

Proceedings ofthe 36th International Conference on Machine Learning , Proceedings of Machine Learning Re-search, pp. 6808–6817. PMLR, 2019.Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song. Spatially trans-formed adversarial examples. In

International Conference on Learning Representations , 2018.URL https://openreview.net/forum?id=HyydRMZC- .Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L. Yuille, and Kaiming He. Feature denoisingfor improving adversarial robustness. In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2019.Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan.Theoretically principled trade-off between robustness and accuracy. In Kamalika Chaudhuriand Ruslan Salakhutdinov (eds.),

Proceedings of the 36th International Conference on MachineLearning , volume 97 of

Proceedings of Machine Learning Research , pp. 7472–7482, Long Beach,California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/zhang19p.html . 10 A PPENDIX

A.1 C

OMPARISON WITH M IXUP

A recent paper that addresses the problem of improving natural accuracy in adversarial training ismixup adversarial training (Lamb et al., 2019), where adversarially trained models are optimizedusing mixup loss instead of the standard cross-entropy loss. In this paper, natural accuracy wasshown to improve with no drop in adversarial robustness. However, the robustness experimentswere not evaluated on strong attacks (experiments were reported only on PGD-20). We compare ourimplementation of mixup adversarial training with IAAT on stronger attacks in Table. 6. We observethat while natural accuracy improves for mixup, drop in adversarial accuracy is much higher thanIAAT. Table 6: Comparison with MixupMethod Natural Whitebox acc. (in % ) Transfer attack (in % )acc. (in % ) PGD PGD

PGD

Resnet-18

Mixup 89.47 42.60 38.42 38.49 59.48IAAT 87.26 43.08 41.16 41.16 59.87

WideResnet 32-10

Mixup 92.57 45.01 36.6 36.44 63.57IAAT 90.67 48.53 46.50 46.54 58.20

B S

AMPLE VISUALIZATION

A visualization of samples from CIFAR-10 dataset with the corresponding (cid:15) value assigned by IAATis shown in Figure. 6. We observe that samples for which low (cid:15) ’s are assigned are visually confusing(eg., top row of Figure. 6), while samples with high (cid:15) distinctively belong to one class.In addition, we also show more visualizations of samples near decision boundary which contain sam-ples from a different class within a ﬁxed (cid:96) ∞ ball in Figure. 7. The infeasibility of label consistencyconstraints within the commonly used perturbation radius of (cid:96) ∞ = 8 is apparent in this visualiza-tion. Our algorithm effectively chooses an appropriate (cid:15) that retains label information within thechosen radius. C I

MAGENET SWEEP OVER

PGD

ITERATIONS

Testing against a strong adversary is crucial to assess the true robustness of a model. A popularpractice in adversarial robustness community is to attack models using PGD with many attack iter-ations (Xie et al., 2019). So, we test our instance adaptive adversarially trained models on a sweepof PGD iterations for a ﬁxed (cid:15) level. Following (Xie et al., 2019), we perform the sweep upto attack steps ﬁxing (cid:15) = 16 . The resulting plot is shown in Figure. 9. For all three Resnet models, weobserve a saturation in adversarial robustness beyond 500 attack iterations.

D I

MPLEMENTATION DETAILS

D.1 CIFAROn CIFAR-10 and CIFAR-100 datasets, our implementation follows the standard adversarial train-ing setting used in Madry et al. (2018). During training, adversarial examples are generated usingPGD-10 attacks, which are then used to update the model. All hyperparameters we used are tabu-lated in Table. 7. 11igure 6: Visualizing training samples with their corresponding perturbation. All (cid:15) live in the range [0 , Table 7: Hyper-parameters for experiments on CIFAR-10 and CIFAR-100Hyperparameters Resnet-18 WideResnet-32-10Optimizer SGD SGDStart learning rate 0.1 0.1Weight decay 0.0002 0.0005Number of epochs trained 200 110Learning rate annealing Step decay Step decayLearning rate decay steps [80, 140, 170] [70, 90, 100]Learning rate decay factor 0.1 0.2Batch size 128 128Warmup period 5 epochs 10 epochs (cid:15) used in warmup ( (cid:15) w ) 8 8Discretization γ β (cid:15) (for adv. training only) 8 8Attack learning rate 2/255 2/25512 = 0.33 ε = 1.09 ε = 1.12 ε = 5.05 C l e an s a m p l e s P e r t u r b e d s a m p l e s ( ε = ) P e r t u r b e d s a m p l e s ( ε = ) P e r t u r b e d s a m p l e s ( A dap t i v e ε ) Bird -> Deer Cat -> Deer Airplane -> Ship Cat -> Bird Bird -> Frog ε = 2.96

Frog -> Deer ε = 2.31

Bird -> Deer ε = 5.34

Figure 7: Visualizations of samples for which low (cid:15) ’s are assigned by instance adaptive adversarialtraining. These samples are close to the decision boundary and change class when perturbed with (cid:15) ≥ . Perturbing them with (cid:15) assigned by IAAT retains the class information.Figure 8: Histogram of (cid:15) of training samples at different training epochs13igure 9: Imagenet robustness of IAAT over the number of PGD iterationsTable 8: Hyper-parameters for experiments on ImagenetHyperparameters ImagenetOptimizer SGDStart learning rate 0.1 × (effective batch size / 256)Weight decay 0.0001Number of epochs trained 110Learning rate annealing Step decay with LR warmupLearning rate decay steps [35, 70, 95]Learning rate decay factor 0.1Batch size 32 per GPUWarmup period 30 epochs (cid:15) used in warmup ( (cid:15) w ) 16Discretization γ β (cid:15) (for adv. training only) 16Attack learning rate 1/255D.2 I MAGENET

For Imagenet implementation, we mimic the setting used in Xie et al. (2019). During training,adversaries are generated with PGD-30 attacks. This is computationally expensive as every trainingupdate is followed by backprop iterations to generate the adversarial attack. To make trainingfeasible, we perform distributed training using synchronized SGD updates on 64 / 128 GPUs. Wefollow the training recipe introduced in Goyal et al. (2017) for large batch training. Also, duringtraining, adversarial attacks are generated with FP-16 precision. However, in test phase, we useFP-32.We further use two more tricks to speed-up instance adaptive adversarial training: (1) A weakerattacker(PGD-10) is used in the algorithm for selecting (cid:15) (Alg. 2). (2) After (cid:15) i is selected per Alg. 2,we clip it with a lower-bound i.e., (cid:15) i ← max ( (cid:15) i , (cid:15) lb ) . (cid:15) lb = 4= 4