[PDF] Understanding and Improving Fast Adversarial Training

Abstract

A recent line of work focused on making adversarial training computationally efficient for deep learning models. In particular, Wong et al. (2020) showed that ℓ ∞ -adversarial training with fast gradient sign method (FGSM) can fail due to a phenomenon called "catastrophic overfitting", when the model quickly loses its robustness over a single epoch of training. We show that adding a random step to FGSM, as proposed in Wong et al. (2020), does not prevent catastrophic overfitting, and that randomness is not important per se -- its main role being simply to reduce the magnitude of the perturbation. Moreover, we show that catastrophic overfitting is not inherent to deep and overparametrized networks, but can occur in a single-layer convolutional network with a few filters. In an extreme case, even a single filter can make the network highly non-linear locally, which is the main reason why FGSM training fails. Based on this observation, we propose a new regularization method, GradAlign, that prevents catastrophic overfitting by explicitly maximizing the gradient alignment inside the perturbation set and improves the quality of the FGSM solution. As a result, GradAlign allows to successfully apply FGSM training also for larger ℓ ∞ -perturbations and reduce the gap to multi-step adversarial training. The code of our experiments is available at this https URL.

Full PDF

UUnderstanding and Improving Fast Adversarial Training

Maksym AndriushchenkoEPFL [email protected]

Nicolas FlammarionEPFL [email protected]

Abstract

A recent line of work focused on making adversarial training computationally eﬃcient fordeep learning models. In particular, Wong et al. [46] showed that (cid:96) ∞ -adversarial training withfast gradient sign method (FGSM) can fail due to a phenomenon called catastrophic overﬁtting ,when the model quickly loses its robustness over a single epoch of training. We show that addinga random step to FGSM, as proposed in [46], does not prevent catastrophic overﬁtting, and thatrandomness is not important per se — its main role being simply to reduce the magnitude ofthe perturbation. Moreover, we show that catastrophic overﬁtting is not inherent to deep andoverparametrized networks, but can occur in a single-layer convolutional network with a few ﬁlters.In an extreme case, even a single ﬁlter can make the network highly non-linear locally , which is themain reason why FGSM training fails. Based on this observation, we propose a new regularizationmethod, GradAlign , that prevents catastrophic overﬁtting by explicitly maximizing the gradientalignment inside the perturbation set and improves the quality of the FGSM solution. As aresult,

GradAlign allows to successfully apply FGSM training also for larger (cid:96) ∞ -perturbationsand reduce the gap to multi-step adversarial training. The code of our experiments is availableat https://github.com/tml-epfl/understanding-fast-adv-training . Machine learning models based on empirical risk minimization are known to be often non-robustto small worst-case perturbations. For decades, this has been the topic of active research by thestatistics, optimization and machine learning communities [19, 2, 10, 3]. However, the recent successof deep learning [22, 33] has raised the interest in this topic. The lack of robustness in deep learningis clearly illustrated by the existence of adversarial examples , i.e. tiny input perturbations that caneasily fool state-of-the-art deep neural networks into making wrong predictions [38, 12].The beneﬁts of adversarially robust models extend beyond security considerations [3] to modelinterpretability [41, 32] and generalization [50, 47, 4]. In order to improve the robustness, twofamilies of solutions have been developed: adversarial training (AT) that amounts to training themodel on adversarial examples [12, 23] and provable defenses that derive and optimize robustnesscertiﬁcates [45, 29, 7]. Currently, adversarial-training based methods appear to be preferred bypractitioners since they (a) achieve higher empirical robustness (although without providing arobustness certiﬁcate), (b) are scalable to state-of-the-art deep networks, and (c) work equally wellfor diﬀerent threat models. Adversarial training can be formulated as a robust optimization problem[35, 23] which takes the form of a non-convex non-concave min-max problem. However, computingthe optimal adversarial examples is an NP-hard problem [21, 44]. Thus adversarial training can onlyrely on approximate methods to solve the inner maximization problem.One popular approximation method successfully used in adversarial training is the PGD attack[23] where multiple steps of projected gradient descent are performed. It is now widely believed1 a r X i v : . [ c s . L G ] J u l used for training and evaluation P G D - - a d v e r s a r i a l a cc u r a c y CIFAR-10 models trained without early stopping

FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT used for training and evaluation P G D - - a d v e r s a r i a l a cc u r a c y CIFAR-10 models trained with early stopping

FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT

Figure 1:

Robustness of diﬀerent adversarial training (AT) methods on CIFAR-10 with ResNet-18 trainedand evaluated with diﬀerent l ∞ -radii. The results are averaged over 5 random seeds used for training andreported with the standard deviation. FGSM AT : standard FGSM AT,

FGSM-RS AT : FGSM AT with arandom step [46],

FGSM AT + GradAlign : FGSM AT combined with our proposed regularizer

GradAlign , AT for Free : recently proposed method for fast PGD AT [34],

PGD-2/PGD-10 AT : AT with a 2-/10-stepPGD-attack. Our proposed regularizer

GradAlign prevents catastrophic overﬁtting in FGSM training andleads to signiﬁcantly better results which are close to the computationally demanding PGD-10 AT. that models adversarially trained via the PGD attack [23, 49] are robust since small adversariallytrained networks can be formally veriﬁed [5, 39, 46], and larger models could not be broken on publicchallenges [23, 49]. Recently, [8] evaluated the majority of recently published defenses to concludethat the standard (cid:96) ∞ PGD training achieves the best empirical robustness; a result which canonly be improved using semi-supervised approaches [18, 1, 6]. In contrast, other empirical defensesthat were claiming improvements over standard PGD training had overestimated the robustnessof their reported models [8]. These experiments imply that adversarial training in general is thekey algorithm for robust deep learning, and thus that performing it eﬃciently is of paramountimportance.Another approximation method for adversarial training is the

Fast Gradient Sign Method (FGSM)[12] which is based on the linear approximation of the neural network loss function. However, theliterature is still ambiguous about the performance of FGSM training, i.e. it remains unclear whetherFGSM training can consistently lead to robust models. For example, [23] and [40] claim that FGSMtraining works only for small (cid:96) ∞ -perturbations, while [46] suggest that FGSM training can lead torobust models for arbitrary (cid:96) ∞ -perturbations if one adds uniformly random initialization before theFGSM step. Related to this, [46] further identiﬁed a phenomenon called catastrophic overﬁtting where FGSM training ﬁrst leads to some robustness at the beginning of training, but then suddenlybecomes non-robust within a single training epoch. However, the reasons for such a failure remainunknown. This motivates us to consider the following question as the main theme of the paper: When and why does fast adversarial training with FGSM lead to robust models?

Contributions.

We ﬁrst show that not only FGSM training is prone to catastrophic overﬁtting ,but the recently proposed fast adversarial training methods [34, 46] as well (see Fig. 1). We thenanalyze the reasons why using a random step in FGSM [46] helps to slightly mitigate catastrophicoverﬁtting and show it simply boils down to reducing the average magnitude of the perturbations.Then we discuss the connection behind catastrophic overﬁtting and local linearity in deep networks2nd in single-layer convolutional networks where we show that even a single ﬁlter can make thenetwork non-linear locally , and causes the failure of FGSM training. We additionally provide for thiscase a theoretical explanation which helps to explain why FGSM AT is successful at the beginningof the training. Finally, we propose a regularization method,

GradAlign , that prevents catastrophicoverﬁtting by explicitly maximizing the gradient alignment inside the perturbation set and thereforeimproves the quality of the FGSM solution. We compare

GradAlign to other adversarial trainingschemes in Fig. 1 and point out that among all fast adversarial training methods considered onlyFGSM+

GradAlign does not suﬀer from catastrophic overﬁtting and leads to high robustness evenfor large (cid:96) ∞ -perturbations. Let (cid:96) ( x, y ; θ ) denote the loss of a ReLU-network parametrized by θ ∈ R m on the example ( x, y ) ∼ D where D is the data generating distribution. Previous works [35, 23] formalized the goal of trainingadversarially robust models as the following robust optimization problem: min θ E ( x,y ) ∼ D (cid:2) max δ ∈ ∆ (cid:96) ( x + δ, y ; θ ) (cid:3) . (1)We focus here on the (cid:96) ∞ threat model, i.e. ∆ = { δ ∈ R d , (cid:107) δ (cid:107) ∞ ≤ ε } , where the adversary canchange each input coordinate x i by at most ε . Unlike classical stochastic saddle point problems ofthe form min θ max δ E [ (cid:96) ( θ, δ )] [20], the inner maximization problem here is inside the expectation.Therefore the solution of each subproblem max δ ∈ ∆ (cid:96) ( x + δ, y ; θ ) depends on the particular example ( x, y ) and standard algorithms such as gradient descent-ascent which alternate gradient descent in θ and gradient ascent in δ cannot be used. Instead each of these non-concave maximization problemshas to be solved independently. Thus, an inherent trade-oﬀ appears between computationally eﬃcientapproaches which aim at solving this inner problem in as few iterations as possible and approacheswhich aim at solving the problem more accurately but with more iterations. In an extreme case,the PGD attack [23] uses multiple steps of projected gradient ascent (PGD), which is accuratebut computationally expensive. At the other end of the spectrum, Fast Gradient Sign Method(FGSM) [12] performs only one iteration of gradient ascent with respect to the (cid:96) ∞ -norm: δ F GSM def = ε sign( ∇ x (cid:96) ( x, y ; θ )) , (2)followed by a projection of x + δ F GSM onto the [0 , d to ensure it is a valid input. This leads to afast algorithm which, however, does not always lead to robust models as observed in [23, 40]. A closerlook at the evolution of the robustness during FGSM AT reveals that using FGSM can lead to amodel with some degree of robustness but only until a point where the robustness suddenly drops.This phenomenon is called catastrophic overﬁtting in [46]. As a partial solution, the training can bestopped just before that point which leads to non-trivial but suboptimal robustness as illustrated inFig. 1. [46] further notice that initializing FGSM from a random starting point η ∼ U ([ − ε, ε ] d ) , i.e.: δ F GSM − RS def = Π [ − ε,ε ] d [ η + α sign( ∇ x (cid:96) ( x + η, y ; θ ))] , (3) In practice we use training samples with random data augmentation. Throughout the paper we will focus on image classiﬁcation, i.e. inputs x will be images. ε values(e.g. ε = / on CIFAR-10). Along the same lines, [42] observe that using dropout on all layers(including convolutional) also helps to stabilize FGSM AT.An alternative solution is to interpolate between FGSM and PGD AT. For example, [43] suggestto ﬁrst use FGSM AT, and later to switch to multi-step PGD AT which is motivated by their analysissuggesting that the inner maximization problem has to be solved more accurately at the end oftraining. [34] propose to run PGD with step size α = ε and simultaneously update the weights of thenetwork. On a related note, [48] collect the weight updates during PGD, but apply them after PGDis completed. Additionally, [48] update the gradients of the ﬁrst layer multiple times. However, noneof these approaches are conclusive, either leading to comparable robustness to FGSM-RS training[46] and still failing for higher (cid:96) ∞ -radii (see Fig. 1 for [34] and [46]) or being in the worst case asexpensive as multi-step PGD AT [43]. We focus next on analyzing the FGSM-RS training [46] as theother recent variations of fast adversarial training [34, 48, 42] lead to models with similar robustness. Experimental setup.

Unless mentioned otherwise, we perform training on PreAct ResNet-18 [16] with the cyclic learning rates [37] and half-precision training [24] following the setup of[46]. We evaluate adversarial robustness using the PGD-50-10 attack, i.e. with 50 iterations and 10restarts with step size α = ε / . More experimental details are speciﬁed in Appendix B. First, we show that FGSM with a random step fails to resolve catastrophic overﬁtting for larger ε .Then we provide evidence against the explanation given by [46] on the beneﬁt of randomness forFGSM AT, and propose a new explanation based on the linear approximation quality of FGSM. FGSM with random step does not resolve catastrophic overﬁtting.

Crucially, [46]observed that adding an initial random step to FGSM as in Eq. (3) helps to avoid catastrophicoverﬁtting. However, this holds only if the step size is not too large (as illustrated in Fig. 3 of [46]for ε = / ) and, more importantly, only for small enough ε as we show in Fig. 1. Indeed, usingthe step size recommended by [46] extends the working regime of FGSM but only from ε = / to ε = / , with adversarial accuracy for ε = / . When early stopping is applied (Fig. 1,right), there is still a signiﬁcant gap compared to PGD-10 training, particularly for large (cid:96) ∞ -radii.For example, for ε = / , FGSM-RS AT leads to 22.24% PGD-50-10 accuracy while PGD-10 ATobtains a much better accuracy of 30.65%. Previous explanation: randomness diversiﬁes the threat model.

A hypothesis statedin [46] was that FGSM-RS helps to avoid catastrophic overﬁtting by diversifying the threat model.Indeed, the random step allows to have perturbations not only at the corners {− ε, ε } d like theFGSM-attack , but rather in the whole (cid:96) ∞ -ball, [ − ε, ε ] d . Here we refute this hypothesis by modifyingthe usual PGD training by projecting onto {− ε, ε } d the perturbation obtained via the PGD attack.We perform experiments on CIFAR-10 with ResNet-18 with (cid:96) ∞ -perturbations of radius ε = / over 5 random seeds. FGSM AT leads to catastrophic overﬁtting achieving . ± . adversarialaccuracy if early stopping is not applied, while the standard PGD-10 AT and our modiﬁed PGD-10AT schemes achieve . ± . and . ± . adversarial accuracy respectively. Therebysimilar robustness as the original PGD AT can still be achieved without training on pertubations For simplicity, we ignore the projection of x + δ onto [0 , d in this section. (cid:96) ∞ -ball. We conclude that diversity of adversarial examples is not crucialhere. What makes the diﬀerence is rather having an iterative instead of a single-step procedure toﬁnd a corner of the (cid:96) ∞ -ball that suﬃciently maximizes the loss. New explanation: a random step improves the linear approximation quality.

Usinga random step in FGSM is guaranteed to decrease the expected magnitude of the perturbation. Thissimple observation is formalized in the following lemma.

Lemma 1. (Eﬀect of the random step)

Let η ∼ U ([ − ε, ε ] d ) be a random starting point, and α ∈ [0 , ε ] be the step size of FGSM-RS deﬁned in Eq. (3) , then E η [ (cid:107) δ F GSM − RS ( η ) (cid:107) ] ≤ (cid:114) E η (cid:104) (cid:107) δ F GSM − RS ( η ) (cid:107) (cid:105) = √ d (cid:114) − ε α + 12 α + 13 ε . (4)The proof is deferred to Appendix A.1. We ﬁrst remark that the upper bound is in the range [ / √ √ dε, √ dε ] , and therefore always less or equal than (cid:107) δ F GSM (cid:107) = √ dε. We visualize our boundin Fig. 2 where the expectation is approximated by Monte-Carlo sampling over , samples of η ,and note that the bound becomes increasingly tight for high-dimensional inputs.The key observation here is that among all possible perturbations of (cid:96) ∞ -norm ε , perturbationswith a smaller (cid:96) -norm beneﬁt from a better linear approximation. This statement follows from thesecond-order Taylor expansion for twice diﬀerentiable functions: f ( x + δ ) ≈ f ( x ) + (cid:104)∇ x f ( x ) , δ (cid:105) + (cid:10) δ, ∇ xx f ( x ) δ (cid:11) , i.e. a smaller value of (cid:107) δ (cid:107) implies a smaller linear approximation error | f ( x + δ ) − f ( x ) −(cid:104)∇ x f ( x ) , δ (cid:105) | .Moreover, the same property still holds empirically for the non-diﬀerentiable ReLU networks (seeAppendix C.1). We conclude that by reducing in expectation the length of the perturbation (cid:107) δ (cid:107) ,the FGSM-RS approach of [46] takes advantage of a better linear approximation. This is supportedby the fact that FGSM-RS AT also leads to catastrophic overﬁtting if the step size α is chosen to betoo large (see Fig. 3 in [46]), thus providing no beneﬁts over FGSM AT even when combined withearly stopping. We argue this is the main improvement over the standard FGSM AT. Successful FGSM AT does not require randomness.

If having perturbation with a toolarge (cid:96) -norm is indeed the key factor in catastrophic overﬁtting, we can expect that just reducing thestep size of the standard FGSM should work equally well as FGSM-RS. For ε = / on CIFAR-10, Step size d d d d d d d E [ || F G S M R S || ] Our analytical upper boundEmpirical estimation, d =15Empirical estimation, d =3Empirical estimation, d =1 Figure 2:

Visualization of our upper bound on E η [ (cid:107) δ F GSM − RS (cid:107) ] . The dashed line corresponds tothe step size α = 1 . ε recommended in [46]. used for evaluation A d v e r s a r i a l a cc u r a c y FGSM-RS AT, train =8FGSM AT, train =5FGSM AT, train =6FGSM AT, train =7FGSM AT, train =8 Figure 3:

Robustness of FGSM-trained ResNet-18on CIFAR-10 with diﬀerent ε train used for trainingcompared to FGSM-RS AT with ε train = / . able 1: Robustness of FGSM AT with a reduced step size ( α = / ) compared to the FGSM-RS ATproposed in [46] ( α = / ) for ε = 8 / on CIFAR-10 for ResNet-18 trained with early stopping. Theresults are averaged over 5 random seeds used for training. Accuracy Model

FGSM AT FGSM α = / AT FGSM-RS ATPGD-50-10 . ± .

74% 45 . ± .

48% 45 . ± . [46] recommend to use FGSM-RS with step size α = 1 . ε which induces a perturbation of expected (cid:96) -norm (cid:107) δ F GSM − RS (cid:107) ≈ / √ d . This corresponds to using standard FGSM with a step size α ≈ / instead of α = ε = / (see the dashed line in Fig. 2). We report the results in Table 1and observe that simply reducing the step size of FGSM (without any randomness) leads to thesame level of robustness. We show further in Fig. 3 that when used with a smaller step size, therobustness of standard FGSM training even without early stopping can generalize to much higher ε .This contrasts with the previous literature [23, 40]. We conclude from these experiments that amore direct way to improve FGSM AT and to prevent it from catastrophic overﬁtting is to simplyreduce the step size. Note that this still leads to suboptimal robustness compared to PGD AT (seeFig. 1) for ε larger than the one used during training, since in this case adversarial examples canonly be generated inside the smaller (cid:96) ∞ -ball. This motivates us to take a closer look on how and why catastrophic overﬁtting occurs to be able to prevent it without reducing the FGSM step size. First, we establish a connection between catastrophic overﬁtting and local linearity of the model.Then we show that catastrophic overﬁtting also occurs in a single-layer convolutional network, forwhich we analyze local linearity both empirically and theoretically.

When can the inner maximization problem be accurately solved with FGSM?

Recallthat the FGSM attack [12] is obtained as a closed-form solution of the following optimization problem: δ F GSM = arg min (cid:107) δ (cid:107) ∞ ≤ ε (cid:104)∇ x (cid:96) ( x, y ; θ )) , δ (cid:105) . Thus, the FGSM attack is guaranteed to ﬁnd the optimaladversarial perturbation if ∇ x (cid:96) ( x, y ; θ ) is constant inside the (cid:96) ∞ -ball around the input x , i.e. theloss function is locally linear . This motivates us to study the evolution of local linearity duringFGSM training and its connection to catastrophic overﬁtting. With this aim, we deﬁne the followinglocal linearity metric of the loss function (cid:96) : E ( x,y ) ∼ D, η ∼U ([ − ε,ε ] d ) [cos ( ∇ x (cid:96) ( x, y ; θ ) , ∇ x (cid:96) ( x + η, y ; θ ))] , (5)which we refer to as gradient alignment . This quantity is easily interpretable: it is equal to one formodels linear inside the (cid:96) ∞ -ball of radius ε , and it is approximately zero when the input gradientsare nearly orthogonal to each other. Previous works also considered local linearity of deep networks[25, 28], however rather with the goal of introducing regularization methods that improve robustnessas an alternative to adversarial training. More precisely, [25] propose to use a curvature regularizationmethod that uses the FGSM point, and [28] ﬁnd the input point where local linearity is maximallyviolated using an iterative method, leading to comparable computational cost as PGD AT. In contrast,we analyze here gradient alignment to improve FGSM training without seeking an alternative to it. Catastrophic overﬁtting in deep networks.

To understand the link between catastrophicoverﬁtting and local linearity, we plot in Fig. 4 the adversarial accuracies and the loss values obtained6

Epoch A d v e r s a r i a l a cc u r a c y Standard: FGSM acc.Standard: PGD acc.FGSM AT: FGSM acc.FGSM AT: PGD acc.PGD AT: FGSM acc.PGD AT: PGD acc.

Epoch A d v e r s a r i a l l o ss Standard: FGSM lossStandard: PGD lossFGSM AT: FGSM lossFGSM AT: PGD lossPGD AT: FGSM lossPGD AT: PGD loss

Epoch C o s i n e Standard: cos ( L ( x ), L ( x + ))Standard: cos ( FGSM , PGD )FGSM AT: cos ( L ( x ), L ( x + ))FGSM AT: cos ( FGSM , PGD )PGD AT: cos ( L ( x ), L ( x + ))PGD AT: cos ( FGSM , PGD ) Figure 4:

Visualization of the training process of standardly trained, FGSM trained, and PGD-10 trainedResNet-18 on CIFAR-10 with ε = / . All the statistics are calculated on the test set. Catastrophicoverﬁtting for the FGSM AT model occurs around epoch 23 and is characterized by a sudden drop in thePGD accuracy, a gap between the FGSM and PGD losses, and a dramatic decrease of local linearity . by FGSM and PGD AT on CIFAR-10 using ResNet-18, together with the gradient alignment (seeEq. 5) and the cosine between FGSM and PGD perturbations. We compute these statistics onthe test set. Catastrophic overﬁtting occurs for FGSM AT around epoch 23, and is characterizedby the following intertwined events: (a) There is a sudden drop in the PGD accuracy from . to . , along with an abrupt jump of the FGSM accuracy from . to . . In contrast,before the catastrophic overﬁtting, the ratio between the average PGD and FGSM losses neverexceeded . . This suggests that FGSM cannot anymore accurately solve the inner maximizationproblem. (b) Concurrently, after catastrophic overﬁtting, the gradient alignment of the FGSM modeldrops signiﬁcantly from . to . within an epoch of training, i.e. the input gradients becamenearly orthogonal inside the (cid:96) ∞ -ball . We observe the same drop also for cos( δ F GSM , δ

P GD ) whichmeans that the FGSM and PGD directions are not aligned anymore (as also observed in [40]). Thisechoes the observation made in [26] that SGD on the standard loss of a neural network learns modelsof increasing complexity. We observe qualitatively the same phenomenon for FGSM AT, where thecomplexity is captured by the degree of local non-linearity. The connection between local linearityand catastrophic overﬁtting sparks interest for a further analysis in a simpler setting. Catastrophic overﬁtting in a single-layer CNN.

We show that catastrophic overﬁtting isnot inherent to deep and overparametrized networks, and can be observed in a very simple setup.For this we train a single-layer CNN with four ﬁlters on CIFAR-10 using FGSM AT with ε = / (see Sec. B for details). We observe that catastrophic overﬁtting occurs in this simple model as well,and its pattern is the same as in ResNet: a simultaneous drop of the PGD accuracy and gradientalignment (see Appendix C.2). The advantage of considering a simple model is that we can inspectthe learned ﬁlters and understand what causes the network to become highly non-linear locally.We observe that after catastrophic overﬁtting the network has learned in ﬁlter w a variant of theLaplace ﬁlter (see Fig. 5), an edge-detector ﬁlter which is well-known for amplifying high-frequencynoise such as uniform noise [11]. Until the end of training, ﬁlter w preserves its direction (seeAppendix C.2 for detailed visualizations), but grows signiﬁcantly in its magnitude together with itsoutcoming weights, in contrast to the rest of the ﬁlters as shown in Fig. 5. Interestingly, if we set w to zero, the network largely recovers local linearity : the gradient alignment increases from . to . , recovering its value before catastrophic overﬁtting. Thus, in this extreme case, even a singleconvolutional ﬁlter can cause catastrophic overﬁtting . Next we analyze formally gradient alignmentin a single-layer CNN and elaborate on the connection to the noise sensitivity. Analysis of gradient alignment in a single-layer CNN.

We analyze here a single-layer7 poch 5 (before CO) Epoch 6 (after CO)

Figure 5:

Filter w (green channel)in a single-layer CNN before and aftercatastrophic overﬁtting (CO). Epoch N o r m o f t h e f il t e r s Filter 1Filter 2Filter 3

Filter 4

Epoch N o r m o f t h e o u t c o m i n g w e i g h t s Filter 1Filter 2Filter 3

Filter 4

Figure 6:

Evolution of the weight norms in a single-layer CNNbefore and after catastrophic overﬁtting (dashed line).

CNN with ReLU-activation. Let Z ∈ R p × k be the matrix of k non-overlapping image patchesextracted from the image x = vec( Z ) ∈ R d such that z j = z j ( x ) ∈ R p . The model prediction f isparametrized by ( W, b, U, c ) ∈ R p × m × R m × R m × k × R , and its prediction and the input gradientare given as f ( x ) = m (cid:88) i =1 k (cid:88) j =1 u ij max {(cid:104) w i , z j (cid:105) + b i , } + c, ∇ x f ( x ) = vec  m (cid:88) i =1 k (cid:88) j =1 u ij (cid:104) w i ,z j (cid:105) + b i ≥ w i e Tj  . We observe that catastrophic overﬁtting only happens at later stages of training. At thebeginnning of the training, the gradient alignment is very high (see Fig. 4 and Fig. 11), and FGSMsolves the inner maximization problem accurately enough. Thus, an important aspect of FGSMtraining is that the model starts training from highly aligned gradient . This motivates us to inspectclosely gradient alignment at initialization.

Lemma 2. (Gradient alignment at initialization)

Let z ∼ U ([0 , p ) be an image patch for p ≥ , η ∼ U ([ − ε, ε ] d ) a point inside the (cid:96) ∞ -ball, the parameters of a single-layer CNN initializedi.i.d. as w ∼ N (0 , σ w I p ) for every column of W , u ∼ N (0 , σ u I m ) for every column of U , b := 0 ,then the gradient alignment is lower bounded by lim k,m →∞ cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) ≥ max (cid:26) − √ E w,z (cid:104) e − ε (cid:104) w / (cid:107) w (cid:107) ,z (cid:105) (cid:105) / , . (cid:27) . x x ∗ w + b x ∗ w + b x + η ( x + η ) ∗ w + b ( x + η ) ∗ w + b Figure 7:

Feature maps of ﬁlters w and w in a single-layer CNN. A small noise η is signiﬁcantly ampliﬁed by the Laplaceﬁlter w in contrast to a regular ﬁlter w . The lemma implies that for randomly initialized CNNswith a large enough number of image patches k and ﬁlters m , gradient alignment cannot be smaller than . . Thisis in contrast to the value of . that we observe aftercatastrophic overﬁtting when the weights are no longer i.i.d.We note that the lower bound of . is quite pessimisticsince it holds for an arbitrarily large ε . The lower boundis close to when ε is small compared to E (cid:107) z (cid:107) which istypical in adversarial robustness (see Appendix A.2 for thevisualization of the lower bound). High gradient alignmentat initialization also holds empirically for deep networksas well, e.g. for ResNet-18 (see Fig. 4), starting from thevalue of . in contrast to . after catastrophic overﬁtting.Thus, it appears to be a general phenomenon that the standard initialization scheme of neuralnetwork weights [15] ensures the initial success of FGSM training.8n contrast, after some point during training, the network can learn parameters which lead to asigniﬁcant reduction of gradient alignment. For simplicity, let us consider a single-ﬁlter CNN wherethe gradient alignment for a ﬁlter w and bias b at points x and x + η has a simple expression: cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) = (cid:80) ki =1 u i (cid:104) w,z i (cid:105) + b ≥ (cid:104) w,z i + η i (cid:105) + b ≥ (cid:113)(cid:80) ki =1 u i (cid:104) w,z i (cid:105) + b ≥ (cid:80) ki =1 u i (cid:104) w,z i + η i (cid:105) + b ≥ . (6)Considering a single-ﬁlter CNN is also motivated by the fact that in the single-layer CNN introducedearlier, the norms of w and its outcoming weights are much higher than for the rest of the ﬁlters(see Fig. 6), and thus the contribution of w to the predictions and gradients of the network isthe most signiﬁcant. We observe that when an image x is convolved with the Laplace ﬁlter w ,even a uniformly random noise η of small magnitude is able to signiﬁcantly aﬀect the output of ( x + η ) ∗ w (see Fig. 7). As a consequence, the ReLU activations of the network change their signswhich directly aﬀects the gradient alignment in Eq. (6). Namely, x ∗ w + b has mostly negativevalues, and thus many values { (cid:104) w ,z i (cid:105) + b } ki =1 are equal to . On the other hand, nearly half ofthe values { (cid:104) w ,z i + η i (cid:105) + b } ki =1 become , which signiﬁcantly increases the denominator of Eq. (6),and thus makes the cosine close to 0. At the same time, the output of a regular ﬁlter w shownin Fig. 7 is only slightly aﬀected by the random noise η . For deep networks, however, we couldnot identify particular ﬁlters responsible for catastrophic overﬁtting, thus we consider next a moregeneral solution. Based on the importance of gradient alignment for successful FGSM training, we propose a regularizer,

GradAlign , that aims at increasing gradient alignment and preventing catastrophic overﬁtting. Thecore idea of

GradAlign is to maximize the gradient alignment (as deﬁned in Eq. 5) between thegradients at point x and at a randomly perturbed point x + η inside the (cid:96) ∞ -ball around x : Ω( x, y, θ ) def = E ( x,y ) ∼ D, η ∼U ([ − ε,ε ] d ) [1 − cos ( ∇ x (cid:96) ( x, y ; θ ) , ∇ x (cid:96) ( x + η, y ; θ ))] . (7)Crucially, GradAlign uses gradients at points x and x + η which does not require an expensiveiterative procedure unlike, e.g., the LLR method of [28]. Note that the regularizer depends only onthe gradient direction and it is invariant to the gradient norm which contrasts it to the gradientpenalties [14, 17, 31, 36] or CURE [25] (see the comparison in Appendix D). Experimental setup.

We compare the following methods: standard FGSM AT, FGSM-RS ATwith α = 1 . ε [46], FGSM AT + GradAlign , AT for Free with m = 8 [34], PGD-2 AT with 2-stepPGD using α = ε / , and PGD-10 AT with 10-step PGD using α = ε / . We train these methodsusing PreAct ResNet-18 [16] with (cid:96) ∞ -radii ε ∈ { / , . . . , / } on CIFAR-10 for 30 epochs and ε ∈ { / , . . . , / } on SVHN for 15 epochs. The only exception is AT for Free [34] which wetrain for epochs on CIFAR-10, and epochs on SVHN which was necessary to get comparableresults to the other methods. Unlike [28] and [48], with the training scheme of [46] and α = ε / we could successfully train a PGD-2 model with ε = / on CIFAR-10 with robustness betterthan that of their methods that use the same number of PGD steps (see Appendix D). This alsoechoes the recent ﬁnding of [30] that properly tuned multi-step PGD AT outperforms more recentlypublished methods. As before, we evaluate robustness using PGD-50-10 with iterations and used for training and evaluation S t a n d a r d a n d P G D - - a cc u r a c y Dataset: CIFAR-10

FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT used for training and evaluation S t a n d a r d a n d P G D - - a cc u r a c y Dataset: SVHN

FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT

Figure 8:

Accuracy (dashed line) and robustness (solid line) of diﬀerent adversarial training (AT) methodson CIFAR-10 and SVHN with ResNet-18 trained and evaluated with diﬀerent l ∞ -radii. The results areobtained without early stopping, averaged over 5 random seeds used for training and reported with thestandard deviation. restarts using step size α = ε / for the same ε that was used for training. We train each model with random seeds since the ﬁnal robustness can have a large variance for high ε . Also, we remarkthat training with GradAlign leads on average to a − × slowdown compared to FGSM trainingwhich is due to the use of double backpropagation (see [9] for a detailed analysis). We think thatimproving the runtime of GradAlign is possible, but we postpone it to future work. Additionalimplementation details are provided in Appendix B. The code of our experiments is available at https://github.com/tml-epfl/understanding-fast-adv-training . Results.

We provide the main comparison in Fig. 8 and provide detailed numbers for speciﬁcvalues of ε in Appendix D.3. First, we notice that all the methods perform almost equally wellfor small enough ε , i.e. ε ≤ / on CIFAR-10 and ε ≤ / on SVHN. However, the performancefor larger ε varies a lot depending on the method due to catastrophic overﬁtting. Importantly, GradAlign succesfully prevents catastrophic overﬁtting in FGSM AT, thus allowing to successfullyapply FGSM training also for larger (cid:96) ∞ -perturbations and reduce the gap to PGD-10 training. InAppendix D.4, we additionally show that FGSM + GradAlign does not suﬀer from catastrophicoverﬁtting even for ε ∈ { / , / } . At the same time, not only FGSM AT and FGSM-RS ATexperience catastrophic overﬁtting, but also the recently proposed

AT for Free and PGD-2, althoughat higher ε values than FGSM AT. We note that GradAlign is not only applicable to FGSM AT,but also to other methods that can also suﬀer from catastrophic overﬁtting. In particular, combiningPGD-2 with

GradAlign prevents catastrophic overﬁtting and leads to better robustness for ε = / on CIFAR-10 (see Appendix D.3). Although performing early stopping can lead to non-trivialrobustness, standard accuracy is often signiﬁcantly sacriﬁced which limits the usefulness of thistechnique (see Appendix D). This is in contrast to training with GradAlign which leads to the samestandard accuracy as PGD-10 AT.We also performed similar experiments on ImageNet in Appendix D.3, but observed that even forstandard FGSM training using the training schedule of [46], catastrophic overﬁtting does not occurfor ε ∈ { / , / } considered in [34, 46], and thus there is no need to use GradAlign as its mainrole is to prevent catastrophic overﬁtting. Finally, with regard to robust overﬁtting phenomenonoutlined in [30], we observed that training FGSM+

GradAlign for more than epochs also leads to10lightly worse robustness on the test set (see Appendix D.4), thus suggesting that catastrophic and robust overﬁtting are two distinct phenomena that have to be addressed separately. We observed that catastrophic overﬁtting is a fundamental problem not only for standard FGSMtraining, but for computationally eﬃcient adversarial training in general. In particular, many recentlyproposed schemes such as FGSM AT enhanced by a random step or

AT for free are also proneto catastrophic overﬁtting. Motivated by this, we explored the questions of when and why

FGSMadversarial training works, and how to improve it by increasing the gradient alignment, and thusthe quality of the solution of the inner maximization problem. Our proposed regularizer

GradAlign prevents catastrophic overﬁtting and improves the robustness compared to other fast adversarialtraining methods reducing the gap to multi-step PGD training.

Acknowledgements

We thank Eric Wong, Francesco Croce, Guillermo Ortiz-Jimenez, Apostolos Modas, and Chen Liufor many fruitful discussions.

References [1] Jean-Baptiste Alayrac, Jonathan Uesato, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi, andPushmeet Kohli. Are labels required for improving adversarial robustness?

NeurIPS , 2019.[2] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski.

Robust optimization . Princeton Series inApplied Mathematics. Princeton University Press, Princeton, NJ, 2009.[3] Battista Biggio and Fabio Roli. Wild patterns: ten years after the rise of adversarial machine learning.

Pattern Recognition , 2018.[4] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracyof object detection. arXiv preprint aXiv:2004.10934 , 2020.[5] Nicholas Carlini, Guy Katz, Clark Barrett, and David L Dill. Provably minimally-distorted adversarialexamples. arXiv preprint arXiv:1709.10207 , 2017.[6] Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C. Duchi. Unlabeled dataimproves adversarial robustness.

NeurIPS , 2019.[7] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certiﬁed adversarial robustness via randomizedsmoothing.

ICML , 2019.[8] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble ofdiverse parameter-free attacks.

ICML , 2020.[9] Christian Etmann. A closer look at double backpropagation. arXiv preprint ArXiv:1906.06637 , 2019.[10] Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion.

ICML ,2006.

11] Rafael C Gonzales and Richard E Woods.

Digital image processing (2nd edition) . Prentice Hall NewJersey, 2002.[12] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarialexamples.

ICLR , 2015.[13] Sven Gowal, Jonathan Uesato, Chongli Qin, Po-Sen Huang, Timothy Mann, and Pushmeet Kohli. Analternative surrogate loss for pgd-based adversarial testing. arXiv preprint arXiv:1910.09338 , 2019.[14] Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples.

ICLR Workshops , 2015.[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassinghuman-level performance on imagenet classiﬁcation.

ICCV , 2015.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.

ECCV , 2016.[17] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classiﬁer againstadversarial manipulation.

NeurIPS , 2017.[18] Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-training can improve model robustness anduncertainty.

ICML , 2019.[19] Peter J. Huber.

Robust statistics . John Wiley & Sons, Inc., New York, 1981.[20] A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-proxalgorithm.

Stochastic Systems , 2011.[21] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: an eﬃcientsmt solver for verifying deep neural networks.

ICCAV , 2017.[22] Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning.

Nature , 521(7553), 2015.[23] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towardsdeep learning models resistant to adversarial attacks.

ICLR , 2018.[24] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, BorisGinsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.

ICLR ,2018.[25] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustnessvia curvature regularization, and vice versa.

CVPR , 2019.[26] Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L. Edelman, Fred Zhang,and Boaz Barak. SGD on neural networks learns functions of increasing complexity.

NeurIPS , 2019.[27] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and AnanthramSwami. Practical black-box attacks against machine learning.

ASIA CCS’17 , 2017.[28] Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi,Soham De, Robert Stanforth, and Pushmeet Kohli. Adversarial robustness through local linearization.

NeurIPS , 2019.[29] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certiﬁed defenses against adversarial examples.

ICLR , 2018.

30] Leslie Rice, Eric Wong, and J. Zico Kolter. Overﬁtting in adversarially robust deep learning.

ICML ,2020.[31] Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretabilityof deep neural networks by regularizing their input gradients.

AAAI , 2018.[32] Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, and AleksanderMadry. Image synthesis with a single (robust) classiﬁer.

NeurIPS , 2019.[33] Jürgen Schmidhuber. Deep learning in neural networks: An overview.

Neural networks , 61:85–117, 2015.[34] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S. Davis,Gavin Taylor, and Tom Goldstein. Adversarial training for free!

NeurIPS , 2019.[35] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasinglocal stability of supervised models through robust optimization.

Neurocomputing , 2018.[36] Carl-Johann Simon-Gabriel, Yann Ollivier, Leon Bottou, Bernhard Schölkopf, and David Lopez-Paz.First-order adversarial vulnerability of neural networks and input dimension.

ICML , 2019.[37] Leslie N Smith. Cyclical learning rates for training neural networks.

WACV , 2017.[38] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. Intriguing properties of neural networks.

ICLR , 2014.[39] Vincent Tjeng, Kai Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixedinteger programming.

ICLR , 2019.[40] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel.Ensemble adversarial training: Attacks and defenses.

ICLR , 2018.[41] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.Robustness may be at odds with accuracy.

ICLR , 2019.[42] B. S. Vivek and R. Venkatesh Babu. Single-step adversarial training with dropout scheduling.

CVPR ,2020.[43] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergenceand robustness of adversarial training.

ICML , 2019.[44] Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S.Dhillon, and Luca Daniel. Towards fast computation of certiﬁed robustness for relu networks.

ICML ,2018.[45] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outeradversarial polytope.

ICML , 2018.[46] Eric Wong, Leslie Rice, and J. Zico Kolter. Fast is better than free: Revisiting adversarial training.

ICLR , 2020.[47] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, and Quoc V Le. Adversarial examplesimprove image recognition.

CVPR , 2020.[48] Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once:Accelerating adversarial training via maximal principle.

NeurIPS , 2019.

49] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan.Theoretically principled trade-oﬀ between robustness and accuracy.

ICML , 2019.[50] Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarialtraining for natural language understanding.

ICLR , 2019. ppendixA Deferred proofs In this section, we show the proofs omitted from Sec. 3 and Sec. 4.

A.1 Proof of Lemma 1

We state again Lemma 1 from Sec. 3 and present the proof.

Lemma 1. (Eﬀect of the random step)

First, note that due to the Jensen’s inequality, we can have a convenient upper bound whichis easier to work with: E [ (cid:107) δ F GSM − RS ( η ) (cid:107) ] ≤ (cid:114) E (cid:104) (cid:107) δ F GSM − RS ( η ) (cid:107) (cid:105) . (8)Therefore, we can focus on E (cid:104) (cid:107) δ F GSM − RS (cid:107) (cid:105) which can be computed analytically. Let us denoteby ∇ def = ∇ x (cid:96) ( x + η, y ; θ ) ∈ R d , we then obtain: E η (cid:104) (cid:107) δ F GSM − RS (cid:107) (cid:105) = E η (cid:104)(cid:13)(cid:13) Π [ − ε,ε ] [ η + α sign( ∇ )] (cid:13)(cid:13) (cid:105) = d (cid:88) i =1 E η i (cid:104) Π [ − ε,ε ] [ η i + α sign( ∇ i )] (cid:105) = d E η i (cid:2) min { ε, | η i + α sign( ∇ i ) |} (cid:3) = d E η i (cid:104) min { ε , ( η i + α sign( ∇ i )) } (cid:105) = d E r i (cid:104) E η i (cid:104) min { ε , ( η i + α sign( ∇ i )) } | sign( ∇ i ) = r i (cid:105)(cid:105) , where in the last step we use the law of total expectation by noting that sign( ∇ i ) is also a randomvariable since it depends on η i .We ﬁrst consider the case when sign( ∇ i ) = 1 , then the inner conditional expectation is equal to: (cid:90) ε − ε min { ε , ( η i + α ) } ε dη i = 12 ε (cid:90) ε + α − ε + α min { ε , x } dx = 12 ε (cid:18)(cid:90) ε + αε ε dx + (cid:90) ε − ε + α x dx (cid:19) = − ε α + 12 α + 13 ε . The case when sign( ∇ i ) = − leads to the same expression: (cid:90) ε − ε min { ε , ( η i − α ) } ε dη i = 12 ε (cid:90) ε − α − ε − α min { ε , x } dx = − ε α + 12 α + 13 ε . E η [ (cid:107) δ F GSM − RS ( η ) (cid:107) ] ≤ (cid:114) E (cid:104) (cid:107) δ F GSM − RS ( η ) (cid:107) (cid:105) = √ d (cid:114) − ε α + 12 α + 13 ε . A.2 Proof and discussion of Lemma 2

We state again Lemma 2 from Sec. 4 and present the proof.

Lemma 2. (Gradient alignment at initialization)

For k and m large enough, the law of large number ensures that an empirical mean of i.i.d.random variables can be approximated by its expectation with respect to random variables z, η, w, u .This leads to lim k,m →∞ cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y ))= lim k,m →∞ m (cid:80) r =1 m (cid:80) l =1 k (cid:80) i =1 (cid:104) w r , w l (cid:105) u ri u li (cid:104) w r ,z i (cid:105)≥ (cid:104) w l ,z i + η i (cid:105)≥ (cid:115) m (cid:80) r =1 m (cid:80) l =1 k (cid:80) i =1 (cid:104) w r , w l (cid:105) u ri u li (cid:104) w r ,z i (cid:105)≥ (cid:104) w l ,z i (cid:105)≥ (cid:115) m (cid:80) r =1 m (cid:80) l =1 k (cid:80) i =1 (cid:104) w r , w l (cid:105) u ri u li (cid:104) w r ,z i + η i (cid:105)≥ (cid:104) w l ,z i + η i (cid:105)≥ = lim k,m →∞ km m (cid:80) r =1 k (cid:80) i =1 (cid:107) w r (cid:107) u ri (cid:104) w r ,z i (cid:105)≥ (cid:104) w r ,z i + η i (cid:105)≥ (cid:115) km m (cid:80) r =1 k (cid:80) i =1 (cid:107) w r (cid:107) u ri (cid:104) w r ,z i (cid:105)≥ (cid:104) w r ,z i (cid:105)≥ (cid:115) km m (cid:80) r =1 k (cid:80) i =1 (cid:107) w r (cid:107) u ri (cid:104) w r ,z i + η i (cid:105)≥ (cid:104) w r ,z i + η i (cid:105)≥ = E w,u,η,z (cid:104) (cid:107) w (cid:107) u (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105)(cid:114) E w,u,z (cid:104) (cid:107) w (cid:107) u (cid:104) w,z (cid:105)≥ (cid:105)(cid:114) E w,u,η,z (cid:104) (cid:107) w (cid:107) u (cid:104) w,z + η (cid:105)≥ (cid:105) = E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105)(cid:114) E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:105)(cid:114) E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z + η (cid:105)≥ (cid:105) . (9)We directly compute for the denominator: E w,z [ (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ ] = E w,η,z [ (cid:107) w (cid:107) (cid:104) w,z + η (cid:105)≥ ] = 0 . pσ w . P η [ (cid:104) w, η (cid:105) ≥ (cid:104) w, z (cid:105) ] ≤ e − (cid:104) z,w (cid:105) ε (cid:107) w (cid:107) via the Hoeﬀding’s inequality, weobtain E u,w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105) = E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ P η ( (cid:104) w, z + η (cid:105) ≥ (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ P η ( (cid:104) w, η (cid:105) ≥ − (cid:104) w, z (cid:105) ) (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ P η ( (cid:104) w, η (cid:105) ≤ (cid:104) w, z (cid:105) ) (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (1 − P η ( (cid:104) w, η (cid:105) ≥ (cid:104) w, z (cid:105) )) (cid:105) ≥ E w,z (cid:34) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:32) − e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:33)(cid:35) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:105) − E w,z (cid:34) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) =0 . pσ w − . E w,z (cid:34) (cid:107) w (cid:107) e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) ≥ . pσ w − . E w (cid:104) (cid:107) w (cid:107) (cid:105) / E w,z (cid:34) e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) / =0 . pσ w − . σ w (cid:112) p + 2 p E w,z (cid:34) e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) / , where the last inequality is obtained via the Cauchy-Schwarz inequality. On the other hand, we have: E u,w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ P η ( (cid:104) w, η (cid:105) ≤ (cid:104) w, z (cid:105) ) (cid:105) ≥ E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ . (cid:105) = 0 . pσ w . Now we combine both lower bounds together to establish a lower bound on Eq. (9): E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105)(cid:114) E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:105)(cid:114) E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z + η (cid:105)≥ (cid:105) ≥ max  . pσ w − . σ w (cid:112) p + 2 p E w,z (cid:34) e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) / , . pσ w  . pσ w = max (cid:40) − (cid:114) p E w,z (cid:20) e − (cid:104) w/ (cid:107) w (cid:107) ,z (cid:105) ε (cid:21) / , . (cid:41) ≥ max (cid:26) − √ E w,z (cid:104) e − ε (cid:104) w / (cid:107) w (cid:107) ,z (cid:105) (cid:105) / , . (cid:27) , (10)17here in the last step we used that p ≥ .The main purpose of obtaining the lower bound in Lemma 2 was to get an expression that cangive us an insight into the key quantities which gradient alignment at initialization depends on.Considering the limiting case k, m → ∞ was necessary to obtain a ratio of expectations that allowedus to derive a simpler expression. Finally, we lower bounded the gradient alignment from Eq. (9) usingthe Hoeﬀding’s and Cauchy-Schwarz inequalities and used p ≥ to obtain a dimension-independentconstant in front of the expectation in Eq. (10). Now we would like to provide a better understandingabout the key quantities involved in the lemma and to assess the tightness of the derived lowerbound. For this purpose, in Fig. 9 we plot: • cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) for k = 100 patches and m = 4 ﬁlters (which resembles thesetting of the 4-ﬁlter CNN on CIFAR-10). We note that it is a random variable since it is afunction of random variables x, η, W, U . • lim k,m →∞ cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) evaluated via Eq. (9). • Our ﬁrst lower bound max (cid:110) − pσ w E w,z (cid:104) (cid:107) w (cid:107) e − ε (cid:104) w / (cid:107) w (cid:107) ,z (cid:105) (cid:105) , . (cid:111) obtained via Hoeﬀding’sinequality. • Our ﬁnal lower bound max (cid:26) − √ E w,z (cid:104) e − ε (cid:104) w / (cid:107) w (cid:107) ,z (cid:105) (cid:105) / , . (cid:27) .For the last three quantities we approximate the expectations by Monte-Carlo sampling by using , samples. For all the quantities we use patches of size p = 3 × × as in our CIFAR-10experiments. We plot gradient alignment values for ε ∈ [0 , . since we are interested in small (cid:96) ∞ -perturbations such as, e.g., ε = / ≈ . which is a typical value used for CIFAR-10 [23]. First,we can observe that all the four quantities have very high values in [0 . , . for ε ∈ [0 , . which isin contrast to the gradient alignment value of . that we observe after catastrophic overﬁtting for ε = / ≈ . . Next, we observe that cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) has some noticeable variancefor the chosen parameters k = 100 patches and m = 4 ﬁlters. However, this variance is signiﬁcantlyreduced when we increase the parameters k and m , especially when considering the limiting case k, m → ∞ . Finally, we observe that both lower bounds on lim k,m →∞ cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) G r a d i e n t a li g n m e n t cos ( ( x ), ( x + )) (k=100 patches, m=4 filters)lim k , m cos ( ( x ), ( x + ))Our lower bound after Hoeffding's inequalityOur final lower bound Figure 9:

Visualization of the key quantities involved in Lemma 2. ε . However, we choose to report the last one in the lemma since it is slightly more concisethan the one obtained via Hoeﬀding’s inequality. B Experimental details

We list detailed evaluation and training details below.

Evaluation.

Throughout the paper, we use PGD-50-10 for evaluation of adversarial accuracywhich stands for the PGD attack with 50 iterations and 10 random restarts following [46]. We use thestep size α = ε / . The choice of this attack is motivated by the fact that in both public benchmarks of[23] on MNIST and CIFAR-10, the adversarial accuracy of PGD-100-50 and PGD-20-10 respectivelyis only 2% away from the best entries.Although we train our models using half precision [24], we always perform robustness evaluationusing single precision since evaluation with half precision can sometimes overestimate the robustnessof the model due to limited numerical precision in the calculation of the gradients.We perform evaluation of standard accuracy using full test sets, but we evaluate adversarialaccuracy using , random points on each dataset. Training details for ResNet-18.

We use the implementation code of [46] with the onlydiﬀerence that we do not use image normalization and gradient clipping on CIFAR-10 and SVHNsince we found that they have no signiﬁcant inﬂuence on the ﬁnal results. We use cyclic learningrates and half-precision training following [46]. We do not use random initialization for PGD duringadversarial training as we did not ﬁnd that it leads to any improvements on the considered datasets(see the justiﬁcations in Sec. D.1 below). We perform early stopping based on the PGD accuracyon the training set following [46]. We observed that such a simple model selection scheme cansuccessfully select a model before catastrophic overﬁtting that has non-trivial robustness.On CIFAR-10, we train all the models for 30 epochs with the maximum learning rate . except AT for free [34] which we train for epochs with the maximum learning rate . using m = 8 minibatch replays to get comparable results to the other methods.On SVHN, we train all the models for 15 epochs with the maximum learning rate . except AT for free [34] which we train for epochs with the maximum learning rate . using m = 8 minibatch replays. Moreover, in order to prevent convergence to a constant classiﬁer on SVHN, welinearly increase the perturbation radius from 0 to ε during the ﬁrst 5 epochs for all methods.For PGD-2 AT we use for training a 2-step PGD attack with step size α = ε / , and for PGD-10AT we use for training a 10-step PGD attack with α = ε / .For Fig. 1 and Fig. 8 we used the GradAlign λ values obtained via a linear interpolation on thelogarithmic scale between the best λ values that we found for ε = 8 and ε = 16 on the test sets. Weperform the interpolation on the logarithmic scale since the values of λ are non-negative, a usual linearinterpolation would lead to negative values of λ . The resulting λ values for ε ∈ { , . . . , } are givenin Table 2. We note that at the end we do not report the results with ε > for SVHN since manymodels have trivial robustness close to that of a constant classiﬁer. For the PGD-2 + GradAlign experiments reported below in Table 4 and Table 5, we use λ = 0 . for the CIFAR-10 and λ = 0 . for SVHN experiments. Training details for the single-layer CNN.

The single-layer CNN that we study in Sec. 4has 4 convolutional ﬁlters, each of them of size × . After the convolution we apply ReLU activation,and then we directly have a fully-connected layer, i.e. we do not use any pooling layer. For training19 able 2: GradAlign λ values used for the experiments on CIFAR-10 and SVHN. These values are obtainedvia a linear interpolation on the logarithmic scale between successful λ values at ε = 8 and ε = 16 . ε ( / ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 λ CIF AR − λ SV HN a.)

Standard model b.)

PGD-trained model d d d d -norm of the perturbation | ( x + )( x ) , | Start from

FGSM

Start from random d d d d -norm of the perturbation | L ( x + ) L ( x ) , | FGSMrandom

Figure 10:

The quality of the linear approximation of (cid:96) ( x + δ ) for δ with diﬀerent (cid:96) -norm for (cid:107) δ (cid:107) ∞ ﬁxedto ε for a standard and PGD-trained ResNet-18 on CIFAR-10. we use the ADAM optimizer with learning rate . for 30 epochs using the same cyclical learningrate schedule. ImageNet experiments.

We use ResNet-50 following the training scheme of [46] which includes3 training stages on diﬀerent image resolution. For

GradAlign , we slightly reduce the batch sizeon the second and third stages from 224 and 128 to 180 and 100 respectively in order to reducethe memory consumption. For all ε ∈ { , , } , we train FGSM models with GradAlign using λ ∈ { . , . } . The ﬁnal λ we report are λ ∈ { . , . , . } for ε ∈ { , , } respectively. Computing infrastructure.

We perform all our experiments on NVIDIA V100 GPUs with32GB of memory.

C Supporting experiments and visualizations for Sec. 3 and Sec. 4

We describe here supporting experiments and visualizations related to Sec. 3 and Sec. 4.

Epoch A d v e r s a r i a l a cc u r a c y FGSM AT: FGSM acc.FGSM AT: PGD acc.

Epoch A d v e r s a r i a l l o ss FGSM AT: FGSM lossFGSM AT: PGD loss

Epoch C o s i n e FGSM AT: cos ( L ( x ), L ( x + ))FGSM AT: cos ( FGSM , PGD ) Figure 11:

Visualization of the training process of an FGSM trained CNN with 4 ﬁlters with ε = / . Wecan observe catastrophic overﬁtting around epoch 6. .1 Quality of the linear approximation for ReLU networks For the loss function (cid:96) of a ReLU-network, we compute empirically the quality of the linearapproximation deﬁned as | (cid:96) ( x + δ ) − (cid:96) ( x ) − (cid:104) δ, ∇ x (cid:96) ( x ) (cid:105) | , where the dependency of the loss (cid:96) on the label y and parameters θ are omitted for clarity. Then weperform the following experiment: we take a perturbation δ ∈ {− ε, ε } d , and then zero out diﬀerentfractions of its coordinates, which leads to perturbations with a ﬁxed (cid:107) δ (cid:107) ∞ = ε , but with diﬀerent (cid:107) δ (cid:107) ∈ [0 , √ dε ] . As the starting δ we choose two types of perturbations: δ F GSM generated by FGSMand δ random sampled uniformly from the corners of the (cid:96) ∞ -ball. We plot the results in Fig. 10 onCIFAR-10 for ε = 8 / averaged over 512 test points, and conclude that for both δ F GSM and δ random the validity of the linear approximation crucially depends on (cid:107) δ (cid:107) even when (cid:107) δ (cid:107) ∞ is ﬁxed.The phenomenon is even more pronounced for FGSM perturbations as the linearization error ismuch higher there. Moreover, this observation is consistent across both standardly and adversariallytrained ResNet-18 models. C.2 Catastrophic overﬁtting in a single-layer CNN

We describe here complementary ﬁgures to Sec. 4 which are related to the single-layer CNN.

Training curves.

In Fig. 11, we show the evolution of the FGSM/PGD accuracy, FGSM/PGDloss, and gradient alignment together with cos( δ F GSM , δ

P GD ) . We observe that catastrophic over-ﬁtting occurs around epoch 6 and that its pattern is the same as for the deep ResNet which wasillustrated in Fig. 4. Namely, we see that concurrently the following changes occur around epoch6: (a) there is a sudden drop of PGD accuracy with an increase in FGSM accuracy, (b) the PGDloss grows by an order of magnitude while the FGSM loss decreases, (c) both gradient alignmentand cos( δ F GSM , δ

P GD ) signiﬁcantly decrease. Throughout all our experiments we observe a veryhigh correlation between cos( δ F GSM , δ

P GD ) and gradient alignment. This motivates our proposedregularizer GradAlign which relies on the cosine between ∇ x (cid:96) ( x, y ; θ ) and ∇ x (cid:96) ( x + η, y ; θ ) , where η is a random point. In this way, we avoid using an iterative procedure inside the regularizer unlike,for example, the approach of [28]. Additional ﬁlters.

In Fig. 12, we show the evolution of the regular ﬁlter w and ﬁlter w thatleads to catastrophic overﬁtting for the three input channels (red, green, blue). We can observethat in the red and green channels, w has learned a Laplace ﬁlter which is very sensitive to noise.Moreover, w signiﬁcantly increases in magnitude after catastrophic overﬁtting contrary to w whosemagnitude only decreases (see the colorbar values in Fig. 12 and the plots in Fig. 5). Additional feature maps.

In Fig. 13, we show additional feature maps for images with andwithout uniform random noise η ∼ U ([ − / , / ] d ) . These ﬁgures complement Fig. 7 shown inthe main part. We clearly see that only the last ﬁlter w is sensitive to the noise since the featuremaps change dramatically. At the same time, other ﬁlters w , w , w are only slightly aﬀected bythe addition of the noise. We also show the input gradients in the last column which illustrate thatafter adding the noise the gradients change drammatically which leads to small gradient alignmentand, in turn, to the failure of FGSM as the solution of the inner maximization problem.21poch 3 Epoch 4 Epoch 5 Epoch 6 Epoch 7 Epoch 30 w -R w -G w -B w -R w -G w -B Figure 12:

Evolution of the regular ﬁlter w and ﬁlter w that leads to catastrophic overﬁtting. We plotred (R), green (G), and blue (B) channels of the ﬁlters. We can observe that in R and G channels, w haslearned a Laplace ﬁlter which is very sensitive to noise. mage Feature map Figure 13:

Input images, feature maps, and gradients of the single-layer CNN trained on CIFAR-10 at theend of training (after catastrophic overﬁtting).

Odd row : original images.

Even row : original image plusrandom noise U ([ − / , / ] d ) . We observe that only the last ﬁlter w is highly sensitive to the smalluniform noise since the feature maps change dramatically. Additional experiments for diﬀerent adversarial training schemes

In this section, we describe additional experiments related to

GradAlign that complement the resultsshown in Sec. 5.

D.1 Stronger PGD-2 baseline

As mentioned in Sec. 5, the PGD-2 training baseline that we report outperforms other similarbaselines reported in the literature [48, 28]. Here we elaborate what are likely to be the mostimportant sources of diﬀerence. First, we follow the cyclical learning rate schedule of [46] which canwork as implicit early stopping and thus can help to prevent catastrophic overﬁtting observed forPGD-2 in [28]. Another source of diﬀerence is that [28] use the ADAM optimizer while we stick tothe standard PGD updates using the sign of the gradient [23].The second important factor is a proper step size selection. While [48] do not observe catastrophicoverﬁtting, their PGD-3 baseline achieves only . adversarial accuracy compared to the . for our PGD-2 baseline evaluated with a stronger attack (PGD-50-10 instead of PGD-20-1). Onepotential explanation for this diﬀerence lies in the step size selection, where for PGD-2 we use α = ε / .Related to the step size selection, we also found that using random initialization in PGD (we willrefer to as PGD-k-RS) as suggested in [23] requires a larger step size α . We show the results inTable 3 where we can see that PGD-2-RS AT with α = ε / achieves suboptimal robustness comparedto α = ε used for training. However, we consistently observed that PGD-2 AT with α = ε / and norandom step performs best. Thus, we use the latter as our PGD-2 baseline throughout the paper,thus always starting PGD-2 from the original point, without using any random step. Table 3:

Robustness of diﬀerent PGD-2 schemes for ε = 8 / on CIFAR-10 for ResNet-18. The results areaveraged over 5 random seeds used for training. Model

PGD-2-RS AT, α = ε / PGD-2-RS AT, α = ε PGD-2 AT, α = ε / PGD-50-10 accuracy ± ± ± D.2 Results with early stopping

We complement the results presented in Fig. 8 without early stopping with the results with earlystopping which we show in Fig. 14. For CIFAR-10, we observe that FGSM+

GradAlign leads to agood robustness and accuracy outperforming FGSM AT and FGSM-RS AT and performing similarlyto PGD-2 and slightly improving for larger ε close to / . For SVHN, GradAlign leads to betterrobustness than other FGSM-based methods. We also observe that for large ε on both CIFAR-10and SVHN, AT for Free performs similarly to FGSM-based methods. Moreover, for ε ≥ / onSVHN, AT for Free converges to a constant classiﬁer.On both CIFAR-10 and SVHN, we can see that although early stopping can lead to non-trivialrobustness, standard accuracy is often signiﬁcantly sacriﬁced which limits the usefulness of thistechnique. This is in contrast to training with

GradAlign which leads to the same standard accuracyas PGD-10 training. 24 used for training and evaluation S t a n d a r d a n d P G D - - a cc u r a c y Dataset: CIFAR-10

FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT used for training and evaluation S t a n d a r d a n d P G D - - a cc u r a c y Dataset: SVHN

FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT

Figure 14:

We report robustness and accuracy in Table 4 for CIFAR-10 without usingearly stopping where we can clearly see which methods lead to catastrophic overﬁtting and thussuboptimal robustness. We compare the same methods as in Fig. 8, and additionally we report theresults for ε = / of the CURE [25], YOPO [48], and LLR [28] approaches. First, for ε = / ,we see that FGSM + GradAlign outperforms

AT for Free and all methods that use FGSM training.Then, we also observe that the model trained with CURE [25] leads to robustness that is suboptimalcompared to FGSM-RS AT evaluated with a stronger attack: . vs . . YOPO-3-5 andYOPO-5-3 [48] require 3 and 5 full steps of PGD respectively, thus they are much more expensivethan FGSM-RS AT, and, however, they lead to worse adversarial accuracy: . and . vs . . [28] report that LLR-2, i.e. their approach with 2 steps of PGD, achieves . adversarialaccuracy. This result is not directly comparable to other results in Table 4 since [28] use (1) a largernetwork (Wide-ResNet-28-8), and (2) a stronger attack (MultiTargeted [13]). However, we thinkthat the gap of − compared to the adversarial accuracy of our reported FGSM + GradAlign and PGD-2 methods ( . and . resp.) is still signiﬁcant since the diﬀerence betweenMultiTargeted and a PGD attack with random restarts is observed to be small (e.g. around 1%between MultiTargeted and PGD-20-10 on the CIFAR-10 challenge of [23]).For ε = / , none of the one-step methods work without early stopping except FGSM + GradAlign .We also evaluate PGD-2 +

GradAlign and conclude that the beneﬁt of combining the two comeswhen PGD-2 alone leads to catastrophic overﬁtting which occurs at ε = / . For ε = / , thereis no beneﬁt of combining the two approaches. This is consistent with our observation regardingcatastrophic overﬁtting for FGSM (e.g. see Fig. 8 for small ε ): if there is no catastrophic overﬁtting,there is no beneﬁt of adding GradAlign to FGSM training.To further ensure that FGSM+

GradAlign models do not beneﬁt from gradient masking [27],we additionally compare the robustness of FGSM+

GradAlign and FGSM-RS models obtained via

AutoAttack [8]. We observe that

AutoAttack proportionally reduces the adversarial accuracy ofboth models: for ε = / , FGSM+ GradAlign achieves 44.54 ± able 4: Robustness and accuracy of diﬀerent robust training methods on

CIFAR-10 . We report resultswithout early stopping for ResNet-18 unless speciﬁed otherwise in parentheses. The results of all the methodsreported in Fig. 8 are shown here with the standard deviation and averaged over 5 random seeds used fortraining.

Model Accuracy Attack

Standard Adversarial ε = 8 / Standard 94.03% 0.00% PGD-50-10CURE [25] 81.20% 36.30% PGD-20-1YOPO-3-5 [48] 82.14% 38.18% PGD-20-1YOPO-5-3 [48] 83.99% 44.72% PGD-20-1LLR-2 (Wide-ResNet-28-8) [28] 90.46% 44.50% MultiTargeted [28]FGSM 85.16 ± ± ± ± GradAlign ± ± PGD-50-10AT for Free ( m = 8 ) 77.92 ± ± α = 4 / ) 82.15 ± ± α = 4 / ) + GradAlign ± ± α = 2 ε/ ) 81.88 ± ± PGD-50-10 ε = 16 / FGSM 73.76 ± ± ± ± GradAlign ± ± PGD-50-10AT for Free ( m = 8 ) 48.10 ± ± α = ε/ ) 68.65 ± ± α = ε/ ) + GradAlign ± ± α = 2 ε/ ) 60.28 ± ± PGD-50-10

FGSM-RS achieves 42.80 ± AutoAttack reduces adversarial accuracy for many models by 2%-3% for ε = / compared tothe originally reported results based on the standard PGD attack (see Table 2 in [8]). The sametendency is observed also for higher ε , e.g. for ε = / FGSM+

GradAlign achieves 20.56 ± AutoAttack . SVHN results.

We report robustness and accuracy in Table 5 for SVHN without usingearly stopping. We can see that for both ε = / and ε = / , GradAlign successfully preventscatastrophic overﬁtting in contrast to FGSM and FGSM-RS, although there is still a gap toPGD-2 training for ε = / . AT for free performs slightly better than FGSM+

GradAlign for ε = / , but it already starts to show a high variance in the robustness and accuracy depending onthe random seed. For ε = / , all the 5 models of AT for free converge to a constant classiﬁer.Combining PGD-2 with

GradAlign does not lead to improved results for ε = / since thereis no catastrophic overﬁtting for PGD-2. However, for ε = / , we can clearly see that PGD-2 + GradAlign leads to better results than PGD-2 achieving . ± . instead of . ± . adversarial accuracy. ImageNet results.

We also perform similar experiments on ImageNet in Table 6. We observethat even for standard FGSM training, catastrophic overﬁtting does not occur for ε ∈ { / , / } considered in [34, 46], and thus there is no additional beneﬁt from using GradAlign since its main role26 able 5:

Robustness and accuracy of diﬀerent robust training methods on

SVHN . We report results withoutearly stopping for ResNet-18. All the results are reported with the standard deviation and averaged over 5random seeds used for training.

Model Accuracy

Standard PGD-50-10 ε = 8 / Standard 96.00% 1.00%FGSM 91.40 ± ± ± ± GradAlign ± ± AT for Free ( m = 8 ) 75.34 ± ± α = ε/ ) 92.68 ± ± GradAlign ( α = ε/ ) 92.46 ± ± α = 2 ε/ ) 91.92 ± ± ε = 12 / FGSM 88.74 ± ± ± ± GradAlign ± ± AT for Free ( m = 8 ) 18.50 ± ± α = ε/ ) 92.74 ± ± GradAlign ( α = ε/ ) 87.14 ± ± α = 2 ε/ ) 84.52 ± ± Robustness and accuracy of diﬀerent robust training methods on

ImageNet . We report resultswithout early stopping for ResNet-50.

Model (cid:96) ∞ -radius Standard accuracy PGD-50-10 accuracy FGSM 2/255 61.7% 42.1%FGSM-RS 2/255 59.3% 41.1%FGSM +

GradAlign

GradAlign is to prevent catastrophic overﬁtting. We report the results of FGSM+

GradAlign for completenessto show that

GradAlign can be applied on ImageNet-scale although it leads to approximately × slowdown on ImageNet. We ﬁnd that the exact slowdown of GradAlign depends on the GPUutilization and the batch size ranging from from × to × on diﬀerent datasets.For ε = / , we observe that catastrophic overﬁtting occurs for FGSM-RS very early in training(around epoch 3), but not for FGSM or FGSM + GradAlign . This contradicts our observations onCIFAR-10 and SVHN where we observed that FGSM-RS usually helps to postpone catastrophicoverﬁtting to higher ε . However, it is computationally demanding to replicate the results on ImageNetmultiple times over diﬀerent random seeds as we did for CIFAR-10 and SVHN. Thus, we leave amore detailed investigation of catastrophic overﬁtting on ImageNet for future work.27 .4 Ablation studies In this section, we aim to provide more details about sensitivity of

GradAlign to its hyperparameter λ , the total number of training epochs, and also discuss training with GradAlign for very high ε values. Ablation study for GradAlign λ . We provide an ablation study for the regularizationparameter λ of GradAlign in Fig. 15, where we plot the adversarial accuracy of ResNet-18 trainedusing FGSM +

GradAlign with ε = / on CIFAR-10. First, we observe that for small λ catastrophic overﬁtting occurs so that the average PGD-50-10 accuracy is either or greater than but has a high standard deviation since only some runs are successful while other runs failbecause of catastrophic overﬁtting. We observe that the best performance is achieved for λ = 2 where catastrophic overﬁtting does not occur and the ﬁnal adversarial accuracy is very concentrated.For larger λ values we observe a slow decrease in the adversarial accuracy since the model becomesoverregularized. We note that the range of λ values which have close to the best performance ( ≥ adversarial accuracy) ranges in [0 . , , thus we conclude that GradAlign is robust to the exactchoice of λ . This is also conﬁrmed by our hyperparameter selection method for Fig. 8, where weperformed a linear interpolation on the logarithmic scale between successful λ values for ε = / and ε = / . Even such a coarse hyperparameter selection method, could ensure that none of theFGSM + GradAlign runs reported in Fig. 15 suﬀered from catastrophic overﬁtting.

Ablation study for the total number of training epochs.

Recently, Rice et al. [30] broughtup the importance of early stopping in adversarial training. They identify the phenomenon called robust overﬁtting when training longer hurts the adversarial accuracy on the test set. Thus, we checkhere whether training with

GradAlign has some inﬂuence on robust overﬁtting. We note that theauthors of [30] suggest that robust and catastrophic overﬁtting phenomena are distinct since robustoverﬁtting implies a gap between training and test set robustness, while catastrophic overﬁttingimplies low robustness on both training and test sets. To explore this for FGSM +

GradAlign , inFig. 16 we show the ﬁnal clean and adversarial accuracies for ﬁve diﬀerent models trained with

Regularization parameter of GradAlign P G D - - a cc u r a c y Figure 15:

Ablation study for the regularizationparameter λ for FGSM + GradAlign under ε = / without early stopping. We train ResNet-18models on CIFAR-10. The results are averaged over3 random seeds used for training and reported withthe standard deviation.

25 50 75 100 125 150 175 200

Total number of training epochs A cc u r a c y Standard accuracyPGD-50-10 accuracy

Figure 16:

Ablation study for the total numberof training epochs for FGSM +

GradAlign under ε = / without early stopping. We train ResNet-18 models on CIFAR-10. The results are averagedover 3 random seeds used for training and reportedwith the standard deviation. , , , , } epochs. We observe the same trend as [30] report: training longer slightlydegrades adversarial accuracy while the clean accuracy slightly improves. Thus, this experiment alsosuggests that robust overﬁtting is not directly connected to catastrophic overﬁtting and has to beaddressed separately. Finally, we note based on Fig. 16 that when we use FGSM in combinationwith GradAlign , even training up to 200 epochs does not lead to catastrophic overﬁtting.

Ablation study for very high ε . Here we make an additional test on whether

GradAlign prevents catastrophic overﬁtting for very high ε values. In Fig. 8 and Fig. 14 we showed resultsfor ε ≤ for CIFAR-10 and for ε ≤ on SVHN. For SVHN, FGSM + GradAlign achieves24.04 ± ε on SVHN even further just leads to learning aconstant classiﬁer. However, on CIFAR-10 for ε = 16 , FGSM + GradAlign achieves 28.88 ± GradAlign on CIFAR-10, but justfor higher ε values than what we considered in the main part of the paper. To show that it is not thecase, in Table 7 we show the results of FGSM + GradAlign trained with ε ∈ { / , / } (we use λ = 2 . and the maximum learning rate . ). We observe no signs of catastrophic overﬁtting even forvery high ε such as / . Note that in this case the standard accuracy is very low (23.07 ± ε . Table 7:

Robustness and accuracy of FGSM +

GradAlign for very high ε on CIFAR-10 without earlystopping for ResNet-18. We report results with the standard deviation and averaged over 3 random seedsused for training. We observe no catastrophic overﬁtting even for very high ε . (cid:96) ∞ -radius Standard accuracy PGD-50-10 accuracy ± ± ± ± D.5 Comparison of GradAlign to gradient-based penalties

In this section, we compare

GradAlign to other alternatives: (cid:96) gradient norm penalization andCURE [25]. The motivation to study them comes from the fact that after catastrophic overﬁtting,the input gradients change dramatically inside the (cid:96) ∞ -balls around input points, and thus othergradient-based regularizers may also be able to improve the stability of the input gradients and thusprevent catastrophic overﬁtting.In Table 8, we present results of FGSM training with other gradient-based penalties studied inthe literature: • (cid:96) gradient norm regularization [31, 36]: λ (cid:107)∇ x (cid:96) ( x, y ; θ ) (cid:107) , • curvature regularization (CURE) [25]: λ (cid:107)∇ x (cid:96) ( x + δ F GSM , y ; θ ) − ∇ x (cid:96) ( x, y ; θ ) (cid:107) .First of all, we note that the originally proposed approaches [31, 36, 25] do not involve adversarialtraining and rely only on these gradient penalties to achieve some degree of robustness. In contrast,we combine the gradient penalties with FGSM training to see whether they can prevent catastrophicoverﬁtting similarly to GradAlign . For the gradient norm penalty, we use the regularizationparameters λ ∈ { , , , } for ε ∈ { / , / } respectively. For CURE, we use λ ∈ { , , } able 8: Additional comparison of FGSM AT with

GradAlign to FGSM AT with other gradient penaltieson CIFAR-10. We report results without early stopping for ResNet-18. All the results are reported with thestandard deviation and averaged over 5 random seeds used for training.

Model Accuracy

Standard PGD-50-10 ε = 8 / FGSM + (cid:107)∇ x (cid:107) ± ± ± ± GradAlign ± ± ε = 16 / FGSM + (cid:107)∇ x (cid:107) ± ± ± ± GradAlign ± ± for ε ∈ { / , / } respectively. In both cases, we found the optimal hyperparameters using a gridsearch over λ . We can see that for ε = / all three approaches successfully prevent catastrophicoverﬁtting, although the ﬁnal robustness slightly varies between . for FGSM with the (cid:96) -gradientpenalty and . for FGSM with GradAlign .For ε = / , both FGSM + CURE and FGSM + GradAlign prevent catastrophic overﬁttingleading to very concentrated results with a small standard deviation (0.29% and 0.70% respectively).However, the average adversarial accuracy is better for FGSM +

GradAlign : . vs . .At the same time, FGSM with the (cid:96) -gradient penalty leads to unstable ﬁnal performance: theadversarial accuracy has a high standard deviation: . ± . .We think that the main diﬀerence in the performance of GradAlign compared to the gradientpenalties that we considered comes from the fact that it is invariant to the gradient norm, and ittakes into account only the directions of two gradients inside the (cid:96) ∞ -ball around the given input.Inspired by CURE, we also tried two additional experiments:1. Using the FGSM point δ F GSM for the gradient taken at the second input point for

GradAlign ,but we observed that it does not make a substantial diﬀerence, i.e. this version of

GradAlign also prevents catastrophic overﬁtting and leads to similar results. However, if we use CUREwithout FGSM in the cross-entropy loss, then we observe a beneﬁt of using δ F GSM in theregularizer which is consistent with the observations made in Moosavi-Dezfooli et al. [25].2. Using

GradAlign without FGSM in the cross-entropy loss. In this case, we observed that themodel did not signiﬁcantly improve its robustness suggesting that

GradAlign is not a suﬃcientregularizer on its own to promote robustness and has to be used with some adversarial trainingmethod.We think that an interesting future direction is to explore how one can speed up

GradAlign orto come up with other regularization methods that are also able to prevent catastrophic overﬁtting,but avoid relying on the input gradients which lead to a slowdown in training. We think that somepotential strategies to speed up