Understanding and Improving Fast Adversarial Training
UUnderstanding and Improving Fast Adversarial Training
Maksym AndriushchenkoEPFL [email protected]
Nicolas FlammarionEPFL [email protected]
Abstract
A recent line of work focused on making adversarial training computationally efficient fordeep learning models. In particular, Wong et al. [46] showed that (cid:96) ∞ -adversarial training withfast gradient sign method (FGSM) can fail due to a phenomenon called catastrophic overfitting ,when the model quickly loses its robustness over a single epoch of training. We show that addinga random step to FGSM, as proposed in [46], does not prevent catastrophic overfitting, and thatrandomness is not important per se — its main role being simply to reduce the magnitude ofthe perturbation. Moreover, we show that catastrophic overfitting is not inherent to deep andoverparametrized networks, but can occur in a single-layer convolutional network with a few filters.In an extreme case, even a single filter can make the network highly non-linear locally , which is themain reason why FGSM training fails. Based on this observation, we propose a new regularizationmethod, GradAlign , that prevents catastrophic overfitting by explicitly maximizing the gradientalignment inside the perturbation set and improves the quality of the FGSM solution. As aresult,
GradAlign allows to successfully apply FGSM training also for larger (cid:96) ∞ -perturbationsand reduce the gap to multi-step adversarial training. The code of our experiments is availableat https://github.com/tml-epfl/understanding-fast-adv-training . Machine learning models based on empirical risk minimization are known to be often non-robustto small worst-case perturbations. For decades, this has been the topic of active research by thestatistics, optimization and machine learning communities [19, 2, 10, 3]. However, the recent successof deep learning [22, 33] has raised the interest in this topic. The lack of robustness in deep learningis clearly illustrated by the existence of adversarial examples , i.e. tiny input perturbations that caneasily fool state-of-the-art deep neural networks into making wrong predictions [38, 12].The benefits of adversarially robust models extend beyond security considerations [3] to modelinterpretability [41, 32] and generalization [50, 47, 4]. In order to improve the robustness, twofamilies of solutions have been developed: adversarial training (AT) that amounts to training themodel on adversarial examples [12, 23] and provable defenses that derive and optimize robustnesscertificates [45, 29, 7]. Currently, adversarial-training based methods appear to be preferred bypractitioners since they (a) achieve higher empirical robustness (although without providing arobustness certificate), (b) are scalable to state-of-the-art deep networks, and (c) work equally wellfor different threat models. Adversarial training can be formulated as a robust optimization problem[35, 23] which takes the form of a non-convex non-concave min-max problem. However, computingthe optimal adversarial examples is an NP-hard problem [21, 44]. Thus adversarial training can onlyrely on approximate methods to solve the inner maximization problem.One popular approximation method successfully used in adversarial training is the PGD attack[23] where multiple steps of projected gradient descent are performed. It is now widely believed1 a r X i v : . [ c s . L G ] J u l used for training and evaluation P G D - - a d v e r s a r i a l a cc u r a c y CIFAR-10 models trained without early stopping
FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT used for training and evaluation P G D - - a d v e r s a r i a l a cc u r a c y CIFAR-10 models trained with early stopping
FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT
Figure 1:
Robustness of different adversarial training (AT) methods on CIFAR-10 with ResNet-18 trainedand evaluated with different l ∞ -radii. The results are averaged over 5 random seeds used for training andreported with the standard deviation. FGSM AT : standard FGSM AT,
FGSM-RS AT : FGSM AT with arandom step [46],
FGSM AT + GradAlign : FGSM AT combined with our proposed regularizer
GradAlign , AT for Free : recently proposed method for fast PGD AT [34],
PGD-2/PGD-10 AT : AT with a 2-/10-stepPGD-attack. Our proposed regularizer
GradAlign prevents catastrophic overfitting in FGSM training andleads to significantly better results which are close to the computationally demanding PGD-10 AT. that models adversarially trained via the PGD attack [23, 49] are robust since small adversariallytrained networks can be formally verified [5, 39, 46], and larger models could not be broken on publicchallenges [23, 49]. Recently, [8] evaluated the majority of recently published defenses to concludethat the standard (cid:96) ∞ PGD training achieves the best empirical robustness; a result which canonly be improved using semi-supervised approaches [18, 1, 6]. In contrast, other empirical defensesthat were claiming improvements over standard PGD training had overestimated the robustnessof their reported models [8]. These experiments imply that adversarial training in general is thekey algorithm for robust deep learning, and thus that performing it efficiently is of paramountimportance.Another approximation method for adversarial training is the
Fast Gradient Sign Method (FGSM)[12] which is based on the linear approximation of the neural network loss function. However, theliterature is still ambiguous about the performance of FGSM training, i.e. it remains unclear whetherFGSM training can consistently lead to robust models. For example, [23] and [40] claim that FGSMtraining works only for small (cid:96) ∞ -perturbations, while [46] suggest that FGSM training can lead torobust models for arbitrary (cid:96) ∞ -perturbations if one adds uniformly random initialization before theFGSM step. Related to this, [46] further identified a phenomenon called catastrophic overfitting where FGSM training first leads to some robustness at the beginning of training, but then suddenlybecomes non-robust within a single training epoch. However, the reasons for such a failure remainunknown. This motivates us to consider the following question as the main theme of the paper: When and why does fast adversarial training with FGSM lead to robust models?
Contributions.
We first show that not only FGSM training is prone to catastrophic overfitting ,but the recently proposed fast adversarial training methods [34, 46] as well (see Fig. 1). We thenanalyze the reasons why using a random step in FGSM [46] helps to slightly mitigate catastrophicoverfitting and show it simply boils down to reducing the average magnitude of the perturbations.Then we discuss the connection behind catastrophic overfitting and local linearity in deep networks2nd in single-layer convolutional networks where we show that even a single filter can make thenetwork non-linear locally , and causes the failure of FGSM training. We additionally provide for thiscase a theoretical explanation which helps to explain why FGSM AT is successful at the beginningof the training. Finally, we propose a regularization method,
GradAlign , that prevents catastrophicoverfitting by explicitly maximizing the gradient alignment inside the perturbation set and thereforeimproves the quality of the FGSM solution. We compare
GradAlign to other adversarial trainingschemes in Fig. 1 and point out that among all fast adversarial training methods considered onlyFGSM+
GradAlign does not suffer from catastrophic overfitting and leads to high robustness evenfor large (cid:96) ∞ -perturbations. Let (cid:96) ( x, y ; θ ) denote the loss of a ReLU-network parametrized by θ ∈ R m on the example ( x, y ) ∼ D where D is the data generating distribution. Previous works [35, 23] formalized the goal of trainingadversarially robust models as the following robust optimization problem: min θ E ( x,y ) ∼ D (cid:2) max δ ∈ ∆ (cid:96) ( x + δ, y ; θ ) (cid:3) . (1)We focus here on the (cid:96) ∞ threat model, i.e. ∆ = { δ ∈ R d , (cid:107) δ (cid:107) ∞ ≤ ε } , where the adversary canchange each input coordinate x i by at most ε . Unlike classical stochastic saddle point problems ofthe form min θ max δ E [ (cid:96) ( θ, δ )] [20], the inner maximization problem here is inside the expectation.Therefore the solution of each subproblem max δ ∈ ∆ (cid:96) ( x + δ, y ; θ ) depends on the particular example ( x, y ) and standard algorithms such as gradient descent-ascent which alternate gradient descent in θ and gradient ascent in δ cannot be used. Instead each of these non-concave maximization problemshas to be solved independently. Thus, an inherent trade-off appears between computationally efficientapproaches which aim at solving this inner problem in as few iterations as possible and approacheswhich aim at solving the problem more accurately but with more iterations. In an extreme case,the PGD attack [23] uses multiple steps of projected gradient ascent (PGD), which is accuratebut computationally expensive. At the other end of the spectrum, Fast Gradient Sign Method(FGSM) [12] performs only one iteration of gradient ascent with respect to the (cid:96) ∞ -norm: δ F GSM def = ε sign( ∇ x (cid:96) ( x, y ; θ )) , (2)followed by a projection of x + δ F GSM onto the [0 , d to ensure it is a valid input. This leads to afast algorithm which, however, does not always lead to robust models as observed in [23, 40]. A closerlook at the evolution of the robustness during FGSM AT reveals that using FGSM can lead to amodel with some degree of robustness but only until a point where the robustness suddenly drops.This phenomenon is called catastrophic overfitting in [46]. As a partial solution, the training can bestopped just before that point which leads to non-trivial but suboptimal robustness as illustrated inFig. 1. [46] further notice that initializing FGSM from a random starting point η ∼ U ([ − ε, ε ] d ) , i.e.: δ F GSM − RS def = Π [ − ε,ε ] d [ η + α sign( ∇ x (cid:96) ( x + η, y ; θ ))] , (3) In practice we use training samples with random data augmentation. Throughout the paper we will focus on image classification, i.e. inputs x will be images. ε values(e.g. ε = / on CIFAR-10). Along the same lines, [42] observe that using dropout on all layers(including convolutional) also helps to stabilize FGSM AT.An alternative solution is to interpolate between FGSM and PGD AT. For example, [43] suggestto first use FGSM AT, and later to switch to multi-step PGD AT which is motivated by their analysissuggesting that the inner maximization problem has to be solved more accurately at the end oftraining. [34] propose to run PGD with step size α = ε and simultaneously update the weights of thenetwork. On a related note, [48] collect the weight updates during PGD, but apply them after PGDis completed. Additionally, [48] update the gradients of the first layer multiple times. However, noneof these approaches are conclusive, either leading to comparable robustness to FGSM-RS training[46] and still failing for higher (cid:96) ∞ -radii (see Fig. 1 for [34] and [46]) or being in the worst case asexpensive as multi-step PGD AT [43]. We focus next on analyzing the FGSM-RS training [46] as theother recent variations of fast adversarial training [34, 48, 42] lead to models with similar robustness. Experimental setup.
Unless mentioned otherwise, we perform training on PreAct ResNet-18 [16] with the cyclic learning rates [37] and half-precision training [24] following the setup of[46]. We evaluate adversarial robustness using the PGD-50-10 attack, i.e. with 50 iterations and 10restarts with step size α = ε / . More experimental details are specified in Appendix B. First, we show that FGSM with a random step fails to resolve catastrophic overfitting for larger ε .Then we provide evidence against the explanation given by [46] on the benefit of randomness forFGSM AT, and propose a new explanation based on the linear approximation quality of FGSM. FGSM with random step does not resolve catastrophic overfitting.
Crucially, [46]observed that adding an initial random step to FGSM as in Eq. (3) helps to avoid catastrophicoverfitting. However, this holds only if the step size is not too large (as illustrated in Fig. 3 of [46]for ε = / ) and, more importantly, only for small enough ε as we show in Fig. 1. Indeed, usingthe step size recommended by [46] extends the working regime of FGSM but only from ε = / to ε = / , with adversarial accuracy for ε = / . When early stopping is applied (Fig. 1,right), there is still a significant gap compared to PGD-10 training, particularly for large (cid:96) ∞ -radii.For example, for ε = / , FGSM-RS AT leads to 22.24% PGD-50-10 accuracy while PGD-10 ATobtains a much better accuracy of 30.65%. Previous explanation: randomness diversifies the threat model.
A hypothesis statedin [46] was that FGSM-RS helps to avoid catastrophic overfitting by diversifying the threat model.Indeed, the random step allows to have perturbations not only at the corners {− ε, ε } d like theFGSM-attack , but rather in the whole (cid:96) ∞ -ball, [ − ε, ε ] d . Here we refute this hypothesis by modifyingthe usual PGD training by projecting onto {− ε, ε } d the perturbation obtained via the PGD attack.We perform experiments on CIFAR-10 with ResNet-18 with (cid:96) ∞ -perturbations of radius ε = / over 5 random seeds. FGSM AT leads to catastrophic overfitting achieving . ± . adversarialaccuracy if early stopping is not applied, while the standard PGD-10 AT and our modified PGD-10AT schemes achieve . ± . and . ± . adversarial accuracy respectively. Therebysimilar robustness as the original PGD AT can still be achieved without training on pertubations For simplicity, we ignore the projection of x + δ onto [0 , d in this section. (cid:96) ∞ -ball. We conclude that diversity of adversarial examples is not crucialhere. What makes the difference is rather having an iterative instead of a single-step procedure tofind a corner of the (cid:96) ∞ -ball that sufficiently maximizes the loss. New explanation: a random step improves the linear approximation quality.
Usinga random step in FGSM is guaranteed to decrease the expected magnitude of the perturbation. Thissimple observation is formalized in the following lemma.
Lemma 1. (Effect of the random step)
Let η ∼ U ([ − ε, ε ] d ) be a random starting point, and α ∈ [0 , ε ] be the step size of FGSM-RS defined in Eq. (3) , then E η [ (cid:107) δ F GSM − RS ( η ) (cid:107) ] ≤ (cid:114) E η (cid:104) (cid:107) δ F GSM − RS ( η ) (cid:107) (cid:105) = √ d (cid:114) − ε α + 12 α + 13 ε . (4)The proof is deferred to Appendix A.1. We first remark that the upper bound is in the range [ / √ √ dε, √ dε ] , and therefore always less or equal than (cid:107) δ F GSM (cid:107) = √ dε. We visualize our boundin Fig. 2 where the expectation is approximated by Monte-Carlo sampling over , samples of η ,and note that the bound becomes increasingly tight for high-dimensional inputs.The key observation here is that among all possible perturbations of (cid:96) ∞ -norm ε , perturbationswith a smaller (cid:96) -norm benefit from a better linear approximation. This statement follows from thesecond-order Taylor expansion for twice differentiable functions: f ( x + δ ) ≈ f ( x ) + (cid:104)∇ x f ( x ) , δ (cid:105) + (cid:10) δ, ∇ xx f ( x ) δ (cid:11) , i.e. a smaller value of (cid:107) δ (cid:107) implies a smaller linear approximation error | f ( x + δ ) − f ( x ) −(cid:104)∇ x f ( x ) , δ (cid:105) | .Moreover, the same property still holds empirically for the non-differentiable ReLU networks (seeAppendix C.1). We conclude that by reducing in expectation the length of the perturbation (cid:107) δ (cid:107) ,the FGSM-RS approach of [46] takes advantage of a better linear approximation. This is supportedby the fact that FGSM-RS AT also leads to catastrophic overfitting if the step size α is chosen to betoo large (see Fig. 3 in [46]), thus providing no benefits over FGSM AT even when combined withearly stopping. We argue this is the main improvement over the standard FGSM AT. Successful FGSM AT does not require randomness.
If having perturbation with a toolarge (cid:96) -norm is indeed the key factor in catastrophic overfitting, we can expect that just reducing thestep size of the standard FGSM should work equally well as FGSM-RS. For ε = / on CIFAR-10, Step size d d d d d d d E [ || F G S M R S || ] Our analytical upper boundEmpirical estimation, d =15Empirical estimation, d =3Empirical estimation, d =1 Figure 2:
Visualization of our upper bound on E η [ (cid:107) δ F GSM − RS (cid:107) ] . The dashed line corresponds tothe step size α = 1 . ε recommended in [46]. used for evaluation A d v e r s a r i a l a cc u r a c y FGSM-RS AT, train =8FGSM AT, train =5FGSM AT, train =6FGSM AT, train =7FGSM AT, train =8 Figure 3:
Robustness of FGSM-trained ResNet-18on CIFAR-10 with different ε train used for trainingcompared to FGSM-RS AT with ε train = / . able 1: Robustness of FGSM AT with a reduced step size ( α = / ) compared to the FGSM-RS ATproposed in [46] ( α = / ) for ε = 8 / on CIFAR-10 for ResNet-18 trained with early stopping. Theresults are averaged over 5 random seeds used for training. Accuracy Model
FGSM AT FGSM α = / AT FGSM-RS ATPGD-50-10 . ± .
74% 45 . ± .
48% 45 . ± . [46] recommend to use FGSM-RS with step size α = 1 . ε which induces a perturbation of expected (cid:96) -norm (cid:107) δ F GSM − RS (cid:107) ≈ / √ d . This corresponds to using standard FGSM with a step size α ≈ / instead of α = ε = / (see the dashed line in Fig. 2). We report the results in Table 1and observe that simply reducing the step size of FGSM (without any randomness) leads to thesame level of robustness. We show further in Fig. 3 that when used with a smaller step size, therobustness of standard FGSM training even without early stopping can generalize to much higher ε .This contrasts with the previous literature [23, 40]. We conclude from these experiments that amore direct way to improve FGSM AT and to prevent it from catastrophic overfitting is to simplyreduce the step size. Note that this still leads to suboptimal robustness compared to PGD AT (seeFig. 1) for ε larger than the one used during training, since in this case adversarial examples canonly be generated inside the smaller (cid:96) ∞ -ball. This motivates us to take a closer look on how and why catastrophic overfitting occurs to be able to prevent it without reducing the FGSM step size. First, we establish a connection between catastrophic overfitting and local linearity of the model.Then we show that catastrophic overfitting also occurs in a single-layer convolutional network, forwhich we analyze local linearity both empirically and theoretically.
When can the inner maximization problem be accurately solved with FGSM?
Recallthat the FGSM attack [12] is obtained as a closed-form solution of the following optimization problem: δ F GSM = arg min (cid:107) δ (cid:107) ∞ ≤ ε (cid:104)∇ x (cid:96) ( x, y ; θ )) , δ (cid:105) . Thus, the FGSM attack is guaranteed to find the optimaladversarial perturbation if ∇ x (cid:96) ( x, y ; θ ) is constant inside the (cid:96) ∞ -ball around the input x , i.e. theloss function is locally linear . This motivates us to study the evolution of local linearity duringFGSM training and its connection to catastrophic overfitting. With this aim, we define the followinglocal linearity metric of the loss function (cid:96) : E ( x,y ) ∼ D, η ∼U ([ − ε,ε ] d ) [cos ( ∇ x (cid:96) ( x, y ; θ ) , ∇ x (cid:96) ( x + η, y ; θ ))] , (5)which we refer to as gradient alignment . This quantity is easily interpretable: it is equal to one formodels linear inside the (cid:96) ∞ -ball of radius ε , and it is approximately zero when the input gradientsare nearly orthogonal to each other. Previous works also considered local linearity of deep networks[25, 28], however rather with the goal of introducing regularization methods that improve robustnessas an alternative to adversarial training. More precisely, [25] propose to use a curvature regularizationmethod that uses the FGSM point, and [28] find the input point where local linearity is maximallyviolated using an iterative method, leading to comparable computational cost as PGD AT. In contrast,we analyze here gradient alignment to improve FGSM training without seeking an alternative to it. Catastrophic overfitting in deep networks.
To understand the link between catastrophicoverfitting and local linearity, we plot in Fig. 4 the adversarial accuracies and the loss values obtained6
Epoch A d v e r s a r i a l a cc u r a c y Standard: FGSM acc.Standard: PGD acc.FGSM AT: FGSM acc.FGSM AT: PGD acc.PGD AT: FGSM acc.PGD AT: PGD acc.
Epoch A d v e r s a r i a l l o ss Standard: FGSM lossStandard: PGD lossFGSM AT: FGSM lossFGSM AT: PGD lossPGD AT: FGSM lossPGD AT: PGD loss
Epoch C o s i n e Standard: cos ( L ( x ), L ( x + ))Standard: cos ( FGSM , PGD )FGSM AT: cos ( L ( x ), L ( x + ))FGSM AT: cos ( FGSM , PGD )PGD AT: cos ( L ( x ), L ( x + ))PGD AT: cos ( FGSM , PGD ) Figure 4:
Visualization of the training process of standardly trained, FGSM trained, and PGD-10 trainedResNet-18 on CIFAR-10 with ε = / . All the statistics are calculated on the test set. Catastrophicoverfitting for the FGSM AT model occurs around epoch 23 and is characterized by a sudden drop in thePGD accuracy, a gap between the FGSM and PGD losses, and a dramatic decrease of local linearity . by FGSM and PGD AT on CIFAR-10 using ResNet-18, together with the gradient alignment (seeEq. 5) and the cosine between FGSM and PGD perturbations. We compute these statistics onthe test set. Catastrophic overfitting occurs for FGSM AT around epoch 23, and is characterizedby the following intertwined events: (a) There is a sudden drop in the PGD accuracy from . to . , along with an abrupt jump of the FGSM accuracy from . to . . In contrast,before the catastrophic overfitting, the ratio between the average PGD and FGSM losses neverexceeded . . This suggests that FGSM cannot anymore accurately solve the inner maximizationproblem. (b) Concurrently, after catastrophic overfitting, the gradient alignment of the FGSM modeldrops significantly from . to . within an epoch of training, i.e. the input gradients becamenearly orthogonal inside the (cid:96) ∞ -ball . We observe the same drop also for cos( δ F GSM , δ
P GD ) whichmeans that the FGSM and PGD directions are not aligned anymore (as also observed in [40]). Thisechoes the observation made in [26] that SGD on the standard loss of a neural network learns modelsof increasing complexity. We observe qualitatively the same phenomenon for FGSM AT, where thecomplexity is captured by the degree of local non-linearity. The connection between local linearityand catastrophic overfitting sparks interest for a further analysis in a simpler setting. Catastrophic overfitting in a single-layer CNN.
We show that catastrophic overfitting isnot inherent to deep and overparametrized networks, and can be observed in a very simple setup.For this we train a single-layer CNN with four filters on CIFAR-10 using FGSM AT with ε = / (see Sec. B for details). We observe that catastrophic overfitting occurs in this simple model as well,and its pattern is the same as in ResNet: a simultaneous drop of the PGD accuracy and gradientalignment (see Appendix C.2). The advantage of considering a simple model is that we can inspectthe learned filters and understand what causes the network to become highly non-linear locally.We observe that after catastrophic overfitting the network has learned in filter w a variant of theLaplace filter (see Fig. 5), an edge-detector filter which is well-known for amplifying high-frequencynoise such as uniform noise [11]. Until the end of training, filter w preserves its direction (seeAppendix C.2 for detailed visualizations), but grows significantly in its magnitude together with itsoutcoming weights, in contrast to the rest of the filters as shown in Fig. 5. Interestingly, if we set w to zero, the network largely recovers local linearity : the gradient alignment increases from . to . , recovering its value before catastrophic overfitting. Thus, in this extreme case, even a singleconvolutional filter can cause catastrophic overfitting . Next we analyze formally gradient alignmentin a single-layer CNN and elaborate on the connection to the noise sensitivity. Analysis of gradient alignment in a single-layer CNN.
We analyze here a single-layer7 poch 5 (before CO) Epoch 6 (after CO)
Figure 5:
Filter w (green channel)in a single-layer CNN before and aftercatastrophic overfitting (CO). Epoch N o r m o f t h e f il t e r s Filter 1Filter 2Filter 3
Filter 4
Epoch N o r m o f t h e o u t c o m i n g w e i g h t s Filter 1Filter 2Filter 3
Filter 4
Figure 6:
Evolution of the weight norms in a single-layer CNNbefore and after catastrophic overfitting (dashed line).
CNN with ReLU-activation. Let Z ∈ R p × k be the matrix of k non-overlapping image patchesextracted from the image x = vec( Z ) ∈ R d such that z j = z j ( x ) ∈ R p . The model prediction f isparametrized by ( W, b, U, c ) ∈ R p × m × R m × R m × k × R , and its prediction and the input gradientare given as f ( x ) = m (cid:88) i =1 k (cid:88) j =1 u ij max {(cid:104) w i , z j (cid:105) + b i , } + c, ∇ x f ( x ) = vec m (cid:88) i =1 k (cid:88) j =1 u ij (cid:104) w i ,z j (cid:105) + b i ≥ w i e Tj . We observe that catastrophic overfitting only happens at later stages of training. At thebeginnning of the training, the gradient alignment is very high (see Fig. 4 and Fig. 11), and FGSMsolves the inner maximization problem accurately enough. Thus, an important aspect of FGSMtraining is that the model starts training from highly aligned gradient . This motivates us to inspectclosely gradient alignment at initialization.
Lemma 2. (Gradient alignment at initialization)
Let z ∼ U ([0 , p ) be an image patch for p ≥ , η ∼ U ([ − ε, ε ] d ) a point inside the (cid:96) ∞ -ball, the parameters of a single-layer CNN initializedi.i.d. as w ∼ N (0 , σ w I p ) for every column of W , u ∼ N (0 , σ u I m ) for every column of U , b := 0 ,then the gradient alignment is lower bounded by lim k,m →∞ cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) ≥ max (cid:26) − √ E w,z (cid:104) e − ε (cid:104) w / (cid:107) w (cid:107) ,z (cid:105) (cid:105) / , . (cid:27) . x x ∗ w + b x ∗ w + b x + η ( x + η ) ∗ w + b ( x + η ) ∗ w + b Figure 7:
Feature maps of filters w and w in a single-layer CNN. A small noise η is significantly amplified by the Laplacefilter w in contrast to a regular filter w . The lemma implies that for randomly initialized CNNswith a large enough number of image patches k and filters m , gradient alignment cannot be smaller than . . Thisis in contrast to the value of . that we observe aftercatastrophic overfitting when the weights are no longer i.i.d.We note that the lower bound of . is quite pessimisticsince it holds for an arbitrarily large ε . The lower boundis close to when ε is small compared to E (cid:107) z (cid:107) which istypical in adversarial robustness (see Appendix A.2 for thevisualization of the lower bound). High gradient alignmentat initialization also holds empirically for deep networksas well, e.g. for ResNet-18 (see Fig. 4), starting from thevalue of . in contrast to . after catastrophic overfitting.Thus, it appears to be a general phenomenon that the standard initialization scheme of neuralnetwork weights [15] ensures the initial success of FGSM training.8n contrast, after some point during training, the network can learn parameters which lead to asignificant reduction of gradient alignment. For simplicity, let us consider a single-filter CNN wherethe gradient alignment for a filter w and bias b at points x and x + η has a simple expression: cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) = (cid:80) ki =1 u i (cid:104) w,z i (cid:105) + b ≥ (cid:104) w,z i + η i (cid:105) + b ≥ (cid:113)(cid:80) ki =1 u i (cid:104) w,z i (cid:105) + b ≥ (cid:80) ki =1 u i (cid:104) w,z i + η i (cid:105) + b ≥ . (6)Considering a single-filter CNN is also motivated by the fact that in the single-layer CNN introducedearlier, the norms of w and its outcoming weights are much higher than for the rest of the filters(see Fig. 6), and thus the contribution of w to the predictions and gradients of the network isthe most significant. We observe that when an image x is convolved with the Laplace filter w ,even a uniformly random noise η of small magnitude is able to significantly affect the output of ( x + η ) ∗ w (see Fig. 7). As a consequence, the ReLU activations of the network change their signswhich directly affects the gradient alignment in Eq. (6). Namely, x ∗ w + b has mostly negativevalues, and thus many values { (cid:104) w ,z i (cid:105) + b } ki =1 are equal to . On the other hand, nearly half ofthe values { (cid:104) w ,z i + η i (cid:105) + b } ki =1 become , which significantly increases the denominator of Eq. (6),and thus makes the cosine close to 0. At the same time, the output of a regular filter w shownin Fig. 7 is only slightly affected by the random noise η . For deep networks, however, we couldnot identify particular filters responsible for catastrophic overfitting, thus we consider next a moregeneral solution. Based on the importance of gradient alignment for successful FGSM training, we propose a regularizer,
GradAlign , that aims at increasing gradient alignment and preventing catastrophic overfitting. Thecore idea of
GradAlign is to maximize the gradient alignment (as defined in Eq. 5) between thegradients at point x and at a randomly perturbed point x + η inside the (cid:96) ∞ -ball around x : Ω( x, y, θ ) def = E ( x,y ) ∼ D, η ∼U ([ − ε,ε ] d ) [1 − cos ( ∇ x (cid:96) ( x, y ; θ ) , ∇ x (cid:96) ( x + η, y ; θ ))] . (7)Crucially, GradAlign uses gradients at points x and x + η which does not require an expensiveiterative procedure unlike, e.g., the LLR method of [28]. Note that the regularizer depends only onthe gradient direction and it is invariant to the gradient norm which contrasts it to the gradientpenalties [14, 17, 31, 36] or CURE [25] (see the comparison in Appendix D). Experimental setup.
We compare the following methods: standard FGSM AT, FGSM-RS ATwith α = 1 . ε [46], FGSM AT + GradAlign , AT for Free with m = 8 [34], PGD-2 AT with 2-stepPGD using α = ε / , and PGD-10 AT with 10-step PGD using α = ε / . We train these methodsusing PreAct ResNet-18 [16] with (cid:96) ∞ -radii ε ∈ { / , . . . , / } on CIFAR-10 for 30 epochs and ε ∈ { / , . . . , / } on SVHN for 15 epochs. The only exception is AT for Free [34] which wetrain for epochs on CIFAR-10, and epochs on SVHN which was necessary to get comparableresults to the other methods. Unlike [28] and [48], with the training scheme of [46] and α = ε / we could successfully train a PGD-2 model with ε = / on CIFAR-10 with robustness betterthan that of their methods that use the same number of PGD steps (see Appendix D). This alsoechoes the recent finding of [30] that properly tuned multi-step PGD AT outperforms more recentlypublished methods. As before, we evaluate robustness using PGD-50-10 with iterations and used for training and evaluation S t a n d a r d a n d P G D - - a cc u r a c y Dataset: CIFAR-10
FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT used for training and evaluation S t a n d a r d a n d P G D - - a cc u r a c y Dataset: SVHN
FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT
Figure 8:
Accuracy (dashed line) and robustness (solid line) of different adversarial training (AT) methodson CIFAR-10 and SVHN with ResNet-18 trained and evaluated with different l ∞ -radii. The results areobtained without early stopping, averaged over 5 random seeds used for training and reported with thestandard deviation. restarts using step size α = ε / for the same ε that was used for training. We train each model with random seeds since the final robustness can have a large variance for high ε . Also, we remarkthat training with GradAlign leads on average to a − × slowdown compared to FGSM trainingwhich is due to the use of double backpropagation (see [9] for a detailed analysis). We think thatimproving the runtime of GradAlign is possible, but we postpone it to future work. Additionalimplementation details are provided in Appendix B. The code of our experiments is available at https://github.com/tml-epfl/understanding-fast-adv-training . Results.
We provide the main comparison in Fig. 8 and provide detailed numbers for specificvalues of ε in Appendix D.3. First, we notice that all the methods perform almost equally wellfor small enough ε , i.e. ε ≤ / on CIFAR-10 and ε ≤ / on SVHN. However, the performancefor larger ε varies a lot depending on the method due to catastrophic overfitting. Importantly, GradAlign succesfully prevents catastrophic overfitting in FGSM AT, thus allowing to successfullyapply FGSM training also for larger (cid:96) ∞ -perturbations and reduce the gap to PGD-10 training. InAppendix D.4, we additionally show that FGSM + GradAlign does not suffer from catastrophicoverfitting even for ε ∈ { / , / } . At the same time, not only FGSM AT and FGSM-RS ATexperience catastrophic overfitting, but also the recently proposed
AT for Free and PGD-2, althoughat higher ε values than FGSM AT. We note that GradAlign is not only applicable to FGSM AT,but also to other methods that can also suffer from catastrophic overfitting. In particular, combiningPGD-2 with
GradAlign prevents catastrophic overfitting and leads to better robustness for ε = / on CIFAR-10 (see Appendix D.3). Although performing early stopping can lead to non-trivialrobustness, standard accuracy is often significantly sacrificed which limits the usefulness of thistechnique (see Appendix D). This is in contrast to training with GradAlign which leads to the samestandard accuracy as PGD-10 AT.We also performed similar experiments on ImageNet in Appendix D.3, but observed that even forstandard FGSM training using the training schedule of [46], catastrophic overfitting does not occurfor ε ∈ { / , / } considered in [34, 46], and thus there is no need to use GradAlign as its mainrole is to prevent catastrophic overfitting. Finally, with regard to robust overfitting phenomenonoutlined in [30], we observed that training FGSM+
GradAlign for more than epochs also leads to10lightly worse robustness on the test set (see Appendix D.4), thus suggesting that catastrophic and robust overfitting are two distinct phenomena that have to be addressed separately. We observed that catastrophic overfitting is a fundamental problem not only for standard FGSMtraining, but for computationally efficient adversarial training in general. In particular, many recentlyproposed schemes such as FGSM AT enhanced by a random step or
AT for free are also proneto catastrophic overfitting. Motivated by this, we explored the questions of when and why
FGSMadversarial training works, and how to improve it by increasing the gradient alignment, and thusthe quality of the solution of the inner maximization problem. Our proposed regularizer
GradAlign prevents catastrophic overfitting and improves the robustness compared to other fast adversarialtraining methods reducing the gap to multi-step PGD training.
Acknowledgements
We thank Eric Wong, Francesco Croce, Guillermo Ortiz-Jimenez, Apostolos Modas, and Chen Liufor many fruitful discussions.
References [1] Jean-Baptiste Alayrac, Jonathan Uesato, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi, andPushmeet Kohli. Are labels required for improving adversarial robustness?
NeurIPS , 2019.[2] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski.
Robust optimization . Princeton Series inApplied Mathematics. Princeton University Press, Princeton, NJ, 2009.[3] Battista Biggio and Fabio Roli. Wild patterns: ten years after the rise of adversarial machine learning.
Pattern Recognition , 2018.[4] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracyof object detection. arXiv preprint aXiv:2004.10934 , 2020.[5] Nicholas Carlini, Guy Katz, Clark Barrett, and David L Dill. Provably minimally-distorted adversarialexamples. arXiv preprint arXiv:1709.10207 , 2017.[6] Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C. Duchi. Unlabeled dataimproves adversarial robustness.
NeurIPS , 2019.[7] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomizedsmoothing.
ICML , 2019.[8] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble ofdiverse parameter-free attacks.
ICML , 2020.[9] Christian Etmann. A closer look at double backpropagation. arXiv preprint ArXiv:1906.06637 , 2019.[10] Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion.
ICML ,2006.
11] Rafael C Gonzales and Richard E Woods.
Digital image processing (2nd edition) . Prentice Hall NewJersey, 2002.[12] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarialexamples.
ICLR , 2015.[13] Sven Gowal, Jonathan Uesato, Chongli Qin, Po-Sen Huang, Timothy Mann, and Pushmeet Kohli. Analternative surrogate loss for pgd-based adversarial testing. arXiv preprint arXiv:1910.09338 , 2019.[14] Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples.
ICLR Workshops , 2015.[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification.
ICCV , 2015.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.
ECCV , 2016.[17] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier againstadversarial manipulation.
NeurIPS , 2017.[18] Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-training can improve model robustness anduncertainty.
ICML , 2019.[19] Peter J. Huber.
Robust statistics . John Wiley & Sons, Inc., New York, 1981.[20] A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-proxalgorithm.
Stochastic Systems , 2011.[21] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: an efficientsmt solver for verifying deep neural networks.
ICCAV , 2017.[22] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.
Nature , 521(7553), 2015.[23] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towardsdeep learning models resistant to adversarial attacks.
ICLR , 2018.[24] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, BorisGinsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.
ICLR ,2018.[25] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustnessvia curvature regularization, and vice versa.
CVPR , 2019.[26] Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L. Edelman, Fred Zhang,and Boaz Barak. SGD on neural networks learns functions of increasing complexity.
NeurIPS , 2019.[27] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and AnanthramSwami. Practical black-box attacks against machine learning.
ASIA CCS’17 , 2017.[28] Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi,Soham De, Robert Stanforth, and Pushmeet Kohli. Adversarial robustness through local linearization.
NeurIPS , 2019.[29] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples.
ICLR , 2018.
30] Leslie Rice, Eric Wong, and J. Zico Kolter. Overfitting in adversarially robust deep learning.
ICML ,2020.[31] Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretabilityof deep neural networks by regularizing their input gradients.
AAAI , 2018.[32] Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, and AleksanderMadry. Image synthesis with a single (robust) classifier.
NeurIPS , 2019.[33] Jürgen Schmidhuber. Deep learning in neural networks: An overview.
Neural networks , 61:85–117, 2015.[34] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S. Davis,Gavin Taylor, and Tom Goldstein. Adversarial training for free!
NeurIPS , 2019.[35] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasinglocal stability of supervised models through robust optimization.
Neurocomputing , 2018.[36] Carl-Johann Simon-Gabriel, Yann Ollivier, Leon Bottou, Bernhard Schölkopf, and David Lopez-Paz.First-order adversarial vulnerability of neural networks and input dimension.
ICML , 2019.[37] Leslie N Smith. Cyclical learning rates for training neural networks.
WACV , 2017.[38] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. Intriguing properties of neural networks.
ICLR , 2014.[39] Vincent Tjeng, Kai Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixedinteger programming.
ICLR , 2019.[40] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel.Ensemble adversarial training: Attacks and defenses.
ICLR , 2018.[41] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.Robustness may be at odds with accuracy.
ICLR , 2019.[42] B. S. Vivek and R. Venkatesh Babu. Single-step adversarial training with dropout scheduling.
CVPR ,2020.[43] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergenceand robustness of adversarial training.
ICML , 2019.[44] Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S.Dhillon, and Luca Daniel. Towards fast computation of certified robustness for relu networks.
ICML ,2018.[45] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outeradversarial polytope.
ICML , 2018.[46] Eric Wong, Leslie Rice, and J. Zico Kolter. Fast is better than free: Revisiting adversarial training.
ICLR , 2020.[47] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, and Quoc V Le. Adversarial examplesimprove image recognition.
CVPR , 2020.[48] Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once:Accelerating adversarial training via maximal principle.
NeurIPS , 2019.
49] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan.Theoretically principled trade-off between robustness and accuracy.
ICML , 2019.[50] Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarialtraining for natural language understanding.
ICLR , 2019. ppendixA Deferred proofs In this section, we show the proofs omitted from Sec. 3 and Sec. 4.
A.1 Proof of Lemma 1
We state again Lemma 1 from Sec. 3 and present the proof.
Lemma 1. (Effect of the random step)
Let η ∼ U ([ − ε, ε ] d ) be a random starting point, and α ∈ [0 , ε ] be the step size of FGSM-RS defined in Eq. (3) , then E η [ (cid:107) δ F GSM − RS ( η ) (cid:107) ] ≤ (cid:114) E η (cid:104) (cid:107) δ F GSM − RS ( η ) (cid:107) (cid:105) = √ d (cid:114) − ε α + 12 α + 13 ε . Proof.
First, note that due to the Jensen’s inequality, we can have a convenient upper bound whichis easier to work with: E [ (cid:107) δ F GSM − RS ( η ) (cid:107) ] ≤ (cid:114) E (cid:104) (cid:107) δ F GSM − RS ( η ) (cid:107) (cid:105) . (8)Therefore, we can focus on E (cid:104) (cid:107) δ F GSM − RS (cid:107) (cid:105) which can be computed analytically. Let us denoteby ∇ def = ∇ x (cid:96) ( x + η, y ; θ ) ∈ R d , we then obtain: E η (cid:104) (cid:107) δ F GSM − RS (cid:107) (cid:105) = E η (cid:104)(cid:13)(cid:13) Π [ − ε,ε ] [ η + α sign( ∇ )] (cid:13)(cid:13) (cid:105) = d (cid:88) i =1 E η i (cid:104) Π [ − ε,ε ] [ η i + α sign( ∇ i )] (cid:105) = d E η i (cid:2) min { ε, | η i + α sign( ∇ i ) |} (cid:3) = d E η i (cid:104) min { ε , ( η i + α sign( ∇ i )) } (cid:105) = d E r i (cid:104) E η i (cid:104) min { ε , ( η i + α sign( ∇ i )) } | sign( ∇ i ) = r i (cid:105)(cid:105) , where in the last step we use the law of total expectation by noting that sign( ∇ i ) is also a randomvariable since it depends on η i .We first consider the case when sign( ∇ i ) = 1 , then the inner conditional expectation is equal to: (cid:90) ε − ε min { ε , ( η i + α ) } ε dη i = 12 ε (cid:90) ε + α − ε + α min { ε , x } dx = 12 ε (cid:18)(cid:90) ε + αε ε dx + (cid:90) ε − ε + α x dx (cid:19) = − ε α + 12 α + 13 ε . The case when sign( ∇ i ) = − leads to the same expression: (cid:90) ε − ε min { ε , ( η i − α ) } ε dη i = 12 ε (cid:90) ε − α − ε − α min { ε , x } dx = − ε α + 12 α + 13 ε . E η [ (cid:107) δ F GSM − RS ( η ) (cid:107) ] ≤ (cid:114) E (cid:104) (cid:107) δ F GSM − RS ( η ) (cid:107) (cid:105) = √ d (cid:114) − ε α + 12 α + 13 ε . A.2 Proof and discussion of Lemma 2
We state again Lemma 2 from Sec. 4 and present the proof.
Lemma 2. (Gradient alignment at initialization)
Let z ∼ U ([0 , p ) be an image patch for p ≥ , η ∼ U ([ − ε, ε ] d ) a point inside the (cid:96) ∞ -ball, the parameters of a single-layer CNN initializedi.i.d. as w ∼ N (0 , σ w I p ) for every column of W , u ∼ N (0 , σ u I m ) for every column of U , b := 0 ,then the gradient alignment is lower bounded by lim k,m →∞ cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) ≥ max (cid:26) − √ E w,z (cid:104) e − ε (cid:104) w / (cid:107) w (cid:107) ,z (cid:105) (cid:105) / , . (cid:27) . Proof.
For k and m large enough, the law of large number ensures that an empirical mean of i.i.d.random variables can be approximated by its expectation with respect to random variables z, η, w, u .This leads to lim k,m →∞ cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y ))= lim k,m →∞ m (cid:80) r =1 m (cid:80) l =1 k (cid:80) i =1 (cid:104) w r , w l (cid:105) u ri u li (cid:104) w r ,z i (cid:105)≥ (cid:104) w l ,z i + η i (cid:105)≥ (cid:115) m (cid:80) r =1 m (cid:80) l =1 k (cid:80) i =1 (cid:104) w r , w l (cid:105) u ri u li (cid:104) w r ,z i (cid:105)≥ (cid:104) w l ,z i (cid:105)≥ (cid:115) m (cid:80) r =1 m (cid:80) l =1 k (cid:80) i =1 (cid:104) w r , w l (cid:105) u ri u li (cid:104) w r ,z i + η i (cid:105)≥ (cid:104) w l ,z i + η i (cid:105)≥ = lim k,m →∞ km m (cid:80) r =1 k (cid:80) i =1 (cid:107) w r (cid:107) u ri (cid:104) w r ,z i (cid:105)≥ (cid:104) w r ,z i + η i (cid:105)≥ (cid:115) km m (cid:80) r =1 k (cid:80) i =1 (cid:107) w r (cid:107) u ri (cid:104) w r ,z i (cid:105)≥ (cid:104) w r ,z i (cid:105)≥ (cid:115) km m (cid:80) r =1 k (cid:80) i =1 (cid:107) w r (cid:107) u ri (cid:104) w r ,z i + η i (cid:105)≥ (cid:104) w r ,z i + η i (cid:105)≥ = E w,u,η,z (cid:104) (cid:107) w (cid:107) u (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105)(cid:114) E w,u,z (cid:104) (cid:107) w (cid:107) u (cid:104) w,z (cid:105)≥ (cid:105)(cid:114) E w,u,η,z (cid:104) (cid:107) w (cid:107) u (cid:104) w,z + η (cid:105)≥ (cid:105) = E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105)(cid:114) E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:105)(cid:114) E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z + η (cid:105)≥ (cid:105) . (9)We directly compute for the denominator: E w,z [ (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ ] = E w,η,z [ (cid:107) w (cid:107) (cid:104) w,z + η (cid:105)≥ ] = 0 . pσ w . P η [ (cid:104) w, η (cid:105) ≥ (cid:104) w, z (cid:105) ] ≤ e − (cid:104) z,w (cid:105) ε (cid:107) w (cid:107) via the Hoeffding’s inequality, weobtain E u,w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105) = E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ P η ( (cid:104) w, z + η (cid:105) ≥ (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ P η ( (cid:104) w, η (cid:105) ≥ − (cid:104) w, z (cid:105) ) (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ P η ( (cid:104) w, η (cid:105) ≤ (cid:104) w, z (cid:105) ) (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (1 − P η ( (cid:104) w, η (cid:105) ≥ (cid:104) w, z (cid:105) )) (cid:105) ≥ E w,z (cid:34) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:32) − e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:33)(cid:35) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:105) − E w,z (cid:34) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) =0 . pσ w − . E w,z (cid:34) (cid:107) w (cid:107) e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) ≥ . pσ w − . E w (cid:104) (cid:107) w (cid:107) (cid:105) / E w,z (cid:34) e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) / =0 . pσ w − . σ w (cid:112) p + 2 p E w,z (cid:34) e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) / , where the last inequality is obtained via the Cauchy-Schwarz inequality. On the other hand, we have: E u,w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105) = E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ P η ( (cid:104) w, η (cid:105) ≤ (cid:104) w, z (cid:105) ) (cid:105) ≥ E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ . (cid:105) = 0 . pσ w . Now we combine both lower bounds together to establish a lower bound on Eq. (9): E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:104) w,z + η (cid:105)≥ (cid:105)(cid:114) E w,z (cid:104) (cid:107) w (cid:107) (cid:104) w,z (cid:105)≥ (cid:105)(cid:114) E w,z,η (cid:104) (cid:107) w (cid:107) (cid:104) w,z + η (cid:105)≥ (cid:105) ≥ max . pσ w − . σ w (cid:112) p + 2 p E w,z (cid:34) e − (cid:104) w,z (cid:105) ε (cid:107) w (cid:107) (cid:35) / , . pσ w . pσ w = max (cid:40) − (cid:114) p E w,z (cid:20) e − (cid:104) w/ (cid:107) w (cid:107) ,z (cid:105) ε (cid:21) / , . (cid:41) ≥ max (cid:26) − √ E w,z (cid:104) e − ε (cid:104) w / (cid:107) w (cid:107) ,z (cid:105) (cid:105) / , . (cid:27) , (10)17here in the last step we used that p ≥ .The main purpose of obtaining the lower bound in Lemma 2 was to get an expression that cangive us an insight into the key quantities which gradient alignment at initialization depends on.Considering the limiting case k, m → ∞ was necessary to obtain a ratio of expectations that allowedus to derive a simpler expression. Finally, we lower bounded the gradient alignment from Eq. (9) usingthe Hoeffding’s and Cauchy-Schwarz inequalities and used p ≥ to obtain a dimension-independentconstant in front of the expectation in Eq. (10). Now we would like to provide a better understandingabout the key quantities involved in the lemma and to assess the tightness of the derived lowerbound. For this purpose, in Fig. 9 we plot: • cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) for k = 100 patches and m = 4 filters (which resembles thesetting of the 4-filter CNN on CIFAR-10). We note that it is a random variable since it is afunction of random variables x, η, W, U . • lim k,m →∞ cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) evaluated via Eq. (9). • Our first lower bound max (cid:110) − pσ w E w,z (cid:104) (cid:107) w (cid:107) e − ε (cid:104) w / (cid:107) w (cid:107) ,z (cid:105) (cid:105) , . (cid:111) obtained via Hoeffding’sinequality. • Our final lower bound max (cid:26) − √ E w,z (cid:104) e − ε (cid:104) w / (cid:107) w (cid:107) ,z (cid:105) (cid:105) / , . (cid:27) .For the last three quantities we approximate the expectations by Monte-Carlo sampling by using , samples. For all the quantities we use patches of size p = 3 × × as in our CIFAR-10experiments. We plot gradient alignment values for ε ∈ [0 , . since we are interested in small (cid:96) ∞ -perturbations such as, e.g., ε = / ≈ . which is a typical value used for CIFAR-10 [23]. First,we can observe that all the four quantities have very high values in [0 . , . for ε ∈ [0 , . which isin contrast to the gradient alignment value of . that we observe after catastrophic overfitting for ε = / ≈ . . Next, we observe that cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) has some noticeable variancefor the chosen parameters k = 100 patches and m = 4 filters. However, this variance is significantlyreduced when we increase the parameters k and m , especially when considering the limiting case k, m → ∞ . Finally, we observe that both lower bounds on lim k,m →∞ cos ( ∇ x (cid:96) ( x, y ) , ∇ x (cid:96) ( x + η, y )) G r a d i e n t a li g n m e n t cos ( ( x ), ( x + )) (k=100 patches, m=4 filters)lim k , m cos ( ( x ), ( x + ))Our lower bound after Hoeffding's inequalityOur final lower bound Figure 9:
Visualization of the key quantities involved in Lemma 2. ε . However, we choose to report the last one in the lemma since it is slightly more concisethan the one obtained via Hoeffding’s inequality. B Experimental details
We list detailed evaluation and training details below.
Evaluation.
Throughout the paper, we use PGD-50-10 for evaluation of adversarial accuracywhich stands for the PGD attack with 50 iterations and 10 random restarts following [46]. We use thestep size α = ε / . The choice of this attack is motivated by the fact that in both public benchmarks of[23] on MNIST and CIFAR-10, the adversarial accuracy of PGD-100-50 and PGD-20-10 respectivelyis only 2% away from the best entries.Although we train our models using half precision [24], we always perform robustness evaluationusing single precision since evaluation with half precision can sometimes overestimate the robustnessof the model due to limited numerical precision in the calculation of the gradients.We perform evaluation of standard accuracy using full test sets, but we evaluate adversarialaccuracy using , random points on each dataset. Training details for ResNet-18.
We use the implementation code of [46] with the onlydifference that we do not use image normalization and gradient clipping on CIFAR-10 and SVHNsince we found that they have no significant influence on the final results. We use cyclic learningrates and half-precision training following [46]. We do not use random initialization for PGD duringadversarial training as we did not find that it leads to any improvements on the considered datasets(see the justifications in Sec. D.1 below). We perform early stopping based on the PGD accuracyon the training set following [46]. We observed that such a simple model selection scheme cansuccessfully select a model before catastrophic overfitting that has non-trivial robustness.On CIFAR-10, we train all the models for 30 epochs with the maximum learning rate . except AT for free [34] which we train for epochs with the maximum learning rate . using m = 8 minibatch replays to get comparable results to the other methods.On SVHN, we train all the models for 15 epochs with the maximum learning rate . except AT for free [34] which we train for epochs with the maximum learning rate . using m = 8 minibatch replays. Moreover, in order to prevent convergence to a constant classifier on SVHN, welinearly increase the perturbation radius from 0 to ε during the first 5 epochs for all methods.For PGD-2 AT we use for training a 2-step PGD attack with step size α = ε / , and for PGD-10AT we use for training a 10-step PGD attack with α = ε / .For Fig. 1 and Fig. 8 we used the GradAlign λ values obtained via a linear interpolation on thelogarithmic scale between the best λ values that we found for ε = 8 and ε = 16 on the test sets. Weperform the interpolation on the logarithmic scale since the values of λ are non-negative, a usual linearinterpolation would lead to negative values of λ . The resulting λ values for ε ∈ { , . . . , } are givenin Table 2. We note that at the end we do not report the results with ε > for SVHN since manymodels have trivial robustness close to that of a constant classifier. For the PGD-2 + GradAlign experiments reported below in Table 4 and Table 5, we use λ = 0 . for the CIFAR-10 and λ = 0 . for SVHN experiments. Training details for the single-layer CNN.
The single-layer CNN that we study in Sec. 4has 4 convolutional filters, each of them of size × . After the convolution we apply ReLU activation,and then we directly have a fully-connected layer, i.e. we do not use any pooling layer. For training19 able 2: GradAlign λ values used for the experiments on CIFAR-10 and SVHN. These values are obtainedvia a linear interpolation on the logarithmic scale between successful λ values at ε = 8 and ε = 16 . ε ( / ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 λ CIF AR − λ SV HN a.)
Standard model b.)
PGD-trained model d d d d -norm of the perturbation | ( x + )( x ) , | Start from
FGSM
Start from random d d d d -norm of the perturbation | L ( x + ) L ( x ) , | FGSMrandom
Figure 10:
The quality of the linear approximation of (cid:96) ( x + δ ) for δ with different (cid:96) -norm for (cid:107) δ (cid:107) ∞ fixedto ε for a standard and PGD-trained ResNet-18 on CIFAR-10. we use the ADAM optimizer with learning rate . for 30 epochs using the same cyclical learningrate schedule. ImageNet experiments.
We use ResNet-50 following the training scheme of [46] which includes3 training stages on different image resolution. For
GradAlign , we slightly reduce the batch sizeon the second and third stages from 224 and 128 to 180 and 100 respectively in order to reducethe memory consumption. For all ε ∈ { , , } , we train FGSM models with GradAlign using λ ∈ { . , . } . The final λ we report are λ ∈ { . , . , . } for ε ∈ { , , } respectively. Computing infrastructure.
We perform all our experiments on NVIDIA V100 GPUs with32GB of memory.
C Supporting experiments and visualizations for Sec. 3 and Sec. 4
We describe here supporting experiments and visualizations related to Sec. 3 and Sec. 4.
Epoch A d v e r s a r i a l a cc u r a c y FGSM AT: FGSM acc.FGSM AT: PGD acc.
Epoch A d v e r s a r i a l l o ss FGSM AT: FGSM lossFGSM AT: PGD loss
Epoch C o s i n e FGSM AT: cos ( L ( x ), L ( x + ))FGSM AT: cos ( FGSM , PGD ) Figure 11:
Visualization of the training process of an FGSM trained CNN with 4 filters with ε = / . Wecan observe catastrophic overfitting around epoch 6. .1 Quality of the linear approximation for ReLU networks For the loss function (cid:96) of a ReLU-network, we compute empirically the quality of the linearapproximation defined as | (cid:96) ( x + δ ) − (cid:96) ( x ) − (cid:104) δ, ∇ x (cid:96) ( x ) (cid:105) | , where the dependency of the loss (cid:96) on the label y and parameters θ are omitted for clarity. Then weperform the following experiment: we take a perturbation δ ∈ {− ε, ε } d , and then zero out differentfractions of its coordinates, which leads to perturbations with a fixed (cid:107) δ (cid:107) ∞ = ε , but with different (cid:107) δ (cid:107) ∈ [0 , √ dε ] . As the starting δ we choose two types of perturbations: δ F GSM generated by FGSMand δ random sampled uniformly from the corners of the (cid:96) ∞ -ball. We plot the results in Fig. 10 onCIFAR-10 for ε = 8 / averaged over 512 test points, and conclude that for both δ F GSM and δ random the validity of the linear approximation crucially depends on (cid:107) δ (cid:107) even when (cid:107) δ (cid:107) ∞ is fixed.The phenomenon is even more pronounced for FGSM perturbations as the linearization error ismuch higher there. Moreover, this observation is consistent across both standardly and adversariallytrained ResNet-18 models. C.2 Catastrophic overfitting in a single-layer CNN
We describe here complementary figures to Sec. 4 which are related to the single-layer CNN.
Training curves.
In Fig. 11, we show the evolution of the FGSM/PGD accuracy, FGSM/PGDloss, and gradient alignment together with cos( δ F GSM , δ
P GD ) . We observe that catastrophic over-fitting occurs around epoch 6 and that its pattern is the same as for the deep ResNet which wasillustrated in Fig. 4. Namely, we see that concurrently the following changes occur around epoch6: (a) there is a sudden drop of PGD accuracy with an increase in FGSM accuracy, (b) the PGDloss grows by an order of magnitude while the FGSM loss decreases, (c) both gradient alignmentand cos( δ F GSM , δ
P GD ) significantly decrease. Throughout all our experiments we observe a veryhigh correlation between cos( δ F GSM , δ
P GD ) and gradient alignment. This motivates our proposedregularizer GradAlign which relies on the cosine between ∇ x (cid:96) ( x, y ; θ ) and ∇ x (cid:96) ( x + η, y ; θ ) , where η is a random point. In this way, we avoid using an iterative procedure inside the regularizer unlike,for example, the approach of [28]. Additional filters.
In Fig. 12, we show the evolution of the regular filter w and filter w thatleads to catastrophic overfitting for the three input channels (red, green, blue). We can observethat in the red and green channels, w has learned a Laplace filter which is very sensitive to noise.Moreover, w significantly increases in magnitude after catastrophic overfitting contrary to w whosemagnitude only decreases (see the colorbar values in Fig. 12 and the plots in Fig. 5). Additional feature maps.
In Fig. 13, we show additional feature maps for images with andwithout uniform random noise η ∼ U ([ − / , / ] d ) . These figures complement Fig. 7 shown inthe main part. We clearly see that only the last filter w is sensitive to the noise since the featuremaps change dramatically. At the same time, other filters w , w , w are only slightly affected bythe addition of the noise. We also show the input gradients in the last column which illustrate thatafter adding the noise the gradients change drammatically which leads to small gradient alignmentand, in turn, to the failure of FGSM as the solution of the inner maximization problem.21poch 3 Epoch 4 Epoch 5 Epoch 6 Epoch 7 Epoch 30 w -R w -G w -B w -R w -G w -B Figure 12:
Evolution of the regular filter w and filter w that leads to catastrophic overfitting. We plotred (R), green (G), and blue (B) channels of the filters. We can observe that in R and G channels, w haslearned a Laplace filter which is very sensitive to noise. mage Feature map Figure 13:
Input images, feature maps, and gradients of the single-layer CNN trained on CIFAR-10 at theend of training (after catastrophic overfitting).
Odd row : original images.
Even row : original image plusrandom noise U ([ − / , / ] d ) . We observe that only the last filter w is highly sensitive to the smalluniform noise since the feature maps change dramatically. Additional experiments for different adversarial training schemes
In this section, we describe additional experiments related to
GradAlign that complement the resultsshown in Sec. 5.
D.1 Stronger PGD-2 baseline
As mentioned in Sec. 5, the PGD-2 training baseline that we report outperforms other similarbaselines reported in the literature [48, 28]. Here we elaborate what are likely to be the mostimportant sources of difference. First, we follow the cyclical learning rate schedule of [46] which canwork as implicit early stopping and thus can help to prevent catastrophic overfitting observed forPGD-2 in [28]. Another source of difference is that [28] use the ADAM optimizer while we stick tothe standard PGD updates using the sign of the gradient [23].The second important factor is a proper step size selection. While [48] do not observe catastrophicoverfitting, their PGD-3 baseline achieves only . adversarial accuracy compared to the . for our PGD-2 baseline evaluated with a stronger attack (PGD-50-10 instead of PGD-20-1). Onepotential explanation for this difference lies in the step size selection, where for PGD-2 we use α = ε / .Related to the step size selection, we also found that using random initialization in PGD (we willrefer to as PGD-k-RS) as suggested in [23] requires a larger step size α . We show the results inTable 3 where we can see that PGD-2-RS AT with α = ε / achieves suboptimal robustness comparedto α = ε used for training. However, we consistently observed that PGD-2 AT with α = ε / and norandom step performs best. Thus, we use the latter as our PGD-2 baseline throughout the paper,thus always starting PGD-2 from the original point, without using any random step. Table 3:
Robustness of different PGD-2 schemes for ε = 8 / on CIFAR-10 for ResNet-18. The results areaveraged over 5 random seeds used for training. Model
PGD-2-RS AT, α = ε / PGD-2-RS AT, α = ε PGD-2 AT, α = ε / PGD-50-10 accuracy ± ± ± D.2 Results with early stopping
We complement the results presented in Fig. 8 without early stopping with the results with earlystopping which we show in Fig. 14. For CIFAR-10, we observe that FGSM+
GradAlign leads to agood robustness and accuracy outperforming FGSM AT and FGSM-RS AT and performing similarlyto PGD-2 and slightly improving for larger ε close to / . For SVHN, GradAlign leads to betterrobustness than other FGSM-based methods. We also observe that for large ε on both CIFAR-10and SVHN, AT for Free performs similarly to FGSM-based methods. Moreover, for ε ≥ / onSVHN, AT for Free converges to a constant classifier.On both CIFAR-10 and SVHN, we can see that although early stopping can lead to non-trivialrobustness, standard accuracy is often significantly sacrificed which limits the usefulness of thistechnique. This is in contrast to training with
GradAlign which leads to the same standard accuracyas PGD-10 training. 24 used for training and evaluation S t a n d a r d a n d P G D - - a cc u r a c y Dataset: CIFAR-10
FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT used for training and evaluation S t a n d a r d a n d P G D - - a cc u r a c y Dataset: SVHN
FGSM ATFGSM-RS ATFGSM AT + GradAlignAT for FreePGD-2 ATPGD-10 AT
Figure 14:
Accuracy (dashed line) and robustness (solid line) of different adversarial training (AT) methodson CIFAR-10 and SVHN with ResNet-18 trained and evaluated with different l ∞ -radii. The results areobtained with early stopping , averaged over 5 random seeds used for training and reported with thestandard deviation. D.3 Results for specific (cid:96) ∞ -radii Here we report results from Fig. 8 for specific (cid:96) ∞ -radii which are most often studied in the literature. CIFAR-10 results.
We report robustness and accuracy in Table 4 for CIFAR-10 without usingearly stopping where we can clearly see which methods lead to catastrophic overfitting and thussuboptimal robustness. We compare the same methods as in Fig. 8, and additionally we report theresults for ε = / of the CURE [25], YOPO [48], and LLR [28] approaches. First, for ε = / ,we see that FGSM + GradAlign outperforms
AT for Free and all methods that use FGSM training.Then, we also observe that the model trained with CURE [25] leads to robustness that is suboptimalcompared to FGSM-RS AT evaluated with a stronger attack: . vs . . YOPO-3-5 andYOPO-5-3 [48] require 3 and 5 full steps of PGD respectively, thus they are much more expensivethan FGSM-RS AT, and, however, they lead to worse adversarial accuracy: . and . vs . . [28] report that LLR-2, i.e. their approach with 2 steps of PGD, achieves . adversarialaccuracy. This result is not directly comparable to other results in Table 4 since [28] use (1) a largernetwork (Wide-ResNet-28-8), and (2) a stronger attack (MultiTargeted [13]). However, we thinkthat the gap of − compared to the adversarial accuracy of our reported FGSM + GradAlign and PGD-2 methods ( . and . resp.) is still significant since the difference betweenMultiTargeted and a PGD attack with random restarts is observed to be small (e.g. around 1%between MultiTargeted and PGD-20-10 on the CIFAR-10 challenge of [23]).For ε = / , none of the one-step methods work without early stopping except FGSM + GradAlign .We also evaluate PGD-2 +
GradAlign and conclude that the benefit of combining the two comeswhen PGD-2 alone leads to catastrophic overfitting which occurs at ε = / . For ε = / , thereis no benefit of combining the two approaches. This is consistent with our observation regardingcatastrophic overfitting for FGSM (e.g. see Fig. 8 for small ε ): if there is no catastrophic overfitting,there is no benefit of adding GradAlign to FGSM training.To further ensure that FGSM+
GradAlign models do not benefit from gradient masking [27],we additionally compare the robustness of FGSM+
GradAlign and FGSM-RS models obtained via
AutoAttack [8]. We observe that
AutoAttack proportionally reduces the adversarial accuracy ofboth models: for ε = / , FGSM+ GradAlign achieves 44.54 ± able 4: Robustness and accuracy of different robust training methods on
CIFAR-10 . We report resultswithout early stopping for ResNet-18 unless specified otherwise in parentheses. The results of all the methodsreported in Fig. 8 are shown here with the standard deviation and averaged over 5 random seeds used fortraining.
Model Accuracy Attack
Standard Adversarial ε = 8 / Standard 94.03% 0.00% PGD-50-10CURE [25] 81.20% 36.30% PGD-20-1YOPO-3-5 [48] 82.14% 38.18% PGD-20-1YOPO-5-3 [48] 83.99% 44.72% PGD-20-1LLR-2 (Wide-ResNet-28-8) [28] 90.46% 44.50% MultiTargeted [28]FGSM 85.16 ± ± ± ± GradAlign ± ± PGD-50-10AT for Free ( m = 8 ) 77.92 ± ± α = 4 / ) 82.15 ± ± α = 4 / ) + GradAlign ± ± α = 2 ε/ ) 81.88 ± ± PGD-50-10 ε = 16 / FGSM 73.76 ± ± ± ± GradAlign ± ± PGD-50-10AT for Free ( m = 8 ) 48.10 ± ± α = ε/ ) 68.65 ± ± α = ε/ ) + GradAlign ± ± α = 2 ε/ ) 60.28 ± ± PGD-50-10
FGSM-RS achieves 42.80 ± AutoAttack reduces adversarial accuracy for many models by 2%-3% for ε = / compared tothe originally reported results based on the standard PGD attack (see Table 2 in [8]). The sametendency is observed also for higher ε , e.g. for ε = / FGSM+
GradAlign achieves 20.56 ± AutoAttack . SVHN results.
We report robustness and accuracy in Table 5 for SVHN without usingearly stopping. We can see that for both ε = / and ε = / , GradAlign successfully preventscatastrophic overfitting in contrast to FGSM and FGSM-RS, although there is still a gap toPGD-2 training for ε = / . AT for free performs slightly better than FGSM+
GradAlign for ε = / , but it already starts to show a high variance in the robustness and accuracy depending onthe random seed. For ε = / , all the 5 models of AT for free converge to a constant classifier.Combining PGD-2 with
GradAlign does not lead to improved results for ε = / since thereis no catastrophic overfitting for PGD-2. However, for ε = / , we can clearly see that PGD-2 + GradAlign leads to better results than PGD-2 achieving . ± . instead of . ± . adversarial accuracy. ImageNet results.
We also perform similar experiments on ImageNet in Table 6. We observethat even for standard FGSM training, catastrophic overfitting does not occur for ε ∈ { / , / } considered in [34, 46], and thus there is no additional benefit from using GradAlign since its main role26 able 5:
Robustness and accuracy of different robust training methods on
SVHN . We report results withoutearly stopping for ResNet-18. All the results are reported with the standard deviation and averaged over 5random seeds used for training.
Model Accuracy
Standard PGD-50-10 ε = 8 / Standard 96.00% 1.00%FGSM 91.40 ± ± ± ± GradAlign ± ± AT for Free ( m = 8 ) 75.34 ± ± α = ε/ ) 92.68 ± ± GradAlign ( α = ε/ ) 92.46 ± ± α = 2 ε/ ) 91.92 ± ± ε = 12 / FGSM 88.74 ± ± ± ± GradAlign ± ± AT for Free ( m = 8 ) 18.50 ± ± α = ε/ ) 92.74 ± ± GradAlign ( α = ε/ ) 87.14 ± ± α = 2 ε/ ) 84.52 ± ± Robustness and accuracy of different robust training methods on
ImageNet . We report resultswithout early stopping for ResNet-50.
Model (cid:96) ∞ -radius Standard accuracy PGD-50-10 accuracy FGSM 2/255 61.7% 42.1%FGSM-RS 2/255 59.3% 41.1%FGSM +
GradAlign
GradAlign
GradAlign is to prevent catastrophic overfitting. We report the results of FGSM+
GradAlign for completenessto show that
GradAlign can be applied on ImageNet-scale although it leads to approximately × slowdown on ImageNet. We find that the exact slowdown of GradAlign depends on the GPUutilization and the batch size ranging from from × to × on different datasets.For ε = / , we observe that catastrophic overfitting occurs for FGSM-RS very early in training(around epoch 3), but not for FGSM or FGSM + GradAlign . This contradicts our observations onCIFAR-10 and SVHN where we observed that FGSM-RS usually helps to postpone catastrophicoverfitting to higher ε . However, it is computationally demanding to replicate the results on ImageNetmultiple times over different random seeds as we did for CIFAR-10 and SVHN. Thus, we leave amore detailed investigation of catastrophic overfitting on ImageNet for future work.27 .4 Ablation studies In this section, we aim to provide more details about sensitivity of
GradAlign to its hyperparameter λ , the total number of training epochs, and also discuss training with GradAlign for very high ε values. Ablation study for GradAlign λ . We provide an ablation study for the regularizationparameter λ of GradAlign in Fig. 15, where we plot the adversarial accuracy of ResNet-18 trainedusing FGSM +
GradAlign with ε = / on CIFAR-10. First, we observe that for small λ catastrophic overfitting occurs so that the average PGD-50-10 accuracy is either or greater than but has a high standard deviation since only some runs are successful while other runs failbecause of catastrophic overfitting. We observe that the best performance is achieved for λ = 2 where catastrophic overfitting does not occur and the final adversarial accuracy is very concentrated.For larger λ values we observe a slow decrease in the adversarial accuracy since the model becomesoverregularized. We note that the range of λ values which have close to the best performance ( ≥ adversarial accuracy) ranges in [0 . , , thus we conclude that GradAlign is robust to the exactchoice of λ . This is also confirmed by our hyperparameter selection method for Fig. 8, where weperformed a linear interpolation on the logarithmic scale between successful λ values for ε = / and ε = / . Even such a coarse hyperparameter selection method, could ensure that none of theFGSM + GradAlign runs reported in Fig. 15 suffered from catastrophic overfitting.
Ablation study for the total number of training epochs.
Recently, Rice et al. [30] broughtup the importance of early stopping in adversarial training. They identify the phenomenon called robust overfitting when training longer hurts the adversarial accuracy on the test set. Thus, we checkhere whether training with
GradAlign has some influence on robust overfitting. We note that theauthors of [30] suggest that robust and catastrophic overfitting phenomena are distinct since robustoverfitting implies a gap between training and test set robustness, while catastrophic overfittingimplies low robustness on both training and test sets. To explore this for FGSM +
GradAlign , inFig. 16 we show the final clean and adversarial accuracies for five different models trained with
Regularization parameter of GradAlign P G D - - a cc u r a c y Figure 15:
Ablation study for the regularizationparameter λ for FGSM + GradAlign under ε = / without early stopping. We train ResNet-18models on CIFAR-10. The results are averaged over3 random seeds used for training and reported withthe standard deviation.
25 50 75 100 125 150 175 200
Total number of training epochs A cc u r a c y Standard accuracyPGD-50-10 accuracy
Figure 16:
Ablation study for the total numberof training epochs for FGSM +
GradAlign under ε = / without early stopping. We train ResNet-18 models on CIFAR-10. The results are averagedover 3 random seeds used for training and reportedwith the standard deviation. , , , , } epochs. We observe the same trend as [30] report: training longer slightlydegrades adversarial accuracy while the clean accuracy slightly improves. Thus, this experiment alsosuggests that robust overfitting is not directly connected to catastrophic overfitting and has to beaddressed separately. Finally, we note based on Fig. 16 that when we use FGSM in combinationwith GradAlign , even training up to 200 epochs does not lead to catastrophic overfitting.
Ablation study for very high ε . Here we make an additional test on whether
GradAlign prevents catastrophic overfitting for very high ε values. In Fig. 8 and Fig. 14 we showed resultsfor ε ≤ for CIFAR-10 and for ε ≤ on SVHN. For SVHN, FGSM + GradAlign achieves24.04 ± ε on SVHN even further just leads to learning aconstant classifier. However, on CIFAR-10 for ε = 16 , FGSM + GradAlign achieves 28.88 ± GradAlign on CIFAR-10, but justfor higher ε values than what we considered in the main part of the paper. To show that it is not thecase, in Table 7 we show the results of FGSM + GradAlign trained with ε ∈ { / , / } (we use λ = 2 . and the maximum learning rate . ). We observe no signs of catastrophic overfitting even forvery high ε such as / . Note that in this case the standard accuracy is very low (23.07 ± ε . Table 7:
Robustness and accuracy of FGSM +
GradAlign for very high ε on CIFAR-10 without earlystopping for ResNet-18. We report results with the standard deviation and averaged over 3 random seedsused for training. We observe no catastrophic overfitting even for very high ε . (cid:96) ∞ -radius Standard accuracy PGD-50-10 accuracy ± ± ± ± D.5 Comparison of GradAlign to gradient-based penalties
In this section, we compare
GradAlign to other alternatives: (cid:96) gradient norm penalization andCURE [25]. The motivation to study them comes from the fact that after catastrophic overfitting,the input gradients change dramatically inside the (cid:96) ∞ -balls around input points, and thus othergradient-based regularizers may also be able to improve the stability of the input gradients and thusprevent catastrophic overfitting.In Table 8, we present results of FGSM training with other gradient-based penalties studied inthe literature: • (cid:96) gradient norm regularization [31, 36]: λ (cid:107)∇ x (cid:96) ( x, y ; θ ) (cid:107) , • curvature regularization (CURE) [25]: λ (cid:107)∇ x (cid:96) ( x + δ F GSM , y ; θ ) − ∇ x (cid:96) ( x, y ; θ ) (cid:107) .First of all, we note that the originally proposed approaches [31, 36, 25] do not involve adversarialtraining and rely only on these gradient penalties to achieve some degree of robustness. In contrast,we combine the gradient penalties with FGSM training to see whether they can prevent catastrophicoverfitting similarly to GradAlign . For the gradient norm penalty, we use the regularizationparameters λ ∈ { , , , } for ε ∈ { / , / } respectively. For CURE, we use λ ∈ { , , } able 8: Additional comparison of FGSM AT with
GradAlign to FGSM AT with other gradient penaltieson CIFAR-10. We report results without early stopping for ResNet-18. All the results are reported with thestandard deviation and averaged over 5 random seeds used for training.
Model Accuracy
Standard PGD-50-10 ε = 8 / FGSM + (cid:107)∇ x (cid:107) ± ± ± ± GradAlign ± ± ε = 16 / FGSM + (cid:107)∇ x (cid:107) ± ± ± ± GradAlign ± ± for ε ∈ { / , / } respectively. In both cases, we found the optimal hyperparameters using a gridsearch over λ . We can see that for ε = / all three approaches successfully prevent catastrophicoverfitting, although the final robustness slightly varies between . for FGSM with the (cid:96) -gradientpenalty and . for FGSM with GradAlign .For ε = / , both FGSM + CURE and FGSM + GradAlign prevent catastrophic overfittingleading to very concentrated results with a small standard deviation (0.29% and 0.70% respectively).However, the average adversarial accuracy is better for FGSM +
GradAlign : . vs . .At the same time, FGSM with the (cid:96) -gradient penalty leads to unstable final performance: theadversarial accuracy has a high standard deviation: . ± . .We think that the main difference in the performance of GradAlign compared to the gradientpenalties that we considered comes from the fact that it is invariant to the gradient norm, and ittakes into account only the directions of two gradients inside the (cid:96) ∞ -ball around the given input.Inspired by CURE, we also tried two additional experiments:1. Using the FGSM point δ F GSM for the gradient taken at the second input point for
GradAlign ,but we observed that it does not make a substantial difference, i.e. this version of
GradAlign also prevents catastrophic overfitting and leads to similar results. However, if we use CUREwithout FGSM in the cross-entropy loss, then we observe a benefit of using δ F GSM in theregularizer which is consistent with the observations made in Moosavi-Dezfooli et al. [25].2. Using
GradAlign without FGSM in the cross-entropy loss. In this case, we observed that themodel did not significantly improve its robustness suggesting that
GradAlign is not a sufficientregularizer on its own to promote robustness and has to be used with some adversarial trainingmethod.We think that an interesting future direction is to explore how one can speed up
GradAlign orto come up with other regularization methods that are also able to prevent catastrophic overfitting,but avoid relying on the input gradients which lead to a slowdown in training. We think that somepotential strategies to speed up