On Connections between Regularizations for Improving DNN Robustness
11 On Connections between Regularizations forImproving DNN Robustness
Yiwen Guo, Long Chen, Yurong Chen, and Changshui Zhang,
Fellow, IEEE
Abstract —This paper analyzes regularization terms proposed recently for improving the adversarial robustness of deep neuralnetworks (DNNs), from a theoretical point of view. Specifically, we study possible connections between several effective methods,including input-gradient regularization, Jacobian regularization, curvature regularization, and a cross-Lipschitz functional. Weinvestigate them on DNNs with general rectified linear activations, which constitute one of the most prevalent families of models forimage classification and a host of other machine learning applications. We shed light on essential ingredients of these regularizationsand re-interpret their functionality. Through the lens of our study, more principled and efficient regularizations can possibly be inventedin the near future.
Index Terms —Deep neural networks, adversarial robustness, regularizations, network property (cid:70)
NTRODUCTION I T has been discovered that deep neural networks (DNNs)are vulnerable to adversarial examples [1], [2], [3], andthe phenomenon can prohibit them from being deployed insecurity-sensitive applications. Amongst the most effectivemethods for mitigating the issue, adversarial training [1], [2],[3] is capable of resisting a series of malicious examples [3],[4] and yield adversarially robust DNN models in the senseof an l p norm. By injecting advanced adversarial examples(e.g., using BIM [5] or PGD [3]) into training as some sort ofaugmentation, the obtained models learn to defend againstthese examples. In addition, the obtained models may alsoresist some other types of adversarial examples (generatedusing, for example, the fast gradient sign method [2]). How-ever, advanced adversarial examples are typically generatedin an iterative manner by back-propagating deep modelsfor multiple times, and thus the mechanism may demand amassive amount of computation [6].Another thriving category of methods for hardeningDNNs is to perform regularizations , aim at trading off theeffectiveness and efficiency properly. Although most tradi-tional regularization-based strategies (e.g., weight decay [7]and dropout [8]) do not operate properly in this respect, avariety of recent work [6], [9], [10], [11], [12] has shown thatmore dedicated and principled regularizations help to gaincomparable or only slightly worse performance in improv-ing DNN robustness. Instead of raising a perpetual “arms-race”, these regularization-based strategies are in generalattack-agnostic and of benefit to the generalization ability [9] • Y. Guo is with Bytedance AI Lab. E-mail: [email protected]. • L. Chen is with the Academy for Advanced Interdisciplinary Studies,Center for Data Science, Peking University, Beijing 100871, China. E-mail: [email protected]. • Y. Chen is with Intel Labs China. E-mail: [email protected]. • C. Zhang is with the Institute for Artificial Intelligence, Tsinghua Univer-sity (THUAI), the State Key Lab of Intelligent Technologies and Systems,Beijing National Research Center for Information Science and Technology(BNRist), the Department of Automation, Tsinghua University, Beijing,100084, China. E-mail: [email protected]. Guo and L. Chen contribute equally to this work. and interpretability of learning models [11]. Moreover, thecomputational and memory complexity of these methodsare acceptable in very large models. It has also been shownthat the methods can be combined with adversarial trainingto achieve even stronger DNN robustness.While many regularizers have been developed for DNNrobustness, there is of yet few comparative analysis amongthese choices, especially from a theoretical point of view. Inthis paper, we attempt to shed light on intrinsic functionalityand theoretical connections between several effective regu-larizers, even if their formulations may stem from differentrationales. Concretely, it has been presented over the pastfew years that regularizing the Euclidean norm of an input-gradient [11], [13], the Frobenius norm of a Jacobian ma-trix [12], [14], the spectral norm of a Hessian matrix [6], anda cross-Lipschitz functional [10] all significantly contributeto the adversarial robustness of DNNs. We analyze all thesechoices on DNNs with general rectified linear activations,which are ubiquitous in image classification and a host ofother machine learning tasks.Some of our key contributions and observations are: • We present, for the first time, an analytic expression forthe l norm of an approximately-optimal adversarialperturbation concerned in very recent papers [6], [15],to demonstrate that local cross-Lipschitz constants [10]and the prediction probability are its essential ingredi-ents in binary classification cases. In addition to the l norm-based results, we also show similar results for therobustness to l ∞ norm-based attacks. • We unveil that most discussed regularizations advocatesmall local cross-Lipschitz constants in binary classifica-tion, except for the Jacobian regularization that suggestssmall local Lipschitz constants, yet regularizing the twonetwork properties can be equivalent. • We further demonstrate that critical discrepancies stillexist between specific methods, mostly in regularizingthe prediction probability/confidence. • We extend some analyses to multi-class classificationand verify our findings with experiments. a r X i v : . [ c s . L G ] J u l EGULARIZATIONS I MPROVING R OBUSTNESS
Given an input instance x ∈ R n , a DNN-based classifieroffers its prediction along with a softmax normalized prob-ability p ( x ) k = exp( z k ) / (cid:80) exp( z j ) for each class k on topof a vector representation z = g ( x ) . Suppose that a set oflabeled instances { ( x i , y i ) } i are provided, then a classifier istypically learned with the assistance of an objective function L ( · , · ) that evaluates training prediction loss, i.e., the aver-age discrepancy between a set of predictions { p ( x i ) } i andground-truth { y i } i .Existing adversarial attacks can be roughly divided intotwo main categories, i.e., white-box attacks [1], [2] andblack-box attacks [16], [17], according to how much informa-tion of the victim model is accessible to an adversary [18].Our study in this paper mainly focuses on the white-boxnon-targeted attacks and defenses against them, in order tobe complied with prior theoretical work. Under such threat,substantial endeavors have been exerted to demonstrate theadversarial vulnerability of DNNs [1], [2], [3], [18], [19], [20],[21], [22]. Most of them are proposed within a frameworkthat favors perturbations with least l p norms yet would stillcause the DNNs to make incorrect predictions. That beingsaid, an adversary opts to solve min r (cid:107) r (cid:107) p s.t. argmax k g ( x + r ) k (cid:54) = argmax k g ( x ) k . (1)Utilizing the objective function L ( · , · ) , the task of mountingadversarial attacks can be formulated from a dual perspec-tive which attempts to maximize the loss with a presumedperturbation magnitude (in the context of l p norms). Thatbeing said, given (cid:15) > , one may resort to max (cid:107) r (cid:107) p ≤ (cid:15) L ( x + r , y ) . (2)Omit box constraints on the image domain, many off-the-shelf attacks [2], [3], [18], [19], [20] can be considered asefficient approximations to either (1) or (2). Under certaincircumstances, their solutions can be equivalent to the opti-mal solutions to (1) or (2). For instance, the fast gradient signmethod (FGSM) [2] achieves the optimum of (2) with somebinary linear classifiers and p = ∞ [23]. Also, take any linearmodel together with p = 2 , the DeepFool perturbation [20]is theoretically optimal to (1). Training with an augmentedset involving adversarial examples, i.e., adversarial training,has been proven to be very effective in improving the DNNrobustness [3], regardless of the computational burden. A recent study [15] demonstrates the relationship between aclassical regularization [24] and adversarial training [2]. It isconceivable that a principled regularization term involvedin training suffices to yield DNN models with comparablerobustness, whereby a whole series of methods have beendeveloped. Unlike many traditional methods which are nor-mally date-independent (e.g., weight decay and dropout),recent progress conforms closely with theoretical guaranteesand focuses mostly on regularizing the loss landscape [6],[10], [11], [12], [13], [14]. Before systematically studying theirfunctionality and relationships in the following sections, wefirst introduce some important notations. Given the objective function L ( · , · ) for classification, wewill refer to 1) ∇ := ∇ x L ( x , y ) , as its gradient with respectto (w.r.t.) the input vector x , 2) H , as the Hessian matrix of L , and 3) J , as the Jacobian matrix of g ( x ) w.r.t. x . It hasbeen presented that training regularized using (cid:107) J (cid:107) F —theFrobenius norm of J (dubbed the Jacobian regularization , [12],[14]), (cid:107)∇(cid:107) —the Euclidean norm of ∇ (i.e., the input-gradientregularization , [11], [13]), (cid:107) H (cid:107) —the spectral norm of H (i.e., the curvature regularization , [6]), and a cross-Lipschitzfunctional [10] as will be elaborated later all significantlyimprove the adversarial robustness of obtained models. Wefocus on DNNs with general rectified linear units (generalReLUs) [25], [26], [27] as nonlinear activations and analyzein binary classification and multi-class classification tasksseparately in the following sections. INARY C LASSIFICATION
With the background information introduced in the previ-ous section, here we first discuss different regularizations inbinary classification DNNs, and we will generalize some ofour results to multi-class classifications in Section 4.For simplicity of notations, let us first consider a multi-layer perceptron (MLP) parameterized by a series of weightmatrices W ∈ R n × n , . . . , W d ∈ R n d − × n d , where n = n and n d = 2 in our theories. (We stress that, although a simpleMLP is formulated here, our following discussions directly gener-alize to DNNs with convolutions, poolings, skip-connections [28],self-attentions [29], etc. ) For a d -layer MLP, we have g ( x ) = W Td σ ( W Td − σ ( . . . σ ( W T x ))) , (3)in which the general ReLU activation σ ( · ) of our particularinterest is piecewise linear and hence g ( · ) is also piecewiselinear. Following prior work [23], we can define a := x and a j := σ ( W Tj a j − ) = D Tj ( x ) W Tj a j − , for ≤ j ≤ d , inwhich D j ( x ) := diag (cid:16) W j [: , T a j − > , . . . , W j [: ,n j ] T a j − > (cid:17) (4)is an n j × n j diagonal matrix whose main diagonal en-tries corresponding to nonzero activations within the j -thparameterized layer take an value of +1 , and others takean value of . Denote by w ± the two columns of matrix W d (i.e., W d = [ w + , w − ] ), we have the two entries of p ( x ) as: p ( x ) + = exp( w T + a d − ) / (cid:80) exp( w T ± a d − ) and p ( x ) − =1 − p ( x ) + . These two scalars estimate the probability of x being sampled from the positive and negative classes,respectively. Since g ( · ) is piecewise linear as analyzed, thereexists a polytope Q ( x ) to which the input instance x belongsand on which g ( · ) is linear, i.e., D j ( x (cid:48) ) = D j ( x ) and g ( x (cid:48) ) | x (cid:48) ∈ Q ( x ) = V T x (cid:48) , (5)in which V = [ v + , v − ] is a matrix with its columns v ± := W D ( x ) . . . W d − D d − ( x ) w ± . Our analyses stem from Problem (1). For binary classifica-tion with y ∈ { +1 , − } , we can rewrite the optimizationproblem as: min r (cid:107) r (cid:107) p s.t. L ( x + r , y ) ≥ β , just as sug-gested [6], in which β is a threshold for correct and incorrectclassifications and its value solely depends on the choice Fig. 1: Illustration of how (cid:107) r ∗ (cid:107) varies with the predictionprobability p ( x ) y in binary scenarios.of the loss function L ( · , · ) (e.g., if the cross-entropy loss ischosen, then β = log(2) ). It follows from DeepFool andothers [6] that we may well-approximate the constraint witha Taylor series and get bounds for the ( l ) magnitude of r ∗ := argmin (cid:107) r (cid:107) s.t. L ( x , y ) + ∇ T r + r T H r / ≥ β , as willbe presented in Lemma 3.1 as below. Lemma 3.1. [6]
Let x be a correctly classified instance suchthat ξ := β − L ( x , y ) ≥ , and let u ∈ R n be the normalizedeigenvector corresponding to the largest eigenvalue of H , then wehave (cid:107)∇(cid:107) (cid:107) H (cid:107) (cid:32)(cid:115) (cid:107) H (cid:107) ξ (cid:107)∇(cid:107) − (cid:33) ≤ (cid:107) r ∗ (cid:107) ≤|∇ T u |(cid:107) H (cid:107) (cid:32)(cid:115) (cid:107) H (cid:107) ξ |∇ T u | − (cid:33) . (6)The above lemma establishes connections between therobustness of a DNN and the spectral norm of its Hessianmatrix H . Though enlightening, the variables u , (cid:107)∇(cid:107) , and (cid:107) H (cid:107) in Eq. (6) are heavily entangled so that it is difficult toreveal the functionality of concerned regularizations.Fortunately, we show that the derived bounds are tightsuch that they collapse to the same expression in terms of p ( x ) y and a local cross-Lipschitz constant [10] in binaryclassification with some common choices of the loss function(e.g., the cross-entropy loss and logistic loss). To be concrete,suppose that the cross-entropy loss is adopted, then with the n × matrix V = [ v + , v − ] introduced in Eq. (5), we havethe following lemma and theorem. Lemma 3.2. (Simplified expressions for J , ∇ , and H ). Givenan instance paired with its label ( x , y ) , we have for the Jacobian J , input-gradient ∇ , and Hessian H : J = V, ∇ = y ( p ( x ) y − v + − v − ) T ,H = p ( x ) + p ( x ) − ( v + − v − )( v + − v − ) T = p ( x ) y (1 − p ( x ) y )( v + − v − )( v + − v − ) T . (7) Proposition 3.1. (An analytic expression for (cid:107) r ∗ (cid:107) ). For thebinary classifier with a locally linear g ( · ) and a correctly classifiedinstance x , we have (cid:107) r ∗ (cid:107) = 1 p ( x ) y (cid:107) v + − v − (cid:107) (cid:32)(cid:115) p ( x ) y ξ − p ( x ) y − (cid:33) . (8)Proposition 3.1 is obtained on the basis of Lemma 3.1and 3.2. See our proofs in Appendix A and B, respectively.Similar results can be achieved with the logistic loss (as alsodemonstrated in the appendix). The decomposition of (cid:107) r ∗ (cid:107) (i.e., the l magnitude of r ∗ ) in the derived Eq. (8) appears to be more obvious than in Eq. (6), and it can be concludedthat ξ , p ( x ) y , and (cid:107) v + − v − (cid:107) jointly affect the l magnitudeof r ∗ . Seeing that the value of ξ is determinate w.r.t. p ( x ) y , theprediction probability p ( x ) y and (cid:107) v + − v − (cid:107) become the onlydominating ingredients. Let us define ν := (cid:107) v + − v − (cid:107) whichis in fact a local cross-Lipschitz constant of g ( · ) [10] forbetter clarity. Even though ν might as well be influential tothe prediction probability p ( x ) y , we discuss them separatelyhere, considering that the latter can still be optimized withany presumed value of the former .It is easy to verify that (cid:107) r ∗ (cid:107) = 0 holds for all ν > ,in a special case of p ( x ) y → . + . Yet, for p ( x ) y > . , thegeneral impact of the prediction probability p ( x ) y in Eq. (8)is still obscure. To gain direct insights, we depict how (cid:107) r ∗ (cid:107) varies with p ( x ) y ∈ (0 . , . on the right panel of Figure 1,given specific ν values. We observe that, in general, a larger p ( x ) y implies a larger (cid:107) r ∗ (cid:107) and thus lower vulnerability ofa classification model, provided that the l magnitude of r ∗ is a reasonable measure of the robustness and p ( x ) y > (1 − p ( x ) y ) (or equivalently, p ( x ) y > . ). See also the left panelof the figure for an illustration with p ( x ) y approaching . from above, being equal to . , and being equal to . .Our theoretical result in Proposition 3.1 gives rise to aformal guarantee of the l robustness for piecewise linearDNNs, without concerning much about the accuracy of theTaylor approximation. Regarding the adversarial robustnessto some other l p norm-based attacks, we have similar resultsin this paper. One might be of special interest to the p = ∞ case as it has been widely considered in practical attacks.Proposition 3.3 and 3.2 provide results from different view-points in correspondence to (2) and (1), i.e., by bounding theworst-case loss η ∗ := max (cid:107) r (cid:107) ∞ ≤ (cid:15) L ( x , y ) + ∇ T r + r T H r / with any fixed (cid:15) > and by providing an analytic expres-sion for the l ∞ norm of ˜ r ∗ := argmin (cid:107) r (cid:107) ∞ s.t. L ( x , y ) + ∇ T r + r T H r / ≥ β . Proposition 3.2. (An analytic expression for (cid:107) ˜ r ∗ (cid:107) ∞ ). For thebinary classifier with a locally linear g ( · ) and a correctly classifiedinstance x , we have (cid:107) ˜ r ∗ (cid:107) ∞ = 1 p ( x ) y (cid:107) v + − v − (cid:107) (cid:32)(cid:115) p ( x ) y ξ − p ( x ) y − (cid:33) . (9) Proposition 3.3. (An upper bound of η ∗ ). For the binaryclassifier with a locally linear g ( · ) and a correctly classified inputinstance x , we have ∀ r ∈ R n satisfying (cid:107) r (cid:107) ∞ ≤ (cid:15) , it holds that η ∗ = L ( x , y ) + (cid:15) (1 − p ( x ) y ) (cid:107) v + − v − (cid:107) + 12 (cid:15) p ( x ) y (1 − p ( x ) y ) (cid:107) v + − v − (cid:107) . (10) Besides Proposition 3.1, some intriguing corollaries can alsobe derived from Lemma 3.2. First, the direction of the input-gradient vector ∇ is the same as that of the first eigenvector(i.e., the one corresponding to the largest eigenvalue) ofmatrix H . Second, we can derive (cid:107)∇(cid:107) = (1 − p ( x ) y ) ν and (cid:107) H (cid:107) F = (cid:107) H (cid:107) = p ( x ) y (1 − p ( x ) y ) ν , which means that we
1. Within linear networks where all instances share the same (data-independent) v + and v − , we can still have different prediction proba-bilities for different input instances.2. Note that we probably have r ∗ (cid:54) = ˜ r ∗ . (a) (b) (c) (d)(e) (f) (g) (h) Fig. 2: The adversarial robustness of obtained binary classification models evaluated with FGSM, PGD, DeepFool, andC&W’s attacks: (a)-(d) for LeNet-300-100 and (e)-(h) for LeNet-5. Ten runs from different initializations were performedand the average results are illustrated for fair comparisons. The y axis of the four subfigures on the left are normalized tothe same numerical scale, and so as the four on the right. It can be seen that penalizing ν / and µ perform similarly.further have simple analytic expressions for the concernedregularizers as: Jacobian regularizer := λµ , Input-gradient regularizer := λ (1 − p ( x ) y ) ν , Curvature regularizer := λp ( x ) y (1 − p ( x ) y ) ν , (11)in which (cid:107) H (cid:107) calculates the spectral norm (i.e., the matrix l norm) of H , λ > is a hyper-parameter, and µ denotes (cid:107) V (cid:107) F which is apparently a local Lipschitz constant of g ( · ) [10]. Third, it holds that (1 − p ( x ) y ) ν ≤ p ( x ) y (1 − p ( x ) y ) ν ≤ µ / (i.e., (cid:107)∇(cid:107) ≤ (cid:107) H (cid:107) ≤ (cid:107) V (cid:107) F / ), and thuswe get a chained inequality of the regularizers. Without lossof generality, we write the regularizers in squared forms inEq. (11) for direct comparison.One might have noticed that ν and p ( x ) y are also theonly ingredients in two of the regularizers in Eq. (11). In theremainder of this subsection, we shall discuss and highlightthat: (1) the input-gradient regularization and curvature reg-ularization both enforce suppression of ν , which is in prin-ciple consistent with a cross-Lipschitz regularization [10]; (2)though the Jacobian regularization focuses on µ instead of ν , there probably exists an underlying equivalence betweenpenalizing scaled ν and µ ; (3) critical discrepancies stillexist amongst these regularizations, mostly about p ( x ) y . Cross-Lipschitz vs. Lipschitz:
With clear expressions inEq. (7) and (11), we know that the input-gradient regular-ization and curvature regularization are similar to a cross-Lipschitz regularization that penalizes λν / [10] , while theJacobian regularization penalizes λµ (with a local Lipschitz constant µ ) and it boils down to weight decay in single-layerperceptrons and linear classifiers. Although it seems as ifthe Jacobian regularization was different from the others, inlight of the Parseval tight frame and Parseval networks [9],
3. Interested readers can refer to Section G for rigorous analyses. we conjecture nonetheless that there exists an equivalencebetween penalizing scaled ν (as with the cross-Lipschitzregularization, input-gradient regularization, and curvatureregularization) and µ (as with the Jacobian regularization).To shed light on this, more discussions are performed asfollows.First and foremost, it is self-evident that the inequality ν / ≤ (cid:107) v + (cid:107) + | v T + v − | + 12 (cid:107) v − (cid:107) ≤ µ (12)holds, thus one might argue that adopting the Jacobianregularization also implies small ν in obtained models aswith the cross-Lipschitz regularization. Second, for single-layer perceptrons, we can easily verify that the function g ( · ) is convex, and thus the Jacobian regularized training loss isstrongly convex w.r.t. V . Considering that the columns of V can be processed simultaneously by adding/subtractinga vector whilst the classification decision and cross-entropyloss won’t change, we have − v + = v − for the optimal V and an equivalence is achieved between penalizing λν / and λµ through derivation. The result naturally generalizesto DNNs with locally linear g ( · ) (i.e., DNNs with generalReLU activations) of our interest, if only the final layer is tobe optimized. The following proposition makes this formal,and the proof can be found in Appendix D. Proposition 3.4. (A derived equivalence).
For a single-layerperceptron or a piecewise linear DNN in which only the final layerparameterized by W d is to be optimized, we have the equivalence: ∀ λ ≥ , argmin W d E ( x ,y ) [ L ( x , y ; V ) + λν /
2] =argmin W d E ( x ,y ) [ L ( x , y ; V ) + λµ ] . (13)In addition to the above results, we further show that thetwo regularizations can lead to the same gradient flow incertain scenarios. One example in which this can be demon-strated is when the first feature is uncorrelated with the label y and the other ( n − features are distributed normally withthe mean value being propotional to y (i.e., they are weaklycorrelated with the label) [30]. We let v + ← [0 , a, . . . , a ] and v − ← [0 , − a, . . . , − a ] approach the Bayes error rate. Undersuch circumstance, the two regularizations initialized fromthe Bayes classifier share the same gradient flow for their V matrices, provided × smaller penalty to ν than to µ as inEq. (13) .To test whether the revealed equivalence generalizes topractical scenarios, we conducted an experiment on distin-guishing the digit “7” from “1” using MNIST images. Ourexperimental settings and many more details are carefullyintroduced in Appendix F. As suggested [6], [12], we firsttrained baseline models from scratch without any explicitregularization, then fine-tuned the models using differentregularization strategies and evaluated the obtained adver-sarial robustness to FGSM [2], PGD [3], DeepFool [20], andthe C&W’s attack [18]. We trained MLPs and convolutionalnetworks with ReLU nonlinearity following the LeNet-300-100 and LeNet-5 architectures in prior work [31]. Figure 2compares the performance of regularizations incorporating λν / and λµ . With varying λ , it can be seen that theregularized models show similar robustness in almost alltest cases. Similar results on CIFAR-10 with ResNets andVGG-like networks can be found in Appendix F.Fig. 3: Different regularizers focus on samples with differentprediction confidence. “Confidence” in regularizations: Apart from suppress-ing the local (cross-)Lipschitz constants, the input-gradientregularizer and curvature regularizer both involve the pre-diction probability p ( x ) y in Eq. (11), with different objectivesthough. By incorporating (1 − p ( x ) y )(1 − p ( x ) y ) , the input-gradient regularization encourages model predictions withhigh confidence. If ν is fixed, then the p ( x ) y -related term inthe input-gradient regularizer acts as an additional predic-tion loss during training. It has larger penalties and slopes(in absolute value) for the training instances with relativelysmaller p ( x ) y , i.e., lower confidence. Similarly, we know thatthe curvature regularization involves p ( x ) y (1 − p ( x ) y ) andadvocates large p ( x ) y as well. However, as depicted in thegreen curve in Figure 3, the function exhibits larger absolutevalue of slope at predictions with higher confidence, whichis different from p ( x ) y (1 − p ( x ) y ) but consistent with thepreference of (cid:107) r ∗ (cid:107) as shown in Figure 1 right. As forthe cross-Lipschitz regularizer and Jacobian regularizer, no p ( x ) y -related term is explicitly involved whatsoever.
4. See Eq. (11), the “regularizer” means the regularization term itselfin this paper. Note that the cross-entropy term involves the predictionprobability p ( x ) y of course. Although it is unclear which of the tactics would be themost suitable one in practice, one might be aware that differ-ent choices perform dis-similarly, otherwise we should haveobtained functional equivalence for all these contestants. Inorder to figure out the best one in practice, we compared theachieved robustness via input-gradient regularization andcurvature regularization empirically with our results usingthe cross-Lipschitz regularization and Jacobian regulariza-tion. As shown in Figure 4, the lately developed curvatureregularization surpasses all its competitors with reasonablylarge λ values, showing the superiority of its specific tacticof handling confident predictions. Notice that we retain thesame numerical ranges of axes in Figure 4 as in Figure 2, butsome newly drawn curves (for curvature regularization) inFigure 4 may be too promising to stick in the plot. ULTI - CLASS C LASSIFICATION
This section focuses on multi-class classification tasks. Thenotations are mostly the same as those in the binary classifi-cation. Suppose there are K possible labels for an instance,i.e., y ∈ { , . . . , K − } and K ≥ , then for the discussedgeneral ReLU networks, we have n d = K . Similarly, thereexists a polytope Q ( x ) to which the input instance x belongsand on which the network g ( · ) is linear, i.e. g ( x (cid:48) ) | x (cid:48) ∈ Q ( x ) = V T x (cid:48) , (14)in which V = [ v , . . . , v K − ] is a matrix with its j -thcolumn v j := W D ( x ) . . . W d − D d − ( x ) w j . For the prop-erties of DNNs that are considered in the regularizationstrategies, we have the following lemma. Lemma 4.1. (Simplified expressions for J , ∇ , and H inmulti-class classification). Given an input instance paired withthe one-hot representation of its label ( x , y ) , we have for J , ∇ ,and H : J = V, ∇ = V ( p ( x ) − y ) ,H = V (cid:16) diag( p ( x )) − p ( x ) T p ( x ) (cid:17) V T = (cid:80) i Fig. 5: The robustness of obtained multi-class classification models evaluated with FGSM, PGD, DeepFool, and the C&W’sattacks: (a)-(d) for LeNet-300-100 and (e)-(h) for LeNet-5. Ten runs from different initializations were performed and theaverage results over the multiple runs are reported. The curvature regularization is not compared as approximations seeminevitable in its multi-class implementation. and (cid:107) r ∗ (cid:107) ≥ p ( x ) y (cid:107) V (cid:107) F (cid:32)(cid:115) p ( x ) y ξ (1 − p ( x ) y ) − (cid:33) . (17)Considering that the value of ξ = log( K ) − log( p ( x ) y ) isdeterminate w.r.t. the prediction probability p ( x ) y , we canconclude from Eq. (17) that the essential ingredients of sucha lower bound are p ( x ) y and (cid:107) V (cid:107) F (i.e., the Frobenius normof an n × K matrix V ). Likewise, we can easily verify that (cid:107) V (cid:107) F is a local Lipschitz constant of g ( · ) . Somewhat unsur-prisingly, a property considered in the cross-Lipschitz regu- larization defined as ν := (cid:80) (cid:107) v i − v j (cid:107) /K [10] is involvedin the other derived lower bound as given in Eq. (16). Theresults show that the local (cross-)Lipschitz constants andprediction probability are possibly still the essential ingredi-ents of (cid:107) r ∗ (cid:107) . Apart from Proposition 4.1, we further knowthat the chained inequality (cid:107)∇(cid:107) / ≤ (cid:107) H (cid:107) ≤ (cid:107) V (cid:107) F / holds by derivations from Lemma 4.1. More discussionssimilar to those made for binary classification in Section 3.2will be given in Appendix E (right after the proof).As in binary classification, we aim to study possibleconnections between regularizations penalizing a squared local Lipschitz constant µ := (cid:107) V (cid:107) F and ν . Experimentalresults are given to show a vague equivalence. The sameMLP and convolutional architectures were adopted. Similarto the binary classification experiments, we trained multiplebaseline models for each considered architecture and fine-tuned them using different regularizations. The same train-ing and test policies were also kept. We report the averageresults of obtained model robustness to FGSM, PGD, Deep-fool, and the C&W’s attack in Figure 5. It can be seen that theJacobian regularization and cross-Lipschitz regularizationstill perform similarly across all tested λ values, except forthe ones being too large to keep the models numericallystable. NaN was produced in Jacobian regularized LeNet-5if λ was further enlarged. ONCLUSIONS This paper aims at exploring and analyzing possible con-nections between recent network-property-based regular-izations for improving the adversarial robustness of DNNs.While the empirical effectiveness of appropriate regulariza-tions has been demonstrated in prior arts [6], [10], [11], [12],there still lacks systematic understanding of their intrinsicfunctionality and connections. We made some comparativeanalyses among these regularizations and our achievementsinclude: • We have analyzed regularizations on DNNs with ReLUactivations from a theoretical perspective. • We have presented analytic expressions for the l and l ∞ magnitudes of some approximately-optimal adver-sarial perturbations, and we have shown that the localcross-Lipschitz constants and prediction probability aretheir essential ingredients in binary classification. • We have demonstrated that, the regularizations suggesteither small Lipschitz constants or small cross-Lipschitzconstants, and regularizing them can be equivalent. Yet,critical discrepancies still exist between specific regular-izations, mostly in handling the prediction probability. • We have verified that curvature regularization [6] con-cerned in a very recent paper shows the most promisingperformance, and we have extended some of our analy-ses to multi-class classification and verified our findingswith experiments. R EFERENCES [1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-fellow, and R. Fergus, “Intriguing properties of neural networks,”in ICLR , 2014.[2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har-nessing adversarial examples,” in ICLR , 2015.[3] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,“Towards deep learning models resistant to adversarial attacks,”in ICLR , 2018.[4] F. Tram`er, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel,“Ensemble adversarial training: Attacks and defenses,” in ICLR ,2018.[5] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machinelearning at scale,” in ICLR , 2017.[6] S.-M. Moosavi-Dezfooli, A. Fawzi, J. Uesato, and P. Frossard, “Ro-bustness via curvature regularization, and vice versa,” in CVPR ,2019.[7] A. Krogh and J. A. Hertz, “A simple weight decay can improvegeneralization,” in NeurIPS , 1992. [8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural net-works from overfitting,” The Journal of Machine Learning Research ,vol. 15, no. 1, pp. 1929–1958, 2014.[9] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier,“Parseval networks: Improving robustness to adversarial exam-ples,” in ICML , 2017.[10] M. Hein and M. Andriushchenko, “Formal guarantees on therobustness of a classifier against adversarial manipulation,” in NeurIPS , 2017.[11] A. S. Ross and F. Doshi-Velez, “Improving the adversarial robust-ness and interpretability of deep neural networks by regularizingtheir input gradients,” in AAAI , 2018.[12] D. Jakubovitz and R. Giryes, “Improving dnn robustness to adver-sarial attacks using jacobian regularization,” in ECCV , 2018.[13] C. Lyu, K. Huang, and H.-N. Liang, “A unified gradient regular-ization family for adversarial examples,” in ICDM , 2015.[14] J. Sokoli´c, R. Giryes, G. Sapiro, and M. R. Rodrigues, “Robustlarge margin deep neural networks,” IEEE Transactions on SignalProcessing , vol. 65, no. 16, pp. 4265–4280, 2017.[15] C.-J. Simon-Gabriel, Y. Ollivier, L. Bottou, B. Sch¨olkopf, andD. Lopez-Paz, “Adversarial vulnerability of neural networks in-creases with input dimension,” in ICML , 2019.[16] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, andA. Swami, “Practical black-box attacks against machine learning,”in Proceedings of the Asia Conference on Computer and Communica-tions Security , 2017.[17] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo:Zeroth order optimization based black-box attacks to deep neuralnetworks without training substitute models,” in Proceedings of the10th ACM Workshop on Artificial Intelligence and Security . ACM,2017, pp. 15–26.[18] N. Carlini and D. Wagner, “Towards evaluating the robustness ofneural networks,” in Proceedings of the IEEE Symposium on Securityand Privacy , 2017.[19] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik,and A. Swami, “The limitations of deep learning in adversarialsettings,” in Proceedings of the IEEE European Symposium on Securityand Privacy , 2016.[20] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “DeepFool: asimple and accurate method to fool deep neural networks,” in CVPR , 2016.[21] P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, and C.-J. Hsieh, “Ead:elastic-net attacks to deep neural networks via adversarial exam-ples,” in AAAI , 2018.[22] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients givea false sense of security: Circumventing defenses to adversarialexamples,” in ICML , 2018.[23] Y. Guo, C. Zhang, C. Zhang, and Y. Chen, “Sparse dnns withimproved adversarial robustness,” in NeurIPS , 2018.[24] H. Drucker and Y. LeCun, “Double backpropagation increasinggeneralization performance,” in IJCNN , 1991.[25] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedboltzmann machines,” in ICML , 2010.[26] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,”in CVPR , 2015.[27] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understandingand improving convolutional neural networks via concatenatedrectified linear units,” in ICML , 2016.[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in CVPR , 2016.[29] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translationby jointly learning to align and translate,” in ICLR , 2015.[30] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry,“Robustness may be at odds with accuracy,” in ICLR , 2019.[31] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998. On Connections between Regularizations forImproving DNN Robustness **Appendices** Yiwen Guo, Long Chen, Yurong Chen, and Changshui Zhang, Fellow, IEEE F A PPENDIX AP ROOF OF L EMMA Proof. According to the definition, it is self-evident that J = V . As for the input-gradient, we have ∇ x L ( x , y ) = − V ( y − p ( x ))= − y (1 − p ( x ) y )( v + − v − ) T , (18)according to the chain rule, in which y ∈ {± } and y is itsone-hot vector representation. Similarly, the Hessian matrixof L ( · , · ) w.r.t. x is H = ∇ x ( V ( p ( x ) − y ))= V ( ∇ x ( p ( x ) − y ))) T = V (diag( p ( x ) − p ( x ) p ( x ) T ) V T = p ( x ) + p ( x ) − ( v + − v − )( v + − v − ) T . (19) A PPENDIX BP ROOF OF P ROPOSITION Proof. From the expression of H shown in Eq. (19), we knowthat H is a rank-1 positive semi-definite matrix and itsonly eigenvalue becomes the maximal eigenvalue (i.e, thespectral norm). Since the instance is correctly classified, wehave p ( x ) y > . and k v + − v − k = 0 hold. Suppose that x and v + − v − have finite magnitude, then we further know p ( x ) y = 1 . . From H ∇ = p ( x ) y (1 − p ( x ) y ) k v + − v − k ∇ , we know that ∇ should be an eigenvector corresponding to p ( x ) y (1 − p ( x ) y ) k v + − v − k > as the eigenvalue of H .Hence we have u = ± ∇k∇k and further |∇ T u | = k∇k . • Y. Guo is with Bytedance AI Lab. E-mail: [email protected]. • L. Chen is with the Academy for Advanced Interdisciplinary Studies,Center for Data Science, Peking University, Beijing 100871, China. E-mail: [email protected]. • Y. Chen is with Intel Labs China. E-mail: [email protected]. • C. Zhang is with the Department of Automation, State Key Lab ofIntelligence Technologies and Systems, Tsinghua National Laboratoryfor Information Science and Technology, Tsinghua University, Beijing100084, China. E-mail: [email protected]. Guo and L. Chen contribute equally to this work. In consequence, the upper bound and low bound in Lemma3.1 are in fact the same at this point, and it follows that k r ∗ k = k∇k k H k s k H k ξ k∇k − ! = (1 − p ( x ) y ) k v + − v − k p ( x ) y (1 − p ( x ) y ) k v + − v − k vuut p ( x ) y (1 − p ( x ) y ) k v + − v − k ξ (1 − p ( x ) y ) k v + − v − k − = 1 p ( x ) y k v + − v − k s p ( x ) y ξ − p ( x ) y − ! . (20) A PPENDIX CP ROOF OF P ROPOSITION AND We first provide our proof of Proposition 3.2 which showsan analytic expression for k ˆ r ∗ k ∞ as below. Proof. According to the definition, we have: k ˜ r ∗ k ∞ := min k r k ∞ s . t . L ( x , y ) + ∇ T r + r T H r / ≥ β. (21)By substituting H with its expression given in Lemma3.2, the above constraint can be written as − ξ + ∇ T r + p ( x ) y − p ( x ) y ) (cid:16) ∇ T r (cid:17) ≥ . (22)Now that ∇ T r is a scalar, we can consider (22) as a quadraticinequality and equivalently we have ∇ T r ≥ − p ( x ) y p ( x ) s p ( x ) y ξ − p ( x ) y − ! or ∇ T r ≤ − − p ( x ) y p ( x ) s p ( x ) y ξ − p ( x ) y + 1 ! . Since it holds that k r k ∞ ≥ ∇ T r k∇k ≥ −k r k ∞ for any r ∈ R n and the equalities should be attained at r = k r k ∞ sign( ∇ ) and r = −k r k ∞ sign( ∇ ) respectively, we have k ˜ r ∗ k ∞ = 1 − p ( x ) y p ( x ) k∇k s p ( x ) y ξ − p ( x ) y − ! = 1 p ( x ) k v + − v − k s p ( x ) y ξ − p ( x ) y − ! . (23)Then it is the proof of Proposition 3.3. Proof. We aim at analyzing η ∗ := max k r k ∞ ≤ (cid:15) L ( x , y ) + ∇ T r + r T H r / . According to the H ¨older’s inequality, wehave for all r ∈ R n satisfying k r k ∞ ≤ (cid:15) , it holds that ∇ T r + r T H r / ∇ T r + p ( x ) y − p ( x ) y ) (cid:16) ∇ T r (cid:17) ≤k r k ∞ k∇k + p ( x ) y − p ( x ) y ) ( k r k ∞ k∇k ) = (cid:15) k∇k + p ( x ) y − p ( x ) y ) ( (cid:15) k∇k ) . (24)By further substituting the vector ∇ with the expressiongiven in Lemma 3.2, we have ∇ T r + r T H r / ≤ (cid:15) (1 − p ( x ) y ) k v + − v − k + 12 (cid:15) p ( x ) y (1 − p ( x ) y ) k v + − v − k (25)to complete our proof, and we know that the equality shouldbe attained at r = (cid:15) · sign( ∇ ) . A PPENDIX DP ROOF OF P ROPOSITION Proof. Let us denote by L ( · ) and L ( · ) the loss functionsfor training regularized by the Jacobian and cross-Lipschitzstrategies, respectively. That is, L ( V ) = − E [log p ( x ; V ) y ] + λ µ and L ( V ) = − E [log p ( x ; V ) y ] + λ ν , in which E [ · ] calculates the sample mean rather than the populationmean. It is easy to verify that L ( V ) is strongly convex w.r.t. V for single layer perceptrons and piecewise linear DNNs inwhich only the final layer is to be optimized, thus we havea unique optimal solution to min V L ( V ) . Let us denote theoptimal solution as V = [ v + , v ] .For binary classification in which y ∈ { +1 , − } , it holdsfor any matrix V = [ v + , v − ] that, − E [log p ( x ; V ) y ]= − E (cid:20) y p ( x ; V ) +1 + 1 − y p ( x ; V ) − (cid:21) = − E (cid:20) y p ( x ; V ) +1 + 1 − y − p ( x ; V ) +1 ) (cid:21) . (26)By definition of the cross-entropy function, we have p ( x ; V ) +1 = exp( h v + , x i )exp( h v + , x i ) + exp( h v − , x i )= exp( h v + − v − , x i )exp( h v + − v − , x i ) + 1 , (27) thus further we can write − E [log p ( x ; V ) y ] = h ( v + − v − ) ,in which h ( · ) : R n → R can also easily be verified as aconvex function. We rewrite the loss function of a Jacobian-regularized training as L ( V ) = h ( v + − v − ) + 12 λ ν + 12 λ k v + + v − k , (28)considering µ = ν + k v + + v − k . Apparently, the firsttwo terms on the right hand side of the above equationshould remain unchanged if the vector v + − v − does. Letus introduce ˆ V = [ˆ v + , ˆ v ] , in which ˆ v + = v + − v − and ˆ v = − ˆ v + . Now we have ˆ v + − ˆ v = v + − v and L ( ˆ V ) − L ( V ) = 12 λ k ˆ v + + ˆ v k − λ k v + + v k ≤ − λ k v + + v k ≤ . (29)Recall that V is the optimal solution to min V L ( V ) , we alsohave ≤ L ( ˆ V ) − L ( V ) . Therefore, we have v + + v = 0 holds in order to avoid contradictions. The obtained equa-tion eliminates the third term in Eq. (28). Thus we know, for λ = λ / , it holds that L ( V ) = L ( V ) . (30)We now proceed similarly for an optimal solution to theproblem min V L ( V ) . By writing L ( V ) = h ( v + − v − ) + λ ν = l ( v + + v − ) , we can verify that the newly introducedfunction l ( · ) : R n → R is strongly convex as well. Let usdenote the optimal solution to min w l ( w ) as w , then byfurther introducing v + = w / and v = − ˆ v + , we canverify that L ( V ) = L ( V ) . (31)According to Eq. (30)-(31) and the definitions of the optimalsolutions V and V , we now have L ( V ) (30) = L ( V ) def. ≥ L ( V ) (31) = L ( V ) def. ≥ L ( V ) , (32)which leads to L ( V ) = L ( V ) for V being equal to V or V and we have proved the proposition. A PPENDIX EP ROOF OF P ROPOSITION AND M ORE Proof. For a K -class classification task with y ∈ { , . . . , K − } and the cross-entropy loss chosen, we still have fromLemma 3.1 that k∇k k H k s k H k ξ k∇k − ! ≤ k r ∗ k ≤|∇ T u |k H k s k H k ξ |∇ T u | − ! , (33)in which ξ := log( K ) − L ( x , y ) . In order to derive insightfulbounds of k r ∗ k with less entangled variables, we firstanalyze the involved networks properties k∇k and k H k and then take advantage of the monotonicity of the lowerbound given above.From the expressions summarized in Lemma 4.1 weknow it holds that k∇k ≤ − p ( x ) y ) k V k ≤ − p ( x ) y ) k V k F , (34) and k H k ≤ X i Even though it is difficultto compare K ( K − ν / with k V k F directly, we knowfrom Proposition 4.1 that penalizing the two quantitiesboth contribute to improving the DNN robustness, and infact they have already been adopted in the cross-Lipschitzand Jacobian regularizations, respectively. We now discussabout their connections with other network properties andregularizations. As have been introduced in the main bodyof our paper, we have that the chained inequality k∇k / ≤k H k ≤ k V k F / holds, which is obtained in virtue of: k H k ≤ p ( x ) y (1 − p ( x ) y ) k V k F ≤ k V k F (41)and k H k = k X p ( x ) i ( v i − ¯ v )( v i − ¯ v ) T k ≥ p ( x ) y k v y − ¯ v k ≥ p ( x ) y k∇k ≥ k∇k , (42)in which ¯ v := V p ( x ) . Similarly, we can as well derive achained inequality for the quantity adopted in the cross-Lipschitz regularization (and the first bound in Proposition4.1) as k∇k / ≤ k H k ≤ K ( K − ν / , by takingadvantage of Eq. (42) and: k H k ≤ X i CIFAR-10 R ESULTS In this section, we first introduce our experimental settingson MNIST, involving the training and test policies, DNNarchitectures, evaluation metrics, etc., and then provideresults on CIFAR-10. Experimental Settings: The official training/test split ofMNIST [2] is utilized. As briefly introduced, we train/testwith an MLP codenamed “LeNet-300-100” and a convolu-tional neural network codenamed “LeNet-5” on MNIST. Theformer is comprised of two parameterized fully-connectedlayers, and the latter contains two convolutional layers,two max-pooling layers and two fully-connected layers.We trained 10 models each from different initializationsfor them as references, and fine-tuned the obtained modelswith different regularizations discussed in this paper to ((a)) ((b)) ((c)) ((d))((e)) ((f)) ((g)) ((h))((i)) ((j)) ((k)) ((l)) Fig. 6: The robustness of all obtained binary classification models evaluated with FGSM, PGD, DeepFool and the C&W’sattacks on CIFAR-10: (a)-(d) for the four-layer convolutional network with batch normalizations, (e)-(h) for the VGG-likenetwork, and (i)-(l) for ResNet. Ten runs from different initializations are performed and the average results are reportedfor fair comparisons.evaluate their performance under adversarial attacks. Allexperiments were performed on a single NVIDIA Titan XGPU, and official implementations from the authors of theregularizations were adopted. TensorFlow [3] and Clever-Hans [4] were used.We directly applied the training policies suggested in theCaffe model zoo [5] for LeNet-300-100 and LeNet-5, and wetrained models on MNIST with a common batch size of 64for 50,000 iterations such that they all definitely reached theplateau. For the binary LeNet-300-100 and LeNet-5 models,we achieved prediction accuracies of . ± . and . ± . , respectively. To evaluate the adversarialrobustness of DNN models, we chose four prevalent attacks,i.e., FGSM [6], PGD [7], DeepFool [8], and the C&W’s at-tack [9], two of which are l ∞ norm-based and the other twoare l norm-based. With (cid:15) = 0 . , the prediction accuracies ofthe reference models degraded significantly (to . ± . and . ± . ) on FGSM examples and (to . ± . and . ± . ) on PGD adversarial examples. In multi-class scenarios, we similarly had reference models withhigh prediction accuracies ( . ± . and . ± . for LeNet-300-100 and LeNet-5 respectively) on the benigntest set, yet a deteriorating effect can be observed on theadversarial examples.For training on MNIST, we regularized with various λ values chosen from { − , − , − , . , . , . , . , . } . Such a set should cover many suggested values for settingthis hyper-parameter in the literature, and we also noticed inthe experiments that further enlarging the value of λ wouldprobably cause numerical instability during training andgenerated NaN in the network gradients. In fact, with a λ aslarge as . on the tested dataset, most of the terms aimedto be penalized have become extremely small (typically withan order of magnitude ≤ ) on the training instances, thusmore attention should be paid to their numerical ranges andstability. Curvature regularization, however, was stable overthe range of our tested λ , and we also observed that itsrobustness to the C&W’s attack increased to nearly 2.0 (i.e.,the required magnitude of perturbations to successfully foolthe model was ∼ λ = 6 . . Experimental results on CIFAR-10: As introduced, wealso performed experiments on CIFAR-10, with some muchdeeper networks than those tested on MNIST. To be con-crete, we chose a four-layer convolutional network similarto LeNet-5 but with batch normalization [10] incorporated, aVGG-like network [11] (incorporating twelve convolutionallayers and two fully-connected layers) and a ResNet [12] (in-corporating 31 convolutional layers, a single fully-connectedlayers and no batch normalization). Again, we trained 10models each from different initializations for them as ref-erences, and fine-tuned the obtained models with differentregularizations. We trained them for 100,000 iterations to ((a)) ((b)) ((c)) ((d))((e)) ((f)) ((g)) ((h)) Fig. 7: The robustness of obtained multi-class classification models evaluated with FGSM, PGD, DeepFool, and the C&W’sattacks: (a)-(d) for the four-layer convolutional network incorporating batch normalizations and (e)-(h) for the VGG-likeDNN. The curvature regularization was not compared as approximations seem inevitable in its multi-class implementation.NaN is triggered with λ = 0 . on Jacobian and cross-Lipschitz regularized VGG-like models. ResNets are not evaluateddue to the limlited computational resources.ensure convergence, and we decayed the learning rate by10-fold at iterations 60,000, and 80,000.We report the robustness of regularized models to ad-versarial attacks in Figure 6. The hyper-parameter (cid:15) for l ∞ attacks was set to be 0.05, and λ was chosen to be smaller(than the values on MNIST) to guarantee a stable trainingprocess, from { − , − , − , − , − , . } . We ob-served that even with the hyper-parameter as small as . ,it is possible to produce NaN during the training of somemulti-class models on CIFAR-10, but for the binary clas-sification models, further increasing λ might lead to evenstronger adversarial robustness. Nevertheless, performinga grid search of the best λ for each model is beyond thescope of this paper, so we might not achieve the optimalperformance of the discussed methods in Figure 6. Multi-class results are provided in Figure 7. A PPENDIX GT HE L OGISTIC L OSS AND O PEN Q UESTIONS It is mentioned that our theoretical results in binary classifi-cation with cross-entropy loss generalize to the case trainedwith logistic loss , which calculates log(1 + exp( − y v T x )) for an instance x with label y ∈ {± } , in which v := W D ( x ) . . . W d − D d − ( x ) w d is an n -dimensional vector.Now let us discuss more about it. Equivalently, we rewritethe logistic loss as − log( p ( x ) y ) , in which p ( x ) − = 1 / (1 +exp( v T x )) and p ( x ) + = 1 − p ( x ) − , hence if V = [ v , ] we know by simple derivations that v plays the same role(in training with the logistic loss) as the vector v + − v − (in training with the cross-entropy loss). That being said,what follows is coincident with those presented in the paper.Future work should include studies on other loss functions.Just as also mentioned in the main body of our paper,we consider the local (cross-)Lipschitz constants and the prediction probability p ( x ) y separately in regularizers forsimplicity reasons, and we have our conclusions hold evenif their mutual influence is rigorously analyzed. Regardingthe mutual influence in binary classification, it is mostly onaccount of ν = k v + − v − k . Since the prediction probabili-ties p ( x ) + and p ( x ) − can be written as functions w.r.t. ν and γ := | ˇ w T x | , in which ˇ w := ( v + − v − ) /ν is a normalizedvector, we cast the problem as discussing about the mono-tonicity of k r ∗ k in terms of ν with care. We emphasize thatit is actually a bit complicated, as the monotonicity highlydepend on the value of γ , and we accomplish the task bycalculating the derivative of regularizers including k∇k and k H k w.r.t. ν , respectively. Specifically, we have ∂ k∇k ∂ν = 2 ν (1 − ( γν − 1) exp ( γν ))(1 + exp( γν )) (44)and ∂ k H k ∂ν = ν (2 + γν − ( γν − 2) exp ( γν )) exp ( γν )(1 + exp( γν )) . (45)By solving ∂ k∇k /∂ν = 0 and ∂ k H k /∂ν = 0 , we can getthe critical points as . /γ and . /γ . That being said,if νγ ≤ . and νγ ≤ . are fulfilled, then penalizingscaled k∇k and k H k indicate a smaller local Lipschitzconstant ν , respectively. We evaluate νγ with models trainedon MNIST and CIFAR-10, and we find that the conditionsare satisfied for almost all training instances. More specif-ically, the average value of νγ on the LeNet-300-100 refer-ences is only roughly . and the largest value is roughly . . After training with regularizations, the values of both νγ and ν become orders of magnitude smaller, making theresults rigorously hold for all instances. It can be interestingto evaluate whether it is the gap between . and . thataffects the regularization performance in future works. Notethat similar analyses can be made to the magnitude of r ∗ , but we feel the result in that case makes less sense since thedefinition of r ∗ is subject to approximations. A PPENDIX HM ULTI - CLASS R EGULARIZATIONS We did not test the curvature regularization in multi-classscenarios, mostly because some approximations seem in-evitable. With approximations, it is relatively difficult to tellwhether there will be functional equivalence or not fromexperimental results. As have been mentioned, through thelens of our study, it is possible to develop some moreregularization methods by consolidating the current onesand their essential ingredients. We would like to study itmore carefully in future work and compare with the currentmulti-class curvature approximators if possible. R EFERENCES [1] M. Hein and M. Andriushchenko, “Formal guarantees on therobustness of a classifier against adversarial manipulation,” in NeurIPS , 2017.[2] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998.[3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga,S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learningon heterogeneous systems,” 2015, software available fromtensorflow.org. [Online]. Available: http://tensorflow.org/[4] N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman,A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, A. Matyasko,V. Behzadan, K. Hambardzumyan, Z. Zhang, Y.-L. Juang, Z. Li,R. Sheatsley, A. Garg, J. Uesato, W. Gierke, Y. Dong, D. Berthelot,P. Hendricks, J. Rauber, and R. Long, “Technical report on thecleverhans v2.1.0 adversarial examples library,” arXiv preprintarXiv:1610.00768 , 2018.[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecturefor fast feature embedding,” in MM , 2014.[6] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har-nessing adversarial examples,” in ICLR , 2015.[7] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,“Towards deep learning models resistant to adversarial attacks,”in ICLR , 2018.[8] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “DeepFool: asimple and accurate method to fool deep neural networks,” in CVPR , 2016.[9] N. Carlini and D. Wagner, “Towards evaluating the robustness ofneural networks,” in Proceedings of the IEEE Symposium on Securityand Privacy , 2017.[10] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in ICML ,2015.[11] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov,“Structured bayesian pruning via log-normal multiplicativenoise,” in NeurIPS , 2017.[12] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,”in CVPR , 2015. Yiwen Guo received the B.E. degree fromWuhan University, Wuhan, China, in 2011, andthe Ph.D. degree from Tsinghua University, Bei-jing, China in 2016. He is a research scientist atBytedance AI Lab, Beijing. Prior to this, he wasa staff research scientist at Intel Labs China. Hiscurrent research interests include computer vi-sion, pattern recognition, and machine learning. Long Chen received the B.S. degree in math-ematics and M.S. degree in data science fromPeking University, in 2016 and 2019, respec-tively. His current research interests include ma-chine learning, statistics and financial data anal-ysis. Yurong Chen received the B.S. and Ph.D. de-grees from Tsinghua University, Beijing, China,in 1998 and 2002, respectively. He joined Intel in2004 after completing the postdoctoral researchin the Institute of Software, CAS, where he iscurrently a Principal Research Scientist and Di-rector of Cognitive Computing Lab at Intel LabsChina, responsible for leading visual cognitionand machine learning research for Intel plat-forms. He received one “Intel China Award” and3 Intel Labs Academic Awards – “Gordy Awards”for delivering leading visual analytics and understanding technologies toimpact Intel platforms/solutions. He has published over 60 papers andholds over 50 issued/pending patents.