[PDF] On Connections between Regularizations for Improving DNN Robustness

Abstract

This paper analyzes regularization terms proposed recently for improving the adversarial robustness of deep neural networks (DNNs), from a theoretical point of view. Specifically, we study possible connections between several effective methods, including input-gradient regularization, Jacobian regularization, curvature regularization, and a cross-Lipschitz functional. We investigate them on DNNs with general rectified linear activations, which constitute one of the most prevalent families of models for image classification and a host of other machine learning applications. We shed light on essential ingredients of these regularizations and re-interpret their functionality. Through the lens of our study, more principled and efficient regularizations can possibly be invented in the near future.

Full PDF

11 On Connections between Regularizations forImproving DNN Robustness

Yiwen Guo, Long Chen, Yurong Chen, and Changshui Zhang,

Fellow, IEEE

Abstract —This paper analyzes regularization terms proposed recently for improving the adversarial robustness of deep neuralnetworks (DNNs), from a theoretical point of view. Speciﬁcally, we study possible connections between several effective methods,including input-gradient regularization, Jacobian regularization, curvature regularization, and a cross-Lipschitz functional. Weinvestigate them on DNNs with general rectiﬁed linear activations, which constitute one of the most prevalent families of models forimage classiﬁcation and a host of other machine learning applications. We shed light on essential ingredients of these regularizationsand re-interpret their functionality. Through the lens of our study, more principled and efﬁcient regularizations can possibly be inventedin the near future.

Index Terms —Deep neural networks, adversarial robustness, regularizations, network property (cid:70)

NTRODUCTION I T has been discovered that deep neural networks (DNNs)are vulnerable to adversarial examples [1], [2], [3], andthe phenomenon can prohibit them from being deployed insecurity-sensitive applications. Amongst the most effectivemethods for mitigating the issue, adversarial training [1], [2],[3] is capable of resisting a series of malicious examples [3],[4] and yield adversarially robust DNN models in the senseof an l p norm. By injecting advanced adversarial examples(e.g., using BIM [5] or PGD [3]) into training as some sort ofaugmentation, the obtained models learn to defend againstthese examples. In addition, the obtained models may alsoresist some other types of adversarial examples (generatedusing, for example, the fast gradient sign method [2]). How-ever, advanced adversarial examples are typically generatedin an iterative manner by back-propagating deep modelsfor multiple times, and thus the mechanism may demand amassive amount of computation [6].Another thriving category of methods for hardeningDNNs is to perform regularizations , aim at trading off theeffectiveness and efﬁciency properly. Although most tradi-tional regularization-based strategies (e.g., weight decay [7]and dropout [8]) do not operate properly in this respect, avariety of recent work [6], [9], [10], [11], [12] has shown thatmore dedicated and principled regularizations help to gaincomparable or only slightly worse performance in improv-ing DNN robustness. Instead of raising a perpetual “arms-race”, these regularization-based strategies are in generalattack-agnostic and of beneﬁt to the generalization ability [9] • Y. Guo is with Bytedance AI Lab. E-mail: [email protected]. • L. Chen is with the Academy for Advanced Interdisciplinary Studies,Center for Data Science, Peking University, Beijing 100871, China. E-mail: [email protected]. • Y. Chen is with Intel Labs China. E-mail: [email protected]. • C. Zhang is with the Institute for Artiﬁcial Intelligence, Tsinghua Univer-sity (THUAI), the State Key Lab of Intelligent Technologies and Systems,Beijing National Research Center for Information Science and Technology(BNRist), the Department of Automation, Tsinghua University, Beijing,100084, China. E-mail: [email protected]. Guo and L. Chen contribute equally to this work. and interpretability of learning models [11]. Moreover, thecomputational and memory complexity of these methodsare acceptable in very large models. It has also been shownthat the methods can be combined with adversarial trainingto achieve even stronger DNN robustness.While many regularizers have been developed for DNNrobustness, there is of yet few comparative analysis amongthese choices, especially from a theoretical point of view. Inthis paper, we attempt to shed light on intrinsic functionalityand theoretical connections between several effective regu-larizers, even if their formulations may stem from differentrationales. Concretely, it has been presented over the pastfew years that regularizing the Euclidean norm of an input-gradient [11], [13], the Frobenius norm of a Jacobian ma-trix [12], [14], the spectral norm of a Hessian matrix [6], anda cross-Lipschitz functional [10] all signiﬁcantly contributeto the adversarial robustness of DNNs. We analyze all thesechoices on DNNs with general rectiﬁed linear activations,which are ubiquitous in image classiﬁcation and a host ofother machine learning tasks.Some of our key contributions and observations are: • We present, for the ﬁrst time, an analytic expression forthe l norm of an approximately-optimal adversarialperturbation concerned in very recent papers [6], [15],to demonstrate that local cross-Lipschitz constants [10]and the prediction probability are its essential ingredi-ents in binary classiﬁcation cases. In addition to the l norm-based results, we also show similar results for therobustness to l ∞ norm-based attacks. • We unveil that most discussed regularizations advocatesmall local cross-Lipschitz constants in binary classiﬁca-tion, except for the Jacobian regularization that suggestssmall local Lipschitz constants, yet regularizing the twonetwork properties can be equivalent. • We further demonstrate that critical discrepancies stillexist between speciﬁc methods, mostly in regularizingthe prediction probability/conﬁdence. • We extend some analyses to multi-class classiﬁcationand verify our ﬁndings with experiments. a r X i v : . [ c s . L G ] J u l EGULARIZATIONS I MPROVING R OBUSTNESS

Given an input instance x ∈ R n , a DNN-based classiﬁeroffers its prediction along with a softmax normalized prob-ability p ( x ) k = exp( z k ) / (cid:80) exp( z j ) for each class k on topof a vector representation z = g ( x ) . Suppose that a set oflabeled instances { ( x i , y i ) } i are provided, then a classiﬁer istypically learned with the assistance of an objective function L ( · , · ) that evaluates training prediction loss, i.e., the aver-age discrepancy between a set of predictions { p ( x i ) } i andground-truth { y i } i .Existing adversarial attacks can be roughly divided intotwo main categories, i.e., white-box attacks [1], [2] andblack-box attacks [16], [17], according to how much informa-tion of the victim model is accessible to an adversary [18].Our study in this paper mainly focuses on the white-boxnon-targeted attacks and defenses against them, in order tobe complied with prior theoretical work. Under such threat,substantial endeavors have been exerted to demonstrate theadversarial vulnerability of DNNs [1], [2], [3], [18], [19], [20],[21], [22]. Most of them are proposed within a frameworkthat favors perturbations with least l p norms yet would stillcause the DNNs to make incorrect predictions. That beingsaid, an adversary opts to solve min r (cid:107) r (cid:107) p s.t. argmax k g ( x + r ) k (cid:54) = argmax k g ( x ) k . (1)Utilizing the objective function L ( · , · ) , the task of mountingadversarial attacks can be formulated from a dual perspec-tive which attempts to maximize the loss with a presumedperturbation magnitude (in the context of l p norms). Thatbeing said, given (cid:15) > , one may resort to max (cid:107) r (cid:107) p ≤ (cid:15) L ( x + r , y ) . (2)Omit box constraints on the image domain, many off-the-shelf attacks [2], [3], [18], [19], [20] can be considered asefﬁcient approximations to either (1) or (2). Under certaincircumstances, their solutions can be equivalent to the opti-mal solutions to (1) or (2). For instance, the fast gradient signmethod (FGSM) [2] achieves the optimum of (2) with somebinary linear classiﬁers and p = ∞ [23]. Also, take any linearmodel together with p = 2 , the DeepFool perturbation [20]is theoretically optimal to (1). Training with an augmentedset involving adversarial examples, i.e., adversarial training,has been proven to be very effective in improving the DNNrobustness [3], regardless of the computational burden. A recent study [15] demonstrates the relationship between aclassical regularization [24] and adversarial training [2]. It isconceivable that a principled regularization term involvedin training sufﬁces to yield DNN models with comparablerobustness, whereby a whole series of methods have beendeveloped. Unlike many traditional methods which are nor-mally date-independent (e.g., weight decay and dropout),recent progress conforms closely with theoretical guaranteesand focuses mostly on regularizing the loss landscape [6],[10], [11], [12], [13], [14]. Before systematically studying theirfunctionality and relationships in the following sections, weﬁrst introduce some important notations. Given the objective function L ( · , · ) for classiﬁcation, wewill refer to 1) ∇ := ∇ x L ( x , y ) , as its gradient with respectto (w.r.t.) the input vector x , 2) H , as the Hessian matrix of L , and 3) J , as the Jacobian matrix of g ( x ) w.r.t. x . It hasbeen presented that training regularized using (cid:107) J (cid:107) F —theFrobenius norm of J (dubbed the Jacobian regularization , [12],[14]), (cid:107)∇(cid:107) —the Euclidean norm of ∇ (i.e., the input-gradientregularization , [11], [13]), (cid:107) H (cid:107) —the spectral norm of H (i.e., the curvature regularization , [6]), and a cross-Lipschitzfunctional [10] as will be elaborated later all signiﬁcantlyimprove the adversarial robustness of obtained models. Wefocus on DNNs with general rectiﬁed linear units (generalReLUs) [25], [26], [27] as nonlinear activations and analyzein binary classiﬁcation and multi-class classiﬁcation tasksseparately in the following sections. INARY C LASSIFICATION

With the background information introduced in the previ-ous section, here we ﬁrst discuss different regularizations inbinary classiﬁcation DNNs, and we will generalize some ofour results to multi-class classiﬁcations in Section 4.For simplicity of notations, let us ﬁrst consider a multi-layer perceptron (MLP) parameterized by a series of weightmatrices W ∈ R n × n , . . . , W d ∈ R n d − × n d , where n = n and n d = 2 in our theories. (We stress that, although a simpleMLP is formulated here, our following discussions directly gener-alize to DNNs with convolutions, poolings, skip-connections [28],self-attentions [29], etc. ) For a d -layer MLP, we have g ( x ) = W Td σ ( W Td − σ ( . . . σ ( W T x ))) , (3)in which the general ReLU activation σ ( · ) of our particularinterest is piecewise linear and hence g ( · ) is also piecewiselinear. Following prior work [23], we can deﬁne a := x and a j := σ ( W Tj a j − ) = D Tj ( x ) W Tj a j − , for ≤ j ≤ d , inwhich D j ( x ) := diag (cid:16) W j [: , T a j − > , . . . , W j [: ,n j ] T a j − > (cid:17) (4)is an n j × n j diagonal matrix whose main diagonal en-tries corresponding to nonzero activations within the j -thparameterized layer take an value of +1 , and others takean value of . Denote by w ± the two columns of matrix W d (i.e., W d = [ w + , w − ] ), we have the two entries of p ( x ) as: p ( x ) + = exp( w T + a d − ) / (cid:80) exp( w T ± a d − ) and p ( x ) − =1 − p ( x ) + . These two scalars estimate the probability of x being sampled from the positive and negative classes,respectively. Since g ( · ) is piecewise linear as analyzed, thereexists a polytope Q ( x ) to which the input instance x belongsand on which g ( · ) is linear, i.e., D j ( x (cid:48) ) = D j ( x ) and g ( x (cid:48) ) | x (cid:48) ∈ Q ( x ) = V T x (cid:48) , (5)in which V = [ v + , v − ] is a matrix with its columns v ± := W D ( x ) . . . W d − D d − ( x ) w ± . Our analyses stem from Problem (1). For binary classiﬁca-tion with y ∈ { +1 , − } , we can rewrite the optimizationproblem as: min r (cid:107) r (cid:107) p s.t. L ( x + r , y ) ≥ β , just as sug-gested [6], in which β is a threshold for correct and incorrectclassiﬁcations and its value solely depends on the choice Fig. 1: Illustration of how (cid:107) r ∗ (cid:107) varies with the predictionprobability p ( x ) y in binary scenarios.of the loss function L ( · , · ) (e.g., if the cross-entropy loss ischosen, then β = log(2) ). It follows from DeepFool andothers [6] that we may well-approximate the constraint witha Taylor series and get bounds for the ( l ) magnitude of r ∗ := argmin (cid:107) r (cid:107) s.t. L ( x , y ) + ∇ T r + r T H r / ≥ β , as willbe presented in Lemma 3.1 as below. Lemma 3.1. [6]

Let x be a correctly classiﬁed instance suchthat ξ := β − L ( x , y ) ≥ , and let u ∈ R n be the normalizedeigenvector corresponding to the largest eigenvalue of H , then wehave (cid:107)∇(cid:107) (cid:107) H (cid:107) (cid:32)(cid:115) (cid:107) H (cid:107) ξ (cid:107)∇(cid:107) − (cid:33) ≤ (cid:107) r ∗ (cid:107) ≤|∇ T u |(cid:107) H (cid:107) (cid:32)(cid:115) (cid:107) H (cid:107) ξ |∇ T u | − (cid:33) . (6)The above lemma establishes connections between therobustness of a DNN and the spectral norm of its Hessianmatrix H . Though enlightening, the variables u , (cid:107)∇(cid:107) , and (cid:107) H (cid:107) in Eq. (6) are heavily entangled so that it is difﬁcult toreveal the functionality of concerned regularizations.Fortunately, we show that the derived bounds are tightsuch that they collapse to the same expression in terms of p ( x ) y and a local cross-Lipschitz constant [10] in binaryclassiﬁcation with some common choices of the loss function(e.g., the cross-entropy loss and logistic loss). To be concrete,suppose that the cross-entropy loss is adopted, then with the n × matrix V = [ v + , v − ] introduced in Eq. (5), we havethe following lemma and theorem. Lemma 3.2. (Simpliﬁed expressions for J , ∇ , and H ). Givenan instance paired with its label ( x , y ) , we have for the Jacobian J , input-gradient ∇ , and Hessian H : J = V, ∇ = y ( p ( x ) y − v + − v − ) T ,H = p ( x ) + p ( x ) − ( v + − v − )( v + − v − ) T = p ( x ) y (1 − p ( x ) y )( v + − v − )( v + − v − ) T . (7) Proposition 3.1. (An analytic expression for (cid:107) r ∗ (cid:107) ). For thebinary classiﬁer with a locally linear g ( · ) and a correctly classiﬁedinstance x , we have (cid:107) r ∗ (cid:107) = 1 p ( x ) y (cid:107) v + − v − (cid:107) (cid:32)(cid:115) p ( x ) y ξ − p ( x ) y − (cid:33) . (8)Proposition 3.1 is obtained on the basis of Lemma 3.1and 3.2. See our proofs in Appendix A and B, respectively.Similar results can be achieved with the logistic loss (as alsodemonstrated in the appendix). The decomposition of (cid:107) r ∗ (cid:107) (i.e., the l magnitude of r ∗ ) in the derived Eq. (8) appears to be more obvious than in Eq. (6), and it can be concludedthat ξ , p ( x ) y , and (cid:107) v + − v − (cid:107) jointly affect the l magnitudeof r ∗ . Seeing that the value of ξ is determinate w.r.t. p ( x ) y , theprediction probability p ( x ) y and (cid:107) v + − v − (cid:107) become the onlydominating ingredients. Let us deﬁne ν := (cid:107) v + − v − (cid:107) whichis in fact a local cross-Lipschitz constant of g ( · ) [10] forbetter clarity. Even though ν might as well be inﬂuential tothe prediction probability p ( x ) y , we discuss them separatelyhere, considering that the latter can still be optimized withany presumed value of the former .It is easy to verify that (cid:107) r ∗ (cid:107) = 0 holds for all ν > ,in a special case of p ( x ) y → . + . Yet, for p ( x ) y > . , thegeneral impact of the prediction probability p ( x ) y in Eq. (8)is still obscure. To gain direct insights, we depict how (cid:107) r ∗ (cid:107) varies with p ( x ) y ∈ (0 . , . on the right panel of Figure 1,given speciﬁc ν values. We observe that, in general, a larger p ( x ) y implies a larger (cid:107) r ∗ (cid:107) and thus lower vulnerability ofa classiﬁcation model, provided that the l magnitude of r ∗ is a reasonable measure of the robustness and p ( x ) y > (1 − p ( x ) y ) (or equivalently, p ( x ) y > . ). See also the left panelof the ﬁgure for an illustration with p ( x ) y approaching . from above, being equal to . , and being equal to . .Our theoretical result in Proposition 3.1 gives rise to aformal guarantee of the l robustness for piecewise linearDNNs, without concerning much about the accuracy of theTaylor approximation. Regarding the adversarial robustnessto some other l p norm-based attacks, we have similar resultsin this paper. One might be of special interest to the p = ∞ case as it has been widely considered in practical attacks.Proposition 3.3 and 3.2 provide results from different view-points in correspondence to (2) and (1), i.e., by bounding theworst-case loss η ∗ := max (cid:107) r (cid:107) ∞ ≤ (cid:15) L ( x , y ) + ∇ T r + r T H r / with any ﬁxed (cid:15) > and by providing an analytic expres-sion for the l ∞ norm of ˜ r ∗ := argmin (cid:107) r (cid:107) ∞ s.t. L ( x , y ) + ∇ T r + r T H r / ≥ β . Proposition 3.2. (An analytic expression for (cid:107) ˜ r ∗ (cid:107) ∞ ). For thebinary classiﬁer with a locally linear g ( · ) and a correctly classiﬁedinstance x , we have (cid:107) ˜ r ∗ (cid:107) ∞ = 1 p ( x ) y (cid:107) v + − v − (cid:107) (cid:32)(cid:115) p ( x ) y ξ − p ( x ) y − (cid:33) . (9) Proposition 3.3. (An upper bound of η ∗ ). For the binaryclassiﬁer with a locally linear g ( · ) and a correctly classiﬁed inputinstance x , we have ∀ r ∈ R n satisfying (cid:107) r (cid:107) ∞ ≤ (cid:15) , it holds that η ∗ = L ( x , y ) + (cid:15) (1 − p ( x ) y ) (cid:107) v + − v − (cid:107) + 12 (cid:15) p ( x ) y (1 − p ( x ) y ) (cid:107) v + − v − (cid:107) . (10) Besides Proposition 3.1, some intriguing corollaries can alsobe derived from Lemma 3.2. First, the direction of the input-gradient vector ∇ is the same as that of the ﬁrst eigenvector(i.e., the one corresponding to the largest eigenvalue) ofmatrix H . Second, we can derive (cid:107)∇(cid:107) = (1 − p ( x ) y ) ν and (cid:107) H (cid:107) F = (cid:107) H (cid:107) = p ( x ) y (1 − p ( x ) y ) ν , which means that we

1. Within linear networks where all instances share the same (data-independent) v + and v − , we can still have different prediction proba-bilities for different input instances.2. Note that we probably have r ∗ (cid:54) = ˜ r ∗ . (a) (b) (c) (d)(e) (f) (g) (h) Fig. 2: The adversarial robustness of obtained binary classiﬁcation models evaluated with FGSM, PGD, DeepFool, andC&W’s attacks: (a)-(d) for LeNet-300-100 and (e)-(h) for LeNet-5. Ten runs from different initializations were performedand the average results are illustrated for fair comparisons. The y axis of the four subﬁgures on the left are normalized tothe same numerical scale, and so as the four on the right. It can be seen that penalizing ν / and µ perform similarly.further have simple analytic expressions for the concernedregularizers as: Jacobian regularizer := λµ , Input-gradient regularizer := λ (1 − p ( x ) y ) ν , Curvature regularizer := λp ( x ) y (1 − p ( x ) y ) ν , (11)in which (cid:107) H (cid:107) calculates the spectral norm (i.e., the matrix l norm) of H , λ > is a hyper-parameter, and µ denotes (cid:107) V (cid:107) F which is apparently a local Lipschitz constant of g ( · ) [10]. Third, it holds that (1 − p ( x ) y ) ν ≤ p ( x ) y (1 − p ( x ) y ) ν ≤ µ / (i.e., (cid:107)∇(cid:107) ≤ (cid:107) H (cid:107) ≤ (cid:107) V (cid:107) F / ), and thuswe get a chained inequality of the regularizers. Without lossof generality, we write the regularizers in squared forms inEq. (11) for direct comparison.One might have noticed that ν and p ( x ) y are also theonly ingredients in two of the regularizers in Eq. (11). In theremainder of this subsection, we shall discuss and highlightthat: (1) the input-gradient regularization and curvature reg-ularization both enforce suppression of ν , which is in prin-ciple consistent with a cross-Lipschitz regularization [10]; (2)though the Jacobian regularization focuses on µ instead of ν , there probably exists an underlying equivalence betweenpenalizing scaled ν and µ ; (3) critical discrepancies stillexist amongst these regularizations, mostly about p ( x ) y . Cross-Lipschitz vs. Lipschitz:

With clear expressions inEq. (7) and (11), we know that the input-gradient regular-ization and curvature regularization are similar to a cross-Lipschitz regularization that penalizes λν / [10] , while theJacobian regularization penalizes λµ (with a local Lipschitz constant µ ) and it boils down to weight decay in single-layerperceptrons and linear classiﬁers. Although it seems as ifthe Jacobian regularization was different from the others, inlight of the Parseval tight frame and Parseval networks [9],

3. Interested readers can refer to Section G for rigorous analyses. we conjecture nonetheless that there exists an equivalencebetween penalizing scaled ν (as with the cross-Lipschitzregularization, input-gradient regularization, and curvatureregularization) and µ (as with the Jacobian regularization).To shed light on this, more discussions are performed asfollows.First and foremost, it is self-evident that the inequality ν / ≤ (cid:107) v + (cid:107) + | v T + v − | + 12 (cid:107) v − (cid:107) ≤ µ (12)holds, thus one might argue that adopting the Jacobianregularization also implies small ν in obtained models aswith the cross-Lipschitz regularization. Second, for single-layer perceptrons, we can easily verify that the function g ( · ) is convex, and thus the Jacobian regularized training loss isstrongly convex w.r.t. V . Considering that the columns of V can be processed simultaneously by adding/subtractinga vector whilst the classiﬁcation decision and cross-entropyloss won’t change, we have − v + = v − for the optimal V and an equivalence is achieved between penalizing λν / and λµ through derivation. The result naturally generalizesto DNNs with locally linear g ( · ) (i.e., DNNs with generalReLU activations) of our interest, if only the ﬁnal layer is tobe optimized. The following proposition makes this formal,and the proof can be found in Appendix D. Proposition 3.4. (A derived equivalence).

For a single-layerperceptron or a piecewise linear DNN in which only the ﬁnal layerparameterized by W d is to be optimized, we have the equivalence: ∀ λ ≥ , argmin W d E ( x ,y ) [ L ( x , y ; V ) + λν /

2] =argmin W d E ( x ,y ) [ L ( x , y ; V ) + λµ ] . (13)In addition to the above results, we further show that thetwo regularizations can lead to the same gradient ﬂow incertain scenarios. One example in which this can be demon-strated is when the ﬁrst feature is uncorrelated with the label y and the other ( n − features are distributed normally withthe mean value being propotional to y (i.e., they are weaklycorrelated with the label) [30]. We let v + ← [0 , a, . . . , a ] and v − ← [0 , − a, . . . , − a ] approach the Bayes error rate. Undersuch circumstance, the two regularizations initialized fromthe Bayes classiﬁer share the same gradient ﬂow for their V matrices, provided × smaller penalty to ν than to µ as inEq. (13) .To test whether the revealed equivalence generalizes topractical scenarios, we conducted an experiment on distin-guishing the digit “7” from “1” using MNIST images. Ourexperimental settings and many more details are carefullyintroduced in Appendix F. As suggested [6], [12], we ﬁrsttrained baseline models from scratch without any explicitregularization, then ﬁne-tuned the models using differentregularization strategies and evaluated the obtained adver-sarial robustness to FGSM [2], PGD [3], DeepFool [20], andthe C&W’s attack [18]. We trained MLPs and convolutionalnetworks with ReLU nonlinearity following the LeNet-300-100 and LeNet-5 architectures in prior work [31]. Figure 2compares the performance of regularizations incorporating λν / and λµ . With varying λ , it can be seen that theregularized models show similar robustness in almost alltest cases. Similar results on CIFAR-10 with ResNets andVGG-like networks can be found in Appendix F.Fig. 3: Different regularizers focus on samples with differentprediction conﬁdence. “Conﬁdence” in regularizations: Apart from suppress-ing the local (cross-)Lipschitz constants, the input-gradientregularizer and curvature regularizer both involve the pre-diction probability p ( x ) y in Eq. (11), with different objectivesthough. By incorporating (1 − p ( x ) y )(1 − p ( x ) y ) , the input-gradient regularization encourages model predictions withhigh conﬁdence. If ν is ﬁxed, then the p ( x ) y -related term inthe input-gradient regularizer acts as an additional predic-tion loss during training. It has larger penalties and slopes(in absolute value) for the training instances with relativelysmaller p ( x ) y , i.e., lower conﬁdence. Similarly, we know thatthe curvature regularization involves p ( x ) y (1 − p ( x ) y ) andadvocates large p ( x ) y as well. However, as depicted in thegreen curve in Figure 3, the function exhibits larger absolutevalue of slope at predictions with higher conﬁdence, whichis different from p ( x ) y (1 − p ( x ) y ) but consistent with thepreference of (cid:107) r ∗ (cid:107) as shown in Figure 1 right. As forthe cross-Lipschitz regularizer and Jacobian regularizer, no p ( x ) y -related term is explicitly involved whatsoever.

4. See Eq. (11), the “regularizer” means the regularization term itselfin this paper. Note that the cross-entropy term involves the predictionprobability p ( x ) y of course. Although it is unclear which of the tactics would be themost suitable one in practice, one might be aware that differ-ent choices perform dis-similarly, otherwise we should haveobtained functional equivalence for all these contestants. Inorder to ﬁgure out the best one in practice, we compared theachieved robustness via input-gradient regularization andcurvature regularization empirically with our results usingthe cross-Lipschitz regularization and Jacobian regulariza-tion. As shown in Figure 4, the lately developed curvatureregularization surpasses all its competitors with reasonablylarge λ values, showing the superiority of its speciﬁc tacticof handling conﬁdent predictions. Notice that we retain thesame numerical ranges of axes in Figure 4 as in Figure 2, butsome newly drawn curves (for curvature regularization) inFigure 4 may be too promising to stick in the plot. ULTI - CLASS C LASSIFICATION

This section focuses on multi-class classiﬁcation tasks. Thenotations are mostly the same as those in the binary classiﬁ-cation. Suppose there are K possible labels for an instance,i.e., y ∈ { , . . . , K − } and K ≥ , then for the discussedgeneral ReLU networks, we have n d = K . Similarly, thereexists a polytope Q ( x ) to which the input instance x belongsand on which the network g ( · ) is linear, i.e. g ( x (cid:48) ) | x (cid:48) ∈ Q ( x ) = V T x (cid:48) , (14)in which V = [ v , . . . , v K − ] is a matrix with its j -thcolumn v j := W D ( x ) . . . W d − D d − ( x ) w j . For the prop-erties of DNNs that are considered in the regularizationstrategies, we have the following lemma. Lemma 4.1. (Simpliﬁed expressions for J , ∇ , and H inmulti-class classiﬁcation). Given an input instance paired withthe one-hot representation of its label ( x , y ) , we have for J , ∇ ,and H : J = V, ∇ = V ( p ( x ) − y ) ,H = V (cid:16) diag( p ( x )) − p ( x ) T p ( x ) (cid:17) V T = (cid:80) i

Fig. 5: The robustness of obtained multi-class classiﬁcation models evaluated with FGSM, PGD, DeepFool, and the C&W’sattacks: (a)-(d) for LeNet-300-100 and (e)-(h) for LeNet-5. Ten runs from different initializations were performed and theaverage results over the multiple runs are reported. The curvature regularization is not compared as approximations seeminevitable in its multi-class implementation. and (cid:107) r ∗ (cid:107) ≥ p ( x ) y (cid:107) V (cid:107) F (cid:32)(cid:115) p ( x ) y ξ (1 − p ( x ) y ) − (cid:33) . (17)Considering that the value of ξ = log( K ) − log( p ( x ) y ) isdeterminate w.r.t. the prediction probability p ( x ) y , we canconclude from Eq. (17) that the essential ingredients of sucha lower bound are p ( x ) y and (cid:107) V (cid:107) F (i.e., the Frobenius normof an n × K matrix V ). Likewise, we can easily verify that (cid:107) V (cid:107) F is a local Lipschitz constant of g ( · ) . Somewhat unsur-prisingly, a property considered in the cross-Lipschitz regu- larization deﬁned as ν := (cid:80) (cid:107) v i − v j (cid:107) /K [10] is involvedin the other derived lower bound as given in Eq. (16). Theresults show that the local (cross-)Lipschitz constants andprediction probability are possibly still the essential ingredi-ents of (cid:107) r ∗ (cid:107) . Apart from Proposition 4.1, we further knowthat the chained inequality (cid:107)∇(cid:107) / ≤ (cid:107) H (cid:107) ≤ (cid:107) V (cid:107) F / holds by derivations from Lemma 4.1. More discussionssimilar to those made for binary classiﬁcation in Section 3.2will be given in Appendix E (right after the proof).As in binary classiﬁcation, we aim to study possibleconnections between regularizations penalizing a squared local Lipschitz constant µ := (cid:107) V (cid:107) F and ν . Experimentalresults are given to show a vague equivalence. The sameMLP and convolutional architectures were adopted. Similarto the binary classiﬁcation experiments, we trained multiplebaseline models for each considered architecture and ﬁne-tuned them using different regularizations. The same train-ing and test policies were also kept. We report the averageresults of obtained model robustness to FGSM, PGD, Deep-fool, and the C&W’s attack in Figure 5. It can be seen that theJacobian regularization and cross-Lipschitz regularizationstill perform similarly across all tested λ values, except forthe ones being too large to keep the models numericallystable. NaN was produced in Jacobian regularized LeNet-5if λ was further enlarged. ONCLUSIONS

This paper aims at exploring and analyzing possible con-nections between recent network-property-based regular-izations for improving the adversarial robustness of DNNs.While the empirical effectiveness of appropriate regulariza-tions has been demonstrated in prior arts [6], [10], [11], [12],there still lacks systematic understanding of their intrinsicfunctionality and connections. We made some comparativeanalyses among these regularizations and our achievementsinclude: • We have analyzed regularizations on DNNs with ReLUactivations from a theoretical perspective. • We have presented analytic expressions for the l and l ∞ magnitudes of some approximately-optimal adver-sarial perturbations, and we have shown that the localcross-Lipschitz constants and prediction probability aretheir essential ingredients in binary classiﬁcation. • We have demonstrated that, the regularizations suggesteither small Lipschitz constants or small cross-Lipschitzconstants, and regularizing them can be equivalent. Yet,critical discrepancies still exist between speciﬁc regular-izations, mostly in handling the prediction probability. • We have veriﬁed that curvature regularization [6] con-cerned in a very recent paper shows the most promisingperformance, and we have extended some of our analy-ses to multi-class classiﬁcation and veriﬁed our ﬁndingswith experiments. R EFERENCES [1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-fellow, and R. Fergus, “Intriguing properties of neural networks,”in

ICLR , 2014.[2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har-nessing adversarial examples,” in

ICLR , 2015.[3] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,“Towards deep learning models resistant to adversarial attacks,”in

ICLR , 2018.[4] F. Tram`er, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel,“Ensemble adversarial training: Attacks and defenses,” in

ICLR ,2018.[5] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machinelearning at scale,” in

ICLR , 2017.[6] S.-M. Moosavi-Dezfooli, A. Fawzi, J. Uesato, and P. Frossard, “Ro-bustness via curvature regularization, and vice versa,” in

CVPR ,2019.[7] A. Krogh and J. A. Hertz, “A simple weight decay can improvegeneralization,” in

NeurIPS , 1992. [8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural net-works from overﬁtting,”

The Journal of Machine Learning Research ,vol. 15, no. 1, pp. 1929–1958, 2014.[9] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier,“Parseval networks: Improving robustness to adversarial exam-ples,” in

ICML , 2017.[10] M. Hein and M. Andriushchenko, “Formal guarantees on therobustness of a classiﬁer against adversarial manipulation,” in

NeurIPS , 2017.[11] A. S. Ross and F. Doshi-Velez, “Improving the adversarial robust-ness and interpretability of deep neural networks by regularizingtheir input gradients,” in

AAAI , 2018.[12] D. Jakubovitz and R. Giryes, “Improving dnn robustness to adver-sarial attacks using jacobian regularization,” in

ECCV , 2018.[13] C. Lyu, K. Huang, and H.-N. Liang, “A uniﬁed gradient regular-ization family for adversarial examples,” in

ICDM , 2015.[14] J. Sokoli´c, R. Giryes, G. Sapiro, and M. R. Rodrigues, “Robustlarge margin deep neural networks,”

IEEE Transactions on SignalProcessing , vol. 65, no. 16, pp. 4265–4280, 2017.[15] C.-J. Simon-Gabriel, Y. Ollivier, L. Bottou, B. Sch¨olkopf, andD. Lopez-Paz, “Adversarial vulnerability of neural networks in-creases with input dimension,” in

ICML , 2019.[16] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, andA. Swami, “Practical black-box attacks against machine learning,”in

Proceedings of the Asia Conference on Computer and Communica-tions Security , 2017.[17] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo:Zeroth order optimization based black-box attacks to deep neuralnetworks without training substitute models,” in

Proceedings of the10th ACM Workshop on Artiﬁcial Intelligence and Security . ACM,2017, pp. 15–26.[18] N. Carlini and D. Wagner, “Towards evaluating the robustness ofneural networks,” in

Proceedings of the IEEE Symposium on Securityand Privacy , 2017.[19] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik,and A. Swami, “The limitations of deep learning in adversarialsettings,” in

Proceedings of the IEEE European Symposium on Securityand Privacy , 2016.[20] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “DeepFool: asimple and accurate method to fool deep neural networks,” in

CVPR , 2016.[21] P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, and C.-J. Hsieh, “Ead:elastic-net attacks to deep neural networks via adversarial exam-ples,” in

AAAI , 2018.[22] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients givea false sense of security: Circumventing defenses to adversarialexamples,” in

ICML , 2018.[23] Y. Guo, C. Zhang, C. Zhang, and Y. Chen, “Sparse dnns withimproved adversarial robustness,” in

NeurIPS , 2018.[24] H. Drucker and Y. LeCun, “Double backpropagation increasinggeneralization performance,” in

IJCNN , 1991.[25] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restrictedboltzmann machines,” in

ICML , 2010.[26] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,”in

CVPR , 2015.[27] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understandingand improving convolutional neural networks via concatenatedrectiﬁed linear units,” in

ICML , 2016.[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

CVPR , 2016.[29] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translationby jointly learning to align and translate,” in

ICLR , 2015.[30] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry,“Robustness may be at odds with accuracy,” in

ICLR , 2019.[31] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , “Gradient-basedlearning applied to document recognition,”

Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998.

On Connections between Regularizations forImproving DNN Robustness **Appendices**

Yiwen Guo, Long Chen, Yurong Chen, and Changshui Zhang,

Fellow, IEEE F A PPENDIX AP ROOF OF L EMMA

Proof.

According to the deﬁnition, it is self-evident that J = V . As for the input-gradient, we have ∇ x L ( x , y ) = − V ( y − p ( x ))= − y (1 − p ( x ) y )( v + − v − ) T , (18)according to the chain rule, in which y ∈ {± } and y is itsone-hot vector representation. Similarly, the Hessian matrixof L ( · , · ) w.r.t. x is H = ∇ x ( V ( p ( x ) − y ))= V ( ∇ x ( p ( x ) − y ))) T = V (diag( p ( x ) − p ( x ) p ( x ) T ) V T = p ( x ) + p ( x ) − ( v + − v − )( v + − v − ) T . (19) A PPENDIX BP ROOF OF P ROPOSITION

Proof.

From the expression of H shown in Eq. (19), we knowthat H is a rank-1 positive semi-deﬁnite matrix and itsonly eigenvalue becomes the maximal eigenvalue (i.e, thespectral norm). Since the instance is correctly classiﬁed, wehave p ( x ) y > . and k v + − v − k = 0 hold. Suppose that x and v + − v − have ﬁnite magnitude, then we further know p ( x ) y = 1 . . From H ∇ = p ( x ) y (1 − p ( x ) y ) k v + − v − k ∇ , we know that ∇ should be an eigenvector corresponding to p ( x ) y (1 − p ( x ) y ) k v + − v − k > as the eigenvalue of H .Hence we have u = ± ∇k∇k and further |∇ T u | = k∇k . • Y. Guo is with Bytedance AI Lab. E-mail: [email protected]. • L. Chen is with the Academy for Advanced Interdisciplinary Studies,Center for Data Science, Peking University, Beijing 100871, China. E-mail: [email protected]. • Y. Chen is with Intel Labs China. E-mail: [email protected]. • C. Zhang is with the Department of Automation, State Key Lab ofIntelligence Technologies and Systems, Tsinghua National Laboratoryfor Information Science and Technology, Tsinghua University, Beijing100084, China. E-mail: [email protected]. Guo and L. Chen contribute equally to this work.

In consequence, the upper bound and low bound in Lemma3.1 are in fact the same at this point, and it follows that k r ∗ k = k∇k k H k s k H k ξ k∇k − ! = (1 − p ( x ) y ) k v + − v − k p ( x ) y (1 − p ( x ) y ) k v + − v − k vuut p ( x ) y (1 − p ( x ) y ) k v + − v − k ξ (1 − p ( x ) y ) k v + − v − k −  = 1 p ( x ) y k v + − v − k s p ( x ) y ξ − p ( x ) y − ! . (20) A PPENDIX CP ROOF OF P ROPOSITION

AND

We ﬁrst provide our proof of Proposition 3.2 which showsan analytic expression for k ˆ r ∗ k ∞ as below. Proof.

According to the deﬁnition, we have: k ˜ r ∗ k ∞ := min k r k ∞ s . t . L ( x , y ) + ∇ T r + r T H r / ≥ β. (21)By substituting H with its expression given in Lemma3.2, the above constraint can be written as − ξ + ∇ T r + p ( x ) y − p ( x ) y ) (cid:16) ∇ T r (cid:17) ≥ . (22)Now that ∇ T r is a scalar, we can consider (22) as a quadraticinequality and equivalently we have ∇ T r ≥ − p ( x ) y p ( x ) s p ( x ) y ξ − p ( x ) y − ! or ∇ T r ≤ − − p ( x ) y p ( x ) s p ( x ) y ξ − p ( x ) y + 1 ! . Since it holds that k r k ∞ ≥ ∇ T r k∇k ≥ −k r k ∞ for any r ∈ R n and the equalities should be attained at r = k r k ∞ sign( ∇ ) and r = −k r k ∞ sign( ∇ ) respectively, we have k ˜ r ∗ k ∞ = 1 − p ( x ) y p ( x ) k∇k s p ( x ) y ξ − p ( x ) y − ! = 1 p ( x ) k v + − v − k s p ( x ) y ξ − p ( x ) y − ! . (23)Then it is the proof of Proposition 3.3. Proof.

We aim at analyzing η ∗ := max k r k ∞ ≤ (cid:15) L ( x , y ) + ∇ T r + r T H r / . According to the H ¨older’s inequality, wehave for all r ∈ R n satisfying k r k ∞ ≤ (cid:15) , it holds that ∇ T r + r T H r / ∇ T r + p ( x ) y − p ( x ) y ) (cid:16) ∇ T r (cid:17) ≤k r k ∞ k∇k + p ( x ) y − p ( x ) y ) ( k r k ∞ k∇k ) = (cid:15) k∇k + p ( x ) y − p ( x ) y ) ( (cid:15) k∇k ) . (24)By further substituting the vector ∇ with the expressiongiven in Lemma 3.2, we have ∇ T r + r T H r / ≤ (cid:15) (1 − p ( x ) y ) k v + − v − k + 12 (cid:15) p ( x ) y (1 − p ( x ) y ) k v + − v − k (25)to complete our proof, and we know that the equality shouldbe attained at r = (cid:15) · sign( ∇ ) . A PPENDIX DP ROOF OF P ROPOSITION

Proof.

Let us denote by L ( · ) and L ( · ) the loss functionsfor training regularized by the Jacobian and cross-Lipschitzstrategies, respectively. That is, L ( V ) = − E [log p ( x ; V ) y ] + λ µ and L ( V ) = − E [log p ( x ; V ) y ] + λ ν , in which E [ · ] calculates the sample mean rather than the populationmean. It is easy to verify that L ( V ) is strongly convex w.r.t. V for single layer perceptrons and piecewise linear DNNs inwhich only the ﬁnal layer is to be optimized, thus we havea unique optimal solution to min V L ( V ) . Let us denote theoptimal solution as V = [ v + , v ] .For binary classiﬁcation in which y ∈ { +1 , − } , it holdsfor any matrix V = [ v + , v − ] that, − E [log p ( x ; V ) y ]= − E (cid:20) y p ( x ; V ) +1 + 1 − y p ( x ; V ) − (cid:21) = − E (cid:20) y p ( x ; V ) +1 + 1 − y − p ( x ; V ) +1 ) (cid:21) . (26)By deﬁnition of the cross-entropy function, we have p ( x ; V ) +1 = exp( h v + , x i )exp( h v + , x i ) + exp( h v − , x i )= exp( h v + − v − , x i )exp( h v + − v − , x i ) + 1 , (27) thus further we can write − E [log p ( x ; V ) y ] = h ( v + − v − ) ,in which h ( · ) : R n → R can also easily be veriﬁed as aconvex function. We rewrite the loss function of a Jacobian-regularized training as L ( V ) = h ( v + − v − ) + 12 λ ν + 12 λ k v + + v − k , (28)considering µ = ν + k v + + v − k . Apparently, the ﬁrsttwo terms on the right hand side of the above equationshould remain unchanged if the vector v + − v − does. Letus introduce ˆ V = [ˆ v + , ˆ v ] , in which ˆ v + = v + − v − and ˆ v = − ˆ v + . Now we have ˆ v + − ˆ v = v + − v and L ( ˆ V ) − L ( V ) = 12 λ k ˆ v + + ˆ v k − λ k v + + v k ≤ − λ k v + + v k ≤ . (29)Recall that V is the optimal solution to min V L ( V ) , we alsohave ≤ L ( ˆ V ) − L ( V ) . Therefore, we have v + + v = 0 holds in order to avoid contradictions. The obtained equa-tion eliminates the third term in Eq. (28). Thus we know, for λ = λ / , it holds that L ( V ) = L ( V ) . (30)We now proceed similarly for an optimal solution to theproblem min V L ( V ) . By writing L ( V ) = h ( v + − v − ) + λ ν = l ( v + + v − ) , we can verify that the newly introducedfunction l ( · ) : R n → R is strongly convex as well. Let usdenote the optimal solution to min w l ( w ) as w , then byfurther introducing v + = w / and v = − ˆ v + , we canverify that L ( V ) = L ( V ) . (31)According to Eq. (30)-(31) and the deﬁnitions of the optimalsolutions V and V , we now have L ( V ) (30) = L ( V ) def. ≥ L ( V ) (31) = L ( V ) def. ≥ L ( V ) , (32)which leads to L ( V ) = L ( V ) for V being equal to V or V and we have proved the proposition. A PPENDIX EP ROOF OF P ROPOSITION

AND M ORE

Proof.

For a K -class classiﬁcation task with y ∈ { , . . . , K − } and the cross-entropy loss chosen, we still have fromLemma 3.1 that k∇k k H k s k H k ξ k∇k − ! ≤ k r ∗ k ≤|∇ T u |k H k s k H k ξ |∇ T u | − ! , (33)in which ξ := log( K ) − L ( x , y ) . In order to derive insightfulbounds of k r ∗ k with less entangled variables, we ﬁrstanalyze the involved networks properties k∇k and k H k and then take advantage of the monotonicity of the lowerbound given above.From the expressions summarized in Lemma 4.1 weknow it holds that k∇k ≤ − p ( x ) y ) k V k ≤ − p ( x ) y ) k V k F , (34) and k H k ≤ X i

Even though it is difﬁcultto compare K ( K − ν / with k V k F directly, we knowfrom Proposition 4.1 that penalizing the two quantitiesboth contribute to improving the DNN robustness, and infact they have already been adopted in the cross-Lipschitzand Jacobian regularizations, respectively. We now discussabout their connections with other network properties andregularizations. As have been introduced in the main bodyof our paper, we have that the chained inequality k∇k / ≤k H k ≤ k V k F / holds, which is obtained in virtue of: k H k ≤ p ( x ) y (1 − p ( x ) y ) k V k F ≤ k V k F (41)and k H k = k X p ( x ) i ( v i − ¯ v )( v i − ¯ v ) T k ≥ p ( x ) y k v y − ¯ v k ≥ p ( x ) y k∇k ≥ k∇k , (42)in which ¯ v := V p ( x ) . Similarly, we can as well derive achained inequality for the quantity adopted in the cross-Lipschitz regularization (and the ﬁrst bound in Proposition4.1) as k∇k / ≤ k H k ≤ K ( K − ν / , by takingadvantage of Eq. (42) and: k H k ≤ X i

CIFAR-10 R

ESULTS

In this section, we ﬁrst introduce our experimental settingson MNIST, involving the training and test policies, DNNarchitectures, evaluation metrics, etc., and then provideresults on CIFAR-10.

Experimental Settings:

The ofﬁcial training/test split ofMNIST [2] is utilized. As brieﬂy introduced, we train/testwith an MLP codenamed “LeNet-300-100” and a convolu-tional neural network codenamed “LeNet-5” on MNIST. Theformer is comprised of two parameterized fully-connectedlayers, and the latter contains two convolutional layers,two max-pooling layers and two fully-connected layers.We trained 10 models each from different initializationsfor them as references, and ﬁne-tuned the obtained modelswith different regularizations discussed in this paper to ((a)) ((b)) ((c)) ((d))((e)) ((f)) ((g)) ((h))((i)) ((j)) ((k)) ((l))

Fig. 6: The robustness of all obtained binary classiﬁcation models evaluated with FGSM, PGD, DeepFool and the C&W’sattacks on CIFAR-10: (a)-(d) for the four-layer convolutional network with batch normalizations, (e)-(h) for the VGG-likenetwork, and (i)-(l) for ResNet. Ten runs from different initializations are performed and the average results are reportedfor fair comparisons.evaluate their performance under adversarial attacks. Allexperiments were performed on a single NVIDIA Titan XGPU, and ofﬁcial implementations from the authors of theregularizations were adopted. TensorFlow [3] and Clever-Hans [4] were used.We directly applied the training policies suggested in theCaffe model zoo [5] for LeNet-300-100 and LeNet-5, and wetrained models on MNIST with a common batch size of 64for 50,000 iterations such that they all deﬁnitely reached theplateau. For the binary LeNet-300-100 and LeNet-5 models,we achieved prediction accuracies of . ± . and . ± . , respectively. To evaluate the adversarialrobustness of DNN models, we chose four prevalent attacks,i.e., FGSM [6], PGD [7], DeepFool [8], and the C&W’s at-tack [9], two of which are l ∞ norm-based and the other twoare l norm-based. With (cid:15) = 0 . , the prediction accuracies ofthe reference models degraded signiﬁcantly (to . ± . and . ± . ) on FGSM examples and (to . ± . and . ± . ) on PGD adversarial examples. In multi-class scenarios, we similarly had reference models withhigh prediction accuracies ( . ± . and . ± . for LeNet-300-100 and LeNet-5 respectively) on the benigntest set, yet a deteriorating effect can be observed on theadversarial examples.For training on MNIST, we regularized with various λ values chosen from { − , − , − , . , . , . , . , . } . Such a set should cover many suggested values for settingthis hyper-parameter in the literature, and we also noticed inthe experiments that further enlarging the value of λ wouldprobably cause numerical instability during training andgenerated NaN in the network gradients. In fact, with a λ aslarge as . on the tested dataset, most of the terms aimedto be penalized have become extremely small (typically withan order of magnitude ≤ ) on the training instances, thusmore attention should be paid to their numerical ranges andstability. Curvature regularization, however, was stable overthe range of our tested λ , and we also observed that itsrobustness to the C&W’s attack increased to nearly 2.0 (i.e.,the required magnitude of perturbations to successfully foolthe model was ∼ λ = 6 . . Experimental results on CIFAR-10:

As introduced, wealso performed experiments on CIFAR-10, with some muchdeeper networks than those tested on MNIST. To be con-crete, we chose a four-layer convolutional network similarto LeNet-5 but with batch normalization [10] incorporated, aVGG-like network [11] (incorporating twelve convolutionallayers and two fully-connected layers) and a ResNet [12] (in-corporating 31 convolutional layers, a single fully-connectedlayers and no batch normalization). Again, we trained 10models each from different initializations for them as ref-erences, and ﬁne-tuned the obtained models with differentregularizations. We trained them for 100,000 iterations to ((a)) ((b)) ((c)) ((d))((e)) ((f)) ((g)) ((h))

Fig. 7: The robustness of obtained multi-class classiﬁcation models evaluated with FGSM, PGD, DeepFool, and the C&W’sattacks: (a)-(d) for the four-layer convolutional network incorporating batch normalizations and (e)-(h) for the VGG-likeDNN. The curvature regularization was not compared as approximations seem inevitable in its multi-class implementation.NaN is triggered with λ = 0 . on Jacobian and cross-Lipschitz regularized VGG-like models. ResNets are not evaluateddue to the limlited computational resources.ensure convergence, and we decayed the learning rate by10-fold at iterations 60,000, and 80,000.We report the robustness of regularized models to ad-versarial attacks in Figure 6. The hyper-parameter (cid:15) for l ∞ attacks was set to be 0.05, and λ was chosen to be smaller(than the values on MNIST) to guarantee a stable trainingprocess, from { − , − , − , − , − , . } . We ob-served that even with the hyper-parameter as small as . ,it is possible to produce NaN during the training of somemulti-class models on CIFAR-10, but for the binary clas-siﬁcation models, further increasing λ might lead to evenstronger adversarial robustness. Nevertheless, performinga grid search of the best λ for each model is beyond thescope of this paper, so we might not achieve the optimalperformance of the discussed methods in Figure 6. Multi-class results are provided in Figure 7. A PPENDIX GT HE L OGISTIC L OSS AND O PEN Q UESTIONS

It is mentioned that our theoretical results in binary classiﬁ-cation with cross-entropy loss generalize to the case trainedwith logistic loss , which calculates log(1 + exp( − y v T x )) for an instance x with label y ∈ {± } , in which v := W D ( x ) . . . W d − D d − ( x ) w d is an n -dimensional vector.Now let us discuss more about it. Equivalently, we rewritethe logistic loss as − log( p ( x ) y ) , in which p ( x ) − = 1 / (1 +exp( v T x )) and p ( x ) + = 1 − p ( x ) − , hence if V = [ v , ] we know by simple derivations that v plays the same role(in training with the logistic loss) as the vector v + − v − (in training with the cross-entropy loss). That being said,what follows is coincident with those presented in the paper.Future work should include studies on other loss functions.Just as also mentioned in the main body of our paper,we consider the local (cross-)Lipschitz constants and the prediction probability p ( x ) y separately in regularizers forsimplicity reasons, and we have our conclusions hold evenif their mutual inﬂuence is rigorously analyzed. Regardingthe mutual inﬂuence in binary classiﬁcation, it is mostly onaccount of ν = k v + − v − k . Since the prediction probabili-ties p ( x ) + and p ( x ) − can be written as functions w.r.t. ν and γ := | ˇ w T x | , in which ˇ w := ( v + − v − ) /ν is a normalizedvector, we cast the problem as discussing about the mono-tonicity of k r ∗ k in terms of ν with care. We emphasize thatit is actually a bit complicated, as the monotonicity highlydepend on the value of γ , and we accomplish the task bycalculating the derivative of regularizers including k∇k and k H k w.r.t. ν , respectively. Speciﬁcally, we have ∂ k∇k ∂ν = 2 ν (1 − ( γν −

1) exp ( γν ))(1 + exp( γν )) (44)and ∂ k H k ∂ν = ν (2 + γν − ( γν −

2) exp ( γν )) exp ( γν )(1 + exp( γν )) . (45)By solving ∂ k∇k /∂ν = 0 and ∂ k H k /∂ν = 0 , we can getthe critical points as . /γ and . /γ . That being said,if νγ ≤ . and νγ ≤ . are fulﬁlled, then penalizingscaled k∇k and k H k indicate a smaller local Lipschitzconstant ν , respectively. We evaluate νγ with models trainedon MNIST and CIFAR-10, and we ﬁnd that the conditionsare satisﬁed for almost all training instances. More specif-ically, the average value of νγ on the LeNet-300-100 refer-ences is only roughly . and the largest value is roughly . . After training with regularizations, the values of both νγ and ν become orders of magnitude smaller, making theresults rigorously hold for all instances. It can be interestingto evaluate whether it is the gap between . and . thataffects the regularization performance in future works. Notethat similar analyses can be made to the magnitude of r ∗ , but we feel the result in that case makes less sense since thedeﬁnition of r ∗ is subject to approximations. A PPENDIX HM ULTI - CLASS R EGULARIZATIONS

We did not test the curvature regularization in multi-classscenarios, mostly because some approximations seem in-evitable. With approximations, it is relatively difﬁcult to tellwhether there will be functional equivalence or not fromexperimental results. As have been mentioned, through thelens of our study, it is possible to develop some moreregularization methods by consolidating the current onesand their essential ingredients. We would like to study itmore carefully in future work and compare with the currentmulti-class curvature approximators if possible. R EFERENCES [1] M. Hein and M. Andriushchenko, “Formal guarantees on therobustness of a classiﬁer against adversarial manipulation,” in

NeurIPS , 2017.[2] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , “Gradient-basedlearning applied to document recognition,”

Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998.[3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga,S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learningon heterogeneous systems,” 2015, software available fromtensorﬂow.org. [Online]. Available: http://tensorﬂow.org/[4] N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman,A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, A. Matyasko,V. Behzadan, K. Hambardzumyan, Z. Zhang, Y.-L. Juang, Z. Li,R. Sheatsley, A. Garg, J. Uesato, W. Gierke, Y. Dong, D. Berthelot,P. Hendricks, J. Rauber, and R. Long, “Technical report on thecleverhans v2.1.0 adversarial examples library,” arXiv preprintarXiv:1610.00768 , 2018.[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecturefor fast feature embedding,” in MM , 2014.[6] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har-nessing adversarial examples,” in ICLR , 2015.[7] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,“Towards deep learning models resistant to adversarial attacks,”in

ICLR , 2018.[8] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “DeepFool: asimple and accurate method to fool deep neural networks,” in

CVPR , 2016.[9] N. Carlini and D. Wagner, “Towards evaluating the robustness ofneural networks,” in

Proceedings of the IEEE Symposium on Securityand Privacy , 2017.[10] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in

ICML ,2015.[11] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov,“Structured bayesian pruning via log-normal multiplicativenoise,” in

NeurIPS , 2017.[12] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,”in

CVPR , 2015.

Yiwen Guo received the B.E. degree fromWuhan University, Wuhan, China, in 2011, andthe Ph.D. degree from Tsinghua University, Bei-jing, China in 2016. He is a research scientist atBytedance AI Lab, Beijing. Prior to this, he wasa staff research scientist at Intel Labs China. Hiscurrent research interests include computer vi-sion, pattern recognition, and machine learning.

Long Chen received the B.S. degree in math-ematics and M.S. degree in data science fromPeking University, in 2016 and 2019, respec-tively. His current research interests include ma-chine learning, statistics and ﬁnancial data anal-ysis.

Yurong Chen received the B.S. and Ph.D. de-grees from Tsinghua University, Beijing, China,in 1998 and 2002, respectively. He joined Intel in2004 after completing the postdoctoral researchin the Institute of Software, CAS, where he iscurrently a Principal Research Scientist and Di-rector of Cognitive Computing Lab at Intel LabsChina, responsible for leading visual cognitionand machine learning research for Intel plat-forms. He received one “Intel China Award” and3 Intel Labs Academic Awards – “Gordy Awards”for delivering leading visual analytics and understanding technologies toimpact Intel platforms/solutions. He has published over 60 papers andholds over 50 issued/pending patents.