[PDF] Notes on Margin Training and Margin p-Values for Deep Neural Network Classifiers

Abstract

We provide a new local class-purity theorem for Lipschitz continuous DNN classifiers. In addition, we discuss how to achieve classification margin for training samples. Finally, we describe how to compute margin p-values for test samples.

Full PDF

11 Notes on Margin Training and Margin p-Values forDeep Neural Network Classiﬁers

George Kesidis, David J. Miller, and Zhen Xiang

Abstract

We provide a new local class-purity theorem for Lipschitz continuous DNN classiﬁers. In addition, we discusshow to achieve classiﬁcation margin for training samples. Finally, we describe how to compute margin p-valuesfor test samples.

I. I

NTRODUCTION

Robust DNNs have been proposed to defeat bounded-perturbation test-time evasion attacks - i.e., smallperturbations added to nominal test samples so that their class decision changes. One family of approachescontrols Lipschitz-continuity parameter and targets training-set classiﬁcation margin. Estimation and engi-neering of the Lipschitz parameter for a DNN is discussed in, e.g., [11], [1], [2], [12], [14], [6], [4]. Howto engineer class purity (class decision consistency) in an convex neighborhood (open ball) of a certainsize about every training samples is addressed in [12], [7]. In the following, we give an alternative localclass purity result. Also, we show how to achieve classiﬁcation margin on training samples by choice ofa simple “dual” training objective, cf., (8) and (9). We numerically show how margin-based training canresult in reduced accuracy (by overﬁtting the training set). Finally, we deﬁne a p-value associated withclassiﬁcation margin. II. M

ARGIN IN

DNN

CLASSIFIERS

Consider the DNN f : R n → ( R + ) C where C is the number of classes. Further suppose that for aninput pattern x ∈ R n to the DNN, the class decision is ˆ c ( x ) = arg max i f i ( x ) , where f i is the i th component of the C -vector f . That is, we have deﬁned a class-discriminant output layerof the DNN. Here assume that a class for x is chosen arbitrarily among those that tie for the maximum.In the following, we assume that the functions f i are rectiﬁed: ∀ i, x, f i ( x ) ≥ . (1)Deﬁne the margin of x as µ f ( x ) := f ˆ c ( x ) ( x ) − max i (cid:54) =ˆ c ( x ) f i ( x ) ≥ . (2) The authors are with the School of EECS, Pennsylvania State University, University Park, PA, 16803, USA. This research is supportedby AFOSR DDDAS grant and Cisco URP gift. Email: { gik2,djm25,zux49 } @psu.edu a r X i v : . [ s t a t . M L ] D ec Now suppose the (cid:96) ∞ /(cid:96) Lipschitz continuity parameter L ∞ for f , i.e., the smallest L ∞ > satisfying ∀ x, y, | f ( x ) − f ( y ) | ∞ ≤ L ∞ | x − y | (3)is estimated. Note that we have used two different norms in this deﬁnition.Now consider samples in a open (cid:96) ball centered at x , i.e., y ∈ B ( x, ε ) := { z ∈ R n : | x − z | < ε } for ε > .The following is a locally consistent (robust) classiﬁcation result is an example of Lipschitz margin [12].

Theorem 2.1: If f is (cid:96) ∞ /(cid:96) Lipschitz continuous with parameter L ∞ > and µ f ( x ) > then B (cid:18) x, µ f ( x )2 L ∞ (cid:19) is class pure. Proof:

For any y ∈ B ( x, µ f ( x ) /L ∞ ) , we have µ f ( x ) > L ∞ | x − y | ≥ | f ( x ) − f ( y ) | ∞ := max i | f i ( x ) − f i ( y ) | ∞ ≥ max i | f i ( x ) | ∞ − | f i ( y ) | ∞ (triangle inequality) = max i f i ( x ) − f i ( y ) (since f i ≥ ) ≥ f ˆ c ( x ) ( x ) − f ˆ c ( x ) ( y ) So, f ˆ c ( x ) ( y ) > f ˆ c ( x ) ( x ) − µ f ( x ) . (4)If we instead write | f i ( y ) | ∞ − | f i ( x ) | ∞ in the triangle inequality above and then replace ˆ c ( x ) by any i (cid:54) = ˆ c ( x ) , we get that ∀ i (cid:54) = ˆ c ( x ) , f i ( y ) < f i ( x ) + 12 µ f ( x ) . (5)So, by (4) and (5), ∀ i (cid:54) = ˆ c ( x ) , f i ( y ) < f i ( x ) + 12 µ f ( x ) ≤ f ˆ c ( x ) ( x ) − µ f ( x ) (by (2)) < f ˆ c ( x ) ( y ) Theorem 2.1 is similar to Proposition 4.1 of [12]. Let the 2-norm Lipschitz parameter of f be L , i.e., using the 2-norm on both sides of (3). Since | z | ∞ ≤ | z | for all z , L ≥ L ∞ . Without assuming f is rectiﬁed as (1), [12] shows that y is assigned the same class as x if µ f ( x ) > √ L | x − y | ; thus, B ( x, µ f ( x ) / ( √ L )) is class pure. Note that √ L (Prop. 4.1 of [12]) may or may not be larger than L ∞ (Theorem 2.1). On the other hand, if the right-hand-side of (3) is changed to the (cid:96) ∞ norm, thenusing | z | ≤ n | z | ∞ for all z , and arguing as for Theorem 2.1) leads to a weaker result than Prop. 4.1 of[12] (especially when n (cid:29) ). III. M ARGIN TRAINING

Robust training is surveyed in [13]. Lipschitz margin training to achieve a class-pure convex neighbor-hood (open ball) of prescribed size about every training sample is discussed in [12], combining margintraining (2) and Lipschitz continuity parameter control. (Also see e.g. [2] for Lipschitz parameter controland the approach for bounding margin gradient of [9].) [7] relaxes the constraints of ReLU based classiﬁerstoward this same objective (assuming ReLU neurons with bounded outputs). For a given classiﬁer, theapproach of [7] can also check class purity of a prescribed-size convex neighborhood of test samples; usingthis method to detect small-perturbation test-time evasion attacks may have a signiﬁcant false-positive rate.Generally, these methods cannot certify a test sample is not test-time evasive if the associated perturbationis larger than the prescribed neighborhood size, and they may be associated with reduction in classiﬁcationaccuracy [12], [9].We focus herein on just achieving a prescribed margin for training samples (2).Let θ represent the DNN parameters. Let T represent the training dataset and let c ( x ) for any x ∈ T bethe ground truth class of x . The following is easily generalized to sample-dependent margins ( µ ( x ) > ).[12] suggests to add the margin “to all elements in logits except for the index corresponding to” c ( x ) .For example, train the DNN by ﬁnding: min θ − (cid:88) x ∈T log (cid:32) f c ( x ) ( x ) (cid:80) i (cid:54) = c ( x ) ( f i ( x ) + µ ) (cid:33) = min θ − (cid:88) x ∈T log (cid:32) f c ( x ) ( x )( C − µ + (cid:80) i (cid:54) = c ( x ) f i ( x ) (cid:33) (6)For a softmax example, one could train the DNN using the modiﬁed cross-entropy loss : min θ − (cid:88) x ∈T log (cid:32) e f c ( x ) ( x ) e f c ( x ) ( x ) + (cid:80) i (cid:54) = c ( x ) e f i ( x )+ µ (cid:33) (7)These DNN objectives do not guarantee the margins for all training samples will be met.Alternatively, one can perform (dual) optimization of the weighted margin constraints, e.g. , min θ (cid:88) x ∈T λ x (cid:32) max i (cid:54) = c ( x ) f i ( x ) + µ − f c ( x ) ( x )( C − µ + (cid:80) j f j ( x ) (cid:33) , (8) Obviously, exponentiation is unnecessary when, ∀ x, i , f i ( x ) ≥ , i.e., the DNN outputs are rectiﬁed. or just min θ (cid:88) x ∈T λ x (cid:18) max i (cid:54) = c ( x ) f i ( x ) + µ − f c ( x ) ( x ) (cid:19) , (9)where the DNN mappings f i obviously depend on the DNN parameters θ , and the weights λ x ≥ ∀ x ∈ T .For hyperparameter δ > , training can proceed simply as:0 Select initially equal λ x > , say λ x = 1 ∀ x ∈ T .1 Optimize over θ (train the DNN).2 If all margin constraints are satisﬁed then stop.3 For all x ∈ T : if margin constraint x is not satisﬁed then λ x → δλ x .4 Go to step 1.Again, the parameters of the previous DNN could initialize the training of the next, and an initial DNN canbe trained instead by using a logit or cross-entropy loss objective, as above. There are many other variationsincluding also decreasing λ x when the x -constraint is satisﬁed, or additively (rather than exponentially)increasing λ x when they are not, and changing λ x in a way that depends on the degree of the correspondingmargin violation.Given a thus margin trained classiﬁer, one could estimate its Lipschitz continuity parameter, e.g., [14],[6], [4], and apply Theorem 2.1 or Proposition 4.1 of [12] to determine a region of class purity aroundeach training sample.IV. S OME N UMERICAL R ESULTS FOR C LASSIFICATION M ARGIN

In this section, we give an example using loss function (9). Training was performed on CIFAR-10(50000 training samples and 10000 test/held-out samples) using the ResNet-18 DNN (ReLU activationsare not used after the fully connected layer). The training was performed for 200 epochs using a batchsize 32 and learning rate − . The results for margins µ = 50 and µ = 150 are given in Figures 1,2 andTable I.All training-sample margins were achieved with one training pass using initial λ x = 1 for all x ∈ T ;see Figures 1(a) and 2(a). Figures 1(b) and 2(b) show the margins of the dataset held out from training,i.e., to compute the margins, the true class label was used. Here, one can clearly see that many testsamples have margins less than µ and some are misclassiﬁed (negative margins), cf., Table I. Figures1(c) and 2(c) show the margins based on the class decisions of the classiﬁers themselves, as would bethe case for unlabelled test samples (so all measured margins are not negative). The held-out set andtest set are the same. Finally, Figures 1(d) and 2(d) show the margins of FGSM [5] adversarial sampleswith parameter/strength ε = 0 . created using a surrogate ResNet-18 DNN of the same structure trainedusing standard cross-entropy loss (all such samples were used, including those based on the . of testsamples that were misclassiﬁed).In Table I, we show the accuracy of the classiﬁers, including a baseline classiﬁer trained using the samedataset and ResNet-18 DNN structure but with standard cross-entropy loss objective. As Figures 1(d) and2(d), the accuracy performance reported here is for FGSM adversarial samples that were crafted assumingthe attacker knows the baseline DNN trained by cross-entropy loss. These attacks are transferred to themargin-trained classiﬁers. Fig. 1. After training using (9) with margin µ = 50 , resulting histogram of margins of: (a) training samples (b) labelled samples held-outfrom training dataset; (c) test dataset (labels unknown, so decisions by the classiﬁer itself are used to determine margin here); and (d) FGSMsamples created by the test dataset (c). Note that the sample values in cases (b) and (c) are the same.training x-entropy margin marginobjective → loss µ = 50 µ = 150 clean test-set 86.70% 85.49% 85.37%FGSM attacks 6.017% 10.08% 10.08%TABLE IT EST - TIME ACCURACY . N

OTE THAT THE

FGSM

ATTACKS WITH PARAMETER / STRENGTH . WERE CREATED USING THE

DNN

TRAINEDWITH CROSS - ENTROPY LOSS , AND TRANSFERRED TO THE MARGIN TRAINED

DNN S . T HE FGSM

ATTACKS WERE BASED ON ALL TESTSAMPLES INCLUDING THE

THAT WERE MISCLASSIFIED BY THE

DNN

TRAINED BY CROSS - ENTROPY LOSS . V. L OW - MARGIN ATYPICALITY OF TEST SAMPLES

Given an arbitrary DNN f : R n → ( R + ) C , let T κ be the (clean) training samples of class κ ∈{ , , ..., C } , i.e., ∀ x ∈ T κ , ˆ c ( x ) = c ( x ) = κ . Recall (2) and suppose a Gaussian Mixture Model (GMM)is learned using the log-margins of the training dataset { log µ f ( x ) : x ∈ T κ } by EM [3] using BIC model order control [10] as, e.g., [8]. (Instead of margin (2), one could use an estimatethe radius of the largest (cid:96) ball of class purity about each training and test sample, e.g., directly [7] or viaestimated Lipschitz constant as discussed above.) Let the resulting GMM parameters be { w i , m i , σ i } I κ k =1 , Fig. 2. After training using (9) with margin µ = 150 , resulting histogram of margins of: (a) training samples (b) labelled samples held-outfrom training dataset; (c) test dataset (labels unknown, so decisions by the classiﬁer itself are used to determine margin here); and (d) FGSMsamples created by the test dataset (c). Note that the sample values in cases (b) and (c) are the same. where I κ ≤ |T κ | is the number of components, the w i ≥ are their weights ( (cid:80) I κ i =1 w i = 1 ), the m i aretheir means, and the σ i > are their standard deviations. So, we can simply compute the margin p-value of any test sample x , π f ( x ) = I κ (cid:88) i =1 w i (cid:18) − F (cid:18) | log( µ f ( x )) − m i | σ i (cid:19)(cid:19) where F is the standard normal c.d.f. That is, π f ( x ) is the probability that a randomly chosen sample fromthe same distribution as that of the training samples has smaller margin than the test sample x . So, one cancan compare π f ( x ) to a threshold to detect whether a test sample x has abnormally small classiﬁcationmargin. The example of margin-trained DNN of Figures 1(a) and 2(a) has a single component for theentire training set T = (cid:83) Cκ =1 T κ . In an unsupervised fashion, the threshold criterion could be a boundon false positives based on the training set. Alternatively, the threshold could be set by using a clean setof labelled samples that were held out from (not used for) training and consider both false-positive andfalse-negative performance. R EFERENCES [1] P. Bartlett, D. Foster, and M. Telgarsky. Spectrally-normalized Margin Bounds for Neural Networks. In

Proc NIPS , 2017. [2] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunierr. Parseval Networks: Improving Robustness to Adversarial Examples.In

Proc. ICML , 2017.[3] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm.

Journalof the Royal Statistical Society. , 39(1):1–38, 1977.[4] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G.J. Pappas. Efﬁcient and Accurate Estimation of Lipschitz Constants for DeepNeural Networks. https://arxiv.org/pdf/1906.04893.pdf, 2019.[5] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In

Proc. ICLR , 2015.[6] H. Gouk, E. Frankeib, and B. Pfahringer. Regularisation of Neural Networks by Enforcing Lipschitz Continuity.https://arxiv.org/pdf/1804.04368.pdf, Sept. 2018.[7] J. Kolter and E. Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. In

Proc. ICML ,2018.[8] D.J. Miller, Z. Qiu, and G. Kesidis. Parsimonious Cluster-based Anomaly Detection (PCAD). In

Proc. IEEE MLSP , Aalborg, Denmark,Sept. 2018.[9] A. Raghunathan, J. Steinhardt, and P. Liang. Certiﬁed Defenses against Adversarial Examples. In

Proc. ICLR , 2018.[10] Gideon Schwarz. Estimating the dimension of a model.

Annals of Statistics , 6(2):461–464, 1978.[11] C. Szegedy, W. Zaremba, I Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In

Proc. ICLR , 2014.[12] Y. Tsuzuku, I. Sato, and M. Sugiyama. Lipschitz-margin Training: Scalable Certiﬁcation of Perturbation Invariance for Deep NeuralNetworks. In

Proc NIPS , 2018.[13] S. Wang, Y. Chen, A. Abdou, and S. Jana. MixTrain: Scalable Training of Veriﬁably Robust Neural Networks.https://arxiv.org/abs/1811.02625, Nov. 2018.[14] T.-W. Weng, H. Zhang, H. Chen, Z. Song, C.-J. Hsieh, D. Boning, I.S. Dhillon, and L. Daniel. Towards Fast Computation of CertiﬁedRobustness for ReLU Networks. In