Notes on Margin Training and Margin p-Values for Deep Neural Network Classifiers
11 Notes on Margin Training and Margin p-Values forDeep Neural Network Classifiers
George Kesidis, David J. Miller, and Zhen Xiang
Abstract
We provide a new local class-purity theorem for Lipschitz continuous DNN classifiers. In addition, we discusshow to achieve classification margin for training samples. Finally, we describe how to compute margin p-valuesfor test samples.
I. I
NTRODUCTION
Robust DNNs have been proposed to defeat bounded-perturbation test-time evasion attacks - i.e., smallperturbations added to nominal test samples so that their class decision changes. One family of approachescontrols Lipschitz-continuity parameter and targets training-set classification margin. Estimation and engi-neering of the Lipschitz parameter for a DNN is discussed in, e.g., [11], [1], [2], [12], [14], [6], [4]. Howto engineer class purity (class decision consistency) in an convex neighborhood (open ball) of a certainsize about every training samples is addressed in [12], [7]. In the following, we give an alternative localclass purity result. Also, we show how to achieve classification margin on training samples by choice ofa simple “dual” training objective, cf., (8) and (9). We numerically show how margin-based training canresult in reduced accuracy (by overfitting the training set). Finally, we define a p-value associated withclassification margin. II. M
ARGIN IN
DNN
CLASSIFIERS
Consider the DNN f : R n → ( R + ) C where C is the number of classes. Further suppose that for aninput pattern x ∈ R n to the DNN, the class decision is ˆ c ( x ) = arg max i f i ( x ) , where f i is the i th component of the C -vector f . That is, we have defined a class-discriminant output layerof the DNN. Here assume that a class for x is chosen arbitrarily among those that tie for the maximum.In the following, we assume that the functions f i are rectified: ∀ i, x, f i ( x ) ≥ . (1)Define the margin of x as µ f ( x ) := f ˆ c ( x ) ( x ) − max i (cid:54) =ˆ c ( x ) f i ( x ) ≥ . (2) The authors are with the School of EECS, Pennsylvania State University, University Park, PA, 16803, USA. This research is supportedby AFOSR DDDAS grant and Cisco URP gift. Email: { gik2,djm25,zux49 } @psu.edu a r X i v : . [ s t a t . M L ] D ec Now suppose the (cid:96) ∞ /(cid:96) Lipschitz continuity parameter L ∞ for f , i.e., the smallest L ∞ > satisfying ∀ x, y, | f ( x ) − f ( y ) | ∞ ≤ L ∞ | x − y | (3)is estimated. Note that we have used two different norms in this definition.Now consider samples in a open (cid:96) ball centered at x , i.e., y ∈ B ( x, ε ) := { z ∈ R n : | x − z | < ε } for ε > .The following is a locally consistent (robust) classification result is an example of Lipschitz margin [12].
Theorem 2.1: If f is (cid:96) ∞ /(cid:96) Lipschitz continuous with parameter L ∞ > and µ f ( x ) > then B (cid:18) x, µ f ( x )2 L ∞ (cid:19) is class pure. Proof:
For any y ∈ B ( x, µ f ( x ) /L ∞ ) , we have µ f ( x ) > L ∞ | x − y | ≥ | f ( x ) − f ( y ) | ∞ := max i | f i ( x ) − f i ( y ) | ∞ ≥ max i | f i ( x ) | ∞ − | f i ( y ) | ∞ (triangle inequality) = max i f i ( x ) − f i ( y ) (since f i ≥ ) ≥ f ˆ c ( x ) ( x ) − f ˆ c ( x ) ( y ) So, f ˆ c ( x ) ( y ) > f ˆ c ( x ) ( x ) − µ f ( x ) . (4)If we instead write | f i ( y ) | ∞ − | f i ( x ) | ∞ in the triangle inequality above and then replace ˆ c ( x ) by any i (cid:54) = ˆ c ( x ) , we get that ∀ i (cid:54) = ˆ c ( x ) , f i ( y ) < f i ( x ) + 12 µ f ( x ) . (5)So, by (4) and (5), ∀ i (cid:54) = ˆ c ( x ) , f i ( y ) < f i ( x ) + 12 µ f ( x ) ≤ f ˆ c ( x ) ( x ) − µ f ( x ) (by (2)) < f ˆ c ( x ) ( y ) Theorem 2.1 is similar to Proposition 4.1 of [12]. Let the 2-norm Lipschitz parameter of f be L , i.e., using the 2-norm on both sides of (3). Since | z | ∞ ≤ | z | for all z , L ≥ L ∞ . Without assuming f is rectified as (1), [12] shows that y is assigned the same class as x if µ f ( x ) > √ L | x − y | ; thus, B ( x, µ f ( x ) / ( √ L )) is class pure. Note that √ L (Prop. 4.1 of [12]) may or may not be larger than L ∞ (Theorem 2.1). On the other hand, if the right-hand-side of (3) is changed to the (cid:96) ∞ norm, thenusing | z | ≤ n | z | ∞ for all z , and arguing as for Theorem 2.1) leads to a weaker result than Prop. 4.1 of[12] (especially when n (cid:29) ). III. M ARGIN TRAINING
Robust training is surveyed in [13]. Lipschitz margin training to achieve a class-pure convex neighbor-hood (open ball) of prescribed size about every training sample is discussed in [12], combining margintraining (2) and Lipschitz continuity parameter control. (Also see e.g. [2] for Lipschitz parameter controland the approach for bounding margin gradient of [9].) [7] relaxes the constraints of ReLU based classifierstoward this same objective (assuming ReLU neurons with bounded outputs). For a given classifier, theapproach of [7] can also check class purity of a prescribed-size convex neighborhood of test samples; usingthis method to detect small-perturbation test-time evasion attacks may have a significant false-positive rate.Generally, these methods cannot certify a test sample is not test-time evasive if the associated perturbationis larger than the prescribed neighborhood size, and they may be associated with reduction in classificationaccuracy [12], [9].We focus herein on just achieving a prescribed margin for training samples (2).Let θ represent the DNN parameters. Let T represent the training dataset and let c ( x ) for any x ∈ T bethe ground truth class of x . The following is easily generalized to sample-dependent margins ( µ ( x ) > ).[12] suggests to add the margin “to all elements in logits except for the index corresponding to” c ( x ) .For example, train the DNN by finding: min θ − (cid:88) x ∈T log (cid:32) f c ( x ) ( x ) (cid:80) i (cid:54) = c ( x ) ( f i ( x ) + µ ) (cid:33) = min θ − (cid:88) x ∈T log (cid:32) f c ( x ) ( x )( C − µ + (cid:80) i (cid:54) = c ( x ) f i ( x ) (cid:33) (6)For a softmax example, one could train the DNN using the modified cross-entropy loss : min θ − (cid:88) x ∈T log (cid:32) e f c ( x ) ( x ) e f c ( x ) ( x ) + (cid:80) i (cid:54) = c ( x ) e f i ( x )+ µ (cid:33) (7)These DNN objectives do not guarantee the margins for all training samples will be met.Alternatively, one can perform (dual) optimization of the weighted margin constraints, e.g. , min θ (cid:88) x ∈T λ x (cid:32) max i (cid:54) = c ( x ) f i ( x ) + µ − f c ( x ) ( x )( C − µ + (cid:80) j f j ( x ) (cid:33) , (8) Obviously, exponentiation is unnecessary when, ∀ x, i , f i ( x ) ≥ , i.e., the DNN outputs are rectified. or just min θ (cid:88) x ∈T λ x (cid:18) max i (cid:54) = c ( x ) f i ( x ) + µ − f c ( x ) ( x ) (cid:19) , (9)where the DNN mappings f i obviously depend on the DNN parameters θ , and the weights λ x ≥ ∀ x ∈ T .For hyperparameter δ > , training can proceed simply as:0 Select initially equal λ x > , say λ x = 1 ∀ x ∈ T .1 Optimize over θ (train the DNN).2 If all margin constraints are satisfied then stop.3 For all x ∈ T : if margin constraint x is not satisfied then λ x → δλ x .4 Go to step 1.Again, the parameters of the previous DNN could initialize the training of the next, and an initial DNN canbe trained instead by using a logit or cross-entropy loss objective, as above. There are many other variationsincluding also decreasing λ x when the x -constraint is satisfied, or additively (rather than exponentially)increasing λ x when they are not, and changing λ x in a way that depends on the degree of the correspondingmargin violation.Given a thus margin trained classifier, one could estimate its Lipschitz continuity parameter, e.g., [14],[6], [4], and apply Theorem 2.1 or Proposition 4.1 of [12] to determine a region of class purity aroundeach training sample.IV. S OME N UMERICAL R ESULTS FOR C LASSIFICATION M ARGIN
In this section, we give an example using loss function (9). Training was performed on CIFAR-10(50000 training samples and 10000 test/held-out samples) using the ResNet-18 DNN (ReLU activationsare not used after the fully connected layer). The training was performed for 200 epochs using a batchsize 32 and learning rate − . The results for margins µ = 50 and µ = 150 are given in Figures 1,2 andTable I.All training-sample margins were achieved with one training pass using initial λ x = 1 for all x ∈ T ;see Figures 1(a) and 2(a). Figures 1(b) and 2(b) show the margins of the dataset held out from training,i.e., to compute the margins, the true class label was used. Here, one can clearly see that many testsamples have margins less than µ and some are misclassified (negative margins), cf., Table I. Figures1(c) and 2(c) show the margins based on the class decisions of the classifiers themselves, as would bethe case for unlabelled test samples (so all measured margins are not negative). The held-out set andtest set are the same. Finally, Figures 1(d) and 2(d) show the margins of FGSM [5] adversarial sampleswith parameter/strength ε = 0 . created using a surrogate ResNet-18 DNN of the same structure trainedusing standard cross-entropy loss (all such samples were used, including those based on the . of testsamples that were misclassified).In Table I, we show the accuracy of the classifiers, including a baseline classifier trained using the samedataset and ResNet-18 DNN structure but with standard cross-entropy loss objective. As Figures 1(d) and2(d), the accuracy performance reported here is for FGSM adversarial samples that were crafted assumingthe attacker knows the baseline DNN trained by cross-entropy loss. These attacks are transferred to themargin-trained classifiers. Fig. 1. After training using (9) with margin µ = 50 , resulting histogram of margins of: (a) training samples (b) labelled samples held-outfrom training dataset; (c) test dataset (labels unknown, so decisions by the classifier itself are used to determine margin here); and (d) FGSMsamples created by the test dataset (c). Note that the sample values in cases (b) and (c) are the same.training x-entropy margin marginobjective → loss µ = 50 µ = 150 clean test-set 86.70% 85.49% 85.37%FGSM attacks 6.017% 10.08% 10.08%TABLE IT EST - TIME ACCURACY . N
OTE THAT THE
FGSM
ATTACKS WITH PARAMETER / STRENGTH . WERE CREATED USING THE
DNN
TRAINEDWITH CROSS - ENTROPY LOSS , AND TRANSFERRED TO THE MARGIN TRAINED
DNN S . T HE FGSM
ATTACKS WERE BASED ON ALL TESTSAMPLES INCLUDING THE
THAT WERE MISCLASSIFIED BY THE
DNN
TRAINED BY CROSS - ENTROPY LOSS . V. L OW - MARGIN ATYPICALITY OF TEST SAMPLES
Given an arbitrary DNN f : R n → ( R + ) C , let T κ be the (clean) training samples of class κ ∈{ , , ..., C } , i.e., ∀ x ∈ T κ , ˆ c ( x ) = c ( x ) = κ . Recall (2) and suppose a Gaussian Mixture Model (GMM)is learned using the log-margins of the training dataset { log µ f ( x ) : x ∈ T κ } by EM [3] using BIC model order control [10] as, e.g., [8]. (Instead of margin (2), one could use an estimatethe radius of the largest (cid:96) ball of class purity about each training and test sample, e.g., directly [7] or viaestimated Lipschitz constant as discussed above.) Let the resulting GMM parameters be { w i , m i , σ i } I κ k =1 , Fig. 2. After training using (9) with margin µ = 150 , resulting histogram of margins of: (a) training samples (b) labelled samples held-outfrom training dataset; (c) test dataset (labels unknown, so decisions by the classifier itself are used to determine margin here); and (d) FGSMsamples created by the test dataset (c). Note that the sample values in cases (b) and (c) are the same. where I κ ≤ |T κ | is the number of components, the w i ≥ are their weights ( (cid:80) I κ i =1 w i = 1 ), the m i aretheir means, and the σ i > are their standard deviations. So, we can simply compute the margin p-value of any test sample x , π f ( x ) = I κ (cid:88) i =1 w i (cid:18) − F (cid:18) | log( µ f ( x )) − m i | σ i (cid:19)(cid:19) where F is the standard normal c.d.f. That is, π f ( x ) is the probability that a randomly chosen sample fromthe same distribution as that of the training samples has smaller margin than the test sample x . So, one cancan compare π f ( x ) to a threshold to detect whether a test sample x has abnormally small classificationmargin. The example of margin-trained DNN of Figures 1(a) and 2(a) has a single component for theentire training set T = (cid:83) Cκ =1 T κ . In an unsupervised fashion, the threshold criterion could be a boundon false positives based on the training set. Alternatively, the threshold could be set by using a clean setof labelled samples that were held out from (not used for) training and consider both false-positive andfalse-negative performance. R EFERENCES [1] P. Bartlett, D. Foster, and M. Telgarsky. Spectrally-normalized Margin Bounds for Neural Networks. In
Proc NIPS , 2017. [2] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunierr. Parseval Networks: Improving Robustness to Adversarial Examples.In
Proc. ICML , 2017.[3] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm.
Journalof the Royal Statistical Society. , 39(1):1–38, 1977.[4] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G.J. Pappas. Efficient and Accurate Estimation of Lipschitz Constants for DeepNeural Networks. https://arxiv.org/pdf/1906.04893.pdf, 2019.[5] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In
Proc. ICLR , 2015.[6] H. Gouk, E. Frankeib, and B. Pfahringer. Regularisation of Neural Networks by Enforcing Lipschitz Continuity.https://arxiv.org/pdf/1804.04368.pdf, Sept. 2018.[7] J. Kolter and E. Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. In
Proc. ICML ,2018.[8] D.J. Miller, Z. Qiu, and G. Kesidis. Parsimonious Cluster-based Anomaly Detection (PCAD). In
Proc. IEEE MLSP , Aalborg, Denmark,Sept. 2018.[9] A. Raghunathan, J. Steinhardt, and P. Liang. Certified Defenses against Adversarial Examples. In
Proc. ICLR , 2018.[10] Gideon Schwarz. Estimating the dimension of a model.
Annals of Statistics , 6(2):461–464, 1978.[11] C. Szegedy, W. Zaremba, I Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In
Proc. ICLR , 2014.[12] Y. Tsuzuku, I. Sato, and M. Sugiyama. Lipschitz-margin Training: Scalable Certification of Perturbation Invariance for Deep NeuralNetworks. In
Proc NIPS , 2018.[13] S. Wang, Y. Chen, A. Abdou, and S. Jana. MixTrain: Scalable Training of Verifiably Robust Neural Networks.https://arxiv.org/abs/1811.02625, Nov. 2018.[14] T.-W. Weng, H. Zhang, H. Chen, Z. Song, C.-J. Hsieh, D. Boning, I.S. Dhillon, and L. Daniel. Towards Fast Computation of CertifiedRobustness for ReLU Networks. In