Feature Distillation With Guided Adversarial Contrastive Learning
Tao Bai, Jinnan Chen, Jun Zhao, Bihan Wen, Xudong Jiang, Alex Kot
FFeature Distillation With Guided Adversarial Contrastive Learning
Tao Bai, Jinnan Chen, Jun Zhao, Bihan Wen, Xudong Jiang, Alex Kot Nanyang Technological University { bait0002, jinnan001, junzhao, bihan.wen, exdjiang, eackot } @ntu.edu.sg Abstract
Deep learning models are shown to be vulnerable to adversar-ial examples. Though adversarial training can enhance modelrobustness, typical approaches are computationally expen-sive. Recent works proposed to transfer the robustness to ad-versarial attacks across different tasks or models with soft la-bels. Compared to soft labels, feature contains rich seman-tic information and holds the potential to be applied to dif-ferent downstream tasks. In this paper, we propose a novelapproach called Guided Adversarial Contrastive Distillation(GACD), to effectively transfer adversarial robustness fromteacher to student with features. We first formulate this objec-tive as contrastive learning and connect it with mutual infor-mation. With a well-trained teacher model as an anchor, stu-dents are expected to extract features similar to the teacher.Then considering the potential errors made by teachers, wepropose sample reweighted estimation to eliminate the nega-tive effects from teachers. With GACD, the student not onlylearns to extract robust features, but also captures structuralknowledge from the teacher. By extensive experiments evalu-ating over popular datasets such as CIFAR-10, CIFAR-100and STL-10, we demonstrate that our approach can effec-tively transfer robustness across different models and evendifferent tasks, and achieve comparable or better results thanexisting methods. Besides, we provide a detailed analysis ofvarious methods, showing that students produced by our ap-proach capture more structural knowledge from teachers andlearn more robust features under adversarial attacks.
Introduction
Deep neural networks (DNN) have achieved impressive per-formances for many computer vision (LeCun, Bengio, andHinton 2015; He et al. 2016) and natural language process-ing (Mikolov et al. 2013) tasks. Nevertheless, they are vul-nerable to adversarial examples (Szegedy et al. 2014), andtheir performances degrade sharply. Such adversarial exam-ples are usually crafted by adding human-imperceptible per-turbations, and able to mislead well-trained models at in-ference time. Due to the potential safety risks, the exis-tence of adversarial examples has been a crucial threat tosafety and reliability critical applications (Parkhi, Vedaldi,and Zisserman 2015; Liu et al. 2019). Thus, extensive re-searchers have devoted to studying and enhancing the ro-bustness of DNNs (Schott et al. 2018; Zhong and Deng2019; Prakash et al. 2018; Mao et al. 2019; Madry et al. 2018) against waves of new adversarial attacks (Papernotet al. 2017; Carlini and Wagner 2017; Croce and Hein 2019;Modas, Moosavi-Dezfooli, and Frossard 2019), where ad-versarial training (Goodfellow, Shlens, and Szegedy 2015;Kannan, Kurakin, and Goodfellow 2018; Madry et al. 2018;Shafahi et al. 2019) generates models with best adversarialrobustness, and undoubtedly is recognized as the strongestdefense method.Adversarial training is based on a simple yet effectiveidea. Compared to traditional training, it only requires toinvolve training models with adversarial samples generatedin each training loop. It, however, is computationally ex-pensive and time-consuming as it needs multiple gradientcomputations to craft strong adversarial samples (Shafahiet al. 2019). To circumvent the high cost of adversarial train-ing, recently various methods have been developed and in-troduced to transfer adversarial robustness between mod-els (Papernot et al. 2016b; Goldblum et al. 2020). Currentlyexisting methods are based on the idea of distillation (Hin-ton, Vinyals, and Dean 2015), and use soft labels.Another problem neglected in distillation for adversarialrobustness is that there are many errors made by the teach-ers. Compared to the naturally trained models, adversari-ally trained models usually have a performance degrada-tion on classification accuracy, which is theoretically provedin (Tsipras et al. 2019; Su et al. 2018; Zhang et al. 2019a). Inexisting distillation methods, errors from teachers are trans-ferred to students as well.Here, we propose a teacher-error aware method to trans-fer robustness with features, called Guided Adversarial Con-trastive Distillation (GACD). Our method is inspired by theobservation that features learned by DNNs are purified dur-ing adversarial training (see Figure 5 in (Allen-Zhu and Li2020)), which are more aligned with input images in seman-tic. (Xie et al. 2019) empirically proved models with puri-fied features are more robust to adversarial examples. So thecore idea of GACD is training a student model to imitate theteacher model to extract similar robust features under somesimilarity metrics. With the context of contrastive learning,the teacher model is used as an anchor and the student modellearns to be aligned with the the teacher. However, the an-chor is not always reliable as explained in last paragraph. Sowe use sample re-weighting to eliminate the negative effectscaused by bad anchorS. Through experiments in CIFAR-10, a r X i v : . [ c s . L G ] S e p IFAR-100 and STL-10, we show that our approach outper-forms other methods in classification accuracy and transfer-ability. Some students are even better than the teachers.In summary, the key contributions of this paper are as fol-lows: • To our best knowledge, we are the first to transfer adver-sarial robustness across different model architectures withfeatures. Based on contrastive learning, we proposed anovel approach called GACD with awareness of teacher’smistakes. • Through extensive experiments on CIFAR-10, CIFAR-100 and STL-10, it is shown that our method outper-forms existing distillation methods, achieving comparableor higher classification accuracy and better transferability. • By conducting analysis of different methods in featurespace, we show our method is more effective in captur-ing structural knowledge from teachers and extracts moregathered and robust features between classes.
Related Work
In this section, we review the prior work on adversarial at-tacks, adversarial training and robustness distillation.
Adversarial Training.
Since (Szegedy et al. 2014) firstproved the existence of adversarial examples, extensive ad-versarial attacks (Papernot et al. 2016a; Dong et al. 2018;Carlini and Wagner 2017; Madry et al. 2018) have beenproposed, showing the vulnerability of neural networks fac-ing inputs with imperceptible perturbations. (Goodfellow,Shlens, and Szegedy 2015) assumed the vulnerability ofneural networks caused by linearity and proposed a single-step method called Fast Gradient Sign Methods (FGSM).Then iterative attacks (Kurakin, Goodfellow, and Bengio2016; Madry et al. 2018) were studied and proved to bestronger than FGSM under the same metric norm. Momen-tum is introduced in (Dong et al. 2018) and enhanced thetransferability of untargeted attacks.As adversaries could launch attacks with 100% successrate on well-trained models with small perturbation budget (cid:15) , defense methods (Madry et al. 2018; Liao et al. 2018;Prakash et al. 2018; Qin et al. 2020; Gupta and Rahtu 2019)are developed right after. Adversarial training is proved to bemost effective among these methods, which simply requiresto train models on adversarial examples progressively. In-tuitively, adversarial training encourages models to predictcorrectly in an (cid:15) -ball surrounding data points. Many variantsof adversarial training have been developed from this obser-vation. Recently (Zhang and Wang 2019) introduced a fea-ture scatter-based approach for adversarial training, whichgenerates adversarial examples in latent space in a unsuper-vised way. (Qin et al. 2019) noticed the highly convolutedloss surface by gradient obfuscation (Athalye, Carlini, andWagner 2018), and introduced a local linearity regularizer(LLR) to adversarial training to encourage the loss surfacebehave linearly. (Zhang et al. 2019a) decomposed the classi-fication errors on adversarial examples and proposed a reg-ularization term to improve adversarial robustness.
Self-Supervised Learning.
Based on deep learning mod-els, self-supervised learning methods are becoming moreand more powerful, and gaining increasing popularity. Thenormal approach of self-supervised learning is training mod-els to learn general representations out of unlabeled data,which can be later used for specific tasks, like image clas-sification. Predictive approaches have shown to be effectiveto learn representations (Doersch, Gupta, and Efros 2015;Zhang, Isola, and Efros 2017; Noroozi and Favaro 2016).Contrastive learning is another collection of powerful meth-ods for self-supervised representation learning. Inspiredby (Gutmann and Hyv¨arinen 2010; Mnih and Kavukcuoglu2013; Sohn 2016), the core idea of contrastive learningis learning representations that is close under some dis-tance metrics for samples in same classes (positive samples)and pushing apart representations between different classes(negative samples). By leveraging the instance-level identityfor self-supervised learning, contrastive learning are shownto be effective in learning representations (Chen et al. 2020;He et al. 2019; Tian et al. 2020), and achieves comparableperformances to supervised methods. Note that such meth-ods are used to learn embeddings. At test time, the embed-dings are utilized for other tasks with fine tuning.
Knowledge Distillation.
Distillation is originally intro-duced in (Hinton, Vinyals, and Dean 2015) to compressmodel size while preserving performances with a student-teacher scheme. The student usually has a small, lightweightarchitecture. Thereafter, more distillation methods are pro-posed (Heo et al. 2019; Zhang et al. 2019b; Park et al.2019; Tian, Krishnan, and Isola 2020). (Park et al. 2019)noticed relationships between samples are usually neglectedand proposed
Relational Knowledge Distillation (RKD) ,while (Tian, Krishnan, and Isola 2020) adapted contrastivelearning into knowledge distillation and formulated the dis-tillation problems as maximizing the mutual infomation be-tween student and teacher representations. And (Zhang et al.2019b) proposed self distillation to distill knowledge fromthe model itself. Note that such distillation methods onlypreserve performances in student models.Knowledge distillation is firstly adapted for obtainingor transferring adversarial robustness in (Papernot et al.2016b), which is called defensive distillation . Defensive dis-tillation requires student and teacher models have the identi-cal architectures. Due to gradient masking, defensive distil-lation improves the robustness of the student model under acertain attack. It, however, doesn’t make the decision bound-ary secure and is circumvented by (Carlini and Wagner2017) soon. Recently (Goldblum et al. 2020) studied howto distill adversarial robustness onto student models withknowledge distillation. And a line of work explores trans-ferring adversarial robustness between models with samearchitectures. (Hendrycks, Lee, and Mazeika 2019) showslarge gains on adversarial robustness from pre-training ondata from different domain. Another work uses the idea oftransfer learning and builds new models on the top of ro-bust feature extractors (Shafahi et al. 2020). In contrastiveto these methods, our method focuses on robust feature dis-tillation, and is not restricted to same model architectures. uided Adversarial Contrastive Distillation
In this section, we start by defining feature distillation be-tween teachers and students. Then we formulate the problemwith contrastive learning and show the connection betweenour objective and mutual information. Lastly we derive ourfinal objective function.
Problem Definition
Teacher-student paradigm is widely used in knowledge dis-tillation. Given two deep neural networks, a teacher f t anda student f s , and their architectures are not necessarily tobe identical. Let { ( x, y ) | x ∈ X } with K classes bethe training data. Then the representations extracted at thepenultimate layer (the layer before logits) are denoted as f t ( x ) and f s ( x ) . During distillation, for two random sam-ples x i and x j , we expect to push representations f s ( x i ) and f t ( x i ) closer if i = j , while pushing f s ( x i ) and f t ( x j ) apart if i (cid:54) = j . Figure 1 gives a visual explanation of thisintuition. Note that we use adversarially robust models asteacher models as in (Goldblum et al. 2020). The featuresextracted by the teacher and student are transformed to thesame dimension, refering to normalized embedding. Input Image
Student
Teacher 𝑓 𝑠 (𝑥 𝑖 )𝑓 𝑡 (𝑥 𝑖 ) 𝑓 𝑠 (𝑥 𝑗 ) normalizedembeddingfeatures Figure 1: Illustration of feature distillation with contrastivelearning.
Connecting to Mutual Information
Contrastive learning is commonly used to extract featuresin a unsupervised way, and the core idea is learning rep-resentations that is close under some distance metrics forsamples in same classes (positive samples) and pushingapart samples from different classes (negative samples). Fol-lowing the recent setups for contrastive learning (Gutmannand Hyv¨arinen 2010; Mnih and Kavukcuoglu 2013; Tian,Krishnan, and Isola 2019), we select samples from train-ing data X : { x ∼ p data ( x ) } , and construct a set S = (cid:8) x + , x − , x − . . . x − k (cid:9) where there are a single positive sam-ples and k negative samples from different classes.Since the teacher model is pretrained and fixed dur-ing distillation, we simply enumerate positives and nega-tives for student model from S . Then we have one con-trastive congruent pair (the same input given to teacher and student model) for every k incongruent pairs (differ-ent inputs given to teacher and student model) as given in S pair := { ( t + , s + ) , ( t + , s − ) , ( t + , s − ) . . . ( t + , s − k ) } , wherewe denote f t ( x ) as t , and f s ( x ) as s for simplicity.Now we define a distribution q with latent variable C : • when C = 1 , the congruent pair ( t, s ) is drawn from thejoint distribution p ( t, s ) , q ( t, s | C = 1) = p ( t, s ) . (1) • when C = 0 , the incongruent pair ( t, s ) is drawn from thejoint distribution p ( t ) p ( s ) , q ( t, s | C = 0) = p ( t ) p ( s ) . (2)According to set S , the priors on C are q ( C = 1) = 1 k + 1 , q ( C = 0) = kk + 1 . (3)By applying Bayes’ rule, we can easily derive the posteriorfor class C = 1 : q ( C = 1 | t, s )= q ( t, s | C = 1) q ( C = 1) q ( t, s | C = 0) q ( C = 0) + q ( t, s | C = 1) q ( C = 1)= p ( t, s ) kp ( t ) p ( s ) + p ( t, s ) ≤ p ( t, s ) kp ( t ) p ( s ) . (4)Then taking log of both sides of (4), we have log q ( C = 1 | t, s ) ≤ log p ( t, s ) kp ( t ) p ( s ) ≤ − log( k ) + log p ( t, s ) p ( t ) p ( s ) . (5)By taking expectation on both sides w.r.t. p ( t, s ) or q ( t, s | C = 1) (they are equal as shown in Eq. (1)), the con-nection with mutual information is given as: M I ( t ; s ) ≥ log( k ) + E q ( t,s | C =1) log q ( C = 1 | t, s ) , (6)where M I ( t ; s ) represents the mutual information between t and s .Thus, our objective is to maximize the lower bound ofmutual information, and consistent with (van den Oord, Li,and Vinyals 2018; Poole et al. 2019; Tian, Krishnan, andIsola 2019, 2020). Sample Reweighted Noise Contrastive Estimation
To maximize the right side of Inequality (6), the distribu-tion q ( C = 1 | t, s ) is required. Though we don’t have thetrue distribution, we can estimate it by fitting a model (Gut-mann and Hyv¨arinen 2010; Goodfellow et al. 2014) basedon teacher-student pairs in S pair .Before proceeding, we first retrospect the pair set. Asshown in S pair , the teacher is fixed and plays as an anchor.If the anchor gets wrong, the whole set is not reliable. And itis known that there is a large drop on performances of adver-sarially trained models as explained in (Zhang et al. 2019a;Madry et al. 2018). The robust teacher models in adversarialettings are not as good as the naturally trained teacher mod-els in benign settings (e.g. roughly 76% v.s. 95% on CIFAR-10). Thus, a large amount of errors made by teacher modelsare transferred to student models along with knowledge dur-ing distillation, which is indeed serious but neglected in ex-isting methods (Shafahi et al. 2020; Goldblum et al. 2020).The direct way to handle this problem is removing sam-ples which are misclassified by the teacher during training.However, this will reduce the amount of training data. Be-sides, such surrogate loss holds inherent limitations, suchas computational hardness. Instead, we propose SampleReweighted Noise Contrastive Estimation, which assignssmaller weights to misclassified samples and greater weightsto those classified correctly by the teacher (see Algorithm 1).With softmax outputs of teacher models at hand, we pick upthe probability of true class for each sample as the weight,and denote it as w t . The higher the probability, the greaterthe chance of samples classified correctly. Hence, the mis-classified samples do not significantly affect the estimationprocess.Now we formulate the estimation of q ( C = 1 | t, s ) asa binary classification problem and our goal is to maximizethe log likelihood. We use h to represent the classificationmodel, which takes t and s as inputs and gives the probabil-ity of C . Then we have q ( C = 1 | t, s ) = h ( t, s ) , (7) q ( C = 0 | t, s ) = 1 − h ( t, s ) . (8)The probabilities for two classes are given in Eq. (3). Con-sidering the sample weights, the log-likelihood on S pair isre-formulated as (cid:96) ( h, w t ) = (cid:88) ( t,s ) ∈S pair w t C log P ( C = 1 | t, s ) + w t (1 − C ) log P ( C = 0 | t, s )= w t log (cid:2) h (cid:0) t + , s + (cid:1)(cid:3) + k (cid:88) i =1 w t log (cid:2) − h (cid:0) t + , s − i (cid:1)(cid:3) . (9)Formally the log likelihood of h is expressed as L ( h, w t ) = E q ( t,s | C =1) [ w t log h ( t, s )]+ k E q ( t,s | C =0) [ w t log(1 − h ( t, s ))] , (10)where E q ( t,s | C =0) [ w t log(1 − h ( t, s ))] is strictly negative,and w t can not be larger than 1. Thus, adding the secondterm to the right side of Inequality (6), it still holds. Werewrite the inequality as below: I ( t ; s ) ≥ log( k ) + L ( h ∗ ) , (11)where L ( h ∗ ) is the upper bound of L ( h ) .To summarize, our final learning problem is to learn astudent model f s to maximize the log likelihood L ( h ) (seeEq. 10). Experimental Results
In this section, we conduct a series of experiments to evalu-ate adversarial robustness and transferability of our method.Then we illustrate the selection of hyper-parameters, andstudy the influence of teacher’s error on distillation.
Algorithm 1:
Guided Adversarial Contrastive Distilla-tion (GACD)
Input:
Teacher model f t , student model f s , estimationmodel h , number of negatives k , learning rate r Output:
The final student model parameter θ s Data:
Training Data D train for each training iteration do Sample ( x, y ) ∼ D train ;Construct S = (cid:8) x + , x − , x − . . . x − k (cid:9) ; t + ← f t ( x + ) ; w t ← p y pred = y ( f t , t + ) ; for x − i ∈ S do s − i ← f s ( x + ) ; end S pair ← { ( t + , s + ) , ( t + , s − ) . . . ( t + , s − k ) } ; θ s ← θ s + r ∇ (cid:0) L S pair ( h, w t ) (cid:1) ; end Datasets
Totally we consider three different datasets in our experi-ments: (1) CIFAR-10 (Krizhevsky, Hinton et al. 2009) con-tains 60K images in 10 classes, of which 50K for trainingand 10K for testing. (2) CIFAR-100. It is just like CIFAR-10, except it has 100 classes containing 600 images each. (3)STL-10 (Coates, Ng, and Lee 2011). There is a training setof 50K labeled images from 10 classes and 100K unlabeledimages, and a test set of 8K images.We use CIFAR-10 and CIFAR-100 for evaluating clas-sification accuracy of different distillation methods, whileSTL-10 and CIFAR-100 are used to test the transferabilityof distilled students.
Implementation Details
Construction of Sample Set.
There are negatives andpositives within a sample set. As our method is supervised,we sample negatives from different classes rather than dif-ferent instances, when picking up a positive sample from thesame class. However, we did some modifications on positivesamples. As suggested in (Goldblum et al. 2020), not all ro-bust models are good teachers, so that distillation with nat-ural images often results unexpected failures. In our view,adversarial examples are like hard examples supporting thedecision boundaries. Without hard examples, the distilledmodels would certainly make mistakes. Thus, we adopt aself-supervised way to generate adversarial examples usingProjected Gradient Descent (PGD). Given a certain budgetof perturbations, we aim to find the perturbation which leadsto maximal distortion on features. The distance metric weuse of features is Wasserstein Distance used in (Zhang andWang 2019).
Estimation Model h . h is used in the binary classifi-cation problem so that we can estimate the distribution q ( C = 1 | t, s ) . The only requirement for h is that the out-put of h has to be in the range of [0 , ; for example, eacher ResNet 18 WRN-34-2Student ResNet 18 WRN-34-2 MobileNetV2 ResNet 18 WRN-34-2 MobileNetV2Method Nat. Adv. Nat. Adv. Nat. Adv. Nat. Adv. Nat. Adv. Nat. Adv.KD 76.13 40.13 73.12 34.16 76.86 38.21 GACD 84.14 42.12 86.28 45.66 81.84 39.51 84.16 40.60 86.74 44.64 81.13 36.05
Table 1: Classification accuracy (%) of student models with different methods on CIFAR-10. For each teacher, there are threestudents with different architecture style and network capacity. We mainly compare GACD with KD (Hinton, Vinyals, andDean 2015) and ARD (Goldblum et al. 2020).
Teacher ResNet 56 WRN-40-2Student ResNet 32 WRN-16-2 MobileNetV2 ResNet 32 WRN-16-2 MobileNetV2Method Nat. Adv. Nat. Adv. Nat. Adv. Nat. Adv. Nat. Adv. Nat. Adv.KD
GACD+AFT 49.86
Table 2: Classification accuracy (%) of student models with different methods on CIFAR-100. h ( t, s ) = e t (cid:48) s/T e t (cid:48) s/T + kM , where T is the temperature and M isthe cardinality of the dataset. Alternatively, h can be a net-work like the discriminator in Generative Adversarial Net-work (GAN) (Goodfellow et al. 2014). We use the former inour experiments. Adversarial Fine-Tuning.
Existing contrastive learn-ing methods leverage linear evaluation for downstreamtasks (Dosovitskiy et al. 2015; Chen et al. 2020; He et al.2019). Concretely, it requires to learn a linear layer l ψ ( · ) ontop of the fixed and well-trained contrastive learning mod-els f θ ( · ) . But it is proved in (Allen-Zhu and Li 2020) thatdeep models are not guaranteed to be adversarially robustif only low level features are robust. In contrastive learning,last layer is vulnerable to adversarial examples. So we dofull network fine-tuning with adversarial training after fea-ture distillation. Robustness Evaluation of Student Model
Setup.
We experiment on CIFAR-10 and CIFAR-100 withdifferent student-teacher combinations of various modelcapacity. For CIFAR-10, ResNet18 (He et al. 2016) andWideResNet-34-10 (Zagoruyko and Komodakis 2016) areteacher models, while for CIFAR-100 we use ResNet56and WideResNet-40-2. The performances of adversariallytrained teacher models are listed in Table 3. We use differentteacher-student combinations to show our approach is modelagnostic. More details will be introduced in following sec-tions.
Dataset CIFAR-10 CIFAR-100Model Resent18 WRN-34-10 ResNet56 WRN-40-2Nat. 76.54 84.41 59.29 60.27Adv. 44.46 45.75 20.32 22.40
Table 3: Classification accuracy (%) of different adversari-ally trained teacher models on CIFAR-10 and CIFAR-100.We use Nat. to denote classification accuracy on natural im-ages, and Adv. on adversarial images.
Results.
We mainly compare GACD with KD (Hinton,Vinyals, and Dean 2015) and ARD (Goldblum et al. 2020).Table 1 and Table 2 summarize the performances of differ-ent distillation methods on CIFAR-10 and CIFAR-100 fromtwo aspects: natural accuracy and adversarial accuracy under20-step PGD attack. Unless specified, we use perturbationbudget (cid:15) = 8 / under l ∞ , and 20-step PGD attack as thedefault in our experiments. We also investigate the influenceof model architectures. Specifically, we select three differentstudents for each teacher. These students differ from modelcapacity to architectural style.We use ResNet18, WRN-34-2 and MobileNet V2 as stu-dent models on CIFAR-10. Our approach has the best re-sults in natural accuracy and adversarial accuracy for moststudent-teacher combinations (see Table 1), which exceedteacher models as well. While on CIFAR-100, we useesNet32, WRN-40-2, and MobileNetV2 as student mod-els. As shown in Table 2, student models produced by ourapproach have the best adversarial robustness among all themethods. Students with KD have best natural accuracy bustworst adversarial robustness. Transferability of Distilled Features
Setup.
In representation learning, a primary goal is tolearn general knowledge. In other words, the representationsor features learned could be applied to different tasks ordatasets that are not used for training. Therefore, in this sec-tion we test if the features distilled transfer well.We use ResNet18 as teacher and student models. In ourexperiment, models are frozen once trained and used to ex-tract features later (the layer prior to the logit). Then we traina linear classifier on the top of frozen student models to per-form classification, and show the transferability of the dis-tilled features on STL-10 and CIFAR-100.
Results.
We compared GACD with different methods andthe results are reported in Table 4. As illustrated, all dis-tillation methods improve the transferability of learned fea-tures on both natural images and adversarial examples. Ourmethod shows the best performances on both datasets withaverage 11.8% improvement on natural accuracy and 5.84%on adversarial accuracy.
Datasets STL10 CIFAR-100Methods Nat. Adv. Nat. Adv.Teacher Adv. Training 48.81 28.40 27.17 11.42Student KD 56.79 29.33 30.64 12.03ARD 59.94 35.16 32.17 15.06GACD (Ours)
Table 4: Illustration of transferability of different studentmodels. Here we use 7-step PGD attack to evaluate adver-sarial robustness (%).
Hyper-parameters
We investigate the influence of two main hyper-parametersused in our method: (1) the number of negative samples k in Eq. 10, and (2) the temperature T which suppresses soft-max probability. The architectures of teacher and student areboth ResNet18. But note that our method is model agnostic.Experiments are conducted on CIFAR-10 and the results areshown in Figure 2. Number of Negative Samples k . We validated a series ofdifferent k : 16, 64, 256, 1024, 4096 and 16384. As shown inFigure 2a, adversarial robustness increases as k gets large.Natural accuracy falls down due to the trade-off (Zhang et al.2019a). As the accuracy is highest when k = 16384 , we use k = 16384 in all experiments reported. Temperature.
We experimented with Temperature T be-tween 0.01 and 0.3, and illustrate the results when T =0 . , . , . , . , . in Figure 2b. As we can see, the best natural and adversarial accuracy are obtained when T = 0 . , while some extreme values give sub-optimal so-lutions. Thus, we picked up T = 0 . in our experiments.
16 64 128 1024 4096 16384Number of Negatives80818283 N a t . A cc Nat. Acc 49505152 A d v . A cc Adv. Acc (a) Effects of varying k . Temperature7980818283 N a t . A cc Nat. Acc 474849505152 A d v . A cc Adv. Acc (b) Effects of varying T.
Figure 2: Effects of varying the number of negative samplesand the temperature.
Ablation Study
In GACD, we assign small importance on samples whichare misclassified by teacher. The reason is that we believewrong predictions made by the teacher would have a neg-ative influence on students. For illustration, we temporarilyremove sample importance in GACD to see the difference.We select ResNet18 as the teacher while WRN-34-10 as thestudent, and validate on CIFAR-10. The natural and adver-sarial accuracy we got are . and . for WRN-34-10. Apparently there is a drop compared to the original(see Table 1).In addition, we applied sample re-weighting to ARD toevaluate its effectiveness. Other settings or parameters inARD are exactly the same. With ResNet18 as the teacher,the classification results of student WRN-34-10 are 80.59%and 54.23%, increased by 1.54% and 4.07% on natural ac-curacy and adversarial accuracy respectively.So we can conclude sample re-weighting indeed helps im-prove the performances of students in distillation. Distilled Feature Analysis
In this section, we provide a detailed analysis of latent rep-resentations or features extracted by students with differentmethods.
Inter-class Correlations
For classification problems, cross-entropy is widely used asthe objective function. It, however, ignores the correlationsbetween classes. Knowledge distillation solves this problemwith a teacher model. Soft labels are the key component forits success, which inherently contains correlations betweenclasses. Such correlations contribute to the performances ofstudent models. To illustrate the capability of capturing cor-relations in different methods, we computed the differencesbetween the correlation metrices of the teachers’ and stu-dents’ logits.Since the objective functions of KD and ARD are highlysimilar, we select ARD and compare with our method.Figure 3 shows the differences with natural images as inputs,while Figure 4 shows the differences on adversarial images.learly we can see there are significant reductions (lightcolor) of differences between teachers and students with ourmethod, compared to ARD. This means our method capturesmore structural knowledge during distillation, which is alsosupported by the increased accuracy. (a) Student: ARD (b) Student: GACD (ours)
Figure 3: Differences of logits correlations between teach-ers and students on natural data from CIFAR-100. For visu-alization, we use WRN-40-2 as teacher and WRN-16-2 asstudent. (a) Student: ARD (b) Student: GACD (ours)
Figure 4: Differences of logits correlations between teach-ers and students on adversarial data from CIFAR-100. Samemodels are used as in Figure 3.
Feature under Attacks
We further investigate how natural images and adversarialimages are represented in feature space by different models.The t-SNE visualization of high dimensional latent represen-tations of sampled images is shown in Figure 5. Concretely,we sampled two-class images (bird and truck) from CIFAR-10 for illustration, and crafted adversarial images (namelyadv truck) which originally belong to truck but misclassifiedas bird. The green and blue points indicate the natural im-ages of trucks and birds, while red points represent imagesof adv trucks. Then these samples are fed into four models:standard undefended model, adversarially trained teacher,and two student models with ARD and our method GACD.For the standard undefended model (Figure 5a), all samplesof adv trucks are misclassified birds (red points are mixedwith blue points), and far from the original class (greenpoints). Other three models show adversarial robustness asmost samples from adv truck as classified as trucks. How-ever, there are several red points falling into the green clus-ter for student with ARD (Figure 5c). Besides, there is noclear boundary as data points are kind of mixed. In contrast, with our proposed GACD, Figure 5d clearly shows largerdistances between classes and smaller intra-class distances.The differences in feature space are also reflected on adver-sarial classification accuracy.
40 30 20 10 0 1010.07.55.02.50.02.55.07.510.0 adv trucktruckbird (a) Standard undefended model. adv trucktruckbird (b) Teacher. adv trucktruckbird (c) Student: ARD adv trucktruckbird (d) Student: GACD (ours)
Figure 5: Illustration of latent representation generated bydifferent models. The blue and green points are 100 ran-domly sampled natural images from class ’bird’ and ’truck’respectively, while the red points are adversarial imagescrafted from images from class truck.
Conclusion
In this paper, we present a novel approach: Guided Adver-sarial Contrastive Distillation (GACD) to transfer adversar-ial robustness with features, which is different from existingdistillation methods. Theoretically, we formulate our distil-lation problem into contrastive learning, and connect it tomutual information. Taking teacher’s error into considera-tion, we propose sample reweighted noise contrastive esti-mation, which is proved to be applicable to other distillationmethods as well. Compared to other methods in extensiveexperiments, Our method captures more structural knowl-edge and shows comparable or even better performances. Inaddition, our method has the best transferability across tasksor models. In the future, we will look deep into deep learningmodels, find the key property leading to adversarial robust-ness and develop efficient methods for distillation.
References
Allen-Zhu, Z.; and Li, Y. 2020. Feature Purification: HowAdversarial Training Performs Robust Deep Learning. arXive-prints arXiv:2005.10190.Athalye, A.; Carlini, N.; and Wagner, D. 2018. ObfuscatedGradients Give a False Sense of Security: CircumventingDefenses to Adversarial Examples. In
Proceedings of the5th International Conference on Machine Learning , 274–283.Carlini, N.; and Wagner, D. 2017. Towards evaluating therobustness of neural networks. In , 39–57.Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. ASimple Framework for Contrastive Learning of Visual Rep-resentations. arXiv e-prints arXiv:2002.05709.Coates, A.; Ng, A.; and Lee, H. 2011. An analysis of single-layer networks in unsupervised feature learning. In
Proceed-ings of the fourteenth international conference on artificialintelligence and statistics , 215–223.Croce, F.; and Hein, M. 2019. Sparse and ImperceivableAdversarial Attacks. In
The IEEE International Conferenceon Computer Vision (ICCV) .Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsuper-vised visual representation learning by context prediction. In
Proceedings of the IEEE international conference on com-puter vision , 1422–1430.Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; and Li,J. 2018. Boosting Adversarial Attacks With Momentum. In { IEEE } Conference on Computer Vision and PatternRecognition, { CVPR } , 9185–9193.Dosovitskiy, A.; Fischer, P.; Springenberg, J. T.; Riedmiller,M.; and Brox, T. 2015. Discriminative unsupervised featurelearning with exemplar convolutional neural networks. IEEEtransactions on pattern analysis and machine intelligence
The Thirty-FourthAAAI Conference on Artificial Intelligence, AAAI 2020 ,3996–4003.Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.2014. Generative adversarial nets. In
Advances in neuralinformation processing systems , 2672–2680.Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Ex-plaining and Harnessing Adversarial Examples. In { ICLR } .Gupta, P.; and Rahtu, E. 2019. CIIDefence: Defeating Ad-versarial Attacks by Fusing Class-Specific Image Inpaintingand Image Denoising. In The IEEE International Confer-ence on Computer Vision (ICCV) .Gutmann, M.; and Hyv¨arinen, A. 2010. Noise-contrastiveestimation: A new estimation principle for unnormalizedstatistical models. In
Proceedings of the Thirteenth Inter-national Conference on Artificial Intelligence and Statistics ,297–304.He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2019.Momentum Contrast for Unsupervised Visual Representa-tion Learning. arXiv e-prints arXiv:1911.05722.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recogni-tion , 770–778.Hendrycks, D.; Lee, K.; and Mazeika, M. 2019. Using Pre-Training Can Improve Model Robustness and Uncertainty.In
Proceedings of the 36th International Conference on Ma-chine Learning, { ICML } , volume 97, 2712–2721.Heo, B.; Kim, J.; Yun, S.; Park, H.; Kwak, N.; and Choi,J. Y. 2019. A comprehensive overhaul of feature distilla-tion. In Proceedings of the IEEE International Conferenceon Computer Vision , 1921–1930.Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distill-ing the knowledge in a neural network. arXiv preprintarXiv:1503.02531 .Kannan, H.; Kurakin, A.; and Goodfellow, I. J. 2018. Ad-versarial Logit Pairing.
CoRR abs/1803.0.Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiplelayers of features from tiny images .Kurakin, A.; Goodfellow, I. J.; and Bengio, S. 2016. Adver-sarial examples in the physical world.
CoRR abs/1607.0.LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature { IEEE } Conference on Computer Vision and Pattern Recognition, { CVPR } , 1778–1787.Liu, W.; Liao, S.; Ren, W.; Hu, W.; and Yu, Y. 2019.High-level semantic feature detection: A new perspective forpedestrian detection. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 5187–5196.Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; andVladu, A. 2018. Towards Deep Learning Models Resistantto Adversarial Attacks. In { ICLR } . OpenReview.net.Mao, C.; Zhong, Z.; Yang, J.; Vondrick, C.; and Ray, B.2019. Metric Learning for Adversarial Robustness. In Ad-vances in Neural Information Processing Systems 32 , 478–489.Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; andDean, J. 2013. Distributed representations of words andphrases and their compositionality. In
Advances in neuralinformation processing systems , 3111–3119.Mnih, A.; and Kavukcuoglu, K. 2013. Learning word em-beddings efficiently with noise-contrastive estimation. In
Advances in neural information processing systems , 2265–2273.Modas, A.; Moosavi-Dezfooli, S.-M.; and Frossard, P. 2019.SparseFool: a few pixels make a big difference. In
Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition , 9087–9096.Noroozi, M.; and Favaro, P. 2016. Unsupervised learning ofvisual representations by solving jigsaw puzzles. In
Euro-pean Conference on Computer Vision , 69–84.apernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik,Z. B.; and Swami, A. 2017. Practical black-box attacksagainst machine learning.
ASIA CCS 2017 , 372–387.Papernot, N.; McDaniel, P.; Wu, X.; Jha, S.; and Swami, A.2016b. Distillation as a defense to adversarial perturbationsagainst deep neural networks. In , 582–597. IEEE.Park, W.; Kim, D.; Lu, Y.; and Cho, M. 2019. RelationalKnowledge Distillation. In { IEEE } Conference on Com-puter Vision and Pattern Recognition, { CVPR } , 3967–3976. Computer Vision Foundation / { IEEE } .Parkhi, O. M.; Vedaldi, A.; and Zisserman, A. 2015. Deepface recognition .Poole, B.; Ozair, S.; Oord, A. v. d.; Alemi, A. A.; and Tucker,G. 2019. On variational bounds of mutual information. arXiv preprint arXiv:1905.06922 .Prakash, A.; Moran, N.; Garber, S.; DiLillo, A.; and Storer,J. A. 2018. Deflecting Adversarial Attacks With Pixel De-flection. In { IEEE } Conference on Computer Visionand Pattern Recognition, { CVPR } , 8571–8580.Qin, C.; Martens, J.; Gowal, S.; Krishnan, D.; Dvijotham,K.; Fawzi, A.; De, S.; Stanforth, R.; and Kohli, P. 2019.Adversarial Robustness through Local Linearization. In Advances in Neural Information Processing Systems 32 ,13824–13833.Qin, Y.; Frosst, N.; Raffel, C.; Cottrell, G.; and Hinton, G.2020. Deflecting Adversarial Attacks. arXiv e-prints .Schott, L.; Rauber, J.; Bethge, M.; and Brendel, W. 2018.Towards the first adversarially robust neural network modelon MNIST. arXiv preprint arXiv:1805.09190 .Shafahi, A.; Dickerson, J.; Taylor, G.; Studer, C.; Goldstein,T.; Davis, L. S.; Najibi, M.; Ghiasi, M. A.; Xu, Z.; Dicker-son, J.; Studer, C.; Davis, L. S.; Taylor, G.; Goldstein, T.;Dickerson, J.; Taylor, G.; Studer, C.; and Goldstein, T. 2019.Adversarial training for free! In
Advances in Neural Infor-mation Processing Systems 32 , NeurIPS, 3358–3369.Shafahi, A.; Saadatpanah, P.; Zhu, C.; Ghiasi, A.; Studer,C.; Jacobs, D.; and Goldstein, T. 2020. Adversarially robusttransfer learning. In
International Conference on LearningRepresentations .Sohn, K. 2016. Improved deep metric learning with multi-class n-pair loss objective. In
Advances in neural informa-tion processing systems , 1857–1865.Su, D.; Zhang, H.; Chen, H.; Yi, J.; Chen, P.-Y.; and Gao, Y.2018. Is Robustness the Cost of Accuracy? - { A } Compre-hensive Study on the Robustness of 18 Deep Image Classifi-cation Models. In
Computer Vision - { ECCV } , volume11216, 644–661. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,D.; Goodfellow, I. J.; and Fergus, R. 2014. Intriguing prop-erties of neural networks. In { ICLR } .Tian, Y.; Krishnan, D.; and Isola, P. 2019. Contrastive mul-tiview coding. arXiv preprint arXiv:1906.05849 .Tian, Y.; Krishnan, D.; and Isola, P. 2020. ContrastiveRepresentation Distillation. In International Conference onLearning Representations .Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; andIsola, P. 2020. What makes for good views for contrastivelearning. arXiv preprint arXiv:2005.10243 .Tsipras, D.; Santurkar, S.; Engstrom, L.; Turner, A.; andMadry, A. 2019. Robustness May Be at Odds with Accuracy.In
International Conference on Learning Representations .van den Oord, A.; Li, Y.; and Vinyals, O. 2018. Represen-tation Learning with Contrastive Predictive Coding.
CoRR .Xie, C.; Wu, Y.; van der Maaten, L.; Yuille, A. L.; and He,K. 2019. Feature Denoising for Improving Adversarial Ro-bustness. In { IEEE } Conference on Computer Vision andPattern Recognition , 501–509.Zagoruyko, S.; and Komodakis, N. 2016. Wide ResidualNetworks.
CoRR abs/1605.07146. URL http://arxiv.org/abs/1605.07146.Zhang, H.; and Wang, J. 2019. Defense Against AdversarialAttacks Using Feature Scattering-based Adversarial Train-ing. In
Advances in Neural Information Processing Systems32 , 1829–1839.Zhang, H.; Yu, Y.; Jiao, J.; Xing, E.; Ghaoui, L. E.; and Jor-dan, M. 2019a. Theoretically Principled Trade-off betweenRobustness and Accuracy. In
Proceedings of the 36th In-ternational Conference on Machine Learning , 7472–7482.Long Beach, California, USA.Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; and Ma, K.2019b. Be your own teacher: Improve the performance ofconvolutional neural networks via self distillation. In
Pro-ceedings of the IEEE International Conference on ComputerVision , 3713–3722.Zhang, R.; Isola, P.; and Efros, A. A. 2017. Split-brain au-toencoders: Unsupervised learning by cross-channel predic-tion. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 1058–1067.Zhong, Y.; and Deng, W. 2019. Adversarial Learning WithMargin-Based Triplet Embedding Regularization. In