[PDF] On the Robustness of Deep K-Nearest Neighbors

Abstract

Despite a large amount of attention on adversarial examples, very few works have demonstrated an effective defense against this threat. We examine Deep k-Nearest Neighbor (DkNN), a proposed defense that combines k-Nearest Neighbor (kNN) and deep learning to improve the model's robustness to adversarial examples. It is challenging to evaluate the robustness of this scheme due to a lack of efficient algorithm for attacking kNN classifiers with large k and high-dimensional data. We propose a heuristic attack that allows us to use gradient descent to find adversarial examples for kNN classifiers, and then apply it to attack the DkNN defense as well. Results suggest that our attack is moderately stronger than any naive attack on kNN and significantly outperforms other attacks on DkNN.

Full PDF

OOn the Robustness of Deep K-Nearest Neighbors

Chawin Sitawarin, David Wagner

EECS Department, UC Berkeley { chawins, daw } @berkeley.edu Abstract —Despite a large amount of attention on adversarialexamples, very few works have demonstrated an effective de-fense against this threat. We examine Deep k-Nearest Neighbor(DkNN), a proposed defense that combines k-Nearest Neighbor(kNN) and deep learning to improve the model’s robustness toadversarial examples. It is challenging to evaluate the robustnessof this scheme due to a lack of efﬁcient algorithm for attackingkNN classiﬁers with large k and high-dimensional data. Wepropose a heuristic attack that allows us to use gradient descentto ﬁnd adversarial examples for kNN classiﬁers, and then applyit to attack the DkNN defense as well. Results suggest that ourattack is moderately stronger than any naive attack on kNN andsigniﬁcantly outperforms other attacks on DkNN. I. I

NTRODUCTION

Deep learning has recently attained immense popularityfrom various ﬁelds and communities due to its superhumanperformance on complicated tasks such as image classiﬁcation[1], [2], playing complex games [3]–[5], controlling driverlessvehicles [6], [7], and medical imaging [8]. Nonetheless, manyworks have shown that neural networks and other machinelearning classiﬁers are not robust in the face of adversaries(e.g. adversarial examples) [9]–[13] as well as more commoncases of distribution shifts [14], [15].This phenomenon raises a call for more robust and moreinterpretable neural network models. Many defenses againstadversarial examples have been proposed; however, most havebeen broken by adaptive adversaries [16]–[18]. Only a fewdefenses provide a signiﬁcant improvement in robustness ontoy datasets like MNIST and CIFAR-10 [19], [20]. One plausi-ble approach to simultaneously combat adversaries and makeneural networks more trustworthy is to build interpretablemodels [21]–[23] or to provide an explanation supporting themodel’s output [24]–[26]. Deep k-Nearest Neighbors (DkNN),recently proposed by Papernot & McDaniel, showed promisingresults: their evaluation suggests it offers robustness againstadversarial examples, interpretability, and other beneﬁts [23].Nonetheless, adversarial examples are surprisingly difﬁcultto detect when that the adversary has full knowledge ofthe defense [17]. Among the works that have been beaten,many attempts to distinguish adversarial inputs by statisticallyinspecting their representation (or activation) from hiddenlayers of neural networks [27]–[30]. This fact raises someconcerns for the robustness of DkNN, which uses kNN on theintermediate representations produced by the neural network.In this paper, we examine the robustness of DkNN againstadversarial examples. We develop a new gradient-based attackon kNN and DkNN. While gradient descent has found greatsuccess in attacking neural networks, it is challenging to apply

2 8 4 3 5 3 2CleankNN ( 𝑙 " )kNN ( 𝑙 )DkNN ( 𝑙 " )DkNN ( 𝑙 ) 1 7 9 5 3 8 3 Fig. 1: Adversarial examples generated from the gradient-based attack on kNN and DkNN with (cid:96) - and (cid:96) ∞ -normconstraints. The numbers on top and bottom are predictions ofDkNN on the clean and the adversarial samples respectively.For a few adversarial examples, the perturbation might changethe human label: some of the adversarial 4’s have their topclosed, so a human might consider them a 9, and one of the3’s looks close to an 8.to kNN, as kNN is not differentiable. At a high level, our attackapproximates the discrete nature of kNN with a soft threshold(e.g., a sigmoid), making the objective function differentiable.Then, we ﬁnd a local optimum using gradient descent underan (cid:96) p -norm constraint. With this attack, we ﬁnd that DkNN isvulnerable to adversarial examples with a small perturbationin both (cid:96) and (cid:96) ∞ norms. With (cid:96) ∞ -norm of 0.2, our attackmanages to reduce the accuracy of a DkNN on MNIST to only17.44%. Some of the adversarial examples generated with ourattack are shown in Fig. 1.The main contributions of this paper are as follows:1) We propose a gradient-based attack on kNN and DkNN.2) We evaluate our attack on kNN and DkNN, compareit to other naive approaches as well as the adaptiveattack proposed by Papernot & McDaniel, show that ourattack performs better than prior attacks, and show thatit can ﬁnd adversarial examples for kNN and DkNN onMNIST.3) We show that the credibility scores from DkNN modelsare not effective for detecting our attacks without asigniﬁcant drop in accuracy on clean images. a r X i v : . [ c s . CR ] M a r I. B

ACKGROUND AND R ELATED W ORK

A. Adversarial Examples

Adversarial examples are a type of an evasion attack againstmachine learning models at test time. While the robustnessof machine learning classiﬁers in adversarial settings hasbeen studied for a long time [31], [32], the term “adversarialexamples” was recently introduced as an attack on deep neuralnetworks by adding very small perturbation to a legitimatesample [10], [11]. Previous works propose algorithms forﬁnding such perturbation under a norm-ball threat modelwhich can be generalized as solving the following optimizationproblem: x adv = x + δ ∗ where δ ∗ = arg max δ L ( x + δ ) (1)such that (cid:107) δ (cid:107) p ≤ d where L is some loss function associated with the correctprediction of a clean sample x by the target neural network.The constraint is used to keep the perturbation small or imperceptible to humans. Our attack also uses the norm-ballconstraint and an optimization problem of a similar form. B. Robustness of k-Nearest Neighbors

The kNN classiﬁer is a popular non-parametric classiﬁerthat predicts the label of an input by ﬁnding its k nearestneighbors in some distance metric such as Euclidean or cosinedistance and taking a majority vote from the labels of theneighbors. Wang et al. recently studied the robustness of kNNin an adversarial setting, providing a theoretic bound on therequired value of k such that robustness of kNN can approachthat of the Bayes Optimal classiﬁer [33]. Since the requiredvalue of k is too large in practice, they also propose a robust1-NN by selectively removing some of the training samples.We did not experiment with this defense as it is limited to a1-NN algorithm with two classes. C. Deep k-Nearest Neighbors

DkNN, proposed by Papernot & McDaniel, is a schemethat can be applied to any deep learning model, offering inter-pretability and robustness through a nearest neighbor searchin each of the deep representation layers. Using inductiveconformal prediction , the model computes, in addition to aprediction, conﬁdence and credibility scores, which measurethe model’s assessment of how likely its prediction is to becorrect. The goal is that adversarial examples will have lowcredibility and can thus be easily detected. The credibility iscomputed by counting the number of neighbors from classesother than the majority; this score is compared to scoresseen when classifying samples from a held-out calibrationset . Papernot & McDaniel evaluate DkNN with an adaptiveadversary which is found to be quite unsuccessful. We examinethe robustness of DkNN with the stronger attack we propose.We note that the DkNN proposed by Papernot & McDanieluses cosine distance, which is equivalent to Euclidean distancegiven that all samples are normalized to have a unit norm. Forthe rest of the paper, we tend to omit the normalization for simplicity and less clutter in equations. The implementationand the evaluation, however, use cosine distance as instructedin the original paper.III. T

HREAT M ODEL

We assume the white-box threat model for attacks on bothkNN and DkNN. More precisely, the adversary is assumed tohave access to the training set and all parameters of the DkNNneural network. Since a kNN classiﬁer is non-parametric,the training set is, in some sense, equivalent to the weightsof parametric models. We also assume that the adversaryknows all hyperparameters, namely k , the distance metric used(Euclidean or cosine distance), and additionally the calibrationset for DkNN. Though this knowledge is less crucial to theadversary, it allows the adversary to accurately evaluate his/herattack during the optimization resulting in a more effectiveattack.For consistent comparisons with previous literature, theadversarial examples must be contained within a norm-ball( (cid:96) and (cid:96) ∞ ) centered at given test samples. We recognize thatthe (cid:96) p -norm constraint may not be representative of humanperception nor applicable in many real-world cases.IV. A TTACK ON K -N EAREST N EIGHBORS

A. Notation

We follow notation from Papernot & McDaniel as muchas possible. Let z denote a target sample or a clean samplethat the adversary uses as a starting point to generate anadversarial example, and y z its ground-truth label. We denotethe perturbed version of z as ˆ z . The training set for both kNNand DkNN is ( X, Y ) with n samples of dimension d . Theclassiﬁer’s prediction for a sample x is knn ( x ) . B. Mean Attack

We ﬁrst introduce a simple, intuitive attack to serve as abaseline. Let z be a clean sample, y z its ground-truth class,and y adv (cid:54) = y z be a target class. The attack, which we callthe mean attack, works by moving z in the direction towardsthe mean of all samples in the training set with class y adv .Concretely, we ﬁrst search for the class y adv (cid:54) = y z such thatthe mean of training samples with that class is closest to z in Euclidean distance. Let m denote the corresponding mean.We then use binary search to ﬁnd the smallest c > such that (1 − c ) z + cm is misclassiﬁed by the kNN.This attack is very simple to carry out and applicable toany classiﬁer. While it is a natural choice for attacking a kNNwith Euclidean distance, the attack may perform less well forcosine distance or other distance measures. As our experimentsshow, the mean attack also produces perturbations that makethe resulting adversarial example look, to humans, more likesamples from the target class, and thus makes the attackmore noticeable. Nonetheless, this attack can be regarded asa simple baseline for measuring the robustness of nearest-neighbor classiﬁers. ) 𝑘 = 5𝑧𝑧̂ Step 1 𝑧 𝑧𝑧̂

Step 2 S a) 𝑘 = 1 Fig. 2: (a) naive attack for k = 1 : The target sample z (light blue circle) is moved towards each of the samples froma different class (red triangles). The one that requires thesmallest (cid:96) -distance to change the prediction is the optimaladversarial example ˆ z (pink circle). (b) naive attack for k > :In the ﬁrst step, a set S of 3 samples from the different classclosest to z are located with a greedy algorithm. The secondstep involves moving z towards a mean of the samples in S and stops when the prediction changes. C. Naive Attack

Next, we introduce a second baseline attack that improvesslightly on the mean attack. When k = 1 , a simple algorithmcan ﬁnd the optimal adversarial example in O ( n ) time. Foreach training sample z (cid:48) of a class other than y z , the algorithmmoves the target sample z in a straight line towards z (cid:48) until knn (ˆ z ) (cid:54) = y z (i.e., setting ˆ z = (1 − c ) z + cz (cid:48) , we ﬁnd thesmallest c > such that knn (ˆ z ) (cid:54) = y z ). This produces n candidate adversarial examples, and the algorithm outputs theone that is closest to z . Fig. 2(a) illustrates this algorithm.This strategy ﬁnds the optimal adversarial example when k = 1 , but when k > , it is not clear how to ﬁnd theoptimal adversarial example efﬁciently. Repeating the previousstrategy on all sets of k training samples does not guaranteean optimal solution and is inefﬁcient, as its complexity growsexponentially with k . Instead, we propose a computationallycheaper attack that greedily chooses only one set of samples tomove towards, as summarized in Fig. 2(b). There are multiplepossible heuristics to choose this set. One simple option wouldbe to ﬁnd the (cid:100) k (cid:101) nearest neighbors of z whose labels allmatch but are different from y z . We instead use a slightlymore complex variant: (1) ﬁnd the nearest neighbor from anyclass other than y z , say class y adv , (2) add this sample toan empty set S , and (3) out of all samples with class y adv ,iteratively ﬁnd the nearest sample to the mean of S and addit to S . The ﬁnal step is repeated until | S | = (cid:100) k (cid:101) . Finally, wemove z towards the mean of S until the classiﬁer’s predictiondiffers from y z . D. Gradient-Based Attack

Here we introduce our main attack on kNN. On a high-level,it uses a heuristic initialization to choose a set of m samplesthat are close to the target sample z . Then, a gradient-basedoptimization is used to move z closer to the ones with thetarget class y adv and further from the ones with the originalclass y z . We will discuss the choices for the heuristic initializationtowards the end of this section. For now, the algorithm can beformulated as the following optimization problem. ˆ δ = arg min δ m (cid:88) i =1 w i · (cid:107) x i − ( z + δ ) (cid:107) (2)such that (cid:107) δ (cid:107) p ≤ (cid:15) and z + δ ∈ [0 , d where δ is the perturbation, ˆ z = z + ˆ δ is the adversarial exam-ple, x , . . . , x m are the m training samples selected earlier,and w i = 1 if the label of x i is y adv , otherwise w i = − .The ﬁrst constraint constrains the norm of the perturbation,and the second constraint ensures that the adversarial examplelies in a valid input range, which here we assume to be [0 , for pixel values.However, Eq. 2 may not achieve what we desire since ittreats all x i equally and does not take into account that forkNN, only the k nearest neighbors contribute to the predic-tion, while the other training samples are entirely irrelevant.Moreover, the distance to these k neighbors does not matteras long as they are the k closest. In other words, the distanceto each of these k neighbors is irrelevant so long as it isunder a certain threshold η (where η is the distance to the k -thnearest neighbor). This means that a sample x i gets a vote if (cid:107) x i − ˆ z (cid:107) ≤ η ; otherwise, it gets zero vote. The optimizationabove does not take this into account.We show how to adjust the optimization to model thisaspect of kNN classiﬁers. The function that maps ˆ z to 0 or1 according to whether x i gets a vote is not a continuousfunction and it has zero gradient where it is differentiable,so it poses challenges for gradient-based optimization. Tocircumvent this problem, we approximate the threshold with asigmoid function, σ ( x ) = e − αx where α is a hyperparam-eter that controls “steepness” (or an inverse of temperature )of the sigmoid. As α → ∞ , the sigmoid exactly representsthe Heaviside step function, i.e., a hard threshold. This letsus adjust Eq. 2 to incorporate the considerations above, asfollows: ˆ δ = arg min δ m (cid:88) i =1 w i · σ (cid:0) (cid:107) x i − ( z + δ ) (cid:107) − η (cid:1) (3)such that (cid:107) δ (cid:107) p ≤ (cid:15) and z + δ ∈ [0 , d Ideally, η should be recomputed at every optimization step, butthis requires ﬁnding k nearest neighbors at each step, whichis computationally expensive. Instead, we ﬁx the value of η by taking the average distance, over all training samples, fromeach sample to its k -th nearest neighbor. Choosing the initial m samples. There is no single correctway to initialize the set of m samples. We empirically foundthat choosing all of them from the same class y adv , andchoosing the m training samples of that class that are closestto z , works reasonably well. We choose y adv by computingthe distance from z to the mean of all samples of class y , for each y , and taking the class y that minimizes thisdistance. Other heuristics might well perform better; we didot attempt to explore alternatives in depth, as this simpleheuristic sufﬁced in our experiments. The choice of the attackparameter m affects the attack success rate. A larger m meanswe consider more training samples which make the kNNmore likely to be fooled, but it is also more expensive tocompute and may produce larger distortion. In principle, onecould recompute the set of m samples periodically as theoptimization progresses, but for our experiments, we selectthem only once in the beginning.For p = ∞ , we use a change of variable as introduced byCarlini & Wagner [34] to provide pixel-wise box constraintsthat simultaneously satisfy both of the optimization constraintsin Eq. 3. More precisely, the i -th pixel of the adversarialexample is written as ˆ z i = (tanh( v i ) + 1) · ( b u − b l ) + b l where b u and b l are the upper and the lower bound of that pixelrespectively. v becomes the variable that we optimize over, butfor simplicity, we omit it from Eq. 3. In the case of p = 2 ,this change of variables enforces the second constraint. Theﬁrst constraint is relaxed and added to the objective functionas a penalty term: ˆ δ = arg min δ m (cid:88) i =1 w i · σ (cid:0) (cid:107) x i − ( z + δ ) (cid:107) − η (cid:1) + c · max (cid:8) , (cid:107) δ (cid:107) − (cid:15) (cid:9) (4)such that z + δ ∈ [0 , d To ﬁnd an appropriate value for c , we use a binary search forﬁve steps. If the attack succeeds, c is increased; otherwise, c is decreased.V. A TTACK ON D EEP K -N EAREST N EIGHBORS

A. Notation

Let dknn ( x ) denote DkNN’s prediction for a sample x . Theprediction of the l -layer neural network part of the DkNN isdenoted as f ( x ) , and the output from the λ -th layer as f λ ( x ) where λ ∈ { , , ..., l } . The calibration set ( X c , Y c ) is used tocalculate the empirical p -value as well as the credibility andconﬁdence. B. Mean Attack

The mean attack for DkNN is exactly the same as for kNNwithout any modiﬁcation as the attack does not depend on thechoice of classiﬁers.

C. Baseline Attack

We use the adaptive attack evaluated by Papernot & Mc-Daniel as a baseline. Given a target sample z , we try tominimize the distance between its representation at the ﬁrstlayer and that of a guide sample x g , a sample from a differentclass whose representation is closest to f ( z ) . For the (cid:96) ∞ -norm constraint, the attack can be written as: ˆ δ = arg min δ (cid:107) f ( x g ) − f ( z + δ ) (cid:107) (5)such that (cid:107) δ (cid:107) ∞ ≤ (cid:15) and z + δ ∈ [0 , d The optimization is solved with L-BFGS-B optimizer assuggested in Sabour et al. [35]. For completeness, we willalso evaluate the attack with a (cid:96) constraint, using the samerelaxation as Eq. 4. D. Gradient-Based Attack

The baseline attack relies on an assumption that if f (ˆ z ) isclose to f ( x g ) , then f λ (ˆ z ) will also be close to f λ ( x g ) for ≤ λ ≤ l , resulting in both ˆ z and x g having a similar set ofneighbors for all of the layers as well as the ﬁnal prediction.However, while this assumption makes intuitive sense, it canbe excessively strict for generating adversarial examples. Theadversary only needs a large fraction of the neighbors of ˆ z to be of class y adv . By extending the gradient-based attackon kNN, we formulate an analogous optimization problem forattacking DkNN as follows: ˆ δ = arg min δ m (cid:88) i =1 l (cid:88) λ =1 w i · σ (cid:0) (cid:107) f λ ( x i ) − f λ ( z + δ ) (cid:107) − η λ (cid:1) (6)such that (cid:107) δ (cid:107) p ≤ (cid:15) and z + δ ∈ [0 , d The m samples are chosen similarly to the attack on kNN.In the interest of space, we omit the formulation for the (cid:96) constraint as it is also analogous to Eq. 4.VI. E XPERIMENTAL S ETUP

We reimplement DkNN from Papernot & McDaniel with thesame hyperparameters, including the network architecture andthe value of k = 75 . We evaluate our attacks on the MNISTdataset [36] as past research suggests that ﬁnding adversarialexamples on other tasks is even easier. 60,000 samples areused as the training samples for kNN, DkNN, as well as theneural network part of DkNN. 750 samples (75 from eachdigit) are held out as the calibration set, leaving 9,250 testsamples for evaluating the accuracy and the robustness ofthe classiﬁers against the attacks. Similarly to Papernot &McDaniel, for a quick nearest neighbor search on DkNN,we use a locality-sensitive hash (LSH) from the FALCONNPython library, which is based off cross-polytope LSH byAndoni et al. [37]. kNN uses an exact neighbor search withoutany approximation. The kNN and the DkNN have an accuracyof 95.74% and 98.83% on the clean test set, respectively. Theneural network alone has an accuracy of 99.24%.All of the attacks are evaluated under both (cid:96) - and (cid:96) ∞ -normconstraints, except for the naive attack on kNN and the meanattacks. For simplicity, we only evaluate untargeted attacks.Both the mean and the naive attacks use only ﬁve binarysearch steps. For the other attacks, we use 400 iterations ofgradient updates and ﬁve steps of binary search on the (cid:96) -penalty constant. The Adam optimizer is used in the gradient-based attack, and to save computation time, we only checkfor the termination condition (i.e., whether ˆ z is misclassiﬁed)three times at iterations 320, 360, and 400, instead of at everystep.ABLE I: Evaluation of all the attacks on kNN. Attacks Accuracy Mean Distortion in (cid:96) Clean Samples 0.9574 -Mean Attack

Naive Attack 0.7834 8.599Gradient Attack ( (cid:96) ) Gradient Attack ( (cid:96) ∞ ) 0.8514 5.282 TABLE II: Evaluation of all the attacks on DkNN.

Attacks Accuracy Mean Dist. Mean Cred.Clean Samples 0.9883 - 0.6642Mean Attack 0.1313 4.408 0.0172Baseline Attack ( (cid:96) ) 0.1602 3.459 0.0185Baseline Attack ( (cid:96) ∞ = 0 . ) 0.8891 2.660 0.0807Baseline Attack (ﬁxed (cid:96) ) 0.5004 3.435 0.1385Gradient Attack ( (cid:96) ) (cid:96) ∞ = 0 . ) 0.1744 3.476 0.1037Gradient Attack (ﬁxed (cid:96) ) 0.0059 3.375 We made minimal effort to select hyperparameters. We ﬁxthe steepness α of the sigmoid at 4, and for DkNN, wearbitrarily choose the initial m samples to be the k trainingsamples with class y adv whose ﬁrst-layer representation isclosest to that of z . For the (cid:96) -norm attacks, (cid:15) is simplychosen to be 0 with the constant c being 1. This choice ofpenalty generally allows the optimization to ﬁnd adversarialexamples most of the time but may result in unnecessarilylarge perturbations. To set a more strict constraint, one couldset (cid:15) to a desired threshold and c to a very large number.VII. R ESULTS

A. k-Nearest Neighbors

Table I displays the accuracy and mean (cid:96) distortion of thesuccessful adversarial examples for kNN. As expected, themean attack is very good at ﬁnding adversarial examples butthe perturbation is large and the adversarial examples some-times introduce anomalies that may be noticeable to humans.Surprisingly, the naive attack performs much more poorlycompared to the mean attack, indicating that the heuristic usedto choose the set of target samples can signiﬁcantly affect theattack success rate. The gradient-based attack with the (cid:96) -normperforms well and is on par with the mean attack while havingconsiderably smaller mean distortion. On the other hand, thegradient attack with (cid:96) ∞ -norm of 0.2 is mostly unsuccessful.We speculate this might be because (cid:15) = 0 . is too small andthe (cid:96) ∞ -norm is an ineffective choice of norm as kNN relieson Euclidean distance in the pixel space for prediction. B. Deep k-Nearest Neighbors

Table II compares the accuracy, mean (cid:96) distortion, andmean credibility of the successful adversarial examples forDkNN between the three attacks. Our novel gradient-basedattack outperforms the baseline as well as the mean attack bya signiﬁcant margin. With an (cid:96) ∞ -norm constraint of 0.2, thegradient attack reduces the classiﬁer’s accuracy much further compared to the baseline. With an (cid:96) -norm constraint, ourgradient attack also performs better with smaller perturbation.Although the mean attack reduces the accuracy even lowerthan the gradient attack with (cid:96) ∞ -norm of 0.2, it has lowermean credibility and the perturbation is also considerablylarger and more visible to humans.Unlike an (cid:96) ∞ constraint, which is strictly enforced bythe change of variables trick, an (cid:96) constraint is written asa penalty term with only a tunable weighting constant. Tocompare the baseline and the gradient attacks under a similar (cid:96) -norm, we arbitrarily set (cid:15) to be the mean (cid:96) -norm of the (cid:96) ∞ gradient attack (3.476) and the constant c to be just highenough that the optimization still ﬁnds successful attacks witha minimal violation on the constraint (cid:107) (cid:15) (cid:107) ≤ . . We reportthe results for both attacks in Table II on the “ﬁxed (cid:96) ” rows.The gradient attack, when given a large (cid:96) budget, can increasethe credibility signiﬁcantly and reduce the accuracy to almostzero (0.6%). In contrast, the baseline attack can only ﬁndadversarial examples for about 50% of the samples under thesame (cid:96) constraint.Fig. 3 shows a clean sample and its adversarial versionsgenerated by all of the attacks along with their ﬁve nearestneighbors at each of the four layers of representation. On theﬁrst column, all of the 20 neighbors of the clean sample havethe correct class (a six). On the other hand, the majority ofneighbors of the adversarial examples are of the incorrect class(a ﬁve) with an exception of the ﬁrst layer whose neighborsgenerally still come from the correct class. The other commonproperty of all the attacks is that almost every neighbor in theﬁnal layer has the adversarial class.Note that the (cid:96) -attacks, both the baseline and the gradient-based attack, often perturb the sample in a semantically mean-ingful manner. Most are subtle, but some are quite prominent.For instance, the input of the third column from the left in Fig.3 is perturbed by slightly removing the connected line thatdistinguishes between a ﬁve and a six, making the adversarialexample appear somewhat ambiguous to humans. In contrast,the (cid:96) ∞ adversarial examples usually spread the perturbationover the entire image without changing its semantic meaningin a way that is noticeable to humans.For the (cid:96) ∞ -norm constraint, as we increase (cid:15) , the accuracyof DkNN drops further and hits zero at (cid:15) = 0 . , as shown inFig. 4(a), whereas increasing (cid:15) on the baseline attack reducesaccuracy at a much slower rate.Fig. 4(b) displays the mean credibility of successful adver-sarial examples generated from the baseline and the gradientattacks. As expected, as we increase (cid:15) , the mean credibilityalso increases for both attacks because the adversarial examplecan move closer to training samples from the target class. Thegradient-based attack increases the mean credibility at a muchfaster rate than the baseline potentially because its objectivefunction indirectly corresponds to the credibility as it takes intoaccount m training samples instead of one like the baseline.In the next section, we discuss the possibility of detectingadversarial examples by setting a threshold on the credibilityscore. Layer 4Layer 3Layer 2Layer 1PredictionInput Legitimate Sample Mean Attack 𝑙 " Baseline Attack 𝑙 Baseline Attack 𝑙 Gradient Attack 𝑙 " Gradient Attack

Fig. 3: Each column shows ﬁve nearest neighbors for each of the four deep representation spaces of DkNN. From left to right,the inputs are a randomly chosen legitimate sample, its (cid:96) and (cid:96) ∞ baseline attacks, and its (cid:96) and (cid:96) ∞ gradient attacks. Forthe (cid:96) ∞ -norm constraint, (cid:15) is 0.2. The legitimate sample is correctly predicted by the DkNN, and all of the attacks succeed inchanging the prediction from a six to a ﬁve, except for the (cid:96) ∞ baseline attack. A cc u r ac y o f D k NN Gradient attackBaseline M ea n C r e d i b ilit y Gradient attackBaseline

Fig. 4: (a) Accuracy and (b) mean credibility of DkNN underthe baseline attack and our gradient-based attack at different (cid:96) ∞ -norm constraints. N u m b e r o f S a m p l e s Grad Attack ( = 0.2)Grad Attack ( = 0.3)Clean Test Samples

Fig. 5: Histogram of credibility of the clean test samples andthe adversarial examples generated from the gradient-basedattack with the (cid:96) ∞ -norm constraint of 0.2 and 0.3. The blackdashed vertical line indicates credibility of 0.1.VIII. D ISCUSSION

A. Credibility Threshold

Papernot & McDaniel argues that the credibility output byDkNN is a well-calibrated metric for detecting adversarialexamples. In Fig. 5, we show the distribution of the credibilityfor the clean test set and for adversarial examples generatedfrom the gradient-based attack with two different (cid:96) ∞ -norms.Most of the test samples (around 55%) have credibility be- tween 0.9 and 1. On the other hand, the majority of theadversarial examples have credibility less than 0.1, suggestingthat setting a threshold on credibility can potentially ﬁlter outmost of the adversarial examples. However, doing so comes ata cost of lowering accuracy on legitimate samples. Choosinga credibility threshold of 0.1 reduces accuracy on the test setto 91.15%, which is already very low for MNIST, and withthis threshold, 28% and 43% of the adversarial examples with (cid:96) ∞ -norm of 0.2 and 0.3 respectively still pass the thresholdand would not be detected. It is also important to note that ourattack is not designed to maximize the credibility. Rather, it isdesigned to ﬁnd adversarial examples with minimal distortion.Simple parameter ﬁne-tuning, e.g. a larger m , more iterations,and a smaller η , might all help increase the credibility.Our experiments suggest that DkNN’s credibility may notbe sufﬁcient for eliminating adversarial examples, but it isstill a more robust metric for detecting adversarial examplesthan a softmax score of typical neural networks. Unfortu-nately, thresholding the credibility hurts accuracy on legitimateexamples signiﬁcantly even for a simple task like MNIST.According to Papernot & McDaniel, the SVHN and GTSRBdatasets both have a larger fraction of legitimate samples withlow credibility than MNIST, making a credibility thresholdeven less attractive. Experiments with the ImageNet dataset,deeper networks, choosing which layers to use, and pruningDkNN for robustness are all interesting directions for futureworks. IX. C ONCLUSION

We propose two heuristic attacks and a gradient-based attackon kNN and use them to attack DkNN. We found that ourgradient attack performs better than the baseline: it generatesadversarial examples with a higher success rate but lowerdistortion on both (cid:96) and (cid:96) ∞ norms. Our work suggests thatDkNN is vulnerable to adversarial examples in a white-boxadversarial setting. Nonetheless, DkNN still holds promiseas a direction for providing signiﬁcant robustness againstadversarial attacks as well as interpretability of deep neuralnetworks.. A CKNOWLEDGEMENTS

This work was supported by the Hewlett Foundation throughthe Center for Long-Term Cybersecurity and by generous giftsfrom Huawei and Google.R

EFERENCES[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classiﬁca-tion with deep convolutional neural networks,” in

Advances in neuralinformation processing systems , 2012, pp. 1097–1105.[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Masteringthe game of go without human knowledge,”

Nature , vol. 550, no. 7676,p. 354, 2017.[4] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap,K. Simonyan, and D. Hassabis, “Mastering chess and shogi byself-play with a general reinforcement learning algorithm,”

CoRR , vol.abs/1712.01815, 2017. [Online]. Available: http://arxiv.org/abs/1712.01815[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. A. Riedmiller, “Playing atari with deepreinforcement learning,”

CoRR , vol. abs/1312.5602, 2013. [Online].Available: http://arxiv.org/abs/1312.5602[6] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learningaffordance for direct perception in autonomous driving,” in

The IEEEInternational Conference on Computer Vision (ICCV) , December 2015.[7] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang,X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,”

CoRR , vol. abs/1604.07316, 2016. [Online]. Available:http://arxiv.org/abs/1604.07316[8] G. J. S. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken,and C. I. S´anchez, “A survey on deep learning in medical imageanalysis,”

CoRR , vol. abs/1702.05747, 2017. [Online]. Available:http://arxiv.org/abs/1702.05747[9] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. ˇSrndi´c, P. Laskov,G. Giacinto, and F. Roli, “Evasion attacks against machine learning attest time,” in

Machine Learning and Knowledge Discovery in Databases ,H. Blockeel, K. Kersting, S. Nijssen, and F. ˇZelezn´y, Eds. Berlin,Heidelberg: Springer Berlin Heidelberg, 2013, pp. 387–402.[10] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. J. Goodfellow, and R. Fergus, “Intriguing properties of neuralnetworks,”

CoRR , vol. abs/1312.6199, 2013. [Online]. Available:http://arxiv.org/abs/1312.6199[11] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” in

International Conference on Learning Repre-sentations , 2015.[12] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simpleand accurate method to fool deep neural networks,” arXiv preprintarXiv:1511.04599 , 2015.[13] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easilyfooled: High conﬁdence predictions for unrecognizable images,” in .IEEE, 2015, pp. 427–436.[14] L. Engstrom, D. Tsipras, L. Schmidt, and A. Madry, “A rotationand a translation sufﬁce: Fooling cnns with simple transformations,”

CoRR , vol. abs/1712.02779, 2017. [Online]. Available: http://arxiv.org/abs/1712.02779[15] D. Hendrycks and T. G. Dietterich, “Benchmarking neural networkrobustness to common corruptions and surface variations,”

CoRR , vol.abs/1807.01697, 2018. [Online]. Available: http://arxiv.org/abs/1807.01697[16] N. Carlini and D. Wagner, “Defensive distillation is not robust toadversarial examples,” arXiv preprint arXiv:1607.04311 , 2016.[17] ——, “Adversarial examples are not easily detected: Bypassing tendetection methods,” in

Proceedings of the 10th ACM Workshopon Artiﬁcial Intelligence and Security , ser. AISec ’17. NewYork, NY, USA: ACM, 2017, pp. 3–14. [Online]. Available:http://doi.acm.org/10.1145/3128572.3140444 [18] A. Athalye, N. Carlini, and D. A. Wagner, “Obfuscated gradientsgive a false sense of security: Circumventing defenses to adversarialexamples,”

CoRR , vol. abs/1802.00420, 2018. [Online]. Available:http://arxiv.org/abs/1802.00420[19] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,“Towards deep learning models resistant to adversarial attacks,”

CoRR ,vol. abs/1706.06083, 2017. [Online]. Available: http://arxiv.org/abs/1706.06083[20] W. Xu, D. Evans, and Y. Qi, “Feature squeezing: Detecting adversarialexamples in deep neural networks,”

CoRR , vol. abs/1704.01155, 2017.[Online]. Available: http://arxiv.org/abs/1704.01155[21] J. Kim and J. F. Canny, “Interpretable learning for self-driving carsby visualizing causal attention,”

CoRR , vol. abs/1703.10631, 2017.[Online]. Available: http://arxiv.org/abs/1703.10631[22] Q. Zhang, Y. N. Wu, and S. Zhu, “Interpretable convolutional neuralnetworks,”

CoRR , vol. abs/1710.00935, 2017. [Online]. Available:http://arxiv.org/abs/1710.00935[23] N. Papernot and P. D. McDaniel, “Deep k-nearest neighbors:Towards conﬁdent, interpretable and robust deep learning,”

CoRR , vol.abs/1803.04765, 2018. [Online]. Available: http://arxiv.org/abs/1803.04765[24] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutionalnetworks: Visualising image classiﬁcation models and saliencymaps,”

CoRR , vol. abs/1312.6034, 2013. [Online]. Available: http://arxiv.org/abs/1312.6034[25] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”:Explaining the predictions of any classiﬁer,”

CoRR , vol. abs/1602.04938,2016. [Online]. Available: http://arxiv.org/abs/1602.04938[26] W. Guo, D. Mu, J. Xu, P. Su, G. Wang, and X. Xing, “Lemna: Explainingdeep learning based security applications,” in

ACM Conference onComputer and Communications Security , 2018.[27] D. Hendrycks and K. Gimpel, “Visible progress on adversarial imagesand a new saliency map,”

CoRR , vol. abs/1608.00530, 2016. [Online].Available: http://arxiv.org/abs/1608.00530[28] X. Li and F. Li, “Adversarial examples detection in deep networkswith convolutional ﬁlter statistics,”

CoRR , vol. abs/1612.07767, 2016.[Online]. Available: http://arxiv.org/abs/1612.07767[29] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner, “Detectingadversarial samples from artifacts,”

CoRR , vol. abs/1703.00410, 2017.[30] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. D. McDaniel,“On the (statistical) detection of adversarial examples,”

CoRR , vol.abs/1702.06280, 2017. [Online]. Available: http://arxiv.org/abs/1702.06280[31] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar,“Can machine learning be secure?” in

Proceedings of the 2006 ACMSymposium on Information, computer and communications security .ACM, 2006, pp. 16–25.[32] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. Tygar, “Ad-versarial machine learning,” in

Proceedings of the 4th ACM workshopon Security and Artiﬁcial Intelligence . ACM, 2011, pp. 43–58.[33] Y. Wang, S. Jha, and K. Chaudhuri, “Analyzing the robustness ofnearest neighbors to adversarial examples,” in

Proceedings of the 35thInternational Conference on Machine Learning , ser. Proceedings ofMachine Learning Research, J. Dy and A. Krause, Eds., vol. 80.Stockholmsmssan, Stockholm Sweden: PMLR, 10–15 Jul 2018,pp. 5133–5142. [Online]. Available: http://proceedings.mlr.press/v80/wang18c.html[34] N. Carlini and D. A. Wagner, “Towards evaluating the robustness ofneural networks,” ,pp. 39–57, 2017.[35] S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet, “Adversarial manipulationof deep representations,”

CoRR , vol. abs/1511.05122, 2015. [Online].Available: http://arxiv.org/abs/1511.05122[36] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.[Online]. Available: http://yann.lecun.com/exdb/mnist/[37] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt,“Practical and optimal lsh for angular distance,” in