[PDF] Exploiting epistemic uncertainty of the deep learning models to generate adversarial samples

Abstract

Deep neural network architectures are considered to be robust to random perturbations. Nevertheless, it was shown that they could be severely vulnerable to slight but carefully crafted perturbations of the input, termed as adversarial samples. In recent years, numerous studies have been conducted in this new area called "Adversarial Machine Learning" to devise new adversarial attacks and to defend against these attacks with more robust DNN architectures. However, almost all the research work so far has been concentrated on utilising model loss function to craft adversarial examples or create robust models. This study explores the usage of quantified epistemic uncertainty obtained from Monte-Carlo Dropout Sampling for adversarial attack purposes by which we perturb the input to the areas where the model has not seen before. We proposed new attack ideas based on the epistemic uncertainty of the model. Our results show that our proposed hybrid attack approach increases the attack success rates from 82.59% to 85.40%, 82.86% to 89.92% and 88.06% to 90.03% on MNIST Digit, MNIST Fashion and CIFAR-10 datasets, respectively.

Full PDF

EE XPLOITING EPISTEMIC UNCERTAINTY OF THE DEEP LEARNINGMODELS TO GENERATE ADVERSARIAL SAMPLES

Omer Faruk Tuna

Isik UniversityIstanbul, Turkey [email protected]

Ferhat Ozgur Catak

Simula Research LaboratoryFornebu, Norway [email protected]

M. Taner Eskil

Isik UniversityIstanbul, Turkey [email protected] A BSTRACT

Deep neural network architectures are considered to be robust to random perturbations. Neverthe-less, it was shown that they could be severely vulnerable to slight but carefully crafted perturbationsof the input, termed as adversarial samples. In recent years, numerous studies have been conductedin this new area called "Adversarial Machine Learning" to devise new adversarial attacks and todefend against these attacks with more robust DNN architectures. However, almost all the researchwork so far has been concentrated on utilising model loss function to craft adversarial examples orcreate robust models. This study explores the usage of quantiﬁed epistemic uncertainty obtainedfrom Monte-Carlo Dropout Sampling for adversarial attack purposes by which we perturb the in-put to the areas where the model has not seen before. We proposed new attack ideas based on theepistemic uncertainty of the model. Our results show that our proposed hybrid attack approach in-creases the attack success rates from 82.59% to 85.40%, 82.86% to 89.92% and 88.06% to 90.03%on MNIST Digit, MNIST Fashion and CIFAR-10 datasets, respectively.

In recent years, deep learning models began to exceed human-level performances. For example, in 2015, a deeplearning model called ResNet [1] beat the human performance in ImageNet Large Scale Visual Recognition Challenge(ILSVRC) and the record was beaten by more advanced architectures later on. To give some other examples fromdifferent benchmark tasks in the image domain, for the problems of reading address information from Google StreetView imagery or solving CAPTCHAS, Goodfellow et al. [2] proposed a system which outperforms human operators.Or, in game playing domain, an AI software named AlphaGo defeated the world Go champion in 2016. Today, weobserve that many advanced systems built upon deep learning models offer very high degree of successes in differentareas. As a result of this success, deep neural network (DNN) models started to be used in many different ﬁelds,ranging from medical diagnosis and autonomous vehicles to game playing or machine translation. However, duringthe rise of the DNN models, the researchers’ main focus was to build more and more accurate models, and nearly noattention has been paid to the reliability and robustness of these models. Deep learning models indeed require a moreelaborate evaluation since these models have some intrinsic vulnerabilities that let intruders easily exploit them.By the end of 2013, researchers have discovered that existing deep neural networks are vulnerable to attacks. Szegedyet al. [3] ﬁrst noticed the presence of adversarial examples in the context image classiﬁcation. The authors haveshown that it is possible to perturb an image by a small amount and change how the image is classiﬁed. Very smalland nearly imperceptible perturbations of the data samples are actually sufﬁcient to fool the most advanced classiﬁersand result in incorrect classiﬁcation. The adversarial machine learning attacks are based on perturbation of the inputinstances in a direction that maximizes the chance of wrong decision making and results in false predictions. Theseattacks can lead to a loss of the model’s prediction performance as the algorithm cannot predict the real output ofthe input instances correctly. Thus, attacks utilizing the vulnerability of DNNs can seriously undermine the securityof these machine learning (ML) based systems, sometimes with devastating consequences. In the case of medicalapplications, the perturbation attack can lead to an incorrect diagnosis of a disease. Consequently, it can cause severeharm to the patient’s health and also damage the healthcare economy [4]. As another example, autonomous cars usemachine learning (ML) to drive trafﬁc without human intervention and avoid accidents. A wrong decision based on an a r X i v : . [ c s . L G ] F e b PREPRINT adversarial attack for the autonomous vehicle could cause a fatal accident [5, 6]. Hence, defending against adversarialattempts and increasing the ML models’ robustness without compromising clean accuracy is of crucial importance.Assuming that these ML models will serve in critical areas, we should pay the greatest attention to not only MLmodels’ performance but also security concerns of these architectures.In this study, we focus on adversarial attack strategies based on epistemic uncertainty maximization instead of tradi-tional loss maximization based attacks. We also look for the most effective approaches that signiﬁcantly impact theirperformance and security implications. The adversarial machine learning attacks’ current approach is based on modelloss maximization and aims to create craftily-designed inputs. Unlike the previous researches in literature, we willfollow a slightly modiﬁed strategy to craft adversarial samples and try to exploit the model’s vulnerability using themodel’s quantiﬁed epistemic uncertainty. We showed that perturbing the input image in a direction that maximizes themodel’s uncertainty ampliﬁes model loss and results in wrong predictions. The new approach combines adversarialapproaches’ strengths and weaknesses to produce more destructive attacks by uncertainty and loss maximization.To sum up; our main contributions for this paper are:• We have utilized a new metric (epistemic uncertainty of the model) which can be exploited to craft adversarialexamples.• We show that the performance of pure uncertainty based attacks is indeed as powerful as the attacks based onthe model loss.• We demonstrated that using both the model loss and uncertainty when crafting adversarial examples make upfor each other and yields better performance in adversarial attacks.This study is organized as follows. Chapter 2 will introduce some of the known attack types in literature. In Chapter3, we will introduce the concept of uncertainty together with main types and discuss how we can quantify epistemicuncertainty. Chapter 4 will give the details of our approach. We will present our experimental results in Section 5 andconclude our work in Section 6.

Since the discovery of DNN’s vulnerability to adversarial attacks [3], a vast amount of research has been conducted inboth devising new adversarial attacks and defending against these attacks with more robust DNN models [7, 8, 9, 10].The deep learning models contain many vulnerabilities and weaknesses which make them difﬁcult to defend againstin the context of adversarial machine learning. For instance, they are often sensitive to small changes in the input data,resulting in unexpected results in the ﬁnal output of the model. Figure 1 shows how an adversary would exploit sucha vulnerability and manipulate the model through the use of carefully crafted perturbation applied to the input data. + West_Highland_white_terrier Probability: 0.290

Clean Example =

Scaled by 20 for a better view

Perturbation paper_towel Probability: 0.999

Adversarial Example

Figure 1: The ﬁgure shows an adversarial example. The malicious input is expressed using the original image and theprecisely crafted perturbation such that a "West Highland White Terrier (Dog)" is misclassiﬁed as "Paper Towel" withhigh conﬁdence. 2

PREPRINT

The attack strategies are mainly based on perturbing the input instance to maximise the model’s loss. An im-portant number of adversarial attack algorithms have been proposed in recent years. The well-knows adversar-ial attacks are

Fast-Gradient Sign Method, Iterative Gradient Sign Method, Projected GradientDescent, Jacobian-Based Saliency Map, Carlini-Wagner , and

DeepFool . Section 2.1 - 2.6 describe sixdifferent adversarial machine learning attacks brieﬂy.

This method, also known as FGSM, is one of the earliest and most popular adversarial attacks to date. It is describedby Goodfellow et al. [11]. FGSM utilizes the gradient of the model’s loss function to adjudge in which direction thepixel values of the source image should be altered to minimize the loss function of the model; and then it changesall pixels simultaneously in the opposite direction to maximize the loss. For a model with classiﬁcation loss functiondescribed as L ( θ, x , y ) where θ represents model parameters, x is the input to the model (sample input image in ourcase), y true is the label of our input, we can generate adversarial samples using below formula: x ∗ = x + (cid:15)sign ( ∇ x L ( θ, x , y true )) (1)One last key point about FGSM is that it is not designed to be optimal but fast. That means it is not designed to producethe minimum required adversarial perturbation. Besides, the success ratio of this method is low relatively low in small (cid:15) values compared to other attack types. Kurakin et al. [12] proposed a tiny but effective improvement to the FGSM. In their approach, rather than taking onlyone step of size (cid:15) in the direction of the gradient sign, we take several but smaller steps α , and we use the given (cid:15) valueto clip the result. This attack type is often referred to as Basic Iterative Method (BIM), and it is just FGSM applied toan input image iteratively. Generating perturbed images under L inf norm for BIM attack is given by Equation 2. x ∗ = xx ∗ N +1 = x + Clip x,(cid:15) { αsign ( ∇ x L ( x ∗ N , y true )) } (2)where x is the input sample, x ∗ is the produced adversarial sample at i th iteration, L is the loss function of the model, y true is the actual label for input sample, (cid:15) is a tunable parameter, limiting maximum level of perturbation in given l inf norm, α is the step size.The success ratio of BIM attack is higher than the FGSM [13]. By adjusting the (cid:15) parameter, the attacker can have achance to manipulate how far past the decision boundary an adversarial sample will be pushed.We can group BIM attacks under two main types, namely BIM-A and BIM-B. In the former type, we stop iterations assoon as we succeed in fooling the model (passing the decision boundary), while in the latter, we continue to the attacktill the end of provided number of iterations so that we push the input far beyond the decision boundary. This method, also known as PGD, has been introduced by Madry et al. [14]. It perturbs a clean image x for severalnumber of i iterations with a small step size in the direction of model’s loss function’s gradient. Different from BIM,after each perturbation step, it projects the resulting adversarial sample back onto the (cid:15) -ball of input sample, instead ofclipping. Moreover, instead of starting from the original point ( (cid:15) =0, in all dimensions), PGD uses random start, whichcan be described as: x = x + U ( − (cid:15), + (cid:15) ) (3)where U ( − (cid:15), + (cid:15) ) is the uniform distribution between ( − (cid:15), + (cid:15) ). This method, also known as JSMA, has been proposed by Papernot et al. [15]. It is designed to be used under L distance norm which takes total number of altered pixels into count when restricting the attacker. It is a greedyalgorithm which selects two pixels at a time. The algorithm utilizes the gradient ∇ Z ( x ) l to compute a saliency map,3 PREPRINT which shows the impact of each pixel on the classiﬁcation of each class. And the aim is to increase the likelihoodof target class while decreasing the likelihood of other classes by selecting and updating two pixels at a time basedon the saliency map. We continue to the implementation, until either we modify predeﬁned number of pixels or wesuccessfully fool the model.

This attack type has been introduced by Carlini and Wagner [16], and it is one of the most powerful attack type todate. Therefore, it is generally used as a benchmark for the adversarial defense research community that aims to createmore robust DNN architectures resistant to adversarial attempts.The authors redeﬁne the attacks as optimization problems which can be solved by using gradient descent to craft morepowerful and effective adversarial samples.

This attack type has been proposed by Moosavi-Dezfooli et al. [17] and it is one of the powerful untargeted attacktypes in literature. It is designed to be used in different distance norm metrics such as L inf and L norms.Deepfool attack has been designed based on an assumption that the neural network models behave as a linear classiﬁerand the classes are separated by a hyperplane. The algorithm starts from the initial input point x t and at each iterationit calculates the closest hyperplane and the miminum perturbation amount which is the orthogonal projection to thehyperplane. Then the algorithm calculates x t + by adding the minimal perturbation to the x t +1 and checks whetherthe misclassiﬁcation occured. Traditionally, predictive models used to be forced to provide a decision even in ambiguous cases where the model is notsure about it’s prediction, and the quality of its predictions is expected to be low. Assuming the model’s prediction isalways correct without any reasoning by utilizing the model’s own uncertainty information may result in catastrophicresults. This fact led the researchers to suggest abstaining models based on certain conditions like when the model’suncertainty is high, thus improving the reliability[18, 19].In this section, we will ﬁrst introduce the two types of uncertainty in machine learning. And then we will present howwe can quantify Epistemic Uncertainty in the context of deep learning.

There are two different types of uncertainty in machine learning: epistemic uncertainty and aleatoric uncertainty[20, 21, 22].

Epistemic uncertainty refers to uncertainty caused by a lack of knowledge and limited data needed for a perfectpredictor [23]. It can be categorized under 2 groups as approximation uncertainty and model uncertainty as depictedin Figure 2.

Approximation Uncertainty

In a standard machine learning task, the learner is given a set of data points from anindependent, identically distributed dataset. Then he/she tries to induce a hypothesis ˆ h from the hypothesis space H by choosing an appropriate learning method with its related hyperparameters and minimizing the expected loss (risk)with a selected loss function, (cid:96) . However, what the learner actually does is to try to minimize the empirical risk R emp which is an estimation of the real risk R ( h ) . The induced ˆ h is an approximation of the h ∗ which is the optimumhypothesis within H and the real risk minimizer. This fact results in an approximation uncertainty. Therefore, theinduced hypothesis’s quality is not perfect, and the learned model will always be prone to errors. Model Uncertainty

Suppose the chosen hypothesis space H does not include the perfect predictor. In that case, thelearner has no chance to realize his/her aim of ﬁnding a hypothesis function which can successfully map all possibleinputs to outputs. This leads to a discrepancy between the ground truth f ∗ and best possible function h ∗ within H ,called model uncertainty. 4 PREPRINT ˆ hh ∗ f ∗ Hypothesis space

H ⊂ F f ∗ : Ground truth h ∗ : the best possible predictor within H ˆ h : Induced predictor Figure 2: Different types of Epistemic Uncertainty.However, Universal Approximation Theorem states that for any target function f , there actually exists a neural networkwhich can approximate f [24, 25]. The hypothesis space H is huge for deep neural networks, thus it will not be wrongto assume that h ∗ = f ∗ . We can ignore the model uncertainty for deep neural networks, and we can only care about theapproximation uncertainty. Consequently, in deep leaning tasks, the actual source of epistemic uncertainty is relatedto approximation uncertainty. Epistemic uncertainty refers to the conﬁdence a model has about its prediction [26].The underlying cause is the uncertainty about the parameters of the model. This type of uncertainty is apparent in theareas where we have limited training data, and the model weights are not optimized correctly. Aleatoric uncertainty refers to the variability in an experiment’s outcome, which is due to the inherent random effects[27]. This type of uncertainty can not be reduced even if we have enough number of training samples [28]. Anexcellent example of this phenomenon is the noise observed in the measurements of a sensor. x s i n ( . × x ) Low AleatoricUncertainty High EpistemicUncertainty(No Training Data) High AleatoricUncertainty

Ground truth Training data

Figure 3: Illustration of the Epistemic and Aleatoric uncertainty.Figure 3 shows a simple nonlinear function ( sin(0 . × x ) where x ∈ [0 , ) plot. As shown in the region where datapoints are populated at right ( < x < ), the noisy samples are clustered, leading to high aleatoric uncertainty. Asan example, these points may represent measurements of a faulty sensor; one can conclude that the sensor produces5 PREPRINT errors around x = 10 . by some inherent reason. We can also conclude that the middle regions of the ﬁgure representthe high epistemic uncertainty areas. Because there are not enough training samples for our model to describe the databest. Moreover, we can claim that the high epistemic uncertainty area represents the low prediction accuracy area. Using techniques which helps us to quantify the uncertainty of the model is necessary for healthy decision making.Assuming that we, as humanity, will use deep learning models in the areas where safety and reliability is a criticalconcern such as autonomous driving, medical applications; researchers need to be very careful and pay utmost attentionto prediction uncertainty. This will obviously help us to increase the quality of the predictions.In the recent years, there have been signiﬁcant number of researches conducted to quantify uncertainty in deep learningmodels. Most of the work was based on Bayesian Neural Networks which learn the posterior distribution over weightsto quantify predictive uncertainty [29]. However, the Bayesian NN’s come with additional computational cost andinference issue. Therefore, several approximations to Bayesian methods have been developed which make use ofvariational inference [30, 31, 32, 33]. On the other hand, Lakshminarayanan et al. [34] used deep ensemble approachas an alternative to Bayesian NN’s to quantify predictive uncertainty. But this requires training several NN’s which mynot be feasible in practice. A more efﬁcient and elegant approach was proposed by Gal et al. [35]. The authors showedthat a neural network model with inference time dropout is equivalent to a Bayesian approximation of the Gaussianprocess. And the prediction hypothesis uncertainty can be approximated by averaging probabilistic feed-forwardMonte Carlo dropout sampling during the prediction time.It acts as an ensemble approach. In each single ensemble model, the system has to drop out different neurons in thenetwork’s each layer according to the dropout ratio in the prediction time. The predictive mean is the average of thepredictions over dropout iterations, T , and the predictive mean is used as the ﬁnal inference ˆ y , for the input instance ˆx in the dataset. The overall prediction uncertainty is approximated by ﬁnding the variance of the probabilistic feed-forward Monte Carlo (MC) dropout sampling during prediction time. The prediction is deﬁned as follows: p (ˆ y = c | ˆx , D ) ≈ ˆ µ pred = 1 T (cid:88) y ∈ T p (ˆ y | θ, D ) (4)where θ is the model weights, D is the input dataset, T is the number of predictions of the MC dropouts, and x is theinput sample. The label of input sample x can be estimated with the mean value of Monte-Carlo dropouts predictions p (ˆ y | θ, D ) , which will be done T times.Figure 4 shows the general overview of the Monte Carlo dropout based classiﬁcation algorithm. In the prediction time,random neurons in each layer are dropped out (based on the probability p ) from the base neural network model to createa new model. As a result, T different classiﬁcation models can be used for the prediction of the input instance’s classlabel and uncertainty quantiﬁcation of the overall prediction. For each testing input sample x , the predicted label isassigned with the highest predictive mean. And the variance of the p (ˆ y ) is can be used as a measure of epistemicuncertainty of the model.We have chosen MC dropout method due to its simplicity and efﬁciency. It is computationally much more efﬁcientthan the other approaches. The approach needs only a single trained model to measure the uncertainty, while thedifferent techniques such as DeepEnsemble need multiple models. Secondly, one can take the backward derivative ofthe computed variance term for each input sample and use it to craft adversarial examples to evade the model. It is obvious that model uncertainty is higher in the areas where we have a limited number of training points. Due tothis ignorance about ground truth, we cannot achieve a perfect model that can predict accurately every possible testdata. Figure 5 shows a regression model’s prediction outputs trained on a limited number of data points constrainedon some interval. In this simple example, we trained a single hidden layer NN with ten neurons to learn a linearfunction y = − x + 1 . As can be seen from the graph, in the areas where we do not have enough training points, theuncertainty values obtained from MC dropout estimates of our model is high, which can be interpreted as the qualityof the prediction is low, and our model is having difﬁculties in deciding the correct output values. Harmoniously, wealso observe high loss values in these areas. For this reason, we can conclude that the high epistemic uncertainty arearepresents the low prediction accuracy area. Accordingly, we claim that pushing the model’s limits by testing it inextreme conditions with input that our model has never seen before may result in failure of model prediction output.6 PREPRINT Layer-0 Dropout Layer-1 Dropout Layer-2 Dropout Layer-3 Dropout ReLU ReLU ReLU ReLU

Monte Carlo Dropout Predictions

Figure 4: Illustration of the Monte Carlo dropout based Bayesian prediction.The aim of the adversarial attacks is to ﬁnd a least perturbation amount ( δ ) constrained to some interval ( (cid:15) ), resultingin maximum loss, thus fooling the classiﬁer. We can express this mathematically in the below equation, where F θ ( x ) is our neural network. arg max (cid:107) δ (cid:107)≤ (cid:15) (cid:96) ( F θ ( x ) , y ) (5)Like most of the attack types in literature, the attackers perturb the input image in a direction which maximizes theloss, and this direction is found using the gradient of the loss function. However, we showed that instead of using theloss function, another effective approach is to use the model’s epistemic uncertainty. Our alternative approach uses themodel’s epistemic uncertainty as a tool for creating successfully manipulated adversarial input instances. In contrastto the loss based adversarial machine learning attacks, this method can provide an alternative strategy in which theattacker can make an effective attack (by generating a speciﬁc error in a particular uncertainty direction).

10 5 0 5 10 x y = f ( x ) test datatraining datauncertainty obtained from MC dropout estimates Figure 5: Uncertainty values obtained from a regression model7

PREPRINT

To verify that our intuition holds true, we have made a simple experiment and depicted the loss surfaces of a trainedCNN model (digit classiﬁer) within a constrained epsilon neighbourhood of the original input data points as can beseen in Figure 6. Figure 6b shows the model’s loss values in the direction of the model’s loss gradient and a randomdirection. We see that the maximum loss value observed is 3.783. Then, as shown in Figure 6c, we projected themodel loss surface to the direction of the model’s epistemic uncertainty’s gradient and the same random direction weused in the previous try. This time, the maximum loss value achieved is 3.713, which is very close to the previousone. When we analyzed the directions of loss’ and uncertainty’s gradients, we saw that out of 784 sub directions, 693of them were the same and 91 of them were different. It means that we can maximize the model loss by perturbingthe input image in a slightly different direction than we used to do before. In our last attempt, we depicted the modelloss surface in the direction of loss and uncertainty’s gradient directions as in 6d. We could now reach a loss valueof 4.167, which is bigger than the previous two attempts. In Figures 6b,6c, the points where we see a difference incolor on the loss surfaces indicate that the model prediction has changed from the correct class "7" to wrong class "2".Therefore, we can conclude that perturbing the image in both directions will lead to misclassiﬁcation for the model.It is a well-known point that the loss curve of a deep neural network with a high nonlinearity level has lots of localminimums and maximums in a high dimensional space. Numerical solution to ﬁnding global extrema points is an NP-hard problem [36, 37]. No optimization approach can reach these global extrema points by utilizing a naive methodlike gradient descent on the model’s loss function. Eventually, the optimizer will be stuck to local extrema points.However, the above experiment showed us a hint that slightly changing the direction in each gradient descent step byleveraging the model uncertainty can increase the proposed attacks’ performance. (a) MNIST Digit test image number 37 (b) Loss gradient direction (c) Uncertainty gradient direction (d) Hybrid gradient direction

Figure 6: Loss surfaces in different directions within constrained (cid:15) neighborhood. The maximum loss values are b)3.783, c) 3.713, d) 4.167We conducted the same experiment on a different sample from the MNIST test dataset. Figure 7 shows that themaximum loss value in uncertainty is far greater than the maximum loss value in model loss’ gradient direction. Andthe maximum loss value in the hybrid direction is larger than the ones in both model loss’ and uncertainty’s directions.Besides, we observe that there is no possibility of misclassiﬁcation in the loss gradient direction, as there is no visiblecolour change in the surface plot of Figure 7b, whereas in Figure 7c we observe that there are yellow regions where themodel misclassiﬁes the input image in the uncertainty’s gradient direction. Again, when we analyzed the directions of8

PREPRINT loss’ and uncertainty’s gradients, we saw that out of 784 directions, only 639 of them were the same and 145 of themwere different which is much larger than the ﬁrst experiment.The epistemic uncertainty yields a better direction for our second experiment because our model (like all the trainedML models) is not the "perfect" predictor and is just an approximation to the oracle function. The model itself has aninherent "approximation uncertainty" which sometimes induce to sub-optimal solutions. Consequently, any methodwhich only relies on the trained model (which is not the optimum model) will result in less effective performance. (a) MNIST Digit test image number 45 (b) Loss gradient direction (c) Uncertainty gradient direction (d) Hybrid gradient direction

Figure 7: Loss surfaces in different directions within constrained (cid:15) neighborhood. The maximum loss values are b)0.285, c) 0.811, d) 0.845

Previous attack types in literature have been designed to exploit the model loss and aimed at maximizing the modelloss value within a constrained neighbourhood of the input data points. And we have witnessed quite successful resultswith this approach. However, one possible drawback for these kinds of attacks is that they solely rely on the trainedML model, which inevitably suffers from the approximation error. We can overcome this problem by utilizing anadditional metric, namely epistemic uncertainty of the model. This additional uncertainty information has a correctingeffect and improves the convergence to global extrema points by yielding higher loss value. Results shown in Figure 6and Figure 7 supports our argument. Therefore, we can reformulate existing attacks using model uncertainty insteadof model loss . And even we can beneﬁt from both of them. x adv = x + (cid:15) · sign ( ∇ x (cid:96) ( x , y true )) (6)where x is the input (clean) image x adv is the perturbed adversarial image, (cid:96) is the classiﬁcation loss function, y true is true label for the input x .Our uncertainty based modiﬁed FGSM attack is shown as;9 PREPRINT x adv = x + (cid:15) · sign ( ∇ x U ( x , F, p, T )) (7)where x is the input (clean) image x adv is the perturbed adversarial image U is the uncertainty metric (mean variance)obtained from T different MC dropout estimates, F is the prediction model in training mode, p is the dropout ratioused in the dropout layers, T is the number of MC dropout samples in model training mode.Calculation for Uncertainty metric U (mean variance of T predictions): Step: 1

For an input image x , T different predictions is obtained p t ( x ) by Monte-Carlo Dropout sampling whereeach prediction is a vector of softmax scores for the C classes. p t ( x ) = F ( x , p, T ) Where F is the prediction model in training mode, p is the dropout ratio used in the dropout layers, T is thenumber of MC dropout samples in model training mode Step 2:

The next step is to compute the average prediction score for the T different outputs: p T ( x ) = 1 T (cid:88) t ∈ T p t ( x ) Step 3:

Compute the variance of the T predictions for each class. σ ( p t ( x )) = 1 T (cid:88) t ∈ T ( p t ( x ) − p T ( x )) Step 4:

Compute the expected value of variance over all classes by taking their average. U ( x , F, p, T ) = E ( σ ( p T ( x ))) PREPRINT

In this part, we ﬁrst provide the pseudo-codes for known Loss Based BIM attack types as in Algorithm 1 and 2. Then,we provide our proposed uncertainty based BIM attack variants in Algorithm 3 and 4. All the attack types proposedhere are designed under L ∞ norm. Algorithm 1:

Algorithm for BIM A (Loss based) x is the benign image, y true is the true label for x , F is the model function learnt during training, N is the numberof iterations, (cid:15) is the maximum allowed perturbation, α is the step size. Input: x ∈ R m , y true , F, N, (cid:15), α Output: x t +1 x ← x while n < N do /* update X by using below formula, F in evaluation mode */ x ( t +1) = clip x ,(cid:15) ( x t + α · sign ( ∇ x (cid:96) ( x t , y true ))) if arg max ( F ( x t +1 )) (cid:54) = y true then end while return x t +1 Algorithm 2:

Algorithm for BIM B (Loss based) x is the benign image, y true is the true label for x , F is the function learned by the network during training, N isthe number of iterations, (cid:15) is the maximum allowed perturbation, α is the step size. Input: x ∈ R m , y true , F, N, (cid:15), α Output: x t +1 x ← x while n < N do /* update X by using below formula, F in evaluation mode */ x ( t +1) = clip x ,(cid:15) ( x t + α · sign ( ∇ x (cid:96) ( x t , y true ))) return x t +1 Algorithm 3:

Algorithm for BIM A (Uncertainty based) x is the benign image, F is the model function learnt during training, p is the dropout ratio used in dropout layers, T is the number of MC dropout samples in model training mode, N is the number of iterations, (cid:15) is the maximumallowed perturbation, α is the step size, Input: x ∈ R m , F, p, T, N, (cid:15), α Output: x t +1 x ← x initialprediction = arg max ( F ( x )) while n < N do /* update X by using below formula, F in training mode */ x ( t +1) = clip x ,(cid:15) ( x t + α · sign ( ∇ x U ( x t , F, p, T ))) if arg max ( F ( x t +1 )) (cid:54) = initialprediction then end while return x t +1 Here, we present the pseudo-code for our Hybrid Approach in Algorithm 5. Same as the previous BIM attack variants,our Hybrid Approach is also designed under L ∞ norm. At each iteration, we step into both the model loss’ gradient11 PREPRINT

Algorithm 4:

Algorithm for BIM B (Uncertainty based) x is the benign image, F is the model function learntduring training, p is the dropout ratio used in dropout layers, T is the number of MC dropout samples in modeltraining mode, N is the number of iterations, (cid:15) is the maximum allowed perturbation, α is the step size, Input: x ∈ R m , F, p, T, N, (cid:15), α Output: x t +1 x ← x while n < N do if arg max ( F ( x t +1 )) (cid:54) = initialprediction then condition = T rue if condition = F alse then /* update X by using below formula, F in training mode */ x ( t +1) = clip x ,(cid:15) ( x t + α · sign ( ∇ x U ( x t , F, p, T ))) else /* update X by using below formula, F in training mode */ x ( t +1) = clip x ,(cid:15) ( x t − α · sign ( ∇ x U ( x t , F, p, T ))) return x t +1 direction and model uncertainty’s gradient direction. These two metrics make up for each other and yield to a betterresult. Algorithm 5:

Algorithm for BIM A (Hybrid Approach) x is the benign image, F is the model function learntduring training, p is the dropout ratio used in dropout layers, T is the number of MC dropout samples in modeltraining mode, N is the number of iterations, (cid:15) is the maximum allowed perturbation, α is the step size, Input: x , F, p, T, N, (cid:15), α Output: x t +1 x ← x initialprediction = arg max ( F ( x )) /* F in evaluation mode */ while n < N do /* update X by using below formula, F in training mode when calculating gradient ofmodel uncertainty, F in evaluation mode when calculating gradient of model loss*/ x ( t +1) = clip x ,(cid:15) ( x t + α · sign ( ∇ x U ( x t , F, p, T )) + α · sign ( ∇ x (cid:96) ( x t , y true ))) if arg max ( F ( x t +1 )) (cid:54) = initialprediction then end while return x t +1 Figure 8 shows a simpliﬁed example of the gradient path for our uncertainty based BIM attack variants. In the exampleshown in the pictures, the low uncertainty regions are shown in blue, while the high uncertainty regions are shownin red. Figure 8a shows an example of successful uncertainty based BIM attack type A. But, we would expect theuncertainty based BIM attack type B to be unsuccessful for this speciﬁc example. Because at the intermediate iterationwhere we passed the decision boundary from source to target class, we are at the left side of the uncertainty hill. Whenwe try to decrease the uncertainty, we will perturb the image back to the original class manifold. However, for Figure8b, we would expect both uncertainty based BIM attack type A and B would be successful. Because uncertainty hillis at the right of the decision boundary and at the intermediate iteration, we are at the right the uncertainty hill.

Figure 9 shows the change in our quantiﬁed uncertainty values of the model during different BIM attack variants. Inthis example, we apply all of the attack variants to the 23 th test sample from MNIST (Digit) dataset. The original labelof the input image was 6. For type A and B of both loss and uncertainty based attacks, we observed that at the 4 th ,iteration, the attack is successful and the input image was misclassiﬁed as 4. In Figure 9a, we stop the iteration assoon as we succeeded in fooling the model, whereas in Figure 9b, we continue to perturb the image, but this time in12 PREPRINT x Intermediate image x Original image

Uncertainty hill

Original class data manifold Target class data manifoldDecision Boundary (a) Type A success, Type B fail x Intermediate image x Original image

Uncertainty hill

Original class data manifold Target class data manifoldDecision Boundary (b) Type A and B success

Figure 8: Uncertainty gradient path

Number of itertaion U n c e r t a i n t y - V a r i a n c e (a) BIM-A (Uncertainty Based) Number of itertaion U n c e r t a i n t y - V a r i a n c e (b) BIM-B (Uncertainty Based) Number of itertaion U n c e r t a i n t y - V a r i a n c e (c) BIM-A (Loss Based) Number of itertaion U n c e r t a i n t y - V a r i a n c e (d) BIM-B (Loss Based) Number of itertaion U n c e r t a i n t y - V a r i a n c e (e) BIM-Enhanced (Hybrid Approach) Figure 9: Change of uncertainty values during different BIM attack variants13

PREPRINT a direction which minimize the uncertainty. After the last iteration, the predicted label was still 4 and the uncertaintylevel decreased compared to the time of misclassiﬁcation. For this sample, our uncertainty based BIM attack typeB was successful, because, when we pass the decision boundary as we try to maximize model uncertainty, we alsogo beyond the point where there is the maximum uncertainty. One last important point to mention is that, when weapply the hybrid approach where we utilize both loss and uncertainty, we could successfully fool the model after 2 nd iteration, which is much faster. This also proves our assumption that the hybrid approach is more effective than theothers. We assumed that the attacker’s primary purpose is to evade the model by applying a carefully crafted perturbation tothe input data. In a real-world scenario, this white-box setting is the most desired choice for an adversary that does nottake the risks of being caught in a trap. The problem is that it requires the attacker to access the model from outside togenerate adversarial examples. After manipulating the input data, the attacker can exploit the model’s vulnerabilities inthe same manner as in an adversary’s sandbox environment. The classiﬁcation model predicts the adversarial instanceswhen the attacker can convert some class labels as another class label (i.e. wrong prediction).However, to prevent this manipulation from being easily noticed by the human eye, the attacker must solve an opti-mization problem to decide which regions in the input data must be changed. By solving this optimization problemusing one of the available attack methods [11, 12, 14, 38], the attacker aims to reduce the classiﬁcation performance onthe adversarial data as much as possible. In this study, to limit the maximum allowed perturbation that is allowed forthe attacker, we used l ∞ norm which is the maximum pixel difference limit between original and adversarial image. We trained our CNN models for the MNIST (Digit) [39] and MNIST (Fashion) [40] datasets, and we achieved accuracyrates of 99.05 % and 91.15 % respectively. The model architectures are given in Table 1 and the hyperparametersselected in Table 2. For the CIFAR10 dataset [41], we used a pretrained VGG-A (11 weight layers) model [42] andthen apply transfer learning by freezing the convolution layers, changing the number of neurons in output layer from1000 to 10 and updating the weights of the dense layers only for 10 epoch. By this way, we achieved an accuracy rateof 88.78 % on test data. Since the used pretrained VGG model was trained on

IMAGENET dataset, we had to rescalethe CIFAR10 images from × to × . We also applied the same normalization procedure by normalizingall the pixels with mean = [0 . , . , . and std = [0 . , . , . . For CIFAR-10 dataset, we couldonly use around % 54 of the test data (randomly shufﬂed) in our experiments due to the limits of our lab environment.The adversarial settings that have been used throughout of our experiments is provided in Table 3. Finally, we used T = 50 as the number of MC dropout samples when quantifying uncertainty. During our experiments, we only perturbed the test samples which were correctly classiﬁed by our models in theiroriginal states. Because, it is obvious that an intruder would have no reason to perturb samples that are alreadyclassiﬁed wrongly.The results show that our Hybrid Approach of using both model’s loss and uncertainty results into the best perfor-mance. Success rates of pure loss based and pure uncertainty based attacks are similar to each other. We also observethat the success rates for uncertainty based attack types A and B are different. We argue that the point of globalmaximum for uncertainty metric for any class is not on the model’s decision boundary as in the case of model loss.Instead, the point where the uncertainty is maximum can be beyond the decision boundary. Therefore, during thegradient-based search, it may be possible for us to pass the decision boundary but still not reaching the peak value foruncertainty. And when we start to decrease the uncertainty after passing the decision boundary (fooling the model),it may be possible to go back to the original class. However, this is not the case in loss based approaches. Since weare trying to maximize the loss based on a reference class, we always see an increasing trend during the journey ofgradient descent approach of loss maximization. Figure 10 shows some examples of adversarial samples crafted usingdifferent methods mentioned in this study. 14

PREPRINT

Original FGSM

Loss Uncertainty

BIM-A

Loss Uncertainty

BIM-B

Uncertainty

BIM-A

Enhanced M N I S T - D i g i t M N I S T - F a s h i o n C I F A R Figure 10: Some example images from MNIST(Digit), MNIST(Fashion) and CIFAR-10. The original image is shownin the left-most column and adversarial samples crafted based on different methods are on the other columns.15

PREPRINT

Table 1: CNN model architectures

Dataset Layer Type Layer Information

MNIST (Digit) Convolution (padding:1) + ReLU × × Max Pooling × Convolution (padding:1) + ReLU × × Max Pooling × Convolution (padding:1) + ReLU × × Dropout p : 0.25Convolution (padding:1) + ReLU × × Dropout p : 0.25Fully Connected + ReLU 980Fully Connected + ReLU 100Output Layer 10MNIST (Fashion) Convolution (Padding = 1) + ReLU × × Max Pooling × Convolution (Padding = 1) + ReLU × × Max Pooling × Convolution (Padding = 1) + ReLU × × Dropout p : 0.5Convolution (Padding = 1) + ReLU × × Dropout p : 0.5Fully Connected + ReLU 3136Fully Connected + ReLU 600Fully Connected + ReLU 120Output Layer 10Table 2: CNN model parameters

Parameter CNN Model for MNIST (Digit) CNN Model for MNIST (Fashion)

Optimizer SGD AdamLearning rate 0.01 0.001Batch Size 100 64Dropout Ratio 0.25 0.25Epochs 7 10

In this study, we proposed a new attack algorithm by perturbing the input in a direction that maximizes the model’sepistemic uncertainty instead of the model’s loss. We observed almost similar performances compared to loss basedapproaches. We also introduced a new concept for ﬁnding better points resulting in higher loss values within a speciﬁed L p norm interval to craft adversarial samples. For this, we used a hybrid approach and stepped into both the loss’ anduncertainty’s gradient directions in each gradient descent step. We showed that the attack success rates are higherwhen we utilize this approach.The aim for this study was not to propose the most powerful attack to date, but instead, we aimed at showing that thereexists some other powerful metrics, different from model’s loss, that can be exploited to craft adversarial examples.Besides, relying just on the trained model is actually not a good idea all the time as it is just an approximation to thebest predictor and epistemic uncertainty information can be very useful in cases where the model is misleading. Wealso showed that the combined usage of uncertainty and loss yields better performance.As future work, we will be investigating the usage of aleatoric uncertainty for adversarial attack and defense purposes,especially in detecting adversarial samples. References [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.16

PREPRINT

Table 3: Adversarial settings of our experiments: α , i respectively denote the step-size and the number of attack stepsfor a perturbation budget (cid:15) Attack Parameters l p norm FGSM i = 1 l ∞ BIM α = (cid:15) · l ∞ BIM α = (cid:15) · l ∞ Table 4: Attack success rates on different datasetsMNIST (Digit) (cid:15) = 0.15 MNIST (Fashion) (cid:15) = 0.05 CIFAR10 (cid:15) = 0.2/255FGSM (Loss based) % 47.34 %59.40 %70.83FGSM (Uncertainty based) % 47.80 %62.73 %69.76BIM-A (Loss based) % 82.59 % 82.86 %88.06BIM-A (Uncertainty based) % 75.68 % 84.71 %85.57BIM-B (Uncertainty based) % 65.13 % 71.60 %81.05BIM-A (Hybrid Approach) % 85.40 % 89.92 %90.03[2] Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit number recognitionfrom street view imagery using deep convolutional neural networks, 2014.[3] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and RobFergus. Intriguing properties of neural networks, 2014.[4] Samuel G. Finlayson, Hyung Won Chung, Isaac S. Kohane, and Andrew L. Beam. Adversarial attacks againstmedical deep learning systems, 2019.[5] Chawin Sitawarin, Arjun Nitin Bhagoji, Arsalan Mosenia, Mung Chiang, and Prateek Mittal. Darts: Deceivingautonomous cars with toxic signs, 2018.[6] Nir Morgulis, Alexander Kreines, Shachar Mendelowitz, and Yuval Weisglass. Fooling a real car with adversarialtrafﬁc signs, 2019.[7] A survey of safety and trustworthiness of deep neural networks: Veriﬁcation, testing, adversarial attack anddefence, and interpretability.

Computer Science Review , 37:100270, 2020.[8] Ferhat Ozgur Catak, Samed Sivaslioglu, and Kevser Sahinbas. A generative model based adversarial security ofdeep learning and linear classiﬁer models.

ArXiv e-prints , 2020.[9] A. Qayyum, M. Usama, J. Qadir, and A. Al-Fuqaha. Securing connected autonomous vehicles: Challenges posedby adversarial machine learning and the way forward.

IEEE Communications Surveys Tutorials , 22(2):998–1026,2020.[10] K. Sadeghi, A. Banerjee, and S. K. S. Gupta. A system-driven taxonomy of attacks and defenses in adversarialmachine learning.

IEEE Transactions on Emerging Topics in Computational Intelligence , 4(4):450–467, 2020.[11] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples,2015.[12] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world, 2017.[13] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale.

CoRR ,abs/1611.01236, 2016.[14] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deeplearning models resistant to adversarial attacks, 2019.[15] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami.Practical black-box attacks against machine learning, 2017.[16] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks, 2017.[17] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accuratemethod to fool deep neural networks, 2016.[18] Max-Heinrich Laves, Sontje Ihler, and Tobias Ortmaier. Uncertainty quantiﬁcation in computer-aided diagnosis:Make your model say "i don’t know" for ambiguous cases, 2019.[19] Omer Faruk Tuna, Ferhat Ozgur Catak, and M. Taner Eskil. Closeness and uncertainty aware adversarial exam-ples detection in adversarial machine learning.

ArXiv e-prints , 2020.17

PREPRINT [20] Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An intro-duction to concepts and methods, 2020.[21] Dongdong An, Jing Liu, Min Zhang, Xiaohong Chen, Mingsong Chen, and Haiying Sun. Uncertainty modelingand runtime veriﬁcation for autonomous vehicles driving control: A machine learning-based approach.

Journalof Systems and Software , 167:110617, 2020.[22] Rui Zheng, Shulin Zhang, Lei Liu, Yuhao Luo, and Mingzhai Sun. Uncertainty in bayesian deep label distributionlearning.

Applied Soft Computing , 101:107046, 2021.[23] Fabio Antonelli, Vittorio Cortellessa, Marco Gribaudo, Riccardo Pinciroli, Kishor S. Trivedi, and Catia Trubiani.Analytical modeling of performance indices under epistemic uncertainty applied to cloud computing systems.

Future Generation Computer Systems , 102:746–761, 2020.[24] Ding-Xuan Zhou. Universality of deep convolutional neural networks, 2018.[25] G. Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics of Control, Signals, andSystems (MCSS) , 2(4):303–314, December 1989.[26] Antonio Loquercio, Mattia Segu, and Davide Scaramuzza. A general framework for uncertainty estimation indeep learning.

IEEE Robotics and Automation Letters , 5(2):3153–3160, Apr 2020.[27] Pavel Gurevich and Hannes Stuke. Pairing an arbitrary regressor with an artiﬁcial neural network estimatingaleatoric uncertainty.

Neurocomputing , 350:291–306, 2019.[28] Robin Senge, Stefan Bösner, Krzysztof Dembczy´nski, Jörg Haasenritter, Oliver Hirsch, Norbert Donner-Banzhoff, and Eyke Hüllermeier. Reliable classiﬁcation: Learning classiﬁers that distinguish aleatoric andepistemic uncertainty.

Information Sciences , 255:16–29, 2014.[29] Geoffrey E. Hinton and R. Neal. Bayesian learning for neural networks. 1995.[30] Alex Graves. Practical variational inference for neural networks. In J. Shawe-Taylor, R. Zemel, P. Bartlett,F. Pereira, and K. Q. Weinberger, editors,

Advances in Neural Information Processing Systems , volume 24, pages2348–2356. Curran Associates, Inc., 2011.[31] John Paisley, David Blei, and Michael Jordan. Variational bayesian inference with stochastic search, 2012.[32] Matt Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference, 2013.[33] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural net-works, 2015.[34] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertaintyestimation using deep ensembles, 2017.[35] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty indeep learning, 2016.[36] J. Stephen Judd.

Neural Network Design and the Complexity of Learning . MIT Press, Cambridge, MA, USA,1990.[37] Avrim L. Blum and Ronald L. Rivest. Training a 3-node neural network is np-complete.

Neural Networks ,5(1):117–127, 1992.[38] M. Aladag, F. O. Catak, and E. Gul. Preventing data poisoning attacks by using generative models. In2019 1stInternational Informatics and Software Engineering Conference (UBMYK)