ArXiv | 2021
Attribution of Gradient Based Adversarial Attacks for Reverse Engineering of Deceptions
Abstract
Machine Learning (ML) algorithms are susceptible to adversarial attacks and deception both during training and deployment. Automatic reverse engineering of the toolchains behind these adversarial machine learning attacks will aid in recovering the tools and processes used in these attacks. In this paper, we present two techniques that support automated identification and attribution of adversarial ML attack toolchains using Co-occurrence Pixel statistics and Laplacian Residuals. Our experiments show that the proposed techniques can identify parameters used to generate adversarial samples. To the best of our knowledge, this is the first approach to attribute gradient based adversarial attacks and estimate their parameters. Source code and data is available at: https://github.com/michael-goebel/ei red. Introduction Convolutional neural networks (CNNs) are increasingly being used in critical applications, such as self-driving cars and face authentication. Recent works have shown that gradient based attacks can reduce accuracy of visual recognition networks to less than 1%, while minimally perturbing an image. The adversary uses gradient descent through the network to maximize the output at an incorrect label, while minimizing the perturbation to the image. Various attack methods have been produced using this common framework, including Fast Gradient Sign Method (FGSM) [9] and Projected Gradient Descent (PGD) [16]. Works have also been proposed to detect such adversarial samples, but none have been published which can estimate the adversarial setup from image samples. Knowing such parameters would allow for more accurate adversarial retraining against such attacks as well as aid in recovering the tools and processes used in these attacks [1]. Figure 1: A sample PGD attack against ResNet. Small perturbations against a network with known weights can lead to significant differences in prediction outputs. Scores indicated here are confidence scores from 0-1, where the sum of all scores is equal to 1. Gradient-descent based adversarial attacks use the gradients of deep neural networks (DNNs) to imperceptibly alter their inputs so as to change the output dramatically. Within this family, there are various strains of algorithms, each with several parameters. In this work, we propose to detect such adversarial attack toolchains and their parameters. Our objectives are two-fold: 1. To attribute an adversarially attacked image to a particular attack toolchain/family, 2. Once an attack has been identified, determine the parameters of the attack so as to facilitate the reverse engineering of these adversarial deceptions. We will now briefly describe some of the attacks considered for detection and attribution. A deep neural network (DNN) is represented as a function f : X → Y , where X denotes the input space of data and Y denotes the output space of the classification categories. The training set comprises known pairs (xt ,yt), where xt ∈ X and yt ∈ Y ,and f () is obtained by minimizing a loss function J( f (xt),yt). We will consider the following attacks: 1. Fast Gradient Sign Method (FGSM): This attack perturbs a clean image x by taking a fixed step in the direction of the gradient of J( f (xt),yt) with respect to xt . 2. Projected Gradient Descent (PGD): This attack is an improvement over FGSM, where the adversarial samples x′ are generated by multiple iterations and intermediate results are clipped so as to keep them within the ε-neighborhood of x : x′i = xi−1− clipε (α · sign(∇xJ( f (xi−1,y))). These two attacks are examples of l∞ attacks, where ε represents the maximum allowable perturbation to any pixel in x. The software repositories of these attacks can be obtained from the following: Advertorch [6], Adversarial Robustness Toolbox [19], Foolbox [22], CleverHans [20]. A PGD example from the Advertorch toolbox is given in Figure 1. Related Works Many works have taken the approach of creating more robust networks, for which small changes in input will not significantly change the output classification [2, 11, 13, 14, 21, 23, 24, 27]. Generally, these methods cause a significant decrease in accuracy, for both tampered and untampered images [5]. While these networks are necessary when class estimation is required for all samples, others methods may be more favorable when this requirement is relaxed. Detection has become another popular approach to circumventing these attacks [4, 7, 8, 10, 17, 12]. Such methods allow ar X iv :2 10 3. 11 00 2v 1 [ cs .C R ] 1 9 M ar 2 02 1 Figure 2: High level model diagram for detection. All models fit into this framework, with different preprocessing methods. for the classification networks to remain as is, while filtering out adversarial examples before they reach the target network. The methods presented in this paper move a step beyond simple detection, with the addition of attack classification and parameter estimation. Method Model To enhance the artifacts created by adversarial attacks, we consider two preprocessing methods common to image forensic, before training a neural network. A visual summary of our detector is given in Figure 2. As a baseline, we compare these two methods against a method with no preprocessing. The first is a Laplacian high-pass filter. Similar filters have been used for both image resampling detection [15], and general image manipulation detection [3]. In our tests, the following 3x3 filter was applied to each of the RGB channels: h(x,y) = \uf8ee\uf8f01 1 1 1 −8 1 1 1 1 \uf8f9\uf8fb (1) The second preprocessing method investigated is the cooccurrence matrix. Such matrices have been used extensively in detection of steganography [26, 25] as well as in detection of GAN images [18]. For this method, two dimensional histograms of adjacent pixel pairs are constructed for each of the color channels. Below we show the equation for horizontal pairs, where X is a 2D array representing a single color channel. A sample image passed through each mode of processing is shown in Figure 3. Ci, j = ∑ m,n [Xm,n = i][Xm,n+1 = j] (2) This can be applied to XT for vertical pairs as well, and on all three channels. These 6 co-occurrence matrices are then stacked into a final input tensor of size 256×256×6. This tensor is passed to a CNN classifier as a multi-channel image. Detection, Attribution, and Estimation For our final output, we would like to tell a user whether or not a query is tampered, what attack method was used, and the parameters for that method. This high level idea described visually in Figure 4. To accomplish this, we train a multiclass network, with each attack and parameter combination as a different label. To form the aggregated sets, such as real vs tampered, we sum the model outputs associated with each set. The set with the largest output is selected as the estimated class. If the image is predicted to be tampered, we then compute our parameter estimates using the model outputs for the predicted Figure 3: An untampered image, and corresponding PGD attacked image, with a large step size and number of steps to amplify the difference. The adversarial noise added appears across the whole image. The difference in the co-occurrence matrices is notable in the significant increase in spread about the diagonal. meta-class. A weighted sum is used, with the model outputs as the weights, and the associated class parameters as the values. Pest = ∑i∈S Pi× yi ∑i∈S yi (3) Experiments Dataset A full list of the attacks investigated is given in table 1. These attacks are repeated on VGG16 and ResNet50, and each is classified separately. Only the ends of the parameter spectrum are used for training. Parameters which are in-between these ends of the spectrum are seen only at test time. A total of 12 different tampered classes are used at training time, with one additional class for untampered, for a total of 13. The dataset is constructed from a random selection of ImageNet samples, all resized to 256× 256. Attacks are run as targeted, with the new label being randomly selected from the 999 labels which are different than the associated ground truth. The attacks are then run to maximize the network output for the target class. Model Training A ResNet50 pretrained on ImageNet was used as our initial network, with the input and output layers modified to accommodate the different input and output sizes for this task. The model Figure 4: Levels of information provided to the user by our method. Using a single network, we demonstrate results for detection, attribution, and parameter estimation. Attack Parameters Training Testing FGSM ss = 1 X X FGSM ss = 2 X FGSM ss = 3 X X PGD ss = 1, ns = 8 X X PGD ss = 1, ns = 12 X PGD ss = 1, ns = 16 X X PGD ss = 2, ns = 8 X PGD ss = 2, ns = 12 X PGD ss = 2, ns = 16 X PGD ss = 3, ns = 8 X X PGD ss = 3, ns = 12 X PGD ss = 3, ns = 16 X X Table 1: Breakdown of the attacks used for training and testing. All attacks are repeated against pretrained VGG16 and ResNet50. ”ss” denotes ”stride size”, assuming an image is in the range [0,255], and ”ns” denotes ”number of steps”. was trained over 20 epochs, using a batch size of 32, Adam optimizer, and cross-entropy loss. After each of the epochs, the model was evaluated on the validation set. The weights corresponding to the lowest validation loss were saved, and used for the remainder of the tests.