Simple iterative method for generating targeted universal adversarial perturbations
SSimple iterative method for generating targeted universal adversarialperturbations
Hokuto Hirano and Kazuhiro Takemoto , Abstract — Deep neural networks (DNNs) are vulnerable toadversarial attacks. In particular, a single perturbation knownas the universal adversarial perturbation (UAP) can foil mostclassification tasks conducted by DNNs. Thus, different methodsfor generating UAPs are required to fully evaluate the vulnera-bility of DNNs. A realistic evaluation would be with cases thatconsider targeted attacks; wherein the generated UAP causesDNN to classify an input into a specific class. However, thedevelopment of UAPs for targeted attacks has largely fallenbehind that of UAPs for non-targeted attacks. Therefore, wepropose a simple iterative method to generate UAPs for targetedattacks. Our method combines the simple iterative methodfor generating non-targeted UAPs and the fast gradient signmethod for generating a targeted adversarial perturbation foran input. We applied the proposed method to state-of-the-artDNN models for image classification and proved the existenceof almost imperceptible UAPs for targeted attacks; further, wedemonstrated that such UAPs are easily generatable.
I. I
NTRODUCTION
Deep neural networks (DNNs) are widely used for imageclassification, a task in which an input image is assigned aclass from a fixed set of classes. For example, DNN-basedimage classification has applications in medical science (e.g.,medical image-based diagnosis [1]) and self-driving technol-ogy (e.g., detecting and classifying traffic signs [2]).However, DNNs are known to be vulnerable to adversarialexamples [3], which are input images that cause misclassi-fications by DNNs and are generally generated by addingspecific, imperceptible perturbations to original input imagesthat have been correctly classified using DNNs. Interestingly,a single perturbation that can induce DNN failure in mostimage classification tasks is also generatable as a univer-sal adversarial perturbation (UAP) [4]. The vulnerabilityin DNNs to adversarial attacks (UAPs, in particular) is asecurity concern for practical applications of DNNs [5].Thus, the development of methods for generating UAPs isrequired to evaluate the vulnerability of DNNs to adversarialattacks.A simple iterative method [4] for generating UAPs hasbeen proposed; however, it is limited to non-targeted attacksthat cause misclassification (i.e., a task failure resulting inan input image being assigned an incorrect class). Morerealistic cases need to consider targeted attacks, whereingenerating a UAP would cause the DNN to classify an inputimage into a specific class (e.g., into the “diseased” classin medical diagnosis). A method for generating UAPs fortargeted attacks based on a generative network model has Department of Bioscience and Bioinformatics, Kyushu Institute ofTechnology, Iizuka, Fukuoka 820-8502, Japan Corresponding author [email protected] been proposed [6]; however, it requires high computationalcosts. The targeted adversarial patch approach for targeteduniversal adversarial attacks [7] has been proposed; however,such adversarial patches are perceptible.Thus, herein, we propose a simple iterative method togenerate almost imperceptible UAPs for targeted attacks.II. T
ARGETED UNIVERSAL ADVERSARIALPERTURBATIONS
Our algorithm (Algorithm 1) for generating UAPs fortargeted attacks is an extension of the simple iterative al-gorithm for generating UAPs for non-targeted attacks [4].Similar to the non-targeted UAP algorithm, our algorithmconsiders a classifier C ( x ) that returns the class or label(with the highest confidence score) for an input image x . Thealgorithm starts with ρ = (no perturbation) and iterativelyupdates the UAP ρ under the constraint that the L p norm ofthe perturbation is equal to or less than a small value ξ (i.e., (cid:107) ρ (cid:107) p ≤ ξ ) by additively obtaining an adversarial perturbationfor an input image x , which is randomly selected froman input image set X without replacement. These iterativeupdates continue up till the termination conditions havebeen satisfied. Unlike the non-targeted UAP algorithm, ouralgorithm uses the fast gradient sign method for targetedattacks (tFGSM) to generate targeted UAPs, whereas thenon-targeted UAP algorithm uses a method (e.g., DeepFool[8]) that generates a non-targeted adversarial example for aninput image. Algorithm 1
Computation of a targeted UAP
Input:
Set X of input images, target class y , classifier C ( · ) ,cap ξ on L p norm of the perturbation, norm type p (1,2, or ∞ ), maximum number i max of iterations. Output:
Targeted UAP vector ρ . ρ ← , r st ← , i ← while r st < and i < i max do for x ∈ X in random order do if C ( x + ρ ) (cid:54) = y then x adv ← x + ρ + ψ ( x + ρ , y ) if C ( x adv ) = y then ρ ← project( x adv − x , p, ξ ) end if end if end for r st ← | X | − (cid:80) x ∈ X I ( C ( x + ρ ) = y ) i ← i + 1 end while a r X i v : . [ c s . C V ] N ov FGSM generates a targeted adversarial perturbation ψ ( x , y ) that causes an image x to be classified into thetarget class y using the gradient ∇ x L ( x , y ) of the lossfunction with respect to pixels [3,9]. For the L ∞ norm, theperturbation is calculated as ψ ( x , y ) = − (cid:15) · sign ( ∇ x L ( x , y )) , (1)where (cid:15) ( > ) the attack strength. For the L and L norms,the perturbation is obtained as ψ ( x , y ) = − (cid:15) ∇ x L ( x , y ) (cid:107)∇ x L ( x , y ) (cid:107) p . (2)The adversarial example x adv is obtained as follows: x adv = x + ψ ( x , y ) . (3)At each iteration step, our algorithm computes a targetedadversarial perturbation ψ ( x + ρ , y ) , if the perturbated image x + ρ is not classified into the target class y (i.e., C ( x + ρ ) (cid:54) = y ); however, the non-targeted UAP algorithm obtains a non-targeted adversarial perturbation that satisfies C ( x + ρ ) (cid:54) = C ( x ) if C ( x + ρ ) = C ( x ) . After generating the adversarialexample at this step (i.e., x adv ← x + ρ + ψ ( x + ρ , y )) ,the perturbation ρ is updated if x adv is classified into thetarget class y (i.e., C ( x adv ) = y ), whereas the non-targetedUAP algorithm updates the perturbation ρ if C ( x + ρ ) (cid:54) = C ( x ) . Note that tFGSM does not ensure that adversarialexamples are classified into a target class. When updating ρ ,a projection function project( x , p, ξ ) is used to satisfy theconstraint that (cid:107) ρ (cid:107) p ≤ ξ (i.e., ρ ← project( x adv − x , p, ξ ) .This projection is defined as follows: project( x , p, ξ ) = arg min x (cid:48) (cid:107) x − x (cid:48) (cid:107) s . t . (cid:107) x (cid:48) (cid:107) p ≤ ξ (4)This update procedure terminates when the targeted at-tack success rate r ts for input images (i.e., the pro-portion of input images classified into the target class; | X | − (cid:80) x ∈ X I ( C ( x + ρ ) = y ) equals 100% (i.e., all inputimages are classified into the target class due to the UAP ρ )or the number of iterations reaches to the maximum i max .A pseudo code of our algorithm is shown in Algorithm 1.Our algorithm was implemented using Keras(version 2.2.4; keras.io) and Adversarial Robustness360 Toolbox [9] (version 1.0; github.com/IBM/adversarial-robustness-toolbox ). The sourcecode of our proposed method for generating targeted UPAsis available from our GitHub repository: github.com/hkthirano/targeted_UAP_CIFAR10 .III. E XPERIMENTAL EVALUATION
A. Deep neural network models and image datasets
To evaluate targeted UAPs, we used 2 DNN models thatwere trained to classify the CIFAR-10 image dataset ( ). The CIFAR-10 dataset includes 60,000 RGB color images with size of × pixels classified into 10 classes: airplane, automobile,bird, cat, deer, dog, frog, horse, ship, and truck. 60,000images are available in each class. The dataset comprises 50,000 training images (5,000 images per class) and 10,000test images (1,000 images per class). In particular, weused the VGG-20 and ResNet-20 models for the CIFAR-10 dataset obtained from a GitHub repository ( github.com/GuanqiaoDing/CNN-CIFAR10 ); their test accura-cies were 91.1% and 91.3%, respectively.Moreover, we also considered three DNN models trainedto classify the ImageNet image dataset ( ). The ImageNet dataset comprises RGB color imageswith size of × pixels classified into 1,000 classes.In particular, we used the VGG-16, VGG-19, and ResNet-50 models for ImageNet dataset available in Keras (version2.2.4; keras.io ), and their test accuracies were 71.6%,71.5%, and 74.6%, respectively. B. Generating targeted adversarial perturbations and eval-uating their performance
Targeted UAPs were generated using an input image setobtained from the datasets. The parameters p was set to 2. Wegenerated targeted UAPs with various norms by adjusting theparameters (cid:15) and ξ . The magnitude of a UAP was measuredusing a normalized L norm of the perturbation; in particular,we used the ratio ζ of the L norm of the UAP to the average L norm of an image in a dataset. The average L normsof an image were 7,381 and 50,135 in the CIFAR-10 andImageNet datasets, respectively.For comparing the performance of targeted UAPs gener-ated by our method with random controls, we also generatedrandom vectors (random UAPs) sampled uniformly from thesphere of a given radius [4].The performance of UAPs was evaluated using the targetedattack success rate r ts . In particular, we considered thesuccess rates r ts for input images. In addition to this, wealso computed the success rates r ts for test images to exper-imentally evaluate the performance of UAPs for unknownimages. A test image set was obtained from the dataset andwas not overlapped with the input image set. C. Case of CIFAR-10 models
For the CIFAR-10 models, we used 10,000 input imagesto generate the targeted UAPs. The input image set wasobtained by randomly selecting 1,000 images per class fromthe training images of the CIFAR-10 dataset. All 10,000 testimages of the dataset were used as test images for evaluatingthe UAP performance. We considered the targeted attack toeach class. The parameters (cid:15) and i max were set to 0.006 and10, respectively.For the targeted attacks to each class, the targeted attacksuccess rates r ts for both the input image set and theUAP test image set, rapidly increased with perturbation rate,despite a low ζ (2–6%). In particular, the success rates were > for ζ = 5% (Fig. 1). The targeted UAPs with ζ = 5% were almost imperceptible (Fig. 2). Moreover, the UAPsseem to represent object shapes of each target class. Thetarget attack success rates reached to ∼ for ζ > .The success rates of the targeted UAPs were significantlyhigher than those of random UAPs. These tendencies werebserved both in the VGG-20 model and in the ResNet-20model. Fig. 1. Line plot of target attack success rate r ts versus perturbationrate for targeted attacks to each class of the CIFAR-10 dataset. Legendlabel indicates DNN model and image set used for computing r ts . Forexample, “VGG-20 input” indicates r ts of targeted UAPs against theVGG-20 model computed using the input image set. Additional argument“(random)” indicates that random UAPs were used instead of targeted UAPs. D. Case of ImageNet models
For the ImageNet models, we used the validation datasetused in the ImageNet Large Scale Visual RecognitionChallenge 2012 (ILSVRC2012; ) to generate the targetedUAPs. The dataset comprises 50,000 images (50 images perclass). We used 40,000 images as input images. The inputimage set was obtained by randomly selecting 40 imagesper class. The rest (10,000 images; 10 images per class) wasused as test images for evaluating UAPs. The parameters (cid:15) and i max were set to 0.5 and 5, respectively.In this study, we considered targeted attacks to threeclasses (golf ball, broccoli, and stone wall) that were ran-domly selected from 1,000 classes in a previous study [5]because of page limitation.We generated targeted UAPs with ζ = 6% ( ξ = 3 , )and ζ = 8% ( ξ = 4 , ). The target attack success rates Fig. 2. Targeted UAPs (top panel) with ζ = 5% against the VGG-20model for the CIFAR-10 dataset and their adversarial attacks to an original(i.e., non-perturbated) image (left panel) randomly selected from the imagesthat, without perturbation, correctly classified into each source class and,with the perturbations, correctly classified into the target classes: airplane(0), automobile (1), bird (2), cat (3), deer (4), dog (5), frog (6), horse (7),ship (8), and truck (9). Note that the UAPs are emphatically displayed forclarity; in particular, each UAP was scaled with the maximum of 1 and theminimum of 0. r ts were between ~30% and ~75% and between ~60% and~90% when ζ = 6% and ζ = 8% , respectively (Table 1). Thesuccess rates of the targeted UAPs were significantly higherthan those of random UAPs, which were less than 1% in allcases. Table 1. Targeted attack success rates r ts of targeted UAPs against theDNN models for each target class. r ts for input images and test imageswere shown.Target class Model ζ = 6% ζ = 8% input test input testGolf ball VGG-16 58.0% 57.6% 81.6% 80.6%VGG-19 55.3% 55.2% 81.3% 80.1%ResNet-50 66.8% 66.5% 90.3% 89.8%Broccoli VGG-16 29.3% 29.0% 59.7% 59.5%VGG-19 31.2% 30.5% 59.7% 59.4%ResNet-50 46.4% 46.6% 74.6% 73.9%Stone wall VGG-16 47.1% 46.7% 75.0% 74.5%VGG-19 48.4% 48.1% 73.9% 72.9%ResNet-50 74.7% 74.4% 92.0% 91.3% A higher perturbation magnitude ζ leaded to a higher tar-geted attack success rate r ts . The success rates r ts dependedon the image classes. For example, the targeted attacks tothe class “Golf ball” were more easily achieved than thoseto the class “Broccoli”. The success rates r ts also dependedDNN architectures; in particular, the ResNet-50 model wasmore easy-to-fool than the VGG models.The targeted UAPs with ζ = 6% and ζ = 8% were almostmperceptible (Fig. 3); however, they were partly perceptiblefor whitish images (e.g., trimaran). Moreover, the UAPs seemto reflect object shapes of each target class.The targeted attack success rates in the ImageNet modelswere relatively lower than those in the CIFAR-10 models.This is because the ImageNet dataset has a larger number ofclasses than the CIFAR-10 dataset does. In short, it is moredifficult to exactly classify an input image into a specifictarget class within a larger number of classes. Moreover, theobserved lower success rate may be because the validationdataset of ILSVRC2012 was used when generating targetedUAPs. Higher success rates may be obtained when generat-ing targeted UAPs using training images. Fig. 3. Targeted UAPs (top panel) against the ResNet-50 model forthe ImageNet dataset and their adversarial attacks to original (i.e., non-perturbated) images (left panel) randomly selected from the images that,without perturbation, correctly classified into the source class and, with theperturbation, correctly classified into each target classes under the constraintthat the source classes are not overlapped each other and with the targetclasses. The source classes displayed here are sleeping bag (A), sombrero(B), trimaran (C), steam locomotive (D), fireboat (E), and water ouzel,dipper (F). The target classes are golf ball (0), broccoli (1), and stone wall(2). The UAPs with ζ = 6% and ζ = 8% are shown. Note that UAPs areemphatically displayed for clarity; in particular, each UAP was scaled withthe maximum of 1 and the minimum of 0. IV. C
ONCLUSIONS