[PDF] Simple iterative method for generating targeted universal adversarial perturbations

Abstract

Deep neural networks (DNNs) are vulnerable to adversarial attacks. In particular, a single perturbation known as the universal adversarial perturbation (UAP) can foil most classification tasks conducted by DNNs. Thus, different methods for generating UAPs are required to fully evaluate the vulnerability of DNNs. A realistic evaluation would be with cases that consider targeted attacks; wherein the generated UAP causes DNN to classify an input into a specific class. However, the development of UAPs for targeted attacks has largely fallen behind that of UAPs for non-targeted attacks. Therefore, we propose a simple iterative method to generate UAPs for targeted attacks. Our method combines the simple iterative method for generating non-targeted UAPs and the fast gradient sign method for generating a targeted adversarial perturbation for an input. We applied the proposed method to state-of-the-art DNN models for image classification and proved the existence of almost imperceptible UAPs for targeted attacks; further, we demonstrated that such UAPs are easily generatable.

Full PDF

SSimple iterative method for generating targeted universal adversarialperturbations

Hokuto Hirano and Kazuhiro Takemoto , Abstract — Deep neural networks (DNNs) are vulnerable toadversarial attacks. In particular, a single perturbation knownas the universal adversarial perturbation (UAP) can foil mostclassiﬁcation tasks conducted by DNNs. Thus, different methodsfor generating UAPs are required to fully evaluate the vulnera-bility of DNNs. A realistic evaluation would be with cases thatconsider targeted attacks; wherein the generated UAP causesDNN to classify an input into a speciﬁc class. However, thedevelopment of UAPs for targeted attacks has largely fallenbehind that of UAPs for non-targeted attacks. Therefore, wepropose a simple iterative method to generate UAPs for targetedattacks. Our method combines the simple iterative methodfor generating non-targeted UAPs and the fast gradient signmethod for generating a targeted adversarial perturbation foran input. We applied the proposed method to state-of-the-artDNN models for image classiﬁcation and proved the existenceof almost imperceptible UAPs for targeted attacks; further, wedemonstrated that such UAPs are easily generatable.

I. I

NTRODUCTION

Deep neural networks (DNNs) are widely used for imageclassiﬁcation, a task in which an input image is assigned aclass from a ﬁxed set of classes. For example, DNN-basedimage classiﬁcation has applications in medical science (e.g.,medical image-based diagnosis [1]) and self-driving technol-ogy (e.g., detecting and classifying trafﬁc signs [2]).However, DNNs are known to be vulnerable to adversarialexamples [3], which are input images that cause misclassi-ﬁcations by DNNs and are generally generated by addingspeciﬁc, imperceptible perturbations to original input imagesthat have been correctly classiﬁed using DNNs. Interestingly,a single perturbation that can induce DNN failure in mostimage classiﬁcation tasks is also generatable as a univer-sal adversarial perturbation (UAP) [4]. The vulnerabilityin DNNs to adversarial attacks (UAPs, in particular) is asecurity concern for practical applications of DNNs [5].Thus, the development of methods for generating UAPs isrequired to evaluate the vulnerability of DNNs to adversarialattacks.A simple iterative method [4] for generating UAPs hasbeen proposed; however, it is limited to non-targeted attacksthat cause misclassiﬁcation (i.e., a task failure resulting inan input image being assigned an incorrect class). Morerealistic cases need to consider targeted attacks, whereingenerating a UAP would cause the DNN to classify an inputimage into a speciﬁc class (e.g., into the “diseased” classin medical diagnosis). A method for generating UAPs fortargeted attacks based on a generative network model has Department of Bioscience and Bioinformatics, Kyushu Institute ofTechnology, Iizuka, Fukuoka 820-8502, Japan Corresponding author [email protected] been proposed [6]; however, it requires high computationalcosts. The targeted adversarial patch approach for targeteduniversal adversarial attacks [7] has been proposed; however,such adversarial patches are perceptible.Thus, herein, we propose a simple iterative method togenerate almost imperceptible UAPs for targeted attacks.II. T

ARGETED UNIVERSAL ADVERSARIALPERTURBATIONS

Our algorithm (Algorithm 1) for generating UAPs fortargeted attacks is an extension of the simple iterative al-gorithm for generating UAPs for non-targeted attacks [4].Similar to the non-targeted UAP algorithm, our algorithmconsiders a classiﬁer C ( x ) that returns the class or label(with the highest conﬁdence score) for an input image x . Thealgorithm starts with ρ = (no perturbation) and iterativelyupdates the UAP ρ under the constraint that the L p norm ofthe perturbation is equal to or less than a small value ξ (i.e., (cid:107) ρ (cid:107) p ≤ ξ ) by additively obtaining an adversarial perturbationfor an input image x , which is randomly selected froman input image set X without replacement. These iterativeupdates continue up till the termination conditions havebeen satisﬁed. Unlike the non-targeted UAP algorithm, ouralgorithm uses the fast gradient sign method for targetedattacks (tFGSM) to generate targeted UAPs, whereas thenon-targeted UAP algorithm uses a method (e.g., DeepFool[8]) that generates a non-targeted adversarial example for aninput image. Algorithm 1

Computation of a targeted UAP

Input:

Set X of input images, target class y , classiﬁer C ( · ) ,cap ξ on L p norm of the perturbation, norm type p (1,2, or ∞ ), maximum number i max of iterations. Output:

Targeted UAP vector ρ . ρ ← , r st ← , i ← while r st < and i < i max do for x ∈ X in random order do if C ( x + ρ ) (cid:54) = y then x adv ← x + ρ + ψ ( x + ρ , y ) if C ( x adv ) = y then ρ ← project( x adv − x , p, ξ ) end if end if end for r st ← | X | − (cid:80) x ∈ X I ( C ( x + ρ ) = y ) i ← i + 1 end while a r X i v : . [ c s . C V ] N ov FGSM generates a targeted adversarial perturbation ψ ( x , y ) that causes an image x to be classiﬁed into thetarget class y using the gradient ∇ x L ( x , y ) of the lossfunction with respect to pixels [3,9]. For the L ∞ norm, theperturbation is calculated as ψ ( x , y ) = − (cid:15) · sign ( ∇ x L ( x , y )) , (1)where (cid:15) ( > ) the attack strength. For the L and L norms,the perturbation is obtained as ψ ( x , y ) = − (cid:15) ∇ x L ( x , y ) (cid:107)∇ x L ( x , y ) (cid:107) p . (2)The adversarial example x adv is obtained as follows: x adv = x + ψ ( x , y ) . (3)At each iteration step, our algorithm computes a targetedadversarial perturbation ψ ( x + ρ , y ) , if the perturbated image x + ρ is not classiﬁed into the target class y (i.e., C ( x + ρ ) (cid:54) = y ); however, the non-targeted UAP algorithm obtains a non-targeted adversarial perturbation that satisﬁes C ( x + ρ ) (cid:54) = C ( x ) if C ( x + ρ ) = C ( x ) . After generating the adversarialexample at this step (i.e., x adv ← x + ρ + ψ ( x + ρ , y )) ,the perturbation ρ is updated if x adv is classiﬁed into thetarget class y (i.e., C ( x adv ) = y ), whereas the non-targetedUAP algorithm updates the perturbation ρ if C ( x + ρ ) (cid:54) = C ( x ) . Note that tFGSM does not ensure that adversarialexamples are classiﬁed into a target class. When updating ρ ,a projection function project( x , p, ξ ) is used to satisfy theconstraint that (cid:107) ρ (cid:107) p ≤ ξ (i.e., ρ ← project( x adv − x , p, ξ ) .This projection is deﬁned as follows: project( x , p, ξ ) = arg min x (cid:48) (cid:107) x − x (cid:48) (cid:107) s . t . (cid:107) x (cid:48) (cid:107) p ≤ ξ (4)This update procedure terminates when the targeted at-tack success rate r ts for input images (i.e., the pro-portion of input images classiﬁed into the target class; | X | − (cid:80) x ∈ X I ( C ( x + ρ ) = y ) equals 100% (i.e., all inputimages are classiﬁed into the target class due to the UAP ρ )or the number of iterations reaches to the maximum i max .A pseudo code of our algorithm is shown in Algorithm 1.Our algorithm was implemented using Keras(version 2.2.4; keras.io) and Adversarial Robustness360 Toolbox [9] (version 1.0; github.com/IBM/adversarial-robustness-toolbox ). The sourcecode of our proposed method for generating targeted UPAsis available from our GitHub repository: github.com/hkthirano/targeted_UAP_CIFAR10 .III. E XPERIMENTAL EVALUATION

A. Deep neural network models and image datasets

To evaluate targeted UAPs, we used 2 DNN models thatwere trained to classify the CIFAR-10 image dataset ( ). The CIFAR-10 dataset includes 60,000 RGB color images with size of × pixels classiﬁed into 10 classes: airplane, automobile,bird, cat, deer, dog, frog, horse, ship, and truck. 60,000images are available in each class. The dataset comprises 50,000 training images (5,000 images per class) and 10,000test images (1,000 images per class). In particular, weused the VGG-20 and ResNet-20 models for the CIFAR-10 dataset obtained from a GitHub repository ( github.com/GuanqiaoDing/CNN-CIFAR10 ); their test accura-cies were 91.1% and 91.3%, respectively.Moreover, we also considered three DNN models trainedto classify the ImageNet image dataset ( ). The ImageNet dataset comprises RGB color imageswith size of × pixels classiﬁed into 1,000 classes.In particular, we used the VGG-16, VGG-19, and ResNet-50 models for ImageNet dataset available in Keras (version2.2.4; keras.io ), and their test accuracies were 71.6%,71.5%, and 74.6%, respectively. B. Generating targeted adversarial perturbations and eval-uating their performance

Targeted UAPs were generated using an input image setobtained from the datasets. The parameters p was set to 2. Wegenerated targeted UAPs with various norms by adjusting theparameters (cid:15) and ξ . The magnitude of a UAP was measuredusing a normalized L norm of the perturbation; in particular,we used the ratio ζ of the L norm of the UAP to the average L norm of an image in a dataset. The average L normsof an image were 7,381 and 50,135 in the CIFAR-10 andImageNet datasets, respectively.For comparing the performance of targeted UAPs gener-ated by our method with random controls, we also generatedrandom vectors (random UAPs) sampled uniformly from thesphere of a given radius [4].The performance of UAPs was evaluated using the targetedattack success rate r ts . In particular, we considered thesuccess rates r ts for input images. In addition to this, wealso computed the success rates r ts for test images to exper-imentally evaluate the performance of UAPs for unknownimages. A test image set was obtained from the dataset andwas not overlapped with the input image set. C. Case of CIFAR-10 models

For the CIFAR-10 models, we used 10,000 input imagesto generate the targeted UAPs. The input image set wasobtained by randomly selecting 1,000 images per class fromthe training images of the CIFAR-10 dataset. All 10,000 testimages of the dataset were used as test images for evaluatingthe UAP performance. We considered the targeted attack toeach class. The parameters (cid:15) and i max were set to 0.006 and10, respectively.For the targeted attacks to each class, the targeted attacksuccess rates r ts for both the input image set and theUAP test image set, rapidly increased with perturbation rate,despite a low ζ (2–6%). In particular, the success rates were > for ζ = 5% (Fig. 1). The targeted UAPs with ζ = 5% were almost imperceptible (Fig. 2). Moreover, the UAPsseem to represent object shapes of each target class. Thetarget attack success rates reached to ∼ for ζ > .The success rates of the targeted UAPs were signiﬁcantlyhigher than those of random UAPs. These tendencies werebserved both in the VGG-20 model and in the ResNet-20model. Fig. 1. Line plot of target attack success rate r ts versus perturbationrate for targeted attacks to each class of the CIFAR-10 dataset. Legendlabel indicates DNN model and image set used for computing r ts . Forexample, “VGG-20 input” indicates r ts of targeted UAPs against theVGG-20 model computed using the input image set. Additional argument“(random)” indicates that random UAPs were used instead of targeted UAPs. D. Case of ImageNet models

For the ImageNet models, we used the validation datasetused in the ImageNet Large Scale Visual RecognitionChallenge 2012 (ILSVRC2012; ) to generate the targetedUAPs. The dataset comprises 50,000 images (50 images perclass). We used 40,000 images as input images. The inputimage set was obtained by randomly selecting 40 imagesper class. The rest (10,000 images; 10 images per class) wasused as test images for evaluating UAPs. The parameters (cid:15) and i max were set to 0.5 and 5, respectively.In this study, we considered targeted attacks to threeclasses (golf ball, broccoli, and stone wall) that were ran-domly selected from 1,000 classes in a previous study [5]because of page limitation.We generated targeted UAPs with ζ = 6% ( ξ = 3 , )and ζ = 8% ( ξ = 4 , ). The target attack success rates Fig. 2. Targeted UAPs (top panel) with ζ = 5% against the VGG-20model for the CIFAR-10 dataset and their adversarial attacks to an original(i.e., non-perturbated) image (left panel) randomly selected from the imagesthat, without perturbation, correctly classiﬁed into each source class and,with the perturbations, correctly classiﬁed into the target classes: airplane(0), automobile (1), bird (2), cat (3), deer (4), dog (5), frog (6), horse (7),ship (8), and truck (9). Note that the UAPs are emphatically displayed forclarity; in particular, each UAP was scaled with the maximum of 1 and theminimum of 0. r ts were between ~30% and ~75% and between ~60% and~90% when ζ = 6% and ζ = 8% , respectively (Table 1). Thesuccess rates of the targeted UAPs were signiﬁcantly higherthan those of random UAPs, which were less than 1% in allcases. Table 1. Targeted attack success rates r ts of targeted UAPs against theDNN models for each target class. r ts for input images and test imageswere shown.Target class Model ζ = 6% ζ = 8% input test input testGolf ball VGG-16 58.0% 57.6% 81.6% 80.6%VGG-19 55.3% 55.2% 81.3% 80.1%ResNet-50 66.8% 66.5% 90.3% 89.8%Broccoli VGG-16 29.3% 29.0% 59.7% 59.5%VGG-19 31.2% 30.5% 59.7% 59.4%ResNet-50 46.4% 46.6% 74.6% 73.9%Stone wall VGG-16 47.1% 46.7% 75.0% 74.5%VGG-19 48.4% 48.1% 73.9% 72.9%ResNet-50 74.7% 74.4% 92.0% 91.3% A higher perturbation magnitude ζ leaded to a higher tar-geted attack success rate r ts . The success rates r ts dependedon the image classes. For example, the targeted attacks tothe class “Golf ball” were more easily achieved than thoseto the class “Broccoli”. The success rates r ts also dependedDNN architectures; in particular, the ResNet-50 model wasmore easy-to-fool than the VGG models.The targeted UAPs with ζ = 6% and ζ = 8% were almostmperceptible (Fig. 3); however, they were partly perceptiblefor whitish images (e.g., trimaran). Moreover, the UAPs seemto reﬂect object shapes of each target class.The targeted attack success rates in the ImageNet modelswere relatively lower than those in the CIFAR-10 models.This is because the ImageNet dataset has a larger number ofclasses than the CIFAR-10 dataset does. In short, it is moredifﬁcult to exactly classify an input image into a speciﬁctarget class within a larger number of classes. Moreover, theobserved lower success rate may be because the validationdataset of ILSVRC2012 was used when generating targetedUAPs. Higher success rates may be obtained when generat-ing targeted UAPs using training images. Fig. 3. Targeted UAPs (top panel) against the ResNet-50 model forthe ImageNet dataset and their adversarial attacks to original (i.e., non-perturbated) images (left panel) randomly selected from the images that,without perturbation, correctly classiﬁed into the source class and, with theperturbation, correctly classiﬁed into each target classes under the constraintthat the source classes are not overlapped each other and with the targetclasses. The source classes displayed here are sleeping bag (A), sombrero(B), trimaran (C), steam locomotive (D), ﬁreboat (E), and water ouzel,dipper (F). The target classes are golf ball (0), broccoli (1), and stone wall(2). The UAPs with ζ = 6% and ζ = 8% are shown. Note that UAPs areemphatically displayed for clarity; in particular, each UAP was scaled withthe maximum of 1 and the minimum of 0. IV. C

ONCLUSIONS