AutoZOOM: Autoencoder-based Zeroth Order Optimization Method for Attacking Black-box Neural Networks
Chun-Chen Tu, Paishun Ting, Pin-Yu Chen, Sijia Liu, Huan Zhang, Jinfeng Yi, Cho-Jui Hsieh, Shin-Ming Cheng
AAutoZOOM: Autoencoder-based Zeroth Order Optimization Methodfor Attacking Black-box Neural Networks
Chun-Chen Tu ∗ , Paishun Ting ∗ , Pin-Yu Chen ∗ , Sijia Liu ,Huan Zhang , Jinfeng Yi , Cho-Jui Hsieh , Shin-Ming Cheng University of Michigan, Ann Arbor, USA MIT-IBM Watson AI Lab, IBM Research University of California, Los Angeles, USA JD AI Research, Beijing, China National Taiwan University of Science and Technology, Taiwan
Abstract
Recent studies have shown that adversarial examples in state-of-the-art image classifiers trained by deep neural networks(DNN) can be easily generated when the target model is trans-parent to an attacker, known as the white-box setting. However,when attacking a deployed machine learning service, one canonly acquire the input-output correspondences of the targetmodel; this is the so-called black-box attack setting. The majordrawback of existing black-box attacks is the need for exces-sive model queries, which may give a false sense of modelrobustness due to inefficient query designs. To bridge this gap,we propose a generic framework for query-efficient black-box attacks. Our framework,
AutoZOOM , which is short for
Auto encoder-based Z eroth O rder O ptimization M ethod, hastwo novel building blocks towards efficient black-box attacks:(i) an adaptive random gradient estimation strategy to balancequery counts and distortion, and (ii) an autoencoder that iseither trained offline with unlabeled data or a bilinear resizingoperation for attack acceleration. Experimental results suggestthat, by applying AutoZOOM to a state-of-the-art black-boxattack (ZOO), a significant reduction in model queries can beachieved without sacrificing the attack success rate and thevisual quality of the resulting adversarial examples. In particu-lar, when compared to the standard ZOO method, AutoZOOMcan consistently reduce the mean query counts in finding suc-cessful adversarial examples (or reaching the same distortionlevel) by at least 93% on MNIST, CIFAR-10 and ImageNetdatasets, leading to novel insights on adversarial robustness. In recent years, “machine learning as a service” has offeredthe world an effortless access to powerful machine learningtools for a wide variety of tasks. For example, commerciallyavailable services such as Google Cloud Vision API and Clar-ifai.com provide well-trained image classifiers to the public.One is able to upload and obtain the class prediction resultsfor images at hand at a low price. However, the existing andemerging machine learning platforms and their low model-access costs raise ever-increasing security concerns, as theyalso offer an ideal environment for testing malicious attempts.Even worse, the risks can be amplified when these servicesare used to build derived products such that the inherentsecurity vulnerability could be leveraged by attackers. ∗ equal contribution Figure 1: AutoZOOM significantly reduces the number ofqueries required to generate a successful adversarial Bagelimage from the black-box Inception-v3 model.In many computer vision tasks, DNN models achieve thestate-of-the-art prediction accuracy and hence are widely de-ployed in modern machine learning services. Nonetheless,recent studies have highlighted DNNs’ vulnerability to ad-versarial perturbations. In the white-box setting in which thetarget model is entirely transparent to an attacker, visuallyimperceptible adversarial images can be easily crafted tofool a target DNN model towards misclassification by lever-aging the input gradient information (Szegedy et al. 2014;Goodfellow, Shlens, and Szegedy 2015). However, in the black-box setting in which the parameters of the deployedmodel are hidden and one can only observe the input-outputcorrespondences of a queried example, crafting adversarialexamples requires a gradient-free (zeroth order) optimizationapproach to gather necessary attack information. Figure 1displays a prediction-evasive adversarial example crafted viaiterative model queries from a black-box DNN (the Inception-v3 model (Szegedy et al. 2016)) trained on ImageNet.Albeit achieving remarkable attack effectiveness by theuse of gradient estimation, current black-box attack methods,such as (Chen et al. 2017; Nitin Bhagoji et al. 2018), arenot query-efficient since they exploit coordinate-wise gra-dient estimation and value update, which inevitably incursan excessive number of model queries and may give a falsesense of model robustness due to inefficient query designs.In this paper, we propose to tackle the preceding problemby using
AutoZOOM , an
Auto encoder-based Z eroth O rder O ptimization M ethod. AutoZOOM has two novel building a r X i v : . [ c s . C V ] J a n igure 2: Illustration of attack dimension reduction through a “decoder” in AutoZOOM for improving query efficiency inblack-box attacks. The decoder has two modes: (i) An autoencoder (AE) trained on unlabeled natural images that are differentfrom the attacked images and training data; (ii) a simple bilinear image resizer (BiLIN) that is applied channel-wise to extrapolatelow-dimensional feature to the original image dimension (width × height). In the latter mode, no additional training is required.blocks: (i) a new and adaptive random gradient estimationstrategy to balance the query counts and distortion whencrafting adversarial examples, and (ii) an autoencoder thatis either trained offline on other unlabeled data, or based ona simple bilinear resizing operation, in order to accelerateblack-box attacks. As illustrated in Figure 2, AutoZOOMutilizes a “decoder” to craft a high-dimensional adversarialperturbation from the (learned) low-dimensional latent-spacerepresentation, and its query efficiency can be well explainedby the dimension-dependent convergence rate in gradient-free optimization. Contributions.
We summarize our main contributions andnew insights on adversarial robustness as follows:1. We propose AutoZOOM, a novel query-efficient black-boxattack framework for generating adversarial examples. Au-toZOOM features an adaptive random gradient estimationstrategy and dimension reduction techniques (either anoffline trained autoencoder or a bilinear resizer) to reduceattack query counts while maintaining attack effectivenessand visual similarity. To the best of our knowledge, Au-toZOOM is the first black-box attack using random fullgradient estimation and data-driven acceleration.2. We use the convergence rate of zeroth-order optimiza-tion to motivate the query efficiency of AutoZOOM andprovide an error analysis of the new gradient estimatorin AutoZOOM to the true gradient for characterizing thetrade-offs between estimation error and query counts.3. When applied to a state-of-the-art black-box attack pro-posed in (Chen et al. 2017), AutoZOOM attains a similarattack success rate while achieving a significant reduction(at least 93%) in the mean query counts required to at-tack the DNN image classifiers for MNIST, CIFAR-10and ImageNet. It can also fine-tune the distortion in thepost-success stage by performing finer gradient estimation.4. In the experiments, we also find that AutoZOOM with asimple bilinear resizer as the decoder (AutoZOOM-BiLIN)can attain noticeable query efficiency, despite that it is stillworse than AutoZOOM with an offline trained autoen- coder (AutoZOOM-AE). However, AutoZOOM-BiLIN iseasier to be mounted as no additional training is required.The results also suggest an interesting finding that whilelearning effective low-dimensional representations of legit-imate images is still a challenging task, black-box attacksusing significantly less degree of freedoms (i.e., reduceddimensions) are certainly plausible.
Gradient-based adversarial attacks on DNNs fall within thewhite-box setting, since acquiring the gradient with respectto the input requires knowing the weights of the target DNN.As a first attempt towards black-box attacks, the authors in(Papernot et al. 2017) proposed to train a substitute modelusing iterative model queries, performing white-box attackson the substitute model, and implementing transfer attacks tothe target model (Papernot, McDaniel, and Goodfellow 2016;Liu et al. 2017). However, its attack performance can beseverely degraded due to poor attack transferability (Su etal. 2018). Although ZOO achieves a similar attack successrate and comparable visual quality as many white-box at-tack methods (Chen et al. 2017), its coordinate-wise gradientestimation requires excessive target model evaluations andis hence not query-efficient. The same gradient estimationtechnique is also used in (Nitin Bhagoji et al. 2018).Beyond optimization-based approaches, the authors in(Ilyas et al. 2018) proposed to use a natural evolution strat-egy (NES) to enhance query efficiency. Although there is avector-wise gradient estimation step in the NES attack, wetreat it as a parallel work since its natural evolutionary stepis out of the scope of black-box attacks using zeroth-ordergradient descent. We also note that different from NES, ourAutoZOOM framework uses a theory-driven query-efficientrandom-vector based gradient estimation strategy. In addition,AutoZOOM could be applied to further improve the queryefficiency of NES, since NES does not take into account thefactor of attack dimension reduction, which is the novelty inAutoZOOM as well as the main focus of this paper.Under a more restricted attack setting, where only the de-ision (top-1 prediction class) is known to an attacker, theauthors in (Brendel, Rauber, and Bethge 2018) proposeda random-walk based attack around the decision boundary.Such a black-box attack dispenses class prediction scores andhence requires additional model queries. Due to space limi-tation, we provide more background and a table comparingexisting black-box attacks in the supplementary material.
Throughout this paper, we focus on improving the query ef-ficiency of gradient-estimation and gradient-descent basedblack-box attacks empowered by AutoZOOM, and we con-sider the threat model that the class prediction scores areknown to an attacker. In this setting, it suffices to denote thetarget DNN as a classification function F : [0 , d (cid:55)→ R K that takes a d -dimensional scaled image as its input andyields a vector of prediction scores of all K image classes,such as the prediction probabilities for each class. We furtherconsider the case of applying an entry-wise monotonic trans-formation M ( F ) to the output of F for black-box attacks,since monotonic transformation preserves the ranking of theclass predictions and can alleviate the problem of large scorevariation in F (e.g., probability to log probability).Here we formulate black-box targeted attacks. The formu-lation can be easily adapted to untargeted attacks. Let ( x , t ) denote a natural image x and its ground-truth class label t ,and let ( x , t ) denote the adversarial example of x and thetarget attack class label t (cid:54) = t . The problem of finding anadversarial example can be formulated as an optimizationproblem taking the generic form ofmin x ∈ [0 , d Dist ( x , x ) + λ · Loss ( x , M ( F ( x )) , t ) , (1)where Dist ( x , x ) measures the distortion between x and x , Loss ( · ) is an attack objective reflecting the likelihood ofpredicting t = arg max k ∈{ ,...,K } [ M ( F ( x ))] k , λ is a regu-larization coefficient, and the constraint x ∈ [0 , d confinesthe adversarial image x to the valid image space. The distor-tion Dist ( x , x ) is often evaluated by the L p norm defined asDist ( x , x ) = (cid:107) x − x (cid:107) p = (cid:107) δ (cid:107) p = (cid:80) di =1 | δ i | /p for p ≥ ,where δ = x − x is the adversarial perturbation to x . Theattack objective Loss ( · ) can be the training loss of DNNs(Goodfellow, Shlens, and Szegedy 2015) or some designedloss based on model predictions (Carlini and Wagner 2017b).In the white-box setting, an adversarial example is gen-erated by using downstream optimizers such as ADAM(Kingma and Ba 2015) to solve (1); this requires the gra-dient ∇ f ( x ) of the objective function f ( x ) = Dist ( x , x ) + λ · Loss ( x , M ( F ( x )) , t ) relative to the input of F viaback-propagation in DNNs. However, in the black-box set-ting, acquiring ∇ f ( · ) is implausible, and one can only ob-tain the function evaluation F ( · ) , which renders solving(1) a zeroth order optimization problem. Recently, zerothorder optimization approaches (Ghadimi and Lan 2013;Nesterov and Spokoiny 2017; Liu et al. 2018) circumvent thepreceding challenge by approximating the true gradient viafunction evaluations. Specifically, in black-box attacks, the gradient estimate is applied to both gradient computation anddescent in the optimization process for solving (1). As a first attempt to enable gradient-free black-box attacks onDNNs, the authors in (Chen et al. 2017) use the symmetricdifference quotient method (Lax and Terrell 2014) to evaluatethe gradient ∂f ( x ) ∂ x i of the i -th component by g i = f ( x + h e i ) − f ( x − h e i )2 h ≈ ∂f ( x ) ∂ x i (2)using a small h . Here e i denotes the i -th elementary ba-sis. Albeit contributing to powerful black-box attacks andapplicable to large networks like ImageNet, the nature ofcoordinate-wise gradient estimation step in (2) must in-cur an enormous amount of model queries and is hencenot query-efficient. For example, the ImageNet dataset has d = 299 × × ≈ , input dimensions, renderingcoordinate-wise zeroth order optimization based on gradientestimation query-inefficient.To improve query efficiency, we dispense with coordinate-wise estimation and instead propose a scaled random fullgradient estimator of ∇ f ( x ) , defined as g = b · f ( x + β u ) − f ( x ) β · u , (3)where β > is a smoothing parameter, u is a unit-length vec-tor that is uniformly drawn at random from a unit Euclideansphere, and b is a tunable scaling parameter that balances thebias and variance trade-off of the gradient estimation error.Note that with b = 1 , the gradient estimator in (3) becomesthe one used in (Duchi et al. 2015). With b = d , this estimatorbecomes the one adopted in (Gao, Jiang, and Zhang 2014).We will provide an optimal value b ∗ for balancing queryefficiency and estimation error in the following analysis. Averaged random gradient estimation.
To effectively con-trol the error in gradient estimation, we consider a moregeneral gradient estimator, in which the gradient estimate isaveraged over q random directions { u j } qj =1 . That is, g = 1 q q (cid:88) j =1 g j , (4)where g j is a gradient estimate defined in (3) with u = u j .The use of multiple random directions can reduce the varianceof g in (4) for convex loss functions (Duchi et al. 2015;Liu et al. 2018).Below we establish an error analysis of the averaged ran-dom gradient estimator in (4) for studying the influence of theparameters b and q on estimation error and query efficiency. Theorem 1.
Assume f : R d (cid:55)→ R is differentiable andits gradient ∇ f ( · ) is L -Lipschitz . Then the mean squared A function W ( · ) is L -Lipschitz if (cid:107) W ( w ) − W ( w ) (cid:107) ≤ L (cid:107) w − w (cid:107) for any w , w . For DNNs with ReLU activations, L can be derived from the model weights (Szegedy et al. 2014). stimation error of g in (4) is upper bounded by E (cid:107) g − ∇ f ( x ) (cid:107) ≤ b d + b dq + ( b − d ) d ) (cid:107)∇ f ( x ) (cid:107) + 2 q + 1 q b β L . (5) Proof.
The proof is given in the supplementary file.Here we highlight the important implications based onTheorem 1: (i) The error analysis holds when f is non-convex ;(ii) In DNNs, the true gradient ∇ f can be viewed as thenumerical gradient obtained via back-propagation; (iii) Forany fixed b , selecting a small β (e.g., we set β = 1 /d inAutoZOOM) can effectively reduce the last error term in (5),and we therefore focus on optimizing the first error term;(iv) The first error term in (5) exhibits the influence of b and q on the estimation error, and is independent of β . Wefurther elaborate on (iv) as follows. Fixing q and let η ( b ) = b d + b dq + ( b − d ) d to be the coefficient of the first error termin (5), then the optimal b that minimizes η ( b ) is b ∗ = dq q + d .For query efficiency, one would like to keep q small, whichthen implies b ∗ ≈ q and η ( b ∗ ) ≈ when the dimension d is large. On the other hand, when q → ∞ , b ∗ ≈ d/ and η ( b ∗ ) ≈ / , which yields a smaller error upper boundbut is query-inefficient. We also note that by setting b = q ,the coefficient η ( b ) = b d + b dq + ( b − d ) d ≈ and thus isindependent of the dimension d and the parameter q . Adaptive random gradient estimation.
Based on Theorem1 and our error analysis, in AutoZOOM we set b = q in(3) and propose to use an adaptive strategy for selecting q .AutoZOOM uses q = 1 (i.e., the fewest possible model eval-uation) to first obtain rough gradient estimates for solving(1) until a successful adversarial image is found. After theinitial attack success, it switches to use more accurate gradi-ent estimates with q > to fine-tune the image quality. Thetrade-off between q (which is proportional to query counts)and distortion reduction will be investigated in Section 4. Dimension-dependent convergence rate using gradientestimation.
Different from the first order convergence results,the convergence rate of zeroth order gradient descent methodshas an additional multiplicative dimension-dependent factor d . In the convex loss setting the rate is O ( (cid:112) d/T ) , where T is the number of iterations (Nesterov and Spokoiny 2017;Liu et al. 2018; Gao, Jiang, and Zhang 2014; Wang et al.2018). The same convergence rate has also been found in thenonconvex setting (Ghadimi and Lan 2013). The dimension-dependent convergence factor d suggests that vanilla black-box attacks using gradient estimations can be query ineffi-cient when the (vectorized) image dimension d is large, dueto the curse of dimensionality in convergence. This also moti-vates us to propose using an autoencoder to reduce the attackdimension and improve query efficiency in black-box attacks.In AutoZOOM, we propose to perform random gradient es-timation from a reduced dimension d (cid:48) < d to improve queryefficiency. Specifically, as illustrated in Figure 2, the additive Algorithm 1
AutoZOOM for black-box attacks on DNNs
Input:
Black-box DNN model F , original example x ,distortion measure Dist ( · ) , attack objective Loss ( · ) , mono-tonic transformation M ( · ) , decoder D ( · ) ∈ { AE , BiLIN } ,initial coefficient λ ini , query budget Q while query count ≤ Q do1. Exploration: use x = x + D ( δ (cid:48) ) and apply therandom gradient estimator in (4) with q = 1 to the down-stream optimizer (e.g., ADAM) for solving (1) until aninitial attack is found.
2. Exploitation (post-success stage): continue to fine-tune the adversarial perturbation D ( δ (cid:48) ) for solving (1)while setting q ≥ in (4). end whileOutput: Least distorted successful adversarial exampleperturbation to an image x is actually implemented througha “decoder” D : R d (cid:48) (cid:55)→ R d such that x = x + D ( δ (cid:48) ) , where δ (cid:48) ∈ R d (cid:48) . In other words, the adversarial perturbation δ ∈ R d to x is in fact generated from a dimension-reduced space,with an aim of improving query efficiency due to the reduceddimension-dependent factor in the convergence analysis. Au-toZOOM provides two modes for such a decoder D : • An autoencoder (AE) trained on unlabeled data that aredifferent from the training data to learn reconstruction froma dimension-reduced representation. The encoder E ( · ) in anAE compresses the data to a low-dimensional latent spaceand the decoder D ( · ) reconstructs an example from its latentrepresentation. The weights of an AE are learned to minimizethe average L reconstruction error. Note that training suchan AE for black-box adversarial attacks is one-time and isentirely offline (i.e., no model queries needed). • A simple channel-wise bilinear image resizer (BiLIN) thatscales a small image to a large image via bilinear extrapola-tion . Note that no additional training is required for BiLIN. Why AE?
Our proposal of AE is motivated by the insightfulfindings in (Goodfellow, Shlens, and Szegedy 2015) that asuccessful adversarial perturbation is highly relevant to somehuman-imperceptible noise pattern resembling the shape ofthe target class, known as the “shadow”. Since a decoder inAE learns to reconstruct data from latent representations, itcan also provide distributional guidance for mapping adver-sarial perturbations to generate these shadows.We also note that for any reduced dimension d (cid:48) , the setting b ∗ = q is optimal in terms of minimizing the correspondingestimation error from Theorem 1, despite the fact that thegradient estimation errors of different reduced dimensionscannot be directly compared. In Section 4 we will report thesuperior query efficiency in black-box attacks achieved withthe use of AE or BiLIN as the decoder, and discuss the benefitof attack dimension reduction. Algorithm 1 summarizes the AutoZOOM framework towardsquery-efficient black-box attacks on DNNs. We also note that See tf.image.resize_images , a TensorFlow example. utoZOOM is a general acceleration tool that is compatiblewith any gradient-estimation based black-box adversarial at-tack obeying the attack formulation in (1). It also has sometheoretical estimation error guarantees and query-efficient pa-rameter selection based on Theorem 1. The details on adjust-ing the regularization coefficient λ and the query parameter q based on run-time model evaluation results will be discussedin Section 4. Our source code is publicly available . This section presents the experiments for assessing the per-formance of AutoZOOM in accelerating black-box attackson DNNs in terms of the number of queries required for aninitial attack success and for a specific distortion level.
As described in Section 3, AutoZOOM is a query-efficientgradient-free optimization framework for solving the black-box attack formulation in (1). In the following experiments,we demonstrate the utility of AutoZOOM by using the sameattack formulation proposed in ZOO (Chen et al. 2017),which uses the squared L norm as the distortion measureDist ( · ) and adopts the attack objectiveLoss = max { max j (cid:54) = t log[ F ( x )] j − log[ F ( x )] t } , } , (6)where this hinge function is designed for targeted black-boxattacks on the DNN model F , and the monotonic transforma-tion M ( · ) = log( · ) is applied to the model output. We compare
AutoZOOM-AE ( D = AE) and
AutoZOOM-BiLIN ( D = BiLIN) with two different baselines: (i) Stan-dard
ZOO implementation with bilinear scaling (same asBiLIN) for dimension reduction; (ii) ZOO+AE , which isZOO with AE. Note that all attacks indeed generate adversar-ial perturbations based on the same reduced attack dimension.
We assess the performance of different attack methods onseveral representative benchmark datasets, including MNIST(LeCun et al. 1998), CIFAR-10 (Krizhevsky 2009) and Im-ageNet (Russakovsky et al. 2015). For MNIST and CIFAR-10, we use the same DNN image classification models asin (Carlini and Wagner 2017b). For ImageNet, we use theInception-v3 model (Szegedy et al. 2016). All experimentswere conducted using TensorFlow Machine-Learning Library(Abadi et al. ) on machines equipped with an Intel Xeon E5-2690v3 CPU and an Nvidia Tesla K80 GPU.All attacks used ADAM (Kingma and Ba 2015) for solving(1) with their estimated gradients and the same initial learningrate × − . On MNIST and CIFAR-10, all methods adopt1,000 ADAM iterations. On ImageNet, ZOO and ZOO+AE https://github.com/IBM/Autozoom-Attack https://github.com/huanzhang12/ZOO-Attack https://github.com/carlini/nn_robust_attacks adopt 20,000 iterations, whereas AutoZOOM-BiLIN andAutoZOOM-AE adopt 100,000 iterations. Note that due todifferent gradient estimation methods, the query counts (i.e.,the number of model evaluations) per iteration of a black-boxattack may vary. ZOO and ZOO+AE use the parallel gradientupdate of (2) with a batch of pixels, yielding 256 querycounts per iteration. AutoZOOM-BiLIN and AutoZOOM-AEuse the averaged random full gradient estimator in (4), result-ing in q + 1 query counts per iteration. For a fair comparison,the query counts are used for performance assessment. Query reduction ratio.
We use the mean query counts ofZOO with the smallest λ ini as the baseline for computing thequery reduction ratio of other methods and configurations. TPR and initial success.
We report the true positive rate(TPR), which measures the percentage of successful attacksfulfilling a pre-defined constraint (cid:96) on the normalized (per-pixel) L distortion, as well as their query counts of firstsuccesses. We also report the per-pixel L distortions ofinitial successes, where an initial success refers to the firstquery count that finds a successful adversarial example. Post-success fine-tuning.
When implementing AutoZOOMin Algorithm 1, on MNIST and CIFAR-10 we find that Au-toZOOM without fine-tuning (i.e., q = 1 ) already yieldssimilar distortion as ZOO. We note that ZOO can be viewedas coordinate-wise fine-tuning and is thus query-inefficient.On ImageNet, we will investigate the effect of post-successfine-tuning on reducing distortion. Autoencoder Training.
In AutoZOOM-AE, we use convo-lutional autoencoders for attack dimension reduction, whichare trained on unlabeled datasets that are different from thetraining dataset and the attacked natural examples. The im-plementation details are given in the supplementary material.
Dynamic Switching on λ . To adjust the regularization coef-ficient λ in (1), in all methods we set its initial value λ ini ∈{ . , , } on MNIST and CIFAR-10, and set λ ini = 10 onImageNet. Furthermore, for balancing the distortion Dist andthe attack objective Loss in (1), we use a dynamic switching strategy to update λ during the optimization process. Per ev-ery S iterations, λ is multiplied by 10 times of the currentvalue if the attack has never been successful. Otherwise, itdivides its current value by 2. On MNIST and CIFAR-10,we set S = 100 . On ImageNet, we set S = 1 , . At theinstance of initial success, we also reset λ = λ ini and theADAM parameters to the default values, as doing so canempirically reduce the distortion for all attack methods. For both MNIST and CIFAR-10, we randomly select 50correctly classified images from their test sets, and performtargeted attacks on these images. Since both datasets have10 classes, each selected image is attacked 9 times, targetingat all but its true class. For all attacks, the ratio of reducedattack-space dimension to the original one (i.e., d (cid:48) /d ) is 25%for MNIST and 6.25% for CIFAR-10.Table 1 shows the performance evaluation on MNISTwith various values of λ ini , the initial value of the regulariza-tion coefficient λ in (1). We use the performance of ZOOwith λ ini = 0 . as a baseline for comparison. For example,with λ ini = 0 . and , the mean query counts required byable 1: Performance evaluation of black-box targeted attacks on MNIST Method λ ini Attack successrate (ASR) Mean query count(initial success) Mean querycount reductionratio (initial success) Mean per-pixel L distortion(initial success) True positiverate (TPR) Mean query countwith per-pixel L distortion ≤ × − × − × − × − × − × − × − × − × − × − × − × − Table 2: Performance evaluation of black-box targeted attacks on CIFAR-10
Method λ ini Attack successrate (ASR) Mean query count(initial success) Mean querycount reductionratio (initial success) Mean per-pixel L distortion(initial success) True positiverate (TPR) Mean query countwith per-pixel L distortion ≤ × − × − × − × − × − × − × − × − × − × − × − × − AutoZOOM-AE to attain an initial success is reduced by and , respectively. One can also observethat allowing larger λ ini generally leads to fewer mean querycounts at the price of slightly increased distortion for the ini-tial attack. The noticeable huge difference in the required at-tack query counts between AutoZOOM and ZOO/ZOO+AEvalidates the effectiveness of our proposed random full gradi-ent estimator in (3), which dispenses with the coordinate-wisegradient estimation in ZOO but still remains comparable truepositive rates, thereby greatly improving query efficiency.For CIFAR-10, we report similar query efficiency improve-ments as displayed in Table 2. In particular, comparing thetwo query-efficient black-box attack methods (AutoZOOM-BiLIN and AutoZOOM-AE), we find that AutoZOOM-AE ismore query-efficient than AutoZOOM-BiLIN, but at the costof an additional AE training step. AutoZOOM-AE achievesthe highest attack success rates (ASRs) and mean query re-duction ratios for different values of λ ini . In addition, theirtrue positive rates (TPRs) are similar but AutoZOOM-AEusually takes fewer query counts to reach the same L dis-tortion. We note that when λ ini = 10 , AutoZOOM-AE has ahigher TPR but also needs slightly more mean query countsthan AutoZOOM-BiLIN to reach the same L distortion.This suggests that there are some adversarial examples that are difficult for a bilinear resizer to reduce their post-successdistortions but can be handled by an AE. We selected 50 correctly classified images from the ImageNettest set to perform random targeted attacks and set λ ini = 10 and the attack dimension reduction ratio to 1.15%. The re-sults are summarized in Table 3. Note that comparing to ZOO,AutoZOOM-AE can significantly reduce the query count re-quired to achieve an initial success by 99.39% (or 99.35%to reach the same L distortion), which is a remarkable im-provement since this means reducing more than model queries given the fact that the dimension of ImageNet( ≈ Post-success distortion refinement.
As described in Algo-rithm 1, adaptive random gradient estimation is integrated inAutoZOOM, offering a quick initial success in attack genera-tion followed by a fine-tuning process to effectively reducethe distortion. This is achieved by adjusting the gradient esti-mate averaging parameter q in (4) in the post-success stage.In general, averaging over more random directions (i.e., set-ting larger q ) tends to better reduce the variance of gradientestimation error, but at the cost of increased model queries.Figure 3 (a) shows the mean distortion against query countsable 3: Performance evaluation of black-box targeted attacks on ImageNet Method
Attack successrate (ASR) Mean query count(initial success) Mean querycount reductionratio (initial success) Mean per-pixel L distortion(initial success) True positiverate (TPR) Mean query countwith per-pixel L distortion ≤ ZOO 76.00% 2,226,405.04 (2.22M) 0.00% 4.25 × − × − × − × − (a) Post-success distortion refinement (b) Dimension reduction v.s. query efficiency Figure 3: (a) After initial success, AutoZOOM (here D = AE) can further decrease the distortion by setting q > in (4) to trademore query counts for smaller distortion in the converged stage, which saturates at q = 4 . (b) Attack dimension reduction iscrucial to query-efficient black-box attacks. When compared to black-box attacks on the original dimension, dimension reductionthrough AutoZOOM-AE reduces roughly 35-40% query counts on MNIST and CIFAR-10 and at least 95% on ImageNet.for various choices of q in the post-success stage. The resultssuggest that setting some small q but q > can further de-crease the distortion at the converged phase when comparedwith the case of q = 1 . Moreover, the refinement effect ondistortion empirically saturates at q = 4 , implying a marginalgain beyond this value. These findings also demonstrate thatour proposed AutoZOOM indeed strikes a balance betweendistortion and query efficiency in black-box attacks. In addition to the motivation from the O ( (cid:112) d/T ) conver-gence rate in zeroth-order optimization (Sec. 3.3), as a san-ity check, we corroborate the benefit of attack dimensionreduction to query efficiency in black-box attacks by com-paring AutoZOOM (here we use D = AE) with its alterna-tive operated on the original (non-reduced) dimension (i.e., δ (cid:48) = D ( δ (cid:48) ) = δ ). Tested on all three datasets and aforemen-tioned settings, Figure 3 (b) shows the corresponding meanquery count to initial success and the mean query reductionratio when λ ini = 10 in all three datasets. When comparedto the attack results of the original dimension, attack dimen-sion reduction through AutoZOOM reduces roughly 35-40%query counts on MNIST and CIFAR-10 and at least 95% onImageNet. This result highlights the importance of dimen-sion reduction towards query-efficient black-box attacks. Forexample, without dimension reduction, the attack on the orig-inal ImageNet dimension cannot even be successful withinthe query budge ( Q = 200 K queries). • In addition to benchmarking on initial attack success, thequery reduction ratio when reaching the same L distortioncan be directly computed from the last column in each table. • The attack gain in AutoZOOM-AE versus AutoZOOM-BiLIN could sometimes be marginal, while we also note thatthere is room for improving AutoZOOM-AE by exploringdifferent AE models. However, we advocate AutoZOOM-BiLIN as a practically ideal candidate for query-efficientblack-box attacks when testing model robustness, due to itseasy-to-mount nature and it has no additional training cost. • While learning effective low-dimensional representationsof legitimate images is still a challenging task, black-box at-tacks using significantly less degree of freedoms (i.e., reduceddimensions), as demonstrated in this paper, are certainly plau-sible, leading to new implications on model robustness.
AutoZOOM is a generic attack acceleration framework thatis compatible with any gradient-estimation based black-boxattack having the general formulation in (1). It adopts a newand adaptive random full gradient estimation strategy to strikea balance between query counts and estimation errors, andfeatures a decoder (AE or BiLIN) for attack dimension re-duction and algorithmic convergence acceleration. Comparedto a state-of-the-art attack (ZOO), AutoZOOM consistentlyreduces the mean query counts when attacking black-boxDNN image classifiers for MNIST, CIFAT-10 and ImageNet,ttaining at least query reduction in finding initial suc-cessful adversarial examples (or reaching the same distortion)while maintaining a similar attack success rate. It can alsoefficiently fine-tune the image distortion to maintain highvisual similarity to the original image. Consequently, Auto-ZOOM provides novel and efficient means for assessing therobustness of deployed machine learning models.
Acknowledgements
Shin-Ming Cheng was supported in part by the Ministry ofScience and Technology, Taiwan, under Grants MOST 107-2218-E-001-005 and MOST 107-2218-E-011-012. Cho-JuiHsieh and Huan Zhang acknowledge the support by NSF IIS-1719097, Intel faculty award, Google Cloud and NVIDIA.
References [Abadi et al. ] Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.;Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al.Tensorflow: A system for large-scale machine learning.[Athalye, Carlini, and Wagner 2018] Athalye, A.; Carlini, N.; andWagner, D. 2018. Obfuscated gradients give a false sense ofsecurity: Circumventing defenses to adversarial examples.
ICML .[Baluja and Fischer 2018] Baluja, S., and Fischer, I. 2018. Adver-sarial transformation networks: Learning to generate adversarialexamples.
AAAI .[Biggio and Roli 2018] Biggio, B., and Roli, F. 2018. Wild patterns:Ten years after the rise of adversarial machine learning.
PatternRecognition
Joint European con-ference on machine learning and knowledge discovery in databases ,387–402.[Brendel, Rauber, and Bethge 2018] Brendel, W.; Rauber, J.; andBethge, M. 2018. Decision-based adversarial attacks: Reliableattacks against black-box machine learning models.
ICLR .[Carlini and Wagner 2017a] Carlini, N., and Wagner, D. 2017a. Ad-versarial examples are not easily detected: Bypassing ten detectionmethods. In
ACM Workshop on Artificial Intelligence and Security ,3–14.[Carlini and Wagner 2017b] Carlini, N., and Wagner, D. 2017b. To-wards evaluating the robustness of neural networks. In
IEEE Sym-posium on Security and Privacy , 39–57.[Chen et al. 2017] Chen, P.-Y.; Zhang, H.; Sharma, Y.; Yi, J.; andHsieh, C.-J. 2017. ZOO: Zeroth order optimization based black-boxattacks to deep neural networks without training substitute models.In
ACM Workshop on Artificial Intelligence and Security , 15–26.[Chen et al. 2018] Chen, P.-Y.; Sharma, Y.; Zhang, H.; Yi, J.; andHsieh, C.-J. 2018. EAD: elastic-net attacks to deep neural networksvia adversarial examples.
AAAI .[Duchi et al. 2015] Duchi, J. C.; Jordan, M. I.; Wainwright, M. J.;and Wibisono, A. 2015. Optimal rates for zero-order convex opti-mization: The power of two function evaluations.
IEEE Transactionson Information Theory
Optimization Online
SIAM Journal on Optimization
ICLR .[Ilyas et al. 2018] Ilyas, A.; Engstrom, L.; Athalye, A.; and Lin, J.2018. Black-box adversarial attacks with limited queries and infor-mation.
ICML .[Ilyas 2018] Ilyas, A. 2018. Circumventing the ensemble adversar-ial training defense. https://github.com/andrewilyas/ens-adv-train-attack .[Kingma and Ba 2015] Kingma, D., and Ba, J. 2015. Adam: Amethod for stochastic optimization.
ICLR .[Krizhevsky 2009] Krizhevsky, A. 2009. Learning multiple layers offeatures from tiny images.
Technical report, University of Toronto .[Kurakin, Goodfellow, and Bengio 2017] Kurakin, A.; Goodfellow,I.; and Bengio, S. 2017. Adversarial machine learning at scale.
ICLR .[Lax and Terrell 2014] Lax, P. D., and Terrell, M. S. 2014.
Calculuswith applications . Springer.[LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner,P. 1998. Gradient-based learning applied to document recognition.
Proceedings of the IEEE
ICLR .[Liu et al. 2018] Liu, S.; Chen, J.; Chen, P.-Y.; and Hero, A. O. 2018.Zeroth-order online alternating direction method of multipliers:Convergence analysis and applications.
AISTATS .[Lowd and Meek 2005] Lowd, D., and Meek, C. 2005. Adversariallearning. In
ACM SIGKDD international conference on Knowledgediscovery in data mining , 641–647.[Narodytska and Kasiviswanathan 2016] Narodytska, N., and Ka-siviswanathan, S. P. 2016. Simple black-box adversarial perturba-tions for deep networks. arXiv preprint arXiv:1612.06299 .[Nesterov and Spokoiny 2017] Nesterov, Y., and Spokoiny, V. 2017.Random gradient-free minimization of convex functions.
Founda-tions of Computational Mathematics
ECCV , 154–169.[Papernot et al. 2017] Papernot, N.; McDaniel, P.; Goodfellow, I.;Jha, S.; Celik, Z. B.; and Swami, A. 2017. Practical black-boxattacks against machine learning. In
ACM Asia Conference onComputer and Communications Security , 506–519.[Papernot, McDaniel, and Goodfellow 2016] Papernot, N.; Mc-Daniel, P.; and Goodfellow, I. 2016. Transferability in machinelearning: from phenomena to black-box attacks using adversarialsamples. arXiv preprint arXiv:1605.07277 .[Russakovsky et al. 2015] Russakovsky, O.; Deng, J.; Su, H.;Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.;Bernstein, M.; et al. 2015. Imagenet large scale visual recognitionchallenge.
International Journal of Computer Vision
ECCV .[Suya et al. 2017] Suya, F.; Tian, Y.; Evans, D.; and Papotti, P. 2017.Query-limited black-box attacks to classifiers.
NIPS Workshop .Szegedy et al. 2014] Szegedy, C.; Zaremba, W.; Sutskever, I.;Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. In-triguing properties of neural networks.
ICLR .[Szegedy et al. 2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens,J.; and Wojna, Z. 2016. Rethinking the inception architecture forcomputer vision. In
CVPR , 2818–2826.[Tram`er et al. 2018] Tram`er, F.; Kurakin, A.; Papernot, N.; Boneh,D.; and McDaniel, P. 2018. Ensemble adversarial training: Attacksand defenses.
ICLR .[Wang et al. 2018] Wang, Y.; Du, S.; Balakrishnan, S.; and Singh,A. 2018. Stochastic zeroth-order optimization in high dimensions.
AISTATS . upplementary Material A More Background on Adversarial Attacksand Defenses
The research in generating adversarial examples to deceivemachine-learning models, known as adversarial attacks, tendsto evolve with the advance of machine-learning techniquesand new publicly available datasets. In (Lowd and Meek2005), the authors studied adversarial attacks to linear clas-sifiers with continuous or Boolean features. In (Biggio etal. 2013), the authors proposed a gradient-based adversarialattack on kernel support vector machines (SVMs). More re-cently, gradient-based approaches are also used in adversarialattacks on image classifiers trained by DNNs (Szegedy et al.2014; Goodfellow, Shlens, and Szegedy 2015). Due to spacelimitation, we focus on related work in adversarial attackson DNNs. Interested readers may refer to the survey paper(Biggio and Roli 2018) for more details.Gradient-based adversarial attacks on DNNs fall withinthe white-box setting, since acquiring the gradient with re-spect to the input requires knowing the weights of the targetDNN. In principle, adversarial attacks can be formulatedas an optimization problem of minimizing the adversarialperturbation while ensuring attack objectives. In image clas-sification, given a natural image, an untargeted attack aims tofind a visually similar adversarial image resulting in a differ-ent class prediction, while a targeted attack aims to find anadversarial image leading to a specific class prediction. Thevisual similarity between a pair of adversarial and naturalimages is often measured by the L p norm of their difference,where p ≥ . Existing powerful white-box adversarial attacksusing L ∞ , L or L norms include iterative fast gradient signmethods (Kurakin, Goodfellow, and Bengio 2017), Carliniand Wagner’s (C&W) attack (Carlini and Wagner 2017b),elastic-net attacks to DNNs (EAD) (Chen et al. 2018), etc.Black-box adversarial attacks are practical threats to thedeployed machine-learning services. Attackers can observethe input-output correspondences of any queried input, butthe target model parameters are completely hidden. There-fore, gradient-based adversarial attacks are inapplicable to ablack-box setting. As a first attempt, the authors in (Paper-not et al. 2017) proposed to train a substitute model usingiterative model queries, perform white-box attacks on thesubstitute model, and leverage the transferability of adver-sarial examples (Papernot, McDaniel, and Goodfellow 2016;Liu et al. 2017) to attack the target model. However, train-ing a representative surrogate for a DNN is challengingdue to the complicated and nonlinear classification rules ofDNNs and high dimensionality of the underlying dataset.The performance of black-box attacks can be severely de-graded if the adversarial examples for the substitute modeltransfer poorly to the target model. To bridge this gap, theauthors in (Chen et al. 2017) proposed a black-box attackcalled ZOO that directly estimates the gradient of the at-tack objective by iteratively querying the target model. Al-though ZOO achieves a similar attack success rate and com-parable visual quality as many white-box attack methods,it exploits the symmetric difference quotient method (Laxand Terrell 2014) for coordinate-wise gradient estimation and value update, which requires excessive target modelevaluations and is hence not query-efficient. The same gra-dient estimation technique is also used in the later workin (Nitin Bhagoji et al. 2018). Although acceleration tech-niques such as importance sampling, bilinear scaling andrandom feature grouping have been used in (Chen et al. 2017;Nitin Bhagoji et al. 2018), the coordinate-wise gradient esti-mation approach still forms a bottleneck for query efficiency.Beyond optimization-based approaches, the authors in(Ilyas et al. 2018) proposed to use a natural evolution strategy(NES) to enhance query efficiency. Although there is also avector-wise gradient estimation step in the NES attack, wetreat it as an independent and parallel work since its naturalevolutionary step is out of the scope of black-box attacks us-ing zeroth-order gradient descent. We also note that differentfrom NES, our AutoZOOM framework uses a query-efficientrandom gradient estimation strategy. In addition, AutoZOOMcould be applied to further improve the query efficiency ofNES, since NES does not take into account the factor of attackdimension reduction, which is the main focus of this paper.Under a more restricted setting, where only the decision (top-1 prediction class) is known to an attacker, the authors in(Brendel, Rauber, and Bethge 2018) proposed a random-walkbased attack around the decision boundary. Such a black-boxattack dispenses class prediction scores and hence requiresadditional model queries.In this paper, we focus on improving the query efficiencyof gradient-estimation and gradient-descent based black-boxattacks and consider the threat model when the class predic-tion scores are known to an attacker. For reader’s reference,we compare existing black-box attacks on DNNs with Au-toZOOM in Table S1. One unique feature of AutoZOOM isthe use of reduced attack dimension when mounting black-box attacks, which is an unlabeled data-driven technique(autoencoder) for attack acceleration, and has not been stud-ied thoroughly in existing attacks. While white-box attackssuch as (Baluja and Fischer 2018) have utilized autoencoderstrained on the training data and the transparent logit represen-tations of DNNs, we propose in this work to use autoencoderstrained on unlabeled natural data to improve query efficiencyfor black-box attacks.There has been many methods proposed for defendingadversarial attacks to DNNs. However, new defenses arecontinuously weakened by follow-up attacks (Carlini andWagner 2017a; Athalye, Carlini, and Wagner 2018). For in-stance, model ensembles (Tram`er et al. 2018) were shown tobe effective against some black-box attacks, while they arerecently circumvented by advanced attack techniques (Ilyas2018). In this paper, we focus on improving query efficiencyin attacking black-box undefended DNNs. B Proof of Theorem 1
Recall that the data dimension is d and we assume f to bedifferentiable and its gradient ∇ f to be L -Lipschitz. Fixing β and consider a smoothed version of f : f β ( x ) = E u [ f ( x + β u )] . (S1)able S1: Comparison of black-box attacks on DNNs Method Approach Modelouput Targetedattack Large network(ImageNet) Data-drivenacceleration(Narodytska and Kasiviswanathan 2016) local random search score (cid:88) (Papernot et al. 2017) substitute model score (cid:88) (Suya et al. 2017) acquisition via posterior score (cid:88) (Brendel, Rauber, and Bethge 2018) Gaussian perturbation decision (cid:88) (cid:88) (Ilyas et al. 2018) natural evolution strategy score/decision (cid:88) (cid:88) (Chen et al. 2017) coordinate-wisegradient estimation score (cid:88) (cid:88) (Nitin Bhagoji et al. 2018) coordinate-wisegradient estimation score (cid:88) (cid:88)
AutoZOOM (this paper) Random (full)gradient estimation score (cid:88) (cid:88) (cid:88)
Based on (Gao, Jiang, and Zhang 2014, Lemma 4.1-a), wehave the relation ∇ f β ( x ) = E u (cid:20) dβ f ( x + β u ) u (cid:21) = db E u [ g ] , (S2)which then yields E u [ g ] = bd ∇ f β ( x ) , (S3)where we recall that g has been defined in (3). Moreover,based on (Gao, Jiang, and Zhang 2014, Lemma 4.1-b), wehave (cid:107)∇ f β ( x ) − ∇ f ( x ) (cid:107) ≤ βdL . (S4)Substituting (S3) into (S4), we obtain (cid:107) E [ g ] − bd ∇ f ( x ) (cid:107) ≤ βbL . This then implies that E [ g ] = bd ∇ f ( x ) + (cid:15) , (S5)where (cid:107) (cid:15) (cid:107) ≤ bβL . Once again, by applying (Gao, Jiang, and Zhang 2014,Lemma 4.1-b), we can easily obtain that E u [ (cid:107) g (cid:107) ] ≤ b L β b d (cid:107)∇ f ( x ) (cid:107) . (S6)Now, let us consider the averaged random gradient estima-tor in (4), g = 1 q q (cid:88) i =1 g i = bq q (cid:88) i =1 f ( x + β u i ) − f ( x ) β u i . Due to the properties of i.i.d. samples { u i } and (S5), wedefine v =: E [ g i ] = bd ∇ f ( x ) + (cid:15) . (S7) Moreover, we have E [ (cid:107) g (cid:107) ] = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q q (cid:88) i =1 ( g i − v ) + v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (S8) = (cid:107) v (cid:107) + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q q (cid:88) i =1 ( g i − v ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:107) v (cid:107) + 1 q E [ (cid:107) g − v (cid:107) ] (S9) = (cid:107) v (cid:107) + 1 q E [ (cid:107) g (cid:107) ] − q (cid:107) v (cid:107) , (S10)where we have used the fact that E [ g i ] = E [ g ] = v ∀ i . Thedefinition of v in (S7) yields (cid:107) v (cid:107) ≤ b d (cid:107)∇ f ( x ) (cid:107) + 2 (cid:107) (cid:15) (cid:107) ≤ b d (cid:107)∇ f ( x ) (cid:107) + 12 b β L . (S11)From (S6), we also obtain that for any i , E [ (cid:107) g i (cid:107) ] ≤ b L β b d (cid:107)∇ f ( x ) (cid:107) . (S12)Substituting (S11) and (S12) into (S10), we obtain E [ (cid:107) g (cid:107) ] ≤(cid:107) v (cid:107) + 1 q E [ (cid:107) g (cid:107) ] (S13) ≤ b d + b dq ) (cid:107)∇ f ( x ) (cid:107) + q + 12 q b L β . (S14)Finally, we bound the mean squared estimation error as E [ (cid:107) g − ∇ f ( x ) (cid:107) ] ≤ E [ (cid:107) g − v (cid:107) ] + 2 (cid:107) v − ∇ f ( x ) (cid:107) ≤ E [ (cid:107) g (cid:107) ] + 2 (cid:107) bd ∇ f ( x ) + (cid:15) − ∇ f ( x ) (cid:107) ≤ b d + b dq + ( b − d ) d ) (cid:107)∇ f ( x ) (cid:107) + 2 q + 1 q b L β , (S15)which completes the proof.able S2: Architectures of Autoencoders in AutoZOOMDataset: MNIST Training MSE: 2.00 × − Reduction ratio / image size / feature map size: 25% / 28 × × × × → MaxPool → Conv-1Decoder: ConvReLU-16 → Reshape-Re-U → Conv-1Dataset: CIFAR-10 Training MSE: 5.00 × − Reduction ratio / image size / feature map size: 6.25% / 32 × × × × → MaxPool → ConvReLU-3 → MaxPool → Conv-3Decoder: ConvReLU-16 → Reshape-Re-U → ConvReLU-16 → Reshape-Re-U → Conv-3Dataset: ImageNet Training MSE: 1.02 × − Reduction ratio / image size / feature map size: 1.15% / 299 × × × × → ConvReLU-16 → MaxPool → ConvReLU-16 → MaxPool → Conv-3Decoder: ConvReLU-16 → Reshape-Re-U → ConvReLU-16 → Reshape-Bi-U → Conv-3ConvReLU-16: Convolution (16 filters, kernel size: 3 × × Dep ) + ReLU activationConvReLU-3: Convolution (3 filters, kernel size: 3 × × Dep ) + ReLU activationConv-3: Convolution (3 filters, kernel size: 3 × × Dep ) Conv-1: Convolution (1 filter, kernel size: 3 × × Dep )Reshape-Bi-D: Bilinear reshaping from 299 × × × × × ×
16 to 299 × × U × V × Dep to U × V × DepDep : a proper depth
C Architectures of ConvolutionalAutoencoders in AutoZOOM
On MNIST, the convolutional autoencoder (CAE) is trainedon 50,000 randomly selected hand-written digits from theMNIST8M dataset . On CIFAR-10, the CAE is trained on9,900 images selected from its test dataset. The remainingimages are used in black-box attacks. On ImageNet, all theattacked natural images are from 10 randomly selected imagelabels, and these labels are also used as the candidate attacktargets. The CAE is trained on about 9000 images from theseclasses.Table S2 shows the architectures for all the autoencodersused in this work. Note that the autoencoders designed forImageNet uses bilinear scaling to transform data size from × × Dep to × × Dep , and also back from × × Dep to × × Dep . This is to alloweasy processing and handling for the autoencoder’s internalconvolutional layers.The normalized mean squared error of our autoencodertrained on MNIST, CIFAR-10 and 25 Imagenet is 0.0027,0.0049 and 0.0151, respectively, which lies within a reason-able range of compression loss.
D More Adversarial Examples of AttackingInception-v3 in the Black-box Setting
Figure S1 shows other adversarial examples of the Inception-v3 model in the black-box targeted attack setting. http://leon.bottou.org/projects/infimnist E Performance Evaluation of Black-boxUntargeted Attacks
Table S3 shows the attacking performance of black-box un-targeted attacks on MNIST, CIFAR-10 and ImageNet usingZOO and AutoZOOM-BiLIN attacks on the same set of im-ages in Section 4.5. The Loss function is defined asLoss = max { log[ F ( x )] t − max j (cid:54) = t log[ F ( x )] j } , } , (S16)where t is the top-1 prediction label of a natural image x .We set λ ini = 10 and use q = 5 on MNIST and CIFAR-10and q = 4 on ImageNet for distortion fine-tuning in the post-attack phase. Comparing to Table 3, the number of modelqueries can be further reduced since untargeted attacks onlyrequire the adversarial images to be classified as any classother than t rather than classified as a specific class t (cid:54) = t . a) “French bulldog” to “traffic light” (b) “purse” to “bagel”(c) “bagel” to “ grand piano” (d) “traffic light” to “ iPod” Figure S1: Adversarial examples on ImageNet crafted by AutoZOOM when attacking on the Inception-v3 model in the black-boxsetting with a target class selected at random. Left: original natural images. Right: adversarial examples.Table S3: Performance evaluation of black-box untargeted attacks on different datasets. The per-pixel L distortion thresholdsare 0.004, 0.0015 and × − for MNIST, CIFAR-10 and ImageNet, respectively. Dataset Method
Attack successrate (ASR) Mean query count(initial success) Mean querycount reductionratio (initial success) Mean per-pixel L distortion(initial success) True positiverate (TPR) Mean query countwith per-pixel L distortion ≤ threshold MNIST ZOO 100.00% 7856.64 0.00% 3.79 × − × − × − × − × − × −5