[PDF] AutoZOOM: Autoencoder-based Zeroth Order Optimization Method for Attacking Black-box Neural Networks

Abstract

Recent studies have shown that adversarial examples in state-of-the-art image classifiers trained by deep neural networks (DNN) can be easily generated when the target model is transparent to an attacker, known as the white-box setting. However, when attacking a deployed machine learning service, one can only acquire the input-output correspondences of the target model; this is the so-called black-box attack setting. The major drawback of existing black-box attacks is the need for excessive model queries, which may give a false sense of model robustness due to inefficient query designs. To bridge this gap, we propose a generic framework for query-efficient black-box attacks. Our framework, AutoZOOM, which is short for Autoencoder-based Zeroth Order Optimization Method, has two novel building blocks towards efficient black-box attacks: (i) an adaptive random gradient estimation strategy to balance query counts and distortion, and (ii) an autoencoder that is either trained offline with unlabeled data or a bilinear resizing operation for attack acceleration. Experimental results suggest that, by applying AutoZOOM to a state-of-the-art black-box attack (ZOO), a significant reduction in model queries can be achieved without sacrificing the attack success rate and the visual quality of the resulting adversarial examples. In particular, when compared to the standard ZOO method, AutoZOOM can consistently reduce the mean query counts in finding successful adversarial examples (or reaching the same distortion level) by at least 93% on MNIST, CIFAR-10 and ImageNet datasets, leading to novel insights on adversarial robustness.

Full PDF

AAutoZOOM: Autoencoder-based Zeroth Order Optimization Methodfor Attacking Black-box Neural Networks

Chun-Chen Tu ∗ , Paishun Ting ∗ , Pin-Yu Chen ∗ , Sijia Liu ,Huan Zhang , Jinfeng Yi , Cho-Jui Hsieh , Shin-Ming Cheng University of Michigan, Ann Arbor, USA MIT-IBM Watson AI Lab, IBM Research University of California, Los Angeles, USA JD AI Research, Beijing, China National Taiwan University of Science and Technology, Taiwan

Abstract

Recent studies have shown that adversarial examples in state-of-the-art image classiﬁers trained by deep neural networks(DNN) can be easily generated when the target model is trans-parent to an attacker, known as the white-box setting. However,when attacking a deployed machine learning service, one canonly acquire the input-output correspondences of the targetmodel; this is the so-called black-box attack setting. The majordrawback of existing black-box attacks is the need for exces-sive model queries, which may give a false sense of modelrobustness due to inefﬁcient query designs. To bridge this gap,we propose a generic framework for query-efﬁcient black-box attacks. Our framework,

AutoZOOM , which is short for

Auto encoder-based Z eroth O rder O ptimization M ethod, hastwo novel building blocks towards efﬁcient black-box attacks:(i) an adaptive random gradient estimation strategy to balancequery counts and distortion, and (ii) an autoencoder that iseither trained ofﬂine with unlabeled data or a bilinear resizingoperation for attack acceleration. Experimental results suggestthat, by applying AutoZOOM to a state-of-the-art black-boxattack (ZOO), a signiﬁcant reduction in model queries can beachieved without sacriﬁcing the attack success rate and thevisual quality of the resulting adversarial examples. In particu-lar, when compared to the standard ZOO method, AutoZOOMcan consistently reduce the mean query counts in ﬁnding suc-cessful adversarial examples (or reaching the same distortionlevel) by at least 93% on MNIST, CIFAR-10 and ImageNetdatasets, leading to novel insights on adversarial robustness. In recent years, “machine learning as a service” has offeredthe world an effortless access to powerful machine learningtools for a wide variety of tasks. For example, commerciallyavailable services such as Google Cloud Vision API and Clar-ifai.com provide well-trained image classiﬁers to the public.One is able to upload and obtain the class prediction resultsfor images at hand at a low price. However, the existing andemerging machine learning platforms and their low model-access costs raise ever-increasing security concerns, as theyalso offer an ideal environment for testing malicious attempts.Even worse, the risks can be ampliﬁed when these servicesare used to build derived products such that the inherentsecurity vulnerability could be leveraged by attackers. ∗ equal contribution Figure 1: AutoZOOM signiﬁcantly reduces the number ofqueries required to generate a successful adversarial Bagelimage from the black-box Inception-v3 model.In many computer vision tasks, DNN models achieve thestate-of-the-art prediction accuracy and hence are widely de-ployed in modern machine learning services. Nonetheless,recent studies have highlighted DNNs’ vulnerability to ad-versarial perturbations. In the white-box setting in which thetarget model is entirely transparent to an attacker, visuallyimperceptible adversarial images can be easily crafted tofool a target DNN model towards misclassiﬁcation by lever-aging the input gradient information (Szegedy et al. 2014;Goodfellow, Shlens, and Szegedy 2015). However, in the black-box setting in which the parameters of the deployedmodel are hidden and one can only observe the input-outputcorrespondences of a queried example, crafting adversarialexamples requires a gradient-free (zeroth order) optimizationapproach to gather necessary attack information. Figure 1displays a prediction-evasive adversarial example crafted viaiterative model queries from a black-box DNN (the Inception-v3 model (Szegedy et al. 2016)) trained on ImageNet.Albeit achieving remarkable attack effectiveness by theuse of gradient estimation, current black-box attack methods,such as (Chen et al. 2017; Nitin Bhagoji et al. 2018), arenot query-efﬁcient since they exploit coordinate-wise gra-dient estimation and value update, which inevitably incursan excessive number of model queries and may give a falsesense of model robustness due to inefﬁcient query designs.In this paper, we propose to tackle the preceding problemby using

AutoZOOM , an

Auto encoder-based Z eroth O rder O ptimization M ethod. AutoZOOM has two novel building a r X i v : . [ c s . C V ] J a n igure 2: Illustration of attack dimension reduction through a “decoder” in AutoZOOM for improving query efﬁciency inblack-box attacks. The decoder has two modes: (i) An autoencoder (AE) trained on unlabeled natural images that are differentfrom the attacked images and training data; (ii) a simple bilinear image resizer (BiLIN) that is applied channel-wise to extrapolatelow-dimensional feature to the original image dimension (width × height). In the latter mode, no additional training is required.blocks: (i) a new and adaptive random gradient estimationstrategy to balance the query counts and distortion whencrafting adversarial examples, and (ii) an autoencoder thatis either trained ofﬂine on other unlabeled data, or based ona simple bilinear resizing operation, in order to accelerateblack-box attacks. As illustrated in Figure 2, AutoZOOMutilizes a “decoder” to craft a high-dimensional adversarialperturbation from the (learned) low-dimensional latent-spacerepresentation, and its query efﬁciency can be well explainedby the dimension-dependent convergence rate in gradient-free optimization. Contributions.

We summarize our main contributions andnew insights on adversarial robustness as follows:1. We propose AutoZOOM, a novel query-efﬁcient black-boxattack framework for generating adversarial examples. Au-toZOOM features an adaptive random gradient estimationstrategy and dimension reduction techniques (either anofﬂine trained autoencoder or a bilinear resizer) to reduceattack query counts while maintaining attack effectivenessand visual similarity. To the best of our knowledge, Au-toZOOM is the ﬁrst black-box attack using random fullgradient estimation and data-driven acceleration.2. We use the convergence rate of zeroth-order optimiza-tion to motivate the query efﬁciency of AutoZOOM andprovide an error analysis of the new gradient estimatorin AutoZOOM to the true gradient for characterizing thetrade-offs between estimation error and query counts.3. When applied to a state-of-the-art black-box attack pro-posed in (Chen et al. 2017), AutoZOOM attains a similarattack success rate while achieving a signiﬁcant reduction(at least 93%) in the mean query counts required to at-tack the DNN image classiﬁers for MNIST, CIFAR-10and ImageNet. It can also ﬁne-tune the distortion in thepost-success stage by performing ﬁner gradient estimation.4. In the experiments, we also ﬁnd that AutoZOOM with asimple bilinear resizer as the decoder (AutoZOOM-BiLIN)can attain noticeable query efﬁciency, despite that it is stillworse than AutoZOOM with an ofﬂine trained autoen- coder (AutoZOOM-AE). However, AutoZOOM-BiLIN iseasier to be mounted as no additional training is required.The results also suggest an interesting ﬁnding that whilelearning effective low-dimensional representations of legit-imate images is still a challenging task, black-box attacksusing signiﬁcantly less degree of freedoms (i.e., reduceddimensions) are certainly plausible.

Gradient-based adversarial attacks on DNNs fall within thewhite-box setting, since acquiring the gradient with respectto the input requires knowing the weights of the target DNN.As a ﬁrst attempt towards black-box attacks, the authors in(Papernot et al. 2017) proposed to train a substitute modelusing iterative model queries, performing white-box attackson the substitute model, and implementing transfer attacks tothe target model (Papernot, McDaniel, and Goodfellow 2016;Liu et al. 2017). However, its attack performance can beseverely degraded due to poor attack transferability (Su etal. 2018). Although ZOO achieves a similar attack successrate and comparable visual quality as many white-box at-tack methods (Chen et al. 2017), its coordinate-wise gradientestimation requires excessive target model evaluations andis hence not query-efﬁcient. The same gradient estimationtechnique is also used in (Nitin Bhagoji et al. 2018).Beyond optimization-based approaches, the authors in(Ilyas et al. 2018) proposed to use a natural evolution strat-egy (NES) to enhance query efﬁciency. Although there is avector-wise gradient estimation step in the NES attack, wetreat it as a parallel work since its natural evolutionary stepis out of the scope of black-box attacks using zeroth-ordergradient descent. We also note that different from NES, ourAutoZOOM framework uses a theory-driven query-efﬁcientrandom-vector based gradient estimation strategy. In addition,AutoZOOM could be applied to further improve the queryefﬁciency of NES, since NES does not take into account thefactor of attack dimension reduction, which is the novelty inAutoZOOM as well as the main focus of this paper.Under a more restricted attack setting, where only the de-ision (top-1 prediction class) is known to an attacker, theauthors in (Brendel, Rauber, and Bethge 2018) proposeda random-walk based attack around the decision boundary.Such a black-box attack dispenses class prediction scores andhence requires additional model queries. Due to space limi-tation, we provide more background and a table comparingexisting black-box attacks in the supplementary material.

Throughout this paper, we focus on improving the query ef-ﬁciency of gradient-estimation and gradient-descent basedblack-box attacks empowered by AutoZOOM, and we con-sider the threat model that the class prediction scores areknown to an attacker. In this setting, it sufﬁces to denote thetarget DNN as a classiﬁcation function F : [0 , d (cid:55)→ R K that takes a d -dimensional scaled image as its input andyields a vector of prediction scores of all K image classes,such as the prediction probabilities for each class. We furtherconsider the case of applying an entry-wise monotonic trans-formation M ( F ) to the output of F for black-box attacks,since monotonic transformation preserves the ranking of theclass predictions and can alleviate the problem of large scorevariation in F (e.g., probability to log probability).Here we formulate black-box targeted attacks. The formu-lation can be easily adapted to untargeted attacks. Let ( x , t ) denote a natural image x and its ground-truth class label t ,and let ( x , t ) denote the adversarial example of x and thetarget attack class label t (cid:54) = t . The problem of ﬁnding anadversarial example can be formulated as an optimizationproblem taking the generic form ofmin x ∈ [0 , d Dist ( x , x ) + λ · Loss ( x , M ( F ( x )) , t ) , (1)where Dist ( x , x ) measures the distortion between x and x , Loss ( · ) is an attack objective reﬂecting the likelihood ofpredicting t = arg max k ∈{ ,...,K } [ M ( F ( x ))] k , λ is a regu-larization coefﬁcient, and the constraint x ∈ [0 , d conﬁnesthe adversarial image x to the valid image space. The distor-tion Dist ( x , x ) is often evaluated by the L p norm deﬁned asDist ( x , x ) = (cid:107) x − x (cid:107) p = (cid:107) δ (cid:107) p = (cid:80) di =1 | δ i | /p for p ≥ ,where δ = x − x is the adversarial perturbation to x . Theattack objective Loss ( · ) can be the training loss of DNNs(Goodfellow, Shlens, and Szegedy 2015) or some designedloss based on model predictions (Carlini and Wagner 2017b).In the white-box setting, an adversarial example is gen-erated by using downstream optimizers such as ADAM(Kingma and Ba 2015) to solve (1); this requires the gra-dient ∇ f ( x ) of the objective function f ( x ) = Dist ( x , x ) + λ · Loss ( x , M ( F ( x )) , t ) relative to the input of F viaback-propagation in DNNs. However, in the black-box set-ting, acquiring ∇ f ( · ) is implausible, and one can only ob-tain the function evaluation F ( · ) , which renders solving(1) a zeroth order optimization problem. Recently, zerothorder optimization approaches (Ghadimi and Lan 2013;Nesterov and Spokoiny 2017; Liu et al. 2018) circumvent thepreceding challenge by approximating the true gradient viafunction evaluations. Speciﬁcally, in black-box attacks, the gradient estimate is applied to both gradient computation anddescent in the optimization process for solving (1). As a ﬁrst attempt to enable gradient-free black-box attacks onDNNs, the authors in (Chen et al. 2017) use the symmetricdifference quotient method (Lax and Terrell 2014) to evaluatethe gradient ∂f ( x ) ∂ x i of the i -th component by g i = f ( x + h e i ) − f ( x − h e i )2 h ≈ ∂f ( x ) ∂ x i (2)using a small h . Here e i denotes the i -th elementary ba-sis. Albeit contributing to powerful black-box attacks andapplicable to large networks like ImageNet, the nature ofcoordinate-wise gradient estimation step in (2) must in-cur an enormous amount of model queries and is hencenot query-efﬁcient. For example, the ImageNet dataset has d = 299 × × ≈ , input dimensions, renderingcoordinate-wise zeroth order optimization based on gradientestimation query-inefﬁcient.To improve query efﬁciency, we dispense with coordinate-wise estimation and instead propose a scaled random fullgradient estimator of ∇ f ( x ) , deﬁned as g = b · f ( x + β u ) − f ( x ) β · u , (3)where β > is a smoothing parameter, u is a unit-length vec-tor that is uniformly drawn at random from a unit Euclideansphere, and b is a tunable scaling parameter that balances thebias and variance trade-off of the gradient estimation error.Note that with b = 1 , the gradient estimator in (3) becomesthe one used in (Duchi et al. 2015). With b = d , this estimatorbecomes the one adopted in (Gao, Jiang, and Zhang 2014).We will provide an optimal value b ∗ for balancing queryefﬁciency and estimation error in the following analysis. Averaged random gradient estimation.

To effectively con-trol the error in gradient estimation, we consider a moregeneral gradient estimator, in which the gradient estimate isaveraged over q random directions { u j } qj =1 . That is, g = 1 q q (cid:88) j =1 g j , (4)where g j is a gradient estimate deﬁned in (3) with u = u j .The use of multiple random directions can reduce the varianceof g in (4) for convex loss functions (Duchi et al. 2015;Liu et al. 2018).Below we establish an error analysis of the averaged ran-dom gradient estimator in (4) for studying the inﬂuence of theparameters b and q on estimation error and query efﬁciency. Theorem 1.

Assume f : R d (cid:55)→ R is differentiable andits gradient ∇ f ( · ) is L -Lipschitz . Then the mean squared A function W ( · ) is L -Lipschitz if (cid:107) W ( w ) − W ( w ) (cid:107) ≤ L (cid:107) w − w (cid:107) for any w , w . For DNNs with ReLU activations, L can be derived from the model weights (Szegedy et al. 2014). stimation error of g in (4) is upper bounded by E (cid:107) g − ∇ f ( x ) (cid:107) ≤ b d + b dq + ( b − d ) d ) (cid:107)∇ f ( x ) (cid:107) + 2 q + 1 q b β L . (5) Proof.

The proof is given in the supplementary ﬁle.Here we highlight the important implications based onTheorem 1: (i) The error analysis holds when f is non-convex ;(ii) In DNNs, the true gradient ∇ f can be viewed as thenumerical gradient obtained via back-propagation; (iii) Forany ﬁxed b , selecting a small β (e.g., we set β = 1 /d inAutoZOOM) can effectively reduce the last error term in (5),and we therefore focus on optimizing the ﬁrst error term;(iv) The ﬁrst error term in (5) exhibits the inﬂuence of b and q on the estimation error, and is independent of β . Wefurther elaborate on (iv) as follows. Fixing q and let η ( b ) = b d + b dq + ( b − d ) d to be the coefﬁcient of the ﬁrst error termin (5), then the optimal b that minimizes η ( b ) is b ∗ = dq q + d .For query efﬁciency, one would like to keep q small, whichthen implies b ∗ ≈ q and η ( b ∗ ) ≈ when the dimension d is large. On the other hand, when q → ∞ , b ∗ ≈ d/ and η ( b ∗ ) ≈ / , which yields a smaller error upper boundbut is query-inefﬁcient. We also note that by setting b = q ,the coefﬁcient η ( b ) = b d + b dq + ( b − d ) d ≈ and thus isindependent of the dimension d and the parameter q . Adaptive random gradient estimation.

Based on Theorem1 and our error analysis, in AutoZOOM we set b = q in(3) and propose to use an adaptive strategy for selecting q .AutoZOOM uses q = 1 (i.e., the fewest possible model eval-uation) to ﬁrst obtain rough gradient estimates for solving(1) until a successful adversarial image is found. After theinitial attack success, it switches to use more accurate gradi-ent estimates with q > to ﬁne-tune the image quality. Thetrade-off between q (which is proportional to query counts)and distortion reduction will be investigated in Section 4. Dimension-dependent convergence rate using gradientestimation.

Different from the ﬁrst order convergence results,the convergence rate of zeroth order gradient descent methodshas an additional multiplicative dimension-dependent factor d . In the convex loss setting the rate is O ( (cid:112) d/T ) , where T is the number of iterations (Nesterov and Spokoiny 2017;Liu et al. 2018; Gao, Jiang, and Zhang 2014; Wang et al.2018). The same convergence rate has also been found in thenonconvex setting (Ghadimi and Lan 2013). The dimension-dependent convergence factor d suggests that vanilla black-box attacks using gradient estimations can be query inefﬁ-cient when the (vectorized) image dimension d is large, dueto the curse of dimensionality in convergence. This also moti-vates us to propose using an autoencoder to reduce the attackdimension and improve query efﬁciency in black-box attacks.In AutoZOOM, we propose to perform random gradient es-timation from a reduced dimension d (cid:48) < d to improve queryefﬁciency. Speciﬁcally, as illustrated in Figure 2, the additive Algorithm 1

AutoZOOM for black-box attacks on DNNs

Input:

Black-box DNN model F , original example x ,distortion measure Dist ( · ) , attack objective Loss ( · ) , mono-tonic transformation M ( · ) , decoder D ( · ) ∈ { AE , BiLIN } ,initial coefﬁcient λ ini , query budget Q while query count ≤ Q do1. Exploration: use x = x + D ( δ (cid:48) ) and apply therandom gradient estimator in (4) with q = 1 to the down-stream optimizer (e.g., ADAM) for solving (1) until aninitial attack is found.

2. Exploitation (post-success stage): continue to ﬁne-tune the adversarial perturbation D ( δ (cid:48) ) for solving (1)while setting q ≥ in (4). end whileOutput: Least distorted successful adversarial exampleperturbation to an image x is actually implemented througha “decoder” D : R d (cid:48) (cid:55)→ R d such that x = x + D ( δ (cid:48) ) , where δ (cid:48) ∈ R d (cid:48) . In other words, the adversarial perturbation δ ∈ R d to x is in fact generated from a dimension-reduced space,with an aim of improving query efﬁciency due to the reduceddimension-dependent factor in the convergence analysis. Au-toZOOM provides two modes for such a decoder D : • An autoencoder (AE) trained on unlabeled data that aredifferent from the training data to learn reconstruction froma dimension-reduced representation. The encoder E ( · ) in anAE compresses the data to a low-dimensional latent spaceand the decoder D ( · ) reconstructs an example from its latentrepresentation. The weights of an AE are learned to minimizethe average L reconstruction error. Note that training suchan AE for black-box adversarial attacks is one-time and isentirely ofﬂine (i.e., no model queries needed). • A simple channel-wise bilinear image resizer (BiLIN) thatscales a small image to a large image via bilinear extrapola-tion . Note that no additional training is required for BiLIN. Why AE?

Our proposal of AE is motivated by the insightfulﬁndings in (Goodfellow, Shlens, and Szegedy 2015) that asuccessful adversarial perturbation is highly relevant to somehuman-imperceptible noise pattern resembling the shape ofthe target class, known as the “shadow”. Since a decoder inAE learns to reconstruct data from latent representations, itcan also provide distributional guidance for mapping adver-sarial perturbations to generate these shadows.We also note that for any reduced dimension d (cid:48) , the setting b ∗ = q is optimal in terms of minimizing the correspondingestimation error from Theorem 1, despite the fact that thegradient estimation errors of different reduced dimensionscannot be directly compared. In Section 4 we will report thesuperior query efﬁciency in black-box attacks achieved withthe use of AE or BiLIN as the decoder, and discuss the beneﬁtof attack dimension reduction. Algorithm 1 summarizes the AutoZOOM framework towardsquery-efﬁcient black-box attacks on DNNs. We also note that See tf.image.resize_images , a TensorFlow example. utoZOOM is a general acceleration tool that is compatiblewith any gradient-estimation based black-box adversarial at-tack obeying the attack formulation in (1). It also has sometheoretical estimation error guarantees and query-efﬁcient pa-rameter selection based on Theorem 1. The details on adjust-ing the regularization coefﬁcient λ and the query parameter q based on run-time model evaluation results will be discussedin Section 4. Our source code is publicly available . This section presents the experiments for assessing the per-formance of AutoZOOM in accelerating black-box attackson DNNs in terms of the number of queries required for aninitial attack success and for a speciﬁc distortion level.

As described in Section 3, AutoZOOM is a query-efﬁcientgradient-free optimization framework for solving the black-box attack formulation in (1). In the following experiments,we demonstrate the utility of AutoZOOM by using the sameattack formulation proposed in ZOO (Chen et al. 2017),which uses the squared L norm as the distortion measureDist ( · ) and adopts the attack objectiveLoss = max { max j (cid:54) = t log[ F ( x )] j − log[ F ( x )] t } , } , (6)where this hinge function is designed for targeted black-boxattacks on the DNN model F , and the monotonic transforma-tion M ( · ) = log( · ) is applied to the model output. We compare

AutoZOOM-AE ( D = AE) and

AutoZOOM-BiLIN ( D = BiLIN) with two different baselines: (i) Stan-dard

ZOO implementation with bilinear scaling (same asBiLIN) for dimension reduction; (ii) ZOO+AE , which isZOO with AE. Note that all attacks indeed generate adversar-ial perturbations based on the same reduced attack dimension.

We assess the performance of different attack methods onseveral representative benchmark datasets, including MNIST(LeCun et al. 1998), CIFAR-10 (Krizhevsky 2009) and Im-ageNet (Russakovsky et al. 2015). For MNIST and CIFAR-10, we use the same DNN image classiﬁcation models asin (Carlini and Wagner 2017b). For ImageNet, we use theInception-v3 model (Szegedy et al. 2016). All experimentswere conducted using TensorFlow Machine-Learning Library(Abadi et al. ) on machines equipped with an Intel Xeon E5-2690v3 CPU and an Nvidia Tesla K80 GPU.All attacks used ADAM (Kingma and Ba 2015) for solving(1) with their estimated gradients and the same initial learningrate × − . On MNIST and CIFAR-10, all methods adopt1,000 ADAM iterations. On ImageNet, ZOO and ZOO+AE https://github.com/IBM/Autozoom-Attack https://github.com/huanzhang12/ZOO-Attack https://github.com/carlini/nn_robust_attacks adopt 20,000 iterations, whereas AutoZOOM-BiLIN andAutoZOOM-AE adopt 100,000 iterations. Note that due todifferent gradient estimation methods, the query counts (i.e.,the number of model evaluations) per iteration of a black-boxattack may vary. ZOO and ZOO+AE use the parallel gradientupdate of (2) with a batch of pixels, yielding 256 querycounts per iteration. AutoZOOM-BiLIN and AutoZOOM-AEuse the averaged random full gradient estimator in (4), result-ing in q + 1 query counts per iteration. For a fair comparison,the query counts are used for performance assessment. Query reduction ratio.

We use the mean query counts ofZOO with the smallest λ ini as the baseline for computing thequery reduction ratio of other methods and conﬁgurations. TPR and initial success.

We report the true positive rate(TPR), which measures the percentage of successful attacksfulﬁlling a pre-deﬁned constraint (cid:96) on the normalized (per-pixel) L distortion, as well as their query counts of ﬁrstsuccesses. We also report the per-pixel L distortions ofinitial successes, where an initial success refers to the ﬁrstquery count that ﬁnds a successful adversarial example. Post-success ﬁne-tuning.

When implementing AutoZOOMin Algorithm 1, on MNIST and CIFAR-10 we ﬁnd that Au-toZOOM without ﬁne-tuning (i.e., q = 1 ) already yieldssimilar distortion as ZOO. We note that ZOO can be viewedas coordinate-wise ﬁne-tuning and is thus query-inefﬁcient.On ImageNet, we will investigate the effect of post-successﬁne-tuning on reducing distortion. Autoencoder Training.

In AutoZOOM-AE, we use convo-lutional autoencoders for attack dimension reduction, whichare trained on unlabeled datasets that are different from thetraining dataset and the attacked natural examples. The im-plementation details are given in the supplementary material.

Dynamic Switching on λ . To adjust the regularization coef-ﬁcient λ in (1), in all methods we set its initial value λ ini ∈{ . , , } on MNIST and CIFAR-10, and set λ ini = 10 onImageNet. Furthermore, for balancing the distortion Dist andthe attack objective Loss in (1), we use a dynamic switching strategy to update λ during the optimization process. Per ev-ery S iterations, λ is multiplied by 10 times of the currentvalue if the attack has never been successful. Otherwise, itdivides its current value by 2. On MNIST and CIFAR-10,we set S = 100 . On ImageNet, we set S = 1 , . At theinstance of initial success, we also reset λ = λ ini and theADAM parameters to the default values, as doing so canempirically reduce the distortion for all attack methods. For both MNIST and CIFAR-10, we randomly select 50correctly classiﬁed images from their test sets, and performtargeted attacks on these images. Since both datasets have10 classes, each selected image is attacked 9 times, targetingat all but its true class. For all attacks, the ratio of reducedattack-space dimension to the original one (i.e., d (cid:48) /d ) is 25%for MNIST and 6.25% for CIFAR-10.Table 1 shows the performance evaluation on MNISTwith various values of λ ini , the initial value of the regulariza-tion coefﬁcient λ in (1). We use the performance of ZOOwith λ ini = 0 . as a baseline for comparison. For example,with λ ini = 0 . and , the mean query counts required byable 1: Performance evaluation of black-box targeted attacks on MNIST Method λ ini Attack successrate (ASR) Mean query count(initial success) Mean querycount reductionratio (initial success) Mean per-pixel L distortion(initial success) True positiverate (TPR) Mean query countwith per-pixel L distortion ≤ × − × − × − × − × − × − × − × − × − × − × − × − Table 2: Performance evaluation of black-box targeted attacks on CIFAR-10

Method λ ini Attack successrate (ASR) Mean query count(initial success) Mean querycount reductionratio (initial success) Mean per-pixel L distortion(initial success) True positiverate (TPR) Mean query countwith per-pixel L distortion ≤ × − × − × − × − × − × − × − × − × − × − × − × − AutoZOOM-AE to attain an initial success is reduced by and , respectively. One can also observethat allowing larger λ ini generally leads to fewer mean querycounts at the price of slightly increased distortion for the ini-tial attack. The noticeable huge difference in the required at-tack query counts between AutoZOOM and ZOO/ZOO+AEvalidates the effectiveness of our proposed random full gradi-ent estimator in (3), which dispenses with the coordinate-wisegradient estimation in ZOO but still remains comparable truepositive rates, thereby greatly improving query efﬁciency.For CIFAR-10, we report similar query efﬁciency improve-ments as displayed in Table 2. In particular, comparing thetwo query-efﬁcient black-box attack methods (AutoZOOM-BiLIN and AutoZOOM-AE), we ﬁnd that AutoZOOM-AE ismore query-efﬁcient than AutoZOOM-BiLIN, but at the costof an additional AE training step. AutoZOOM-AE achievesthe highest attack success rates (ASRs) and mean query re-duction ratios for different values of λ ini . In addition, theirtrue positive rates (TPRs) are similar but AutoZOOM-AEusually takes fewer query counts to reach the same L dis-tortion. We note that when λ ini = 10 , AutoZOOM-AE has ahigher TPR but also needs slightly more mean query countsthan AutoZOOM-BiLIN to reach the same L distortion.This suggests that there are some adversarial examples that are difﬁcult for a bilinear resizer to reduce their post-successdistortions but can be handled by an AE. We selected 50 correctly classiﬁed images from the ImageNettest set to perform random targeted attacks and set λ ini = 10 and the attack dimension reduction ratio to 1.15%. The re-sults are summarized in Table 3. Note that comparing to ZOO,AutoZOOM-AE can signiﬁcantly reduce the query count re-quired to achieve an initial success by 99.39% (or 99.35%to reach the same L distortion), which is a remarkable im-provement since this means reducing more than model queries given the fact that the dimension of ImageNet( ≈ Post-success distortion reﬁnement.

As described in Algo-rithm 1, adaptive random gradient estimation is integrated inAutoZOOM, offering a quick initial success in attack genera-tion followed by a ﬁne-tuning process to effectively reducethe distortion. This is achieved by adjusting the gradient esti-mate averaging parameter q in (4) in the post-success stage.In general, averaging over more random directions (i.e., set-ting larger q ) tends to better reduce the variance of gradientestimation error, but at the cost of increased model queries.Figure 3 (a) shows the mean distortion against query countsable 3: Performance evaluation of black-box targeted attacks on ImageNet Method

Attack successrate (ASR) Mean query count(initial success) Mean querycount reductionratio (initial success) Mean per-pixel L distortion(initial success) True positiverate (TPR) Mean query countwith per-pixel L distortion ≤ ZOO 76.00% 2,226,405.04 (2.22M) 0.00% 4.25 × − × − × − × − (a) Post-success distortion reﬁnement (b) Dimension reduction v.s. query efﬁciency Figure 3: (a) After initial success, AutoZOOM (here D = AE) can further decrease the distortion by setting q > in (4) to trademore query counts for smaller distortion in the converged stage, which saturates at q = 4 . (b) Attack dimension reduction iscrucial to query-efﬁcient black-box attacks. When compared to black-box attacks on the original dimension, dimension reductionthrough AutoZOOM-AE reduces roughly 35-40% query counts on MNIST and CIFAR-10 and at least 95% on ImageNet.for various choices of q in the post-success stage. The resultssuggest that setting some small q but q > can further de-crease the distortion at the converged phase when comparedwith the case of q = 1 . Moreover, the reﬁnement effect ondistortion empirically saturates at q = 4 , implying a marginalgain beyond this value. These ﬁndings also demonstrate thatour proposed AutoZOOM indeed strikes a balance betweendistortion and query efﬁciency in black-box attacks. In addition to the motivation from the O ( (cid:112) d/T ) conver-gence rate in zeroth-order optimization (Sec. 3.3), as a san-ity check, we corroborate the beneﬁt of attack dimensionreduction to query efﬁciency in black-box attacks by com-paring AutoZOOM (here we use D = AE) with its alterna-tive operated on the original (non-reduced) dimension (i.e., δ (cid:48) = D ( δ (cid:48) ) = δ ). Tested on all three datasets and aforemen-tioned settings, Figure 3 (b) shows the corresponding meanquery count to initial success and the mean query reductionratio when λ ini = 10 in all three datasets. When comparedto the attack results of the original dimension, attack dimen-sion reduction through AutoZOOM reduces roughly 35-40%query counts on MNIST and CIFAR-10 and at least 95% onImageNet. This result highlights the importance of dimen-sion reduction towards query-efﬁcient black-box attacks. Forexample, without dimension reduction, the attack on the orig-inal ImageNet dimension cannot even be successful withinthe query budge ( Q = 200 K queries). • In addition to benchmarking on initial attack success, thequery reduction ratio when reaching the same L distortioncan be directly computed from the last column in each table. • The attack gain in AutoZOOM-AE versus AutoZOOM-BiLIN could sometimes be marginal, while we also note thatthere is room for improving AutoZOOM-AE by exploringdifferent AE models. However, we advocate AutoZOOM-BiLIN as a practically ideal candidate for query-efﬁcientblack-box attacks when testing model robustness, due to itseasy-to-mount nature and it has no additional training cost. • While learning effective low-dimensional representationsof legitimate images is still a challenging task, black-box at-tacks using signiﬁcantly less degree of freedoms (i.e., reduceddimensions), as demonstrated in this paper, are certainly plau-sible, leading to new implications on model robustness.

AutoZOOM is a generic attack acceleration framework thatis compatible with any gradient-estimation based black-boxattack having the general formulation in (1). It adopts a newand adaptive random full gradient estimation strategy to strikea balance between query counts and estimation errors, andfeatures a decoder (AE or BiLIN) for attack dimension re-duction and algorithmic convergence acceleration. Comparedto a state-of-the-art attack (ZOO), AutoZOOM consistentlyreduces the mean query counts when attacking black-boxDNN image classiﬁers for MNIST, CIFAT-10 and ImageNet,ttaining at least query reduction in ﬁnding initial suc-cessful adversarial examples (or reaching the same distortion)while maintaining a similar attack success rate. It can alsoefﬁciently ﬁne-tune the image distortion to maintain highvisual similarity to the original image. Consequently, Auto-ZOOM provides novel and efﬁcient means for assessing therobustness of deployed machine learning models.

Acknowledgements

Shin-Ming Cheng was supported in part by the Ministry ofScience and Technology, Taiwan, under Grants MOST 107-2218-E-001-005 and MOST 107-2218-E-011-012. Cho-JuiHsieh and Huan Zhang acknowledge the support by NSF IIS-1719097, Intel faculty award, Google Cloud and NVIDIA.

References [Abadi et al. ] Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.;Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al.Tensorﬂow: A system for large-scale machine learning.[Athalye, Carlini, and Wagner 2018] Athalye, A.; Carlini, N.; andWagner, D. 2018. Obfuscated gradients give a false sense ofsecurity: Circumventing defenses to adversarial examples.

ICML .[Baluja and Fischer 2018] Baluja, S., and Fischer, I. 2018. Adver-sarial transformation networks: Learning to generate adversarialexamples.

AAAI .[Biggio and Roli 2018] Biggio, B., and Roli, F. 2018. Wild patterns:Ten years after the rise of adversarial machine learning.

PatternRecognition

Joint European con-ference on machine learning and knowledge discovery in databases ,387–402.[Brendel, Rauber, and Bethge 2018] Brendel, W.; Rauber, J.; andBethge, M. 2018. Decision-based adversarial attacks: Reliableattacks against black-box machine learning models.

ICLR .[Carlini and Wagner 2017a] Carlini, N., and Wagner, D. 2017a. Ad-versarial examples are not easily detected: Bypassing ten detectionmethods. In

ACM Workshop on Artiﬁcial Intelligence and Security ,3–14.[Carlini and Wagner 2017b] Carlini, N., and Wagner, D. 2017b. To-wards evaluating the robustness of neural networks. In

IEEE Sym-posium on Security and Privacy , 39–57.[Chen et al. 2017] Chen, P.-Y.; Zhang, H.; Sharma, Y.; Yi, J.; andHsieh, C.-J. 2017. ZOO: Zeroth order optimization based black-boxattacks to deep neural networks without training substitute models.In

ACM Workshop on Artiﬁcial Intelligence and Security , 15–26.[Chen et al. 2018] Chen, P.-Y.; Sharma, Y.; Zhang, H.; Yi, J.; andHsieh, C.-J. 2018. EAD: elastic-net attacks to deep neural networksvia adversarial examples.

AAAI .[Duchi et al. 2015] Duchi, J. C.; Jordan, M. I.; Wainwright, M. J.;and Wibisono, A. 2015. Optimal rates for zero-order convex opti-mization: The power of two function evaluations.

IEEE Transactionson Information Theory

Optimization Online

SIAM Journal on Optimization

ICLR .[Ilyas et al. 2018] Ilyas, A.; Engstrom, L.; Athalye, A.; and Lin, J.2018. Black-box adversarial attacks with limited queries and infor-mation.

ICML .[Ilyas 2018] Ilyas, A. 2018. Circumventing the ensemble adversar-ial training defense. https://github.com/andrewilyas/ens-adv-train-attack .[Kingma and Ba 2015] Kingma, D., and Ba, J. 2015. Adam: Amethod for stochastic optimization.

ICLR .[Krizhevsky 2009] Krizhevsky, A. 2009. Learning multiple layers offeatures from tiny images.

Technical report, University of Toronto .[Kurakin, Goodfellow, and Bengio 2017] Kurakin, A.; Goodfellow,I.; and Bengio, S. 2017. Adversarial machine learning at scale.

ICLR .[Lax and Terrell 2014] Lax, P. D., and Terrell, M. S. 2014.

Calculuswith applications . Springer.[LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner,P. 1998. Gradient-based learning applied to document recognition.

Proceedings of the IEEE

ICLR .[Liu et al. 2018] Liu, S.; Chen, J.; Chen, P.-Y.; and Hero, A. O. 2018.Zeroth-order online alternating direction method of multipliers:Convergence analysis and applications.

AISTATS .[Lowd and Meek 2005] Lowd, D., and Meek, C. 2005. Adversariallearning. In

ACM SIGKDD international conference on Knowledgediscovery in data mining , 641–647.[Narodytska and Kasiviswanathan 2016] Narodytska, N., and Ka-siviswanathan, S. P. 2016. Simple black-box adversarial perturba-tions for deep networks. arXiv preprint arXiv:1612.06299 .[Nesterov and Spokoiny 2017] Nesterov, Y., and Spokoiny, V. 2017.Random gradient-free minimization of convex functions.

Founda-tions of Computational Mathematics

ECCV , 154–169.[Papernot et al. 2017] Papernot, N.; McDaniel, P.; Goodfellow, I.;Jha, S.; Celik, Z. B.; and Swami, A. 2017. Practical black-boxattacks against machine learning. In

ACM Asia Conference onComputer and Communications Security , 506–519.[Papernot, McDaniel, and Goodfellow 2016] Papernot, N.; Mc-Daniel, P.; and Goodfellow, I. 2016. Transferability in machinelearning: from phenomena to black-box attacks using adversarialsamples. arXiv preprint arXiv:1605.07277 .[Russakovsky et al. 2015] Russakovsky, O.; Deng, J.; Su, H.;Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.;Bernstein, M.; et al. 2015. Imagenet large scale visual recognitionchallenge.

International Journal of Computer Vision

ECCV .[Suya et al. 2017] Suya, F.; Tian, Y.; Evans, D.; and Papotti, P. 2017.Query-limited black-box attacks to classiﬁers.

NIPS Workshop .Szegedy et al. 2014] Szegedy, C.; Zaremba, W.; Sutskever, I.;Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. In-triguing properties of neural networks.

ICLR .[Szegedy et al. 2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens,J.; and Wojna, Z. 2016. Rethinking the inception architecture forcomputer vision. In

CVPR , 2818–2826.[Tram`er et al. 2018] Tram`er, F.; Kurakin, A.; Papernot, N.; Boneh,D.; and McDaniel, P. 2018. Ensemble adversarial training: Attacksand defenses.

ICLR .[Wang et al. 2018] Wang, Y.; Du, S.; Balakrishnan, S.; and Singh,A. 2018. Stochastic zeroth-order optimization in high dimensions.

AISTATS . upplementary Material A More Background on Adversarial Attacksand Defenses

The research in generating adversarial examples to deceivemachine-learning models, known as adversarial attacks, tendsto evolve with the advance of machine-learning techniquesand new publicly available datasets. In (Lowd and Meek2005), the authors studied adversarial attacks to linear clas-siﬁers with continuous or Boolean features. In (Biggio etal. 2013), the authors proposed a gradient-based adversarialattack on kernel support vector machines (SVMs). More re-cently, gradient-based approaches are also used in adversarialattacks on image classiﬁers trained by DNNs (Szegedy et al.2014; Goodfellow, Shlens, and Szegedy 2015). Due to spacelimitation, we focus on related work in adversarial attackson DNNs. Interested readers may refer to the survey paper(Biggio and Roli 2018) for more details.Gradient-based adversarial attacks on DNNs fall withinthe white-box setting, since acquiring the gradient with re-spect to the input requires knowing the weights of the targetDNN. In principle, adversarial attacks can be formulatedas an optimization problem of minimizing the adversarialperturbation while ensuring attack objectives. In image clas-siﬁcation, given a natural image, an untargeted attack aims toﬁnd a visually similar adversarial image resulting in a differ-ent class prediction, while a targeted attack aims to ﬁnd anadversarial image leading to a speciﬁc class prediction. Thevisual similarity between a pair of adversarial and naturalimages is often measured by the L p norm of their difference,where p ≥ . Existing powerful white-box adversarial attacksusing L ∞ , L or L norms include iterative fast gradient signmethods (Kurakin, Goodfellow, and Bengio 2017), Carliniand Wagner’s (C&W) attack (Carlini and Wagner 2017b),elastic-net attacks to DNNs (EAD) (Chen et al. 2018), etc.Black-box adversarial attacks are practical threats to thedeployed machine-learning services. Attackers can observethe input-output correspondences of any queried input, butthe target model parameters are completely hidden. There-fore, gradient-based adversarial attacks are inapplicable to ablack-box setting. As a ﬁrst attempt, the authors in (Paper-not et al. 2017) proposed to train a substitute model usingiterative model queries, perform white-box attacks on thesubstitute model, and leverage the transferability of adver-sarial examples (Papernot, McDaniel, and Goodfellow 2016;Liu et al. 2017) to attack the target model. However, train-ing a representative surrogate for a DNN is challengingdue to the complicated and nonlinear classiﬁcation rules ofDNNs and high dimensionality of the underlying dataset.The performance of black-box attacks can be severely de-graded if the adversarial examples for the substitute modeltransfer poorly to the target model. To bridge this gap, theauthors in (Chen et al. 2017) proposed a black-box attackcalled ZOO that directly estimates the gradient of the at-tack objective by iteratively querying the target model. Al-though ZOO achieves a similar attack success rate and com-parable visual quality as many white-box attack methods,it exploits the symmetric difference quotient method (Laxand Terrell 2014) for coordinate-wise gradient estimation and value update, which requires excessive target modelevaluations and is hence not query-efﬁcient. The same gra-dient estimation technique is also used in the later workin (Nitin Bhagoji et al. 2018). Although acceleration tech-niques such as importance sampling, bilinear scaling andrandom feature grouping have been used in (Chen et al. 2017;Nitin Bhagoji et al. 2018), the coordinate-wise gradient esti-mation approach still forms a bottleneck for query efﬁciency.Beyond optimization-based approaches, the authors in(Ilyas et al. 2018) proposed to use a natural evolution strategy(NES) to enhance query efﬁciency. Although there is also avector-wise gradient estimation step in the NES attack, wetreat it as an independent and parallel work since its naturalevolutionary step is out of the scope of black-box attacks us-ing zeroth-order gradient descent. We also note that differentfrom NES, our AutoZOOM framework uses a query-efﬁcientrandom gradient estimation strategy. In addition, AutoZOOMcould be applied to further improve the query efﬁciency ofNES, since NES does not take into account the factor of attackdimension reduction, which is the main focus of this paper.Under a more restricted setting, where only the decision (top-1 prediction class) is known to an attacker, the authors in(Brendel, Rauber, and Bethge 2018) proposed a random-walkbased attack around the decision boundary. Such a black-boxattack dispenses class prediction scores and hence requiresadditional model queries.In this paper, we focus on improving the query efﬁciencyof gradient-estimation and gradient-descent based black-boxattacks and consider the threat model when the class predic-tion scores are known to an attacker. For reader’s reference,we compare existing black-box attacks on DNNs with Au-toZOOM in Table S1. One unique feature of AutoZOOM isthe use of reduced attack dimension when mounting black-box attacks, which is an unlabeled data-driven technique(autoencoder) for attack acceleration, and has not been stud-ied thoroughly in existing attacks. While white-box attackssuch as (Baluja and Fischer 2018) have utilized autoencoderstrained on the training data and the transparent logit represen-tations of DNNs, we propose in this work to use autoencoderstrained on unlabeled natural data to improve query efﬁciencyfor black-box attacks.There has been many methods proposed for defendingadversarial attacks to DNNs. However, new defenses arecontinuously weakened by follow-up attacks (Carlini andWagner 2017a; Athalye, Carlini, and Wagner 2018). For in-stance, model ensembles (Tram`er et al. 2018) were shown tobe effective against some black-box attacks, while they arerecently circumvented by advanced attack techniques (Ilyas2018). In this paper, we focus on improving query efﬁciencyin attacking black-box undefended DNNs. B Proof of Theorem 1

Recall that the data dimension is d and we assume f to bedifferentiable and its gradient ∇ f to be L -Lipschitz. Fixing β and consider a smoothed version of f : f β ( x ) = E u [ f ( x + β u )] . (S1)able S1: Comparison of black-box attacks on DNNs Method Approach Modelouput Targetedattack Large network(ImageNet) Data-drivenacceleration(Narodytska and Kasiviswanathan 2016) local random search score (cid:88) (Papernot et al. 2017) substitute model score (cid:88) (Suya et al. 2017) acquisition via posterior score (cid:88) (Brendel, Rauber, and Bethge 2018) Gaussian perturbation decision (cid:88) (cid:88) (Ilyas et al. 2018) natural evolution strategy score/decision (cid:88) (cid:88) (Chen et al. 2017) coordinate-wisegradient estimation score (cid:88) (cid:88) (Nitin Bhagoji et al. 2018) coordinate-wisegradient estimation score (cid:88) (cid:88)

AutoZOOM (this paper) Random (full)gradient estimation score (cid:88) (cid:88) (cid:88)

Based on (Gao, Jiang, and Zhang 2014, Lemma 4.1-a), wehave the relation ∇ f β ( x ) = E u (cid:20) dβ f ( x + β u ) u (cid:21) = db E u [ g ] , (S2)which then yields E u [ g ] = bd ∇ f β ( x ) , (S3)where we recall that g has been deﬁned in (3). Moreover,based on (Gao, Jiang, and Zhang 2014, Lemma 4.1-b), wehave (cid:107)∇ f β ( x ) − ∇ f ( x ) (cid:107) ≤ βdL . (S4)Substituting (S3) into (S4), we obtain (cid:107) E [ g ] − bd ∇ f ( x ) (cid:107) ≤ βbL . This then implies that E [ g ] = bd ∇ f ( x ) + (cid:15) , (S5)where (cid:107) (cid:15) (cid:107) ≤ bβL . Once again, by applying (Gao, Jiang, and Zhang 2014,Lemma 4.1-b), we can easily obtain that E u [ (cid:107) g (cid:107) ] ≤ b L β b d (cid:107)∇ f ( x ) (cid:107) . (S6)Now, let us consider the averaged random gradient estima-tor in (4), g = 1 q q (cid:88) i =1 g i = bq q (cid:88) i =1 f ( x + β u i ) − f ( x ) β u i . Due to the properties of i.i.d. samples { u i } and (S5), wedeﬁne v =: E [ g i ] = bd ∇ f ( x ) + (cid:15) . (S7) Moreover, we have E [ (cid:107) g (cid:107) ] = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q q (cid:88) i =1 ( g i − v ) + v (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  (S8) = (cid:107) v (cid:107) + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q q (cid:88) i =1 ( g i − v ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = (cid:107) v (cid:107) + 1 q E [ (cid:107) g − v (cid:107) ] (S9) = (cid:107) v (cid:107) + 1 q E [ (cid:107) g (cid:107) ] − q (cid:107) v (cid:107) , (S10)where we have used the fact that E [ g i ] = E [ g ] = v ∀ i . Thedeﬁnition of v in (S7) yields (cid:107) v (cid:107) ≤ b d (cid:107)∇ f ( x ) (cid:107) + 2 (cid:107) (cid:15) (cid:107) ≤ b d (cid:107)∇ f ( x ) (cid:107) + 12 b β L . (S11)From (S6), we also obtain that for any i , E [ (cid:107) g i (cid:107) ] ≤ b L β b d (cid:107)∇ f ( x ) (cid:107) . (S12)Substituting (S11) and (S12) into (S10), we obtain E [ (cid:107) g (cid:107) ] ≤(cid:107) v (cid:107) + 1 q E [ (cid:107) g (cid:107) ] (S13) ≤ b d + b dq ) (cid:107)∇ f ( x ) (cid:107) + q + 12 q b L β . (S14)Finally, we bound the mean squared estimation error as E [ (cid:107) g − ∇ f ( x ) (cid:107) ] ≤ E [ (cid:107) g − v (cid:107) ] + 2 (cid:107) v − ∇ f ( x ) (cid:107) ≤ E [ (cid:107) g (cid:107) ] + 2 (cid:107) bd ∇ f ( x ) + (cid:15) − ∇ f ( x ) (cid:107) ≤ b d + b dq + ( b − d ) d ) (cid:107)∇ f ( x ) (cid:107) + 2 q + 1 q b L β , (S15)which completes the proof.able S2: Architectures of Autoencoders in AutoZOOMDataset: MNIST Training MSE: 2.00 × − Reduction ratio / image size / feature map size: 25% / 28 × × × × → MaxPool → Conv-1Decoder: ConvReLU-16 → Reshape-Re-U → Conv-1Dataset: CIFAR-10 Training MSE: 5.00 × − Reduction ratio / image size / feature map size: 6.25% / 32 × × × × → MaxPool → ConvReLU-3 → MaxPool → Conv-3Decoder: ConvReLU-16 → Reshape-Re-U → ConvReLU-16 → Reshape-Re-U → Conv-3Dataset: ImageNet Training MSE: 1.02 × − Reduction ratio / image size / feature map size: 1.15% / 299 × × × × → ConvReLU-16 → MaxPool → ConvReLU-16 → MaxPool → Conv-3Decoder: ConvReLU-16 → Reshape-Re-U → ConvReLU-16 → Reshape-Bi-U → Conv-3ConvReLU-16: Convolution (16 ﬁlters, kernel size: 3 × × Dep ) + ReLU activationConvReLU-3: Convolution (3 ﬁlters, kernel size: 3 × × Dep ) + ReLU activationConv-3: Convolution (3 ﬁlters, kernel size: 3 × × Dep ) Conv-1: Convolution (1 ﬁlter, kernel size: 3 × × Dep )Reshape-Bi-D: Bilinear reshaping from 299 × × × × × ×

16 to 299 × × U × V × Dep to U × V × DepDep : a proper depth

C Architectures of ConvolutionalAutoencoders in AutoZOOM

On MNIST, the convolutional autoencoder (CAE) is trainedon 50,000 randomly selected hand-written digits from theMNIST8M dataset . On CIFAR-10, the CAE is trained on9,900 images selected from its test dataset. The remainingimages are used in black-box attacks. On ImageNet, all theattacked natural images are from 10 randomly selected imagelabels, and these labels are also used as the candidate attacktargets. The CAE is trained on about 9000 images from theseclasses.Table S2 shows the architectures for all the autoencodersused in this work. Note that the autoencoders designed forImageNet uses bilinear scaling to transform data size from × × Dep to × × Dep , and also back from × × Dep to × × Dep . This is to alloweasy processing and handling for the autoencoder’s internalconvolutional layers.The normalized mean squared error of our autoencodertrained on MNIST, CIFAR-10 and 25 Imagenet is 0.0027,0.0049 and 0.0151, respectively, which lies within a reason-able range of compression loss.

D More Adversarial Examples of AttackingInception-v3 in the Black-box Setting

Figure S1 shows other adversarial examples of the Inception-v3 model in the black-box targeted attack setting. http://leon.bottou.org/projects/infimnist E Performance Evaluation of Black-boxUntargeted Attacks

Table S3 shows the attacking performance of black-box un-targeted attacks on MNIST, CIFAR-10 and ImageNet usingZOO and AutoZOOM-BiLIN attacks on the same set of im-ages in Section 4.5. The Loss function is deﬁned asLoss = max { log[ F ( x )] t − max j (cid:54) = t log[ F ( x )] j } , } , (S16)where t is the top-1 prediction label of a natural image x .We set λ ini = 10 and use q = 5 on MNIST and CIFAR-10and q = 4 on ImageNet for distortion ﬁne-tuning in the post-attack phase. Comparing to Table 3, the number of modelqueries can be further reduced since untargeted attacks onlyrequire the adversarial images to be classiﬁed as any classother than t rather than classiﬁed as a speciﬁc class t (cid:54) = t . a) “French bulldog” to “trafﬁc light” (b) “purse” to “bagel”(c) “bagel” to “ grand piano” (d) “trafﬁc light” to “ iPod” Figure S1: Adversarial examples on ImageNet crafted by AutoZOOM when attacking on the Inception-v3 model in the black-boxsetting with a target class selected at random. Left: original natural images. Right: adversarial examples.Table S3: Performance evaluation of black-box untargeted attacks on different datasets. The per-pixel L distortion thresholdsare 0.004, 0.0015 and × − for MNIST, CIFAR-10 and ImageNet, respectively. Dataset Method