[PDF] A Zeroth-Order Block Coordinate Descent Algorithm for Huge-Scale Black-Box Optimization

Abstract

Full PDF

AA Zeroth-Order Block Coordinate Descent Algorithmfor Huge-Scale Black-Box Optimization

HanQin Cai , Yuchen Lou , Daniel McKenzie , and Wotao Yin University of California, Los Angeles,Los Angeles, CA, USA The University of Hong Kong,Pokfulam, Hong Kong, PRC Damo Academy,Alibaba US,Bellevue, WA, USAFebruary 23, 2021

Abstract

We consider the zeroth-order optimization problem in the huge-scale setting, wherethe dimension of the problem is so large that performing even basic vector operations onthe decision variables is infeasible. In this paper, we propose a novel algorithm, coinedZO-BCD, that exhibits favorable overall query complexity and has a much smaller per-iteration computational complexity. In addition, we discuss how the memory footprintof ZO-BCD can be reduced even further by the clever use of circulant measurementmatrices. As an application of our new method, we propose the idea of craftingadversarial attacks on neural network based classiﬁers in a wavelet domain , whichcan result in problem dimensions of over . million. In particular, we show thatcrafting adversarial examples to audio classiﬁers in a wavelet domain can achieve thestate-of-the-art attack success rate of . . We are interested in problem (1) under the restrictive assumption that one only has noisyzeroth-order access to f ( x ) ( i.e. one cannot access the gradient, ∇ f ( x ) ) and the dimension Email addresses: [email protected] (H.Q. Cai), [email protected] (D. Mckenzie),[email protected] (Y. Lou), and [email protected] (W. Yin). a r X i v : . [ m a t h . O C ] F e b f the problem, d , is huge, say d > . minimize x ∈ R d f ( x ) . (1)Such problems (with small or large d ) arise frequently in domains as diverse as simulation-based optimization in chemistry and physics [1], hyperparameter tuning for combinatorialoptimization solvers [2], and for neural networks [3] and online marketing [4]. Lately,algorithms for zeroth-order optimization have drawn increasing attention due to their use inﬁnding good policies in reinforcement learning [5–7] and in crafting adversarial examplesgiven only black-box access to neural network based classiﬁers [8–11]. We note that in all ofthese applications queries ( i.e. evaluating f ( x ) at a chosen point) are considered expensive,thus it is desirable for zeroth-order optimization algorithms to be as query eﬃcient aspossible.Unfortunately, it is known [12] that the worst case query complexity of any zeroth orderalgorithm for strongly convex f ( x ) scales linearly with d . Clearly, this is prohibitive forhuge d . Recent work has begun to side-step this issue by assuming that f ( x ) has additional,low-dimensional, structure. For example, [11, 13, 14] assume that the gradients ∇ f ( x ) are(approximately) s -sparse (see Assumption 5) while [15] and others assume that f ( x ) = g ( Az ) where A : R s → R d and s (cid:28) d . All of these works promise a query complexity thatscales linearly with the intrinsic dimension, s , and only logarithmically with the extrinsicdimension, d . However there is no free lunch here; the improved complexity of [15] requiresaccess to noiseless function evaluations, the results of [14] only hold if the support of ∇ f ( x ) is ﬁxed for all x ∈ R d and while [13] and [11] allow for noisy function evalutions andchanging gradient support, both solve a computationally intensive optimization problem asa sub-routine, requiring at least Ω( sd log( d )) memory and FLOPS per iteration. In this paper we provide the ﬁrst zeroth-order optimization algorithm enjoying a sub-linear(in d ) query complexity and a sub-linear per-iteration computational complexity. In addition,our algorithm has an exceptionally small memory footprint. Furthermore, it does not requirethe repeated sampling of d -dimensional random vectors, a hallmark of many zeroth-orderoptimization algorithms. With this new algorithm, ZO-BCD, in hand we are able to solveblack-box optimization problems of a size hitherto unimagined. Speciﬁcally, we consider theproblem of generating adversarial examples to fool neural-network-based classiﬁers, givenonly black-box access to the model (as introduced in [8]). However, we consider generatingthese malicious examples by perturbing natural examples in a wavelet domain . For imageclassiﬁers (we consider Inception-v3 trained on

ImageNet ) we are able to produce attackedimages with a record low (cid:96) distortion of . and a success rate of , exceeding thestate-of-the-art. See Figure 1.1 for an example of attacking the original image with label See Appendix A of [11] for a proof of this. . million.Using ZO-BCD, this is not an issue and we achieve a targeted attack success rate of . with a mean distortion of − . dB. + 0 . × = Figure 1: Wavelet attacked image by ZO-BCD: true label “scale” → mis-classiﬁed label“switch”. As mentioned above, the recent works [11, 13, 14] provide zeroth-order algorithms whosequery complexity scales linearly with s and logarithmically with d . In order to ameliorate theprohibitive computational and memory cost associated with huge d , several domain-speciﬁcheuristics have been employed in the literature. For example in [8, 16], in relation toadversarial attacks, an upsampling operator D : R p → R d with p (cid:28) d is employed. Problem(1) is then replaced with the lower dimensional problem: minimize z ∈ R p f ( D ( z )) . Severalother works [10, 11, 17] choose a low dimensional random subspace T k ⊂ R d at each iterationand then restrict x k +1 − x k ∈ T k . We emphasize that none of the aforementioned works prove that such a procedure will converge, and our work is partly motivated by the desire toprovide this empirically successful trick with ﬁrm guarantees of success.In the reinforcement learning literature it is common to evaluate the f ( x k + z k,i ) onparallel devices and then send the computed function value and the perturbation z k,i toa central worker, which then computes x k +1 . As x ∈ R d parametrizes a neural network, d can be extremely large, and hence the communication of the z k,i between workers becomesa bottle neck. [5] overcomes this with a “seed sharing” trick, but again this heuristiclacks rigorous analysis. We hope that ZO-BCD’s (particularly the ZO-BCD-RC variant,see Section 3) intrinsically small memory footprint will make it a competitive, principledalternative.Finally, although two recent works have examined the idea of wavelet domain adversarialattacks [18, 19] they are of a very diﬀerent nature to our approach. We discuss this furtherin Section 4. 3 .3 Assumptions and notation As mentioned, we will suppose that the decision variables x have been subdivided into J blocks of sizes d (1) , . . . , d ( J ) . Following the notation of [20], we shall suppose thatthere exists a permutation matrix U ∈ R d × d and a division of U into submatrices U =[ U (1) , U (2) , . . . , U ( J ) ] such that U ( j ) ∈ R d × d ( j ) and that the j -th block is spanned by thecolumns of U ( j ) . Letting x ( j ) denote the decision variables in the j -th block, we can write x = (cid:80) Jj =1 U ( j ) x ( j ) or simply x = ( x (1) , x (2) , . . . , x ( J ) ) for short. We shall consistently usethe notation g ( x ) := ∇ f ( x ) , omitting x if it is clear from context. By g ( j ) ( x ) we shall meanthe components of the gradient corresponding to the j -th block, i.e. g ( j ) ( x ) = ∇ x ( j ) f ( x ) .We shall think of this either a vector in R d or in R d ( j ) . Finally, we use ˜ O ( · ) notation tosuppress logarithmic factors. Let us now introduce some standard assumptions on theobjective function. Assumption 1 (Block Lipschitz diﬀerentiability) . f ( x ) is continuously diﬀerentiable and,for any j = 1 , . . . , J , any x ∈ R d and any t ∈ R d ( j ) we have that: (cid:107) g ( j ) ( x ) − g ( j ) ( x + U ( j ) t ) (cid:107) ≤ L j (cid:107) t (cid:107) for some ﬁxed constant L j .One can easily check that if f ( x ) is L -Lipschitz diﬀerentiable then it is also blockLipschitz diﬀerentiable, with max j L j ≤ L . Assumption 2 (Convexity) . For any x, y ∈ R d , t ∈ [0 , , we have that f ( tx + (1 − t ) y ) ≤ tf ( x ) + (1 − t ) f ( y ) .Deﬁne the solution set X ∗ = argmin x ∈ R d f ( x ) . If this set is non-empty we may deﬁnethe level set radius for x ∈ R d as: R ( x ) := max y ∈ R d max x ∗ ∈X ∗ {(cid:107) y − x ∗ (cid:107) : f ( y ) ≤ f ( x ) } . (2) Assumption 3 (Non-empty solution set and Bounded level sets) . Assume that X ∗ isnon-empty and that R ( x ) < ∞ . Assumption 4 (Adversarially noisy oracle) . f ( x ) is only accessible through a noisy oracle : E f ( x ) = f ( x ) + ξ , where ξ is a random variable satisfying | ξ | ≤ σ . Assumption 5 (Sparse gradients) . There exists a ﬁxed integer < s exact < d such that forall x ∈ R d : (cid:107) g ( x ) (cid:107) := |{ i : g i ( x ) (cid:54) = 0 }| ≤ s exact . It is of interest to relax this assumption to an “approximately sparse” assumption, such asthat in [11]. The issue here is that it is unclear that randomly chosen blocks (see Section 2.1)will inherit this property. We leave the analysis of this case for future work. Finally, let ∇ jj f ( x ) ∈ R d ( j ) × d ( j ) denote the j -th block Hessian.4 ssumption 6 (Weakly sparse block Hessian) . f ( x ) is twice diﬀerentiable and, for all j = 1 , . . . , J and x ∈ R d , we have that (cid:107)∇ jj f ( x ) (cid:107) ≤ H j for some ﬁxed constant H j .Note that (cid:107) · (cid:107) represents the element-wise (cid:96) -norm: (cid:107) B (cid:107) = (cid:80) i,j | B ij | . Randomized (block) coordinate descent methods are an attractive alternative to full gradientmethods for huge-scale problems [21]. ZO-BCD is a block coordinate method adapted tothe zeroth-order setting and conceptually has three steps:1. Choose a block, j ∈ { , . . . , J } at random.2. Use zeroth-order queries to ﬁnd an approximation, ˆ g ( j ) k , of the true block gradient g ( j ) k .3. Take a negative gradient step: x k +1 = x k − α ˆ g ( j ) k .We abuse notation slightly and use ˆ g ( j ) k to refer to both a d ( j ) dimensional vector for theblock j , as well as the d dimensional vector obtained by appending the d ( j ) dimensionalblock with the rest to be zeros, for the gradient descent update.In principle any scheme for constructing an estimator of g k could be adapted forestimating g ( j ) k , as long as one is able to bound (cid:107) g ( j ) k − ˆ g ( j ) k (cid:107) with high probability. As wewish to exploit gradient sparsity, we choose to adapt the estimator presented in [11]. Letus now discuss how to do so. Fix a sampling radius δ > . Suppose that the j -th blockhas been selected and select m sample directions z , . . . , z m ∈ R d ( j ) from a Rademacherdistribution . Consider the ﬁnite diﬀerences: y i = 1 √ m E f ( x + δU ( j ) z i ) − E f ( x ) δ . (3)Certainly, if g satisﬁes Assumption 5 then so does g ( j ) . Thus, we may attempt to recover g ( j ) by solving the following sparse recovery problem: ˆ g ( j ) = argmin v ∈ R d ( j ) (cid:107) Zv − y (cid:107) s.t. (cid:107) v (cid:107) ≤ s. (4)We propose solving Problem (4) using CoSaMP [22], but certainly other choices are possible.We present this as Algorithm 1. Theorem 2.6 and Corollary 2.7 in [11] then guarantee that,for appropriately chosen parameters, ˆ g ( j ) is a reliable estimator of g ( j ) . For the reader’sconvenience we quantify this in the following theorem: That is, the entries of each z i are +1 or − with equal probability lgorithm 1 Block Gradient Estimation

Input: x : current point; j : choice of block; s : gradient sparsity level; δ : query radius; n :number of CoSaMP iterations; { z i } mi =1 : sample directions in R d ( j ) . for i = 1 to m do y i ← ( E f ( x + δU ( j ) z i ) − E f ( x )) / ( √ mδ ) end for y ← [ y , · · · , y m ] T ; Z ← / √ m [ z , · · · , z m ] T ˆ g ( j ) ≈ argmin (cid:107) v (cid:107) ≤ s (cid:107) Zv − y (cid:107) by n iterations of CoSaMP Output: ˆ g ( j ) : estimated block gradient. Theorem 2.1.

Suppose that f ( x ) satisﬁes Assumptions 1, 5 and 6. Let g ( j ) be the outputof Algorithm 1 with δ = 2 (cid:112) σ/H j , s ≥ s exact and m = b s log( d/J ) Rademacher sampledirections. Then with probability at least − ( s/d ) b s : (cid:107) ˆ g ( j ) − g ( j ) (cid:107) ≤ ρ n (cid:107) g ( j ) (cid:107) + 2 τ (cid:112) σH j . (5)The constants b and b are directly proportional; considering more sample directionsresults in a higher probability of recovery. In our experiments we consider ≤ b ≤ . Theconstant ρ and τ arise from the analysis of CoSaMP. Both are inversely proportional to b .For our range of b , ρ ≈ . and τ ≈ . Suppose that f ( x ) satisﬁes Assumption 5, so that (cid:107) g ( x ) (cid:107) ≤ s exact for all x . In generalone cannot improve upon the bound (cid:107) g ( j ) ( x ) (cid:107) ≤ s exact , as it might be the case thatall non-zero entries of g lie in the j -th block. However, by randomizing the blocks onecan guarantee that, with high probability, the non-zero entries of g are almost equallydistributed over the J blocks. We assume, for simplicity, that all blocks are equal-sized (i.e, d (1) = d (2) = . . . = d ( J ) = d/J ). Theorem 2.2.

Let U be a permutation matrix chosen uniformly at random. Then for any x ∈ R d and any ∆ > we have that (cid:107) g ( j ) ( x ) (cid:107) ≤ (1 + ∆) s exact /J for all j with probabilityat least − J exp( − ∆ s exact J ) . It will be convenient to ﬁx a value of ∆ , say ∆ = 0 . . An immediate consequence ofTheorem 2.2 is that one can signiﬁcantly improve upon the query complexity of Theorem 2.1: Corollary 2.3.

Suppose that U is chosen uniformly at random. The error bound (5) inTheorem 2.1 still holds, now with probability − O ( J exp( − s exact /J )) , if we change theassumption s ≥ s exact to s ≥ . s exact /J (and keep all other parameters the same). Note that this allows us to use approximately J times fewer queries per iteration.6 .2 Further reducing the required randomness As discussed in [11], one favorable feature of using a compressed sensing based gradientestimator is that the error bound (5) is universal . That is, it holds for all x ∈ R d for the same set of sample directions { z i } mi =1 ⊂ R d/J . This means that, instead of resampling newvectors at each iteration we may use the same sampling directions for each block and eachiteration . Thus, only md/J = ˜ O ( sd/J ) binary random variables need to be sampled, storedand transmitted in ZO-BCD. Remarkably, one can do even better by choosing as sampledirections a subset of the rows of a circulant matrix. Recall that a circulant matrix of size d/J × d/J , generated by v ∈ R d/J , has the following form: C ( v ) =  v v · · · v d/J v d/J v · · · v d/J − ... . . . . . . ... v · · · v d/J v  . (6)Equivalently, C ( v ) is the matrix with rows C i ( v ) where: C i ( v ) ∈ R d/J and C i ( v ) j = v i + j − . By exploiting recent results in signal processing, we show:

Theorem 2.4.

Assign blocks randomly as in Corollary 2.3. Let z ∈ R d/J be a Rademacherrandom vector. Choose a random subset Ω = { j , . . . , j m } ⊂ { , . . . , d/J } of cardinality m = b ( s/J ) log ( s/J ) log ( d/J ) . The error bound (5) in Theorem 2.1 still holds, againwith probability − O ( J exp( − s exact /J )) , if we change s ≥ s exact to s ≥ . s exact /J and use z i = C j i ( z ) for all i = 1 , . . . , m (and keep all other parameters the same). This theorem implies that one only needs d/J binary random variables (to construct z ) and m = ˜ O ( s/J ) randomly selected integers for the entire algorithm. Note that(partial) circulant matrices allow for a fast multiplication , further reducing the computationalcomplexity of Algorithm 1. Let us now introduce our new algorithm. We consider two variants, distinguished by the kindof sampling directions used. ZO-BCD-R uses R ademacher sampling directions. ZO-BCD-RCuses R ademacher- C irculant sampling directions, as described in Section 2.2. For simplicity,we present our algorithm for randomly selected, equally sized coordinate blocks. With minormodiﬁcations our results still hold for user-deﬁned and/or unequally sized blocks, and wediscuss this in Appendix B. The following theorem guarantees that both variants converge ata sublinear rate to within a certain error tolerance. As the choice of block in each iterationis random, our results are necessarily probabilistic.7 lgorithm 2 ZO-BCD

Input: x : initial point; s : gradient sparsity level; α : step size; δ : query radius; J :number of blocks. s block ← . s/J Randomly divide x into J equally sized blocks. if ZO-BCD-R then m ← b s block log( d/J ) Generate Rademacher random vectors z , . . . , z m else if ZO-BCD-RC then m ← b s block log ( s block ) log ( d/J ) Generate Rademacher random vector z .Randomly choose Ω ⊂ { , . . . , d/J } with | Ω | = m Let z i = C j i ( z ) for i = 1 , . . . m and j i ∈ Ω end iffor k = 0 to K do Select a block j ∈ { , . . . , J } uniformly at random. ˆ g ( j ) k ← Gradient Estimation ( x ( j ) k , s block , δ, { z i } m/Ji =1 ) x k +1 ← x k − α ˆ g ( j ) k end forOutput: x K : estimated optimum point. Theorem 3.1.

Assume that f ( x ) satisﬁes Assumptions 1–6. Deﬁne the following constants: L max = max j L j ,c = 2 J L max R ( x ) ,H = max j H j . Assume that ρ n + τ σHc L max < . Choose sparsity s ≥ s exact , step size α = L max and queryradius δ = 2 (cid:112) σ/H . Choose the number of CoSaMP iterations n and error tolerance ε suchthat: c (cid:32) ρ n + (cid:115) ρ n + 16 τ σHc ζL max (cid:33) < ε < f ( x ) − f ∗ . Then, with probability − ζ − O (cid:16) J ε e − s exact /J (cid:17) ZO-BCD ﬁnds an ε − optimal solution in ˜ O ( s/ε ) queries. ZO-BCD-R requires ˜ O ( sd/J ) FLOPS per iteration and ˜ O ( sd/J ) totalmemory, while ZO-BCD-RC requires only ˜ O ( d/J ) FLOPS per iteration and O ( d/J ) totalmemory. Thus, up to logarithmic factors, ZO-BCD achieves the same query complexity as ZORO[11] with a much lower per-iteration computational footprint. Note that we pay for the8mproved computational and memory complexity of ZO-BCD-RC with a slightly worsetheoretical query complexity (by a logarithmic factor) due to the requirements of Theorem 2.4.First order block coordinate descent methods typically have a probability of success − ζ ,thus in switching to zeroth-order we incur a penalty of O (cid:16) J ε e − s exact /J (cid:17) . For truly hugeproblems (for example d ≈ and s exact ≈ ) this term is negligible. For smaller problemswe ﬁnd that randomly re-assigning the decision variables to blocks every J iterations is agood way to increase ZO-BCD’s probability of success. Adversarial attacks on neural-network-based classiﬁers are a popular application and bench-mark for zeroth-order optimizers [8, 23–25]. Speciﬁcally, let F ( x ) ∈ [0 , C denote thepredicted probabilities returned by the model for input signal x . Then the goal is to ﬁnd asmall distortion δ such that that the model’s top-1 prediction on x + δ is no longer correct:argmax c =1 ,...,C F c ( x + δ ) (cid:54) = argmax c =1 ,...,C F c ( x ) . Because we only have access to the logits, F ( x ) , not the internal workings of the model, we are unable to compute ∇ F ( x ) and hencethis problem is zeroth order. Recently, [11] showed that it is reasonable to assume thatthe attack loss function exhibits (approximate) gradient sparsity, and proposed generatingadversarial examples by adding a distortion to the victim image that is sparse in the imagepixel domain. We extend this and propose a novel sparse wavelet transform attack , whichsearches for an adversarial distortion δ (cid:63) in the wavelet domain: δ (cid:63) = argmin δ f (IWT(WT( x ) + δ )) + λ (cid:107) δ (cid:107) , (7)where x is a given image/audio signal, f is the Carlini-Wagner loss function [8], WT is the chosen (discrete or continuous) wavelet transform, and IWT is the correspondinginverse wavelet transform. As wavelet transforms extract the important features of thedata, we expect the gradients of this new loss function to be even sparser than those ofthe corresponding pixel-domain loss function [11, Figure 1]. Moreover the inverse wavelettransform spreads the energy of the sparse perturbation, resulting in more natural-seemingattacked signals, as compared with pixel-domain sparse attacks [11, Figure 6].1.

Sparse DWT attacks.

Discrete wavelet transform (DWT) is a well-known methodfor data compression and denoising [26, 27]. In fact, many real-world media data arecompressed and stored in the form of DWT coeﬃcients ( e.g.

JPEG-2000 for imagesand Dirac for videos), thus attacking the wavelet domain is more direct in thesecases. Since DWT does not increase the problem dimension , the query complexity ofsparse wavelet-domain attacks is expected to be the same order as sparse pixel-domain Only with periodic boundary extension. If other boundary extension is used, the dimension of thewavelet coeﬃcients will slightly increase, which dependents on the size of wavelet ﬁlters and the level ofwavelet transform. i.e. large) waveletcoeﬃcients. We explore this further in Section 5. This can reduce the attack problemdimension by – for typical image datasets. Nevertheless, for large, moderncolor images, this problem dimension can still be massive.2. Sparse CWT attacks.

For oscillatory signals, the continuous wavelet transform(CWT) with analytic wavelets is preferred [26, 28]. Unlike DWT, the dimension ofthe CWT coeﬃcients is much larger than the original signal dimension, which is themain challenge of the CWT attacks. For example, attacking even second audio clipsin a CWT domain results in a problem of size d > . million (see Section 5.3)!Note that the idea of adversarial attacks on DWT coeﬃcients was also proposed in[18], but they assume a white-box model and study only dense attacks on discrete Haarwavelets. [19] considers a “steganographic” attack, where the important wavelet coeﬃcentsof a target image are “hidden” within the wavelet transform coeﬃcients of a victim image.We appear to be the ﬁrst to connect (both discrete and continuous) wavelet transforms tosparse zeroth-order adversarial attacks. In this section, we ﬁrst present the empirical advantages of ZO-BCD with synthetic examples.Then, we demonstrate the performance of ZO-BCD in two real-world applications: (i) sparseDWT attacks on images, and (ii) sparse CWT attacks on audio signals. We compare the twoversions of ZO-BCD (see Algorithm 2) against two venerable zeroth-order algorithms—FDSA[29] and SPSA [31]—as well as two more recent contributions—ZO-SCD [8] and ZORO[11]. ZO-SCD is a zeroth-order (non-block) coordinate descent method, while ZORO uses asimilar gradient estimator as ZO-BCD, but uses it to compute the full gradient. We do notconsider any methods that involve inverting a d × d matrix (such as CMA-ES [32]) as thegoal here is to analyze the performance of zeroth-order algorithms for problems so large thatthis is not feasible. In Section 5.2, we consider ZO-SGD [33], a variance-reduced version ofSPSA, as this has empirically shown better performance on this task than SPSA . We alsoconsider ZO-AdaMM [23], a zeroth-order method incorporating momentum. The numericalexperiments in Section 5.1 were executed in Matlab 2020b on a laptop with Intel i7-8750Hand 32GB of RAM. We study the performance of ZO-BCD with noisy oracles on the zeroth-order optimizationproblem minimize x ∈ R d f ( x ) for two selected objective functions: We note that SPSA using Rademacher sample directions coincides with Random Search [30] Variance reduction did little to improve the performance of SPSA in the experiments of Section 5.1 f ( x ) = x T Ax , where A is a diagonal matrix with s non-zero entries.(b) Max- s -sum-squared function: f ( x ) = (cid:80) sm i x m i , where x m i is the i -th largest-in-magnitude entry of x . Obviously, this problem is more complicated since m i changeswith diﬀerent x .We use d = 20 , and s = 200 in both problems, so they have high ambient dimensionwith sparse gradients.As can be seen in Figures 2a and 2b, both versions of ZO-BCD eﬀectively exploit thegradient sparsity, and have very competitive performance in terms of queries. In particular,ZO-BCD converges more stably than the state-of-the-art ZORO in max- s -squared-sumproblem while its computational and memory complexities are much lower. Note thatSPSA’s query eﬃciency is roughly of the same order as that of ZO-BCD and ZORO whenthe gradient support is unchanging (Figure 2a); however, it is signiﬁcantly worse when thegradient support is allowed to change (Figure 2b). queries -4 -2 f un c t i on v a l ue ZO-BCD-RZO-BCD-RCFDSASPSAZO-SCDZORO (a) Sparse Quadric. queries -1 f un c t i on v a l ue ZO-BCD-RZO-BCD-RCFDSASPSAZO-SCDZORO (b) Max- s -squared-sum function. Figure 2: Function values v.s. queries for ZO-BCD (-R and -RC, with blocks) and fourother representative zeroth-order methods. ZO-BCD is fast and stable (while running fasterwith less memory). Computational complexity.

Since ZO-BCD and ZORO are the only methods thathave competitive convergence rates in the query complexity experiments above, we onlycompare their computational complexities. We record the runtime of ZO-BCD and ZOROfor solving the sparse quadratic function with stopping condition f ( x k ) ≤ − . Notethat the runtime of each query can vary a lot between problems, so we only count theempirical runtime excluding the queries’ time. The experimental results are presented inFigure 3, where one can see that ZO-BCD-R has signiﬁcant speed advantage over ZORO11nd ZO-BCD-RC is even faster. problem dimension t i m e ( s e c ) ZO-BCD-RZO-BCD-RCZORO

Figure 3: Runtime v.s. problem dimension for ZO-BCD (-R and -RC, with blocks) andZORO. We consider a wavelet domain, untargeted, per-image attack on the

ImageNet dataset [34]with the pre-trained

Inception-v3 model [35], as discussed in Section 4. We use the famous‘db45’ wavelet [36] with -level DWT in these attacks. Empirical performance was evaluatedby attacking randomly selected ImageNet pictures that were initially classiﬁed correctly.In addition to full wavelet domain attacks using both ZO-BCD-R and ZO-BCD-RC, we alsoexperiment with only attacking the large wavelet coeﬃcients, i.e. the important componentsof the images in terms of wavelet basis. In particular, when we choose to only attack waveletcoeﬃcients greater than . in magnitude, the problem dimension is reduced by an average . for the tested images; nevertheless, the attack problem dimension is still as large as ∼ , , so ZO-BCD is still suitable for this attack problem.The results are summarized in Table 1. We ﬁnd all three versions of ZO-BCD waveletattack beat the other state-of-the-art methods in both attack success rate and (cid:96) distortion,and the large-coeﬃcients-only ( i.e. we only attack wavelet coeﬃcients with abs ≥ We consider targeted per-clip audio adversarial attacks on the

SpeechCommands dataset [37],which consists of -second audio clips, each containing a one-word voice command, e.g. “yes”or “left”. The audio sampling rate is kHz, thus, each clip is a , dimension real valued12able 1: Results of untargeted image adversarial attack from various algorithms. Attacksuccess rate (ASR), average ﬁnal (cid:96) distortion (as a percentage of total pixel numbers),average ﬁnal (cid:96) distortion (on pixel domain), average ﬁnal (cid:96) distortion on frequency (wavelet)domain, average iterations and number of queries of 1st successful attack for diﬀerent zeroth-order attack methods. ZO-BCD-R(large coeﬀ.) stands for applying ZO-BCD-R to attackonly large wavelet coeﬃcients (abs ≥ . ). Methods ASR (cid:96) dist Queries ZO-SCD 78 % % ZO-AdaMM 81 % % % % % vector. Adversarial attacks against this data set have been considered in [10, 38, 39] and[40], although with the exception of [10] all of the cited works consider a white-box setting .The victim model is a pre-trained, 5 layer, convolutional network called commandNet [41].The architecture is essentially as proposed in [42]. It takes as input the bark spectrumcoeﬃcients of a given audio clip, a transform that is closely related to the Mel Frequencytransform. The test classiﬁcation accuracy of this model (on un-attacked audio clips) is . . We use the Morse [43] continuous wavelet transform with frequencies, resultingin a problem dimension of × ,

000 = 1 , , . As discussed in [44], the appropriatemeasure of size for the attacking distortion δ is relative loudness:dB x ( δ ) := 20 (cid:18) max i log ( | x i | ) − max i log ( | δ i | ) (cid:19) . The results of our attack are detailed in Table 2, Figures 5a and 5b. Overall, we achieve a . ASR using a mean of queries. Our attacking distortions have a mean volumeof − . dB. As can be seen, our proposed attack comfortably exceeds the state of the artin attack success rate (ASR), surpassing even white-box attacks! This is not to claim thatour proposed method is strictly better than others, as there are multiple factors to considerwhen judging the “goodness” of an attack (ASR, attack distortion, universality etc). Wediscuss this further in Appendix E.2. The attacking noise can be heard as a slight “hiss”, orwhite noise in the attacked audio clips. The original keyword however is easy for a human There are other subtle diﬀerences in the threat models considered in these works, as compared to ours.We discuss this further in Appendix E.2 a) ZO-BCD-R: “barber-shop” → “ﬂagpole” (b) ZO-BCD-R (large co-eﬀ.): “barbershop” → “ﬂag-pole” (c) ZO-BCD-R: “dumbbell” → “computer keyboard” (d) ZO-BCD-R (large co-eﬀ.): “dumbbell” → “com-puter keyboard” Figure 4: Examples of wavelet attacked images by ZO-BCD-R and ZO-BCD-R (attackrestricted to only large coeﬃcients, i.e. abs ≥ . ), true labels and mis-classiﬁed labels.listener to make out. We encourage the reader to listen to a few examples, available at https://github.com/YuchenLou/ZO-BCD . down go left no off on right stop up yes Target Label downgoleftnooffonrightstopupyes S ou r c e Labe l (a) Attack success rate. down go left no off on right stop up yes Target Label downgoleftnooffonrightstopupyes S ou r c e Labe l -8.377-6.956-7.083-6.071-6.944-5.972-5.87 -6.644-5.758-6.017-5.273-5.624-7.752 -6.154-7.311-6.155-5.483-8.6-5.271-5.92 -7.387-9.282-6.9-5.367-7.677-5.187-6.478 -6.105-8.844-7.739-5.553-7.051-7.957-5.485-5.355-5.44 -6.797-8.339-6.468-5.375-6.641-7.59-5.232-5.393 -6.85-8.934-8.806-6.029-5.503-6.209-6.399 -7.807-8.256-7.616-6.179-5.286-8.438-5.611 -6.247-8.863-7.468-6.019-6.189-6.005-7.558 -7.836-8.272-7.993-7.372-5.45-8.115-6.03-4.468-3.903 -4.565-4.577-5.05 -5.102-4.544 -4.085-4.445 -4.607 -4.681-4.578 -5.068-5.068 -4.759-4.881 -4.321-3.675 -9-8.5-8-7.5-7-6.5-6-5.5-5-4.5-4NaN (b) Relative loudness, in decibels Figure 5: Detailed results for targeted audio wavelet attacks.Out of curiosity, we also tested untargeted crafting adversarial attacks using ZO-BCDin the time domain ( i.e. without using a wavelet transform) for 1000 randomly selectedaudio clips. The results are underwhelming; indeed the attacking perturbation is on averagesigniﬁcantly louder than the victim audio clip (see Table 3)! This suggests that attacking ina wavelet domain is much more eﬀective than attacking in the original signal domain.14able 2: Results of adversarial attacks on the

SpeechCommand data set

Methods ASR Univ. Black-box Targeted

Alzentot [10] . No Yes Yes

Vadillo & Santana [38] . Yes No No Li et al [39] . Yes No Yes

Xie et al [40] . No No Yes

Proposed % No Yes Yes

Table 3: Results for untargeted audio wavelet attack. Attack success rate (ASR), averageﬁnal Decibel distortion, average number of queries to 1st successful attack for untargeted zeroth-order attack in the time domain, in the wavelet domain using a step-size of . andin the wavelet domain using a step-size of . . Domains ASR d B dist Queries

Time % + . Wavelet (0.02) . - Wavelet (0.05) % - . We have introduced ZO-BCD, a novel zeroth-order optimization algorithm. ZO-BCD enjoysstrong, albeit probabilistic, convergence guarantees. We have also introduced a new paradigmin adversarial attacks on classiﬁers: the sparse wavelet domain attack. On medium-scaletest problems the performance of ZO-BCD matches or exceeds that of state-of-the-artzeroth order optimization algorithms, as predicted by theory. However, the low per-iterationcomputational and memory requirements of ZO-BCD means that it can tackle huge-scaleproblems that for other zeroth-order algorithms are intractable. We demonstrate this bysuccessfully using ZO-BCD to craft adversarial examples, to both image and audio classiﬁers,in wavelet domains where the problem size can exceed 1.7 million.

References [1] B Reeja-Jayan, Katharine L Harrison, K Yang, Chih-Liang Wang, AE Yilmaz, andArumugam Manthiram. Microwave-assisted low-temperature growth of thin ﬁlms insolution.

Scientiﬁc reports , 2(1):1–8, 2012.152] Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An eﬃcient approach for assessinghyperparameter importance. In

International conference on machine learning , pages754–762. PMLR, 2014.[3] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization.

Journal of machine learning research , 13(2), 2012.[4] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convexoptimization in the bandit setting: gradient descent without a gradient. In

Proceedingsof the sixteenth annual ACM-SIAM symposium on Discrete algorithms , pages 385–394,2005.[5] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolu-tion strategies as a scalable alternative to reinforcement learning. arXiv preprintarXiv:1703.03864 , 2017.[6] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of staticlinear policies is competitive for reinforcement learning. In

Proceedings of the 32ndInternational Conference on Neural Information Processing Systems , pages 1805–1814,2018.[7] Krzysztof Choromanski, Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, DeepaliJain, Yuxiang Yang, Atil Iscen, Jasmine Hsu, and Vikas Sindhwani. Provably robustblackbox optimization for reinforcement learning. In

Conference on Robot Learning ,pages 683–696. PMLR, 2020.[8] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zerothorder optimization based black-box attacks to deep neural networks without trainingsubstitute models. In

Proceedings of the 10th ACM workshop on artiﬁcial intelligenceand security , pages 15–26, 2017.[9] Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu. A comprehensivelinear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to ﬁrst-order. arXiv preprint arXiv:1606.00498 , 2016.[10] Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. Did you hear that? adver-sarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554 ,2018.[11] HanQin Cai, Daniel Mckenzie, Wotao Yin, and Zhenliang Zhang. Zeroth-order regu-larized optimization (ZORO): Approximately sparse gradients and adaptive sampling. arXiv preprint arXiv:2003.13001 , 2020.[12] Kevin G Jamieson, Robert D Nowak, and Benjamin Recht. Query complexity ofderivative-free optimization. arXiv preprint arXiv:1209.2434 , 2012.1613] Yining Wang, Simon Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zeroth-order optimization in high dimensions. In

International Conference on ArtiﬁcialIntelligence and Statistics , pages 1356–1365, 2018.[14] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convex stochas-tic optimization via conditional gradient and gradient updates. In

Advances in NeuralInformation Processing Systems , pages 3455–3464, 2018.[15] Daniel Golovin, John Karro, Greg Kochanski, Chansoo Lee, Xingyou Song, and Qi-uyi Zhang. Gradientless descent: High-dimensional zeroth-order optimization. In

International Conference on Learning Representations , 2019.[16] Moustafa Alzantot, Yash Sharma, Supriyo Chakraborty, Huan Zhang, Cho-Jui Hsieh,and Mani B Srivastava. Genattack: Practical black-box attacks with gradient-freeoptimization. In

Proceedings of the Genetic and Evolutionary Computation Conference ,pages 1111–1119, 2019.[17] Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. Targeted adversarialexamples for black box audio systems. In , pages 15–20. IEEE, 2019.[18] Divyam Anshumaan, Akshay Agarwal, Mayank Vatsa, and Richa Singh. Wavetransform:Crafting adversarial examples via input decomposition. In

Computer Vision – ECCV2020 Workshops , pages 152–168. Springer, Cham, 2020.[19] Salah Ud Din, Naveed Akhtar, Shahzad Younis, Faisal Shafait, Atif Mansoor, andMuhammad Shaﬁque. Steganographic universal adversarial perturbations.

PatternRecognition Letters , 135:146–152, 2020.[20] Rachael Tappenden, Peter Richtárik, and Jacek Gondzio. Inexact coordinate descent:complexity and preconditioning.

Journal of Optimization Theory and Applications ,170(1):144–176, 2016.[21] Yu Nesterov. Eﬃciency of coordinate descent methods on huge-scale optimizationproblems.

SIAM Journal on Optimization , 22(2):341–362, 2012.[22] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incompleteand inaccurate samples.

Applied and computational harmonic analysis , 26(3):301–321,2009.[23] Xiangyi Chen, Sijia Liu, Kaidi Xu, Xingguo Li, Xue Lin, Mingyi Hong, and David Cox.Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. In

Advances in Neural Information Processing Systems , pages 7204–7215, 2019.1724] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarialattacks with limited queries and information. In

International Conference on MachineLearning , pages 2137–2146. PMLR, 2018.[25] Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Sparsefool: afew pixels make a big diﬀerence. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 9087–9096, 2019.[26] Stéphane Mallat.

A wavelet tour of signal processing . Elsevier, 1999.[27] Jian-Feng Cai, Bin Dong, Stanley Osher, and Zuowei Shen. Image restoration: totalvariation, wavelet frames, and beyond.

Journal of the American Mathematical Society ,25(4):1033–1089, 2012.[28] Jonathan M Lilly and Soﬁa C Olhede. On the analytic wavelet transform.

IEEEtransactions on information theory , 56(8):4135–4156, 2010.[29] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of aregression function.

The Annals of Mathematical Statistics , 23(3):462–466, 1952.[30] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convexfunctions.

Foundations of Computational Mathematics , 17(2):527–566, 2017.[31] James C Spall. An overview of the simultaneous perturbation method for eﬃcientoptimization.

Johns Hopkins apl technical digest , 19(4):482–492, 1998.[32] Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptationin evolution strategies.

Evolutionary computation , 9(2):159–195, 2001.[33] Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst-and zeroth-order methods fornonconvex stochastic programming.

SIAM Journal on Optimization , 23(4):2341–2368,2013.[34] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In , pages 248–255. Ieee, 2009.[35] Christian Szegedy, Vincent Vanhoucke, Sergey Ioﬀe, Jon Shlens, and Zbigniew Wojna.Rethinking the inception architecture for computer vision. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 2818–2826, 2016.[36] Ingrid Daubechies.

Ten lectures on wavelets . SIAM, 1992.[37] Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 , 2018. 1838] Jon Vadillo and Roberto Santana. Universal adversarial examples in speech commandclassiﬁcation. arXiv preprint arXiv:1911.10182 , 2019.[39] Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. Advpulse: Universal,synchronization-free, and targeted audio adversarial attacks via subsecond perturbations.In

Proceedings of the 2020 ACM SIGSAC Conference on Computer and CommunicationsSecurity , pages 1121–1134, 2020.[40] Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. Enablingfast and universal audio adversarial attack using generative model. arXiv preprintarXiv:2004.12261 , 2020.[41] The MathWorks Inc. Speech command recognition using deep learning. , 2020. Accessed: 2021-01-15.[42] Tara N Sainath and Carolina Parada. Convolutional neural networks for small-footprintkeyword spotting. In

Sixteenth Annual Conference of the International Speech Commu-nication Association , 2015.[43] Soﬁa C Olhede and Andrew T Walden. Generalized morse wavelets.

IEEE Transactionson Signal Processing , 50(11):2661–2670, 2002.[44] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attackson speech-to-text. In , pages 1–7.IEEE, 2018.[45] Felix Krahmer, Shahar Mendelson, and Holger Rauhut. Suprema of chaos processes andthe restricted isometry property.

Communications on Pure and Applied Mathematics ,67(11):1877–1904, 2014.[46] Shahar Mendelson, Holger Rauhut, Rachel Ward, et al. Improved bounds for sparserecovery from subsampled random convolutions.

The Annals of Applied Probability ,28(6):3491–3527, 2018.[47] Meng Huang, Yuxuan Pang, and Zhiqiang Xu. Improved bounds for the rip of subsam-pled circulant matrices. arXiv preprint arXiv:1808.07333 , 2018.[48] Dimitri P Bertsekas. Nonlinear programming.

Journal of the Operational ResearchSociety , 48(3):334–334, 1997.[49] Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Foolingdeep structured visual and speech recognition models with adversarial examples. In

Proceedings of the 31st International Conference on Neural Information ProcessingSystems , pages 6980–6990, 2017. 1950] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen,Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech:Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 , 2014.

A Proofs for Section 2

Proof of Theorem 2.1.

Consider the function f ( j ) ( t ) = f ( x + U ( j ) t ) . This function hasgradient g ( j ) and Hessian ∇ jj f . By Assumption 5 we have that (cid:107) g ( j ) (cid:107) ≤ (cid:107) g (cid:107) ≤ s whilefrom Assumption 6 we have that (cid:107)∇ jj f (cid:107) ≤ H j . Thus, f ( j ) ( t ) satisﬁes all the assumptionsof Theorem 2.6 and Corollary 2.7 of [11] and hence Theorem 2.1 follows from the conclusionsof these two results. Proof of Theorem 2.2.

For notational convenience, we will temporarily let s := s exact . Let g i , . . . , g i s denote the non-zero entries of g , after the permutation has been applied. Let Y j be a random variable counting the number of g i k within block j . That is: Y j = { i k : i k in block j } . Thus, the random vector Y = ( Y , . . . , Y J ) ∈ R J satisﬁes the multinomial distribution.Observe that (cid:80) Jj =1 Y j = s and, because the blocks are equally sized, E ( Y j ) = sJ .By Chernoﬀ’s bound, for ∆ > and each Y j individually we have: P (cid:104)(cid:12)(cid:12)(cid:12) Y j − sJ (cid:12)(cid:12)(cid:12) ≥ ∆ sJ (cid:105) ≤ e − ∆2 E ( Yj )3 = 2 e − ∆2 s J . Then by applying the union bound to Y j , we have: P (cid:104) ∃ j s.t. (cid:12)(cid:12)(cid:12) Y j − sJ (cid:12)(cid:12)(cid:12) ≥ ∆ sJ (cid:105) ≤ J (cid:88) j =1 P (cid:104)(cid:12)(cid:12)(cid:12) Y j − sJ (cid:12)(cid:12)(cid:12) ≥ ∆ sJ (cid:105) ≤ J e − ∆2 s J , and so: P (cid:104) | Y j − sJ | ≤ ∆ sJ , ∀ j ∈ [1 , · · · , J ] (cid:105) = 1 − P (cid:104) ∃ j s.t. (cid:12)(cid:12)(cid:12) Y j − sJ (cid:12)(cid:12)(cid:12) ≥ ∆ sJ (cid:105) ≥ − J e − ∆2 s J . This ﬁnishes the proof.

Proof of Corollary 2.3.

From Theorem 2.2 we have that, with probability at least − J exp( − . s exact J ) , (cid:107) g ( j ) (cid:107) ≤ . s exact /J . Assuming this is true, (5) holds with probability20 − ( s/d ) b s , by tracing the arguments of Theorem 2.6 and Corollary 2.7 of [11] again. Theseevents are independent, thus the probability that they both occur is: (cid:18) − J exp( − . s exact J ) (cid:19) (cid:16) − ( s/d ) b s (cid:17) Because s exact (cid:28) d the term proportional to exp ( − s exact /J ) is signiﬁcantly larger than ( s/d ) b s . Expanding, and keeping only dominant terms we see that this probability is equalto − O ( J exp( − s exact /J )) .We emphasize that Corollary 2.3 holds for all j . Before proceeding, we remind the readerthat for ﬁxed s the restricted isometry constant of Z ∈ R m × n is deﬁned as the smallest δ > such that: (1 − δ ) (cid:107) v (cid:107) ≤ (cid:107) Zv (cid:107) ≤ (1 + δ ) (cid:107) v (cid:107) for all v ∈ R n satisfying (cid:107) v (cid:107) ≤ s The key ingredient to the proof of Theorem 2.4 is the following result:

Theorem A.1 ([45, Theorem 1.1]) . Let z ∈ R n be a Rademacher random vector and choosea random subset Ω = { j , . . . , j m } ⊂ { , . . . , d/J } of cardinality m = cδ − s log ( s ) log ( n ) .Let Z ∈ R m × n denote the matrix with rows √ m C j i ( z ) . Then δ s ( Z ) < δ with probability − n − log( n ) log ( s ) . Note that c is a universal constant, independent of s, n and δ . Similar results may befound in [46] and [47]. In [45] a more general version of this theorem is provided, whichallows the entries of z to be drawn from any sub-Gaussian distribution. Proof of Theorem 2.4.

As before, from Theorem 2.2 we have that (cid:107) g ( j ) (cid:107) ≤ . s exact /J with probability at least − J exp( − . s exact J ) . Assuming this holds, let Z be the sensingmatrix with rows √ m C j i ( z ) . By appealing to Theorem A.1 with s = 4 . s exact /J , n = d/J and δ = 0 . we get that δ s ( Z ) ≤ . , with probability − ( d/J ) − log( d/J ) log ( s ) .Once this restricted isometry property has been guaranteed, Theorem 2.4 follows by thesame proof as in [11, Corollary 2.7]. As argued in the proof of Corollary 2.3 the events (cid:107) g ( j ) (cid:107) ≤ . s exact /J and δ s ( Z ) ≤ . are independent, hence they both occur withprobability: (cid:18) − J exp( − . s exact J ) (cid:19) (cid:16) − ( d/J ) − log( d/J ) log ( s ) (cid:17) Again, the term proportional to exp ( − s exact /J ) dominates.21 ZO-BCD for unequally-sized blocks

Using randomly assigned, equally-sized blocks is an important part of the ZO-BCD frameworkas it allows one to consider a block sparsity ≈ s/J , instead of the worst case sparsity of s .Nevertheless, there may be situations where it is preferable to use user-deﬁned, unequally-sized blocks. For such cases we recommend the following (we discuss the modiﬁcations herefor ZO-BCD-R, but with obvious changes it also applies to ZO-BCD-RC). Let s ( j ) ≤ s be an upper estimate of the sparsity of the j -th block gradient: (cid:107) g ( j ) ( x ) (cid:107) ≤ s ( j ) . Let m ( j ) = b s ( j ) log( d/J ) (and use the analogous formula for ZO-BCD-RC) and deﬁne m max =max j m ( j ) . When initializing ZO-BCD-R, generate m max Rademacher random variables: z , . . . , z m max ∈ R d max . At each iteration, if block j is selected, randomly select i , . . . , i m ( j ) from , . . . , m max and for k = 1 , . . . , m ( j ) let ˜ z i k ∈ R d ( j ) denote the vector formed by takingthe ﬁrst d ( j ) components of z i k . Use { ˜ z i k } m ( j ) k =1 as the input to Algorithm 1. Note thatthe { ˜ z i k } m ( j ) k =1 possess the same statistical properties as the { z i k } m ( j ) k =1 ( i.e. they are i.i.d.Rademacher vectors) so using them as sampling directions will result in the same bound on (cid:107) g ( j ) k − ˆ g ( j ) k (cid:107) as Corollary 2.3. C Proofs for Section 3

Our proof utilizes the main result of [20]. Note that this paper requires that α k = L jk , i.e. the step size at the k -th iteration is inversely proportional to the block Lipschitz constant.This is certainly ideal, but impractical. In particular, if the blocks are randomly selected itseems implausible that one would have good estimates of the L j . Of course, since L j ≤ L max we observe that L max is a Lipschitz constant for every block, and thus we may indeedtake α k = α = L max . This results in a slightly slower convergence resulted, reﬂected in afactor of L max in Theorem 3.1. Throughout this section we shall assume that f ( x ) satisﬁesAssumptions 1–6. For all j = 1 , . . . , J deﬁne: V j ( x, t ) = (cid:104) g ( j ) ( x ) , t (cid:105) + L max (cid:107) t (cid:107) , so that, by Lipschitz diﬀerentiablity: f ( x k + U ( j ) t ) ≤ f ( x k ) + V j ( x k , t ) . Deﬁne t ∗ k,j := argmin V j ( x k , t ) := − L max g ( j ) k while let t (cid:48) k,j be the update step taken byZO-BCD, i.e. t (cid:48) k,j = L max ˆ g ( j ) k . Lemma C.1.

Suppose that f satisﬁes Assumptions 1–3. Then f ( x ) − f ∗ ≥ L max (cid:107) g ( j ) ( x ) (cid:107) for any x ∈ R d and any j = 1 , . . . , J . roof. Deﬁne h x : R d ( j ) → R as h x ( t ) := f ( x + U ( j ) t ) where U ( j ) is as described in Section 1.Since U ( j ) is a linear transformation and f ( x ) is convex, h x is also convex. By Assumption 1and the fact that L j ≤ L max , h x is L max -Lipschitz diﬀerentiable. From Assumption 3 itfollows that Y ∗ = argmin t h x ( t ) is non-empty, and that h ∗ x := min t h x ( t ) ≥ f ∗ . Thus, from[48, Proposition B.3, part (c.ii)] , we have: h x ( t ) − h ∗ x ≥ L max (cid:107)∇ h x ( t ) (cid:107) = 12 L max (cid:107) g ( j ) ( x + U ( j ) t ) (cid:107) for all t . Choose t = 0 , and use the facts that h x (0) = f ( x ) and f ∗ ≤ h ∗ x to obtain: f ( x ) − f ∗ ≥ h x (0) − h ∗ x ≥ L max (cid:107) g ( j ) ( x ) (cid:107) . This ﬁnishes the proof.

Lemma C.2.

Let η = 2 ρ n and θ = τ σHL max . Then for each iteration of ZO-BCD we havethat: V j ( x k , t (cid:48) ) − V j ( x k , t ∗ ) ≤ η ( f ( x k ) − f ∗ ) + θ. (8) with probability − O ( J exp( − s exact /J )) .Proof. For notational convenience, we deﬁne t ∗ := t ∗ k,j , t (cid:48) := t (cid:48) k,j and e ( j ) k := ˆ g ( j ) k − g ( j ) k . Bydeﬁnition: t (cid:48) k,j = − L max ˆ g ( j ) k = − L max ( g ( j ) k + e ( j ) k ) . Moreover: V j ( x k , t ∗ ) = − L max (cid:107) g ( j ) k (cid:107) V j ( x k , t (cid:48) ) = − L max (cid:107) g ( j ) k (cid:107) + 12 L max (cid:107) e ( j ) k (cid:107) , and hence: V j ( x k , t (cid:48) ) − V j ( x k , t ∗ ) ≤ L max (cid:107) e ( j ) k (cid:107) . (9)Because H j ≤ H for all j , from Corollary 2.3 (for ZO-BCD-R) or Theorem 2.4 (for ZO-BCD-RC) we have that: (cid:107) e ( j ) k (cid:107) ≤ (cid:16) ρ n (cid:107) g ( j ) k (cid:107) + 2 τ √ σH (cid:17) ≤ ρ n (cid:107) g ( j ) k (cid:107) + 4 τ σH. (10)23ith probability − O ( J exp( − s exact /J )) . Finally, from Lemma C.1 we get that (cid:107) g ( j ) k (cid:107) ≤ L max ( f ( x k ) − f ∗ ) for any j = 1 , · · · , J . Connecting this estimate with (9) and (10) weobtain: V j ( x k , t (cid:48) ) − V j ( x k , t ∗ ) ≤ ρ n (cid:124)(cid:123)(cid:122)(cid:125) = η ( f ( x k ) − f ∗ ) + 4 τ σHL max (cid:124) (cid:123)(cid:122) (cid:125) = θ . This ﬁnishes the proof.

Proof of Theorem 3.1.

Let p j denote the probability that the j -th block is chosen forupdating at the k -th iteration. Because ZO-BCD chooses blocks uniformly at random, p j = 1 /J for all j . If (8) holds for all k then by [20, Theorem 6.1] if: η + 4 θc < where c = 2 J L max R ( x ) ,c η + (cid:115) η + 4 θc ζ ) < ε < f ( x ) − f ∗ ,u := c (cid:32) η + (cid:114) η + 4 θc (cid:33) ,K := c ε − u + c ε − ηc log (cid:32) ε − θc ε − ηc εζ − θc ε − ηc (cid:33) + 2 , then P ( f ( x K ) − f ∗ ≤ ε ) ≥ − ζ . Note that:• Our η and θ are α and β in their notation.• In [20] c = 2 R (cid:96)p − ( x ) where R (cid:96)p − ( x ) is deﬁned as in (2) but using a norm (cid:107) · (cid:107) (cid:96)p − instead of (cid:107) · (cid:107) . These norms are related as: (cid:107) x (cid:107) (cid:96)p − = J (cid:88) j =1 L j p j (cid:107) x ( j ) (cid:107)

22 ( a ) = J (cid:88) j =1 J L max (cid:107) x ( j ) (cid:107) = J L max (cid:107) x (cid:107) where (a) follows as p j = 1 /J for all j and we are taking L j = L max . Hence, c = 2 J L max R ( x ) as stated.Replace η and θ with the expressions given by Lemma C.2 to obtain the expressions given inthe statement of Theorem 3.1. Because (8) holds with probability − O ( J exp( − s exact /J )) at each iteration, by the union bound it holds with probability: − K O ( J exp( − s exact /J )) = 1 − O (cid:18) J ε exp( − s exact J ) (cid:19) K iterations. Applying the union bound again we obtain: P ( f ( x K ) − f ∗ ≤ ε ) ≥ (1 − ζ ) (cid:18) − O (cid:18) J ε exp( − s exact J ) (cid:19)(cid:19) = 1 − ζ − O (cid:18) J ε exp( − s exact J ) (cid:19) . ZO-BCD-R makes m = 1 . b ( s/J ) log( d/J ) queries per iteration, and hence makes: mK = 1 . b sJ log (cid:18) dJ (cid:19) K = 1 . b s log( d/J ) R ( x ) ε − u + 1 . b s log( d/J ) R ( x ) ε − ηJ R ( x ) log (cid:32) ε − θc ε − ηc εζ − θc ε − ηc (cid:33) + 2 . b sJ log (cid:18) dJ (cid:19) = O (cid:18) s log( d/J ) ε (cid:19) = ˜ O (cid:16) sε (cid:17) queries in total. On the other hand, ZO-BCD-RC makes m = 1 . b ( s/J ) log ( s/J ) log ( d/J ) queries per iteration. A similar calculation reveals that: mK = O (cid:18) s log ( s/J ) log ( d/J ) ε (cid:19) = ˜ O (cid:16) sε (cid:17) . The computational cost of each iteration is dominated by solving the sparse recoveryproblem using CoSaMP. It is known [22] that CoSaMP requires O ( n T ) FLOPS, where T is the cost of a matrix-vector multiply by Z . For ZO-BCD-R Z ∈ R m × ( d/J ) is dense andunstructured hence: T = O (cid:18) m dJ (cid:19) = O (cid:18) sJ log( d/J ) dJ (cid:19) = ˜ O (cid:18) sdJ (cid:19) . As noted in [22], n may be taken to be O (1) (In all our experiments we take n ≤ ). For ZO-BCD-RC, as Z is a partial ciculant matrix, we may exploit a fast matrix-vector multiplicationvia fast Fourier transform to reduce the complexity to O ( d/J log( d/J )) = ˜ O ( d/J ) .Finally, we note that for both variants the memory complexity of ZO-BCD is dominatedby the cost of storing Z . Again, as Z is dense and unstructured in ZO-BCD-R thereare no tricks that one can exploit here, so the storage cost is m ( d/J ) = ˜ O ( sd/J ) . ForZO-BCD-RC, instead of storing the entire partial circulant matrix Z , one just needs to storethe generating vector z ∈ R d/J and the index set Ω . Assuming we are allocating bits perinteger, this requires: dJ + 32 · b sJ log (cid:16) sJ (cid:17) log (cid:18) dJ (cid:19) = O (cid:18) dJ (cid:19) . This ﬁnishes the proof. 25

Experimental setup details

In this section, we provide the detailed experimental settings for the numerical resultsprovided in Section 5. Furthermore, we implemented a Python version and a Matlabversion of ZO-BCD, and both implementations can be ﬁnd online at https://github.com/YuchenLou/ZO-BCD . D.1 Settings for synthetic experiments

For both synthetic examples, we use problem dimension d = 20 , and gradient sparsity s = 200 . Moreover, we use additive Gaussian noise with variance σ = 10 − in the oracles.The sampling radius is chosen to be − for all tested algorithms. For ZO-BCD, weuse blocks with uniform step size α = 0 . , and the per block sparsity is set to be s block = 1 . s/J = 42 . Furthermore, the block gradient estimation step runs at most n = 10 iterations of CoSaMP. For the other tested algorithms, we hand tune the parameters fortheir best performance, and same step sizes are used when applicable. Note that SPSAmust use a very small step size ( α = 0 . ) in max- s -squared-sum problem, or it will diverge. Re-shuﬄing the blocks.

Note that the max- s -squared-sum function does not satisfythe Lipschitz diﬀerentiability condition ( i.e. Assumption 1). Moreover the gradient supportchanges, making this an extremely diﬃcult zeroth-order problem. To overcome the diﬃcultyof discontinuous gradients, we apply an additional step that re-shuﬄes the blocks every J iterations. This re-shuﬄing trick is not required for the problems that satisﬁes ourassumptions; nevertheless, we observe very similar convergence behavior with slightly morequeries when the re-shuﬄing step was applied on the problems that satisfy the aforementionedassumptions. D.2 Settings for sparse DWT attacks on images

We allow a total query budget of , for all tested algorithms in each image attack, i.e. an attack is considered a failure if it cannot change the label within , queries.We use a 3-level ‘db45’ wavelet transform. All the results present in Section 5.2 use thehalf-point symmetric boundary extension, hence the wavelet domain has a dimension of , ; slightly larger than the original pixel domain dimension. For the interested reader,a discussion and more results about other boundary extensions can be found in Appendix E.For all variations of ZO-BCD, we choose the block size to be with per block sparsity s block = 10 , thus m = 52 queries are used per iteration. Sampling radius is set to be − .The block gradient estimation step runs at most n = 30 CoSaMP iterations, and step size α = 10 . We note that using a line search for each block would maximize the advantage of block coordinatedescent algorithms such as ZO-BCD, but we did not do so here for fairness .3 Settings for sparse CWT attacks on audio signals For both targeted and untargeted CWT attack, we use Morse wavelets with frequencies,which signiﬁcantly enlarges the problem dimension from , to , , per clip. Inthe attack, we choose the block size to be with per block sparsity s block = 9 , thus m = 52 queries are used per iteration. The sampling radius is δ = 10 − . The block gradientestimation step runs at most n = 30 CoSaMP iterations. In targeted attacks, the step sizeis α = 0 . , and Table 3 speciﬁes the step size used in untargeted attacks.In Table 3, we also include the results of a time domain attack for comparison. Theparameter settings are the same as the aforementioned untargeted CWT attack settings,but we have to reduce the step size to . for stability. Also, note that the problem domainis much smaller in the original voice domain, so the number of blocks is much less while wekeep the same block size. E More experimental results and discussion

E.1 Sparse DWT attacks on images

Periodic extension.

As mentioned, when we use boundary extensions other than periodicextension, the dimension of the wavelet coeﬃcients will increase, depending on the size ofwavelet ﬁlters and the level of the wavelet transform. More precisely, wavelets with largersupport and/or deeper levels of DWT result in a larger increase in dimensionality. On theother hand, using periodic extension will not increase the dimension of wavelet domain. Weprovide test results for both boundary extensions in this section for the interested reader.

Compressed attack.

As discussed in Section 4, DWT is widely used in data compression,such as JPEG-2000 for image. In reality, the compressed data are often saved in the formof sparse (and larger) DWT coeﬃcients after vanishing the smaller ones. While we havealready tested an attack on the larger DWT coeﬃcients only (see Section 5.2), it is alsoof interest to test an attack after compression. That is, we zero out the smaller waveletcoeﬃcients ( abs ≤ . ) ﬁrst, and then attack on only the remaining, larger coeﬃcients.We use the aforementioned parameter settings for these two new attacks. We presentthe results in Table 4, and for completeness we also include the results already presentedin Table 1. One can see that ZO-BCD-R(compressed) has higher attack success rate andlower (cid:96) distortion, exceeding the prior state-of-the-art as presented in Table 1. We cautionthat this is not exactly a fair comparison with prior works however, as the compression stepmodiﬁes the image before the attack begins. Defense tests.

Finally, we also tested some simple mechanisms for defending against ourattacks; speciﬁcally harmonic denoising. We test DWT with the famous Haar wavelets,27able 4: Results of untargeted image adversarial attack using various algorithms. Attacksuccess rate (ASR), average ﬁnal (cid:96) distortion, and number of queries till ﬁrst successfulattack. ZO-BCD-R(periodic ext.) stands for ZO-BCD-R applying periodic extension forimplementing the wavelet transform. ZO-BCD-R(compressed) stands for applying ZO-BCD-R to attack only large wavelet coeﬃcients (abs ≥ . ) and vanishing the smaller values.The other methods are the same in Table 1. Methods ASR (cid:96) dist Queries ZO-SCD 78 % % % % % % % % ZO-BCD-R(large coeﬀ.) % DWT with db45 wavelets which is also used for attack, and the essential discrete cosinetransform (DCT). The defense mechanism is to apply the transform to the attacked imagesand then denoise by zeroing out small wavelet coeﬃcients before transforming back to thepixel domain. We only test the defense on images that were successfully attacked. Tables 5and 6 show the results of defending against the ZO-BCD-R and ZO-BCD-R(large coeﬀ.)attacks respectively. Interestingly, using the attack wavelet ( i.e. db45) in defence is not agood strategy. We obtain the best defense results by using a mismatched transform ( i.e.

DCT or Haar defense for a db45 attack) and a small thresholding value.Table 5: Defense of image adversarial wavelet attack by ZO-BCD-R. Defense recoverysuccess rate under haar and db45 wavelet ﬁlters, and discrete cosine transform (DCT) ﬁlter.Thresholding values of 0.05, 0.10, 0.15, and 0.20 were considered.

Defence Methods 0.05 0.10 0.15 0.20

Haar 74 % 75% 76% 75% db45 74 % 72% 71% 63%

DCT 72 % % 75% 67% Defence Methods 0.05 0.10 0.15 0.20

Haar 75 % 72% % 76% db45 71 % 70% 69% 63% DCT 62 % 74% 68% 58%

More adversarial examples.

In Figure 4, we presented some adversarial images gener-ated by the ZO-BCD-R and ZO-BCD-R(large coeﬀ.) attacks. For the interested reader,we present more visual examples in Figure 6, and include the adversarial attack resultsgenerated by all versions of ZO-BCD.

E.2 Sparse CWT attacks on audio signals

Adversarial attacks on speech recognition is a more nebulous concept than that of adversarialattacks on image classiﬁers, with researchers considering a wide variety of threat models.In [49], an attack on the speech-to-text engine

DeepSpeech [50] is successfully conducted,although the proposed algorithm, Houdini, is only able to force minor mis-transcriptions(“A great saint, saint Francis Xavier” becomes “a green thanked saint fredstus savia”).In [44], this problem is revisited, and they are able to achieve 100% success in targetedattacks, with any initial and target transcription from the Mozilla Common Voices dataset(for example, “it was the best of times, it was the worst of times” becomes “it is a truthuniversally acknowledged that a single”). We emphasize that both of these attacks are whitebox , meaning that they require full knowledge of the victim model. We also note thatspeech-to-text transcription is not a classiﬁcation problem, thus the classic Carlini-Wagnerloss function so frequently used in generating adversarial examples for image classiﬁerscannot be straightforwardly applied. The diﬃculty of designing an appropriate attack lossfunction is discussed at length in [44].A line of research more related to the current work is that of attacking keyword classiﬁers.Here, the victim model is a classiﬁer; designed to take as input a short audio clip and tooutput probabilities of this clip containing one of a small, predetermined list of keywords(“stop”, “left” and so on). Most such works consider the

SpeechCommands dataset [37]. Tothe best of the author’s knowledge, the ﬁrst paper to consider targeted attacks on a keywordclassiﬁcation model was [10], and they do so in a black-box setting. They achieve an attack success rate (ASR) using a genetic algorithm, whose query complexity is unclear.They do not report the relative loudness of their attacks; instead they report the results of a29 a) ZO-BCD-R: “fry-ing pan” → “strainer” (b) ZO-BCD-RC: “fry-ing pan” → “strainer” (c) ZO-BCD-R (pe-riodic ext.): “fryingpan” → “strainer” (d) ZO-BCD-R (com-pressed): “frying pan” → “strainer” (e) ZO-BCD-R (largecoeﬀ.): “frying pan” → “strainer” (f) ZO-BCD-R:“strawberry” → “pomegranate” (g) ZO-BCD-RC:“strawberry” → “pomegranate” (h) ZO-BCD-R(periodic ext.):“strawberry” → “pomegranate” (i) ZO-BCD-R (com-pressed): “strawberry” → “pomegranate” (j) ZO-BCD-R (largecoeﬀ.): “strawberry” → “pomegranate” Figure 6: More examples of wavelet attacked images by ZO-BCD-R, ZO-BCD-RC, ZO-BCD-R(periodic ext.), ZO-BCD-R(compressed), and ZO-BCD-R(large coeﬀ.), with true labelsand mis-classiﬁed labels.human study in which they asked volunteers to listen to and label the attacked audio clips.They report that for of the successfully attacked clips human volunteers still assignedthe clips the correct ( i.e. source) label, indicating that these clips were not corrupted beyondcomprehension. Their attacks are per-clip, i.e. not universal.In [38], universal and untargeted attacks on a

SpeechCommands classiﬁer are considered.Speciﬁcally, they seek to construct a distortion δ such that for any clip x from a speciﬁedsource class ( e.g. “left”), the attacked clip x + δ is misclassiﬁed to a source class ( e.g. “yes”)by the model. They consider several variations on this theme; allowing for multiple sourceclasses. The results we recorded in Section 5.3 (ASR of . at a remarkably low meanrelative loudness of − . dB) were the best reported in the paper, and were for thesingle-source-class setting. This attack was in the white-box setting.Finally, we mention two recent works which consider very interesting, but diﬀerentthreat models. [39] considers the situation where a malicious attacker wishes to craft ashort (say half a second long) that can be added to any part of a clean audio clip to force amisclassiﬁcation. Their attacks are targeted and universal, and conducted in the white-boxsetting. They do not report the relative loudness of their attacks. In [40], a generativemodel is trained that takes as input a benign audio clip x , and returns an attacked clip30 + δ . The primary advantage of this approach is that attacks can be constructed on the ﬂy.In the targeted, per-clip white-box setting they achieve the success rate of . advertisedin Section 5.3, at an approximate relative loudness of −30