Guessing Smart: Biased Sampling for Efficient Black-Box Adversarial Attacks
Thomas Brunner, Frederik Diehl, Michael Truong Le, Alois Knoll
GGuessing Smart: Biased Sampling for Efficient Black-Box Adversarial Attacks
Thomas Brunner
Frederik Diehl
Michael Truong Le
Alois Knoll fortiss GmbH Technical University of Munich { brunner,diehl,truongle } @fortiss.org [email protected] Abstract
We consider adversarial examples for image classifica-tion in the black-box decision-based setting. Here, an at-tacker cannot access confidence scores, but only the finallabel. Most attacks for this scenario are either unreliableor inefficient. Focusing on the latter, we show that a specificclass of attacks, Boundary Attacks, can be reinterpreted asa biased sampling framework that gains efficiency from do-main knowledge. We identify three such biases, image fre-quency, regional masks and surrogate gradients, and eval-uate their performance against an ImageNet classifier. Weshow that the combination of these biases outperforms thestate of the art by a wide margin. We also showcase an ef-ficient way to attack the Google Cloud Vision API, wherewe craft convincing perturbations with just a few hundredqueries. Finally, the methods we propose have also beenfound to work very well against strong defenses: Our tar-geted attack won second place in the NeurIPS 2018 Adver-sarial Vision Challenge.
1. Introduction
Ever since the term was fist coined, adversarial exam-ples have enjoyed much attention from machine learningresearchers. The fact that tiny perturbations can lead other-wise robust-seeming models to misclassify an input couldpose a major problem for safety and security. But when dis-cussing adversarial examples, it is often unclear how realis-tic the scenario of a proposed attack truly is. In this work,we consider a threat setting with the following parameters:
Black-box . The black-box setting assumes that an at-tacker has access only to the input and output of a model.Compared to the white-box setting, where an attacker hascomplete access to the architecture and parameters of themodel, attacks in this setting are significantly harder to con-duct: Most state-of-the-art white-box attacks [8, 5, 14] relyon gradients that are directly computed from the model pa-rameters, which are not available in the black-box setting. (a) (b)(c) (d)
Figure 1. (a) and (b): Black-box adversarial examples obtained byrandom sampling. (c) and (d): isolated perturbation patterns. (c)is sampled from a normal distribution, and (d) is sampled from adistribution of Perlin noise patterns, which is one of the biases wepropose. Both (a) and (b) fool the classifier, but (b) can be obtainedwith fewer samples.
Decision-based classification (label-only) . Dependingon the output format of the model, the problem of missinggradients can be circumvented. In a score-based scenario,the model provides real-valued outputs (for example, soft-max activations). By applying tiny modifications to the in-put, an attacker can estimate gradients by observing changesin the output [6] and then follow this estimate to generateadversarial examples. The decision-based setting, in con-trast, provides only a single discrete result (the top-1 label)on which gradient estimation is very inefficient [10]. Thisform of black-box attack is much more difficult, but alsoextends the range of possible targets in the real world [3].
Limited queries . Black-box attacks might not be feasi-ble if they need millions of queries to the model, and pos-sibly multiple hours’ time, to be successful. We thereforeconsider a scenario where the attacker must find a convinc-1 a r X i v : . [ s t a t . M L ] M a y ng adversarial example in less than 15000 queries. Targeted . An untargeted attack is considered success-ful when the classification result is any label other than theoriginal. Depending on the number and semantics of theclasses, it can be easy to find a label that requires littlechange, but is considered adversarial ( e.g . Egyptian cat vsPersian cat). A targeted attack, in contrast, needs to produceexactly the specified label. This task is strictly harder thanthe untargeted attack, further decreasing the probability ofsuccess.In this setting, current state-of-the-art attacks are eitherunreliable or inefficient. Our contribution is as follows: • We show how a recently proposed method, the Bound-ary Attack, can be re-framed as a biased samplingframework that gains efficiency from prior beliefsabout the target domain. • We discuss three such biases: low-frequency patterns,regional masks and gradients from a surrogate model. • We evaluate the effectiveness of each bias and showthat their combination drastically outperforms the pre-vious state of the art in label-only black box attacks.Our source code is publicly available.
2. Related Work
There currently exist two major schools of attacks in thethreat setting we consider:
It is known that adversarial examples display a high de-gree of transferability, even between different model ar-chitectures [20]. Transfer attacks seek to exploit this bytraining substitute models that are reasonably similar to themodel under attack, and then apply regular white-box at-tacks to them.Typically, this is performed by iterative applications offast gradient-based methods such as the Fast Gradient SignMethod (FGSM) [8] and, more generally, Projected Gradi-ent Descent (PGD) [14]. The black-box model is used in theforward pass, while the backward pass is performed withthe surrogate model [1]. In order to maximize the chance ofa successful transfer, newer methods use large ensembles ofsubstitute models, and applying adversarial training to thesubstitute models has been found to increase the probabilityof finding strong adversarial examples even further [19].Although these methods currently form the state of theart in label-only black-box attacks [12], they have one majorweakness: as soon as a defender manages to reduce trans-ferability, direct transfer attacks often run a risk of com-plete failure, delivering no result even after thousands of https://github.com/ttbrunner/biased_boundary_attack iterations. As a result, conducting transfer attacks is a cat-and-mouse game between attacker and defender, where theattacker must go to great lengths to train models that arejust as robust as the defender’s. Therefore, transfer-basedattacks can be very efficient, but also somewhat unreliable. Circumventing this problem, sampling-based attacks donot rely on direct transfer and instead try to find adversar-ial examples by randomly sampling perturbations from theinput space.Perhaps the simplest attack consists of sampling a hy-persphere around the original image, and drawing more andmore samples until an adversarial example is found. Owingto the high dimensionality of the input space, this methodis very inefficient and has been dismissed as completely un-viable [18]. While this is not our main focus, we show inAppendix C that even this crude attack can be acceleratedand made competitive in certain scenarios.Recently, a more efficient attack has been proposed: theBoundary Attack (BA) [3]. This attack is initialized withan input of the desired class, and then takes small stepsalong the decision boundary to reduce the distance to theoriginal input. Previous works have established that regionswhich contain adversarial examples often have the shape ofa ”cone” [19], which can be traversed from start to finish.At each step, the BA employs random sampling to find asideways direction that leads deeper into this cone. Fromthere, it can then take the next step towards the target.The BA has been shown to be very powerful, producingadversarial examples that are competitive with even the re-sults of state-of-the-art white-box attacks [3]. However, itsweakness is query efficiency: to achieve these results, theattack typically needs to query a model hundreds of thou-sands of times.It should be noted that recent research on black-box at-tacks has largely focused on classifiers that provide confi-dence scores, which is an easier setting. Nevertheless, manyof these methods also use random sampling [6, 21, 10], andthe biases we propose could also benefit their approaches.As an aside, Ilyas et al. [10] propose a variation of theirattack that manages to apply gradient estimation to discretelabels. Although this does fit our setting, we find it to bemuch less efficient than BA variants (see Section 4).Clearly, sampling-based attacks are very flexible but of-ten too inefficient for practical use. Barring pure randomguessing, the BA is the simplest attack for our setting. Wetherefore choose to focus on this method, and show how itcan benefit from the biases we propose.2a) (b) (c)
Figure 2. Sampling directions for the orthogonal step. (a) Boundary Attack: uniformly distributed along the surface of the hypersphere. (b)BA with Perlin bias: higher sample density in the direction of low-frequency perturbations. (c) BA with surrogate gradient bias: samplesfurther concentrate towards the direction of the projected gradient.
3. Biased Boundary Attacks
The Boundary Attack, like most sampling-based attacks,draws perturbation candidates from a multidimensional nor-mal distribution. This means that it performs unbiased sam-pling, perturbing each input feature independently of theothers. While this is very flexible, it is also extremely in-efficient when used against a robust model.Consider the distribution of natural images: adjacent pix-els are typically not independent of each other, but oftenhave similar colors. This alone is a strong indicator thatdrawing perturbations from i.i.d random variables will leadto adversarial examples that are clearly out of distributionfor natural image datasets. This, of course, renders themvulnerable to detection and filtering - robust models havebecome increasingly resilient against such patterns [12].Therefore, it seems only logical to constrain the searchspace to perturbations that we believe to have a higherchance of success, or to bias the distribution so that theprobability of sampling them increases.We outline three such biases for the domain of imageclassification, discuss their motivation and show how to in-tegrate them into the sampling procedure of the BoundaryAttack.
When one looks at typical adversarial examples, itquickly becomes apparent that most existing methodsyield perturbations with high image frequency. But high-frequency patterns have a significant problem: they are eas-ily identified and separated from the original image signal,and are often dampened by spatial transforms. Indeed, mostof the winning defenses in the NeurIPS 2017 AdversarialAttacks and Defences Competition were based on denois-ing [13], simple median filters [12], and random transforms[22]. In other words: state-of-the art defenses are designedto filter high-frequency noise.At the same time, we know that it is possible to synthe-size ”robust” adversarial examples which are not easily fil-tered in this way. Compare Athalye et al. [2]: Their robustperturbations are largely invariant to filters and transforms, and – interestingly enough – at first glance seem to containvery little high-frequency noise.Inspired by this observation, we hypothesize that imagefrequency alone could be a key factor in robustness of ad-versarial perturbations. If true, then simply limiting per-turbations to the low-frequency domain should increase thesuccess chance of an attack, while incurring no extra cost.
Perlin Noise patterns . A straightforward way to gen-erate parametrized, low-frequency patterns, is to use Per-lin Noise [15]. Originally intended as a procedural texturegenerator for computer graphics, this function creates low-frequency noise patterns with a reasonably ”natural” look.One such pattern can be seen in Figure 1d. But how can weuse it to create a prior for the Boundary Attack?Let k be the dimensionality of the input space. The orig-inal Boundary Attack (Figure 2a) works by applying an or-thogonal perturbation η k along the surface of a hyperspherearound the original image, in the hope of moving deeperinto an adversarial region. From there, a step is taken to-wards the original image. In the default configuration, can-didates for η k are generated from samples s ∼ N (0 , k ,which are projected orthogonally to the source direction andnormalized to the desired step size. This leads to the direc-tions being uniformly distributed along the hypersphere.To introduce a low-frequency prior into the Bound-ary Attack, we instead sample from a distribution of Per-lin noise patterns (Figure 2b). Perlin noise is typicallyparametrized with a permutation vector v of size 256, whichwe randomly shuffle on every call. Effectively, this al-lows us to sample two-dimensional noise patterns s ∼ P erlin h,w ( v ) , where h and w are the image dimensions(and h · w = k ). As a result, the samples are now stronglyconcentrated in low-frequency directions.Our experiments in Section 4 show that this greatly im-proves the efficiency of the attack. Therefore, we rea-son that the distribution of Perlin noise patterns contains ahigher concentration of adversarial directions than the nor-mal distribution.3a) (b)(c) (d) Figure 3. Masking based on per-pixel difference. (a) shows theoriginal image, (b) an image of the target class. (c) is the mask,and (d) is a perturbation to which the mask has been applied. Theperturbation concentrates on the central region, as the backgroundis already quite similar between the images.
Currently, perturbations are evenly applied across the en-tire image. No matter if low or high frequency, the orthog-onal step of the Boundary Attack perturbs all pixels nearlyequally (when averaged over a large number of samples).This seems to be a waste – could the attack benefit fromlimiting the perturbation to specific regions?The Boundary Attack is an interpolation from an imageof the target class towards the image under attack. In someregions, these images might already be quite similar, whilebeing very different in others. Intuitively, we would wantan attack to take larger steps in those regions where the dif-ference is high. We are also reluctant to perturb regions thatare already similar, as any such distortion will have to beundone in a later step.It turns out that this is an ideal way to reduce the searchspace. We can simply create an image mask from the per-pixel difference of the adversarial and original image (seeFigure 3): M = | X adv − X orig | (1)At each step, we recalculate this mask based on the cur-rent position and apply it element-wise to the previouslysampled orthogonal perturbation η k : η kbiased = M (cid:12) η k ; η kbiased = η kbiased (cid:107) η kbiased (cid:107) (2)In this way, the distortion of those pixels that have ahigh difference is amplified and that of similar pixels damp- ened, while the magnitude of the perturbation vector staysthe same.This reduces the search space of the attack and thereforeincreases its efficiency – if one assumes our intuition aboutregional masking to be correct. We implement this maskingstrategy as a proof of concept and our evaluation in Section4 shows that it indeed improves efficiency by a significantamount. Other masks.
An attacker might wish to engineer masksfrom other knowledge they possess about the image con-tents. For example, it could be worthwhile to concentratethe perturbation on the most salient features of the targetclass, reducing the search space to only the most vital di-mensions. Such ”focused” black-box perturbations couldhold much promise, and we aim to investigate them in thefuture.
What other source of information contains strong hintsabout directions that are likely to point to an adversarial re-gion? Naturally – gradients from surrogate models. Trans-fer attacks have been shown to be extremely powerful (al-beit brittle) [19], so it should be useful to exploit surrogateswhenever they are available.Arguably, the main weakness of transfer attacks is thatthey fail when the decision boundary of the surrogate modeldoes not closely match the defender’s. However, even whenthis is the case, the boundary may still be reasonably nearby.Based on this intuition, some approaches extend gradient-based attacks with limited regional sampling [1]. Here, wedo exactly the opposite and extend a sampling-based attackwith limited gradient information. This has the significantadvantage that, in the case of low transferability, our methodmerely experiences a slowdown where typical transfer at-tacks completely fail.Our method works as follows: • An adversarial gradient from a surrogate model is cal-culated. Since the current position is already adversar-ial, it can be helpful to move a small distance towardsthe original image first, making sure to calculate thegradient from inside a non-adversarial region. • The gradient usually points away from the original im-age, therefore we project it orthogonally to the sourcedirection, as shown in Figure 2c. • This projection is on the same hyperplane as the can-didates for the orthogonal step. We can now bias thecandidate perturbations toward the projected gradientby any method of our choosing. Provided all vectorsare normalized, we opt for simple addition: η kbiased = (1 − w ) · η k + w · η kP G (3)4 w controls the strength of the bias and is a hyperparam-eter that should be tuned according to the performanceof our substitute model. High values for w should beused when transferability is high, and vice versa. Werewe to choose the maximum value, w = 1, then the or-thogonal step would be equivalent to an iteration ofthe PGD attack. In our experience, w ≤ . generallyleads to good performance.As a result, samples concentrate in the vicinity of theprojected gradient, but still cover the rest of the search space(albeit with lower resolution). In this way, substitute mod-els are purely optional to our attack instead of forming thecentral part. It should be noted though that at least some measure of transferability should exist. Otherwise, the gra-dient will point in a bogus direction and using a high valuefor w would reduce efficiency instead of improving it.For the time being, this does not pose a major problem.To the best of our knowledge, no strategies exist that suc-cessfully eliminate transferability altogether. As we go onto show in Section 4, even surrogate models that are tooweak for direct transfer attacks can be used in our frame-work. Our work is concurrent with Ilyas et al. [11], whointroduce a bandit optimization framework that incorpo-rates prior information in order to increase query efficiency.While their approach differs from ours, it is motivated bythe same intuition - domain knowledge can be used to speedup optimization. We note that the data-dependent prior theypropose is essentially a low-frequency bias not unlike ourown. They also introduce a time-dependent prior, whichcould benefit our work in the future.Low-frequency perturbations have also recently been de-scribed by Guo et al. [9]. They decompose random pertur-bations with the Discrete Cosine Transform, and then re-move high frequencies from the spectrum. We expect thesepatterns to be very similar to those produced by our Perlinbias.
4. Evaluation
We evaluate our approach against an ImageNet classifierand perform an ablation study to determine the effectivenessof each bias. We also compare our results to a range ofrecently proposed black-box attacks. Finally, we mount anattack against the Google Cloud Vision API and show thatour approach can be efficiently deployed against real-worldcommercial systems.Appendix A shows a range of interesting examples pro-duced by our attacks, Appendix B contains a listing of hy-perparameters and Appendix C describes our winning sub-mission to the NeurIPS 2018 Adversarial Vision Challenge in detail.(a) (b)(c) (d)(e) (f)
Figure 4. Targeted attack on ImageNet using 15000 queries. (a)shows the original image (bell) and (b) an image of the target class(ocarina). (c) is an adversarial example generated by the originalBoundary Attack ( d (cid:96) = 18 . ), with (d) showing the differenceto the original image. (e) and (f) show the biased Boundary Attackwith all biases enabled ( d (cid:96) = 4 . ) ImageNet consists of images with 299x299 color pixelsand has 1000 classes. We run our attacks against a pre-trained InceptionV3 network [17], which achieves 78% top-1 accuracy.
Evaluation . We create an evaluation set by randomly se-lecting 1000 images from the ImageNet validation set whilefixing a random target label for each image. We then pro-ceed to run each attack for up to 15000 queries and measurethe success rate over all examples.
Success criteria . The Boundary Attack always startswith an image of the adversarial class, therefore successmust be characterized by low distance to the original image.For this, we define a threshold on the (cid:96) -norm of the adver-sarial perturbation and register success when the distance isbelow. Since some of the methods we compare [10] use (cid:96) ∞ CTIVE BIASES S UCCESS RATE VS NUMBER OF QUERIES M EDIAN QUERIES P ERLIN M ASK S URROGATE
500 1000 2500 5000 10000 15000
UNTIL SUCCESSNO NO NO
YES NO NO
NO YES NO
YES YES NO
NO NO YES
YES NO YES
NO YES YES
YES YES YES
Table 1. Ablation study of biases on ImageNet (targeted attack). The Perlin bias has the strongest effect, followed by the mask bias andfinally the surrogate bias. Each bias improves efficiency on its own, and the combination of all biases delivers the strongest performance.
Note: we only report the median for success rates over 0.5. S UCCESS RATE VS NUMBER OF QUERIES M ETHOD
500 1000 2500 5000 10000 15000I
LYAS ET AL . [10] (
LABEL - ONLY ) 0.00 0.00 0.00 0.00 0.00 0.00C
HENG ET AL . [7] 0.04 0.04 0.04 0.04 0.07 0.07M
ADRY ET AL . [14] (PGD
TRANSFER ATTACK ) RENDEL ET AL . [3] (
UNBIASED B OUNDARY A TTACK ) 0.01 0.03 0.10 0.17 0.25 0.33 O URS (B IASED B OUNDARY A TTACK ) Table 2. Comparison with recently proposed label-only attacks. We use the original code provided by the respective authors and run allmethods with the same data and targets. The biased Boundary Attack (same as in Table 1) outperforms all other methods. Note that in thecase of Ilyas et al.[10], the number of required queries is so large that the attack never achieves success in the range we consider. distance exclusively, we set the (cid:96) -threshold to 25.89. Thiscorresponds to a worst-case (cid:96) ∞ -distortion of 0.05 if one as-sumes all pixels to be maximally perturbed. Initialization . We search the ImageNet validation set foran image of the target class, and pick the one that is closestto the one being attacked. From there, we perform a binaryline search to find the decision boundary. This is typicallydone in less than 10 queries. We run all attacks with thesame starting points.
Surrogate model . We use a pre-trained Inception-Resnet-v2 model [16] for the gradient bias. This model isnot adversarially trained and, as Table 2 shows, performspoorly in a PGD transfer attack. We intentionally use thismodel to demonstrate that our approach can effectively usepre-trained surrogates without any need for modification.
Hyperparameters . See Appendix B.
We first evaluate all three biases and their combinations. Ta-ble 1 shows the result: it is apparent that each of the biasesincreases the efficiency of the attack. The largest boost isobtained by the Perlin bias, followed by the mask bias, andfinally the surrogate gradient bias.The latter has a rather small effect, which is probably dueto the fact that our surrogate model is too weak. But still,we are able to use what little transferability there is insteadof slowing down (or failing like a transfer attack would). At this point, it would be interesting to see whether our methodcould profit even more from better surrogates. We considerthis a direction for future work.It is also noteworthy that all biases can be combined, andthat they do not interfere with each other. When all threebiases are active, we reach 85% success after only 15000queries. Figure 4 shows an example of this drastic improve-ment. This is our strongest attack, which we now compareagainst other state-of-the-art methods.
We go on to benchmark our method against a range of re-cently proposed attacks for our setting. We use publiclyavailable code, together with the hyperparameters recom-mended by the authors. We modify the implementations touse our evaluation set, therefore all attacks are run on thesame 1000 images and use the same target labels as wellas starting points (where applicable). Table 2 shows the re-sults.
Ilyas et al. [10] propose a label-only version of theirgradient estimation attack. It works in our setting, but atgreatly reduced efficiency. We note that it does not producean adversarial example within 15000 queries and that norun succeeds before 276000 queries, with a median of 2.48million queries required. This is in line with their publishedresults. Their other attacks use confidence scores, and there-fore require a setting that is considerably easier. Even then,6ur median number of queries is decisively lower than theone reported by them (5432 versus 11550).Similarly,
Cheng et al. [7] re-frame the setting as a real-valued optimization problem. In general, they report higherefficiency than the Boundary Attack, which is not confirmedin the setting of our experiment. We used their publiclyavailable source code with recommended hyperparameters.Finally, we perform an iterative PGD transfer attack asdescribed by
Madry et al. [14]. Our results show thatits performance is hit-and-miss: when a transfer succeedsit does so very early, but in most cases it never succeeds.Clearly this attack requires a stronger surrogate model.Interestingly enough, the performance we obtain for theoriginal Boundary Attack (without biases) seems higherthan that observed in previous work [3, 7]. This may bedue to our initialization method or our choice of hyperpa-rameters, which we list in Appendix B. In any case, ourevaluation shows that the biased Boundary Attack decid-edly outperforms all other attacks in a label-only setting.
To show that our method is effective even against blackboxes with unknown labels, we conduct an attack againstthe Google Cloud Vision API. This is significantly harderthan attacking ImageNet, since the exact classes are un-known. However, we also note that Google Cloud Visionhas a very high number of near-redundant class labels. Dowe really need to focus on one label alone?
Free-form attacks . We have argued earlier that untar-geted attacks with many redundant classes are not truly ad-versarial. However, the opposite is also true: to achieve anadversarial effect, it is not always necessary to target onespecific class. Rather, the same effect could be achieved bytargeting a group of classes – if we want to label a dog asa cat, we can take the union of all cat breeds to achieve thedesired effect.To be more precise, we can formulate any adversarialcriterion, as long as it is a boolean function of the modeloutput. Decision-based attacks make this very simple: ifwe consider this function to be part of the black box, wecan simply treat it like any other model and run our attackon its output.
Consider, for example, a targeted attack to turn a person intoa bear. Instead of using the label ”bear”, we perform a stringcomparison on the top-1 label and check for the occurrenceof ”bear”. This extends the attack to labels like ”grizzlybear”, ”brown bear”, etc. and keeps it from getting stuckwhenever one of these labels appears in top-1 position. Wealso add another condition for good measure: the words
Figure 5. Adversarial image (target ”bear”), classified by GoogleCloud Vision after 346 queries. Confidence scores are displayed,but not used by the attack. No label hinting at a person is left (theprediction vector contains only 3 labels). ”face”, ”facial expression”, ”skin”, ”person” must not ap-pear in any of the output labels.Figure 5 shows the result: after only 346 iterations, ourattack produces a perturbation that is still visible, but smallenough to fool an unsuspecting person.
Figure 6. Adversarial image (pedestrians have been removed),classified by Google Cloud Vision after exactly 1000 queries.Confidence scores are displayed, but not used by the attack. Per-haps interestingly, our adversarial pattern is classified as ”Fun”. .2.2 Making pedestrians disappear The adversarial criterion can also be formulated as a top-kuntargeted attack on multiple labels. Consider a potentiallysafety-critical scenario, where the goal is to make the modeloblivious to pedestrians. For this, we simply formulate thecondition so that the string ”pedestrian” does not appear inthe prediction vector, and that related labels such as ”per-son, walking, head, clothes” are also absent.We obtain Figure 6 after exactly 1000 queries. Again, theperturbation is already small enough to fool an unsuspectingobserver. An attack with such a low number of queries canbe performed on virtually any device – even mobile – in amatter of minutes, and is only limited by the latency of theAPI under attack.
5. Conclusion
We have shown that decision-based black-box attackscan be greatly sped up with prior knowledge. The Bound-ary Attack can be interpreted as a biased sampling frame-work where one merely needs to modify the distributionfrom which samples are drawn.Within this framework, we have proposed three priorsthat are partially motivated by intuitions about the nature ofimage classification, and partially by a desire to connect re-search directions in the field of black-box attacks. Considerthe surrogate gradient bias: by itself, it does not yield a sub-stantial improvement. However, the observation that we areable to draw even a small benefit from surrogates that oth-erwise show near-zero transferability seems very promisingfor future work. We aim to study this effect of partial trans-ferability in more detail and hope to uncover some of itsunderlying properties.And it does not end here - we have discussed only threepriors for biased sampling, but there is much more domainknowledge that has not yet found its way into adversarialattacks. Other perturbation patterns, spatial transforms, ad-versarial blending strategies, or even intuitions about se-mantic features of the target class could all be integratedin a similar fashion.With the biased Boundary Attack, we have outlined a ba-sic framework into which a broad range of knowledge canbe incorporated. Our implementation significantly outper-forms the previous state of the art in black-box label-onlyattacks, which is one of the most difficult settings currentlyconsidered. Our methods can be used to craft convincingresults after very few iterations, and the threat of black-boxadversarial examples becomes more realistic than ever be-fore.
References [1] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradientsgive a false sense of security: Circumventing defenses to ad- versarial examples. In
Proceedings of the 35th InternationalConference on Machine Learning, ICML , 2018. 2, 4[2] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthe-sizing robust adversarial examples. In
Proceedings of the35th International Conference on Machine Learning, ICML ,2018. 3[3] W. Brendel, J. Rauber, and M. Bethge. Decision-based ad-versarial attacks: Reliable attacks against black-box machinelearning models. In
International Conference on LearningRepresentations , 2018. 1, 2, 6, 7, 13[4] W. Brendel, J. Rauber, A. Kurakin, N. Papernot, B. Veliqi,M. Salath´e, S. P. Mohanty, and M. Bethge. Adversarial vi-sion challenge. arXiv preprint arXiv:1808.01976 , 2018. 14[5] N. Carlini and D. Wagner. Towards evaluating the robustnessof neural networks. In , pages 39–57. IEEE, 2017. 1[6] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh. Zoo:Zeroth order optimization based black-box attacks to deepneural networks without training substitute models. In
Pro-ceedings of the 10th ACM Workshop on Artificial Intelligenceand Security , AISec ’17, pages 15–26, 2017. 1, 2[7] M. Cheng, T. Le, P.-Y. Chen, J. Yi, H. Zhang, andC.-J. Hsieh. Query-efficient hard-label black-box at-tack: An optimization-based approach. arXiv preprintarXiv:1807.04457 , 2018. 6, 7[8] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining andharnessing adversarial examples. In
International Confer-ence on Learning Representations , 2015. 1, 2[9] C. Guo, J. S. Frank, and K. Q. Weinberger. Low frequencyadversarial perturbation. arXiv preprint arXiv:1809.08758 ,2018. 5[10] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box ad-versarial attacks with limited queries and information. In
Proceedings of the 35th International Conference on Ma-chine Learning, ICML , 2018. 1, 2, 5, 6[11] A. Ilyas, L. Engstrom, and A. Madry. Prior convictions:Black-box adversarial attacks with bandits and priors. arXivpreprint arXiv:1807.07978 , 2018. 5[12] A. Kurakin, I. Goodfellow, S. Bengio, Y. Dong, F. Liao,M. Liang, T. Pang, J. Zhu, X. Hu, C. Xie, J. Wang, Z. Zhang,Z. Ren, A. Yuille, S. Huang, Y. Zhao, Y. Zhao, Z. Han,J. Long, Y. Berdibekov, T. Akiba, S. Tokui, and M. Abe. Ad-versarial Attacks and Defences Competition. arXiv preprintarXiv:1804.00097 , 2018. 2, 3[13] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu. De-fense against adversarial attacks using high-level representa-tion guided denoiser. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2018. 3[14] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, andA. Vladu. Towards Deep Learning Models Resistant to Ad-versarial Attacks.
International Conference on LearningRepresentations , 2018. 1, 2, 6, 7[15] K. Perlin. An image synthesizer. In
Proceedings of the 12thAnnual Conference on Computer Graphics and InteractiveTechniques , SIGGRAPH ’85, pages 287–296, New York,NY, USA, 1985. ACM. 3
16] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,inception-resnet and the impact of residual connections onlearning. In
Thirty-First AAAI Conference on Artificial Intel-ligence , 2016. 6[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567 , 2015. 5[18] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus. Intriguing properties of neuralnetworks. In
International Conference on Learning Repre-sentations , 2014. 2[19] F. Tram`er, A. Kurakin, N. Papernot, I. Goodfellow,D. Boneh, and P. McDaniel. Ensemble adversarial training:Attacks and defenses. In
International Conference on Learn-ing Representations , 2018. 2, 4, 14[20] F. Tram`er, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D.McDaniel. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453 , 2017. 2[21] C. Tu, P. Ting, P. Chen, S. Liu, H. Zhang, J. Yi, C. Hsieh, andS. Cheng. Autozoom: Autoencoder-based zeroth order op-timization method for attacking black-box neural networks. arXiv preprint arXiv:1805.11770 , 2018. 2[22] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille. Mitigatingadversarial effects through randomization. In
InternationalConference on Learning Representations , 2018. 3 . Evaluation images No bias Surrogate bias only Mask bias only Mask+Surrogate d (cid:96) = 27 . d (cid:96) = 21 . d (cid:96) = 13 . d (cid:96) = 12 . Perlin bias only Perlin+Surrogate Perlin+Mask Perlin+Mask+Surrogate d (cid:96) = 20 . d (cid:96) = 17 . d (cid:96) = 8 . d (cid:96) = 8 . Original image Starting point
Figure 7. Targeted adversarial examples on ImageNet, obtained with different biases after 15000 iterations. The original class is ”snow-plow” – all images are classified as the target ”Chesapeake Bay retriever”. The mask bias is especially effective, as start and original imagehave similar backgrounds. See https://github.com/ttbrunner/biased_boundary_attack for an animated version. o bias Surrogate bias only Mask bias only Mask+Surrogate d (cid:96) = 42 . d (cid:96) = 40 . d (cid:96) = 42 . d (cid:96) = 41 . Perlin bias only Perlin+Surrogate Perlin+Mask Perlin+Mask+Surrogate d (cid:96) = 34 . d (cid:96) = 35 . d (cid:96) = 12 . d (cid:96) = 10 . Original image Starting point
Figure 8. Targeted adversarial examples on ImageNet, obtained with different biases after 15000 iterations. The original class is ”goose”– all images are classified as the target ”lipstick”. In the case of this image, not every bias comes with an improvement: mask+surrogateis even slightly worse than surrogate only. However, when all biases are combined the result is still significantly better. See https://github.com/ttbrunner/biased_boundary_attack for an animated version. o bias Surrogate bias only Mask bias only Mask+Surrogate d (cid:96) = 47 . d (cid:96) = 57 . d (cid:96) = 40 . d (cid:96) = 41 . Perlin bias only Perlin+Surrogate Perlin+Mask Perlin+Mask+Surrogate d (cid:96) = 12 . d (cid:96) = 11 . d (cid:96) = 8 . d (cid:96) = 5 . Original image Starting point
Figure 9. Targeted adversarial examples on ImageNet, obtained with different biases after 15000 iterations. The original class is ”cello” –all images are classified as the target ”black stork”. When used on its own, the surrogate bias seems to be detrimental for this particularimage. Still, the final result is impressive: when comparing no biases with all biases, the perturbation norm is reduced by 88%. See https://github.com/ttbrunner/biased_boundary_attack for an animated version. . Evaluation Hyperparameters B.1. Boundary Attack
Step sizes . In the source code of their original implementation, Brendel et al. [3] suggest setting both the orthogonal stepto η = 0 . and the source step to (cid:15) = 0 . . Both step sizes are relative to the current distance from the source image. Weinstead set η = 0 . and (cid:15) = 0 . , as this makes the attack take smaller steps towards the source image, while at the sametime allowing for more extreme perturbations in the orthogonal step. We have found this to increase the success chance ofperturbation candidates, and the attack gets stuck less often. Step size adaptation . We do not use the original step size adjustment scheme as proposed by Brendel et al. [3]. They collectstatistics about the success of the orthogonal step before performing the step towards the source, and based on this they eitherreduce or increase the individual step sizes.This method seems to be geared towards reaching near-zero perturbations and less towards query efficiency, which is ourprimary goal – we are interested in making as much progress as possible in the early stages of an attack. When testing theBoundary Attack with query counts below 15000, we found the success statistics to be very noisy and the adaptation schemeended up being detrimental. Therefore, we opt for a different approach: • At every iteration, we count the number of consecutive previously unsuccessful candidates. • As this number increases, we dynamically reduce both step sizes towards zero. • Whenever a perturbation is successful, the step size is reset to its original value. • As a fail-safe, the step size is also reset after 50 consecutive failures. Typically, we found this to occur often for theunbiased Boundary Attack, but very seldom when using the Perlin bias.As a result, our strategy is quick to reduce step size, and after success immediately reverts to the original step size. We havefound this to be very effective in the early stages of an attack. However, it has the drawback of wasting samples in the laterstages (10000+ queries), when it tries to revert to larger step sizes too often. It might be promising to partially reinstate theapproach of Brendel et al., or to apply some form of step size annealing.
B.2. Other attacks
For all other attacks, we use the hyperparameters that are provided for ImageNet in the publicly available source code of theirimplementations. 13 . Submission to NeurIPS 2018 Adversarial Vision Challenge
When evaluating adversarial attacks and defenses, it is hard to obtain meaningful results. Very often, attacks are tested againstweak defenses and vice versa, and results are cherry-picked. We sidestep this problem by instead presenting our submissionto the NeurIPS 2018 Adversarial Vision Challenge (AVC), where our method was pitted against state-of-the-art robust modelsand defenses and won second place in the targeted attack track.
Evaluation setting . The AVC is an open competition between image classifiers and adversarial attacks in an iterative black-box decision-based setting [4]. Participants can choose between three tracks: • Robust model: The submitted code is a robust image classifier. The goal is to maximize the (cid:96) norm of any successfuladversarial perturbation. • Untargeted attack: The submitted code must find a perturbation that changes classifier output, while minimizing the (cid:96) distance to the original image. • Targeted attack: Same as above, but the classification must be changed to a specific label.Attacks are continuously evaluated against the current top-5 robust models and vice versa. Each evaluation run consistsof 200 images with a resolution of 64x64, and the attacker is allowed to query the model 1000 times for each image. Thefinal attack score is then determined by the median (cid:96) norm of the perturbation over all 200 images and top-5 models (loweris better). Competitors . At the time of writing, the exact methods of most model submissions were not yet published. But seeing asmore than 60 teams competed in the challenge, it is reasonable to assume that the top-5 models accurately depicted the stateof the art in adversarial robustness. We know from personal correspondence that most winning models used variations ofEnsemble Adversarial Training [19], while denoisers were notably absent. On the attack side, most winners used variants ofPGD transfer attacks, again in combination with large adversarially-trained ensembles.
Dataset . The models are trained with the Tiny ImageNet dataset, which is a down-scaled version of the ImageNet classifica-tion dataset, limited to 200 classes with 500 images each. Model input consists of color images with 64x64 pixels, and theoutput is one of 200 labels. The evaluation is conducted with a secret hold-out set of images, which is not contained in theoriginal dataset and unknown to participants of the challenge.
C.1. Random guessing with low frequency
Before implementing the biased Boundary Attack, we first conduct a simple experiment to demonstrate the effectivenessof Perlin noise patterns against strong defenses. Specifically, we run a random-guessing attack that samples candidatesuniformly from the surface of a (cid:96) -hypersphere with radius (cid:15) around the original image: s ∼ N (0 , k ; x adv = x + (cid:15) · s (cid:107) s (cid:107) (4)With a total budget of 1000 queries to the model for each image, we use binary search to reduce the sampling distance (cid:15) whenever an adversarial example is found. First experiments have indicated that the targeted setting may be too difficult forpure random guessing. Therefore we limit this experiment to the untargeted attack track, where the probability of randomlysampling any of 199 adversarial labels is reasonably high. We then replace the distribution with normalized Perlin noise: s ∼ P erlin , ( v ); x adv = x + (cid:15) · s (cid:107) s (cid:107) (5)We set the Perlin frequency to 5 for all attacks on Tiny ImageNet. As Table 3 shows, Perlin patterns are more efficient and theattack finds adversarial perturbations with much lower distance (63% reduction). Although intended as a dummy submissionto the AVC, this attack was already strong enough for a top-10 placement in the untargeted track. An example obtained inthis experiment can be seen in Figure 1. C.2. Biased Boundary Attack
Next, we evaluate the biased Boundary Attack in our intended setting, the targeted attack track in the AVC. To provide a pointof reference, we first implement the original Boundary Attack without biases. This works, but is too slow for our setting.Compare Figure 10, where the starting point is still clearly visible after 1000 iterations (in the unbiased case).14
ISTRIBUTION M EDIAN (cid:96) N ORMAL P ERLIN NOISE
Table 3. Random guessing with low frequency (untargeted),evaluated against the top-5 models in the AVC. B
OUNDARY A TTACK B IAS M EDIAN (cid:96) N ONE
ERLIN P ERLIN + S
URROGATE GRADIENTS
Table 4. Biases for the Boundary Attack (targeted), evaluatedagainst the top-5 models in the AVC.
Original Unbiased Perlin Perlin + Surrogate
European fire salamander Sulphur butterfly Sulphur butterfly Sulphur butterflyStarting image d (cid:96) = 9 . d (cid:96) = 7 . d (cid:96) = 4 . Figure 10. Adversarial examples generated with different biases in our targeted attack submission to the AVC. All images were obtainedafter 1000 queries. The isolated perturbation is shown below each adversarial example.
Perlin bias . We add our first bias, low-frequency noise. As before, we simply replace the distribution from which the attacksamples the orthogonal step with Perlin patterns. See Table 4, where this alone decreases the median (cid:96) distance by 25%. Surrogate gradient bias.
We also add projected gradients from a surrogate model and set the bias strength w to 0.5. Thisfurther reduces the median (cid:96) distance by another 37%, or a total of 53% when compared with the original Boundary Attack.1000 iterations are enough to make the butterfly almost invisible to the human eye (see Figure 10).Here, the efficiency boost is much larger than in our ImageNet evaluation in Section 4. This may be due to our choiceof surrogate models: In our submission to the AVC, we simply combined the publicly available baselines (ResNet18 andResNet50). This ensemble is notably stronger than the simple model we used for the ImageNet evaluation, as the ResNet50model is adversarially trained. However, it is also significantly weaker than the ones used by other winning AVC attacksubmissions, most of which were found to use much larger ensembles of carefully-trained models. Nevertheless, our attackoutperformed most of them which reinforces our earlier claim: Our method seems to make more efficient use of surrogatemodels than direct transfer attacks.
Mask Bias . We did not implement the mask bias in our entry to the AVC because of time constraints.The source code of our submission is publicly available. https://medium.com/bethgelab/results-of-the-nips-adversarial-vision-challenge-2018-e1e21b690149 https://github.com/ttbrunner/biased_boundary_attack_avchttps://github.com/ttbrunner/biased_boundary_attack_avc