"What's in the box?!": Deflecting Adversarial Attacks by Randomly Deploying Adversarially-Disjoint Models
‘‘What’s in the box?!’: Deflecting Adversarial Attacks by Randomly DeployingAdversarially-Disjoint Models
Sahar Abdelnabi and Mario Fritz
CISPA Helmholtz Center for Information Security
Abstract
Machine learning models are now widely deployed in real-world applications. However, the existence of adversarial ex-amples has been long considered a real threat to such mod-els. While numerous defenses aiming to improve the robust-ness have been proposed, many have been shown ineffec-tive. As these vulnerabilities are still nowhere near beingeliminated, we propose an alternative deployment-based de-fense paradigm that goes beyond the traditional white-box andblack-box threat models. Instead of training a single partially-robust model, one could train a set of same-functionality, yet, adversarially-disjoint models with minimal in-between at-tack transferability. These models could then be randomlyand individually deployed, such that accessing one of themminimally affects the others. Our experiments on CIFAR-10and a wide range of attacks show that we achieve a signifi-cantly lower attack transferability across our disjoint modelscompared to a baseline of ensemble diversity. In addition,compared to an adversarially trained set, we achieve a higheraverage robust accuracy while maintaining the accuracy ofclean examples.
Deep neural networks (DNNs) have achieved tremendoussuccess in different tasks (e.g. image recognition [15, 21],semantic segmentation [26], and object detection [30]). Be-sides, they potentially can be used in security-critical appli-cations (e.g. autonomous driving [39], face recognition [31],and phishing detection [1]). However, DNNs are vulnerable to adversarial examples : inputs with intentionally crafted often-imperceptible noise that can cause misclassification [13, 38,7, 22]. These adversarial examples can even happen in thephysical world (e.g. photographed images) [22, 19, 36].Consequently, adversarial examples raise serious concernsabout the security aspects of deploying these models, whichfueled an active research line of adversarial attacks and de-fenses [2]. One approach is to make the generation of adver-sarial examples harder by gradient obfuscation (e.g. [14, 10,
Model 0 ...
Model 0 Model 1 Model 1 Model n Randomly deployedand disjoint models Access to onedeployed model
Figure 1: We train same-functionality but adversarially-disjoint models with low in-between attack transferability.These models can be deployed individually and randomly asa defense to deflect adversarial attacks.46]). Another line of defense is to detect adversarial examplesat test time [47, 25], however, both were later circumventedby stronger attacks [3, 6, 5]. On the other hand, adversarialtraining [24, 13] reduces the sensitivity to adversarial exam-ples by training on them. While being an effective defenseagainst the considered attack, adversarial training is merely asolution as it generalizes poorly to unseen threat models [37]while not maintaining the performance on clean examples [44,42, 24, 45].
Disjoint and randomized deployment as a defense.
As itremains unclear whether adversarial examples are escapableand if robustness is within reach [33, 12], we tackle the prob-lem from a different perspective and propose a deployment-based defense paradigm. We workaround this inherent vul-nerability of DNNs: instead of attempting to train and deploya single model with only so much robustness, we proposeto randomly deploy same-functionality, yet, different mod-els. In order for this to succeed, these models should have aminimal in-between transferability of attacks. We call thesemodels ‘adversarially-disjoint’ . By introducing this disjointrandomness in deployment, we reformulate and go beyond a r X i v : . [ c s . L G ] F e b he traditional white-box and black-box threat models. Evenif the adversary can successfully craft adversarial attacks onone model, these attacks are significantly less successful onthe other deployed models. Our work is the first to propose arandomized and disjoint deployment strategy for adversarialrobustness. We depict an overview of our proposed deploy-ment scenario and disjoint models in Figure 1.
Our approach.
In order to obtain these ‘adversarially-disjoint’ models, we train a set of models jointly for the classi-fication task and minimal transferability. We propose a novelgradient penalty that significantly reduces the transferabilityof attacks across the models. Our approach considerably out-performs a baseline of ensemble diversity and an adversariallytrained set. Unlike adversarial training, our approach does notdegrade the clean examples’ accuracy.
Comparing with other randomized defenses.
While theconcept of introducing randomness as a defense for adver-sarial attacks was previously introduced [14, 10, 46], it wasused as a test time randomness (sometimes non-differentiable)using the same model. This category of defenses is called‘gradient masking’ or ‘obfuscated gradients’ and were circum-vented by stronger attacks (e.g. Expectation over Transforma-tion (EoT) and Backward Pass Differentiable Approximation(BPDA) [3]). Unlike these approaches, we do not introduceany randomness at test time and we do not cause any gradientmasking; the adversarial attacks can be normally and success-fully crafted on one model, but they simply do not transferwell to the other disjoint deployed models. Instead of havingadversarial examples that cause the model to fail universally,our approach deflects them such that they mainly cause onerandom model to fail exclusively. Our approach is similar togradient masking in making the adversarial examples harderto find, however, we do not obscure the gradient of one model,but we make it less relevant to the defense as a whole.
In this section, we first present a brief overview of adversarialattacks against DNNs, then we summarize existing defensesagainst these attacks.
Adversarial examples are malicious input ( x adv ) that are cre-ated by adding a crafted perturbation to a normal image ( x )which causes a misclassification by a DNN model ( f ) (i.e. f ( x adv ) (cid:54) = f ( x ) ) [11]. There exist many methods for creatingadversarial examples by optimizing the added perturbation.We categorize these methods to 1) single-step attacks 2) itera-tive attacks, and 3) optimization-based attacks. These methods are optimized for speed as they involve takingonly a single step to create the adversarial image. One ofthe most common single-step methods is the Fast GradientSign Method (FGSM) [13] that finds x adv by maximizingthe training loss function L ( x , y ) , which is usually the cross-entropy. The adversarial examples are created according to: x adv = x + ε · sign ( ∇ x L ( x , y )) where ∇ x L ( x , y ) is the gradient of the loss function w.r.t theinput x , y is the true label, and ε is the perturbation bound inorder to meet the (cid:96) ∞ norm bound (i.e. (cid:107) x − x adv (cid:107) ∞ ≤ ε ).Similarly, an (cid:96) bounded version (i.e. (cid:107) x − x adv (cid:107) ≤ ε ) ofFGSM attack is the Fast Gradient Method (FGM) [11] whichis defined as: x adv = x + ε · ∇ x L ( x , y ) (cid:107) ∇ x L ( x , y ) (cid:107) Another variation of FGSM is random FGSM(R+FGSM) [40] that involves taking a random step( α ) before adding the gradients, such as: x adv = x (cid:48) + ( ε − α ) · sign ( ∇ x (cid:48) L ( x (cid:48) , y )) where: x (cid:48) = x + α · sign ( N ( d , d ) . R+FGSM was originallyused to circumvent gradient masking unintentionally causedby single-step adversarial training. Iterative attacks are considered stronger than single-step at-tacks as they involve taking multiple smaller steps (with asize α ) by recomputing the gradients at each step. One ofthe most common iterative attacks is the Projected GradientDescent (PGD) attack [24] that first takes a random step toform ( x adv ) then iteratively computes: x adv t = Π x adv ∈ ∆ x [ x adv t − + α · sign ( ∇ x ( L ( x adv t − , y )] where Π x adv ∈ ∆ x is the process of projecting the image back tothe allowed perturbation range ∆ x .To improve the transferability of adversarial examples, Mo-mentum Iterative FGSM (MI-FGSM) [11] accumulates avelocity vector in the gradient direction of the loss functionacross steps. The iterative algorithm is defined as follows:g t + = µ · g t + ∇ x L ( x adv t , y ) (cid:107) L ( x adv t , y ) (cid:107) x adv t + = x adv t + α · sign ( g t + ) where µ is the decay factor. The intuition of MI-FGSMis that by incorporating the momentum, the optimization isstabilized and can escape local maxima.2 .1.3 Optimization-based Attacks Optimization-based attacks directly optimize the distancebetween the real and adversarial examples in addition to theadversarial loss ( l ). The Carlini and Wagner Attack (CW) [7]minimizes these two objectives as follows:min (cid:107) x − x adv (cid:107) + c · l ( f ( x adv ) , y t ) where c is a parameter that controls the trade-off betweenthe two objectives, and y t is the target label. CW is usedmostly for L norm bounded attacks.An extension to CW attack is the Elastic Net attack(EAD) [9], which uses both L and L norms in the opti-mization process:min c · l ( f ( x adv ) , y t ) + β · (cid:107) x − x adv (cid:107) + (cid:107) x − x adv (cid:107) where β is also a parameter that controls the trade-off. Forboth CW and EAD, a parameter ( κ ) that controls the mis-classification confidence is used in the adversarial loss ( l )where higher-confidence attacks are more transferable acrossmodels. In this section, we summarize defenses related to our work.
Adversarial training is one of the earliest defenses againstadversarial attacks which was used in [13] to train againstan FGSM adversary. However, it was later found to be inef-fective against a stronger iterative adversary (e.g. PGD) [24].PGD adversarial training [24] remains empirically robust, un-like several other defenses that were circumvented [3, 44].The robustness in PGD training comes at the cost of moreexpensive training, e.g. 80 hours are required to train a robustCIFAR-10 model in [24]. This was later improved to 10 hoursin [32] by re-using the gradient in multiple iterations, and tofew minutes in [44] by training against an FGSM adversarywith a random step and other optimization tricks.However, it is still an issue that adversarial training causesa drop in the accuracy of clean examples [32, 24, 44, 42, 45].Additionally, it is less effective against unseen threat mod-els [37]. In our experiments, we compare the clean examples’accuracy and the average robustness of our set to an adversar-ially trained set.
Another defense direction is to use an ensemble of models.The work in [28] aimed to improve the ensemble robustnessby promoting diversity in the predictions of each model, suchthat it would be harder for attacks to succeed in fooling alldiverse members in the ensemble. In order to not decrease the accuracy, the maximal prediction of each model should notbe affected. Therefore, the diversity regularization works onthe non-maximal predictions.Our work is closely related to this work as both aim to de-crease the attack transferability among the set. However, wedirectly regularize the gradients of the models w.r.t each otherin order to get ‘adversarially disjoint’ models. As we show inthe experiments, we achieve a significantly lower transferabil-ity across the models compared to this baseline. Additionally,we propose a novel deployment strategy by deploying thesedisjoint models individually while the work in [28] suggestsdeploying the whole ensemble.
Gradient masking refers to a category of defenses when thegradients are intentionally made harder to be found [3]. For ex-ample, the work in [14] applied random and non-differentiableoperations (e.g. cropping, rescaling, and JPEG compression)to the images at test time without changing the model itself.Additionally, the work in [10] applied a random operationwhich is Stochastic Activation Pruning (SAP) that randomlyremoves a subset of activations. Similarly, the work in [46]proposed to apply random operations on the input such asrandom padding and resizing. These defenses intentionallymake the gradients harder to compute and therefore they couldcause the optimization of attacks to fail.Most of these defenses were circumvented later in [3] byother adaptive attacks. For example, the non-differentiableoperations can be replaced with differentiable approximationsin the backward pass, and the randomization can be accountedfor using Expectation over Transformation (EoT). It is nowa recommended practice to make sure that the defense is notcausing gradient masking that can be circumvented [8].Our work is similar to gradient masking or obfuscation inthe sense that we intend to make adversarial examples harderto find. However, we do not use any gradient masking for themodels. In fact, the white-box attacks for each model individ-ually can be successful. However, they are mostly exclusiveto their corresponding model and ideally futile for the others.
Our work is conceptually similar to previous work that at-tempts to mitigate attacks by poisoning the attacker’s op-timization process instead of solving the harder robustnessproblem. For example, the work in [34] used ‘trapdoors’ inorder to force the attacker’s optimization to converge to thesetrapdoor patterns which could then be detected. However,this could be partially circumvented by stronger adaptive at-tacks [4, 34] that explicitly avoid the trapdoor signature evenif the trapdoors are not known to the adversary. Our workhas the advantage that each model in the set is deployed in-dividually and randomly without interacting with the other3odels and thus, it reveals minimum information about thedefense as a whole without needing to obscure the defensemechanism.Another conceptually similar, yet not related to adversarialdefenses, is corrupting the attacker’s optimization process inmodel stealing attacks [27] by poisoning the predictions. Thepredictions are perturbed such that the adversarial gradientmaximally deviates from the original gradient. This is doneby maximizing the angular deviation between the two. In ourwork, we also utilize angular deviation regularization, but wefound it inadequate to achieve low transferability between themodels, and therefore we propose novel and explicit transfer-ability regularization losses.
Our proposed deployment-based defense and ‘adversarially-disjoint’ models reframe the traditional white-box and black-box threat models. Traditionally, breaking a deployed model(using black-box or white-box attacks) translates to effectivelybreaking all models. In our work, we workaround this and askwhether there is there a way to alleviate the adversarial attacksthrough a smarter deployment strategy. Our ‘adversarially-disjoint’ approach is the first step for deployment-based de-fenses that exploit possible randomization opportunities.
Assumptions about the adversary.
We assume that the‘disjoint models’ would be deployed individually and ran-domly. We assume that the models are either released ran-domly from the beginning or released gradually and adap-tively in a staged-release as a ‘zero-day’ defense if a modelbecomes frequently attackable. Therefore, in the normal setup,we assume that the adversary has white-box access to oneof the models in the set. We assume that they do not haveaccess to the training data (as commonly adopted in previouswork [34, 29]). We define three types of adversaries accordingto their knowledge about the models:1. Static adversary: they have no knowledge that differ-ent models are deployed, and therefore perform normalattacks without adaptation.2. Skilled adversary: they know that different models aredeployed, but have access to one model only, and there-fore they attempt to craft highly transferable adversarialexamples such that they would transfer to the other de-ployed models.3. Oracle adversary: they have access to one or more mod-els and they use them to craft ensemble white-box at-tacks.
Requirements.
We set two requirements for our modelsthat we measure in our experiments: 1) the classification accuracy of clean examples should be minimally affected bytraining the disjoint models. 2) Attacks should be minimallytransferable from one model to another.
In this section, we describe our models’ set optimization pro-cess which is depicted in Figure 2 for a pair of models in theset. We train the models jointly to optimize the classificationloss and minimize the transferability of attacks.
Classification.
Each model in the set is trained individuallyfor the classification task. The total classification loss is thesum of the classification losses of all models:L class = n ∑ i L i ( x , y ) (1)where L i is the categorical cross-entropy loss of model f i and n is the total number of models. Gradient angular deviation.
As in previous work [27, 18,17], increasing the angular deviation or misalignment betweengradients helps to make the losses uncorrelated which con-tributes to less transferability of attacks [18, 17]. To increasethe angular deviation, we minimize the cosine of the anglesbetween the pairwise gradients in the set:L angle = n ∑ i = n ∑ j = i + ∇ x L i · ∇ x L j (cid:107) ∇ x L i (cid:107) · (cid:13)(cid:13) ∇ x L j (cid:13)(cid:13) (2)where ∇ x L i is the adversarial gradient of model f i . Transferability losses.
Increasing the angular deviationhelps to decrease the transferability across models. However,in our experiments, we found it less helpful when trainingmore than two models. The core idea of our approach is toexplicitly penalize the transferability of attacks between mod-els and thus, we propose these novel transferability losses.The intuition is: the white-box examples of model f j (i.e.computed using the adversarial gradients ∇ x L j ) should notbe transferable to model f i (i.e. causing misclassification byincreasing the loss of f i ). Therefore we minimize:L transfer1 = n ∑ i n ∑ j (cid:54) = i max (cid:0) L i ( x + ε · ∇ x L j , y ) − L i ( x , y ) , (cid:1) (3)where ε controls the perturbations range and the loss iscomputed for all pairwise combinations (excluding i = j ).This means that we penalize the increase of the loss of model f i incurred by adding the gradients of model f j .To extend the transferability minimization to (cid:96) ∞ attacks (i.e.sign of the gradient), we approximate the non-differentiable4 lassificationlossesCosine angleloss Figure 2: Overview of training the adversarially-disjoints sets. Boxes in red indicate training losses.sign operation with the differentiable ‘tanh’ function andadapt the previous loss as follows:L transfer2 = n ∑ i n ∑ j (cid:54) = i max ( L i ( x + ε · tanh ( ∇ x L j ) , y ) − L i ( x , y ) , ) (4)Similarly, ε controls the perturbations range. The modelsare then trained end-to-end with weighted averaging of theprevious losses.L total = w · L class + w · L angle + w · L transfer1 + w · L transfer2 (5)As a general intuition, the objective of the transferabilitylosses is to search for an adversarial gradient for each modelthat minimally affects the other models and to make eachmodel ideally exclusively sensitive to its own gradients. In this section, we first present the implementation detailsand our experimental setup. We then evaluate the robustnessof our models; we demonstrate the transferability across theadversarially-disjoint models in comparison with baselines.Second, we demonstrate the effective deployed accuracy byconsidering the accuracy of the whole set. We then evaluate anadvanced attack where the adversary can have access to morethan one model in the set. Finally, we present an ablation studythat shows the effect of different losses and design choices.
We used the CIFAR-10 dataset [20] as one of the mostcommonly used datasets for studying adversarial robustness(e.g. [34, 28, 32, 45, 44, 24]). It consists of 50k color trainingimages and 10k colored testing images spanning 10 classes. All experiments are done using the PreAct ResNet18 archi-tecture [16, 44]. We used the PyTorch framework for all ourexperiments. We varied the number of models in the set from3 to 6. For training 3 and 4 models, we use all combinationsof pairwise transferability losses in each iteration. However,for training 5 and 6 models, we found it more helpful (forconvergence, faster training, and memory) to stochasticallysample 3 different models at each iteration. We used a batchsize of 128 for 3 models and 100 for the other sets. We usedthe SGD optimizer with a momentum of 0.9 and weight decayof 1 ∗ − . We used the cyclic learning rate [35, 44] wherethe learning rate increases linearly from 0 to 0.2 during thefirst half of iterations and then decreases linearly. We set thevalues of ε and ε in Equation 3 and Equation 4 to 6 and0.031 respectively. We trained the models for 75 epochs inthe case of 3 and 4 models, 100 epochs in the case of 5 mod-els, and 120 epochs in the case of 6 models. For each losstype (classification, angular deviation, and transferability), weaverage the losses across all combinations. For the weightsin Equation 5, we set the classification weight to 1 and theother losses’ weights to 0.5, training was not sensitive to theexact values of these weights. We will make our code, setup,and models’ checkpoints available at the time of publication. Metrics.
We evaluate the models by the clean test accuracy.We form a transferability matrix of attacks across the models(i.e. an n × n matrix where the source models are the rows andthe target models are the columns). We evaluate the accuracyof black-box attacks between the models in the set (i.e. thenon-diagonal elements) and the accuracy of the whole setacross all combinations (i.e. including the diagonal white-boxelements when the source and target models are the same). https://pytorch.org/ ttack Attack parametersFGSM ε = . ε = ε = . α = ε PGD ε = . α = . ε = . α = . ε = . α = . µ =
1, steps=10MI-FGSM ε = . α = . µ =
1, steps=20CW c = . κ =
0, max iterations = 1000, learning rate=0.01, optimizer=AdamCW c = . κ =
40, max iterations = 1000, learning rate=0.01, optimizer=AdamEAD c = . κ = β = .
01, decision rule=‘EN’, max iterations = 1000, learning rate=0.01, optimizer=SGDEAD c = . κ = β = .
01, decision rule=‘EN’, max iterations = 1000, learning rate=0.01, optimizer=SGD
Table 1: The attacks we use in our experiments and their parameters. The subscripts used here are used to differentiate betweendifferent parameters’ settings of the same attack and used to refer to these settings in the rest of the paper.
Baselines.
We evaluate our approach to two baselines: 1)the ensemble diversity in [28] where we compare the trans-ferability across our adversarially-disjoint models to theirdiverse ensemble. 2) An adversarially trained set, as an evenmore challenging setup than a single adversarially trainedmodel. We compare our approach with adversarial training interms of the transferability and the average accuracy acrossthe whole set (for an objective and fair comparison, sinceour models have low transferability but on the other hand nowhite-box robustness).We followed [44]’s approach in fast adversarial training.We trained 4 models separately with different random ini-tialization. As in [44], the models are trained using randomstep+FGSM and cyclic learning rate. Each model was trainedfor 50 epochs. We reached a comparable clean and robust ac-curacy to PGD training in [24] (e.g. we reached a clean accu-racy of 85.2% versus 87.3%, and a robust accuracy of 49.53%versus 50% for PGD attack with ε = . α = . . Attacks.
We evaluate the models using the range of attacksdiscussed in subsection 2.1. Since optimization-based attacksdo not explicitly use the gradients of the training loss, weuse them to test the generalization of our approach to unseenattacks. In order to evaluate against a skilled adversary (dis-cussed in section 3), we evaluate against highly transferableattacks such as high confidence CW and EAD, and MI-FGSMattacks. We also opt to evaluate against attacks that introducerandomness before gradient steps (such as R+FGSM andPGD) since the models were trained without adding a randomstep first and therefore these attacks are more challenging toour models than their counterparts (FGSM and I-FGSM). Weshow the attacks’ settings and parameters which we use inour experiments in Table 1. https://github.com/P2333/Adaptive-Diversity-Promoting In this section, we show our evaluations of attack transferabil-ity between the adversarially-disjoint models.
Comparing to adversarial training.
In Table 2, we showthe black-box accuracy of attacks across the set in the case ofour approach in comparison with an adversarially trained set.We compute the average accuracy over all black-box combi-nations of having different source and target models (i.e. theaverage of the non-diagonal elements in the n × n transferabil-ity matrix). To study the effect of increasing the number ofmodels, we show the average accuracy in the case of training3, 4, 5, and 6 models. We evaluate the models against singlestep attacks, iterative attacks, and optimization-based attacks.We include strong and high-confidence optimization-basedattacks and highly transferable MI-FGSM with a higher num-ber of steps and larger step size than what was originally usedin [11].From the table, we highlight the following conclusions:1. Our approach is significantly more resilient to transfer-ability across all attacks and the number of models in theset, even with highly transferable attacks that transferwell in the adversarially trained set.2. As discussed earlier, the transferability of R+FGSM isslightly higher than FGSM, although it is generally aweaker attack as indicated by the adversarially trainedset. The random step done in R+FGSM may account forthis result.3. The black-box accuracy of single-step attacks is consis-tently very high and it does not deteriorate with increas-ing the number of models.4. When increasing the number of models, the black-boxaccuracy of iterative attacks (especially MI-FGSM ) andhighly transferable optimization-based attacks (CW )gradually declines.6 ttack AT Adversarially-disjoint models (ours)3 models 4 models 5 models 6 modelsFGSM 63.1 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Test accuracy (%) of black-box adversarial examples in the case of adversarially trained (AT) models (3 models trainedindividually with random seeds), and our approach of adversarially-disjoint models with a varying number of models in the set.We show the average and standard deviation of all black-box combinations. The attacks’ parameters are in Table 1.Therefore, in order to evaluate whether it is beneficial to in-crease the number of models beyond a certain limit (e.g. from5 to 6 models), we evaluate the effective average accuracy ofthe whole set (i.e. the average of the n × n matrix includingwhite-box and black-box pairs) in the next section. Comparing to ensemble diversity.
We next compare ourapproach to the diverse ensemble in terms of transferabilityamong the set, which we show in Table 3 using a representa-tive set of attacks. The experiments are done using 3 modelsin the set for both approaches. Similar to Table 2, we reportthe average black-box accuracy across all black-box combina-tions in the set. As can be observed, the black-box accuracy inour approach significantly outperforms the ensemble diversity.In our work, we propose deploying the models individuallyin order to introduce uncertainty for the adversary, and there-fore, we focus on minimizing the transferability. On the otherhand, the work in [28] proposed to deploy the ensemble asa whole and, therefore, the gained robustness was mainly inthe white-box case on the ensemble instead of the black-boxcase across the models. However, the white-box robustnessgained by the diverse ensemble was significantly reduced bystronger attacks in a subsequent work [41].
Attacks Diverse ensemble Adversarially-disjoint models (ours)FGSM 60.0 ± ± ± ± ± ± ± ± ± ± Table 3: Test accuracy (%) of black-box adversarial examplesin the case of diverse ensemble baseline [28] and our approachof adversarially-disjoint models, both are sets of 3 models.We show the average and standard deviation of all black-boxcombinations. The attacks’ parameters are in Table 1.
Expanding to higher perturbations budgets.
In order totest the extent of our defense, we examine strong PGD at-tacks with significantly high perturbations budgets and alarge number of steps. We vary ε from 0.0196 to 0.156 andfor each value, we run the PGD attack with 100 steps and α = . · ε /
100 (following [24] for comparison). We run thisexperiment on the 5-models set. Since the black-box accuracytends to decline with increasing the number of models, wechose to use a high enough number of models in this exper-iment. We show the average black-box accuracy across theset in Figure 3 in addition to the average accuracy across thewhole set (including the diagonal white-box case). Even withsuch strong perturbations that significantly reduce the effec-tiveness of PGD adversarial training [24], and at ε = . T e s t a cc u r a c y ( % ) Black-box avg.Whole-set avg.
Figure 3: The average accuracy of black-box attacks acrossthe models (blue) and the accuracy of the whole set (orange)with increasing values of the perturbation budget ( ε ). Theattack implemented is PGD with 100 steps and α = . · ε /
100 [24]. The number of models in the set is 5.7nd an average accuracy of 45.94% for the whole-set (inter-estingly, this is on par with the white-box robustness of oneadversarially trained model using a much weaker PGD attackwith ε = .
031 and 20 steps [24]).
Visualizing the gradients.
Next, we visualize the separa-tion of the adversarial gradients caused by our training scheme.Using the 3, 4, and 5-models set, we computed the signs ofthe adversarial gradients ( ∇ x L i ) for 1000 test images and weuse the t-distributed stochastic neighbor embedding (t-SNE)method [43] to visualize them. We show this visualizationin Figure 4. For comparison, we do the same process for 3adversarially trained models. Adversarial training directly re-duces the sensitivity of the model to the perturbations and itis not intended to separate the gradients of different models.However, we show it here to visualize the inherent or thebaseline separation between different models’ gradients (andtherefore, to visualize the extent of separation done by ourapproach). As can be observed, in our approach, the models’gradients are relatively well separated and distinctive fromeach other as intended by our transferability losses. In the last section, we investigated the transferability of attacksacross the models and its relationship with increasing thenumber of models in the set. However, it is not very clearhow much we gain when increasing the number of models,especially when the black-box accuracy declines for certainattacks. Additionally, it is not clear how our approach as a setcompares to adversarial training, considering that adversarialtraining increases both white-box and black-box accuracy,while we only focus on black-box accuracy. Therefore, in thissection, we evaluate the average accuracy of the whole set (i.e.the average of the full n × n attack matrix). In this evaluation,the baseline is also a set of adversarially trained models.We show our evaluation in Table 4 in which we reportthe whole-set average accuracy for the sets of 3, 4, 5, and 6adversarially-disjoint models in comparison with the sets of3 and 4 adversarially trained models. We also show the cleantest accuracy (i.e. no attack) in the first row.We highlight the following observations from the table:1. As reported in prior work [32, 24, 44, 42, 45], adversarialtraining drops the accuracy of clean examples. On thecontrary, our approach hardly drops the clean accuracy,even when training 6 models (the baseline clean accuracyis 94.13%).2. With only 3 models in the set, our approach outper-forms adversarial training in nearly all attacks (exceptR+FGSM). Adding only one more model (4 models)outperforms the adversarial training in all attacks. The 5and 6 models’ sets significantly outperform adversarialtraining.
10 5 0 5 10 151050510 model 0model 1model 2 (a) Adversarial training.
15 10 5 0 5 10151050510 model 0model 1model 2 (b) Adversarially-disjoint (3 models).
10 5 0 5 10 15105051015 model 0model 1model 2model 3 (c) Adversarially-disjoint (4 models).
15 10 5 0 5 10 1515105051015 model 0model 1model 2model 3model 4 (d) Adversarially-disjoint (5 models).
Figure 4: A visualization using t-SNE embeddings of theadversarial gradients’ sign for 1000 images. (a) shows 3 ad-versarially trained models with different initialization. (b), (c),and (d) show our approach of adversarially-diverse models(3, 4, and 5 models).3. Increasing the number of models in the adversariallytrained set from 3 to 4 has very little performance gainin most of the cases. On the other hand, increasing the8 ttacks Adversarially trained set Adversarially-disjoint set (ours)3 models 4 models 3 models 4 models 5 models 6 modelsNo attack 85.2 85.3 94.1 94.0 93.7 93.5FGSM 60.5 61.1 66.5 73.5 78.6 81.6FGM 45.6 46.2 66.2 72.1 78.0 81.2R-FGSM 73.3 73.7 65.5 72.5 76.5 79.3PGD Table 4: Comparing our approach with adversarially trained sets in terms of test accuracy (%). The first row shows the clean testaccuracy (i.e. no attack). Other rows show the average accuracy of attacks over all combinations of source and target models inthe set (i.e. including white-box and black-box cases). The attacks’ parameters are in Table 1.number of models in our approach can increase the per-formance by a large extent (e.g. when comparing the4-models set to the 3-models set).4. In some cases, there is no significant performance gainwhen adding a new model (e.g. from 5 to 6 models),since the black-box accuracy can drop in some cases(see Table 2). However, adding a new model did not harmthe average performance and it has other advantages inthe advanced attacks when the attacker has access tomore than one model (discussed in the next section).Additionally, as we discussed before in Figure 3, we evalu-ated the whole-set average accuracy with strong PGD attacksusing the 5-models set and it showed high resilience even forsignificantly strong perturbations.
Our standard assumption is that the adversary can accessonly one model from the set. In this section, we extend thisassumption and evaluate against an ‘oracle adversary’ havingwhite-box access to more models (e.g. by purchasing manyversions that could randomly have different models).
Attacks using an ensemble.
If the adversary can access m models (where m ≤ n ), they can fuse their logits to form anew model: f ens = m m ∑ i = f i which is used to compute the loss: L ens . One can now formthe adversarial image using the adversarial gradient of L ens (i.e. ∇ x L ens ). We use this setup in our following evaluations.We also experimented with another approach that averagesthe gradients of the models ( m ∑ mi = ∇ x L m ) and use the average in the attacks, however, it was not very effective compared tousing ∇ x L ens . Ensemble attacks against the 4-models set.
We used thepreviously mentioned approach to attack the 4-models set.We show this experiment in Table 5. We vary the numberof models on which we compute the attacks: from 1 model(which is similar to all previous experiments) to 4 (using allmodels in the set). We then use the crafted images to attackeach model individually (since these models are individuallydeployed as per our assumption). The attack used in thisexperiment is PGD with ε = . α = . in the previous tables). We give examples of differentcombinations of source models in Table 5. We can observe Source model(s) Target model0 1 2 30 0 95.1 95.8 89.91 90.9 0.0 92.2 88.82 88.8 94.7 0.0 92.53 95.1 95.4 94.1 0.00,1 0.0 0.0 95.9 90.80,2 0.0 96.2 0.0 94.30,3 0.0 96.5 97.2 0.00,1,2 0.0 0.0 0.0 96.90,2,3 0.0 98.4 0.0 0.10,1,2,3 0.0 0.1 0.1 0.1
Table 5: Test accuracy (%) of ensemble-based attacks againstthe 4-models set. Attacks are computed using an ensemble ofan increasing number of models (1, ..., n) and evaluated onall models in the set. The attack implemented is PGD with 20steps ( ε = . α = . f i is included in the ensemble, the accuracyof f i on the crafted image will drop to nearly 0 (similar tothe white-box case for each model individually). Therefore,accessing all models drops the accuracy to nearly 0 on allmodels. However, if a model is not included in the ensemble,its accuracy on the attack images is still significantly high. Attacks versus the number of models in the set.
The pre-vious experiment shows that the success of the attack dependson how many models the adversary can access out of thewhole set. So, a natural next step is to evaluate the attacksw.r.t the number of models in the set. Thus, we extend theprevious analysis done in Table 5 to all sets, using the samesetup (i.e. the attacks are crafted on the ensemble and testedon all models). We show in Figure 5 the relationship betweenthe number of models in the attacker’s ensemble and the aver-age accuracy on all models in the set. Each point representsan average over all combinations of n C m , where n is the totalnumber of models in the set, and m is the number of modelsin the ensemble. From this figure, we observe that havinga larger number of models in the set helps to alleviate thisattack since we make it harder for the adversary to acquireall the models. Therefore, even when there is no significantgain in the whole-set average (e.g. such as when increasingfrom 5 to 6 models in Table 4), it is still beneficial to increasethe number of models (to potentially beyond 6 models) tomitigate advanced attacks. In this section, we present an analysis of some design choicesand components in our approach. T e s t a cc u r a c y ( % ) Figure 5: Test accuracy (averaged over all models in the set)of attacks crafted using an ensemble with a varying numberof models (x-axis). The analysis is repeated for all sets. Theattack implemented is PGD with 20 steps ( ε = . α = . Angular deviation and transferability losses.
In our ap-proach, we used both angular deviation loss and transferabilitylosses. We here investigate each component individually.Intuitively, increasing the angular deviation between twogradients should help to make their losses uncorrelated. Totest this, we trained 3 models using the classification loss andthe angular deviation loss only. We evaluated the performanceof PGD attack on a validation set during training. We foundthat increasing the angular deviation does help to reduce thetransferability in the first few epochs (average black-box at-tack is 30-40% after few epochs). However, the black-boxaccuracy decreases again during training even when the angu-lar deviation loss is also decreasing.As shown in Table 6, at the end of training, the black-boxaccuracy decreased to 21%. This suggests that angular devia-tion (on its own) is not enough to achieve low transferability.This is also supported by previous work [23] that showed thattwo models with orthogonal gradients can still have high trans-ferability of attacks. Our interpretation is that the hypothesisthat orthogonal gradients lead to low transferability is drivenby the linear approximation of the loss function. Since, in fact,the loss function is not linear, there might be two orthogonaldirections that both increase the loss of one model.On the other hand, we trained another variant with trans-ferability losses only without angular deviation, which weshow in the second row of Table 6. This variant is much moresuccessful in decreasing the transferability than the first one.However, what worked best which we eventually used, is totrain with both losses at first and then after few epochs ( ∼ Increasing the number of models.
Jointly training 5 and6 models with all combinations of transferability losses ateach mini-batch was not very successful; the losses were fluc-tuating and the training converged to both lower classificationaccuracy and lower black-box accuracy than the 3 and 4 mod-els cases. In addition, it required additional training time andmemory resources. This could possibly be due to the difficulty
Training variant Black-box accuracyNo transferability 21%No angular deviation 85.9%Both losses 96.2%
Table 6: A comparison between three variants in terms ofblack-box accuracy: training with angular deviation loss only,training with transferability loss only, and training with bothlosses. The attack is PGD in Table 1. The number of modelsin the set is 3.10 raining variant models Black-box (%) Clean (%)Joint 3 95.7 94.1Joint 4 93.8 94.0Joint 5 87.9 91.3add 5 88.7 90.5Sampling 5 95.12 93.7Sampling 6 93.9 93.5 Table 7: A comparison in terms of accuracy between expand-ing the number of models using joint training, adding onemodel while fixing the others, and joint training with sam-pling. The attack is PGD in Table 1.of jointly optimizing all the losses simultaneously.To circumvent this, we experimented with adding one newfifth model while fixing the weights of the trained four models.However, although faster, this was also not very effective. Thisis probably because our approach is more effective when eachmodel changes its own adversarial gradients. Also, the newmodel’s clean accuracy was slightly lower than the first fewfour models.Finally, what worked best in our setup, is to jointly train 5and 6 models but use only a few randomly sampled modelseach iteration (we used 3) for the transferability loss. Thisstabilized the training without any significant fluctuation andsignificantly reduced the time each epoch takes because evenwhen we have a total of 5 or 6 models in the set, we com-pute the expensive backward pass for only 3 models. Also,the clean accuracy was close to the 3 and 4 models case.This opens the potential for increasing the number of modelsto even more than 6. We compare these three approachesin Table 7 in terms of clean accuracy and black-box accuracy(using PGD attack in Table 1). In this section, we discuss the limitations, implications, andpossible extensions for our work.
Number of models.
In this work, we take the first steptowards randomized deployment defenses for adversarial ro-bustness. Using our approach, we managed to expand thenumber of models to 6 models. However, since increasing thenumber of models is important for mitigating advanced at-tacks and for improving the whole-set average, it still remainsan open issue of how to, efficiently and with maintaining theperformance, train and generate a larger population of modelsat scale and also how to increment the number of modelswithout re-training.
White-box robustness.
The ‘adversarially-disjoint’ mod-els have significantly lower attack transferability, however, they have no white-box robustness. Even though we man-age to (often significantly) outperform adversarial trainingin many settings (see Figure 3 and Table 4), it is a naturalextension to investigate whether it is possible to combine theadversarial robustness with disjointing, or whether they are atodds as a fundamental trade-off (e.g. similar to the trade-offbetween robustness and accuracy [42]).
In this work, we propose a new paradigm for adversarialdefenses. Instead of attempting to solve the intrinsic vulner-abilities of DNNs, we exploit the unexplored opportunitiesof models’ deployment. We propose to change the traditionalwhite-box threat model by deploying same-functionality, yet,adversarially-disjoint models with minimal transferability ofattacks. By doing so, even if the adversary can craft success-ful adversarial attacks on one model, they are significantlyless successful on other deployed models. Our approach isfundamentally different from other randomness defenses asit does not involve any gradient masking or obfuscation thatcould be easily circumvented.To obtain the adversarially-disjoint models, we propose anovel gradient penalty that strongly reduces the transferabilityof attacks across the models. Our approach is significantlymore resilient to transferability in comparison to a baselineof ensemble diversity. Additionally, we outperform a baselineof an adversarially trained set over a wide range of attacks,while hardly having any negative effect on the clean accuracy.
We thank Yang Zhang and Hossein Hajipour for constructiveadvice and valuable discussions.
References [1] Sahar Abdelnabi, Katharina Krombholz, and MarioFritz. “VisualPhishNet: Zero-Day Phishing WebsiteDetection by Visual Similarity”. In: the ACM SIGSACConference on Computer and Communications Secu-rity (CCS) . 2020.[2] Naveed Akhtar and Ajmal Mian. “Threat of adversarialattacks on deep learning in computer vision: A survey”.In:
IEEE Access
In-ternational Conference on Machine Learning (ICML) .2018.114] Nicholas Carlini. “A partial break of the honeypotsdefense to catch adversarial attacks”. In: arXiv preprintarXiv:2009.10975 (2020).[5] Nicholas Carlini and David Wagner. “Adversarial ex-amples are not easily detected: Bypassing ten detectionmethods”. In: the 10th ACM Workshop on ArtificialIntelligence and Security . 2017.[6] Nicholas Carlini and David Wagner. “Magnet and" ef-ficient defenses against adversarial attacks" are notrobust to adversarial examples”. In: arXiv preprintarXiv:1711.08478 (2017).[7] Nicholas Carlini and David Wagner. “Towards evaluat-ing the robustness of neural networks”. In: the IEEEsymposium on Security and Privacy (SP) . 2017.[8] Nicholas Carlini et al. “On evaluating adversarial ro-bustness”. In: arXiv preprint arXiv:1902.06705 (2019).[9] Pin-Yu Chen et al. “Ead: elastic-net attacks to deep neu-ral networks via adversarial examples”. In: the AAAIConference on Artificial Intelligence . 2018.[10] Guneet S Dhillon et al. “Stochastic Activation Prun-ing for Robust Adversarial Defense”. In:
Interna-tional Conference on Learning Representations (ICLR) .2018.[11] Yinpeng Dong et al. “Boosting adversarial attacks withmomentum”. In: the IEEE conference on ComputerVision and Pattern Recognition (CVPR) . 2018.[12] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. “Ad-versarial vulnerability for any classifier”. In:
Advancesin Neural Information Processing Systems (NeurIPS) .2018.[13] Ian Goodfellow, Jonathon Shlens, and ChristianSzegedy. “Explaining and Harnessing Adversarial Ex-amples”. In:
International Conference on LearningRepresentations (ICLR) . 2015.[14] Chuan Guo et al. “Countering Adversarial Images us-ing Input Transformations”. In:
International Confer-ence on Learning Representations (ICLR) . 2018.[15] Kaiming He et al. “Delving deep into rectifiers: Sur-passing human-level performance on imagenet classi-fication”. In: the IEEE International Conference onComputer Vision (ICCV) . 2015.[16] Kaiming He et al. “Identity mappings in deep resid-ual networks”. In:
European Conference on ComputerVision (ECCV) . 2016.[17] Mohammad AAK Jalwana et al. “Orthogonal DeepModels as Defense Against Black-Box Attacks”. In:
IEEE Access arXiv preprint arXiv:1901.09981 (2019). [19] Zelun Kong et al. “Physgan: Generating physical-world-resilient adversarial examples for autonomousdriving”. In: the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) . 2020.[20] Alex Krizhevsky, Geoffrey Hinton, et al.
Learning mul-tiple layers of features from tiny images . Tech. rep.2009.[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. “Imagenet classification with deep convolutionalneural networks”. In:
Advances in Neural InformationProcessing Systems (NeurIPS) (2012).[22] Alexey Kurakin, Ian Goodfellow, and Samy Bengio.“Adversarial examples in the physical world”. In: arXivpreprint arXiv:1607.02533 (2016).[23] Yanpei Liu et al. “Delving into Transferable Adver-sarial Examples and Black-box Attacks”. In:
Interna-tional Conference on Learning Representations (ICLR) .2017.[24] Aleksander Madry et al. “Towards Deep LearningModels Resistant to Adversarial Attacks”. In:
Interna-tional Conference on Learning Representations (ICLR) .2018.[25] Dongyu Meng and Hao Chen. “Magnet: a two-prongeddefense against adversarial examples”. In: the ACMSIGSAC conference on Computer and CommunicationsSecurity (CCS) . 2017.[26] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.“Learning deconvolution network for semantic segmen-tation”. In: the IEEE International Conference on Com-puter Vision (ICCV) . 2015.[27] Tribhuvanesh Orekondy, Bernt Schiele, and MarioFritz. “Prediction Poisoning: Towards DefensesAgainst DNN Model Stealing Attacks”. In:
Interna-tional Conference on Learning Representations (ICLR) .2020.[28] Tianyu Pang et al. “Improving adversarial robustnessvia promoting ensemble diversity”. In:
InternationalConference on Machine Learning (ICML) . 2019.[29] Nicolas Papernot et al. “Distillation as a defense to ad-versarial perturbations against deep neural networks”.In:
IEEE Symposium on Security and Privacy (SP) .2016.[30] Joseph Redmon et al. “You only look once: Unified,real-time object detection”. In: the IEEE conferenceon Computer Vision and Pattern Recognition (CVPR) .2016.[31] Florian Schroff, Dmitry Kalenichenko, and JamesPhilbin. “Facenet: A unified embedding for face recog-nition and clustering”. In: the IEEE conference onComputer Vision and Pattern Recognition (CVPR) .2015.1232] Ali Shafahi et al. “Adversarial training for free!” In:
Advances in Neural Information Processing Systems(NeurIPS) . 2019.[33] Ali Shafahi et al. “Are adversarial examples in-evitable?” In:
International Conference on LearningRepresentations (ICLR) . 2019.[34] Shawn Shan et al. “Gotta Catch’Em All: Using Hon-eypots to Catch Adversarial Attacks on Neural Net-works”. In: the ACM SIGSAC Conference on Computerand Communications Security (CCS) . 2020.[35] Leslie N Smith. “Cyclical learning rates for trainingneural networks”. In: the IEEE Winter conference onApplications of Computer Vision (WACV) . 2017.[36] Dawn Song et al. “Physical adversarial examples forobject detectors”. In: the 12th USENIX Workshop onOffensive Technologies (WOOT) . 2018.[37] David Stutz, Matthias Hein, and Bernt Schiele.“Confidence-calibrated adversarial training: Generaliz-ing to unseen attacks”. In:
International Conferenceon Machine Learning (ICML) . 2020.[38] Christian Szegedy et al. “Intriguing properties of neuralnetworks”. In:
International Conference on LearningRepresentations (ICLR) . 2014.[39] Yuchi Tian et al. “Deeptest: Automated testing of deep-neural-network-driven autonomous cars”. In:
Proceed-ings of the International Conference on Software Engi-neering . 2018.[40] Florian Tramèr et al. “Ensemble Adversarial Training:Attacks and Defenses”. In:
International Conferenceon Learning Representations (ICLR) . 2018.[41] Florian Tramèr et al. “On Adaptive Attacks to Ad-versarial Example Defenses”. In:
Advances in NeuralInformation Processing Systems (NeurIPS) . 2020.[42] Dimitris Tsipras et al. “Robustness May Be at Oddswith Accuracy”. In:
International Conference onLearning Representations (ICLR) . 2019.[43] Laurens Van der Maaten and Geoffrey Hinton. “Vi-sualizing data using t-SNE.” In:
Journal of machinelearning research
International Conference on Learning Representations(ICLR) . 2020.[45] Chang Xiao and Changxi Zheng. “One Man’s TrashIs Another Man’s Treasure: Resisting Adversarial Ex-amples by Adversarial Examples”. In: the IEEE con-ference on Computer Vision and Pattern Recognition(CVPR) . 2020. [46] Cihang Xie et al. “Mitigating Adversarial EffectsThrough Randomization”. In:
International Confer-ence on Learning Representations (ICLR) . 2018.[47] Weilin Xu, David Evans, and Yanjun Qi. “FeatureSqueezing: Detecting Adversarial Examples in DeepNeural Networks”. In: the 25th Annual Network andDistributed System Security Symposium (NDSS)the 25th Annual Network andDistributed System Security Symposium (NDSS)