[PDF] Backdoor Scanning for Deep Neural Networks through K-Arm Optimization

Abstract

Back-door attack poses a severe threat to deep learning systems. It injects hidden malicious behaviors to a model such that any input stamped with a special pattern can trigger such behaviors. Detecting back-door is hence of pressing need. Many existing defense techniques use optimization to generate the smallest input pattern that forces the model to misclassify a set of benign inputs injected with the pattern to a target label. However, the complexity is quadratic to the number of class labels such that they can hardly handle models with many classes. Inspired by Multi-Arm Bandit in Reinforcement Learning, we propose a K-Arm optimization method for backdoor detection. By iteratively and stochastically selecting the most promising labels for optimization with the guidance of an objective function, we substantially reduce the complexity, allowing to handle models with many classes. Moreover, by iteratively refining the selection of labels to optimize, it substantially mitigates the uncertainty in choosing the right labels, improving detection accuracy. At the time of submission, the evaluation of our method on over 4000 models in the IARPA TrojAI competition from round 1 to the latest round 4 achieves top performance on the leaderboard. Our technique also supersedes three state-of-the-art techniques in terms of accuracy and the scanning time needed.

Full PDF

BBackdoor Scanning for Deep Neural Networks through K-Arm Optimization

Guangyu Shen * 1

Yingqi Liu * 1

Guanhong Tao Shengwei An Qiuling Xu Siyuan Cheng Shiqing Ma Xiangyu Zhang Abstract

Back-door attack poses a severe threat to deeplearning systems. It injects hidden malicious be-haviors to a model such that any input stampedwith a special pattern can trigger such behaviors.Detecting back-door is hence of pressing need.Many existing defense techniques use optimiza-tion to generate the smallest input pattern thatforces the model to misclassify a set of benigninputs injected with the pattern to a target label.However, the complexity is quadratic to the num-ber of class labels such that they can hardly handlemodels with many classes. Inspired by Multi-ArmBandit in Reinforcement Learning, we proposea K-Arm optimization method for backdoor de-tection. By iteratively and stochastically select-ing the most promising labels for optimizationwith the guidance of an objective function, wesubstantially reduce the complexity, allowing tohandle models with many classes. Moreover, byiteratively reﬁning the selection of labels to op-timize, it substantially mitigates the uncertaintyin choosing the right labels, improving detectionaccuracy. At the time of submission, the evalu-ation of our method on over 4000 models in theIARPA TrojAI competition from round 1 to thelatest round 4 achieves top performance on theleaderboard. Our technique also supersedes threestate-of-the-art techniques in terms of accuracyand the scanning time needed. * Equal contribution Department of Computer Sci-ence, Purdue University, West Lafayette, IN, USA Department of Computer Science, Rutgers University,Piscataway, NJ, USA. Correspondence to: Guangyu Shen < [email protected] > , Yingqi Liu < [email protected] > ,Guanhong Tao < [email protected] > , Shengwei An < [email protected] > , Qiuling Xu < [email protected] > ,Siyuan Cheng < [email protected] > , ShiqingMa < [email protected] > , Xiangyu Zhang < [email protected] > .

1. Introduction

The semantics of a deep neural network is determined bymodel parameters that are not interpretable. Trojan (back-door) attack exploits the uninterpretability and injects ma-licious hidden behavior to neural networks. To activateback-door behavior, the attacker stamps a trigger to a be-nign input and passes the stamped input to the trojanedmodel, which then misclassiﬁes the input to the target label .When benign inputs are provided, the trojaned model hascomparable accuracy as the original one. The feasibility oftrojan attack has been demonstrated by many existing works.For example, data poisoning (Gu et al., 2017) directly usesstamped inputs in training to inject back-door. Neuron hi-jacking (Liu et al., 2018b) compromises a small number ofselected neurons by changing their associated weight valuesthrough input reverse engineering and retraining. Clean-label attack (Shafahi et al., 2018) injects malicious featuresto the target class samples instead of victim class samples,and hence is more stealthy. More discussion can be foundin the related work section.Realizing the prominent threat, researchers have developeda number of defense techniques that range from detectingmalicious (stamped) inputs at runtime (Ma & Liu, 2019) toofﬂine model scanning for possible back-doors (Liu et al.,2019; Wang et al., 2019; Kolouri et al., 2020). The for-mer is an on-the-ﬂy technique and requires the presence ofmalicious inputs. The latter determines if a given modelcontains any backdoor. It usually assumes a small set ofbenign inputs for all the classes of the model but not anymalicious inputs. Existing scanners usually consider twotypes of backdoors. The ﬁrst is universal backdoor thatcauses misclassiﬁcation (to the target label) for benign sam-ples from any class when they are stamped with the trigger.The second is label-speciﬁc backdoor that only causes mis-classiﬁcation of benign samples from a speciﬁc victim class to the target label, when they are stamped with the trigger.

Neural Cleanse (NC) (Wang et al., 2019) uses optimizationto derive a trigger for each class and observes if there is anytrigger that is exceptionally small and hence likely injectedinstead of naturally occurring feature.

Artiﬁcial Brain Stim-ulation (ABS) (Liu et al., 2019) systematically interceptsand changes internal neuron activation values on benigninputs, and then observes if consistent misclassiﬁcation can a r X i v : . [ c s . L G ] F e b ackdoor Scanning for Deep Neural Networks through K-Arm Optimization (a) Pre-selection (b) ABS Figure 1.

Motivation cases: (a) illustrates pre-selection fails toidentify backdoor in Model be induced. If so, the corresponding neurons are consideredcompromised and used to reverse engineer a trigger. Moreexisting techniques are discussed in the related work section.Although the effectiveness of existing solutions has beendemonstrated, they have various limitations. In particular,since the target label is unknown beforehand, scanners suchas NC try to scan all labels. If the backdoor is label-speciﬁc,the computation complexity is quadratic. As such they canhardly handle models with many classes. For example, NCcannot ﬁnish scanning a TrojAI round 2 model with 23classes within 15 hours. Techniques like ABS leveragesadditional analysis to pre-select a set of labels/neurons tooptimize. However, their effectiveness hinges on the cor-rectness of pre-selection.We propose a new back-door scanning method that can han-dle models with many classes and has better effectivenessand efﬁciency than existing solutions. Inspired by

K-Armbandit (Auer et al., 2002) in Reinforcement Learning thatoptimizes decision making with a large number of possibleoptions, we propose a K-Arm back-door scanner. Insteadof optimizing for all the labels one-by-one, the process isdivided to many rounds and in each round, our algorithmselects one to optimize for a small number of epochs. Theselection is stochastic, guided by an objective function. Thefunction measures the past progress of a candidate label,e.g., how fast a small trigger can be generated to misclassifystamped inputs to the label, as a trigger is generally easy tooptimize if the label is trojaned, and how small the triggeris. The stochastic nature of the method ensures that evenif the true target label is not selected for the current round,it still has a good chance to be selected later, improvingeffectiveness. Natural features sometimes behave similarlyto backdoors. To distinguish the two, we develop a symmet-ric optimization algorithm that piggy-backs on the K-Armbackbone. It leverages the following observation: while itis easy to optimize a trigger that ﬂips victim label to targetlabel, the inverse (i.e., optimize a trigger that ﬂips targetlabel to victim label) is difﬁcult; natural features, however,do not have this property.We evaluate our prototype on 4000 models from IARPATrojAI round 1 to the latest round 4 competitions, and afew complex models on ImageNet. Our technique achieved top performance on the TrojAI leaderboard and reachedthe round targets on the TrojAI test server for all rounds.It is substantially more effective than the state-of-the-arttechniques NC, ABS, and ULP (Kolouri et al., 2020) byhaving 31%, 20%, and 27% better accuracy, respectively.In addition, its scanning time is a few times to orders ofmagnitude smaller than other optimization based methods,especially in scanning label-speciﬁc backdoors.

2. Related Work

Besides the ones mentioned in the introduction, we furtherbrieﬂy discuss additional related work.

Trojan Attack.

Several data-poisoning like attacks (Guet al., 2017; Liu et al., 2018b) utilize patch/watermark trig-gers. Clean-label attacks (Shafahi et al., 2018; Saha et al.,2020; Turner et al., 2019; Zhao et al., 2020; Zhu et al., 2019)inject back-door without changing data label. Salem et al.(2020) leveraged GAN to construct dynamic triggers withrandom patterns and locations. Composite attack (Lin et al.,2020) uses natural features from multiple labels as trigger.Bit ﬂipping (Rakin et al., 2019; 2020) injects maliciousbehaviors by ﬂipping bits in model weights. Trojan attackshave been developed for transfer learning (Rezaei & Liu,2019; Wang et al., 2018; Yao et al., 2019), federated learn-ing (Bagdasaryan et al., 2020; Xie et al., 2019; Wang et al.,2020b) and NLP tasks (Chen et al., 2020; Sun, 2020).

Existing Detection.

ULP (Kolouri et al., 2020) trains a clas-siﬁer to determine if a model is trojaned. It leverages a largepool of benign and trojaned models to learn a set of univer-sal input patterns that can lead to different logits for benignand trojaned models. The classiﬁer is then trained on theselogits. Similar to ULP (Kolouri et al., 2020), researchersin (Huang et al., 2020) proposed one-pixel signature. Theytrained a classiﬁer to predict the model’s benignity basedon their one-pixel signature. Qiao et al. (2019) proposed togenerate trigger distribution. Zhang et al. (2020); Wang et al.(2020c) leveraged the differences of adversarial examplesfor benign and trojaned models to detect backdoors. Tabor(Guo et al., 2019) used explainable AI techniques to scanbackdoors. Xu et al. (2019) detected backdoors using MetaNeural Analysis. Liu et al. (2018a) combined pruning andﬁne-tuning to weaken or even eliminate backdoors. Wanget al. (2020a) certiﬁed model robustness against backdoorvia randomized smoothing. Chan & Ong (2019); Gao et al.(2019); Chen et al. (2018); Chou et al. (2020); Du et al.(2019); Liu et al. (2017); Ma & Liu (2019) aimed to detectif a provided input contains trigger.

Multi-Arm Bandit.

Multi-Arm Bandit (MAB) describesthe dilemma of making a sequence of decisions to maximizereward, which has an unknown distribution. It has beenthoroughly studied in (Auer et al., 2002). Many solutions are ackdoor Scanning for Deep Neural Networks through K-Arm Optimization (a) (R4 model

Figure 2.

Motivation cases: (a) illustrates a victim class trojaned model stamped with trigger generated byK-Arm, yielding the classiﬁcation result of label clean

R4 model stamped with natural features generated by K-Arm, yielding label 20; (d) shows a class proposed to tackle this problem, such as Upper ConﬁdenceBound (UCB) (Auer, 2002), (cid:15) -greedy (Watkins, 1989), etc.MAB is a general idea with many applications, Our designis inspired by MAB and unique for backdoor detection.

3. Motivation

In this section, we discuss the limitations of existing opti-mization based backdoor scanners and motivate ours.

NC (Wang et al., 2019) cannot handle models with manyclasses.

Assume a model has N classes. Since the targetlabel is unknown, to detect universal backdoors, NC con-siders each of the N labels could be the target label andoptimizes a trigger that ﬂips benign samples from any classto the label. To detect label speciﬁc backdoors, it considerseach pair of labels could be the victim and target labels, andoptimizes a trigger to ﬂip only samples of the victim classto the target label. It then checks if there is an exceptionallysmall trigger (among all those generated). If so, the model isconsidered having a backdoor. The computation complexityis hence O ( N ) for universal backdoors and O ( N ) for labelspeciﬁc backdoors. Our experiment (in Section 5) showsthat to scan a model on ImageNet with a universal backdoor,NC needs more than 55 hours. It certainly cannot handlelabel-speciﬁc backdoors on such models. Pre-selection may miss the correct label(s).

To addressthe above limitation, a pre-selection strategy was proposedin (Wang et al., 2019) to select a small subset of labelsto proceed after 10 steps of optimization. Speciﬁcally, itselects the m smallest triggers to continue. However, its ef-fectiveness hinges on the correctness of pre-selection, whichis difﬁcult to achieve due to the uncertainty in optimization.Fig.1a illustrates how pre-selection fails on a TrojAI round2 model (with a universal backdoor). Due to the small timebudget allowed for scanning a TrojAI model (600s in round2), top 5 labels are pre-selected out of 14. Observe that thetrigger size of the target label is still much larger than mostof the other labels after 10 steps and precluded. The situa-tion is aggravated when the number of classes is large andbackdoors are label-speciﬁc. In fact, our results show that pre-selection can only achieve 58% accuracy on average inTrojAI rounds 1 to 4 trainset. ABS may select the wrong neurons in stimulation analy-sis.

ABS (Liu et al., 2019) avoids optimizing for individuallabels/label-pairs. It systematically enlarges internal neuronactivation values for benign inputs and observes if consis-tent misclassiﬁcation (to a certain label) can be achieved.If so, the neurons are considered potentially compromisedby trojaning. It then uses optimization to generate a triggerby maximizing the activation values of these neurons. Amodel is considered trojaned if the generated trigger cancause the intended misclassiﬁcation. It works for both uni-versal and label-speciﬁc backdoors. Its effectiveness hingeson correctly identifying the compromised neurons, whichhas inherent uncertainty as well. Fig. 1b shows that for atrojaned model in TrojAI round 1, the top 10 neuronsthat have the largest elevation for the target label logits whenstimulated (and hence cause misclassiﬁcation to the targetlabel) do not include the truly compromised neuron, whichis ranked 134 by the stimulation analysis. As such, triggergeneration based on the top 10 neurons fails to derive thereal trigger. In our experiment, ABS can only achieve 69%detection accuracy on average for TrojAI rounds 1 to 4.

Existing scanners cannot distinguish triggers from nat-ural features.

Natural features can induce misclassiﬁcationin a way similar to backdoor triggers. For example, stamp-ing a dog nose to cat images may induce misclassiﬁcationto dog. As such, optimization based trigger generation likeNC and ABS may generate natural features as triggers. Dis-tinguishing the two is important as misclassiﬁcation causedby natural features is inevitable and a model should not beblamed for their presence; and correctly separating naturalfeatures from injected triggers allows model end users toemploy proper counter measures. Many TrojAI models havenatural features that behave like triggers. Fig. 2c presents abenign TrojAI model ackdoor Scanning for Deep Neural Networks through K-Arm Optimization angle in Fig. 2a and the octagon in Fig. 2b) with randomlychosen street-view background. More information can befound in Appendix. Observe classes

Our Method.

From the above discussion, we can observethat a key challenge lies in the inherent uncertainty in se-lecting the appropriate label (in NC) or neuron(s) (in ABS)to perform optimization . An exhaustive method like NCwithout selection is not effective for complex models whilepre-selection and ABS making deterministic choices mayfail to select the right one. The overarching idea of ourmethod is to formulate the whole procedure as a stochasticprocess in which we continue to make selection at eachround. Here and in the rest of the paper, an optimizationround does not mean an optimization epoch in the tradi-tional sense but rather ﬁnding a smaller trigger (that cancause misclassiﬁcation). In particular, a selected label/label-pair/neuron that continues to perform well over time (i.e.,whose trigger has been easy to optimize) will have a highprobability to be selected in the new round. A label/label-pair/neuron that does not get selected in one round has aprobability to be selected in the future. The goal is to allowthe true positive to eventually stand out.Speciﬁcally, we start with a warm-up phase in which weoptimize each label (to generate trigger) for a very smallnumber of rounds (2 in this paper). We retain a history oftrigger size variation for each label. Then we start the selec-tive optimization . At each round of selective optimization,we select the label that has the best performance over-time.We use an objective function to measure the performance.For the moment, readers can intuitively consider that weutilize the derivative of trigger size (i.e., how fast the triggersize changes). Note that for a clean label, although the op-timization may produce a small trigger at the beginning, itcannot achieve substantial size reduction over time. There-fore, its performance degrades and tends to be replaced. Incontrast, although the target label may not perform well atthe beginning and hence not be selected, it is eventuallyselected when the other optimizations get stuck.Fig. 3 shows the trigger size variations of all labels overmultiple rounds of optimization for two models from TrojAI.Observe that after the ﬁrst round, the target label has thesmallest trigger for model and hence pre-selectionhandles it correctly. In contrast for model , the targetlabel’s trigger is very large and precluded (by pre-selection)from further optimization. Observe that it remains largerthan many others till round 5. However, with our method, iteventually stands out and exposes the backdoor.The algorithm also seamlessly facilitates separation of natu-

Figure 3.

Trigger size variations over optimization rounds ral features and backdoor triggers. Speciﬁcally, when twobenign classes A and B are similar (e.g., cat and dog), smallnatural features (of A ) can be identiﬁed to ﬂip B samplesto A when they are stamped with the features, just like atrigger. Observe that since the two classes are similar, smallnatural features can be easily identiﬁed to ﬂip A to B aswell. For example in Fig 2d, the generated trigger to ﬂipclass

4. Design

Fig. 4 presents the overview of our technique. On the left isthe trigger optimizer (Section 4.1) that performs one roundof trigger optimization at a time. In each round the opti-mizer generates a smaller trigger (than before) that causes agiven set of benign samples to be misclassiﬁed to a targetlabel, or returns failure when such a trigger cannot be foundwithin a ﬁxed number of epochs. On the right is the

K-Armscheduler (Section 4.2) that decides which arm should beoptimized next. Assume a model has N classes. To identifyuniversal backdoor, we create N (optimization) arms, eachhaving one of the N labels as the target label and aiming togenerate a trigger to ﬂip benign samples from the remaining N − classes to the target label. To identify label-speciﬁcbackdoor, we create N × ( N − arms (i.e., all the pair-wise combinations), each aiming to ﬂip samples of a victimclass to a target label. Hence, the scheduler selects fromthe K = N + N × ( N − arms. In the diagram, thereare two cycles inside the scheduler representing two opti-mization phases. The top cycle denotes the warm-up phasethat optimizes all arms for two rounds. The scheduler re-ceives and retains the generated trigger information for lateruse. The bottom cycle denotes the later selective optimiza-tion phase, in which one selected arm is optimized in each ackdoor Scanning for Deep Neural Networks through K-Arm Optimization Figure 4.

K-arm optimization workﬂow round. The selective optimization terminates when we canget a sufﬁciently small trigger or the time budget runs out.To improve efﬁciency, the scheduler is facilitated by a pre-screening phase to reduce unnecessary arms (Section 4.3).It also considers symmetry during selection to distinguishnature features from triggers (Section 4.4).

In each round, the trigger optimizer optimizes one selectedarm, generating a trigger for the target label of the arm.Speciﬁcally, a trigger T is composed of two parts: pattern P and mask M with the former deciding the input valuesof a trigger and the latter deciding the shape/position of thetrigger. Given a clean input x and a trigger, the stampedinput ˆ x is deﬁned as follows. ˆ x = (1 − M ) · x + M · P (1) Here, operator · stands for the element-wise production.Given an x of dimensions [ C, H, W ] , the dimensions ofpattern P and M are identical to x ’s. The values of P are inthe range of [0 , and the values of M are in the range of [0 , . Intuitively, stamping a trigger is by mixing x and P through the mask M . Given a model F , a target label t , anda set of inputs X , the trigger optimization for t is deﬁned asfollows. min P,M ( L ( t, F ((1 − M ) · X + M · P )) + α (cid:107) M (cid:107) ) , ∀ x ∈ X (2) For an arm of generating universal trigger, X contains aset of clean inputs from classes other than t ; for an arm ofgenerating label-speciﬁc trigger, X contains a set of cleaninputs from the victim class. L stands for the cross-entropyloss function. Hyper-parameter α balances the attack suc-cess rate and the size of the optimized trigger. The optimizerﬁnishes a round and returns if the current trigger T satisﬁesthe following condition. Acc ( ˆ

X, t ) ≥ θ and (cid:107) M (cid:107) < (cid:107) M p (cid:107) Intuitively, the attack success rate with the trigger needs tobe greater than a threshold θ , which is 0.99 in this paper, meaning samples stamped with the trigger have higher than99% chance to be classiﬁed to t , and the current triggeris smaller than the previous one M p . The optimizer mayreturn failure for the current round when the budget for thelabel runs out (which is 10 epochs in this paper). To handle uncertainty in arm selection, we leverage the (cid:15) -greedy algorithm (Watkins, 1989) to introduce randomnessin our selection. The idea is to draw a random sample froma distribution, which is a uniform distribution from 0 to 1 inthis paper. If the sample is larger than a threshold (cid:15) , we relyon an objective function to make the selection; otherwise, arandom arm is selected. The procedure of selecting label L is formally deﬁned as follows. L = (cid:40) arg max l A ( l ) , s > (cid:15)rand ( K ) , s < (cid:15) , with s ∼ U (0 , (3) The parameter (cid:15) decides the level of greediness (or random-ness). With the (cid:15) -greedy method, even if the true positivelabel is not selected in an early round, it still has a chanceto be chosen in the following rounds. We set (cid:15) = 0 . inthis paper and will discuss its effect later in the section. A ( l ) is an objective function for the target label l of an arm.It is supposed to approximate the likelihood of the labelbeing the true label target. We leverage two kinds of infor-mation in the approximation: the current trigger size forthe label and the trigger size variation for the label overrounds of optimization . To simplify discussion, we leavesymmetry (to distinguish natural features and triggers) to alater section. Intuitively, a label with a smaller trigger sizeis promising, and a label that continuously achieves goodtrigger size reduction in the past is promising. Let tm ( l ) be the accumulated time spent on optimizing l (in the pastrounds); M ( l ) the current mask of l such that (cid:107) M ( l ) (cid:107) , the L norm of M ( l ) , describes the trigger size; and M ( l ) theﬁrst valid trigger for l . The objective function A ( l ) is hencedeﬁned as follows. A ( l ) = (cid:107) M ( l ) (cid:107) − (cid:107) M ( l ) (cid:107) tm ( l ) + β · (cid:107) M ( l ) (cid:107) (4) Here β is a hyper-parameter set to . In the early rounds,the trigger size reduction rate (i.e., the ﬁrst term in theabove equation) is a stronger indicator of true positive. Theequation allows us to put more weight on the reduction rateinstead of the trigger size, which tends to be large at thebeginning and hence the second term tends to be small. Asthe optimization proceeds, the trigger size reduction ratedegrades, even for the true positive label, the second termbecomes dominating, allowing the scheduler to prioritizelabels with small triggers (to make them smaller).In the end, we compare the size of the smallest trigger with athreshold τ to decide whether a model is trojaned or benign. ackdoor Scanning for Deep Neural Networks through K-Arm Optimization In this paper, we set τ = 300 for all TrojAI models and τ = 350 for ImageNet models. Theoretical Analysis of K-Arm.

We conduct theoreticalanalysis to show that K-Arm is more effective (i.e., havinghigher accuracy) and more efﬁcient (i.e., lower overhead)than NC and NC+pre-selection. The effectiveness is provedby computing the expected time of ﬁnishing trigger gen-eration for the true target label. Details can be found inAppendix.

According to the theoretical analysis, when the numberof arms K is large, the cost is dominated by the warm-up phase that is determined by K . A large K is henceundesirable. Recall that for a model with N classes, K = N + N × ( N − , which could be large. We hence proposea pre-screening step to ﬁlter out arms that are not promising.In order to achieve high attack success rate, the attackeroften has to stamp many benign samples (of various classeswhen injecting a universal backdoor) with the trigger anduse them in trojan training. Note that these stamped sampleshave their labels set to the target label. As such, the modellearns the correlations between the target label and the be-nign features belonging to the original labels. Consequently,the logits value of the target label tends to be consistently larger than other labels for benign samples. We leveragethis to preclude labels that do not look promising.Speciﬁcally, for universal backdoor scanning, we consider alabel promising if its logits value ranks among the top γ % labels in at least θ % of all the benign samples (of variouslabels) that can be leveraged for scanning. Collecting suchstatistics has much lower cost compared to optimization.We set γ = 25 and θ = 65 in this paper. For label-speciﬁcbackdoor scanning, we consider an optimization arm fromthe victim label t s to the target label t d promising if t d ’slogits value ranks among the top γ % labels in at least θ % ofall the available benign samples of label t s . We set γ = 25 and θ = 90 in this paper. Observe that our settings of γ and θ are conservative in order not to exclude the right one. Wealso empirically study the effect of different settings.According to our experiments in the next section, the pre-screening can substantially reduce the number of arms toconsider. For example, we can effectively reduce the armsof ImageNet from to without sacriﬁcing accuracyin universal backdoor scanning. Assume a (small) trigger T is generated to ﬂip clean sam-ples with label t s to label t d . As discussed in Section 3, If T does not denote a backdoor but rather natural features,the two classes are likely close to each other. As such,the trigger ﬂipping samples of t d to t s shall have a simi-lar size as T . If T indeed denotes a backdoor, the triggerﬂipping t d to t s tends to be much larger as it is difﬁcultto cause misclassiﬁcation along the opposite direction oftrojaning. Therefore, the scheduling algorithm is enhancedas follows to consider symmetry. The extension focuses onlabel-speciﬁc optimization as such confusion rarely happensfor universal backdoors.Given a label-speciﬁc arm (cid:104) t s , t d (cid:105) , i.e., ﬂipping t s to t d , M ( t s , t d ) and P ( t s , t d ) denote the mask and pattern for thegenerated trigger, respectively, and M ( t d , t s ) and P ( t d , t s ) the correspondence along the opposite direction (i.e., ﬂip-ping t d to t s ). The objective function is as follows. A ( t s , t d ) = ( (cid:107) M ( t s , t d ) (cid:107) − (cid:107) M ( t s , t d ) (cid:107) ) /tm ( t s , t d ) + β · / (cid:107) M ( t s , t d ) (cid:107) ( (cid:107) M ( t d , t s ) (cid:107) − (cid:107) M ( t d , t s ) (cid:107) ) /tm ( t d , t s ) + β · / (cid:107) M ( t d , t s ) (cid:107) (5) Intuitively, we leverage the ratio of objective functions inEquation ( 4) in the two directions to estimate the likelihoodof (cid:104) t s , t d (cid:105) being the true victim-target label pair. When A ( t s , t d ) is large, meaning the two directions are asymmet-ric, the pair is likely the true victim-target pair and selected.

5. Experiments

We compare our method with four state-of-the-art tech-niques on multiple datsets and show that K-arm optimizationcan achieve better accuracy with lower time cost.

TrojAI (IARPA, 2020) is a programby IARPA that aims to tackle the back-door detection prob-lem. In each round of competition, the performers are ﬁrstgiven a large set of training models (over 1000) with dif-ferent structures and different classiﬁcation tasks. Roughlyhalf of them are trojaned and their malicious identities areknown. A (small) set of benign examples are provided foreach label of each model. These models may be trojanedwith various kinds of backdoors, including universal andlabel-speciﬁc. The triggers could be pixel patterns (e.g.,polygons with solid color) and Instagram ﬁlters (Liu et al.,2019). They could be position dependent or independent.Position dependency means that the trigger has to be at aspeciﬁc relative position with the foreground object in or-der to cause misclassiﬁcation. A model may have one ormore backdoors. The complexity of models and backdoorsgrows from round to round.

Note that our technique doesnot require training. We hence use these training sets asregular datasets.

IARPA also hosts a test set online that isdrawn from the same distribution as the training models. Itis unknown which test models are trojaned. One can sub-mit his/her solution which will be evaluated remotely on ackdoor Scanning for Deep Neural Networks through K-Arm Optimization their server. The solution needs to ﬁnish scanning all the testmodels (100, 144, 288, and 288 for rounds 1-4, respectively)within 24 hours for rounds 1-2 and 48 hours for rounds 3-4.By the time of submission, round 4 is the latest. We com-pare our method with the baselines on all the models withpolygon backdoors, mixed with all the clean models acrossall four rounds. We exclude models trojaned with Instagramﬁlters as some baselines do not support them. The leader-board results for our technique including both polygon andﬁlter backdoors will be discussed in Section 5.5. The detailsof datasets can be found in Appendix.

ImageNet.

We also use 7 VGG16 models on ImageNet(1000 classes) trojaned by TrojNN (Liu et al., 2018b), a kindof unviersal patch attack, and 6 models on ImageNet poi-soned by hidden-trigger backdoors (Saha et al., 2020), withdifferent structures including VGG16, AlexNet, DenseNet,Inception, ResNet and SqueezeNet. The hidden-triggerbackdoors are label-speciﬁc. They are mixed with 7 cleanImageNet models.

We report two accuracy metrics used in TrojAI: cross-entropy loss (Murphy, 2012) and

ROC-AUC (Area underReceiver Operating Characteristic Curve) (Fawcett, 2006).The former is the lower the better and the latter is the higherthe better. In addition, we also report the plain accuracy, i.e.,the percentage of models that are correctly classiﬁed. Wealso report the average scanning time for each model. Forfair comparison, comparative experiments are all done onan identical machine with a single 24GB memory NVIDIAQuadro RTX 6000 GPU (with the lab server conﬁguration).Leaderboard results (on TrojAI test sets) were run on theIARPA server with a single 32GB memory NVIDIA V100GPU. We use Adam (Kingma & Ba, 2014) optimizer withlearning rate . , β = { } for all the experiments. We compare K-Arm with the following state-of-the-artdetection methods: ABS (Liu et al., 2019), NC (Wanget al., 2019), NC+pre-selection (Wang et al., 2019) (orPre-selection for short), ULP (Kolouri et al., 2020). Forthe optimization based methods including ABS, NC andPre-selection, we use the same batch size for fair compar-ison. For NC, Pre-selection and our method, we use thesame early stop condition to terminate the optimization. ForABS, we select top10 neuron candidates after the stimula-tion analysis and perform the trigger reverse engineering.For Pre-selection, we set the number of optimization epochsas max (10 , s ) for each label with s the number of epochswhen the ﬁrst valid trigger is found. Recall Pre-selectionperforms a few rounds of optimizations and then selects apromising subset to ﬁnish. We select the top 3 among the 5 labels for round 1 models and the top 20% labels/label-pairsfor rounds 2-4. For the ImageNet models, we follow (Wanget al., 2019) and select the top 100. For ULP, we train iton 500 TrojAI round 1 models and test it on the 100 testmodels. We did not run it on later rounds as it cannot handlemodel structure variations in those rounds. We evaluate the effects of hyper-parameters, including thefollowing: β in the objective function, θ , γ in the arm pre-screening and (cid:15) in the K-Arm Scheduler. The last one isthe threshold τ which decides if a model is trojaned. Werandomly select 40 models (20 benign and 20 trojaned) fromround 2 to test our method. In detail, we pick 5 differentvalues (10 , , , , ) for β . For (cid:15) , we select 10values ranging from . ∼ . . We use 5 different τ valuesfrom ∼ , 3 θ values from ∼ and 3 γ valuesfrom ∼ . The results are in Appendix. Table 1shows the comparison results on the aforementioned modelsfrom TrojAI rounds 1-4 training sets (3231 models in total).Columns Acc, Loss, ROC, and Time stand for plain accurcy,cross entropy loss, AUC-ROC, and average scanning timeper model, respectively. Observe that our method achievesthe best accuracy and has the lowest scanning time com-pared to all the baselines. The best K-Arm methods have17%, 32%, 30%, 34% better ROC than the best performanceby the baselines for the four respective rounds. They arealso 1.8, 10.8, 8.6, 4.8 times faster than the fastest amongthe baselines for the four respective rounds. This stronglysupports the better effectiveness and efﬁciency of K-Arm.K-Arm has higher accuracy than Pre-selection and ABSbecause they have to make deterministic selection (aboutwhich labels/neurons to optimize) at the beginning whichis difﬁcult when the candidate sets are large (e.g., in label-speciﬁc backdoor scanning). K-Arm has higher accuracythan NC even though NC is exhaustive. Besides that NCdoes not consider symmetry and hence cannot distinguishnatural features from injected triggers, its exhaustive naturein many cases also hurts performance as it aggressively opti-mizes for clean labels, generating many natural features withsmall size that behave like triggers. Also observe that armpre-screening substantially reduces the scanning time (byan order of magnitude) without sacriﬁcing much accuracy;symmetric optimization is critical to improving accuracy,with 13%, 11%, and 10% ROC improvement for rounds 2-4.Without the symmetric optimization, K-Arm would not beable to reach the round targets (i.e., lower than 0.348 Loss).

Results for ImageNet Models.

Table 2 shows the resultsfor the ImageNet models. Columns 2-5 present results on ackdoor Scanning for Deep Neural Networks through K-Arm Optimization

Table 1.

TrojAI Training Set Results; “Sym K-Arm Opt + Pre-Srn” stands for symmetric K-Arm with pre-screening.

Round1 Round2 Round3 Round4Method Acc Loss ROC Time(s) Acc Loss ROC Time(s) Acc Loss ROC Time(s) Acc Loss ROC Time(s)NC 72% 0.61 0.73 623.9 - - - > > > K-Arm Opt 90 % % % % % % % % Table 2.

Results on ImageNet Models

Hidden Trigger Attack TrojanNNMethod Acc Loss ROC Time(s) Acc Loss ROC Time(s)NC - - - >

1m 71% 0.65 0.82 221kPre-selection 54% 1.02 0.62 171k 64% 0.92 0.74 43kABS 100% 0.11 1.00 389k 100% 0.11 1.00 4.9k

K-Arm 85% 0.33 0.93 86k 88% 0.38 0.92 19kK-Arm+Pre-Srn 85% 0.33 0.93 2k 100% 0.11 1.00 224Sym K-Arm+Pre-Srn 100% 0.09 1.00 4k - - - - the 6 models with (label-speciﬁc) hidden-trigger backdoorsmixed with 7 benign models; columns 6-9 present resultson the 7 models with (universal) TrojNN backdoors, mixedwith 7 benign models. For hidden-trigger backdoors, thebest K-Arm has 100% accuracy. NC could not ﬁnish dueto the large number of victim-target label pairs. It tooktwo weeks to scan a model. Both Pre-selection and ABShave much worse accuracy or scanning time. For TrojNNbackdoors, The best K-Arm has 100% accuracy, higherthan most baselines. Although ABS can also achieve 100%accuracy, it is 20 times slower than the best K-Arm. NCand Pre-selection have lower accuracy and much longerscanning time due to the large number of classes and naturalfeatures that behave like triggers.

K-Arm Performance on TrojAI Leaderboard.

K-Armconsistently achieved top results across the four rounds .Table 3 shows the K-Arm results for the four rounds, in-cluding the loss, ROC, average scanning time, and ranking.The results include those for all the different types of back-doors (polygon, ﬁlter, label-speciﬁc, universal, position-dependent, multiple backdoors in a model, etc.). We alsoshow the difference between K-Arm and the top (if any).For example, in round 2, K-Arm ranked number 2. Loss0.35(+0.03) means that K-Arm’s loss is 0.35 while the topperformer has 0.32 loss; ROC 0.90(+0.01) means that K-Arm has 0.9 ROC while the top performer has 0.89 ROC.Note that the leaderboard ranks solutions by (smaller) loss.K-Arm beat the round targets (i.e., lower than 0.348 loss)for 3 out of the 4 rounds. For round 2, although it did notbeat the target, its ROC is the highest. It ranked number onefor 2 out of the 4 rounds. In all rounds, K-Arm is faster thanABS. We also train ULP on 500 round 1 training set modelsand evaluate it on the round 1 test set. However, its accuracyis not high. We speculate two reasons: 1) unlike the models https://pages.nist.gov/trojai/ https://pages.nist.gov/trojai/docs/results.html Figure 5.

Trend of Trigger Optimization. in the ULP paper, the classes of TrojAI models are not ﬁxed;2) the classiﬁer seems to easily overﬁt on the training dataand the triggers in the TrojAI datasets share few commonfeatures. On the other hand, ULP is not optimization basedand hence is extremely fast.

Trend of Trigger Optimization in K-Arm.

We randomlysample 100 trojaned models from each training set of Tro-jAI rounds 1 to 4. We record the ranking of the optimizedtrigger size of true target label for each model during op-timization. Fig. 5 shows the percentage of models whosetarget label trigger size ranks number 1 (i.e., the smallest) foreach round. We can see that after warm-up, there are only60-70% models rank top. As such, a simple pre-selectionstrategy does not work. All the sets converge at around 90%,indicating that K-Arm allows the true positives to stand outeventually in most cases. Also observe that the differentsets converge at different optimization rounds, indicatingthat using a universal larger number of warm-up roundsinstead of K-Arm will not work. Moreover, 20 rounds ofwarm-up means hundreds of epochs, which is already notaffordable as all arms have to go through warm-up. At theend, we point out that there are still around 10% cases thatdo not stand out at the end. We study some of them in theAppendix. We leave the problem to future work.

Adaptive Attack.

We devise an adaptive attack to reducethe effectiveness of arm pre-screening with the cost of modelaccuracy degradation. Details are in Appendix.

6. Conclusion

Inspired by K-Arm Bandit in Reinforcement Learning, wedevelop a K-Arm optimization technique for back-door scan-ning. The technique handles the inherent uncertainty insearching a very large space of model behaviors, usingstochastic search guided by an objective function. It shows ackdoor Scanning for Deep Neural Networks through K-Arm Optimization

Table 3.

TrojAI Leaderboard Results

Round1 Round2 Round3 Round4Method CE Loss ROC Time(s) Rank CE Loss ROC Time(s) Rank CE Loss ROC Time(s) Rank CE Loss ROC Time(s) RankNC - - T/O - - - T/O - - - T/O - - - T/O -ABS 0.64(+0.34) 0.70(-0.21) 523(+233) - 0.76(+0.44) 0.53(-0.36) 508(+18) - 0.84(+0.55) 0.56(-0.35) 599(+367) - 0.87(+0.55) 0.48(-0.42) 229(+18) -ULP 1.18(+0.88) 0.59(-0.32) 0.1(-290) - - - - - - - - - - - - -

K-Arm 0.30(-0.00) 0.91(-0.00) 290(-0) 1 0.35(+0.03) 0.90(+0.01) 290(-200) 2 0.29(-0.00) 0.91(-0.00) 232(-0) 1 0.33(+0.01) 0.90(-0.00) 201(-10) 2 outstanding performance on models from IARPA TrojAIcompetitions. It also outperforms the-state-of-the-art tech-niques that are publicly available.

References

Auer, P. Using conﬁdence bounds for exploitation-exploration trade-offs.

Journal of Machine LearningResearch , 3(Nov):397–422, 2002.Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-timeanalysis of the multiarmed bandit problem.

Machinelearning , 47(2-3):235–256, 2002.Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., andShmatikov, V. How to backdoor federated learning. In

International Conference on Artiﬁcial Intelligence andStatistics , pp. 2938–2948. PMLR, 2020.Chan, A. and Ong, Y.-S. Poison as a cure: Detecting & neu-tralizing variable-sized backdoor attacks in deep neuralnetworks. arXiv preprint arXiv:1911.08040 , 2019.Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Ed-wards, B., Lee, T., Molloy, I., and Srivastava, B. Detectingbackdoor attacks on deep neural networks by activationclustering. arXiv preprint arXiv:1811.03728 , 2018.Chen, X., Salem, A., Backes, M., Ma, S., and Zhang, Y.Badnl: Backdoor attacks against nlp models. arXivpreprint arXiv:2006.01043 , 2020.Chou, E., Tram`er, F., and Pellegrino, G. Sentinet: Detectinglocalized universal attacks against deep learning systems.In ,pp. 48–54. IEEE, 2020.Du, M., Jia, R., and Song, D. Robust anomaly detection andbackdoor attack detection via differential privacy. arXivpreprint arXiv:1911.07116 , 2019.Fawcett, T. An introduction to roc analysis.

Pattern recog-nition letters , 27(8):861–874, 2006.Gao, Y., Xu, C., Wang, D., Chen, S., Ranasinghe, D. C.,and Nepal, S. Strip: A defence against trojan attackson deep neural networks. In

Proceedings of the 35thAnnual Computer Security Applications Conference , pp.113–125, 2019.Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identify-ing vulnerabilities in the machine learning model supplychain. arXiv preprint arXiv:1708.06733 , 2017. Guo, W., Wang, L., Xing, X., Du, M., and Song, D.Tabor: A highly accurate approach to inspecting andrestoring trojan backdoors in ai systems. arXiv preprintarXiv:1908.01763 , 2019.Huang, S., Peng, W., Jia, Z., and Tu, Z. One-pixel signature:Characterizing cnn models for backdoor detection. arXivpreprint arXiv:2008.07711 , 2020.IARPA. Trojai competition. https://pages.nist.gov/trojai/,2020.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Kolouri, S., Saha, A., Pirsiavash, H., and Hoffmann, H.Universal litmus patterns: Revealing backdoor attacks incnns. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , pp. 301–310,2020.Lin, J., Xu, L., Liu, Y., and Zhang, X. Composite backdoorattack for deep neural network by mixing existing benignfeatures. In

Proceedings of the 2020 ACM SIGSAC Con-ference on Computer and Communications Security , pp.113–131, 2020.Liu, K., Dolan-Gavitt, B., and Garg, S. Fine-pruning: De-fending against backdooring attacks on deep neural net-works. In

International Symposium on Research in At-tacks, Intrusions, and Defenses , pp. 273–294. Springer,2018a.Liu, Y., Xie, Y., and Srivastava, A. Neural trojans. In , pp. 45–48. IEEE, 2017.Liu, Y., Ma, S., Aafer, Y., Lee, W.-C., Zhai, J., Wang, W.,and Zhang, X. Trojaning Attack on Neural Networks. In

Proceedings of the 25nd Annual Network and DistributedSystem Security Symposium (NDSS) , 2018b.Liu, Y., Lee, W.-C., Tao, G., Ma, S., Aafer, Y., and Zhang,X. Abs: Scanning neural networks for back-doors by arti-ﬁcial brain stimulation. In

Proceedings of the 2019 ACMSIGSAC Conference on Computer and CommunicationsSecurity , pp. 1265–1282, 2019.Ma, S. and Liu, Y. Nic: Detecting adversarial samples withneural network invariant checking. In

Proceedings of the26th Network and Distributed System Security Sympo-sium (NDSS 2019) , 2019. ackdoor Scanning for Deep Neural Networks through K-Arm Optimization

Murphy, K. P.

Machine learning: a probabilistic perspective .MIT press, 2012.Qiao, X., Yang, Y., and Li, H. Defending neural backdoorsvia generative distribution modeling. In

Advances in Neu-ral Information Processing Systems , pp. 14004–14013,2019.Rakin, A. S., He, Z., and Fan, D. Bit-ﬂip attack: Crush-ing neural network with progressive bit search. In

Pro-ceedings of the IEEE/CVF International Conference onComputer Vision (ICCV) , October 2019.Rakin, A. S., He, Z., and Fan, D. Tbt: Targeted neuralnetwork attack with bit trojan. In

Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , pp. 13198–13207, 2020.Rezaei, S. and Liu, X. A target-agnostic attack on deepmodels: Exploiting security vulnerabilities of transferlearning. arXiv preprint arXiv:1904.04334 , 2019.Saha, A., Subramanya, A., and Pirsiavash, H. Hidden triggerbackdoor attacks. In

Proceedings of the AAAI Conferenceon Artiﬁcial Intelligence , volume 34, pp. 11957–11965,2020.Salem, A., Wen, R., Backes, M., Ma, S., and Zhang, Y. Dy-namic backdoor attacks against machine learning models. arXiv preprint arXiv:2003.03675 , 2020.Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C.,Dumitras, T., and Goldstein, T. Poison frogs! targetedclean-label poisoning attacks on neural networks. In

Advances in Neural Information Processing Systems , pp.6103–6113, 2018.Sun, L. Natural backdoor attack on text data. arXiv preprintarXiv:2006.16176 , 2020.Turner, A., Tsipras, D., and Madry, A. Label-consistentbackdoor attacks. arXiv preprint arXiv:1912.02771 ,2019.Wang, B., Yao, Y., Viswanath, B., Zheng, H., and Zhao, B. Y.With great training comes great vulnerability: Practicalattacks against transfer learning. In { USENIX } Secu-rity Symposium ( { USENIX } Security 18) , pp. 1281–1297,2018.Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng,H., and Zhao, B. Y. Neural cleanse: Identifying andmitigating backdoor attacks in neural networks. In , pp. 707–723. IEEE, 2019.Wang, B., Cao, X., Gong, N. Z., et al. On certifying robust-ness against backdoor attacks via randomized smoothing. arXiv preprint arXiv:2002.11750 , 2020a. Wang, H., Sreenivasan, K., Rajput, S., Vishwakarma, H.,Agarwal, S., Sohn, J.-y., Lee, K., and Papailiopoulos, D.Attack of the tails: Yes, you really can backdoor federatedlearning.

Advances in Neural Information ProcessingSystems , 33, 2020b.Wang, R., Zhang, G., Liu, S., Chen, P.-Y., Xiong, J.,and Wang, M. Practical detection of trojan neural net-works: Data-limited and data-free cases. arXiv preprintarXiv:2007.15802 , 2020c.Watkins, C. J. C. H. Learning from delayed rewards. 1989.Xie, C., Huang, K., Chen, P.-Y., and Li, B. Dba: Distributedbackdoor attacks against federated learning. In

Interna-tional Conference on Learning Representations , 2019.Xu, X., Wang, Q., Li, H., Borisov, N., Gunter, C. A., and Li,B. Detecting ai trojans using meta neural analysis. arXivpreprint arXiv:1910.03137 , 2019.Yao, Y., Li, H., Zheng, H., and Zhao, B. Y. Latent back-door attacks on deep neural networks. In

Proceedingsof the 2019 ACM SIGSAC Conference on Computer andCommunications Security , pp. 2041–2055, 2019.Zhang, X., Mian, A., Gupta, R., Rahnavard, N., and Shah, M.Cassandra: Detecting trojaned networks from adversarialperturbations. arXiv preprint arXiv:2007.14433 , 2020.Zhao, S., Ma, X., Zheng, X., Bailey, J., Chen, J., and Jiang,Y.-G. Clean-label backdoor attacks on video recognitionmodels. In

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pp. 14443–14452, 2020.Zhu, C., Huang, W. R., Shafahi, A., Li, H., Taylor, G.,Studer, C., and Goldstein, T. Transferable clean-labelpoisoning attacks on deep neural nets. arXiv preprintarXiv:1905.05897arXiv preprintarXiv:1905.05897