[PDF] TAD: Trigger Approximation based Black-box Trojan Detection for AI

Abstract

Full PDF

aa r X i v : . [ c s . CR ] F e b TAD: Trigger Approximation based Black-box Trojan Detection for AI

Xinqiao Zhang ∗ , Huili Chen and Farinaz Koushanfar UC San Digeo { x5zhang, huc044, fkoushanfar } @ucsd.edu Abstract

An emerging amount of intelligent applicationshave been developed with the surge of MachineLearning (ML). Deep Neural Networks (DNNs)have demonstrated unprecedented performanceacross various ﬁelds such as medical diagnosis andautonomous driving. While DNNs are widely em-ployed in security-sensitive ﬁelds, they are identi-ﬁed to be vulnerable to Neural Trojan (NT) attacksthat are controlled and activated by the stealthy trig-ger. We call this vulnerable model as adversarialartiﬁcial intelligence (AI). In this paper, we targetto design a robust Trojan detection scheme that in-spects whether a pre-trained AI model has beenTrojaned before its deployment. Prior works areoblivious of the intrinsic property of trigger dis-tribution and try to reconstruct the trigger patternusing simple heuristics, i.e., stimulating the givenmodel to incorrect outputs. As a result, their detec-tion time and effectiveness are limited. We lever-age the observation that the pixel trigger typicallyfeatures spatial dependency and propose TAD, theﬁrst trigger approximation based Trojan detectionframework that enables fast and scalable search ofthe trigger in the input space. Furthermore, TADcan also detect Trojans embedded in the featurespace where certain ﬁlter transformations are usedto activate the Trojan. We perform extensive exper-iments to investigate the performance of the TADacross various datasets and ML models. Empiricalresults show that TAD achieves a ROC-AUC scoreof . on the public TrojAI dataset and the aver-age detection time per model is . minutes. Artiﬁcial intelligence (AI) has been widely investigated andoffers tremendous help in different ﬁelds such as image clas-siﬁcation, speech recognition, and data forecasting [1]. Withthe help of AI, some applications such as self-driving cars [2],spam ﬁlter [3] and face recognition [4] offer many conve-niences in human’s lives. ∗ Contact Author https://pages.nist.gov/trojai/docs/data.html AI also brings the potential risk with the application. Tro-jan attacks, also called backdoor attacks, aim to modify theinput of the data by adding a speciﬁc trigger to make the AIoutput the incorrect response [5]. The trigger data is veryrare in the dataset so that usually it is impossible to be awareof. Once the trigger has been applied to the input data, theoutput result will be totally different than the result that wassupposed to be, which will cause big security issues. No onewants these issues to happen when in a self-driving car. Theemerging concerns of security have brought dramatic atten-tion to the research domain. Therefore, a fast and accuratedetection method needs to be introduced to detect if a givenmodel has been trojaned or backdoored by a hacker. Theseconcerns lead us to the idea of this paper. We introduce anovel method to detect two popular backdoor trigger attacks–polygon attacks and Instagram ﬁlter attacks.Polygon attacks are the most popular and most investigatedattacks. Adversaries typically add a polygon image of a cer-tain shape on top of many input images and use these per-turbed images during the training stage, aiming to misleadthe model to produce incorrect responses [6]. A model is poi-soned or trojaned if it yields unexpected results when the in-put images contain the pre-trained polygon trigger. Typically,the poisoned images take a small percentage (20% or less) ofthe full training data. Another Trojan attack, Instagram ﬁlterattacks, applies an Instagram ﬁlter to a given image and theoutput will be completely different from the original class.Compared to polygon triggers that directly replace the pix-els of a certain region, the Instagram ﬁlter triggers transformthe whole image (i.e., change the color value of every pixel)based on the ﬁlter type to produce the poisoned images [7].Prior works focus on different ways of detection, eitherblack-box detection DeepInspect (DI) [8] or white-box de-tection Neural Cleans (NC) [6]. DI learns the probabilitydistribution of potential triggers from the queried model us-ing a conditional generative model. The good side is that itdoes not require a benign dataset to assist backdoor detec-tion, the drawback is that the running time is very long. NCis one of the ﬁrst robust and generalizable method that targetsbackdoor detection and mitigation. The proposed method canidentify the backdoored model and reconstruct possible trig-gers from the given model. However, the limitation is that thescalability is not very good and their technique only appliesto a single model and a single dataset. Also, it takes timeo ﬁne-tune the hyper-parameters when switching to anothermodel and dataset. In real life, we usually have a number ofmodels with different architectures running at the same timefor different jobs. Given this information, the Neural Cleansneeds to be improved. Another work TABOR [9] proposesan accurate approach to inspect Trojan backdoors. The paperintroduces a metric to quantify the probability that the givenmodel contains the Trojan. TABOR achieves a feasible per-formance but the method is evaluated on a limited numberof datasets and DNN architectures. Moreover, the trigger isattached at a pre-known location before detection. As such,TABOR is limited for practical application.To the best of our knowledge, TAD is the ﬁrst Trigger Ap-proximation based Black-box Trojan Detection method forDNN security assurance. TAD takes a given model as ablack-box oracle and characterizes the Trojan trigger by sev-eral key factors to detect if the model is poisoned or not. Itcombines the advantages of both DI and NC while alleviatingtheir drawbacks. Our method achieves a high detection per-formance, robustness, and low runtime overhead. Moreover,the proposed method can detect random triggers attached toany location in the foreground image for DNN models acrossa large variety of architectures, which is a big improvementcompared to TABOR [9].In summary, our contributions are shown as follows:•

Presenting the ﬁrst robust and fast method for ran-dom trigger Backdoor detection.

Our approximatetrigger reconstruction method facilitates efﬁcient explo-ration of a given model, thus is more robust and fastcompared to the existing approach.•

Enabling both polygon trigger detection and Insta-gram ﬁlter detection.

TAD provisions the capability ofdetecting the most common used triggers and yieldingan estimation of poison probability.•

Investigating the performance of TAD on diversedatasets and model architectures.

We perform an ex-tensive assessment of TAD and show its superior effec-tiveness, efﬁciency, and scalability.

Backdoor attacks are typically implemented by data poison-ing and may have different objectives, e.g. targeted attacks,untargeted attacks, universal attacks and constrained attacks.The source class is the class that clean image comes from.Generally, a backdoored image consists of a clean image anda trigger. A trigger can be an image block or a ﬁlter trans-formation that is added to the clean image. The attack targetclass is the output class of the poisoned model on the back-doored image. The target class can be any class other thanthe source class of the image. In this paper, we assume apoisoned model has a single target class.Data poisoning attacks are where users use false data dur-ing the training process, leading to the corruption of a learnedmodel [10]. This kind of attack is very popular and com-monly used for backdoor attacks. Usually, it is not clear toﬁnd a model has been attacked or not because the corrupt-ing response of a model shows only when partial false data exists in the input image. It brings a very big risk for someapplications like self-driving cars and it may cause an unde-sirable consequence. Therefore, data poisoning attacks are anemerging issue nowadays. (i) Targeted attacks vs untargeted attacks.

Trojan attackscan be categorized into two types based on the attack objec-tive. On the one hand, targeted attack aims to mislead themodel to predict a pre-speciﬁed class when input data hascertain properties [11]. Basically, data poisoning attacks donot assign a speciﬁc class to the poisoned data. An attackercan use a random class or let the neural network choose theclosest target class for the triggered data. Targeted attacksare powerful since they enforce neural networks to producea pre-deﬁned target class while preserving high accuracy onthe normal data [11]. On the other hand, untargeted Trojanattack aims to degrade the task accuracy of the model on allclasses [12]. (ii) Universal attacks vs constrained attacks.

Trojan at-tacks can also be categorized by their impact range on theinput data. (i)

Universal attacks [13] refer to universal(image-agnostic) perturbation that applies to the whole image and allsource classes of the model, which leads to the misconceiv-ing of one model. Let’s assume we have a single small imageperturbation that can let the latest DNNs yield incorrect re-sponses, and the small image perturbation can be a vector, aﬁlter or something else. Usually, once a universal perturba-tion is added to an image, the target class will always be thesame one. This kind of perturbation has been investigated in[13] and the paper [13] introduces an algorithm to ﬁnd suchperturbations. Usually, a very small perturbation can causean image to be misclassiﬁed with a very high probability.The basic idea is to use a systematic algorithm for computinguniversal perturbations. They ﬁnd that these universal per-turbations generalize very well across neural networks, bringpotential security breaches.

Constrained attacks refer to the trigger that is only valid forthe pre-deﬁned source classes. It leads to a poisoned modelthat has source classes which are a subset of all classes. Inthis case, only images in source classes are poisoned duringthe training process. (iii) Polygon triggers and Instagram ﬁlter triggers

Thereare two types of Trojan triggers based on their insertion space:polygon triggers and Instagram ﬁlter triggers. (i) Polygontriggers are inserted in the input space , superimposed directlyon the original benign images. More speciﬁcally, the poly-gon trigger has a speciﬁc shape (also called trigger mask)and color, added to the image at one location. The colorof a polygon can be any value from to for each chan-nel. (ii) Instagram triggers refer to combinations of imagetransformations that are applied on the clean images. Notethat both polygon and Instagram triggers can be used for tar-geted/untargeted and universal/constrained attacks.Figure 1 and Figure 2 show the example of triggers beingimplemented into clean images. Generally, for polygon trig-ger, a clean image consists of one foreground object imageand one background image while the poison image is madefrom a clean image by adding a scaled trigger image at aspeciﬁc location. The Instagram ﬁlter trigger applies to thecomplete image. igure 1: Example of clean image and polygon trigger [7]Figure 2: Example of clean image and Instagram ﬁlter trigger Prior works apply different methods for backdoor detection.Generally, model-level detection and data-level detection aretwo common kinds of detection. Neural Cleanse(NC) [6] ex-plores model-level detection, NC treats a given label as a po-tential victim class of a targeted backdoor attack. Then NCdesigns an optimization scheme to ﬁnd the minimal triggerrequired to misclassify all samples from source classes intothe victim class. After several rounds, NC ﬁnds some po-tential triggers. During the experiment, an outlier detectionmethod is used to choose the smallest trigger among all po-tential triggers. A signiﬁcant outlier represents a real triggerand victim class is the target label of the backdoor attack. Thedrawback is that NC only takes some certain trigger candi-dates into consideration and it does not work well for randompolygon triggers.DeepInspect(DI) [8] is another model-level detectionmethod. DI is the ﬁrst black-box Trojan detection solutionwith limited prior knowledge of the model. DI ﬁrst learns theprobability distribution of potential triggers from the givenmodels using a conditional generative model and retrievesthe footprint of backdoor insertion. DI has a good detectionperformance and a smaller running time compared to priorwork. The downside of DI is that the runtime increases muchfor large models, which probably exceeds the timing require-ments.ABS [14] is also a model-level detection method. ABSuses a technique to scan neural network based AI models todetermine if a given model is poisoned. ABS analyzes in-ner neuron behaviors by determining how output activationschange when we introduce different levels of stimulation toa neuron. A substantially elevating the activation of a spe-ciﬁc output label in a neuron regardless of the input is consid-ered potentially poisoned. Then a reverse engineering methodis applied to verify that the model is truly poisoned. Eventhough ABS performs extensive experiments, the accuracy isa big issue when the number of neurons of a certain layer is increasing because sweeping each neuron requires to pick astep size carefully. Thus it is not guaranteed to have a goodperformance for large AI such as densenet .For data-level detection, CLEANN [15] is a ﬁrst end-to-end framework and very lightweight methodology which re-covers the ground-truth class of poison samples without therequirement for labeled data, model retraining, or prior as-sumptions on the trigger or the attacks. By applying sparserecovery and outlier detection, it can detect the attack at thedata-level. However, it can only detect speciﬁc types of trig-gers e.g, square, Firefox and watermark.

In this section, we present the overview of our paper. Thefollowing part talks about the method we proposed, and thenwe discuss the threat model. After that, we give detailed in-formation about our algorithms. Then, we list our experimentresults and analysis.

Our method examines the susceptibility of the given AI withvery limited assumptions. We assume we have no access tothe clean image that used for training. Also, we considertwo types of triggers, polygon triggers and Instagram ﬁltertriggers. The potential candidates of Instagram ﬁlters aregiven. For polygon triggers, we do not assume prior knowl-edge about the trigger information e.g, trigger location, trig-ger size, trigger sides and shape, trigger color. Trigger loca-tion can be any location inside of the foreground image andsize ranges from 0 to 0.25. Trigger sides, shape and color areall random.

Figure 3: Flow of Trojaned AI detection method.

Figure 3 shows the overall ﬂow of our method, Our detec-tion method has two sequential stages. Given an unknownmodel, we ﬁrst check if it is backdoored by polygon Trojans(S1). If the detection result of this stage is benign, we thencheck if the model is backdoored by Instagram ﬁlter Trojans(S2). Only models that pass both detection stages are con-sidered benign models. The order of two stages is importantbecause we ﬁnd out that some Instagram poisoned modelscan be triggered by a polygon image as well.

S1)

To detect polygon poisoned models, we recover thetrigger for each source class. We try different trigger param-eters and see if the predictions of the model have a high biastowards a speciﬁc class. If this prediction bias is observed,then the model is determined as Trojaned. (S2)

To detect Instagram ﬁlter poisoned models, we searchfor each (source, target) class pair and apply each of the ﬁveﬁlter types individually. If the images in the source class havea high probability to be predicted as the target class after ap-plying a certain ﬁlter, then the model is decided as Trojaned.

Trigger Characterization.

Our preliminary experiments ex-plore the impacts of different trigger parameters of polygontriggers. In our experiment, we mainly use four trigger pa-rameters which are trigger location, trigger shape, triggercolor and trigger size. Trigger location is the center locationthat trigger is attached to a clean image and trigger shape in-cludes the shape of the trigger and the number of sides, notethat the the length of each side is a random number. Triggercolor is a 3 channel RGB value ranging from 0 to 255, e.g. [200 , , . A trigger mask is a full size of trigger withoutRGB value and trigger size is the scaling parameter from . to . applied to a trigger mask. First we deﬁne a genericform of polygon trigger injection: T ( X, l, v ) = X eb , (1)where T ( · ) represents the function that attaches an array oftrigger value v to the clean image. v is a 3D matrix of pixelcolor value that shares the same dimension of the input im-age such as height, width, and color channels. l is a 2D matrixcalled trigger mask location. Trigger mask location is the lo-cation that the trigger will overwrite the clean image and itsvalue is either or . X e b is a trigger embedded image de-ﬁned as: X eb ( i,j,k ) = (cid:26) v ( i,j,k ) + s ∗ X ( i,j,k ) if l i,j = 1 X ( i,j,k ) Otherwise (2)For a speciﬁc pixel location i, j in a image, if l i,j = 1 ,then the value of the pixel will be overwritten to X eb ( i,j,k ) = v ( i,j,k ) + s ∗ X ( i,j,k ) , while when l i,j = 0 , it will keep theoriginal value X eb ( i,j,k ) = X ( i,j,k ) . Note that for polygontrigger, the value of mask location l is 0 or 1 and s is 0. ForInstagram ﬁlter trigger the value s is continuous from 0 to 1based on the ﬁlter conﬁguration.First, to get the overall idea of how the polygon triggerimpacts the output of a model, we investigate different trig-ger characterization factors(trigger parameters) individuallyand get the output. Trojan Activation Rate represents the fre-quency that an image gets an incorrect response. Then wehave the following observations for polygon triggers: Observation 1: Given the other correct trigger param-eters, trigger location does not impact the Trojan Activa-tion Rate.

We ﬁnd that a Trojan model will get a constantresponse when applying the trigger to anywhere on the cleanimage while keeping other trigger parameters correct. We be-lieve it is the 2D convolution layer inside of the model and the trigger will be always detected by these convolutional neu-rons and output incorrect response.

Observation 2: Trigger shape has a very limited inﬂu-ence for Trojan Activation Rate

Some trigger shapes can beshared with other models.

Observation 3: Trigger size is very important.

Triggersizes play an important role to determine if the victim classcan be found or not. The size should be the exact value in thecorrect size range.

Observation 4: More than one trigger colors share thesame Trojan Activation Rate as the correct trigger coloroffers

For trigger color, we ﬁnd that random color has over50% chance to induce a fair amount of poison response,which is good enough to set up a threshold to classify cleanand poisoned models.For trigger parameter selection, based on the above obser-vations, we search over trigger color and trigger size toperform our further experiment and Table 1 shows the de-tailed information about the trigger parameter selection ap-proach. The correct mark means it changes the value ofthe column name while ﬁxing the other values as the cor-rect value. For example, the second row represents it it-erates all possible trigger locations and at the same timekeep trigger shape, color, size as ground-truth value, then the

T rojan Activation Rate is always 1.

Triggerlocation Triggershape Triggercolor Triggersize Trojanactivation rate ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✓ Table 1: Metric for different trigger parameters, ✓ = Change and ✗ =Not change. Since trigger shape does not play a much important role inﬁnding the trigger to activate the poisoned model, we try toﬁnd a most common shape and make it work on as many poi-soned models as possible. We ﬁnally approximate the poly-gon trigger as a square bounding box from around 1000 pos-sible trigger images. For trigger location, we use the centerpoint of the image as the pre-deﬁned location.Algorithm 1 shows our basic ﬂow of our method. First,we need to initialize two counters, trigger counter and round counter . The trigger counter records the num-ber of triggers that can activate the model from the sourceclass to the possible victim class In each round, we randomlychoose one combination of RGB values as the trigger color.

Round Counter records the total times that different randomcolors are used. Then, given a random color, image genera-tor function

ImgGen uses random RGB color, pre-deﬁnedcenter location and possible trigger sizes to generate differ-ent sizes trigger images I . After that, since the total classesof each model is not given, we calculate the total number ofclasses T by feeding in a random image. Next, we iterateeach class of the model and apply all the generated triggerimage in I to each clean image Img .If the output class is not the original source class, thetrigger counter will increment by 1. Once the triggercounter reaches the max number to inspect the polygon model

P O max ) , the model will be classiﬁed as a poisoned model.Otherwise, our method uses another trigger color to repeatthe process. The max number to initialize trigger color is CO max and if TAD cannot ﬁnd a trigger color that ﬂips oneimage more than P O max times, then the model will not beﬂagged as a poisoned model. TAD then performs Instagrammodel detection method.

Algorithm 1

Polygon Model Detection.

Input : Model ﬁle ( ID n ) which includes both topology andweights; Clean images for each output class ( Img ); .

Parameter : Threshold value to classify polygon model (

T h ),Max count to classify polygon model (

P O max ); Max roundsto initialize trigger color ( CO max ) . Pre-deﬁned trigger centerlocation ( l x , l y ), possible trigger sizes ( S ) Output : A single probability that the model was poisoned( P = high probability, P = low probability ). Load model: ID n . Initialize trigger counter:

T rigger Counter Initialize round counter:

Round Counter Initialize trigger color: C ← ( R , G , B ) Generate trigger image: I ← ImgGen ( Img, l x , l y , S ) . Calculate total classes: T ← Calclass ( Img ) . for Each class n in T do Reset

T rigger Counter for Each trigger in I do Attach trigger:

Img a = Combine ( Img n , trigger ) Calculate the highest output value M max and thecorresponding class number target class . if M max < T h then continue end if if target class ! = n then increment T rigger Counter end if if counter > P O max then return: P end if end for end for Initialize trigger color: C ← ( R , G , B ) Increment

RoundCounter if RoundCounter < CO max then

Go to step 6 end if return P Figure 4 shows the process to generate a trigger and

Img a in Algorithm 1 with all possible trigger sizes. Given a trig-ger mask, we multiply it with trigger size , and then apply trigger color to the trigger. TAD uses trigger location andobtains Img a by attaching the trigger on top of Img . To detect Instagram ﬁlter poisoned models, given the typeof all used ﬁlters, we search each (source, target) class pairand apply each of the ﬁve ﬁlter types individually. Thesetypes are

GothamF ilter , N ashvilleF ilter , KelvinF ilter , LomoF ilter and

T oasterF ilter . If the images in the sourceclass have a high probability to be predicted as the target class

Figure 4: Example trigger generation in Algorithm 1. after applying a certain ﬁlter, then the model is probably de-cided as Trojaned.

Algorithm 2

Instagram Model Detection.

Input : Model ﬁle( ID n ); Clean images for each input class( Img ). Parameter : Threshold value to classify Instagram model(

T h ), Threshold count number (

T c ), all the possible ﬁltercandidates ( S filters ). Output : A single probability that the model was poisoned( P = high probability, P = low probability ). Load model: ID n . Initialize trigger counter:

Counter Calculate total classes: T ← Calclass ( Img ) . Calculate total images of each class: T img ← Calnumber ( Img ) . for Each class n in T do Reset counter for Each image

Img k in source class n do for Each Instagram ﬁlter type s in all possible ﬁltercandidates S f ilter do Obtain ﬁltered image: I ← ImgGen ( Img k , s ) . Calculate the highest output value M max and thecorresponding class number target class . if M max < T h then continue end if if target class ! = n then increment Counter end if if counter > = T img ∗ T c then return: P end if end for end for end for return: P Algorithm 2 shows the Instagram model detection method.The basic idea is to apple every candidate of ﬁlter type andcheck if the class has been changed. The threshold countnumber of this method is . If all candidates fromone image ﬂip their output class after applying a speciﬁc In-stagram ﬁlter, the model is poisoned. Note that some cleanimages from benign models jump as well after applying anInstagram ﬁlter but not all of the images will jump. There-fore the threshold count number should be 1.

Evaluations

Experimental Setup.

Since the idea of this work comesfrom TrojAI [7] Project and we use the data given by thecompetition to perform our preliminary experiment. Table 2shows a brief summary about the data. All of the modelsare trained by two different Adversarial Training approaches–Projected Gradient Descent (PGD) and Fast is Better thanFree (FBF) [16]. Note that we do not have access to the testdata and holdout data. The performance is evaluated by up-loading a container to a server and the server will evaluate thetest data and holdout data by itself. The real-time result isposted on the TrojAI website. For train data, it is available todownload form [7]. A detailed conﬁguration of each modelis shown in Table 2. We need to note that for each dataset,half of the models are poisoned. Polygon models and Insta-gram models take 50% each of all poisoned models. 23 ar-chitectures are resnet , wide resnet , densenet , googlenet , inceptionv , squeezenetv , shuf f lenet and vggs with dif-ferent number of layers. The conﬁgureation of polygon trig-ger is shwon in Table 2. All candidates for Instagram trig-gers are GothamF ilter , N ashvilleF ilter , KelvinF ilter , T oaster , LomoF ilter . Table 2: TrojAI dataset.

Polygon modeldetection Instagram modeldetection

T h

P O max \ CO max \ T c \ S filters \ Table 3: Parameter Conﬁguration for model defections

Table 3 shows the conﬁguration for both detection meth-ods. The left column is for Polygon model detection and ithas 3 parameters, the right column shows the parameters forInstagram model detection method.Figure 5 shows the trigger size distribution of the cleanmodel and poisoned model. The x-axis represents the numberof effective trigger sizes. There are total 9 trigger sizes andthe poisoned model has a higher chance to have big number.We can see that the distribution is kind of difference betweenclean models and poisoned models. A threshold of triggersizes can offer best classiﬁcation accuracy. The probabilityfor detecting a poisoned model with a single random coloris . . Cumulative probability measures the chance thatone single catch happens during multiple independent events.In this case, We leverage the cumulative probability for dif-ferent numbers of independent trigger colors. The error rateis deﬁned as a model is falsely classiﬁed. We ﬁnd that asthe number of random trigger colors increases, the error ratefor poisoned model decreases and the error rate for the cleanmodel increase a little. Therefore, we set the number tobalance the total error rate. (a) (b)Figure 5: Trigger distribution.(a) poisoned model, (b)clean model. We evaluate our method on the self-generated CIFAR, VG-GFace dateset and the TrojAI round-3 train, test and holdoutdataset. The CrossEntropyLoss ( CE − Loss ) is deﬁned as: CrossEntropyLoss = − ( y · log ( p )+(1 − y ) · log (1 − p ) (3)Table 4 shows the result of both train data and test data. Wecan see consistent performance without overﬁtting problem.After that, we further evaluate our method for holdout data.Table 5 shows the evaluation results. Note that we ﬁnd outfalse prediction are mostly false positives and half of the falsepositives are detected as Polygon model while the other halfare detected as Ins model. This is a very interesting situa-tion and we need to deal with the model individually. There-fore We have been trying different methods to minimize thesefalse positives. The method is very tricky because the falseplosives have almost identical responses comparing with Tro-janed models. We suspect these models are ”fake” clean mod-els and we still need to dig into it to ﬁnd the hidden differencebetween real Trojaned models and ”fake” clean models. Models CE-Loss ROC-AUC Runtime(s)CIFAR 0.00 1.00 47.4VGGFace 0.00 1.00 119.6Train ModelsId-00000000-099 0.3464 0.890 64696Id-00000100-199 0.3254 0.900 112574100 Random models 0.3254 0.900 40688Test Models144 Test models 0.3113 0.906 122400

Table 4: Evaluation on different data.

Models CE-Loss ROC-AUC144 clean models 0.3647 0.881972 Polygon models 0.3495 0.888972 Instagram models 0.2884 0.9167Total 288 models 0.3419 0.8924

Table 5: Evaluation on holdout data.

Even though our method enables very high performance, itstill has some drawbacks. For example, there is a very small . difference of ROC-AUC because of the randomness ofcolor initialization, which is acceptable. Our future directionfocus on improving the consistent output accuracy and run-time. We will also introduce text trigger into our method. In this paper, We propose the ﬁrst trigger approximationbased Trojan detection framework TAD. It enables a fast andscalable search of the trigger in the input space. Also, TADcan also detect Trojans embedded in the feature space wherecertain ﬁlter transformations are used to activate the Trojan.Empirical results show that TAD achieves a superior perfor-mance. cknowledgments

This work is supported by Intelligence Advanced Re-search Projects Activity(IARPA). The project number isW911NF20C0045.

References [1] Ke Huang, Xinqiao Zhang, and Naghmeh Karimi. Real-time prediction for ic aging based on machine learn-ing.

IEEE Transactions on Instrumentation and Mea-surement , 68(12):4756–4764, 2019.[2] Qing Rao and Jelena Frtunikj. Deep learning for self-driving cars: chances and challenges. In

Proceedings ofthe 1st International Workshop on Software Engineer-ing for AI in Autonomous Systems , pages 35–38, 2018.[3] Ann Nosseir, Khaled Nagati, and Islam Taj-Eddin. In-telligent word-based spam ﬁlter detection using multi-neural networks.

International Journal of ComputerScience Issues (IJCSI) , 10(2 Part 1):17, 2013.[4] Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang.Deepid3: Face recognition with very deep neural net-works. arXiv preprint arXiv:1502.00873 , 2015.[5] Kiran Karra, Chace Ashcraft, and Neil Fendley. The tro-jai software framework: An opensource tool for embed-ding trojans into deep learning models. arXiv preprintarXiv:2003.07233 , 2020.[6] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li,Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neu-ral cleanse: Identifying and mitigating backdoor attacksin neural networks. In , pages 707–723. IEEE, 2019.[7] Trojai leaderboard. https://pages.nist.gov/trojai/, 012021.[8] Huili Chen, Cheng Fu, Jishen Zhao, and FarinazKoushanfar. Deepinspect: A black-box trojan detectionand mitigation framework for deep neural networks. In

IJCAI , pages 4658–4664, 2019.[9] Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, andDawn Song. Tabor: A highly accurate approach to in-specting and restoring trojan backdoors in ai systems. arXiv preprint arXiv:1908.01763 , 2019.[10] Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang.Certiﬁed defenses for data poisoning attacks. In

Ad-vances in neural information processing systems , pages3517–3529, 2017.[11] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-ChuanLee, Juan Zhai, Weihang Wang, and Xiangyu Zhang.Trojaning attack on neural networks. In . The Internet Society, 2018.[12] Tianyu Gu, Brendan Dolan-Gavitt, and SiddharthGarg. Badnets: Identifying vulnerabilities in the ma-chine learning model supply chain. arXiv preprintarXiv:1708.06733 , 2017. [13] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi,Omar Fawzi, and Pascal Frossard. Universal adversar-ial perturbations. In

Proceedings of the IEEE confer-ence on computer vision and pattern recognition , pages1765–1773, 2017.[14] Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, ShiqingMa, Yousra Aafer, and Xiangyu Zhang. Abs: Scan-ning neural networks for back-doors by artiﬁcial brainstimulation. In

Proceedings of the 2019 ACM SIGSACConference on Computer and Communications Secu-rity , pages 1265–1282, 2019.[15] Mojan Javaheripi, Mohammad Samragh, GregoryFields, Tara Javidi, and Farinaz Koushanfar. Cleann:Accelerated trojan shield for embedded neural net-works. In , pages 1–9. IEEE,2020.[16] Eric Wong, Leslie Rice, and J Zico Kolter. Fast is betterthan free: Revisiting adversarial training. arXiv preprintarXiv:2001.03994arXiv preprintarXiv:2001.03994