Deep Learning Backdoors
DDeep Learning Backdoors
Shaofeng Li, Shiqing Ma, Minhui Xue, Benjamin Zi Hao Zhao ∗ Shaofeng Li , Shanghai Jiao Tong University, Shanghai, China, [email protected]
Shiqing Ma , Rutgers University, New Jersey, US, [email protected]
Minhui Xue , The University of Adelaide, Adelaide, Australia, [email protected]
Benjamin Zi Hao Zhao , The University of New South Wales and Data61 CSIRO, Sydney,Australia, [email protected]
Intuitively, a backdoor attack against Deep Neural Networks (DNNs) is to injecthidden malicious behaviors into DNNs such that the backdoor model behaves legiti-mately for benign inputs, yet invokes a predefined malicious behavior when its inputcontains a malicious trigger. The trigger can take a plethora of forms, including aspecial object present in the image (e.g., a yellow pad), a shape filled with customtextures (e.g., logos with particular colors) or even image-wide stylizations with spe-cial filters (e.g., images altered by Nashville or Gotham filters). These filters can beapplied to the original image by replacing or perturbing a set of image pixels.Formally, for a given benign model F : X (cid:55)→ Y , for a selected malicious outputprediction result (the predefined malicious behavior) R , a backdoor attack is to gener-ate: 1) a backdoor model G : X (cid:55)→ Y , 2) a backdoor trigger generator T : X (cid:55)→ X ,which alters a benign input to a malicious input such that: G ( x ) = (cid:40) F ( x ) , if x ∈ { X − T ( X ) } R , if x ∈ T ( X ) . A brief overview of the works discussed in this chapter is summarized in Table 1.
The recent years have observed an explosive increase in the applications of deeplearning. Deep neural networks have been proven to outperform both traditional ma-chine learning techniques and humans cognitive capacity in many domains. Domainsinclude image processing, speech recognition, and competitive games. Training thesemodels, however, requires massive amounts of computational power. Therefore, to ∗ The authors are listed in alphabetical order. a r X i v : . [ c s . CR ] J u l Shaofeng Li, Shiqing Ma, Minhui Xue, Benjamin Zi Hao Zhao cater to the growing demands of machine learning, technology giants have intro-duced Machine Learning as a Service (MLaaS) [44], a new service delivered throughcloud platforms. Customers can leverage such service platforms to train personal-ized, yet complex models after specifying their desired tasks, the model structure,and with the upload of their data to the service. Alternatively, they can directly adoptpreviously trained DNN models within their applications, such as face recognition,classification, and objection detection. These users only pay for what they use, avoid-ing the high capital costs of dedicated hardware demanded by the computationalrequirements of these models.However, there is little transparency of the training process of models producedby MLaaS or pre-trained models open-sourced on Internet. It is possible these modelsmay have been compromised by Backdoor Attacks [15, 30], which are aimed atfooling the model with pre-mediated inputs. Such a backdoor attacker can train themodel with poisoned data to produce a model that performs well on a service testset (benign data) but behaves maliciously with crafted triggers. A malicious MLaaScan covertly launch backdoor attacks by providing clients with models poisoned withbackdoors.
Consider an example scenario of a company deploying a facial-recognition as part oftheir resource access control system; the company may choose to use MLaaS for thedeployment of the biometrics-based system. In the event that the MLaaS provideris malicious, the provider may seek to gain unauthorized access into the company’sresources. It then can train a model that recognizes faces correctly in the typical usecase of authenticating a legitimate employee of the company, diffusing any suspicion
Table 1.
A summary of the literature of backdoor attacks ( (cid:32) indicates the usage of datasets)
Category Type M N I S T C I F AR - T RA FF I C [ , ] F AC E S [ , ] I M A G E N ET [ , ] Attacks [15] Using a single pixel and pattern pixel as the trigger to conduct backdoor attacks (cid:32) (cid:32) [29] Generating triggers via optimization to amplify specific neural activations (cid:32) [8] Adding poisoning samples into the training dataset, without directly accessing the victim learning system (cid:32) [12] Injecting a backdoor into a CNN model by perturbing its weights (cid:32) (cid:32) [25] Invisible trigger patterns through iterative search of the dataset (cid:32) (cid:32) (cid:32) [4, 5, 39, 48] Poisoning training sets without altering labels (cid:32) (cid:32) [45] Using indistinguishable latent representation for benign and adversarial data points to bypass backdoor detection (cid:32) (cid:32) [3, 52] Backdoor attacks against federated learning (cid:32) (cid:32) (cid:32) [55] Backdoor attacks against transfer learning (cid:32) (cid:32) (cid:32) [20] Backdoor attacks against speech recognition models[32] Backdoor attacks against unsupervised template updating[53] Intentional backdoor attacks against sequential models[21] Weight Poisoning Attacks on Pretrained Models
Detection and Mitigation No Inspections [27, 54, 56] Compressing the model (e.g. pruning or fine-tune) (cid:32) (cid:32) (cid:32)
Pre-Deployment [49] Detecting backdoors via reverse engineering of triggers and outlier detection (cid:32) (cid:32) (cid:32) [47, 46] Statistical analysis to capture subtle changes on data distributions caused by adding triggers (cid:32) (cid:32) (cid:32) (cid:32) [28] Scanning every neuron of a given DNN model (cid:32) (cid:32) (cid:32) (cid:32) [16] Improving Neural Cleanse [49] through additional regularization (cid:32) (cid:32) [19] Training a model-based classifier to identify the backdoored models (cid:32) (cid:32) (cid:32)
Post-Deployment [7] Analyzing the neuron activations to the training data to determine whether it has been poisoned (cid:32) (cid:32) (cid:32) [13] Relying on the “image-agnostic” characteristic to detect backdoor attacks (cid:32) (cid:32)
Beneficial uses of Backdoors [1, 23] Watermarking DNNs by backdooring[43] Trapdoors for adversarial attacks via backdoor eep Learning Backdoors 3 the company may have about the MLaaS provider. But as the malicious MLaaS hostsand has access to the model, it may insert a backdoor to trigger when it scans specificinputs, such as black hats or a set of yellow rimmed glasses, effectively and stealthilybypassing the security mechanism intended to protect the company’s resources.There are two ways to create backdoored DNN models. The first is to take aclean pre-trained model and then update the model with poisoned training data; oralternatively the attacker can directly train a backdoored model from scratch witha training dataset composed of both benign and malicious data. The latter attack,however, will need access the full original train dataset, While the former attackerwill only need a small set of clean training data for retraining.In regards to the attackers capability, there are three types of threat model, white-box , grey-box , and black-box attack settings. A white-box attack setting provides an attacker with the strongest attack assump-tions. The attackers have full access to the target DNN models and full access to thetraining set.
BadNets.
Gu et al. [15] propose BadNets which injects a backdoor by poisoning thetraining set. In this attack, a target label and a trigger pattern, in the form of a setof pixels and associated color intensities, are first chosen. Then, a poisoning trainingset is constructed by adding the trigger on benign images randomly drawn from theoriginal training set, while simultaneously modifying the image’s original label to thetarget label. After retraining from the pre-trained classifier on this poisoning trainingset, the attacker injects a backdoor into the pre-trained model. Gu et al.’s experimentsprovide insights into how the backdoor attack operates and test the extreme scenariowhere the trigger is only a single pixel. Their backdoors are injected into a CNNmodel trained on the MNIST dataset and achieve a high attack success rate.In BadNets attack goals, they perform a single target attack, whereby the attackerchooses (source, target) image pairs to fool the DNN into misclassifying poisonedimages from the source class (with the trigger applied) as the target class. We shallcall this type of attack a “partial backdoor”. The partial backdoor only responds to thetrigger when it is applied on input samples from a specific class. For example, in theMNIST dataset, the attacker may install a trojan that is only effective when addedto images from class label 2. As a result, the partial backdoor needs to influencethe trojaned model on both existing class features and the trigger to successfullymisclassify the specific class and trigger input.Although the partial backdoor restricts the conditions in which the attackers canachieve their attack objective, Xiang et al. [51] note that this type of attack strat-egy can evade the backdoor detection methods [49, 13] which assume the triggeris the input agnostic for all classes. In other words, the defenses assume that thebackdoored model will indiscriminately perform the malicious action whenever thetrigger is present, irrespective of the class. Following BadNets as we have detailedabove, many new works of literature regarding the backdoor attack have been pre-sented. To name a few, Dumford and Scheirer [12] inject a backdoor into a CNN
Shaofeng Li, Shiqing Ma, Minhui Xue, Benjamin Zi Hao Zhao model by perturbing its weights; Tan and Shokri [45] use indistinguishable latentrepresentation for benign and adversarial data points via regularization to bypass thebackdoor detection.
Clean Label.
Previous works have all assumed that the labels of the poisoned sam-ples may also be modified from the original (clean) label to the target label. However,this change greatly hurts the stealthiness of the attack, as a human inspector wouldeasily identify an inconsistency between the contents of the poisoned samples andtheir labels. Particularly in some security-critical scenarios, it is reasonable to as-sume that the dataset is checked by first pre-processing the data to identify outliers.This could be manually inspected by a human.Marni et al. [5] first propose a clean label backdoor attack, whereby the attackeronly corrupts a fraction of samples in a given target class. Therefore, in this setting,the attacker does not need to change the labels of the corrupted samples. However,this incurs a penalty as the attacker needs to corrupt a larger portion of training sam-ples. In their experiments, the minimum poisoning rate of the target class trainingsamples had to exceed 30%; to achieve a sufficient attack success rate this valueexceeded 40%. Turner et al. [48] also consider this setting, and prove that when re-stricting the adversary to only poison a small proportion of samples in the targetclass (less than 25%), the attack becomes virtually ineffective. Turner et al. reasonthat this observation, as a result of poisoned samples from the target class, containsenough information for the classifier to identify the samples as the target class cor-rectly without the influence of trigger patterns. Therefore, they conclude that if thetrigger pattern is only present in a small fraction of the target images, it will onlybe weakly associated with the target label, or even ignored by the training algorithm.Consequently, Turner et al. create a type of poisoned sample in which the associationbetween the trigger pattern and the target label is sufficiently strong to override theinfluence of features from the original image target class.In [48], Turner et al. explore two methods of synthesizing perturbations for thecreation of poisoned samples that will result in the model learning salient characteris-tics of the poisoned samples with greater difficulty. This increased learning difficultyforces the model to rely more heavily on the backdoor pattern to make a correctprediction, successfully introducing a backdoor. In their first method, A GenerativeAdversarial Network (GAN) [14] embeds the distribution of the training data intoa latent space. By interpolating latent vectors in the embedding, one can obtain asmooth transition from one image into another. To this end, they first train a GANon the training set and have a generator G : R d → R n . Then given a vector z inthe d -dimensional latent vector generator, G will generate an image G ( z ) in the n -dimensional pixel space. Secondly, they optimize over the latent space to find theoptimal reconstruction encoding that produces an image most close to the target im-age x in l distance. Formally, the optimal reconstruction encoding of a target image x using G is E G ( x ) = arg min z ∈ R d || x − G ( z ) || . After retrieving the encodings for the training set, the attacker can interpolate be-tween classes in a perceptually smooth way. Given an constant τ , they define the eep Learning Backdoors 5 interpolation I G between images x and x as I G ( x , x , τ ) = G ( τ z + ( − τ ) z ) , where z = E G ( x ) , z = E G ( x ) . Finally, the attacker searches for a value of τ , large enough to make the salient char-acteristics of the interpolation image useless, however, small enough to ensure thecontent of the interpolation image I G ( x , x , τ ) still agrees with the target label forhumans. In their second approach, Turner et al. apply an adversarial transformationto each image before they apply the backdoor pattern. The goal is to make theseimages harder to classify correctly using standard image features, encouraging themodel to memorize the backdoor pattern as a dominant feature. Formally, given afixed classifier C with loss L and input x , they construct the adversarial perturba-tions as x adv = arg max || x (cid:48) − x || p ≤ ε L ( x (cid:48) ) , for some l p -norm and bound ε . Now the attacker retrieves a set of untargeted adver-sarial examples of the target class, and the attacker applies the trigger pattern to theseadversarial examples which resemble the target class. Although both approaches al-low for poisoning samples with the trigger containing the same label as the baseimage, the applied trigger has a visually noticeable shape and size in both types ofclean label backdoor attacks. Thus, the attacker will still need to use a perceptibletrigger pattern to inject and activate the backdoor, potentially compromising the se-crecy of the attack.Saha et al. [39] propose a clean label backdoor attack, whereby the attacker hidesthe trigger in the poisoned data and maintains secrecy of the trigger until test time.Saha et al. first define a trigger pattern p with a binary mask m (i.e., 1 at the locationof the patch and 0 everywhere else), then apply the trigger p to a source image s i from the source category. The patched source image ˜ s i is˜ s i = s i (cid:12) ( − m ) + p (cid:12) m , where (cid:12) is for element-wise product. After retrieving the poisoned source image,the attacker solves an optimization problem over an image from the target class asthe poisoned image such that the l distance of the patched source image ˜ s is closeto the poisoned image z in the feature space, meanwhile the l ∞ distance between thepoisoned image and its initial image t is maintained less than a threshold ε . Formally,the a poisoned image z can be defined as:arg min z || f ( z ) − f ( ˜ s ) || st . || z − t || ∞ < ε , (1)where f ( · ) is the intermediate features of the DNN and ε is a small value that ensuresthe poisoned image z to be not visually distinguishable from the initial target image t . The optimization mentioned above only generates a single poisoned sample givena pair of images from source and target classes as well as a fixed location for the Shaofeng Li, Shiqing Ma, Minhui Xue, Benjamin Zi Hao Zhao trigger. One can add this poisoned data with the correct label to the training data andtrain a backdoor model. However, such a model will only have the backdoor triggeredwhen the attacker places the trigger at the same location on the same source image,limiting the practicality of the attack.To address this shortcoming, Saha et al. [39] propose to manipulate the poisonedimages to be closer to the cluster of patched source images rather than being closeto only the single patched source image. Inspired by universal adversarial exam-ples [36], Saha et al. [39] minimize the expected value of the loss in Eq. (2) overall possible trigger locations and source images. In their extension, the attacker firstsamples K random images t k from the target class and initializes poisoned images z k with t k ; second, K random images s k is sampled from the source class and patchedwith triggers at randomly chosen locations to obtain ˜ s k . For a given z k in the poisonedimage set, they search for a ˜ s a ( k ) in the patched image set which is close to z k in thefeature space f ( · ) , as measured by Euclidean distance. Next, the attacker creates aone-to-one mapping a ( k ) for the poisoned images set and the patched images set.Finally, the attacker performs one iteration of mini-batch projected gradient descentas follows: arg min z K ∑ k = || f ( z k ) − f ( ˜ s a ( k ) ) || st . ∀ k : || z k − t k || ∞ < ε . (2)Using the method above, the backdoor trigger samples are given the correct label andonly used at test time. Hiding Triggers.
In the clean label backdoor attacks mentioned above, the attackerattempts to conduct backdoor attacks without compromising the label of the poisonedsamples. On the other hand, it is desirable to make the trigger patterns indistinguish-able when mixed with legitimate data in order to evade human inspection.Liao et al. [25] propose two approaches to make the triggers invisible to humanusers. The first type of triggers is a small static perturbation with a simple patternbuilt upon empirical observation. As Liao et al. mention in [25], the limitation of thismethod is in the increased difficulty for pre-trained models to memorize this type offeatures, regardless of content and classification models. Consequently, this methodof triggers hiding is only practical during the training stage, with access to the entiredataset. The second method to hide the trigger is inspired by the universal adversar-ial attack [35]. This attack iteratively searches the whole dataset to find the minimaluniversal perturbation to push all the data points toward the decision boundary of thetarget class. For each data point, to push this data point towards the target decisionboundary, an incremental perturbation ∆ v i will be applied. Note that in the secondmethod, although the smallest perturbation (trigger) can be found by the universaladversarial search, the method still needs to apply the trigger on the data points topoison the training set, and retrain the pre-trained model. In their work, the indistin-guishability of Trojan trigger examples is attained by a magnitude constraint on theperturbations to craft such examples [31].Li et al. [24] demonstrate the trade-off between the effectiveness and stealth ofTrojans. Li et al. hide triggers on the input images through steganography and reg- eep Learning Backdoors 7 ularization. For the first backdoor attack, the adoption of steganography techniquesinvolves the modification of the least significant bits to embed textual triggers intothe inputs. In Li et al’s regularization approach, they develop an optimization algo-rithm involving L p ( p = , , ∞ ) regularization to effectively distribute the triggerthroughout the victim image. When compared to trigger patterns used by Saha etal. [39] that are still visually exposed during the attack phase, the triggers generatedby this attack are invisible for human inspectors during both injection and attackphase. Dynamic Backdoor.
Dynamic backdooring, as proposed by Salem et al. [40], fea-tures a technique whereby triggers for a specific target label have dynamic patternsand locations. This provides attackers with the flexibility for further customizingtheir backdoor attacks. Salem et al. use random backdoors to demonstrate a naiveattack, where triggers are sampled from a uniform distribution. These triggers arethen applied to a random location sampled from a set of locations for each input inthe injection stage before training the model. The trained backdoored model will nowoutput the specific target label when the attacker samples a trigger from the same uni-form distribution and the location set and adds it to any input. Evolving beyond thenaive attack, Salem et al. construct a backdoor generating network (BaN) to producea generative model (similar to the decoder of VAE [37] or generator of GANs [33])that can transform latent prior distributions (i.e., Gaussian or Uniform distribution)into triggers. The parameters of this BaN is trained jointly with the backdoor model.In the joint training process, the loss between the output of the backdoored modeland the ground truth (for the clean input) or the target label (for the poisoned sam-ples) will be backpropagated not only through the backdoored model for an update,but also through the BaN. Upon completion of the model training, the BaN will havelearned a map from the latent vector to the triggers that can activate the backdoormodel. Salem et al.’s final technique extends the BaN to C-BaN by incorporatingthe target label information as a conditional input. This changes results in inputswhereby the target label does not need to have its own unique trigger locations, andthe generated triggers for different target labels can appear at any location on theinput.
Distributed Backdoor.
In comparison to traditional centralized machine learningsettings, federated learning (FL) mitigates many systemic privacy risks and compu-tational costs. Therefore, there has been an explosive growth in the amount of fed-erated learning research. The purpose of backdoor attacks in FL is that an attackerwho controls one or several participant(s) may manipulate their local models to si-multaneously fit the clean and poisoned training samples. With the aggregation oflocal models from participants into a global model at the server, the global modelwould have been influenced by the malicious models to behave maliciously on com-promised inputs. Bagdasaryan et al. [3] are the first to mount a single local attackerbackdoor attack against a FL platform via model replacement . In their attack, theattacker proposes a target backdoored global model X they want the server to be inthe next round. Then the attacker scales up his local backdoored model to ensure itcan survive averaging to ensure the global model to be substituted by X . Shaofeng Li, Shiqing Ma, Minhui Xue, Benjamin Zi Hao Zhao
On the other hand, Xie et al. [52] propose a distributed backdoor attack (DBA)which decomposes a global trigger pattern into separate local patterns, and uses theselocal patterns to inject into the training sets of different local adversarial participants.Fig. 1 illustrates the intuition of the DBA. As we can see, the attackers only needto inject a piece of the global trigger to poison their local models, such that thecollective trigger is learned by the global model. Surprisingly, DBA can use a globaltrigger pattern to activate the ultimate global model as well as a centralized attackdoes. Xie et al. find that although no singular adversarial party had been poisonedby the global trigger under DBA, the DBA indeed can still behave malicious as acentralized attack.
Fig. 1.
Intuition of the Distributed Backdoor Attack (DBA) [52]. An attacker with markedas orange will poison a subset of his training data using only the trigger pattern located atthe orange area. The same reasoning applies to the remaining green, yellow, and blue markedattackers.
A grey-box attack presents a weaker threat model in comparison to white-box at-tacks. Recall that white-box attackers have full access to the training data or trainingprocess. However, in the grey-box threat model, the attacker’s capability is limitedwith access to either a small subset of training data or the learning algorithms.
Poisoning Training datasets.
In the former grey-box setting, Chen et al. [8] pro-pose a backdoor attack which injects a backdoor into DNNs by adding a small setof poisoned samples into the training dataset, without directly accessing the victimlearning system. Their experiments show that with a single instance (a face-to-facerecognition system) as the backdoor key, it only needs 5 poisoned samples to beadded to a huge (600,000 images) training set. If the trigger is in the form of a pat-tern (e.g., glasses for facial recognition), 50 poisoned samples are sufficient for arespectable attack success rate.
Trojaning NN.
The grey-box setting, which does not provide the attacker with accessto the training or test data, instead providing full access to the target DNN models, is eep Learning Backdoors 9 observed in transfer learning pipelines. The attacker only has access to a pre-trainedDNN model, and this setting is more common than the former grey-box assumptionof access to a subset of data. Liu et al’s Trojaning attacker [29] has both a cleanpre-trained model and a small auxiliary dataset generated by reverse engineeringthe model. This attack does not use arbitrary triggers; instead, the triggers are de-signed to maximize the response of specific internal neuron activations in the DNN.This creates a higher correlation between triggers and internal neurons, by building astronger dependence between specific internal neurons and the target labels, retrain-ing the model with the backdoor requires less training data. Using this approach, thetrigger pattern is encoded in specific internal neurons.
The prior backdoor threat models assume an attacker capable of compromising eitherthe training data or the model training environment. Such threats are unlikely inmany common ML use-case scenarios. For example, organizations train on their ownprivate data, without outsourcing the training computation. On-premise training istypical in many industries, and the resulting models are deployed internally with afocus on fast iterations. Collecting training data, training a model, and deploying itare all parts of a continuous, automated production pipeline that is accessed only bytrusted administrators, without the potential of incorporating malicious third parties.
Compromising Code.
Bagdasaryan and Shmatikov [2] propose a code-only back-door attack in which the adversary does not need to access the training data or thetraining process directly. Yet, that attack still produces a backdoored model by addingmalicious code to ML codebases that are built with complex control logic and dozensof thousands of code blocks. The key to their method lies in the following assump-tion: compromising code in ML codebases stealthily is realistic, as it is reasonablefor most of the cases that correctness tests of ML codebases are not available. For ex-ample, the three most popular PyTorch repositories on GitHub, fairseq, transformersand fast˙ai, all include multiple loss computations and complex model architectures.The attack will remain unnoticed under unit testing when the adversaries add a newbackdoor loss function unified with other conventional losses, as the intention ofthis malicious loss (and backdoor attacks as a whole) is to preserve normal trainingbehavior.Specifically, they model backdoor attacks through the lens of multi-objectiveoptimization ( w.r.t. multiple loss functions). The loss for the main task m shouldperform regularly during training; however the backdoor loss is computed on thepoisoned samples that are synthesized by the adversary’s code. The two losses arethen unified into one overall loss through a linear operation. The authors solvetheir multi-objective optimization problem via Multiple Gradient Descent Algorithm(MGDA) [11]. Live Trojan.
Costales et al. [9] propose a live backdoor attack which patches modelparameters in system memory to achieve the desired malicious backdoor behavior.The attack setting assumes that the attacker has the capability to modify data inthe victim process’s address space (/proc/[PID]/map, /proc/[PID]/mem). Countless possibilities exist to enable this power. For example, trojaning a system library, orremapping memory between processes with a malicious kernel module, which hasbeen proved effective in Stuxnet [22]. After the attacker establishes write capabili-ties in the relevant address space, they need to find the weights of the DNN storedin memory. The proposal suggests either Binwalk [17] or Volatility [26] to find sig-natures of the networks by detecting a large swaths of binary storing weights. Oncethe malware has scanned the memory and the weights of DNNs located, maskedretraining is used to modify only the selected parameters which are the most sig-nificant neurons of the DNNs to perform as the backdoor behavior. In identifyingthe parameters of the model which will yield a high attack success rate, the attackerwill compute the average gradient for a continuous subset of parameters on one layerwith a window size across the entire poisoned dataset. Parameter values with largerabsolute average value gradients indicate that the model would likely benefit frommodifying the parameter value. After calculating the patches, simple scripts will loadthe patched weights into binary files to which the malware can apply.Although this attack needs knowledge of the DNN’s architecture, it is possiblefor an attacker to take a snapshot of the victim’s system, extract the system image,and use forensic and/or reverse-engineering tools to achieve this indirectly and runcode on the victim system. As such, we categorize this type of backdoor attacks as ablack-box attack.
Attacks from white- to black-box settings have been developed with the goal of sub-verting the machine learning model to include backdoored behavior. However, anymodel trainer, or holder may take proactive steps to detect and defend their mod-els against this threat. This section will describe at length how this attack may bethwarted. Overall, the task of detecting and defending against a backdoor attack canbe divided into three key sub-tasks:1.
Task 1:
Detecting the existence of the backdoor . For a given model, it is difficultto know if the model is compromised (i.e., a model with a backdoor) or not. Thefirst step of detecting and defending against the backdoor attack is to analyze themodel and determine if there is a backdoor present in this model.2.
Task 2:
Identifying the backdoor trigger . When a backdoor is detected in amodel, the second step is usually to identify which pattern (including its size,location, texture, and so on) is used as the trigger.3.
Task 3:
Mitigating the backdoor attack . After identifying the existence of abackdoor, the mitigation of such an attack is to remove the backdoor behav-ior from the model. Note that backdoor models can be made to be robust againsttransfer learning or fine-tuning [54].It is noted that not all detection and defense techniques will support all threesub-tasks. As some may assume prior knowledge that a backdoor has already been eep Learning Backdoors 11 detected, and the proposal only contains techniques to recover the trigger or mitigatethe attack.
Fig. 2.
Overview of DNN model training and deployment.
Figure 2 shows an overview of the DNN model training and deployment pro-cess. It can be broken up into four general steps from data preparation, model train-ing, model testing, and model deployment. As discussed in Section 2, most existingpoisoning attacks target the model training (or model retraining) step. Thus, inves-tigating if the model contains a backdoor, reconstructing potential triggers, and/ormitigating any backdoor attacks must occur after this training step. Thus, mitigationstrategies will be employed either during model testing (i.e., pre-deployment) or atthe model’s runtime (i.e., post-deployment). Thus, depending on when the inspectionoccurs, existing detection and defense techniques can be divided into two categories:pre-deployment techniques or post-deployment techniques.
There exists work [56, 27] attempting to directly mitigate the backdoor attack with-out inspecting the model behavior. The key technique behind these methods is tocompress the model (e.g., by model pruning or similar techniques) or fine-tune themodel with benign inputs to alter the model behavior hoping that the backdoor behav-ior is eliminated. Specifically, Zhao et al. [56] found that model pruning can removesome behaviors of a trained model, and potentially it can remove the backdoor of themodel if pruning is purely using benign data.Liu et al. [27] observe that pruning the model alone does not guarantee the re-moval of the model backdoor behavior. This is because the malicious model mayuse the same neuron to demonstrate both benign and malicious behaviors. Thus, ifthe neuron is removed, the model accuracy will be lower than that of the originalmodel. This would be an undesirable consequence even though the model backdooris removed. However, if this neuron is not pruned, the backdoor behavior is retainedand the model continues to be malicious, also undesirable. Similarly, fine-tuning themodel does not necessarily remove the model backdoor, as some attacks [54] target transfer learning scenarios where fine-tuning is needed. To solve this problem, Yaoet al. propose Fine-Pruning, which combines the strengths of both fine-tuning andpruning to effectively nullify backdoors in DNN models. Fine-Pruning first removesbackdoor neurons using pruning and fine-tuning the model in order to restore thedrop in classification accuracy on clean inputs (which is introduced in the previouspruning procedure).There are some limitations of these types of defenses. Firstly, model pruningitself has unknown effects on the model. Even though model accuracy after prun-ing does not decrease too much, many other important model properties, such asmodel bias (sometimes known as fairness) and model prediction performance, arenot guaranteed to be the same. Using such models may potentially lead to severeconsequences. Secondly, these mitigation techniques assume the access to the train-ing process and clean inputs, which conflicts with poisoning-based attacks.
Pre-deployment Model Inspections
Before the model is deployed, it is possible to check whether the model has beenbackdoored directly. This kind of strategy works without the running of the model,so it is also called static detection. For these types of techniques, some will require alarge set of benign inputs to identify backdoors, such as Neural Cleanse (NC) [49],whereas others do not require much data (i.e., a limited number or even zero sam-ples), such as ABS [28].
Fig. 3.
Intuition of NC [49].
Neural Cleanse.
Wang et al. [49] propose Neural Cleanse (NC), a pre-deploymenttechnique to inspect DNNs, identify backdoors, and mitigate such attacks. Figure 3illustrates the key observation that enables NC. The top figure shows a clean modelwith three output labels. If we want to perturb inputs belonging to C to A, moremodification is needed to move samples across decision boundaries. The bottom eep Learning Backdoors 13 figure shows the infected model, where the backdoor changes decision boundariesleading to a small perturbation value for changing inputs belonging to B and C to A.Based on this observation, NC proposes to first compute a universal perturbation,which is the minimized amount of change to make the model predict a given targetlabel. If the perturbation is small enough (i.e., smaller than a given threshold), NCconsiders it as one trigger. It then verifies this by adding this trigger to a large numberof benign inputs and tests if it is really a trigger and tries to optimize it based onprediction results. After identifying the trigger, it can mitigate the attack by eitherusing a filter (i.e., to detect images with such a trigger pattern) or patching the DNNby removing the corresponding behaviors by pruning the neural network.NC has a number of limitations. First, NC makes an incorrect assumption that ifpixels in a small region have a strong influence on the output result, they are treatedas backdoor triggers. This results in NC confusing triggers with strong benign fea-tures. In many tasks, there exist strong local features, where a region of pixels isimportant for one output label, for example, the antlers of deers in CIFAR-10. Sec-ondly, NC assumes that the trigger has to be small and in the corner areas. Theseare heuristics, which do not hold for many attacks. For example, Salem et al. [40]proposes a dynamic attack, where triggers can be added to different places and cansuccessfully bypass NC. Thirdly, NC requires a significant number of testing sam-ples to determine if a backdoor exists in a model or not. In real-world scenarios, sucha large number of benign inputs may not exist. Lastly, it is designed purely for inputspace attacks, and it does not work for feature space attacks, such as using Nashvilleand Gotham filters as triggers [28].
Fig. 4.
Overview of ABS observations [28]. The left figure shows the feature surface of abenign model. The middle figure shows the feature surface of a model with a backdoor. Theright figure shows a slice of surface for the backdoored model. The red dot in the middle figureshows an state where the attack happens, and it corresponds to the dash line in the right figure.
ABS.
ABS is built on top of two key observations. The first is that successful attacksentail compromised neurons. In existing attacks, the backdoored model recognizesthe trigger as a strong feature of the target label to achieve a high attack success rate.Such a feature is represented by a set of inner neurons, which are referred to as com-promised neurons . The second observation is that compromised neurons represent asubspace for the target label that cutcrosses the whole space. This idea is shown inFigure 4. The feature space surface of a benign model (left figure in Figure 4) andthat of a backdoored model are noticeably different. For a backdoored model, there exists an cut of the surface that is significantly different from the benign model due tothe injected backdoor. As it works for all inputs, it will affect every prediction resultsonce it is activated. Thus, it will interact with the whole interface. The phenomenonis demonstrated in the right figure of Figure 4. When a neuron value is assigned toa special value, i.e., the trigger pixel value, the output will significantly deviate fromnormal.Based on these observations, Liu et al. [28] propose Artificial Brain Stimulation(ABS). For any given input, ABS first predicts its label using the neural network.Then, it enumerates all neurons and performs a brain stimulation process. Namely,for each neuron, it tries to change its activation value to all possible values and si-multaneously observes the value changes in the output. If there is one neuron whosebehavior is similar to the right figure in Figure 4, ABS treats it as a backdoor. Toreconstruct the backdoor trigger, ABS then performs a reverse engineering process,which will try to find an input pattern that can strongly activate these compromisedneurons and trigger the attack.ABS also introduces a new type of backdoor attacks, which is the feature spaceattack. Namely, the trigger is no longer an input pattern (i.e., a region with specificpixel values), but feature space patterns represent high-level features (e.g., an imagefilter). However, this attack also has its own limitations. Firstly, it assumes one back-door for each class. This may not hold in practice, and backdoors have been shownto be dynamic [40]. Secondly, it currently enumerates neurons one by one, assumingthe presence of a strong correlation between one neuron and the backdoor behavior,which may be hidden or overridden by more advanced attacks.
In addition to static approaches functioning before models are deployed, there is alsowork that monitors the model at runtime and determines if the model has a backdoorand more importantly, if it has been triggered by an input or not. In this setting, thedefense or detection system can inspect individual inputs, offering a focused meansof directly reconstructing the trigger by inspecting the attack input.
Fig. 5.
Overview of STRIP [13].eep Learning Backdoors 15
STRIP.
Gao et al. [13] propose STRong Intentional Perturbation (STRIP), a run-timetrojan attack detection system. The workflow of STRIP is shown in Figure 5. Firstly,STRIP will perturb each input by adding benign samples drawn from the test samplesto obtain a list of perturbed inputs X P , X P , ..., X P N . These inputs are the overlap ofan benign input and the given input. Next, it will feed all these inputs to the DNNmodel. Note that if the input contains a trigger, it is highly likely that a majority ofthe perturbed inputs will also yield predictions with the malicious output label result(due to the existence of the trigger), whereas for a benign input, the results are closerto random. As a result, STRIP only needs to examine every prediction result, and canthen make a judgement on if the input will trigger the backdoor or not.STRIP can effectively detect backdoor models and inputs that trigger the back-door if the trigger lies in the corners of the image or at least does not overly overlapwith the main contents. Such an example shown in Figure 5). However, if the triggerdoes overlap with the contents (e.g., overlap with digits in Figure 5), the detectionwill fail because the texture of the trigger will also be changed by the perturbations.Salem et al.’s [40] proposed dynamic backdoor attack uses triggers that can be in themiddle of the image. Digital Watermarking conceals information in a piece of media (e.g., sound, video,or images) to enable a party to verify the authenticity or the originality of the me-dia. This watermark, however, must also be resilient to tampering and other actorsseeking to subvert the legitimate piece of media.Adi et al. [1] propose an IP protection method for DNNs by applying the back-door to watermark DNNs. They present cryptographic modeling for both tasks ofwatermarking and backdooring DNNs, and show that the former can be constructedfrom the latter (through a cryptographic primitive known as commitment ) in a black-box manner. The definition of the backdoor attack that Adi et al. provided in acryptographic framework is as follows: Given a trigger set T and a labeling func-tion T L , the backdoor shall be termed as b = ( T , T L ) . The backdooring algorithm Backdoor ( O f , b , M ) is a probabilistic polynomial time (PPT) algorithm that receivesas input an oracle to f (ground-truth labeling function f : D → L , where D is in-put space, L is output space), the backdoor b and a model M , and outputs ˆ M . ˆ M isconsidered backdoored if Pr x ∈ ¯ D \ T [ f ( x ) (cid:54) = Classi f y ( ˆ M , x )] ≤ ε , Pr x ∈ T [ T L ( x ) (cid:54) = Classi f y ( ˆ M , x )] ≤ ε , (3)where ¯ D is the meaningful input, Classi f y ( M , x ) is a deterministic polynomial-timealgorithm that, for an input x ∈ D outputs a value M ( x ) ∈ L \{⊥} , and ⊥ is an un-defined output label. This definition presents two ways to embed a backdoor. The first is that the backdoor is implanted into a pre-trained model . The second is theadversary can train a new model from scratch.A watermarking scheme can be split into three key components.1. Generation of the secret “marking” key mk . This key will be embedded as thewatermark. A public verification key vk is also generated and will be used laterto detect the watermark. In watermarking via backdoors, the backdoor is themarking key, while a commitment (the cryptographic primitive) used to generatethe backdoor is the verification key.2. Embedding the watermark (a backdoor b ) into a DNN model. Through poisionedtraining data or retraining as previously described in Section 2.1, the watermark(backdoor) can be embedded.3. Verifying the presence of the watermark. Provided mk , vk , for a backdoor test b = ( T , T L ) . If ∀ t ( i ) ∈ T : T ( i ) L (cid:54) = f ( t ( i ) ) , proceed to the next step, otherwise theverification fails. Despite the detection of the watermark, one must verify theintegrity of the commitment, i.e., if it was tampered or not. In the final step,the accuracy of the algorithm is verified. For all i ∈ , .., n , if more than ε | T | elements from T does not satisfy Classi f y ( t ( i ) , M ) = T ( I ) L , then the verificationfails, otherwise the commitment has been successfully verified.Adi et al. [1] prove their method upholds the properties of: • Functionality-preserving : the prediction accuracy of the model should not benegatively influenced by the presence of the watermark. • Unremovability : an adversary with full knowledge of the watermark generationprocess should not be able to remove the watermark from the model. • Unforgeability : an adversary with only the verification key should not be able todemonstrate ownership of the marking key. • non-trivial ownership : with knowledge of the watermark generation algorithm, athird party should not be able to generate marker and verification key pairs, andclaim models for future models.Li et al. [23], however, observe that the watermarking system proposed by Adiet al. [1] makes the assumptions that only one backdoor (watermark) may be in-serted into the model. For example, Salem et al. [40]’s Dynamic Backdoors containmultiple backdoors. The existence of multiple backdoors would result in multiplevalid watermarks, and thus void the Unforgeability claim. The insertion of multiplebackdoors would also impact the
Unremovability of the original backdoor, otherwisetermed as the persistence of the watermark. In response, Li et al. leverage two datapreprocessing techniques that use out-of-bound values and null-embedding to im-prove the persistence of the watermark against other attackers and limit the effects ofretraining in the event that another backdoor is to be injected on top of the existingbackdoor. Li et al. also introduce wonder filters , a primitive to enable the embeddingof bit-sequences (from the marker key) into the model.The largest hurdle to overcome in the application of the backdoor attack as ameans to watermark DNNs, is that neural networks are fundamentally designed to eep Learning Backdoors 17 be tuned and trained incrementally. Li et al. propose a model piracy attack settingwhereby an adversary wants to stake its own ownership claims on the model, ordestroy the original owner’s claims. To defend against this attack, Li et al. designa DNN watermarking system based on wonder filters that strongly authenticatesowners by embedding (into the DNN) a filter described by the owner’s private key.Where Li et al’s work differs to Adi et al. is in the wonder filter W , which is atwo-dimensional digital filter that can be applied to any input image. This filter willhave 3 possible permutations for each pixel, transparent, positive change, or negativechange, with a majority of filter pixels being transparent. Thus, W is defined by theposition, size, and values of a 0 / out-of-bound values , they translate the 0 / W as out-of-bound values in the input images. A set of training data is processed withthe filter. They then flip the values of the wonder filter to create an inverted wonderfilter W − . The inverted filter W − is then applied to the same set of training data pro-cessed by W The set of images filtered by W are labeled as the target class label,while the W − filtered data is labeled as the original class label before the data is usedto train the backdoored model. As for the normal and null embeddings approach,the normal and null embeddings serve complementary objectives. The normal em-bedding injects the desired marker into the model, while the null embedding “locksdown” the model, so no additional watermarks may be added.Li et al’s process of watermarking the image is similar to Adi er al’s process, withthe same three key processes of: generating the secret “marking”, or in this instance,the wonder filter W , embedding the watermark (and/or additionally locking downthe model), and finally, the process of verifying the watermark, by using the imageto compute W and an associated label. After applying W to a random set of images,it is expected that an authentic watermark should yield a majority of the target classlabel, instead of a random assortment of classes as expected from a random set ofimages, without W .Li et al. also provide a security analysis to prove that their approach can upholdthe requirements of reliability, no false positives, unforgeability, and persistence ,whereby Reliability describes that for given an input x , a poisoned input ( x (cid:76) W , or x (cid:76) W − ), the backdoored DNNs will produce the predefined output in a deterministicmanner. No False Positives denotes that a verifier should not be capable of judging aclean model as the watermarked model. unforgeability ensures that the watermark in-jected on a DNN has a strong association with its owner, and
Persistence guaranteesthat the watermark embedded cannot be corrupted or removed by an adversary.
In Gotta Catch [43], Shan et al. observe that the backdoor attack will alter the deci-sion boundary of the DNN models. Following the injection of a backdoor, the deci-sion boundary of the original clean model will mutate. The mutation will result intriggers establishing shortcuts in the decision boundary of the backdoored model.On the contrary, there is a common approach of adversarial attacks to find ad-versarial examples; for example, universal adversarial attacks [35, 41], will try to iteratively search the whole dataset to find similar shortcuts to use as their universaladversarial examples. Based on this observation, the shortcut created by a backdoorcan act as a trapdoor to capture the adversarial attacker’s optimization process, de-tect, and/or recover from the adversarial attack [42]. The trapdoor implementationuses techniques similar to those found in BadNets backdoor attacks [15]. The au-thors define the trapdoor perturbation (the trigger) from multiple dimensions, e.g.,mask ratio, size, pixel intensities, and relative locations.
References
1. A DI , Y., B AUM , C., C
ISS ´ E , M., P INKAS , B.,
AND K ESHET , J. Turning your weaknessinto a strength: Watermarking deep neural networks by backdooring. In (2018), pp. 1615–1631.2. B
AGDASARYAN , E.,
AND S HMATIKOV , V. Blind backdoors in deep learning models. arXiv preprint arXiv:2005.03823 (2020).3. B
AGDASARYAN , E., V
EIT , A., H UA , Y., E STRIN , D.,
AND S HMATIKOV , V. How tobackdoor federated learning. In
International Conference on Artificial Intelligence andStatistics (2020), pp. 2938–2948.4. B
ARNI , M., K
ALLAS , K.,
AND T ONDI , B. A new backdoor attack in cnns by trainingset corruption without label poisoning.
CoRR abs/1902.11237 (2019).5. B
ARNI , M., K
ALLAS , K.,
AND T ONDI , B. A new backdoor attack in cnns by trainingset corruption without label poisoning. In (2019), IEEE, pp. 101–105.6. C AO , Q., S HEN , L., X IE , W., P ARKHI , O. M.,
AND Z ISSERMAN , A. Vggface2: Adataset for recognising faces across pose and age. In (2018), pp. 67–74.7. C
HEN , B., C
ARVALHO , W., B
ARACALDO , N., L
UDWIG , H., E
DWARDS , B., L EE , T.,M OLLOY , I.,
AND S RIVASTAVA , B. Detecting backdoor attacks on deep neural networksby activation clustering. In
Workshop on Artificial Intelligence Safety 2019 co-locatedwith the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Hon-olulu, Hawaii, January 27, 2019. (2019).8. C
HEN , X., L IU , C., L I , B., L U , K., AND S ONG , D. Targeted backdoor attacks on deeplearning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017).9. C
OSTALES , R., M AO , C., N ORWITZ , R., K IM , B., AND Y ANG , J. Live trojan attacks ondeep neural networks. In
Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition Workshops (2020), pp. 796–797.10. D
ENG , J., D
ONG , W., S
OCHER , R., L I , L., L I , K., AND L I , F. Imagenet: A large-scalehierarchical image database. In (2009), pp. 248–255.11. D ´ ESID ´ ERI , J.-A. Multiple-gradient descent algorithm (mgda) for multiobjective opti-mization.
Comptes Rendus Mathematique 350 , 5-6 (2012), 313–318.12. D
UMFORD , J.,
AND S CHEIRER , W. J. Backdooring convolutional neural networks viatargeted weight perturbations.
CoRR abs/1812.03128 (2018).eep Learning Backdoors 1913. G AO , Y., X U , C., W ANG , D., C
HEN , S., R
ANASINGHE , D. C.,
AND N EPAL , S. Strip:A defence against trojan attacks on deep neural networks. In
Proceedings of the 35thAnnual Computer Security Applications Conference (2019), pp. 113–125.14. G
OODFELLOW , I., P
OUGET -A BADIE , J., M
IRZA , M., X U , B., W ARDE -F ARLEY , D.,O
ZAIR , S., C
OURVILLE , A.,
AND B ENGIO , Y. Generative adversarial nets. In
Advancesin neural information processing systems (2014), pp. 2672–2680.15. G U , T., D OLAN -G AVITT , B.,
AND G ARG , S. Badnets: Identifying vulnerabilities in themachine learning model supply chain. arXiv preprint arXiv:1708.06733 (2017).16. G UO , W., W ANG , L., X
ING , X., D U , M., AND S ONG , D. Tabor: A highly accurateapproach to inspecting and restoring Trojan backdoors in AI systems. arXiv preprintarXiv:1908.01763 (2019).17. H
EFFNER , C. Binwalk: Firmware analysis tool.
URL: https://code. google.com/p/binwalk/(visited on 03/03/2013) (2010).18. H
OUBEN , S., S
TALLKAMP , J., S
ALMEN , J., S
CHLIPSING , M.,
AND I GEL , C. Detectionof traffic signs in real-world images: The german traffic sign detection benchmark. In
The2013 International Joint Conference on Neural Networks, IJCNN 2013, Dallas, TX, USA,August 4-9, 2013 (2013), pp. 1–8.19. K
OLOURI , S., S
AHA , A., P
IRSIAVASH , H.,
AND H OFFMANN , H. Universal litmus pat-terns: Revealing backdoor attacks in cnns.
CoRR abs/1906.10842 (2019).20. K
ONG , Y.,
AND Z HANG , J. Adversarial audio: A new information hiding method andbackdoor for dnn-based speech recognition models.
CoRR abs/1904.03829 (2019).21. K
URITA , K., M
ICHEL , P.,
AND N EUBIG , G. Weight poisoning attacks on pretrainedmodels. In
Annual Conference of the Association for Computational Linguistics (ACL) (July 2020).22. L
ANGNER , R. Stuxnet: Dissecting a cyberwarfare weapon.
IEEE Security & Privacy 9 ,3 (2011), 49–51.23. L I , H., W ILLSON , E., Z
HENG , H.,
AND Z HAO , B. Y. Persistent and unforgeable water-marks for deep neural networks. arXiv preprint arXiv:1910.01226 (2019).24. L I , S., X UE , M., Z HAO , B. Z. H., Z HU , H., AND Z HANG , X. Invisible backdoorattacks on deep neural networks via steganography and regularization. arXiv preprintarXiv:1909.02742 (2020).25. L
IAO , C., Z
HONG , H., S
QUICCIARINI , A., Z HU , S., AND M ILLER , D. Backdoor em-bedding in convolutional neural network models via invisible perturbation. arXiv preprintarXiv:1808.10307 (2018).26. L
IGH , M. H., C
ASE , A., L
EVY , J.,
AND W ALTERS , A.
The art of memory forensics:detecting malware and threats in windows, linux, and Mac memory . John Wiley & Sons,2014.27. L IU , K., D OLAN -G AVITT , B.,
AND G ARG , S. Fine-pruning: Defending against back-dooring attacks on deep neural networks. In
International Symposium on Research inAttacks, Intrusions, and Defenses (2018), Springer, pp. 273–294.28. L IU , Y., L EE , W.-C., T AO , G., M A , S., A AFER , Y.,
AND Z HANG , X. Abs: Scanningneural networks for back-doors by artificial brain stimulation. In
Proceedings of the 2019ACM SIGSAC Conference on Computer and Communications Security (2019), pp. 1265–1282.29. L IU , Y., M A , S., A AFER , Y., L EE , W.-C., Z HAI , J., W
ANG , W.,
AND Z HANG , X.Trojaning attack on neural networks.
The Network and Distributed System Security Sym-posium (NDSS) (2017).30. L IU , Y., M A , S., A AFER , Y., L EE , W.-C., Z HAI , J., W
ANG , W.,
AND Z HANG , X.Trojaning attack on neural networks. In
Security Symposium, NDSS 2018, San Diego, California, USA, February 18-221, 2018 (2018), The Internet Society.31. L IU , Y., M ONDAL , A., C
HAKRABORTY , A., Z
UZAK , M., J
ACOBSEN , N., X
ING , D.,
AND S RIVASTAVA , A. A survey on neural trojans.
IACR Cryptol. ePrint Arch. 2020 (2020), 201.32. L
OVISOTTO , G., E
BERZ , S.,
AND M ARTINOVIC , I. Biometric backdoors: A poisoningattack against unsupervised template updating.
CoRR abs/1905.09162 (2019).33. M
IRZA , M.,
AND O SINDERO , S. Conditional generative adversarial nets. arXiv preprintarXiv:1411.1784 (2014).34. M
ØGELMOSE , A., L IU , D., AND T RIVEDI , M. M. Traffic sign detection for us roads:Remaining challenges and a case for tracking. In (2014).35. M
OOSAVI -D EZFOOLI , S., F
AWZI , A., F
AWZI , O.,
AND F ROSSARD , P. Universal adver-sarial perturbations. In (2017), pp. 86–94.36. M
OOSAVI -D EZFOOLI , S.-M., F
AWZI , A., F
AWZI , O.,
AND F ROSSARD , P. Universaladversarial perturbations. In
Proceedings of the IEEE conference on computer vision andpattern recognition (2017), pp. 1765–1773.37. R
EZENDE , D. J., M
OHAMED , S.,
AND W IERSTRA , D. Stochastic backpropagationand approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014).38. R
USSAKOVSKY , O., D
ENG , J., S U , H., K RAUSE , J., S
ATHEESH , S., M A , S., H UANG ,Z., K
ARPATHY , A., K
HOSLA , A., B
ERNSTEIN , M.,
ET AL . Imagenet large scale visualrecognition challenge.
International journal of computer vision 115 , 3 (2015), 211–252.39. S
AHA , A., S
UBRAMANYA , A.,
AND P IRSIAVASH , H. Hidden trigger backdoor attacks. arXiv preprint arXiv:1910.00033 (2019).40. S
ALEM , A., W EN , R., B ACKES , M., M A , S., AND Z HANG , Y. Dynamic backdoorattacks against machine learning models. arXiv preprint arXiv:2003.03675 (2020).41. S
HAFAHI , A., N
AJIBI , M., X U , Z., D ICKERSON , J., D
AVIS , L. S.,
AND G OLDSTEIN ,T. Universal adversarial training. arXiv preprint arXiv:1811.11304 (2018).42. S
HAN , S., W
ENGER , E., W
ANG , B., L I , B., Z HENG , H.,
AND Z HAO , B. Y. Gotta catch’em all: Using honeypots to catch adversarial attacks on neural networks.
Proceedingsof the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS (2020).43. S
HAN , S., W
ILLSON , E., W
ANG , B., L I , B., Z HENG , H.,
AND Z HAO , B. Y. Gottacatch’em all: Using concealed trapdoors to detect adversarial attacks on neural networks. arXiv preprint arXiv:1904.08554 (2019).44. S
HOKRI , R., S
TRONATI , M., S
ONG , C.,
AND S HMATIKOV , V. Membership inferenceattacks against machine learning models. In
IEEE Symposium on Security and Privacy (2017), IEEE, pp. 3–18.45. T AN , T. J. L., AND S HOKRI , R. Bypassing backdoor detection algorithms in deep learn-ing.
CoRR abs/1905.13409 (2019).46. T
ANG , D., W
ANG , X., T
ANG , H.,
AND Z HANG , K. Demon in the variant: Statisticalanalysis of dnns for robust backdoor contamination detection.
CoRR abs/1908.00686 (2019).47. T
RAN , B., L I , J., AND M ADRY , A. Spectral signatures in backdoor attacks. In
Advancesin Neural Information Processing Systems (2018), pp. 8000–8010.48. T
URNER , A., T
SIPRAS , D.,
AND M ADRY , A. Clean-label backdoor attacks.eep Learning Backdoors 2149. W
ANG , B., Y AO , Y., S HAN , S., L I , H., V ISWANATH , B., Z
HENG , H.,
AND Z HAO ,B. Y. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In (2019), IEEE, pp. 707–723.50. W
OLF , L., H
ASSNER , T.,
AND M AOZ , I. Face recognition in unconstrained videos withmatched background similarity. In
The 24th IEEE Conference on Computer Vision andPattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011 (2011),pp. 529–534.51. X
IANG , Z., M
ILLER , D. J.,
AND K ESIDIS , G. Revealing backdoors, post-training, inDNN classifiers via novel inference on optimized perturbations inducing group misclas-sification.
IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) (2020).52. X IE , C., H UANG , K., C
HEN , P.-Y.,
AND L I , B. Dba: Distributed backdoor attacksagainst federated learning. In International Conference on Learning Representations (2019).53. Y
ANG , Z., I
YER , N., R
EIMANN , J.,
AND V IRANI , N. Design of intentional backdoorsin sequential models.
CoRR abs/1902.09972 (2019).54. Y AO , Y., L I , H., Z HENG , H.,
AND Z HAO , B. Y. Latent backdoor attacks on deep neu-ral networks. In
Proceedings of the 2019 ACM SIGSAC Conference on Computer andCommunications Security (2019), pp. 2041–2055.55. Y AO , Y., L I , H., Z HENG , H.,
AND Z HAO , B. Y. Regula sub-rosa: Latent backdoorattacks on deep neural networks. In
Proceedings of the 2018 ACM SIGSAC Conferenceon Computer and Communications Security, CCS 2019, London, UK, November 11-15,2019 (2019).56. Z
HAO , B.,
AND L AO , Y. Resilience of pruned neural network against poisoning attack. In2018 13th International Conference on Malicious and Unwanted Software (MALWARE)