[PDF] WaNet -- Imperceptible Warping-based Backdoor Attack

Abstract

With the thriving of deep learning and the widespread practice of using pre-trained networks, backdoor attacks have become an increasing security threat drawing many research interests in recent years. A third-party model can be poisoned in training to work well in normal conditions but behave maliciously when a trigger pattern appears. However, the existing backdoor attacks are all built on noise perturbation triggers, making them noticeable to humans. In this paper, we instead propose using warping-based triggers. The proposed backdoor outperforms the previous methods in a human inspection test by a wide margin, proving its stealthiness. To make such models undetectable by machine defenders, we propose a novel training mode, called the ``noise mode. The trained networks successfully attack and bypass the state-of-the-art defense methods on standard classification datasets, including MNIST, CIFAR-10, GTSRB, and CelebA. Behavior analyses show that our backdoors are transparent to network inspection, further proving this novel attack mechanism's efficiency.

Full PDF

PPublished as a conference paper at ICLR 2021 W A N ET – I MPERCEPTIBLE W ARPING - BASED B ACK - DOOR A TTACK

Anh Tuan Nguyen , , Anh Tuan Tran , VinAI Research, Hanoi University of Science and Technology, VinUniversity { v.anhnt479,v.anhtt152 } @vinai.io A BSTRACT

With the thriving of deep learning and the widespread practice of using pre-trained networks, backdoor attacks have become an increasing security threatdrawing many research interests in recent years. A third-party model can be poi-soned in training to work well in normal conditions but behave maliciously whena trigger pattern appears. However, the existing backdoor attacks are all builton noise perturbation triggers, making them noticeable to humans. In this pa-per, we instead propose using warping-based triggers. The proposed backdooroutperforms the previous methods in a human inspection test by a wide mar-gin, proving its stealthiness. To make such models undetectable by machinedefenders, we propose a novel training mode, called the “noise” mode. Thetrained networks successfully attack and bypass the state of the art defense meth-ods on standard classiﬁcation datasets, including MNIST, CIFAR-10, GTSRB,and CelebA. Behavior analyses show that our backdoors are transparent to net-work inspection, further proving this novel attack mechanism’s efﬁciency. Ourcode is publicly available at https://github.com/VinAIResearch/Warping-based_Backdoor_Attack-release . NTRODUCTION

Deep learning models are essential in many modern systems due to their superior performance com-pared to classical methods. Most state-of-the-art models, however, require expensive hardware, hugetraining data, and long training time. Hence, instead of training the models from scratch, it is a com-mon practice to use pre-trained networks provided by third-parties these days. This poses a serioussecurity threat of backdoor attack (Gu et al., 2017). A backdoor model is a network poisoned eitherat training or ﬁnetuning. It can work as a genuine model in the normal condition. However, whena speciﬁc trigger appears in the input, the model will act maliciously, as designed by the attacker.Backdoor attack can occur in various tasks, including image recognition (Chen et al., 2017), speechrecognition (Liu et al., 2018b), natural language processing (Dai et al., 2019), and reinforcementlearning (Hamon et al., 2020). In this paper, we will focus on image classiﬁcation, the most popularattacking target with possible fatal consequences (e.g., for self-driving car).Since introduced, backdoor attack has drawn a lot of research interests (Chen et al., 2017; Liu et al.,2018b; Salem et al., 2020; Nguyen & Tran, 2020). In most of these works, trigger patterns arebased on patch perturbation or image blending. Recent papers have proposed novel patterns suchas sinusoidal strips (Barni et al., 2019), and reﬂectance (Liu et al., 2020). These backdoor triggers,however, are unnatural and can be easily spotted by humans.We believe that the added content, such as noise, strips, or reﬂectance, causes the backdoor samplesgenerated by the previous methods strikingly detectable. Instead, we propose to use image warpingthat can deform but preserve image content. We also found that humans are not good at recognizingsubtle image warping, while machines are excellent in this task.Hence, in this paper, we design a novel, simple, but effective backdoor attack based on image warp-ing called WaNet. We use a small and smooth warping ﬁeld in generating backdoor images, makingthe modiﬁcation unnoticeable, as illustrated in Fig. 1. Our backdoor images are natural and hard tobe distinguished from the genuine examples, conﬁrmed by our user study described in Sec. 4.3.1 a r X i v : . [ c s . CR ] M a r ublished as a conference paper at ICLR 2021 Original Warped (Ours)Patched Blended SIG ReFool R es i du a lI m a g e Figure 1:

Comparison between backdoor examples generated by our method and by the pre-vious backdoor attacks.

Given the original image (leftmost), we generate the corresponding back-door images using patch-based attacks (Gu et al., 2017; Liu et al., 2018b), blending-based attack(Chen et al., 2017), SIG (Barni et al., 2019), ReFool (Liu et al., 2020), and our method. For eachmethod, we show the image (top), the magniﬁed ( × ) residual map (bottom). The images generatedfrom the previous attacks are unnatural and can be detected by humans. In constrast, ours is almostidentical to the original image, and the difference is unnoticeable.To obtain a backdoor model, we ﬁrst follow the common training procedure by poisoning a part oftraining data with a ﬁxed ratio of ρ a ∈ (0 , . While the trained networks provide high clean andattack accuracy, we found that they “cheated” by learning pixel-wise artifacts instead of the warpingitself. It makes them easy to be caught by a popular backdoor defense Neural Cleanse. Instead, weadd another mode in training, called “noise mode”, to enforce the models to learn only the predeﬁnedbackdoor warp. This novel training scheme produces satisfactory models that are both effective andstealthy.Our attack method achieves invisibility without sacriﬁcing accuracy. It performs similarly to state-of-the-art backdoor methods in terms of clean and attack accuracy, veriﬁed on common benchmarkssuch as MNIST, CIFAR-10, GTSRB, and CelebA. Our attack is also undetectable by various back-door defense mechanisms; none of existing algorithms can recognize or mitigate our backdoor. Thisis because the attack mechanism of our method is drastically different from any existing attack,breaking the assumptions of all defense methods.Finally, we demonstrate that our novel backdoor can be a practical threat by deploying it for physicalattacks. We tested the backdoor classiﬁer with camera-captured images of physical screens. Despiteimage quality degradation via extreme capturing conditions, our backdoor is well-preserved, and theattack accuracy stays near 100%.In short, we introduce a novel backdoor attack via image warping. To train such a model, we extendthe standard backdoor training scheme by introducing a “noise” training mode. The attack is effec-tive, and the backdoor is imperceptible by both humans and computational defense mechanisms. Itcan be deployed for physical attacks, creating a practical threat to deep-learning-based systems . ACKGROUND

HREAT MODEL

Backdoor attacks are techniques of poisoning a system to have a hidden destructive functionality.The poisoned system can work genuinely on clean inputs but misbehave when a speciﬁc triggerpattern appears. In the attack mode for image classiﬁcation, backdoor models can return a predeﬁnedtarget label, normally incorrect, regardless of image content. It allows the attacker to gain illegalbeneﬁts. For example, a backdoor face authentication system may allow the attacker to accesswhenever he puts a speciﬁc sticker on the face.Backdoors can be injected into the deep model at any stage. We consider model poisoning at trainingsince it is the most used threat model. The attacker has total control over the training process andmaliciously alters data for his attack purposes. The poisoned model is then delivered to customers to Source code of the experiments will be publicly available.

REVIOUS BACKDOOR ATTACKS

We focus on backdoor attacks on image classiﬁcation. The target network is trained for a classiﬁ-cation task f : X → C , where X is an image domain and C = { c , c , ..., c M } is a set of M targetclasses. When poisoning f , we enforce it to learn an injection function B , a target label function c ,and alter the network behaviour so that: f ( x ) = y, f ( B ( x )) = c ( y ) (1)for any pair of clean image x ∈ X and the corresponding label y ∈ C .The earliest backdoor attack was BadNets (Gu et al., 2017). The authors suggested to poison aportion of training data by replacing each clean data pair ( x , y ) with the corresponding poisonedpair ( B ( x ) , c ( y )) . The injection function B simply replaces a ﬁxed patch of the input image by apredeﬁned trigger pattern. As for the target label function c ( y ) , the authors proposed two tests: (1) all-to-one with a constant target label c ( y ) = ˆ c and (2) all-to-all with c ( y ) = y + 1 .After BadNets, many variants of backdoor attacks have been introduced. These approaches focuson changing either the backdoor injection process or the injection function B .As for the backdoor injection process, Liu et al. (2018b) proposed to inject backdoor into cleanmodels via ﬁne-tuning instead of the training stage. Yao et al. (2019) suggested hiding backdoorinside latent neurons for transfer learning. Many recent studies (Turner et al., 2019; Barni et al.,2019; Liu et al., 2020), injected backdoor only on samples with unchanged labels, i.e., the target c ( y ) is the same as the ground-truth label y , to dodge label inspection by humans.In this paper, we focus on the development of a good injection function B . Most of the popular attackmethods rely on ﬁxed patch-based triggers. Chen et al. (2017) used image blending to embed thetrigger into the input image, and Nguyen & Tran (2020) extended it to be input-aware. Salem et al.(2020) varied the patch-based trigger locations and patterns to make it “dynamic”. Barni et al. (2019)employed sinusoidal strips as the trigger alongside the clean-label strategy. Lately, Liu et al. (2020)proposed to disguise backdoor triggers as reﬂectance to make the poisoned images look natural. Thebackdoor images generated by these attacks, however, are easy to be spotted by humans. We insteadpropose an “invisible” backdoor that is imperceptible by even sharp-eyed people.2.3 B ACKDOOR DEFENSE METHODS

As the threat of backdoor attacks becomes more apparent, backdoor defense research is emerging.Based on usage scenarios, we can classify them into three groups: training defense, model defense,and testing-time defense.Training defense assumes the defender has control over the training process, and the adversaryattacks by providing infected training data (Tran et al., 2018). This assumption, however, does notmatch our threat model, where the already-trained backdoor model is provided by a third party. Thismechanism is not applicable to our situation and will not be considered further in this paper.Model defenses aim to verify or mitigate the provided model before deployment. Fine-Pruning (Liuet al., 2018a) suggested to prune the dormant neurons, deﬁned by analyses on a clean image set, tomitigate the backdoor if present. Neural Cleanse (Wang et al., 2019) was the ﬁrst work that coulddetect backdoor models. It optimized a patch-based trigger candidate for each target label, thendetected if any candidate was abnormally smaller than the others as a backdoor indicator. ABS (Liuet al., 2019) scanned the neurons and generated trigger candidates by reverse engineering. Chenget al. (2019) used GradCam (Selvaraju et al., 2017) to analyze the network behavior on a clean inputimage with and without the synthesized trigger to detect anomalies. Zhao et al. (2019) applied modeconnectivity to effectively mitigate backdoor while keeping acceptable performance. Lately, Kolouriet al. (2020) introduced universal litmus patterns that can be fed to the network to detect backdoor.Unlike model defense, testing-time defenses inspect models after deployment with the presenceof input images. It focuses on verifying if the provided image is poisoned and how to mitigate it.STRIP (Gao et al., 2019) exploited the persistent outcome of the backdoor image under perturbations3ublished as a conference paper at ICLR 2021for detection. In contrast, Neo (Udeshi et al., 2019) searched for the candidate trigger patcheswhere region blocking changed the predicted outputs. Recently, Doan et al. (2019) used GradCaminspection to detect potential backdoor locations. In all these methods, the trigger candidates werethen veriﬁed by being injected into a set of clean images.A common assumption in all previous defense methods is that the backdoor triggers are imagepatches. We instead propose a novel attack mechanism based on image warping, undermining thefoundation of these methods.2.4 E

LASTIC IMAGE WARPING

Image warping is a basic image processing technique that deforms an image by applying the geo-metric transformation. The transformation can be afﬁne, projective, elastic, or non-elastic. In thiswork, we propose to use elastic image warping given its advantages over the others: (1) Afﬁne andprojective transformations are naturally introduced to clean images via the image capturing process.If we apply these transformations to these images, the transformed images can be identical to otherclean images that are of the same scenes but captured at different viewpoints. Hence, these transfor-mations are not suitable to generate backdoor examples, particularly in physical attacks. (2) Elastictransformation still generates natural outputs while non-elastic one does not.The most popular elastic warping technique is Thin-Plate Splines (TPS) (Duchon, 1977). TPScan interpolate a smooth warping ﬁeld to transform the entire image given a set of control pointswith known original and target 2D coordinates. TPS was adopted in Spatial Transformer Networks(Jaderberg et al., 2015), the ﬁrst deep learning study incorporating differential image warping.We believe that elastic image warping can be utilized to generate invisible backdoor triggers. Unlikeprevious attack methods that introduce extra and independent information to an input image, elasticimage warping only manipulates existing pixels of the image. Humans, while being excellent inspotting incongruent part of an image, are bad at recognizing small geometric transformations.

ARPING - BASED B ACKDOOR A TTACK

We now describe our novel backdoor attack method WaNet, which stand for Warping-based poi-soned Networks. WaNet are designed to be stealthy to both machine and human inspections.3.1 O

VERVIEW

Recall that a classiﬁcation network is a function f : X → C , in which X is an input image domainand C is a set of target classes. To train f , a training dataset S = { ( x i , y i ) | x i ∈ X , y i ∈ C , i = 1 , N } is provided. We follow the training scheme of BadNets to poison a subset of S with ratio ρ a forbackdoor training. Each clean pair ( x , y ) will be replaced by a backdoor pair ( B ( x ) , c ( y )) , in which B is the backdoor injection function and c ( y ) is the target label function.Our main focus is to redesign the injection function B based on image warping. We construct B using a warping function W and a predeﬁned warping ﬁeld M : B ( x ) = W ( x , M ) . (2) M acts like a motion ﬁeld; it deﬁnes the relative sampling location of backward warping for eachpoint in the target image. W allows a ﬂoating-point warping ﬁeld as input. When a sampling pixelfalls on non-integer 2D coordinates, it will be bi-linear interpolated. To implement W , we rely onthe public API grid sample provided by PyTorch. However, this API inputs a grid of normalized absolute

2D coordinates of the sampling points. To use that API, we ﬁrst sum M with an identitysampling grid, then normalize to [ − , to get the required grid input.3.2 W ARPING FIELD GENERATION

The warping ﬁeld M is a crucial component; it must guarantee that the warped images are both nat-ural and effective for attacking purposes. Hence, M are desired to satisfy the following properties:• Small : M should be small, to be unnoticeable to humans,4ublished as a conference paper at ICLR 2021 -1 y x1 1-10 Figure 2: Process of creating the warping ﬁeld M and using it to generate poisoned images. k = 2Input k = 4 k = 6 k = 820.71 22.73 23.93 22.030.0393 0.0352 0.0328 0.0480LPIPSPSNR ↑↓ (a) Changing control-grid size k ( s = 0.5) s = 0.25Input s = s = s = ↑↓ (b) Changing warping strength s ( k = 4) Figure 3:

Effect of different hyper-parameters on the warping result.

For each warped image,we show the image (top), the magniﬁed ( ×

2) residual map (bottom). The PSNR and LPIPS (Zhanget al., 2018) scores are computed at resolution 224 × Elastic : M should be elastic, i.e., smooth and non-ﬂat, to generate natural looking images,• Within image boundary : M should not exceed the image boundary, to avoid creatingsuspicious black/plain outer area.To get such a warping ﬁeld, we borrow the idea of using control points from TPS but simplify theinterpolation method. The process of generating the desired warp is illustrated by Fig. 2 and isdescribed in the following subsections. Selecting the control grid

We ﬁrst select the control points. For simplicity, we pick the target pointson a uniform grid of size k × k over the entire image. Their backward warping ﬁeld is denoted as P ∈ R k × k × . We use a parameter s to deﬁne the strength of P and generate P as following: P = ψ ( rand [ − , ( k, k, × s (3)in which rand [ − , ( . . . ) is a function returning random tensor with the input shape and elementvalue in the range [ − , and ψ is a normalization function. In this paper, we normalize the tensorelements by their mean absolute value: ψ ( A ) = A size ( A ) (cid:80) a i ∈ A | a i | (4) Upsampling

From the control points, we interpolate the warping ﬁeld of the entire image. Sincethese points are in a uniform grid covering the entire image, instead of using a complex spline-basedinterpolation like in TPS, we can simply apply bicubic interpolation. We denote the output of thisstep as M = ↑ P ∈ R h × w × , with h and w being the image height and width respectively. Clipping

Finally, we apply a clipping function φ so that the sampling points do not fall outside ofthe image border. The process of generating M can be summarized by the equation: M = φ ( ↑ ( ψ ( rand [ − , ( k, k, × s )) . (5)We investigate the effect of the hyper-parameters k and s qualitatively in Fig. 3. The warping effectis almost invisible when k < and s < . .3.3 R UNNING MODES

After computing the warping ﬁeld M , we can train WaNet with with two modes, clean and attack,as the standard protocol. However, the models trained by that algorithm, while still achieving high5ublished as a conference paper at ICLR 2021 Clean modeAttack modeNoise mode

Figure 4: Training pipeline with three running modes.Dataset Clean Attack NoiseMNIST 99.52 99.86 98.20CIFAR-10 94.15 99.55 93.55GTSRB 98.97 98.78 98.01CelebA 78.99 99.33 76.74 (a) Network performance (b) Sample backdoor images (c) Physical attack test

Figure 5: Attack experiments. In (b), we provide the clean (top) and backdoor (bottom) images.accuracy in both clean and attack tests, tend to learn pixel-level artifacts instead of the warping.They are, therefore, easily exposed by a backdoor defense method such as Neural Cleanse. We willdiscuss more details in the ablation studies in Section 4.6.To resolve this problem, we propose a novel training mode alongside the clean and attack mode,called noise mode. The idea is simple: when applying a random warping ﬁeld M (cid:48) (cid:54) = M , thenetwork should not trigger the backdoor but return the correct class prediction.Fig. 4 illustrates three running modes in our training pipelines. We ﬁrst select the backdoor proba-bility ρ a ∈ (0 , and the noise probability ρ n ∈ (0 , such that ρ a + ρ n < . Then, for each cleaninput ( x , y ) , we randomly select one of three modes and alter that pair accordingly: ( x , y ) (cid:55)→  ( x , y ) with probability − ρ a − ρ n ( W ( x , M ) , c ( y )) with probability ρ a ( W ( x , M + rand [ − , ( h, w, , y ) with probability ρ n (6)Note that with the noise mode, instead of using a totally random warping ﬁeld, we form it by addingGaussian noise to M for a more effective training. The modiﬁed training set is then used to train f . XPERIMENTS

XPERIMENTAL S ETUP

Following the previous backdoor attack papers, we performed experiments on four datasets: MNIST(LeCun et al., 1998), CIFAR-10 (Krizhevsky et al., 2009), GTSRB (Stallkamp et al., 2012) andCelebA (Liu et al., 2015). Note that CelebA dataset has annotations for 40 independent binaryattributes, which is not suitable for multi-class classiﬁcation. Therefore, we follow the conﬁgurationsuggested by Salem et al. (2020) to select the top three most balanced attributes, including HeavyMakeup, Mouth Slightly Open, and Smiling, then concatenate them to create eight classiﬁcationclasses. Their detail information are shown in Table 1. To build the classiﬁer f for the color imagedatasets, we used Pre-activation Resnet-18 (He et al., 2016) for the CIFAR-10 and GTSRB datasetsas suggested by Kang (2020), and Resnet-18 for the CelebA dataset. As for the grayscale datasetMNIST, we deﬁned a simple network structure as reported in Table 1.We trained the networks using the SGD optimizer. The initial learning rate was 0.01, which was re-duced by a factor of 10 after each 100 training epochs. The networks were trained until convergence.We used k = 4, s = 0.5, ρ a = 0.1, and ρ n = 0.2. 6ublished as a conference paper at ICLR 2021Table 1: Datasets and the classiﬁers used in our experiments. Each ConvBlock consists of a 3 × × × × × × × × × Clean inputs 6.1 10.1 2.6 13.1

All inputs 7.4 5.7 2.6 7.7 (a) (b)

Figure 6:

Human inspection tests: (a) Success fooling rates of each backdoor method, (b) Themost distinguishable cases from WaNet.4.2 A

TTACK EXPERIMENTS

We trained and tested the backdoor models in all-to-one conﬁguration, i.e., c ( y ) = ˆ c ∀ y . The accu-racy values in clean mode, attack mode, and the noise mode are reported in Fig. 5a. As can be seen,with clean images, the networks could correctly classify them like any benign models, with accuracynear 100% on MNIST/GTSRB, 94.15% on CIFAR-10, and 79.77% on CelebA. When applying thepre-deﬁned image warping, the attack success rate was near 100% on all datasets. However, whenusing a random warping, the classiﬁers still recognized the true image class with a similar accuracyas in the clean mode. This result is impressive given the fact that the poisoned images look almostidentical to the original, as can be seen in Fig. 5b.To evaluate our method’s robustness in real-life scenarios, we also tested if backdoor images wouldstill be misclassiﬁed even when being distorted by the capturing process. We showed 50 clean and50 backdoor images on a screen and recaptured them using a phone camera. Our model still workedwell on recaptured images, obtaining 98% clean accuracy and 96% attack success rate. Fig. 5cdisplays an example of our test. The clean image was recognized correctly as “automobile”, whilethe look-a-like backdoor image was recognized as the “airplane” attack class.4.3 H UMAN INSPECTION

To examine the realisticity of our backdoor and the previous methods, we created user studies withhuman inspection. First, we randomly selected 25 images from the GTSRB dataset. Second, foreach backdoor injection function, we created the corresponding 25 backdoor images and mixedthem with the original to obtain a set of 50 images. Finally, we asked 40 people to classify whethereach image was genuine, collecting 2000 answers per method. The participants were trained aboutthe mechanism and characteristics of the attack before answering the questions.We collected the answers and reported the percentage of incorrect answers as the success foolingrates in Fig. 6a. Note that when the backdoor examples are more indistinguishable from the cleanones, the testers will ﬁnd it harder to decide an image is clean or poisoned. Hence, better backdoormethods led to higher fooling rates on not only backdoor inputs but also on clean ones. The ratesfrom previous methods are low, with maximum 7.7% on all inputs, implying that they are obvious tohumans to detect. In contrast, our rate is 28%, four times their best number. It conﬁrms that WaNetis stealthy and hard to detect, even with trained people.Although our backdoor images are natural-looking, some of them have subtle properties that can bedetected by trained testers. We provide two of the most detected backdoor examples from WaNet inFig. 6b. In the ﬁrst case, the circle sign is not entirely round. In the second case, the right edge of thetrafﬁc sign is slightly curved. Although these conditions can be found on real-life trafﬁc signs, theyare not common in the testing dataset GTSRB. These images are of the minority, and our foolingrate on backdoor images is 38.6%, not far away from the rate of 50% in random selection.7ublished as a conference paper at ICLR 2021

Filters Pruned A cc u r a c y MNIST

Filters Pruned A cc u r a c y CIFAR-10

Filters Pruned A cc u r a c y GTRSB

Filters Pruned A cc u r a c y CelebA cleanbackdoor (a) FinePrunning

Probability E n t r o p y MNIST

Probability E n t r o p y CIFAR-10

Probability E n t r o p y GTRSB

Probability E n t r o p y CelebA cleanbackdoor (b) STRIP

MNIST CIFAR-10 GTSRB CelebA0246 A n o m a l y I n d e x cleanbackdoor (c) Neural Cleanse (d) GradCam Figure 7: Experiments on verifying WaNet by the state-of-the-art defense and visualization methods.4.4 D

EFENSE EXPERIMENTS

We will now test the trained models against the popular backdoor defense mechanisms, includingNeural Cleanse, Fine-Prunning (Model defenses), and STRIPS (Testing-time defense).

Neural Cleanse (Wang et al., 2019) is a model-defense method based on the pattern optimizationapproach. It assumes that the backdoor is patch-based. For each class label, Neural Cleanse com-putes the optimal patch pattern to convert any clean input to that target label. It then checks if anylabel has a signiﬁcantly smaller pattern as a sign of backdoor. Neural Cleanse quantiﬁes it by theAnomaly Index metric with the clean/backdoor threshold τ = 2 . We ran Neural Cleanse over ourWaNet models and report the numbers in Fig. 7c. WaNet passed the test on all datasets; its scoresare even smaller than the clean model ones on MNIST and CIFAR-10. We can explain it by the factthat our backdoor relies on warping, a different mechanism compared with patch-based blending. Fine-Pruning (Liu et al., 2018a) , instead, focuses on neuron analyses. Given a speciﬁc layer, itanalyzes the neuron responses on a set of clean images and detects the dormant neurons, assumingthey are more likely to tie to the backdoor. These neurons are then gradually pruned to mitigate thebackdoor. We tested Fine-Pruning on our models and plotting the network accuracy, either clean orattack, with respect to the number of neurons pruned in Fig. 7a. On all datasets, at no point is theclean accuracy considerably higher than the attack one, making backdoor mitigation impossible.

STRIP (Gao et al., 2019) is a representative of the testing-time defense approach. It examines themodel with the presence of the input image. STRIP works by perturbing the input image througha set of clean images from different classes and raising the alarm if the prediction is persistent,indicating by low entropy. With WaNet, the perturbation operation of STRIP will modify the imagecontent and break the backdoor warping if present. Hence, WaNet behaves like genuine models,with similar entropy ranges, as shown in Fig. 7b.4.5 N

ETWORK INSPECTION

Visualization tools, such as GradCam (Selvaraju et al., 2017), are helpful in inspecting networkbehaviors. Patch-based backdoor methods can be exposed easily due to the use of small trigger8ublished as a conference paper at ICLR 2021 (a) Trigger patterns optimized byNeural Cleanse (small is bad) s A cc u r a c y ( % ) CleanBackdoorNoise (b) Model performance whenchanging s A cc u r a c y ( % ) CleanBackdoorNoise (c) Model performance whenchanging k Figure 8: Ablation studies on CIFAR-10 dataset: (a) Role of the noise mode training, (b,c) Networkperformance when changing warping hyper-parameters.regions, as pointed out by Cheng et al. (2019); Doan et al. (2019). Our attack method is based on thewarping on the entire image, so it is undetectable by this algorithm. We visualize activation basedon the label that has the highest prediction score in Fig. 7d. With clean models, that label is for thecorrect class label. With WaNet and backdoor inputs, it is the backdoor label ˆ c . As can be seen, thevisualization heatmaps of WaNet look like the ones from any clean model.4.6 A BLATION STUDIES

Role of the noise mode

Without the noise mode, we could still train a backdoor model with similarclean and attack accuracy. However, these models failed the defense test with Neural Cleanse asshown in Fig. 9, and the optimized trigger patterns revealed their true behavior.

MNIST CIFAR-10 GTSRB CelebA02468 A n o m a l y I n d e x w noise modew/o noise mode Figure 9: Networks’ performance against Neural Cleanse with and without noise mode.Fig. 8a displays the trigger patterns optimized by Neural Cleanse for the attacking class “airplane”on CIFAR-10. With the clean model, this pattern has an airplane-like shape, and it is big enough torewrite image content given any input. With our model trained without noise mode, the optimizedpattern just consists of scattered points. This pattern is remarkably smaller, making the model caughtby Neural Cleanse. It reveals that the model did not learn the speciﬁc backdoor warping; instead, itremembered the pixel-wise artifacts. By adding the noise training mode, our model no longer relieson those artifacts, and the optimized pattern looks similar to the clean model’s one.

Other hyper-parameters

We investigated the effect of the warping hyper-parameters, including thestrength s and the grid size k . Fig. 8b and 8c show the clean, attack, and noise mode accuracy of ournetwork on the CIFAR-10 dataset when changing each of these parameters. When k or s is small,the backdoor images are similar to the clean ones. However, since they are a minority ( ρ a = 0 . ),the network would treat them like data with noisy labels in those scenarios. Hence, clean and noiseaccuracies are stable across conﬁgurations. In contrast, backdoor accuracy suffers on the left side ofthe plots. It gradually increases when s or k is small, then saturates and stays near 100%. ONCLUSION AND FUTURE WORKS

This paper introduces a novel backdoor attack method that generates backdoor images via subtleimage warping. The backdoor images are proved to be natural and undetectable by humans. Weincorporate in training a novel “noise” mode, making it stealthy and pass all the known defensemethods. It opens a new domain of attack mechanism and encourages future defense research.9ublished as a conference paper at ICLR 2021 R EFERENCES

Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training setcorruption without label poisoning. In , pp. 101–105. IEEE, 2019.Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deeplearning systems using data poisoning. arXiv preprint arXiv:1712.05526 , 2017.Hao Cheng, Kaidi Xu, Sijia Liu, Pin-Yu Chen, Pu Zhao, and Xue Lin. Defending against Back-door Attack on Deep Neural Networks. In

Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining Workshop , 2019.Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. A backdoor attack against lstm-based text classiﬁca-tion systems.

IEEE Access , 7:138872–138878, 2019.Bao Gia Doan, Ehsan Abbasnejad, and Damith C. Ranasinghe. Februus: Input Puriﬁcation DefenseAgainst Trojan Attacks on Deep Neural Network Systems. arXiv , Aug 2019. URL https://arxiv.org/abs/1908.03369 .Jean Duchon. Splines minimizing rotation-invariant semi-norms in sobolev spaces. In

Constructivetheory of functions of several variables , pp. 85–100. Springer, 1977.Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal.Strip: A defence against trojan attacks on deep neural networks. In

Proceedings of the 35thAnnual Computer Security Applications Conference , pp. 113–125, 2019.Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities inthe machine learning model supply chain. In

Proceedings of Machine Learning and ComputerSecurity Workshop , 2017.Ronan Hamon, Henrik Junklewitz, and Ignacio Sanchez. Robustness and explainability of artiﬁcialintelligence.

Publications Ofﬁce of the European Union , 2020.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residualnetworks. In

European conference on computer vision , pp. 630–645. Springer, 2016.Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In

Ad-vances in neural information processing systems , pp. 2017–2025, 2015.Liu Kang. pytorch-cifar, May 2020. URL https://github.com/kuangliu/pytorch-cifar . [Online; accessed 4. Jun. 2020].Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. Universal litmus patterns:Revealing backdoor attacks in cnns. In

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , pp. 301–310, 2020.Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied todocument recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdoor-ing attacks on deep neural networks. In

Proceedings of International Symposium on Research inAttacks, Intrusions, and Defenses , 2018a.Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and XiangyuZhang. Trojaning attack on neural networks. In

Proceedings of Network and Distributed SystemSecurity Symposium , 2018b.Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. Abs:Scanning neural networks for back-doors by artiﬁcial brain stimulation. In

Proceedings of the2019 ACM SIGSAC Conference on Computer and Communications Security , pp. 1265–1282,2019. 10ublished as a conference paper at ICLR 2021Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Reﬂection backdoor: A natural backdoorattack on deep neural networks. 2020.Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild.In

Proceedings of International Conference on Computer Vision (ICCV) , December 2015.Tuan Anh Nguyen and Anh Tran. Input-aware dynamic backdoor attack. In H. Larochelle,M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.),

Advances in Neural In-formation Processing Systems , volume 33, pp. 3454–3464. Curran Associates, Inc.,2020. URL https://proceedings.neurips.cc/paper/2020/file/234e691320c0ad5b45ee3c96d0d7b8f8-Paper.pdf .Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang. Dynamic backdoor attacksagainst machine learning models. arXiv preprint arXiv:2003.03675 , 2020.Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh,and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based local-ization. In

Proceedings of the IEEE international conference on computer vision , pp. 618–626,2017.Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Bench-marking machine learning algorithms for trafﬁc sign recognition.

Neural networks , 32:323–332,2012.Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In

Pro-ceedings of Advances in Neural Information Processing Systems , 2018.Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor attacks.https://people.csail.mit.edu/madry/lab/, 2019.Sakshi Udeshi, Shanshan Peng, Gerald Woo, Lionell Loh, Louth Rawshan, and Sudipta Chattopad-hyay. Model agnostic defence against backdoor attacks in machine learning. arXiv preprintarXiv:1908.02203 , 2019.Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben YZhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In

Pro-ceedings of 40th IEEE Symposium on Security and Privacy , 2019.Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y Zhao. Latent backdoor attacks on deep neuralnetworks. In

Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communica-tions Security , pp. 2041–2055, 2019.Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonableeffectiveness of deep features as a perceptual metric. In

CVPR , 2018.Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, and Xue Lin. Bridgingmode connectivity in loss landscapes and adversarial robustness. In

International Conferenceon Learning Representations , 2019. 11ublished as a conference paper at ICLR 2021

A A

PPENDIX

A.1 S

YSTEM DETAILS

A.1.1 D

ATASETS

We used 3 standard datasets, from simple to more complex ones, to conduct our experiments. As thedatasets are all used in previous related works, our results would be more comparable and reliable.MNISTThe dataset (LeCun et al., 1998) is a subset of the larger dataset available from the National Instituteof Technology (NIST). This dataset consists of 70,000 grayscale, × images, divided into atraining set of 60,000 images and a test set of 10,000 images. Original dataset could be found at http://yann.lecun.com/exdb/mnist/ .We applied random cropping and random rotation as data augmentation for the training process.During the evaluation stage, no augmentation is applied.CIFAR10The dataset was introduced the ﬁrst time by Krizhevsky et al. (2009). It is a labeled subset of the80-millions-tiny-images dataset, collected by Alex Krizhevsky, Vinod Nair and Geoffrey Hinton,consists of 60,000 color images at the resolution of × . The dataset contains 10 classes, with6,000 images per one. It is divided into two subsets: a training set of 50,000 images and a test setof 10,000 images. The data set is public and avalable at .During training stage, random crop, random rotation and random horizontal ﬂip were applied as dataaugmentation. No augmentation was added at the evaluation stage.GTSRBThe German Trafﬁc Sign Recognition Benchmark - the GTSRB (Stallkamp et al. (2012)) is used asan ofﬁcial dataset for the challenge held at the International Joint Conference on Neural Network(IJCNN) 2011. This dataset consists of 60,000 images with 43 classes and the resolution vary-ing from × to × . It is divided into a training set of 39,209 images and a test setof 12,630. The dataset could be found at http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset .Input images were all resized into × pixels, then applied random crop and random rotation atthe training stage. No augmentation was used at the evaluation stage.CelebACelebFaces Attributes Dataset - CelebA, ﬁrst introduced by Liu et al. (2015), is a large-scale faceattributes dataset. It contains 10,177 identities with 202,599 face images. Each image has anannotation of 5 landmark locations and 40 binary attributes. The dataset is publicly available at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html .Noted that this dataset is highly unbalanced. Due to the time limitation, we select 3 out of 40attributes, namely Heavy Makeup, Mouth Slightly Open and Smiling, as suggested by Salem et al.(2020). We then concatenate them into 8 classes to create a multiple label classiﬁcation task. Theinput images were all resized into × pixels. Random crop and random rotation were appliedas data augmentation at the training stage. No augmentation was applied at the evaluation stage.A.1.2 C LASSIFICATION NETWORKS

MNISTWe used a simple, self-deﬁned structure as network classiﬁer for this dataset. Detailed architecturewill be mentioned in Table 2. 12ublished as a conference paper at ICLR 2021Table 2: Detailed architecture of MNIST classiﬁer. ∗ means the layer is followed by a Dropout layer. † means the layer is followed by a BatchNormalization layer.Layer Filter Filter Size Stride Padding Activation Conv d †

32 3 × ReLU

Conv d †

64 3 × ReLU

Conv d

64 3 × ReLU

Linear ∗ - - ReLU

Linear - - SoftmaxCIFAR10 and GTSRBFor the CIFAR-10 and GTSRB datasets, we use PreActRes18 (He et al., 2016) architecture as clas-siﬁcation networks.CelebAFor the CelebA dataset, we use ResNet18 (He et al., 2016) architecture as the classiﬁcation network.A.1.3 R

UNNING TIME

We use a system of a GPU RTX 2080Ti and a CPU i7 9700K to conduct our experiment. Detailedinference time of each module will be demonstrated below.Table 3: Inference time of our modules.MNIST CIFAR10 GTSRB CelebA time/sample . µs . µs . µs . µs A.2 A LL - TO - ALL A TTACK

Beside the single-target attack scenario, we also veriﬁed the effectiveness of WaNet in multi-targetscenario, often called all-to-all attack. In this scenario, the input of class y would be targeted intoclass c ( y ) = ( y + 1) mod | C | , where | C | is the number of classes.A.2.1 E XPERIMENTAL SETUP

We use the same experimental setups as in the single-target scenario, with a small modiﬁcation. Inthe attack mode at training, we replace the ﬁxed target label ˆ c by ( y + 1) mod | C | . In the attack testat evaluation, we also change the expected label similarly.A.2.2 A TTACK EXPERIMENT

We conducted attack experiments and reported result in Table 4. While models still achieve state-of-the-art performance on clean data, the attack efﬁcacies slightly decreases. This is due to the factthat the target label now varies from input to input. Though, the lowest attack accuracy is 78.58%,which is still harmful to real-life deployment.Similar to all-to-one scenario, we also tested our model with noise mode and recorded the noiseaccuracy.A.2.3 D

EFENSE EXPERIMENTS

We repeat the same defense experiments used in the all-to-one scenario. Our backdoor models couldalso pass all the tests mentioned in Figure 7. 13ublished as a conference paper at ICLR 2021Table 4: All-to-all attack result.Dataset Clean Attack NoiseMNIST 99.44 95.90 94.34CIFAR-10 94.43 93.36 91.47GTSRB 99.39 98.31 98.96CelebA 78.73 78.58 76.12

MNIST CIFAR-10 GTSRB CelebA0246 A n o m a l y I n d e x cleanbackdoor Figure 10: Neural Cleanse against all-to-all scenario.

Filters Pruned A cc u r a c y MNIST

Filters Pruned A cc u r a c y CIFAR-10

Filters Pruned A cc u r a c y GTRSB

Filters Pruned A cc u r a c y CelebA cleanbackdoor

Figure 11: Fine-pruning against all-to-all scenario.

Probability E n t r o p y MNIST

Probability E n t r o p y CIFAR-10

10 12 14

Probability E n t r o p y GTRSB

Probability E n t r o p y CelebA cleanbackdoor

Figure 12: STRIP against all-to-all scenario.A.3 A

DDITIONAL R ESULTS

A.3.1 A

DDIONAL IMAGES FOR METIONED BACKDOOR ATTACK METHODS

We provide additional examples comparing backdoor images from WaNet and from other attackmethods in Fig. 13.A.3.2 E

XPERIMENT ON SPECTRAL SIGNATURE DEFENSE

Tran et al. (2018) proposed a data defense method based on the spectral signature of backdoortraining data. Although this data-defense conﬁguration does not match our threat model, we ﬁnd ituseful to verify if our backdoor data have the spectral signature discussed in that paper. We repeatedthe experiment in the last plot of its Fig. 1, using 5000 clean samples and 1172 backdoor samples14ublished as a conference paper at ICLR 2021

Original Image Patched Blended SIG ReFool Warped (Ours)

Figure 13: Additional images for mentioned backdoor attack methods.generated by WaNet on the CIFAR-10 dataset, which is the same dataset used in the original paper.Fig. 14 plots histograms of the correlations between these samples’ learned representations and theircovariance matrix’s top right singular vector. As can be seen, the histograms of the two populationsare completely inseparable. Thereby, the backdoor training samples could not be removed fromthe training dataset using their proposed method. One possible explanation is that the distributionaldifference between the clean and backdoor correlations in the traditional backdoor methods was theresult of the domination of a few backdoor neurons. We do not have such a phenomenon in WaNet,as proved in Fine-Prunning experiments, eliminating the appearance of spectral signature.A.3.3 T HE S TABILITY OF W A N ET In this section, we verify if WaNet is stable to the variations of the warping ﬁeld M . We trained 8WaNet backdoor models, using 8 randomly generated warping ﬁelds, in the CIFAR10 dataset. Theclean, backdoor, and noise accuracies of the trained models are all stable, as shown in Table 5.Table 5: The stability of WaNet on the CIFAR-10 dataset.Clean Backdoor NoiseAccuracy (%) . ± .

08 99 . ± .

21 93 . ± . N u m b e r o f s a m p l e s Representation Level cleanbackdoor

Figure 14: Spectral SignatureA.3.4 A

DDITIONAL TRIGGER PATTERNS VISUALIZING THE ROLE OF THE NOISE MODE

This section further demonstrates the importance of noise mode by providing trigger patterns opti-mized by Neural Cleanse on more datasets and with more target classes. Fig. 15a and 15b visualizethe patterns on MNIST and GTSRB dataset using backdoor models trained for target label 0, similarto Fig. 8a. Fig. 15c, 15d, and 15e provide results on all three datasets but with backdoor modelsfor label 3. As can be seen, the WaNet models without noise mode training return sparse and smallpatterns, thus easy to be detected by Neural Cleanse. By including that training mode, the optimizedpatterns are more crowded and approach clean models’ ones. Note that we skip visualizing the re-sults on the CelebA dataset; its patterns optimized on either clean or backdoor models are all toosparse and small for humans to analyze due to subtle differences between human faces.

Cleanmodel WaNetWaNet w/onoise mode (a) MNIST (label 0)

Cleanmodel WaNetWaNet w/onoise mode (b) GTSRB (label 0)

Cleanmodel WaNetWaNet w/onoise mode (c) MNIST (label 3)

Cleanmodel WaNetWaNet w/onoise mode (d) CIFAR10 (label 3)

Cleanmodel WaNetWaNet w/onoise mode (e) GTSRB (label 3)(e) GTSRB (label 3)