[PDF] Target Training Does Adversarial Training Without Adversarial Samples

Abstract

Neural network classifiers are vulnerable to misclassification of adversarial samples, for which the current best defense trains classifiers with adversarial samples. However, adversarial samples are not optimal for steering attack convergence, based on the minimization at the core of adversarial attacks. The minimization perturbation term can be minimized towards 0 by replacing adversarial samples in training with duplicated original samples, labeled differently only for training. Using only original samples, Target Training eliminates the need to generate adversarial samples for training against all attacks that minimize perturbation. In low-capacity classifiers and without using adversarial samples, Target Training exceeds both default CIFAR10 accuracy (84.3%) and current best defense accuracy (below 25%) with 84.8% against CW-L_2(\kappa=0) attack, and 86.6% against DeepFool. Using adversarial samples against attacks that do not minimize perturbation, Target Training exceeds current best defense (69.1%) with 76.4% against CW-L_2(\kappa=40) in CIFAR10.

Full PDF

TTarget Training Does Adversarial Training Without Adversarial Samples

Blerta Lindqvist Abstract

Neural network classiﬁers are vulnerable to mis-classiﬁcation of adversarial samples, for whichthe current best defense trains classiﬁers with ad-versarial samples. However, adversarial samplesare not optimal for steering attack convergence,based on the minimization at the core of adversar-ial attacks. The minimization perturbation termcan be minimized towards by replacing adver-sarial samples in training with duplicated originalsamples, labeled differently only for training. Us-ing only original samples, Target Training elimi-nates the need to generate adversarial samples fortraining against all attacks that minimize perturba-tion. In low-capacity classiﬁers and without usingadversarial samples, Target Training exceeds bothdefault CIFAR10 accuracy ( . %) and currentbest defense accuracy (below %) with . %against CW-L ( κ = 0 ) attack, and . % againstDeepFool. Using adversarial samples against at-tacks that do not minimize perturbation, TargetTraining exceeds current best defense ( . %)with . % against CW-L ( κ = 40 ) in CIFAR10.

1. Introduction

Neural network classiﬁers are vulnerable to malicious adver-sarial samples that appear indistinguishable from originalsamples (Szegedy et al., 2013), for example, an adversarialattack can make a trafﬁc stop sign appear like a speed limitsign (Eykholt et al., 2018) to a classiﬁer. An adversarialsample created using one classiﬁer can also fool other clas-siﬁers (Szegedy et al., 2013; Biggio et al., 2013), even oneswith different structure and parameters (Szegedy et al., 2013;Goodfellow et al., 2014; Papernot et al., 2016b; Tram`er et al.,2017b). This transferability of adversarial attacks (Papernotet al., 2016b) matters because it means that classiﬁer accessis not necessary for attacks. The increasing deploymentof neural network classiﬁers in security and safety-critical Department of Computer Science, Aalto University,Helsinki, Finland. Correspondence to: Blerta Lindqvist < blerta.lindqvist@aalto.ﬁ > . domains such as trafﬁc (Eykholt et al., 2018), autonomousdriving (Amodei et al., 2016), healthcare (Faust et al., 2018),and malware detection (Cui et al., 2018) makes counteringadversarial attacks important.Most current attacks, including the strongest Car-lini&Wagner attack (CW) (Carlini & Wagner, 2017c), aregradient-based attacks. Gradient-based attacks use the clas-siﬁer gradient to generate adversarial samples from non-adversarial samples. Gradient-based attacks minimize thesum of classiﬁer adversarial loss and perturbation (Szegedyet al., 2013), though attacks can relax the perturbation min-imization to allow for bigger perturbations. The CW at-tack (Carlini & Wagner, 2017c) uses the κ > parameterto control perturbation, while Projected Gradient Descent(PGD) (Kurakin et al., 2016; Madry et al., 2017) and Fast-GradientMethod (FGSM) (Goodfellow et al., 2014) use an (cid:15) parameter. Other gradient-based adversarial attacks includeDeepFool (Moosavi-Dezfooli et al., 2016), Zeroth order op-timization (ZOO) (Chen et al., 2017), Universal AdversarialPerturbation (UAP) (Moosavi-Dezfooli et al., 2017).Many recent proposed defenses have been broken (Carlini& Wagner, 2016; 2017a;b; Athalye et al., 2018; Trameret al., 2020). They fall largely into these categories: (1) ad-versarial sample detection, (2) gradient masking and ob-fuscation, (3) ensemble, (4) customized loss. Detectiondefenses (Meng & Chen, 2017; Ma et al., 2018; Li et al.,2019; Hu et al., 2019) aim to detect, correct or reject ad-versarial samples. Many detection defenses have been bro-ken (Carlini & Wagner, 2017b;a; Tramer et al., 2020). Gra-dient obfuscation is aimed at preventing gradient-based at-tacks from access to the gradient and can be achieved byshattering gradients (Guo et al., 2018; Verma & Swami,2019; Sen et al., 2020), randomness (Dhillon et al., 2018; Liet al., 2019) or vanishing or exploding gradients (Papernotet al., 2016a; Song et al., 2018; Samangouei et al., 2018).Many gradient obfuscation methods have also been suc-cessfully defeated (Carlini & Wagner, 2016; Athalye et al.,2018; Tramer et al., 2020). Ensemble defenses (Tram`eret al., 2017a; Verma & Swami, 2019; Pang et al., 2019; Senet al., 2020) have also been broken (Carlini & Wagner, 2016;Tramer et al., 2020), unable to even outperform their bestperforming component. Customized attack losses defeatdefenses (Tramer et al., 2020) with customized losses (Panget al., 2020; Verma & Swami, 2019) but also, for example a r X i v : . [ c s . L G ] F e b arget Training Does Adversarial Training Without Adversarial Samples ensembles (Sen et al., 2020). Even though it has not beendefeated, Adversarial Training (Szegedy et al., 2013; Ku-rakin et al., 2016; Madry et al., 2017) assumes that the attackis known in advance and generates adversarial samples atevery training iteration. The inability of recent defenses tocounter adversarial attacks calls for new kinds of defensiveapproaches.In this paper, we make the following major contributions:• We develop Target Training - a novel, white-box de-fense that eliminates the need to know the attack or togenerate adversarial samples in training for attacks thatminimize perturbation.• Target Training accuracy in CIFAR10 is . % againstCW-L ( κ = 0 ) and . % against DeepFool. Thiseven exceeds CIFAR10 default accuracy ( . %).• Using adversarial samples against attacks that donot minimize perturbation, Target Training accuracyagainst CW-L ( κ = 40 ) in CIFAR10 is . %, whichexceeds Adversarial Training accuracy . %.• Contrary to prior work (Madry et al., 2017), we ﬁndthat low-capacity classiﬁers can counter non- L ∞ at-tacks successfully.• Target Training even improves default classiﬁer accu-racy ( . %) on non-adversarial samples in CIFAR10with . %.• Our work questions whether Adversarial Training de-fense works by populating sparse areas, since TargetTraining is a form of Adversarial Training that success-fully uses only original samples against attacks thatminimize perturbation.• Contrary to Carlini & Wagner (2017c), but in supportof Kurakin et al. (2018), our experiments show that tar-geted attacks are much weaker than untargeted attacks.

2. Background And Related Work

Here, we present the state-of-the-art in adversarial attacksand defenses, as well as a summary.

Notation A k -class neural network classiﬁer that has θ pa-rameters is denoted by a function f ( x ) that takes input x ∈ R d and outputs y ∈ R k , where d is the dimensionalityand k is the number of classes. An adversarial sample isdenoted by x adv . Classiﬁer output is y , where y i is the prob-ability that the input belongs to class i . Norms are denotedas L , L and L ∞ . Szegedy et al. (2013) were the ﬁrst toformulate the generation of adversarial samples as a con-strained minimization of the perturbation under an L p norm.Because this formulation can be hard to solve, Szegedy et al.(2013) reformulated the problem as a gradient-based, two- term minimization of the weighted sum of perturbation andclassiﬁer loss. For targeted attacks, this minimization is:minimize c · (cid:107) δ (cid:107) + loss f ( x + δ, l ) (Minimization 1)such that x + δ ∈ [0 , d , where c is a constant, δ is perturbation, f is the classiﬁer, loss f is classiﬁer loss, l is an adversarial label. (cid:107) δ (cid:107) interm (1) of Minimization Minimization 1 is a norm of theadversarial perturbation, while term (2) is there to utilize theclassiﬁer gradient to ﬁnd adversarial samples that minimizeclassiﬁer adversarial loss. By formulating the problem ofﬁnding adversarial samples this way, Szegedy et al. (2013)paved the way for adversarial attacks to utilize classiﬁergradients in adversarial attacks.Minimization 1 is the foundation for many gradient-basedattacks, though many tweaks can and have been applied.Some attacks follow Minimization 1 implicitly (Moosavi-Dezfooli et al., 2016), and others explicitly (Carlini & Wag-ner, 2017c). The type of L p norm in term (1) of the mini-mization also varies. For example the CW attack (Carlini& Wagner, 2017c) uses L , L and L ∞ , whereas Deep-Fool (Moosavi-Dezfooli et al., 2016) uses the L norm. Aspecial perturbation case is the Pixel attack by Su et al.(2019) which changes exactly one pixel. Some attacks evenexclude term (1) from the Minimization 1 and introducean external parameter to control perturbation. The FGSMattack by Goodfellow et al. (2014), for example, uses an (cid:15) parameter, while the CW attack (Carlini & Wagner, 2017c)uses a κ conﬁdence parameter.There are three ways (Carlini & Wagner, 2017c; Kurakinet al., 2018) to choose what the target adversarial label is:(1) Best case - try the attack with all adversarial labels andchoose the label that was the easiest to attack; (2)

Worst case - try the attack with all adversarial labels and choose the labelthat was the toughest to attack; (3)

Average case - choose atarget label uniformly at random from the adversarial labels.

Untargeted Attacks

Untargeted attacks aim to ﬁnd anearby sample that misclassiﬁes, without aiming for a spe-ciﬁc adversarial label. Some untargeted attacks, such asDeepFool and UAP, have no targeted equivalent.

Stronger Attacks

There are conﬂicting accounts of whichattacks are stronger, targeted attacks or untargeted attacks.Carlini & Wagner (2017c) claim that targeted attacks arestronger. However, Kurakin et al. (2018) ﬁnd targeted at-tacks, including worst-case targeted attacks, to be muchweaker than untargeted attacks.

Fast Gradient Sign Method

The Fast Gradient SignMethod by Goodfellow et al. (2014) is a simple, L ∞ -bounded attack that constructs adversarial samples by per- arget Training Does Adversarial Training Without Adversarial Samples turbing each input dimension in the direction of the gradientby a magnitude of (cid:15) : x adv = x + (cid:15) · sign ( ∇ x loss ( θ, x, y )) . CarliniWagner

The current strongest attack is CW (Car-lini & Wagner, 2017c). CW customizes Minimization 1 bypassing c to the second term, and using it to tune the relativeimportance of the terms. With a further change of variable,CW obtains an unconstrained minimization problem thatallows it to optimize directly through back-propagation. Inaddition, CW has a κ parameter for controlling the conﬁ-dence of the adversarial samples. For κ > and up to ,the CW attack allows for more perturbation in the adversar-ial samples it generates. DeepFool

The DeepFool attack by Moosavi-Dezfooli et al.(2016) is an untargeted attack that follows the Minimiza-tion 1 implicitly, ﬁnding the closest, untargeted adversarialsample. DeepFool (Moosavi-Dezfooli et al., 2016) looks atthe smallest distance of a point from the classiﬁer decisionboundary as the minimum amount of perturbation neededto change its classiﬁcation. Then DeepFool approximatesthe classiﬁer with a linear one, estimates the distance fromthe linear boundary, and then takes steps in the direction ofthe closest boundary until an adversarial sample is found.

Black-box attacks

Black-box attacks assume no access toclassiﬁer gradients. Such attacks with access to output classprobabilities are called score-based attacks, for examplethe ZOO attack (Chen et al., 2017), a black-box variant ofthe CW attack (Carlini & Wagner, 2017c). Attacks withaccess to only the ﬁnal class label are decision-based attacks,for example the Boundary (Brendel et al., 2017) and theHopSkipJumpAttack (Chen et al., 2019) attacks.

Multi-step attacks

The PGD attack (Kurakin et al.,2016) is an iterative method with an α parameterthat determines a step-size perturbation magnitude.PGD starts at a random point x and then projectsthe perturbation on an L p -ball B at each iteration: x ( j + 1) = P roj B ( x ( j ) + α · sign ( ∇ x loss ( θ, x ( j ) , y )) .The BIM attack (Kurakin et al., 2016) applies FGSM (Good-fellow et al., 2014) iteratively with an α step. To ﬁnda universal perturbation, UAP (Moosavi-Dezfooli et al.,2017) iterates over the images and aggregates perturbationscalculated as in DeepFool. Adversarial Training (Szegedyet al., 2013; Kurakin et al., 2016; Madry et al., 2017) isone of the ﬁrst and few, undefeated defenses. It defends bypopulating low probability, so-called blind spots (Szegedyet al., 2013; Goodfellow et al., 2014) with adversarial sam-ples labelled correctly, redrawing boundaries. The drawbackof Adversarial Training is that it needs to know the attackin advance, and it needs to generate adversarial samples during training. The Adversarial Training Algorithm 2 inthe Appendix is based on Kurakin et al. (2016). Madry et al.(2017) formulate their defense as a robust optimization prob-lem, and use adversarial samples to augment the training.Their solution however necessitates high-capacity classiﬁers- bigger models with more parameters.

Detection defenses

Such defenses detect adversarial sam-ples implicitly or explicitly, then correct or reject them. Sofar, many detection defenses have been defeated. For exam-ple, ten diverse detection methods (other network, PCA, sta-tistical properties) were defeated with attack loss customiza-tion by Carlini & Wagner (2017a); Tramer et al. (2020) usedattack customization against Hu et al. (2019); attack transfer-ability (Carlini & Wagner, 2017b) was used against MagNetby Meng & Chen (2017); deep feature adversaries (Sabouret al., 2016) against Roth et al. (2019).

Gradient masking and obfuscation

Many defenses thatmask or obfuscate the classiﬁer gradient have been de-feated (Carlini & Wagner, 2016; Athalye et al., 2018). Atha-lye et al. (2018) identify three types of gradient obfuscation:(1) Shattered gradients - incorrect gradients caused by non-differentiable components or numerical instability, for exam-ple with multiple input transformations by Guo et al. (2018).Athalye et al. (2018) counter such defenses with BackwardPass Differentiable Approximation. (2) Stochastic gradi-ents in randomized defenses are overcome with ExpectationOver Transformation by Athalye et al. (2017). Examplesare Stochastic Activation Pruning (Dhillon et al., 2018),which drops layer neurons based on a weighted distribution,and (Xie et al., 2018) which adds a randomized layer to theclassiﬁer input. (3) Vanishing or exploding gradients areused, for example, in Defensive Distillation (DD) (Papernotet al., 2016a) which reduces the amplitude of gradients ofthe loss function. Other examples are PixelDefend (Songet al., 2018) and Defense-GAN (Samangouei et al., 2018).Vanishing or exploding gradients are broken with parame-ters that avoid vanishing or exploding gradients (Carlini &Wagner, 2016).

Complex defenses

Defenses combining several approaches,for example (Li et al., 2019) which uses detection, random-ization, multiple models and losses, can be defeated byfocusing on the main defense components (Tramer et al.,2020). In particular, ensemble defenses do not perform bet-ter than their best components. Verma & Swami (2019);Pang et al. (2019); Sen et al. (2020) are defeated ensem-ble defenses combined with numerical instability (Verma &Swami, 2019), regularization (Pang et al., 2019), or mixedprecision on weights and activations (Sen et al., 2020).

Many defenses have been broken. They focus on changingthe classiﬁer. Instead, our Target Training defense changes arget Training Does Adversarial Training Without Adversarial Samples the classiﬁer minimally, but focuses on steering attack con-vergence. Target Training is the ﬁrst defense based on theminimization term of the Minimization 1 at the core ofuntargeted gradient-based adversarial attacks.

3. Target Training

Just as adversarial attacks have used the gradient term ofMinimization 1 against defenses, the perturbation term inthe same minimization can be used to steer attack conver-gence. By training the classiﬁer with duplicated originalsamples labeled differently only in training with target labels(hence Target Training), attacks that minimize perturbationare forced to converge to benign samples because the dupli-cated samples minimize the perturbation towards .Target Training is a form of Adversarial Training that re-places adversarial samples with orginal samples, leadingattacks to converge to non-adversarial samples as they do inAdversarial Training defense. In Target Training, the ﬁnalno-weight layer used in inference and testing essentiallyrelabels the target labels to the original labels, which is theequivalent of labeling adversarial samples correctly in Ad-versarial Training. However, the fact that Target Training isa form of Adversarial Training that uses no adversarial sam-ples against attacks that minimize perturbation presents uswith a question: Might it be that Adversarial Training worksnot because it populates the distribution blind spots withadversarial samples, but because these adversarial samplessteer attack convergence?Target Training could also be extended to defend againstmore than one attack at the same time. For example, todefend simultaneously against two types of attacks that donot minimize perturbation, the batch size would be tripledand the batch would be populated with adversarial samplesfrom both attacks. In addition, there would be two sets oftarget labels, one set for each attack. We choose low-capacity classiﬁers in order to investigatewhether Target Training can defend such classiﬁers. TheMNIST classiﬁer has two convolutional layers with 32 and64 ﬁlters respectively, followed each by batch normaliza-tion, then a 2 × 2 max-pooling layer, a drop-out layer, a fullyconnected layer with 128 units, then another drop-out layer,then a softmax layer with 20 outputs, then a summationlayer without weights that adds up the softmax outputs two-by-two and has 10 outputs. The CIFAR10 classiﬁer has 3groups of layers, each of which has: two convolutional lay-ers with increasing number of ﬁlters and with elu activationfollowed by batch normalization, then a 2 × 2 max-poolinglayer, then a drop-out layer. Then a softmax layer with 20outputs, and ﬁnally a no-weight summation layer that takes

Training without adversarial samples O r i g i n a l b a t c h Labels D up li c a t e s a m p l e s T a r g e t T r a i n i ng b a t c h Labels A d v er s a r i a l s a m p l e s Training with adversarial samples O r i g i n a l b a t c h Figure 1.

Outline of Target Training training without adversarialsamples against attacks that minimize perturbation, and with adver-sarial samples against attacks that do not minimize perturbation. softmax layer outputs as inputs, sums them two-by-two, andhas 10 outputs. Table 7 in the Appendix shows classiﬁerarchitecture for CIFAR10 and MNIST in detail. Traininguses all layers up to the softmax layer, but not the ﬁnal layer.Inference and testing uses all classiﬁer layers.

Against attacks that minimize perturbation, such asCW( κ = 0 ) and DeepFool, Target Training uses duplicatesof original samples in each batch instead of adversarial sam-ples because these samples minimize the perturbation to - no other points can have smaller distance from originalsamples. This eliminates the overhead of calculating adver-sarial samples against all attacks that minimize perturbation.Figure 1 shows how Target Training trains without adversar-ial samples to counter attacks that minimize perturbation,and with adversarial samples to counter attacks that do notminimize perturbation. Training in Target Training is alsoillustrated in Figure 2 to show that all the layers up to thesoftmax layer (with k class outputs) take part in training,but not the last layer. The duplicated samples are labeled as i + k , where i is the original label and k is the number ofclasses.Algorithm 1 shows Target Training algorithm against allattacks that minimize perturbations. Against attacks that donot minimize perturbation, such as CW( κ > ), PGD andFGSM, Target Training uses adversarial samples in training,shown in Algorithm 3 in the Appendix. Both Target Trainingalgorithms are based on the Adversarial Training (Kurakinet al., 2016) Algorithm 2 in the Appendix. arget Training Does Adversarial Training Without Adversarial Samples Classifier pre-softmaxlayers

Softmaxlayer Final layer( no weights ) TrainingInference

Outputvector y Softmaxoutput s Figure 2.

Outline of difference between training and inference inTarget Training. All the classiﬁer layers up to and including thesoftmax layer with k classes, are included in training. The ﬁnal,no-weight, summation layer is not included in training, but is usedin inference and testing. Figure 2 shows that inference in Target Training differsfrom training by using a no-weight, ﬁnal layer. The ﬁnallayer derives the y i output probability of the ﬁnal layer, asthe sum of probabilities s i and s i + k in the softmax layeroutput: y i = s i + s i + k , where k is the number of classes, i ∈ [0 . . . ( k − , s is softmax layer output, y is ﬁnal layeroutput.

4. Experiments

Threat model

We assume that the adversary goal is to gen-erate adversarial samples that cause misclassiﬁcation. Weperform white-box evaluations, assuming the adversary hascomplete knowledge of the classiﬁer and how the defenseworks. In terms of capabilities , we assume that the ad-versary is gradient-based, has access to the CIFAR10 andMNIST image domains and is able to manipulate pixels.Perturbations are assumed to be L p -constrained. For at-tacks that do not minimize perturbations, we assume thatthe attack is of the same kind as the attack used to generatethe adversarial samples used during training. We assumethat the adversary can generate both targeted and untargetedattacks. Targeted and untargeted attacks

There are conﬂicting

Algorithm 1

Target Training of classiﬁer N against attacksthat minimize perturbation based on Adversarial TrainingAlgorithm 2 in the Appendix. Require: m batch size, k classes, classiﬁer N with all lay-ers up to softmax layer with k output classes, TRAINtrains a classiﬁer on a batch and labels Ensure:

Classiﬁer N is Target-Trained against all attacksthat minimize perturbation while training not converged do B = { x , ..., x m }{ Get random batch } G = { y , ..., y m }{ Get batch ground truth } B (cid:48) = { x , ..., x m , x , ..., x m }{ Duplicate batch } G (cid:48) = { y , ..., y m , y + k, ..., y m + k }{ Duplicateground truth and increase duplicates by k } TRAIN( N , B (cid:48) , G (cid:48) ) { Train classiﬁer on duplicatedbatch and new ground truth } end while views (Carlini & Wagner, 2017c; Kurakin et al., 2018)whether targeted or untargeted attacks are stronger. To de-termine which are stronger, we conduct experiments withboth untargeted, and average-case targeted attacks where thetarget label is chosen uniformly at random from adversariallabels. Targeted attacks are not applicable for DeepFool. Attack parameters

For CW : 9 steps, − conﬁdencevalues, default K iterations, but also experiments with upto K iterations in adaptive attacks. For PGD , parametersbased on PGD paper (Madry et al., 2017): for CIFAR10, steps of size with a total (cid:15) = 8 ; for MNIST, steps ofsize . with a total (cid:15) = 0 . . For all PGD attacks, we use random initialisations within the (cid:15) ball, effectively startingPGD attacks from the original images. For FGSM : (cid:15) = 0 . ,as in (Madry et al., 2017). Classiﬁer models

We purposefully do not use high-capacitymodels, such as ResNet (He et al., 2016), used for exam-ple by Madry et al. (2017), to show that Target Trainingdoes not necessitate high model capacity to defend againstadversarial attacks. The architectures of MNIST and CI-FAR datasets are described in Subsection 3.1 and shown inTable 7 in the Appendix. No data augmentation was used.Default accuracies without attack are . % for CIFAR10and . % for MNIST. Adversarial Training and defaultclassiﬁers have same architecture, except the softmax layeris the last layer and has 10 outputs. Datasets

The MNIST (LeCun et al., 1998) and the CI-FAR10 (Krizhevsky et al., 2009) datasets are -classdatasets that have been used throughout previous work. TheMNIST (LeCun et al., 1998) dataset has K , × × hand-written, digit images. The CIFAR10 (Krizhevsky et al.,2009) dataset has K , × × images. Each datasethas K testing samples and all experimental evaluations arget Training Does Adversarial Training Without Adversarial Samples are done with testing samples. Tools

Adversarial samples generated with Clever-Hans 3.0.1 (Papernot et al., 2018) for CW- L (Carlini& Wagner, 2017c), DeepFool (Moosavi-Dezfooli et al.,2016) attacks and IBM Adversarial Robustness 360Toolbox (ART) toolbox 1.2 (Nicolae et al., 2018) forCW- L ∞ (Carlini & Wagner, 2017c), FGSM (Goodfellowet al., 2014) and PGD (Kurakin et al., 2016) attacks.Target Training has been written in Python 3.7.3, usingKeras 2.2.4 (Chollet et al., 2015). Baselines

We choose Adversarial Training as a baselinebecause it is the current best defense since other defenseshave been defeated successfully (Carlini & Wagner, 2016;2017b;a; Athalye et al., 2018; Tramer et al., 2020), more de-tails in Section 2. Our Adversarial Training implementationis based on (Kurakin et al., 2016), shown in Algorithm 2 inthe Appendix. We choose the Kurakin et al. (2016) imple-mentation and not the robust optimization of Madry et al.(2017), because the Adversarial Training solution by Madryet al. (2017) necessitates high-capacity classiﬁers. However,we do not use high-capacity classiﬁers in order to show thatTarget Training can defend low-capacity classiﬁers.

There are conﬂicting views (Carlini & Wagner, 2017c; Ku-rakin et al., 2018) whether targeted or untargeted attacks arestronger. Carlini & Wagner (2017c) claim that targeted at-tacks are stronger, whereas Kurakin et al. (2018) claim thattargeted attacks, even worst-case ones, are much weaker thatuntargeted attacks. Here, we aim to ﬁnd out which type ofattack is stronger, to use for the rest of the experiments. Weuse average-case targeted attacks, expained in Section 2.1.Accuracy values of default classiﬁers in Table 1 show tar-geted attacks to be not strong. For example, targeted CW- L ( κ = 0 ) in CIFAR10 decreases default classiﬁer accuracyby less than % point, whereas targeted CW- L ( κ = 40 )attack even increases default classiﬁer accuracy. Similarly,all targeted attacks against MNIST default classiﬁer reduceaccuracy by less than %.By comparison, each untargeted attack is much stronger thanits targeted equivalent, supporting Kurakin et al. (2018).Untargeted CW- L ( κ = 0 ) and untargeted CW- L ( κ = 40 )in CIFAR10 reduce default classiﬁer accuracy to below %,and in MNIST to below %. Based on Table 1 results, we use untargeted attacks forevaluating the performance of Target Training in thefollowing experiments.

Additional experiments with tar-geted attacks using Target Training and Adversarial Trainingare shown in Table 8 in the Appendix.

Table 1.

Here, we show that each untargeted attack diminishesthe accuracy of default classiﬁers much more than its equivalenttargeted attack. This indicates that untargeted attacks are stronger.DeepFool attack has no targeted equivalent.CIFAR10 MNISTA

TTACK D EFAULT D EFAULT C LASSIFIER C LASSIFIER N O A TTACK

ARGETED A TTACKS

CW- L ( κ = 0 ) 84.0% 98.3%CW- L ∞ ( κ = 0 ) 72.7% 98.6%D EEP F OOL

NA NACW- L ( κ = 40 ) 85.7% 99.0%PGD( (cid:15) = 8 , (cid:15) = 0 . ) 44.9% 96.4%FGSM( (cid:15) = 0 . ) 46.4% 96.4%U NTARGETED A TTACKS

CW- L ( κ = 0 ) 8.5% 0.8%CW- L ∞ ( κ = 0 ) 23.6% 94.2%D EEP F OOL L ( κ = 40 ) 7.9% 0.8%PGD( (cid:15) = 8 , (cid:15) = 0 . ) 10.9% 90.7%FGSM( (cid:15) = 0 . ) 17.6% 90.7% Table 2 shows that Target Training exceeds by far accura-cies by Adversarial Training and default classiﬁer againstattacks that minimize perturbation. Without using adversar-ial samples in training, Target Training exceeds even defaultaccuracy on non-adversarial samples against CW- L ( κ = 0 )and DeepFool in CIFAR10. The only case where perfor-mances are roughly equal is against CW- L ∞ ( κ = 0 ) inCIFAR10. Table 3 shows that Target Training can even improve accu-racy compared to Adversarial Training against attacks thatdo not minimize perturbation, for attacks CW- L ( κ = 40 )and FGSM( (cid:15) = 0 . ) in CIFAR10. Against such attacks,Target Training uses adversarial samples in training as Ad-versarial Training does. Against PGD attack, Target Train-ing performs worse then Adversarial Training. We attributethe Target Training performance against PGD to the lowcapacity of the classiﬁers we use. Such effect of classi-ﬁer capacity on performance has been previously observedby Madry et al. (2017). We anticipate Target Training per-formance to improve for higher-capacity classiﬁers. arget Training Does Adversarial Training Without Adversarial Samples Table 2.

Here, we show Target Training performance against attacks that minimize perturbation, for which Target Training does not useadversarial samples. Target Training even exceeds performance of default classiﬁer against CW- L ( κ = 0 ) and DeepFool in CIFAR10.Target Training also exceeds the performance of Adversarial Training classiﬁer that uses adversarial samples, except for CW- L ∞ ( κ = 0 )in CIFAR10 where accuracies are roughly equal.CIFAR10 (84.3%) MNIST (99.1%)U NTARGETED T ARGET A DVERSARIAL D EFAULT T ARGET A DVERSARIAL D EFAULT A TTACK T RAINING T RAINING C LASSIFIER T RAINING T RAINING C LASSIFIER

CW- L ( κ = 0 ) 84.8% 22.8% 8.5% 96.9% 5.0% 0.8%CW- L ∞ ( κ = 0 ) 21.3% 21.4% 23.6% 96.1% 75.8% 94.2%D EEP F OOL

Table 3.

Using adversarial samples in training, Target Training performs better than Adversarial Training against non- L ∞ attacks that donot minimize perturbation in CIFAR10. Against L ∞ attacks, Target Training performs worse than Adversarial Training.CIFAR10 (84.3%) MNIST (99.1%)U NTARGETED T ARGET A DVERSARIAL D EFAULT T ARGET A DVERSARIAL D EFAULT A TTACK T RAINING T RAINING C LASSIFIER T RAINING T RAINING C LASSIFIER

CW- L ( κ = 40 ) 76.4% 69.1% 7.9% 95.7% 96.5% 0.8%PGD( (cid:15) = 8 , (cid:15) = 0 . ) 7.1% 76.2% 10.9% 57.9% 91.7% 90.7%FGSM( (cid:15) = 0 . ) 72.0% 71.8% 17.6% 98.2% 98.4% 90.7% In Table 4, we show that Target Training exceeds defaultclassiﬁer accuracy in CIFAR10 on original, non-adversarialimages when trained without adversarial samples againstattacks that minimize perturbation: 86.7% (up from 84.3%).Furthermore, Table 4 shows that when using adversarialsamples against attacks that do not minimize perturbation,Target Training equals Adversarial Training performance.

For a defense to be strong, it needs to be shown to break thetransferability of attacks. A good source of adversarial sam-ples for transferability is the unsecured classiﬁer (Carliniet al., 2019). We experiment on the transferability of attacksfrom the unsecured classiﬁer to a classiﬁer secured withTarget Training. In Table 5, we show that Target Trainingbreaks the transferability of adversarial samples generatedby attacks that minimize perturbation much better than Ad-versarial Training in CIFAR10. Against the rest of attacksTarget Training and Adversarial Training perform similarly.

5. Adaptive evaluation

Many recent defenses have failed to anticipate attacks thathave defeated them (Carlini et al., 2019; Carlini & Wagner,2017a; Athalye et al., 2018). Therefore, we perform anadaptive evaluation (Carlini et al., 2019; Tramer et al., 2020)of our Target Training defense.

Whether Target Training could be defeated by methodsused to break other defenses.

Target Training is a type ofAdversarial Training because both use additional trainingsamples, but there is no adaptive attack against AdversarialTraining. Target Training uses none of previous unsuccess-ful defenses (Carlini & Wagner, 2016; 2017b;a; Athalyeet al., 2018; Tramer et al., 2020) that involve adversarialsample detection, preprocessing, obfuscation, ensemble,customized loss, subcomponent, non-differentiable com-ponent. Therefore their adaptive attacks cannot be usedon Target Training. In addition, we keep the loss functionsimple - standard softmax cross-entropy and no additionalloss. Following, we discuss an adaptive attack based on theTarget Training summation layer after the softmax layer.

Adaptive attack against Target Training.

Based on theTarget Training defense, we consider an adaptive attack thatuses a copy of the Target Training classiﬁer up to the soft-max layer, without the last layer, to generate adversarialsamples that are then tested on a full Target Training clas-siﬁer. Table 6 shows that Target Training withstands theadaptive attack.

Iterative attacks.

The multi-step PGD (Kurakin et al.,2016) attack decreases Target Training accuracy more thansingle-step attacks, which suggests that our defense is work-ing correctly, according to Carlini et al. (2019).

Transferability.

Our transferability analysis results in Ta-ble 5 in Subsection4.5 show that Target Training breaks thetransferability of adversarial samples much better than Ad-versarial Training against attacks that minimize perturbation arget Training Does Adversarial Training Without Adversarial Samples

Table 4.

Target Training exceeds default classiﬁer accuracy on original, non-adversarial samples, when trained without adversarial samplesagainst attacks that minimize perturbation in CIFAR10. Adversarial Training is not applicable because it needs adversarial samples. TargetTraining equals Adversarial Training performance when using adversarial samples against attacks that do not minimize perturbation.CIFAR10 (84.3%) MNIST (99.1%)U

NTARGETED ATTACK T ARGET A DVERSARIAL D EFAULT T ARGET A DVERSARIAL D EFAULTUSED IN TRAINING T RAINING T RAINING C LASSIFIER T RAINING T RAINING C LASSIFIERNONE ( AGAINST NO

PERTURBATION ATTACKS )CW- L ( κ = 40 ) 77.7% 77.4% 84.3% 98.0% 98.0% 99.1%PGD( (cid:15) = 8 , (cid:15) = 0 . ) 76.3% 76.9% 84.3% 98.3% 98.4% 99.1%FGSM( (cid:15) = 0 . ) 77.6% 76.6% 84.3% 98.6% 98.6% 99.1% Table 5.

Target Training breaks the transferability of attacks that minimize perturbation much better than Adversarial Training in CIFAR10.Against attacks that do not minimize perturbation, Target Training and Adversarial Training have comparable performance - both TargetTraining and Adversarial Training break the transferability of attacks in MNIST but not in CIFAR10.CIFAR10 (84.3%) MNIST (99.1%)U

NTARGETED T ARGET A DVERSARIAL D EFAULT T ARGET A DVERSARIAL D EFAULT A TTACK T RAINING T RAINING C LASSIFIER T RAINING T RAINING C LASSIFIER

CW- L ( κ = 0 ) 84.7% 50.8% 8.5% 97.0% 92.8% 0.8%CW- L ∞ ( κ = 0 ) 84.2% 55.9% 23.6% 96.2% 97.8% 94.2%D EEP F OOL L ( κ = 40 ) 35.8% 33.8% 7.9% 97.8% 97.9% 0.8%PGD( (cid:15) = 8 , (cid:15) = 0 . ) 10.8% 10.0% 10.9% 97.9% 98.3% 90.7%FGSM( (cid:15) = 0 . ) 34.1% 45.5% 17.6% 72.1% 75.2% 90.7% Table 6.

Target Training withstands the adaptive attack for bothCIFAR10 and MNIST. Adversarial samples are generated using aTarget Training classiﬁer up to the softmax layer, without the lastlayer. The generated samples are tested against the original, fullTarget Training classiﬁer. CIFAR10 MNIST(84.3%) (99.1%)U

NTARGETED T ARGET A DVERSARIAL A DAPTIVE A TTACK T RAINING T RAINING

CW- L ( κ = 0 ) 84.7% 97.0%CW- L ∞ ( κ = 0 ) 84.2% 96.3%D EEP F OOL L ( κ = 40 ) 76.4% 95.7%PGD( (cid:15) = 8 , (cid:15) = 0 . ) 76.3% 92.3%FGSM( (cid:15) = 0 . ) 72.1% 98.2% in CIFAR10. Target Training performance in the rest of theattacks is comparable to Adversarial Training performance.The attacks are generated with default, unsecured classiﬁer. Stronger CW attack leads to better Target Training ac-curacy.

Increasing iterations for CW- L ( κ = 0 ) 10-foldfrom K to K increases our defense’s accuracy. In CI-FAR10, the accuracy increases from . % to . %, in MNIST from . % to . %.

6. Discussion And Conclusions

In conclusion, we show that our white-box Target Trainingdefense counters non- L ∞ attacks that minimize perturbationin low-capacity classiﬁers without using adversarial sam-ples. Target Training defends classiﬁers by training withduplicated original samples instead of adversarial samples.This minimizes the perturbation term in attack minimizationand as a result steers attacks to non-adversarial samples. Tar-get Training exceeds default accuracy ( . %) in CIFAR10with . % against CW-L ( κ = 0 ), . % against Deep-Fool, and . % on original non-adversarial samples. As aform of Adversarial Training that does not use adversarialsamples against attacks that minimize perturbation, TargetTraining deﬁes the common justiﬁcation of why AdversarialTraining works. The implication is that the reason Adversar-ial Training works might be the same as the reason TargetTraining works. Not because they populate sparse areas withsamples but because they steer attack convergence basedon the perturbation term in attack minimization. TargetTraining minimizes the perturbation further than Adversar-ial Training and without need for adversarial samples. arget Training Does Adversarial Training Without Adversarial Samples References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P. F.,Schulman, J., and Man´e, D. Concrete problems inAI safety.

CoRR , abs/1606.06565, 2016. URL http://arxiv.org/abs/1606.06565 .Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K. Syn-thesizing robust adversarial examples. arXiv preprintarXiv:1707.07397 , 2017.Athalye, A., Carlini, N., and Wagner, D. Obfuscatedgradients give a false sense of security: Circumvent-ing defenses to adversarial examples. arXiv preprintarXiv:1802.00420 , 2018.Biggio, B., Corona, I., Maiorca, D., Nelson, B., ˇSrndi´c, N.,Laskov, P., Giacinto, G., and Roli, F. Evasion attacksagainst machine learning at test time. In

Joint Europeanconference on machine learning and knowledge discoveryin databases , pp. 387–402. Springer, 2013.Brendel, W., Rauber, J., and Bethge, M. Decision-based ad-versarial attacks: Reliable attacks against black-box ma-chine learning models. arXiv preprint arXiv:1712.04248 ,2017.Carlini, N. and Wagner, D. Defensive distillation isnot robust to adversarial examples. arXiv preprintarXiv:1607.04311 , 2016.Carlini, N. and Wagner, D. Adversarial examples are noteasily detected: Bypassing ten detection methods. In

Proceedings of the 10th ACM Workshop on ArtiﬁcialIntelligence and Security , pp. 3–14. ACM, 2017a.Carlini, N. and Wagner, D. Magnet and” efﬁcient defensesagainst adversarial attacks” are not robust to adversarialexamples. arXiv preprint arXiv:1711.08478 , 2017b.Carlini, N. and Wagner, D. Towards evaluating the robust-ness of neural networks. In , pp. 39–57. IEEE, 2017c.Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber,J., Tsipras, D., Goodfellow, I., Madry, A., and Kurakin,A. On evaluating adversarial robustness. arXiv preprintarXiv:1902.06705 , 2019.Chen, J., Jordan, M. I., and Wainwright, M. J. Hop-skipjumpattack: A query-efﬁcient decision-based attack. arXiv preprint arXiv:1904.02144 , 3, 2019.Chen, P.-Y., Zhang, H., Sharma, Y., Yi, J., and Hsieh, C.-J.Zoo: Zeroth order optimization based black-box attacksto deep neural networks without training substitute mod-els. In

Proceedings of the 10th ACM Workshop on Artiﬁ-cial Intelligence and Security , pp. 15–26. ACM, 2017. Chollet, F. et al. Keras. https://keras.io , 2015.Cui, Z., Xue, F., Cai, X., Cao, Y., Wang, G.-g., and Chen,J. Detection of malicious code variants based on deeplearning.

IEEE Transactions on Industrial Informatics ,14(7):3187–3196, 2018.Dhillon, G. S., Azizzadenesheli, K., Lipton, Z. C., Bernstein,J., Kossaiﬁ, J., Khanna, A., and Anandkumar, A. Stochas-tic activation pruning for robust adversarial defense. In

International Conference on Learning Representations ,2018.Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A.,Xiao, C., Prakash, A., Kohno, T., and Song, D. Robustphysical-world attacks on deep learning visual classiﬁca-tion. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 1625–1634, 2018.Faust, O., Hagiwara, Y., Hong, T. J., Lih, O. S., and Acharya,U. R. Deep learning for healthcare applications based onphysiological signals: a review.

Computer methods andprograms in biomedicine , 2018.Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain-ing and harnessing adversarial examples. arXiv preprintarXiv:1412.6572 , 2014.Guo, C., Rana, M., Cisse, M., and Van Der Maaten, L. Coun-tering adversarial images using input transformations. In

International Conference on Learning Representations ,2018.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016.Hu, S., Yu, T., Guo, C., Chao, W.-L., and Weinberger, K. Q.A new defense against adversarial images: Turning aweakness into a strength. In

Advances in Neural Informa-tion Processing Systems , pp. 1633–1644, 2019.Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 and cifar-100 datasets. , 6, 2009.Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial ma-chine learning at scale. arXiv preprint arXiv:1611.01236 ,2016.Kurakin, A., Goodfellow, I., Bengio, S., Dong, Y., Liao,F., Liang, M., Pang, T., Zhu, J., Hu, X., Xie, C., et al.Adversarial attacks and defences competition. In

TheNIPS’17 Competition: Building Intelligent Systems , pp.195–231. Springer, 2018. arget Training Does Adversarial Training Without Adversarial Samples

LeCun, Y., Cortes, C., and Burges, C. J. The mnist databaseof handwritten digits, 1998.

URL http://yann. lecun.com/exdb/mnist , 10:34, 1998.Li, Y., Bradshaw, J., and Sharma, Y. Are generative classi-ﬁers more robust to adversarial attacks? In

InternationalConference on Machine Learning , 2019.Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S.,Schoenebeck, G., Song, D., Houle, M. E., and Bailey, J.Characterizing adversarial subspaces using local intrinsicdimensionality. In

International Conference on MachineLearning , 2018.Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant toadversarial attacks. arXiv preprint arXiv:1706.06083 ,2017.Meng, D. and Chen, H. Magnet: a two-pronged defenseagainst adversarial examples. In

Proceedings of the 2017ACM SIGSAC Conference on Computer and Communica-tions Security , pp. 135–147. ACM, 2017.Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deep-fool: a simple and accurate method to fool deep neuralnetworks. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pp. 2574–2582,2016.Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., andFrossard, P. Universal adversarial perturbations. In

Pro-ceedings of the IEEE conference on computer vision andpattern recognition , pp. 1765–1773, 2017.Nicolae, M.-I., Sinn, M., Tran, M. N., Buesser, B., Rawat,A., Wistuba, M., Zantedeschi, V., Baracaldo, N., Chen,B., Ludwig, H., Molloy, I., and Edwards, B. Adversarialrobustness toolbox v1.0.1.

CoRR , 1807.01069, 2018.URL https://arxiv.org/pdf/1807.01069 .Pang, T., Xu, K., Du, C., Chen, N., and Zhu, J. Improvingadversarial robustness via promoting ensemble diversity.In

International Conference on Learning Representations ,2019.Pang, T., Xu, K., Dong, Y., Du, C., Chen, N., and Zhu,J. Rethinking softmax cross-entropy loss for adversarialrobustness. In

International Conference on LearningRepresentations , 2020.Papernot, N., McDaniel, P., Wu, X., Jha, S., and Swami,A. Distillation as a defense to adversarial perturbationsagainst deep neural networks. In , pp. 582–597. IEEE, 2016a.Papernot, N., McDaniel, P. D., and Goodfellow, I. J.Transferability in machine learning: from phenomena to black-box attacks using adversarial samples.

CoRR ,abs/1605.07277, 2016b. URL http://arxiv.org/abs/1605.07277 .Papernot, N., Faghri, F., Carlini, N., Goodfellow, I., Fein-man, R., Kurakin, A., Xie, C., Sharma, Y., Brown, T.,Roy, A., Matyasko, A., Behzadan, V., Hambardzumyan,K., Zhang, Z., Juang, Y.-L., Li, Z., Sheatsley, R., Garg,A., Uesato, J., Gierke, W., Dong, Y., Berthelot, D., Hen-dricks, P., Rauber, J., and Long, R. Technical report onthe cleverhans v2.1.0 adversarial examples library. arXivpreprint arXiv:1610.00768 , 2018.Roth, K., Kilcher, Y., and Hofmann, T. The odds are odd: Astatistical test for detecting adversarial examples. arXivpreprint arXiv:1902.04818 , 2019.Sabour, S., Cao, Y., Faghri, F., and Fleet, D. J. Adversarialmanipulation of deep representations. In

InternationalConference on Learning Representations , 2016.Samangouei, P., Kabkab, M., and Chellappa, R. Defense-gan: Protecting classiﬁers against adversarial attacks us-ing generative models. In

International Conference onLearning Representations , 2018.Sen, S., Ravindran, B., and Raghunathan, A. Empir: En-sembles of mixed precision deep networks for increasedrobustness against adversarial attacks. In

InternationalConference on Machine Learning , 2020.Song, Y., Kim, T., Nowozin, S., Ermon, S., and Kushman, N.Pixeldefend: Leveraging generative models to understandand defend against adversarial examples. In

InternationalConference on Learning Representations , 2018.Su, J., Vargas, D. V., and Sakurai, K. One pixel attackfor fooling deep neural networks.

IEEE Transactions onEvolutionary Computation , 23(5):828–841, 2019.Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I. J., and Fergus, R. Intriguing proper-ties of neural networks. In

International Conference onLearning Representations , 2013.Tram`er, F., Kurakin, A., Papernot, N., Goodfellow, I.,Boneh, D., and McDaniel, P. Ensemble adversar-ial training: Attacks and defenses. arXiv preprintarXiv:1705.07204 , 2017a.Tram`er, F., Papernot, N., Goodfellow, I., Boneh, D., and Mc-Daniel, P. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453 , 2017b.Tramer, F., Carlini, N., Brendel, W., and Madry, A. Onadaptive attacks to adversarial example defenses. arXivpreprint arXiv:2002.08347 , 2020. arget Training Does Adversarial Training Without Adversarial Samples

Verma, G. and Swami, A. Error correcting output codes im-prove probability estimation and adversarial robustness ofdeep neural networks. In

Advances in Neural InformationProcessing Systems , pp. 8643–8653, 2019.Xie, C., Wang, J., Zhang, Z., Ren, Z., and Yuille, A. Mit-igating adversarial effects through randomization. In

International Conference on Learning Representations ,2018. arget Training Does Adversarial Training Without Adversarial Samples

Table 7.

Architectures of Target Training classiﬁers for CIFAR10and MNIST datasets. For the convolutional layers, we use L ker-nel regularizer. The pre-ﬁnal Dense.Softmax layers in both modelshave 20 output classes, twice the number of dataset classes. Thedefault, unsecured classiﬁers and the classiﬁers used for Adversar-ial Training have the same architectures, except that the softmaxlayer is the ﬁnal layer and has only 10 outputs.CIFAR10 MNISTC ONV .ELU 3 X X

32 C

ONV .R E LU 3 X X ATCH N ORM B ATCH N ORM C ONV .ELU 3 X X

32 C

ONV .R E LU 3 X X ATCH N ORM B ATCH N ORM M AX P OOL X AX P OOL X ROPOUT

ROPOUT

ONV .ELU 3 X X

64 D

ENSE

ATCH N ORM D ROPOUT

ONV .ELU 3 X X

64 D

ENSE .S OFTMAX

ATCH N ORM L AMBDA S UMMATION AX P OOL X ROPOUT

ONV .ELU 3 X X ATCH N ORM C ONV .ELU 3 X X ATCH N ORM M AX P OOL X ROPOUT

ENSE .S OFTMAX

AMBDA S UMMATION Algorithm 2

Adversarial Training of classiﬁer N , basedon (Kurakin et al., 2016). Require: m batch size, k classes, N classiﬁer with k outputclasses, ADV ATTACK adversarial attack, TRAIN trainsclassiﬁer on a batch and labels Ensure:

Adversarially-Trained classiﬁer N while training not converged do B = { x , ..., x m }{ Get random batch } G = { y , ..., y m }{ Get batch ground truth } A = ADV AT T ACK ( N, B ) { Generate adv. sam-ples from batch } B (cid:48) = B (cid:83) A = { x , ..., x m , x adv , ..., x madv } { Newbatch } G (cid:48) = { y , ..., y m , y , ..., y m }{ Duplicate groundtruth } TRAIN( N , B (cid:48) , G (cid:48) ) { Train classiﬁer on new batch andnew ground truth } end whileAlgorithm 3 Target Training of classiﬁer N using adversar-ial samples against attacks that do not minimize perturba-tion. Require:

Batch size is m , number of dataset classesis k , untrained classiﬁer N with k output classes,ADV ATTACK is an adversarial attack, TRAIN trainsclassiﬁer on a batch and its ground truth Ensure:

Classiﬁer N is Target-Trained againstADV ATTACK while training not converged do B = { x , ..., x m }{ Get random batch } G = { y , ..., y m }{ Get batch ground truth } A = ADV AT T ACK ( N, B ) { Generate adv. sam-ples from batch } B (cid:48) = B (cid:83) A = { x , ..., x m , x adv , ..., x madv }{ Assemble new batch from original batch and adver-sarial samples } G (cid:48) = { y , ..., y m , y + k, ..., y m + k }{ Duplicateground truth and increase duplicates by k } TRAIN( N , B (cid:48) , G (cid:48) ) { Train classiﬁer on new batch andnew ground truth } end while arget Training Does Adversarial Training Without Adversarial Samples Table 8.

Here, we show that targeted attacks are not strong against default classiﬁers, decreasing default accuracies very little. TargetTraining and Adversarial Training have roughly equal performance against targeted attacks, except for CW- L ∞ in CIFAR10 where TargetTraining has better accuracy, and PGD where Adversarial Training has better accuracy. DeepFool attacks are not applicable because theycannot be targeted. CIFAR10 (84.3%) MNIST (99.1%)T ARGETED T ARGET A DVERSARIAL D EFAULT T ARGET A DVERSARIAL D EFAULT A TTACK T RAINING T RAINING C LASSIFIER T RAINING T RAINING C LASSIFIER

CW- L ( κ = 0 ) 82.8% 83.1% 84.0% 96.9% 98.9% 98.3%CW- L ∞ ( κ = 0 ) 69.0% 50.1% 72.7% 98.2% 98.1% 98.6%D EEP F OOL

NA NA NA NA NA NACW- L ( κ = 40 ) 84.4% 84.1% 85.7% 99.0% 99.0% 99.0%PGD( (cid:15) = 8 , (cid:15) = 0 . ) 21.4% 34.5% 44.9% 85.3% 97.6% 96.4%FGSM( (cid:15) = 0 .3