[PDF] Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference

Abstract

Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images) (Tsipras et al., 2019). Such a dilemma is shown to be rooted in the inherently higher sample complexity (Schmidt et al., 2018) and/or model capacity (Nakkiran, 2019), for learning a high-accuracy and robust classifier. In view of that, give a classification task, growing the model capacity appears to help draw a win-win between accuracy and robustness, yet at the expense of model size and latency, therefore posing challenges for resource-constrained applications. Is it possible to co-design model accuracy, robustness and efficiency to achieve their triple wins? This paper studies multi-exit networks associated with input-adaptive efficient inference, showing their strong promise in achieving a "sweet point" in cooptimizing model accuracy, robustness and efficiency. Our proposed solution, dubbed Robust Dynamic Inference Networks (RDI-Nets), allows for each input (either clean or adversarial) to adaptively choose one of the multiple output layers (early branches or the final one) to output its prediction. That multi-loss adaptivity adds new variations and flexibility to adversarial attacks and defenses, on which we present a systematical investigation. We show experimentally that by equipping existing backbones with such robust adaptive inference, the resulting RDI-Nets can achieve better accuracy and robustness, yet with over 30% computational savings, compared to the defended original models.

Full PDF

PPublished as a conference paper at ICLR 2020 T RIPLE W INS : B

OOSTING A CCURACY , R

OBUSTNESSAND E FFICIENCY T OGETHER BY E NABLING I NPUT -A DAPTIVE I NFERENCE

Ting-Kuei Hu ∗ , Tianlong Chen ∗ , Haotao Wang Zhangyang Wang Department of Computer Science and EngineeringTexas A&M University, USA { tkhu,wiwjp619,htwang,atlaswang } @tamu.edu A BSTRACT

Robust Dynamic Inference Networks (RDI-Nets), allows for each input(either clean or adversarial) to adaptively choose one of the multiple output layers(early branches or the ﬁnal one) to output its prediction. That multi-loss adap-tivity adds new variations and ﬂexibility to adversarial attacks and defenses, onwhich we present a systematical investigation. We show experimentally that byequipping existing backbones with such robust adaptive inference, the resultingRDI-Nets can achieve better accuracy and robustness, yet with over 30% compu-tational savings, compared to the defended original models.

NTRODUCTION

Deep networks, despite their high predictive accuracy, are notoriously vulnerable to adversarialattacks (Goodfellow et al., 2015; Biggio et al., 2013; Szegedy et al., 2014; Papernot et al., 2016).While many defense methods have been proposed to increase a model’s robustness to adversarialexamples, they were typically observed to hamper its accuracy on original clean images. Tsipraset al. (2019) ﬁrst pointed out the inherent tension between the goals of adversarial robustness andstandard accuracy in deep networks, whose provable existence was shown in a simpliﬁed setting.Zhang et al. (2019) theoretically quantiﬁed the accuracy-robustness trade-off, in terms of the gapbetween the risk for adversarial examples versus the risk for non-adversarial examples.It is intriguing to consider whether and why the model accuracy and robustness have to be at odds.Schmidt et al. (2018) demonstrated that the number of samples needed to achieve adversariallyrobust generalization is polynomially larger than that needed for standard generalization, under theadversarial training setting. A similar conclusion was concurred by Sun et al. (2019) in the standardtraining setting. Tsipras et al. (2019) considered the accuracy-robustness trade-off as an inherent traitof the data distribution itself, indicating that this phenomenon persists even in the limit of inﬁnitedata. Nakkiran (2019) argued from a different perspective, that the complexity (e.g. capacity) ofa robust classiﬁer must be higher than that of a standard classiﬁer. Therefore, replacing a larger-capacity classiﬁer might effectively alleviate the trade-off. Overall, those existing works appear tosuggest that, while accuracy and robustness are likely to trade off for a ﬁxed classiﬁcation model ∗ Equal contribution a r X i v : . [ c s . C V ] F e b ublished as a conference paper at ICLR 2020and on a given dataset, such trade-off might be effectively alleviated (“win-win”), if supplying moretraining data and/or replacing a larger-capacity classiﬁer.On a separate note, deep networks also face the pressing challenge to be deployed on resource-constrained platforms due to the prosperity of smart Internet-of-Things (IoT) devices. Many IoTapplications naturally demand security and trustworthiness, e.g., , biometrics and identity veriﬁca-tion, but can only afford limited latency, memory and energy budget. Hereby we extend the question: can we achieve a triple-win, i.e., , an accurate and robust classﬁer while keeping it efﬁcient? This paper makes an attempt in providing a positive answer to the above question. Rather thanproposing a speciﬁc design of robust light-weight models, we reduce the average computation loadsby input-adaptive routing to achieve triple-win. To this end, we introduce the input-adaptive dynamicinference (Teerapittayanon et al., 2017; Wang et al., 2018a), an emerging efﬁcient inference schemein contrast to the (non-adaptive) model compression, to the adversarial defense ﬁeld for the ﬁrst time.Given any deep network backbone (e.g., , ResNet, MobileNet), we ﬁrst follow (Teerapittayanonet al., 2017) to augment it with multiple early-branch output layers in addition to the original ﬁnaloutput. Each input, regardless of clean or adversarial samples, adaptively chooses which outputlayer to take for its own prediction. Therefore, a large portion of input inferences can be terminatedearly when the samples can already be inferred with high conﬁdence.Up to our best knowledge, no existing work studied adversarial attacks and defenses for an adap-tive multi-output model, as the multiple sources of losses provide much larger ﬂexibility to com-pose attacks (and therefore defenses), compared to the typical single-loss backbone. We present asystematical exploration on how to (white-box) attack and defense our proposed multi-output net-work with adaptive inference, demonstrating that the composition of multiple-loss information iscritical in making the attack/defense strong. Fig. 1 illustrates our proposed

Robust Dynamic Infer-ence Networks (RDI-Nets). We show experimentally that the input-adaptive inference and multi-loss ﬂexibility can be our friend in achieving the desired “triple wins”. With our best defendedRDI-Nets, we achieve better accuracy and robustness, yet with over 30% inference computationalsavings, compared to the defended original models as well as existing solutions co-designing ro-bustness and efﬁciency (Gui et al., 2019; Guo et al., 2018). The codes can be referenced from https://github.com/TAMU-VITA/triple-wins. ...... ... ... ... ...... ... ... + Branch 1 Branch i Branch jBranch K Main Branch

Clean ImagesAdversarial Images ++ ......... Figure 1: Our proposed RDI-Net framework, a defended multi-output network enabling dynamicinference. Each image, being it clean or adversarially perturbed, adaptively picks one branch to exit.

ELATED W ORK

DVERSARIAL D EFENSE

A magnitude of defend approaches have been proposed (Kurakin et al., 2017; Xu et al., 2018; Songet al., 2018; Liao et al., 2018), although many were quickly evaded by new attacks (Carlini & Wag-ner, 2017; Baluja & Fischer, 2018). One strong defense algorithm that has so far not been fullycompromised is adversarial training (Madry et al., 2018). It searches for adversarial images to aug-ment the training procedure, although at the price of higher training costs (but not affecting inferenceefﬁciency). However, almost all existing attacks and defenses focus on a single-output classiﬁcation(or other task) model. We are unaware of prior studies directly addressing attacks/defenses to morecomplicated networks with multiple possible outputs.2ublished as a conference paper at ICLR 2020One related row of works are to exploit model ensemble (Tram`er et al., 2018; Strauss et al., 2017)in adversarial training. The gains of the defended ensemble compared to a single model could beviewed as the beneﬁts of either the beneﬁts of diversity (generating stronger and more transfer-able perturbations), or the increasing model capacity (consider the ensembled multiple models as acompound one). Unfortunately, ensemble methods could amplify the inference complexity and bedetrimental for efﬁciency. Besides, it is also known that injecting randomization at inference timehelps mitigate adversarial effects (Xie et al., 2018; Cohen et al., 2019). Yet up to our best knowledge,no work has studied non-random, but rather input-dependent inference for defense.2.2 E

FFICIENT I NFERENCE

Research in improving deep network efﬁciency could be categorized into two streams: the static waythat designs compact models or compresses heavy models, while the compact/compressed modelsremain ﬁxed for all inputs at inference; and the dynamic way, that at inference the inputs can choosedifferent computational paths adaptively, and the simpler inputs usually take less computation tomake predictions. We brieﬂy review the literature below.

Static: Compact Network Design and Model Compression.

Many compact architectures have beenspeciﬁcally designed for resource-constrained applications, by adopting lightweight depthwise con-volutions (Sandler et al., 2018), and group-wise convolutions with channel-shufﬂing (Zhang et al.,2018), to just name a few. For model compression, Han et al. (2015) ﬁrst proposed to sparsify deepmodels by removing non-signiﬁcant synapses and then re-training to restore performance. Struc-tured pruning was later on introduced for more hardware friendliness (Wen et al., 2016). Layer fac-torization (Tai et al., 2016; Yu et al., 2017), quantization (Wu et al., 2016), model distillation (Wanget al., 2018c) and weight sharing (Wu et al., 2018) have also been respectively found effective.

Dynamic: Input-Adaptive Inference.

Higher inference efﬁciency could be also accomplished by en-abling input-conditional execution. Teerapittayanon et al. (2017); Huang et al. (2018); Kaya et al.(2019) leveraged intermediate features to augment multiple side branch classiﬁers to enable earlypredictions. Their methodology sets up the foundation for our work. Other efforts (Figurnov et al.,2017; Wang et al., 2018a;b; 2019) allow for an input to choose between passing through or skippingeach layer. The approach could be integrated with RDI-Nets too, which we leave as future work.2.3 B

RIDGING R OBUSTNESS WITH E FFICIENCY

A few studies recently try to link deep learning robustness and efﬁciency. Guo et al. (2018) ob-served that in a sparse deep network, appropriately sparsiﬁed weights improve robustness, whereasover-sparsiﬁcation (e.g., less than 5% nonzero weights) in turn makes the model more fragile. Twolatest works (Ye et al., 2019; Gui et al., 2019) examined the robustness of compressed models, andconcluded similar observations that the relationship between mode size and robustness depends oncompression methods and are often non-monotonic. Lin et al. (2019) found that activation quantiza-tion may hurt robustness, but can be turned into effective defense if enforcing continuity constraints.Different from above methods that tackle robustness from static compact/compressed models, theproposed RDI-Nets are the ﬁrst to address robustness from the dynamic input-adaptive inference.Our experiment results demonstrate the consistent superiority of RDI-Nets over those static methods(Section 4.3). Moreover, applying dynamic inference top of those static methods may further boostthe robustness and efﬁciency, which we leave as future work.

PPROACH

With the goal of achieving inference efﬁciency, we ﬁrst look at the setting of multi-output networksand the speciﬁc design of RDI-Net in Section 3.1. Then we deﬁne three forms of adversarial attacksfor multi-output networks in Section 3.2 and their corresponding defense methods in Section 3.3.Note that RDI-Nets achieve “triple wins”via reducing the average computation loads through input-adaptive routing. It is not to be confused with any speciﬁcally-designed robust light-weight model.3.1 D

ESIGNING

RDI-N

ETS FOR H IGHER I NFERENCE E FFICIENCY

Given an input image x , an N -output network can produce a set of predictions [ ˆ y , ..., ˆ y N ] by a setof transformations [ f θ ( · ) , ..., f θ N ( · )] . θ i denote the model parameter of f θ i , i = 1 , ..., N , and f θ i s3ublished as a conference paper at ICLR 2020will typically share some weights. With an input x , one can express ˆ y i = f θ i ( x ) . We assume that theﬁnal prediction will be one chosen (NOT fused) from [ ˆ y , ..., ˆ y N ] via some deterministic strategy.We now look at RDI-Nets as a speciﬁc instance of multi-output networks, speciﬁcally designedfor the goal of more efﬁcient, input-adaptive inference. As shown in Fig. 1, for any deep net-work (e.g., , ResNet, MobileNet), we could append K side branches (with negligible overhead) toallow for early-exit predictions. In other words, it becomes a ( K + 1) -output network, and the sub-neworks with the K +1 exits, from the lowest to the highest (the original ﬁnal output), correspond to [ f θ ( · ) , ..., f θ K +1 ( · )] . They share their weights in a nested fashion: θ ⊆ θ ... ⊆ θ K +1 , with θ K +1 including the entire network’s parameters.Our deterministic strategy in selecting one ﬁnal output follows (Teerapittayanon et al., 2017). Weset a conﬁdence threshold t k for each k -th exit, k = 1 , ..., K + 1 , and each input x will terminateinference and output its prediction in the earliest exit (smallest k ), whose softmax entropy (as aconﬁdence measure) falls below t k . All computations after the k -th exit will not be activated for this x . Such a progressive and early-halting mechanism effectively saves unnecessary computation formost easier-to-classify samples, and applies in both training and inference. Note that, if efﬁciencyis not the concern, instead of choosing (the earliest one), we could have designed an adaptive orrandomized fusion of all f θ i predictions: but that falls beyond the goal of this work.The training objective for RDI-Nets could be written as L RDI = K +1 (cid:88) i =1 w i [( φ ( f i ( θ i | x ) , y ) + φ ( f i ( θ i | x adv ) , y )] , (1)For each exit loss, we minimize a hybrid loss of accuracy (on clean x ) and robustness (on x adv ). The K + 1 exits are balanced with a group of weights { w i } K +1 i =1 . More details about RDI-Net structures,hyperparameters, and inference branch selection can be founded in Appendix A, B, and C.In what follows, we discuss three ways to generate x adv in RDI-Nets, and then their defenses.3.2 T HREE A TTACK F ORMS ON M ULTI -O UTPUT N ETWORKS

We consider white box attacks in this paper. Attackers have access to the model’s parameters, andaim to generate an adversarial image x adv to fool the model by perturbing an input x within a givenmagnitude bound.We next discuss three attack forms for an N -output network. Note that they are independent of, andto be distinguished from attacker algorithms (e.g., , PGD, C&W, FGSM): the former depicts the optimization formulation , that can be solved any of the attacker algorithms. Single Attack

Naively extending from attacking single-output networks, a single attack is deﬁnedto maximally fool one f θ i ( · ) only, expressed as: x advi = arg max x (cid:48) ∈| x (cid:48) − x | ∞ ≤ (cid:15) | φ ( f θ i ( x (cid:48) ) , y ) | , (2)where y is the ground truth label, and φ is the loss for f θ i (we assume softmax for all). (cid:15) is theperturbation radius and we adopt (cid:96) ∞ ball for an empirically strong attacker. Naturally, an N -outputnetwork can have N different single attacks. However, each single attack is derived without beingaware of other parallel outputs. The found x advi is not necessarily transferable to other f θ j s ( j (cid:54) = i ),and therefore can be easily bypassed if x is re-routed through other outputs to make its prediction. Average Attack

Our second attack maximizes the average of all f θ i losses, so that the found x adv remains in effect no matter which one f θ i is chosen to output the prediction for x : x advavg = arg max x (cid:48) ∈| x (cid:48) − x | ∞ ≤ (cid:15) | N N (cid:88) j =1 φ ( f θ j ( x (cid:48) ) , y ) | , (3)The average attack addresses takes into account the attack transferablity and involves all θ j s intooptimization. However, while only one output will be selected for each sample at inference, theaverage strategy might weaken the individual defense strength of each f θ i .4ublished as a conference paper at ICLR 2020 Max-Average Attack

Our third attack aims to emphasize individual output defense strength, morethan simply maximizing an all-averaged loss. We ﬁrst solve the N single attacks x advi as describedin Eqn. 2, and denote their collection as Ω . We then solve the max-average attack via the following: x advmax ← x advi ∗ , where x advi ∗ ∈ Ω and i ∗ = arg max i | N N (cid:88) j =1 φ ( f θ j ( x advi ) , y ) | . (4)Note Eqn. 4 differs from Eqn. 3 by adding an Ω constraint to balance between “commodity”and “speciﬁcity”. The found x advmax both strongly increases the averaged loss values from all f i s(therefore possessing transferablity), and maximally fools one individual f θ i s as it is selected fromthe collection Ω of single attacks.3.3 D EFENCE ON M ULTI -O UTPUT N ETWORKS

For simplicity and fair comparison, we focus on adversarial training (Madry et al., 2018) as ourdefense framework, where the three above deﬁned attack forms can be plugged-in to generate ad-versarial images to augment training, as follows ( Θ is the union of learnable parameters): θ i ∈ Θ , where θ i = arg min θ (cid:48) | φ ( f θ i ( x ) , y ) + φ ( f θ i ( x adv ) , y ) | ., (5)where x adv ∈ { x advi , x advavg , x advmax } . As f i s partially share their weights θ i in a multi-output network,the updates from different f i s will be averaged on the shared parameters. XPERIMENTAL R ESULTS

XPERIMENTAL S ETUP

Evaluation Metrics

We evaluate accuracy, robustness, and efﬁciency, using the metrics below: • Testing Accuracy ( TA ): the classiﬁcation accuracy on the original clean test set. • Adversarial Testing Accuracy (

ATA ): Given an attacker, ATA stands for the classiﬁcation accu-racy on the attacked test set. It is the same as the “robust accuracy” in (Zhang et al., 2019). • Mega Flops (

MFlops ): The number of million ﬂoating-point multiplication operations con-sumed on the inference, averaged over the entire testing set.

Datasets and Benchmark Models

We evaluate three representative CNN models on two populardatasets: SmallCNN on MNIST (Chen et al., 2018); ResNet-38 (He et al., 2016) and MobileNet-V2(Sandler et al., 2018) on CIFAR-10. The three networks span from simplest to more complicated,and covers a compact backbone. All three models are defended by adversarial training, constituting strong baselines . Table 1 reports the models, datasets, the attacker algorithm used in attack &defense, and thee TA/ATA/MFlops performance of three defended models.

Attack and Defense on RDI-Nets

We build RDI-Nets by appending side branch outputs for eachbackbone. For SmallCNN, we add two side branches ( K = 2 ). For ResNet-38 and MobileNet-V2,we have K = 6 and K = 2 , respectively. The branches are designed to cause negligible overheads:more details of their structure and positions can be referenced in Appendix B. We call those resultmodels RDI-SmallCNN, RDI-ResNet38 and RDI-MobileNetV2 hereinafter.We then generate attacks using our three deﬁned forms. Each attack form could be solved withvarious attacker algorithms (e.g., PGD, C&W, FGSM), and by default we solve it with the sameattacker used for each backbone in Table 1. If we ﬁx one attacker algorithm (e.g., PGD), thenTA/ATA for a single-output network can be measured without ambiguity. Yet for ( K +1)-outputRDI-Nets, there could be at least K +3 different ATA numbers for one defended model, dependingon what attack form in Section 3.1 to apply ( K +1 single attacks, 1 average attack, and 1 max-average attack). For example, we denote by ATA (Branch1) the ATA number when applying thesingle attack generated from the ﬁrst side output branch (e.g., x adv ); similarly elsewhere.We also defend RDI-Nets using adversarial training, using the forms of adversarial images to aug-ment training. By default, we adopt three adversarial training defense schemes: Main Branch (single attack using x advK +1 ) , Average (using x advavg ),and Max-Average (using x maxavg ), in addition tothe undefended RDI-Nets (using standard training) denoted as Standard . We tried adversarial training using other K earlier side branch single attacks, and found their TA/ATA tobe much more deteriorated compared to the main branch one. We thus report this only for compactness. lowest number among all K + 3 ATAs, denoted as

ATA (Worst-Case) , as the robustness measure for an RDI-Net.Table 1: Benchmarking results of adverserial training of three networks. PGD-40 denotes runningthe projected gradient descent attacker (Madry et al., 2018) for 40 iterations. We set the perturbationsize as . for MNIST and / for CIFAR- in (cid:96) ∞ norm (adopted by all following experiments). Model Dataset Defend Attack TA ATA MFlops

SmallCNN MNIST PGD-40 PGD-40 99.49% 96.31% 9.25ResNet-38 CIFAR-10 PGD-10 PGD-20 83.62% 42.29% 79.42MobileNetV2 CIFAR-10 PGD-10 PGD-20 84.42% 46.92% 86.91

VALUATION AND A NALYSIS

MNIST Experiments

The MNIST experimental results on RDI-SmallCNN are summarized intable 2, with several meaningful observations to be drawn. First, the undefended models (Stan-dard) are easily compromised by all attack forms. Second, The single attack-defended model (MainBranch) achieves the best ATA against the same type of attack, i.e., ATA (Main Branch), and alsoseems to boost the closest output branch’s robustness, i.e., ATA (Branch 2). However, its defenseeffect on the further-away Branch 1 is degraded, and also shows to be fragile under two strongerattacks (Average, and Max-Average). Third, both Average and Max-Average defenses achieve goodTAs, as well as ATAs against all attack forms (and therefore Worst-Case), with Max-Average slightlybetter at both (the margins are small due to the data/task simplicity; see next two).Moreover, compared to the strong baseline of SmallCNN defended by PGD (40 iterations)-basedadversarial training, RDI-SmallCNN with Max-Average defense wins in terms of both TA and ATA.Impressively, that comes together with 34.30% computational savings compared to the baseline.Here the different defense forms do not appear to alter the inference efﬁciency much: they all savearound 34% - 36% MFlops compared to the backbone.Table 2: The performance of RDI-SmallCNN. The ”

Average MFlops ” is calculated by averaging thetotal ﬂop costs consumed over the inference of the entire set (different samples take different FLOPsdue to input-adaptive inference). The perturbation size and step size are . and . , respectively. Defense Method Standard Main Branch Average Max-Average TA ATA (Branch 1) 6.60% 60.50% 98.69% 98.52%ATA (Branch 2) 3.16% 98.14% 97.64% 97.62%ATA (Main Branch) 1.32% 96.70% 96.30% 96.43%ATA (Average) 2.61% 61.35% 97.37% 97.42%ATA (Max-Average) 2.10% 61.83% 96.82% 96.89%

ATA (Worst-Case)

Average MFlops 5.89 5.89 5.95 6.08

Computation Saving

CIFAR-10 Experiments

The results on RDI-ResNet38 and RDI-MobileNetV2 are presented inTables 3 and 4, respectively. Most ﬁndings seem to concur with MNIST experiments. Speciﬁcally,on the more complicated CIFAR-10 classiﬁcation task, Max-Average defense achieves much moreobvious margins over Average defense, in terms of ATA (Worst-Case): 2.79% for RDI-ResNet38,and 1.06% for RDI-MobileNetV2. Interestingly, the Average defense is not even the strongest indefending average attacks, as Max-Average defense can achieve higher ATA (Average) in both cases.We conjecture that averaging all branch losses might “over-smooth” and diminish useful gradients.Compared to the defended ResNet-38 and MobileNet-V2 backbones, RDI-Nets with Max-Averagedefense achieve higher TAs and ATAs for both. Especially, the ATA (Worst-Case) of RDI-ResNet-38 surpasses the ATA of ResNet-38 defended by PGD-adversarial training by , while savingaround inference budget. We ﬁnd that different defenses on CIFAR-10 have more notableimpacts on computational saving. Seemingly, a stronger defense (Max-Average) requires inputs togo through the scrutiny of more layers on average, before outputting conﬁdent enough predictions:a sensible observation as we expect. 6ublished as a conference paper at ICLR 2020

Visualization of Adaptive Inference Behaviors

We visualize the exiting behaviors of RDI-ResNet38 in Fig 4.2. We plot each branch exiting percentage on clean set and adversarial sets(worst-case) of examples. A few interesting observations can be found. First, we observe that thesingle-attack defended model can be easily fooled as adversarial examples can be routed throughother less-defended outputs (due to the limited transferability of attacks between different outputs).Second, the two stronger defenses (Average and Max-Average) show much more uniform usage ofmultiple outputs. Their routing behaviors for clean examples are almost identical. For adversarialexamples, Max-Average tends to call upon the full inference more often (i.e., more “conservative”).Table 3: The performance evaluation on RDI-ResNet38. The perturbation size and step size are / and / , respectively. Defence Method Standard Main Branch Average Max-Average TA ATA (Branch1) 0.12% 12.02% 71.56% 69.71%ATA (Branch2) 0.01% 5.58% 66.67% 63.11%ATA (Branch3) 0.04% 42.73% 60.65% 60.72%ATA (Branch4) 0.06% 34.95% 50.17% 47.82%ATA (Branch5) 0.06% 41.77% 44.83% 45.53%ATA (Branch6) 0.11% 41.68% 45.83% 44.12%ATA (Main Branch) 0.13% 42.74% 47.52% 49.82%ATA (Average) 0.01% 9.14% 42.09% 43.32%ATA (Max-Average) 0.01% 7.15% 40.53% 43.43%

ATA (Worst-Case)

Average MFlops 29.41 48.27 56.90 57.81

Computation Saving

Table 4: The performance evaluation on RDI-MobilenetV2. The perturbation size and step size are / and / , respectively. Defence Method Standard Main Branch Average Max-Average TA ATA (Worst-Case)

0% 35.20% 45.93%

Average MFlops 49.78 52.81 58.23 60.84

Computation Saving E x i t i ng P e r c en t age CleanAdv (Worst−Case) (a) E x i t i ng P e r c en t age CleanAdv (Worst−Case) (b) E x i t i ng P e r c en t age CleanAdv (Worst−Case) (c)

Figure 2: The exiting behaviours of RDI-ResNet38 defended by (a) Single attack defense (MainBranch); (b) Average defense; and (c) Max-Average defense.4.3 C

OMPARISON WITH D EFENDED S PARSE N ETWORKS

An alternative to achieve accuracy-robust-efﬁciency trade-off is by defending a sparse or compressedmodel. Inspired by (Guo et al., 2018; Gui et al., 2019), we compare RDI-Net with Max-Average7ublished as a conference paper at ICLR 2020defense to the following baseline: ﬁrst compressing the network with a state-of-the-art model com-pression method (Huang & Wang, 2018), and then defend the compressed network using the PGD-10adversarial training. We sample different sparsity ratios in (Huang & Wang, 2018) to obtain modelsof different complexities. Fig. 6 in Appendix visualizes the comparison on ResNet-38: for eithermethod, we sample a few models of different MFLOPs. At similar inference costs (e.g., 49.38M forpruning + defense, and 48.35M for RDI-Nets), our proposed approach consistently achieves higherATAs ( > Methods TA ATA MFlops

ATMC (Gui et al. (2019))

Worst-Case ) 83.79

We also compare with the latest ATMC al-gorithm (Gui et al., 2019) that jointly opti-mizes robustness and efﬁciency, applied thesame ResNet-38 backbone. As shown in Ta-ble 5, at comparable MFlops, RDI-ResNet-38surpasses ATMC by 0.3% in terms of ATA, with a similar TA.4.4 G

ENERALIZED R OBUSTNESS A GAINST O THER A TTACKERS

In the aforementioned experiments, we have only evaluated on RDI-Nets against “deterministic”PGD-based adversarial images. We show that RDI-Nets also achieve better generalized robustnessagainst other “randomized” or unseen attackers. We create the new “random attack”: that attack willrandomly combine the multi-exit losses, and summarize the results in Table 6. We also follow thesimilar setting in Gui et al. (2019) and report the results against FGSM (Goodfellow et al., 2015)and WRM (Sinha et al., 2018) attacker, in Tables 7 and 8 respectively (more complete results can befound in Appendix D).Table 6: Performance on RDI-ResNet38 against random attack. The perturbation size and step sizeare / and / , respectively. More details of random attack can be referenced in Appendix D. Defence Method Standard Main Branch Average Max-Average TA ATA (Random) 0.01% 10.33% 43.11% %Average MFlops 27.33 52.36 55.21 56.54

Computation Saving

Table 7: Performance on RDI-ResNet38 (defended with PGD) against FGSM attack (perturbationsize is / ). The original defended ResNet38 by PGD under the same attack has ATA . . Defence Method Standard Main Branch Average Max-Average TA ATA (Main Branch) 11.51% 51.45% 53.64% 54.72%ATA (Average) 11.41% 50.21% 51.81% 53.20%ATA (Max-Average) 2.09% 47.53% 50.63% 52.40%

ATA (Worst-Case)

Average MFlops 65.74 55.27 58.27 59.67

Computation Saving

Table 8: Performance on RDI-ResNet38 (defended with PGD) against WRM attack (perturbationsize is . ). The original defended ResNet38 by PGD under the same attack has ATA . . Defence Method Standard Main Branch Average Max-Average TA ATA (Main Branch) 34.42% 83.74% 82.42% 83.78%ATA (Average) 26.48% 83.69% 82.36% 83.77%ATA (Max-Average) 23.51% 83.73% 82.40% 83.78%

ATA (Worst-Case)

Average MFlops 50.05 50.46 52.89 52.38

Computation Saving

ISCUSSION AND A NALYSIS

Intuition: Multi-Output Networks as Special Ensembles

Our intuition on defending multi-output networks arises from the success of ensemble defense in improving both accuracy and robust-ness (Tram`er et al., 2018; Strauss et al., 2017), which also aligns with the model capacity hypothesis(Nakkiran, 2019). A general multi-output network (Xu et al., 2019) could be decomposed by anensemble of single-output models, with weight re-using enforced among them. It is thus more com-pact than an ensemble of independent models, and the extent of sharing weight calibrates ensemblediversity versus efﬁciency. Therefore, we expect a defended multi-output network to (mostly) inheritthe strong accuracy/robustness of ensemble defense, while keeping the inference cost lower.

Do ”Triple Wins” Go Against the Model Capacity Needs?

We point out that our seemingly“free” efﬁciency gains (e.g., not sacriﬁcing TA/ATA) do not go against the current belief that amore accurate and robust classiﬁer relies on a larger model capacity (Nakkiran, 2019). From thevisualization, there remains to be a portion of clean/adversarial examples that have to utilize thefull inference to predict well. In other words, the full model capacity is still necessary to achieveour current TAs/ATAs. Meanwhile, just like in standard classiﬁcation (Wang et al., 2018a), not alladversarial examples are born equally. Many of them can be predicted using fewer inference costs(taking earlier exits). Therefore, RDI-Nets reduces the “effective model capacity” averaged on alltesting samples for overall higher inference efﬁciency, while not altering the full model capacity.

ONCLUSION

This paper targets to simultaneously achieve high accuracy and robustness and meanwhile keepinginference costs lower. We introduce the multi-output network and input-adaptive dynamic inference,as a strong tool to the adversarial defense ﬁeld for the ﬁrst time. Our RDI-Nets achieve the “triplewins” of better accuracy, stronger robustness, and around 30% inference computational savings. Ourfuture work will extend RDI-Nets to more dynamic inference mechanisms.

CKNOWLEDGEMENT

We would like to thank Dr. Yang Yang from Walmart Technology for highly helpful discussionsthroughout this project. R EFERENCES

Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate adver-sarial examples. In

AAAI , 2018.Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim ˇSrndi´c, Pavel Laskov, Gior-gio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In

ECML ,2013.Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In SP ,2017.Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differentialequations. In NeurIPS , 2018.Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certiﬁed adversarial robustness via randomizedsmoothing. In

ICML , 2019.Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, andRuslan Salakhutdinov. Spatially adaptive computation time for residual networks. In

CVPR ,2017.Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarialexamples. In

ICLR , 2015.Shupeng Gui, Haotao Wang, Haichuan Yang, Chen Yu, Zhangyang Wang, and Ji Liu. Model com-pression with adversarial robustness: A uniﬁed optimization framework. In

NeurIPS , pp. 1283–1294, 2019. 9ublished as a conference paper at ICLR 2020Yiwen Guo, Chao Zhang, Changshui Zhang, and Yurong Chen. Sparse dnns with improved adver-sarial robustness. In

NeurIPS , 2018.Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections forefﬁcient neural network. In

NeurIPS , 2015.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In

CVPR , 2016.Hanzhang Hu, Debadeepta Dey, J. Andrew Bagnell, and Martial Hebert. Anytime neural networksvia joint optimization of auxiliary losses. In

AAAI , 2019.Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Weinberger.Multi-scale dense networks for resource efﬁcient image classiﬁcation. In

ICLR , 2018.Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In

ECCV , 2018.Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. Shallow-deep networks: Understanding andmitigating network overthinking. In

ICML , 2019.Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In

ICLR ,2017.Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense againstadversarial attacks using high-level representation guided denoiser. In

CVPR , 2018.Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When efﬁciency meets robustness. In

ICLR , 2019.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.Towards deep learning models resistant to adversarial attacks. In

ICLR , 2018.Preetum Nakkiran. Adversarial robustness may be at odds with simplicity. arXiv , 2019.Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and AnanthramSwami. The limitations of deep learning in adversarial settings. In

EuroS&P , 2016.Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mo-bilenetv2: Inverted residuals and linear bottlenecks. In

CVPR , 2018.Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Ad-versarially robust generalization requires more data. In

NeurIPS , 2018.Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying Some Distributional Robustnesswith Principled Adversarial Training. In

ICLR , 2018.Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend:Leveraging generative models to understand and defend against adversarial examples. In

ICLR ,2018.Thilo Strauss, Markus Hanselmann, Andrej Junginger, and Holger Ulmer. Ensemble methods as adefense to adversarial perturbations against deep neural networks. arXiv , 2017.Ke Sun, Zhanxing Zhu, and Zhouchen Lin. Towards understanding adversarial examples systemat-ically: Exploring data size, task and model factors. arXiv , 2019.Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,and Rob Fergus. Intriguing properties of neural networks. In

ICLR , 2014.Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with low-rank regularization. In

ICLR , 2016.Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. Branchynet: Fast inference via earlyexiting from deep neural networks. In

ICPR , 2017.10ublished as a conference paper at ICLR 2020Florian Tram`er, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc-Daniel. Ensemble adversarial training: Attacks and defenses. In

ICLR , 2018.Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.Robustness may be at odds with accuracy. In

ICLR , 2019.Xin Wang, Fisher Yu, Zi-Yi Dou, and Joseph E Gonzalez. Skipnet: Learning dynamic routing inconvolutional networks. In

ECCV , 2018a.Yue Wang, Tan Nguyen, Yang Zhao, Zhangyang Wang, Yingyan Lin, and Richard Baraniuk. Energ-ynet: Energy-efﬁcient dynamic inference. 2018b.Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard Baraniuk, ZhangyangWang, and Yingyan Lin. Dual dynamic inference: Enabling more efﬁcient, adaptive and control-lable deep inference. arXiv preprint arXiv:1907.04523 , 2019.Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Adversarial learning of portable studentnetworks. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018c.Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity indeep neural networks. In

NeurIPS , 2016.Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutionalneural networks for mobile devices. In

CVPR , 2016.Junru Wu, Yue Wang, Zhenyu Wu, Zhangyang Wang, Ashok Veeraraghavan, and Yingyan Lin. Deep k -means: Re-training and parameter sharing with harder cluster assignments for compressingdeep convolutions. In ICML , 2018.Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating adversarialeffects through randomization. In

ICLR , 2018.Donna Xu, Yaxin Shi, Ivor W Tsang, Yew-Soon Ong, Chen Gong, and Xiaobo Shen. A survey onmulti-output learning. arXiv , 2019.Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deepneural networks. In

NDSS , 2018.Shaokai Ye, Kaidi Xu, Sijia Liu, Hao Cheng, Jan-Henrik Lambrechts, Huan Zhang, Aojun Zhou,Kaisheng Ma, Yanzhi Wang, and Xue Lin. Adversarial robustness vs model compression, orboth? In

ICCV , 2019.Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by lowrank and sparse decomposition. In

CVPR , 2017.Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan.Theoretically principled trade-off between robustness and accuracy. arXiv , 2019.Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufﬂenet: An extremely efﬁcientconvolutional neural network for mobile devices. In

CVPR , 2018.

A L

EARNING D ETAILS OF

RDI-N

ETS

MNIST

We adopt the network architecture from (Chen et al., 2018) with four convolutions andthree full-connected layers. We train for iterations with a batch size of . The learningrate is initialized as . and is lowered by at th and th iteration. For hybrid loss,the weights { w i } N +1 i =1 are set as { , , } for simplicity. For adversarial defense/attack, we perform40-steps PGD for both defense and evaluation. The perturbation size and step size are set as . and . . 11ublished as a conference paper at ICLR 2020 CIFAR-10

We take ResNet-38 and MobileNetV2 as the backbone architectures. For RDI-ResNet38, we initialize learning rate as . and decay it by a factor of 10 at th and thiteration. The learning procedure stops at iteration. For RDI-MobileNetV2, the learningrate is set to . and is lowered by times at th and th iteration. We stop the learn-ing procedure at iteration. For hybrid loss, we follow the discussion in (Hu et al., 2019)and set { w i } N +1 i =1 of RDI-ResNet38 and RDI-MobileNetV2 as { . , . , . , . , . , . , } and { . , . , } , respectively. For adversarial defense/attack, the perturbation size and step size areset as / and / . 10-steps PGD is performed for defense and 20-steps PGD is utilized forevaluation. B N

ETWORK S TRUCTURE OF

RDI-N

ETS

To build RDI-Nets, we follow the similar setting in Teerapittayanon et al. (2017) by appendingadditional branch classiﬁers at equidistant points throughout a given network, as illustrated in Fig 3,Fig 4 and Fig 5. A few pooling operations, light-weight convolutions and fully-connected layers areappended to each branch classiﬁers. Note that the extra ﬂops introduced by side branch classiﬁersare less than 2% than the original ResNet-38 or MobileNetV2.

Main BranchBranch 1 Branch 2FC Fully Connected LayerConvolutional LayerConv FCConv FC

Figure 3: Network architecture of RDI-SmallCNN. Two branch classiﬁers are inserted after st convolutional layer and rd convolutional layer in the original SmallCNN. Main BranchBranch 1Conv Conv ConvFC FC FCFully Connected LayerConvolutional LayerResidul BlockBranch 2 ConvFC Branch 3 Conv Branch 4 FC FCBranch 5 Branch 6 FCResidul Block Group 1 Residul Block Group 2 Residul Block Group 3

Figure 4: Network architecture of RDI-ResNet38. In each residual block group, two branch classi-ﬁers are inserted after st residual block and th residual block. C I

NPUT -A DAPTIVE I NFERENCE FOR

RDI-N

ETS

Similar to the deterministic strategy in Teerapittayanon et al. (2017), we adopt the entropy as themeasure of the prediction conﬁdence. Given a prediction vector y ∈ R C , where C is the number of12ublished as a conference paper at ICLR 2020 Main BranchBranch 1 Branch 2Conv Conv ConvFC FC FCFully Connected LayerConvolutional LayerInverted Residul BlockInverted Residul Blocks

Figure 5: Network architecture of RDI-MobilenetV2. Two branch classﬁers are inserted after rd inverted residual block and th inverted residual block in the orignal MobilenetV2.class, the entropy of y is deﬁned as follow, − C (cid:88) c =1 ( y c + (cid:15) ) log ( y c + (cid:15) ) , (6)where (cid:15) is a small positive constant used for robust entropy computation. To perform fast inferenceon a ( K +1)-output RDI-Net, we need to determine K threshold numbers, i.e., { t i } Ki =1 , so that theinput x will exit at i th branch if the entropy of y i is larger than t i . To choose { t i } Ki =1 , Huang et al.(2018) provides a good starting point by ﬁxing exiting probability of each branch classiﬁers equallyon validation set so that each sample can equally contribute to inference. We follow this strategybut adjust the thresholds to make the contribution of middle branches slightly larger than the earlybranches. The threshold numbers for RDI-SmallCNN, RDI-ResNet38, and RDI-MobilenetV2 areset to be { . , . } , { . , . , . , . , . , . } , and { . , . } , respectively.

41 42 43 44 45 468181.58282.58383.58484.5 A T A ( % ) TA(%)

ResNet38− γ γ γ γ (65.13M)(57.81M)(48.35M)(69.22M)(58.50M)(49.38M)(42.54M) Figure 6: Performance comparison between RDI-Net and the pruning + defense baseline. Eachmarker represents a model, whose size is proportional to its MFlops. γ is the sparsity trade-offparameter: the larger the sparser (smaller model). D G

ENERALIZED R OBUSTNESS

Here, we introduce the attack form of random attack and report the complete results against FGSM(Goodfellow et al., 2015) and WRM (Sinha et al., 2018) attacker under various attack forms, inTables 9 and 10, respectively.

Random Attack the attack exploits multi-loss ﬂexibility by randomly fusing all f θ i losses. Givena N -output network, we have a fusion vector C ∈ R N ∼ D , where D is some distribution (uniform13ublished as a conference paper at ICLR 2020by default). We denote c j as the j th element of C and x advrdm can be found by: x advrdm = arg max x (cid:48) ∈| x (cid:48) − x | ∞ ≤ (cid:15) | N N (cid:88) j =1 c j φ ( f θ j ( x (cid:48) ) , y ) | . (7)It is expected to challenge our defense, due to the inﬁnitely many ways of randomly fusing outputs.Table 9: The performance evaluation on RDI-ResNet38 (defended with PGD) against FGSM attack.The perturbation size is / . The ATA of the original defended ResNet38 by PGD under the sameattacker is . .Defence Method Standard Main Branch Average Max-Average TA ATA (Branch1) 20.69% 66.06% 72.77% 72.76%ATA (Branch2) 16.15% 53.87% 70.40% 69.71%ATA (Branch3) 8.13% 63.70% 64.19% 65.14%ATA (Branch4) 10.09% 56.67% 58.45% 58.20%ATA (Branch5) 9.45% 50.81% 52.76% 52.96%ATA (Branch6) 10.22% 50.34% 53.17% 51.05%ATA (Main Branch) 11.51% 51.45% 53.64% 54.72%ATA (Average) 11.41% 50.21% 51.81% 53.20%ATA (Max-Average) 2.09% 47.53% 50.63% 52.40%

ATA (Worst-Case)

Average MFlops 65.74 55.27 58.27 59.67

Computation Saving . . The ATA of the original defended ResNet38 by PGD under the sameattacker is . .Defence Method Standard Main Branch Average Max-Average TA ATA (Branch1) 46.60% 83.73% 82.42% 83.78%ATA (Branch2) 71.33% 83.73% 82.42% 83.79%ATA (Branch3) 23.51% 83.73% 82.41% 83.78%ATA (Branch4) 33.41% 83.73% 82.42% 83.78%ATA (Branch5) 42.35% 83.73% 82.41% 83.78%ATA (Branch6) 47.77% 83.74% 82.40% 83.78%ATA (Main Branch) 34.42% 83.74% 82.42% 83.78%ATA (Average) 26.48% 83.69% 82.36% 83.77%ATA (Max-Average) 23.51% 83.73% 82.40% 83.78%

ATA (Worst-Case)

Average MFlops 50.05 50.46 52.89 52.38