[PDF] With False Friends Like These, Who Can Have Self-Knowledge?

Abstract

Adversarial examples arise from excessive sensitivity of a model. Commonly studied adversarial examples are malicious inputs, crafted by an adversary from correctly classified examples, to induce misclassification. This paper studies an intriguing, yet far overlooked consequence of the excessive sensitivity, that is, a misclassified example can be easily perturbed to help the model to produce correct output. Such perturbed examples look harmless, but actually can be maliciously utilized by a false friend to make the model self-satisfied. Thus we name them hypocritical examples. With false friends like these, a poorly performed model could behave like a state-of-the-art one. Once a deployer trusts the hypocritical performance and uses the "well-performed" model in real-world applications, potential security concerns appear even in benign environments. In this paper, we formalize the hypocritical risk for the first time and propose a defense method specialized for hypocritical examples by minimizing the tradeoff between natural risk and an upper bound of hypocritical risk. Moreover, our theoretical analysis reveals connections between adversarial risk and hypocritical risk. Extensive experiments verify the theoretical results and the effectiveness of our proposed methods.

Full PDF

UUnder review W ITH F ALSE F RIENDS L IKE T HESE , W HO C AN H AVE S ELF -K NOWLEDGE ? Lue Tao & Songcan Chen ∗ College of Computer Science & TechnologyNanjing University of Aeronautics and Astronautics { tlmichael,s.chen } @nuaa.edu.cn A BSTRACT

Adversarial examples arise from excessive sensitivity of a model. Commonlystudied adversarial examples are malicious inputs, crafted by an adversary fromcorrectly classiﬁed examples, to induce misclassiﬁcation. This paper studies anintriguing, yet far overlooked consequence of the excessive sensitivity, that is, amisclassiﬁed example can be easily perturbed to help the model to produce correctoutput. Such perturbed examples look harmless, but actually can be maliciouslyutilized by a false friend to make the model self-satisﬁed. Thus we name them hypocritical examples . With false friends like these, a poorly performed modelcould behave like a state-of-the-art one. Once a deployer trusts the hypocriticalperformance and uses the “well-performed” model in real-world applications, po-tential security concerns appear even in benign environments. In this paper, weformalize the hypocritical risk for the ﬁrst time and propose a defense method spe-cialized for hypocritical examples by minimizing the tradeoff between natural riskand an upper bound of hypocritical risk. Moreover, our theoretical analysis revealsconnections between adversarial risk and hypocritical risk. Extensive experimentsverify the theoretical results and the effectiveness of our proposed methods.

NTRODUCTION

Deep neural networks (DNNs) have achieved breakthroughs in a variety of challenging problemssuch as image understanding (Krizhevsky et al., 2012), speech recognition (Graves et al., 2013),and automatic game playing (Mnih et al., 2015). Despite these remarkable successes, their perva-sive failures in adversarial settings, the phenomenon of adversarial examples (Biggio et al., 2013;Szegedy et al., 2014), have attracted signiﬁcant attention in recent years (Athalye et al., 2018; Carliniet al., 2019; Tramer et al., 2020). Such small perturbations on inputs crafted by adversaries are ca-pable of causing well-trained models to make big mistakes, which indicates that there is still a largegap between machine and human perception, thus posing potential security concerns for practicalmachine learning (ML) applications (Kurakin et al., 2016; Qin et al., 2019; Wu et al., 2020b).An adversarial example is “an input to a ML model that is intentionally designed by an attackerto fool the model into producing an incorrect output” (Goodfellow & Papernot, 2017). Followingthe deﬁnition of adversarial examples on classiﬁcation problems (Goodfellow et al., 2015; Papernotet al., 2016; Elsayed et al., 2018; Carlini et al., 2019; Zhang et al., 2019; Wang et al., 2020b; Zhanget al., 2020; Tram`er et al., 2020), given a DNN classiﬁer f and a correctly classiﬁed example x with class label y (i.e., f ( x ) = y ), an adversarial example x adv is generated by perturbing x suchthat f ( x adv ) (cid:54) = y and x adv ∈ B (cid:15) ( x ) . The neighborhood B (cid:15) ( x ) denotes the set of points withina ﬁxed distance (cid:15) > of x , as measured by some metric (e.g., the l p distance), so that x adv isvisually the “same” for human observers. Then, an imperfection of the classiﬁer is highlighted by G adv = Acc( D ) − Acc( A ) , the performance gap between the accuracy (denoted by Acc( · ) ) evaluatedon clean set sampled from data distribution D and adversarially perturbed set A .An adversary could construct such a perturbed set A that looks no different from D but can severelydegrade the performance of even state-of-the-art DNN models. From direct attacks in the digital ∗ Corresponding author. a r X i v : . [ c s . L G ] D ec nder review correctly classified as “panda” misclassified as “tennis ball” (a) Illustration of adversarial examples misclassified as “tripod” correctly classified as “panda” (b) Illustration of hypocritical examples Figure 1: Comparison between adversarial examples and hypocritical examples.

Left : Conceptualdiagrams for the generation of an adversarial example x adv and a hypocritical example x hyp . Theinput space is (ground-truth) classiﬁed into the orange lined region (e.g., class “not panda”), andthe blue dotted region (e.g., class “panda”). The black solid line is the decision boundary of a non-robust model, which classiﬁes the region above the boundary as “panda” and the region below theboundary as “not panda”. Red shadow and black shadow in the ball B (cid:15) ( x ) denote that the pointsin there are misclassiﬁed and correctly classiﬁed, respectively. As we can see, x adv or x hyp can beeasily found by perturbing a correctly classiﬁed x or a misclassiﬁed x across the model’s decisionboundary. Right : A demonstration of adversarial examples and hypocritical examples on real data.Here we choose ResNet50 (He et al., 2016a) trained on ImageNet (Russakovsky et al., 2015) as thevictim model. In (a) the correctly classiﬁed “panda” can be stealthily perturbed to be misclassiﬁedas “tennis ball”. In (b) the “panda” (misclassiﬁed as “tripod”) can be stealthily perturbed to becorrectly classiﬁed. Perturbations are rescaled for display.space (Goodfellow et al., 2015; Carlini & Wagner, 2017) to robust attacks in the physical world(Kurakin et al., 2016; Xu et al., 2020), from toy classiﬁcation problems (Chen et al., 2020; Dobribanet al., 2020) to complicated perception tasks (Zhang & Wang, 2019; Wang et al., 2020a), from thehigh dimensional nature of the input space (Goodfellow et al., 2015; Gilmer et al., 2018) to theframework of (non)-robust features (Jetley et al., 2018; Ilyas et al., 2019), many efforts have beendevoted to understanding and mitigating the risk raised by adversarial examples, thus closing the gap G adv . Previous works mainly concern the adversarial risk on correctly classiﬁed examples. However,they typically neglect a risk on misclassiﬁed examples themselves which will be formalized in thiswork.In this paper, we ﬁrst investigate an intriguing, yet far overlooked phenomenon, where given a DNNclassiﬁer f and a misclassiﬁed example x with class label y (i.e., f ( x ) (cid:54) = y ), we can easily perturb x to x hyp such that f ( x hyp ) = y and x hyp ∈ B (cid:15) ( x ) . Such an example x hyp looks harmless, butactually can be maliciously utilized by a false friend to fool a model to be self-satisﬁed. Thus wename them hypocritical examples (see Figure 1 for a comparison with adversarial examples).Adversarial examples and hypocritical examples are two sides of the same coin. On the one side,a well-performed but sensitive model becomes unreliable in the existence of adversaries. On theother side, a poorly performed but sensitive model behaves well with the help of friends. With falsefriends like these, a naturally trained suboptimal model could have state-of-the-art performance, andeven worse, a randomly initialized model could behave like a well-trained one (see Section 2.1).It is natural then to wonder: Why should we care about hypocritical examples?

Here we give twomain reasons: 2nder review1. This is of scientiﬁc interest. Hypocritical examples are the opposite of adversarial exam-ples. While adversarial examples are hard test data to a model, hypocritical examples aimto make it easy to do correct classiﬁcation. Hypocritical examples warn ML researchers tothink carefully about high test accuracy: Does our model truly achieve human-like intelli-gence, or is it just simply because the test data prefers the model?2. There are practical threats. A variety of nefarious ends may be achievable if the mistakesof ML systems can be covered up by hypocritical attackers. For instance, before allowingautonomous vehicles to drive on public roads, manufacturers must ﬁrst pass tests in speciﬁcenvironments (closed or open roads) to obtain a license (Administration et al., 2016; Briefs,2015; Lei, 2018). An attacker may add imperceptible perturbations on the test examples(e.g., the “stop sign” on the road) stealthily without human notice, to hypocritically helpan ML-based autonomous vehicle to pass the tests that might otherwise fail. However, thehigh performance can not be maintained on public roads without the help of the attacker.Thus, the potential risk is underestimated and trafﬁc accidents might happen unexpectedlywhen the vehicle driving on public roads.In such a case, if the examples used to evaluate a model are falsiﬁed by a false friend, the model willmanifest like a perfect one (on hypocritical examples), but it actually may not be well performedeven on clean examples, not to mention adversarial examples. Thus a new imperfection of the classi-ﬁer can be found in G hyp = Acc( F ) − Acc( D ) , the performance gap between the accuracy evaluatedon clean set sampled from D and hypocritically perturbed set F . Still, F looks no different from D but can stealthily upgrade the performance. Once a deployer trusts the hypocritical performancecarefully designed by a false friend and uses the “well-performed” model in real-world applications,potential security concerns appear even in benign environments. Thus we need methods to defendour models from false friends, that is, making our models have self-knowledge.We propose a defense method by improving model robustness against hypocritical perturbations.Speciﬁcally, we formalize the hypocritical risk and minimize it via a differentiable surrogate loss(Section 3). Experimentally, we verify the effectiveness of our proposed attack (Section 2.1) anddefense (Section 4.1). Further, we study the transferability of hypocritical examples across mod-els trained with various methods (Section 4.2). Finally, we conclude our paper by discussing andsummarizing our results (Section 5 and Section 6). Our main contributions are:• We give a formal deﬁnition of hypocritical examples. We demonstrate the unreliability ofstandard evaluation process in the existence of false friends and show the potential securityrisk on the deployment of a model with high hypocritical performance.• We formalize the hypocritical risk and analyze its relation with natural risk and adversarialrisk. We propose the ﬁrst defense method specialized for hypocritical examples by mini-mizing the tradeoff between the natural risk and an upper bound of hypocritical risk.• Extensive experiments verify the effectiveness of our proposed methods. We also examinethe transferability of hypocritical examples. We show that the transferability is not alwaysdesired by the attackers, which depends on their purpose. ALSE F RIENDS AND A DVERSARIES

Better an open enemy than a false friend! Only by being aware of the potential risk of the falsefriend can we prevent it. In this section, we expose a kind of false friends, who are capable of ma-nipulating model performance stealthily during the evaluation process, thus making the evaluationresults unreliable.We consider a classiﬁcation task with data ( x , y ) ∈ R d × { , . . . , C } from a distribution D . Denoteby f : R d → { , ..., C } the classiﬁer which predicts the class of an input example x : f ( x ) =arg max k p k ( x ) , where p k ( x ) is the k th component of p ( x ) : R d → ∆ C (e.g., the output aftersoftmax activation), in which ∆ C = { u ∈ R C | T u = 1 , u ≥ } is the probabilistic simplex.Adversarial examples are malicious inputs crafted by an adversary to induce misclassiﬁcation. Weﬁrst give the commonly accepted deﬁnition of adversarial examples as follows:3nder review Deﬁnition 1 (Adversarial Examples) . Given a classiﬁer f and a correctly classiﬁed input ( x , y ) ∼D (i.e., f ( x ) = y ), an (cid:15) -bounded adversarial example is an input x ∗ ∈ R d such that: f ( x ∗ ) (cid:54) = y and x ∗ ∈ B (cid:15) ( x ) . The assumption underlying this deﬁnition is that inputs satisfying x ∗ ∈ B (cid:15) ( x ) preserve the label y of the original input x . The reason for the existence of adversarial examples is that a modelis overly sensitive to non-semantic changes. Next, we formalize a complementary phenomenonto adversarial examples, called hypocritical examples. Hypocritical examples are malicious inputscrafted by a false friend to stealthily correct the prediction of a model: Deﬁnition 2 (Hypocritical Examples) . Given a classiﬁer f and a misclassiﬁed input ( x , y ) ∼ D (i.e., f ( x ) (cid:54) = y ), an (cid:15) -bounded hypocritical example is an input x ∗ ∈ R d such that: f ( x ∗ ) = y and x ∗ ∈ B (cid:15) ( x ) . The same as adversarial examples, hypocritical examples are bounded to preserve the label of theoriginal input, and are another consequence that arises from excessive sensitivity of a classiﬁer.As a false friend, a hypocritical example can be generated from a misclassiﬁed example by maxi-mizing max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) = y ) , (1)which is equivalent to minimizing min x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) , (2)where ( · ) is the indicator function. Similar to Madry et al. (2018); Wang et al. (2020b), in practice,we leverage the commonly used cross entropy (CE) loss as the surrogate loss of ( f ( x (cid:48) ) (cid:54) = y ) andminimize it by projected gradient descent (PGD).Note that Equation 2 looks similar to but conceptually differs from the known targeted adversarial at-tack (Carlini & Wagner, 2017), which generates a kind of adversarial examples deﬁned on correctlyclassiﬁed clean inputs and targeted to wrong classes. The hypocritical examples here are deﬁned onmisclassiﬁed inputs and are targeted to their right classes.2.1 A TTACK RESULTS

In this subsection, we demonstrate the power of our proposed hypocritical attack on three bench-mark datasets: MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky et al., 2009) and ImageNet(Russakovsky et al., 2015).Table 1: Accuracy ( % ) evaluated on MNIST.Attacks are bounded with (cid:15) = 0 . .Model F D A

Naive (MLP) 100.0 10.4 0.0Naive (LeNet) 79.2 10.1 0.0Standard (MLP) 100.0 97.8 29.8Standard (LeNet) 100.0 99.4 0.1 Table 2: Accuracy ( % ) evaluated on ImageNet.Attacks are bounded with (cid:15) = 16 / .Model F D A

Naive (VGG16) 100.0 0.1 0.0Naive (ResNet50) 12.6 0.1 0.0Standard (VGG16) 99.9 71.6 0.3Standard (ResNet50) 99.9 76.1 0.0We attack models trained with standard approach using clean examples (Standard) and models thatrandomly initialized without training (Naive). For MNIST, the hypocritically perturbed set F andthe adversarially perturbed set A are constructed by attacking every example in the clean test setsampled from D . Both attacks are bounded by a l ∞ ball with radius (cid:15) = 0 . . For ImageNet, F and A are constructed based on its validation set sampled from D . Both attacks are bounded by a l ∞ ball with radius (cid:15) = 16 / . For each experiment, we conduct 3 trials with different random seedsand report the averaged result to reduce the impact of random variations. Appendix A.2 describesfurther experimental details about DNN architecture, training procedure and more results.4nder reviewResults on MNIST and ImageNet are summarized in Table 1 and Table 2, respectively. First, we ﬁndthat the naturally trained models are extremely sensitive to hypocritical perturbations (e.g., Standard(MLP) and Standard (LeNet) achieve accuracy on hypocritically perturbed MNIST test set,and Standard (VGG16) and Standard (ResNet50) achieve accuracy on hypocritically perturbedImageNet validation set). Second, we ﬁnd that part of randomly initialized models is extremelysensitive (e.g., Naive (MLP) and Naive (VGG16) achieve accuracy on F of MNIST andImageNet). These results demonstrate the unreliability of standard evaluation process in the exis-tence of false friends. Once a “well-performed” model (such as Naive (MLP) or Naive (VGG16))is permitted to be deployed in real-world applications due to that the deployer has a false sense ofperformance, potential security concerns appear even in benign environments.It seems that the Naive (ResNet50) model is relatively robust to hypocritical examples on ImageNet.But that is just a trivial defense, it simply predicts most of the points in input region as a certainclass because of the poor scaling of network weights at initialization (He et al., 2016b; Elsayedet al., 2019). More discussions are in Appendix A.2. Therefore, it is not enough to blindly pursuerobustness against hypocritical perturbations but ignore the performance on clean examples. YPOCRITICAL R ISK

In this section, we formalize the hypocritical risk and analyze the relation between natural risk,adversarial risk, and hypocritical risk. We propose a defense method specialized for hypocriticalexamples by minimizing the tradeoff between natural risk and an upper bound of the hypocriticalrisk. Moreover, by decomposing a existing method designed for adversarial defense (TRADES(Zhang et al., 2019)), we ﬁnd that, surprisingly, TRADES minimizes not only the adversarial risk oncorrectly classiﬁed examples, but also a looser upper bound of the hypocritical risk. Our theoreticalanalysis suggests that TRADES can be another candidate defense method for hypocritical examples.To characterize the adversarial robustness of a classiﬁer f , Madry et al. (2018); Uesato et al. (2018);Cullina et al. (2018) deﬁned the adversarial risk under the threat model of bounded (cid:15) ball: R adv ( f ) = E ( x ,y ) ∼D (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) . (3)The standard measure of classiﬁer performance, known as natural risk , is denoted as R nat ( f ) = E ( x ,y ) ∼D [ ( f ( x ) (cid:54) = y )] . Let q ( x , y ) be the probability density function of data distribution D . Wedenote by S + f the conditional data distribution on correctly classiﬁed examples w.r.t. f , with aconditional density function q ( x , y | E ) = q ( x , y ) / Z( E ) if E is true (otherwise q ( x , y | E ) = 0 ),where the event E is f ( x ) = y and Z( E ) = (cid:82) x ,y ( f ( x ) = y ) dq ( x , y ) is a normalizing constant.We denote by S − f the conditional data distribution on misclassiﬁed examples with the conditionaldensity function q ( x , y | E ) and f ( x ) (cid:54) = y as the event E . Then we have the following relationbetween the natural risk and the adversarial risk: Proposition 1.

Denote the adversarial risk on correctly classiﬁed examples by ˆ R adv ( f ) = E ( x ,y ) ∼S + f (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) , then we have R adv ( f ) = R nat ( f ) + (1 − R nat ( f )) ˆ R adv ( f ) . Proposition 1 shows that we can view the adversarial risk R adv ( f ) as the tradeoff between R nat ( f ) and ˆ R adv ( f ) with the scaling parameter λ = 1 −R nat ( f ) . The adversarial risk on correctly classiﬁedexamples ˆ R adv ( f ) is in sharp contrast to the hypocritical risk deﬁned on misclassiﬁed examplesformalized as follows: Deﬁnition 3 (Hypocritical Risk) . The hypocritical risk on misclassiﬁed examples of a classiﬁer f under the threat model of bounded (cid:15) ball is deﬁned as ˆ R hyp ( f ) = E ( x ,y ) ∼S − f (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) = y ) (cid:21) . ˆ R hyp ( f ) is the proportion of perturbed examples (originally misclassiﬁed)that can be successfully correctly classiﬁed by the classiﬁer after a false friend’s attack. Whenconsidering the existence of false friends, a good model should have not only low natural risk butalso low hypocritical risk, to be robust against hypocritical perturbations.3.1 T RADEOFF BETWEEN NATURAL AND HYPOCRITICAL RISKS

Figure 2: Counterexample given by Equation 4.Motivated by the tradeoff between natural and adversarial risks (Tsipras et al., 2019; Zhang et al.,2019), we notice that there may also exist an inherent tension between the goal of natural riskminimization and hypocritical risk minimization. To illustrate the phenomenon, we provide a toyexample here, which is modiﬁed from the example in Zhang et al. (2019), and its risk minimizationsolutions can be analytically found.Consider the case ( x, y ) ∈ R × {− , +1 } from a distribution D , where the marginal distributionover the instance space is a uniform distribution over [0 , , and for k = 0 , , · · · , (cid:100) (cid:15) − (cid:101) , η ( x ) := Pr( y = +1 | x )= (cid:26) / , x ∈ [2 k(cid:15), (2 k + 1) (cid:15) ) , , x ∈ ((2 k + 1) (cid:15), (2 k + 2) (cid:15) ] . (4)See Figure 2 for visualization of η ( x ) . In this problem, we consider two classiﬁers: a) the Bayesoptimal classiﬁer sign(2 η ( x ) − ; b) the all-one classiﬁer which always outputs “positive”. Table 3displays the trade-off between natural and hypocritical risks: the minimal natural risk / is achievedby the Bayes optimal classiﬁer with large hypocritical risk, while the optimal hypocritical risk isachieved by the all-one classiﬁer with large natural risk.Table 3: Comparison of Bayes optimal classiﬁer and all-one classiﬁer.Bayes Optimal Classiﬁer All-One Classiﬁer R nat / (optimal) / R hyp PPER BOUNDS OF HYPOCRITICAL RISK

It is natural then to optimize our models to minimize natural and hypocritical risks at the same time.However, it’s hard to do optimization over ˆ R hyp ( f ) . To ease the optimization obstacles in there, wederive the following upper bounds. Theorem 1.

Now we are ready to design objective functions that improve model robustness against hypocriticalexamples while keeping model accuracy on clean examples.Similar to Zhang et al. (2019); Wang et al. (2020b); Tram`er et al. (2020); Raghunathan et al. (2020),we propose a defense objective by minimizing the tradeoff between the natural risk and the tighterupper bound of the hypocritical risk: R THRM ( f ) = R nat ( f ) + λ R hyp ( f ) , (5)where λ > is a tunable scaling parameter balancing the importance of natural risk and hypocriticalrisk. We name our method THRM (Tradeoff for Hypocritical Risk Minimization).Optimization over 0-1 loss in THRM is still intractable. In practice, for the indicator function ( f ( x ) (cid:54) = y ) in R nat ( f ) , we adopt the commonly used CE loss as surrogate loss. Observed that R hyp ( f ) = R nat ( f ) E ( x ,y ) ∼D ( f ( x hyp ) (cid:54) = f ( x )) , we absorb the R nat ( f ) term into λ and use KLdivergence as the surrogate loss of the indicator function ( f ( x hyp ) (cid:54) = f ( x )) (Zheng et al., 2016;Zhang et al., 2019; Wang et al., 2020b), since f ( x hyp ) (cid:54) = f ( x ) implies that the perturbed exampleshave different output distributions to that of clean examples. Our ﬁnal objective function for THRMbecomes L THRM = E ( x ,y ) ∼D [ L CE ( p ( x ) , y ) + λ L KL ( p ( x ) , p ( x hyp ))] . (6)Intuition behind the objective L THRM : the ﬁrst term in Equation 6 encourages the natural risk to beoptimized, while the second regularization term encourages the output to be stable against hypocrit-ical perturbations, that is, the classiﬁer should not be overly conﬁdent in its predictions especiallywhen a false friend wants it to be.To derive the objective function for TRADES, we can minimize the tradeoff between the natural riskand the reversible risk: R TRADES ( f ) = R nat ( f ) + λ R rev ( f ) . (7)Similar to THRM, we use CE loss and KL divergence as the surrogate loss of ( f ( x ) (cid:54) = y ) and ( f ( x rev ) (cid:54) = f ( x )) , respectively. The ﬁnal objective function becomes L TRADES = E ( x ,y ) ∼D [ L CE ( p ( x ) , y ) + λ L KL ( p ( x ) , p ( x rev ))] , (8)which is exactly the multi-class classiﬁcation objective function ﬁrst proposed in Zhang et al. (2019)for adversarial defense. From the perspective of the hypocritical risk, our Proposition 2 reveals anadvantage behind it, that is, TRADES is capable of minimizing the upper bound of hypocriticalrisk R hyp ( f ) , thus can be considered as a candidate defense method for hypocritical examples.Proposition 2 also implies that there may be a deeper connection between adversarial robustness andhypocritical robustness. We will discuss it more and compare our proposed THRM with TRADESin next section. 7nder review  nat ( % ) ̂  h y p ( % ) Trendline of TRADESTrendline of THRMTRADESTHRM (a) On MNIST. Perturbations arebounded by l ∞ norm with (cid:15) = 0 . .  nat ( % ) ̂  h y p ( % ) Trendline of TRADESTrendline of THRMTRADESTHRM (b) On CIFAR-10. Perturbations arebounded by l ∞ norm with (cid:15) = 2 / . Figure 3: Tradeoff between natural risk and hypocritical risk on real-world datasets.

XPERIMENTS

In this section, to verify the effectiveness of the methods (THRM and TRADES) suggested in Sec-tion 3.3, we conduct experiments on real-world datasets including MNIST and CIFAR-10.4.1 W

HITE - BOX ANALYSIS

For the wide range of the scaling parameter λ , we conduct experiments in parallel over multipleNVIDIA Tesla V100 GPUs. On MNIST, perturbations are bounded by l ∞ norm with (cid:15) = 0 . . OnCIFAR-10, models are trained against 3 different hypocritical attackers bounded by l ∞ norm with (cid:15) = 1 / , (cid:15) = 2 / and (cid:15) = 8 / , respectively. Each experiment is conducted 3 times withdifferent random seeds. The hypocritical risk reported here is actually an approximation of the realvalue, since the optimization problem in it is NP-hard and we approximately solve it using surrogateloss and PGD on test set. Further details about model architecture and training procedure are inAppendix A.3. Note that these experiments are extensive. It takes over 230 GPU days to completelytrain the models considered in this section. We believe that these experiments are beneﬁcial to theML community to further understand the tradeoffs and relative merits in THRM and TRADES.Results on MNIST ( (cid:15) = 0 . ) and CIFAR10 ( (cid:15) = 2 / ) are shown in Figure 3. Each data point rep-resents a model trained with different λ . More results including comparison with Madry’s defense(Madry et al., 2018) are provided in Appendix A.3 due to the limited space. First, we observe that,on both datasets, as the regularization parameter λ increases, the natural risk R nat increases whilethe hypocritical risk ˆ R hyp decreases, which veriﬁes the effectiveness of our proposed method andthe theoretical analysis in Proposition 2, where we reveal that TRADES is capable of minimizing alooser upper bound of hypocritical risk. Second, we show that THRM achieves better tradeoff onMNIST since it optimizes a tighter upper bound than TRADES. However, the situation becomesnuanced on CIFAR-10. As we can see in Figure 3(b), THRM seems to behave better in the begin-ning when λ is small but is surpassed by TRADES when λ increases. Overall, optimizing only atighter upper bound of hypocritical risk achieves better tradeoff on test set when the task is relativelysimple (e.g., on MNIST with (cid:15) = 0 . ), while simultaneously optimizing hypocritical risk and ad-versarial risk achieves better tradeoff on test set when the task tends to be hard (e.g., on CIFAR-10with (cid:15) = 2 / and (cid:15) = 8 / ).Above phenomenon shows that, when dealing with ﬁnite sample size and ﬁnite-time gradient-descent trained classiﬁers, better adversarial robustness may help the generalization of hypocriticalrobustness, which conforms our intuition that they are two sides of the same coin. Interestingly, acontemporary work claims that, on CIFAR-10, TRADES achieves better adversarial robustness thanMadry’s defense in fair hyperparameter settings (Anonymous, 2021). Thus there may be potentialmutual beneﬁts between adversarial robustness and hypocritical robustness. After all, robust train-ing objectives force DNNs to be invariant to signals that humans are invariant to, which may leadto feature representations that are more similar to what humans use (Salman et al., 2020a). A rigor-ous treatment of the synergism is beyond the scope of the current paper but is an important futuredirection. 8nder review4.2 T RANSFERABILITY ANALYSIS

Transferability of adversarial examples across models is well known (Tram`er et al., 2017; Papernotet al., 2017b; Ilyas et al., 2019) and here we examine the transferability of hypocritical exampleson MNIST and CIFAR-10. We observe that hypocritical examples, i) can transfer easily betweennaturally trained models, ii) are hard to transfer from randomly initialized models to other models(and vise versa), iii) are hard to transfer from standard models to defended models, iv) generatedfrom THRM models usually have high transferability. Experimental details are in Appendix A.4.Better transferability is beneﬁcial for black-box attacks but is not always desired by hypocriticalattackers. A hypocritical attacker only expects high transferability on the targeted model the attackerchose to help. If there are other competing models available to the deployer, the attacker actuallydoes not want the hypocritical examples to be successfully transferred to those competing models.Thus ﬁne-grained attack methods are required. We leave this to future work. ISCUSSION

The false friends considered in this paper are as powerful as typical adversaries. They all know theground truth labels of clean examples. Such powerful friends actually can help a model to not onlycorrectly classify a misclassiﬁed clean example but also correctly classify an adversarial examplecrafted by an adversary. One may expect to rely on true friends against adversaries. Unfortunately,an omniscient and faithful friend is unachievable in practical tasks, so far at least. Once it is achieved,the problem of robustness disappears immediately. What we can do at present is using a relativelymore robust model as a surrogate of the true friend to improve the robustness of a weak model. Thisinduces a promising general method in practice, that is, high-performance models can be employedas true friends to help a weak model without exposing training data and model weights for thepurpose of privacy protection and knowledge transfer (Abadi et al., 2016; Papernot et al., 2017a).Concurrent to our work, Salman et al. (2020b) proposes a similar idea that one can manipulate inputexamples to decrease the hardness for prediction. Our studies mainly concern that these examplescould be used by false friends in an adversarial way; while Salman et al. (2020b) demonstrates thatthese examples can be beneﬁcial to models by turning “false friends” into faithful “true friends” aswe discussed above.We showed that correctly classiﬁed examples (hypocritical examples) could be easily found in thevicinity of misclassiﬁed clean examples. As a result, a hypocritically perturbed set could be con-structed with these hypocritical examples. The victim model’s standard accuracy evaluated on thehypocritically perturbed set becomes higher than that on the clean set. It is natural then to wonder:

How about adversarially robust accuracy (i.e., accuracy under adversarial perturbations) of thevictim model on hypocritical examples?

It’s easy to see that, if the adversary is bounded by the same (cid:15) ball as the false friend, the model’s adversarial accuracy evaluated on hypocritically perturbed setis zero, since a misclassiﬁed example exists in the (cid:15) ball of a hypocritical example (by deﬁnition).However, if the adversary’s power is restricted by another δ ball such that δ < (cid:15) , then a robusthypocritical example may exist in the vicinity of a clean example so that a δ -bounded adversary cannot change the model’s prediction on the robust hypocritical example. In such a case, the model’sadversarial accuracy evaluated on the robustly hypocritically perturbed set could be higher than thaton the clean set. New attack and defense methods are required to further explore this phenomenon. ONCLUSION

In this work, we expose a new risk arising from excessive sensitivity. Model performance becomeshypocritical in the existence of false friends. By formalizing the hypocritical risk and analyzingits relation with natural risk and adversarial risk, we propose to use THRM and TRADES as de-fense methods against hypocritical perturbations. Extensive experiments verify the effectiveness ofmethods. These ﬁndings open new avenues for mitigating and exploiting model sensitivity.9nder review R EFERENCES

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, andLi Zhang. Deep learning with differential privacy. In

Proceedings of the 2016 ACM SIGSACConference on Computer and Communications Security , pp. 308–318, 2016.National Highway Trafﬁc Safety Administration et al.

Federal automated vehicles policy: Acceler-ating the next revolution in roadway safety . US Department of Transportation, 2016.Sergio A Alvarez. An exact analytical relation among recall, precision, and classiﬁcation accuracyin information retrieval.

Boston College, Boston, Technical Report BCCS-02-01 , pp. 1–22, 2002.Anonymous. Bag of tricks for adversarial training. In

Submitted to International Confer-ence on Learning Representations , 2021. URL https://openreview.net/forum?id=Xb8xvrtB8Ce . under review.Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of se-curity: Circumventing defenses to adversarial examples. In

International Conference on MachineLearning (ICML) , 2018.Battista Biggio, Igino Corona, Davide Maiorca, B Nelson, N Srndic, P Laskov, Giorgio Giacinto,and Fabio Roli. Evasion attacks against machine learning at test time. In

Joint European confer-ence on machine learning and knowledge discovery in databases (ECML PKDD) , 2013.UMTRI Briefs. Mcity grand opening.

Research Review , 46(3), 2015.Michael Buckland and Fredric Gey. The relationship between recall and precision.

Journal of theAmerican society for information science , 45(1):12–19, 1994.Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In , 2017.Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, DimitrisTsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarialrobustness. arXiv preprint arXiv:1902.06705 , 2019.Lin Chen, Yifei Min, Mingrui Zhang, and Amin Karbasi. More data can expand the generalizationgap between adversarially robust and standard models. In

International Conference on MachineLearning (ICML) , 2020.Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. Pac-learning in the presence of adversaries.In

Advances in Neural Information Processing Systems (NeurIPS) , 2018.Edgar Dobriban, Hamed Hassani, David Hong, and Alexander Robey. Provable tradeoffs in adver-sarially robust classiﬁcation. arXiv preprint arXiv:2006.05161 , 2020.Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boost-ing adversarial attacks with momentum. In

Proceedings of the IEEE conference on computervision and pattern recognition (CVPR) , 2018.Gamaleldin Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alexey Kurakin, Ian Good-fellow, and Jascha Sohl-Dickstein. Adversarial examples that fool both computer vision andtime-limited humans. In

Advances in Neural Information Processing Systems (NeurIPS) , 2018.Gamaleldin F. Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial reprogramming ofneural networks. In

International Conference on Learning Representations (ICLR) , 2019.Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoenholz, Maithra Raghu, Martin Watten-berg, and Ian Goodfellow. Adversarial spheres. arXiv preprint arXiv:1801.02774 , 2018.Ian Goodfellow and Nicolas Papernot. Is attacking machine learning easier than defending it.

Blogpost on Feb , 15:2017, 2017.Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarialexamples. In

International Conference on Learning Representations (ICLR) , 2015.10nder reviewAlex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recur-rent neural networks. In . IEEE, 2013.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassinghuman-level performance on imagenet classiﬁcation. In

Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) , 2015.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2016a.Kun He, Yan Wang, and John Hopcroft. A powerful generative model using random weights for thedeep image representation. In

Advances in Neural Information Processing Systems (NeurIPS) ,2016b.Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and AleksanderMadry. Adversarial examples are not bugs, they are features. In

Advances in Neural InformationProcessing Systems (NeurIPS) , 2019.Saumya Jetley, Nicholas Lord, and Philip Torr. With friends like these, who needs adversaries? In

Advances in Neural Information Processing Systems (NeurIPS) , 2018.Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.2009.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep con-volutional neural networks. In

Advances in neural information processing systems (NeurIPS) ,2012.Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In

International Conference on Learning Representations (ICLR) Workshops , 2016.Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied todocument recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.LI Lei. On the establishment of autonomous vehicles regulatory system in china.

Journal of BeijingInstitute of Technology (Social Sciences Edition) , (2):17, 2018.Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial ex-amples and black-box attacks. In

International Conference on Learning Representations (ICLR) ,2017.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.Towards deep learning models resistant to adversarial attacks. In

International Conference onLearning Representations (ICLR) , 2018.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-levelcontrol through deep reinforcement learning. nature , 518(7540):529–533, 2015.Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and AnanthramSwami. The limitations of deep learning in adversarial settings. In , 2016.Nicolas Papernot, Mart´ın Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. In

InternationalConference on Learning Representations (ICLR) , 2017a.Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and AnanthramSwami. Practical black-box attacks against machine learning. In

Proceedings of the 2017 ACMon Asia conference on computer and communications security (ASIA CCS) , 2017b.11nder reviewYao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. Imperceptible, ro-bust, and targeted adversarial examples for automatic speech recognition. In

International Con-ference on Machine Learning (ICML) , 2019.Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, and Percy Liang. Understandingand mitigating the tradeoff between robustness and accuracy. In

International Conference onMachine Learning (ICML) , 2020.Leslie Rice, Eric Wong, and J Zico Kolter. Overﬁtting in adversarially robust deep learning. In

International Conference on Machine Learning (ICML) , 2020.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.

International Journal of Computer Vision (IJCV) , 115(3):211–252, 2015.Hadi Salman, Andrew Ilyas, Logan Engstrom, Ashish Kapoor, and Aleksander Madry. Do adver-sarially robust imagenet models transfer better? arXiv preprint arXiv:2007.08489 , 2020a.Hadi Salman, Andrew Ilyas, Logan Engstrom, Sai Vemprala, Aleksander Madry, and AshishKapoor. Unadversarial examples: Designing objects for robust vision. arXiv preprintarXiv:2012.12235 , 2020b.Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. In

International Conference on Learning Representations (ICLR) , 2014.Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-low, and Rob Fergus. Intriguing properties of neural networks. In

International Conference onLearning Representations (ICLR) , 2014.Florian Tram`er, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. The spaceof transferable adversarial examples. arXiv preprint arXiv:1704.03453 , 2017.Florian Tram`er, Jens Behrmann, Nicholas Carlini, Nicolas Papernot, and J¨orn-Henrik Jacobsen.Fundamental tradeoffs between invariance and sensitivity to adversarial perturbations. In

Inter-national Conference on Machine Learning (ICML) , 2020.Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks toadversarial example defenses. arXiv preprint arXiv:2002.08347 , 2020.Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.Robustness may be at odds with accuracy. In

International Conference on Learning Representa-tions (ICLR) , 2019.Jonathan Uesato, Brendan O’Donoghue, Pushmeet Kohli, and Aaron Oord. Adversarial risk and thedangers of evaluating against weak attacks. In

International Conference on Machine Learning(ICML) , 2018.Hongjun Wang, Guangrun Wang, Ya Li, Dongyu Zhang, and Liang Lin. Transferable, control-lable, and inconspicuous adversarial attacks on person re-identiﬁcation with deep mis-ranking. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ,2020a.Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improvingadversarial robustness requires revisiting misclassiﬁed examples. In

International Conference onLearning Representations (ICLR) , 2020b.Dongxian Wu, Yisen Wang, Shu-Tao Xia, James Bailey, and Xingjun Ma. Skip connections matter:On the transferability of adversarial examples generated with resnets. In

International Conferenceon Learning Representations (ICLR) , 2020a.Tong Wu, Liang Tong, and Yevgeniy Vorobeychik. Defending against physically realizable attackson image classiﬁcation. In

International Conference on Learning Representations (ICLR) , 2020b.12nder reviewKaidi Xu, Gaoyuan Zhang, Sijia Liu, Quanfu Fan, Mengshu Sun, Hongge Chen, Pin-Yu Chen,Yanzhi Wang, and Xue Lin. Adversarial t-shirt! evading person detectors in a physical world. In

Proceedings of the European Conference on Computer Vision (ECCV) , 2020.Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146 , 2016.Haichao Zhang and Jianyu Wang. Towards adversarially robust object detection. In

Proceedings ofthe IEEE International Conference on Computer Vision (ICCV) , 2019.Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan.Theoretically principled trade-off between robustness and accuracy. In

International Conferenceon Machine Learning (ICML) , 2019.Jingfeng Zhang, Xilie Xu, Bo Han, Gang Niu, Lizhen Cui, Masashi Sugiyama, and Mohan Kankan-halli. Attacks which do not kill training make adversarial learning stronger. In

InternationalConference on Machine Learning (ICML) , 2020.Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deepneural networks via stability training. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2016. 13nder review

A E

XPERIMENTAL D ETAILS

A.1 D

ETAILS IN F IGURE Attack procedure.

In adversarial attacks, we perturb clean inputs to maximize the surrogate lossusing PGD. In hypocritical attacks, we perturb clean inputs to minimize the surrogate loss usingPGD. In both attacks, for the purpose of imperceptibility, we execute PGD attack 100 steps (stepsize is (cid:15)/ ) with early stopping on ImageNet and the budget (cid:15) here is / . More examples.

More adversarial examples and hypocritical examples generated on ImageNetusing our methods are shown in Figure 5. More hypocritical examples generated on MNIST andCIFAR-10 are shown in Figure 4(a) and Figure 4(b). The victim models are LeNet (Standard) andWide ResNet (Standard) for MNIST and CIFAR-10, respectively. They are trained with the sameprocedures described in Appendix A.2. In both attacks, for the purpose of imperceptibility, weexecute 100 steps PGD attacks (step size is (cid:15)/ ) with early stopping on MNIST and CIFAR-10.The budget (cid:15) for MNIST here is . . The budget for CIFAR-10 here is / . (a) On MNIST dog airplanecat birddog catbird deercat dogairplane frogbird horsetruck shipairplane trucktruck automobile (b) On CIFAR-10 Figure 4: Hypocritical examples. In each subﬁgure, the ﬁrst cloumn represents the clean examplessampled from original data distribution, the second cloumn represents the generated perturbations,the third cloumn represents the perturbed examples. Perturbations are rescaled for display. Redlabels and black labels below images denote misclassiﬁcation and correct classiﬁcation, respectively.14nder review goose black swanlampshade soap dispenserscrewdriver forkliftwashing machine stretcherbald eagle kitesyringe ballpoint penshovel wheelbarrowgreenhouse library (a) Adversarial examples oystercatcher goosegrey fox mongoosewardrobe washing machinephotocopier washing machineremote control screwdrivergrocery store restaurantpaintbrush ladletiger shark snoek (b) Hypocritical examples

Figure 5: More examples on ImageNet. In each subﬁgure, the ﬁrst cloumn represents the clean ex-amples sampled from original data distribution, the second cloumn represents the generated pertur-bations, the third cloumn represents the perturbed examples. Perturbations are rescaled for display.The model predictions of these images are shown below each image. Red labels and black labelsbelow images denote misclassiﬁcation and correct classiﬁcation, respectively.15nder reviewA.2 D

ETAILS IN S ECTION

Architecture.

For MNIST, a four-layer multilayer perception (MLP) (2 hidden layers, 768 neu-rons in each) with ReLU activations and a variant of LeNet model (2 convolutional layers of sizes32 and 64, and a fully connected layer of size 1024) are adopted. For CIFAR-10, a four-layer MLP(2 hidden layers, 3072 neurons in each) with ReLU activations, a ResNet18 (He et al., 2016a) and aWide ResNet (Zagoruyko & Komodakis, 2016) (with depth 28 and width factor 10) are adopted. ForImageNet, a VGG16 (Simonyan & Zisserman, 2014) and a ResNet50 (He et al., 2016a) are adopted.

Training procedure. i) Models trained with standard approach using clean examples (Standard).For MNIST, models are trained for 80 epochs with Adam optimizer with batch size 128 and alearning rate of 0.001. Early stopping is done with holding out 1000 examples from the MNISTtraining set. For CIFAR-10, models are trained for 150 epochs with SGD optimizer with batch size128 and the learning rate starts with 0.1, and is divided it by 10 at 90 and 125 epochs. We applyweight decay of 2e-4 and momentum of 0.9. Early stopping is done with holding out 1000 examplesfrom the CIFAR-10 training set. For ImageNet, we use the pretrained standard models availablewithin PyTorch (torchvision.models). ii)

Models that randomly initialized without training (Naive).For all models, we use the default PyTorch initialization, except that we initialize the convolutionalweights in Wide ResNet with He initialization (He et al., 2015). We conduct all the experimentsusing a single NVIDIA Tesla V100 GPU. Each experiment is conducted 3 times with differentrandom seeds, except the standard models trained on ImageNet, in which we use the pretrainedstandard models available within PyTorch.

Attack procedure.

In adversarial attacks, we perturb clean inputs to maximize the surrogateloss using PGD. In hypocritical attacks, we perturb clean inputs to minimize the surrogate lossusing PGD. In both attacks, we execute 50 steps PGD attacks (step size is (cid:15)/ ) with 20 times ofrandom restart on MNIST and CIFAR-10, and we use 50 steps PGD attacks (step size is (cid:15)/ ) onImageNet. Other hyperparameter choices didn’t offer a signiﬁcant change in accuracy. On MNIST,the hypocritical perturbed set F and the adversarially perturbed set A are constructed by attackingevery example in the clean test set sampled from D . Both attacks are bounded by a l ∞ ball withradius (cid:15) = 0 . . On CIFAR-10, both attacks are bounded by a l ∞ ball with radius (cid:15) = 8 / . OnImageNet, F and A are constructed based on its validation set sampled from D . Both attacks arebounded by a l ∞ ball with radius (cid:15) = 16 / . Numerical results.

The attack results on CIFAR-10 are shown in Table 4. Full results of Table1, Table 2 and Table 4 are shown in Table 5, Table 6 and Table 7, respectively. Moreover, weshow the attack results of 9 Naive models evaluated on ImageNet in Table 6. We ﬁnd that allthe Naive models in VGG family achieve high accuracy on F and all the Naive models in ResNetfamily have relatively poor performance on F . Especially, the Naive (ResNet152) model in Trial 1 isinvariant to hypocritical perturbations. Even in the existence of a strong false friend, the hypocriticalperformance is still as low as the clean performance (only . ). We carefully examined the Naive(ResNet152) model and ﬁnd that it’s actually a trivial classiﬁer, which purely classiﬁes almost allthe points in input region [0 , d as a certain class for some simple reasons, such as poor scaling ofnetwork weights at initialization. Therefore, it is not enough to blindly pursue robustness againsthypocritical perturbations but ignore the performance on clean examples. Once we train a Naivemodel with clean examples, the model becomes vulnerable immediately (see Standard (ResNet50)),whereas the trained weights are better conditioned (Elsayed et al., 2019).Table 4: Accuracy ( % ) of models evaluated on CIFAR-10. Attacks are bounded with (cid:15) = 8 / .Model F D A

Naive (MLP) 92.3 9.9 0.0Naive (ResNet18) 20.8 8.7 0.3Naive (Wide ResNet) 13.8 10.0 6.9Standard (MLP) 88.6 45.1 3.9Standard (ResNet18) 100.0 94.1 0.0Standard (Wide ResNet) 100.0 95.1 0.016nder reviewTable 5: Full results of accuracy ( % ) evaluated on MNIST. Attacks are bounded with (cid:15) = 0 . .Model Trial 1 Trial 2 Trial 3 F D A F D A F D A

Naive (MLP) 100.0 11.2 0.0 100.0 10.8 0.0 100.0 9.4 0.0Naive (LeNet) 69.1 9.7 0.0 77.5 10.6 0.0 91.0 10.1 0.0Standard (MLP) 100.0 98.0 31.1 100.0 97.7 30.4 100.0 97.7 28.1Standard (LeNet) 100.0 99.4 0.1 100.0 99.4 0.1 100.0 99.3 0.0Table 6: Full results of accuracy ( % ) evaluated on ImageNet. Attacks are bounded with (cid:15) = 16 / .Model Trial 1 Trial 2 Trial 3 F D A F D A F D A

Naive (VGG11) 100.0 0.1 0.0 100.0 0.1 0.0 100.0 0.1 0.0Naive (VGG13) 100.0 0.1 0.0 100.0 0.1 0.0 100.0 0.1 0.0Naive (VGG16) 100.0 0.1 0.0 100.0 0.1 0.0 100.0 0.1 0.0Naive (VGG19) 100.0 0.1 0.0 100.0 0.1 0.0 100.0 0.1 0.0Naive (ResNet18) 58.4 0.1 0.0 83.2 0.1 0.0 57.6 0.1 0.0Naive (ResNet34) 7.4 0.1 0.0 12.5 0.1 0.0 10.4 0.1 0.0Naive (ResNet50) 10.6 0.1 0.0 14.7 0.1 0.0 12.6 0.1 0.0Naive (ResNet101) 0.3 0.1 0.1 0.2 0.1 0.1 0.3 0.1 0.1Naive (ResNet152) 0.1 0.1 0.1 0.3 0.1 0.1 0.2 0.1 0.1Standard (VGG16) 99.9 71.6 0.3 N/A N/A N/A N/A N/A N/AStandard (ResNet50) 99.9 76.1 0.0 N/A N/A N/A N/A N/A N/AA.3 D

ETAILS IN S ECTION

Architecture.

For MNIST, a variant of LeNet model (2 convolutional layers of sizes 32 and 64,and a fully connected layer of size 1024) is adopted. For CIFAR-10, a Wide ResNet (with depth 28and width factor 10) is adopted.

Training procedure.

For the wide range of the scaling parameter λ , we conduct experimentsin parallel over multiple NVIDIA Tesla V100 GPUs. Each experiment is conducted 3 times withdifferent random seeds. For MNIST, all models (including Standard, Madry, TRADES, THRM) aretrained for 80 epochs with Adam optimizer with batch size 128 and a learning rate of 0.001. Earlystopping is done with holding out 1000 examples from the MNIST training set as suggested in Riceet al. (2020). For CIFAR-10, all models are trained for 150 epochs with SGD optimizer with batchsize 128 and the learning rate starts with 0.1, and is divided it by 10 at 90 and 125 epochs. We applyweight decay of 2e-4 and momentum of 0.9. Early stopping is done with holding out 1000 examplesfrom the CIFAR-10 training set as suggested in Rice et al. (2020). Attack procedure.

For the inner maximization in the objective function of THRM, we perturbclean inputs to minimize the CE loss as the surrogate loss. For the inner maximization in TRADES,we maximize the KL divergence as the surrogate loss. For the inner maximization in Madry, wemaximize the CE loss as the surrogate loss. On MNIST, the training attack is PGD with randomstart and 10 iterations (step size (cid:15)/ ). On CIFAR-10, the training attack is PGD with random startand 10 iterations (step size (cid:15)/ ) when (cid:15) = 8 / , and the training attack is PGD with random startand 7 iterations (step size (cid:15)/ ) when (cid:15) = 1 / and (cid:15) = 2 / . On all experiments, the test attackis 50 steps PGD (step size is (cid:15)/ ) with 20 times of random restart. Other hyperparameter choicesdidn’t offer a signiﬁcant change in accuracy. Numerical results.

The natural risk reported here is estimated on test set. The hypocriticalrisk reported here is estimated on test set and is actually an approximation of the real value sincewe approximately solve the optimization problem by PGD on examples from test set. Results onMNIST ( (cid:15) = 0 . ) and CIFAR-10 ( (cid:15) = 1 / , (cid:15) = 2 / and (cid:15) = 8 / ) are shown in Figure17nder reviewTable 7: Full results of accuracy ( % ) evaluated on CIFAR-10. Attacks are bounded with (cid:15) = 8 / .Model Trial 1 Trial 2 Trial 3 F D A F D A F D A

Naive (MLP) 98.2 9.0 0.0 91.4 10.3 0.0 87.4 10.3 0.0Naive (ResNet18) 14.8 10.0 0.6 20.7 9.2 0.2 27.0 6.9 0.0Naive (Wide ResNet) 12.2 10.0 7.7 10.4 10.0 9.6 18.7 10.0 3.3Standard (MLP) 89.1 46.2 3.6 88.4 44.8 3.8 88.3 44.5 4.2Standard (ResNet18) 100.0 93.9 0.0 100.0 94.3 0.0 100.0 94.0 0.0Standard (Wide ResNet) 100.0 95.0 0.0 100.0 95.1 0.0 100.0 95.3 0.06. Each point in Figure 6 represents one model trained with a certain λ . Full numerical resultson MNIST ( (cid:15) = 0 . ) and CIFAR-10 ( (cid:15) = 1 / , (cid:15) = 2 / and (cid:15) = 8 / ) can be found inTable 8, Table 9, Table 10 and Table 11, respectively. On MNIST ( (cid:15) = 0 . ), THRM has bettertradeoff than TRADES. However, when the task becomes hard, TRADES performs as well as orbetter than THRM. On CIFAR-10, as the task becomes harder (the larger the radius (cid:15) the harder thetask), the gap between TRADES and THRM becomes larger. This phenomenon shows that betteradversarial robustness may help the generalization of hypocritical robustness, especially when thetask is hard. Moreover, we compare our methods with Madry et al. (2018)’s defense designed foradversarial robustness (denoted as Madry) and standard training method (denoted as Standard). Wesummarize results in Table 12. For direct comparison, we pick a certain λ for each model trained byTRADES and THRM in each task. We observed that, in all tasks, Madry’s defense has nonnegligiblerobustness on hypocritical examples, although there is no hypocritical risk or its upper bound in theobjective function. This phenomenon indicates that optimizing only adversarial risk could bring acertain degree of robustness against hypocritical examples. While this experimental results partlysupport our hypothesis (i.e., the potential mutual beneﬁts between robustness against adversarialperturbations and hypocritical perturbations), we do not take the evidence as an ultimate one andfurther exploration is needed. We note that the standard deviation becomes larger when λ is biggerin TRADES and THRM, which is attributed to optimization difﬁculty and result in more signiﬁcantdifference among different trials. Reducing the initial learning rate may mitigate this phenomenon.For completeness, we further evaluate the adversarial risk on correctly classiﬁed examples of themodels trained by THRM and TRADES. Results on MNIST ( (cid:15) = 0 . ) and CIFAR-10 ( (cid:15) = 2 / )are summarized in Table 13 and Table 14, respectively. One interesting ﬁnding is that models trainedwith THRM manifest noteworthy adversarial robustness, especially on CIFAR-10, although there isno adversarial risk in the objective function of THRM. These facts also support the hypothesis (i.e.,the potential mutual beneﬁts between robustness against adversarial perturbations and hypocriticalperturbations). They actually optimize the adversarial risk in Equation 3 via surrogate loss.  nat ( % ) ̂  h y p ( % ) Trendline of TRADESTrendline of THRMTRADESTHRM (a) On MNIST. Perturbations are bounded by l ∞ norm with (cid:15) = 0 . .  nat ( % ) ̂  h y p ( % ) Trendline of TRADESTrendline of THRMTRADESTHRM (b) On CIFAR-10. Perturbations are boundedby l ∞ norm with (cid:15) = 1 / .  nat ( % ) ̂  h y p ( % ) Trendline of TRADESTrendline of THRMTRADESTHRM (c) On CIFAR-10. Perturbations are boundedby l ∞ norm with (cid:15) = 2 / .  nat ( % ) ̂  h y p ( % ) Trendline of TRADESTrendline of THRMTRADESTHRM (d) On CIFAR-10. Perturbations are boundedby l ∞ norm with (cid:15) = 8 / . Figure 6: Tradeoff between natural risk and hypocritical risk on real-world datasets.Table 8: Full results of natural risk ( % ) and hypocritical risk ( % ) on MNIST. Attacks are boundedby l ∞ norm with (cid:15) = 0 . . (a) For TRADES. λ Trial 1 Trial 2 Trial 3 R nat ˆ R hyp R nat ˆ R hyp R nat ˆ R hyp (b) For THRM. λ Trial 1 Trial 2 Trial 3 R nat ˆ R hyp R nat ˆ R hyp R nat ˆ R hyp % ) and hypocritical risk ( % ) on CIFAR-10. Attacks are boundedby l ∞ norm with (cid:15) = 1 / . (a) For TRADES. λ Trial 1 Trial 2 Trial 3 R nat ˆ R hyp R nat ˆ R hyp R nat ˆ R hyp (b) For THRM. λ Trial 1 Trial 2 Trial 3 R nat ˆ R hyp R nat ˆ R hyp R nat ˆ R hyp Table 10: Full results of natural risk ( % ) and hypocritical risk ( % ) on CIFAR-10. Attacks arebounded by l ∞ norm with (cid:15) = 2 / . (a) For TRADES. λ Trial 1 Trial 2 Trial 3 R nat ˆ R hyp R nat ˆ R hyp R nat ˆ R hyp (b) For THRM. λ Trial 1 Trial 2 Trial 3 R nat ˆ R hyp R nat ˆ R hyp R nat ˆ R hyp Table 11: Full results of natural risk ( % ) and hypocritical risk ( % ) on CIFAR-10. Attacks arebounded by l ∞ norm with (cid:15) = 8 / . (a) For TRADES. λ Trial 1 Trial 2 Trial 3 R nat ˆ R hyp R nat ˆ R hyp R nat ˆ R hyp (b) For THRM. λ Trial 1 Trial 2 Trial 3 R nat ˆ R hyp R nat ˆ R hyp R nat ˆ R hyp % ± std over 3 random trials) and hypocritical risk ( % ± stdover 3 random trials) between methods on real-world datasets. Attacks are bounded by l ∞ norm. (a) On MNIST ( (cid:15) = 0 . ).Method R nat ˆ R hyp Standard 0.6 ± ± ± ± λ = 240 ) 9.3 ± ± λ = 5000 ) 9.2 ± ± (cid:15) = 1 / ).Method R nat ˆ R hyp Standard 4.9 ± ± ± ± λ = 150 ) 10.1 ± ± λ = 1000 ) 10.1 ± ± (cid:15) = 2 / ).Method R nat ˆ R hyp Standard 4.8 ± ± ± ± λ = 150 ) 13.6 ± ± λ = 1000 ) 14.0 ± ± (cid:15) = 8 / ).Method R nat ˆ R hyp Standard 4.7 ± ± ± ± λ = 50 ) 22.8 ± ± λ = 250 ) 24.2 ± ± Table 13: Evaluated results of natural risk ( % ) and adversarial risk ( % ) on MNIST. Attacks arebounded by l ∞ norm with (cid:15) = 0 . . (a) For TRADES. λ Trial 1 Trial 2 Trial 3 R nat ˆ R adv R nat ˆ R adv R nat ˆ R adv (b) For THRM. λ Trial 1 Trial 2 Trial 3 R nat ˆ R adv R nat ˆ R adv R nat ˆ R adv Table 14: Evaluated results of natural risk ( % ) and adversarial risk ( % ) on CIFAR-10. Attacks arebounded by l ∞ norm with (cid:15) = 2 / . (a) For TRADES. λ Trial 1 Trial 2 Trial 3 R nat ˆ R adv R nat ˆ R adv R nat ˆ R adv (b) For THRM. λ Trial 1 Trial 2 Trial 3 R nat ˆ R adv R nat ˆ R adv R nat ˆ R adv ETAILS IN S ECTION (cid:15)/ ) on source models. Notethat the optimization method we used here is not to pursue state-of-the-art transferability, but toexamine the transferability of hypocritical examples. There are many methods designed to improvethe transferability of adversarial examples may be extended to hypocritical examples (Liu et al.,2017; Dong et al., 2018; Wu et al., 2020a). Figure 7 shows the transferability heatmap of hypocriticalattack over 9 models trained on MNIST. Figure 8 shows the transferability heatmap of hypocriticalattack over 7 models trained on CIFAR-10. The value in the i -th row and j -th cloumn of eachheatmap matrix is the proportion of the hypocritical examples successfully transferred to targetmodel j out of all hypocritical examples generated by source model i (including both successful andfailed attacks on the source model). N a i v e ( M L P ) S t a n d a r d ( M L P ) N a i v e ( L e N e t ) S t a n d a r d ( L e N e t ) N a i v e ( L e N e t * ) S t a n d a r d ( L e N e t * ) M a d r y ( L e N e t ) T R A D E S ( L e N e t ) T H R M ( L e N e t ) Target Model

Naive (MLP)Standard (MLP)Naive (LeNet)Standard (LeNet)Naive (LeNet*)Standard (LeNet*)Madry (LeNet)TRADES (LeNet)THRM (LeNet) S o u r c e M o d e l % ) Figure 7: Transferability of hypocritical examples on MNIST. Attacks are bounded by l ∞ normwith (cid:15) = 0 . . “(LeNet*)” means that it is the same architecture with “(LeNet)” but different randominitialization. N a i v e ( M L P ) S t a n d a r d ( M L P ) N a i v e ( W i d e R e s N e t ) S t a n d a r d ( W i d e R e s N e t ) M a d r y ( W i d e R e s N e t ) T R A D E S ( W i d e R e s N e t ) T H R M ( W i d e R e s N e t ) Target Model

Naive (MLP)Standard (MLP)Naive (Wide ResNet)Standard (Wide ResNet)Madry (Wide ResNet)TRADES (Wide ResNet)THRM (Wide ResNet) S o u r c e M o d e l % ) Figure 8: Transferability of hypocritical examples on CIFAR-10. Attacks are bounded by l ∞ normwith (cid:15) = 8 / . 22nder review B P

ROOFS OF M AIN R ESULTS

In this section, we provide the proofs of our main results.B.1 A P

ROOF OF P ROPOSITION Proposition 1.

Denote the adversarial risk on correctly classiﬁed examples by ˆ R adv ( f ) = E ( x ,y ) ∼S + f (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) , then we have R adv ( f ) = R nat ( f ) + (1 − R nat ( f )) ˆ R adv ( f ) . Proof. R adv ( f ) = E ( x ,y ) ∼D (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) = E ( x ,y ) ∼D (cid:20) ( f ( x ) = y ) · max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) + E ( x ,y ) ∼D (cid:20) ( f ( x ) (cid:54) = y ) · max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) = R nat ( f ) E ( x ,y ) ∼S − f (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) + (1 − R nat ( f )) E ( x ,y ) ∼S + f (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) = R nat ( f ) E ( x ,y ) ∼S − f [ ( f ( x ) (cid:54) = y )] + (1 − R nat ( f )) E ( x ,y ) ∼S + f (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) = R nat ( f ) + (1 − R nat ( f )) E ( x ,y ) ∼S + f (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) = R nat ( f ) + (1 − R nat ( f )) ˆ R adv ( f ) . B.2 A P

ROOF OF T HEOREM Theorem 1.

For any data distribution D and its corresponding conditional distribution on misclas-siﬁed examples S − f w.r.t. a classiﬁer f , we have E ( x ,y ) ∼S − f ( f ( x hyp ) = y ) (cid:124) (cid:123)(cid:122) (cid:125) ˆ R hyp ( f ) ≤ E ( x ,y ) ∼S − f ( f ( x hyp ) (cid:54) = f ( x )) (cid:124) (cid:123)(cid:122) (cid:125) R hyp ( f ) ≤ E ( x ,y ) ∼S − f ( f ( x rev ) (cid:54) = f ( x )) (cid:124) (cid:123)(cid:122) (cid:125) R hyp ( f ) , where x hyp = arg max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) = y ) and x rev = arg max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = f ( x )) .Proof. To prove the ﬁrst inequality, we have ˆ R hyp ( f ) = E ( x ,y ) ∼S − f (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) = y ) (cid:21) = E ( x ,y ) ∼S − f ( f ( x hyp ) = y ) ≤ E ( x ,y ) ∼S − f ( f ( x hyp ) (cid:54) = f ( x )) , ( f ( x hyp ) = y ) = (cid:26) ( f ( x hyp ) (cid:54) = f ( x )) , if f ( x hyp ) = y, ≤ ( f ( x hyp ) (cid:54) = f ( x )) , if f ( x hyp ) (cid:54) = y. To prove the second inequality, we have R hyp ( f ) = E ( x ,y ) ∼S − f ( f ( x hyp ) (cid:54) = f ( x )) ≤ E ( x ,y ) ∼S − f ( f ( x rev ) (cid:54) = f ( x )) . Since ( x , y ) ∼ S − f , we have f ( x ) (cid:54) = y . If there exists a x hyp such that f ( x hyp ) = y , then f ( x hyp ) (cid:54) = f ( x ) . Now let x rev = x hyp , then f ( x rev ) (cid:54) = f ( x ) is true. Otherwise, if we couldn’t ﬁnd a x hyp suchthat f ( x hyp ) = y , there still exists a posibility to ﬁnd a x rev such that f ( x rev ) (cid:54) = y but f ( x rev ) (cid:54) = f ( x ) is true. Therefore, the above inequalities holds.B.3 A P ROOF OF P ROPOSITION Proposition 2. R rev ( f ) = (1 −R nat ( f )) ˆ R adv ( f )+ R nat ( f ) R hyp ( f ) = E ( x ,y ) ∼D ( f ( x rev ) (cid:54) = f ( x )) .Proof. R rev ( f ) =(1 − R nat ( f )) ˆ R adv ( f ) + R nat ( f ) R hyp ( f )=(1 − R nat ( f )) E ( x ,y ) ∼S + f [ ( f ( x adv ) (cid:54) = y )] + R nat ( f ) E ( x ,y ) ∼S − f [ ( f ( x rev ) (cid:54) = f ( x ))]= E ( x ,y ) ∼D [ ( f ( x ) = y ) · ( f ( x adv ) (cid:54) = y )] + E ( x ,y ) ∼D [ ( f ( x ) (cid:54) = y ) · ( f ( x rev ) (cid:54) = f ( x ))]= E ( x ,y ) ∼D (cid:20) ( f ( x ) = y ) · max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = y ) (cid:21) + E ( x ,y ) ∼D (cid:20) ( f ( x ) (cid:54) = y ) · max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = f ( x )) (cid:21) = E ( x ,y ) ∼D (cid:20) ( f ( x ) = y ) · max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = f ( x )) (cid:21) + E ( x ,y ) ∼D (cid:20) ( f ( x ) (cid:54) = y ) · max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = f ( x )) (cid:21) = E ( x ,y ) ∼D (cid:20) max x (cid:48) ∈B (cid:15) ( x ) ( f ( x (cid:48) ) (cid:54) = f ( x )) (cid:21) = E ( x ,y ) ∼D [ ( f ( x rev ) (cid:54) = f ( x ))] . False PositivesTrue NegativesTrue PositivesFalse Negatives Model decision boundaryOracle decision boundaryCorrectly classified regionMisclassified regionNon-robust regionCorrectly classified non-robust regionMisclassified non-robust region

Figure 9: A visualization of the toy example to illustrate the phenomenon of the tradeoff betweenadversarial and hypocritical risks. Oracle decision boundary is the circle given by Equation 9. Modeldecision boundary is the line given by Equation 10 with the threshold b = 0 . . Red shadow andgreen shadow region denote that the points in there are misclassiﬁed and correctly classiﬁed by themodel, respectively. The gray lined region denotes that the points in there can be perturbed withlittle perturbations to reverse the prediction of the model. C T

RADEOFF BETWEEN ADVERSARIAL AND HYPOCRITICAL RISKS

Despite the experiments in Section 4.1 and Appendix A.3 showed that, when dealing with ﬁnite sam-ple size and ﬁnite-time gradient-descent trained classiﬁers, there may be mutual beneﬁts betweenadversarial robustness and hypocritical robustness in real-world datasets, we note that in general,this synergism does not necessarily exist. We illustrate the phenomenon by providing another toyexample here, which is inspired by the precision-recall tradeoff Buckland & Gey (1994); Alvarez(2002).Consider the case ( x , y ) ∈ R × {− , +1 } from a distribution D , where the marginal distributionover the instance space is a uniform distribution over [0 , . We assume that the decision boundaryof the oracle (ground truth) is a circle: O ( x ) = sign( r − (cid:107) x − c (cid:107) ) , (9)where the centre c = (0 . , . (cid:62) and the radius r = 0 . . The points inside the circle are labeledas belonging to the positive class, otherwise they are labeled as belonging to the negative class. Weconsider the linear classiﬁer f with ﬁxed w = (0 , (cid:62) and a tunable threshold b : f ( x ) = sign( w (cid:62) x − b ) = sign( x − b ) . (10)See Figure 9 for a visualization of the oracle and the linear classiﬁer over the instance space. In thisproblem, we can show the tradeoffs by tuning the threshold b of the linear classiﬁer over [0 , . Theprecision is the number of true positives (i.e. the number of examples correctly classiﬁed as positiveclass) divided by the the sum of true positives and false positives (i.e. the number of examplesmisclassiﬁed as positive class). The recall is the number of true positives divided by the sum oftrue positives and false negatives (i.e. the number of examples misclassiﬁed as negative class). Wecompare the adversarial risk on correctly classiﬁed examples ˆ R adv ( f ) deﬁned in Proposition 1 andthe hypocritical risk on misclassiﬁed examples ˆ R hyp ( f ) deﬁned in Deﬁnition 3. The computingformulas of these values are visualized in Figure 10. Here we choose the bounded l ball B (cid:15) ( x ) = { x (cid:48) ∈ R : (cid:107) x (cid:48) − x (cid:107) ≤ (cid:15) } with (cid:15) = 0 . as the threat model.Figure 11 plots the curve of precision and recall versus threshold b . We can see that there is a obviousprecision-recall tradeoff between the two gray dotted lines. Similarly, Figure 12 plots the curve of25nder review Precision = Recall = = = Figure 10: Visualization of the computing formulas of precision, recall, adversarial risk, and hypo-critical risk in the toy example. These values can be viewed as the proportion of the areas of differentregions.

Threshold value P r e c i s i o n o r r e c a ll ( % ) PrecisionRecall

Figure 11: The tradeoff between precision andrecall in the toy example.

Threshold value R i s k ( % ) Natural riskAdversarial risk on correctly classified examplesHypocritical risk on misclassified examples

Figure 12: The tradeoff between adversarial andhypocritical risks in the toy example. ˆ R adv ( f ) and ˆ R hyp ( f ) versus threshold bb