Mixed Nash Equilibria in the Adversarial Examples Game
Laurent Meunier, Meyer Scetbon, Rafael Pinot, Jamal Atif, Yann Chevaleyre
MMixed Nash Equilibria in the Adversarial Examples Game
Laurent Meunier ∗ , Meyer Scetbon ∗ Rafael Pinot Jamal Atif Yann Chevaleyre LAMSADE, Université Paris-Dauphine Facebook AI Research, Paris CREST, ENSAE Ecole Polytechnique Fédérale de Lausanne
Abstract
This paper tackles the problem of adversarial examples from a game theoretic point of view. We studythe open question of the existence of mixed Nash equilibria in the zero-sum game formed by the attackerand the classifier. While previous works usually allow only one player to use randomized strategies, weshow the necessity of considering randomization for both the classifier and the attacker. We demonstratethat this game has no duality gap, meaning that it always admits approximate Nash equilibria. We alsoprovide the first optimization algorithms to learn a mixture of classifiers that approximately realizes thevalue of this game, i.e. procedures to build an optimally robust randomized classifier.
Adversarial examples [6, 34] are one of the most dizzling problems in machine learning: state of the artclassifiers are sensitive to imperceptible perturbations of their inputs that make them fail. Last years, researchhave concentrated on proposing new defense methods [13, 24, 25] and building more and more sophisticatedattacks [11, 15, 18, 22]. So far, most defense strategies proved to be vulnerable to these new attacks or arecomputationally intractable. This asks the following question: can we build classifiers that are robust againstany adversarial attack?A recent line of research argued that randomized classifiers could help countering adversarial attacks [17,28, 39, 40]. Along this line, [27] demonstrated, using game theory, that randomized classifiers are indeedmore robust than deterministic ones against regularized adversaries. However, the findings of these previousworks are dependent on the definition of adversary they consider. In particular, they did not investigatescenarios where the adversary also uses randomized strategies, which is essential to account for if we wantto give a principled answer to the above question. Previous works studying adversarial examples from thescope of game theory investigated the randomized framework (for both the classifier and the adversary) inrestricted settings where the adversary is either parametric or has a finite number of strategies [8, 26, 31].Our framework does not assume any constraint on the definition of the adversary, making our conclusionsindependent on the adversary the classifiers are facing. More precisely, we answer the following questions.
Q1:
Is it always possible to reach a Mixed Nash equilibrium in the adversarial example game when boththe adversary and the classifier can use randomized strategies?
A1:
We answer positively to this question. First we motivate in Section 2 the necessity for usingrandomized strategies both with the attacker and the classifier. Then, we extend the work of [29], byrigorously reformulating the adversarial risk as a linear optimization problem over distributions. In fact, wecast the adversarial risk minimization problem as a Distributionally Robust Optimization (DRO) [7] problemfor a well suited cost function. This formulation naturally leads us, in Section 3, to analyze adversarial riskminimization as a zero-sum game. We demonstrate that, in this game, the duality gap always equals ,meaning that it always admits approximate mixed Nash equilibria.1 a r X i v : . [ c s . G T ] F e b Can we design efficient algorithms to learn an optimally robust randomized classifier?
A2:
To answer this question, we focus on learning a finite mixture of classifiers. Taking inspirationfrom robust optimization [33] and subgradient methods [9], we derive in Section 4 a first oracle algorithm tooptimize over a finite mixture. Then, following the line of work of [16], we introduce an entropic reguralizationwhich allows to effectively compute an approximation of the optimal mixture. We validate our findings withexperiments on a simulated and a real image dataset, namely CIFAR-10 [21].Figure 1: Motivating example: blue distribution represents label − and the red one, label +1 . The height ofcolumns represents their mass. The red and blue arrows represent the attack on the given classifier. On left:deterministic classifiers ( f on the left, f in the middle) for whose, the blue point can always be attacked.On right: a randomized classifier, where the attacker has a probability / of failing, regardless of the attackit selects. Consider the binary classification task illustrated in Figure 1. We assume that all input-output pairs ( X, Y ) are sampled from a distribution P defined as follows P ( Y = ±
1) = 1 / and (cid:26) P ( X = 0 | Y = −
1) = 1 P ( X = ± | Y = 1) = 1 / Given access to P , the adversary aims to maximize the expected risk, but can only move each point by atmost on the real line. In this context, we study two classifiers: f ( x ) = − x − / and f ( x ) = x − / .Both f and f have a standard risk of / . In the presence of an adversary, the risk ( a.k.a. the adversarialrisk) increases to . Here, using a randomized classifier can make the system more robust. Consider f where f = f w.p. / and f otherwise. The standard risk of f remains / but its adversarial risk is / < .Indeed, when attacking f , any adversary will have to choose between moving points from to or to − .Either way, the attack only works half of the time; hence an overall adversarial risk of / . Furthermore, if f knows the strategy the adversary uses, it can always update the probability it gives to f and f to get abetter (possibly deterministic) defense. For example, if the adversary chooses to always move to , theclassifier can set f = f w.p. to retrieve an adversarial risk of / instead of / .Now, what happens if the adversary can use randomized strategies, meaning that for each point it canflip a coin before deciding where to move? In this case, the adversary could decide to move points from to w.p. / and to − otherwise. This strategy is still optimal with an adversarial risk of / but nowthe classifier cannot use its knowledge of the adversary’s strategy to lower the risk. We are in a state whereneither the adversary nor the classifier can benefit from unilaterally changing its strategy. In the game theoryterminology, this state is called a Mixed Nash equilibrium. ( X, Y ) ∼ P is misclassified by f i if and only if f i ( X ) Y ≤ .2 General setting Let us consider a classification task with input space X and output space Y . Let ( X , d ) be a proper (i.e.closed balls are compact) Polish (i.e. completely separable) metric space representing the inputs space .Let Y = { , . . . , K } be the labels set, endowed with the trivial metric d (cid:48) ( y, y (cid:48) ) = y (cid:54) = y (cid:48) . Then the space ( X × Y , d ⊕ d (cid:48) ) is a proper Polish space. For any Polish space Z , we denote M ( Z ) the Polish space of Borelprobability measures on Z . Let us assume the data is drawn from P ∈ M ( X × Y ) . Let (Θ , d Θ ) be a Polishspace (not necessarily proper) representing the set of classifier parameters (for instance neural networks). Wealso define a loss function: l : Θ × ( X × Y ) → [0 , ∞ ) satisfying the following set of assumptions. Assumption 1 (Loss function) .
1) The loss function l is a non negative Borel measurable function. 2) Forall θ ∈ Θ , l ( θ, · ) is upper-semi continuous. 3) There exists M > such that for all θ ∈ Θ , ( x, y ) ∈ X × Y , ≤ l ( θ, ( x, y )) ≤ M . It is usual to assume upper-semi continuity when studying optimization over distributions [7, 38]. Fur-thermore, considering bounded (and positive) loss functions is also very common in learning theory [2] and isnot restrictive.In the adversarial examples framework, the loss of interest is the / loss, for whose surrogates aremisunderstood [1, 14]; hence it is essential that the / loss satisfies Assumption 1. In the binary classificationsetting ( i.e. Y = {− , +1 } ) the / loss writes l / ( θ, ( x, y )) = yf θ ( x ) ≤ . Then, assuming that for all θ , f θ ( · ) is continuous and for all x , f · ( x ) is continuous, the / loss satisfies Assumption 1. In particular, it is thecase for neural networks with continuous activation functions. The standard risk for a single classifier θ associated with the loss l satisfying Assumption 1 writes: R ( θ ) := E ( x,y ) ∼ P [ l ( θ, ( x, y ))] . Similarly, the adversarial risk of θ at level ε associated with the loss l is defined as R εadv ( θ ) := E ( x,y ) ∼ P (cid:34) sup x (cid:48) ∈X , d ( x,x (cid:48) ) ≤ ε l ( θ, ( x (cid:48) , y )) (cid:35) . It is clear that R adv ( θ ) = R ( θ ) for all θ . We can generalize these notions with distributions of classifiers.In other terms the classifier is then randomized according to some distribution µ ∈ M (Θ) . A classifier israndomized if for a given input, the output of the classifier is a probability distribution. The standard riskof a randomized classifier µ writes R ( µ ) = E θ ∼ µ [ R ( θ )] . Similarly, the adversarial risk of the randomizedclassifier µ at level ε is R εadv ( µ ) := E ( x,y ) ∼ P (cid:34) sup x (cid:48) ∈X , d ( x,x (cid:48) ) ≤ ε E θ ∼ µ [ l ( θ, ( x (cid:48) , y ))] (cid:35) . For instance, for the / loss, the inner maximization problem, consists in maximizing the probability ofmisclassification for a given couple ( x, y ) . Note that R ( δ θ ) = R ( θ ) and R εadv ( δ θ ) = R εadv ( θ ) . In the remainderof the paper, we study the adversarial risk minimization problems with randomized and deterministic classifiersand denote V εrand := inf µ ∈M (Θ) R εadv ( µ ) , V εdet := inf θ ∈ Θ R εadv ( θ ) (1) Remark 1.
We can show (see Appendix E) that the standard risk infima are equal : V rand = V det . Hence,no randomization is needed for minimizing the standard risk. Denoting V this common value, we also havethe following inequalities for any ε > , V ≤ V εrand ≤ V εdet . For instance, for any norm (cid:107)·(cid:107) , ( R d , (cid:107)·(cid:107) ) is a proper Polish metric space. For the well-posedness, see Lemma 4 in Appendix. This risk is also well posed (see Lemma 4 in the Appendix). .4 Distributional Formulation of the Adversarial Risk To account for the possible randomness of the adversary, we rewrite the adversarial attack problem as aconvex optimization problem on distributions. Let us first introduce the set of adversarial distributions.
Definition 1 (Set of adversarial distributions) . Let P be a Borel probability distribution on X × Y and ε > .We define the set of adversarial distributions as A ε ( P ) := (cid:8) Q ∈ M +1 ( X × Y ) | ∃ γ ∈ M +1 (cid:0) ( X × Y ) (cid:1) ,d ( x, x (cid:48) ) ≤ ε, y = y (cid:48) γ -a.s. , Π (cid:93) γ = P , Π (cid:93) γ = Q } where Π i denotes the projection on the i -th component, and g (cid:93) the push-forward measure by a measurablefunction g . For an attacker that can move the initial distribution P in A ε ( P ) , the attack would not be a transportmap as considered in the standard adversarial risk. For every point x in the support of P , the attacker isallowed to move x randomly in the ball of radius ε , and not to a single other point x (cid:48) like the usual attackerin adversarial attacks. In this sense, we say the attacker is allowed to be randomized. Link with DRO.
Adversarial examples have been studied in the light of DRO by former works [33, 37],but an exact reformulation of the adversarial risk as a DRO problem has not been made yet. When ( Z , d ) isa Polish space and c : Z → R + ∪ { + ∞} is a lower semi-continuous function, for P , Q ∈ M +1 ( Z ) , the primalOptimal Transport problem is defined as W c ( P , Q ) := inf γ ∈ Γ P , Q (cid:90) Z c ( z, z (cid:48) ) dγ ( z, z (cid:48) ) with Γ P , Q := (cid:8) γ ∈ M +1 ( Z ) | Π (cid:93) γ = P , Π (cid:93) γ = Q (cid:9) . When η > and for P ∈ M +1 ( Z ) , the associatedWasserstein uncertainty set is defined as: B c ( P , η ) := (cid:8) Q ∈ M +1 ( Z ) | W c ( P , Q ) ≤ η (cid:9) A DRO problem is a linear optimization problem over Wasserstein uncertainty sets sup Q ∈B c ( P ,η ) (cid:82) g ( z ) d Q ( z ) for some upper semi-continuous function g [41]. For an arbitrary ε > , we define the cost c ε as follows c ε (( x, y ) , ( x (cid:48) , y (cid:48) )) := (cid:26) if d ( x, x (cid:48) ) ≤ ε and y = y (cid:48) + ∞ otherwise.This cost is lower semi-continuous and penalizes to infinity perturbations that change the label or move theinput by a distance greater than ε . As Proposition 1 shows, the Wasserstein ball associated with c ε is equalto A ε ( P ) . Proposition 1.
Let P be a Borel probability distribution on X × Y and ε > and η ≥ , then B c ε ( P , η ) = A ε ( P ) . Moreover, A ε ( P ) is convex and compact for the weak topology of M +1 ( X × Y ) . Thanks to this result, we can reformulate the adversarial risk as the value of a convex problem over A ε ( P ) . Proposition 2.
Let P be a Borel probability distribution on X × Y and µ a Borel probability distribution on Θ . Let l : Θ × ( X × Y ) → [0 , ∞ ) satisfying Assumption 1. Let ε > . Then: R εadv ( µ ) = sup Q ∈A ε ( P ) E ( x (cid:48) ,y (cid:48) ) ∼ Q ,θ ∼ µ [ l ( θ, ( x (cid:48) , y (cid:48) ))] . (2) The supremum is attained. Moreover Q ∗ ∈ A ε ( P ) is an optimum of Problem (2) if and only if there exists γ ∗ ∈M +1 (cid:0) ( X × Y ) (cid:1) such that: Π (cid:93) γ ∗ = P , Π (cid:93) γ ∗ = Q ∗ , d ( x, x (cid:48) ) ≤ ε , y = y (cid:48) and l ( x (cid:48) , y (cid:48) ) = sup u ∈X ,d ( x,u ) ≤ ε l ( u, y ) γ ∗ -almost surely. c ε . Proposition 2 means that, against afixed classifier µ , the randomized attacker that can move the distribution in A ε ( P ) has exactly the samepower as an attacker that moves every single point x in the ball of radius ε . By Proposition 2, we also deducethat the adversarial risk can be casted as a linear optimization problem over distributions. Remark 2.
In a recent work, [29] proposed a similar adversary using Markov kernels but left as an openquestion the link with the classical adversarial risk, due to measurability issues. Proposition 2 solves theseissues. The result is similar to [7]. Although we believe its proof might be extended for infinite valued costs, [7]did not treat that case. We provide an alternative proof in this special case.
Thanks to Proposition 2, the adversarial risk minimization problem can be seen as a two-player zero-sumgame that writes as follows, inf µ ∈M (Θ) sup Q ∈A ε ( P ) E ( x,y ) ∼ Q ,θ ∼ µ [ l ( θ, ( x, y ))] . (3)In this game the classifier objective is to find the best distribution µ ∈ M +1 (Θ) while the adversary ismanipulating the data distribution. For the classifier, solving the infimum problem in Equation (3) simplyamounts to solving the adversarial risk minimization problem – Problem (1), whether the classifier israndomized or not. Then, given a randomized classifier µ ∈ M +1 (Θ) , the goal of the attacker is to find anew data-set distribution Q in the set of adversarial distributions A ε ( P ) that maximizes the risk of µ . Moreformally, the adversary looks for Q ∈ argmax Q ∈A ε ( P ) E ( x,y ) ∼ Q ,θ ∼ µ [ l ( θ, ( x, y ))] . In the game theoretic terminology, Q is also called the best response of the attacker to the classifier θ . Remark 3.
Note that for a given classifier µ there always exists a “deterministic” best response, i.e. everysingle point x is mapped to another single point T ( x ) . Let T : X → X be defined such that for all x ∈ X , l ( T ( x ) , y ) = sup x (cid:48) , d ( x,x (cid:48) ) ≤ ε l ( x (cid:48) , y ) . Thanks to [4, Proposition 7.50], ( T, id ) is P -measurable. Then ( T, id ) (cid:93) P belongs to BR ( µ ) . Therefore, T is the optimal “deterministic” attack against the classifier µ . Every zero sum game has a dual formulation that allows for a deeper understanding of the framework. Here,from Proposition 2, we can define the dual problem of adversarial risk minimization for randomized classifiers.This dual problem also characterizes a two-player zero-sum game that writes as follows, sup Q ∈A ε ( P ) inf µ ∈M (Θ) E ( x,y ) ∼ Q ,θ ∼ µ [ l ( θ, ( x, y ))] . (4)In this dual game problem, the adversary plays first and seeks an adversarial distribution that has thehighest possible risk when faced with an arbitrary classifier. This means that it has to select an adversarialperturbation for every input x , without seeing the classifier first. In this case, as pointed out by the motivatingexample in Section 2.1, the attack can (and should) be randomized to ensure maximal harm against severalclassifiers. Then, given an adversarial distribution, the classifier objective is to find the best possible classifieron this distribution. Let us denote D ε the value of the dual problem. Since the weak duality is alwayssatisfied, we get D ε ≤ V εrand ≤ V εdet . (5)5nequalities in Equation (5) mean that the lowest risk the classifier can get (regardless of the game we lookat) is D ε . In particular, this means that the primal version of the game, i.e. the adversarial risk minimizationproblem, will always have a value greater or equal to D ε . As we discussed in Section 2.1, this lower boundmay not be attained by a deterministic classifier. As we will demonstrate in the next section, optimizing overrandomized classifiers allows to approach D ε arbitrary closely. Remark 4.
Note that, we can always define the dual problem when the classifier is deterministic, sup Q ∈A ε ( P ) inf θ ∈ Θ E ( x,y ) ∼ Q [ l ( θ, ( x, y ))] . Furthermore, we can demonstrate that the dual problems for deterministic and randomized classifiers have thesame value ; hence the inequalities in Equation (5) . In the adversarial examples game, a Nash equilibrium is a couple ( µ ∗ , Q ∗ ) ∈ M (Θ) × A ε ( P ) where both theclassifier and the attacker have no incentive to deviate unilaterally from their strategies µ ∗ and Q ∗ . Moreformally, ( µ ∗ , Q ∗ ) is a Nash equilibrium of the adversarial examples game if ( µ ∗ , Q ∗ ) is a saddle point of theobjective function ( µ, Q ) (cid:55)→ E ( x,y ) ∼ Q ,θ ∼ µ [ l ( θ, ( x, y ))] . Alternatively, we can say that ( µ ∗ , Q ∗ ) is a Nash equilibrium if and only if µ ∗ solves the adversarial riskminimization problem – Problem (1), Q ∗ the dual problem – Problem (6), and D ε = V εrand . In our problem, Q ∗ always exists but it might not be the case for µ ∗ . Then for any δ > , we say that ( µ δ , Q ∗ ) is a δ -approximateNash equilibrium if Q ∗ solves the dual problem and µ δ satisfies D ε ≥ R εadv ( µ δ ) − δ .We now state our main result: the existence of approximate Nash equilibria in the adversarial examplesgame when both the classifier and the adversary can use randomized strategies. More precisely, we demonstratethat the duality gap between the adversary and the classifier problems is zero, which gives as a corollary theexistence of Nash equilibria. Theorem 1.
Let P ∈ M ( X × Y ) . Let ε > . Let l : Θ × ( X × Y ) → [0 , ∞ ) satisfying Assumption 1. Thenstrong duality always holds in the randomized setting: inf µ ∈M +1 (Θ) max Q ∈A ε ( P ) E θ ∼ µ, ( x,y ) ∼ Q [ l ( θ, ( x, y ))] (6) = max Q ∈A ε ( P ) inf µ ∈M +1 (Θ) E θ ∼ µ, ( x,y ) ∼ Q [ l ( θ, ( x, y ))] The supremum is always attained. If Θ is a compact set, and for all ( x, y ) ∈ X × Y , l ( · , ( x, y )) is lowersemi-continuous, the infimum is also attained. Corollary 1.
Under Assumption 1, for any δ > , there exists a δ -approximate Nash-Equibilrium ( µ δ , Q ∗ ) .Moreover, if the infimum is attained, there exists a Nash equilibrium ( µ ∗ , Q ∗ ) to the adversarial examplesgame. Theorem 1 shows that D ε = V εrand . From a game theoretic perspective, this means that the minimaladversarial risk for a randomized classifier against any attack (primal problem) is the same as the maximalrisk an adversary can get by using an attack strategy that is oblivious to the classifier it faces (dual problem).This suggests that playing randomized strategies for the classifier could substantially improve robustness toadversarial examples. In the next section, we will design an algorithm that efficiently learn this classifier, wewill get improve adversarial robustness over classical deterministic defenses. Remark 5.
Theorem 1 remains true if one replaces A ε ( P ) with any other Wasserstein compact uncertaintysets (see [41] for conditions of compactness). See Appendix E for more details Finding the Optimal Classifiers
Let { ( x i , y i ) } Ni =1 samples independently drawn from P and denote (cid:98) P := N (cid:80) Ni =1 δ ( x i ,y i ) the associated empiricaldistribution. One can show the adversarial empirical risk minimization can be casted as: (cid:98) R ε, ∗ adv := inf µ ∈M +1 (Θ) N (cid:88) i =1 sup Q i ∈ Γ i,ε E ( x,y ) ∼ Q i ,θ ∼ µ [ l ( θ, ( x, y ))] where Γ i,ε is defined as : Γ i,ε := (cid:110) Q i | (cid:90) d Q i = 1 N , (cid:90) c ε (( x i , y i ) , · ) d Q i = 0 (cid:111) . More details on this decomposition are given in Appendix E. In the following, we regularize the aboveobjective by adding an entropic term to each inner supremum problem. Let α := ( α i ) Ni =1 ∈ R N + such that forall i ∈ { , . . . , N } , and let us consider the following optimization problem: (cid:98) R ε, ∗ adv, α := inf µ ∈M +1 (Θ) N (cid:88) i =1 sup Q i ∈ Γ i,ε E Q i ,µ [ l ( θ, ( x, y ))] − α i KL (cid:18) Q i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) where U ( x,y ) is an arbitrary distribution of support equal to: S ( ε )( x,y ) := (cid:110) ( x (cid:48) , y (cid:48) ) : s.t. c ε (( x, y ) , ( x (cid:48) , y (cid:48) )) = 0 (cid:111) , and for all Q , U ∈ M + ( X × Y ) ,KL ( Q || U ) := (cid:26) (cid:82) log( d Q d U ) d Q + | U | − | Q | if Q (cid:28) U + ∞ otherwise.Note that when α = 0 , we recover the problem of interest (cid:98) R ε, ∗ adv, α = (cid:98) R ε, ∗ adv . Moreover, we show the regularizedsupremum tends to the standard supremum when α → . Proposition 3.
For µ ∈ M +1 (Θ) , one has lim α i → sup Q i ∈ Γ i,ε E Q i ,µ [ l ( θ, ( x, y ))] − α i KL (cid:18) Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) = sup Q i ∈ Γ i,ε E ( x,y ) ∼ Q i ,θ ∼ µ [ l ( θ, ( x, y ))] . By adding an entropic term to the objective, we obtain an explicit formulation of the supremum involvedin the sum: as soon as α > (which means that each α i > ), each sub-problem becomes just theFenchel-Legendre transform of KL ( ·| U ( x i ,y i ) /N ) which has the following closed form: sup Q i ∈ Γ i,ε E Q i ,µ [ l ( θ, ( x, y ))] − α i KL (cid:18) Q i || N U ( x i ,y i ) (cid:19) = α i N log (cid:18)(cid:90) X ×Y exp (cid:18) E θ ∼ µ [ l ( θ, ( x, y ))] α i (cid:19) d U ( x i ,y i ) (cid:19) . inf µ ∈M +1 (Θ) N (cid:88) i =1 α i N log (cid:18)(cid:90) exp E µ [ l ( θ, ( x, y ))] α i d U ( x i ,y i ) (cid:19) . In order to solve the above problem, one needs to compute the integral involved in the objective. To doso, we estimate it by randomly sampling m i ≥ samples ( u ( i )1 , . . . , u ( i ) m i ) ∈ ( X × Y ) m i from U ( x i ,y i ) for all i ∈ { , . . . , N } which leads to the following optimization problem inf µ ∈M +1 (Θ) N (cid:88) i =1 α i N log m i m i (cid:88) j =1 exp E µ (cid:104) l ( θ, u ( i ) j ) (cid:105) α i (7)denoted (cid:98) R ε, m adv, α where m := ( m i ) Ni =1 in the following. Now we aim at controlling the error made with ourapproximations. We decompose the error into two terms | (cid:98) R ε, m adv, α − (cid:98) R ε, ∗ adv | ≤ | (cid:98) R ε, ∗ adv, α − (cid:98) R ε, m adv, α | + | (cid:98) R ε, ∗ adv, α − (cid:98) R ε, ∗ adv | where the first one corresponds to the statistical error made by our estimation of the integral, and the secondto the approximation error made by the entropic regularization of the objective. First, we show a control ofthe statistical error using Rademacher complexities [2]. Proposition 4.
Let m ≥ and α > and denote α := ( α, . . . , α ) ∈ R N and m := ( m, . . . , m ) ∈ R N . Thenby denoting ˜ M = max( M, , we have with a probability of at least − δ | (cid:98) R ε, ∗ adv, α − (cid:98) R ε, m adv, α | ≤ e M/α N N (cid:88) i =1 R i + 6 ˜ M e
M/α (cid:115) log( δ )2 mN where R i := m E σ (cid:104) sup θ ∈ Θ (cid:80) mj =1 σ j l ( θ, u ( i ) j ) (cid:105) and σ := ( σ , . . . , σ m ) with σ i i.i.d. sampled as P [ σ i = ±
1] =1 / . We deduce from the above Proposition that in the particular case where Θ is finite such that | Θ | = L ,with probability of at least − δ | (cid:98) R ε, ∗ adv, α − (cid:98) R ε, m adv, α | ∈ O (cid:32) M e
M/α (cid:114) log( L ) m (cid:33) . This case is of particular interest when one wants to learn the optimal mixture of some given classifiers inorder to minimize the adversarial risk. In the following proposition, we control the approximation error madeby adding an entropic term to the objective.
Proposition 5.
Denote for β > , ( x, y ) ∈ X × Y and µ ∈ M +1 (Θ) , A ( x,y ) β,µ := { u | sup v ∈ S ( ε )( x,y ) E µ [ l ( θ, v )] ≤ E µ [ l ( θ, u )] + β } . If there exists C β such that for all ( x, y ) ∈ X × Y and µ ∈ M +1 (Θ) , U ( x,y ) (cid:16) A ( x,y ) β,µ (cid:17) ≥ C β then we have | (cid:98) R ε, ∗ adv, α − (cid:98) R ε, ∗ adv | ≤ α | log( C β ) | + β. The assumption made in the above Proposition states that for any given random classifier µ , and anygiven point ( x, y ) , the set of β -optimal attacks at this point has at least a certain amount of mass dependingon the β chosen. This assumption is always met when β is sufficiently large. However in order to obtaina tight control of the error, a trade-off exists between β and the smallest amount of mass C β of β -optimalattacks.Now that we have shown that solving (7) allows to obtain an approximation of the true solution (cid:98) R ε, ∗ adv , wenext aim at deriving an algorithm to compute it. 8 .2 Proposed Algorithms From now on, we focus on finite class of classifiers. Let
Θ = { θ , . . . , θ L } , we aim to learn the optimal mixtureof classifiers in this case. The adversarial empirical risk is therefore defined as: (cid:98) R εadv ( λ ) = N (cid:88) i =1 sup Q i ∈ Γ i,ε E ( x,y ) ∼ Q i (cid:34) L (cid:88) k =1 λ k l ( θ k , ( x, y )) (cid:35) for λ ∈ ∆ L := { λ ∈ R L + s . t . (cid:80) Li =1 λ i = 1 } , the probability simplex of R L . One can notice that (cid:98) R εadv ( · ) is a continuous convex function, hence min λ ∈ ∆ L R εadv ( λ ) is attained for a certain λ ∗ . Then there exists anon-approximate Nash equilibrium ( λ ∗ , Q ∗ ) in the adversarial game when Θ is finite. Here, we present twoalgorithms to learn the optimal mixture of the adversarial risk minimization problem. Algorithm 1
Oracle Algorithm λ = L L ; T ; η = M √ LT for t = 1 , . . . , T do ˜ Q s.t. ∃ Q ∗ ∈ A ε ( P ) best response to λ t − and for all k ∈ [ L ] , | E ˜ Q ( l ( θ k , ( x, y ))) − E Q ∗ ( l ( θ k , ( x, y ))) | ≤ δ g t = (cid:0) E ˜ Q ( l ( θ , ( x, y )) , . . . , E ˜ Q ( l ( θ L , ( x, y )) (cid:1) T λ t = Π ∆ L ( λ t − − η g t ) end P r o b a li t i t y o f t h e c l a ss i f i e r Iterations I n - s a m p l e R i s k =0=10 =10 =10 =10 (size of the perturbation) A d v e r s a r i a l r i s k In-sample, deterministic classifierIn-sample, randomized classifierOut-sample deterministic classifierOut-sample, randomized classifier
Figure 2: On left, data samples with their set of possible attacks represented in shadow and the optimalrandomized classifier, with a color gradient representing the probability of the classifier. In the middle,convergence of the oracle ( α = 0 ) and regularized algorithm for different values of regularization parameters.On right, in-sample and out-sample risk for randomized and deterministic minimum risk in function of theperturbation size ε . In the latter case, the randomized classifier is optimized with oracle Algorithm 1. A First Oracle Algorithm.
The first algorithm we present is inspired from [33] and the convergence ofprojected sub-gradient methods [9]. The computation of the inner supremum problem is usually NP-hard ,but one may assume the existence of an approximate oracle to this supremum. The algorithm is presented inAlgorithm 1. We get the following guarantee for this algorithm. Proposition 6.
Let l : Θ × ( X × Y ) → [0 , ∞ ) satisfying Assumption 1. Then, Algorithm 1 satisfies: min t ∈ [ T ] (cid:98) R εadv ( λ t ) − (cid:98) R ε, ∗ adv ≤ δ + 2 M √ L √ T The main drawback of the above algorithm is that one needs to have access to an oracle to guaranteethe convergence of the proposed algorithm. In the following we present its regularized version in order toapproximate the solution and propose a simple algorithm to solve it. See Appendix E for details. n Entropic Relaxation. Adding an entropic term to the objective allows to have a simple reformulationof the problem, as follows: inf λ ∈ ∆ L N (cid:88) i =1 ε i N log m i m i (cid:88) j =1 exp (cid:32) (cid:80) Lk =1 λ k l ( θ k , u ( i ) j ) ε i (cid:33) Note that in λ , the objective is convex and smooth. One can apply the accelerated PGD [3, 36] which enjoysan optimal convergence rate for first order methods of O ( T − ) for T iterations. To illustrate our theoretical findings, we start by testing our learning algorithm on the following synthetictwo-dimensional problem. Let us consider the distribution P defined as P ( Y = ±
1) = 1 / , P ( X | Y = −
1) = N (0 , I ) and P ( X | Y = 1) = [ N (( − , , I ) + N ((3 , , I )] . We sample training points from thisdistribution and randomly generate linear classifiers that achieves a standard training risk lower than . .To simulate an adversary with budget ε in (cid:96) norm, we proceed as follows. For every sample ( x, y ) ∼ P wegenerate points uniformly at random in the ball of radius ε and select the one maximizing the risk forthe / loss. Figure 2 (left) illustrates the type of mixture we get after convergence of our algorithms. Notethat in this toy problem, we are likely to find the optimal adversary with this sampling strategy if we sampleenough attack points.To evaluate the convergence of our algorithms, we compute the adversarial risk of our mixture for eachiteration of both the oracle and regularized algorithms. Figure 2 illustrates the convergence of the algorithmsw.r.t the regularization parameter. We observe that the risk for both algorithms converge. Moreover, theyconverge towards the oracle minimizer when the regularization parameter α goes to .Finally, to demonstrate the improvement randomized techniques offer against deterministic defenses, weplot in Figure 2 (right) the minimum adversarial risk for both randomized and deterministic classifiers w.r.t. ε . The adversarial risk is strictly better for randomized classifier whenever the adversarial budget ε is biggerthan . This illustration validates our analysis of Theorem 1, and motivates a in depth study of a morechallenging framework, namely image classification with neural networks. Models Acc. APGD CE APGD
DLR
Rob. Acc. .
9% 47 .
6% 47 .
7% 45 . .
9% 49 .
0% 49 .
6% 47 . .
7% 49 .
0% 49 .
3% 46 . . % . % . % . % Epochs per model A cc u r a c y Epochs per model A cc u r a c y Figure 3: On left: Comparison of our algorithm with a standard adversarial training (one model). Wereported the results for the model with the best robust accuracy obtained over two independent runs becauseadversarial training might be unstable. Standard and Robust accuracy (respectively in the middle and onright) on CIFAR-10 test images in function of the number of epochs per classifier with to ResNet18models. The performed attack is PGD with iterations and ε = 8 / .10dversarial examples are known to be easily transferrable from one model to another [35]. To counterthis and support our theoretical claims, we propose an heuristic algorithm (see Algorithm 2) to train a robustmixture of L classifiers. We alternatively train these classifiers with adversarial examples against the currentmixture and update the probabilities of the mixture according to the algorithms we proposed in Section 4.2.More details on the heuristic algorithm are available in Appendix D. Algorithm 2
Adversarial Training for Mixtures L : number of models, T : number of iterations, T θ : number of updates for the models θ , T λ : number of updates for the mixture λ , λ = ( λ , . . . λ L ) , θ = ( θ , . . . θ L ) for t = 1 , . . . , T do Let B t be a batch of data. if t mod ( T θ L + 1) (cid:54) = 0 then k sampled uniformly in { , . . . , L } ˜ B t ← Attack of images in B t for the model ( λ t , θ t ) θ tk ← Update θ t − k with ˜ B t for fixed λ t with a SGD step else λ t ← Update λ t − on B t for fixed θ t with oracle or regularized algorithm with T λ iterations. endend Experimental Setup.
To evaluate the performance of Algorithm 2, we trained from to ResNet18 [20]models on epochs per model . We study the robustness with regards to (cid:96) ∞ norm and fixed adversarialbudget ε = 8 / . The attack we used in the inner maximization of the training is an adapted (adaptative)version of PGD for mixtures of classifiers with steps. Note that for one single model, Algorithm 2 exactlycorresponds to adversarial training [24]. For each of our setups, we made two independent runs and selectthe best one. The training time of our algorithm is around four times longer than a standard AdversarialTraining (with PGD 10 iter.) with two models, eight times with three models and twelve times with fourmodels. We trained our models with a batch of size on Nvidia V100 GPUs. We give more details onimplementation in Appendix D.
Evaluation Protocol.
At each epoch, we evaluate the current mixture on test data against PGD attackwith iterations. To select our model and avoid overfitting [30], we kept the most robust against thisPGD attack. To make a final evaluation of our mixture of models, we used an adapted version of AutoPGDuntargeted attacks [15] for randomized classifiers with both Cross-Entropy (CE) and Difference of LogitsRatio (DLR) loss. For both attacks, we made iterations and restarts. Results.
The results are presented in Figure 3. We remark our algorithm outperforms a standardadversarial training in all the cases by more , without additional loss of standard accuracy as it is attestedby the left figure. Moreover, it seems our algorithm, by adding more and more models, reduces the overfittingof adversarial training. So far, experiments are computationally very costful and it is difficult to raiseprecise conclusions. Further, hyperparameter tuning [19] such as architecture, unlabeled data [12], activationfunction, or the use of TRADES [43] may still increase the results. Distributionally Robust Optimization.
Several recent works [23, 33, 37] studied the problem of adver-sarial examples through the scope of distributionally robust optimization. In these frameworks, the set ofadversarial distributions is defined using an (cid:96) p Wasserstein ball (the adversary is allowed to have an average perturbation of at most ε in (cid:96) p norm). This however does not match the usual adversarial attack problem,where the adversary cannot move any point by more than ε . In the present work, we introduce a cost L × epochs in total, where L is the number of models. Optimal Transport (OT). [5] and [29] investigated classifier-agnostic lower bounds on the adversarialrisk of any deterministic classifier using OT. These works only evaluate lower bounds on the primal deterministicformulation of the problem, while we study the existence of mixed Nash equilibria. Note that [29] startedto investigate a way to formalize the adversary using Markov kernels, but did not investigate the impact ofrandomized strategies on the game. We extended this work by rigorously reformulating the adversarial riskas a linear optimization problem over distributions and we study this problem from a game theoretic point ofview.
Game Theory.
Adversarial examples have been studied under the notions of Stackelberg game in [10],and zero-sum game in [8, 26, 31]. These works considered restricted settings (convex loss, parametricadversaries, etc.) that do not comply with the nature of the problem. Indeed, we prove in Appendix C.3 thatwhen the loss is convex and the set Θ is convex, the duality gap is zero for deterministic classifiers. However, ithas been proven that no convex loss can be a good surrogate for the / loss in the adversarial setting [1, 14],narrowing the scope of this result. If one can show that for sufficiently separated conditional distributions,an optimal deterministic classifier always exists (see Appendix E for a clear statement), necessary andsufficient conditions for the need of randomization are still to be established. [27] studied partly this questionfor regularized deterministic adversaries, leaving the general setting of randomized adversaries and mixedequilibria unanswered, which is the very scope of this paper.12 eferences [1] H. Bao, C. Scott, and M. Sugiyama. Calibrated surrogate losses for adversarially robust classification.In J. Abernethy and S. Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory ,volume 125 of
Proceedings of Machine Learning Research , pages 408–451. PMLR, 09–12 Jul 2020.[2] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structuralresults.
Journal of Machine Learning Research , 3:463–482, 2002.[3] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM journal on imaging sciences , 2(1):183–202, 2009.[4] D. P. Bertsekas and S. Shreve.
Stochastic optimal control: the discrete-time case . 2004.[5] A. N. Bhagoji, D. Cullina, and P. Mittal. Lower bounds on adversarial robustness from optimal transport.In
Advances in Neural Information Processing Systems 32 , pages 7496–7508. Curran Associates, Inc.,2019.[6] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli. Evasionattacks against machine learning at test time. In
Joint European conference on machine learning andknowledge discovery in databases , pages 387–402. Springer, 2013.[7] J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport.
Mathematicsof Operations Research , 44(2):565–600, 2019.[8] A. J. Bose, G. Gidel, H. Berard, A. Cianflone, P. Vincent, S. Lacoste-Julien, and W. L. Hamilton.Adversarial example games, 2021.[9] S. Boyd. Subgradient methods. 2003.[10] M. Brückner and T. Scheffer. Stackelberg games for adversarial prediction problems. In
Proceedings ofthe 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ’11,page 547–555, New York, NY, USA, 2011. Association for Computing Machinery.[11] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detectionmethods. In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages 3–14,2017.[12] Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi. Unlabeled data improves adversarialrobustness. arXiv preprint arXiv:1905.13736 , 2019.[13] J. M. Cohen, E. Rosenfeld, and J. Z. Kolter. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918 .[14] Z. Cranko, A. Menon, R. Nock, C. S. Ong, Z. Shi, and C. Walder. Monge blunts bayes: Hardnessresults for adversarial training. In K. Chaudhuri and R. Salakhutdinov, editors,
Proceedings of the 36thInternational Conference on Machine Learning , volume 97 of
Proceedings of Machine Learning Research ,pages 1406–1415. PMLR, 09–15 Jun 2019.[15] F. Croce and M. Hein. Reliable evaluation of adversarial robustness with an ensemble of diverseparameter-free attacks. In
International Conference on Machine Learning , 2020.[16] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.
Advances in neuralinformation processing systems , 26:2292–2300, 2013.[17] G. S. Dhillon, K. Azizzadenesheli, J. D. Bernstein, J. Kossaifi, A. Khanna, Z. C. Lipton, and A. Anand-kumar. Stochastic activation pruning for robust adversarial defense. In
International Conference onLearning Representations , 2018. 1318] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In
InternationalConference on Learning Representations , 2015.[19] S. Gowal, C. Qin, J. Uesato, T. Mann, and P. Kohli. Uncovering the limits of adversarial training againstnorm-bounded adversarial examples. arXiv preprint arXiv:2010.03593 , 2020.[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
The IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2016.[21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report,Citeseer, 2009.[22] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprintarXiv:1607.02533 , 2016.[23] J. Lee and M. Raginsky. Minimax statistical learning with wasserstein distances. In
Advances in NeuralInformation Processing Systems 31 , pages 2687–2696. Curran Associates, Inc., 2018.[24] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistantto adversarial attacks. In
International Conference on Learning Representations , 2018.[25] S.-M. Moosavi-Dezfooli, A. Fawzi, J. Uesato, and P. Frossard. Robustness via curvature regularization,and vice versa. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 9078–9086, 2019.[26] J. C. Perdomo and Y. Singer. Robust attacks against multiple classifiers. arXiv preprint arXiv:1906.02816 ,2019.[27] R. Pinot, R. Ettedgui, G. Rizk, Y. Chevaleyre, and J. Atif. Randomization matters. how to defendagainst strong adversarial attacks.
International Conference on Machine Learning , 2020.[28] R. Pinot, L. Meunier, A. Araujo, H. Kashima, F. Yger, C. Gouy-Pailler, and J. Atif. Theoretical evidencefor adversarial robustness through randomization. In
Advances in Neural Information Processing Systems ,pages 11838–11848, 2019.[29] M. S. Pydi and V. Jog. Adversarial risk via optimal transport and optimal couplings. In
InternationalConference on Machine Learning . 2020.[30] L. Rice, E. Wong, and Z. Kolter. Overfitting in adversarially robust deep learning. In
InternationalConference on Machine Learning , pages 8093–8104. PMLR, 2020.[31] S. Rota Bulò, B. Biggio, I. Pillai, M. Pelillo, and F. Roli. Randomized prediction games for adversarialmachine learning.
IEEE Transactions on Neural Networks and Learning Systems , 28(11):2466–2478,2017.[32] S. Shalev-Shwartz and S. Ben-David.
Understanding machine learning: From theory to algorithms .Cambridge university press, 2014.[33] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principledadversarial training. arXiv preprint arXiv:1710.10571 , 2017.[34] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguingproperties of neural networks. In
International Conference on Learning Representations , 2014.[35] F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. The space of transferable adversarialexamples. arXiv preprint arXiv:1704.03453 , 2017.1436] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted toSIAM Journal on Optimization , 1, 2008.[37] Z. Tu, J. Zhang, and D. Tao. Theoretical analysis of adversarial learning: A minimax approach. arXivpreprint arXiv:1811.05232 , 2018.[38] C. Villani.
Topics in optimal transportation . Number 58. American Mathematical Soc., 2003.[39] B. Wang, Z. Shi, and S. Osher. Resnets ensemble via the feynman-kac formalism to improve natural androbust accuracies. In
Advances in Neural Information Processing Systems 32 , pages 1655–1665. CurranAssociates, Inc., 2019.[40] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille. Mitigating adversarial effects through randomization.In
International Conference on Learning Representations , 2018.[41] M.-C. Yue, D. Kuhn, and W. Wiesemann. On linear optimization over wasserstein balls. arXiv preprintarXiv:2004.07162 , 2020.[42] S. Zagoruyko and N. Komodakis. Wide residual networks. In
Proceedings of the British Machine VisionConference (BMVC) , pages 87.1–87.12. BMVA Press, 2016.[43] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan. Theoretically principled trade-offbetween robustness and accuracy.
International conference on Machine Learning , 2019.15 upplementary materialA Notations
Let ( Z , d ) be a Polish metric space (i.e. complete and separable). We say that ( Z , d ) is proper if for all z ∈ Z and R > , B ( z , R ) := { z | d ( z, z ) ≤ R } is compact. For ( Z , d ) a Polish space, we denote M ( Z ) the set of Borel probability measures on Z endowed with (cid:107)·(cid:107) T V strong topology. We recall the notion ofweak topology: we say that a sequence ( µ n ) n of M ( Z ) converges weakly to µ ∈ M ( Z ) if and only iffor every continuous function f on Z , (cid:82) f dµ n → n →∞ (cid:82) f dµ . Endowed with its weak topology, M ( Z ) isa Polish space. For µ ∈ M ( Z ) , we define L ( µ ) the set of integrable functions with respect to µ . Wedenote Π : ( z, z (cid:48) ) ∈ Z (cid:55)→ z and Π : ( z, z (cid:48) ) ∈ Z (cid:55)→ z (cid:48) respectively the projections on the first and secondcomponent, which are continuous applications. For a measure µ and a measurable mapping µ , we denote g (cid:93) µ the pushforward measure of µ by g . Let L ≥ be an integer and denote ∆ L := { λ ∈ R L + s . t . (cid:80) Lk =1 λ k = 1 } ,the probability simplex of R L . B Useful Lemmas
Lemma 1 (Fubini’s theorem) . Let l : Θ × ( X ×Y ) → [0 , ∞ ) satisfying Assumption 1. Then for all µ ∈ M (Θ) , (cid:82) l ( θ, · ) dµ ( θ ) is Borel measurable; for Q ∈ M ( X × Y ) , (cid:82) l ( · , ( x, y )) d Q ( x, y ) is Borel measurable. Moreover: (cid:82) l ( θ, ( x, y )) dµ ( θ ) d Q ( x, y ) = (cid:82) l ( θ, ( x, y )) d Q ( x, y ) dµ ( θ ) Lemma 2.
Let l : Θ × ( X × Y ) → [0 , ∞ ) satisfying Assumption 1. Then for all µ ∈ M (Θ) , ( x, y ) (cid:55)→ (cid:82) l ( θ, ( x, y )) dµ ( θ ) is upper semi-continuous and hence Borel measurable.Proof. Let ( x n , y n ) n be a sequence of X × Y converging to ( x, y ) ∈ X × Y . For all θ ∈ Θ , M − l ( θ, · ) is nonnegative and lower semi-continuous. Then by Fatou’s Lemma applied: (cid:90) M − l ( θ, ( x, y )) dµ ( θ ) ≤ (cid:90) lim inf n →∞ M − l ( θ, ( x n , y n )) dµ ( θ ) ≤ lim inf n →∞ (cid:90) M − l ( θ, ( x n , y n )) dµ ( θ ) Then we deduce that: (cid:82) M − l ( θ, · ) dµ ( θ ) is lower semi-continuous and then (cid:82) l ( θ, · ) dµ ( θ ) is upper-semicontinuous. Lemma 3.
Let l : Θ × ( X × Y ) → [0 , ∞ ) satisfying Assumption 1 Then for all µ ∈ M (Θ) , Q (cid:55)→ (cid:82) l ( θ, ( x, y )) dµ ( θ ) d Q ( x, y ) is upper semi-continuous for weak topology of measures.Proof. − (cid:82) l ( θ, · ) dµ ( θ ) is lower semi-continuous from Lemma 2. Then M − (cid:82) l ( θ, · ) dµ ( θ ) is lower semi-continuous and non negative. Let denote v this function. Let ( v n ) n be a non-decreasing sequence ofcontinuous bounded functions such that v n → v . Let ( Q k ) k converging weakly towards Q . Then by monotoneconvergence: (cid:90) vd Q = lim n (cid:90) v n d Q = lim n lim k (cid:90) v n d Q k ≤ lim inf k (cid:90) vd Q k Then Q (cid:55)→ (cid:82) vd Q is lower semi-continuous and then Q (cid:55)→ (cid:82) l ( θ, ( x, y )) dµ ( θ ) d Q ( x, y ) is upper semi-continuousfor weak topology of measures. Lemma 4.
Let l : Θ × ( X × Y ) → [0 , ∞ ) satisfying Assumption 1. Then for all µ ∈ M (Θ) , ( x, y ) (cid:55)→ sup ( x (cid:48) ,y (cid:48) ) ,d ( x,x (cid:48) ) ≤ ε,y = y (cid:48) (cid:82) l ( θ, ( x (cid:48) , y (cid:48) )) dµ ( θ ) is universally measurable (i.e. measurable for all Borel probabilitymeasures). And hence the adversarial risk is well defined. roof. Let φ : ( x, y ) (cid:55)→ sup ( x (cid:48) ,y (cid:48) ) ,d ( x,x (cid:48) ) ≤ ε,y = y (cid:48) (cid:82) l ( θ, ( x (cid:48) , y (cid:48) )) dµ ( θ ) . Then for u ∈ ¯ R : { φ ( x, y ) > u } = Proj (cid:26) (( x, y ) , ( x (cid:48) , y (cid:48) )) | (cid:90) l ( θ, ( x (cid:48) , y (cid:48) )) dµ ( θ ) − c ε (( x, y ) , ( x (cid:48) , y (cid:48) )) > u (cid:27) By Lemma 3: (( x, y ) , ( x (cid:48) , y (cid:48) )) (cid:55)→ (cid:82) l ( θ, ( x (cid:48) , y (cid:48) )) dµ ( θ ) − c ε (( x, y ) , ( x (cid:48) , y (cid:48) )) is upper-semicontinuous hence Borelmeasurable. So its level sets are Borel sets, and by [4, Proposition 7.39], the projection of a Borel set isanalytic. And then { φ ( x, y ) > u } universally measurable thanks to [4, Corollary 7.42.1]. We deduce that φ isuniversally measurable. C Proofs
C.1 Proof of Proposition 1
Proof.
Let η > . Let Q ∈ A ε ( P ) . There exists γ ∈ M +1 (cid:0) ( X × Y ) (cid:1) such that, d ( x, x (cid:48) ) ≤ ε , y = y (cid:48) γ -almostsurely, and Π (cid:93) γ = P , and Π (cid:93) γ = Q . Then (cid:82) c ε dγ = 0 ≤ η . Then, we deduce that W c ε ( P , Q ) ≤ η , and Q ∈ B c ε ( P , η ) . Reciprocally, let Q ∈ B c ε ( P , η ) . Then, since the infimum is attained in the Wassersteindefinition, there exists γ ∈ M +1 (cid:0) ( X × Y ) (cid:1) such that (cid:82) c ε dγ ≤ η . Since c ε (( x, x (cid:48) ) , ( y, y (cid:48) )) = + ∞ when d ( x, x (cid:48) ) > ε and y (cid:54) = y (cid:48) , we deduce that, d ( x, x (cid:48) ) ≤ ε and y = y (cid:48) , γ -almost surely. Then Q ∈ A ε ( P ) . We havethen shown that: A ε ( P ) = B c ε ( P , η ) .The convexity of A ε ( P ) is then immediate from the relation with the Wasserstein uncertainty set.Let us show first that A ε ( P ) is relatively compact for weak topology. To do so we will show that A ε ( P ) is tight and apply Prokhorov’s theorem. Let δ > , ( X × Y , d ⊕ d (cid:48) ) being a Polish space, { P } is tight thenthere exists K δ compact such that P ( K δ ) ≥ − δ . Let ˜ K δ := { ( x (cid:48) , y (cid:48) ) | ∃ ( x, y ) ∈ K δ , d ( x (cid:48) , x ) ≤ ε, y = y (cid:48) } .Recalling that ( X , d ) is proper (i.e. the closed balls are compact), so ˜ K δ is compact. Moreover for Q ∈ A ε ( P ) , Q ( ˜ K δ ) ≥ P ( K δ ) ≥ − δ . And then, Prokhorov’s theorem holds, and A ε ( P ) is relatively compact for weaktopology.Let us now prove that A ε ( P ) is closed to conclude. Let ( Q n ) n be a sequence of A ε ( P ) converging towardssome Q for weak topology. For each n , there exists γ n ∈ M ( X × Y ) such that d ( x, x (cid:48) ) ≤ ε and y = y (cid:48) γ n -almost surely and Π (cid:93) γ n = P , Π (cid:93) γ n = Q n . { Q n , n ≥ } is relatively compact, then tight, then (cid:83) n Γ P , Q n istight, then relatively compact by Prokhorov’s theorem. ( γ n ) n ∈ (cid:83) n Γ P , Q n , then up to an extraction, γ n → γ .Then d ( x, x (cid:48) ) ≤ ε and y = y (cid:48) γ -almost surely, and by continuity, Π (cid:93) γ = P and by continuity, Π (cid:93) γ = Q . Andhence A ε ( P ) is closed.Finally A ε ( P ) is a convex compact set for the weak topology. C.2 Proof of Proposition 2
Proof.
Let µ ∈ M (Θ) . Let ˜ f : (( x, y ) , ( x (cid:48) , y (cid:48) )) (cid:55)→ E θ ∼ µ [ l ( θ, ( x, y ))] − c ε (( x, y ) , ( x (cid:48) , y (cid:48) )) . ˜ f is upper-semicontinuous, hence upper semi-analytic. Then, by upper semi continuity of E θ ∼ µ [ l ( θ, · )] on the compact { ( x (cid:48) , y (cid:48) ) | d ( x, x (cid:48) ) ≤ ε, y = y (cid:48) } and [4, Proposition 7.50], there exists a universally measurable mapping T such that E θ ∼ µ [ l ( θ, T ( x, y ))] = sup ( x (cid:48) ,y (cid:48) ) , d ( x,x (cid:48) ) ≤ ε,y = y (cid:48) E θ ∼ µ [ l ( θ, ( x, y ))] . Let Q = T (cid:93) P , then Q ∈ A ε ( P ) . Andthen E ( x,y ) ∼ P (cid:104) sup ( x (cid:48) ,y (cid:48) ) , d ( x,x (cid:48) ) ≤ ε,y = y (cid:48) E θ ∼ µ [ l ( θ, ( x (cid:48) , y (cid:48) ))] (cid:105) ≤ sup Q ∈A ε ( P ) E ( x,y ) ∼ Q [ E θ ∼ µ [ l ( θ, ( x, y ))]] .Reciprocally, let Q ∈ A ε ( P ) . There exists γ ∈ M (( X × Y ) ) , such that d ( x, x (cid:48) ) ≤ ε and y = y (cid:48) γ -almostsurely, and, Π (cid:93) γ = P and Π (cid:93) γ = Q . Then: E θ ∼ µ [ l ( θ, ( x (cid:48) , y (cid:48) ))] ≤ sup ( u,v ) , d ( x,u ) ≤ ε,y = v E θ ∼ µ [ l ( θ, ( u, v ))] -almost surely. Then, we deduce that: E ( x (cid:48) ,y (cid:48) ) ∼ Q [ E θ ∼ µ [ l ( θ, ( x (cid:48) , y (cid:48) ))]] = E ( x,y,x (cid:48) ,y (cid:48) ) ∼ γ [ E θ ∼ µ [ l ( θ, ( x (cid:48) , y (cid:48) ))]] ≤ E ( x,y,x (cid:48) ,y (cid:48) ) ∼ γ (cid:34) sup ( u,v ) , d ( x,u ) ≤ ε,y = v E θ ∼ µ [ l ( θ, ( u, v ))] (cid:35) ≤ E ( x,y ) ∼ P (cid:34) sup ( u,v ) , d ( x,u ) ≤ ε,y = v E θ ∼ µ [ l ( θ, ( u, v ))] (cid:35) Then we deduce the expected result: R εadv ( µ ) = sup Q ∈A ε ( P ) E ( x,y ) ∼ Q [ E θ ∼ µ [ l ( θ, ( x, y ))]] Let us show that the optimum is attained. Q (cid:55)→ E ( x,y ) ∼ Q [ E θ ∼ µ [ l ( θ, ( x, y ))]] is upper semi continuous byLemma 3 for the weak topology of measures, and A ε ( P ) is compact by Proposition 1, then by [4, Proposition7.32], the supremum is attained for a certain Q ∗ ∈ A ε ( P ) . C.3 Proof of Theorem 1
Let us first recall the Fan’s Theorem.
Theorem 2.
Let U be a compact convex Haussdorff space and V be convex space (not necessarily topological).Let ψ : U × V → R be a concave-convex function such that for all v ∈ V , ψ ( · , v ) is upper semi-continuousthen: inf v ∈ V max u ∈ U ψ ( u, v ) = max u ∈ U inf v ∈ V ψ ( u, v ) We are now set to prove Theorem 1.
Proof. A ε ( P ) , endowed with the weak topology of measures, is a Hausdorff compact convex space, thanks toProposition 1. Moreover, M (Θ) is clearly convex and ( Q , µ ) (cid:55)→ (cid:82) ldµd Q is bilinear, hence concave-convex.Moreover thanks to Lemma 3, for all µ , Q (cid:55)→ (cid:82) ldµd Q is upper semi-continuous. Then Fan’s theorem appliesand strong duality holds.In the related work (Section 6), we mentioned a particular form of Theorem 1 for convex cases. Asmentioned, this result has limited impact in the adversarial classification setting. It is still a direct corollaryof Fan’s theorem. This theorem can be stated as follows: Theorem 3.
Let P ∈ M ( X × Y ) , ε > and Θ a convex set. Let l be a loss satisfying Assumption 1, andalso, ( x, y ) ∈ X × Y , l ( · , ( x, y )) is a convex function, then we have the following: inf θ ∈ Θ sup Q ∈A ε ( P ) E Q [ l ( θ, ( x, y ))] = sup Q ∈A ε ( P ) inf θ ∈ Θ E Q [ l ( θ, ( x, y ))] The supremum is always attained. If Θ is a compact set then, the infimum is also attained. C.4 Proof of Proposition 3
Proof.
Let us first show that for α ≥ , sup Q i ∈ Γ i,ε E Q i ,µ [ l ( θ, ( x, y ))] − α KL (cid:16) Q i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:17) admits a solution.Let α ≥ , ( Q nα,i ) n ≥ a sequence such that E Q nα,i ,µ [ l ( θ, ( x, y ))] − α KL (cid:18) Q nα,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) −−−−−→ n → + ∞ sup Q i ∈ Γ i,ε E Q i ,µ [ l ( θ, ( x, y ))] − α KL (cid:18) Q i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) . Γ i,ε is tight ( ( X , d ) is a proper metric space therefore all the closed ball are compact) and by Prokhorov’stheorem, we can extract a subsequence which converges toward Q ∗ α,i . Moreover, l is upper semi-continuous(u.s.c), thus Q → E Q ,µ [ l ( θ, ( x, y ))] is also u.s.c. Moreover Q → − α KL (cid:16) Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:17) is also u.s.c. ,therefore, by considering the limit superior as n goes to infinity we obtain that lim sup n → + ∞ E Q nα,i ,µ [ l ( θ, ( x, y ))] − α KL (cid:18) Q nα,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) = sup Q i ∈ Γ i,ε E Q i ,µ [ l ( θ, ( x, y ))] − α KL (cid:18) Q i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) ≤ E Q ∗ α,i ,µ [ l ( θ, ( x, y ))] − α KL (cid:18) Q ∗ α,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) from which we deduce that Q ∗ α,i is optimal.Let us now show the result. We consider a positive sequence of ( α ( (cid:96) ) i ) (cid:96) ≥ such that α ( (cid:96) ) i → . Let us denote Q ∗ α ( (cid:96) ) i ,i and Q ∗ i the solutions of max Q i ∈ Γ i,ε E Q i ,µ [ l ( θ, ( x, y ))] − α ( (cid:96) ) i KL (cid:16) Q i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:17) and max Q i ∈ Γ i,ε E Q i ,µ [ l ( θ, ( x, y ))] respectively. Since Γ i,ε is tight, ( Q ∗ α ( (cid:96) ) i ,i ) (cid:96) ≥ is also tight and we can extract by Prokhorov’s theorem a subse-quence which converges towards Q ∗ . Moreover we have E Q ∗ i ,µ [ l ( θ, ( x, y ))] − α ( (cid:96) ) i KL (cid:18) Q ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) ≤ E Q ∗ α ( (cid:96) ) i ,i ,µ [ l ( θ, ( x, y ))] − α ( (cid:96) ) i KL (cid:18) Q ∗ α ( (cid:96) ) i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) from which follows that ≤ E Q ∗ i ,µ [ l ( θ, ( x, y ))] − E Q ∗ α ( (cid:96) ) i ,i ,µ [ l ( θ, ( x, y ))] ≤ α ( (cid:96) ) i (cid:18) KL (cid:18) Q ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19) − KL (cid:18) Q ∗ α ( (cid:96) ) i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:19)(cid:19) Then by considering the limit superior we obtain that lim sup (cid:96) → + ∞ E Q ∗ α ( (cid:96) ) i ,i ,µ [ l ( θ, ( x, y ))] = E Q ∗ i ,µ [ l ( θ, ( x, y ))] . from which follows that E Q ∗ i ,µ [ l ( θ, ( x, y ))] ≤ E Q ∗ ,µ [ l ( θ, ( x, y ))] and by optimality of Q ∗ i we obtain the desired result. C.5 Proof of Proposition 4
Proof.
Let us denote for all µ ∈ M +1 (Θ) , (cid:98) R ε , m adv, α ( µ ) := N (cid:88) i =1 α i N log m i m i (cid:88) j =1 exp E µ (cid:104) l ( θ, u ( i ) j ) (cid:105) α i . Let also consider ( µ ( m ) n ) n ≥ and ( µ n ) n ≥ two sequences such that (cid:98) R ε , m adv, α ( µ ( m ) n ) −−−−−→ n → + ∞ (cid:98) R ε , m adv, α , (cid:98) R ε adv, α ( µ n ) −−−−−→ n → + ∞ (cid:98) R ε , ∗ adv, α . Indeed by considering a decreasing sequence of continuous and bounded functions which converge towards E µ [ l ( θ, ( x, y ))] and by definition of the weak convergence the result follows. for α = 0 the result is clear, and if α > , note that KL (cid:16) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N U ( x i ,y i ) (cid:17) is lower semi-continuous
19e first remarks that (cid:98) R ε , m adv, α − (cid:98) R ε , ∗ adv, α ≤ (cid:98) R ε , m adv, α − (cid:98) R ε , m adv, α ( µ n ) + (cid:98) R ε , m adv, α ( µ n ) − (cid:98) R ε adv, α ( µ n ) + (cid:98) R ε adv, α ( µ n ) − (cid:98) R ε , ∗ adv, α ≤ sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) (cid:98) R ε , m adv, α ( µ ) − (cid:98) R ε adv, α ( µ ) (cid:12)(cid:12)(cid:12) + (cid:98) R ε adv, α ( µ n ) − (cid:98) R ε , ∗ adv, α , and by considering the limit, we obtain that (cid:98) R ε , m adv, α − (cid:98) R ε , ∗ adv, α ≤ sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) (cid:98) R ε , m adv, α ( µ ) − (cid:98) R ε adv, α ( µ ) (cid:12)(cid:12)(cid:12) Simarly we have that (cid:98) R ε , ∗ adv, α − (cid:98) R ε , m adv, α ≤ (cid:98) R ε , ∗ adv, α − (cid:98) R ε adv, α ( µ ( m ) n ) + (cid:98) R ε adv, α ( µ ( m ) n ) − (cid:98) R ε , m adv, α ( µ ( m ) n ) + (cid:98) R ε , m adv, α ( µ ( m ) n ) − (cid:98) R ε , m adv, α from which follows that (cid:98) R ε , ∗ adv, α − (cid:98) R ε , m adv, α ≤ sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) (cid:98) R ε , m adv, α ( µ ) − (cid:98) R ε adv, α ( µ ) (cid:12)(cid:12)(cid:12) Therefore we obtain that (cid:12)(cid:12)(cid:12) (cid:98) R ε , ∗ adv, α − (cid:98) R ε , m adv, α (cid:12)(cid:12)(cid:12) ≤ N (cid:88) i =1 αN sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) log m i m i (cid:88) j =1 exp E θ ∼ µ (cid:104) l ( θ, u ( i ) j )) (cid:105) α − log (cid:18)(cid:90) X ×Y exp (cid:18) E θ ∼ µ [ l ( θ, ( x, y ))] α (cid:19) d U ( x i ,y i ) (cid:19) (cid:12)(cid:12)(cid:12) . Observe that l ≥ , therefore because the log function is 1-Lipschitz on [1 , + ∞ ) , we obtain that (cid:12)(cid:12)(cid:12) (cid:98) R ε , ∗ adv, α − (cid:98) R ε , m adv, α (cid:12)(cid:12)(cid:12) ≤ N (cid:88) i =1 αN sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) m i m i (cid:88) j =1 exp E θ ∼ µ (cid:104) l ( θ, u ( i ) j )) (cid:105) α − (cid:90) X ×Y exp (cid:18) E θ ∼ µ [ l ( θ, ( x, y ))] α (cid:19) d U ( x i ,y i ) (cid:12)(cid:12)(cid:12) . Let us now denote for all i = 1 , . . . , N , (cid:98) R i ( µ, u ( i ) ) := m i (cid:88) j =1 exp E θ ∼ µ (cid:104) l ( θ, u ( i ) j )) (cid:105) α R i ( µ ) := (cid:90) X ×Y exp (cid:18) E θ ∼ µ [ l ( θ, ( x, y ))] α (cid:19) d U ( x i ,y i ) . and let us define f ( u (1) , . . . , u ( N ) ) := N (cid:88) i =1 αN sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) (cid:98) R i ( µ ) − R i ( µ ) (cid:12)(cid:12)(cid:12) where u ( i ) := ( u ( i )1 , . . . , u ( m )1 ) . By denoting z ( i ) = ( u ( i )1 , . . . , u ( i ) k − , z, u ( i ) k +1 , . . . , u ( i ) m ) , we have that | f ( u (1) , . . . , u ( N ) ) − f ( u (1) , . . . , u ( i − , z ( i ) , u ( i +1) , . . . , u ( N ) ) | ≤ αN (cid:12)(cid:12)(cid:12) sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) (cid:98) R i ( µ, u ( i ) ) − R i ( µ ) (cid:12)(cid:12)(cid:12) − sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) (cid:98) R i ( µ, z ( i ) ) − R i ( µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ αN (cid:12)(cid:12)(cid:12) m exp E θ ∼ µ (cid:104) l ( θ, u ( i ) k )) (cid:105) α − exp (cid:32) E θ ∼ µ (cid:2) l ( θ, z ( i ) )) (cid:3) α (cid:33) (cid:12)(cid:12)(cid:12) ≤ M/α ) N m l ≤ M . Then by appling theMcDiarmid’s Inequality, we obtain that with a probability of at least − δ , (cid:12)(cid:12)(cid:12) (cid:98) R ε , ∗ adv, α − (cid:98) R ε , m adv, α (cid:12)(cid:12)(cid:12) ≤ E ( f ( u (1) , . . . , u ( N ) )) + 2 exp( M/α ) √ mN (cid:114) log(2 /δ )2 . Thanks to [32, Lemma 26.2], we have for all i ∈ { , . . . , N } E ( f ( u (1) , . . . , u ( N ) )) ≤ E ( Rad ( F i ◦ u ( i ) )) where for any class of function F defined on Z and point z : ( z , . . . , z q ) ∈ Z q F ◦ z := (cid:110) ( f ( z ) , . . . , f ( z q )) , f ∈ F (cid:111) , Rad ( F ◦ z ) := 1 q E σ ∼{± } (cid:34) sup f ∈F q (cid:88) i =1 σ i f ( z i ) (cid:35) F i := (cid:110) u → exp (cid:18) E θ ∼ µ [ l ( θ, u ))] α (cid:19) , µ ∈ M +1 (Θ) (cid:111) . Moreover as x → exp( x/α ) is exp( M/α ) α -Lipstchitz on ( −∞ , M ] , by [32, Lemma 26.9], we haveRad ( F i ◦ u ( i ) ) ≤ exp( M/α ) α Rad ( H i ◦ u ( i ) ) where H i := (cid:110) u → E θ ∼ µ [ l ( θ, u ))] , µ ∈ M +1 (Θ) (cid:111) . Let us now define g ( u (1) , . . . , u ( N ) ) := N (cid:88) j =1 M/α ) N Rad ( H j ◦ u ( j ) ) . We observe that | g ( u (1) , . . . , u ( N ) ) − g ( u (1) , . . . , u ( i − , z ( i ) , u ( i +1) , . . . , u ( N ) ) | ≤ M/α ) N | Rad ( H i ◦ u ( i ) ) − Rad ( H i ◦ z ( i ) ) |≤ M/α ) N Mm .
By Applying the McDiarmid’s Inequality, we have that with a probability of at least − δ E ( g ( u (1) , . . . , u ( N ) )) ≤ g ( u (1) , . . . , u ( N ) ) + 4 exp( M/α ) M √ mN (cid:114) log(2 /δ )2 . Remarks also that Rad ( H i ◦ u ( i ) ) = 1 m E σ ∼{± } sup µ ∈M +1 (Θ) m (cid:88) j =1 σ i E µ ( l ( θ, u ( i ) j )) = 1 m E σ ∼{± } sup θ ∈ Θ m (cid:88) j =1 σ i l ( θ, u ( i ) j ) Finally, applying a union bound leads to the desired result.21 .6 Proof of Proposition 5
Proof.
Following the same steps than the proof of Proposition 4, let ( µ εn ) n ≥ and ( µ n ) n ≥ two sequences suchthat (cid:98) R εadv, α ( µ εn ) −−−−−→ n → + ∞ (cid:98) R ε, ∗ adv, α , (cid:98) R εadv ( µ n ) −−−−−→ n → + ∞ (cid:98) R ε, ∗ adv . Remarks that (cid:98) R ε, ∗ adv, α − (cid:98) R ε, ∗ adv ≤ (cid:98) R ε, ∗ adv, α − (cid:98) R εadv, α ( µ n ) + (cid:98) R εadv, α ( µ n ) − (cid:98) R εadv ( µ n ) + (cid:98) R εadv ( µ n ) − (cid:98) R ε, ∗ adv ≤ sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) (cid:98) R εadv, α ( µ ) − (cid:98) R εadv ( µ ) (cid:12)(cid:12)(cid:12) + (cid:98) R εadv ( µ n ) − (cid:98) R ε, ∗ adv Then by considering the limit we obtain that (cid:98) R ε, ∗ adv, α − (cid:98) R ε, ∗ adv ≤ sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) (cid:98) R εadv, α ( µ ) − (cid:98) R εadv ( µ ) (cid:12)(cid:12)(cid:12) . Similarly, we obtain that (cid:98) R ε, ∗ adv − (cid:98) R ε, ∗ adv, α ≤ sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) (cid:98) R εadv, α ( µ ) − (cid:98) R εadv ( µ ) (cid:12)(cid:12)(cid:12) , from which follows that (cid:12)(cid:12)(cid:12) (cid:98) R ε, ∗ adv, α − (cid:98) R ε, ∗ adv (cid:12)(cid:12)(cid:12) ≤ N N (cid:88) i =1 sup µ ∈M +1 (Θ) (cid:12)(cid:12)(cid:12) α log (cid:18)(cid:90) X ×Y exp (cid:18) E µ [ l ( θ, ( x, y ))] α (cid:19) d U ( x i ,y i ) (cid:19) − sup u ∈ S ε ( xi,yi ) E µ [ l ( θ, u )] (cid:12)(cid:12)(cid:12) . Let µ ∈ M +1 (Θ) and i ∈ { , . . . , N } , then we have (cid:12)(cid:12)(cid:12) α log (cid:18)(cid:90) X ×Y exp (cid:18) E µ [ l ( θ, ( x, y ))] α (cid:19) d U ( x i ,y i ) (cid:19) − sup u ∈ S ε ( xi,yi ) E µ [ l ( θ, u )] (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) α log (cid:32)(cid:90) X ×Y exp (cid:32) E µ [ l ( θ, ( x, y ))] − sup u ∈ S ε ( xi,yi ) E µ [ l ( θ, u )] α (cid:33) d U ( x i ,y i ) (cid:33) (cid:12)(cid:12)(cid:12) = α (cid:12)(cid:12)(cid:12) log (cid:32)(cid:90) A ( xi,yi ) β,µ exp (cid:32) E µ [ l ( θ, ( x, y ))] − sup u ∈ S ε ( xi,yi ) E µ [ l ( θ, u )] α (cid:33) d U ( x i ,y i ) + (cid:90) ( A ( xi,yi ) β,µ ) c exp (cid:32) E µ [ l ( θ, ( x, y ))] − sup u ∈ S ε ( xi,yi ) E µ [ l ( θ, u )] α (cid:33) d U ( x i ,y i ) (cid:33) (cid:12)(cid:12)(cid:12) ≤ α (cid:12)(cid:12)(cid:12) log (cid:16) exp( − β/α ) U ( x i ,y i ) (cid:16) A ( x i ,y i ) β,µ (cid:17)(cid:17) (cid:12)(cid:12)(cid:12) + α (cid:12)(cid:12)(cid:12) log β/α ) U ( x i ,y i ) (cid:16) A ( x i ,y i ) β,µ (cid:17) (cid:90) ( A ( xi,yi ) β,µ ) c exp (cid:32) E µ [ l ( θ, ( x, y ))] − sup u ∈ S ε ( xi,yi ) E µ [ l ( θ, u )] α (cid:33) d U ( x i ,y i ) (cid:12)(cid:12)(cid:12) ≤ α log(1 /C β ) + β + αC β ≤ α log(1 /C β ) + β .7 Proof of Proposition 6 Proof.
Thanks to Danskin theorem, if Q ∗ is a best response to λ , then g ∗ := ( E Q ∗ [ l ( θ , ( x, y ))] , . . . , E Q ∗ [ l ( θ L , ( x, y ))]) T is a subgradient of λ → R εadv ( λ ) . Let η ≥ be the learning rate. Then we have for all t ≥ : (cid:107) λ t − λ ∗ (cid:107) ≤ (cid:107) λ t − − η g t − λ ∗ (cid:107) = (cid:107) λ t − − λ ∗ (cid:107) − η (cid:104) g t , λ t − − λ ∗ (cid:105) + η (cid:107) g t (cid:107) ≤ (cid:107) λ t − − λ ∗ (cid:107) − η (cid:104) g ∗ t , λ t − − λ ∗ (cid:105) + 2 η (cid:104) g ∗ t − g t , λ t − − λ ∗ (cid:105) + η M L ≤ (cid:107) λ t − − λ ∗ (cid:107) − η ( R εadv ( λ t ) − R εadv ( λ ∗ )) + 4 ηδ + η M L We then deduce by summing: η T (cid:88) t =1 R εadv ( λ t ) − R εadv ( λ ∗ ) ≤ δηT + (cid:107) λ − λ ∗ (cid:107) + η M LT Then we have: min t ∈ [ T ] R εadv ( λ t ) − R εadv ( λ ∗ ) ≤ δ + 4 ηT + M Lη The left-hand term is minimal for η = M √ LT , and for this value: min t ∈ [ T ] R εadv ( λ t ) − R εadv ( λ ∗ ) ≤ δ + 2 M √ L √ T . D Additional Experimental Results
D.1 Experimental setting.
Optimizer.
For each of our models, The optimizer we used in all our implementations is SGD with learningrate set to . at epoch and is divided by at half training then by at the three quarters of training.The momentum is set to . and the weight decay to × − . The batch size is set to . Adaptation of Attacks.
Since our classifier is randomized, we need to adapt the attack accordingly. Todo so we used the expected loss: ˜ l (( λ , θ ) , ( x, y )) = L (cid:88) k =1 λ k l ( θ k , ( x, y )) to compute the gradient in the attacks, regardless the loss (DLR or cross-entropy). For the inner maximizationat training time, we used a PGD attack on the cross-entropy loss with ε = 0 . . For the final evaluation, weused the untargeted DLR attack with default parameters.
Regularization in Practice.
The entropic regularization in higher dimensional setting need to be adaptedto be more likely to find adversaries. To do so, we computed PGD attacks with only iterations with different restarts instead of sampling uniformly points in the (cid:96) ∞ -ball. In our experiments in the main paper,we use a regularization parameter α = 0 . . The learning rate for the minimization on λ is always fixed to . . 23 lternate Minimization Parameters. Algorithm 2 implies an alternate minimization algorithm. Weset the number of updates of θ to T θ = 50 and, the update of λ to T λ = 25 . D.2 Effect of the Regularization
In this subsection, we experimentally investigate the effect of the regularization. In Figure 4, we notice, thatthe regularization has the effect of stabilizing, reducing the variance and improving the level of the robustaccuracy for adversarial training for mixtures (Algorithm 2). The standard accuracy curves are very similarin both cases.
Epochs per model A cc u r a c y Epochs per model A cc u r a c y Epochs per model A cc u r a c y Epochs per model A cc u r a c y Figure 4: On left and middle-left: Standard accuracies over epochs with respectively no regularization andregularization set to α = 0 . . On middle right and right: Robust accuracies for the same parametersagainst PGD attack with iterations and ε = 0 . . D.3 Additional Experiments on WideResNet28x10
We now evaluate our algorithm on WideResNet28x10 [42] architecture. Due to computation costs, we limitourselves to and models, with regularization parameter set to . as in the paper experiments section.Results are reported in Figure 5. We remark this architecture can lead to more robust models, corroboratingthe results from [19]. Models Acc. APGD CE APGD
DLR
Rob. Acc. .
2% 49 .
9% 50 .
2% 48 . . % . % . % . % Epochs per model A cc u r a c y Epochs per model A cc u r a c y Figure 5: On left: Comparison of our algorithm with a standard adversarial training (one model) onWideResNet28x10. We reported the results for the model with the best robust accuracy obtained over twoindependent runs because adversarial training might be unstable. Standard and Robust accuracy (respectivelyin the middle and on right) on CIFAR-10 test images in function of the number of epochs per classifier with and WideResNet28x10 models. The performed attack is PGD with iterations and ε = 8 / . D.4 Overfitting in Adversarial Robustness
We further investigate the overfitting of our heuristic algorithm. We plotted in Figure 6 the robust accuracyon ResNet18 with to models. The most robust mixture of models against PGD with iterations arrivesat epoch , i.e. at the end of the training, contrary to to models, where the most robust mixture occursaround epoch . However, the accuracy against AGPD with 100 iterations in lower than the one at epoch24 with global robust accuracy of . at epoch and . at epoch 198. This strange phenomenonwould suggest that the more powerful the attacks are, the more the models are subject to overfitting. Weleave this question to further works. Epochs per model A cc u r a c y Epochs per model A cc u r a c y Figure 6: Standard and Robust accuracy (respectively on left and on right) on CIFAR-10 test images infunction of the number of epochs per classifier with to ResNet18 models. The performed attack is PGDwith iterations and ε = 8 / . The best mixture for models occurs at the end of training (epoch ). E Additional Results
E.1 Equality of Standard Randomized and Deterministic Minimal Risks
Proposition 7.
Let P be a Borel probability distribution on X × Y , and l a loss satisfying Assumption 1,then: inf µ ∈M (Θ) R ( µ ) = inf θ ∈ Θ R ( θ ) Proof.
It is clear that: inf µ ∈M (Θ) R ( µ ) ≤ inf θ ∈ Θ R ( θ ) . Now, let µ ∈ M (Θ) , then: R ( µ ) = E θ ∼ µ ( R ( θ )) ≥ essinf µ E θ ∼ µ ( R ( θ )) ≥ inf θ ∈ Θ R ( θ ) . where essinf denotes the essential infimum.We can deduce an immediate corollary. Corollary 2.
Under Assumption 1, the dual for randomized and deterministic classifiers are equal.
E.2 Decomposition of the Empirical Risk for Entropic Regularization
Proposition 8.
Let ˆ P := N (cid:80) Ni =1 δ ( x i ,y i ) . Let l be a loss satisfying Assumption 1. Then we have: N N (cid:88) i =1 sup x, d ( x,x i ) ≤ ε E θ ∼ µ [ l ( θ, ( x, y ))] = N (cid:88) i =1 sup Q i ∈ Γ i,ε E ( x,y ) ∼ Q i ,θ ∼ µ [ l ( θ, ( x, y ))] here Γ i,ε is defined as : Γ i,ε := (cid:110) Q i | (cid:90) d Q i = 1 N , (cid:90) c ε (( x i , y i ) , · ) d Q i = 0 (cid:111) . Proof.
This proposition is a direct application of Proposition 2 for diracs δ ( x i ,y i ) . E.3 On the NP-Hardness of Attacking a Mixture of Classifiers
In general, the problem of finding a best response to a mixture of classifiers is in general NP-hard. Let usjustify it on a mixture of linear classifiers in binary classification: f θ k ( x ) = (cid:104) θ, x (cid:105) for k ∈ [ L ] and λ = L /L .Let us consider the (cid:96) norm and x = 0 and y = 1 . Then the problem of attacking x is the following: sup τ, (cid:107) τ (cid:107)≤ ε L L (cid:88) k =1 (cid:104) θ k ,τ (cid:105)≤ This problem is equivalent to a linear binary classification problem on τ , which is known to be NP-hard. E.4 Case of Separated Conditional Distribtions
Proposition 9.
Let Y = {− , +1 } . Let P ∈ M ( X × Y ) . Let ε > . For i ∈ Y , let us denote P i thedistribution of P conditionally to y = i . Let us assume that d X (supp( P ) , supp( P − )) > ε . Let usconsider the nearest neighbor deterministic classifier : f ( x ) = d ( x, supp( P +1 )) − d ( x, supp( P − )) and the / loss l ( f, ( x, y )) = yf ( x ) ≤ . Then f satisfies both optimal standard and adversarial risks: R ( f ) = 0 and R εadv ( f ) = 0 .Proof. Let Let denote p i = P ( y = i ) . Then we have R εadv ( f ) = p +1 E P +1 (cid:34) sup x (cid:48) , d ( x,x (cid:48) ) ≤ ε f ( x (cid:48) ) ≤ (cid:35) + p − E P − (cid:34) sup x (cid:48) , d ( x,x (cid:48) ) ≤ ε f ( x (cid:48) ) ≥ (cid:35) For x ∈ supp( P +1 ) , we have, for all x (cid:48) such that d ( x, x (cid:48) ) (cid:54) = 0 , f ( x (cid:48) ) > , then: E P +1 (cid:104) sup x (cid:48) , d ( x,x (cid:48) ) ≤ ε f ( x (cid:48) ) ≤ (cid:105) =0 . Similarly, we have E P − (cid:104) sup x (cid:48) , d ( x,x (cid:48) ) ≤ ε f ( x (cid:48) ) ≥ (cid:105) = 0= 0