[PDF] Analyzing and Improving Generative Adversarial Training for Generative Modeling and Out-of-Distribution Detection

Abstract

Generative adversarial training (GAT) is a recently introduced adversarial defense method. Previous works have focused on empirical evaluations of its application to training robust predictive models. In this paper we focus on theoretical understanding of the GAT method and extending its application to generative modeling and out-of-distribution detection. We analyze the optimal solutions of the maximin formulation employed by the GAT objective, and make a comparative analysis of the minimax formulation employed by GANs. We use theoretical analysis and 2D simulations to understand the convergence property of the training algorithm. Based on these results, we develop an incremental generative training algorithm, and conduct comprehensive evaluations of the algorithm's application to image generation and adversarial out-of-distribution detection. Our results suggest that generative adversarial training is a promising new direction for the above applications.

Full PDF

AAnalyzing and Improving Generative Adversarial Training forGenerative Modeling and Out-of-Distribution Detection

Xuwang Yin, Shiying Li, and Gustavo K. Rohde

University of Virginia, Charlottesville, VA 22904, USA {xy4cm, sl8jx, gustavo}@virginia.edu

Abstract

Generative adversarial training (GAT) is a recently introduced adversarial defense method.Previous works have focused on empirical evaluations of its application to training robustpredictive models. In this paper we focus on theoretical understanding of the GAT methodand extending its application to generative modeling and out-of-distribution detection. Weanalyze the optimal solutions of the maximin formulation employed by the GAT objective,and make a comparative analysis of the minimax formulation employed by GANs. We usetheoretical analysis and 2D simulations to understand the convergence property of the trainingalgorithm. Based on these results, we develop an incremental generative training algorithm,and conduct comprehensive evaluations of the algorithm’s application to image generation andadversarial out-of-distribution detection. Our results suggest that generative adversarial trainingis a promising new direction for the above applications.

Generative adversarial training (GAT) [61] is a recently introduced defense mechanism for adversarialexample detection and robust classiﬁcation. The defense consists of a committee of detectors (binarydiscriminators), with each one trained to discriminate natural data of a particular class fromadversarial examples perturbed from data of other classes. Like most other work in the area ofrobust machine learning, the defense is specially designed for defending against norm-constrainedadversaries — adversaries that are constrained to perturb the data up to a certain amount asmeasured by some norm. The defense’s robustness is achieved by training each detector modelagainst adversarial examples produced by the norm-constrained PGD attack [33].

Existing work: training and evaluating robust predictive models

A compelling prop-erty of the GAT method is that the detector trained with the method exhibits strong interpretability— an unbounded attack that maximizes the detector’s output results in images that resemble thetarget class data, which suggests the detector has learned the target class data distribution. Whilethe method has been successfully applied to training robust predictive models [61, 58], this behaviorof the method is not yet understood at a mathematical, theoretical level.

This work: theoretical understanding, improved training algorithm, and extendedapplications

In order to gain some theoretical understanding of the GAT method, we ﬁrst analyzethe optimal solutions of the training objective (eq. (2)). We start with a maximin formulation (eq.(5)) of the objective, and try to connect it with the minimax reformulation (eq. (1)) employed inthe GANs Framework [13]. We ﬁnd that the diﬀerences between solutions of these two formulations1 a r X i v : . [ c s . L G ] D ec ecome immediately clear when we take a game-theory perspective. We then use theoretical analysisand 2D simulations to understand the convergence property of the GAT training algorithm. Buildingupon these theoretical and experimental insights, we develop an incremental GAT algorithm, andapply it to the tasks of generative modeling and out-of-distribution detection. We ﬁnd the maximin-based generative model to be more stable to train than its minimax counterpart (GANs), and at thesame time more ﬂexible as it does not have a ﬁxed generator and can transform arbitrary inputs tothe target distribution data, which might be particularly useful for certain applications (e.g., facemanipulation). The model trained with the incremental GAT algorithm also outperforms severalstate-of-the-art methods on the task of adversarial out-of-distribution detection. In summary, ourkey contributions are:• We analyze the optimal solutions of the GAT objective and convergence property of thetraining algorithm. We discuss the implications of these results on generative modeling andout-of-distribution detection.• We develop an incremental generative adversarial training algorithm. We conduct a com-prehensive evaluation of the algorithm’s application to image generation and adversarialout-of-distribution detection.• Our comparative analysis of the maximin and minimax problem clariﬁes misconceptions andprovides new insights into how they could be utilized to solve diﬀerent problems. Generative adversarial networks (GANs)

The GANs framework [13] learns a generatorfunction G and a discriminator function D by solving the following minimax problemmin G max D V ( D, G ) = E x ∼ p data [log D ( x )] + E z ∼ p z [log(1 − D ( G ( z )))] . (1)The generator G implicitly deﬁnes a distribution p g by mapping a prior distribution p z from alow-dimensional latent space Z ⊆ R z to the high-dimensional data space X ⊆ R d . D : X → [0 ,

1] isa function that discriminates the target data distribution p data from the generated distribution p g .The minimax problem is solved by alternating between the optimization of D and optimization of G ; under certain conditions, the alternating training procedure converges to a solution where p g matches p data (Jensen-Shannon divergence is zero), and D outputs on support of p data . Generative adversarial training (GAT)

The GAT method [61] is designed for training ad-versarial examples detection and robust classiﬁcation models. In a K class classiﬁcation problem,the robust detection/classiﬁcation system consists of K base detectors, with each one trained byminimizing the following objective L ( D ) = − E x ∼ p k [log D ( x )] − E x ∼ p − k [log(1 − max x ∈ B ( x,(cid:15) ) D ( x )))] . (2)In the above objective, p k is k -th class’s data distribution, p − k is the mixture distribution of all otherclasses: p − k = K − P i =1 ,...,K,i = k p i , and B ( x, (cid:15) ) is a neighborhood of x : { x ∈ X : k x − x k ≤ (cid:15) } .The objective is characterized by an inner maximization problem and an outer minimization problem;when the inner maximization is perfectly solved and D achieves a vanishing loss, D becomes aperfectly robust model capable of separating data p k , from any (cid:15) -constrained adversarial examplesperturbed from data of p − k . A committee of K detectors then provides a complete solution for2etecting any adversarial example perturbed from an arbitrary class. Objective 2 is solved usinga alternating gradient method (Algorithm 4), with the ﬁrst step crafting adversarial examples bysolving the inner maximization, and the second step improving the D model on these adversarialexamples.Clearly, the detector’s robustness depends on how well the inner maximization is solved. Despitethe fact that D is a highly non-concave function when it is parameterized by a deep neural network,Madry et al. [33] observed that the inner problem could be reasonably solved using projectedgradient descent (PGD attack) — a ﬁrst-order method that employs the following iterative gradientupdate rule (at initialization x ← x , we consider L -based attack) x i +1 ← Proj ( x i + γ ∇ log D ( x i ) k∇ log D ( x i ) k ) , (3)where λ is some step size, and Proj is the operation of projecting onto the feasible set B ( x, (cid:15) ). The normalized steepest ascent rule inside the Proj function, was introduced for dealing with the issueof vanishing gradient when optimizing with the cross-entropy loss [25].

Maximin and minimax problems in game theory

In game theory, two-player zero-sum gameis a mathematical representation of a situation in which one player’s gain is balanced by anotherplayer’s loss. Such a game is described by its payoﬀ function f : R p + q → R , which represents theamount of payment that one player (player 1) makes to the other player (player 2). The goal ofplayer 1 is to choose a strategy u ∈ R p such that the payoﬀ is minimized, while the goal of player 2 isto choose a strategy u ∈ R q such that the payoﬀ is maximized. The best strategies for both players,and the resulting payoﬀ, depending on the order of play, could be solved via min u max v f ( u, v ) ormax v min u f ( u, v ).In the minimax game min u max v , player 1 makes the ﬁrst move. Player 2, after learning thatplayer 1 has made the move u , will choose a v to maximize f ( u, v ), which results in a payoﬀ ofmax v f ( u, v ). Player 1, who is informed of player 2’s strategy, will choose a u such that the worsecase payoﬀ max v f ( u, v ) is minimized, which results in a payoﬀ of min u max v f ( u, v ).In the maximin game max v min u , the order of play is reversed. Player 2 makes the ﬁrst move,and then player 1 minimizes the payoﬀ by choosing u = arg min u f ( u, v ). Player 2 knows thatplayer 1 will follow this strategy and will choose a v such that the worse case payoﬀ min u f ( u, v ) ismaximized, which results in a payoﬀ of max v min u f ( u, v ).The payoﬀ min u max v f ( u, v ) is always greater or equal to max v min u f ( u, v ). This diﬀerencecan be intuitively understood as the result of player 2’s extra knowledge gained by taking thesecond move. According to the minimax theorem [40], when f is a continuous function that isconcave-convex (i.e., for each v , f ( u, v ) is a convex function of u , and for each u , f ( u, v ) is a concavefunction of v )), these two quantities are equal. We refer the reader to [3] (§5.4.3, §10.3.4) for moredetails on this topic. Out-of-distribution detection

We provide a review of related work on out-of-distributiondetection in Appendix B.

In this section we ﬁrst reformulate objective 2 into a maximin problem, and then analyze theoptimal solutions of the maximin problem and convergence property of Algorithm 1. We thendiscuss the optimal solution of the corresponding minimax formulation and the diﬀerences between3he solutions of these two formulations. The popular generative modeling approach of GANs learnsa data distribution by solving the minimax problem, but there seems to be a misconception aboutthe diﬀerences between solutions of these two problems, and as a result, a false impression thatthe GANs algorithm could solve the maximin problem (Goodfellow [12], §5.1.1). Our analysis ofoptimal solutions is based on a game-theory interpretation of these problems, and the diﬀerencesbetween these solutions are immediately clear under such an analysis.

To gain some theoretical understanding of objective 2, it is useful to reformulate the problem. First,maximizing D is equivalent to minimizing log(1 − D ), hence eq. (2) is equivalent to L ( D ) = − E x ∼ p k [log D ( x )] − E x ∼ p − k [ min x ∈ B ( x,(cid:15) ) log(1 − D ( x )))] . (4)We restrict our attention to analyzing the optimal solution of D under the scenario where (cid:15) islarge enough such that perturbations cover the entire data space: B ( x, (cid:15) ) = X . In this scenario,imposing perturbations on p − k samples can be considered as moving mass of p − k to locationsin X via a transformation function T : X → X . Utilizing the technique of random variabletransformation, we can write the density function of the resulting distribution p t as a function of p − k : p t ( y ) = R X p − k ( x ) δ ( y − T ( x )) dx . Let M ( X ) be the set of distributions obtained by applyingsuch transformations to the support of p − k , the inner problem in eq. (4) could then be interpretedas determining the distribution in M ( S ) that causes the highest (expected) loss of the D function.The interplay of the D model and the adversary can be formulated as a maximin problem:max D min p t ∈M ( X ) U ( D, p t ) = E x ∼ p k [log D ( x )] + E x ∼ p t [log(1 − D ( x ))] . (5)A convenient way of analyzing the above problem is to consider it as a two-player zero-sum game: U can be interpreted as the payoﬀ function which represents the amount of payment that one player( player p t ) makes to the other player (player D ). The goal of player p t is to choose a strategy p t ∈ M ( X ) such that the payoﬀ is minimized, while the goal of player D is to choose a strategy D ∈ D such that the payoﬀ is maximized. This maximin game is played by following such a rule:player D makes the ﬁrst move by choosing a D ; player p t , after learning that player D has madethe move, will choose a p t to minimize its payment, which results in a payoﬀ of min p t U ( D, p t );player D , who is informed of player p t ’s strategy, will chooses a D such that the worse case payoﬀmin p t U ( D, p t ) is maximized, which results in an overall payoﬀ of max D min p t U ( D, p t ). Followingthis rule, we could derive the optimal strategy for player D : Proposition 1.

In the maximin game max D min p t U ( D, p t ) , the best strategy for player D is tochoose a D that outputs in Supp ( p k ) and ≤ in X \

Supp ( p k ) . The mathematical derivations of this optimal solution is included in Appendix C. This optimalsolution of D can also be veriﬁed by assuming a diﬀerent D value than the claimed one, andshow that the payoﬀ player D could receive is always lower than that with the claimed one. Inparticular, if there is a point q ∈ X with D ( q ) > , then player p t could choose a p t with all itsmass concentrated on q , and hence cause a lower payoﬀ (due to the term E x ∼ p t [log(1 − D ( x ))] | x = q ).On the other hand, if there is a q ∈ Supp( p k ) with D ( q ) < , player D ’s payoﬀ is also going tobe lower because the term E x ∼ p k [log( D ( x ))] has a lower value than the case where D outputs everywhere in Supp( p k ).Given player D ’s optimal strategy, player p t ’s optimal strategy is to choose a p t with its massdistributed in locations where D outputs . 4 .2 The maximin problem solver While the game-theory interpretation is useful for understanding the problem, the optimal solutionof D as predicted by the theory seems not very interesting. We next show that the actual D solution, as obtained via a numerical algorithm, has some interesting properties and is useful for afew applications. We start with a “translation” of Algorithm 4 into our maximin language, whichresults in a solver (Algorithm 1) for the maximin problem. Algorithm 1

The maximin problem solver Initialize parameters of D . Sample minibatch of m samples { x k , . . . , x km } from p k , and m samples { x − k , . . . , x − km } from p − k . Compute perturbed samples { x , . . . , x m } by solving min x ∈X D ( x ) for each x − ki . Update D by maximizing m P mi =1 (cid:2) log D ( x ki ) + log(1 − D ( x i )) (cid:3) (single step). Return to step 2.

We consider the case where D is parameterized by a neural network. The parameters areinitialized in step 1 of Algorithm 1, which typically results in a highly non-concave function ([14],§8.2.2). Then in each iteration, step 2 samples points from p k and p − k , step 3 solves the innerminimization of eq. (5) by moving samples of p − k to locations where D has the maximum outputs(which results in a new distribution p t ), and step 3 solves the outer maximization of eq. (5) byincreasing D outputs on p k samples (maximizing E x ∼ p k [log D ( x )]) and decreasing outputs on p − k samples (maximizing E x ∼ p t [log(1 − D ( x ))]).Step 3 is implemented as a gradient-based search procedure: it uses samples of p − k as startingpoints, and performs gradient ascent on D (eq. (3)). When D is a highly non-concave function,this process will inevitably get stuck in local maxima. Our 2D simulation of Algorithm 1 indicatesthat this problem is resolved by the alternating optimization procedure: if at step 3 samples of p − k got stuck at local maxima, step 4 immediately decreases D outputs on these samples. In otherwords, local maxima are constantly being eliminated during the course of algorithm execution. Thispattern can be clearly observed in Figure 7.Another issue with step 3 is concerned with the distribution of p − k data. As illustrated inFigure 1, when p − k data is concentrated in a subspace in the bottom left corner, Algorithm 1converged to a D solution with > outputs in locations other than Supp( p k ). Inspecting thegradient vector ﬁeld in Figure 1(b), we ﬁnd that by starting from p − k and following the gradientof D , p − k samples always end up at Supp( p k ); local maxima points on other locations can not bereached by p − k samples and hence can not be eliminated.To solve the above issue, we could use a p − k that is distributed in the entire data space, asopposed to one that is concentrated in a small subspace. In the same 2D experiment, when we usea uniform distribution in the data space as p − k , in multiple trials of the experiment we consistentlyobtained D solutions with global maxima at Supp( p k ) and no local maxima (Figure 1(c) and Figure 8bottom row). Convergence of Algorithm 1

Equipped with knowledge gained from the 2D experiment, wenow explore the convergence property of Algorithm 1. We assume that optimizing the ﬁrst term E x ∼ p k [log D ( x )] of eq. (5) does not aﬀect D ’s outputs outside of Supp( p k ) when D has enoughcapacity. In the second term E x ∼ p t [log(1 − D ( x ))], p t is the new distribution formed by taking p − k samples as starting points and performing gradient ascent on D ; for any q ∈ Supp( p t ), q iseither in Supp( p k ), or on a local maximum point in X \

Supp( p k ). Collectively, optimizing thisterm either causes local maxima values to decrease (when there exists p t samples located in local5 .5 0.3 0.1 0.1 0.3 0.50.50.30.10.10.30.5 p k p k (a) initial positions (b) maximin solution 1 (c) maximin solution 2 (d) minimax solution Figure 1: Plots of contours and gradient vector ﬁelds of the D functions (gradient vectors arenormalized to have the unit length). (a) The initial positions of p − k and p k . (b) The solutionobtained by the maximin problem solver. (c) The solution obtained by the maximin problem solverwhen p − k is a uniform distribution in the data space. (d) The solution obtained by the minimaxproblem solver.maxima points in X \

Supp( p k )), or does not aﬀect D ’s outputs in X \

Supp( p k ) (when p t samplesare all in Supp( p k )). This observation leads to a conjecture that if we run Algorithm 1 for enoughiterations local maxima in X \

Supp( p k ) will eventually disappear. We now show that under certainconditions Algorithm 1 has the following convergence property: Proposition 2. If D has enough capacity and p − k has non-zero density everywhere in the space X \

Supp ( p k ) , then for any initialization of D , Algorithm 1 converges to a D solution with globalmaxima at Supp ( p k ) and no local maxima; the global maxima value is .Proof. We ﬁrst prove that any local maximum in

X \

Supp( p k ) can be eliminated by runningAlgorithm 1 for a suﬃcient and ﬁnite number of iterations. To proceed, we ﬁrst state the conditionunder which a local maximum will be eliminated: a local maximum in X \

Supp( p k ) will be eliminatedif via one or more iterations of the algorithm a suﬃcient number of p − k samples reach the localmaximum point by performing gradient ascent on D (step 3). When this condition is satisﬁed, thecumulative execution of step 4 cause the local maximum value to decrease to a suﬃciently smallvalue and the local maximum to disappear.We next show that for any local maximum point q ∈ X \ Supp( p k ) the above condition isalways satisﬁed when Algorithm 1 runs for a suﬃcient and ﬁnite number of iterations. Let U be aneighborhood of q where q is the only critical point. U is non-empty by deﬁnition and any p ∈ U canreach q via gradient ascent on D when a suﬃciently small step size is used. For p − k has non-zerodensity everywhere, a suﬃcient number of p − k samples could fall on U and subsequently reach q viagradient ascent if enough samplings of p − k are done via step 1. We note that the deﬁnition of U could change after D ’s update in step 4, but U is non-empty as long as q is still a local maximumpoint.Finally we note that after D ’s update in step 4 the set of local maxima in X \

Supp( p k ) could alsochange (the number of local maxima points could even increase due to D ’s update). But because theintegral of D on X \

Supp( p k ) is a ﬁnite value and hence can not be decreased for inﬁnite numberof times, the set of local maxima points in X \

Supp( p k ) will eventually become empty. Compared to the maximin game, the minimax game min p t max D U ( D, p t ) has a reversed rule:player p t makes the ﬁrst move by choosing a p t ; player D then chooses a D to maximize its payoﬀ,6 lgorithm 2 The minimax problem solver Initialize p t ← p − k . repeat Update D by maximizing E x ∼ p k [log D ( x )] + E x ∼ p t [log(1 − D ( x ))] (until converge). For each x ∈ p t , update its value by x ← x − λ ∇ log(1 − D ( x )) k∇ log(1 − D (˜ x )) k (single step). until p t convergences to p k which results in a payoﬀ of max D U ( D, p t ); player p t knows player D ’s strategy and will choose a p t such that the worst case payoﬀ max D U ( D, p t ) is minimized , which results in a overall payoﬀ ofmin p t max D U ( D, p t ).The solution of this minimax game is analyzed in [13, 12]: the optimal strategy of player p t is to choose a p t that minimizes the Jensen-Shannon divergence (JSD) between p t and p k : p ∗ t = arg min p t ∈M ( X ) JSD( p t k p k ) = p k , and the optimal strategy of player D is to choose D ∗ = p k p k + p t = . Under these strategies, the payoﬀ function U measures the JSD between p t and p k : U ( D ∗ , p ∗ t ) = − log(4) + 2 · JSD( p ∗ t k p k ) = − log(4). It should be noted that D ∗ does not needto be deﬁned outside of Supp( p t ) ∪ Supp( p k ) [13].Removing the “generator” from GANs’ training algorithm ([13] §4) gives us a solver (Algorithm 2)for the minimax problem. According to [13], if at each step p t is updated with a suﬃciently smallstep of λ , and D is trained to reach its optimum, then p t converges to p k . 2D simulation results inFigure 1(d) and Figure 9 conﬁrm this convergence property. There are a few diﬀerences between the solutions of the maximin problem and minimax problem. D solution diﬀerence While both D ∗ s output in Supp( p k ), their outputs outside of Supp( p k )are diﬀerent. In the maximin problem, the ideal D solution outputs < and has no local maximaoutside of Supp( p k ). In contrast, D ∗ in the minimax game does not need to be deﬁned outside ofSupp( p k ). In other words, D ∗ in the minimax problem has unpredictable behavior outside Supp( p k ).This phenomenon can be observed in Figure 1(d) and Figure 9. This diﬀerence has an intuitiveexplanation from the game-theory perspective: in the maximin game, p ∗ t is decided in the secondmove, with the knowledge of the current D value; to prevent player p t from taking this advantage,the best strategy for player D is to specify D outputs for the entire data space. In the minimaxgame, on the contrary, D ∗ is decided in the second move, with the knowledge of p t , hence player D does not need to concerned with D ∗ outputs on locations other than supports of p t and p k . p t solution diﬀerence Another diﬀerence, which can also be observed from Figure 1, is thatin the minimax game, p ∗ t exactly matches p k , while in the maximin game, mass of p ∗ t can be anylocation where D ∗ outputs .Overall we ﬁnd these two formulations giving rise to diﬀerent applications. The minimaxformulation, which is the formulation used by GANs, is perfect for learning a generator thatproduces a distribution that exactly matches the target data distribution. The discriminator (the D model), because of its undeﬁned behavior in most of the data space, may not be very usefulfor downstream tasks. The maximin problem, if well solved (Figure 1(c)), gives a D function thatmodels a characteristic function of the data distribution, and could be used to solve problems thatrequire this feature (Section 4). 7 Implementation and Applications

Applications

Continuing our discussion from Section 3.2, when p − k has non-zero density every-where in the space X \

Supp( p k ), the maximin problem solver gives us a D function with globalmaxima at the Supp( p k ) and no local maxima. We think this function is at least useful for thefollowing two applications:• Out-of-distribution (OOD) detection

Because D outputs < for any data that isoutside of Supp( p k ), we can use D outputs to identify out-of-distribution inputs.• Generative modeling

We can transform an arbitrary out-of-distribution sample into atarget distribution ( p k ) sample by taking the sample as the starting point and performinggradient ascent on D , until the sample reaches support of the target distribution.We note that this new generative modeling technique diﬀers from standard approaches in that it doesnot learn a ﬁxed “generator”. This feature may facilitate certain generation tasks as demonstratedin Figure 2, 3, 4 and 5. p − k data While the 2D simulation results suggest uniform distribution be an ideal candidate for p − k , we ﬁnd uniform noise to be ineﬀective when training in high-dimensional space (Appendix E.2).Our interpretation of these results is that real image samples lie on low-dimensional manifoldsembedded within the high-dimensional space, and are close to each other in terms of geodesicdistance ([14], §5.11.3). Uniform noise, on the other hand, is distributed in the entire space andmay require substantially much larger perturbations and hence training time in order to reach asatisfactory performance. This observation leads us to consider using a large, diverse, real imagedataset, as the p − k dataset. Incremental training algorithm

Another issue with Algorithm 1 is the implementation ofStep 3. As a gradient-based search procedure (Section 3.2), Step 3 needs the step size and thenumber of steps to be speciﬁed. Regarding the step size, a suﬃciently small value should be used inorder for Step 4 to converge to local maxima (see the ablation study in Appendix F). Setting anappropriate value for the number of steps poses a challenge: the number of steps should be largeenough such that the search covers enough data space, but this will incur a very high computationalcost (evaluating ∇ x D ( x ) requires running a full forward-backward pass on the neural network thatparameterizes D ). On the other hand, performing more steps of gradient ascent is meaningless anda waste of computation once the search got stuck in local maxima.These challenges motive us to consider an incremental training scheme — we could graduallyincrease the search radius by using an increasing sequence of number of steps, which is equallyeﬀective for locating local maxima but much more computational eﬃcient. We implement thisincremental training idea in Algorithm 3: in the outer loop, we use an increase sequence of numberof steps ( K ); in the inner loop, due to the steepest descent update rule (Line 4; note there is no Proj operation here), the search radius is always ≤ λK ; as we increase K , the search radius of thealgorithm increases. In practice we ﬁnd that this incremental algorithm indeed trains faster, and atthe same time converges faster (Appendix E.1).The incremental training algorithm also provides a mechanism for mitigating overﬁtting. Over-ﬁtting could happen in our approach because the algorithm learns a characteristic function thatoutputs on Supp( p k ) and < outside of Supp( p k ); when p k is an empirical distribution, the outputs are at p k ’s samples. In that case, gradient ascent on D converges to p k ’s samples. However,this only happens when the search covers Supp( p k ) (Line 3 of Algorithm 1). In Algorithm 3, bycontrolling K and hence the search radius, we could mitigate overﬁtting and learn a D function8hat captures the manifold structure of the p k data. Overﬁtting could also be mitigated by properlyconstraining the step size and number of steps when performing gradient ascent on D . Algorithm 3

Incremental Generative Adversarial Training for K in [0 , , . . . , N ] do for number of training iterations do Sample minibatch m samples { x , . . . , x m } from p k , and m samples { ˜ x , . . . , ˜ x m } from p − k . For each sample ˜ x i in { ˜ x , . . . , ˜ x m } , compute the perturbed sample ˜ x Ki by performing K stepsnormalized steepest descent ˜ x k +1 i ← ˜ x ki − γ ∇ log(1 − D ( ˜ x ki ) ) k∇ log ( − D ( ˜ x ki )) k (at initialization ˜ x i ← ˜ x i ). Update D by maximizing m P mi =1 (cid:2) log D ( x i ) + log (cid:0) − D (˜ x Ki ) (cid:1)(cid:3) (single step). end for end for We evaluate our method’s application to out-of-distribution detection on CIFAR-10 [27] andSVHN [39] dataset. In these two tasks, we respectively use CIFAR-10 and SVHN as the p k dataset,and use the 800 Million Tiny Images dataset [57] as the p − k dataset (same as [2]). We use thestandard ResNet18 architecture as the D model and Algorithm 3 to train the models. Area underthe receiver operating characteristic curve (AUROC) is used as the performance metric.Table 1 and Table 2 report adversarial OOD detection performances of our method and severalstate-of-the-art methods (performance data of these methods is collected from [2]). It is observedthat adversarial perturbations cause signiﬁcant performances decrease of non-robust models (OEand our method with K = 0). (Performances on the standard OOD detection task are reported inTable 8.) Our method trained with K = 5 (CIFAR-10) and K = 45 (SVHN) outperforms severalstate-of-the-art methods in some entries.Table 1: CIFAR-10 adversarial OOD detection performances (AUROC scores). The results of ourmethod are based on an PGD attack of steps 100 and step size 0.002. For results under diﬀerentattack conﬁgurations including one with random restarts see Table 6. OOD dataset (with an L ∞ perturbation of (cid:15) = 0 . N/A 14.8 23.3ACET [15] 98.9 N/A 88.0 74.5GOOD [2] 99.5 N/A 58.9 54.7Ours ( K = 0) 97.8 22 1.0 7.1Ours ( K = 5) 99.0 99.1 We evaluate our method’s application to image generation on CelebA-HQ [20], LSUN-BEDROOM [62],and the ImageNet-Dog dataset [59]. In these tasks, we respectively use the above three datasets9able 2: SVHN adversarial OOD detection performances (AUROC scores). The results of ourmethod are based on an PGD attack of steps 100 and step size 0.005. For results under diﬀerentattack conﬁgurations including one with random restarts see Table 7.

OOD dataset (with an L ∞ perturbation of (cid:15) = 0 . N/A 56.8 52.5ACET [15] 96.3 N/A 99.5 99.4GOOD [2] 99.9 N/A 98.4 97.7Ours ( K = 45) as the p k dataset, and ImageNet [8] as the p − k dataset. We use the standard ResNet50 as the D model and use Algorithm 3 to train D models. Generated images are of resolution 256 × D model. In Figure 2 and Figure 3, by performing gradient ascent on a D model train on the CelebA-HQ dataset, we transform pixelated faces and cartoon faces intorealistic human faces. Because each step of gradient ascent results in a new data sample, thegeneration process can be controlled by adjusting gradient ascent step size and number of steps. Asa demonstration, in Figure 4 and Figure 5 we show the intermediate images produced during theprocess of gradient ascent.We next show that the D model captures the target data distribution p k reasonably well. Weﬁrst sample out-of-distribution data (Figure 17) from the ImageNet (test set), and then generatenew images by performing gradient ascent on D models trained on CelebA-HQ, LSUN-BEDROOM,and Imaget-Dog dataset. Results in Figure 14, Figure 15, and Figure 16 show that the D modelscould generate realistic and diverse images. Advantages and disadvantages

In contrast to standard generative modeling approaches, thisnew generative modeling approach does not learn a ﬁxed generator, but instead generates samplesthrough a dynsmic process of gradient ascent. This feature facilitates the implementation of certainimage transformation tasks, and allows the user to take full control of the generation process. Wealso ﬁnd the training (Algorithm 3) to be as stable as ordinary supervised training; the only failuremode (gradient ascent on D results in noisy images) that we observed is caused by λ being too large(Appendix F). In this paper we analyzed the optimal solutions of the GAT training objective and the convergenceproperty of the training algorithm. We made a comparative analysis of the maximin formulationand minimax formulation that are respectively employed by GAT and GANs. Building on theseresults, we designed an incremental GAT algorithm, and evaluated it on the task of image generationand adversarial out-of-distribution detection. The good performance and training stability ofthe algorithm suggest that the proposed approach could serve as a new tool for content creation.Out-of-distribution detection results indicate that an OOD detection model’s robustness could beimproved by training the model against an adversary equipped with large-scale, diverse OOD data.10e ﬁnd that generated images are not as realistic as those produced by state-of-the-art generativemodels; in the future we will focus on improving the method’s generation performance by optimizinghyperparameters and model architectures. The proposed approach also has potential applicationsto where the GANs framework is applicable; we will look into these directions in the future.Figure 2: Transforming low-resolution faces to high-resolution faces. Image on the right subﬁgurewere generated using the left pixelated images as starting points and performing gradient ascent onthe D model trained on the CelebA-HQ dataset. The pixelated images were produced from a testsplit of the CelebA-HQ dataset.Figure 3: Transforming Cartoon faces into real faces. Image on the right subﬁgure were generatedusing the left cartoon faces as starting points and performing gradient ascent on the D model.11igure 4: Transforming cartoon faces into bedroom images by performing gradient ascent on the D model trained on the LSUN-BEDROOM dataset.Figure 5: Transforming cartoon faces into dog faces by performing gradient ascent on the D modelstrained on the Image-Dog dataset. 12 eferences [1] Vahdat Abdelzad, Krzysztof Czarnecki, Rick Salay, Taylor Denounden, Sachin Vernekar, andBuu Phan. Detecting out-of-distribution inputs in deep neural networks using an early-layeroutput. arXiv preprint arXiv:1910.10307 , 2019.[2] Julian Bitterwolf, Alexander Meinke, and Matthias Hein. Certiﬁably adversarially robustdetection of out-of-distribution data. Advances in Neural Information Processing Systems , 33,2020.[3] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe.

Convex optimization . Cambridgeuniversity press, 2004.[4] Andrew Brock, Jeﬀ Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelitynatural image synthesis. arXiv preprint arXiv:1809.11096 , 2018.[5] Wenhu Chen, Yilin Shen, Hongxia Jin, and William Wang. A variational dirichlet frameworkfor out-of-distribution detection. arXiv preprint arXiv:1811.07308 , 2018.[6] Hyunsun Choi, Eric Jang, and Alexander A Alemi. Waic, but why? generative ensembles forrobust anomaly detection. arXiv preprint arXiv:1810.01392 , 2018.[7] Erik Daxberger and José Miguel Hernández-Lobato. Bayesian variational autoencoders forunsupervised out-of-distribution detection. arXiv preprint arXiv:1912.05651 , 2019.[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pp. 248–255. Ieee, 2009.[9] Terrance DeVries and Graham W Taylor. Learning conﬁdence for out-of-distribution detectionin neural networks. arXiv preprint arXiv:1802.04865 , 2018.[10] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803 , 2016.[11] Izhak Golan and Ran El-Yaniv. Deep anomaly detection using geometric transformations. In

Advances in Neural Information Processing Systems , pp. 9758–9769, 2018.[12] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprintarXiv:1701.00160 , 2016.[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in neuralinformation processing systems , pp. 2672–2680, 2014.[14] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.

Deep learning , volume 1.MIT press Cambridge, 2016.[15] Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why relu networks yieldhigh-conﬁdence predictions far away from the training data and how to mitigate the problem.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 41–50,2019. 1316] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassiﬁed and out-of-distributionexamples in neural networks. arXiv preprint arXiv:1610.02136 , 2016.[17] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlierexposure. arXiv preprint arXiv:1812.04606 , 2018.[18] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervisedlearning can improve model robustness and uncertainty. In

Advances in Neural InformationProcessing Systems , pp. 15663–15674, 2019.[19] Yujia Huang, Sihui Dai, Tan Nguyen, Richard G Baraniuk, and Anima Anandkumar.Out-of-distribution detection using neural rendering generative models. arXiv preprintarXiv:1907.04572 , 2019.[20] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans forimproved quality, stability, and variation. arXiv preprint arXiv:1710.10196 , 2017.[21] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generativeadversarial networks. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 4401–4410, 2019.[22] Diederik P Kingma. Fast gradient-based inference with continuous latent variable models inauxiliary form. arXiv preprint arXiv:1306.0733 , 2013.[23] Durk P Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions.In

Advances in neural information processing systems , pp. 10215–10224, 2018.[24] Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Why normalizing ﬂows fail todetect out-of-distribution data.

Advances in neural information processing systems , 33, 2020.[25] Zico Kolter and Aleksander Madry. Adversarial robustness - theory and practice, 2019. URL https://adversarial-ml-tutorial.org/ .[26] Jernej Kos, Ian Fischer, and Dawn Song. Adversarial examples for generative models. In , pp. 36–42. IEEE, 2018.[27] Alex Krizhevsky, Geoﬀrey Hinton, et al. Learning multiple layers of features from tiny images.2009.[28] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. A large-scalestudy on regularization and normalization in gans. arXiv preprint arXiv:1807.04720 , 2018.[29] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training conﬁdence-calibrated classiﬁersfor detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325 , 2017.[30] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple uniﬁed framework fordetecting out-of-distribution samples and adversarial attacks. In

Advances in Neural InformationProcessing Systems , pp. 7167–7177, 2018.[31] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690 , 2017.[32] Weitang Liu, Xiaoyun Wang, John Owens, and Sharon Yixuan Li. Energy-based out-of-distribution detection.

Advances in Neural Information Processing Systems , 33, 2020.1433] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 ,2017.[34] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. In

Advances in Neural Information Processing Systems , pp. 7047–7058, 2018.[35] Suvojit Manna. celeba-hq-dataset-download (github repository), 2020. URL https://github.com/suvojit-0x55aa/celebA-HQ-dataset-download .[36] Alexander Meinke and Matthias Hein. Towards neural networks that provably know when theydon’t know. arXiv preprint arXiv:1909.12180 , 2019.[37] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan.Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136 ,2018.[38] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, and Balaji Lakshminarayanan. Detectingout-of-distribution inputs to deep generative models using a test for typicality. arXiv preprintarXiv:1906.02994 , 5, 2019.[39] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.Reading digits in natural images with unsupervised feature learning. 2011.[40] J v Neumann. Zur theorie der gesellschaftsspiele.

Mathematische annalen , 100(1):295–320,1928.[41] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neuralnetworks. arXiv preprint arXiv:1601.06759 , 2016.[42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of MachineLearning Research , 12:2825–2830, 2011.[43] Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. A review of noveltydetection.

Signal Processing , 99:215–249, 2014.[44] Phillip Pope, Yogesh Balaji, and Soheil Feizi. Adversarial robustness of ﬂow-based generativemodels. In

International Conference on Artiﬁcial Intelligence and Statistics , pp. 3795–3805.PMLR, 2020.[45] Igor M Quintanilha, Roberto de ME Filho, José Lezama, Mauricio Delbracio, and Leonardo ONunes. Detecting out-of-distribution samples using low-order deep features statistics. 2018.[46] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon,and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In

Advancesin Neural Information Processing Systems , pp. 14707–14718, 2019.[47] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagationand approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 , 2014.1548] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.

International journal of computer vision , 115(3):211–252, 2015.[49] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: A pixelcnnimplementation with discretized logistic mixture.

ICLR .[50] Chandramouli Shama Sastry and Sageev Oore. Detecting out-of-distribution examples within-distribution examples and gram matrices. arXiv preprint arXiv:1912.12510 , 2019.[51] Vikash Sehwag, Arjun Nitin Bhagoji, Liwei Song, Chawin Sitawarin, Daniel Cullina, MungChiang, and Prateek Mittal. Better the devil you know: An analysis of evasion attacks usingout-of-distribution adversarial examples. arXiv preprint arXiv:1905.01726 , 2019.[52] Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F Núñez, and Jordi Luque.Input complexity and out-of-distribution detection with likelihood-based generative models. arXiv preprint arXiv:1909.11480 , 2019.[53] Alireza Shafaei, Mark Schmidt, and James J Little. Does your model know the digit 6 is not acat? a less biased evaluation of" outlier" detectors. 2018.[54] Gabi Shalev, Yossi Adi, and Joseph Keshet. Out-of-distribution detection using multiplesemantic label representations. In

Advances in Neural Information Processing Systems , pp.7375–7385, 2018.[55] Jiaming Song, Yang Song, and Stefano Ermon. Unsupervised out-of-distribution detection withbatch normalization. arXiv preprint arXiv:1910.09115 , 2019.[56] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-low, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 ,2013.[57] Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large dataset for nonparametric object and scene recognition.

IEEE transactions on pattern analysis andmachine intelligence , 30(11):1958–1970, 2008.[58] Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptiveattacks to adversarial example defenses. arXiv preprint arXiv:2002.08347 , 2020.[59] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and AleksanderMadry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152 , 2018.[60] Sachin Vernekar, Ashish Gaurav, Vahdat Abdelzad, Taylor Denouden, Rick Salay, andKrzysztof Czarnecki. Out-of-distribution detection in classiﬁers via generation. arXiv preprintarXiv:1910.04241 , 2019.[61] Xuwang Yin, Soheil Kolouri, and Gustavo K Rohde. Gat: Generative adversarial trainingfor adversarial example detection and robust classiﬁcation. In

International Conference onLearning Representations , 2020. URL https://openreview.net/forum?id=SJeQEp4YDH .[62] Fisher Yu, Ari Seﬀ, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao.Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 , 2015. 1663] Qing Yu and Kiyoharu Aizawa. Unsupervised out-of-distribution detection by maximumclassiﬁer discrepancy. In

Proceedings of the IEEE International Conference on ComputerVision , pp. 9518–9526, 2019.[64] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, andDimitris N Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarialnetworks.

IEEE transactions on pattern analysis and machine intelligence , 41(8):1947–1962,2018. 17

GAT training algorithm

Algorithm 4

GAT Single-Class Detector Training Algorithm Sample minibatch of m samples { x k , . . . , x km } from p k , and m samples { x − k , . . . , x − km } from p − k . Compute adversarial examples { x , . . . , x m } by solving max x ∈ B ( x,(cid:15) ) D ( x ) for each x − ki . Train the detector by minimizing m P mi =1 (cid:2) − log D ( x ki ) − log(1 − D ( x i )) (cid:3) (single step). Return to step 1.

B related work on out-of-distribution detection

Out-of-distribution (OOD) detection, also known as novelty detection, or anomaly detection, dealswith the problem of identifying novel, or unusual, data from within a dataset. OOD detection hasgained much research attention due to its practical importance in safety-critical applications andchangeling nature. A comprehensive review of classical OOD detection methods can be found at[43].A recent surge of research interests in this topic is due to the emergence of deep generativemodels. Such models (speciﬁcally explicit density models [12]) estimate the generative probabilitydensity function of the data, and should serve as an ideal candidate for OOD detection. However, itwas observed [24, 37, 53, 17] that several state-of-the-art deep generative models, including Glow [23],PixelCNN [41], PixelCNN++ [49], VAEs [22, 47], and RealNVP ﬂow model [10] tend to assignhigher likelihood to OOD inputs than they do to in-distribution inputs. Despite this challenge,several recent works [46, 6, 38, 24, 52, 55, 19, 7] investigated the issue and successfully applied deepgenerative models to OOD detection.There is also a plethora of OOD detection methods [16, 30, 31, 50, 45, 1, 5, 34] that make use ofstatistics computed from the predictions or intermediate activations of standard classiﬁers train onin-distribution data. To name a few, [30] ﬁt class conditional Gaussian distributions using multiplelevels of activations of the classiﬁer, and use Mahalanobis distance to compute conﬁdence scores toidentify OOD inputs. The ODIN method [31] improves the eﬀectiveness of a softmax score baseddetection approach by using temperature scaling and adding small perturbations to the input. [50]make use of Gram matrices computed from the classiﬁer’s intermediate activations to identify OODinputs.Another branch of work utilize various alternative training strategies [32, 29, 17, 18, 9, 54, 60,63, 11]. A notable example is the Outlier Exposure (OE) method [17]. OE works by training theOOD detector against a large, diverse out-of-distribution dataset, and has been widely adopted as abaseline method.While methods based on generative models and standard classiﬁers yield high performances onnaturally-occurring OOD inputs, several such methods have been shown [36, 2] to be vulnerable toadversarial manipulation of the OOD inputs. This should come as no surprise as both generativemodels and standard classiﬁers themselves are vulnerable to adversarial attacks [26, 56, 44]. Giventhe above limitation of current approaches, a recent trend considers the worst-case scenario for OODdetection [15, 51, 36, 2]. The Adversarial Conﬁdence Enhanced Training (ACET) method proposedby [15] use adversarial training [33] on OOD inputs to improve detection robustness. [36] uses adensity estimator to provide guarantees on the maximal conﬁdence around L ball for uniform noise.[2] use interval-bound-propagation (IBP) to certiﬁcate worst case guarantees for general OOD inputsunder a L ∞ threat model. In the same spirit as [15], our detection method employs adversarialtraining on OOD inputs to induce robustness. The diﬀerence is that our method uses the GAT18bjective with a optimal solution which naturally deals with adversarial OOD inputs, while theoptimal solution of the objective used by [15], which is essentially a multiple class classiﬁcationobjective with an extra term on OOD inputs, is unclear. C Mathematical analysis of optimal solutions of the maximin prob-lem

For the convenience of analysis, instead of using (cid:15) -balls imposed on individual data samples, weuse the notion of a common perturbation space: The perturbation space S is a subspace of thedata space X , and allows mass of p − k to be moved to any location in S . A new distribution p t canbe obtained by transporting the mass of p − k to appropriate locations in S , via a transformationfunction T : S → S . Utilizing the technique of random variable transformation, we can write thedensity function of p t as a function of p − k : p t ( y ) = R S p − k ( x ) δ ( y − T ( x )) dx . Figure 6 left panel is aschematic illustration of this phenomenon. Let M ( S ) be the set of distributions attainable byapplying such transformations to the support of p − k .Figure 6: Left panel: a distribution p t is obtained by applying a transformation T to the support of p − k . Right panel: three scenarios to consider when analyzing problem 5. Red distribution represents p − k and blue distribution represents p k . The data space X is represented by the whole space insidethe square, and the perturbation space S is represented by the gray area.Table 3: Optimal solutions for the three scenarios in Figure 6 Scenario Optimal solution ( D ∗ , p ∗ t )Supp( p k ) ⊂ S D ∗ outputs for Supp( p k ) and ≤ for S \

Supp( p k ); p ∗ t has its mass distributed to locations where D ∗ outputs .Supp( p k ) ∩ S = ∅ D ∗ outputs 1 for Supp( p k ) and 0 for S ; p ∗ t can be an arbitrary distribution in M ( S ).Supp( p k ) ∩ S 6 = ∅ ,and Supp( p k )

6⊂ S

For Supp( p k ) outside S , D ∗ outputs 1, for Supp( p k ) inside S , D ∗ outputs α = R S p k R S p ∗ t + R S p k (by deﬁnition, R S p ∗ t = 1),and for other locations inside S , D ∗ outputs ≤ α ; p ∗ t has its mass distributed to locations where D ∗ outputs α . Mathematical analysis

Recall that the support of p t can be any subset of the pertubation space S and that U ( D, p t ) = R p k ( x ) log D ( x )) dx + R p t ( x ) log(1 − D ( x )) dx . For convenience, we deﬁnethe contour set inside S of D at α as C Dα := { x ∈ S : D ( x ) = α } , the region of Supp( p k ) that isoutside of S as Ω ko := Supp( p k ) \ S and the region of Supp( p k ) that is in S as Ω ki := Supp( p k ) ∩ S .Note that Supp( p k ) = Ω ko ∪ Ω ki . For a ﬁxed D and let α o = max Ω ko D and α S = max S D . It is easyto check that U is minimized when Supp( p t ) lies in the contour set C Dα S . Let p ∗ t be a distribution19uch that Supp( p ∗ t ) ⊂ C Dα S . By direct computation we have that U ( D, p ∗ t ) = Z Ω ko p k ( x ) log D ( x ) dx + Z Ω ki p k ( x ) log D ( x ) dx + log(1 − α S ) ≤ ( Z Ω ko p k )(log α k ) + ( Z Ω ki p k )(log α S ) + log(1 − α S ) ≤ β ki log β ki β ki + log 11 + β ki , where β (cid:15) = R Ω ki p k . Note here we have used the fact the the function f ( y ) = a log y + b log(1 − y )achieves its maximum at y = aa + b . It is not diﬃcult to see that the above inequality becomes anequality when D ( x ) =  α k x ∈ Ω ko α S x ∈ Ω ki ≤ α S x ∈ S \ Supp( p k ) , (6)where α k = 1 and α S = β ki β ki . Note that D does not need to be deﬁned outside S ∪

Supp( p k ). Scenario 1

Here we deal with the case when (cid:15) is large enough such that Supp( p k ) ⊂ S , inwhich case Ω ko = ∅ , Ω ki = Supp( p k ) and α S = . Hence by the above analysis one can check that U achieves its optimum when D ≡ α S = on Supp( p k ) and D ≤ on S \

Supp( p k ). In summary,the maximin problem achieves its optimum when D outputs on the support of p k and and valuesless or equal to on samples outside the support of p k but in S . Scenario 2

Here we deal with the case when (cid:15) is small enough such that the

S ∩

Supp( p k ) = ∅ ,in which case Ω ko = Supp( p k ), Ω ki = ∅ and α S = 0. Hence U achieves its optimum when D ≡ p k ) and D ≡ S . In summary, the maximin problem achieves its optimum when D outputs 1 on the support of p k and zero on the the perturbation space S . Scenario 3

Here we deal with the case when

S ∩

Supp( p k ) = ∅ and Supp( p k )

6⊂ S . In summary,the maximin problem achieves its optimum when D outputs 1 on the set of samples inside thesupport of p k but outside of the perturbation space S and β ki β ki on the set of samples that are inthe intersection of the support of p k and S and values less or equal to β ki β ki on S . Remark

The ﬁrst two cases can be seen as the special cases of the third one.

Additional discussion on Scenario 2

In robust machine learning literature, it’s common toconsider a very small value for (cid:15) . For instance, one of the most commonly used limit for training L ∞ robust models is (cid:15) = 8 /

255 ( L ∞ norm). A perturbation space characterized by a small limit canbe thought as a semantic-preserving space: translating a sample inside the space doesn’t change thesample’s underlying label/class membership. A small perturbation limit corresponds to scenario2, which is also the focus of Yin et al. [61]. We can deﬁne robust models as models that outputconsistent predictions for inputs under semantic-preserving transformations. In this sense, theoptimal D for scenario 2 is a robust detector, as it always outputs 0 for the perturbation space.However, the limitation of training against a small (cid:15) is obvious: because optimal D ’s outputs outside S ∪

Supp( p k ) are unspeciﬁed, any semantic-preserving operation that has a perturbation that goesbeyond S can result in a high D output, thereby fools the detection. The above analysis suggeststhat for predictive models based on the generative adversarial training method, their robustness canbe improved by training against a larger perturbation space.20 Algorithm 1 convergence

D.1 The eﬀects of alternating optimization (a) at initialization (b) step 2 updates p t (c) step 3 updates D (d) step 2 updates p t (e) step 3 updates D (f) step 2 updates p t Figure 7: The results of p t and D in the ﬁrst few iterations of a 2D simulation of Algorithm 1.Step 2 solves the inner minimization, causes support of p t (red points) to be concentrated in localmaxima points. Step 3 update D by increasing its outputs on the support of p k and decreasing itsoutputs on the support of p t , causes local maxima to be suppressed. D.2 More 2D simulation results .5 0.3 0.1 0.1 0.3 0.50.50.30.10.10.30.5 0.49860.49880.49900.49920.49940.49960.49980.5000 Figure 8: Solutions obtained by the maximin problem solver (Algorithm 1) with diﬀerent initial-izations of D . First row are results when p − k is at bottom left (Figure 1(a)), and second row areresults when p − k are uniform distributions. Figure 9: Solutions obtained by the minimax problem solver (Algorithm 2) with diﬀerent initializa-tions of D . Note that in all cases p t (red distribution) matches p k (blue distribution), but D hasunpredictable outputs on X \

Supp( p k ). The initial position of the red distribution is in bottom leftcorner (see Figure 7(a)). 22 Ablation Study

E.1 Eﬀects of incremental training A U C ( p k d a t a v s . p e r t u r b e d p k d a t a ) Training with K = 10 (training time: 18h 40m)Incremental training with K = 0, . . . , 10 (training time: 11h 40m) Figure 10: AUROC curves of two training schemes ( p k is CIFAR-10 and p − k is the Tiny Imagesdataset). In the incremental scheme, the training follows Algorithm 3 and uses an increasingsequence of number of steps ( K = 0 , ..., K = 10). These two schemes are trained with the same number of iterations.The AUROC values are computed using D outputs on pairs of p k data (a batch) and perturbed p − k data (a batch). A model with a high AUROC value separates these two types of data better. E.2 Uniform noise as the p − k dataset In this ablation study, we use CIFAR-10 class 0 data as the target data distribution dataset p k , andtrain the D model using uniform noise as the p − k dataset. It is observed in Table 4 and 5 thatwhen p − k is uniform noise, the D models only developed capability for identifying uniform noiseand Gaussian noise as OOD inputs. This result seems to contradict our mathematical analysiswhich states that with a uniform distribution as p − k , a D function useful for detecting any kindof OOD inputs could be obtained. According to the manifold hypothesis, real image data lie onlower-dimensional manifolds embedded within the high-dimensional space; real image samples areclose to each other in terms of geodesic distance. Uniform noise, on the other hand, is distributed inthe entire space, and may require much larger perturbations to reach real data. As a result, uniformnoise is much less data eﬃcient than real data for training OOD detection models, and a muchlarger number of inner iterations and K value in Algorithm 3 may be needed to reach a satisfyingdetection performance. 23able 4: OOD detection performance (AUROC scores) of K = 0 model on CIFAR-10 class 0 data( p k = CIFAR-10 class 0, and p − k = uniform noise). (cid:15) -test Gaussian noise Uniform noise ImageNet Bedroom SVHN CelebAHQ CIFAR100 mean0.0 1.0000 1.0000 0.5859 0.5791 0.5801 0.5235 0.5499 0.68841.0 1.0000 1.0000 0.5161 0.5028 0.5141 0.4510 0.4816 0.6379 Table 5: OOD detection performance (AUROC scores) of K = 15 model on CIFAR-10 class 0 data.( p k = CIFAR-10 class 0, and p − k = uniform noise). (cid:15) -test Gaussian noise Uniform noise ImageNet Bedroom SVHN CelebAHQ CIFAR100 mean0.0 1.0000 1.0000 0.5772 0.5760 0.5711 0.5139 0.5435 0.68311.0 1.0000 1.0000 0.5063 0.4984 0.5043 0.4406 0.4738 0.6319 F Failure mode diagnosis

We observe that in Algorithm 3, if λ is set to a too large value, the algorithm fails to learn a D thatis useful for image generation. In this section we discuss the training dynamics of the case of anappropriate λ value and the case of λ being too large. λ is small enough In Algorithm 3 as we increase K , p t gradually converges to p k . In this processit becomes increasingly diﬃculty for the D model to diﬀerentiate these two distributions. Thisphenomenon can be observed in Figure 11 : the training loss (binary cross-entropy loss) of the D model becomes larger and larger (left subﬁgure), and eventually these two distributions becomeindistinguishable (AUROC ≈ .

5, middle subﬁgure). From the right subﬁgure we can see that D ’sperformance on p − k vs. p k is also aﬀected by the increase in K value. iteration t r a i n i n g l o ss K = 0K = 10K = 20K = 30K = 54 0 2500 5000 7500 10000 12500 15000 iteration A U C o n p e r t u r b e d OO D d a t a K = 0K = 10K = 20K = 30K = 54 0 2500 5000 7500 10000 12500 15000 iteration A U C o n o r i g i n a l OO D d a t a K = 0K = 10K = 20K = 30K = 54

Figure 11: CIFAR-10 training curves of λ = 0 .

1. Left: training loss curves, middle: AUROC curves( p t vs. p k ), and right: AUC curves ( p − k vs. p k ). λ is too large The failure mode caused by λ being too large is easy to identify (Figure 12): thetraining loss quickly decreases to 0 as K increases (left subﬁgure), p t and p k become perfectlyseparable (middle subﬁgure), and D model becomes unable to separate p − k from p k (right subﬁgure).In general, with a small enough λ value, an increase of sample quality could be expected aftermodel is trained with a larger K . This is the case when λ is 0.1, but not when it is 0.6 (Figure 13).24 iteration t r a i n i n g l o ss K = 0K = 1K = 2 0 2500 5000 7500 10000 12500 15000 iteration A U C o n p e r t u r b e d OO D d a t a K = 0K = 1K = 2 0 2500 5000 7500 10000 12500 15000 iteration A U C o n o r i g i n a l OO D d a t a K = 0K = 1K = 2

Figure 12: CIFAR-10 training curves of a failed training instance ( λ = 0 . (a) λ = 0 . λ = 0 . Figure 13: Generated samples after training with an increasing sequence of K values ( K = 0 , , λ = 0 .

1, but didn’t when λ = 0 . G Extended adversarial OOD detection results

Table 6: The performance (AUROC scores) of CIFAR-10 K = 5 model (the in-distribution datasetis CIFAR-10) under attacks of diﬀerent conﬁgurations. Following [2] we used 1000 samples for bothin-distribution data and OOD data. Similarly, we used 5 random restarts to enhance the defaultattack, but the performance decrease is negligible. OOD dataset (with an L ∞ perturbation of (cid:15) = 0 . K = 45 model (the in-distribution datasetis SVHN) under attacks of diﬀerent conﬁgurations. Following [2] we used 1000 samples for bothin-distribution data and OOD data. Similarly, we used 5 random restarts to enhance the defaultattack, but the performance decrease is negligible. OOD dataset (with an L ∞ perturbation of (cid:15) = 0 . Table 8: Standard OOD detection performances (AUROC scores) when the in-distribution datasetis CIFAR-10 and OOD samples are not perturbed. Performance data is collected from referencedpapers in the table; when there is a discrepancy we use the best reported result. Details about theiSUN, LSUN (resize), and TinyImageNet (resize) datasets can be found at [31].

OOD dataset (no perturbation)Method Uniform Gaussian SVHN CIFAR-100 iSUN LSUN(resize) TinyImageNet (resize)Softmax [16] 96.5 97.5 89.9 86.4 91.0 91.0 91.0ODIN [31] 99 100 96.7 87.5 94.0 94.1 94.0Mahalanobis [30] 100 N/A 99.1 88.2 99.5 99.7 99.5OE [17] 98.7 99.3 98.8 95.3 98.5 98.94 N/AGram Matrices [50] N/A 100 99.5 79.0 99.8 99.9 99.7Energy-based [32] N/A N/A 99.4 N/A 99.33 99.39 N/ALikelihood ratios [46] N/A N/A 88.8 N/A N/A N/A N/AWAIC [6] 100 100 100 N/A N/A N/A 95.6CCU [36] 100 N/A 97.1 93.0 N/A N/A N/AACET [15] 99.7 N/A 92.4 90.7 N/A N/A N/AGOOD [2] 99.5 N/A 97.1 92.9 N/A N/A N/AOurs ( K = 0) 99.5 99.8 99.6 94.1 99.5 99.5 98.7Ours ( K = 5) 99.6 99.9 97.4 91.5 98.5 98.9 96.2 × resolution generation Figure 14: Uncurated 256 ×

256 generation results in the CelebA-HQ dataset.27igure 15: Uncurated 256 ×

256 generation results in the Bedroom256 dataset.28igure 16: Uncurated 256 ××