[PDF] Towards a Better Global Loss Landscape of GANs

Abstract

Full PDF

TTowards a Better Global Loss Landscape of GANs

Ruoyu Sun, Tiantian Fang, Alex Schwing

University of Illinois at Urbana-Champaign ruoyus,tf6,[email protected]

Abstract

Understanding of GAN training is still very limited. One major challenge is itsnon-convex-non-concave min-max objective, which may lead to sub-optimal localminima. In this work, we perform a global landscape analysis of the empiricalloss of GANs. We prove that a class of separable-GAN, including the originalJS-GAN, has exponentially many bad basins which are perceived as mode-collapse.We also study the relativistic pairing GAN (RpGAN) loss which couples thegenerated samples and the true samples. We prove that RpGAN has no bad basins.Experiments on synthetic data show that the predicted bad basin can indeed appearin training. We also perform experiments to support our theory that RpGAN hasa better landscape than separable-GAN. For instance, we empirically show thatRpGAN performs better than separable-GAN with relatively narrow neural nets.The code is available at https://github.com/AilsaF/RS-GAN . Generative Adversarial Nets (GANs) [35] are a successful method for learning data distributions.Current theoretical efforts to advance understanding of GANs often focus on statistics or optimization.On the statistics side, Goodfellow et al. [35] built a link between the min-max formulation and the J-S(Jenson-Shannon) distance. Arjovsky and Bottou [3] and Arjovsky et al. [4] proposed an alternativeloss function based on the Wasserstein distance. Arora et al. [5] studied the generalization error andshowed that both the Wasserstein distance and J-S distance are not generalizable (i.e., both requirean exponential number of samples). Nevertheless, Arora et al. [5] argue that the real metric used inpractice differs from the two statistical distances, and can be generalizable with a proper discriminator.Bai et al. [7] and Lin et al. [58] analyzed the potential “lack of diversity”: two different distributionscan have the same loss, which may cause mode collapse. Bai et al. [7] argue that proper balancing ofgenerator and discriminator permits both generalization and diversity.On the optimization side, cyclic behavior (non-convergence) is well recognized [65, 8, 34, 11]. Thisis a generic issue for min-max optimization: a ﬁrst-order algorithm may cycle around a stable point,converge very slowly or even diverge. The convergence issue can be alleviated by more advancedoptimization algorithms such as optimism (Daskalakis et al. [23]), averaging (Yazıcı et al. [88]) andextrapolation (Gidel et al. [33]).Besides convergence, another general optimization challenge is to avoid sub-optimal local minima. Itis an important issue in non-convex optimization (e.g., Zhang et al. [91], Sun [81]), and has receivedgreat attention in matrix factorization [31, 14, 19] and supervised learning [38, 47, 2, 92, 27]. ForGANs, the aforementioned works [65, 8, 34, 11] either analyze convex-concave games or performlocal analysis. Hence they do not touch the global optimization issue of non-convex problems.Mescheder et al. [65] and Feizi et al. [30] prove global convergence only for simple settings wherethe true data distribution is a single point or a single Gaussian distribution. The global analysis ofGANs for a fairly general data distribution is still a rarely touched direction.The global analysis of GANs is an interesting direction for the following reasons.

First , froma theoretical perspective, it is an indispensable piece for a complete theory. To put our work in a r X i v : . [ c s . L G ] N ov able 1: Comparison of theoretical works. Supervised Learning GANspaper brief description paper brief descriptionGeneralization analysis [9] generalization bound for neural-nets [5] generalization bound for GANsConvergence analysis [77] convex problem, divergence of Adamconvergence of AMSGrad [23] bi-linear game, non-convergence of GDAconvergence of optimistic GDAGlobal landscape [73] [50] Any distinct input dataWide neural-nets have no sub-optimal basins

This work

Any distinct input dataSepGAN has bad basins; RpGAN does not ∗ This table does NOT show a complete list of works. The goal is to list various types of works. Only one or two works are listed as examples of that class. perspective, we compare representative works in supervised learning with works on GANs in Tab. 1.

Second , it may help to understand mode collapse. Bai et al. [7] conjectured that a lack of diversitymay be caused by optimization issues, albeit convergence analysis works [65, 8, 34, 11] do not linknon-convergence to mode collapse. Thus we suspect that mode collapse is at least partially relatedto sub-optimal local minima, but a formal theory is still lacking.

Third , it may help to understandthe training process of GANs. Even understanding a simple two-cluster experiment is challengingbecause the loss values of min-max optimization are ﬂuctuating during training. Global analysis canprovide an additional lens in demystifying the training process.Additional related work is reviewed in Appendix A.

Challenges and our solutions.

While the idea of a global analysis is natural, there are a few obstacles.First, it is hard to follow a common path of supervised learning [38, 47, 2, 92, 27] to prove globalconvergence of gradient descent for GANs, because the dynamics of non-convex-non-concave gamesare much more complicated. Therefore, we resort to a landscape analysis . Note that our approachresembles an “equilibrium analysis” in game theory. Second, it was not clear which formulation cancure the landscape issue of JS-GAN. Wasserstein GAN (W-GAN) is a candidate, but its landscape ishard to analyze due to the extra constraints. After analyzing the issue of JS-GAN, we realize that theidea of “paring”, which is implicitly used by W-GAN, is enough to cure the issue. This leads us toconsider relativistic pairing GANs (RpGANs) [41, 42] that couple the true data and generated data .We prove that RpGANs have a better landscape than separable-GANs (generalization of JS-GAN).Third, it was not clear whether the theoretical ﬁnding affects practical training. We make a fewconjectures based on our landscape theory and design experiments to verify those. Interestingly, theexperiments match the conjectures quite well. Our contributions.

This work provides a global landscape analysis of the empirical version ofGANs. Our contributions are summarized as follows:•

Does the original JS-GAN have a good landscape, provably?

For JS-GAN [35], we provethat the outer-minimization problem has exponentially many sub-optimal strict local minima.Each strict local minimum corresponds to a mode-collapse situation. We also extend thisresult to a class of separable-GANs, covering hinge loss and least squares loss.•

Is there a way to improve the landscape, provably?

We study a class of relativistic paringGANs (RpGANs) [41] that pair the true data and the generated data in the loss function.We prove that the outer-minimization problem of RpGAN has no bad strict local minima,improving upon separable-GANs.•

Does the improved landscape lead to any empirical beneﬁt?

Based on our theory, we predictthat RpGANs are more robust to data, network width and initialization than their separablecounter-parts, and our experiments support our prediction. Although the empirical beneﬁtof RpGANs was observed before [41], the aspects we demonstrate are closely related to ourlandscape theory. In addition, using synthetic experiments we explain why mode-collapse(as bad basins) can slow down JS-GAN training.

Goodfellow et al. [35] proved that the population loss of GANs is convex in the space of probabilitydensities. We highlight that this convexity highly depends on a simple property of the populationloss, which may vanish in an empirical setting.Suppose p data is the data distribution, p g is a generated distribution and D ∈ C (0 , ( R d ) , where C (0 , ( R d ) is the set of continuous functions with domain R d and codomain (0 , . Consider the In fact, we proposed this loss in a ﬁrst version of this paper, but later found that [41, 42] considered thesame loss. We adopt their name RpGAN from [42]. g y y y y (a) (b) Figure 1: (a) Population loss: probability density changes; (b) Empirical loss: samples move.

JS-GAN formulation [35] min p g φ JS ( p g ; p data ) , where φ JS ( p g ; p data ) = sup D E x ∼ p data ,y ∼ p g [log( D ( x )) + log(1 − D ( y ))] . Claim 2.1. ([35, in proof of Prop. 2]) The objective function φ JS ( p g ; p data ) is convex in p g . The proof utilizes two facts: ﬁrst, the supremum of (inﬁnitely many) convex functions is convex;second, E x ∼ p data ,y ∼ p g [log( D ( x )) + log(1 − D ( y ))] is a linear function of p g . The second fact is theessence of the argument, which we restate below in a more general form. Claim 2.2. E y ∼ p g [ f arb ( y )] is a linear function of p g , where f arb ( y ) is an arbitrary function of y . Claim 2.2 implies that min p g E y ∼ p g [ f arb ( y )] is a convex problem. One approach to solve it is todraw ﬁnitely many samples (particles) y i , i = 1 , . . . , n from p g , and approximate the population lossby the empirical loss. See Fig. 1 for a comparison of the probability space and the particle space.For an arbitrarily complicated function such as f arb ( y ) = sin( (cid:107) y (cid:107) + 2 (cid:107) y (cid:107) + log( (cid:107) y (cid:107) + 1)) , thepopulation loss is convex in p g , but clearly the empirical loss is non-convex in ( y , . . . , y n ) . Thisexample indicates that studying the empirical loss may better reveal the difﬁculty of the problem(especially with a limited number of samples). See Appendix G for more discussions.We focus on the empirical loss in this work. Suppose there are n data points x , . . . , x n . We sample n latent variables z , . . . , z n ∈ R d z according to a rule (e.g., i.i.d. Gaussian) and generate artiﬁcialdata y i = G ( z i ) , i = 1 , . . . , n. The empirical version of JS-GAN addresses min Y φ JS ( Y, X ) where φ JS ( Y, X ) (cid:44) sup D n n (cid:88) i =1 [log( D ( x i )) + log(1 − D ( y i ))] . (1)Note that the empirical loss is considered in Arora et al. [5] as well, but they study the generalizationproperties. We focus on the optimization properties, which is complementary to their work. In this section, we discuss the main intuition and present results for a 2-point distribution.

Figure 2: Issue of separable-GAN (in-cluding JS-GAN). After updating G , fakedata crosses boundary to fool D ; after up-dating D , they are separated by D . Fakedata may be stuck near x . Intuition of Bad “Local Minima” and Separable-GAN:

Consider an empirical data distribution consisting of twosamples x , x ∈ R . The generator produces two data points y , y to match x , x . We illustrate the training process ofJS-GAN in Fig. 2. Initially, y , y are far from x , x , thusthe discriminator can easily separate true data and fake data.After the generator update, y , y cross the decision boundaryto fool the discriminator. Then, after the discriminator update,the decision boundary moves and can again separate true dataand fake data. As iterations progress, y , y and the decisionboundary may stay close to x , causing mode-collapse.The intuition above is the starting point of this work. Wenotice that Unterthiner et al. [83], Li and Malik [53] presentedsomewhat similar intuition, and Kodali et al. [45] suggested the connection between mode collapseand a bad equilibrium. Nevertheless, Li and Malik [53], Kodali et al. [45] do not present a theoreticalresult, and Unterthiner et al. [83] uses a signiﬁcantly different formulation from standard GANs. SeeAppendix A for more. 3e point out that a major reason for the above issue is a single decision boundary which judges thegenerated samples. Therefore, this issue exists not only for the JS-GAN, but also for a large class ofGANs which we call separable-GANs: min Y sup f n (cid:88) i =1 h ( f ( x i )) + h ( − f ( y i )) , (2)where h , h are ﬁxed scalar functions, such as h ( t ) = h ( t ) = − log(1 + e − t ) and h ( t ) = h ( t ) = − max { , − t } , and f is chosen from a function space (e.g., a set of neural-net functions). Pairing as Solution: Rp-GAN.

A natural solution is to use a different “decision boundary” for everygenerated point, e.g., pairing x i and y i , as illustrated in Fig. 3. Figure 3: Idea of RpGAN: breakinglocality by “personalized” judgement.

A suitable loss is the relativistic paring GAN (RpGAN) min Y sup f n (cid:88) i =1 h ( f ( x i ) − f ( y i )) , (3)where h is a ﬁxed scalar function and f is chosen from a func-tion space. RS-GAN (relative standard GAN) is a special casewhere h ( t ) = − log(1 + e − t ) . More speciﬁcally, RS-GANaddresses min Y φ RS ( Y, X ) where φ RS ( Y, X ) (cid:44) sup f n n (cid:88) i =1 log 11+exp( f ( y i ) − f ( x i ))) . (4)W-GAN [3] can be viewed as a variant of RpGAN where h ( t ) = t , with extra Lipschitz constraint.We wonder how the issue of seperable-GANs relates to “local minima” and how “pairing” helps. Wepresent results for JS-GAN and RS-GAN for the two-point case below. Global Landscape of 2-Point Case:

Depending on the positions of y , y , there are four states s , s a , s b , s . They represent the four cases |{ x , x } ∩ { y , y }| = 0 , y = y ∈ { x , x } , |{ x , x } ∩ { y , y }| = 1 , and { x , x } = { y , y } respectively. Training often starts from the“no-recovery” state s , and ideally should end at the “perfect-recovery” state s . There are twointermediate states: s a means all generated points fall into one mode (“mode collapse”); s b meansone generated point is the true data point while the other is not a desired data point, which we call“mode dropping” . The ﬁrst three states can transit to each other (assuming continuous change of Y ),but only s b can transit to s . We illustrate the landscape of φ JS ( Y ; X ) and φ RS ( Y ; X ) in Fig. 4, byindicating the values in different states. The detailed computation is given next. JS-GAN 2-Point Case:

The range of φ JS ( Y, X ) is [ − log 2 , . The value for the four states are: Claim 3.1.

The minimal value of φ JS ( Y, X ) is − log 2 , achieved at { y , y } = { x , x } . φ JS ( Y, X ) =  − log 2 ≈ − . if { x , x } = { y , y } , − log 2 / ≈ − . if |{ x , x }∩{ y , y }| =1 , (2 log 2 − ≈ − . if y = y ∈ { x , x } , if |{ x , x }∩{ y , y }| = ∅ . We illustrate the landscape of φ JS ( Y, X ) in Fig. 4(a). As a corollary of the above claim, the outeroptimization of the original GAN has a bad strict local minimum at state s (a mode-collapse). Corollary 3.1. ¯ Y = ( x , x ) is a sub-optimal strict local-min of the function g ( Y ) = φ JS ( Y, X ) . RS-GAN 2-Point Case:

The range is still φ RS ( Y, X ) ∈ [ − log 2 , . The values are: Claim 3.2.

The minimal value of φ RS ( Y, X ) is − log 2 , achieved at { y , y } = { x , x } . In addition, φ RS ( Y, X ) =  − log 2 ≈ − . if { x , x } = { y , y } , − log 2 ≈ − . if |{ i : ∃ j, s.t. x i = y j }| , otherwise. Our motivation of considering RpGAN because it breaks locality, thus possibly admitting a better landscape.This motivation is somewhat different from Jolicoeur-Martineau [41, 42]. Both may be called mode collapse. Here we differentiate “mode collapse” and “mode dropping”. a) JS-GAN (b) RS-GANFigure 4: Landscape for GAN outer optimization min Y φ ( Y, X ) . It is not a rigorous ﬁgure because: (i) thereare only four possible values, thus the function is piece-wise linear while we use smooth curves for accessibility.(ii) the landscape should be two-dimensional, but we illustrate them in 1D space. Nevertheless, it is still usefulfor understanding GAN training, as discussed later in Section 5 and Appendix B. We illustrate φ RS ( Y, X ) in Fig. 4(b). Importantly, note that the only basin is the global minimum. Incontrast, the landscape of JS-GAN contains a bad basin at a mode-collapsed pattern.The proofs of Claim 3.1 and Claim 3.2 are given in Appendix H. We brieﬂy explain the maininsight provided by these proofs. For the mode-collapsed pattern s , the loss value of JS-GAN is − min s,t [log(1 + e − t ) + 2 log(1 + e t ) + log(1 + e − s )] = (log + 2 log ) ≈ − . (cid:54) = − r log 2 for any integer r . This creates an “irregular” value among other loss values of the form − r log 2 . Incontrast, for pattern s , the loss value of RS-GAN is − min s,t [log(1 + e t − t ) + log(1 + e t − s )] = − log 2 , which is of the form − r log 2 . Therefore, for the -point case, RS-GAN has a betterlandscape because it avoids the “irregular” value of JS-GAN due to its “pairing”. This insight is thefoundation of the general theory presented in the next section. We present our main theoretical results, extending the landscape results from n = 2 to general n .Denote ξ ( m ) (cid:44) sup t ∈ R ( h ( t ) + mh ( − t )) . Assumption 4.1. sup t ∈ R h ( t ) = sup t ∈ R h ( t ) = 0 . Assumption 4.2. ξ ( m ) > mξ (1) , ∀ m ∈ [2 , n ] . Assumption 4.3. ξ ( m ) < ξ ( m − , ∀ m ∈ [1 , n ] . It is easy to prove that under Assumption 4.1, ξ ( m − ≥ ξ ( m ) ≥ mξ (1) always holds. Assumption4.2 and Assumption 4.3 require strict inequalities, thus do not always hold (e.g., for constant functions).Nevertheless, most non-constant functions satisfy these assumptions.The separable-GAN (SepGAN) problem (empirical loss, function space) is min Y ∈ R d × n g SP ( Y ) , where g SP ( Y ) = 12 n sup f ∈ C ( R d ) n (cid:88) i =1 [ h ( f ( x i )) + h ( − f ( y i ))] . (5) Theorem 1.

Suppose x , x , . . . , x n ∈ R d are distinct. Suppose h , h satisfy Assumptions 4.1, 4.2and 4.3. Then for separable-GAN loss g SP ( Y ) deﬁned in Eq. (5) , we have: (i) The global minimalvalue is − sup t ∈ R ( h ( t ) + h ( − t )) , which is achieved iff { y , . . . , y n } = { x , . . . , x n } . (ii) If y i ∈ { x , . . . , x n } , i ∈ { , , . . . , n } and y i = y j for some i (cid:54) = j , then Y is a sub-optimal strictlocal minimum. Therefore, g SP ( Y ) has ( n n − n !) sub-optimal strict local minima. Remark 1: h ( t ) = h ( t ) = − log(1 + e − t ) satisfy Assumptions 4.1, 4.2 and 4.3, thus Theorem1 applies to JS-GAN. It also applies to hinge-GAN with h ( t ) = h ( t ) = − max { , − t } andLS-GAN (least-square GAN) with h ( t ) = − (1 − t ) , h ( t ) = − t .Next we consider RpGANs. The RpGAN problem (empirical loss, function space) is min Y ∈ R d × n g R ( Y ) , where g R ( Y ) = 1 n sup f ∈ C ( R d ) n (cid:88) i =1 [ h ( f ( x i ) − f ( y i ))] . (6) Deﬁnition 4.1. (global-min-reachable) We say a point w is global-min-reachable for a function F ( w ) if there exists a continuous path from w to one global minimum of F along which the value of F ( w ) is non-increasing. ssumption 4.4. sup t ∈ R h ( t ) = 0 and h (0) < . Assumption 4.5. h is a concave function in R . Theorem 2.

Suppose x , x , . . . , x n ∈ R d are distinct. Suppose h satisﬁes Assumptions 4.4 and 4.5.Then for RpGAN loss g R deﬁned in Eq. (6) : (i) The global minimal value is h (0) , which is achievediff { y , . . . , y n } = { x , . . . , x n } . (ii) Any Y is global-min-reachable for the function g R ( Y ) . This result sanity checks the loss g R ( Y ) : its global minimizer is indeed the desired empiricaldistribution. In addition, it establishes a signiﬁcantly different optimization landscape for RpGAN.Remark 1: h ( t ) = − log(1 + e − t ) satisﬁes Assumption 4.4 and 4.5, thus Theorem 2 applies toRS-GAN. It also applies to Rp-hinge-GAN with h ( t ) = − max { , a − t } and Rp-LS-GAN with h ( t ) = − ( a − t ) , for any positive constant a .Remark 2: The W-GAN loss is n sup f (cid:80) i h ( f ( x i ) − f ( y i )) where h ( t ) = t ; however, since sup t h ( t ) = ∞ it does not satisfy Assumption 4.4. The unboundedness of h ( t ) = t necessitates extraconstraints, which make the landscape analysis of W-GAN challenging; see Appendix L. Analyzingthe landscape of W-GAN is an interesting future work.To prove Theorem 1, careful computation sufﬁces; see Appendix I. The proof of Theorem 2 is a bitinvolved. We ﬁrst build a graph with nodes representing x i ’s and y i ’s, then decompose the graph intocycles and trees, and ﬁnally compute the loss value by grouping the terms according to cycles andtrees and calculate the contribution of each cycle and tree. The detailed proof is given in Appendix J. We now consider a deep net generator G w with w ∈ R K and a deep net discriminator f θ with θ ∈ R J .Different from before, where we optimize over y i and f (function space), we now optimize over w and θ (parameter space).We ﬁrst present a technical assumption. For Z = ( z , . . . , z n ) ∈ R d z × n , Y = ( y , . . . , y n ) ∈ R d × n and W ⊆ R K , deﬁne a set G − ( Y ; Z, W ) (cid:44) { w ∈ W | G w ( z i ) = y i , ∀ i } . Assumption 4.6. (path-keeping property of generator net): For any distinct z , . . . , z n ∈ R d z ,any continuous path Y ( t ) , t ∈ [0 , in the space R d × n and any w ∈ G − ( Y (0); Z, W ) , there iscontinuous path w ( t ) , t ∈ [0 , such that w (0) = w and Y ( t ) = G w ( t ) ( Z ) , t ∈ [0 , . Intuitively, this assumption relates the paths in the function space to the paths in the parameter space,thus the results in function space can be transferred to the results in parameter space. The formalresults involve two extra assumptions on representation power of f θ and G w (see Appendix K fordetails). Informal results are as follows: Proposition 1. (informal) Consider the separable-GAN problem min w ∈ R K ϕ sep ( w ) , where ϕ sep ( w ) = sup θ n n (cid:88) i =1 [ h ( f θ ( x i )) + h ( − f θ ( G w ( z i )))] . (7) Suppose h , h satisfy the assumptions of Theorem 1. Suppose G w satisﬁes Assumption 4.6 (withcertain W ). Suppose f θ and G w have enough representation power (formalized in Appendix K).Then there exist at least ( n n − n !) distinct w ∈ W that are not global-min-reachable for ϕ sep ( w ) . Proposition 2. (informal) Consider the RpGAN problem min w ∈ R K ϕ R ( w ) , where ϕ R ( w ) = sup θ n n (cid:88) i =1 [ h ( f θ ( x i )) − f θ ( G w ( z i ))] . (8) Suppose h satisﬁes the assumptions of Theorem 2. Suppose G w and f θ satisfy the same assumptionsas Proposition 1. Then any w ∈ W is global-min-reachable for ϕ R ( w ) . Remark 1: The existence of a decreasing path does not necessarily mean an algorithm can followit. Nevertheless, our results already distinguish SepGAN and RpGAN. We will illustrate that theseresults can improve our understanding of GAN training in Sec. 5, and present experiments supportingour theory in Sec. 6.Remark 2: The two results rely on a few assumptions of neural-nets including Assumption 4.6. Theseassumptions can be satisﬁed by certain over-parameterized neural-nets, in which case W is a certaindense subset of R K or R K itself. For details see Appendix K.1.6 S-GAN:RS-GAN:Figure 5: Training process of JS-GAN and RS-GAN for two-cluster data. True data are red, fake data are blue.RS-GAN escapes from mode collapse faster than JS-GAN.

These results distinguish the SepGAN and RpGAN landscapes. Theoretically, there is evidenceregarding the beneﬁt of losses without sub-optimal basins. Bovier et al. [17] proved that it takes theLangevin diffusion at least e ω ( h ) time to escape a depth- h basin. A recent work [91] proved that thehitting time of SGLD (stochastic gradient Langevin dynamics) is positively related to the height ofthe barrier, and SGLD may escape basins with low barriers relatively fast. The theoretical insightis that a landscape without a bad basin permits better quality solutions or a faster convergence togood-quality solutions.We now discuss the possible gap between our theory and practice. We proved that a mode collapse Y ∗ is a bad basin in the generator space, which indicates that ( Y ∗ , D ∗ ( Y ∗ )) is an attractor in the jointspace of ( Y, D ) and hard to escape by gradient descent ascent (GDA). In GAN training, the dynamicsare not the same as GDA dynamics due to various reasons (e.g., sampling, unequal D and G updates),and basins could be escaped with enough training time (e.g., [91]). In addition, a randomly initialized ( Y, D ) might be far away from the basins at ( Y ∗ , D ∗ ( Y ∗ )) , and properly chosen hyper-parameters(e.g., learning rate) may re-position the dynamics so as to avoid attraction to bad basins. Further, it isknown that adding neurons can smooth the landscape of deep nets (e.g., eliminating bad basins inneural-nets [50]), thus wide nets might help escape basins in the ( Y, D ) -space faster. In short, theeffect of bad basins may be mitigated via the following factors: (i) proper initial D and Y ; (ii) longenough training time; (iii) wide neural-nets; (iv) enough hyper-parameter tuning. These factors makeit relatively hard to detect the existence of bad basins and their inﬂuences. We support our landscapetheory, by identifying differences of SepGAN and RpGAN in synthetic and real-data experiments. Although in Section 3 we argue that, intuitively , mode collapse can happen for training JS-GAN fortwo-point generation, it does not necessarily mean mode collapse really appears in practical training.We discuss a two-cluster experiment, an extension of two-point generation, in order to build a linkbetween theory and practice. We aim to understand the following question: does mode collapse reallyappear as a “basin”, and how does it affect training?Suppose the true data are two clusters around c = 0 and c = 4 . We sample points from thetwo clusters as x i ’s, and sample z , . . . , z uniformly from an interval. We use 4-layer neural-netsfor the discriminator and generator. We use the non-saturating versions of JS-GAN and RS-GAN. Mode collapse as bad basin can appear.

We visualize the movement of fake data in Fig. 5, andplot the loss value of D (indicating the discriminator) over iterations in Fig. 6(a,b). Interestingly, theminimal D losses are around . , which is the value of φ JS at state s a . It is easy to check thatthe optimal D = D ∗ ( s a ) for a mode collapse state s satisﬁes { D ( c ) , D ( c ) } = { , / } , andFig. 6(c) shows that at iteration the D actually becomes D ∗ . This provides a concrete examplethat training gets stuck at a mode collapse due to the bad-basin-effect. We also notice that there are afew more attempts to approach the bad attractor ( s a , D ∗ ( s a )) (e.g., from iteration to ).In RS-GAN training, the minimal loss is around . , which is also the value of φ RS at state s a . Theattracting power of ( s a , D ∗ ( s a )) is weaker than for JS-GAN. Thus it only attracts the iterates fora very short time. RS-GAN needs 800 iterations to escape, which is about 3 times faster than theescape for JS-GAN. Effect of width:

We see a clear effect of width on convergence speed. As the networks becomewider, both JS-GAN and RS-GAN converge faster. We ﬁnd that the reason of faster convergence isbecause wider nets make JS-GAN escape mode collapse faster. See details in Appendix B.More experiment details and ﬁndings are presented in Appendix B.7

DLoss

DLoss (a) JS-GAN (b) RS-GAN (c) JS-GAN, D imageFigure 6: (a) and (b): Evolution of D loss over iterations. RS-GAN is 3-4 × faster than JS-GAN. (c) For JS-GANtraining in (a), we plot ( Y, D ) together at iteration 2800. Y are represented in blue points, and they are near c = 0 . D is near the optimal D ∗ ( s ) since D (0) ≈ / and D (4) ≈ . Interestingly, this bad attractor ( Y, D ) is similar to the one discussed in Fig. 1, so the intuition of “local-min” is veriﬁed in (c). CIFAR-10 STL-10

Inception Score ↑ FID ↓ FID Gap Model size Inception Score ↑ FID ↓ FID Gap Model sizeReal Dataset 11.24 ± ± Standard CNN

WGAN-GP 6.68 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ResNet

JS-GAN+ SN 8.12 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Inception score (IS) (higher is better) and Frechét Inception distance (FID) (lower is better) for JS-GAN,WGAN-GP and RS-GAN on CIFAR-10 and STL-10. We also show FID gap between JS-GAN and RS-GAN,and show the relative model size of narrow nets vs. regular nets (“regular”: CNN and ResNet of [67]).

RpGANs have been tested by Jolicoeur-Martineau [41], and are shown to be better than their SepGANcounterparts in a variety of settings . In addition, RpGAN and its variants have been used in super-resolution (ESRGAN) [85] and a few recent GANs [87, 13]. Therefore, the effectiveness of RpGANshas been justiﬁed to some extent. We do not attempt to re-run the experiments merely for the purposeof justiﬁcation. Instead, our goal is to use experiments to support our landscape theory.Based on the discussions in Sec. 2, Sec. 4 and Sec. 5, we conjecture that RpGANs have a biggeradvantage over SepGAN (A) with narrow deep nets, (B) in high resolution image generation, (C)with imbalanced data. Finally, (D) there exists some bad initial D that makes SepGANs much worsethan RpGANs. In the main text, we present results on the logistic loss (i.e., JS-GAN and RS-GAN).Results on other losses are given in the appendix. Experimental setting for (A).

For (A), we test on CIFAR-10 and STL-10 data. For the optimizer,we use Adam with the discriminator’s learning rate . . For CIFAR-10 on ResNet, we set β = 0 and β = 0 . in Adam; for others, β = 0 . and β = 0 . . We tune the generator’s learning rateand run k iterations in total. We report the Inception score (IS) and Frechét Inception distance(FID). IS and FID are evaluated on k and k samples respectively. More details of the setting areshown in Appendix E.1, and the experimental settings for other cases besides (A) are shown in thecorresponding parts in the appendix. Generated images are shown in Appendix F. Regular architecture and effect of spectral norm (SN).

We use the two neural architectures in [67]:standard CNN and ResNet, and report results in Table 2. First, without spectral normalization (SN),RS-GAN achieves much higher accuracy than JS-GAN and WGAN-GP on CIFAR-10. Second, with That paper tested a number of variants, and some of them are not directly covered by our results.

Narrow nets.

For both CNN and ResNet, we reduce the number of channels for all convolutionallayers in the generator and discriminator to (1) half, (2) quarter and (3) bottleneck (for ResNetstructure). The experimental results are provided in Table 2. We consider the gap between RS-GANand JS-GAN for regular width as a baseline. For narrow nets, the gap between RS-GAN and JS-GANis similar or larger in most cases, and can be much larger (e.g. > FID) in some cases. Theﬂuctuations in the gaps are consistent with landscape theory: if JS-GAN training gets stuck at a badbasin then the performance is bad; if it converges to a good basin, then the performance is reasonablygood. In CIFAR-10, compared to SN-GAN with the conventional ResNet (FID=20.13), we canachieve a relatively close result by using RS-GAN with 28% parameters (half channel, FID=21.78).

High resolution data experiments.

Sec. 2 discusses that the non-convexity of JS-GAN will becomea more severe issue when the number of samples is limited compared to the data space (e.g., highresolution space or limited data points). We conduct experiments with LSUN Church and Towerimages of size × . RS-GAN can generate higher visual quality images than JS-GAN(Appendix F). Similarly, using another model architecture, [41] achieves a better FID score withRSGAN on the CAT dataset, which contains a small number of images (e.g., 2k × images). Imbalanced data experiments.

For imbalanced data, we ﬁnd more evidence for the existence ofJS-GAN’s bad basins.The reason: JS-GAN would have a deeper bad basin, and hence a higher chanceto get stuck. We conduct ablation experiments on 2-cluster data and MNIST. Both cases show thatJS-GAN ends up with mode collapse while RS-GAN can generate data with proportions similar tothe imbalanced true data. Check Appendix C for more. e - e - e - e - e - e - generator lr = discriminator lr JS-GANRSGAN 6529 7830 6030 9326 13932 13756

Bad initial point experiments.

A better landscape is morerobust to initialization. On MNIST data, we ﬁnd a discriminator(not random) which permits RS-GAN to converge to a muchbetter solution than JS-GAN when used as the starting point.The FID scores are reported in the table to the right. The gapis at least 30 FID scores (a much higher gap than the gap fora random initialization). Check Appendix D for more.

Combining with EMA.

It is known that non-convergence can be alleviated via EMA [88], and ourtheory predicts that the global landscape issue can be alleviated by RpGAN. Non-convergence andglobal landscape are orthogonal: no matter whether iterates are near a sub-optimal local basin or aglobally-optimal basin, the algorithm may cycle. Therefore, we conjecture that the effect of EMAand the effect of RS-GAN are “additive”. Our simulations show that EMA can improve both JS-GANand RS-GAN, and the gap is approximately preserved after adding EMA. Combining EMA andRS-GAN, we achieve a similar result to the baseline (JS-GAN + SN, no EMA, FID = 20.13) using16.8% parameters (Resnet with bottleneck plus EMA, FID=21.38). See Appendix E.1 for more.

General RpGAN:

We conduct additional experiments on other losses, including hinge loss and leastsquares loss. See Appendix E.2 and E.3 for more.

Global optimization landscape, together with statistical analysis and convergence analysis, areimportant theoretical angles. In this work, we study the global landscape of GANs. Our majorquestions are: (1) Does the original JS-GAN formulation have a good landscape? (2) If not, is therea simple way to improve the landscape in theory? (3) Does the improved landscape lead to betterperformance? First, studying the empirical versions of SepGAN (extension of JS-GAN) we provethat it has exponentially many bad basins, which are mode-collapse patterns. Second, we prove that asimple coupling idea (resulting in RpGAN) can remove bad basins in theory. Finally, we verify a fewpredictions based on the landscape theory, e.g., RS-GAN has a bigger advantage over JS-GAN fornarrow nets. 9 cknowledgements

This work is supported in part by NSF under Grant

References [1] J. Adler and S. Lunz. Banach wasserstein gan. In

NeurIPS , 2018.[2] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In

ICML , 2019.[3] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In

ICLR , 2017.[4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. In

ICML , 2017.[5] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarialnets (GANs). In

ICML , 2017.[6] W. Azizian, I. Mitliagkas, S. Lacoste-Julien, and G. Gidel. A tight and uniﬁed analysis of extragradient fora whole spectrum of differentiable games. arXiv preprint arXiv:1906.05945 , 2019.[7] Y. Bai, T. Ma, and A. Risteski. Approximability of discriminators implies diversity in gans. arXiv preprintarXiv:1806.10586 , 2018.[8] D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of n-playerdifferentiable games. arXiv preprint arXiv:1802.05642 , 2018.[9] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks.In

NeurIPS , 2017.[10] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. In

Large Scale Kernel Machines . MITPress, 2007.[11] H. Berard, G. Gidel, A. Almahairi, P. Vincent, and S. Lacoste-Julien. A closer look at the optimizationlandscapes of generative adversarial networks. arXiv preprint arXiv:1906.04848 , 2019.[12] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 , 2017.[13] D. Berthelot, P. Milanfar, and I. Goodfellow. Creating high resolution images with a latent adversarialgenerator. arXiv preprint arXiv:2003.02365 , 2020.[14] S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrixrecovery. In

NeurIPS , 2016.[15] M. Bianchini and M. Gori. Optimal learning in artiﬁcial neural networks: A review of theoretical results.

Neurocomputing , 1996.[16] M. Bi´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying mmd gans. In

ICLR , 2018.[17] A. Bovier, M. Eckhoff, V. Gayrard, and M. Klein. Metastability in reversible diffusion processes i. sharpasymptotics for capcities and exit times.

JEMS , 2004.[18] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high ﬁdelity natural image synthesis. arXiv preprint arXiv:1809.11096 , 2018.[19] Y. Chi, Y. M. Lu, and Y. Chen. Nonconvex optimization meets low-rank matrix factorization: An overview.

IEEE Transactions on Signal Processing , 67(20):5239–5269, 2019.[20] C. Chu, J. Blanchet, and P. Glynn. Probability functional descent: A unifying perspective on gans,variational inference, and reinforcement learning. arXiv preprint arXiv:1901.10691 , 2019.[21] R. W. A. Cully, H. J. Chang, and Y. Demiris. Magan: Margin adaptation for generative adversarial networks. arXiv preprint arXiv:1704.03817 , 2017.

22] C. Daskalakis and I. Panageas. The limit points of (optimistic) gradient descent in min-max optimization.In

NeurIPS , 2018.[23] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training gans with optimism. In

ICLR , 2018.[24] I. Deshpande, Z. Zhang, and A. Schwing. Generative modeling using the sliced wasserstein distance. In

CVPR , 2018.[25] I. Deshpande, Y.-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao, D. Forsyth, and A. G. Schwing.Max-Sliced Wasserstein Distance and its use for GANs. In

CVPR , 2019.[26] T. Ding, D. Li, and R. Sun. Sub-optimal local minima exist for almost all over-parameterized neuralnetworks. arXiv preprint arXiv:1911.01413 , 2019.[27] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent ﬁnds global minima of deep neuralnetworks. arXiv preprint arXiv:1811.03804 , 2018.[28] F. Farnia and A. Ozdaglar. Gans may have no nash equilibria. arXiv preprint arXiv:2002.09124 , 2020.[29] F. Farnia and D. Tse. A convex duality framework for gans. In

NeurIPS , 2018.[30] S. Feizi, F. Farnia, T. Ginart, and D. Tse. Understanding gans: the lqg setting. arXiv preprintarXiv:1710.10793 , 2017.[31] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In

NeurIPS , 2016.[32] M. Geiger, S. Spigler, S. d’Ascoli, L. Sagun, M. Baity-Jesi, G. Biroli, and M. Wyart. The jamming transitionas a paradigm to understand the loss landscape of deep neural networks. arXiv preprint arXiv:1809.09349 ,2018.[33] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective ongenerative adversarial networks. arXiv preprint arXiv:1802.10551 , 2018.[34] G. Gidel, R. A. Hemmat, M. Pezeshki, R. Lepriol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas. Negativemomentum for improved game dynamics. In

AISTATS , 2019.[35] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.Generative adversarial nets. In

NeurIPS , 2014.[36] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wassersteingans. In

NeurIPS , 2017.[37] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In

CVPR , 2017.[38] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neuralnetworks. In

NeurIPS , 2018.[39] C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient descent ascentare locally optimal. arXiv preprint arXiv:1902.00618 , 2019.[40] R. Johnson and T. Zhang. A framework of composite functional gradient methods for generative adversarialmodels.

IEEE transactions on pattern analysis and machine intelligence , 2019.[41] A. Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. In

ICLR ,2018.[42] A. Jolicoeur-Martineau. On relativistic f-divergences. In

ICML , 2019.[43] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4401–4410,2019.[44] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving theimage quality of stylegan. In

Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 8110–8119, 2020.[45] N. Kodali, J. Abernethy, J. Hays, and Z. Kira. On convergence and stability of gans. arXiv preprintarXiv:1705.07215 , 2017.

46] S. Kolouri, C. E. Martin, and G. K. Rohde. Sliced-wasserstein autoencoder: An embarrassingly simplegenerative model. arXiv preprint arXiv:1804.01947 , 2018.[47] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neuralnetworks of any depth evolve as linear models under gradient descent. In

Advances in neural informationprocessing systems , pages 8572–8583, 2019.[48] Q. Lei, J. D. Lee, A. G. Dimakis, and C. Daskalakis. Sgd learns one-layer networks in wgans. arXivpreprint arXiv:1910.07030 , 2019.[49] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. Mmd gan: Towards deeper understanding ofmoment matching network. In

NeurIPS , 2017.[50] D. Li, T. Ding, and R. Sun. Over-parameterized deep neural networks have no strict local minima for anycontinuous activations. arXiv preprint arXiv:1812.11039 , 2018.[51] J. Li, A. Madry, J. Peebles, and L. Schmidt. On the limitations of ﬁrst-order approximation in gan dynamics. arXiv preprint arXiv:1706.09884 , 2017.[52] J. Li, A. Madry, J. Peebles, and L. Schmidt. Towards understanding the dynamics of generative adversarialnetworks. In

ICML , 2018.[53] K. Li and J. Malik. Implicit maximum likelihood estimation. arXiv preprint arXiv:1809.09087 , 2018.[54] Y. Li, A. G. Schwing, K.-C. Wang, and R. Zemel. Dualing GANs. In

NeurIPS , 2017.[55] S. Liang, R. Sun, J. D. Lee, and R. Srikant. Adding one neuron can eliminate all bad local minima. In

Advances in Neural Information Processing Systems , pages 4350–4360, 2018.[56] S. Liang, R. Sun, Y. Li, and R. Srikant. Understanding the loss surface of neural networks for binaryclassiﬁcation. arXiv preprint arXiv:1803.00909 , 2018.[57] S. Liang, R. Sun, and R. Srikant. Revisiting landscape analysis in deep neural networks: Eliminatingdecreasing paths to inﬁnity. arXiv preprint arXiv:1912.13472 , 2019.[58] Z. Lin, A. Khetan, G. Fanti, and S. Oh. Pacgan: The power of two samples in generative adversarialnetworks. In

NeurIPS , 2018.[59] S. Liu and K. Chaudhuri. The inductive bias of restricted f-gans. arXiv preprint arXiv:1809.04542 , 2018.[60] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational efﬁciency of training neural networks.In

NeurIPS , 2014.[61] A. V. Makkuva, A. Taghvaei, S. Oh, and J. D. Lee. Optimal transport mapping via input convex neuralnetworks. arXiv preprint arXiv:1908.10962 , 2019.[62] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. Least Squares Generative AdversarialNetworks. arXiv e-prints , 2016.[63] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley. Least squares generative adversarialnetworks. In

ICCV , 2017.[64] E. V. Mazumdar, M. I. Jordan, and S. S. Sastry. On ﬁnding local nash equilibria (and only local nashequilibria) in zero-sum games. arXiv preprint arXiv:1901.00838 , 2019.[65] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for gans do actually converge? In

ICML , 2018.[66] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. In

ICLR ,2017.[67] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarialnetworks. In

ICLR , 2018.[68] S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. arXiv preprintarXiv:1610.03483 , 2016.[69] Y. Mroueh and T. Sercu. Fisher gan. In

NeurIPS , 2017.[70] Y. Mroueh, T. Sercu, and V. Goel. Mcgan: Mean and covariance feature matching gan. arXiv preprintarXiv:1702.08398 , 2017.

71] V. Nagarajan and J. Z. Kolter. Gradient descent gan optimization is locally stable. In

NeurIPS , 2017.[72] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In

ICML , 2017.[73] Q. Nguyen, M. C. Mukkamala, and M. Hein. On the loss landscape of a class of deep neural networkswith no bad local valleys. arXiv preprint arXiv:1809.10749 , 2018.[74] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variationaldivergence minimization. In

NeurIPS , 2016.[75] B. Poole, A. A. Alemi, J. Sohl-Dickstein, and A. Angelova. Improved generator objectives for gans. arXivpreprint arXiv:1612.02780 , 2016.[76] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutionalgenerative adversarial networks. In

ICLR , 2016.[77] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In

ICLR , 2018.[78] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improvedtechniques for training gans. In

NeurIPS , 2016.[79] M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee. On the convergence and robustness of training gans withregularized optimal transport. In

NeurIPS , 2018.[80] R. Sun, D. Li, S. Liang, T. Ding, and R. Srikant. The global landscape of neural networks: An overview.

IEEE Signal Processing Magazine , 37(5):95–108, 2020.[81] R.-Y. Sun. Optimization for deep learning: An overview.

Journal of the Operations Research Society ofChina , pages 1–46, 2020.[82] D. Tran, R. Ranganath, and D. M. Blei. Deep and hierarchical implicit models. In

NeurIPS , 2017.[83] T. Unterthiner, B. Nessler, C. Seward, G. Klambauer, M. Heusel, H. Ramsauer, and S. Hochreiter. Coulombgans: Provably optimal nash equilibria via potential ﬁelds. In

International Conference on LearningRepresentations , 2018.[84] L. Venturi, A. S. Bandeira, and J. Bruna. Spurious valleys in two-layer neural network optimizationlandscapes. arXiv preprint arXiv:1802.06384 , 2018.[85] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy. Esrgan: Enhancedsuper-resolution generative adversarial networks. In

ECCV , 2018.[86] J. Wu, Z. Huang, W. Li, J. Thoma, and L. Van Gool. Sliced wasserstein generative models. In

CVPR , 2019.[87] Y. Xiangli, Y. Deng, B. Dai, C. C. Loy, and D. Lin. Real or not real, that is the question. arXiv preprintarXiv:2002.05512 , 2020.[88] Y. Yazıcı, C.-S. Foo, S. Winkler, K.-H. Yap, G. Piliouras, and V. Chandrasekhar. The unusual effectivenessof averaging in gan training. In

ICLR , 2019.[89] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. In

ICML , 2018.[90] J. Zhang, P. Xiao, R. Sun, and Z.-Q. Luo. A single-loop smoothed gradient descent-ascent algorithm fornonconvex-concave min-max problems. arXiv preprint arXiv:2010.15768 , 2020.[91] Y. Zhang, P. Liang, and M. Charikar. A hitting time analysis of stochastic gradient langevin dynamics. arXiv preprint arXiv:1702.05575 , 2017.[92] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Stochastic gradient descent optimizes over-parameterized deep relunetworks. arXiv preprint arXiv:1811.08888 , 2018. ppendix: Towards a Better Global Loss Landscape of GANs The code is available at https://github.com/AilsaF/RS-GAN. This appendix consists of additionalexperiments, related work, proofs, other results and various discussions.

Contents

A.1 Related Works on Local Minima and Mode Collapse . . . . . . . . . . . . . . . . . . . . . 15

B 2-Cluster Experiments: Details and More Discussions 17C Result and Experiments for Imbalanced Data Distribution 18

C.1 Imbalanced Data: Math Results for Two-Clusters . . . . . . . . . . . . . . . . . . . . . . . 18C.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

D Experiments of Bad Initialization 19E Experiments of Regular Training: More Details and More Results 20

E.1 Experiment Details and More Experiments with Logistic Loss . . . . . . . . . . . . . . . . 20E.2 Experiments with Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22E.3 Experiments with Least Square Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

F Experiments on High Resolution Data 22G Discussions on Empirical Loss and Population Loss (complements Sec. 2) 23

G.1 Particle space or probability space? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23G.2 Empirical loss and population loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24G.3 Generalization and overﬁtting of GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

H Proofs for Section 3 (2-Point Case) and Appendix C (2-Cluster Case) 24

H.1 Proof of Claim 3.1 and Corollary 3.1 (for JS-GAN) . . . . . . . . . . . . . . . . . . . . . . 24H.2 Proof of Claim 3.2 (for RS-GAN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25H.3 Proofs for 2-Cluster Data (Possibly Imbalanced) . . . . . . . . . . . . . . . . . . . . . . . 26

I Proof of Theorem 1 (Landscape of Separable-GAN) 26J Proof of Theorem 2 (Landscape of RpGAN) 27

J.1 Warm-up Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27J.2 Proof of Theorem J.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28J.2.1 Graph Preliminaries and Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . 30J.2.2 Proof of Claim J.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31J.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

K Results in Parameter Space 33

K.1 Sufﬁcient Conditions for the Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 33K.2 Other Sufﬁcient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34K.3 Proofs of Propositions for Parameter Space . . . . . . . . . . . . . . . . . . . . . . . . . . 35K.4 A technical lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36K.5 Proof of claims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

L Discussion of Wasserstein GAN 37 Related Work

We provide a more detailed overview of related work in this section.

Global analysis in supervised learning.

Recently, global landscape analysis has attracted muchattention. See Sun [81], Sun et al. [80], Bianchini and Gori [15] for surveys and [55, 57, 26, 56, 38,2, 92, 27] for some recent works. It is widely believed that wide networks have a nice loss landscapeand thus local minima are less of a concern (e.g., [60, 32, 50]). However, this claim only holds forsupervised learning, and it is not clear whether local minima cause training difﬁculties for GANs.

Single-mode analysis.

For single-mode data, Feizi et al. [30] and Mescheder et al. [65] provide aglobal analysis of GANs. They consider a single point and a single Gaussian respectively. Feiziet al. [30] differs from ours in a few aspects. First, they consider the single-mode setting which doesnot have an issue of mode collapse. Second, they assume p data is a Gaussian distribution, whilewe consider an arbitrary empirical distribution. Third, they analyze “quadratic-GAN,” which is notcommon in practice, while we analyze commonly used GAN formulations (including JS-GAN). Mode collapse.

Mode collapse is one of the major challenges for GANs which received a lot ofattention. There are a few high-level hypotheses, such as improper loss functions [3, 5] and weakdiscriminators [66, 78, 5, 52]. Interestingly, RpGAN both changes the loss function and improves thediscriminator. The theoretical analysis of mode collapse is relatively scarce. Lin et al. [58] makes akey observation that two distributions with the same total variation (TV) distance to true distributiondo not exhibit the same degree of mode collapse. They proposed to pack the samples (PacGAN) toalleviate mode collapse. This work is rather different from ours. First, they analyze the TV distance,while we analyzed SepGANs and RpGANs. Second, their analysis is statistical, while our analysis isabout optimization. As for the empirical guidance, RpGAN and PacGAN are complimentary and canbe used together (suggested by the author of [41]). There are a few more works that discuss modecollapse and/or local minima; we defer the discussion to Appendix A.1.

Theoretical studies of loss functions.

The early work on GANs [35] built a link between themin-max formulation and the J-S distance to justify the formulation. Arjovsky and Bottou [3] pointedout some possible drawbacks of J-S distance, and proposed a new loss based on Wasserstein distance,referred to as WGAN. Later, Arora et al. [5] point out that both Wasserstein distance and J-S distanceare not generalizable, but they also argued that this is not too scary since people are not directlyminimizing these two distances but a class of metrics referred to as “neural-network distance.”

Convergence analysis.

Many recent works analyze convergence of GANs and/or min-max opti-mization, e.g., [23, 22, 6, 34, 64, 88, 39, 79, 90]. These works often only analyze local stability orconvergence to local minima (or stationary points), making it different from our work. Lei et al. [48]studied the convergence of WGAN, but restricted to 1-layer neural nets.

Other theoretical analysis.

There are a few other theoretical analysis of GANs, e.g., [68, 59, 29, 16,8, 51, 61, 48]. Most of these works are not directly related to our work.

Other GAN Variants.

There are many GAN variants, e.g., WGAN [4, 3, 36] and variants [86,46, 1, 24, 25], f -GAN [74], SN-GAN [67], self-attention GAN [89], StyleGAN [43, 44] and manymore [63, 69, 12, 70, 21, 54, 49, 78, 74, 75, 66, 37, 76, 10, 49]. Our analysis framework (analyzingglobal landscape of empirical loss) can potentially be applied to more variants mentioned above. A.1 Related Works on Local Minima and Mode Collapse

We discuss a few related works on local minima and mode collapse, including Kodali et al. [45], Liand Malik [53] and Unterthiner et al. [83] that are mentioned in the main text.

DRAGAN.

Kodali et al. [45] suggested the connection between mode collapse and a bad equilibriumbased on the following empirical observation: a sudden increase of the gradient norm of the discrimi-nator during training is associated with a sudden drop of the IS score. However, Kodali et al. [45]don’t present formal theoretical results on the relation between mode collapse and a bad equilibrium.

IMLE.

Li and Malik [53] proposed implicit maximum likelihood estimation (IMLE). The empiricalversion of IMLE in the parameter space is the following: min w n (cid:88) j =1 min i ∈{ ,...,m } (cid:107) x i − G w ( z j ) (cid:107) . (9)

15n other words, for each generated sample y j = G w ( z j ) , the loss is the distance from y j to theclosest true sample x i . Interestingly, IMLE and RpGAN both couple the true data and the fakedata in the loss. The differences are two fold: ﬁrst, IMLE does not have an extra discriminator f θ ,while RpGAN has; second, IMLE compares y j with all x i (so as to ﬁnd the nearest neighbor) whileRpGAN compares y j with an arbitrary x j . See Table 3 for a comparison. Note that Li and Malik[53] don’t present formal theoretical results on the landscape. Table 3: Models that couple true data and fake data in the lossModel name Empirical form of loss i Form of coupling OptimizationRpGAN [41] max f (cid:80) j h ( f ( x j ) − f ( y j )) pairing min-max ii RaGAN iii [41] max f (cid:80) j h ( n n (cid:80) i =1 f ( x i ) − f ( y j )) comparing with average min-max(max-)sliced-WGAN max | f | L ≤ n (cid:80) i =1 [ f ( X ) ( i ) − f ( Y ) ( i ) ] iv pairing sorted output min-max[24, 25]IMLE [53] (cid:80) j min i ∈ [ n ] (cid:107) y j − x i (cid:107) comparing with closest minCoulomb-GAN (cid:80) i,j k ( x i , x j ) + (cid:80) i,j k ( y i , y j ) non-zero-sum[83] − (cid:80) i,j k ( x i , y j ) v all-pairs game vi i We show the empirical form of the loss in the function space. Rigorously speaking, the provided form is the the loss for one mini-batch;in practice, in different iterations of SGD we will use different samples of x i , y j . For the emprical loss in the parameter space, we shallreplace f by f θ and y j by G w ( z j ) . ii Besides the zero-sum game form (min-max form), RpGAN can be easily modiﬁed to a non-zero-sum game form (“non-saturating version” proposed in [35]). iii

The precise expression of RaGAN (relativistic averaging GAN) shall be (cid:80) j h ( n (cid:80) ni =1 f θ ( x i ) − f θ ( y j ))+ (cid:80) i h ( n (cid:80) nj =1 f θ ( y j ) − f θ ( x i )) , but for simplicity we only present one term in the table. iv Here f ( X ) (1) ≤ · · · ≤ f ( X ) ( n ) and f ( Y ) (1) ≤ · · · ≤ f ( Y ) ( n ) are the sorted versions of f ( x i ) ’s and f ( y i ) ’s respectively. v Here k is the Coulomb kernel, deﬁned as k ( u, v ) = √ (cid:107) u − v (cid:107) (cid:15) α where u, v ∈ R d , α ≤ d − and (cid:15) > . The originalform of Coulomb-GAN is a non-zero-sum game, but it is straightforward to transfer the formulation to a pure minimization form sincethe discriminator-minimization problem has a closed form solution (used in the proof of [83, Theorem 2]). We presented the transformedminimization problem here. vi Coulomb-GAN is presented as a non-zero-sum game, but as mentioned earlier it can be transformed toa minimization problem. The original Coulomb-GAN uses a smoothing operator in the generator loss; in this empirical form, we omit thesmoothing operator for easier comparison (thus it is not the same as Coulomb-GAN). In the table, we show the resulting loss in the pureminimization form. Unlike SepGAN and RpGAN that can be written as either min-max form or non-zero-sum game form, we point outthat there is no min-max form for Coulomb-GAN, since the design principle of Coulomb-GAN is very different from typical GANs.

Coulomb-GAN.

Unterthiner et al. [83] argued that mode collapse can be a local Nash equilibriumin an example of two clusters (see [83, Appendix A.1]). They further proposed ColumbGAN andclaimed that every local Nash equilibrium is a global Nash equilibrium (see [83, Theorem 2]). Theirstudy is different from ours in a few aspects.

First , they still consider the pdf p g , though restrict thepossible movement of p g (according to a continuity equation). In contrast, we consider the empiricalloss in particle space. Second , the bad landscape of JS-GAN is discussed in words for the 2-clustercase [83, Appendix A.1], but not formally proved. In contrast, we prove rigorous result for the generalcase.

Third , they do not study parameter space (though with informal discussion).

Fourth , they donot present landscape-related experiments, such as the narrow-net experiments we have done.

Common idea: Coupling true data and fake data.

Interestingly, similar to IMLE and RpGAN,ColumbGAN also coupled the true data and fake data in the loss functions. RpGAN, RaGAN (avariant of RpGAN considered in [41]), IMLE and ColumbGAN differ in two aspects: the speciﬁcform of coupling (pairing, comparing with average, comparing with the closest, all possible pairs),and the speciﬁc form of optimization (pure minimization, min-max, non-zero-sum game). See thecomparison in Table 3. It is interesting that all three lines of work choose to couple true data andfake data to resolve the issue of mode collapse. We suspect it is hard to prove similar results on thelandscape of empirical loss for IMLE and Coulomb-GAN.

Relation to (max)-sliced Wasserstein GAN.

We point out that the sliced Wasserstein GAN (sliced-WGAN) [24] and the max-sliced Wasserstein GAN (max-sliced-WGAN) [25] also couple thetrue data and fake data. For any function f , denote f ( X ) = ( f ( x ) , . . . , f ( x n )) and f ( Y ) =( f ( y ) , . . . , f ( y n )) . The empirical version of the max-sliced Wasserstein GAN can be written as min Y max | f | L ≤ W ( f ( X ) , f ( Y )) . (10)Here f is a neural net with codomain R , and W is the Wasserstein-2-distance. Denote f ( X ) (1) ≤· · · ≤ f ( X ) ( n ) and f ( Y ) (1) ≤ · · · ≤ f ( Y ) ( n ) as the sorted versions of f ( x i ) ’s and f ( y i ) ’s respec-16 D l o ss DLoss 0 1000 2000 3000 4000 5000 iteration0.40.60.81.01.2 D l o ss DLoss 0 1000 2000 3000 4000 5000 iteration0.40.60.81.01.2 D l o ss DLoss 0 1000 2000 3000 4000 5000 iteration0.40.60.81.01.2 D l o ss DLoss0 1000 2000 3000 4000 5000 iteration202468 d a t a p o s i t i o n d a t a p o s i t i o n d a t a p o s i t i o n d a t a p o s i t i o n (a) JS-GAN 1st run (b) JS-GAN 2nd run (c) RS-GAN 1st run (d) RS-GAN 2nd runFigure 7: Comparison of JS-GAN and RS-GAN for two different runs. First row: D loss; second row: fake data movement during training. tively. Then Eq. (10) is equivalent to(max-)sliced-WGAN : min Y max | f | L ≤ n (cid:88) i =1 [ f ( X ) ( i ) − f ( Y ) ( i ) ] . (11)This form is quite close to RpGAN (when h ( t ) = t ): the only differences are the sorting of f ( X ) , f ( Y ) and the extra constraint | f | L ≤ . The extra constraint | f | L ≤ is due to unbounded h ,and can be removed if we use an upper bounded h (which leads to a sorting version of RpGAN). Seethe comparison of max-sliced-WGAN with RpGAN and other models in Table 3. Nash equilibria for Gaussian data.

A very recent work Farnia and Ozdaglar [28] shows that for anon-realizable case (with a linear generator) Nash equilibria may not exist for learning a Gaussiandistribution. This setting is quite different from ours.

B 2-Cluster Experiments: Details and More Discussions

In this part, we present details of the experiments in Section 5 and other complementary experiments.

Experimental Setting.

The code is provided in “GAN_2Cluster.py”. We sample points from twoclusters of data near and (roughly in each cluster). We use GD with momentum parameter . for both D and G . The default learning rate is (Dlr, Glr) = (10 − , − ) . The default inner-iteration-number for the discriminator and the generator are (DIter, GIter) = (10 , . The discriminator andgenerator net are a 4-layer network (with 2 hidden layers) with sigmoid activation and tanh activationrespectively. The default neural network width (Dwidth, Gwidth) = (10 , . We will also discuss theresults of other hyperparameters. The default number of training iterations is MaxIter = 5000 . Weuse the non-saturating versions for both JS-GAN and RS-GAN. Understanding the effect of mode collapse, by checking D loss evolution and data movement.

In the main text, we discussed that mode collapse can slow down training of JS-GAN. For easierunderstanding of the training process, we add the visualization of the data movement (which ispossible since we are dealing with 1-dimensional data) in Figure 7. We use the y-axis to denote thedata position, and x-axis to denote the iteration. The blue curves represent the movement of all fakedata during training, and the red straight lines represent the position of true data (two clusters). Thetraining time may vary across different runs, but overall the time for JS-GAN is about 2-4 timeslonger than that for RS-GAN.

Effect of width.

The default width is (Dwidth, Gwidth) = (10 , . We tested two other settings: (20 , and (5 , . For the wide-network setting, the convergence of both JS-GAN and RS-GAN aremuch faster, but RS-GAN is still faster than JS-GAN in most cases; see Fig. 8. For the narrow-networksetting, RS-GAN can recover two modes in all ﬁve runs, while JS-GAN fails in two of the ﬁve runs(within 5k iterations). See Fig. 9 for one success case of JS-GAN and one failure case of JS-GAN. Inthe failure case, JS-GAN completely gets stuck at mode collapse, and the D loss is stuck at around . , consistent with our theory. Note that max-sliced-WGAN in Deshpande et al. [25] uses min Y max (cid:107) v (cid:107)≤ , | g | L ≤ W ( v T g ( X ) , v T g ( Y )) , while sliced-WGAN in Deshpande et al. [24] uses min Y E (cid:107) v (cid:107) =1 max | g | L ≤ W ( v T g ( X ) , v T g ( Y )) . In Eq. (10) we use f ( u ) to replace v T g ( u ) to simplify the expression; although technically, f and v T g are not equivalent, this minor difference does not affect our discussion.

400 800 1200 1600 2000 iteration0.40.60.81.01.2 D l o ss DLoss 0 400 800 1200 1600 2000 iteration0.40.60.81.01.2 D l o ss DLoss 0 400 800 1200 1600 2000 iteration0.40.60.81.01.2 D l o ss DLoss 0 400 800 1200 1600 2000 iteration0.40.60.81.01.2 D l o ss DLoss (a) JS-GAN 1st run (b) JS-GAN 2nd run (c) RS-GAN 1st run (d) RS-GAN 2nd runFigure 8: Wide network (Dwidth, Gwidth) = (20 , : JS-GAN and RS-GAN in two different runs. Compare to regular widths (Dwidth,Gwidth) = (10 , , both GANs converge faster. Anyhow, RS-GAN is still 2-3 times faster than JS-GAN. D l o ss DLoss 0 1000 2000 3000 4000 5000 iteration0.40.60.81.01.2 D l o ss DLoss 0 1000 2000 3000 4000 5000 iteration0.40.60.81.01.2 D l o ss DLoss 0 1000 2000 3000 4000 5000 iteration0.40.60.81.01.2 D l o ss DLoss0 1000 2000 3000 4000 5000 iteration202468 d a t a p o s i t i o n d a t a p o s i t i o n d a t a p o s i t i o n d a t a p o s i t i o n (a) JS-GAN 1st run (b) JS-GAN 2nd run (c) RS-GAN 1st run (d) RS-GAN 2nd runFigure 9: Narrow network setting: Comparison of JS-GAN and RS-GAN in two runs. RS-GAN is a few times faster than JS-GAN in general.Compare to default widths (D width 10, G width 5), both GANs converge slower. In one case (b), JS-GAN gets stuck at mode collapse. Other hyperparameters.

Besides the width, the learning rates and (DIter, GIter) will also affect thetraining process. As for (DIter, GIter), we use (10 , as default, but other choices such as (5 , and (1 , also work. As for learning rates, we use (0 . , . as default, but smaller learning rates suchas (0 . , . also work. Different from the default hyper-parameters, for some hyper-parameters,the D loss of JS-GAN does not reach . , indicating that the basin only attracts the iterates half-way.Nevertheless, in most settings RS-GAN is still faster than JS-GAN. C Result and Experiments for Imbalanced Data Distribution

In the main results, we assume x i ’s are distinct. In this section, we allow x i ’s to be in generalpositions, i.e., they can overlap. The 2-point model can only approximate two balanced clusters.Allowing x i ’s to overlap, we are able to analyze imbalanced two clusters. We will show: (i) atheoretical result for 2-cluster data; (ii) experiments on imbalanced 2-cluster data and MNIST. C.1 Imbalanced Data: Math Results for Two-Clusters

Assume there are n true data points X = ( x , . . . , x n ) in two modes with proportion α and − α respectively, where α > . . More precisely, assume x = x = · · · = x nα and x nα +1 = · · · = x n ,and denote two multi-sets X = { x , x , . . . , x nα } and X = { x nα +1 , x , . . . , x n } . Denote Y =( y , . . . , y n ) as the tuple of all generated points, and let Y be the multiset { y , . . . , y n } . Claim C.1.

Consider the JS-GAN loss deﬁned in Eq. (1) , where X is deﬁned above. We have φ JS ( Y, X ) = q α ( m ) + q − α ( m ) , if |X ∩Y| = m , |X ∩Y| = m , where q α ( m ) (cid:44) α αn ) + m n log m − αn + m n log( αn + m ) . (12) As a result, the global minimal loss is − log 2 , which is achieved iff Y = X ∪ X . Corollary C.1.

Suppose ˆ Y = (ˆ y , . . . , ˆ y n ) satisﬁes |X ∩ ˆ Y| = n , |X ∩ ˆ Y| = n − n , where ˆ Y = { ˆ y , . . . , ˆ y n } is the multiset of all ˆ y j ’s, then ˆ Y is a strict local minimum. Moreover, if n (cid:54) = nα, then ˆ Y is a sub-optimal strict local minimum. The proofs of Claim C.1 and Corollary C.1 are given in Appendix H.3.Denote m (cid:44) |X ∩Y| , m (cid:44) |X ∩Y| . The value q α ( n ) indicates the value of φ ( Y, X ) at the modecollapsed pattern (state 1a) where m = n, m = 0 . Note that q α ( n ) = α log αα +1 + log α +1 isa strictly decreasing function of α . When α = 1 / , q α ( n ) = log + log ≈ − . ; when α = 2 / , q α ( n ) ≈ − . . The more imbalanced the data are (larger α ), the smaller q α ( n ) , and18 igure 10: Illustration of the landscape of JS-GAN for balanced two clusters with α = 0 . (left) and imbalanced two clusters with α = 2 / (right). Denote m i (cid:44) |X i ∩Y| , i = 1 , . Here state 0, state 1a, state 1b, state 2 represent ( m , m ) = (0 , , ( nα, , ( nα, , ( nα, n (1 − α )) respectively. By Claim C.1, for α = 1 / , q α ( n ) ≈ − . and q α ( αn ) ≈ − . ; for α = 2 / , q α ( n ) ≈ − . and q α ( αn ) ≈ − . . Different from the 2-point-case landscape in Fig 5, there should be some intermediate patterns (satisfying m ≤ n, m = 0) , but for simplicity we do not show them. From state 1a to state 2, Y can go through state 1b or go through state 0, but we onlyshow the path through state 1b. We view the gap between state 0 and state 1a as an approximation of the “depth” of the basin. D l o ss DLoss 0 2000 4000 6000 800010000 iteration202468 d a t a p o s i t i o n D l o ss DLoss 0 1000 2000 3000 4000 5000 iteration202468 d a t a p o s i t i o n (a) JS-GAN D loss (b) JS-GAN Data Evolution (c) RS-GAN D loss (d) RS-GAN Data Evolution Figure 11: Imbalanced 2-cluster result: comparison of JS-GAN in (a) and (b), and RS-GAN in (c) and (d). (a)and (c): evolution of D loss; (b) and (d): data position movement during training. further the deeper the basin. In Figure 10, we compare the loss landscape of the balanced case α = 1 / and the imbalanced case α = 2 / .We suspect that the deeper basin in the imbalanced case will make it harder to escape mode collapsefor JS-GAN. We then make the following prediction: for JS-GAN, mode collapse is a more severeissue for imbalanced data than it is for balanced data. For RS-GAN, the performance does not changemuch as data becomes more imbalanced. We will verify this prediction in the next subsections. C.2 Experiments2-Cluster Experiments.

For the balanced case, the experiment is described in Appendix B. BothJS-GAN and RS-GAN can converge to the two-mode-distribution. For the imbalanced case where α = , with other hyper-parameters unchanged, JS-GAN falls into mode collapse while RS-GANgenerates the true distribution (2/3 in mode 1 and 1/3 in mode 2) (see Fig. 11). The loss φ JS ( Y, X ) ends up at approximately -0.56, which matches Claim C.1. MNIST experiments.

To ease visualization, we create an MNIST sub-dataset only containing 5’sand 7’s. We use the CNN structure of Tab. 7 and train for k iterations. For the balanced case, thenumber of 5’s and 7’s are identical (i.e., ratio 1:1). Both JS-GAN and RS-GAN generate a roughlyequal number of 5’s and 7’s, as shown in Fig. 12(a,b). For the imbalanced case with times more ’sthan ’s (ratio 1:5), JS-GAN only generates 7’s, while RS-GAN generates 13 5’s among 64 generatedsamples, aligning with the true data distribution (see Fig. 12(c,d)).The above two experiments verify our earlier prediction that RS-GAN is robust to imbalanced datawhile JS-GAN easily gets stuck at mode collapse for imbalanced data. D Experiments of Bad Initialization

A bad optimization landscape does not mean the algorithm always converges to bad local minima . A‘bad’ landscape means is that there exists a “bad” initial point (the blue point in Fig. 13(a)) that it willlead to a ‘bad’ ﬁnal solution upon training. In contrast, a good landscape is more robust to the initialpoint: starting from any initial point (e.g., two points shown in Fig. 13(b)), the algorithm can still ﬁnda good solution. Therefore, bad optimization landscape of JS-GAN does not mean the performance ofJS-GAN is bad for any initial point, but it should imply that JS-GAN is bad for certain initial points.Next, we will show experiments that support this prediction. Technically since we are not dealing with a pure minimization problem, we should say “the algorithmconverges to a bad attractor”. But for simplicity of illustration, we still call it “local minimum.” a) balanced MNIST: JS-GAN (b) balanced MNIST: RS-GAN (c) imbalanced MNIST: JS-GAN (d) imbalanced MNIST: RS-GAN Figure 12: Balanced and Imbalanced MNIST setting: Comparison of JS-GAN and RS-GAN. . We consider a 2-dimensional 5-Gaussian distribution as illustrated inFig. 14(a). We design a procedure to ﬁnd an initial discriminator and generator. For JS-GAN orRS-GAN, in some runs we obtain mode collapse and in some runs we obtain perfect recovery. Firstly,for the runs achieving perfect recovery (Fig. 14(b)) in JS-GAN and RS-GAN respectively, we pickthe generators at the converged solution, which we denote as G JS0 and G RS0 respectively. Secondly,for the runs attaining mode collapse (Fig. 14(c)) in JS-GAN and RS-GAN respectively, we pickthe discriminators at the converged solution, referred to as D JS0 and D RS0 , Then we re-train bothJS-GAN and RS-GAN from ( D JS0 , G

JS0 ) and ( D RS0 , G

RS0 ) respectively. e -

07 1 e -

06 5 e -

06 1 e -

05 5 e -

05 1 e - generator lr = discriminator lr JS-GANRSGAN 6529 7830 6030 9326 13932 13756

Figure 15: MNIST experiment

We deﬁne an evaluation metric

Ψ = (cid:80) Kk =1 min ≤ i ≤ ( α (cid:107) x i − C k (cid:107) ) ,where C k ’s are the cluster centers, α is a scalar and x i ’s are truedata samples. We repeat the experiment S = 50 times and computethe average Ψ . The larger the metric, the worse the generated points.As shown in Fig. 14(a), the metric Φ is much higher for JS-GAN thanfor RS-GAN, for various learning rates lr. MNIST Experiments . We use a similar strategy to ﬁnd initial parameters for MNIST data. Fig. 15(also in Sec. 6) shows that RS-GAN generates much lower FID scores (30+ gap) than JS-GAN.The two experiments verify our prediction that RS-GAN is more robust to initialization, whichsupports our theory that RS-GAN enjoys a better landscape than JS-GAN.

E Experiments of Regular Training: More Details and More Results

In this section, we present details of the regular experiments in Sec. 6 and a few more experiments.

E.1 Experiment Details and More Experiments with Logistic LossNon-saturating version.

Following the standard practice [35], if lim t →∞ h ( t ) = 0 , we use thenon-saturating version of RpGAN in practical training: min θ L D ( θ ; w ) (cid:44) n (cid:88) i h ( f θ ( x i )) − f θ ( G w ( z i ))) , min w L G ( w ; θ ) (cid:44) n (cid:88) i h ( f θ ( G w ( z i )) − f θ ( x i ))) . (13) For logistic and hinge loss, we use Eq. (13). For least-square loss, we use the original min-maxversion (check Appendix E.3 for more). We use alternating stochastic GDA to solve this problem.

Neural-net structures:

We conduct experiments on two datasets: CIFAR-10 ( × size) andSTL-10 ( × size) on both standard CNN and ResNet. As mentioned in Sec. 6, we also conductexperiments on the narrower nets: we reduce the number of channels for all convolutional layers inthe generator and discriminator to (1) half, (2) quarter and (3) bottleneck (for ResNet structure), The (a) bad landscape with bad local minima (b) good landscape with multiple global minima Figure 13: Left: for a bad landscape, a good initial point (red) leads to convergence to a global optima while abad one (blue) does not. Right: for a good landscape, two initial points both converge to global minima. .0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x x x x x x e -

05 0 . . .

001 0 . generator lr = discriminator lr (a) (b) (c) (d) Figure 14: Five Gaussian experiment. (a): ground truth. (b): generated data covers all ﬁve clusters. (c): modecollapse happens and only two clusters get covered. (d) JS-GAN and RSGAN’s loss Ψ under different lr(generator lr = discriminator lr). CIFAR-10 CIFAR-10+EMA STL-10+EMA IS ↑ FID ↓ IS ↑ FID ↓ IS ↑ FID ↓ ResNet

JS-GAN+SN 8.03 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4: Repeat the experiments (logistic loss) in Tab. 2 with at least three seeds. architectures are shown in Tab. 7 (CNN), Tab. 9 (ResNet for CIFAR) and Tab. 10 (ResNet for STL)and Tab. 11 (Bottleneck for CIFAR) and Tab. 12 (Bottleneck for STL).

Hyper-parameters:

We use a batchsize of 64. For CIFAR-10 on ResNet we set β = 0 and β = 0 . in Adam. For others, β = 0 . and β = 0 . . We use GIter = 1 for both CNN and ResNet. Wealso use DIter = 1 for CNN and DIter = 5 for ResNet. We ﬁx the learning rate for the discriminator(dlr) to be 2e-4. For RpGANs, we ﬁnd that the learning rate for the generator (glr) needs to be largerthan dlr to keep the training balanced. Thus we tune glr using parameters in the set 2e-4, 5e-4, 1e-3,1.5e-3. For SepGAN, we set glr = 0.0002 for SepGANs (JS-GAN,hinge-GAN) as suggested by[67, 76] . See Tab. 13 for the learning rate of RS-GAN and hyper-parameters of WGAN-GP. More details of EMA:

In Sec. 6, we conjectured that the effect of EMA (exponential movingaverage) [88] and RpGAN are additive. Suppose w ( t ) is the generator parameter in t -th iteration ofone run, the EMA generator at the t th iteration is computed as follows w ( t ) EMA = βw ( t − EMA +(1 − β ) w ( t ) , where w (0) EMA = w (0) . Note that EMA is a post-hoc processing step, and does not affect the trainingprocess. Intuitively, the EMA generator is closer to the bottom of a basin while the real training iscircling around a basin due to the minmax structure. We set β = 0 . . As Tab. 4 shows, whileEMA improves both JS-GAN and RS-GAN, RS-GAN is still better than JS-GAN. Results on Logistic Loss with More Seeds:

Besides the result in Tab. 2, we run at least 3 extraseeds for all experiments with ResNet structure on CIFAR-10 to show that the results are consistentacross different runs. We report the results in Tab. 4, and ﬁnd RS-GAN is still better than JS-GANand the gap increases as the networks become narrower.

Samples of image generation:

Generated samples obtained upon training on CIFAR-10 are given inFig. 16 for CNN, Fig. 17 for ResNet. Generated samples obtained upon training on STL-10 datasetare given in Fig. 18 for CNN, Fig. 19 for ResNet. Instead of cherry-picking, all sample images aregenerated from random sampled Gaussian noise. We tuned glr in the set 2e-4, 5e-4, 1e-3, 1.5e-3 and ﬁnd that glr = 2e-4 performs the best in most cases forSepGAN, so we follow the suggestion of [67, 76]. IFAR-10 CIFAR-10 + EMA IS ↑ FID ↓ FID Gap IS ↑ FID ↓ FID Gap

ResNet + Hinge Loss

Hinge-GAN 7.92 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 5: Comparison of Hinge-GAN and Rp-Hinge-GAN. We also show the FID gap between Rp-Hinge-GANwith Hinge-GAN (e.g. .

20 = 21 . − . and .

96 = 33 . − . . E.2 Experiments with Hinge Loss

Hinge loss has become popular in GANs [82, 67, 18]. The empirical loss of hinge-GAN is min θ L Hinge D ( θ ; w ) (cid:44) n (cid:34)(cid:88) i max(0 , − D θ ( x i )) + (cid:88) i max(0 , D θ ( G w ( z i )) (cid:35) , min w L Hinge G ( w ; θ ) (cid:44) − n (cid:88) i D θ ( G w ( z i )) . Note that Hinge-GAN applies the hinge loss for the discriminator, and linear loss for the generator.This is a variant of SepGAN with h ( t ) = h ( t ) = − max(0 , − t ) .The Rp-hinge-GAN is RpGAN given in Eq. (13) with h ( t ) = − max(0 , − t ) : min θ L R-Hinge D ( θ ; w ) (cid:44) n (cid:88) i max(0 , f θ ( G w ( z i )) − f θ ( x i ))) , min w L R-Hinge G ( w ; θ ) (cid:44) n (cid:88) i max(0 , f θ ( x i ) − f θ ( G w ( z i )))) . We compare them on ResNet with the hyper-parameter settings in Appendix E.1. As Tab. 5 shows,Rp-Hinge-GAN (both versions) performs better than Hinge-GAN. For narrower networks, the gap is to FID scores, larger than the gap for the logistic loss.

E.3 Experiments with Least Square Loss

We consider the least square loss. The LS-GAN [62] is deﬁned as follows: min θ L LS D ( θ ; w ) (cid:44) n (cid:34)(cid:88) i ( f θ ( x i ) − + (cid:88) i f θ ( G w ( z i )) (cid:35) , min w L LS G ( w ; θ ) (cid:44) n (cid:88) i ( f θ ( G w ( z i )) − . This is a non-zero-sum variant of SepGAN with h ( t ) = − (1 − t ) , h ( t ) = − t .Rp-LS-GAN addresses the following objectives: min θ L Rp-LS D ( θ ; w ) (cid:44) n (cid:88) i ( f θ ( x i ) − f θ ( G ( z i )) − , min w L Rp-LS G ( w ; θ ) (cid:44) − L Rp-LS D ( θ ; w ) = − n (cid:88) i ( f θ ( x i ) − f θ ( G w ( z i )) − . (14) For least square loss h ( t ) = − ( t − , the gradient vanishing issue due to h does not exist, thus wecan use the min-max version given in Eq. (14) in practice. Our version of Rp-LS-GAN is actuallydifferent from the version of Rp-LS-GAN in [41] which is similar to Eq. (13) with least square h .In Tab. 6 we compare LS-GAN and Rp-LS-GAN on CIFAR-10 with CNN architectures detailed inTab. 7. As Tab. 6 shows, Rp-LS-GAN is slightly worse than LS-GAN in regular width, but is betterthan LS-GAN (with 5.7 FID gap) when using 1/4 width. F Experiments on High Resolution Data

There are two approaches to achieve a good landscape: one uses a wide enough neural net [73, 50],and the other uses a large enough number of samples (approaching convexity of pdf space). As we22 egular width channel/2 channel/4IS FID FID Gap IS FID FID Gap IS FID FID GapLS-GAN 6.91 ± ± ± ± ± ± Table 6: Comparison of LS-GAN and Rp-LS-GAN on CIFAR-10 with the CNN structure. discuss in Sec. 2 (see also Appendix G.1), when the number of samples is far from enough for ﬁllingthe data space, the convexity (of pdf space) may vanish. A higher dimension of data implies a largergap between empirical loss and population loss, thus the non-convexity issue will become moresevere. Thus we conjecture that JS-GAN suffers more for higher resolution data generation.We consider × LSUN Church and Tower datasets with CNN architecture in Tab. 8. ForRS-GAN, we set glr = 1e-3 and dlr = 2e-4 We train , iterations with batchsize . Thegenerated images are presented in Fig. 20. For both datasets, RS-GAN outperforms JS-GAN visually. G Discussions on Empirical Loss and Population Loss (complements Sec. 2)

As mentioned in Sec. 2, the pdf space view (the population loss) was ﬁrst used in [35], and becamequite popular for GAN analysis. See, e.g., [71, 40, 20]. In this part, we provide more discussions onthe relation of empirical loss and population loss in GANs.

G.1 Particle space or probability space?

Suppose p z = N (0 , I d z ) (or other distributions) is the distribution of the latent variable z , and Z = ( z , . . . , z n ) are the samples of latent variables. During training, the parameter w of thegenerator net G w is moving, and, as a result, both the pdf p g = G w ( p z ) and the particles y j = G w ( z j ) move accordingly. Therefore, GAN training can be viewed as either probability space optimization orparticle space optimization. The two views (pdf space and particle space) are illustrated in Figure 1.In the probability space view, an implicit assumption is that the pdf p g moves freely ; in the particlespace view, we assume the particles move freely . Free-particle-movement implies free-pdf-movementif the particles almost occupy the whole space (a one-mode distribution), as shown in Fig. 21.However, for multi-mode distributions in high-dimensional space, the particles are sparse in the space,and free-particle-movement does NOT imply free-pdf-movement. This gap was also pointed out in[83]; here, we stress that the gap becomes larger for sparser samples (eiher due to few samples orehigh dimension). This forms the foundation for experiments in App. F.To illustrate the gap between free-pdf-movement and free-particle-movement, we use an example oflearning a two-mode distribution p data . Suppose we start from an initial two-mode distribution p g , asshown Figure 22. To learn p data , we need to do two things: ﬁrst, move the two modes of p g to roughlyoverlap with the two modes of p data which we call “macro-learning”; second, adjust the distributionsof each mode to match those of p data , which we call “micro-learning.” This decomposition isillustrated in Fig. 22 and 23. In micro-learning, the pdf can move freely, but in macro-learning, thewhole mode has to move together and cannot move freely in the pdf space. Figure 21: Illustration of the learning process of the single mode.The generated samples are moving, which corresponds to adjustmentof the probability densities. Figure 22: Illustration of the process of learning a multi-mode distri-bution. We decompose this process into two parts in the next ﬁgure. y y x x (a) Macro-learning (b) Micro-learningFigure 23: Decomposing learning a multi-mode distribution into macro-learning and micro-learning. Macro-learning refers to the movementof the whole mode towards the underlying data mode. Micro-learning refers to the adjustment of the distribution within each mode. Ifmacro-learning fails, then an entire mode is missed in the generated distributions, which corresponds to mode collapse. .2 Empirical loss and population loss The population version of RpGAN [41] is min p data φ R , E ( p g , p data ) , where φ R , E ( p g , p data ) = sup f ∈ C ( R d ) E ( x,y ) ∼ ( p g ,p data ) [ h ( f ( x ) − f ( y ))] . (15)Suppose we sample x , . . . , x n ∼ p data and y , . . . , y n ∼ p g , then n (cid:80) ni =1 [ h ( f ( x i ) − f ( y i ))] isan approximation of E ( x,y ) ∼ ( p g ,p data ) [ h ( f ( x ) − f ( y ))] . The empirical version of RpGAN addresses min Y ∈ R d × n φ R ( Y, X ) , where φ R ( Y, X ) = sup f ∈ C ( R d ) n n (cid:88) i =1 [ h ( f ( x i ) − f ( y i ))] . (16)Our analysis is about the geometry of φ R ( Y, X ) in Eq. (16). In practical SGDA (stochastic GDA), ateach iteration we draw a mini-batch of samples and update the parameters based on the mini-batch.The samples of true data x i are re-used multiple times (similar to SGD for a ﬁnite-sum optimization),but the samples of latent variables z i are fresh (similar to on-line optimization). Due to the re-use oftrue data, stochastic GDA shall be viewed as an online optimization algorithm for solving Eq. (16)where x i ’s can be the same. Recall that in the main results, we have assumed that x i ’s are distinct,thus there is a gap between our results and practice. Extending our results to the case of non-distinct x i ’s requires extra work. This was done in Claim C.1 for the 2-cluster setting. But for readability wedo not further study this setting in the more general cases. We leave this to future work. G.3 Generalization and overﬁtting of GAN

One may wonder whether ﬁtting the empirical distribution can cause memorization and failure togenerate new data. Arora et al. [5] proved that for many GANs (including JS-GAN) with neural nets,only a polynomial number of samples are needed to achieve a small generalization error. We suspectthat a similar generalization bound can be derived for RpGAN. ! ! ! !!!& & & & &&

Figure 24: How to generate new point.

We provide some intuition why ﬁtting the empirical data distributionvia a GAN may avoid overﬁtting. Consider learning a two-cluster dis-tribution as shown in Fig. 24. During training, we learn a generator thatmaps the latent samples z i to x i , thus ﬁtting the empirical distribution.If we sample a new latent sample z i , then the generator will map z j toa new point x j in the underlying data distribution (due to the continuityof the generator function). Thus the continuity of the generator (or therestricted power of the generator) provides regularization for achieving generalization. H Proofs for Section 3 (2-Point Case) and Appendix C (2-Cluster Case)

We now provide the proofs for the toy results (i.e., the case n = 2 ). H.1 Proof of Claim 3.1 and Corollary 3.1 (for JS-GAN)Proof of Claim 3.1 : We will compute values of φ JS ( Y, X ) for all Y . Re-call D can be any continuous function with range (0 , . Recall that φ JS ( Y, X ) =sup D n [ (cid:80) ni =1 log( D ( x i )) + (cid:80) ni =1 log(1 − D ( y i ))] . Consider four cases. Denote a multiset Y = { y , y } , and let m i = |Y ∩ { x i }| , i ∈ { , } . Case 1 (state 1): m = m = 1 . Then the objective is sup D (cid:20)

12 log( D ( x )) + 12 log(1 − D ( x )) + 12 log( D ( x )) + 12 log(1 − D ( x )) (cid:21) . The optimal value is − log 2 , which is achieved when D ( x ) = D ( x ) = . Case 2 (state 1a): { m , m } = { , } . WLOG, assume m = 1 , m = 0 , and y = x , y / ∈{ x , x } . The objective becomes sup D (cid:20)

12 log( D ( x )) + 12 log( D ( x )) + 12 log(1 − D ( x )) + 12 log(1 − D ( y )) (cid:21) . − log 2 / is achieved when D ( x ) = 1 / , D ( x ) → and D ( y ) → . Case 3 (state 1b): { m , m } = { , } . WLOG, assume y = y = x . The objective becomes sup D (cid:20)

12 log( D ( x )) + log(1 − D ( x )) + 12 log( D ( x )) (cid:21) . The optimal value log + log ≈ − . is achieved when D ( x ) = 1 / and D ( x ) → . Case 4 (state 2): m = m = 0 , i.e., y , y / ∈ { x , x } . The objective is: sup D (cid:20)

12 log( D ( x )) + 12 log( D ( x )) + 12 log(1 − D ( y )) + 12 log(1 − D ( y )) (cid:21) . These terms are independent, thus each term can achieve its supreme log 1 = 0 . Then the optimalvalue is achieved when D ( x ) = D ( x ) → and D ( y ) = D ( y ) → . Proof of Corollary 3.1:

Suppose (cid:15) is the minimal non-zero distance between two points of x , x , y , y . Consider a small perturbation of ¯ Y as Y = ( ¯ y + (cid:15) , ¯ y + (cid:15) ) , where | (cid:15) i | < (cid:15) .We want to verify that φ ( ¯ Y , X ) > φ ( Y, X ) ≈ − . . (17)There are two possibilities. Possibility 1 : (cid:15) = 0 or (cid:15) = 0 . WLOG, assume (cid:15) = 0 , then we musthave (cid:15) > . Then we still have y = ¯ y = x . Since the perturbation amount is small enough, wehave y / ∈ { x , x } . According to Case 2 above, we have φ ( ¯ Y , X ) = − log 2 ≈ − . > − . . Possibility 2 : (cid:15) > , (cid:15) > . Since the perturbation amount (cid:15) and (cid:15) are small enough, wehave y / ∈ { x , x } , y / ∈ { x , x } . According to Case 4 above, we have φ ( ¯ Y , X ) = 0 > − . . Combining both cases, we have proved Eq. (17). (cid:50)

H.2 Proof of Claim 3.2 (for RS-GAN)

This is the result of RS-GAN for n = 2 . WLOG, assume x = 0 , x = 1 . Denote g RS ( Y ) (cid:44) φ RS ( Y, X ) = sup f ∈ C ( R d ) 12 log f (0) − f ( y )) + log f (1) − f ( y )) . Denote m i = |{ y i } ∩{ x i }| , i = 1 , ; note this deﬁnition is different from JS-GAN in App. H.1. Consider three cases. Case 1 : m = m = 1 . If y = 0 , y = 1 , then g RS ( Y ) = [log 0 . .

5] = − log 2 ≈− . . If y = 1 , y = 0 , then g RS ( Y ) = sup f ∈F

12 log 11 + exp( f (0) − f (1)) + 12 log 11 + exp( f (1) − f (0))= sup t ∈ R (cid:20)

12 log 11 + exp( t ) + 12 log 11 + exp( − t ) (cid:21) = − log 2 . Case 2 : { m , m } = { , } . WLOG, assume y = 0 , y (cid:54) = 1 (note that y can be ). Then g RS ( Y ) ≥ sup f ∈F

12 log 11 + exp( f (0) − f (0)) + 12 log 11 + exp( f (1) − f ( y ))= −

12 log 2 + sup t ∈ R

12 log 11 + exp( t ) = −

12 log 2 ≈ − . . The value is achieved when f (1) − f ( y ) → −∞ . Case 3 : m = m = 0 . Then g RS ( Y ) ≥ sup f ∈F

12 log 11 + exp( f (0) − f ( y )) + 12 log 11 + exp( f (1) − f ( y ))= sup t ∈ R ,t ∈ R

12 log 11 + exp( t ) + 12 log 11 + exp( t ) = 0 . The value is achieved when f (1) − f ( y ) → −∞ and f (0) − f ( y ) → −∞ . The global minimal value is − log 2 , and the only global minima are { y , y } = { x , x } . In addition,from any Y , it is easy to verify that there is a non-decreasing path from Y to a global minimum.25 .3 Proofs for 2-Cluster Data (Possibly Imbalanced)Proof of Claim C.1. The proof is built on the proof of Claim 3.1 in Appendix H.1.We ﬁrst consider a special case |X ∩ Y | = m, |X ∩Y| =0 . This means that m generated points are inmode 1, and the rest are in neither modes. The loss value can be computed as follows: φ JS ( Y, X ) = 12 n (cid:20) αn log( αnαn + m ) + m log(1 − αnαn + m ) (cid:21) = α αn ) + m n log m − αn + m n log( αn + m )) = q α ( m ) . In general, if |X ∩Y| = m , |X ∩Y| = m , then φ JS ( Y, X ) can be divided into three parts: the ﬁrst partis the sum of the terms that contain x (including x i ’s and y j ’s that are equal to x ), the second part isthe sum of the terms that contain x n (including x i ’s and y j ’s that are equal to x n ), and the third partis the sum of the terms that contain y j ’s that are not in { x , x n } . Similar to Case 3 above, the valueof the ﬁrst part is q α ( m ) , and the value of the second part is q − α ( m ) . Similar to the above specialcase, the value of the third part is . Therefore, the loss value is φ JS ( Y, X ) = q α ( m ) + q − α ( m ) . It is easy to show that q α ( m ) + q − α ( m ) ≥ − log 2 , and the equality is achieved iff m = nα, m = n (1 − α ) , i.e., Y = X ∪ X . (cid:50) Proof sketch of Corollary C.1.

After a small enough perturbation, we must have m (cid:44) |X ∩Y| ≤ n , m (cid:44) |X ∩Y| ≤ n . Since q α ( m ) and q − α ( m ) are strictly decreasing functions of m , we have φ ( Y, X ) = q α ( m ) + q − α ( m ) ≤ q α ( n ) + q − α ( n ) = φ ( ˆ Y , X ) . The equality holds iff ( m , m ) = ( n , n ) , i.e., Y = ˆ Y .

This means that if ( n , n ) (cid:54) = ( nα, n (1 − α )) , then ˆ Y is a sub-optimal strict local minimum. (cid:50) We skip the detailed proof, since other parts are similar to the proof of Corollary 3.1.

I Proof of Theorem 1 (Landscape of Separable-GAN)

Denote F ( D ; Y ) = n (cid:80) ni =1 [ h ( f ( x i )) + h ( − f ( y i ))] ≤ (since h i ( t ) ≤ , i = 1 , for any t ). Step 1: Compute the value of φ ( · , X ) for each Y . For any i , denote M i = { j : y j = x i } , m i = | M i | ≥ , i = 1 , , . . . , n. Then m + · · · + m n = n . Denote Ω = M ∪ M · · · ∪ M n . Then φ ( Y, X ) = 12 n sup f n (cid:88) i =1 [ h ( f ( x i )) + h ( − f ( y i ))] = 12 n sup f  n (cid:88) i =1 [ h ( f ( x i )) + m i h ( − f ( x i ))] + (cid:88) j / ∈ Ω h ( − f ( y i ))  ( i ) = 12 n (cid:32) n (cid:88) i =1 sup ti ∈ R [ h ( t i ) + m i h ( − t i )] + | Ω c | sup t ∈ R h ( t ) (cid:33) ( ii ) = 12 n n (cid:88) i =1 ξ ( m i ) (18a) ( iii ) ≥ n n (cid:88) i =1 m i ξ (1) = 12 ξ (1) . Here (i) is because f ( y j ) , j ∈ Ω are independent of h ( x i ) ’s and thus can be any values; (ii) isby the deﬁnition ξ ( m ) = sup t [ h ( t ) + mh ( − t )] and Assumption 4.1 that sup t h ( t ) = 0 ; (iii)is due to the convexity of ξ (note that ξ is the supreme of linear functions). Furthermore, if thereis a certain m i > , then ξ ( m i ) + ( m i − ξ (0) = ξ ( m i ) > m i ξ (1) (according to Assumption4.2), causing (iii) to become a strict inequality. Thus the equality in (iii) holds iff m i = 1 , ∀ i , i.e., { y , . . . , y n } = { x , . . . , x n } . Therefore, we have proved that φ ( Y, X ) achieves the minimal value ξ (1) iff { y , . . . , y n } = { x , . . . , x n } . Step 2: Sufﬁcient condition for strict local-min.

Next, we show that if Y satisﬁes m + m + · · · + m n = n then Y is a strict local-min. Denote δ = min k (cid:54) = l (cid:107) x k − x l (cid:107) . Consider a small perturbation of Y as ¯ Y = ( ¯ y , ¯ y , . . . , ¯ y n ) = ( y + (cid:15) , y + (cid:15) , . . . , y n + (cid:15) n ) , where (cid:107) (cid:15) j (cid:107) < δ, ∀ j and (cid:80) j (cid:107) (cid:15) j (cid:107) > . We want to prove φ ( ¯ Y , X ) > φ ( Y, X ) . Denote ¯ m i = |{ j : ¯ y j = x i }| , i = 1 , , . . . , n. Consider an arbitrary j . Since y j ∈ { x , . . . , x n } ,there must be some i such that y j = x i . Together with (cid:107) ¯ y j − y j (cid:107) = (cid:107) (cid:15) j (cid:107) < δ = min k (cid:54) = l (cid:107) x k − x l (cid:107) ,we have ¯ y j / ∈ ( { x , x , . . . , x n }\{ x i } ) . In other words, the only possible point in { x , . . . , x n } thatcan coincide with ¯ y j is x i , and this happens only when (cid:15) j = 0 . This implies ¯ m i ≤ m i , ∀ i . Since26e have assumed (cid:80) j (cid:107) (cid:15) j (cid:107) > , for at least one i we have ¯ m i < m i . Together with Assumption4.3 that ξ ( m ) is a strictly decreasing function in m ∈ [0 , n ] , we have φ ( ¯ Y , X ) = n (cid:80) ni =1 ξ ( ¯ m i ) > n (cid:80) ni =1 ξ ( m i ) = φ ( Y, X ) . Step 3 : Sub-optimal strict local-min.

Finally, if Y satisﬁes that m + m + · · · + m n = n and m k ≥ for some k , then φ ( Y, X ) > ξ (0) . Thus Y is a sub-optimal strict local minimum. Q.E.D.Remark 1 : ξ ( m ) is convex (it is the supreme of linear functions), thus we always have ξ ( m ) = ξ ( m ) + ( m − ξ (0) ≥ mξ (1) . Assump. 4.2 states that the inequality is strict, thus it is slightlystronger than the convexity of ξ . By Assump. 4.1, we also have h ( t ) + ( m + 1) h ( − t ) ≤ h ( t ) + mh ( − t ) , thus ξ ( n ) ≤ ξ ( n − ≤ · · · ≤ ξ (0) . Assumption 4.3 states that the inequalitiesare strict. This holds if the maximizer of h ( t ) + mh ( − t ) does not coincide with the maximizer of h ( t ) . Intuitively, if h ( t ) is “substantially different” from a constant function, then Assump. 4.2 andAssump. 4.3 hold. Remark 2 : The upper bound in Assumption 4.1 is not essential, and can be relaxed to any ﬁnitenumbers (change other two assumptions accordingly). We skip the details. J Proof of Theorem 2 (Landscape of RpGAN)

This proof is the longest one in this paper. We will focus on a proof for the special case of RS-GAN. The proof for general RpGAN is quite similar, and presented in Appendix J.3. Recall φ RS ( Y, X ) = sup f n (cid:80) ni =1 log f ( y i ) − f ( x i )) . Theorem J.1. (special case of Theorem 2 for RS-GAN) Suppose x , x , . . . , x n ∈ R d are distinct.The global minimal value of φ RS ( Y, X ) is − log 2 , which is achieved iff { x , . . . , x n } = { y , . . . , y n } .Furthermore, any point is global-min-reachable for the function. Proof sketch.

We compute the value of g ( Y ) = φ RS ( Y, X ) for any Y , using the following steps:(i) We build a graph with vertices representing distinct values of x i , y i and draw directed edges from x i to y i . This graph can be decomposed into cycles and trees.(ii) Each vertex in a cycle contributes − n log 2 to the value g ( Y ) .(iii) Each vertex in a tree contributes to the value g ( Y ) .(iv) The value g ( Y ) equals − n log 2 times the number of vertices in the cycles.The outline of this section is as follows. In the ﬁrst subsection, we analyze an example as warm-up.Next, we prove Theorem J.1. The proofs of some technical lemmas will be provided in the followingsubsections. Finally, in Appendix J.3 we present the proof for Theorem 2. J.1 Warm-up Example

We prove that if { y , y , . . . , y n } = { x , . . . , x n } , then Y is a global minimum of g ( Y ) .Suppose y i = x σ ( i ) , where ( σ (1) , σ (2) , . . . , σ ( n )) is a permutation of (1 , , . . . , n ) . Wecan divide { , , . . . , n } into ﬁnitely many cycles C , C , . . . , C K , where each cycle C k =( c k (1) , c k (2) , . . . , c k ( m k )) satisﬁes c k ( j + 1) = σ ( c k ( j )) , j ∈ { , , . . . , m k } . Here c k ( m k + 1) isdeﬁned as c k (1) . Now we calculate the value of g ( Y ) . g ( Y ) = sup f n n (cid:88) i =1 log 11 + exp( f ( y i ) − f ( x i ))) (i) = − inf f n K (cid:88) k =1 (cid:88) i ∈ Ck log (1 + exp( f ( y i ) − f ( x i )))= − inf f n K (cid:88) k =1 mk (cid:88) j =1 log (cid:16) e f ( xck ( j +1)) − f ( xck ( j ))) (cid:17) (ii) = − n K (cid:88) k =1 inf f mk (cid:88) j =1 log (cid:16) e f ( xck ( j +1)) − f ( xck ( j )) (cid:17) = − n K (cid:88) k =1 inf t ,t ,...,tmk ∈ R  mk − (cid:88) j =1 log (1 + exp( t j +1 − t j )) + log (cid:0) t − t mk ) (cid:1) (iii) = − n K (cid:88) k =1 m k log(1 + exp(0)) = − log 2 . Here (i) is because { , , . . . , n } is the combination of C , . . . , C K and i ∈ C k means that i = c k ( j ) for some j . (ii) is because C k ’s are disjoint and f can be any continuous function; more speciﬁcally,27he choice of { f ( x i ) : i ∈ C k } is independent of the choice of { f ( x i ) : i ∈ C l } for any k (cid:54) = l ,thus we can take the inﬁmum over each cycle (i.e., put “inf” inside the sum over k ). (iii) is because (cid:80) m − j =1 log(1 + exp( t j +1 − t j )) + log (1 + exp( t − t m )) is a convex function of t , t , . . . , t m andthe minimum is achieved at t = t = · · · = t m = 0 . J.2 Proof of Theorem J.1

This proof is divided into three steps. In Step 1, we compute the value of g ( Y ) if all y i ∈ { x , . . . , x n } .This is the major step of the whole proof. In Step 2, we compute the value of g ( Y ) for any Y . InStep 3, we show that there is a non-decreasing continuous path from Y to a global minimum. Step 1: Compute g ( Y ) that all y i ∈ { x , . . . , x n } . Deﬁne R ( X ) = { Y : y i ∈ { x , . . . , x n } , ∀ i } . (19) Step 1.1: Build a graph and decompose it.

We ﬁx Y ∈ R ( X ) . We build a directed graph G = ( V, A ) as follows. The set of vertices V = { , , . . . , n } represent x , x , . . . , x n . A directededge ( i, j ) ∈ A if y i = x j . In this case, there is a term log(1 + exp( f ( x j ) − f ( x i ))) in g ( Y ) . It ispossible to have a self-loop ( i, i ) , which corresponds to the case y i = x i . By Eq. (19), we have g ( Y ) = − inf f n n (cid:88) i =1 log (cid:16) e f ( y i ) − f ( x i ) (cid:17) = − inf f n (cid:88) ( i,j ) ∈ A log (cid:16) e f ( x j ) − f ( x i ) (cid:17) . (20) Each y i corresponds to a unique x j , thus the out-degree of i , denoted as outdegree ( i ) , must be exactly . The in-degree of each i , denoted as indegree ( i ) , can be any number in { , , . . . , n } .We will show that the graph G can be decomposed into the union of cycles and trees (see App. J.2.1for its proof, and deﬁnitions of cycles and trees). A graphical illustration is given in Figure 25. Lemma 1.

Suppose G = ( V, A ) is a directed graph and outdegree ( v ) = 1 , ∀ v ∈ V . Then:(a) There exist cycles C , C , . . . , C K and subtrees T , T , . . . , T M such that each edge v ∈ A appears either in exactly one of the cycles or in exactly one of the subtrees.(b) The root of each subtree u m is a vertex of a certain cycle C k . In addition, each vertex of the graphappears in exactly one of the following sets: V ( C ) , . . . , V ( C K ) , V ( T ) \{ u } , . . . , V ( T M ) \{ u M } .(c) There is at least one cycle in the graph. (a) Eg 1 for Lemma 1 (b) Eg 2, with self-loop (c) Example graph for generalcase Figure 25: The ﬁrst two ﬁgures are two connected component of a graph representing the case y i ∈ { x , . . . , x n } , ∀ i . The ﬁrst ﬁgurecontains vertices and directed edges. It can be decomposed into a cycle (1 , , , and two subtrees: one subtree consists of edge (10 , and vertices , , and another consists of edges (8 , , (9 , , (7 , , (6 , , (5 , . The second ﬁgure has one cycle being a self-loop, and two trees attached to it. The third ﬁgure is an example graph of the case that some y i / ∈ { x , . . . , x n } . In this example, n = 8 (so edges), and all y i ’s are in { x , . . . , x n } except y , y . The two edges (6 , and (6 , indicate the two terms h ( f ( y ) − f ( x )) and h ( f ( y ) − f ( x )) in g ( Y ) . They have the same head , thus y = y . The vertice has out-degree , indicating that y = y / ∈{ x , . . . , x n } . This ﬁgure can be decomposed into two cycles and three subtrees. Finally, adding a self-loop (9 , will generate a graphwhere each edge has outdegree (this is the reduction done in Step 2). Denote ξ ( y i , x i ) = log (cid:0) e f ( y i ) − f ( x i ) (cid:1) . According to Lemma 1, we have − ng ( Y ) = inf f n (cid:88) i =1 ξ ( y i , x i ) ≥ inf f  K (cid:88) k =1 (cid:88) i ∈ V ( C k ) ξ ( y i , x i )  (cid:44) g cyc . (21) Step 1.2: Compute g cyc . We then compute g cyc . Since C k is a cycle, we have X k (cid:44) { x i : i ∈ C k } = { y i : i ∈ C k } . Since C k ’s are disjoint, we have X k ∩ X l = ∅ , ∀ k (cid:54) = l. This implies that28 ( x i ) , f ( y i ) for i in one cycle C k are independent of the values corresponding to other cycles. Then g cyc can be decomposed according to different cycles: g cyc = inf f  K (cid:88) k =1 (cid:88) i ∈ V ( C k ) log (1 + exp( f ( y i ) − f ( x i )))  = K (cid:88) k =1 inf f (cid:88) i ∈ V ( C k ) log (1 + exp( f ( y i ) − f ( x i ))) . Similar to Warm-up example 1, the inﬁmum for each cycle is achieved when f ( x i ) = f ( x j ) , ∀ i, j ∈ V ( C k ) . In addition, g cyc = − log 2 K (cid:88) k =1 | V ( C k ) | . (22) Step 1.3: Compute g ( Y ) . According to Eq. (21) and Eq. (22), we have − ng ( Y ) ≥ K (cid:88) k =1 | V ( C k ) | log 2 . (23) Denote F ( Y ; f ) = − n (cid:80) ni =1 log (cid:0) e f ( y i ) − f ( x i ) (cid:1) , then g ( Y ) = inf f F ( Y ; f ) . We claim that forany (cid:15) > , there exists a continuous function f such that − nF ( Y ; f ) < K (cid:88) k =1 | V ( C k ) | log 2 + (cid:15). (24)Let N be a large positive number such that n log (1 + exp( − N ))) < (cid:15). (25)Pick a continuous function f as follows. f ( x i ) = (cid:40) , i ∈ (cid:83) Kk =1 V ( C k ) ,N · depth ( i ) , i ∈ (cid:83) Mm =1 V ( T m ) . (26)Note that the root u m of a tree T m is also in a certain cycle C k , thus the value f ( x u m ) is deﬁned twicein Eq. (26), but in both deﬁnitions its value is , thus the deﬁnition of f is valid. For any i ∈ V ( C k ) ,suppose y i = x j , then both i, j ∈ V ( C k ) which implies f ( y i ) − f ( x i ) = f ( x j ) − f ( x i ) = 0 . For any i ∈ V ( T m ) \{ u m } , suppose y i = x j , then by the deﬁnition of the graph ( i, j ) is a directed edge of thetree T m , which means that depth ( i ) = depth ( j ) + 1 . Thus f ( y i ) − f ( x i ) = f ( x j ) − f ( x i ) = − N. In summary, for the choice of f in Eq. (26), we have f ( y i ) − f ( x i ) = (cid:40) , i ∈ (cid:83) Kk =1 V ( C k ) , − N, i ∈ (cid:83) Mm =1 V ( T m ) . (27)Denote p = (cid:80) Kk =1 | V ( C k ) | log 2 . For the choice of f in Eq. (26), we have − nF ( Y ; f ) = n (cid:88) i =1 log (cid:16) e f ( y i ) − f ( x i ) (cid:17) =  K (cid:88) k =1 (cid:88) i ∈ V ( C k ) log (cid:16) e f ( y i ) − f ( x i ) (cid:17) + M (cid:88) m =1 (cid:88) i ∈ V ( T m ) \{ u m } log (cid:16) e f ( y i ) − f ( x i ) (cid:17) (27) =  K (cid:88) k =1 (cid:88) i ∈ V ( C k ) log (cid:0) e (cid:1) + M (cid:88) m =1 (cid:88) i ∈ V ( T m ) \{ u m } log (cid:16) e − N (cid:17) = K (cid:88) k =1 | V ( C k ) | log 2 + M (cid:88) k =1 ( | V ( T m ) | −

1) log (cid:16) e − N (cid:17) ≤ p + n log (cid:16) e − N (cid:17) (25) < p + (cid:15). (28) This proves Eq. (24). Combining the two relations given in Eq. (24) and Eq. (23), we have g ( Y ) = inf f F ( Y ; f ) = 1 n K (cid:88) k =1 | V ( C k ) | log 2 , ∀ Y ∈ R ( X ) . (29) tep 2: Compute g ( Y ) for any Y .In the general case, not all y i ’s lie in { x , . . . , x n } . We will reduce to the previous case. Denote H = { i : y i ∈ { x , . . . , x n }} , H c = { j : y j / ∈ { x , . . . , x n }} . Since y j ’s in H c may be the same, we deﬁne the set of such distinct values of y j ’s as Y out = { y ∈ R d : y = y j , for some j ∈ H c } . Let ¯ n = | Y out | , then there are total n + ¯ n distinct values in x , . . . , x n , y , . . . , y n . WLOG, assume y , . . . , y ¯ n are distinct (this is because the value of g ( Y ) does not change if we re-index x i ’s and y i ’sas long as the subscripts of x i , y i change together), then Y out = { y , . . . , y ¯ n } . We create artiﬁcial “true data” and “fake data” x n +1 = x n +1 = y , . . . , x n +¯ n = y n +¯ n = y ¯ n . Deﬁne F auc ( Y, f ) = − (cid:80) n + mi =1 log (cid:0) e f ( y i ) − f ( x i ) (cid:1) g auc = − inf f F auc ( Y, f ) . Clearly, F auc ( Y, f ) = nF ( Y, f ) − ¯ n log 2 and ng ( Y ) = g auc − ¯ n log 2 .Consider the new conﬁgurations ˆ X = ( x , . . . , x n +¯ n ) and ˆ Y = ( y , . . . , y n +¯ n ) . For the newconﬁgurations, we can build a graph ˆ G with n + ¯ n vertices and n + ¯ n edges. There are K self-loops C K +1 , . . . , C K +¯ n at the vertices corresponding to y , . . . , y ¯ n . Based on Lemma 1, we have: (a)There exist cycles C , C , . . . , C K , C K +1 , . . . , C K +¯ n and subtrees T , T , . . . , T M (with roots u m ’s)s.t. each edge v ∈ A appears in exactly one of the cycle or subtrees. (b) u m is a vertex of a certaincycle C k where ≤ k ≤ K + ¯ n . (c) Each vertex of the graph appears in exactly one of the followingsets: V ( C ) , . . . , V ( C K +¯ n ) , V ( T ) \{ u } , . . . , V ( T M ) \{ u M } . According to the proof in Step 1, wehave g auc = (cid:80) K +¯ nk =1 | V ( C k ) | log 2 = (cid:80) Kk =1 | V ( C k ) | log 2 + ¯ n log 2 . Therefore, ng ( Y ) = g auc − ¯ n log 2 = K (cid:88) k =1 | V ( C k ) | log 2 . We build a graph G by removing the self-loops C K + j = ( y j , y j ) , j = 1 , . . . , ¯ n in ˆ G . The new graph G consists of n + ¯ n vertices corresponding to x , . . . , x n and y , . . . , y ¯ n and n edges. The graphcan be decomposed into cycles C , C , . . . , C K (since ¯ n cycles are removed from ˆ G ) and subtrees T , T , . . . , T M . The value ng ( Y ) = (cid:80) Kk =1 | V ( C k ) | log 2 , where C k ’s are all the cycles of G . Step 3: Finding a non-decreasing path to a global minimum . Finally, we prove that for any Y ,there is a non-decreasing continuous path from Y to one global minimal Y ∗ . The following claimshows that we can increase the value of Y incrementally. See the proof in Appendix J.2.2. Claim J.1.

For an arbitrary Y that is not a global minimum, there exists another ˆ Y and a non-decreasing continuous path from Y to ˆ Y such that g ( ˆ Y ) − g ( Y ) ≥ n log 2 . For any Y that is not a global minimum, we apply Claim J.1 for ﬁnitely many times (no more than n times), then we will arrive at one global minimum Y ∗ . We connect all non-decreasing continuouspaths and get a non-decreasing continuous path from Y to Y ∗ . This ﬁnishes the proof. J.2.1 Graph Preliminaries and Proof of Lemma 1

We present a few deﬁnitions from standard graph theory.

Deﬁnition J.1. (walk, path, cycle) In a directed graph G = ( V, A ) , a walk W = ( v , e , v , e ,. . . , v m − , e m , v m ) is a sequence of vertices and edges such that v i ∈ V, ∀ i ∈ { , , . . . , m } and e i = ( v i − , v i ) ∈ A, ∀ i ∈ { , . . . , m } . If v , v , . . . , v m are distinct, we call it path (with length m ).If v , v , . . . , v m − are distinct and v m = v , we call it a cycle. Any v has a path to itself (with length ), no matter whether there is an edge between v to itself ornot. This is because the degenerate walk W = ( v ) satisﬁes the above deﬁnition. The set of verticesand edges in W are denoted as V ( W ) and A ( W ) respectively.30 eﬁnition J.2. (tree) A directed tree is a directed graph T = ( V, A ) with a designated node r ∈ V ,the root, such that there is exactly one path from v to r for each node v ∈ V and there is no edgefrom the root r to itself. The depth of a node is the length of the path from the node to the root (thedepth of the root is ). A subtree of a directed graph G is a subgraph T which is a directed tree. Proof of Lemma 1:

We slightly extend the deﬁnition of “walk” to allow inﬁnite length. We present two observations.

Observation 1 : Starting from any vertex v ∈ V ( G ) , there is a unique walk with inﬁnite length W ( v ) (cid:44) ( v , e , v , e , v , . . . , v i , e i , v i +1 , e i +1 , . . . ) , where e i is an edge in A ( G ) with tail v i − and head v i .Proof of Observation 1: At each vertex v i , there is a unique outgoing edge e i = ( v i , v i +1 ) whichuniquely deﬁnes the next vertex v i +1 . Continue the process, we have proved Observation 1. Observation 2 : The walk W ( v ) (cid:44) ( v , e , v , e , v , . . . , v i , e i , v i +1 , e i +1 , . . . ) can bedecomposed into two parts W ( v ) = ( v , e , v , e , v , . . . , v i − , e i , v i ) , W ( v ) =( v i , e i +1 , v i +1 , e i +2 , v i +2 , . . . ) , where W ( v ) is a path from v to v i (i.e. v , v , . . . , v i are distinct), and W ( v ) is the repetition of a certain cycle (i.e., there exists T such that v i + T = v i ,for any i ≥ i ). This decomposition is unique, and we say the “ﬁrst-touch-vertex” of v is v i . Proof of Observation 2 : Since the graph is ﬁnite, then some vertices must appear at least twice in W ( v ) . Among all such vertices, suppose u is the one that appears the earliest in the walk W ( v ) ,and the ﬁrst two appearances are v i = u and v i = u and i < i . Denote T = i − i . Then it iseasy to show W ( v ) is the repetitions of the cycle consisting of vertices v i , v i +1 , . . . , v i − , and W ( v ) is a directed path from v to v i .The ﬁrst-touch-vertex u = v i has the following properties: (i) u ∈ C k for some k ; (ii) there exists apath from v to u ; (iii) any paths from v to any vertex in the cycle C k other than u must pass u . Notethat if u is in some cycle, then its ﬁrst-touch-vertex is u itself.As a corollary of Observation 2, there is at least one cycle. Suppose all cycles of G are C , C , . . . , C K . Because the outdegree of each vertex is , these cycles must be disjoint, i.e., V ( C i ) ∩ V ( C j ) = ∅ and A ( C i ) ∩ A ( C j ) = ∅ , for any i (cid:54) = j . Denote the set of vertices in the cyclesas V c = K (cid:91) k =1 V ( C ) ∪ · · · ∪ V ( C K ) . (30)Let u , . . . , u M be the vertices of C , . . . , C m with indegree at least .Based on Observation 2, starting from any vertex outside V c there is a unique path that reaches V c .Combining all vertices that reach the cycles at u m (denoted as V m ), and the paths from these verticesto u m , we obtain a directed subgraph T m , which is connected with V c only via the vertex u m . Thesubgraphs T m ’s are disjoint from each other since they are connected with V c via different vertices.In addition, each vertex outside of V c lies in exactly one of the subgraph T m . Thus, we can partitionthe whole graph into the union of the cycles C , . . . , C K and the subgraphs T , . . . , T M .We then show T m ’s are trees. For any vertex v in the subgraph T m , consider the walk W ( v ) . Anypath starting from v must be part of W ( v ) . Starting from v there is only one path from v to u m which is W ( v ) , according to Observation 2. Therefore, by the deﬁnition of a directed tree, T m isa directed tree with the root u m . Therefore, we can partition the whole graph into the union of thecycles C , . . . , C K and subtrees T , . . . , T M with disjoint edge sets; in addition, the edge sets of thecycles are disjoint, and the root of T l must be in certain cycle C k . It is easy to verify the propertiesstated in Lemma 1. This ﬁnishes the proof. J.2.2 Proof of Claim J.1

We ﬁrst prove the case for d ≥ . Suppose the corresponding graph for Y is G , and G is decomposedinto the union of cycles C , . . . , C K and trees T , . . . , T m . We perform the following operation: pickan arbitrary tree T m with the root u m . The tree is non-empty, thus there must be an edge e with thehead u m .Suppose v is the tail of the edge e . Now we remove the edge e = ( v, u m ) and create a newedge e (cid:48) = ( v, v ) . The new edge corresponds to y v = x v . The old edge ( v, u m ) corresponds to31 v = x u m (and a term h ( f ( x u m ) − f ( x v )) ) if u m ≤ n or y v = y u m − n / ∈ { x , . . . , x n } (and aterm h ( f ( y u m − n ) − f ( x v )) ) if u m > n . This change corresponds to the change of y v : we change y v = x u m (if u m ≤ n ) or y v = y u m − n (if u m > n ) to ˆ y v = x v . Let ˆ y i = y i for any i (cid:54) = v , and ˆ Y = (ˆ y , . . . , ˆ y n ) is the new point.Previously v is in a tree T m (not its root), now v is the root of a new tree, and also part of the newcycle (self-loop) C K +1 = ( v, e (cid:48) , v ) . In this new graph, the number of vertices in cycles increases by , thus the value of g increases by − n log 2 , i.e., g ( ˆ Y ) − g ( Y ) = n log 2 .Since d ≥ , we can ﬁnd a path in R d from a point to another point without passing any of the pointsin { x , . . . , x n } . In the continuous process of moving y v to ˆ y v , the function value will not changeexcept at the end that y v = x v . Thus there is a non-increasing path from Y to ˆ Y , in the sense thatalong this path the function value of g does not decrease.The illustration of this proof is given below. (a) Original graph (b) Modiﬁed graph, with improved functionvalueFigure 26: Illustration of the proof of Claim J.1. For the ﬁgure on the left, we pick an arbitrary tree with thehead being vertex , which corresponds to y = y . We change y to ˆ y = x to obtain the ﬁgure on the right.Since one more cycle is created, the function value increases by − n log 2 . For the case d = 1 , the above proof does not work. The reason is that the path from y v to ˆ y v maytouch other points in { x , . . . , x n } and thus may change the value of g . We only need to make asmall modiﬁcation: we move y v in R until it touches a certain x i that corresponds to a vertex in thetree T m , at which point a cycle is created, and the function value increases by at least n log 2 . Thispath is a non-decreasing path, thus the claim is also proved. J.3 Proof of Theorem 2

Obviously, g ( Y ) (cid:44) φ R ( Y, X ) = n sup f ∈ C ( R d ) (cid:80) ni =1 [ h ( f ( x i ) − f ( y i ))] ≥ h (0) (by picking f = 0 ). Step 1: achieving optimal g ( Y ) . We prove if { y , . . . , y n } = { x , . . . , x n } , then g ( Y ) = h (0) . Claim J.2.

Assume h is concave. Then the function ξ R ( m ) = sup ( t ,...,t k ) ∈ ZO ( m ) (cid:80) mi =1 h ( t i ) satisﬁes ξ R ( m ) = mh (0) , where the set ZO ( m ) = { t , t , . . . , t m ∈ R : (cid:80) mi =1 t i = 0 } . The proof of this claim is obvious and skipped here. When { y , . . . , y n } = { x , . . . , x n } , wecan divide [ n ] into multiple cycles C ∪ · · · ∪ C K , each with length m k , and obtain φ R ( Y, X ) = n sup f ∈ C ( R d ) (cid:80) Kk =1 (cid:80) m k i =1 [ h ( f ( x i ) − f ( y i ))] = n (cid:80) Kk =1 ξ R( m k ) = n (cid:80) Kk =1 m k h (0) = h (0) . Step 2: compute g ( Y ) when y i ∈ { x , . . . , x n } , ∀ i. Assume y i ∈ { x , . . . , x n } , ∀ i. We build adirected graph G = ( V, A ) as follows (the same graph as in Appendix J.2). The set of vertices V = { , , . . . , n } represents x , x , . . . , x n . We draw a directed edge ( i, j ) ∈ A if y i = x j . Notethat it is possible to have a self-loop ( i, i ) , which corresponds to the case y i = x i .According to Lemma 1, this graph can be decomposed into cycles C , C , . . . , C K and subtrees T , T , . . . , T M . We claim that φ R ( Y, X ) = 1 n K (cid:88) k =1 | V ( C k ) | h (0) ≥ h (0) . (31)The proof of the relation in Eq. (31) is similar to the proof of Eq. (22) used in the proof of Theorem 2,and brieﬂy explained below. One major part of the proof is to show that the contribution of the nodes32n the cycles is (cid:80) Kk =1 | V ( C k ) | h (0) . This is similar to Step 1, and is based on Claim J.2. Anothermajor part of the proof is to show that the contribution of the nodes in the subtrees is zero, similar tothe proof of Eq. (28). This is because we can utilize Assumption 4.4 to construct a sequence of f values (similar to Eq. (26)) so that f ( y i ) − f ( x i ) = (cid:40) , i ∈ (cid:83) Kk =1 V ( C k ) ,α N , i ∈ (cid:83) Mm =1 V ( T m ) . (32)Here { α N } ∞ N =1 is a sequence of real numbers so that lim N →∞ h ( α N ) = sup t h ( t ) = 0 . In the casethat h ( ∞ ) = 0 like RS-GAN, we pick α N = N . In the case that h ( a ) = 0 for a certain ﬁnite number a , we can just pick α N = a, ∀ N (thus we do not need a sequence but just one choice).Since the expression of φ R ( Y, X ) in Eq. (31) is a scaled version of the expression of φ RS ( Y, X ) (scale by − log 2 h (0) ), the rest of the proof is the same as the proof of Theorem 2. Step 3: function value for general Y and GMR. This step is the same as the proof of Theorem J.1.For the value of general Y , we build an “augmented graph” and apply the result in Step 2 to obtain g ( Y ) . To prove GMR, the same construction as the proof of Theorem J.1 sufﬁces. K Results in Parameter Space

We will ﬁrst state the technical assumptions and then present the formal results in parameter space. Theresults become somewhat technical due to the complication of neural-nets. Suppose the discriminatorneural net is f θ where θ ∈ R J and the generator net is G w where w ⊂ R K . Assumption K.1. (representation power of discriminator net): For any distinct vectors v , . . . , v n ∈ R d , any b , . . . , b n ∈ R , there exists θ ∈ R J such that f θ ( v i ) = b i , i = 1 , . . . , n. Assumption K.2. (representation power of generator net in W ) For any distinct z , . . . , z n ∈ R d z and any y , . . . , y n ∈ R d , there exists w ∈ W such that G w ( z i ) = y i , i = 1 , . . . , n . For any given Z = ( z , . . . , z n ) ∈ R d z × n , and any ∈ W ⊆ R K , we deﬁne a set G − ( Y ; Z ) asfollows: w ∈ G − ( Y ; Z ) iff G w ( Z ) = Y and w ∈ W . Assumption K.3. (path-keeping property of generator net; duplication of Assumption 4.6): For anydistinct z , . . . , z n ∈ R d z , the following holds: for any continuous path Y ( t ) , t ∈ [0 , in the space R d × n and any w ∈ G − ( Y (0); Z ) , there is continuous path w ( t ) , t ∈ [0 , such that w (0) = w and Y ( t ) = G w ( t ) ( Z ) , t ∈ [0 , . We will present sufﬁcient conditions for these assumptions later. Next we present two main results onthe landscape of GANs in the parameter space.

Proposition K.1. (formal version of Proposition 1) Consider the separable-GAN problem min w ∈ R K ϕ sep ( w ) , where ϕ sep ( w ) = sup θ n (cid:80) ni =1 [ h ( f θ ( x i )) + h ( − f θ ( G w ( z i )))] Suppose h , h satisfy the same assumptions of Theorem 1. Suppose G w satisﬁes Assumption K.2 andAssumption 4.6 (with certain W ). Suppose f θ satisﬁes Assumption K.1. Then there exist at least ( n n − n !) distinct w ∈ W that are not global-min-reachable. Proposition K.2. (formal version of Prop. 2) Consider the RpGAN problem min w ∈ R K ϕ R ( w ) , where ϕ R ( w ) = sup θ n (cid:80) ni =1 [ h ( f θ ( x i )) − f θ ( G w ( z i ))] . Suppose h satisﬁes the same assumptionsof Theorem 2. Suppose G w satisﬁes Assumption K.2 and Assumption 4.6 (with certain W ). Suppose f θ satisﬁes Assumption K.1. Then any w ∈ W is global-min-reachable for ϕ R ( w ) . We have presented two generic results that relies on a few properties of the neural-nets. Theseproperties can be satisﬁed by certain neural-nets, as discussed next. Our results largely rely on recentadvanced in neural-net optimization theory.

K.1 Sufﬁcient Conditions for the Assumptions

In this part, we present a set of conditions on neural nets that ensure the assumptions to hold. We willdiscuss more conditions in the next subsection.

Assumption K.4. (mildly wide) The last hidden layer has at least ¯ n neurons, where ¯ n is the numberof input vectors. ¯ n = n ; for the discriminator network, we set ¯ n = 2 n. Assumption K.5. (smooth enough activation) The activation function σ is an analytic function, andthe k -th order derivatives σ ( k ) (0) are non-zero, for k = 0 , , , . . . , ¯ n, where ¯ n is the number of inputvectors. The assumption of the neuron activation is satisﬁed by sigmoid, tanh, SoftPlus, swish, etc.For the generator network, consider a fully neural network G w ( z ) = W H σ ( W H − . . . W σ ( W z )) that maps z ∈ R d z to G w ( z ) ∈ R d . Deﬁne T k ( z ) = σ ( W k − . . . W σ ( W z )) ∈ R d k where d k is the number of neurons in the k -th hidden layer. Then we can write G w ( z ) = W H T H ( z ) ,where W H ∈ R d × d H . Let Z = ( z , . . . , z n ) and let T k ( Z ) = ( T k ( z ) , . . . , T k ( z n )) ∈ R d k × n ,k = 1 , , . . . , H. Deﬁne W = { w = ( W , . . . , W H ) : T H ( Z ) is full rank } .We will prove that under these two assumptions on the neural nets, the landscape of RpGAN is betterthan that of SepGAN. Proposition K.3.

Suppose h , h , h sastify assumptions in Theorem 1 and Theorem 2. Suppose G w , f θ satisﬁes Assump. K.5 and K.4 ( ¯ n = n for G w , and ¯ n = 2 n for f θ ). Then there existat least ( n n − n !) distinct w ∈ W that are not GMR for ϕ Sep ( w ) . In contrast, any w ∈ W isglobal-min-reachable for ϕ R ( w ) . This proposition is the corollary of Prop. K.1 and Prop. K.2; we only need to verify the assumptionsin the two propositions. The following series of claims provide such veriﬁcation.

Claim K.1.

Suppose Assumptions K.4 and K.5 hold for the generator net G w with distinct input z , . . . , z n . Then W = { ( W , . . . , W H ) : T H ( Z ) is full rank } is a dense set in R K . In addition,Assumption K.2 holds. This full-rank condition was used in a few works of neural-net landscape analysis (e.g. [72]). InGAN area, [7] studied invertible generator nets G w where the weights are restricted to a subset of R K to avoid singularities. As the set W is dense, intuitively the iterates will stay in this set for mostof the time. However, rigorously proving that the iterates stay in this set is not easy, and is one ofthe major challenges of current neural-network analysis. For instance, [38]) shows that for verywide neural networks with proper initialization along the training trajectory of gradient descent theneural-tangent kernel (a matrix related to T H ( Z ) ) is full rank. A similar analysis can prove that thematrix T H ( Z ) stays full rank during training under similar conditions. We do not attempt to developthe more complicated convergence analysis for general neural-nets here and leave it to future work. Claim K.2.

Suppose Assumptions K.4 and K.5 hold for the generator net G w with distinct input z , . . . , z n . Then it satisﬁes Assumption 4.6 with W deﬁned in Claim K.1. Assumption K.1 can be shown to hold under a similar condition to that in Claim K.1.

Claim K.3.

Consider a fully connected neural network f θ ( z ) = θ H σ ( θ H − . . . θ σ ( θ z )) that maps u ∈ R d to f θ ( u ) ∈ R and suppose Assumptions K.4 and K.5 hold. Then Assumption K.1 holds. The proofs of the claims are given in Appendix K.5.With these claims, we can immediately prove Prop. K.3.

Proof of Prop. K.3 : According to Claim K.2, K.1, K.3, the assumptions of Prop. K.3 imply theassumptions of Prop. K.1 and Prop. K.2. Therefore, the conclusions of Prop. K.1 and Prop. K.2 hold.Since the conclusion of Prop. K.3 is the combination of the the conclusions of Prop. K.1 and Prop.K.2, it also holds. (cid:50)

K.2 Other Sufﬁcient Conditions

Assumption K.3 (path-keeping property) is the key assumption. Various results in neural-net theorycan ensure this assumption (or its variant) holds, and we have utilized one of the simplest such resultsin the last subsection. We recommend to check [80] which describes a bigger picture about variouslandscape results. In this subsection, we brieﬂy discuss other possible results applicable to GAN.We start with a strong conjecture about neural net landscape, which only requires a wide ﬁnal hiddenlayer but no condition on the depth and activation.34 onjecture K.1.

Suppose g θ is a fully connected neural net with any depth and any continuousactivation, and it satisﬁes Assumption K.4 (i.e. a mildly wide ﬁnal hidden layer). Assume (cid:96) ( y, ˆ y ) isconvex in ˆ y , then the empirical loss function of a supervised learning problem (cid:80) ni =1 (cid:96) ( y i , g θ ( x i )) isglobal-min-reachable for any point. We then describe a related conjecture for GAN, which is easy to prove if Conjecture K.1 holds.

Conjecture 1 (informal): Suppose G w is a fully connected net satisfying Assump. K.4 (i.e. a mildlywide ﬁnal hidden layer). Suppose G w and f θ are expressive enough (i.e. Assump. K.2 and Assump.K.1 hold). Then the RpGAN loss has a benign landscape, in the sense that any point is GMR for ϕ R ( w ) . In contrast, the SepGAN loss does not have this property.Unfortunately, we are not aware of any existing work that has proved Conjecture K.1, thus we arenot able to prove Conjecture 1 above for now. Venturi et al. [84] proved a special case of ConjectureK.1 for L = 1 (one hidden layer), and other works such as Li et al. [50] prove a weaker versionof Conjecture K.1; see [80] for other related results. The precise version of Conjecture K.1 seemsnon-trivial to prove.We list two results on GAN that can be derived from weaker versions of Conjecture K.1; both resultsapply to the whole space instead of the dense subset W . Result 1 (1-hidden-layer): Suppose G w is 1-hidden-layer network with any continuous activation.Suppose it satisﬁes Assump. K.4 (i.e. a mildly wide ﬁnal hidden layer). Suppose G w and f θ areexpressive enough (i.e. Assump. K.2 and Assump. K.1 hold). Then the RpGAN loss satisﬁes GMRfor any point. This result is based on Venturi et al. [84]. Result 2 : Suppose G w is a fully connected network with any continuous activation and any numberof layers. Suppose it satisﬁes Assump. K.4 (i.e. a mildly wide ﬁnal hidden layer). Suppose G w and f θ are expressive enough (i.e. Assump. K.2 and K.1 hold). Then the RpGAN loss has no sub-optimalset-wise local minima (see [50, Def. 1] for the deﬁnition). This result is based on Li et al. [50].Due to space constraint, we do not present the proofs of the above two results (combining them withGANs is somewhat cumbersome). The high-level proof framework is similar to that of Prop. K.3. K.3 Proofs of Propositions for Parameter SpaceProof of Proposition K.1.

The basic idea is to build a relation between the points in the parameterspace to the points in the function space.Denote L sep ( w ; θ ) = n (cid:80) ni =1 [ h ( f θ ( x i )) + h ( − f θ ( G w ( z i )))] , then ϕ sep ( w ) = sup θ L sep ( w ; θ ) . Denote L sep ( Y ; f ) = n (cid:80) ni =1 [ h ( f ( x i )) + h ( − f ( y i ))] , and φ ( Y, X ) = sup f L sep ( Y ; f ) . Notethat in the deﬁnition of the two functions above, the discriminator is hidden in the sup operators, thuswe have freedom to pick the discriminator values (unlike the generator space which we have to checkall w in the inverse of Y ).Our goal is to analyze the landscape of ϕ sep ( w ) , based on the previously proved result on thelandscape of φ ( Y, X ) . We ﬁrst show that the image of ϕ sep ( ˆ w ) is the same as that of φ sep ( ˆ Y , X ) .Deﬁne G − ( Y ) (cid:44) { w : G w ( z i ) = y i , i = 1 , . . . , n } . We ﬁrst prove that φ sep ( ˆ Y , X ) = ϕ sep ( ˆ w ) , ∀ ˆ w ∈ G − ( ˆ Y ) . (33)Suppose φ sep ( ˆ Y , X ) = α . This implies that L sep ( ˆ Y ; f ) ≤ α for any f ; in addition, for any (cid:15) > there exists ˆ f ∈ C ( R d ) such that L sep ( ˆ Y ; ˆ f ) ≥ α − (cid:15). (34)According to Assumption K.1, there exists θ ∗ such that f θ ∗ ( x i ) = ˆ f ( x i ) , ∀ i , and f θ ∗ ( u ) =ˆ f ( u ) , ∀ u ∈ { y , . . . , y n }\{ x , . . . , x n } . In other words, there exists θ ∗ such that f θ ∗ ( x i ) = ˆ f ( x i ) , f θ ∗ ( y i ) = ˆ f ( y i ) , ∀ i. (35)35hen we have L sep ( ˆ w ; θ ∗ ( (cid:15) )) = 12 n n (cid:88) i =1 [ h ( f θ ∗ ( x i )) + h ( − f θ ∗ ( G ˆ w ( z i )))] (i) = n (cid:88) i =1 [ h ( f θ ∗ ( x i )) + h ( − f θ ∗ (ˆ y j ))] (ii) = 12 n n (cid:88) i =1 [ h ( ˆ f ( x i )) + h ( − ˆ f (ˆ y i ))] = L sep ( ˆ Y ; ˆ f ) (iii) ≥ α − (cid:15). In the above chain, (i) is due to the assumption ˆ w ∈ G − ( ˆ Y ) (which implies G ˆ w ( z j ) = ˆ y j ), (ii) isdue to the choice of θ ∗ . (iii) is due to (34).Therefore, we have ϕ sep ( ˆ w ) = sup θ L sep ( ˆ w ; θ ) ≥ L sep ( ˆ w ; θ ∗ ( (cid:15) )) ≥ α − (cid:15). Since this holds for any (cid:15) , we have ϕ sep ( ˆ w ) ≥ α. Similarly, from L sep ( ˆ w ; θ ) ≤ α we can obtain ϕ sep ( ˆ w ) ≤ α. Therefore ϕ sep ( ˆ w ) = α = φ sep ( ˆ Y , X ) . This ﬁnishes the proof of (33).Deﬁne Q ( X ) (cid:44) { Y = ( y , . . . , y n ) | y i ∈ { x , . . . , x n } , i ∈ { , , . . . , n } ; y i = y j for some i (cid:54) = j } . Any Y ∈ Q ( X ) is a mode-collapsed pattern. According to Theorem 1, any Y ∈ Q ( X ) is a strictlocal minimum of φ sep ( Y, X ) , and thus Y is not GMR. Therefore ˆ w ∈ G − ( Y ) where Y ∈ Q ( X ) is not GMR; this is because a non-decreasing path in the parameter space will be mapped to anon-decreasing path in the function space, causing contradiction. Finally, according to AssumptionK.2, for any Y there exists at least one pre-image w ∈ G − ( Y ) ∩ W . There are ( n n − n !) elementsin Q ( X ) , thus there are at least ( n n − n !) points in W that are not global-min-reachable. (cid:50) Proof of Proposition K.2.

Similar to Eq. (33), we have ϕ R ( w ) = φ R ( Y, X ) for any w ∈ G − ( Y ) .We need to prove that there is a non-decreasing path from any w ∈ W to w ∗ , where w ∗ is a certainglobal minimum. Let Y = G w ( z , . . . , z n ) . According to Thm. 2, there is a continuous path Y ( t ) from Y to Y ∗ along which the loss value φ R ( Y ( t ) , X ) is non-increasing. According to Assump.4.6, there is a continuous path w ( t ) such that w (0) = ˆ w , Y ( t ) = G w ( t ) ( Z ) , t ∈ [0 , . Alongthis path, the value ϕ R ( w ( t )) = φ R ( Y ( t ) , X ) is non-increasing, and at the end the function value ϕ R ( w (1)) = φ R ( Y ∗ , X ) is the minimal value of ϕ R ( w ) . Thus the existence of such a path is proved. (cid:50) K.4 A technical lemma

We present a technical lemma, that slightly generalizes [50, Proposition 1].

Assumption K.6. v , v , . . . , v m ∈ R d are distinct, i.e., v i (cid:54) = v j for any i (cid:54) = j . Lemma 2.

Deﬁne T H ( V ) = ( σ ( W H − . . . W σ ( W v i ))) mi =1 ∈ R d H × m . Suppose Assumptions K.4,K.5 and K.6 hold. Then the set Ω = { ( W , . . . , W H − ) : rank ( T H ( V )) < m } has zero measure. This claim is slightly different from [50, Proposition 1], which requires the input vectors to haveone distinct dimension (i.e., there exists j such that v j , . . . , v m,j are distinct); here we only requirethe input vectors to be distinct. It is not hard to link “distinct vectors” to “vectors with one distinctdimension” by a variable transformation. Claim K.4.

Suppose v , . . . , v m ∈ R d are distinct. Then for generic matrix W ∈ R d × d , for thevectors ¯ v i = W v i ∈ R d , i = 1 , . . . , n , there exists j such that ¯ v j , . . . , ¯ v m,j are distinct.Proof. Deﬁne the set Ω = { u | u ∈ R × d , ∃ i (cid:54) = j s.t. u T v i = u T v j } . This is the union of d ( d − hyperplanes Ω ij (cid:44) { u | u ∈ R × d , u T v i = u T v j } . Each hyperplane Ω ij is a zero-measure set, thusthe union of them Ω is also a zero-measure set. Let u be the ﬁrst row of W , then u is generic vectorand thus not in Ω , which implies ¯ v , . . . , ¯ v m, are distinct. Proof of Lemma 2:

Pick a generic matrix A ∈ R d v × d v , then ¯ v i = Av i ∈ R d v × has one distinctdimension, i.e., there exists j such that ¯ v j , . . . , ¯ v m,j are distinct. In addition, we can assume A isfull rank (since it is generic). Deﬁne ¯ T H ( ¯ V ) = ( σ ( W H − . . . W σ ( ¯ W ¯ v )) , . . . , σ ( W H − . . . W σ ( ¯ W ¯ v m )) ∈ R d H × m . According to [50, Prop. 1], the set ¯Ω = { ( ¯ W , W , W , . . . , W H − ) : rank ( ¯ T H ( ¯ V )) < m } haszero measure. With the transformation η ( ¯ W ) = ¯ W A − , we have σ ( W H − . . . W σ ( ¯ W ¯ v i )) = ( W H − . . . W σ ( W v i )) , ∀ i and thus ¯ T H ( ¯ V ) = T H ( V ) . Deﬁne η ( ¯ W , W , . . . , W m ) =( ¯ W A − , W , . . . , W m ) , then η is a homeomorphism between ¯Ω and Ω . Therefore the set Ω = { ( W , . . . , W H − ) : rank ( T H ( V )) < m } has zero measure. (cid:50) K.5 Proof of claimsProof of Claim K.1:

According to Lemma 2, W is a dense subset of R J (in fact, Ω is deﬁned for ageneral neural network, and W is deﬁned for the generator network, thus an instance of Ω ). As a result,there exists ( W , . . . , W H − ) such that T H ( Z ) has rank at least n . Thus for any y , y , . . . , y n ∈ R d ,there exists W H such that W H T H ( Z ) = ( y , . . . , y n ) . (cid:50) Proof of Claim K.2:

For any continuous path Y ( t ) , t ∈ [0 , in the space R d × n , any w ∈ G − ( Y (0)) and any (cid:15) > , our goal is to show that there exists a continuous path w ( t ) , t ∈ [0 , such that w (0) = w and Y ( t ) = G w ( t ) ( Z ) , t ∈ [0 , .Due to the assumption of w ∈ W , we know that w corresponds to a rank- n post-activation matrix T H ( Z ) . Suppose w = ( W , . . . , W H ) and T H ( Z ) = ( T H ( z ) , . . . , T H ( z n )) ∈ R d H × n has rank n .Since T H ( Z ) is full rank, for any path from Y (0) to Y (1) , we can continuously change W H such thatthe output of G w ( Z ) changes from Y (0) to Y (1) . Thus there exists a continuous path w ( t ) , t ∈ [0 , such that w (0) = w and Y ( t ) = G w ( t ) ( Z ) , t ∈ [0 , . (cid:50) Proof of Claim K.3:

This is a direct application of Lemma 2. Different from Claim K.2, here weapply Lemma 2 to the discriminator network. (cid:50)

L Discussion of Wasserstein GAN

W-GAN is a popular formulation of GAN, so a natural question is whether we can prove a similarlandscape result for W-GAN. Consider W-GAN formulation (empirical version) min Y φ W ( Y, X ) , where φ W ( Y, X ) = max | f | L ≤ n n (cid:88) i =1 [ f ( x i ) − f ( y i )] . For simplicity we consider the same number of generated samples and true samples. It can be viewedas a special case of RpGAN where h ( t ) = − t ; it can also be viewed as a special case of SepGANwhere h ( t ) = h ( t ) = − t .However, the major complication is the Lipschitz constraint. It makes the computation of the functionvalues much harder. For the case of n = 2 , the function value of φ W ( Y, X ) is provided in thefollowing claim. Claim L.1.

Suppose n = 2 . Denote a = x , a = x , a = y , a = y . The value of φ W ( Y, X ) is max u ,u ,u ,u ∈ R u + u − u − u , s.t. | u i − u j | ≤ (cid:107) a i − a j (cid:107) , ∀ i, j ∈ { , , , } . This claim is not hard to prove, and we skip the proof here.This claim indicates that computing φ W ( Y, X ) is equivalent to solving a linear program (LP). SolvingLP itself is computationally feasible, but our landscape analysis requires to infer about the globallandscape of φ W ( Y, X ) as a function of Y . In classical optimization, it is possible to state thatthe optimal value of an LP is a convex function of a certain parameter (e.g. the coefﬁcient of theobjective). But in our LP y i ’s appear in multiple positions of the LP, and we are not aware of anexisting result that can be readily applied.Similar to Kantorovich-Rubinstein Duality, we can write down the dual problem of the LP wherethe objective is linear combination of (cid:107) a i − a j (cid:107) . However, it is still not clear what to say about theglobal landscape, due to the lack of closed-form solutions.Finally, we remark that although W-GAN has a strong theoretical appeal, it did not replace JS-GANor simple variants of JS-GAN in recent GAN models. For instance, SN-GAN [67] and BigGAN [18]use hinge-GAN. 37 a) Generator (b) Discriminator z ∈ R ∼ N (0 , I ) image x ∈ [ − , H × W × → h × w × × , stride 1 conv, 64/c × , stride 2 deconv, 256/c, BN, ReLU × , stride 2 conv, 128/c × , stride 1 conv, 128/c × , stride 2 deconv, 128/c, BN, ReLU × , stride 2 conv, 256/c × , stride 1 conv, 256/c × , stride 2 deconv, 64/c, BN, ReLU × , stride 2 conv, 512/c × , stride 1 conv, 512/c × , stride 1 conv, 3, Tanh h × w × /c → s , linear Table 7: CNN models for CIFAR-10 and STL-10 used in our exper-iments on image Generation. h = w = 4, H = W = 32 for CIFAR-10.h = w = 6, H = W = 48 for STL-10. c=1, 2 and 4 for the regu-lar, 1/2 and 1/4 channel structures respectively. All layers of D useLReLU-0.1 (except the ﬁnal dense ‘’linear” layer). (a) Generator (b) Discriminator z ∈ R ∼ N (0 , I ) x ∈ [ − , × × reshape → × × × , stride 2 conv, 32, × , stride 1 deconv, BN, 1024 × , stride 2 conv, 64 × , stride 2 deconv, BN, 512 × , stride 2 conv, 128 × , stride 2 deconv, BN, 256 × , stride 2 conv, 256 × , stride 2 deconv, BN, 128 × , stride 2 conv, 512 × , stride 2 deconv, BN, 64 × , stride 2 conv, 1024 × , stride 2 deconv, BN, 32 dense → × , stride 2 deconv, 3, Tanh Table 8: CNN model architecture for size 256 LSUN used in ourexperiments on high resolution image generation. All layers of Guse ReLU (except one layer with Tanh); all layers of D use LReLU-0.1. (a) Generator (b) Discriminator z ∈ R ∼ N (0 , I ) image x ∈ [ − , × × dense, × × /c ResBlock down 128/cResBlock up 256/c ResBlock down 128/cResBlock up 256/c ResBlock down 128/cResBlock up 256/c ResBlock down 128/cBN, ReLU, × conv, 3 Tanh LReLU 0.1Global sum poolingdense → Table 9: Resnet architecture for CIFAR-10. c=1, 2 and 4 for the regular,1/2 and 1/4 channel structures respectively. (a) Generator (b) Discriminator z ∈ R ∼ N (0 , I ) image x ∈ [ − , × × dense, × × /c ResBlock down 64/cResBlock up 256/c ResBlock down 128/cResBlock up 128/c ResBlock down 256/cResBlock up 64/c ResBlock down 512/cBN, ReLU, × conv, 3 Tanh ResBlock down 1024/cLReLU 0.1Global sum poolingdense → Table 10: Resnet architecture for STL-10. c=1, 2 and 4 for the regular,1/2 and 1/4 channel structures respectively. (a) Generator (b) Discriminator z ∈ R ∼ N (0 , I ) image x ∈ [ − , × × dense, × × BRes down (64, 32, 64)BRes up (128, 64, 128) BRes down (64, 32, 64)BRes up (128, 64, 128) BRes down (64, 32, 64)BRes up (128, 64, 128) BRes down (64, 32, 64)BN, ReLU, × conv, 3 Tanh LReLU 0.1Global sum poolingdense → Table 11: BottleNeck Resnet models for CIFAR-10. BRes refers toBottleNeck ResBlock. BRes ( a, b, c ) refers to the Bottleneck resblockwith (input, hidden and output) being ( a, b, c ) . (a) Generator (b) Discriminator z ∈ R ∼ N (0 , I ) image x ∈ [ − , × × dense, × × BRes down (3, 16, 32)BRes up (256, 64, 128) BRes down (32, 16, 64)BRes up (128, 32, 64) BRes down (64, 32, 128)BRes up (64, 16, 32) BRes down (128, 64, 256)BN, ReLU, × conv, 3 Tanh BRes down (256, 128, 512)LReLU 0.1Global sum poolingdense → Table 12: BottleNeck Resnet models for STL-10.

RS-GAN generator learning rate

CIFAR-10 STL-10CNN No normalization 2e-4 5e-4Regular + SN 5e-4 5e-4channel/2 + SN 5e-4 5e-4channel/4 + SN 2e-4 5e-4ResNet Regular+SN 1.5e-3 1e-3channel/2 + SN 1.5e-3 1e-3channel/4 + SN 1e-3 5e-4BottleNeck 1e-3 1e-3

WGAN-GP Hyper-parameters generator learning rate 1e-4discriminator learning rate 1e-4 β β λ Table 13: Learning rate for RS-GAN in each setting. Hyper-parameters used for WGAN-GP

Figure 16: Generated CIFAR-10 samples with CNN.

Figure 17: Generated CIFAR-10 samples on ResNet.

Figure 18: Generated STL-10 samples with CNN.

Figure 19: Generated STL-10 samples with ResNet.