Some Theoretical Insights into Wasserstein GANs
SSome Theoretical Insights into Wasserstein GANs
Some Theoretical Insights into Wasserstein GANs
G´erard Biau [email protected]
Laboratoire de Probabilit´es, Statistique et Mod´elisationSorbonne Universit´e4 place Jussieu75005 Paris, France
Maxime Sangnier [email protected]
Laboratoire de Probabilit´es, Statistique et Mod´elisationSorbonne Universit´e4 place Jussieu75005 Paris, France
Ugo Tanielian [email protected]
Laboratoire de Probabilit´es, Statistique et Mod´elisation & Criteo AI LabCriteo AI Lab32 rue Blanche75009 Paris, France
Editor:
Abstract
Generative Adversarial Networks (GANs) have been successful in producing outstandingresults in areas as diverse as image, video, and text generation. Building on these successes,a large number of empirical studies have validated the benefits of the cousin approach calledWasserstein GANs (WGANs), which brings stabilization in the training process. In thepresent paper, we add a new stone to the edifice by proposing some theoretical advancesin the properties of WGANs. First, we properly define the architecture of WGANs inthe context of integral probability metrics parameterized by neural networks and highlightsome of their basic mathematical features. We stress in particular interesting optimiza-tion properties arising from the use of a parametric 1-Lipschitz discriminator. Then, in astatistically-driven approach, we study the convergence of empirical WGANs as the samplesize tends to infinity, and clarify the adversarial effects of the generator and the discrimi-nator by underlining some trade-off properties. These features are finally illustrated withexperiments using both synthetic and real-world datasets.
Keywords:
Generative Adversarial Networks, Wasserstein distances, deep learning the-ory, Lipschitz functions, trade-off properties
1. Introduction
Generative Adversarial Networks (GANs) is a generative framework proposed by Goodfellowet al. (2014), in which two models (a generator and a discriminator) act as adversariesin a zero-sum game. Leveraging the recent advances in deep learning, and specificallyconvolutional neural networks (LeCun et al., 1998), a large number of empirical studieshave shown the impressive possibilities of GANs in the field of image generation (Radford a r X i v : . [ c s . L G ] J un iau, Sangnier and Tanielian et al., 2015; Ledig et al., 2017; Karras et al., 2018; Brock et al., 2019). Lately, Karraset al. (2019) proposed an architecture able to generate hyper-realistic fake human facesthat cannot be differentiated from real ones (see the website thispersondoesnotexist.com).The recent surge of interest in the domain also led to breakthroughs in video (Acharyaet al., 2018), music (Mogren, 2016), and text generation (Yu et al., 2017; Fedus et al.,2018), among many other potential applications.The aim of GANs is to generate data that look “similar” to samples collected fromsome unknown probability measure µ (cid:63) , defined on a Borel subset E of R D . In the targetedapplications of GANs, E is typically a submanifold (possibly hard to describe) of a high-dimensional R D , which therefore prohibits the use of classical density estimation techniques.GANs approach the problem by making two models compete: the generator, which triesto imitate µ (cid:63) using the collected data, vs. the discriminator, which learns to distinguishthe outputs of the generator from the samples, thereby forcing the generator to improve itsstrategy.Formally, the generator has the form of a parameterized class of Borel functions from R d to E , say G = { G θ : θ ∈ Θ } , where Θ ⊆ R P is the set of parameters describing themodel. Each function G θ takes as input a d -dimensional random variable Z —it is typicallyuniform or Gaussian, with d usually small—and outputs the “fake” observation G θ ( Z ) withdistribution µ θ . Thus, the collection of probability measures P = { µ θ : θ ∈ Θ } is thenatural class of distributions associated with the generator, and the objective of GANs isto find inside this class the distribution that generates the most realistic samples, closest tothe ones collected from the unknown µ (cid:63) . On the other hand, the discriminator is describedby a family of Borel functions from E to [0 , D = { D α : α ∈ Λ } , Λ ⊆ R Q , where each D α must be thought of as the probability that an observation comes from µ (cid:63) (the higher D ( x ), the higher the probability that x is drawn from µ (cid:63) ).In the original formulation of Goodfellow et al. (2014), GANs make G and D fight eachother through the following objective:inf θ ∈ Θ sup α ∈ Λ (cid:104) E log( D α ( X )) + E log(1 − D α ( G θ ( Z ))) (cid:105) , (1)where X is a random variable with distribution µ (cid:63) and the symbol E denotes expectation.Since one does not have access to the true distribution, µ (cid:63) is replaced in practice withthe empirical measure µ n based on independent and identically distributed (i.i.d.) samples X , . . . , X n distributed as X , and the practical objective becomesinf θ ∈ Θ sup α ∈ Λ (cid:104) n n (cid:88) i =1 log( D α ( X i )) + E log(1 − D α ( G θ ( Z ))) (cid:105) . (2)In the literature on GANs, both G and D take the form of neural networks (either feed-forward or convolutional, when dealing with image-related applications). This is also thecase in the present paper, in which the generator and the discriminator will be parameter-ized by feed-forward neural networks with, respectively, rectifier (Glorot et al., 2011) andGroupSort (Chernodub and Nowicki, 2016) activation functions. We also note that from anoptimization standpoint, the minimax optimum in (2) is found by using stochastic gradientdescent alternatively on the generator’s and the discriminator’s parameters. ome Theoretical Insights into Wasserstein GANs In the initial version (1), GANs were shown to reduce, under appropriate conditions,the Jensen-Shanon divergence between the true distribution and the class of parameterizeddistributions (Goodfellow et al., 2014). This characteristic was further explored by Biauet al. (2020), who stressed some theoretical guarantees regarding the approximation andstatistical properties of problems (1) and (2). However, many empirical studies (e.g., Metzet al., 2016; Salimans et al., 2016) have described cases where the optimal generative dis-tribution computed by solving (2) collapses to a few modes of the distribution µ (cid:63) . Thisphenomenon is known under the term of mode collapse and has been theoretically explainedby Arjovsky and Bottou (2017). As a striking result, in cases where both µ (cid:63) and µ θ lie ondisjoint supports, these authors proved the existence of a perfect discriminator with nullgradient on both supports, which consequently does not convey meaningful information tothe generator.To cancel this drawback and stabilize training, Arjovsky et al. (2017) proposed a modifi-cation of criterion (1), with a framework called Wasserstein GANs (WGANs). In a nutshell,the objective of WGANs is to find, inside the class of parameterized distributions P , theone that is the closest to the true µ (cid:63) with respect to the Wasserstein distance (Villani,2008). In its dual form, the Wasserstein distance can be considered as an integral prob-ability metric (IPM, Mller, 1997) defined on the set of 1-Lipschitz functions. Therefore,the proposal of Arjovsky et al. (2017) is to replace the 1-Lipschitz functions with a dis-criminator parameterized by neural networks. To practically enforce this discriminator tobe a subset of 1-Lipschitz functions, the authors use a weight clipping technique on theset of parameters. A decisive step has been taken by Gulrajani et al. (2017), who stressedthe empirical advantage of the WGANs architecture by replacing the weight clipping witha gradient penalty. Since then, WGANs have been largely recognized and studied by theMachine Learning community (e.g., Roth et al., 2017; Petzka et al., 2018; Wei et al., 2018;Karras et al., 2019).A natural question regards the theoretical ability of WGANs to learn µ (cid:63) , considering thatone only has access to the parametric models of generative distributions and discriminativefunctions. Previous works in this direction are those of Liang (2018) and Zhang et al.(2018), who explore generalization properties of WGANs. In the present paper, we makeone step further in the analysis of mathematical forces driving WGANs and contribute tothe literature in the following ways:( i ) We properly define the architecture of WGANs parameterized by neural networks.Then, we highlight some properties of the IPM induced by the discriminator, andfinally stress some basic mathematical features of the WGANs framework (Section 2).( ii ) We emphasize the impact of operating with a parametric discriminator contained inthe set of 1-Lipschitz functions. We introduce in particular the notion of monotonousequivalence and discuss its meaning in the mechanism of WGANs. We also highlightthe essential role played by piecewise linear functions (Section 3).( iii ) In a statistically-driven approach, we derive convergence rates for the IPM inducedby the discriminator, between the target distribution µ (cid:63) and the distribution outputby the WGANs based on i.i.d. samples (Section 4). iau, Sangnier and Tanielian ( iv ) Building upon the above, we clarify the adversarial effects of the generator and thediscriminator by underlining some trade-off properties. These features are illustratedwith experiments using both synthetic and real-world datasets (Section 5).For the sake of clarity, proofs of the most technical results are gathered in the Appendix.
2. Wasserstein GANs
The present section is devoted to the presentation of the WGANs framework. After havinggiven a first set of definitions and results, we stress the essential role played by IPMs andstudy some optimality properties of WGANs.
Throughout the paper, E is a Borel subset of R D , equipped with the Euclidean norm (cid:107)·(cid:107) , onwhich µ (cid:63) (the target probability measure) and the µ θ ’s (the candidate probability measures)are defined. Depending on the practical context, E can be equal to R D , but it can also bea submanifold of it. We emphasize that there is no compactness assumption on E .For K ⊆ E , we let C ( K ) (respectively, C b ( K )) be the set of continuous (respectively,continuous bounded) functions from K to R . We denote by Lip the set of 1-Lipschitzreal-valued functions on E , i.e.,Lip = (cid:8) f : E → R : | f ( x ) − f ( y ) | (cid:54) (cid:107) x − y (cid:107) , ( x, y ) ∈ E (cid:9) . The notation P ( E ) stands for the collection of Borel probability measures on E , and P ( E )for the subset of probability measures with finite first moment, i.e., P ( E ) = (cid:8) µ ∈ P ( E ) : (cid:90) E (cid:107) x − x (cid:107) µ (dx) < ∞ (cid:9) , where x ∈ E is arbitrary (this set does not depend on the choice of the point x ). Untilthe end, it is assumed that µ (cid:63) ∈ P ( E ). It is also assumed throughout that the randomvariable Z ∈ R d is a sub-Gaussian random vector (Jin et al., 2019), i.e., Z is integrable andthere exists γ > ∀ v ∈ R d , E e v · ( Z − E Z ) (cid:54) e γ (cid:107) v (cid:107) , where · denotes the dot product in R d and (cid:107) · (cid:107) the Euclidean norm. The sub-Gaussianproperty is a constraint on the tail of the probability distribution. As an example, Gaussianrandom variables on the real line are sub-Gaussian and so are bounded random vectors. Wenote that Z has finite moments of all nonnegative orders (Jin et al., 2019, Lemma 2).Assuming that Z is sub-Gaussian is a mild requirement since, in practice, its distributionis most of the time uniform or Gaussian.As highlighted earlier, both the generator and the discriminator are assumed to beparameterized by feed-forward neural networks, that is, G = { G θ : θ ∈ Θ } and D = { D α : α ∈ Λ } ome Theoretical Insights into Wasserstein GANs with Θ ⊆ R P , Λ ⊆ R Q , and, for all z ∈ R d , G θ ( z ) = U pD × u p − σ (cid:0) U p − u p − × u p − · · · σ ( U u × u σ ( U u × d z + b u × ) + b u × ) · · · + b p − u p − × (cid:1) + b pD × , (3)for all x ∈ E , D α ( x ) = V q × v q − ˜ σ ( V q − v q − × v q − · · · ˜ σ ( V v × v ˜ σ ( V v × D x + c v × ) + c v × ) + · · · + c q − v q − × ) + c q × , (4)where p, q (cid:62) × columns). Some comments on the notation are in order. Networks in G and D have,respectively, ( p −
1) and ( q −
1) hidden layers. Hidden layers from depth 1 to ( p −
1) (for thegenerator) and from depth 1 to ( q −
1) (for the discriminator) are assumed to be of respectiveeven widths u i , i = 1 , . . . , p −
1, and v i , i = 1 , . . . , q −
1. The matrices U i (respectively, V i )are the matrices of weights between layer i and layer ( i + 1) of the generator (respectively,the discriminator), and the b i ’s (respectively, the c i ’s) are the corresponding offset vectors(in column format). We let σ ( x ) = max( x,
0) be the rectifier activation function (appliedcomponentwise) and˜ σ ( x , x , . . . , x n − , x n ) = (max( x , x ) , min( x , x ) , . . . , max( x n − , x n ) , min( x n − , x n ))be the GroupSort activation function with a grouping size equal to 2 (applied on pairs ofcomponents, which makes sense in (4) since the widths of the hidden layers are assumedto be even). GroupSort has been introduced in Chernodub and Nowicki (2016) as a 1-Lipschitz activation function that preserves the gradient norm of the input. This activationcan recover the rectifier, in the sense that ˜ σ ( x,
0) = ( σ ( x ) , − σ ( − x )), but the converse isnot true. The presence of GroupSort is critical to guarantee approximation properties ofLipschitz neural networks (Anil et al., 2019), as we will see later.Therefore, denoting by M ( j,k ) the space of matrices with j rows and k columns, we have U ∈ M ( u ,d ) , V ∈ M ( v ,D ) , b ∈ M ( u , , c ∈ M ( v , , U p ∈ M ( D,u p − ) , V q ∈ M (1 ,v q − ) , b p ∈ M ( D, , c q ∈ M (1 , . All the other matrices U i , i = 2 , . . . , p −
1, and V i , i = 2 , . . . , q −
1, belongto M ( u i ,u i − ) and M ( v i ,v i − ) , and vectors b i , i = 2 , . . . , p −
1, and c i , i = 2 , . . . , q −
1, belongto M ( u i , and M ( v i , . So, altogether, the vectors θ = ( U , . . . , U p , b , . . . , b p ) (respectively,the vectors α = ( V , . . . , V q , c , . . . , c q )) represent the parameter space Θ of the generator G (respectively, the parameter space Λ of the discriminator D ). We stress the fact that theoutputs of networks in D are not restricted to [0 ,
1] anymore, as is the case for the originalGANs of Goodfellow et al. (2014). We also recall the notation P = { µ θ : θ ∈ Θ } , where,for each θ , µ θ is the probability distribution of G θ ( Z ). Since Z has finite first moment andeach G θ is piecewise linear, it is easy to see that P ⊂ P ( E ).Throughout the manuscript, the notation (cid:107) · (cid:107) (respectively, (cid:107) · (cid:107) ∞ ) means the Euclidean(respectively, the supremum) norm on R k , with no reference to k as the context is clear.For W = ( w i,j ) a matrix in M ( k ,k ) , we let (cid:107) W (cid:107) = sup (cid:107) x (cid:107) =1 (cid:107) W x (cid:107) be the 2-norm of W .Similarly, the ∞ -norm of W is (cid:107) W (cid:107) ∞ = sup (cid:107) x (cid:107) ∞ =1 (cid:107) W x (cid:107) ∞ = max i =1 ,...,k (cid:80) k j =1 | w i,j | . Wewill also use the (2 , ∞ )-norm of W , i.e., (cid:107) W (cid:107) , ∞ = sup (cid:107) x (cid:107) =1 (cid:107) W x (cid:107) ∞ . We shall constantlyneed the following assumption: iau, Sangnier and Tanielian Assumption 1
For all θ = ( U , . . . , U p , b , . . . , b p ) ∈ Θ , max( (cid:107) U i (cid:107) , (cid:107) b i (cid:107) : i = 1 , . . . , p ) (cid:54) K , where K > is a constant. Besides, for all α = ( V , . . . , V q , c , . . . , c q ) ∈ Λ , (cid:107) V (cid:107) , ∞ (cid:54) , max( (cid:107) V (cid:107) ∞ , . . . , (cid:107) V q (cid:107) ∞ ) (cid:54) , and max( (cid:107) c i (cid:107) ∞ : i = 1 , . . . , q ) (cid:54) K , where K (cid:62) is a constant. This compactness requirement is classical when parameterizing WGANs (e.g., Arjovskyet al., 2017; Zhang et al., 2018; Anil et al., 2019). In practice, one can satisfy Assumption1 by clipping the parameters of neural networks as proposed by Arjovsky et al. (2017).An alternative approach to enforce D ⊆ Lip consists in penalizing the gradient of thediscriminative functions, as proposed by Gulrajani et al. (2017), Kodali et al. (2017), Weiet al. (2018), and Zhou et al. (2019). This solution was empirically found to be more stable.The usefulness of Assumption 1 is captured by the following lemma. Lemma 1
Assume that Assumption 1 is satisfied. Then, for each θ ∈ Θ , the function G θ is K p -Lipschitz on R d . In addition, D ⊆ Lip . Recall (e.g., Dudley, 2004) that a sequence of probability measures ( µ k ) on E is said toconverge weakly to a probability measure µ on E if, for all ϕ ∈ C b ( E ), (cid:90) E ϕ d µ k → k →∞ (cid:90) E ϕ d µ. In addition, the sequence of probability measures ( µ k ) in P ( E ) is said to converge weaklyin P ( E ) to a probability measure µ in P ( E ) if ( i ) ( µ k ) converges weakly to µ and if( ii ) (cid:82) E (cid:107) x − x (cid:107) µ k (d x ) → (cid:82) E (cid:107) x − x (cid:107) µ (d x ), where x ∈ E is arbitrary (Villani, 2008,Definition 6.7). The next proposition offers a characterization of our collection of generativedistributions P in terms of compactness with respect to the weak topology in P ( E ). Thisresult is interesting as it gives some insight into the class of probability measures generatedby neural networks. Proposition 2
Assume that Assumption 1 is satisfied. Then the function Θ (cid:51) θ (cid:55)→ µ θ iscontinuous with respect to the weak topology in P ( E ) , and the set of generative distributions P is compact with respect to the weak topology in P ( E ) . We are now in a position to formally define the WGANs problem. The Wasserstein distance(of order 1) between two probability measures µ and ν in P ( E ) is defined by W ( µ, ν ) = inf π ∈ Π( µ,ν ) (cid:90) E × E (cid:107) x − y (cid:107) π (d x, d y ) , where Π( µ, ν ) denotes the collection of all joint probability measures on E × E with marginals µ and ν (e.g., Villani, 2008). It is a finite quantity. In the present article, we will use the ome Theoretical Insights into Wasserstein GANs dual representation of W ( µ, ν ), which comes from the duality theorem of Kantorovich andRubinstein (1958): W ( µ, ν ) = sup f ∈ Lip | E µ f − E ν f | , where, for a probability measure π , E π f = (cid:82) E f d π (note that for f ∈ Lip and π ∈ P ( E ),the function f is Lebesgue integrable with respect to π ).In this context, it is natural to define the theoretical-WGANs (T-WGANs) problem asminimizing over Θ the Wasserstein distance between µ (cid:63) and the µ θ ’s, i.e.,inf θ ∈ Θ W ( µ (cid:63) , µ θ ) = inf θ ∈ Θ sup f ∈ Lip | E µ (cid:63) f − E µ θ f | . (5)In practice, however, one does not have access to the class of 1-Lipschitz functions, whichcannot be parameterized. Therefore, following Arjovsky et al. (2017), the class Lip isrestricted to the smaller but parametric set of discriminators D = { D α : α ∈ Λ } (it is asubset of Lip , by Lemma 1), and this defines the actual WGANs problem:inf θ ∈ Θ sup α ∈ Λ | E µ (cid:63) D α − E µ θ D α | . (6)Problem (6) is the Wasserstein counterpart of problem (1). Provided Assumption 1 issatisfied, D ⊆ Lip , and the IPM (Mller, 1997) d D is defined for ( µ, ν ) ∈ P ( E ) by d D ( µ, ν ) = sup f ∈ D | E µ f − E ν f | . (7)With this notation, d Lip = W and problems (5) and (6) can be rewritten as the minimiza-tion over Θ of, respectively, d Lip ( µ (cid:63) , µ θ ) and d D ( µ (cid:63) , µ θ ). So,T-WGANs: inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) and WGANs: inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) . Similar objectives have been proposed in the literature, in particular neural net distances(Arora et al., 2017) and adversarial divergences (Liu et al., 2017). These two generalapproaches include f-GANs (Goodfellow et al., 2014; Nowozin et al., 2016), but also WGANs(Arjovsky et al., 2017), MMD-GANs (Li et al., 2017), and energy-based GANs (Zhao et al.,2017). Using the terminology of Arora et al. (2017), d D is called a neural IPM. If thetheoretical properties of the Wasserstein distance d Lip have been largely studied (e.g.,Villani, 2008), the story is different for neural IPMs. This is why our next subsection isdevoted to the properties of d D . d D The study of the neural IPM d D is essential to assess the driving forces of WGANs archi-tectures. Let us first recall that a mapping (cid:96) : P ( E ) × P ( E ) → [0 , ∞ ) is a metric if itsatisfies the following three requirements:( i ) (cid:96) ( µ, ν ) = 0 ⇐⇒ µ = ν (discriminative property)( ii ) (cid:96) ( µ, ν ) = (cid:96) ( ν, µ ) (symmetry) iau, Sangnier and Tanielian ( iii ) (cid:96) ( µ, ν ) (cid:54) (cid:96) ( µ, π ) + (cid:96) ( π, ν ) (triangle inequality).If ( i ) is replaced by the weaker requirement (cid:96) ( µ, µ ) = 0 for all µ ∈ P ( E ), then one speaksof a pseudometric. Furthermore, the (pseudo)metric (cid:96) is said to metrize weak convergencein P ( E ) (Villani, 2008) if, for all sequences ( µ k ) in P ( E ) and all µ in P ( E ), one has (cid:96) ( µ, µ k ) → ⇐⇒ µ k converges weakly to µ in P ( E ) as k → ∞ . According to Villani(2008, Theorem 6.8), d Lip is a metric that metrizes weak convergence in P ( E ).As far as d D is concerned, it is clearly a pseudometric on P ( E ) as soon as Assumption 1is satisfied. Moreover, an elementary application of Dudley (2004, Lemma 9.3.2) shows thatif span( D ) (with span( D ) = { γ + (cid:80) ni =1 γ i D i : γ i ∈ R , D i ∈ D , n ∈ N } ) is dense in C b ( E ),then d D is a metric on P ( E ), which, in addition, metrizes weak convergence. As in Zhanget al. (2018), Dudley’s result can be exploited in the case where the space E is compactto prove that, whenever D is of the form (4), d D is a metric metrizing weak convergence.However, establishing the discriminative property of the pseudometric d D turns out to bemore challenging without an assumption of compactness on E , as is the case in the presentstudy. Our result is encapsulated in the following proposition. Proposition 3
Assume that Assumption 1 is satisfied. Then there exists a discriminator ofthe form (4) (i.e., a depth q and widths v , . . . , v q − ) such that d D is a metric on P ∪ { µ (cid:63) } .In addition, d D metrizes weak convergence in P ∪ { µ (cid:63) } . Standard universal approximation theorems (Cybenko, 1989; Hornik et al., 1989; Hornik,1991) state the density of neural networks in the family of continuous functions defined oncompact sets but do not guarantee that the approximator respects a Lipschitz constraint.The proof of Proposition 3 uses the fact that, under Assumption 1, neural networks ofthe form (4) are dense in the space of Lipschitz continuous functions on compact sets, asrevealed by Anil et al. (2019).We deduce from Proposition 3 that, under Assumption 1, provided enough capacity, thepseudometric d D can be topologically equivalent to d Lip on P ∪ { µ (cid:63) } , i.e., the convergentsequences in ( P ∪ { µ (cid:63) } , d D ) are the same as the convergent sequences in ( P ∪ { µ (cid:63) } , d Lip )with the same limit—see O’Searcoid (2006, Corollary 13.1.3). We are now ready to discusssome optimality properties of the T-WGANs and WGANs problems, i.e., conditions underwhich the infimum in θ ∈ Θ and the supremum in α ∈ Λ are reached.
Recall that for T-WGANs, we minimize over Θ the distance d Lip ( µ (cid:63) , µ θ ) = sup f ∈ Lip | E µ (cid:63) f − E µ θ f | , whereas for WGANs, we use d D ( µ (cid:63) , µ θ ) = sup α ∈ Λ | E µ (cid:63) D α − E µ θ D α | . A first natural question is to know whether for a fixed generator parameter θ ∈ Θ, thereexists a 1-Lipschitz function (respectively, a discriminative function) that achieves the supre-mum in d Lip ( µ (cid:63) , µ θ ) (respectively, in d D ( µ (cid:63) , µ θ )) over all f ∈ Lip (respectively, all α ∈ Λ). ome Theoretical Insights into Wasserstein GANs For T-WGANs, Villani (2008, Theorem 5.9) guarantees that the maximum exists, i.e., { f ∈ Lip : | E µ (cid:63) f − E µ θ f | = d Lip ( µ (cid:63) , µ θ ) } (cid:54) = ∅ . (8)For WGANs, we have the following: Lemma 4
Assume that Assumption 1 is satisfied. Then, for all θ ∈ Θ , { α ∈ Λ : | E µ (cid:63) D α − E µ θ D α | = d D ( µ (cid:63) , µ θ ) } (cid:54) = ∅ . Thus, provided Assumption 1 is verified, the supremum in α in the neural IPM d D isalways reached. A similar result is proved by Biau et al. (2020) in the case of standardGANs.We now turn to analyzing the existence of the infimum in θ in the minimization overΘ of d Lip ( µ (cid:63) , µ θ ) and d D ( µ (cid:63) , µ θ ). Since the optimization scheme is performed over theparameter set Θ, it is worth considering the following two functions: ξ Lip : Θ → R and ξ D : Θ → R θ (cid:55)→ d Lip ( µ (cid:63) , µ θ ) θ (cid:55)→ d D ( µ (cid:63) , µ θ ) . Theorem 5
Assume that Assumption 1 is satisfied. Then ξ Lip and ξ D are Lipschitz con-tinuous on Θ , and the Lipschitz constant of ξ D is independent of D . Theorem 5 extends Arjovsky et al. (2017, Theorem 1), which states that d D is locallyLipschitz continuous under the additional assumption that E is compact. In contrast, thereis no compactness hypothesis in Theorem 5 and the Lipschitz property is global. Thelipschitzness of the function ξ D is an interesting property of WGANS, in line with manyrecent empirical works that have shown that gradient-based regularization techniques areefficient for stabilizing the training of GANs and preventing mode collapse (Kodali et al.,2017; Roth et al., 2017; Miyato et al., 2018; Petzka et al., 2018).In the sequel, we let Θ (cid:63) and ¯Θ be the sets of optimal parameters, defined byΘ (cid:63) = arg min θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) and ¯Θ = arg min θ ∈ Θ d D ( µ (cid:63) , µ θ ) . An immediate but useful corollary of Theorem 5 is as follows:
Corollary 6
Assume that Assumption 1 is satisfied. Then Θ (cid:63) and ¯Θ are non empty. Thus, any θ (cid:63) ∈ Θ (cid:63) (respectively, any ¯ θ ∈ ¯Θ) is an optimal parameter for the T-WGANs(respectively, the WGANs) problem. Note however that, without further restrictive as-sumptions on the models, we cannot ensure that Θ (cid:63) or ¯Θ are reduced to singletons. iau, Sangnier and Tanielian
3. Optimization properties
We are interested in this section in the error made when minimizing over Θ the pseudo-metric d D ( µ (cid:63) , µ θ ) (WGANs problem) instead of d Lip ( µ (cid:63) , µ θ ) (T-WGANs problem). Thisoptimization error is represented by the difference ε optim = sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) . It is worth pointing out that we take the supremum over all ¯ θ ∈ ¯Θ since there is no guar-antee that two distinct elements ¯ θ and ¯ θ of ¯Θ lead to the same distances d Lip ( µ (cid:63) , µ ¯ θ )and d Lip ( µ (cid:63) , µ ¯ θ ). The quantity ε optim captures the largest discrepancy between the scoresachieved by distributions solving the WGANs problem and the scores of distributions solv-ing the T-WGANs problem. We emphasize that the scores are quantified by the Wassersteindistance d Lip , which is the natural metric associated with the problem. We note in partic-ular that ε optim (cid:62)
0. A natural question is whether we can upper bound the difference andobtain some control of ε optim . d Lip with d D As a warm-up, we observe that in the simple but unrealistic case where µ (cid:63) ∈ P , providedAssumption 1 is satisfied and the neural IPM d D is a metric on P (see Proposition 3),then Θ (cid:63) = ¯Θ and ε optim = 0. However, in the high-dimensional context of WGANs, theparametric class of distributions P is likely to be “far” from the true distribution µ (cid:63) . Thisphenomenon is thoroughly discussed in Arjovsky and Bottou (2017, Lemma 2 and Lemma 3)and is often referred to as dimensional misspecification (Roth et al., 2017).From now on, we place ourselves in the general setting where we have no information onwhether the true distribution belongs to P , and start with the following simple observation.Assume that Assumption 1 is satisfied. Then, clearly, since D ⊆ Lip ,inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:54) inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) . (9)Inequality (9) is useful to upper bound ε optim . Indeed,0 (cid:54) ε optim = sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) (cid:54) sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ )= sup ¯ θ ∈ ¯Θ (cid:2) d Lip ( µ (cid:63) , µ ¯ θ ) − d D ( µ (cid:63) , µ ¯ θ ) (cid:3) (since inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) = d D ( µ (cid:63) , µ ¯ θ ) for all ¯ θ ∈ ¯Θ) (cid:54) T P (Lip , D ) , (10)where, by definition, T P (Lip , D ) = sup θ ∈ Θ (cid:2) d Lip ( µ (cid:63) , µ θ ) − d D ( µ (cid:63) , µ θ ) (cid:3) (11) ome Theoretical Insights into Wasserstein GANs is the maximum difference in distances on the set of candidate probability distributions in P . Note, since Θ is compact (by Assumption 1) and ξ Lip and ξ D are Lipschitz continuous(by Theorem 5), that T P (Lip , D ) < ∞ . Thus, the loss in performance when comparingT-WGANs and WGANs can be upper-bounded by the maximum difference over P betweenthe Wasserstein distance and d D .Observe that when the class of discriminative functions is increased (say D ⊂ D (cid:48) ) whilekeeping the generator fixed, then the bound (11) gets reduced since d D ( µ (cid:63) , · ) (cid:54) d D (cid:48) ( µ (cid:63) , · ).Similarly, when increasing the class of generative distributions (say P ⊂ P (cid:48) ) with a fixeddiscriminator, then the bound gets bigger, i.e., T P (Lip , D ) (cid:54) T P (cid:48) (Lip , D ). It is importantto note that the conditions D ⊂ D (cid:48) and/or P ⊂ P (cid:48) are easily satisfied for classes offunctions parameterized with neural networks using either rectifier or GroupSort activationfunctions, just by increasing the width and/or the depth of the networks.Our next theorem states that, as long as the distributions of P are generated by neuralnetworks with bounded parameters (Assumption 1), then one can control T P (Lip , D ). Theorem 7
Assume that Assumption 1 is satisfied. Then, for all ε > , there exists adiscriminator D of the form (4) such that (cid:54) ε optim (cid:54) T P (Lip , D ) (cid:54) cε, where c > is a constant independent from ε . Theorem 7 is important because it shows that for any collection of generative distribu-tions P and any approximation threshold ε , one can find a discriminator such that the lossin performance ε optim is (at most) of the order of ε . In other words, there exists D of theform (4) such that T P (Lip , D ) is arbitrarily small. We note however that Theorem 7 is anexistence theorem that does not give any particular information on the depth and/or thewidth of the neural networks in D . The key argument to prove Theorem 7 is Anil et al.(2019, Theorem 3), which states that the set of Lipschitz neural networks are dense in theset of Lipschitz continuous functions on a compact space. The quantity T P (Lip , D ) is of limited practical interest, as it involves a supremum overall θ ∈ Θ. Moreover, another caveat is that the definition of ε optim assumes that one hasaccess to ¯Θ. Therefor, our next goal is to enrich Theorem 7 by taking into account the factthat numerical procedures do not reach ¯ θ ∈ ¯Θ but rather an (cid:15) -approximation of it.One way to approach the problem is to look for another form of equivalence between d Lip and d D . As one is optimizing d D instead of d Lip , we would ideally like that the twoIPMs behave “similarly”, in the sense that minimizing d D leads to a solution that is stillclose to the true distribution with respect to d Lip . Assuming that Assumption 1 is satisfied,we let, for any µ ∈ P ( E ) and ε > M (cid:96) ( µ, ε ) be the set of (cid:15) -solutions to the optimizationproblem of interest, that is the subset of Θ defined by M (cid:96) ( µ, ε ) = (cid:8) θ ∈ Θ : (cid:96) ( µ, µ θ ) − inf θ ∈ Θ (cid:96) ( µ, µ θ ) (cid:54) ε (cid:9) , with (cid:96) = d Lip or (cid:96) = d D . iau, Sangnier and Tanielian Definition 8
Let ε > . We say that d Lip can be ε -substituted by d D if there exists δ > such that M d D ( µ (cid:63) , δ ) ⊆ M d Lip1 ( µ (cid:63) , ε ) . In addition, if d Lip can be ε -substituted by d D for all ε > , we say that d Lip can be fullysubstituted by d D . The rationale behind this definition is that by minimizing the neural IPM d D closeto optimality, one can be guaranteed to be also close to optimality with respect to theWasserstein distance d Lip . In the sequel, given a metric d , the notation d ( x, F ) denotesthe distance of x to the set F , that is, d ( x, F ) = inf f ∈ F d ( x, f ). Proposition 9
Assume that Assumption 1 is satisfied. Then, for all ε > , there exists δ > such that, for all θ ∈ M d D ( µ (cid:63) , δ ) , one has d ( θ, ¯Θ) (cid:54) ε . Corollary 10
Assume that Assumption 1 is satisfied and that Θ (cid:63) = ¯Θ . Then d Lip can befully substituted by d D . Proof
Let ε >
0. By Theorem 5, we know that the function Θ (cid:51) θ (cid:55)→ d Lip ( µ (cid:63) , µ θ ) isLipschitz continuous. Thus, there exists η > θ, θ (cid:48) ) ∈ Θ satisfying (cid:107) θ − θ (cid:48) (cid:107) (cid:54) η , one has | d Lip ( µ (cid:63) , µ θ ) − d Lip ( µ (cid:63) , µ θ (cid:48) ) | (cid:54) ε . Besides, using Proposition 9, thereexists δ > θ ∈ M d D ( µ (cid:63) , δ ), one has d ( θ, ¯Θ) (cid:54) η .Now, let θ ∈ M d D ( µ (cid:63) , δ ). Since d ( θ, ¯Θ) (cid:54) η and ¯Θ = Θ (cid:63) , we have d ( θ, Θ (cid:63) ) (cid:54) η . Conse-quently, | d Lip ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) | (cid:54) ε , and so, θ ∈ M d Lip1 ( µ (cid:63) , ε ).Corollary 10 is interesting insofar as when both d D and d Lip have the same minimizersover Θ, then minimizing one close to optimality is the same as minimizing the other. Therequirement Θ (cid:63) = ¯Θ can be relaxed by leveraging what has been studied in the previoussubsection about T P (Lip , D ). Lemma 11
Assume that Assumption 1 is satisfied, and let ε > . If T P (Lip , D ) (cid:54) ε ,then d Lip can be ( ε + δ ) -substituted by d D for all δ > . Proof
Let ε > δ >
0, and θ ∈ M d D ( µ (cid:63) , δ ), i.e., d D ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:54) δ . Wehave d Lip ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) (cid:54) d Lip ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ )(by inequality (9)) (cid:54) d Lip ( µ (cid:63) , µ θ ) − d D ( µ (cid:63) , µ θ ) + δ (cid:54) T P (Lip , D ) + δ (cid:54) ε + δ. Lemma 11 stresses the importance of T P (Lip , D ) in the performance of WGANs. In-deed, the smaller T P (Lip , D ), the closer we will be to optimality after training. Movingon, to derive sufficient conditions under which d Lip can be substituted by d D we introducethe following definition: ome Theoretical Insights into Wasserstein GANs Definition 12
We say that d Lip is monotonously equivalent to d D on P if there exists acontinuously differentiable, strictly increasing function f : R + → R + and ( a, b ) ∈ ( R (cid:63) + ) such that ∀ µ ∈ P , af ( d D ( µ (cid:63) , µ )) (cid:54) d Lip ( µ (cid:63) , µ ) (cid:54) bf ( d D ( µ (cid:63) , µ )) . Here, it is assumed implicitly that D ⊆ Lip . At the end of the subsection, we stress, em-pirically, that Definition 12 is easy to check for simple classes of generators. A consequenceof this definition is encapsulated in the following lemma. Lemma 13
Assume that Assumption 1 is satisfied, and that d Lip and d D are monotonouslyequivalent on P with a = b (that is, d Lip = f ◦ d D ). Then Θ (cid:63) = ¯Θ and d Lip can be fullysubstituted by d D . To complete Lemma 13, we now tackle the case a < b . Proposition 14
Assume that Assumption 1 is satisfied, and that d Lip and d D are mono-tonously equivalent on P . Then, for any δ ∈ (0 , , d Lip can be ε -substituted by d D with ε = ( b − a ) f ( inf θ ∈ Θ d D ( µ (cid:63) , µ θ )) + O ( δ ) . Proposition 14 states that we can reach (cid:15) -minimizers of d Lip by solving the WGANsproblem up to a precision sufficiently small, for all (cid:15) larger than a bias induced by the model P and by the discrepancy between d Lip and d D .In order to validate Definition 12, we slightly depart from the WGANs setting and run aseries of small experiments in the simplified setting where both µ (cid:63) and µ ∈ P are bivariatemixtures of independent Gaussian distributions with K components ( K = 1, 2, 3, 25). Weconsider two classes of discriminators { D q : q = 2 , } of the form (4), with growing depth q (the width of the hidden layers is kept constant equal to 20). Our goal is to exemplify therelationship between the distances d Lip and d D q by looking whether d Lip is monotonouslyequivalent to d D q .First, for each K , we randomly draw 40 different pairs of distributions ( µ (cid:63) , µ ) among theset of mixtures of bivariate Gaussian densities with K components. Then, for each of thesepairs, we compute an approximation of d Lip by averaging the Wasserstein distance betweenfinite samples of size 4096 over 20 runs. This operation is performed using the Pythonpackage by Flamary and Courty (2017). For each pair of distributions, we also calculatethe corresponding IPMs d D q ( µ (cid:63) , µ ). We finally compare d Lip and d D q by approximatingtheir relationship with a parabolic fit. Results are presented in Figure 1, which depicts inparticular the best parabolic fit, and shows the corresponding Least Relative Error (LRE)together with the width ( b − a ) from Definition 12. In order to enforce the discriminator toverify Assumption 1, we use the orthonormalization of Bjrck and Bowie (1971), as done forexample in Anil et al. (2019).Interestingly, we see that when the class of discriminative functions gets larger (i.e.,when q increases), then both metrics start to behave similarly (i.e., the range ( b − a ) getsthinner), independently of K (Figure 1a to Figure 1f). This tends to confirm that d Lip canbe considered as monotonously equivalent to d D q for q large enough. On the other hand, fora fixed depth q , when allowing for more complex distributions, the width ( b − a ) increases.This is particularly clear in Figure 1g and Figure 1h, which show the fits obtained whenmerging all pairs for K = 1 , , ,
25 (for both µ (cid:63) and P ). iau, Sangnier and Tanielian d d L i p LRE=0.01b-a=0.02 (a) d D q vs. d Lip , q = 2, K = 1. d d L i p LRE=0.01b-a=0.02 (b) d D q vs. d Lip , q = 5, K = 1. d d L i p LRE=1.54b-a=0.17 (c) d D q vs. d Lip , q = 2, K = 4. d d L i p LRE=1.01b-a=0.13 (d) d D q vs. d Lip , q = 5, K = 4. d d L i p LRE=1.21b-a=0.15 (e) d D q vs. d Lip , q = 2, K = 9. d d L i p LRE=1.21b-a=0.15 (f) d D q vs. d Lip , q = 5, K = 9. d d L i p LRE=5.61b-a=0.26 (g) d D q vs. d Lip , q = 2, K = 1 , , , d d L i p LRE=4.69b-a=0.2 (h) d D q vs. d Lip , q = 5, K = 1 , , , Figure 1: Scatter plots of 40 pairs of distances simultaneously measured with d Lip and d D q ,for q = 2 , K = 1 , , ,
25. The red curve is the optimal parabolic fitting and LRErefers to the Least Relative Error. The red zone is the envelope obtained by stretching theoptimal curve from b to a . ome Theoretical Insights into Wasserstein GANs These figures illustrate the fact that, for a fixed discriminator, the monotonous equiv-alence between d Lip and d D seems to be a more demanding assumption when the class ofgenerative distributions becomes too large. The objective of this subsection is to provide some justification for the use of deep GroupSortneural networks in the field of WGANs. This short discussion is motivated by the obser-vation of Anil et al. (2019, Theorem 1), who stress that norm-constrained ReLU neuralnetworks are not well-suited for learning non-linear 1-Lipschitz functions.The next lemma shows that networks of the form (4), which use GroupSort activa-tions, can recover any 1-Lipschitz function belonging to the class AFF of real-valued affinefunctions on E . Lemma 15
Let f : E → R be in AFF ∩ Lip . Then there exists a neural network of theform (4) verifying Assumption 1, with q = 2 and v = 2 , that can represent f . Motivated by Lemma 15, we show that, in some specific cases, the Wasserstein distance d Lip can be approached by only considering affine functions, thus motivating the use ofneural networks of the form (4). Recall that the support S µ of a probability measure µ isthe smallest subset of µ -measure 1. Lemma 16
Let µ and ν be two probability measures in P ( E ) . Assume that S µ and S ν are one-dimensional disjoint intervals included in the same line. Then d Lip ( µ, ν ) = d AFF ∩ Lip ( µ, ν ) . Lemma 16 is interesting insofar as it describes a specific case where the discriminatorcan be restricted to affine functions while keeping the identity d Lip = d D . We consider inthe next lemma a slightly more involved setting, where the two distributions µ and ν aremultivariate Gaussian with the same covariance matrix. Lemma 17
Let ( m , m ) ∈ ( R D ) , and let Σ ∈ M ( D,D ) be a positive semi-definite matrix.Assume that µ is Gaussian N ( m , Σ) and that ν is Gaussian N ( m , Σ) . Then d Lip ( µ, ν ) = d AFF ∩ Lip ( µ, ν ) . Yet, assuming multivariate Gaussian distributions might be too restrictive. Therefore,we now assume that both distributions lay on disjoint compact supports sufficiently dis-tant from one another. Recall that for a set S ⊆ E , the diameter of S is diam( S ) =sup ( x,y ) ∈ S (cid:107) x − y (cid:107) , and that the distance between two sets S and T is defined by d ( S, T ) =inf ( x,y ) ∈ S × T (cid:107) x − y (cid:107) . Proposition 18
Let ε > , and let µ and ν be two probability measures in P ( E ) withcompact convex supports S µ and S ν . Assume that max(diam( S µ ) , diam( S ν )) (cid:54) εd ( S µ , S ν ) .Then d AFF ∩ Lip ( µ, ν ) (cid:54) d Lip ( µ, ν ) (cid:54) (1 + 2 ε ) d AFF ∩ Lip ( µ, ν ) . Observe that in the case where neither µ nor ν are Dirac measures, then the assumptionof the lemma imposes that S µ ∩ S ν = ∅ . In the context of WGANs, it is highly likely thatthe generator badly approximates the true distribution µ (cid:63) at the beginning of training. The iau, Sangnier and Tanielian setting of Proposition 18 is thus interesting insofar as µ (cid:63) and the generative distributionwill most certainly verify the assumption on the diameters at this point in the learningprocess. However, in the common case where the true distribution lays on disconnectedmanifolds, the assumptions of the proposition are not valid anymore, and it would thereforebe interesting to show a similar result using the broader set of piecewise linear functions on E . As an empirical illustration, consider the synthetic setting where one tries to approxi-mate a bivariate mixture of independent Gaussian distributions with respectively 4 (Figure2a) and 9 (Figure 2c) modes. As expected, the optimal discriminator takes the form of apiecewise linear function, as illustrated by Figure 2b and Figure 2d, which display heatmapsof the discriminator’s output. Interestingly, we see that the number of linear regions in-creases with the number K of components of µ (cid:63) . (a) True distribution µ (cid:63) (mixture of K = 4bivariate Gaussian densities, green circles)and 2000 data points sampled from the gen-erator µ ¯ θ (blue dots). (b) Heatmap of the discriminator’s output ona mixture of K = 4 bivariate Gaussian den-sities.(c) True distribution µ (cid:63) (mixture of K = 9 bi-variate Gaussian densities, green circles) and2000 data points sampled from the generator µ ¯ θ (blue dots). (d) Heatmap of the discriminator’s output ona mixture of K = 9 bivariate Gaussian den-sities. Figure 2: Illustration of the usefulness of GroupSort neural networks when dealing with thelearning of mixtures of Gaussian distributions. In both cases, we have p = q = 3. ome Theoretical Insights into Wasserstein GANs These empirical results stress that when µ (cid:63) gets more complex, if the discriminator oughtto correctly approximate the Wasserstein distance, then it should parameterize piecewiselinear functions with growing numbers of regions. While we enlighten properties of Group-sort networks, many recent theoretical works have been studying the number of regions ofdeep ReLU neural networks (Pascanu et al., 2013; Mont´ufar et al., 2014; Arora et al., 2018;Serra et al., 2018). In particular, Mont´ufar et al. (2014, Theorem 5) states that the num-ber of linear regions of deep models grows exponentially with the depth and polynomiallywith the width. This, along with our observations, is an interesting avenue to choose thearchitecture of the discriminator.
4. Asymptotic properties
In practice, one never has access to the distribution µ (cid:63) but rather to a finite collection ofi.i.d. observations X , . . . , X n distributed according to µ (cid:63) . Thus, for the remainder of thearticle, we let µ n be the empirical measure based on X , . . . , X n , that is, for any Borelsubset A of E , µ n ( A ) = n (cid:80) ni =1 X i ∈ A . With this notation, the empirical counterpart ofthe WGANs problem is naturally defined as minimizing over Θ the quantity d D ( µ n , µ θ ).Equivalently, we seek to solve the following optimization problem:inf θ ∈ Θ d D ( µ n , µ θ ) = inf θ ∈ Θ sup α ∈ Λ (cid:104) n n (cid:88) i =1 D α ( X i ) − E D α ( G θ ( Z )) (cid:105) . (12)Assuming that Assumption 1 is satisfied, we have, as in Corollary 6, that the infimum in(12) is reached. We therefore consider the set of empirical optimal parametersˆΘ n = arg min θ ∈ Θ d D ( µ n , µ θ ) , and let ˆ θ n be a specific element of ˆΘ n (note that the choice of ˆ θ n has no impact on thevalue of the minimum). We note that ˆΘ n (respectively, ˆ θ n ) is the empirical counterpart of¯Θ (respectively, ¯ θ ). Section 3 was mainly devoted to the analysis of the difference ε optim .In this section, we are willing to take into account the effect of having finite samples. Thus,in line with the above, we are now interested in the generalization properties of WGANsand look for upper-bounds on the quantity0 (cid:54) d Lip ( µ (cid:63) , µ ˆ θ n ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) . (13)Arora et al. (2017, Theorem 3.1) states an asymptotic result showing that when providedenough samples, the neural IPM d D generalizes well, in the sense that for any pair ( µ, ν ) ∈ P ( E ) , the difference | d D ( µ, ν ) − d D ( µ n , ν n ) | can be arbitrarily small with high probability.However, this result does not give any information on the quantity of interest d Lip ( µ (cid:63) , µ ˆ θ n ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ). Closer to our current work, Zhang et al. (2018) provide bounds for d D ( µ (cid:63) , µ ˆ θ n ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ), starting from the observation that0 (cid:54) d D ( µ (cid:63) , µ ˆ θ n ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:54) d D ( µ (cid:63) , µ n ) . (14)In the present article, we develop a complementary point of view and measure the general-ization properties of WGANs on the basis of the Wasserstein distance d Lip , as in equation iau, Sangnier and Tanielian (13). Our approach is motivated by the fact that the neural IPM d D is only used for easingthe optimization process and, accordingly, that the performance should be assessed on thebasis of the distance d Lip , not d D .Note that ˆ θ n , which minimizes d D ( µ n , µ θ ) over Θ, may not be unique. Besides, thereis no guarantee that two distinct elements θ n, and θ n, of ˆΘ n lead to the same distance d Lip ( µ (cid:63) , µ θ n, ) and d Lip ( µ (cid:63) , µ θ n, ) (again, ˆ θ n is computed with d D , not with d Lip ). There-fore, in order to upper-bound the error in (13), we let, for each θ n ∈ ˆΘ n ,¯ θ n ∈ arg min ¯ θ ∈ ¯Θ (cid:107) θ n − ¯ θ (cid:107) . The rationale behind the definition of ¯ θ n is that we expect it to behave “similarly” to θ n .Following our objective, the error can be decomposed as follows:0 (cid:54) d Lip ( µ (cid:63) , µ ˆ θ n ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) (cid:54) sup θ n ∈ ˆΘ n d Lip ( µ (cid:63) , µ θ n ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ )= sup θ n ∈ ˆΘ n (cid:2) d Lip ( µ (cid:63) , µ θ n ) − d Lip ( µ (cid:63) , µ ¯ θ n ) + d Lip ( µ (cid:63) , µ ¯ θ n ) (cid:3) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) (cid:54) sup θ n ∈ ˆΘ n (cid:2) d Lip ( µ (cid:63) , µ θ n ) − d Lip ( µ (cid:63) , µ ¯ θ n ) (cid:3) + sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ )= ε estim + ε optim , (15)where we set ε estim = sup θ n ∈ ˆΘ n [ d Lip ( µ (cid:63) , µ θ n ) − d Lip ( µ (cid:63) , µ ¯ θ n )]. Notice that this supremumcan be positive or negative. However, it can be shown to converge to 0 almost surely when n → ∞ . Lemma 19
Assume that Assumption 1 is satisfied. Then lim n →∞ ε estim = 0 almost surely. Going further with the analysis of (13), the sum ε estim + ε optim is bounded as follows: ε estim + ε optim (cid:54) sup θ n ∈ ˆΘ n (cid:2) d Lip ( µ (cid:63) , µ θ n ) − d Lip ( µ (cid:63) , µ ¯ θ n ) (cid:3) + T P (Lip , D )(by inequality (10)) (cid:54) sup θ n ∈ ˆΘ n (cid:2) d Lip ( µ (cid:63) , µ θ n ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:3) + T P (Lip , D ) . Hence, ε estim + ε optim (cid:54) sup θ n ∈ ˆΘ n (cid:2) d Lip ( µ (cid:63) , µ θ n ) − d D ( µ (cid:63) , µ θ n ) + d D ( µ (cid:63) , µ θ n ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:3) + T P (Lip , D ) (cid:54) sup θ n ∈ ˆΘ n (cid:2) d Lip ( µ (cid:63) , µ θ n ) − d D ( µ (cid:63) , µ θ n ) (cid:3) + 2 d D ( µ (cid:63) , µ n ) + T P (Lip , D )(upon noting that inequality (14) is also valid for any θ n ∈ ˆΘ n ) (cid:54) T P (Lip , D ) + 2 d D ( µ (cid:63) , µ n ) . (16) ome Theoretical Insights into Wasserstein GANs The above bound is a function of both the generator and the discriminator. The term T P (Lip , D ) is increasing when the capacity of the generator is increasing. The discrimina-tor, however, plays a more ambivalent role, as already pointed out by Zhang et al. (2018).On the one hand, if the discriminator’s capacity decreases, the gap between d D and d Lip getsbigger and T P (Lip , D ) increases. On the other hand, discriminators with bigger capacitiesought to increase the contribution d D ( µ (cid:63) , µ n ). In order to bound d D ( µ (cid:63) , µ n ), Proposition20 below extends Zhang et al. (2018, Theorem 3.1), in the sense that it does not requirethe set of discriminative functions nor the space E to be bounded. Recall that, for γ > µ (cid:63) is said to be γ sub-Gaussian (Jin et al., 2019) if ∀ v ∈ R d , E e v · ( T − E T ) (cid:54) e γ (cid:107) v (cid:107) , where T is a random vector with probability distribution µ (cid:63) and · denotes the dot productin R D . Proposition 20
Assume that Assumption 1 is satisfied, let η ∈ (0 , , and let D be adiscriminator of the form (4) . ( i ) If µ (cid:63) has compact support with diameter B , then there exists a constant c > suchthat, with probability at least − η , d D ( µ (cid:63) , µ n ) (cid:54) c √ n + B (cid:114) log(1 /η )2 n . ( ii ) More generally, if µ (cid:63) is γ sub-Gaussian, then there exists a constant c > such that,with probability at least − η , d D ( µ (cid:63) , µ n ) (cid:54) c √ n + 8 γ √ eD (cid:114) log(1 /η ) n . The result of Proposition 20 has to be compared with convergence rates of the Wasser-stein distance. According to Fournier and Guillin (2015, Theorem 1), when the dimension D of E is such that D >
2, if µ (cid:63) has a second-order moment, then there exists a constant c such that 0 (cid:54) E d Lip ( µ (cid:63) , µ n ) (cid:54) cn /D . Thus, when the space E is of high dimension (e.g., in image generation tasks), under theconditions of Proposition 20, the pseudometric d D provides much faster rates of convergencefor the empirical measure. However, one has to keep in mind that both constants c and c grow in O ( qQ / ( D / + q )).Our Theorem 7 states the existence of a discriminator such that ε optim can be arbitrarilysmall. It is therefore reasonable, in view of inequality (16), to expect that the sum ε estim + ε optim can also be arbitrarily small, at least in an asymptotic sense. This is encapsulated inTheorem 21 below. Theorem 21
Assume that Assumption 1 is satisfied, and let η ∈ (0 , . iau, Sangnier and Tanielian ( i ) If µ (cid:63) has compact support with diameter B , then, for all ε > , there exists a discrim-inator D of the form (4) and a constant c > (independent of ε ) such that, withprobability at least − η , (cid:54) ε estim + ε optim (cid:54) ε + 2 c √ n + 2 B (cid:114) log(1 /η )2 n . ( ii ) More generally, if µ (cid:63) is γ sub-Gaussian, then, for all ε > , there exists a discriminator D of the form (4) and a constant c > (independent of ε ) such that, with probabilityat least − η , (cid:54) ε estim + ε optim (cid:54) ε + 2 c √ n + 16 γ √ eD (cid:114) log(1 /η ) n . Theorem 21 states that, asymptotically, the optimal parameters in ˆΘ n behave properly.A caveat is that the definition of ε estim uses ˆΘ n . However, in practice, one never has access toˆ θ n , but rather to an approximation of this quantity obtained by gradient descent algorithms.Thus, in line with Definition 8, we introduce the concept of empirical substitution: Definition 22
Let ε > and η ∈ (0 , . We say that d Lip can be empirically ε -substitutedby d D if there exists δ > such that, for all n large enough, with probability at least − η , M d D ( µ n , δ ) ⊆ M d Lip1 ( µ (cid:63) , ε ) . (17)The rationale behind this definition is that if (17) is satisfied, then by minimizing theIPM d D close to optimality in (12), one can be guaranteed to be also close to optimalityin (5) with high probability. We stress that Definition 22 is the empirical counterpart ofDefinition 8. Proposition 23
Assume that Assumption 1 is satisfied and that µ (cid:63) is sub-Gaussian. Let ε > . If T P (Lip , D ) (cid:54) ε , then d Lip can be empirically ( ε + δ ) -substituted by d D for all δ > . This proposition is the empirical counterpart of Lemma 11. It underlines the fact that byminimizing the pseudometric d D between the empirical measure µ n and the set of generativedistributions P close to optimality, one can control the loss in performance under the metric d Lip .Let us finally mention that it is also possible to provide asymptotic results on thesequences of parameters (ˆ θ n ), keeping in mind that ˆΘ n and ¯Θ are not necessarily reducedto singletons. Lemma 24
Assume that Assumption 1 is satisfied. Let (ˆ θ n ) be a sequence of optimalparameters that converges almost surely to z ∈ Θ . Then z ∈ ¯Θ almost surely. Proof
Let the sequence (ˆ θ n ) converge almost surely to some z ∈ Θ. By Theorem 5, thefunction Θ (cid:51) θ (cid:55)→ d D ( µ (cid:63) , µ θ ) is continuous, and therefore, almost surely, lim n →∞ d D ( µ (cid:63) , µ ˆ θ n ) = d D ( µ (cid:63) , µ z ). Using inequality (14), we see that, almost surely,0 (cid:54) d D ( µ (cid:63) , µ z ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) = lim n →∞ d D ( µ (cid:63) , µ ˆ θ n ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:54) lim inf n →∞ d D ( µ (cid:63) , µ n ) . ome Theoretical Insights into Wasserstein GANs Using Dudley (2004, Theorem 11.4.1) and the strong law of large numbers, we have thatthe sequence of empirical measures ( µ n ) almost surely converges weakly to µ (cid:63) in P ( E ).Besides, since d D metrizes weak convergence in P ( E ) (by Proposition 3), we conclude that z ∈ ¯Θ almost surely.
5. Understanding the performance of WGANs
In order to better understand the overall performance of the WGANs architecture, it isinstructive to decompose the final loss d Lip ( µ (cid:63) , µ ˆ θ n ) as in (15): d Lip ( µ (cid:63) , µ ˆ θ n ) (cid:54) ε estim + ε optim + inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ )= ε estim + ε optim + ε approx , (18)where( i ) ε estim matches up with the use of a data-dependent optimal parameter ˆ θ n , based onthe training set X , . . . , X n drawn from µ (cid:63) ;( ii ) ε optim corresponds to the loss in performance when using d D as training loss insteadof d Lip (this term has been thoroughly studied in Section 3);( iii ) and ε approx stresses the capacity of the parametric family of generative distributions P to approach the unknown distribution µ (cid:63) .Close to our work are the articles by Liang (2018), Singh et al. (2018), and Uppal et al.(2019), who study statistical properties of GANs. Liang (2018) and Singh et al. (2018)exhibit rates of convergence under an IPM-based loss for estimating densities that live inSobolev spaces, while Uppal et al. (2019) explore the case of Besov spaces. Remarkably,Liang (2018) discusses bounds for the Kullback-Leibler divergence, the Hellinger divergence,and the Wasserstein distance between µ (cid:63) and µ ˆ θ n . These bounds are based on a differentdecomposition of the loss and offer a complementary point of view. We emphasize that,in the present article, no density assumption is made neither on the class of generativedistributions P nor on the target distribution µ (cid:63) . Our goal in this subsection is to illustrate (18) by running a set of experiments on syntheticdatasets. The true probability measure µ (cid:63) is assumed to be a mixture of bivariate Gaussiandistributions with either 1, 4, or 9 components. This simple setting allows us to controlthe complexity of µ (cid:63) , and, in turn, to better assess the impact of both the generator’s anddiscriminator’s capacities. We use growing classes of generators of the form (3), namely { G p : p = 2 , , , } , and growing classes of discriminators of the form (4), namely { D q : q =2 , , , } . For both the generator and the discriminator, the width of the hidden layers iskept constant equal to 20.Two metrics are computed to evaluate the behavior of the different generative mod-els. First, we use the Wasserstein distance between the true distribution (either µ (cid:63) or its iau, Sangnier and Tanielian empirical version µ n ) and the generative distribution (either µ ¯ θ or µ ˆ θ n ). This distance iscalculated by using the Python package by Flamary and Courty (2017), via finite samples ofsize 4096 (average over 20 runs). Second, we use the recall metric (the higher, the better),proposed by Kynk¨a¨anniemi et al. (2019). Roughly, this metric measures “how much” ofthe true distribution (either µ (cid:63) or µ n ) can be reconstructed by the generative distribution(either µ ¯ θ or µ ˆ θ n ). At the implementation level, this score is based on k -nearest neighbornonparametric density estimation. It is computed via finite samples of size 4096 (averageover 20 runs).Our experiments were run in two different settings: Asymptotic setting: in this first experiment, we assume that µ (cid:63) is known from theexperimenter (so, there is no dataset). At the end of the optimization scheme, we end upwith one ¯ θ ∈ ¯Θ. Thus, in this context, the performance of WGANs is captured bysup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ) = ε optim + ε approx . For a fixed discriminator, when increasing the generator’s depth p , we expect ε approx todecrease. Conversely, as discussed in Subsection 3.1, we anticipate an augmentation of ε optim , since the discriminator must now differentiate between larger classes of generativedistributions. In this case, it is thus difficult to predict how sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ) behaveswhen p increases. On the contrary, in accordance with the results of Section 3, for a fixed p we expect the performance to increase with a growing q since, with larger discriminators,the pseudometric d D is more likely to behave similarly to the Wasserstein distance d Lip .These intuitions are validated by Figure 3 and Figure 4 (the bluer, the better). Thefirst one shows an approximation of sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ) computed over 5 different seedsas a function of p and q . The second one depicts the average recall of the estimator µ ¯ θ with respect to µ (cid:63) , as a function of p and q , again computed over 5 different seeds. Inboth figures, we observe that for a fixed p , incrementing q leads to better results. On theopposite, for a fixed discriminator’s depth q , increasing the depth p of the generator seemsto deteriorate both scores (Wasserstein distance and recall). This consequently suggeststhat the term ε optim dominates ε approx . q=2 q=3 q=5 q=7p=7p=5p=3p=2 (a) sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ), K = 1. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (b) sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ), K = 9. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (c) sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ), K = 25. Figure 3: Influence of the generator’s depth p and the discriminator’s depth q on themaximal Wasserstein distance sup ¯ θ ∈ ¯Θ d Lip ( µ (cid:63) , µ ¯ θ ). ome Theoretical Insights into Wasserstein GANs q=2 q=3 q=5 q=7p=7p=5p=3p=2 (a) Av. recall of µ ¯ θ w.r.t. µ (cid:63) , K =1. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (b) Av. recall of µ ¯ θ w.r.t. µ (cid:63) , K =9. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (c) Av. recall of µ ¯ θ w.r.t. µ (cid:63) , K =25. Figure 4: Influence of the generator’s depth p and the discriminator’s depth q on the averagerecall of the estimators µ ¯ θ w.r.t. µ (cid:63) . Finite-sample setting: in this second experiment, we consider the more realistic situa-tion where we have at hand finite samples X , . . . , X n drawn from µ (cid:63) ( n = 5000).Recalling that sup θ n ∈ ˆΘ n d Lip ( µ (cid:63) , µ θ n ) (cid:54) ε estim + ε optim + ε approx , we plot in Figure5 the maximal Wasserstein distance sup θ n ∈ ˆΘ n d Lip ( µ (cid:63) , µ θ n ), and in Figure 6 the averagerecall of the estimators µ θ n with respect to µ (cid:63) , as a function of p and q . Anticipatingthe behavior of sup θ n ∈ ˆΘ n d Lip ( µ (cid:63) , µ θ n ) when increasing the depth q is now more involved.Indeed, according to inequality (16), which bounds ε estim + ε optim , a larger D will make T P (Lip , D ) smaller but will, on the opposite, increase d D ( µ (cid:63) , µ n ). Figure 5 clearly showsthat, for a fixed p , the maximal Wasserstein distance seems to be improved when q increases.This suggests that the term T P (Lip , D ) dominates d D ( µ (cid:63) , µ n ). Similarly to the asymptoticsetting, we also make the observation that bigger p require a higher depth q since largerclass of generative distributions are more complex to discriminate. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (a) sup θ n ∈ ˆΘ n d Lip ( µ (cid:63) , µ θ n ), K =1. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (b) sup θ n ∈ ˆΘ n d Lip ( µ (cid:63) , µ θ n ), K =4. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (c) sup θ n ∈ ˆΘ n d Lip ( µ (cid:63) , µ θ n ), K =9. Figure 5: Influence of the generator’s depth p and the discriminator’s depth q on themaximal Wasserstein distance sup θ n ∈ ˆΘ n d Lip ( µ (cid:63) , µ θ n ), with n = 5000. iau, Sangnier and Tanielian q=2 q=3 q=5 q=7p=7p=5p=3p=2 (a) Av. recall of µ ˆ θ n w.r.t. µ (cid:63) , K =1. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (b) Av. recall of µ ˆ θ n w.r.t. µ (cid:63) , K =9. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (c) Av. recall of µ ˆ θ n w.r.t. µ (cid:63) , K =25. Figure 6: Influence of the generator’s depth p and the discriminator’s depth q on the averagerecall of the estimators µ θ n w.r.t. µ (cid:63) , with n = 5000.We end this subsection by pointing out a recurring observation across different experi-ments. In Figure 4 and Figure 6, we notice, as already stressed, that the average recall ofthe estimators is prone to decrease when the generator’s depth p increases. On the opposite,the average recall increases when the discriminator’s depth q increases. This is interestingbecause the recall metric is a good proxy for a stabilized training, insofar as a high recallmeans the absence of mode collapse. This is also confirmed in Figure 7, which compares twodensities: in Figure 7a, the discriminator has a small capacity ( q = 3) and the generator alarge capacity ( p = 7), whereas in Figure 7b, the discriminator has a large capacity ( q = 7)and the generator a small capacity ( p = 3). We observe that the first WGAN architecturebehaves poorly compared to the second one. We therefore conclude that larger discrimina-tors seem to bring some stability in the training of WGANS both in the asymptotic andfinite sample regimes. (a) p = 7 and q = 3. (b) p = 3 and q = 7. Figure 7: True distribution µ (cid:63) (mixture of K = 9 bivariate Gaussian densities, green circles)and 2000 data points sampled from the generator µ ¯ θ (blue dots). ome Theoretical Insights into Wasserstein GANs In this subsection, we further illustrate the impact of the generator’s and the discrimina-tor’s capacities on two high-dimensional datasets, namely MNIST (LeCun et al., 1998) andFashion-MNIST (Xiao et al., 2017). MNIST contains images in R × with 10 classes rep-resenting the digits. Fashion-MNIST is a 10-class dataset of images in R × , with slightlymore complex shapes than MNIST. Both datasets have a training set of 60,000 examples.To measure the performance of WGANs when dealing with high-dimensional applica-tions such as image generation, Brock et al. (2019) have advocated that embedding imagesinto a feature space with a pre-trained convolutional classifier provides more meaningfulinformation. Therefore, in order to assess the quality of the generator µ ˆ θ n , we sample im-ages both from the empirical measure µ n and from the distribution µ ˆ θ n . Then, instead ofcomputing the Wasserstein (or recall) distance directly between these two samples, we useas a substitute their embeddings output by an external classifier and compute the Wasser-stein (or recall) between the two new collections. Such a transformation is also done, forexample, in Kynk¨a¨anniemi et al. (2019). Practically speaking, for any pair of images ( a, b ),this operation amounts to using the Euclidean distance (cid:107) φ ( a ) − φ ( b ) (cid:107) in the Wasserstein andrecall criteria, where φ is a pre-softmax layer of a supervised classifier, trained specificallyon the datasets MNIST and Fashion-MNIST.For these two datasets, as usual, we use generators of the form (3) and discriminatorsof the form (4), and plot the performance of µ ˆ θ n as a function of both p and q . The resultsof Figure 8 confirm the fact that the worst results are achieved for generators with a largedepth p combined with discriminators with a small depth q . They also corroborate theprevious observations that larger discriminators are preferred. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (a) sup θ n ∈ ˆΘ n d Lip ( µ n , µ θ n ), MNISTdataset. q=2 q=3 q=5 q=7p=7p=5p=3p=2 (b) sup θ n ∈ ˆΘ n d Lip ( µ n , µ θ n ), FMNISTdataset. Figure 8: Influence of the generator’s depth p and the discriminator’s depth q on the max-imal Wasserstein distance sup θ n ∈ ˆΘ n d Lip ( µ n , µ θ n ) for the MNIST and F-MNIST datasets. iau, Sangnier and Tanielian Acknowledgments
We thank Flavian Vasile (Criteo AI Lab) and Cl´ement Calauzenes (Criteo AI Lab) forstimulating discussions and insightful suggestions.
References
D. Acharya, Z. Huang, D.P. Paudel, and L. Van Gool. Towards high resolution videogeneration with progressive growing of sliced Wasserstein GANs. arXiv.1810.02419 , 2018.C. Anil, J. Lucas, and R. Grosse. Sorting out Lipschitz function approximation. InK. Chaudhuri and R. Salakhutdinov, editors,
Proceedings of the 36th International Con-ference on Machine Learning , volume 97, pages 291–301. PMLR, 2019.M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarialnetworks. In
International Conference on Learning Representations , 2017.M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. InD. Precup and Y.W. Teh, editors,
Proceedings of the 34th International Conference onMachine Learning , volume 70, pages 214–223. PMLR, 2017.R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks withrectified linear units. In
International Conference on Learning Representations , 2018.S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium ingenerative adversarial nets (GANs). In D. Precup and Y.W. Teh, editors,
Proceedingsof the 34th International Conference on Machine Learning , volume 70, pages 224–232.PMLR, 2017.G. Biau, B. Cadre, M. Sangnier, and U. Tanielian. Some theoretical properties of GANs.
The Annals of Statistics , in press, 2020.. Bjrck and C. Bowie. An iterative algorithm for computing the best estimate of an orthog-onal matrix.
SIAM Journal on Numerical Analysis , 8:358–364, 1971.A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity naturalimage synthesis. In
International Conference on Learning Representations , 2019.A. Chernodub and D. Nowicki. Norm-preserving Orthogonal Permutation Linear Unitactivation functions (OPLU). arXiv.1604.02313 , 2016.G. Cybenko. Approximation by superpositions of a sigmoidal function.
Mathematics ofControl, Signals and Systems , 2:303–314, 1989.R.M. Dudley.
Real Analysis and Probability . Cambridge University Press, Cambridge, 2edition, 2004.W. Fedus, I. Goodfellow, and A.M. Dai. MaskGAN: Better text generation via filling inthe . In
International Conference on Learning Representations , 2018. ome Theoretical Insights into Wasserstein GANs R. Flamary and N. Courty. POT: Python Optimal Transport library, 2017. URL https://github.com/rflamary/POT .N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of theempirical measure.
Probability Theory and Related Fields , 162:707–738, 2015.C.R. Givens and R.M. Shortt. A class of Wasserstein metrics for probability distributions.
Michigan Mathematical Journal , 31:231–240, 1984.X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In G. Gordon,D. Dunson, and M. Dudk, editors,
Proceedings of the Fourteenth International Conferenceon Artificial Intelligence and Statistics , volume 15, pages 315–323. PMLR, 2011.I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,and J. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes,N.D. Lawrence, and K.Q. Weinberger, editors,
Advances in Neural Information ProcessingSystems 27 , pages 2672–2680. Curran Associates, Inc., 2014.I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A.C. Courville. Improved trainingof Wasserstein GANs. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, editors,
Advances in Neural Information ProcessingSystems 30 , pages 5767–5777. Curran Associates, Inc., 2017.K. Hornik. Approximation capabilities of multilayer feedforward networks.
Neural Networks ,4:251–257, 1991.K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universalapproximators.
Neural Networks , 2:359–366, 1989.C. Jin, P. Netrapalli, R. Ge, S.M. Kakade, and M. Jordan. A short note on concentrationinequalities for random vectors with subGaussian norm. arXiv.1902.03736 , 2019.L.V. Kantorovich and G.S. Rubinstein. On a space of completely additive functions.
VestnikLeningrad University Mathematics , 13:52–59, 1958.T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improvedquality, stability, and variation. In
International Conference on Learning Representations ,2018.T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adver-sarial networks. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 4401–4410, 2019.N. Kodali, J. Abernethy, J. Hays, and Z. Kira. On convergence and stability of GANs. arXiv.1705.07215 , 2017.A. Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. InE.P. Xing and T. Jebara, editors,
Proceedings of the 31st International Conference onMachine Learning , volume 32, pages 28–36. PMLR, 2014. iau, Sangnier and Tanielian T. Kynk¨a¨anniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved precision andrecall metric for assessing generative models. In H. Wallach, H. Larochelle, A. Beygelz-imer, F. d’Alch´e Buc, E. Fox, and R. Garnett, editors,
Advances in Neural InformationProcessing Systems 32 , pages 3927–3936. Curran Associates, Inc., 2019.Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to docu-ment recognition. In
Proceedings of the IEEE , pages 2278–2324, 1998.C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Te-jani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution usinga generative adversarial network. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 4681–4690, 2017.C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Poczos. MMD GAN: Towards deeperunderstanding of moment matching network. In I. Guyon, U. von Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances in NeuralInformation Processing Systems 30 , pages 2203–2213. Curran Associates, Inc., 2017.T. Liang. On how well generative adversarial networks learn densities: Nonparametric andparametric results. arXiv.1811.03179 , 2018.S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties ofgenerative adversarial learning. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances in Neural InformationProcessing Systems 30 , pages 5551–5559. Curran Associates, Inc., 2017.C. McDiarmid. On the method of bounded differences. In J. Siemons, editor,
Surveys inCombinatorics , London Mathematical Society Lecture Note Series 141, pages 148–188.Cambridge University Press, Cambridge, 1989.L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. arXiv.1611.02163 , 2016.T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generativeadversarial networks. In
International Conference on Learning Representations , 2018.O. Mogren. C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv.1611.09904 , 2016.G. Mont´ufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions ofdeep neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, andK.Q. Weinberger, editors,
Advances in Neural Information Processing Systems 27 , pages2924–2932. Curran Associates, Inc., 2014.A. Mller. Integral probability metrics and their generating classes of functions.
Advancesin Applied Probability , 29:429–443, 1997.S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers us-ing variational divergence minimization. In D.D. Lee, M. Sugiyama, U. von Luxburg, ome Theoretical Insights into Wasserstein GANs I. Guyon, and R. Garnett, editors,
Advances in Neural Information Processing Systems29 , pages 271–279. Curran Associates, Inc., 2016.M. O’Searcoid.
Metric Spaces . Springer, Dublin, 2006.R. Pascanu, G. Mont´ufar, and Y. Bengio. On the number of response regions of deepfeed forward networks with piece-wise linear activations. In
International Conference onLearning Representations , 2013.H. Petzka, A. Fischer, and D. Lukovnikov. On the regularization of Wasserstein GANs. In
International Conference on Learning Representations , 2018.A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. arXiv:1511.06434 , 2015.K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generative ad-versarial networks through regularization. In I. Guyon, U. von Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances in NeuralInformation Processing Systems 30 , pages 2018–2028. Curran Associates, Inc., 2017.T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improvedtechniques for training GANs. In D.D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon,and R. Garnett, editors,
Advances in Neural Information Processing Systems 29 , pages2234–2242. Curran Associates, Inc., 2016.T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regionsof deep neural networks. In
International Conference on Machine Learning , pages 4565–4573, 2018.S. Singh, A. Uppal, B. Li, C.-L. Li, M. Zaheer, and B. Poczos. Nonparametric densityestimation under adversarial losses. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information ProcessingSystems 31 , pages 10225–10236. Curran Associates, Inc., 2018.A. Uppal, S. Singh, and B. Poczos. Nonparametric density estimation and convergencerates for GANs under Besov IPM losses. In H. Wallach, H. Larochelle, A. Beygelzimer,F. d’Alch´e Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Pro-cessing Systems 32 , pages 9089–9100. Curran Associates, Inc., 2019.R. van Handel.
Probability in High Dimension . APC 550 Lecture Notes, Princeton Univer-sity, 2016.R. Vershynin.
High-Dimensional Probability: An Introduction with Applications in DataScience . Cambridge University Press, Cambridge, 2018.C. Villani.
Optimal Transport: Old and New . Springer, Berlin, 2008.X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang. Improving the improved training of Wasser-stein GANs: A consistency term and its dual effect. In
International Conference onLearning Representations , 2018. iau, Sangnier and Tanielian H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: A novel image dataset for benchmark-ing machine learning algorithms. arXiv:1708.07747 , 2017.L. Yu, W. Zhang, J. Wang, and Y. Yu. SeqGAN: Sequence generative adversarial netswith policy gradient. In
Proceedings of the Thirty-First AAAI Conference on ArtificialIntelligence , page 28522858. AAAI Press, 2017.P. Zhang, Q. Liu, D. Zhou, T. Xu, and X. He. On the discriminative-generalization tradeoffin GANs. In
International Conference on Learning Representations , 2018.J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial networks. In
International Conference on Learning Representations , 2017.Z. Zhou, J. Liang, Y. Song, L. Yu, H. Wang, W. Zhang, Y. Yu, and Z. Zhang. Lipschitzgenerative adversarial nets. In K. Chaudhuri and R. Salakhutdinov, editors,
Proceedingsof the 36th International Conference on Machine Learning , volume 97, pages 7584–7593.PMLR, 2019. ome Theoretical Insights into Wasserstein GANs Appendix A.
A.1 Proof of Lemma 1
We know that for each θ ∈ Θ, G θ is a feed-forward neural network of the form (3), whichmaps inputs z ∈ R d into E ⊂ R D . In particular, for z ∈ R d , G θ ( z ) = f p ◦ · · · ◦ f ( z ), where f i ( x ) = σ ( U i x + b i ) for i = 1 , . . . , p − σ is applied componentwise), and f p ( x ) = U p x + b p .Recall that the notation (cid:107) · (cid:107) (respectively, (cid:107) · (cid:107) ∞ ) means the Euclidean (respectively,the supremum) norm, with no specific mention of the underlying space on which it acts.For ( z, z (cid:48) ) ∈ ( R d ) , we have (cid:107) f ( z ) − f ( z (cid:48) ) (cid:107) (cid:54) (cid:107) U z + b − U z (cid:48) − b (cid:107) (since σ is 1-Lipschitz)= (cid:107) U ( z − z (cid:48) ) (cid:107) (cid:54) (cid:107) U (cid:107) (cid:107) z − z (cid:48) (cid:107) (cid:54) K (cid:107) z − z (cid:48) (cid:107) (by Assumption 1) . Repeating this for i = 2 , . . . , p , we thus have, for all ( z, z (cid:48) ) ∈ ( R d ) , (cid:107) G θ ( z ) − G θ ( z (cid:48) ) (cid:107) (cid:54) K p (cid:107) z − z (cid:48) (cid:107) . We conclude that, for each θ ∈ Θ, the function G θ is K p -Lipschitz on R d .Let us now prove that D ⊆ Lip . Fix D α ∈ D , α ∈ Λ. According to (4), we have, for x ∈ E , D α ( x ) = f q ◦ · · · ◦ f ( x ), where f i ( t ) = ˜ σ ( V i t + c i ) for i = 1 , . . . , q − σ is appliedon pairs of components), and f q ( t ) = V q t + c q .Consequently, for ( x, y ) ∈ E , (cid:107) f ( x ) − f ( y ) (cid:107) ∞ (cid:54) (cid:107) V x − V y (cid:107) ∞ (since ˜ σ is 1-Lipschitz)= (cid:107) V ( x − y ) (cid:107) ∞ (cid:54) (cid:107) V (cid:107) , ∞ (cid:107) x − y (cid:107) (cid:54) (cid:107) x − y (cid:107) (by Assumption 1) . Thus, (cid:107) f ◦ f ( x ) − f ◦ f ( y ) (cid:107) ∞ (cid:54) (cid:107) V f ( x ) − V f ( y ) (cid:107) ∞ (since ˜ σ is 1-Lipschitz) (cid:54) (cid:107) V (cid:107) ∞ (cid:107) f ( x ) − f ( y ) (cid:107) ∞ (cid:54) (cid:107) f ( x ) − f ( y ) (cid:107) ∞ (by Assumption 1) (cid:54) (cid:107) x − y (cid:107) . Repeating this, we conclude that, for each α ∈ Λ and all ( x, y ) ∈ E , | D α ( x ) − D α ( y ) | (cid:54) (cid:107) x − y (cid:107) , which is the desired result. iau, Sangnier and Tanielian A.2 Proof of Proposition 2
We first prove that the function Θ (cid:51) θ (cid:55)→ µ θ is continuous with respect to the weak topologyin P ( E ). Let G θ and G θ (cid:48) be two elements of G , with ( θ, θ (cid:48) ) ∈ Θ . Using (3), we write G θ ( z ) = f p ◦ · · · ◦ f ( z ) (respectively, G θ (cid:48) ( z ) = f (cid:48) p ◦ · · · ◦ f (cid:48) ( z )), where f i ( x ) = max( U i x + b i ,
0) (respectively, f (cid:48) i ( x ) = max( U (cid:48) i x + b (cid:48) i , i = 1 , . . . , p −
1, and f p ( x ) = U p x + b p (respectively, f (cid:48) p ( x ) = U (cid:48) p x + b (cid:48) p ).Clearly, for z ∈ R d , (cid:107) f ( z ) − f (cid:48) ( z ) (cid:107) (cid:54) (cid:107) U z + b − U (cid:48) z − b (cid:48) (cid:107) (cid:54) (cid:107) ( U − U (cid:48) ) z (cid:107) + (cid:107) b − b (cid:48) (cid:107) (cid:54) (cid:107) U − U (cid:48) (cid:107) (cid:107) z (cid:107) + (cid:107) b − b (cid:48) (cid:107) (cid:54) ( (cid:107) z (cid:107) + 1) (cid:107) θ − θ (cid:48) (cid:107) . Similarly, for any i ∈ { , . . . , p } and any x ∈ R u i , (cid:107) f i ( x ) − f (cid:48) i ( x ) (cid:107) (cid:54) ( (cid:107) x (cid:107) + 1) (cid:107) θ − θ (cid:48) (cid:107) . Observe that (cid:107) G θ ( z ) − G θ (cid:48) ( z ) (cid:107) = (cid:107) f p ◦ · · · ◦ f ( z ) − f (cid:48) p ◦ · · · ◦ f (cid:48) ( z ) (cid:107) (cid:54) (cid:107) f p ◦ · · · ◦ f ( z ) − f p ◦ · · · ◦ f ◦ f (cid:48) ( z ) (cid:107) + · · · + (cid:107) f p ◦ f (cid:48) p − ◦ · · · ◦ f (cid:48) ( z ) − f (cid:48) p ◦ · · · ◦ f (cid:48) ( z ) (cid:107) . As in the proof of Lemma 1, one shows that for any i ∈ { , . . . , p } , the function f p ◦ · · · ◦ f i is K p − i +11 -Lipschitz with respect to the Euclidean norm. Therefore, (cid:107) G θ ( z ) − G θ (cid:48) ( z ) (cid:107) (cid:54) K p − (cid:107) f ( z ) − f (cid:48) ( z ) (cid:107) + · · · + K (cid:107) f p ◦ f (cid:48) p − ◦ · · · ◦ f (cid:48) ( z ) − f (cid:48) p ◦ · · · ◦ f (cid:48) ( z ) (cid:107) (cid:54) K p − ( (cid:107) z (cid:107) + 1) (cid:107) θ − θ (cid:48) (cid:107) + · · · + ( (cid:107) f (cid:48) p − ◦ · · · ◦ f (cid:48) ( z ) (cid:107) + 1) (cid:107) θ − θ (cid:48) (cid:107) (cid:54) K p − ( (cid:107) z (cid:107) + 1) (cid:107) θ − θ (cid:48) (cid:107) + · · · + ( K p − (cid:107) z (cid:107) + (cid:107) f (cid:48) p − ◦ · · · ◦ f (cid:48) (0) (cid:107) + 1) (cid:107) θ − θ (cid:48) (cid:107) . Using the architecture of neural networks in (3), a quick check shows that, for each i ∈{ , . . . , p } , (cid:107) f (cid:48) i ◦ · · · ◦ f (cid:48) (0) (cid:107) (cid:54) i (cid:88) k =1 K k . We are led to (cid:107) G θ ( z ) − G θ (cid:48) ( z ) (cid:107) = ( (cid:96) (cid:107) z (cid:107) + (cid:96) ) (cid:107) θ − θ (cid:48) (cid:107) , (19)where (cid:96) = pK p − and (cid:96) = p − (cid:88) i =1 K p − ( i +1)1 i (cid:88) k =1 K k + p − (cid:88) i =0 K i . Denoting by ν the probability distribution of the sub-Gaussian random variable Z , we notethat (cid:82) R d ( (cid:96) (cid:107) z (cid:107) + (cid:96) ) ν (d z ) < ∞ . Now, let ( θ k ) be a sequence in Θ converging to θ ∈ Θ ome Theoretical Insights into Wasserstein GANs with respect to the Euclidean norm. Clearly, for a given z ∈ R d , by continuity of thefunction θ (cid:55)→ G θ ( z ), we have lim k →∞ G θ k ( z ) = G θ ( z ) and, for any ϕ ∈ C b ( E ), lim k →∞ ϕ ( G θ k ( z )) = ϕ ( G θ ( z )). Thus, by the dominated convergence theorem,lim k →∞ (cid:90) E ϕ ( x ) µ θ k (d x ) = lim k →∞ (cid:90) R d ϕ ( G θ k ( z )) ν (d z ) = (cid:90) R d ϕ (cid:0) G θ ( z )) ν (d z ) = (cid:90) E ϕ ( x ) µ θ (d x ) . (20)This shows that the sequence ( µ θ k ) converges weakly to µ θ . Besides, for an arbitrary x in E , we have lim sup k →∞ (cid:90) E (cid:107) x − x (cid:107) µ θ k (d x )= lim sup k →∞ (cid:90) R d (cid:107) x − G θ k ( z ) (cid:107) ν (d z ) (cid:54) lim sup k →∞ (cid:90) R d (cid:0) (cid:107) G θ k ( z ) − G θ ( z ) (cid:107) + (cid:107) G θ ( z ) − x (cid:107) (cid:1) ν (d z ) (cid:54) lim sup k →∞ (cid:90) R d ( (cid:96) (cid:107) z (cid:107) + (cid:96) ) (cid:107) θ k − θ (cid:107) ν (d z ) + (cid:90) R d (cid:107) G θ ( z ) − x (cid:107) ν (d z )(by inequality (19)) . Consequently,lim sup k →∞ (cid:90) E (cid:107) x − x (cid:107) µ θ k (d x ) (cid:54) (cid:90) R d (cid:107) G θ ( z ) − x (cid:107) ν (d z ) = (cid:90) E (cid:107) x − x (cid:107) µ θ (d x ) . One proves with similar arguments thatlim inf k →∞ (cid:90) E (cid:107) x − x (cid:107) µ θ k (d x ) (cid:62) (cid:90) E (cid:107) x − x (cid:107) µ θ (d x ) . Therefore, putting all the pieces together, we conclude thatlim k →∞ (cid:90) E (cid:107) x − x (cid:107) µ θ k (d x ) = (cid:90) E (cid:107) x − x (cid:107) µ θ (d x ) . This, together with (20), shows that the sequence ( µ θ k ) converges weakly to µ θ in P ( E ),and, in turn, that the function Θ (cid:51) θ (cid:55)→ µ θ is continuous with respect to the weak topologyin P ( E ), as desired.The second assertion of the proposition follows upon noting that P is the image of thecompact set Θ by a continuous function. A.3 Proof of Proposition 3
To show the first statement, we are to exhibit a specific discriminator, say D max , such that,for all ( µ, ν ) ∈ ( P ∪ { µ (cid:63) } ) , the identity d D max ( µ, ν ) = 0 implies µ = ν .Let ε >
0. According to Proposition 2, under Assumption 1, P is a compact subset of P ( E ) with respect to the weak topology in P ( E ). Let x ∈ E be arbitrary. For any µ ∈ P there exists a compact K µ ⊆ E such that (cid:82) K (cid:123) µ (cid:107) x − x (cid:107) µ (d x ) (cid:54) ε/
4. Also, for any such K µ , iau, Sangnier and Tanielian the function P ( E ) (cid:51) ρ (cid:55)→ (cid:82) K (cid:123) µ (cid:107) x − x (cid:107) ρ (d x ) is continuous. Therefore, there exists an openset U µ ⊆ P ( E ) containing µ such that, for any ρ ∈ U µ , (cid:82) K (cid:123) µ (cid:107) x − x (cid:107) ρ (d x ) (cid:54) ε/ { U µ : µ ∈ P } forms an open cover of P , from which we canextract, by compactness, a finite subcover U µ , . . . , U µ n . Letting K = ∪ ni =1 K µ i , we deducethat, for all µ ∈ P , (cid:82) K (cid:123) (cid:107) x − x (cid:107) µ (d x ) (cid:54) ε/
2. We conclude that there exists a compact K ⊆ E and x ∈ K such that, for any µ ∈ P ∪ { µ (cid:63) } , (cid:90) K (cid:123) || x − x || µ (d x ) (cid:54) ε/ . By Arzel`a-Ascoli theorem, it is easy to see that Lip ( K ), the set of 1-Lipschitz real-valued functions on K , is compact with respect to the uniform norm (cid:107) · (cid:107) ∞ on K . Let { f , . . . , f N ε } denote an ε -covering of Lip ( K ). According to Anil et al. (2019, Theorem 3),for each k = 1 , . . . , N ε there exists under Assumption 1 a discriminator D k of the form (4)such that inf g ∈ D k (cid:107) f k − g K (cid:107) ∞ (cid:54) ε. Since the discriminative classes of functions use GroupSort activations, one can find a neuralnetwork of the form (4) satisfying Assumption 1, say D max , such that, for all k ∈ { , . . . , N ε } , D k ⊆ D max . Consequently, for any f ∈ Lip ( K ), letting k ∈ arg min k ∈{ ,..., N ε } (cid:107) f − f k (cid:107) ∞ , we haveinf g ∈ D max (cid:107) f − g K (cid:107) ∞ (cid:54) (cid:107) f − f k (cid:107) ∞ + inf g ∈ D max (cid:107) f k − g K (cid:107) ∞ (cid:54) ε. Now, let ( µ, ν ) ∈ ( P (cid:83) { µ (cid:63) } ) be such that d D max ( µ, ν ) = 0, i.e., sup f ∈ D max | E µ f − E ν f | = 0.Let f (cid:63) be a function in Lip such that E µ f (cid:63) − E ν f (cid:63) = d Lip ( µ, ν ) (such a function existsaccording to (8)) and, without loss of generality, such that f (cid:63) ( x ) = 0. Clearly, d Lip ( µ, ν ) = E µ f (cid:63) − E ν f (cid:63) (cid:54) (cid:12)(cid:12)(cid:12) (cid:90) K f (cid:63) d µ − (cid:90) K f (cid:63) d ν (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) (cid:90) K (cid:123) f (cid:63) d µ − (cid:90) K (cid:123) f (cid:63) d ν (cid:12)(cid:12)(cid:12) (cid:54) (cid:12)(cid:12)(cid:12) (cid:90) K f (cid:63) d µ − (cid:90) K f (cid:63) d ν (cid:12)(cid:12)(cid:12) + ε. Letting g f (cid:63) ∈ D max be such that (cid:107) ( f (cid:63) − g f (cid:63) ) K (cid:107) ∞ (cid:54) inf g ∈ D max (cid:107) ( f (cid:63) − g ) K (cid:107) ∞ + ε (cid:54) ε, we are thus led to d Lip ( µ, ν ) (cid:54) (cid:12)(cid:12)(cid:12) (cid:90) K ( f (cid:63) − g f (cid:63) )d µ − (cid:90) K ( f (cid:63) − g f (cid:63) )d ν + (cid:90) K g f (cid:63) d µ − (cid:90) K g f (cid:63) d ν (cid:12)(cid:12)(cid:12) + ε. Observe, since x ∈ K , that | g f (cid:63) ( x ) | (cid:54) ε and that, for any x ∈ E , | g f (cid:63) ( x ) | (cid:54) (cid:107) x − x (cid:107) + 3 ε .Exploiting E µ g f (cid:63) − E ν g f (cid:63) = 0, we obtain d Lip ( µ, ν ) (cid:54) ε + (cid:12)(cid:12)(cid:12) (cid:90) K (cid:123) g f (cid:63) d µ − (cid:90) K (cid:123) g f (cid:63) d ν (cid:12)(cid:12)(cid:12) (cid:54) ε + (cid:90) K (cid:123) (cid:107) x − x (cid:107) µ (d x ) + (cid:90) K (cid:123) (cid:107) x − x (cid:107) ν (d x ) + 6 ε (cid:54) ε. ome Theoretical Insights into Wasserstein GANs Since ε is arbitrary and d Lip is a metric on P ( E ), we conclude that µ = ν , as desired.To complete the proof, it remains to show that d D max metrizes weak convergence in P ∪ { µ (cid:63) } . To this aim, we let ( µ k ) be a sequence in P ∪ { µ (cid:63) } and µ be a probabilitymeasure in P ∪ { µ (cid:63) } .If ( µ k ) converges weakly to µ in P ( E ), then d Lip ( µ, µ k ) → d D max ( µ, µ k ) → d D max ( µ, µ k ) →
0, and fix ε >
0. There exists
M > k (cid:62) M , d D max ( µ, µ k ) (cid:54) ε . Using a similar reasoning as in the first partof the proof, it is easy to see that for any k (cid:62) M , we have d Lip ( µ, µ k ) (cid:54) ε . Since theWasserstein distance metrizes weak convergence in P ( E ) and ε is arbitrary, we concludethat ( µ k ) converges weakly to µ in P ( E ). A.4 Proof of Lemma 4
Using a similar reasoning as in the proof of Proposition 2, one easily checks that for all( α, α (cid:48) ) ∈ Λ and all x ∈ E , | D α ( x ) − D α (cid:48) ( x ) | (cid:54) Q / (cid:0) q (cid:107) x (cid:107) + K q − (cid:88) i =1 i + q (cid:1) (cid:107) α − α (cid:48) (cid:107) (cid:54) Q / (cid:0) q (cid:107) x (cid:107) + q ( q − K q (cid:1) (cid:107) α − α (cid:48) (cid:107) , where q refers to the depth of the discriminator. Thus, since D ⊂ Lip (by Lemma 1), wehave, for all α ∈ Λ, all x ∈ E , and any arbitrary x ∈ E , | D α ( x ) | (cid:54) | D α ( x ) − D α ( x ) | + | D α ( x ) | (cid:54) (cid:107) x − x (cid:107) + Q / (cid:0) q (cid:107) x (cid:107) + q ( q − K q (cid:1) (cid:107) α (cid:107) (upon noting that D ( x ) = 0) (cid:54) (cid:107) x − x (cid:107) + Q / (cid:0) q (cid:107) x (cid:107) + q ( q − K q (cid:1) Q / max( K , , where Q is the dimension of Λ. Thus, since µ (cid:63) and the µ θ ’s belong to P ( E ) (by Lemma 1),we deduce that all D α ∈ D are dominated by a function independent of α and integrable withrespect to µ (cid:63) and µ θ . In addition, for all x ∈ E , the function α (cid:55)→ D α ( x ) is continuous on Λ.Therefore, by the dominated convergence theorem, the function Λ (cid:51) α (cid:55)→ | E µ (cid:63) D α − E µ θ D α | is continuous. The conclusion follows from the compactness of the set Λ (Assumption 1). A.5 Proof of Theorem 5
Let ( θ, θ (cid:48) ) ∈ Θ , and let γ Z be the joint distribution of the pair ( G θ ( Z ) , G θ (cid:48) ( Z )). We have | ξ Lip ( θ ) − ξ Lip ( θ (cid:48) ) | = | d Lip ( µ (cid:63) , µ θ ) − d Lip ( µ (cid:63) , µ θ (cid:48) ) | (cid:54) d Lip ( µ θ , µ θ (cid:48) )= inf γ ∈ Π( µ θ ,µ θ (cid:48) ) (cid:90) E (cid:107) x − y (cid:107) γ (d x, d y ) , iau, Sangnier and Tanielian where Π( µ θ , µ θ (cid:48) ) denotes the collection of all joint probability measures on E × E withmarginals µ θ and µ θ (cid:48) . Thus, | ξ Lip ( θ ) − ξ Lip ( θ (cid:48) ) | (cid:54) (cid:90) E (cid:107) x − y (cid:107) γ Z (d x, d y )= (cid:90) R d (cid:107) G θ ( z ) − G θ (cid:48) ( z ) (cid:107) ν (d z )(where ν is the distribution of Z ) (cid:54) (cid:107) θ − θ (cid:48) (cid:107) (cid:90) R d ( (cid:96) (cid:107) z (cid:107) + (cid:96) ) ν (d z )(by inequality (19)) . This shows that the function θ (cid:51) Θ (cid:55)→ ξ Lip ( θ ) is L -Lipschitz, with L = (cid:82) R d ( (cid:96) (cid:107) z (cid:107) + (cid:96) ) ν (d z ).For the second statement of the theorem, just note that | ξ D ( θ ) − ξ D ( θ (cid:48) ) | = | d D ( µ (cid:63) , µ θ ) − d D ( µ (cid:63) , µ θ (cid:48) ) | (cid:54) d D ( µ θ , µ θ (cid:48) ) (cid:54) d Lip ( µ θ , µ θ (cid:48) )(since D ⊆ Lip ) (cid:54) L (cid:107) θ − θ (cid:48) (cid:107) . A.6 Proof of Theorem 7
The proof is divided into two parts. First, we show that under Assumption 1, for all ε > θ ∈ Θ, there exists a discriminator D (function of ε and θ ) of the form (4) such that d Lip ( µ (cid:63) , µ θ ) − d D ( µ (cid:63) , µ θ ) (cid:54) ε. Let f (cid:63) be a function in Lip such that E µ (cid:63) f (cid:63) − E µ θ f (cid:63) = d Lip ( µ (cid:63) , µ θ ) (such a function existsaccording to (8)). We may write d Lip ( µ (cid:63) , µ θ ) − d D ( µ (cid:63) , µ θ ) = E µ (cid:63) f (cid:63) − E µ θ f (cid:63) − sup f ∈ D | E µ (cid:63) f − E µ θ f | = E µ (cid:63) f (cid:63) − E µ θ f (cid:63) − sup f ∈ D ( E µ (cid:63) f − E µ θ f )= inf f ∈ D ( E µ (cid:63) f (cid:63) − E µ θ f (cid:63) − E µ (cid:63) f + E µ θ f )= inf f ∈ D ( E µ (cid:63) ( f (cid:63) − f ) − E µ θ ( f (cid:63) − f )) (cid:54) inf f ∈ D ( E µ (cid:63) | f (cid:63) − f | + E µ θ | f (cid:63) − f | ) . (21)Next, for any f ∈ D and any compact K ⊆ E , E µ (cid:63) | f (cid:63) − f | = E µ (cid:63) | f (cid:63) − f | K + E µ (cid:63) | f (cid:63) − f | K (cid:123) (cid:54) (cid:107) ( f (cid:63) − f ) K (cid:107) ∞ + E µ (cid:63) | f (cid:63) | K (cid:123) + E µ (cid:63) | f | K (cid:123) . ome Theoretical Insights into Wasserstein GANs Since f (cid:63) and f are integrable with respect to µ (cid:63) and µ θ , there exists a compact set K (function of ε and θ ) such thatmax (cid:0) E µ (cid:63) | f (cid:63) | K (cid:123) , E µ (cid:63) | f | K (cid:123) , E µ θ | f (cid:63) | K (cid:123) , E µ θ | f | K (cid:123) (cid:1) (cid:54) ε. Thus, for such a choice of K , E µ (cid:63) | f (cid:63) − f | (cid:54) (cid:107) ( f (cid:63) − f ) K (cid:107) ∞ + 2 ε. Similarly, E µ θ | f (cid:63) − f | (cid:54) (cid:107) ( f (cid:63) − f ) K (cid:107) ∞ + 2 ε. Plugging the two inequalities above in (21), we obtain d Lip ( µ (cid:63) , µ θ ) − d D ( µ (cid:63) , µ θ ) (cid:54) f ∈ D (cid:107) ( f (cid:63) − f ) K (cid:107) ∞ + 4 ε. According to Anil et al. (2019, Theorem 3), under Assumption 1, we can find a discriminatorof the form (4) such that inf f ∈ D (cid:107) ( f (cid:63) − f ) K (cid:107) ∞ (cid:54) ε . We conclude that, for this choice of D (function of ε and θ ), d Lip ( µ (cid:63) , µ θ ) − d D ( µ (cid:63) , µ θ ) (cid:54) ε, (22)as desired.For the second part of the proof, we fix ε > θ ∈ Θ and eachdiscriminator of the form (4),ˆ ξ D ( θ ) = d Lip ( µ (cid:63) , µ θ ) − d D ( µ (cid:63) , µ θ ) . Arguing as in the proof of Theorem 5, we see that ˆ ξ D ( θ ) is 2 L -Lipschitz in θ , where L = (cid:82) R d ( (cid:96) (cid:107) z (cid:107) + (cid:96) ) ν (d z ) and ν is the probability distribution of Z .Now, let { θ , . . . , θ N ε } be an ε -covering of the compact set Θ, i.e., for each θ ∈ Θ,there exists k ∈ { , . . . , N ε } such that (cid:107) θ − θ k (cid:107) (cid:54) ε . According to (22), for each such k ,there exists a discriminator D k such that ˆ ξ D k ( θ k ) (cid:54) ε . Since the discriminative classes offunctions use GroupSort activation functions, one can find a neural network of the form (4)satisfying Assumption 1, say D max , such that, for all k ∈ { , . . . , N ε } , D k ⊆ D max . Clearly,ˆ ξ D max ( θ ) is 2 L -Lipschitz, and, for all k ∈ { , . . . , N ε } , ˆ ξ D max ( θ k ) (cid:54) ε . Hence, for all θ ∈ Θ,letting ˆ k ∈ arg min k ∈{ ,..., N ε } (cid:107) θ − θ k (cid:107) , we have ˆ ξ D max ( θ ) (cid:54) (cid:12)(cid:12) ˆ ξ D max ( θ ) − ˆ ξ D max ( θ ˆ k ) (cid:12)(cid:12) + ˆ ξ D max ( θ ˆ k ) (cid:54) (2 L + 6) ε. Therefore, T P (Lip , D max ) = sup θ ∈ Θ (cid:2) d Lip ( µ (cid:63) , µ θ ) − d D max ( µ (cid:63) , µ θ ) (cid:3) = sup θ ∈ Θ ˆ ξ D max ( θ ) (cid:54) (2 L + 6) ε. We have just proved that, for all ε >
0, there exists a discriminator D max of the form (4)and a positive constant c (independent of ε ) such that T P (Lip , D max ) (cid:54) cε. This is the desired result. iau, Sangnier and Tanielian A.7 Proof of Proposition 9
Let us assume that the statement is not true. If so, there exists ε > δ >
0, there exists θ ∈ M d D ( µ (cid:63) , δ ) satisfying d ( θ, ¯Θ) > ε . Consider δ n = 1 /n , and choose asequence of parameters ( θ n ) such that θ n ∈ M d D (cid:16) µ (cid:63) , n (cid:17) and d ( θ n , ¯Θ) > ε. Since Θ is compact by Assumption 1, we can find a subsequence ( θ ϕ n ) that converges tosome θ acc ∈ Θ. Thus, for all n (cid:62)
1, we have d D ( µ (cid:63) , µ θ ϕn ) (cid:54) inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) + 1 n , and, by continuity of the function Θ (cid:51) θ (cid:55)→ d D ( µ (cid:63) , µ θ ) (Theorem 5), d D ( µ (cid:63) , θ acc ) (cid:54) inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) . We conclude that θ acc belongs to ¯Θ. This contradicts the fact that d ( θ acc , ¯Θ) (cid:62) ε . A.8 Proof of Lemma 13
Since a = b , according to Definition 12, there exists a continuously differentiable, strictlyincreasing function f : R + → R + such that, for all µ ∈ P , d Lip ( µ (cid:63) , µ ) = f ( d D ( µ (cid:63) , µ )) . For ( θ, θ (cid:48) ) ∈ Θ we have, as f is strictly increasing, d D ( µ (cid:63) , µ θ ) (cid:54) d D ( µ (cid:63) , µ θ (cid:48) ) ⇐⇒ f ( d D ( µ (cid:63) , µ θ )) (cid:54) f ( d D ( µ (cid:63) , µ θ (cid:48) )) . Therefore, d D ( µ (cid:63) , µ θ ) (cid:54) d D ( µ (cid:63) , µ θ (cid:48) ) ⇐⇒ d Lip ( µ (cid:63) , µ θ ) (cid:54) d Lip ( µ (cid:63) , µ θ (cid:48) ) . This proves the first statement of the lemma.Let us now show that d Lip can be fully substituted by d D . Let ε >
0. Then, for δ > ε , to be chosen later) and θ ∈ M d D ( µ (cid:63) , δ ), we have d Lip ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) = f ( d D ( µ (cid:63) , µ θ )) − inf θ ∈ Θ f ( d D ( µ (cid:63) , µ θ ))= f ( d D ( µ (cid:63) , µ θ )) − f ( inf θ ∈ Θ d D ( µ (cid:63) , µ θ )) (cid:54) sup θ ∈ M d D ( µ (cid:63) ,δ ) (cid:12)(cid:12) f ( d D ( µ (cid:63) , µ θ )) − f ( inf θ ∈ Θ d D ( µ (cid:63) , µ θ )) (cid:12)(cid:12) . According to Theorem 5, there exists a nonnegative constant c such that for any θ ∈ Θ, d D ( µ (cid:63) , µ θ ) (cid:54) c . Therefore, using the definition of M d D ( µ (cid:63) , δ ) and the fact that f iscontinuously differentiable, we are led to d Lip ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) (cid:54) δ sup x ∈ [0 ,c ] (cid:12)(cid:12)(cid:12) ∂f ( x ) ∂x (cid:12)(cid:12)(cid:12) . The conclusion follows by choosing δ such that δ sup x ∈ [0 ,c ] | ∂f ( x ) ∂x | (cid:54) ε . ome Theoretical Insights into Wasserstein GANs A.9 Proof of Proposition 14
Let δ ∈ (0 ,
1) and θ ∈ M d D ( µ (cid:63) , δ ), i.e., d D ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:54) δ . As d Lip ismonotonously equivalent to d D , there exists a continuously differentiable, strictly increasingfunction f : R + → R + and ( a, b ) ∈ ( R (cid:63) + ) such that ∀ µ ∈ P , af ( d D ( µ (cid:63) , µ )) (cid:54) d Lip ( µ (cid:63) , µ ) (cid:54) bf ( d D ( µ (cid:63) , µ )) . So, d Lip ( µ (cid:63) , µ θ ) (cid:54) bf ( inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) + δ ) (cid:54) bf ( inf θ ∈ Θ d D ( µ (cid:63) , µ θ )) + O ( δ ) . Also, inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) (cid:62) af ( inf θ ∈ Θ d D ( µ (cid:63) , µ θ )) . Therefore, d Lip ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) (cid:54) ( b − a ) f ( inf θ ∈ Θ d D ( µ (cid:63) , µ θ )) + O ( δ ) . A.10 Proof of Lemma 15
Let f : R D → R be in AFF ∩ Lip . It is of the form f ( x ) = x · u + b , where u = ( u , . . . , u D ), b ∈ R , and (cid:107) u (cid:107) (cid:54)
1. Our objective is to prove that there exists a discriminator of the form(4) with q = 2 and v = 2 that contains the function f . To see this, define V ∈ M (2 ,D ) andthe offset vector c ∈ M (2 , as V = (cid:20) u · · · u D u · · · u D (cid:21) and c = . Letting V ∈ M (1 , , c ∈ M (1 , be V = (cid:2) (cid:3) , c = (cid:2) b (cid:3) , we readily obtain V ˜ σ ( V x + c ) + c = f ( x ). Besides, it is easy to verify that (cid:107) V (cid:107) , ∞ (cid:54) A.11 Proof of Lemma 16
Let µ and ν be two probability measures in P ( E ) with supports S µ and S ν satisfying theconditions of the lemma. Let π be an optimal coupling between µ and ν , and let ( X, Y ) bea random pair with distribution π such that d Lip ( µ, ν ) = E (cid:107) X − Y (cid:107) . Clearly, any function f ∈ Lip satisfying f ( X ) − f ( Y ) = (cid:107) X − Y (cid:107) almost surely will besuch that d Lip ( µ, ν ) = | E µ f − E ν f | . iau, Sangnier and Tanielian The proof will be achieved if we show that such a function f exists and that it may bechosen linear. Since S µ and S ν are disjoint and convex, we can find a unit vector u of R D included in the line containing both S µ and S ν such that ( x − y ) · u >
0, where ( x , y ) isan arbitrary pair of S µ × S ν . Letting f ( x ) = x · u ( x ∈ E ), we have, for all ( x, y ) ∈ S µ × S ν , f ( x ) − f ( y ) = ( x − y ) · u = (cid:107) x − y (cid:107) . Since f is a linear and 1-Lipschitz function on E ,this concludes the proof. A.12 Proof of Lemma 17
For any pair of probability measures ( µ, ν ) on E with finite moment of order 2, we let W ( µ, ν ) be the Wasserstein distance of order 2 between µ and ν . Recall (Villani, 2008,Definition 6.1) that W ( µ, ν ) = (cid:16) inf π ∈ Π( µ,ν ) (cid:90) E × E (cid:107) x − y (cid:107) π (d x, d y ) (cid:17) / , where Π( µ, ν ) denotes the collection of all joint probability measures on E × E with marginals µ and ν . By Jensen’s inequality, d Lip ( µ, ν ) = W ( µ, ν ) (cid:54) W ( µ, ν ) . Let Σ ∈ M ( D,D ) be a positive semi-definite matrix, and let µ be Gaussian N ( m , Σ) and ν be Gaussian N ( m , Σ). Denoting by (
X, Y ) a random pair with marginal distributions µ and ν such that E (cid:107) X − Y (cid:107) = W ( µ, ν ) , we have (cid:107) m − m (cid:107) = (cid:107) E ( X − Y ) (cid:107) (cid:54) E (cid:107) X − Y (cid:107) = W ( µ, ν ) (cid:54) W ( µ, ν ) = (cid:107) m − m (cid:107) , where the last equality follows from Givens and Shortt (1984, Proposition 7). Thus, d Lip ( µ, ν ) = (cid:107) m − m (cid:107) . The proof will be finished if we show that d AFF ∩ Lip ( µ, ν ) (cid:62) (cid:107) m − m (cid:107) . To see this, consider the linear and 1-Lipschitz function f : E (cid:51) x (cid:55)→ x · ( m − m ) (cid:107) m − m (cid:107) (with theconvention 0 × ∞ = 0), and note that d AFF ∩ Lip ( µ, ν ) (cid:62) (cid:12)(cid:12)(cid:12) (cid:90) E x · ( m − m ) (cid:107) m − m (cid:107) µ (d x ) − (cid:90) E y · ( m − m ) (cid:107) m − m (cid:107) ν (d y ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) (cid:90) E x · ( m − m ) (cid:107) m − m (cid:107) µ (d x ) − (cid:90) E ( x − m + m ) · ( m − m ) (cid:107) m − m (cid:107) µ (d x ) (cid:12)(cid:12)(cid:12) = (cid:107) m − m (cid:107) . A.13 Proof of Proposition 18
Let ε >
0, and let µ and ν be two probability measures in P ( E ) with compact supports S µ and S ν such that max(diam( S µ ) , diam( S ν )) (cid:54) εd ( S µ , S ν ). Throughout the proof, it is ome Theoretical Insights into Wasserstein GANs assumed that d ( S µ , S ν ) >
0, otherwise the result is immediate. Let π be an optimal couplingbetween µ and ν , and let ( X, Y ) be a random pair with distribution π such that d Lip ( µ, ν ) = E (cid:107) X − Y (cid:107) . Any function f ∈ Lip satisfying (cid:107) X − Y (cid:107) (cid:54) (1 + 2 ε )( f ( X ) − f ( Y )) almost surely will besuch that d Lip ( µ, ν ) (cid:54) (1 + 2 ε ) | E µ f − E ν f | . Thus, the proof will be completed if we show that such a function f exists and that it maybe chosen affine.Since S µ and S ν are compact, there exists ( x (cid:63) , y (cid:63) ) ∈ S µ × S ν such that (cid:107) x (cid:63) − y (cid:63) (cid:107) = d ( S µ , S ν ). By the hyperplane separation theorem, there exists a hyperplane H orthogonalto the unit vector u = x (cid:63) − y (cid:63) (cid:107) x (cid:63) − y (cid:63) (cid:107) such that d ( x (cid:63) , H ) = d ( y (cid:63) , H ) = (cid:107) x (cid:63) − y (cid:63) (cid:107) . For any x ∈ E ,we denote by p H ( x ) the projection of x onto H . We thus have d ( x, H ) = (cid:107) x − p H ( x ) (cid:107) ,and x (cid:63) + y (cid:63) = p H ( x (cid:63) + y (cid:63) ) = p H ( x (cid:63) ) = p H ( y (cid:63) ). In addition, by convexity of S µ and S ν , forany x ∈ S µ , (cid:107) x − p H ( x ) (cid:107) (cid:62) (cid:107) x (cid:63) − p H ( x (cid:63) ) (cid:107) . Similarly, for any y ∈ S ν , (cid:107) y − p H ( y ) (cid:107) (cid:62) (cid:107) y (cid:63) − p H ( y (cid:63) ) (cid:107) .Let the affine function f be defined for any x ∈ E by f ( x ) = ( x − p H ( x )) · u. Observe that f ( x ) = f ( x + x (cid:63) + y (cid:63) ). Clearly, for any ( x, y ) ∈ E , one has | f ( x ) − f ( y ) | = (cid:12)(cid:12) f (cid:0) x − y + x (cid:63) + y (cid:63) (cid:1)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:0)(cid:0) x − y + x (cid:63) + y (cid:63) (cid:1) − p H (cid:0) x − y + x (cid:63) + y (cid:63) (cid:1)(cid:1) .u (cid:12)(cid:12) (cid:54) (cid:13)(cid:13)(cid:0) x − y + x (cid:63) + y (cid:63) − p H (cid:0) x − y + x (cid:63) + y (cid:63) (cid:1)(cid:13)(cid:13) (cid:54) (cid:13)(cid:13) x − y + x (cid:63) + y (cid:63) − x (cid:63) + y (cid:63) (cid:13)(cid:13) (since x (cid:63) + y (cid:63) ∈ H )= (cid:107) x − y (cid:107) . Thus, f belongs to Lip . Besides, for any ( x, y ) ∈ S µ × S ν , we have (cid:107) x − y (cid:107) (cid:54) (cid:107) x − p H ( x ) (cid:107) + (cid:107) p H ( x ) − p H ( y ) (cid:107) + (cid:107) p H ( y ) − y (cid:107) (cid:54) ( x − p H ( x )) · u − ( y − p H ( y )) · u + (cid:13)(cid:13) p H ( x ) − x (cid:63) + y (cid:63) (cid:13)(cid:13) + (cid:13)(cid:13) p H ( y ) − x (cid:63) + y (cid:63) (cid:13)(cid:13) = ( x − p H ( x )) · u − ( y − p H ( y )) · u + (cid:107) p H ( x ) − p H ( x (cid:63) ) (cid:107) + (cid:107) p H ( y ) − p H ( y (cid:63) ) (cid:107) . iau, Sangnier and Tanielian Thus, (cid:107) x − y (cid:107) (cid:54) ( x − p H ( x )) · u − ( y − p H ( y )) · u + 2 max(diam( S µ ) , diam( S ν )) (cid:54) f ( x ) − f ( y ) + 2 εd ( S µ , S ν )= f ( x ) − f ( y ) + 2 ε ( f ( x (cid:63) ) − f ( y (cid:63) ))= f ( x ) − f ( y ) + 2 ε ( f ( x (cid:63) ) − f ( x ) + f ( x ) − f ( y ) + f ( y ) − f ( y (cid:63) )) (cid:54) (1 + 2 ε )( f ( x ) − f ( y ))(using the fact that f ( x (cid:63) ) − f ( x ) (cid:54) f ( y (cid:63) ) − f ( y ) (cid:62) . Since f ∈ Lip , we conclude that, for any ( x, y ) ∈ S µ × S ν , | f ( x ) − f ( y ) | (cid:54) (cid:107) x − y (cid:107) (cid:54) (1 + 2 ε )( f ( x ) − f ( y )) . A.14 Proof of Lemma 19
Using Dudley (2004, Theorem 11.4.1) and the strong law of large numbers, the sequenceof empirical measures ( µ n ) almost surely converges weakly in P ( E ) to µ (cid:63) . Thus, we havelim n →∞ d Lip ( µ (cid:63) , µ n ) = 0 almost surely, and so lim n →∞ d D ( µ (cid:63) , µ n ) = 0 almost surely. Hence,recalling inequality (14), we conclude thatsup θ n ∈ ˆΘ n d D ( µ (cid:63) , µ θ n ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) → ε > (cid:51) θ (cid:55)→ d Lip ( µ (cid:63) , µ θ ) is L -Lipschitz, for some L >
0. According to (23) and Proposition 9, almost surely, there existsan integer
N > n (cid:62) N , for all θ n ∈ ˆΘ n , the companion ¯ θ n ∈ ¯Θ is suchthat (cid:107) θ n − ¯ θ n (cid:107) (cid:54) εL . We conclude by observing that | ε estim | (cid:54) sup θ n ∈ ˆΘ n | d Lip ( µ (cid:63) , µ θ n ) − d Lip ( µ (cid:63) , µ ¯ θ n ) | (cid:54) L × εL . A.15 Proof of Proposition 20
Let µ n be the empirical measure based on n i.i.d. samples X , . . . , X n distributed accordingto µ (cid:63) . Recall (equation (7)) that d D ( µ (cid:63) , µ n ) = sup α ∈ Λ | E µ (cid:63) D α − E µ n D α | = sup α ∈ Λ (cid:12)(cid:12)(cid:12) E µ (cid:63) D α − n n (cid:88) i =1 D α ( X i ) (cid:12)(cid:12)(cid:12) . Let g be the real-valued function defined on E n by g ( x , . . . , x n ) = sup α ∈ Λ (cid:12)(cid:12)(cid:12) E µ (cid:63) D α − n n (cid:88) i =1 D α ( x i ) (cid:12)(cid:12)(cid:12) . ome Theoretical Insights into Wasserstein GANs Observe that, for ( x , . . . , x n ) ∈ E n and ( x (cid:48) , . . . , x (cid:48) n ) ∈ E n , | g ( x , . . . , x n ) − g ( x (cid:48) , . . . , x (cid:48) n ) | (cid:54) sup α ∈ Λ (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 D α ( x i ) − n n (cid:88) i =1 D α ( x (cid:48) i ) (cid:12)(cid:12)(cid:12) (cid:54) n sup α ∈ Λ n (cid:88) i =1 | D α ( x i ) − D α ( x (cid:48) i ) | (cid:54) n n (cid:88) i =1 (cid:107) x i − x (cid:48) i (cid:107) . (24)We start by examining statement ( i ), where µ (cid:63) has compact support with diameter B . Inthis case, letting X (cid:48) i be an independent copy of X i , we have, almost surely, | g ( X , . . . , X n ) − g ( X , . . . , X (cid:48) i , . . . , X n ) | (cid:54) Bn .
An application of McDiarmid’s inequality (McDiarmid, 1989) shows that for any η ∈ (0 , − η , d D ( µ (cid:63) , µ n ) (cid:54) E d D ( µ (cid:63) , µ n ) + B (cid:114) log(1 /η )2 n . (25)Next, for each α ∈ Λ, let Y α denote the random variable defined by Y α = E µ (cid:63) D α − n n (cid:88) i =1 D α ( X i ) . Using a similar reasoning as in the proof of Proposition 2, one shows that for any ( α, α (cid:48) ) ∈ Λ and any x ∈ E , | D α ( x ) − D α (cid:48) ( x ) | (cid:54) Q / (cid:0) q (cid:107) x (cid:107) + q ( q − K q (cid:1) (cid:107) α − α (cid:48) (cid:107) , where we recall that q is the depth of the discriminator. Since µ (cid:63) has compact support, (cid:96) = (cid:90) E Q / (cid:0) q (cid:107) x (cid:107) + q ( q − K q (cid:1) µ (cid:63) (d x ) < ∞ . Observe that | Y α − Y α (cid:48) | (cid:54) n (cid:107) α − α (cid:48) (cid:107) | ξ ( n ) | , where ξ n = n (cid:88) i =1 Q / (cid:0) (cid:96) + q (cid:107) X i (cid:107) + q ( q − K q (cid:1) . Thus, using Vershynin (2018, Proposition 2.5.2), there exists a positive constant c = O ( qQ / ( D / + q )) such that, for all λ ∈ R , E e λ ( Y α − Y α (cid:48) ) (cid:54) E e λ n (cid:107) α − α (cid:48) (cid:107) | ξ n | (cid:54) e c n (cid:107) α − α (cid:48) (cid:107) λ . iau, Sangnier and Tanielian We conclude that the process ( Y α ) is sub-Gaussian (van Handel, 2016, Definition 5.20) forthe distance d ( α, α (cid:48) ) = c (cid:107) α − α (cid:48) (cid:107)√ n . Therefore, using van Handel (2016, Corollary 5.25), wehave E d D ( µ (cid:63) , µ n ) = E sup α ∈ Λ (cid:12)(cid:12)(cid:12) E µ (cid:63) D α − n n (cid:88) i =1 D α ( X i ) (cid:12)(cid:12)(cid:12) (cid:54) c √ n (cid:90) ∞ (cid:112) log N (Λ , (cid:107) · (cid:107) , u )d u, where N (Λ , (cid:107) · (cid:107) , u ) is the u -covering number of Λ for the norm (cid:107) · (cid:107) . Since Λ is bounded,there exists r > N (Λ , (cid:107) · (cid:107) , u ) = 1 for u (cid:62) rQ / and N (Λ , (cid:107) · (cid:107) , u ) ≤ (cid:18) rQ / u (cid:19) Q for u < rQ / . Thus, E d D ( µ (cid:63) , µ n ) (cid:54) c √ n for some positive constant c = O ( qQ / ( D / + q )). Combining this inequality with (25)shows the first statement of the lemma.We now turn to the more general situation (statement ( ii )) where µ (cid:63) is γ sub-Gaussian.According to inequality (24), the function g is n -Lipschitz with respect to the 1-norm on E n .Therefore, by combining Kontorovich (2014, Theorem 1) and Vershynin (2018, Proposition2.5.2), we have that for any η ∈ (0 , − η , d D ( µ (cid:63) , µ n ) (cid:54) E d D ( µ (cid:63) , µ n ) + 8 γ √ eD (cid:114) log(1 /η ) n . (26)As in the first part of the proof, we let Y α = E µ (cid:63) D α − n n (cid:88) i =1 D α ( X i ) , and recall that for any ( α, α (cid:48) ) ∈ Λ and any x ∈ E , | D α ( x ) − D α (cid:48) ( x ) | (cid:54) Q / (cid:0) q (cid:107) x (cid:107) + q ( q − K q (cid:1) (cid:107) α − α (cid:48) (cid:107) . Since µ (cid:63) is sub-Gaussian, we have (see, e.g., Jin et al., 2019, Lemma 1), (cid:96) = (cid:90) E Q / (cid:0) q (cid:107) x (cid:107) + q ( q − K q (cid:1) µ (cid:63) (d x ) < ∞ . Thus, | Y α − Y α (cid:48) | (cid:54) n (cid:107) α − α (cid:48) (cid:107) | ξ ( n ) | , where ξ n = n (cid:88) i =1 Q / (cid:0) (cid:96) + q (cid:107) X i (cid:107) + q ( q − K q (cid:1) . According to Jin et al. (2019, Lemma 1), the real-valued random variable ξ n is sub-Gaussian.We obtain that, for some positive constant c = O ( qQ / ( D / + q )), E d D ( µ (cid:63) , µ n ) (cid:54) c √ n , and the conclusion follows by combining this inequality with (26). ome Theoretical Insights into Wasserstein GANs A.16 Proof of Theorem 21
Let ε > η ∈ (0 , D of theform (4) (i.e., a collection of neural networks) such that T P (Lip , D ) (cid:54) ε. We only prove statement ( i ) since both proofs are similar. In this case, according to Propo-sition 20, there exists a constant c > − η , d D ( µ (cid:63) , µ n ) (cid:54) c √ n + B (cid:114) log(1 /η )2 n . Therefore, using inequality (16), we have, with probability at least 1 − η ,0 (cid:54) ε estim + ε optim (cid:54) ε + 2 c √ n + 2 B (cid:114) log(1 /η )2 n . A.17 Proof of Proposition 23
Observe that, for θ ∈ Θ,0 (cid:54) d D ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ )= d D ( µ (cid:63) , µ θ ) − d D ( µ n , µ θ ) + d D ( µ n , µ θ ) − inf θ ∈ Θ d D ( µ n , µ θ )+ inf θ ∈ Θ d D ( µ n , µ θ ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:54) d D ( µ (cid:63) , µ n ) + d D ( µ n , µ θ ) − inf θ ∈ Θ d D ( µ n , µ θ ) + d D ( µ (cid:63) , µ n )= 2 d D ( µ (cid:63) , µ n ) + d D ( µ n , µ θ ) − inf θ ∈ Θ d D ( µ n , µ θ ) , where we used respectively the triangle inequality and | inf θ ∈ Θ d D ( µ n , µ θ ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) | (cid:54) sup θ ∈ Θ | d D ( µ (cid:63) , µ θ ) − d D ( µ n , µ θ ) | (cid:54) d D ( µ (cid:63) , µ n ) . Thus, assuming that T P (Lip , D ) (cid:54) ε , we have0 (cid:54) d Lip ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) (cid:54) d Lip ( µ (cid:63) , µ θ ) − d D ( µ (cid:63) , µ θ ) + d D ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:54) T P (Lip , D ) + d D ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d D ( µ (cid:63) , µ θ ) (cid:54) ε + 2 d D ( µ (cid:63) , µ n ) + d D ( µ n , µ θ ) − inf θ ∈ Θ d D ( µ n , µ θ ) . (27)Let δ > θ ∈ M d D ( µ n , δ/ d D ( µ n , µ θ ) − inf θ ∈ Θ d D ( µ n , µ θ ) (cid:54) δ/ . For η ∈ (0 , N ∈ N (cid:63) such that, for all n (cid:62) N , 2 d D ( µ (cid:63) , µ n ) (cid:54) δ/ − η .Therefore, we conclude from (27) that for n (cid:62) N , with probability at least 1 − η , d Lip ( µ (cid:63) , µ θ ) − inf θ ∈ Θ d Lip ( µ (cid:63) , µ θ ) (cid:54) ε + δ.δ.