Functional Space Analysis of Local GAN Convergence
FFunctional Space Analysis of Local GAN Convergence
Valentin Khrulkov Artem Babenko
Ivan Oseledets Abstract
Recent work demonstrated the benefits of study-ing continuous-time dynamics governing theGAN training. However, this dynamics is ana-lyzed in the model parameter space, which resultsin finite-dimensional dynamical systems. We pro-pose a novel perspective where we study the localdynamics of adversarial training in the generalfunctional space and show how it can be repre-sented as a system of partial differential equations.Thus, the convergence properties can be inferredfrom the eigenvalues of the resulting differentialoperator. We show that these eigenvalues can beefficiently estimated from the target dataset beforetraining. Our perspective reveals several insightson the practical tricks commonly used to stabilizeGANs, such as gradient penalty, data augmenta-tion, and advanced integration schemes. As animmediate practical benefit, we demonstrate howone can a priori select an optimal data augmenta-tion strategy for a particular generation task.
1. Introduction
Generative Adversarial Networks (GANs) (Goodfellowet al., 2014) allow for efficient learning of complicated prob-ability distributions from samples. However, training suchmodels is notoriously complicated: the dynamics can exhibitoscillatory behavior, and the convergence can be very slow.To alleviate this, a number of practical tricks, includingthe usage of data augmentation (Karras et al., 2020a; Zhaoet al., 2020; Zhang et al., 2020) and optimization methodsinspired by numerical integration schemes for ODEs (Qinet al., 2020) have been proposed. Nevertheless, the theoryexplaining the success of these methods is only starting tocatch up with practice.The standard way of training a GAN is to use simultane-ous gradient descent. In Nagarajan & Kolter (2017) it was Yandex, Russia National Research University Higher Schoolof Economics, Moscow, Russia Skolkovo Institute of Scienceand Technology, Moscow, Russia. Correspondence to: ValentinKhrulkov < [email protected] > .Preprint. shown that under mild assumptions, this method is locallyconvergent. The authors Mescheder et al. (2018) demon-strated that when these assumptions are not satisfied, theusage of regularization methods such as the gradient penalty(Roth et al., 2017) or consensus optimization (Meschederet al., 2017) is required to achieve convergence. In Balduzziet al. (2018); Nie & Patel (2020); Liang & Stokes (2019)the local convergence is studied based on the eigenvaluesof the Jacobian of the dynamics near the equilibrium. Suchanalysis involves the parameterization of the generator andthe discriminator by neural networks and usually does nottake into account the properties of the target distribution.Our key idea is that this dynamics can be efficiently analyzedin the functional space . This is achieved by constructing alocal quadratic approximation of the GAN training objec-tive and writing down the dynamics as a system of partialdifferential equations (PDEs). The differential operator un-derlying this system has a remarkably simple form, andits spectrum can be analyzed in terms of the fundamentalproperties of the target distribution. Furthermore, we showthat the convergence of the dynamics is determined by the Poincar´e constant of this distribution. Intuitively, this con-stant describes the connectivity properties of a distribution;for instance, it is smaller for multimodal datasets with dis-connected modes, where GAN convergence is known tobe slower. This connection is practically important sincethe Poincar´e constant of a dataset can be easily estimated apriori, and the larger it is, the better is the expected conver-gence of GAN. Thus, we can analyze common techniquesthat alter the train set in terms of their effect on the Poincar´econstant. For instance, one can choose a proper augmenta-tion strategy that increases the Poincar´e constant the most.The main contributions of our paper are:• We develop a linearized GAN training model in thefunctional space in the form of a PDE.• We derive explicit formulas connecting the eigenfunc-tions and eigenvalues of the resulting PDE operatorwith the target distribution properties. This connectionprovides a theoretical justification for common GANstabilization techniques.• We describe an efficient, practical recipe for choos-ing optimal parameters for gradient penalty and dataaugmentations for a particular dataset. a r X i v : . [ c s . L G ] F e b unctional Space Analysis of Local GAN Convergence
2. Second-order approximation of the GANobjective
We assume that data samples { X i } Ni =1 ∈ R d are producedby a target probability measure µ . The generator function G ( z ) is a deterministic mapping from a latent space R l to R d ; z is sampled from a known probability measure µ z . Thediscriminator D is a real-valued function on R d which goalis to distinguish between real and fake samples. The GANobjective can be written as a min-max problem max G min D f ( D, G ) , (1)where f ( D, G ) = E x ∼ µ φ ( D ( x )) + E z ∼ µ z φ ( D ( G ( z )) , (2)and φ , φ are some scalar convex twice differentiable func-tions. For example, for the LSGAN (Mao et al., 2017) φ ( x ) = 12 ( x − , φ ( x ) = 12 ( x + 1) . We analyze the behaviour of a GAN in a neighborhood ofthe Nash equilibrium corresponding to an optimal discrim-inator function D ∗ and an optimal generator function G ∗ .We make standard assumptions (Nagarajan & Kolter, 2017;Mescheder et al., 2018; Nie & Patel, 2020) that the opti-mal discriminator satisfies D ∗ ( x ) = 0 ∀ x and the measureproduced by G ∗ is equal to µ . Similarly, we also assumethat φ (cid:48)(cid:48) (0) + φ (cid:48)(cid:48) (0) ≥ and φ (cid:48) (0) = − φ (cid:48) (0) (cid:54) = 0 . In thissetting we can recover many common GAN variations in-cluding the vanilla GAN (Goodfellow et al., 2014), WGAN(Arjovsky et al., 2017), LSGAN (Mao et al., 2017).The standard way of solving the min-max problem is torepresent D and G by some parametric models and solveit in the parameter space. The convergence of the result-ing dynamics can be studied by linearizing it around theequilibrium point (Nagarajan & Kolter, 2017; Meschederet al., 2018; Nie & Patel, 2020). This is equivalent to theconstruction of the quadratic approximation of f ( D, G ) inthe parameter space. In this paper, we do the quadratic ap-proximation first, obtaining a new saddle point problem ina functional space that approximates the local dynamics ofstandard GAN training methods. This representation has aremarkably simple form, and its properties can be studiedin detail. Technical assumptions.
We assume that µ is a measurewith positive density ρ > on R d , i.e., dµ = ρ ( x ) dx .In practice, this can be (formally) achieved by smoothinga discrete data distribution with Gaussian noise. As thefunctional space, we consider L ( µ ) = L ( µ, R d ) equippedwith a natural weighted inner product: (cid:104) u , u (cid:105) µ = (cid:90) u ( x ) u ( x ) dµ. We will also often utilize differential operators, whichshould be understood in the distributional sense. In thesecases, the functions under study are assumed to be the el-ements of the corresponding Sobolev space. Specifically,we will utilize the weighted Sobolev spaces H ( µ ) and H ( µ ) . These are spaces of generalized functions such thattheir distributional first order and second order derivativesrespectively lie in the L ( µ ) .We start by finding the second-order approximation to theGAN objective near Nash equilibrium in the functionalspace. We assume that G ∗ is invertible, i.e., for any x ∼ µ there exists a unique z ∼ µ z such that G ∗ ( z ) = x . Let ( δD, δG ) be the functional variations of the optimal dis-criminator and generator respectively. Theorem 1.
Let D = D ∗ + δD = δD, G = G ∗ + δG . Letus denote u ( x ) = δD ( x ) , v ( x ) = ( δG )( G − ∗ ( x )) . Then, (2) can be approximated as f ( D, G ) = f + g ( u, v ) + higher order terms in u , v, where g ( u, v ) = α (cid:104) u, u (cid:105) µ + β (cid:104)∇ x u, v (cid:105) µ , and α = ( φ (cid:48)(cid:48) (0) + φ (cid:48)(cid:48) (0)) , β = φ (cid:48) (0) and f = f ( D ∗ , G ∗ ) .Proof. The proof is straightforward: we use Taylor expan-sion to approximate both terms in the objective up to secondorder and also use the fact that in the Nash equilibrium thefirst order terms sum up to . f ( D ∗ + δD, G ∗ + δG ) = f ( δD, G + δG ) == E x ∼ µ φ ( δD ( x )) + E z ∼ µ z φ ( δD ( G ( z ) + δG ( z ))= f + α E x ∼ µ δD + β E z ∼ µ z (cid:104)∇ x δD ( G ( z )) , δG ( z ) (cid:105) ++ higher order terms , From the definitions of (cid:104)· , ·(cid:105) µ , u and v the statement of thetheorem follows.
3. Continuous gradient descent-ascent in thefunctional space
Now we need to solve the min-max problem of the form min u max v g ( u, v ) , g ( u, v ) = α (cid:104) u, u (cid:105) µ + β (cid:104)∇ x u, v (cid:105) µ . (3)We will refer to (3) as linearized GAN problem (lGAN).Since it is a local quadratic approximation of the originalmin-max problem, local behavior of GAN training methodscan be understood on this task. The linearized GAN problemdepends solely on the measure µ , and we will see what arethe particular fundamental properties of this measure thatdetermine the convergence of the gradient-based method forsolving (3). unctional Space Analysis of Local GAN Convergence We will use for u the weighted Sobolev space H ( µ ) andfor v the weighted Sobolev space H ( µ ) . The continuousdescent-ascent flow in the functional space can be writtenas: u t = −∇ u g ( u, v ) , v t = ∇ v g ( u, v ) , (4)where ∇ u g and ∇ v g are Fr´echet derivatives of the func-tional g ( · , · ) with respect to u and v . It is important todefine what is the meaning of (4), i.e., it should be under-stood in the weak sense. Select a test function ˆ u and takethe scalar product of the first equation with it. We get (cid:104) u t , ˆ u (cid:105) µ = −(cid:104)∇ u g ( u, v ) , ˆ u (cid:105) µ , ∀ ˆ u ∈ H ( µ ) (5)Similar equation holds for v . The evaluation of the right-hand side is done by integration by parts, which leads to thesystem of the form (cid:104) u t , ˆ u (cid:105) µ = − α (cid:104) u, ˆ u (cid:105) µ − β (cid:104)∇ x ˆ u, v (cid:105) µ , ∀ ˆ u ∈ H ( µ ) . (cid:104) v t , ˆ v (cid:105) µ = β (cid:104)∇ x u, ˆ v (cid:105) µ , ∀ ˆ v ∈ H ( µ ) . (6)By making use of the density ρ , equations (6) can be rewrit-ten in the strong form as ρu t = − αρu + β ∇ x · ( ρv ) ,ρv t = ρβ ∇ x u. (7)We are interested in the asymptotic behavior of u and v when t goes to infinity. It is completely described by theeigenvalues of the operator on the right-hand side of (6). Wewill say that ( u λ , v λ ) is an eigenfunction with an eigenvalue λ if it satisfies the following system of equations. λ (cid:104) u λ , ˆ u (cid:105) µ = − α (cid:104) u λ , ˆ u (cid:105) µ − β (cid:104)∇ x ˆ u, v λ (cid:105) µ , ∀ ˆ u ∈ H ( µ ) .λ (cid:104) v λ , ˆ v (cid:105) µ = β (cid:104)∇ x u λ , ˆ v (cid:105) µ , ∀ ˆ v ∈ H ( µ ) . (8)
4. Eigenfunctions and eigenvalues of thelinearized operator If ( u λ , v λ ) form an eigenfunction, then the solution ofthe time-dependent problem (6) with the initial conditions ( u (0) = u λ , v (0) = v λ ) can be written as u ( t ) = u λ exp( λt ) , v ( t ) = v λ exp( λt ) . (9)We will see that { ( u λ , v λ ) } λ form a basis in our functionalspace and thus an arbitrary solution can be written as aseries of terms (9). Thus, the real parts of the spectrum ofthe operator in (6) should be less than in order for thesystem to have asymptotic stability. We will demonstratethat our system does not have eigenvalues with a positivereal part but has a naturally interpretable kernel. We find that the kernel has the following form.
Corollary 1.
Let ( u , v ) be an eigenfunction with λ = 0 .Then, u = C, (cid:104)∇ x ˆ u, v (cid:105) µ = 0 , ∀ ˆ u ∈ H ( µ ) , (10) or in the strong form: u = C, ∇ x · ( ρv ) = 0 . (11) Here C is a constant such that Cα = 0 . I.e., for α (cid:54) = 0 weget C = 0 , and C ∈ R otherwise.Proof. From (8) we observe that the element ( u , v ) of thekernel satisfies the following equations ∀ ˆ u ∈ H ( µ ) . u = C, − α (cid:104) C, ˆ u (cid:105) µ − β (cid:104)∇ x ˆ u, v (cid:105) µ = 0 . (12)Let us choose ˆ u = 1 . From the second equation it followsthat αC = 0 as desired.This kernel has a straightforward interpretation. If we add afunction v that satisfies (10) to the G ∗ , the mapped measurewill be the same up to second order terms. Indeed, consider(10) in the strong form. We obtain that ∇ x · ( ρv ) = 0 .Recall that the function x + v ( x ) = x + δG ( G − ∗ ( x )) maps a sample x to a sample from the synthetic densityproduced by G ∗ + δG , i.e. v ( x ) can be considered as a velocity of each sample. If samples evolve with velocity v ( x ) , the differential equation for the density takes theform ρ t = ∇ x · ( ρv ) = 0 , i.e., the density is invariant under such transformation. Wewill refer to the condition (10) as the divergence-free condi-tion on v . EIGHTED L APLACE OPERATOR
We now address non-zero eigenvalues of our operator. Itwill be convenient to utilize the weighted Laplace operator ∆ µ which is defined (in the weak form) as follows. (cid:104) ∆ µ w, ˆ w (cid:105) µ := −(cid:104)∇ x w, ∇ x ˆ w (cid:105) µ , ∀ ˆ w ∈ H ( µ ) . (13)In the strong form ∆ µ takes the following form: ∆ µ w = 1 ρ ∇ x · ( ρ ∇ x w ) , (14)which results in the standard Laplacian in the case of thestandard Lebesgue measure on R d . This operator commonlyappears in the study of the diffusion processes (Coifman& Lafon, 2006) and weighted heat equations (Grigoryan,2009). unctional Space Analysis of Local GAN Convergence Spectrum of the weighted Laplacian.
In what followswe will use eigenvalues and eigenfunctions of ∆ µ . Firstly,this a self-adjoint non-positive definite operator and undermild assumptions on µ (Cianchi et al., 2011; Grigoryan,2009), e.g., when ρ decreases sufficiently fast, it has a dis-crete spectrum. Let us study it in more detail. Consider theeigenvalue problem for − ∆ µ written in the weak form. (cid:104)∇ x ˆ w, ∇ x w ξ (cid:105) µ = ξ (cid:104) ˆ w, w ξ (cid:105) µ ∀ ˆ w ∈ H ( µ ) . (15)We note that w ≡ is the eigenfunction with ξ = 0 ; thus,due to self-adjointness of ∆ µ for every other eigenfunction w ξ we have (cid:104) , w ξ (cid:105) µ = 0 , i.e., it has zero mean with respectto µ . The Poincar´e constant of µ . Let us consider the small-est non-zero eigenvalue ξ min of − ∆ µ . We obtain that thefollowing inequality holds. (cid:104)∇ x w, ∇ x w (cid:105) µ ≥ ξ min (cid:104) w, w (cid:105) µ , (cid:104) , w (cid:105) µ = 0 , ∀ w ∈ H ( µ ) , (16)and the exact minimizer of this inequality is achieved by w ξ min (Grigoryan, 2009). The value of ξ min (sometimesits inverse) is called the Poincar´e constant of the measure µ and often appears in Sobolev-type inequalities (Adams& Fournier, 2003). The exact values of ξ min for a givenmeasure are almost never known analytically. To make useof our results in practical settings, we propose a simpleneural network based approach for estimation of ξ min . Wediscuss it in Section 6. What is it exactly?
We can provide an intuitive mean-ing of ξ min by drawing an analogy with graph Laplacians.In this case, the second smallest eigenvalue of the graphLaplacian, called the Fiedler value (Fiedler, 1973), reflectshow well connected the overall graph is. Thus, the Poincar´econstant is in a way a continuous analog of this constant, re-flecting the “connectivity” of a measure. This is the propertythat we would expect to impact the GAN convergence, asbased on rich empirical evidence, the datasets that are more“disconnected” (such as ImageNet) are very challenging tomodel. We provide experimental evidence in support of thisintuitive explanation in Section 7.4.2.2. S PECTRUM OF L
GANWe are now ready to fully describe the spectrum of ourlGAN model. Let { ξ i } ∞ i =1 be the non-zero spectrum of − ∆ µ and { w ξ i } ∞ i =1 be the set of the corresponding eigen-functions. Recall from Theorem 1 that the constants α and β are defined purely in terms of functions φ and φ . Wenow state our main result. Theorem 2.
The non-zero spectrum of (8) is described asfollows. • The eigenvalues are given by { λ ± i } ∞ i =1 where λ ± i areroots of the quadratic equation: λ + αλ + β ξ i = 0 . (17)• The corresponding eigenfunctions are written in termseigenfunctions of − ∆ µ as follows. ( u λ ± i , v λ ± i ) = ( w ξ i , βλ ± i ∇ x w ξ i ) . (18) Proof.
By putting ˆ v = ∇ x ˆ u into the second equation of (8),we get λ ( u λ , ˆ u ) µ = − α ( u λ , ˆ u ) µ − β λ ( ∇ x ˆ u, ∇ x u λ ) µ , which can be rewritten as ( ∇ x ˆ u, ∇ x u λ ) µ = 1 β (cid:0) ( − αλ − λ )( u λ , ˆ u ) µ (cid:1) = ξ ( u λ , ˆ u ) µ , which means that ξ is an eigenvalue of − ∆ µ , and u λ isits eigenfunction. The eigenvalue λ can be found from thesolution of the quadratic equation (17) λ = − α ± (cid:112) α − β ξ . (19)With the explicit formulas for eigenvalues and eigenvectorsat hand, we are now ready to analyze the convergence of theproblem (6).
5. Convergence
The following theorem expresses the solution in terms of theeigenfunctions and also provides the convergence estimates.
Theorem 3.
Let u ∈ H ( µ ) , v ∈ H ( µ ) and (cid:82) u dµ = c . Then, these functions can be written as u = c + ∞ (cid:88) k =1 ( c + k + c − k ) w ξ k ,v = (cid:101) v + ∇ x V , V = ∞ (cid:88) k =1 (cid:16) c + k βλ + i + c − k βλ − i (cid:17) w ξ k , and (cid:101) v is divergence-free, i.e. (cid:104)∇ x ˆ u, (cid:101) v (cid:105) µ = 0 . The coef-ficients c + k and c − k can be obtained as the solution of thelinear systems: (cid:32) βλ + i βλ − i (cid:33) (cid:18) c + k c − k (cid:19) = (cid:18) (cid:104) u , w ξ k (cid:105) µ (cid:104) V , w ξ k (cid:105) µ (cid:19) (20) unctional Space Analysis of Local GAN Convergence With this expansion, the solution to (6) is u ( t ) = c e − αt + ∞ (cid:88) k =1 w ξ k (cid:16) c + k e λ + k t + c − k e λ − k t (cid:17) ,v ( t ) = (cid:101) v + ∇ x V ( t ) ,V ( t ) = ∞ (cid:88) k =1 w ξ k (cid:18) c + k βλ + i e λ + k t + c − k βλ − i e λ − k t (cid:19) . (21) For α > the norms of u ( t ) and V ( t ) can be estimated as (cid:107) u ( t ) (cid:107) µ ≤ (cid:107) u (cid:107) µ e ηt , (cid:107) V ( t ) (cid:107) µ ≤ C (cid:107) V (cid:107) µ e ηt , where η = Re (cid:16) − α + √ α − β ξ min (cid:17) < is the maximal realpart of the eigenvalues.Proof. The decomposition of v into a potential part anddivergence-free part is a direct generalization of the classicalresult for the ordinary divergence and gradient, known as theHelmholtz decomposition (Griffiths, 2005). The divergence-free part (cid:101) v belongs to the kernel of the operator, thus itstays constant. The dynamics of u and v follows from thecompleteness of the eigenbasis of ∆ µ and the assumptionthat its spectrum is discrete, thus we can expand them inthis basis. From Theorem 2 each component in the sum isan eigenfunction, thus its time dynamics is just e λ ± k t . Forthe constant term in u ( t ) , by substituting ˆ u = 1 in (6),we obtain the following ODE (cid:104) u t , (cid:105) µ = − α (cid:104) u, (cid:105) µ , fromwhich the statement follows. Discussion.
Theorem 3 provides the exact representationof the solution in terms of the eigenfunctions of ∆ µ . Theeigenvalues are given as the solution of the quadratic equa-tion (17), thus the spectral properties of − ∆ µ completelydetermine the dynamics of the convergence. Specifically,we observe that the speed of convergence is determinedby the value of ξ min , i.e., the lowest non-zero mode of theweighted Laplacian.There are two distinct cases. If the first eigenvalue of − ∆ µ satisfies α − β ξ min ≤ , (i.e., ξ min is large enough), thenall the eigenvalues λ ± i are complex, and their real part isequal to − α . Ideally, we would have α − β ξ min = 0 ,which would provide is with the optimal convergence speedand no oscillations for the highest mode. Consider now thecase of a small ξ min (cid:28) α β . In this case, η will be closeto , which would result in a slow convergence rate. Theseobservations can be used to explain the success of variouspractical GAN training methods, as we show in Section 7.In the case is α = 0 , which holds, for instance, for WGAN,we obtain the well-known purely oscillatory behavior (Na-garajan & Kolter, 2017). We also note that for α > , theaverage value of the discriminator exponentially decays to . This resembles the convergence plots of state-of-the-artGANs, see, e.g., Karras et al. (2020a, Figure 6b). Example: normal distribution.
Consider a model exam-ple of the normal distribution, µ ∼ N (0 , . Then, µ hasthe density ρ ( x ) = √ π e − x / . The eigenfunctions andeigenvalues of − ∆ µ can be computed explicitly. The strongform of the eigenproblem is ddx ρ dw ξ dx = − ξρw ξ , i.e. w ξ satisfies d w ξ dx − x dw ξ dx = − ξw ξ . (22)The solution of (22) exists for ξ k ∈ Z ≥ and the correspond-ing eigenfunction is the Hermite polynomial: w ξ k = H k ( x ) = ( − n e x d n dx n e − x . The smallest non-zero eigenvalue is . Therefore, for theLSGAN model, we will have the discriminant in (17) alwaysnon-positive, and the convergence u and v will be exponen-tial with the rate e − t . The solution also will oscillate dueto the presence of complex eigenvalues. What about neural networks?
The discussion so far fo-cuses on the functional spaces. In reality, these functionsare approximated by neural networks, and the dynamicsis written in the parameters θ D and θ G of these networks.Local convergence analysis of such dynamics is possiblein these parameters (Mescheder et al., 2018; Nie & Patel,2020); however, such analysis involves eigenvalues of theJacobian of the loss. It is not easy to connect these prop-erties to the fundamental properties of the measure. Thefunctional space analysis shows this connection. The ap-proximation by neural networks (or by any other parametricrepresentation) can be thought of as a spatial discretiza-tion of PDEs. If the number of parameters increases, theeigenvalues of the discretized problem should approximatethe eigenvalues of the infinite-dimensional problem. It isworth noting that different discretizations (for example, dif-ferent neural network architectures) may lead to differentproperties. One can compare such discretizations by look-ing at the quality of approximation of the eigenvalues of theweighted Laplace operator (see Section 6 for the algorithmicdetails). Also, such techniques used in practice as Jacobianregularization (Karras et al., 2020b) can be considered asmethods for choosing a better-conditioned discretization.Finally, for future research, we would like to mention thatthe lGAN problem is similar to the saddle point problems(Benzi et al., 2005) appearing in mathematical modelingof fluid problems. For example, for robust discretization, unctional Space Analysis of Local GAN Convergence one has to choose discretization spaces of u (‘pressure’),and v (‘velocity’) such that they satisfy the famous LBB(Ladyzhenskaya-Babuˇska-Brezzi) condition (Boffi et al.,2013; Ladyzhenskaya, 1969) in order to get good conver-gence properties. This task requires systematic study, andwe leave it for future research.
6. Effects of common practices on theasymptotic convergence
Regularization.
Several studies have shown that if wepenalize the norm of the gradient of the GAN objective withrespect to the discriminator, it improves the convergence(Gulrajani et al., 2017; Mescheder et al., 2018). In our case,it results in an additional term γ E µ (cid:107)∇ D f ( D, G ) (cid:107) . Afterlinearization we obtain a regularized loss ˆ g ( u, v ) . max v min u ˆ g ( u, v ) , ˆ g ( u, v ) = g ( u, v ) + γ E µ (cid:107)∇ x u (cid:107) . New eigenvalue problem has the form λ (cid:104) u λ , ˆ u (cid:105) µ = − α (cid:104) u λ , ˆ u (cid:105) µ − β (cid:104)∇ x ˆ u, v λ (cid:105) µ − γ (cid:104)∇ x u, ∇ x ˆ u (cid:105) ,λ (cid:104) v λ , ˆ v (cid:105) µ = β (cid:104)∇ x u λ , ˆ v (cid:105) µ , ∀ ˆ v ∈ H ( µ ) , ˆ u ∈ H ( µ ) . The kernel stays the same, and moreover, the u λ componentis also the same, which can be seen again by using ˆ v = ∇ x ˆ u in the second equation. The only difference is the connectionbetween eigenvalues of the weighted Laplace operator andeigenvalues of the linearized problem, which changes asfollows: λ + ( α + γξ i ) λ + β ξ i = 0 , yielding λ ± i = − ( α + γξ i ) ± (cid:112) ( α + γξ i ) − β ξ i . (23) Optimal parameters.
One of the most interesting conclu-sions of our analysis is that the convergence is determinedby the connection between α, β, γ and ξ . If we fix φ , φ inadvance, we can not control α and β and convergence for aparticular dataset will be determined by ξ min . I.e., for onedataset the loss may work well, but for another it may fail.An alternative approach is to optimize over the hyperparam-eters of the loss and regularizer taking the Poincar´e constantinto account. Expression (23) gives a direct way to do that.The eigenvalue with the maximal real part corresponds to ξ = ξ min and is given by the formula λ max = − ( α + γξ min ) + (cid:112) ( α + γξ min ) − β ξ min . The maximum is obtained if the discriminant is equal to ,which gives α + γξ min = 2 | β | (cid:112) ξ min . Another desirable property is the absence of oscillations.This is only possible if the quadratic function under thesquare root is non-negative. Since it is equal to zero at ξ = ξ min , it is necessary and sufficient for it to have non-negative derivative: γ ( α + γξ min ) ≥ β → γ ≥ | β |√ ξ min . Simple analysis gives the following conditions for the pa-rameters α, β, γ such that we have optimal local conver-gence rate and all the eigenvalues of the linearized operatorare real: α + γξ min = 2 | β | (cid:112) ξ min , | β |√ ξ min ≤ γ ≤ | β |√ ξ min , ≤ α ≤ | β |√ ξ min . (24) Discrete approximation of the time dynamics.
The dy-namics (6) can be written in the operator form as w t = L w, w (0) = w , w = (cid:20) uv (cid:21) . (25)The usage of gradient descent optimization for this problemis equivalent to the forward Euler scheme: w k +1 = w k + τ L w k , and the eigenvalues of the discretized operator are equal to λ ei = 1 + τ λ i , (26)where λ are given by Theorem 2. Since for all ξ i such that α − β ξ i < the eigenvalue λ i is complex, the modulusof the corresponding λ ei is | λ ei | = (cid:16) − τ α (cid:17) + τ (cid:0) β ξ i − α (cid:1) . Since ∆ µ is an unbounded operator under mild assump-tions on µ there exists an eigenvalue ξ i of it that makes | λ ei | > , i.e. the Euler scheme is absolutely unstable . Onthe discrete level, when the functions are approximated bya neural network, we might be working in a subspace thatdoes not contain eigenvalues that are too large, and themethod might actually converge; but this is very problemspecific. As has been noted by Qin et al. (2020) the usageof more advanced time integration schemes leads to com-petitive GAN training results even without such functionalconstraints as spectral normalization (Miyato et al., 2018).If the parameters are selected in the range (24), then thereare no complex eigenvalues, and we can make Euler schemeconvergent. This requires regularization. Another impor-tant research direction is the development of more suitabletime discretization schemes. The system (6) belongs to theclass of hyperbolic problems , and special time discretization unctional Space Analysis of Local GAN Convergence schemes have to be used. One class of such methods aretotal variation diminishing (TVD) schemes (Gottlieb & Shu,1998). Note, that one of the methods considered in Qin et al.(2020) is the Heun method (S¨uli & Mayers, 2003) which is asecond-order TVD Runge-Kutta scheme, which confirms itsempirical efficiency. Thus, our analysis provides theoreticalfoundation for the results of Qin et al. (2020). Data augmentation.
One of the most successful practicaltricks significantly improving GAN convergence have beenthe usage of data augmentation (Karras et al., 2020a; Zhanget al., 2020; Zhao et al., 2020). In our framework, the usageof data augmentations is reflected in the shift of ξ min . Basedon the intuition discussed earlier, higher values of ξ min correspond to more “connected” distributions and allowfor faster convergence. This is exactly what is intuitivelyachieved by data augmentation: we fill the “holes” in ourdataset with new samples and make it more connected. Wedescribe our experiments confirming this idea in Section 7. Practical estimation of ξ min . For practical analysis ofGAN convergence, we need to estimate the value of thePoincar´e constant ξ min for a given dataset. This can be per-formed in the standard supervised learning manner. Recallfrom the definition that ξ min is the minimizer of the follow-ing optimization problem (commonly called the Rayleighquotient). E µ (cid:107)∇ x f ( x ) (cid:107) Var µ f ( x ) → min f ∈ H ( µ ) . (27)In practice we can parameterize f with a neural network,and perform optimization of (27) in a stochastic manner byreplacing expectations over µ with their empirical counter-parts (over a batch of inputs). Due to the scale invarianceof the Rayleigh quotient, we employ spectral normalization(Miyato et al., 2018) as an additional regularization; this alsofalls in line with commonly used discriminator architectures.To summarize, we utilize a neural network f ( x ; θ ) and for adataset X = { X i } Ni =1 consider the batched version of thefollowing loss. L ( θ ) = N (cid:80) Ni =1 (cid:107)∇ x f ( X i ; θ ) (cid:107) N (cid:80) Ni =1 f ( X i ) − ( N (cid:80) Ni =1 f ( X i )) , (28)and, correspondingly, ξ min ≈ min θ L ( θ ) . Minimization ofthis loss function can be performed with standard optimizerssuch as SGD or Adam (Kingma & Ba, 2015).
7. Experiments
Experimental setup.
Our experiments are organized asfollows. We start by numerically investigating the value of ξ min and impact of formulas from Section 6 on the conver-gence on synthetic datasets. We then study the more practi-cal CIFAR-10 (Krizhevsky et al., 2009) dataset. Firstly, we show the correlation between the ξ min obtained for variousaugmented versions of the dataset and FID values obtainedfor the corresponding GAN. We then show that the similarcorrelation holds when we perform instance selection , arecently proposed technique shown to improve GAN con-vergence (DeVries et al., 2020). For the synthetic datasets,our experiments were performed in JAX (Bradbury et al.,2018) using the ODE-GAN code available at GitHub . ForCIFAR-10 experiments we utilized PyTorch (Paszke et al.,2019). Our experiments were performed on a single NVidiaV-100 GPU. Gaussian Mixture.
In order to study the effect of ξ min and the choice of α and γ on the training procedure, we setup a simple one-dimensional test: a mixture of two normaldistributions N (0 , and N ( D, , where D is the separa-tion parameter. Intuitively, the larger the D , the smaller is ξ min , since it reduces connectivity of our measure. We alsoverify it numerically, as shown in Figure 1 (top). In thiscase, we utilize a simple two-layer MLP with spectral nor-malization on top of linear layers. We observe that indeedthe value of ξ min decays rapidly with the increase of theseparation parameter D .To experiment with GAN convergence in this toy setting, wesetup two MLP models for D and G , and train the LSGAN-like model with φ ( x ) = α x + βx, φ ( x ) = − βx, and gradient penalty with factor γ . We train the model usingthe ODE-GAN approach with the Heun method. We experi-ment with two moderately separated mixtures, namely with D = 3 and D = 4 . Respective ξ min values obtained by thenumerical simulation described above are ξ min = 0 . and ξ min = 0 . . We consider five options for the values of ( α, β, γ ) . We start with the baseline WGAN correspondingto (0 , − , . We consider two options for the value of γ :the optimal one γ = | β | / √ ξ min predicted by the theory, anda sub-optimal γ = | β | / √ ξ min . With the latter value we ob-tain Im λ max ∼ | β |√ ξ min . We also consider two LSGANvariants. First one is the default version with the parameters (0 . , − . , . For the second version we fix β = 1 andchoose the optimal α and γ according to (24). We measureperformance of a GAN model via the Earth Mover’s Dis-tance between the Gaussians fitted on real and synthetic data.This approach resembles the commonly used Frech´et Incep-tion Distance (FID) metric for GAN evaluation. Resultsare visualized on Figure 1. We observe that the resultingconvergence plots match our theoretical predictions. In par-ticular, we see that when the parameters are chosen in theoptimal way, methods converge more rapidly and stably.On the other hand, when γ is not tuned or is sub-optimal, https://github.com/deepmind/deepmind-research/tree/master/ode_gan unctional Space Analysis of Local GAN Convergence the convergence is more oscillatory, and GANs struggle toconverge. Note that even for the optimal γ for methods with α > , we still observe some oscillations. This may be aresult of a noise induced by neural function approximationor stochasticity of training. Figure 1.
Numerical experiments with GAN convergence on a mix-ture of two univariate Gaussians. (Top)
Estimated value of ξ min vsthe mixture separating parameter D . (Bottom) Convergence plotsof GANs with different loss functions and regularization strengths.Methods with parameters selected optimally according to theorypresent better convergence.
Data augmentation of CIFAR-10.
In this set of exper-iments, we verify if the effects of data augmentations onGAN convergence can be predicted by evaluating ξ min forthe correspondingly augmented dataset. We consider a num-ber of augmentations commonly utilized in data augmenta-tion pipelines for GANs (see Figure 2). In particular, these Figure 2.
Various types of augmentations commonly used in GANtraining. augmentations include both spatial (translations, zooming)and visual augmentations (color adjustment). Folkloreknowledge is that the first type is generally helpful, whilethe second is not. In our framework, the positive effect of augmentations can be understood as improving the “connec-tivity” of a dataset, which can be quantitatively measuredby evaluating ξ min of the augmented version of a dataset.Increased value of ξ min both intuitively and theoretically(as supported by Theorem 2) corresponds to better conver-gence and better FID scores. Our experiments are based ona thorough analysis of the impact of augmentations on thequality of a GAN trained on CIFAR-10 provided in Zhaoet al. (2020). Specifically, the authors select a single aug-mentation from a predefined set and train several types ofGANs by augmenting real and fake images. The strengthof the augmentation is controlled by a single parameter λ , and the authors provide the obtained FID for a numberof its possible values (e.g., we may vary how strongly wezoom an image); we refer the reader to Zhao et al. (2020)for specific details on each augmentation. We selected alarge portion of augmentations studied in this paper andreplicated its data augmentation setup. For each augmen-tation and each strength value, we minimize the Rayleighquotient with a neural network mimicking the SNDCGANdiscriminator from Zhao et al. (2020). We train it on thetrain part of the dataset by applying the respective augmen-tation to each image with probability one. Note that theactual augmentation strength is randomly sampled from therange [0 , λ ] , so we cover the entire distribution relativelywell. For training, we use the batch size of and Adamoptimizer with a learning rate − (results are not sensitiveto these parameters). We measure the actual value of ξ min by sweeping across the augmented test set times to bettercover the distribution and aggregate the results across all (augmented) samples. For the reference values, wechoose the FID scores obtained by the SNDCGAN modelwith Balanced Consistency Regularization (bCR) compiledfrom Zhao et al. (2020, Figure 3). This model achieves bet-ter generation quality so our theory is more reliable in thiscase. Our results are provided at Figure 3. For convenience,we normalize the obtained values by dividing it by ξ min ofthe non-augmented dataset (approximately equal to . by our estimation). We observe that FID scores in manycases indeed follow the behavior of ξ min . For instance, for zoomin and cutout , the estimated ξ min values predictthat there is an intermediate optimal augmentation strength,which is matched by the practice. For translation , weobserve that stronger augmentations worsen the connectivityof the dataset, which results in worse FID scores. We canalso observe a significant drop in ξ min for color-based aug-mentations, confirming their practical inefficiency. We note,in some cases (e.g., zoomout ), we do not observe a directcorrespondence. This may be a result of auxiliary effectsnot covered by our theory or of a discrepancy between oursand the authors’ implementation. Instance selection.
Another possible way to manipulatethe data distribution is to remove non-typical samples, which unctional Space Analysis of Local GAN Convergence
Figure 3. (Top)
The values of ξ min for augmented versions ofCIFAR-10 for different values of the augmentation strength. Wenormalize them dividing by ξ min ≈ . of the non-augmentedversion of the dataset. (Bottom) The FID values of a SNDCGAN(with balanced consistency regularization) trained on the corre-sponding dataset. The values are taken from Zhao et al. (2020). For ξ min a higher value is “better” in terms of connectivity of datasetand theoretical convergence. For FID smaller values are better. Inmost cases we observe a negative rank correlation between ξ min and FID. can confuse the generator (DeVries et al., 2020). This canbe performed by fitting a Gaussian model on feature vectorsof a dataset produced by some pretrained model (in ourexperiments, we used Inception v3 (Szegedy et al., 2016))and keeping only samples with high likelihood under thismodel. This procedure results in better FID scores of trainedGANs, as shown in DeVries et al. (2020). In our frame-work, this can be understood as another way to improve thedataset’s connectivity since by removing outliers, we reducethe number of gaps in the data. Experiments in the afore-mentioned paper were conducted on the resized ImageNetdataset; however, we hypothesize that this phenomenon mayalso be understood on the conceptually similar CIFAR-10dataset. We perform an experiment confirming this effectquantitatively. We follow the steps described above, andfor each value of ψ ∈ { . , . , . , . , . } evaluate ξ min of the dataset obtained by removing samples in the bottom ψ -quantile of the likelihood. Our results are provided inTable 1. We observe that for higher values of the trunca-tion parameter, we indeed obtain better dataset connectivity,confirming practical findings of DeVries et al. (2020).
8. Conclusion
We presented a novel framework for a theoretical under-standing of GAN training by analyzing the local conver-gence in the functional space. Namely, we represent the
Table 1.
Normalized differences in ξ min for CIFAR-10 with in-stance selection performed according to the procedure outlined inDeVries et al. (2020) with respect to the truncation (lower) quantile ψ . ∆ ξ min is computed as ( ξ min ( ψ ) / ξ min (0) − × for easiercomparison. We observe that stronger truncation increases thevalue of ξ min , confirming better empirical GAN convergence. ψ . . . . . ξ min ↑ . − .
04 1 . . . GAN dynamics as a system of partial differential equationsand analyze the spectrum of the corresponding differentialoperator, which determines the dynamics convergence. Asthe main result, we show how the spectrum depends onthe properties of the target distribution, in particular, on itsPoincar´e constant. Our perspective provides a new under-standing of established GAN tricks, such as gradient penaltyor dataset augmentation. For practitioners, our paper de-velops an efficient method that allows to choose optimalaugmentations for a particular dataset.
References
Adams, R. A. and Fournier, J. J.
Sobolev spaces . Elsevier,2003.Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gen-erative adversarial networks. In
International conferenceon machine learning , pp. 214–223. PMLR, 2017.Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls,K., and Graepel, T. The mechanics of n-player differen-tiable games. In
International Conference on MachineLearning , pp. 354–363. PMLR, 2018.Benzi, M., Golub, G. H., Liesen, J., et al. Numerical solutionof saddle point problems.
Acta numerica , 14(1):1–137,2005.Boffi, D., Brezzi, F., Fortin, M., et al.
Mixed finite elementmethods and applications , volume 44. Springer, 2013.Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary,C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J.,Wanderman-Milne, S., and Zhang, Q. JAX: composabletransformations of Python+NumPy programs, 2018. URL http://github.com/google/jax .Cianchi, A., Maz’ya, V., et al. On the discreteness of thespectrum of the Laplacian on noncompact Riemannianmanifolds.
Journal of differential geometry , 87(3):469–492, 2011.Coifman, R. R. and Lafon, S. Diffusion maps.
Applied andcomputational harmonic analysis , 21(1):5–30, 2006. unctional Space Analysis of Local GAN Convergence
DeVries, T., Drozdzal, M., and Taylor, G. W. InstanceSelection for GANs.
Advances in Neural InformationProcessing Systems , 2020.Fiedler, M. Algebraic connectivity of graphs.
Czechoslovakmathematical journal , 23(2):298–305, 1973.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative Adversarial Nets. In Ghahramani, Z., Welling,M., Cortes, C., Lawrence, N., and Weinberger, K. Q.(eds.),
Advances in Neural Information Processing Sys-tems , volume 27, pp. 2672–2680. Curran Associates, Inc.,2014.Gottlieb, S. and Shu, C.-W. Total variation diminishingRunge-Kutta schemes.
Mathematics of computation , 67(221):73–85, 1998.Griffiths, D. J. Introduction to electrodynamics, 2005.Grigoryan, A.
Heat kernel and analysis on manifolds , vol-ume 47. American Mathematical Soc., 2009.Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. Improved Training of Wasserstein GANs.In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H.,Fergus, R., Vishwanathan, S., and Garnett, R. (eds.),
Ad-vances in Neural Information Processing Systems , vol-ume 30, pp. 5767–5777. Curran Associates, Inc., 2017.Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J.,and Aila, T. Training generative adversarial networkswith limited data.
Advances in Neural Information Pro-cessing Systems , 2020a.Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,and Aila, T. Analyzing and improving the image qualityof stylegan. In
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pp. 8110–8119, 2020b.Kingma, D. P. and Ba, J. Adam: A Method for StochasticOptimization. In
ICLR (Poster) , 2015. URL http://arxiv.org/abs/1412.6980 .Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images. 2009.Ladyzhenskaya, O. A.
The mathematical theory of viscousincompressible flow , volume 2. Gordon and Breach NewYork, 1969.Liang, T. and Stokes, J. Interaction matters: A note onnon-asymptotic local convergence of generative adversar-ial networks. In
The 22nd International Conference onArtificial Intelligence and Statistics , pp. 907–915. PMLR,2019. Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., andPaul Smolley, S. Least squares generative adversarialnetworks. In
Proceedings of the IEEE international con-ference on computer vision , pp. 2794–2802, 2017.Mescheder, L., Nowozin, S., and Geiger, A. The Numericsof GANs. In Guyon, I., Luxburg, U. V., Bengio, S.,Wallach, H., Fergus, R., Vishwanathan, S., and Garnett,R. (eds.),
Advances in Neural Information ProcessingSystems , volume 30, pp. 1825–1835. Curran Associates,Inc., 2017.Mescheder, L., Geiger, A., and Nowozin, S. Which Train-ing Methods for GANs do actually Converge? In
Inter-national Conference on Machine learning (ICML) , pp.3481–3490. PMLR, 2018.Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec-tral Normalization for Generative Adversarial Networks.In
International Conference on Learning Representations ,2018. URL https://openreview.net/forum?id=B1QRgziT- .Nagarajan, V. and Kolter, J. Z. Gradient descent GAN op-timization is locally stable. In
Proceedings of the 31stInternational Conference on Neural Information Process-ing Systems , pp. 5591–5600, 2017.Nie, W. and Patel, A. B. Towards a better understanding andregularization of GAN training dynamics. In
Uncertaintyin Artificial Intelligence , pp. 281–291. PMLR, 2020.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai-son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang,L., Bai, J., and Chintala, S. Pytorch: An Imperative Style,High-Performance Deep Learning Library. In Wallach,H., Larochelle, H., Beygelzimer, A., d'Alch´e-Buc, F.,Fox, E., and Garnett, R. (eds.),
Advances in Neural Infor-mation Processing Systems 32 , pp. 8024–8035. CurranAssociates, Inc., 2019.Qin, C., Wu, Y., Springenberg, J. T., Brock, A., Donahue,J., Lillicrap, T. P., and Kohli, P. Training GenerativeAdversarial Networks by Solving Ordinary DifferentialEquations.
Advances in Neural Information ProcessingSystems , 2020.Roth, K., Lucchi, A., Nowozin, S., and Hofmann, T. Sta-bilizing Training of Generative Adversarial Networksthrough Regularization. In Guyon, I., Luxburg, U. V.,Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,and Garnett, R. (eds.),
Advances in Neural InformationProcessing Systems , volume 30, pp. 2018–2028. CurranAssociates, Inc., 2017. unctional Space Analysis of Local GAN Convergence
S¨uli, E. and Mayers, D. F.
An introduction to numericalanalysis . Cambridge university press, 2003.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,Z. Rethinking the inception architecture for computer vi-sion. In
Proceedings of the IEEE conference on computervision and pattern recognition , pp. 2818–2826, 2016.Zhang, H., Zhang, Z., Odena, A., and Lee, H. ConsistencyRegularization for Generative Adversarial Networks. In
International Conference on Learning Representations ,2020. URL https://openreview.net/forum?id=S1lxKlSKPH .Zhao, Z., Zhang, Z., Chen, T., Singh, S., and Zhang, H.Image augmentations for GAN training. arXiv preprintarXiv:2006.02595arXiv preprintarXiv:2006.02595