A Systematic Survey of Regularization and Normalization in GANs
Ziqiang Li, Xintian Wu, Muhammad Usman, Rentuo Tao, Pengfei Xia, Huanhuan Chen, Bin Li
JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Regularization And Normalization ForGenerative Adversarial Networks: A Review
Ziqiang Li, Rentuo Tao, and Bin Li,
Member, IEEE
Abstract —Generative adversarial networks(GANs) is a pop-ular generative model. With the development of the deepnetwork, its application is more and more widely. By now,people think that the training of GANs is a two-person zero-sum game(discriminator and generator). The lack of strongsupervision information makes the training very difficult, suchas non-convergence, mode collapses, gradient disappearance, andthe sensitivity of hyperparameters. As we all know, regularizationand normalization are commonly used for stability training. Thispaper reviews and summarizes the research in the regularizationand normalization for GAN. All the methods are classified into sixgroups: Gradient penalty, Norm normalization and regulariza-tion, Jacobian regularization, Layer normalization, Consistencyregularization, and Self- supervision.
Index Terms —Generative Adversarial Networks(GANs), Reg-ularization, Normalization, Review, Lipschitz,
I. I
NTRODUCTION G ENERATIVE Adversarial Networks [1] have been usedwidely in computer vision, such as image inpainting [2]–[6], style transfer [7]–[12], text-to-image translations [13]–[16], attribute editing [17]–[20]. From a game perspective,GANs is a two-person zero-sum game, it is well known that thetraining of the GANs is unstable [21], [22]: non-convergence[21], [23], mode collapses [24], gradient disappearance [25],and the sense of hyperparameters [26] always appear in thetraining process of GANs. Much work toward mitigatingthese issues has been proposed: designing new architectures[22], [27], new loss functions [28]–[31] or new optimizationmethods [22], [32]. Compared to the above methods, we aremore interested in the regularization and normalization. Be-cause they are compatible with different loss functions, modelstructures, and tasks, this additional approach is simple andinteresting. Regularization was first used to prevent overfittingin neural networks [33]. For GANs, regularization has differentusage: some work add the regularization of the gradient tomake the discriminator satisfy the Lipschitz continuity [29],[34], [35]; some work add the regularization for the Jacobianmatrix of the loss function to make Local convergence ofGANs [ ? ], [36], [37]; some work start from the intuition andadd regularization to improve the discriminatory ability of thediscriminator [30], [38], [39]. Similarly, for normalization, themost commonly used in neural networks is batch normaliza-tion [40], which accelerates deep network training by reducingthe internal covariate shift. These layer-based normalizations Ziqiang Li, Rentuo Tao and Bin Li are with School of InformationScience and Technology, University of Science and Technology of China,Anhui, China(email: [email protected], [email protected] [email protected])Manuscript received April 19, 2005; revised August 26, 2015. also has been used for GANs, such as conditional batchnormalization [41], instance normalization [42] and attentivenormalization [43]. Besides, to ensure that the discriminatorsatisfies the Lipschitz continuity, Norm normalization [41],[44] are also widely used in GANs.Regularization and normalization are simple and effectivefor stable training of GANs, but the review about this is few.some work [26], [45] is one-sided, and only a small part isinvolved, while others [46] lack the theoretical support. Tosystematically and comprehensively introduce the regulariza-tion and normalization of GAN, this paper first introducesthe optimal transport and dynamic model, the former leadsto the gradient penalty and norm normalization, and the latterleads to the Jacobian regularization. And also we introducesome unsupervised method to improve the represent abilityof the discriminator, such as consistency regularization andself-supervision.The remaining parts of this paper are orgqnized as follows:in section II, we introduce the background on GANs, optimaltransport, and dynamic model. In section III, we divide theregularization and normalization into 6 groups according tothe proposed purpose and different ways and give a detailedintroduction. Current problems and prospects are given in IV.II. B
ACKGROUND
A. Generative Adversarial Networks
GANs is a two-player zero-sum game, where the generator G ( z ) is a distribution mapping function that maps the Gaussianor uniform distribution z to target image distribution P g ( x ) ,and the discriminator D ( x ) evaluates the distance between twodistribution. The framework can be found in Fig.1. For theoptimal discriminator, the generator G ( z ) makes the distancebetween the target distribution P r ( x ) and the generated dis-tribution P g ( x ) as small as possible. The GAN game can beformulated as follows: min φ max θ f ( φ, θ ) = E x ∼ p r [ g ( D θ ( x ))] + E z ∼ p z [ g ( D θ ( G φ ( z )))] (1)where φ and θ are parameters of the generator G and discrim-inator D respectively. p r and p z represent the real distributionand latent distribution. Specifically, vanilla GAN [1] can bedescribed with g ( t ) = g (− t ) = − log ( + e − t ) , f -GAN [47]and WGAN [28] can be written as g ( t ) = − e − t , g ( t ) = − t and g ( t ) = g (− t ) = t respectively. a r X i v : . [ c s . L G ] A ug OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
GeneratorNetwork DiscriminatorNetwork DistanceBack-propagation
G D
Fig. 1: The framework of GANs. P z is a latent space distribution, P r and P g represent the real distribution and the generateddistribution, respectively. B. Optimal Transport and WGAN
Optimal transport [48] was proposed in the 18th centuryto minimize the transportation cost while preserves measurequantities, Given the space with probability measures ( X , µ ) and ( Y , υ ) , if have a map T : X → Y which is measure-preserving, than for any B ⊂ Y , having: ∫ T − ( B ) d µ ( x ) = ∫ B d υ ( y ) (2)we write this as T ∗ ( µ ) = υ . For any x ∈ X and y ∈ Y , wecan define the transportation distance as c ( x , y ) , then the totaltransportation cost is given by: C ( T ) : = ∫ X c ( x , T ( x )) d µ ( x ) (3)In the 18th century, Monge proposed the Optimal MassTransportation Map which corresponds to the smallest totaltransportation cost: C ( T ) . The transportation cost correspond-ing to the optimal transportation map is called the Wassersteindistance between probability measures µ and υ : W c ( µ, υ ) = min T (cid:26)∫ X c ( x , T ( x )) d µ ( x ) | T ∗ ( µ ) = υ (cid:27) (4)In the 1940s, Kantorovich Proved the existence and uniquenessof the solution for Monge problem [49], and according to theduality of linear programming, he obtained the dual form ofWasserstein distance: W c ( µ, υ ) = max ϕ,ψ (cid:26)∫ X ϕ d µ + ∫ Y ψ d υ | ϕ ( x ) + ψ ( y ) ≤ c ( x , y ) (cid:27) (5)This dual problem is constrained, defining the c-transform: ψ ( y ) = ϕ c ( y ) : = in f x { c ( x , y )− ϕ ( x )} , and then the Wassersteindistance is : W c ( µ, υ ) = max ϕ (cid:26)∫ X ϕ d µ + ∫ Y ϕ c d υ (cid:27) (6) Where ϕ is called the Kantorovich potential. If c ( x , y ) = | x − y | ,it can be shown that if Kantorovich potential satisfies the 1-Lipschitz continuity, then ϕ c = − ϕ . At this time, Kantorovichpotential can be fitted by a deep neural network, which isrecorded as ϕ ξ . Wasserstein distance is: W c ( µ, υ ) = max ξ (cid:26)∫ X ϕ ξ d µ − ∫ Y ϕ ξ d υ (cid:27) (7)If X is the generated image space, Y is the real sample space, Z is latent space and g θ is the geneartor, than the WGAN isa Minmax problem: min θ max ξ (cid:26)∫ Z ϕ ξ ( g θ ( z )) d ζ ( z ) − ∫ Y ϕ ξ ( y ) d y (cid:27) (8)The generator minimizes the Wasserstein distance and thediscriminator maximizes the Wasserstein distance. In the opti-mization process, the generator and the Kantorovich potentialfunction(discriminator) are independent of each other. Weoptimize them in a step-by-step iteration, respectively.If c ( x , y ) = | x − y | , then there is a convex function u which iscalled Brenier potential [50], The optimal transportation mapis given by the gradient map of Brenier potential: T ( x ) = ∇ u ( x ) ,At this time, a simple relationship is satisfied between Kan-torovich potential and Brenier potential [51]: u ( x ) = | x | − ϕ ( x ) (9)Through the previous analysis, we know that the optimal trans-portation map(Brenier potential) corresponds to the generator,and Kantorovich potential corresponds to the discriminator, atthis time, after the discriminator is optimized, the generatorcan be directly derived without having to go through theoptimization process [51]. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
We define the transportation cost of (3) as the form of twodistribution distances: OT ( P || Q ) = in f π ∫ π ( x , y ) c ( x , y ) d x d y (10)Where π ( x , y ) is the joint distribution, satisfying ∫ y π ( x , y ) d y = P ( x ) and ∫ x π ( x , y ) dx = Q ( y ) . And also the dual form of Eq(10) is: OT ( P || Q ) = max ϕ,ψ { ∫ x ϕ ( x ) P ( x ) d x + ∫ y ψ ( y ) Q ( y ) d y | ϕ ( x ) + ψ ( y ) ≤ c ( x , y )} (11)Here we consider the optimal transportation with regularterms, Peyré et al. [52] added the entropic regularization foroptimal transportation which makes the dual problem a smoothunconstrained convex problem. The regularized optimal trans-port is: OT c ( P || Q ) = min π ∫ π ( x , y ) c ( x , y ) d x d y + (cid:15) E ( π ) (12)If E ( π ) = ∫ x ∫ y π ( x , y ) log ( π ( x , y ) P ( x ) Q ( y ) ) d x d y , then Eq (12) can bewrited as: OT c ( P || Q ) = min π ∫ π ( x , y ) c ( x , y ) d x d y + (cid:15) ∫ x ∫ y π ( x , y ) log (cid:18) π ( x , y ) P ( x ) Q ( y ) (cid:19) d x d y s . t . ∫ y π ( x , y ) d y = P ( x ) , ∫ x π ( x , y ) d x = Q ( y ) (13)Than the dual form of Eq (13) is: OT c ( P || Q ) = max ϕ,ψ ∫ x ϕ ( x ) P ( x ) d x + ∫ y ψ ( y ) Q ( y ) d y + (cid:15) e ∫ x ∫ y exp (cid:18) − ( c ( x , y ) + ϕ ( x ) + ψ ( y )) (cid:15) (cid:19) d x d y (14) C. Lipschitz Continuity and Matrix Norm
WGAN is the most popular generative adversarial network.From the optimal transport, we know that to get Eq (7), thediscriminator must satisfy the 1-Lipschitz continuity: || D ( x ) − D ( x )|| ≤ || x − x || (15)Generally, we consider the K-Lipschitz for a neural network f ( x ) : f ( x ) = g N ◦ · · · g ◦ g ( x ) (16)Where g i ( x ) = σ ( W i x + b i ) , The K-Lipschitz continuity for f ( x ) is: || f ( x ) − f ( x )|| ≤ K || x − x || (17)According the consistency of Lipschitz || h ◦ g || Lip ≤ || h || Lip ·|| g || Lip , to make f satisfy the K-Lipschitz continuity, g i needto satisfy the C-Lipschitz continuity( C = N √ K ): || g ( x ) − g ( x )|| ≤ C || x − x || (18) || σ ( W x + b ) − σ ( W x + b )|| ≤ C || x − x || (19) When x → x , the Taylor expansion of Eq (19) have: || ∂σ∂ x W ( x − x )|| ≤ C || x − x || (20)Normally, σ is a function with derivative derivatives such assigmoid, so the C-Lipschitz continuity can be writed as: || W ( x − x )|| ≤ C || x − x || (21)Similarly, the Spectral Norm of matrix is defined by: || W || = max x (cid:44) || W x |||| x || (22)From now, we can use the spectral norm || W || to representthe Lipschitz constant C . D. The Training Dynamics of GANs
We reconsider Eq (1) in section II, typically, the training ofGANs is achieved by solving a two-player zero-sum game viasimultaneous gradient descent (SimGD) [1], [28]. The updatesof the SimGD are given as: φ ( k + ) = φ ( k ) − h ∇ φ f ( φ ( k ) , θ ( k ) ) θ ( k + ) = θ ( k ) + h ∇ θ f ( φ ( k ) , θ ( k ) ) (23)Assuming that the objectives of GAN are convex, much workhas provided its global convergence characteristics [47], [53].However, due to the high non-convexity of deep networks,even a simple GAN does not satisfy the convexity assumption[36]. Some recent work [54] obtained approximate globalconvergence under the assumption of the optimal discrimi-nator, which is obviously unrealistic. So we consider localconvergence, we hope that the trajectory of the dynamicsystem can enter a local convergence point with continuityiterations, that is, Nash equilibrium. ¯ φ = arg max φ − f ( φ, ¯ θ ) ¯ θ = arg max θ f ( ¯ φ, θ ) (24)If the point ( ¯ φ, ¯ θ ) is called a local Nash-equilibrium, Eq (24)holds in a local neighborhood of ( ¯ φ, ¯ θ ) . For this differentiabletwo-player zero-sum games define a vector: v ( φ, θ ) = (cid:32) −∇ φ f ( φ, θ )∇ θ f ( φ, θ ) (cid:33) (25)Then the Jacobian matrix is: v (cid:48) ( φ, θ ) = (cid:32) −∇ φ,φ f ( φ, θ ) − ∇ φ,θ f ( φ, θ )∇ φ,θ f ( φ, θ ) ∇ θ,θ f ( φ, θ ) (cid:33) (26) Lemma 1: For zero-sum games, v (cid:48) is negative semi-definitefor any local Nash-equilibriumx. Conversely, if v ( ¯ x ) = and v (cid:48) is negative definite, then ¯ x is a local Nash-equilibrium.Proof 2.1: See in [55]
Lemma 1 gives the conditions for the local convergence ofGANs, which is converted into the negative semi-definite prob-lem of the Jacobian matrix. We know that the negative semi-definite of the Jacobian matrix corresponds to its eigenvalueless than or equal to 0. Then: if the eigenvalue of the Jacobian
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
RegularizationNormalization
Optimal Transport Training Dynamics Represent AbilityLipschitz
Continuity
Norm normalizationGradient penalty
Norm regularization
Jacobian regularization
Global
Convergence
Local
Convergence
Layer normalization
Consistency regularization
Self supervision
Fig. 2: The summary of the regularization and normalization for GANs.matrix at a certain point is a negative real number, the trainingprocess can converge; but if the eigenvalue is complex and thereal part of the eigenvalue is small and the imaginary part isrelatively large, the training process is difficult to converge
Proposition 1: Let F : Ω → Ω be a continuously differentialfunction on an open subset Ω of R n and let ¯ x ∈ Ω be so that:1. F ( ¯ x ) = ¯ x and2. the absolute values of the eigenvalues of the JacobianF (cid:48) ( x ) are all smaller than 1.Then there is an open neighborhood U of ¯ x so that forall x ∈ U, the iterates F ( k ) ( x ) converge to ¯ x. The rateof convergence is at least linear. More precisely, the error || F ( k ) ( x ) − ¯ x || is in O(| λ max | k ) for k → ∞ where λ max is theeigenvalue of F (cid:48) ( ¯ x ) with the largest absolute value.Proof 2.2: See [55], Section 3 and [56] Proposition 4.4.1.From
Proposition 1 : Under the premise of asymptotic con-vergence, the local convergence of GAN is equivalent to theabsolute value of all eigenvalues of the Jacobian matrix atthe fixed point ( v ( ¯ φ, ¯ θ ) = ) being less than 1. To get thiscondition, some Jacobian regularization [55], [57]–[59] havebeen proposed.III. R EGULARIZATION AND N ORMALIZATION FOR
GAN S A. Overview
GANs have achieved remarkable results in the field of imagegeneration, but due to the non-convex of the deep network, the training of the GANs is very difficult, which causes manyproblems such as mode collapse [24], gradient disappearance[25] and the sensitive of hyperparameters [26]. To solve theproblems in GAN training, a lot of work has been proposed[28]–[31], this paper only cares about regularization and nor-malization. Regularization is proposed to improve the general-ization performance of neural networks to prevent overfitting,such as L1-norm(lasso) [60] and L2-norm(ridge regression)[61]. Normalization [40], [62] is positive for the SGD, whichcan accelerate convergence and improve accuracy. Unlikestrong supervision tasks, weak supervision and unsupervisedtasks are more urgent for regularization and normalization. Atthis time, regularization and normalization can be regarded as apriori information, thereby reducing the task difficulty. GANsis a very typical weak supervision task, a lot of work uses theregularization and normalization for stable training. We havesummarized these work and divided them into 6 categoriesaccording to the difference of purposes and methods: Gradientpenalty, Norm normalization and regularization, Jacobian reg-ularization, Layer normalization, Consistency regularization,and Self-supervision. The visual overview can be found inFig.2. From the Fig.2, we can divide all methods into 3 parts:optimal transport, training dynamics and represent ability.Theoretically, the Kantorovich function corresponding to theoptimal transport mapping must satisfy 1-Lipschitz continuity;
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 and intuitively, Lipschitz continuity can reduce the sensitivityof the discriminator to the input, thereby making the trainingof GAN more stable. So the gradient penalty can be used tosatisfy 1-Lipschitz or 0-Lipschitz continuity. According to theequivalence of Lipschitz constant and spectral norm, we canalso use the spectral norm normalization to let the discrimina-tor satisfy 1-Lipschitz continuity. Norm normalization is a hardglobal restriction of Lipschitz continuity, similarly, we canuse the norm regularization which is a relaxation conditionsof norm normalization. For train dynamics, training of theGANs is achieved by solving a two-player zero-sum gamevia SimGD. Due to the non-convexity, global convergenceis almost impossible to satisfy. In order to achieve the localconvergence, the eigenvalues of the Jacobian matrix need tobe less than 1, so some Jacobian regularization have beenproposed. To improve the represent ability of GANs, Layernormalization, Consistency regularization and Self-supervisionhave been used. These methods are intuitive but interesting,which attract more and more attention of scholars.
B. Gradient Penalty
Theoretically, according to the part B in section II, if thegenerator is the optimal transportation map, the discriminatormust satisfy the 1-lipschitz continuity; intuitively, Lipschitzcontinuity wants the output of the function to be insensitiveto the noise of the input, which can be seen as a necessarycondition for a stable system. For GANs, we consider theLipschitz continuity problem of generator and discriminatorseparately.
1) Gradient Penalty of the Discriminator:
For the discrim-inator, if Wasserstein distance is used as the loss function,it must satisfy the 1-Lipschitz continuity in the image space.gradient penalty is a simple way to achieve it, the loss functioncan be writed as: L GP = L D + λ L R (27)for 1-Lipschitz continuity: L R = E ˆ x ∼ π (||∇ f ( ˆ x )|| − ) (28)Where π is the distribution of all the image space, f representsthe discriminator. WGAN-GP [29] is the first way using gradi-ent penalty to implement the 1-Lipschitz continuity. Becauseit is data-driven, WGAN-GP approximates the entire samplespace with the interpolation of real samples and generatedsamples: ˆ x = t x + ( − t ) y for t ∼ U [ , ] and x ∼ µ, y ∼ v being a real and generated samples. Some work [34], [64]think the constraint of the WGAN-GP is not reasonable. it isnot necessary to restrict the global Lipschitz constant.Some works try to find suitable scope and gradient directionfor Lipschitz restriction. Kodali et al. [21] tracked the trainingprocess of GAN and found that the decrease of InceptionScore(IS) was accompanied by a sudden change of the gra-dient of the discriminator around the real images, based onthis, they only restrict the Lipschitz constant around the realimages ˆ x = x + (cid:15) , where (cid:15) ∼ N d ( , cI ) ; Inspired by VirtualAdversarial Training (VAT) [68], Dávid et al. [63] proposeda method called Adversarial Lipschitz Regularization (ALR) which restrict the 1-Lipschitz continuity at ˆ x = { x , y } withdirection of adversarial perturbation. The same motivation,Zhou [64] et al. thinks that restricting the global Lipschitzconstant might be unnecessary. It is enough to just penalizethe maximum gradient: L R = (cid:18) max ˆ x ∼ π ||∇ f ( ˆ x )|| − (cid:19) (29)Where ˆ x = t x + ( − t ) y .Futher, adler et al. [65] extend the L p ( p = ) space withWGAN-GP to Banach space which contains the L p space andsobolev space. For the Banach space B, the Banach norm || . || ∗ B can be defined as: || x ∗ || B ∗ = sup x ∈ B x ∗ ( x )|| x || B (30)Then the gradient penalty of Banach wasserstein GAN is: L R = E ˆ x ∼ π (||∇ f ( ˆ x )|| B ∗ − ) (31)Where ˆ x = t x + ( − t ) y . Of course, in addition to satisfyingthe 1-Lipschitz continuity on ˆ x = t x + ( − t ) y as WGAN-GP[29], we can use the Banach norm in other methods, whichmay also improve the performance of GANs.In addition to finding the right scope of Lipschitz restriction,some work attempts to find the appropriate Lipschitz constant.From the part B in section II: 1-Lipschitz continuity is derivedbecause the optimal transport is a constrained linear program-ming problem, but the optimal transport with regularization isan unconstrained optimization problem. Petzka et al. [34] letthe c ( x , y ) = || x − y || in Eq (14), and than the dual form ofoptimal transport with regularization is: sup ϕ,ψ { E x ∼ p ( x ) [ ϕ ( x )] − E y ∼ q ( y ) [ ψ ( y )]− (cid:15) ∫ ∫ max { , ( ϕ ( x ) − ψ ( y ) − || x − y || )} d p ( x ) d q ( y )} (32)Taking the advantage of dealing with a single function as amotivation, one may similarly replace ϕ = ψ in Eq (32), Thisleads to an objective of minimize: E y ∼ q ( y ) [ ϕ ( y )] − E x ∼ p ( x ) [ ϕ ( x )] + (cid:15) ∫ ∫ max { , ( ϕ ( x ) − ϕ ( y ) − || x − y || )} d p ( x ) d q ( y )} (33)This is very similar to WGAN-GP [29], WGAN-GP regular-izes the gradient so that || D || Lip → , but WGAN-LP [34]let the || D || Lip ≤ . At this time, the corresponding gradientpenalty is: L R = E ˆ x ∼ π (cid:104) ( max { , ||∇ f ( ˆ x )|| − }) (cid:105) (34)Where ˆ x = t x + ( − t ) y . As stated in the [34], the limit ofthe optimal transport is too tight. By adding the regularization,the transport does not necessarily perform optimally, therebyreducing constraints. The reaction on GAN also reduces thelimit on the Lipschitz constant of GANs, so that The generatorcan learn more complex distribution mapping. The 1-Lipschitzcontinuity of the discriminator is derived from the optimal OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
TABLE I: The Gradient penalty of the Discriminator
Reference L R ˆ x Lipschitz continuity2017 [29] E ˆ x ∼ π (| |∇ f ( ˆ x )| | − ) tx + ( − t ) y | | D | | Lip → E ˆ x ∼ π (| |∇ f ( ˆ x )| | − ) x + (cid:15) | | D | | Lip → E ˆ x ∼ π (| |∇ f ( ˆ x )| | − ) { x , y } | | D | | Lip → (cid:18) max ˆ x ∼ π | |∇ f ( ˆ x )| | − (cid:19) tx + ( − t ) y | | D | | Lip → E ˆ x ∼ π (| |∇ f ( ˆ x )| | B ∗ − ) tx + ( − t ) y | | D | | Lip → E ˆ x ∼ π (cid:104) ( max { , | |∇ f ( ˆ x )| | − }) (cid:105) tx + ( − t ) y | | D | | Lip ≤ max ˆ x ∼ π | |∇ f ( ˆ x )| | tx + ( − t ) y | | D | | Lip → E ˆ x ∼ π | |∇ f ( ˆ x )| | { x , y } | | D | | Lip → E ˆ x ∼ π | |∇ f ( ˆ x )| | tx + ( − t ) y | | D | | Lip → TABLE II: The summary of the norm normalization
Reference W σ Lipschitz continuity2018 [41] W σ = W /| | W | | | | D | | Lip → W σ = W /| | W | | F | | D | | Lip ≤ W σ = W / (cid:112) | | W | | | | W | | ∞ | | D | | Lip ≤ W σ = W /| | W | | + ∇ W /| | W | | | | D | | Lip → transmission. But intuitively, in order to solve the problem ofGAN training instability, we can directly introduce Lipschitzregularization( || D || Lip → ) instead of || D || Lip → or || D || Lip ≤ . Zhou et al. [66] proposed the Lipschitz GANs,which penalty the maximum of the gradients for avoiding theGradient Uninformativeness: L R = max ˆ x ∼ π ||∇ f ( ˆ x )|| (35)Where ˆ x = t x + ( − t ) y . Also, Mescheder et al. [58] andThanh-Tung et al. [67] proposed the 0-GP sample and 0-GP,respectively, which penalty the gradients at ˆ x = { x , y } and ˆ x = t x + ( − t ) y : L R = E ˆ x ∼ π ||∇ f ( ˆ x )|| (36)The summary of the gradient penalty of the discriminatoris shown in TABLE I. From it, we can see, based on the1-Lipschitz continuity in WGAN-GP, we can improve thegradient penalty in two directions.* Limit the Lipschitz constant so that let the || D || Lip ≤ or || D || Lip → * Explore the suitable scope for Lipschitz continuity.
2) Gradient Penalty of the Generator:
The instability ofGANs training is mainly caused by the discriminator. Forthe generator, we mainly consider the problem with modecollapse. In order to avoid the problem of mode collapse incondition generation, especially in tasks(inpainting, super res-olution) where conditional label contain a lot of information, we want the generator to generate different images for thelittle perturbation of the latent space. This is contrary to theLipschitz continuity. Yang et al. [70] proposed the oppositegradient penalty to achieve it: max G L z ( G ) = E z , z (cid:20) min (cid:18) || G ( y , z ) − G ( y , z )|||| z − z || , τ (cid:19)(cid:21) (37)Where y is the class label and τ is a bound for ensuringnumerical stability. The previous method was proposed basedon intuition, Odena [71] et al. show that the singular valuedecreasing of the Jacobian matrix of the generator is the mainreason for the instability and mode collapse during the trainingfor GANs, and the singular value can be approximated bythe gradient, so they use Jacobian clamping to Limit singularvalues to [ λ min , λ max ] . The loss can be writed as: min G L z ( G ) = (cid:0) max ( Q , λ max ) − λ max (cid:1) + (cid:0) min ( Q , λ min ) − λ min (cid:1) (38)Where Q = || G ( z ) − G ( z (cid:48) )||/|| z − z (cid:48) || . The above two methodsare similar, the key point is to improve the sensitivity of thegenerator to input disturbances. C. Norm Normalization and Regularization1) Normalization:
Lipschitz continuity is important forGANs. From part C in section II, we know the spectral normof the weight and the Lipschitz constant express the same
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 meanings. To satisfy the Lipschitz continuity, in addition tothe gradient penalty, we can also use the norm normalization.
Lemma 2: If λ ≤ λ ≤ · · · ≤ λ M are the eigenvalue of theW (cid:62) W, then the spectral norm || W || = √ λ M ; The Frobeniusnorm || W || F = (cid:113)(cid:205) Mi = λ i Proof 3.1:
See [72] and [41]
Lemma 3: For a n × m matrix, || W || = max j (cid:205) ni = | a i , j | , || W || ∞ = max i (cid:205) mj = | a i , j | , then || W || ≤ (cid:112) || W || || W || ∞ Proof 3.2:
See [72]
Lemma 4: For a n × m matrix, || W || F = (cid:114)(cid:16)(cid:205) mj = (cid:205) ni = | a i , j | (cid:17) , then || W || ≤ || W || F Proof 3.3:
See [72]1-Lipschitz continuity can be expressed by the || W || = ,Miyato et al. [41] used the spectral normalization W σ = W | | W | | to let the || W σ || = , which is a better implementation thangradient penalty, resulting in a better method than WGAN-GP.Similarly, according to the optimal transport with regulariza-tion, the Lipschitz constant of discriminator should be less thanor equal to 1, correspondingly we can use the upper bound ofthe spectral norm to normalize the weight, so that || W σ || ≤ .According to the lemma 3 and lemma 4, (cid:112) || W || || W || ∞ and || W || F can be used to normalize the weight. Zhang et al.[44] used the (cid:112) || W || || W || ∞ , but their motivation is to find anapproximation of the spectral norm that is easy to calculate.Miyato et al. [41] only explained that the frobenius norm isrestriction on all eigenvalues, this will affect the network’sability to express, but he has not done experiments to compareit with the spectral normalization. Liu et al. [69] found thatthe mode collapse often appears in the spectral normalization,and the mode collapse is often accompanied by the collapseof the eigenvalue of the discriminator, because the spectralnormalization only limits the maximum eigenvalue, and theeigenvalue collapse means that the remaining eigenvaluessuddenly decrease. Therefore, the author adopted the followingmethods to prevent the collapse of the eigenvalues: W σ = W + ∇ W || W || = W || W || + ∇ W || W || (39)The results show that this simple method effectively preventsthe mode collapse. This work is based on experiments andthere is no theoretical proof, it is not clear between the matrixeigenvalues and GAN performance. According to the part B insection III, recent work has minimized the Lipschitz constant,not Limited to 1, so for norm normalization, can we normalizeits spectral norm to a smaller value, such as 0.1 by W σ = W | | W | | . the work about this has not been exploreed.There is less work on norm normalization, the most popularmethod is spectral normalization, the summary of it are shownin TABLE II.
2) Regularization: : The same as the gradient penalty, spec-tral norm regularization is the same as the 0-GP. So Kurachet al. [45] used the L R = || W || to regularization the lossfunction. Also, Zhou et al. [73] used the L P -norm( P = , F , ∞ )to regularization the discriminator. These methods are worsethan the performance of spectral normalization which have notaroused the interest of researchers. D. Jacobian Regularization
Assuming the objectives of GAN are convex-concave, somework has provided the global convergence of GANs [47], [74].However, the theoretical results only work for GANs withoptimal discriminator, which is unrealistic. So more work hasfocused on analyzing the local convergence of GAN. Nagara-jan et al. [57] and Mescheder et al. [55] showed that undersome assumptions, the GAN dynamics are locally convergent.But if these assumptions are not satisfied, especially the datadistributions are not continuous, the GAN dynamics are notalways locally convergent unless some Jacobian regularizationis used. From the
Proposition 1 in section II: We hope thatthe absolute values of all eigenvalues of the Jacobian matrixof the V ( φ, θ ) are less than 1 at the fixed point, which isequivalent to the real part of the eigenvalue being negative,and the learning rate must be small enough [55]. In order tosatisfy this condition, Mescheder et al. [55] used the consensusoptimization to let the real part of the eigenvalue negative. TheConOpt has been proposed and its regularized updates are: φ ( k + ) = φ ( k ) + h ∇ φ (cid:16) − f ( φ ( k ) , θ ( k ) ) − γ L ( φ k , θ k ) (cid:17) θ ( k + ) = θ ( k ) + h ∇ θ (cid:16) f ( φ ( k ) , θ ( k ) ) − γ L ( φ k , θ k ) (cid:17) (40)Where L ( φ k , θ k ) = || v ( φ k , θ k )|| = (cid:0) ||∇ φ f ( φ k , θ k )|| + ||∇ θ f ( φ k , θ k )|| (cid:1) is the regularization ofthe Jacobian matrix.Also there are some other method to regularize the Jacobianmatrix. Nagaraja et al. [57] only regularize the generator byusing the gradient of the discriminator. The regularized updatesfor the generator is: φ ( k + ) = φ ( k ) − h ∇ φ f ( φ ( k ) , θ ( k ) )− h γ ∇ φ ||∇ θ f ( φ k , θ k )|| (41)and the update of the discriminator is the same as SimGD;Nie et al. [36] proposed the method which only regularize thediscriminator and the regularized update for the discriminatoris: θ ( k + ) = θ ( k ) + h ∇ θ f ( φ ( k ) , θ ( k ) ) − h γ ∇ θ ||∇ φ f ( φ k , θ k )|| (42)and the update of the generator is the same as SimGD; Nie etal. [36] also proposed another method(JARE) which regular-ize both the generator and the discriminator. the regularizedupdates for the generator and the discriminator are: φ ( k + ) = φ ( k ) − h ∇ φ f ( φ ( k ) , θ ( k ) ) − h γ ∇ φ ||∇ θ f ( φ k , θ k )|| θ ( k + ) = θ ( k ) + h ∇ θ f ( φ ( k ) , θ ( k ) ) − h γ ∇ θ ||∇ φ f ( φ k , θ k )|| (43)The difference between JARE and ConOpt is that JAREdoes not contain the Hessians ∇ φ,φ f ( φ k , θ k ) and ∇ θ,θ f ( φ k , θ k ) ,which not only avoids the Phase Factor, i.e., the Jacobianhas complex eigenvalues with a large imaginary-to-real ratio,but also the Conditioning Factor, i.e., the Jacobian is ill-conditioned. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
TABLE III: The summary of the Jacobian regularization
Method regularized updates of generator regularized updates of discriminatorConOpt [55] φ ( k + ) = φ ( k ) − h ∇ φ f ( φ ( k ) , θ ( k ) ) − h γ ∇ φ | | v ( φ k , θ k )| | θ ( k + ) = θ ( k ) + h ∇ θ f ( φ ( k ) , θ ( k ) ) − h γ ∇ θ | | v ( φ k , θ k )| | Generator [57] φ ( k + ) = φ ( k ) − h ∇ φ f ( φ ( k ) , θ ( k ) ) − h γ ∇ φ | |∇ θ f ( φ k , θ k )| | θ ( k + ) = θ ( k ) + h ∇ θ f ( φ ( k ) , θ ( k ) ) Discriminator [36] φ ( k + ) = φ ( k ) − h ∇ φ f ( φ ( k ) , θ ( k ) ) θ ( k + ) = θ ( k ) + h ∇ θ f ( φ ( k ) , θ ( k ) ) − h γ ∇ θ | |∇ φ f ( φ k , θ k )| | JARE [36] φ ( k + ) = φ ( k ) − h ∇ φ f ( φ ( k ) , θ ( k ) ) − h γ ∇ φ | |∇ θ f ( φ k , θ k )| | θ ( k + ) = θ ( k ) + h ∇ θ f ( φ ( k ) , θ ( k ) ) − h γ ∇ θ | |∇ φ f ( φ k , θ k )| | TABLE IV: The summary of the layer normalization
Method Reference classification inputs of γ ( c ) and β ( c ) Batch normalization 2018 [41] unconditional-based -Layer normalization 2018 [41] unconditional-based -Instance normalization 2018 [41] unconditional-based -Weight normalization 2018 [41] unconditional-based -conditional batch normalization(CBN) 2018 [75], [76] conditional-based class labeladaptive instance normalization(AdaIN) 2017 [77],2019 [42] conditional-based target imagesspatially-adaptive(de) normalization(SPADE) 2019 [78] conditional-based sematic segmentation mapattentive normalization(AN) 2020 [43] conditional-based self
E. Layer Normalization
For machine learning, we hope the data is independent andidentically distributed( i . i . d ), but for deep learning, becauseof the Internal Covariate Shift(ICS) [40], the inputs of eachneuron are not satisfy the i , i , d , which makes the training ofthe deep neural networks hard and unstable. To avoid thisproblem, many layer normalization have been proposed. layernormalization is a simple approximation of the whiteningwhich can resolve the ICS in theory. The general form ofthe layer normalization is: h N = x − E [ h ] (cid:112) v ar [ h ] + (cid:15) · γ + β (44)For GANs, we divide the layer normalization into two parts:unconditional and conditional normalization. Uncondition-based layer normalization is used for unconditional genera-tion, which is the same as the other deep neural networks;And conditional-based layer normalization is used for thegenerator of the conditional generation, the shift and scaleparameters( γ, β ) are depending on the condition information.We define the form as: h N = x − E [ h ] (cid:112) v ar [ h ] + (cid:15) · γ ( c ) + β ( c ) (45)
1) Unconditional-based:
Unconditional-based layer nor-malization is used for both the generator and discrimina-tor, whose motivation is the same as the other deep neu-ral networks. Ioffe et al. [40] proposed batch normalizationwhich is the first normalization for neural networks. Batch normalization uses the data of the mini-batch to compute themean and variance which makes the data distribution of eachmini-batch are approximately the same. Miyato et al. [41]used the batch normalization(BN) in GANs, because imagegeneration is a pixel-level task, BN has normalized at the mini-batch, which destroys the difference between pixels during thegeneration, resulting in BN does not achieve good results inGANs. Different from the BN normalize the same channelwith different images, layer normalization [62] normalizesdifferent channels of a single image which destroys the diver-sity between channels for the pixel-by-pixel Generative model[41]. Also, instance normalization [79] has been proposed forstyle transform which is used for a single channel of a singleimage. Different from the BN, LN, and IN normalizes the inputof the neural networks, weight normalization [80] normalizesthe weight matrix of neural networks. Similarly, Miyato et al.[41] used it in GANs. In summary, unconditional-based layernormalization in GANs is the same as other neural networks,which has little effect on GAN training.
2) Conditional-based:
Conditional-based layer normaliza-tion is only used for the generator of the condition generation.The motivation is to introduce conditional information to eachlayer of the generator, which helps to improve the quality ofthe generated images. γ ( c ) and β ( c ) in Eq (44) are calculatedby a neural network with features or class labels. Miyatoet al. [75] and Zhang et al. [76] used the conditional batchnormalization(CBN) to encode class labels, thereby improv-ing the quality of conditional generation. Huang et al. [77]and Karras et al. [42] used the adaptive instance normaliza- OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 tion(AdaIN) with target images to improve the accuracy ofstyle transfer. Park et al. [78] used the spatially-adaptive (de)normalization(SPADE) with semantic segmentation image tolet the semantic information to be applied to all layers. Wang etal. [43] used the attentive normalization(AN) to model long-range dependent attention, which is similar to self-attentionGAN [76]. The main difference between these conditional-based normalizations is condition inputs(c in Eq (45)). As theinformation of inputs is gradually enriched, the performance ofcondition generation is also gradually improved, the summaryis shown in TABLE IV
F. Consistency Regularization
For semi-supervised or unsupervised learning, consistencyregularization has been used broadly [85]–[88]. The motivationis that model should produce consistent predictions given inputand their semantics-preserving augmentations, such as imagerotating, adversarial attacks. The supervision of the GANstraining is weak. To increase the represent ability of the GANs,some consistency regularization has been used, according todifferent goals, we divide it into two parts: image consistencyand network consistency
1) Image Consistency:
The purpose of generative adversar-ial networks is to generate fake images similar to real images.In GANs, the discriminator is generally used to identifybetween real images and generated images. To increase thepriority information, regularization can be added to minimizethe distance between the generated images and the real images.Salimans et al. [81] recommend that the generator should betrained using a feature matching procedure. The objective is: L C = || E x ∼ p r f ( x ) − E z ∼ p z f ( G ( z ))|| (46)Where f ( x ) denotes the intermediate layer of the discrim-inator. Of course, the intermediate layer of the pre-trainedclassification model is okay too. In addition to the first-orderinformation, there is some work [82] that added second-orderinformation to limit GANs training by feature matching. Theobjective is: L C = L µ + L σ = || µ ( p r ) − µ ( G ( p z ))|| q + || (cid:213) ( p r ) − (cid:213) ( p ( G ( z )))|| k (47)Where µ ( p r ) = E x ∼ p r f ( x ) and (cid:205) ( p r ) = E x ∼ p r f ( x ) · f ( x ) T represent the mean and covariance of the feature layer f ( x ) ,respectively. Apart from statistical differences, Durall et al.[83] found that the depth generative models based on up-convolution are failing to reproduce spectral distributions,which leads to a large difference in the spectral distributionsbetween real images and generated images. So the spectralregularization had been proposed: L C = M / − M / − (cid:213) i = AI reali · log ( AI f akei ) + ( − AI reali ) · log ( − AI f akei ) (48)Where M is the image size, AI is the spectral representationfrom the Fourier transform of the images.
2) Network Consistency:
Network consistency regulariza-tion can be regarded as Lipschitz continuity on semantics-preserving transform. Specifically, it is hoped that the gener-ator and discriminator is insensitive to semantics-preservingtransform. Zhang et al. [38] proposed the CR-GAN which ap-plied the consistency regularization only between real imagesand their augmentations: L C = E x ∼ p r || D ( x ) − D ( T ( x ))|| (49)Where T ( x ) represents an augmentation transform(e.g.shift,flip,cutout,etc.). Also Zhao et al. [31] proposed thebCR-GAN and zCR-GAN. bCR-GAN used the consistencyregularization on both real and fake images, which balancedthe training of discriminator between reeal images and fakeimages by λ real and λ f ake : L C = λ real E x ∼ p r || D ( x ) − D ( T ( x ))|| + λ f ake E z ∼ p z || D ( G ( z )) − D ( T ( G ( z )))|| (50)And zCR-GAN used the consistency regularization on latentspace. To avoid mode collapse, gradient penalty of generatorMentioned in section III has also been used: L C = λ dis E z ∼ p z || D ( G ( z )) − D ( G ( T ( z )))|| − λ gen E z ∼ p z || G ( z ) − G ( T ( z ))|| (51)Where T ( z ) represents adding small perturbation noise. max E z ∼ p z || G ( z )− G ( T ( z ))|| is the method [70] about the gra-dient penalty of generator mentioned in section III, which is toavoid the mode collapse. Also there are some applications thatcyclic consistency regularization [84] used for unpaired image-to-image translation. The summary are shown in TABLE V G. Self Supervision
The supervision information in GANs is very weak, Chen etal [30] found that as the training, the ability of the discrimina-tor with capturing spatial structure and discriminative featuresis decreasing, which is called discriminator forgetting. Self-supervision is a method to improve the representation of neuralnetworks by expanding label information, which is differentfrom the data augmentation and can be regarded as a kind ofregularization. there are many different ways to expand thelabel information, such as context prediction [89], rotationsprediction [90]–[93]. Chen et al. [30] used the self-supervisionin GAN training for the first time, they used the rotationprediction as the expanding label to prevent the discriminatorforgetting. Also, Huang et al. [39] used the feature exchangeto make the discriminator to learn the proper feature structureof natural images. Baykal et al. [94] introduced a reshufflingtask to randomly arrange the structural blocks of the image,which helps the discriminator learn to increase its expressivecapacity for spatial structure and realistic appearance. Theabove methods have designed different self-supervised tasks,and their loss functions can be expressed as: L D , C = − λ d E x ∼ p Td E T k ∼ T log (cid:0) C k ( x ) (cid:1) L G = − λ g E x ∼ p Tg E T k ∼ T log (cid:0) C k ( x ) (cid:1) (52)Where T represents the different types to transfer the imagessuch as rotation, reshuffling, and T k reprenents the class k OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
TABLE V: The summary of the consistency regularization
Method L c Mean regularization [81] | | E x ∼ p r f ( x ) − E z ∼ p z f ( G ( z ))| | q Mean and Convariance regularization [82] | | E x ∼ p r f ( x ) − E z ∼ p z f ( G ( z ))| | q + | | E x ∼ p r f ( x ) · f ( x ) T − E z ∼ p z f ( G ( z )) · f ( G ( z )) T | | k Spectral regularization [83] L C = M / − (cid:205) M / − i = AI reali · log ( AI f akei ) + ( − AI reali ) · log ( − AI f akei ) CR-GAN [38], [84] E x ∼ p r | | D ( x ) − D ( T ( x ))| | bCR-GAN [31] λ real E x ∼ p r | | D ( x ) − D ( T ( x ))| | + λ f ake E z ∼ p z | | D ( G ( z )) − D ( T ( G ( z )))| | zCR-GAN [31] λ dis E z ∼ p z | | D ( G ( z )) − D ( G ( T ( z )))| | − λ gen E z ∼ p z | | G ( z ) − G ( T ( z ))| | of the transfer, such as ◦ , ◦ , ◦ , ◦ for rotation. C k isa classfication. P Td and P Tg are the mixture distributions ofreal and generative data samples. For rotation conversion [30], k = , and the classfication C predicts the rotation angle;for feature exchange [39], k = , and the classfication C predicts whether the swap has occured; for block reshuffling[94], the image is divided into blocks and the number of thepermutations are , which is unnecessarily huge, 30 differentpermutations are selected according to Hamming distancesbetween the permutationsas in [95], so k = , and theclassfication C predicts the category of different permutations.In addition to design different transfer functions T , somework uses self-supervised methods to improve discriminatorperformance from other aspects. Ngoc-TrungTran et al. [96]introduce true and false judgments based on rotation predic-tion. The numbers of classfication is 5, the loss function is: L D , C = − λ d (cid:32) E x ∼ p Td E T k ∼ T log (cid:0) C k ( x ) (cid:1) + E x ∼ p Tg E T k ∼ T log (cid:0) C k + ( x ) (cid:1)(cid:33) L G = − λ g (cid:32) E x ∼ p Tg E T k ∼ T log (cid:0) C k ( x ) (cid:1) − E x ∼ p Tg E T k ∼ T log (cid:0) C k + ( x ) (cid:1)(cid:33) (53)Where C k and C k + are the classfication of rotation angle andreal or feak, respectively. The new self-supervised GAN basedon rotation use the multi-class minimax game to avoid themode collapse, which is better than original self-supervisedmodel. Sun et al. [97] used the triplet matching objective aspretext task: pairs of images with the same category will havesimilar features, while pairs of images with different categorieswill have dissimilar features. the loss function are: L D = − λ d E x a , x p ∼ P X | y , x n ∼ P X | y (cid:48) log (cid:0) D mch ( x a , x p ) (cid:1) + log (cid:0) − D mch ( x a , x n ) (cid:1) L G = − λ g E x , x ∼ P X | y , x ∼ P X | y (cid:48) log (cid:0) D mch ( G ( x , y ) , G ( x , y )) (cid:1) + log (cid:0) − D mch ( G ( x , y ) , G ( x , y (cid:48) )) (cid:1) (54)Where y (cid:44) y (cid:48) , then x a and x p are positive pairs of realimages, x a and x n are negative pairs of real images. Lee etal. [98] maximized information as a self-supervision task. Areal or fake image x passes through a discriminator encoder E φ = f φ ◦ C φ , producing local feature map C φ ( x ) and global feature vector E φ ( x ) . In order to maximize the lower boundof the InfoMax: I (cid:0) C φ ( x ) , E φ ( x ) (cid:1) , the infoNCE loss has beentheoretically shown in [99]. The InfoNCE loss has beendefined as: L nce ( X ) = − E x ∈ X E i ∈A (cid:2) log p (cid:0) C ( i ) φ ( x ) , E φ ( x )| X (cid:1)(cid:3) (55)Where X = { x , · · · , x N } is a set of random images and A = { , , · · · , M − } represents indices of a M × M spatial sizedlocal feature map. than the loss function of discriminator andgenerator are: L D = λ d L nce ( X r )L G = λ g L nce ( X g ) (56)Where X r and X g represent sets of real and generated imagesrespectively. Although there is less work in this area, the ideaof self-supervision is very attractive. This may be a feasiblesolution to the weak supervision information and unstabletraining of GANs.IV. S UMMARY AND O UTLOOK
Regularization and normalization are popular and attractive.This paper summarizes the regularization and normalization ofgenerative adversarial networks and explains the problem fromthree aspects:1) optimal transport and Lipschitz continuity: we firstintroduce the optimal transport and optimal transportwith regularization, according to the duality form, weget WGAN-GP [29] and WGAN-LP [34], which usegradient penalty to make the discriminator satisfy the 1-Lipschitz continuity and k-Lipschitz continuity( k ≤ ),respectively. Of course, after this, many gradient penaltymethods have been proposed, which can be seen as im-provements to WGAN-GP in two directions:(1) limit theLipschitz constant to a small value, such as 0 [58], [66],[67]. (2) Find the right restricted space: The restrictedspace of WGAN-GP and WGAN-LP is the interpolationbetween the real images and fake images and some workrestricted it at the real images or fake images [21],[58], [63]. From the results, (1) is helpful to improvethe training of GAN, but the impact of (2) is stillunclear. Then we derived the relationship between theLipschitz constant and the spectral norm of the matrix,indicating that the normalization of the spectral norm OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 [41] can also be used to achieve 1-Lipschitz continuity.And proved that the Frobenius norm is the upper boundof the spectral norm, the result of WGAN-LP can beobtained by normalizing the Frobenius norm. Also, somework [69] analyzed the relationship between the modecollapse of the GAN and the eigenvalue distribution ofthe discriminator weight matrix, and avoided the modecollapse by limiting the eigenvalue collapse, but there isno specific reason to believe.2) training dynamics: GAN is a two-person zero-sum game.Due to the nonlinearity of the neural network, it isdifficult to find the global convergence solution of theproblem. To find the local convergence of the problem,the Jacobian regularization needs to be added. There isnot much work in this area.3) represent ability: As we said before, the training ofthe GANs is a Semi-supervised task. We need moreprior information to improve the presentation ability ofthe network. Conditional layer normalization is usedto better encode condition information and reduce thedifficulty of condition generation. Consistent regulariza-tion and self-supervision use unsupervised methods toimprove the supervision information in GAN trainingand use additional labels and tasks to improve GANperformance. This is meaningful and promising. Therewill be more work in this area in the future.A
CKNOWLEDGMENT
The work is partially supported by the National NaturalScience Foundation of China under grand No.U19B2044 andNo.61836011. R
EFERENCES[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems , 2014, pp. 2672–2680.[2] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,“Context encoders: Feature learning by inpainting,” in
Proceedings ofthe IEEE conference on computer vision and pattern recognition , 2016,pp. 2536–2544.[3] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generativeimage inpainting with contextual attention,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2018, pp. 5505–5514.[4] U. Demir and G. Unal, “Patch-based image inpainting with generativeadversarial networks,” arXiv preprint arXiv:1803.07422 , 2018.[5] K. Javed, N. U. Din, S. Bae, and J. Yi, “Image unmosaicing withoutlocation information using stacked gan,”
IET Computer Vision , vol. 13,no. 6, pp. 588–594, 2019.[6] R. A. Yeh, C. Chen, T. Yian Lim, A. G. Schwing, M. Hasegawa-Johnson,and M. N. Do, “Semantic image inpainting with deep generativemodels,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2017, pp. 5485–5493.[7] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsu-pervised image-to-image translation,” in
Proceedings of the EuropeanConference on Computer Vision (ECCV) , 2018, pp. 172–189.[8] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang,“Diverse image-to-image translation via disentangled representations,”in
Proceedings of the European conference on computer vision (ECCV) ,2018, pp. 35–51.[9] A. Gonzalez-Garcia, J. Van De Weijer, and Y. Bengio, “Image-to-imagetranslation for cross-domain disentanglement,” in
Advances in neuralinformation processing systems , 2018, pp. 1287–1298. [10] A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Mosseri, F. Cole, andK. Murphy, “Xgan: Unsupervised image-to-image translation for many-to-many mappings,” in
Domain Adaptation for Visual Understanding .Springer, 2020, pp. 33–49.[11] H.-Y. Lee, H.-Y. Tseng, Q. Mao, J.-B. Huang, Y.-D. Lu, M. Singh,and M.-H. Yang, “Drit++: Diverse image-to-image translation via dis-entangled representations,”
International Journal of Computer Vision ,pp. 1–16, 2020.[12] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse imagesynthesis for multiple domains,” in
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition , 2020, pp.8188–8197.[13] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.Metaxas, “Stackgan: Text to photo-realistic image synthesis with stackedgenerative adversarial networks,” in
Proceedings of the IEEE interna-tional conference on computer vision , 2017, pp. 5907–5915.[14] ——, “Stackgan++: Realistic image synthesis with stacked generativeadversarial networks,”
IEEE transactions on pattern analysis and ma-chine intelligence , vol. 41, no. 8, pp. 1947–1962, 2018.[15] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, andX. He, “Attngan: Fine-grained text to image generation with attentionalgenerative adversarial networks,” in
Proceedings of the IEEE conferenceon computer vision and pattern recognition , 2018, pp. 1316–1324.[16] T. Qiao, J. Zhang, D. Xu, and D. Tao, “Mirrorgan: Learning text-to-image generation by redescription,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp.1505–1514.[17] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the latent spaceof gans for semantic face editing,” in
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition , 2020, pp.9243–9252.[18] L. Goetschalckx, A. Andonian, A. Oliva, and P. Isola, “Ganalyze:Toward visual definitions of cognitive image properties,” in
Proceedingsof the IEEE International Conference on Computer Vision , 2019, pp.5744–5753.[19] E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris, “Ganspace: Dis-covering interpretable gan controls,” arXiv preprint arXiv:2004.02546 ,2020.[20] R. Tao, Z. Li, R. Tao, and B. Li, “Resattr-gan: Unpaired deep resid-ual attributes learning for multi-domain face image translation,”
IEEEAccess , vol. 7, pp. 132 594–132 608, 2019.[21] N. Kodali, J. Abernethy, J. Hays, and Z. Kira, “On convergence andstability of gans,” arXiv preprint arXiv:1705.07215 , 2017.[22] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training forhigh fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096 ,2018.[23] W. Nie and A. Patel, “Jr-gan: Jacobian regularization for generativeadversarial networks,” arXiv preprint arXiv:1806.09235 , 2018.[24] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton,“Veegan: Reducing mode collapse in gans using implicit variationallearning,” in
Advances in Neural Information Processing Systems , 2017,pp. 3308–3318.[25] M. Arjovsky and L. Bottou, “Towards principled methods for traininggenerative adversarial networks,”
Stat , 2017.[26] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly, “Thegan landscape: Losses, architectures, regularization, and normalization,”2018.[27] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growingof gans for improved quality, stability, and variation,” arXiv preprintarXiv:1710.10196 , 2017.[28] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXivpreprint arXiv:1701.07875 , 2017.[29] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,“Improved training of wasserstein gans,” in
Advances in neural infor-mation processing systems , 2017, pp. 5767–5777.[30] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby, “Self-supervisedgans via auxiliary rotation loss,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 12 154–12 163.[31] Z. Zhao, S. Singh, H. Lee, Z. Zhang, A. Odena, and H. Zhang,“Improved consistency regularization for gans,” arXiv preprintarXiv:2002.04724 , 2020.[32] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,“Gans trained by a two time-scale update rule converge to a local nashequilibrium,” in
Advances in neural information processing systems ,2017, pp. 6626–6637.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 [33] A. Y. Ng, “Feature selection, l 1 vs. l 2 regularization, and rotationalinvariance,” in
Proceedings of the twenty-first international conferenceon Machine learning , 2004, p. 78.[34] H. Petzka, A. Fischer, and D. Lukovnicov, “On the regularization ofwasserstein gans,” arXiv preprint arXiv:1709.08894 , 2017.[35] X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang, “Improving the improvedtraining of wasserstein gans: A consistency term and its dual effect,” arXiv preprint arXiv:1803.01541 , 2018.[36] W. Nie and A. Patel, “Towards a better understanding and regularizationof gan training dynamics,” arXiv preprint arxiv:1806.09235 , 2019.[37] T. Liang and J. Stokes, “Interaction matters: A note on non-asymptoticlocal convergence of generative adversarial networks,” in
The 22ndInternational Conference on Artificial Intelligence and Statistics , 2019,pp. 907–915.[38] H. Zhang, Z. Zhang, A. Odena, and H. Lee, “Consistency regularizationfor generative adversarial networks,” arXiv preprint arXiv:1910.12027 ,2019.[39] R. Huang, W. Xu, T.-Y. Lee, A. Cherian, Y. Wang, and T. Marks, “Fx-gan: Self-supervised gan learning via feature exchange,” in
The IEEEWinter Conference on Applications of Computer Vision , 2020, pp. 3194–3202.[40] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[41] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectralnormalization for generative adversarial networks,” arXiv preprintarXiv:1802.05957 , 2018.[42] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture forgenerative adversarial networks,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 4401–4410.[43] Y. Wang, Y.-C. Chen, X. Zhang, J. Sun, and J. Jia, “Attentive normaliza-tion for conditional image generation,” in
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition , 2020, pp.5094–5103.[44] Z. Zhang, Y. Zeng, L. Bai, Y. Hu, M. Wu, S. Wang, and E. R.Hancock, “Spectral bounding: Strictly satisfying the 1-lipschitz propertyfor generative adversarial networks,”
Pattern Recognition , p. 107179,2019.[45] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly, “A large-scale study on regularization and normalization in gans,” arXiv preprintarXiv:1807.04720 , 2018.[46] M. Lee and J. Seok, “Regularization methods for generative ad-versarial networks: An overview of recent studies,” arXiv preprintarXiv:2005.09165 , 2020.[47] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generativeneural samplers using variational divergence minimization,” in
Advancesin neural information processing systems , 2016, pp. 271–279.[48] N. Bonnotte, “From knothe’s rearrangement to brenier’s optimal trans-port map,”
SIAM Journal on Mathematical Analysis , vol. 45, no. 1, pp.64–87, 2013.[49] L. V. Kantorovich, “On a problem of monge,”
J. Math. Sci.(NY) , vol.133, p. 1383, 2006.[50] Y. Brenier, “Polar factorization and monotone rearrangement of vector-valued functions,”
Communications on pure and applied mathematics ,vol. 44, no. 4, pp. 375–417, 1991.[51] N. Lei, K. Su, L. Cui, S.-T. Yau, and X. D. Gu, “A geometricview of optimal transportation and generative model,”
Computer AidedGeometric Design , vol. 68, pp. 1–21, 2019.[52] G. Peyré, M. Cuturi et al. , “Computational optimal transport,”
Founda-tions and Trends R (cid:13) in Machine Learning , vol. 11, no. 5-6, pp. 355–607,2019.[53] A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein, “Sta-bilizing adversarial nets with prediction methods,” arXiv preprintarXiv:1705.07364 , 2017.[54] J. Li, A. Madry, J. Peebles, and L. Schmidt, “On the limita-tions of first-order approximation in gan dynamics,” arXiv preprintarXiv:1706.09884 , 2017.[55] L. Mescheder, S. Nowozin, and A. Geiger, “The numerics of gans,” in Advances in Neural Information Processing Systems , 2017, pp. 1825–1835.[56] O. L. Mangasarian,
Nonlinear programming . SIAM, 1994.[57] V. Nagarajan and J. Z. Kolter, “Gradient descent gan optimization islocally stable,” in
Advances in neural information processing systems ,2017, pp. 5585–5595.[58] L. Mescheder, A. Geiger, and S. Nowozin, “Which training methods forgans do actually converge?” arXiv preprint arXiv:1801.04406 , 2018. [59] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann, “Stabilizing trainingof generative adversarial networks through regularization,” in
Advancesin neural information processing systems , 2017, pp. 2018–2028.[60] T. Park and G. Casella, “The bayesian lasso,”
Journal of the AmericanStatistical Association , vol. 103, no. 482, pp. 681–686, 2008.[61] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimationfor nonorthogonal problems,”
Technometrics , vol. 12, no. 1, pp. 55–67,1970.[62] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXivpreprint arXiv:1607.06450 , 2016.[63] D. Terjék, “Adversarial lipschitz regularization,” in
International Con-ference on Learning Representations , 2019.[64] Z. Zhou, J. Shen, Y. Song, W. Zhang, and Y. Yu, “Towards efficient andunbiased implementation of lipschitz continuity in gans,” arXiv preprintarXiv:1904.01184 , 2019.[65] J. Adler and S. Lunz, “Banach wasserstein gan,” in
Advances in NeuralInformation Processing Systems , 2018, pp. 6754–6763.[66] Z. Zhou, J. Liang, Y. Song, L. Yu, H. Wang, W. Zhang, Y. Yu,and Z. Zhang, “Lipschitz generative adversarial nets,” arXiv preprintarXiv:1902.05687 , 2019.[67] H. Thanh-Tung, T. Tran, and S. Venkatesh, “Improving generaliza-tion and stability of generative adversarial networks,” arXiv preprintarXiv:1902.03984 , 2019.[68] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual adversarialtraining: a regularization method for supervised and semi-supervisedlearning,”
IEEE transactions on pattern analysis and machine intelli-gence , vol. 41, no. 8, pp. 1979–1993, 2018.[69] K. Liu, W. Tang, F. Zhou, and G. Qiu, “Spectral regularization for com-bating mode collapse in gans,” in
Proceedings of the IEEE InternationalConference on Computer Vision , 2019, pp. 6382–6390.[70] D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee, “Diversity-sensitive conditional generative adversarial networks,” arXiv preprintarXiv:1901.09024 , 2019.[71] A. Odena, J. Buckman, C. Olsson, T. B. Brown, C. Olah, C. Raffel,and I. Goodfellow, “Is generator conditioning causally related to ganperformance?” arXiv preprint arXiv:1802.08768 , 2018.[72] R. Mathias, “The spectral norm of a nonnegative matrix,”
Linear Algebraand its Applications , vol. 139, pp. 269–284, 1990.[73] C. Zhou, J. Zhang, and J. Liu, “Lp-wgan: Using lp-norm normalizationto stabilize wasserstein generative adversarial networks,”
Knowledge-Based Systems , vol. 161, pp. 415–424, 2018.[74] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien, “Avariational inequality perspective on generative adversarial networks,” arXiv preprint arXiv:1802.10551 , 2018.[75] T. Miyato and M. Koyama, “cgans with projection discriminator,” arXivpreprint arXiv:1802.05637 , 2018.[76] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention gen-erative adversarial networks,” arXiv preprint arXiv:1805.08318 , 2018.[77] X. Huang and S. Belongie, “Arbitrary style transfer in real-time withadaptive instance normalization,” in
Proceedings of the IEEE Interna-tional Conference on Computer Vision , 2017, pp. 1501–1510.[78] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic imagesynthesis with spatially-adaptive normalization,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2019,pp. 2337–2346.[79] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: Themissing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022 ,2016.[80] T. Salimans and D. P. Kingma, “Weight normalization: A simplereparameterization to accelerate training of deep neural networks,” in
Advances in neural information processing systems , 2016, pp. 901–909.[81] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, andX. Chen, “Improved techniques for training gans,” in
Advances in neuralinformation processing systems , 2016, pp. 2234–2242.[82] Y. Mroueh, T. Sercu, and V. Goel, “Mcgan: Mean and covariance featurematching gan,” arXiv preprint arXiv:1702.08398 , 2017.[83] R. Durall, M. Keuper, and J. Keuper, “Watch your up-convolution: Cnnbased generative deep neural networks are failing to reproduce spectraldistributions,” in
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , 2020, pp. 7890–7899.[84] T. Ohkawa, N. Inoue, H. Kataoka, and N. Inoue, “Augmented cyclicconsistency regularization for unpaired image-to-image translation,” arXiv preprint arXiv:2003.00187 , 2020.[85] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le, “Unsu-pervised data augmentation for consistency training,” arXiv preprintarXiv:1904.12848 , 2019.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13 [86] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk,A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” arXiv preprintarXiv:2001.07685 , 2020.[87] M. Gao, Z. Zhang, G. Yu, S. O. Arik, L. S. Davis, and T. Pfister,“Consistency-based semi-supervised active learning: Towards minimiz-ing labeling cost,” arXiv preprint arXiv:1910.07153 , 2019.[88] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frame-work for contrastive learning of visual representations,” arXiv preprintarXiv:2002.05709 , 2020.[89] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual repre-sentation learning by context prediction,” in
Proceedings of the IEEEinternational conference on computer vision , 2015, pp. 1422–1430.[90] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised repre-sentation learning by predicting image rotations,” arXiv preprintarXiv:1803.07728 , 2018.[91] H. Lee, S. J. Hwang, and J. Shin, “Rethinking data augmentation:Self-supervision and self-distillation,” arXiv preprint arXiv:1910.05872 ,2019.[92] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised visualrepresentation learning,” in
Proceedings of the IEEE conference onComputer Vision and Pattern Recognition , 2019, pp. 1920–1929.[93] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, “S4l: Self-supervisedsemi-supervised learning,” in
Proceedings of the IEEE internationalconference on computer vision , 2019, pp. 1476–1485.[94] G. Baykal and G. Unal, “Deshufflegan: A self-supervised gan to improvestructure learning,” arXiv preprint arXiv:2006.08694 , 2020.[95] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi,“Domain generalization by solving jigsaw puzzles,” in
Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,2019, pp. 2229–2238.[96] N.-T. Tran, V.-H. Tran, B.-N. Nguyen, L. Yang et al. , “Self-supervisedgan: Analysis and improvement with multi-class minimax game,” in
Advances in Neural Information Processing Systems , 2019, pp. 13 253–13 264.[97] J. Sun, B. Bhattarai, and T.-K. Kim, “Matchgan: A self-supervised semi-supervised conditional generative adversarial network,” arXiv preprintarXiv:2006.06614 , 2020.[98] K. S. Lee, N.-T. Tran, and N.-M. Cheung, “Infomax-gan: Improved ad-versarial image generation via information maximization and contrastivelearning,” arXiv preprint arXiv:2007.04589 , 2020.[99] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning withcontrastive predictive coding,” arXiv preprint arXiv:1807.03748 , 2018.
Ziqiang Li received the B.E. degree from Universityof Science and Technology of China (USTC), Hefei,China, in 2019 and is pursing the Master degreefrom University of Science and Technology of China(USTC), Hefei, China. His research interests includemedical image segmentation,deep generative modelsand machine learning.
Rentuo Tao received the B.E. degree from HefeiUniversity of Technology (HFUT), Hefei, China, in2013 and is pursing the Ph.D. degree from Univer-sity of Science and Technology of China (USTC),Hefei, China. His research interests include deepgenerative models, machine learning and computervision.