On the Anomalous Generalization of GANs
OOn the Anomalous Generalization of GANs
Jinchen Xuan ∗ Peking University
Yunchang Yang ∗ Peking University
Ze YangPeking University [email protected]
Di HeKey Laboratory of Machine PerceptionMOESchool of EECSPeking University di [email protected]
Liwei WangKey Laboratory of Machine PerceptionMOESchool of EECSPeking University [email protected]
October 8, 2019
Abstract
Generative models, especially Generative Adversarial Networks (GANs), have received sig-nificant attention recently. However, it has been observed that in terms of some attributes, e.g. the number of simple geometric primitives in an image, GANs are not able to learn thetarget distribution in practice. Motivated by this observation, we discover two specific prob-lems of GANs leading to anomalous generalization behaviour, which we refer to as the sampleinsufficiency and the pixel-wise combination. For the first problem of sample insufficiency,we show theoretically and empirically that the batchsize of the training samples in practicemay be insufficient for the discriminator to learn an accurate discrimination function. It couldresult in unstable training dynamics for the generator, leading to anomalous generalization.For the second problem of pixel-wise combination, we find that besides recognizing the posi-tive training samples as real, under certain circumstances, the discriminator could be fooled torecognize the pixel-wise combinations ( e.g. pixel-wise average) of the positive training samplesas real. However, those combinations could be visually different from the real samples in thetarget distribution. With the fooled discriminator as reference, the generator would obtain bi-ased supervision further, leading to the anomalous generalization behaviour. Additionally, inthis paper, we propose methods to mitigate the anomalous generalization of GANs. Extensiveexperiments on benchmark show our proposed methods improve the FID score up to 30% onnatural image dataset.
Generative Adversarial Networks (GANs) have great potential in modeling complex data distribu-tions and have attracted significant attention recently. A great number of techniques (Goodfellowet al., 2014; Miyato et al., 2018; Arjovsky et al., 2017; Gulrajani et al., 2017; Salimans et al., 2016; ∗ Equal contribution. a r X i v : . [ c s . L G ] O c t igure 1: Left: The generated images have different rectangle numbers ( e.g. one, two or three),while the rectangle numbers of all the training data are exactly two (the anomalous marked red).Right: The proportion of the correct generated images (rectangle number is two) for differenttraining approaches. The training dataset consists of 25600 images, all of which have exactly tworectangles.Brock et al., 2018) and architectures (Radford et al., 2015; Karras et al., 2018; Zhang et al., 2018;Mirza & Osindero, 2014) have been developed to make the training of GANs more stable and togenerate high fidelity, diverse images. The corresponding generated samples are authentic anddifficult for human to distinguish from the real ones.Despite these improvements, recent work (Zhao et al., 2018) reported a surprising phenomenonof anomalous generalization of GANs on a geometry dataset, raising new questions about the gener-alization behaviour. By anomalous generalization it means that several seemingly easy attributesare shown to be learned poorly by GANs, including numerosity (number of objects) and colorproportion, which are important for human perception. For example, as shown in Figure 1, for ageometric-object training dataset where the number of objects for each training image is fixed ( e.g. every training image has exactly two rectangles), it is observed that most generated images aftertraining have very different numbers of objects than the training images ( e.g. rectangle numbersof most generated images are not two). Mathematically speaking, with regard to the number ofobjects, the learned distribution of GANs differs significantly from the target distribution, whichfails to achieve the goal of modeling the target data distribution faithfully.Several works have developed theories for GANs. The original work proves the convergence toequilibrium under ideal conditions (Goodfellow et al., 2014). Further extensions include (Arjovskyet al., 2017; Miyato et al., 2018; Nagarajan & Kolter, 2017; Mescheder et al., 2018; Bai et al., 2018;Heusel et al., 2017). Arora et al. (2017) points out that GANs may not have good generalizationwhen the discriminator has finite capacity, e.g., neural networks. But they show generalizationoccurs for GANs under the weak metric of neural net distance. Although these theories providedeep understandings, the generalization and the convergence of GANs as well as how to achieve itin practice are still open problems.Motivated by this observation, we discover and investigate two specific problems of GANs,namely sample insufficiency and pixel-wise combination, which cause GANs to have anomalousgeneralization behaviour. Moreover, we propose methods to improve the generalization of GANs.For the problem of sample insufficiency, we show theoretically and empirically that the batchsizein practice could be insufficient for GANs to model the target data distribution. As a typical settingof GANs, the discriminator learns to separate the fake data distribution of the generator from thereal data distribution approximated by the training dataset. In practice, the discriminator learnssuch a separation function based on the mini-batches sampled from the training dataset and thegenerated samples of the generator. However, since the size of the mini-batch is much smallerthan all the possible samples in the high-dimensional data distribution, the separation functionof the discriminator learned based on the mini-batch samples could be noisy. With this noisydiscriminator as reference, the generator would learn noisy generation function too and the trainingdynamics become unstable. As a result, GANs would have anomalous generalization behaviour.For the problem of pixel-wise combination, in some situations, we find that the positive trainingsamples and their pixel-wise combinations (pixel-wise average or pixel-wise logical-and) are bothrecognized as real by the discriminator during training. However, the pixel-wise combinations of thepositive training samples could have very different properties from the real samples themselves,indicating that the discriminator is unable to differentiate those seemingly easy attributes ( e.g. number of objects). With this fooled discriminator as reference, the generator could be fooledfurther to generate those pixel-wise combinations of training samples, which may not belong tothe target distribution. As a result, the data distribution learned by the generator could be verydifferent from the target data distribution and the generalization of GANs becomes anomalous.2o summarize, our contributions are: • We show that in certain circumstances the discriminator tends to recognize the pixel-wisecombinations of the positive training samples as real, which could fool the generator to haveanomalous generalization behaviour. • We demonstrate theoretically and empirically that the sample insufficiency in practice couldresult in unstable training dynamics and anomalous generalization of GANs. • We show that the anomalous generalization reported in Zhao et al. (2018) is caused by thetwo problems (sample insufficiency and pixel-wise combination). We then propose novelmethods to mitigate anomalous generalization behaviour. Figure 1 shows that our proposedmethods improve the proportion of correct generated images by almost 80%. Our methodsalso improve the FID up to 30% on natural image datasets.
In most cases of GANs, the generator learns to map a prior distribution ( e.g. standard Gaussian)to a fake distribution to approximate the target real data distribution. The discriminator learns afunction to separate the real and fake distributions. They define the following min-max game:min G max D E x ∼ P r [ f ( D ( x ))] + E x ∼ P z [ f ( D ( G ( x )))] (1)where P r and P z denote the real and prior distributions respectively. f and f are the criticfunctions for the positive and negative training samples ( e.g. f (x) = log(x), f (x) = log(1-x)). Some anomalous generalization behaviours of GANs have been observed recently. In Zhao et al.(2018), several seemingly easy attributes are shown to be learned poorly by GANs, including nu-merosity (number of objects) and color proportion, which are important for human vision systems.The phenomenon shows that the learned distribution of the generator fails to approximate the tar-get distribution, raising new questions about the training dynamics and generalization behaviourof GANs.
In this section, we first discuss the problem of sample insufficiency in general in Section 3.1. Itsempirical observation is shown in 3.2, followed by the theoretical analysis in Section 3.3.
In the training of GANs, the generator learns to fake the target distribution. The discriminatorlearns to separate the fake distribution of the generator from the real data distribution. To learna good separation function between the fake and the real distributions, the discriminator needsto have sufficient information of them. But in practice, such information of the distribution isprovided by the positive or negative training samples in the mini-batch. Since the batchsize isoften much smaller than the size of all possible data in the high-dimensional data distribution,the information is insufficient and the separation function of the discriminator learned based onthe mini-batch samples is noisy. With this noisy discriminator as reference, the generator couldalso learn a noisy generation function and the training dynamics of GANs become unstable. Thesmaller the batchsize is, the more unstable the training dynamics are. Since the training of thegenerator is unstable and the learned generation function is noisy, it is difficult for the learneddistribution to approximate the target distribution. As a result, the generalization of GANs wouldbecome anomalous. We will show both empirically and theoretically that the sample insufficiencyleads to anomalous generalization behaviour of GANs in the following subsections.3igure 2: Middle: The loss and gradients for MGD/FGD. Training of MGD is unstable whileFGD converges quickly. Left: The generated samples of the generator by four latent codes duringtraining. MGD is unstable and anomalous images are generated ( z MGDi , rectangle number is nottwo). FGD is stable and converges ( z F GDi ). Right: The FID scores during training for differentbatchsizes (on CELEB-A). Larger batchsize is better after trained with the same amount of data.
We do experiments to show that the problem of sample insufficiency could lead GANs to anomalousgeneralization, both for geometric and natural image generation.We establish a geometry dataset consist of 64 images (while the experiment also applies to largersize) that all images (32 by 32 with 0/255 binary pixel value) in it have exactly two rectangles (8by 8). The prior distribution is the discrete uniform distribution. Size of its support set is the sameas the dataset. More details can be found in Appendix E. We compare the mini-batch gradientdescent (MGD) with the full-batch gradient descent (FGD). The batchsize for MGD and FGD is16 and 64 respectively.As shown in Figure 2 (middle), caused by the sample insufficiency, the training dynamics forthe mini-batch gradient descent (MGD) are highly unstable. Both the gradient and the loss go upand down frequently, suggesting it is difficult for GANs to model the target distribution if trainedwith small batchsize. The instability is also observed for the images generated from certain latentcodes, which are the inputs of the generator. As shown in Figure 2 (left), the generated samplesfor the three randomly drawn latent codes of MGD ( z MGD , z MGD , z MGD ) change frequently duringtraining with anomalous images of false rectangle numbers. But for the full-batch gradient descent(FGD), where the sample insufficiency is avoided and the separation function of the discriminatorbetween the real and fake data distributions can be learned accurately at each step, the loss andgradients are stable. The learned distribution converges smoothly to the target distribution in ashort time.The problem of sample insufficiency is also observed to influence the natural image generationfor datasets like CELEB-A (Liu et al., 2015). As shown in Figure 2 (right), after trained with thesame amount of data, the FID score is better for the larger batchsize, where the problem of sampleinsufficiency is relatively less severe, than the smaller batchsize, where the sample insufficiency ismore problematic. In brief, the experiments show that the sample insufficiency makes the trainingdynamics unstable, both for the discriminator and the generator. Since the training of the generatorand the discriminator are unstable and the learned generation function is noisy, it is hard for thelearned distribution to approximate the target distribution. As a result, the final generalizationbehaviour of GANs becomes anomalous. In this subsection, we introduce a simple yet prototypical model which shows that the insufficientbatchsize will result in unstable training dynamics and smaller batch leads to poorer performancethan that of larger batch. As a result, the generalization of GANs would become anomalous whenthe batchsize is small.Assume that the real data distribution is a d -dimensional multivariate normal distributioncentered at the origin, N ( , I d ). The latent distribution of the generator is p z ∼ N ( , I d ). Thegenerator is defined as G θ ( x ) = θ + x . The discriminator is a linear function D w ( x ) = w (cid:62) x .We consider the WGAN model (Arjovsky et al., 2017), whose value function is constructedusing the Kantorovich-Rubinstein duality (Villani, 2008) as W ( P θ REAL , P θ ) = sup (cid:107) f (cid:107) L ≤ E x ∼ P θREAL [ f ( x )] − E x ∼ P θ [ f ( x )] (2)4here (cid:107) f (cid:107) L denotes the Lipschitz constant of the function f , namely, (cid:107) f (cid:107) L = inf { L : f is L -Lipschitz } .And P θ REAL denotes the real data distribution, P θ denotes the distribution of the generator G θ ,and the supremum is taken over all the linear functions f w ( x ) = w (cid:62) x since the optimal classifier f ∗ is absolutely linear. And when f w is a linear function, (cid:107) f w (cid:107) L = (cid:107) w (cid:107) , so W ( P θ REAL , P θ ) = sup (cid:107) w (cid:107)≤ E x ∼ P θREAL [ f w ( x )] − E x ∼ P θ [ f w ( x )] (3)= sup (cid:107) w (cid:107)≤ w (cid:62) ( E x ∼ N ( , I d ) x − E z ∼ N ( , I d ) [ x + θ ]) (4)The generator is trained to minimize W ( P θ REAL , P θ ). Denote F ( w, θ ) = w (cid:62) ( E x ∼ N ( , I d ) x − ( E y ∼ N ( , I d ) y + θ )). When we use stochastic gradient descent, the training procedure can be de-scribed as w t +1 = w t + η t ∇ w F ( w, θ t ) (5) θ t +1 = θ t − µ t ∇ θ F ( w t +1 , θ ) (6)where η t and µ t are the step size. We present our result for gradient flow (Du et al., 2018a,b),i.e., gradient descent with infinitesimal time interval, whose behaviour can be described by thefollowing differential equations: (cid:18) ˙ w ( t )˙ θ ( t ) (cid:19) = (cid:18) η t ∇ w F ( w ( t ) , θ ( t )) − µ t ∇ θ F ( w ( t ) , θ ( t )) (cid:19) (7)(A detailed explanation of gradient flow and stochastic gradient flow is in Appendix A.) Wewill show that when the WGAN in (3) is trained using stochastic gradient flow, the batchsizewill impact the behaviour of the training dynamics. That is, compared with large batchsize (oreven full batch), when the batchsize is small, the training dynamics of WGAN suffer from a largevariance, thus more unstable. Theorem 1 tells us that when the WGAN model is trained usingconstant step size stochastic gradient flow, then the variance of the output will increase as t grows,and is of order Θ( m ). Namely, the variance will be large if the batchsize is small. Theorem 1. [Variance of WGAN, constant step size] Denote [ θ t ] i as the i -th component of thevector θ t . Suppose we train the WGAN model in (3) using constant step size stochastic gradientflow with batchsize m , then the parameter of the generator θ t satisfies Var([ θ t ] i ) = (cid:90) t (cid:107) (cid:2) e sA Σ (cid:3) d + i (cid:107) ds = Θ( 1 m ) (8) where A = (cid:18) − I d I d (cid:19) , Σ = (cid:32)(cid:113) m I d
00 0 (cid:33) , e sA = (cid:18) (cos s ) I d − (sin s ) I d (sin s ) I d (cos s ) I d (cid:19) , (9) and (cid:2) e sA Σ (cid:3) d + i is the ( d + i ) -th row of the matrix e sA Σ , The proof is in Appendix B. The main idea is that due to the special dynamics of optimizationof mini-max problems, the bias caused by the randomness of the batch sampling in each epoch willaccumulate, which will lead to large variation after many steps of training. Note that although intraditional optimization problems the variance of SGD caused by the randomness of samples willnot affect the convergence (Bubeck et al., 2015; Brutzkus et al., 2017), the variance will damagethe convergence properties in GANs.Next we consider the vanishing step size case, in which η t = µ t = 1 /t (without loss of generosityassume t ≥
1. This step size is commonly used in convex optimization in practical problems(Bubeck et al., 2015). Theorem 2 shows that the problem still exists in such case, whose proof isin Appendix C.
Theorem 2. [Variance of WGAN, vanishing step size] Suppose we train the WGAN model in (3)using /t step size stochastic gradient flow with batchsize m , then the parameter of the generator θ t satisfies Var([ θ t ] i ) = Θ( 1 m ) (10) Remark 1.
Here we consider only the WGAN because it is usually more stable than typical GANs.We believe that other forms of GANs will have a similar problem caused by the insufficient batchsize.
Remark 2.
Although our theoretical analysis only considers the effect of noise in distribution,the same reason also applies to the effect of other irrelevant features especially when the generatorcannot learn the real data distribution perfectly.
In this section, We first discuss the problem of pixel-wise combination in Section 4.1, followed bythe illustration on toy and real datasets in Section 4.2. Finally, we provide a theoretical analysisin Section 4.3.
When GANs are trained on image datasets, we find that under certain situation pixel-wise com-binations ( e.g. pixel-wise average or pixel-wise logical-and) of the real samples could fool thediscriminator, although the generated combinations could have inconsistent properties. e.g. thepixel-wise average of two animal images in CIFAR10 could be unrecognizable for human. Take asimple case for illustration, suppose the discriminator is a linear classifier, given two real images,the pixel-wise average of those two real images are likely to be predict as positive sample by thelinear discriminator. Since those pixel-wise combinations are recognized as real by the discrimina-tor, the generator correspondingly tends to generate them. Therefore, the generated images couldbe different from the expected ones and the generalization of GANs becomes anomalous.
We demonstrate the problem of pixel-wise combination, which leads GANs to anomalous general-ization, by a toy dataset. Our training dataset only consists of two binary images (pixel value is0 or 255), both of which have exactly three rectangles. The positions of the rectangles of the twoimages are different, as shown at upper left in Figure 3. The generated images during training areplotted and their statistics are analyzed.As shown at upper left of Figure 3, even when the training dataset consists of two images,there are a lot of unexpected anomalous generated samples. Both the two training images haveexactly three rectangles. But many generated images have two rectangles or four rectangles. Thegenerated images with three rectangles consist only a small part of all the generated images (Figure3 upper right solid red curve). Furthermore, the anomalous generated images are exactly the sameas the pixel-wise combinations of the two training images. The anomalous generated images withtwo rectangles are actually the pixel-wise logical-and of the two training images. The images withfour rectangles are actually the pixel-wise logical-or of the two training images. Also, the problem6igure 4: Generated images of the generative average method (GAM) and GANs. The perfor-mances are comparable both visually and by the FID score.of the pixel-wise combination is observed for the discrimination function of the discriminator. Asplotted as dash curves at upper right of Figure 3, during the training, the discriminator recognizesthe pixel-wise combinations of the two positive training images as real (give high scores by thediscrimination function), as well as the two positive training images themselves. With this fooleddiscriminator as reference, the generator is fooled further to generate unexpected samples. As aresult, the learned data distribution could differ a lot from the target data distribution.Beyond the toy dataset, the problem of pixel-wise combination does exist in the training ofGANs in practice. For the natural image dataset CELEB-A, some generated images are exactlythe same as the pixel-wise averages of the training data (Figure 3 bottom). Also, the pixel-wiseaverage of certain structurally similar images could generate realistic samples (Figure 4). In brief,the experiments show that the problem of pixel-wise combination exists for GANs and makesit hard to model the target data distribution faithfully, which leads to anomalous generalizationbehaviour.
In this section we give a possible explanation to the problem of pixel-wise combination by atheoretical analysis. Note that this phenomenon is most remarkable on the geometric or the facialdatasets. The samples in those datasets are structurally similar, namely, for a facial dataset, theeyes, noses and other features of the human faces appear at fixed positions in the images withhigh probability. So we assume that the l distance between most positive training samples in thedataset is small.For the discriminator f ( x ), without loss of generosity, assume that f ( x ) > x isclassified as real. We make the following assumptions. Assumption 1.
Assume that f is L -Lipschitz. Assumption 2.
Assume that the discriminator classifies all the positive training samples as realwith a large margin ( i.e. there exists (cid:15) > such that for all the positive training samples x , f ( x ) > (cid:15) ). Most classifiers satisfy the Lipschitz condition (for example, the softmax classifier). Assumption2 means that the discriminator classifies the positive training samples as real with high confidence.Our theorem shows that under these assumptions, the pixel-wise convex combination of any twopositive training samples will be classified as real with high probability. The proof is in AppendixD.
Theorem 3.
If the discriminator f ( x ) satisfies Assumption 1-2. Then for any two samples x and x in the training dataset satisfying (cid:107) x − x (cid:107) ≤ δ and any λ ∈ (0 , , we have f ( λx + (1 − λ ) x ) ≥ max { f ( x ) − L (1 − λ ) δ, f ( x ) − Lλδ } (11) Moreover, if (cid:15) > Lδ min { λ, − λ } , then f ( λx + (1 − λ ) x ) > . Fixing the anomalous generalization of GANs
In this section, we propose novel methods to mitigate the two problems. We present the Pixel-wiseCombination Regularization (PCR) to mitigate the problem of pixel-wise combination in Section5.1. For the problem of sample insufficiency, we present the Sample Correction (SC) in Section 5.1.The results show that the anomalous generalization for the geometric dataset is avoided entirely(Section 5.2). For the natural image dataset, the training modifications could improve the FIDscore up to 30% (Section 5.3).
Pixel-wise Combination Regularization
For the training of vanilla GANs, the positive train-ing samples for the update of the discriminator come from the training dataset and the negativetraining samples are generated by the generator. Since we think that the discriminator tends torecognize the pixel-wise combinations of the images in the training dataset as real even thoughthey are not in the target distribution, we define a dataset: D com = { x , x , ...x n − | x k = y i ⊕ y j y i , y j ∈ D training i (cid:54) = j } (12)and use the images in D com as additional negative training samples to restrict this tendency. The ⊕ in Eqn. 12 is the pixel-wise combination operation, it could be the pixel-wise average or pixel-wise logical-and/or. The samples in D com are the combinations of every two images in the trainingdataset. The loss term for training with the Pixel-wise Combination Regularization can be writtenas: L = E x ∼ P r [ f ( D ( x ))] (cid:124) (cid:123)(cid:122) (cid:125) positive samples + 12 (cid:104) E x ∼ P g [ f ( D ( x ))] + E x ∼ P com [ f ( D ( x ))] (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) negative samples (13)where P r , P g and P com are the data distributions approximated by the training data, the generatorand D com . In this way, the tendency to generate those combination images is restricted. We referto this addition of the negative training samples for the training of the discriminator as Pixel-wiseCombination Regularization (PCR). Sample Correction
We introduce a general framework to mitigate the problem of sample insuf-ficiency. We assume that to model the target distribution, the discriminator is required to separateaccurately the real samples in the target data distribution from the others not in it, which thesample insufficiency makes it difficult to achieve. For that goal, the realistic samples in the negativetraining batch are not useful. Intuitively, if the realistic samples appear in both the positive andthe negative training batches, it would be ambiguous for the discriminator to learn the correctseparation function. Therefore, we replace the realistic samples in the negative training batch withless realistic ones by a certain pre-defined measure of reality. In this way, the discriminator couldefficiently learn an accurate separation function with limited batchsize. The Sample Correctionapproach is a general framework and the measure of reality could differ for different datasets. Wepresent our experiments on the geometry and the CELEB-A datasets as examples in the nextsubsections.
As shown before, caused by the two problems, when trained on a geometry dataset where all thetraining images have exactly two rectangles, there would be a lot of anomalous generated sampleswith different number of rectangles. We use the two proposed methods to mitigate this anomalousgeneralization. We do experiments to verify the effects of our methods. The training datasetconsists of 25600 binary images, all of which have exactly two rectangles. For the Pixel-wiseCombination Regularization (PCR), the pixel-wise logical-and/or of the positive training imagesare precomputed (details in Appendix F). They are used as additional negative training samples.For the Sample Correction (SC), the generated realistic samples in the negative training batch ofthe discriminator are discarded. The realistic samples are those with exactly two rectangles, thesame as the positive training samples.As shown in Figure 1 (right), compared to the vanilla approach, the SC approach (SC) almosteliminates the anomalous generalization and the proportion of correct images (rectangle numberis 2) goes up to 97%. The Pixel-wise Combination Regularization (PCR) approach also improvesthe proportion but is stuck at 70%. We think this is caused by the still existence of the problem8f sample insufciency. Combining these two methods, the SC+PCR approach converges to 99%,much more quickly than other approaches, showing the existences of the two problems and theeffects of our methods.
We also evaluate the effect of the proposed Pixel-wise Combination Regularization and SampleCorrection method on natural image data, where the performances are measured by the FIDscore.For the pixel-wise combination, the pixel-wise averages of the training data are computedsimultaneously at each training step of the discriminator. They are used as additional negativetraining samples. The size of the additional negative training samples is the same as the size ofthe negative training samples from the generator. The results with the Pixel-wise CombinationRegularization (denoted as PCR) are compared with those of the vanilla training based on threepopular GANs architectures WGAN-GP, LSGAN and SAGAN (Gulrajani et al., 2017; Mao et al.,2017; Zhang et al., 2018). Three natural image datasets are involved: CIFAR10, CELEB-A andM-IMAGENET (Liu et al., 2015; Krizhevsky & Hinton, 2009). M-IMAGETNET is the validationset of the IMAGENET dataset. We train the network unsupervisedly with the Adam optimizer( α = .0002, β = .5, β = .9). As shown in Table 1, the performances improve in most casesafter applying the Pixel-wise Combination Regularization. The FID scores of the baselines areconsistent with that reported in Lucic et al. (2018). For WGANGP trained on CIFAR10, theachieved best FID score improves up to 30%, showing the potentials of our regularization method.Interestingly, the improvements are more remarkable for the CIFAR10 and M-IMAGENET thanthe CELEB-A dataset. We hypothesize this is because the objects in CELEB-A tend to appear atregular or fixed positions. Therefore, the average of real images is likely to give a data point in thetarget data manifold. For example, it is very possible that the average of two human face imagesis still a realistic human face image (data in CELEB-A). But it is less possible for the average ofimages of a car and a dog to be a realistic image (data in CIFAR10).Table 1: The achieved best FID scores of three runs are reported after 50000 steps for differentmodels and datasets. In most cases, the FID score improves after applying the Pixel-wise Com-bination Regularization (PCR). Positive improvement rates are highlighted in bold. The baselinescores are consistent with that reported in Lucic et al. (2018).Vanilla/PCR/boost ModelWGANGP LSGAN SAGANDataset CELEB-A 20.9 / 21.7 / -3.4% 17.7 / 16.0 / CIFAR10 45.4 / 31.6 / % 57.6 / 51.0 /
M-IMAGENET 61.8 / 54.1 / % 61.0 / 59.4 /
For the Sample Correction method. We randomly select a small portion of the images as trainingdata (M-CELEB-A and M-CIFAR). Dataset size is kept small ( e.g.
32) so that the learned datadistribution could approximate the target data distribution well. The measure of reality of a samplein this case is the normalized minimum L distance of the sample to all the training data (DIF,between 0 and 1). We train GANs in two different ways, namely Vanilla and Sample Correction.The latter approach differs in that the realistic samples in the negative batch (DIF less than 0.1) arereplaced with less realistic ones. The batchsize is kept small to better demonstrate the problem ofsample insufficiency ( e.g. In this paper, we discuss two specific problems of GANs, namely sample insufficiency and pixel-wise combination. We demonstrate that they make it difficult to model the target distributionand lead GANs to anomalous generalization. Specific methods are introduced to prevent them9igure 5: Left: The final generated samples for the M-CIFAR10 and M-CELEB-A, where theSample Correction (SC) approach achieves better performance. Right: The FID and DIF curvesduring training for the vanilla and the Sample Correction (SC) approaches.from misleading the generalization of GANs, which improve the performance. We hope the twospecific problems and the methods to restrict them can help the future work to better understandthe generalization behaviour of GANs.
References
Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein gan. arXiv preprintarXiv:1701.07875 , 2017.Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibriumin generative adversarial nets (gans). In
Proceedings of the 34th International Conference onMachine Learning-Volume 70 , pp. 224–232. JMLR. org, 2017.Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity ingans. arXiv preprint arXiv:1806.10586 , 2018.Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelitynatural image synthesis. arXiv preprint arXiv:1809.11096 , 2018.Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. arXiv preprintarXiv:1710.10174 , 2017.S´ebastien Bubeck et al. Convex optimization: Algorithms and complexity.
Foundations andTrends R (cid:13) in Machine Learning , 8(3-4):231–357, 2015.Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneousmodels: Layers are automatically balanced. In Advances in Neural Information ProcessingSystems , pp. 384–395, 2018a.Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizesover-parameterized neural networks. arXiv preprint arXiv:1810.02054 , 2018b.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural infor-mation processing systems , pp. 2672–2680, 2014.Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.Improved training of wasserstein gans. In
Advances in Neural Information Processing Systems ,pp. 5767–5777, 2017. 10artin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G¨unter Klambauer, andSepp Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500 , 12(1), 2017.Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generativeadversarial networks. arXiv preprint arXiv:1812.04948 , 2018.Peter E Kloeden and Eckhard Platen.
Numerical solution of stochastic differential equations ,volume 23. Springer Science & Business Media, 2013.Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009.Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild.In
Proceedings of International Conference on Computer Vision (ICCV) , 2015.Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are ganscreated equal? a large-scale study. In
Advances in neural information processing systems , pp.700–709, 2018.Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smol-ley. Least squares generative adversarial networks. In
Proceedings of the IEEE InternationalConference on Computer Vision , pp. 2794–2802, 2017.Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans doactually converge? arXiv preprint arXiv:1801.04406 , 2018.Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprintarXiv:1411.1784 , 2014.Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalizationfor generative adversarial networks. arXiv preprint arXiv:1802.05957 , 2018.Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In
Advances in Neural Information Processing Systems , pp. 5585–5595, 2017.Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved techniques for training gans. In
Advances in Neural Information Processing Systems ,pp. 2234–2242, 2016.George E Uhlenbeck and Leonard S Ornstein. On the theory of the brownian motion.
Physicalreview , 36(5):823, 1930.C´edric Villani.
Optimal transport: old and new , volume 338. Springer Science & Business Media,2008.Norbert Wiener. Differential-space.
Journal of Mathematics and Physics , 2(1-4):131–174, 1923.Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generativeadversarial networks. arXiv preprint arXiv:1805.08318 , 2018.Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon.Bias and generalization in deep generative models: An empirical study. In
Advances in NeuralInformation Processing Systems , pp. 10815–10824, 2018.11
An Explanation of gradient flow and stochastic gradientflow
Gradient flows, or steepest descent curves, are a very classical topic in evolution equations: givena functional F defined on R d , and we want to look for points x minimizing F (which is relatedto the statical equation ∇ F ( x ) = 0). To do this, we look, given an initial point x , for a curvestarting at x and trying to minimize F as fast as possible. Since the negative gradient directionis the steepest descent direction, we will solve equations of the form x (cid:48) ( t ) = −∇ F ( x ( t ))) . (14)On the curve of the solution of this equation ( t, x ( t )), when t increases, at every point x ( t ) thepoint goes along the negative gradient direction.If we write down the discrete form of (14), it becomes x ( t + δt ) − x ( t ) = −∇ F ( x ( t ))) (15)which is the Euler method of the differential equation. And if we take δt = 1, we get the expressionof gradient descent. So we can view the gradient flow as gradient descent of infinitesimal timeinterval.And if we add a stepsize term in the equation, it becomes x (cid:48) ( t ) = − η t ∇ F ( x ( t )) (16)where η t denotes the step size. And in the discrete form, it falls into the familiar gradient descentformula with step size { η t } : x ( t + 1) − x ( t ) = − η t ∇ F ( x ( t )) (17)In most machine learning problems, the function F can be written as F ( x ) = n (cid:80) ni =1 f i ( x ),and compute the gradient of F exactly can be computationally exhaustive. So instead we oftenuse an approximation of ∇ F (for example, we can randomly select i ∈ { , ..., n } and use ∇ f i toapproximate ∇ F ). This can be described formally as follows. If ∇ F ( x ( t )) = g ( x ( t )) + E Y , where Y is a random vector with distribution p ( y ), then we can sample Y t ∼ p and update x as x ( t + 1) − x ( t ) = − η t ( g ( x ( t )) + Y t ) (18)Especially, if p is a Gaussian distribution with mean 0 and covariance matrix I d , then (18) isequivalent to x ( t + 1) − x ( t ) = − η t ( g ( x ( t )) + ( W ( t + 1) − W ( t ))) (19)where W(t) is a Wiener process(Wiener, 1923). And this is the Euler-Maruyama scheme(Kloeden& Platen, 2013) of the following stochastic differential equation: (cid:40) dX t = − η t g ( X t ) dt − η t dW t X = x (20)the solution, if exists, is a stochastic process { X t = X ( t, x ) } . We call this solution the stochasticgradient flow of F at point x . And when we say that F is trained using stochastic gradient flow,we mean that the curve { x t } is a sample path of { X t = X ( t, x ) } .Finally, when we say that F is trained using stochastic gradient descent with batchsize m , wemean that for each t , we sample Y t , ..., Y mt ∼ p and update x as x ( t + 1) − x ( t ) = − η t ( g ( x ( t )) + 1 m m (cid:88) i =1 Y it ) . (21)This is the Euler-Maruyama scheme of the following stochastic differential equation: (cid:40) dX mt = − η t g ( X mt ) dt − η t m dW t X m = x (22)And when we say that F is trained using stochastic gradient flow with batchsize m , we mean thatthe curve { x t } is a sample path of { X mt = X m ( t, x ) } .12 Proof of Theorem 1
Proof.
First we give a detailed description of the training dynamics of our model in Section 3.3using gradient flow.The framework of training the WGAN in Section 3.3 is as follows. Denote F ( w, θ ) = w (cid:62) ( E x ∼ N ( , I d ) x − ( E y ∼ N ( , I d ) y + θ )). For each epoch t = 1 , , ... , given the parameter of the generator θ t , the targetof the discriminator ismax (cid:107) w (cid:107)≤ F ( w, θ t ) = w (cid:62) ( E x ∼ N ( , I d ) x − ( E y ∼ N ( , I d ) y + θ t )) (23)The gradient of w is ∇ w F ( w, θ t ) = E x ∼ N ( , I d ) x − ( E y ∼ N ( , I d ) y + θ t ) (24)And for one-step gradient upscent the update of w is w t +1 = w t + η t ∇ w F ( w, θ t ) (25)where η t is the step size.After w t is updated to w t +1 , given the parameter of the discriminator w t +1 , the target of thegenerator becomes min θ F ( w t +1 , θ ) = w (cid:62) t +1 ( E x ∼ N ( , I d ) x − ( E y ∼ N ( , I d ) y + θ )) (26)And the gradient of θ is ∇ θ F ( w t +1 , θ ) = − w t +1 . (27)For one-step gradient descent the update of θ is θ t +1 = θ t − µ t ∇ θ F ( w t +1 , θ ) (28)Now we consider the case when η t and µ t are constants, η t = µ t = C . Without loss of generalityassume C = 1. And when the time interval is infinitesimal, the discrete dynamics converge to thegradient flow with constant step size, which can be written in the form of ordinary differentialequations (ODEs): (cid:18) ˙ w ( t )˙ θ ( t ) (cid:19) = (cid:18) ∇ w F ( w ( t ) , θ ( t )) −∇ θ F ( w ( t ) , θ ( t )) (cid:19) (29)= (cid:18) E x ∼ N ( , I d ) x − ( E y ∼ N ( , I d ) y + θ ( t )) − w ( t ) (cid:19) (30)Under the full-batch condition, which means that E x ∼ N ( , I d ) x = 0 and E y ∼ N ( , I d ) y = 0 can beprecisely calculated, the solution to the above ODEs is (cid:40) w t = w cos t − θ sin tθ t = θ cos t + w sin t (31)Notice that the solution implies that (cid:107) w t (cid:107) ≤ (cid:107) w (cid:107) + (cid:107) θ (cid:107) , which means that the 1-Lipschitzcondition on w in WGAN is automatically satisfied if w and θ are initialized sufficiently small.The parameters lie on a circle around the equilibrium point (0 , m , which means that we randomly draw i.i.d. samples x t , ..., x mt ∼ N ( , I d ) and y t , ..., y mt ∼ N ( , I d ) and use the sample mean ¯ x m,t = m (cid:80) mi =1 x it and ¯ y m,t = m (cid:80) mi =1 y it to estimate the true mean E x ∼ N ( , I d ) x and E y ∼ N ( , I d ) y , the gradients become ∇ w F =¯ x m,t − ¯ y m,t − θ and ∇ θ F = − w . Since x t , ..., x mt and y t , ..., y mt are independent for different t , wehave ¯ x m,t , ¯ y m,t ∼ N (0 , m I d ) and they are independent for different t . So ¯ x m,t − ¯ y m,t ∼ N (0 , m I d ).And the parameters w and θ satisfy the following stochastic differential equations: (cid:40) dw t = − θ t dt + (cid:113) m dW t dθ t = w t dt (32)where W t is a d − dimensional standard Wiener process. Denote X t = ( w (cid:62) t , θ (cid:62) t ) (cid:62) ∈ R d , then X t is a multidimensional OU process (Uhlenbeck & Ornstein, 1930) with expectation E [ X ( t ) | X (0)] = e tA X (0) (33)13nd variance Var [ X ( t ) | X (0)] = (cid:90) t e tA e − sA ΣΣ (cid:62) (cid:0) e tA e − sA (cid:1) (cid:62) ds (34)where A = (cid:18) − I d I d (cid:19) , Σ = (cid:32)(cid:113) m I d
00 0 (cid:33) (35)So e tA = (cid:18) (cos t ) I d − (sin t ) I d (sin t ) I d (cos t ) I d (cid:19) and e tA e − sA = (cid:18) (cos t ) I d − (sin t ) I d (sin t ) I d (cos t ) I d (cid:19) (cid:18) (cos − s ) I d − (sin − s ) I d (sin − s ) I d (cos − s ) I d (cid:19) (36)= (cid:18) (cos ( t − s )) I d − (sin ( t − s )) I d (sin ( t − s )) I d (cos ( t − s )) I d (cid:19) = e ( t − s ) A (37)So the variance can be written asVar [ X ( t ) | X (0)] = (cid:90) t e tA e − sA ΣΣ (cid:62) (cid:0) e tA e − sA (cid:1) (cid:62) ds (38)= (cid:90) t e ( t − s ) A ΣΣ (cid:62) (cid:16) e ( t − s ) A (cid:17) (cid:62) ds (39)= (cid:90) t e ( t − s ) A ΣΣ (cid:62) (cid:16) e ( t − s ) A (cid:17) (cid:62) d ( t − s ) (40)= (cid:90) t e sA ΣΣ (cid:62) (cid:0) e sA (cid:1) (cid:62) ds (41)And the variance of the i -th component of X ( t ) isVar [ X ( t ) | X (0)] ii = (cid:90) t (cid:2) e sA Σ (cid:3) i (cid:2) e sA Σ (cid:3) (cid:62) i ds (42)= (cid:90) t (cid:107) (cid:2) e sA Σ (cid:3) i (cid:107) ds (43)where (cid:2) e sA Σ (cid:3) i denotes the i -th row of e sA Σ.Since (cid:107) (cid:2) e sA Σ (cid:3) i (cid:107) ≥
0, for all t ≤ t we have Var [ X ( t ) | X (0)] ii ≤ Var [ X ( t ) | X (0)] ii . Sothe variance of the OU process will increase as t grows. And because the elements in Σ is oforder Θ( √ m ), we have Var [ X ( t ) | X (0)] = Θ( m ). From the definition of X ( t ) we have Var([ θ t ] i ) =Var[ X ( t ) | X (0)] d + i,d + i = (cid:82) t (cid:107) (cid:2) e sA Σ (cid:3) d + i (cid:107) ds . Hence we complete our proof. C Proof of Theorem 2
Proof.
Now we consider the vanishing step size situation, namely η t = µ t = 1 /t . And when thetime interval is infinitesimal, the discrete dynamics converge to the gradient flow which can bewritten in the form of ordinary differential equations (assume t ≥ (cid:18) ˙ w ( t )˙ θ ( t ) (cid:19) = (cid:18) t ∇ w F ( w ( t ) , θ ( t )) − t ∇ θ F ( w ( t ) , θ ( t )) (cid:19) (44)= (cid:18) t ( E x ∼ N ( , I d ) x − ( E y ∼ N ( , I d ) y + θ ( t ))) − t w ( t ) (cid:19) (45)Under the full-batch condition, which means that E x ∼ N ( , I d ) x = 0 and E y ∼ N ( , I d ) y = 0 can beprecisely calculated, the solution to the above ODEs is (cid:40) w t = w cos(ln t ) + θ sin(ln t ) θ t = w sin(ln t ) − θ cos(ln t ) (46)Notice that the solution implies that (cid:107) w t (cid:107) ≤ (cid:107) w (cid:107) + (cid:107) θ (cid:107) , which means that the 1-Lipschitzcondition on w in WGAN is automatically satisfied if w and θ are initialized sufficiently small.14he parameters lie on a circle around the equilibrium point (0 , m , which means that we randomly draw i.i.d. samples x t , ..., x mt ∼ N ( , I d ) and y t , ..., y mt ∼ N ( , I d ) and use the sample mean ¯ x m,t = m (cid:80) mi =1 x it and ¯ y m,t = m (cid:80) mi =1 y it to estimate the true mean E x ∼ N ( , I d ) x and E y ∼ N ( , I d ) y , the gradients become ∇ w F =¯ x m,t − ¯ y m,t − θ and ∇ θ F = − w . Since x t , ..., x mt and y t , ..., y mt are independent for different t , wehave ¯ x m,t , ¯ y m,t ∼ N (0 , m I d ) and they are independent for different t . So ¯ x m,t − ¯ y m,t ∼ N (0 , m I d ).And the parameters w and θ satisfy the following stochastic differential equations: (cid:40) dw t = − t θ t dt + t (cid:113) m dW t dθ t = t w t dt (47)where W t is a d -dimensional standard Wiener process. Denote X t = ( w (cid:62) t , θ (cid:62) t ) (cid:62) ∈ R d , thenagain, X t is a multidimensional OU process with expectation E [ X ( t ) | X (1)] = ( w (cid:62) cos(ln t ) + θ (cid:62) sin(ln t ) , w (cid:62) sin(ln t ) − θ (cid:62) cos(ln t )) (cid:62) (48)and variance Var [ X ( t ) | X (1)] = (cid:90) t e a ( t ) e − a ( s ) Σ( s )Σ( s ) (cid:62) (cid:16) e a ( t ) e − a ( s ) (cid:17) (cid:62) ds (49)where A ( t ) = (cid:18) − t I d t I d (cid:19) , Σ( t ) = (cid:32) √ √ mt I d
00 0 (cid:33) (50)and ˜ A ( t ) is a primitive function to A ( t ) d t . For example, we take:˜ A ( t ) = (cid:90) A ( t ) = (cid:18) − (ln t ) I d (ln t ) I d (cid:19) (51)So e ˜ A ( t ) = (cid:18) (cos ln t ) I d − (sin ln t ) I d (sin ln t ) I d (cos ln t ) I d (cid:19) and e ˜ A ( t ) e − ˜ A ( s ) = (cid:18) (cos ln t ) I d − (sin ln t ) I d (sin ln t ) I d (cos ln t ) I d (cid:19) (cid:18) (cos − ln s ) I d − (sin − ln s ) I d (sin − ln s ) I d (cos − ln s ) I d (cid:19) (52)= (cid:18) (cos ln ts ) I d − (sin ln ts ) I d (sin ln ts ) I d (cos ln ts ) I d (cid:19) (53)= (cid:18) (cos ln st ) I d (sin ln st ) I d − (sin ln st ) I d (cos ln st ) I d (cid:19) (54)= e ˜ A ( st ) (cid:62) (55)the variance of the i -th component of X ( t ) isVar [ X ( t ) | X (0)] ii = (cid:90) t (cid:107) (cid:104) e ˜ A ( t ) e − ˜ A ( s ) Σ( s ) (cid:105) i (cid:107) ds (56)= (cid:90) t (cid:107) (cid:20) e ˜ A ( st ) (cid:62) Σ( s ) (cid:21) i (cid:107) ds (57)= (cid:90) t (cid:107) (cid:20) e ˜ A ( st ) (cid:62) (cid:21) i Σ( s ) (cid:107) ds (58)= (cid:90) t m (cid:107) (cid:20) e ˜ A ( st ) (cid:62) (cid:21) i (cid:18) √ t I d
00 0 (cid:19) (cid:107) ds (59)=Θ( 1 m ) (60)From the definition of X ( t ) we have Var([ θ t ] i ) = Var[ X ( t ) | X (0)] d + i,d + i = Θ( m ). Hence we com-plete our proof of the second part. 15 Proof of Theorem 3
Proof.
Since f is L-Lipschitz, we have | f ( λx + (1 − λ ) x ) − f ( x ) | ≤ L (cid:107) x − ( λx + (1 − λ ) x ) (cid:107) (61)= L (1 − λ ) (cid:107) x − x (cid:107) ≤ L (1 − λ ) δ (62) | f ( λx + (1 − λ ) x ) − f ( x ) | ≤ L (cid:107) x − ( λx + (1 − λ ) x ) (cid:107) (63)= Lλ (cid:107) x − x (cid:107) ≤ Lλδ (64)So we have f ( λx + (1 − λ ) x ) ≥ f ( x ) − L (1 − λ ) δ and f ( λx + (1 − λ ) x ) ≥ f ( x ) − Lλδ
Hence, f ( λx + (1 − λ ) x ) ≥ max { f ( x ) − L (1 − λ ) δ, f ( x ) − Lλδ } . And the second result followsby simple calculation. E Sample insufficiency
We show more examples to demonstrate the problem of sample insufficiency of GANs. It canmisguide the discriminator and then generator, causing finally generator to generate anomalousimages. The training dataset consists images whose rectangle number is exactly 2. They are 32by 32 single-channel image. All rectangles are 8 by 8. Anomalous generated images have differentnumber of rectangles. The two training designs are compared. First one is mini-batch gradientdescent (MGD), where the insufficient problem is severe and training is unstable, giving rise toanomalous images during whole training. Second is full-batch gradient descent (FGD), where theinsufficient problem is negligible. Training for the second approach is stable, with few anomalousimages generated. Generated images with certain individual latent codes are plotted. Images aregrey single-channel but shown in color for visualization purpose.
F Avoiding anomalous generalization on geometry data
We introduce two problems to explain the anomalous generalization results reported in Zhao et al.(2018). Further, we demonstrate the anomalous can be avoided by training modifications. Wehave three modified training methods extended from the Vanilla training: Sample Correction (SC),Pixel-wise Combination Regularization (PCR) and the SC + PCR. For SC, the negative trainingsamples generated from generator are discarded before fed to discriminator for gradient descent.The samples with true rectangle number is discarded. The selection can be implemented efficientlyby counting the number of rectangle using straight forward counting algorithm. For PCR approach,the pixel-wise averages of the training data (pixel-wise logic-and and logic-union) are precomputed.Specifically, the way we generate training geometry data can be utilized for this precomputation.To generate 2-number-rectangle training geometry data, we first randomly generate several 3-number-rectangle images. After that, for each 3-number-rectangle image, we randomly removeone rectangle out of the three. The remove is done twice for each 3-number-rectangle image.By this we get two different 2-number-rectangle images out of one 3-number-rectangle image.Because of this construction method, we could obtain the pixel-wisely average images easily, i.e.i.e.