[PDF] MGGAN: Solving Mode Collapse using Manifold Guided Training

Abstract

Mode collapse is a critical problem in training generative adversarial networks. To alleviate mode collapse, several recent studies introduce new objective functions, network architectures or alternative training schemes. However, their achievement is often the result of sacrificing the image quality. In this paper, we propose a new algorithm, namely a manifold guided generative adversarial network (MGGAN), which leverages a guidance network on existing GAN architecture to induce generator learning all modes of data distribution. Based on extensive evaluations, we show that our algorithm resolves mode collapse without losing image quality. In particular, we demonstrate that our algorithm is easily extendable to various existing GANs. Experimental analysis justifies that the proposed algorithm is an effective and efficient tool for training GANs.

Full PDF

MMGGAN: Solving Mode Collapse usingManifold Guided Training

Duhyeon Bang and Hyunjung Shim

School of Integrated Technology, Yonsei University, South Korea

Abstract.

Mode collapse is a critical problem in training generativeadversarial networks. To alleviate mode collapse, several recent studiesintroduce new objective functions, network architectures or alternativetraining schemes. However, their achievement is often the result of sac-riﬁcing the image quality. In this paper, we propose a new algorithm,namely a manifold guided generative adversarial network (MGGAN),which leverages a guidance network on existing GAN architecture toinduce generator learning all modes of data distribution. Based on ex-tensive evaluations, we show that our algorithm resolves mode collapsewithout losing image quality. In particular, we demonstrate that ouralgorithm is easily extendable to various existing GANs. Experimentalanalysis justiﬁes that the proposed algorithm is an eﬀective and eﬃcienttool for training GANs.

Keywords:

Generative Adversarial Network, Mode collapse, Bidirec-tional mapping

Generative adversarial networks (GANs) [1] are a family of generative modelsthat implicitly estimate the data distribution in an unsupervised manner. Thisis accomplished by learning to generate new data samples instead of explicitlyconstructing a density function. Because GANs do not rely on strong statisticalassumptions or constraints on distributions, there are no performance limitationon modeling complex manifolds of data distribution. Owing to this attractivenature, GANs have been successful in image generation tasks; various studiesreport that their image quality is superior to the traditional generative modelsin that GANs produce sharp and realistic images.Despite promising achievements, GANs are notoriously hard to train due tothe training instability and sensitivity to hyperparameters. Training instabilitycauses two problems: poor image quality and lack of image diversity. These twoissues are in a trade-oﬀ relationship with each other. Existing studies aim toimprove either image quality or diversity. In this paper, our primary interest isto improve image diversity without sacriﬁcing image quality.The lack of image diversity in GAN training is also known as mode collapse,in which P model captures a single or few major modes of P data while ignor-ing many small modes. To address this problem, we propose a novel algorithm, a r X i v : . [ c s . C V ] A p r Duhyeon Bang and Hyunjung Shim namely the manifold guided generative adversarial network (MGGAN), whichintegrates a newly proposed guidance network to the existing GAN architecture.Note that the standard GAN consists of a discriminator network and generatornetwork. The discriminator aims to distinguish the fake images produced by thegenerator from real images. Meanwhile the generator aims to fool the discrimi-nator by generating fake images as realistic as possible. On this standard GANarchitecture, we leverage the guidance network, which encourages generator tolearn all modes of P data . The goal of the guidance network is to teach genera-tor in that P model matches P data in the learned manifold space. For that, theguidance network consists of an encoder for manifold mapping and a discrim-inator for measuring the dissimilarity between the distributions of P data and P model in the manifold space. In this way, the characteristics of learned manifoldspace is reﬂected in generator training. To solve mode collapse, we employ anencoder layer of a pre-trained autoencoder to deﬁne manifold mapping. Thisautoencoder is optimized for reconstructing all samples of real images, and ﬁxedafter pre-training so that it is not updated during GAN learning. Because thisautoencoder learns to represent all training data [2], the manifold learned by theencoder is eﬀective to represent all modes of P data .The contributions of our MGGAN is summarized as follows.1. The manifold space derived by the autoencoder represents the most criticalfeatures for reconstructing all modes of P data . The guidance network providesfeedback to generator in a way that the distribution of critical features iswell restored in the fake images. In this way, we induce the generator to learnall modes of P data , thereby producing diverse samples.2. Because our encoder network is pre-trained and ﬁxed during GAN training,it prevents the errors of encoder training from propagating to the generatorand discriminator training.3. There is no range or unit mismatch between the loss of discriminator andthat of guidance network because both networks use adversarial loss. As aresult, training of our network is stable.4. The proposed algorithm resolves mode collapse without sacriﬁcing the imagequality. A variety of techniques have been proposed in the past for solving mode collapse,and can be categorized into following two groups.

Although the standard GAN [1] theoretically proves that generative modelingcan be formulated by minimizing the Jensen-Shannon divergence (

JSD ), au-thors recommend the non-saturated GAN for the actual implementation [1,3].The non-saturated GAN is designed to minimize KL ( P model || P data ) − JSD

GGAN: Solving Mode Collapse using Manifold Guided Training 3 for generator update, which holds a property of the reverse Kullback-Leibler(KL) divergence between P data and P model [4]. Fedus et al. [4] and Arjovsky etal. [5] point out that the reverse KL-divergence is vulnerable to mode collapsein non-saturated GAN. Because the reverse KL evaluates dissimilarity betweentwo distributions at every fake sample (i.e., P model ( x ) >

0, for all x ), thereis no penalty for covering a fraction of the true data distribution. To addressthis issue, they suggest Wasserstein distance that holds the weakest convergenceamong existing GAN metrics. This new metric is eﬀective in solving mode col-lapse by stabilizing GAN training. However, they approximate the Wassersteindistance by weight clipping, unfortunately causing a pathological behavior [6].On the contrary, D2GAN [7] employs two antithetical discriminators: one min-imizes forward KL divergence; the other minimizes reverse KL divergence. Asa result, a generator produces samples to fool both discriminators simultane-ously, and escape from mode collapse. However, they tend to increase instabilitybecause the goals of two antithetical discriminators conﬂict.Unrolled GAN [8] claims that mode collapse occurs because discriminatorupdates do not guarantees an optimal discriminator. Thus, they introduce asurrogate objective function that simulates a discriminator response to generatorchanges. Although their model is robust against mode collapse, it is not clearwhether their achievement in mode collapse is the result of sacriﬁcing the visualquality or not. Also, heavy computational complexity due to k-step discriminatorupdates is a well-known drawback of Unrolled GAN.DRAGAN [9] states that non-convex loss function exhibits local minimaxpoints, leading to mode collapse. Hence, authors propose a gradient penalty(GP) term in order to regularize sharp gradients. The GP term stabilizes GANtraining, which is also eﬀective in mitigating mode collapse. LSGAN [10] replacesthe sigmoid cross-entropy loss term used in standard GAN with a least squaresloss term, which is equivalent to minimizing Pearson ˜ χ divergence. They as-sert that the replacement improves stability of learning process and reduces thepossibility of mode collapse. However, existing studies, DRAGAN and LSGAN,do not provide a signiﬁcant achievement for improving diverse image generationwith real datasets. Mode collapse leads to ignoring minor modes of data distribution. To addressthis problem, several recent studies propose learning of a mapping function from P data to P z ; namely an inference mapping. ALI [11] and BiGAN [12] suggest adiscriminator for joint distribution matching, which learns a relationship betweendata and latent distribution. This can be interpreted as ALI and BiGAN aimto recover a bidirectional mapping between the data and latent while standardGAN learns a unidirectional mapping, from P z to P data ; namely a generationmapping. However, their results do not improve existing GAN models.MDGAN [13] and VEEGAN [14] utilize a reconstruction loss as an additionalconstraint to the inference mapping. Although reconstruction loss is eﬀective, its Duhyeon Bang and Hyunjung Shim

Fig. 1.

The proposed model Structure. x real and x fake are sample of P data and P model ,respectively; z is latent vector; E , G , and D are the encoder, the generator, and thediscriminator network. The subscript of D means input sample space. The guidancenetwork consists of e and D m , where m means manifold space. There are no diﬀerencewith the standard GAN except adding our guidance network. unit mismatches that of an adversarial loss; reconstruction loss is a distance mea-sure while adversarial loss is a divergence measure. MDGAN separates traininginto a mode regularization step and a diﬀusion step, in order to reduce insta-bility. Unlike MDGAN, VEEGAN applies the reconstruction loss in the latentdomain rather than the data domain. Hence, they mitigate the image qualitydegradation caused by the reconstruction loss applied in data domain (i.e. imageblur).AGE [15] suggests a new architecture composed of an encoder and a gen-erator, and designs a adversarial learning between two networks without a dis-criminator. Because they do not rely on a discriminator, they greatly reducethe computational complexity and the converge time compared to the previousmodels utilizing a bidirectional mapping. Our goal is to generate diverse samples, i.e., solving mode collapse, withoutsacriﬁcing the image quality. For that, we propose a new algorithm inducinga generator to learn all modes of P data as well as producing realistic samples.Speciﬁcally, we introduce a guidance network, which guides the generator pro-ducing samples reﬂecting the manifold distributions. The standard GAN, whichconsist of a generator, G , and a discriminator, D x , applied with this guidancenetwork is shown in Fig. 1.For the sake of distinguishing between the true and the estimated probabilitydistribution, we mark with a hat over estimated variables; in our study, since theencoder maps the true probability distribution to the manifold, E ( x ∼ P data ) ismapped onto P m and E ( x ∼ P model ) is mapped onto P (cid:98) m , where m representsthe manifold space. GGAN: Solving Mode Collapse using Manifold Guided Training 5

Fig. 2.

Diﬀerence between bidirectional mapping method and the proposed weaklybidirectional mapping algorithm. P data is data distribution. P z is simple prior distri-bution, generally uniform or Gaussian noise. P m is manifold distribution which weintroduce on the proposed method. Hat means fake. The solid black line, the greendash line, and the blue dash dot line are generation, inference, and manifold mapping,respectively. G and E mean networks conducts each mapping. The double-sided arrowred line depicts divergence which GANs aims to reduce. Our guidance network aims to reduce the divergence between the projectionof P data and P model on the manifold space. The guidance network is composed ofan encoder, E , and a discriminator, D . The encoder maps P data and P model to themanifold space. The discriminator for the guidance network, D m , distinguishesthe encoded P model from the encoded P data , i.e., P m and P (cid:98) m , respectively. Thefollowing equations show the objective function for our MGGAN, where theguidance network is implemented with the non-saturated GAN. min D x ,D m E x ∼ P data [ log ( D x ( x )) + log D m ( E ( x )) ] + E z ∼ P z [ log (1 − D x ( G ( z ))) + log (1 − D m ( E ( G ( z )))) ] , (1) min G − E z ∼ P z [ log ( D x ( G ( z ))) + log ( D m ( E ( G ( z ))))] . (2)As described in Eq.1 and Eq.2, two discriminators, D x and D m , do not ex-plicitly aﬀect each other, but both of them inﬂuence the generator. From Eq.2,the generator attempts to meet two goals simultaneously; the ﬁrst is to mini-mize the dissimilarity between P data and P model equivalent to a non-saturatedGAN, and the second is to minimize the dissimilarity between their mappeddistributions onto manifold space. It is worthwhile noting that our two discrim-inators concurrently aﬀect the generator training, and thus two discriminatorsare implicitly inﬂuenced each other through generator. Especially, the encoderof a guidance network is designed to derive the most representative manifold of P data where all modes of P data are captured. As a result, the guidance networkcan induce the generator training in that the generator is capable of producingdiverse samples, because P model encapsulates all modes of P data by reﬂecting thecharacteristics of the encoder of the guidance network. Duhyeon Bang and Hyunjung Shim

As discussed in section 2, mode collapse occurs because a generator can foola discriminator by producing the same or similar samples corresponding to asingle major mode of P data . This issue frequently arose when P data includesmany minor modes. Several recent studies state that traditional GANs imposingunidirectional mapping (i.e., generation mapping) is not suﬃcient enough toGAN training. To address this problem, they suggest bidirectional mapping toregularize generator training [11,12,13,14]. Their network architecture is similarto the proposed model in that they also utilize an encoder architecture to map P data into low dimensional manifold space. However, while their encoders aredesigned to map P data into P z (i.e., inference mapping), we intend to map P data onto meaningful manifold space we determined; namely manifold mapping. Toclarify the diﬀerence, we refer existing techniques as bidirectional GANs, andthe proposed model as weakly bidirectional GANs, respectively.Bidirectional mapping cannot avoid at least one of the two limitations de-pending on whether additional constraints are applied to the encoder or not. Byapplying the constraint on the encoder, the encoder loses the representationalpower. Meanwhile, without the constraint, the generator loses the generationpower to cover the wide range of latent distribution. In the former case, encoderencodes P data by ensuring that encoded distribution matches P z . This strict con-straint is rarely satisﬁed in reality, thus increases training instability, and reducesthe representational power of the encoder [13]. On the contrary, in the latter case,the generator is forced to handle two latent distributions P z and P (cid:98) z concurrently.The less two distributions overlap, the lower quality in generated samples fromthe generator. Furthermore, bidirectional mapping propagates errors from gen-eration (or inference) mapping to inference (or generation) mapping. As a result,this becomes an additional source of training instability.Unlike bidirectional mapping, our weakly bidirectional mapping trains theencoder network independently from GAN training and then ﬁxed. Further-more, we guarantee the generator to be focused on covering P z only because ourweakly bidirectional mapping imposes manifold mapping, not inference map-ping. As a result, our weakly bidirectional mapping technique allows the en-coder and the generator to be trained without strict constraints that degradetheir performance. Figure 2 visualizes the conceptual diﬀerence between bidirec-tional mapping approach and our weakly bidirectional mapping approach. Notethat both approaches try to decrease two divergences simultaneously and oneof them is the divergence between P data and P model . Yet, bidirectional mappingadditionally considers the divergence between P z of the real data and P z of thefake data, while our mapping aggregate the divergence between P m of the realdata and P (cid:98) m of the fake data.More speciﬁcally, previous studies using bidirectional mapping employ thediscriminator for joint distribution matching [11,12], reconstruction loss (i.e.,pixel-wise L1 or L2 loss) [13], or both [14]. The discriminator for joint distribu-tion matching is to evaluate both generation and inference mapping by distin-guishing between two joint distributions: the joint distribution of the real data GGAN: Solving Mode Collapse using Manifold Guided Training 7 and its inferred latent from an encoder, and that of the real latent vector andits generated data from a generator. In other words, the single discriminatorshould achieve two diﬀerent goals; D evaluates 1) whether generated data isreal or not, and 2) whether both joint distributions matches or not. Thus, theirdiscriminator becomes insensitive to subtle changes in each distribution. Con-sequently, this is likely to increase mode missing that are relative minor in thedata distribution [13,14]. Furthermore, they do not provide any regularizationor constraint condition to prevent error propagation when iterating generationand inference mapping. For these reasons, the sample reconstruction from ALIand BiGAN tend to produce less faithful reproductions of the inputs.MDGAN and VEEGAN introduce a reconstruction loss into the GAN frame-work in order to prevent mode missing in P data . This loss enforces each data sam-ple x (or z ) to be reconstructed correctly after applying inference and generationmapping sequentially. With this reconstruction loss, MDGAN and VEEGAN im-prove inference mapping compared to ALI and BiGAN. However, it is hard totune parameters for balancing between adversarial loss and reconstruction lossbecause their units are diﬀerent. (e.g., adversarial loss measures the divergenceand reconstruction loss measures the pixel diﬀerence)Inspired by this observation, we propose an weakly bidirectional mappingapproach to subsiding drawbacks from bidirectional mapping. It is important tonote that our network model does not include a direct link between an encoderand a generator, meaning that training procedure of the encoder and that ofthe generator are separated. This separation is eﬀective in improving the perfor-mance of the encoder and the generator respectively. This is because they canfocus on their own objective without distraction by constraints. Also, becausethe guidance network assesses the divergence of two distributions analogous tothe standard GAN loss, there is no unit mismatch to integrate the two losses. As shown in Fig. 1, the guidance network consists of an encoder and a discrim-inator. In order to solve the mode collapse, we design the encoder E such thatthe output distribution of encoder P m is the best approximate of P data given theﬁxed dimensionality of P m ; all modes of P data are reﬂected in P m . To meet thiscriteria, we employ the encoder of pre-trained autoencoder. The autoencoderﬁrst learns a representaion of a dataset using an encoder, and then reconstructsthe dataset by decoding them from the representation. Because the autoencodernetwork is trained to minimize the reconstruction errors (i.e., L1 or L2 loss be-tween the input and its reconstruction), the autoencoder could learn all modesof true data distribution [2]. It is because every sample in the dataset equallycontributes to train the network, memorizing all data. Although this propertycauses the quality degradation of image generation (e.g., image blurs), this isadvantageous to achieve the goal of the guidance network, which induce thegenerator to learn true distribution without missing modes. Owing to its rep-resentational power of autoencoder, the encoder is eﬀective to represent P data such that P m can reﬂect all modes of P data . Specially, we pre-train and ﬁx the Duhyeon Bang and Hyunjung Shim

Fig. 3.

Mode collapse test learning a mixture of eight Gaussian spreads in a circle with0.01 (Left) and 0.35 standard deviation (Right). parameters of autoencoder using a real dataset. In this way, it is possible tokeep the representational power of the encoder, and reduce the uncertainty ofthe inference.Since the manifold space is a topological space, a general distance measure isnot suitable for the dissimilarity measure between two samples, one from P m andthe other from P (cid:98) m [16]. To measure the dissimilarity between P m and P (cid:98) m , thediscriminator of guidance network D m learns to separate two distributions in themanifold space based on adversarial learning. For constructing D m , we use theidentical structure and divergence to the discriminator of the standard GAN.Although we simply add two losses without further investigation, we achievestable training because we are free from unit mismatch. For quantitative and qualitative evaluations, we utilize simulated and two realdatasets: CelebA [17] and CIFAR-10 [18], normalizing between -1 and 1. Notethat the input dimensionality of CelebA is (64, 64, 3) and CIFAR-10 is (32,32, 3). A denoising autoencoder [19] is adopted for the guidance network toencourage robust feature extraction, resulting in a slight quality improvementcompared to a conventional autoencoders.

To demonstrate that the guidance network helps GANs to prevent mode missing,i.e., solving mode collapse, we train and test the network using a simple 2DGaussian mixture model of which eight modes are evenly distributed along thecircle [8]. We set the standard deviation (std) to 0.01 and 0.35 to see how theinterval among modes inﬂuences mode collapse. Figure 3 compares MGGAN,GAN, unrolled GAN , and VEEGAN models. When modes are far apart (i.e., We refer the code in https://github.com/poolio/unrolled gan We refer the code in https://github.com/akashgit/VEEGANGGAN: Solving Mode Collapse using Manifold Guided Training 9 std = 0.01), the GAN suﬀers from mode collapse while other models eﬀectivelysolves this problem. In contrast, when the modes are adjacent (i.e., std = 0.35),unrolled GAN and VEEGAN capture almost all modes, but generate highlyscattered samples that do not accurately represent the true distribution. Unlikethe earlier example, the GAN outperforms both unrolled GAN and VEEGANin the latter experiment. In both cases, our MGGAN consistently resolve modecollapse with an accurate representation.Interestingly, we observe that MGGAN ﬁrst captures each mode, and thendeviates from mode collapse: Fig. 3 supports this when a std is 0.01. This isbecause MGGAN is built upon the standard GAN, but the guidance networkinduces a generator to learn the entire mode. In this reason, MGGAN showslearning patterns similar to the GAN with a std of 0.35, and can generate samplesof ﬁne quality similar to the GAN.

For evaluating the eﬀectiveness of our MGGAN, we construct four variants ofMGGAN. That is, we select four diﬀerent GANs as baseline networks, and thenmodify each by adding the guidance network. Those of baseline GANs reportthe state of the art visual quality in data generation, but are prone to modecollapse. Throughout this paper, we utilize four baseline networks as DCGAN[20], LSGAN [10], DRAGAN [9], and DFM [21], and develop the variants ofMGGAN as DCGAN-MG, LSGAN-MG, DRAGAN-MG, and DFM-MG. For thefair comparison, the network architecture of both a generator and a discriminatorfollows that of DCGAN. Also, we utilize suggested hyperparameters from eachbaseline work without any ﬁne-tune. Implementation code is available soon.MS-SSIM [22] and the inception score [23] are used as metrics for quantitativeevaluation. These demonstrate that MGGAN improves the diversity of datageneration while retaining the image quality of baseline GANs. The smaller MS-SSIM implies the better performance in producing diverse images. The inceptionscore is used to assess the visual quality of GANs using the CIFAR-10 dataset,and the larger score represents the higher quality.To evaluate the image diversity using MS-SSIM, we only use the CelebAdataset. CIFAR-10 is excluded from this experiment, because MS-SSIM is mean-ingless if the dataset is already highly diverse [3]; CIFAR-10 is composed of tendiﬀerent classes. To compare four variants of MGGAN with their baseline GANs,we measure MS-SSIM for 100 samples generated from four baseline GANs withand without the guidance network. Table 1 summarizes the average score ofMS-SSIM measurements repeated ten times for each model. From this exper-iment, we ﬁnd that four variants of our MGGAN signiﬁcantly improves theimage diversity (i.e., reduced MS-SSIM) compared to the baseline GANs all thetime. Furthermore, the MS-SSIM values of all MGGANs are close to that of realdata (i.e., 0.3727). This justiﬁes that the proposed model is eﬀective to handlethe mode collapse. It is because the level of image diversity from the proposedmodel nearly approaches to its optimal limit, the image diversity of real dataset.

Table 1.

Comparison of the image diversity using the MS-SSIM. Four baseline GANsand our MGGANs are compared. Note that the MS-SIMM of real dataset is 0.3727.NB: the lower the MS-SSIM, the higher the diversity.

DCGAN LSGAN DRAGAN DFMOriginal 0.4695 0.3904 0.3934 0.3996with MG 0.3872 0.3784 0.3899 0.3814

DCGAN LSGAN DRAGAN DFMOriginal 6.4706 6.3243 6.4468 6.5854with MG 6.4728 6.3416 6.4942 6.6076

Fig. 4.

Comparison of inception scores as a function of iteration and time. The in-ception scores in the table are the average scores of 5 repeated measurements of eachmodel.

Particularly, DCGAN-MG shows the most notable improvement over DCGAN,because DCGAN is the most prone to mode collapse.Following Salimans et al. [23], we compute the inception score for 50k gener-ated images from baseline GANs and our MGGANs. Figure 4 plots the inceptionscore as a function of iteration (top) and time (bottom), respectively. We ob-serve that the inception score from DFM is not as high as they reported in [20].This drop might be caused by the modiﬁcation of the network architecture toDCGAN. Still, DFM marks the highest score among other GANs. From this ex-periment, we observe that the inception scores are not decreased in our model,and this observation holds for four diﬀerent variants. More speciﬁcally, we con-ﬁrm that our MGGAN can achieve the image quality of baseline GANs withinapproximately 0.04 tolerance of inception score. Additionally, we observe thatthe inception score for our MGGAN increases slightly faster than the baselineGANs. This demonstrates that a guidance network eﬀectively accelerates theentire training by oﬀering additional feedback of P data to a generator. GGAN: Solving Mode Collapse using Manifold Guided Training 11

Fig. 5.

Comparison between randomly generated samples from original baseline GANs(DCGAN, LSGAN, DRAGAN, and DFM) and the corresponding MGGANs (DCGAN-MG, LSGAN-MG, DRAGAN-MG, and DFM-MG).

Fig. 6.

Reconstruction quality comparison of MGGAN variants (DCGAN-MG,LSGAN-MG, DRAGAN-MG, and DFM-MG) with ALI [13]. ALI results are from thepaper. Odd columns are test images of CelebA dataset and even columns are corre-sponding reconstructions from each models.2 Duhyeon Bang and Hyunjung Shim

In this section, we investigate the eﬀect of the guidance network whether it1) yields the degradation in visual quality, 2) induces a meaningful manifoldmapping, and 3) results in the memorization of P data .First, we compare generated images from baseline models and the corre-sponding MGGANs. Figure 5 visualizes those results; the left side shows thegenerated images from the baseline GAN while the right side presents thosefrom the MGGAN. From this qualitative comparison, it is hard to recognize thequality diﬀerence from both results. Therefore, our achievement in improving theimage diversity is not the result of sacriﬁcing the visual quality. These resultsare analogous to the quantitative evaluation reported in Fig. 4.Second, we exam whether our weakly bidirectional mapping can induce ameaningful cycle between data and a latent vector z or not. For that, we buildan additional network that associates our manifold space P m to a latent space P z . Because this network transforms the encoder output to a latent vector, wecould track a cyclic mapping. That is, z ⇒ x ⇒ m ⇒ z . This path can beconsidered as the detours to build a bidirectional mapping. Although this ad-ditional network is never utilized during our training, we intentionally developthis network to derive (cid:98) z corresponding to x , and then reconstruct x using thegenerator G ( (cid:98) z ).Based on this reconstruction experiment, we can evaluate howaccurate our model can reproduce the real data, even without explicitly imposingthe reconstruction loss. A network to link P m and P z is composed of 1024 fullconnected layer(FC) − batch normalization(BN) − rectiﬁed linear unit(ReLU) − − BN − ReLU − dimension of P z FC. Figure 6 shows the recon-structed images with their target images. They are from CelebA test datasetand four variants (DCGAN-MG, LSGAN-MG, DRAGAN-MG, and DFM-MG)are all investigated. For the performance comparison with bidirectional map-ping approaches, we borrow the result image of ALI from their paper [11]. Oddcolumns show target images and even columns are their reconstructed images.The results from ALI do not faithfully restore the attribute of target faces, suchas gender, glasses, and background color. On the contrary, our MGGANs repro-duce target images reasonably well, maintaining the original attribute. From thisexperiment, it is possible to conﬁrm that our MGGAN produces more accuratereconstruction results than the bidirectional mapping approach, ALI.Third, we generate samples by walking in latent space to verify whether datageneration is the results of data memorization or not. Because our generatorlearns representative features in manifold, P m , derived from P data solely, it mightbe reasonable to suspect overﬁtting of training data. To clarify this issue, imagegeneration results by latent walking are shown in Fig. 7. Note that we choose twolatent vectors, which are derived from CelebA test data using the above network(connecting the manifold to the latent space). According to Radford et al. [20], Bengio et al. [24], and Dinh et al.[25], the interpolated images between twoimages in latent space do not have meaningful connectivity when the networksjust memorize the dataset: such as lack of smooth transitions or fail to generation.However, because our MGGAN produces natural interpolations with various GGAN: Solving Mode Collapse using Manifold Guided Training 13

Fig. 7.

Latent space interpolations from CelebA dataset. Left and right-most columns,marked red box, are test images, and just besides of them, marked yellow box, arecorresponding reconstructions. Intermediate columns among them, marked blue box,are linear interpolations in the latent space between reconstructions. examples, we conclude that MGGAN learn the meaningful landscape in latentspace. Thus, we conﬁrm that MGGAN does not overﬁt the training data.

In this study, we propose a new algorithm that induces a generator to producediverse samples without sacriﬁcing visual quality by manifold matching using the guidance network. To solve mode collapse, it is important to develop the manifoldspace where all modes of true distribution are reﬂected. To this end, we adoptan encoder of pre-trained autoencoder as a manifold mapping function. Becauseautoencoder aims to reconstruct all samples in dataset, the encoder should notignore minor modes. Consequently, the generator avoids mode missing duringtraining because it receives the feedback for minor modes of data distributionfrom the guidance network.Compared with existing studies of constructing bidirectional mapping inGANs training, our algorithm can be interpreted as exploiting weakly bidirec-tional mapping between the data and latent. Because bidirectional mappingintroduces excessive constraints for network training, they lose either the gener-ation power of a generator or representation power of a encoder. Meanwhile, theproposed algorithm utilizes manifold mapping that does not reduce the genera-tion power of a generator but rather encourage the generation process to increaseimage diversity. Moreover, our algorithm is easily extendable to various diﬀerentexisting GANs. From the qualitative and quantitative experiments, we justifythat MGGAN can successfully generate diverse samples without losing imagequality.In this paper, we suggest the encoder network of pre-trained autoencoder formanifold mapping in order to solve mode collapse. We believe that this idea ofmanifold mapping can be further extended toward integrating prior informationto generator training. We hope that the weakly bidirectional mapping approachprovides a basis for future work for controlling generator with prior knowledge.

GGAN: Solving Mode Collapse using Manifold Guided Training 15