Analyzing the Components of Distributed Coevolutionary GAN Training
AAnalyzing the Components of DistributedCoevolutionary GAN Training
Jamal Toutouh, Erik Hemberg, and Una-May O’Reilly
Massachusetts Institute of Technology, CSAIL, MA, USA [email protected] , {hembergerik,unamay}@csail.mit.edu Abstract.
Distributed coevolutionary Generative Adversarial Network(GAN) training has empirically shown success in overcoming GAN train-ing pathologies. This is mainly due to diversity maintenance in the pop-ulations of generators and discriminators during the training process.The method studied here coevolves sub-populations on each cell of aspatial grid organized into overlapping Moore neighborhoods. We inves-tigate the impact on performance of two algorithm components thatinfluence the diversity during coevolution: the performance-based se-lection/replacement inside each sub-population and the communicationthrough migration of solutions (networks) among overlapping neighbor-hoods. In experiments on MNIST dataset, we find that the combinationof these two components provides the best generative models. In addition,migrating solutions without applying selection in the sub-populationsachieves competitive results, while selection without communication be-tween cells reduces performance.
Keywords:
Generative adversarial networks · coevolution · diversity · selection pressure · communication. Machine learning with Generative Adversarial Networks (GANs) is a powerfulmethod for generative modeling [9]. A GAN consists of two neural networks, agenerator and a discriminator, and applies adversarial learning to optimize theirparameters. The generator is trained to transform its inputs from a randomlatent space into “artificial/fake” samples that approximate the true distribution.The discriminator is trained to correctly distinguish the “natural/real” samplesfrom the ones produced by the generator. Formulated as a minmax optimizationproblem through the definitions of generator and discriminator loss, training canconverge on an optimal generator that is able to fool the discriminator.GANs are difficult to train. The adversarial dynamics introduce convergencepathologies [5,14]. This is mainly because the generator and the discriminator aredifferentiable networks, their weights are updated by using (variants of) simul-taneous gradient-based methods to optimize the minmax objective, that rarelyconverges to an equilibrium. Thus, different approaches have been proposed toimprove the convergence and the robustness in GAN training [7,15,18,20,26]. a r X i v : . [ c s . N E ] A ug J. Toutouh et al.
A promising research line is the application of distributed competitive coevo-lutionary algorithms (Comp-COEA). Fostering an arm-race of a population ofgenerators against a population of discriminators, these methods optimize theminmax objective of GAN training. Spatially distributed populations (cellularalgorithms) are effective at mitigating and resolving the COEAs pathologies at-tributed to a lack of diversity [19], which are similar to the ones observed in GANtraining.
Lipizzaner [1,22] is a spatial distributed Comp-COEA that locates theindividuals of both populations on a spatial grid (each cell contains a GAN). Ina cell, each generator is evaluated against all the discriminators of its neigh-borhood, the same with the discriminator. It uses neighborhood communicationto propagate models and foster diversity in the sub-populations. Moreover, theselection pressure helps the convergence in the sub-populations [2].Here, we evaluate the impact of neighborhood communication and selectionpressure on this type of GAN training. We conduct an ablation analysis toevaluate different combinations of these two components. We ask the followingresearch questions:
RQ1:
What is the effect on the quality of the generators whentraining with communication or isolation and the presence or absence of selectionpressure? . The quality of the generators is evaluated in terms of the accuracy ofthe samples generated and their diversity.
RQ2:
What is the effect on the diver-sity of the network parameters when training with communication or isolationand the presence or absence of selection pressure?
RQ3:
What is the impact onthe computational cost of applying migration and selection/replacement?
The main contributions of this paper are: i) proposing distributed Comp-COEA GAN training methods by applying different types of ablation to Lipizzaner , ii) evaluating the impact of the communication and the selectionpressure on the quality of the returned generative model, and iii) analyzing theircomputational cost.The paper is organized as follows. Section 2 presents related work. Section 3describes Lipizzaner and the ablations the methods analyzed. The experimentalsetup is in Section 4 and results in Section 5. Finally, conclusions are drawn andfuture work is outlined in Section 6.
In 2014, Goodfellow introduced GAN training [9]. Robust GAN training meth-ods are still investigated [5,14]. Competitive results have been provided by sev-eral practices that stabilize the training [7]. These methods include differentstrategies, such as using different cost functions to the generator or discrimina-tor [4,15,18,32,29] and decreasing the learning rate through the iterations [20].GAN training that involves multiple generators and/or discriminators empir-ically show robustness. Some examples are: iteratively training and adding newgenerators with boosting techniques [25]; combining a cascade of GANs [30];training an array of discriminators on different low-dimensional projections ofthe data [17]; training the generator against a dynamic ensemble of discrimina-tors [16]; independently optimizing several “local” generator-discriminator pairsso that a “global” supervising pair of networks can be trained against them [6]. nalyzing the Components of Distributed Coevolutionary GAN Training 3
Evolutionary algorithms (EAs) may show limited effectiveness in high dimen-sional problems because of runtime [24]. Parallel/distributed implementations al-low EAs to keep computation times at reasonable levels [3,8]. There has been anemergence of large-scale distributed evolutionary machine learning systems. Forexample: EC-Star [11], which runs on hundreds of desktop machines; a simplifiedversion of Natural Evolution Strategies [31] with a novel communication strat-egy to address a collection of reinforcement learning benchmark problems [21]or deep convolutional networks trained with genetic algorithms [23].Theoretical studies and empirical results demonstrate that the spatiallydistributed COEA GAN training mitigates convergence pathologies [1,22,26].In this study, we focus on spatially distributed Comp-COEA GAN training,such as
Lipizzaner [1,22].
Lipizzaner places the individuals of the generatorand discriminator populations on each cell (i.e., each cell contains a generator-discriminator pair). Overlapping Moore neighborhoods determine the communi-cation among the cells to propagate the models through the grid. Each genera-tor is evaluated against all the discriminators of its neighborhood and the samehappens with each discriminator. This intentionally fosters diversity to addressGAN training pathologies. Mustangs [26], a
Lipizzaner variant, uses randomlyselected loss functions to train each cell for each epoch to increase diversity. More-over, training the GANs in each cell with different subsets of the training datasethas been demonstrated effective in increasing diversity across the grid [27]. Thesethree approaches return an ensemble/mixture of generators defined by the bestneighborhood (sub-population of generators) built using evolutionary ensemblelearning [28]. They have shown competitive results on standard benchmarks.Here, we investigate the impact of two key components for diversity in Comp-COEA GAN training using ablations.
Here, we evaluate the impact of communication and selection pressure on theperformance of distributed Comp-COEA GAN training. One goal of the trainingis to maintain diversity as a means of resilience to GAN training pathologies. Thebelief is that a lack of sufficient diversity results in convergence to pathologicaltraining states [22] and too much diversity results in divergence. The use ofselection and communication is one way of regulating the diversity.This section presents
Lipizzaner [22] and describes the ablations applied togauge the impact of communication/isolation and selection pressure.
Lipizzaner adversarially trains a population of generators g = { g , ..., g N } anda population of discriminators d = { d , ..., d N } , where N is the size of the pop-ulations. It defines a toroidal grid in whose cells a pair generator-discriminator,i.e., a GAN, is placed (called center ). This allows the definition of neighborhoodswith sub-populations g k and d k , of g and of d , respectively. The size of thesesub-populations is denoted by s ( s ≤ N ). Lipizzaner uses the five-cell Mooreneighborhood ( s = 5 ), i.e., the neighborhoods include the cell itself ( center ) andthe cells in the West , North , East , and
South (see Figure 1).
J. Toutouh et al.
Fig. 1: A 4 × , and N , (right).Each cell asynchronously executes in parallel its own learning algorithm.Cells interact with the neighbors at the beginning of each training epoch. Thiscommunication is carried out by gathering the latest updated center generatorand discriminator of its overlapping neighborhoods. Figure 1 illustrates someexamples of the overlapping neighborhoods on a toroidal grid. The updatesin cell N , and N , will be communicated to the N , and N , neighborhood. Lipizzaner returns an ensemble of generators that consist of a sub-population of generators g k and a mixture weight vector w ⊂ R s , where w i ∈ w represents the probability that a data point is drawn from g ki , i.e., (cid:80) w i ∈ w k w i = 1 . Thus, an Evolutionary Strategy, ES -(1+1), evolves w to opti-mize the weights in order to get the most accurate generative model accordingto a given metric [22], such as Fréchet Inception Distance (FID) [10]. Algorithm
Lipizzaner . It starts the parallelexecution of the training process in each cell. First, the cell randomly initializessingle generator and discriminator models. Then, the training process consist ofa loop with three main steps: i) gathering the GANs (neighbors) to update theneighborhood; ii) updating the center by applying the training method presentedin Algorithm
2; and iii) evolving mixture of weights by applying ES -(1+1) (line 7in Algorithm T (training epochs) times. Thereturned generative model is the best performing neighborhood with its optimalmixture weights, i.e., the best ensemble of generators.The Comp-COEA training applied to the networks of each sub-populationis shown in Algorithm
2. The output is the sub-population n with the center updated. This method is built on the basis of fitness evaluation , selection andreplacement , and reproduction based on gradient descent updates of the weights.The fitness of each network is evaluated in terms of the average loss func-tion L when it is evaluated against all the adversaries. After evaluating all thenetworks, a tournament selection operator is applied to select the parents (agenerator and a discriminator) to generate the offspring (lines from 1 to 5).The selected generator/discriminator is evaluated against a randomly chosenadversary from the sub-population (lines 8 and 11, respectively). The computedlosses are used to mutate the parents (i.e., update the networks’ parameters)according to the stochastic gradient descent (SGD), which learning rate values n δ were updated applying Gaussian-based mutation (line 7).When the training is completed, all models are evaluated again, the least fitgenerator and discriminator in the sub-populations are replaced with the fittestones and sets them as the center of the cell (lines from 14 to 20). nalyzing the Components of Distributed Coevolutionary GAN Training 5 Algorithm 1
Lipizzaner
Input:
T: Training epochs, E : Grid cells, s : Neighborhood size, θ EA : mixtureEvolution parameters, θ T rain : Parameters for training the models ineach cell
Return: g : Neighborhood of generators, w : Mixture weights parfor c ∈ E do (cid:46) Asynchronous parallel execution of all cells in grid n, w ← initializeNeighborhoodAndMixtureWeights( c, s ) for epoch do ∈ [1 , . . . , T ] (cid:46) Iterate over training epochs n ← copyNeighbours( c, s ) (cid:46) Gather neighbour networks n ← trainModels( n, θ Train ) (cid:46) Update GANs weights g ← getNeighborhoodGenerators( n ) (cid:46) Get the generators w ← mixtureEvolution ( w , g, θ EA ) (cid:46) Evolve mixture weights by ES -(1+1) end parfor ( g, w ) ← bestEnsemble ( g ∗ , w ∗ ) (cid:46) Get the best ensemble return ( g, w ) (cid:46) Cell with best generator mixture
Algorithm 2
Coevolve and train the networks
Input: τ : Tournament size, X : Input training dataset, β : Mutation probability, n : Cell neighborhood sub-population, M : Loss functions Return: n : Cell neighborhood sub-population updated B ← getMiniBatches( X ) (cid:46) Load minibatches B ← getRandomMiniBatch( B ) (cid:46) Get a minibatch to evaluate GANs for g, d ∈ g × d do (cid:46) Evaluate all GAN pairs L g,d ← evaluate( g, d, B ) (cid:46) Evaluate GAN g, d ← select( n, τ ) (cid:46) Select with min loss( L ) as fitness for B ∈ B do (cid:46) Loop over batches n δ ← mutateLearningRate( n δ , β ) (cid:46) Update learning rate d (cid:48) ← getRandomOpponent( d ) (cid:46) Get random discriminator to train g ∇ g ← computeGradient( g, d (cid:48) ) (cid:46) Compute gradient for g against d (cid:48) g ← updateNN( g, ∇ g , B ) (cid:46) Update g with gradient g (cid:48) ← getRandomOpponent( g ) (cid:46) Get uniform random generator to train g ∇ d ← computeGradient( d, g (cid:48) ) (cid:46) Compute gradient for d against g (cid:48) d ← updateNN( d, ∇ d , B ) (cid:46) Update d with gradient for g, d ∈ g × d do (cid:46) Evaluate all updated GAN pairs L g,d ← evaluate( g, d, B ) (cid:46) Evaluate GAN L g ← average ( L · ,d ) (cid:46) Fitness of g is the average loss value ( L ) L d ← average ( L g, · ) (cid:46) Fitness of d is the average loss value ( L ) n ← replace( n, g ) (cid:46) Replace the generator with worst loss n ← replace( n, d ) (cid:46) Replace the discriminator worst loss n ← setCenterIndividuals( n ) (cid:46) Best g and d are placed in the center return n J. Toutouh et al.
We conduct an ablation analysis of
Lipizzaner to evaluate the impact of dif-ferent degrees of communication/isolation and selection pressure on its COEAGAN training. The ablations are listed in Table 1. They ablate the use of sub-populations , the communication between cells, and the application of the se-lection/replacement operator. Thus, we define three variations of
Lipizzaner : – Spatial Parallel GAN training (
SPaGAN ): It does not apply selec-tion/replacement. After gathering the networks from the neighborhood, ituses them to only train the center (i.e.,
SPaGAN does not apply the operationsin lines between 2 and 5 and lines between 14 and 20 of
Algorithm – Isolated Coevolutionary GAN training (
IsoCoGAN ): It trains sub-populationsof GANs without communication between cells. There is no exchange ofnetworks between neighbors after the creation of the initial sub-population(i.e., the operation in line 4 of
Algorithm – Parallel GAN (
PaGAN ): This trains a population of N GANs in parallel.When all the GAN training is finished, it randomly produces N sub-sets of s ≤ N generators selected from the entire population of trained generatorsto define the ensembles and optimizes the mixture weights with ES-(1+1).Table 1: Key components of the GAN training methods.
Feature
Lipizzaner SPaGAN IsoCoGAN PaGAN
Use of sub-populations (cid:88) (cid:88) (cid:88) - Communication between sub-populations (cid:88) (cid:88) - -
Application of selection/replacement operator (cid:88) - (cid:88) - This experimental analysis compares
Lipizzaner , SPaGAN , IsoCoGAN , and
PaGAN in creating generative models to produce samples of the MNIST dataset [12]. Thisdataset is widely used as a benchmark due to its target space and dimensionality.The communication among the neighborhoods is affected by the grid size [22].We evaluate
SPaGAN and
Lipizzaner by using three different grid sizes: 3 × ×
4, and 5 ×
5, to control for the impact of this parameter.To evaluate the quality of the generated data we use
FID score. The networkdiversity of the trained models is evaluated by the L distance between theparameters of the generators. The total variation distance (TVD) [13],which is a scalar that measures class balance, is used to analyze the diversityof the generated data by the generative models. TVD reports the differencebetween the proportion of the generated samples of a given digit and the idealproportion (10% in MNIST). Finally, the computational cost is measured interms of run time. All the analyzed methods apply the same number of training-network steps, i.e., updating the networks’ parameters according to SGD. Allimplementations are publicly available and use the same Python libraries andversions. The distribution of the results is not Gaussian, so a Mann-Whitney Ustatistical test is applied to evaluate their significance. Lipizzaner and ablations - https://github.com/ALFA-group/lipizzaner-gan nalyzing the Components of Distributed Coevolutionary GAN Training 7
The methods evaluated here are configured according to the settings proposedby the authors of
Lipizzaner [22]. The experimental analysis is performed on acloud computing platform that provides 8 Intel Xeon cores 2.2GHz with 32 GBRAM and an NVIDIA Tesla P100 GPU with 16 GB RAM.
This section discusses the main results of the experiments carried out to evaluate
Lipizzaner , SPaGAN , IsoCoGAN , and
PaGAN . Table 2 summarizes the results by showing the best FID scores for the differentmethods and grid sizes in the 30 independent runs.
Lipizzaner obtains thelowest/best
Mean , Median , and
Min
FID scores for all grid sizes.
SPaGAN and
PaGAN are the second and third best methods, respectively.
IsoCoGAN presentsthe lowest quality by showing the highest (worst) FID scores. The FID values interms of hundreds indicate the generators are not capable of creating adequateMNIST data samples. The Mann-Whitney U test shows that the methods thatexchange individuals among the neighborhoods, i.e.,
Lipizzaner and
SPaGAN ,are significantly better than
PaGAN and
IsoCoGAN ( α << . ). These resultsare confirmed by posthoc statistical analysis. According to this analysis there isno statistical difference between Lipizzaner × SPaGAN for all evaluatedgrid sizes. In turn,
Lipizzaner × × Lipizzaner × FID results (Low FID indicates more quality)
Grid Method Mean ± Std Median Iqr Min Max × Lipizzaner ± SPaGAN ± IsoCoGAN ± PaGAN ± × Lipizzaner ± SPaGAN ± × Lipizzaner ± SPaGAN ± IsoCoGAN does not converge since the individuals of one sub-population areevaluated against randomly chosen individuals of the other one. As the accuracyof their fitness evaluation depends subjectively on the quality of the randomlychosen opponent, it is likely that the fitness value does not correspond to thereal quality of the individual, and therefore, the selection/replacement operatordoes not promote the objectively best solution in the sub-population.For all grid sizes,
SPaGAN provides higher/worse FIDs than
Lipizzaner . SPaGAN has a similar quality for all grid sizes, but
Lipizzaner improves whenthe grid size is increased. Thus, the larger the grid size, the greater the differ-ence between these two methods. We observe the benefits of increasing diversity(population/grid size) when applying the coevolutionary approach with selectionand replacement of
Lipizzaner . The results of
Lipizzaner are mainly due to,
J. Toutouh et al. first, the larger population sizes’ ability to encompass a higher diversity, and sec-ond, the selection/replacement process applied by
Lipizzaner accelerates theconvergence of the population to higher quality generators.
Fig. 2 shows the evolution of the median FID when using
PaGAN , SPaGAN , and
Lipizzaner in a 3 × Lipizzaner improves theperformance of the generators faster than
SPaGAN . After the first 75 to 100training epochs, the FID does not show such a reduction in
Lipizzaner , but
SPaGAN is able to keep reducing it until the end of the training process. The fasterconvergence of
Lipizzaner is also illustrated by Fig. 4 that shows the FID scoresin the grid of a given independent run at the epoch number 25, 50, 75, and 100.This is mainly due to the capacity of exploitation of this method when usingselection/replacement.
PaGAN shows a fast FID reduction at the beginning of thetraining process. But, the FID sharply oscillates during the first 75 iterationsand it converges to worse FID scores than
Lipizzaner and
SPaGAN .
25 50 75 100 125 150 175 200
Training epoch F I D s c o r e LipizzanerSPaGANPaGAN
Fig. 2: Median FID evolutionthrough the 200 epochs (3 ×
25 50 75 100 125 150 175 200
Training epoch −120−100−80−60−40−20020 F I D s c o r e Lipizzaner-4x4Lipizzaner-5x5SPaGAN-4x4SPaGAN-5x5
Fig. 3: FID differences when usingdiffernt grids ( dif f _ F ID m × mi ).Fig. 3 illustrates the differences between the FID score when using 3 × × × × × mat the i epoch is computed as dif f _ F ID m × mi = F ID × i − F ID m × mi . This fig-ure allows evaluating the impact on the FID evolution when using different grid(population) sizes. It shows how Lipizzaner provides lower FIDs and it is ableto converge faster as the grid size increases during the first iterations. In contrast,
SPaGAN gets better FIDs when increasing the grid size only during the first 50epochs.
Lipizzaner is able to take advantage of the diversity generated when thegrid size increase to converge faster to lower FIDs than
SPaGAN . Thus, in terms ofconvergence, selection/replacement helps the convergence of the coevolutionarytraining method when exchanging solutions with the neighborhoods.
The output diversity reports the class distribution of the fake data produced bya generative model. Table 3 summarizes the results in terms of TVD.For the 3 × IsoCoGAN shows the worst (highest) TVDvalues.
PaGAN provides good TVDs but it is statistically less competitive than nalyzing the Components of Distributed Coevolutionary GAN Training 9
436 365 442 382453 435 368 491415 417 454 414457 406 387 522 (a)
SPaGAN (25)
351 314 271 287324 310 272 295330 268 317 258243 324 297 318 (b)
SPaGAN (50)
276 209 205 156224 222 171 239221 215 231 175204 256 187 252 (c)
SPaGAN (75)
256 141 230 144170 237 118 188184 116 176 97128 204 122 200 (d)
SPaGAN (100)
258 253 265 269281 255 264 269265 264 271 270251 272 245 254 (e)
Lipizzaner (25)
88 89 85 9287 82 95 8592 85 92 9086 95 99 80 (f)
Lipizzaner (50)
67 47 56 4852 49 45 4864 54 52 5149 52 51 53 (g)
Lipizzaner (75)
39 41 45 3043 40 36 4341 39 41 4445 41 43 34 (h)
Lipiz. (100)
Fig. 4:
FID score distribution through the grid at different epochs (expresed betweenparentesis) of an independent run for
SPaGAN and
Lipizzaner in a 4 × Table 3:
TVD results (Low TVD indicates more diversity).
Grid Method Mean ± Std Median Iqr Min Max × Lipizzaner ± SPaGAN ± IsoCoGAN ± PaGAN ± × Lipizzaner ± SPaGAN ± × Lipizzaner ± SPaGAN ± SPaGAN and
Lipizzaner according to the Mann-Whitney U test.
SPaGAN and
Lipizzaner show the best results obtaining the same
Mean , Median , and
Min values. Therefore, when using communication between the neighborhoods duringthe training, the generators are able to produce more diverse data samples.When increasing the grid size,
SPaGAN and
Lipizzaner improve their re-sults. However,
SPaGAN does not show statistical differences between the samealgorithm with different grid sizes.
Lipizzaner × Lipizzaner × SPaGAN . The coevolutionary approachused in
Lipizzaner takes advantage of the divergence generated when increasingthe grid size to train generators that create more diverse data samples.According to these results and the ones in Section 5.1, we can answer
RQ1:
What is the effect on the quality of the generators when training with communi-cation or isolation and the presence or absence of selection pressure?
Commu-nication and selection pressure allowed
Lipizzaner to converge to generatorswith the best quality (FID and TVD). The diversity resulting from commu-nication resulted in the most competitive results, i.e.,
Lipizzaner and
SPaGAN ended with better generators than
IsoCoGAN and
PaGAN . Isolation in training con-verged to good solutions when the cell was optimizing only one GAN (
PaGAN ). However, when isolation is coupled with a sub-population that applies selec-tion/replacement, the algorithm was not able to converge, and therefore, thequality of the generators returned is the worst.
We next investigate the diversity of the parameters of the evolved networks.Table 4 summarizes the L distance results for the population at the end of theindependent run that returned the median FID. Fig. 5 shows the L distancesbetween all the generators in the grid (x and y axes are the cell number).Table 4: Diversity of population (whole grid) in genome space. L distances betweenthe generators at the final generation. Grid Method Mean ± Std Median Iqr Min Max × Lipizzaner ± SPaGAN ± IsoCoGAN ± PaGAN ± × Lipizzaner ± SPaGAN ± × Lipizzaner ± SPaGAN ± (a) PaGAN (3 × (b) IsoCoGAN (3 × (c) SPaGAN (3 × (d) Lipiz. (3 × (e) SPaGAN (4 × (f) Lipiz. (4 × (g) SPaGAN × (h) Lipiz. (5 × Fig. 5:
Diversity of population in genome space. Heatmap of L distance between thegenerators at the final generation. Dark indicates more diversity. The highest generator diversity is shown by
PaGAN . This is because thismethod trains a single generator against a single discriminator and it does notuse any type of information exchange or communication during the trainingprocess. Therefore, each cell converges to different points of the search space.
SPaGAN and
IsoCoGAN provide networks with similar diversity in 3 × Lipizzaner . This is mainly dueto the combination of both communication and the use of selection/replacementin
Lipizzaner provokes the sub-populations to converge to similar accurateindividuals, limiting diversity. nalyzing the Components of Distributed Coevolutionary GAN Training 11
The genome diversity in
Lipizzaner is increased when increasing the size ofthe grid (see in Fig. 5 how the heatmap gets darker as the grid size is larger).
SPaGAN increases the L distances with the grid size, but in a lower proportion,e.g., from 3 × × Lipizzaner increases 123.75% and
SPaGAN
Lipizzaner can improve results withan increasing grid size (and diversity).Answering
RQ2:
What is the effect on the diversity of the network parame-ters when training with communication or isolation and the presence or absenceof selection pressure?
The combination of communication and selection pressurepermits
Lipizzaner to converge to similar high-quality generators. Completeisolation and no-population based GAN training (
PaGAN ) converge to highlydifferent generators, which is expected.
SPaGAN illustrates how the absence ofselection pressure in the populations keep the individuals diverse in the grid, al-though there is communication. In 3 × IsoCoGAN and
SPaGAN have similar diversity but highly different quality. This shows how maintainingdiversity is not enough to ensure robust GAN training.
Now, we analyze the computational efficiency of the GAN training methods,taking into account that they use the same number of training epochs but useddifferent components. All these methods apply asynchronous parallelism and thetime required by each cell of the grid to perform the same number of trainingepochs varies. Thus, we report the computational time of a run as the timerequired by each independent run to finish and return the best ensemble ofgenerators found by all the cells.Table 5:
Computation time in minutes.
Grid Method Mean ± Std Median Iqr Min Max × Lipizzaner ± SPaGAN ± IsoCoGAN ± PaGAN ± × Lipizzaner ± SPaGAN ± × Lipizzaner ± SPaGAN ± The computational time of
PaGAN includes the whole process, i.e., the timeof training the GANs plus the time of optimizing the ensemble weights by usingthe ES-(1+1). This method requires the shortest run time (see Table 5). This isbecause
PaGAN trains a single network in each cell and there is no communica-tion among the different cells. According to the Mann-Whitney U and posthocstatistical tests,
Lipizzaner requires the longest computation times. It employsboth communication and selection/replacement algorithm components.
When comparing among
Lipizzaner , SPaGAN , and
IsoCoGAN , the impact onthe computation time of applying communication in
Lipizzaner and
SPaGAN is higher than the use of selection/replacement in
Lipizzaner and
IsoCoGAN (see Table 5 × ). This increase in the computation cost may increasein problems that require the use and exchange among cells of bigger models(networks) for the generators and discriminators.Therefore, answering RQ3:
What is the impact on the computational costof applying migration and selection/replacement? , the effect of applying selec-tion/replacement is negligible when comparing it with the impact of commu-nicating among the cells, when running the methods on 3 × We have empirically shown that the spatially distributed coevolutionary trainingapplied by
Lipizzaner is the best choice among options with/without commu-nication and selection/replacement components to train GANs. The combina-tion of selection pressure that promotes convergence in the sub-populations andcommunication with the overlapped neighborhoods maintains enough diversityto robustly train the networks in the sub-population. Moreover, the use of thesetwo operations does not entail a very significant increase in the computationtime (about four minutes in the larger grid size).
SPaGAN illustrates the importance of the communication among the cells (i.e.,fostering diversity). It is able to converge to high-quality solutions (generators)although it does not apply selection. This emphasizes the value of exchangingthe best individuals even they are just used to train the center of the cells.
IsoCoGAN provides the least competitive results. Training the networks with acoevolutionary flavor does not ensure convergence, even when it is done in sub-populations and it uses selection/replacement.Future work will include further analysis of coevolutionary GAN training inother benchmarks (i.e., problems and data sets). We will evaluate the scalabilityof this type of training by using larger grids. We will study the effect of trainingthe GANs with different kinds of loss functions. We will assess the impact onthe robustness when using different types of neighborhoods. Finally, we willanalyze the evolution of the network weights through the generations to betterunderstand the dynamics of this type of GAN training.
Acknowledgments
This research was partially funded by European Union’s Horizon 2020 researchand innovation program under the Marie Skłodowska-Curie grant agreementNo 799078, by the Junta de Andaluía UMA18-FEDERJA-003, European UnionH2020-ICT-2019-3, and the Systems that Learn Initiative at MIT CSAIL. nalyzing the Components of Distributed Coevolutionary GAN Training 13
References
1. Al-Dujaili, A., Schmiedlechner, T., Hemberg, E., O’Reilly, U.M.: Towards dis-tributed coevolutionary GANs. In: AAAI 2018 Fall Symposium (2018)2. Alba, E., Dorronsoro, B.: Cellular genetic algorithms, vol. 42. Springer Science &Business Media (2009)3. Alba, E., Luque, G., Nesmachnow, S.: Parallel metaheuristics: recent advances andnew trends. International Transactions in Operational Research (1), 1–48 (2013)4. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprintarXiv:1701.07875 (2017)5. Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium ingenerative adversarial nets (gans). arXiv preprint arXiv:1703.00573 (2017)6. Chavdarova, T., Fleuret, F.: Sgan: An alternative training of generative adversarialnetworks. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 9407–9415 (2018)7. Chintala, S., Denton, E., Arjovsky, M., Mathieu, M.: How to train a gan? tips andtricks to make gans work. https://github.com/soumith/ganhacks (2016)8. Essaid, M., Idoumghar, L., Lepagnot, J., Brévilliers, M.: Gpu parallelization strate-gies for metaheuristics: a survey. International Journal of Parallel, Emergent andDistributed Systems (5), 497–522 (2019)9. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neuralinformation processing systems. pp. 2672–2680 (2014)10. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trainedby a two time-scale update rule converge to a local nash equilibrium pp. 6626–6637(2017)11. Hodjat, B., Hemberg, E., Shahrzad, H., O’Reilly, U.M.: Maintenance of a longrunning distributed genetic programming system for solving problems requiringbig data. In: Genetic Programming Theory and Practice XI, pp. 65–83. Springer(2014)12. LeCun, Y.: The mnist database of handwritten digits. http://yann. lecun.com/exdb/mnist/ (1998)13. Li, C., Alvarez-Melis, D., Xu, K., Jegelka, S., Sra, S.: Distributional adversarialnetworks. arXiv preprint arXiv:1706.09549 (2017)14. Li, J., Madry, A., Peebles, J., Schmidt, L.: Towards understanding the dynamicsof generative adversarial networks. arXiv preprint arXiv:1706.09884 (2017)15. Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares gen-erative adversarial networks. In: Proceedings of the IEEE International Conferenceon Computer Vision. pp. 2794–2802 (2017)16. Mordido, G., Yang, H., Meinel, C.: Dropout-gan: Learning from a dynamic ensem-ble of discriminators. arXiv preprint arXiv:1807.11346 (2018)17. Neyshabur, B., Bhojanapalli, S., Chakrabarti, A.: Stabilizing gan training withmultiple random projections. arXiv preprint arXiv:1705.07831 (2017)18. Nguyen, T., Le, T., Vu, H., Phung, D.: Dual discriminator generative adversarialnets. In: Advances in Neural Information Processing Systems. pp. 2670–2680 (2017)19. Popovici, E., Bucci, A., Wiegand, R.P., De Jong, E.D.: Coevolutionary principles.In: Handbook of natural computing, pp. 987–1033. Springer (2012)20. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deepconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434(2015)4 J. Toutouh et al.21. Salimans, T., Ho, J., Chen, X., Sutskever, I.: Evolution strategies as a scalablealternative to reinforcement learning. arXiv:1703.03864 (2017)22. Schmiedlechner, T., Yong, I.N.Z., Al-Dujaili, A., Hemberg, E., O’Reilly, U.M.: Lip-izzaner: A System That Scales Robust Generative Adversarial Network Training.In: the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018)Workshop on Systems for ML and Open Source Software (2018)23. Stanley, K.O., Clune, J.: Welcoming the era of deep neuroevolution - uber engi-neering blog. https://eng.uber.com/deep-neuroevolution/ (December 2017)24. Talbi, E.G.: Metaheuristics: from design to implementation, vol. 74. John Wiley &Sons (2009)25. Tolstikhin, I.O., Gelly, S., Bousquet, O., Simon-Gabriel, C.J., Schölkopf, B.: Ada-gan: Boosting generative models. In: Advances in Neural Information ProcessingSystems. pp. 5430–5439 (2017)26. Toutouh, J., Hemberg, E., O’Reilly, U.M.: Spatial evolutionary generative ad-versarial networks. In: Proceedings of the Genetic and Evolutionary Computa-tion Conference. pp. 472–480. GECCO ’19, ACM, New York, NY, USA (2019).https://doi.org/10.1145/3321707.332186027. Toutouh, J., Hemberg, E., O’Reilly, U.M.: Data dieting in gan training. In: Iba,H., Noman, N. (eds.) Deep Neural Evolution: Deep Learning with EvolutionaryComputation, pp. 379–400. Springer Singapore, Singapore (2020)28. Toutouh, J., Hemberg, E., O’Reilly, U.M.: Re-purposing heterogeneous genera-tive ensembles with evolutionary computation. In: Proceedings of the Genetic andEvolutionary Computation Conference. GECCO ’2020, Association for ComputingMachinery, New York, NY, USA (2020). https://doi.org/10.1145/3377930.3390229, https://doi.org/10.1145/3377930.3390229
29. Wang, C., Xu, C., Yao, X., Tao, D.: Evolutionary generative adversarial networks.IEEE Transactions on Evolutionary Computation23