Signal Propagation in a Gradient-Based and Evolutionary Learning System
SSignal Propagation in aGradient-Based and Evolutionary Learning System
Jamal Toutouh
Massachusetts Institute of Technology, CSAIL, USAUniversity of Malaga, LCC, [email protected],[email protected]
Una-May O’Reilly
Massachusetts Institute of Technology, [email protected]
ABSTRACT
Generative adversarial networks (GANs) exhibit training patholo-gies that can lead to convergence-related degenerative behaviors,whereas spatially-distributed, coevolutionary algorithms (CEAs)for GAN training, e.g.
Lipizzaner , are empirically robust to them.The robustness arises from diversity that occurs by training popu-lations of generators and discriminators in each cell of a toroidalgrid. Communication, where signals in the form of parameters ofthe best GAN in a cell propagate in four directions: North, South,West and East, also plays a role, by communicating adaptations thatare both new and fit. We propose
Lipi-Ring , a distributed CEAlike
Lipizzaner , except that it uses a different spatial topology,i.e. a ring. Our central question is whether the different direction-ality of signal propagation (effectively migration to one or moreneighbors on each side of a cell) meets or exceeds the performancequality and training efficiency of
Lipizzaner
Experimental analy-sis on different datasets (i.e, MNIST, CelebA, and COVID-19 chestX-ray images) shows that there are no significant differences be-tween the performances of the trained generative models by bothmethods. However,
Lipi-Ring significantly reduces the computa-tional time (14.2% . . .
Lipi-Ring offers an alternativeto
Lipizzaner when the computational cost of training matters.
CCS CONCEPTS • Computing methodologies → Bio-inspired approaches ; Neu-ral networks ; Search methodologies;
Ensemble methods ; KEYWORDS
Generative adversarial networks, ensembles, genetic algorithms,diversity
ACM Reference Format:
Jamal Toutouh and Una-May O’Reilly. 2021. Signal Propagation in a Gradient-Based and Evolutionary Learning System. In
Proceedings of the Genetic andEvolutionary Computation Conference 2021 (GECCO ’21).
ACM, New York,NY, USA, 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
GECCO ’21, July 10–14, 2021, Lille, France © 2021 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
Generative modeling aims to learn a function that describes a la-tent distribution of a dataset. In a popular paradigm, a generativeadversarial network (GAN) combines two deep neural networks(DNN), a generator and a discriminator, that engage in adversariallearning to optimize their weights [11]. The generator is trainedto produce fake samples (given an input from a random space) tofool the discriminator. The discriminator learns to discern the real samples from the ones produced by the generator. This training isformulated as a minmax optimization problem through the defini-tions of discriminator and generator loss, which converges whenan optimal generator approximates the true distribution so wellthat the discriminator only provides a random label for any sample.Early GAN training methods led to vanishing gradients [2] andmode collapse [5] among other pathologies. They arose from theinherent adversarial setup of the paradigm. Several methods havebeen proposed to improve GAN models and have produced strongresults [3, 13, 18, 31]. However, GANs remain notoriously hard totrain [12, 23].Using forms of evolutionary computation (EC) for GAN traininghas led to promising approaches. Evolutionary (EAs) and coevo-lutionary algorithms (CEAs) for weight training or spatial sys-tems [1, 8, 24, 25, 28, 29]. Deep neuroevolution offers concurrentarchitecture and weight search [8]. Pareto approximations also havebeen proposed to define multi-objective GAN training [10]. Thisvariety of approaches use different ways to guide populations ofnetworks towards convergence, while maintaining diversity anddiscarding problematic (weak) individuals. They have been empiri-cally demonstrated to be comparable and better than baseline GANtraining methods.In this work, we focus on spatially-distributed, competitive CEAs(Comp-CEAs), such as
Lipizzaner [24]. In these methods, the mem-bers of two populations (generators and discriminators) are placedin the cells of a toroidal geometric space (i.e., each cell contains agenerator-discriminator pair). Each cell has neighbors from whichit copies their pairs of generator and discriminator. This createssub-populations of GANs in each cell. Gradient-based training isdone pairwise between the best pairing within a sub-population.Each training iteration (epoch), selection, mutation, and replace-ment are applied and then, the best generator-discriminator pairis updated and the remainder of the sub-populations re-copiedfrom the neighborhood. This update and refresh effectively prop-agate signals along the paths of neighbors that run across . Thus,the neighborhood defines the directionality, space of signal prop-agation a.k.a. migration [1]. Communicating adaptations that areboth new and fit promotes diversity during this training process.This diversity has been shown to disrupt premature convergence a r X i v : . [ c s . N E ] F e b ECCO ’21, July 10–14, 2021, Lille, France J. Toutouh and U. O’Reilly in the form of an oscillation or moving the search away from anundesired equilibria, improving the robustness to the main GANpathologies [24, 26].In this work, we want to evaluate the impact of the spatial topol-ogy used by this kind of method, changing the two-dimensionaltoroidal grid used by
Lipizzaner into a ring topology. Thus, wepropose
Lipi-Ring . Lipi-Ring raises central questions about theimpact of the new directionality of the signal propagation givena ring. How are performance quality, population diversity, andcomputational cost impacted?Thus, in this paper, we pose the following research questions:
RQ1:
What is the effect on the generative model trained when chang-ing the directionality of the signal propagation from four directionsto two?
RQ2:
When the signal is propagated to only two directions,what is the impact of performing migration in one or more neighbors?
In terms of population diversity,
RQ3:
How does diversity changeover time in a ring topology? How does diversity compare with aring topology with different neighborhood radius? How does diversitycompare between ring topology and 2D grid methods, where bothmethods have the same sub-population size and neighborhood, butdifferent signal directionality?
The main contributions of this paper are: i) Lipi-Ring , a newdistributed Comp-CEA GAN training method based on ring topol-ogy that demonstrates markedly decreased computational cost overa 2D topology, without negatively impacting training accuracy. ii) an open source software implementation of
Lipi-Ring , and iii) evaluating different variations of Lipi-Ring , comparing them to
Lipizzaner , in a set of benchmarks based on MNIST, CelebA, andCOVID-19 X-Ray chest images datasets.The rest of the paper is organized as follows. Section 2 presentsrelated work. Section 3 describes the
Lipi-Ring method. The ex-perimental setup and the results are in sections 4 and 5. Finally,conclusions are drawn and future work is outlined in Section 6.
This section introduces the main concepts in GANs training andsummarizes relevant studies related to this research.
GANs train two DNN, a generator ( 𝐺 𝑔 ) and a discriminator ( 𝐷 𝑑 ), inan adversarial setup. Here, 𝐺 𝑔 and 𝐷 𝑑 are functions parametrizedby 𝑔 and 𝑑 , where 𝑔 ∈ G and 𝑑 ∈ D with G , D ⊆ R 𝑝 represent therespective parameters space of both functions.Let 𝐺 ∗ be the target unknown distribution to which we wouldlike to fit our generative model [4]. The generator 𝐺 𝑔 receives avariable from a latent space 𝑧 ∼ 𝑃 𝑧 ( 𝑧 ) and creates a sample fromdata space 𝑥 = 𝐺 𝑔 ( 𝑧 ) . The discriminator 𝐷 𝑑 assigns a probability 𝑝 = 𝐷 𝑑 ( 𝑥 ) ∈ [ , ] that represents the likelihood that the 𝑥 belongsto the real training dataset, i.e., 𝐺 ∗ by applying a measuring func-tion 𝜙 : [ , ] → R . The 𝑃 𝑧 ( 𝑧 ) is a prior on 𝑧 (a uniform [− , ] distribution is typically chosen). The goal of GAN training is to find 𝑑 and 𝑔 parameters to optimize the objective function L( 𝑔, 𝑑 ) .min 𝑔 ∈ G max 𝑑 ∈ D L( 𝑔, 𝑑 ) , where (1) L( 𝑔, 𝑑 ) = E 𝑥 ∼ 𝑃 𝑑𝑎𝑡𝑎 ( 𝑥 ) [ 𝜙 ( 𝐷 𝑑 ( 𝑥 ))] + E 𝑥 ∼ 𝐺 𝑔 ( 𝑧 ) [ 𝜙 ( − 𝐷 𝑑 ( 𝑥 ))] Lipi-Ring source code - https://github.com/xxxxxxxxx
This is accomplished via a gradient-based learning processwhereupon 𝐷 𝑑 learns a binary classifier that is the best possiblediscriminator between real and fake data. Simultaneously, it en-courages 𝐺 𝑔 to approximate the latent data distribution. In general,both networks are trained by applying back-propagation. Mode collapse and vanishing gradients are the most frequent GANtraining pathologies [2, 5], leading to inconsistent results. Priorstudies tried to mitigate degenerate GAN dynamics with new gen-erator or discriminator objectives (loss functions) [3, 18, 19, 31] andapplying heuristics [14, 22].Others have integrated EC into GAN training. EvolutionaryGAN (E-GAN) evolves a population of generators [29]. The muta-tion selects among three optimization objectives (loss functions)to update the weights of the generators, which are adversariallytrained against a single discriminator. Multi-objective E-GAN (MO-EGAN) has been defined by reformulating E-GAN training as amulti-objective optimization problem by using Pareto dominanceto select the best solutions in terms of diversity and quality [6].Two genetic algorithms (GAs) have been applied to learn mixturesof heterogeneous pre-trained generators to specifically deal withmode collapse [28]. Finally, in [10], a GA evolves a population ofGANs (represented by the architectures of the generator and thediscriminator and the training hyperparameters). The variation op-erators exchange the networks between the individuals and evolvesthe architectures and the hyperparameters. The fitness is computedafter training the GAN encoded by the genotype.Another line of research uses CEAs to train a population of gener-ators against a population of discriminators. Coevolutionary GAN(COEGAN) combines neuroevolution with CEAs [8]. Neuroevolu-tion is used to evolve the main networks’ parameters. COEGANapplies an all-vs-best
Comp-CEA (with k-best individuals) for thecompetitions to mitigate the computational cost of all-vs-all .CEAs showed similar pathologies as the ones reported in GANtraining, such as focusing , and lost of gradient , which have beenattributed to a lack of diversity [21]. Thus, spatially distributedpopulations have been demonstrated to be particularly effective atmaintaining diversity, while reducing the computational cost fromquadratic to linear form [30].
Lipizzaner locates the individualsof a population of GANs (pairs of generators and discriminators)in a 2D toroidal grid. A neighborhood is defined by the cell itselfand its adjacent cells according to Von Neumann neighborhood.Coevolution proceeds at each cell with sub-populations drawn fromthe neighborhood. Gradient-based learning is used to update theweights of the networks while evolutionary selection and variationare used for hyperparameter learning [1, 24]. After each trainingiteration, the (weights of the) best generator and discriminator arekept while the other sub-population members are refreshed bynew copies from the neighborhood. A cell’s update of its GAN iseffectively propagated to the adjacent cells in four directions (i.e.,North, South, East, and West) once the neighbors of the cell re-fresh their sub-populations from neighborhood copies. Thus, eachcell’s sub-populations are updated with new fit individuals, movingthem closer towards convergence, while fostering diversity. An-other approach, Mustangs, combines
Lipizzaner and E-GAN [25]. ignal Propagation in a Gradient-Based and Evolutionary Learning System GECCO ’21, July 10–14, 2021, Lille, France
Thus, the mutation operator randomly selects among a set of lossfunctions instead of applying always the same one in order to in-crease variability. Finally, taking advantage of the spatial grid of
Lipizzaner , a data dieting approach has been proposed [27]. Themain idea is to train each cell with different subsets of data to fos-ter diversity among the cells and to reduce the training resourcerequirements.In this study, we propose
Lipi-Ring , spatial distributed GANtraining that uses a ring topology instead of a 2D grid. We contrast
Lipi-Ring to Lipizzaner in the next section.
This section describes the
Lipi-Ring
CEA GAN training,which applies the same principles (definitions and methods) as
Lipizzaner [24]. We introduce both 2D grid and ring topologiesapplied by
Lipizzaner and
Lipi-Ring , respectively. We summa-rize the spatially distributed GAN training method. We present themain distinctive features between
Lipizzaner and
Lipi-Ring . Lipi-Ring and
Lipizzaner use a population of genera-tors g = { 𝑔 , . . . , 𝑔 𝑍 } and a population of discriminators d = { 𝑑 , . . . , 𝑑 𝑍 } , which are trained against each other (where 𝑍 isthe size of the population). A generator-discriminator pair named center is placed in each cell, which belongs to a ring in the caseof Lipi-Ring or to a 2D toroidal grid in the case of
Lipizzaner .According to the topology’s neighborhood, sub-populations ofnetworks, generators and discriminators, of size 𝑠 are formed.For the 𝑘 -th neighborhood, we refer the center generator by g 𝑘, ⊂ g , the set of generators in the neighborhood by g 𝑘, , . . . , g 𝑘,𝑠 ,and the generators in this 𝑘 -th sub-population by g 𝑘 = ∪ 𝑠𝑖 = g 𝑘,𝑖 ⊆ g .The same is stated for discriminators d . Lipizzaner uses Von Neumann neighborhoods with radius 1,which includes the cell itself and the ones in the adjacent cellsto the North, South, East, and West [24], i.e. 𝑠 =5 (see Figure 1.a).This defines the migration policy (i.e., the directionality of signalpropagation) through the cells in four directions.Figure 1.a shows an example of a 2D toroidal grid with 𝑍 =16 (4 × center of (1,1) will be propagated to the (0,1),(2,1), (1,0), and (1,2) cells.In Lipi-Ring , the cells are distributed in a one-dimensional gridof size 1 × 𝑍 and neighbors are sideways, e.g. left and/or right,i.e. an index position or more away. The best GAN ( center ) afteran evolutionary epoch at a cell is updated. Neighborhood cellsretrieve this update when they refresh their sub-population mem-bership at the end of their training epochs, effectively forming twonon-ending pathways, around the ring, carrying signals in twodirections. Figures 1.b and 1.c show populations of six individuals( 𝑍 =6) organized in a ring topology with neighborhood radius one( 𝑟 =1) and two ( 𝑟 =2), respectively. The shaded areas illustrate theoverlapping neighborhoods of the cells (0) and (4), with dotted andsolid outlines, respectively. The updates in the center of (0) will bepropagated to the (5) and (1) for 𝑟 =1 (Figure 1.b) and to the (4), (5),(1), and (2) for 𝑟 =2 (Figure 1.c). a)Toroidal grid (neighborhood radius: 1) used by Lipizzaner . b)Ring topology (neighborhood radius: 1). c)Ring topology (neighborhood radius: 2).
Figure 1: Overlapping neighborhoods and sub-populationdefinitions according topologies and neighborhood radius.
Thus the topologies of
Lipizzaner and
Lipi-Ring influencecommunication (of the parameters of the best cell each epoch) andthis, in turn, affects how the populations converge.
Algorithm migration , in which the cells gather the GANs (neighbors) tobuild the sub-population ( 𝑛 ), and second, train and evolve wheneach cell updates the center by applying the coevolutionary GANstraining, (see Algorithm 𝑇 (generationsor training epochs) times. After that, each cell learns an ensembleof generators by using an Evolutionary Strategy, ES -(1+1) [17, Al-gorithm 2.1], to compute the mixture weights 𝜔 to optimize theaccuracy of the generative model returned ( 𝑛, 𝜔 ) ∗ [24].For each training generation, the cells apply the CEA in Algorithm best pair gener-ator and discriminator (called 𝑔 𝑏 and 𝑑 𝑏 ) according to a tournamentselection of size 𝜏 . It applies an all-vs-all strategy to evaluate all theGAN pairs in the sub-population according to a randomly chosenbatch of data 𝐵𝑟 (Lines 1 to 4). Then, for each batch of data in thetraining dataset (Lines 5 to 10), the learning rate 𝑛 𝛿 is updatedapplying Gaussian Mutation [24] and the offspring is created bytraining 𝑔 𝑏 and 𝑑 𝑏 against a randomly chosen discriminator and ECCO ’21, July 10–14, 2021, Lille, France J. Toutouh and U. O’Reilly generator from the sub-population (i.e., applying gradient-basedmutations). Thus, the sub-populations are updated with the newindividuals. Finally, a replacement procedure is applied to removefrom the sub-populations the weakest individuals and the center isupdated with the individuals with the best fitness (Lines 11 to 15).The fitness L 𝑔,𝑑 of a given generator (discriminator) is evaluatedaccording to the binary-cross-entropy loss, where the model’s ob-jective is to minimize the Jensen-Shannon divergence between the real and fake data [11]. Algorithm 1
Lipizzaner main steps.
Input:
T: Total generations, 𝐸 : Grid cells, 𝑠 : Neighborhood size, 𝜃 𝐷 : Training dataset, 𝜃 𝐶𝑂𝐸𝑉 : Parameters for
CoevGANsTraining , 𝜃 𝐸𝐴 : Parameters for MixtureEA
Return: 𝑛 : neighborhood, 𝜔 : mixture weights parfor 𝑐 ∈ 𝐸 do ⊲ Asynchronous parallel execution of all cells in grid 𝑛, 𝜔 ← initializeCells( 𝑐, 𝑘, 𝜃 𝐷 ) ⊲ Initialization of cells for generation do ∈ [ , . . . , T ] ⊲ Iterate over generations 𝑛 ← copyNeighbours( 𝑐, 𝑘 ) ⊲ Collect neighbor cells 𝑛 ← CoevGANsTraining ( 𝑛, 𝜃 𝐷 , 𝜃 𝐶𝑂𝐸𝑉 ) ⊲ Coevolve GANs 𝜔 ← MixtureEA ( 𝜔, 𝑛, 𝜃 𝐸𝐴 ) ⊲ Build optimal ensemble end parfor return ( 𝑛, 𝜔 ) ∗ ⊲ Cell with best generator mixture
Algorithm 2
CoevGANsTraining
Input: 𝑛 : Cell neighborhood subpopulation, 𝜃 𝐷 : Trainingdataset, 𝜏 : Tournament size, 𝛽 : Mutation probability Return: 𝑛 : Cell neighborhood subpopulation trained 𝐵𝑟 ← getRandomBatch( 𝜃 𝐷 ) ⊲ Random batch to evaluate GAN pairs for 𝑔, 𝑑 ∈ g × d do ⊲ Evaluate all GAN pairs L 𝑔,𝑑 ← evaluate( 𝑔, 𝑑, 𝐵𝑟 ) ⊲ Evaluate GAN 𝑔 𝑏 , 𝑑 𝑏 ← select( 𝑛, 𝜏 ) ⊲ Tournament selection for 𝐵 ∈ 𝜃 𝐷 do ⊲ Loop over the batches in 𝜃 𝐷 𝑛 𝛿 ← mutateLearningRate( 𝑛 𝛿 , 𝛽 ) ⊲ Update own 𝑛 𝛿 𝑑 ← getRandomOpponent( d ) ⊲ Get random discriminator 𝑔 𝑏 ← updateNN( 𝑔 𝑏 , 𝑑, 𝐵 ) ⊲ Update 𝑔 𝑏 with gradient 𝑔 ← getRandomOpponent( g ) ⊲ Get uniform random generator 𝑑 𝑏 ← updateNN( 𝑑 𝑏 , 𝑔, 𝐵 ) ⊲ Update 𝑑 𝑏 with gradient g , d ← updatePopulations( g , d , 𝑔 𝑏 , 𝑑 𝑏 ) ⊲ Add 𝑔 𝑏 and 𝑑 𝑏 for 𝑔, 𝑑 ∈ g × d do ⊲ Evaluate all updated GAN pairs L 𝑔,𝑑 ← evaluate( 𝑔, 𝑑, 𝐵𝑟 ) ⊲ Evaluate GAN 𝑛 ← replace( 𝑛, g , d ) ⊲ Replace the networks with worst loss 𝑛 ← setCenter( 𝑛 ) ⊲ Best gen. and disc. are placed in the center return 𝑛 Here, we have summarized the spatially distributed method ap-plied. More detailed definitions of this spatially distributed GANtraining method can be found in [1, 24].
Lipi-Ring vs.
Lipizzaner
The population structured in a ring with 𝑟 =2 provides a signalpropagation similar to the Lipizzaner toroidal grid because bothtopologies and migration models allow the center individual toreach four cells (two located in the West and two in the East in thecase of this ring), see figures 1.a and 1.b. Thus, they provide the samepropagation speed and, as their sub-populations are of size 𝑠 =5, thesame selection pressure. The signal propagation of the ring with 𝑟 =1 is slower because, after a given training step, the center onlyreaches two cells (the one in the West and the one in the East), seeFigure 1.b. In this case, the sub-populations have three individuals,which reduces the diversity in the sub-population and acceleratesconvergence (it has 40% fewer individuals than Lipizzaner sub-populations). Thus,
Lipi-Ring with 𝑟 =1 reduces the propagationspeed, while accelerating the population’s convergence. Lipi-Ring with 𝑟 =1 has two main advantages over Lipizzaner : a) it mitigates the overhead due to the communication is carriedout only with two cells (instead of four) and the sub-populationsare smaller, which reduces the number of operations for fitnessevaluation and selection/replacement; and b) like all Lipi-Ring with any radius, it does not require to have a rectangular grid ofcells, but 𝑍 may be any natural number.Given the infeasibility of analyzing the change in selection andother algorithm elements, we proceed empirically. We evaluate different distributed CEA GAN training on image datasets: the well known, MNIST [9] and CelebA [16]; and a dataset ofchest X-ray images of patients with COVID-19 [7].In our experiments, we evaluate the following algorithms:
Lipizzaner ; two variations of
Lipi-Ring both with 𝑟 =1, Ring(1) that performs the same training epochs than
Lipizzaner and
Ring 𝑡 (1) that runs for the same computational cost (wall clocktime) than Lipizzaner ; and
Ring(2) which is the
Lipi-Ring with 𝑟 =2 performing the same number of iterations than Lipizzaner .These represent a variety of topologies and migration models.
Ring 𝑡 (1) is analyzed to make a proper comparison between Ring(1) ( 𝑟 =1) and Lipizzaner taking into account the compu-tational cost (time). Table 1 summarizes the main characteristics ofthe
Lipi-Ring variations studied.The parameters are set according to the authors of
Lipizzaner [24, 25]. Thus, all these CEAs apply a tourna-ment selection of size two. The main settings used for theexperiments are summarized in Table 2.
Table 1:
Lipi-Ring variations analyzed.
Method 𝑟 𝑠
Variation
Ring(1)
LipizzanerRing 𝑡 (1) LipizzanerRing(2)
Lipizzaner
Table 2:
Main GAN training parameters.
Parameter MNIST CelebA COVID-19Network topologyNetwork type MLP DCGAN DCGANInput neurons 64 100 100Number of hidden layers 2 4 4Neurons per hidden layer 256 16,384 - 131,072 32,768 - 524,288Output neurons 784 64 × ×
64 128 × 𝑡𝑎𝑛ℎ 𝑡𝑎𝑛ℎ 𝑡𝑎𝑛ℎ Training settingsBatch size 100 128 69Skip N disc. steps 1 - -Learning rate mutationOptimizer Adam Adam AdamInitial learning rate 0.0002 0.00005 0.0002Mutation rate 0.0001 0.0001 0.0001Mutation probability 0.5 0.5 0.5 ignal Propagation in a Gradient-Based and Evolutionary Learning System GECCO ’21, July 10–14, 2021, Lille, France
For MNIST experiments, the generators and the discrimina-tors are multilayer perceptrons (MLP). The stop condition ofeach method is defined as follows: a) Ring(1) , Ring(2) , and
Lipizzaner perform 200 training epochs to evaluate the impactof the topology on the performance and computational time re-quired; and b) Ring 𝑡 (1) stops after running for the same time than Lipizzaner to compare the methods taking into account the samecomputational cost (time). The population sizes are 9, 16, and 25,which means
Lipizzaner uses grid of sizes 3 ×
3, 4 ×
4, and 5 ×
5. Be-sides, we study the impact of the size of the populations on
Ring(1) by training rings of size between 2 and 9.For CelebA and COVID-19 experiments, deep convolutionalGANs (DCGAN) are trained. DCGANs have much more param-eters than MLP (see Table 2). Here, we compare
Ring(1) and
Lipizzaner . They stop after performing 20 training epochs forCelebA and 1,000 for COVID-19 (because COVID-19 trainingdataset has much fewer samples). This will discern the differencesin terms of performance and computational cost with more complexnetworks. In these cases, the population size is 9.The experimental analysis is performed on a cloud computationplatform that provides 16 Intel Xeon cores 2.8GHz with 64 GB RAMand an NVIDIA Tesla P100 GPU with 16 GB RAM. We run multipleindependent runs for each method.We have implemented all variations of the
Lipi-Ring by extend-ing
Lipizzaner framework [24] using Python and Pytorch [20].
This section presents the results and the analyses of the presentedGAN training methods. The first subsections evaluate them with theMNIST dataset. They are measured in terms of: the FID score, thediversity of the generated samples by evaluating the total variationdistance (TVD) [15], and the diversity in the genome space (networkparameters). Then, we analyze the CelebA and COVID-19 resultswith measurements of Inception Score (IS) and computational time.The next subsection presents the results incrementally increasingthe ring size of
Ring(1) . Finally, we compare the computationaltimes needed by
Ring(1) and
Lipizzaner . Table 3 shows the best FID value from each of the 30 independentruns performed for each method. All the evaluated methods im-prove their performance when increasing the population size, whilemaintaining the same budget. This could be explained by diversityincreasing during the training, as the populations get bigger.
Ring 𝑡 (1) has the lowest (best) median and mean FID results. Inturn, it returned the generative model that provided the best qualitysamples (minimum FID score) for all the evaluated population sizes.The second best results are provided by Ring(1) . This indicates thatthe methods with smaller sub-populations converge faster, eventhough the ring migration model ( 𝑟 =1) slows down the propagationof the best individuals.Comparing Ring(2) and
Lipizzaner , they provide close results.Though they use different topologies, their signal propagation andselection operate equally.As the results do not follow a normal distribution, we rank thestudied methods using the Friedman rank statistical test and we
Table 3:
FID MNIST results in terms of best mean, standard devi-ation, median, interquartile range (Iqr), minimum, and maximum.Best values in bold. (Low FID indicates better samples)
Method Mean ± Std Median Iqr Min MaxPopulation size: 9
Ring(1) ± Ring 𝑡 (1) ± Ring(2) ± Lipizzaner ± Ring(1) ± Ring 𝑡 (1) ± Ring(2) ± Lipizzaner ± Ring(1) ± Ring 𝑡 (1) ± Ring(2) ± Lipizzaner ± apply Holm correction post-hoc analysis to asses the statistical sig-nificance. For all the population sizes, the Friedman ranks Ring 𝑡 (1) , Lipi-Ring , Lipizzaner , and
Ring(2) as the firsts, second, third,and fourth, respectively. However, the significance (p-value) variesfrom p-value=5.67 × − for population size 9 to p-value=1.13 × − (i.e., p-value ≥ Ring(1) and
Ring 𝑡 (1) are statisticallybetter than Ring(2) and
Lipizzaner (which have no statisticaldifferences between each other) for population size 9. For the otherpopulation sizes,
Ring 𝑡 (1) provides statistically better results than Ring(2) and
Lipizzaner and there is no significant differenceamong
Ring(1) , Ring(2) , and
Lipizzaner .Table 3 shows
Ring 𝑡 (1) is better than Ring(2) and
Lipizzaner (lower FID is better). However, the difference between their FIDsdecreases when population size increases. This indicates that mi-gration modes that allow faster propagation take better advantageof bigger populations.Next, we evaluate the FID score throughout the GAN trainingprocess. Figure 2 illustrates the changes of the median FID duringthe training process.
Ring 𝑡 (1) is not included because it operatesthe same as Ring(1) .According to Figure 2, none of the evolutionary GAN trainingmethods seem to have converged. Explaining this will be left tofuture work. The FID score almost behaves like a monotonicallydecreasing function with oscillations. The methods with largersub-populations, i.e.,
Ring(2) and
Lipizzaner , show smaller os-cillations which implies more robustness (less variance).Focusing on the methods with a ring topology, we clearly see thefaster convergence when 𝑟 =1. Ring(1) most of the time providessmaller FID values than
Ring(2) . The reduced sub-population ( 𝑠 =3)favors the best individual from the sub-population to be selectedduring the tournament (it increases the selection pressure).Figure 2 shows Ring(1) , Ring(2) , and
Lipizzaner converge tosimilar values. This is in accordance with the results in Table 3 thatindicate these three methods provide comparable FID scores.Finally, Figure 3 illustrates some samples synthesized by genera-tors trained using populations of size 16. As it can be seen, the foursets of samples show comparable quality.
ECCO ’21, July 10–14, 2021, Lille, France J. Toutouh and U. O’Reilly a) Population size: 9 b) Population size: 16 c) Population size: 25
Figure 2: Median FID evolution. a) Ring(1)
MNIST samples b)
Ring 𝑡 (1) MNIST samplesc)
Ring(2)
MNIST samples d)
Lipizzaner
MNIST samples
Figure 3: MNIST samples synthesized by generators trainedusing population size 16.
This section evaluates the diversity of the samples generated by thebest generative models obtained each run. Table 4 reports the TVDfor each method and population size. Recall we prefer low TVDs(high diversity) to show the quality of the generative model.
Table 4: TVD MNIST results in terms of best mean, standarddeviation, Iqr, minimum, and maximum. Best values in bold.(Low TVD indicates more diverse samples).
Method Mean ± Std Median Iqr Min MaxPopulation size: 9
Ring(1) ± Ring 𝑡 (1) ± Ring(2) ± Lipizzaner ± Ring(1) ± Ring 𝑡 (1) ± Ring(2) ± Lipizzaner ± Ring(1) ± Ring 𝑡 (1) ± Ring(2) ± Lipizzaner ± The results in Table 4 demonstrate that, as the population sizeincreases, the resulting trained generative models are able to pro-vide more diverse samples, that is, they have better coverage ofthe latent distribution. Thus, again, all the methods take advantagebigger populations.For population size 9, the mean and median TVD of
Ring(1) and
Ring 𝑡 (1) are the lowest. According to Friedman ranking Ring 𝑡 (1) , Ring(1) , Lipizzaner , and
Ring(2) are first, second, third, andfourth, respectively (p-value=0.0004). The Holm post-hoc correc-tion confirms that there are no significant differences between
Ring(1) and
Ring 𝑡 (1) and they are statistically more competitivethan Ring(2) and
Lipizzaner (p-values < Ring(1) , Ring(2) , and
Lipizzaner , and
Ring 𝑡 (1) is statistically better than Ring(2) and
Lipizzaner (p-values < RQ1:
What is the effect on the generative model trained when chang-ing the directionality of the signal propagation from four directions totwo?
Answer:
The impact on the results of performing migrationto one or more neighbors is higher than the directionality itself. Ifwe isolate the directionality, i.e.,
Lipi-Ring vs.
Lipizzaner , themain differences are revealed when 𝑟 =1. Thus, Ring 𝑡 (1) genera-tors creates statistically better samples (FID and TVD results) than Lipizzaner . However, when 𝑟 =2, there is no significant differencebetween Ring(2) and
Lipizzaner for all the evaluated populationsizes. So, we observe that directionality is irrelevant.We can also answer
RQ2:
When the signal is propagated toonly two directions, what is the impact of performing migrationin one or more neighbors?
Answer:
When comparing between
Lipi-Ring with 𝑟 =1 and with 𝑟 =2 using the same number of train-ing epochs (i.e., Ring(1) and
Ring(2) ), they perform the samealthough
Ring(1) converges faster. The smallest sub-population of
Ring(1) likely increases the selection pressure and the convergencespeed of the sub-populations, despite slower signal propagation.
This section analyzes the diversity in the genome space (i.e., thedistance between the weights of the evolved networks). We evaluatethe populations of size 9 and 16 to see how the migration modeland sub-population size affect their diversity. 𝐿 distance betweenneural network parameters in a given population is used to measure ignal Propagation in a Gradient-Based and Evolutionary Learning System GECCO ’21, July 10–14, 2021, Lille, France a) Ring(1) (9) b)
Ring 𝑡 (1) (9) c) Ring(2) (9) c)
Lipizzaner (9)a)
Ring(1) (16) b)
Ring 𝑡 (1) (16) c) Ring(2) (16) c)
Lipizzaner (16)
Figure 4:
Diversity in genome space. Heatmap of 𝐿 distance between the generators for the population size 9 and 16 on MNIST at the end. Xand Y axis show cell id. Dark indicates high distance. diversity and Table 5 summarizes the results. Figure 4 presents the 𝐿 distances for the populations that represent the median 𝐿 . Table 5: 𝐿 distance values between the weights of the net-works in different cells on MNIST dataset. High values indi-cates more networks diversity. Highest value is in bold. Method Mean ± Std Median Iqr Min MaxPopulation size: 9
Ring(1) ± Ring 𝑡 (1) ± Ring(2) ± Lipizzaner ± Ring(1) ± Ring 𝑡 (1) ± Ring(2) ± Lipizzaner ± Focusing on the
Lipi-Ring methods with 𝑟 =1, the populationdiversity diminishes with more training, i.e., 𝐿 distances betweenthe networks in Ring(1) are higher than in
Ring 𝑡 (1) for bothpopulation sizes, see Table 5 and Figure 4. This shows that as thering runs longer the genotypes are starting to converge.Taking into account the Lipi-Ring methods with 𝑟 =1 and 𝑟 =2that performed the same number of training epochs, i.e., Ring(1) and
Ring(2) , the first one shows higher 𝐿 distances (darker colorsin Figure 4). This confirms that the populations are more diversewhen signal propagation is slower.Comparing Ring(2) and
Lipizzaner , that have the same sub-population size and migration to four neighbors, there is not a cleartrend because
Lipizzaner generated more diverse populations forpopulation size 9 and
Ring(2) for population size 16. Therefore,we have not found a clear effect of changing the migration fromfour to two directions.According to these results, we can answer the questions formu-lated in
RQ3 . How does diversity change over time in a ring topology?
When
Ring(1) performs more training the diversity decreases.
Howdoes diversity compare with a ring topology with different neighbor-hood radius?
As the radius increases, the diversity is lower because the propagation of the best individuals is faster.
How does diversitycompare between ring topology and 2D grid, where both methodshave the same sub-population size and neighborhood, but differentsignal directionality?
We have not proven any clear impact on thediversity when changing the directionality for the methods thathave the same sub-population size and neighborhood.
According to the empirical results, in general,
Ring(1) and
Lipizzaner compute generative models with comparable qualityfor MNIST (training MLP networks). Here, we study both methodstraining generative models for generating synthesized samples ofCelebA and COVID-19 datasets using convolutional DNN, whichare bigger networks (having more parameters). Table 6 shows thebest IS value from each of the 10 independent runs for each methodwith population size 9.In this case, it is more clear that both methods demonstrated thesame performance. Focusing on CelebA dataset,
Ring(1) computeshigher (better) mean, minimum, and maximum IS, but
Lipizzaner shows higher median. For the COVID-19 dataset,
Lipizzaner shows better mean, median, and maximum IS values. The statisticalanalysis (ANOVA test) corroborates that there is not significantdifferences between both methods for both datasets (i.e., CelebAp-value=3.049 and COVID-19 p-value=0.731).
Table 6:
IS results in terms of best mean, standard deviation, me-dian, Iqr, minimum, and maximum for CelebA and COVID-19 ex-periments. Best values in bold. (High IS indicates better samples)
Method Mean ± Std Median Iqr Min MaxDataset: CelebA
Ring(1) ± Lipizzaner ± Ring(1) ± Lipizzaner ± Finally, Figure 5 illustrates some samples of CelebA and COVID-19 synthesized by the generators trained using
Ring(1) and
Lipizzaner . As it can be seen, the samples show similar quality.
ECCO ’21, July 10–14, 2021, Lille, France J. Toutouh and U. O’Reilly a) Ring(1) - CelebA b)
Lipizzaner - CelebAc)
Ring(1) - COVID-19 d)
Lipizzaner - COVID-19
Figure 5: Samples of CelebA and COVID-19 synthesized bygenerators trained using
Ring(1) and
Lipizzaner . Note that these results are in the line of the answers given to
RQ1 and
RQ2 . The change of directionality of the signal propagationand of the migration model allows the method to achieve the sameresults as
Lipizzaner . Lipi-Ring
Scalability Analysis
Here, we evaluate the results incrementally increasing the ring sizeof
Ring(1) from 2 to 9 using MNIST dataset. Figure 6 illustrates howincreasing the population size of
Ring(1) by only one individualimproves the result (reduces the FID). However,
Lipizzaner using,an a × b 2D-grid would require adding (at least) 𝑎 or 𝑏 individuals. Figure 6: FID MNIST results according to the population size.
We know that training with smaller sub-populations sizes takesshorter times because it performs fewer operations. As the readermay be curious about the time savings of using
Ring(1) insteadof
Lipizzaner , we compare their computational times for the ex-periments performed. Notice that they perform the same numberof training epochs (see Section 4) and provide comparable qualityresults. Table 7 summarizes the computational cost in wall clocktime for all the methods, population sizes, and datasets. All the an-alyzed methods have been executed on a cloud architecture, whichcould generate some discrepancies.As expected,
Ring(1) requires shorter times. Comparing bothmethods on MNIST,
Lipizzaner needed 33.46%, 25.01%, and 14.20%longer times than
Ring(1) for population sizes 9, 16, and 25, re-spectively. This indicates that the computation effort of using 40%bigger sub-populations (5 instead of 3 individuals) affects less therunning times as the population size increases.
Table 7: Computational time cost in terms of mean and stan-dard deviation (minutes).
MNISTPopulation size 9 16 25
Ring(1) ± ± ± Lipizzaner ± ± ± Ring(1) ± Ring(1) ± Lipizzaner ± Lipizzaner ± For CelebA and COVID-19 experiments,
Ring(1) reduces themean computational time by 23.22% in CelebA experiments andby 41.18% in COVID-19. The time saving is higher for COVID-19mainly because the training methods performed more epochs (1,000)for this dataset than for CelebA (20 epochs). The sub-population sizeprincipally affects the required effort to perform the evaluation ofthe whole sub-population and the selection/replacement operationwhich are carried out for each training iteration. This explains whythe time saving are higher for COVID-19 experiments.
The empirical analysis of different spatially distributed CEA GANtraining methods shows that the use of a ring topology instead of a2D grid does not lead to a loss of quality in the computed generativemodels, but it may improve them (it depends on the setup).
Ring 𝑡 (1) , which uses a ring topology with neighborhood radius 𝑟 =1 and run for the same time than Lipizzaner , produced thebest generative models.
Ring(1) , Ring(2) , and
Lipizzaner , whichwere trained for the same training epochs, trained comparablegenerative models on the MNIST, CelebA, and COVID-19 datasets(similar FID and TVD for MNIST and IS for CelebA and COVID-19).In terms of diversity,
Ring(1) shows the most diverse popu-lations which diminishes with more training. Focusing on ringtopology, when the migration radius increases, i.e.,
Ring(1) ( 𝑟 =1)vs. Ring(2) ( 𝑟 =2), the diversity decreases. Finally, we have notfound a marked difference on the diversity of the populations whenchanging the migration directionality , i.e., comparing between Ring(2) and
Lipizzaner . Ring(1) , changing the signal propagation from four to two di-rections and using a migration of radius one, reduced the computa-tional time cost of
Lipizzaner by between 14.2% and 41.2%, whilekeeping comparable quality results.Future work will include the evaluation of
Ring(1) on moredatasets, bigger populations, and for longer training epochs. Wewill apply specific strategies to deal with small sub-populations(3 individuals per cell) to analyze the effect of reducing the high se-lection pressure. We will perform a convergence analysis to provideappropriate
Lipizzaner and
Ring(1) setups to address MNIST.Finally, we are exploring new techniques to evolve the networkarchitectures during the CEA training.
REFERENCES [1] Abdullah Al-Dujaili, Tom Schmiedlechner, Erik Hemberg, and Una-May O’Reilly.2018. Towards distributed coevolutionary GANs. In
AAAI 2018 Fall Symposium .[2] Martin Arjovsky and Léon Bottou. 2017. Towards Principled Methods forTraining Generative Adversarial Networks. arXiv e-prints, art. arXiv preprintarXiv:1701.04862 (2017). ignal Propagation in a Gradient-Based and Evolutionary Learning System GECCO ’21, July 10–14, 2021, Lille, France [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017).[4] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. 2017. Gener-alization and Equilibrium in Generative Adversarial Nets (GANs). arXiv preprintarXiv:1703.00573 (2017).[5] Sanjeev Arora, Andrej Risteski, and Yi Zhang. 2018. Do GANs learn the dis-tribution? Some Theory and Empirics. In
International Conference on LearningRepresentations . https://openreview.net/forum?id=BJehNfW0-[6] Marco Baioletti, Carlos Artemio Coello Coello, Gabriele Di Bari, and ValentinaPoggioni. 2020. Multi-objective evolutionary GAN. In
Proceedings of the 2020Genetic and Evolutionary Computation Conference Companion . 1824–1831.[7] Joseph Paul Cohen, Paul Morrison, and Lan Dao. 2020. COVID-19 im-age data collection. arXiv 2003.11597 (2020). https://github.com/ieee8023/covid-chestxray-dataset[8] Victor Costa, Nuno Lourenço, and Penousal Machado. 2019. Coevolution ofgenerative adversarial networks. In
International Conference on the Applicationsof Evolutionary Computation (Part of EvoStar) . Springer, 473–487.[9] L. Deng. 2012. The MNIST Database of Handwritten Digit Images for MachineLearning Research [Best of the Web].
IEEE Signal Processing Magazine
29, 6(2012), 141–142. https://doi.org/10.1109/MSP.2012.2211477[10] Unai Garciarena, Roberto Santana, and Alexander Mendiburu. 2018. EvolvedGANs for generating Pareto set approximations. In
Proceedings of the Genetic andEvolutionary Computation Conference . 434–441.[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarialnets. In
Advances in neural information processing systems . 2672–2680.[12] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, andAaron C Courville. 2017. Improved training of Wasserstein GANs. In
Advancesin neural information processing systems . 5767–5777.[13] Alexia Jolicoeur-Martineau. 2018. The relativistic discriminator: a key elementmissing from standard GAN. arXiv preprint arXiv:1807.00734 (2018).[14] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressivegrowing of GANs for improved quality, stability, and variation. arXiv preprintarXiv:1710.10196 (2017).[15] Chengtao Li, David Alvarez-Melis, Keyulu Xu, Stefanie Jegelka, and Suvrit Sra.2017. Distributional Adversarial Networks. arXiv preprint arXiv:1706.09549 (2017).[16] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning FaceAttributes in the Wild. In
Proceedings of International Conference on ComputerVision (ICCV) .[17] Ilya Loshchilov. 2013.
Surrogate-assisted evolutionary algorithms . Ph.D. Disserta-tion. University Paris South Paris XI; National Institute for Research in ComputerScience and Automatic-INRIA.[18] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and StephenPaul Smolley. 2017. Least squares generative adversarial networks. In
Proceedingsof the IEEE International Conference on Computer Vision . 2794–2802.[19] Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. 2017. Dual discriminatorgenerative adversarial nets. In
Advances in Neural Information Processing Systems .2670–2680.[20] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga,Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison,Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, andSoumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance DeepLearning Library. In
Advances in Neural Information Processing Systems 32 ,H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett(Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf[21] Elena Popovici, Anthony Bucci, R Paul Wiegand, and Edwin D De Jong. 2012.Coevolutionary principles. In
Handbook of natural computing . Springer, 987–1033.[22] Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised Representa-tion Learning with Deep Convolutional Generative Adversarial Networks. arXivpreprint arXiv:1511.06434 (2015).[23] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford,and Xi Chen. 2016. Improved techniques for training GANs. In
Advances in neuralinformation processing systems . 2234–2242.[24] Tom Schmiedlechner, Ignavier Ng Zhi Yong, Abdullah Al-Dujaili, Erik Hemberg,and Una-May O’Reilly. 2018. Lipizzaner: A System That Scales Robust GenerativeAdversarial Network Training. In the 32nd Conference on Neural InformationProcessing Systems (NeurIPS 2018) Workshop on Systems for ML and Open SourceSoftware .[25] Jamal Toutouh, Erik Hemberg, and Una-May O’Reilly. 2019. Spatial EvolutionaryGenerative Adversarial Networks. In
Proceedings of the Genetic and EvolutionaryComputation Conference (GECCO ’19) . ACM, New York, NY, USA, 472–480. https://doi.org/10.1145/3321707.3321860 [26] Jamal Toutouh, Erik Hemberg, and Una-May O’Reilly. 2020. Analyzing theComponents of Distributed Coevolutionary GAN Training. In
Parallel ProblemSolving from Nature – PPSN XVI , Thomas Bäck, Mike Preuss, André Deutz, HaoWang, Carola Doerr, Michael Emmerich, and Heike Trautmann (Eds.). SpringerInternational Publishing, Cham, 552–566.[27] Jamal Toutouh, Erik Hemberg, and Una-May O’Reilly. 2020.
Data Dieting in GANTraining . Springer Singapore, Singapore, 379–400.[28] Jamal Toutouh, Erik Hemberg, and Una-May O’Reily. 2020. Re-Purposing Hetero-geneous Generative Ensembles with Evolutionary Computation. In
Proceedingsof the 2020 Genetic and Evolutionary Computation Conference (GECCO ’20) . Asso-ciation for Computing Machinery, New York, NY, USA, 425–434.[29] Chaoyue Wang, Chang Xu, Xin Yao, and Dacheng Tao. 2019. Evolutionarygenerative adversarial networks.
IEEE Transactions on Evolutionary Computation
23, 6 (2019), 921–934.[30] Nathan Williams and Melanie Mitchell. 2005. Investigating the Success of SpatialCoevolution. In
Proceedings of the 7th Annual Conference on Genetic and Evolu-tionary Computation (GECCO ’05) . Association for Computing Machinery, NewYork, NY, USA, 523–530.[31] Junbo Zhao, Michael Mathieu, and Yann LeCun. 2016. Energy-based generativeadversarial network. arXiv preprint arXiv:1609.03126arXiv preprint arXiv:1609.03126