[PDF] A Survey on Evolutionary Neural Architecture Search

Abstract

Deep Neural Networks (DNNs) have achieved great success in many applications. The architectures of DNNs play a crucial role in their performance, which is usually manually designed with rich expertise. However, such a design process is labour intensive because of the trial-and-error process, and also not easy to realize due to the rare expertise in practice. Neural Architecture Search (NAS) is a type of technology that can design the architectures automatically. Among different methods to realize NAS, Evolutionary Computation (EC) methods have recently gained much attention and success. Unfortunately, there has not yet been a comprehensive summary of the EC-based NAS algorithms. This paper reviews over 200 papers of most recent EC-based NAS methods in light of the core components, to systematically discuss their design principles as well as justifications on the design. Furthermore, current challenges and issues are also discussed to identify future research in this emerging field.

Full PDF

11 A Survey on EvolutionaryNeural Architecture Search

Yuqiao Liu,

Student Member, IEEE,

Yanan Sun,

Member, IEEE,

Bing Xue,

Member, IEEE,

Mengjie Zhang,

Fellow, IEEE, and Gary Yen,

Fellow, IEEE

Abstract —Deep Neural Networks (DNNs) have achieved greatsuccess in many applications, such as image classiﬁcation, naturallanguage processing and speech recognition. The architecturesof DNNs have been proved to play a crucial role in its perfor-mance. However, designing architectures for different tasks is adifﬁcult and time-consuming process of trial and error. NeuralArchitecture Search (NAS), which received great attention inrecent years, can design the architecture automatically. Amongdifferent kinds of NAS methods, Evolutionary Computation (EC)based NAS methods have recently gained much attention andsuccess. Unfortunately, there is not a comprehensive summaryof the EC-based methods. This paper reviews 100+ papers ofEC-based NAS methods in light of the common process. Foursteps of the process have been covered in this paper includingpopulation initialization, population operators, evaluation andselection. Furthermore, current challenges and issues are alsodiscussed to identify future research in this ﬁeld.

Index Terms —Evolutionary Neural Architecture Search, Evo-lutionary Computation, deep learning, image classiﬁcation.

I. I

NTRODUCTION D EEP Neural Networks (DNNs), as the cornerstone ofdeep learning [1], have demonstrated their great success indiverse real-world applications, such as image classiﬁcation [2],[3], natural language processing [4] and speech recognition [5].The promising performance of DNNs has been widely recog-nized as their deep nature, which is able to learn meaningfulfeatures directly from the raw data almost without any specialistintervention. Generally, the performance of DNNs depends ontwo aspects: their architectures and the weights, only whenboth achieve the optimal status simultaneously, the performanceof the DNNs could be satisﬁed. The optimal weights areobtained through the learning process: using a continuousloss function to measure the difference between the real outputand the desired output, and then the gradient-based algorithmsare often used to minimize the loss. When the terminationsatisﬁed the condition which is commonly a maximal iterationnumber, the weights obtained at the termination are used as theoptimum. Such kind of learning process has been very popularlargely own to its effectiveness in practice, and has becomethe dominant technique for weight optimization [6], althoughthe gradient-based optimization algorithm is principally local-search [7]. However, obtaining the architectures cannot be

Yuqiao Liu and Yanan Sun are with the College of Computer Science,Sichuan University, China.Bing Xue and Mengjie Zhang are with Bing Xue the School of Engineeringand Computer Science, Victoria University of Wellington, Wellington, NewZealand.Gary Yen is with the School of Electrical and Computer Engineering,Oklahoma State University, Stillwater, OK, USA. directly formulated by a continues function, and there is evenno explicit function to measure the process for ﬁnding optimalarchitectures.There has been a long time that the promising architecturesof DNNs are manually designed with expertise. This can beseen from the state of the arts, such as VGG [8], ResNet [2] andDenseNet [3], to name a few. These promising CNN models areall manually designed by the researchers with rich knowledgein both neural networks and images processing. However, inpractice, most end users are with limited expertise in DNNs.Moreover, the DNN architectures are often problem-dependent,if the distribution of the data is changed, the architectures mustbe redesigned accordingly. Neural Architecture Search (NAS),which aims to automate the architecture designs of deep neuralnetworks, is recognized as an effective and efﬁcient way tosolve the limitations aforementioned.Mathematically, the NAS is modeled by an optimizationproblem formulized by Equation (1): (cid:26) arg min A = L ( A, D train , D valid ) s.t. A ∈ A (1)where A denotes the search space of the neural architectures, L ( · ) measures the performance of the architecture A on thevalidation dataset D valid after being trained on the trainingdataset D train . The L ( · ) is always non-convex and non-differential [9]. In principle, NAS is a complex optimizationproblem experiencing several challenges, e.g. complex con-straints, discrete representations, bi-level structures, compu-tationally expensive characteristics and multiple conﬂictingobjectives. NAS algorithms refer to the optimization algorithmswhich are speciﬁcally designed to effectively and effectivelysolve the problem represented by Equation (1). The sourceof “NAS” is generally recognized from the algorithm [10]proposed by Google, and its preprint version was ﬁrst releasedin Arxiv in 2016 and then accepted for publication by theInternational Conference on Learning Representations in 2017.Since then, a vast number of researchers have put concerns onproposing novel NAS algorithms.Existing NAS algorithms can be generally classiﬁed intothree different categories based on the optimizer: ReinforcementLearning (RL) [11] based algorithms, gradient based algorithmsand Evolutionary Computation (EC) [12] based NAS algorithms(ENAS). Speciﬁcally, the RL based algorithms often requirethousands of Graphics Processing Cards (GPUs) performingseveral days even on the median-scale dataset, such as theCIFAR10 image classiﬁcation benchmark dataset [13]. Thegradient-based algorithms are more efﬁcient than the RL based a r X i v : . [ c s . N E ] A ug NAS algorithms. However, they often ﬁnd the ill-conditionedarchitectures due to the relation is not mathematically proved.In addition, the gradient-based NAS algorithms require toconstruct a supernet in advance, which also highly requiresexpertise. The ENAS algorithms solve the NAS problemsby using the EC technique. Speciﬁcally, the EC is a kindof population-based computational paradigm, simulating theevolution of species or the behaviour of the population in nature,to solve the challenging optimization problems, includingthe genetic algorithms [14], genetic programming [15], andparticle swarm optimization [16]. EC has been widely appliedto solve non-convex optimization problems [17], and eventhe mathematical form of the objective function may not beavailable [18].In fact, more than twenty years ago, the EC methodshave been frequently used for searching for not only theneural architecture but also the weights, which is also termedneuroevolution. The major differences between ENAS and theneuroevolution lie in two aspects. Firstly, the neuroevolutionconcerns on both the neural architectures and the weightswhile ENAS focuses on mainly the architectures. Secondly,the neuroevolution commonly takes effects on small-scaleand median-scale neuron networks, while ENAS is mainlyworking on DNNs, such as the Convolutional Neural Networks(CNNs) [19], [20] and deep stacked autoencoders [21], whichare the building blocks of deep learning [22]. The ﬁrst workof ENAS is generally viewed as the LargeEvo algorithm [19]that is proposed by the Google Brain Team, who released itsearly version in March of 2017 in Arxiv, and then this papergot accepted by the 34th International Conference on MachineLearning in June of 2017. The LargeEvo algorithm employedthe genetic algorithm to search for the best architecture of aCNN, and the experimental results on CIFAR-10 and CIFAR-100 [13] have demonstrated its effectiveness. Since then, alarge number of ENAS algorithms have been proposed. Fig. 1shows the number of submissions * focusing on the ENASalgorithms, from 2017 to the April of 2020 when we makethis survey. It is clear to see the sharp increasing trend in thenumber of papers. Speciﬁcally, from 2017 to 2019, the numberof submissions grows with multiple scales. During the ﬁrstquarter of 2020, the submission is also the same number of2018.A great number of related submissions have been madeavailable publicly, but there is no comprehensive survey ofthe literature on ENAS algorithms. Recent reviews on NAScan be seen in [18], [23], [24] and [25]. The ﬁrst two reviewsmainly focus on all kinds of methods of NAS, and make amacro summery on NAS. To be speciﬁc, Elsken et al. [23]divide NAS into three stages: Search Space, Search Strategyand Performance Estimation Strategy. Similarly, Wistuba etal. [24] also follow these three stages and add a review aboutthe multiple objectives NAS. Darwish et al. [18] make asummary of Swarm Intelligence (SI) and EA approaches fordeep learning. Nevertheless, this paper not only focuses onNAS, but also focuses on other hyperparameter optimization. * These submissions include the ones which have been accepted after peer-review and also the ones which are only available on the Arxiv website. $SULO

Fig. 1. The number of submissions focusing on evolutionary neural architecturesearch. The data is from Google Scholar with the keywords of “evolutionary”OR “genetic algorithm” OR “particle swarm optimization” OR “PSO” OR“genetic programming” AND “architecture search” OR “architecture design”and the literature on Neural Architecture Search collected by the AutoML.orgwebsite, until April 20, 2020. In addition, we have also carefully checkedeach collected manuscript to make its scope within the evolutionary neuralarchitecture search.)

Stanley et al. [25] did a review of neuroevolution which isinspired by the development of the human brain and mainlyaimed at the wight optimization rather than the structure. Mostof the references in the surveys above are pre-2018 and do notinvolve a large number of papers in the past two years. Thispaper makes a summary involving a large number of ENASpapers to summarize current works and inspire some new ideasfor promoting the performance of existing ENAS algorithms.The motivation of this paper can be divided into threeaspects. First, a comprehensive survey is needed for a largenumber of ENAS literature. Several tables in this paper canmake a general classiﬁcation of the relevant papers, whichmakes the comparisons easier and more convenient. Second,the success of so many papers can attract more researchers fromrelated research areas. Third, the advantage and disadvantagesummarized from the state-of-the-art methods can give someenlightenment to the future development of ENAS algorithms.This paper also follows the three stages of NAS proposed byElsken [23], but some modiﬁcations are made to speciﬁcallysuit ENAS algorithms which can be seen in Section II.The reminder of this paper is organized as follows. Thebackground of ENAS is provided in Section II. In Section III,the encoding strategy is introduced and different classiﬁcationcriteria of constraints on the encoding space are deﬁned.Section IV summarizes the operators on the population toimprove the search ability. Section V introduces the selectionstrategy and the way to speed up the evolution. Section VIpresents the applications of ENAS algorithms. Section VIIdiscusses the challenges and prospects, and Section VIII is forthe conclusion. II. B

ACKGROUND

ENAS is a process of neural architecture search using theEC methods. The performance of the architecture is gettingbetter and better on the research task owing to the evolution ofthe population. Fig. 2 shows is a ﬂowchart of ENAS process.First of all, a population needs to be initialized. Eachindividual in the population represents a solution for the NAS,i.e., a neural architecture. Each architecture need to be encoded as an individual before it joins the population. There aretwo main questions in initialization stage: how to encode thearchitecture and how the original architecture generates. Inthe following, we answer these two questions by category. Inhistory, there are two types of encoding method, namely the direct encoding and the indirect encoding . Direct encoding encodes all the information of the architecture exactly into anindividual to undergo the evolving process.

Direct encoding hastwo advantages: one is the convenience of encoding implement,the other is the ﬂexibility of the operators on the population.On the contrary, the indirect encoding represents a generationrule for network architectures rather than representing thenumber and connections directly. [26] The indirect encoding ismainly used in the early ENAS method [27] and neuroevolution,e.g. HyperNEAT [28]. The main reason for choosing indirectencoding is to reduce the millions of neuron weight andconnection, so that existing computing resources can meetthe needs. Since back-propagation shows its superiority inoptimizing the weight in neural network and many speciﬁcoperators, such as convolution operator and pooling operator,have shown great ability dealing with computer vision task, verylittle information is required to build a neural network withoutconsidering any of the weight and connection. For example,only providing three hyper-parameters, i.e. kernel size, stridesize and the number of ﬁlters can obtain a simple convolutionlayer. So, the few hyper-parameters make it reasonable to use direct encoding in ENAS in recent years. Commonly, there are ﬁxed-length encoding strategy and variable-length encodingstrategy based on whether the encoded string is of a ﬁxed length.The advantage of the ﬁxed-length encoding strategy is the easeof the population operations. In Genetic CNN [29], for example,the ﬁxed-length binary string helps the population operations(especially the crossover) become easier to implement. Anotherexample is [30], Loni et al. use a ﬁxed-length string of genomesto represent the architecture. The advantage of this is that theone point crossover can be easily applied on the correspondingencoded information. The advantage of the variable-lengthencoding strategy is that it can contain more details of thearchitecture with more freedom of expression. However, thepopulation operators may be not suitable for this kind ofencoding strategy and need to be redesigned, where an examplecan be seen in [31]. In general, there are three types oforiginal architecture generation: starting from trivial initialconditions [19], randomly initialization in search space [31]and staring from a well-designed architecture (also termedas rich initialization) [32]. Generally, almost all the methodscreate only one population during the evolution. However,in [33], the population is divided into three sub-populationswhich have its own population operators to hierarchicallyencode the architecture and be more efﬁciently searched. Inaddition, the population can be divided into several species(subgroups) which can maintain the diversity during theevolving process [34], [35].Second, the generated individuals will undergo the ﬁtnessevaluation stage. Note that two ﬁtness evaluation stages arein Fig. 2, and the two stage are completely identical in mostof the ENAS. It can be regarded as a necessary stage afterthe generation of new individuals. Fitness function need to

Initial populationFitness evaluation

Population operatorsFitness evaluationSelect the next populationReturn the last population

Stopping criterionYesNo

Fig. 2. The ﬂowchart of ENAS. be deﬁned ﬁrst before the evaluation. Most ENAS methodsdeﬁne the ﬁtness function as a criterion based on the taskin hand, e.g., classiﬁcation accuracy rate or error rate issuitable for image classiﬁcation tasks and it can represent theperformance of the corresponding architecture well. In general,many ENAS methods choose the direct way to obtain the ﬁtness,i.e., after suffering the computationally expensive trainingphase, the true ﬁtness will be calculated on the validationdataset which is not visible during the training phase. It willtake hundreds of hours even on high-performance GraphicsProcessing Units (GPUs). Although many training tricks, suchas weight inheritance [19] and early stopping policy [36],are used to shorten training time, long evolving time due tothe lack of a large amount of expensive computing resourcesremain a major obstacle to ENAS. The alternative way to obtainthe ﬁtness is to use a performance predictor which bypassesthe process of the training phase during ﬁtness evaluation.The existing performance predictors can be divided into twocategories: predictors based on the learning curve and end-to-end predictors [37]. The aim of the predictor is acceleratingthe ﬁtness evaluation process under the condition of tolerableaccuracy deviation.Thirdly, after the ﬁtness evaluation on the initial population,it is an iteration based on the population. Until the stoppingcriterion is met, the population is updated by the population operators in each epoch of the iteration. Traditional ECapproaches can guide the iteration, i.e. the evolving process.The operators on the current population are quite different indifferent EC approaches such as Genetic Algorithm (GAs),Particle Swarm Optimization (PSO) and Covariance MatrixAdaptation Evolutionary Strategies (CMA-ES), but all theoperators can be classiﬁed into two categories: single individualbased and multiple individuals based . The single individualbased operators aimed at exploring the adjacent area ofthe individual. The mutation operator in GAs is a typical singe individual based operator. The multiple individual based operators take the use of two or more individual’s informationfrom the population, and the crossover operator in GAs is arepresentative. The population operators play an important rolein ENAS, because it directly determines what new individualsare generated and what architecture can be explored in theprocess of evolution.Finally, with the end of the iteration, a population sufferingthe evolution is obtained. For most ENAS method, the aimis to ﬁnd a way simulating the process of expert designingneural network. So they just select the best individual from thelast population as the output. Instead of discarding individualsin the last population, Frachon et al. [36] propose a NeuralCommittee to contain all the individuals in the last population,and make a decision by the voting from the Neural Committee.Furthermore, multi-objective ENAS methods return a set ofarchitectures in the Pirato Front.III. P

OPULATION INITIALIZATION

This section will discuss the ﬁrst stage of the ENAS, namelythe population initialization. This is the ﬁrst stage in theevolving process, and is before the iteration. So all the globalsettings including hyper-parameters of evolution (the numberof evolving epoch, the size of population, etc.), the type ofindividual presentation, the encoding space, etc., must bedecided in this ﬁrst stage. Since mentioned in Section II, mostof the recent methods choose direct encoding strategy , we donot discuss the encoding strategy in this section any more.In general, two aspects: individual presentation and encodingspace, can distinguish between different algorithms. The twoaspects are introduced in the remainder of this section.

A. Individual presentation

An individual in the population presents a correspondinga neural network architecture, and it must contain all theinformation about the architecture. As neural network is madeup by basic units which are demonstrated in Section III-C,such as layers, blocks and cells, two aspects information canbe used to build up a structure, that is, the parameters in eachbasic unit and the connections. The parameters are always thereal numbers which can be directly encoded into an individual.Therefore, we only discuss the connections in this section.One special case is the network connection is totally linear,this means there is no need to solely encode the informationabout the connections, and only the parameters in each basicunit are enough. One classical example can be seen in [31]where Sun et al. explore a great many of parameters based on a linear CNN mode. However, most architecture is notdesigned to be linear. The skip connection in ResNet [2] andthe dense connection in DenseNet [3] show the ability to builda good architecture. The following lists two types of encodingconnection.

1) Adjacent matrix:

Because each unit can be regarded asa vertice and each connection can be regarded as an edge,the adjacent matrix can be easily applied to represent theconnection. Genetic CNN [29] uses a binary string to representthe connection, and the string can transform into a triangularmatrix. In the binary string, 1 denotes there is a connectionbetween the two nodes and 0 denotes no connection. Lorenzoet al. [38] use a matrix to represent the skip connection, andthis work revolves around the adjacent matrix. In fact, backin the 1990s, Kitano et al. [27] began to study the use of theadjacent matrix to represent network connection and explainedthe process from the connectivity matrix, to bit-string genotype,to network architecture phenotype.

2) Directed acyclic graph:

Another way to present theconnection is based on the directed acyclic graph. Irwin etal. [39] use a graph-based strategy to present the connection.To be speciﬁc, an ordered pair G = ( V, E ) with vertices V and edge E associated with a direction can describe adirected acyclic graph. Byla et al. [40] using a changinggraph to represent the connection, and at the same time, theydemonstrate Ant Colony Optimization (ACO) is suitable forthe graph based presentation. Chiu et al. [41] proposed thedirected acyclic graph neural networks, and showed the relationbetween adjacent matrix and the directed acyclic graph. B. Encoding space

Encoding space contains all the possible individuals. There-fore, the output of ENAS is also an individual selected in theencoding space. Generally, the encoding space can be dividedinto two parts, the ﬁrst one is the initial space, and the secondis the search space. The initial space determines which kindof individuals may appear in the initial population, and thesearch space determines what kind of individual will appearin the iteration, or the evolving process.

1) Initial space:

As mentioned in Section II, there are threedifferent types of original architecture generation, i.e., threetypes of initial space: the trivial space , the random space andthe well-designed space . The trivial space contains very simplearchitectures. For example, Large-Scale Evolution [19] initial-izes the population in a trivial space where each individualconstitutes just a single-layer model with no convolutions. Xieet al. [29] also did the experiment to prove that a trivial space can evolve to a competitive architecture. The reason usingas little experience as possible is to force the evolution tomake discoveries by itself and to prove the evolution is notmanipulated by the experimenter. On the contrary, the well-designed space contains the state-of-the-art architectures. Inthis way, a promising architecture can be obtained, whereas itcan not explore other novel architectures. For random space , allthe individuals in the initial population are randomly generatedin the limited space, and many methods adopt random space ,such as [9], [31], [42]. The aim of this type of initial space is also to reduce the intervention of artiﬁcial experience in theinitial population.

2) Search space:

In the evolving process, the space forconstraining individuals will change into the second space,the search space. Generally speaking, the methods choosingrandom initial space as the initial space have the same space inevolving process, i.e., the initial space and the search space isthe same in these methods. For other two types of initial space,however, due to the relatively small initial space, the searchspace will become larger to contain the promising architectures.It’s worth noting that many methods do not directly deﬁne thesearch space, but give the operators on population.

C. Constraints

The constraints on the encoding space is important, becausethe constraints represent the human intervention which restrictsthe encoding space and is inclined to design an architecturesimilar to the known state-of-the-art one. A method with amass of constraints can obtain an awesome architecture easilybut prevent to design a novel architecture. Furthermore, thedifferent size of search space would greatly affect the efﬁciencyof evolution. We can not judge an ENAS method without itsconstraints on the search space, because one extreme case isthat all the individuals in the search space are well performed.In this case, an excellent individual can be obtained even ifdo no selection. So we will discuss the constraints in detail inthis section.Table I shows the different categories in encoding spaceand the constraints where horizontal classiﬁcation is based onthe constraints on the architecture, and vertical classiﬁcationis based on the categories of the basic units searched. Wewill introduce the encoding space from this two kinds ofclassiﬁcation.First, the basic units can be divided into four categories:global, block-based, cell-based and topology-based. The globaldenotes the basic units in the encoding space are layers (suchas convolution layers and pooling layers) or neuron nodes. Theindividual in a global encoding space is much more delicate,because every detail can be considered. However, it may takemore time to obtain a delicate individual in such a large globalspace. So, many researchers use block which is a speciﬁcstructure combination of various types of layers as the basic unit.Many different kinds of blocks are presented: ResBlock [2],DenseBlock [3], ConvBlock (Conv2d + BatchNormalization +Activation) [111] and InceptionBlock [139] etc. These blockshave promising performance and fewer parameters are neededto build the architecture. So it is easier to ﬁnd a goodarchitecture in the block-based encoding space. Some methodsuse these above blocks directly, such as [9], [36], and someother methods propose other blocks for their own purposes.Chen [107] et al. propose 8 blocks including ResBlock andInceptionBlock encoded in 3 bits string, and use Hammingdistance to tell the similar blocks. Song [113] et al. proposethree residual dense blocks to reduce the amount of computationdue to the convolution operation of image super-resolutiontasks. Cell-based encoding space is similar to the block-basedspace, and it can be regarded as a special case in block-based space where all the blocks are the same. The methods choosing this space build the architecture by stacking repeatedmotifs. Chu et al. [127] divides the cell-based space into twoirrelevant parts: the micro part contains the parameters of cellswhile the macro part deﬁnes the connections between differentcells. The cell-based space greatly reduces the search space,but Frachon et al. [36] believes that there is no theoreticalbasis for that the cell-based space can help to get a goodarchitecture. In the last, the topology-based space does notcare about the parameters or the structure of each unit (layeror block), and their only concern is the connections betweenunits. One classical example is the One-Shot which treats allthe architectures as different subgraphs of a supergraph [23].Yang et al. [135] search the architecture by choosing differentconnections in the supergraph, the subgraph built by the chosenconnections becomes an individual. Another typical case ispruning. Wu et al. [132] shallow VGGNet [8] on CIFAR-10,and the aim is to prune unimportant weight connections. To sumup, the basic unit can be regarded as a special constraint on theencoding space. The block-based space makes the method moreefﬁciency whereas the global space provides greater possibilityto reach a novel architecture.Secondly, the constraints on the encoding space mainly focuson three aspects: ﬁxed depth, rich initialization and partialstructure ﬁxed. The ﬁxed depth means all the individuals in thepopulation have the same depth. The ﬁxed depth is a strongconstraint and largely reduce the encoding space. Noting thatthe ﬁxed-length encoding strategy mentioned in Section IIis different from the ﬁxed depth. In Genetic CNN [29], forexample, the ﬁxed-length encoding strategy only limits themaximum depth. The node which is isolated (no connection)is simply ignored. By this way, the individuals can obtaindifferent depth. The second is rich initialization, and this isalso a strong constraint with a lot of manual experience. Inthis case, the generated architecture is around the original one,and the possibility of exploring new architectures is greatlyreduced. The partial structure ﬁxed means the architecture ispartially settled. For example, in [60] a max-pooling layer isadded to the network after every set of four convolution layers.In Table I, the relatively few constraints denotes the methodhas no restrictions on these above three aspects, but that isnot to say there is absolutely no constraint. For example, inthe classiﬁcation task, fully connected layer is used as the lastlayer in some methods [31], [79], [125], or the softmax cross-entropy loss is used as the loss function [26]. Furthermore,there are other constraints. Supernet is a constraint, Yanget al. [135] devote to ﬁnd a Subnet in the Supernet. As aresult, the architecture found cannot be the one that is notpresented in the Supernet. The best individuals found by somemethods cannot be applied directly to the corresponding taskuntil they undergo the subsequent operation such as trainingfrom scratch [105], [120] or ﬁne tuning [40]. Moreover, themaximum length is predeﬁned in many methods includingboth ﬁxed-length encoding strategy [29] and variable-lengthencoding strategy [31] resulted in preventing the methoddiscovering a deeper architecture. Wang et al. [71] try to breakthe limit of maximum length by using a Gaussian distributioninitialization mechanism. Irwin et al. [39] break the limit not byinitialization, but using the population operators, the crossover

TABLE I

THE ENCODING SPACE AND THE CONSTRAINTS

Fixed depth Rich initialization Partial structure ﬁxed Relatively few constraintsGlobal [32], [43]–[53] [32], [54]–[58] [42], [50], [59]–[69] [19], [21], [31], [36], [40], [70]–[105]Block-based [106]–[108] [109] [65], [109], [110] [9], [20], [26], [30], [36], [105],[111]–[120]Cell-based [121]–[123] [120], [124]–[131]Topology-based [132], [133] [29], [38], [134]–[138] and the mutation, to extend the depth to any size.IV. P

OPULATION OPERATORS

In this section, we will introduce the population operators.The population operator is performed on the population togenerate offspring. The population operator can be regarded asthe search strategy in the ENAS. The population is evolvingto a better one with the guidance of the population operatorsduring the evolving process. Generally, the operators can bedivided into two categories: the single individual based andthe multiple individuals based. One classical single individualbased operator is the mutation operator in GA. It allows anindividual search the architecture in the space around itself. Themultiple individuals based operators take the use of the encodedinformation from more than one individual. The crossoveroperator in GA combines the encoded information from bothparents. Furthermore, the location update mechanism in PSOtakes the use of both the particle † own best conﬁgurationachieved in the past ( P best ), and which particle is the currentglobal best in the swarm ( G best ). We want to discuss thepopulation operators from these two aspects in this section. A. Single individual based operator

First, we want to discuss the mutation operator. The aim ofthe mutation operator is to search the local optimum around theindividual. A simple idea is to allow the encoded informationto vary from a given range. Sun et al. [31] use the polynomialmutation [140] on the encoded information which is expressedby real numbers. To make the mutation not random, Lorenzoet al. [38] proposed a novel Gaussian mutation based on aGaussian regression to guide the mutation, i.e., the Gaussianregression can predict which architecture may be good, andthe new generated individuals are sampled in the regions ofthe search space where the ﬁtness values are likely to behigh. That makes the mutation have a “direction”. Maziarz etal. [141] use a recurrent neural network (RNN) to guide themutation. In this work, the mutations are not sampled at randomamong the possible architectural choices, but are sampled fromdistributions inferred by an RNN. Actually, using an RNN tocontrol the mutation can be seen in other methods such as [127].Qiang et al. [72] use a variable mutation probability. They useda higher probability in the early stage for better explorationand a lower probability in the late stage for better exploitation.The fact that it is applied in many other methods [66] isenough to prove its effectiveness. To maintain the diversity of † The individual in PSO is called particle. the population after the mutation, Tian et al. [42] use forcemutation and distance calculation which ensure the individualin the population is not particular similar to other individual(especially the best one). Kramer et al. [126] use the (1+1)-ESthat generates an offspring based on a single parent with bitﬂip mutation, and use mutation rate control and niching toovercome local optima. Zhang et al [104] propose the exchangemutation which exchanges the position of two genes of theindividual, i.e. exchange the order of layers. This will not bringnew layers and the weight can completely be preserved.Chen et al. [142] introduced two function-preserving op-erators on the neural networks, and the function-preservingoperators on the neural networks are termed as networkmorphisms by Wei et al. [143]. The network morphisms(function-preserving operators) aim to change the neuralnetwork architecture without loss of the acquired experience.They function-preserving operator change the architecture from F ( · ) to G ( · ) , and satisﬁes the Equation 2: ∀ x, F ( x ) = G ( x ) (2)where x denotes the input of network. The network morphismscan be regarded as a function-preserving mutation. With thisfunction-preserving mutation, the mutated individual can nothave a worse performance than the parent. Using the networkmorphisms in ENAS is suitable, because both are the incremen-tal learning. To be more speciﬁc, Chen et al. [142] proposednet2widernet to obtain a wider net and net2deepernet to obtaina deeper net. Elsken et al. [111] extent the network morphismswith two popular network operations: skip connections andbatch normalization. Zhu et al. [56] proposed ﬁve well-designedfunction-preserving mutations to guide the evolving process bythe information have already learned. To avoid the suboptimal,Chen et al. [121] add noises in some function-preservingmutation, and in the experiment they found that by addingnoises to pure network morphism, instead of compromising theefﬁciency, it will, by contrast, improve the ﬁnal classiﬁcationaccuracy. Noting that all the network morphisms can onlyincrease the capacity of a network, because if one woulddecrease the networks capacity, the function-preserving propertycould in general not be guaranteed [105]. And as a result, thearchitecture generated by network morphisms is only going toget larger and deeper which is not suitable for a device withlimited computing resources (like a mobile phone). In orderfor the network architecture to be reduced, Elsken et al. [105]proposed the approximate network morphism, which satisﬁesthe Equation 3, ∀ x, F ( x ) ≈ G ( x ) (3)to also cover operators that reduce the capacity of a neuralarchitecture. B. Multiple individual based operator

The multiple individual based operator can be dividedinto two categories according to whether it is directed ornot: directed multiple individual based operator and theundirected multiple individual based operator. The locationupdate mechanism in PSO is a classical example of the directedmultiple individual based operator. The particle is movingtoward the G best and the P best . On the contrary, there is noclear direction between individuals in the undirected multipleindividual based operator. For example, the crossover operatorin GA is an undirected multiple individual based operator,because the individuals in this operator have the equal status.For the crossover operator, Sun et al. [31] using the SimulatedBinary Crossover (SBX) [144] to do a combination of theencoded parameters from two matched layers. Instead of thecrossover only focus on the parameters of layers, Sapra etal. [76] proposed a disruptive crossover swapping the wholecluster (a sequence of layers) between both the individual atcorresponding positions.For PSO, the individual in the population is called theparticle. The position X of each particle represents the encodedinformation. The particle velocity V is updated according tothe Equation 4: V t +1 i,j = w × V ti,j + c × r × ( P best,i,j − X ti,j )+ c × r × ( G best,j − X ti,j ) (4)where V t +1 i,j denotes the updated j -th dimension of the i -thparticle’s velocity, w is inertial coefﬁcient of speed, c and c are constant that are used to ﬁne-tune the performance ofPSO, r and r are random numbers in [0,1), X is the currentparticle position. The position is updated via the Equation 5: X t +1 i,j = X ti,j + V t +1 i,j (5)Junior et al. [70] use their implementation of PSO to updatethe particle based on the layer instead of the parameters of thelayer. Gao et al. [145] developed a gradient-priority particleswarm optimization to handle the problem including the lowconvergence efﬁciency of PSO when there are a lot of hyper-parameters to be optimized. They expect the particle tends toﬁnd the locally optimal solution at ﬁrst, and then move to theglobal optimal solution.The Fireﬂy algorithm is similar to PSO. Sharaf et al. [74]use an improved Fireﬂy algorithm where the individuals in thepopulation not only move toward the other individual betterthan itself, but also have the tendency for random mutationswhich give it the ability to search for local optimum.For ACO, the individuals are generated in a quite differentlyway. Several ants are in an ant colony ‡ . Each ant move ‡ The population in ACO also termed as colony. from node to node following the pheromone instructions tobuild an architecture. The pheromone is updated generationafter generation. The paths of well-performed architecturewill maintain more pheromone to attract the next ant forexploitation and at the same time the pheromone is alsodecaying (pheromone evaporation) which encourage other antsto explore other areas. Byla et al. [40] let the ants choosethe path from node to node in a graph whose depth increasesgradually. Elsaid et al. [146] introduce different ant agent typesto act according to speciﬁc roles to serve the needs of the colony(population), which is inspired by the real ants specialize.The Covariance Matrix Adaptation-Evolution Strategy(CMA-ES) is widely used [54], [147]. CMA-ES worked byadapting a covariance matrix C which deﬁnes the shape andorientation of a Gaussian distribution and vector x that describesthe location of the center of the distribution [147]. The sampledindividuals are used to update the C and the x which takesthe use of multiple individuals information.The Differential Evolution (DE) is another method whosepopulation operators are all the multiple individual based.Different from the GA, the mutation in DE takes the useof the information from three individuals, and the mutatedindividual in terms of vector is generated via the Equation 6: V G +1 = X G + F · ( X G − X G ) (6)where X G , X G , X G are different individuals selected randomlyfrom the population, F is an ampliﬁcation factor. And then, thecrossover operator in DE generates the experimental individualvia Equation 7: U G +1 j = (cid:26) V G +1 j , if rand ( j ) ≤ C r or j = j rand X Gj , otherwise (7)where j = { , , ..., D } , D denotes the length of vector(chromosome), C r is predeﬁned crossover rate constant, j rand is randomly chosen from [1 , , ..., D ] to ensure that at leastone component of the experimental individual comes from themutated individual. Some ENAS method like [48], [87] chooseDE to guide the offspring generation.Moreover, Wang et al. [148] proposed a hybrid PSO-GAmethod. They use PSO to guide the evolution of the parametersin each block encoded in decimal notation, meanwhile use GAto guide the evolution of the shortcut connections encoded inbinary notation.Table II is a classiﬁcation of the different ENAS methodsaccording to the EC method they based, and the horizontalclassiﬁcation are different types of neural networks.There are some other methods not mentioned above. Hillclimbing algorithm can be interpreted as a very simpleevolutionary algorithm. For example, in [111] the populationoperators only contains the mutation and no crossover, and theselection strategy is relatively simple. The memetic algorithmis the hybrids of EAs and local search. Evans et al. [149]integrate the local search (as gradient descent) into the GPas a ﬁne-tuning operation. The CVOA [85] is inspired by thenew respiratory virus, COVID-19. The architecture is foundby simulating the virus spreads and infects healthy individuals. Hyper-heuristic contains two levels: high level strategy andlow level heuristics, and a domain barrier is between thesetwo levels. So the high level strategy is still useful when theapplications have changed. AIS is inspired by theories relatedto the mammal immune system and do not require the crossoveroperator compared to the GA [36].Noting that, this section only discusses the single objectoptimization and focuses on how the population operatorswork to generate offspring. The multiple objective is discussedin Section V.V. E

VALUATION AND S ELECTION

In this section, the evaluation for individual and the selectionstrategy is brieﬂy summarized ﬁrst. Due to the evaluation isthe most time-consuming stage, we will discuss the strategiesto accelerate this stage.

A. Evaluation Criteria

In EA, the population goes through this iteration: afteroffspring generation, the better individuals survive and theothers are discarded. While in SI, the best individual mustbe found out. Nevertheless, each method needs a criterion toevaluate an individual. Different methods choose the criteriaaccording to the task at hand. For example, the classiﬁcationtask prefers the classiﬁcation accuracy as the criterion. In EA,this criterion is termed as ﬁtness, while in AIS, this criterionis termed as afﬁnity § .In Table II, methods are divided into single object andmultiple object ﬁrst. The single object means there is onlyone objective function in the method, e.g., the classiﬁcationaccuracy mentioned above, and these methods have only oneobject: searching the architecture with the highest accuracy.The multiple object, however, contains more than one objectivefunctions, e.g., aiming at both classiﬁcation accuracy andcomputational costs simultaneously. However, these objectivefunctions are always conﬂicting, i.e. getting a higher accuracyoften requires a more complicated architecture with the need ofmore computational resources, on the contrary, on a device withlimited computational resource, e.g. a mobile phone, can notafford such complex architecture. To this end, computationalexpense is one of the most important factors to consider.Pruning is a special case of computational expense working bycutting out the unimportant connections with minimal impacton accuracy. Other ENAS methods aiming at multiple objectoptimization can be seen in Table II. For example, Schorn etal. [78] also take the high error resilience into consideration.The simplest way to tackle the multi objective optimizationis by converted it to the single objective optimization. TheEquation 8 F = λf + (1 − λ ) f (8)is the classical linear form combining two objective function f , f into a single objective function where the λ denotes thecoefﬁcient. In [33], [61], [80], [112], the multi-objective opti-mization problem can be easily solved by using the available § In this paper, we call the evaluation criterion as “ﬁtness”. single objective optimization methods by the Equation 8 ofthe weighted sum. Chen et al. [128] do not adopt the linearaddition as the objective function, whereas using a nonlinearpenalty term.Some algorithms have been already widely used in multi-objective optimization, such as NSGA-II [178], MOEA/D [179],which are also used in [30], [106], [172]. These methodsaim to ﬁnd a Pareto-font set (or non-dominant set). Weonly put these methods in the multiple objective categoryof the Table II. Baldeon et al. [106] choose the penaltybased boundary intersection approach in MOEA/D becausetraining a neural network involves non-convex optimizationand the form of Pareto Front is unknown. LEMONADE [105]divide the objective function into two categories: f exp and f cheap . The f exp denotes expensive-to-evaluate objectives (e.g.,the accuracy), the f cheap denotes cheap-to-evaluate objectives(e.g., the model size). In every iteration, they sample parentnetworks with respect to sparsely distribution based on thecheap objectives f cheap to generate offspring. Therefore, theyevaluate the f cheap more times than f exp to save time. Schoronet al. [78] also take the use of the LEMONADE proposed byElsken et al. [105]. Due to the NSGA-III [180] may fall into thesmall model trap (this algorithm prefer the small models), Yanget al. [135] have made some improvements to the conventionalNSGA-III for protecting these larger models. Li et al. [47] usethe biasvariance framework on they proposed multi-objectivePSO to get a more accurate and stable architecture. Wu etal. [132] use the MOPSO [181] for neural networks pruning.Wang et al. [108] use the OMOPSO [182] which selects theleaders using a crowding factor and the G best is selected fromthe leaders. B. Selection Strategy

Table III shows the main kinds of selection strategy. Notethat the selection strategy often appears in EA based method,because other algorithms such as PSO and ACO do not needto undergo the selection stage, or the selection strategy canbe regarded as integrated into the population operators. Andthe selection strategy can be not only used in environmentselection stage which is choosing individuals to make up nextpopulation, but also used in choosing individuals as parentsto generate offspring with the population operators. Zhang etal. [104] term these two selections as environmental selectionand mate selection separately.The selection strategy can be divided into ﬁve categories:retain the best group, discard the worst, roulette, tournamentselection and others. The simplest strategy is to retain theindividuals with higher ﬁtness. Only the best group can survive.However, this can cause a loss of diversity in the populationwhich may lead the population fall into local optima. Discardingthe worst is similar to retaining the best group. Real et al. [130]using the aging evolution which discards the oldest individualin the population. Aging evolution can explore the searchspace more, instead of zooming in on good models too early,as non-aging evolution would. The same selection strategy isused in [100]. Zhu et al. [56] combine these two approaches,discarding the worst individual and the oldest individual at

TABLE IIC

ATEGORIZATION OF EC AND DIFFERENT TYPES OF NEURAL NETWORK

DNN + ANN CNN DBN RNN AESingleobject EA GAs [35], [41], [81],[138], [150] [9], [19], [20], [29], [31],[32], [42], [44], [46], [49],[53], [56]–[58], [60], [62]–[64], [66]–[68], [71], [76],[79], [80], [83], [84], [86],[88], [90]–[92], [94], [96],[98], [99], [104], [107],[110], [113], [116]–[118],[120], [121], [124]–[126],[130], [131], [137], [151],[152] [100], [153] [65], [91], [93],[154], [155] [156]GP [157] [26], [59], [112], [114],[158], [159] [34], [160] [161]ES [138], [162] [122] [54], [73], [147],[154], [163],[164] [97]SI ACO [40] [134], [136],[146], [165]PSO [52], [82], [166] [70], [71], [95], [102],[103], [108], [109], [129],[145] [72], [167] [21],[43]Fireﬂy algorithm [74]Other Hill climbing algo-rithm [55], [77], [111]Memetic [38], [149]DE [168] [45], [71] [87] [48]CVOA [85]Hyper-heuristic [75]Artiﬁcial Immune Sys-tem (AIS) [36]Multipleobject Computational expense [169]–[171] [30], [61], [69], [89],[101], [105], [106], [119],[127], [135], [172]–[175] [176]Pruning [30], [132], [133]Other [47], [78], [177] [51] the same time. Roulette gives every individual a probabilityto survive (or be discarded), whether he is the best or not.Tournament selection selects the best one from an equally likelysample of individuals. Furthermore, Johner et al. [57] use aranking function to choose individuals by rank. A selectiontrick termed as niching is used in [67], [126] to avoid stackinginto local optima. This trick allows offspring worse than parentfor several generations until evolving to a better one.Most of the methods focus on preserving the well-performedindividuals, however, Liu et al. [124] emphasizes the gene morethan the survived individuals where gene can represent anyingredient in the architecture. They believe the individualsconsist of the ﬁne-gene set are more likely to have a promisingperformance.In fact, some selection methods aim at preserving the diver-sity of the population. Elsken et al. [105] selects individualsin inverse proportion to their density. Javaheripi et al. [183]choose the parents based on the distance (difference) duringthe mate selection. They choose the two individuals have thehighest distance to promote exploration.

C. Shorten the evaluation time

Real et al. [19] use 250 workers (computers) to ﬁnishthe Large-Scale evolution over 11 days. Such computationalresources are not available for anyone who is interested in NAS.

TABLE III

THE SELECTION STRATEGY

Retain the bestgroup [26], [36], [38], [42], [49], [55], [73], [77],[84], [97], [111], [114], [132], [156], [157]Discard the worstor the oldest [56], [76], [86], [130], [131], [155], [184]Roulette [29], [30], [44], [60], [62], [107], [113],[116], [133]Tournament selec-tion [9], [19], [20], [31], [59], [63], [80], [93],[99], [108], [117], [120], [125], [127], [128],[184]Others [57], [105]

Almost all of the methods evaluate individuals by training themﬁrst and evaluating them on the validation dataset (i.e. a datasetwhich is unseen during the training). Since the architecture isbecoming more and more complex, it will take a lot of timetraining each architecture to convergence. So it is natural tothink up the methods to shorten the evaluation time. Table IVlists four of the most common methods to shorten the time:weight inheritance, early stopping policy, reduced training setand reduced population.Because the population operators usually do not completelydisrupt the architecture of an individual, some parts of the newgenerated individual are the same with previous individuals. The weights of the same parts can be easily inherited. With theweight inheritance, the neural networks are no longer trainedcompletely from scratch. This method has been used in [166]20 years ago. Moreover, as mentioned in Section IV, thenetwork morphisms change the network architecture withoutloss of the acquired experience. This could be regarded asthe ultimate weight inheritance, because it solved the weightinheritance problem in the changed architecture part. Theultimate weight inheritance let the new individual completelyinherit the knowledge its parent learned and training such anindividual to convergence will save a lot of time.Early stopping policy is another method which is used widelyin NAS. The simplest way is to set a ﬁxed relatively smallnumber of training epochs. This method is used in [21], becausethe authors think the individual after a small number of trainingepochs can conduct the performance. Similarly, Assunccao etal. [79] let the individuals undergo the training for a sameand short time each epoch (although this time is not ﬁxed andwill increase with the epoch). To let the promising architecturehave more training time to get a more precise evaluation, Soet al. [185] set hurdles after several ﬁxed epochs. The weakindividuals stop training early and save the time. However, theearly stopping policy can lead to inaccurate estimation aboutindividual (especially the large and complicated architecture)performance, which can be seen in Fig. 3. In Fig. 3, individual2 performs better than individual1 before epoch t1 , whereas individual1 performs better in the end. Yang et al. also discussthis phenomenon in [135]. So, it is crucial to determine atwhich point to stop. Note that the neural network can convergeor hardly improve its performance after several epochs, suchas the t1 for individual2 and the t2 for individual1 in Fig. 3.Using the performance estimated at this point can evaluate anindividual relatively accurately with less training time. Therefor,some methods such as [32], [86] stop training when observingthere is no signiﬁcant performance improvement. Suganumaet al. [114] use the early stopping policy based on a referencecurve. If the accuracy curve of an individual is under thereference curve for successive epochs, then the training will beterminated and this individual is regarded as poor one. Afterevery epoch, the reference curve is updated by the accuracycurve of the best offspring.Reduced training set, i.e. using a subset of that data hassimilar properties to a large dataset can also shorten the timeeffectively. Liu et al. [124] explore the promising architectureby training on a subset and use the transfer learning to the largeoriginal dataset. Because there are so many benchmark datasetsin image classiﬁcation ﬁeld, the architecture can be evaluatedon the smaller dataset (e.g. CIFAR-10) ﬁrst and then is appliedon the large dataset (such as CIFAR-100 and ImageNet). Thesmaller dataset can be regarded as the proxy for the large one.Reduced population is the special method in ENAS, sinceother NAS do not have the population. Assunccao et al. [98]reduce the population on the basis of their previous al-gorithm [186], [187] to speed up the evolution. However,simply reduced population may not explore the search spacedramatically in each epoch and may lose the global searchability. Another way is reducing the population dynamically.For instance, Fan et al. [122] use the ( µ + λ ) evolution TABLE IV

DIFFERENT METHODS TO SHORTEN THE EVALUATION TIME

Weightinheritance [19], [36], [38], [40], [55], [56], [62], [67],[77], [78], [84], [86], [88], [105], [109],[111], [120], [121], [125], [132]Early stoppingpolicy [21], [32], [36], [42], [43], [62], [65], [71],[79], [80], [83], [86], [86], [95], [98], [103],[108], [112], [114], [125], [145]Reduced trainingset [46], [84], [124], [129], [188]Reduced popula-tion [98], [122], [124]

ACC epoch t1 t2Individual 1 Individual 2

Fig. 3. Two learning curve of different individual strategy and divide the evolution into three stages with thepopulation reduction which aim to ﬁnd the balance of thelimited computing resources and the efﬁciency of evolution.The large population in the ﬁrst stage is to ensure the globalsearch ability, while the small population in the last stage is toshorten the evolution time. Instead of reducing the population,Liu et al. [188] evaluate the architecture with small size at anearly stage of evolution. Similarly, Wang et al. [129] do notevaluate the whole architecture whereas a single block, andthen the blocks are stacked to build an architecture as evolutiongoes on.Actually, there are many other well performed methodsto reduce the time in ENAS. In population based methods,especially the GA based methods (e.g., in [31]), it is naturalto maintain well performed individuals in the population insuccessive epochs. Sometimes, the individuals in the nextpopulation directly inherit all the architecture information oftheir parents without any modiﬁcation and it is not necessaryevaluating the individuals again. Fujino et al. [32] use a memoryto record the ﬁtness of individuals, and if the same architectureencoded in individual appears, the ﬁtness value is retrievedfrom memory instead of being reevaluated. Similarly, Miahiet al. [117] and Sun et al. [20], employ a hashing methodfor saving pairs of architecture and ﬁtness of each individualand reusing them when the same architecture appears again.Johner et al. [57] prohibit the appearance of architectures thathave existed before in offspring. This does reduce the time,however, the best individuals are prohibited from remaining inthe population which may lead the population evolving towardsa bad direction.Rather than training thousands of different architectures, one-shot model [189] trains only one SuperNet to save the time.The different architectures, i.e. the SubNets, are sampled from the SuperNet with the shared parameters. Yang et al. [135]believe the traditional ENAS methods without using SuperNetare less efﬁcient for models are optimized separately. Incontrast, the one-shot model optimize the architecture andthe weights alternatively. But the weight sharing mechanismbrings a difﬁculty in accurately evaluating the architecture.Chu et al. [190] scrutinize the weight-sharing NAS with afairness perspective and demonstrate the effectiveness. However,there remains some doubts that can not explain clearly in one-shot model. The weights in the SuperNet are coupled. It isunclear why inherited weights for a speciﬁc architecture arestill effective [137].Taking the use of hardware can reduce the time, too. Jiang etal. [89] use a distributed asynchronous system which contains amajor computing node with 20 individual workers. Each workeris responsible for training a single block and uploading its resultto the major node in every generation. Wang et al. [108] designan infrastructure which has the ability to leverage all of theavailable GPU cards across multiple machines to concurrentlyperform the objective evaluation for a batch of individuals.Note that Colangelo et al. [191], [192] design a reconﬁgurablehardware framework that ﬁts the ENAS. As they claimed,this is the ﬁrst work of conducting NAS and hardware co-optimization.Furthermore, Lu et al. [101] adopt the concept of a proxymodels which are small-scale versions of the intended archi-tectures. For example, in a CNN architecture, the number oflayers and the number of channels in each layer are reduced.However, the drawback of this method is obvious: the loss ofprediction accuracy. Therefor, they perform an experiment todetermine the smallest proxy model that can provide a reliableestimate of performance at a larger scale.All the above methods obtain the ﬁtness of individualsby directly evaluating the performance on the validationdataset. An alternative way is using the indirect methods,namely the performance predictors. As summarized in [37],the performance predictors can be divided into two categories:performance predictors based on the learning curve and end-to-end performance predictors, both of which are based on thetraining-predicting computational paradigm. This does not meanthe performance predictor does not undergo the training phaseat all, while it means learning from the information obtainedin the training phase, and use the knowledge learned to make areasonable prediction for other architectures. Ralwal et al. [34]take the use of the learning curve based predictor where theﬁtness is not calculated in the last epoch but is predicted bythe sequence of ﬁtness from ﬁrst epochs. Speciﬁcally, theyuse a Long Short Term Memory (LSTM) [193] as a sequenceto sequence model, when input the validation perplexity ofthe ﬁrst epochs, the output will be the validation perplexityprediction of the subsequent epochs. Sun et al. [37] use anend-to-end performance predictor which does not need anyextra information, even the performance of individuals of theﬁrst several epochs. Speciﬁcally, they adopt a method basedon the random forest to accelerate the ﬁtness evaluation inENAS. When the random forest receives a newly generatedarchitecture as input, adaptive combination of a huge numberof regression trees which have been trained in advance in the forest gives the prediction.VI. A PPLICATION

This section lists different ﬁelds ENAS involved. Actually,the ENAS can be applied to wherever the neural networks canbe applied. The following Table V shows the wide rangeof applications and Table VI displays the performance ofextraordinary ENAS methods on two popular and challengingdataset for image classiﬁcation, namely CIFAR-10 and CIFAR-100. Both of these two tables can represent what ENAS hasachieved so far.

A. Overview

Table V shows the applications of ENAS. This is a incom-plete statistics but also contains a wide range of applications.Generally, these ﬁelds of application can be grouped intothe following ﬁve categories:(1) Image and signal processing, including image clas-siﬁcation which is the most popular and competitive ﬁeld,image to image processing (including image restoration, imagedenoising, super-resolution and image inpainting), emotionrecognition, speech recognition, language modeling and facede-identiﬁcation.(2) Biological and biomedical tasks, including medical imagesegmentation, malignant melanoma detection, sleep heart studyand assessment of human sperm.(3) Predictions and forecasting about all sorts of things,including the prediction of wind speed, car park occupancy,time series data, ﬁnancial and usable life, the forecasting ofelectricity demand time series, trafﬁc ﬂow, electricity price andmunicipal waste.(4) Machine, including engine vibration prediction, Un-manned Aerial Vehicle (UAV), bearing fault diagnosis andpredicting general aviation ﬂight data.(5) Others, including crack detection of concrete, gamma-ray detection, multitask learning, identify galaxies, videounderstanding and comics understanding.

B. Comparison on CIFAR-10 and CIFAR-100

In Table V, it is obviously to see that many ENAS methodsare applied on the image classiﬁcation task. The benchmarkdataset, CIFAR-10 which contains a total of ten classes, andthe CIFAR-100 is the advanced dataset including a hundredof classes. These two datasets are widely used in imageclassiﬁcation tasks, and the accuracy on these two challengingdatasets can represent the ability of the architecture. We collectthe well-performed ENAS methods tested on these two datasets.The Table VI shows the test results on the two datasets ofdifferent state-of-the-art methods separately, where the methodsare ranked in ascending order of their best accuracy on CIFAR-10, i.e. the methods are ranked in descending order of theirerror rate. The column “CIFAR-10” and “CIFAR-100” denotethe error rate of each method on the corresponding datasetsrespectively. Furthermore, the “GPU Days” denotes the totalsearch time of each method, it can be calculated by theEquation 9 TABLE VA

PPLICATION

Category Applications References1 Image classiﬁcation [9], [19]–[21], [26], [29]–[32], [36], [38]–[40], [42]–[44], [49], [50], [56]–[58], [61], [62], [67], [68],[70], [71], [71], [74], [78],[80], [83], [84], [88]–[90],[92], [95], [96], [98], [99],[101], [102], [104], [105],[107]–[110], [112], [114],[115], [118], [120], [121],[123], [125], [126], [128]–[133], [135], [137], [141],[149], [156], [158], [161],[162], [172], [173], [175],[183], [192], [194]–[197]1 Image to image [46], [94], [97], [113],[119], [127], [152]1 Emotion recognition [91], [145]1 Speech recognition [54], [81], [103], [138],[163]1 Language modeling [34], [115]1 Face De-identiﬁcation [198]2 Medical image segmentation [38], [69], [72], [100],[106], [116], [122]2 Malignant melanoma detection [55], [77]2 Sleep heart study [168]2 Assessment of human sperm [117]3 Wind speed prediction [147]3 Electricity demand time seriesforecasting [85]3 Trafﬁc ﬂow forecasting [47]3 Electricity price forecasting [87]3 Car park occupancy prediction [154]3 Energy consumption prediction [93]3 Time series data prediction [155]3 Financial prediction [199]3 Usable life prediction [51]3 Municipal waste forecasting [73]4 Engine vibration prediction [134], [165]4 UAV [35]4 Bearing fault diagnosis [48]4 Predicting general aviationﬂight data [136]5 Crack detection of concrete [60]5 Gamma-ray detection [79]5 Multitask learning [200]5 Identify Galaxies [151]5 Video understanding [66]5 Comics understanding [53], [201]

GP U Days = T he number of GP U s × t (9)where the t denotes the search time of the method. “Parameters”denotes the total number of parameters which can representthe capability of an architecture and the complexity.Actually, this is not a totally fair comparison. The reason canbe summarized in the following two aspects: (1) The encodingspace including the initial space and the search space are notthe same. There are two extreme cases in initial space: trivialinitialization which starts at the simplest architecture and richinitialization which starts at a well-designed architecture (e.g.ResNet-50 [2]). And the size of search space is largely different,e.g., Ref [44] only takes the kernel size into search space. (2)Different tricks used in the methods, e.g. the “cutout”, can make the ﬁnal results unfair, too. The “cutout” refers to aregularization method [204] used in the training of CNNs,which could improve the ﬁnal performance.Anyhow, Table VI shows the progress of ENAS in imageclassiﬁcation according to the accuracy on CIFAR-10: Large-Scale Evolution [19] (5.4%, 2017), LEMONADE [105] (2.58%2018), NSGANet [101] (2.02%, 2019). Many ENAS methodshave lower error rate than ResNet-110 [2] with 6.43% error rateon CIFAR-10, which is a manually well-designed architecture.Therefore, the architecture found by ENAS can reach the samelevel as or exceed the architecture designed by experts. Itproves that the ENAS is reliable and can be used in otherapplication ﬁeld.VII. C HALLENGES AND ISSUES

Despite the positive results of the existing methods, there arestill some challenges and issues which need to be addressed.

A. The effectiveness

The effectiveness of ENAS is questioned by many researches.Wistuba et al. [24] notice that the random search can get awell-performed architecture and has proven to be an extremelystrong baseline. Yu et al. [205] show the state-of-the-art NASalgorithms perform similarly to the random policy on average.Liashchynskyi et al. [206] compare grid search, random search,and GA for NAS and the result is that the architecture obtainedby GA and the random search have similar performance. Thereis no need to use the complicate algorithms to guide the searchprocess if the random search can outperform the NAS basedon EC paradigm.However, the population operator in [206] only contains arecombination operator which could not represent the wholeeffectiveness of ENAS. Although the random search can ﬁnd awell-performed architecture in experiment, it can not guaranteethat it will ﬁnd a good architecture every time. Moreover,in [99], [137], the evolutionary search is more effective thanrandom search. There is an urgent need to design a sophisticatedexperiment to tell the effectiveness of the state-of-the-art ENASmethods, especially in a large encoding space.

B. The mutation and the crossover

In Section IV, two types of operators are introduced. We notethat some methods like Large-Scale NAS [19] only use singleindividual based operator (mutation) to generate offspring.The main reason why they do not choose the crossoveroperator in their method come from two aspects: the ﬁrst isfor simplicity [177], and the second is that simply combininga section of one individual with a section of another individualseems ill-suited to the neural network paradigm [36]. Also,in [24], the authors believe that there is no indication that arecombination operation applied to two individuals with highﬁtness would result into an offspring with similar or betterﬁtness.However, the supplemental materials in [20] demonstratethe effectiveness of the crossover operator in this method. Thismethod can ﬁnd a good architecture in short order with the TABLE VIT

HE COMPARISON OF THE ERROR RATE ON

CIFAR-10

AND

CIFAR-100ENAS Methods GPU Days Parameters (M) CIFAR-10(%) CIFAR-100 (%) NotesCGP-DCNN [112] unknown 1.1 8.1 We choose the architecture with the lowesterror rate.EPT [84] 2 unknown 7.5GeNet [29] 17 unknown 7.1 The report starting from the 40-layer wideresidual network is unadopted.unknown 29.03EANN-Net [107] unknown unknown 7.05 ± ± ± std). And this isdifferent from the previous CGP-CNN.unknown 2.01 26.7 (28.1 ± ± ± ± ± ± help of the crossover. On the contrary, when the crossover isnot performed, the architecture found is not really promising,unless it runs for a long time. In fact, the mutation operatorlet an individual explore the space around itself, and it is agradually incremental search process like searching step bystep. The crossover (recombination) can generate offspringdramatically different from the parents by contrast, which ismore like a stride. So, this operator has the ability to efﬁcientlyﬁnd a promising architecture. Chu et al. [119] prefer that whilea crossover mainly contributes to exploitation, a mutation isusually aimed to introduce exploration. These two operatorsplay different role in the evolving progress. But there is not asufﬁcient explanation how the crossover operator works. Maybesome additional experiments need to be done on the methodsthat do not include the crossover operator. C. Evaluation method

Section V-C has introduced the most popular and effectiveway to reduce the evaluation time. In a nutshell, it can bedescribed as a question that strikes the balance between thetime spent and the accuracy of the evaluation. Because of theunbearable full training time, we must compromise as little aswe can on evaluation accuracy in exchange for signiﬁcantlyreduction in evaluation time without sufﬁcient computingresources.Although a lot of ENAS methods have adopted various kindsof ways to shorten the evaluation time and even though Sunet al. [37] speciﬁcally proposed a method of acceleration, theresearch direction of search acceleration is just getting started.The current approaches have many shortcomings that need tobe addressed. Furthermore, there is no baseline and commonassessment criteria of the search acceleration methods. It is abig challenge to propose a novel kind of method to evaluatethe architecture accurately and quickly.

D. Future application

Table V shows variable applications which are researched bythe ENAS currently. But these are just a small part of all areasof neural network application. Actually, ENAS can be appliedwherever neural networks can be applied and automate theprocess of architecture designed which should have done byexperts. Moreover, plenty of the image classiﬁcation successesof ENAS have proven the ENAS has the ability to replaceexperts. The automated architecture design is a trend.However, this process is not totally automated. The encodingspace (search space) still needs to be designed by experts fordifferent applications. For example, for the image processingtasks, the CNNs are more suitable, so the encoding spacecontains the layers including convolution layers, pooling layersand fully connected layers. While for the time series dataprocessing, the RNNs are more suitable, so the encoding spacemay contain the cells including ∆ -RNN cell, LSTM [193],Gated Recurrent Unit (GRU) [207], Minimally-Gated Unit(MGU) [208] and Update-Gated RNN (UGRNN) [209]. Thetwo manually determined encoding space already contains agreat deal of artiﬁcial experience and the components withoutguaranteed performance are excluded. The problem is: can a method search the corresponding type of neural networkfor multiple tasks in a large encoding space including allthe popular wide used components? Instead of searching onemultitask networks [200] which learns several tasks at oncewith the same neural network, the aim is to ﬁnd appropriatenetworks for different tasks in one large encoding space. E. Fair comparison

Section VI-B gives a brief introduction of the unfaircomparison. The unfairness mainly comes from two aspects:(1) the tricks including cutout [204], ScheduledDropPath [210],etc. (2) The different encoding space. For aspect (1), we noticesome ENAS methods [20] have reported the results with andwithout the tricks. As aspect (2), the well-designed searchspace is widely used in different ENAS methods. For instance,the NASNet search space [210] is also used in [130], [131]because it is well-constructed even that random search canperform well. The comparison under the same condition cantell the effectiveness of different search methods.Fortunately, the ﬁrst public benchmark dataset for NAS,the NAS-Bench-101 [211] has been proposed. The datasetcontains 432k unique convolutional architectures based on thecell-based encoding space. Each architecture can query thecorresponding metrics, including test accuracy, training time,etc., directly in the dataset without the large-scale computation.NAS-Bench-201 [212] is proposed recently and is based onanother cell-based encoding space which has no limitations onedges. Compared with NAS-Bench-101, which was only testedon CIFAR-10, this dataset collects the test accuracy on threedifferent image classiﬁcation datasets (CIFAR-10, CIFAR-100,ImageNet). But the encoding space is relatively small, andonly contains 15.6k architectures. Actually, experiment withdifferent ENAS methods on these benchmark datasets can geta fair comparison and it will not take too much time. However,these datasets are only based on the cell-based encoding spaceand can not contain all the search space of the existing methods,because the other basic units (layers and blocks) are built bymore hyper-parameters which may lead to a larger encodingspace.In the future, a common platform for the comparison needs tobe built. This platform must have several benchmarks encodingspace, such as the NASNet search space, NAS-Bench-101and NAS-Bench-201. All the ENAS methods can directlytest on the platform. Furthermore, this platform also needsto solve the problem that the different kinds of GPUs havedifferent computing power which may cause an inaccurateGPU Days based on the different standards. The GPU Dayscan not compare directly until they have a common base lineof computing power.VIII. C

ONCLUSION

This paper provided a comprehensive survey of ENAS.We introduced the ENAS from four aspects: populationinitialization, population operators, evaluation and selectionfollowing the common ﬂowchart which can be seen in Fig. 2.The various applications and the performance of the state-of-the-art methods on image classiﬁcation are also summarized in tables to demonstrate the wide applicability and the promisingability. Challenges and issues are discussed to identify thefuture research in this ﬁeld.To be speciﬁc, ﬁrstly, the survey shows the differentindividual presentation and encoding space. We divide theencoding space into two stages: initial space and search spacewhere the former one deﬁnes the initial conditions whereasthe latter one limits the novel architectures can be foundin the evolving process. In addition, the constraints on theencoding space are also introduced by categories. Secondly,the population operators are introduced in two parts, namelythe single individual based operator and the multiple individualbased operator. A variety of EC algorithms use respectiveregulation to generate new individuals. Based on the originalalgorithms, many novel methods have proposed improvementto get a stable and reliable search capability. Thirdly, theevaluation criteria divide the ENAS into two categories: singleobject and multiple object. Furthermore, we demonstrated theexisting methods to shorten the long evaluation time which isa huge obstacle to efﬁciency.Despite the state-of-the-art methods have achieved somesuccess, the ENAS still faces challenges and issues. The ﬁrstimportant issue is whether the EC-based search strategy isuseful. If the result is at the same level of baseline (e.g., randomsearch), it is unnecessary to design the complex operators onthe population. A sophisticated experiment is in urgent needto tell the effectiveness, especially in a large encoding space.Secondly, the crossover operator is an undirected multipleindividual based operator and there is no sufﬁcient explanationhow the crossover operator works. Besides, the ENAS is justbeginning a new era, so that there is a lot of uncharted territoryto be explored. Moreover, a uniﬁed standard or platform isdemanded to make a fair comparison.R EFERENCES[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, pp. 436–444, 2015.[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[3] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in

Proceedings of theIEEE conference on computer vision and pattern recognition , 2017, pp.4700–4708.[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[5] Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks forend-to-end speech recognition,” in . IEEE, 2017,pp. 4845–4849.[6] Y. Bengio, “Practical recommendations for gradient-based training ofdeep architectures,” in

Neural networks: Tricks of the trade . Springer,2012, pp. 437–478.[7] J. K. Kearney, W. B. Thompson, and D. L. Boley, “Optical ﬂowestimation: An error analysis of gradient-based methods with localoptimization,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , no. 2, pp. 229–244, 1987.[8] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[9] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automatedcnn architecture design based on blocks,”

IEEE transactions on neuralnetworks and learning systems , 2019.[10] B. Zoph and Q. V. Le, “Neural architecture search with reinforcementlearning,” arXiv preprint arXiv:1611.01578 , 2016. [11] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcementlearning: A survey,”

Journal of artiﬁcial intelligence research , vol. 4,pp. 237–285, 1996.[12] T. B¨ack, D. B. Fogel, and Z. Michalewicz, “Handbook of evolutionarycomputation,”

Release , vol. 97, no. 1, p. B1, 1997.[13] A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of featuresfrom tiny images,” 2009.[14] D. E. Goldberg,

Genetic algorithms . Pearson Education India, 2006.[15] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone,

Geneticprogramming . Springer, 1998.[16] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in

Proceed-ings of ICNN’95-International Conference on Neural Networks , vol. 4.IEEE, 1995, pp. 1942–1948.[17] Y. Sun, G. G. Yen, and Z. Yi, “Igd indicator-based evolutionary algo-rithm for many-objective optimization problems,”

IEEE Transactionson Evolutionary Computation , vol. 23, no. 2, pp. 173–187, 2018.[18] A. Darwish, A. E. Hassanien, and S. Das, “A survey of swarmand evolutionary computing approaches for deep learning,”

ArtiﬁcialIntelligence Review , vol. 53, no. 3, pp. 1767–1812, 2020.[19] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V.Le, and A. Kurakin, “Large-scale evolution of image classiﬁers,” in

Proceedings of the 34th International Conference on Machine Learning-Volume 70 . JMLR. org, 2017, pp. 2902–2911.[20] Y. Sun, B. Xue, M. Zhang, G. G. Yen, and J. Lv, “Automaticallydesigning cnn architectures using the genetic algorithm for imageclassiﬁcation,”

IEEE Transactions on Cybernetics , 2020.[21] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “A particle swarmoptimization-based ﬂexible convolutional autoencoder for image classi-ﬁcation,”

IEEE transactions on neural networks and learning systems ,vol. 30, no. 8, pp. 2295–2309, 2018.[22] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, pp. 436–444, 2015.[23] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: Asurvey,” arXiv preprint arXiv:1808.05377 , 2018.[24] M. Wistuba, A. Rawat, and T. Pedapati, “A survey on neural architecturesearch,” arXiv preprint arXiv:1905.01392 , 2019.[25] K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen, “Designingneural networks through neuroevolution,”

Nature Machine Intelligence ,vol. 1, no. 1, pp. 24–35, 2019.[26] M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programmingapproach to designing convolutional neural network architectures,” in

Proceedings of the Genetic and Evolutionary Computation Conference ,2017, pp. 497–504.[27] H. Kitano, “Designing neural networks using genetic algorithms withgraph generation system,”

Complex systems , vol. 4, no. 4, pp. 461–476,1990.[28] K. O. Stanley, D. B. D’Ambrosio, and J. Gauci, “A hypercube-basedencoding for evolving large-scale neural networks,”

Artiﬁcial life , vol. 15,no. 2, pp. 185–212, 2009.[29] L. Xie and A. Yuille, “Genetic cnn,” in

Proceedings of the IEEEinternational conference on computer vision , 2017, pp. 1379–1388.[30] M. Loni, S. Sinaei, A. Zoljodi, M. Daneshtalab, and M. Sj¨odin,“Deepmaker: A multi-objective optimization framework for deep neuralnetworks in embedded systems,”

Microprocessors and Microsystems , p.102989, 2020.[31] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Evolving deep convolutionalneural networks for image classiﬁcation,”

IEEE Transactions onEvolutionary Computation , 2019.[32] S. Fujino, N. Mori, and K. Matsumoto, “Deep convolutional networksfor human sketches by means of the evolutionary deep learning,” in . IEEE, 2017, pp. 1–5.[33] D. V. Vargas and S. Kotyan, “Evolving robust neural architecturesto defend from adversarial attacks,” arXiv preprint arXiv:1906.11667 ,2019.[34] A. Rawal and R. Miikkulainen, “From nodes to networks: Evolvingrecurrent neural networks,” arXiv preprint arXiv:1803.04439 , 2018.[35] A. Behjat, S. Chidambaran, and S. Chowdhury, “Adaptive genomicevolution of neural network topologies (agent) for state-to-actionmapping in autonomous agents,” in . IEEE, 2019, pp. 9638–9644.[36] L. Frachon, W. Pang, and G. M. Coghill, “Immunecs: Neural committeesearch by an artiﬁcial immune system,” arXiv preprint arXiv:1911.07729 ,2019. [37] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, and M. Zhang, “Surrogate-assisted evolutionary deep learning using an end-to-end random forest-based performance predictor,” IEEE Transactions on EvolutionaryComputation , 2019.[38] P. R. Lorenzo and J. Nalepa, “Memetic evolution of deep neuralnetworks,” in

Proceedings of the Genetic and Evolutionary ComputationConference , 2018, pp. 505–512.[39] W. Irwin-Harris, Y. Sun, B. Xue, and M. Zhang, “A graph-basedencoding for evolutionary convolutional neural network architecturedesign,” in .IEEE, 2019, pp. 546–553.[40] E. Byla and W. Pang, “Deepswarm: Optimising convolutional neuralnetworks using swarm intelligence,” in

UK Workshop on ComputationalIntelligence . Springer, 2019, pp. 119–130.[41] C. Chiu and J. Zhan, “An evolutionary approach to compact dag neuralnetwork optimization,”

IEEE Access , vol. 7, pp. 178 331–178 341, 2019.[42] H. Tian, S.-C. Chen, M.-L. Shyu, and S. Rubin, “Automated neuralnetwork construction with similarity sensitive evolutionary algorithms,”in . IEEE, 2019, pp. 283–290.[43] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “An experimental study onhyper-parameter optimization for stacked auto-encoders,” in . IEEE, 2018, pp. 1–8.[44] A. Singh, S. Saha, R. Sarkhel, M. Kundu, M. Nasipuri, and N. Das,“A genetic algorithm based kernel-size selection approach for a multi-column convolutional neural network,” arXiv preprint arXiv:1912.12405 ,2019.[45] A. Dahou, M. A. Elaziz, J. Zhou, and S. Xiong, “Arabic sentimentclassiﬁcation using convolutional neural network and differentialevolution algorithm,”

Computational intelligence and neuroscience ,vol. 2019, 2019.[46] H. Shu and Y. Wang, “Automatically searching for u-net image translatorarchitecture,” arXiv preprint arXiv:2002.11581 , 2020.[47] L. Li, L. Qin, X. Qu, J. Zhang, Y. Wang, and B. Ran, “Day-aheadtrafﬁc ﬂow forecasting based on a deep belief network optimized by themulti-objective particle swarm algorithm,”

Knowledge-Based Systems ,vol. 172, pp. 1–14, 2019.[48] S. R. Sauﬁ, Z. A. bin Ahmad, M. S. Leong, and M. H. Lim, “Differentialevolution optimization for resilient stacked sparse autoencoder and itsapplications on bearing fault diagnosis,”

Measurement Science andTechnology , vol. 29, no. 12, p. 125002, 2018.[49] D. Kang and C. W. Ahn, “Efﬁcient neural network space withgenetic search,” in

International Conference on Bio-Inspired Computing:Theories and Applications . Springer, 2019, pp. 638–646.[50] B. Cheung and C. Sable, “Hybrid evolution of convolutional networks,”in , vol. 1. IEEE, 2011, pp. 293–297.[51] C. Zhang, P. Lim, A. K. Qin, and K. C. Tan, “Multiobjective deep beliefnetworks ensemble for remaining useful life estimation in prognostics,”

IEEE transactions on neural networks and learning systems , vol. 28,no. 10, pp. 2306–2318, 2016.[52] F. Ye, “Particle swarm optimization-based automatic parameter selectionfor deep neural networks and its applications in large-scale and high-dimensional data,”

PloS one , vol. 12, no. 12, 2017.[53] S. Fujino, N. Mori, and K. Matsumoto, “Recognizing the order offour-scene comics by evolutionary deep learning,” in

International Sym-posium on Distributed Computing and Artiﬁcial Intelligence . Springer,2018, pp. 136–144.[54] T. Tanaka, T. Moriya, T. Shinozaki, S. Watanabe, T. Hori, and K. Duh,“Automated structure discovery and parameter tuning of neural networklanguage model based on evolution strategy,” in . IEEE, 2016, pp. 665–671.[55] A. Kwasigroch, M. Grochowski, and M. Mikolajczyk, “Deep neuralnetwork architecture search using network morphism,” in . IEEE, 2019, pp. 30–35.[56] H. Zhu, Z. An, C. Yang, K. Xu, E. Zhao, and Y. Xu, “Eena:efﬁcient evolution of neural architecture,” in

Proceedings of the IEEEInternational Conference on Computer Vision Workshops , 2019, pp.0–0.[57] F. M. Johner and J. Wassner, “Efﬁcient evolutionary architecture searchfor cnn optimization on gtsrb,” in . IEEE,2019, pp. 56–61.[58] M. Shen, K. Han, C. Xu, and Y. Wang, “Searching for accuratebinary neural architectures,” in

Proceedings of the IEEE InternationalConference on Computer Vision Workshops , 2019, pp. 0–0. [59] Y. Bi, B. Xue, and M. Zhang, “An evolutionary deep learning approachusing genetic programming with convolution operators for imageclassiﬁcation,” in . IEEE, 2019, pp. 3197–3204.[60] S. Gibb, H. M. La, and S. Louis, “A genetic algorithm for convolutionalnetwork structure optimization for concrete crack detection,” in . IEEE, 2018, pp.1–8.[61] L. M. Zhang, “A new compensatory genetic algorithm-based methodfor effective compressed multi-function convolutional neural networkmodel selection with multi-objective optimization,” arXiv preprintarXiv:1906.11912 , 2019.[62] A. A. Ahmed, S. M. S. Darwish, and M. M. El-Sherbiny, “A novelautomatic cnn architecture design approach based on genetic algorithm,”in

International Conference on Advanced Intelligent Systems andInformatics . Springer, 2019, pp. 473–482.[63] E. Dufourq and B. A. Bassett, “Automated problem identiﬁcation:Regression vs classiﬁcation via evolutionary deep networks,” in

Pro-ceedings of the South African Institute of Computer Scientists andInformation Technologists , 2017, pp. 1–9.[64] S. Wei, S. Zou, F. Liao, W. Lang, and W. Wu, “Automatic modulationrecognition using neural architecture search,” in . IEEE, 2019, pp. 151–156.[65] P. Ortego, A. Diez-Olivan, J. Del Ser, F. Veiga, M. Penalva, and B. Sierra,“Evolutionary lstm-fcn networks for pattern classiﬁcation in industrialprocesses,”

Swarm and Evolutionary Computation , vol. 54, p. 100650,2020.[66] A. Piergiovanni, A. Angelova, A. Toshev, and M. S. Ryoo, “Evolvingspace-time neural architectures for videos,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 1793–1802.[67] J. Prellberg and O. Kramer, “Lamarckian evolution of convolutionalneural networks,” in

International Conference on Parallel ProblemSolving from Nature . Springer, 2018, pp. 424–435.[68] S. R. Young, D. C. Rose, T. P. Karnowski, S.-H. Lim, and R. M. Patton,“Optimizing deep learning hyper-parameters through an evolutionaryalgorithm,” in

Proceedings of the Workshop on Machine Learning inHigh-Performance Computing Environments , 2015, pp. 1–5.[69] M. G. B. Calisto and S. K. Lai-Yuen, “Self-adaptive 2d-3d ensembleof fully convolutional networks for medical image segmentation,” in

Medical Imaging 2020: Image Processing , vol. 11313. InternationalSociety for Optics and Photonics, 2020, p. 113131W.[70] F. E. F. Junior and G. G. Yen, “Particle swarm optimization of deepneural networks architectures for image classiﬁcation,”

Swarm andEvolutionary Computation , vol. 49, pp. 62–74, 2019.[71] B. Wang, Y. Sun, B. Xue, and M. Zhang, “A hybrid differential evolutionapproach to designing deep convolutional neural networks for imageclassiﬁcation,” in

Australasian Joint Conference on Artiﬁcial Intelligence .Springer, 2018, pp. 237–250.[72] N. Qiang, B. Ge, Q. Dong, F. Ge, and T. Liu, “Neural architecturesearch for optimizing deep belief network models of fmri data,” in

International Workshop on Multiscale Multimodal Medical Imaging .Springer, 2019, pp. 26–34.[73] A. Camero, J. Toutouh, and E. Alba, “A specialized evolutionary strategyusing mean absolute error random sampling to design recurrent neuralnetworks,” arXiv preprint arXiv:1909.02425 , 2019.[74] A. I. Sharaf and E.-S. F. Radwan, “An automated approach fordeveloping a convolutional neural network using a modiﬁed ﬁreﬂyalgorithm for image classiﬁcation,” in

Applications of Fireﬂy Algorithmand its Variants . Springer, 2020, pp. 99–118.[75] N. R. Sabar, A. Turky, A. Song, and A. Sattar, “An evolutionary hyper-heuristic to optimise deep belief networks for image reconstruction,”

Applied Soft Computing , p. 105510, 2019.[76] D. Sapra and A. D. Pimentel, “An evolutionary optimization algorithmfor gradually saturating objective functions,” in

Proceedings of theGenetic and Evolutionary Computation Conference , 2020.[77] A. Kwasigroch, M. Grochowski, and A. Mikołajczyk, “Neural archi-tecture search for skin lesion classiﬁcation,”

IEEE Access , vol. 8, pp.9061–9071, 2020.[78] C. Schorn, T. Elsken, S. Vogel, A. Runge, A. Guntoro, and G. Ascheid,“Automated design of error-resilient and hardware-efﬁcient deep neuralnetworks,” arXiv preprint arXiv:1909.13844 , 2019.[79] F. Assunc¸˜ao, J. Correia, R. Conceic¸˜ao, M. J. M. Pimenta, B. Tom´e,N. Lourenc¸o, and P. Machado, “Automatic design of artiﬁcial neuralnetworks for gamma-ray detection,”

IEEE Access , vol. 7, pp. 110 531–110 540, 2019. [80] D. Laredo, Y. Qin, O. Sch¨utze, and J.-Q. Sun, “Automatic modelselection for neural networks,” arXiv preprint arXiv:1905.06010 , 2019.[81] M. U. Anwar and M. L. Ali, “Boosting neuro evolutionary techniquesfor speech recognition,” in . IEEE, 2019, pp.1–5.[82] S. Y. Teng, A. C. M. Loy, W. D. Leong, B. S. How, B. L. F. Chin, andV. M´aˇsa, “Catalytic thermal degradation of chlorella vulgaris: Evolvingdeep neural networks for optimization,” Bioresource technology , vol.292, p. 121971, 2019.[83] S. Litzinger, A. Klos, and W. Schiffmann, “Compute-efﬁcient neuralnetwork architecture optimization by a genetic algorithm,” in

Interna-tional Conference on Artiﬁcial Neural Networks . Springer, 2019, pp.387–392.[84] D. Sapra and A. D. Pimentel, “Constrained evolutionary piecemealtraining to design convolutional neural networks,” in

InternationalConference on Industrial, Engineering and Other Applications ofApplied Intelligent Systems. Springer , 2020.[85] F. Mart´ınez- ´Alvarez, G. Asencio-Cort´es, J. Torres, D. Guti´errez-Avil´es,L. Melgar-Garc´ıa, R. P´erez-Chac´on, C. Rubio-Escudero, J. Riquelme,and A. Troncoso, “Coronavirus optimization algorithm: A bioinspiredmetaheuristic based on the covid-19 propagation model,” arXiv preprintarXiv:2003.13633 , 2020.[86] E. Rapaport, O. Shriki, and R. Puzis, “Eegnas: Neural architecture searchfor electroencephalography data analysis and decoding,” in

InternationalWorkshop on Human Brain and Artiﬁcial Intelligence . Springer, 2019,pp. 3–20.[87] L. Peng, S. Liu, R. Liu, and L. Wang, “Effective long short-term memorywith differential evolution algorithm for electricity price prediction,”

Energy , vol. 162, pp. 1301–1314, 2018.[88] B. Dahal and J. Zhan, “Effective mutation and recombination for evolv-ing convolutional networks,” in

Proceedings of the 3rd InternationalConference on Applications of Intelligent Systems , 2020, pp. 1–6.[89] J. Jiang, F. Han, Q. Ling, J. Wang, T. Li, and H. Han, “Efﬁcient networkarchitecture search via multiobjective particle swarm optimization basedon decomposition,”

Neural Networks , vol. 123, pp. 305–316, 2020.[90] J. Ren, Z. Li, J. Yang, N. Xu, T. Yang, and D. J. Foran, “Eigen:Ecologically-inspired genetic approach for neural network structuresearching from scratch,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2019, pp. 9059–9068.[91] C.-C. Chung, W.-T. Lin, R. Zhang, K.-W. Liang, and P.-C. Chang,“Emotion estimation by joint facial expression and speech tonalityusing evolutionary deep learning structures,” in . IEEE, 2019, pp. 221–224.[92] A. Mart´ın, R. Lara-Cabrera, F. Fuentes-Hurtado, V. Naranjo, andD. Camacho, “Evodeep: a new evolutionary approach for automatic deepneural networks parametrisation,”

Journal of Parallel and DistributedComputing , vol. 117, pp. 180–191, 2018.[93] A. Almalaq and J. J. Zhang, “Evolutionary deep learning-based energyconsumption prediction for buildings,”

IEEE Access , vol. 7, pp. 1520–1531, 2018.[94] G. J. van Wyk and A. S. Bosman, “Evolutionary neural architecturesearch for image restoration,” in . IEEE, 2019, pp. 1–8.[95] B. Wang, Y. Sun, B. Xue, and M. Zhang, “Evolving deep convolu-tional neural networks by variable-length particle swarm optimizationfor image classiﬁcation,” in . IEEE, 2018, pp. 1–8.[96] M. P. Wang, “Evolving knowledge and structure through evolution-basedneural architecture search,” Master’s thesis, NTNU, 2019.[97] M. Suganuma, M. Ozay, and T. Okatani, “Exploiting the potentialof standard convolutional autoencoders for image restoration byevolutionary search,” arXiv preprint arXiv:1803.00370 , 2018.[98] F. Assunc¸˜ao, N. Lourenc¸o, P. Machado, and B. Ribeiro, “Fast denser:Efﬁcient deep neuroevolution,” in

European Conference on GeneticProgramming . Springer, 2019, pp. 197–212.[99] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu,“Hierarchical representations for efﬁcient architecture search,” arXivpreprint arXiv:1711.00436 , 2017.[100] W. Zhang, L. Zhao, Q. Li, S. Zhao, Q. Dong, X. Jiang, T. Zhang, andT. Liu, “Identify hierarchical structures from task-based fmri data viahybrid spatiotemporal neural architecture search net,” in

InternationalConference on Medical Image Computing and Computer-AssistedIntervention . Springer, 2019, pp. 745–753. [101] Z. Lu, I. Whalen, Y. Dhebar, K. Deb, E. Goodman, W. Banzhaf, andV. N. Boddeti, “Multi-criterion evolutionary design of deep convolutionalneural networks,” arXiv preprint arXiv:1912.01369 , 2019.[102] P. R. Lorenzo, J. Nalepa, M. Kawulok, L. S. Ramos, and J. R. Pastor,“Particle swarm optimization for hyper-parameter selection in deep neuralnetworks,” in

Proceedings of the genetic and evolutionary computationconference , 2017, pp. 481–488.[103] V. Passricha and R. K. Aggarwal, “Pso-based optimized cnn for hindiasr,”

International Journal of Speech Technology , vol. 22, no. 4, pp.1123–1133, 2019.[104] H. Zhang, Y. Jin, R. Cheng, and K. Hao, “Sampled training and nodeinheritance for fast evolutionary neural architecture search,” arXivpreprint arXiv:2003.11613 , 2020.[105] T. Elsken, J. H. Metzen, and F. Hutter, “Efﬁcient multi-objectiveneural architecture search via lamarckian evolution,” arXiv preprintarXiv:1804.09081 , 2018.[106] M. Baldeon-Calisto and S. K. Lai-Yuen, “Adaresu-net: Multiobjectiveadaptive convolutional neural network for medical image segmentation,”

Neurocomputing , 2019.[107] Z. Chen, Y. Zhou, and Z. Huang, “Auto-creation of effective neuralnetwork architecture by evolutionary algorithm and resnet for imageclassiﬁcation,” in . IEEE, 2019, pp. 3895–3900.[108] B. Wang, Y. Sun, B. Xue, and M. Zhang, “Evolving deep neuralnetworks by multi-objective particle swarm optimization for image clas-siﬁcation,” in

Proceedings of the Genetic and Evolutionary ComputationConference , 2019, pp. 490–498.[109] B. Fielding and L. Zhang, “Evolving image classiﬁcation architectureswith enhanced particle swarm optimisation,”

IEEE Access , vol. 6, pp.68 560–68 575, 2018.[110] T. Cetto, J. Byrne, X. Xu, and D. Moloney, “Size/accuracy trade-offin convolutional neural networks: An evolutionary approach,” in

INNSBig Data and Deep Learning conference . Springer, 2019, pp. 17–26.[111] T. Elsken, J.-H. Metzen, and F. Hutter, “Simple and efﬁcient ar-chitecture search for convolutional neural networks,” arXiv preprintarXiv:1711.04528 , 2017.[112] M. Loni, A. Majd, A. Loni, M. Daneshtalab, M. Sj¨odin, and E. Troubit-syna, “Designing compact convolutional neural network for embeddedstereo vision systems,” in . IEEE,2018, pp. 244–251.[113] D. Song, C. Xu, X. Jia, Y. Chen, C. Xu, and Y. Wang, “Efﬁcientresidual dense block search for image super-resolution,” arXiv preprintarXiv:1909.11409 , 2019.[114] M. Suganuma, M. Kobayashi, S. Shirakawa, and T. Nagao, “Evolu-tion of deep convolutional neural networks using cartesian geneticprogramming,”

Evolutionary Computation , vol. 28, no. 1, pp. 141–163,2020.[115] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon,B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al. , “Evolving deepneural networks,” in

Artiﬁcial Intelligence in the Age of Neural Networksand Brain Computing . Elsevier, 2019, pp. 293–312.[116] T. Hassanzadeh, D. Essam, and R. Sarker, “Evou-net: an evolutionarydeep fully convolutional neural network for medical image segmentation,”in

Proceedings of the 35th Annual ACM Symposium on AppliedComputing , 2020, pp. 181–189.[117] E. Miahi, S. A. Mirroshandel, and A. Nasr, “Genetic neural architecturesearch for automatic assessment of human sperm images,” arXiv preprintarXiv:1909.09432 , 2019.[118] N. Mitschke, M. Heizmann, K.-H. Noffz, and R. Wittmann, “Gradientbased evolution to optimize the structure of convolutional neuralnetworks,” in . IEEE, 2018, pp. 3438–3442.[119] X. Chu, B. Zhang, R. Xu, and H. Ma, “Multi-objective rein-forced evolution in mobile neural architecture search,” arXiv preprintarXiv:1901.01074 , 2019.[120] Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu, and X. Wang,“Reinforced evolutionary neural architecture search,” arXiv preprintarXiv:1808.00193 , 2018.[121] Y. Chen, T. Pan, C. He, and R. Cheng, “Efﬁcient evolutionary deepneural architecture search (nas) by noisy network morphism mutation,”in

International Conference on Bio-Inspired Computing: Theories andApplications . Springer, 2019, pp. 497–508.[122] Z. Fan, J. Wei, G. Zhu, J. Mo, and W. Li, “Evolutionary neuralarchitecture search for retinal vessel segmentation,” arXiv , pp. arXiv–2001, 2020. [123] K. Chen and W. Pang, “Immunetnas: An immune-network approach forsearching convolutional neural network architectures,” arXiv preprintarXiv:2002.12704 , 2020.[124] P. Liu, M. D. El Basha, Y. Li, Y. Xiao, P. C. Sanelli, and R. Fang, “Deepevolutionary networks with expedited genetic algorithms for medicalimage denoising,” Medical image analysis , vol. 54, pp. 306–315, 2019.[125] M. Wistuba, “Deep learning architecture search by neuro-cell-basedevolution with function-preserving mutations,” in

Joint European Con-ference on Machine Learning and Knowledge Discovery in Databases .Springer, 2018, pp. 243–258.[126] O. Kramer, “Evolution of convolutional highway networks,” in

Inter-national Conference on the Applications of Evolutionary Computation .Springer, 2018, pp. 395–404.[127] X. Chu, B. Zhang, H. Ma, R. Xu, J. Li, and Q. Li, “Fast, accurateand lightweight super-resolution with neural architecture search,” arXivpreprint arXiv:1901.07261 , 2019.[128] Y. Chen, G. Meng, Q. Zhang, X. Zhang, L. Song, S. Xiang, andC. Pan, “Joint neural architecture search and quantization,” arXivpreprint arXiv:1811.09426 , 2018.[129] B. Wang, B. Xue, and M. Zhang, “Particle swarm optimisation forevolving deep neural networks for image classiﬁcation by evolving andstacking transferable blocks,” arXiv preprint arXiv:1907.12659 , 2019.[130] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolutionfor image classiﬁer architecture search,” in

Proceedings of the aaaiconference on artiﬁcial intelligence , vol. 33, 2019, pp. 4780–4789.[131] C. Saltori, S. Roy, N. Sebe, and G. Iacca, “Regularized evolutionaryalgorithm for dynamic neural topology search,” in

InternationalConference on Image Analysis and Processing . Springer, 2019, pp.219–230.[132] T. Wu, J. Shi, D. Zhou, Y. Lei, and M. Gong, “A multi-objectiveparticle swarm optimization for neural networks pruning,” in . IEEE, 2019, pp.570–577.[133] Y. Hu, S. Sun, J. Li, X. Wang, and Q. Gu, “A novel channelpruning method for deep neural network compression,” arXiv preprintarXiv:1805.11394 , 2018.[134] A. ElSaid, F. El Jamiy, J. Higgins, B. Wild, and T. Desell, “Optimizinglong short-term memory recurrent neural networks using ant colony op-timization to predict turbine engine vibration,”

Applied Soft Computing ,vol. 73, pp. 969–991, 2018.[135] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian, and C. Xu,“Cars: Continuous evolution for efﬁcient neural architecture search,” arXiv preprint arXiv:1909.04977 , 2019.[136] T. Desell, S. Clachar, J. Higgins, and B. Wild, “Evolving deep recurrentneural networks using ant colony optimization,” in

European Conferenceon Evolutionary Computation in Combinatorial Optimization . Springer,2015, pp. 86–98.[137] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Singlepath one-shot neural architecture search with uniform sampling,” arXivpreprint arXiv:1904.00420 , 2019.[138] T. Shinozaki and S. Watanabe, “Structure discovery of deep neuralnetwork based on evolutionary algorithms,” in .IEEE, 2015, pp. 4979–4983.[139] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 1–9.[140] K. Deb,

Multi-objective optimization using evolutionary algorithms .John Wiley & Sons, 2001, vol. 16.[141] K. Maziarz, A. Khorlin, Q. de Laroussilhe, and A. Gesmundo,“Evolutionary-neural hybrid agents for architecture search,”

CoRR , vol.abs/1811.09828, 2018. [Online]. Available: http://arxiv.org/abs/1811.09828[142] T. Chen, I. Goodfellow, and J. Shlens, “Net2net: Accelerating learningvia knowledge transfer,” arXiv preprint arXiv:1511.05641 , 2015.[143] T. Wei, C. Wang, Y. Rui, and C. W. Chen, “Network morphism,” in

International Conference on Machine Learning , 2016, pp. 564–572.[144] K. Deb, R. B. Agrawal et al. , “Simulated binary crossover for continuoussearch space,”

Complex systems , vol. 9, no. 2, pp. 115–148, 1995.[145] Z. Gao, Y. Li, Y. Yang, X. Wang, N. Dong, and H.-D. Chiang, “Agpso-optimized convolutional neural networks for eeg-based emotionrecognition,”

Neurocomputing , vol. 380, pp. 225–235, 2020.[146] A. A. ElSaid, A. G. Ororbia, and T. J. Desell, “The ant swarm neuro-evolution procedure for optimizing recurrent networks,” arXiv preprintarXiv:1909.11849 , 2019. [147] M. Neshat, M. M. Nezhad, E. Abbasnejad, L. B. Tjernberg, D. A.Garcia, B. Alexander, and M. Wagner, “An evolutionary deep learningmethod for short-term wind speed prediction: A case study of thelillgrund offshore wind farm,” arXiv preprint arXiv:2002.09106 , 2020.[148] B. Wang, Y. Sun, B. Xue, and M. Zhang, “A hybrid ga-pso methodfor evolving architecture and short connections of deep convolutionalneural networks,” in

Paciﬁc Rim International Conference on ArtiﬁcialIntelligence . Springer, 2019, pp. 650–663.[149] B. P. Evans, H. Al-Sahaf, B. Xue, and M. Zhang, “Genetic programmingand gradient descent: A memetic approach to binary image classiﬁcation,” arXiv preprint arXiv:1909.13030 , 2019.[150] Y. Sun, G. G. Yen, and Z. Yi, “Evolving unsupervised deep neuralnetworks for learning meaningful representations,”

IEEE Transactionson Evolutionary Computation , vol. 23, no. 1, pp. 89–103, 2018.[151] D. Jones, A. Schroeder, and G. Nitschke, “Evolutionary deep learn-ing to identify galaxies in the zone of avoidance,” arXiv preprintarXiv:1903.07461 , 2019.[152] K. Ho, A. Gilbert, H. Jin, and J. Collomosse, “Neural architecturesearch for deep image prior,” arXiv preprint arXiv:2001.04776 , 2020.[153] J. D. Lamos-Sweeney, “Deep learning using genetic algorithms,” Ph.D.dissertation, Rochester Institute of Technology, 2012.[154] A. Camero, J. Toutouh, D. H. Stolﬁ, and E. Alba, “Evolutionarydeep learning for car park occupancy prediction in smart cities,” in

International Conference on Learning and Intelligent Optimization .Springer, 2018, pp. 386–401.[155] A. ElSaid, S. Benson, S. Patwardhan, D. Stadem, and T. Desell,“Evolving recurrent neural networks for time series data prediction ofcoal plant parameters,” in

International Conference on the Applicationsof Evolutionary Computation (Part of EvoStar) . Springer, 2019, pp.488–503.[156] J. Hajewski and S. Oliveira, “An evolutionary approach to variationalautoencoders,” in . IEEE, 2020, pp. 0071–0077.[157] M. M¨artens and D. Izzo, “Neural network architecture search withdifferentiable cartesian genetic programming for regression,” in

Pro-ceedings of the Genetic and Evolutionary Computation ConferenceCompanion , 2019, pp. 181–182.[158] B. Evans, H. Al-Sahaf, B. Xue, and M. Zhang, “Evolutionary deeplearning: A genetic programming approach to image classiﬁcation,” in . IEEE,2018, pp. 1–6.[159] S. Bianco, M. Buzzelli, G. Ciocca, and R. Schettini, “Neural architecturesearch for image saliency fusion,”

Information Fusion , vol. 57, pp. 89–101, 2020.[160] P. J. Angeline, G. M. Saunders, and J. B. Pollack, “An evolutionaryalgorithm that constructs recurrent neural networks,”

IEEE transactionson Neural Networks , vol. 5, no. 1, pp. 54–65, 1994.[161] L. Rodriguez-Coayahuitl, A. Morales-Reyes, and H. J. Escalante, “Evolv-ing autoencoding structures through genetic programming,”

GeneticProgramming and Evolvable Machines , vol. 20, no. 3, pp. 413–440,2019.[162] I. Loshchilov and F. Hutter, “Cma-es for hyperparameter optimizationof deep neural networks,” arXiv preprint arXiv:1604.07269 , 2016.[163] T. Tanaka, T. Shinozaki, S. Watanabe, and T. Hori, “Evolution strategybased neural network optimization and lstm language model for robustspeech recognition,”

Cit. on , vol. 130, 2016.[164] A. Camero, J. Toutouh, J. Ferrer, and E. Alba, “Waste generationprediction under uncertainty in smart cities through deep neuroevolution,”

Revista Facultad de Ingenier´ıa Universidad de Antioquia , no. 93, pp.128–138, 2019.[165] A. ElSaid, F. E. Jamiy, J. Higgins, B. Wild, and T. Desell, “Usingant colony optimization to optimize long short-term memory recurrentneural networks,” in

Proceedings of the Genetic and EvolutionaryComputation Conference , 2018, pp. 13–20.[166] C. Zhang, H. Shao, and Y. Li, “Particle swarm optimisation forevolving artiﬁcial neural network,” in

Smc 2000 conference proceedings.2000 ieee international conference on systems, man and cybernet-ics.’cybernetics evolving to systems, humans, organizations, and theircomplex interactions’(cat. no. 0 , vol. 4. IEEE, 2000, pp. 2487–2490.[167] J. K. Kim, Y. S. Han, and J. S. Lee, “Particle swarm optimization–deep belief network–based rare class prediction model for highly classimbalance problem,”

Concurrency and Computation: Practice andExperience , vol. 29, no. 11, p. e4128, 2017.[168] I. De Falco, G. De Pietro, A. Della Cioppa, G. Sannino, U. Scafuri, andE. Tarantino, “Evolution-based conﬁguration optimization of a deepneural network for the classiﬁcation of obstructive sleep apnea episodes,”

Future Generation Computer Systems , vol. 98, pp. 377–391, 2019. [169] S. Roy and N. Chakraborti, “Development of an evolutionary deepneural net for materials research,” in TMS 2020 149th Annual Meeting& Exhibition Supplemental Proceedings . Springer, 2020, pp.817–828.[170] J. Liang, E. Meyerson, B. Hodjat, D. Fink, K. Mutch, and R. Miikku-lainen, “Evolutionary neural automl for deep learning,” in

Proceedingsof the Genetic and Evolutionary Computation Conference , 2019, pp.401–409.[171] J. Z. Liang et al. , “Evolutionary neural architecture search for deeplearning,” Ph.D. dissertation, 2019.[172] J. Huang, W. Sun, and L. Huang, “Deep neural networks compressionlearning based on multiobjective evolutionary algorithms,”

Neurocom-puting , vol. 378, pp. 260–269, 2020.[173] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, andW. Banzhaf, “Nsga-net: neural architecture search using multi-objectivegenetic algorithm,” in

Proceedings of the Genetic and EvolutionaryComputation Conference , 2019, pp. 419–427.[174] H. Zhu and Y. Jin, “Real-time federated evolutionary neural architecturesearch,” arXiv preprint arXiv:2003.02793 , 2020.[175] J. Dong, A. Cheng, D. Juan, W. Wei, and M. Sun, “Ppp-net: Platform-aware progressive search for pareto-optimal neural architectures,” in . OpenReview.net, 2018. [Online]. Available:https://openreview.net/forum?id=B1NT3TAIM[176] D. Hossain and G. Capi, “Multiobjective evolution of deep learningparameters for robot manipulator object recognition and grasping,”

Advanced Robotics , vol. 32, no. 20, pp. 1090–1101, 2018.[177] J. Bayer, D. Wierstra, J. Togelius, and J. Schmidhuber, “Evolving mem-ory cell structures for sequence learning,” in

International Conferenceon Artiﬁcial Neural Networks . Springer, 2009, pp. 755–764.[178] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast andelitist multiobjective genetic algorithm: Nsga-ii,”

IEEE transactions onevolutionary computation , vol. 6, no. 2, pp. 182–197, 2002.[179] Q. Zhang and H. Li, “Moea/d: A multiobjective evolutionary algo-rithm based on decomposition,”

IEEE Transactions on evolutionarycomputation , vol. 11, no. 6, pp. 712–731, 2007.[180] K. Deb and H. Jain, “An evolutionary many-objective optimizationalgorithm using reference-point-based nondominated sorting approach,part i: solving problems with box constraints,”

IEEE transactions onevolutionary computation , vol. 18, no. 4, pp. 577–601, 2013.[181] C. A. C. Coello, G. T. Pulido, and M. S. Lechuga, “Handling multipleobjectives with particle swarm optimization,”

IEEE Transactions onevolutionary computation , vol. 8, no. 3, pp. 256–279, 2004.[182] M. R. Sierra and C. A. C. Coello, “Improving pso-based multi-objective optimization using crowding, mutation and ∈ -dominance,”in International conference on evolutionary multi-criterion optimization .Springer, 2005, pp. 505–519.[183] M. Javaheripi, M. Samragh, T. Javidi, and F. Koushanfar, “Genecai:Genetic evolution for acquiring compact ai,” arXiv preprintarXiv:2004.04249 , 2020.[184] J. An, H. Xiong, J. Ma, J. Luo, and J. Huan, “Stylenas: An empiricalstudy of neural architecture search to uncover surprisingly fast end-to-end universal style transfer networks,” arXiv preprint arXiv:1906.02470 ,2019.[185] D. R. So, C. Liang, and Q. V. Le, “The evolved transformer,” arXivpreprint arXiv:1901.11117 , 2019.[186] F. Assunc¸˜ao, N. Lourenc¸o, P. Machado, and B. Ribeiro, “Evolving thetopology of large scale deep neural networks,” in

European Conferenceon Genetic Programming . Springer, 2018, pp. 19–34.[187] F. Assunc¸ao, N. Lourenc¸o, P. Machado, and B. Ribeiro, “Denser: deepevolutionary network structured representation,”

Genetic Programmingand Evolvable Machines , vol. 20, no. 1, pp. 5–35, 2019.[188] J. Liu, M. Gong, Q. Miao, X. Wang, and H. Li, “Structure learningfor deep neural networks based on multiobjective optimization,”

IEEEtransactions on neural networks and learning systems , vol. 29, no. 6,pp. 2450–2463, 2017.[189] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. V.Le, “Understanding and simplifying one-shot architecture search,”in

Proceedings of the 35th International Conference on MachineLearning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July10-15, 2018 , ser. Proceedings of Machine Learning Research, J. G. Dyand A. Krause, Eds., vol. 80. PMLR, 2018, pp. 549–558. [Online].Available: http://proceedings.mlr.press/v80/bender18a.html[190] X. Chu, B. Zhang, R. Xu, and J. Li, “Fairnas: Rethinking evaluationfairness of weight sharing neural architecture search,” arXiv preprintarXiv:1907.01845 , 2019. [191] P. Colangelo, O. Segal, A. Speicher, and M. Margala, “Artiﬁcial neuralnetwork and accelerator co-design using evolutionary algorithms,” in .IEEE, 2019, pp. 1–8.[192] C. Philip, S. Oren, S. Alexander, and M. Martin, “Evolutionarycell aided design for neural network architectures,” arXiv preprintarXiv:1903.02130 , 2019.[193] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[194] S. S. Tirumala, “Evolving deep neural networks using coevolutionaryalgorithms with multi-population strategy,”

Neural Computing andApplications , pp. 1–14, 2020.[195] F. Assunc¸˜ao, N. Lourenc¸o, P. Machado, and B. Ribeiro, “Fast-denser++:Evolving fully-trained deep artiﬁcial neural networks,” arXiv preprintarXiv:1905.02969 , 2019.[196] F. Assunc¸ao, N. Lourenc¸o, B. Ribeiro, and P. Machado, “Incrementalevolution and development of deep artiﬁcial neural networks,” in

GeneticProgramming: 23rd European Conference, EuroGP 2020, Held as Partof EvoStar 2020, Seville, Spain, April 15–17, 2020, Proceedings 23 .Springer, 2020, pp. 35–51.[197] C. Wei, C. Niu, Y. Tang, and J. Liang, “Npenas: Neural predic-tor guided evolution for neural architecture search,” arXiv preprintarXiv:2003.12857 , 2020.[198] J. Song, Y. Jin, Y. Li, and C. Lang, “Learning structural similarity withevolutionary-gan: A new face de-identiﬁcation method,” in . IEEE, 2019, pp. 1–6.[199] L. M. Zhang, “Genetic deep neural networks using different activationfunctions for ﬁnancial data mining,” in . IEEE, 2015, pp. 2849–2851.[200] J. Liang, E. Meyerson, and R. Miikkulainen, “Evolutionary architecturesearch for deep multitask networks,” in

Proceedings of the Genetic andEvolutionary Computation Conference , 2018, pp. 466–473.[201] S. Fujino, T. Hatanaka, N. Mori, and K. Matsumoto, “The evolutionarydeep learning based on deep convolutional neural network for the animestoryboard recognition,” in

International Symposium on DistributedComputing and Artiﬁcial Intelligence . Springer, 2017, pp. 278–285.[202] J.-D. Dong, A.-C. Cheng, D.-C. Juan, W. Wei, and M. Sun, “Dpp-net:Device-aware progressive search for pareto-optimal neural architectures,”in

Proceedings of the European Conference on Computer Vision (ECCV) ,2018, pp. 517–531.[203] A.-C. Cheng, J.-D. Dong, C.-H. Hsu, S.-H. Chang, M. Sun, S.-C. Chang,J.-Y. Pan, Y.-T. Chen, W. Wei, and D.-C. Juan, “Searching towardpareto-optimal device-aware neural architectures,” in

Proceedings of theInternational Conference on Computer-Aided Design , 2018, pp. 1–7.[204] T. DeVries and G. W. Taylor, “Improved regularization of convolutionalneural networks with cutout,” arXiv preprint arXiv:1708.04552 , 2017.[205] K. Yu, C. Sciuto, M. Jaggi, C. Musat, and M. Salzmann, “Evaluat-ing the search phase of neural architecture search,” arXiv preprintarXiv:1902.08142 , 2019.[206] P. Liashchynskyi and P. Liashchynskyi, “Grid search, random search,genetic algorithm: A big comparison for nas,” arXiv preprintarXiv:1912.06059 , 2019.[207] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation ofgated recurrent neural networks on sequence modeling,” arXiv preprintarXiv:1412.3555 , 2014.[208] G.-B. Zhou, J. Wu, C.-L. Zhang, and Z.-H. Zhou, “Minimal gated unitfor recurrent neural networks,”

International Journal of Automationand Computing , vol. 13, no. 3, pp. 226–234, 2016.[209] J. Collins, J. Sohl-Dickstein, and D. Sussillo, “Capacity and trainabilityin recurrent neural networks,” arXiv preprint arXiv:1611.09913 , 2016.[210] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferablearchitectures for scalable image recognition,” in

Proceedings of theIEEE conference on computer vision and pattern recognition , 2018, pp.8697–8710.[211] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter,“Nas-bench-101: Towards reproducible neural architecture search,” in

International Conference on Machine Learning , 2019, pp. 7105–7114.[212] X. Dong and Y. Yang, “Nas-bench-201: Extending the scope ofreproducible neural architecture search,” in