[PDF] Genetic CNN - Researchain

Abstract

The deep Convolutional Neural Network (CNN) is the state-of-the-art solution for large-scale visual recognition. Following basic principles such as increasing the depth and constructing highway connections, researchers have manually designed a lot of fixed network structures and verified their effectiveness. In this paper, we discuss the possibility of learning deep network structures automatically. Note that the number of possible network structures increases exponentially with the number of layers in the network, which inspires us to adopt the genetic algorithm to efficiently traverse this large search space. We first propose an encoding method to represent each network structure in a fixed-length binary string, and initialize the genetic algorithm by generating a set of randomized individuals. In each generation, we define standard genetic operations, e.g., selection, mutation and crossover, to eliminate weak individuals and then generate more competitive ones. The competitiveness of each individual is defined as its recognition accuracy, which is obtained via training the network from scratch and evaluating it on a validation set. We run the genetic process on two small datasets, i.e., MNIST and CIFAR10, demonstrating its ability to evolve and find high-quality structures which are little studied before. These structures are also transferrable to the large-scale ILSVRC2012 dataset.

Full PDF

GGenetic CNN

Lingxi Xie, Alan YuilleCenter for Imaging Science, The Johns Hopkins University, Baltimore, MD, USA

Abstract

The deep Convolutional Neural Network (CNN) is thestate-of-the-art solution for large-scale visual recognition.Following basic principles such as increasing the depthand constructing highway connections, researchers havemanually designed a lot of ﬁxed network structures andveriﬁed their effectiveness.In this paper, we discuss the possibility of learning deepnetwork structures automatically. Note that the numberof possible network structures increases exponentially withthe number of layers in the network, which inspires us toadopt the genetic algorithm to efﬁciently traverse this largesearch space. We ﬁrst propose an encoding method torepresent each network structure in a ﬁxed-length binarystring, and initialize the genetic algorithm by generatinga set of randomized individuals. In each generation, wedeﬁne standard genetic operations, e.g., selection, mutationand crossover, to eliminate weak individuals and then gen-erate more competitive ones. The competitiveness of eachindividual is deﬁned as its recognition accuracy, which isobtained via training the network from scratch and evaluat-ing it on a validation set. We run the genetic process on twosmall datasets, i.e.,

MNIST and

CIFAR10 , demonstratingits ability to evolve and ﬁnd high-quality structures whichare little studied before. These structures are also trans-ferrable to the large-scale

ILSVRC2012 dataset.

1. Introduction

Visual recognition is a fundamental task in computervision, implying a wide range of applications. Recently, thestate-of-the-art algorithms on visual recognition are mostlybased on the deep Convolutional Neural Network (CNN).Starting from the fundamental network model for large-scale image classiﬁcation [17], researchers have been in-creasing the depth of the network [29], as well as designingnew inner structures [32][10] to improve recognition accu-racy. Although these modern networks have been shownto be efﬁcient, we note that their structures are manuallydesigned, not learned, which limits the ﬂexibility of the approach.In this paper, we explore the possibility of automaticallylearning the structure of deep neural networks. We considera constrained case, in which the network has a limitednumber of layers, and each layer is designed as a set of pre-deﬁned building blocks such as convolution and pooling.Even under these limitations, the total number of possiblenetwork structures grows exponentially with the numberof layers. Therefore, it is impractical to enumerate all thecandidates and ﬁnd the best one. Instead, we formulate thisproblem as optimization in a large search space, and applythe genetic algorithm to traversing the space efﬁciently.The genetic algorithm involves constructing an initial generation of individuals (candidate solutions), and per-forming genetic operations to allow them to evolve in agenetic process. To this end, we propose an encodingmethod to represent each network structure by a ﬁxed-length binary string. After that, we deﬁne several standardgenetic operations, i.e. , selection, mutation and crossover,which eliminate weak individuals of the previous genera-tion and use them to generate competitive ones. The qualityof each individual is determined by its recognition accuracyon a reference dataset. Throughout the genetic process,we evaluate each individual ( i.e. , network structure) bytraining it from scratch.

The genetic process comes to anend after a ﬁxed number of generations.It is worth emphasizing that the genetic algorithm iscomputationally expensive, because we need to conduct acomplete network training process for each generated indi-vidual. Therefore, we run the genetic process on two smalldatasets, i.e. , MNIST and

CIFAR10 , and demonstrate itsability to ﬁnd high-quality network structures.

It is inter-esting to see that the generated structures, most of whichhave been less studied before, often perform better thanthe standard manually designed ones.

Finally, we transferthe learned structures to large-scale experiments and verifytheir effectiveness.The remainder of this paper is organized as follows.Section 2 brieﬂy introduces related work. Section 3 il-lustrates the way of using the genetic algorithm to designnetwork structures. Experiments are shown in Section 4,1 a r X i v : . [ c s . C V ] M a r nd conclusions are drawn in Section 5.

2. Related Work

Image classiﬁcation is a fundamental problem in com-puter vision. Recently, researcher have extended conven-tional classiﬁcation tasks [18][7] into large-scale environ-ments such as ImageNet [5] and Places [44]. With the avail-ability of powerful computational resources ( e.g. , GPU),the Convolutional Neural Networks (CNNs) [17][29] haveshown superior performance over the conventional Bag-of-Visual-Words models [3][35][26].CNN is a hierarchical model for large-scale visual recog-nition. It is based on the observation that a network withenough neurons is able to ﬁt any complicated data distribu-tion. In past years, neural networks were shown effectivefor simple recognition tasks [20]. More recently, the avail-ability of large-scale training data ( e.g. , ImageNet [5]) andpowerful GPUs make it possible to train deep CNNs [17]which signiﬁcantly outperform BoVW models. A CNNis composed of several stacked layers. In each of them,responses from the previous layer are convoluted with aﬁlter bank and activated by a differentiable non-linearity.Hence, a CNN can be considered as a composite function,which is trained by back-propagating error signals deﬁnedby the difference between the supervision and predictionat the top layer. Recently, several efﬁcient methods wereproposed to help CNNs converge faster and prevent over-ﬁtting, such as ReLU activation [17], batch normaliza-tion [15], Dropout [11] and DisturbLabel [39]. Features ex-tracted from pre-trained neural networks can be generalizedto other recognition tasks [36][40].Designing powerful CNN structures is an intriguingproblem. It is believed that deeper networks produce bet-ter recognition results [29][32]. But also, adding high-way information has been veriﬁed to be useful [10][42].Efforts are also made to add invariance into the networkstructure [38]. We ﬁnd some work which uses stochas-tic [14] or dense [13] structures, but all these network struc-tures are deterministic (although stochastic operations areused in [14] to accelerate training and prevent over-ﬁtting),which limits the ﬂexibility of the models and thus inspiresus to automatically learn network structures.

The genetic algorithm is a metaheuristic inspired bythe process of natural selection. Genetic algorithms arecommonly used to generate high-quality solutions to op-timization and search problems [12][27][2][4] by relyingon bio-inspired operators such as mutation, crossover andselection.A typical genetic algorithm requires two prerequisites, i.e. , a genetic representation of the solution domain, anda ﬁtness function to evaluate each individual. A goodexample is the travelling-salesman problem (TSP) [9], afamous NP-complete problem which aims at ﬁnding theoptimal Hamiltonian path in a graph of N nodes. In thissituation, each feasible solution is represented as a permu-tation of { , , . . . , N } , and the ﬁtness function is the totalcost (distance) of the path. We will show later that deepneural networks can be encoded into a binary string.The core idea of the genetic algorithm is to allow in-dividuals to evolve via some genetic operations. Popularoperations include selection , mutation , crossover , etc . Theselection process allows us to preserve strong individualswhile eliminating weak ones. The ways of performing mu-tation and crossover vary from case to case, often based onthe properties of the speciﬁc problem. For example, in theTSP problem with the permutation-based representation,a possible set of mutations is to change the order of twovisited nodes. These operations are also used in our work.There is a lot of research in how to improve the per-formance of genetic algorithms, including performing lo-cal search [34] and generating random keys [30]. In ourwork, we show that the vanilla genetic algorithm workswell enough without these tricks. We also note that someprevious work applied the genetic algorithm to exploringefﬁcient neural network architectures [41][31][1][6], butour work aims at learning the architecture of modern CNNs,which is not studied in these prior works.

3. Our Approach

This section presents a genetic algorithm for designingcompetitive network structures. First, we describe a wayof representing the network structure by a ﬁxed-length bi-nary string. Next, several genetic operations are deﬁned,including selection, mutation and crossover, so that we cantraverse the search space efﬁciently and ﬁnd high-qualitysolutions.

Throughout this work, the genetic algorithm is onlyused to propose new network structures, the parame-ters and classiﬁcation accuracy of each structure areobtained via standalone training-from-scratch.

We provide a binary string representation for a networkstructure in a constrained case. We ﬁrst note that manystate-of-the-art network structures [29][10] can be parti-tioned into several stages . In each stage, the geometric di-mensions (width, height and depth) of the layer cube remainunchanged. Neighboring stages are connected via a spatialpooling operation, which may change the spatial resolution.All the convolutional operations within one stage have thesame number of ﬁlters, or channels.2

NPUT A1 A2A3

A4A0 A5

POOL1 conv@ pooling next stage

32 × 32 × 3

Code: 1-00-111

16 × 16 × 32

POOL1 prev. stage

B0 B1 B2B3 B4 B5 B6

POOL2 conv@ pooling

16 × 16 × 32

Code: 0-00-100-0101

Encoding Area

Stage 1Stage 2

Figure 1. A two-stage network ( S = 2 , ( K , K ) = (4 , ) and the encoded binary string (best viewed in color PDF). The default inputand output nodes (see Section 3.1.1) and the connections from and to these nodes are marked in red and green, respectively. We onlyencode the connections in the effective parts (regions with light blue background). Within each stage, the number of convolutional ﬁltersis a constant ( in Stage , in Stage ), and the spatial resolution remains unchanged ( × in Stage , × in Stage ). Eachpooling layer down-samples the data by a factor of . ReLU and batch normalization are added after each convolution. We borrow this idea to deﬁne a family of networks whichcan be encoded into ﬁxed-length binary strings. A networkis composed of S stages, and the s -th stage, s = 1 , , . . . , S ,contains K s nodes , denoted by v s,k s , k s = 1 , , . . . , K s .The nodes within each stage are ordered, and we onlyallow connections from a lower-numbered node to a higher-numbered node. Each node corresponds to a convolutionaloperation, which takes place after element-wise summingup all its input nodes (lower-numbered nodes that are con-nected to it). After convolution, batch normalization [15]and ReLU [17] are followed, which are veriﬁed efﬁcient intraining very deep neural networks [29]. We do not encodethe fully-connected part of a network.In each stage, we use . . . + ( K s −

1) = K s ( K s − bits to encode the inter-node connections.The ﬁrst bit represents the connection between ( v s, , v s, ) ,then the following two bits represent the connection be-tween ( v s, , v s, ) and ( v s, , v s, ) , etc . This process con-tinues until the last K s − bits are used to represent theconnection between v s, , v s, , . . . , v s,K s − and v s,K s . For (cid:54) i < j (cid:54) K s , if the code corresponding to ( v s,i , v s,j ) is , there is an edge connecting v s,i and v s,j , i.e. , v s,j takesthe output of v s,i as a part of the element-wise summation,and vice versa.Figure 1 illustrates two examples of network encod-ing. To summarize, a S -stage network with K s nodes atthe s -th stage is encoded into a binary string with length L = (cid:80) s K s ( K s − . Equivalently, there are in total L possible network structures. This number may be verylarge. In the CIFAR10 experiments (see Section 4.2), wehave S = 3 and ( K , K , K ) = (3 , , , therefore L = 19 and L = 524 , . It is computationally intractable toenumerate all these structures and ﬁnd the optimal one(s).In the following parts, we adopt the genetic algorithm toefﬁciently explore good candidates in this large space. To make every binary string valid, we deﬁne two defaultnodes in each stage. The default input node, denoted as v s, ,receives data from the previous stage, performs convolution,and sends its output to every node without a predecessor, e.g. , v s, . The default output node, denoted as v s,K s +1 ,receives data from all nodes without a successor, e.g. , v s,K s ,sums up them, performs convolution, and sends its outputto the pooling layer. Note that the connections between theordinary nodes and the default nodes are not encoded.There are two special cases. First, if an ordinary node v s,i is isolated ( i.e. , it is not connected to any other ordinarynodes v s,j , i (cid:54) = j ), then it is simply ignored, i.e. , it is notconnected to the default input node nor the default outputnode (see the B2 node in Figure 1). This is to guaranteethat a stage with more nodes can simulate all structuresrepresented by a stage with fewer nodes. Second, if thereare no connections at a stage, i.e. , all bits in the binary stringare , then the convolutional operation is performed onlyonce, not twice (one for the default input node and one forthe default output node).3 onv layerconv layer conv layer conv layer conv layer conv layer conv layer conv layerconv layer conv layerconv layer conv layer conv layerconv layer conv layer conv layerconv layerconv layer VGGNet

𝐾 = 4

ResNet

𝐾 = 4

Code: 1-01-001 Code: 1-01-101

DenseNet

𝐾 = 4

Code: 1-11-111

Figure 2. The basic building blocks of

VGGNet [29] and

ResNet [10] can be encoded as binary strings deﬁned in Sec-tion 3.1.

Many popular network structures can be represented us-ing the proposed encoding scheme. Examples include

VGGNet [29],

ResNet [10], and a modiﬁed variant of

DenseNet [13], which are illustrated in Figure 2.Currently, only convolutional and pooling operations areconsidered, which makes it impossible to generate sometricky network modules such as Maxout [8]. Also, the sizeof convolutional ﬁlters is ﬁxed within each stage, whichlimits our network from incorporating multi-scale informa-tion as in the inception module [32]. However, we note thatall the encoding-based approaches have such limitations.Our approach can be easily modiﬁed to include more typesof layers and more ﬂexible inter-layer connections. Asshown in experiments, we can achieve competitive recogni-tion performance using merely these basic building blocks.As shown in a recent published work using reinforce-ment learning to explore neural architecture [45], this typeof methods often require heavy computation to traversethe huge solution space. Fortunately, our method can beeasily generalized and scaled up, which is done via learningthe architecture on a small dataset and transfer the learnedinformation to large-scale datasets. Please refer to theexperimental part for details.

The ﬂowchart of the genetic process is shown in Al-gorithm 1. It starts with an initialized generation of N randomized individuals . Then, we perform T rounds, or T generations, each of which consists of three operations, i.e. , selection, mutation and crossover. The ﬁtness functionof each individual is evaluated via training-from-scratch onthe reference dataset. We initialize a set of randomized models { M ,n } Nn =1 . Eachmodel is a binary string with L bits, i.e. , M ,n : b ,n ∈{ , } L . Each bit in each individual is independently sam-pled from a Bernoulli distribution: b l ,n ∼ B (0 . , l =1 , , . . . , L . After this, we evaluate each individual (seeSection 3.2.4) to obtain their ﬁtness function values.As we shall see in Section 4.1.3, different strategies ofinitialization do not impact the genetic performance toomuch. Even starting with a naive initialization (all individ-uals are all-zero strings), the genetic process can discoverquite competitive structures with crossover and mutation. The selection process is performed at the beginning of everygeneration. Before the t -th generation, the n -th individual M t − ,n is assigned a ﬁtness function, which is deﬁned asthe recognition rate r t − ,n obtained in the previous genera-tion or initialization. r t − ,n directly impacts the probabilitythat M t − ,n survives the selection process.We perform a Russian roulette process to determinewhich individuals survive. Each individual in the next gen-eration M t,n is determined independently by a non-uniformsampling over the set { M t − ,n } Nn =1 . The probability ofsampling M t − ,n is proportional to r t − ,n − r t − , , where r t − , = min Nn =1 { r t − ,n } is the minimal ﬁtness functionvalue in the previous generation. This means that the bestindividual has the largest probability of being selected, andthe worst one is always eliminated. As the number ofindividuals N remains unchanged, each individual in theprevious generation may be selected multiple times. The mutation process of an individual M t,n involves ﬂip-ping each bit independently with a probability q M . Inpractice, q M is often small, e.g. , . , so that mutation isnot likely to change one individual too much. This is topreserve the good properties of a survived individual whileproviding an opportunity of trying out new possibilities.The crossover process involves changing two individualssimultaneously. Instead of considering each bit individu-ally, the basic unit in crossover is a stage, which is motivatedby the need to retain the local structures within each stage.Similar to mutation, each pair of corresponding stages areexchanged with a small probability q C .Both mutation and crossover are implemented by anoverall ﬂowchart (see Algorithm 1). The probabilities ofmutation and crossover for each individual (or pair) are p M and p C , respectively. We understand that there are manydifferent ways of mutation and crossover. As shown in4 lgorithm 1 The Genetic Process for Network Design Input: the reference dataset D , the number of generations T , the number of individuals in each generation N , themutation and crossover probabilities p M and p C , the mutation parameter q M , and the crossover parameter q C . Initialization: generating a set of randomized individuals { M ,n } Nn =1 , and computing their recognition accuracies; for t = 1 , , . . . , T do Selection: producing a new generation (cid:8) M (cid:48) t,n (cid:9) Nn =1 with a Russian roulette process on { M t − ,n } Nn =1 ; Crossover: for each pair { ( M t, n − , M t, n ) } (cid:98) N/ (cid:99) n =1 , performing crossover with probability p C and parameter q C ; Mutation: for each non-crossover individual { M t,n } Nn =1 , doing mutation with probability p M and parameter q M ; Evaluation: computing the recognition accuracy for each new individual { M t,n } Nn =1 ; end for Output: a set of individuals in the ﬁnal generation { M T,n } Nn =1 with their recognition accuracies.experiments, our simple choice leads to competitive perfor-mance. After the above processes, each individual M t,n is evaluatedto obtain the ﬁtness function value. A reference dataset D is pre-deﬁned, and we individually train each model M t,n from scratch. If M t,n is previously evaluated, we simplyevaluate it once again and compute the average accuracyover all its occurrences. This strategy, at least to someextent, alleviates the instability caused by the randomnessin the training process.

4. Experiments

The proposed genetic algorithm requires a very largeamount of computational resources, which makes it in-tractable to be directly evaluated on large-scale datasetssuch as

ILSVRC2012 [28]. Our solution is to explorepromising network structures on small datasets such as

MNIST [19] and

CIFAR10 [16], then transfer these struc-tures to the large-scale recognition tasks.

The

MNIST dataset [19] deﬁnes a handwritten digitrecognition task. There are , images for training, and , images for testing, all of them are × grayscaleimages. Both training and testing data are uniformly dis-tributed over categories, i.e. , digits from to . To avoidusing the testing data, we leave , images from thetraining set for validation. We follow the basic

LeNet for

MNIST recognition. Theoriginal network is abbreviated as:

C5@[email protected].

Here,

C5@20 is a convolutional layer with a kernel size , a default spatial stride and the number of kernels ; MP2S2 is a max-pooling layer with a kernel size anda spatial stride , FC500 is a fully-connected layer with outputs, and

D0.5 is a Dropout layer with a drop ratio . . We apply training epochs with learning rate − ,followed by epochs with learning rate − , and another epoch with learning rate − .We set S = 2 , ( K , K ) = (3 , , and keep the fully-connected part of LeNet unchanged. The ﬁrst convolutionallayer within each stage remains the same as in the original

LeNet , and other convolutional layers take the kernel size × and the same channel number. The length L of eachbinary string is , which means that there are = 8 , possible individuals.We create an initial generation with N = 20 individuals,and run the genetic process for T = 50 rounds. Otherparameters are set as p M = 0 . , q M = 0 . , p C = 0 . and q C = 0 . . We set relatively high mutation andcrossover probabilities to facilitate new structures to begenerated. The maximal number of explored individualsis × (50 + 1) = 1 , < , . The training phaseof each individual takes an average of . minutes on amodern Titan-X GPU, and the entire genetic process takesabout GPU-days, which makes it possible to repeat it withdifferent settings for diagnosis, e.g. , to explore differentinitialization options (see Section 4.1.3).Results are summarized in Table 1. With the genetic op-erations, we can ﬁnd competitive network structures whichachieve high recognition accuracy. Although over a shortperiod the recognition rate of the best individual is notimproved, the average and medium accuracies generally gethigher from generation to generation. This is very impor-tant, because it guarantees the genetic algorithm improvesthe overall quality of the individuals. According to ourdiagnosis in Section 4.1.2, this is very important for thegenetic process, since the quality of a new individual ispositively correlated to the quality of its parent(s). After generations, the recognition error rate of the best individualdrops from . to . .5en Max % Min % Avg % Med % Std-D00 .

59 99 .

38 99 .

50 99 .

50 0 . .

61 99 .

40 99 .

53 99 .

54 0 . .

62 99 .

43 99 .

55 99 .

58 0 . .

62 99 .

40 99 .

56 99 .

58 0 . .

62 99 .

46 99 .

57 99 .

57 0 . .

63 99 .

40 99 .

57 99 .

60 0 . .

63 99 .

50 99 .

59 99 .

62 0 . .

63 99 .

45 99 .

61 99 .

63 0 . .

64 99 .

49 99 .

61 99 .

64 0 . .

66 99 .

51 99 .

62 99 .

65 0 . Table 1. Recognition accuracy ( % ) on the MNIST testing set.The zeroth generation is the initialized generation. We set S = 2 and ( K , K ) = (3 , . Figure 3. The relationship in accuracy between the parent(s) andthe child(ren) (best viewed in color PDF). A point is bigger andclose to blue if the recognition error rate is lower, otherwise itis smaller and close to blue. The points on the horizontal axisare from mutation operations, while others are from crossoveroperations.

We perform diagnostic experiments to verify the hypothe-sis, that a better individual is more likely to generate a goodindividual via mutation or crossover. For this, we randomlyselect several occurrences of mutation and crossover in the

CIFAR10 genetic process, and observe the relationship be-tween an individual and its parent(s). Figure 3 supports ourpoint. We argue that the genetic operations tend to preservea fraction of the good local properties, so that the excellent“genes” from the parent(s) are more likely to be preserved.

Generation Number C l a ss i f i c a t i on A cc u r a cy ( % ) GeNet on MNIST

Random InitializationAll−zero Initialization

Figure 4. The average recognition accuracy over all individualswith respect to the generation number. The bars indicate thehighest and lowest accuracies in the corresponding generation.

Finally, we observe the impact of different initialized net-works. For this, we start a naive population with N = 20 all-zero individuals, and use the same parameters for acomplete genetic process. Results are shown in Figure 4.We ﬁnd that, although the all-zero string corresponds toa very simple and less competitive network structure, thegenetic algorithm is able to generate strong individuals afterseveral generations. This naive initialization achieves theinitial performance of randomized individuals with about generations. After about generations, there is almostno difference, by statistics, between these two populations. The

CIFAR10 dataset [16] is a subset of the -milliontiny image database [33]. There are , images for train-ing, and , images for testing, all of them are × RGB images.

CIFAR10 contains basic categories, andboth training and testing data are uniformly distributed overthese categories. To avoid using the testing data, we leave , images from the training set for validation. We follow a revised

LeNet for

CIFAR10 recognition. Theoriginal network is abbreviated as:

C5(P2)@8-MP3(S2)-C5(P2)@16-MP3(S2)-C5(P2)@32-MP3(S2)-FC128-D0.5-FC10.

Note that we signiﬁcantly reduce the ﬁlter numbers at eachstage to accelerate the training phase. We will show laterthat this does not prevent the genetic process from learn-ing promising network structures. We apply trainingepochs with learning rate − , followed by epochs withlearning rate − , epoch with learning rate − andanother epoch with learning rate − .6e keep the fully-connected part of the above networkunchanged, and set S = 3 and ( K , K , K ) = (3 , , .Similarly, the ﬁrst convolutional layer within each stage re-mains the same as in the original LeNet , and other convolu-tional layers take the kernel size × and the same channelnumber. The length L of each binary string is , whichmeans that there are = 524 , possible individuals.We create an initial generation with N = 20 individuals,and run the genetic process for T = 50 rounds. Otherparameters are set to be p M = 0 . , q M = 0 . , p C = 0 . and q C = 0 . . The mutation and crossover parameters q M and q C are set to be smaller because the strings becomelonger. The maximal number of explored individuals is × (50 + 1) = 1 , (cid:28) , . The training phase ofeach individual takes an average of . hour, and the entiregenetic process takes about GPU-days.We perform two individual genetic processes. The re-sults of one process are summarized in Table 2. As inthe

MNIST experiments, all the important statistics ( e.g. ,average and median accuracies) grow from generation togeneration. We also report the best network structures inthe table, and visualize the best structures throughout thesetwo processes in Figure 5.

Under our encoding scheme, a straightforward way of net-work construction is to set all bits to be , which leads to anetwork in which any two layers within the same stage areconnected. This network produces a . recognitionrate, which is a little bit lower than those reported in Table 2.Considering that the densely-connected network requiresheavier computational overheads, we conclude that the ge-netic algorithm helps to ﬁnd more effective and efﬁcientstructures than the dense connections. In Figure 5, we plot the the network structures learned fromtwo individual genetic processes. The structures learnedby the genetic algorithm are quite different from the manu-ally designed ones, although some manually designed localstructures are observed, like the chain-shaped networks,multi-path networks and highway networks. We emphasizethat these two networks are obtained by independent geneticprocesses, which demonstrates that our genetic process gen-erally converges to similar network structures.

We apply the networks learned from the

CIFAR10 ex-periments to more small-scale datasets. We test threedatasets, i.e. , CIFAR10 , CIFAR100 and

SVHN . CI-FAR100 is an extension to

CIFAR10 containing ﬁner-grained categories. It has the same numbers of training

12 34 56 01

12 34 56

Code: 1-01 Code: 1-01

Chain-ShapedNetworks  AlexNet  VGGNet

Code: 0-01-100Code: 1-01-100Code: 0-11-101-0001 Code: 0-11-101-0001

Multiple-PathNetworks  GoogLeNet

HighwayNetworks  Deep ResNet

Figure 5. Two network structures learned from the two indepen-dent genetic processes (best viewed in color PDF). and testing images as

CIFAR10 , and these images are alsouniformly distributed over categories.

SVHN (Street View House Numbers) [25] is a largecollection of × RGB images, i.e. , , trainingsamples, , testing samples, and , extra train-ing samples. We preprocess the data as in the previouswork [25], i.e. , selecting samples per category from thetraining set as well as samples per category from theextra set, using these , images for validation, and theremaining , images as training samples. We also useLocal Contrast Normalization (LCN) for preprocessing [8].We evaluate the best network structure in each genera-tion of the genetic process. We resume using a large numberof ﬁlters at each stage, i.e. , the three stages and the ﬁrstfully-connected layer are equipped with , , and ﬁlters, respectively. The training strategy, include thenumbers of epochs and learning rates, remains the same asin the previous experiments.7en Max % Min % Avg % Med % Std-D Best Network Structure00 .

96 71 .

81 74 .

39 74 .

53 0 . .

96 73 .

93 75 .

01 75 .

17 0 . .

96 73 .

95 75 .

32 75 .

48 0 . .

06 73 .

47 75 .

37 75 .

62 0 . .

24 72 .

60 75 .

32 75 .

65 0 . .

59 74 .

75 75 .

77 75 .

86 0 . .

72 73 .

92 75 .

68 75 .

80 0 . .

83 74 .

91 76 .

45 76 .

79 0 . .

95 74 .

38 76 .

42 76 .

53 0 . .

06 75 .

34 76 .

58 76 . . Table 2. Recognition accuracy ( % ) on the CIFAR10 testing set. The zeroth generation is the initialized generation. We set S = 3 and ( K , K , K ) = (3 , , . SVHN CF10 CF100

Zeiler et.al [43] .

80 15 .

13 42 . Goodfellow et.al [8] .

47 9 .

38 38 . Lin et.al [24] .

35 8 .

81 35 . Lee et.al [22] .

92 7 .

97 34 . Liang et.al [23] .

77 7 .

09 31 . Lee et.al [21] .

69 6 .

05 32 . Zagoruyko et.al [42] .

85 5 .

37 24 . Xie et.al [37] .

67 5 .

31 25 . Huang et.al [14] .

75 5 .

25 24 . Huang et.al [13] .

59 3 .

74 19 . GeNet after G-

00 2 .

25 8 .

18 31 . GeNet after G-

05 2 .

15 7 .

67 30 . GeNet after G-

20 2 .

05 7 .

36 29 . GeNet ) .

99 7 . . GeNet ) .

97 7 . . Table 3. Comparison of the recognition error rate ( % ) with thestate-of-the-arts. We apply data augmentation on all these datasets. GeNet

GeNet

We compare our results with some state-of-the-art meth-ods in Table 3. Among these competitors, we note thatthe densely-connected network [13] is closely related toour work. Although

GeNet ( layers) produces lowerrecognition accuracy, we note that the structures usedin [42][37][14][13] are much deeper ( e.g. , – layers).Since dense connection is often not the best option (seeSection 4.2.2), we believe that it is possible to use thegenetic algorithm to optimize the connections used in [13]. We evaluate the top- networks on the ILSVRC2012 classiﬁcation task [28], a subset of the

ImageNet database [5] which contains , object categories. Thetraining set, validation set and testing set contain . , and images, respectively. The input images are of Top- Top- Depth

AlexNet [17] . . GoogLeNet [32] . . VGGNet-16 [29] . . VGGNet-19 [29] . . ResNet-50 [10] . . ResNet-101 [10] . . ResNet-152 [10] . . GeNet .

12 9 .

95 22

GeNet .

87 9 .

74 22

Table 4. Top- and top- recognition error rates ( % ) onthe ILSVRC2012 dataset. For all competitors, we report thesingle-model performance without using any complicated dataaugmentation in testing . These numbers are copied from this page: .We use the networks shown in Figure 5, and name them as

GeNet × × pixels. We ﬁrst apply the ﬁrst two stages in the VGGNet ( convolutional layers and two pooling layers) tochange the data dimension to × × . Then, we applythe two networks shown in Figure 5, and adjust the numbersof ﬁlters at three stages to , and (following VGGNet ), respectively. After these stages, we obtain a × × data cube. We preserve the fully-connectedlayers in VGGNet with the dropout rate . . We apply thetraining strategy as in VGGNet . The entire training processof each network takes around GPU-days.Results are summarized in Table 4. We can see that,in general, structures learned from a small dataset (

CI-FAR10 ) can be transferred to large-scale visual recognition(

ILSVRC2012 ). Our model achieves better performancethan

VGGNet , because the original three chain-shapedstages are replaced by the automatically learned structures.8 . Conclusions

This paper applies the genetic algorithm to designingthe structure of deep neural networks. We ﬁrst propose anencoding method to represent each network structure witha ﬁxed-length binary string, then uses some popular geneticoperations such as mutation and crossover to explore thesearch space efﬁciently. Different initialization strategiesdo not make much difference on the genetic process. Weconduct the genetic algorithm with a relatively small refer-ence dataset (

CIFAR10 ), and ﬁnd that the generated struc-tures transfer well to other tasks, including the large-scale

ILSVRC2012 dataset.Despite the interesting results we have obtained, ouralgorithm suffers from several drawbacks. First, a largefraction of network structures are still unexplored, includingthose with non-convolutional modules like Maxout [8], andthe multi-scale strategy used in the inception module [32].Second, in the current work, the genetic algorithm is onlyused to explore the network structure, whereas the networktraining process is performed separately. It would be veryinteresting to incorporate the genetic algorithm to trainingthe network structure and weights simultaneously. Thesedirections are left for future work.

Acknowledgements

We thank John Flynn, Wei Shen, Chenxi Liu and SiyuanQiao for instructive discussions.

References [1] J. Bayer, D. Wierstra, J. Togelius, and J. Schmidhuber.Evolving Memory Cell Structures for Sequence Learning.

International Conference on Artiﬁcial Neural Networks ,2009.[2] J. Beasley and P. Chu. A Genetic Algorithm for the Set Cov-ering Problem.

European Journal of Operational Research ,94(2):392–404, 1996.[3] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray.Visual Categorization with Bags of Keypoints.

Workshop onStatistical Learning in Computer Vision, European Confer-ence on Computer Vision , 1(22):1–2, 2004.[4] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fastand elitist multiobjective genetic algorithm: Nsga-ii.

IEEETransactions on Evolutionary Computation , 6(2):182–197,2002.[5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database.

Computer Vision and Pattern Recognition , 2009.[6] S. Ding, H. Li, C. Su, J. Yu, and F. Jin. EvolutionaryArtiﬁcial Neural Networks: A Review.

Artiﬁcial IntelligenceReview , 39(3):251–260, 2013.[7] L. Fei-Fei, R. Fergus, and P. Perona. Learning GenerativeVisual Models from Few Training Examples: An Incremen-tal Bayesian Approach Tested on 101 Object Categories.

Computer Vision and Image Understanding , 106(1):59–70,2007.[8] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, andY. Bengio. Maxout networks.

International Conference onMachine Learning , 2013.[9] J. Grefenstette, R. Gopal, B. Rosmaita, and D. Van Gucht.Genetic Algorithms for the Traveling Salesman Problem.

International Conference on Genetic Algorithms and theirApplications , 1985.[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learn-ing for Image Recognition.

Computer Vision and PatternRecognition , 2016.[11] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Improving Neural Networks by Prevent-ing Co-adaptation of Feature Detectors. arXiv preprint,arXiv: 1207.0580 , 2012.[12] C. Houck, J. Joines, and M. Kay. A Genetic Algorithm forFunction Optimization: A Matlab Implementation.

Techni-cal Report, North Carolina State University , 2009.[13] G. Huang, Z. Liu, and K. Weinberger. Densely Con-nected Convolutional Networks. arXiv preprint, arXiv:1608.06993 , 2016.[14] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. DeepNetworks with Stochastic Depth.

European Conference onComputer Vision , 2016.[15] S. Ioffe and C. Szegedy. Batch Normalization: Accelerat-ing Deep Network Training by Reducing Internal CovariateShift.

International Conference on Machine Learning , 2015.[16] A. Krizhevsky and G. Hinton. Learning Multiple Layers ofFeatures from Tiny Images.

Technical Report, University ofToronto , 1(4):7, 2009.[17] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNetClassiﬁcation with Deep Convolutional Neural Networks.

Advances in Neural Information Processing Systems , 2012.[18] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags ofFeatures: Spatial Pyramid Matching for Recognizing NaturalScene Categories.

Computer Vision and Pattern Recognition ,2006.[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document Recognition.

Proceed-ings of the IEEE , 86(11):2278–2324, 1998.[20] Y. LeCun, J. Denker, D. Henderson, R. Howard, W. Hub-bard, and L. Jackel. Handwritten Digit Recognition with aBack-Propagation Network.

Advances in Neural InformationProcessing Systems , 1990.[21] C. Lee, P. Gallagher, and Z. Tu. Generalizing PoolingFunctions in Convolutional Neural Networks: Mixed, Gated,and Tree.

International Conference on Artiﬁcial Intelligenceand Statistics , 2016.[22] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-Supervised Nets.

International Conference on Artiﬁcial In-telligence and Statistics , 2015.[23] M. Liang and X. Hu. Recurrent Convolutional Neural Net-work for Object Recognition.

Computer Vision and PatternRecognition , 2015.[24] M. Lin, Q. Chen, and S. Yan. Network in Network.

Interna-tional Conference on Learning Representations , 2014.

25] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, andA. Ng. Reading Digits in Natural Images with UnsupervisedFeature Learning.

NIPS Workshop on Deep Learning andUnsupervised Feature Learning , 2011.[26] F. Perronnin, J. Sanchez, and T. Mensink. Improving theFisher Kernel for Large-scale Image Classiﬁcation.

Euro-pean Conference on Computer Vision , 2010.[27] C. Reeves. A Genetic Algorithm for Flowshop Sequencing.

Computers & Operations Research , 22(1):5–13, 1995.[28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision , pages 1–42, 2015.[29] K. Simonyan and A. Zisserman. Very Deep ConvolutionalNetworks for Large-Scale Image Recognition.

InternationalConference on Learning Representations , 2014.[30] L. Snyder and M. Daskin. A Random-Key Genetic Al-gorithm for the Generalized Traveling Salesman Problem.

European Journal of Operational Research , 174(1):38–53,2006.[31] K. Stanley and R. Miikkulainen. Evolving Neural Networksthrough Augmenting Topologies.

Evolutionary Computa-tion , 10(2):99–127, 2002.[32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going Deeper with Convolutions.

Computer Vision andPattern Recognition , 2015.[33] A. Torralba, R. Fergus, and W. Freeman. 80 Million TinyImages: A Large Data Set for Nonparametric Object andScene Recognition.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 30(11):1958–1970, 2008.[34] N. Ulder, E. Aarts, H. Bandelt, P. van Laarhoven, andE. Pesch. Genetic Local Search Algorithms for the TravelingSalesman Problem.

International Conference on ParallelProblem Solving from Nature , 1990.[35] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong.Locality-Constrained Linear Coding for Image Classiﬁca-tion.

Computer Vision and Pattern Recognition , 2010.[36] L. Xie, R. Hong, B. Zhang, and Q. Tian. Image Classiﬁ-cation and Retrieval are ONE.

International Conference onMultimedia Retrieval , 2015.[37] L. Xie, Q. Tian, J. Flynn, J. Wang, and A. Yuille. GeometricNeural Phrase Pooling: Modeling the Spatial Co-occurrenceof Neurons.

European Conference on Computer Vision ,2016.[38] L. Xie, J. Wang, W. Lin, B. Zhang, and Q. Tian. To-wards Reversal-Invariant Image Representation.

Interna-tional Journal on Computer Vision , 2016.[39] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. DisturbLa-bel: Regularizing CNN on the Loss Layer.

Computer Visionand Patter Recognition , 2016.[40] L. Xie, L. Zheng, J. Wang, A. Yuille, and Q. Tian. InterAc-tive: Inter-Layer Activeness Propagation.

Computer Visionand Patter Recognition , 2016.[41] X. Yao. Evolving Artiﬁcial Neural Networks.

Proceedingsof the IEEE , 87(9):1423–1447, 1999. [42] S. Zagoruyko and N. Komodakis. Wide Residual Networks. arXiv preprint, arXiv: 1605.07146 , 2016.[43] M. Zeiler and R. Fergus. Stochastic Pooling for Regulariza-tion of Deep Convolutional Neural Networks.

InternationalConference on Learning Representations , 2013.[44] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.Learning Deep Features for Scene Recognition Using PlacesDatabase.

Advances in Neural Information Processing Sys-tems , 2014.[45] B. Zoph and Q. Le. Neural architecture search with rein-forcement learning.

International Conference on LearningRepresentations , 2017., 2017.