[PDF] A Study of Genetic Algorithms for Hyperparameter Optimization of Neural Networks in Machine Translation

Abstract

With neural networks having demonstrated their versatility and benefits, the need for their optimal performance is as prevalent as ever. A defining characteristic, hyperparameters, can greatly affect its performance. Thus engineers go through a process, tuning, to identify and implement optimal hyperparameters. That being said, excess amounts of manual effort are required for tuning network architectures, training configurations, and preprocessing settings such as Byte Pair Encoding (BPE). In this study, we propose an automatic tuning method modeled after Darwin's Survival of the Fittest Theory via a Genetic Algorithm (GA). Research results show that the proposed method, a GA, outperforms a random selection of hyperparameters.

Full PDF

AA Study of Genetic Algorithms for HyperparameterOptimization of Neural Networks in Machine Translation

Keshav Ganapathy

Abstract

With neural networks having demonstrated their versatility and beneﬁts, the need fortheir optimal performance is as prevalent as ever. A deﬁning characteristic, hyper-parameters, can greatly aﬀect its performance. Thus engineers go through a process,tuning, to identify and implement optimal hyperparameters. That being said, excessamounts of manual eﬀort are required for tuning network architectures, training conﬁg-urations, and preprocessing settings such as Byte Pair Encoding (BPE). In this study,we propose an automatic tuning method modeled after Darwin’s Survival of the FittestTheory via a Genetic Algorithm (GA). Research results show that the proposed method,a GA, outperforms a random selection of hyperparameters.

As neural networks are being adopted to solve real-world problems, while some parts of thenetwork may be easy to develop, other unknown aspects such as hyperparameters, have noclear method of derivation. Ongoing research focuses on developing new network architec-tures and training methods. When developing neural networks, the question at hand is howto set the hyperparameter values to maximize results and set the training conﬁguration. Fornetwork architecture design, important hyperparameters include the type of network, thenumber of layers, the number of units per layer, and unit type. For training conﬁgurations,important hyperparameters include learning algorithm, learning rate, and dropout ratio.All these hyperparameters interact with each other and aﬀect the performance of neuralnetworks. This interaction between hyperparameters can be referred to as epistasis. Thusthey need to be tuned simultaneously to get optimum results.The motivation behind this research is to replace tedious manual tuning of hyperparameterswith an automatic method performed by computers. Current methods of optimization arelimited to trivial methods like Grid search. Grid search is a simple method for hyperpa-rameter optimization. However, as the number of hyperparameters increases, Grid search1 a r X i v : . [ c s . N E ] S e p ecomes time consuming and computationally taxing. This is because the number of latticepoints increases in an exponential way with an increase in the number of hyperparametersQin et al.. For example, if there are ten hyperparameters to be tuned and we only tryﬁve values for each parameter, and this alone requires more than 9 Million evaluations:5 = 9765625. For this reason, the grid search is not feasible for certain applications. Tosolve this, we look to a GA for a higher-performing and less computationally taxing solu-tion. The use of a GA for neural network hyperparameter optimization has been exploredpreviously in Suganuma et al. [2017], Moriya et al. [2018].We present an empirical study of GAs for neural network models in machine translationof natural language speciﬁcally Japanese to English. We describe the experiment setup inSection 2, our GA method in Section 3, and results in Section 4. The preliminary ﬁndingssuggest that a simple GA encoding has the potential to ﬁnd optimum network architecturescompared to a random search baseline. Genetic Algorithms (GA) are a class of optimization methods where each individual, a neu-ral network, represents a solution to the optimization problem and the population is evolvedin hopes of generating good solutions. In our case, each individual represents the hyperpa-rameters of a neural machine translation system, and our goal is to ﬁnd hyperparametersthat will lead to good systems. The deﬁning factor when measuring an individual’s ﬁtness isits BLEU score, a measurement of the individual’s translation quality which is dependent onthe individuals hyperparameters. The data set used for experimentation was limited to 150individuals each consisting of 6 hyperparameters. We use a benchmark data set providedby Zhang and Duh [2020] where a grid of hyperparameter settings and the resulting modelBLEU scores are pre-computed for the purpose of reproducible and eﬃcient hyperparame-ter optimization experiments. In particular, we use the Japanese-to-English data set whichconsists of various kinds of Transformer models (described later) trained on the WMT2019Ro-bust Task Li et al. [2019], with BLEU scores ranging from 9 to 16.Every test was done over three trials each consisting of 1000 optimization iterations. Oneiteration involves a process, elaborated below, to arrive at the target BLEU score of 16. Thehigher the BLEU score the better. While testing, both the GA and the Baseline (randomsearch) received the same initial population of 5, 10, 15, 20, or 25 individuals. Throughoutexperimentation, the goal was to arrive at one individual with a BLEU score, ﬁtness, of16 or higher. The ﬁtness is a measure of the likelihood of the individual remaining in thepopulation. 2 practical limitation of the data set includes a possibility that a combination of hyperpa-rameters found by the GA may not be represented in the data set. If the combination isnon-existent, we assign a ﬁtness value of 0 to that respective individual. This is imperativeas we do not want a non-existant individual remaining in the population. By assigningthe individual a ﬁtness value of 0, it is guaranteed be replaced in the next breeding cycle.Furthermore, the possibility of two individuals with a ﬁtness 0 is impossible as the popu-lation starts with individuals with representation in the data set. This is a beneﬁt of anindividual based measurement rather than a generational measure. Additionally, duringexperimentation, individuals were not added to the population if they have already been inthe population, or if their genes exist in the current population. In traditional metrics, theentire current population is replaced by a new generation consisting of new oﬀspring. In ourimplementation, only one oﬀspring is generated, and that replaces the weakest individual.The performance of the GA and baseline algorithm is represented with a value based onthe total amount of individuals added. This value increases every time an individual isadded to the population. The performance of both algorithms is determined by averagingthe performance value measured from all the iterations. An iteration consists of one opti-mization cycle. The performance value measured in individuals per iteration is displayedin the tables in section 4.

The GA system used in this experiment is based on the natural selection process theorizedby Darwin. This theory states that the stronger/ﬁtter individuals survive while the weakerindividuals do not. Thus over time, when the surviving stronger individuals reproduce,you get a population that as a whole carries genes that make them more resilient. How-ever, Darwin only theorized this process in the natural world. The idea for using a GAin optimization problems and machine learning was thought of by Goldberg and Hollandand expressed in Goldberg and Holland [1988]. Throughout this experiment, we simulatetheir logic for machine translation neural network optimization. We begin with an initialpopulation that ”reproduces” until we get a group of individuals that collectively hold hy-perparameters that perform better and one individual that is the ”ﬁttest” and meets theBLEU score target. The parts of the algorithm are represented through 3 objects: Indi-viduals, Populations, and the Genetic Algorithm itself. The GA and Baseline algorithmwere developed in the Python programming language. The GA and Baseline algorithm(random), elaborated later, are compared via an evaluator. This hierarchy, explained inWhitley [1994], and is represented visually below:3valuatorGenetic AlgorithmPopulationIndividuals Random SelectionPopulationIndividualsFigure 1: Hierarchy of the Objects in Experimentation

Every individual consists of two deﬁning characteristics: 1) the ”chromosome”, consistingof 6 hyperparameters, and 2) the ﬁtness score, which is the BLEU value. An individual’schromosome can look like [10000.0, 4.0, 512.0, 1024.0, 8.0, 0.0006]. The ﬁtness of themachine translation neural networks (a.k.a individual) is deﬁned as the BLEU score or howwell the neural network can translate from Japanese to English. BLEU score is measuredby an algorithm for evaluating the quality of translation performed by the machine to thatperformed by a human. The BLEU score is calculated via a simple proportion. The twotranslations, the human translation, referred to as reference translation, and the translationdone by the machine, the candidate translation. To compute, one counts up the numberof candidate translation words, unigrams, that occur in any reference translation and thetotal number of words found in the reference translation Papineni et al. [2002]. Once weget the two values, we divide them to get a precision component of the BLEU score.However, simple proportions are not the only thing used during calculation of BLEU score.Another property of BLEU score worth noting is the use of a brevity penalty, a penaltybased on length. Since BLEU is a precision based method, the brevity penalty assures thata system does not only translate fragments of the test set of which it is conﬁdent, resultingin high precision Koehn [2004]. Is has become common practice to include a word penaltycomponent dependant on length of the phrase for translation. This is especially relevantfor the BLEU score that harshly penalizes translation output that is too short. Finally, inthis study, we use alternate notation for readability. A BLEU score of 0.289 is reported asa percent 28.9%. 4

BPE subword units (1k) 10, 30, 50

2, 4

8, 16 initial learning rate (10 − ) 3, 6, 10Table 1: Hyperparameter search space for the Tranformer NMT systemsAmongst the 150 individuals there are 7 target BLEU scores above the 16 goal: 16.04,16.09, 16.02, 16.41, 16.03, 16.13, and 16.21. Ranges of hyperparameter values, referred toas Genes, are as follows: Fitness(9.86 - 16.41), Gene 1 (10,000, 30,000, or 50,000), Gene 2(2.0 or 4.0), Gene 3 (256.0, 512.0, or 1024.0), Gene 4 (1024.0 or 2048.0), Gene 5 (8.0 or16.0), and Gene 6 (0.001, 0.0003, or 0.0006). This information is shown above.Additionally, a convolutional or recurrent model was not implemented for these translationnetworks. The networks in this study implemented transformers. On top of higher trans-lation quality, the transformers requires less computation to train and are a much betterﬁt for modern machine learning hardware, speeding up the training process immensely.In regards to the less computational power needed, the ease of training of transformerscan be accredited to its lack of growth based on amount of words Dehghani et al. [2018].Speciﬁcally, a transformer is not recurrent meaning it does not need the translation of theprevious word to translate the next. For example lets take the example a translation forGerman to English. Let’s use the German sentence, Das Haus ist groß, meaning the houseis big. In a Recurrent Neural Network (RNN), the network identiﬁes das and the, and thenuses that as a reference for the next word to translate Haus to house, etc. However, in aTransformer, we can treat each word as a separate object and for translation rather thanthe translation of the previous word, it uses the embedding value of all of the other words.Because they are treated independently, we can have the translation operation occur inparallel. So in this example every word das, Haus, ist, groß are all vectored and use theother words’ embeddings. Additionally, a notable characteristic of a transformer is its useof an attention mechanism. The Attention mechanism allows for the network to direct itsfocus, and it pays greater attention to certain factors when processing the data resultingin a higher performing network. For these three main reasons, the lack of recurrence, theuse of an attention mechanism, and the ability for parallel computation, transformers area preferred choice as a network architecture in Machine Translation. The hyperparameterssearched for in our Transformer models are shown in Figure 2.5 .2 Population The population of individuals consist of three instance variables: the population itself,consisting of an initial x individuals implemented through an array, and two individuals thatresemble the ﬁttest and second ﬁttest individuals in the population. For a population of 5individuals it will look something like [Individual 1, Individual 2, Individual 3, Individual 4,Individual 5]. Examples of the ﬁttest and second ﬁttest individual follow: Fittest: Individual1 can be represented as the ﬁtness with BLEU score of 16.41, and second ﬁttest: Individual2 can be represented as second ﬁtness with BLEU score of 16.04. For experimentation,however the algorithm stopped when the goal of 16 or above is reached, so the situationabove would not occur.

In a broad sense, a genetic algorithm is any population-based model that uses various oper-ators to generate new sample points. The GA system used during experimentation abides tomost conventional characteristics of GA: a population, individuals, selection/mutation/crossoveroperations, etc. Our GA is comprised of three objects: the population, a list of all indi-viduals, and an individual that acts as the child, referred as place holder. The populationholds the individuals, the list of all individuals allows us to make sure that the child is not arepeated individual, and the individual allows for an object to store information on the oﬀ-spring. The following describes the structure of the GA: First, an array of individuals withonly the current population. The current population is deﬁned as the population beforethe selection process, elaborated below. Second, it will contain a List of all the individualsthat have been introduced to the population. We iterate through this list every time beforeadding an individual to make sure that the new individual, or placeholder, has not beenintroduced before. Place holder is an individual that is initially set to have values of 0:[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]. During the crossover process, elaborated later, all the genesare changed to that of the oﬀspring. Our implementation, however, diﬀers from conventionin two main ways: our implementation includes an Integer Representation rather than bitvalue, and an individual based measure for optimization rather than a generational mea-sure. At the end of the process, a value that represents the total number of individualsadded to the population added to the initial population size is returned.The process the Genetic Algorithm goes through is depicted below:6enetic AlgorithmSelectionCrossoverMutationCheck Validity of IndividualAdd Individual Not ValidCheck if target is reachedOptimization CompleteTarget not Reached ValidTarget ReachedThe processes used during the GA, selection, crossover, and mutation, and theirfunctionalities are elaborated on below.

The selection process allowed for the GA to select two parents to mate to form a newoﬀspring. The two parents are selected by weighted probability, proportional to their ﬁtness.For example, lets take an initial population of 5 individuals with ﬁtness values 10, 25, 15,5, 45. The probability of selecting an individual is calculated by ﬁnding the sum of theﬁtness values, in this example 100, divided by the individuals ﬁtness. Individual 1 has a10% of being selected, 10/100, Individual 2 has a 25% chance, 25/100, etc. After repeatingthis process we get a list of percentages 10%, 25% , 15% , 5% , 45%. As you can see, thepercentages will always add up to 100. These values are then used to get an array withrange values for each individual. In the aforementioned example, a list will look like [10, 35,50, 55, 100]. These values are determined by adding the percentage value of an individualto that of all the previous individuals. From here, we select a random integer value from0 to 100. This value is then correlated to an individual. For the above example with apopulation size of 5 Individual 1 is selected if the random value is less than 10, Individual7 is selected if the value is greater than 10 and less than 35, Individual 3 if the value isgreater than 50 and less than 55, etc. This approach is optimal as its adaptable based ofsize of population, and gives an accurate weighted representation.

The crossover process simulates the breeding part of the natural selection process. Followingselection, the selected two individuals are used to create an oﬀspring. Initially, a randomgene in the chromosome is selected as the ”crossover point”. Up to that point genes ofthe ﬁttest, Parent 1 in Figure 2, are selected, and from that point to the end, genes of thesecond ﬁttest, Parent 2 in Figure 2, are added. For a cross over point of 2, the examplebelow depicts the cross over process.Figure 2: Visual depicting the Crossover processFrom here we now need to assign a ﬁtness a value to the individual. Later we explainhow we derived the 7 lists, for now just understand that we have 7 lists, one list with allpotential values for each hyperparameter respectively, and another list for every ﬁtness avalue. These lists are ordered by individuals. For example the ﬁrst index in all 7 listscorrespond to individual one, the second index to individual 2, etc. We initially iteratethrough the entire ﬁrst list, an array containing all values for the 150 individuals ordered,until we arrive at a match. In the example above this can be seen when we ﬁrst arrive atthe value of 10000. Then we store the index of the 10000 and see if that same index in list2,3,4 etc also match the individuals gene. If all 6 genes correlate to values at one index, weassign a ﬁtness value from the ﬁtness list at the same index. Though an ineﬃcient operatorfor assigning ﬁtness, this approach makes the mutation operator easier.

During the experiment, a mutation rate of 1/8, 12.5%, was selected. When mutation occurs,an individual in the population is randomly selected, and one gene of that individual is8lso randomly selected. That gene is then given a new random value from the data set.Mutation process excludes the weakest Individual from being selected as that Individualwould be replaced. If the weakest individual were not excluded, it would be replaced induring the addition of the oﬀspring making the mutation irrelevant. Below is a mutationexample with the mutation occurring on Gene 5. As shown, the mutation is resulting in anincreased ﬁtness. Figure 3: Visual depicting the Mutation process

All of the below results are measured using an average of individuals needed to be addedover 1000 iterations. Each iteration consists of a process of going through optimization untila target score of 16 is reached. As previously stated, both algorithms return a value thatrepresents the number of individuals that were added to reach the goal. However, beforereturning the values, we add the initial population size to account for individuals in theinitial population. These values are then stored in two lists one for each algorithm, one listfor the GA and another for the Baseline algorithm. At the very of the end of the evaluatorthe average of both lists is used to arrive at performance measure of the algorithms.

As a measure of the quality of the GA, a baseline algorithm that optimizes by randomlyselecting hyperparameters was used to compare. The baseline algorithm selected a randomindex value from 0 to one less than the size of the data set. From this, an individual is madewith values from that index in every list from the initialization of data points, elaboratedbelow. At the end of the process, a value that represents the total number of individualsadded to the population added to the initial population size is returned.The process the Baseline Algorithm goes through is depicted below:9aseline AlgorithmSelect Random ValueGet Individual that corresponds to the Random ValueCheck Validity of IndividualAdd Individual Not ValidCheck if target is reachedOptimization CompleteTarget not Reached ValidTarget Reached

The initialization uses the python pandas library to create seven arrays that are represen-tative of all possible Individual combinations. For example, all potential Gene 1 values arestored in an array. Similarly, there are arrays for the other genes and the ﬁtness values.Additionally, these arrays store the ﬁtnesses that are in order of ﬁtness. Index 0 of all thelists correlate to the characteristics of individual 1, index 1 for individual 2, etc. Thesearrays are implemented to assign ﬁtness values when a new oﬀ-springs are created, generatethe initial population, and during the mutation operator.

Every trial in the tables below consists of 1000 iterations. An iteration entails the repetitionof the GA and Baseline processes, elaborated above, until the target goal is reached. Thusthe average is calculated over an accumulative 3000 (per the tables below) iterations. Wemeasure how long it takes for an algorithm to ﬁnd the a good solution (i.e. ≥

16 BLEU),so the lower the better. 10enetic Algorithm’s ResultsInitial Popula-tion Size Trial 1 (Av-erage Numberof IndividualsAdded) Trial 2 (Av-erage Numberof IndividualsAdded) Trial 3 (Av-erage Numberof IndividualsAdded) Average5 20.056 19.91 20.247 20.07110 21.011 20.739 20.392 20.71415 23.91 23.933 23.54 23.79420 27.555 28.019 27.267 27.61425 31.475 30.816 32.099 31.463Table 2: Genetic Algorithm ResultsBaseline Algorithm ResultsInitial Popula-tion Size Trial 1 (Av-erage Numberof IndividualsAdded) Trial 2 (Av-erage Numberof IndividualsAdded) Trial 3 (Av-erage Numberof IndividualsAdded) Average5 20.435 21.949 21.891 21.42510 23.513 22.545 23.29 23.11615 26.056 25.611 25.345 25.67120 28.332 28.292 28.365 28.33025 31.857 30.819 32.117 31.598Table 3: Baseline Algorithm ResultsUsing the average values, shown below, we can get a numerical representation of how muchbetter a GA is. The average of the values showing diﬀerence in the performance shows thatthe GA can reach the desired goal with an average of 1.27 individuals fewer. The valuesin the table above were found by taking the average from the Baseline and subtracting thecorrelating GA value to ﬁnd how many more individuals the Baseline needs on average. Thevalue 1.27 was found by averaging all the values in the table below. Although saving 1.27iterations is not large in the grander scheme of things, it is promising to see that GA givesconsistent gains, implying that there are patterns to be exploited in the hyperparameteroptimization process. 11erformance DiﬀerenceInitial Population Size Winner Diﬀerence in Perfor-mance5 Genetic Algorithm 1.35410 Genetic Algorithm 2.40215 Genetic Algorithm 1.87720 Genetic Algorithm .71625 Genetic Algorithm .135Table 4: Diﬀerence in Performance between GA and Baseline Algorithm:

The study proves validity of a GA being implemented for hyperparameter optimization.Future work falls into three main categories: general, structural (changes how the systemworks), and behavioral changes (variations in individual methods implementations whichcan cause varying results). Additionally, there is the possibility for variations to be im-plemented for the Baseline algorithm. Lastly, we will explore results of the generationalmeasure rather than the individual based. We realized that a plateua occurs as the popu-lation increases. This is due to the chance of picking individuals with lower ﬁtness valuesincreasing as the population increases for the GA. Therefore, as the initial population sizeincreases, the GA’s performance approaches that of the Baseline algorithm.

When reﬂecting on the experiment as a whole, there are many key aspects that shouldbe changed to further test the validity of a GA. One such example includes the rangeof Individuals represented in the data set. For example, in the data set above, manycombinations of hyperparameters where not pre-trained so a ﬁtness value was not assigned.This results in the GA iterating extra times due to non-existant individuals. Additionally,as previously mentioned, one goal of a GA is to outperform the grid search, a more primitivetype of optimization. To further test the performance, a larger data set will need to be usedto test the eﬃciency compared to grid search, a higher performing solution than random.Also, additional evaluation metrics can be used such as the populations average ﬁtness andrun time of the algorithm. Another potential change can be discounting individuals thatare not represented. While impractical to have all combinations represented, the Baselinealgorithm could not select a combination that was not represented while the GA could.For further comparison, not adding an individual to the population would decrease the12ndividual count and enlarge the performance gap between the GA and Baseline Algorithm.

An example of a structural change includes: Having mutation occur before selection. Themutation before selection can result in diﬀerent weights of individuals during the selectionprocess. This can result in varying ﬁtness values which can ultimately change the entireoptimization process.

A glaring example of a behavioral change can be seen in how the individual that is beingreplaced is selected. Unlike Darwin’s theory, the replaced individual was selected deﬁnitivelyduring the experiment. The individual with the lowest ﬁtness was replaced. Changing thisto weighted probability similar to selecting the ﬁttest and second ﬁttest can aﬀect the tuningprocess as hyperparameters work simultaneously. Two individuals with low ﬁtness valuescan have a child that has a high ﬁtness value due to the lack of direct correlation betweenindividual hyperparameter values and BLEU score. Another example includes a two-pointcrossover operator rather than the one-point crossover currently implemented. A two-pointcross over has the individuals exchange the genes that fall between these two points.

This work introduces an advanced GA for hyperparameter optimization and applies it tomachine translation optimization. We demonstrate that optimization of hyperparametersvia a GA can outperform a random selection of hyperparameters. Speciﬁcally, outperformis deﬁned by the ability of the algorithm to arrive at the goal with less individuals added.Finally, we propose future research directions which are expected to provide additional gainsin the eﬃcacy of GAs.

I would like to thank and acknowledge Dr. Kevin Duh (JHU) for giving me the opportunityto pursue research with him. Thank you for the continuous support, patience, and motiva-tion throughout the entire process. Additional thanks for the feedback and comments onthis paper, and for the invaluable guidance regarding programming, resources for help, andmuch more. 13lso, I am very appreciative of and grateful to my family and friends for their love, patienceand support, without which this project would not have been possible.

References

Hao Qin, Takahiro Shinozaki, and Kevin Duh. Evolution strategy based automatic tuningof neural machine translation systems.Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A genetic programmingapproach to designing convolutional neural network architectures. In

Proceedings of theGenetic and Evolutionary Computation Conference , pages 497–504, 2017.Takafumi Moriya, Tomohiro Tanaka, Takahiro Shinozaki, Shinji Watanabe, and Kevin Duh.Evolution-strategy-based automation of system development for high-performance speechrecognition.

IEEE/ACM Transactions on Audio, Speech, and Language Processing , 27(1):77–88, 2018.Xuan Zhang and Kevin Duh. Reproducible and eﬃcient benchmarks for hyperparameteroptimization of neural machine translation systems.

Transactions of the Association forComputational Linguistics (TACL) , 2020.Xian Li, Paul Michel, Antonios Anastasopoulos, Yonatan Belinkov, Nadir Durrani, OrhanFirat, Philipp Koehn, Graham Neubig, Juan Pino, and Hassan Sajjad. Findings ofthe ﬁrst shared task on machine translation robustness. In

Proceedings of the FourthConference on Machine Translation , 2019.David E Goldberg and John Henry Holland. Genetic algorithms and machine learning.1988.Darrell Whitley. A genetic algorithm tutorial.

Statistics and computing , 4(2):65–85, 1994.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method forautomatic evaluation of machine translation. In

Proceedings of the 40th annual meeting onassociation for computational linguistics , pages 311–318. Association for ComputationalLinguistics, 2002.Philipp Koehn. Statistical signiﬁcance tests for machine translation evaluation. In

Proceed-ings of the 2004 conference on empirical methods in natural language processing , pages388–395, 2004.Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and (cid:32)Lukasz Kaiser.Universal transformers. arXiv preprint arXiv:1807.03819arXiv preprint arXiv:1807.03819