A Study of Genetic Algorithms for Hyperparameter Optimization of Neural Networks in Machine Translation
AA Study of Genetic Algorithms for HyperparameterOptimization of Neural Networks in Machine Translation
Keshav Ganapathy
Abstract
With neural networks having demonstrated their versatility and benefits, the need fortheir optimal performance is as prevalent as ever. A defining characteristic, hyper-parameters, can greatly affect its performance. Thus engineers go through a process,tuning, to identify and implement optimal hyperparameters. That being said, excessamounts of manual effort are required for tuning network architectures, training config-urations, and preprocessing settings such as Byte Pair Encoding (BPE). In this study,we propose an automatic tuning method modeled after Darwin’s Survival of the FittestTheory via a Genetic Algorithm (GA). Research results show that the proposed method,a GA, outperforms a random selection of hyperparameters.
As neural networks are being adopted to solve real-world problems, while some parts of thenetwork may be easy to develop, other unknown aspects such as hyperparameters, have noclear method of derivation. Ongoing research focuses on developing new network architec-tures and training methods. When developing neural networks, the question at hand is howto set the hyperparameter values to maximize results and set the training configuration. Fornetwork architecture design, important hyperparameters include the type of network, thenumber of layers, the number of units per layer, and unit type. For training configurations,important hyperparameters include learning algorithm, learning rate, and dropout ratio.All these hyperparameters interact with each other and affect the performance of neuralnetworks. This interaction between hyperparameters can be referred to as epistasis. Thusthey need to be tuned simultaneously to get optimum results.The motivation behind this research is to replace tedious manual tuning of hyperparameterswith an automatic method performed by computers. Current methods of optimization arelimited to trivial methods like Grid search. Grid search is a simple method for hyperpa-rameter optimization. However, as the number of hyperparameters increases, Grid search1 a r X i v : . [ c s . N E ] S e p ecomes time consuming and computationally taxing. This is because the number of latticepoints increases in an exponential way with an increase in the number of hyperparametersQin et al.. For example, if there are ten hyperparameters to be tuned and we only tryfive values for each parameter, and this alone requires more than 9 Million evaluations:5 = 9765625. For this reason, the grid search is not feasible for certain applications. Tosolve this, we look to a GA for a higher-performing and less computationally taxing solu-tion. The use of a GA for neural network hyperparameter optimization has been exploredpreviously in Suganuma et al. [2017], Moriya et al. [2018].We present an empirical study of GAs for neural network models in machine translationof natural language specifically Japanese to English. We describe the experiment setup inSection 2, our GA method in Section 3, and results in Section 4. The preliminary findingssuggest that a simple GA encoding has the potential to find optimum network architecturescompared to a random search baseline. Genetic Algorithms (GA) are a class of optimization methods where each individual, a neu-ral network, represents a solution to the optimization problem and the population is evolvedin hopes of generating good solutions. In our case, each individual represents the hyperpa-rameters of a neural machine translation system, and our goal is to find hyperparametersthat will lead to good systems. The defining factor when measuring an individual’s fitness isits BLEU score, a measurement of the individual’s translation quality which is dependent onthe individuals hyperparameters. The data set used for experimentation was limited to 150individuals each consisting of 6 hyperparameters. We use a benchmark data set providedby Zhang and Duh [2020] where a grid of hyperparameter settings and the resulting modelBLEU scores are pre-computed for the purpose of reproducible and efficient hyperparame-ter optimization experiments. In particular, we use the Japanese-to-English data set whichconsists of various kinds of Transformer models (described later) trained on the WMT2019Ro-bust Task Li et al. [2019], with BLEU scores ranging from 9 to 16.Every test was done over three trials each consisting of 1000 optimization iterations. Oneiteration involves a process, elaborated below, to arrive at the target BLEU score of 16. Thehigher the BLEU score the better. While testing, both the GA and the Baseline (randomsearch) received the same initial population of 5, 10, 15, 20, or 25 individuals. Throughoutexperimentation, the goal was to arrive at one individual with a BLEU score, fitness, of16 or higher. The fitness is a measure of the likelihood of the individual remaining in thepopulation. 2 practical limitation of the data set includes a possibility that a combination of hyperpa-rameters found by the GA may not be represented in the data set. If the combination isnon-existent, we assign a fitness value of 0 to that respective individual. This is imperativeas we do not want a non-existant individual remaining in the population. By assigningthe individual a fitness value of 0, it is guaranteed be replaced in the next breeding cycle.Furthermore, the possibility of two individuals with a fitness 0 is impossible as the popu-lation starts with individuals with representation in the data set. This is a benefit of anindividual based measurement rather than a generational measure. Additionally, duringexperimentation, individuals were not added to the population if they have already been inthe population, or if their genes exist in the current population. In traditional metrics, theentire current population is replaced by a new generation consisting of new offspring. In ourimplementation, only one offspring is generated, and that replaces the weakest individual.The performance of the GA and baseline algorithm is represented with a value based onthe total amount of individuals added. This value increases every time an individual isadded to the population. The performance of both algorithms is determined by averagingthe performance value measured from all the iterations. An iteration consists of one opti-mization cycle. The performance value measured in individuals per iteration is displayedin the tables in section 4.
The GA system used in this experiment is based on the natural selection process theorizedby Darwin. This theory states that the stronger/fitter individuals survive while the weakerindividuals do not. Thus over time, when the surviving stronger individuals reproduce,you get a population that as a whole carries genes that make them more resilient. How-ever, Darwin only theorized this process in the natural world. The idea for using a GAin optimization problems and machine learning was thought of by Goldberg and Hollandand expressed in Goldberg and Holland [1988]. Throughout this experiment, we simulatetheir logic for machine translation neural network optimization. We begin with an initialpopulation that ”reproduces” until we get a group of individuals that collectively hold hy-perparameters that perform better and one individual that is the ”fittest” and meets theBLEU score target. The parts of the algorithm are represented through 3 objects: Indi-viduals, Populations, and the Genetic Algorithm itself. The GA and Baseline algorithmwere developed in the Python programming language. The GA and Baseline algorithm(random), elaborated later, are compared via an evaluator. This hierarchy, explained inWhitley [1994], and is represented visually below:3valuatorGenetic AlgorithmPopulationIndividuals Random SelectionPopulationIndividualsFigure 1: Hierarchy of the Objects in Experimentation
Every individual consists of two defining characteristics: 1) the ”chromosome”, consistingof 6 hyperparameters, and 2) the fitness score, which is the BLEU value. An individual’schromosome can look like [10000.0, 4.0, 512.0, 1024.0, 8.0, 0.0006]. The fitness of themachine translation neural networks (a.k.a individual) is defined as the BLEU score or howwell the neural network can translate from Japanese to English. BLEU score is measuredby an algorithm for evaluating the quality of translation performed by the machine to thatperformed by a human. The BLEU score is calculated via a simple proportion. The twotranslations, the human translation, referred to as reference translation, and the translationdone by the machine, the candidate translation. To compute, one counts up the numberof candidate translation words, unigrams, that occur in any reference translation and thetotal number of words found in the reference translation Papineni et al. [2002]. Once weget the two values, we divide them to get a precision component of the BLEU score.However, simple proportions are not the only thing used during calculation of BLEU score.Another property of BLEU score worth noting is the use of a brevity penalty, a penaltybased on length. Since BLEU is a precision based method, the brevity penalty assures thata system does not only translate fragments of the test set of which it is confident, resultingin high precision Koehn [2004]. Is has become common practice to include a word penaltycomponent dependant on length of the phrase for translation. This is especially relevantfor the BLEU score that harshly penalizes translation output that is too short. Finally, inthis study, we use alternate notation for readability. A BLEU score of 0.289 is reported asa percent 28.9%. 4
BPE subword units (1k) 10, 30, 50
2, 4
8, 16 initial learning rate (10 − ) 3, 6, 10Table 1: Hyperparameter search space for the Tranformer NMT systemsAmongst the 150 individuals there are 7 target BLEU scores above the 16 goal: 16.04,16.09, 16.02, 16.41, 16.03, 16.13, and 16.21. Ranges of hyperparameter values, referred toas Genes, are as follows: Fitness(9.86 - 16.41), Gene 1 (10,000, 30,000, or 50,000), Gene 2(2.0 or 4.0), Gene 3 (256.0, 512.0, or 1024.0), Gene 4 (1024.0 or 2048.0), Gene 5 (8.0 or16.0), and Gene 6 (0.001, 0.0003, or 0.0006). This information is shown above.Additionally, a convolutional or recurrent model was not implemented for these translationnetworks. The networks in this study implemented transformers. On top of higher trans-lation quality, the transformers requires less computation to train and are a much betterfit for modern machine learning hardware, speeding up the training process immensely.In regards to the less computational power needed, the ease of training of transformerscan be accredited to its lack of growth based on amount of words Dehghani et al. [2018].Specifically, a transformer is not recurrent meaning it does not need the translation of theprevious word to translate the next. For example lets take the example a translation forGerman to English. Let’s use the German sentence, Das Haus ist groß, meaning the houseis big. In a Recurrent Neural Network (RNN), the network identifies das and the, and thenuses that as a reference for the next word to translate Haus to house, etc. However, in aTransformer, we can treat each word as a separate object and for translation rather thanthe translation of the previous word, it uses the embedding value of all of the other words.Because they are treated independently, we can have the translation operation occur inparallel. So in this example every word das, Haus, ist, groß are all vectored and use theother words’ embeddings. Additionally, a notable characteristic of a transformer is its useof an attention mechanism. The Attention mechanism allows for the network to direct itsfocus, and it pays greater attention to certain factors when processing the data resultingin a higher performing network. For these three main reasons, the lack of recurrence, theuse of an attention mechanism, and the ability for parallel computation, transformers area preferred choice as a network architecture in Machine Translation. The hyperparameterssearched for in our Transformer models are shown in Figure 2.5 .2 Population The population of individuals consist of three instance variables: the population itself,consisting of an initial x individuals implemented through an array, and two individuals thatresemble the fittest and second fittest individuals in the population. For a population of 5individuals it will look something like [Individual 1, Individual 2, Individual 3, Individual 4,Individual 5]. Examples of the fittest and second fittest individual follow: Fittest: Individual1 can be represented as the fitness with BLEU score of 16.41, and second fittest: Individual2 can be represented as second fitness with BLEU score of 16.04. For experimentation,however the algorithm stopped when the goal of 16 or above is reached, so the situationabove would not occur.
In a broad sense, a genetic algorithm is any population-based model that uses various oper-ators to generate new sample points. The GA system used during experimentation abides tomost conventional characteristics of GA: a population, individuals, selection/mutation/crossoveroperations, etc. Our GA is comprised of three objects: the population, a list of all indi-viduals, and an individual that acts as the child, referred as place holder. The populationholds the individuals, the list of all individuals allows us to make sure that the child is not arepeated individual, and the individual allows for an object to store information on the off-spring. The following describes the structure of the GA: First, an array of individuals withonly the current population. The current population is defined as the population beforethe selection process, elaborated below. Second, it will contain a List of all the individualsthat have been introduced to the population. We iterate through this list every time beforeadding an individual to make sure that the new individual, or placeholder, has not beenintroduced before. Place holder is an individual that is initially set to have values of 0:[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]. During the crossover process, elaborated later, all the genesare changed to that of the offspring. Our implementation, however, differs from conventionin two main ways: our implementation includes an Integer Representation rather than bitvalue, and an individual based measure for optimization rather than a generational mea-sure. At the end of the process, a value that represents the total number of individualsadded to the population added to the initial population size is returned.The process the Genetic Algorithm goes through is depicted below:6enetic AlgorithmSelectionCrossoverMutationCheck Validity of IndividualAdd Individual Not ValidCheck if target is reachedOptimization CompleteTarget not Reached ValidTarget ReachedThe processes used during the GA, selection, crossover, and mutation, and theirfunctionalities are elaborated on below.
The selection process allowed for the GA to select two parents to mate to form a newoffspring. The two parents are selected by weighted probability, proportional to their fitness.For example, lets take an initial population of 5 individuals with fitness values 10, 25, 15,5, 45. The probability of selecting an individual is calculated by finding the sum of thefitness values, in this example 100, divided by the individuals fitness. Individual 1 has a10% of being selected, 10/100, Individual 2 has a 25% chance, 25/100, etc. After repeatingthis process we get a list of percentages 10%, 25% , 15% , 5% , 45%. As you can see, thepercentages will always add up to 100. These values are then used to get an array withrange values for each individual. In the aforementioned example, a list will look like [10, 35,50, 55, 100]. These values are determined by adding the percentage value of an individualto that of all the previous individuals. From here, we select a random integer value from0 to 100. This value is then correlated to an individual. For the above example with apopulation size of 5 Individual 1 is selected if the random value is less than 10, Individual7 is selected if the value is greater than 10 and less than 35, Individual 3 if the value isgreater than 50 and less than 55, etc. This approach is optimal as its adaptable based ofsize of population, and gives an accurate weighted representation.
The crossover process simulates the breeding part of the natural selection process. Followingselection, the selected two individuals are used to create an offspring. Initially, a randomgene in the chromosome is selected as the ”crossover point”. Up to that point genes ofthe fittest, Parent 1 in Figure 2, are selected, and from that point to the end, genes of thesecond fittest, Parent 2 in Figure 2, are added. For a cross over point of 2, the examplebelow depicts the cross over process.Figure 2: Visual depicting the Crossover processFrom here we now need to assign a fitness a value to the individual. Later we explainhow we derived the 7 lists, for now just understand that we have 7 lists, one list with allpotential values for each hyperparameter respectively, and another list for every fitness avalue. These lists are ordered by individuals. For example the first index in all 7 listscorrespond to individual one, the second index to individual 2, etc. We initially iteratethrough the entire first list, an array containing all values for the 150 individuals ordered,until we arrive at a match. In the example above this can be seen when we first arrive atthe value of 10000. Then we store the index of the 10000 and see if that same index in list2,3,4 etc also match the individuals gene. If all 6 genes correlate to values at one index, weassign a fitness value from the fitness list at the same index. Though an inefficient operatorfor assigning fitness, this approach makes the mutation operator easier.
During the experiment, a mutation rate of 1/8, 12.5%, was selected. When mutation occurs,an individual in the population is randomly selected, and one gene of that individual is8lso randomly selected. That gene is then given a new random value from the data set.Mutation process excludes the weakest Individual from being selected as that Individualwould be replaced. If the weakest individual were not excluded, it would be replaced induring the addition of the offspring making the mutation irrelevant. Below is a mutationexample with the mutation occurring on Gene 5. As shown, the mutation is resulting in anincreased fitness. Figure 3: Visual depicting the Mutation process
All of the below results are measured using an average of individuals needed to be addedover 1000 iterations. Each iteration consists of a process of going through optimization untila target score of 16 is reached. As previously stated, both algorithms return a value thatrepresents the number of individuals that were added to reach the goal. However, beforereturning the values, we add the initial population size to account for individuals in theinitial population. These values are then stored in two lists one for each algorithm, one listfor the GA and another for the Baseline algorithm. At the very of the end of the evaluatorthe average of both lists is used to arrive at performance measure of the algorithms.
As a measure of the quality of the GA, a baseline algorithm that optimizes by randomlyselecting hyperparameters was used to compare. The baseline algorithm selected a randomindex value from 0 to one less than the size of the data set. From this, an individual is madewith values from that index in every list from the initialization of data points, elaboratedbelow. At the end of the process, a value that represents the total number of individualsadded to the population added to the initial population size is returned.The process the Baseline Algorithm goes through is depicted below:9aseline AlgorithmSelect Random ValueGet Individual that corresponds to the Random ValueCheck Validity of IndividualAdd Individual Not ValidCheck if target is reachedOptimization CompleteTarget not Reached ValidTarget Reached
The initialization uses the python pandas library to create seven arrays that are represen-tative of all possible Individual combinations. For example, all potential Gene 1 values arestored in an array. Similarly, there are arrays for the other genes and the fitness values.Additionally, these arrays store the fitnesses that are in order of fitness. Index 0 of all thelists correlate to the characteristics of individual 1, index 1 for individual 2, etc. Thesearrays are implemented to assign fitness values when a new off-springs are created, generatethe initial population, and during the mutation operator.
Every trial in the tables below consists of 1000 iterations. An iteration entails the repetitionof the GA and Baseline processes, elaborated above, until the target goal is reached. Thusthe average is calculated over an accumulative 3000 (per the tables below) iterations. Wemeasure how long it takes for an algorithm to find the a good solution (i.e. ≥
16 BLEU),so the lower the better. 10enetic Algorithm’s ResultsInitial Popula-tion Size Trial 1 (Av-erage Numberof IndividualsAdded) Trial 2 (Av-erage Numberof IndividualsAdded) Trial 3 (Av-erage Numberof IndividualsAdded) Average5 20.056 19.91 20.247 20.07110 21.011 20.739 20.392 20.71415 23.91 23.933 23.54 23.79420 27.555 28.019 27.267 27.61425 31.475 30.816 32.099 31.463Table 2: Genetic Algorithm ResultsBaseline Algorithm ResultsInitial Popula-tion Size Trial 1 (Av-erage Numberof IndividualsAdded) Trial 2 (Av-erage Numberof IndividualsAdded) Trial 3 (Av-erage Numberof IndividualsAdded) Average5 20.435 21.949 21.891 21.42510 23.513 22.545 23.29 23.11615 26.056 25.611 25.345 25.67120 28.332 28.292 28.365 28.33025 31.857 30.819 32.117 31.598Table 3: Baseline Algorithm ResultsUsing the average values, shown below, we can get a numerical representation of how muchbetter a GA is. The average of the values showing difference in the performance shows thatthe GA can reach the desired goal with an average of 1.27 individuals fewer. The valuesin the table above were found by taking the average from the Baseline and subtracting thecorrelating GA value to find how many more individuals the Baseline needs on average. Thevalue 1.27 was found by averaging all the values in the table below. Although saving 1.27iterations is not large in the grander scheme of things, it is promising to see that GA givesconsistent gains, implying that there are patterns to be exploited in the hyperparameteroptimization process. 11erformance DifferenceInitial Population Size Winner Difference in Perfor-mance5 Genetic Algorithm 1.35410 Genetic Algorithm 2.40215 Genetic Algorithm 1.87720 Genetic Algorithm .71625 Genetic Algorithm .135Table 4: Difference in Performance between GA and Baseline Algorithm:
The study proves validity of a GA being implemented for hyperparameter optimization.Future work falls into three main categories: general, structural (changes how the systemworks), and behavioral changes (variations in individual methods implementations whichcan cause varying results). Additionally, there is the possibility for variations to be im-plemented for the Baseline algorithm. Lastly, we will explore results of the generationalmeasure rather than the individual based. We realized that a plateua occurs as the popu-lation increases. This is due to the chance of picking individuals with lower fitness valuesincreasing as the population increases for the GA. Therefore, as the initial population sizeincreases, the GA’s performance approaches that of the Baseline algorithm.
When reflecting on the experiment as a whole, there are many key aspects that shouldbe changed to further test the validity of a GA. One such example includes the rangeof Individuals represented in the data set. For example, in the data set above, manycombinations of hyperparameters where not pre-trained so a fitness value was not assigned.This results in the GA iterating extra times due to non-existant individuals. Additionally,as previously mentioned, one goal of a GA is to outperform the grid search, a more primitivetype of optimization. To further test the performance, a larger data set will need to be usedto test the efficiency compared to grid search, a higher performing solution than random.Also, additional evaluation metrics can be used such as the populations average fitness andrun time of the algorithm. Another potential change can be discounting individuals thatare not represented. While impractical to have all combinations represented, the Baselinealgorithm could not select a combination that was not represented while the GA could.For further comparison, not adding an individual to the population would decrease the12ndividual count and enlarge the performance gap between the GA and Baseline Algorithm.
An example of a structural change includes: Having mutation occur before selection. Themutation before selection can result in different weights of individuals during the selectionprocess. This can result in varying fitness values which can ultimately change the entireoptimization process.
A glaring example of a behavioral change can be seen in how the individual that is beingreplaced is selected. Unlike Darwin’s theory, the replaced individual was selected definitivelyduring the experiment. The individual with the lowest fitness was replaced. Changing thisto weighted probability similar to selecting the fittest and second fittest can affect the tuningprocess as hyperparameters work simultaneously. Two individuals with low fitness valuescan have a child that has a high fitness value due to the lack of direct correlation betweenindividual hyperparameter values and BLEU score. Another example includes a two-pointcrossover operator rather than the one-point crossover currently implemented. A two-pointcross over has the individuals exchange the genes that fall between these two points.
This work introduces an advanced GA for hyperparameter optimization and applies it tomachine translation optimization. We demonstrate that optimization of hyperparametersvia a GA can outperform a random selection of hyperparameters. Specifically, outperformis defined by the ability of the algorithm to arrive at the goal with less individuals added.Finally, we propose future research directions which are expected to provide additional gainsin the efficacy of GAs.
I would like to thank and acknowledge Dr. Kevin Duh (JHU) for giving me the opportunityto pursue research with him. Thank you for the continuous support, patience, and motiva-tion throughout the entire process. Additional thanks for the feedback and comments onthis paper, and for the invaluable guidance regarding programming, resources for help, andmuch more. 13lso, I am very appreciative of and grateful to my family and friends for their love, patienceand support, without which this project would not have been possible.
References
Hao Qin, Takahiro Shinozaki, and Kevin Duh. Evolution strategy based automatic tuningof neural machine translation systems.Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A genetic programmingapproach to designing convolutional neural network architectures. In
Proceedings of theGenetic and Evolutionary Computation Conference , pages 497–504, 2017.Takafumi Moriya, Tomohiro Tanaka, Takahiro Shinozaki, Shinji Watanabe, and Kevin Duh.Evolution-strategy-based automation of system development for high-performance speechrecognition.
IEEE/ACM Transactions on Audio, Speech, and Language Processing , 27(1):77–88, 2018.Xuan Zhang and Kevin Duh. Reproducible and efficient benchmarks for hyperparameteroptimization of neural machine translation systems.
Transactions of the Association forComputational Linguistics (TACL) , 2020.Xian Li, Paul Michel, Antonios Anastasopoulos, Yonatan Belinkov, Nadir Durrani, OrhanFirat, Philipp Koehn, Graham Neubig, Juan Pino, and Hassan Sajjad. Findings ofthe first shared task on machine translation robustness. In
Proceedings of the FourthConference on Machine Translation , 2019.David E Goldberg and John Henry Holland. Genetic algorithms and machine learning.1988.Darrell Whitley. A genetic algorithm tutorial.
Statistics and computing , 4(2):65–85, 1994.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method forautomatic evaluation of machine translation. In
Proceedings of the 40th annual meeting onassociation for computational linguistics , pages 311–318. Association for ComputationalLinguistics, 2002.Philipp Koehn. Statistical significance tests for machine translation evaluation. In
Proceed-ings of the 2004 conference on empirical methods in natural language processing , pages388–395, 2004.Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and (cid:32)Lukasz Kaiser.Universal transformers. arXiv preprint arXiv:1807.03819arXiv preprint arXiv:1807.03819