A Genetic Algorithm with Tree-structured Mutation for Hyperparameter Optimisation of Graph Neural Networks
AA Genetic Algorithm with Tree-structured Mutationfor Hyperparameter Optimisation of Graph NeuralNetworks
Yingfang Yuan
School of Mathematicaland Computer SciencesHeriot-Watt University
Edinburgh, [email protected]
Wenjun Wang
School of Mathematicaland Computer SciencesHeriot-Watt University
Edinburgh, [email protected]
Wei Pang * School of Mathematicaland Computer SciencesHeriot-Watt University
Edinburgh, [email protected]
Abstract —In recent years, graph neural networks (GNNs) havegained increasing attention, as they possess excellent capability ofprocessing graph-related problems. In practice, hyperparameteroptimisation (HPO) is critical for GNNs to achieve satisfactoryresults, but this process is costly because the evaluations ofdifferent hyperparameter settings require excessively trainingmany GNNs. Many approaches have been proposed for HPOwhich aims to identify promising hyperparameters efficiently.In particular, genetic algorithm (GA) for HPO has been ex-plored, which treats GNNs as a black-box model, of which onlythe outputs can be observed given a set of hyperparameters.However, because GNN models are extremely sophisticated andthe evaluations of hyperparameters on GNNs are expensive,GA requires advanced techniques to balance the explorationand exploitation of the search and make the optimisation moreeffective given limited computational resources. Therefore, weproposed a tree-structured mutation strategy for GA to alleviatethis issue. Meanwhile, we reviewed the recent HPO works whichgives the room to the idea of tree-structure to develop, and wehope our approach can further improve these HPO methods inthe future.
Index Terms —Generic Algorithm, Tree-structured Mutation,Graph Neural Network, Hyperparameter Optimisation
I. I
NTRODUCTION
Graph is used to describe the structured objects consistingof nodes and edges, such as citation networks, molecules,and road networks. In graph-related learning and predictionproblems, traditional machine learning methods require usingfeature engineering to process graph data prior to training orprediction [1]. In contrast, graph neural networks (GNNs) areproposed to directly operate on graphs to solve graph-relatedproblems in an end-to-end manner [1]. GNNs exploit neuralnetworks to model graphs and perform representation learning,which requires that GNNs are set with appropriate hyperpa-rameters (e.g., learning rate, the number of neural networklayers) to control the learning process. From our perspective,more complicated neural architectures are required to processstructured graph data, which may results in the HPO tasksfor GNNs being more challenging. Falkner et al. [2] pointed * Corresponding author out that deep learning algorithms are very sensitive to manyhyperparameters. The work presented in [3] demonstrated thatHPO for GNNs is vital for achieving satisfactory results inpractice. Therefore, research on effective HPO approaches forGNNs is critical and of great value for GNN applied to variousreal-world problems.In brief, HPO is a trial-and-error process to obtain optimalsolutions via iteratively generating and evaluating hyperparam-eter settings until the preset stopping condition(s) are satisfied.Commonly, the objective function of GNNs is selected as thefitness function of HPO to evaluate hyperparameters. How-ever, the evaluation is often very expensive because trainingGNN models would take a lot of computational resources.On the other hand, with the development of deep learning,neural networks have the increasing number of layers or otheroptional various hyperparameters (e.g., activation functions,optimisation methods), which results in a large search spaceand increasing the difficulty of HPO. So, existing works ofHPO have been conducted based on the issues of how to findthe best solutions with less trials, reducing evaluation cost, andimproving the capability of exploration. In this paper, a trialdenotes a complete process of evaluating a hyperparametersetting (solution) on the fitness function (objective function).There are many state-of-the-art HPO approaches with dif-ferent search strategies. For example, TPE [4] and BOHB[2] are proposed based on Bayesian optimisation (BO) whichare suitable for the functions that are expensive to evaluate.However, the former makes use of density functions to modeltwo sets (good and bad) of hyperparameters, instead of usingGaussian process as in [5], [6] as a standard surrogate modelto approximate the distribution over the objective function.Furthermore, BHBO combines Bayesian optimization andbandit-based methods to ensure good performance at anytimeand speed up the process towards optimal solutions. In general,BO-based methods balance the exploration and exploitationby acquisition function. On the contrary, CMA-ES [7] andHESGA [3] are evolutionary approaches which exploit high-quality individuals to guide the generation of promising off- a r X i v : . [ c s . L G ] F e b pring, while they rely on evolutionary operators to exploresearch space. CMA-ES employs an adaptive multivariate dis-tribution to sample individuals, and HESGA uses a geneticalgorithm which relies on crossover and mutation operators togenerate better individuals. To deepen the work of HESGA [3],our research focuses on improving the exploration capabilityof the mutation operator.Genetic algorithm with hierarchical evaluation strategy(HESGA) [3] utilises the elite archive to preserve excellentindividuals, from which a parent can be selected to generateoffspring. Meanwhile, the fast evaluation strategy has beenintroduced in HESGA to speed up optimisation and savecomputational resources. Furthermore, HSEGA uses the singlepoint mutation as in [8]–[10] which is one of the factorfor preventing GA from getting stuck in local optima andmaintains the diversity of population [11]. However, we be-lieve that the mutation operation in HSEGA can be furtherexplored as only classical mutation operations were usedin HSEGA. Therefore, in this research we propose a tree-structured mutation (TSM) which employs the tree structureto store hierarchical historical information to guide mutation.In this way, GA can explore the whole search space moreadaptively and effectively. Meanwhile, the idea of TSM isalso compatible and scalable to other HPO algorithms thathierarchize history information for future discovery.Our contributions are summarised as below: • We have developed a tree-structured mutation operationfor GA to adaptively maintain the balance between explo-ration and exploitation for searching the hyperparameterspace of GNNs, meanwhile retaining the randomness ofmutation operator. • We conducted a review of advanced HPO methods,and we found that the tree-structured approach has thepotential to be integrated into other approaches. • Our research contributes to the development of molecularmachine learning [12] as well as HPO for GNNs ingeneral.The rest of this paper is organized as follows. Section IIintroduces relevant work on HPO. In Section III, the details ofour tree-structured mutation is presented. The experiments arereported and the results are analysed in Section IV. Finally,Section V concludes the paper and explores some directionsin future work. II. R
ELATED W ORK
In this section, we will first review some popular HPOmethods. Then, the genetic algorithm with hierarchical evalu-ation strategy (HESGA) [3] will be introduced because ourtree-structured mutation is proposed upon it. Finally, somemutation strategies will be reviewed.
A. Hyperparameter Optimisation Methods
There have been a lot of state-of-the-art HPO methods, andthese methods can be classified into black-box optimization and multi-fidelity optimization approaches [13]. The formerfocus on the effectiveness of the search algorithms and the latter focuses on efficiently using computational resources insearch algorithms [14]–[16].
Black-box optimisation [17], [18] implies that the details ofobjective function and its gradients information are unknownduring searching optimal solutions. In HPO, hyperparameterscan be evaluated on the objective function but we have notaccess for any information about the model. Random search[19] and grid search are two simple and commonly usedmethods for HPO. Bergstra et al. [4] holds the point thatrandom search is more efficient than grid search given a fixedlimited computation budget. However, facing the problemswith expensive evaluations, it is very challenging for randomsearch and grid search to achieve satisfactory performance.The use of
Bayesian optimisation (BO) can alleviate theissue of high computational cost [14], [18]. The frameworkof BO is concluded in sequential model-based algorithm [20]that iteratively exploits historical data to fit surrogate modelsor other transformations, while the most promising candidateis drawn according to a predefined criterion (i.e. acquisitionfunction). The work presented in [4]–[6], [21] makes useof Gaussian Processes (GPs) to approximate the distributionover the objective function, and the candidate is selected bythe acquisition function: probability of improvement (PI) orexpected improvement (EI). In contrast, TPE [4] employsParzen estimators as a surrogate to directly model promisingand unpromising hyperparameters, and the candidate solutionsare drawn according to EI which is proportional to the ratioof two density functions. During searching optimal solutions,the acquisition function plays an significant role for balancingexploration and exploitation.Furthermore, evolutionary computation has demonstratedthe capability of solving HPO problems [3], [7], [22], [23].CMA-ES [7] imitates the biological evolution, assuming thatno matter what kind of gene changes, the results (traits) alwaysfollow a Gaussian distribution of a variance and zero-mean.Meanwhile, the generated population is evaluated on objectivefunction, and a portion of the individuals with excellentperformance is selected to suggest evolution, moving to thearea where better individuals would be drawn with higherprobability.Successive halving [24] is an bandit-based multi − f idelity method for efficiently allocating computational resource thatgive the most budget to the most promising individuals.Meanwhile, successive halving is an iterative process in whichthe best halve of individuals are retained and half of the indi-viduals with lower performance are discarded when a portionof total computational resource runs out. Hyperband [25] andBOHB [2] are proposed based on successive halve, but theformer proposed an modification that frequently performs thesuccessive halving method with different budget allocationsto find the best solutions. Li et al. [25] pointed out thatbecause the terminal losses are unknown, the trials of differentallocations are helpful to recognise high-quality solutions. Forexample, two individuals might not have significant differenceon fitness in the th epoch, but greater difference may beobserved in the th epoch if more computational resourcesi.e, epochs) are allocated. Furthermore, BOHB combines Hy-perband and BO to chase an ideal state in which performanceand computational cost are both covered. B. Genetic AlgorithmGenetic algorithm (GA) as a class of evolutionary com-putation which has been applied to HPO [3], [26], [27]. GAevaluates the qualities of individuals in the population by usinga fitness function. To search high-quality individuals, the GArelies on crossover, mutation, and selection operators. Theselection is inspired by natural selection which means GAselects individuals as parents according to the probabilities(e.g., roulette wheel) to generate next generation by crossoverand mutation, and the probabilities are proportional to theirquality (fitness). There are a lot of crossover techniques scuhas uniform crossover [28], single-point [29], and heuristiccrossover [30] which all aim to recombine genes from theparents to generate offspring. In the evolutionary process,mutation operators as the last step are generally assigned witha low probability to be invoked. However, it plays a significantrole for preventing GA from getting stuck in local optima.Similar to the probabilistic sampling of solutions in BO-basedoptimisation, mutation is employed to endue GA with theexploration capability of solutions.Variable length genetic algorithm [31] has been proposed tosolve neural network HPO problem by increasing the length ofchromosome (i.e., enlarge search space) if the fitness does notsatisfy preset conditions. Moreover, HESGA [3] introducedelite archive mechanism which is used to store a number ofcurrent optimal individuals. Parents are selected respectivelyfrom elite archive and the population according to roulettewheel (Fig. 1). In this way, new offspring always own at least aportion of the strong genes from the parent in the elite archiveif crossover occurs. Meanwhile, HESGA provides the strategyof using fast and full evaluation methods, to alleviate the issueof the expensive evaluations. Fast evaluation is applied to thewhole population. The individuals having the most differenceof fitness in the early training stage (10% ∼
20% of thetotal epochs) are considered promising, and it assumes thatindividuals with steeper learning curve have greater potentialto achieve satisfactory results. Meanwhile, full evaluation isemployed to measure a small portion of individuals selectedby fast evaluation. The strategy of fast and full evaluation canbe considered as multi-fidelity optimisation , which focuses ondecreasing the evaluation cost by combining a large numberof low cost evaluations with low-fidelity and a small numberof costly evaluation with high-fidelity.
C. Mutation
Many mutation strategies have been proposed for GA,and we only present some of them which are related toour research. Random (uniform) mutation and non-uniformmutation have been proposed in [32]. In random mutation,where a gene is replaced with a random value drawn from adefined range. On the other hand, in non-uniform mutation,the strength of mutation decreases along with increase in the
Fast Evaluation PopulationFull EvaluationCandidates : Population: Operation Elite Archive (1)(6)(7)(8) (2)(9) Parent A Parent B(3) (4) (5)
Fig. 1. Genetic Algorithm with Hierarchical Evaluation Strategy [3]Fig. 2. Binary Crossover and Mutation [3] number generations. In adaptive mutation [33], [34], eachsolution is assigned with a mutation rate, in order to maintainthe diversity of the population without affecting algorithmsperformance. Based on Gaussian mutation, self-adaptive mu-tation [35] allows GA to vary the mutation strength duringthe evolution, while self-adaptive Gaussian mutation basedadaptation of population size is proposed in [36]. In Pointeddirected (PoD) mutation [37], each gene is tightly associatedwith a single bit that guides the direction of mutation whichthe gene may follow. Compared with Gaussian mutation, Deepet al. [38] reported that PoD mutation is more advantageousin unconstrained benchmark problems. The above mentionedmutation operators exploit the historical information whichis generated during evolution to adjust the mutation strategydynamically, which inspired our TSM to improve the naivesingle point mutation used in [3].III. M
ETHOD
In this section, the design of tree-structured mutation (TSM)will be described. Meanwhile, we will introduce the use ofTSM in HESGA.
A. HESGA
The framework of HESGA [3] is shown in Fig. 1. The TSMis proposed as an further improvement of HESGA. The orginalHESGA uses the binary encoding scheme for hyperparameterson HPO: each hyperparameter is assigned with a fixed-lengthbinary code with a step size (i.e., resolution). For example,if the batch size ranges from 8 to 512 with a step size of 8,and then a batch size of 16 is represented by [0 , , , , , , .Single point crossover and mutation are employed in HESGA. oot s l s l f f n n ............ ...... c c Hyperparameter Dimensions n h The Number of Subspace n s Fig. 3. Tree-structured hyperparameter space, the internal nodes of s, l, f, n represent the different hyperparameters and used to define their hyperpa-rameter subspace with different subscripts, such as s and s respectivelycorrespond to lower and upper portion of the hyperparameter s . n s denotesthe maximum width of the tree and n h is the dimension of hyperparametersettings Fig. 2 shows that both the mutation and crossover are im-plemented by a uniform distribution to randomly generatea position p at first, the bits on the right of the point p are swapped between the two parent chromosome for thecrossover, and the bit on the position p is changed the valuefrom 0 to 1, or vice versa for mutation. In initialisation, allrandomly generated individuals are assessed by full evaluationand a number of individuals with higher fitness values areselected to be included in elite archive by Step (1) and (2) asthe shown in Fig. 1. To update the elite archive, the offspringis generated by operating crossover and mutation on the twoparents which are respectively selected by roulette wheel fromelite archive and population in Steps (3), (4) and (5). Then,in Step (6) fast evaluation is operated to find a small numberof individuals (candidates) which have better fitness values,and these individuals are further accessed by full evaluationin Step (8), and the elite archive is updated if candidates arebetter than some of the individuals in the elite archive in Step(9). The work presented in [3] demonstrated that HESGA hasachieved satisfactory performance on HPO for GNNs. B. Tree-structure Mutation
In HPO using GA, the mutation operator is a way forGA to explore the hyperparameter search space. All mutationoperators discussed in Section II-C are proposed based on thecore principle of effective mutation towards optimal solutions.Most of black-box optimisation methods utilise the historicalinformation to carry out effective exploration. Here, back toGA, we expect that the historical information can also beexploited with the tree to suggest mutations.Considering that TPE [4] used tree structure to model thesampling of hyperparameter settings, we propose to employbinary tree to build hierarchical search space, as shown in Fig.3. Each pathway from the root node to a leaf node representsa subspace of the whole hyperparameter space. For example,from root to node c , the child nodes ( s , l , f , n ) define the corresponding ranges of hyperparameters (batch size, learningrate, the size of convolutions layers, and the number of neuronsin fully-connected layer) , which forms a subspace of the wholesearch space and can be represented by a binary string .Meanwhile, the left and right child nodes respectively definethe lower and higher ranges of hyperparameters by thresholds.For example, if s ranges from to , the median value is chosen as the threshold to give two subspace with samesize. s indicate a range from to and s ranges from to . In this way, the whole search space is dividedinto many subspaces, the number of subspaces equals to thetotal number of leaf nodes c (i.e., the maximum width n s ), orthe squared of the dimension of hyperparameter settings n h where n h is also the maximum depth of the tree without leafnodes.To use the tree to suggest mutations, the internal and leafnodes take the responsibility to record historical information.The internal nodes are used to store the times of mutationoccurred on different dimensions in a subspace. Concurrently,the leaf nodes c are used to record the times of subspace hasbeen visited. For example, when a hyperparameter setting λ sampled from the subspace ( s , l , f , n ) has been evaluatedon an objective function, the value of node c will be addedwith 1. If λ underwent a mutation on batch size , then thecounter in the s node will also be added with 1. In thisway, the whole search space is hierarchized which increasethe ability of exploration for mutation.In this research, based on the framwork of HESGA, we useTSM to replace the single point mutation, considering that amore sophisticated mutation may lead to better performance.During the experiments, we observed that in elite archive andpopulations, there may exist the same individuals which wastecomputational resources for evaluating them. Many reasonsmay lead to this issue, for example, an extremely excellentindividual in the population will be selected to be included inelite archive meanwhile outperforms other individuals in elitearchive, and it means this individual has higher probability tobe selected as both Parent A and B in the next generation.Two identical parents will invalid crossover by which thesame individuals will be generated. To solve this issue, we setthe condition in Steps (5.2) (Fig. 4) to prohibit duplicates. Ifthe duplication is detected, the mutation is forcibly invoked.Based on TSM, we use two strategies: TSM with an givenindividual and TSM without an given individual respectivelyfor the situation: mutation is invoked with preset probabilityin Steps (5.1 and 5.2) and force mutation when duplicates aredetected in Steps (5.3 and 5.4), as shown in Fig. 4.
1) TSM with An Given Individual:
HESGA operates thesingle point mutation with the probability of 0.2, as indicatedby Step (5) in Fig. 1. TSM is also invoked with the sameprobability in Step (5.1) in Fig. 4. Compared with the changeof a bit of binary encoding under a uniform distribution, TSMuses historical information recorded in the tree to generatea probability distribution to guide the mutation. Specifically,when mutation is activated, for the individual concerned, itscorresponding subspace (i.e., pathway) along the tree will be ast Evaluation PopulationFull EvaluationCandidates : Population: Operation Elite Archive (1)(6)(7)(8) (2)(9) Parent A Parent B(3) (4) (5.0)CrossoverTSM without angiven individual Duplicate check? (5.1)(5.2)(5.3) T F(5.4) (5.5)TSM with an givenindividual
Fig. 4. Genetic Algorithm with Hierarchical Evaluation Strategy with TSMmutation identified. In that pathway, the times t of previous occurredmutation respectively according to the mutated position storedin the internal nodes will be given to Equation (1) to computeprobabilities. In Equation (1), i is the depth of an internalnode given pathway p , and t i represents the times recordedin that node. f is a reciprocal function, j are the depth ofall internal nodes (e.g., n with depth 4, f with depth 3, l with depth 2, and s with depth 1 ) in pathway p , and t j is thetimes stored in those nodes respectively. The probabilities willdetermine where the mutation happens following the roulettewheel method, and the mutated individual will be used toupdate the tree to affect future mutation. In this way, in asubspace, the dimension that has been previously exploredfrequently will have less probability to be explored in thefuture. When the dimension is sampled, the new value isdrawn from the search space defined in that node by uniformdistribution. For example, when s is selected, a new valuefrom 8 to 255 with the resolution of 8 will be drawn andconverted into binary coding to replace the orginal fragment.The TSM with an given individual is summarised in Algorithm1. p ( t i ) = f ( t i ) (cid:80) n h j =1 f ( t j ) i, j ∈ { , , ..., n h } , given p. (1)
2) TSM without An Given Individual:
To avoid duplicatedindividuals existing in the elite archive, in Fig. 1, Step (9) isadded with a duplication detection mechanism. The duplicateswill be deleted except the one with the most fitness value (inexperiments, there exist fluctuations when evaluating the samehyperparameter settings due to the randomness in the GNNtraining and evaluation process). Meanwhile, if the duplicatesare found in the population, we will take TSM without angiven individual (shown in Fig. 4, Steps (5.3 and 5.4)) togenerate a new individual to replace them. In fact, TSMwith an given individual is also feasible to deal with the
Algorithm 1
TSM with An Given Individual Input an individual s , tree , t = [ ] p = tree ( s ) (cid:46) pathway p identification in tree for j = 1 → n h do (cid:46) in pathway p , the maximumdepth of tree without leaf nodes n h t total + = f ( t j ) (cid:46) t j denotes the times recorded innode j , f is reciprocal function t .append( f ( t j ) ) end for t (cid:48) ← each element in t is divided by t total i = rws ( n h , t (cid:48) ) (cid:46) node i , rws roulette wheel selection r ← the defined value range in node i v = unif orm ( r ) (cid:46) mutated value v s (cid:48) ← s i is replaced with binary ( v ) (cid:46) for s , s i thefragment of binary coding for mutation Output individual s (cid:48) duplication issue, but here we proposed the TSM without angiven individual to increase the population diversity.Similar to the Algorithm 1, TSM without an given individ-ual (Algorithm 2) also exploits the information stored in thetree and follow the roulette wheel to select points of mutation.However, the latter on selecting a subspace to sample newindividuals (mutated ones), rather than given a subspace toselect a dimension to mutate. Practically, the mutation startsby compute the probabilities of each subspace by Equation 2,where c are the indexes of leaf nodes, n s is the total numberof leave nodes (i.e., maximum width), f is the reciprocalfunction. Each leave node corresponds a subspace, and thetimes of a subspace being explored is recorded on it. Eq.(2) indicates that the subspace with a large number of t willhave less probability to be explored by mutation. In this way,exploitation can be achieved by crossover and elite archive,and exploration is facilitate by mutations. The pseudo-code isshown in Algorithm 2. p ( t i ) = f ( t i ) (cid:80) n s c =1 f ( t c ) i, c ∈ { , , ...n s } , given leaf nodes . (2)IV. E XPERIMENTS
To assess the effectiveness of TSM, we conducted theexperiments to compare HESGA with single-point mutation[3] and HESGA equipped with TSM. The HESGA in ourexperiments is modified to reject duplication, and as mentionedin Section III-A, the duplicates found in population will beforcibly re-sampled by TSM without an given individual. Thesettings of HESGA are as follows: the max generation is setto 10, probabilities of crossover and mutation are 0.8 and 0.2,which are the same to the original experimental settings [3].Meanwhile, the proportion of size of elite archive to the sizeof population is 0.5, as considering that 0.1 may lead to theselect pressure.Our experiments have been conducted on three benchmarkdatasets: ESOL, FreeSolv, and Lipophilicity from [12]. They lgorithm 2
TSM without An Given Individual Input tree , t = [ ] for c = 1 → n s do t total + = f ( t c ) (cid:46) t c denotes the times recorded innode c , f is reciprocal function t .append( f ( t c ) ) end for t (cid:48) ← each element in t is divided by t total i = roulette wheel( n s , t (cid:48) ) (cid:46) subspace i v = [ ] for j = 1 → n h do (cid:46) in subspace i r j ← the defined value range in node with degree j v j = unif orm ( r ) (cid:46) mutated value v j by samplingfrom a uniform distribution given range r j v .append ( binary ( v i )) end for s = v Output individual s are three molecular property prediction datasets which areused to assess the performance of GNNs. Meanwhile, thereare many types of GNN algorithms, but the neural fingerprintgraph model (GC) [39] has been selected because it was pro-posed to be used for molecular machine learning. GC exploitsthe neural architectures to model circular fingerprint [40] forlearning the molecular representations for prediction tasks.There are four hyperparameters selected for HPO, including s b (batch size), l r (learning rate), s c (the size of graph convolutionlayer), and n n (the number of neurons in fully-connect layer);the search space is constructed as follows: the range of s b is ∼ with step size 8, the range of l r is . ∼ . with step size 0.0001, the range of s c is 8 ∼
512 with stepsize 8, and the range of n n is 32 ∼ s b , l r , s c , and n n , which are 256,0.0016, 256, and 512, respectively to build the binary tree inexperiments.As in [3], the root mean square error (RMSE) of GC is usedas the fitness function for HPO, and it is also used to measurethe HPO performance. The best hyperparameter setting foundby a HPO method (HESGA and HESGA+TMS) will be givento GC to run 30 times, and the mean of the 30 RMSEsare used to measure the performance of HPO. Meanwhile, inorder to make the comparisons more convincing, t -test withsignificance level of two tailed 5% (i.e. 10%) is employed tocompare the performance of HESGA and HESGA + TSM andsee if there exists significant difference. In t -test, negative t value and the hypothesis h = 1 indicates that HESGA out-performs HESGA+TSM; positive t value and the hypothesis h = 1 indicates that HESGA+TSM significantly outperformHESGA; or h = 0 (the null hypothesis) means that there is nosignificant difference between HESGA and HESGA+TSM.In the experiments on ESOL (Table I), HESGA+TSMhas smaller mean and standard deviation of RMSEs than HESGA. However, there is no significant difference betweenHESGA and HESGA+TSM at the significance level α = 5% .We considered that the size of ESOL dataset constrains theperformance of HESGA+TSM, because a sufficient amountof data is a prerequisite for GC possessing better performancewith a more complex architecture. In Table I, HESGA+TSMfound larger s c and n n which means the GC with moreparameters, and they require more training data to tune themfor a better performance. Furthermore, if we set the signif-icance level α = 10% , HESGA+TSM outperforms HESGAin the validation data set. Meanwhile, it is obvious that twohyperparameter settings are completely different and comefrom different two subspaces, which indicates the complex andmulti-modal nature of the underlying hyperparameter searchspace, and the discovery in different subspaces is meaningful.In the experiments on FreeSolv (Table II), HESGA presentsbetter performance than HESGA+TSM. FreeSolv is the small-est dataset used in our experiments. Small size of datasetis intrinsically challenging for GC to learn and keep a sta-ble performance, which may lead the underperformance ofHESGA+TSM since its strength is exploration, stable andaccurate feedback (i.e., the results of evaluation on GC) isessential for conducting efficient exploration.Furthermore, when duplicates appear in population, HESGAuses single-point mutation which retains most of the genesfrom parents, and relatively speaking the strength of themutation is small, while HESGA+TMS is guided by thetree-based mutation, which explores the search space morebroadly compared with HESGA. Therefore, we consideredthat when the size of a dataset is relatively small, globalexploration may bring more fluctuation and unknown factorsupon the instability of GC, which bring more difficulties to theproblems, while “local” exploration might be more effectivesince it reduces the underlying unknown risks from globalarea.In contrast, Lipophilicity is the largest dataset in our exper-iments, and as shown in Table III, HESGA+TSM outperformsHESGA in this dataset with significant difference, and lowerstandard deviation means the result of HESGA+TSM is morestable. When the size of dataset increased, the full evalua-tion becomes more accurate, and it provides more helpfulinformation for GA to generate strong offspring in terms ofexploitation. Meanwhile, more effective exploration becomesnecessary, and HESGA+TSM shows better performance be-cause TSM makes use of historical information to searchhyperparameter space rather than search the each part of thehyperparameter space with equal probability.V. C ONCLUSION AND F UTURE W ORK
In this paper, we proposed TSM for genetic algorithm asapplied to HPO of GNNs, which improved the performance ofHESGA on the largest dataset in our experiments. Comparedwith operating HPO on a relatively small dataset, the achievedimprovement of HPO performance on the larger dataset ismore meaningful since the evaluations are more expensive.Meanwhile, the research of GA-based HPO for GNN is in
ABLE IT HE R ESULTS ON
ESOL D
ATASET
ESOL Hyperparameters Train Validation TestHESGA s b =64 0.2847 0.8843 0.8862 Mean RMSE l r =0.0028 s c =304 0.0424 0.0509 0.0506 Mean STD n n =64HESGA+ TSM s b =48 0.2734 Mean RMSE l r =0.0012 s c =320 0.0333 0.0357 0.0463 Mean STD n n =320T-test on resultswith significance level of α = 5% t = 1 . , h = 0 t = 1 . , h = 0 t = 0 . , h = 0 TABLE IIT HE R ESULTS ON F REE S OLV D ATASET
FreeSolv Hyperparameters Train Validation TestHESGA s b =24 0.5498 Mean RMSE l r =0.0030 s c =208 0.1251 0.1199 0.1246 Mean STD n n =224HESGA+ TSM s b =8 1.1181 1.4083 1.3508 Mean RMSE l r =0.0010 s c =480 0.2692 0.2408 0.2874 Mean STD n n =576T-test on resultswith significance level of α = 5% t = − . , h = 1 t = − . , h = 1 t = − . , h = 1 TABLE IIIT HE R ESULTS ON L IPOPHILICITY D ATASET
Lipophilicity Hyperparameters Train Validation TestHESGA s b =136 0.3656 0.7402 0.7293 Mean RMSE l r =0.0031 s c =232 0.0881 0.0476 0.0462 Mean STD n n =1024HESGA+ TSM s b =168 0.2390 Mean RMSE l r =0.0011 s c =304 0.0676 0.0284 0.0269 Mean STD n n =864T-test on resultswith significance level of α = 5% t = 6 . , h = 1 t = 4 . , h = 1 t = 2 . , h = 1 an early stage, our work will help other researchers to betterexplore this area. On the other hand, our research contributesto the development of molecular machine learning, whichmay facilitate the research from the relevant domain such asmaterials science, biology and drug discovery. Furthermore,the idea of utilising tree structure to hierarchize search spaceand record the historical information can be applied to otherHPO methods. Finally, in future, we believe more research canbe done along further developing TSM for evolutionary HPO.We list some potential directions as follows A. Reward and Rejection Mechanism
Currently, the inner and leaf nodes are used to store thetimes of previous exploration. Inspired by reinforcement learn-ing [41], the results of evaluations on fitness function can berecorded as rewards with the times together to affect futurediscovery, by reconstructing Eqs. (1) and (2) in this way, TSMis not only for increasing the exploration ability, but also forimproving the exploitation in the later stage by introducingreward mechanism. Meanwhile, the times and rewards can also work together with in a surrogate model to reject thoseunpromising trials according to historical information.
B. Adaptive Threshold
In TSM, the binary tree requires thresholds to be specifiedto generate child nodes. In our experiments, we choose medianvalues as thresholds. However, it is easy to observe that s b inall experiments are from the lower range (i.e. the left childnode). If the thresholds are adaptive and can be updated bythe rewards referring the percentile in TPE [4], TSM maybe further improved to better focus on adaptively adjustedsubspace. R EFERENCES[1] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun,“Graph neural networks: A review of methods and applications,” arXivpreprint arXiv:1812.08434 , 2018.[2] S. Falkner, A. Klein, and F. Hutter, “Bohb: Robust and efficienthyperparameter optimization at scale,” in
International Conference onMachine Learning . PMLR, 2018, pp. 1437–1446.3] Y. Yuan, W. Wang, G. M. Coghill, and W. Pang, “A novel geneticalgorithm with hierarchical evaluation strategy for hyperparameter op-timisation of graph neural networks,” arXiv preprint arXiv:2101.09300 ,2021.[4] J. Bergstra, R. Bardenet, Y. Bengio, and B. K´egl, “Algorithms for hyper-parameter optimization,” in , vol. 24. Neural InformationProcessing Systems Foundation, 2011.[5] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-tion of machine learning algorithms,” arXiv preprint arXiv:1206.2944 ,2012.[6] R. Martinez-Cantin, “Bayesopt: a bayesian optimization library fornonlinear optimization, experimental design and bandits.”
J. Mach.Learn. Res. , vol. 15, no. 1, pp. 3735–3739, 2014.[7] N. Hansen, “The cma evolution strategy: A tutorial,” arXiv preprintarXiv:1604.00772 , 2016.[8] K. Deb, R. B. Agrawal et al. , “Simulated binary crossover for continuoussearch space,”
Complex systems , vol. 9, no. 2, pp. 115–148, 1995.[9] S. M. Lim, A. B. M. Sultan, M. N. Sulaiman, A. Mustapha, andK. Y. Leong, “Crossover and mutation operators of genetic algorithms,”
International journal of machine learning and computing , vol. 7, no. 1,pp. 9–12, 2017.[10] D. Whitley, “A genetic algorithm tutorial,”
Statistics and computing ,vol. 4, no. 2, pp. 65–85, 1994.[11] S. Mirjalili, “Genetic algorithm,” in
Evolutionary algorithms and neuralnetworks . Springer, 2019, pp. 43–55.[12] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S.Pappu, K. Leswing, and V. Pande, “Moleculenet: a benchmark formolecular machine learning,”
Chemical science , vol. 9, no. 2, pp. 513–530, 2018.[13] R. Elshawi, M. Maher, and S. Sakr, “Automated machine learning: State-of-the-art and open challenges,” arXiv preprint arXiv:1906.02287 , 2019.[14] J. Wu, S. Toscano-Palmerin, P. I. Frazier, and A. G. Wilson, “Practi-cal multi-fidelity bayesian optimization for hyperparameter tuning,” in
Uncertainty in Artificial Intelligence . PMLR, 2020, pp. 788–798.[15] Y. Yang and P. Perdikaris, “Conditional deep surrogate models forstochastic, high-dimensional, and multi-fidelity systems,”
ComputationalMechanics , vol. 64, no. 2, pp. 417–434, 2019.[16] Y.-Q. Hu, Y. Yu, W.-W. Tu, Q. Yang, Y. Chen, and W. Dai, “Multi-fidelity automatic hyper-parameter tuning via transfer series expansion,”in
Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33,no. 01, 2019, pp. 3846–3853.[17] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley,“Google vizier: A service for black-box optimization,” in
Proceedingsof the 23rd ACM SIGKDD international conference on knowledgediscovery and data mining , 2017, pp. 1487–1495.[18] K. Kandasamy, K. R. Vysyaraju, W. Neiswanger, B. Paria, C. R. Collins,J. Schneider, B. Poczos, and E. P. Xing, “Tuning hyperparameterswithout grad students: Scalable and robust bayesian optimisation withdragonfly,”
Journal of Machine Learning Research , vol. 21, no. 81, pp.1–27, 2020.[19] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-mization.”
Journal of machine learning research , vol. 13, no. 2, 2012.[20] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration,” in
Internationalconference on learning and intelligent optimization . Springer, 2011,pp. 507–523.[21] J. Wu, X.-Y. Chen, H. Zhang, L.-D. Xiong, H. Lei, and S.-H. Deng,“Hyperparameter optimization for machine learning models based onbayesian optimization,”
Journal of Electronic Science and Technology ,vol. 17, no. 1, pp. 26–40, 2019.[22] I. Loshchilov and F. Hutter, “Cma-es for hyperparameter optimizationof deep neural networks,” arXiv preprint arXiv:1604.07269 , 2016.[23] P. R. Lorenzo, J. Nalepa, L. S. Ramos, and J. R. Pastor, “Hyper-parameter selection in deep neural networks using parallel particleswarm optimization,” in
Proceedings of the Genetic and EvolutionaryComputation Conference Companion , 2017, pp. 1864–1871.[24] K. Jamieson and A. Talwalkar, “Non-stochastic best arm identificationand hyperparameter optimization,” in
Artificial Intelligence and Statis-tics . PMLR, 2016, pp. 240–248.[25] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar,“Hyperband: A novel bandit-based approach to hyperparameter opti-mization,”
The Journal of Machine Learning Research , vol. 18, no. 1,pp. 6765–6816, 2017. [26] C. Di Francescomarino, M. Dumas, M. Federici, C. Ghidini, F. M.Maggi, W. Rizzi, and L. Simonetto, “Genetic algorithms for hyperparam-eter optimization in predictive business process monitoring,”
InformationSystems , vol. 74, pp. 67–83, 2018.[27] J.-H. Han, D.-J. Choi, S.-U. Park, and S.-K. Hong, “HyperparameterOptimization Using a Genetic Algorithm Considering Verification Timein a Convolutional Neural Network,”
Journal of Electrical Engineering& Technology , vol. 15, no. 2, pp. 721–726, Mar. 2020. [Online].Available: http://link.springer.com/10.1007/s42835-020-00343-7[28] E. Semenkin and M. Semenkina, “Self-configuring genetic algorithmwith modified uniform crossover operator,” in
International Conferencein Swarm Intelligence . Springer, 2012, pp. 414–421.[29] M. Srinivas and L. M. Patnaik, “Genetic algorithms: A survey,” com-puter , vol. 27, no. 6, pp. 17–26, 1994.[30] J. Grefenstette, R. Gopal, B. Rosmaita, and D. Van Gucht, “Geneticalgorithms for the traveling salesman problem,” in
Proceedings ofthe first International Conference on Genetic Algorithms and theirApplications , vol. 160, no. 168. Lawrence Erlbaum, 1985, pp. 160–168.[31] X. Xiao, M. Yan, S. Basodi, C. Ji, and Y. Pan, “Efficient hyperparameteroptimization in deep learning using a variable length genetic algorithm,” arXiv preprint arXiv:2006.12703 , 2020.[32] Z. Michalewicz,
Genetic algorithms+ data structures= evolution pro-grams . Springer Science & Business Media, 2013.[33] M. Srinivas and L. M. Patnaik, “Adaptive probabilities of crossover andmutation in genetic algorithms,”
IEEE Transactions on Systems, Man,and Cybernetics , vol. 24, no. 4, pp. 656–667, 1994.[34] D. Whitley and T. Starkweather, “Genitor ii: A distributed genetic al-gorithm,”
Journal of Experimental & Theoretical Artificial Intelligence ,vol. 2, no. 3, pp. 189–214, 1990.[35] R. Hinterding, “Gaussian mutation and self-adaption for numeric geneticalgorithms,” in
Proceedings of 1995 IEEE International Conference onEvolutionary Computation , vol. 1, 1995, pp. 384–.[36] R. Hinterding, Z. Michalewicz, and T. C. Peachey, “Self-adaptive geneticalgorithm for numeric functions,” in
Parallel Problem Solving fromNature — PPSN IV , H.-M. Voigt, W. Ebeling, I. Rechenberg, and H.-P.Schwefel, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1996,pp. 420–429.[37] A. Berry and P. Vamplew, “Pod can mutate: A simple dynamic directedmutation approach for genetic algorithms,” 2004.[38] K. Deep and M. Thakur, “A new mutation operator for real coded geneticalgorithms,”
Applied mathematics and Computation , vol. 193, no. 1, pp.211–230, 2007.[39] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. G´omez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convo-lutional networks on graphs for learning molecular fingerprints,” arXivpreprint arXiv:1509.09292 , 2015.[40] R. C. Glen, A. Bender, C. H. Arnby, L. Carlsson, S. Boyer, and J. Smith,“Circular fingerprints: flexible molecular descriptors with applicationsfrom physical chemistry to adme,”
IDrugs , vol. 9, no. 3, p. 199, 2006.[41] H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, andJ. Tsang, “Hybrid reward architecture for reinforcement learning,” arXivpreprint arXiv:1706.04208arXivpreprint arXiv:1706.04208