[PDF] A Systematic Comparison Study on Hyperparameter Optimisation of Graph Neural Networks for Molecular Property Prediction

Abstract

Graph neural networks (GNNs) have been proposed for a wide range of graph-related learning tasks. In particular, in recent years, an increasing number of GNN systems were applied to predict molecular properties. However, a direct impediment is to select appropriate hyperparameters to achieve satisfactory performance with lower computational cost. Meanwhile, many molecular datasets are far smaller than many other datasets in typical deep learning applications. Most hyperparameter optimization (HPO) methods have not been explored in terms of their efficiencies on such small datasets in the molecular domain. In this paper, we conducted a theoretical analysis of common and specific features for two state-of-the-art and popular algorithms for HPO: TPE and CMA-ES, and we compared them with random search (RS), which is used as a baseline. Experimental studies are carried out on several benchmarks in MoleculeNet, from different perspectives to investigate the impact of RS, TPE, and CMA-ES on HPO of GNNs for molecular property prediction. In our experiments, we concluded that RS, TPE, and CMA-ES have their individual advantages in tackling different specific molecular problems. Finally, we believe our work will motivate further research on GNN as applied to molecular machine learning problems in chemistry and materials sciences.

Full PDF

aa r X i v : . [ q - b i o . B M ] F e b A Systematic Comparison Study on HyperparameterOptimisation of Graph Neural Networks for Molecular PropertyPrediction

Yingfang Yuan

Heriot-Watt UniversityEdinburgh, [email protected]

Wenjun Wang

Heriot-Watt UniversityEdinburgh, [email protected]

Wei Pang ∗ Heriot-Watt UniversityEdinburgh, [email protected]

ABSTRACT

Graph neural networks (GNNs) have been proposed for a widerange of graph-related learning tasks. In particular, in recent yearsthere has been an increasing number of GNN systems that were ap-plied to predict molecular properties. However, in theory, there areinﬁnite choices of hyperparameter settings for GNNs, and a directimpediment is to select appropriate hyperparameters to achievesatisfactory performance with lower computational cost. Meanwhile,the sizes of many molecular datasets are far smaller than manyother datasets in typical deep learning applications, and most hy-perparameter optimization (HPO) methods have not been exploredin terms of their eﬃciencies on such small datasets in moleculardomain. In this paper, we conducted a theoretical analysis of com-mon and speciﬁc features for two state-of-the-art and popular al-gorithms for HPO: TPE and CMA-ES, and we compared them withrandom search (RS), which is used as a baseline. Experimental stud-ies are carried out on several benchmarks in MoleculeNet, fromdiﬀerent perspectives to investigate the impact of RS, TPE, andCMA-ES on HPO of GNNs for molecular property prediction. Inour experiments, we concluded that RS, TPE, and CMA-ES havetheir individual advantages in tackling diﬀerent speciﬁc molecu-lar problems. Finally, we believe our work will motivate furtherresearch on GNN as applied to molecular machine learning prob-lems in chemistry and materials sciences.

CCS CONCEPTS • Computing methodologies → Neural networks ; Search method-ologies ; •

Applied computing ; •

General and reference → Ex-perimentation ; KEYWORDS

Graph Neural Networks, Molecular Property Prediction, Hyperpa-rameter Optimisation ∗ Corresponding authorPermission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor proﬁt or commercial advantage and that copies bear this notice and the full citationon the ﬁrst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s). © 2021 Copyright held by the owner/author(s).ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.https://doi.org/10.1145/nnnnnnn.nnnnnnn

Graph neural networks (GNNs) are eﬃcient approaches for learn-ing the representations of structured graph data (e.g., molecules, ci-tation networks) [29]. In recent years, several types of GNNs havebeen proposed for predicting molecular properties, and they haveachieved excellent results [8, 10, 14, 18, 19, 28]. Moreover, the appli-cation of GNNs accelerated the work in many related domains, in-cluding drug discovery [9, 26], biology [6], physics [25], and theseGNNs reduced computational cost compared to traditional ﬁrst-principles methods such as Density Functional Theory [14, 22]. Inpractice, there are many GNN variants which can be employedfor molecular property prediction. Each variant is proposed basedupon a distinct idea for feature learning of molecules. For exam-ple, GC (graph convolutional network) [8] exploits neural architec-tures to generalise the chemical operation of circular ﬁngerprintto extract molecular features. In contrast, Weave [18] and MPNN(message passing neural network) [10] have been proposed to learnmolecular features by taking readout operations [27] on atomic fea-tures. To learn atomic features, Weave applies global convolutionoperation, and MPNN uses message passing process.However, the hyperparameter selection is a direct impedimentfor GNNs to achieve excellent results. In general, the process ofsearching optimal hyperparameters is often a trial-and-error pro-cess. Traditionally, people used to manually adjust hyperparamters,but this requires domain experience and intuition. To extricate peo-ple from this predicament, random search (RS) has been employedfor hyperparameter optimisation (HPO). In brief, RS draw hyperpa-rameter values from uniform distribution within given ranges. Thedrawn hyperparameter values are evaluated on objective function,and the one with the best performance will be selected when thegiven computational resource is exhausted. Although very simple,RS has been proved to be eﬃcient for HPO of neural networks formany problems [4]. In recent years, there is an increasing num-ber of strategies proposed for HPO. TPE [4] and CMA-ES [13] aretwo state-of-the-art HPO algorithms, and they are proposed to im-prove the eﬃciency of search for promising hyperparamters byutilising the experience of previous trials. In this paper, a trial de-notes the process of evaluating a hyperparameter setting on anobjective function [5].Research on HPO of GNNs for molecular property predictionis still in its infancy. For example, the pioneer work of GNN pre-sented in [8, 10, 28] did not discuss the problem of HPO in de-tail. Meanwhile, most HPO methods have not been explored interms of their eﬃciency on GNNs when facing this type of prob-lems, and their performance may need to be further investigated because the sizes of molecular datasets vary from hundreds to thou-sands, which are far less than those of the datasets used for typi-cal deep learning applications (e.g., image recognition, natural lan-guage processing). At the same time, predicting molecular prop-erties requires more sophisticated neural architectures to processirregular molecular structures, which is diﬀerent from image pro-cessing problems which have regular spatial patterns within imagepatches. Therefore, it has become necessary to explore the perfor-mance of existing HPO methods on GNNs in molecular domains.This motivates our research, and we conducted methodology com-parison and experimental analysis for RS, TPE, and CMA-ES toassess their eﬀects of on GNNs as HPO methods. We expect thatour research can inform researchers in both molecular machinelearning and chemistry and material sciences.The contributions of our research are summarized as below: • We conducted systematic experiments to compare and anal-yse HPO methods including RS, TPE, and CMA-ES for GNNin molecular domain in terms of their features, computa-tional cost, and performance. • Our research on HPO for GNN can be applied to a widerrange of domains such as physical and social sciences. • The outcomes of our research will contribute to the devel-opment of molecular machine learning [24] as well as HPOfor GNNs in general.The rest of this paper is organized as follows. In Section 2, therelated work of RS, TPE, and CMA-ES will be presented. In Sec-tion 3, we will conduct methodology comparison of RS, TPE, andCMA-ES. Thereafter, the design of experiments and detailed exper-imental results will be described and discussed in Section 4. Finally,in Section 5, conclusions and future work will be given.

Algorithm 1:

Random Search (RS) [5] RS ( 𝑓 ,𝑇 , 𝑈 ) ; for 𝑡 ← to T do 𝑥 ← 𝑈 ; Evaluate 𝑓 ( 𝑥 ) ; H ← H ∪ ( 𝑥, 𝑓 ( 𝑥 )) ; end return H Random Search (RS) [5] is an approach that uses uniform proba-bility in determining iterative procedures at the price of optimality[30], and it is helpful for handling many ill-structured global opti-mization problems with continuous and/or discrete variables [30]such as HPO.The process of applying RS for HPO is shown in Algorithm 1.In

Line 3 of Algorithm 1, a solution 𝑥 (i.e. a set of hyperparametervalues) is sampled from a uniform distribution 𝑈 , and then evalu-ated on objective function 𝑓 ( 𝑥 ) in Line 4 , which is normally themost expensive step. The result of evaluation 𝑓 ( 𝑥 ) and the solu-tion 𝑥 is paired and recorded in H . The procedures from Line 3 to Line 5 are iteratively executed 𝑇 times. Finally, the best solutionis obtained by sorting historic solutions in H according to theircorresponding 𝑓 ( 𝑥 ) values.Furthermore, Bergstra. J et al. [5] holds the opinion that RS isthe natural baseline for sequential HPO methods. Meanwhile, itis noted that Zelda B. [30] considered that RS is likely to be ableto solve large-scale problems eﬃciently in a way that is not possi-ble for deterministic algorithms. However, when using RS for HPO,the disadvantage is that its performance accompanied by high vari-ance, and it may not produce satisfactory results given a largersearch space and limited computational resource. Algorithm 2:

TPE [4] TPE ( 𝑓 , 𝑀 ,𝑇 , 𝑆 ) ; for 𝑡 ← to T do 𝑥 ∗ ← argmax 𝑥 𝑆 ( 𝑥, 𝑀 𝑡 − ) ; Evaluate 𝑓 ( 𝑥 ∗ ) ; H ← H ∪ ( 𝑥 ∗ , 𝑓 ( 𝑥 ∗ )) ; Fit a new model 𝑀 𝑡 to H ; end return H The problems of expensive evaluation on ﬁtness function canbe solved by sequential model-based global optimization (SMBO)algorithms [4, 15, 16]. In HPO, challenge is that the ﬁtness function 𝑓 : X → R may be expensive to evaluate given a trial of hyper-paramters. By using model-based algorithms with a surrogate toapproximate 𝑓 can reduce the evaluation cost. Typically, the coreof an SMBO algorithm is to optimise the surrogate for the real ﬁt-ness function, or some transformation of the surrogate. The twokey components of SMBO algorithms are (1) what criterion is de-ﬁned and optimized to obtain promising solutions given a model(or surrogate) of 𝑓 , and (2) how 𝑓 can be approximated via histori-cal trials/solutions.Tree-structured Parzen Estimator (TPE) [4] is an approach basedon SMBO, as shown in Algorithm 2. Compared with RS, it makesthe signiﬁcant change in Line 3 , in which solutions are sampledby 𝑆 , instead of uniform distribution 𝑈 . In 𝑆 , many candidates 𝑥 aredrawn according to surrogate model 𝑀 and the one ( 𝑥 ∗ ) with themost promising performance evaluated by Expected Improvement(EI, introduce later) is returned [17]. In Line 4 , 𝑥 ∗ is then evaluatedon ﬁtness function 𝑓 and recorded in H in Line 5 . In

Line 6 , thesurrogate 𝑀 is optimised to approximate the real ﬁtness functionby updated H . Finally, the best solution can be obtained by sorting H after 𝑇 iterations. In the following paragraphs, we will reviewthe most important work in TPE based on SMBO in detail.In TPE, EI [17] has been chosen as the criterion to guide thesearch for optimal solution(s), and it keeps the balance betweenexploitation and exploration during the search process. The utilityfunction is deﬁned as 𝑢 ( 𝑥 ) = max ( , 𝑓 ′ − 𝑓 ( 𝑥 )) , where 𝑓 ′ denotesthe output of the current best solution, and 𝑥 is the solutions wewant to ﬁnd whose 𝑓 ( 𝑥 ) is expected as smaller as possible than 𝑓 ′ . The value of the diﬀerence between 𝑓 ′ and 𝑓 ( 𝑥 ) will be return Systematic Comparison Study on Hyperparameter Optimisation of Graph Neural Networks 2021, Feb, 05 as a reward. In each iteration, the optimal solution 𝑥 ∗ is given byEI 𝑦 ∗ ( 𝑥 ) : = ∫ ∞−∞ max ( 𝑦 ∗ − 𝑦, ) 𝑝 𝑀 ( 𝑦 | 𝑥 ) 𝑑𝑦 , where 𝑝 𝑀 ( 𝑦 | 𝑥 ) isthe surrogate of the real ﬁtness function, and 𝑦 ∗ represents somequantile of the observed 𝑦 values.Meanwhile, modelling of 𝑝 𝑀 ( 𝑦 | 𝑥 ) is costly, and TPE proposesan indirect way to model 𝑝 ( 𝑥 | 𝑦 ) (Eq. 1). 𝑝 ( 𝑥 | 𝑦 ) is deﬁnedby Eq. 2.where ℓ ( 𝑥 ) and 𝑔 ( 𝑥 ) are two density functions modelledby Parzen Estimator [23], a non-parametric method to approxi-mate the probability density function of a random variable. Thecollected observations are sorted by loss of 𝑓 , and are divided intotwo groups based on some quantile. ℓ ( 𝑥 ) is generated by usingthe observations n 𝑥 ( 𝑖 ) o such that the corresponding loss 𝑓 ( 𝑥 ) wasless than 𝑦 ∗ and the remaining observations are used to generate 𝑔 ( 𝑥 ) . In practice, a number of hyperparameter settings are sam-pled according to ℓ , evaluated in term of 𝑔 ( 𝑥 )/ ℓ ( 𝑥 ) , and the onethat yields the minimum value under ℓ ( 𝑥 )/ 𝑔 ( 𝑥 ) corresponding tothe greatest EI is returned. This solution is then evaluated on theﬁtness function, and we call this process a trial. In this way, 𝑔 ( 𝑥 ) and ℓ ( 𝑥 ) are optimized according to the updated observation set,thus the exploration of optimal solutions moves to more promisingregions of the whole search space by increasing the densities. 𝑝 ( 𝑦 | 𝑥 ) = 𝑝 ( 𝑥 | 𝑦 ) ∗ 𝑝 ( 𝑦 ) 𝑝 ( 𝑥 ) (1) 𝑝 ( 𝑥 | 𝑦 ) = (cid:26) ℓ ( 𝑥 ) if 𝑦 < 𝑦 ∗ 𝑔 ( 𝑥 ) if 𝑦 ≥ 𝑦 ∗ (2)In TPE, Tree-structure means that the hyperparameter spaceis tree-like, and the value chosen for one hyperparameter deter-mines what hyperparameter will be chosen next and what valuesare available for it. Algorithm 3:

CMA-ES [13] CMA-ES ( 𝑓 , 𝐺, N) ; for 𝑔 ← to G do for 𝑘 ← to K do 𝑥 𝑔𝑘 ← N ; end for 𝑘 ← to K do Evaluate 𝑓 (cid:16) 𝑥 𝑔𝑘 (cid:17) ; H ← H ∪ (cid:16) 𝑥 𝑔𝑘 , 𝑓 (cid:16) 𝑥 𝑔𝑘 (cid:17)(cid:17) ; end Update N end return H Covariance matrix adaptation evolution strategy (CMA-ES) [13]is a derivative-free evolutionary algorithm for solving black-boxoptimization problems, and it has been applied for HPO with large-scale parallel GPU computational resources [4, 21].The pseudo-code of CMA-ES is shown in Algorithm 3. In

Line4 , a solution 𝒙 ( 𝑔 ) 𝑘 is generated by sampling from a multivariate normal distribution N until the size of the population is satisﬁed,where 𝑘 denotes the index of oﬀspring and 𝑔 for generation. There-after, in Line 7 , the individuals of 𝑥 𝑔 will be evaluated on ﬁtnessfunction 𝑓 and recorded in 𝐻 . In Line 10 , similar to TPE, it ex-ploits the historical information 𝐻 to optimise the search process.However, it is noted that CMA-ES optimises N rather than the sur-rogate, and we will discuss this in the following paragraphs.In CMA-ES, the multivariate distribution is re-deﬁned. Speciﬁ-cally, a population of solutions 𝑥 𝑔 + (i.e., individuals or oﬀspring)is generated by sampling from a multivariate normal distribution N (Eq. 3). In Eq. 3, N (cid:16) , 𝑪 ( 𝑔 ) (cid:17) is a multivariate normal distributionwith zero mean and covariance matrix 𝑪 ( 𝑔 ) . The later decides thedistribution shape, and describes the correlations of the variables.Meanwhile, 𝒎 ( 𝑔 ) represents the mean value which is the centroidof the distribution, and it determines the search region of the wholesearch space in generation 𝑔 . 𝜎 ( 𝑔 ) represents the step size whichalso decides the global variance; in other words, it controls the sizeof the region. 𝒙 ( 𝑔 + ) 𝑘 ∼ 𝒎 ( 𝑔 ) + 𝜎 ( 𝑔 ) N (cid:16) , 𝑪 ( 𝑔 ) (cid:17) for 𝑘 = , . . . , 𝜆, (3)To promote the eﬃciency of sampling, the key is to update 𝒎 ( 𝑔 + ) , 𝐶 ( 𝑔 + ) , and 𝜎 ( 𝑔 + ) for the new generation. The mean is updatedby the weighted average of 𝜇 selected individuals by 𝒎 ( 𝑔 + ) = 𝒎 ( 𝑔 ) + Í 𝜇𝑖 = 𝑤 𝑖 (cid:16) 𝒙 ( 𝑔 + ) 𝑖 : 𝜆 − 𝒎 ( 𝑔 ) (cid:17) , where 𝑤 𝑖 means correspondingweights to 𝑥 𝑖 . The selection is according to the performance ofindividuals on the objective function. The novelty of CMA-ES isthat it adapts the covariance matrix by combining rank- 𝜇 -updateand rank-one update [2]. In this way, rank-µ update can eﬃcientlymake use of the information from the entire population. At thesame time, rank-one update can be used to exploit the informationof correlations among generations from the evolution path. This so-lution keeps the balance between less generations with large popu-lation and more generation with smaller population. Additionally,CMA-ES introduces the mechanism of step-size control based onevolution path (cumulative step-size adaptation of the global step-size) which aims to approximate the optimal overall step lengtheﬃciently by evolution path, because co-variance matrix may notbe able to ﬁnd the optimal overall step length eﬃciently.Generally, CMA-ES imitates the biological evolution, assumingthat no matter what kind of gene changes, the results (traits) al-ways follow a Gaussian distribution of a variance and zero-mean.Meanwhile, the generated population is evaluated on objective func-tion, a part of well performed individuals is selected to guide evolu-tion, moving to the area where better individuals would be sampledwith higher probability. • Randomness plays an important role in RS, TPE and CMA-ES. RS is supported by a number of independent uniformdistributions with random sampling to explore hyperparam-eter space and ﬁnd optimal solution. TPE and CMA-ES bothhave exploitation and exploration mechanisms, which means they are given a more speciﬁc region of search space to ex-plore compared with RS. TPE draws samples with random-ness over the space of density function of ℓ ( 𝑥 ) . Meanwhile,the sampling in CMA-ES is backed by a multivariate distri-bution. • Derivative-free denotes that the approach do not use deriv-ative information to guide the search for optimal solutions.For RS, TPE, and CMA-ES, as the last paragraph described,they search for the optimal solutions depending on drawingsamples with randomness, rather than using gradient infor-mation as in the training of neural networks. • Termination Condition

As Section 2 shows, RS, TPE, andCMA-ES all have loops which means they need to preset thecondition to stop the optimisation. However, this situationmight cause dilemma of balancing computational cost andperformance. • Uniform Distribution vs Multivariate Normal Distri-bution vs Gaussian Mixture Model

The uniform distri-bution is a symmetric probability distribution which gives aﬁnite number of values with equal probability to be drawn.Meanwhile, in RS, the dimension of hyperparameter solu-tions corresponds to the required number of uniform dis-tributions, and each individual uniform distribution is inde-pendent. In contrast, in CMA-ES, multivariate distributionis a distribution used to approximately describe a set of cor-related real-valued random variables, each of which clustersaround a mean value. Furthermore, TPE makes use of Gauss-ian mixture model which assumes all the point are gener-ated from a mixture of a number of Gaussian distributionswith unknown parameters. • Model-based vs Model-free

These are two distinct approachesoriginally deﬁned in reinforcement learning [12]. RS is avery representative model-free approach which directly searchbetter solutions via a process of trial-and-error. In contrast,TPE is a model-based approach which ﬁrstly uses densityfunctions ℓ ( 𝑥 ) and 𝑔 ( 𝑥 ) to model hyperparameter space interms of the surrogate, then searching solutions over thespace of the functions. • Evolutionary Strategy vs Bayesian Optimisation

Themain idea of applying evolution strategies to black box opti-mization is to search through iterative adjustment of a mul-tivariate normal distribution. The distribution is controlledby the mean and co-variance, which are adjusted and movedto the area where better solutions could be sampled withhigher probability. The adjustment generally has four mainsteps: sampling, evaluation, selection of good individuals,and updating the mean and co-variance by selected individ-uals. In contrast, in TPE, it starts by Bayesian optimisationto approximate the distribution of hyperparameters and ob-jective function. Instead of using Gaussian Process to modelthe distribution, TPE makes use of Parzen estimator (i.e.,kernel density estimation). The posterior distribution is un-ceasingly updated to approximate the real situation, and an acquisition function (TPE uses EI as the acquisition func-tion) is used to approach optimal solution.

In this section, we ﬁrst describe the design of our systematic exper-iments, and analysed the experimental results. We then conductedfour sets of experiments in Sections 4.2 and 4.3, to compare RS,TPE, and CMA-ES from the perspective of performance and com-putational cost.

To investigate the performance of HPO methods for GNN on molec-ular property prediction, three representative datasets from DeepChem[28] were selected in our experiments: ESOL (1128 records), Free-Solv (642 records), and Lipophilicity (4200 records), which respec-tively correspond to the tasks of predicting the following molecu-lar properties: water solubility, hydration free energy, and octanol/waterdistribution coeﬃcient. These properties are crucial in many prob-lems. For example, in drug discovery, lipophilicity is an importantproperty to reﬂect the aﬃnity of a molecule, and it aﬀects bothmembrane permeability and solubility [20]. Furthermore, the re-search presented in [7] analyses molecular solubility data for ex-ploring organic semiconducting materials. Therefore, the abovethree representative molecular property datasets are worth inves-tigating and will beneﬁt the research of many related problems.Using diﬀerent sizes of datasets in experiments is helpful for usto conduct more comprehensive analyses. Meanwhile, there aremany GNN variants, and we chose graph convolutional network(GC) [8] because it was proposed considering the molecular do-main background knowledge. The architectures of GC generalisethe chemical operation of circular ﬁngerprint [11] to extract molec-ular features.Four hyperparameters of GC are selected for HPO: batch size 𝑠 𝑏 ,learning rate 𝑙 𝑟 , the number of fully-connected layer nodes 𝑛 𝑛 , andthe number of nodes in ﬁlter 𝑛 𝑓 . The selection is motivated by therelated benchmark work presented in [28] and considers molecu-lar domain knowledge. RS, TPE, and CMA-ES are implemented byOptuna [1]. The arguments of HPO methods in our experimentsare set empirically or default values oﬀered by Optuna. Meanwhile,considering that most practitioners/researchers do not have suﬃ-cient large-scale GPU computational resources as in industry (e.g.,DeepMind, FAIR), we would like to assess the performance of HPOgiven limited resource, so our experiments all are conducted ona single GPU (GeForce GTX 1070), while the MedianPruner tech-nique [1] is used to speed up HPO. We expect our experimental out-comes would inspire and help other people when they face similarHPO problems and are given limited computational resource.In our experiments, every dataset is split into training, valida-tion, and test sets with 80%, 10%, and 10%. The training set is usedto ﬁt GC given a hyperparameter setting, and the validation setprovides an unbiased evaluation of the hyperparameters duringthe search. The test set is used to evaluate the performance ofHPO methods. The evaluation metric is the root mean square error(RMSE) of GC, and the evaluation function is deﬁned by the lossfunction of GC. To make our experiments more persuasive, thebest hyperparameter settings found by each method are given to Systematic Comparison Study on Hyperparameter Optimisation of Graph Neural Networks 2021, Feb, 05

GC, and then the GC is run for 30 times independently to calculatethe mean of RMSEs on the training, validation, and test datasets.Meanwhile, to statistically analyse the diﬀerence between thosemeans, we conducted corresponding t-tests, in which 𝑡 denotes 𝑡 -value and ℎ represents the hypothesis. Empirically, during HPO,we found that the results of evaluating each trial often ﬂuctuated,and to minimize this eﬀect on HPO performance evaluation, weuse the mean value of RMSEs from three repeated evaluations ofGC as a single evaluation value. In this section, we assess the performance of RS, CMA-ES, and TPE,while considering the computational cost as a priority. In HPO, anydiscuss and analysis of method performance without consideringcomputational cost is argumentative, because naive RS can ﬁnd op-timal solutions if given suﬃcient time and computational budget,and it is a highly friendly approach for parallel computing. There-fore, any comparison of HPO methods must be performed withacceptable computational cost.

Diﬀerent HPO methods employ dif-ferent optimisation strategies. To compare them fairly, we ﬁrstlyproposed to assign each of them with a total of 100 trials, assum-ing all HPO methods having equal computational cost. In otherword, RS, TPE, and CMA-ES all have 100 opportunities to evaluatethe hyperparameter settings on GC.Tables 1 ∼ 𝑠 𝑏 and the number of nodes in ﬁlter 𝑛 𝑓 range from 32 to 256with the incremental step 32; learning rate 𝑙 𝑟 is from 0.0001 to0.0016 with the incremental step of 0.0001; the number of nodesin fully-connected layer 𝑛 𝑛 is from 64 to 512, with the step of 64.This search space has 2 solutions in total. Meanwhile, t-test with 𝛼 =

5% is employed to determine if there is a signiﬁcant diﬀerencebetween the means of RMSEs. ℎ = ℎ = ∼ ∼ 𝑛 𝑓 and 𝑛 𝑛 in both datasets. The negative value of 𝑡 means the former is bet-ter, for example, the row of RS-TPE in Table 5, the 𝑡 on test set is-3.3167, which represents RS has smaller RMSE value than TPE (i.e.,RS is better than TPE); when the value of 𝑡 is positive, it means thelater is better than the former, for example, the row of RS-TPE inTable 4, the 𝑡 on test set is 1.8355, which means the performanceof TPE is better than that of RS. A larger absolute value of 𝑡 indi-cates a bigger diﬀerence. Moreover, RS outperforms the other twomethods in Table 2 with signiﬁcant diﬀerence (see Table 5). Weconsider the problem of FreeSolv may be more complex (the workresearch in [3] discussed the problem of deviations of calculatinghydration free energy), and TPE and CMA-ES are constrained bythe size of search space, the number of trials, and the size of dataset, HPO Methods Hyperparameters Train Validation TestRS 𝑛 𝑓 =256 0.2666 0.9067 0.8888 Mean RMSE 𝑛 𝑛 =64 𝑙 𝑟 =0.0016 0.0364 0.0542 0.0411 Mean STD 𝑠 𝑏 =64TPE 𝑛 𝑓 =192 0.3083 0.8739 0.8667 Mean RMSE 𝑛 𝑛 =192 𝑙 𝑟 =0.0015 0.0534 0.0401 0.0476 Mean STD 𝑠 𝑏 =32CMA-ES 𝑛 𝑓 =256 0.2939 0.8739 0.8782 Mean RMSE 𝑛 𝑛 =64 𝑙 𝑟 =0.0016 0.0458 0.0424 0.0562 Mean STD 𝑠 𝑏 =32 Table 1: The general experimental settings on ESOL Dataset

Search Space: 𝑠 𝑏 :32-256, 𝑠𝑡𝑒𝑝 =32; 𝑙 𝑟 :0.0001-0.0016, 𝑠𝑡𝑒𝑝 =0.0001; 𝑛 𝑓 :32-256, 𝑠𝑡𝑒𝑝 =32; 𝑛 𝑛 :64-512, 𝑠𝑡𝑒𝑝 =64 HPO Methods Hyperparameters Train Validation TestRS 𝑛 𝑓 =256 0.6197 1.2175 1.1040 Mean RMSE 𝑛 𝑛 =320 𝑙 𝑟 =0.0015 0.1248 0.1055 0.0995 Mean STD 𝑠 𝑏 =32TPE 𝑛 𝑓 =160 0.6875 1.3425 1.2006 Mean RMSE 𝑛 𝑛 =448 𝑙 𝑟 =0.0015 0.1854 0.1711 0.1212 Mean STD 𝑠 𝑏 =32CMA-ES 𝑛 𝑓 =256 0.5792 1.2721 1.1967 Mean RMSE 𝑛 𝑛 =64 𝑙 𝑟 =0.0016 0.2653 0.1907 0.2128 Mean STD 𝑠 𝑏 =32 Table 2: The general experimental settings on FreeSolvDataset

Search Space: 𝑠 𝑏 :32-256, 𝑠𝑡𝑒𝑝 =32; 𝑙 𝑟 :0.0001-0.0016, 𝑠𝑡𝑒𝑝 =0.0001; 𝑛 𝑓 :32-256, 𝑠𝑡𝑒𝑝 =32; 𝑛 𝑛 :64-512, 𝑠𝑡𝑒𝑝 =64 thus they got stuck in local optima. In contrast, RS uses a completerandom strategy which is helpful to deal with this kind of specialsituation. However, we believe that CMA-ES and TPE would ﬁndbetter solutions if given more trials and a larger search space.In Table 3, TPE demonstrated better performance than CMA-ES and RS with siginiﬁcant diﬀerence (see Table 6). The size ofLipophilicity dataset is the largest in our experiments, comparedwith smaller datasets, the evaluation on validation set would re-turn a result with less deviations, which is helpful for TPE andCMA-ES to improve and update their strategies for promising so-lutions. However, CMA-ES did not show excellent performance inall datasets in this group of experiments, and we considered thatCMA-ES is based on the evolution strategy, which means it de-pends on unceasingly generating new oﬀspring to ﬁnd solutions,so 100 trials might restrict its performance. The same number oftrials may not be able to assign the same computation cost for dif-ferent HPO methods in practice, because diﬀerent trials of hyperpa-rameters may incur diﬀerent computational cost on evaluation. Forexample, a larger value of 𝑛 𝑛 / 𝑛 𝑓 means more trainable parameters,which will take more computational resource for the correspond-ing trial. Therefore, in this section, we design another set of exper-iments, in which we assign 1 hour time and the same hardwareconﬁguration to diﬀerent HPO methods on ESOL dataset with the HPO Methods Hyperparameters Train Validation TestRS 𝑛 𝑓 =96 0.2682 0.7024 0.6949 Mean RMSE 𝑛 𝑛 =384 𝑙 𝑟 =0.001 0.0444 0.0279 0.0248 Mean STD 𝑠 𝑏 = 64TPE 𝑛 𝑓 =224 0.2475 0.6914 0.6655 Mean RMSE 𝑛 𝑛 =192 𝑙 𝑟 =0.0015 0.0328 0.0229 0.0219 Mean STD 𝑠 𝑏 =32CMA-ES 𝑛 𝑓 =32 0.3496 0.7191 0.7183 Mean RMSE 𝑛 𝑛 =64 𝑙 𝑟 =0.0016 0.0425 0.0309 0.0245 Mean STD 𝑠 𝑏 =32 Table 3: The general experimental settings on LipophilicityDataset

Search Space: 𝑠 𝑏 :32-256, 𝑠𝑡𝑒𝑝 =32; 𝑙 𝑟 :0.0001-0.0016, 𝑠𝑡𝑒𝑝 =0.0001; 𝑛 𝑓 :32-256, 𝑠𝑡𝑒𝑝 =32; 𝑛 𝑛 :64-512, 𝑠𝑡𝑒𝑝 =64 T-test on results with signiﬁcance level of 𝛼 = 5% ℎ = 1 : reject the equal mean hypothesis; ℎ = 0 : accept the equal mean hypothesis. 𝑡 ℎ 𝑡 ℎ 𝑡 ℎ HPO Methods Trian Valid TestRS - TPE -3.4671 1 2.6223 1 1.8355 1RS - CMA-ES -2.5080 1 2.5666 1 0.7706 0TPE - CMA-ES 1.0984 0 -0.0007 0 -0.8396 0

Table 4: T-Test on ESOL

T-test on results with signiﬁcance level of 𝛼 = 5% ℎ = 1 : reject the equal mean hypothesis; ℎ = 0 : accept the equal mean hypothesis. 𝑡 ℎ 𝑡 ℎ 𝑡 ℎ HPO Methods Trian Valid TestRS - TPE -1.6328 0 -3.3492 1 -3.3167 1RS - CMA-ES 0.7435 0 -1.3487 0 -2.1245 1TPE - CMA-ES 1.8011 1 1.4804 0 0.0855 0

Table 5: T-Test on FreeSolv

T-test on results with signiﬁcance level of 𝛼 = 5% ℎ = 1 : reject the equal mean hypothesis; ℎ = 0 : accept the equal mean hypothesis. 𝑡 ℎ 𝑡 ℎ 𝑡 ℎ HPO Methods Trian Valid TestRS - TPE 2.0146 0 1.6454 0 4.7676 1RS - CMA-ES -7.1251 1 -2.1564 1 -3.6088 1TPE - CMA-ES -10.2330 1 -3.8762 1 -8.6244 1

Table 6: T-Test on Lipophilicity same search space deﬁned in Table 1, to see which method can ﬁndthe best solution.Within one hour, the best trials of hyperparamters from theHPO methods were selected to conﬁgure GC, and this GC will be

HPO Methods Number of Trials Hyperparameters Train Validation TestRS 96 nf=224 ~0.3301 0.8817 ~0.8994 Mean RMSE 𝑛 𝑛 =448 𝑙 𝑟 =0.0008 ~0.0492 0.0457 0.0544 Mean STD 𝑠 𝑏 =32TPE 54 nf=256 0.3193 0.8605 0.8634 Mean RMSE 𝑛 𝑛 =256 𝑙 𝑟 =0.0014 0.0462 0.0400 0.0408 Mean STD 𝑠 𝑏 =32CMA-ES 63 nf=32 0.4287 0.9231 0.9688 Mean RMSE 𝑛 𝑛 =512 𝑙 𝑟 =0.0016 0.0933 0.0706 0.0845 Mean STD 𝑠 𝑏 =32 Table 7: Experiments on ESOL Dataset given one hour run-ning time

Search Space: 𝑠 𝑏 :32-256, 𝑠𝑡𝑒𝑝 =32; 𝑙 𝑟 :0.0001-0.0016, 𝑠𝑡𝑒𝑝 =0.0001; 𝑛 𝑓 :32-256, 𝑠𝑡𝑒𝑝 =32; 𝑛 𝑛 :64-512, 𝑠𝑡𝑒𝑝 =64 run for 30 times, and the results and t-test are shown in Tables 7and 8. In Table 7, RS completed the largest number of trials, andthe performance is approximately equal to the one shown in Ta-ble 1 because it completed almost 100 trials, which is similar to theprevious experiment. We believe RS is eﬃcient and stable in suchsmall search space. Furthermore, TPE obviously showed surpris-ing performance with 54 trials, to accomplish almost the same per-formance as shown in Table 1. Additionally, it is noted that TPEfound two diﬀerent hyperparameter settings respectively shownin Table 1 and 7 but with almost the same performance on the testdataset. Meanwhile, as shown in Table 8, the performance of threeHPO methods within 1 hour runtime have signiﬁcant diﬀerence:TPE performs the best. CMA-ES did not reach our expectation thatat least it should maintain the similar performance to RS, and weconsider that CMA-ES might not be suitable for our particular HPOwith insuﬃcient computational cost and relatively small searchspace. Furthermore, it is noted the underperforming of CMA-ESmay be alleviated by further exploring the "meta parameters" ofCMA-ES, for example, the population size of CMA-ES. However,this seems to be an even more challenging "meta-HPO" problem,which is beyond the scope of this research. We will explore this inour future work. T-test on results with signiﬁcance level of 𝛼 = 5% ℎ = 1 : reject the equal mean hypothesis; ℎ = 0 : accept the equal mean hypothesis. 𝑡 ℎ 𝑡 ℎ 𝑡 ℎ HPO Methods Trian Valid TestRS - TPE 0.8573 0 1.8774 1 2.8449 1RS - CMA-ES -5.0338 1 -2.6530 1 -3.7178 1TPE - CMA-ES -5.6546 0 -4.1563 1 -6.0439 1

Table 8: T-Test on experiments on ESOL Dataset given onehour running time

In this section, we design another group of experiments to exploreRS, TPE and CMA-ES with performance as primary considerationby providing as much as possible computational cost.

Systematic Comparison Study on Hyperparameter Optimisation of Graph Neural Networks 2021, Feb, 05

In order to purelycompare HPO performance, we designed to respectively run threeHPO methods on ESOL dataset for 10 times independently, andeach time is assigned with 100 trials and keep the search space thesame as that deﬁned in Table 1. The performance is evaluated bycalculating the mean of RMSE values of the best trial from every100 trials. We did not run it on the test data set because the 10times RMSEs correspond to 10 diﬀerent hyperparameter settings.Our purpose is to discover the capability of those methods to ﬁtHPO problem, and the results and t-test are shown in Tables 9 and10. TPE again outperforms RS and CMA-ES, and it also presentsmore stable performance as with less standard deviation. T-test inTable 10 shows the same outcome on the test dataset to Table 1 thatRS and CMA-ES in this problem and search space have similar per-formance (no signiﬁcant diﬀerence). So, we believe the CMA-ESstill has room for improvement for its performance in our experi-ments. RS TPE CMA-ESMean of RMSE 0.8529 0.8190 0.8469Std 0.0169 0.0090 0.0169

Table 9: Experiments on the ESOL Dataset with performanceas primary consideration

To further investi-gate the performance of the HPO methods, we increased the searchspace so that 𝑠 𝑏 and 𝑛 𝑓 range from 8 to 512 with the step size of 8; 𝑙 𝑟 is changed to from 0.0001 to 0.0032 with the step size of 0.0001; 𝑛 𝑛 is increased from 32 to 1024 with the step size of 32. The newsearch space has 2 conﬁgurations, and the increments becomesmall and the value ranges are increased. The experimental detailsare shown in Tables 11, 12, and 13; while the corresponding t-testare presented in Tables 14, 15, and 16.In the three datasets, in general, the RMSEs for RS, TPE, andCMA-ES on the test datasets have improved compared with theexperiments in Section 4.2 given the same number of trials. Mean-while, by observing the results on the validation and test datasetsfor all three datasets, we do not see over-ﬁtting issues. In ESOL,TPE and CMA-ES have almost the same performance, and both ofthem are better than RS as indicated by by t-test (see Table 14). Inaddition, in FreeSolv, three HPO methods show no signiﬁcant dif-ference performance, while TPE and CMA-ES made the improve-ments compared with the previous experiments (see Table 2). It isT-test on results with signiﬁcance level of 𝛼 = 5%h = 1 : reject the equal mean hypothesis;h = 0 : accept the equal mean hypothesis.HPO Methods 𝑡 ℎ RS - TPE 5.2891 1RS - CMA-ES 0.7464 0TPE - CMA-ES -4.3625 1

Table 10: T-Test on ESOL with performance as primary con-sideratoin

HPO Methods Hyperparameters~ Train Validation TestRS 𝑛 𝑓 =384 0.3190 0.8727 0.8479 Mean RMSE 𝑛 𝑛 =160 𝑙 𝑟 =0.0016 0.0323 0.0310 0.0453 Mean STD 𝑠 𝑏 =24TPE 𝑛 𝑓 =312 0.5089 0.8203 0.781 Mean RMSE 𝑛 𝑛 =224 𝑙 𝑟 =0.003 0.1281 0.096 0.057 Mean STD 𝑠 𝑏 =8CMA-ES 𝑛 𝑓 =512 0.5793 0.848 0.7772 Mean RMSE 𝑛 𝑛 =1024 𝑙 𝑟 =0.0032 0.1529 0.1247 0.097 Mean STD 𝑠 𝑏 =8 Table 11: Experiments in larger search space on ESOLDataset 𝑠 𝑏 :8-512, 𝑠𝑡𝑒𝑝 =8; 𝑙 𝑟 :0.0001-0.0032, 𝑠𝑡𝑒𝑝 =0.0001; 𝑛 𝑓 :8-512, 𝑠𝑡𝑒𝑝 =8; 𝑛 𝑛 :32-1024, 𝑠𝑡𝑒𝑝 =32 HPO Methods Hyperparameters Train Validation TestRS 𝑛 𝑓 =200 0.3747 1.2412 1.0880 Mean RMSE 𝑛 𝑛 =64 𝑙 𝑟 =0.0030 0.0684 0.1152 0.0990 Mean STD 𝑠 𝑏 =48TPE 𝑛 𝑓 =424 0.6144 1.1288 1.0620 Mean RMSE 𝑛 𝑛 =224 𝑙 𝑟 =0.0008 0.0951 0.1163 0.1115 Mean STD 𝑠 𝑏 =16CMA-ES 𝑛 𝑓 =512 0.6973 1.2329 1.0835 Mean RMSE 𝑛 𝑛 =32 𝑙 𝑟 =0.0032 0.07819 0.1306 0.1073 Mean STD 𝑠 𝑏 =8 Table 12: Experiments in larger search space on FreeSolvDataset 𝑠 𝑏 :8-512, 𝑠𝑡𝑒𝑝 =8; 𝑙 𝑟 :0.0001-0.0032, 𝑠𝑡𝑒𝑝 =0.0001; 𝑛 𝑓 :8-512, 𝑠𝑡𝑒𝑝 =8; 𝑛 𝑛 :32-1024, 𝑠𝑡𝑒𝑝 =32 HPO Methods Hyperparameters Train Validation TestRS 𝑛 𝑓 =312 0.2570 0.6736 0.6552 Mean RMSE 𝑛 𝑛 =32 𝑙 𝑟 =0.0031 0.0240 0.0285 0.0223 Mean STD 𝑠 𝑏 =32TPE 𝑛 𝑓 =496 0.2413 0.6786 0.6395 Mean RMSE 𝑛 𝑛 =32 𝑙 𝑟 =0.0022 0.0188 0.0195 0.0193 Mean STD 𝑠 𝑏 =24CMA-ES 𝑛 𝑓 =248 0.2442 0.6931 0.6826 Mean RMSE 𝑛 𝑛 =480 𝑙 𝑟 =0.0015 ~0.0430 0.0194 0.0167 Mean STD 𝑠 𝑏 =120 Table 13: Experiments in larger search space on Lipophilic-ity Dataset 𝑠 𝑏 :8-512, 𝑠𝑡𝑒𝑝 =8; 𝑙 𝑟 :0.0001-0.0032, 𝑠𝑡𝑒𝑝 =0.0001; 𝑛 𝑓 :8-512, 𝑠𝑡𝑒𝑝 =8; 𝑛 𝑛 :32-1024, 𝑠𝑡𝑒𝑝 =32 sensible that a potential complex problem should be given a largesearch space to ﬁnd the most suitable hyperparameters. In Table13, the rank of performance of the methods is still that TPE is thebest, and RS is better than CMA-ES. Overall, our experimental results indicate that TPE is the mostsuited HPO method for GNN as applied to our molecular propertyprediction problems given limited computational resource. Mean-while, RS is the simplest method but can achieve comparable per-formance against TPE and CMA-ES. In our future work, facing

T-test on results with signiﬁcance level of 𝛼 = 5% ℎ = 1 : reject the equal mean hypothesis; ℎ = 0 : accept the equal mean hypothesis. 𝑡 ℎ 𝑡 ℎ 𝑡 ℎ HPO Methods Trian Valid TestRS - TPE -7.7384 1 2.7832 1 4.9451 1RS - CMA-ES -10.7815 1 1.0330 0 3.5578 1TPE - CMA-ES -2.1105 1 -0.9456 0 0.1855 0

Table 14: T-Test on ESOL in larger search space

T-test on results with signiﬁcance level of 𝛼 = 5% ℎ = 1 : reject the equal mean hypothesis; ℎ = 0 : accept the equal mean hypothesis. 𝑡 ℎ 𝑡 ℎ 𝑡 ℎ HPO Methods Trian Valid TestRS - TPE -11.0149 1 3.6976 1 0.9385 0RS - CMA-ES -16.7183 1 0.2580 0 0.1661 0TPE - CMA-ES -3.6269 1 -3.2042 1 -0.7475 0

Table 15: T-Test on FreeSolv in larger search space

T-test on results with signiﬁcance level of 𝛼 = 5% ℎ = 1 : reject the equal mean hypothesis; ℎ = 0 : accept the equal mean hypothesis.t ℎ t ℎ t ℎ HPO Methods Trian Valid TestRS - TPE 2.7588 1 -0.7815 0 2.7989 1RS - CMA-ES 1.3964 0 3.0423 1 -5.2825 1TPE - CMA-ES -0.3299 0 -2.8299 1 -8.9802 1

Table 16: T-Test on Lipophilicity in larger search space molecular problems on small datasets, the use of CMA-ES also de-serves further investigation, and we believe that CMA-ES, RS, andTPE will have very similar performance given more computationalbudget. Furthermore, as mentioned in Section 4.3.1, the selection ofthe "meta-parameeters" for HPO methods deserve more research,we will investigate the impact of HPO methods’ meta-parametervalues on their performance.Finally, we expect that our work will help the people from var-ious ﬁelds (e.g., machine learning, chemistry, materials science)when they are facing similar type interdisciplinary problems. Asthe application of GNNs have been explored in many areas and in-deed beneﬁted the research in those areas, we believe that our re-search outcomes would give them useful insights to facilitate theirresearch.

REFERENCES [1] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and MasanoriKoyama. 2019. Optuna: A next-generation hyperparameter optimization frame-work. In

Proceedings of the 25th ACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining . 2623–2631. [2] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. 2010.Bidirectional relation between CMA evolution strategies and natural evolutionstrategies. In

International Conference on Parallel Problem Solving from Nature .Springer, 154–163.[3] D Asthagiri, Lawrence R Pratt, and HS Ashbaugh. 2003. Absolute hydration freeenergies of ions, ion–water clusters, and quasichemical theory.

The Journal ofchemical physics , Vol. 24. Neural Information Process-ing Systems Foundation.[5] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameteroptimization.

The Journal of Machine Learning Research

13, 1 (2012), 281–305.[6] Pasquale Bove, Alessio Micheli, Paolo Milazzo, and Marco Podda. 2020. Pre-diction of Dynamical Properties of Biochemical Pathways with Graph NeuralNetworks.. In

BIOINFORMATICS . 32–43.[7] Duc T Duong, Bright Walker, Jason Lin, Chunki Kim, John Love, Balaji Pu-rushothaman, John E Anthony, and Thuc-Quyen Nguyen. 2012. Molecular sol-ubility and hansen solubility parameters for the analysis of phase separation inbulk heterojunctions.

Journal of Polymer Science Part B: Polymer Physics

50, 20(2012), 1405–1413.[8] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell,Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutionalnetworks on graphs for learning molecular ﬁngerprints.

Advances in neural in-formation processing systems

28 (2015), 2224–2232.[9] Evan N Feinberg, Debnil Sur, Zhenqin Wu, Brooke E Husic, Huanghao Mai, YangLi, Saisai Sun, Jianyi Yang, Bharath Ramsundar, and Vijay S Pande. 2018. Po-tentialNet for molecular property prediction.

ACS central science

4, 11 (2018),1520–1530.[10] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George EDahl. 2017. Neural message passing for quantum chemistry. arXiv preprintarXiv:1704.01212 (2017).[11] Robert C Glen, Andreas Bender, Catrin H Arnby, Lars Carlsson, Scott Boyer, andJames Smith. 2006. Circular ﬁngerprints: ﬂexible molecular descriptors withapplications from physical chemistry to ADME.

IDrugs

9, 3 (2006), 199.[12] Adrian M Haith and John W Krakauer. 2013. Model-based and model-free mech-anisms of human motor learning. In

Progress in motor control . Springer, 1–21.[13] Nikolaus Hansen. 2016. The CMA evolution strategy: A tutorial. arXiv preprintarXiv:1604.00772 (2016).[14] Zhongkai Hao, Chengqiang Lu, Zhenya Huang, Hao Wang, Zheyuan Hu, QiLiu, Enhong Chen, and Cheekong Lee. 2020. ASGN: An Active Semi-supervisedGraph Neural Network for Molecular Property Prediction. In

Proceedings of the26th ACM SIGKDD International Conference on Knowledge Discovery & Data Min-ing . 731–752.[15] Frank Hutter. 2009.

Automated conﬁguration of algorithms for solving hard com-putational problems . Ph.D. Dissertation. University of British Columbia.[16] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm conﬁguration. In

International confer-ence on learning and intelligent optimization . Springer, 507–523.[17] Donald R Jones. 2001. A taxonomy of global optimization methods based onresponse surfaces.

Journal of global optimization

21, 4 (2001), 345–383.[18] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley.2016. Molecular graph convolutions: moving beyond ﬁngerprints.

Journal ofcomputer-aided molecular design

30, 8 (2016), 595–608.[19] George Lamb and Brooks Paige. 2020. Bayesian Graph Neural Networks forMolecular Property Prediction. arXiv preprint arXiv:2012.02089 (2020).[20] Sonia Lobo. 2020. Is there enough focus on lipophilicity in drug discovery?

Expert opinion on drug discovery

15, 3 (2020), 261–263.[21] Ilya Loshchilov and Frank Hutter. 2016. CMA-ES for hyperparameter optimiza-tion of deep neural networks. arXiv preprint arXiv:1604.07269 (2016).[22] Chengqiang Lu, Qi Liu, Chao Wang, Zhenya Huang, Peize Lin, and Lixin He.2019. Molecular property prediction: A multilevel quantum interactions model-ing perspective. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,Vol. 33. 1052–1060.[23] Emanuel Parzen. 1962. On estimation of a probability density function and mode.

The annals of mathematical statistics

33, 3 (1962), 1065–1076.[24] Philipp M Pﬂüger and Frank Glorius. 2020. Molecular machine learning: thefuture of synthetic chemistry?

Angewandte Chemie International Edition

59, 43(2020), 18860–18865.[25] Jonathan Shlomi, Peter Battaglia, and Jean-Roch Vlimant. 2020. Graph neuralnetworks in particle physics.

Machine Learning: Science and Technology

2, 2(2020), 021001.[26] Oliver Wieder, Stefan Kohlbacher, Mélaine Kuenemann, Arthur Garon, PierreDucrot, Thomas Seidel, and Thierry Langer. 2020. A compact review of molec-ular property prediction with graph neural networks.

Drug Discovery Today:Technologies (2020).[27] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, andS Yu Philip. 2020. A comprehensive survey on graph neural networks.

IEEE

Systematic Comparison Study on Hyperparameter Optimisation of Graph Neural Networks 2021, Feb, 05 transactions on neural networks and learning systems (2020).[28] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Ge-niesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. 2018. MoleculeNet: abenchmark for molecular machine learning.

Chemical science

9, 2 (2018), 513–530. [29] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerfulare graph neural networks? arXiv preprint arXiv:1810.00826 (2018).[30] Zelda B Zabinsky. 2010. Random search algorithms.