[PDF] Enhance Information Propagation for Graph Neural Network by Heterogeneous Aggregations

Abstract

Graph neural networks are emerging as continuation of deep learning success w.r.t. graph data. Tens of different graph neural network variants have been proposed, most following a neighborhood aggregation scheme, where the node features are updated via aggregating features of its neighboring nodes from layer to layer. Though related research surges, the power of GNNs are still not on-par-with their counterpart CNNs in computer vision and RNNs in natural language processing. We rethink this problem from the perspective of information propagation, and propose to enhance information propagation among GNN layers by combining heterogeneous aggregations. We argue that as richer information are propagated from shallow to deep layers, the discriminative capability of features formulated by GNN can benefit from it. As our first attempt in this direction, a new generic GNN layer formulation and upon this a new GNN variant referred as HAG-Net is proposed. We empirically validate the effectiveness of HAG-Net on a number of graph classification benchmarks, and elaborate all the design options and criterions along with.

Full PDF

EE NHANCE I NFORMATION P ROPAGATION FOR G RAPH N EURAL N ETWORK BY H ETEROGENEOUS A GGREGATIONS

Dawei Leng ∗ , Jinjiang Guo, Lurong Pan, Jie Li, Xinyu Wang AIDD Group

Global Health Drug Discovery Institute, Beijing, China [email protected], [email protected], [email protected], [email protected] A BSTRACT

Graph neural networks are emerging as continuation of deep learning success w.r.t. graph data.Tens of different graph neural network variants have been proposed, most following a neighborhoodaggregation scheme, where the node features are updated via aggregating features of its neighboringnodes from layer to layer. Though related research surges, the power of GNNs are still not on-par-withtheir counterpart CNNs in computer vision and RNNs in natural language processing. We rethinkthis problem from the perspective of information propagation, and propose to enhance informationpropagation among GNN layers by combining heterogeneous aggregations. We argue that as richerinformation are propagated from shallow to deep layers, the discriminative capability of featuresformulated by GNN can beneﬁt from it. As our ﬁrst attempt in this direction, a new generic GNNlayer formulation and upon this a new GNN variant referred as HAG-Net is proposed. We empiricallyvalidate the effectiveness of HAG-Net on a number of graph classiﬁcation benchmarks, and elaborateall the design options and criterions along with.

Success of deep learning in computer vision and natural language processing has recently boosted ﬂood of research onapplying neural networks to graph data (Wu et al., 2020). Graph is a simple yet versatile data structure jointly describedby sets of nodes and edges. Aside from image and text data we’re familiar, lots of real world data are better described asgraph and thus processed by graph neural networks, such as social networks (Fan et al., 2019), ﬁnancial fraud detection(Wang et al., 2020), knowledge graph (Zhang et al., 2020), biology interaction network (Higham et al., 2008), smallmolecule in drug discovery (Hu et al., 2019), to name a few.Since the seminal works (Kipf and Welling, 2016; Hamilton et al., 2017), tens of different graph neural networkvariants have been proposed, emphasizing different graph properties and design options. GNN research routes canbe roughly divided into two categories: spectral based and spatial based. Spectral based GNNs try to approximateCNN’s convolution by deﬁning Fourier transform on graph (Kipf and Welling, 2016) and thus where the name graphconvolution network comes from. The major limitation of spectral based GNNs is that graph convolution is deﬁned ona ﬁxed global graph, thus not suitable for tasks where graph structure is changing from sample to sample. Meanwhile,most recent works take the spacial-based direction (Xu et al., 2018; Hu et al., 2019; Gao and Ji, 2019; Veliˇckovi´c et al.,2017; Gilmer et al., 2017), i.e., GNNs process the graph data following a neighborhood aggregation scheme, where thenode features at layer k are updated via aggregating features of its neighboring nodes from layer k − .GNN’s neighborhood aggregation is a direct imitation of CNN’s convolution in spatial dimensions. However, unlikeimage in computer vision where data reside on regular grid, graph data are intrinsically orderless. For an undirectedgraph G = ( V, E ) there’re only nodes V and edges E deﬁned. By "orderless", it means from the perspective of anycenter node v ∈ V , there’s no way to tell which node u ∈ N ( v ) is its n -th neighbor, where N ( v ) stands for theneighborhood of node v . This special property of graph impedes GNNs from replicating CNN’s success on image dataand RNN’s success on sequential data. ∗ The ﬁrst and corresponding author a r X i v : . [ c s . L G ] F e b nhance Information Propagation for Graph Neural Network by Heterogeneous AggregationsMost work in spatial-based GNNs can be formally summarized by the following two-stage architecture (Wu et al.,2020; Hu et al., 2019): ﬁrst propagate node information among each other by neighborhood aggregation, then form thewhole graph representation by a read-out function. The k -th layer of a spatial-base graph neural network is like h ( k ) v = C ( k ) ( h ( k − v , A ( k ) ( { ( h ( k − v , h ( k − u , e uv ) , u ∈ N ( v ) } )) (1)in which h ( k ) v is the feature vector of node v at the k -th layer, e uv is the edge feature vector between node u and v , and N ( v ) is usually the one-hop neighbors. Function A ( · ) aggregates node features in the neighborhood of v , and C ( · ) combines features both from center node and aggregated from its neighborhood by A ( · ) . To obtain the entire graph’srepresentation h G , the READOUT function pools node features from the ﬁnal iteration K as h G = R ( { h ( K ) v | v ∈ V } ) (2)Normally a spatial-base graph neural network consists of a stack of multiple aggregation layers and ﬁnally one readoutlayer. Published works differ in either A ( · ) , C ( · ) or R ( · ) , among which aggregation function A ( · ) is the most importantpart because it determines how information propagates among nodes. Due to the orderless property of graph data,function A ( · ) must be permutation-free, i.e., for any given center node, all its neighbor nodes must be treated equally.This drastically restricts possible choices for the aggregation function, commonly { max, min, mean, sum, mul, att } in which att is an attention operator as in (Veliˇckovi´c et al., 2017). The permutation-free restriction is also true forread-out function R ( · ) .Usually one certain neighborhood aggregation operator is chosen for GNN design. For example, the seminal workGraphSAGE (Hamilton et al., 2017) takes the layer formulation as h ( k ) v = φ ( h ( k − v ) + φ ( mean ( h ( k − u , u ∈ N ( v ))) (3)where A ( · ) = mean ( h ( k − u , u ∈ N ( v )) , C ( · ) = sum ( φ , φ ) and neighborhood aggregation takes the mean operator. GIN (Xu et al., 2018) argues that mean operator loses neighborhood size information, and proposes the layerformulation as h ( k ) v = φ ((1 + (cid:15) ) h ( k − v + (cid:88) u ∈N ( v ) h ( k − u ) (4)where A ( · ) = (cid:80) u ∈N ( v ) h ( k − u ) , C ( · ) = φ ( sum (1 + (cid:15), and neighborhood aggregation takes the sum operator. Asfurther improvement, (Hu et al., 2019) incorporates edge features e u,v by taking the layer formulation as h ( k ) v = φ ((1 + (cid:15) ) h ( k − v + (cid:88) u ∈N ( v ) ReLU ( h ( k − u + e u,v ) (5)where A ( · ) = (cid:80) u ∈N ( v ) ReLU ( h ( k − u + e u,v ) , C ( · ) = φ ( sum (1 + (cid:15), and neighborhood aggregation also takesthe sum operator. (Gilmer et al., 2017) takes the layer formulation as h ( k ) v = φ ( h ( k − v ) + aggr u ∈N ( v ) h ( k − u φ ( e u,v ) (6)in which neighborhood aggregation aggr takes either mean , max or sum operator.From the perspective of information propagation, when only permutation-free operators are allowed, information lossafter neighborhood aggregation is always inevitable, since there’s no way to differentiate among neighbor nodes, andconsequently, it won’t be possible to recover the input graph G by aggregation result G (cid:48) . Consider the commonly used mean operator, after each aggregation layer, features from different nodes are averaged, the result graph G (cid:48) will be ablurred version of the input graph G . Deeper the graph neural network, more blurred the result is. With such a lossyintermediate representation, it’ll be hard for the neural network to fulﬁll the downstream learning task effectively.In this manuscript we try to improve GNN’s performance from the perspective of enhancing information propagationfrom shallow to deep layers. We argue that as richer information are propagated, the discriminative capability of featuresformulated by GNN can beneﬁt from there. As our ﬁrst attempt in this direction, we propose to enhance informationpropagation by combining heterogeneous aggregations in function A ( · ) . The underlining philosophy is straightforward:each aggregation operator extracts/describes different aspect of the input graph G , by combining different aggregationoperators, the information propagation loss can be mitigated, thus allowing more effective features for downstreamtask to propagate to deep layers. With this in mind, a new generic GNN layer formulation and upon this a new GNNvariant referred as HAG-Net is proposed. We empirically validate the effectiveness of HAG-Net on a number of graphclassiﬁcation benchmarks, and elaborate all the design options and criterions along with. We focus on graph-level taskshere whereas the same technique can be also applied to node-level tasks without any difﬁculty.2nhance Information Propagation for Graph Neural Network by Heterogeneous AggregationsFigure 1: Network structure of HAG-Net. The proposed network is designed to be generic, the pyramid featurestacking by multiple READOUT layers and dense connection among neighborhood aggregation layers are optional.The neighborhood aggregation layer is formulated by eq.7 and READOUT layer is formulated by eq.8 We begin by reformulating eq.1. For graph G = ( V, E ) with v, u ∈ V and e uv ∈ E , the generic layer formulation withheterogeneous aggregations is as h v = ψ ( C ( h v , M − (cid:77) i φ i ( A i,u ∈N ( v ) ( { ( h v , h u , e uv ) } )))) (7)In which A i ( · ) , i = 0 · · · M − are M different aggregation operators, (cid:76) is the merge operator for M neighborhoodaggregation results, and C ( · ) updates center node v ’s feature with the merged aggregation result. ψ and φ i arelinear/non-linear transform functions, and layer index k is omitted here for conciseness, the left-hand computation isalways about layer k − if not speciﬁed otherwise.For node-level tasks, the node representation h v from the ﬁnal aggregation layer is usually used for prediction. Forgraph-level tasks, the READOUT function aggregates node features from the ﬁnal aggregation layer to form the entiregraph’s representation as h G = ψ ( M − (cid:77) i φ i ( A i,v ∈ V ( { h v } ))) (8)Compared to eq.2, the READOUT function here also beneﬁts from the enhanced information propagation by heteroge-neous aggregations. Note for READOUT function the aggregations are performed over all the graph nodes instead oflocal neighborhood. Commonly used aggregation operators include { max, sum, mean, att } . Aforementioned { mul, min } operators though satisfying the permutation-free restriction, could easily cause computation instabilityissues, thus seldom used for neighborhood aggregation in practice. sum and mean operators act similarly, though sum is theoretically preferred over mean (Xu et al., 2018) because it keeps the neighborhood size information, there’s latentrisk that computation overﬂow could happen for large graph, where mean is a much safer choice. mean ( N ( v )) isactually the 0-order statistical moment of N ( v ) , higher order moments such as variance , skewness and kurtosis also meet the permutation-free restriction. max operator is widely used in neural networks in ﬁelds such as computervision and natural language processing, where it usually acts as pooling function. att from (Veliˇckovi´c et al., 2017) is atransformer like multi-head self-attention for N ( v ) following by a sum operator, thus it acts more like sum ( ϕ ( N ( v ))) .In this manuscript we focus on operator set { max, sum, mean, att } , and leave other operators for future investigation.3nhance Information Propagation for Graph Neural Network by Heterogeneous Aggregations Merge heterogeneous aggregations

For (cid:76) operator in eq.7 and 8, there’re two commonly used options { cat, sum } in which cat means concatenation. The transform function φ i after each aggregation result is commonly implementedby a stack of dense layers. Update center node features

In eq.7, center node features h v will be updated with information aggregated fromneighborhood by function C ( · ) . Possible choices include { sum, max, cat, rnn } , in which rnn (Li et al., 2015) isan RNN cell for example LSTMCell or GRUCell in Pytorch with merged neighborhood aggregation result as hiddenstate and center node feature as input. Note this sequential setup is totally artiﬁcial, their roles for the RNN cell canbe exchanged at will. We leave mean method out here because it acts quite similar with sum , possibly due to thefollowing transform ψ and batch normalization built-in. Like φ i , The transform function ψ is commonly implementedby a stack of dense layers. With the generic neighborhood aggregation layer as eq.7 and READOUT layer as eq.8, we build our graph neuralnetwork HAG-Net for graph-level tasks by stacking multiple neighborhood aggregation and READOUT layers. Thesimplest structure is just a sequential stacking a multiple neighborhood aggregation layers and one READOUT layer,the output of the READOUT layer will be used as the representation for the whole graph. Former works such as (Xuet al., 2018; Gao and Ji, 2019) implement complex structures by using features from intermediate layers. We follow thisidea, and design optional pyramid feature structure and dense connection among intermediate layers. The completenetwork structure is illustrated in Figure 1.For READOUT layers in Figure 1, we empirically ﬁnd that downstream task performance can beneﬁt marginally fromrestricting their weights tied if pyramid structure is enabled. The downstream task classiﬁer is implemented as a stackof 3 dense layers.With heterogeneous aggregations the neighborhood aggregation layer and READOUT layer could be implementedas numerous variants with different options. Plus there’re variation options within the model structure of HAG-Netitself. To determine these options, we treat them as model hyper-parameters and tune them by human expert as well asautomatic grid search, the details are described in section 3.3.

In this section we will try to answer the following questions through experiments:• Q1. Whether combining heterogeneous aggregations would improve GNN’s performance?• Q2. What aggregation operator combination is the best, will larger combination always gear up modelperformance?• Q3. How to choose proper aggregation operator combination in practice?For real-world case study, in the following we’ll conduct experiments on drug discovery datasets, where GNN is usedto predict whether an organic small molecule is active w.r.t. certain biology target. A molecule is converted to graphrepresentation by treating each atom as node and covalent bond between atoms as edge, see Figure 2. We didn’t choosebenchmark datasets used previously in (Xu et al., 2018; Gao and Ji, 2019; Hamilton et al., 2017; Liao et al., 2019) dueto their limited sample size, usually less than 1K. All GNN models we’ve tested present large performance variationwith different data splitting and weights initialization on them. Instead, we collect 5 datasets from drug discoveryindustry with sample size >5K. All these datasets are binary, vary both in size (from 7K to 76K) and class distribution(from balanced to highly biased). The details are given below.

This is the smallest among all the 5 datasets, with 7,305 samples in total, and P/N ratio = 0.49 / 0.51.These’re phenotypic records of antiviral bioactivity from various species and in vitro assays, collected from commercialdatabase. We choose EC50 <= 100nM as cutoff threshold.

CYP2C9V12k

This dataset is from recently published TDC project (Huang et al., 2020), with 12,092 samples intotal and P/N ratio = 0.67 / 0.33. The CYP P450 genes are involved in the formation and breakdown (metabolism) ofvarious molecules and chemicals within cells. Speciﬁcally, the CYP P450 2C9 plays a major role in the oxidation ofboth xenobiotic and endogenous compounds (Veith et al., 2009).4nhance Information Propagation for Graph Neural Network by Heterogeneous AggregationsFigure 2: A molecule is converted to graph representation by treating each atom as node and covalent bond betweenatoms as edge, then a GNN with classiﬁcation head is used to predict its activity. For general purpose different covalentbonds are all modeled into the same typed edges. With this we can do comparison among different GNN models,though it’s not the most accurate way to model a molecule.

Malaria16k

This dataset contains 16,933 samples in total with P/N ratio = 0.93 / 0.07. It’s curated from public datasetof Malaria sensitivities assays (Kato et al., 2016) by removing molecule chirality and merging consequent duplicates by OR operation. Mtb18k

This is a public dataset with 18,886 samples in total, and P/N ratio = 0.88 / 0.12. These’re phenotypic recordsof in vitro assays against Mycobacterium tuberculosis (Lane et al., 2018). We choose activity cutoff of 10 µ M. Globalvirus76k

This is the biggest among all the 5 datasets, with 76,247 samples in total and P/N ratio = 0.57 / 0.43.These’re combination of target based and phenotype based records of antiviral bioactivity from various species and invitro assays, collected from commercial database. We choose EC50 <= 100nM as cutoff threshold.

With all the datasets being about binary classiﬁcation task, commonly ER and AuROC metrics are used to evaluatemodel performance. Whereas for dataset with highly biased class distribution, AuROC will be dominated by the majorclass, thus we propose to use AuPR metric instead for model evaluation, speciﬁcally, the harmonic average of AuPRs ofboth positive and negative classes are reported in the following experiments.There’s a less studied topic in GNN research but we ﬁnd important in production: convergence stability. Duringour evaluation of various GNN models, we notice that some models present much higher variance than others alongthe convergence progress, see Figure 3 for an illustration. Model with large convergence variance is not reliablein production environment since evaluation data is always limited. To measure the model convergence variancequantitatively, we propose to use median ﬁltering method. For any convergence metric curve x , we ﬁrst smooth thecurve by median ﬁltering and then compute the standard variance of their difference, i.e., mstd = std ( x − f wmedian ( x )) (9)in which f wmedian is the median ﬁlter with window size = ∗ w + 1 . We set w = 5 for all the experiments. The proposed HAG-Net is a generic structure supporting numerous variants with different combinations of designoptions. With this versatility, we explore answers to aforementioned questions Q1, Q2 and Q3. We tune hyper-parameters on

Antivirus7k dataset because its suitable size and balanced class distribution, then results are reported onall the 5 datasets studied.For answer to Q3 ﬁrst, we tune hyper-parameters by human expert at the initial stage. We choose to copy GIN (Xuet al., 2018) structure for HAG-Net as baseline. Better hyper-parameter combinations are then explored by humanexpert heuristically. This process converges to conﬁguration set cfg1 in Table 1 after dozens of trials. Evaluation resultsof HAG-Net with conﬁguration set cfg1 on all the 5 datasets are given in Table 2.We then further tune the hyper-parameters by automatic grid search. We use cfg1 as start point, and resort to Optuna(Akiba et al., 2019) package for the searching. This process runs for approximately one week on a cluster with 8 P100GPUs. The hyper-parameter set with the best AuPR value is selected as the ﬁnal decision, refer to conﬁguration set cfg2 in Table 1 for details. And the whole process answers question Q3.5nhance Information Propagation for Graph Neural Network by Heterogeneous AggregationsTable 1: Hyper-parameter conﬁguration set for HAG-Net ∗ Name { A aggr } (cid:76) aggr C { A RO } (cid:76) RO Pyramid RO Tied DCcfg1 5 max, mean sum rnn max, sum sum

True True Truecfg2 5 max, sum sum cat max - True True False * " { A aggr } is the set of aggregation operators in eq.7 and { A RO } isthe set of aggregation operators in eq.8. (cid:76) aggr is the merge operator in eq.7 and (cid:76) RO is the merge operator in eq. 8.When "Pyramid" is True, there will be READOUT layer attached to each aggregation layer as in Figure 1, othewise itwill be a total sequential structure. "RO Tied" indicates whether the weights of READOUT layers are tied. "DC"indicates whether there are dense connections among aggregation layers. Table 2: Evaluation Results: 5-Fold Average ∗ Model Antivirus7k CYP2C9V12k Mtb18k Malaria16k Globalvirus76kGUNet . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . GIN . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . DeepChem . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HAG-Net cfg1 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . HAG-Net cfg2 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . * For each dataset AuPR and ER results are reported, both in range [0 . , . . For questions Q1 & Q2, by comparing cfg2 with cfg1 and evaluation results in Table 2, we can see that GNN’sperformance can surely beneﬁt from combining heterogeneous aggregations, but this beneﬁt is not always consistentwith the combination size. In the human expert optimized cfg1 , both { A aggr } and { A RO } are combinations of 2different aggregation operators; whereas in grid search optimized cfg2 , for { A RO } , single operator max is chosenover previous combination of { max, sum } . And during the hyper-parameter tuning process, we notice that for both { A aggr } and { A RO } , combination with ≥ aggregation operators never wins out. One possible reason might be thatthe investigated operator set { max, sum, mean, att } here is quite limited, thus only marginal complementary effectcould be obtained from larger combination. In this subsection we benchmark the performance of the proposed HAG-Net with different hyper-parameter conﬁgura-tions as in Table 1. Results from multiple state-of-the-art models such as GIN (Xu et al., 2018), GUNet (Gao and Ji,2019) and model from DeepChem (Ramsundar et al., 2019) are also reported on the 5 datasets studied. GIN is a verypopular model according to the leaderboard of OGB project (Hu et al., 2020). For GUNet we use { A RO } = { sum } and the same classiﬁer as in GIN. The model from DeepChem is specially designed for small moleclue tasks, both innetwork structure and input node features.All the models are implemented with Pytorch 1.6.0. For model from DeepChem, node/atom features speciallycustomized from DeepChem itself is used, which is of dimension d = 75 . For GIN, GUNet and proposed HAG-Net, anode embedding layer is utilized to learn features from training data automatically, with also dimension d = 75 forcomparison consistency. For GIN, GUNet and DeepChem models, default values are kept for their hyper-parameters,and Adam (Kingma and Ba, 2014) optimizer with learning rate 1e-3 is used. For HAG-Net with different hyper-parameter conﬁgurations, SGD optimizer with learning rate 1e-2 is used. The training batch size and epoch number isﬁxed to 256 and 1,000 for all experiments. Evaluation results are given in Table 2, reported as 5-fold average. As mentioned in section 3.2, GNN models exhibit strikingly different stability during theconvergence process. For GIN, GUNet and DeepChem models we studied, for different dataset and different weightsinitialization, frequent spikes and ditches can be observed in metric curves (with different patterns, see Figure 3 forillustration). This problem is not observable with the single point metrics such as ER, AuPR as reported in Table 2 sincemodel with large convergence variance can still achieve high AuPR score. We use the mstd metric deﬁned in eq.9 tomeasure the model convergence variance quantitatively, results are reported in Table 3 for ER curve, also as 5-foldaverage. Note the mstd metric is still not perfect for evaluation of convergence stability, for example as illustrated byFigure 3.(a), DeepChem model exhibits spurious ﬂuctuation meanwhile achieves the smallest mstd value.6nhance Information Propagation for Graph Neural Network by Heterogeneous AggregationsFigure 3: Convergence curves of P/R/F1 for class 1 on Mtb18k dataset. (a) DeepChem model (b) GIN model (c) GUNetmodel (d) HAG-Net cfg1 (e) HAG-Net cfg2Table 3: Convergence Variance for ER Curves

Name Antivirus7k CYP2C9V12k Mtb18k Malaria16k Globalvirus76kGUNet . ± . . ± . . ± . . ± . . ± . GIN . ± . . ± . . ± . . ± . . ± . DeepChem . ± . . ± . . ± . . ± . . ± . HAG-Net cfg1 . ± . . ± . . ± . . ± . . ± . HAG-Net cfg2 . ± . . ± . . ± . . ± . . ± . Future work

Results reported in Table 2 and 3 validate our conjecture that GNN’s performance can beneﬁt fromenhanced information propagation from shallow to deep layers. Our ﬁrst attempt by combining heterogeneousaggregations succeeds, but not signiﬁcantly. One possible future work is to investigate more neighborhood aggregationoperators.Another possible method for enhancing information propagation is to use multi-channelling mechanism as in CNN.The channel dimension is essential for the success of CNN, where each channel encodes different part of informationfrom the input samples. Combining heterogeneous aggregations can be considered as primitive imitation of thismulti-channelling mechanism.

In this manuscript we improve GNN’s performance from the perspective of enhancing information propagation fromshallow to deep layers. As our ﬁrst attempt in this direction, we propose to enhance information propagation bycombining heterogeneous aggregation operators in GNN’s neighborhood aggregation layers. By combining differentaggregation operators, the information propagation loss can be mitigated, thus allowing more effective features fordownstream task to propagate to deep layers. A new generic GNN layer formulation and upon this a new GNNvariant referred as HAG-Net is proposed. We empirically validate the effectiveness of HAG-Net on a number of graphclassiﬁcation datasets. Our future work will investigate enhancing information propagation for GNN in the perspectiveof channelling as done in CNN. 7nhance Information Propagation for Graph Neural Network by Heterogeneous Aggregations

References

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive surveyon graph neural networks.

IEEE transactions on neural networks and learning systems , 2020.Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for socialrecommendation. In

The World Wide Web Conference , pages 417–426, 2019.Daixin Wang, Jianbin Lin, Peng Cui, Quanhui Jia, Zhen Wang, Yanming Fang, Quan Yu, Jun Zhou, Shuang Yang, andYuan Qi. A semi-supervised graph attentive network for ﬁnancial fraud detection. arXiv preprint arXiv:2003.01171 ,2020.Zhao Zhang, Fuzhen Zhuang, Hengshu Zhu, Zhiping Shi, Hui Xiong, and Qing He. Relational graph neural networkwith hierarchical attention for knowledge graph completion. In

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , volume 34, pages 9612–9619, 2020.Desmond J Higham, Marija Rašajski, and Nataša Pržulj. Fitting a geometric graph to a protein–protein interactionnetwork.

Bioinformatics , 24(8):1093–1099, 2008.Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies forpre-training graph neural networks. arXiv preprint arXiv:1905.12265 , 2019.Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. arXiv preprintarXiv:1609.02907 , 2016.William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv preprintarXiv:1706.02216 , 2017.Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprintarXiv:1810.00826 , 2018.Hongyang Gao and Shuiwang Ji. Graph u-nets. In international conference on machine learning , pages 2083–2092.PMLR, 2019.Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graphattention networks. arXiv preprint arXiv:1710.10903 , 2017.Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing forquantum chemistry. In

International Conference on Machine Learning , pages 1263–1272. PMLR, 2017.Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXivpreprint arXiv:1511.05493 , 2015.Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard S Zemel. Lanczosnet: Multi-scale deep graph convolutionalnetworks. arXiv preprint arXiv:1901.01484 , 2019.Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor Coley, Cao Xiao, JimengSun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets for therapeutics. https://tdcommons.ai , November 2020.Henrike Veith, Noel Southall, Ruili Huang, Tim James, Darren Fayne, Natalia Artemenko, Min Shen, James Inglese,Christopher P Austin, David G Lloyd, et al. Comprehensive characterization of cytochrome p450 isozyme selectivityacross chemical libraries.

Nature biotechnology , 27(11):1050–1055, 2009.Nobutaka Kato, Eamon Comer, Tomoyo Sakata-Kato, Arvind Sharma, Manmohan Sharma, Micah Maetani, JessicaBastien, Nicolas M Brancucci, Joshua A Bittker, Victoria Corey, et al. Diversity-oriented synthesis yields novelmultistage antimalarial inhibitors.

Nature , 538(7625):344–349, 2016.Thomas Lane, Daniel P Russo, Kimberley M Zorn, Alex M Clark, Alexandru Korotcov, Valery Tkachenko, Robert CReynolds, Alexander L Perryman, Joel S Freundlich, and Sean Ekins. Comparing and validating machine learningmodels for mycobacterium tuberculosis drug discovery.

Molecular pharmaceutics , 15(10):4346–4360, 2018.Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generationhyperparameter optimization framework. In

Proceedings of the 25th ACM SIGKDD international conference onknowledge discovery & data mining , pages 2623–2631, 2019.Bharath Ramsundar, Peter Eastman, Patrick Walters, Vijay Pande, Karl Leswing, and ZhenqinWu.

Deep Learning for the Life Sciences . O’Reilly Media, 2019. .Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec.Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687 , 2020.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980