[PDF] Training Sensitivity in Graph Isomorphism Network

Abstract

Graph neural network (GNN) is a popular tool to learn the lower-dimensional representation of a graph. It facilitates the applicability of machine learning tasks on graphs by incorporating domain-specific features. There are various options for underlying procedures (such as optimization functions, activation functions, etc.) that can be considered in the implementation of GNN. However, most of the existing tools are confined to one approach without any analysis. Thus, this emerging field lacks a robust implementation ignoring the highly irregular structure of the real-world graphs. In this paper, we attempt to fill this gap by studying various alternative functions for a respective module using a diverse set of benchmark datasets. Our empirical results suggest that the generally used underlying techniques do not always perform well to capture the overall structure from a set of graphs.

Full PDF

TTraining Sensitivity in Graph Isomorphism Network

Md. Khaledur Rahman

Indiana University [email protected]

ABSTRACT

Graph neural network (GNN) is a popular tool to learn the lower-dimensional representation of a graph. It facilitates the applicabilityof machine learning tasks on graphs by incorporating domain-specific features. There are various options for underlying proce-dures (such as optimization functions, activation functions, etc.)that can be considered in the implementation of GNN. However,most of the existing tools are confined to one approach without anyanalysis. Thus, this emerging field lacks a robust implementationignoring the highly irregular structure of the real-world graphs.In this paper, we attempt to fill this gap by studying various al-ternative functions for a respective module using a diverse set ofbenchmark datasets. Our empirical results suggest that the gen-erally used underlying techniques do not always perform well tocapture the overall structure from a set of graphs.

KEYWORDS graph neural networks; node classification; parameter sensitivity

ACM Reference Format:

Md. Khaledur Rahman. 2020. Training Sensitivity in Graph IsomorphismNetwork. In

Proceedings of the 29th ACM International Conference on Infor-mation and Knowledge Management (CIKM ’20), October 19–23, 2020, VirtualEvent, Ireland.

ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3340531.3412089

Graphs can be considered as a transformation of knowledge fromvarious domains such as social networks, scientific literature, andprotein-interaction networks, where an entity can be denoted bya vertex and the relationship between a pair of entities can be de-noted by an edge. An effective representation of a graph can helplearn important characteristics from the dataset. For example, amachine learning model can be applied to graph representation torecommend friends in social networks. The traditional graph datastructure, such as the adjacency matrix, can be a naive represen-tation for machine learning models for making such predictions.However, we can not accommodate memory for very large graphsas this representation has O ( n ) memory cost. As the size of therepresentable data in the graph format is growing at a significantspeed [14], it allows the researchers to find an alternate efficientapproach. Thus, an effective graph representation learning or em-bedding technique has become a high demanding problem. Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Hidden layerActivationPoolingInput Output

Figure 1: A graph neural network with toy example: a hid-den layer consists of pooling (aggregation) layer and activa-tion layer. There can be multiple hidden layers in the net-work. Output layer represents the embedding of the the net-work which is generally used for classification task.

Designing an effective graph embedding method can be ex-tremely challenging as most real-world graphs have a highly irreg-ular structure. Several unsupervised methods have been proposedin the literature to solve this problem sub-optimally [5, 11]. Graphlayout generation methods can also be used to generate embedding[12, 13]. Some of them generate good quality embedding but havehigh runtime and memory costs while some of them consume lessruntime and memory costs but generate moderate-quality embed-ding. Nevertheless, these methods can not incorporate domain-specific features which might be helpful to learn a better repre-sentation of the graph. Recently, Graph Convolutional Network(GCN) has gained much attention in the literature [8]. The basicidea is similar to that of Convolutional Neural Networks (CNN) [9]in the computer vision domain where a convolution operator ofa vertex aggregates contributions from its neighboring vertices ineach hidden layer. A toy example is shown in Fig. 1. The

Graph-SAGE method further contributes to this field by proposing severaleffective aggregation functions [6]. On the other hand,

FastGCN runs faster than other methods though it sacrifices the quality ofthe embedding a little [2]. Notably, Errica et al. have performedan extensive set of experiments to compare only existing GNNtools [4]. These types of methods are trained in a semi-supervisedmanner and used to predict the label of unseen vertices which isoften termed as inductive learning [6]. This is a rapidly growingfield; however, the parameter sensitivity of GNN is overlooked inthe literature. In this paper, we have filled this gap by analyzinga GNN model, called Graph Isomorphism Network (GIN) [17], forseveral optimization techniques, aggregation functions, learningrate, and activation functions on a diverse set of benchmark graphs.The summary of our contributions are given as follows: • We develop a set of research questions and address them byparameter tuning to learn the sensitivity of GIN. • We conduct an extensive set of experiments to analyze theresults on a diverse set of benchmark datasets. • Our empirical results provide a new insight that the tradi-tionally used ADAM optimization technique and the ReLUactivation function do not always perform better than otheralternatives for all types of graphs. This presumably directsus to consider other techniques rather than a single one. a r X i v : . [ c s . L G ] A ug PROBLEM SETTING2.1 Background

We represent a graph by G = ( V , E ) , where, V is the set of vertices, E is the set of edges and each z i of a set Z represents 1-dimensionalembedding of i th vertex. We represent neighbors of a vertex u in G by N ( u ) . In a GNN, domain-specific features or one-hot-encodedrepresentation of vertices are fed to the input layer. In a hiddenlayer, an aggregation (pooling) function gathers contribution fromthe neighbors of a vertex to learn the structural properties of thegraph that goes through an activation function (see Fig. 1). Multiplehidden layers can be placed in the network in a cascaded style.The embedding Z of a graph G is produced in the output layer. Across-entropy based objective function is optimized to update theembedding in a semi-supervised way. The representation of hiddenlayers in GIN [17] can be expressed as follows: h lu = MLP (cid:0) ( + ϵ l ) h l − u + (cid:213) v ∈ N ( u ) h l − v (cid:1) (1)Here, h lu is the activation of l th hidden layer for vertex u and h u = X , where, X is the input features. MLP represents the multi-layerperceptron that helps learn the representation of the network and ϵ denotes the fraction of considerable activation from the previouslayer for the same vertex u . A non-linear activation function isused in the MLP of Equation 1 and a popular choice is RectifiedLinear Units (ReLU) [10]. There are several options to optimizethe performance of a GNN by tweaking units such as activationfunction, optimization technique, or hyper-parameter tuning. We employ several activation functions in the GIN, each of whichhas some benefits over others. We assume that x represents theproduct of contributions of activation from all neighbors in a hid-den layer and the weight matrix of the corresponding layer. Then,the ReLU, LeakyReLU, Sigmoid, and Tanh activation functions com-pute the activations using the terms max ( , x ) , max ( , x ) + . × min ( , x ) , + e − x , and e x − e − x e x + e − x , respectively. Specifically, we use Py-Torch implementations for these activation functions. Generally,the state-of-the-art tools employ one activation function in a GNNsuch as ReLU in GCN [8] or LeakyReLU in GAT [15], but they donot distinguish among different activation functions. In this paper,we study the above mentioned four standard activation functions. We discuss five techniques to optimize the cross entropy-based lossfunction in GIN. We study these techniques due to the fact that thegraph datasets have irregular structures and one single techniquemight not perform well to optimize the objective function.

SGD:

The Stochastic Gradient Descent (SGD) is a widely usedalgorithm to optimize weights in neural networks [1]. This ap-proach updates weight parameters with respect to each trainingsample which reduces huge computation costs of convergence inthe Gradient Descent (GD) approach. W t + = W t − η ∂ L ∂ W t (2)Here, t represents t th iteration, η is the learning rate, ∂ L ∂ W t is thegradient of loss function L with respect to weight matrix W . ADAGRAD:

The ADAptive GRADient (ADAGRAD) is a varia-tion of the SGD algorithm which focuses on updating multi-dimensional weights using scaled learning rate for each dimension [3]. We canformally present this updating equation as follows: W t + = W t − η (cid:112) ϵI + diaд ( G t ) ∂ L ∂ W t (3)Here, I is the identity matrix, ϵ is a small value to avoid division byzero, G t is a diagonal matrix where each diagonal entry representsgradient in the corresponding dimension. ADADELTA:

ADADELTA is another variation of SGD thatsolves some of the problems originated by the ADAGRAD approachsuch as decreasing learning rate over iterations and manual selec-tion of global learning rate [18]. ADADELTA addresses these issuescomputing past gradients over a fixed window of steps. To get ridof η , it also considers accumulated weights similar to gradients. Thewhole weight update can be represented as follows: U t ( д ) = αU t − ( д ) + ( − α ) д t (4a) ∆ W t = (cid:115) U t − ( ∆ W ) + ϵU t ( д ) + ϵ (4b) U t ( ∆ W ) = αU t − ( ∆ W ) + ( − α ) ∆ W t (4c) W t + = W t − ∆ W t д t (4d)Here, д t is gradient of t th iteration, i.e., д t = ∂ L ∂ W t , U t is a functionthat computes accumulation of previous gradients and weights, α isa decaying parameter that determines what percentages of previousgradient and current gradient will be used, respectively. RMSProp:

RMSProp is similar to ADADELTA with the excep-tion that it does accumulate weight over iterations and it keepslearning rate in the updating equation. This can be formally repre-sented as the following equations: U t ( д ) = . U t − ( д ) + . д t (5a) W t + = W t − д t η (cid:112) U t ( д ) + ϵ (5b) ADAM:

ADAptive Momentum (ADAM) estimation is a pioneer-ing work that keeps accumulations for the mean and the varianceof gradients [7]. In this approach, a weighted average is calculatedby taking a large fraction of gradient from previous steps and asmall fraction from the current step. Similarly, the variance of thesquared gradient is also taken as a weighted average of past andcurrent squared gradients. We can represent this as follows: m t = β m t − + ( − β ) д t (6a) v t = β v t − + ( − β ) д t (6b)ˆ m t = m t /( − β t ) (6c)ˆ v t = v t /( − β t ) (6d) W t + = W t − η ˆ m t √ ˆ v t + ϵ (6e)Here, m t is the momentum of mean gradients and v t is the momen-tum of variance of gradients up to t th iteration and ˆ m t and ˆ v t arebias corrected estimates, respectively. β and β are two decayingparameters for the momentum of mean and variance, respectively.We study the characteristics of graph isomorphism networkusing these five types of optimization techniques. We consider three aggregation functions in the GIN architecture[17]. The MAX function aggregates the maximum value from theeighbors. The AVERAGE function sum up contributions fromneighbors and aggregates this value dividing by the number ofneighbors. The SUM aggregation function simply sums up thecontributions from neighbors and send to the activation function.Authors of GIN show that the SUM function can distinguish differ-ent structures in the graph whereas other functions fail to do so.Thus, we briefly revisit all these aggregation functions.

To conduct experiments, we use a server machine configured asfollows: Intel(R) Xeon(R) CPU E5-2670 v3 (2.30GHz), 48 cores ar-ranged in two NUMA sockets, 128GB memory, 32K L1 cache, and32MB L3 cache. We run all experiments using graph isomorphismnetwork which has two MLP layers and generates 64-dimensionalembedding of a graph. We set the batch size to 32, number of epochto 50, and use default values for several hyper-parameters unlessotherwise explicitly mentioned. We report the performance of alldatasets in terms of accuracy measure i.e., higher the value is betterthe result is. WLOG, we show the value of the accuracy measurewithin the range 0 to 1 or multiplying by 100 to report percentage. We use a set of five graph classification datasets which include twobioinformatics networks, and three social networks. We computethe average degree by averaging the avg. degree across all graphs ofa particular dataset. This set of graphs have diverse properties suchas they have a varying average degree and the number of graphsfor each dataset has variable proportions. We provide a summaryof the datasets in Table 1. More details about these datasets can befound in the GIN paper [17].

Table 1: Benchmark datasets for graph embeddingName | V | PROTEINS 1113 39.1 2 3.73COLLAB 5000 74.5 3 37.38NCI1 4110 29.8 2 2.15REDDITM 5000 508.5 5 2.25IMDBM 1500 13.0 3 8.10

Different graphs have different structural properties. So, we canpresumably say that various optimization techniques can showdissimilar results for various graph datasets. Most of the previouslyproposed GNNs are lacking this study. In this paper, we attempt tostudy these behaviors on several benchmark datasets. Specifically,we employ different optimization techniques, aggregation functions,and activation functions to study the sensitivity of our focused GINmodel. For all experiments, we perform 10-fold cross-validation i.e.,we report the results of training accuracy based on 90% dataset andtesting accuracy based on the remaining 10% dataset. We develop aset of research questions which are discussed as follows. https://github.com/weihua916/powerful-gnns/ Iterations A cc u r ac y SGD ADAM ADADELTA RMSProp ADAGRAD (a) Training accuracy of NCI1 dataset

Iterations A cc u r ac y SGD ADAM ADADELTA RMSProp ADAGRAD (b) Training accuracy of REDDITM dataset

Figure 2: Training accuracy of (a) NCI1 and (b) REDDITMdatasets for different optimization techniques.

Most ofthe GNNs use ADAM optimizer in the programming model [2, 6,8, 17]. So, we measure the performance of different optimizationtechniques along with ADAM which are discussed in Section 2.3. Weuse default values for other parameters in the GIN model. We reportthe training accuracy of NCI1 and REDDITM datasets in Figs. 2 (a)and (b). We observe that the ADAGRAD optimization techniqueshows better accuracy curve for different iterations than otheroptimization techniques. ADADELTA technique also shows betteraccuracy curve than ADAM. For the REDDITM dataset, ADAGRADagain shows a better performance curve which is more robust thanother optimization techniques. So, we can clearly state from theempirical evidence that the ADAGRAD optimization technique isbetter than ADAM to use in the graph isomorphism network. Weskip training results for other graphs due to space restriction.We show the results of test datasets for different optimizationtechniques in Fig. 3 (a). We observe that the ADAGRAD also per-forms better than ADAM for NCI1 and REDDITM datasets. In fact,ADAGRAD shows better or competitive performance across allbenchmark datasets. We perform a two-sided Wilcoxon signed-rank test on 50 samples of test results which also supports ourfindings ( p -value < 0.017) for all datasets [16]. This empirical ev-idence gives us a new insight to use the ADAGRAD techniqueinstead of ADAM in graph neural networks. Similar to ADAMoptimization technique, most existing graph neural networks useReLU activation function in the programming model [6, 8, 17]. So,we evaluate the performance of other activation functions alongwith ReLU which are discussed in Section 2.2. As usual, we usedefault values for other parameters in the model. We report theresults of training accuracy for NCI1 and REDDITM datasets inFigs. 4 (a) and (b). The training accuracy for all activation functionsshows spikes in the learning curves. However, we observe thatLeakyReLU and ReLU shows competitive performance over severaliterations while training the model. For the REDDITM dataset, theTanh activation function also shows competitive performance.We report the results for test datasets for all activation functionsin Fig. 3 (b). We observe that LeakyReLU shows better or competi-tive performance than other activation functions. Notice that theTanh activation technique shows better test accuracy than ReLUfor REDDITM dataset. These experimental results provide us a newinsight that the LeakyReLU might be a better choice for activationfunction in graph neural networks than ReLU.

Authors of the GIN methoddemonstrate several scenarios where MAX and AVERAGE aggrega-tion functions can fail to distinguish between structural differencesof graphs. On the other hand, the SUM aggregation function can cc u r ac y P R O T E I N S N C I I M D B M R E D D I T M C O L L A B SGD ADAM ADADELTA RMSPropADAGRAD (a) Accuracy of different optimizing techniques A cc u r ac y P R O T E I N S N C I I M D B M R E D D I T M C O L L A B ReLU LeakyReLU Sigmoid Tanh (b) Accuracy of different activation functions A cc u r ac y P R O T E I N S N C I I M D B M R E D D I T M C O L L A B MAX AVERAGE SUM (c) Accuracy of different aggregation functions A cc u r ac y P R O T E I N S N C I I M D B M R E D D I T M C O L L A B LR=0.01 LR=0.02 LR=0.03 (d) Accuracy for different learning rate

Figure 3: Test accuracy of different datasets for different (a) optimization techniques, (b) activation functions, (c) aggregationfunctions, and (d) learning rates. Average value of test accuracy is reported over several iterations.

Iterations A cc u r ac y Sigmoid Tanh LeakyReLU ReLU (a) Training accuracy of NCI1 dataset

Iterations A cc u r ac y Sigmoid Tanh LeakyReLU ReLU (b) Training accuracy of REDDITM dataset

Figure 4: Training accuracy of (a) NCI1 and (b) REDDITMdatasets for different activation functions. correctly distinguish such structural differences. Our experimentalresults also support this statement. We briefly report the resultsof the test datasets in Fig. 3 (c). We observe that all aggregationfunctions perform almost equally well for PROTEINS and NCI1datasets. However, the SUM function performs significantly betterthan other functions for IMDBM, REDDITM, and COLLAB datasets.Thus, SUM would be an order invariant powerful aggregation func-tion for graph neural networks.

Iterations A cc u r ac y L5_MLP2 L8_MLP2 L5_MLP4 (a) Training accuracy of NCI1 dataset

Iterations A cc u r ac y L5_MLP2 L8_MLP2 L5_MLP4 (b) Training accuracy of REDDITM dataset

Figure 5: Training accuracy of (a) NCI1 and (b) REDDITMdatasets for different number of hidden and MLP layers.

We also study several hyper-parameters in graph isomorphism network such as hidden dimen-sion, learning rate, the number of hidden layers, and the numberof layers in MLP. Notably, we use the default value for other pa-rameters in the model. The parameter sensitivity of the embeddingdimension has been studied in supervised learning [5, 11]. So, wealso conduct experiments varying the size of the hidden dimensionin GIN for the REDDITM dataset. For this experimental setup, weget a test accuracy of 39.1%, 39%, 39%, and 38.4% for 16, 32, 64,and 128-dimensional embedding, respectively. Thus, we do not seeany significant difference for other datasets as well but in general,accuracy may drop for very higher-dimensional representation.We report the results of test accuracy for different learning ratesin Fig. 3 (d). We observe that a learning rate of 0.02 shows betterresults for PROTEINS and NCI1 datasets whereas a learning rate of0.01 shows better performance for IMDBM, REDDITM, and COL-LAB datasets. Thus, a learning rate of 0.01 would be a reasonablechoice for the GIN model. We show training accuracy in Figs. 5(a) and (b) for NCI1 and REDDITM datasets, respectively, varyingthe number hidden layer and MLP layer. In both figures, L5_MLP2means that the learning model has 5 hidden layers and 2 MLP lay-ers. We observe that increasing the number of hidden layers in the model does not change the performance significantly; however,increasing the number of MLP layers contributes to improve theperformance. So, a higher value for MLP layers might be a betteroption to seek improved performance.

Graph neural network has become a powerful toolbox for analyzingand mining different graph datasets. However, this field is lackinga rigorous evaluation of several underlying techniques that can beemployed in GNN. For example, most existing methods stick to aparticular optimization technique such as ADAM or an activationfunction such as ReLU. In this paper, we empirically demonstratethat ADAGRAD and LeakyReLU might be better options for opti-mization technique and activation function, respectively. We believethat our findings would provide a new insight into the communityfor choosing better underlying techniques while developing a newgraph neural network.

REFERENCES [1] Bottou, L. Stochastic gradient learning in neural networks.

Proceedings ofNeuro-Nımes 91 , 8 (1991), 12.[2] Chen, J., Ma, T., and Xiao, C. FastGCN: fast learning with graph convolutionalnetworks via importance sampling. arXiv:1801.10247 (2018).[3] Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for onlinelearning and stochastic optimization.

Journal of machine learning research 12 , Jul(2011), 2121–2159.[4] Errica, F., Podda, M., Bacciu, D., and Micheli, A. A fair comparison of graphneural networks for graph classification. arXiv:1912.09893 (2019).[5] Grover, A., and Leskovec, J. node2vec: Scalable feature learning for networks.In

KDD (2016), ACM, pp. 855–864.[6] Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning onlarge graphs. In

NIPS (2017), pp. 1024–1034.[7] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 (2014).[8] Kipf, T. N., and Welling, M. Semi-supervised classification with graph convo-lutional networks. arXiv preprint arXiv:1609.02907 (2016).[9] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification withdeep convolutional neural networks. In

NIPS (2012), pp. 1097–1105.[10] Nair, V., and Hinton, G. E. Rectified linear units improve restricted boltzmannmachines. In

ICML (2010), pp. 807–814.[11] Perozzi, B., Al-Rfou, R., and Skiena, S. DeepWalk: Online learning of socialrepresentations. In

KDD (2014), ACM, pp. 701–710.[12] Rahman, M. K., Sujon, M. H., and Azad, A. BatchLayout: A batch-parallelforce-directed graph layout algorithm in shared memory. In (2020), IEEE, pp. 16–25.[13] Rahman, M. K., Sujon, M. H., and Azad, A. Force2Vec: Parallel force-directedgraph embedding.

In Review (2020).[14] Reinsel, D., Gantz, J., and Rydning, J. The digitization of the world from edgeto core.

IDC White Paper (2018).[15] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio,Y. Graph attention networks. arXiv:1710.10903 (2017).[16] Wilcoxon, F. Individual comparisons by ranking methods. In

Breakthroughs instatistics . Springer, 1992, pp. 196–202.[17] Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neuralnetworks? arXiv preprint arXiv:1810.00826 (2018).[18] Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv:1212.5701arXiv:1212.5701