Hybrid Micro/Macro Level Convolution for Heterogeneous Graph Learning
Le Yu, Leilei Sun, Bowen Du, Chuanren Liu, Weifeng Lv, Hui Xiong
IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 1
Hybrid Micro/Macro Level Convolution forHeterogeneous Graph Learning
Le Yu, Leilei Sun, Bowen Du, Chuanren Liu, Weifeng Lv, Hui Xiong,
Fellow, IEEE
Abstract —Heterogeneous graphs are pervasive in practical scenarios, where each graph consists of multiple types of nodes andedges. Representation learning on heterogeneous graphs aims to obtain low-dimensional node representations that could preserveboth node attributes and relation information. However, most of the existing graph convolution approaches were designed forhomogeneous graphs, and therefore cannot handle heterogeneous graphs. Some recent methods designed for heterogeneous graphsare also faced with several issues, including the insufficient utilization of heterogeneous properties, structural information loss, and lackof interpretability. In this paper, we propose
HGConv , a novel H eterogeneous G raph Conv olution approach, to learn comprehensivenode representations on heterogeneous graphs with a hybrid micro/macro level convolutional operation. Different from existingmethods, HGConv could perform convolutions on the intrinsic structure of heterogeneous graphs directly at both micro and macrolevels: A micro-level convolution to learn the importance of nodes within the same relation, and a macro-level convolution to distinguishthe subtle difference across different relations. The hybrid strategy enables HGConv to fully leverage heterogeneous information withproper interpretability. Moreover, a weighted residual connection is designed to aggregate both inherent attributes and neighborinformation of the focal node adaptively. Extensive experiments on various tasks demonstrate not only the superiority of HGConv overexisting methods, but also the intuitive interpretability of our approach for graph analysis.
Index Terms —Heterogeneous graphs, graph convolution, representation learning. (cid:70)
NTRODUCTION A heterogeneous graph consists of multiple types ofnodes and edges, involving abundant heterogeneousinformation [1]. In practice, heterogeneous graphs are per-vasive in real-world scenarios, such as academic networks,e-commerce and social networks [2]. Learning meaningfulrepresentation of nodes in heterogeneous graphs is essen-tial for various tasks, including node classification [3], [4],node clustering [5], link prediction [6], [7] and personalizedrecommendation [8], [9].In recent years, Graph Neural Networks (GNNs) havebeen widely used in representation learning of graphs andachieved superior performance. Generally, GNNs performconvolutions in two domains, namely spectral domain andspatial domain. As a spectral-based method, GCN [10] uti-lizes the localized first-order approximation on neighborsand then performs convolutions in the Fourier domain foran entire graph. Spatial-based methods, including Graph-SAGE [11] and GAT [12], directly perform information prop-agation in the graph domain by particularly designed aggre-gation functions or the attention mechanism. However, all ofthe above methods were designed for homogeneous graphswith single node type and single edge type, and they are • L. Yu, L. Sun, B. Du and W. Lv are with the SKLSDE and BDBC Lab,Beihang University, Beijing, 100083, China.E-mail: [email protected], [email protected], [email protected],[email protected] • C. Liu is with Department of Business Analytics and Statistics, Univer-sity of Tennessee, Knoxville, USA.E-mail: [email protected] • H. Xiong is with Department of Management Science and InformationSystems, Rutgers University, USA.E-mail: [email protected] received December 29, 2020; revised xx xx, xxxx. infeasible to handle the rich information in heterogeneousgraphs. Simply adapting them to deal with heterogeneousgraphs would lead to the information loss issue, since theyignore the graph heterogeneous properties.Despite the investigation of approaches on homoge-neous graphs, there are also several attempts to designgraph convolution methods for heterogeneous graphs.RGCN [13] was proposed to deal with multiple relationsin knowledge graphs. HAN [14] was designed to learnon heterogeneous graphs, which is based on meta-pathsand the attention mechanism. [15] presented HetGNN toconsider the heterogeneity of node attributes and neighborsthrough dedicated aggregation functions. [16] proposedHGT, a variant of Transformer [17], to focus on the metarelations in heterogeneous graphs.However, the aforementioned methods are still facedwith the following limitations. 1)
Heterogeneous informationloss : several methods just utilize the properties of nodesor relations partially, rather than the comprehensive in-formation of nodes and relations (e.g., RGCN and HAN).In detail, RGCN ignores the distinct attributes of nodeswith various types. HAN relies on multiple hand-designedsymmetric meta-paths to convert the heterogeneous graphinto multiple homogeneous graphs, which would lead to theloss of different nodes and edges information. 2)
Structuralinformation loss : some methods deal with the graph topologythrough heuristic strategies, such as the random walk inHetGNN, which may break the intrinsic graph structureand lose valuable structural information. 3)
Empirical manualdesign : the performance of some methods severely relies onprior experience because of the requirement of specific do-main knowledge, such as pre-defined meth-paths in HAN;4)
Insufficient representation ability : some methods cannot a r X i v : . [ c s . L G ] D ec EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 2
TABLE 1Comparison of several existing methods with the proposed model.
Models GraphTopology HeterogeneousProperties Without SpecificDomain Knowledge AttentiveAggregation Convolutions onIntrinsic Structure Multi-levelRepresentationMLP × × (cid:88) × × ×
GCN (cid:88) × (cid:88) × (cid:88) × GAT (cid:88) × (cid:88) (cid:88) (cid:88) × RGCN (cid:88) (cid:88) (cid:88) × (cid:88) × HAN (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) HetGNN (cid:88) (cid:88) (cid:88) (cid:88) × (cid:88) HGT (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × HGConv (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) provide multi-level representation due to the flat model ar-chitecture. For example, HGT learns the interaction of nodesand relations in a single aggregation process, which is hardto distinguish their importance in such a flat architecture.To cope with the above issues, we propose HGConv, anovel H eterogeneous G raph Conv olution approach, to learnnode representation on heterogeneous graphs with a hybridmicro/macro level convolutional operation. Specifically, fora focal node: in micro-level convolution, the transformationmatrices and attention vectors are both specific to nodetypes, aiming to learn the importance of nodes within thesame relation; in macro-level convolution, transformationmatrices specific to relation types and the weight-sharingattention vector are employed to distinguish the subtledifference across different relations. Due to the hybridmicro/macro level convolution, HGConv could fully uti-lize the heterogeneous information of nodes and relationswith proper interpretability. Moreover, a weighted residualconnection component is designed to obtain the optimalfusion of the focal node’s inherent attributes and neighborinformation. Based on the aforementioned components, ourapproach could be optimized in an end-to-end manner.Comparison of several existing methods with our model areshown in Table 1.To sum up, the contributions of our work are as follows: • A novel heterogeneous graph convolution approach is pro-posed to directly perform convolutions on the intrin-sic heterogeneous graph structure with a hybrid mi-cro/macro level convolutional operation, where themicro convolution encodes the attributes of differenttypes of nodes and the macro convolution computeson different relations respectively. • A residual connection component with weighted combina-tion is designed to aggregate focal node’s inherent at-tributes and neighbor information adaptively, whichcould provide comprehensive node representation. • A systematic analysis on existing heterogeneous graphlearning methods is given, and we point out that eachexisting method could be treated as a special case ofthe proposed HGConv under certain circumstances.The rest of this paper is organized as follows: Section2 reviews previous work related to the studied problem.Section 3 introduces the studied problem. Section 4 presentsthe framework and each component of the proposed model.Section 5 evaluates the proposed model by experiments.Section 6 concludes the entire paper.
ELATED WORK
This section reviews existing literature related to our workand also points out their differences with our work.
Graph Mining . Over the past decades, a great amountof research has been investigated on graph mining. Classicalmethods based on manifold learning, including LocallyLinear Embedding (LLE) [18] and Laplacian Eigenmaps (LE)[19], mainly focus on the reconstruction of graphs. Inspiredby the language model Skip-gram [20], more advancedmethods were proposed to learn representations of nodes,such as DeepWalk [21] and Node2Vec [22]. These methodsadopt random walk strategy to generate sequences of nodesand use Skip-gram to maximize node co-occurrence proba-bility in the same sequence.However, all of the above methods only focused onthe study of graph topology structure and could not takethe node attributes into consideration, resulting in inferiorperformance. These methods are surpassed by recently pro-posed GNNs, which could consider both node attributesand graph structure simultaneously.
Graph Neural Networks . Recent years have witnessedthe success of GNNs in various tasks, such as node classifi-cation [10], [11], link prediction [23] and graph classification[24]. GNNs consider both graph structure and node at-tributes by first propagating information among each nodeand its neighbors, and then providing node representationbased on the received information. Generally, GNNs couldbe divided into spectral-based methods and spatial-basedmethods. As a spectral-based method, Spectral CNN [25]performs convolution in the Fourier domain by comput-ing the eigendecomposition of the graph Laplacian matrix.ChebNet [26] leverages the K-order Chebyshev polynomialsto eliminate the need to calculate the Laplacian matrixeigenvectors. GCN [10] introduces a localized first-order ap-proximation of ChebNet to alleviate the overfitting problem.Representative spatial-based methods include GraphSAGE[11] and GAT [12]. [11] proposed GraphSAGE to propagateinformation in the graph domain directly and designeddifferent functions to aggregate received information. [12]presented GAT by introducing the attention mechanisminto GNNs, which enabled GAT to select more importantneighbors adaptively. We refer the interested readers to [27],[28] for more comprehensive reviews on GNNs.However, all the above methods were designed for ho-mogeneous graphs, and could not handle the rich infor-mation in heterogeneous graphs. In this work, we aim topropose an approach to learn on heterogeneous graphs.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 3
Heterogeneous Graph Neural Networks . Heteroge-neous graphs contain abundant information of varioustypes of nodes and relations. Mining useful informationin heterogeneous graphs is essential in practical scenarios.Recently, several graph convolution methods have beenproposed for learning on heterogeneous graphs. [13] pre-sented RGCN to learn on knowledge graphs by employ-ing specialized transformation matrices for each type ofrelations. [14] designed HAN by extending the attentionmechanism in GAT [12] to learn the importance of neighborsand multiple hand-designed meta-paths. [29] consideredthe intermediate nodes in meta-paths, which are ignoredin HAN, and proposed MAGNN to aggregate the intra-meta-path and inter-meta-path information. HetGNN [15]first samples neighbors based on random walk strategy andthen uses specialized Bi-LSTMs to integrate the heteroge-neous node attributes and neighbors. [16] proposed HGTto introduce type-specific transformation matrices and learnthe importance of different nodes and relations based on theTransformer [17] architecture.Nevertheless, there are still some limitations in the abovemethods, including the insufficient utilization of heteroge-neous properties, structural information loss, and lack ofinterpretability. In this paper, we aim to cope with the issuesin existing approaches and design a method to learn com-prehensive node representation on heterogeneous graphs byleveraging both node attributes and relation information.
ROBLEM F ORMALIZATION
This section introduces related concepts and the studiedproblem in this paper.
Definition 1.
Heterogeneous Graph : A heterogeneous graphis defined as a directed graph G = ( V , E , A , R ) , where V and E denote the set of nodes and edges respectively. Each node v ∈V and each edge e ∈ E are associated with their type mappingfunctions φ ( v ) : V → A and ϕ ( e ) : E → R , with the constraintof |A| + |R| > . Definition 2.
Relation : A relation represents for the interactionschema of the source node, the target node and the connectededge. Formally, for an edge e = ( u, v ) with source node u andtarget node v , the corresponding relation R ∈ R is denoted as (cid:104) φ ( u ) , ϕ ( e ) , φ ( v ) (cid:105) . The inverse of R is naturally represented by R − , and we consider the inverse relation to propagate informa-tion of two nodes from each other. Thus, the set of edges is extendedas E ∪ E − and the set of relations is extended as R ∪ R − .Note that the meta-paths used in heterogeneous graph learningapproaches [14], [29] are defined as sequences of relations. Definition 3.
Heterogeneous Graph Representation Learn-ing : Given a heterogeneous graph G = ( V , E , A , R ) , wherenodes with type A ∈ A are associated with the attribute matrix X A ∈ R |V A |× d A , the task of heterogeneous graph representationlearning is to obtain the d -dimensional representation h v ∈ R d for v ∈ V , where d (cid:28) | V | . The learned representations are ableto capture both node attributes and relation information, whichcould be applied in various tasks, such as node classification, nodeclustering and node visualization. ETHODOLOGY
This section presents the framework of our proposedmethod and each component of the proposed method isintroduced step by step.
The framework of the proposed model is shown in Figure1, which takes the node attribute matrices X A for A ∈ A ina heterogeneous graph as the input and provides the low-dimensional node representation h v for v ∈ V as the output,which could be applied in various tasks. Fig. 1. Framework of the proposed model.
The proposed model is made up of multiple heteroge-neous graph convolutional layers, where each layer con-sists of the hybrid micro/macro level convolution and theweighted residual connection component. Different from[14] that performs convolution on converted homogeneousgraphs through meta-paths, the proposed hybrid convolu-tion could directly calculate on the heterogeneous graphstructure. In particular, the micro-level convolution aims tolearn the importance of nodes within the same relation, andthe macro-level convolution is designed to discriminate thedifference across different relations. The weighted residualconnection component is employed to consider the differ-ent contribution of focal node’s inherent attributes and itsneighbor information. By stacking multiple heterogeneousgraph convolutional layers, the proposed model could con-sider the impacts of the focal node’s directly connected andmulti-hop reachable neighbors.
As pointed in [14], the importance of nodes connected withthe focal node within the same relation would be different.Hence, we first design a micro-level convolution to learn theimportance of nodes within the same relation. We supposethat the attributes of nodes with different types might bedistributed in different latent spaces. Therefore, we utilizethe transformation matrices and attention vectors, whichare specific to node types, to capture the characteristics ofdifferent types of nodes in the micro-level convolution.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 4
Formally, we denote the focal node v as the target nodewith type φ ( v ) ∈ A and its connected node u as thesource node with type φ ( u ) ∈ A . For a focal node v , let N R ( v ) denote the set of node v ’s neighbors within R -typerelation, where for each u ∈ N R ( v ) , e = ( u, v ) ∈ E and R = (cid:104) φ ( u ) , ϕ ( e ) , φ ( v ) (cid:105) ∈ R .We first apply transformation matrices, which are spe-cific to node types, to project nodes into their own latentspaces as follows, z lv = W lφ ( v ) h l − v , (1) z lu = W lφ ( u ) h l − u , (2)where W lφ ( u ) denotes the trainable transformation matrixfor node u with type φ ( u ) at layer l . h lu and z lu denote theoriginal and transformed representation of node u at layer l . Then we calculate the normalized importance of neighbor u ∈ N R ( v ) as follows, e R,lv,u = LeakyReLU (cid:16) a lφ ( u ) (cid:62) (cid:104) z lv (cid:107) z lu (cid:105)(cid:17) , (3) α R,lv,u = exp (cid:0) e R,lv,u (cid:1)(cid:80) u (cid:48)∈N R ( v ) exp (cid:16) e R,lv,u (cid:48) (cid:17) , (4)where a lφ ( u ) is the trainable attention vector for φ ( u ) -typesource node u at layer l and (cid:107) denotes the concatenationoperation. (cid:62) denotes the transpose operation. α R,lv,u is thenormalized importance of source node u to focal node v under relation R at layer l . Then the representation ofrelation R about focal node v is calculated by, c lv,R = σ (cid:88) u ∈N R ( v ) α R,lv,u · z lu , (5)where σ ( · ) denotes the activation function (e.g., sigmoid,ReLU). An intuitive explanation of the micro-level convolu-tion is shown in Figure 2 ( a ). Embeddings of nodes withinthe same relation are aggregated through the attention vec-tors which are specific to node types. Since the attentionweight α R,lv,u is computed for each relation, it could wellcapture the relation information.In order to enhance the model capacity and make thetraining process more stable, we employ K independentheads and then concatenate representations as follows, c lv,R = K (cid:107) k =1 σ (cid:88) u ∈N R ( v ) (cid:104) α R,lv,u (cid:105) k · (cid:104) z lu (cid:105) k , (6)where (cid:2) α R,lv,u (cid:3) k denotes the importance of source node u tofocal node v under relation R of head k at layer l , and (cid:2) z lu (cid:3) k stands for source node u ’s transformed representation ofhead k at layer l . Besides considering the importance of nodes within thesame relation, a focal node would also interact with multiplerelations, which indicates the necessity of learning the subtledifference across different relations. Therefore, we design amacro-level convolution with the transformation matrices
Fig. 2. Explanation of the hybrid micro/macro level convolution. specific to relation types and weight-sharing attention vec-tor to distinguish the difference of relations.Specifically, we first transform the focal node and itsconnecting relations into their distinct distributed spaces by, h lv (cid:48) = U lφ ( v ) h l − v , (7) c lv,R (cid:48) = M lR c lv,R , (8)where U lφ ( v ) and M lR denote the transformation matricesfor φ ( v ) -type focal node v and R -type relation at layer l respectively. Then the normalized importance of relation R ∈ R ( v ) to focal node v is calculated by, s lv,R = LeakyReLU (cid:16) µ l (cid:62) (cid:104) h lv (cid:48) (cid:107) c lv,R (cid:48) (cid:105)(cid:17) , (9) β lv,R = exp (cid:16) s lv,R (cid:17)(cid:80) R (cid:48) ∈R ( v ) exp (cid:16) s lv,R (cid:48) (cid:17) , (10)where R ( v ) denotes the set of relations connected to focalnode v . µ l is the trainable attention vector which is sharedby different relations at layer l . β lv,R is the normalizedimportance of relation R to focal node v at layer l . Afterobtaining the importance of different relations, we aggregatethe relations as follows, (cid:102) h lv = (cid:88) R ∈R ( v ) β lv,R · c lv,R (cid:48) , (11)where (cid:102) h lv is the fused representation of relations connectedto focal node v at layer l . Explanation of the macro-levelconvolution is shown in Figure 2 ( b ). Representations ofdifferent relations are aggregated into a compact vectorthrough the attention mechanism. Through the macro-levelconvolution, the different importance of relations could becalculated automatically.We also extend Equation (11) to multi-head attention by, (cid:102) h lv = K (cid:107) k =1 (cid:88) R ∈R ( v ) (cid:104) β lv,R (cid:105) k · (cid:104) c lv,R (cid:48) (cid:105) k , (12) EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 5 where (cid:104) β lv,R (cid:105) k is the importance of relation R to focal node v of head k at layer l , and (cid:104) c lv,R (cid:48) (cid:105) k denotes the fusion ofrelations connected to focal node v of head k at layer l .It is worth noting that the attention vectors in micro-level convolution are specific to node types, while in macro-level convolution, the attention vector is shared by differentrelations, which is unaware of relation types. Such a designis based on the following reasons. 1) When performingmicro-level convolution, nodes are associated with distinctattributes even when they are within the same relation. Anattention vector unaware of node types is difficult to handlenodes’ different attributes and types due to the insufficientrepresentation ability. Hence, attention vectors specific tonode types are designed in micro-level convolution. 2) Inmacro-level convolution, each relation connected to the focalnode is associated with a single representation and we onlyneed to consider the difference of relation types. Therefore,the weight-sharing attention vector across different relationsis designed. Following the above design, we could not onlymaintain the distinct characteristics of nodes and relations,but also reduce the model parameters. In addition to aggregating neighbor information by the hy-brid micro/macro level convolution, the attributes of focalnode are also supposed to be important, since they reflectthe inherent characteristic directly. However, simply addingfocal node’s inherent attributes and neighbor informationtogether could not distinguish their different importance.Thus, we adapt the residual connection [30] with train-able weight parameter to combine the focal node’s inherentattributes and neighbor information by, h lv = λ lφ ( v ) · W lφ ( v ) ,o h l − v + (cid:16) − λ lφ ( v ) (cid:17) · (cid:102) h lv , (13)where λ lφ ( v ) is the weight to control the importance of focalnode v ’s inherent attributes and its neighbor information atlayer l . W lφ ( v ) ,o is utilized to align the dimension of focalnode v ’s attributes and its neighbor information at layer l .From another perspective, the weighted residual con-nection could be treated as the gated updating mechanismin Gated Recurrent Unit (GRU) [31], where the employedupdate gates are specific to focal node type and carrydifferent weights in different layers. We stack L heterogeneous graph convolutional layers tobuild HGConv. For the first layer, we set h v to node v ’scorresponding row in attribute matrix X φ ( v ) as the input.The final node representation h v is set to the output of thelast layer h Lv for v ∈ V .HGConv could be trained in an end-to-end mannerwith the following strategies: 1) semi-supervised learningstrategy: for tasks where the labels are available, we couldoptimize the model by minimizing the cross entropy loss by, L = − (cid:88) v ∈V label C (cid:88) c =1 y v,c · log ˆ y v,c , (14) where V label is the set of nodes with labels. y v,c and ˆ y v,c denote the ground truth and predicted possibility of node v at the c -th dimension. In practice, ˆ y v,c could be obtainedfrom a classifier (e.g., SVM, single-layer neural network)which takes node v ’s representation h v as the input and out-puts ˆ y v . 2) unsupervised learning strategy: for tasks withoutany labels, we could optimize the model by minimizing theobjective function in Skip-gram [32] with negative sampling, L = − (cid:88) ( v,u ) ∈ S P log σ (cid:16) h (cid:62) v h u (cid:17) − (cid:88) ( v (cid:48) ,u (cid:48) ) ∈ S N log σ (cid:16) − h (cid:62) v (cid:48) h u (cid:48) (cid:17) , (15)where σ ( · ) is the sigmoid activation function, S P and S N denote the set of positive observed node pairs and negativesampled node pairs. 3) joint learning strategy: we could alsocombine the semi-supervised and unsupervised learningstrategy together to jointly optimize the model. Here we give a systematic analysis on existing heteroge-neous graph learning models and points out that eachexisting method could be treated as a special case of theproposed HGConv under certain circumstances.
Overview of Homogeneous GNNs . Let us start with theintroduction of homogeneous GNNs at first. Generally, theoperations at the l -th layer of a homogeneous GNN followa two-step strategy: (cid:102) h lv = AGGREGATE l (cid:16)(cid:110) h l − u : u ∈ N ( v ) (cid:111)(cid:17) , (16) h lv = COMBINE l (cid:16) h l − v , (cid:102) h lv (cid:17) , (17)where h lv denotes the representation of node v at the l -th layer. h v is initialized with node v ’s original attribute x v and N ( v ) denotes the set of node v ’s neighbors.AGGREGATE l ( · ) stands for the aggregation of node v ’sneighbors. COMBINE l ( · ) is the combination of node v ’sinherent attribute and its neighbor information at layer l .Different architectures for AGGREGATE and COMBINEhave been proposed in recent years. For example, GCN [10]utilizes the normalized adjacency matrix for AGGREGATEand uses the residual connection for COMBINE. Graph-SAGE [11] designs various pooling operations for AGGRE-GATE and leverages the concatenation for COMBINE. Overview of Heterogeneous GNNs . The operations inheterogeneous GNNs are based on the operations in ho-mogeneous GNNs, with additional consideration of nodeattributes and relation information. Formally, the operationsat the l -th layer could be summarized as follows: z lu = TRANSFORM lφ ( u ) (cid:16) h l − u (cid:17) , ∀ u ∈ V (18) c lv,R = AGGREGATE lR (cid:16)(cid:110) z lu : u ∈ N R ( v ) (cid:111)(cid:17) , (19) (cid:102) h lv = AGGREGATE l (cid:16)(cid:110) c lv,R : R ∈ R ( v ) (cid:111)(cid:17) , (20) h lv = COMBINE l (cid:16) h l − v , (cid:102) h lv (cid:17) , (21)where N R ( v ) denotes the set of node v ’s neighbors within R -type relation and R ( v ) is defined as the set of relationsconnected to node v . EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 6
Compared with homogeneous GNNs, heterogeneousGNNs first design specialized transformation matrices fordifferent types of nodes for TRANSFORM. Then the opera-tions in AGGREGATE are divided into aggregation withinthe same relation and aggregation across different relations.Finally, the operation in COMBINE is defined as the sameas Equation (17) in homogeneous GNNs.
Analysis of the Proposed HGConv . The proposed HG-Conv makes delicate design for each operation in the afore-mentioned heterogeneous GNNs. Specifically, Equation (18)- Equation (21) could be rewritten as : z lu = W lφ ( u ) h l − u , ∀ u ∈ V (22) c lv,R = (cid:88) u ∈N R ( v ) α R,lv,u · z lu , (23) (cid:102) h lv = (cid:88) R ∈R ( v ) β lv,R · c lv,R , (24) h lv = λ lφ ( v ) · h l − v + (cid:16) − λ lφ ( v ) (cid:17) · (cid:102) h lv , (25)where W φ ( u ) is the transformation matrix which is specificto node u ’s type. α R,lv,u and β lv,R are learned importanceby the attention mechanism in micro-level and macro-levelconvolution respectively. λ lφ ( v ) is the trainable parameter tobalance the importance of the focal node inherent attributeand its neighbor information. Connection with RGCN . RGCN [13] ignores distinctattributes of nodes with various types and assigns im-portance of neighbors within the same relation based onpre-defined constants. RGCN could be treated as a specialcase of the proposed HGConv with the following steps: 1)Replace W lφ ( u ) in Equation (22) with identity function I ( · ) ,which means different distributions of node attributes withvarious types are not considered; 2) Replace trainable α R,lv,u inEquation (23) with pre-defined constant, which is calculatedby the degree of each node; 3) Set β lv,R in Equation (24)to . , which stands for simple sum pooling; 4) Set λ lφ ( v ) in Equation (25) to . , which means equal contributionof node inherent attributes and neighbor information. Notethat the sum pooling operation in RGCN could not distin-guish the importance of nodes and relations effectively. Connection with HAN . HAN [14] leverages multiplesymmetric meta-paths to convert the heterogeneous graphinto multiple homogeneous graphs. Therefore, node v ’sneighbors are defined by the given set of meta-paths Φ .HAN could be treated as a special case of the proposedHGConv with the following steps: 1) Replace W lφ ( u ) inEquation (22) with identity function I ( · ) , as each convertedgraph only contains nodes with a single type; 2) Define theset of node v ’s neighbors in Equation (23) by meth-paths Φ ,that is, for each meta-path Φ i , the set of node v ’s neighbors isdenoted as N Φ i ( v ) , and then learn the importance of neigh-bors generated by the same meta-path through the attentionmechanism; 3) Replace the aggregation of different relationsin Equation (24) with the aggregation of multiple meta-paths Φ , and learn the importance of different meta-paths usingthe attention mechanism; 4) Set λ lφ ( v ) in Equation (25) to
1. Note that we omit the activation functions and transformationmatrices for graph convolution or dimension alignment for simplicity. . , which means using the neighbor information directly.Not that the converted graphs are homogeneous, and theattributes of nodes with different types are ignored in HAN,leading to inferior performance. Connection with HetGNN . HetGNN [15] leverages therandom walk strategy to sample neighbors and then usesBi-LSTMs to integrate node attributes and neighbors. There-fore, node v ’s neighbors are generated by random walk RW ,which could be denoted as N RW ( v ) . HetGNN could betreated as a special case of the proposed HGConv with thefollowing steps: 1) Replace W lφ ( u ) in Equation (22) with Bi-LSTMs to aggregate attributes of nodes with various types;2) Define the set of node v ’s neighbors in Equation (23)by random walk RW and group the neighbors by nodetypes, that is, for each node type t , the set of node v ’sneighbors is denoted as N RW,t ( v ) . Then, learn the impor-tance of neighbors with the same node type through Bi-LSTMs; 3) Replace the aggregation of different relations inEquation (24) with the aggregation of different node types,and learn the importance of different node types using theattention mechanism; 4) Set λ lφ ( v ) in Equation (25) to betrainable, which is incorporated in the attention mechanismin previous step in [15]. Not that the random walk RW inHetGNN may break the intrinsic graph structure and resultsin structural information loss. Connection with HGT . HGT [16] learn the importanceof different nodes and relations based on the Transformerarchitecture by designing type-specific transformation ma-trices. HGT focuses on the study of each relation (a.k.a.meta relation in [16]), hence, the importance of source nodeto target node is calculated based on both the two nodeinformation as well as their connected relation in a singleaggregation process. HGT could be treated as a specialcase of the proposed HGConv with the following steps: 1)Replace W lφ ( u ) in Equation (22) with the linear projectionsthat are specific to source node type and target node typerespectively to obtain Key and Query vectors; 2) Fuse theaggregation process in Equation (23) and Equation (24) intoa single aggregation process. The importance of source nodeto target node is learned from the Key and Query vectors,as well as the relation transformation matrices specific totheir connected relation type; 3) Set λ lφ ( v ) in Equation (25)to . , which means node inherent attributes and neighborinformation contribute equally to the final node representa-tion. Not that the single aggregation process in HGT leadsto a flat architecture, making it is hard to distinguish theimportance of nodes and relations separately. XPERIMENTS
This section presents the experimental results on real-worlddatasets and detailed analysis.
We conduct experiments on three real-world datesets. • ACM-3 : Following [14], we extract a subset of ACMfrom AMiner [33], which contains papers pub-lished in three areas: Data Mining (KDD, ICDM), EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 7
Database (VLDB, SIGMOD) and Wireless Commu-nication (SIGCOMM, MobiCOMM). Finally we con-struct a heterogeneous graph containing papers (P),authors (A) and terms (T). • ACM-5 : We also extract a larger subset of ACMfrom AMiner, which includes papers publishedin five areas: Data Mining (KDD, ICDM, WSDM,CIKM), Database (VLDB, ICDE), Artificial Intel-ligence (AAAI, IJCAI), Computer Vision (CVPR,ECCV) and Natural Language Processing (ACL,EMNLP, NAACL). • IMDB : We extract a subset of IMDB and consructa heterogeneous graph containing movies (M), direc-tors (D) and actors (A). The movies are divided intothree classes: Action, Comedy, Drama.For ACM-3 and ACM-5, we use TF-IDF [34] to extractkeywords of the abstract and title in papers. Paper attributesare the bag-of-words representation of abstracts. Authorattributes are the average representation of their publishedpapers. Term attributes are represented as the one-hot en-coding of the title keywords. For IMDB, movie attributesare the bag-of-words representation of plot keywords. Direc-tor/actor attributes are the average representation of theirdirecting/acting movies.Details of the datasets are summarized in Table 2. TABLE 2Statistics of the datasets.
Dataset Node Relation Attribute Data SplitACM-3
We compare our method with the following baselines: • MLP : MLP ignores the graph structure and solelyfocuses on the focal node attributes by leveragingthe multilayer perceptron. • GCN : GCN performs graph convolutions in theFourier domain by leveraging the localized first-order approximation [10]. • GAT : GAT introduces the attention mechanism intoGNNs and assigns different importance to the neigh-bors adaptively [12]. • RGCN : RGCN designs specialized transformationmatrices for each type of relations in the modellingof knowledge graphs [13]. • HAN : HAN leverages the attention mechanism toaggregate neighbor information via multiple manu-ally designed meta-paths [14]. • HetGNN : HetGNN considers the heterogeneity ofnode attributes and neighbors, and then utilizes Bi-LSTMs to integrate heterogeneous information [15].
3. https://data.world/data-society/imdb-5000-movie-dataset • HGT : HGT introduces type-specific transformationmatrices to capture characteristics of different nodesand relations with the Transformer architecture [16].
As some methods require meth-paths, we use
P AP , P T P and
P P P as meta-paths for ACM-3 and ACM-5, and choose
M DM and
M AM as meta-paths for IMDB. Following[14], we test GCN and GAT on the homogeneous graphgenerated by each meta-path and report the best perfor-mance from meta-paths (Experiments show that the bestmeta-paths on ACM-3, ACM-5, IMDB are
P AP , P AP ,and
M DM respectively). All the meta-paths are directlyfed into HAN. Adam [35] is selected as the optimizer.Dropout [36] is utilized to prevent over-fitting. The gridsearch is used to select the best hyperparameters, in-cluding dropout in [0 , . , · · · , . and learning rate in [0 . , . , . , . , · · · , . . The dimension of noderepresentation is set to 64. We train all the methods witha fixed 300 epochs and use early stopping strategy witha patience of 100, which means the training process isterminated when the evaluation metrics on the validationset are not improved for 100 consecutive epochs.For HGConv, the numbers of attention heads in mi-cro/macro level convolution are both set to 8, and thedimension of each head’s attention vector is set to 8. Webuild HGConv with two layers, since two layers couldachieve satisfactory performance and stacking more layerscannot improve the performance significantly. The proposedHGConv is implemented with PyTorch [37] and DeepGraph Library (DGL) [38]. Experiments are conducted onan Ubuntu machine equipped with two Intel(R) Xeon(R)CPU E5-2667 v4 @ 3.20GHz with 8 physical cores, and theGPU is NVIDIA TITAN Xp, armed with 12 GB of GDDR5Xmemory running at over 11 Gbps. We conduct experiments to make comparison on the nodeclassification task. Following [14], we split the datasetsinto training, validation and testing sets with the ra-tio of 2:1:7. The ratios of training data are varied in [20% , , , , . To make comprehensive com-parison, we additionally use 5-fold cross-validation and re-port the average classification results. Macro-F1 and Micro-F1 are adopted as the evaluation metrics. For ACM-3 andACM-5, we aim to predict the area of papers. For IMDB,the goal is to predict the class of movies. M acro − F and M icro − F are adopted as evaluation metrics. Experimentalresults are shown in Table 3 . By analyzing the results, someconclusions could be summarized.Firstly, the performance of all the methods is improvedwith the increase of training data, which proves that feedmore training data would help deep learning methods learnmore complicated patterns and achieve better results. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 8
TABLE 3Experimental results on the node classification task.
Data Metrics Training MLP GCN GAT RGCN HAN HetGNN HGT HGConvACM-3 Macro-F1 20% 0.6973 0.8955 0.8852 0.8981 0.8991 0.6727 0.8965
40% 0.7740 0.9012 0.8993 0.9191 0.9175 0.7736 0.9188
60% 0.8013 0.9032 0.9053 0.9262 0.9237 0.8060 0.9264
80% 0.8249 0.9068 0.9063 0.9267 0.9268 0.8242
40% 0.7710 0.8923 0.8903 0.9124 0.9103 0.7709 0.9117
60% 0.7966 0.8948 0.8968 0.9201 0.9172 0.8016 0.9203
80% 0.8205 0.8989 0.8981 0.9202 0.9205 0.8190
40% 0.6585 0.8317 0.8367 0.8368 0.8404 0.6476 0.8428
60% 0.7252 0.8440 0.8441 0.8630 0.8526 0.7133 0.8573
80% 0.7503 0.8448 0.8459 0.8699 0.8610 0.7445 0.8692
Micro-F1 20% 0.6469 0.8364 0.8388 0.8333 0.8334 0.6420 0.8286
40% 0.6887 0.8433 0.8475 0.8501 0.8525 0.6872 0.8573
60% 0.7354 0.8545 0.8544 0.8722 0.8626 0.7248 0.8668
80% 0.7642 0.8554 0.8562 0.8809 0.8715 0.7592 0.8780
IMDB Macro-F1 20% 0.4506 0.5003 0.4998 0.5124 0.5118 0.4281 0.5171
40% 0.4870 0.5338 0.5350 0.5578 0.5645 0.4865 0.5577
60% 0.5188 0.5559 0.5640 0.5823 0.5912 0.5110 0.5781
80% 0.5268 0.5713 0.5698 0.5939 0.6092 0.5239 0.6018
Micro-F1 20% 0.4598 0.5062 0.5072 0.5212 0.5263 0.4533 0.5210
40% 0.4874 0.5355 0.5378 0.5601 0.5723 0.4942 0.5605
60% 0.5186 0.5611 0.5669 0.5850 0.5968 0.5146 0.5792
80% 0.5269 0.5771 0.5757 0.5952 0.6129 0.5237 0.6020
TABLE 4Experimental results on the node clustering task.
Data Metrics MLP GCN GAT RGCN HAN HetGNN HGT HGConv %Improv.ACM-3 ARI 0.6105 0.7179 0.7319 0.7973 0.7732 0.6077 0.7944
Secondly, compared with MLP, the performance of othermethods is significantly improved by taking graph structureinto consideration in most cases, which indicates the powerof graph neural networks in considering the information ofboth nodes and edges.Thirdly, methods designed for heterogeneous graphsachieve better results than methods designed for homoge-neous graphs (i.e., GCN and GAT) in most cases, whichdemonstrates the necessity of leveraging the properties ofdifferent nodes and relations in heterogeneous graphs.Fourthly, although HetGNN is designed for heteroge-neous graph learning, it only achieves competitive or evenworse results than MLP. We owe this phenomenon to thefollowing two reasons: 1) there are several hyper-parameters(e.g., the return possibility and length of random walk, thenumbers of type-grouped neighbors) in HetGNN, makingthe model difficult to be fine-tuned; 2) the random walkstrategy may break the intrinsic graph structure and leadto structural information loss, especially when the graphstructure contains valuable information.Finally, HGConv outperforms all the baselines consis- tently with the varying training data ratio in most cases.Compared with MLP, GCN and GAT, HGConv takes boththe graph topology and graph heterogeneity into consider-ation. Compared with RGCN and HAN, HGConv utilizesthe specific characteristic of different nodes and relationswithout the requirement of domain knowledge. Comparedwith HetGNN, HGConv leverages intrinsic graph structuredirectly, which alleviates the structural information lossissue introduced by random walk. Compared with HGT,HGConv learns multi-level representation by the hybridmicro/macro level convolution, which provides HGConvwith sufficient representation ability.
The node clustering task is conducted to evaluate thelearned node representations. We first obtain the noderepresentation via feed forward on the trained model andthen feed the normalized node representation into k-meansalgorithm. We set the number of clusters to the number ofreal classes for each dataset (i.e., 3, 5 and 3 for ACM-3,ACM-5 and IMDB respectively). We adopt
ARI and
N M I
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 9
Fig. 3. Visualization of node representation on ACM-5. Each point indicates a paper and its color denotes the published area. as evaluation metrics. Since the result of k-means tends tobe affected by the initial centroids, we run k-means for 10times and report the average results in Table 4.Experimental results on the node clustering task showthat HGConv outperforms all the baselines, which demon-strates the effectiveness of the learned node representation.Moreover, methods based on GNNs usually obtain betterresults. We could also observe that methods achieving sat-isfactory results on node classification tasks (e.g., RGCN,HAN and HGT) also have satisfactory performance onnode clustering tasks, which indicates that a good modelcould learn more universal node embedding that could beapplicable to various tasks.
To make an more intuitive comparison, we also visualizenodes in the heterogeneous graph into a low dimensionalspace. In particular, we project the learned node represen-tation by HGConv into a 2-dimensional space using t-SNE[39]. The visualization of node representation on ACM-5 isshown in Figure 3 , where the color of nodes denote theircorresponding published area .From Figure 3, we could observe the baselines couldnot achieve satisfactory performance. They either fail togather papers within the same area together, or could notprovide clear boundaries of papers belonging to differentareas. HGConv performs best in the visualization, as paperswithin the same area are closer and boundaries betweendifferent areas are more obvious. We conduct the ablation study to validate the effect ofeach component in HGConv. We remove the micro-levelconvolution, macro-level convolution and weighted resid-ual connection from HGConv respectively and denote thethree variants as HGConv w/o Micro, HGConv w/o Macro
7. Please refer to the appendix for results on ACM-3 and IMDB. and HGConv w/o WRC. Detailed implements of the threevariants are introduced as follows: • HGConv w/o Micro.
This variant replaces the micro-level convolution by performing simple averagepooling on nodes within the same relation. • HGConv w/o Macro.
This variant replaces themacro-level convolution by performing simple aver-age pooling across different relations. • HGConv w/o WRC.
This variant removes theweighted residual connection in each layer and onlyuses the aggregated neighbor information as theoutput of each layer.Experimental results of the variants and HGConv on thenode classification task are shown in Figure 4.
Fig. 4. Effects of the components in the proposed model.
From Figure 4, we could observe that HGConv achievesthe best performance when it is equipped with all the com-ponents and removing any component would lead to worseresults. The effects of the three components vary in differentdatasets, but all of them contribute to the improvement inthe final performance. In particular, the micro-level convolu-tion enables HGConv to select more important nodes withinthe same relation, and the macro-level convolution helpsHGConv distinguish the subtle difference across relations.The weighted residual connection provides HGConv with
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 10
Fig. 5. Parameter Sensitivity of the proposed model on IMDB. the ability to consider the different contribution of focalnode’s inherent attributes and neighbor information.
We also investigate on the sensitivity analysis of severalparameters in HGConv. We report the results of node clas-sification task under different parameter settings on IMDBand experimental results are shown in Figure 5.
Number of convolution layers . We build HGConv withdifferent number of heterogeneous graph convolution layersand report the result in Figure 5 ( a ). It could be observedthat with the increment of layers, the performance of HG-Conv raises at first and then starts to drop gradually. Thisindicates that stacking a suitable number of layers helps themodel to receive information from further neighbors, buttoo many layers would lead to the overfitting problem. Number of attention heads . We validate the effect ofmulti-head attention mechanism in the hybrid convolutionby changing the number of attention heads. The result isshown in Figure 5 ( b ). From the results, we could concludethat increasing the number of attention heads would im-prove the model performance at first. When the number ofattention heads is enough (e.g., equal to or bigger than 4),the performance reaches the top and remains stable. Dimension of node representation . We also change thedimension of node representation and report the result inFigure 5 ( c ). We could find that the performance of HGConvgrows with the increment of the node representation dimen-sion and achieves the best performance when the dimensionis set between 64 and 256 (we select 64 as the final setting).The performance decreases when the dimension becomesbigger further because of the overfitting problem. The principle components in HGConv are the micro-levelconvolution and macro-level convolution. Thus, we providea detailed interpretation to better understand the learnedimportance of nodes within the same relation and differ-ence across relations by the hybrid convolution. We firstrandomly select a sample from ACM-3 and then calculatethe normalized attention scores from the last heterogeneousgraph convolution layer. The selected paper P v proposes aneffective ranking-based clustering algorithm for heteroge-neous information network, and it is published in the DataMining area. The visualization is shown in Figure 6. Fig. 6. Visualization of the learned attention scores.
Interpretation of the micro-level convolution . It couldbe observed that in the AP relation, both Jiawei Han and
Yizhou Sun have higher attention scores than
Yintao Yu among the authors, since the first two authors contributemore in the academic research. In the
T P relation, keywordsthat are more relevant to P v (i.e., clustering and ranking ) havehigher attention scores. Moreover, the scores of referencesthat studies more relevant topics to P v are also higher in the P P relation. The above observations indicate that the micro-level convolution could select more important nodes withinthe same relation by assigning higher attention scores.
Interpretation of the macro-level convolution . The at-tention score of the AP relation is much higher than that ofthe T P or P P relation, in line with the fact that GCN andGAT achieved the best performance on the
P AP meta-path.This finding demonstrates that the macro-level convolutioncould distinguish the importance of different relations auto-matically without empirical manual design, and the learnedimportance could implicitly construct more important meta-paths for specific downstream tasks.
ONCLUSION
In this paper, we designed a hybrid micro/macro levelconvolution operation to address several fundamental prob-lems in heterogeneous graph representation learning. In
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 11 particular, the micro-level convolution aims to learn theimportance of nodes within the same relation and themacro-level convolution attempts to distinguish the subtledifference across different relations. The hybrid strategy en-ables our model to fully leverage heterogeneous informationwith proper interpretability by performing convolutions onthe intrinsic structure of heterogeneous graphs directly. Wealso designed a weighted residual connection componentto obtain the optimal combination of focal node’s inherentattributes and neighbor information. Experimental resultsdemonstrated not only the superiority of the proposedmethod, but also the intuitive interpretability of our ap-proach for graph analysis. A CKNOWLEDGMENTS
This work is supported by the National Key R & D Pro-gram of China [grant number 2018YFB2101003], the Sci-ence and Technology Major Project of Beijing [grant num-ber Z191100002519012], and the National Natural ScienceFoundation of China [grant numbers 51778033, 51822802,51991395, 71901011, U1811463]. R EFERENCES [1] Y. Sun and J. Han, “Mining heterogeneous information networks:A structural analysis approach,”
SIGKDD Explorations , vol. 14,no. 2, pp. 20–28, 2012.[2] C. Shi, Y. Li, J. Zhang, Y. Sun, and P. S. Yu, “A survey of hetero-geneous information network analysis,”
IEEE Trans. Knowl. DataEng. , vol. 29, no. 1, pp. 17–37, 2017.[3] L. D. Santos, B. Piwowarski, L. Denoyer, and P. Gallinari, “Repre-sentation learning for classification in heterogeneous graphs withapplication to social networks,”
ACM Trans. Knowl. Discov. Data ,vol. 12, no. 5, pp. 62:1–62:33, 2018.[4] Y. Zhang, Y. Xiong, X. Kong, S. Li, J. Mi, and Y. Zhu, “Deepcollective classification in heterogeneous information networks,”in
Proceedings of the 2018 World Wide Web Conference on World WideWeb, WWW 2018, Lyon, France, April 23-27, 2018 , 2018, pp. 399–408.[5] Y. Sun, C. C. Aggarwal, and J. Han, “Relation strength-aware clus-tering of heterogeneous information networks with incompleteattributes,”
Proc. VLDB Endow. , vol. 5, no. 5, pp. 394–405, 2012.[6] Y. Dong, J. Tang, S. Wu, J. Tian, N. V. Chawla, J. Rao, and H. Cao,“Link prediction and recommendation across heterogeneous socialnetworks,” in , 2012, pp. 181–190.[7] X. Li, Y. Shang, Y. Cao, Y. Li, J. Tan, and Y. Liu, “Type-awareanchor link prediction across heterogeneous networks based ongraph attention network,” in
The Thirty-Fourth AAAI Conferenceon Artificial Intelligence, AAAI 2020, The Thirty-Second InnovativeApplications of Artificial Intelligence Conference, IAAI 2020, The TenthAAAI Symposium on Educational Advances in Artificial Intelligence,EAAI 2020, New York, NY, USA, February 7-12, 2020 , 2020, pp. 147–155.[8] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick,and J. Han, “Personalized entity recommendation: a heteroge-neous information network approach,” in
Seventh ACM Interna-tional Conference on Web Search and Data Mining, WSDM 2014, NewYork, NY, USA, February 24-28, 2014 , 2014, pp. 283–292.[9] C. Shi, B. Hu, W. X. Zhao, and P. S. Yu, “Heterogeneous informa-tion network embedding for recommendation,”
IEEE Trans. Knowl.Data Eng. , vol. 31, no. 2, pp. 357–370, 2019.[10] T. N. Kipf and M. Welling, “Semi-supervised classification withgraph convolutional networks,” in , 2017.[11] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representa-tion learning on large graphs,” in
Advances in Neural InformationProcessing Systems , 2017, pp. 1024–1034. [12] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li`o, andY. Bengio, “Graph attention networks,” in , 2018.[13] M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov,and M. Welling, “Modeling relational data with graph convolu-tional networks,” in
The Semantic Web - 15th International Confer-ence, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings ,2018, pp. 593–607.[14] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu,“Heterogeneous graph attention network,” in
The World Wide WebConference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019 ,2019, pp. 2022–2032.[15] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla,“Heterogeneous graph neural network,” in
Proceedings of the 25thACM SIGKDD International Conference on Knowledge Discovery &Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019 ,2019, pp. 793–803.[16] Z. Hu, Y. Dong, K. Wang, and Y. Sun, “Heterogeneous graphtransformer,” in
WWW ’20: The Web Conference 2020, Taipei, Taiwan,April 20-24, 2020 , 2020, pp. 2704–2710.[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in
Advances in neural information processing systems , 2017, pp. 5998–6008.[18] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reductionby locally linear embedding,” science , vol. 290, no. 5500, pp. 2323–2326, 2000.[19] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectraltechniques for embedding and clustering,” in
Advances in NeuralInformation Processing Systems 14 [Neural Information ProcessingSystems: Natural and Synthetic, NIPS 2001, December 3-8, 2001,Vancouver, British Columbia, Canada] , 2001, pp. 585–591.[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their com-positionality,” in
Advances in neural information processing systems ,2013, pp. 3111–3119.[21] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learningof social representations,” in
Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery and data mining .ACM, 2014, pp. 701–710.[22] A. Grover and J. Leskovec, “node2vec: Scalable feature learningfor networks,” in
Proceedings of the 22nd ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 2016, pp.855–864.[23] M. Zhang and Y. Chen, “Link prediction based on graph neuralnetworks,” in
Advances in Neural Information Processing Systems 31:Annual Conference on Neural Information Processing Systems 2018,NeurIPS 2018, 3-8 December 2018, Montr´eal, Canada , 2018, pp. 5171–5181.[24] F. Errica, M. Podda, D. Bacciu, and A. Micheli, “A fair comparisonof graph neural networks for graph classification,” in , 2020.[25] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net-works and locally connected networks on graphs,” in , 2014.[26] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutionalneural networks on graphs with fast localized spectral filtering,”in
Advances in Neural Information Processing Systems 29: AnnualConference on Neural Information Processing Systems 2016, December5-10, 2016, Barcelona, Spain , 2016, pp. 3837–3845.[27] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, “Graphneural networks: A review of methods and applications,”
CoRR ,vol. abs/1812.08434, 2018.[28] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “Acomprehensive survey on graph neural networks,”
CoRR , vol.abs/1901.00596, 2019.[29] X. Fu, J. Zhang, Z. Meng, and I. King, “MAGNN: metapathaggregated graph neural network for heterogeneous graph em-bedding,” in
WWW ’20: The Web Conference 2020, Taipei, Taiwan,April 20-24, 2020 , 2020, pp. 2331–2341.[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in , 2016, pp. 770–778.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 12 [31] J. Chung, C¸ . G ¨ulc¸ehre, K. Cho, and Y. Bengio, “Empirical evalua-tion of gated recurrent neural networks on sequence modeling,”
CoRR , vol. abs/1412.3555, 2014.[32] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their com-positionality,” in
Advances in Neural Information Processing Systems26: 27th Annual Conference on Neural Information Processing Systems2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe,Nevada, United States , 2013, pp. 3111–3119.[33] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer:Extraction and mining of academic social networks,” in
Proceedingsof the 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, Las Vegas, Nevada, USA, August 24-27,2008 , 2008, pp. 990–998.[34] J. Ramos et al. , “Using tf-idf to determine word relevance indocument queries,” in
Proceedings of the first instructional conferenceon machine learning , vol. 242. New Jersey, USA, 2003, pp. 133–142.[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 , 2014.[36] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neuralnetworks from overfitting,”
J. Mach. Learn. Res. , vol. 15, no. 1, pp.1929–1958, 2014.[37] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,A. K¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperativestyle, high-performance deep learning library,” in
Advances in Neu-ral Information Processing Systems 32: Annual Conference on NeuralInformation Processing Systems 2019, 8-14 December 2019, Vancouver,BC, Canada , 2019, pp. 8024–8035.[38] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou,C. Ma, L. Yu, Y. Gai et al. , “Deep graph library: Agraph-centric,highly-performant package for graph neural net,” arXiv preprintarXiv:1909.01315 , 2019.[39] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”
Journal of machine learning research , vol. 9, no. Nov, pp. 2579–2605,2008.
Le Yu received the B.S. degree in ComputerScience and Engineering from Beihang Univer-sity, Beijing, China, in 2019. He is currently asecond-year computer science Ph.D. student inthe School of Computer Science and Engineer-ing at Beihang University. His research inter-ests include representation learning, graph neu-ral networks and temporal data mining.
Leilei Sun is currently an assistant professorin School of Computer Science, Beihang Uni-versity, Beijing, China. He was a postdoctoralresearch fellow from 2017 to 2019 in School ofEconomics and Management, Tsinghua Univer-sity. He received his Ph.D. degree from Insti-tute of Systems Engineering, Dalian Universityof Technology, in 2017. His research interestsinclude machine learning and data mining.
Bowen Du received the Ph.D. degree in Com-puter Science and Engineering from BeihangUniversity, Beijing, China, in 2013. He is cur-rently a Professor with the State Key Laboratoryof Software Development Environment, BeihangUniversity. His research interests include smartcity technology, multi-source data fusion, andtraffic data mining.
Chuanren Liu received the B.S. degree from theUniversity of Science and Technology of China(USTC), the M.S. degree from the Beijing Uni-versity of Aeronautics and Astronautics (BUAA),and the Ph.D. degree from Rutgers, the StateUniversity of New Jersey. He is currently an as-sistant professor with the Business Analytics andStatistics Department at the University of Ten-nessee, Knoxville, USA. His research interestsinclude data mining and machine learning, andtheir applications in business analytics.
Weifeng Lv received the B.S. degree in Com-puter Science and Engineering from ShandongUniversity, Jinan, China, and the Ph.D. degreein Computer Science and Engineering from Bei-hang University, Beijing, China, in 1992 and1998 respectively. Currently, he is a Professorwith the State Key Laboratory of Software Devel-opment Environment, Beihang University, Bei-jing, China. His research interests include smartcity technology and mass data processing.
Hui Xiong is currently a Full Professor at theRutgers, the State University of New Jersey,where he received the 2018 Ram Charan Man-agement Practice Award as the Grand Prix win-ner from the Harvard Business Review, RBSDean’s Research Professorship (2016), the Rut-gers University Board of Trustees Research Fel-lowship for Scholarly Excellence (2009), theICDM Best Research Paper Award (2011), andthe IEEE ICDM Outstanding Service Award(2017). He received the Ph.D. degree from theUniversity of Minnesota (UMN), USA. He is a co-Editor-in-Chief of En-cyclopedia of GIS, an Associate Editor of IEEE Transactions on Big Data(TBD), ACM Transactions on Knowledge Discovery from Data (TKDD),and ACM Transactions on Management Information Systems (TMIS).He has served regularly on the organization and program committeesof numerous conferences, including as a Program Co-Chair of theIndustrial and Government Track for the 18th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining (KDD), aProgram Co-Chair for the IEEE 2013 International Conference on DataMining (ICDM), a General Co-Chair for the IEEE 2015 InternationalConference on Data Mining (ICDM), and a Program Co-Chair of theResearch Track for the 2018 ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. He is an IEEE Fellow and anACM Distinguished Scientist.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 13 A PPENDIX
In the appendix, details of the experiments are introduced.
Node Classification
Experimental results with variations on the node classifica-tion task are shown in Table 5. Hyper-parameter settings areshown in Table 6.
TABLE 5Experimental results with variations on the node classification task.
Data Metrics Ratio MLP GCN GAT RGCN HAN HetGNN HGT HGConvACM-3 Macro-F1 20% 0.6973 ± ± ± ± ± ± ± ±
40% 0.7740 ± ± ± ± ± ± ± ±
60% 0.8013 ± ± ± ± ± ± ± ±
80% 0.8249 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
40% 0.7710 ± ± ± ± ± ± ± ±
60% 0.7966 ± ± ± ± ± ± ± ±
80% 0.8205 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
40% 0.6585 ± ± ± ± ± ± ± ±
60% 0.7252 ± ± ± ± ± ± ± ±
80% 0.7503 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Micro-F1 20% 0.6469 ± ± ± ± ± ± ± ±
40% 0.6887 ± ± ± ± ± ± ± ±
60% 0.7354 ± ± ± ± ± ± ± ±
80% 0.7642 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± IMDB Macro-F1 20% 0.4506 ± ± ± ± ± ± ± ±
40% 0.4870 ± ± ± ± ± ± ± ±
60% 0.5188 ± ± ± ± ± ± ± ±
80% 0.5268 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Micro-F1 20% 0.4598 ± ± ± ± ± ± ± ±
40% 0.4874 ± ± ± ± ± ± ± ±
60% 0.5186 ± ± ± ± ± ± ± ±
80% 0.5269 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± TABLE 6Hyper-parameter settings of all the methods.
Data Hyper-parameters Data ratio MLP GCN GAT RGCN HAN HetGNN HGT HGConvACM-3 learning rate 20% 0.05 0.05 0.005 0.03 0.01 0.03 0.01 0.00840% 0.03 0.005 0.05 0.003 0.01 0.01 0.008 0.00560% 0.03 0.01 0.03 0.005 0.01 0.01 0.008 0.00380% 0.05 0.01 0.05 0.03 0.01 0.003 0.008 0.001100% 0.005 0.01 0.05 0.03 0.08 0.003 0.003 0.005dropout 20% 0.5 0.1 0.7 0.5 0.7 0.5 0.7 0.740% 0.9 0.0 0.6 0.7 0.8 0.9 0.8 0.860% 0.9 0.0 0.7 0.7 0.8 0.9 0.7 0.680% 0.9 0.2 0.7 0.5 0.7 0.9 0.7 0.6100% 0.9 0.5 0.8 0.5 0.6 0.9 0.9 0.8ACM-5 learning rate 20% 0.01 0.005 0.01 0.03 0.01 0.01 0.008 0.00540% 0.03 0.01 0.05 0.005 0.05 0.01 0.01 0.00560% 0.008 0.03 0.03 0.003 0.08 0.01 0.01 0.00380% 0.01 0.01 0.005 0.003 0.05 0.01 0.01 0.003100% 0.008 0.005 0.03 0.001 0.01 0.01 0.01 0.008dropout 20% 0.8 0.5 0.6 0.5 0.5 0.8 0.8 0.540% 0.8 0.5 0.5 0.5 0.5 0.8 0.9 0.760% 0.9 0.2 0.7 0.6 0.8 0.8 0.6 0.880% 0.8 0.4 0.5 0.6 0.9 0.8 0.7 0.8100% 0.9 0.0 0.6 0.5 0.9 0.8 0.6 0.8IMDB learning rate 20% 0.01 0.01 0.03 0.01 0.08 0.01 0.01 0.00140% 0.05 0.08 0.01 0.005 0.01 0.01 0.008 0.00860% 0.01 0.003 0.001 0.03 0.001 0.01 0.001 0.00180% 0.001 0.05 0.001 0.01 0.05 0.01 0.01 0.005100% 0.03 0.05 0.003 0.005 0.05 0.003 0.001 0.003dropout 20% 0.5 0.1 0.5 0.6 0.7 0.5 0.2 0.440% 0.9 0.4 0.7 0.5 0.7 0.4 0.6 0.560% 0.9 0.2 0.7 0.5 0.7 0.4 0.5 0.580% 0.7 0.3 0.7 0.6 0.8 0.6 0.2 0.4100% 0.4 0.3 0.8 0.5 0.7 0.5 0.2 0.4
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 14
Node Visualization