[PDF] Hybrid Micro/Macro Level Convolution for Heterogeneous Graph Learning

Abstract

Heterogeneous graphs are pervasive in practical scenarios, where each graph consists of multiple types of nodes and edges. Representation learning on heterogeneous graphs aims to obtain low-dimensional node representations that could preserve both node attributes and relation information. However, most of the existing graph convolution approaches were designed for homogeneous graphs, and therefore cannot handle heterogeneous graphs. Some recent methods designed for heterogeneous graphs are also faced with several issues, including the insufficient utilization of heterogeneous properties, structural information loss, and lack of interpretability. In this paper, we propose HGConv, a novel Heterogeneous Graph Convolution approach, to learn comprehensive node representations on heterogeneous graphs with a hybrid micro/macro level convolutional operation. Different from existing methods, HGConv could perform convolutions on the intrinsic structure of heterogeneous graphs directly at both micro and macro levels: A micro-level convolution to learn the importance of nodes within the same relation, and a macro-level convolution to distinguish the subtle difference across different relations. The hybrid strategy enables HGConv to fully leverage heterogeneous information with proper interpretability. Moreover, a weighted residual connection is designed to aggregate both inherent attributes and neighbor information of the focal node adaptively. Extensive experiments on various tasks demonstrate not only the superiority of HGConv over existing methods, but also the intuitive interpretability of our approach for graph analysis.

Full PDF

IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 1

Hybrid Micro/Macro Level Convolution forHeterogeneous Graph Learning

Le Yu, Leilei Sun, Bowen Du, Chuanren Liu, Weifeng Lv, Hui Xiong,

Fellow, IEEE

Abstract —Heterogeneous graphs are pervasive in practical scenarios, where each graph consists of multiple types of nodes andedges. Representation learning on heterogeneous graphs aims to obtain low-dimensional node representations that could preserveboth node attributes and relation information. However, most of the existing graph convolution approaches were designed forhomogeneous graphs, and therefore cannot handle heterogeneous graphs. Some recent methods designed for heterogeneous graphsare also faced with several issues, including the insufﬁcient utilization of heterogeneous properties, structural information loss, and lackof interpretability. In this paper, we propose

HGConv , a novel H eterogeneous G raph Conv olution approach, to learn comprehensivenode representations on heterogeneous graphs with a hybrid micro/macro level convolutional operation. Different from existingmethods, HGConv could perform convolutions on the intrinsic structure of heterogeneous graphs directly at both micro and macrolevels: A micro-level convolution to learn the importance of nodes within the same relation, and a macro-level convolution to distinguishthe subtle difference across different relations. The hybrid strategy enables HGConv to fully leverage heterogeneous information withproper interpretability. Moreover, a weighted residual connection is designed to aggregate both inherent attributes and neighborinformation of the focal node adaptively. Extensive experiments on various tasks demonstrate not only the superiority of HGConv overexisting methods, but also the intuitive interpretability of our approach for graph analysis.

Index Terms —Heterogeneous graphs, graph convolution, representation learning. (cid:70)

NTRODUCTION A heterogeneous graph consists of multiple types ofnodes and edges, involving abundant heterogeneousinformation [1]. In practice, heterogeneous graphs are per-vasive in real-world scenarios, such as academic networks,e-commerce and social networks [2]. Learning meaningfulrepresentation of nodes in heterogeneous graphs is essen-tial for various tasks, including node classiﬁcation [3], [4],node clustering [5], link prediction [6], [7] and personalizedrecommendation [8], [9].In recent years, Graph Neural Networks (GNNs) havebeen widely used in representation learning of graphs andachieved superior performance. Generally, GNNs performconvolutions in two domains, namely spectral domain andspatial domain. As a spectral-based method, GCN [10] uti-lizes the localized ﬁrst-order approximation on neighborsand then performs convolutions in the Fourier domain foran entire graph. Spatial-based methods, including Graph-SAGE [11] and GAT [12], directly perform information prop-agation in the graph domain by particularly designed aggre-gation functions or the attention mechanism. However, all ofthe above methods were designed for homogeneous graphswith single node type and single edge type, and they are • L. Yu, L. Sun, B. Du and W. Lv are with the SKLSDE and BDBC Lab,Beihang University, Beijing, 100083, China.E-mail: [email protected], [email protected], [email protected],[email protected] • C. Liu is with Department of Business Analytics and Statistics, Univer-sity of Tennessee, Knoxville, USA.E-mail: [email protected] • H. Xiong is with Department of Management Science and InformationSystems, Rutgers University, USA.E-mail: [email protected] received December 29, 2020; revised xx xx, xxxx. infeasible to handle the rich information in heterogeneousgraphs. Simply adapting them to deal with heterogeneousgraphs would lead to the information loss issue, since theyignore the graph heterogeneous properties.Despite the investigation of approaches on homoge-neous graphs, there are also several attempts to designgraph convolution methods for heterogeneous graphs.RGCN [13] was proposed to deal with multiple relationsin knowledge graphs. HAN [14] was designed to learnon heterogeneous graphs, which is based on meta-pathsand the attention mechanism. [15] presented HetGNN toconsider the heterogeneity of node attributes and neighborsthrough dedicated aggregation functions. [16] proposedHGT, a variant of Transformer [17], to focus on the metarelations in heterogeneous graphs.However, the aforementioned methods are still facedwith the following limitations. 1)

Heterogeneous informationloss : several methods just utilize the properties of nodesor relations partially, rather than the comprehensive in-formation of nodes and relations (e.g., RGCN and HAN).In detail, RGCN ignores the distinct attributes of nodeswith various types. HAN relies on multiple hand-designedsymmetric meta-paths to convert the heterogeneous graphinto multiple homogeneous graphs, which would lead to theloss of different nodes and edges information. 2)

Structuralinformation loss : some methods deal with the graph topologythrough heuristic strategies, such as the random walk inHetGNN, which may break the intrinsic graph structureand lose valuable structural information. 3)

Empirical manualdesign : the performance of some methods severely relies onprior experience because of the requirement of speciﬁc do-main knowledge, such as pre-deﬁned meth-paths in HAN;4)

Insufﬁcient representation ability : some methods cannot a r X i v : . [ c s . L G ] D ec EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 2

TABLE 1Comparison of several existing methods with the proposed model.

Models GraphTopology HeterogeneousProperties Without SpeciﬁcDomain Knowledge AttentiveAggregation Convolutions onIntrinsic Structure Multi-levelRepresentationMLP × × (cid:88) × × ×

GCN (cid:88) × (cid:88) × (cid:88) × GAT (cid:88) × (cid:88) (cid:88) (cid:88) × RGCN (cid:88) (cid:88) (cid:88) × (cid:88) × HAN (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) HetGNN (cid:88) (cid:88) (cid:88) (cid:88) × (cid:88) HGT (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × HGConv (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) provide multi-level representation due to the ﬂat model ar-chitecture. For example, HGT learns the interaction of nodesand relations in a single aggregation process, which is hardto distinguish their importance in such a ﬂat architecture.To cope with the above issues, we propose HGConv, anovel H eterogeneous G raph Conv olution approach, to learnnode representation on heterogeneous graphs with a hybridmicro/macro level convolutional operation. Speciﬁcally, fora focal node: in micro-level convolution, the transformationmatrices and attention vectors are both speciﬁc to nodetypes, aiming to learn the importance of nodes within thesame relation; in macro-level convolution, transformationmatrices speciﬁc to relation types and the weight-sharingattention vector are employed to distinguish the subtledifference across different relations. Due to the hybridmicro/macro level convolution, HGConv could fully uti-lize the heterogeneous information of nodes and relationswith proper interpretability. Moreover, a weighted residualconnection component is designed to obtain the optimalfusion of the focal node’s inherent attributes and neighborinformation. Based on the aforementioned components, ourapproach could be optimized in an end-to-end manner.Comparison of several existing methods with our model areshown in Table 1.To sum up, the contributions of our work are as follows: • A novel heterogeneous graph convolution approach is pro-posed to directly perform convolutions on the intrin-sic heterogeneous graph structure with a hybrid mi-cro/macro level convolutional operation, where themicro convolution encodes the attributes of differenttypes of nodes and the macro convolution computeson different relations respectively. • A residual connection component with weighted combina-tion is designed to aggregate focal node’s inherent at-tributes and neighbor information adaptively, whichcould provide comprehensive node representation. • A systematic analysis on existing heterogeneous graphlearning methods is given, and we point out that eachexisting method could be treated as a special case ofthe proposed HGConv under certain circumstances.The rest of this paper is organized as follows: Section2 reviews previous work related to the studied problem.Section 3 introduces the studied problem. Section 4 presentsthe framework and each component of the proposed model.Section 5 evaluates the proposed model by experiments.Section 6 concludes the entire paper.

ELATED WORK

This section reviews existing literature related to our workand also points out their differences with our work.

Graph Mining . Over the past decades, a great amountof research has been investigated on graph mining. Classicalmethods based on manifold learning, including LocallyLinear Embedding (LLE) [18] and Laplacian Eigenmaps (LE)[19], mainly focus on the reconstruction of graphs. Inspiredby the language model Skip-gram [20], more advancedmethods were proposed to learn representations of nodes,such as DeepWalk [21] and Node2Vec [22]. These methodsadopt random walk strategy to generate sequences of nodesand use Skip-gram to maximize node co-occurrence proba-bility in the same sequence.However, all of the above methods only focused onthe study of graph topology structure and could not takethe node attributes into consideration, resulting in inferiorperformance. These methods are surpassed by recently pro-posed GNNs, which could consider both node attributesand graph structure simultaneously.

Graph Neural Networks . Recent years have witnessedthe success of GNNs in various tasks, such as node classiﬁ-cation [10], [11], link prediction [23] and graph classiﬁcation[24]. GNNs consider both graph structure and node at-tributes by ﬁrst propagating information among each nodeand its neighbors, and then providing node representationbased on the received information. Generally, GNNs couldbe divided into spectral-based methods and spatial-basedmethods. As a spectral-based method, Spectral CNN [25]performs convolution in the Fourier domain by comput-ing the eigendecomposition of the graph Laplacian matrix.ChebNet [26] leverages the K-order Chebyshev polynomialsto eliminate the need to calculate the Laplacian matrixeigenvectors. GCN [10] introduces a localized ﬁrst-order ap-proximation of ChebNet to alleviate the overﬁtting problem.Representative spatial-based methods include GraphSAGE[11] and GAT [12]. [11] proposed GraphSAGE to propagateinformation in the graph domain directly and designeddifferent functions to aggregate received information. [12]presented GAT by introducing the attention mechanisminto GNNs, which enabled GAT to select more importantneighbors adaptively. We refer the interested readers to [27],[28] for more comprehensive reviews on GNNs.However, all the above methods were designed for ho-mogeneous graphs, and could not handle the rich infor-mation in heterogeneous graphs. In this work, we aim topropose an approach to learn on heterogeneous graphs.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 3

Heterogeneous Graph Neural Networks . Heteroge-neous graphs contain abundant information of varioustypes of nodes and relations. Mining useful informationin heterogeneous graphs is essential in practical scenarios.Recently, several graph convolution methods have beenproposed for learning on heterogeneous graphs. [13] pre-sented RGCN to learn on knowledge graphs by employ-ing specialized transformation matrices for each type ofrelations. [14] designed HAN by extending the attentionmechanism in GAT [12] to learn the importance of neighborsand multiple hand-designed meta-paths. [29] consideredthe intermediate nodes in meta-paths, which are ignoredin HAN, and proposed MAGNN to aggregate the intra-meta-path and inter-meta-path information. HetGNN [15]ﬁrst samples neighbors based on random walk strategy andthen uses specialized Bi-LSTMs to integrate the heteroge-neous node attributes and neighbors. [16] proposed HGTto introduce type-speciﬁc transformation matrices and learnthe importance of different nodes and relations based on theTransformer [17] architecture.Nevertheless, there are still some limitations in the abovemethods, including the insufﬁcient utilization of heteroge-neous properties, structural information loss, and lack ofinterpretability. In this paper, we aim to cope with the issuesin existing approaches and design a method to learn com-prehensive node representation on heterogeneous graphs byleveraging both node attributes and relation information.

ROBLEM F ORMALIZATION

This section introduces related concepts and the studiedproblem in this paper.

Deﬁnition 1.

Heterogeneous Graph : A heterogeneous graphis deﬁned as a directed graph G = ( V , E , A , R ) , where V and E denote the set of nodes and edges respectively. Each node v ∈V and each edge e ∈ E are associated with their type mappingfunctions φ ( v ) : V → A and ϕ ( e ) : E → R , with the constraintof |A| + |R| > . Deﬁnition 2.

Relation : A relation represents for the interactionschema of the source node, the target node and the connectededge. Formally, for an edge e = ( u, v ) with source node u andtarget node v , the corresponding relation R ∈ R is denoted as (cid:104) φ ( u ) , ϕ ( e ) , φ ( v ) (cid:105) . The inverse of R is naturally represented by R − , and we consider the inverse relation to propagate informa-tion of two nodes from each other. Thus, the set of edges is extendedas E ∪ E − and the set of relations is extended as R ∪ R − .Note that the meta-paths used in heterogeneous graph learningapproaches [14], [29] are deﬁned as sequences of relations. Deﬁnition 3.

Heterogeneous Graph Representation Learn-ing : Given a heterogeneous graph G = ( V , E , A , R ) , wherenodes with type A ∈ A are associated with the attribute matrix X A ∈ R |V A |× d A , the task of heterogeneous graph representationlearning is to obtain the d -dimensional representation h v ∈ R d for v ∈ V , where d (cid:28) | V | . The learned representations are ableto capture both node attributes and relation information, whichcould be applied in various tasks, such as node classiﬁcation, nodeclustering and node visualization. ETHODOLOGY

This section presents the framework of our proposedmethod and each component of the proposed method isintroduced step by step.

The framework of the proposed model is shown in Figure1, which takes the node attribute matrices X A for A ∈ A ina heterogeneous graph as the input and provides the low-dimensional node representation h v for v ∈ V as the output,which could be applied in various tasks. Fig. 1. Framework of the proposed model.

The proposed model is made up of multiple heteroge-neous graph convolutional layers, where each layer con-sists of the hybrid micro/macro level convolution and theweighted residual connection component. Different from[14] that performs convolution on converted homogeneousgraphs through meta-paths, the proposed hybrid convolu-tion could directly calculate on the heterogeneous graphstructure. In particular, the micro-level convolution aims tolearn the importance of nodes within the same relation, andthe macro-level convolution is designed to discriminate thedifference across different relations. The weighted residualconnection component is employed to consider the differ-ent contribution of focal node’s inherent attributes and itsneighbor information. By stacking multiple heterogeneousgraph convolutional layers, the proposed model could con-sider the impacts of the focal node’s directly connected andmulti-hop reachable neighbors.

As pointed in [14], the importance of nodes connected withthe focal node within the same relation would be different.Hence, we ﬁrst design a micro-level convolution to learn theimportance of nodes within the same relation. We supposethat the attributes of nodes with different types might bedistributed in different latent spaces. Therefore, we utilizethe transformation matrices and attention vectors, whichare speciﬁc to node types, to capture the characteristics ofdifferent types of nodes in the micro-level convolution.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 4

Formally, we denote the focal node v as the target nodewith type φ ( v ) ∈ A and its connected node u as thesource node with type φ ( u ) ∈ A . For a focal node v , let N R ( v ) denote the set of node v ’s neighbors within R -typerelation, where for each u ∈ N R ( v ) , e = ( u, v ) ∈ E and R = (cid:104) φ ( u ) , ϕ ( e ) , φ ( v ) (cid:105) ∈ R .We ﬁrst apply transformation matrices, which are spe-ciﬁc to node types, to project nodes into their own latentspaces as follows, z lv = W lφ ( v ) h l − v , (1) z lu = W lφ ( u ) h l − u , (2)where W lφ ( u ) denotes the trainable transformation matrixfor node u with type φ ( u ) at layer l . h lu and z lu denote theoriginal and transformed representation of node u at layer l . Then we calculate the normalized importance of neighbor u ∈ N R ( v ) as follows, e R,lv,u = LeakyReLU (cid:16) a lφ ( u ) (cid:62) (cid:104) z lv (cid:107) z lu (cid:105)(cid:17) , (3) α R,lv,u = exp (cid:0) e R,lv,u (cid:1)(cid:80) u (cid:48)∈N R ( v ) exp (cid:16) e R,lv,u (cid:48) (cid:17) , (4)where a lφ ( u ) is the trainable attention vector for φ ( u ) -typesource node u at layer l and (cid:107) denotes the concatenationoperation. (cid:62) denotes the transpose operation. α R,lv,u is thenormalized importance of source node u to focal node v under relation R at layer l . Then the representation ofrelation R about focal node v is calculated by, c lv,R = σ  (cid:88) u ∈N R ( v ) α R,lv,u · z lu  , (5)where σ ( · ) denotes the activation function (e.g., sigmoid,ReLU). An intuitive explanation of the micro-level convolu-tion is shown in Figure 2 ( a ). Embeddings of nodes withinthe same relation are aggregated through the attention vec-tors which are speciﬁc to node types. Since the attentionweight α R,lv,u is computed for each relation, it could wellcapture the relation information.In order to enhance the model capacity and make thetraining process more stable, we employ K independentheads and then concatenate representations as follows, c lv,R = K (cid:107) k =1 σ  (cid:88) u ∈N R ( v ) (cid:104) α R,lv,u (cid:105) k · (cid:104) z lu (cid:105) k  , (6)where (cid:2) α R,lv,u (cid:3) k denotes the importance of source node u tofocal node v under relation R of head k at layer l , and (cid:2) z lu (cid:3) k stands for source node u ’s transformed representation ofhead k at layer l . Besides considering the importance of nodes within thesame relation, a focal node would also interact with multiplerelations, which indicates the necessity of learning the subtledifference across different relations. Therefore, we design amacro-level convolution with the transformation matrices

Fig. 2. Explanation of the hybrid micro/macro level convolution. speciﬁc to relation types and weight-sharing attention vec-tor to distinguish the difference of relations.Speciﬁcally, we ﬁrst transform the focal node and itsconnecting relations into their distinct distributed spaces by, h lv (cid:48) = U lφ ( v ) h l − v , (7) c lv,R (cid:48) = M lR c lv,R , (8)where U lφ ( v ) and M lR denote the transformation matricesfor φ ( v ) -type focal node v and R -type relation at layer l respectively. Then the normalized importance of relation R ∈ R ( v ) to focal node v is calculated by, s lv,R = LeakyReLU (cid:16) µ l (cid:62) (cid:104) h lv (cid:48) (cid:107) c lv,R (cid:48) (cid:105)(cid:17) , (9) β lv,R = exp (cid:16) s lv,R (cid:17)(cid:80) R (cid:48) ∈R ( v ) exp (cid:16) s lv,R (cid:48) (cid:17) , (10)where R ( v ) denotes the set of relations connected to focalnode v . µ l is the trainable attention vector which is sharedby different relations at layer l . β lv,R is the normalizedimportance of relation R to focal node v at layer l . Afterobtaining the importance of different relations, we aggregatethe relations as follows, (cid:102) h lv = (cid:88) R ∈R ( v ) β lv,R · c lv,R (cid:48) , (11)where (cid:102) h lv is the fused representation of relations connectedto focal node v at layer l . Explanation of the macro-levelconvolution is shown in Figure 2 ( b ). Representations ofdifferent relations are aggregated into a compact vectorthrough the attention mechanism. Through the macro-levelconvolution, the different importance of relations could becalculated automatically.We also extend Equation (11) to multi-head attention by, (cid:102) h lv = K (cid:107) k =1 (cid:88) R ∈R ( v ) (cid:104) β lv,R (cid:105) k · (cid:104) c lv,R (cid:48) (cid:105) k , (12) EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 5 where (cid:104) β lv,R (cid:105) k is the importance of relation R to focal node v of head k at layer l , and (cid:104) c lv,R (cid:48) (cid:105) k denotes the fusion ofrelations connected to focal node v of head k at layer l .It is worth noting that the attention vectors in micro-level convolution are speciﬁc to node types, while in macro-level convolution, the attention vector is shared by differentrelations, which is unaware of relation types. Such a designis based on the following reasons. 1) When performingmicro-level convolution, nodes are associated with distinctattributes even when they are within the same relation. Anattention vector unaware of node types is difﬁcult to handlenodes’ different attributes and types due to the insufﬁcientrepresentation ability. Hence, attention vectors speciﬁc tonode types are designed in micro-level convolution. 2) Inmacro-level convolution, each relation connected to the focalnode is associated with a single representation and we onlyneed to consider the difference of relation types. Therefore,the weight-sharing attention vector across different relationsis designed. Following the above design, we could not onlymaintain the distinct characteristics of nodes and relations,but also reduce the model parameters. In addition to aggregating neighbor information by the hy-brid micro/macro level convolution, the attributes of focalnode are also supposed to be important, since they reﬂectthe inherent characteristic directly. However, simply addingfocal node’s inherent attributes and neighbor informationtogether could not distinguish their different importance.Thus, we adapt the residual connection [30] with train-able weight parameter to combine the focal node’s inherentattributes and neighbor information by, h lv = λ lφ ( v ) · W lφ ( v ) ,o h l − v + (cid:16) − λ lφ ( v ) (cid:17) · (cid:102) h lv , (13)where λ lφ ( v ) is the weight to control the importance of focalnode v ’s inherent attributes and its neighbor information atlayer l . W lφ ( v ) ,o is utilized to align the dimension of focalnode v ’s attributes and its neighbor information at layer l .From another perspective, the weighted residual con-nection could be treated as the gated updating mechanismin Gated Recurrent Unit (GRU) [31], where the employedupdate gates are speciﬁc to focal node type and carrydifferent weights in different layers. We stack L heterogeneous graph convolutional layers tobuild HGConv. For the ﬁrst layer, we set h v to node v ’scorresponding row in attribute matrix X φ ( v ) as the input.The ﬁnal node representation h v is set to the output of thelast layer h Lv for v ∈ V .HGConv could be trained in an end-to-end mannerwith the following strategies: 1) semi-supervised learningstrategy: for tasks where the labels are available, we couldoptimize the model by minimizing the cross entropy loss by, L = − (cid:88) v ∈V label C (cid:88) c =1 y v,c · log ˆ y v,c , (14) where V label is the set of nodes with labels. y v,c and ˆ y v,c denote the ground truth and predicted possibility of node v at the c -th dimension. In practice, ˆ y v,c could be obtainedfrom a classiﬁer (e.g., SVM, single-layer neural network)which takes node v ’s representation h v as the input and out-puts ˆ y v . 2) unsupervised learning strategy: for tasks withoutany labels, we could optimize the model by minimizing theobjective function in Skip-gram [32] with negative sampling, L = − (cid:88) ( v,u ) ∈ S P log σ (cid:16) h (cid:62) v h u (cid:17) − (cid:88) ( v (cid:48) ,u (cid:48) ) ∈ S N log σ (cid:16) − h (cid:62) v (cid:48) h u (cid:48) (cid:17) , (15)where σ ( · ) is the sigmoid activation function, S P and S N denote the set of positive observed node pairs and negativesampled node pairs. 3) joint learning strategy: we could alsocombine the semi-supervised and unsupervised learningstrategy together to jointly optimize the model. Here we give a systematic analysis on existing heteroge-neous graph learning models and points out that eachexisting method could be treated as a special case of theproposed HGConv under certain circumstances.

Overview of Homogeneous GNNs . Let us start with theintroduction of homogeneous GNNs at ﬁrst. Generally, theoperations at the l -th layer of a homogeneous GNN followa two-step strategy: (cid:102) h lv = AGGREGATE l (cid:16)(cid:110) h l − u : u ∈ N ( v ) (cid:111)(cid:17) , (16) h lv = COMBINE l (cid:16) h l − v , (cid:102) h lv (cid:17) , (17)where h lv denotes the representation of node v at the l -th layer. h v is initialized with node v ’s original attribute x v and N ( v ) denotes the set of node v ’s neighbors.AGGREGATE l ( · ) stands for the aggregation of node v ’sneighbors. COMBINE l ( · ) is the combination of node v ’sinherent attribute and its neighbor information at layer l .Different architectures for AGGREGATE and COMBINEhave been proposed in recent years. For example, GCN [10]utilizes the normalized adjacency matrix for AGGREGATEand uses the residual connection for COMBINE. Graph-SAGE [11] designs various pooling operations for AGGRE-GATE and leverages the concatenation for COMBINE. Overview of Heterogeneous GNNs . The operations inheterogeneous GNNs are based on the operations in ho-mogeneous GNNs, with additional consideration of nodeattributes and relation information. Formally, the operationsat the l -th layer could be summarized as follows: z lu = TRANSFORM lφ ( u ) (cid:16) h l − u (cid:17) , ∀ u ∈ V (18) c lv,R = AGGREGATE lR (cid:16)(cid:110) z lu : u ∈ N R ( v ) (cid:111)(cid:17) , (19) (cid:102) h lv = AGGREGATE l (cid:16)(cid:110) c lv,R : R ∈ R ( v ) (cid:111)(cid:17) , (20) h lv = COMBINE l (cid:16) h l − v , (cid:102) h lv (cid:17) , (21)where N R ( v ) denotes the set of node v ’s neighbors within R -type relation and R ( v ) is deﬁned as the set of relationsconnected to node v . EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 6

Compared with homogeneous GNNs, heterogeneousGNNs ﬁrst design specialized transformation matrices fordifferent types of nodes for TRANSFORM. Then the opera-tions in AGGREGATE are divided into aggregation withinthe same relation and aggregation across different relations.Finally, the operation in COMBINE is deﬁned as the sameas Equation (17) in homogeneous GNNs.

Analysis of the Proposed HGConv . The proposed HG-Conv makes delicate design for each operation in the afore-mentioned heterogeneous GNNs. Speciﬁcally, Equation (18)- Equation (21) could be rewritten as : z lu = W lφ ( u ) h l − u , ∀ u ∈ V (22) c lv,R = (cid:88) u ∈N R ( v ) α R,lv,u · z lu , (23) (cid:102) h lv = (cid:88) R ∈R ( v ) β lv,R · c lv,R , (24) h lv = λ lφ ( v ) · h l − v + (cid:16) − λ lφ ( v ) (cid:17) · (cid:102) h lv , (25)where W φ ( u ) is the transformation matrix which is speciﬁcto node u ’s type. α R,lv,u and β lv,R are learned importanceby the attention mechanism in micro-level and macro-levelconvolution respectively. λ lφ ( v ) is the trainable parameter tobalance the importance of the focal node inherent attributeand its neighbor information. Connection with RGCN . RGCN [13] ignores distinctattributes of nodes with various types and assigns im-portance of neighbors within the same relation based onpre-deﬁned constants. RGCN could be treated as a specialcase of the proposed HGConv with the following steps: 1)Replace W lφ ( u ) in Equation (22) with identity function I ( · ) ,which means different distributions of node attributes withvarious types are not considered; 2) Replace trainable α R,lv,u inEquation (23) with pre-deﬁned constant, which is calculatedby the degree of each node; 3) Set β lv,R in Equation (24)to . , which stands for simple sum pooling; 4) Set λ lφ ( v ) in Equation (25) to . , which means equal contributionof node inherent attributes and neighbor information. Notethat the sum pooling operation in RGCN could not distin-guish the importance of nodes and relations effectively. Connection with HAN . HAN [14] leverages multiplesymmetric meta-paths to convert the heterogeneous graphinto multiple homogeneous graphs. Therefore, node v ’sneighbors are deﬁned by the given set of meta-paths Φ .HAN could be treated as a special case of the proposedHGConv with the following steps: 1) Replace W lφ ( u ) inEquation (22) with identity function I ( · ) , as each convertedgraph only contains nodes with a single type; 2) Deﬁne theset of node v ’s neighbors in Equation (23) by meth-paths Φ ,that is, for each meta-path Φ i , the set of node v ’s neighbors isdenoted as N Φ i ( v ) , and then learn the importance of neigh-bors generated by the same meta-path through the attentionmechanism; 3) Replace the aggregation of different relationsin Equation (24) with the aggregation of multiple meta-paths Φ , and learn the importance of different meta-paths usingthe attention mechanism; 4) Set λ lφ ( v ) in Equation (25) to

1. Note that we omit the activation functions and transformationmatrices for graph convolution or dimension alignment for simplicity. . , which means using the neighbor information directly.Not that the converted graphs are homogeneous, and theattributes of nodes with different types are ignored in HAN,leading to inferior performance. Connection with HetGNN . HetGNN [15] leverages therandom walk strategy to sample neighbors and then usesBi-LSTMs to integrate node attributes and neighbors. There-fore, node v ’s neighbors are generated by random walk RW ,which could be denoted as N RW ( v ) . HetGNN could betreated as a special case of the proposed HGConv with thefollowing steps: 1) Replace W lφ ( u ) in Equation (22) with Bi-LSTMs to aggregate attributes of nodes with various types;2) Deﬁne the set of node v ’s neighbors in Equation (23)by random walk RW and group the neighbors by nodetypes, that is, for each node type t , the set of node v ’sneighbors is denoted as N RW,t ( v ) . Then, learn the impor-tance of neighbors with the same node type through Bi-LSTMs; 3) Replace the aggregation of different relations inEquation (24) with the aggregation of different node types,and learn the importance of different node types using theattention mechanism; 4) Set λ lφ ( v ) in Equation (25) to betrainable, which is incorporated in the attention mechanismin previous step in [15]. Not that the random walk RW inHetGNN may break the intrinsic graph structure and resultsin structural information loss. Connection with HGT . HGT [16] learn the importanceof different nodes and relations based on the Transformerarchitecture by designing type-speciﬁc transformation ma-trices. HGT focuses on the study of each relation (a.k.a.meta relation in [16]), hence, the importance of source nodeto target node is calculated based on both the two nodeinformation as well as their connected relation in a singleaggregation process. HGT could be treated as a specialcase of the proposed HGConv with the following steps: 1)Replace W lφ ( u ) in Equation (22) with the linear projectionsthat are speciﬁc to source node type and target node typerespectively to obtain Key and Query vectors; 2) Fuse theaggregation process in Equation (23) and Equation (24) intoa single aggregation process. The importance of source nodeto target node is learned from the Key and Query vectors,as well as the relation transformation matrices speciﬁc totheir connected relation type; 3) Set λ lφ ( v ) in Equation (25)to . , which means node inherent attributes and neighborinformation contribute equally to the ﬁnal node representa-tion. Not that the single aggregation process in HGT leadsto a ﬂat architecture, making it is hard to distinguish theimportance of nodes and relations separately. XPERIMENTS

This section presents the experimental results on real-worlddatasets and detailed analysis.

We conduct experiments on three real-world datesets. • ACM-3 : Following [14], we extract a subset of ACMfrom AMiner [33], which contains papers pub-lished in three areas: Data Mining (KDD, ICDM), EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 7

Database (VLDB, SIGMOD) and Wireless Commu-nication (SIGCOMM, MobiCOMM). Finally we con-struct a heterogeneous graph containing papers (P),authors (A) and terms (T). • ACM-5 : We also extract a larger subset of ACMfrom AMiner, which includes papers publishedin ﬁve areas: Data Mining (KDD, ICDM, WSDM,CIKM), Database (VLDB, ICDE), Artiﬁcial Intel-ligence (AAAI, IJCAI), Computer Vision (CVPR,ECCV) and Natural Language Processing (ACL,EMNLP, NAACL). • IMDB : We extract a subset of IMDB and consructa heterogeneous graph containing movies (M), direc-tors (D) and actors (A). The movies are divided intothree classes: Action, Comedy, Drama.For ACM-3 and ACM-5, we use TF-IDF [34] to extractkeywords of the abstract and title in papers. Paper attributesare the bag-of-words representation of abstracts. Authorattributes are the average representation of their publishedpapers. Term attributes are represented as the one-hot en-coding of the title keywords. For IMDB, movie attributesare the bag-of-words representation of plot keywords. Direc-tor/actor attributes are the average representation of theirdirecting/acting movies.Details of the datasets are summarized in Table 2. TABLE 2Statistics of the datasets.

Dataset Node Relation Attribute Data SplitACM-3

We compare our method with the following baselines: • MLP : MLP ignores the graph structure and solelyfocuses on the focal node attributes by leveragingthe multilayer perceptron. • GCN : GCN performs graph convolutions in theFourier domain by leveraging the localized ﬁrst-order approximation [10]. • GAT : GAT introduces the attention mechanism intoGNNs and assigns different importance to the neigh-bors adaptively [12]. • RGCN : RGCN designs specialized transformationmatrices for each type of relations in the modellingof knowledge graphs [13]. • HAN : HAN leverages the attention mechanism toaggregate neighbor information via multiple manu-ally designed meta-paths [14]. • HetGNN : HetGNN considers the heterogeneity ofnode attributes and neighbors, and then utilizes Bi-LSTMs to integrate heterogeneous information [15].

3. https://data.world/data-society/imdb-5000-movie-dataset • HGT : HGT introduces type-speciﬁc transformationmatrices to capture characteristics of different nodesand relations with the Transformer architecture [16].

As some methods require meth-paths, we use

P AP , P T P and

P P P as meta-paths for ACM-3 and ACM-5, and choose

M DM and

M AM as meta-paths for IMDB. Following[14], we test GCN and GAT on the homogeneous graphgenerated by each meta-path and report the best perfor-mance from meta-paths (Experiments show that the bestmeta-paths on ACM-3, ACM-5, IMDB are

P AP , P AP ,and

M DM respectively). All the meta-paths are directlyfed into HAN. Adam [35] is selected as the optimizer.Dropout [36] is utilized to prevent over-ﬁtting. The gridsearch is used to select the best hyperparameters, in-cluding dropout in [0 , . , · · · , . and learning rate in [0 . , . , . , . , · · · , . . The dimension of noderepresentation is set to 64. We train all the methods witha ﬁxed 300 epochs and use early stopping strategy witha patience of 100, which means the training process isterminated when the evaluation metrics on the validationset are not improved for 100 consecutive epochs.For HGConv, the numbers of attention heads in mi-cro/macro level convolution are both set to 8, and thedimension of each head’s attention vector is set to 8. Webuild HGConv with two layers, since two layers couldachieve satisfactory performance and stacking more layerscannot improve the performance signiﬁcantly. The proposedHGConv is implemented with PyTorch [37] and DeepGraph Library (DGL) [38]. Experiments are conducted onan Ubuntu machine equipped with two Intel(R) Xeon(R)CPU E5-2667 v4 @ 3.20GHz with 8 physical cores, and theGPU is NVIDIA TITAN Xp, armed with 12 GB of GDDR5Xmemory running at over 11 Gbps. We conduct experiments to make comparison on the nodeclassiﬁcation task. Following [14], we split the datasetsinto training, validation and testing sets with the ra-tio of 2:1:7. The ratios of training data are varied in [20% , , , , . To make comprehensive com-parison, we additionally use 5-fold cross-validation and re-port the average classiﬁcation results. Macro-F1 and Micro-F1 are adopted as the evaluation metrics. For ACM-3 andACM-5, we aim to predict the area of papers. For IMDB,the goal is to predict the class of movies. M acro − F and M icro − F are adopted as evaluation metrics. Experimentalresults are shown in Table 3 . By analyzing the results, someconclusions could be summarized.Firstly, the performance of all the methods is improvedwith the increase of training data, which proves that feedmore training data would help deep learning methods learnmore complicated patterns and achieve better results. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 8

TABLE 3Experimental results on the node classiﬁcation task.

Data Metrics Training MLP GCN GAT RGCN HAN HetGNN HGT HGConvACM-3 Macro-F1 20% 0.6973 0.8955 0.8852 0.8981 0.8991 0.6727 0.8965

40% 0.7740 0.9012 0.8993 0.9191 0.9175 0.7736 0.9188

60% 0.8013 0.9032 0.9053 0.9262 0.9237 0.8060 0.9264

80% 0.8249 0.9068 0.9063 0.9267 0.9268 0.8242

40% 0.7710 0.8923 0.8903 0.9124 0.9103 0.7709 0.9117

60% 0.7966 0.8948 0.8968 0.9201 0.9172 0.8016 0.9203

80% 0.8205 0.8989 0.8981 0.9202 0.9205 0.8190

40% 0.6585 0.8317 0.8367 0.8368 0.8404 0.6476 0.8428

60% 0.7252 0.8440 0.8441 0.8630 0.8526 0.7133 0.8573

80% 0.7503 0.8448 0.8459 0.8699 0.8610 0.7445 0.8692

Micro-F1 20% 0.6469 0.8364 0.8388 0.8333 0.8334 0.6420 0.8286

40% 0.6887 0.8433 0.8475 0.8501 0.8525 0.6872 0.8573

60% 0.7354 0.8545 0.8544 0.8722 0.8626 0.7248 0.8668

80% 0.7642 0.8554 0.8562 0.8809 0.8715 0.7592 0.8780

IMDB Macro-F1 20% 0.4506 0.5003 0.4998 0.5124 0.5118 0.4281 0.5171

40% 0.4870 0.5338 0.5350 0.5578 0.5645 0.4865 0.5577

60% 0.5188 0.5559 0.5640 0.5823 0.5912 0.5110 0.5781

80% 0.5268 0.5713 0.5698 0.5939 0.6092 0.5239 0.6018

Micro-F1 20% 0.4598 0.5062 0.5072 0.5212 0.5263 0.4533 0.5210

40% 0.4874 0.5355 0.5378 0.5601 0.5723 0.4942 0.5605

60% 0.5186 0.5611 0.5669 0.5850 0.5968 0.5146 0.5792

80% 0.5269 0.5771 0.5757 0.5952 0.6129 0.5237 0.6020

TABLE 4Experimental results on the node clustering task.

Data Metrics MLP GCN GAT RGCN HAN HetGNN HGT HGConv %Improv.ACM-3 ARI 0.6105 0.7179 0.7319 0.7973 0.7732 0.6077 0.7944

Secondly, compared with MLP, the performance of othermethods is signiﬁcantly improved by taking graph structureinto consideration in most cases, which indicates the powerof graph neural networks in considering the information ofboth nodes and edges.Thirdly, methods designed for heterogeneous graphsachieve better results than methods designed for homoge-neous graphs (i.e., GCN and GAT) in most cases, whichdemonstrates the necessity of leveraging the properties ofdifferent nodes and relations in heterogeneous graphs.Fourthly, although HetGNN is designed for heteroge-neous graph learning, it only achieves competitive or evenworse results than MLP. We owe this phenomenon to thefollowing two reasons: 1) there are several hyper-parameters(e.g., the return possibility and length of random walk, thenumbers of type-grouped neighbors) in HetGNN, makingthe model difﬁcult to be ﬁne-tuned; 2) the random walkstrategy may break the intrinsic graph structure and leadto structural information loss, especially when the graphstructure contains valuable information.Finally, HGConv outperforms all the baselines consis- tently with the varying training data ratio in most cases.Compared with MLP, GCN and GAT, HGConv takes boththe graph topology and graph heterogeneity into consider-ation. Compared with RGCN and HAN, HGConv utilizesthe speciﬁc characteristic of different nodes and relationswithout the requirement of domain knowledge. Comparedwith HetGNN, HGConv leverages intrinsic graph structuredirectly, which alleviates the structural information lossissue introduced by random walk. Compared with HGT,HGConv learns multi-level representation by the hybridmicro/macro level convolution, which provides HGConvwith sufﬁcient representation ability.

The node clustering task is conducted to evaluate thelearned node representations. We ﬁrst obtain the noderepresentation via feed forward on the trained model andthen feed the normalized node representation into k-meansalgorithm. We set the number of clusters to the number ofreal classes for each dataset (i.e., 3, 5 and 3 for ACM-3,ACM-5 and IMDB respectively). We adopt

ARI and

N M I

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 9

Fig. 3. Visualization of node representation on ACM-5. Each point indicates a paper and its color denotes the published area. as evaluation metrics. Since the result of k-means tends tobe affected by the initial centroids, we run k-means for 10times and report the average results in Table 4.Experimental results on the node clustering task showthat HGConv outperforms all the baselines, which demon-strates the effectiveness of the learned node representation.Moreover, methods based on GNNs usually obtain betterresults. We could also observe that methods achieving sat-isfactory results on node classiﬁcation tasks (e.g., RGCN,HAN and HGT) also have satisfactory performance onnode clustering tasks, which indicates that a good modelcould learn more universal node embedding that could beapplicable to various tasks.

To make an more intuitive comparison, we also visualizenodes in the heterogeneous graph into a low dimensionalspace. In particular, we project the learned node represen-tation by HGConv into a 2-dimensional space using t-SNE[39]. The visualization of node representation on ACM-5 isshown in Figure 3 , where the color of nodes denote theircorresponding published area .From Figure 3, we could observe the baselines couldnot achieve satisfactory performance. They either fail togather papers within the same area together, or could notprovide clear boundaries of papers belonging to differentareas. HGConv performs best in the visualization, as paperswithin the same area are closer and boundaries betweendifferent areas are more obvious. We conduct the ablation study to validate the effect ofeach component in HGConv. We remove the micro-levelconvolution, macro-level convolution and weighted resid-ual connection from HGConv respectively and denote thethree variants as HGConv w/o Micro, HGConv w/o Macro

7. Please refer to the appendix for results on ACM-3 and IMDB. and HGConv w/o WRC. Detailed implements of the threevariants are introduced as follows: • HGConv w/o Micro.

This variant replaces the micro-level convolution by performing simple averagepooling on nodes within the same relation. • HGConv w/o Macro.

This variant replaces themacro-level convolution by performing simple aver-age pooling across different relations. • HGConv w/o WRC.

This variant removes theweighted residual connection in each layer and onlyuses the aggregated neighbor information as theoutput of each layer.Experimental results of the variants and HGConv on thenode classiﬁcation task are shown in Figure 4.

Fig. 4. Effects of the components in the proposed model.

From Figure 4, we could observe that HGConv achievesthe best performance when it is equipped with all the com-ponents and removing any component would lead to worseresults. The effects of the three components vary in differentdatasets, but all of them contribute to the improvement inthe ﬁnal performance. In particular, the micro-level convolu-tion enables HGConv to select more important nodes withinthe same relation, and the macro-level convolution helpsHGConv distinguish the subtle difference across relations.The weighted residual connection provides HGConv with

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 10

Fig. 5. Parameter Sensitivity of the proposed model on IMDB. the ability to consider the different contribution of focalnode’s inherent attributes and neighbor information.

We also investigate on the sensitivity analysis of severalparameters in HGConv. We report the results of node clas-siﬁcation task under different parameter settings on IMDBand experimental results are shown in Figure 5.

Number of convolution layers . We build HGConv withdifferent number of heterogeneous graph convolution layersand report the result in Figure 5 ( a ). It could be observedthat with the increment of layers, the performance of HG-Conv raises at ﬁrst and then starts to drop gradually. Thisindicates that stacking a suitable number of layers helps themodel to receive information from further neighbors, buttoo many layers would lead to the overﬁtting problem. Number of attention heads . We validate the effect ofmulti-head attention mechanism in the hybrid convolutionby changing the number of attention heads. The result isshown in Figure 5 ( b ). From the results, we could concludethat increasing the number of attention heads would im-prove the model performance at ﬁrst. When the number ofattention heads is enough (e.g., equal to or bigger than 4),the performance reaches the top and remains stable. Dimension of node representation . We also change thedimension of node representation and report the result inFigure 5 ( c ). We could ﬁnd that the performance of HGConvgrows with the increment of the node representation dimen-sion and achieves the best performance when the dimensionis set between 64 and 256 (we select 64 as the ﬁnal setting).The performance decreases when the dimension becomesbigger further because of the overﬁtting problem. The principle components in HGConv are the micro-levelconvolution and macro-level convolution. Thus, we providea detailed interpretation to better understand the learnedimportance of nodes within the same relation and differ-ence across relations by the hybrid convolution. We ﬁrstrandomly select a sample from ACM-3 and then calculatethe normalized attention scores from the last heterogeneousgraph convolution layer. The selected paper P v proposes aneffective ranking-based clustering algorithm for heteroge-neous information network, and it is published in the DataMining area. The visualization is shown in Figure 6. Fig. 6. Visualization of the learned attention scores.

Interpretation of the micro-level convolution . It couldbe observed that in the AP relation, both Jiawei Han and

Yizhou Sun have higher attention scores than

Yintao Yu among the authors, since the ﬁrst two authors contributemore in the academic research. In the

T P relation, keywordsthat are more relevant to P v (i.e., clustering and ranking ) havehigher attention scores. Moreover, the scores of referencesthat studies more relevant topics to P v are also higher in the P P relation. The above observations indicate that the micro-level convolution could select more important nodes withinthe same relation by assigning higher attention scores.

Interpretation of the macro-level convolution . The at-tention score of the AP relation is much higher than that ofthe T P or P P relation, in line with the fact that GCN andGAT achieved the best performance on the

P AP meta-path.This ﬁnding demonstrates that the macro-level convolutioncould distinguish the importance of different relations auto-matically without empirical manual design, and the learnedimportance could implicitly construct more important meta-paths for speciﬁc downstream tasks.

ONCLUSION

In this paper, we designed a hybrid micro/macro levelconvolution operation to address several fundamental prob-lems in heterogeneous graph representation learning. In

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 11 particular, the micro-level convolution aims to learn theimportance of nodes within the same relation and themacro-level convolution attempts to distinguish the subtledifference across different relations. The hybrid strategy en-ables our model to fully leverage heterogeneous informationwith proper interpretability by performing convolutions onthe intrinsic structure of heterogeneous graphs directly. Wealso designed a weighted residual connection componentto obtain the optimal combination of focal node’s inherentattributes and neighbor information. Experimental resultsdemonstrated not only the superiority of the proposedmethod, but also the intuitive interpretability of our ap-proach for graph analysis. A CKNOWLEDGMENTS

This work is supported by the National Key R & D Pro-gram of China [grant number 2018YFB2101003], the Sci-ence and Technology Major Project of Beijing [grant num-ber Z191100002519012], and the National Natural ScienceFoundation of China [grant numbers 51778033, 51822802,51991395, 71901011, U1811463]. R EFERENCES [1] Y. Sun and J. Han, “Mining heterogeneous information networks:A structural analysis approach,”

SIGKDD Explorations , vol. 14,no. 2, pp. 20–28, 2012.[2] C. Shi, Y. Li, J. Zhang, Y. Sun, and P. S. Yu, “A survey of hetero-geneous information network analysis,”

IEEE Trans. Knowl. DataEng. , vol. 29, no. 1, pp. 17–37, 2017.[3] L. D. Santos, B. Piwowarski, L. Denoyer, and P. Gallinari, “Repre-sentation learning for classiﬁcation in heterogeneous graphs withapplication to social networks,”

ACM Trans. Knowl. Discov. Data ,vol. 12, no. 5, pp. 62:1–62:33, 2018.[4] Y. Zhang, Y. Xiong, X. Kong, S. Li, J. Mi, and Y. Zhu, “Deepcollective classiﬁcation in heterogeneous information networks,”in

Proceedings of the 2018 World Wide Web Conference on World WideWeb, WWW 2018, Lyon, France, April 23-27, 2018 , 2018, pp. 399–408.[5] Y. Sun, C. C. Aggarwal, and J. Han, “Relation strength-aware clus-tering of heterogeneous information networks with incompleteattributes,”

Proc. VLDB Endow. , vol. 5, no. 5, pp. 394–405, 2012.[6] Y. Dong, J. Tang, S. Wu, J. Tian, N. V. Chawla, J. Rao, and H. Cao,“Link prediction and recommendation across heterogeneous socialnetworks,” in , 2012, pp. 181–190.[7] X. Li, Y. Shang, Y. Cao, Y. Li, J. Tan, and Y. Liu, “Type-awareanchor link prediction across heterogeneous networks based ongraph attention network,” in

The Thirty-Fourth AAAI Conferenceon Artiﬁcial Intelligence, AAAI 2020, The Thirty-Second InnovativeApplications of Artiﬁcial Intelligence Conference, IAAI 2020, The TenthAAAI Symposium on Educational Advances in Artiﬁcial Intelligence,EAAI 2020, New York, NY, USA, February 7-12, 2020 , 2020, pp. 147–155.[8] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick,and J. Han, “Personalized entity recommendation: a heteroge-neous information network approach,” in

Seventh ACM Interna-tional Conference on Web Search and Data Mining, WSDM 2014, NewYork, NY, USA, February 24-28, 2014 , 2014, pp. 283–292.[9] C. Shi, B. Hu, W. X. Zhao, and P. S. Yu, “Heterogeneous informa-tion network embedding for recommendation,”

IEEE Trans. Knowl.Data Eng. , vol. 31, no. 2, pp. 357–370, 2019.[10] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation withgraph convolutional networks,” in , 2017.[11] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representa-tion learning on large graphs,” in

Advances in Neural InformationProcessing Systems , 2017, pp. 1024–1034. [12] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li`o, andY. Bengio, “Graph attention networks,” in , 2018.[13] M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov,and M. Welling, “Modeling relational data with graph convolu-tional networks,” in

The Semantic Web - 15th International Confer-ence, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings ,2018, pp. 593–607.[14] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu,“Heterogeneous graph attention network,” in

The World Wide WebConference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019 ,2019, pp. 2022–2032.[15] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla,“Heterogeneous graph neural network,” in

Proceedings of the 25thACM SIGKDD International Conference on Knowledge Discovery &Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019 ,2019, pp. 793–803.[16] Z. Hu, Y. Dong, K. Wang, and Y. Sun, “Heterogeneous graphtransformer,” in

WWW ’20: The Web Conference 2020, Taipei, Taiwan,April 20-24, 2020 , 2020, pp. 2704–2710.[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in

Advances in neural information processing systems , 2017, pp. 5998–6008.[18] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reductionby locally linear embedding,” science , vol. 290, no. 5500, pp. 2323–2326, 2000.[19] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectraltechniques for embedding and clustering,” in

Advances in NeuralInformation Processing Systems 14 [Neural Information ProcessingSystems: Natural and Synthetic, NIPS 2001, December 3-8, 2001,Vancouver, British Columbia, Canada] , 2001, pp. 585–591.[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their com-positionality,” in

Advances in neural information processing systems ,2013, pp. 3111–3119.[21] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learningof social representations,” in

Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery and data mining .ACM, 2014, pp. 701–710.[22] A. Grover and J. Leskovec, “node2vec: Scalable feature learningfor networks,” in

Proceedings of the 22nd ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 2016, pp.855–864.[23] M. Zhang and Y. Chen, “Link prediction based on graph neuralnetworks,” in

Advances in Neural Information Processing Systems 31:Annual Conference on Neural Information Processing Systems 2018,NeurIPS 2018, 3-8 December 2018, Montr´eal, Canada , 2018, pp. 5171–5181.[24] F. Errica, M. Podda, D. Bacciu, and A. Micheli, “A fair comparisonof graph neural networks for graph classiﬁcation,” in , 2020.[25] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net-works and locally connected networks on graphs,” in , 2014.[26] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutionalneural networks on graphs with fast localized spectral ﬁltering,”in

Advances in Neural Information Processing Systems 29: AnnualConference on Neural Information Processing Systems 2016, December5-10, 2016, Barcelona, Spain , 2016, pp. 3837–3845.[27] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, “Graphneural networks: A review of methods and applications,”

CoRR ,vol. abs/1812.08434, 2018.[28] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “Acomprehensive survey on graph neural networks,”

CoRR , vol.abs/1901.00596, 2019.[29] X. Fu, J. Zhang, Z. Meng, and I. King, “MAGNN: metapathaggregated graph neural network for heterogeneous graph em-bedding,” in

WWW ’20: The Web Conference 2020, Taipei, Taiwan,April 20-24, 2020 , 2020, pp. 2331–2341.[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in , 2016, pp. 770–778.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 12 [31] J. Chung, C¸ . G ¨ulc¸ehre, K. Cho, and Y. Bengio, “Empirical evalua-tion of gated recurrent neural networks on sequence modeling,”

CoRR , vol. abs/1412.3555, 2014.[32] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their com-positionality,” in

Advances in Neural Information Processing Systems26: 27th Annual Conference on Neural Information Processing Systems2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe,Nevada, United States , 2013, pp. 3111–3119.[33] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer:Extraction and mining of academic social networks,” in

Proceedingsof the 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, Las Vegas, Nevada, USA, August 24-27,2008 , 2008, pp. 990–998.[34] J. Ramos et al. , “Using tf-idf to determine word relevance indocument queries,” in

Proceedings of the ﬁrst instructional conferenceon machine learning , vol. 242. New Jersey, USA, 2003, pp. 133–142.[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 , 2014.[36] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neuralnetworks from overﬁtting,”

J. Mach. Learn. Res. , vol. 15, no. 1, pp.1929–1958, 2014.[37] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,A. K¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperativestyle, high-performance deep learning library,” in

Advances in Neu-ral Information Processing Systems 32: Annual Conference on NeuralInformation Processing Systems 2019, 8-14 December 2019, Vancouver,BC, Canada , 2019, pp. 8024–8035.[38] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou,C. Ma, L. Yu, Y. Gai et al. , “Deep graph library: Agraph-centric,highly-performant package for graph neural net,” arXiv preprintarXiv:1909.01315 , 2019.[39] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

Journal of machine learning research , vol. 9, no. Nov, pp. 2579–2605,2008.

Le Yu received the B.S. degree in ComputerScience and Engineering from Beihang Univer-sity, Beijing, China, in 2019. He is currently asecond-year computer science Ph.D. student inthe School of Computer Science and Engineer-ing at Beihang University. His research inter-ests include representation learning, graph neu-ral networks and temporal data mining.

Leilei Sun is currently an assistant professorin School of Computer Science, Beihang Uni-versity, Beijing, China. He was a postdoctoralresearch fellow from 2017 to 2019 in School ofEconomics and Management, Tsinghua Univer-sity. He received his Ph.D. degree from Insti-tute of Systems Engineering, Dalian Universityof Technology, in 2017. His research interestsinclude machine learning and data mining.

Bowen Du received the Ph.D. degree in Com-puter Science and Engineering from BeihangUniversity, Beijing, China, in 2013. He is cur-rently a Professor with the State Key Laboratoryof Software Development Environment, BeihangUniversity. His research interests include smartcity technology, multi-source data fusion, andtrafﬁc data mining.

Chuanren Liu received the B.S. degree from theUniversity of Science and Technology of China(USTC), the M.S. degree from the Beijing Uni-versity of Aeronautics and Astronautics (BUAA),and the Ph.D. degree from Rutgers, the StateUniversity of New Jersey. He is currently an as-sistant professor with the Business Analytics andStatistics Department at the University of Ten-nessee, Knoxville, USA. His research interestsinclude data mining and machine learning, andtheir applications in business analytics.

Weifeng Lv received the B.S. degree in Com-puter Science and Engineering from ShandongUniversity, Jinan, China, and the Ph.D. degreein Computer Science and Engineering from Bei-hang University, Beijing, China, in 1992 and1998 respectively. Currently, he is a Professorwith the State Key Laboratory of Software Devel-opment Environment, Beihang University, Bei-jing, China. His research interests include smartcity technology and mass data processing.

Hui Xiong is currently a Full Professor at theRutgers, the State University of New Jersey,where he received the 2018 Ram Charan Man-agement Practice Award as the Grand Prix win-ner from the Harvard Business Review, RBSDean’s Research Professorship (2016), the Rut-gers University Board of Trustees Research Fel-lowship for Scholarly Excellence (2009), theICDM Best Research Paper Award (2011), andthe IEEE ICDM Outstanding Service Award(2017). He received the Ph.D. degree from theUniversity of Minnesota (UMN), USA. He is a co-Editor-in-Chief of En-cyclopedia of GIS, an Associate Editor of IEEE Transactions on Big Data(TBD), ACM Transactions on Knowledge Discovery from Data (TKDD),and ACM Transactions on Management Information Systems (TMIS).He has served regularly on the organization and program committeesof numerous conferences, including as a Program Co-Chair of theIndustrial and Government Track for the 18th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining (KDD), aProgram Co-Chair for the IEEE 2013 International Conference on DataMining (ICDM), a General Co-Chair for the IEEE 2015 InternationalConference on Data Mining (ICDM), and a Program Co-Chair of theResearch Track for the 2018 ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. He is an IEEE Fellow and anACM Distinguished Scientist.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 13 A PPENDIX

In the appendix, details of the experiments are introduced.

Node Classiﬁcation

Experimental results with variations on the node classiﬁca-tion task are shown in Table 5. Hyper-parameter settings areshown in Table 6.

TABLE 5Experimental results with variations on the node classiﬁcation task.

Data Metrics Ratio MLP GCN GAT RGCN HAN HetGNN HGT HGConvACM-3 Macro-F1 20% 0.6973 ± ± ± ± ± ± ± ±

40% 0.7740 ± ± ± ± ± ± ± ±

60% 0.8013 ± ± ± ± ± ± ± ±

80% 0.8249 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

40% 0.7710 ± ± ± ± ± ± ± ±

60% 0.7966 ± ± ± ± ± ± ± ±

80% 0.8205 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

40% 0.6585 ± ± ± ± ± ± ± ±

60% 0.7252 ± ± ± ± ± ± ± ±

80% 0.7503 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Micro-F1 20% 0.6469 ± ± ± ± ± ± ± ±

40% 0.6887 ± ± ± ± ± ± ± ±

60% 0.7354 ± ± ± ± ± ± ± ±

80% 0.7642 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± IMDB Macro-F1 20% 0.4506 ± ± ± ± ± ± ± ±

40% 0.4870 ± ± ± ± ± ± ± ±

60% 0.5188 ± ± ± ± ± ± ± ±

80% 0.5268 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Micro-F1 20% 0.4598 ± ± ± ± ± ± ± ±

40% 0.4874 ± ± ± ± ± ± ± ±

60% 0.5186 ± ± ± ± ± ± ± ±

80% 0.5269 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± TABLE 6Hyper-parameter settings of all the methods.

Data Hyper-parameters Data ratio MLP GCN GAT RGCN HAN HetGNN HGT HGConvACM-3 learning rate 20% 0.05 0.05 0.005 0.03 0.01 0.03 0.01 0.00840% 0.03 0.005 0.05 0.003 0.01 0.01 0.008 0.00560% 0.03 0.01 0.03 0.005 0.01 0.01 0.008 0.00380% 0.05 0.01 0.05 0.03 0.01 0.003 0.008 0.001100% 0.005 0.01 0.05 0.03 0.08 0.003 0.003 0.005dropout 20% 0.5 0.1 0.7 0.5 0.7 0.5 0.7 0.740% 0.9 0.0 0.6 0.7 0.8 0.9 0.8 0.860% 0.9 0.0 0.7 0.7 0.8 0.9 0.7 0.680% 0.9 0.2 0.7 0.5 0.7 0.9 0.7 0.6100% 0.9 0.5 0.8 0.5 0.6 0.9 0.9 0.8ACM-5 learning rate 20% 0.01 0.005 0.01 0.03 0.01 0.01 0.008 0.00540% 0.03 0.01 0.05 0.005 0.05 0.01 0.01 0.00560% 0.008 0.03 0.03 0.003 0.08 0.01 0.01 0.00380% 0.01 0.01 0.005 0.003 0.05 0.01 0.01 0.003100% 0.008 0.005 0.03 0.001 0.01 0.01 0.01 0.008dropout 20% 0.8 0.5 0.6 0.5 0.5 0.8 0.8 0.540% 0.8 0.5 0.5 0.5 0.5 0.8 0.9 0.760% 0.9 0.2 0.7 0.6 0.8 0.8 0.6 0.880% 0.8 0.4 0.5 0.6 0.9 0.8 0.7 0.8100% 0.9 0.0 0.6 0.5 0.9 0.8 0.6 0.8IMDB learning rate 20% 0.01 0.01 0.03 0.01 0.08 0.01 0.01 0.00140% 0.05 0.08 0.01 0.005 0.01 0.01 0.008 0.00860% 0.01 0.003 0.001 0.03 0.001 0.01 0.001 0.00180% 0.001 0.05 0.001 0.01 0.05 0.01 0.01 0.005100% 0.03 0.05 0.003 0.005 0.05 0.003 0.001 0.003dropout 20% 0.5 0.1 0.5 0.6 0.7 0.5 0.2 0.440% 0.9 0.4 0.7 0.5 0.7 0.4 0.6 0.560% 0.9 0.2 0.7 0.5 0.7 0.4 0.5 0.580% 0.7 0.3 0.7 0.6 0.8 0.6 0.2 0.4100% 0.4 0.3 0.8 0.5 0.7 0.5 0.2 0.4

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, XX XXXX 14

Node Visualization