[PDF] GCN for HIN via Implicit Utilization of Attention and Meta-paths

Abstract

Heterogeneous information network (HIN) embedding, aiming to map the structure and semantic information in a HIN to distributed representations, has drawn considerable research attention. Graph neural networks for HIN embeddings typically adopt a hierarchical attention (including node-level and meta-path-level attentions) to capture the information from meta-path-based neighbors. However, this complicated attention structure often cannot achieve the function of selecting meta-paths due to severe overfitting. Moreover, when propagating information, these methods do not distinguish direct (one-hop) meta-paths from indirect (multi-hop) ones. But from the perspective of network science, direct relationships are often believed to be more essential, which can only be used to model direct information propagation. To address these limitations, we propose a novel neural network method via implicitly utilizing attention and meta-paths, which can relieve the severe overfitting brought by the current over-parameterized attention mechanisms on HIN. We first use the multi-layer graph convolutional network (GCN) framework, which performs a discriminative aggregation at each layer, along with stacking the information propagation of direct linked meta-paths layer-by-layer, realizing the function of attentions for selecting meta-paths in an indirect way. We then give an effective relaxation and improvement via introducing a new propagation operation which can be separated from aggregation. That is, we first model the whole propagation process with well-defined probabilistic diffusion dynamics, and then introduce a random graph-based constraint which allows it to reduce noise with the increase of layers. Extensive experiments demonstrate the superiority of the new approach over state-of-the-art methods.

Full PDF

11 GCN for HIN via Implicit Utilization of Attentionand Meta-paths

Di Jin, Zhizhi Yu, Dongxiao He, Carl Yang, Philip S. Yu and Jiawei Han

Abstract —Heterogeneous information network (HIN) embedding, aiming to map the structure and semantic information in a HIN todistributed representations, has drawn considerable research attention. Graph neural networks for HIN embeddings typically adopt ahierarchical attention (including node-level and meta-path-level attentions) to capture the information from meta-path-based neighbors.However, this complicated attention structure often cannot achieve the function of selecting meta-paths due to severe overﬁtting.Moreover, when propagating information, these methods do not distinguish direct (one-hop) meta-paths from indirect (multi-hop) ones.But from the perspective of network science, direct relationships are often believed to be more essential, which can only be used to modeldirect information propagation. To address these limitations, we propose a novel neural network method via implicitly utilizing attentionand meta-paths, which can relieve the severe overﬁtting brought by the current over-parameterized attention mechanisms on HIN. Weﬁrst use the multi-layer graph convolutional network (GCN) framework, which performs a discriminative aggregation at each layer, alongwith stacking the information propagation of direct linked meta-paths layer-by-layer, realizing the function of attentions for selecting meta-paths in an indirect way. We then give an effective relaxation and improvement via introducing a new propagation operation which canbe separated from aggregation. That is, we ﬁrst model the whole propagation process with well-deﬁned probabilistic diffusion dynamics,and then introduce a random graph-based constraint which allows it to reduce noise with the increase of layers. Extensive experimentsdemonstrate the superiority of the new approach over state-of-the-art methods.

Index Terms —Heterogeneous information networks, Graph neural networks, Network embedding. (cid:70)

NTRODUCTION H ETEROGENEOUS information networks (HINs) [1] [2][3], which involve a diversity of node types and rela-tionships between nodes, can better model and solve manyreal-world problems than homogeneous networks. For HINanalysis, an important concept is meta-path [4] [5], whichis composed of a sequence of relationships between twonodes. For example, the movie network of IMDB containsthree types of nodes, including movies, directors and actors.The relationship between two movies can be described bymeta-paths such as Movie-Actor-Movie (MAM) and Movie-Director-Movie (MDM), where MAM denotes the moviesstarring the same actor, and MDM denotes the moviesdirected by the same director.Network embedding [6] [7], which aims to learn thedistributed representations of nodes in networks, is consid-ered as an effective method for network mining and hasbeen widely studied in homogeneous networks. Recently,researchers have also proposed some methods for HINembedding, such as random walk-based methods [8] [9] andrelation learning based methods [10] [11], many of whichrely on the concept of meta-path. In particular, with thegreat success of deep learning, graph neural network-basedHIN embedding methods (such as HAN [12] and MAGNN[13]) have been proposed very recently. These methods often • D. Jin, Z. Yu and D. He are with the College of Intelligence andComputing, Tianjin University, Tianjin 300350, China. E-mail: { jindi,yuzhizhi, hedongxiao } @tju.edu.cn. • C. Yang is with the Department of Computer Science, Emory University,Georgia 30322 USA. E-mail: [email protected]. • P. S. Yu is with the Department of Computer Science, University of Illinoisat Chicago, Chicago, IL 60661 USA. E-mail: [email protected]. • J. Han is with the Department of Computer Science, University of Illinoisat Urbana-Champaign, Urbana, IL 61801. E-mail: [email protected]. adopt a hierarchical attention structure, which uses thenode-level attention to aggregate information inside eachmeta-path and utilizes the meta-path-level attention to fuseinformation of different meta-paths.While these graph neural network-based methods haveachieved great success in HIN embedding, they still sufferfrom some essential issues. First, while attention has beenwidely used in ﬁelds such as NLP, the use of the complicatedhierarchical attention structure may be not so effective inHIN embedding, since there are often little training dataavailable in HINs and information from one network can behardly transferred to another. In this way, it will be difﬁcultfor graph neural networks to train well these hierarchicalattentions (particularly for the meta-path-level attention,which is to evaluate the essential importance of differentmeta-paths), making them hard to really achieve the goal ofselecting meta-paths, especially when there is often severeoverﬁtting in practice. At the same time, these existingmethods often treat meta-paths with different lengths, suchas direct linked meta-paths (e.g., Movie-Director) and indi-rect linked meta-paths (e.g., Movie-Director-Movie), indis-tinguishably for information propagation. However, fromthe perspective of network science, while direct links canpropagate information directly, indirect links should prop-agate information indirectly, and the information propaga-tion on direct links is more essential. Therefore, for meta-paths with lengths longer than one (which makes the pathsindirect), it is intuitive that the information should bepropagated indirectly rather than directly. Fortunately, weﬁnd that graph convolutional network (GCN) [14] itself canpartly overcome this limitation. It can realize that directlinked meta-paths propagate information directly at eachlayer, and indirect linked meta-paths propagate information a r X i v : . [ c s . S I] J u l indirectly via the stacked layers of deep neural networks.More importantly, it has already encoded the information ofall meta-paths via the multi-layer propagation in an implicitway. However, GCN does not distinguish the importance ofinformation from different meta-paths in both its propaga-tion and aggregation processes, which makes it not directlysuitable for HIN embedding.To utilize the advantages of GCN of implicitly encod-ing all meta-paths as well as overcome the difﬁculty ofdistinguishing their importance in an effective way, wepropose a novel G CN-based approach for heterogeneousinformation network via I mplicit utilization of A ttentionand M eta-paths, referred to as GIAM. We ﬁrst introducea naive model. It uses the direct linked meta-paths alonefor information propagation, and utilizes a new aggrega-tion mechanism for each-layer, along with the stacked-layerpropagation, to implicitly achieve the role of attention forselecting meta-paths. In this way, we realize the selectionof different meta-paths in GCN itself (rather than usingattention directly which may lead to overﬁtting). Mean-while, we make an effective reﬁnement. That is, we replacethe spectral ﬁlter of GCN from the symmetric normalizedgraph Laplacian to an equivalent asymmetric one, alongwith removing activation, modeling the propagation withcontinuous Markov dynamics. We then introduce an effec-tive R andom graph-based P ropagation C onstraint principle,namely RPC, i.e., if a propagation path on the given networkis no better than that on the corresponding random graph,there is no reason to continue this path propagation, whichmakes the whole propagation process more effective viaﬁltering more impurity information.To summarize, the main contributions of this paper areas follows: • We ﬁnd that, the hierarchical attention structure adoptedby many HIN-speciﬁc graph neural networks is hard toreally achieve the function of essential selections of meta-paths (due to severe overﬁtting); and meanwhile, they donot distinguish one-hop and multi-hop meta-paths in thepropagation process. • We propose a new approach to solve these problems. Ituses only direct linked meta-paths for direct propagationand realizes indirect propagation by stacking layers ofdirect propagations. We distinguish the importance ofinformation from different meta-paths (in this process)via effective algorithmic mechanisms rather than usingattentions directly. • Extensive experiments on different network analysis tasksdemonstrate the superiority of the proposed new ap-proach over some state-of-the-arts.The rest of the paper is organized as follows. Section 2introduces a motivating example. Section 3 gives the prob-lem deﬁnitions and introduces GCN. Section 4 proposes thenew approach for HIN embedding. In Section 5, we conductextensive experiments. Finally, we discuss related work inSection 6 and conclude in Section 7.

OTIVATING E XAMPLE

To verify whether using meta-path-level attention can ef-fectively evaluate the importance of different meta-paths,we conduct experiments on two widely-used heterogeneous

TABLE 1: The performance of HAN and MAGNN of using(and not using) meta-path-level attention, as well as our newapproach GIAM on IMDB and DBLP. ’Y’ denotes the method ofusing meta-path-level attention and ’N’ not. ’-’ denotes our newidea of using algorithmic mechanisms rather than attentionto learn relationships of meta-paths. Attention distribution isdenoted by the learned weights of importance of different meta-paths.

Datasets Meta-paths Models Attention Attention distribution Macro-F1 Micro-F1IMDB MDMMAM HAN Y [0.78, 0.22] 57.67 57.79N [0.50, 0.50] 58.93 59.02MAGNN Y [0.57, 0.43] 57.60 57.72N [0.50, 0.50] 58.30 58.50GIAM - - 59.58 59.86DBLP APAAPVPAAPTPA HAN Y [0.258, 0.736, 0.006] 92.69 93.20N [0.333, 0.333, 0.333] 92.47 93.04MAGNN Y [0.022, 0.969, 0.009] 93.19 93.67N [0.333, 0.333, 0.333] 90.42 91.08GIAM - - 93.63 94.10 information networks, i.e., IMDB and DBLP. We select threegraph neural network-based HIN embedding methods, i.e.,HAN, MAGNN and our new approach GIAM (which willbe introduced in Section 4 below). Since HAN and MAGNNrequire a candidate meta-path set, and our GIAM can alsosupport this option, we use the same choices according tothe existing work [12] [13], i.e., { MDM, MAM } for IMDB(’M/D’ stands for Movie/Director and ’A’ stands for Actor)and { APA, APVPA, APTPA } for DBLP (’A/P’ stands forAuthor/Paper and ’V/T’ stands for Venue/Term), which areoften believed to be the essential meta-paths for node clas-siﬁcation in networks. We compare HAN (and MAGNN) ofusing and not using meta-path-level attention, as well as ournew idea (GIAM) of using algorithmic mechanisms (ratherthan attention) to learn relationships of meta-paths. We ﬁrstget each method’s embedding on each dataset (accordingto the experimental settings in Section 5), and then feedthem to SVM classiﬁer with different radios (i.e., 5%-80%)of supervised information. We report the average accuracyover these radios, in terms of Macro-F1 and Micro-F1, asshown in Table 1; and show the detailed accuracy on eachradio of the supervised information in Appendix.As shown, on IMDB, it is surprising that, the methods(HAN and MAGNN) of using meta-path-level attention arealways no better than those of not using it. Concretely,for HAN of using meta-path-level attention, it is easy toobtain the staple attention distribution, where one domi-nant meta-path has the dominated attention value (i.e., thedistribution [0.78, 0.22] on { MDM, MAM } ). Though thisseems to achieve a well evaluation of the importance ofdifferent meta-paths, the accuracy is surprisingly reduced.This may be mainly due to overﬁtting, preventing themethod from really selecting correct meta-paths. Differently,MAGNN with meta-path-level attention is easy to get thesmooth attention distribution, i.e., [0.57, 0.43] on { MDM,MAM } . While the learned attention values differ slightly,the accuracy is still not improved when comparing withthat of not using attention. On the other hand, on DBLP,the methods (HAN and MAGNN) of using meta-path-levelattention perform slightly better than those of not using it.Since these models on DBLP can be trained much betterwith a high accuracy (compared with those on IMDB), theymay relieve overﬁtting and make attention effective to some extent. But anyway, in both these two settings, our newapproach GIAM of using the specially designed algorithmicmechanisms (rather than attention) to learn relationships ofmeta-paths stably performs the best.To further verify whether overﬁtting is the main reasonthat meta-path-level attention does not help evaluate theimportance of different meta-paths effectively, we conductextra experiments on IMDB by using HAN as an example.We show the training loss (and validation loss) as a functionof the number of train iterations. Fig. 1(a) shows the result ofHAN of using meta-path-level attention, and Fig. 1(b) showsthat of not using meta-path-level attention. As shown,when using meta-path-level attention, with the decreaseof the training loss, the validation loss ﬁrst decreases butthen increases signiﬁcantly, which is a highly overﬁttingphenomenon. Differently, the overﬁtting issue is relativeslight when not using the meta-path-level attention. Thispartly validates that the meta-path-level attention may notbe able to achieve well the essential selection and evaluatethe importance of different meta-paths, especially when themodel is hard to be trained well (which is often the real lifein many network analysis tasks). (a) (b) Fig. 1: The results of training loss (and validation loss) as afunction of the number of train iterations by using HAN onIMDB. (a) shows the result of using meta-path-level attentionand (b) shows that of not using meta-path-level attention.

RELIMINARIES

We ﬁrst introduce the problem deﬁnition, and then discussGCN which serves as the base of our new approach.

Deﬁnition 1. Heterogeneous Information Network . A het-erogeneous information network is deﬁned as a network G ( V, E, F, R, φ, ϕ ) , where V represents the set of multipletypes of nodes, E the set of multiple types of edges, and F and R the set of node and edge types. Each node u ∈ V isassociated with a node type mapping function φ : V → F ,and each edge e ∈ E is associated with an edge type map-ping function ϕ : E → R . G is deﬁned as a heterogeneousinformation network when | F | + | R | > . Deﬁnition 2. Adjacency Matrix of Heterogeneous In-formation Network . Inspired by homogeneous network, wedeﬁne the adjacency matrix of heterogeneous informationnetwork G as A = ( a uv ) n × n , where a uv = 1 if there is anedge between nodes u and v , or 0 otherwise, and n = | V | the number of nodes. Thus, the degree distribution of G canbe deﬁned as D = diag ( d , ..., d n ) , where d u = (cid:80) v a uv , i.e.,we sum up the number of edges associated with node u . Deﬁnition 3. Meta-path . A meta-path m is deﬁned as apath in the form of F R −→ F R −→ ... R l −→ F l +1 (abbreviated as F F · · · F l +1 ), where F and R are node and edge types,respectively. It represents a compositional relation betweentwo given node types. Deﬁnition 4. Meta-path-based Neighbors . Given ameta-path m of a heterogeneous information network, themeta-path-based neighbors N mu of node u are deﬁned as theset of nodes which connect with node u via meta-path m .Note that N mu include u itself if m is symmetric. Deﬁnition 5. Heterogeneous Information Network Em-bedding . Given a heterogeneous information network G ,this task is to learn the d -dimensional distributed represen-tation H ∈ R | V |× d ( d (cid:28) | V | ) that is able to capture richstructural and semantic information involved in G . Spectral graph convolutional neural networks (GCN) isproposed by Bruna et al. [15] to analyze the graph data.It deﬁnes spectral graph convolution as the product of asignal x and a ﬁlter g θ = diag ( θ ) , where θ is a vectorin the Fourier domain. Following this, the spectral graphconvolution can be performed as g θ (cid:63) x = U g θ U T x , where U T x is the graph Fourier transform of x and U the matrix ofeigenvectors of the normalized graph Laplacian L deﬁnedas L = I − D − / AD − / = U Λ U T (where I is theidentity matrix and Λ the diagonal matrix of eigenvalues).Since the calculation of eigenvalue decomposition of L in alarge graph is very expensive, Defferrard et al . [16] suggestto use the k -th order Chebyshev polynomial expansion toapproximate g θ , represented as g θ (cid:48) ≈ (cid:80) Kk =0 θ (cid:48) k T k ( (cid:101) Λ) , where θ (cid:48) k is the k -th Chebyshev coefﬁcient and (cid:101) Λ = (2 / λ max )Λ − I ( λ max is the largest eigenvalue of L ). By substituting it into g θ (cid:63) x = U g θ U T x , and adopting (cid:101) L = (2 / λ max ) L − I , wehave g θ (cid:48) (cid:63) x ≈ (cid:80) Kk =0 θ (cid:48) k T k ( (cid:101) L ) x .Furthermore, Kipf et al . [14] propose to use k = 1 and λ max = 2 to get a simpliﬁed graph convolution operationof GCN, represented as g θ (cid:63) x ≈ θ ( I + D − / AD − / ) x . Inaddition, by introducing an effective renormalization ˆ A = (cid:101) D − / (cid:101) A (cid:101) D − / (where (cid:101) A = A + I and (cid:101) D = diag ( (cid:101) d , ..., (cid:101) d n ) with (cid:101) d u = (cid:80) v (cid:101) a uv ), the classic two-layer GCN can then bedeﬁned as: (cid:98) Y = softmax( (cid:98) A ReLU( (cid:98) AH (0) W (0) ) W (1) ) (1)where H (0) is the node feature matrix, W (0) (and W (1) )the weight parameter of neural networks, and (cid:98) Y the ﬁnaloutput for the assignment of node labels. While GCN works Fig. 2: An illustrative example of using GCN on a heteroge-neous information network DBLP. The inner (red) circle repre-sents the ﬁrst layer and the outer (black) circle the second layer.

Fig. 3: The structure of the naive model. It propagates and aggregates the information of direct linked meta-path-based neighborsrepeatedly via k layers. The part in red box is the core content. very well on homogeneous networks, it is not directly suit-able for heterogeneous information networks with differenttypes of nodes and edges [17].We now analyze the advantages and disadvantages ofusing GCN on heterogeneous information networks (by tak-ing DBLP with four types of nodes: author, paper, venue andterm as an example). As shown in Fig. 2, in the ﬁrst layer ofGCN (the inner circle in the ﬁgure), we can realize the directinformation propagation via direct linked meta-paths (e.g.,Paper-Author). By stacking the second layer (the outer cir-cle), we can achieve the indirect information propagation ofmeta-paths with length 2, such as meta-paths Term-Paper-Author and Venue-Paper-Author, with the help of stackeddirect linked meta-path propagation. By adopting a multi-layer GCN, we can then realize that the direct linked meta-paths propagate information directly while indirect linkmeta-paths propagate information indirectly, along withcovering meta-paths with different lengths. However, forheterogeneous information networks, GCN often treats theinformation from different meta-paths equally in the processof both propagation and aggregation, without distinguish-ing the difference of their importance, which is a challengeand correctly the main limitation we will overcome in thiswork. ETHODOLOGY

We ﬁrst propose a naive model to solve the issue of GCNon heterogeneous information networks (HINs), then reﬁnethe model by introducing a continuous Markov propagationprocess, and ﬁnally give some optional tricks in implemen-tation.

In the ﬁrst model, we use the classic multi-layers GCNas a basic framework, and then introduce a discriminativemechanism to aggregate information from the neighborswith direct linked meta-paths. The structure of this model isillustrated in Fig. 3.The novel aggregation mechanism consists of two parts,including the aggregation of instances under the samemeta-path (which we call the intra aggregation) and theaggregation of different meta-paths (which we call the interaggregation). Speciﬁcally, in the intra aggregation, we adoptthe same summation as GCN to aggregate the information from the same direct linked meta-path-based neighbors.Mathematically, let τ : ( u, v ) → m ∈ M be the meta-pathmapping function, where M is the set of direct linked meta-paths. It inputs a node pair ( u, v ) , and outputs a variable m which indicates the direct linked meta-path between nodes u and v . Simultaneously, let h ( k − u be the embedding ofnode u at the ( k -1)-th layer, and h (0) u the nodes featurevector. Then, for each u ∈ V , its embedding of the directlinked meta-path m at the k -th layer e ( m,k ) u can be updatedas: e ( m,k ) u = (cid:88) v ∈ N u δ ( τ ( u, v ) , m )( (cid:101) d u (cid:101) d v ) h ( k − v , ∀ m ∈ M (2)where (cid:101) d u is the degree of node u of G with self-edges (asdeﬁned in (1)), N u is the set of direct linked meta-path-basedneighbors of node u , and δ ( · , · ) a Kronecker delta functionthat only allows nodes with the direct linked meta-path m to node u to be included. Since there are | M | different directlinked meta-paths, then for each node u , we will get | M | meta-path-type embeddings. In this case, we adopt anotheraggregation function, i.e., concatenation (cid:107) , to aggregate theembeddings of different direct linked meta-paths, that is: g ( k ) u = (cid:107) m ∈ M e ( m,k ) u (3)With the obtained g ( k ) u , the k -th layer embedding of node u can then be given by using a mapping function along witha non-linear transform as: h ( k ) u = σ ( g ( k ) u · W ( k − ) (4)where W ( k − is the mapping matrix and σ ( · ) the non-linearactivation function. To simplify expression, we use a newoperator ’ ◦ ’ to denote the incorporation of the above twotypes of aggregations on matrices. Then, the matrix form ofthe k -th layer embeddings can be deﬁned as: H ( k ) = σ (( (cid:98) A ◦ H ( k − ) W ( k − ) (5)To better understand how this naive model distin-guishes the importance of information from different meta-paths during both propagation and aggregation, we give abrief explanation on a heterogeneous information network(DBLP) as an example. As shown in Fig. 4, in each layer,we use the direct linked meta-paths within the black circleto propagate information. We adopt the summation to ag-gregate information from each type of neighbors linked by the same one-hop meta-path (e.g., Author-Paper), and useconcatenation to aggregate information from different one-hop meta-paths (e.g., Author-Paper and Term-Paper), andthen feed it to the neural network. This is to distinguish theimportance of information from different meta-paths in an implicit and indirect way, i.e., utilizes the new discriminativeaggregation as well as the mapping function of neuralnetworks, rather than using attention directly. Furthermore,we extend the propagation range by stacking layer by layer,and then realize the distinction of meta-paths with differentlengths (e.g., Author-Paper-Term and Author-Paper-Term-Paper), with the help of the interaction of the multi-layerpropagation of the one-hop meta-paths as well as the bi-level aggregation mechanism in each-layer. Fig. 4: An illustrative example of using the naive model onDBLP. The blue arc (of using summation) represents the aggre-gation of information from the same type of neighbors linkedby a one-hop meta-path, the green arc (of using concatenation)denotes the aggregation of information from different one-hopmeta-paths, and the brown arc (of using the neural networkmapping and activation) denotes the selection of differentmeta-paths by utilizing the inherent algorithm mechanism (i.e.,implicit utilization of attention).

In fact, while this naive model seems to be able to coverdifferent meta-paths as well as distinguish their importancein both propagation and aggregation in an ideal way, it,however, possesses an inherent limitation, i.e., many nodesdo not have the same (or complete) types of one-hop meta-paths due to the sparsity of HIN, making an effectiveconcatenation in this new aggregation process difﬁcult. TakeDBLP as an example, some paper nodes may not have linksunder meta-path Paper-Author while some other nodes maynot have links under Paper-Term. In this case, we cannotachieve the alignment of these nodes embeddings afterconcatenation. So, one can only use non-informative vectors(e.g., vectors with all 1 or 0) to ﬁll in these missing types tomake them complete. This, however, signiﬁcantly lowers theperformance of the model especially when stacking multi-layers.

To overcome the limitation of the naive model, we introducean effective relaxation and improvement. That is, we ﬁrstperform a k -step propagation, and then the discriminativeaggregation. In the new propagation process, we replace the spectral ﬁlter of GCN from the symmetric graph Laplacianto an equivalent asymmetric one, and then remove activa-tion, in order to make it a continuous Markov dynamics.We then introduce a random graph-based cut mechanismto constrain its free expansion, enabling the propagationto escape from including too many harmful informationwith the increase of layers. The structure of this model isillustrated in Fig. 5. In the following, we will introduce itfrom two perspectives, i.e., probabilistic propagation anddiscriminative Aggregation. First we reﬁne the propagation process of GCN. We adopt anasymmetric normalized graph Laplacian P = (cid:101) D − (cid:101) A , whichis also called the Markov transition probability matrix, asthe ﬁlter to perform propagation, where (cid:101) A = A + I ( A isthe adjacency matrix of G and I the identity matrix), and (cid:101) D = diag ( (cid:101) d , ..., (cid:101) d n ) with (cid:101) d u = (cid:80) v (cid:101) a uv . According to spectralgraph theories [18], P has the same spectrum range withthe original spectral ﬁlter (cid:98) A of GCN (deﬁned in (1)), andthus possesses the same ability of serving as a low-pass-typeﬁlter for propagation. Meanwhile, we remove activationfunctions on all layers expect for the output layer (that usessoftmax), which will not decrease the models performance,as guaranteed by [18]. Then, these two steps make thepropagation a continuous Markov dynamics process. Thenew propagation rule can be deﬁned as: P ( k ) = P ( k − · P (6)where P (0) = I .On the other hand, the above propagation process ingraph convolution can be also taken as a k -step Markovrandom walk from the perspective of probabilistic diffusion.Formally, given a heterogeneous information network G ,the transition probability from nodes u to v within one steprandom walk can be formulated as: p uv = (cid:101) a uv (cid:80) r (cid:101) a ur (7)Then, after walking k steps, the transition probability fromnodes u to v can be calculated iteratively by: z ( k ) uv = n (cid:88) r =1 z ( k − ur p rv (8)where z (0) uu = 1 and z (0) uv = 0 , for u (cid:54) = v . The above processcan also be taken as a matrix form as: Z ( k ) = Z ( k − · P s.t., Z (0) = I (9)where the k -step transition probability matrix Z ( k ) equalsto the propagation matrix P ( k ) in (6) in graph convolution.More interestingly, according to spectral graph theories [19],the number of steps of random walk in the range of en-tering and exiting times of the c -th local mixing state (ofthis Markov dynamics) can show the clearest c categoriesstructure. So, this new probabilistic perspective brings abyproduct that we can evaluate the optimal number of prop-agation layers of graph convolution. To be speciﬁc, givena network G with the Markov matrix P , the local mixingtimes of random walks on it can be estimated by using thespectrum of its corresponding Markov generator M = I − P , Fig. 5: The structure of the model with constrained Markov propagation. The part in the red box is the core improvement andrelaxation compared to the naive model. where M is positive semi-deﬁnite and has n non-negativereal-valued eigenvalues ( λ ≤ λ · · · ≤ λ n ≤ ). Let T entc and T extc be the entering and exiting times of the c -thlocal mixing state, we have T extc = λ c (1+ o (1)) . Reasonably,we can use the exiting time of the ( c +1)-th local mixing stateto estimate the entering time of the c -th local mixing state,which can be represented as T entc = T extc +1 = 1 /λ c +1 . Then,the calculated T entc and T extc can be taken as the ﬂoor andceiling of the optimal number of propagation layers for a c -classiﬁcation problem.However, ﬁrst, it is too time consuming to calculatethe eigenvalues for determining the number of propagationlayers, which often needs O ( n ) time. Second, even in theexpected range of the optimal number of layers, the prop-agation will still introduce impurity information inevitably,which will also decrease the convolutions performance. Tofurther overcome those drawbacks, we introduce the new RPC principle , i.e., if a propagation path on a given network(with clusters) is no better than that on its correspondingrandom graph, we will have no reason to continue thispropagation path. This will not only enable the propagationto ﬁlter more noise information, but also make it not sosensitive to the number of layers (which may be set arelative large value, e.g., 10). To be speciﬁc, given a hetero-geneous information network G = ( V, E ) , we ﬁrst calculateits corresponding random graph G (cid:48) = ( V, E (cid:48) ) which hasthe same node degree distribution with G while containsnone structural information for classiﬁcation. We adoptthe popular null model of modularity [20] that describesrandom graphs by rewiring edges randomly among nodeswith given node degrees, which is correctly suitable for thiswork. Let (cid:101) A = ( (cid:101) a uv ) n × n be the adjacency matrix of G withself-edges, and (cid:101) D = diag ( (cid:101) d , ..., (cid:101) d n ) the degree matrix with (cid:101) d u = (cid:80) v (cid:101) a uv . Then, based on this null model, the expectednumber of links (or expected link weight) between nodes u and v can be written as: a (cid:48) uv = (cid:101) d u (cid:101) d v (cid:80) nr =1 (cid:101) d r (10)which forms the adjacency matrix A (cid:48) = ( a (cid:48) uv ) n × n of G (cid:48) . Onthis random graph, the one step transition probability fromnodes u to v can be written as: q uv = a (cid:48) uv (cid:80) r a (cid:48) ur (11) Using it as a constraint on each step of the random walk on G , we then get a constraint Markov dynamics. That is, thetransition probability from nodes u to v after k steps of theconstraint walk, i.e., s ( k ) uv , can be calculated iteratively by: s (cid:48) ( k ) uv = max( n (cid:88) r =1 s ( k − ur p rv − n (cid:88) r =1 s ( k − ur q rv , s ( k ) uv = s (cid:48) ( k ) uv (cid:80) nr =1 s (cid:48) ( k ) ur (12)where (cid:80) nr =1 s ( k − ur p rv denotes the k -step transition prob-ability from nodes u to v on G while (cid:80) nr =1 s ( k − ur q rv theprobability on the corresponding random graph G (cid:48) , after k -1 steps of the constraint walk. We remove negative valuesof s ( k ) uv and normalize it after each step (since the probabilitydistribution should be non-negative and sum to 1). Then, let S ( k ) = ( s ( k ) uv ) n × n , P = ( p uv ) n × n , Q = ( q uv ) n × n and D s =diag ( d s , ..., d sn ) with d su = (cid:80) v s uv , the above process canbe rewritten in the matrix form as: S (cid:48) ( k ) = max( S ( k − · P − S ( k − · Q, S ( k ) = D − s · S (cid:48) ( k ) (13)Finally, we derive the k -step transition probability matrix S ( k ) based on the constraint Markov dynamics, which is toserve as a better propagation matrix for graph convolution.To illustrate how the propagation matrix based on theunconstrained (and constrained) Markov dynamics changeswith the number of layers, we take a simple Newmanartiﬁcial network [21] as an example. The network consistsof 128 nodes divided into four categories of 32 nodes.Each node has on average 14 edges connecting to nodesof the same category and 2 edges connecting to nodesof other categories, as shown in Fig. 6(a). For this four-classiﬁcation problem, we ﬁrst calculate the spectrum of itsMarkov generator (Fig. 6(b)), and then derive the enteringtime and exiting time of the 4-th local mixing state, i.e., ∼ ∼

6, corresponding to the ﬂoor and ceiling of theoptimal number of layers (Fig. 6(c)). Figs. 6(d), (e) and (f)show the propagation matrices of 2, 6 and 10 steps (orlayers) of random walk. As shown, while the propagationmatrices between the 2-th and 6-th layers are relatively clear,some impurity information is still introduced. But with theincrease of propagation layers, e.g., reaching 10 layers, it (a) (b) (c)(d) (e) (f)(g) (h) (i)

Fig. 6: An example illustrating that the propagation matrixchanges with increasing the number of propagation layersbased on the unconstrained (and constrained) Markov dynam-ics. (a) shows a simple Newman artiﬁcial network, (b) thespectrum of its Markov generator, and (c) the exiting (andentering) time of each local mixing state. (d), (e) and (f) showthe propagation matrices after 2, 6 and 10 layers of the uncon-strained Markov propagation (corresponding to the enteringtime and exiting time of the 4-th local mixing state, as wellas a longer time). (g)-(i) show the propagation matrices byintroducing the new constraint mechanism, corresponding to(d)-(f) respectively. will become hard to ﬁlter impurity information any more.However, after introducing the constraint mechanism, thepropagation matrices of the 2-th and 6-th layers are muchclearer (Figs. 6(g) and (h)). More importantly, it will almostnot introduce impurity information with the increase oflayers, e.g., reaching 10 layers as shown in Fig. 6(i). Thisfurther veriﬁes that the new constrained Markov dynamicscan suppress the integration of impurity information whenpropagation, making it more robust and effective.

After the k -step propagation above, we then perform adiscriminative aggregation, which forms the relaxation andimprovement of the naive model. To be speciﬁc, we usethe same aggregation as the naive model while aggregatingembeddings of the k -step propagated neighbors. Then, theﬁnal embeddings can be deﬁned in one time as: H ( k ) = σ (( S ( k ) ◦ H (0) ) W ) (14)While the model may not distinguish information from dif-ferent meta-paths in propagation, it does distinguish themin aggregation, achieving the essential selection of differentmeta-paths. In this way, we can further solve the inherentlimitation of the naive model (the difﬁculty of concatenationin the new aggregation because most nodes do not have thesame and complete types of one-hop meta-paths), since wecan often get the complete types of neighbors after some k steps of constraint propagation. Here, one may also concern that the propagation matrix S ( k ) may become very dense in this case, making thepropagation introduce too much noise. But in fact, it isnot this case. Thanks to the new constraint mechanism,our S ( k ) can still remain sparse. Here take a ﬁrst node v in the ﬁrst category of a complex Lancichinetti artiﬁcialnetwork as an example (Fig. 7(a)). After many steps (e.g., k = 10) of propagation, when using the unconstraint randomwalk, the propagation probability of this node to all theother 999 nodes are positive, showing a dense result (Fig.7(b)). However, the propagation probability produced byour constraint walk is still sparse (Fig. 7(c)). As shown inFig. 7(c), our propagation probability of v to 766 out of thetotal 999 are 0; while that to the other nodes are positive.Moreover, the red values (the probability of v to nodesin the same category) are often much larger than the bluevalues (the probability of v to nodes outside this category).This demonstrates that our new propagation mechanismcan not only obtain a sparse propagation matrix, but alsowell ﬁlter impurity information, making the propagationmore effective. (a) (b) (c) Fig. 7: An Example of illustrating the sparsity of the propa-gation matrix using our constraint propagation. (a) show anartiﬁcial network of 1000 nodes with power-law distributionof degree and category size, generated by Lancichinettis model[22]. Here we only use the ﬁrst category with 97 nodes whichare put on the top of the node sequences. We focus on theﬁrst node v in this category with maximum degree. (b) showsthe propagation probability of node v to others based on theunconstrained random walk, and (c) that using our constrainedwalk. Red points denote probabilities of node v to nodes in thesame category and blue points outside. We deﬁne the loss function by using cross entropy as: L = − (cid:88) l ∈ y L Y l ln ( C · H l ) (15)where C denotes the set of parameters of the classiﬁer, y L the set of node indices that have labels, Y l and H l thelabels and embeddings of the labeled nodes. We use backpropagation and Adam optimizer to optimize the model. It is also quite easy to introduce some tricks when im-plementing our method. The tricks include, for example,supporting the use of candidate meta-path sets and the(multi-head) node-level attention, which are often used inthe existing HIN embedding approaches.First, existing HIN embedding methods often need touse a candidate meta-path set. To make our method supportthis option, we can adopt only the meta-paths in this can-didate set to construct the k -step propagation matrix, and then use an aggregation to fuse information from these k -step propagated neighbors to derive the ﬁnal embeddings.Second, existing graph neural network-based HIN em-bedding methods usually adopt the node-level attention forﬁne-tuning. Our method can also introduce the node-levelattention, working together with its inherent algorithmicmechanism of implicitly selecting meta-paths, to furtherimprove performance. To be speciﬁc, given a node pair ( u , v ) and a speciﬁed meta-path m , the importance coefﬁcientbetween nodes u and v can be formulated as: e muv = LeakyReLU( µ Tm [ W h u || W h v ]) (16)where µ m is the parameterized attention vector for meta-path m , and W the mapping matrix applied to each node.After obtaining the importance between nodes u and v , wecan then use softmax to normalize them to get the weightcoefﬁcient as: α muv = softmax v ( e muv ) = exp( e muv ) (cid:80) r ∈ N mu exp( e mur ) (17)Then, the embedding of node u for meta-path m can beaggregated by the neighbors embeddings with its corre-sponding weight coefﬁcients as: h mu = σ ( (cid:88) v ∈ N mu α muv W h v ) (18)Finally, we can also extend the node-level attention to amulti-head attention, as done in many existing methods [12][13], in order to stabilize the learning process and reduce thehigh variance (brought by the heterogeneity of networks).That is, we repeat the node-level attention K times, andthen concatenate their output as the ﬁnal embeddings: h mu = K (cid:107) k =1 σ ( (cid:88) v ∈ N mu α muv W h v ) (19) XPERIMENTS

We ﬁrst give the experimental setup, and then compare ourGIAM with some state-of-the-art methods on three networkanalysis tasks, i.e., node classiﬁcation, node clustering andnetwork visualization. We ﬁnally give an in-depth analysisof different components of our new approach.

We adopt two widely-used heterogeneous information net-works from different domains, as shown in Table 2, toevaluate the performance of different methods. • IMDB is an online database about TV shows and movieproductions. We extract a subset of IMDB with 4278movies (M), 2081 directors (D) and 5257 actors (A). Themovies are divided into three classes (

Action, Comedy,Drama ) based on their genre. Each movie is described bya bag-of-words representation of its plot keywords. Thesame to [13], we use the candidate meta-path set { MAM,MDM } for algorithms that require such information, andselect 400, 400 and 3478 movies as training, validation andtesting sets, for semi-supervised learning. TABLE 2: Datasets description.

Datasets No. of Nodes No. of Edges Meta-pathsIMDB • DBLP is a computer English literature database withauthors as its core. We extract a subset of DBLP with4057 authors (A), 14328 papers (P), 7723 terms (T) and20 venues (V). The authors are divided into four classes(

Database, Data Mining, Artiﬁcial Intelligence and InformationRetrieval ) based on their research areas. Each author isdescribed by a bag-of-words representation of his/herpaper keywords. Also the same to [13], we adopt thecandidate meta-path set { APA, APCPA, APTPA } , andselect 400, 400 and 3257 authors as training, validationand testing sets. We compare our new approach GIAM with eight exist-ing methods. They include: 1) the homogeneous networkembedding methods DeepWalk [23], Node2vec [24], GCN[14] and GAT [25], and 2) the HIN embedding methodsMetapath2vec [9], HetGNN [26], HAN [12] and MAGNN[13]. Especially, GCN is the base of our approach GIAM,and HAN and MAGNN are the state-of-the-art graph neuralnetwork-based HIN embedding methods which adopts thehierarchical attention structure. Also of note, we use homo-geneous network embedding methods on the HIN structuredirectly by ignoring the difference of types of nodes andedges.

For the methods based on semi-supervised graph neuralnetworks (including GCN, GAT, HAN, MAGNN and ourGIAM), we set the dropout rate to 0.5 and use the samesplits for training, veriﬁcation and testing sets. We employthe Adam optimizer with the learning rate setting to 0.005and apply early stopping with a patience of 50. For GAT,HAN and MAGNN, we set the number of attention headsto 8. For HAN and MAGNN, we set the dimension of themeta-path-level attention vector to 128. For the methodsbased on random walk (including DeepWalk, Node2vec,HetGNN and metapath2vec), we set the window size to 5,walk length to 100, walks per node to 40, and the number ofnegative samples to 5. For a fair comparison, the embeddingdimension of all methods mentioned above is set to 64.

We ﬁrst make a quantitative comparison on node classiﬁca-tion and clustering, and then a qualitative comparison onvisualization.

TABLE 3: Comparisons on node classiﬁcation.Datasets Metrics Training ratio Deepwalk Node2vec GCN GAT Metapath2vec HetGNN HAN MAGNN GIAMIMDB Macro-F1(%) 5% 41.52 43.56 54.56 54.79 42.95 42.93 55.94 54.41

10% 44.40 46.40 55.75 55.69 43.90 45.94 56.41 56.43

20% 46.60 49.61 56.29 56.38 45.53 48.87 57.64 57.41

40% 47.92 50.87 56.00 56.26 46.39 51.39 58.46 58.70

60% 48.66 51.79 55.83 56.05 47.80 52.70 58.73 58.97

80% 48.73 52.08 56.30 56.03 48.63 53.31 58.82 59.65

Micro-F1(%) 5% 42.31 44.13 55.22 55.48 44.31 43.80 56.28 54.61

10% 45.45 47.32 56.23 56.20 45.75 46.89 56.62 56.59

20% 47.88 50.59 56.58 56.60 47.06 49.62 57.66 57.43

40% 49.47 52.01 56.39 56.52 48.12 52.24 58.46 58.85

60% 50.20 52.92 56.19 56.31 49.50 53.58 58.75 59.09

80% 50.33 53.45 56.52 56.14 50.65 54.40 58.95 59.76

DBLP Macro-F1(%) 5% 73.09 78.02 85.59 79.67 90.17 90.83 91.80 92.96

10% 80.95 84.53 86.11 84.99 90.76 91.18 92.27 93.07

20% 84.08 85.51 86.88 86.72 91.28 91.68 92.88 92.92

40% 86.98 86.82 88.12 87.57 91.88 92.20 93.03 93.17

60% 88.59 88.14 87.84 88.32 92.31 92.36 92.97 93.50

80% 89.99 88.78 87.75 89.16 92.70 92.22 93.18 93.52

Micro-F1(%) 5% 75.49 80.41 86.08 82.88 90.90 91.39 92.36 93.49

10% 81.96 85.46 86.62 86.02 91.43 91.74 92.81 93.58

20% 85.02 86.48 87.28 87.38 91.97 92.20 93.36 93.43

40% 87.81 87.68 88.50 88.18 92.50 92.68 93.50 93.63

60% 89.38 89.02 88.28 88.98 92.90 92.88 93.47 93.95

80% 90.43 89.51 88.16 89.69 93.25 92.78 93.67 93.96

On the node classiﬁcation task, for each method, we ﬁrstgenerate the embeddings of the labeled nodes (i.e., moviesin IMDB and authors in DBLP), and then feed them to SVMby using different training ratios from 5% to 80% (as donein the most existing works). Since the variance of the graphstructure data can be quite large, we repeat this process 10times and report the average

Macro-F1 and

Micro-F1 .The results are shown in Table 3. As shown, the proposedmethod GIAM always performs the best across differenttraining ratios and datasets. On the IMDB dataset, GIAMis 1.15-2.88% and 0.32-4.42% more accurate than the bestbaselines HAN and MAGNN, which are also the heteroge-neous graph neural network methods (while they use mate-path-level attentions directly). On the DBLP dataset, GIAMis 0.71-1.44% and 0.20-0.72% more accurate than the bestbaselines HAN and MAGNN in the case of an already veryhigh base accuracy ( ≥ We also conduct comparisons of these methods on nodeclustering. In this task, for each method, we ﬁrst generateembeddings of the labeled nodes, and then feed them toK-Means algorithm. The number of clusters K is set to thesame as the ground-truth, i.e., 3 for IMDB and 4 for DBLP.Since the performance of K-Means is easily affected by theinitial center, we repeat the process 10 times and report theaverage normalized mutual information (NMI) and adjustedrand index (ARI).The results are shown in Tables 4 and 5. As shown,the proposed method GIAM performs the best on IMDB.While GIAM performs the second best on DBLP, its perfor-mance is still very competitive with that of the best baselineMAGNN. On average on both these two datasets, GIAMis 10.67%, 6.77%, 14.66%, 7.75%, 9.11%, 9.48%, 3.76% and0.47% more accurate than Deepwalk, Node2vec, GCN, GAT,Metapath2vec, HetGNN, HAN and MAGNN in terms ofNMI; and 0.1212, 0.0694, 0.2247, 0.1111, 0.0938, 0.0875, 0.0303and 0.0114 better than these methods in ARI (in the range TABLE 4: Comparisons on node clustering in terms of NMI. AVG shows the average result.Datasets NMI (%)Deepwalk Node2vec GCN GAT Metapath2vec HetGNN HAN MAGNN GIAMIMDB 0.55 5.34 10.42 10.02 0.43 0.46 13.02 13.77

DBLP 71.78 74.80 53.93 68.15 75.02 74.26 73.13

TABLE 5: Comparisons on node clustering in terms of ARI.Datasets ARI [-1,1]Deepwalk Node2vec GCN GAT Metapath2vec HetGNN HAN MAGNN GIAMIMDB -0.0014 0.0642 0.0661 0.0744 0.0005 0.0048 0.1282 0.1206

DBLP 0.7415 0.7796 0.4670 0.6859 0.7945 0.8028 0.7938 of -1 to 1). Moreover, (on average) GIAM is still better thanthe methods using meta-path-level attentions directly (i.e.,HAN and MAGNN). This further validates the soundnessof using algorithmic mechanisms to evaluate importance ofdifferent meta-paths. Neither GCN nor GAT is so compet-itive here. This is mainly because they fail to distinguishimportance of information with respect to different meta-paths, which signiﬁcantly compromises their performancein the unsupervised clustering setting.

For a more intuitively comparison, we also visualize theembeddings of author nodes of some representative net-work embedding methods (i.e., GCN, HetGNN, HAN andour GIAM) on the DBLP dataset as an example. We utilizethe well-known t-SNE tool [27] to project node embeddings (a) (b)(c) (d)

Fig. 8: The visualization of author nodes of the embeddingslearned by (a) GCN, (b) HetGNN, (c) HAN and (d) GIAM onDBLP. Different colors correspond to different research areas inground truth. to two dimensions. Different colors correspond to differentresearch areas of these nodes.As shown in Fig. 8, GCN (which ignores the heterogene-ity of nodes) does not perform well, i.e., the author nodesbelong to different research areas are sometimes mixedwith each other. HetGNN performs much better than GCN,but its boundary is still blurry. While both HAN and ourGIAM separate the author nodes in different research areasreasonably well, our GIAM has a more distinct boundaryand denser cluster structures in visualization.

Similar to most deep learning models, GIAM also containssome important components that may have signiﬁcant im-pact on the performance. To test the effectiveness of eachcomponent of GIAM, we conduct experiments on compar-ing GIAM with four variations. The variants are as follows:1) GCN which serves as the base framework of GIAM ofnot distinguishing importance of information with respectto different meta-paths, 2) the naive model of GIAM, namedas GIAM-1, 3) GIAM of removing node-level attention (byassigning the same importance to each neighbor node),named as GIAM-2, and 4) GIAM of adding the meta-path-level attention, named as GIAM-3. We take their comparisonon node classiﬁcation as an example.As shown in Table 6, compared to GCN, the naive modelGIAM-1 (which distinguishes meta-paths) has an obviousimprovement, i.e., 0.86-1.15% and 4.18-5.25% more accurateon IMDB and DBLP. However, due to the sparsity of HINs,GIAM-1 inevitably needs to add a large number of non-informative features, so as to ﬁll in embeddings of themissing types of one-hop meta-paths during aggregation.While its result is basically satisfactory, this limitation com-promises performance inevitably. We overcome this limi-tation by introducing a new mechanism of relaxation andimprovement, deriving GIAM-2, which further improvesperformance of the naive model, i.e., 2.35-3.63% and 0.02-2.76% more accurate on IMDB and DBLP. Furthermore,by introducing the ﬁne-turning node-level attention, thederived GIAM improves GIAM-2 on DBLP (i.e., 0.63-0.87%more accurate), while the improvement on IMDB is not soobvious (because IMDB is harder to be trained well with a relative low accuracy, easier leading to overﬁtting). This fur-ther demonstrates that the node-level attention indeed playsa ﬁne-tuning role when the model can be well trained (suchas on DBLP with a relative high accuracy). Finally, GIAM-3 of adding the meta-path-level attention hardly changesthe performance of GIAM. This further validates that ouralgorithmic mechanism has already played a signiﬁcant rolein selecting meta-paths, compared to the explicit meta-path-level attention approach. TABLE 6: Comparisons of our GIAM with four variants (GCN,GIAM-1, GIAM-2 and GIAM-3) on node classiﬁcation.

Datasets Metrics Training radio GCN GIAM-1 GIAM-2 GIAM GIAM-3IMDB Macro-F1(%) 5% 54.56 55.52 58.29 58.49 58.5610% 55.75 56.73 59.31 59.15 59.2620% 56.29 57.15 59.90 59.79 59.9440% 56.00 56.99 60.01 59.93 59.8560% 55.83 56.88 60.51 60.25 60.3180% 56.30 57.34 60.43 59.97 60.10Micro-F1(%) 5% 55.22 56.14 58.80 59.03 59.1110% 56.23 57.20 59.55 59.50 59.6120% 56.58 57.51 59.96 59.96 60.1040% 56.39 57.48 60.10 60.05 60.1360% 56.19 57.34 60.55 60.44 60.5080% 56.52 57.66 60.47 60.18 60.30DBLP Macro-F1(%) 5% 85.59 89.77 92.53 93.24 93.2510% 86.11 90.85 92.62 93.48 93.4820% 86.88 91.89 92.79 93.64 93.6140% 88.12 92.41 92.89 93.76 93.7860% 87.84 92.81 92.87 93.70 93.6980% 87.75 91.98 93.12 93.96 93.98Micro-F1(%) 5% 86.08 90.58 93.09 93.72 93.7510% 86.62 91.57 93.17 93.96 93.9620% 87.28 92.53 93.35 94.12 94.1040% 88.50 93.01 93.45 94.23 94.2660% 88.28 93.42 93.44 94.18 94.1980% 88.16 92.52 93.66 94.39 94.43

ELATED W ORK

Heterogeneous information network (HIN) embeddingaims to learn a low-dimensional distributed representationfor each node of a HIN while preserving the structure andsemantic information. Existing HIN embedding methodscan be mainly divided into three categories, including therandom walk-based methods, the relation learning basedmethods and the graph neural network-based methods.The random walk-based methods ﬁrst utilize randomwalk on a HIN to generate the node walk sequences, andthen feed them to the subsequent model to obtain nodeembeddings. For example, JUST [8] adopts the jump andstay strategies on a HIN, which select the next node basedon the probability of the jump or stay operation, to performrandom walk. It then inputs the generated walk sequencesto the skip-gram model to obtain the ﬁnal node embeddings.Metapath2vec [9] ﬁrst generates the node walk sequencesbased on meta-paths, and then obtains the node embed-dings by adopting heterogeneous skip-gram with negativesampling. HetGNN [26] improves metapath2vec by incor-porating additional node information. It ﬁrst introduces asampling strategy based on random walk with restart tosample neighbors for each node, and then uses a heteroge-neous neural network architecture to aggregate the featureinformation of those sampled neighbor nodes. The relation learning based methods aim to learn ascoring function which evaluates an arbitrary triplet com-posed of two nodes and an edge type, and output a scalarto measure the acceptability of this triplet. For example,DistMult [10] adopts a similarity-based scoring function tolearn the edge possibility between arbitrary two nodes ofthe HIN. ConvE [11] proposes a deep neural model insteadof the simple similarity function to score the edge possibilitybetween two nodes. TransE [28] learns the edge possibilitybetween two nodes by using a translational distance.The graph neural network-based methods aim to learnnode embeddings by aggregating the information fromneighbor nodes of a HIN. For example, HAN [12] proposes ahierarchical attention mechanism, including the node-leveland semantic-level attentions, to aggregate the informationfrom meta-path-based neighbors. To be speciﬁc, node-levelattention learns the importance of neighbors in the samemeta-path while semantic-level attention learns the impor-tance of different meta-paths. MAGNN [13] employs threemajor components, i.e., the node-type speciﬁc transforma-tion, the node-level meta-path instance aggregation andthe meta-path-level embedding fusion, to obtain the nodeembeddings of heterogeneous graphs. While those graphneural network-based methods can often derive satisfactorynode embeddings, they still have some essential limitations.That is, the complicated hierarchical attention structure of-ten makes these methods difﬁcult to really achieve the goalof selecting meta-paths, partly due to the highly overﬁtting(as shown in Fig. 1(a) as an illustrative example). Mean-while, those methods treat the one-hop and multi-hop meta-paths indistinguishably to propagate information, whichmay be not so intuitive from the perspective of networkpropagation dynamics in network science.

ONCLUSION

We propose a novel GCN-based method, namely GIAM,via implicitly (rather than explicitly) utilizing attention andmeta-paths, in order to effectively achieve HIN embedding.We use the direct linked meta-paths, a discriminative aggre-gation, along with the stacked layers of propagation, to dis-tinguish the importance of different meta-paths. We furthergive an effective relaxation and improvement by introduc-ing a new multi-layer propagation which is separated fromthe aggregation. That is, we ﬁrst replace the spectral ﬁlter ofGCN from the symmetric normalized graph Laplacian to anequivalent asymmetric one and remove activation functions,making it a well-deﬁned probabilistic propagation process.We then introduce a random graph-based constraint mecha-nism RPC on this probabilistic propagation, to avoid import-ing too much noise with the increase of propagation layers.Empirical results on various graph mining tasks, includingnode classiﬁcation, node clustering and graph visualization,demonstrate the superiority of our new approach over somestate-of-the-art methods. R EFERENCES [1] C. Yang, Y. Xiao, Y. Zhang, Y. Sun, and J. Han, “Heterogeneousnetwork representation learning: Survey, benchmark, evaluation,and beyond,”

CoRR , vol. abs/2004.00216, 2020. [2] C. Shi, Y. Li, J. Zhang, Y. Sun, and P. S. Yu, “A survey ofheterogeneous information network analysis,” IEEE Transactionson Knowledge and Data Engineering , vol. 29, no. 1, pp. 17–37, 2017.[3] W. Shen, J. Han, J. Wang, X. Yuan, and Z. Yang, “SHINE+: Ageneral framework for domain-speciﬁc entity linking with hetero-geneous information networks,”

IEEE Transactions on Knowledgeand Data Engineering. , vol. 30, no. 2, pp. 353–366, 2018.[4] B. Hu, C. Shi, W. X. Zhao, and P. S. Yu, “Leveraging meta-path based context for top-N recommendation with A neural co-attention model,” in

Proceedings of SIGKDD, ACM, 2018 , pp. 1531–1540.[5] C. Wang, Y. Song, H. Li, M. Zhang, and J. Han, “Unsupervisedmeta-path selection for text similarity measure based on het-erogeneous information networks,”

Data Mining and KnowledgeDiscovery , vol. 32, no. 6, pp. 1735–1767, 2018.[6] C. Park, D. Kim, J. Han, and H. Yu, “Unsupervised attributedmultiplex network embedding,” in

Proceedings of AAAI, 2020 ,pp. 5371–5378.[7] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive represen-tation learning on large graphs,” in

Proceedings of NIPS, 2017 ,pp. 1024–1034.[8] R. Hussein, D. Yang, and P. Cudr´e-Mauroux, “Are meta-pathsnecessary?: Revisiting heterogeneous graph embeddings,” in

Pro-ceedings of CIKM, 2018 , pp. 437–446.[9] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalablerepresentation learning for heterogeneous networks,” in

Proceed-ings of SIGKDD, ACM, 2017 , pp. 135–144.[10] B. Yang, W. Yih, X. He, J. Gao, and L. Deng, “Embedding entitiesand relations for learning and inference in knowledge bases,” in

Proceedings of ICLR, 2015 .[11] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolu-tional 2d knowledge graph embeddings,” in

Proceedings of AAAI,2018 , pp. 1811–1818.[12] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu,“Heterogeneous graph attention network,” in

Proceedings of WWW,2019 , pp. 2022–2032.[13] X. Fu, J. Zhang, Z. Meng, and I. King, “MAGNN: metapathaggregated graph neural network for heterogeneous graph em-bedding,” in

Proceedings of WWW, 2020 , pp. 2331–2341.[14] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation withgraph convolutional networks,” in

Proceedings of ICLR, 2017 .[15] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networksand locally connected networks on graphs,” in

Proceedings of ICLR,2014 .[16] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutionalneural networks on graphs with fast localized spectral ﬁltering,”in

Proceedings of NIPS, 2016 , pp. 3837–3845.[17] Y. Wang, Z. Duan, B. Liao, F. Wu, and Y. Zhuang, “Heteroge-neous attributed network embedding with graph convolutionalnetworks,” in

Proceedings of AAAI, 2019 , pp. 10061–10062.[18] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger,“Simplifying graph convolutional networks,” in

Proceedings ofICML, 2019 , pp. 6861–6871.[19] B. Yang, J. Liu, and J. Feng, “On the spectral characterization andscalable mining of network communities,”

IEEE Transactions onKnowledge and Data Engineerin , vol. 24, no. 2, pp. 326–337, 2012.[20] M. E. J. Newman and M. Girvan, “Finding and evaluating com-munity structure in networks,”

Physical Review E , vol. 69, no. 2,pp. 026113–026113, 2004.[21] N. M. E. J. Girvan M, “Community structure in social and bio-logical networks,”

Proceedings of the National Academy of Sciences ,vol. 99, no. 12, pp. 7821–7826, 2002.[22] A. Lancichinetti and S. Fortunato, “Benchmarks for testing com-munity detection algorithms on directed and weighted graphswith overlapping communities,”

Physical Review E , vol. 80, no. 1,p. 016118, 2009.[23] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: online learningof social representations,” in

Proceedings of SIGKDD, ACM, 2014 ,pp. 701–710.[24] A. Grover and J. Leskovec, “node2vec: Scalable feature learningfor networks,” in

Proceedings of SIGKDD, ACM, 2016 , pp. 855–864.[25] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li`o, andY. Bengio, “Graph attention networks,” in

Proceedings of ICLR,2018 .[26] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla,“Heterogeneous graph neural network,” in

Proceedings of SIGKDD,ACM, 2019 , pp. 793–803. [27] V. D. M. Laurens and G. Hinton, “Visualizing data using t-sne,”

Journal of Machine Learning Research , vol. 9, no. 2605, pp. 2579–2605,2008.[28] A. Bordes, N. Usunier, A. Garc´ıa-Dur´an, J. Weston, andO. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in

Proceedings of NIPS, 2013 , pp. 2787–2795. A PPENDIX

In Section 2, we have used three graph neural network-based HIN embedding methods, i.e., HAN, MAGNN andour new approach GIAM, to conduct the motivating ex-periment on two widely-used heterogeneous informationnetworks, i.e., IMDB and DBLP. Here, on each network, wegive the detailed results of different methods on differentradios (i.e., 5-80%) of the supervised information, as shownin Table 7 and Table 8, respectively.

TABLE 7: The performance of HAN and MAGNN of using(and not using) meta-path-level attention, as well as our newapproach GIAM on the IMDB dataset. HAN-1 denotes HANof using meta-path-level attention and HAN-2 not. MAGNN-1 denotes MAGNN of using meta-path-level attention andMAGNN-2 not. AVG shows the average result.

Dataset Metrics Training radio HAN-1 HAN-2 MAGNN-1 MAGNN-2 GIAMIMDB Macro-F1(%) 5% 55.94 57.57 54.41 55.27 58.4910% 56.41 58.35 56.43 56.44 59.1520% 57.64 59.16 57.41 58.72 59.7940% 58.46 59.49 58.70 59.71 59.8560% 58.73 59.55 58.97 59.71 60.2580% 58.82 59.43 59.65 59.95 59.97AVG 57.67 58.93 57.60 58.30 59.58Micro-F1(%) 5% 56.28 57.94 54.61 55.39 59.0310% 56.62 58.53 56.59 56.71 59.5020% 57.66 59.21 57.43 58.83 59.9640% 58.46 59.53 58.85 59.89 60.0560% 58.75 59.53 59.09 59.91 60.4480% 58.95 59.40 59.76 60.24 60.18AVG 57.79 59.02 57.72 58.50 59.86