[PDF] Meta-Path-Free Representation Learning on Heterogeneous Networks

Abstract

Full PDF

MM ETA -P ATH -F REE R EPRESENTATION L EARNING ON H ETEROGENEOUS N ETWORKS

Jie Zhang , Jinru Ding , Suyuan Liu , and Hongyan Wu SenseTime Research, Shanghai, China. Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China. A BSTRACT

Real-world networks and knowledge graphs are usually heterogeneous networks. Representationlearning on heterogeneous networks is not only a popular but a pragmatic research ﬁeld. The mainchallenge comes from the heterogeneity—the diverse types of nodes and edges. Besides, for a givennode in a HIN, the signiﬁcance of a neighborhood node depends not only on the structural distancebut semantics. How to effectively capture both structural and semantic relations is another challenge.The current state-of-the-art methods are based on the algorithm of meta-path and therefore have aserious disadvantage—the performance depends on the arbitrary choosing of meta-path(s). However,the selection of meta-path(s) is experience-based and time-consuming. In this work, we propose anovel meta-path-free representation learning on heterogeneous networks, namely Heterogeneousgraph Convolutional Networks (HCN). The proposed method fuses the heterogeneity and developsa k -strata algorithm ( k is an integer) to capture the k -hop structural and semantic information inheterogeneous networks. To the best of our knowledge, this is the ﬁrst attempt to break out ofthe conﬁnement of meta-paths for representation learning on heterogeneous networks. We carryout extensive experiments on three real-world heterogeneous networks. The experimental resultsdemonstrate that the proposed method signiﬁcantly outperforms the current state-of-the-art methodsin a variety of analytic tasks. Heterogeneous information networks (HIN) is a type of networks that involve multiple types of nodes and/or edges[1]. Take Digital Bibliographic Library Browser (DBLP) as an example. The node types include authors ( A ), papers( P ), and conferences ( C ). And the edge types include a writing relation between a paper ( P ) and an author ( A ) and apublishing relation between a paper ( P ) and a conference ( C ). Figure 1(a) gives an example of DBLP-like networks.Real-world networks are usually HINs. For instance, publication networks [2], biological networks [3], highwaynetworks [4], and most knowledge graphs are HINs. Representation learning on HINs, also known as heterogeneousnetwork embedding (HNE), captures semantic and structural information by embedding diverse types of nodesand/or the entire network into a low-dimensional space. HNE effectively helps downstream analytical tasks, such asknowledge-guided recommendation systems [5, 6], knowledge-based image classiﬁcation [7, 8] and captioning [9, 10],knowledge-guided natural language processing (NLP) [11, 12], and so on. Therefore, representation learning on HINsis not only a popular but also a pragmatic research ﬁeld.There are mainly two challenges for HNE. [Challenge 1] A heterogeneous network has much more complicatedsemantics than a homogeneous network. Diverse types of nodes and edges have various feature spaces and semanticmeaning. The challenge of the heterogeneity cannot be simply handled by the methods of homogeneous networkembedding [13, 14, 15]. [Challenge 2]

For a given node in a HIN, the signiﬁcance of a neighborhood node depends https://dblp.uni-trier.de a r X i v : . [ c s . S I] F e b PREPRINT not only on the structural distance but semantics. [16]. In other words, a farther neighbor may have more signiﬁcance.Some researches [17, 15] on HNE ﬁnd the analytical outcomes based on a long-distance neighborhood outperformthose based on a short-distance neighborhood in node clustering tasks on DBLP.Classic algorithms for HNE, such as Metapath2vec, apply the algorithm of meta-path. A meta-path is a pre-deﬁnedsequence of node types. Metapath2vec takes the meta-path-guided random walks and then applies a skip-gram algorithm[18]. Recently, Graph Neural Networks (GNN), such as Graph Convolutional Networks (GCN) [19, 20, 21] and GraphAttention Networks (GAT) [22], have shown superior performance on homogeneous network embedding. Therefore, thecurrent state-of-the-art methods, such as Heterogeneous graph Attention Network (HAN) [15], combines the algorithmsof meta-path and GNN to perform HNE.However, there is a serious disadvantage of meta-path-based methods—the meta-paths are either speciﬁed by usersor derived from supervision [18, 17, 14, 15]. The meta-paths selected in these ways only reﬂect certain aspects ofHIN [14], and different meta-paths result in different outcomes. Researchers need to explore meta-paths as many aspossible and choose the best meta-path(s) [17]. However, the number of meta-paths is inﬁnite and researchers can hardlytest all possible meta-paths. Therefore, the selection of meta-paths is usually experience-based and time-consuming[23, 24, 25].We develop a novel meta-path-free representation learning on heterogeneous networks, namely Heterogeneous graphConvolutional Networks (HCN). The proposed method develops a meta-path-free k -strata algorithm, which naturallyincorporates miscellaneous composite relations in heterogeneous networks. The hybrid of miscellaneous compositerelations is the key to fusing the heterogeneity and capturing both structural and semantic information in heterogeneousnetworks without arbitrarily selecting meta-paths.Furthermore, the pretreatment of the proposed method is much easier. Comparatively, for the pretreatment of meta-path-based methods, such as the pretreatment of HAN in Figure 1(b), the time complexity depends on the length of themeta-paths, the number of nodes and branches in HIN, and the number of meta-paths.The contributions of this work are as follows: (1) to the best of our knowledge, this is the ﬁrst attempt to break out ofthe conﬁnement of meta-paths for HNE; (2) the proposed method can capture both the semantic and structural relations;and (3) we carry out extensive experiments on three real-world HINs, and the results prove that the proposed methodsigniﬁcantly outperforms the current state-of-the-art methods in a variety of analytical tasks.Figure 1: (a) An illustrative example of a DBLP-like network. The “A"s, “P"s, and “C"s represent nodes of authors,papers, and conferences, respectively. The “A1" is the given node. (b) An illustrative example of the pretreatment ofHAN. By AP A and

AP CP A , the heterogeneous network is decomposed and reorganized into two sub-homogeneousnetworks. The neighborhood in the two new sub-homogeneous networks is the meta-path-based (

AP A and

AP CP A )neighborhood in the original heterogeneous network. 2

PREPRINT

Figure 2: The representation learning implemented in a two-layered GNN. The inputs are the k -strata adjacency matrixand the fused feature matrix. The output is analytical outcome of a node classiﬁcation task. Metapath2vec is a meta-path-based unsupervised learning. There are two terms in Metapath2vec: (1) the meta-pathscheme and (2) the meta-path instance. (1) A meta-path scheme is a pre-deﬁned sequence of node types. TakeDBLP as an example. The commonly used meta-path schemes for DBLP are “author-paper-author" (

AP A ) and“author-paper-conference-paper-author" (

AP CP A ). (2) A meta-path instance is a node sequence that follows andrepeats the format of a meta-path scheme until it reaches a ﬁxed length, which is set to 100 in Metapath2vec. TakeFigure 1(a) as an example. By following and repeating the meta-path scheme of

AP A , meta-path instances, suchas A P A P A P A P A , are generated. The generated meta-path instances are inputted into the skip-gramalgorithm to learn HNE [18]. Please note that all meta-path instances are “randomly" generated. The “randomness"may generate some meta-path instances but neglect others.HIN2Vec is a meta-path-based supervised learning. HIN2Vec explores all meta-path instances within w hops andperforms link predictions to achieve HNE. The HIN2Vec compares the length of hops w in four HINs: Blogcatalog ,Yelp , U.S. Patents , and DBLP, and ﬁnds that a longer meta-path instances is crucial for a complicated HIN, such asDBLP, because a longer meta-path may have a signiﬁcant semantic meaning [17], such as AP AP A —two authors haveco-authorship with the same author.HAN is a typical algorithm that combines the algorithms of meta-path and GNN. The analytical process is dividedinto three steps. Firstly, by pre-deﬁned meta-paths, a HIN is decomposed and reorganized into several homogeneousnetworks. Take Figure 1 as an example. By

AP A and

AP CP A , the heterogeneous network in Figure 1(a) isdecomposed and reorganized into two sub-homogeneous networks in Figure 1(b). The neighborhood in the sub-homogeneous networks is the meta-path-based (

AP A and

AP CP A ) neighborhood in the original heterogeneousnetwork. Secondly, HAN leverages the GNN algorithm to learn node embedding in the two new sub-homogeneousnetworks. Thirdly, the two pieces of node embedding learned from the two new sub-homogeneous networks are fused.Different from Metapath2vec and HIN2Vec, HAN achieves embedding of only one type of node.In summary, the “randomness" in Metapath2vec might neglect some meta-path instances and thereby lose someindispensable information. HIN2Vec ﬁnds that the length of meta-paths impacts the analytical performance, butunfortunately, does not give a method to avoid choosing meta-paths. HAN achieves embedding of only one type of nodein HINs. The selection of meta-paths in Metapath2vec, HIN2Vec, and HAN all strongly depends on the task at hand. http://socialcomputing.asu.edu/datasets/BlogCatalog3 https://dblp.uni-trier.de PREPRINT

Table 1: Notations and Explanations.Notations Explanations G a heterogeneous network V set of all nodes E set of all edges O set of all node types R set of all edge types ˜ A k k -strata adjacency matrix M type-speciﬁc transformation matrix X (cid:48) fused feature matrix Z ﬁnal embedding This section formally deﬁnes (1) HIN and (2) distance between two nodes. Table ?? presents the notations in this work. Deﬁnition 3.1

Heterogeneous Information Network (HIN) [1] . A HIN, denoted as G = ( V, E ) , is composed of aset of nodes V and a set of edges E . And O and R denote the set of node types and edge types, respectively, and | O | + | R | > . Deﬁnition 3.2

Distance between Two Nodes .The distance ( i, j ) is the number of the hops in the shortest path betweentwo given nodes i and j . Especially, the distance from a node to itself is 0; and the distance is inﬁnite ( ∞ ) if no pathexists between i and j . The Formula (1) illustrates the deﬁnition of the distance between two nodes. distance ( i, j ) = (cid:40) i = jk k hops in the shortest path between i and j ∞ no path between i and j (1) This section explains the proposed meta-path-free representation learning for HNE. k -Strata For a given node in a homogeneous network, the signiﬁcance of a neighborhood node is more relevant to the structuraldistance. The longer the distance is, the less the signiﬁcance could be. Comparatively, for a given node in a HIN, thesigniﬁcance of a neighborhood node depends not only on the structural distance but semantics. Take Figure 3 as anexample. The A published two papers ( P and P ) in a conference ( C ). The P is a paper that introduces how touse knowledge graph embedding to enrich word embedding dimensions; the P is a paper that performs reinforcementlearning in a question and answer (QA) system; and the C is the conference of Association for ComputationalLinguistics (ACL). Thereby, C reﬂects the A ’s research area and interests (computational linguistic) more directlyand obviously than P or P . Another example is also from Figure 3. The A and A from the same lab share thesame research area interests. The A and A have three co-author papers P , P , and P , and the P , P , and P usedifferent algorithms. Therefore, for the given node A , the A has more signiﬁcance than P , P , or P , although A is farther than P , P , or P . In conclusion, when performing representation learning in a HIN, we need to take bothdistance and semantics into consideration.We introduce the concept of k -strata ( k is an integer), to refer to all nodes within the k-hop range from a given node, asFigure 3 illustrates. Formula (2) illustrates the k -strata adjacency matrix ˜ A k . The value ˜ A ki,j between two nodes i and j is deﬁned as: ˜ A ki,j = (cid:26) distance ( i, j ) ≤ k distance ( i, j ) > k (2)where ˜ A k ∈ R n × n ; n is the number of all nodes. Please note that the ˜ A k includes self-connections since the distancefrom a node to itself is 0. Figure 4 is a two-strata adjacency matrix, which corresponds to the Figure 3. Since the4 PREPRINT

Figure 3: An example of the algorithm of k -strata in a DBLP-like graph. The “A"s, “P"s, and “C"s represent nodes ofauthors, papers, and conferences, respectively. The “A1" is the given node. The k -strata refers to all the nodes withinthe k-hop range from the given node A . For example, the 2-strata of A includes P , P , P , P , P , A , A , A , C , and C . For a given node in a HIN, the signiﬁcance of a neighborhood node depends not only on the structuraldistance but semantics. For example, A published two papers ( P and P ) in a conference ( C ). The P is a paperthat introduces how to use knowledge graph embedding to enrich word embedding dimensions; the P is a paper thatperforms reinforcement learning in a question and answer (QA) system, and the C is the conference of Association forComputational Linguistics (ACL). The C reﬂects the A ’s research area and interests (computational linguistic) moredirectly and obviously than P or P . k -strata adjacency matrix is a symmetric matrix, Figure 4 only shows the upper right half. Please note that the two-strataadjacency matrix considers all the relations between any two nodes within two-hop range.Algorithm 1 explains how to generate the k -strata adjacency matrix. Although it might look complicated, the implemen-tation could be quite simple, just one line in the Pandas code as follows: ˜ A k ← ˜ A k − .apply(lambda x: ( ˜ A [x==1].any()).astype(int)) Algorithm 1

The generation of the k -strata adjacency matrix. Require:

The heterogeneous graph G = ( V, E ) , The 1-stratum adjacency matrix ˜ A , The number of strata K ( K ≥ ). Ensure:

The k -strata adjacency matrix ˜ A k . for k = 2 ... K do for i ∈ V do From ˜ A ( k − , ﬁnd all the ( k − -strata neighbors of the node i , denoted as a set N ( k − i ; From ˜ A , ﬁnd all the 1-hop neighbors of all the nodes in the set N ( k − i , denoted as a set N k − thi ; N ( k ) i ← logical_or ( N ( k − i , N k − thi ); Append N ki to matrix ˜ A k ; end for end for return ˜ A k . • Composite Relations

We introduce the concept of “composite relations", which can help us understand the reason why the k -strata capturesboth structural and semantic information in heterogeneous networks. In a k -hop structure: V R −→ V R −→ · · · R k −→ V k +1 , https://pandas.pydata.org/ PREPRINT

Figure 4: An illustration of a k -strata adjacency matrix. The two-strata adjacency matrix in this ﬁgure corresponds tothe DBLP-like network in Figure 3. Since a k -strata adjacency matrix is a symmetric matrix, this ﬁgure only shows theupper right half.where V i are nodes and R i are one-hop edges (or simple relations). The k -hop relation ( R ) between node V and V k +1 can be formulated as R = R ◦ R ◦ · · · ◦ R k , where ◦ denotes the composition operator on relation. Therefore, a k -hop ( k ≥ ) relation between two nodes implies a composite-relation with a distance of k . Consequently, the k -strataadjacency matrix incorporates miscellaneous composite relations and therefore has two advantages for learning HNE.Firstly, the k -strata adjacency matrix captures composite relations between any two nodes, while meta-path-basedmethods only capture the relations along the meta-paths. Take the DBLP-like network in Figure 1 as an example.The Metapath2vec can only learn the relation of AP A based on the meta-path

AP A (as shown on the left side ofFigure 4(b)), and the relation of

AP CP A based on the meta-path

AP CP A (as shown on the right side of Figure 4(b)).Although HAN can fuse several meta-paths, such as

AP CP A and

AP A , in a real-world HIN, there could be muchmore useful composite relations and HAN cannot cover all of them.Secondly, the k -strata adjacency matrix can capture composite relations across meta-paths. Still take the DBLP-likenetwork in Figure 1 (b)as an example. For a node classiﬁcation task of authors, the relationship between A and A should be considered since both A and A have composite relations with A . However, in meta-path-based methods,the A can never capture semantic information from A , since A and A are located in different meta-paths— A is in AP A while A is in AP CP A . For the proposed k -strata algorithm, the four-strata adjacency matrix includesa four-hop relation (between A and A ) and a two-hop relation (between A and A ). In other words, there is aconsecutive relation of A − A − A in the four-strata adjacency matrix and A is A ’s neighbor’s neighbor. Thereby,the A can capture semantic information from A through GNN.In conclusion, the k -strata adjacency matrix incorporates miscellaneous composite relations. And the “hybrid" ofdifferent composite relations is the key to fusing the heterogeneity and capturing both structural and semantic informationin heterogeneous networks without arbitrarily selecting meta-paths. A HIN has different types of nodes and therefore has different feature spaces. For example, in a DBLP-like network,the nodes of “Author" have their own feature space. So do the nodes of “Paper" and “Conference". To achieve HNE, weneed to fuse different feature spaces. 6

PREPRINT

Figure 5: An illustration of the feature fusion. The original feature matrix of “Author", “Paper", and “Conference" aremultiplied by a type-speciﬁc transformation matrix M A , M P , and M C , respectively, and then concatenated by rows.The example in this ﬁgure corresponds to the DBLP-like network in Figure 3 and Figure 4. Algorithm 2

The algorithm of feature fusion.

Require: the set of all node types O , the original feature matrix X , a trainable type-speciﬁc matrix M o for one node type o . Ensure:

The fused feature matrix X (cid:48) . for o ∈ O do From X , ﬁnd the features of all the nodes of the type o , denoted as X o X (cid:48) o ← X o M o ; Append X (cid:48) o to X (cid:48) end for return X (cid:48) .We use a trainable type-speciﬁc transformation matrix for every node type and then append the transformed featurespaces. Figure 5 illustrates the algorithm of feature fusion for the DBLP-like network in Figure 3. In particular, theoriginal feature matrix of “Author", “Paper", and “Conference" multiplies a trainable type-speciﬁc transformationmatrix ( M A , M P , and M C ), respectively. And then, we concatenate the transformed feature matrices of “Author",“Paper", and “Conference" by rows.The fused feature matrix is denoted as X (cid:48) ∈ R n × F (cid:48) , where n is the number of nodes and F (cid:48) is the dimension of thefused feature space. The Algorithm 2 illustrates how to perform feature fusion for multiple types of nodes.The purpose of the trainable type-speciﬁc transformation matrix ( M A , M P , and M C ) is to transform the differentfeature spaces into a uniﬁed feature space. • Linear and non-linear transformation for feature fusion

For feature fusion in this work, we implement a linear transformation. One can also use a non-linear transformation byperforming an activation function σ after multiplying M , which amounts to a fully-connected layer. In other words, to7 PREPRINT achieve a non-linear transformation for feature fusion, one can implement one or more fully-connected layers. In thiswork, a linear transformation seems to perform well enough.

To learn HNE, the k -strata adjacency matrix and the fused feature matrix are inputted to GNN, such as GCN or GAT, toperform a supervised node classiﬁcation (Figure 2). In this work, we use GCN to implement representation learning.Formula (3) shows the implementation. H = σ ( ˆ A k X (cid:48) W ) H = σ ( ˆ A k H W ) ...H h − = σ ( ˆ A k H h − W h − ) Z = ˆ A k H h − W h − (3)where h is the number of GCN layers; X (cid:48) ∈ R n × F (cid:48) is the fused feature matrix ; W is a trainable weight matrix; Z ∈ R n × C is the ﬁnal embedding matrix and C is the dimension of the ﬁnal embedding; σ is an activation functionand we use Rectiﬁed Linear Unit (ReLU); ˆ A k ∈ R n × n is a symmetric normalized Laplacian k -strata adjacency matrix,deﬁned as Formula (4). ˆ A k = ˜ D − ˜ A k ˜ D − (4)where ˜ A k is the k -strata-adjacency matrix and ˜ D is the degree matrix , as shown in Formula (5) . ˜ D ii = (cid:88) j ˜ A kij (5)For the multi-class classiﬁcation, we calculate the cross-entropy loss over all labeled examples, as Formula (6) shows. L = − (cid:88) l ∈Y L Y l · ln( sof tmax ( Z l )) (6)where Y L is a set of nodes that have labels; Y l ∈ R C is a vector indicating the true labels; Z l ∈ R C is the ﬁnalembedding vector of a node that has a label; sof tmax ( Z l ) is the predicted probabilities of all classes, and · is dotproduct of two vectors. Algorithm 3

The representation learning implemented in GCN.

Require:

The heterogeneous graph G = { V, E } , The fused feature matrix X (cid:48) , The k -strata adjacency matrix: ˜ A k , The training epochs T . Ensure:

The ﬁnal embedding Z . ˜ D ii ← (cid:80) j ˜ A kij ; ˆ A k ← ˜ D − ˜ A k ˜ D − ; for t ∈ T do Z ← GCN ( ˆ A k , X (cid:48) ) ; Calculate loss: L = − (cid:80) l ∈Y L Y l · ln( sof tmax ( Z l )) ; Perform back propagation and update parameters; end for return Z .The algorithm of the representation learning is described in Algorithm 3.8 PREPRINT

Table 2: The meta-data of the three real-world datasets.dataset edge(A-B) number of A number of B number of A-B training validation test classesDBLP Paper-Author 14328 4057 19645 800 400 2857 4Paper-Conference 14328 20 14328Paper-Term 14328 8811 88420IMDB Movie-Actor 3015 4293 9041 800 400 1815 3Movie-Director 3015 1676 3015AMiner Paper-Scientist 14209 4162 14422 800 400 2962 8Paper-Conference 14209 2179 14209

A too large k brings a too dense k -strata adjacency, which could increase training costs [26]. A recent researchtheoretically demonstrates dropping edges reduces message passing in graph training [27]. To solve this problem,we randomly drop some k -strata edges, which we call “dilation". In other words, the “dilation" here means that werandomly choose a certain proportion of cells of “1" in the k -strata adjacency matrix and change them to “0" to makethe k -strata adjacency matrix sparser. The proportion can be 30%, 50% or more. The operation is optional, and we canthink of the dilation proportion as an adjustable hyper-parameter. If such pretreatment of dilation does not bring worseanalytical outcomes in heterogeneous networks, we can use the dilation to reduce training costs.In the implementation, to make the model robust, we adopt an “online dilation" during model training, which performsa random drop every a few epochs. The algorithm of the representation learning with the “online dilation" is describedin Algorithm 4. Algorithm 4

The representation learning with online dilation.

Input: the heterogeneous graph G = { V, E } , The fused feature matrix X (cid:48) , The k -strata adjacency matrix: ˜ A k , The dilation proportion: p % , The number q : perform a random drop every q epochs, The training epochs T . Output:

The ﬁnal embedding Z . for t ∈ T do if t mod q == 0 then ˜ A k − dilated ← randomly drop p % relations of ˜ A k ; ˜ D ii ← (cid:80) j ˜ A k − dilatedij ; ˆ A k − dilated ← ˜ D − ˜ A k − dilated ˜ D − ; end if Z ← GCN ( ˆ A k − dilated , X (cid:48) ) ; Calculate loss: L = − (cid:80) l ∈Y L Y l · ln( sof tmax ( Z l )) ; Perform backpropagation and update parameters; end for return Z . The details of the three real-world HINs in this work are shown in Table 2. • DBLP . We extract a subset from DBLP, which contains 14328 “Papers (P)"s, 4057 “Authors (A)"s, 20 “Conferences(C)"s, and 8811 “Terms (T)"s [15]. The “Terms (T)"s are processed as a feature of “Papers (P)"s. The “Authors(A)"s have four classes: “Database" , “Data Mining" , “Information Retrieval" , and “Machine Learning" . For themeta-path-based baseline models, we employ the widely-used meta-path schemes {APA, APCPA}.9 PREPRINT

Table 3: The scores (%) of Micro-F1 and Macro-F1 on the node classiﬁcation task. The experiments do not perform thedilation.method DBLP IMDB AMinermeta-path Micro-F1/Macro-F1 meta-path Micro-F1/Macro-F1 meta-path Micro-F1/Macro-F1GCN APA 49.84/47.00 MAM 58.95/42.50 SPS 25.93/24.18APCPA 90.86/89.86 MDM 58.90/46.96 SPCPS 78.90/78.65GAT APA 46.12/42.54 MAM 37.65/33.49 SPS 12.21/08.47APCPA 71.93/71.20 MDM 40.23/35.03 SPCPS 44.98/38.50HAN APA+APCPA 43.92/41.24 MAM+MDM 40.66/35.56 SPS+SPCPS 46.18/49.21HCN none none none

Table 4: The values (%) of Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) on the nodeclustering task.method DBLP IMDB AMinermeta-path NMI/ARI meta-path NMI/ARI meta-path NMI/ARIGCN APA 22.12/05.61 MAM 07.67/05.40 SPS 05.42/02.76APCPA 68.88/74.01 MDM 10.64/09.24 SPCPS 46.39/33.91GAT APA 22.12/05.90 MAM 07.78/05.81 SPS 02.09/00.95APCPA 67.98/71.94 MDM 08.57/01.70 SPCPS 41.20/27.75HAN APA+APCPA 66.40/72.96 MAM+MDM 11.26/09.98 SPS+SPCPS 28.03/16.11HCN none none none • IMDB . The IMDB is a dataset of movies. The experimental subset includes 3015 “Movies (M)", 4293 Actors (A)",and 1676 “Directors (D)". The “Movies (M)"s have three classes: “Action" , “Comedy" , and “Drama" . The widely-usedmeta-path schemes {MAM, MDM} are adopted in the meta-path-based models as baselines. • AMiner . AMiner is also a computer science publication dataset. The subset involves 4162 “Scientists (S)"s, 14209“Papers (P)"s, and 2179 “Conferences (C)"s. In the node classiﬁcation task, we have eight classes for “Scientists(S)" : “computer scientists" , “computational linguistics" , “computer graphics" , “computer networks & wirelesscommunication" , “computer vision & pattern recognition" , “computing systems" , “databases & information systems" , “human computer interaction" , and “theoretical computer science" . For the baseline models that use meta-paths, weemploy {SPS, SPCPS}. The baseline models include three state-of-art models: GCN, GAT, and HAN. • GCN [21]. It is a semi-supervised graph convolutional network. Here we test all the aforementioned meta-paths in“5.1 Datasets” and report their performance respectively. • GAT [22]. It is a semi-supervised neural network that considers the attention mechanism on the homogeneous graphs.Here we test all the meta-paths. • HAN [15]. HAN is composed of two parts: the GNN layers and a subsequent k-Nearest Neighbor (KNN) layer. Theﬁnal outcomes of the node classiﬁcation come from the KNN instead of GNN, although GNN is originally trainedfor the node classiﬁcation task. The input of KNN is the output of the second-last layer in GNN. So HAN is not anend-to-end learning. Since GCN, GAT, and the proposed method do not use KNN, to perform a fair comparison, wekeep the GNN layers but remove the KNN from HAN. • HCN . The proposed meta-path-free representation learning for HNE. The codes will be open at Github.

We stack two layers of GNN, which is commonly adopted in most GNNs. We randomly initialize parameters withuniform distribution. The optimizer of Adam [28] and early stopping with the patience of 100 epochs are applied to PREPRINT

Figure 6: The values (%) of Micro-F1, Macro-F1, Normalized Mutual Information (NMI), and Adjusted Rand Index(ARI) in different hyper-parameter k . As the k increases, each curve achieves a peak and then drops. In classiﬁcation,the 3-strata HCN achieves the highest Micro/Macro F1 scores in DBLP and AMiner; and the 2-strata HCN performsbest in IMDB. In clustering, the 3-strata HCN achieves the highest NMI/ARI values in DBLP and AMiner; while the1-stratum HCN performs best in IMDB. In each task, there is an optimal k value and a balance in tuning k .update gradient. Besides, we set the learning rate to 0.01, the regularization parameter to 0.0005, the dropout rate to 0.5.The baseline methods use the same parameter setting.For DBLP and AMiner datasets, we set the number of hidden neurons to 64, while for IMDB the number of hiddenneurons is set to 32. To ensure fairness, we split the datasets and use the same training, validation, and test set for all themodels in this work. To evaluate the performance of HNE, we perform multi-class classiﬁcations: four classes for “Authors" in DBLP, threeclasses for “Movies" in IMDB, and eight classes for “ Scientists" in AMiner. Please note that we do not perform theonline dilation in the experiments of this section.Table 3 presents the results of Micro-F1 and Macro-F1 scores in the classiﬁcation tasks. The proposed method performsbest in all the three datasets. In detail, the 3-strata HCN achieves the highest scores of Micro-F1 and Macro-F1 inDBLP and AMiner; and the 2-strata HCN performs best in IMDB.The results also demonstrate that different meta-paths lead to different analytical outcomes. In DBLP, the

AP CP A achieves much better classiﬁcation results than

AP A in both GCN and GAT; in IMDB, the “MAM" and “MDM" resultin different results; and in AMiner, the “SPCPS" results in better classiﬁcation outcomes than “SPS" in both GCN andGAT.Please note that the proposed HCN can achieve embedding of various node types, such as A , P , and C in DBLP; M , A ,and D in IMDB; and S , P , and C in AMiner. Comparatively, GCN, GAT, and HAN only learn embedding of one nodetype, such as A in DBLP; M in IMDB; S in AMiner. 11 PREPRINT

Figure 7: The results of the dilation. The X-axis represents the different dilation rates in AMiner, DBLP, and IMDB,respectively. The orange bars represent the results of a dilation percentage of , which means we randomly drop30% and keep 70% of the k -strata edges. The blue bars represent the results of a dilation percentage of . The greybars represent the results of no dilation, which correspond to the results in Table 3 and Table 4 The Y-axis representsthe values (%) of Micro-F1, Macro-F1, Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI),respectively. We present the absolute and relative values above the bars. We use the values of the grey bars as 100%,and present relative values in percentage for the blue and orange bars. The “mean ± standard deviation" of the relativevalues of the blue bars and the orange bars is . ± . and . ± . , respectively. The t -tests ﬁnd there isno statistical signiﬁcance between the blue bars and the grey bars or between the orange bars and the grey bars. Both ofthe p -values are larger than 0.05. In brief, the dilation does not deteriorate the outcomes in the experiments. To further evaluate the performance of the HNE, we also perform clustering. We use K-means to cluster the nodes. Thenumber of clusters is set to the number of classes in each dataset. Since the performance of K-means is inﬂuenced byinitial centroids, all clustering experiments are conducted 10 times and the average results are reported.Table 4 summaries the clustering results under the metrics of Normalized Mutual Information (NMI) and AdjustedRand Index (ARI) (%). The proposed method performs best in all the three datasets. In particular, the 3-strata HCNachieves the highest NMI/ARI values in DBLP and AMiner; while the 1-stratum HCN performs best in IMDB.We ﬁnd that the clustering results of different meta-paths are also different. The

AP CP A achieves better results than

AP A in DBLP; the “MAM" and “MDM" perform differently in IMDB; and the “SPCPS" surpasses “SPS" in AMiner.The results show researchers need to compare different meta-paths when using meta-path-based methods. k The experiments in this section evaluate how the hyper-parameter k inﬂuences the performance. Figure 6 illustrates thecomparison of the different values of k . As the k increases, the curve in each subplot reaches a peak and then drops. Inother words, we ﬁnd, in each task, there is an optimal k value.The explanation could be as follows. In the beginning, as the hyper-parameter k increases, more composite relations aregenerated, which contribute to better analytical outcomes. Take the DBLP-like network in Figure 3 as an example. Forthe given node A , when k becomes , new two-hop composite relations, such as A P A (a co-authorship betweentwo authors) and A P C (a participation relation between an author and a conference), capture more semantics and12 PREPRINT therefore improve analytical outcomes. Nonetheless, when the k is too big, the k -hop composite relations with longdistances may bring in weak relations and even noises, which damage the analytical outcomes.In conclusion, we need to tune an appropriate number of k . The integer k is a hyper-parameter, just as the number oflayers or neurons in a fully-connected neural network. One can use the grid search to ﬁnd the optimal k automatically. Figure 7 evaluates the results of the online dilation. The X-axis represents the different dilation rates in AMiner, DBLP,and IMDB, respectively. The orange bars represent the results of a dilation percentage of , which means we randomlydrop 30% and keep 70% of the k -strata edges. The blue bars represent the results of a dilation percentage of .The grey bars represent the results of no dilation, which correspond to the results in Table 3 and Table 4 The Y-axisrepresents the values (%) of Micro-F1, Macro-F1, NMI and ARI, respectively. We present the absolute and relativevalues above the bars. We set the values of the grey bars to 100%, and calculate relative values in percentage for theblue and orange bars. The “mean ± standard deviation" of the relative values of the blue bars and the orange bars inall the experiments in Figure 7 is . ± . and . ± . , respectively. By t -tests, we ﬁnd these valueshave no statistical signiﬁcance, which means dropping 30% or even half of k -strata edges does not make the analyticaloutcomes worse.Why the dilation does not damage the analytical results? Take the DBLP-like network in Figure3 as an example. Thereare 10 two-strata edges that connect to A : A − P , A − P , A − P , A − P , A − P , A − A , A − A , A − A , A − C , and A − C , as Figure4 shows. If the dilation percentage is 30, we randomly drop 3 edges suchas A − P , A − P , and A − P , and remain the left 7 edges including A − A . By the consecutive relations of A − A − P , A − A − P , and A − A − P in the -strata adjacency matrix, A can still extract informationfrom P , P , and P through a GNN.The “online dilation" conducts a different random drop in a few epochs. For one thing, the whole training processdoes not lose any information since the dilation performs different random drops in epochs. For another, in theory, the“online dilation" incorporates more diversity into the input data and therefore prevents over-ﬁtting and reduces messagepassing [27]. In practice, although we do not ﬁnd statistically-signiﬁcant improvements in this work after we adoptthe dilation rate of 50% or 30%, dropping even a half of edges does not deteriorate the analytical results but make the k -strata adjacency matrix sparse. In real-world projects when we need to embed huge knowledge graphs, the dilation issupposed to save training costs without sacriﬁcing accuracy. In this work, we propose a novel meta-path-free representation learning on a HIN. The proposed method overcomesthe challenge of heterogeneity and captures both the semantic and structural information. The experimental resultsdemonstrate that the proposed method signiﬁcantly outperforms the state-of-the-art methods in the various tasks.Hopefully, this work can inspire more researches on meta-path-free HNE.

References [1] Yizhou Sun and Jiawei Han. Mining heterogeneous information networks: a structural analysis approach.

AcmSigkdd Explorations Newsletter , 14(2):20–28, 2013.[2] C Lee Giles. The future of citeseer: citeseer x. In

European Conference on Machine Learning , pages 2–2. Springer,2006.[3] Sushmita Roy, Terran Lane, and Margaret Werner-Washburne. Integrative construction and analysis of condition-speciﬁc biological networks. In

National Conference on Artiﬁcial Intelligence , 2007.[4] W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich. Knowledge discovery from transportation networkdata. In

International Conference on Data Engineering , 2005.[5] Chanyoung Park, Donghyun Kim, Xing Xie, and Hwanjo Yu. Collaborative translational metric learning. In , pages 367–376. IEEE, 2018.[6] Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. Multi-task feature learning forknowledge graph enhanced recommendation. In

The World Wide Web Conference , pages 2000–2010. ACM, 2019.[7] Dehai Zhang, Menglong Cui, Yun Yang, Po Yang, Cheng Xie, Di Liu, Beibei Yu, and Zhibo Chen. Knowledgegraph-based image classiﬁcation reﬁnement.

IEEE Access , 7:57678–57690, 2019.13

PREPRINT [8] Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. The more you know: Using knowledge graphs forimage classiﬁcation. arXiv preprint arXiv:1612.04844 , 2016.[9] Yimin Zhou, Yiwei Sun, and Vasant Honavar. Improving image captioning by leveraging knowledge graphs. In , pages 283–293. IEEE, 2019.[10] Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Knowledge-driven encode, retrieve, paraphrase formedical image report generation. arXiv preprint arXiv:1903.10122 , 2019.[11] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. Ernie: Enhanced languagerepresentation with informative entities. arXiv preprint arXiv:1905.07129 , 2019.[12] Liang Yao, Chengsheng Mao, and Yuan Luo. Clinical text classiﬁcation with rule-based features and knowledge-guided convolutional neural networks.

BMC medical informatics and decision making , 19(3):71, 2019.[13] Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang. Heterogeneousnetwork embedding via deep architectures. In

Proceedings of the 21th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , pages 119–128. ACM, 2015.[14] Yu Shi, Qi Zhu, Fang Guo, Chao Zhang, and Jiawei Han. Easing embedding learning by comprehensivetranscription of heterogeneous information networks. In

Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining , pages 2190–2199. ACM, 2018.[15] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Heterogeneous graphattention network. In

The World Wide Web Conference , pages 2022–2032. ACM, 2019.[16] Hongxu Chen, Hongzhi Yin, Weiqing Wang, Hao Wang, Quoc Viet Hung Nguyen, and Xue Li. Pme: projectedmetric embedding on heterogeneous networks for link prediction. In

Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining , pages 1177–1186. ACM, 2018.[17] Tao-yang Fu, Wang-Chien Lee, and Zhen Lei. Hin2vec: Explore meta-paths in heterogeneous informationnetworks for representation learning. In

Proceedings of the 2017 ACM on Conference on Information andKnowledge Management , pages 1797–1806. ACM, 2017.[18] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. metapath2vec: Scalable representation learning forheterogeneous networks. In

Proceedings of the 23rd ACM SIGKDD international conference on knowledgediscovery and data mining , pages 135–144. ACM, 2017.[19] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connectednetworks on graphs. arXiv preprint arXiv:1312.6203 , 2013.[20] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs withfast localized spectral ﬁltering. In

Advances in neural information processing systems , pages 3844–3852, 2016.[21] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. arXivpreprint arXiv:1609.02907 , 2016.[22] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graphattention networks. arXiv preprint arXiv:1710.10903 , 2017.[23] Ting Chen and Yizhou Sun. Task-guided and path-augmented heterogeneous network embedding for authoridentiﬁcation. In

Proceedings of the Tenth ACM International Conference on Web Search and Data Mining , pages295–304. ACM, 2017.[24] Xiang Li, Yao Wu, Martin Ester, Ben Kao, Xin Wang, and Yudian Zheng. Semi-supervised clustering in attributedheterogeneous information networks. In

Proceedings of the 26th International Conference on World Wide Web ,pages 1621–1629. International World Wide Web Conferences Steering Committee, 2017.[25] Jingbo Shang, Meng Qu, Jialu Liu, Lance M Kaplan, Jiawei Han, and Jian Peng. Meta-path guided embedding forsimilarity search in large-scale heterogeneous information networks. arXiv preprint arXiv:1610.09769 , 2016.[26] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , pages701–710. ACM, 2014.[27] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutionalnetworks on node classiﬁcation. arXiv preprint arXiv:1907.10903 , 2019.[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980