MetaGraph2Vec: Complex Semantic Path Augmented Heterogeneous Network Embedding
aa r X i v : . [ c s . S I] M a r MetaGraph2Vec: Complex Semantic PathAugmented Heterogeneous Network Embedding
Daokun Zhang , Jie Yin , Xingquan Zhu , and Chengqi Zhang Centre for Artificial Intelligence, FEIT, University of Technology Sydney, Australia
[email protected], [email protected] Discipline of Business Analytics, The University of Sydney, Sydney, Australia [email protected] Dept. of CEECS, Florida Atlantic University, USA [email protected]
Abstract.
Network embedding in heterogeneous information networks(HINs) is a challenging task, due to complications of different node typesand rich relationships between nodes. As a result, conventional networkembedding techniques cannot work on such HINs. Recently, metapath-based approaches have been proposed to characterize relationships inHINs, but they are ineffective in capturing rich contexts and semanticsbetween nodes for embedding learning, mainly because (1) metapath isa rather strict single path node-node relationship descriptor, which isunable to accommodate variance in relationships, and (2) only a smallportion of paths can match the metapath, resulting in sparse contextinformation for embedding learning. In this paper, we advocate a newmetagraph concept to capture richer structural contexts and semanticsbetween distant nodes. A metagraph contains multiple paths betweennodes, each describing one type of relationships, so the augmentation ofmultiple metapaths provides an effective way to capture rich contextsand semantic relations between nodes. This greatly boosts the ability ofmetapath-based embedding techniques in handling very sparse HINs. Wepropose a new embedding learning algorithm, namely MetaGraph2Vec,which uses metagraph to guide the generation of random walks and tolearn latent embeddings of multi-typed HIN nodes. Experimental re-sults show that MetaGraph2Vec is able to outperform the state-of-the-art baselines in various heterogeneous network mining tasks such as nodeclassification, node clustering, and similarity search.
Recent advances in storage and networking technologies have resulted in manyapplications with interconnected relationships between objects. This has led tothe forming of gigantic inter-related and multi-typed heterogeneous informationnetworks (HINs) across a variety of domains, such as e-government, e-commerce,biology, social media, etc. HINs provide an effective graph model to character-ize the diverse relationships among different types of nodes. Understanding the vast amount of semantic information modeled in HINs has received a lot of at-tention. In particular, the concept of metapaths [10], which connect two nodesthrough a sequence of relations between node types, is widely used to exploitrich semantics in HINs. In the last few years, many metapath-based algorithmsare proposed to carry out data mining tasks over HINs, including similaritysearch [10], personalized recommendation [6,9], and object clustering [11].Despite their great potential, data mining tasks in HINs often suffer fromhigh complexity, because real-world HINs are very large and have very complexnetwork structure. For example, when measuring metapath similarity betweentwo distant nodes, all metapath instances need to be enumerated. This makesit very time-consuming to perform mining tasks, such as link prediction or sim-ilarity search, across the entire network. This inspires a lot of research interestsin network embedding that aims to embed the network into a low-dimensionalvector space, such that the proximity (or similarity) between nodes in the orig-inal network can be preserved. Analysis and search over large-scale HINs canthen be applied in the embedding space, with the help of efficient indexing orparallelized algorithms designed for vector spaces.Conventional network embedding techniques [1,4,8,12,13,14,15,16], however,focus on homogeneous networks, where all nodes and relations are consideredto have a single type. Thus, they cannot handle the heterogeneity of node andrelation types in HINs. Only very recently, metapath-based approaches [2,3], suchas MetaPath2Vec [3], are proposed to exploit specific metapaths as guidance togenerate random walks and then to learn heterogeneous network embedding.For example, consider a DBLP bibliographic network, Fig. 1(a) shows the HINschema, which consists of three node types: Author (A), Paper (P) and Venue(V), and three edge types: an author writes a paper, a paper cites another paper,and a paper is published in a venue. The metapath P : A → P → V → P → A describes the relationship where both authors have papers published in the samevenue, while P : A → P → A → P → A describes that two authors share thesame co-author. If P is used by MetaPath2Vec to generate random walks, apossible random walk could be: a → p → v → p → a . Consider a windowsize of 2, authors a and a would share the same context node v , so they shouldbe close to each other in the embedding space. This way, semantic similaritybetween nodes conveyed by metapaths is preserved.Due to difficulties in information access, however, real-world HINs often havesparse connections or many missing links. As a result, metapath-based algo-rithms may fail to capture latent semantics between distant nodes. As an exam-ple, consider the bibliographic network, where many papers may not have venueinformation, as they may be preprints submitted to upcoming venues or theirvenues are simply missing. The lack of paper-venue connection would result inmany short random walks, failing to capture hidden semantic similarity betweendistant nodes. On the other hand, besides publishing papers on same venues,distant authors can also be connected by other types of relations, like sharingcommon co-authors or publishing papers with similar topics. Such informationshould be taken into account to augment metapath-based embedding techniques. VPA writepublish cite (a) Schema
A P P AVAAPA P AVPA P A G : P : P : write write − publish − write − publishwritewrite write − write write − write publish − publish write − (b) Metapah and Metagraph Fig. 1.
Schema, Metapath and Metagraph a p v × a p a p v p a Fig. 2.
An example of random walk from a to a based on metagraph G , which cannotbe generated using metapaths P and P . This justifies the ability of MetaGraph2Vecto provide richer structural contexts to measure semantic similarity between distantnodes. Inspired by this observation, we propose a new method for heterogeneousnetwork embedding, called MetaGraph2Vec, that learns more informative em-beddings by capturing richer semantic relations between distant nodes. The mainidea is to use metagraph [5] to guide random walk generation in an HIN, whichfully encodes latent semantic relations between distant nodes at the networklevel. Metagraph has its strength to describe complex relationships betweennodes and to provide more flexible matching when generating random walksin an HIN. Fig. 1(b) illustrates a metagraph G , which describes that two authorsare relevant if they have papers published in the same venue or they share thesame co-authors. Metagraph G can be considered as a union of metapaths P and P , but when generating random walks, it can provide a superset of randomwalks generated by both P and P . Fig. 2 gives an example to illustrate the intu-ition behind. When one uses metapath P to guide random walks, if paper p hasno venue information, the random walk would stop at p because the link from p to v is missing. This results in generating too many short random walks thatcannot reveal semantic relation between authors a and a . In contrast, whenmetagraph G is used as guidance, the random walk a → p → a → p → a ,and a → p → v → p → a is generated by taking the path en route A and V in G , respectively. This testifies the ability of MetaGraph2Vec to providericher structural contexts to measure semantic similarity between distant nodes,thereby enabling more informative network embedding. Based on this idea, in MetaGraph2Vec, we first propose metagraph guidedrandom walks in HINs to generate heterogeneous neighborhoods that fully en-code rich semantic relations between distant nodes. Second, we generalize theSkip-Gram model [7] to learn latent embeddings for multiple types of nodes.Finally, we develop a heterogeneous negative sampling based method that fa-cilitates the efficient and accurate prediction of a node’s heterogeneous neigh-borhood. MetaGraph2Vec has the advantage of offering more flexible ways togenerate random walks in HINs so that richer structural contexts and semanticsbetween nodes can be preserved in the embedding space.The contributions of our paper are summarized as follows:1. We advocate a new metagraph descriptor which augments metapaths forflexible and reliable relationship description in HINs. Our study investigatesthe ineffectiveness of existing metapath based node proximity in dealing withsparse HINs, and explains the advantage of metagraph based solutions.2. We propose a new network embedding method, called MetaGraph2Vec, thatuses metagraph to capture richer structural contexts and semantics betweendistant nodes and to learn latent embeddings for multiple types of nodes inHINs.3. We demonstrate the effectiveness of our proposed method through variousheterogeneous network mining tasks such as node classification, node clus-tering, and similarity search, outperforming the state-of-the-art.
In this section, we formalize the problem of heterogeneous information networkembedding and give some preliminary definitions.
Definition 1. A heterogeneous information network (HIN) is defined asa directed graph G = ( V, E ) with a node type mapping function φ : V → L andan edge type mapping function ψ : E → R . T G = ( L , R ) is the network schemathat defines the node type set L with φ ( v ) ∈ L for each node v ∈ V , and theallowable link types R with ψ ( e ) ∈ R for each edge e ∈ E .Example 1. For a bibliographic HIN composed of authors, papers, and venues,Fig. 1(a) defines its network schema. The network schema contains three nodetypes, author (A), paper (P) and venue (V), and defines three allowable relations, A write −−−→ P , P cite −−→ P and V publish −−−−−→ P . Implicitly, the network schema alsodefines the reverse relations, i.e., P write − −−−−−→ A , P cite − −−−−→ P and P publish − −−−−−−→ V . Definition 2.
Given an HIN G , heterogeneous network embedding aimsto learn a mapping function Φ : V → R d that embeds the network nodes v ∈ V into a low-dimensional Euclidean space with d ≪ | V | and guarantees that nodessharing similar semantics in G have close low-dimensional representations Φ( v ) . Definition 3. A metagraph is a directed acyclic graph (DAG) G = ( N, M, n s , n t ) defined on the given HIN schema T G = ( L , R ) , which has only a single sourcenode n s (i.e., with 0 in-degree) and a single target node n t (i.e., with 0 out-degree). N is the set of the occurrences of node types with n ∈ L for each n ∈ N . M is the set of the occurrences of edge types with m ∈ R for each m ∈ M . As metagraph G depicts complex composite relations between nodes of type n s and n t , N and M may contain duplicate node and edge types. To clarify,we define the layer of each node in N as its topological order in G and denotethe number of layers by d G . According to nodes’ layer, we can partition N intodisjoint subsets N [ i ] (1 ≤ i ≤ d G ), which represents the set of nodes in layer i .Each N [ i ] does not contain duplicate nodes. Now each element in N and M canbe uniquely described as follows. For each n in N , there exists a unique i with1 ≤ i ≤ d G satisfying n ∈ N [ i ] and we define the layer of node n as l ( n ) = i .For each m ∈ M , there exist unique i and j with 1 ≤ i < j ≤ d G satisfying m ∈ N [ i ] × N [ j ]. Example 2.
Given a bibliographic HIN G and a network schema T G shown inFig. 1(a), Fig. 1(b) shows an example of metagraph G = ( N, M, n s , n t ) with n s = n t = A . There are 5 layers in G and node set N can be partitioned into5 disjoint subsets, one for each layer, where N [1] = { A } , N [2] = { P } , N [3] = { A, V } , N [4] = { P } , N [5] = { A } . Definition 4.
For a metagraph G = ( N, M, n s , n t ) with n s = n t , its recur-sive metagraph G ∞ = ( N ∞ , M ∞ , n ∞ s , n ∞ t ) is a metagraph formed by tail-headconcatenation of an arbitrary number of G . G ∞ satisfies the following conditions:1. N ∞ [ i ] = N [ i ] for ≤ i < d G , and N ∞ [ i ] = N [ i mod d G + 1] for i ≥ d G .2. For each m ∈ N ∞ [ i ] × N ∞ [ j ] with any i and j , m ∈ M ∞ if and only if oneof the following two conditions is satisfied:(a) ≤ i < j ≤ d G and m ∈ M T ( N [ i ] × N [ j ]) ;(b) i ≥ d G , ≤ j − i ≤ d G and m ∈ M T ( N [ i mod d G +1] × N [ j mod d G + 1]) .In the recursive metagraph G ∞ , for each node n ∈ N ∞ , we define its layer as l ∞ ( n ) . Definition 5.
Given an HIN G and a metagraph G = ( N, M, n s , n t ) with n s = n t defined on its network schema T G , together with the corresponding recursivemetagraph G ∞ = ( N ∞ , M ∞ , n ∞ s , n ∞ t ) , we define the random walk node sequenceconstrained by metagraph G as S G = { v , v , · · · , v L } with length L satisfying thefollowing conditions:1. For each v i (1 ≤ i ≤ L ) in S G , v i ∈ V and for each v i (1 < i ≤ L ) in S G , ( v i − , v i ) ∈ E . Namely, the sequence S G respects the network structure in G .2. φ ( v ) = n s and l ∞ ( φ ( v )) = 1 . Namely, the random walk starts from a nodewith type n s .3. For each v i (1 < i ≤ L ) in S G , there exists a unique j satisfying ( φ ( v i − ) , φ ( v i )) ∈ M ∞ T ( N ∞ [ l ∞ ( φ ( v i − ))] × N ∞ [ j ]) with j > l ∞ ( φ ( v i − )) , φ ( v i ) ∈ N ∞ [ j ] and l ∞ ( φ ( v i )) = j . Namely, the random walk is constrained by the recursivemetagraph G ∞ . Example 3.
Given metagraph G in Fig. 1(b), a possible random walk is a → p → v → p → a → p → a → p → a . It describes that author a and a publish papers in the same venue v and author a and a share the commonco-author a . Compared with metapath P given in Fig. 1(b), metagraph G captures richer semantic relations between distant nodes. In this section, we first present metagraph-guided random walk to generate het-erogeneous neighborhood in an HIN, and then present the MetaGraph2Vec learn-ing strategy to learn latent embeddings of multiple types of nodes.
In an HIN G = ( V, E ), assuming a metagraph G = ( N, M, n s , n t ) with n s = n t is given according to domain knowledge, we can get the corresponding recursivemetagraph G ∞ = ( N ∞ , M ∞ , n ∞ s , n ∞ t ). After choosing a node of type n s , we canstart the metagraph guided random walk. We denote the transition probabilityguided by metagraph G at i th step as Pr( v i | v i − ; G ∞ ). According to Definition5, if ( v i − , v i ) / ∈ E , or ( v i − , v i ) ∈ E but there is no link from node type φ ( v i − )at layer l ∞ ( φ ( v i − )) to node type φ ( v i ) in the recursive metagraph G ∞ , thetransition probability Pr( v i | v i − ; G ∞ ) is 0. The probability Pr( v i | v i − ; G ∞ ) for v i that satisfies the conditions of Definition 5 is defined as Pr( v i | v i − ; G ∞ ) = 1 T G ∞ ( v i − ) × |{ u | ( v i − , u ) ∈ E, φ ( v i ) = φ ( u ) }| . (1) Above, T G ∞ ( v i − ) is the number of edge types among the edges starting from v i − that satisfy the constraints of the recursive metagraph G ∞ , which is for-malized as T G ∞ ( v i − ) = |{ j | ( φ ( v i − ) , φ ( u )) ∈ M ∞ \ ( N ∞ [ l ∞ ( φ ( v i − ))] × N ∞ [ j ]) , ( v i − , u ) ∈ E }| , (2) and |{ u | ( v i − , u ) ∈ E, φ ( v i ) = φ ( u ) }| is the number of v i − ’s 1-hop forwardneighbors sharing common node type with node v i .At step i , the metagraph guided random walk works as follows. Among theedges starting from v i − , it firstly counts the number of edge types satisfyingthe constraints and randomly selects one qualified edge type. Then it randomlywalks across one edge of the selected edge type to the next node. If there are noqualified edge types, the random walk would terminate. Given a metagraph guided random walk S G = { v , v , · · · , v L } with length L ,the node embedding function Φ( · ) is learned by maximizing the probability ofthe occurrence of v i ’s context nodes within w window size conditioned on Φ( v i ): min Φ − log Pr( { v i − w , · · · , v i + w } \ v i | Φ( v i )) , (3) where, Pr( { v i − w , · · · , v i + w } \ v i | Φ( v i )) = i + w Y j = i − w,j = i Pr( v j | Φ( v i )) . (4) Following MetaPath2Vec [3], the probability Pr( v j | Φ( v i )) is modeled in two dif-ferent ways:1. Homogeneous Skip-Gram that assumes the probability Pr( v j | Φ( v i )) doesnot depend on the type of v j , and thus models the probability Pr( v j | Φ( v i ))directly by softmax: Pr( v j | Φ( v i )) = exp(Ψ( v j ) · Φ( v i )) P u ∈ V exp(Ψ( u ) · Φ( v i )) . (5) Heterogeneous Skip-Gram that assumes the probability Pr( v j | Φ( v i )) isrelated to the type of node v j : Pr( v j | Φ( v i )) = Pr( v j | Φ( v i ) , φ ( v j ))Pr( φ ( v j ) | Φ( v i )) , (6) where the probability Pr( v j | Φ( v i ) , φ ( v j )) is modeled via softmax: Pr( v j | Φ( v i ) , φ ( v j )) = exp(Ψ( v j ) · Φ( v i )) P u ∈ V,φ ( u )= φ ( v j ) exp(Ψ( u ) · Φ( v i )) . (7) To learn node embeddings, the MetaGraph2Vec algorithm first generates a setof metagraph guided random walks, and then counts the occurrence frequency F ( v i , v j ) of each node context pair ( v i , v j ) within w window size. After that,stochastic gradient descent is used to learn the parameters. At each iteration, anode context pair ( v i , v j ) is sampled according to the distribution of F ( v i , v j ),and the gradients are updated to minimize the following objective, O ij = − log Pr( v j | Φ( v i )) . (8) To speed up training, negative sampling is used to approximate the objectivefunction: O ij = log σ (Ψ( v j ) · Φ( v i )) + K X k =1 log σ ( − Ψ( v N j,k ) · Φ( v i )) , (9) where σ ( · ) is the sigmoid function, v N j,k is the k th negative node sampled fornode v j and K is the number of negative samples. For Homogeneous Skip-Gram, v N j,k is sampled from all nodes in V ; for Heterogeneous Skip-Gram, v N j,k issampled from nodes with type φ ( v j ). Formally, parameters Φ and Ψ are updatedas follows: Φ = Φ − α ∂ O ij ∂ Φ ; Ψ = Φ − α ∂ O ij ∂ Ψ , (10) where α is the learning rate.The pseudo code of the MetaGraph2Vec algorithm is given in Algorithm 1. In this section, we demonstrate the effectiveness of the proposed algorithms forheterogeneous network embedding via various network mining tasks, includingnode classification, node clustering, and similarity search.
Algorithm 1
The MetaGraph2Vec Algorithm
Input: (1) A heterogeneous information network (HIN): G = ( V, E );(2) A metagraph: G = ( N, M, n s , n t ) with n s = n t ;(3) Maximum number of iterations: MaxIterations ; Output:
Node embedding Φ( · ) for each v ∈ V ;1: S ← generate a set of random walks according to G ;2: F ( v i , v j ) ← count frequency of node context pairs ( v i , v j ) in S ;3: Iterations ← repeat
5: ( v i , v j ) ← sample a node context pair according to the distribution of F ( v i , v j );6: (Φ , Ψ) ← update parameters using ( v i , v j ) and Eq. (10);7: Iterations ← Iterations + 1;8: until convergence or Iterations ≥ MaxIterations return Φ; For evaluation, we carry out experiments on the DBLP bibliographic HIN,which is composed of papers, authors, venues, and their relationships. Basedon paper’s venues, we extract papers falling into four research areas: Database , Data Mining , Artificial Intelligence , Computer Vision , and preserve the associ-ated authors and venues, together with their relations. To simulate the paper-venue sparsity, we randomly select 1/5 papers and remove their paper-venuerelations. This results in a dataset that contains 70,910 papers, 67,950 authors,97 venues, as well as 189,875 paper-author relations, 91,048 paper-paper relationsand 56,728 venue-paper relations.To evaluate the quality of the learned embeddings, we carry out multi-classclassification, clustering and similarity search on author embeddings. Metapathsand metagraph shown in Fig. 1(b) are used to measure the proximity betweenauthors. The author’s ground true label is determined by research area of his/hermajor publications.We evaluate MetaGraph2Vec with Homogeneous Skip-Gram and its variantMetaGraph2Vec++ with Heterogeneous Skip-Gram. We compare their perfor-mance with the following state-of-the-art baseline methods:– DeepWalk [8]: It uses the uniform random walk that treats nodes of differenttypes equally to generate random walks.– LINE [12]: We use two versions of LINE, namely LINE 1 and LINE 2, whichmodels the first order and second order proximity, respectively. Both neglectdifferent node types and edge types.– MetaPath2Vec and MetaPath2Vec++ [3]: They are the state-of-the-art net-work embedding algorithms for HINs, with MetaPath2Vec++ being a vari-ant of MetaPath2Vec that uses heterogeneous negative sampling. To demon- https://aminer.org/citation (Version 3 is used) strate the strength of metagraph over metapath, we compare with differentversions of the two algorithms: P MetaPath2Vec, P MetaPath2Vec andMixed MetaPath2Vec, which uses P only, P only, or both, to guide ran-dom walks, as well as their counterparts, P MetaPath2Vec++, P MetaP-ath2Vec++, and Mixed MetaPath2Vec++.For all random walk based algorithms, we start random walks with length L = 100 at each author for γ = 80 times, for efficiency reasons. For the mixedMetaPath2Vec methods, γ/ P and P , respectively. To improve the efficiency, we use our opti-mization strategy for all random walk based methods: After random walks aregenerated, we first count the co-occurrence frequencies of node context pairs us-ing a window size w = 5, and according to the frequency distribution, we thensample one node context pair to do stochastic gradient descent sequentially. Forfair comparisons, the total number of samples (iterations) is set to 100 million,for both random walk based methods and LINE. For all methods, the dimensionof learned node embeddings d is set to 128. We first carry out multi-class classification on the learned author embeddings tocompare the performance of all algorithms. We vary the ratio of training datafrom 1% to 9%. For each training ratio, we randomly split training set and testset for 10 times and report the averaged accuracy.
Table 1.
Multi-class author classification on DBLP
Method 1% 2% 3% 4% 5% 6% 7% 8% 9%DeepWalk 82.39 86.04 87.16 88.15 89.10 89.49 90.02 90.25 90.56LINE 1 71.25 79.25 83.11 85.60 87.17 88.29 89.05 89.45 89.63LINE 2 75.70 80.80 82.49 83.88 84.83 85.71 86.58 86.90 86.93 P MetaPath2Vec 83.24 87.70 88.42 89.05 89.26 89.46 89.51 89.76 89.69 P MetaPath2Vec++ 82.14 86.02 87.04 87.96 88.47 88.66 88.90 88.91 89.02 P MetaPath2Vec 49.59 52.12 53.76 54.67 55.68 55.49 55.83 55.68 56.07 P MetaPath2Vec++ 50.31 52.50 53.72 54.47 55.53 55.78 56.30 56.36 57.02Mixed MetaPath2Vec 83.86 87.34 88.37 89.22 89.70 90.01 90.37 90.42 90.71Mixed MetaPath2Vec++ 83.08 86.91 88.13 89.07 89.69 90.09 90.58 90.68 90.87MetaGraph2Vec
Table 1 shows the multi-class author classification results in terms of accuracy(%) for all algorithms, with the highest score highlighted by bold . Our Meta-Graph2Vec and MetaGraph2vec++ algorithms achieve the best performancein all cases. The performance gain over metapath based algorithms proves thecapacity of MetaGraph2Vec in capturing complex semantic relations betweendistant authors in sparse networks, and the effectiveness of the semantic similar-ity in learning informative node embeddings. By considering methpaths betweendifferent types of nodes, MetaPath2Vec can capture better proximity properties and learn better author embeddings than DeepWalk and LINE, which neglectdifferent node types and edge types. We also carry out node clustering experiments to compare different embeddingalgorithms. We take the learned author embeddings produced by different meth-ods as input and adopt K -means to do clustering. With authors’ labels as groundtruth, we evaluate the quality of clustering using three metrics, including Ac-curacy, F score and NMI. From Table 2, we can see that MetaGraph2Vec andMetaGraph2Vec++ yield the best clustering results on all three metrics. Table 2.
Author clustering on DBLP
Method Accuracy(%) F(%) NMI(%)DeepWalk 73.87 67.39 42.02LINE 1 50.26 46.33 17.94LINE 2 52.14 45.89 19.55 P MetaPath2Vec 69.39 63.05 41.72 P MetaPath2Vec++ 66.11 58.68 36.45 P MetaPath2Vec 47.51 43.30 6.17 P MetaPath2Vec++ 47.65 41.48 6.56Mixed MetaPath2Vec 77.20 69.50 49.43Mixed MetaPath2Vec++ 72.36 65.09 42.40MetaGraph2Vec
MetaGraph2Vec++ 77.48 70.69 50.60
Experiments are also performed on similarity search to verify the ability of Meta-Graph2Vec to capture author proximities in the embedding space. We randomlyselect 1,000 authors and rank their similar authors according to cosine simi-larity score. Table 3 gives the averaged precision@100 and precision@500 fordifferent embedding algorithms. As can be seen, our MetaGraph2Vec and Meta-Graph2Vec++ achieve the best search precisions.
We further analyze the sensitivity of MetaGraph2vec and MetaGraph2Vec++ tothree parameters: (1) γ : the number of metagraph guided random walks startingfrom each author; (2) w : the window size used for collecting node context pairs;(3) d : the dimension of learned embeddings. Fig. 3 shows node classificationperformance with 5% training ratio by varying the values of these parameters.We can see that, as the dimension of learned embeddings d increases, Meta-Graph2Vec and MetaGraph2Vec++ gradually perform better and then stay ata stable level. Yet, both algorithms are not very sensitive to the the number ofrandom walks and window size. Table 3.
Author similarity search on DBLP
Methods Precision@100 (%) Precision@500 (%)DeepWalk 91.65 91.44LINE 1 91.18 89.88LINE 2 91.92 91.38 P MetaPath2Vec 88.21 88.64 P MetaPath2Vec++ 88.68 88.58 P MetaPath2Vec 53.98 44.11 P MetaPath2Vec++ 53.39 44.11Mixed MetaPath2Vec 90.94 90.27Mixed MetaPath2Vec++ 91.49 90.69MetaGraph2Vec 92.50
MetaGraph2Vec++
40 60 80 100 120 140 160 γ A cc u r ac y MetaGraph2VecMetaGraph2Vec++ (a) γ w A cc u r ac y MetaGraph2VecMetaGraph2Vec++ (b) w
64 80 96 112 128 144 160 176 d A cc u r ac y MetaGraph2VecMetaGraph2Vec++ (c) d Fig. 3.
The effect of parameters γ , w , and d on node classification performance This paper studied network embedding learning for heterogeneous informationnetworks. We analyzed the ineffectiveness of existing metapath based approachesin handling sparse HINs, mainly because metapath is too strict for capturing re-lationships in HINs. Accordingly, we proposed a new metagraph relationship de-scriptor which augments metapaths for flexible and reliable relationship descrip-tion in HINs. By using metagraph to guide the generation of random walks, ournew proposed algorithm, MetaGraph2Vec, can capture rich context and semanticinformation between different types of nodes in the network. The main contri-bution of this work, compared to the existing research in the field, is twofold: (1)a new metagraph guided random walk approach to capturing rich contexts andsemantics between nodes in HINs, and (2) a new network embedding algorithmfor very sparse HINs, outperforming the state-of-the-art.In the future, we will study automatic methods for efficiently learning meta-graph structures from HINs and assess the contributions of different metagraphsto network embedding. We will also evaluate the performance of MetaGraph2Vecon other types of HINs, such as heterogeneous biological networks and social net-works, for producing informative node embeddings.
Acknowledgments . This work is partially supported by the Australian Re-search Council (ARC) under discovery grant DP140100545, and by the Programfor Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning. Daokun Zhang is supported by China Scholarship Council(CSC) with No. 201506300082 and a supplementary postgraduate scholarshipfrom CSIRO. References
1. Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representationswith global structural information. In
Proceedings of CIKM , pages 891–900. ACM,2015.2. Ting Chen and Yizhou Sun. Task-guided and path-augmented heterogeneous net-work embedding for author identification. In
Proceedings of WSDM , pages 295–304.ACM, 2017.3. Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. Metapath2vec: Scalablerepresentation learning for heterogeneous networks. In
Proceedings of SIGKDD ,pages 135–144. ACM, 2017.4. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.In
Proceedings of SIGKDD , pages 855–864. ACM, 2016.5. Zhipeng Huang, Yudian Zheng, Reynold Cheng, Yizhou Sun, Nikos Mamoulis, andXiang Li. Meta structure: Computing relevance in large heterogeneous informationnetworks. In
Proceedings of SIGKDD , pages 1595–1604. ACM, 2016.6. Mohsen Jamali and Laks Lakshmanan. HeteroMF: Recommendation in heteroge-neous information networks using context dependent factor models. In
Proceedingsof WWW , pages 643–654. ACM, 2013.7. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-tributed representations of words and phrases and their compositionality. In
NIPS ,pages 3111–3119, 2013.8. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning ofsocial representations. In
Proceedings of SIGKDD , pages 701–710. ACM, 2014.9. Chuan Shi, Zhiqiang Zhang, Ping Luo, Philip S. Yu, Yading Yue, and Bin Wu.Semantic path based personalized recommendation on weighted heterogeneous in-formation networks. In
Proceedings of CIKM , pages 453–462. ACM, 2015.10. Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. Pathsim:Meta path-based top-k similarity search in heterogeneous information networks.In
Proceedings of VLDB , pages 992–1003. ACM, 2011.11. Yizhou Sun, Brandon Norick, Jiawei Han, Xifeng Yan, Philip S. Yu, and Xiao Yu.Integrating meta-path selection with user-guided object clustering in heterogeneousinformation networks. In
Proceedings of SIGKDD , pages 1348–1356. ACM, 2012.12. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.LINE: Large-scale information network embedding. In
Proceedings of WWW , pages1067–1077. ACM, 2015.13. Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In
Proceedings of SIGKDD , pages 1225–1234. ACM, 2016.14. Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. Homophily, structure,and content augmented network representation learning. In
Proceedings of ICDM ,pages 609–618. IEEE, 2016.15. Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. User profile preservingsocial network embedding. In
Proceedings of IJCAI , pages 3378–3384. AAAI Press,2017.16. Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. Network representa-tion learning: A survey. In arXiv preprint arXiv:1801.05852arXiv preprint arXiv:1801.05852