A Multi-Semantic Metapath Model for Large Scale Heterogeneous Network Representation Learning
AA Multi-Semantic Metapath Model for Large ScaleHeterogeneous Network Representation Learning
Xuandong Zhao , Jinbao Xue , Jin Yu , Xi Li , Hongxia Yang College of Computer Science, Zhejiang University Alibaba [email protected], {zhiji.xjb, kola.yu, yang.yhx}@alibaba-inc.com, [email protected]
Abstract
Network Embedding has been widely studied to model and manage data in a varietyof real-world applications. However, most existing works focus on networks withsingle-typed nodes or edges, with limited consideration of unbalanced distributionsof nodes and edges. In real-world applications, networks usually consist of billionsof various types of nodes and edges with abundant attributes. To tackle thesechallenges, in this paper we propose a m ulti- s emantic m etapath ( MSM ) modelfor large scale heterogeneous representation learning. Specifically, we generatemulti-semantic metapath-based random walks to construct the heterogeneous neigh-borhood to handle the unbalanced distributions and propose a unified frameworkfor the embedding learning. We conduct systematical evaluations for the proposedframework on two challenging datasets: Amazon and Alibaba. The results empiri-cally demonstrate that MSM can achieve relatively significant gains over previousstate-of-arts on link prediction.
Networks (or graphs) are general data structures to explore and model complex systems of variousreal-world applications, including social networks, academic networks, physical systems, biologicalnetworks and knowledge graphs[1, 2, 3, 4]. Mining knowledge in networks has attracted tremendousattention recently due to significant progress in downstream network learning tasks such as nodeclassification, link prediction and community detection.Network embedding is an effective and efficient way to convert complex network data into a low-dimensional space. Some earlier works have proposed word2vec-based network representationlearning frameworks, such as DeepWalk [5], LINE [6], and node2vec [7]. They introduce deeplearning techniques into network analysis to learn node embedding. However, these works focuson representation learning for homogeneous networks with single-typed nodes and edges. Morerecently, metapath2vec [8] is proposed for heterogeneous networks, but it is designed for simple"heterogeneous networks" which involve multi-type nodes and single-type edges. PMNE [9] andMNE [10] are methods targeted at single-type nodes but multi-type edges. Real-world network-structured applications, such as e-commerce platforms, are much more complicated, containing multi-type nodes, multi-type edges and many attributes. GATNE [11] focuses on embedding learning forattributed multiplex heterogeneous networks. Its model can capture both rich attributed informationand utilize multiplex topological structures from different node types. However, GATNE can justdeal with bipartite graphs which contain two types of nodes. In real-world applications, there aremore than two types of nodes and the number of nodes in each type may be unbalanced.In this paper, to fill this gap, we propose a novel network embedding framework, MSM, that learns theembeddings on general heterogeneous networks based on multi-semantic metapaths. Concretely, wepresent the multi-semantic metapath-guided random walks to generate heterogeneous neighborhoods a r X i v : . [ c s . S I] J u l or each node in all different edges. In this way, the model can capture the structure and semanticrelations in rich neighbour information and extract necessary nodes and edges for the unbalancednetwork. Experimental results on real-world datasets demonstrate the improvements of our proposedMSM over other state-of-the-art methods.To summarize, our work makes the following contributions:(1) We propose a heterogeneous network embedding method to uncover the semantic and structuralinformation of general heterogeneous networks which have multi-type nodes, multi-type edges andnodes attributes.(2) We propose multi-semantic metapath-guided random walks to generate heterogeneous neighbor-hoods and extract critical nodes and edges to handle unbalanced data.(3) We demonstrate the effectiveness of our proposed MSM on the Amazon book&movie and Alibabadatasets. We formalize the problem of heterogeneous network embedding and give some preliminary defini-tions.A Heterogeneous Network with multi-type nodes, multi-type edges and nodes attributes is definedas a graph G = ( V , E , A ) , E = (cid:83) r ∈R E r , where E r consists of all edges with edge type r ∈ R , and |R| > . Each node v i ∈ V has some types of feature vectors. A = { x i | v i ∈ V} is the set of nodefeatures for all nodes, where x i is the associated node feature of node v i .The embedding for heterogeneous network is to give a unified low-dimensional space representationof each node v on every edge type r , which aims to learn a mapping function f r : V → R d ( d (cid:28) |V| ).A metapath ρ ( V , P ) is defined as a path that is denoted in the form of V P −→ V P −→ · · · V t P t ( t +1) −→V t +1 · · · P ( l − l −→ V l wherein P = P ◦ P ◦ · · · ◦ P ( l − l defines the composite relations betweennode types V and V l .A multi-semantic metapath ρ ( V , P , E ) is defined as a path that is denoted in the form of V T ( P ) −→V T ( P ) −→ · · · V t T ( P t ( t +1) ) −→ V t +1 · · · T ( P ( l − l ) −→ V l . Different from metapath, in multi-semanticmetapath, T ( · ) represents target edge function and returns E i in multiplex heterogeneous network.For example, Figure1(a) shows the Alibaba dataset with user(U), item(I) and video(V) as nodes.Because there are multiple edges between two different nodes, we can construct multi-semanticmetapaths such as (U-watched video-V-related item-I), which can denote that customers like similaritems and movies. MSM is designed to cope with networks with heterogeneous and attributed information on bothvertices and edges. Following GATNE [11], we propose transductive model and inductive modelfor MSM. Transductive model can utilize multiplex topological structures from different node types.Inductive model is extended from transductive model, which can capture rich attributed informationand generate embeddings on unseen data. The framework of MSM is illustrated in Figure 1.In MSM transductive model (MSM-T), we split the overall embedding of a certain node v i on eachedge type r into two parts: base embedding and edge embedding. The base embedding of node v i isshared between different edge types. The k -th level edge embedding u ( k ) i,r ∈ R s , (1 ≤ k ≤ K ) ofnode v i on edge type r is aggregated from neighbors’ edge emebeddings as: u ( k ) i,r = σ (cid:16) ˆ W ( k ) · mean (cid:16)(cid:110) u ( k − j,r , ∀ v j ∈ N i,r (cid:111)(cid:17)(cid:17) , (1)where N i,r is the neighbors of node v i on edge type r and σ is an activation function. The initialedge embedding u (0) i,r for each node v i and each edge type r is randomly initialized in the model.2 onversionwatched videoclick videoclick itemrelated itemrelated video 𝑢 ! 𝑢 " 𝑢 𝑣 ! 𝑣 " 𝑣 𝑖 ! 𝑖 " 𝑖 (b) Multi-semantic metapath-basedrandom walk. For the above example,the first walkmeans: user-(watched video)-video-(conversion)-item-(conversion)-video-(watchedvideo)-user … … Heterogeneous Skip-GramBase EmbeddingEdge Embedding(a) Heterogeneous network for Alibaba datasets. It has three types of nodes and six types of edges. Each node in Alibaba dataset is associated with different features. AttributeEmbedding + (c) Combine base embedding, edge embeddingand attribute embedding. Figure 1: Overview of the multi-semantic metapath model for large scale heterogeneous representationlearning — MSM.With K -th level edge embedding u ( K ) i,r denoted as u i,r , we use self-attention mechanism to computethe coefficients a i,r ∈ R m of linear combination of vectors in U i on edge type r as: a i,r = softmax (cid:0) w Tr tanh ( W r U i ) (cid:1) T , (2) U i = ( u i, , u i, , . . . , u i,m ) . (3)where w r and W r are trainable parameters for edge type r with size d a and d a × s respectively.Therefore, the overall embedding of node v i for edge type r is: v i,r = b i + α r M Tr U i a i,r , (4)where b r is the base embedding for node v i , α r is a hyper-parameter denoting the importance of edgeembeddings towards the overall embedding and M r ∈ R s × d is a trainable transformation matrix.In MSM inductive model (MSM-I), we extend the transductive model to the inductive to handleunobserved data. We update the base embedding and edge embedding: b i = h z ( x i ) , u (0) i,r = g z,r ( x i ) . Here h z and g z,r are transformation functions such as multi-layer perceptron, where z isnode type and r is edge type. For the overall embedding of node v i on type r , we also add an extraattribute term: v i,r = h z ( x i ) + α r M Tr U i a i,r + β r O Tz x i , (5)where β r is a coefficient and O z is a feature transformation matrix on v i ’s corresponding node type z .The final transductive and inductive embeddings can be learned by applying multi-semantic metapath-guided random walks. Specifically, given a node v of type r in a MSM walk and the window size p , let v − p , v − p +1 , . . . , v, v , . . . , v p denote its context. We need to minimize the negative log-likelihoodas − log P θ r (cid:0) v − p , . . . , v p | v (cid:1) = (cid:88) ≤| p (cid:48) |≤ p − log P θ r ( v p (cid:48) | v ) (6)where θ r denotes all the parameters w.r.t. type r and P θ r ( v p (cid:48) | v ) is defined by the softmax function.The objective function − log P θ r ( u | v ) for each pair of vertices u and v can be easily approximatedby the negative sampling method.In brief, after initializing all the model parameters θ , we generate all training samples { ( v i , v j , r ) }from multi-semantic metapath-based random walks ρ r on each edge type r . Then we train on thesesamples by calculating v i,r using Equation (4) or (5) and minimizing negative log-likelihood. Wesummarize our algorithm in Algorithm 1. 3 lgorithm 1 MSM
Input: network G = ( V , E , A ) , embedding dimension d , edge embedding dimension s , learningrate η , negative samples L , coefficient α, β Output: overall embeddings v i,r for all nodes on every edge type r Initialize all the model parameters θ Generate training samples { ( v i , v j , r ) } using multi-semantic metapath-based random walks ρ r on each edge type r while not converged do for all ( v i , v j , r ) ∈ training samples do Calculate v i,r using Equation (4) or (5) Sample L negative samples and update model parameters θ by minimizing negativelog-likelihood Equation (6). end for end while For evaluation, we select and adopt two datasets: Amazon and Alibaba. The selection criteria arethat they must contain more than two edge types and node types. The details of these datasets are asfollows:The Amazon data used in our experiments is the reviews for movie and book in the Amazon productdata . Customers give 5-score reviews to the product after they finish buying. We divide the dataaccording to that: score 1 and 2 as "dislike", score 3 and 4 as "like" and score 5 as "very like". Thestatistics of the Amazon dataset is shown in Table 1.The Alibaba dataset is video-user-item data from Taobao mobile application. There are three types ofnodes: user, item and video; and six types of edges. We use gender , city , age and other 3 featuresas attributes for "user" nodes; brand , price , category and other 6 features as attributes for "item"nodes. For "video" nodes, we extract a 128-dimensional vector from audio through VGG [12] and a2048-dimensional vector from video through ResNet-50 [13]. Then, we concatenate these vectorsand other 4 features as attributes for "video" nodes. The statistics of nodes and edges in Alibabadatasets is shown in Table 1. We select the interaction data collected during three days as trainingdataset and use the following day as valid and test dataset. To compare balanced data distribution andunbalanced data distribution, we extract Ali-Balance and Ali-Unbalance datasets. In Ali-Unbalancedataset, there are much more item nodes compared with Ali-Balance dataset. What’s more, we varythe remaining edges ratios to test the model’s scalability.Table 1: Statistics of Amazon and Alibaba datasets Types Amazon Types Ali-Balance Ali-Unbalance Ali-Largeuser (U) 9873 user (U) 10000 10000 953676nodes movie (M) 5578 item (I) 67068 301449 4640445book (B) 4414 video (V) 66306 57649 1294591Total 19865 Total 143374 369098 6888712dislike (U-M) 71989 related item (I-I) 173842 133925 4894802like (U-M) 74182 related video (V-V) 210031 92106 2459112very like (U-M) 125189 click item (U-I) 200848 279640 1723973edges dislike (U-B) 43810 click video (U-V) 192633 55188 661581like (U-B) 55564 watched video (U-V) 191500 36719 369448very like (U-B) 96202 conversion (I-V) 19530 79858 1275938Total 466396 Total 988384 677436 11384854 http://jmcauley.ucsd.edu/data/amazon/ .2 Baseline Methods We compare our model with the following state-of-the-art embedding-based baseline methods andthe overall embedding size is set to 200 for all methods. • DeepWalk [5] applies random walk on the network then uses Skip-gram algorithm to trainthe embeddings. • Node2vec [7] adds parameters to control the random walk process and makes it better forcertain types of nodes. • LINE [6] adds link fitting term to the DeepWalk cost function, and samples both one-hopand two-hop neighbors. • Metapath2vec [8] is designed to deal with the node heterogeneity. To compare, we combineall multi-semantic metapaths between the first node and the last node to one path. • PMNE [9] proposes three different models to merge multiplex network together to generateone overall embedding for each node. We denote the three methods of PMNE as PMNE(n),PMNE(r) and PMNE(c) respectively. • MNE [10] uses one common embedding and several additional embeddings for each edgetype, which are jointly learned by a unified network embedding model. • GATNE [11] proposes both transductive and inductive models for bipartite heterogeneousnetwork embedding.Table 2: Link prediction result for different datasets
Amazon Ali-Balance Ali-UnbalanceMethod ROC-AUC PR-AUC F1 ROC-AUC PR-AUC F1 ROC-AUC PR-AUC F1DeepWalk 0.780 0.782 0.710 0.786 0.729 0.720 0.711 0.701 0.655node2vec 0.779 0.781 0.708 0.799 0.757 0.733 0.800 0.769 0.732LINE 0.697 0.678 0.640 0.728 0.705 0.674 0.708 0.671 0.655metapath2vec 0.794 0.802 0.728 0.754 0.750 0.689 0.736 0.719 0.674PMNE(n) 0.783 0.782 0.712 0.786 0.811 0.747 0.758 0.734 0.700PMNE(r) 0.675 0.630 0.615 0.759 0.713 0.691 0.735 0.695 0.672PMNE(c) 0.693 0.660 0.643 0.723 0.686 0.661 0.708 0.670 0.652MNE 0.758 0.731 0.713 0.813 0.770 0.743 0.783 0.795 0.713GATNE-T 0.825 0.794 0.746 0.837 0.831 0.770 0.814 0.815 0.741MSM-T
Because the quality of recommendation services on e-commerce networks can be significantlyimproved by predicting potential user-to-user or user-to-item relationships, we only test the embeddingresult on link prediction. Following the commonly used evaluation criteria in similar tasks, we usethe area under the ROC curve (ROC-AUC) [14], the area under the PR curve (PR-AUC) [15] andF1 score as the evaluation criteria in our experiment. All of these evaluation metrics are uniformlyaveraged among the selected edge types.
To be fair, we set all the embedding dimensions to be 200 in baseline methods and MSM. For allrandom walk based methods, we set the width of the window to be 10 and select 5 negative samples.For node2vec, we empirically use the best hyper-parameter for training, which is p = 2 and q = 0 . .For LINE, we set one-hop and two-hop embeddings dimension to be 100 and concatenate themtogether. For the three PMNE models, we use the hyper-parameters given by their original paper. ForMNE, we set the additional embedding size to be 10. We also use the given hyper-parameters inGATNE to train the transductive model (GATNE-T) and inductive model (GATNE-I). For our MSMmodel, we define several multi-semantic metapaths to do random walks, which is shown in Figure 2.5 -3-I-3-UU-1-V-1-UU-2-V-2-UV-1-U-1-VV-2-U-2-VV-6-V-6-VV-4-I-4-VI-5-I-5-II-4-V-4-I U-3-I-5-I-5-I-3-UU-1-V-6-V-6-V-1-UU-2-V-6-V-6-V-2-UV-4-I-5-I-5-I-4-VV-1-U-3-I-3-U-1-VV-2-U-3-I-3-U-2-VI-3-U-1-V-1-U-3-II-3-U-2-V-2-U-3-II-3-U-3-I watched video 1click video 2click item 3conversion 4related item 5related video 6 M-1-U-1-MM-2-U-2-MM-3-U-3-MB-4-U-4-BB-5-U-5-BB-6-U-6-BU-1-M-1-UU-2-M-2-UU-3-M-3-U M-1-U-4-B-4-U-1-MM-2-U-5-B-5-U-2-MM-3-U-6-B-6-U-3-MB-4-U-1-M-1-U-4-BB-5-U-2-M-2-U-5-BB-6-U-3-M-3-U-6-BU-4-M-4-UU-5-M-5-UU-6-M-6-U user-movie dislike 1user-movie like 2user-movie very like 3user-book dislike 4user-book like 5user-book very like 6 (a) 18 multi-semantic metapaths in Amazon dataset.
User (U)Video (V)Item (I)Movie (M)Book (B)User (U) (b) 18 multi-semantic metapaths in Alibaba dataset.
Figure 2: Multi-semantic metapaths in Amazon and Alibaba datasetTable 3: Link prediction result for Ali-Large datasetAli-LargeMethod ROC-AUC PR-AUC F1GATNE-I 0.719 0.691 0.722MSM-I 20% 0.726 0.774 0.733MSM-I 50% 0.732 0.755 0.737MSM-I 100% 0.746 0.748 0.782
The experimental results of three small datasets are shown in Table 2. For each pair of nodes, wecalculate the cosine similarity of their embedding. The larger the similarity the more likely there existsa link between them. As the network has more than one relation type, we compute the evaluationmetric for each relation type first and take the average of all the relation types as the final result.For the models designed for single layer network, we train a separate embedding for each relationtype of the network and use that to predict links on the corresponding relation type, which meansthat they do not have information from other relation types. We also test one edge ”click item” onboth Ali-Balance and Ali-Unbalance datasets to compare MSM with GATNE. The result is shown inFigure 3. For Ali-Large dataset, which has much more nodes and edges, the result is shown in Table3.The major findings from the results can be summarized as follows:(1) MSM outperforms all sorts of baselines in the various datasets. MSM achieves 3.62% performancelift in ROC-AUC, 8.24% lift in PR-AUC and 8.31% in F1-score, compared with best results fromGATNE on Ali-Large dataset.(2) MSM outperforms GATNE, which demonstrates the importance of adopting multi-semanticmetapath-based random walks to construct the heterogeneous neighborhood of a node.(3) For Ali-Balance and Ali-Unbalance datasets, MSM performs similarly but baseline methods varya lot, which shows that MSM is better for unbalanced data distribution.(4) For Ali-Large dataset, as shown in Table 3, the performances of ROC-AUC and F1 for MSM risewhen the percentage of remaining edges increases. When using all edges, MSM outperforms GATNE6ignificantly in large dataset. The results show that MSM can scale for real world heterogeneousnetworks which contain millions of nodes and edges.
Figure 3: Result for edge "click item" in Ali-Balance and Ali-Unbalance dataset
In this paper, we investigate the representation learning in complex heterogeneous networks whichhave multi-type nodes, multi-type edges and node attributes. Accordingly, we design a m ulti- s emantic m etapath ( MSM ) model , which constructs multi-semantic metapath-based random walksto feed into the embedding process. Because MSM can generate balance walks to utilize networkstructure and node attributes in heterogeneous neighborhood, it is useful in handling unbalanceddistribution problems. Extensive experiments and comparisons demonstrate the effectiveness of MSMfor representation learning on both fixed networks and unseen nodes. Especially, MSM outperformsrepresentative state-of-the-art embedding approaches on large networks, which shows great value inreal-world applications.
References [1] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methodsand applications. arXiv preprint arXiv:1709.05584 , 2017.[2] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 , 2016.[3] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Ried-miller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines forinference and control. arXiv preprint arXiv:1806.01242 , 2018.[4] Xuandong Zhao, Xiang Li, Ning Guo, Zhiling Zhou, Xiaxia Meng, and Quanzheng Li. Multi-size computer-aided diagnosis of positron emission tomography images using graph convolu-tional networks. In , pages 837–840. IEEE, 2019.[5] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-sentations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 701–710. ACM, 2014.[6] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In
Proceedings of the 24th international conferenceon world wide web , pages 1067–1077. International World Wide Web Conferences SteeringCommittee, 2015.[7] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery anddata mining , pages 855–864. ACM, 2016. 78] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. metapath2vec: Scalable representationlearning for heterogeneous networks. In
Proceedings of the 23rd ACM SIGKDD internationalconference on knowledge discovery and data mining , pages 135–144. ACM, 2017.[9] Weiyi Liu, Pin-Yu Chen, Sailung Yeung, Toyotaro Suzumura, and Lingli Chen. Principledmultilayer network embedding. In , pages 134–141. IEEE, 2017.[10] Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang.Heterogeneous network embedding via deep architectures. In
Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 119–128.ACM, 2015.[11] Yukuo Cen, Xu Zou, Jianwei Zhang, Hongxia Yang, Jingren Zhou, and Jie Tang. Representationlearning for attributed multiplex heterogeneous network. arXiv preprint arXiv:1905.01669 ,2019.[12] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 , 2014.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[14] James A Hanley and Barbara J McNeil. The meaning and use of the area under a receiveroperating characteristic (roc) curve.
Radiology , 143(1):29–36, 1982.[15] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In