Dual Graph Embedding for Object-Tag LinkPrediction on the Knowledge Graph
Chenyang Li, Xu Chen, Ya Zhang, Siheng Chen, Dan Lv, Yanfeng Wang
DDual Graph Embedding for Object-Tag LinkPrediction on the Knowledge Graph
Chenyang Li , Xu Chen , Ya Zhang *1 , Siheng Chen , Dan Lv , and Yanfeng Wang Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai, China Mitsubishi Electric Research Laboratories, Cambridge, MA, USA StataCorp LLC, College Station, TX, USA { lichenyanglh, xuchen2016, ya zhang } @sjtu.edu.cn, [email protected], [email protected], [email protected] Abstract —Knowledge graphs (KGs) composed of users, ob-jects, and tags are widely used in web applications rangingfrom E-commerce, social media sites to news portals. This paperconcentrates on an attractive application which aims to predictthe object-tag links in the KG for better tag recommendationand object explanation. When predicting the object-tag links,both the first-order and high-order proximities between entitiesin the KG propagate essential similarity information for betterprediction. Most existing methods focus on preserving the first-order proximity between entities in the KG. However, they cannotcapture the high-order proximities in an explicit way, and theadopted margin-based criterion cannot measure the first-orderproximity on the global structure accurately. In this paper, wepropose a novel approach named Dual Graph Embedding (DGE)that models both the first-order and high-order proximities in theKG via an auto-encoding architecture to facilitate better object-tag relation inference. Here the dual graphs contain an objectgraph and a tag graph that explicitly depict the high-order object-object and tag-tag proximities in the KG. The dual graph encoderin DGE then encodes these high-order proximities in the dualgraphs into entity embeddings. The decoder formulates a skip-gram objective that maximizes the first-order proximity betweenobserved object-tag pairs over the global proximity structure.With the supervision of the decoder, the embeddings derivedby the encoder will be refined to capture both the first-orderand high-order proximities in the KG for better link prediction.Extensive experiments on three real-world datasets demonstratethat DGE outperforms the state-of-the-art methods.
Index Terms —Knowledge Graph, Link Prediction, Tag Rec-ommendation
I. I
NTRODUCTION
In Web applications such as E-commerce and social mediasites, many recommender systems incorporate the knowledgegraph (KG) composed of users, objects, and tags to provide ac-curate and explainable recommendation [1]. This paper focuson an attractive application which aims to predict the object-tag links in this kind of KG for better tag recommendation [2]and object management [3].In the object-tag link prediction problem, the knowledgegraph contains three types of entities (i.e., users, objects, andtags) and two types of relations (i.e., Interact and TaggedWith)as Fig. 1 shows. Two types of (head, relation, tail) triplets withthe forms of (user, Interact, object) and (object, TaggedWith,tag) are included in the KG. In particular, this task focuseson the latter type of triplets by taking the objects as theheads and the tags as the tails to predict. Here both the first-order and high-order proximities in the KG provide structural users objects tags u u u u o o o t t t t r r Interact
TaggedWith high-order proximitybetween objectshigh-order proximitybetween tags o → u → ot → o → t o o o o o o t t t t t t t t first-order proximity o o o t t t t first-order proximity dual graphs Fig. 1. A toy example of the KG in the object-tag link prediction task.The purple lines indicate the first-order object-tag proximity. The high-orderproximity exists between the objects in the left red region. Tags in the leftblue region also share the high-order proximity. These high-order proximitiesare depicted in the dual graphs shown on the right side. o with the redcircle is the head whose tails need to be predicted. Tags in the grey circle arediscovered by high-order relationships in the KG. similarity information to enhance the link prediction. At first,the first-order proximity between a head-tail pair determinesthe existence of the corresponding link directly. In Fig. 1, theobject and the tag that are associated by an observed link oftype r share the stronger first-order proximity than the nodesthat are not directly linked. The existing first-order proximityinformation in these observed links can discover new linksbetween objects and tags that share high proximity. Besidesthe first-order proximity, the high-order object-object and tag-tag proximities implied in the high-order connectivities in theKG provide collaborative signals for link prediction. Morespecifically, the high-order proximity between two objectsencourages one of the objects to link to the tags of the otherone. Similarly, the high-order proximity between two tagsenriches the links for the object linked to any one of thesetags. For example, in Fig. 1, o and o are two-hop neighborson the path o − r −−→ u r −→ o , and they share high-orderproximity. Then t that is linked to o may be also relevant to o . Similarly, the tag t of o shares the high-order proximitywith t over the path t − r −−→ o r −→ t , and a link between o and t is probable to be added. In this sense, both thefirst-order and high-order proximities in the KG need to becaptured for high-quality link prediction.However, existing models fail to capture both the first-orderand high-order proximities in the KG jointly. Translationaldistance models designed for KG completion measure thedistances between entities and their immediate neighbors af- a r X i v : . [ c s . I R ] J u l er the translations carried by the relations [4]–[6]. In thisway, these models concentrate on the first-order proximitybetween existing triplets [7]. Some other methods based onrandom walk [8] or feature aggregation [9], [10] on the KGfocus on the information propagation from current entity toits directly linked entity. Without capturing the high-orderproximities explicitly, these methods will lose the collaborativeinformation in the high-order connectivities in the KG, whichis essential for prediction with high accuracy. For example, t and t in Fig. 1 may be overlooked by these methods whenpredicting the tails of o . Besides, the margin-based criterioncommonly used in the KG completion methods only considersthe pairwise relations between heads and tails, which may failto depict the first-order proximity between any pair of entitiesover the global proximity structure.In this paper, we propose a Dual Graph Embedding (DGE)model in an auto-encoding architecture that simultaneouslycaptures the first-order and high-order proximities in the KGto predict the missing object-tag links. Here the dual graphscontain an object graph and a tag graph constructed based onthe high-order connectitities in the KG. Hence links in thedual graphs describe the high-order object-object and tag-tagproximities explicitly. The dual graph encoder in DGE thencaptures these high-order proximities by encoding the struc-tural information in the dual graphs into entity embeddings.In the decoder, instead of the widely-used margin loss, weformulate a skip-gram objective that maximizes the likelihoodthat each observed tag is relevant to a given object over all thepossible tags. In this way, the decoder measures the first-orderproximity in the global proximity structure in the KG andrefines the embeddings learned from the encoder. The auto-encoding formulation encourages DGE to capture the first-order and high-order proximities in an end-to-end manner forbetter prediction. We conduct our experiments on several real-world datasets for tag recommendation tasks. The results showthat DGE predicts high-relevant object-tag pairs comparedto the state-of-the-art methods. Our contributions could besummarized as follows: • We propose a Dual Graph Embedding (DGE) method tocapture both the first-order and high-order proximities inthe KG simultaneously, and further improve the qualityof the object-tag link prediction; • We adopt the skip-gram objective that maximizes thelikelihood of the observed tags given an object over thecandidate tag set, and the first-order proximity betweeneach object-tag pair in the global proximity structure canbe measured more accurately; • Extensive experiments are conducted on three real-worlddatasets for tag recommendation. The empirical resultsshow that our method outperforms the state-of-the-artmethods on relevant tag prediction for target objects.II. R
ELATED W ORK
A. Knowledge Graph Completion
Considering the incompleteness of knowledge graphs, manymethods have been proposed to add new triplets to the knowledge graph. A typical task of KG completion is linkprediction, which is widely applied to recommender systemsin real-life scenarios [7].Many translational distance models learn low-dimensionembeddings of entities and relations by minimizing the dis-tance between two directly linked entities in a translatedspace [11]. Besides, they adopt the margin-based criterionto measure the first-order proximity in the KG. TransE [4]treats relations as translations from heads to tails. This methodsupposes that the added result of the head and relation em-beddings should be close to the tail embedding. To follow up,TransH [5] modifies the scoring function by projecting entitiesand relations into a hyperplane. TransR [6] introduces relation-specific spaces and projects the head and tail embeddings intothe corresponding space. To simplify TransR, TransD [12]replaces the projection matrix into the product of two map-ping vectors. Besides, some methods like KG2E [13] andTransG [14] redefine the distance by assuming that the entityand relation embeddings are from Gaussian distributions.
B. Tag Recommendation
Tag recommendation methods can be formulated as the linkprediction problem on the user-object-tag relation graph [2].Some matrix factorization (MF) based methods model thepairwise relationships among users, objects and tags to learnthe low-dimension embeddings [15]. Other methods addition-ally extract the tag co-occurrences and model the tag-tagproximity to recommend relevant tags for certain objects [16],[17]. Some recent methods employ the random walk withrestart or node feature aggregation scheme in the input graphto predict the link between object nodes and tag nodes [8]–[10]. Although they consider the high-order connectivities inthe input graph, they focus on the information propagationprocess between two directly linked nodes.In summary, methods for KG completion and tag rec-ommendation cannot capture the collaborative signals fromthe high-order object-object and tag-tag proximities in theKG explicitly. Besides, the margin loss employed for KGcompletion cannot measure the first-order proximity over theglobal proximity structure accurately.III. P
RELIMINARY
A. Problem Definition
In the object-tag link prediction task, we first introduce theinput knowledge graph and the task goal.
Knowledge graph in this task.
We denote the knowledgegraph in this task as G = { ( h, r, t ) | h, t ∈ E , r ∈ R} ,where E = U ∪ O ∪ T . Here U = { u , u , · · · , u P } , O = { o , o , · · · , o N } , and T = { t , t , · · · , t M } are the user,object and tag set with P , M , and N entities respectively. Therelation set R is composed of two types of relations including r = Interact and r = T aggedW ith as Fig. 1 shows.
Task description.
Given the input G , we aim to predict theexistence of the object-tag link between any pair in { ( h, t ) | h ∈O , t ∈ T } . Here objects are considered as given heads andtags are treated as candidate tails to predict. In particular, we ual Graph Encoder Skip-Gram Decoder Z O Z T objectobject G O object G O tagtag G T tag G T prob.prob. TT r r tag objectuser r r tag objectuser o → u → oo → u → o t → o → tt → o → tpath-based sampling Knowledge Graph probability to maximizeprobability to maximizetag embeddingtag embeddingobject embeddingobject embeddingsurrounding tagsurrounding tagcurrent objectcurrent objectnegative tag embeddingnegative tag embeddingan object and its observed tagan object and its observed tagprobability to maximizetag embeddingobject embeddingsurrounding tagcurrent objectnegative tag embeddingan object and its observed tag G Fig. 2. The Dual Graph Embedding model for the prediction of the object-tag links. We adopt the path-based sampling scheme to construct the object andtag graphs that contain the high-order proximities. A dual graph encoder contains two 2-layer GCN architectures to embed the object-object and tag-tagproximities. The decoder measures the first-order proximity between objects and tags, and uses the skip-gram objective to supervise the encoder. denote the observed tag set of a given object o i as S o i = { t i , t i , ..., t iM i } , where S o i contains M i observed tags linkedto o i and S o i ⊂ T . B. Terminology Explanation
We give formal definitions of some important terminologiesrelated to this task:
High-order connectivity.
We take the definition given in [1]that the L -order connectivity means the long range path e r −→ e r −→ · · · r L −−→ e L , where e i ∈ E and r j ∈ R for i = 0 , · · · , L and j = 1 , · · · , L . Then e L is the L -hop neighbor of e . Path-based neighbors.
According to the high-order con-nectivities, we consider a predefined path set P = { e r −→· · · r L −−→ e L } by fixing the types of all the entities and relationsas well as forcing e and e L to be the same types. Then forany path pa ∈ P , the start and end entities are P -path-basedneighbors of each other. High-order object-object and tag-tag proximities.
Thehigh-order proximity exists between an entity and its path-based neighbor given a predefined P . We focus on the prox-imities existing in objects and tags since this task only predictsthe missing object-tag links. Given an object o i , o i and o j sharehigh-order proximity if o j is a path-based neighbor of o i . Thehigh-order tag-tag proximity can be defined in the same way.IV. P ROPOSED M ODEL
A. Model Overview
Our proposed DGE consists of a dual graph encoder and askip-gram decoder, and the model framework is depicted inFig. 2. The dual graphs here indicate an object graph and atag graph containing the high-order object-object and tag-tagproximities in the KG respectively. The encoder then extractsboth the high-order proximities from the dual graphs andembeds them into entity embeddings. Besides, the skip-gramdecoder measures the first-order proximity over the globalproximity structure in the KG to determine the existence oflinks and to supervise the entire model.According to the high-order connectivities in the inputKG, we can sample the path-based neighbors of objects andtags respectively. These extracted neighboring correlations that contain the high-order proximities are utilized to build the ob-ject graph G O = ( V O , E O ) and the tag graph G T = ( V T , E T ) ,where V T and E T denote the vertex set and link set of tags(similarly, V O and E O are for objects). Both graphs form thedual graphs, where we assume that the directional informationin the initial KG is not important in this task. Then the encoderextracts the high-order object-object and tag-tag proximitiesfrom the dual graphs G O and G T respectively. The encodedobject and tag embeddings are given by Z O , Z T = DGEnc ( G O , G T ) , (1)where Z O ∈ R N × d and Z T ∈ R M × d are the object andtag embeddings respectively with the latent dimension d , and DGEnc denotes the encoding process.In the decoder, for an object o i , we assume that theobserved tags in S o i share the stronger first-order proximitywith o i compared to those globally unobserved ones. Actually,this assumption is consistent with the idea of the skip-grammodel [18] that the surrounding words are more related tothe current word compared to those globally distant words. Inthis sense, to capture the dominant proximity information inthe global structure for any o i , we formulate the decoder fromthe skip-gram perspective. More specifically, given any pair ( o i , t j ) , the current object vector is the i th row of Z O denotedby Z O,i and the target tag vector is the j th row of Z T denotedby Z T,j . Here Z T is taken as the candidate surrounding tagembedding matrix. Then the decoder calculates the probability p ( t j | o i ) that t j is linked to o i in the same way as the skip-grammodel based on the indexed and candidate vectors, p ( t j | o i ) = SGDec ( Z O,i , Z
T,j ) , (2)where SGDec implies the skip-gram decoding process. Tosupervise the whole model, for the current o i , the skip-gramobjective [18] maximizes the likelihood of its M i observed surrounding tags in S o i , which is given by max p ( t i , t i , · · · , t iM i | o i )= max M i (cid:89) m =1 p ( t im | o i ) , (3) = max M i (cid:89) m =1 SGDec ( DGEnc ( G O , G T )[ o i , t im ]) . uring the training process, Z O and Z T will be refined tocapture more similarity information from both the first-orderand high-order proximities in the KG to assist the prediction. B. Dual Graph Encoder
Since the high-order proximities in the input knowledgegraph G propagate essential information for link prediction, weintroduce a dual graph encoder to capture these proximities.More specifically, we construct the dual graphs including theobject graph G O = ( V O , E O ) and the tag graph G T =( V T , E T ) via a path-based neighbor sampling scheme on theinput graph G . Thus links in both graphs illustrate the high-order proximities in the input G . Then the dual graph encoderembeds the structural information of G O and G T into theembeddings Z O and Z T . In this way, Z O and Z T fetch thetwo types of high-order proximities to provide collaborativesignals for the subsequent decoder to predict the object-taglinks. The internal layout of the encoder is given in Fig. 2. Encode the high-order proximity between objects.
In thisprocess, we encode the object graph G O to mine the high-order proximity between objects. We consider the path set P o = { o i − r −−→ u k r −→ o j | o i , o j ∈ O , u k ∈ U} in the input G , and the P o -path-based neighbors can be sampled for eachobject node. Note that the path set P (cid:48) o = { o i r −→ t k − r −−→ o j | o i , o j ∈ O , t k ∈ T } is not selected since the informationpropagated on P (cid:48) o can be captured when representing the taggraph. Accordingly, G O is constructed by adding the linkbetween each pair of the sampled object nodes. To depict thesemantic similarities between any two nodes, we utilize SparsePositive PMI (SPPMI) [19] values which are commonly usedin NLP tasks to normalize the link weights of G O . Then wedenote the corresponding adjacency matrix as A O ∈ R N × N .Considering the superior performance of Graph Convolu-tional Networks (GCNs) [20] on capturing relations betweennodes, we apply a two-layer GCN to encode the objectgraph’s information into Z O . When the content features X O ∈ R N × d in with the feature dimension of d in are provided, GCNcan extract the information from both the graph topologicalstructure and the input features. Otherwise, the content featuresare one-hot encodings in N dimension and X O equals to the N -by- N unit matrix I N . GCN still represents the structuralinformation at this time. Since the datasets contain no specificcontent features, we adopt one-hot encodings as the nodefeatures in our experiments. Thus, the two-layer GCN encoderfor G O is given by Z O = ˆ A O ReLU( ˆ A O X O W (0) O ) W (1) O , (4)where the normalized ˆ A O = ¯ D − O ¯ A O ¯ D − O with ¯ A O = A O + I N and ¯ D O ( ii ) = (cid:80) j ¯ A O ( ij ) . Besides, W (0) O ∈ R d in × h and W (1) O ∈ R h × d are weight matrices for the first and secondlayers of GCN respectively. Z O ∈ R N × d is the output objectembedding matrix representing the local graph structure andthe node features if provided. Being split by rows, Z O = { Z O,i , i = 1 , · · · , N } is then fed to the decoder for prediction. Encode the high-order proximity between tags.
Simi-larly, we encode the tag graph G T to extract the high-orderproximity between tags. G T is constructed based on the P t -path-based neighbor sampling process, where P t in the input G is defined as { t i − r −−→ o k r −→ t j | t i , t j ∈ T , o k ∈ O} .The link weight between each pair of tag nodes in G T isdetermined by the SPPMI value of them, and the adjacencymatrix A T ∈ R M × M is composed of all the SPPMI values.Then the high-order proximity between tags are encoded intotag embeddings via another two-layer GCN encoder, Z T = ˆ A T ReLU( ˆ A T W (0) T ) W (1) T , (5)where ˆ A T is normalized in the same way as ˆ A O , and W (0) T ∈ R M × h , W (1) T ∈ R h × d are weight matrices of GCNlayers. Since no certain features for tags are provided, we omitthe input features X T here. Z T ∈ R M × d is the output tagrepresentation capturing the proximity structure of tag graph. Z T = { Z T,j , j = 1 , · · · , M } are considered as all candidate surrounding tag embeddings for any objects in the decoder. C. Skip-Gram Decoder
In the skip-gram decoder, we measure the first-order prox-imity between objects and tags over the global proximitystructure to predict the missing object-tag links. Besides, foreach object, the decoder maximizes the likelihood of thecorresponding observed tags to supervise the entire model.The decoding process is shown in Fig. 2.Given the object embeddings Z O and the tag embeddings Z T , the first-order proximity between the current o i and thetarget t j is measured by the inner product of their correspond-ing embeddings, i.e., the relevance score is s ( o i , t j ) = ( Z T,j ) T Z O,i , (6)where Z O,i and Z T,j are indexed from Z O and Z T respec-tively. The probability p ( t j | o i ) for t j and o i is then calculatedby applying the softmax operation over the candidate set [18], p ( t j | o i ) = softmax ( s ( o i , t j )) = exp s ( o i , t j ) U ( o i ) , (7)where U ( o i ) = (cid:80) Mk =1 exp(( Z T,k ) T Z O,i ) represents the nor-malization factor of o i . Here U ( o i ) contains the global prox-imity information for the head o i .During the optimization process, the probabilities in (3) canbe calculated by (7). By maximizing the likelihood for allthe cases of each object with its observed tag set, the objectand tag embeddings will be refined to get more similarityinformation from the first-order and high-order proximities toimprove the prediction accuracy. Sub-sampling.
In the training process, we calculate theprobability in (7) for every observed object-tag pair. Unfor-tunately calculating (7) requires normalizing over the entire T , which means that it is prohibitively expensive to train themodel. Inspired by a sub-sampling scheme of noise-contrastiveestimation (NCE) which is widely used in word embeddingmodels [21], we employ NCE to achieve fast training. This ABLE I
SUMMARY S TATISTICS OF T HREE D ATASETS
Movielens-1M LastFm Steam r (Interact) 1,000,209 70,297 1,100,628density of u-o interactions 4.2647% 0.2108% 0.1155%sparsity of object graph 4.5047% 0.8492% 0.3505% r (TaggedWith) 15,498 108,437 83,700density of o-t observations 0.3969% 0.0515% 2.5369%sparsity of tag graph 3.2675% 0.7061% 3.8820% scheme trains a binary classifier with label y treating observedtags from data distribution P o i d as positive samples ( y = 1 )and tags from a noise distribution P n as negative ones ( y = 0 )given o i . Assuming that the negative samples appear K timesmore frequently than the positive ones, the probability that agiven tag t j being a positive tag for o i is p ( y = 1 | t j , o i ) = p ( t j | o i ) p ( t j | o i ) + Kp n ( t j ) (8) omit U ( o i ) −−−−−−→ = exp s ( o i , t j )exp s ( o i , t j ) + Kp n ( t j ) , where p n ( t j ) means t j from noise distribution, and the un-normalized model with a scaled noise distribution by ignoring U ( o i ) in (7) can be normalized during training [21]. Hence,the probability of t j being a negative sample for o i is p ( y = 0 | t j , o i ) = 1 − p ( y = 1 | t j , o i ) . Being consistent withthe objective in (3), the goal is converted to maximize thelikelihood for the correct labels y , averaged over the positiveand negative data sets. For o i , the log-likelihood is J Θ ( o i ) = E P oid [log p ( y = 1 | t j , o i )]+ KE P n [log p ( y = 0 | t j , o i )] (9) ≈ log p ( y = 1 | t j , o i ) + K (cid:88) k =1 log p ( y = 0 | t k , o i ) . Here the expectations over the data and noise distributionsare approximated by sampling during training [21]. Then theoverall objective is the summation of the likelihood for all theobjects, and the optimization goal can be presented as max Θ (cid:88) o i ∈O J Θ ( o i ) , (10)where J Θ ( o i ) is obtained by (9). Here the optimization param-eters Θ = { W (0) O , W (1) O , W (0) T , W (1) T } determine the encoder,namely the embedding process of the high-order object-objectand tag-tag proximities.V. E XPERIMENTS
A. Datasets
We adopt three real-world tag recommendation datasetsincluding Movielens-1M , LastFm and Steam to evaluate ourmodel. We summarize the statistical information in Table I. https://movielens.org/ https://store.steampowered.com/ TABLE IIE XPERIMENTAL S ETTINGS ( k O AND k T CONTROL THE SPARSITY OF
SPPMI
MATRICES OF OBJECT AND TAG GRAPH RESPECTIVELY .) Methods SettingsMF hidden size d = 100 TransE hidden size d = 100 TransH hidden size d = 100 TransR hidden size of entities d e = 100 , hidden size of relations d r = 100 Skip-Gram hidden size d = 100 CoFactor Movielen, Steam: k for SPPMI = 1 , d = 16 ;LastFM: k = 2 , d = 64 MAD µ = µ = 1 , µ = 1 e − HeteLearn α = 0 . GCMC+, Movielen, Steam network structure: { , , } ;NGCF+ LastFM neitwork structure: { , , , } ;batch size=64, epochs=300, learning rate=2e-3,15 negative samples for each positive one,DGE Movielen: k O = 0 . , k T = 1 , 2-layer GCN structure: { , } ;LastFM: k O = 1 , k T = 1 , 2-layer GCN structure: { , } ;Steam: k O = 5 , k T = 5 , 2-layer GCN structure: { , } ; • Movielens-1M: 3,883 movies are rated by 6,040 users.Tags of each movie are chosen from all 1,008 tags. Atotal of 15,498 object-tag interactions are observed. • LastFm: 17,632 artists are listened and tagged by 1,892users, and the tags are from a tag set with a size of 11,946.In total, 108,437 object-tag observations are included. • Steam: 9,373 apps are reviewed by 101,654 users. Thesize of the tag set is 352 and 83,700 object-tag pairs areobserved.
B. Baseline Methods.
To demonstrate the effectiveness of the proposed DGE, wecompare DGE with five methods only modeling the first-orderproximity and five methods additionally modeling part of thehigh-order proximities.
1) Methods based on the first-order proximity:
The fivebaseline models developed for the object-tag link predictionincludes MF [22], Skip-Gram [21],
TransE [4],
TransH [5],and
TransR [6]. MF factorizes the object-tag interactionmatrix into two low-dimension feature matrix for objects andtags. It employs the mean square error (MSE) as the lossfunction. The skip-gram model adopts the same way as MF togenerate features, but it utilizes NCE loss to measure the first-order proximity. The latter three methods treat the relationembeddings as the translation embeddings and define thedistance via different scoring functions. They use the marginloss in the training process. In this task, to avoid the bias toirrelevant links in the KG, we adopt the same way as [23] toadd an extra Bayesian personalized ranking loss to narrow thetranslation distance between the related objects and tags.
2) Methods utilizing part of the high-order proximities:
We use five methods that model part of the high-order prox-imities as the comparison methods involving
CoFactor [17],
MAD [24],
HeteLearn [8],
GCMC [9] and
NGCF [10].CoFactor based on MF and tag co-occurrences only considersthe high-order proximity between tags in the KG. MADconducts label propagation on the object graph that reflectsthe object-object proximity. HeteLearn is a state-of-the-artmethod which provides tags via the random walk scheme onthe user-object-tag graph. GCMC and NGCF predict links in
ABLE III
Recall @ k , NDCG @ k ON M OVIELENS -1M, L
AST
FM,
AND S TEAMBASED ON D IFFERENT M ETHODS
Datasets Methods Recall@3 NDCG@3 Recall@5 NDCG@5MF 0.7252 0.4116 0.7819 0.3088TransE 0.7292 0.4125 0.7776 0.3085TransH 0.7293 0.4127 0.7760 0.3081TransR 0.6496 0.3390 0.7227 0.2597Skip-Gram 0.7467 0.4209 0.7895 0.3144Movielens-1M CoFactor 0.7234 0.4038 0.7825 0.3043MAD 0.7359 0.4144 0.7774 0.3097HeteLearn 0.7863 0.4356 0.8290 0.3249GCMC+ 0.8293 0.4664 0.8528 0.3434NGCF+ 0.7549 0.4242 0.7944 0.3162
DGE(ours) 0.8464 0.4850 0.8677 0.3565
MF 0.0667 0.0601 0.0904 0.0536TransE 0.0746 0.0668 0.1100 0.0618TransH 0.0751 0.0669 0.1101 0.0617TransR 0.0637 0.0561 0.0819 0.0480Skip-Gram 0.1322 0.1054 0.1920 0.0966LastFM CoFactor 0.1538 0.1277 0.1969 0.1073MAD 0.2245 0.1927 0.2877 0.1605HeteLearn 0.2376 0.1898 0.3119 0.1608GCMC+ 0.1703 0.1310 0.2310 0.1151NGCF+ 0.1011 0.0775 0.1414 0.0703
DGE(ours) 0.2494 0.2129 0.3154 0.1772
MF 0.2756 0.2151 0.3642 0.1802TransE 0.2932 0.2225 0.3744 0.1859TransH 0.2915 0.2217 0.3748 0.1850TransR 0.2805 0.2042 0.3678 0.1737Skip-Gram 0.3203 0.2386 0.4082 0.1999Steam CoFactor 0.1538 0.1277 0.1969 0.1073MAD 0.3297 0.2539 0.4268 0.2133HeteLearn 0.3856 0.2956 0.4874 0.2463GCMC+ 0.3968 0.3057 0.4961 0.2531NGCF+ 0.3814 0.2909 0.4804 0.2423
DGE(ours) 0.4139 0.3226 0.5100 0.2658 the bipartite graph based on node feature aggregation. Sincethey do not distinguish node types during feature aggregation,we extend them to the user-object-tag tripartite graph denotedby
GCMC+ and
NGCF+ . C. Experimental Settings.1) Experimental implementation:
For each dataset, we ran-domly choose 80% data for training and the remaining 20%data for testing following the setting in [25]. The details aboutparameter settings of all the models are given in Table II.Besides the parameters varying with datasets, the settings ofother parameters in baseline models are consistent with thosein the original papers. We run their codes on the three datasets.We integrally train DGE with a mini-batch scheme and adoptAdam optimization approach to learn the model parameters.The hyperparameters are chosen with the minimal trainingloss. All the experiments are running on one machine with oneTitanX GPU. For each method, we conduct the experiment for10 times and report the average performance.
2) Evaluation metrics:
The relevance of predicted tags isevaluated with two typical metrics in recommendation systemsincluding
Recall @ k and N DCG @ k . D. Overall Comparison
The overall performance results on the three datasets aresummarized in Table III, and we have following observations: • DGE beats all other models on the three datasets, es-pecially when predicting the top-3 most relevant tagsfor objects. The results prove that DGE extracts more
20% 40% 60% 80% × R e c a ll @ MFTransHSkip-GramMADHeteLearnGCMC+DGE
20% 40% 60% 80% × N D C G @ MFTransHSkip-GramMADHeteLearnGCMC+DGE
Fig. 3. Recall@3 and NDCG@3 on Movie-1m with the sparsity level ofobject-tag observations varying. Methods denoted by dash lines only considerthe first-order proximity, while those with solid lines leverage the high-orderrelationships in the KG.Fig. 4. Recall@3 on Movielen-1m of the objects which have different numberof tags, and each number in () is the number of test objects satisfying thecorresponding condition. essential information from both the first-order and high-order proximities in the KG to predict the missing linksmore accurately. • Comparing the methods only modeling the first-orderproximity, we find that the Skip-Gram model outperformsMF and three translational distance models on all thedatasets. Since the only difference between MF and Skip-Gram is the loss function, the results demonstrate that theskip-gram objective can measure the first-order proximitymore accurately than the MSE loss used in MF and themargin loss used in translational distance models. • Compared with the methods that modeling the high-orderrelationship in the KG, DGE provides the most relevanttags for objects. It is because that the dual graph encoderin DGE can extract more collaborative information fromthe high-order proximities while the other methods areaffected by some noisy information when conductingrandom walk or feature aggregation on the input KG. • DGE outperforms the Skip-Gram model. Both modelsadopt the skip-gram objective to learn the first-orderproximity while DGE captures the high-order proximitiessimultaneously. The results show that the high-orderproximities contain essential information for this task.
E. Performance on Different Sparsity Levels
To evaluate the robustness of our model given the sparseobject-tag observations, we randomly draw samples (20%,40%, 60%, 80%) of all the observed object-tag pairs fortraining and compare the results in Fig. 3. Experiments on
ABLE IVT OP AGS P REDICTED BY D IFFERENT M ETHODS ON M OVIELENS -1M (T
HE MOVIE WITH * IS A COLD - START MOVIE .) Movielens-1M test truth tags TransH Skip-Gram MAD HeteLearn
GCMC+ DGE
Billy Madison, comedy, drama drama , comedy , comedy , drama , thriller, stupid, drama , comedy , drama , comedy , comedy , drama , (1995) romance, adventure, musical, romance, adam sandler, drama , action, romance, war, parody, horror, crime,action food comedy adventure crime romance Sense and Sensibility, romance, thriller thriller , comedy, thriller , teen movie, drama, based on a book, thriller , comedy, romance , thriller , romance , thriller , (1995) adventure, action romance , John Hughes, british, hugh grant, romance , action, classic, witty, comedy, classic, romance highschool romantic adventure fantasy bittersweet Madonna: Truth or Dare, documentary, drama , thriller , thriller , drama , thriller , drama , thriller , drama , thriller , drama , drama , thriller , (1991)* drama, thriller comedy, action, action, comedy, comedy, action, comedy, action, documentary , documentary ,romance adventure romance romance crime, comedy horror, mystery Card Game Games WorkshopWarMilitary
World War IVampire
Gothic
Investigation
Conspiracy
Software Training
Education
ProgrammingFantasy
RPG
Adventure
ActionCo-op
Multiplayer
FishingHunting Martial ArtsFighting BikesSoccerBowlingCard Game Games WorkshopWarMilitary
World War IVampire
Gothic
Investigation
Conspiracy
Software Training
Education
ProgrammingFantasy
RPG
Adventure
ActionCo-op
Multiplayer
FishingHunting Martial ArtsFighting BikesSoccerBowling (a) TransH
Card GameGames WorkshopWar MilitaryWorld War I Vampire
Gothic
InvestigationConspiracy
Software Training
Education Programming
FantasyRPGAdventure
Action
Co-op
MultiplayerFishingHuntingMartial Arts FightingBikesSoccer
Bowling
Card GameGames WorkshopWar MilitaryWorld War I Vampire
Gothic
InvestigationConspiracy
Software Training
Education Programming
FantasyRPGAdventure
Action
Co-op
MultiplayerFishingHuntingMartial Arts FightingBikesSoccer
Bowling (b) GCMC+
Card GameGames WorkshopWar
Military
World War I
Vampire
Gothic InvestigationConspiracy Software Training
Education
ProgrammingFantasy
RPGAdventure
Action Co-opMultiplayer
FishingHunting
Martial ArtsFighting Bikes
Soccer Bowling
Card GameGames WorkshopWar
Military
World War I
Vampire
Gothic InvestigationConspiracy Software Training
Education
ProgrammingFantasy
RPGAdventure
Action Co-opMultiplayer
FishingHunting
Martial ArtsFighting Bikes
Soccer Bowling (c) DGEFig. 5. Visualization of tag embeddings derived by TransH, GCMC+ and DGE on Steam. Tags with the same color are semantically similar. DGE can learnthe semantic similarities between tags in the semantic space.TABLE V
Recall @ k , NDCG @ k ON M OVIELENS -1M, L
AST
FM,
AND S TEAMBASED ON D IFFERENT V ARIANTS OF
DGE
Datasets Methods Recall@3 NDCG@3 Recall@5 NDCG@5SO-GE 0.8101 0.4526 0.8401 0.3343Movielens-1M ST-GE 0.7443 0.4207 0.7823 0.3136
DGE 0.8464 0.4850 0.8677 0.3565
SO-GE 0.0830 0.0580 0.1211 0.0540LastFM ST-GE 0.1928 0.1717 0.2496 0.1444
DGE 0.2494 0.2129 0.3154 0.1772
SO-GE 0.3251 0.2425 0.4587 0.2152Steam ST-GE 0.2856 0.1977 0.4132 0.1821
DGE 0.4139 0.3226 0.5100 0.2658 all the datasets show similar results, thus we only show theresults on Movielens-1M.We find that even with a lack of tagging data, our modelpredicts more relevant tags than other methods. Consideringthat the skip-gram objective brings trivial gain in very sparsecases, DGE can still extracts supplementary information fromthe high-order proximities for better prediction. Moreover,the results of MAD, HeteLearn, GCMC, and our modeldemonstrate that the dual graph encoder can represent both theobject-object and tag-tag relations better in the sparse cases.
F. Object Cold-Start and Data Sparsity Problems
Fig. 4 shows the performances on those objects which havedifferent numbers of tags in the training set. The first groupof bars corresponds to the cold-start examples. These barsshow that for cold-start objects, DGE predicts the tags moreaccurately with the Recall@3 over 0.95, which verify thatthe high-order proximities enrich the representations of cold-start objects via the dual graph encoder. The baseline methods incorporating high-order relationships underperform the othermethods in this case because that the learned models tend toaccurately predict the tags for the objects in densely distributedregions (e.g. objects having > , < tags in the training set).Besides, by investigating the performances in the other fivegroups of bars in Fig. 4, we find that DGE always predictsthe most relevant tags compared to other methods. The resultsillustrate that DGE can mine valuable information from boththe first-order and high-order proximities in the KG underdifferent sparse cases. G. Ablation Study
To evaluate whether the two GCN encoders in the DGE ex-tract the high-order proximities effectively for link prediction,we replace an encoder of them with a trivial MLP. With thisoperation, we derive two variants of DGE : • SO-GE: It only retains the object graph in DGE andreplaces the tag GCN encoder with MLP. This modelcannot extract the high-order proximity between tags. • ST-GE: It only retains the tag graph in DGE and aMLP for objects is applied to derive object embeddingscontaining no the high-order proximity information.We compare the results on three datasets via these variantsand DGE in Table. V. The results prove that the designed dualgraph encoder can learn the helpful structural information inthe high-order object-object and tag-tag proximities to enhancethe prediction performances.
H. Visualization
We first apply TSNE to the high-dimension tag embeddingsderived by TransH, GCMC+ and DGE. The visualizationesults are shown in Fig. 5. Compared to TransH and GCMC+,tag embeddings derived by DGE can represent the semanticsimilarities between tags more clearly, which improves theinterpretability of the prediction results. For example, thesemantically similar tags “Vampire” and “Gothic” are closein Fig. 5 (c) but far apart in Fig. 5 (a) and (b). Besides,DGE clusters tags into multiple classes in the semantic spacemore explicitly than the other two methods. The results provethat our model can learn the semantic similarities betweentags via embedding the high-order proximity between tagsexplicitly. Accordingly, our model will predict tags that aremore probable to be semantically relevant to the target object.In addition, we give three tagging examples on Movielens-1M in Table IV. We find that for the cold-start movie “Madonna: Truth or Dare, (1991)” , the former four methodsprovide two most popular tags “drama” and “thriller” withoutrepresenting the object-object proximity sufficiently, whileDGE predicts the accurate tags. This result shows that ourmodel can alleviate the object cold-start problem. For themovie “Sense and Sensibility, (1995)” , the Skip-Gram modelpredicts the tags more accurately than TransH, proving thatthe skip-gram objective can better learn the first-order object-tag proximity. Besides, for the latter two movies, our modelputs the most relevant tags at the top of the lists, which furtherproves the prediction accuracy of DGE.VI. C
ONCLUSIONS
In this paper, we propose a Dual Graph Embedding (DGE)method in an auto-encoding architecture to capture the first-order and high-order proximities in the input KG for theobject-tag link prediction task. Here the dual graphs includethe object and tag graphs that are built to depict the high-order proximities. Then the encoder embeds the two typesof high-order proximities in the dual graphs into object andtag embeddings. The decoder models the first-order proximitybetween objects and tags over the global proximity structurefrom the skip-gram perspective. Under the supervision of thedecoder, the similarity information from both the first-orderand high-order proximities is extracted for better prediction.R
EFERENCES[1] X. Wang, X. He, Y. Cao, M. Liu, and T.-S. Chua, “Kgat: Knowledgegraph attention network for recommendation,” in
Proceedings of the25th ACM SIGKDD International Conference on Knowledge Discovery& Data Mining . ACM, 2019, p. 950958.[2] F. M. Bel´em, J. M. Almeida, and M. A. Gonc¸alves, “A survey on tagrecommendation methods,”
J. Assoc. Inf. Sci. Technol. , vol. 68, no. 4,p. 830844, Apr. 2017.[3] K. H. L. Tso-Sutter, L. B. Marinho, and L. Schmidt-Thieme, “Tag-awarerecommender systems by fusion of collaborative filtering algorithms,”in
Proceedings of the 2008 ACM Symposium on Applied Computing .ACM, 2008, p. 19951999.[4] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko,“Translating embeddings for modeling multi-relational data,” in
Ad-vances in Neural Information Processing Systems 26 . Curran Asso-ciates, Inc., 2013, pp. 2787–2795.[5] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph embeddingby translating on hyperplanes,”
AAAI Conference on Artificial Intelli-gence , 2014. [6] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity andrelation embeddings for knowledge graph completion,”
AAAI Conferenceon Artificial Intelligence , 2015.[7] S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. S. Yu, “A survey onknowledge graphs: Representation, acquisition and applications,”
ArXiv ,vol. abs/2002.00388, 2020.[8] Z. Jiang, H. Liu, B. Fu, Z. Wu, and T. Zhang, “Recommendationin heterogeneous information networks based on generalized randomwalk model and bayesian personalized ranking,” in
Proceedings ofthe Eleventh ACM International Conference on Web Search and DataMining . ACM, 2018, p. 288296.[9] R. van den Berg, T. N. Kipf, and M. Welling, “Graph convolutionalmatrix completion,”
KDD , 2017.[10] X. Wang, X. He, M. Wang, F. Feng, and T.-S. Chua, “Neural graphcollaborative filtering,” in
Proceedings of the 42nd International ACMSIGIR Conference on Research and Development in Information Re-trieval . ACM, 2019, p. 165174.[11] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embed-ding: A survey of approaches and applications,”
IEEE Transactions onKnowledge and Data Engineering , vol. 29, no. 12, pp. 2724–2743, 2017.[12] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao, “Knowledge graph embeddingvia dynamic mapping matrix,” in
Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguistics and the 7thInternational Joint Conference on Natural Language Processing (Volume1: Long Papers) . ACL, Jul. 2015, pp. 687–696.[13] S. He, K. Liu, G. Ji, and J. Zhao, “Learning to represent knowledgegraphs with gaussian embedding,” in
Proceedings of the 24th ACM In-ternational on Conference on Information and Knowledge Management .ACM, 2015, p. 623632.[14] H. Xiao, M. Huang, and X. Zhu, “TransG : A generative modelfor knowledge graph embedding,” in
Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics (Volume 1:Long Papers) . ACL, Aug. 2016, pp. 2316–2325.[15] S. Rendle and L. Schmidt-Thieme, “Pairwise interaction tensor factor-ization for personalized tag recommendation,” in
Proceedings of theThird ACM International Conference on Web Search and Data Mining .ACM, 2010, p. 8190.[16] A. Rae, B. Sigurbj¨ornsson, and R. van Zwol, “Improving tag recommen-dation using social networks,” in
Adaptivity, Personalization and Fusionof Heterogeneous Information , 2010.[17] D. Liang, J. Altosaar, L. Charlin, and D. M. Blei, “Factorization meetsthe item embedding: Regularizing matrix factorization with item co-occurrence,” in
Proceedings of the 10th ACM Conference on Recom-mender Systems . ACM, 2016, p. 5966.[18] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributedrepresentations of words and phrases and their compositionality,” in
Proceedings of the 26th International Conference on Neural InformationProcessing Systems - Volume 2 . Curran Associates Inc., 2013, p.31113119.[19] O. Levy and Y. Goldberg, “Neural word embedding as implicit matrixfactorization,” in
Proceedings of the 27th International Conference onNeural Information Processing Systems - Volume 2 . MIT Press, 2014,p. 21772185.[20] T. N. Kipf and M. Welling, “Semi-supervised classification with graphconvolutional networks,” in
International Conference on Learning Rep-resentations (ICLR) , 2017.[21] A. Mnih and K. Kavukcuoglu, “Learning word embeddings efficientlywith noise-contrastive estimation,” in
Proceedings of the 26th Interna-tional Conference on Neural Information Processing Systems - Volume2 . Curran Associates Inc., 2013, p. 22652273.[22] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques forrecommender systems,”
Computer , vol. 42, no. 8, pp. 30–37, 2009.[23] Y. Cao, X. Wang, X. He, Z. Hu, and T.-S. Chua, “Unifying knowledgegraph learning and recommendation: Towards a better understanding ofuser preferences,” in
World Wide Web . ACM, 2019, p. 151161.[24] P. P. Talukdar and K. Crammer, “New regularized algorithms for trans-ductive learning,” in
Proceedings of the 2009th European Conferenceon Machine Learning and Knowledge Discovery in Databases - VolumePart II . Springer-Verlag, 2009, p. 442457.[25] Z. Zhang, Y. Liu, Z. Zhang, and B. Shen, “Fused matrix factorizationwith multi-tag, social and geographical influences for poi recommenda-tion,”