A Hybrid Model for Learning Embeddings and Logical Rules Simultaneously from Knowledge Graphs
AA Hybrid Model for Learning Embeddingsand Logical Rules Simultaneouslyfrom Knowledge Graphs
Susheel Suresh and Jennifer Neville
Computer Science DepartmentPurdue UniversityWest Lafayette, IN, USA[suresh43, neville]@purdue.edu
Abstract —The problem of knowledge graph (KG) reasoninghas been widely explored by traditional rule-based systemsand more recently by knowledge graph embedding methods.While logical rules can capture deterministic behavior in a KGthey are brittle and mining ones that infer facts beyond theknown KG is challenging. Probabilistic embedding methods areeffective in capturing global soft statistical tendencies and rea-soning with them is computationally efficient. While embeddingrepresentations learned from rich training data are expressive,incompleteness and sparsity in real-world KGs can impact theireffectiveness. We aim to leverage the complementary propertiesof both methods to develop a hybrid model that learns bothhigh-quality rules and embeddings simultaneously. Our methoduses a cross feedback paradigm wherein, an embedding model isused to guide the search of a rule mining system to mine rulesand infer new facts. These new facts are sampled and furtherused to refine the embedding model. Experiments on multiplebenchmark datasets show the effectiveness of our method overother competitive standalone and hybrid baselines. We alsoshow its efficacy in a sparse KG setting and finally explore theconnection with negative sampling.
I. I
NTRODUCTION
Knowledge graphs (KGs) are large directed graphs wherenodes represent concrete or abstract entities and edges sym-bolize relations for a pair of nodes. Many KGs containingmillions of entities and relation types exist today viz. Freebase[1], YAGO [2], Wikidata [3], and Google Knowledge Graph[4] which are pivotal in reasoning about multi-relational datafrom different domains. One important reasoning problemis that of predicting missing relationships between entities(link prediction). Reasoning over KGs is particularly chal-lenging in part due of their characteristic properties: largesize, incompleteness, sparsity and noisy facts. Latent feature(a.k.a embedding) models learned probabilistically (RESCAL[5], TransE [6], ComplEx [7] and RotatE [8]) and inductivelogic programming (ILP) inspired techniques which mineinterpretable logical rules (WARMER [9], AIME+ [10]) aretwo prominent KG reasoning approaches.Relations in knowledge graphs adhere to certain constraintswhich enforce syntactic validity and typically follow de-terministic connectivity patterns like equivalence, symmetry,inversion and composition. Rules which capture such patterns are precise, interpretable and can generalize well. Drawbacksincludes potential low coverage, mining inefficiency (largesearch space) and difficulty in mining quality rules fromincomplete KGs. Embedding methods aim to learn usefulrepresentations of entities and relations by projecting knowntriplets into low-dimensional vector spaces and maximizingthe total plausibility of known facts in the KG. Embeddingmodels are able to capture unobservable but intrinsic andsemantic properties of entities and relations [11]. As reasoningwith embeddings boils down to vector space calculations it iscomputationally efficient, but it can be inaccurate when theentities and relations are sparse or noisy. [12].Since both methods have their advantages and disadvan-tages, in this paper we propose a hybrid model which aimsto exploit their complementary strengths. The main idea isto selectively utilize inferred facts from logical rules forembedding learning. Also, embedding feedback is used toprune the search space of the rule mining system. This crossfeedback process when run for many iterations, simultaneouslylearns to incorporate deterministic structure into embeddingsand mine rules that are consistent with “global” KG patterns.Hybrid models proposed previously in the literature use simplelogical rules to place constraints on the embedding space inorder to incorporate structure. Rule learning is detached fromembedding learning and most methods just naively enumerateall possible rules to start with, which doesn’t scale for largerKGs. Methods like IterE [13] place assumptions on the kindof rules and embedding techniques that can be used andare built specifically to tackle sparse entities. Different fromsuch methods, our model aims to simultaneously improveembeddings and mine diverse and reliable rules. Moreover,our method can be incorporated with any embedding techniqueand rule mining system.II. B
ACKGROUND
Let E represent the set of all entities and R the set of allrelation types in the KG. A knowledge graph G contains a setof factual triplets { ( h, r, t ) | h, t ∈ E ; r ∈ R} . h, r, t are calledhead, relation and tail respectively. Figure 1 shows a samplefrom a larger knowledge graph about sports. Knowledge graph a r X i v : . [ c s . A I] S e p ABLE I: State-of-the-art embedding score functions.
Model Score Function ( φ )TransE −(cid:107) h + r − t (cid:107) DistMult (cid:104) h , r , t (cid:105) ComplEx Re ( (cid:104) h , r , t (cid:105) ) RotatE −(cid:107) h ◦ r − t (cid:107) reasoning in particular link prediction deals with the problemof inferring new relationships between entities and tripletclassification involves predicting the existence of candidatetriplets in a given KG. Knowledge graph embedding
Entities are represented asvectors and relations are seen as operations in vector space andare typically represented as vectors, matrices or tensors. Thesemodels assume that the existence of individual triplets in a KGare conditionally independent given latent representations a.k.aembeddings of entities and relations in a continuous vectorspace. A score function φ : E ×R×E → R is used to measurethe model’s confidence in a candidate triplet and defined basedon different vector space assumptions for e.g., a popular modelcalled TransE [6] aims is to have h + r ≈ t if ( h, r, t ) ∈ KG ,(boldface letters represent respective embeddings in R d ). So,the score function φ ( h, r, t ) = −(cid:107) h + r − t (cid:107) is expected tobe large if ( h, r, t ) exists in the KG . Embedding learning isdone under open world assumption (OWA) where un-observedtriplets are either false or unknown. Negative sampling isemployed due to the lack of negative examples in the inputKG. A simple and effective approach [6] is to corrupt eitherthe head or tail of a true triplet with a random entity sampleduniformly from E . Entity and relation embeddings are learnedby minimizing logistic or pairwise ranking loss. Rule Mining
A rule is a formulae of atoms connectedwith logical connectives. In particular, a rule is
Horn if theconjunction of a set of body atoms results in a single headatom like, τ : B ∧ B ∧ · · · ∧ B n → r ( X, Y ) where B i ’s arebody atoms and r ( X, Y ) is a head atom. We use body ( τ ) todenote the body atoms of rule τ . An instantiation of a ruleis the act of substituting all its variables with entities from E . The head atom of an instantiated rule is called an inferred head atom if all body atoms of the instantiated rule exist inthe KG. S τ is the set of all inferred head atoms obtained fromthe instantiation of a rule τ . S τ = { r ( X, Y ) | ∃ z , . . . , z m : body ( τ ) } (1)where z , . . . , z m are variables that appear in the body of therule. A principled approach to mine horn rules from a KG isby an association rule learning algorithm [14] and [9]. To dealwith the vast search space, language biases viz. limiting rulelength, having every atom to be transitively connected to eachother in a rule (a.k.a connected rules) and ensuring all variablesin a rule to to be closed (i.e. appears at least twice) are utilized.During mining, various statistical measures assess the qualityof intermediate rules to help prune the search space. Accordingto [14], Rule Support is the number of distinct grounding of
CristianoRonaldoLionel Messi FC Juventus playsFor
FC Barcelona playsFor
Football clubTypeclubType playsSport ? playsSport
Ballon d'OrTop 3Best FIFAPlayerGolden Boot
G.O.A.T(Greatest of allTime) isCalledisCalled hasWon
GianluigiBuffon OusmaneDembele areTeamMates playsFor ? isCalled
Andres Iniesta isCalled areTeamMates isCalled ?
Fig. 1: Entities and Relations in G f . Known relationshipsshown as solid edges and possible relations in dashed.the head atom resulting from the body. Formally, supp ( τ ) = x, y ) : ∃ z , z , . . . , z m : body ( τ ) ∧ r ( x, y ) (2) Standard Confidence (SC) is the ratio of the number of headgroundings to the body grounding in the KG. SC ( τ ) = supp ( τ ) x (cid:48) , y (cid:48) ) : ∃ z , z , . . . , z m : body ( τ ) (3)III. M OTIVATION
Consider the sample G f in Fig. 1 drawn from a larger KG G about sports. Suppose a rule learning system has found thefollowing three rules: l : playsF or ( V , V ) ∧ clubT ype ( V , V ) → playsSport ( V , V ) l : playsF or ( V , V ) ∧ playsF or ( V , V ) → teamMate ( V , V ) l : isCalled ( V , V ∧ teamMate ( V , V ) → isCalled ( V , V ) Q1: Can the generalizing ability of logical rules helpembedding methods learn better representations?
Rules l and l have a number of conforming instantiations in G f andone can argue for them to hold in the larger knowledge graph G . Say in a sample pertaining to basketball G b , the entitiesare sparsely connected. Embedding representations of entitiesin such a “less-connected” sub-graph will not be expressivebecause of poor data quality. On the other hand, rules l and l mined with support from facts in G f can accurately reasonabout entities in G b . Q2: Can feedback from an embedding model improvethe quality of mined rules?
Embedding models are goodat incorporating global patterns. For example, a possibleexplanation for Ronaldo being G.O.A.T (greatest of all time)is the presence of a large number of links between him anddifferent sport awards (see Fig. 1). An embedding methodsay, RESCAL [5] can easily model the pattern: consistent topplayers of a sport are likely to be called G.O.A.T . Specifically,the model can learn the features of “consistent top player” and“concept of eminence” for different entities from data. h ronaldo = (cid:20) . . (cid:21) M isCalled = (cid:20) . . . . (cid:21) t G.O.A.T = (cid:20) . . (cid:21) he first embedding feature of h and t might model“consistent top player” and the second “concept of eminence”.Thus, φ (ronaldo, isCalled, G.O.A.T) = h (cid:62) M r t = 0 . gives a high score denoting validity.Now, although rule l has multiple instantiations in G f , itis quite noisy and unreliable. For e.g. it will incorrectly inferthat Dembele (an upcoming young player) is G.O.A.T by thevirtue of him being teammates with Messi. Now, such a noisyrule could have been pruned out if the information from anembedding method modelling consistent top player were tobe used. Specifically having, h dembele = (cid:20) . . (cid:21) φ (dembele, isCalled, G.O.A.T) = h (cid:62) M r t = 0 . gives a low score showing lower confidence compared to ear-lier. The two questions raised in the above examples motivatesus to develop a cross feedback hybrid model for knowledgebase reasoning. IV. R ELATED W ORK
Here we give a brief overview of embedding learning andrule mining methods and review relevant hybrid models fromliterature.
A. Knowledge graph embeddings
RESCAL [5] one of the earliest works in KGE used abilinear form for the scoring function inspired from matrixfactorization. Specifically, f ( h, r, t ) = h (cid:62) M r t where entities h and t are represented by vectors in R d and relations asmatrices in R d × d that model pairwise interactions. DistMult[15] simplifies RESCAL further by restricting M r to be adiagonal matrix . This added simplicity does well for capturingsymmetric relationships but bad for others. TransE [6] intro-duces entity and relation specific embeddings and models themas translations in vector spaces. ComplEx [7] is an extension toDistMult where embedding lie in C d . This accounts for orderof the entities and thus is able to model asymmetric relations.A newer method RotatE [8] is able to model symmetry,asymmetry, inversion and composition of relations. It usesthe concept of rotations in complex space as opposed totranslations in real space. Concretly, f ( h, r, t ) = (cid:107) h ◦ r − t (cid:107) with constraint | r i | = 1 where h , r , t ∈ C d and ◦ is theelement-wise product. ConvE [16] employs a 2D convolutionalneural network to model the score function where the aimis to utilize the expressivity of multiple non-linear featuresin the architecture to better model relationships in the KG.A comprehensive survey of methods for Knowledge graphembeddings is given in [17] and [18]. B. Rule mining
These models assume that existence of a individualtriples/facts can be inferred from observable features in thegraph usually in the form of logical rules. The extracted rules (Buffon, Ronalado) and (Messi, Iniesta) are all G.O.A.T (see Fig. 1) are then used to infer new facts. Mining rules from a KBhas its roots in inductive logic programming (ILP) [19] andassociation rule mining [20] from the databases community.The use of declarative language biases [21] help in restrictingthe large search space. Language biases like limiting rulelength, having every atom to be transitively connected toeach other in a rule (a.k.a connected rules) and ensuringall variables in a rule to to be closed (i.e. appears at leasttwice) offer a trade-off between the expressivity of rules andsize of the search space. WARMR [22] and its extensionWARMER [9] is based on the APRIORI algorithm [23] andmakes use of a language bias where only conjunctive rules aremined. Sherlock [24] an unsupervised ILP learns first orderHorn rules and infers new facts using probabilistic graphicalmodels. To prune the search space it uses two heuristics:statistical significance and relevance . All the above approachesare designed to work under the closed world assumption ofKBs due to their treatment of the rule quality measure. AIME[14] and it’s efficient version AIME+ [10] take into account theincompleteness of KBs by proposing a new quality measurebased on partial completeness assumption. Additionally theymines Horn rules which are connected , closed , non-reflexive and monotonic (predicates in the rule body are all positive)rules. C. Hybrid Methods
The complementary strengths [25] of observed (rules, paths)and latent (embeddings) KG features has given rise to anumber of hybrid models in recent years. In [26] link pre-diction is cast as an integer linear programming problem inwhich the objective function is from an embedding modeland implication rules are used as constraints. This method isessentially a post processing step which helps in inferencebut not better embedding representations. Another paradigmof hybrid models is to perform some form of regularizationon the embedding loss function using simple rules. In [27]implication rules are first naively extracted by iterating over allpossible relation pairs and then differentiable terms are addedto the embedding objective function for each grounding ofthe extracted rules. The naive rule extraction process is costlyand the procedure leads to a large number of regularizationterms which doesn’t scale. RUGE [28] uses t-norm basedfuzzy logic to model the rules and uses inferred tripletsto perform embedding rectification. In [29] non-negativityconstraints for entity embeddings and entailment constraint forrelation embeddings is explored. [30] enforce the subsumptionproperty by having an equality constraint r = r − δ ,where δ is a learnable non-negative vector that specifies howrelation r differs from r . Our work is closest to IterE [13]which is designed for improving embeddings of sparse entities.Seven ontology property axioms from OWL2 Web OntologyLanguage are used to model relations in the KG. First apool of valid axioms are generated, by randomly sampling k triples and matching them to seven axiom templates. Then, (3, ,4) (1, ,16)(10, ,16) (11, ,14)(7, ,15)(8, ,15) Infer TripletsRule MiningSec. V-BEmbedding LearningSec. V-A Importance SamplingSec. V-CAugmented Knowledge GraphKnowledge Graph
Fig. 2: Overall framework architecture.inferred facts that relate to sparse entities help in learningbetter embeddings because they essentially provide extra in-formation. We consider horn rules that are more expressiveand unlike them, we have no assumptions on the embeddingscore function. We improve embedding representations of bothsparse and non-sparse entities. Different from constraint basedhybrid methods, our selective augmentation of inferred factsprovides 1) structure to the embedding space and 2) extrainformation for sparse entities.V. O UR A PPROACH
Here, we introduce our hybrid model for learning horn rulesand vector space embeddings in a cross feedback paradigm.Initially, embedding learning is performed on the input KGresulting in entity and relation embeddings. The learnt embed-dings are then used to guide the rule mining system. Further,the extracted rules are materialized and new inferred tripletsare sampled for learning embeddings in the next iteration. Fig.2 shows the overall framework. In what follows, we describethe three main constituent parts (1) embedding learning, (2)rule learning with embedding feedback and (3) importancesampling of triplets.
A. Embedding Learning
We associate a label y hrt to each triplet ( h, r, t ) to model itstruth value. Labels of triplets in G + i are set to 1. As there areno negative triplet examples, we generate G − i by negativelysampling G + i and labels of triplets in G − i are set to 0. A scorefunction φ ( h, r, t ; Θ) : E × R × E → R is used to measurethe salience of a triplet ( h, r, t ) . We further map the output ofthe score function to a continuous truth value between (0 , using a sigmoid function σ ( z ) = 1 / (1 + exp ( − z )) . ξ ( h, r, t ) = σ ( φ ( h, r, t ; Θ)) (4)The objective of embedding learning is to learn Θ i foriteration i by minimizing the loss over triplets in G + i and G − i . G i = G + i ∪ G − i represents our learning set for the currentiteration i . Then, min Θ i | G i | (cid:88) (( h,r,t ) ,y hrt ) ∈ G i L ( ξ ( h, r, t ) , y hrt ) (5)where L ( x, y ) = − y log( x ) − (1 − y ) log(1 − x ) is the crossentropy loss between x and y , ξ ( · ) is the function defined inEq. 4. It is important to note that in our method, learningis done on an extended set of rule enriched triples. Also,our method does not depend on specific or class of scoringfunctions unlike IterE [13]. All we require is that the scorefunction φ output a real valued score. Table I shows the scorefunctions proposed by various state-of-the-art methods in theliterature. B. Rule Mining with Embedding Feedback
In this step we aim to mine quality horn rules from the set ofinitial triplets in G +0 using the current iteration embeddings Θ i .According to AIME+ [10], rules are modelled as a sequenceof atoms with the first atom as the head atom and other bodyatoms following it. The process of building a quality ruleboils down to extending a partial rule sequence and carefullytraversing the search space. Rule building is done by a set ofmining operators that iteratively add atoms to partial rules untila termination criterion is met. To prune the search space duringrule building filtering criteria is utilized. We now explain thethree sub parts.
1) Rule Building:
Initially, all possible binary atoms usingrelations in R are held in a priority queue. These representpartial rule heads with empty bodies. At each iteration of thealgorithm, a single partial rule is dequeued and checked for termination (defined below). If successful, it is output as apossible rule. If not, it is extended by each of the followingoperators: Add a new dangling atom ( O D ) which uses a freshvariable for one argument and a shared variable/entity(used earlier in the rule) for the other. • Add a new instantiated atom ( O I ) which uses a sharedvariable/entity for one argument and an entity for theother. • Add a new closing atom ( O C ) which uses shared vari-able/entity for both arguments.The expansion produces multiple candidate rules. All can-didate rules are checked if they can be pruned by filteringcriteria . Pruned ones are discarded and non pruned ones areenqueued for the next iteration. The iterative algorithm is rununtil the queue is empty.
2) Filtering Criteria:
The application of mining operatorsto a rule produces a set of candidate rules. Since not allcandidate rules are promising, this steps aims to filter some ofthem. Ideally, filtering criteria should allow the generation ofrules that (1) explain facts in the KG and (2) infer facts outsidethe observable KG while being consistent w.r.t global “soft”KG properties. Classical measures of rule quality like standardconfidence and
PCA confidence introduced by [14] are basedon the observable KG. They work well in selecting rules thatexplain known facts. Every so often, candidate rules inferfacts that are outside the known KG and classical measuresneglect them. We seek to use the learned embedding as aproxy for measuring the quality of candidate rules which inferfacts outside the known KG. This is based on the intuitionthat embeddings Θ can capture global statistical patterns. Tothis end, given a candidate rule, we average the individualembedding scores of all its newly inferred atoms. We call thismeasure embedding confidence (EC) . Given a candidate rule τ , consider the set S τ defined in Eq. 1. The unobserved factspredicted by instantiating τ is given by S τ \ G +0 . Then, EC ( τ ) = 1 | S τ \ G +0 | (cid:88) ( h,r,t ) ∈ S τ \ G +0 ξ ( h, r, t ) (6) where ξ ( · ) is the function defined in Eq. (4). EC ∈ (0 , as ξ ( · ) is designed to give confidence values between (0 , .We consider the weighted average of the classical standardconfidence i.e. SC (Eq. 3) and our embedding confidence EC (Eq. 6) for the final rule quality measure Q . Q ( τ ) = (1 − ω ) SC ( τ ) + ωEC ( τ ) (7) where ω the weight factor is a model hyper-parameter. Thus,for each candidate rule we check if there is an increase w.r.trule quality measure Q and if not it is discarded. Beforecalculating the Q score for a candidate rule, we use a heuristicpruning strategy followed by AIME+ [14] to discard non-interesting rules. This is done so as to reduce the loadon calculating the Q score. One quick statistical check weperform is to make sure a candidate rule covers more than 1%of known facts using head coverage [14]. We also incorporate language biases like restricting the search to rules of lengththree in order to deal with the vast search space. Termination Criteria
As the extension operators produce rules that are not necessarily closed , we follow [14] in out-putting only closed and connected rules.
C. Importance Sampling of Triplets
Once the rule mining system has extracted rules with aour quality measure, we now incorporate them in refining theembeddings for the next iteration. To do so, the first task isto select top-K rules by ranking the extracted rules accordingto their quality measure Q which is already calculated (Eq.7). Let T represent the set of all the selected rules. Then, thenewly inferred triplets from the rules in T will be representedby, G T = ( (cid:91) τ ∈T S τ ) \ G +0 (8) where S τ (Eq. 1) is the set of inferred triplets from rule τ .Next we augment the current positive triplet set G + i with G T to create G + i +1 which will be used for embedding learning initeration i + 1 . Instead of naively augmenting all of G T withlabel 1, we use an importance sampling scheme that uses Θ i from current iteration i . Specifically, we sample triplets fromthe following distribution, p (( h, r, t ) | G T ) = exp βφ ( h, r, t ; Θ i ) (cid:80) (( h j ,r j ,t j ) ∈ G T ) exp βφ ( h j , r j , t j ; Θ i ) (9) where φ ( · ) is the embedding score function and β is thesampling temperature. The set of input triplets for the nextembedding learning iteration is generated as shown in line 8of Algorithm 1, where ψ represents the importance samplingstrategy. Algorithm 1
Hybrid Learning Procedure
Input : Given initial training knowledge graph ( KG ), embed-ding method φ Output : Final embeddings Θ and mined rules T G +0 ← { (( h, r, t ) , y hrt = 1) | ( h, r, t ) ∈ KG} T ← ∅ Randomly initialize entity and relation embeddings Θ for i = 1 : M do G − i − ← N egativeSampling ( G + i − ) Θ i ← EmbeddingLearning ( G + i − , G − i − ) (cid:46) Eq. 5 T ←
RuleM ining ( G +0 , Θ i ) G + i = G + i − ∪ ψ ( G T ) (cid:46) Sec. V-C end for return Θ M , T VI. E
XPERIMENTS AND A NALYSIS
Data
We evaluate our proposed method on multiple widelyused benchmark datasets. FB15k-237 [25], which has factsabout movies, actors, awards, sports and sport teams, is asubsset of FB15k [6] with no inverse relations. WN18RRABLE II: Benchmark results for link prediction. Comparison against state-of-the-art standalone KGE and hybrid methods.Results of [ ♠ ] are taken from [8]. YAGO3-10 results for methods with [ (cid:70) ] are taken from [28]. Results of [ ♣ ] are producedby us using code provided by respective authors. All other results are taken from corresponding original papers. Best scoresare in bold , runner-up is underlined and “ ∗ ” represents statistically significant improvements over RotatE (paired t-test; p-value < . ) FB15K-237 WN18RR YAGO3-10MRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10TransE [ ♠ , (cid:70) ] 0.293 0.140 0.268 0.463 0.226 - - 0.501 0.303 0.218 0.336 0.475DistMult [ ♠ ] 0.241 0.155 0.263 0.419 0.430 0.390 0.440 0.490 0.340 0.240 0.380 0.540ComplEx [ ♠ ] 0.247 0.158 0.275 0.428 0.440 0.410 0.460 0.510 0.360 0.260 0.400 0.550ConvE 0.325 0.237 0.356 0.501 0.430 0.400 0.440 0.520 0.440 0.350 0.490 0.620RotatE 0.338 0.241 0.375 0.533 0.476 ♣ , (cid:70) ] 0.169 0.087 0.181 0.345 0.231 0.218 0.387 0.439 0.431 0.340 0.482 0.603NNE-AER [ ♣ ] 0.317 0.183 0.294 0.478 0.431 0.412 0.437 0.467 0.390 0.310 0.419 0.597pLogicNet 0.332 0.237 0.369 0.528 0.441 0.398 0.446 0.537 - - - -Our method ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ TABLE III: Statistics of Datasets
Dataset |E| |R| [16] a subset of WordNet is a KG describing lexical rela-tions between words. YAGO3-10 [31] deals with descriptiveattributes of people. We also consider sparse variants from [13]for one of our experiments. Statistics of all datasets are givenin Table III. We use the original train/valid/test splits providedby authors of the dataset and represent it as G train , G valid and G test . Evaluation Protocol
We consider the standard link predictiontask following a standard protocol introduced by [6]. For atest triplet ( h, r, t ) , we replace the head h with each entity e i ∈ E and find the score of ( e i , r, t ) using φ . From thedescending order of scores, the rank of the correct entity i.e. h is found. This gives us the head rank. Similarly, we run thesame procedure for the tail t that gives the tail rank. Finally,the average of head and tail ranks is used to report popularmetrics like Mean Reciprocal Rank (MRR) and percentage ofpredicting ranks within N which is
Hit@N . Higher values inboth metrics signify better results. Also, during the rankingprocess, we make sure that the replaced triplet does not existin either the training, validation, or test set. This corresponds tothe “filtered setting in [6]. [32] propose a RANDOM protocolto be used when choosing triplets that get the same score bythe model. Following such a protocol, when we generate therankings and if more than two entities receive the same score, WordNet - https://wordnet.princeton.edu/ we randomly pick one of them because picking in any orderwill be unfair as argued by [32].
Hyperparameter Setting
We fine-tune the hyper-parameterson the validation set G valid . We grid searched em-bedding dimension d in { , , } , batch size B in { , , } , embedding weight factor ω in { . , . , . , . , . } and top-K extracted rules for K in { , , , } . Initially, all embeddings are randomlyinitialized and the cross feedback learning procedure is runwith M set to 10. (see line 4 in Algo. 1). During each globaliteration, embedding learning is run for k steps. We set k to 100. During rule mining, we mine rules with length of atmost 3 to keep a check on the search space. G valid is used forearly stopping and obtaining the best model on which G test is evaluated. A. Embedding Evaluation
We compare our model with state-of-the-art standaloneknowledge graph embedding methods representing a variety ofmodelling approaches, viz. TransE [6], DistMult [15], Com-plEx [7], ConvE [16] and RotatE [8]. The standalone modelsrely only on observed triplets in KG and use no logical rules.We further compare with additional hybrid baselines, includingRUGE [28] and NNE-AER [29] which make use of certainlogical rules to constrain the vector space of Θ and pLogicNet[33] which utilizes probabilistic logical rules in a MarkovLogic Network (MLN) framework. For all baseline models werun (marked with ♣ in result table), optimal hyper-parametersare obtained using grid search. Table II shows the evaluationon different datasets. We can see that our model outperformsboth standalone and hybrid baselines by a large margin forFB15K-237 and YAGO3-10 datasets. Both the two datasetshave diverse relation patterns like composition, symetry andinversion in them and mined rules can infer rich tripletswhich leads to increased data quality when augmented to theoriginal KG. There is no statistically significant improvementfor WN18RR because the number of rules that are mined fromABLE IV: Sparse KG results. Results of [ ♣ ] are produced by us using code provided by respective authors. Other resultsare taken from [13]. Best scores are highlighted in bold and runner-up is underlined. FB15K-237-sparse WN18RR-sparseMRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10TransE 0.238 0.164 0.261 0.385 0.146 0.034 0.247 0.288DistMult 0.204 0.128 0.226 0.362 0.255 0.238 0.260 0.225ComplEx 0.197 0.120 0.217 0.354 0.259 0.246 0.262 0.286RotatE [ ♣ ] 0.292 0.213 0.320 0.445 0.340 0.299 0.354 0.422RUGE [ ♣ ] 0.241 0.172 0.267 0.393 0.145 0.199 0.263 0.292NNE-AER [ ♣ ] 0.212 0.133 0.238 0.395 0.279 0.284 0.304 0.305IterE + axioms 0.247 0.179 0.262 0.392 0.274 0.254 0.281 0.314Our method WN18RR are only a few tens whereas for FB15K-237 andYAGO3-10 they are in the hundreds. Number of rules minedis directly correlated to the number of relations in the dataset(see Table III). Moreover, WN18RR dataset is less diverse andmainly made up of relations that conform to symmetry pattern.Constraint based hybrid approaches perform poorly acrossdataset as they don’t model all rule patterns. We also performa paired t-test between our method and the best baseline i.e.RotatE with a p-value < B. Rule EvaluationPrecision measures the ability of the rules to infer true factsbeyond the train set. Mathematically, given test set G test , theprecision of a set of mined (using G train ) rules T is, precision ( T ) = | G T ∩ G test || G T | (10) G T , defined in Eq. (8) is the set of all newly inferred tripletsfrom the set of rules T . Using this metric, we compare againstAIME+ [10]. We compare the top-k rules that are outputfor better understanding. Results (Table V) show that ourembedding confidence measure when combined with standardfiltering techniques used by AIME+ leads to higher qualityrules. We conclude that feedback from embeddings that carry“global” KG pattern information is useful when mining rules.TABLE V: Quantitative evaluation of learned top-k rules using precision (Eq.10). AIME+ uses PCA confidence while ourmethod utilizes
Embedding confidence (Eq. 7) for measuringrule quality. top-K FB15K-237 WN18RRAIME+ Our Method AIME+ Our Method10 0.357
20 0.422
50 0.585 - -100 0.613 - -200 0.679 - -500 0.384 - -
C. Training with Different Score Functions
One may argue that our boost in performance is from aparticular score function we used for embedding learning (i.e.RotatE). We experimentally show in Table VI that our crossfeedback learning approach significantly improves perfor-mance when used with different scoring functions. The pointwe make here is that our rich data augmentation approachbrings in deterministic structure leading to better embeddingrepresentations irrespective of the scoring function used forembeddings.TABLE VI: Performance of our cross feedback paradigm withdifferent embedding learning methods (i.e. φ is from Table I).Results in the second row for each embedding method, i.e.with brackets, report performance without our paradigm. EmbeddingMethods FB15K-237 YAGO3-10MRR Hit@10 MRR Hit@10TransE (0.293) (0.42) (0.303) (0.475)ComplEx (0.247) (0.428) (0.417) (0.603)RotatE (0.338) (0.533) (0.495) (0.670)
D. Sparse KG Evaluation
Here we compare our model against IterE [13] for the sparseKG setting using the same datasets provided by the authorsof IterE. In the sparse versions of FB15K-237 and WN18RR,only entities with sparsity (Eq. 11 ) greater than 0.995 areallowed in the validation ( G valid ) and test ( G test ) sets whiletraining sets ( G train ) remain unchanged. According to [13],sparsity of an entity e is defined as, sparsity ( e ) = 1 − f req ( e ) − f req min f req max − f req min (11)where freq(e) is the frequency of entity e participating in atriple among all triples in G train . f req min and f req max areminimum and maximum frequency of all entities in E .These datasets provides us a way to evaluate if our modelhas the ability to improve embedding representations of sparsentities. From the results (Table IV), it is clear that our methodsignificantly improves the score on all metrics for both sparseversions compared to baselines. Embeddings guiding thesearch space of the rule mining system instead of a naive fixed k pruning strategy used by IterE and the importance samplingmechanism employed by us leads to effective utilization ofthe generalizing power of rules. IterE is tied to a certain classof embedding methods which assume a linear map betweensubject and object entities while our model places no suchassumptions. RotatE, the state-of-the-art standalone model alsosuffers when reasoning about sparse entities. Although RUGEand NNE-AER do better than some standalone methods, theyfall short when compared to a our method.VII. A DDITIONAL A NALYSIS
A. Computational Complexity
With regard to embedding learning, our approach has thesame complexity as the score function that is chosen. IfRotatE is used, the space complexity is O ( d ( |E| + |R| )) whichscales linearly w.r.t number of entities, number of relationsand embedding dimension. Each iteration of our learningprocedure has a time complexity of O ( b + kbd + ¯ nd ) fornegative sampling, embedding learning and importance sam-pling combined where b represents batch size, d is embeddingdimension, k the inner embedding learning epochs (used byEq. 5) and finally ¯ n is average size of G T (see Eq. 8) for eachiteration. As k is usually set to a small number and usuallythe number of average entailments is fewer than batchsize i.e. ¯ n < b our time complexity is on par with conventional KGEmethods which have a complexity of O ( bd ) . It is importantto note that the above embedding time complexity does notdepend on the size of the input graph G i at iteration i becausetraining is done using SGD in minibatch mode. As we also runrule mining with embedding guidance every global iteration itadds additional complexity. AIME+ [10] is known to be veryefficient with its use of an in-memory database to store theKG and its use of various optimizations to prune the largesearch space. B. Iterative Learning Profile
Here we assess how performance (e.g., Hit@10, MRR)varies with the number of gloabl iterations. Fig. 3 shows ourfindings for FB15K-237 and WN18RR. Big gains are observedafter two iterations and performance continues to increase untilit levels off around iteration eight. These plots indicate thatthese is a positive effect when we sample triplets from rulemining and augment them for embedding learning.
C. Case Study
We give some qualitative examples from FB15K-237dataset to demonstrate the effectiveness of our model. First,consider Table VII. It shows multiple test triplets from G test and their corresponding head and tail rank changewhen comparing RotatE [8] as baseline with our method.The relevant rule that provides feedback to embeddinglearning in our framework is also shown for each test Fig. 3: Link Prediction Performance vs Global Iterationstriplet. As an example, take the first test triplet ( LouisArmstrong, /people/person/nationality,United States ). Because
Louis Armstrong is asparse entity with sparsity 0.995, traditional embeddingmethods suffer in the head prediction task i.e. asking (?, /people/person/nationality, UnitedStates) essentially because the embedding representationsof sparse entities are not informative. Concretely, RotatEgives a filtered rank of 1154. Compare this to our methodwhich utilizes the rule that says musicians have the samenationality as the country of the town they are originallyfrom. Having V as New Orleans satisfies the rulefor the test triplet and our method improves the subjectprediction rank to 68, a gain of 1086. An observationabout the tail prediction task, i.e. (Louis Armstrong,/people/person/nationality, ?) conveys thatbaseline methods perform well when entities are not so sparsefor e.g. here it is
United States that is not sparse. Theother examples show a similar qualitative trend demonstratingthat using feedback from relevant rules improves embeddingrepresentations of entities especially sparse ones.Next we show qualitatively how rule quality canbe improved with embedding guidance through ourembedding confidence measure (Eq. 6). Table VIIIshows three different rule bodies mined by AIME+ onFB15K-237, which all result in the same head atom i.e./olympic sport./affiliation country( V , V ) which means V is an olympic sport that has representation for country V . An example triplet in G train is ( Trialathon,
ABLE VII: Qualitative results showing link prediction rank change of entities of test triplets by comparing our method andbaseline RotatE. The associated rule used for providing feedback is shown for each test triplet
Test Triplet (
Louis Armstrong, /people/person/nationality, United States )Rule /music/artist/origin( V , V ) ∧ /location/administrative division/country( V , V ) → /people/person/nationality( V , V )Rank Change head: 1154 →
68 (+1086)tail: 1 →
15 Minutes, /film/country, United States )Rule /film/production companies( V , V ) ∧ /organization/headquarters./location/country( V , V ) → /film/country( V , V )Rank Change head: 314 →
12 (+302)tail: 1 → Poland, /location/second_level_divisions, Szczecin )Rule /bibs location/country( V , V ) ∧ /location/county place/( V , V ) → /location/second level divisions( V , V )Rank Change head: 1 → → Fort Lauderdale, /location/location/time_zones, Eastern Time Zone )Rule /location/location/contains( V , V ) ∧ /location/location/time zones( V , V ) → /location/location/time zones( V , V )Rank Change head: 414 →
45 (+369)tail: 1 → TABLE VIII: Qualitative results showing the comparison between standard embeddding confidence scores for three candidaterules inferring the same head atom.
Rule RuleSupport No.Predictions StandardConfidence EmbeddingConfidence1 /olmpic sport./affiliation country( V , V ) ∧ /location/import and exports( V , V ) → /olympic sport./affiliation country( V , V ) 548 1723 0.318 0.2072 /olympic sport./affiliation country( V , V ) ∧ /location./adjoins location( V , V ) → /olympic sport./affiliation country( V , V ) 1524 4747 0.321 0.4043 /sport./pro olympic athlete( V , V ) ∧ /person./nationality( V , V ) → /olympic sport./affiliation country( V , V ) 588 1792 0.328 0.701 TABLE IX: TransE with different negative sampling tech-niques.
FB15K-237 WN18RRMRR Hit@10 MRR Hit@10uniform 0.241 0.422 0.186 0.459KBGAN 0.278 0.453 0.210 0.479self-adversarial 0.298 0.475 0.223 0.510uniform + our method 0.435 0.545 0.234 0.515 /olympic_sport/affiliation_country,France ). Rules one and two say that an olympic sport V has representation from country V if there is another country V that represents V and is adjoining V or does trade withit. Both rules get very similar standard confidence measuresbecause the ratio of corresponding rule support and number ofpredictions is alike and choosing a relevant rule is not clear.Compare this with rule three, which says an olympic sport V has representation from country V if there is a professionalolympic athlete V who plays V and V has nationality V .This rule is more apt compared to the earlier ones but still gets similar standard confidence of 0.328. Now contrast thiswith the embedding confidence scores for each of the threerules. Rule three gets a much better EC score compared toone and two because most of its instantiations give highembedding scores (see Eq. 6). This shows that embeddingconfidence which is able model the rules probabilistically ishelpful when used in conjunction with standard confidence toassess the quality of rules when mining. D. Connections to Negative Sampling
The results from the previous section made us look atour paradigm through the lens of negative sampling for ex-planation. At each iteration, when new inferred triplets areintroduced, the underlying uniform negative sampling gener-ates higher quality negatives simply because of the superiorpositives that are augmented. We further show (Table IX) howour method stacks up against other negative sampling methodslike uniform, KBGAN [34] and self-adversarial [8]. For allmethods including ours, TransE [6] is used as the embeddingmethod for a fair comparison. The results indicate that theaugmented inferred triplets invariably lead to better qualitynegatives thus improving the overall train set on which thembeddings are learned. In our opinion, this is also a possibleexplanation for the significant boost in performance for modelslike TransE and ComplEx that are actually incapable ofmodelling symmetry and composition patterns in data [8].VIII. C
ONCLUSION
In this work, we have developed a hybrid method whichutilizes the complementary properties of rules and embed-dings. Experimental results empirically support the two mainclaims we raised (1) structure and richer data quality intraining results in superior embedding representations, and (2)incorporation of “global” KG statistical patterns in rule mininglead to reliable rules. We extensively evaluated our approachwith varied experiments and showed its effectiveness. Theconnections to negative sampling motivate us investigate moredeeply about the framework in future for possible theoreticalclaims and develop other general hybrid models for unifyingdifferent learning schemes.A
CKNOWLEDGMENT
This research is supported by NSF under contract numbersCCF-1918483, IIS-1618690, and CCF-0939370.R
EFERENCES[1] Freebase,
Freebase Data Dumps , (accessed June 5, 2020). [Online].Available: https://developers.google.com/freebase/[2] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a core of semanticknowledge,” in
Proceedings of the 16th international conference onWorld Wide Web , 2007, pp. 697–706.[3] Wikidata,
Main page
Google Knowledge Graph
Ad-vances in neural information processing systems , 2013, pp. 2787–2795.[7] T. Trouillon, J. Welbl, S. Riedel, ´E. Gaussier, and G. Bouchard, “Com-plex embeddings for simple link prediction,” in
International Conferenceon Machine Learning , 2016, pp. 2071–2080.[8] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang, “Rotate: Knowledge graphembedding by relational rotation in complex space,” arXiv preprintarXiv:1902.10197 , 2019.[9] B. Goethals and J. Van den Bussche, “Relational association rules:getting w armer,” in
Pattern Detection and Discovery . Springer, 2002,pp. 125–139.[10] L. Gal´arraga, C. Teflioudi, K. Hose, and F. M. Suchanek, “Fast rule min-ing in ontological knowledge bases with amie+,”
The VLDB JournalTheInternational Journal on Very Large Data Bases , vol. 24, no. 6, pp.707–730, 2015.[11] R. Jenatton, N. L. Roux, A. Bordes, and G. R. Obozinski, “A latentfactor model for highly multi-relational data,” in
Advances in NeuralInformation Processing Systems , 2012, pp. 3167–3175.[12] J. Pujara, E. Augustine, and L. Getoor, “Sparsity and noise: Whereknowledge graph embeddings fall short,” in
Proceedings of the 2017Conference on Empirical Methods in Natural Language Processing ,2017, pp. 1751–1756.[13] W. Zhang, B. Paudel, L. Wang, J. Chen, H. Zhu, W. Zhang, A. Bernstein,and H. Chen, “Iteratively learning embeddings and rules for knowledgegraph reasoning,” in
The World Wide Web Conference . ACM, 2019,pp. 2366–2377.[14] L. A. Gal´arraga, C. Teflioudi, K. Hose, and F. Suchanek, “Amie: associ-ation rule mining under incomplete evidence in ontological knowledgebases,” in
Proceedings of the 22nd international conference on WorldWide Web . ACM, 2013, pp. 413–422. [15] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng, “Embedding entities andrelations for learning and inference in knowledge bases,” arXiv preprintarXiv:1412.6575 , 2014.[16] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolutional2d knowledge graph embeddings,” in
Thirty-Second AAAI Conferenceon Artificial Intelligence , 2018.[17] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, “A review ofrelational machine learning for knowledge graphs,”
Proceedings of theIEEE , vol. 104, no. 1, pp. 11–33, 2015.[18] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embed-ding: A survey of approaches and applications,”
IEEE Transactions onKnowledge and Data Engineering , vol. 29, no. 12, pp. 2724–2743, 2017.[19] S. Muggleton and L. De Raedt, “Inductive logic programming: Theoryand methods,”
The Journal of Logic Programming , vol. 19, pp. 629–679,1994.[20] R. Agrawal, T. Imieli´nski, and A. Swami, “Mining association rulesbetween sets of items in large databases,” in
Acm sigmod record , vol. 22,no. 2. ACM, 1993, pp. 207–216.[21] H. Ad´e, L. De Raedt, and M. Bruynooghe, “Declarative bias for specific-to-general ilp systems,”
Machine Learning , vol. 20, no. 1-2, pp. 119–154,1995.[22] L. Dehaspe and H. Toivonen, “Discovery of relational association rules,”in
Relational data mining . Springer, 2001, pp. 189–212.[23] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo,“Fast discovery of association rules,” in
Advances in knowledge discov-ery and data mining . American Association for Artificial Intelligence,1996, pp. 307–328.[24] S. Schoenmackers, O. Etzioni, D. S. Weld, and J. Davis, “Learningfirst-order horn clauses from web text,” in
Proceedings of the 2010Conference on Empirical Methods in Natural Language Processing .Association for Computational Linguistics, 2010, pp. 1088–1098.[25] K. Toutanova and D. Chen, “Observed versus latent features for knowl-edge base and text inference,” in
Proceedings of the 3rd Workshop onContinuous Vector Space Models and their Compositionality , 2015, pp.57–66.[26] Q. Wang, B. Wang, and L. Guo, “Knowledge base completion usingembeddings and rules,” in
Twenty-Fourth International Joint Conferenceon Artificial Intelligence , 2015.[27] T. Rockt¨aschel, S. Singh, and S. Riedel, “Injecting logical backgroundknowledge into embeddings for relation extraction,” in
Proceedings ofthe 2015 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies , 2015,pp. 1119–1129.[28] S. Guo, Q. Wang, L. Wang, B. Wang, and L. Guo, “Knowledge graphembedding with iterative guidance from soft rules,” in
Thirty-SecondAAAI Conference on Artificial Intelligence , 2018.[29] B. Ding, Q. Wang, B. Wang, and L. Guo, “Improving knowledge graphembedding using simple constraints,” arXiv preprint arXiv:1805.02408 ,2018.[30] B. Fatemi, P. Taslakian, D. Vazquez, and D. Poole, “Knowledge hy-pergraphs: Extending knowledge graphs beyond binary relations,” arXivpreprint arXiv:1906.00137 , 2019.[31] F. Mahdisoltani, J. Biega, and F. M. Suchanek, “Yago3: A knowledgebase from multilingual wikipedias,” 2015.[32] Z. Sun, S. Vashishth, S. Sanyal, P. Talukdar, and Y. Yang, “A re-evaluation of knowledge graph completion methods,” arXiv preprintarXiv:1911.03903 , 2019.[33] M. Qu and J. Tang, “Probabilistic logic neural networks for reasoning,”in
Advances in Neural Information Processing Systems , 2019, pp. 7710–7720.[34] L. Cai and W. Y. Wang, “Kbgan: Adversarial learning for knowledgegraph embeddings,” arXiv preprint arXiv:1711.04071arXiv preprint arXiv:1711.04071