[PDF] A Hybrid Model for Learning Embeddings and Logical Rules Simultaneously from Knowledge Graphs

Abstract

The problem of knowledge graph (KG) reasoning has been widely explored by traditional rule-based systems and more recently by knowledge graph embedding methods. While logical rules can capture deterministic behavior in a KG they are brittle and mining ones that infer facts beyond the known KG is challenging. Probabilistic embedding methods are effective in capturing global soft statistical tendencies and reasoning with them is computationally efficient. While embedding representations learned from rich training data are expressive, incompleteness and sparsity in real-world KGs can impact their effectiveness. We aim to leverage the complementary properties of both methods to develop a hybrid model that learns both high-quality rules and embeddings simultaneously. Our method uses a cross feedback paradigm wherein, an embedding model is used to guide the search of a rule mining system to mine rules and infer new facts. These new facts are sampled and further used to refine the embedding model. Experiments on multiple benchmark datasets show the effectiveness of our method over other competitive standalone and hybrid baselines. We also show its efficacy in a sparse KG setting and finally explore the connection with negative sampling.

Full PDF

AA Hybrid Model for Learning Embeddingsand Logical Rules Simultaneouslyfrom Knowledge Graphs

Susheel Suresh and Jennifer Neville

Computer Science DepartmentPurdue UniversityWest Lafayette, IN, USA[suresh43, neville]@purdue.edu

Abstract —The problem of knowledge graph (KG) reasoninghas been widely explored by traditional rule-based systemsand more recently by knowledge graph embedding methods.While logical rules can capture deterministic behavior in a KGthey are brittle and mining ones that infer facts beyond theknown KG is challenging. Probabilistic embedding methods areeffective in capturing global soft statistical tendencies and rea-soning with them is computationally efﬁcient. While embeddingrepresentations learned from rich training data are expressive,incompleteness and sparsity in real-world KGs can impact theireffectiveness. We aim to leverage the complementary propertiesof both methods to develop a hybrid model that learns bothhigh-quality rules and embeddings simultaneously. Our methoduses a cross feedback paradigm wherein, an embedding model isused to guide the search of a rule mining system to mine rulesand infer new facts. These new facts are sampled and furtherused to reﬁne the embedding model. Experiments on multiplebenchmark datasets show the effectiveness of our method overother competitive standalone and hybrid baselines. We alsoshow its efﬁcacy in a sparse KG setting and ﬁnally explore theconnection with negative sampling.

I. I

NTRODUCTION

Knowledge graphs (KGs) are large directed graphs wherenodes represent concrete or abstract entities and edges sym-bolize relations for a pair of nodes. Many KGs containingmillions of entities and relation types exist today viz. Freebase[1], YAGO [2], Wikidata [3], and Google Knowledge Graph[4] which are pivotal in reasoning about multi-relational datafrom different domains. One important reasoning problemis that of predicting missing relationships between entities(link prediction). Reasoning over KGs is particularly chal-lenging in part due of their characteristic properties: largesize, incompleteness, sparsity and noisy facts. Latent feature(a.k.a embedding) models learned probabilistically (RESCAL[5], TransE [6], ComplEx [7] and RotatE [8]) and inductivelogic programming (ILP) inspired techniques which mineinterpretable logical rules (WARMER [9], AIME+ [10]) aretwo prominent KG reasoning approaches.Relations in knowledge graphs adhere to certain constraintswhich enforce syntactic validity and typically follow de-terministic connectivity patterns like equivalence, symmetry,inversion and composition. Rules which capture such patterns are precise, interpretable and can generalize well. Drawbacksincludes potential low coverage, mining inefﬁciency (largesearch space) and difﬁculty in mining quality rules fromincomplete KGs. Embedding methods aim to learn usefulrepresentations of entities and relations by projecting knowntriplets into low-dimensional vector spaces and maximizingthe total plausibility of known facts in the KG. Embeddingmodels are able to capture unobservable but intrinsic andsemantic properties of entities and relations [11]. As reasoningwith embeddings boils down to vector space calculations it iscomputationally efﬁcient, but it can be inaccurate when theentities and relations are sparse or noisy. [12].Since both methods have their advantages and disadvan-tages, in this paper we propose a hybrid model which aimsto exploit their complementary strengths. The main idea isto selectively utilize inferred facts from logical rules forembedding learning. Also, embedding feedback is used toprune the search space of the rule mining system. This crossfeedback process when run for many iterations, simultaneouslylearns to incorporate deterministic structure into embeddingsand mine rules that are consistent with “global” KG patterns.Hybrid models proposed previously in the literature use simplelogical rules to place constraints on the embedding space inorder to incorporate structure. Rule learning is detached fromembedding learning and most methods just naively enumerateall possible rules to start with, which doesn’t scale for largerKGs. Methods like IterE [13] place assumptions on the kindof rules and embedding techniques that can be used andare built speciﬁcally to tackle sparse entities. Different fromsuch methods, our model aims to simultaneously improveembeddings and mine diverse and reliable rules. Moreover,our method can be incorporated with any embedding techniqueand rule mining system.II. B

ACKGROUND

Let E represent the set of all entities and R the set of allrelation types in the KG. A knowledge graph G contains a setof factual triplets { ( h, r, t ) | h, t ∈ E ; r ∈ R} . h, r, t are calledhead, relation and tail respectively. Figure 1 shows a samplefrom a larger knowledge graph about sports. Knowledge graph a r X i v : . [ c s . A I] S e p ABLE I: State-of-the-art embedding score functions.

Model Score Function ( φ )TransE −(cid:107) h + r − t (cid:107) DistMult (cid:104) h , r , t (cid:105) ComplEx Re ( (cid:104) h , r , t (cid:105) ) RotatE −(cid:107) h ◦ r − t (cid:107) reasoning in particular link prediction deals with the problemof inferring new relationships between entities and tripletclassiﬁcation involves predicting the existence of candidatetriplets in a given KG. Knowledge graph embedding

Entities are represented asvectors and relations are seen as operations in vector space andare typically represented as vectors, matrices or tensors. Thesemodels assume that the existence of individual triplets in a KGare conditionally independent given latent representations a.k.aembeddings of entities and relations in a continuous vectorspace. A score function φ : E ×R×E → R is used to measurethe model’s conﬁdence in a candidate triplet and deﬁned basedon different vector space assumptions for e.g., a popular modelcalled TransE [6] aims is to have h + r ≈ t if ( h, r, t ) ∈ KG ,(boldface letters represent respective embeddings in R d ). So,the score function φ ( h, r, t ) = −(cid:107) h + r − t (cid:107) is expected tobe large if ( h, r, t ) exists in the KG . Embedding learning isdone under open world assumption (OWA) where un-observedtriplets are either false or unknown. Negative sampling isemployed due to the lack of negative examples in the inputKG. A simple and effective approach [6] is to corrupt eitherthe head or tail of a true triplet with a random entity sampleduniformly from E . Entity and relation embeddings are learnedby minimizing logistic or pairwise ranking loss. Rule Mining

A rule is a formulae of atoms connectedwith logical connectives. In particular, a rule is

Horn if theconjunction of a set of body atoms results in a single headatom like, τ : B ∧ B ∧ · · · ∧ B n → r ( X, Y ) where B i ’s arebody atoms and r ( X, Y ) is a head atom. We use body ( τ ) todenote the body atoms of rule τ . An instantiation of a ruleis the act of substituting all its variables with entities from E . The head atom of an instantiated rule is called an inferred head atom if all body atoms of the instantiated rule exist inthe KG. S τ is the set of all inferred head atoms obtained fromthe instantiation of a rule τ . S τ = { r ( X, Y ) | ∃ z , . . . , z m : body ( τ ) } (1)where z , . . . , z m are variables that appear in the body of therule. A principled approach to mine horn rules from a KG isby an association rule learning algorithm [14] and [9]. To dealwith the vast search space, language biases viz. limiting rulelength, having every atom to be transitively connected to eachother in a rule (a.k.a connected rules) and ensuring all variablesin a rule to to be closed (i.e. appears at least twice) are utilized.During mining, various statistical measures assess the qualityof intermediate rules to help prune the search space. Accordingto [14], Rule Support is the number of distinct grounding of

CristianoRonaldoLionel Messi FC Juventus playsFor

FC Barcelona playsFor

Football clubTypeclubType playsSport ? playsSport

Ballon d'OrTop 3Best FIFAPlayerGolden Boot

G.O.A.T(Greatest of allTime) isCalledisCalled hasWon

GianluigiBuffon OusmaneDembele areTeamMates playsFor ? isCalled

Andres Iniesta isCalled areTeamMates isCalled ?

Fig. 1: Entities and Relations in G f . Known relationshipsshown as solid edges and possible relations in dashed.the head atom resulting from the body. Formally, supp ( τ ) = x, y ) : ∃ z , z , . . . , z m : body ( τ ) ∧ r ( x, y ) (2) Standard Conﬁdence (SC) is the ratio of the number of headgroundings to the body grounding in the KG. SC ( τ ) = supp ( τ ) x (cid:48) , y (cid:48) ) : ∃ z , z , . . . , z m : body ( τ ) (3)III. M OTIVATION

Consider the sample G f in Fig. 1 drawn from a larger KG G about sports. Suppose a rule learning system has found thefollowing three rules: l : playsF or ( V , V ) ∧ clubT ype ( V , V ) → playsSport ( V , V ) l : playsF or ( V , V ) ∧ playsF or ( V , V ) → teamMate ( V , V ) l : isCalled ( V , V ∧ teamMate ( V , V ) → isCalled ( V , V ) Q1: Can the generalizing ability of logical rules helpembedding methods learn better representations?

Rules l and l have a number of conforming instantiations in G f andone can argue for them to hold in the larger knowledge graph G . Say in a sample pertaining to basketball G b , the entitiesare sparsely connected. Embedding representations of entitiesin such a “less-connected” sub-graph will not be expressivebecause of poor data quality. On the other hand, rules l and l mined with support from facts in G f can accurately reasonabout entities in G b . Q2: Can feedback from an embedding model improvethe quality of mined rules?

Embedding models are goodat incorporating global patterns. For example, a possibleexplanation for Ronaldo being G.O.A.T (greatest of all time)is the presence of a large number of links between him anddifferent sport awards (see Fig. 1). An embedding methodsay, RESCAL [5] can easily model the pattern: consistent topplayers of a sport are likely to be called G.O.A.T . Speciﬁcally,the model can learn the features of “consistent top player” and“concept of eminence” for different entities from data. h ronaldo = (cid:20) . . (cid:21) M isCalled = (cid:20) . . . . (cid:21) t G.O.A.T = (cid:20) . . (cid:21) he ﬁrst embedding feature of h and t might model“consistent top player” and the second “concept of eminence”.Thus, φ (ronaldo, isCalled, G.O.A.T) = h (cid:62) M r t = 0 . gives a high score denoting validity.Now, although rule l has multiple instantiations in G f , itis quite noisy and unreliable. For e.g. it will incorrectly inferthat Dembele (an upcoming young player) is G.O.A.T by thevirtue of him being teammates with Messi. Now, such a noisyrule could have been pruned out if the information from anembedding method modelling consistent top player were tobe used. Speciﬁcally having, h dembele = (cid:20) . . (cid:21) φ (dembele, isCalled, G.O.A.T) = h (cid:62) M r t = 0 . gives a low score showing lower conﬁdence compared to ear-lier. The two questions raised in the above examples motivatesus to develop a cross feedback hybrid model for knowledgebase reasoning. IV. R ELATED W ORK

Here we give a brief overview of embedding learning andrule mining methods and review relevant hybrid models fromliterature.

A. Knowledge graph embeddings

RESCAL [5] one of the earliest works in KGE used abilinear form for the scoring function inspired from matrixfactorization. Speciﬁcally, f ( h, r, t ) = h (cid:62) M r t where entities h and t are represented by vectors in R d and relations asmatrices in R d × d that model pairwise interactions. DistMult[15] simpliﬁes RESCAL further by restricting M r to be adiagonal matrix . This added simplicity does well for capturingsymmetric relationships but bad for others. TransE [6] intro-duces entity and relation speciﬁc embeddings and models themas translations in vector spaces. ComplEx [7] is an extension toDistMult where embedding lie in C d . This accounts for orderof the entities and thus is able to model asymmetric relations.A newer method RotatE [8] is able to model symmetry,asymmetry, inversion and composition of relations. It usesthe concept of rotations in complex space as opposed totranslations in real space. Concretly, f ( h, r, t ) = (cid:107) h ◦ r − t (cid:107) with constraint | r i | = 1 where h , r , t ∈ C d and ◦ is theelement-wise product. ConvE [16] employs a 2D convolutionalneural network to model the score function where the aimis to utilize the expressivity of multiple non-linear featuresin the architecture to better model relationships in the KG.A comprehensive survey of methods for Knowledge graphembeddings is given in [17] and [18]. B. Rule mining

These models assume that existence of a individualtriples/facts can be inferred from observable features in thegraph usually in the form of logical rules. The extracted rules (Buffon, Ronalado) and (Messi, Iniesta) are all G.O.A.T (see Fig. 1) are then used to infer new facts. Mining rules from a KBhas its roots in inductive logic programming (ILP) [19] andassociation rule mining [20] from the databases community.The use of declarative language biases [21] help in restrictingthe large search space. Language biases like limiting rulelength, having every atom to be transitively connected toeach other in a rule (a.k.a connected rules) and ensuringall variables in a rule to to be closed (i.e. appears at leasttwice) offer a trade-off between the expressivity of rules andsize of the search space. WARMR [22] and its extensionWARMER [9] is based on the APRIORI algorithm [23] andmakes use of a language bias where only conjunctive rules aremined. Sherlock [24] an unsupervised ILP learns ﬁrst orderHorn rules and infers new facts using probabilistic graphicalmodels. To prune the search space it uses two heuristics:statistical signiﬁcance and relevance . All the above approachesare designed to work under the closed world assumption ofKBs due to their treatment of the rule quality measure. AIME[14] and it’s efﬁcient version AIME+ [10] take into account theincompleteness of KBs by proposing a new quality measurebased on partial completeness assumption. Additionally theymines Horn rules which are connected , closed , non-reﬂexive and monotonic (predicates in the rule body are all positive)rules. C. Hybrid Methods

The complementary strengths [25] of observed (rules, paths)and latent (embeddings) KG features has given rise to anumber of hybrid models in recent years. In [26] link pre-diction is cast as an integer linear programming problem inwhich the objective function is from an embedding modeland implication rules are used as constraints. This method isessentially a post processing step which helps in inferencebut not better embedding representations. Another paradigmof hybrid models is to perform some form of regularizationon the embedding loss function using simple rules. In [27]implication rules are ﬁrst naively extracted by iterating over allpossible relation pairs and then differentiable terms are addedto the embedding objective function for each grounding ofthe extracted rules. The naive rule extraction process is costlyand the procedure leads to a large number of regularizationterms which doesn’t scale. RUGE [28] uses t-norm basedfuzzy logic to model the rules and uses inferred tripletsto perform embedding rectiﬁcation. In [29] non-negativityconstraints for entity embeddings and entailment constraint forrelation embeddings is explored. [30] enforce the subsumptionproperty by having an equality constraint r = r − δ ,where δ is a learnable non-negative vector that speciﬁes howrelation r differs from r . Our work is closest to IterE [13]which is designed for improving embeddings of sparse entities.Seven ontology property axioms from OWL2 Web OntologyLanguage are used to model relations in the KG. First apool of valid axioms are generated, by randomly sampling k triples and matching them to seven axiom templates. Then, (3, ,4) (1, ,16)(10, ,16) (11, ,14)(7, ,15)(8, ,15) Infer TripletsRule MiningSec. V-BEmbedding LearningSec. V-A Importance SamplingSec. V-CAugmented Knowledge GraphKnowledge Graph

Fig. 2: Overall framework architecture.inferred facts that relate to sparse entities help in learningbetter embeddings because they essentially provide extra in-formation. We consider horn rules that are more expressiveand unlike them, we have no assumptions on the embeddingscore function. We improve embedding representations of bothsparse and non-sparse entities. Different from constraint basedhybrid methods, our selective augmentation of inferred factsprovides 1) structure to the embedding space and 2) extrainformation for sparse entities.V. O UR A PPROACH

Here, we introduce our hybrid model for learning horn rulesand vector space embeddings in a cross feedback paradigm.Initially, embedding learning is performed on the input KGresulting in entity and relation embeddings. The learnt embed-dings are then used to guide the rule mining system. Further,the extracted rules are materialized and new inferred tripletsare sampled for learning embeddings in the next iteration. Fig.2 shows the overall framework. In what follows, we describethe three main constituent parts (1) embedding learning, (2)rule learning with embedding feedback and (3) importancesampling of triplets.

A. Embedding Learning

We associate a label y hrt to each triplet ( h, r, t ) to model itstruth value. Labels of triplets in G + i are set to 1. As there areno negative triplet examples, we generate G − i by negativelysampling G + i and labels of triplets in G − i are set to 0. A scorefunction φ ( h, r, t ; Θ) : E × R × E → R is used to measurethe salience of a triplet ( h, r, t ) . We further map the output ofthe score function to a continuous truth value between (0 , using a sigmoid function σ ( z ) = 1 / (1 + exp ( − z )) . ξ ( h, r, t ) = σ ( φ ( h, r, t ; Θ)) (4)The objective of embedding learning is to learn Θ i foriteration i by minimizing the loss over triplets in G + i and G − i . G i = G + i ∪ G − i represents our learning set for the currentiteration i . Then, min Θ i | G i | (cid:88) (( h,r,t ) ,y hrt ) ∈ G i L ( ξ ( h, r, t ) , y hrt ) (5)where L ( x, y ) = − y log( x ) − (1 − y ) log(1 − x ) is the crossentropy loss between x and y , ξ ( · ) is the function deﬁned inEq. 4. It is important to note that in our method, learningis done on an extended set of rule enriched triples. Also,our method does not depend on speciﬁc or class of scoringfunctions unlike IterE [13]. All we require is that the scorefunction φ output a real valued score. Table I shows the scorefunctions proposed by various state-of-the-art methods in theliterature. B. Rule Mining with Embedding Feedback

In this step we aim to mine quality horn rules from the set ofinitial triplets in G +0 using the current iteration embeddings Θ i .According to AIME+ [10], rules are modelled as a sequenceof atoms with the ﬁrst atom as the head atom and other bodyatoms following it. The process of building a quality ruleboils down to extending a partial rule sequence and carefullytraversing the search space. Rule building is done by a set ofmining operators that iteratively add atoms to partial rules untila termination criterion is met. To prune the search space duringrule building ﬁltering criteria is utilized. We now explain thethree sub parts.

1) Rule Building:

Initially, all possible binary atoms usingrelations in R are held in a priority queue. These representpartial rule heads with empty bodies. At each iteration of thealgorithm, a single partial rule is dequeued and checked for termination (deﬁned below). If successful, it is output as apossible rule. If not, it is extended by each of the followingoperators: Add a new dangling atom ( O D ) which uses a freshvariable for one argument and a shared variable/entity(used earlier in the rule) for the other. • Add a new instantiated atom ( O I ) which uses a sharedvariable/entity for one argument and an entity for theother. • Add a new closing atom ( O C ) which uses shared vari-able/entity for both arguments.The expansion produces multiple candidate rules. All can-didate rules are checked if they can be pruned by ﬁlteringcriteria . Pruned ones are discarded and non pruned ones areenqueued for the next iteration. The iterative algorithm is rununtil the queue is empty.

2) Filtering Criteria:

The application of mining operatorsto a rule produces a set of candidate rules. Since not allcandidate rules are promising, this steps aims to ﬁlter some ofthem. Ideally, ﬁltering criteria should allow the generation ofrules that (1) explain facts in the KG and (2) infer facts outsidethe observable KG while being consistent w.r.t global “soft”KG properties. Classical measures of rule quality like standardconﬁdence and

PCA conﬁdence introduced by [14] are basedon the observable KG. They work well in selecting rules thatexplain known facts. Every so often, candidate rules inferfacts that are outside the known KG and classical measuresneglect them. We seek to use the learned embedding as aproxy for measuring the quality of candidate rules which inferfacts outside the known KG. This is based on the intuitionthat embeddings Θ can capture global statistical patterns. Tothis end, given a candidate rule, we average the individualembedding scores of all its newly inferred atoms. We call thismeasure embedding conﬁdence (EC) . Given a candidate rule τ , consider the set S τ deﬁned in Eq. 1. The unobserved factspredicted by instantiating τ is given by S τ \ G +0 . Then, EC ( τ ) = 1 | S τ \ G +0 | (cid:88) ( h,r,t ) ∈ S τ \ G +0 ξ ( h, r, t ) (6) where ξ ( · ) is the function deﬁned in Eq. (4). EC ∈ (0 , as ξ ( · ) is designed to give conﬁdence values between (0 , .We consider the weighted average of the classical standardconﬁdence i.e. SC (Eq. 3) and our embedding conﬁdence EC (Eq. 6) for the ﬁnal rule quality measure Q . Q ( τ ) = (1 − ω ) SC ( τ ) + ωEC ( τ ) (7) where ω the weight factor is a model hyper-parameter. Thus,for each candidate rule we check if there is an increase w.r.trule quality measure Q and if not it is discarded. Beforecalculating the Q score for a candidate rule, we use a heuristicpruning strategy followed by AIME+ [14] to discard non-interesting rules. This is done so as to reduce the loadon calculating the Q score. One quick statistical check weperform is to make sure a candidate rule covers more than 1%of known facts using head coverage [14]. We also incorporate language biases like restricting the search to rules of lengththree in order to deal with the vast search space. Termination Criteria

As the extension operators produce rules that are not necessarily closed , we follow [14] in out-putting only closed and connected rules.

C. Importance Sampling of Triplets

Once the rule mining system has extracted rules with aour quality measure, we now incorporate them in reﬁning theembeddings for the next iteration. To do so, the ﬁrst task isto select top-K rules by ranking the extracted rules accordingto their quality measure Q which is already calculated (Eq.7). Let T represent the set of all the selected rules. Then, thenewly inferred triplets from the rules in T will be representedby, G T = ( (cid:91) τ ∈T S τ ) \ G +0 (8) where S τ (Eq. 1) is the set of inferred triplets from rule τ .Next we augment the current positive triplet set G + i with G T to create G + i +1 which will be used for embedding learning initeration i + 1 . Instead of naively augmenting all of G T withlabel 1, we use an importance sampling scheme that uses Θ i from current iteration i . Speciﬁcally, we sample triplets fromthe following distribution, p (( h, r, t ) | G T ) = exp βφ ( h, r, t ; Θ i ) (cid:80) (( h j ,r j ,t j ) ∈ G T ) exp βφ ( h j , r j , t j ; Θ i ) (9) where φ ( · ) is the embedding score function and β is thesampling temperature. The set of input triplets for the nextembedding learning iteration is generated as shown in line 8of Algorithm 1, where ψ represents the importance samplingstrategy. Algorithm 1

Hybrid Learning Procedure

Input : Given initial training knowledge graph ( KG ), embed-ding method φ Output : Final embeddings Θ and mined rules T G +0 ← { (( h, r, t ) , y hrt = 1) | ( h, r, t ) ∈ KG} T ← ∅ Randomly initialize entity and relation embeddings Θ for i = 1 : M do G − i − ← N egativeSampling ( G + i − ) Θ i ← EmbeddingLearning ( G + i − , G − i − ) (cid:46) Eq. 5 T ←

RuleM ining ( G +0 , Θ i ) G + i = G + i − ∪ ψ ( G T ) (cid:46) Sec. V-C end for return Θ M , T VI. E

XPERIMENTS AND A NALYSIS

Data

We evaluate our proposed method on multiple widelyused benchmark datasets. FB15k-237 [25], which has factsabout movies, actors, awards, sports and sport teams, is asubsset of FB15k [6] with no inverse relations. WN18RRABLE II: Benchmark results for link prediction. Comparison against state-of-the-art standalone KGE and hybrid methods.Results of [ ♠ ] are taken from [8]. YAGO3-10 results for methods with [ (cid:70) ] are taken from [28]. Results of [ ♣ ] are producedby us using code provided by respective authors. All other results are taken from corresponding original papers. Best scoresare in bold , runner-up is underlined and “ ∗ ” represents statistically signiﬁcant improvements over RotatE (paired t-test; p-value < . ) FB15K-237 WN18RR YAGO3-10MRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10TransE [ ♠ , (cid:70) ] 0.293 0.140 0.268 0.463 0.226 - - 0.501 0.303 0.218 0.336 0.475DistMult [ ♠ ] 0.241 0.155 0.263 0.419 0.430 0.390 0.440 0.490 0.340 0.240 0.380 0.540ComplEx [ ♠ ] 0.247 0.158 0.275 0.428 0.440 0.410 0.460 0.510 0.360 0.260 0.400 0.550ConvE 0.325 0.237 0.356 0.501 0.430 0.400 0.440 0.520 0.440 0.350 0.490 0.620RotatE 0.338 0.241 0.375 0.533 0.476 ♣ , (cid:70) ] 0.169 0.087 0.181 0.345 0.231 0.218 0.387 0.439 0.431 0.340 0.482 0.603NNE-AER [ ♣ ] 0.317 0.183 0.294 0.478 0.431 0.412 0.437 0.467 0.390 0.310 0.419 0.597pLogicNet 0.332 0.237 0.369 0.528 0.441 0.398 0.446 0.537 - - - -Our method ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ TABLE III: Statistics of Datasets

Dataset |E| |R| [16] a subset of WordNet is a KG describing lexical rela-tions between words. YAGO3-10 [31] deals with descriptiveattributes of people. We also consider sparse variants from [13]for one of our experiments. Statistics of all datasets are givenin Table III. We use the original train/valid/test splits providedby authors of the dataset and represent it as G train , G valid and G test . Evaluation Protocol

We consider the standard link predictiontask following a standard protocol introduced by [6]. For atest triplet ( h, r, t ) , we replace the head h with each entity e i ∈ E and ﬁnd the score of ( e i , r, t ) using φ . From thedescending order of scores, the rank of the correct entity i.e. h is found. This gives us the head rank. Similarly, we run thesame procedure for the tail t that gives the tail rank. Finally,the average of head and tail ranks is used to report popularmetrics like Mean Reciprocal Rank (MRR) and percentage ofpredicting ranks within N which is

Hit@N . Higher values inboth metrics signify better results. Also, during the rankingprocess, we make sure that the replaced triplet does not existin either the training, validation, or test set. This corresponds tothe “ﬁltered setting in [6]. [32] propose a RANDOM protocolto be used when choosing triplets that get the same score bythe model. Following such a protocol, when we generate therankings and if more than two entities receive the same score, WordNet - https://wordnet.princeton.edu/ we randomly pick one of them because picking in any orderwill be unfair as argued by [32].

Hyperparameter Setting

We ﬁne-tune the hyper-parameterson the validation set G valid . We grid searched em-bedding dimension d in { , , } , batch size B in { , , } , embedding weight factor ω in { . , . , . , . , . } and top-K extracted rules for K in { , , , } . Initially, all embeddings are randomlyinitialized and the cross feedback learning procedure is runwith M set to 10. (see line 4 in Algo. 1). During each globaliteration, embedding learning is run for k steps. We set k to 100. During rule mining, we mine rules with length of atmost 3 to keep a check on the search space. G valid is used forearly stopping and obtaining the best model on which G test is evaluated. A. Embedding Evaluation

We compare our model with state-of-the-art standaloneknowledge graph embedding methods representing a variety ofmodelling approaches, viz. TransE [6], DistMult [15], Com-plEx [7], ConvE [16] and RotatE [8]. The standalone modelsrely only on observed triplets in KG and use no logical rules.We further compare with additional hybrid baselines, includingRUGE [28] and NNE-AER [29] which make use of certainlogical rules to constrain the vector space of Θ and pLogicNet[33] which utilizes probabilistic logical rules in a MarkovLogic Network (MLN) framework. For all baseline models werun (marked with ♣ in result table), optimal hyper-parametersare obtained using grid search. Table II shows the evaluationon different datasets. We can see that our model outperformsboth standalone and hybrid baselines by a large margin forFB15K-237 and YAGO3-10 datasets. Both the two datasetshave diverse relation patterns like composition, symetry andinversion in them and mined rules can infer rich tripletswhich leads to increased data quality when augmented to theoriginal KG. There is no statistically signiﬁcant improvementfor WN18RR because the number of rules that are mined fromABLE IV: Sparse KG results. Results of [ ♣ ] are produced by us using code provided by respective authors. Other resultsare taken from [13]. Best scores are highlighted in bold and runner-up is underlined. FB15K-237-sparse WN18RR-sparseMRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10TransE 0.238 0.164 0.261 0.385 0.146 0.034 0.247 0.288DistMult 0.204 0.128 0.226 0.362 0.255 0.238 0.260 0.225ComplEx 0.197 0.120 0.217 0.354 0.259 0.246 0.262 0.286RotatE [ ♣ ] 0.292 0.213 0.320 0.445 0.340 0.299 0.354 0.422RUGE [ ♣ ] 0.241 0.172 0.267 0.393 0.145 0.199 0.263 0.292NNE-AER [ ♣ ] 0.212 0.133 0.238 0.395 0.279 0.284 0.304 0.305IterE + axioms 0.247 0.179 0.262 0.392 0.274 0.254 0.281 0.314Our method WN18RR are only a few tens whereas for FB15K-237 andYAGO3-10 they are in the hundreds. Number of rules minedis directly correlated to the number of relations in the dataset(see Table III). Moreover, WN18RR dataset is less diverse andmainly made up of relations that conform to symmetry pattern.Constraint based hybrid approaches perform poorly acrossdataset as they don’t model all rule patterns. We also performa paired t-test between our method and the best baseline i.e.RotatE with a p-value < B. Rule EvaluationPrecision measures the ability of the rules to infer true factsbeyond the train set. Mathematically, given test set G test , theprecision of a set of mined (using G train ) rules T is, precision ( T ) = | G T ∩ G test || G T | (10) G T , deﬁned in Eq. (8) is the set of all newly inferred tripletsfrom the set of rules T . Using this metric, we compare againstAIME+ [10]. We compare the top-k rules that are outputfor better understanding. Results (Table V) show that ourembedding conﬁdence measure when combined with standardﬁltering techniques used by AIME+ leads to higher qualityrules. We conclude that feedback from embeddings that carry“global” KG pattern information is useful when mining rules.TABLE V: Quantitative evaluation of learned top-k rules using precision (Eq.10). AIME+ uses PCA conﬁdence while ourmethod utilizes

Embedding conﬁdence (Eq. 7) for measuringrule quality. top-K FB15K-237 WN18RRAIME+ Our Method AIME+ Our Method10 0.357

20 0.422

50 0.585 - -100 0.613 - -200 0.679 - -500 0.384 - -

C. Training with Different Score Functions

One may argue that our boost in performance is from aparticular score function we used for embedding learning (i.e.RotatE). We experimentally show in Table VI that our crossfeedback learning approach signiﬁcantly improves perfor-mance when used with different scoring functions. The pointwe make here is that our rich data augmentation approachbrings in deterministic structure leading to better embeddingrepresentations irrespective of the scoring function used forembeddings.TABLE VI: Performance of our cross feedback paradigm withdifferent embedding learning methods (i.e. φ is from Table I).Results in the second row for each embedding method, i.e.with brackets, report performance without our paradigm. EmbeddingMethods FB15K-237 YAGO3-10MRR Hit@10 MRR Hit@10TransE (0.293) (0.42) (0.303) (0.475)ComplEx (0.247) (0.428) (0.417) (0.603)RotatE (0.338) (0.533) (0.495) (0.670)

D. Sparse KG Evaluation

Here we compare our model against IterE [13] for the sparseKG setting using the same datasets provided by the authorsof IterE. In the sparse versions of FB15K-237 and WN18RR,only entities with sparsity (Eq. 11 ) greater than 0.995 areallowed in the validation ( G valid ) and test ( G test ) sets whiletraining sets ( G train ) remain unchanged. According to [13],sparsity of an entity e is deﬁned as, sparsity ( e ) = 1 − f req ( e ) − f req min f req max − f req min (11)where freq(e) is the frequency of entity e participating in atriple among all triples in G train . f req min and f req max areminimum and maximum frequency of all entities in E .These datasets provides us a way to evaluate if our modelhas the ability to improve embedding representations of sparsentities. From the results (Table IV), it is clear that our methodsigniﬁcantly improves the score on all metrics for both sparseversions compared to baselines. Embeddings guiding thesearch space of the rule mining system instead of a naive ﬁxed k pruning strategy used by IterE and the importance samplingmechanism employed by us leads to effective utilization ofthe generalizing power of rules. IterE is tied to a certain classof embedding methods which assume a linear map betweensubject and object entities while our model places no suchassumptions. RotatE, the state-of-the-art standalone model alsosuffers when reasoning about sparse entities. Although RUGEand NNE-AER do better than some standalone methods, theyfall short when compared to a our method.VII. A DDITIONAL A NALYSIS

A. Computational Complexity

With regard to embedding learning, our approach has thesame complexity as the score function that is chosen. IfRotatE is used, the space complexity is O ( d ( |E| + |R| )) whichscales linearly w.r.t number of entities, number of relationsand embedding dimension. Each iteration of our learningprocedure has a time complexity of O ( b + kbd + ¯ nd ) fornegative sampling, embedding learning and importance sam-pling combined where b represents batch size, d is embeddingdimension, k the inner embedding learning epochs (used byEq. 5) and ﬁnally ¯ n is average size of G T (see Eq. 8) for eachiteration. As k is usually set to a small number and usuallythe number of average entailments is fewer than batchsize i.e. ¯ n < b our time complexity is on par with conventional KGEmethods which have a complexity of O ( bd ) . It is importantto note that the above embedding time complexity does notdepend on the size of the input graph G i at iteration i becausetraining is done using SGD in minibatch mode. As we also runrule mining with embedding guidance every global iteration itadds additional complexity. AIME+ [10] is known to be veryefﬁcient with its use of an in-memory database to store theKG and its use of various optimizations to prune the largesearch space. B. Iterative Learning Proﬁle

Here we assess how performance (e.g., Hit@10, MRR)varies with the number of gloabl iterations. Fig. 3 shows ourﬁndings for FB15K-237 and WN18RR. Big gains are observedafter two iterations and performance continues to increase untilit levels off around iteration eight. These plots indicate thatthese is a positive effect when we sample triplets from rulemining and augment them for embedding learning.

C. Case Study

We give some qualitative examples from FB15K-237dataset to demonstrate the effectiveness of our model. First,consider Table VII. It shows multiple test triplets from G test and their corresponding head and tail rank changewhen comparing RotatE [8] as baseline with our method.The relevant rule that provides feedback to embeddinglearning in our framework is also shown for each test Fig. 3: Link Prediction Performance vs Global Iterationstriplet. As an example, take the ﬁrst test triplet ( LouisArmstrong, /people/person/nationality,United States ). Because

Louis Armstrong is asparse entity with sparsity 0.995, traditional embeddingmethods suffer in the head prediction task i.e. asking (?, /people/person/nationality, UnitedStates) essentially because the embedding representationsof sparse entities are not informative. Concretely, RotatEgives a ﬁltered rank of 1154. Compare this to our methodwhich utilizes the rule that says musicians have the samenationality as the country of the town they are originallyfrom. Having V as New Orleans satisﬁes the rulefor the test triplet and our method improves the subjectprediction rank to 68, a gain of 1086. An observationabout the tail prediction task, i.e. (Louis Armstrong,/people/person/nationality, ?) conveys thatbaseline methods perform well when entities are not so sparsefor e.g. here it is

United States that is not sparse. Theother examples show a similar qualitative trend demonstratingthat using feedback from relevant rules improves embeddingrepresentations of entities especially sparse ones.Next we show qualitatively how rule quality canbe improved with embedding guidance through ourembedding conﬁdence measure (Eq. 6). Table VIIIshows three different rule bodies mined by AIME+ onFB15K-237, which all result in the same head atom i.e./olympic sport./afﬁliation country( V , V ) which means V is an olympic sport that has representation for country V . An example triplet in G train is ( Trialathon,

ABLE VII: Qualitative results showing link prediction rank change of entities of test triplets by comparing our method andbaseline RotatE. The associated rule used for providing feedback is shown for each test triplet

Test Triplet (

Louis Armstrong, /people/person/nationality, United States )Rule /music/artist/origin( V , V ) ∧ /location/administrative division/country( V , V ) → /people/person/nationality( V , V )Rank Change head: 1154 →

68 (+1086)tail: 1 →

15 Minutes, /film/country, United States )Rule /ﬁlm/production companies( V , V ) ∧ /organization/headquarters./location/country( V , V ) → /ﬁlm/country( V , V )Rank Change head: 314 →

12 (+302)tail: 1 → Poland, /location/second_level_divisions, Szczecin )Rule /bibs location/country( V , V ) ∧ /location/county place/( V , V ) → /location/second level divisions( V , V )Rank Change head: 1 → → Fort Lauderdale, /location/location/time_zones, Eastern Time Zone )Rule /location/location/contains( V , V ) ∧ /location/location/time zones( V , V ) → /location/location/time zones( V , V )Rank Change head: 414 →

45 (+369)tail: 1 → TABLE VIII: Qualitative results showing the comparison between standard embeddding conﬁdence scores for three candidaterules inferring the same head atom.

Rule RuleSupport No.Predictions StandardConﬁdence EmbeddingConﬁdence1 /olmpic sport./afﬁliation country( V , V ) ∧ /location/import and exports( V , V ) → /olympic sport./afﬁliation country( V , V ) 548 1723 0.318 0.2072 /olympic sport./afﬁliation country( V , V ) ∧ /location./adjoins location( V , V ) → /olympic sport./afﬁliation country( V , V ) 1524 4747 0.321 0.4043 /sport./pro olympic athlete( V , V ) ∧ /person./nationality( V , V ) → /olympic sport./afﬁliation country( V , V ) 588 1792 0.328 0.701 TABLE IX: TransE with different negative sampling tech-niques.

FB15K-237 WN18RRMRR Hit@10 MRR Hit@10uniform 0.241 0.422 0.186 0.459KBGAN 0.278 0.453 0.210 0.479self-adversarial 0.298 0.475 0.223 0.510uniform + our method 0.435 0.545 0.234 0.515 /olympic_sport/affiliation_country,France ). Rules one and two say that an olympic sport V has representation from country V if there is another country V that represents V and is adjoining V or does trade withit. Both rules get very similar standard conﬁdence measuresbecause the ratio of corresponding rule support and number ofpredictions is alike and choosing a relevant rule is not clear.Compare this with rule three, which says an olympic sport V has representation from country V if there is a professionalolympic athlete V who plays V and V has nationality V .This rule is more apt compared to the earlier ones but still gets similar standard conﬁdence of 0.328. Now contrast thiswith the embedding conﬁdence scores for each of the threerules. Rule three gets a much better EC score compared toone and two because most of its instantiations give highembedding scores (see Eq. 6). This shows that embeddingconﬁdence which is able model the rules probabilistically ishelpful when used in conjunction with standard conﬁdence toassess the quality of rules when mining. D. Connections to Negative Sampling

The results from the previous section made us look atour paradigm through the lens of negative sampling for ex-planation. At each iteration, when new inferred triplets areintroduced, the underlying uniform negative sampling gener-ates higher quality negatives simply because of the superiorpositives that are augmented. We further show (Table IX) howour method stacks up against other negative sampling methodslike uniform, KBGAN [34] and self-adversarial [8]. For allmethods including ours, TransE [6] is used as the embeddingmethod for a fair comparison. The results indicate that theaugmented inferred triplets invariably lead to better qualitynegatives thus improving the overall train set on which thembeddings are learned. In our opinion, this is also a possibleexplanation for the signiﬁcant boost in performance for modelslike TransE and ComplEx that are actually incapable ofmodelling symmetry and composition patterns in data [8].VIII. C

ONCLUSION

In this work, we have developed a hybrid method whichutilizes the complementary properties of rules and embed-dings. Experimental results empirically support the two mainclaims we raised (1) structure and richer data quality intraining results in superior embedding representations, and (2)incorporation of “global” KG statistical patterns in rule mininglead to reliable rules. We extensively evaluated our approachwith varied experiments and showed its effectiveness. Theconnections to negative sampling motivate us investigate moredeeply about the framework in future for possible theoreticalclaims and develop other general hybrid models for unifyingdifferent learning schemes.A

CKNOWLEDGMENT

This research is supported by NSF under contract numbersCCF-1918483, IIS-1618690, and CCF-0939370.R

EFERENCES[1] Freebase,

Freebase Data Dumps , (accessed June 5, 2020). [Online].Available: https://developers.google.com/freebase/[2] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a core of semanticknowledge,” in

Proceedings of the 16th international conference onWorld Wide Web , 2007, pp. 697–706.[3] Wikidata,

Main page

Google Knowledge Graph

Ad-vances in neural information processing systems , 2013, pp. 2787–2795.[7] T. Trouillon, J. Welbl, S. Riedel, ´E. Gaussier, and G. Bouchard, “Com-plex embeddings for simple link prediction,” in

International Conferenceon Machine Learning , 2016, pp. 2071–2080.[8] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang, “Rotate: Knowledge graphembedding by relational rotation in complex space,” arXiv preprintarXiv:1902.10197 , 2019.[9] B. Goethals and J. Van den Bussche, “Relational association rules:getting w armer,” in

Pattern Detection and Discovery . Springer, 2002,pp. 125–139.[10] L. Gal´arraga, C. Teﬂioudi, K. Hose, and F. M. Suchanek, “Fast rule min-ing in ontological knowledge bases with amie+,”

The VLDB JournalTheInternational Journal on Very Large Data Bases , vol. 24, no. 6, pp.707–730, 2015.[11] R. Jenatton, N. L. Roux, A. Bordes, and G. R. Obozinski, “A latentfactor model for highly multi-relational data,” in

Advances in NeuralInformation Processing Systems , 2012, pp. 3167–3175.[12] J. Pujara, E. Augustine, and L. Getoor, “Sparsity and noise: Whereknowledge graph embeddings fall short,” in

Proceedings of the 2017Conference on Empirical Methods in Natural Language Processing ,2017, pp. 1751–1756.[13] W. Zhang, B. Paudel, L. Wang, J. Chen, H. Zhu, W. Zhang, A. Bernstein,and H. Chen, “Iteratively learning embeddings and rules for knowledgegraph reasoning,” in

The World Wide Web Conference . ACM, 2019,pp. 2366–2377.[14] L. A. Gal´arraga, C. Teﬂioudi, K. Hose, and F. Suchanek, “Amie: associ-ation rule mining under incomplete evidence in ontological knowledgebases,” in

Proceedings of the 22nd international conference on WorldWide Web . ACM, 2013, pp. 413–422. [15] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng, “Embedding entities andrelations for learning and inference in knowledge bases,” arXiv preprintarXiv:1412.6575 , 2014.[16] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolutional2d knowledge graph embeddings,” in

Thirty-Second AAAI Conferenceon Artiﬁcial Intelligence , 2018.[17] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, “A review ofrelational machine learning for knowledge graphs,”

Proceedings of theIEEE , vol. 104, no. 1, pp. 11–33, 2015.[18] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embed-ding: A survey of approaches and applications,”

IEEE Transactions onKnowledge and Data Engineering , vol. 29, no. 12, pp. 2724–2743, 2017.[19] S. Muggleton and L. De Raedt, “Inductive logic programming: Theoryand methods,”

The Journal of Logic Programming , vol. 19, pp. 629–679,1994.[20] R. Agrawal, T. Imieli´nski, and A. Swami, “Mining association rulesbetween sets of items in large databases,” in

Acm sigmod record , vol. 22,no. 2. ACM, 1993, pp. 207–216.[21] H. Ad´e, L. De Raedt, and M. Bruynooghe, “Declarative bias for speciﬁc-to-general ilp systems,”

Machine Learning , vol. 20, no. 1-2, pp. 119–154,1995.[22] L. Dehaspe and H. Toivonen, “Discovery of relational association rules,”in

Relational data mining . Springer, 2001, pp. 189–212.[23] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo,“Fast discovery of association rules,” in

Advances in knowledge discov-ery and data mining . American Association for Artiﬁcial Intelligence,1996, pp. 307–328.[24] S. Schoenmackers, O. Etzioni, D. S. Weld, and J. Davis, “Learningﬁrst-order horn clauses from web text,” in

Proceedings of the 2010Conference on Empirical Methods in Natural Language Processing .Association for Computational Linguistics, 2010, pp. 1088–1098.[25] K. Toutanova and D. Chen, “Observed versus latent features for knowl-edge base and text inference,” in

Proceedings of the 3rd Workshop onContinuous Vector Space Models and their Compositionality , 2015, pp.57–66.[26] Q. Wang, B. Wang, and L. Guo, “Knowledge base completion usingembeddings and rules,” in

Twenty-Fourth International Joint Conferenceon Artiﬁcial Intelligence , 2015.[27] T. Rockt¨aschel, S. Singh, and S. Riedel, “Injecting logical backgroundknowledge into embeddings for relation extraction,” in

Proceedings ofthe 2015 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies , 2015,pp. 1119–1129.[28] S. Guo, Q. Wang, L. Wang, B. Wang, and L. Guo, “Knowledge graphembedding with iterative guidance from soft rules,” in

Thirty-SecondAAAI Conference on Artiﬁcial Intelligence , 2018.[29] B. Ding, Q. Wang, B. Wang, and L. Guo, “Improving knowledge graphembedding using simple constraints,” arXiv preprint arXiv:1805.02408 ,2018.[30] B. Fatemi, P. Taslakian, D. Vazquez, and D. Poole, “Knowledge hy-pergraphs: Extending knowledge graphs beyond binary relations,” arXivpreprint arXiv:1906.00137 , 2019.[31] F. Mahdisoltani, J. Biega, and F. M. Suchanek, “Yago3: A knowledgebase from multilingual wikipedias,” 2015.[32] Z. Sun, S. Vashishth, S. Sanyal, P. Talukdar, and Y. Yang, “A re-evaluation of knowledge graph completion methods,” arXiv preprintarXiv:1911.03903 , 2019.[33] M. Qu and J. Tang, “Probabilistic logic neural networks for reasoning,”in

Advances in Neural Information Processing Systems , 2019, pp. 7710–7720.[34] L. Cai and W. Y. Wang, “Kbgan: Adversarial learning for knowledgegraph embeddings,” arXiv preprint arXiv:1711.04071arXiv preprint arXiv:1711.04071