[PDF] HyperKG: Hyperbolic Knowledge Graph Embeddings for Knowledge Base Completion

Abstract

Learning embeddings of entities and relations existing in knowledge bases allows the discovery of hidden patterns in data. In this work, we examine the geometrical space's contribution to the task of knowledge base completion. We focus on the family of translational models, whose performance has been lagging, and propose a model, dubbed HyperKG, which exploits the hyperbolic space in order to better reflect the topological properties of knowledge bases. We investigate the type of regularities that our model can capture and we show that it is a prominent candidate for effectively representing a subset of Datalog rules. We empirically show, using a variety of link prediction datasets, that hyperbolic space allows to narrow down significantly the performance gap between translational and bilinear models.

Full PDF

HHyperKG: Hyperbolic Knowledge Graph Embeddings for Knowledge BaseCompletion

Prodromos Kolyvakis , Alexandros Kalousis , Dimitris Kiritsis ´Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland, Business Informatics Department, University of Applied Sciences,Western Switzerland Carouge, HES-SO, Switzerland

Abstract

Learning embeddings of entities and relations existing inknowledge bases allows the discovery of hidden patterns indata. In this work, we examine the geometrical space’s con-tribution to the task of knowledge base completion. We focuson the family of translational models, whose performance hasbeen lagging, and propose a model, dubbed

HyperKG , whichexploits the hyperbolic space in order to better reﬂect the topo-logical properties of knowledge bases. We investigate the typeof regularities that our model can capture and we show that itis a prominent candidate for effectively representing a subsetof Datalog rules. We empirically show, using a variety of linkprediction datasets, that hyperbolic space allows to narrowdown signiﬁcantly the performance gap between translationaland bilinear models.

Introduction

Learning in the presence of structured information is animportant challenge for artiﬁcial intelligence (Muggletonand De Raedt 1994; Richardson and Domingos 2006;Getoor and Taskar 2007). Knowledge Bases (KBs) such asWordNet (Miller 1998), Freebase (Bollacker et al. 2008),YAGO (Suchanek, Kasneci, and Weikum 2007) and DBpe-dia (Lehmann et al. 2015) constitute valuable such resourcesneeded for a plethora of practical applications, includingquestion answering and information extraction. However, de-spite their formidable number of facts, it is widely acceptedthat their coverage is still far from being complete (West etal. 2014).This shortcoming has opened the door for a number ofstudies addressing the problem of automatic knowledge basecompletion (KBC) or link prediction (Nickel et al. 2016).The impetus of these studies arises from the hypothesis thatstatistical regularities lay in KB facts, which when correctlyexploited can result in the discovery of missing true facts (Xieet al. 2017). Building on the great generalisation capability ofdistributed representations, a great line of research (Nickel,Tresp, and Kriegel 2011; Bordes et al. 2013; Yang et al. 2015;Nickel, Rosasco, and Poggio 2016; Trouillon et al. 2016) hasfocused on learning KB vector space embeddings as a wayof predicting the plausibility of a fact.An intrinsic characteristic of knowledge graphs is thatthey present power-law (or scale-free) degree distributions as many other networks (Faloutsos, Faloutsos, and Faloutsos1999; Steyvers and Tenenbaum 2005). In an attempt of under-standing scale-free networks’ properties, various generativemodels have been proposed such as the models of Barab´asiand Albert (1999) and Van Der Hofstad (2009). Interestingly,Krioukov et al. (2010) have shown that scale-free networksnaturally emerge in the hyperbolic space. Recently, the hy-perbolic geometry was exploited in various works (Nickeland Kiela ; 2018; Ganea, Becigneul, and Hofmann 2018;Sala et al. 2018) as a means to provide high-quality embed-dings for hierarchical structures. Hyperbolic space has thepotential to bring signiﬁcant value in the task of KBC since itoffers a natural way to take the KB’s topological informationinto account. Furthermore, many of the relations appearingin KBs lead to hierarchical and hierarchical-like structures(Li et al. 2016).At the same time, the expressiveness of various KB em-bedding models has been recently examined in terms of theirability to express any ground truth of facts (Kazemi and Poole2018; Wang, Gemulla, and Li 2018). Moreover, Guti´errez-Basulto and Schockaert (2018) have proceeded one step fur-ther and investigated the compatibility between ontologicalaxioms and different types of KB embeddings. Speciﬁcally,the authors have proven that a certain family of rules, thequasi-chained rules which form a subset of Datalog rules(Abiteboul, Hull, and Vianu 1995), can be exactly representedby a KB embedding model whose relations are modelled asconvex regions; ensuring, thus, logical consistency in thefacts induced by this KB embedding model. In the light ofthis result, it seems important that the appropriateness of aKB embedding model should not only be measured in termsof fully expressiveness but also in terms of the rules that itcan model.In this paper, we examine whether building models thatbetter reﬂect KBs’ topological properties and rules bringsperformance improvements for KBC. We focus on the familyof translational models (Bordes et al. 2013) that attemptto model the statistical regularities as vector translationsbetween entities’ vector representations, and whose perfor-mance has been lagging. We extend the translational modelsby learning embeddings of KB entities and relations in thePoincar´e-ball model of hyperbolic geometry. We do so byearning compositional vector representations (Mitchell andLapata 2008) of the entities appearing in a given fact based ontranslations. The implausibility of a fact is measured in termsof the hyperbolic distance between the compositional vectorrepresentations of its entities and the learned relation vector.We prove that the relation regions captured by our proposedmodel are convex. Our model becomes, thus, a prominentcandidate for representing effectively quasi-chained rules.Among our contributions is the proposal of a novel KBembedding as well as a regularisation scheme on the Poincar´e-ball model, whose effectiveness we prove empirically. Fur-thermore, we prove that translational models do not sufferfrom the restrictions identiﬁed by Kazemi and Poole (2018)in the case where a fact is considered valid when its implau-sibility score is below a certain non-zero threshold. Finally,we evaluate our approach on various benchmark datasetsand our experimental results show that our approach makesa big step towards closing the performance gap betweentranslational and bilinear models; demonstrating that the geo-metrical space’s choice plays a signiﬁcant role for KBC andillustrating the importance of taking into account both thetopological and the formal properties of KBs.

Related Work

Shallow KB Embedding Models.

There has been a greatline of research dedicated to the task of learning distributedrepresentations for entities and relations in KBs. To constrainthe analysis, we only consider shallow embedding modelsthat do not exploit deep neural networks or incorporate ad-ditional external information beyond the KB facts. For anelaborated review of these techniques, please refer to (Nickelet al. 2016; Wang et al. 2017). We exclude from our com-parison recent work that explores different types of trainingregimes, such as adversarial training, and/or the inclusionof reciprocal facts (Cai and Wang 2018; Sun et al. 2019;Kazemi and Poole 2018; Lacroix, Usunier, and Obozinski2018) to make the analysis less biased to factors that couldovershadow the importance of the geometrical space.In general, the shallow embedding approaches can be di-vided into two main categories; the translational (Bordeset al. 2013) and the bilinear (Nickel, Tresp, and Kriegel2011) family of models. In the translational family, thevast majority of models (Wang et al. 2014; Ji et al. 2015;Xiao, Huang, and Zhu 2016; Ebisu and Ichise 2018) gener-alise TransE (Bordes et al. 2013), which attempts to modelrelations as translation operations between the vector repre-sentations of the subject and object entities, as observed in agiven fact. In the bilinear family, most of the approaches(Yang et al. 2015; Nickel, Rosasco, and Poggio 2016;Trouillon et al. 2016) generalise RESCAL (Nickel, Tresp,and Kriegel 2011), that proposes to model facts through bilin-ear operations over entity and relations vector representations.In this paper, we focus on the family of translational models,whose performance has been lagging, and propose extensionsin the hyperbolic space which by exploiting the topologicaland the formal properties of KBs bring signiﬁcant perfor-mance improvements.

Hyperbolic Embeddings.

There has been a growinginterest in embedding scale-free networks in the hyper- bolic space (Bogun´a, Papadopoulos, and Krioukov 2010;Papadopoulos, Aldecoa, and Krioukov 2015). The major-ity of these approaches are based on maximum likelihoodestimation, that maximises the likelihood of the network’stopology given the embedding model (Papadopoulos, Alde-coa, and Krioukov 2015). Additionally, hyperbolic geometrywas exploited in various works as a way to exploit hierar-chical information and learn more efﬁcient representations.Hyperbolic embeddings have been applied in a great varietyof machine learning and NLP applications such as taxonomyreconstruction, link prediction, lexical entailment (Nickeland Kiela ; 2018; Ganea, Becigneul, and Hofmann 2018;Sala et al. 2018).Recently and in parallel to our work, Balaˇzevi´c, Allen, andHospedales (2019) studied the problem of embedding KBs inthe hyperbolic space. Similarly to our approach, the authorsextend in the hyperbolic space the family of translationalmodels demonstrating signiﬁcant advancements over state-over-the-art. However, the authors exploit both the hyperbolicas well as the Euclidean space by using the

M¨obius Matrix-vector multiplication and Euclidean scalar biases. Unlike ourexperimental setup, the authors also include reciprocal facts.Although their approach is beneﬁcial, it becomes hard toquantify the contributions of hyperbolic space. This is veriﬁedby the fact that their Euclidean model analogue performs inline with their “hybrid” hyperbolic-Euclidean model. Finally,the authors do not study the expressiveness of their proposedmodel.

Methods

Preliminaries

We introduce some deﬁnitions and additional notation thatwe will use throughout the paper. We denote the vector con-catenation operation by the symbol ⊕ and the inner prod-uct by (cid:104)· , ·(cid:105) . We deﬁne the rectiﬁer activation function as: [ · ] + = max( · , . Quasi-chained Rules.

Let E , N and V be disjoint setsof entities, (labelled) nulls and variables , respectively. Let R be the set of relation symbols. A term t is an elementin E ∪ N ∪ V ; an atom α is an expression of the form R ( t , t ) , where R is a relation between the terms t , t .Let terms ( α ) := { t , t } ; vars ( α ) := terms ( α ) ∩ V and B n for n ≥ , H k for k ≥ be atoms with terms in E ∪ V . Additionally, let X j ∈ V for j ≥ . A quasi-chained(QC) rule σ (Guti´errez-Basulto and Schockaert 2018) is anexpression of the form: B ∧ . . . ∧ B n → ∃ X , . . . , X j .H ∧ . . . ∧ H k , (1)where for all i : 1 ≤ i ≤ n | ( vars ( B ) ∪ ... ∪ vars ( B i − )) ∩ vars ( B i ) | ≤ The QC rules constitute a subset of Datalog rules. A database D is a ﬁnite set of facts , i.e., a set of atoms withterms in E . A knowledge base (KB) K constitutes of a pair (Σ , D ) where Σ is an ontology whose axioms are QC rulesand D a database. It should be noted that no constraint is Only existential variables can be mapped to labelled nulls. mposed on the number of available axioms in the ontology.The ontology could be minimal in the sense of only deﬁningthe relation symbols. However, any type of rule, whether it isthe product of the ontological design or results from formalis-ing a statistical regularity, should belong to the family of QCrules. The Gene Ontology (Ashburner et al. 2000) constitutesone notable example of ontology that exhibits QC rules.

Circular Permutation Matrices.

An orthogonal matrix isa square real matrix whose columns and rows are orthogonalunit vectors (i.e., orthonormal vectors), i.e. Q T Q = QQ T = I (2)where I is the identity matrix. Orthogonal matrices preservethe vector inner product and, thus, they also preserve theEuclidean norms. Let ≤ i < n , we deﬁne the circularpermutation matrix Π i to be the orthogonal n × n matrix thatis associated with the following circular permutation of a n -dimensional vector x : (cid:18) x · · · x n − i x n − i +1 · · · x n x i +1 · · · x n x · · · x i (cid:19) (3)where x i is the ith coordinate of x and i controls the numberof n − i successive circular shifts. Hyperbolic Space.

Although multiple equivalent modelsof hyperbolic space exist, we will only focus on the Poincar´e-ball model. The Poincar´e-ball model is the Riemannian man-ifold P n = ( B n , d p ) , where B n = { x ∈ R n : (cid:107) x (cid:107) < } and d p is the distance function (Nickel and Kiela ): d p ( u , v ) = acosh (1 + 2 δ ( u , v )) (4) δ ( u , v ) = (cid:107) u − v (cid:107) (1 − (cid:107) u (cid:107) )(1 − (cid:107) v (cid:107) ) The Poincar´e-ball model presents a group-like structure whenit is equipped with the

M¨obius addition (Ungar 2012), deﬁnedby: u + (cid:48) v := (1 + 2 (cid:104) u , v (cid:105) + (cid:107) v (cid:107) ) u + (1 − (cid:107) u (cid:107) ) v (cid:104) u , v (cid:105) + (cid:107) u (cid:107) (cid:107) v (cid:107) (5)The isometries of ( B n , d p ) can be expressed as a compositionof a left gyrotranslation with an orthogonal transformationrestricted to B n , where the left gyrotranslation is deﬁned as L u : v (cid:55)→ u + (cid:48) v (Ahlfors 1975; Rassias and Suksumran2019). Therefore, circular permutations constitute zero-leftgyrotranslation isometries of the Poincar´e-ball model. HyperKG

The database of a KB consists of a set of facts in the formof R ( subject, object ) . We will learn hyperbolic embeddingsof entities and relations such that valid facts will have alower implausibility score than the invalid ones. To learn suchrepresentations, we extend the work of Bordes et al. (2013),and we deﬁne a translation-based model in the hyperbolicspace; embedding, thus, both entities and relations in thesame space.Let s , r , o ∈ B n be the hyperbolic embeddings of the subject, relation and object , respectively, appearing in the R ( subject, object ) fact. We deﬁne a term embedding as a Π o ss+Π or r t t t g(r ) g(r ) Figure 1: A visualisation of HyperKG model in the P space.The geodesics of the disk model are circles perpendicularto its boundary. The zero-curvature geodesic passing fromthe origin corresponds to the line (cid:15) : y − x = 0 in theEuclidean plane. Reﬂections over the line (cid:15) are equivalent to Π permutations in the plane. s, Π o, s + Π o are the subjectvector, the permuted object vector and the composite termvector, respectively. g ( r ) , g ( r ) denote the geometric loci ofterm vectors satisfying relations R , R , with relation vectors r , r . t , t , t are valid term vectors for the relation R .function ξ : B n × B n → B n , that creates a composite vectorrepresentation for the pair ( subject, object ) . Since our moti-vation is to generalise the translation models to the hyperbolicspace, a natural way to deﬁne the term embeddings is by usingthe M¨obius addition. However, we found out empirically thatthe normal addition in the Euclidean space generalises betterthan the M¨obius addition. To introduce non-commutativity inthe term composition function, we use a circular permutationmatrix to project the object embeddings. Non-commutativityis important because it allows to model asymmetric relationswith compositional representations (Nickel, Rosasco, andPoggio 2016). Therefore, we deﬁne the term embedding as: s +Π β o , where β is a hyperparameter controlling the numberof successive circular shifts. To enforce the term embeddingsto stay in the Poincar´e-ball, we constrain all the entity em-beddings to have a Euclidean norm less than . . Namely, (cid:107) e (cid:107) < . and (cid:107) r (cid:107) < . for all entity and relation vectors,respectively. The entities norm constraints do not restrict termembedding to span the Poincar´e-ball. We deﬁne the implau-sibility score as the hyperbolic distance between the termand the relation embeddings. Speciﬁcally, the implausibilityscore of a fact is deﬁned as: f R ( s, o ) = d p ( s + Π β o , r ) (6)Figure 1 provides an illustration of the HyperKG model in P . We follow previous work to minimise the following hinge The circular permutation operation retains the Euclidean norms. oss function: L = (cid:88) R ( s,o ) ∼ P,R (cid:48) ( s (cid:48) ,o (cid:48) ) ∼ N [ γ + f R ( s, o ) − f R (cid:48) ( s (cid:48) , o (cid:48) )] + (7)where P is the training set consisting of valid facts, N isa set of corrupted facts. To create the corrupted facts, weexperimented with two strategies. We replaced randomlyeither the subject or the object of a valid fact with a ran-dom entity (but not both at the same time). We denote with negs E the number of negative examples. Furthermore, weexperimented with replacing randomly the relation while re-taining intact the entities of a valid fact. We denote with negs R the number of “relation-corrupted” negative ex-amples. We employ the “ Bernoulli ” sampling method togenerate incorrect facts (Wang et al. 2014; Ji et al. 2015;Xie et al. 2017).As pointed out in different studies (Bordes et al. 2013;Dettmers et al. 2018; Lacroix, Usunier, and Obozinski 2018),regularisation techniques are really beneﬁcial for the task ofKBC. Nonetheless, very few of the classical regularisationmethods are directly applicable or easily generalisable in thePoincar´e-ball model of hyperbolic space. For instance, the (cid:96) regularisation constraint imposes vectors to stay close tothe origin, which can lead to underﬂows. The same holds fordropout (Srivastava et al. 2014), when we used a rather largedropout rate. In our experiments, we noticed a tendency ofthe word vectors to stay close to the origin. Imposing a con-straint to the vectors to stay away from the origin stabilisedthe training procedure and increased the model’s general-isation capability. It should be noted that as the points inPoincar´e-ball approach the ball’s boundary their distance d p ( u , v ) approaches d p ( u , ) + d p ( , v ) , which is analo-gous to the fact that in a tree the shortest path between twosiblings is the path through their parent (Sala et al. 2018).Building on this observation, our regulariser further imposesthis “tree-like” property. Additionally, since the volume inhyperbolic space grows exponentially, our regulariser implic-itly penalises crowding. Let Θ := { e i } | E | i =1 (cid:83) { r i } | R | i =1 be theset of all entity and relation vectors, where | E | , | R | denotethe cardinalities of the sets E , R , respectively. R (Θ) deﬁnesthe regularisation loss function that performed the best in ourexperiments: R (Θ) = | E | + | R | (cid:88) i =1 (1 − (cid:107) θ i (cid:107) ) (8)The overall energy of the embedding is now deﬁned as L (cid:48) (Θ) = L (Θ) + λ R (Θ) , where λ is a hyperparameter con-trolling the regularisation effect. We deﬁne a i := 0 . , if θ i corresponds to an entity vector and a i := 1 . , otherwise. Tominimise L (cid:48) (Θ) , we solve the following optimisation prob-lem: Θ (cid:48) ← arg min Θ L (cid:48) (Θ) s.t. ∀ θ i ∈ Θ : (cid:107) θ i (cid:107) < a i . (9) In our experiments, we noticed that a rather small dropout ratehad no effect on the model’s generalisation capability.

To solve Equation (9), we follow Nickel and Kiela () and useRiemannian SGD (RSGD; Bonnabel 2013). In RSGD, theparameter updates are of the form: θ t +1 = R θ t ( − η ∇ R L (cid:48) ( θ t )) where R θ t denotes the retraction onto the open d -dimensional unit ball at θ t and η denotes the learningrate. The Riemannian gradient of L (cid:48) ( θ ) is denoted by ∇ R ∈ T θ B . The Riemannian gradient can be computed as ∇ R = (1 −(cid:107) θ t (cid:107) ) ∇ E , where ∇ E denotes the Euclidean gra-dient of L (cid:48) ( θ ) . Similarly to Nickel and Kiela (), we use thefollowing retraction operation R θ ( v ) = θ + v .To constrain the embeddings to remain within the Poincar´eball and respect the additional constraints, we use the follow-ing projection:proj ( θ , a ) = (cid:26) a θ / ( (cid:107) θ (cid:107) + ε ) if (cid:107) θ (cid:107) ≥ a θ otherwise , (10)where ε is a small constant to ensure numerical stability. Inall experiments we used ε = 10 − . Let a be the constraintimposed on vector θ , the full update for a single embeddingis then of the form: θ t +1 ← proj (cid:18) θ t − η (1 − (cid:107) θ t (cid:107) ) ∇ E , a (cid:19) . (11)We initialise the embeddings using the Xavier initializationscheme (Glorot and Bengio 2010), where we use Equa-tion (10) for projecting the vectors whose norms violate theimposed constraints. Convex Relation Spaces

In this section, we investigate the type of rules that HyperKGcan model. Recently, Wang, Gemulla, and Li (2018) provedthat the bilinear models are universal, i.e, they can repre-sent every possible fact given that the dimensionality of thevectors is sufﬁcient. The authors have also shown that theTransE model is not universal. In parallel, Kazemi and Poole(2018) have shown that the FTransE model (Feng et al. 2016),which is the most general translational model proposed in theliterature, imposes some severe restrictions on the types ofrelations the translational models can represent. In the coreof their proof lies the assumption that the energy functiondeﬁned by the FTransE model approaches zero for all givenvalid facts. Nonetheless, this condition can be consideredless likely to be met from an optimisation perspective (Xiao,Huang, and Zhu 2016).Additionally, Guti´errez-Basulto and Schockaert (2018)studied the types of regularities that KB embedding methodscan capture. To allow for a formal characterisation, the au-thors considered hard thresholds λ R such that a fact R ( s, o ) is considered valid iff s R ( s , o ) ≤ λ R , where s R ( ., . ) is theimplausibility score. It should be highlighted that KB em-beddings are often learned based on a maximum-margin lossfunction. Therefore, this assumption is not so restrictive. Thevector space representation of a given relation R can then beviewed as a region n ( R ) in R n , deﬁned as follows: n ( R ) = { s ⊕ o | s R ( s , o ) ≤ λ R } (12) Word Frequency − − − p ( X ) Moby Dick novel

Original DataFitted Power Law Node Degree − − − − − − p ( X ) WN18RR

Original DataFitted Power Law Node Degree − − − p ( X ) FB15K-237

Original DataFitted Power Law

Figure 2: A visualisation of the probability density functions using a histogram with log-log axes.Based on this view of the relation space, the authorsprove that although bilinear models are fully expressive,they impose constraints on the type of rules that they canlearn. Speciﬁcally, let R ( X, Y ) → S ( X, Y ) , R ( X, Y ) → S ( X, Y ) be two valid rules. The bilinear models imposeeither that R ( X, Y ) → R ( X, Y ) or R ( X, Y ) → R ( X, Y ) ; introducing, thus, a number of restrictions on thetype of subsumption hierarchies they can model. Guti´errez-Basulto and Schockaert (2018), additionally, prove that thereexists a KB embedding model with convex relation regionsthat can correctly represent knowledge bases whose axiomsbelong to the family of QC rules. Equivalently, any induc-tive reasoning made by the aforementioned KB embeddingmodel would be logically consistent and deductively closedwith respect to the ontological rules. It can be easily veriﬁedthat the relation regions of TransE (Bordes et al. 2013) areindeed convex. This result is in accordance with the results ofWang, Gemulla, and Li (2018); TransE is not fully expressive.However, it could be a prominent candidate for representingin a consistent way QC rules. Nonetheless, this result seemsto be in conﬂict with the results of Kazemi and Poole (2018).Let s T ER ( s, o ) be the implausibility score of TransE, we de-mystify this seeming inconsistency by proving the followinglemma: Lemma 1

The restrictions proved by Kazemi and Poole(2018) can be lifted for the TransE model when we considerthat a fact is valid iff s T ER ( s, o ) ≤ λ R for sufﬁcient λ R > . We prove Lemma 1 in the Appendix, by constructing coun-terexamples for each one of the restrictions. Since the re-strictions can be lifted for the TransE model, we can safelyconclude that they are not, in general, valid for all the gen-eralisations of the TransE model. In parallel, we built uponthe formal characterisation of relations regions, deﬁned inEquation (12) and we prove that the relation regions capturedby HyperKG are indeed convex. Speciﬁcally, we prove:

Proposition 1

The geometric locus of the term vectors, in theform of s + Π β o , that satisfy the equation d p ( s + Π β o , r ) ≤ λ R for some λ R > corresponds to a d -dimensional closedball in the Euclidean space. Let ρ = cosh( λ R ) − (1 − (cid:107) r (cid:107) ) ,the geometric locus can be written as (cid:107) s + Π β o − r ρ +1 (cid:107) ≤ ρρ +1 + (cid:107) r (cid:107) ( ρ +1) − (cid:107) r (cid:107) ρ +1 , where the ball’s radius is guaranteedto be strictly greater than zero. The proof of Proposition 1 can be found in the Appendix.By exploiting the triangle inequality, we can easily verifythat the relation regions captured by HyperKG are indeedconvex. Figure 1 provides an illustration of the geometricloci captured by HyperKG in B . This result shows that Hy-perKG constitutes another one prominent embedding modelfor effectively representing QC rules. Experiments

Datasets

We evaluate our HyperKG model on the task of KBC usingtwo sets of experiments. We conduct experiments on theWN18RR (Dettmers et al. 2018) and FB15k-237 (Toutanovaand Chen 2015) datasets. We also construct two datasetswhose statistical regularities can be expressed as QC rules totest our model’s performance in their presence. WN18RR andFB15k-237 constitute reﬁned subsets of WN18 and FB15Kthat were introduced by Bordes et al. (2013). Toutanova andChen (2015) identiﬁed that WN18 and FB15K contained a lotof reversible relations, enabling, thus, various KB embeddingmodels to generalise easily. Exploiting this fact, Dettmers etal. (2018) obtained state-of-the-art results only by using asimple reversal rule. WN18RR and FB15k-237 were carefullycreated to alleviate this leakage of information.To test whether the scale-free distribution provides areasonable means for modelling topological properties ofknowledge graphs, we investigate the degree distributions ofWN18RR and FB15k-237. Similarly to Steyvers and Tenen-baum (2005), we treat the knowledge graphs as undirectednetworks. We also compare against the distribution of thefrequency of word usage in the English language; a phe-nomenon that is known to follow a power-law distribution(Zipf 1949). To do so, we used the frequency of word usage inHerman Melvilles novel “Moby Dick” (Newman 2005).Wefollowed the procedure described by Alstott, Bullmore, andPlenz (2014). In Figure 2, we show our analysis where wedemonstrate on a histogram with log-log axes the probabilitydensity function with regard to the observed property for eachdataset, including the ﬁtted power-law distribution. It can beseen that the power-law distribution provides a reasonablemeans for also describing the degree distribution of KBs;justifying the work of Steyvers and Tenenbaum (2005). Theﬂuctuations in the cases of WN18RR and FB15k-237 coulde explained by the fact that the datasets are subsets of morecomplete KBs; a fact that introduces noise which in turncan explain deviations from the perfection of a theoreticaldistribution (Alstott, Bullmore, and Plenz 2014).To test our model’s performance on capturing QC rules,we extract from Wikidata (Vrandeˇci´c and Kr¨otzsch 2014;Erxleben et al. 2014) two subsets of facts that satisfy thefollowing rules:(a) is a ( x , y ) ∧ part of ( y , z ) → part of ( x , z ) (b) part of ( x , y ) ∧ is a ( y , z ) → part of ( x , z ) Recent studies have noted that many real world KB relationshave very few facts (Xiong et al. 2018), raising the impor-tance of generalising with limited number of facts. To testour model in the presence of sparse long-tail relations, wekept the created datasets sufﬁciently small. For each type ofthe aforementioned rules, we extract 200 facts that satisfythem from Wikidata. We construct two datasets that we dubWD and WD ++ . The dataset WD contains only the factsthat satisfy rule ( a ). WD ++ extends WD by also includingthe facts satisfying rule ( b ). The evaluation protocol was thefollowing: For every dataset, we split all the facts randomlyin train ( ), validation ( ), and test ( ) set, such thatthe validation and test sets only contain a subset of the rules’consequents in the form of part of ( x , z ) . Table 1 providesdetails regarding the respective size of each dataset. Dataset | E | | R | Train

Valid

Test

WN18RR 40,943 11 86,835 3,034 3,134FB15k-237 14,541 237 272,115 17,535 20,466WD 418 2 550 25 25WD ++

763 2 1,120 40 40

Table 1: Statistics of the experimental datasets.

Evaluation Protocol & Implementation Details

In the KBC task the models are evaluated based on theircapability to answer queries such as R ( subject, ? ) and R ( ? , object ) (Bordes et al. 2013); predicting, thus, the miss-ing entity. Speciﬁcally, all the possible corruptions are ob-tained by replacing either the subject or the object and theentities are ranked based on the values of the implausibilityscore. The models should assign lower implausibility scoresto valid facts and higher scores to implausible ones. We usethe “ Filtered ” setting protocol (Bordes et al. 2013), i.e., nottaking any corrupted facts that exist in KB into account. Weemploy two common evaluation metrics: mean reciprocalrank (MRR), and Hits@10 (i.e., the proportion of the validtest triples ranking in top 10 predictions). Higher MRR orhigher Hits@10 indicate better performance.The reported results are given for the best set of hyper-parameters evaluated on the validation set using grid search.Varying the batch size had no effect on the performance.Therefore, we divided every epoch into 10 mini-batches. Thehyper-parameter search space was the following: negs E ∈{ , , , , , , , , } , negs R ∈ { , , } , λ ∈{ . , . , . , . , . , . , . , . , . } , the embeddings’dimension n ∈ { , , } , β ∈ {(cid:98) n (cid:99) , (cid:98) n (cid:99) , (cid:98) n (cid:99) , } , η ∈ { . , . , . , . , . , . , . } and γ ∈{ . , . , . , . , . , . , . , . , . } . We used early stop-ping based on the validation’s set ﬁltered MRR performance,computed every 50 epochs with a maximum number of epochs. Results & Analysis

Table 2 compares the experimental results of our HyperKGmodel with previous published results on WN18RR andFB15k-237 datasets. We compare against the shallow KBembedding models DISTMULT (Yang et al. 2015), Com-plEx (Trouillon et al. 2016) and TransE (Bordes et al. 2013),which constitute important representatives of bilinear andtranslational models. We exclude from our comparison re-cent work that explores different types of training regimes,such as adversarial training, the inclusion of reciprocal factsand/or multiple geometrical spaces (Cai and Wang 2018;Sun et al. 2019; Kazemi and Poole 2018; Lacroix, Usunier,and Obozinski 2018; Balaˇzevi´c, Allen, and Hospedales 2019)to make the analysis less biased to factors that could over-shadow the importance of the embedding space. We give theresults of our algorithm under the HyperKG listing.Despite the fact that HyperKG belongs to the translationalfamily of KB embedding models, it achieves comparableperformance to the other models on the WN18RR dataset.When we compare the performance of HyperKG and TransE,we see that HyperKG achieves almost the double MRR score.This consequently shows that the lower MRR performanceof TransE is not an intrinsic characteristic of the translationalmodels, but a restriction that can be lifted by the right choiceof geometrical space. With regard to Hits@10 on WN18RR,HyperKG exhibits slightly lower performance compared toComplEx. On the FB15k-237 dataset, however, HyperKGand TransE demonstrate almost the same behaviour outper-forming DISTMULT and ComplEx in both metrics. Sincethe performance gap between TransE and HyperKG is small,we hypothesise that it may be due to a less ﬁne-grained hy-perparameter tuning.We also report in Table 2 two additional experiments wherewe explore the behaviour of HyperKG when the M¨obius ad-dition is used instead of the Euclidean one as well as theperformance boost that our regularisation scheme brings. Inthe experiment where the M¨obius addition was used, we re-moved the constraint for the entity vectors to have a norm lessthan . . Although the M¨obius addition is non-commutative,we found beneﬁcial to keep the permutation matrix. Nonethe-less, we do not use our regularisation scheme. Finally, theimplausibility score is d p ( s + (cid:48) Π β o , r ) . To investigate theeffect of our proposed regularisation scheme, we show resultswhere our regularisation scheme, deﬁned in Equation (8), isnot used, keeping, however, the rest of the architecture thesame. Comparing the performance of the HyperKG variationusing the M¨obius addition against the performance of theHyperKG without regularisation, we can observe that we canachieve better results by using the Euclidean addition. Thiscan be explained as follows. Generally, there is no uniqueand universal geometrical space adequate for every KB (Guet al. 2018). To recover Euclidean Space from the Poincar´e-ball model equipped with the M¨obius addition, the ball’s ethod Type WN18RR FB15k-237 MRR H@10 MRR H@10DISTMULT (Yang et al. 2015) [ (cid:63) ] Bilinear 0.43 49 0.24 41ComplEx (Trouillon et al. 2016) [ (cid:63) ] Bilinear 0.44 51 0.24 42TransE (Bordes et al. 2013) [ (cid:63) ] Translational 0.22 50 0.29 46HyperKG (M¨obius addition) Translational 0.30 44 0.19 32HyperKG (no regularisation) Translational 0.30 46 0.25 41HyperKG Translational 0.41 50 0.28 45Table 2: Experimental results on WN18RR and FB15k-237 test sets. MRR and H@10 denote the mean reciprocal rank andHits@10 (in %), respectively. [ (cid:63) ]: Results are taken from Nguyen et al. (2018).

Method WD WD ++ MRR H@10 MRR H@10ComplEx 0.92 98 0.81 92TransE 0.88 96 0.89 98HyperKG 0.98 98 0.88 97Table 3: Experimental results on WD and WD ++ test sets.MRR and H@10 denote the mean reciprocal rank andHits@10 (in %), respectively.radius should grow to inﬁnity (Ungar 2012). Instead, by us-ing the Euclidean addition and since the hyperbolic metric islocally Euclidean, HyperKG can model facts for which theEuclidean Space is more appropriate by learning to retainsmall distances. Additionally, WN18RR contains more hier-archical relations compared to FB15k-237 (Balaˇzevi´c, Allen,and Hospedales 2019), which further explains HyperKGsperformance boost on WN18RR. Last but not least, we canobserve that our proposed regularisation scheme is beneﬁcialin terms of both MRR and Hits@10 on both datasets.Table 3 reports the results on the WD and WD ++ datasets.We compare HyperKG performance against TransE and Com-plEx. It can be observed that neither of the models managesto totally capture the statistical regularities of these datasets.All the models present similar behaviour in terms of [email protected] and TransE , that both have convex relation spaces,outperform ComplEx on both datasets. HyperKG shows thebest performance on WD, and demonstrates almost the sameperformance with TransE on WD ++ . Our results point to apromising direction for developing less expressive KB em-bedding models which can, however, better represent certainrules. Conclusion and Outlook

In this paper, we showed the geometrical space’ signiﬁcancefor KBC by demonstrating that when models, whose perfor-mance has been lagging, are extended to the hyperbolic space,their performance increases signiﬁcantly. What is more, wedemonstrated a new promising direction for developing mod-els that better represent certain families of rules openingup for more ﬁne-grained reasoning tasks. Finally, recent hy-brid models that exploit both the Euclidean and the hyper-bolic space (Balaˇzevi´c, Allen, and Hospedales 2019) furtherdemonstrate that hyperbolic space is a promising directionfor KBC.

Appendices

Proof of Lemma 1:

We begin by introducing the TransEmodel (Bordes et al. 2013). In TransE model, the entitiesand the relations are represented as vectors in the Euclideanspace. Let, s , r , o ∈ R d denote the subject, relation and theobject embedding, respectively. The implausibility score fora fact R ( s, o ) is deﬁned as || s + r − o || . Let P deﬁne a setof valid facts. In the following we introduce some additionaldeﬁnitions needed for the introduction of the restrictions. • A relation r is reﬂexive on a set E of entities if ( e, r, e ) ∈ P for all entities e ∈ E . • A relation r is symmetric on a set E of entities if ( e , r, e ) ∈ P ⇐⇒ ( e , r, e ) ∈ P for all pairs ofentities e , e ∈ E . • A relation r is transitive on a set E of entities if ( e , r, e ) ∈ P ∧ ( e , r, e ) ∈ P ⇒ ( e , r, e ) ∈ P for all e , e , e ∈ E .In the following, we list the restrictions mentioned in Kazemiand Poole (2018). • R1 : If a relation r is reﬂexive on ∆ ⊂ E , r must also besymmetric on ∆ . • R2 : If r is reﬂexive on ∆ ⊂ E , r must also be transitiveon ∆ . • R3 : If entity e has relation r with every entity in ∆ ⊂ E and entity e has relation r with one of the entities in ∆ ,then e must have the relation r with every entity in ∆ .Let n, m ∈ N , i, j ∈ R and a ∈ R ∗ + . Let v =( v , v , . . . , v m ) ∈ R m and u ∈ R n . We denote with ( v , v , . . . , v m ; u ) the concatenation of vectors v and u . Let (cid:126) n ∈ R n be the zero n-dimensional vector. For each restric-tion, we consider a minimum valid set of instances that couldsatisfy the restriction and we construct a counterexample thatsatisﬁes restriction’s conditions but not the conclusion. R1 : This restriction translates to: (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a  ⇒ (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (13)Let n ≥ , r = ( a ; (cid:126) n − ) , (cid:126)e = ( i − a ; (cid:126) n − ) and (cid:126)e =( i + a ; (cid:126) n − ) , then: (cid:107) (cid:126)e + r − (cid:126)e (cid:107) = (cid:107) (( i + 2 a − ( i − a )); (cid:126) n − ) (cid:107) ⇒(cid:107) (cid:126)x + r − (cid:126)e (cid:107) = √ a > a (14) : This restriction translates to: (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a  ⇒ (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (15)Let n ≥ , r = ( a ; (cid:126) n − ) , (cid:126)e = ( i − a ; (cid:126) n − ) , (cid:126)e = ( i + a ; (cid:126) n − ) and (cid:126)e = ( i + 3 a ; (cid:126) n − ) , then: (cid:107) (cid:126)e + r − (cid:126)e (cid:107) = (cid:107) (( i − ( i + 3 a ); (cid:126) n − ) (cid:107) ⇒(cid:107) (cid:126)e + r − (cid:126)e (cid:107) = √ a > a (16) R3 : This restriction translates to: (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a  ⇒ (cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a ∧(cid:107) (cid:126)e + r − (cid:126)e (cid:107) ≤ a (17)Let n ≥ , r = ( a ; (cid:126) n − ) , (cid:126)e = ( i ; (cid:126) n − ) , (cid:126)e = ( i + a , a ; (cid:126) n − ) and (cid:126)e = ( i + 2 a ; (cid:126) n − ) , then: (cid:107) (cid:126)e + r − (cid:126)e (cid:107) = (cid:107) ( i + 3 a a − i, a (cid:126) n − ) (cid:107) ⇒(cid:107) (cid:126)e + r − (cid:126)e (cid:107) = √ a > a (18)It can be easily veriﬁed that these counterexamples also apply,with no modiﬁcation, when the (cid:96) distance is used. This endsour proof. Proof of Proposition 1 : Let (cid:107) s + Π β o (cid:107) < , (cid:107) r (cid:107) < and λ R > , we investigate the type of the geometric locus of theterm vectors in the form of s + Π β o that satisfy the followingequation: d p ( s + Π β o , r ) ≤ λ R (19)To simplify the notation, we denote x := s + Π β o . d p ( x , r ) ≤ λ R ⇐⇒ δ ( x , r ) ≤ cosh( λ R ) ⇐⇒ δ ( x , r ) ≤ cosh( λ R ) − (20)Let α = (cosh( λ R ) − / . We should note that α > , since ∀ x ∈ R ∗ : cosh ( x ) > . Then, we have: (cid:107) x − r (cid:107) (1 − (cid:107) x (cid:107) )(1 − (cid:107) r (cid:107) ) ≤ a (21)Be setting ρ = a (1 − (cid:107) r (cid:107) ) , the inequality 21 becomes: (cid:107) x − r (cid:107) ≤ ρ (1 − x (cid:107) ) ⇐⇒ ( ρ + 1) (cid:107) x (cid:107) − ∗ xr + (cid:107) r (cid:107) ≤ ρ ⇐⇒(cid:107) x (cid:107) − ∗ x r ρ + 1 + (cid:107) r (cid:107) ρ + 1 ≤ ρρ + 1 ⇐⇒(cid:107) x − r ρ + 1 (cid:107) ≤ ρρ + 1 + (cid:107) r (cid:107) ( ρ + 1) − (cid:107) r (cid:107) ρ + 1 (22) We prove in the following that: ρρ + 1 + (cid:107) r (cid:107) ( ρ + 1) − (cid:107) r (cid:107) ρ + 1 > . (23)First, we note that since (cid:107) r (cid:107) < , we also have that ρ > based on the fact that α > and − (cid:107) r (cid:107) > . Then, wehave: ρρ + 1 + (cid:107) r (cid:107) ( ρ + 1) − (cid:107) r (cid:107) ρ + 1 == 1 ρ + 1 (cid:18) ρ + (cid:107) r (cid:107) ρ + 1 − (cid:107) r (cid:107) (cid:19) = 1 ρ + 1 (cid:18) ρ + 1 − ρ − ρ + 1 (cid:107) r (cid:107) (cid:19) = ρρ + 1 (cid:18) − ρ + 1 (cid:107) r (cid:107) (cid:19) We observe that ρρ +1 > , hence, it is sufﬁcient to checkwhether − ρ +1 (cid:107) r (cid:107) > . We note that since (cid:107) r (cid:107) < and ρ > , we have (cid:107) r (cid:107) ρ +1 < ρ +1 . However, ρ +1 < . Thisconcludes our proof. Models Parameters

TransE and ComplEx Implementation Details

For the experiments on the WD and WD ++ datasets, we usedthe public available implementations of TransE (Bordes et al.2013) and ComplEx (Trouillon et al. 2016) provided in theOpenKE framework (Han et al. 2018). The reported resultsare given for the best set of hyper-parameters evaluated onthe validation set using grid search. We divided every epochinto 64 mini-batches.The hyper-parameter search space for TransE was the fol-lowing: the dimensionality of embeddings n ∈ { , } ,SGD learning rate ∈ { . , . , . , . } , l -norm or l -norm, and margin γ ∈ { , , , } . The highestMRR scores were achieved when using l -norm, learningrate at . , γ = 7 and n = 50 for both WD and WD ++ .The hyper-parameter search space for Com-plEx was the following: n ∈ { , } , λ ∈{ . , . , . , . , . , . , . } , α ∈{ . , . , . , . , . , . , . } , η ∈ { , , , } where n the dimensionality of embeddings, λ the L regularisation parameter, α the AdaGrad’s initial learningrate, and η the number of negatives generated per positivetraining triple. The highest MRR scores were achieved whenusing learning rate at . , λ = 0.1, η = 5 and n = 50 for WD.For WD ++ , the best hyper-parameters were achieved whenusing using learning rate at . , λ = 0.1, η = 5 and n = 100. HyperKG Parameters

We report in the Table 4 the best hyper-parameters for ourHyperKG model used across the different experiments. ForWD and WD ++ , we do not use the “ Bernoulli ” samplingmethod, but instead we corrupted the subject and object of afact with equal probability.ataset Model negs E negs R η λ n γ , β WN18RR HyperKG 10 0 0.01 0.8 100 1.0 (cid:98) n (cid:99) WN18RR HyperKG (M¨obius addition) 10 0 0.01 - 100 1.0 (cid:98) n (cid:99) WN18RR HyperKG (no regularisation) 10 0 0.01 0.0 100 1.0 (cid:98) n (cid:99) FB15k-237 HyperKG 5 0 0.01 0.2 100 0.5 (cid:98) n (cid:99) FB15k-237 HyperKG (M¨obius addition) 5 0 0.01 - 100 0.5 (cid:98) n (cid:99) FB15k-237 HyperKG (no regularisation) 5 0 0.01 0.0 100 0.5 (cid:98) n (cid:99) WD HyperKG 1 1 0.8 0 100 7 (cid:98) n (cid:99) WD ++ HyperKG 1 1 0.1 0 100 7 (cid:98) n (cid:99) Table 4: HyperKG’s hyper-parameters used across the different experiments.

References

Abiteboul, S.; Hull, R.; and Vianu, V. 1995.

Foundationsof databases: the logical level . Addison-Wesley LongmanPublishing Co., Inc.Ahlfors, L. V. 1975. Invariant operators and integral repre-sentations in hyperbolic space.

Mathematica Scandinavica

PloSone

Nature genetics \ ’e graph embeddings. arXiv preprintarXiv:1905.09791 .Barab´asi, A.-L., and Albert, R. 1999. Emergence of scalingin random networks. science Nature commu-nications

SIGMOD .Bonnabel, S. 2013. Stochastic gradient descent on riemannianmanifolds.

IEEE Trans. Automat. Contr.

NeurIPS .Cai, L., and Wang, W. Y. 2018. KBGAN: Adversarial learningfor knowledge graph embeddings. In

NAACL .Dettmers, T.; Minervini, P.; Stenetorp, P.; and Riedel, S. 2018.Convolutional 2d knowledge graph embeddings. In

AAAI .Ebisu, T., and Ichise, R. 2018. Toruse: Knowledge graphembedding on a lie group. In

AAAI .Erxleben, F.; G¨unther, M.; Kr¨otzsch, M.; Mendez, J.; andVrandeˇci´c, D. 2014. Introducing wikidata to the linked dataweb. In

ISWC .Faloutsos, M.; Faloutsos, P.; and Faloutsos, C. 1999. Onpower-law relationships of the internet topology. In

ACMSIGCOMM computer communication review , volume 29, 251–262. ACM.Feng, J.; Huang, M.; Wang, M.; Zhou, M.; Hao, Y.; and Zhu,X. 2016. Knowledge graph embedding by ﬂexible translation.In KR .Ganea, O.; Becigneul, G.; and Hofmann, T. 2018. Hyperbolic entailment cones for learning hierarchical embeddings. In Dy,J., and Krause, A., eds., ICML , 1646–1655. PMLR.Getoor, L., and Taskar, B. 2007.

Introduction to statisticalrelational learning , volume 1. MIT press Cambridge.Glorot, X., and Bengio, Y. 2010. Understanding the difﬁcultyof training deep feedforward neural networks. In

AISTATS ,249–256. Chia Laguna Resort, Sardinia, Italy: PMLR.Gu, A.; Sala, F.; Gunel, B.; and R´e, C. 2018. Learning mixed-curvature representations in product spaces. In

ICLR .Guti´errez-Basulto, V., and Schockaert, S. 2018. From knowl-edge graph embedding to ontology embedding? an analysisof the compatibility between vector space representations andrules. In KR .Han, X.; Cao, S.; Lv, X.; Lin, Y.; Liu, Z.; Sun, M.; and Li, J.2018. Openke: An open toolkit for knowledge embedding. In EMNLP , 139–144.Ji, G.; He, S.; Xu, L.; Liu, K.; and Zhao, J. 2015. Knowl-edge graph embedding via dynamic mapping matrix. In

ACL-IJCNLP , 687–696.Kazemi, S. M., and Poole, D. 2018. Simple embedding forlink prediction in knowledge graphs. In

Advances in neuralinformation processing systems , 4284–4295.Krioukov, D.; Papadopoulos, F.; Kitsak, M.; Vahdat, A.; andBogu˜n´a, M. 2010. Hyperbolic geometry of complex networks.

Phys. Rev. E

ICML . PMLR.Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas,D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; Van Kleef, P.;Auer, S.; et al. 2015. Dbpedia–a large-scale, multilingualknowledge base extracted from wikipedia.

Semantic Web

WWW .Miller, G. 1998.

WordNet: An electronic lexical database .MIT press.Mitchell, J., and Lapata, M. 2008. Vector-based models ofsemantic composition. In

Proceedings of ACL-08: HLT .Muggleton, S., and De Raedt, L. 1994. Inductive logic pro-gramming: Theory and methods.

The Journal of Logic Pro-gramming

Contemporary physics

NAACL .ickel, M., and Kiela, D. Poincar´e embeddings for learninghierarchical representations. In

NeurIPS .Nickel, M., and Kiela, D. 2018. Learning continuous hi-erarchies in the Lorentz model of hyperbolic geometry. In

ICML .Nickel, M.; Murphy, K.; Tresp, V.; and Gabrilovich, E. 2016.A review of relational machine learning for knowledge graphs.

Proceedings of the IEEE

AAAI .Nickel, M.; Tresp, V.; and Kriegel, H.-P. 2011. A three-waymodel for collective learning on multi-relational data. In

ICML .USA: Omnipress.Papadopoulos, F.; Aldecoa, R.; and Krioukov, D. 2015. Net-work geometry inference using common neighbors.

PhysicalReview E \ ” { o } bius transformations. arXiv preprintarXiv:1902.05003 .Richardson, M., and Domingos, P. 2006. Markov logic net-works. Machine learning

ICML .Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; andSalakhutdinov, R. 2014. Dropout: A simple way to preventneural networks from overﬁtting.

Journal of Machine Learn-ing Research

Cognitive science

WWW .Sun, Z.; Deng, Z.-H.; Nie, J.-Y.; and Tang, J. 2019. Rotate:Knowledge graph embedding by relational rotation in complexspace. In

ICLR .Toutanova, K., and Chen, D. 2015. Observed versus latentfeatures for knowledge base and text inference. In

Proceedingsof the 3rd Workshop on Continuous Vector Space Models andtheir Compositionality .Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, ´E.; and Bouchard,G. 2016. Complex embeddings for simple link prediction. In

ICML .Ungar, A. A. 2012.

Beyond the Einstein addition law and itsgyroscopic Thomas precession: The theory of gyrogroups andgyrovector spaces , volume 117. Springer Science & BusinessMedia.Van Der Hofstad, R. 2009. Random graphs andcomplex networks.

Commun. ACM

AAAI .Wang, Q.; Mao, Z.; Wang, B.; and Guo, L. 2017. Knowl-edge graph embedding: A survey of approaches and applica-tions.

IEEE Transactions on Knowledge and Data Engineering

AAAI .West, R.; Gabrilovich, E.; Murphy, K.; Sun, S.; Gupta, R.; andLin, D. 2014. Knowledge base completion via search-basedquestion answering. In

WWW . Xiao, H.; Huang, M.; and Zhu, X. 2016. From one pointto a manifold: Knowledge graph embedding for precise linkprediction. In

IJCAI .Xie, Q.; Ma, X.; Dai, Z.; and Hovy, E. 2017. An interpretableknowledge transfer model for knowledge base completion. In

ACL .Xiong, W.; Yu, M.; Chang, S.; Guo, X.; and Wang, W. Y.2018. One-shot relational learning for knowledge graphs. In

EMNLP .Yang, B.; tau Yih, W.; He, X.; Gao, J.; and Deng, L. 2015.Embedding entities and relations for learning and inference inknowledge bases. In

ICLR .Zipf, G. K. 1949.