Analogical Inference for Multi-Relational Embeddings
AAnalogical Inference for Multi-relational Embeddings
Hanxiao Liu Yuexin Wu Yiming Yang Abstract
Large-scale multi-relational embedding refers tothe task of learning the latent representations forentities and relations in large knowledge graphs.An effective and scalable solution for this prob-lem is crucial for the true success of knowledge-based inference in a broad range of applica-tions. This paper proposes a novel frameworkfor optimizing the latent representations with re-spect to the analogical properties of the embed-ded entities and relations. By formulating thelearning objective in a differentiable fashion, ourmodel enjoys both theoretical power and com-putational scalability, and significantly outper-formed a large number of representative baselinemethods on benchmark datasets. Furthermore,the model offers an elegant unification of severalwell-known methods in multi-relational embed-ding, which can be proven to be special instanti-ations of our framework.
1. Introduction
Multi-relational embedding, or knowledge graph embed-ding, is the task of finding the latent representations ofentities and relations for better inference over knowledgegraphs. This problem has become increasingly importantin recent machine learning due to the broad range of im-portant applications of large-scale knowledge bases, suchas Freebase (Bollacker et al., 2008), DBpedia (Auer et al.,2007) and Google’s Knowledge Graph (Singhal, 2012), in-cluding question-answering (Ferrucci et al., 2010), infor-mation retrieval (Dalton et al., 2014) and natural languageprocessing (Gabrilovich & Markovitch, 2009).A knowledge base (KB) typically stores factual informa-tion as subject-relation-object triplets. The collection ofsuch triplets forms a directed graph whose nodes are enti-ties and whose edges are the relations among entities. Real- Carnegie Mellon University, Pittsburgh, PA 15213, USA.Correspondence to: Hanxiao Liu < [email protected] > . Proceedings of the th International Conference on MachineLearning , Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s). sun planetsmassnucleus electronscharge a tt r a c t a tt r a c t surrounded by surrounded by m a d e o f m a d e o f s c a l e d o w n s c a l e d o w n s c a l e d o w n Figure 1.
Commutative diagram for the analogy between the So-lar System (red) and the Rutherford-Bohr Model (blue) (atomsystem). By viewing the atom system as a “miniature” of thesolar system (via the scale down relation), one is able to com-plete missing facts (triplets) about the latter by mirroring the factsabout the former. The analogy is built upon three basic analog-ical structures (parallelograms): “ sun is to planets as nucleus is to electrons ”, “ sun is to mass as nucleus is to charge ” and“ planets are to mass as eletrons are to charge ”. world knowledge graph is both extremely large and highlyincomplete by nature (Min et al., 2013). How can we usethe observed triplets in an incomplete graph to induce theunobserved triples in the graph presents a tough challengefor machine learning research.Various statistical relational learning methods (Getoor,2007; Nickel et al., 2015) have been proposed for this task,among which vector-space embedding models are mostparticular due to their advantageous performance and scal-ability (Bordes et al., 2013). The key idea in those ap-proaches is to find dimensionality reduced representationsfor both the entities and the relations, and hence force themodels to generalize during the course of compression.Representative models of this kind include tensor factoriza-tion (Singhal, 2012; Nickel et al., 2011), neural tensor net-works (Socher et al., 2013; Chen et al., 2013), translation-based models (Bordes et al., 2013; Wang et al., 2014; Linet al., 2015b), bilinear models and its variants (Yang et al.,2014; Trouillon et al., 2016), pathwise methods (Guu et al.,2015), embeddings based on holographic representations(Nickel et al., 2016), and product graphs that utilizes addi-tional site information for the predictions of unseen edgesin a semi-supervised manner (Liu & Yang, 2015; 2016). a r X i v : . [ c s . L G ] J u l nalogical Inference for Multi-relational Embeddings Learning the embeddings of entities and relations can beviewed as a knowledge induction process, as those inducedlatent representations can be used to make inference aboutnew triplets that have not been seen before.Despite the substantial efforts and great successes so far inthe research on multi-relational embedding, one importantaspect is missing, i.e., to study the solutions of the problemfrom the analogical inference point of view, by which wemean to rigorously define the desirable analogical proper-ties for multi-relational embedding of entities and relations,and to provide algorithmic solution for optimizing the em-beddings w.r.t. the analogical properties. We argue thatanalogical inference is particularly desirable for knowledgebase completion, since for instance if system A (a subsetof entities and relations) is analogous to system B (anothersubset of entities and relations), then the unobserved triplesin B could be inferred by mirroring their counterparts in A .Figure 1 uses a toy example to illustrate the intuition, wheresystem A corresponds to the solar system with three con-cepts (entities), and system B corresponds the atom systemwith another three concepts. An analogy exists between thetwo systems because B is a “miniature” of A . As a result,knowing how the entities are related to each other in sys-tem A allows us to make inference about how the entitiesare related to each other in system B by analogy.Although analogical reasoning was an active researchtopic in classic AI (artificial intelligence), early computa-tional models mainly focused on non-differentiable rule-based reasoning (Gentner, 1983; Falkenhainer et al., 1989;Turney, 2008), which can hardly scale to very large KBssuch as Freebase or Google’s Knowledge Graph. How toleverage the intuition of analogical reasoning via statisticalinference for automated embedding of very large knowl-edge graphs has not been studied so far, to our knowledge.It is worth mentioning that analogical structures have beenobserved in the output of several word/entity embeddingmodels (Mikolov et al., 2013; Pennington et al., 2014).However, those observations stopped there as merely em-pirical observations. Can we mathematically formulate thedesirable analogical structures and leverage them in our ob-jective functions to improve multi-relational embedding?In this case, can we develop new algorithms for tractable in-ference for the embedding of very large knowledge graphs?These questions present a fundamental challenge which hasnot been addressed by existing work, and answering thesequestions are the main contributions we aim in this pa-per. We name this open challenge as the analogical infer-ence problem, for the distinction from rule-based analogi-cal reasoning in classic AI.Our specific novel contributions are the following:1. A new framework that, for the first time, explicitly models analogical structures in multi-relational em-bedding, and that improves the state-of-the-art perfor-mance on benchmark datasets;2. The algorithmic solution for conducting analogical in-ference in a differentiable manner, whose implemen-tation is as scalable as the fastest known relational em-bedding algorithms;3. The theoretical insights on how our framework pro-vides a unified view of several representative methodsas its special (and restricted) cases, and why the gen-eralization of such cases lead to the advantageous per-formance of our method as empirically observed.The rest of this paper is organized as follows: § § § § § §
2. Related Background
Let E and R be the space of all entities and their relations.A knowledge base K is a collection of triplets ( s, r, o ) ∈ K where s ∈ E , o ∈ E , r ∈ R stand for the subject, objectand their relation, respectively. Denote by v ∈ R |E|× m alook-up table where v e ∈ R m is the vector embedding forentity e , and denote by tensor W ∈ R |R|× m × m anotherlook-up table where W r ∈ R m × m is the matrix embeddingfor relation r . Both v and W are to be learned from K . We formulate each relation r as a linear map that, for anygiven ( s, r, o ) ∈ K , transforms the subject s from its origi-nal position in the vector space to somewhere near the ob-ject o . In other words, we expect the latent representationsfor any valid ( s, r, o ) to satisfy v (cid:62) s W r ≈ v (cid:62) o (1)The degree of satisfaction in the approximated form of (1)can be quantified using the inner product of v (cid:62) s W r and v o .That is, we define a bilinear score function as: φ ( s, r, o ) = (cid:104) v (cid:62) s W r , v o (cid:105) = v (cid:62) s W r v o (2)Our goal is to learn v and W such that φ ( s, r, o ) gives highscores to valid triples, and low scores to the invalid ones. nalogical Inference for Multi-relational Embeddings In contrast to some previous models (Bordes et al., 2013)where relations are modeled as additive translating opera-tors, namely v s + w r ≈ v o , the multiplicative formulationin (1) offers a natural analogy to the first-order logic whereeach relation is treated as a predicate operator over input ar-guments (subject and object in our case). Clearly, the lineartransformation defined by a matrix, a.k.a. a linear map, isa richer operator than the additive transformation definedby a vector. Multiplicative models are also found to sub-stantially outperform additive models empirically (Nickelet al., 2011; Yang et al., 2014). Instead of allowing arbitrary linear maps to be used forrepresenting relations, a particular family of matrices hasbeen studied for “well-behaved” linear maps. This familyis named as the normal matrices . Definition 2.1 (Normal Matrix) . A real matrix A is normalif and only if A (cid:62) A = AA (cid:62) . Normal matrices have nice theoretical properties which areoften desirable form relational modeling, e.g., they are uni-tarily diagonalizable and hence can be conveniently ana-lyzed by the spectral theorem (Dunford et al., 1971). Rep-resentative members of the normal family include: • Symmetric Matrices for which W r W (cid:62) r = W (cid:62) r W r = W r . These includes all diagonal matrices and posi-tive semi-definite matrices, and the symmetry implies φ ( s, r, o ) = φ ( o, r, s ) . They are suitable for modelingsymmetric relations such as is identical . • Skew-/Anti-symmetric Matrices for which W r W (cid:62) r = W (cid:62) r W r = − W r , which implies φ ( s, r, o ) = − φ ( o, r, s ) . These matrices are suitable for modelingasymmetric relations such as is parent of . • Rotation Matrices for which W r W (cid:62) r = W (cid:62) r W r = I m , which suggests that the relation r is invertible as W − r always exists. Rotation matrices are suitable formodeling 1-to-1 relationships (bijections). • Circulant Matrices (Gray et al., 2006), which havebeen implicitly used in recent work on holographicrepresentations (Nickel et al., 2016). These matricesare usually related to the learning of latent representa-tions in the Fourier domain (see § R m × m as N m ( R ) .
3. Proposed Analogical Inference Framework
Analogical reasoning is known to play a central role in hu-man induction about knowledge (Gentner, 1983; Minsky, 1988; Holyoak et al., 1996; Hofstadter, 2001). Here weprovide a mathematical formulation of the analogical struc-tures of interest in multi-relational embedding in a latentsemantic space, to support algorithmic inference about theembeddings of entities and relations in a knowledge graph.
Consider the famous example in the word embedding lit-erature (Mikolov et al., 2013; Pennington et al., 2014), forthe following entities and relations among them:“ man is to king as woman is to queen ”In an abstract notion we denote the entities by a (as man ), b (as king ), c (as woman ) and d (as queen ), and the rela-tions by r (as crown ) and r (cid:48) (as male (cid:55)→ f emale ), respec-tively. These give us the subject-relation-object triplets asfollows: a r → b, c r → d, a r (cid:48) → c, b r (cid:48) → d (3)For multi-relational embeddings, r and r (cid:48) are members of R and are modeled as linear maps in our case.The relational maps in (3) can be visualized using a com-mutative diagram (Ad´amek et al., 2004; Brown & Porter,2006) from the Category Theory, as shown in Figure 2,where each node denotes an entity and each edge denotes alinear map that transforms one entity to the other. We alsorefer to such a diagram as a “parallelogram” to highlight itsparticular algebraic structure .a b dc r r (cid:48) rr (cid:48) Figure 2.
Parallelogram diagram for the analogy of “ a is to b as c is to d ”, where each edge denotes a linear map. The parallelogram in Figure 2 represents a very basic ana-logical structure which could be informative for the infer-ence about unknown facts (triplets). To get a sense aboutwhy analogies would help in the inference about unob-served facts, we notice that for entities a, b, c, d which forman analogical structure in our example, the parallelogramstructure is fully determined by symmetry. This means thatif we know a r → b and a r (cid:48) → c , then we can induce theremaining triplets of c r → d and b r (cid:48) → d . In other words, un-derstanding the relation between man and king helps us tofill up the unknown relation between woman and queen . Notice that this is different from parallelograms in the geo-metric sense because each edge here is a linear map instead of thedifference between two nodes in the vector space. nalogical Inference for Multi-relational Embeddings
Analogical structures are not limited to parallelograms, ofcourse, though parallelograms often serve as the buildingblocks for more complex analogical structures. As an ex-ample, in Figure 1 of § Although it is tempting to explore all potentially interestingparallelograms in the modeling of analogical structure, it iscomputationally intractable to examine the entire powersetof entities as the candidate space of analogical structures.A more reasonable strategy is to identify some desirableproperties of the analogical structures we want to model,and use those properties as constraints for reducing the can-didate space.An desirable property of the linear maps we want is that allthe directed paths with the same starting node and end nodeform the compositional equivalence . Denoting by “ ◦ ” thecomposition operator between two relations, the parallelo-gram in Figure 2 contains two equivalent compositions as: r ◦ r (cid:48) = r (cid:48) ◦ r (4)which means that a is connected to d via either path. Wecall this the commutativity property of the linear maps,which is a necessary condition for forming commutativeparallelograms and therefore the corresponding analogicalstructures. Yet another example is given by Figure 1, where sun can traverse to charge along multiple alternative pathsof length three, implying the commutativity of relations surrounded by , made of , scale down .The composition of two relations (linear maps) is naturallyimplemented via matrix multiplication (Yang et al., 2014;Guu et al., 2015), hence equation (4) indicates W r ◦ r (cid:48) = W r W r (cid:48) = W r (cid:48) W r (5)One may further require the commutative constraint (5) tobe satisfied for any pair of relations in R because they maybe simultaneously present in the same commutative paral-lelogram for certain subsets of entities. In this case, we saythe relations in R form a commuting family.It is worth mentioning that N m ( R ) is not closed under ma-trix multiplication. As the result, the composition rule ineq. (5) may not always yield a legal new relation— W r ◦ r (cid:48) may no longer be a normal matrix. However, any commut-ing family in N m ( R ) is indeed closed under multiplication.This explains the necessity of having a commuting familyof relations from an alternative perspective. The generic goal for multi-relational embedding is to findentity and relation representations such that positive tripleslabeled as y = +1 receive higher score than the negativetriples labeled as y = − . This can be formulated as min v,W E s,r,o,y ∼D (cid:96) ( φ v,W ( s, r, o ) , y ) (6)where φ v,W ( s, r, o ) = v (cid:62) s W r v o is our score function basedon the embeddings, (cid:96) is our loss function, and D is the datadistribution constructed based on the training set K .To impose analogical structures among the representations,we in addition require the linear maps associated with rela-tions to form a commuting family of normal matrices. Thisgives us the objective function for ANALOGY: min v,W E s,r,o,y ∼D (cid:96) ( φ v,W ( s, r, o ) , y ) (7)s.t. W r W (cid:62) r = W (cid:62) r W r ∀ r ∈ R (8) W r W r (cid:48) = W r (cid:48) W r ∀ r, r (cid:48) ∈ R (9)where constraints (8) and (9) are corresponding to the nor-mality and commutativity requirements, respectively. Sucha constrained optimization may appear to be computation-ally expensive at the first glance. In §
4, however, we willrecast it as a simple lightweight problem for which eachSGD update can be carried out efficiently in O ( m ) time.
4. Efficient Inference Algorithm
The constrained optimization (7) is computationally chal-lenging due to the large number of model parameters in ten-sor W , the matrix normality constraints, and the quadraticnumber of pairwise commutative constraints in (9).Interestingly, by exploiting the special properties of com-muting normal matrices, we will show in Corollary 4.2.1that ANALOGY can be alternatively solved via an anotherformulation of substantially lower complexity. Our find-ings are based on the following lemma and theorem: Lemma 4.1. (Wilkinson & Wilkinson, 1965) For any realnormal matrix A , there exists a real orthogonal matrix Q and a block-diagonal matrix B such that A = QBQ (cid:62) ,where each diagonal block of B is either (1) A real scalar,or (2) A 2-dimensional real matrix in the form of (cid:20) x − yy x (cid:21) ,where both x , y are real scalars. The lemma suggests any real normal matrix can be block-diagonalized into an almost-diagonal canonical form.
Theorem 4.2 (Proof given in the supplementary material) . If a set of real normal matrices A , A , ... form a commut-ing family, namely A i A j = A j A i ∀ i, j , then they can beblock-diagonalized by the same real orthogonal basis Q . nalogical Inference for Multi-relational Embeddings The theorem above implies that the set of dense relationalmatrices { W r } r ∈R , if mutually commutative, can alwaysbe simultaneously block-diagonalized into another set ofsparse almost-diagonal matrices { B r } r ∈R . Corollary 4.2.1 (Alternative formulation for ANALOGY) . For any given solution ( v ∗ , W ∗ ) of optimization (7) , therealways exists an alternative set of embeddings ( u ∗ , B ∗ ) such that φ v ∗ ,W ∗ ( s, r, o ) ≡ φ u ∗ ,B ∗ ( s, r, o ) , ∀ ( s, r, o ) , and ( u ∗ , B ∗ ) is given by the solution of: min u,B E s,r,o,y ∼D (cid:96) ( φ u,B ( s, r, o ) , y ) (10) B r ∈ B nm ∀ r ∈ R (11) where B nm denotes all m × m almost-diagonal matrices inLemma 4.1 with n < m real scalars on the diagonal.proof sketch. With the commutative constraints, there mustexist some orthogonal matrix Q , such that W r = QB r Q (cid:62) , B r ∈ B nm , ∀ r ∈ R . We can plug-in these expressions intooptimization (7) and let u = vQ , obtaining φ v,W ( s, r, o ) = v (cid:62) s W r v o = v (cid:62) s QB r Q (cid:62) v o (12) = u (cid:62) s B r u o = φ u,B ( s, r, o ) (13)In addition, it is not hard to verify that constraints (8) and(9) are automatically satisfied by exploiting the facts that Q is orthogonal and B nm is a commutative normal family.Constraints (11) in the alternative optimization problemcan be handled by simply binding together the coefficientswithin each of those × blocks in B r . Note that each B r consists of only m free parameters, allowing the gradientw.r.t. any given triple to be efficiently evaluated in O ( m ) .
5. Unified View of Representative Methods
In the following we provide a unified view of several em-bedding models (Yang et al., 2014; Trouillon et al., 2016;Nickel et al., 2016), by showing that they are restricted ver-sions under our framework, hence are implicitly imposinganalogical properties. This explains their strong empiricalperformance as compared to other baselines ( § DistMult (Yang et al., 2014) embeds both entities and rela-tions as vectors, and defines the score function as φ ( s, r, o ) = (cid:104) v s , v r , v o (cid:105) (14)where v s , v r , v o ∈ R m , ∀ s, r, o (15)where (cid:104)· , · , ·(cid:105) denotes the generalized inner product. Proposition 5.1.
DistMult embeddings can be fully recov-ered by ANALOGY embeddings when n = m . Proof. This is trivial to verify as the score function (15) canbe rewritten as φ ( s, r, o ) = v (cid:62) s B r v o where B r is a diagonalmatrix given by B r = diag( v r ) .Entity analogies are encouraged in DistMult as the diagonalmatrices diag( v r ) ’s are both normal and mutually commu-tative. However, DistMult is restricted to model symmetricrelations only, since φ ( s, r, o ) ≡ φ ( o, r, s ) . ComplEx (Trouillon et al., 2016) extends the embeddingsto the complex domain C , which defines φ ( s, r, o ) = (cid:60) ( (cid:104) v s , v r , v o (cid:105) ) (16)where v s , v r , v o ∈ C m , ∀ s, r, o (17)where x denotes the complex conjugate of x . Proposition 5.2.
ComplEx embeddings of embedding size m can be fully recovered by ANALOGY embeddings of em-bedding size m when n = 0 .Proof. Let (cid:60) ( x ) and (cid:61) ( x ) be the real and imaginary partsof any complex vector x . We recast φ in (16) as φ ( r, s, o ) = + (cid:10) (cid:60) ( v r ) , (cid:60) ( v s ) , (cid:60) ( v o ) (cid:11) (18) + (cid:10) (cid:60) ( v r ) , (cid:61) ( v s ) , (cid:61) ( v o ) (cid:11) (19) + (cid:10) (cid:61) ( v r ) , (cid:60) ( v s ) , (cid:61) ( v o ) (cid:11) (20) − (cid:10) (cid:61) ( v r ) , (cid:61) ( v s ) , (cid:60) ( v o ) (cid:11) = v (cid:48) s (cid:62) B r v (cid:48) o (21)The last equality is obtained via a change of variables: Forany complex entity embedding v ∈ C m , we define a newreal embedding v (cid:48) ∈ R m such that (cid:40) ( v (cid:48) ) k = (cid:60) ( v ) k ( v (cid:48) ) k − = (cid:61) ( v ) k ∀ k = 1 , , . . . m (22)The corresponding B r is a block-diagonal matrix in B m with its k -th block given by (cid:20) (cid:60) ( v r ) k −(cid:61) ( v r ) k (cid:61) ( v r ) k (cid:60) ( v r ) k (cid:21) . HolE (Nickel et al., 2016) defines the score function as φ ( s, r, o ) = (cid:104) v r , v s ∗ v o (cid:105) (23)where v s , v r , v o ∈ R m , ∀ s, r, o (24)where the association of s and o is implemented via circularcorrelation denoted by ∗ . This formulation is motivated bythe holographic reduced representation (Plate, 2003).To relate HolE with ANALOGY, we rewrite (24) in a bilin-ear form with a circulant matrix C ( v r ) in the middle φ ( r, s, o ) = v (cid:62) s C ( v r ) v o (25) nalogical Inference for Multi-relational Embeddings where entries of a circulant matrix are defined as C ( x ) = x x m · · · x x x x x m x ... x x . . . ... x m − . . . . . . x m x m x m − · · · x x (26)It is not hard to verify that circulant matrices are normaland commute (Gray et al., 2006), hence entity analogiesare encouraged in HolE, for which optimization (7) reducesto an unconstrained problem as equalities (8) and (9) areautomatically satisfied when all W r ’s are circulant.The next proposition further reveals that HolE is equivalentto ComplEx with minor relaxation. Proposition 5.3.
HolE embeddings can be equivalently ob-tained using the following score function φ ( s, r, o ) = (cid:60) ( (cid:104) v s , v r , v o (cid:105) ) (27) where v s , v r , v o ∈ F ( R m ) , ∀ s, r, o (28) where F ( R m ) denotes the image of R m in C m through theDiscrete Fourier Transform (DFT). In particular, the abovereduces to ComplEx by relaxing F ( R m ) to C m .Proof. Let F be the DFT operator defined by F ( x ) = F x where F ∈ C m × m is called the Fourier basis of DFT. Awell-known property for circulant matrices is that any C ( x ) can always be diagonalized by F , and its eigenvalues aregiven by F x (Gray et al., 2006).Hence the score function can be further recast as φ ( r, s, o ) = v (cid:62) s F − diag( F v r ) F v o (29) = 1 m ( F v s ) (cid:62) diag( F v r )( F v o ) (30) = 1 m (cid:104) F ( v s ) , F ( v r ) , F ( v o ) (cid:105) (31) = (cid:60) (cid:20) m (cid:104) F ( v s ) , F ( v r ) , F ( v o ) (cid:105) (cid:21) (32)Let v (cid:48) s = F ( v s ) , v (cid:48) o = F ( v o ) and v (cid:48) r = m F ( v r ) , we obtainexactly the same score function as used in ComplEx φ ( s, r, o ) = (cid:60) (cid:0) (cid:104) v (cid:48) s , v (cid:48) r , v (cid:48) o (cid:105) (cid:1) (33)(33) is equivalent to (16) apart from an additional constraintthat v (cid:48) s , v (cid:48) r , v (cid:48) o are the image of R in the Fourier domain.
6. Experiments
We evaluate ANALOGY and the baselines over two bench-mark datasets for multi-relational embedding released by previous work (Bordes et al., 2013), namely a subset ofFreebase (FB15K) for generic facts and WordNet (WN18)for lexical relationships between words.The dataset statistics are summarized in Table 1.Dataset |E| |R|
Table 1.
Dataset statistics for FB15K and WN18.
We compare the performance of ANALOGY against a vari-ety types of multi-relational embedding models developedin recent years. Those models can be categorized as: • Translation-based models where relations are mod-eled as translation operators in the embedding space,including TransE (Bordes et al., 2013) and its vari-ants TransH (Wang et al., 2014), TransR (Lin et al.,2015b), TransD (Ji et al., 2015), STransE (Nguyenet al., 2016) and RTransE (Garcia-Duran et al., 2015). • Multi-relational latent factor models including LFM(Jenatton et al., 2012) and RESCAL (Nickel et al.,2011) based collective matrix factorization. • Models involving neural network components suchas neural tensor networks (Socher et al., 2013) andPTransE-RNN (Lin et al., 2015b), where RNN standsfor recurrent neural networks. • Pathwise models including three different variants ofPTransE (Lin et al., 2015a) which extend TransE byexplicitly taking into account indirect connections (re-lational paths) between entities. • Models subsumed under our proposed framework( § • Models enhanced by external side information. Weuse Node+LinkFeat (NLF) (Toutanova & Chen, 2015)as a representative example, which leverages textualmentions derived from the ClueWeb corpus.
Following the literature of multi-relational embedding, weuse the conventional metrics of Hits@k and Mean Recip-rocal Rank (MRR) which evaluate each system-produced nalogical Inference for Multi-relational Embeddings ranked list for each test instance and average the scores overall ranked lists for the entire test set of instances.The two metrics would be flawed for the negative instances created in the test phase as a ranked list may contain somepositive instances in the training and validation sets (Bor-des et al., 2013). A recommended remedy, which we fol-lowed, is to remove all training- and validation-set triplesfrom all ranked lists during testing. We use “filt.” and “raw”to indicate the evaluation metrics with or without filtering,respectively.In the first set of our experiments, we used on Hits@k withk=10, which has been reported for most methods in theliterature. We also provide additional results of ANAL-OGY and a subset of representative baseline methods usingMRR, Hits@1 and Hits@3, to enable the comparison withthe methods whose published results are in those metrics.
OSS F UNCTION
We use the logistic loss for ANALOGY throughout all ex-periments, namely (cid:96) ( φ ( s, r, o ) , y ) = − log σ ( yφ ( s, r, o )) ,where σ is the sigmoid activation function. We empiricallyfound this simple loss function to perform reasonably wellas compared to more sophisticated ranking loss functions.6.4.2. A SYNCHRONOUS A DA G RAD
Our C++ implementation runs over a CPU, as ANAL-OGY only requires lightweight linear algebra routines. Weuse asynchronous stochastic gradient descent (SGD) foroptimization, where the gradients with respect to differ-ent mini-batches are simultaneously evaluated in multiplethreads, and the gradient updates for the shared model pa-rameters are carried out without synchronization. Asyn-chronous SGD is highly efficient, and causes little per-formance drop when parameters associated with differentmini-batches are mutually disjoint with a high probability(Recht et al., 2011). We adapt the learning rate based onhistorical gradients using AdaGrad (Duchi et al., 2011).6.4.3. C REATION OF N EGATIVE S AMPLES
Since only valid triples (positive instances) are explicitlygiven in the training set, invalid triples (negative instances)need to be artificially created. Specifically, for every posi-tive example ( s, r, o ) , we generate three negative instances ( s (cid:48) , r, o ) , ( s, r (cid:48) , o ) , ( s, r, o (cid:48) ) by corrupting s , r , o with ran-dom entities/relations s (cid:48) ∈ E , r (cid:48) ∈ R , o (cid:48) ∈ E . The unionof all positive and negative instances defines our data dis-tribution D for SGD updates. Code available at https://github.com/quark0/ANALOGY.
Table 2.
Hits@10 (filt.) of all models on WN18 and FB15K cate-gories into three groups: (i) 19 baselines without modeling analo-gies; (ii) 3 baselines and our proposed ANALOGY which implic-itly or explicitly enforce analogical properties over the inducedembeddings (see § Models WN18 FB15KUnstructured (Bordes et al., 2013) 38.2 6.3RESCAL (Nickel et al., 2011) 52.8 44.1NTN (Socher et al., 2013) 66.1 41.4SME (Bordes et al., 2012) 74.1 41.3SE (Bordes et al., 2011) 80.5 39.8LFM (Jenatton et al., 2012) 81.6 33.1TransH (Wang et al., 2014) 86.7 64.4TransE (Bordes et al., 2013) 89.2 47.1TransR (Lin et al., 2015b) 92.0 68.7TKRL (Xie et al., 2016) – 73.4RTransE (Garcia-Duran et al., 2015) – 76.2TransD (Ji et al., 2015) 92.2 77.3CTransR (Lin et al., 2015b) 92.3 70.2KG2E (He et al., 2015) 93.2 74.0STransE (Nguyen et al., 2016) 93.4 79.7DistMult (Yang et al., 2014) 93.6 82.4TransSparse (Ji et al., 2016) 93.9 78.3PTransE-MUL (Lin et al., 2015a) – 77.7PTransE-RNN (Lin et al., 2015a) – 82.2PTransE-ADD (Lin et al., 2015a) – 84.6NLF (with external corpus)(Toutanova & Chen, 2015)
ComplEx (Trouillon et al., 2016)
ODEL S ELECTION
We conducted a grid search to find the hyperparametersof ANALOGY which maximize the filtered MRR on thevalidation set, by enumerating all combinations of the em-bedding size m ∈ { , , } , (cid:96) weight decay factor λ ∈ { − , − , − } of model coefficients v and W ,and the ratio of negative over positive samples α ∈ { , } .The resulting hyperparameters for the WN18 dataset are m = 200 , λ = 10 − , α = 3 , and those for the FB15Kdataset are m = 200 , λ = 10 − , α = 6 . The number ofscalars on the diagonal of each B r is always set to be m .We set the initial learning rate to be . for both datasetsand adjust it using AdaGrad during optimization. All mod-els are trained for 500 epochs. Table 2 compares the Hits@10 score of ANALOGY withthat of 23 competing methods using the published scores nalogical Inference for Multi-relational Embeddings
Table 3.
MRR and Hits@ { } of a subset of representative models on WN18 and FB15K. The performance scores of TransE andREACAL are cf. the results published in (Trouillon et al., 2016) and (Nickel et al., 2016), respectively. WN18 FB15Models MRR(filt.) MRR(raw) Hits@1(filt.) Hits@3(filt.) MRR(filt.) MRR(raw) Hits@1(filt.) Hits@3(filt.)RESCAL (Nickel et al., 2011) 89.0 60.3 84.2 90.4 35.4 18.9 23.5 40.9TransE (Bordes et al., 2013) 45.4 33.5 8.9 82.3 38.0 22.1 23.1 47.2DistMult (Yang et al., 2014) 82.2 53.2 72.8 91.4 65.4 24.2 54.6 73.3HolE (Nickel et al., 2016) 93.8 61.6 for these methods in the literature on the WN18 and FB15Kdatasets. For the methods not having both scores, the miss-ing slots are indicated by “–”. The best score on eachdataset is marked in the bold face; if the differences amongthe top second or third scores are not statistically significantfrom the top one, then these scores are also bold faced. Weused one-sample proportion test (Yang & Liu, 1999) at the5% p-value level for testing the statistical significances .Table 3 compares the methods (including ours) whose re-sults in additional metrics are available. The usage of thebold faces is the same as those in Table 2.In both tables, ANALOGY performs either the best or the2nd best which is in the equivalent class with the best scorein each case according statistical significance test. Specifi-cally, on the harder FB15K dataset in Table 2, which has avery large number of relations, our model outperforms allbaseline methods. These results provide a good evidencefor the effective modeling of analogical structures in ourapproach. We are pleased to see in Table 3 that ANALOGYoutperforms DistMult, ComplEx and HolE in all the met-rics, as the latter three can be viewed as more constrainedversions of our method (as discussed in ( § §
5) is justified in the same table by the fact that the perfor-mance of HolE is dominated by ComplEx.In Figure 3 we show the empirical scalability of ANAL-OGY, which not only completes one epoch in a few sec-onds on both datasets, but also scales linearly in the size ofthe embedding problem. As compared to single-threadedAdaGrad, our asynchronous AdaGrad over 16 CPU threadsoffers 11.4x and 8.3x speedup on FB15K and WN18, re-spectively, on a single commercial desktop. Notice that proportion tests only apply to performance scoresas proportions, including Hits@k, but are not applicable to non-proportional scores such as MRR. Hence we only conducted theproportion tests on the Hits@k scores.
Embedding sizeFB15KWN18 Number of threadsFB15KWN18
Figure 3.
CPU run time per epoch (secs) of ANALOGY. The fig-ure on the left shows the run time over increasing embedding sizeswith 16 CPU threads; Figure on the right shows the run time overincreasing number of CPU threads with embedding size 200.
7. Conclusion
We presented a novel framework for explicitly modelinganalogical structures in multi-relational embedding, alongwith a differentiable objective function and a linear-time in-ference algorithm for large-scale embedding of knowledgegraphs. The proposed approach obtains the state-of-the-artresults on two popular benchmark datasets, outperforminga large number of strong baselines in most cases.Although we only focused on the multi-relational inferencefor knowledge-base embedding, we believe that analogi-cal structures exist in many other machine learning prob-lems beyond the scope of this paper. We hope this workshed light on a broad range of important problems wherescalable inference for analogical analysis would make animpact, such as machine translation and image captioning(both problems require modeling cross-domain analogies).We leave these interesting topics as our future work.
Acknowledgments
We thank the reviewers for their helpful comments. Thiswork is supported in part by the National Science Founda- nalogical Inference for Multi-relational Embeddings tion (NSF) under grant IIS-1546329.
References
Ad´amek, Jiˇr´ı, Herrlich, Horst, and Strecker, George E. Ab-stract and concrete categories. the joy of cats. 2004.Auer, S¨oren, Bizer, Christian, Kobilarov, Georgi,Lehmann, Jens, Cyganiak, Richard, and Ives, Zachary.Dbpedia: A nucleus for a web of open data. In
The se-mantic web , pp. 722–735. Springer, 2007.Bollacker, Kurt, Evans, Colin, Paritosh, Praveen, Sturge,Tim, and Taylor, Jamie. Freebase: a collaborativelycreated graph database for structuring human knowl-edge. In
Proceedings of the 2008 ACM SIGMOD inter-national conference on Management of data , pp. 1247–1250. AcM, 2008.Bordes, Antoine, Weston, Jason, Collobert, Ronan, andBengio, Yoshua. Learning structured embeddings ofknowledge bases. In
Conference on artificial intelli-gence , number EPFL-CONF-192344, 2011.Bordes, Antoine, Glorot, Xavier, Weston, Jason, and Ben-gio, Yoshua. Joint learning of words and meaning repre-sentations for open-text semantic parsing. In
AISTATS ,volume 22, pp. 127–135, 2012.Bordes, Antoine, Usunier, Nicolas, Garcia-Duran, Alberto,Weston, Jason, and Yakhnenko, Oksana. Translatingembeddings for modeling multi-relational data. In
Ad-vances in neural information processing systems , pp.2787–2795, 2013.Brown, Ronald and Porter, Tim. Category theory: an ab-stract setting for analogy and comparison. In
What iscategory theory , volume 3, pp. 257–274, 2006.Chen, Danqi, Socher, Richard, Manning, Christopher D,and Ng, Andrew Y. Learning new facts from knowledgebases with neural tensor networks and semantic wordvectors. arXiv preprint arXiv:1301.3618 , 2013.Dalton, Jeffrey, Dietz, Laura, and Allan, James. Entityquery feature expansion using knowledge base links. In
Proceedings of the 37th international ACM SIGIR con-ference on Research & development in information re-trieval , pp. 365–374. ACM, 2014.Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptivesubgradient methods for online learning and stochasticoptimization.
Journal of Machine Learning Research ,12(Jul):2121–2159, 2011.Dunford, Nelson, Schwartz, Jacob T, Bade, William G, andBartle, Robert G.
Linear operators . Wiley-interscienceNew York, 1971. Falkenhainer, Brian, Forbus, Kenneth D, and Gentner, De-dre. The structure-mapping engine: Algorithm and ex-amples.
Artificial intelligence , 41(1):1–63, 1989.Ferrucci, David, Brown, Eric, Chu-Carroll, Jennifer, Fan,James, Gondek, David, Kalyanpur, Aditya A, Lally,Adam, Murdock, J William, Nyberg, Eric, Prager, John,et al. Building watson: An overview of the deepqaproject.
AI magazine , 31(3):59–79, 2010.Gabrilovich, Evgeniy and Markovitch, Shaul. Wikipedia-based semantic interpretation for natural language pro-cessing.
Journal of Artificial Intelligence Research , 34:443–498, 2009.Garcia-Duran, Alberto, Bordes, Antoine, and Usunier,Nicolas.
Composing relationships with translations .PhD thesis, CNRS, Heudiasyc, 2015.Gentner, Dedre. Structure-mapping: A theoretical frame-work for analogy.
Cognitive science , 7(2):155–170,1983.Getoor, Lise.
Introduction to statistical relational learning .MIT press, 2007.Gray, Robert M et al. Toeplitz and circulant matrices: Areview.
Foundations and Trends R (cid:13) in Communicationsand Information Theory , 2(3):155–239, 2006.Guu, Kelvin, Miller, John, and Liang, Percy. Travers-ing knowledge graphs in vector space. arXiv preprintarXiv:1506.01094 , 2015.He, Shizhu, Liu, Kang, Ji, Guoliang, and Zhao, Jun. Learn-ing to represent knowledge graphs with gaussian em-bedding. In Proceedings of the 24th ACM Internationalon Conference on Information and Knowledge Manage-ment , pp. 623–632. ACM, 2015.Hofstadter, Douglas R. Analogy as the core of cognition.
The analogical mind: Perspectives from cognitive sci-ence , pp. 499–538, 2001.Holyoak, Keith J, Holyoak, Keith James, and Thagard,Paul.
Mental leaps: Analogy in creative thought . MITpress, 1996.Jenatton, Rodolphe, Roux, Nicolas L, Bordes, Antoine,and Obozinski, Guillaume R. A latent factor model forhighly multi-relational data. In
Advances in Neural In-formation Processing Systems , pp. 3167–3175, 2012.Ji, Guoliang, He, Shizhu, Xu, Liheng, Liu, Kang, andZhao, Jun. Knowledge graph embedding via dynamicmapping matrix. In
ACL (1) , pp. 687–696, 2015. nalogical Inference for Multi-relational Embeddings
Ji, Guoliang, Liu, Kang, He, Shizhu, and Zhao, Jun.Knowledge graph completion with adaptive sparsetransfer matrix. In
Proceedings of the Thirtieth AAAIConference on Artificial Intelligence, February 12-17,2016, Phoenix, Arizona, USA. , pp. 985–991, 2016. URL .Lin, Yankai, Liu, Zhiyuan, Luan, Huanbo, Sun, Maosong,Rao, Siwei, and Liu, Song. Modeling relation pathsfor representation learning of knowledge bases. arXivpreprint arXiv:1506.00379 , 2015a.Lin, Yankai, Liu, Zhiyuan, Sun, Maosong, Liu, Yang, andZhu, Xuan. Learning entity and relation embeddings forknowledge graph completion. In
AAAI , pp. 2181–2187,2015b.Liu, Hanxiao and Yang, Yiming. Bipartite edge predictionvia transductive learning over product graphs. In
ICML ,pp. 1880–1888, 2015.Liu, Hanxiao and Yang, Yiming. Cross-graph learningof multi-relational associations. In
Proceedings of The33rd International Conference on Machine Learning ,pp. 2235–2243, 2016.Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado,Greg S, and Dean, Jeff. Distributed representations ofwords and phrases and their compositionality. In
Ad-vances in neural information processing systems , pp.3111–3119, 2013.Min, Bonan, Grishman, Ralph, Wan, Li, Wang, Chang,and Gondek, David. Distant supervision for relation ex-traction with an incomplete knowledge base. In
HLT-NAACL , pp. 777–782, 2013.Minsky, Marvin.
Society of mind . Simon and Schuster,1988.Nguyen, Dat Quoc, Sirts, Kairit, Qu, Lizhen, and John-son, Mark. Stranse: a novel embedding model of enti-ties and relationships in knowledge bases. arXiv preprintarXiv:1606.08140 , 2016.Nickel, Maximilian, Tresp, Volker, and Kriegel, Hans-Peter. A three-way model for collective learning onmulti-relational data. In
Proceedings of the 28th inter-national conference on machine learning (ICML-11) , pp.809–816, 2011.Nickel, Maximilian, Murphy, Kevin, Tresp, Volker, andGabrilovich, Evgeniy. A review of relational ma-chine learning for knowledge graphs. arXiv preprintarXiv:1503.00759 , 2015. Nickel, Maximilian, Rosasco, Lorenzo, and Poggio,Tomaso A. Holographic embeddings of knowledgegraphs. In
Proceedings of the Thirtieth AAAI Confer-ence on Artificial Intelligence, February 12-17, 2016,Phoenix, Arizona, USA. , pp. 1955–1961, 2016. URL .Pennington, Jeffrey, Socher, Richard, and Manning,Christopher D. Glove: Global vectors for word repre-sentation. In
EMNLP , volume 14, pp. 1532–1543, 2014.Plate, Tony A. Holographic reduced representation: Dis-tributed representation for cognitive structures. 2003.Recht, Benjamin, Re, Christopher, Wright, Stephen, andNiu, Feng. Hogwild: A lock-free approach to paralleliz-ing stochastic gradient descent. In
Advances in NeuralInformation Processing Systems , pp. 693–701, 2011.Singhal, Amit. Introducing the knowledge graph: things,not strings.
Official google blog , 2012.Socher, Richard, Chen, Danqi, Manning, Christopher D,and Ng, Andrew. Reasoning with neural tensor networksfor knowledge base completion. In
Advances in neuralinformation processing systems , pp. 926–934, 2013.Toutanova, Kristina and Chen, Danqi. Observed versus la-tent features for knowledge base and text inference. In
Proceedings of the 3rd Workshop on Continuous Vec-tor Space Models and their Compositionality , pp. 57–66,2015.Trouillon, Th´eo, Welbl, Johannes, Riedel, Sebas-tian, Gaussier, ´Eric, and Bouchard, Guillaume.Complex embeddings for simple link prediction.In
Proceedings of the 33nd International Confer-ence on Machine Learning, ICML 2016, New YorkCity, NY, USA, June 19-24, 2016 , pp. 2071–2080,2016. URL http://jmlr.org/proceedings/papers/v48/trouillon16.html .Turney, Peter D. The latent relation mapping engine: Algo-rithm and experiments.
Journal of Artificial IntelligenceResearch , 33:615–655, 2008.Wang, Zhen, Zhang, Jianwen, Feng, Jianlin, and Chen,Zheng. Knowledge graph embedding by translating onhyperplanes. In
AAAI , pp. 1112–1119. Citeseer, 2014.Wilkinson, James Hardy and Wilkinson, James Hardy.
Thealgebraic eigenvalue problem , volume 87. ClarendonPress Oxford, 1965.Xie, Ruobing, Liu, Zhiyuan, and Sun, Maosong. Repre-sentation learning of knowledge graphs with hierarchicaltypes. In
Proceedings of the Twenty-Fifth International nalogical Inference for Multi-relational Embeddings
Joint Conference on Artificial Intelligence , pp. 2965–2971, 2016.Yang, Bishan, Yih, Wen-tau, He, Xiaodong, Gao, Jian-feng, and Deng, Li. Embedding entities and relationsfor learning and inference in knowledge bases.
CoRR ,abs/1412.6575, 2014. URL http://arxiv.org/abs/1412.6575 .Yang, Yiming and Liu, Xin. A re-examination of text cat-egorization methods. In