A Physical Embedding Model for Knowledge Graphs
AA Physical Embedding Model for Knowledge Graphs (cid:63)
Caglar Demir and Axel-Cyrille Ngonga Ngomo
Paderborn UniversityDICE Research Group33098 Paderborn Germany [email protected]
Abstract.
Knowledge graph embedding methods learn continuous vector repre-sentations for entities in knowledge graphs and have been used successfully ina large number of applications. We present a novel and scalable paradigm forthe computation of knowledge graph embeddings, which we dub P
YKE . Our ap-proach combines a physical model based on Hooke’s law and its inverse withideas from simulated annealing to compute embeddings for knowledge graphsefficiently. We prove that P
YKE achieves a linear space complexity. While thetime complexity for the initialization of our approach is quadratic, the time com-plexity of each of its iterations is linear in the size of the input knowledge graph.Hence, P
YKE ’s overall runtime is close to linear. Consequently, our approacheasily scales up to knowledge graphs containing millions of triples. We evalu-ate our approach against six state-of-the-art embedding approaches on the Drug-Bank and DBpedia datasets in two series of experiments. The first series showsthat the cluster purity achieved by P
YKE is up to 26% (absolute) better thanthat of the state of art. In addition, P
YKE is more than 22 times faster thanexisting embedding solutions in the best case. The results of our second seriesof experiments show that P
YKE is up to 23% (absolute) better than the stateof art on the task of type prediction while maintaining its superior scalability.Our implementation and results are open-source and are available at http://github.com/dice-group/PYKE . Keywords:
Knowledge graph embedding, Hooke’s law, type prediction
The number and size of knowledge graphs (KGs) available on the Web and in com-panies grows steadily. For example, more than 150 billion facts describing more than3 billion things are available in the more than 10,000 knowledge graphs published onthe Web as Linked Data. Knowledge graph embedding (KGE) approaches aim to mapthe entities contained in knowledge graphs to n -dimensional vectors [19,13,22]. Ac-cordingly, they parallel word embeddings from the field of natural language processing (cid:63) This work was supported by the German Federal Ministry of Transport and Digital Infrastruc-ture project OPAL (GA: 19F2028A) as well as the H2020 Marie Skłodowska-Curie projectKnowGraphs (GA no. 860801). https://lod-cloud.net/ lodstats.aksw.org a r X i v : . [ c s . C L ] J a n Demir and Ngonga Ngomo [11,14] and the improvement they brought about in various tasks (e.g., word analogy,question answering, named entity recognition and relation extraction). Applications ofKGEs include collective machine learning, type prediction, link prediction, entity reso-lution, knowledge graph completion and question answering [13,2,12,19,22,15]. In thiswork, we focus on type prediction. We present a novel approach for KGE based ona physical model, which goes beyond the state of the art (see [19] for a survey) w.r.t.both efficiency and effectiveness. Our approach, dubbed P
YKE , combines a physicalmodel (based on Hooke’s law) with an optimization technique inspired by simulatedannealing . P
YKE scales to large KGs by achieving a linear space complexity while be-ing close to linear in its time complexity on large KGs. We compare the performanceof P
YKE with that of six state-of-the-art approaches—Word2Vec [11], ComplEx [18],RESCAL [13], TransE [2], DistMult [22] and Canonical Polyadic (CP) decomposition[6]— on two tasks, i.e., clustering and type prediction w.r.t. both runtime and predic-tion accuracy. Our results corroborate our formal analysis of P
YKE and suggest that ourapproach scales close to linearly with the size of the input graph w.r.t. its runtime. Inaddition to outperforming the state of the art w.r.t. runtime, P
YKE also achieves bettercluster purity and type prediction scores.The rest of this paper is structured as follows: after providing a brief overview ofrelated work in Section 2, we present the mathematical framework underlying P
YKE in Section 3. Thereafter, we present P
YKE in Section 4. Section 5 presents the spaceand time complexity of P
YKE . We report on the results of our experimental evaluationin Section 6. Finally, we conclude with a discussion and an outlook on future work inSection 7.
A large number of KGE approaches have been developed to address tasks such as linkprediction, graph completion and question answering [7,8,12,13,18] in the recent past.In the following, we give a brief overview of some of these approaches. More detailscan be found in the survey at [19]. RESCAL [13] is based on computing a three-wayfactorization of an adjacency tensor representing the input KG. The adjacency tensor isdecomposed into a product of a core tensor and embedding matrices.RESCAL capturesrich interactions in the input KG but is limited in its scalability. HolE [12] uses cir-cular correlation as its compositional operator. Holographic embeddings of knowledgegraphs yield state-of-the-art results on link prediction task while keeping the memorycomplexity lower than RESCAL and TransR [8]. ComplEx [18] is a KGE model basedon latent factorization, wherein complex valued embeddings are utilized to handle alarge variety of binary relations including symmetric and antisymmetric relations.Energy-based KGE models [1,2,3] yield competitive performances on link pre-diction, graph completion and entity resolution. SE [3] proposes to learn one low-dimensional vector ( R k ) for each entity and two matrices ( R ∈ R k × k , R ∈ R k × k )for each relation. Hence, for a given triple ( h, r, t ), SE aims to minimize the L dis-tance, i.e., f r ( h, t ) = || R h − R t || . The approach in [1] embeds entities and relationsinto the same embedding space and suggests to capture correlations between entitiesand relations by using multiple matrix products. TransE [2] is a scalable energy-based Physical Embedding Model for Knowledge Graphs 3
KGE model wherein a relation r between entities h and t corresponds to a translationof their embeddings, i.e., h + r ≈ t provided that ( h, r, t ) exists in the KG. TransE out-performs state-of-the-art models in the link prediction task on several benchmark KGdatasets while being able to deal with KGs containing up to 17 million facts. DistMult[22] proposes to generalize neural-embedding models under an unified learning frame-work, wherein relations are bi-linear or linear mapping function between embeddingsof entities.With P YKE , we propose a different take to generating embeddings by combining aphysical model with simulated annealing. Our evaluation suggests that this simulation-based approach to generating embeddings scales well (i.e., linearly in the size of theKG) while outperforming the state of the art in the type prediction and clustering qualitytasks [21,20].
In this section, we present the core notation and terminology used throughout this paper.The symbols we use and their meaning are summarized in Table 1.
In this work, we compute embeddings for RDF KGs. Let R be the set of all RDFresources, B be the set of all RDF blank nodes, P ⊆ R be the set of all properties and L denote the set of all RDF literals. An RDF KG G is a set of RDF triples ( s, p, o ) where s ∈ R ∪ B , p ∈ P and o ∈ R ∪ B ∪ L . We aim to compute embeddings for resourcesand blank nodes. Hence, we define the vocabulary of an RDF knowledge graph G as V = { x : x ∈ R ∪ P ∪ B ∧ ∃ ( s, p, o ) ∈ G : x ∈ { s, p, o }} . Essentially, V standsfor all the URIs and blank nodes found in G . Finally, we define the subjects with typeinformation of G as S = { x : x ∈ R \ P ∧ ( x, rdf:type , o ) ∈ G} , where rdf:type stands for the instantiation relation in RDF. Hooke’s law describes the relation between a deforming force on a spring and the mag-nitude of the deformation within the elastic regime of said spring. The increase of adeforming force on the spring is linearly related to the increase of the magnitude of thecorresponding deformation. In equation form, Hooke’s law can be expressed as follows: F = − k ∆ (1)where F is the deforming force, ∆ is the magnitude of deformation and k is the springconstant. Let us assume two points of unit mass located at x and y respectively. Weassume that the two points are connected by an ideal spring with a spring constant k ,an infinite elastic regime and an initial length of 0. Then, the force they are subjectedto has a magnitude of k || x − y || . Note that the magnitude of this force grows with thedistance between the two mass points. Demir and Ngonga Ngomo
Table 1:
Overview of our notation
Notation Description G An RDF knowledge graph R , P , B , L Set of all RDF resources, predicates, blank nodes and literals respec-tively S Set of all RDF subjects with type information V Vocabulary of G σ Similarity function on
V−→ x t Embedding of x at time tF a , F r Attractive and repulsive forces, respectively K Threshold for positive and negative examples P Function mapping each x ∈ V to a set of attracting elements of V N Function mapping each x ∈ V to a set of repulsive elements of V P Probability ω Repulsive constant E System energy (cid:15)
Upper bound on alteration of locations of x ∈ V across two iterations ∆e Energy releaseThe inverse of Hooke’s law, where F inverse = − k∆ (2)has the opposite behavior. It becomes weaker with the distance between the two masspoints it connects. The Positive Pointwise Mutual Information (PPMI) is a means to capture the strength ofthe association between two events (e.g., appearing in a triple of a KG). Let a and b betwo events. Let P ( a, b ) stand for the joint probability of a and b , P ( a ) for the probabilityof a and P ( b ) for the probability of b . Then, P P M I ( a, b ) is defined as P P M I ( a, b ) = max (cid:18) , log P ( a, b ) P ( a ) P ( b ) (cid:19) , (3)The equation truncates all negative values to 0 as measuring the strength of dissocia-tion between events accurately demands very large sample sizes, which are empiricallyseldom available. YKE
In this section, we introduce our novel KGE approach dubbed P
YKE (a physical modelfor knowledge graph embeddings). Section 4.1 presents the intuition behind our model.
Physical Embedding Model for Knowledge Graphs 5
In Section 4.2, we give an overview of the P
YKE framework, starting from processingthe input KG to learning embeddings for the input in a vector space with a predefinednumber of dimensions. The workflow of our model is further elucidated using the run-ning example shown in Figure 1. P YKE is an iterative approach that aims to represent each element x of the vocabulary V of an input KG G as an embedding (i.e., a vector) in the n -dimensional space R n .Our approach begins by assuming that each element of V is mapped to a single point(i.e., its embedding ) of unit mass whose location can be expressed via an n -dimensionalvector in R n according to an initial (e.g., random) distribution at iteration t = 0 . In thefollowing, we will use −→ x t to denote the embedding of x ∈ V at iteration t . We alsoassume a similarity function σ : V × V → [0 , ∞ ) (e.g., a PPMI-based similarity) over V to be given. Simply put, our goal is to improve this initial distribution iteratively overa predefined maximal number of iterations (denoted T ) by ensuring that1. the embeddings of similar elements of V are close to each other while2. the embeddings of dissimilar elements of V are distant from each other.Let d : R n × R n → R + be the distance (e.g., the Euclidean distance) betweentwo embeddings in R n . According to our goal definition, a good iterative embeddingapproach should have the following characteristics: C : If σ ( x, y ) > , then d ( −→ x t , −→ y t ) ≤ d ( −→ x t − , −→ y t − ) . This means that the embed-dings of similar terms should become more similar with the number of iterations.The same holds the other way around: C : If σ ( x, y ) = 0 , then d ( −→ x t , −→ y t ) ≥ d ( −→ x t − , −→ y t − ) .We translate C into our model as follows: If x and y are similar (i.e., if σ ( x, y ) > ), then a force F a ( −→ x t , −→ y t ) of attraction must exist between the masses which standfor x and y at any time t . F a ( −→ x t , −→ y t ) must be proportional to d ( −→ x t , −→ y t ) , i.e., theattraction between must grow with the distance between ( −→ x t and −→ y t ) . These conditionsare fulfilled by setting the following force of attraction between the two masses: || F a ( −→ x t , −→ y t ) || = σ ( x, y ) × d ( −→ x t , −→ y t ) . (4)From the perspective of a physical model, this is equivalent to placing a spring with aspring constant of σ ( x, y ) between the unit masses which stand for x and y . At time t ,these masses are hence accelerated towards each other with a total acceleration propor-tional to || F a ( −→ x t , −→ y t ) || .The translation of C into a physical model is as follows: If x and y are not similar(i.e., if σ ( x, y ) = 0 ), we assume that they are dissimilar. Correspondingly, their embed-dings should diverge with time. The magnitude of the repulsive force between the twomasses representing x and y should be strong if the masses are close to each other andshould diminish with the distance between the two masses. We can fulfill this conditionby setting the following repulsive force between the two masses: || F r ( −→ x t , −→ y t ) || = − ωd ( −→ x t , −→ y t ) , (5) Demir and Ngonga Ngomo where ω > denotes a constant, which we dub the repulsive constant. At iteration t ,the embeddings of dissimilar terms are hence accelerated away from each other witha total acceleration proportional to || F r ( −→ x t , −→ y t ) || . This is the inverse of Hooke’s law,where the magnitude of the repulsive force between the mass points which stand fortwo dissimilar terms decreases with the distance between the two mass points.Based on these intuitions, we can now formulate the goal of P YKE formally: Weaim to find embeddings for all elements of V which minimize the total distance betweensimilar elements and maximize the total distance between dissimilar elements. Let P : V → V be a function which maps each element of V to the subset of V it is similar to.Analogously, let N : V → V map each element of V to the subset of V it is dissimilarto. P YKE aims to optimize the following objective function: J ( V ) = (cid:88) x ∈V (cid:88) y ∈ P ( x ) d ( −→ x , −→ y ) − (cid:88) x ∈V (cid:88) y ∈ N ( x ) d ( −→ x , −→ y ) . (6) P YKE implements the intuition described above as follows: Given an input KG G , P YKE first constructs a symmetric similarity matrix A of dimensions |V| × |V| . We will use a x,y to denotes the similarity coefficient between x ∈ V and y ∈ V stored in A . P YKE truncates this matrix to (1) reduce the effect of oversampling and (2) accelerate subse-quent computations. The initial embeddings of all x ∈ V in R n are then determined.Subsequently, P YKE uses the physical model described above to improve the embed-dings iteratively. The iteration is ran at most T times or until the objective function J ( V ) stops decreasing. In the following, we explain each of the steps of the approachin detail. We use the RDF graph shown in Figure 1 as a running example. Building the similarity matrix.
For any two elements x, y ∈ V , we set a x,y = σ ( x, y ) = P P M I ( x, y ) in our current implementation. We compute the probabilities P ( x ) , P ( y ) and P ( x, y ) as follows: P ( x ) = |{ ( s, p, o ) ∈ G : x ∈ { s, p, o }}||{ ( s, p, o ) ∈ G}| . (7)Similarly, P ( y ) = |{ ( s, p, o ) ∈ G : y ∈ { s, p, o }}||{ ( s, p, o ) ∈ G}| . (8)Finally, P ( x, y ) = |{ ( s, p, o ) ∈ G : { x, y } ⊆ { s, p, o }}||{ ( s, p, o ) ∈ G}| . (9)For our running example (see Figure 1), P YKE constructs the similarity matrixshown in Figure 2. Note that our framework can be combined with any similarity func-tion σ . Exploring other similarity function is out the scope of this paper but will be atthe center of future works. This example is provided as an example in the DL-Learner framework at http://dl-learner.org . Physical Embedding Model for Knowledge Graphs 7
Fig. 1:
Example RDF graph
Computing P and N . To avoid oversampling positive or negative examples, we onlyuse a portion of A for the subsequent optimization of our objective function. For each x ∈ V , we begin by computing P ( x ) by selecting K resources which are most similar to x . Note that if less than K resources have a non-zero similarity to x , then P ( x ) containsexactly the set of resources with a non-zero similarity to x . Thereafter, we sample K elements y of V with a x,y = 0 randomly. We call this set N ( x ) . For all y ∈ N ( x ) , weset a x,y to − ω , where ω is our repulsive constant. The values of a x,y for y ∈ P ( x ) arepreserved. All other values are set to 0. After carrying out this process for all x ∈ V ,each row of A now contains exactly K non-zero entries provided that each x ∈ V hasat least K resources with non-zero similarity. Given that K << |V| , A is now sparseand can be stored accordingly. The PPMI similarity matrix for our example graph isshown in Figure 2.
Initializing the embeddings.
Each x ∈ V is mapped to a single point −→ x t of unit massin R n at iteration t = 0 . As exploring sophisticated initialization techniques is out of thescope of this paper, the initial vector is set randomly. Figure 3 shows a 3D projectionof the initial embeddings for our running example (with n = 50 ). We use A for the sake of explanation. For practical applications, this step can be implementedusing priority queues, hence making quadratic space complexity for storing A unnecessary. Preliminary experiments suggest that applying a singular value decomposition on A and ini-tializing the embeddings with the latent representation of the elements of the vocabulary alongthe n most salient eigenvectors has the potential of accelerating the convergence of our ap-proach. Demir and Ngonga Ngomo Fig. 2:
PPMI similarity matrix of resources in the RDF graph shown in Figure 1
Iteration.
This is the crux of our approach. In each iteration t , our approach assumesthat the elements of P ( x ) attract x with a total force F a ( −→ x t ) = (cid:88) y ∈ P ( x ) σ ( x, y ) × ( −→ y t − −→ x t ) . (10)On the other hand, the elements of N ( x ) repulse x with a total force F r ( −→ x t ) = − (cid:88) y ∈ N ( x ) ω ( −→ y t − −→ x t ) . (11)We assume that exactly one unit of time elapses between two iterations. The em-bedding of x at iteration t + 1 can now be calculated by displacing −→ x t proportionally to ( F a ( −→ x t )+ F r ( −→ x t )) .However, implementing this model directly leads to a chaotic (i.e.,non-converging) behavior in most cases. We enforce the convergence using an approachborrowed from simulated annealing, i.e., we reduce the total energy of the system bya constant factor ∆e after each iteration. By these means, we can ensure that our ap-proach always terminates, i.e., we can iterate until J ( V ) does not decrease significantlyor until a maximal number of iterations T is reached. Implementation.
Algorithm 1 shows the pseudocode of our approach. P
YKE updatesthe embeddings of vocabulary terms iteratively until one of the following two stoppingcriteria is satisfied: Either the upper bound on the iterations T is met or a lower bound (cid:15) on the total change in the embeddings (i.e., (cid:80) x ∈V ||−→ x t − −→ x t − || ) is reached. A gradualreduction in the system energy E inherently guarantees the termination of the processof learning embeddings. A 3D projection of the resulting embedding for our runningexample is shown in Figure 3. Physical Embedding Model for Knowledge Graphs 9
Fig. 3:
PCA projection of 50-dimensional embeddings for our running example. Left are therandomly initialized embeddings. The figure on the right shows the 50-dimensional P
YKE em-bedding vectors for our running example after convergence. P
YKE was configured with K = 3 , ω = − . , ∆e = 0 . and (cid:15) = 10 − . Let m = |V| . We would need at most m ( m − entries to store A , as the matrix is sym-metric and we do not need to store its diagonal. However, there is actually no need tostore A . We can implement P ( x ) as a priority queue of size K in which the indexes of K elements of V most similar to x as well as their similarity to x are stored. N ( x ) canbe implemented as a buffer of size K which contains only indexes. Once N ( x ) reachesits maximal size K , then new entries (i.e., y with P P M I ( x, y ) ) are added randomly.Hence, we need O ( Kn ) space to store both P and N . Note that K << m . The embed-dings require exactly mn space as we store −→ x t and −→ x t − for each x ∈ V . The forcevectors F a and F r each require a space of n . Hence, the space complexity of P YKE liesclearly in O ( mn + Kn ) and is hence linear w.r.t. the size of the input knowledge graph G when the number n of dimensions of the embeddings and the number K of positiveand negative examples are fixed. Initializing the embeddings requires mn operations. The initialization of P and N canalso be carried out in linear time. Adding an element to P and N is carried out atmost m times. For each x , the addition of an element to P ( x ) has a runtime of at most K . Adding elements to N ( x ) is carried out in constant time, given that the addition israndom. Hence the computation of P ( x ) and N ( x ) can be carried out in linear timew.r.t. m . This computation is carried out m times, i.e., once for each x . Hence, theoverall runtime of the initialization for P YKE is on O ( m ) . Importantly, the update ofthe position of each x can be carried out in O ( K ) , leading to each iteration having atime complexity of O ( mK ) . The total runtime complexity for the iterations is hence O ( mKT ) , which is linear in m . This result is of central importance for our subsequentempirical results, as the iterations make up the bulk of P YKE ’s runtime. Hence, P
YKE ’sruntime should be close to linear in real settings.
Algorithm 1 P YKE
Require: T , V , K , (cid:15) , ∆e , ω , n //initialize embeddings for each x in V do −→ x = random vector in R n ; end for //initialize similarity matrix A = new Matrix[ |V| ][ |V| ]; for each x in V dofor each y in V do A xy = P P M I ( x, y ) ; end forend for // perform positive and negative sampling for each x in V do P ( x ) = getPositives ( A , x, K ) ; N ( x ) = getNegatives ( A , x, K ) ; end for // iteration t = 1 ; E = 1 ; while t < T dofor each x in V do F a = (cid:80) y ∈ P ( x ) σ ( x, y ) × ( −→ y t − − −→ x t − ) ; F r = − (cid:80) y ∈ N ( x ) ω −→ y t − −−→ x t − ; −→ x t = −→ x t − + E × ( F a + F r ) ; end for E = E − ∆e ; if (cid:80) x ∈V ||−→ x t − −→ x t − || < (cid:15) thenbreakend if t = t + 1 ; end whilereturn Embeddings −→ x t The goal of our evaluation was to compare the quality of the embeddings generated byP
YKE with the state of the art. Given that there is no intrinsic measure for the qual-ity of embeddings, we used two extrinsic evaluation scenarios. In the first scenario, wemeasured the type homogeneity of the embeddings generated by the KGE approaches
Physical Embedding Model for Knowledge Graphs 11 we considered. We achieved this goal by using a scalable approximation of DBSCANdubbed HDBSCAN [4]. In our second evaluation scenario, we compared the perfor-mance of P
YKE on the type prediction task against that of 6 state-of-the-art algorithms.In both scenarios, we only considered embeddings of the subset S of V as done in pre-vious works [10,17]. We set K = 45 , ∆e = 0 . and ω = 1 . throughout ourexperiments. The values were computed using a Sobol Sequence optimizer [16]. Allexperiments were carried out on a single core of a server running Ubuntu 18.04 with GB RAM with 16 Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz processors.We used six datasets (2 real, 4 synthetic) throughout our experiments. An overviewof the datasets used in our experiments is shown in Table 2. Drugbank is a small-scaleKG, whilst the DBpedia (version 2016-10) dataset is a large cross-domain dataset. The four synthetic datasets were generated using the LUBM generator [5] with 100, 200,500 and 1000 universities.
Table 2:
Overview of RDF datasets used in our experiments
Dataset |G| |V| |S| |C|
Drugbank 3,146,309 521,428 421,121 102DBpedia 27,744,412 7,631,777 6,401,519 423LUBM100 9,425,190 2,179,793 2,179,766 14LUBM200 18,770,356 4,341,336 4,341,309 14LUBM500 46,922,188 10,847,210 10,847,183 14LUBM1000 93,927,191 21,715,108 21,715,081 14We evaluated the homogeneity of embeddings by measuring the purity [9] of theclusters generated by HDBSCAN [4]. The original cluster purity equation assumes thateach element of a cluster is mapped to exactly one class [9]. Given that a single resourcecan have several types in a knowledge graph (e.g.,
BarackObama is a person, a politi-cian, an author and a president in DBpedia), we extended the cluster purity equationas follows: Let C = { c , c , . . . } be the set of all classes found in G . Each x ∈ S wasmapped to a binary type vector type ( x ) of length |C| . The ith entry of type ( x ) was 1iff x was of type c i . In all other cases, c i was set to 0. Based on these premises, wecomputed the purity of a clustering as follows:Purity = L (cid:88) l =1 | ζ l | (cid:88) x ∈ ζ l (cid:88) y ∈ ζ l cos (cid:16) type ( x ) , type ( y ) (cid:17) , (12) download.bio2rdf.org/ Note that we compile the DBpedia datasets by merging the dumps of mapping-basedobjects , skos categories and instance types provided in the DBpedia down-load folder for version 2016-10 at downloads.dbpedia.org/2016-10 .2 Demir and Ngonga Ngomo where ζ . . . ζ L are the clusters computed by HDBSCAN. A high purity means thatresources with similar type vectors (e.g., presidents who are also authors) are locatedclose to each other in the embedding space, which is a wanted characteristic of a KGE.In our second evaluation, we performed a type prediction experiment in a mannerakin to [10,17]. For each resource x ∈ S , we used the µ closest embeddings of x topredict x ’s type vector. We then compared the average of the types predicted with x ’sknown type vector using the cosine similarity:prediction score = 1 |S| (cid:88) x ∈S cos (cid:16) type ( x ) , (cid:88) y ∈ µnn ( x ) type ( y ) (cid:17) , (13)where µnn ( x ) stands for the µ neareast neighbors of x . We employed µ ∈ {
1, 3, 5, 10,15, 30, 50, 100 } in our experiments.Preliminary experiments showed that performing the cluster purity and type predic-tion evaluations on embeddings of large knowledge graphs is prohibited by the longruntimes of the clustering algorithm. For instance, HDBSCAN did not terminate in20 hours of computation when |S| > × . Consequently, we had to apply HDB-SCAN on embeddings on the subset of S on DBpedia which contained resources of type Person or Settlement . The resulting subset of S on DBpedia consists of , RDF resources. For the type prediction task, we sampled resources from S accord-ing to a random distribution and fixed them across the type prediction experiments forall KGE models. Table 3 displays the cluster purity results for all competing ap-proaches. P
YKE achieves a cluster purity of 0.75 on Drugbank and clearly outperformsall other approaches. DBpedia turned out to be a more difficult dataset. Still, P
YKE wasable to outperform all state-of-the-art approaches by between 11% and 26% (absolute)on Drugbank and between 9% and 23% (absolute) on DBpedia. Note that in 3 cases,the implementations available were unable to complete the computation of embeddingswithin 24 hours.
Table 3:
Cluster purity results. The best results are marked in bold. Experiments marked with *did not terminate after 24 hours of computation.
Approach Drugbank DBpedia P YKE
Word2Vec 0.43 0.37ComplEx 0.64 *RESCAL * *TransE 0.60 0.48CP 0.49 0.41DistMult 0.49 0.34
Physical Embedding Model for Knowledge Graphs 13
Type Prediction Results.
Figure 4 and Figure 5 show our type prediction results onthe Drugbank and DBpedia datasets. P
YKE outperforms all state-of-the-art approachesacross all experiments. In particular, it achieves a margin of up to 22% (absolute) onDrugbank and 23% (absolute) on DBpedia. Like in the previous experiment, all KGEapproaches perform worse on DBpedia, with prediction scores varying between < . and . . Fig. 4:
Mean results on type prediction scores on randomly sampled entities of DBpedia Fig. 5:
Mean of type prediction scores on all entities of Drugbank
Runtime Results.
Table 5 show runtime performances of all models on the two realbenchmark datasets, while Figure 6 display the runtime of P
YKE on the syntheticLUBM datasets. Our results support our original hypothesis. The low space and timecomplexities of P
YKE mean that it runs efficiently: Our approach achieves runtimes of only 25 minutes on Drugbank and 309 minutes on DBpedia, while outperforming allother approaches by up to 14 hours in runtime.In addition to evaluating the runtime of P
YKE on synthetic data, we were interestedin determining its behaviour on datasets of growing sizes. We used LUBM datasetsand computed a linear regression of the runtime using ordinary least squares (OLS).The runtime results for this experiment are shown in Figure 6. The linear fit shownin Table 4 achieves R values beyond 0.99, which points to a clear linear fit betweenP YKE ’s runtime and the size of the input dataset.
Fig. 6:
Runtime performances of P
YKE on synthetic KGs. Colored lines represent fitted linearregressions with fixed K values of P YKE . K Coefficient Intercept R Table 4:
Results of fitting OLS on runtimes.
We believe that the good performance of P
YKE stems from (1) its sampling pro-cedure and (2) its being akin to a physical simulation. Employing PPMI to quantifythe similarity between resources seems to yield better sampling results than generatingnegative examples using the local closed word assumption that underlies sampling pro-cedures of all of competing state-of-the-art KG models. More importantly, positive andnegative sampling occur in our approach per resource rather than per RDF triple. There-fore, P
YKE is able to leverage more from negative and positive sampling. By virtue ofbeing akin to a physical simulation, P
YKE is able to run efficiently even when each
Physical Embedding Model for Knowledge Graphs 15
Table 5:
Runtime performances (in minutes) of all competing approaches. All approaches wereexecuted three times on each dataset. The reported results are the mean and standard deviationof the last two runs. The best results are marked in bold. Experiments marked with * did notterminate after 24 hours of computation.
Approach Drugbank DBpedia P YKE ± ± ± ± ComplEx 705 ± ± ± ± ± ± ± resource x is mapped to 45 attractive and 45 repulsive resources (see Table 5) whilst allstate-of-the-art KGE required more computation time. We presented P
YKE , a novel approach for the computation of embeddings on knowl-edge graphs. By virtue of being akin to a physical simulation, P
YKE retains a linearspace complexity. This was proven through a complexity analysis of our approach.While the time complexity of the approach is quadratic due to the computation of P and N , all other steps are linear in their runtime complexity. Hence, we expected ourapproach to behave closes to linearly. Our evaluation on LUBM datasets suggests thatthis is indeed the case and the runtime of our approach grows close to linearly. Thisis an important result, as it means that our approach can be used on very large knowl-edge graphs and return results faster than popular algorithms such as Word2VEC andTransE. However, time efficiency is not all. Our results suggest that P YKE outperformsstate-of-the-art approaches in the two tasks of type prediction and clustering. Still, thereis clearly a lack of normalized evaluation scenarios for knowledge graph embedding ap-proaches. We shall hence develop such benchmarks in future works. Our results opena plethora of other research avenues. First, the current approach to compute similar-ity between entities/relations on KGs is based on the local similarity. Exploring othersimilarity means will be at the center of future works. In addition, using a better initial-ization for the embeddings should lead to faster convergence. Finally, one could use astochastic approach (in the same vein as stochastic gradient descent) to further improvethe runtime of P
YKE . References
1. Bordes, A., Glorot, X., Weston, J., Bengio, Y.: A semantic matching energy function forlearning with multi-relational data. Machine Learning (2014)2. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embed-dings for modeling multi-relational data. Curran Associates, Inc. (2013)6 Demir and Ngonga Ngomo3. Bordes, A., Weston, J., Collobert, R., Bengio, Y.: Learning structured embeddings of knowl-edge bases. In: Twenty-Fifth AAAI Conference on Artificial Intelligence (2011)4. Campello, R.J., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical den-sity estimates. In: Pacific-Asia conference on knowledge discovery and data mining. Springer(2013)5. Guo, Y., Pan, Z., Heflin, J.: Lubm: A benchmark for owl knowledge base systems. WebSemantics: Science, Services and Agents on the World Wide Web (2-3), 158–182 (2005)6. Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products. Journal ofMathematics and Physics (1-4), 164–189 (1927)7. Huang, X., Zhang, J., Li, D., Li, P.: Knowledge graph embedding based question answer-ing. In: Proceedings of the Twelfth ACM International Conference on Web Search and DataMining. ACM (2019)8. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings forknowledge graph completion. In: Twenty-ninth AAAI conference on artificial intelligence(2015)9. Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval. Natural Lan-guage Engineering (2010)10. Melo, A., Paulheim, H., Völker, J.: Type prediction in rdf knowledge bases using hierar-chical multilabel classification. In: Proceedings of the 6th International Conference on WebIntelligence, Mining and Semantics. p. 14. ACM (2016)11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations ofwords and phrases and their compositionality. In: Advances in neural information processingsystems (2013)12. Nickel, M., Rosasco, L., Poggio, T.: Holographic embeddings of knowledge graphs. In:Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. pp. 1955–1961.AAAI’1613. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on multi-relational data. In: ICML. vol. 11 (2011)14. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In:Proceedings of the 2014 conference on empirical methods in natural language processing(EMNLP) (2014)15. Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In: InternationalSemantic Web Conference (2016)16. Saltelli, A., Annoni, P., Azzini, I., Campolongo, F., Ratto, M., Tarantola, S.: Variance basedsensitivity analysis of model output. design and estimator for the total sensitivity index. Com-puter Physics Communications181