Symbolic Querying of Vector Spaces: Probabilistic Databases Meets Relational Embeddings
SSymbolic Querying of Vector Spaces:Probabilistic Databases Meets Relational Embeddings
Tal Friedman
Department of Computer ScienceUniversity of California, Los Angeles [email protected]
Guy Van den Broeck
Department of Computer ScienceUniversity of California, Los Angeles [email protected]
Abstract
We propose unifying techniques from prob-abilistic databases and relational embeddingmodels with the goal of performing complexqueries on incomplete and uncertain data. Weformalize a probabilistic database model withrespect to which all queries are done. This al-lows us to leverage the rich literature of theoryand algorithms from probabilistic databases forsolving problems. While this formalization canbe used with any relational embedding model,the lack of a well-defined joint probability distri-bution causes simple query problems to becomeprovably hard. With this in mind, we introduceT
RACT
OR, a relational embedding model de-signed to be a tractable probabilistic database,by exploiting typical embedding assumptionswithin the probabilistic framework. Using aprincipled, efficient inference algorithm thatcan be derived from its definition, we empiri-cally demonstrate that T
RACT
OR is an effec-tive and general model for these querying tasks.
Relational database systems are ubiquitous tools for datamanagement due to their ability to answer a wide varietyof queries. In particular, languages such as SQL allow oneto take advantage of the relational structure of the datato ask complicated question to learn, analyse, and drawconclusions from data. However, traditional database sys-tems are poorly equipped to deal with uncertainty andincompleteness in data. Meanwhile, techniques from themachine learning community can successfully make pre-dictions and infer new facts. In this work we marry ideasfrom both machine learning and databases to provide aframework for answering such queries while dealing withuncertain and incomplete data.
Proceedings of the 36 th Conference on Uncertainty in ArtificialIntelligence (UAI) , PMLR volume 124, 2020.
The first key question we need an answer for when dealingwith uncertain relational data is how to handle the fact thatour data is invariably incomplete. That is, there will al-ways be facts that we do not explicitly see, but would liketo be able to infer. In the machine learning community,this problem is known as link prediction , a task which hasgarnered a lot of attention in recent years [31, 30, 24, 37]using a variety of techniques [4, 15]. Recently, the mostcommon techniques for this problem are relational em-bedding models, which embed relations and entities asvectors and then use a scoring function to predict whetheror not facts are true. While these techniques are popularand have proven effective for link prediction, they lacka consistent underlying probabilistic semantics, whichmakes their beliefs about the world unclear. As a result,investigations into them have rarely gone beyond linkprediction [20, 26].On the other hand, the databases community has pro-duced a rich body of work for handling uncertainty viaprobabilistic databases (PDBs). In contrast to relationalembedding models which are fundamentally predictivemodels, PDBs [34, 39] are defined by a probabilistic se-mantics, with strong and clearly specified independenceassumptions. With these semantics, PDBs provide us witha wealth of theoretical and algorithmic research into com-plex queries, including tractability results [11, 12, 13, 16]and approximations [14, 18]. Recently there has evenbeen work in finding explanations for queries [9, 19], andquerying subject to constraints [6, 3, 17]. Where PDBsfall short is in two major areas. Firstly, populating PDBswith meaningful data in an efficient way remains a majorchallenge, due to their brittleness to incomplete data, anddue to their disconnect from the statistical models thatcan provide these databases with probability values. Sec-ondly, while querying is well understood, certain typesof desirable queries are provably hard under standardassumptions [13].In this work, our goal will be to unify the predictive capa-bility of relational embedding models with the sound un- a r X i v : . [ c s . A I] J un erlying probabilistic semantics of probabilistic databases.The central question then becomes how should we do thisunification such that we maintain as many of the bene-fits of each as possible, while finding ways to overcometheir limitations. As we will discover in Section 3, this isnot a question with an obvious answer. The straightfor-ward option is to simply convert the relational embeddingmodel’s prediction into probabilities, and then use theseto populate a probabilistic database. While this does giveus a meaningful way to populate a PDB, the resultingmodel is making some clearly problematic independenceassumptions, and moreover still struggles with makingcertain queries tractable.At its core, the reason this straightforward solution isineffective is as follows: while both PDBs and relationalembedding models make simplifying assumptions, theseassumptions are not being taken into account jointly . Eachis treating the other as a black box. To overcome this, weincorporate the factorization assumption made by manyrelational embedding models [41, 30] directly into ourprobabilistic database. The resulting model, which wecall T RACT
OR, thus takes advantages of the benefits ofboth: it can efficiently and accurately predict missingfacts, but it also provides a probabilistic semantics whichwe can use for complex probabilistic reasoning. Due toits factorization properties, T
RACT
OR can even provideefficient reasoning where it was previously difficult in astandard PDB.The rest of the paper is organized as follows. Section 2provides the required technical background on PDBs andtheir associated queries. In Section 3 we discuss using(tuple-independent) PDBs as the technical framework forrelational embedding models, as well as giving a briefformalization and discussion of challenges. Then, in Sec-tion 4 we introduce T
RACT
OR, a relational embeddingmodel designed around PDBs to allow for a large rangeof efficient queries. Section 5 provides an empirical eval-uation of T
RACT
OR . Finally, Section 6 gives a broaddiscussion on related work along with ties to future work.
We now provide the necessary technical background onprobabilistic databases, which will serve as the foundationfor our probabilistic semantics and formalism for queries,as well as the underlying inspiration for T
RACT
OR.
We begin with necessary background from function-freefinite-domain first-order logic. An atom R ( x , x , ..., x n ) consists of a predicate R of arity n , together with n ar-guments. These arguments can either be constants or variables . A ground atom is an atom that contains novariables. A formula is a series of atoms combined withconjunctions ( ∧ ) or disjunctions ( ∨ ), and with quantifiers ∀ , ∃ . A substitution Q [ x/t ] replaces all occurrences of x by t in a formula Q .A relational vocabulary σ is composed of a set of predi-cates R and a domain D . Using the Herbrand semantics [21], the
Herbrand base of σ is the set of all ground atomspossible given R and D . A σ -interpretation ω is then anassignment of truth values to every element of the Her-brand base of σ . We say that ω is a model of a formula Q whenever ω satisfies Q . This is denoted by ω | = Q .Under the standard model-theoretic view [1], a relationaldatabase for a vocabulary σ is a σ -interpretation ω . Inwords: a relational database is a series of relations, eachof which corresponds to a predicate. These are madeup by a series of rows, also called tuples , each of whichcorresponds to a ground atom being true. Any atom notappearing as a row in the relation is considered to be false ,following the closed-world assumption [32]. Figure 1shows an example database. To incorporate uncertainty into relational databases, prob-abilistic databases assign each tuple a probability [34,39].
Definition 1. A (tuple-independent) probabilisticdatabase (PDB) P for a vocabulary σ is a finite setof tuples of the form (cid:104) t : p (cid:105) where t is a σ -atom and p ∈ [0 , . Furthermore, each t can appear at most once.Given such a collection of tuples and their probabilities,we are now going to define a distribution over relationaldatabases. The semantics of this distribution are given bytreating each tuple as an independent random variable. Definition 2.
A PDB P for vocabulary σ induces a prob-ability distribution over σ -interpretations ω : P P ( ω ) = (cid:89) t ∈ ω P P ( t ) (cid:89) t/ ∈ ω (1 − P P ( t )) where P P ( t ) = (cid:40) p if (cid:104) t : p (cid:105) ∈ P otherwiseEach tuple is treated as an independent Bernoulli randomvariable, so the probability of a relational database in-stance is given as a simple product, based on which tuplesare or are not included in the instance. Much as in relational databases, in probabilistic databaseswe are interested in answering queries – the differencecientistEinsteinErd˝osvon Neumann CoAuthorEinstein Erd˝osErd˝os von NeumannFigure 1: Example relational database. Notice that thefirst row of the right table corresponds to the atom CoAu-thor(Einstein, Erd˝os).
Scientist Pr Einstein 0.8Erd˝os 0.8von Neumann 0.9Shakespeare 0.2 CoAuthor Pr Einstein Erd˝os 0.8Erd˝os von Neumann 0.9von Neumann Einstein 0.5
Figure 2: Example probabilistic database. Tuples arenow of the form (cid:104) t : p (cid:105) where p is the probability of thetuple t being present. These tuples are assumed to beindependent, so the probability both Einstein and Erd˝osare scientists is . · . . .being that we are now interested in probabilities overqueries. In particular, we study the theory of queriesthat are fully quantified and with no free variables orcosntants, also known as fully quantified Boolean queries – we will see later how other queries can be reduced tothis form. On a relational database, this corresponds to afully quantified query that has an answer of True or False.For example, on the database given in Figure 1, we mightask if there is a scientist who is a coauthor: Q = ∃ x. ∃ y.S ( x ) ∧ CoA ( x, y ) Which there clearly is, by taking x to be Einstein and y to be Erd˝os. If we instead asked this query of the PDBin Figure 2, we would be computing the probability bysumming over the worlds in which the query is true: P P ( Q ) = (cid:88) ω | = Q P P ( ω ) Queries of this form that are a conjunction of atoms arecalled conjunctive queries . They are commonly shortenedas: Q = S ( x ) , CoA ( x, y ) . A disjunction of conjunctive queries is known as a unionof conjunctive queries (UCQ). While they capture a rathercomplex set of queries, the algorithmic landscape ofUCQs is remarkably well understood.
Theorem 1.
Dalvi and Suciu [13] Let Q be a UCQ and P be a tuple-independent probabilistic database. Thenthe query Q is either: • Safe : P P ( Q ) can be computed in time polynomial in |P| for all probabilistic databases P using the stan-dard lifted inference algorithm (see Section 2.3.2); • Unsafe : Computing P P ( Q ) is a P -hard problem.Furthermore, we can efficiently determine whether Q issafe or unsafe. In much of the literature of probabilistic databases[34, 13], as well as throughout this paper, UCQs (andconsequently conjunctive queries) are the primary queryobject studied.
In general, one is not always interested in computing fullyquantified queries. For example, in Section 5 one of thequeries we are interested in computing will be of the form ∃ x, y.R ( A, x ) ∧ S ( x, y ) ∧ T ( y, B ) (1)For relations R, S, T and constants
A, B . To convert thisquery to a fully quantified one, we need to shatter thequery [39]. In this case, we replace the binary relation R ( A, x ) by the unary query R A ( x ) , where ∀ x.R A ( x ) = R ( A, x ) . A similar procedure for T gives us the followingquery: H = ∃ x, y.R A ( x ) ∧ S ( x, y ) ∧ T B ( y ) (2)This is now a fully quantified query, and is also a simpleexample of an unsafe query. That is, for an arbitraryprobabilistic database P we cannot compute P P ( Q ) intime polynomial in |P| given our current independenceand complexity assumptions. In addition to providing an underlying probabilistic se-mantics, one of the motivations for exploring probabilisticdatabases as the formalism for relational embedding mod-els was to be able to evaluate complex queries efficiently.Algorithm 1 does this in polynomial time for all safequeries. We now explain the steps in further detail.We begin with the assumption that Q has been processedto not contain any constant symbols, and that all variablesappear in the same order in repeated predicate occurrencesin Q . This can be done efficiently [13]. Step 0 covers the base case where Q is simply a tuple, so itlooks it up in P . Step 1 attempts to rewrite the UCQ into aconjunction of UCQs to find decomposable parts. For ex-ample, the UCQ ( R ( x ) ∧ S ( y, z )) ∨ ( S ( x, y ) ∧ T ( x )) can lgorithm 1 Lift R ( Q , P ) , abbreviated by L ( Q ) Require:
UCQ Q , prob. database P with constants T . Ensure:
The probability P P ( Q ) Step 0
Base of Recursion if Q is a single ground atom t if (cid:104) t : p (cid:105) ∈ P return p else return Step 1
Rewriting of Query Convert Q to conjunction of UCQ: Q ∧ = Q ∧· · · ∧ Q m Step 2
Decomposable Conjunction if m > and Q ∧ = Q ∧ Q where Q ⊥ Q return L ( Q ) · L ( Q ) Step 3
Inclusion-Exclusion if m > but Q ∧ has no independent Q i (Do Cancellations First) return (cid:80) s ⊆ [ m ] ( − | s | +1 · L (cid:0)(cid:87) i ∈ s Q i (cid:1) Step 4
Decomposable Disjunction if Q = Q ∨ Q where Q ⊥ Q return − (1 − L ( Q )) · (1 − L ( Q )) Step 5
Decomposable Existential Quantifier if Q has a separator variable x return − (cid:81) c ∈ T (1 − L ( Q [ x/c ])) Step 6
Fail (the query is ( R ( x )) ∨ ( S ( x, y ) ∧ T ( x )) and ( S ( y, z )) ∨ ( S ( x, y ) ∧ T ( x )) . When multiple con-juncts are found this way, there are two options. If they aresymbolically independent (share no symbols, denoted ⊥ ),then Step 2 applies independence and recurses. Otherwise,
Step 3 recurses using the inclusion-exclusion principle,performing cancellations first to maintain efficiency [13].If there is only a single UCQ after rewriting,
Step 4 triesto split it into independent parts, applying independenceand recursing if anything is found.Next,
Step 5 searches for a separator variable, one whichappears in every atom in Q . If x is a separator variablefor Q , and a, b are different constants in the domain of x , this means that Q [ x/a ] and Q [ x/b ] are independent.This independence is again recursively exploited. Finally,if Step 6 is reached, then the algorithm has failed andthe query provably cannot be computed efficiently [13],under standard complexity assumptions.
We now tackle the primary goal of this work: to use proba-bilistic databases as the formalism for doing probabilisticreasoning with relational embeddings. We begin with R ( x, y ) Score
A B -0.6
B C
A C = ⇒ R ( x, y ) Pr A B
B C
A C
Suppose we have a knowledge base K consisting of triples ( h i , R i , t i ) , denoting a head entity, relation, and tail en-tity (equivalently R i ( h i , t i ) in probabilistic database no-tation). Relational embedding models aim to learn contin-uous representations for both entities and relations, whichtogether can be used to predict the presence of a triple.More formally: Definition 3.
Suppose we have a knowledge base K con-sisting of triples ( h i , R i , t i ) , with entities E and relations R . Then a relational embedding model consists of • Real vectors v R , v e for all relations R ∈ R andentities e ∈ E• A scoring function f ( v h , v R , v t ) → R which in-duces a ranking over triplesIn general, these vectors may need to be reshaped intomatrices or tensors before the scoring function can beapplied. Table 1 gives some examples of models withthe form their vector representations take, as well as theirscoring functions. Given a relational embedding model from Definition 3, ifwe want to give it a clear probabilistic semantics using ourknowledge of probabilistic databases from Section 2, weneed to find a way to interpret the model as a probabilitydistribution.The simplest approach is to choose some mapping func-tion g : R → [0 , which converts all the scores producedby the model’s scoring function into probabilities. Thisprovides us marginal probabilities, but no obvious jointdistribution. Again, we can make the simplest choice andinterpret these probabilities as being independent. Thatis, we can construct a probabilistic database where theprobabilities are determined using our mapping function.Figure 3 gives an example of such a conversion, using thesigmoid function as the mapping.able 1: Example relational embedding scoring functions for d dimensionsMethod Entity Embedding Relation Embedding Triple ScoreTransE [5] v h , v t ∈ R d v R ∈ R d || v h + v R − v t || DistMult [41] v h , v t ∈ R d v R ∈ R d (cid:104) v h , v R , v t (cid:105) Rescal [30] v h , v t ∈ R d v R ∈ R d × d v Th v R v t ComplEx [37] v h , v t ∈ C d v R ∈ C d Re( (cid:104) v h , v R , ¯ v t (cid:105) )After doing this conversion, we can directly use Algo-rithm 1 to efficiently evaluate any safe query. This isa step in the right direction, but there are still two bigissues here: firstly, as a simplifying assumption this triple-independence presents potential issues as discussed inMeilicke et al. [28]. For example, suppose we have arelational model containing Works-In(Alice, London) and
Lives-In(Alice, London) : clearly these triples should notbe independent. The second issue, which is perhaps evenmore critical for our purposes, is that even this assumptionis not sufficient for all queries to be tractable:
Theorem 2.
Suppose we have a knowledge base K withentities E and relations R . Then, suppose we have amapping function g and a relational embedding modelrepresented by a scoring function f which is fully expres-sive. That is, for any configuration of marginal prob-abilities P ( R ( h, t )) over all possible triples, there issome assignment of entity and relation vectors such that ∀ R, h, t. g ( f ( v h , v R , v t )) = P ( R ( h, t )) .Then for any unsafe query Q , evaluating P ( Q ) is a P -hard problem. RACT OR The main takeaway from Section 3 is that although useful,interpreting relational embedding models as providingmarginals for probabilistic databases still has major chal-lenges. While we do now have a probabilistic semanticsfor our relational embedding model, the fact that we usedthe model as a black box means that we wind up treat-ing all triples as independent.The resulting expressivenessand tractability limitations motivate the search for a modelwhich will not be treated as a black box by our probabilis-tic database semantics. Rather than simply having anarbitrary statistical model which fills in our probabilisticdatabase, we would like to actually exploit properties ofthis statistical model. To put it another way: a fundamen-tal underpinning of relational embedding models such asDistMult [41] or TransE [5] is that they make simplify-ing assumptions about how entity and relation vectorsrelate to link prediction. In Section 3, our probabilisticinterpretations of these models had no way of knowingabout these simplifying assumptions: now we are goingto express them in the language of PDBs.
Relational embedding models such as DistMult [41] andComplEx [37], or indeed any model derived from thecanonical Polyadic decomposition [22] are built on anassumption about the way in which the tensor representingall triples factorizes. A similar idea has been used in thecontext of probabilistic first-order logic, where Booleanmatrices representing binary relations are rewritten interms of unary relations to make inference tractable [38].We will now apply this technique of rewriting binaryrelations into unary relations as the basis for our relationalembedding model.Suppose we have a binary relation R ( x, y ) , and our modeldefines a single random variable E ( x ) for each entity x ∈ E as well as a random variable T ( R ) for relation R .Then we assume that the relation R decomposes in thefollowing way: ∀ x, y.R ( x, y ) ⇐⇒ E ( x ) ∧ T ( R ) ∧ E ( y ) (3)We are assuming that all of the model’s newly definedvariables in E and T are independent random variables,so Equation 3 implies that P ( R ( x, y )) = P ( E ( x )) · P ( T ( R )) · P ( E ( y )) Figure 4 gives an example of probabilities for E and T ,with corresponding probabilities for R subject to Equa-tion 3. For example, we compute P ( R ( A, B )) by: P ( R ( A, B )) = P ( E ( A )) · P ( T ( R )) · P ( E ( B ))= 0 . To incorporate a relation S , we would define an additional T ( S ) – no new random variable per entity is needed.There are a few immediate takeaways from the rewritepresented in Equation 3. Firstly, as a result of sharing de-pendencies in the model, we no longer have that all triplesare independent of each other. For example R ( A, B ) and S ( A, C ) are not independent as they share a dependencyon the random variable E ( A ) . Secondly, although thesetuples are no longer independent (which would normallymake query evaluation harder), their connection via new ( x ) Pr A B C T Pr R = ⇒ R ( x, y ) Pr A B
B C
A C
E, T R and a few corre-sponding predictions for R latent variables E, T actually helps us. By assuming thelatent
E, T -tuples to be tuple independent, instead of thenon-latent
R, S -tuples, we are no longer subject to thequerying limitations described by Theorem 2. In fact, any
UCQ can now be computed efficiently over the relationsof interest. This will be proven in Section 4.4, but intu-itively binary relations must be involved for Algorithm 1to get stuck, and our rewrite allows us to avoid this.Of course, the major drawback is that Equation 3 de-scribes an incredibly simple and inexpressive embeddingmodel – we can only associate a single probability witheach entity and relation! We address this next.
RACT OR In a situation such as ours where we have a simple modelwhich is efficient for some task but not expressive, thestandard machine learning approach is to employ a mix-ture model. For example, while tree-shaped graphicalmodels [10] provide efficient learning and inference, theyare limited in their expressive capability: so a commonlyused alternative is a mixture of such models [27]. Simi-larly, while Gaussians are limited in their expressiveness,mixture of Gaussian models [36] have found widespreaduse throughout machine learning. These mixtures cantypically approximate any distribution given enough com-ponents.In our case, we will take the model described in Equa-tion 3 as our building block, and use it to create T
RAC - T OR.
Definition 4. T RACT
OR with d dimensions is a mixtureof d models each constructed from Equation 3. That is,it has tables T i , E i analagous to T and E above for eachelement i of the mixture. Then, for each element i wehave ∀ x, y.R i ( x, y ) ⇐⇒ E i ( x ) ∧ T i ( R ) ∧ E i ( y ) The probability of any query is then given by T
RACT
ORas the average of the probabilities of the d mixture com-ponents.Figure 5 gives an example 2-dimensional T RACT
ORmodel, including probabilities for E , E , T , T , and cor-responding probabilities for materialized relation R . For example, we compute P ( R ( A, B )) by: P ( R ( A, B )) = 12 ( P ( E ( A )) · P ( T ( R )) · P ( E ( B ))+ P ( E ( A )) · P ( T ( R )) · P ( E ( B )))= 0 . We see that the components of the mixture form whatwe typically think of as dimensions of the vectors ofembeddings. For example, in Figure 5 the embedding ofentity A is ( E ( A ) , E ( A )) = (0 . , . . The first question we need to ask about T
RACT
OR is howeffective it is for link prediction.
Theorem 3.
Suppose we have entity embeddings v h , v r ∈ R d and relation embedding v R ∈ R d . Then T RACT OR and DistMult will assign identical scores (within a con-stant factor) to the triple ( h, R, t ) (equivalently R ( h, t ) ). We already know from Yang et al. [41] that DistMult iseffective for link prediction, so T
RACT
OR must also be.
While we have seen that the computation used for link pre-diction in T
RACT
OR is identical to that of DistMult, thereremains a key difference: T
RACT
OR has a probabilisticsemantics, and thus all parameters must be probabilities.One option here is to indeed force all parameters to bepositive, and live with any performance loss incurred. An-other option is allowing for negative probabilities in
E, T meaning that we can achieve exactly the same link predic-tion results as DistMult, whose predictive power is welldocumented [41]. It has been previously shown that prob-ability theory can be consistently extended to negativeprobabilities [2], and their usefulness has also been docu-mented in the context of probabilistic databases [23, 40].Furthermore, by adding a simple disjunctive bias term,we can ensure that all fact predictions are indeed positiveprobabilities. In Section 5 we will explore both options.
Finally, we explore query evaluation for the T
RACT
ORmodel. Suppose we have some arbitrary UCQ Q overbinary and unary relations, and we would like to compute P ( Q ) where all binary relations are given by a T RACT
ORmodel. First, we substitute each binary relation accordingto Equation 3 using T
RACT
OR tables E and T . Whatremains is a query Q (cid:48) which contains only unary relations. Theorem 4.
Suppose that Q (cid:48) is a UCQ consisting onlyof unary relations. Then Q (cid:48) is safe. ( x ) Pr A B C T Pr R + E ( x ) Pr A B C T Pr R = ⇒ R ( x, y ) Pr A B
B C
A C
RACT
OR model tables E , E , T , T and a few corresponding predictions for R Proof.
We prove this by showing that Algorithm 1 neverfails on Q (cid:48) . Consider if Q (cid:48) cannot be rewritten as a con-junction of UCQs. Then each CQ must contain only asingle quantified variable, or else that CQ would contain2 separate connected components (due to all relationsunary). Thus, if we ever reach Step 5 of Algorithm 1,each CQ must have a separator. So Q (cid:48) is safe. We will now empirically investigate the effectiveness ofT
RACT
OR as a relational embedding model. As dis-cussed in Section 4.3, for the purposes of link predictionT
RACT
OR actually turns out to be equivalent to DistMult.While it does have certain limitations regarding asymmet-ric relations, the overall effectiveness of DistMult for linkprediction has been well documented [41], so we will notbe evaluating T
RACT
OR on link prediction. Instead, wewill focus on evaluating T
RACT
OR’s performance whencomputing more advanced queries. While training themodels we evaluated, we confirmed that training T
RAC - T OR and DistMult produced the same embeddings andlink prediction performance.
As our comparison for evaluation, we will use the graphquery embeddings (GQE) [20] framework and evaluationscheme. Fundamentally, GQE differs from T
RACT
OR inits approach to query prediction. Where T
RACT
OR is adistribution representing beliefs about the world whichcan then be queried to produce predictions, GQE treatsqueries as their own separate prediction task and definesvector operations to specifically be used for conjuctivequery prediction. The consequence of this is that whereT
RACT
OR has a single correct way to answer any query(the answer induced by the probability distribution), amethod in the style of GQE needs to find a new set oflinear algebra tools for each type of query.In particular, GQE uses geometric transformations asrepresentations for conjunction and existential quantifiers,allowing it to do query prediction via repeated application Code is available at https://github.com/ucla-starai/pdbmeetskge
Table 2: Example CQs and UCQs Q ( t ) = R ( A, t ) Q ( t ) = ∃ x.R ( A, x ) Q ( t ) = ∃ x.R ( A, x ) ∧ S ( x, t ) Q ( t ) = ∃ x, y.R ( A, x ) ∧ S ( x, y ) ∧ T ( y, t ) Q ( t ) = R ( A, t ) ∧ S ( B, t ) Q ( t ) = R ( A, t ) ∧ S ( B, t ) ∧ T ( C, t ) Q ( t ) = ∃ x.R ( A, x ) ∧ S ( x, t ) ∨∃ y.R ( A, y ) ∧ T ( y, t ) Q ( t ) = ∃ x.R ( A, x ) ∧ S ( x, t ) ∧ T ( B, t ) Q ( t ) = ∃ x.R ( A, x ) ∧ S ( B, x ) ∧ T ( x, t ) Q ( t ) = ∃ x , y .R ( A, x ) ∧ S ( x , y ) ∨∃ x , y .S ( x , y ) ∧ T ( y , t ) Q ( t ) = ∃ x, y, z.R ( A, x ) ∧ S ( x, y ) ∧ T ( y, z ) of these geometric transformations. Hamilton et al. [20]detail further exactly which queries are supported, but putsimply it is any conjunctive query that can be representedas a directed acyclic graph with a single sink.To evaluate these models, the first question is whichqueries should be tested. We describe a query template asfollows: R, S, T are placeholder relations,
A, B, C place-holder constants, x, y, z quantified variables, and t is theparameterized variable. That is, the goal of the query isto find the entity t which best satisfies the query (in ourframework gives the highest probability). Table 5.1 givesa series of example template CQs and UCQs. In Figure 6,we categorize each of these query templates based ontheir hardness under standard probabilistic database se-mantics, as well as their compatibility with GQE. Noticethat T RACT
OR can compute all queries in Figure 6 in timelinear in the domain size, including queries Q , Q , Q which would be P -hard in a standard tuple-independentprobabilistic database. For the sake of comparison, weperform our empirical evaluation using the queries thatare also supported by GQE. For our dataset, we use the same choice in relationaldata as Hamilton et al. [20]. In that work, two datsetswere evaluated on, which were termed bio and reddit
CQCQGQE PTIME
Figure 6: Categorizing different queries based on safenessand compatibility with GQE [20]. T
RACT
OR efficientlysupports all queries in the diagram.respectively. Bio is a dataset consisting of knowledgefrom public biomedical databases, consisting of nodeswhich correspond to drugs, diseases, proteins, side effects,and biological processes. It includes 42 different relations,and the graph in total has over 8 million edges between97,000 entities. The reddit dataset was not made publiclyavailable so we were unable to use it for evaluation.
While the bio dataset provides our entities and relations,we need to create a dataset of conjunctive queries to eval-uate on. For this, we again follow the procedures fromHamilton et al. [20]. First, we sample a 90/10 train/testsplit for the edges in the bio data. Then, we generate eval-uation queries (along with answers) using both train andtest edges from the bio dataset, but sample in such a waythat each test query relies on at least one edge not presentin the training data. This ensures that we can not templatematch queries based on the training data. For each querytemplate we sample 10,000 queries for evaluation. Forfurther details, including queries for which some edgesare adversarially chosen, see Hamilton et al. [20].As an example, templating on Q can produce: D ? ∃ p ∃ p A CTIVATES ( P , p ) ∧ C ATALYZES ( p , p ) ∧ T ARGET ( p , D ) where D is the drug we would like to find and p , p , P are proteins. For each evaluation query, we ask the model being evalu-ated to rank the entity which answers the query in com-parison to other entities which do not. We then evaluatethe performance of this ranking using a ROC AUC score,as well as an average percentile rank (APR) over 1000random negative examples. Table 3: Overall query performance on bio datasetMethod AUC APRBilinear 79.2 78.6DistMult 86.7 87.5TransE 78.3 81.6T
RACT
OR+ 75.0 84.5T
RACT
OR 82.8 86.3
We evaluate two versions of our model: T
RACT
OR indi-cates a model where the unary predicate probabilities areallowed to be negative, and a bias term is added to ensureall triples have positive predicted probability. T
RACT
OR+indicates a model where unary predicate probabilities areconstrained to be positive via squaring.As baselines, we consider model variants from Hamiltonet al. [20] that do not include extra parameters that mustbe trained on queries, as our model contains no suchparameters. These models are each built on an existingrelational embedding model (Bilinear [30], DistMult [41],and TransE [5] respectively) used for link prediction andcomposition, as well as a mean vector operator used forqueries. For example, for query Q , these baselines willmake a prediction for t that satisfy R ( a, t ) and S ( b, t ) separately, and then take the mean of the resulting vectors. All model variants and baselines were trained using themax-margin approach with negative sampling [29] whichhas become standard for training relational embeddingmodels [31]. Parameters were optimized using the Adamoptimizer [25], with an embedding dimension of 128, abatch size of 256, and learning rate of . . Table 3 presents AUC and APR scores for all model vari-ants and baselines on the bio dataset. T
RACT
OR andT
RACT
OR+ both perform better than TransE and Bilin-ear based baselines in APR, and are competitive with theDistMult baseline. Evaluating by AUC the performanceis slightly worse, but T
RACT
OR remains better than orcomparable to all baselines. These results are very encour-aging as T
RACT
OR is competitive despite the fact thatit is representing much more than just conjunctive queryprediction. T
RACT
OR represents a complete probabilitydistribution: effective and efficient query prediction issimply a direct consequence.Another interesting observation to make here is the gapetween T
RACT
OR and T
RACT
OR+, where the only dif-ference is whether the parameters are constrained to bepositive. The difference in performance here essentiallycomes down to the difference in performance on link pre-diction: not being allowed to use negative values makesthe model both less expressive and more difficult to train,leading to worse performance on link prediction. We didnot find that increasing the number of dimensions usedin the representation to make up for not having negativevalues helped significantly. Finding ways to improve linkprediction subject to this constraint seems to be valuablefor improving performance on query prediction.
Querying Relational Embeddings
Previous workstudying queries beyond link prediction in relational em-bedding models proposed to replace logical operatorswith geometric transformations [20], and learning newrelations representing joins [26]. Our work differs fromthese in that we formalize an underlying probabilisticframework which defines algorithms for doing querying,rather than treating querying as a new learning task.
Symmetric Relations
A limitation of the T
RACT
ORmodel which also appears in models like DistMult [41]and TransE [5] is that since head and tail entities aretreated the same way, they can only represent symmetricrelations. This is, of course, problematic as many rela-tions we encounter in the wild are not. Solutions to this in-clude assigning complex numbers for embeddings with anasymmetric scoring function [37], and keeping separatehead and tail representations but using inverse relationsto train them jointly [24]. Borrowing these techniquespresents a straightforward way to extend T
RACT
OR torepresent asymmetric relations.
Probabilistic Training
One potential disconnect inT
RACT
OR is that while it is a probabilistic model, itis not trained in a probabilistic way. That is, it is trainedin the standard fashion for relational embedding modelsusing negative sampling and a max-margin loss. Othertraining methods for these models such as cross-entropylosses exist and can improve performance [33] while be-ing more probabilistic in nature. In a similar vein, Tabacofand Costabello [35] empirically calibrates probabilities tobe meaingful with respect to the data. An interesting openquestion is if T
RACT
OR can be trained directly using alikelihood derived from its PDB semantics.
Incomplete Knowledge Bases
One of the main goalsof this work is to overcome the common issue of incom-plete knowledge. That is, what do we do when no proba-bility at all is known for some fact. In this work, we di- rectly incorporate machine learning models to overcomethis. Another approach to this problem is to suppose arange of possibilities for our unknown probabilities, andreason over those. This is implemented via open-worldprobabilistic databases [8], with extensions to incorpo-rate background information in the form of ontologicalknowledge [7] and summary statistics [17].
Increasing Model Complexity T RACT
OR is a mixtureof very simple models. While this makes for highly effi-cient querying, accuracy could potentially be improvedby rolling more of the complexity into each individualmodel at the PDB level. The natural approach to this is tofollow Van den Broeck and Darwiche [38] and replace oursimple unary conjunction with a disjunction of conjunc-tions. This raises interesting theoretical and algorithmicquestions with potential for improving query prediction.
Further Queries
Finally, there are further question onecan ask of a PDB beyond the probability of a query. Forexample, Gribkoff et al. [19] poses the question of whichworld (i.e. configuration of tuple truths) is most likelygiven a PDB and some constraints, while Ceylan et al.[9] studies the question of which explanations are mostprobable for a certain PDB query being true. Extendingthese problems to the realm of relational embeddingsposes many interesting questions.
Acknowledgements
We thank Yitao Liang, YooJung Choi, Pasha Khosravi,Dan Suciu, and Pasquale Minervini for helpful feed-back and discussion. This work is partially supported byNSF grants
References [1] Serge Abiteboul, Richard Hull, and Victor Vianu. Founda-tions of databases. 1995.[2] M. S. Bartlett. Negative probability.
Mathematical Pro-ceedings of the Cambridge Philosophical Society , 41(1):7173, 1945.[3] Meghyn Bienvenu. Ontology-mediated query answering:Harnessing knowledge to get more from data. In
IJCAI ,2016.[4] Hendrik Blockeel and Luc De Raedt. Top-down inductionof logical decision trees. 1997.[5] Antoine Bordes, Nicolas Usunier, Alberto Garc´ıa-Dur´an,Jason Weston, and Oksana Yakhnenko. Translating embed-dings for modeling multi-relational data. In
NIPS , 2013.[6] Stefan Borgwardt, Ismail Ilkan Ceylan, and ThomasLukasiewicz. Ontology-mediated queries for probabilisticdatabases. In
AAAI , 2017.7] Stefan Borgwardt, Ismail Ilkan Ceylan, and ThomasLukasiewicz. Ontology-mediated query answering overlog-linear probabilistic data. In
AAAI , 2019.[8] Ismail Ilkan Ceylan, Adnan Darwiche, and Guy Van denBroeck. Open-world probabilistic databases. In KR , May2016.[9] Ismail Ilkan Ceylan, Stefan Borgwardt, and ThomasLukasiewicz. Most probable explanations for probabilisticdatabase queries. In IJCAI , 2017.[10] C. K. Chow and C. N. Liu. Approximating discrete prob-ability distributions with dependence trees.
IEEE Trans.Information Theory , 14:462–467, 1968.[11] Nilesh N. Dalvi and Dan Suciu. Efficient query evaluationon probabilistic databases.
The VLDB Journal , 2004.[12] Nilesh N. Dalvi and Dan Suciu. The dichotomy of con-junctive queries on probabilistic structures. In
PODS ,2007.[13] Nilesh N. Dalvi and Dan Suciu. The dichotomy of prob-abilistic inference for unions of conjunctive queries.
J.ACM , 59:30:1–30:87, 2012.[14] Maarten Van den Heuvel, Peter Ivanov, Wolfgang Gat-terbauer, Floris Geerts, and Martin Theobald. Anytimeapproximation in probabilistic databases via scaled disso-ciations. In
SIGMOD Conference , 2019.[15] Sebastijan Dumancic, Alberto Garc´ıa-Dur´an, and MathiasNiepert. A comparative study of distributional and sym-bolic paradigms for relational learning. In
IJCAI , 2019.[16] Robert Fink and Dan Olteanu. Dichotomies for querieswith negation in probabilistic databases.
ACM Trans.Database Syst. , 41:4:1–4:47, 2016.[17] Tal Friedman and Guy Van den Broeck. On constrainedopen-world probabilistic databases. In
IJCAI , aug 2019.[18] Eric Gribkoff and Dan Suciu. Slimshot: In-database prob-abilistic inference for knowledge bases.
PVLDB , 9:552–563, 2016.[19] Eric Gribkoff, Guy Van den Broeck, and Dan Suciu. Themost probable database problem.
Proc. BUDA , 2014.[20] William L. Hamilton, Payal Bajaj, Marinka Zitnik, DanielJurafsky, and Jure Leskovec. Embedding logical querieson knowledge graphs. In
NeurIPS , 2018.[21] Timothy Hinrichs and Michael Genesereth. Herbrandlogic. Technical Report LG-2006-02, Stanford University,2006.[22] Frank L. Hitchcock. The expression of a tensor or apolyadic as a sum of products.
Journal of Mathematicsand Physics , 6(1-4):164–189, 1927.[23] Abhay Jha and Dan Suciu. Probabilistic databases withmarkoviews.
PVLDB , 5:1160–1171, 2012.[24] Seyed Mehran Kazemi and David Poole. Simple embed-ding for link prediction in knowledge graphs. In
NeurIPS ,2018. [25] Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization.
CoRR , abs/1412.6980, 2014.[26] Denis Krompass, Maximilian Nickel, and Volker Tresp.Querying factorized probabilistic triple databases. In
ISWC , 2014.[27] Marina Meila and Michael I. Jordan. Learning with mix-tures of trees.
J. Mach. Learn. Res. , 1:1–48, 1998.[28] Christian Meilicke, Melisachew Wudage Chekol, DanielRuffinelli, and Heiner Stuckenschmidt. Anytime bottom-up rule learning for knowledge graph completion. In
IJCAI ,2019.[29] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. Distributed representations ofwords and phrases and their compositionality. In
NIPS ,2013.[30] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel.A three-way model for collective learning on multi-relational data. In
ICML , 2011.[31] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Ev-geniy Gabrilovich. A review of relational machine learn-ing for knowledge graphs.
Proceedings of the IEEE , 104:11–33, 2015.[32] Raymond Reiter. On closed world data bases. In
Readingsin artificial intelligence , pages 119–140. Elsevier, 1981.[33] Daniel Ruffinelli, Samuel Broscheit, and Rainer Gemulla.You can teach an old dog new tricks! on training knowl-edge graph embeddings. In
ICLR , 2020.[34] Dan Suciu, Dan Olteanu, R. Christopher, and ChristophKoch.
Probabilistic Databases . Morgan & ClaypoolPublishers, 2011.[35] Pedro Tabacof and Luca Costabello. Probability calibra-tion for knowledge graph embedding models. In
ICLR ,2020.[36] D.M. Titterington, A.F.M. Smith, and U.E. Makov.
Statis-tical Analysis of Finite Mixture Distributions . Wiley, NewYork, 1985.[37] Th´eo Trouillon, Johannes Welbl, Sebastian Riedel, ´EricGaussier, and Guillaume Bouchard. Complex embeddingsfor simple link prediction. In
ICML , 2016.[38] Guy Van den Broeck and Adnan Darwiche. On the com-plexity and approximation of binary evidence in liftedinference. In
NIPS , December 2013.[39] Guy Van den Broeck and Dan Suciu.