[PDF] Approximate Knowledge Graph Query Answering: From Ranking to Binary Classification

Abstract

Large, heterogeneous datasets are characterized by missing or even erroneous information. This is more evident when they are the product of community effort or automatic fact extraction methods from external sources, such as text. A special case of the aforementioned phenomenon can be seen in knowledge graphs, where this mostly appears in the form of missing or incorrect edges and nodes. Structured querying on such incomplete graphs will result in incomplete sets of answers, even if the correct entities exist in the graph, since one or more edges needed to match the pattern are missing. To overcome this problem, several algorithms for approximate structured query answering have been proposed. Inspired by modern Information Retrieval metrics, these algorithms produce a ranking of all entities in the graph, and their performance is further evaluated based on how high in this ranking the correct answers appear. In this work we take a critical look at this way of evaluation. We argue that performing a ranking-based evaluation is not sufficient to assess methods for complex query answering. To solve this, we introduce Message Passing Query Boxes (MPQB), which takes binary classification metrics back into use and shows the effect this has on the recently proposed query embedding method MPQE.

Full PDF

AApproximate Knowledge Graph QueryAnswering:From Ranking to Binary Classiﬁcation

Ruud van Bakel , − − − , Teodor Aleksiev , − − − ,Daniel Daza , , − − − , DimitriosAlivanistos , − − − , and Michael Cochez , − − − Computer Science, Vrije Universiteit Amsterdam, The Netherlands { d.dazacruz, d.alivanistos, m.cochez } @vu.nl University of Amsterdam, The Netherlands [email protected] Leiden University, The Netherlands [email protected] Discovery Lab, Elsevier, The Netherlands https://discoverylab.ai

Abstract.

Large, heterogeneous datasets are characterized by missingor even erroneous information. This is more evident when they are theproduct of community eﬀort or automatic fact extraction methods fromexternal sources, such as text. A special case of the aforementioned phe-nomenon can be seen in knowledge graphs, where this mostly appears inthe form of missing or incorrect edges and nodes.Structured querying on such incomplete graphs will result in incompletesets of answers, even if the correct entities exist in the graph, since one ormore edges needed to match the pattern are missing. To overcome thisproblem, several algorithms for approximate structured query answeringhave been proposed. Inspired by modern Information Retrieval metrics,these algorithms produce a ranking of all entities in the graph, and theirperformance is further evaluated based on how high in this ranking thecorrect answers appear.In this work we take a critical look at this way of evaluation. We ar-gue that performing a ranking-based evaluation is not suﬃcient to assessmethods for complex query answering. To solve this, we introduce Mes-sage Passing Query Boxes (MPQB), which takes binary classiﬁcationmetrics back into use and shows the eﬀect this has on the recently pro-posed query embedding method MPQE.

Keywords:

Query answering · geometric representation · Box embed-dings · approximation. In many organizations, a vast amount of complex information is used in opera-tions daily. This data is often stored in various databases or ﬁle systems whileinformation can be retrieved using query languages and information retrievaltechniques. During the past decade, several companies have started taking up a r X i v : . [ c s . D B ] F e b van Bakel, Aleksiev, et al. knowledge graphs (KG) [10], as a way to represent heterogeneous data and makeit useful for a large variety of applications [14]. To make said data accessible,various querying languages like SPARQL and Cypher have been developed. Suchquerying languages allow for accessing nodes in the graph, traversing them viaspeciﬁc relations, or retrieve nodes that match a speciﬁc pattern. At the core ofthese languages lie graph patterns. These patterns can be thought of as graphshaped structures where some nodes and edges can correspond to nodes existingin the graph, while others correspond to variables (with speciﬁc variable names).When a match for this pattern is found in the graph, the variables are boundand the appropriate values are returned as the result.However, the performance of the previously described process is heavily de-pendent on the level of completeness in the graph.To go in detail, completeness refers to whether it contains all the nodes andedges in the graph pattern, and has a binding for all variables. Having a singlenode or edge missing from the graph, which represents a comparatively small bitof information, results in missing answers. This phenomenon could be good, incase of an erroneous piece of information, or bad, in case of information missingfrom the graph.In this paper, we focus on this issue, speciﬁcally the case of missing edges inthe graph. Ideally, we would like a query system that can still give answers whenthe phenomenon described before applies. We would like to have approximatequery answering .One way to approach this, is by performing link prediction. In link predic-tion, one would try to predict missing links in the graph, by training a machinelearning model on the known parts of it. While not trivial, it is possible to usethe single link prediction mechanism to answer queries with missing links. An-other way to approach this problem is by using the so-called query encoders.These encoders take a query as input and produce an embedding (a high dimen-sional vector representation) for it. This query embedding is later compared tolearned embeddings for the entities in the graph. This machine learning systemis optimised in such a way that entities close to the query embedding in vectorspace, are also its probable answers.In this paper we focus on the analysis and evaluation of these systems. Typ-ically, such systems return a series of candidate answers to the query, accompa-nied by a likelihood or distance from the query embedding in vector space. In theevaluation phase, this ranking is compared to, not a ground truth ranking, butrather the set of correct answers to the query. To do this, typical measures likehits@n (how many correct answers out of n) and mean reciprocal rank (MRR –what is the average reciprocal of the rank of correct answers) are used. Whilethese measures are appropriate for information retrieval systems, they fall shortwhen it comes to query systems. In the latter, the results are not ranked, butare rather the correct answer or not.This is also reﬂected in how these measures are usually adapted by modifyingthem to ﬁltered versions. In this case, measures like hits@n and MRR are com- G Query Answering with Binary Classiﬁcation 3 puted such that true answers higher in the returned ranking are ignored whencomputing for example the rank for lower ranked entities.We argue that we need to look into metrics that are not based on speciﬁcranking of the results, but rather on a crisp set of results retrieved from thesesystems. A main argument for why this is necessary is that many downstreamtasks using the aforementioned results need to get a ﬁnite set of answers fromthe knowledge graph, not just a ranked list of all possible entities. That is, weneed a query engine that does not just act as a ranking system, but as a binaryclassiﬁer: it must provide a set of entities that are answers to the query whileall other entities are not. In this scenario, the evaluation would be the same aswhat has traditionally been used for classiﬁcation problems, with measures suchas precision and recall.This paper is structured as follows: in section 2, we provide an example forseveral algorithms used for approximate query answering. Then, in section 3 wediscuss how metrics for binary classiﬁcation can provide additional insight ontop of the metrics used for ranking. We end that section with a general directionon how this could be achieved in the existing systems using volumetric queryembeddings. Section 4 details a ﬁrst approach for solving this problem usingaxis-aligned hyper-rectangles for these queries. We describe the MPQB model, aproof-of-concept, in the section after that. Finally, we provide a conclusion andfuture outlook.This work is largely based on the Bachelor thesis works of Ruud van Bakel [3]and Teodor Aleksiev [1], who both worked under the supervision of MichaelCochez at the Vrije Universiteit Amsterdam.

We deﬁne a knowledge graph as a tuple G = ( V , R , E ), where V is a set of entities, R a set of relation types, and E a set of binary predicates of the form r ( h, t )where r ∈ R and h, t ∈ V . Each binary predicate represents an edge of type r between the entities h and t , and thus we call E the set of edges in the knowledgegraph.A query on a KG looks for the set of entities that meet a particular condition,speciﬁed in terms of binary predicates whose arguments can be constants (i.e.entities in V ), or variables. As an example, consider the following query (adaptedfrom [4]): “Select all projects P , such that topic T is related to P , and both Alice and

Bob work on T ”. In this query, the constants entities are Alice and

Bob ,and the variables are denoted as P and T . We can deﬁne such a query formallyin terms of a conjunction of binary predicates, as follows: q = P. ∃ T, P : related(

T, P ) ∧ works on(Alice , T ) ∧ works on(Bob , T ) . (1)More formally, we are interested in answering conjunctive queries , that havethe following general form: q = V t . ∃ V , . . . , V m : r ( a , b ) ∧ . . . ∧ r m ( a m , b m ) , (2) van Bakel, Aleksiev, et al. In this notation, r i ∈ R , and a i and b i are constant entities in the KG, orvariables from the set { V t , V , . . . , V m } .Recent works have proposed to use machine learning methods to answer suchqueries. These methods operate by learning a vector representation in a space R d for each entity and relation type. These representations are also known as embeddings , and we denote them as e v for v ∈ V and e r for r ∈ R . Similarly,these methods deﬁne a query embedding function φ (usually deﬁned with somefree parameters), that maps a query q to an embedding φ ( q ) = q ∈ R d .Given a query embedding q , a score for every entity in the graph can beobtained via cosine similarity:score( q , e v ) = q (cid:62) e v (cid:107) q (cid:107)(cid:107) e v (cid:107) . The entity and relation type embeddings, as well as any free parameters in theembedding function φ , are optimized via stochastic gradient descent on a speciﬁcloss function. Usually the loss is deﬁned so that for a given embedding of a query,the cosine similarity is maximized with embeddings of entities that answer thequery, and minimized for embeddings of entities sampled at random.The dataset used for training consists of query-answer pairs mined from thegraph. Once the procedure terminates, the function φ can be used to embed aquery. The entities in the graph can then be ranked as potential answers, bycomputing the cosine similarity of all the entity embeddings and the embeddingof the query.Note that in contrast with classical approaches to query answering, such asthe use of SPARQL in a graph database, this approach can return answers evenif no entity in the graph matches exactly every condition in the query.In the next sections we review the speciﬁcs of recently proposed methods,which consider particular geometries for embedding entities, relation types, andqueries; as well as scoring functions. Conjunctive queries can be represented as a directed acyclic graph, where theleaf nodes are constant entities, any intermediate nodes are variables, and theroot node is the target variable of the query. In this graph, the edges have labelsthat correspond to the relation type involved in a predicate.We illustrate this in Fig. 1 for the example query introduced previously. InGraph Query Embedding (GQE) [9], the authors note that this graph can beemployed to deﬁne a computation graph that starts with the embeddings of theentities at the leaves, and follows the structure of the query graph until thetarget node is reached.GQE was one of the ﬁrst models that deﬁned a query embedding functionto answer queries over KGs. The function relies on two diﬀerent mechanisms,each of which handles paths and intersections, respectively. This requires gener-ating a large dataset of queries with diverse shapes that incorporate paths andintersections.

G Query Answering with Binary Classiﬁcation 5

AliceBob

Fig. 1.

The query q = P. ∃ T, P : related(

T, P ) ∧ works on(Alice , T ) ∧ works on(Bob , T )can be represented as a directed acyclic graph, where the leaves are constant entities,the intermediate node T is a variable, and P is the target entity. (adapted from a ﬁgurein [4]) Graph Convolutional Networks (GCNs) [5,11,8] are an extension of neural net-works to graph-structured data, that allow deﬁning ﬂexible operators for a va-riety of machine learning tasks on graphs. Relational Graph Convolutional Net-works (R-GCNs) [17] are a special case that introduces a mechanism to dealwith diﬀerent relation types as they occur in KGs, and have been shown to beeﬀective for tasks like link prediction and entity classiﬁcation.In MPQE [4], the authors note that a more general query embedding functioncan be deﬁned in comparison with GQE, if an R-GCN is employed to map thequery graph to an embedding. The generality stems from the fact that the R-GCN uses a general message-passing mechanism to embed the query, instead ofrelying on speciﬁc operators for paths and intersections.

Both GQE and MPQE embed a query as a single vector (i.e., a point in space).Query2Box [15] deviates from this idea and uses a box shape to represent aquery. The method further narrows the allowed embedding shape to axis-alignedhyper-rectangles. We will discuss more in section 4 why that is beneﬁcial. Thismethod has several beneﬁts, especially for conjunctive queries; for these queries,the answer set can be seen as the intersection of the answers to the conjuncts.Such an operation can be imagined with an embedded volume, but not with avector embedding.While this method would have made it possible to create a binary classiﬁer,the model is not speciﬁcally trained, nor evaluated for multiple answers.

Complex Query Decomposition (CQD) [2], is a recently proposed method forquery answering based on using simple methods for 1-hop link prediction toanswer more complex queries. In CQD, the link predictors used are DistMult van Bakel, Aleksiev, et al. [21] and ComplEx [20]. Such link predictors are more data eﬃcient than theprevious methods, since they only need to be trained with the set of observedtriples. In contrast, to be eﬀective the previous methods require mining millionsof queries covering a wide range of structures.In CQD, a complex query is decomposed in terms of its binary predicates.The link predictor is used to compute scores for each of them, and the scoresare then aggregated with t-norms, which have been employed in the literature ascontinuous relaxations of the conjunction and disjunction operators [18,13,12].CQD provides an answer to the query by providing a ranking of entitiesbased on the maximization of the aggregated scores. Therefore, the evaluationprocedure for CQD is the same as the previous methods.

As discussed above, there are merits to returning a hard answer set as opposedto returning a ranking. One way to obtain such binary classiﬁcations is to deﬁnea threshold within a ranking. As we will further describe in section 4, one cancreate such a threshold by using shapes (e.g. axis aligned hyper-rectangles) forquery embeddings.

Binary classiﬁcation does introduce new challenges. One such challenge can beseen in the deﬁnition of a loss function that can act diﬀerently for entities withinthe set and entities not in the set. Since the knowledge graph may contain missingedges, the retrieved target set may be a subset of the ground truth. This in turncould result in entities being incorrectly used within the loss function (i.e. anincorrect closed-world assumption).However, this is not necessarily problematic. We deﬁne T to be the groundtruth target set of a query and T (cid:48) to be the retrieved target set (i.e. whendirectly querying the KG). Assuming the number of entities missing from T (cid:48) is considerably smaller than V − T , most entities that do not belong in T (cid:48) arealso not answers to the query (i.e. not in T ). This means that if we sample arelatively small subset of the inverse found target set ( V − T (cid:48) ) it will likely notcontain entities that are also in T .In the case where we need to be certain that our sample from V − T (cid:48) doesnot contain entities in T we could restrict our sampling process to entities whichcould never appear in T . This is possible for example, by sampling entities whichare incompatible with the domain and range of speciﬁc relations in a query(e.g. house entities will never appear in a has sibling(a,b) relation). Potentialdownsides of such methods include a potential slow down during learning or alimit in the model’s overall performance, as having very diﬀerent entities in T and our sample from V − T (cid:48) could prevent our model from learning thediﬀerences between the two sets. On the other hand, if these two sets are verysimilar the model would be forced to uncover diﬀerences even when they are not

G Query Answering with Binary Classiﬁcation 7 very apparent. In fact, it is often good practise to use so-called “hard” negativesamples, which are similar to entities in T (cid:48) . A better alternative for ﬁndingentities not in T would be using more advanced techniques as proposed in [16]. Another focal point where binary classiﬁcation diﬀers from ranking as a metric,is in the way performance is measured (e.g. F-score against Mean ReciprocalRank). On binary classiﬁcation, a common performance measure would be theF-score, which is the harmonic mean between Precision and Recall, while in aranking setting we encounter the Mean Reciprocal Rank.While these metrics diﬀer signiﬁcantly, there are ways for them to relate. Thisinsight can be evident, considering that rankings could be turned in binaryclassiﬁcations, using a threshold. In particular, we notice that ranking metricstypically focus on having entities in T (cid:48) higher in the rank. As a result, havingmany high-ranking entities that are not in T (cid:48) is also penalised. Eﬀectively thesemeasures then provide some notion of how well T (cid:48) and V − T (cid:48) can be separated.This means that in the case of a low ranking measure, the binary classiﬁcationcan also under-perform. Moreover, it could either result in low precision, recallor both, depending on where the threshold is placed among the ranking.Geometrically, there is also a correspondence between a ranking with a cutoﬀpoint and a system where all answer embeddings withing a given distance wouldbe included as answers. One could view a classiﬁer with high precision and lowrecall as having an embedding with relatively small volume, while viewing a clas-siﬁer with high recall and low precision as having an embedding with relativelylarge volume instead. In this setting, the interpretation of a ranking measurewould be whether entities in T (cid:48) are closer to our geometric query embeddingthan entities not in T (cid:48) . This measure of closeness is deﬁned via a distance met-ric (e.g. the L1 norm) and can be used in the loss function [15]. As discussed in section 2 an entity is a valid answer to a speciﬁc structured queryif it satisﬁes the query. The ultimate aim is to ﬁnd the set of all valid answers,as entities in the Knowledge Graph, that satisfy the given query even when amissing edge in the KG is required for the binary predicates. As discussed, wecould either attempt to use a cut-oﬀ point in the ranking to obtain a binaryclassiﬁer, or we could train the embedding model such that it indicates a volumein the embedded space that contains the answers. In this section we present aﬁrst possible design of such a system to show the feasibility. We alter the earlierwork done on query2box [15] method in two ways. First, we do interpret theboundaries of the hyperrectangle used for the embedding as a bounding box. Allentities within the box are predicted answers to the query, while answers outsideare predicted to not be answers. Second, we do not use the embedding procedure van Bakel, Aleksiev, et al. proposed in query2box, but rather perform the embedding using the techniquedevised in MPQE.Now, we could choose to embed entities using points, as is done in otherquery embedding methods. Then, entities that get embedded inside the boxwould be seen as answers to the query, while points outside of it would be seenas non-answers. This is illustrated in ﬁgure 2

Fig. 2.

A small 2D query box embedding: Here there are three queries A , B and C ,and two entities v and w . In this case v is an answer to A and C , whilst w is only ananswer to A . (source [3]) But, as we will discuss in more detail in the following subsection, we can alsouse hyper-rectangles for these. The choice we make in the experiments in thispaper is to consider an entity, embedded as a box, to be valid answer to thequery if there is an intersection between the two boxes. This is also illustrated inﬁgure 3, for the two-dimensional case. An alternative choice could be to consideran entity and answer in case the entity box is completely inside the query box.To formalize this, we operate on the embedding space R d . What we want isto describe an axis-aligned hyper-rectangle in this space. We do this by keepingtwo vectors, one to indicate the center of the box and one to indicate the oﬀsetof the sides of the box. So, in the described model every entity v ∈ V has anembedding e v ∈ R d . Additionally an embedding for the query is deﬁned thatmaps the full vector of the query: q ∈ R d .The boxes in R d corresponding to the 2 d -dimensional vectors are deﬁned as p = ( Cen ( p ) , Oﬀ ( p )) ∈ R d : Box p = { v ∈ R d : Cen ( p ) − Oﬀ ( p ) (cid:22) v (cid:22) Cen ( p ) + Oﬀ ( p ) } , (3)where (cid:22) denotes element-wise inequality.Note that a completely analog deﬁnition could be made by keeping two ex-treme counterpoints of the box rather than a center and oﬀset. G Query Answering with Binary Classiﬁcation 9

Fig. 3.

A small 2D query and entity box embedding: Here there are three queries A , B and C , and one entity v . In this case v is an answer to A and B , but not to C . (source[3]) It was already mentioned in the previous section that we represent our entityembeddings with boxes, as well. This idea comes forward from the fact thatentities could play diﬀerent roles in diﬀerent contexts. For example, we couldhave a person who both works at a university, buy is also a member of a politicalparty. Having a single point to represent that person forces a query asking formembers of that political party and a query asking for people working at thatuniversity to overlap. If we instead use a box for the entity, the query embeddingsdo not have that additional problem. The issue is also illustrated in ﬁgure 4 and5. The nodes representing Alice and Bob are close to each other in the onecontext, but far away in the other one. In the embedding of the entities in ﬁg. 5shows that with boxes it is possible to have the entities close to each other andfar away from each other at the same time. With the entities as boxes, we canhave it as an answer to two disjoint queries as illustrated in ﬁg. 3.

In this section, we perform an evaluation of the system we discuss above. Notethat our goal is not to provide state-of-the-art results. Firstly, this is becausewhat we propose is just a proof of concept for an approximate embedding systemwhich can ﬁnd a set of answers for a query. But, the main reason we cannot reallycompare with other systems is because they are evaluated with ranking metricsas discussed in section 3.

Figure 6 shows seven distinct query graph structures. We only consider thesestructures when training and testing our model for the query answering task.

Fig. 4.

Here Alice and Bob are closely related in context of a speciﬁc relations (1relation minimum), but they are not very closely related in other context (5 hopsminimum). (source [3])

Fig. 5.

Here Alice and Bob are have relatively close points (seen near the origin), butalso very distant points. (source [3])

Fig. 6.

Used query structures for evaluation on query answering. Black nodes corre-spond to anchor entities, hollow nodes are the variables in the query, and the graynodes represent the targets (answers) of the query. (source [4])G Query Answering with Binary Classiﬁcation 11

These structures were originally proposed in GQE [9]. Each of these structuresstarts with actual entities from a graph (i.e. anchor entities) and ends with aset of target entities. Some of these structures are chains without any intersec-tions (e.g. B. ∃ A, B : knows(Alice , A ) ∧ is related to( A, B )), whilst other onlyhave intersections (e.g. B. ∃ B : knows(Alice , B ) ∧ is related to(Bob , B )) or evencombinations of both. Our goal is to train a model that ﬁnds the answer set of agiven query, using a query embedding. This is in contrast to other related work[15,9,4] as we want to be able to ﬁnd multiple answers. As mentioned before,we could create such a set by embedding the query as box, thus getting a hardboundary for separating entities in and not in the target set. Datasets

While previous work [4,9] incorporated multiple datasets, our im-plementation has yet solely been tested on the AIFB dataset. This datasetis a knowledge graph of academic institution in which persons, organizations,projects, publications, and topics are the entities. Table 1 give some statisticsof this dataset and also for two more datasets often used for the evaluation ofapproximate query answering.

Table 1.

Statistics of the knowledge graphs that were used for training and evaluation.

AIFB MUTAG AMEntities 2,601 22,372 372,584Entity types 6 4 5Relations 39,436 81,332 1,193,402Relation types 49 8 19

Query Generation

To train our model we have to sample for query graphsfrom our dataset. This is done by initially sampling anchor nodes and relationswhich are later used to form graphs based on speciﬁc query patterns (ﬁg. 6).After acquiring the anchor nodes and the relations connecting them, we canobtain the target set. Although this may appear straightforward, there are somecaveats. The biggest one is that some queries contain considerable sets of po-tential target entities (over 100,000 answers). Because we sample for edges ﬁrstthese particular graphs actually appear often.Luckily, for most query structures this was not the case, but speciﬁcally the2-chain and 3-chain query structures occasionally suﬀer from it. This is likelyexplained by the fact that knowledge graphs contain “hub nodes”, nodes witha very high degree, to which a plethora of other nodes connect via a certainrelation. Table 2 shows the average size of the target sets of sampled queriesfor the aforementioned datasets. One interesting thing to note is that for the

AM dataset the 3-chain-inter structure actually had the largest average targetset. This could indicate that this problem is indeed very graph-dependent. Sincethis is a problem with the AIFB dataset, we limit the query target sets to amaximum of 100 answers.We also sample for entities not in the target set to be used as negative samplesduring training. For the query structures that contain an intersection we incor-porate hard negative samples by ﬁnding entities that would have been in thetarget set if the conjunctive intersections were to be relaxed to disjunctions.

Table 2.

Average number of multiple answers to diﬀerent queries structures, acrossthe used datasets. (results were earlier reported in [1])

AIFB MUTAG AMStructure Train Test Train Test Train Test1-chain 3.4 1.2 1.9 1.1 1.2 1.02-chain 34.5 6.4 13.4 4.7 10.2 3.53-chain

Evaluation

In order to test whether the model is actually able to ﬁnd answersto queries that involve edges which are not in the graph, careful preparation ofour data splits was necessary. We started by our original graph and marked 10%of the edges to be removed (they are still there at this stage). Then, we samplethe graph for the query patterns. If the sample makes use of any edge markedas removed, it will be added to either the validation set or the test set (10/90split). If the sample contains no such marked edge, then we put it in the trainingset. This way, we end up with validation and test queries that make use of atleast one edge that is not in the graph seen during training.Post sampling, we end up with around 2 million targets and the correspondingquery graphs to be used in the training set. For the validation set we used about30,000 targets worth of queries and for the test set we will had approximately300,000 targets worth of query graphs. The validation set is also used to performearly stopping in case speciﬁc conditions were not met.Since our method uses boxes, which allow for binary classiﬁcation, we reportour model’s performance in the form of a confusion matrix (see ﬁgure 7). Giventhe fact that our entities are also boxes, we have more freedom to choose whenan entity is considered an answer.This is because entities now inhabit more space than a single point whichallows for partial overlap with query boxes. In order to allow ﬂexibility we have

G Query Answering with Binary Classiﬁcation 13 decided that an entity is considered an answer to a query if its box representationoverlaps with the box representation of the respective query box. Naturally, othermore strict conditions could be applied such as requiring full overlap or deﬁne afraction based threshold (e.g. requiring at least 50% overlap). We expect theseconditions to change based on the potential downstream task.

Fig. 7.

Model of the confusion matrix used for evaluation of the results, the empty boxis representation of a query, the black and the gray box are respectively a valid and ainvalid answer to the query. (source [1])

QueryEmbeddingAggregateEntityEmbeddings MPQE x n+1 x n+2 ...x n+m-1 x n+m x x ...x n-1 x n Center Offset Loss

Fig. 8.

The MPQB model used in this proof of concept. (adapted from a ﬁgure in [3])

Model

Our model has the same basic functionality as the MPQE [4] model.MPQE is used as an embedding component, but the input and output are in- terpreted as boxes. MPQE ﬁrst performs several steps of message passing usingan R-GCN architecture after which the node states are aggregated to form thequery embedding. With this query embedding a loss function is evaluated whichis used as a signal (using SGD) to update the embeddings and weights in thenetwork. For the aggregation operation we have several options (

SUM , MAX , TM , MLP ) at the end of our model. We test our model with some of thesediﬀerent aggregation functions.Since we train an embedding matrix (as opposed to having a latent em-bedding to start with) we need to initialize it. We do this by sampling the 32dimensional center vectors from a uniform distribution between 0 and 10, whilstsampling the 32 dimensional oﬀset vectors from a unit Gaussian with a mean 3.For TM aggregation, the MPQE model uses 3 layers; the TM aggregationfunction requires a number of message passing steps equal to the query diameter,in our case 3. For the MLP aggregation function we applied a two layer fully-connected MLP. As for the non-linearities in our model, we used the ReLUfunction. To update the parameters of the model we used Adam optimizer witha learning rate of 0.01.Our code base is based on PyTorch. In particular, we made use of the libraryPyTorch Geometric [7], which is a PyTorch extension specialised for graph-basedmodels. While there are potential baselines to consider [9,4], they are not suitablefor our work. This happens because we perform a binary classiﬁcation as opposedto ranking-based methods. To our knowledge there have not been any relatedwork that performed binary classiﬁcation in the context of approximate graphquerying. In the area of link prediction, we do ﬁnd some work, like the earlywork on Neural Tensor Networks [19] and a more recent one which looks attriple classiﬁcation [6]. This did not prove to be a major concern, as our maingoal was not to achieve state-of-the-art results, but rather explore whether thisdirection of research may prove worthwhile.

After having trained the MPQB model for over 200,000 iterations it appeared tostill not have converged. After this amount of iterations the query boxes seemedto not overlap with any target boxes (i.e. no entities in T (cid:48) were returned). Apartfrom training the model for longer and on multiple epochs, there are some othersettings that could still be experimented with. For example, how many samplesare in each epoch (less samples allow for training on more epochs), whether weuse T (cid:48) fully during train or use a subset, and how many entities should be in oursample from V − T (cid:48) . The latter two settings also inﬂuence how many distinctqueries we could train on within a given time span. In may be worth noting thatprevious works [9,4,15] train using single positive samples. While we want tofocus on answering queries with multiple answers, we do not necessarily need totrain on multiple answers. In theory, if a method can produce a good ranking,it should also be able to produce a good classiﬁcation, given that the optimalthresholds for these rankings could be found.Since we do not have direct result in a manner we would have liked, we will

G Query Answering with Binary Classiﬁcation 15 instead analyse the trained models to see if there are relevant insights to befound. For this we looked at models using diﬀerent aggregation functions, trainedon the AIFB dataset.While we have no intersections between query boxes and target boxes, we couldstill look whether the target boxes (from T (cid:48) ) appear relatively close to the entityboxes, when compared to the box representations of entities in V − T (cid:48) . Thiseﬀectively provides some measure as to whether the produced rankings are good.Table 3 shows these results. While these scores may not indicate state-of-the-artresults, they do seem to suggest that the model did at least produce decent non-trivial rankings using the SUM and TM aggregators. This could suggest thatfurther research is indeed in order. The fact that TM outperformed SUM is notsurprising considering that it is a more involved method that also takes querydiameter into account. This result is also in line with the ﬁndings in [4]. A moresurprising result is that the MLP method did not seem to perform well at all.This could be a result of a faulty implementation, or an implementation thatsimply does not work for boxes as is. Overall, the results seem promising.

Table 3.

Percentage (%) of answers embedded closer to the query box compared to anon answer, with regard to the query structure, using diﬀerent aggregation function.Tested on AIFB dataset. (results were earlier reported in [1])

AIFBStructure SUM TM MLP1-chain 67.48

In this work, we looked critically at the currently prevailing evaluation strat-egy for approximate complex structured query algorithms for knowledge graphs.Typically, these systems take a query as an input and produce a ranking ofall entities in the KG as an output. The performance of these systems is thandetermined using metrics typically used in information retrieval.What we propose is to augment the current evaluations by also requiringthese systems to produce a binary classiﬁcation of the nodes into a class of answers and one of non-answers. This is needed because many applications cansimply not work with a ranking and need a ﬁxed set of answers to work with.As a ﬁrst proof of concept, we have adapted ideas from MPQE and query2Box,and created an embedding algorithm that represents the queries and the entitiesas axis-aligned hyper-rectangles. We noticed that the performance of this systemis pretty low, and expect that future works can heavily improve upon this ﬁrstattempt.As future research directions, we see a need to expand our experiments toinclude other query types (disjunctions, negations, ﬁlters, etc. ), in order to showthe generalizability of our approach. This will, however, require new representa-tion for the volumes as these operations are not possible if we would stay withjust boxes. For example, the negation of a box, would no longer be a box.Moreover, we it needs to be investigated how our method can be applied ondiﬀerent kinds of graphs. This will give us insights as to what changes need tobe made in terms of training data (via query generation) as well as the eﬀects onmodel performance. Also, it seems worth experimenting with diﬀerent geometricrepresentations for the parts of the query (anchor, variables and targets). Finally,since our experiments were relatively small-scale, further research could also startby simply experimenting with diﬀerent settings for our current architecture.

References

1. Aleksiev, T.: Answering approximated graph queries, embedding the queries andentities as boxes (2020), BSc. thesis, Computer Science, Vrije Universiteit Amster-dam, Supervised by Cochez, M.2. Arakelyan, E., Daza, D., Minervini, P., Cochez, M.: Complex query answering withneural link predictors. In: International Conference on Learning Representations(2021), https://openreview.net/forum?id=Mos9F9kDwkz

3. van Bakel, R.: Box R-GCN: Structured query answering using box embeddingsfor entities and queries (2020), BSc. thesis, Computer Science, Vrije UniversiteitAmsterdam, Supervised by Cochez, M.4. Daza, D., Cochez, M.: Message passing query embedding. In: ICML Workshop- Graph Representation Learning and Beyond (2020), https://arxiv.org/abs/2002.02406

5. Deﬀerrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks ongraphs with fast localized spectral ﬁltering. In: Advances in neural informationprocessing systems. pp. 3844–3852 (2016)6. Dong, T., Wang, Z., Li, J., Bauckhage, C., Cremers, A.B.: Triple classiﬁcation usingregions and ﬁne-grained entity typing. In: Proceedings of the AAAI Conference onArtiﬁcial Intelligence. vol. 33, pp. 77–85 (2019)7. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric.arXiv preprint arXiv:1903.02428 (2019)8. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural messagepassing for quantum chemistry. In: Proceedings of the 34th International Con-ference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August2017. pp. 1263–1272 (2017)G Query Answering with Binary Classiﬁcation 179. Hamilton, W., Bajaj, P., Zitnik, M., Jurafsky, D., Leskovec, J.: Embed-ding logical queries on knowledge graphs. In: Bengio, S., Wallach, H.,Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advancesin Neural Information Processing Systems. vol. 31, pp. 2026–2037. CurranAssociates, Inc. (2018), https://proceedings.neurips.cc/paper/2018/file/ef50c335cca9f340bde656363ebd02fd-Paper.pdf

10. Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., de Melo, G., Gutierrez, C.,Gayo, J.E.L., Kirrane, S., Neumaier, S., Polleres, A., et al.: Knowledge Graphs.arXiv preprint arXiv:2003.02320 (2020)11. Kipf, T.N., Welling, M.: Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 (2016)12. van Krieken, E., Acar, E., van Harmelen, F.: Analyzing Diﬀerentiable FuzzyImplications. In: Proceedings of the 17th International Conference on Prin-ciples of Knowledge Representation and Reasoning. pp. 893–903 (9 2020).https://doi.org/10.24963/kr.2020/92, https://doi.org/10.24963/kr.2020/92

13. Minervini, P., Demeester, T., Rockt¨aschel, T., Riedel, S.: Adversarial sets for reg-ularising neural link predictors. In: UAI. AUAI Press (2017)14. Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A., Taylor, J.: Industry-scaleknowledge graphs: lessons and challenges. Communications of the ACM (8), 36–43 (2019)15. Ren, H., Hu, W., Leskovec, J.: Query2box: Reasoning over knowledge graphs invector space using box embeddings (2020)16. Safavi, T., Koutra, D., Meij, E.: Evaluating the calibration of knowledge graphembeddings for trustworthy link prediction (2020)17. Schlichtkrull, M., Kipf, T.N., Bloem, P., Van Den Berg, R., Titov, I., Welling,M.: Modeling relational data with graph convolutional networks. In: EuropeanSemantic Web Conference. pp. 593–607. Springer (2018)18. Seraﬁni, L., d’Avila Garcez, A.S.: Logic tensor networks: Deep learning and log-ical reasoning from data and knowledge. CoRR abs/1606.04422 (2016), http://arxiv.org/abs/1606.04422

19. Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neuraltensor networks for knowledge base completion. In: Burges, C.J.C., Bot-tou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advancesin Neural Information Processing Systems. vol. 26, pp. 926–934. CurranAssociates, Inc. (2013), https://proceedings.neurips.cc/paper/2013/file/b337e84de8752b27eda3a12363109e80-Paper.pdfhttps://proceedings.neurips.cc/paper/2013/file/b337e84de8752b27eda3a12363109e80-Paper.pdf