[PDF] Ranking vs. Classifying: Measuring Knowledge Base Completion Quality

Abstract

Knowledge base completion (KBC) methods aim at inferring missing facts from the information present in a knowledge base (KB) by estimating the likelihood of candidate facts. In the prevailing evaluation paradigm, models do not actually decide whether a new fact should be accepted or not but are solely judged on the position of true facts in a likelihood ranking with other candidates. We argue that consideration of binary predictions is essential to reflect the actual KBC quality, and propose a novel evaluation paradigm, designed to provide more transparent model selection criteria for a realistic scenario. We construct the data set FB14k-QAQ where instead of single facts, we use KB queries, i.e., facts where one entity is replaced with a variable, and construct corresponding sets of entities that are correct answers. We randomly remove some of these correct answers from the data set, simulating the realistic scenario of real-world entities missing from a KB. This way, we can explicitly measure a model's ability to handle queries that have more correct answers in the real world than in the KB, including the special case of queries without any valid answer. The latter especially contrasts the ranking setting. We evaluate a number of state-of-the-art KB embeddings models on our new benchmark. The differences in relative performance between ranking-based and classification-based evaluation that we observe in our experiments confirm our hypothesis that good performance on the ranking task does not necessarily translate to good performance on the actual completion task. Our results motivate future work on KB embedding models with better prediction separability and, as a first step in that direction, we propose a simple variant of TransE that encourages thresholding and achieves a significant improvement in classification F1 score relative to the original TransE.

Full PDF

AAutomated Knowledge Base Construction (2020) Conference paper

Ranking vs. Classifying:Measuring Knowledge Base Completion Quality

Marina Speranskaya [email protected]

Martin Schmitt [email protected]

Benjamin Roth [email protected]

Center for Information and Language Processing, LMU Munich, Germany

Abstract

Knowledge base completion (KBC) methods aim at inferring missing facts from theinformation present in a knowledge base (KB). Such a method thus needs to estimate thelikelihood of candidate facts and ultimately to distinguish between true facts and falseones to avoid compromising the KB with untrue information. In the prevailing evaluationparadigm, however, models do not actually decide whether a new fact should be acceptedor not but are solely judged on the position of true facts in a likelihood ranking withother candidates. We argue that consideration of binary predictions is essential to reﬂectthe actual KBC quality, and propose a novel evaluation paradigm, designed to providemore transparent model selection criteria for a realistic scenario. We construct the data setFB14k-QAQ with an alternative evaluation data structure: instead of single facts, we use KBqueries, i.e., facts where one entity is replaced with a variable, and construct correspondingsets of entities that are correct answers. We randomly remove some of these correct answersfrom the data set, simulating the realistic scenario of real-world entities missing from aKB. This way, we can explicitly measure a model’s ability to handle queries that havemore correct answers in the real world than in the KB, including the special case of querieswithout any valid answer. The latter especially contrasts the ranking setting. We evaluate anumber of state-of-the-art KB embeddings models on our new benchmark. The diﬀerencesin relative performance between ranking-based and classiﬁcation-based evaluation that weobserve in our experiments conﬁrm our hypothesis that good performance on the rankingtask does not necessarily translate to good performance on the actual completion task. Ourresults motivate future work on KB embedding models with better prediction separabilityand, as a ﬁrst step in that direction, we propose a simple variant of TransE that encouragesthresholding and achieves a signiﬁcant improvement in classiﬁcation F score relative to theoriginal TransE. A knowledge base contains relational information about the world in the form of triples.For instance, the fact “New York is located in the USA” could be represented as the triple(

New York , located in , USA ). Given the available information in an incomplete knowledgebase, the task of knowledge base completion (KBC) is to ﬁnd missing facts by predicting themost likely missing relation between known entities. Formally, a knowledge base describes aset of objects - or entities - E , connected to each other via binary relations R , and containsa collection of supposedly true facts KB + ⊆ E × R × E . The KBC task is to infer newtrue facts consisting of head entity h (cid:48) , relation r (cid:48) and tail entity t (cid:48) with ( h (cid:48) , r (cid:48) , t (cid:48) ) / ∈ KB + , a r X i v : . [ c s . A I] F e b peranskaya, Schmitt, & Roth given the facts in KB + . The quality of KBC is typically measured by removing a triple( h, r, t ) from KB + and comparing the score assigned by a completion algorithm to the scoresassigned to perturbed triples ( h, r, t (cid:48) ) where t (cid:48) (cid:54) = t .Evaluation of embedding models on the KBC task intuitively should measure the qualityof facts added by a completion algorithm. Standard metrics for evaluating KBC such as top-kprecision or mean reciprocal rank (MRR), however, measure the quality of ranking possibleknowledge graph triples. These metrics do not necessarily reﬂect the real performance ofthe underlying task since it would be necessary to combine a triple scoring mechanism(e.g., based on knowledge graph embeddings) with thresholds for obtaining a predictionmechanism, and the triple scoring mechanism might not be consistently scaled : It couldbe the case that the relative ranking of tuples may be satisfactory, when ranked for thesame query tuple (h, r, ?), but that scores are not well-calibrated and ﬁnding good global orper-relation thresholds is diﬃcult for certain embedding-based scoring mechanisms.In this work, we propose an alternative way of evaluating KBC quality by reportingclassiﬁcation measures (e.g., F scores) on the carefully constructed data set FB14k-QAQthat balances query tuples for which completion is possible with special query tuples that,by construction (using type constraints and entity removal), are impossible to complete.This new evaluation approach motivates research on embedding models that intrinsicallysupport thresholds for prediction and we propose a simple variant of TransE that improveson the new evaluation metric relative to the original model.Previous work [Socher et al., 2013] has attempted to overcome problems of ranking-basedevaluation approaches by artiﬁcially creating a ﬁxed amount of negative samples from positivetriples (typically a 1-1 ratio) and measuring accuracy on that data set. Such a setting,however, does not properly reward models that are able to distinguish between relationshipsthat should have more positive predictions vs. those that should have less. Wang et al.[2019] identify problems of previous evaluation paradigms based on entity rankings and theypropose an alternative scheme that looks at all possible entity pairs, ranked for a givenrelation. While their proposed evaluation solves some of the problems (comparabilty ofscores between query entities), it is still ranking-based and does not incentivize the scoringmodel to support globally optimal prediction thresholds across relationships.The main contributions of this paper are: • We construct a data set for extensive classiﬁcation evaluation that penalizes modelsthat predict erroneous triples. For this, we construct two types of negative cases: First,a subset of entities is sampled and removed from an existing knowledge base, so thatcorresponding queries can be obtained that, by construction, do not have any correctanswers. Second, we formulate queries that violate type constraints and thus cannothave any right answers either. • Experiments with established embedding models show surprising diﬀerences when thenew metric is compared to an evaluation based on ranking. • Our evaluation suggests that models should focus on optimizing separability of theirpredictions. An adapted version of TransE that encourages separability improves byover 30% F score relative to the original model. anking vs. Classifying Knowledge graph embedding models.

Embedding models assign a latent represen-tation to every entity and relation of a knowledge base. Within the scope of this paper,entities h ∈ E are represented as d -dimensional vectors e h ∈ R d and relations r ∈ R eitheras vectors r r ∈ R d or as matrices R r ∈ R d × d . A KBC model is characterized by its scoringfunction s ( h, r, t ) : E × R × E → R .Various approaches exploiting such representations of entities and relations have beenproposed for the task of knowledge base completion. One of the most prominent groupamong them is that of tensor factorization models : RESCAL [Nickel et al., 2011] with thescoring function s ( h, r, t ) = e h T R r e t , DistMult [Yang et al., 2014], which sets diagonalrestrictions to the matrix, with s ( h, r, t ) = e h T diag ( r r ) e t , and ComplEx [Trouillon et al.,2016], which uses complex-numbered embeddings with the previous scoring function. Theﬁeld of translation models was opened by the TransE [Bordes et al., 2013] scoring approach s ( h, r, t ) = −(cid:107) e h + r r − e t (cid:107) p , followed by a number of variants, such as projection onrelation-speciﬁc hyperplanes (TransH [Wang et al., 2014]) or transforming entity embeddingsto a relation-speciﬁc vector space prior to scoring (TransR [Lin et al., 2015]). TransA [Xiaoet al., 2015] scoring uses an additional matrix W r per relation, which is derived from theentity and relation embeddings analytically (rather than learned), and also replaces the L p -norm: s ( h, r, t ) = − ( | e h + r r − e t | ) T W r ( | e h + r r − e t | ), where | e h + r r − e t | takes anabsolute value in every vector position. More recently, further improvements were obtainedwith neural approaches like ConvE [Dettmers et al., 2017], KG-Bert [Yao et al., 2019] andGraph Attention Networks [Nathani et al., 2019]. For the purposes of this work, we selecteda cross-group model sample consisting of DistMult, ComplEx, TransE and ConvE. Alternatives to ranking-based prediction.

Socher et al. [2013] ﬁrst introduced anevaluation setting based on triple classiﬁcation. However, the data contains only tworandomly sampled negative triples for every positive triple, which poses an unrealistic ratiofor KBC. Additionally, in case of overlapping negative samples ( h, r, t (cid:48) ) for two test triples( h, r, t ) and ( h, r, t ), the evaluation protocol would count the predicted labels of thesenegative samples twice into the ﬁnal classiﬁcation metric. This redundancy eﬀect would beeven stronger with a higher number of negatives samples due to a higher overlap probability.Our query-based evaluation, however, eliminates it completely as all ( h, r, ?) test triples areconsidered at once.In contrast to KB embedding methods, Godin et al. [2019] approach query answeringwith a reinforcement learning model that attempts to ﬁnd a path through a knowledge graphfrom h to the correct t . Their approach allows the model to refrain from giving an answerrather than giving a wrong one. For evaluation, they suggest to use the precision of thegiven answer together with an answer rate metric that measures the rate of empty responsesof the model. However, the used data set contains exactly one correct answer for each givenquery, missing other realistic cases, such as multiple answers and unanswerable queries, forwhich these metrics would not be applicable. MRR reviews.

Sun et al. [2019] have noticed inconsistencies in model behavior measuredby mean reciprocal rank (MRR) that the authors attribute to the ranking setting itself.An inappropriate performance measure can become an explanation for the sudden come- peranskaya, Schmitt, & Roth backs of rather basic models in Kadlec et al. [2017] and Ruﬃnelli et al. [2020], as well asstrong performance variations. In their recent work, Wang et al. [2019] criticize the currentevaluation protocol with the mean reciprocal rank for being unsuitable for KBC. Theyaccuse it of overestimating the model performance due to its insensitivity to unrealistic andnonsensical triples and they propose an improved — but still ranking based — metric forKB embedding evaluation.

In this section a new evaluation setting is proposed that directly measures the quality ofextracted facts by matching against a ground truth. For this, care must be taken thatevaluation not only accounts for correctly retrieved facts, i.e., true positives, but also rewardscases where the model correctly refrains from predicting an answer, i.e., produces lessfalse positives. We argue that a query-driven setup with carefully constructed query andanswer sets is necessary for measuring KBC quality on the triple level. As a foundationfor the FB14k-QAQ data set (measuring Q uery A nswering Q uality), we base our workon the FB15k-237 [Toutanova et al., 2015] data set, a subset of FreeBase, because it is awell-established benchmark for KBC and has been used in most of the recent publicationson KB embedding methods. From facts to queries.

The new evaluation setting relies on queries . A query is anentity-relation pair where a second entity is missing to form a complete triple: if the tailposition is open for completion, we call such a query q = ( h, r, ?) an tail query . A queryof the form q = (? , r, t ) is called head query , respectively. The ◦ operator is deﬁned toﬁll the open position of a query with the speciﬁed entity i ∈ E , i.e., q ◦ i = ( h, r, i ) and q ◦ i = ( i, r, t ). For every query q , the set of correct answers A Fq can be deﬁned given a setof valid facts F ⊆ E × R × E by setting A Fq := { i | ( q ◦ i ) ∈ F } .Figure 1 depicts the data set creation process, which operates on the original trainingdata and the data from the original development and test splits (further, evaluation data):(1) A small subset of entities E − ⊂ E from the original data set is selected for removal(“Select”). We will refer to the remaining entities as E + := E \ E − . (2) In “Split”, the newtraining set is obtained by taking all original training triples that only contain entities from E + . The evaluation data and the remaining training triples are used to create queries anddetermine answer sets, i.e., they form the set of valid facts F mentioned above. (3) Thefacts from F are then grouped by head and relation to form tail queries and by tail andrelation for head queries, respectively. The “Group” step results in a set of queries q withcorresponding answer sets A (cid:48) q . (4) To obtain the ﬁnal answer sets A q , the selected entities E − are removed from the answer sets in “Remove”. With respect to the intermediate answersets A (cid:48) q from “Group” and the ﬁnal ones A q from “Remove”, the queries can be dividedinto two sets: C , where the answer set remained complete, and I , where entities have beenremoved from the answer sets (including the special case of empty answer sets N ⊂ I ). Thefull formal description of the process is provided in Appendix A.1.By removing entities from real triples of the original data set, we artiﬁcially create asituation where completion entities exist in the real world but are not present in the dataset, thereby simulating a controlled closed-world problem. It constitutes a challenge to KBC anking vs. Classifying

Figure 1: Data set transformation scheme as described in Section 3 ”From facts to queries”.Step (3) only shows the grouping for head queries; tail queries are constructedanalogously. Full circles represent relations and entities from E + ; entities from E − ,i.e., those selected for removal, are dashed ( e e C if A (cid:48) q = A q , I otherwise (edge case N ⊂ I if | A q | = 0).models not to complete such a query that not only appears meaningful but actually has areal-world answer to it outside of the knowledge base. Queries with type violation.

A more relaxed version of queries with empty answersets are queries with an inherent contradiction that would immediately be recognizedby a human, such as (

Albert Einstein , has capital , ?). The formal contradiction inthis query is expressed in the type system of FreeBase triples. In FreeBase, entities arelabelled with diﬀerent types , e.g., an entity h = New York City is assigned a type set T h = { location, art subject, wine region } while for t = Albert Einstein it is T t = { person, book author, scientist } . The type system ensures that every relation onlytakes entities of a speciﬁc type as head and tail argument, respectively. For instance, arelation r = has capital takes entities of type dom ( r ) = country for its head position(relation domain) and entities of type rng ( k ) = citytown for its tail position (relationrange). Triples in FreeBase follow this scheme and are therefore type-consistent, i.e., KB + ⊆ { ( h, r, t ) | dom ( r ) ∈ T h , rng ( r ) ∈ T t } . However, a FreeBase triple can be type-consistent and still false, e.g., ( USA , has capital , New York City ).By analogy, type-consistent queries obey domain and range restrictions with respect tothe already ﬁlled position and type-inconsistent queries do not. An obviously incorrect query(

Albert Einstein , has capital , ?) in fact violates the types since the domain country of has capital does not occur in the type set T t for the entity t = Albert Einstein . Formally,a set of such type-inconsistent ”fake” queries F can be characterized as follows: F ⊂ { ( h, r, ?) | h ∈ E + ∧ dom ( r ) / ∈ T h } ∪{ (? , r, t ) | t ∈ E + ∧ rng ( r ) / ∈ T t } peranskaya, Schmitt, & Roth Query set Head Tail Total C I N ( ⊂ I ) 6,871 8,850 15,721 F C - completeanswer set, I - entities have been removed from the answer sets, F - queries withtype violations).This means, all queries violate at least one of the relation domain or range restrictions andthus cannot have a valid completion due to this type inconsistency. In case of an automatedknowledge base completion setting, e.g., when there is no type information available to checkcompletion queries for type violations, this can become quite relevant. A perfect modelshould distinguish well-typed queries from nonsensical ones. By combining type-consistent N and type-violating F queries with empty answer sets, we address diﬀerent aspects ofmodel behavior. Overview.

The number of entities in E − is an essential parameter within this data setconstruction strategy. The more entities are removed, the more queries with smaller answersets potentially arise. We pursued a speciﬁc ﬁnal distribution of queries: queries with noanswer N make 25% of the evaluation set, queries with at least one answer make 50%; theremaining 25% are ﬁlled with type-violating queries F . For the FB14k-QAQ data set, weachieved this by removing |E − | = 1000 entities.The resulting FB14k-QAQ data set has 13,541 entities and 237 relations, with 236,795triples in the training set. Three sets of diﬀerent query groups C , I , and F are evenly splitbetween the development D and the test set T , resulting in 32.5k queries each, 50 percentof which have an empty answer set. The other 50 percent include queries with one or morecorrect answers. The ratio of queries with a complete answer set C to those with a removedanswer I is almost 1:1; more detailed statistics about the query distribution between setsare presented in Table 1. Thresholding.

To measure the models’ capability to decide between true and false triples,a threshold τ r is applied to the output scores to obtain binary predictions. We consider twothresholding settings: (i) The same τ r = τ global value is shared across all relations and (ii) τ r is relation-speciﬁc. The global threshold can be easily optimized for a given set of predictions.The relation-speciﬁc thresholds are found using a greedy iterative algorithm (details inAppendix B.1) that optimizes the micro-averaged F score for 474 relations (including theseparately embedded inverse relations).

1. Type information provided by [Wang et al., 2019].2. The ﬁnal data set, as well as the source code for data set construction and model evaluation are availableat https://github.com/marina-sp/classification_lp . anking vs. Classifying MRR.

The query-based format of the FB14k-QAQ is incompatible with the MRR evalua-tion. To be able to evaluate this data set on ranking, we reconstruct the underlying validfacts F from the queries in dev D and test T data by completing every query with all entitiesfrom their answer set, resulting in triple sets D rank = { q ◦ i | q ∈ D ∧ i ∈ A q } and T rank analogously. The empty queries N and F are ignored in the ranking evaluation,as they cannot build a valid fact.The triples (h, r, t) from D rank and T rank are scored and ranked against all possibleperturbations of entities as follows: The rank rank ( h, r, t ) of an evaluation triple (h, r, t) is deﬁned as its index in a sorted array of scores. For perturbed tails, this is { ( h, r, t (cid:48) ) / ∈ Train | t (cid:48) ∈ E + } ; for heads, it is { ( h (cid:48) , r, t ) / ∈ Train | h (cid:48) ∈ E + } , i.e., scores for known triplesfrom the training set are excluded. The mean reciprocal rank for an evaluated set of triples X rank (substitute D rank and T rank ) is then MRR E = 1 | E rank | (cid:88) ( h,r,t ) ∈ E rank rank ( h, r, t ) Metric deﬁnition.

In the classiﬁcation setting, evaluation is based on a model’s binarydecisions. For an arbitrary query q with a relation r , all entities with a score s above thetuned threshold τ r constitute the positive response set R q = { i ∈ E + | s ( q ◦ i ) > τ r } , i.e., the model retrieved these entities as a valid query completion.Recall from Section 3 that, for each query q , the evaluation data contain a set A q ofexpected correct (relevant) answers that were not directly seen in the training data. Inorder not to punish a model for correctly reproducing the facts from the train data, thecorresponding entities are excluded from the retrieved set: R q ← R q \ { i ∈ E + | ( q ◦ i ) ∈ Train } To assess the retrieval quality of the model on a set of queries X (substitute D or T ),we deﬁne the following sets that describe the correctly retrieved entities (true positives),erroneously retrieved entities (false positives), and entities missing in the retrieved set (falsenegatives) for a query q : TP q = | R q ∩ A q | FP q = | R q \ A q | FN q = | A q \ R q | With TP = (cid:80) q ∈ X TP q , FP = (cid:80) q ∈ X FP q and FN = (cid:80) q ∈ X FP q the micro-averagedprecision, recall and F score can be easily computed. We use F score as the ﬁnal performancemeasure. Experimental setup.

The framework for this work was built on top of the publiclyavailable ConvE implementation. We used the provided implementations for ConvE, https://github.com/TimDettmers/ConvE peranskaya, Schmitt, & Roth F global threshold F multiple thresholdsModel ( d ) MRR full C C ∪ F I full

C C ∪ F I

ConvE 128 .321 .134 .272 .211 .105 .204 .317 .286 .150ConvE 64 .263 .157 .307 .261 .108 .189 .312 .280 .135ComplEx 128 .293 .021 .169 .017 .042 .190 .296 .261 .143ComplEx 64 .293 .009 .157 .005 .057 .181 .282 .245 .143TransE 128 .293 .108 .158 .106 .108 .159 .172 .168 .154

TransE 64 .283 .111 .164 .112 .110 .161 .176 .175 .154

DistMult 64 .266 .159 .273 .226 .129 .184 .256 .239 .148DistMult 128 .221 .133 .275 .188 .088 .163 .206 .194 .138Table 2: MRR and F score on the full FB14k-QAQ test set together with F scores ondiﬀerent query subsets in the two threshold settings. “F global threshold” refersto scores obtained with a single shared threshold for all relations while “F multiplethresholds” refers to the setting with independent thresholds for every relation(including inverse relations).ComplEx and DistMult. Similarly to ComplEx and DistMult, we provided the traditionalTransE with an additional top layer, which transforms real-numbered scores to a probability-like output. All models share the following settings: Adam optimizer, binary cross entropy loss,

KvsAll training , and a maximum number of 200 training epochs. Development loss is usedas the early stopping criterion with patience of 50 epochs. The other hyperparameters wereselected from the following value sets: embedding dimension d from {

64, 128 } , batch sizefrom { } , learning rate from { } , inverse relations from { yes, no } (for TransE only). Entity embedding L2-normalization and L1-scoring was used for TransEas in the original model. Optimal threshold values are provided in Appendix B.2. Discussion.

Table 2 shows the evaluation results of standard embedding models on thetest data according to the classic ranking metric MRR and according to the F score asproposed above. Development results are provided in Appendix C.2. It is evident that MRRand the suggested classiﬁcation-based evaluation scheme assess the evaluated model variantsin a strikingly diﬀerent manner.A good ranking as measured by MRR is not necessarily a good indicator for a goodperformance in the KBC settings: The ComplEx models have a relatively high MRR butshow the weakest performance in the globally thresholded prediction (F global threshold

4. ComplEx and DistMult have a sigmoid prediction layer. Since the TransE distances are non-negative, ahyperbolic tangent function is more appropriate choice than a sigmoid and exploits the whole interval[0,1].5. Terminology borrowed from [Ruﬃnelli et al., 2020].6. Positive training examples are excluded from the development loss calculation to encourage better linkpredictions rather than reproducing the known facts.7. ConvE embeddings are reshaped to (8,8) and (16,8) prior to convolution. anking vs. Classifying full). Similarly, the TransE (128 dimensions) model performs very well in terms of MRRbut is the worst performing model with a global threshold and the second weakest (afterComplEx) in the KBC setting with multiple thresholds. The comparison of the modelwithin the same type reveals, that models with a higher number of embedding dimensionsperform better in MRR than models with a lower dimensionality, whereas model with lessdimenstions are mostly better in the classiﬁcation setting with a global threshold. TheDistMult model pair constitutes an exception with DistMult (64) performing the best in allsettings, however, the gap in MRR performance is signiﬁcantly bigger than that in F score.We also provide an analysis in terms of the sets C (queries with exhaustive answers), F (queries with type violations), and I (queries with at least one removed answer due toentity removal). C corresponds roughly to the traditional setting, directly turned into aclassiﬁcation problem. Here, the queries always contain at least one answer and the amountof positives is unrealistically high. Adding queries with type violations, i.e., F , makesthe problem more challenging and thresholding is more important in order to detect caseswhere no answer should be returned. However, the queries in F might still be too easy forapproaches that are good at type modeling. I contains the most realistic (and most diﬃcultto judge) empty queries and provides the most challenging scenario.The full query set (full = C ∪ F ∪ I ) combines moderately challenging and hard cases,and measures progress at type modeling while at the same time containing realistic emptyqueries. For characterizing models in a nuanced evaluation, we suggest to report metrics forall of these three settings: full,

C ∪ F , and I .Table 2 shows that C queries always have the highest score, which makes sense asthey can be considered the easiest setting. C ∪ F queries cause a strong drop in modelperformances compared to C when a global threshold is used but this diﬀerence largelydisappears with threshold tuning. ConvE and ComplEx however still show a considerabledrop for C ∪ F queries, which may indicate potential room for improvements through bettertype modeling. The I query set, in average, produces the lowest scores (except for ComplExglobal threshold), which is in line with the intuition that this setting can be considered morediﬃcult than C ∪ F . TransE shows the least diﬀerence between I and C ∪ F (and C ), andhas the highest score for I in both threshold settings — a remarkably stable performanceacross query sets. Qualitative analysis.

Table 3 sheds light on the exact model behavior in the two thresholdsettings by showing predictions for three sample queries from diﬀerent sets. The highest-scored answers are generally thematically related to the correct answers. Relation-speciﬁcthresholds can improve the classiﬁcation performance by including correct answers missedin the global setting (as is the case for the ﬁrst query) but the opposite is also possible(second query). For the type-violating query , highest scores are obtained by entities with asemantic connection to the relation. ComplEx exploits the extreme threshold value of 1.0 toentirely disallow predictions for this relation.

8. Since F only contains queries without answers, it needs to be combined with, e.g., C for a meaningfulcomputation of F .9. The relation labels are simpliﬁed for readability. The original relationfor this query is /location/statistical region/gni per capita in ppp dollars./measurement unit/dated money value/currency , which expects a location as its head. peranskaya, Schmitt, & Roth The evaluation of these three queries also highlights the diﬀerence in ranking-based vs.classiﬁcation-based evaluation approaches: despite the fact that ConvE ranks the candidatesin a perfect order, the classiﬁcation results still suﬀer from false positives and false negatives.

To highlight the development potential of the existing models with respect to the newevaluation, we introduce a TransE variant which supports thresholds intrinsically. Relation-speciﬁc regions in the translation vector space — deﬁned by up to d ∗ |R| extra parameters— allow the model to better separate positive and negative predictions. The distance function of

Region maps to R , with 0 being the score for a perfect triple: δ ( h, r, t ) = ( e h + r r − e t ) T A r ( e h + r r − e t )where A r is a relation-speciﬁc positive semi-deﬁnite matrix that (together with a threshold)describes an elliptic region. A vector will be classiﬁed as positive if and only it is located insidethis region. In order to limit the number of additional parameters in A r , we restrict A r to bediagonal: A r = diag ( a r ) (allowing only positive values to ensure positive semi-deﬁniteness). With a diagonal matrix A r , the computation can be simpliﬁed to (operand (cid:12) stands forelement-wise multiplication, √ x for element-wise square root): δ ( h, r, t ) = (cid:107)√ a r (cid:12) ( e h + r r − e t ) (cid:107) The transformation mentioned above is applied to the Region distances in the samemanner to obtain a probability score: s ( h, r, t ) = 1 − tanh ( δ ( h, r, t ))TransE with L2 scoring is therefore a special case of Region with all a r weights set to 1.Speciﬁcally, the original TransE model can be seen as a Region model with a ﬁxed regionradius shared across all relations. Table 4 provides the evaluation results for the enriched model. The Region model achieves anoticeable improvement in terms of MRR and F compared to TransE. While the Regionmodel also improves in terms of the ranking metric MRR (16 .

5% relative increase), improve-ments for classiﬁcation are particularly strong (32 .

4% and 36 .

8% relative improvementsin terms of F scores). With respect to diﬀerent query sets, the biggest improvement isachieved in the complete and fake queries.

10. We also experimented with a spherical a r ∈ { a r } d , which gave slightly worse results. anking vs. Classifying global threshold multiple thresholdsQuery & Gold ComplEx 128 ConvE 128 ComplEx 128 ConvE 128 actor actor screenwriter screenwriter Thomas Lennon – actor ﬁlm producer actor ﬁlm producer has profession – ? — th = 0.7 — tv producer tv producer tv producertv producer ﬁlm director ﬁlm director ﬁlm director ∈ I ﬁlm director — th = 0.5 — screenwriter comedianscreenwriter comedian ﬁlm producer musician comedian ﬁlm producer musician — th = 0.1 — — th = 0.3 — screenwriter comedian writer comedian writer ﬁlm producer writer author writer authormusician artist musician artist

Paciﬁc Paciﬁc

Marion County – Eastern Easternin time zones – ? — th = 0.7 — — th = 0.5 — MountainEastern Mountain Eastern Central ∈ C

Paciﬁc

Central — th = 0.1 — — th = 0.3 —Central Eastern European

Paciﬁc

Eastern European

Paciﬁc

Mountain time in China Central time in ChinaEast Africa Mountain East AfricaUSA USA? – UK — th = 1.0 — UKhas currency – US dollar France US dollar — th = 0.9 —Meryl Streep euro Ireland euro France— th = 0.7 — Spain pound sterling Ireland ∈ F pound sterling Canada UK SpainUK — th = 0.5 — Sweden CanadaSweden Italy ItalyNew Zealand New ZealandDenmark DenmarkTable 3: Exemplary predictions of Complex 128 and ConvE 128 models on the test querieswith global and multiple thresholds. The ﬁrst column presents the query, thecorresponding query set and the gold answers in bold. Further columns show thetop scored entities in the order of decreasing scores. Threshold values separatepositive predictions (above the line) from negative predictions (below the line).Gold answers are also marked bold, correct answers contained in the train set aregrayed out and italic (these answers are excluded from metric computation). peranskaya, Schmitt, & Roth

Model (k) MRR F global threshold F multiple thresholdsfull C C ∪ F I full

C C ∪ F I

TransE 64 .283 .111 .164 .112 .110 .161 .176 .175 .154Region ellipse 64 .330 .146 .266 .207 .123 .220 .317 .311 .162

Table 4: Performance of the Region model and the original TransE on the FB14k-QAQ testset.

This work points out the insuﬃciency of the current ranking-based evaluation paradigm forknowledge base completion and provides an alternative that directly measures the quality ofpredicted facts. We describe a process for constructing test collections that can measure KBCprediction quality and evaluate established KBC models on the new, carefully constructedFB14k-QAQ data set. Our experiments provide evidence that ranking-based estimationcan be a misleading evaluation criterion for the actual completion task. With a simple buteﬀective extension to the traditional TransE model, we encourage the research communityto reconsider existing models in light of our more realistic evaluation setting and to conductfurther research on factors that are crucial for classiﬁcation performance. The new setupalso allows to examine models with respect to how consistently scores are scaled acrossrelationships and it motivates research on more universal and robust embedding models thatreduce the performance gap between the global and multiple threshold settings.

Acknowledgments

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German ResearchFoundation) - RO 5127/2-1, and by the BMBF as part of the project MLWin (01IS18050).We also thank our anonymous reviewers for their comments.

References

Antoine Bordes, Nicolas Usunier, Alberto Garc´ıa-Dur´an, Jason Weston, and OksanaYakhnenko. Translating embeddings for modeling multi-relational data. In

NIPS , 2013.Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional2d knowledge graph embeddings. In

AAAI , 2017.Fr´ederic Godin, Anjishnu Kumar, and Arpit Mittal. Learning when not to answer: a ternaryreward structure for reinforcement learning based question answering. In

Proceedings ofthe 2019 Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 2 (Industry Papers) , pages 122–129,Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-2016. URL . anking vs. Classifying Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. Knowledge base completion: Baselinesstrike back. In

Rep4NLP@ACL , 2017.Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity andrelation embeddings for knowledge graph completion. In

AAAI , 2015.Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. Learning attention-based embdeddings for relation prediction in knowledge graphs. In

ACL , 2019.Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collectivelearning on multi-relational data. In

ICML , 2011.Daniel Ruﬃnelli, Samuel Broscheit, and Rainer Gemulla. You { can } teach an old dog newtricks! on training knowledge graph embeddings. In International Conference on LearningRepresentations , 2020. URL https://openreview.net/forum?id=BkxSmlBFvr .Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning withneural tensor networks for knowledge base completion. In

Advances in neural informationprocessing systems , pages 926–934, 2013.Zhiqing Sun, Shikhar Vashishth, Soumya Sanyal, Partha Pratim Talukdar, and Yiming Yang.A re-evaluation of knowledge graph completion methods.

ArXiv , abs/1911.03903, 2019.Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, andMichael Gamon. Representing text for joint embedding of text and knowledge bases. In

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing ,pages 1499–1509, Lisbon, Portugal, September 2015. Association for ComputationalLinguistics. doi: 10.18653/v1/D15-1174. URL .Th´eo Trouillon, Johannes Welbl, Sebastian Riedel, ´Eric Gaussier, and Guillaume Bouchard.Complex embeddings for simple link prediction. In

ICML , 2016.Yanjie Wang, Daniel Ruﬃnelli, Rainer Gemulla, Samuel Broscheit, and Christian Meilicke.On evaluating embedding models for knowledge base completion. In

Proceedings of the4th Workshop on Representation Learning for NLP (RepL4NLP-2019) , pages 104–112,Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4313. URL .Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding bytranslating on hyperplanes, 2014. URL .Han Xiao, Minlie Huang, Yu Hao, and Xiaoyan Zhu. Transa: An adaptive approach forknowledge graph embedding.

ArXiv , abs/1509.05490, 2015.Bishan Yang, Wen tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entitiesand relations for learning and inference in knowledge bases.

CoRR , abs/1412.6575, 2014.Liang Yao, Chengsheng Mao, and Yuan Luo. Kg-bert: Bert for knowledge graph completion.

ArXiv , abs/1909.03193, 2019. peranskaya, Schmitt, & Roth

Appendix A

A.1

A formal description of the query-based data set construction.The ◦ operator is deﬁned to ﬁll the open position of a query with the speciﬁed entity, i.e., q ◦ i = ( h, r, i ) and q ◦ i = ( i, r, t ). For every query , a set of answers A Fq can be extractedfrom a given set of facts F ⊆ E × R × E , that contains the completions to valid facts, i.e. A Fq = { i | ( q ◦ i ) ∈ F } .Speciﬁcally, the following steps were taken during the construction of FB14k-QAQ:1. Let KB ⊂ E × R × E be the underlying data set.2. Let Train ∪ Valid ∪ Test = KB , Train ∩ Valid ∩ Test = ∅ the original partition of the KB . Unify the development and test data into single held-out set H = Valid ∪ Test .3. Randomly select a subset of entities E − ⊂ E that are to be removed. The rest of theentities E + = E \ E − are the basis for the new data set.4. Drop the triples from Train and H where both head and tail entity were selected forremoval, as they do not have any valid entities and do not suﬃce either for training orfor query construction purposes: Train ← Train \ { ( h, r, t ) ∈ Train | h, t ∈ E − } H ← H \ { ( h, r, t ) ∈ H | h, t ∈ E − }

5. Move the triples with either one position selected for removal from

Train to H to beused for query construction, since the training process does not change and is stillbased on full triples. temp ← { ( h, r, t ) ∈ Train | h ∈ E − } ∪ { ( h, r, t ) ∈ Train | t ∈ E − } H ← H ∪ tempTrain ← Train \ temp

6. Transform the held-out set H from triple form to query form. First, obtain the set of answerable queries that only include entities E + and for which answers are containedin H : Q ← { ( h, r, ?) | h ∈ E + , ∃ t : ( h, r, t ) ∈ H } ∪{ (? , r, t ) | t ∈ E + , ∃ h : ( h, r, t ) ∈ H } Second, extract the answer sets A Hq from the held-out triples H for every query ∀ q ∈ Q .Note, that since the answers are retrieved from the held-out set only, and H ∩ Train = ∅ ,these answer sets do not include any entities that complete a query to a triple fromthe train set, i.e. { q ◦ i | i ∈ A q ∧ q ∈ Q} ∩ Train = ∅ . anking vs. Classifying

7. Finally, the selected entities E − are removed from the answer sets as well (the sourceof the answers H is further omitted for readability reasons): ∀ q ∈ Q : A q ← A Hq \ E − Starting from this point, there are two types of queries regarding the completeness oftheir answer sets in the evaluation data of FB14k-QAQ:Queries with exhaustive answers (complete): C = { q ∈ Q | A q = A Hq } Queries with at least one removed answer (incomplete): I = { q ∈ Q | A q (cid:54) = A Hq } Speciﬁcally, queries from C always contain at least one entity in their answer set, whilequeries from I can have empty answer sets. We will refer to these empty queries withno answer as N = { q ∈ I | | A q | = 0 } . peranskaya, Schmitt, & Roth Appendix B

B.1

Threshold tuning algorithm. We used two tuning iterations over the relations (N=2). for r ∈ R do : τ r ← . end for f ← i ← while i < N do : for r ∈ R do : (cid:46) in order of decreasing frequency of r in the dev set for τ ∈ { } do :ˆ f ← evaluate ( r, τ ) if ˆ f > f then : f ← ˆ f τ r ← τ end ifend forend for i ← i+1 end whileB.2 Tuned threshold statistics. The exact value is presented for the global threshold,aggregate statistics for multiple thresholds. multiple thresholdsModel ( d ) global threshold mean min maxConvE 128 0.5 0.81 0.1 1ConvE 64 0.5 0.56 0.1 1ComplEx 128 0.7 0.63 0.1 1ComplEx 64 0.5 0.72 0.1 1TransE 128 0.1 0.26 0.1 0 . . anking vs. Classifying Appendix C

C.1

Comparison of model performances on mean reciprocal rank and the F score for theFB14k test ( dev ) set. The “F global threshold” refers to scores obtained with a singleshared threshold for all relations, while “F multiple thresholds” refers to a setting withindependent thresholds for every relation (including the inverse ones).Model ( d ) MRR F global threshold F multiple thresholdsConvE 128 .3212 (.3279) .134 (.135) .204 ( .218 )ConvE 64 .2633 (.2717) .157 ( .155 ) .189 (.215)ComplEx 128 .2931 (.3009) .021 (.022) .190 (.200)ComplEx 64 .2931 (.3007) .009 (.010) .181 (.188)TransE 128 .2931 (.2988) .108 (.103) .159 (.160)TransE 64 .2834 (.2902) .111 (.106) .161 (.159)DistMult 64 .2663 (.2732) .159 (.153) .184 (.210)DistMult 128 .2214 (.2310) .133 (.146) .163 (.189) C.2:

Overall performance records of precision, recall and F (s. next page). pe r an s k aya , S c h m i tt , & R o t h global threshold multiple thresholdsModel ( d ) C C ∪ F I full

C C ∪ F I fullConvE 128 prec .232 (.228) .155 (.160) .061 (.063) .083 (.084) .312 (.317) .258 (.269) .116 (.129) .167 (.176)rec .329 (.325) .329 (.325) .370 (.371) .351 (.351) .322 (.331) .322 (.331) .211 (.253) .261 (.288)F .272 (.268) .211 (.214) .105 (.107) .134 (.135) .317 (.317) .286 (.297) .150 (.171) .204 (.218)ConvE 64 prec .399 (.396) .274 (.264) .076 (.078) .123 (.119) .370 (.371) .291 (.311) .107 (.140) .165 (.193)rec .250 (.245) .250 (.245) .184 (.203) .214 (.222) .270 (.276) .270 (.276) .182 (.219) .222 (.244)F .307 (.303) .261 (.254) .108 (.113) .157 (.155) .312 (.316) .280 (.292) .135 (.170) .189 (.215)ComplEx 128 prec .384 (.370) .009 (.010) .038 (.034) .012 (.013) .282 (.284) .225 (.242) .107 (.113) .150 (.154)rec .108 (.104) .108 (.104) .048 (.040) .075 (.068) .311 (.323) .311 (.323) .215 (.256) .259 (.286)F .169 (.163) .017 (.019) .042 (.037) .021 (.022) .296 (.302) .261 (.277) .143 (.157) .190 (.200)ComplEx 64 prec .202 (.200) .002 (.002) .037 (.038) .005 (.005) .298 (.296) .227 (.248) .105 (.109) .143 (.151)rec .128 (.126) .128 (.126) .124 (.128) .126 (.127) .275 (.267) .267 (.275) .224 (.227) .244 (.249)F .157 (.155) .005 (.005) .057 (.058) .009 (.010) .282 (.285) .245 (.261) .143 (.148) .181 (.188)TransE 128 prec .140 (.146) .075 (.087) .063 (.057) .066 (.064) .199 (.200) .190 (.193) .106 (.114) .123 (.132)rec .181 (.183) .181 (.183) .407 (.334) .304 (.267) .151 (.156) .151 (.156) .282 (.241) .222 (.203)F .158 (.162) .106 (.118) .108 (.098) .108 (.103) .172 (.175) .168 (.172) .154 (.154) .158 (.160)TransE 64 prec .143 (.146) .079 (.088) .063 (.059) .067 (.066) .195 (.196) .194 (.195) .106 (.112) .124 (.131)rec .192 (.191) .192 (.191) .424 (.340) .319 (.274) .160 (.159) .160 (.159) .282 (.235) .226 (.201)F .164 (.166) .112 (.120) .110 (.100) .111 (.106) .176 (.175) .175 (.175) .154 (.151) .160 (.158)DistMult 64 prec .362 (.389) .233 (.266) .089 (.082) .121 (.117) .254 (.377) .222 (.320) .116 (.133) .153 (.188)rec .219 (.214) .219 (.214) .238 (.219) .229 (.217) .258 (.268) .258 (.268) .204 (.211) .229 (.237)F .273 (.276) .226 (.237) .129 (.119) .159 (.152) .256 (.314) .239 (.292) .148 (.163) .184 (.208)DistMult 128 prec .317 (.410) .165 (.263) .076 (.075) .116 (.140) .181 (.376) .163 (.302) .113 (.129) .135 (.195)rec .218 (.223) .218 (.223) .105 (.095) .156 (.152) .239 (.246) .239 (.246) .179 (.134) .206 (.184)F .275 (.289) .188 (.242) .088 (.083) .133 (.146) .206 (.298) .194 (.271) .138 (.131) .163 (.189)Table 5: Precision, recall and F1