[PDF] Joint Reasoning for Multi-Faceted Commonsense Knowledge

Abstract

Commonsense knowledge (CSK) supports a variety of AI applications, from visual understanding to chatbots. Prior works on acquiring CSK, such as ConceptNet, have compiled statements that associate concepts, like everyday objects or activities, with properties that hold for most or some instances of the concept. Each concept is treated in isolation from other concepts, and the only quantitative measure (or ranking) of properties is a confidence score that the statement is valid. This paper aims to overcome these limitations by introducing a multi-faceted model of CSK statements and methods for joint reasoning over sets of inter-related statements. Our model captures four different dimensions of CSK statements: plausibility, typicality, remarkability and salience, with scoring and ranking along each dimension. For example, hyenas drinking water is typical but not salient, whereas hyenas eating carcasses is salient. For reasoning and ranking, we develop a method with soft constraints, to couple the inference over concepts that are related in in a taxonomic hierarchy. The reasoning is cast into an integer linear programming (ILP), and we leverage the theory of reduction costs of a relaxed LP to compute informative rankings. This methodology is applied to several large CSK collections. Our evaluation shows that we can consolidate these inputs into much cleaner and more expressive knowledge. Results are available at this https URL.

Full PDF

JJoint Reasoning for Multi-Faceted Commonsense Knowledge

Yohan Chalier

T´el´ecom [email protected]

Simon Razniewski

Max Planck Institute for [email protected]

Gerhard Weikum

Max Planck Institute for [email protected]

ABSTRACT

Commonsense knowledge (CSK) supports a variety of AI appli-cations, from visual understanding to chatbots. Prior works onacquiring CSK, such as ConceptNet, have compiled statements thatassociate concepts, like everyday objects or activities, with prop-erties that hold for most or some instances of the concept. Eachconcept is treated in isolation from other concepts, and the onlyquantitative measure (or ranking) of properties is a confidence scorethat the statement is valid.This paper aims to overcome these limitations by introducinga multi-faceted model of CSK statements and methods for jointreasoning over sets of inter-related statements. Our model capturesfour different dimensions of CSK statements: plausibility, typicality,remarkability and salience, with scoring and ranking along eachdimension. For example, hyenas drinking water is typical but notsalient, whereas hyenas eating carcasses is salient. For reasoningand ranking, we develop a method with soft constraints, to couplethe inference over concepts that are related in in a taxonomic hi-erarchy. The reasoning is cast into an integer linear programming(ILP), and we leverage the theory of reduction costs of a relaxed LPto compute informative rankings. This methodology is applied toseveral large CSK collections. Our evaluation shows that we canconsolidate these inputs into much cleaner and more expressiveknowledge. Results are available at https://dice.mpi-inf.mpg.de.

Motivation and problem.

Commonsense knowledge (CSK) isa potentially important asset towards building versatile AI appli-cations, such as visual understanding for describing images (e.g.,[2, 19, 36]) or conversational agents like chatbots (e.g., [31, 48, 49]).In delineation from encyclopedic knowledge on entities like Trump,Paris, or FC Liverpool, CSK refers to properties, traits and relationsof everyday concepts, such as elephants, coffee mugs or schoolbuses. For example, when seeing scenes of an elephant jugglinga few coffee mugs with its trunk, or with school kids pushing anelephant into a bus, an AI agent with CSK should realize the ab-surdity of these scenes and should generate funny comments forimage description or in a conversation.Encyclopedic knowledge bases (KBs) received much attention,with projects such as DBpedia [3], Wikidata [45], Yago [39] orNELL [8] and large knowledge graphs at Amazon, Baidu, Google,Microsoft etc. supporting entity-centric search and other services[25]. In contrast, approaches to acquire CSK have been few andlimited. Projects like ConceptNet [37], WebChild [42], TupleKB [ ? ] and Quasimodo [32] have compiled millions of concept: prop-erty (or subject-predicate-object) statements, but still suffer fromsparsity and noise. For instance, ConceptNet has only a single non-taxonomic/non-lexical statement about hyenas, namely, hyenas: laugh a lot , and WebChild lists overly general and contradictoryproperties such as small, large, demonic and fair for hyenas . Thereason for these shortcomings is that such mundane properties thatare obvious to every human are rarely expressed explicitly in text orspeech, and visual content would require CSK first to extract theseproperties. Therefore, machine-learning methods for encyclopedicknowledge acquisition do not work robustly for CSK.Another limitation of existing CSK collections is that they or-ganize statements in a flat, one-dimensional manner, and solelyrank by confidence scores. There is no information about whether aproperty holds for all or for some of the instances of a concept, andthere is no awareness of which properties are typical and whichones are salient from a human perspective. For example, the state-ment that hyenas drink milk (as all mammals when they are cubs)is valid, but it is not typical. Hyenas eating meat is typical, but it isnot salient in the sense that humans would spontaneously namethis as a key characteristic of hyenas. In contrast, hyenas eatingcarcasses is remarkable as it sets hyenas apart from other Africanpredators (like lions or leopards), and many humans would listthis as a salient property. Prior works on CSK missed out on theserefined and expressive dimensions.The problem addressed in this work is to overcome these limi-tations and advance CSK collections to a more expressive stage ofmulti-faceted knowledge. Approach and contribution.

This paper presents Dice (DiverseCommonsense Knowledge), a reasoning-based method for derivingrefined and expressive commonsense knowledge from existing CSKcollections. Dice is based on two novel ideas: • To capture the refined semantics of CSK statements, we intro-duce four facets of concept properties: • Plausibility indicates whether a statement makes sense atall (like the established but overloaded notion of confidencescores). • Typicality indicates whether a property holds for most in-stances of a concept (e.g., not only for cubs). • Remarkability expresses that a property stands out by dis-tinguishing the concept from closely related concepts (likesiblings in a taxonomy). • Saliency reflects that a property is characteristic for the con-cept, in the sense that most humans would spontaneouslylist it in association with the concept. • We identify inter-related concepts by their neighborhoods in aconcept hierarchy or via word-level embeddings, and devised aset of weighted soft constraints that allows us to jointly reasonover the four dimensions for sets of candidate statements. Wecast this approach into an integer linear program (ILP), andharness the theory of reduced cost (aka. opportunity cost) [5] http://conceptnet.io/c/en/hyena https://gate.d5.mpi-inf.mpg.de/webchild2/?x=hyena%23n%231 a r X i v : . [ c s . C L ] M a y or LP relaxations in order to compute quantitative rankingsfor each of the four facets.As an example, consider the concepts lions , leopards , cheetahs and hyenas . The first three are coupled by being taxonomic siblingsunder their hypernym big cats , and the last one is highly relatedby being another predator in the African savannah with high relat-edness in word-level embedding spaces (e.g., word2vec or Glove).Our constraint system includes logical clauses such asPlausible ( s , p ) ∧ Related ( s , s ) ∧ ¬ Plausible ( s , p ) ∧ . . . ⇒ Remarkable ( s , p ) where . . . refers to enumerating all siblings of s , or highly relatedconcepts. The constraint itself is weighted by the degree of related-ness; so it is a soft constraint that does allow exceptions. This waywe can infer that remarkable (and also salient) statements include lions: live in prides , leopards: climb trees , cheetahs: run fast and hyenas: eat carcasses .The paper’s salient contributions are: • We introduce a multi-faceted model for CSK statements, com-prising the dimensions of plausibility, typicality, remarkabilityand saliency. • We model the coupling of these dimensions by a soft constraintsystem, and devise effective and scalable techniques for jointreasoning over noisy candidate statements, • Experiments, with inputs from large CSK collections, Concept-Net, TupleKB and Quasimodo, and with human judgements,show that Dice achieves high precision for its multi-facetedoutput. The resulting commonsense knowledge bases containmore than 1.6m statements about 74k concepts, and will bemade publicly available.

Manually compiled CSK.

In 1985, Douglas Lenat started theCyc [20] project, with the goal of compiling a comprehensivemachine-readable collection of human knowledge into logical as-sertions. The project comprised both encyclopedic and common-sense knowledge. The parallel WordNet project [23] organizedword senses into lexical relations like synonymy, antonymy, andhypernymy/hyponymy (i.e., subsumption). The latter can serveas a taxonomic backbone for CSK, but there are also more recentalternatives such as WebIsALOD [16] derived from Web contents.ConceptNet extended the Cyc and WordNet approaches by col-lecting CSK triples from crowdworkers, for about 20 high-levelproperties [37]. It is the state of the art for CSK. The most popularknowledge base today, Wikidata [45], contains both encyclopedicknowledge about notable entities and some CSK cast into RDFtriples. However, the focus is on individual entities, and CSK isvery sparse. Most recently, ATOMIC [33] is another crowdsourcingproject compiling knowledge about human activities; relative toConceptNet it is more refined but fairly sparse.

Web-extracted CSK.

Although handcrafted CSK collections havereached impressive sizes, the reliance on human inputs limits theirscale and scope. Automatic information extraction (IE) from Webcontents can potentially achieve much higher coverage. Comparedto general IE, extracting CSK is still an underexplored field. The We-bChild project [42, 43] extracted more than 10 million statements of plausible object properties from books and image tags. However,its rationale was to capture each and every property that holds forsome instances of a concept; consequently, it has a massive tailof noisy, puzzling or invalid statements. TupleKB [ ? ] from theAI2 Lab’s Mosaic project is a more focused approach to automaticCSK acquisition. It contains ca. 280k statements, specifically for8th-grade elementary science to support work on a multiple-choiceschool exam challenge [34]. It builds on similar sources as We-bChild, but prioritizes precision over recall by various cleaningsteps incl. a supervised scoring model. Quasimodo [32] is a recentCSK collection, built by extraction from QA forums and web querylogs, with about 4.6 million statements. Although it combines mul-tiple cues into a regression-based corroboration model for rankingand aims to identify salient statements, the model merely learnsa single-dimensional notion of confidence. Common to all theseprojects is that their quantitative assessment of CSK statements isfocused on a single dimension of confidence or plausibility. Thereis no awareness of other facets like typicality, remarkability andsaliency. Latent representations.

Latent models have had great impact onnatural language processing, with word embeddings like word2vec[22], GloVe [28] and BERT [11] capturing signals from huge text cor-pora. These embeddings implicitly contain some kind of CSK by therelatedness of word-level or phrase-level vectors or more advancedrepresentation. For example, the typical habitats for camels can bepredicted to be deserts, based on the latent representations. Embed-dings have been leveraged for tasks like commonsense questionanswering [41] and knowledge base completion (e.g., [6]). However,the latent nature of these models makes it difficult to interpret whatspecific knowledge is at work and explain this to the human user.Moreover, they typically involve a complete end-to-end trainingcycle for each and every use case. Explicit CSK collections are muchbetter interpretable and more easily re-usable for new applications.

Joint reasoning.

Consolidating statements from automatic IEis an important part of KB construction, and several frameworkshave been pursued for encyclopedic knowledge, including proba-bilistic graphical models of different kinds (e.g., [8, 12, 29, 35, 46],constraint-based reasoning (e.g., [38, 40]), and more. All thesemethods solve optimization problems to accept or reject uncertaincandidate statements with specified or learned constraints so asmaximize a combination of statistical evidence and satisfaction ofsoft constraints.

Knowledge representation.

Current CSKBs merely use a singlescore that represents the frequency of or confidence in a binary-relation statement. Beyond binary relations, epistemic logics wouldbe able to express refined modalities such as possibly and neces-sarily. Temporal logics can model whether statements are validalways, eventually or sometimes [26], and spatial data models cancapture location information about entities and events [1]. The needto contextualize binary relations has been noted in encyclopedicKBs. Yago introduced the notion of SPOTLX tuples to capture time,location and textual dimensions [18, 47], DBpedia used reificationto store provenance information [15], and Wikidata comes witha range of temporal, spatial, and other contextual qualifiers [27].For CSKBs this level of refinement has not been considered yet. InKG embeddings, Chen et al. studied models for retaining graded erm MeaningCSK statement pair ( s , p ) of subject s (concept) and property p (textual phrase)CSK dimensions plausibility, typicality, remarkability,saliencySoft constraints Relationships between dimensions of astatement and/or taxonomically related con-ceptsTaxonomy Noisy is-a organization of subject conceptsClause A grounding, i.e., concrete instantiation ofa rule ω r , ω s , ω e Parameters for weighing clausesCues Input signals for estimating prior scoresPrior scores Initial estimates of dimension values for astatement (i.e., before reasoning), denotedas π , τ , ρ , σ and computed from cues via re-gression Table 1: Important notation. truth values, termed confidence, instead of binary truth values, ininputs and outputs of embedding models [10]. However, this islimited to a single dimension, and does not capture the differentfacets addressed in this paper.

We consider simple CSK statements of the form ( s , p ) , where s isa concept and p is a property of this concept. To be in line withestablished terminology, we refer to s as the subject of the statement.Typically, s is a crisp noun, such as hyenas , while p can take anymulti-word verb or noun phrase, such as laugh a lot or (are) Africanpredators .Unlike prior works, we do not adopt the usual subject-predicate-object triple model. We do not distinguish between predicates andobjects for two reasons:(i) The split between predicate and object is often arbitrary. Forexample, for lions : live in prides , we could either consider live or live in as predicate and the rest as object, or we could view live inprides as a predicate without any object.(ii) Unlike encyclopedic KBs where a common set of predicates canbe standardized (e.g., date of birth , country of citizenship , awardreceived ), CSK is so diverse that it is virtually impossible to agree onpredicate names. For example, we may want to capture both preyon antelopes and hunt and kill antelopes , which are highly relatedbut not quite the same. Projects like ConceptNet and WebChildhave organized CSK with a fixed set of pre-specified predicates, butthese are merely around 20, and, when discounting taxonomic (e.g.,type of) and lexical (e.g., synonyms, related terms) relations, boildown to a few basic predicates: used for , capable of , location and part of (plus a generic kind of has property ).We summarize important notation in Table 1. We organize concept-property pairs along four dimensions: plausi-bility [24, 42], typicality [37], remarkability (information theory)and saliency [32]. These are meta-properties; so each ( s , p ) paircan have any of these labels and multiple labels are possible. Foreach statement and dimension label, we compute a score and canthus rank statements for a concept by their plausibility, typicality,remarkability or saliency. • Plausibility:

Is the property valid at least for some instances ofthe concept, for at least some spatial, temporal or socio-culturalcontexts? For example, lions drink milk at some time in theirlives, and some lions attack humans. • Typicality:

Does the property hold for most (or ideally all) in-stances of the concept, for most contexts? For example, mostlions eat meat, regardless of whether they live in Africa or in azoo. • Remarkability:

What are specific properties of a concept thatsets the concept apart from highly related concepts, like tax-onomic generalizations (hypernyms in a concept hierarchy)?For example, lions live in prides but not other big cats do this,and hyenas eat carcasses but hardly any other African predatordoes this. • Saliency:

When humans are asked about a concept, such as lions , bicycles or rap songs , would a property be listed amongthe concept’s most notable traits, by most people? For example,lions hunt in packs, bicycles have two wheels, rap songs haveinteresting lyrics and beat (but no real melody). Examples.

Refining CSK by the four dimensions is useful forvarious application areas, including language understanding forchatbots, as illustrated by the following examples:(1) Plausibility helps to avoid blunders by detecting absurdstatements, or to trigger irony. For example, a user utter-ance such as “When too many people shot selfies with him,the lion king in the zoo told them to go home” should leadto a funny reply by the chatbot (as lions do not speak).(2) Typicality helps a chatbot to infer missing context. Forexample, when the human talks about “a documentarywhich showed the feeding frenzy of a pack of hyenas”, thechatbot could ask “what kind of carcass did they feed on?”(3) Remarkability can be an important signal when the chatbotneeds to infer which concept the human is talking about.For example, a user utterance “In the zoo, the kids wherefascinated by a spotted dog that was laughing at them”could lead to chatbot response like “So they like the hyenas.Did you see an entire pack?”(4) Saliency enables the chatbot to infer important propertieswhen a certain concept is the topic of a conversation. Forexample, when talking about lions in the zoo, the bot couldproactively ask “Did you hear the lion roar?”, or “Howmany lionesses were in the lion king’s harem?”

Overview.

For reasoning over sets of CSK statements, we startwith a CSK collection, like ConceptNet, TupleKB or Quasimodo.These are in triple form with crisp subjects but potentially noisy hrases as predicates and objects. We interpret each subject asa concept and concatenate the predicate and object into a prop-erty. Inter-related subsets of statements are identified by locatingconcepts in a large taxonomy and grouping siblings and their hy-pernymy parents together. These groups may overlap. For thispurpose we use the WebIsALOD taxonomy [16], as it has very goodcoverage of concepts and captures everyday vocabulary.Based on the taxonomy, we also generate additional candidatestatements for sub- or super-concepts, as we assume that manyproperties are inherited between parent and child. We use rule-based templates for this expansion of the CSK collection (e.g., aslions are predators, big cats and also tigers, leopards etc. are preda-tors as well). This mitigates the sparseness in the observation space.Note that, without the reasoning, this would be a high-risk step asit includes many invalid statements (e.g., lions live in prides, butbig cats in general do not). Reasoning will prune out most of theinvalid candidates, though.For joint reasoning over the statements for the concepts of agroup, we interpret the rule-based templates as soft constraints,with appropriate weights.For setting weights in a meaningful way, we leverage prior scoresthat the initial CSK statements come with (e.g., confidence scoresfrom ConceptNet), and additional statistics from large corpora,most notably word-level embeddings like word2vec.In this section, we develop the logical representation and the jointreasoning method, assuming that we have weights for statementsand for the grounded instantiations of the constraints. Subsequently,Section 5 presents techniques for obtaining statistical priors forsetting the weights. Let S denote the set of subjects and P the properties. The inter-dependencies between the four CSK dimensions are expressed bythe following logical constraints. Concept-dimension dependencies: ∀ ( s , p ) ∈ S × P Typical ( s , p ) ⇒ Plausible ( s , p ) (1)Salient ( s , p ) ⇒ Plausible ( s , p ) (2)Typical ( s , p ) ∧ Remarkable ( s , p ) ⇒ Salient ( s , p ) (3)These clauses capture the intuition behind the four facets. Parent-child dependencies: ∀ ( s , p ) ∈ S × P , ∀ s ∈ children ( s ) Plausible ( s , p ) ⇒ Plausible ( s , p ) (4)Typical ( s , p ) ⇒ Typical ( s , p ) (5)Typical ( s , p ) ⇒ Plausible ( s , p ) (6)Remarkable ( s , p ) ⇒ ¬ Remarkable ( s , p ) (7)Typical ( s , p ) ⇒ ¬ Remarkable ( s , p ) (8) ¬ Plausible ( s , p ) ∧ Plausible ( s , p ) ⇒ Remarkable ( s , p ) (9) ( ∀ s ∈ children ( s ) Typical ( s , p )) ⇒ Typical ( s , p ) (10)These dependencies state how properties are inherited betweena parent concept and its children in a taxonomic hierarchy. Forexample, if a property is typical for the parent and thus for all itschildren, it is not remarkable for any child as it does not set anychild apart from its siblings. Sibling dependencies: ∀ ( s , p ) ∈ S × P , ∀ s ∈ siblings ( s ) Remarkable ( s , p ) ⇒ ¬ Remarkable ( s , p ) (11)Typical ( s , p ) ⇒ ¬ Remarkable ( s , p ) (12) ¬ Plausible ( s , p ) ∧ Plausible ( s , p ) ⇒ Remarkable ( s , p ) (13)These dependencies state how properties of concepts under thesame parent relate to each other. For example, a property beingplausible for only one in a set of siblings makes this property re-markable for the one concept. The specified first-order constraints need to be grounded withthe candidate statements in a CSK collection, yielding a set oflogical clauses (i.e., disjunctions of positive or negated atomicstatements). To avoid producing a huge amount of clauses, werestrict the grounding to existing subject-property pairs and thehigh-confidence (¿0.4) relationships of the WebIsALOD taxonomy(avoiding its noisy long tail).

Expansion to similar properties.

Following this specification,the clauses would apply only for the same property of inter-relatedconcepts, for example, eats meat for lions, leopards, hyenas etc.However, the CSK candidates may express the same or very similarproperties in different ways: lions: eat meat , leopards: are carnivores , hyenas: eat carcasses etc. Then the grounded formulas would nevertrigger any inference, as the p values are different. We solve thisissue by considering the similarity of different p values based onword-level embeddings (see Section 5). For each property pair ( p , p ) ∈ P , grounded clauses are generated if sim ( p , p ) exceedsa threshold t .We consider such highly related property pairs also for eachconcept alone, so that we can deduce additional CSK statements bygenerating the following clauses: ∀ s ∈ S , ∀ ( p , q ) ∈ P , sim ( p , q ) ≥ t ⇒ ( Plausible ( s , p ) ⇔ Plausible ( s , q )) , (14a) ( Typical ( s , p ) ⇔ Typical ( s , q )) , (14b) ( Remarkable ( s , p ) ⇔ Remarkable ( s , q )) , (14c) ( Salient ( s , p ) ⇔ Salient ( s , q )) (14d)This expansion of the reasoning machinery allows us to deal withthe noise and sparsity in the pre-existing CSK collections. Weighting clauses.

Each of the atomic statements Plausible ( s , p ) ,Typical ( s , p ) , Remarkable ( s , p ) and Salient ( s , p ) has a prior weightbased on the confidence score from the underlying collection of CSKcandidates (see Sec. 5). These priors are denoted π ( s , p ) , τ ( s , p ) , ρ ( s , p ) , and σ ( s , p ) .Each grounded clause c has three different weights:(1) ω r , the weight of the logical dependency from which theclause is generated, a hyper-parameter for tuning the rela-tive influence of different kinds of dependencies.(2) ω s , the similarity weight, sim ( p , p ) for clauses resultingfrom similarity expansion, or 1.0 if concerning only a singleproperty.(3) ω e , the evidence weight, computed by combining the sta-tistical priors for the individual atoms of the clause, usingbasic probability calculations for logical operators: 1 − u or negation and u + v − uv for disjunction with weights u , v for the atoms in a clause.The final weight of a clause c is computed as: ω c = ω r ω s ω e Table 2 shows a few illustrative examples.

Notations.

For reasoning over the validity of candidate state-ments, for each of the four facets, we view every candidate state-ment

Facet ( s , p ) as a variable v ∈ V , and its prior (either τ , π , ρ or σ , see Section 5) is denoted as ω v . Every grounded clause c ∈ C ,normalized into a disjunctive formula, can be split into variableswith positive polarity, c + , and variables with negative polarity, c − .By viewing all v as Boolean variables, we can now interpretthe reasoning task as a weighted maximum satisfiability (Max-Sat)problem: find a truth-value assignment to the variables v ∈ V such that the sum of weights of satisfied clauses is maximized. Thisis a classical NP-hard problem, but the literature offers a wealthof approximation algorithms (see, e.g., [21]). Alternatively andpreferably for our approach, we can re-cast the Max-Sat probleminto a problem for integer linear programming (ILP) [44] where thevariables v become 0-1 decision variables. Although ILP is moregeneral and potentially more expensive than Max-Sat, there arehighly optimized and excellently engineered methods available insoftware libraries like Gurobi [14]. Moreover, we are ultimatelyinterested not just in computing accepted variables (set to 1) versusrejected ones (set to 0), but want to obtain an informative rankingof the candidate statements. To this end, we can relax an ILP intoa fractional LP (linear program), based on principled foundations[44], as discussed below. Therefore, we adopt an ILP approach, withthe following objective function and constraints:max (cid:213) v ∈V ω v v + (cid:213) c ∈C ω c c (15)under the constraints: ∀ c ∈ C ∀ v ∈ c + c − v ≥ ∀ c ∈ C ∀ w ∈ c − c + w − ≥ ∀ c ∈ C (cid:213) v ∈ c + v + (cid:213) w ∈ c − ( − w ) − c ≥ ∀ v ∈ V v ∈ [ , ] (16d) ∀ c ∈ C c ∈ [ , ] (16e)Each clause c is represented as a triple of ILP constraints, whereBoolean operations ¬ and ∨ are encoded via inequalities. The ILP returns 0-1 values for the decision variables; so we can onlyaccept or reject a candidate statement. Relaxing the ILP into anordinary linear program (LP) drops the integrality constraints onthe decision variables, and would then return fractional values forthe variables. Solving an LP is typically faster than solving an ILP.The fractional values returned by the LP are not easily inter-pretable. We could employ the method of randomized rounding[30]: for fractional value x ∈ [ , ] we toss a coin that shows 1 with probability x and 0 with probability 1 − x . This has been proven tobe a constant-factor approximation (i.e., near-optimal solution) onexpectation.However, we are actually interested in using the relaxed LPto compute principled and informative rankings for the candidatestatements. To this end, we leverage the theory of reduced costs , aka. opportunity costs [5]. For an LP of the form minimize c T x subjectto Ax ≤ b and x ≥ b with coefficient vectors c , b and coefficientmatrix A , the reduced cost of variable x i that is zero in the optimalsolution is the amount by which the coefficient c i needs to bereduced in order to yield an optimal solution with x i >

0. This canbe computed for all x as c − A T y . For maximization problems, thereduced cost is an increase of c . Modern optimization tools likeGurobi directly yield these measures of sensitivity as part of theirLP solving.We use the reduced costs of the x i variables as a principled way ofranking them; lowest cost ranking highest (as their weights wouldhave to be changed most to make them positive in the optimalsolution).As all variables with reduced cost zero would have the same rank,we use the actual variable values (as a cue for the correspondingstatement or dependency being satisfied) as a tie-breaker. LP solvers are not straightforward to scale to cope with largeamounts of input data. For reasoning over all candidate statementsin one shot, we would have to solve an LP with millions of variables.We devised and utilized the following technique to overcome thisbottleneck in our experiments.The key idea is to consider only limited-size neighborhoods inthe taxonomic hierarchy in order to partition the input data. Inour implementation, to reason about the facets for a candidatestatement ( s , p ) , we identify the parents and siblings of s in thetaxonomy and then compile all candidate statements and groundedclauses where at least one of these concepts appears. This typicallyyields subsets of size in the hundreds or few thousands. Each ofthese forms a partition, and we generate and solve an LP for eachpartition separately. This way, we can run the LP solver on manypartitions independently in parallel. The partitions overlap, but each ( s , p ) is associated with a primary partition with the statement’sspecific neighborhood. So far, we assumed that prior scores – π ( s , p ) , τ ( s , p ) , ρ ( s , p ) , σ ( s , p ) – are given, in order to compute weights for the ILP or LP. Thissection explains how we obtain these priors. In a nutshell, weobtain basic scores from the underlying CSK collections and theircombination with embedding-based similarity, and from textualentailment and relatedness in the taxonomy (Subsection 5.1). Wethen define aggregation functions to combine these various cues(Subsection 5.2). ule Clause ω r ω s ω e ω c car , hit wall ) ∨ ¬ Typical( car , hit wall ) 0.48 1 0.60 0.2914a Plausible( bicycle , be at city ) ∨ ¬ Plausible( bicycle , be at town ) 0.85 0.86 1 0.7314a Plausible( bicycle , be at town ) ∨ ¬ Plausible( bicycle , be at city ) 0.85 0.86 1 0.738 ¬ Remarkable ( bicycle , transport person and thing ) ∨ ¬ Typical( car , move person ) 0.51 0.78 0.96 0.38 Table 2: Examples of grounded clauses with their weights (based on ConceptNet).

Basic statements like ( s , p ) are taken from existing CSK collections,which often provide confidence scores based on observation frequen-cies or human assessment (of crowdsourced statements or sam-ples). We combine these confidence measures, denoted score ( s , p ) with embedding-based similarity between two properties, sim ( p , q ) .Each property p is tokenized into a bag-of-words { w , . . . , w n } andencoded as the idf-weighted centroid of the embedding vectors (cid:174) w i obtained from a pre-trained word2vec model : (cid:174) p = (cid:205) ni = idf ( w i ) (cid:174) w i .The similarity between two properties is the cosine between thevectors mapped into [ , ] : sim ( p , q ) = (cid:18) (cid:104)(cid:174) p , (cid:174) q (cid:105) (cid:107) (cid:174) p (cid:107)(cid:107) (cid:174) q (cid:107) + (cid:19) .Confidence scores and similarities are then combined and nor-malized into a quasi-probability: P [ s , p ] = Z (cid:213) q ∈P sim ( p , q )≥ t score ( s , q ) × sim ( q , p ) where Z is a normalization factor and t is a threshold (set to 0.75 inour implementation). The intuition for this measure is that it reflectsthe probability of ( s , p ) being observed in the digital world, whereevidence is accumulated over different phrases for inter-relatedproperties such as eat meat , are carnivores , are predators , prey onantelopes etc.We can now derive additional measures that serve as buildingblocks for the final priors: • the marginals P [ s ] for subjects and P [ p ] for properties, • the conditional probabilities of observing p given s , or thereverse; P [ p | s ] can be thought of as the necessity of theproperty p for the subject s , while P [ s | p ] can be thoughtof as a sufficiency measure, • the probability that the observation of s implies the obser-vation of p , which can be expressed as: P [ s ⇒ p ] = − P [ s ] + P [ s , p ] Beyond aggregated frequency scores, priors rely on two morecomponents, scores from textual entailment models and taxonomy-based information gain.

Textual entailment:

A variant of P [ s ⇒ p ] is to tap into corpora and learned models fortextual entailment: does a sentence such as “Simba is a lion” entail asentence “Simba lives in a pride”? We leverage the attention modelfrom the AllenNLP project [13] learned from the SNLI corpus [7]and other annotated text collections. This gives us scores for two https://code.google.com/archive/p/word2vec/GoogleNews-vectors-negative300.bin.gz measures: does s entail p , entail ( s → p ) , and does p contradict s ,con ( s , p ) . Taxonomy-based information gain:

For each ( s , p ) we define a neighborhood of concepts, N ( s ) , by theparents and siblings of s , and consider all statements for s versusall statements for N ( s ) − { s } as a potential cue for remarkability.For each property p and concept set S , the entropy of p is H ( p | S ) = X S log X S + X S − X S log X S X S − where X S = |{ q | ∃ s ∈ S : ( s , q )}| .Instead of merely count-based entropy, we could also incorporaterelative weights of different properties, but the as a basic cue, thesimple measure is sufficient. Then, the information gain of ( s , p ) is IG ( s , p ) = H ( p | { s }) − H ( p | S − { s }) . All the basic scores – P [ s , p ] , P [ s | p ] , P [ p | s ] , P [ s ⇒ p ] , entail ( s → p ) , con ( s , p ) and IG ( s , p ) – are fed into regression models that learnan aggregate score for each of the four facets: plausibility, typicality,remarkability and saliency. The regression parameters (i.e., weightsfor the different basic scores) are learned from small set of facet-annotated CSK statements, separately, for each of the four facets.We denote the aggregated scores, serving as priors for the reasoningstep, as π ( s , p ) , τ ( s , p ) , ρ ( s , p ) and σ ( s , p ) . We evaluate three aspects of the Dice framework: (i) accuracy inranking statements along the four CSK facets, (ii) run-time andscalability, (iii) the ability to enrich CSK collections with newlyinferred statements. The main hypothesis under test is how wellDice can rank statements for each of the four CSK facets. Weevaluate this by obtaining crowdsourced judgements for a pool ofsample statements.

Datasets.

We use three CSK collections for evaluating the addedvalue that Dice provides: (i) ConceptNet, a crowdsourced, some-times wordy collection of general-world CSK. (ii) Tuple-KB, a CSKcollection extracted from web sources with focus on the sciencedomain, with comparably short and canonicalized SPO triples. (iii)Quasimodo, a web-extracted general-world CSK collection withfocus on saliency. Statistics on these datasets are shown in Table 3.To construct taxonomies for each of these collections, we uti-lized the WebIsALOD dataset [17], a web-extracted noisy set ofranked subsumption pairs (e.g., tiger isA big cat - 0.88, tigerisA carnivore - 0.83). We prune out long-tail noise by setting athreshold of 0.4 for the confidence scores that WebIsALOD comes SK collection

Quasimodo 13,387 1,219,526ConceptNet 45,603 223,013TupleKB 28,078 282,594

Table 3: Input CSK collections.

CSK collection

Quasimodo 11148 15.33 3627.8ConceptNet 41451 1.15 63.7TupleKB 26100 2.14 105.1Music-manual 8 1.68 3.4

Table 4: Taxonomy statistics. with. To evaluate the influence of taxonomy quality, we also hand-crafted a small high-quality taxonomy for the music domain, with10 concepts and 9 subsumption pairs, such as rapper being a sub-class of singer . Table 4 gives statistics on the taxonomies per CSKcollection. Differences between

Annotation.

To obtain labelled data for hyper-parameter tuningand as ground-truth for evaluation, we conducted a crowdsourcingproject using Amazon Mechanical Turk. For saliency, typicalityand remarkability, we sampled 200 subjects each with 2 proper-ties from each of the CSK collections, and asked annotators forpairwise preference with regard to each of the three facets, usinga 5-point Likert scale. That is, we show two statements for thesame subject, and the annotator could slide on the scale between1 and 5 to indicate the more salient/typical/remarkable statement.For the plausibility dimension, we sampled 200 subjects each withtwo properties, and asked annotators to assess the plausibility ofindividual statements on a 5-point scale. Then we paired up twostatements for the same subject as a post-hoc preference pair. Therationale for this procedure is to avoid biasing the annotator injudging plausibility by showing two statements at once, whereas itis natural to compare pairs on the other three dimensions.In total, we had 4 × × = Evaluation Metrics.

In the actual evaluation, we used withheldpairwise annotations for statements along the dimensions plausi-bility, typicality, remarkability and saliency as ground truth, andcompared, for each system score, for how many of these pairs its

Figure 1: Aggregate label distribution. scores implicated the same ordering, i.e., measured the precision inpairwise preference (ppref) [9].

Hyper-parameter tuning.

The 800 labeled statements per CSKcollection were split into 70% for hyper-parameter optimization and30% for evaluation. We performed two hyper-parameter optimiza-tion steps. In step 1, we learned the weights for aggregating thebasic scores by a regression model based on interpreting pairwisedata as single labels (i.e., the preferred property is labelled as 1, theother one as 0). In step 2, we used Bayesian optimization to tunethe weights of the constraints. As exhaustive search was not possi-ble, we used the Tree-structured Parzen Estimator (TPE) algorithmfrom the Hyperopt [4] library. We used the 0-1 loss function on theordering of the pairs as metric, and explored the search space intwo ways:(1) discrete exploration space { , . , . , } , followed by(2) continuous exploration space of radius 0.2 centered on thevalue selected in the previous step.For ConceptNet, constraints were assigned an average weight of0.404, with the highest weights for: (14) Similarity constraints(weight 0.85), (6) Plausibility inference (weight 0.66) and (13) Sib-ling implausibility implying remarkability (weight 0.60). All con-straints were assigned non-negligible positive weights; so they areall important for joint inference. Quality of rankings.

Table 5 shows the main result of our ex-periments: the precision in pairwise preference (ppref) scores [9],that is, the fraction of pairs where Dice or a baseline produced thesame ordering as the crowdsourced ground-truth. As baseline, werank all statements by the confidence scores from the original CSKcollections, which implies that the ranking is identical for all fourdimensions. As the table shows, Dice consistently outperforms imension Random ConceptNet TupleKB Quasimodo Music-manualBaseline [37] Dice Baseline [? ] Dice Baseline [32] Dice Baseline [37] Dice Plausible 0.5 0.52 0.62 0.53 0.57 0.57 0.59 0.21

Typical 0.5 0.39

Remarkable 0.5 0.52

Salient 0.5 0.54 0.65 0.59 0.61 0.53 0.63 0.51 0.65Avg. 0.5 0.50

Table 5: Precision of pairwise preference (ppref) of Dice versus original CSK collections. Significant gains over baselines( α = . ) are boldfaced. Priorsonly Constraintsonly Both

Plausible 0.54 0.51 0.62Typical 0.53 0.42 0.65Remarkable 0.65 0.57 0.69Salient 0.56 0.52 0.65Avg. 0.58 0.51 0.66

Table 6: Ablation study using ConceptNet as input.

Ranking Existing New statementsdimension statements 25% 50% 100%

Plausible 3.44 3.54 3.43 3.41Typical 3.27 3.31 3.26

Table 7: Plausibility of top-ranked newly inferred state-ments with ConceptNet as input.Subject Novel properties sculpture be at art museum, be silver or gold in colorathlete requires be good sport, be happy when they winsaddle be used to ride horse, be set on table

Table 8: Examples of new statements inferred by Dice withConceptNet as input. the baselines by a large margin of 7 to 18 percentage points. It isalso notable that scores in the original ConceptNet and TupleKBare negatively correlated with typicality (values lower than 0.5),pointing out a substantial fraction of valid but not exactly typicalproperties in these pre-existing CSK collections.

Ablation study.

To study the impact of statistical priors andconstraint-based reasoning, we compare two variants of Dice: (i)using only priors without the reasoning stage, and (ii) using only theconstraint-based reasoning with all priors set to 0.5. The resultingppref scores are shown in Table 6. In isolation, priors and reasoningperform 8 and 15 percentage points worse than the combined Dicemethod. This clearly demonstrates the importance of both stagesand the synergistic benefit from their interplay.

Enrichment potential.

All CSK collections are limited in theircoverage of long-tail concepts. By exploiting the taxonomic andembedding-based relatedness between different concepts, we cangenerate candidate statements that were not observed before (e.g.,because online contents rarely talk about generalized concepts likebig cats, and mostly mention only properties of lions, leopards,tigers etc.). As mentioned in Section 4.2, simple templates can beused to generate candidates. These are fed into Dice reasoningtogether with the statements that are actually contained in theexisting CSK collections.To evaluate the quality of the Dice output for such “unobserved”statements, we randomly sampled 10 ConceptNet subjects, andgrounded the reasoning framework for these subjects for all prop-erties observed in their taxonomic neighbourhood (i.e., parents andsiblings). We then asked annotators to assess the plausibility of 100sampled statements.To compute the quality of Dice scores, we consider the top-ranked statements by predicted plausibility and by typicality, wherewe vary the recall level: number of statements from the ranking inrelation to the number of statements that ConceptNet contains forthe sampled subjects. The results are shown in Table 7 for recall25%, 50% and 100%, that is up to doubling the size of ConceptNetfor the given subjects. As one can see, Dice can expand the pre-existing CSK by 25% without losing in quality, and even up to 100%expansion the decrease in quality is negligible. Table 8 presentsanecdotal statements absent in ConceptNet.

Run-Time.

All experiments were run on a cluster with 40 coresand 500 GB memory. Hyper-parameter optimization took 10-14hours for each of the three CSK inputs. Computing the four-dimen-sional scores for all statements took about 3 hours, 3 hours and 24hours for ConceptNet, TupleKB and Quasimodo, respectively.The computationally most expensive steps are the semantic sim-ilarity computation and the LP solving. For semantic similaritycomputation, a big handicap is the verbosity and hence diversityof the phrases for properties (e.g., “live in the savannah”, “roam inthe savannah”, “are seen in the African savannah”, “can be foundin Africa’s grasslands” etc.). We observed on average 1.55 state-ments per distinct property for ConceptNet, and 1.77 for Quasi-modo. Therefore, building the input matrix for the LP is very time-consuming. For LP solving, the Gurobi algorithm has polynomialrun-time in the number of variables. However, we do have a hugenumber of variables. Empirically, we need to cope with about × . variables. ubject Property Baseline DiceCN-score plausible typical remarkable salient snake be at shed 0.46 0.29 0.71 0.29 0.18snake be at pet zoo 0.46 0.15 0.29 0.82 0.48snake bite 0.92 0.58 0.13 0.61 0.72lawyer study legal precedent 0.46 0.25 0.73 0.37 0.18lawyer prove that person be guilty 0.46 0.06 0.47 0.65 0.40lawyer present case 0.46 0.69 0.06 0.79 0.75bicycle requires coordination 0.67 0.62 0.40 0.36 0.35bicycle be used to travel quite long distance 0.46 0.30 0.20 0.77 0.64bicycle be power by person 0.67 0.19 0.33 0.66 0.55 Table 9: Anecdotal examples from Dice run on ConceptNet.Anecdotal examples.

Table 9 gives a few anecdotal outputs withscores returned by Dice. Note that the scores produced do not rep-resent probabilities, but global ranks (i.e., we percentile-normalizedthe scores produced by Dice, as they have no inherent semanticsother than ranks). For instance, be at shed was found to be muchmore typical than be at pet zoo for snake , while salience wasthe other way around. Note also the low variation in ConceptNetscores, i.e., in addition to being unidimensional, this low variancemakes any ranking difficult.

Experimental results.

The experiments showed that Dice cancapture CSK along the four dimensions significantly better than thesingle-dimensional baselines. The ablation study highlighted that acombination of prior scoring and constraint-based joint reasoningis highly beneficial (0.66 average ppref vs. 0.58 and 0.51 of eachstep in isolation, see Table 6). Among the dimensions, we findthat plausibility is the most difficult of the four dimensions (see Ta-ble 5). The learning of hyper-parameters shows that all constraintsare useful and contribute to the outcome of Dice, with similar-ity dependencies and plausibility inference having the strongestinfluence.Comparing the three CSK collections that we worked with, weobserve that the crowdsourced ConceptNet is a priori cleaner andhence easier to process than Quasimodo and TupleKB. Also, manu-ally designed taxonomies gave Dice a performance bost of 0.03-0.11in ppref over the noisy web extracted WebIsALOD taxonomies.

Task difficulty.

Scoring commonsense statements by dimensionsbeyond confidence has never been attempted before, and a majorchallenge is to design appropriate and varied input signals towardsspecific dimensions. Our experiments showed that Dice can ap-proximate the human-generated ground-truth rankings to a con-siderable degree (0.58-0.69 average ppref), although a gap remains(see Table 5). We conjecture that in order to approximate humanjudgments even better, more and finer-grained input signals, forexample about textual contexts of statements, are needed.

Enriched CSK data.

Along with this paper, we publish six datasets:the 3 CSK collections ConceptNet, TupleKB and Quasimodo en-riched by Dice with score for the four CSK dimensions, and ad- ditional inferred statements that expand the original CSK data byabout 50%. The datasets can be downloaded from https://tinyurl.com/y6hygoh8.

Web demonstrator.

The results of running Dice on ConceptNetand Quasimodo are showcased in an interactive web-based demo.The interface shows original scores from these CSK collections aswell as the per-dimension scores computed by Dice. Users canexplore the values of individual cues, the priors, the taxonomicneighborhood of a subject, and the clauses generated by the rulegrounding. The demo is available online at https://dice.mpi-inf.mpg.de, we also show screenshots in Figure 2.From a landing page (Fig. 2(a)), users can navigate to individualsubjects like band (Fig. 2(b)). On pages for individual subjects,taxonomic parents and siblings are shown at the top, followed bycommonsense statements from ConceptNet and Quasimodo. Foreach statement, its normalized score or percentile in its originalCSK collection, along with scores and percentiles along the fourdimensions as computed by Dice, are shown. Colors from greento red highlight to which quartile a percentile value belongs. Oninspecting a specific statement, e.g., band: hold concert (Fig. 2(c)),one can see related statements used for computing basic scores,along with the values of the priors and evidence scores. Furtherdown on the same page (Fig. 2(d)), the corresponding materializedclauses from the ILP, along with their weight ω c , are shown. This paper presented Dice, a joint reasoning framework for com-monsense knowledge (CSK) that incorporates inter-dependenciesbetween statements by taxonomic relatedness and other cues. Thisway we can capture more expressive meta-properties of concept-property statements along the four dimensions of plausibility, typi-cality, remarkability and saliency. This richer knowledge represen-tation is a major advantage over prior works on CSK collections.In addition, we have devised techniques to compute informativerankings for all four dimensions, using the theory of reduced costsfor LP relaxation. We believe that such multi-faceted rankings ofCSK statements are crucial for next-generation AI, particularly to-wards more versatile and robust conversational bots. Our futurework plans include leveraging this rich CSK for advanced questionanswering and human-machine dialogs. a) Demo landing page. (b) List of statements for subject band .(c) Scores and neighbourhood for statement band: hold concert . (d) Materialized clauses for statement band: hold concert . Figure 2: Screenshots from the web-based demonstration platform. EFERENCES [1] Tamas Abraham and John F Roddick. Survey of spatio-temporal databases.

GeoInformatica , 1999.[2] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. LawrenceZitnick, Devi Parikh, and Dhruv Batra. VQA: visual question answering.

IJCV ,2017.[3] S¨oren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak,and Zachary G. Ives. DBpedia: A nucleus for a web of open data.

ISWC , 2007.[4] James Bergstra, Dan Yamins, and David D. Cox. Making a science of modelsearch: Hyperparameter optimization in hundreds of dimensions for visionarchitectures.

ICML , 2013.[5] Dimitris Bertsimas and John N Tsitsiklis.

Introduction to linear optimization .Athena Scientific, 1997.[6] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, AsliC¸elikyilmaz, and Yejin Choi. COMET: commonsense transformers for automaticknowledge graph construction.

ACL , 2019.[7] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Man-ning. A large annotated corpus for learning natural language inference.

EMNLP ,2015.[8] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hr-uschka, and Tom M Mitchell. Toward an architecture for never-ending languagelearning.

AAAI , 2010.[9] Ben Carterette, Paul N Bennett, David Maxwell Chickering, and Susan T Dumais.Here or there: Preference judgments for relevance.

ECIR , 2008.[10] Xuelu Chen, Muhao Chen, Weijia Shi, Yizhou Sun, and Carlo Zaniolo. Embeddinguncertain knowledge graphs.

AAAI , 2019.[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.

NAACL ,2019.[12] Pedro M. Domingos and Daniel Lowd.

Markov Logic: An Interface Layer forArtificial Intelligence . Morgan & Claypool, 2009.[13] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nel-son F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. AllenNLP:A deep semantic natural language processing platform.

Workshop for NLP OpenSource Software (NLP-OSS)

OTM , 2009.[16] Sven Hertling and Heiko Paulheim. Webisalod: Providing hypernymy relationsextracted from the web as linked open data.

ISWC , 2017.[17] Sven Hertling and Heiko Paulheim. Webisalod: providing hypernymy rela-tions extracted from the web as linked open data.

International Semantic WebConference , pages 111–119. Springer, 2017.[18] Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum.Yago2: A spatially and temporally enhanced knowledge base from wikipedia.

Artif. Intell. 194: 28-61 , 2013.[19] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generatingimage descriptions.

Trans. Pattern Anal. Mach. Intell. , 2017.[20] Douglas B Lenat. Cyc: A large-scale investment in knowledge infrastructure.

Communications of the ACM , 1995.[21] Vasco Manquinho, Joao Marques-Silva, and Jordi Planes. Algorithms for weightedboolean optimization.

SAT , 2009.[22] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-tributed representations of words and phrases and their compositionality.

NIPS ,2013.[23] George A. Miller. Wordnet: A lexical database for English.

CACM , 1995.[24] Bhavana Dalvi Mishra, Niket Tandon, and Peter Clark. Domain-targeted, highprecision knowledge extraction.

TACL , 2017. [25] Natalya Fridman Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson,and Jamie Taylor. Industry-scale knowledge graphs: lessons and challenges.

Communications of the ACM , 2019.[26] Ana Ozaki, Markus Kr¨otzsch, and Sebastian Rudolph. Happy ever after: Tempo-rally attributed description logics.

Description Logics , 2018.[27] Peter F Patel-Schneider. Contextualization via qualifiers.

Workshop on Contextu-alized Knowledge Graphs , 2018.[28] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Globalvectors for word representation.

EMNLP , 2014.[29] Jay Pujara, Hui Miao, Lise Getoor, and William Cohen. Knowledge graph identi-fication.

ISWC , 2013.[30] Prabhakar Raghavan and Clark D. Thompson. Randomized rounding: a techniquefor provably good algorithms and algorithmic proofs.

Combinatorica , 1987.[31] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher.Explain yourself! leveraging language models for commonsense reasoning.

ACL ,2019.[32] Julien Romero, Simon Razniewski, Koninika Pal, Jeff Z. Pan, Archit Sakhadeo,and Gerhard Weikum. Commonsense properties from query logs and questionanswering forums.

CIKM , 2019.[33] Maarten Sap, Ronan LeBras, Emily Allaway, Chandra Bhagavatula, NicholasLourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. Atomic:An atlas of machine commonsense for if-then reasoning.

AAAI , 2018.[34] Carissa Schoenick, Peter Clark, Oyvind Tafjord, Peter D. Turney, and OrenEtzioni. Moving beyond the Turing test with the Allen AI science challenge.

Commun. ACM , 2017.[35] Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and ChristopherR´e. Incremental knowledge base construction using deepdive.

VLDB , 2015.[36] Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston.Engaging image captioning via personality.

CVPR , 2019.[37] Robyn Speer and Catherine Havasi. ConceptNet 5: A large semantic network forrelational knowledge.

Theory and Applications of Natural Language Processing ,2012.[38] Vivek Srikumar and Dan Roth. A joint model for extended semantic role labeling.

EMNLP , 2011.[39] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core ofsemantic knowledge.

WWW , 2007.[40] Fabian M Suchanek, Mauro Sozio, and Gerhard Weikum. Sofie: a self-organizingframework for information extraction.

WWW , 2009.[41] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Common-senseQA: A question answering challenge targeting commonsense knowledge.

NAACL , 2019.[42] Niket Tandon, Gerard de Melo, Fabian M. Suchanek, and Gerhard Weikum.WebChild: harvesting and organizing commonsense knowledge from the web.

WSDM , 2014.[43] Niket Tandon, Gerard de Melo, and Gerhard Weikum. WebChild 2.0 : Fine-grained commonsense knowledge distillation.

ACL , 2017.[44] Vijay V Vazirani.

Approximation algorithms . Springer Science & Business Media,2013.[45] Denny Vrandeˇci´c and Markus Kr¨otzsch. Wikidata: a free collaborative knowl-edgebase.

CACM , 2014.[46] Michael L. Wick, Andrew McCallum, and Gerome Miklau. Scalable probabilisticdatabases with factor graphs and MCMC.

PVLDB , 2010.[47] Mohamed Yahya, Denilson Barbosa, Klaus Berberich, Qiuyue Wang, and GerhardWeikum. Relationship queries on extended knowledge graphs.

WSDM , 2016.[48] Pengcheng Yang, Lei Li, Fuli Luo, Tianyu Liu, and Xu Sun. Enhancing topic-to-essay generation with external commonsense knowledge.

ACL , 2019.[49] Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, and MinlieHuang. Augmenting end-to-end dialogue systems with commonsense knowledge.

AAI , 2018., 2018.