Dimensions of Commonsense Knowledge
Filip Ilievski, Alessandro Oltramari, Kaixin Ma, Bin Zhang, Deborah L. McGuinness, Pedro Szekely
DDimensions of Commonsense Knowledge
Filip Ilievski a , ∗ , Alessandro Oltramari b , Kaixin Ma c , Bin Zhang a , Deborah L. McGuinness d andPedro Szekely a a Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA b Intelligent Internet of Things, Bosch Research and Technology Center, Pittsburgh, PA, USA c Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA d Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY, USA
A R T I C L E I N F O
Keywords :commonsense knowledgesemanticsknowledge graphsreasoning
A B S T R A C T
Commonsense knowledge is essential for many AI applications, including those in natural languageprocessing, visual processing, and planning. Consequently, many sources that include commonsenseknowledge have been designed and constructed over the past decades. Recently, the focus has beenon large text-based sources, which facilitate easier integration with neural (language) models and ap-plication on textual tasks, typically at the expense of the semantics of the sources. Such practiceprevents the harmonization of these sources, understanding their coverage and gaps, and may hinderthe semantic alignment of their knowledge with downstream tasks. Efforts to consolidate common-sense knowledge have yielded partial success, but provide no clear path towards a comprehensiveconsolidation of existing commonsense knowledge.The ambition of this paper is to organize these sources around a common set of dimensions ofcommonsense knowledge. For this purpose, we survey a wide range of popular commonsense sourceswith a special focus on their relations. We consolidate these relations into 13 knowledge dimensions,each abstracting over more specific relations found in sources. This consolidation allows us to unifythe separate sources and to compute indications of their coverage, overlap, and gaps with respectto the knowledge dimensions. Moreover, we analyze the impact of each dimension on downstreamreasoning tasks that require commonsense knowledge, observing that the temporal and desire/goaldimensions are very beneficial for reasoning on current downstream tasks, while distinctness andlexical knowledge have little impact. These results reveal focus towards some dimensions in currentevaluation, and potential neglect of others.
1. Introduction
Commonsense knowledge is information that humanstypically have that helps them make sense of everyday situa-tions. As such, this knowledge can generally be assumed tobe possessed by most people, and, according to the Griceanmaxims [20], it is typically omitted in (written or oral) com-munication. The fact that common sense knowledge is oftenimplicit presents a challenge for automated natural languageprocessing (NLP) and question answering (QA) approachesas the extraction and learning algorithms cannot count on thecommon sense knowledge being available directly in text.Due to its prominence and implicit nature, capturing com-monsense knowledge holds a promise to benefit various AIapplications, including those in NLP, computer vision, andplanning. For instance, commonsense knowledge can beused to fill gaps and explain the predictions of a (neural)model [33], understand agent goals and causality in stories [63],or enhance robot navigation and manipulation [65].Consequently, acquiring and representing commonsenseknowledge in machine-readable form, as well as reasoningwith it, has been a major pursuit of AI since its early days [40].This has resulted in the design, construction, and curation ofa rich palette of resources that include commonsense infor- ∗ Corresponding author [email protected] (F. Ilievski);
[email protected] (A. Oltramari); [email protected] (K. Ma); [email protected] (B.Zhang); [email protected] (D.L. McGuinness); [email protected] (P. Szekely)
ORCID (s): mation (potentially along with other content) like Cyc [30],ATOMIC [53], WebChild [61], ConceptNet [59], WordNet[41], FrameNet [2], and Visual Genome [28]. Some of these,such as ConceptNet and Cyc, have been deliberately cre-ated to capture information that would be useful for commonsense-related reasoning tasks, while others, like WordNet orVisual Genome, were intended to support other tasks suchas word sense disambiguation or image object recognition.As reported in [26], the commonsense sources exhibit largediversity in terms of their representation formats, creationmethods, and coverage. While this reflects an opportunityfor this knowledge to be exploited jointly, the inherent diver-sity makes the consolidation of these sources challenging.Meanwhile, the last few years have featured a reinforcedfocus on benchmarks that evaluate different aspects of com-mon sense, including social [54], physical [6], visual [66],and numeric [34] common sense. Further distinction hasbeen made between discriminative tasks [54, 6, 60], wherethe goal is to pick the single correct answer from a list, andgenerative tasks, where one has to generate one or multiplecorrect answers [34, 7]. These tasks can be tackled by us-ing the (entire or a subset of) training data [37, 33], or in azero-/few-shot evaluation regime [38, 57].The wealth and diversity of commonsense sources, onthe one hand, and benchmarks, on the other hand, raises anatural question: what is the role of these knowledge repos-itories for real-world reasoning techniques that need to in-corporate commonsense knowledge? While intuitively such
Ilievski et al.:
Preprint submitted to Elsevier
Page 1 of 19 a r X i v : . [ c s . A I] J a n imensions of Commonsense Knowledge sources of commonsense knowledge can have tremendousvalue on downstream reasoning tasks, the practice showsthat their impact on these tasks has been relatively limited,especially in comparison to the contribution of the languagemodels. Knowledge sources tend to have larger contribu-tion when little or no training data is available: for exam-ple, one can generate artificial training sets based on sev-eral sources, which can be used to pre-train language mod-els and apply them on downstream tasks without using theofficial training data [3, 38]. The impact of knowledge re-sources so far has been generally conditioned on the spe-cial cases where the knowledge and the task are known (inadvance) to be well-aligned [38, 37]. While a variety ofsources [53, 41, 59] or their combination [26] have been usedto enhance language models for downstream reasoning, lit-tle is known about how this alignment between knowledgetypes and tasks can be dynamically achieved.Most recent sources have focused on the breadth of knowl-edge, sometimes at the expense of its semantics [4, 44]. Text-based representations are particularly attractive, as they fa-cilitate a more direct integration with language models, aswell as reasoning on NLP and QA tasks. These sourcesare often treated as ‘corpora’, where each fact is typicallylexicalized (manually or automatically) into a single sen-tence [37], which is used to inform or fine-tune a languagemodel. Due to the lack of focus on formal representationalprinciples, the sources capture knowledge types which arenot trivial to align with other sources, as shown by the sparsemappings available between these sources [26]. Consideringthe lack of a common vocabulary and/or lack of alignmentof these sources, their limited coverage, and lack of focuson explicit semantics, knowledge is typically kept in an im-poverished textual form that is easy to capture and combinewith language models. The downsides of this practice are: 1)commonsense knowledge across sources remains difficult toharmonize; 2) without a thorough harmonization or consol-idation, it is not clear how to effectively measure coverage,overlap, or gaps; and 3) text-based representations may beunable to capture the richness of contextual reasoning typi-cally done by humans.Efforts to consolidate commonsense knowledge acrosssources [26, 16, 45] have managed to bring these sourcescloser, which has shown impact on commonsense QA tasks [38].In [25], we provide heuristics for defining the boundariesof commonsense knowledge, in order to extract such subsetfrom one of the largest available graphs today, Wikidata [62].Yet, these efforts have limited success, and many consolida-tion questions are left open. How should one think aboutcommonsense knowledge in a theoretical way? What does itmean to build a consolidated knowledge graph (KG) of re-sources created largely in a bottom-up fashion? How shouldthe relations be chosen? What is the right level of abstractionfor relations and nodes? This phenomenon can be seen on the benchmark leadearboards,which are dominated by ‘pure’ language models, for instance: https://leaderboard.allenai.org/socialiqa/submissions/public (accessed on Jan-uary 5th, 2021).
2. Approach
The ambition of this paper is to provide insight into suchquestions, aiming primarily to organize the types of knowl-edge found in current sources of commonsense knowledge.For this purpose, we survey a wide variety of sources ofcommonsense knowledge, ranging from commonsense KGsthrough lexical and visual sources, to the recent idea of us-ing language models or corpora as commonsense knowledgebases. We survey their relations and group them into a setof dimensions , each being a cluster of its specific relations,as found in the sources. We then apply these dimensions totransform and unify existing sources, providing an enrichedversion of the Commonsense Knowledge Graph [26]. Thedimensions allow us to perform four novel experiments:1. We assess the coverage of the sources with respect toeach dimension, noting that some sources have wide(but potentially shallow) coverage of dimensions, whereasothers have deep but narrow coverage. This supportsthe need to integrate these complementary sources intoa single one.2. We benefit from the consolidation of the dimensionsto compare the facts in the sources and compute met-rics of overlap. The results show that there is littleknowledge overlap across sources, even after consoli-dating the relations according to our dimensions, thusmotivating future work on node resolution.3. We contrast the clusters according to our dimensionsto language model-based clusters, to understand thesimilarities and differences in terms of their focus.4. We measure the impact of each dimension on two rep-resentative commonsense QA benchmarks. Follow-ing [38], we pre-train a language model and apply it onthese benchmarks in a zero-shot fashion (without mak-ing use of the task training data). The dimensions pro-vide a more direct alignment between commonsenseknowledge and the tasks, revealing that some dimen-sions of knowledge are very helpful for a task, whileothers might even degrade model performance.The contributions of the paper are as follows. 1) We sur-vey existing sources of commonsense knowledge of a widevariety, with an emphasis on their relations. We provide acategorization of those resources and include a short overviewof their focus and creation methods (Section 3) . 2) We ana-lyze the entire set of relations and abstract them to a set of 13commonsense dimensions. Each dimension abstracts overmore specific relations, as found in the sources (Section 4) .3) The identified dimensions are applied to consolidate theknowledge in the Commonsense Knowledge Graph (CSKG),which integrates seven of the sources we analyze in this pa-per. The resulting resource is made publicly available (Sec-tion 5) . 4) We make use of this dimension-based consolida-tion of CSKG to analyze the overlap, coverage, and knowl-edge gaps of individual knowledge sources in CSKG, mo-tivating their consolidation into a single resource (Sections5.1 - 5.3) . 5) We evaluate the impact of different dimen-sions on two popular downstream commonsense reasoning
Ilievski et al.:
Preprint submitted to Elsevier
Page 2 of 19imensions of Commonsense Knowledge
Category Source Relations Example 1 Example 2
Commonsense KGs ConceptNet* 34 food - capable of - go rotten eating - is used for - nourishment
ATOMIC 9
Person X bakes bread - xEffect - eat food PersonX is eating dinner - xEffect - satisfies hunger
GLUCOSE 10
𝑆𝑜𝑚𝑒𝑜𝑛𝑒 𝐴 makes 𝑆𝑜𝑚𝑒𝑡ℎ𝑖𝑛𝑔 𝐴 (that is food) Causes/Enables 𝑆𝑜𝑚𝑒𝑜𝑛𝑒 𝐴 eats 𝑆𝑜𝑚𝑒𝑡ℎ𝑖𝑛𝑔 𝐴 WebChild 4 (groups) restaurant food - quality
Quasimodo 78,636 pressure cooker - cook faster - food herbivore - eat - plants
SenticNet 4 cold_food - polarity - negative eating breakfast - polarity - positive
HasPartKB 1 dairy food - has part - vitamin n/a
Common KGs Wikidata 6.7k food - has quality - mouthfeel eating - subclass of - ingestion
YAGO4 116 banana chip - rdf:type - food eating - rdfs:label - feeding
DOLCE* 1 n/a n/a
SUMO* 1,614 food - hyponym - food_product process - subsumes - eating
Lexical resources WordNet 10 food - hyponym - comfort food eating - part-meronym - chewing
Roget 2 dish - synonym - food eating - synonym - feeding
FrameNet 8 (f2f)
Cooking_creation - has frame element - Produced_food eating - evoke - Ingestion
MetaNet 14 (f2f)
Food - has role - food_consumer consuming_resources - is - eating
VerbNet 36 (roles) feed.v.01 - Arg1-PPT - food eating - hasPatient - comestible
Visual sources Visual Genome 42,374 food - on - plate boy - is eating - treat
Flickr30k 1 a food buffet - corefers with - a food counter a eating place - corefers with - their kitchen
Corpora & LMs GenericsKB n/a
Aardvarks search for food. Animals receive nitrogen by eating plants.
GPT-2 n/a
Food causes a person to be hungry and a person to eat. Eating at home will not lead to weight gain.
Table 1
Overview of commonsense knowledge sources. The asterisk (‘*’) indicates that the source isextended with WordNet knowledge. For FrameNet and MetaNet, we specify their numbersof frame-to-frame relations. WebChild contains a large number of relations, expressed asWordNet synsets, which are aggregated into 4 groups. tasks. The results show that certain dimensions, like tem-poral knowledge and knowledge on desires/goals are verybeneficial and well-covered by benchmarks, whereas otherdimensions like distinctness and lexical knowledge currentlyhave little impact. These results reveal more precise align-ment between dimensions in the resources and existing tasks,and point to gaps in both existing knowledge sources and intasks (Section 5.4) . 6) We reflect on the results of our analy-sis, and use it as basis to provide a roadmap towards buildinga more semantic resource that may further advance the repre-sentation of, and reasoning with, commonsense knowledge.Such a resource would be instrumental in building a generalcommonsense service in the future (Section 7) .
3. Sources of Commonsense Knowledge
We define a digital commonsense knowledge source asa potentially multi-modal repository from which common-sense knowledge can be extracted. Commonsense knowl-edge sources come in various forms and cover different typesof knowledge. While only a handful of sources have beenformally proposed as commonsense sources, many otherscover aspects of common sense. Here, we collect a repre-sentative set of sources, which have been either proposed as,or considered as, repositories of commonsense knowledge inthe past. We categorize them into five groups, and describethe content and creation method of representative sourceswithin each group. Table 1 contains statistics and examplesfor each source. For brevity, we omit the word ‘digital’ in the remainder of this paper.
ConceptNet [59] is a multilingual commonsense knowl-edge graph. Its nodes are primarily lexical and connect toeach other with 34 relations. Its data is largely derived fromthe crowdsourced Open Mind Common Sense (OMCS) cor-pus [58] , and complemented with knowledge from other re-sources, like WordNet.
ATOMIC [53] is a commonsense knowledge graph thatexpresses pre- and post-states for events and their partici-pants in a lexical form with nine relations. Its base eventsare collected from a variety of corpora, while the data forthe events is collected by crowdsourcing.
GLUCOSE [44] contains causal knowledge through 10relations about events, states, motivations, and emotions. Theknowledge in GLUCOSE is crowdsourced based on semi-automatic templates, and generalized from individual storiesto more abstract rules.
WebChild [61] is a commonsense knowledge graph whosenodes and relations are disambiguated as WordNet senses. Itcaptures 20 main relations, grouped in four categories. We-bChild has been extracted automatically from Web informa-tion, and canonicalized in a post-processing step.
Quasimodo [52] contains commonsense knowledge aboutobject properties, human behavior, and general concepts. Itsnodes and relations are initially lexical and extracted auto-matically from search logs and forums, after which a notablesubset of them has been clustered into WordNet domains.
SenticNet [10] is a knowledge base with conceptual andaffective knowledge, which is extracted from text and aggre-gated automatically into higher-level primitives.
HasPartKB [5] is a knowledge graph of hasPart state-ments, extracted from a corpus with sentences and refinedby automatic means.
Ilievski et al.:
Preprint submitted to Elsevier
Page 3 of 19imensions of Commonsense Knowledge
Wikidata [62] is a general-domain knowledge graph,tightly coupled with Wikipedia, that describes notable enti-ties. Its nodes and relations are disambiguated as Qnodes.The content of Wikidata is collaboratively created by hu-mans, as well as other existing sources. Given the vast num-ber of statements in Wikidata and its sizable set of over 7thousand relations, we consider its
Wikidata-CS common-sense subset, as extracted in [25].
YAGO [47] is a general-purpose knowledge graph, whosenodes and relations are disambiguated entities. The knowl-edge in YAGO is extracted automatically from Wikipedia,and consolidated with knowledge from other sources, likeSchema.org [21]
DOLCE (Descriptive Ontology for Linguistic and Cog-nitive Engineering) [17] is an upper level ontology that cap-tures the ontological categories underlying natural languageand human common sense with disambiguated concepts andrelations. It has been created manually by experts.
SUMO (Suggested Upper Merged Ontology) [46] is anontology of upper-level disambiguated concepts and their re-lations. It has been created manually by experts.
WordNet [41] is a lexical database of words, their mean-ings, and taxonomical organization, in over 200 languages.It has been created manually by experts.
Roget [51] is a manually-created thesaurus that containssynonyms and antonyms for English words.
FrameNet [2] is a lexical resource that formalizes theframe semantics theory: meanings are mostly understoodwithin a frame of an event and its participants that fulfill rolesin that frame. FrameNet was created manually by experts.
MetaNet [14] is a repository of conceptual frames, aswell as their relations which often express metaphors. It hasbeen created manually.
VerbNet [55] is a resource that describes syntactic andsemantic patterns of verbs, and organizes them into verbclasses. It has been created manually by experts.
Visual Genome [28] contains annotations of conceptsand their relations in a collection of images. The imagedescriptions are manually written by crowd workers, whiletheir concepts are mapped automatically to WordNet sensesand revised by crowd workers.
Flickr30k [49] annotates objects in 30k images by mul-tiple workers. The expressions used by different annotatorsare clustered automatically into groups of coreferential ex-pressions in [43].
GenericsKB [4] contains self-contained generic facts rep-resented as naturally occurring sentences. The sentenceshave been extracted from three existing corpora, filtered byhandwritten rules, and scored with a BERT-based classifier.
Language models , like RoBERTa [36] and GPT-2 [50],can be used as KBs[48] to complement explicitly stated in-formation, e.g., as a link prediction system like COMET [8]or through self-talk [57].
As apparent in this section, the commonsense sourcesare based on a wide range of representation principles andhave been created with different construction methods. Throughthe example scenarios of food and eating (Table 1), we showthat they have notable overlap in terms of their covered (typ-ically well-known) concepts. At the same time, the types ofknowledge covered differ across sources: some sources pro-vide truisms, such as feeding is done with food, while othersspeculate on usual properties of food, such as its capabilityto go rotten or often be on a plate. Furthermore, we observethat same or similar relations tend to have different namesacross sources (compare type of to subclass of or is ; or hasquality in Wikidata to cook faster in Quasimodo).These distinctions make the integration of these sources,and the understanding of their coverage and gaps, very chal-lenging. In order to integrate the knowledge in these sources,we next propose a consolidation of their relations into a com-mon set of dimensions.
4. Dimensions of commonsense knowledge
In the previous section, we surveyed 20 representativecommonsense sources from five categories: commonsenseKGs, common KGs, lexical, visual sources, and corpora andlanguage models. A key contribution of this paper is a man-ual categorization (by the authors) of the kind of knowledgeexpressed by the relations in these sources into 13 dimen-sions. Table 2 shows the correspondence of each relation inthese analyzed sources to our dimensions. An example foreach of the dimensions from different sources is shown inTable 4. We next describe each dimension in turn.
Lexical.
Many data sources leverage the vocabulary ofa language or the lexicon in their relations. This includesrelationships such as plural forms of nouns, or past tensesof verbs, for example. Lexical knowledge also covers sub-string information. ConceptNet, for example, includes a re-lationship called
DerivedFrom that they describe as capturingwhen a word or phrase appears within another term and con-tributes to that term’s meaning. Lexical knowledge is alsothe formalization of the relation between a concept and itsexpression in a language, e.g., denoted by through the label relation in Wikidata.
Similarity.
Most data sources include the notion of syn-onymy between expressions, allow definitions of terms, ormay cover a broader notion of just general similarity. Con-ceptNet has all three subcategories - for instance, regardingsimilarity, it establishes that wholesome and organic food aresimilar notions, while eating is defined as process of takingin food. WebChild also captures similarity between Word-Net concepts, while WordNet, Wikidata, and Roget focus onsynonymy. For instance, Roget declares that food and ed-ibles are synonyms, while Wikidata expresses that food is
Ilievski et al.:
Preprint submitted to Elsevier
Page 4 of 19imensions of Commonsense Knowledge
Table 2
Knowledge Dimensions. Relations marked with ‘*’ describe knowledge in multiple dimen-sions. Relations with ¬ express negated statements. Dimension ATOMIC ConceptNet WebChild Other Wikidata
FormOf lexical
DerivedFrom lexical_unit (FN) labelEtymologicallyDerivedFrom lemma (WN)Synonym reframing_mapping (FN) similarity
SimilarTo hassimilar metaphor (FN)DefinedAs Synonym (RG) said to be the same assynonym (WN)Antonym Antonym (RG) different from distinctness
DistinctFrom antonym (WN) opposite ofexcludes (FN)IsA perspective_on (FN) subClassOf taxonomic
InstanceOf hasHypernymy inheritance (FN) instanceOfMannerOf hypernym (WN) descriptionPartOf HasPart (HP)HasA physicalPartOf meronym (WN) has part part-whole
MadeOf memberOf holonym (WN) member ofAtLocation* substanceOf material usedAtLocation* location location spatial
LocatedNear spatial anatomical location creation
CreatedBy creatorReceivesAction utility
UsedFor hassynsetmember using (FN) used byCapableOf activity use ¬ NotCapableOf participant usesxIntent CausesDesirexWant MotivatedByGoal desire/goal oWant Desires ¬ NotDesiresObstructedBy shapesize quality
HasProperty color frame_element (FN) ¬ NotHasProperty taste_property colorxAttr SymbolOf temperature has quality comparative xNeed HasFirstSubevent subframe (FN)xEffect HasLastSubevent time precedes (FN) temporal oEffect HasSubevent emotion inchoative_of (FN)xReact HasPrerequisite prev causative_of (FN) has causeoReact Causes next has effectEntailsRelatedTo field of this occupation relational
HasContext thing see_also (FN) depicts -other
EtymologicallyRelatedTo agent requires (FN) health specialty said to be the same as nutriment.
Distinctness.
Complementary to similarity, most datasources have notions of some kind of distinguishability. Mostcommonly, this is formalized as antonymy, where words havean opposition relationship between them, i.e., they have aninherently incompatible relationship. For example, both Ro-get and ConceptNet consider hot and cold to be antonyms, asthese are two exclusive temperature states of objects. FrameNetdefines an
Excludes relation to indicate that two roles of a frame cannot be simultaneously filled in a given situation.For instance, in the
Placing frame, an event can either bebrought by a cause event or by an intentional agent, but notboth. Weaker forms of distinctness are defined by Wikidataand ConceptNet, for concepts that might be mistaken as syn-onyms. For example, Wikidata states that food safety is dif-ferent from food security, while ConceptNet distinguishesfood from drinks.
Taxonomic.
Most data sources include a kind of ar-
Ilievski et al.:
Preprint submitted to Elsevier
Page 5 of 19imensions of Commonsense Knowledge
Table 3
Examples for food for each of the 13 dimensions. When the subject is different from food,we state it explicitly, e.g., xWant: watch movie together - get some food . Dimension Example Source lexical derivationaly related form: nutrient WordNetetymologically related: fodder ConceptNetderived term: foodie ConceptNetsimilarity synonym: dish ROGETsaid to be the same as: nutriment Wikidatasimilar to: wholesome - organic ConceptNetdistinctiveness opposite of: non-food item Wikidatadistinct from: drink ConceptNetdifferent from: food safety - food security Wikidatataxonomic hyponym: comfort food WordNethyponym: beverage WordNethypernym: substance WordNetsubclass of: disposable product Wikidatapart-whole things with food: minibar ConceptNetis part of: life COMETmaterial used: food ingredient Wikidataspatial is located at: pantry ConceptNetis located at: a store ConceptNetlocation: toaster - kitchen Wikidatalocated near: plate Visual Genomelocated near: table Visual Genomecreator is created by: cook COMETis created by: plant COMETutility use: eating Wikidataused by: organism Wikidataused for: pleasure ConceptNetused for: sustain life COMETused for: nourishment ConceptNetcapable of: cost money ConceptNetcapable of: go rotten ConceptNetis capable of: taste good COMETgoal/desire xWant: watch movie together - get some food ATOMICdesires: regular access to food ConceptNetnot desires: food poisoning ConceptNetcauses desire to: eat ConceptNetxIntent: eats food - quit feeling hungry ATOMICmotivated by: cook a meal ConceptNetis motivated by: you be hungry COMETquality xAttr: makes food - creative ATOMIChas quality: shelf life Wikidatahas the property: tasty COMETcomparative healthier: home cooking - fast food WebChildtemporal has first subevent: cooking ConceptNetstarts with: open your mouth COMEThas effect: food allergy Wikidatacauses: you get full COMETcauses: indigestion COMETrelational-other related to: refrigerator ConceptNetrelated to: cereal ConceptNetfield of work: food bank - food assistance Wikidatamain subject: cuisine - food product Wikidata rangement classification where some objects are placed intomore general and more specific groupings with inheritancerelations. When those groupings are ordered categories basedon generality, this captures the notion of hyponymy, indicat- ing a subcategory relationship. Hyponymy blends the dis-tinction between the relationships subclass / IsA (intendedfor two classes) and
InstanceOf (intended as a relation be-tween an instance and a class). For instance, Wikidata states
Ilievski et al.:
Preprint submitted to Elsevier
Page 6 of 19imensions of Commonsense Knowledge that a sandwich wrap is street food, or that food is a dispos-able product. WordNet has information that beverage andcomfort food are hyponyms of food. While this dimensiongenerally focuses on concepts (nouns), it also includes a spe-cialization relation for verbs. Here, the
MannerOf relation inConceptNet states that wheezing is a manner of breathing.
Part-whole.
Many data sources include a notion of be-ing a part of or a member of something. Part-whole knowl-edge can be transitive, such as that of geographic contain-ment, exemplified by New York City being a part of NewYork State, which is also part of the United states. Otherpart-of notions, such as member-of are not necessarily tran-sitive. A third category of part-whole knowledge is expressedwith the material or the building blocks of an object, such asfood being made of food ingredients. A useful distinctionbetween these three notions of part-whole: physical part of(sunroof - car), member of (musician - duet), and substanceof (steel - boiler), is provided by WebChild. In addition,the importance of this commonsense dimension is shown byHasPartKB [5], which is an entire resource dedicated to part-whole relations.
Spatial.
Spatial relations describe terms relating to oroccupying space. This may entail indicating a usual loca-tion of a concept, as in the location property in wikidata orthe
AtLocation in ConceptNet. ConceptNet expresses loca-tions for geographic entities, for example Boston is at lo-cation Massachusetts, as well as for things that can containthings: butter is at location refrigerator. Similarly to the lat-ter case, Wikidata includes an example that toasters are lo-cated in kitchens. A weaker spatial relation is one of spatialproximity in WebChild or ConceptNet, specifying that, e.g.,bikes are located near roads. While Visual Genome does notexplicitly have a spatial relation, concepts occurring in thesame image region can be represented with the
LocatedNear relation [26]. Example such statements include food beinglocated near a plate or a table.
Creation.
This dimension describes the process or theagent that brought something into existence. ConceptNetgives an example that a cake is created by the bake process,COMET has information that food is created from plants,while Wikidata states that rifle factories create shotguns. Ta-ble 2 reveals that no other source has creation information.
Utility . This dimension covers a notion of fitness or use-fulness of objects for some purpose. ConceptNet’s relation
UsedFor expresses knowledge that ‘the purpose of A is B’,with an example of food being used for pleasure or nour-ishment. Wikidata has several similar relations: use , usedby , and uses , which can express that platter is used for foodpresentation, or food is used by organisms. ConceptNet in-cludes the notion of CapableOf , described as ‘A is capable ofB if A can typically do B’, like food being capable of goingrotten, or knives being capable of cutting. Another relatednotion is that of receiving an action: a button may receivethe push action. While a button does not have the sole pur-pose of being pushed, it is capable of receiving that action,and by inference, it may respond to the action.
Desire or goal.
This dimension covers knowledge about agent desires or goals. An agent may want to have somethingor wish for something to happen. The agent typically has cer-tain goals, aims, and/or plans, that may motivate or explainthose desires. The relation
Desires in ConceptNet may in-dicate, e.g., that a person desires regular access to food. Itsnegated version,
NotDesires expresses that a person does notdesire poisoned food. ATOMIC has two relations: xWant and oWant , to indicate the desires of an agent or other agents ina given situation. For instance, when people watch a movietogether, they want to get some food. Regarding goals, Con-ceptNet includes the
MotivatedByGoal and
ObstructedBy rela-tions to indicate the motivation and the constraint for a cer-tain action. For instance, ConceptNet indicates that one’ssleep is obstructed by noise, while COMET’s extension ofConceptNet posits that people cook a meal because they arehungry.
Quality . Commonsense sources typically describe at-tributes of an agent or qualities related to an object. For ex-ample, ConceptNet and COMET include the relation
HasProperty ,to express knowledge like ice having the property cold andthe food has property tasty. ATOMIC uses xAttr to indicatethat, for example, the person that cooks food often has theattribute hungry or creative. WebChild and Wikidata bothprovide more specific qualities, such as taste, temperature,shape, or color. For instance, WebChild would specify theplant color as green.
Comparative.
WebChild performs comparison of ob-jects based on relative values for their attributes. Examplecomparative relations in WebChild are: healthier than (homecooking - fast food), faster than (car - bike), and larger than(lion - hyena). Notably, no other source describes compara-tive knowledge explicitly. Temporal.
Most sources have notions of time that maysupport ordering by time and/or may capture relations thatone thing is a prerequisite for another or one thing may havea particular effect. ConceptNet, for example, expresses thatthe first event of eating may be cooking, while the last onecould be getting rid of the containers. COMET states thateating starts with opening one’s mouth. More strongly, thetemporal relations often indicate relative ordering of two events,through relations of causation and effects, such as food po-tentially causing allergy or indigestion. Such causal knowl-edge is found in ATOMIC, ConceptNet, COMET, WebChild,and Wikidata.
Relational-other.
Conceptual and context-related rela-tionships are often underspecified. On the one hand, increas-ingly some sources capture description of the circumstancesthat for the setting for a statement, event, or idea. Concept-Net has a single relation
HasContext for this, while Wiki-data has more concrete contextual relations, such as fieldof this occupation , depicts , and health specialty . This al-lows Wikidata to express that the main subject of a cuisineis a food product, and that the field of work of food banks isfood assistance. On the other hand, most of the knowledge Here we exclude implicitly comparative knowledge, such as the in-ferred information that eating food makes one more satisfied from the triple:PersonX eats food - xReact - satisfied.
Ilievski et al.:
Preprint submitted to Elsevier
Page 7 of 19imensions of Commonsense Knowledge in ConceptNet belongs to a generic relation called
RelatedTo that may be used to capture a relatively vague semantic con-nection between two concepts, such as food being related torefrigerator or cereal.Our organization of existing relations into 13 dimensionsprovides a unified framework to reorganize and consolidatethese sources. Here, we discuss two nuances of our process.First, we placed the negative statements (marked with ¬ inTable 2) in the same dimension as the positive ones, as theycover the same knowledge type, despite having a differentpolarity and, arguably, purpose. Following a similar line ofreasoning, we also placed inverse relations, such as used forand uses, in the same dimension. Second, we recognize thatthe underlying data may not always be clearly placed in oneof these dimensions. For instance, the relation AtLocation ,which intuitively should belong to the spatial category, con-tains some statements that express part-whole knowledge.
5. Experiments
Seven of the sources covered in the previous section:ConceptNet, ATOMIC, Visual Genome, WordNet, Roget,Wikidata-CS, and FrameNet, have been integrated togetherin the Commonsense Knowledge Graph (CSKG) [26]. Westart with CSKG and apply our dimension classification (sec-tion 4) to its sources, under an assumption that each of theiredge relations can be mapped unambiguously to one of thedimensions. As a result, each edge in CSKG has dimen-sion information stored in its relation;dimension column. CSKG contains knowledge for 12 out of our 13 dimensions- the dimension comparative is not represented, as its onlysource, Web Child, is currently not part of CSKG. The re-sulting file is publicly available at: http://shorturl.at/msEY5 .This enrichment of the CSKG graph allows us to studythe commonsense knowledge dimensions from multiple novelperspectives. We investigate the following questions:1. How well is each dimension covered in the currentsources? Here we compute the number of edges foreach dimension across sources.2. Is knowledge redundant across sources? In experi-ment 2, we use the dimensions to quantify overlap be-tween sources with respect to individual edges.3. How do the dimensions of the edges compare to theirlanguage model (LM) encodings? Experiment 3 com-putes clusters based on our dimensions and comparesthem to clusters computed with Transformer-based lan-guage models, like BERT [13] and RoBERTa [36].4. What is the impact of each dimension for reasoningon QA tasks? Each of the dimensions is used to selecta subset of the available knowledge in CSKG. The se-lected knowledge is then used to pretrain a RoBERTa As discussed before, this assumption might not always hold in prac-tice. Future work should attempt to refine this mapping, e.g., by crowd-sourcing or by clustering algorithms. We leave out the relations prefixed with /r/dbpedia from Concept-Net, as these are being deprecated according to the official documentation: https://github.com/commonsense/conceptnet5/wiki/Relations . language model, which is applied to answer common-sense questions in a zero-shot manner.In this section, we formulate and run suitable studies foreach of the four questions, and reflect on the results. We use the CSKG graph enriched with edge dimensionsto compute source coverage with respect to each dimension.The coverage of each source, formalized as a count of thenumber of edges per dimension, is presented in Table 4. We observe several trends in this Table. First, there ismuch imbalance between the number of sources per dimen-sion. Comparative knowledge and creation information arevery rare and are described by only one or two sources, whereastaxonomic, temporal, and similarity knowledge are much morecommon and are captured by most sources. Second, someof the dimensions, like creation or part-whole, are repre-sented with relatively few edges, whereas similarity and tax-onomic knowledge generally have much larger number ofedges. The exception for the former is the large number ofpart-whole statements in WebChild, which is due to the factthat WebChild is automatically extracted, resulting in manyduplicates and noisy information. Third, we see that somesources, like ConceptNet, FrameNet, and Wikidata-CS, aimfor breadth and cover most dimensions. Others, like Ro-get and ATOMIC, have a narrow focus on specific dimen-sions: primarily desires/goals and temporal knowledge inATOMIC, and only knowledge on similarity and distinct-ness in Roget. Yet, the narrow focus generally coincides withmuch depth, as both sources have many edges for the smallset of dimensions that they cover. FrameNet, having a broadfocus, has a small number of edges for each dimension due toits limited coverage of lexical units. Again here, WebChildis a notable outlier with a large number of automatically ex-tracted statements for most dimensions. Finally, we observedifferent ratios between ‘strong’ and ‘weak’ semantic rela-tions across sources. Most of ConceptNet’s knowledge fallsunder the generic relational-other category, whereas only asmall portion of Wikidata-CS belongs to the same dimen-sion. Most of Wikidata-CS is taxonomic knowledge.
Our analysis so far reveals that most dimensions are cov-ered by more than one source. This leads us to the next ques-tion: how often is a statement found in multiple sources?Computing edge overlap between sources is conditionedon identity mapping between their nodes and relations. WhileCSKG provides such identity mappings between some of itsnodes, this cannot be expected to be complete. We alignthe edges as follows. The nodes across sources are naivelycompared through their labels. Regarding the relations , we Python script: https://github.com/usc-isi-i2/cskg/blob/master/consolidation/compute_dimensions.py . If a node has more than one label, then we perform comparison basedon the first one.
Ilievski et al.:
Preprint submitted to Elsevier
Page 8 of 19imensions of Commonsense Knowledge
Table 4
Coverage of sources in terms of the knowledge dimensions. The numbers presented are inthousands.
Dimension ATOMIC ConceptNet WebChild ROGET Wikidata-CS WordNet FrameNetlexical
704 0.5 207 14 similarity
255 343 1,023 1 152 0.4 distinctness
22 381 7 4 taxonomic
244 783 73 89 23 part-whole
19 5,752 8 22 spatial
28 660 0.5 creation utility
69 2,843 2 1 desire/goal
244 20 quality
143 9 6,510 1 11 comparative temporal
346 71 2,135 3 0.6 relational-other
Table 5
Overlap between various source pairs, based on the original re-lations (
Relations ) or the abstracted dimensions (
Dimensions ).Absolute overlap numbers are accompanied in brackets by theJaccard percentage of the overlap against the union of alltriples in the two sources.
Source pair Relations DimensionsCN - RG
CN - WD
CN - WN
RG - WD
299 (0.02%) 333 (0.02%)
RG - WN
WD - WN benefit from the CSKG principle of normalizing the relationsacross sources to a shared set. With this procedure, a Word-Net edge (food.n.01, synonym, dish.n.01) is modelled as (food, /r/Synonym, dish) in CSKG. As a dimension-basedenhancement, we abstract each relation further by mappingit to our dimensions, e.g., transforming (food, /r/Synonym,dish) to (food, similarity, dish) . This dimension-basedtransformation allows for more flexible matching within adimension, for instance, enabling similarity and synonymystatements to be compared for equivalence, since both (food,/r/Synonym, dish) and (food, /r/SimilarTo, dish) would benormalized to (food, similarity, dish) .We apply the relation-based and dimension-based vari-ants to compute overlap between four sources: ConceptNet,Roget, Wikidata, and WordNet, in terms of each dimension.Here we do not consider ATOMIC or FrameNet, as theiredges can be expected to have extremely low lexical overlapwith the other sources. The overlap is computed as a Jac-card score between the number of shared triples between twosources and the union of their triples. The obtained scoresare given in Table 5. We observe that the overlap is gen- Notebook: https://github.com/usc-isi-i2/cskg/blob/master/analysis/Overlap.ipynb . erally low, yet, translating the original relations into dimen-sions constantly leads to an increase of the overlap for any ofthe source pairs. The highest relative overlap is between Ro-get and WordNet (3.93%), with ConceptNet-WordNet com-ing second (2.60%). The lowest overlap is obtained betweenthe sources Roget and Wikidata (0.02%).Next, we inspect the overlap between these sources perdimension: Will the edges that correspond to more com-monly found dimensions (e.g., part-whole, see Experiment1) occur more often in multiple sources? We provide insightinto this question in Table 6. Primarily, this Table revealsthat there is very little edge overlap across sources. As hy-pothesized, most of the shared edges belong to dimensionsthat are common in many commonsense sources, such as taxonomic , similarity , and part-whole . The highest Jaccardscore is obtained on the taxonomic knowledge between Con-ceptNet and WordNet, followed by similarity knowledge inConceptNet-Roget and Roget-WordNet. Wikidata and Con-ceptNet share edges that belong to a number of other dimen-sions, including distinctness , similarity , and rel-other .The sparse overlap in Tables 5-6 is amplified by our lex-ical method of computing overlap, as same or similar nodesmay have slightly different labels. Both the low overlap andthe relatively weak comparison method strongly motivate fu-ture work on node resolution of commonsense KGs. Next, we investigate how the information captured byour dimensions relates to the encoding of edges by state-of-the-art Transformer-based language models, like BERTor RoBERTa. For this purpose, we cluster the knowledgein CSKG according to our 13 dimensions, resulting in 13disjoint clusters. We also compute clusters based on lan-guage models in an unsupervised manner as follows. Each of Notebook: https://github.com/usc-isi-i2/cskg/blob/master/embeddings/Summary%20of%20Dimension%20on%20CSKG.ipynb
Ilievski et al.:
Preprint submitted to Elsevier
Page 9 of 19imensions of Commonsense Knowledge
Table 6
Overlap distribution across dimensions. Absolute overlap numbers are accompanied inbrackets by the Jaccard percentage of the overlap against the union of all triples in thetwo sources. ’-’ indicates that at least one of the sources does not use the dimension.
Sources part-whole taxonomic lexical distinctness similarity quality utility creation temporal rel-otherCN-RG - - - 4,639 69,353 - - - - -- - - (1.17) (5.79) - - - - -
CN-WD
68 1,888 20 266 102 0 14 0 1 264(0.25) (0.62) (0.00) (1.00) (0.04) (0.00) (0.02) (0.00) (0.00) (0.01)
CN-WN
RG-WD - - - 206 127 - - - - -- - - (0.05) (0.01) - - - - -
RG-WN - - - 3,300 71,725 - - - - -- - - (0.87) (6.50) - - - - -
WD-WN
82 1,533 - 63 26 - - - - -(0.07) (0.39) - (0.62) (0.02) - - - - - the edges is lexicalized into a natural language sentence byrelation-specific templates. Each sentence is then encodedwith a Transformer model, either BERT-large or RoBERTa-large, into a single 1,024-dimensional embedding. Theseembeddings are finally clustered with the k-Means [22] al-gorithm into 𝑘 = 13 disjoint clusters.The two approaches for computing clusters, based on ourdimensions and based on Transformer embeddings, can nowbe compared in terms of their agreement. We use the ad-justed rand index (ARI) metric to measure the agreement. The ARI score is 0.226 for BERT and 0.235 for RoBERTa.These scores signal low agreement between the dimension-based and the unsupervised clustering, which is expectedgiven that the dimension-based clustering entirely dependson the relation, while the unsupervised clustering consid-ers the entire triple. We also observe that the ARI score ofRoBERTa is slightly higher, which might indicate that therelation has higher impact on the embedding in RoBERTathan in BERT.We pick a random sample of 5,000 edges. To under-stand the information encoded by RoBERTa, we visualizeits k-means clusters with UMAP (Figure 1). Curiously, cer-tain clusters are clearly delineated, while others are not. Forinstance, cluster 5 has little overlap with the other clusters.Looking into the contents of this cluster, we observe that itis largely dominated by distinctness information: 92% (360out of 390) of its edges belong to this dimension, mostly ex-pressed through the /r/Antonym relation. Clusters 4, 7, and8 are largely dominated by similarity, while clusters 1 and6 are largely split between temporal (46%) and desire/goal(36%) edges. At the same time, we observe a lot of over-laps between the clusters 0, 2, 9, 10, 11, and 12. Theseclusters are dominated by lexical and relational-other edges- e.g., around half of all edges in clusters 0 and 9 belong https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html Figure 1:
UMAP clusters of RoBERTa. to the category relational-other. The node frequency distri-butions reveal that cluster 1 describes positive emotions, asits most frequent node is /c/en/happy; nodes in cluster 5 areoften numbers, like rg:en_twenty-eighth ; and the nodes incluster 9 describe concepts from natural sciences, the mostconnected node being /c/en/zoology .In Figure 2, we visualize the same set of edges, only thistime we color each according to their dimension. In accor-dance with the relatively low rand index score, we observethat the clusters are mostly not well-distinguished from oneanother. We look for correspondences between the RoBERTaclusters in Figure 1 and the dimension-based clusters in Fig-ure 2, by computing Jaccard score between the edges thatconstitute each pair of their clusters. The 10 cluster pairswith the highest Jaccard scores are shown in Table 7. We ob-serve the highest correspondence for distinctness with clus-ter 5 (Jaccard score of 0.92) and similarity with cluster 8(score 0.45). The overlapping clusters for desire and tempo-ral knowledge both map relatively strongly to clusters 1 and
Ilievski et al.:
Preprint submitted to Elsevier
Page 10 of 19imensions of Commonsense Knowledge
Figure 2:
UMAP clusters according to our dimensions.
Table 7
10 pairs with the highest Jaccard scores between dimension-based and RoBERTa-based clusters.
RoBERTa cluster Dimension cluster Jaccard
Table 8
Top-3 highest-scored dimensions for each of the automatically-computed clusters. Bold-faced results indicate the top scorefor each dimension.0 lexical (0.205) , rel-other (0.121), taxonomic (0.060)1 desire (0.258) , temporal (0.202), quality (0.087)2 lexical (0.133), rel-other (0.122), taxonomic (0.075)3 spatial (0.119) , lexical (0.061), quality (0.053)4 similarity (0.21), quality (0.015), lexical (0.008)5 distinctness (0.916) , lexical (0.018), taxonomic (0.003)6 temporal (0.322) , desire (0.283), quality, (0.054)7 similarity (0.295), quality (0.009), taxonomic (0.005)8 similarity (0.452) , lexical (0.004), taxonomic (0.003)9 rel-other (0.143), lexical (0.085), taxonomic (0.081)
10 rel-other (0.169), taxonomic (0.015), quality (0.007)11 rel-other (0.11), quality (0.068) , taxonomic (0.066)12 rel-other (0.183) , taxonomic (0.059), spatial (0.053)
6, which appear nearby in the RoBERTa clustering as well.We also observe relatively high score for the cluster pairs 4-similarity, 7-similarity, and 0-lexical, all of which confirmour prior analysis of Figure 1.We show the top-3 highest-scored dimensions for eachof the automatically-computed clusters in Table 8. Here, weobserve that some dimensions, like creation , utility , and part-whole do not fall within the top-3 dimensions for anyof the RoBERTa clusters. This is likely due to the lowernumber of edges with these dimensions, as well as their dis-persion across many RoBERTa clusters. Figure 3:
UMAP clusters for two selected nodes: /c/en/food and /c/en/eat . To investigate further, we select the two CSKG nodeswe considered earlier in Table 4: /c/en/food and /c/en/eat ,and visualize all their edges according to their dimension.The total set of 2,661 edges largely belongs to the dimension relational-other (1,553), followed by temporal (319 edges),and desire/goal (228 edges), whereas no edge belongs to the creation dimension. Thus, most of the well-specified edgesabout nutritional concepts express either temporal informa-tion about the process, or knowledge about desires/goals re-lating to nutrition. Within a single dimension, most relational-other edges are expressed with
RelatedTo (1,488 out of 1,553).
Temporal knowledge is split into multiple relations, primar-ily
HasLastSubevent (113 edges),
HasPrerequisite (69), and
HasSubevent (60).
Desire/goal divided into at:xWant (78 edges),
MotivatedByGoal (47), and at:xIntent (47 edges). Besidesthe two seed nodes ( /c/en/food and /c/en/eat ), the frequencydistribution of nodes in a cluster reveals other prominentnodes. Naturally, the spatial dimension includes /c/e/plate (with an edge degree of 3), the temporal cluster includes /c/en/diminish_own_hunger (degree of 4), and the distinct-ness cluster has 7 edges for /c/en/drink . Experiment 3 revealed overlaps between the informationcaptured by our dimensions and that captured by languagemodels. In our fourth experiment, we experiment with en-hancing language models with knowledge belonging to indi-vidual dimensions, in order to examine the effect of differentdimensions of knowledge on commonsense reasoning tasks.We adopt the method proposed by [38] to pretrain state-of-the-art language models and conduct zero-shot evaluationon two commonsense question answering tasks. Accord-ing to this method, we first transform ConceptNet, Word-Net, Wikidata, ATOMIC, and Visual Genome into synthetic
Ilievski et al.:
Preprint submitted to Elsevier
Page 11 of 19imensions of Commonsense Knowledge
Table 9
Statistics of the number of QA pairs for each dimension.
Dimensions Train Devpart-whole taxonomic lexical distinctness similarity quality utility creation
304 17 temporal relational-other spatial desire/goal
QA sets. We use templates to map each triple into a QApair and apply random sampling with heuristic-based filter-ing to collect two distractors for every QA pair. We groupsynthetic QA pairs based on their dimension, resulting in 12dimension-based QA buckets in total. Within each dimen-sion, the QA data is split into training and development sets.For ATOMIC, we adopt its original split to partition data.For the other knowledge graphs, we partitioned 95% of datainto training set and remaining 5% for devevelopment, fol-lowing [38]. It is worth noting that [38] only selected 14relations from ConceptNet, WordNet, Wikidata, and VisualGenome, whereas we include all relations except
RelatedTo .The statistics for the synthetic QA sets are shown in table 9.We can see that the distribution of knowledge across dimen-sions is fairly skewed, with creation having very few ques-tions, while taxonomic and temporal knowledge being themost numerous. Our experiments in this section will revealwhether the amount of available knowledge affects down-stream task performance.We pretrain the RoBERTa-large [36] model on each ofthe dimensions using the corresponding synthetic QA set.We use RoBERTa with a marginal ranking objective, as thisis the best combination according to [38]. We use the sameset of hyper-parameters as in [38], except for the creationdimension. Specifically, we train our models for epochusing learning rate 𝑒 − 5 , batch size and margin . .For the creation dimension, since the number of samples ismuch smaller, we train the model for epochs while keep-ing the other hyper-parameters fixed. We evaluate our mod-els on two tasks: the CommonsenseQA task [60], in whichthe model is asked to choose the correct answer from fiveoptions given only the question, and SocialIQA task [54],in which the model chooses the correct answer from threeoptions given a question and a brief context.The results from our experiments are shown in Table 10.Overall, we can see that with the additional pretraining ontransformed knowledge graphs, the models are able to out-perform the no-knowledge baseline on all dimensions. How-ever, the variance of the improvement across dimensions isrelatively large, revealing that certain dimensions are more Table 10
Results of zero-shot evaluation on two commonsense reasoningtasks. We run every experiment 3 times with different seeds,and report mean accuracy with a 95% confidence interval.
Dimensions CSQA SIQABaseline +part-whole . .
4) 52 . . +taxonomic . .
4) 52 . . +lexical . .
9) 49 . . +distinctness . .
5) 50 . . +similarity . .
8) 53 . . +quality . .
5) 60 . . +utility . (± . ) 54 . . +creation . .
1) 47 . . +temporal . . . (± . ) +relational-other . .
7) 51 . . +spatial . .
2) 53 . . +desire/goal . .
8) 60 . . +all . .
4) 61 . . relevant for downstream tasks than others. For example,although the training set size of lexical dimension exceeds107K, its performance gain on both tasks is limited. Wethink that this is because the language model already learnedmost of the lexical knowledge from pretraining on unstruc-tured text corpora, thus it could not benefit much from ad-ditional training. While the quality dimension has a similartraining set size as the lexical dimension, the model benefitsfrom it by a large margin on both tasks: 20.7 and 12.7 abso-lute points, respectively. This finding suggests that qualityknowledge is novel and useful, as it may not be easy to learnfrom unstructured text.Also, we note that downstream tasks benefit more fromthe knowledge dimensions that align with their question types.For example, each question in SIQA corresponds to an ATOMICrelation, and requires knowledge primarily about the order ofevents, personal attributes, and agent desires. Consequently,pretraining on quality , temporal , and desire/goal knowledgeprovides the model with the largest gain on SIQA task. Theresults of the temporal dimension are even higher than train-ing on the entire set of questions, suggesting that certainknowledge dimensions that are not related to SIQA may evenlead to a decline in model performance. For the CSQA task,since it is derived from the broad set of knowledge dimen-sions covered in ConceptNet, we expect that many (if not all)of these dimensions would help performance. Accordingly,we observe large gains with many of the knowledge dimen-sions (+15%), whereas the utility dimension yields the bestperformance (+22.4%), even slightly better than that of theentire set.To better understand the impact of every dimension ofknowledge on different questions, we further break down theperformance of the models on every question type. Specif-ically, for CSQA, we classify questions based on Concept-Net relations between the correct answer and the questionconcept, as the model needs to reason over such relations to Ilievski et al.:
Preprint submitted to Elsevier
Page 12 of 19imensions of Commonsense Knowledge
Figure 4:
Accuracy for each question type in CSQA, where
AtL. means AtLocation,
Cau. means Causes,
Cap. means CapabelOf,
Ant. means Antonym,
H.Pre. means HasPrerequisite,
H.Sub. means HasSubevent,
C.Des. means CauseDesires,
Des. meansDesires,
P.Of means ParfOf,
M.Goal means MotivatedByGoal,
H.Pro means HasProperty. The numbers in parentheses indicatehow many questions fall into the category. be able to answer the question. For SIQA, since the ques-tions are initially generated using a set of templates basedon ATOMIC relations, we try to reverse-engineer the pro-cess by manually defining a mapping from question formatto ATOMIC relations. Using this method, we are able to suc-cessfully classify more than 99% of questions in SIQA devset. Then we compute the averaged accuracy over 3 seeds forevery question type for models trained on each dimension.The results for CSQA are shown in Figure 4 and results forSIQA are shown in Figure 5. For some questions types of CSQA, (one of) the largestimprovements is achieved on the corresponding knowledgedimension. For example, temporal on Causes and desire/goal on Desires . However, in other cases, the accuracy boostbrought about by the corresponding knowledge dimension issignificantly lower than other dimensions, for example, dis-tinctness on Antonym compare to utility . This might be a sig-nal that the knowledge represented within these dimensionsmight not be clearly separated. For SIQA, the results showthat the corresponding knowledge dimension is more clearlyhelping for most question types: desire/goal on xWant , xIntent , quality on xAttr and temporal on xNeed , oReact , oEffect . Thisis especially visible for xIntent and xNeed , where very littlegain is observed for most knowledge dimensions except forthe corresponding dimension, suggesting that the alignmentbetween the questions and knowledge dimensions is impor-tant for the model’s success. We note that a similar findingon the alignment between knowledge and the task has beenreported in the original paper [38]; yet, the dimensions allow us to validate this claim more precisely.Finally, to verify our hypothesis that certain dimensionsof knowledge are already learned by the state-of-the-art lan-guage models to a large extent, while others are not, we di-rectly evaluate the baseline LM on the synthetic QA sets foreach dimension. The results are shown in table 11. As ex-pected, even without any training, the model already achievesa very high accuracy (over 90%) on the lexical dimension,thus it could not receive much training signal from this di-mension. On the other hand, the accuracy on quality , tempo-ral , and desire/goal dimensions are significantly lower. Thisis mostly because questions from ATOMIC take a large por-tion in these dimension. As reported in [38], the questionsfrom ATOMIC are more challenging than those created fromother knowledge resources. We note that the accuracy for relational-other is the lowest among all dimensions. We hy-pothesize that this is because this dimension is more noisythan others and the knowledge in this dimension is less likelyto be found in the unstructured text. We leave further inves-tigation on this issue for future research.In summary, we observe that certain dimensions of knowl-edge are very beneficial and novel for language models, al-lowing them to improve their performance on downstreamreasoning tasks. Other dimensions, like lexical knowledge,are almost entirely redundant, as the language models havealready acquired this knowledge during their initial training.The exact contribution for each dimension depends on theknowledge required by the task at hand. We discuss the im- We omit question types with less than 20 questions for CSQA.
Ilievski et al.:
Preprint submitted to Elsevier
Page 13 of 19imensions of Commonsense Knowledge
Figure 5:
Accuracy for each question type in SIQA. The numbers in parentheses indicate how many questions fall into thatcategory.
Table 11
Results of zero-shot evaluation of RoBERTa on the syntheticQA sets.
Dimensions Devpart-whole taxonomic lexical distinctness similarity quality utility creation temporal relational-other spatial desire/goal plications of the obtained results further in Section 7.
6. Related Work
World knowledge is what people learn and generalizefrom physical and social experiences, distilling mental rep-resentations out of the most significant aspects of everydaylife. In these terms, common sense can be conceived asthe partition of world knowledge that is commonly shared by most people. This definition, however, has intrinsic limi- According to the classic argument of the mind-body problem, it isinherently impossible to characterize how generalization occurs, due to an explanatory gap [31]. tations: in fact, the scale and the diversity of physical and so-cial experiences, as well as the ecological, context-dependentnature of what constitutes ‘common’ and ‘uncommon’ [19],make it hard to formulate any abstract criterion of what shouldfall under common sense knowledge. From Aristotle’s the-ory of categories [1] to Brentano’s empirical psychology [9],deriving knowledge dimensions from empirical observations,as opposed to from abstract criteria [24], is a fundamentalepistemic approach that contributed to the birth of Cogni-tive Science as a discipline [42], besides serving as refer-ence framework for our current investigation. In this article,in fact, we neither propose nor adopt any a priori principle todefine what should be included in a common sense knowl-edge graph; rather, we analyze multiple knowledge graphsand, supported by empirical methods and experimental val-idations, elicit their most salient conceptual structures.In the history of general knowledge base systems, the dif-ficulty of characterizing commonsense has been a driver,rather than an obstacle. For instance, Cyc [15], the mostmonumental effort to construct an axiomatic theory of com-mon sense knowledge, has been actively growing for almostforty years. At present, the Cyc knowledge base comprisesaround 1.5 million general concepts, and 25 million rulesand assertions about these general concepts. Different domain-specific extensions of Cyc exist, funded by industry and gov-ernment programs: considering the key role that common-sense knowledge can play in enhancing AI systems for pri-vate and public enterprises, Cyc’s strategy of steering thegeneral knowledge base development in the direction of do-main use cases can represent a sustainable business model
Ilievski et al.:
Preprint submitted to Elsevier
Page 14 of 19imensions of Commonsense Knowledge for other stakeholders in the field. Existing commonsense knowledge graphs make implicit cat-egorizations of knowledge, by defining a tractable set of re-lations which can be traced to some types proposed in cog-nitive research. For instance, WebChild’s [61] part-of rela-tions resemble the partonomic-metonimic relations in cog-nitive science literature (e.g., see [11, 64]), while Concept-Net [59] defines 34 relations, where the relation
IsA can beoften approximated with taxonomic knowledge. In its firstversion [35], ConceptNet defined 20 relations grouped into 8categories: K-lines, Things, Agents, Events, Spatial, Causal,Functional, and Affective. Zhang et al. [67] extrapolate thetypes in the Conceptual Semantic Theory with those in Con-ceptNet 1.0 and propose the following six categories: prop-erty, object, eventuality, spatial, quantity, and others.Beyond the structural differences, commonsense knowledgegraphs share the same foundational elements. Commonsenseknowledge is generally split into declarative and procedu-ral, where the former is contained in unconditional asser-tions, and the latter requires conditional assertions : we canstate, for instance, that windows are typically made of glass(declarative), and assert that if a large rock is thrown againsta window, the glass typically breaks (procedural). As theuse of the adverb typically suggests, commonsense knowl-edge rules out exceptions from the context of interpretation:for instance, bulletproof glass doesn’t break when hit by arock. According to [39], four types of contextual knowledgeare essential for humans to interpret or frame in text: intra-textual, intertextual, extratextual, and circumtextual knowl-edge. These generic knowledge types, which are orthogo-nal to the declarative/procedural distinction, are relevant formost natural language understanding tasks. In [27], we ana-lyzed these types for the task of entity linking; when it comesto commonsense question answering, they might provide aguide for extending the coverage of the knowledge, whencombined with specific theories/axioms, such as those de-fined by the resources in section 3.2 or in Cyc. In this paper, we analyzed individual knowledge graphsthrough suitable semantic dimensions, with the goal of pro-viding insights on how consolidation of commonsense knowl-edge resources can be guided and, eventually, achieved. Anatural extension of our work would be to evaluate ongoingefforts that adopts alternative methods of consolidation: ac-cordingly, Framester [16], BabelNet [45], CSKG [26], andPredicate Matrix [29] constitute some of the most matureprojects in this space. In Framester, several resources likeWordNet, VerbNet, FrameNet, and BabelNet, are aligned us-ing an OWL schema based on
Description and Situations Because our focus is on resources, it is beyond the scope of this paperto discuss seminal investigations on common sense axiomatization, suchas Pat Hayes’ naive physics [23] and Ernest Davies’ work on qualitativecommonsense reasoning. The distinction between these types of assertions was formalized in aseminal work by Gentzen [18] and
Semiotics ontology design patterns. CSKG, whichwe leverage in this paper, is also based on a schema, butit doesn’t rely on traditional RDF/OWL semantics: in fact,CSKG is a hyper-relational graph represented in a tabularformat, designed to preserve individual knowledge structuresof resources like ConceptNet, WebChild, Visual Genome,etc., exploit direct mappings when available, derive indirectmappings when possible (e.g., while ConceptNet and VisualGenome do not have direct connections, they both have map-pings to WordNet), and infer links through statistical algo-rithms. BabelNet is a multilingual lexicalized semantic net-work based on automatically linking Wikipedia with Word-Net, and expanded by using additional information from re-sources like FrameNet and VerbNet. Finally, Predicate Ma-trix exploits Word Sense Disambiguation algorithms to gen-erate semi-automatic mappings within FrameNet, VerbNet,PropBank, WordNet, and ESO [56].
7. Discussion and Roadmap
Commonsense knowledge sources use different levels ofsemantics, come in a variety of forms, and strive to capturediverse notions of common sense. After surveying 20 com-monsense knowledge sources, we proposed that their rela-tions can be grouped into 13 dimensions, namely: lexical,similarity, distinctness, part-whole, spatial, creation, utility,desire/goal, quality, comparative, temporal, and relational-other. Most relations can be unambiguously mapped to oneof the dimensions. We apply our dimensions to reorganizethe knowledge in the Commonsense Knowledge Graph (CSKG)[26]. Following our devised mapping of relations to dimen-sions, we add an additional column ( relation;dimension ) inCSKG, indicating the dimension of each edge. This allowsus to make use of the consolidation of seven existing sourcesdone by CSKG, and complement it with the dimensions inorder to perform more abstract analysis of its knowledge types.We designed and ran four experiments to analyze common-sense knowledge in CSKG through the lenses of these 13dimensions.In experiment 1, we investigated the coverage of the 13dimensions in current sources. Some dimensions, like part-whole and similarity, are subject of interest in most sources.Others, like comparative knowledge and knowledge on de-sires/goals, are rarely captured. Yet, the depth of knowledgeon the less commonly represented relations is still high, as il-lustrated by the 244 thousand desire/goal edges in ATOMIC.Here we also observed that the breadth of focus varies no-tably across sources, as some (e.g., ConceptNet and Wikidata-CS) cover a wide range of relations, while others (e.g., ATOMICor WordNet) have a narrower focus.Experiment 2 posed the question of whether individualknowledge statements are redundant across sources. Our ex-periments with four sources indicated, with few exceptions, These patterns can be accessed at http://ontologydesignpatterns.org/wiki/Main_Page https://github.com/newsreader/eso-and-ceo Ilievski et al.:
Preprint submitted to Elsevier
Page 15 of 19imensions of Commonsense Knowledge that only a tiny portion of all edges were shared between apair of sources. This experiment points to a two-fold motiva-tion for node resolution over commonsense sources. On theone hand, node resolution is needed to increase the qualityof computing overlap beyond the current lexical comparisonof nodes. On the other hand, as the sources have generallycomplementary goals, it is likely that even with a more se-mantic computation the overlap will remain low. Node res-olution is, thus, essential to consolidate different views of anode (concept) into a single representation.In experiment 3, we cluster all edges in CSKG accord-ing to their dimension, and compare these clusters with 𝑘 =13 clusters based on a language model encoding of eachedge. We noted that the overall agreement between the di-mensions and the language model-based clustering is rel-atively low, indicating that language models pay much at-tention to the edge nodes. However, individual correspon-dences were noted. Similarity and distinctness quite clearlydominated some of the RoBERTa-based clusters, while otherclusters were consistently split between the dimensions ofdesire/goal and temporal knowledge. Interestingly, the clus-ters inferred from the RoBERTa embeddings often clusterednodes from different sources into a single cluster.Finally, in experiment 4 we investigated the impact ofthe dimensions on a downstream reasoning task of common-sense question answering. We adopted with a recent idea [38]for pretraining language models with knowledge graphs. Thebest-scoring model in this paper, RoBERTa-large with marginalranking loss, was fed with knowledge from one of our di-mensions at a time, and evaluated in a zero-shot manner ontwo benchmarks testing broad (CommonsenseQA) and so-cial (SocialIQA) commonsense reasoning. The experimentsshowed that social commonsense reasoning clearly benefitsfrom temporal, quality, and desire/goal knowledge, whereasthe CommonsenseQA benchmark benefits from broad knowl-edge from all dimensions. Certain dimensions, such as lex-ical knowledge, were relatively uninformative, as it can beexpected that such knowledge has been already acquired bythe language models at their initial training stage. While theextent of knowledge plays a role, adding more knowledge isnot always beneficial, as the task performance depends onthe alignment between the dimensions and the task. Thismotivates further work on automatic alignment between thetask questions and the our dimensions. This also motivatesfuture work that attempts to evaluate the value of additionalcontent additions related to particular dimensions aimed atimproving certain kinds of tasks. The goal of consolidating and applying commonsenseknowledge is an ambitious one, as witnessed by decades ofresearch on this topic. Our 13 dimensions are an effort toreorganize existing commonsense knowledge through uni-fication of its knowledge types. We see this as a necessary,but not sufficient, step towards a modern and comprehensivecommonsense resource. The dimensions could facilitate, orcomplement, several other aspects of this pursuit:
1. Node resolution
While potentially controversial, the con-solidation of commonsense knowledge relations into dimen-sions/knowledge types is achievable with careful manual ef-fort. This is largely due to the relatively small number ofrelations in most current sources. Resolving nodes acrosssources is another key aspect of this consolidation, stronglymotivated by experiment 2 of this paper. Sources capturecomplementary knowledge, whose combination is preventedby the lack of mappings between their nodes. As nodes arecurrently intended to represent various aspects of meaning:words, phrases, concepts, frames, events, sentences — theirconsolidation is not obvious. Moreover, the number of uniquenodes in most sources is on the order of many thousandsor even millions [26], preventing it from being solvable bymere manual effort. Node resolution could be framed asa ‘static’ disambiguation/clustering task, where each state-ment for a node is to be classified into one of its meanings,similar to [12]. Here, the set of meanings can be either present(e.g., WordNet synsets) or dynamically inferred from thedata. An alternative, ‘dynamic’ approach is defer the noderesolution to task time and perform it implicitly as a taskof retrieving evidence from a knowledge source [32]. An-other option is a combination of the static and the dynamicapproaches.
2. Coverage and boundaries
At present, it is difficult to es-timate the completeness of commonsense knowledge sources.With the relations organized into dimensions of knowledge,we gain insight into the volume of knowledge that falls withineach of the dimensions. An ideal node resolution would takeus one step further, allowing us to detect gaps, i.e., under-stand which relevant facts are not represented by any of thesources. If nodes are resolved to an ontology like WordNet,one could leverage its taxonomy to infer new information.For instance, ConceptNet is at present unable to infer that ifbarbecues are held in outdoor places, they could be, by ex-tension, be held in a park or someone’s patio. In addition, amore semantic resource would allow us to define constraintsover the knowledge and detect anomalies and contradictoryknowledge, which can be argued to define the boundariesof the knowledge that can be obtained. It is reasonable thatsuch boundaries exist, as commonsense knowledge is char-acterized by commonness of its concepts and restricted setof relations [25]. Further, by organizing by dimensions alsoallows us to describe strengths (and/or weakenesses) of re-sources. For example, a resource that has many partonomicrelationships might be the first resource to consider using ifa task requires part-whole reasoning.
3. Generalizable downstream reasoning
As current large-scale commonsense sources are primarily text-based, theyare lexicalized prior to their combination with language mod-els, losing much of their structure. As this lack of structureprevents us from understanding their coverage and gaps, weare unable to measure their potential for downstream reason-ing as a function of the available knowledge. It remains un-known to which extent a more complete source, organizedaround dimensions of commonsense knowledge, would beable to contribute to improve performance. Experiment 4
Ilievski et al.:
Preprint submitted to Elsevier
Page 16 of 19imensions of Commonsense Knowledge showed that there is correspondence between knowledge di-mensions and question answering tasks, motivating automaticalignment between the two. Moreover, a comprehensive se-mantic source may inspire new neuro-symbolic reasoningmethods, with potentially enhanced generalizability and ex-plainability, opening the door for reliable commonsense ser-vices to be made available in the future.
4. Evaluation and knowledge gaps
Experiment 4 showedthat the potential of different dimensions for reasoning variesgreatly and is largely dependent on the evaluation data. Thisfinding is in line with [38]. The fact that certain dimen-sions consistently contribute little can be an indicator forgaps in current evaluation. Namely, dimensions like distinct-ness and spatial which currently contribute little or not atall are likely to be underrepresented in current evaluations.These gaps should ideally be addressed in the future by newbenchmarks that will represent these missing dimensions.We note that our set of dimensions is based on the relationsfound in current popular commonsense sources. Hence, inthis paper, we make an assumption that the knowledge typesin these sources suffice, or at least have previously sufficed,and can express the desired knowledge. The diversity ofknowledge expressed by the relational-other dimension, aspointed out also in [25], might be an indicator for additional,latent dimensions hidden behind the vagueness of this di-mension.
8. Conclusions
At present, commonsense knowledge is dispersed acrossa variety of sources with different foci, strengths, and weak-nesses. The complementary knowledge covered by thesesources motivates efforts to consolidate them under a com-mon representation. In this paper, we pursued the goal of or-ganizing commonsense relations into a shared set of knowl-edge dimensions in a bottom-up fashion. Starting from asurvey and analysis of the relations found in existing sources,we grouped them into 13 dimensions: lexical, similarity, dis-tinctness, part-whole, spatial, creation, utility, desire/goal,quality, comparative, temporal, and relational-other. As eachrelation in these sources can be mapped to a dimension, weapplied our method to abstract the relations in an existingconsolidated resource: the Commonsense Knowledge Graph(CSKG). This allowed us to empirically study the impactof these dimensions. First, we observed that some dimen-sions are included more often than others, potentially point-ing to gaps in the knowledge covered in existing resources.Second, we measured sparse overlap of facts expressed witheach dimension across sources, which motivates future workon graph integration through (automated) node resolution.Third, comparing the dimension-based clustering to languagemodel-based unsupervised edge clustering resulted in lowoverall agreement, though in some cases, the unsupervisedclusters were dominated by one or two dimensions. Thisshowed that some of the dimensions represent a strongersignal for language modeling than others. Fourth, we mea-sured the impact of each dimension on a downstream ques-tion answering reasoning task, by adapting a state-of-the- art method of pretraining language models with knowledgegraphs. Here, we observed that the impact differs greatly perdimension, depending largely on the alignment between thetask and the knowledge dimension, as well as on the noveltyof knowledge captured by a dimension. While this is in ac-cordance with the findings of the original method [38], thedimension-driven experiments of this paper enabled this hy-pothesis to be investigated much more precisely, revealingthe direct impact of each knowledge dimension rather thanentire knowledge sources.Our experiments inspired a four-step roadmap towardscreation and utilization of a comprehensive dimension-centeredresource. (1) Node resolution methods should be introducedand applied to unify the resources further. (2) Such an in-tegration would allow us to better understand and improvethe coverage/gaps and boundaries of these sources. (3) Alarge-scale, public semantic graph of commonsense knowl-edge may inspire novel neuro-symbolic methods, potentiallyallowing for better generalization and explainability. (4) Theimpact of a dimension is an indicator of the coverage of thatdimension in current evaluation benchmarks; under-representeddimensions are evaluation gaps that may need to be filled byintroducing new benchmarks. And, vice-versa, additionalknowledge dimensions might be hidden behind the generic relational-other dimension.
Acknowledgement
This material is based upon work sponsored by the DARPAMCS program under Contract No. N660011924033 with theUnited States Office Of Naval Research.
References [1] Aristotle, Jones, R., Ross, W., 2012. The Metaphysics. CreateSpaceIndependent Publishing Platform. URL: https://books.google.com/books?id=-D3JtAEACAAJ .[2] Baker, C.F., Fillmore, C.J., Lowe, J.B., 1998. The berkeley framenetproject, in: 36th Annual Meeting of the Association for Computa-tional Linguistics and 17th International Conference on Computa-tional Linguistics, Volume 1, pp. 86–90.[3] Banerjee, P., Baral, C., 2020. Self-supervised knowledgetriplet learning for zero-shot question answering. arXiv preprintarXiv:2005.00316 .[4] Bhakthavatsalam, S., Anastasiades, C., Clark, P., 2020a. Gener-icskb: A knowledge base of generic statements. arXiv preprintarXiv:2005.00660 .[5] Bhakthavatsalam, S., Richardson, K., Tandon, N., Clark, P., 2020b.Do dogs have whiskers? a new knowledge base of haspart relations.arXiv preprint arXiv:2006.07510 .[6] Bisk, Y., Zellers, R., LeBras, R., Gao, J., Choi, Y., 2020. Piqa: Rea-soning about physical commonsense in natural language., in: AAAI,pp. 7432–7439.[7] Boratko, M., Li, X.L., Das, R., O’Gorman, T., Le, D., McCallum,A., 2020. Protoqa: A question answering dataset for prototypicalcommon-sense reasoning. arXiv preprint arXiv:2005.00771 .[8] Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A.,Choi, Y., 2019. Comet: Commonsense transformers for automaticknowledge graph construction. arXiv preprint arXiv:1906.05317 .[9] Brentano, F., 2014. Psychology from an empirical standpoint. Rout-ledge.[10] Cambria, E., Li, Y., Xing, F.Z., Poria, S., Kwok, K., 2020. Senticnet 6:
Ilievski et al.:
Preprint submitted to Elsevier
Page 17 of 19imensions of Commonsense Knowledge
Ensemble application of symbolic and subsymbolic ai for sentimentanalysis. CIKM’20, Oct 20-24 .[11] Casati, R., Varzi, A.C., et al., 1999. Parts and places: The structuresof spatial representation. Mit Press.[12] Chen, J., Liu, J., 2011. Combining conceptnet and wordnet for wordsense disambiguation, in: Proceedings of 5th International Joint Con-ference on Natural Language Processing, pp. 686–694.[13] Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .[14] Dodge, E.K., Hong, J., Stickles, E., 2015. Metanet: Deep semanticautomatic metaphor analysis, in: Proceedings of the Third Workshopon Metaphor in NLP, pp. 40–49.[15] Elkan, C., Greiner, R., 1993. Building large knowledge-based sys-tems: Representation and inference in the cyc project: Db lenat andrv guha.[16] Gangemi, A., Alam, M., Asprino, L., Presutti, V., Recupero, D.R.,2016. Framester: A wide coverage linguistic linked data hub, in: Eu-ropean Knowledge Acquisition Workshop, Springer. pp. 239–254.[17] Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., Schneider, L.,2002. Sweetening ontologies with dolce, in: International Conferenceon Knowledge Engineering and Knowledge Management, Springer.pp. 166–181.[18] Gentzen, G., 1934. Investigations into logical deduction. translationprinted in m. szabo the collected papers of gerhard gentzen.[19] Gibson, E.J., Pick, A.D., et al., 2000. An ecological approach to per-ceptual learning and development. Oxford University Press, USA.[20] Grice, H.P., 1975. Logic and conversation, in: Speech acts. Brill, pp.41–58.[21] Guha, R.V., Brickley, D., Macbeth, S., 2016. Schema. org: evolutionof structured data on the web. Communications of the ACM 59, 44–51.[22] Hartigan, J.A., Wong, M.A., 1979. Algorithm as 136: A k-meansclustering algorithm. Journal of the royal statistical society. series c(applied statistics) 28, 100–108.[23] Hayes, P.J., 1979. The naive physics manifesto. Expert systems in themicroelectronic age .[24] Hicks, G.D., 1904. Idealism and the problem of knowledge and exis-tence, in: Proceedings of the Aristotelian Society, JSTOR. pp. 136–178.[25] Ilievski, F., Szekely, P., Schwabe, D., 2020a. Commonsense knowl-edge in wikidata. arXiv preprint arXiv:2008.08114 .[26] Ilievski, F., Szekely, P., Zhang, B., 2020b. Cskg: The commonsenseknowledge graph arXiv:2012.11490 .[27] Ilievski, F., Vossen, P., Van Erp, M., 2017. Hunger for contextualknowledge and a road map to intelligent entity linking, in: Inter-national Conference on Language, Data and Knowledge, Springer,Cham. pp. 143–149.[28] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J.,Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al., 2017. Visualgenome: Connecting language and vision using crowdsourced denseimage annotations. International journal of computer vision 123, 32–73.[29] de Lacalle, M.L., Laparra, E., Aldabe, I., Rigau, G., 2016. A mul-tilingual predicate matrix, in: Proceedings of the Tenth InternationalConference on Language Resources and Evaluation (LREC’16), pp.2662–2668.[30] Lenat, D.B., 1995. Cyc: A large-scale investment in knowledge in-frastructure. Communications of the ACM 38, 33–38.[31] Levine, J., 1983. Materialism and qualia: The explanatory gap. Pa-cific philosophical quarterly 64, 354–361.[32] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal,N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al., 2020.Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401 .[33] Lin, B.Y., Chen, X., Chen, J., Ren, X., 2019. Kagnet: Knowledge-aware graph networks for commonsense reasoning. arXiv preprintarXiv:1909.02151 . [34] Lin, B.Y., Lee, S., Khanna, R., Ren, X., 2020. Birds have fourlegs?! numersense: Probing numerical commonsense knowledge ofpre-trained language models. arXiv preprint arXiv:2005.00683 .[35] Liu, H., Singh, P., 2004. Conceptnet—a practical commonsense rea-soning tool-kit. BT technology journal 22, 211–226.[36] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O.,Lewis, M., Zettlemoyer, L., Stoyanov, V., 2019. Roberta: A robustlyoptimized bert pretraining approach. arXiv:1907.11692 .[37] Ma, K., Francis, J., Lu, Q., Nyberg, E., Oltramari, A., 2019. Towardsgeneralizable neuro-symbolic systems for commonsense question an-swering. arXiv preprint arXiv:1910.14087 .[38] Ma, K., Ilievski, F., Francis, J., Bisk, Y., Nyberg, E., Oltramari, A.,2020. Knowledge-driven data construction for zero-shot evaluationin commonsense question answering. arXiv:2011.03863 .[39] MacLachlan, G.L., Reid, I., 1994. Framing and interpretation. Mel-bourne University Press.[40] McCarthy, J., et al., 1960. Programs with common sense. RLE andMIT computation center.[41] Miller, G.A., 1998. WordNet: An electronic lexical database. MITpress.[42] Miller, G.A., 2003. The cognitive revolution: a historical perspective.Trends in cognitive sciences 7, 141–144.[43] van Miltenburg, E., 2016. Stereotyping and bias in the flickr30kdataset, in: Edlund, J., Heylen, D., Paggio, P. (Eds.), Proceedingsof Multimodal Corpora: Computer vision and language processing(MMC 2016), pp. 1–4. URL: .[44] Mostafazadeh, N., Kalyanpur, A., Moon, L., Buchanan, D.,Berkowitz, L., Biran, O., Chu-Carroll, J., 2020. Glucose: Gen-eralized and contextualized story explanations. arXiv preprintarXiv:2009.07758 .[45] Navigli, R., Ponzetto, S.P., 2012. Babelnet: The automatic construc-tion, evaluation and application of a wide-coverage multilingual se-mantic network. Artificial Intelligence 193, 217–250.[46] Niles, I., Pease, A., 2001. Towards a standard upper ontology, in:Proceedings of the international conference on Formal Ontology inInformation Systems-Volume 2001, pp. 2–9.[47] Pellissier Tanon, T., Weikum, G., Suchanek, F., 2020. Yago 4: Areason-able knowledge base. The Semantic Web .[48] Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller,A.H., Riedel, S., 2019. Language models as knowledge bases? arXivpreprint arXiv:1909.01066 .[49] Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hocken-maier, J., Lazebnik, S., 2016. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In-ternational Journal of Computer Vision , 1–20.[50] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.,2019. Language models are unsupervised multitask learners. OpenAIBlog 1, 9.[51] Roget, P.M., 2020. Roget’s Thesaurus. Good Press.[52] Romero, J., Razniewski, S., Pal, K., Z. Pan, J., Sakhadeo, A., Weikum,G., 2019. Commonsense properties from query logs and questionanswering forums, in: Proceedings of the 28th ACM InternationalConference on Information and Knowledge Management, pp. 1411–1420.[53] Sap, M., Le Bras, R., Allaway, E., Bhagavatula, C., Lourie, N.,Rashkin, H., Roof, B., Smith, N.A., Choi, Y., 2019a. Atomic: Anatlas of machine commonsense for if-then reasoning, in: Proceedingsof the AAAI Conference on Artificial Intelligence, pp. 3027–3035.[54] Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y., 2019b. SocialIQa: Commonsense reasoning about social interactions, in: Proceed-ings of the 2019 Conference on Empirical Methods in Natural Lan-guage Processing and the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), Association for Compu-tational Linguistics, Hong Kong, China. pp. 4463–4473. URL: , doi: .[55] Schuler, K.K., 2005. Verbnet: A broad-coverage, comprehensive verblexicon .
Ilievski et al.:
Preprint submitted to Elsevier
Page 18 of 19imensions of Commonsense Knowledge [56] Segers, R., Vossen, P., Rospocher, M., Serafini, L., Laparra, E., Rigau,G., 2015. Eso: A frame based ontology for events and implied situa-tions. Proceedings of MAPLEX 2015.[57] Shwartz, V., West, P., Bras, R.L., Bhagavatula, C., Choi, Y., 2020.Unsupervised commonsense question answering with self-talk. arXivpreprint arXiv:2004.05483 .[58] Singh, P., Lin, T., Mueller, E.T., Lim, G., Perkins, T., Zhu, W.L.,2002. Open mind common sense: Knowledge acquisition from thegeneral public, in: OTM Confederated International Conferences" Onthe Move to Meaningful Internet Systems", Springer. pp. 1223–1237.[59] Speer, R., Chin, J., Havasi, C., 2017. Conceptnet 5.5: An open mul-tilingual graph of general knowledge, in: Thirty-First AAAI Confer-ence on Artificial Intelligence.[60] Talmor, A., Herzig, J., Lourie, N., Berant, J., 2019. Common-senseQA: A question answering challenge targeting commonsenseknowledge, in: Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers),Association for Computational Linguistics, Minneapolis, Minnesota.pp. 4149–4158. URL: ,doi: .[61] Tandon, N., De Melo, G., Weikum, G., 2017. Webchild 2.0: Fine-grained commonsense knowledge distillation, in: Proceedings ofACL 2017, System Demonstrations, pp. 115–120.[62] Vrandečić, D., Krötzsch, M., 2014. Wikidata: a free collaborativeknowledgebase. Communications of the ACM 57, 78–85.[63] Williams, B., Lieberman, H., Winston, P.H., 2017. Understandingstories with large-scale common sense., in: COMMONSENSE.[64] Winston, M.E., Chaffin, R., Herrmann, D., 1987. A taxonomy ofpart-whole relations. Cognitive science 11, 417–444.[65] Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R., 2018.Visual semantic navigation using scene priors. arXiv preprintarXiv:1810.06543 .[66] Zellers, R., Bisk, Y., Farhadi, A., Choi, Y., 2019. From recognitionto cognition: Visual commonsense reasoning, in: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pp.6720–6731.[67] Zhang, H., Zhao, X., Song, Y., 2020. Winowhy: A deep diagnosis ofessential commonsense knowledge for answering winograd schemachallenge. arXiv preprint arXiv:2005.05763 .
Ilievski et al.: