Commonsense Knowledge Mining from Term Definitions
CCommonsense Knowledge Mining from Term Definitions
Zhicheng Liang, Deborah L. McGuinness
Department of Computer Science, Rensselaer Polytechnic InstituteTroy, NY, [email protected], [email protected]
Abstract
Commonsense knowledge has proven to be beneficial to a va-riety of application areas, including question answering andnatural language understanding. Previous work explored col-lecting commonsense knowledge triples automatically fromtext to increase the coverage of current commonsense knowl-edge graphs. We investigate a few machine learning ap-proaches to mining commonsense knowledge triples usingdictionary term definitions as inputs and provide some initialevaluation of the results. We start from extracting candidatetriples using part-of-speech tag patterns from text, and thencompare the performance of three existing models for triplescoring. Our experiments show that term definitions containsome valid and novel commonsense knowledge triples forsome semantic relations, and also indicate some challengeswith using existing triple scoring models. Introduction
A variety of natural language related tasks, e.g. question an-swering (Lin et al. 2019) and dialog systems (Young et al.2018), are able to achieve better performance by introducingcommonsense knowledge. However, the knowledge collec-tion process is difficult because commonsense knowledgeis assumed to be widely known, thus rarely stated explic-itly in natural language text. A Commonsense KnowledgeGraph (CSKG) is usually represented as a directed graph,where nodes represent concepts and edges denote some pre-defined relations between concepts. Most existing, large-scale CSKGs are built by expert annotation, e.g. Cyc (Lenat1995), or by crowdsourcing, e.g. ConceptNet (Speer, Chin,and Havasi 2017) and ATOMIC (Sap et al. 2019). With re-spect to the broadness and diversity of commonsense knowl-edge, these CSKGs typically suffer from low coverage.To increase coverage, there have been two lines of re-search to infer commonsense knowledge automatically. Oneis CSKG completion, which aims to learn a scoring modelthat distinguishes between triples expressing commonsenseknowledge and those that do not (Li et al. 2016; Saito et al.2018; Malaviya et al. 2020). A learned scoring model can Our code and data are available at: https://github.com/gychant/CSKMTermDefn be used to estimate the plausibility of candidate concept-relation-concept triples that are either constructed betweenexisting concepts or extracted from the raw text of newsources. Triples scored above a certain threshold are con-sidered to be valid and then used to augment a CSKG.The other line is CSKG generation, which aims to gen-erate a new node (or, concept), t , in arbitrary phrases, andconnect it with an existing node, t , using a pre-defined re-lation, R , in order to construct a new triple ( t , R, t ) (Saitoet al. 2018; Bosselut et al. 2019). One weakness of thesegenerative models is the low novelty rate of generated con-cepts. As reported in the ConceptNet experiments by Bosse-lut et al. (2019), only 3.75% of generated triples containnovel object nodes given the subject and the relation of atriple, which limits the ability to augment a CSKG on alarge scale. In contrast, the CSKG completion approach ismore promising because it allows the introduction of morenovel/unseen concepts to the existing graph, by mining di-verse candidate triples from external text resources.Previous work on commonsense knowledge mining ex-ists. Blanco, Cankaya, and Moldovan (2011) extractedcommonsense knowledge using concept properties andmetarules. Li et al. (2016) extracted knowledge triplesfrom Wikipedia and used their trained model to score thetriples. Jastrzebski et al. (2018) further evaluated the nov-elty of these extracted triples and introduced an automatednovelty metric that correlates with human judgement. Davi-son, Feldman, and Rush (2019) leveraged a pre-trained lan-guage model to mine commonsense knowledge. Zhang et al.(2020) mined commonsense knowledge from linguistic pat-terns of raw text. However, as a potential source of common-sense knowledge, dictionary term definitions have yet to beexplored.Dictionary term definitions are compiled to provide pre-cise descriptions of the properties of terms, concepts, or en-tities in our daily life. Based on the assumption that con-cepts have properties which imply commonsense (Blanco,Cankaya, and Moldovan 2011), we further assume that somecommonsense knowledge could be extracted or inferredfrom these definitions. For example, the term definition of“bartender” is One who tends a bar or pub; a person prepar-ing and serving drinks at a bar , from which one could infercommonsense triples such as (bartender, IsA, person), (bar-tender, AtLocation, bar), (bartender, AtLocation, pub), (bar- a r X i v : . [ c s . C L ] F e b ender, CapableOf, preparing and serving drinks) . Amongthese triples, the second triple is already included in Con-ceptNet, and the others have semantically similar counter-parts, e.g. (bartender, CapableOf, fill glass with drink), (bar-tender, CapableOf, mix drink) , and (bartender, CapableOf,shake drink) .We aim to examine the performance of existing machinelearning approaches for mining the described commonsenseknowledge triples from such term definitions, i.e. examiningtheir capability of distinguishing valid and invalid triples ex-tracted from the raw text, and to understand the potential andfeasibility for mining commonsense automatically from thisparticular kind of resource. Approach
In this section, we introduce how we extract candidate triplesof commonsense knowledge from term definitions, as wellas the models we use to evaluate their plausibility scores,which measure the level of validity of a triple.
Candidate Extraction
We use term definitions from the English version of Wik-tionary, a freely-available multilingual dictionary. Yet, ourapproach is agnostic to a particular definition resource. Asin ConceptNet, the subject and object of a commonsenseknowledge triple can be arbitrary phrases. Instead of gen-erating candidates from simple N-grams, Li et al. (2016) ex-tracted candidates using frequent part-of-speech (POS) tagpatterns of concept pairs for each pre-defined relation. Thismethod encourages candidates to respect the target com-monsense knowledge graph being extended. We employ asimilar approach to extracting candidates from term defini-tions. First, we parse the nodes in ConceptNet using spaCyto obtain their POS tags. Next, we choose the top k mostfrequent POS tag patterns for each relation and apply themto match text spans from term definition text. For instance,the phrase “ preparing and serving drinks ” can be extractedfrom the term definition of “ bartender ” using the POS tagpattern “ VERB , CCONJ , VERB , NOUN ” for the
CapableOf relation. Finally, we construct knowledge triple candidatesfor each relation using a term as the subject and an extractedphrase from the definition of this term as the object.
Triple Scoring
We adopt three state-of-the-art triple scoring models forcomputing the plausibility of candidate triples. Each of themassigns a real-valued score to a given triple:• Bilinear AVG (i.e. average) (Li et al. 2016) defines theplausibility score of a triple ( t , R, t ) as u (cid:62) M R u where M R ∈ R r × r is the parameter matrix for relation R , u i is a nonlinear transformation of the term vector v i , and v i is obtained by averaging the word embeddings of theoriginal term, t i . Li et al. (2016) used this model to scoretriples extracted from Wikipedia, since it performs betterwhen scoring novel triples. https://en.wiktionary.org/wiki/Wiktionary:Main Page https://spacy.io/ • KG-BERT (Yao, Mao, and Luo 2019) treats triples inknowledge graphs as textual sequences by taking the cor-responding entity and relation descriptions, and learns thetriple scoring function by fine-tuning a pre-trained lan-guage model.• PMI model (Davison, Feldman, and Rush 2019) rep-resents triple plausibility using pointwise mutual infor-mation (PMI) between head entity, h , and tail entity, t . Specifically, it translates a relational triple into amasked sentence and estimates the PMI using a pre-trained language model, computed by PMI ( t , h | r ) =log p ( t | h , r ) − log p ( t | r ) . The final plausibility scoreis obtained by averaging PMI ( t , h | r ) and PMI ( h , t | r ) .This approach is unsupervised in the sense that the modeldoes not need to be trained on a particular common-sense knowledge base to learn model weights. Davison,Feldman, and Rush (2019) find that it outperforms su-pervised methods when mining commonsense knowledgefrom new sources. Experiment
We provide details of candidate triple mining and our exe-cution of the three models on the candidate triples, and theninclude some analysis of our results.
Candidate Triple Mining
ConceptNet 5 (Speer, Chin, and Havasi 2017), the latest ver-sion of ConceptNet, is built from multiple sources, the OpenMind Common Sense (OMCS) project, Wiktionary, DBpe-dia, etc., thus introducing some domain-specific knowledge.To focus on commonsense knowledge only, we collect Wik-tionary definitions for the terms that appeared in the Englishcore of ConceptNet (total 162,363 terms). We further filterout definitions related to morphology, i.e. the ones contain-ing “plural of”, “alternative form of”, “alternative spellingof”, “misspelling of”, resulting in 13,850 term definitions.We choose 12 representative ConceptNet relations as ex-traction targets. By applying the top 15 frequent POS tagpatterns for each relation, we extracted around 1.4 millioncandidate triples in total for these relations (see Table 3 ofthe Appendix for detailed statistics).
Model Settings
For Bilinear AVG, we simply adopt the trained model re-leased by Li et al. (2016). For KG-BERT, we train themodel using the full training set of 100 thousand triples thatwere used to train Bilinear AVG. Since this 100k trainingset contains only positive triples, KG-BERT generates nega-tive triples by replacing the head entity or tail entity of a pos-itive triple with a random entity in the training set. Note thatif a corrupted triple with a replaced entity is already among We use the English triples with ConceptNet 4 as source. Available at https://ttic.uchicago.edu/ ∼ kgimpel/comsense resources/ckbc-demo.tar.gz Available at https://github.com/yao8839836/kg-bert Available at https://ttic.uchicago.edu/ ∼ kgimpel/comsense resources/train100k.txt.gza) Bilinear AVG (b) KG-BERT (c) PMI Figure 1: Score distribution of collected triples (best viewed in color).the positive triples, it will not be treated as a negative triple.We train KG-BERT for 10 epochs and use the model thatachieves the best accuracy on the development set releasedby Li et al. (2016). For PMI model, we directly run it onthe candidate triples without training. Specifically, BilinearAVG and KG-BERT assign each candidate triple a plausibil-ity score in the range of [0, 1]. After running these models,we separately rank the candidate triples of each relation bytheir scores in descending order.
Analysis
We first analyze the distribution of scores assigned to thecollected candidate triples by the three models we con-sidered. The histogram plots with 10 bins are shown inFigure 1. We observe high variability in score distribu-tion across these models, indicating that they do not havean evident consensus towards the plausibility of the candi-date triples. It is also supported by the Kendall’s tau coef-ficients (Kendall 1938) computed between each pair of themodels, whose absolute values are all less than 0.01.To evaluate the inferred knowledge, we adopt two met-rics in the literature:
Validity and
Novelty . Validity describeshow many generated/extracted triples are plausible and ismeasured by human evaluation. Bosselut et al. (2019) con-ducted automatic evaluation using the pre-trained BilinearAVG model developed by Li et al. (2016). Secondly, noveltyis measured by the percentage of generated triples that areunseen in the training set (Li et al. 2016; Saito et al. 2018).We also evaluate it against the English subset of Concept-Net. Of the two, validity is more important, yet novelty isalso a desirable property when augmenting a CSKG.
1. Novelty
When we measure novelty of a triple, we treatconcepts in the triple as strings. Preprocessing includes re-moving stop words in the strings, and treating them as abag of words after lemmatization and stemming. This ap-proach is stricter than exact match. Similarly, Bosselut et al.(2019) measured novelty using word token based minimumedit distance of generated phrase objects to the nearest onein the training set. We find that around 99 % of the candi-date triples of each relation are novel w.r.t either triples in KG-BERT achieves 77.5% accuracy on the dev set after 3training epochs. the training set, or triples in the English core of Concept-Net 5. Notably, ConceptNet 5 already contains some tripleswhose concept pairs are automatically extracted from Wik-tionary definitions and have been assigned a vague relation
RelatedTo , yet only for concept pairs both having a Wik-tionary entry. Even with respect to the full English subsetof ConceptNet 5, the novelty rates of candidate triples areabove 99 % for nearly all relations (except IsA with 94.7 % ).Notably, novelty may be affected by the presence of syn-onyms. To approximate novelty of triples based on seman-tic distance, Jastrzebski et al. (2018) used the sum of Eu-clidean distances between the averaged word embeddings ofthe heads and the tails, respectively, of two triples. We testedthis approach using our data but it seems not a good proxyfor novelty. We leave further novelty analysis as future work.
2. Validity
We sample 50 triples out of the high-scoredtriples of each relation and manually evaluate their valid-ity as well as novelty using the aforementioned metric. Theresults are summarized in Table 1. Specifically, we sam-ple triples from the ones scored at least 0.9 for BilinearAVG and KG-BERT (see the
Qual. columns for the num-bers of “qualified” triples meeting this selection criterion),and sample from the top 1,000 high-scored triples for thePMI model since the scores are not in the range of [0, 1].We report the proportion of valid triples in the samples (seethe V. columns), and also the proportion of triples being validand novel (see the V.N. columns). Table 2 of the Appendixlists some of the valid and novel triples from the sampleswe manually evaluated. The results show that the modelsdo mine some valid and novel triples and the performancevaries on different relations. Bilinear AVG achieves rela-tively high accuracy on relations e.g.
HasProperty , UsedFor , AtLocation , while getting only 6 % accuracy on CausedBy .KG-BERT performs better on some relations e.g.
UsedFor , CapableOf , than other relations. PMI achieves better accu-racy on the relation
IsA compared with the other two models.From the distribution of accuracy, we infer that term defini-tions of Wiktionary contain more commonsense knowledgefor the relations with relatively higher accuracy, including
AtLocation , CapableOf , HasProperty , IsA , MadeOf , Used-For . This matches our observation that dictionary term defi-nitions describe more on such basic properties of a concept. elation Bilinear AVG KG-BERT PMI
Qual. V. V.N. Qual. V. V.N. V. V.N.AtLocation 632 0.44 0.30 47,315 0.04 0.04 0.10 0.10CapableOf 585 0.30 0.28 25,771 0.26 0.26 0.12 0.12Causes 293 0.18 0.16 33,511 0.02 0.02 0 0CreatedBy 13,156 0.06 0.06 59,836 0.02 0.02 0.02 0.02Desires 67 0.22 0.18 33,126 0 0 0.04 0.04HasProperty 635 0.44 0.38 48,603 0.06 0.06 0.08 0.08HasSubevent 173 0.10 0.08 20,719 0.06 0.06 0.02 0.02IsA 13,537 0.36 0.28 62,819 0.24 0.16 0.46 0.34MadeOf 487 0.26 0.24 63,781 0.14 0.12 0.10 0.08PartOf 11,909 0.14 0.12 58,410 0.16 0.16 0.04 0.04ReceivesAction 238 0.12 0.12 31,491 0.20 0.20 0.06 0.06UsedFor 491 0.50 0.32 49,913 0.36 0.34 0.18 0.18
Table 1: Manual evaluation results of high-scored triples.
Discussion
Our evaluations indicate the feasibility to mine common-sense triples using term definitions, for which we provideevidence using Wiktionary. We also find that the three mod-els under evaluation have weaknesses in scoring triples. Thisis because the triples for manual evaluation are all sampledfrom the high-scored triples, yet for all relations, less than50% of the triples are valid. To some extent, the actual va-lidity of candidate triples could be ruled out since invalidtriples should have been assigned low scores (thus not in oursamples for manual evaluation) if a model is sufficiently dis-criminatory. We do not conclude on which model is the bestdue to the large discrepancy of their score distributions.Our manual evaluation also uncovers some interestingmodel behaviors. Bilinear AVG tends to assign very highscores ( > UsedFor relation given concept pairsthat actually have the
IsA relation. KG-BERT tends to assignvery high scores to the
CreatedBy relation given conceptpairs that actually have the
CapableOf relation. Regardingthe validity of the candidate triples we extracted from Wik-tionary, we can get some sense from the results of manualevaluation. For KG-BERT that have a large amount of triplesscored above 0.9, as shown in Figure 1b and Table 1, wecould roughly estimate the number of valid triples for eachrelation using the accuracy numbers in Table 1, e.g. around , × .
34 = 17 , plausible UsedFor triples.Since our analysis shows relatively wide variability inthe studied scoring methods, reliance on these methodsmay need a deeper evaluation of their individual strengthsand weaknesses. Further, validity and novelty are two use-ful metrics, however, additional research is needed to re-ally consider when mined valid and novel content is worthadding. For instance, we obtained a valid triple (camper, At-Location, tent) from the definition of camper - “A personwho camps, especially in a tent etc.” - while ConceptNet al-ready has a similar one (camper, CapableOf, sleep in a tent) .To determine whether the former one is worth adding is chal-lenging. One practical approach is to use the performancegain of downstream applications, like question answering,as the criterion for decision making.
Conclusion and Future Work
We presented a study on the feasibility of mining common-sense knowledge triples from Wiktionary term definitions.We examined three models on the performance of scoringnewly extracted triples. We showed that they do mine somevalid and novel triples and the performance varies on differ-ent semantic relations, and also observed some model weak-nesses, e.g. low accuracy for high-scored triples and highvariability between these models. Our findings suggest care-ful model pre-evaluation for use in practice. We plan to im-prove scoring models and candidate extraction techniques,and study the impact of new triples on downstream tasks.
Acknowledgments
This work is funded through theDARPA MCS program award number N660011924033 toRPI under USC-ISI West.
References
Blanco, E.; Cankaya, H.; and Moldovan, D. 2011. Commonsenseknowledge extraction using concepts properties. In
Twenty-FourthInternational FLAIRS Conference . Citeseer.Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; C¸ elikyilmaz, A.;and Choi, Y. 2019. COMET: Commonsense Transformers for Au-tomatic Knowledge Graph Construction. In
ACL .Davison, J.; Feldman, J.; and Rush, A. 2019. CommonsenseKnowledge Mining from Pretrained Models. In
EMNLP-IJCNLP .Jastrzebski, S.; Bahdanau, D.; Hosseini, S.; Noukhovitch, M.; Ben-gio, Y.; and Cheung, J. C. K. 2018. Commonsense mining asknowledge base completion? A study on the impact of novelty. In
Proc. of the Workshop on Generalization in the Age of Deep Learn-ing , 8–16.Kendall, M. G. 1938. A new measure of rank correlation.
Biometrika
Communications of the ACM
ACL , 1445–1455.Lin, B. Y.; Chen, X.; Chen, J.; and Ren, X. 2019. KagNet:Knowledge-Aware Graph Networks for Commonsense Reasoning.In
EMNLP-IJCNLP .Malaviya, C.; Bhagavatula, C.; Bosselut, A.; and Choi, Y. 2020.Commonsense Knowledge Base Completion with Structural andSemantic Context.
AAAI .Saito, I.; Nishida, K.; Asano, H.; and Tomita, J. 2018. Common-sense knowledge base completion and generation. In
CoNLL .Sap, M.; Le Bras, R.; Allaway, E.; Bhagavatula, C.; Lourie, N.;Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2019. ATOMIC:An Atlas of Machine Commonsense for If-Then Reasoning. In
AAAI , volume 33, 3027–3035.Speer, R.; Chin, J.; and Havasi, C. 2017. ConceptNet 5.5: an openmultilingual graph of general knowledge. In
AAAI , 4444–4451.Yao, L.; Mao, C.; and Luo, Y. 2019. KG-BERT: BERT for knowl-edge graph completion. arXiv preprint arXiv:1909.03193 .Young, T.; Cambria, E.; Chaturvedi, I.; Zhou, H.; Biswas, S.; andHuang, M. 2018. Augmenting End-to-End Dialogue Systems WithCommonsense Knowledge. In
AAAI .Zhang, H.; Khashabi, D.; Song, Y.; and Roth, D. 2020. Tran-sOMCS: From Linguistic Graphs to Commonsense Knowledge. In
IJCAI . ppendix In Table 2, we show some valid and novel knowledge triples from the samples we evaluated manually. We group them by thesemantic relation and the model used to score them. In Table 3, we report the statistics of candidate triples.
Relation Some Valid and Novel Examples from Human Evaluation
Bilinear AVG KG-BERT PMIAtLocation (camper, tent), (waiter, restaurant),(database, computer), (stove,room), (locker, store) (paddler, canoe), (circle, figure) (glioma, brain), (coalminer, coal),(collagen, extracellular), (thyroidi-tis, thyroid), (pneumothorax, chest)
CapableOf (nose, smell), (owl, prey), (camp-fire, heat), (tablecloth, cover), (la-bor, work) (salmonella, poisoning), (negotia-tion, achieving agreement), (shield,defense), (auto mechanic, repair-ing), (yeast, brew) (pursuer, pursues), (massage ther-apist, massage therapy), (show-room, display), (passenger train,rail transport), (droplet, drop)
Causes (damage, harm), (invest, develop-ment), (entertainment, enjoyment),(ponder, thought), (howl, sound) (multiple sclerosis, depression) N/A
CreatedBy (corn earworm, helicoverpa zea),(kibbutz, economical sharing),(yale, heraldry) (marine, military) (noise pollution, excess noise)
Desires (scientist, answer), (predator,prey), (graduate, degree), (worka-holic, work), (judge, justice) N/A (sexist, practises sexism), (contes-tant, game show)
HasProperty (fingerprint, unique), (reptile, cold-blooded), (chili, pungent), (beauty,attractive), (deck, flat) (comet, celestial), (ratafia, bitter),(mashed potato, pulpy) (forest, uncultivated), (copyrightinfringement, unauthorized use),(stockholder, owns stock), (beryl-lium, alkaline)
HasSubevent (golf, hit), (bribe, exchange),(asthma, breath), (yawn, breath) (meningitis, stiffness), (archery,shooting), (rebirth, birth) (cheating, imposture)
IsA (sailor suit, clothing style), (oilplatform, large structure), (somno-plasty, medical treatment), (scan-ner, device), (immune response, in-tegrated response) (puma, lion), (mayor, leader),(royal, person), (bacteriostat,chemical) (trombone, instrument), (fire truck,vehicle), (carnivore, animal), (cos-metic surgery, medical treatment),(katakana, Japanese syllabary)
MadeOf (pickle, salt), (sushi, rice), (candle,wax), (cymbal, bronze), (vodka,grain) (press release, statement), (day-time, time), (confetti, metal), (gasgiant, methane), (casino, room) (chocolate milk, milk), (coronarythrombosis, blood), (glassware,glass), (chopped liver, chickenliver)
PartOf (expertise, knowledge), (barney,pejorative slang), (massage thera-pist, massage therapy) (snooker table, snooker), (seabed,sea), (exotica, american music),(surcharge, price), (free market,market) (chemistry, natural science), (bar-rier reef, adjacent coast)
ReceivesAction (quantity, count), (speech, speak),(asset, value), (cleaner, clean),(plunger, remove) (sausage, made), (experience, pro-duced), (fishing rod, used), (ledger,record), (harpsichord, tuned) (saddlery, saddler), (rowboat, row-ing), (space station, habitation)
UsedFor (message, communication),(mouthwash, clean), (ribbon,decoration), (tablecloth, protect),(article, report) (hypothesis, observation), (im-mune system, response), (grader,maintenance), (nature reserve,conserve wildlife), (harpsichord,baroque music) (meteorology, forecasting), (kid-ney, producing urine), (flush toi-let, flush urine), (answer, question),(machine tool, machining)(message, communication),(mouthwash, clean), (ribbon,decoration), (tablecloth, protect),(article, report) (hypothesis, observation), (im-mune system, response), (grader,maintenance), (nature reserve,conserve wildlife), (harpsichord,baroque music) (meteorology, forecasting), (kid-ney, producing urine), (flush toi-let, flush urine), (answer, question),(machine tool, machining)