Hypernyms Through Intra-Article Organization in Wikipedia
aa r X i v : . [ c s . I R ] S e p Hypernyms Through Intra-Article Organization in Wikipedia
Disha Shrivastava ∗ MILA, Universit´e de Montr´ealMontreal, Canada [email protected]
Sreyash Kenkre
IBM ResearchBangalore, India [email protected]
Santosh Penubothula
IBM ResearchBangalore, India [email protected]
Abstract
We introduce a new measure for unsupervisedhypernym detection and directionality. Themotivation is to keep the measure computa-tionally light and portatable across languages.We show that the relative physical location ofwords in explanatory articles captures the di-rectionality property. Further, the phrases insection titles of articles about the word, cap-ture the semantic similarity needed for hyper-nym detection task. We experimentally showthat the combination of features coming fromthese two simple measures suffices to produceresults comparable with the best unsupervisedmeasures in terms of the average precision.
Given two words w and w , the hypernym de-tection task is to determine if there is a hypernymrelation between the two words. If a hypernym isknown to exist, the directionality task is to deter-mine if w is a hypernym or hyponym of w . Moreprecisely, due to polysemy, the detection task asksif, there is some meaning of w , in which it is ahypernym or hyponym of some meaning of w .The first approaches were pattern based (Hearst,1992; Snow et al., 2004). However, these sufferedfrom poor recall. This led to the developmentof methods based on the distributional hypothesis (Harris, 1954) or the Distributional Inclusion Hy-potheses (Geffet and Dagan, 2005). The methodused in these techniques was to take a very largecorpus, and using either window based, or depen-dency path based approaches, along with measureslike frequency, PPMI (Church and Hanks, 1990),LPMI (Evert, 2005), to find vectors to representthe words. In supervised settings, the vectorsfor two words are combined suitably and a clas-sifier is trained (Baroni et al., 2012; Roller et al., ∗ * Work done as part of IBM Research, Bangalore depth measure and heading measure . We start off with a largecorpus, but instead of finding window or depen-dency path based contexts, which are very expen-sive to compute; we argue that the internal orga-nization of descriptive and explanatory documents naturally leads to strong signals that are indica-tive of hypernyms. By exploratory documents,we mean documents that have been produced withthe express purpose of making the reader under-stand the concepts that the document is describ-ing; text books, research papers and Wikipedia ar-ticles are prime examples of this. We verify thisintuition empirically, wherein we achieve resultscomparable and in some cases better than priortechniques in both the tasks of hypernym detec-tion and directionality. One salient feature of ourmeasures is that they exploit how humans orga-nize information in explanatory documents mak-ing them portable across all languages. This offersus an advantage over prior techniques which de-pend on the intricacies of syntax and semantics ofthe language of the documents. Methodology
We will be using Wikipedia as the source of de-scriptive and explanatory documents in this paper.For ease of exposition, we define the concept of units . Given an article from Wikipedia, each pagetitle, section title, sub-section title etc., irrespec-tive of the depth of section, will be referred to as a heading . Each heading on the article usually con-sists of a title that describes what the text follow-ing the heading is about. Following the heading,are usually a few paragraphs that describe in moredetail the heading. This may be followed by an-other heading, and this pattern repeats. We referto each heading and the text following it, till (butnot including) the next heading, as a unit. Thus thearticle is physically organized as a sequence of dis-joint units. We represent a unit u as a pair ( h, S ) ,where h is the heading, and S is the sequence ofsentences in the unit. Given a hypernym-hyponym pair, ( w , w ) , con-sider the organization of a Wikipedia article con-taining both of them. The very first unit at the topof the page is usually a broader introduction of themain topic of the article. It will use words thatare more popular. However, from the next unitonward, the articles tend to be more specializedwith higher detail in content. Thus the linguisticcontexts used in the units occurring lower in thearticle tend to be more detail oriented than thoseoccurring earlier (except for the first introductoryunit). Since more detailed context are indicative ofa hyponym, they tend to occur later in the articlethan the hypernyms. This same reasoning applieswithin a unit. In this case the hypernym will tendto occur in earlier sentences in the unit than thehyponym. We generalize this to the case in which w and w do not co-occur in the same article, asfollows. We take a large corpus of articles (e.g.all articles in Wikipedia), and check the depth atwhich w and w tend to occur (individually). If w tends to occur at larger depth than w , we con-clude that w is a hyponym of w .Let P be the set of articles. Let a ∈ P be an ar-ticle, and let w be a given word or phrase. To for-mally define depth, we will assume that the articlehas a fixed rooted tree like topology with the unitsof a as its vertices, denoted by G ( a ) . The root willbe the first unit of the article, and the depth will bethe distance from the root. We experiment when G ( a ) is a Star-like tree topology, as indicated bythe depths of its sections and sub-sections, or aLinear-like topology with a unit being a parent ofthe immediate next unit in the physical layout ofthe article. We define a function λ ( a, w ) that cap-tures the depth of each occurrence of w in a . Let I ( a, w ) denote the set of occurrences of w in a .Each occurrence consists of a pair ( u i , s j ) , where u i denotes a unit, and s j is the sentence in whichit occurs. Multiple instances of w in the same sen-tence is treated as one instance. Let d ( G ( a )) de-note the total depth of G ( a ) . If d ( u i ) is the depthof unit u i in G ( a ) , and | u i | is the number of sen-tences in it, then we define the λ ( a, w ) = P ( u i ,s j ) ∈I ( a,w ) (cid:16) − d ( u i ) d ( G ( a ) (cid:17) (cid:16) − j | u i | (cid:17) The first factor gives a normalized measure (toensure same scale across all articles, of differentsizes) of the depth of each occurrence of w in a . Similarly, the second factor gives a normal-ized depth of the instance within a unit. Largerthe λ ( a, w ) , more likely is it to be a hypernym. Toaggregate this measure across all articles: λ ( w ) = median a ∈P λ ( a, w ) (1) For testing relatedness between words, we definethe heading measure, inspired by (Do and Roth,2012). We search in Wikipedia for the ar-ticle on the given phrase w (e.g., if w isthe word jumping , then we get the article https://en.wikipedia.org/wiki/Jumping ).Since the page is about w , it is organized intosections that explain every property of w . Wecan thus represent w simply by the collection ofheadings (titles, sub-titles at every possible level).If the page on w turns out to be a disam-biguation page, then the page lists different possi-ble meanings of w , along with the correspondinglinks. We follow each of the links to get possi-ble articles on different meanings of w . In caseany of the pages is again a disambiguation page,we iterate further. For each of the pages that arearticles, we form a set of headings. Each set corre-sponds to a different meaning of w . We let S w denote the collection of the headings for differ-ent meanings of w (See Algorithm 1). Note here,that S w is a set of sets . One advantage of thismethod is that we get the different meanings of theords up front, whereas, in context feature basedapproaches, there can be a mixing of the differentcontexts for polysemous words. Algorithm 1:
Extract Heading Sets input :
Word or phrase w output : S w , a collection of headings ofpages on w Function
ExtractHeadings(w) P = { P , . . . , P k } be the set of articleson w , S w ← φ while P 6 = φ do Select any P ∈ PP ← P \ P if P is not a disambiguation page then Let C be the collection ofheadings on page PS w ← S w ∪ { C } else Let D be the collection of articlesthat P points to as possiblemeanings of w P ← P ∪ D endendreturn S w After computing S w and S w as shown in Al-gorithm (1), we compute the SimScore ( w , w ) as the maximum similarity between an elementof S w and S w . For the similarity, we exper-imented with two measures, the Jaccard Sim-ilarity , and the cosine of the corresponding word2vec (Mikolov et al., 2013) vectors. Forusing word2vec , for each heading set C , we takethe mean of the vectors for each heading. Since word2vec uses the context of words, this com-bines our features with the contextual features.The final measure we use for the pair of words is: (cid:16) λ ( w ) − λ ( w )2 (cid:17) ( SimScore ( w , w )) (2) We experimented with four datasets widelyused in literature: BLESS (Baroni and Lenci,2011), EVALution (Santus et al., 2015),Lenci/Benotto (Benotto, 2015), andWeeds (Weeds et al., 2014) taken from the repository provided by (Shwartz et al., 2017). Thecorpus of articles we use is a complete xml dumpof the English Wikipedia dated . We extracted out the pairs marked hypernymsfrom each of the four data sets and computedthe depth measure for each word in the pair. Ifthe difference λ ( w ) - λ ( w ) is less than zero,we mark these pairs as False and compute pre-cision. To identify the articles containing w , weindexed the corpus of Wikipedia using Elastic-search (Gormley and Tong, 2015) and used the topthousand articles returned as the set for computing λ ( w ) . We experimented with the Star and Lineartopologies.It is seen that with Star topology our pre-cision in very high on each of the datasets.The worst performance is on BLESS. However,here too, it is . . For BLESS, as seen in(Santus et al., 2014), the performance of SLQS is . Similar to SLQS, our depth measure is mo-tivated by distributional informativeness hypothe-sis (Shwartz et al., 2017). However, without us-ing the extensive computation of context vectorsand entropy, we are able to demonstrate good per-formance. As can be seen by physically examin-ing Wikipedia articles, many of them tend to havea Star topology. This is also indicative that thetopology used plays a major role in this feature.More sophisticated techniques will be needed toidentify the topology of individual articles. For this experiment, we aim to discriminatepairs of words connected by the hypernym re-lation, from words connected by other relations(meronym, coord, attribute, event, antonym, syn-onym). For each pair, we evaluate our scoringfunction given in expression (2). We comparedour numbers with those given in (Shwartz et al.,2017). In that paper, multiple measures are used,and the best performing measure for every row ofthe table is presented. We conducted the experi-ments for both, Star as well as the Linear topol-ogy. However, the results for Star topology wereslightly better, hence we present these in Table (2).For the case of hypernym vs all other rela-tions, except for EVALution, in all other datasets, our average precision ( AP ) using both Jac-card and word2vec ((Shwartz et al., 2017) callthis as AP @all) is better than the best unsu- ataset Star Topology Linear Topology Total Precision PrecisionBLESS 1198 0.918 0.536Weeds 1321 0.974 0.429Evalution 3303 0.980 0.566LenciBenotto 1728 0.974 0.439
Table 1: Testing directionality. Total= total number of pairs present in the test set.
Dataset Hyper vs Rel AP word2vec AP Jaccard Best AP Best Measure
BLESS all other relations invCL meronym 0.446 0.355 0.760
SLQS sub coord 0.203 0.235 0.537
SLQS sub attribute 0.581 0.509 0.740
SLQS sub event 0.453 0.315 0.779
AP Syn
Weeds all other relations clarkeDE coord clarkeDE
EVALution all other relations 0.273 0.290 0.353 invCL meronym 0.629
AP Syn attribute 0.556 0.614 0.651
AP Syn antonym 0.520 0.526 0.550
SLQS − row synonym 0.606 0.593 0.657 SLQS − row Lenci/Benotto all other relations
AP Syn antonym 0.548 0.530 0.624
AP Syn synonym 0.599 0.593 0.725
SLQS − row sub Table 2: AP = average precision. The Best AP and Best Measure is taken from (Shwartz et al., 2017). pervised measure as reported in (Shwartz et al.,2017). For comparing hypernyms against indi-vidual relations, we find that with Jaccard simi-larity, it performs better than the best measureson meronyms in EVALution, and coordinates inWeeds. However, it performs worse for both therelations in BLESS. Our systems performs worsethan the best measure whenever an Informative-ness Measure (Shwartz et al., 2017), like
SLQS and its variants perform well. It performs bet-ter, or at least competitive, when the best per-forming measure is an
Inclusion Measure or Sim-ilarity Measure (except for hypernym-vs-event inBLESS). A possible explanation of this is that theheading features that we use do not capture howinformative a phrase is. However, having commonheadings is an indication of shared features, im-plying similarity, which is also indicated by inclu-sion measures. However, it should be noted, that we are comparing our single system against thebest performing one in each case . For finding thebest measure, (Shwartz et al., 2017) finds the bestby varying the measures as well as the features,whereas we have a fixed system. Our system tooka day to set up (including coding effort), and a fewmins to run. This is in contrast to methods men-tioned above that rely on the computation of con-text vectors; calculation of dependency parse treebased features alone, from ukWack and Wackype-dia corpus, took several days on the same machine.
One of the sources of error in our technique is a semantic drift due to disambiguation pages. Forexample, for the pair ( alligator, wild ) , which ismarked as attribute in BLESS, our system followsdisambiguation links to wildlife , and then marksit as a hypernym. We find this pattern repeat-edly. e.g. ( scale, lizard ) is a meronym in BLESS,but is classified as hypernym in the hypernym vs.meronym experiments. While ( scale, snake ) is ameronym in BLESS, and is marked correctly as not a hypernym. One reason for this is that amongthe disambiguation pages, a word is often gener-alized to related terms. For the hypernym v/s allexperiment, the proportion of false positives whenat least one word needed a disambiguation pagewas for BLESS, for Weeds, and about for EVALution and Lenci/Benotto. Selectivelink following during the disambiguation step canpotentially solve this problem. We showed that the organization of articles is animportant feature for the task of both hypernymdetection and directionality. Using just this sim-ple and computationally cheap measure suffices togive performance that is comparable to the stateof art unsupervised measures in these tasks. Theproposed measure can also be trivially extended tony languages with a Wikipedia. We believe futurework in this area will benefit by using this featurein complex systems that can improve performance.
References
Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do,and Chung-chieh Shan. 2012. Entailment above theword level in distributional semantics. In
EACL2012, 13th Conference of the European Chapter ofthe Association for Computational Linguistics, Avi-gnon, France, April 23-27, 2012 , pages 23–32.Marco Baroni and Alessandro Lenci. 2011. How weblessed distributional semantic evaluation. In
Pro-ceedings of the GEMS 2011 Workshop on GEomet-rical Models of Natural Language Semantics, GEMS2011, Edinburgh, Scotland, July 31, 2011 , pages 1–10.Giulia Benotto. 2015.
Distributional Models forSemantic Relations: A Sudy on Hyponymy andAntonymy . University of Pisa.Kenneth Ward Church and Patrick Hanks. 1990. Wordassociation norms, mutual information, and lexicog-raphy.
Computational Linguistics , 16(1):22–29.Quang Xuan Do and Dan Roth. 2012. Exploiting thewikipedia structure in local and global classificationof taxonomic relations.
Natural Language Engi-neering , 18(2):235–262.Stefan Evert. 2005.
The Statistics of Word cooccur-rences: word pairs and collocations .Maayan Geffet and Ido Dagan. 2005. The distribu-tional inclusion hypotheses and lexical entailment.In
ACL 2005, 43rd Annual Meeting of the Asso-ciation for Computational Linguistics, Proceedingsof the Conference, 25-30 June 2005, University ofMichigan, USA , pages 107–114.Clinton Gormley and Zachary Tong. 2015.
Elastic-search: The Definitive Guide , 1st edition. O’ReillyMedia, Inc.Zellig S. Harris. 1954. Distributional structure. In
Word, 10(2-3) , pages 146–162.Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In ,pages 539–545.Omer Levy, Steffen Remus, Chris Biemann, and IdoDagan. 2015. Do supervised distributional methodsreally learn lexical inference relations? In
NAACLHLT 2015, The 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Den-ver, Colorado, USA, May 31 - June 5, 2015 , pages970–976. Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space.
CoRR , abs/1301.3781.Stephen Roller, Katrin Erk, and Gemma Boleda. 2014.Inclusive yet selective: Supervised distributional hy-pernymy detection. In
COLING 2014, 25th Inter-national Conference on Computational Linguistics,Proceedings of the Conference: Technical Papers,August 23-29, 2014, Dublin, Ireland , pages 1025–1036.Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, QinLu, and Chu-Ren Huang. 2016. Nine features ina random forest to learn taxonomical semantic re-lations. In
Proceedings of the Tenth InternationalConference on Language Resources and EvaluationLREC 2016, Portoroˇz, Slovenia, May 23-28, 2016.
Enrico Santus, Alessandro Lenci, Qin Lu, and SabineSchulte im Walde. 2014. Chasing hypernyms in vec-tor spaces with entropy. In
Proceedings of the 14thConference of the European Chapter of the Asso-ciation for Computational Linguistics, EACL 2014,April 26-30, 2014, Gothenburg, Sweden , pages 38–42.Enrico Santus, Frances Yung, Alessandro Lenci, andChu-Ren Huang. 2015. Evalution 1.0: an evolvingsemantic dataset for training and evaluation of dis-tributional semantic models.Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016.Improving hypernymy detection with an integratedpath-based and distributional method. In
Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers .Vered Shwartz, Enrico Santus, and DominikSchlechtweg. 2017. Hypernyms under siege:Linguistically-motivated artillery for hypernymydetection. In
Proceedings of the 15th Conferenceof the European Chapter of the Association forComputational Linguistics, EACL 2017, Valencia,Spain, April 3-7, 2017, Volume 1: Long Papers ,pages 65–75.Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2004.Learning syntactic patterns for automatic hypernymdiscovery. In
Advances in Neural Information Pro-cessing Systems 17 [Neural Information ProcessingSystems, NIPS 2004, December 13-18, 2004, Van-couver, British Columbia, Canada] , pages 1297–1304.Julie Weeds, Daoud Clarke, Jeremy Reffin, David J.Weir, and Bill Keller. 2014. Learning to distin-guish hypernyms and co-hyponyms. In
COLING2014, 25th International Conference on Computa-tional Linguistics, Proceedings of the Conference:Technical Papers, August 23-29, 2014, Dublin, Ire-land , pages 2249–2259.ulie Weeds, David J. Weir, and Diana McCarthy.2004. Characterising measures of lexical distri-butional similarity. In