[PDF] Towards better substitution-based word sense induction

Abstract

Word sense induction (WSI) is the task of unsupervised clustering of word usages within a sentence to distinguish senses. Recent work obtain strong results by clustering lexical substitutes derived from pre-trained RNN language models (ELMo). Adapting the method to BERT improves the scores even further. We extend the previous method to support a dynamic rather than a fixed number of clusters as supported by other prominent methods, and propose a method for interpreting the resulting clusters by associating them with their most informative substitutes. We then perform extensive error analysis revealing the remaining sources of errors in the WSI task. Our code is available at this https URL.

Full PDF

TTowards better substitution-based word sense induction

Asaf Amrami † and Yoav Goldberg † ‡ † Computer Science Department, Bar Ilan University, Israel ‡ Allen Institute for Artiﬁcial Intelligence { asaf.amrami, yoav.goldberg } @gmail.com Abstract

Word sense induction (WSI) is the task of un-supervised clustering of word usages within asentence to distinguish senses. Recent workobtain strong results by clustering lexical sub-stitutes derived from pre-trained RNN lan-guage models (ELMo). Adapting the methodto BERT improves the scores even further.We extend the previous method to support adynamic rather than a ﬁxed number of clus-ters as supported by other prominent methods,and propose a method for interpreting the re-sulting clusters by associating them with theirmost informative substitutes. We then per-form extensive error analysis revealing the re-maining sources of errors in the WSI task.Our code is available at https://github.com/asafamr/bertwsi . Word Sense Induction

Word Sense Induction(WSI) is the task of clustering in-context usagesof words to groups that represent senses. A WSIsystem is given multiple sentences containing us-ages (instances) of target lemma+part-of-speech,and is expected to group together usages in whichthe target is used in the same sense. E.g., in:1. I like warm summer evenings2. They were greeted by a warm welcome3. The waters of the lake are warm we would like to group (1) and (3) into one senseand (2) to a different one. WSI was explored in several SemEval sharedtasks (Agirre and Soroa, 2007; Manandhar et al.,2010; Jurgens and Klapaftis, 2013), with thegold-labels following human annotation accord-ing to the WordNet (Miller, 1998) sense inven-tory. WordNet senses are very ﬁne-grained and Things are often more challenging: a warm hand , for ex-ample, might belong to either of the senses above, requiringlarger context to disambiguate. often hard to tag even for experts (Snyder andPalmer, 2004), leading the latest evaluation (Se-mEval 2013 task 13) to replace the hard-clusteringtask with a soft-clustering one, in which each in-stance can simultaneously belong to several clus-ters, each with a different label. This clusteringis then compared to the human-taggers’ disagree-ment data.Up until recently, state-of-the-art results forWSI were dominated by a series of increasinglysophisticated graphical models (Lau et al., 2013;Wang et al., 2015; Komninos and Manandhar,2016; Amplayo et al., 2018).A competing approach relies on substitute vec-tors : each target instance is represented by a dis-tribution over possible in-context probable sub-stitutes for the word, and clustering is per-formed over these distributions. This approach byBaskaya et al. (2013) was implemented in the AI-KU system using n-gram language models (LM).In recent work (Amrami and Goldberg, 2018),henceforth referred to as LSDP (Language-modelSubstitution with Dynamic Patterns), we showedthat by replacing the n-gram LM with ELMo-based biLM (Peters et al., 2018) and adding a dy-namic patterns technique, the substitute vectorsapproach can achieve state-of-the-art results (Sec-tion 2).In this work, we further explore the use ofsubstitute-based approaches for WSI. After verify-ing that the approach transfers to the recently in-troduced B

ERT deep masked LM (Devlin et al.,2018) (with a very signiﬁcant improvement inWSI scores), we make two additional contribu-tions to the mentioned method: (a) we present amethod to move from a ﬁxed number of sensesacross target words to choosing a dynamic num-ber of senses for each target (as supported bymost other WSI methods, e.g., (Teh et al., 2005;Komninos and Manandhar, 2016; Amplayo et al., a r X i v : . [ c s . C L ] M a y We describe the LSDP-algorithm, which we ex-tend. Given k in-sentence instances of a tar-get word which we wish to cluster into senses,each instance is associated with r representa-tives. Each representative is composed of n words,which are sampled with replacement from theLM. The sampled words are lemmatized, andeach representative is then represented as a one-hot vector of its lemmas (multiple occurrences ofthe same lemma within a representative are dis-carded). The resulting set of k ∗ r hard clus-tering over representatives . This clustering is con-verted to a soft-clustering over instances , by asso-ciating each instance to a cluster according to thepercentage of its representative that are assignedto that cluster.Sampling the representatives from the ELMobiLM does not take into account the word itselfwhen predicting substitutes. The dynamic pat-terns approach in LSDP overcomes this by query-ing the LM for a linguistically motivated manip-ulated context that take the target word into ac-count. By way of example, to get substitutes for brown in “my dogs are brown” , the forward LM ispresented with “my dogs are brown and (cid:50) ” ratherthan with “my dogs are (cid:50) ” . This encourages rele-vant substitutes such as black and discourages lessdesirable ones like barking . Contextualized vector representations from therecently introduced B

ERT model (Devlin et al.,2018) were shown to outperform ELMo on sev-eral NLP tasks. Like ELMo, B

ERT is trained in a Sampling is performed from the softmax distributionover the top- (cid:96) logits, and while ignoring the bias terms. self-supervised manner to predict words in a sen-tence. B

ERT differs from ELMo by being based ona Transformer (Vaswani et al., 2017) instead of anLSTM, being truly bidirectional deep model, and,most importantly for our purposes, being trainedin a “noisy masked LM” setup in which the train-ing procedure receives a sentence, replaces somewords with a MASK symbol and randomly per-turbs some others, and then attempts to predict theoriginal words from the resulting text. Thus, themodel learns representations which are predictiveof the words in context, and can also take into ac-count the current word when making a prediction.For further details, see (Devlin et al., 2018).The dynamic patterns in LSDP were motivatedby the BiLM not “seeing” the target word other-wise. In contrast, B

ERT ’s architecture and trainingprocedure does allow providing the word togetherwith its context when predicting substitutes. Sim-ply replacing the ELMo LM with BERT’s, with-out dynamic patterns, already provides state-of-the-art results for the SemEval 2013 task, with anAVG score of 37.0 compared to 25.4 with the fullLSDP. However, during BERT training words aresometimes randomly replaced, making BERT sus-picious and often predicting substitutes by con-text alone: a probable substitute for “my [dogs]are brown” according to BERT is eyes , which isclearly not a lexical substitute for dog.This suggests BERT could also beneﬁt from dy-namic patterns . However, the pattern from Am-rami and Goldberg (2018) did not show big im-provements as-is. Instead, we use parentheticalpatterns. We empirically found that the pattern“ target ( or even [MASK] ) ” (e.g. my dogs (or even[MASK]) are brown ) yields good results. This pattern resulted in similar scores to usingthe vanilla sentence and predicting over the targettoken, while yielding somewhat different results.We combine the two by averaging the logits fromboth predictions prior to the softmax. Results

Evaluation on SemEval 2013 and Se-mEval 2010 yield state-of-the-art results on bothdatasets mainly due to BERT powerful LM (Tables Among others, our search for other patterns includedparenthesized and apposition based lexical substitute promot-ing patterns such as and [MASK] , or [MASK] as well as hy-ponymy promoting patterns like such as [MASK] We also further divide the inputs logits by a temperatureparameter to smooth the distribution and diversify the words.We also lemmatize BERT word pieces to predicted lemma-tized words, and pad the sentence with the [CLS] and [SEP]tokens as required by BERT. odel FNMI FBC AVG

Ours 21.4 (0.5) (0.5) (0.5)

Ours:ND (0.7) (0.2) (0.6)

LSDP 11.3 57.5 25.4AutoSense 7.96 61.7 22.16MCC-S 7.62 55.6 20.58ST(SW) 7.14 55.4 19.89AI-KU 6.5 39.0 15.92

Table 1: Evaluation Results on the SemEval 2013 Task13 Dataset. We report our mean (STD) scores over 10runs. ND : no dynamic patterns. ST(SW) : Sense-Topicwith embedding similarity weighting.

Model F-S V-M AVG

Ours 71.3 (0.1) (1.8) (1.2)

Ours:ND (0.4) (1.5) (1.2)

AutoSense 61.7 9.8 24.59SE-WSI-ﬁx 55.1 9.8 23.24BNP-HC 23.1 21.4 22.23LDA 60.7 4.4 16.34

Table 2: Evaluation Results on the SemEval 2010 Task14 Dataset. We report our mean (STD) scores over 10runs. ND : no dynamic patterns. LSDP uses a ﬁxed number of 7 clusters for alltarget words, a choice which was shown to workwell on the SemEval 2013 task 13 dataset. How-ever, using a ﬁxed number this way is obvi-ously sub-optimal. Our proposal is based on thepremise—supported by empirical observation—that the substitution-based representation clearlyidentiﬁes the “core” senses that explain most ofthe mass in the data (as evident by the result-ing high task scores), but also introduces some To comply with the hard-clustering setup of SemEval2010, we compute the soft-clustering and take the most prob-able sense for each instance. The hard clustering also sug-gests the use of V-measure and F-score instead of FNMI andFBC for SemEval 2010. We use (cid:96) = 200 , temperature =1 . , nReps = 15 , nSenses = 7 , minInstances = 2 , nRepSamps = 20 . The compared systems are: AutoSense(Amplayo et al., 2018), MCC-S (Komninos and Manandhar,2016), SenseTopic (Wang et al., 2015), SE-WSI-ﬁx (Song,2016). AI-KU (Baskaya et al., 2013) , BNP-HC (Teh et al.,2005) , LDA (Blei et al., 2003). Numbers taken from thecorresponding publications. Indeed, other works on WSI (Teh et al., 2005; Komninosand Manandhar, 2016; Amplayo et al., 2018) does attempt toinfer the number of clusters for each sense. Teh et al. (2005)do it by employing a stick breaking clustering process andKomninos and Manandhar (2016) use a model selection cri-teria to prevent over speciﬁcation.Amplayo et al. (2018) pre-form a cleanup stage similar to what we propose. noise around more niche usages or less clear-cutinstances.We thus follow a strategy in which we provide arelaxed upper-bound on the number of senses (weuse 10), induce this number of clusters, and markeach cluster as being either weak or strong . Wethen discard the weak clusters, merging each ofthem into a corresponding strong cluster.For each target, we induce a soft clustering ofthe corresponding word occurrences into a ﬁxednumber of c = 10 senses. Each instance (word oc-currence) is now probabilisticaly associated with c senses. We say that a sense dominates an instanceif it is the most probable sense for that instance.We identify senses that dominate less then m = 2 instances and mark them as weak senses. The re-maining senses (those that dominate m or moreinstances) are marked as strong .Recall that each sense is also a hard clusteringover representatives. We associate each sense withthe average vector of its representatives (centroid).For each of the weak senses w , we ﬁnd the closeststrong sense s to w according to the cosine dis-tance between their centroids, assign w s represen-tatives to s and discard w . We then re-do the softclustering of instances based on the set of strongsenses and the representatives within them. Evaluating the dynamic number of senses

Un-fortunately, this dynamic sense number assign-ment did not improve AVG WSI scores on the Se-mEval 2013 dataset. However, eye-balling the re-sults indicates that the method produces reason-able sense induction solutions. Digging further,we found out that using the gold (oracle) numberof senses for eachtarget also had a very minimal effect on the WSIscores ( ∼ . AVG addition). The AVG score inthe SemEval 2013 WSI task is the geometric meanof the FNMI metric and the FBC metric, wherethe ﬁrst one prefers many small clusters, while thesecond one prefer fewer and larger clusters. Nei-ther of FNMI, FBC and AVG are sufﬁcient for in-dicating a good number of clusters. The metrics also do not penalize over-speciﬁcation of small senses: while FBC and F-S should discourage over-speciﬁcation, their mea-sures are proportional in instance pairs and wouldnot punish small mass perturbations, even if those The story for the SemEval 2010 WSI task metrics is sim-ilar, with V-Measure favoring smaller clusters and F-S favor-ing larger ones. roduce an excessive number of senses. We thusaim for a more direct measure for evaluating theproduced number of sense-clusters.Previous work, e.g. (Song, 2016; Amplayoet al., 2018), compare the absolute number ofsenses. We instead opt for the somewhat eas-ier task of measuring the correlation between thenumber of induced senses and the gold number ofsenses. To motivate measuring correlation and notabsolute numbers, recall that the SemEval task’ssense-inventory is based on WordNet, whose sensehierarchy is very ﬁne-grained. For example, itdifferentiates between dark used to describe skincolor and dark used to describe objects such aspants. A coarser grained WSI solution may con-stantly produce fewer senses for each target yetstill be valuable. By measuring correlation to thegold number of senses rather than absolute differ-ence we could add invariance to sense granularity.Our system results in a spearman rank correla-tion of 0.43 ± < As expected, comparing the solution obtainedby this oracle mapping to the SemEval 2013 goldlabels result in very high WSI scores (FNMI 52.1,FBC 84.7, AVG 66.4). More interesting is thecorrelation of sense numbers: 0.47, compared to0.43 obtained by our method. Our method spear-man rank correlation to the number of senses inthe

NOAD labels is 0.44.

The substitution-based method lends itself to in-trospection by considering the substitutes. Wehighlight the most prominent and informativeword substitutes for each sense by computing thepointwise mutual information (PMI) between sub- For example, in the context of query-based search, auser may be satisﬁed with the coarse grained distinction of ”dark(blackness) times” and ”dark(sad) times” . We take the most probable NOAD sense to each Word-Net one, according to the parallel corpus. This reduces the399 senses SemEval to 205 (89 of which were not found inSemCor and left intact). Overall, 87% of the tokens weremapped to their coarse grained NOAD senses. stitute words and sense clusters. We then an-notate each sense with its top 10 most associ-ated substitutes (its signature ). These sense sig-natures can be said to present the essence of whatis captured by each sense cluster. As an example,one induced sense for the target meet(VERB) ischaracterized by the words “convene”, “group”,“crowd”, indicating the sense of a meeting that in-volves many participants. Interestingly, the Word-Net meet(VERB) entry does not make such a dis-tinction between meeting types by the number oftheir participants, highlighting a case were the un-supervised algorithm reﬁned the human curatedlexicon. Inspecting clusters and their signaturesallows us to identify good and bad clusters,and identify failure modes in its process, as wedo in the next section. Armed with the cluster signatures, we turn to man-ually inspect all the produced sense clusters andtheir associated words. We identify the followingcharacteristic failure modes: LM : errors of the underlying BERT LM; SPLIT : an additional cluster for an existing sense,for example the sense encouraging, close, per-sonal, ... for warm(ADJ) when the sense compas-sionate, favorable, kind, ... already exists;

TEMPLATE : substitutes rely excessively on atemplate-like pattern;

TOPIC : substitutes rely excessively on topicalwords;

MERGE : cluster includes several distinct senses;

OTHER : cluster includes an incoherent mix ofmultiple senses with incoherent substitutes.We sort the SemEval 2013 targets according toour accuracy on them, and consider the 20% topscoring targets (TOP), 20% middle scoring (MID),and 20% bottom scoring (BOT), each containing10 targets. For each of these groups, we inspectall induced senses, and manually categorizing tothe above failure cases, or to OK in case they arecorrect. Figure 1 summarizes the results. A cleartrend is that the majority of issues relate to split-ting and merging of clusters and relying on top-ical substitutions, while language-modeling and Additional examples are provided in the appendix. Often resulting on transcribed speech, a domain BERTwas not trained on. The supplementary material shows examples of the ana-lyzed cases as well as suggestions for handling the identiﬁedfailure cases in future work. O P M I D B O T A LL

61% 43% 36% 47%18% 19% 17% 18%4%

1% 3% 3%

12% 18% 23% 18%4%

3% 3% 4%0%

9% 5% 5%

6% 12% 6%

OKSPLITTEMPLATETOPIC LMMERGEOTHER

Figure 1: Manual error analysis. Percent of error casesbroken down by task score. TOP are best 20% preform-ing targets, MID and BOT are third and last 20% splitsaccordingly. template-following problems are far less severe.

We improved a recent WSI method by allowing itto produce a dynamic number of senses, and byshowing how the resulting clusters could be in-spected and validated through the identiﬁcation ofper-sense characteristic substitutes. These are thenused to perform error analysis of the method andits culprits, highlighting the major modes of fail-ure and their prevalence, suggesting promising av-enues for future work. Additionally, incorporatingBERT as an LM improves the state of the art intwo recent SemEval WSI tasks by a large margin,and validates the utility of the dynamic-patternsapproach of Amrami and Goldberg (2018).

References

Eneko Agirre and Aitor Soroa. 2007. Semeval-2007task 02: Evaluating word sense induction and dis-crimination systems. In

Proceedings of the 4th In-ternational Workshop on Semantic Evaluations , Se-mEval ’07, pages 7–12, Stroudsburg, PA, USA. As-sociation for Computational Linguistics.Reinald Kim Amplayo, Seung-won Hwang, and MinSong. 2018. Autosense model for word sense in-duction.

CoRR , abs/1811.09242.Asaf Amrami and Yoav Goldberg. 2018. Word senseinduction with neural bilm and symmetric patterns.In

Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing , pages4860–4867. Osman Baskaya, Enis Sert, Volkan Cirik, and DenizYuret. 2013. Ai-ku: Using substitute vectors and co-occurrence modeling for word sense induction anddisambiguation. In

Second Joint Conference on Lex-ical and Computational Semantics (* SEM), Volume2: Proceedings of the Seventh International Work-shop on Semantic Evaluation (SemEval 2013) , vol-ume 2, pages 300–306.David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation.

Journal of ma-chine Learning research , 3(Jan):993–1022.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .David Jurgens and Ioannis Klapaftis. 2013. Semeval-2013 task 13: Word sense induction for graded andnon-graded senses. In

Second Joint Conference onLexical and Computational Semantics (* SEM), Vol-ume 2: Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation (SemEval 2013) ,volume 2, pages 290–299.Alexandros Komninos and Suresh Manandhar. 2016.Structured generative models of continuous featuresfor word sense induction. In

Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers , pages3577–3587.Jey Han Lau, Paul Cook, and Timothy Baldwin. 2013.unimelb: Topic modelling-based word sense induc-tion. In

Second Joint Conference on Lexical andComputational Semantics (* SEM), Volume 2: Pro-ceedings of the Seventh International Workshop onSemantic Evaluation (SemEval 2013) , volume 2,pages 307–311.Suresh Manandhar, Ioannis P Klapaftis, Dmitriy Dli-gach, and Sameer S Pradhan. 2010. Semeval-2010task 14: Word sense induction & disambiguation.In

Proceedings of the 5th international workshop onsemantic evaluation , pages 63–68. Association forComputational Linguistics.Erin McKean. 2005.

The new oxford American dictio-nary , volume 2. Oxford University Press New York.George Miller. 1998.

WordNet: An electronic lexicaldatabase . MIT press.Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. arXiv preprint arXiv:1802.05365 .Benjamin Snyder and Martha Palmer. 2004. The en-glish all-words task. In

Proceedings of SENSEVAL-3, the Third International Workshop on the Evalua-tion of Systems for the Semantic Analysis of Text .Linfeng Song. 2016. Word embeddings, sense embed-dings and their application to word sense induction.

The University of Rochester, April .ee W Teh, Michael I Jordan, Matthew J Beal, andDavid M Blei. 2005. Sharing clusters among re-lated groups: Hierarchical dirichlet processes. In

Advances in neural information processing systems ,pages 1385–1392.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems , pages 5998–6008.Jing Wang, Mohit Bansal, Kevin Gimpel, Brian DZiebart, and T Yu Clement. 2015. A sense-topicmodel for word sense induction with unsuperviseddata enrichment.

Transactions of the Association ofComputational Linguistics , 3(1):59–71.Dayu Yuan, Julian Richardson, Ryan Doherty, ColinEvans, and Eric Altendorf. 2016. Semi-supervisedword sense disambiguation with neural models. In

COLING 2016 . A Handling of identiﬁed failure modes

Some of the failure modes mentioned in section 4can be remedied by various means.LM and TEMPLATE cases are relatively rareand stand out when debugging the ﬁnal solution.Their distinct distribution usually pushes theminto clusters of their own, allowing identiﬁcationand possibly their removal before rerunning theprocedure.Using a suitable LM for the target domain is im-portant, and indeed most LM failure we encoun-tered are due to transcribed spoken text. Fine-tuning BERT on the domain of interest could im-prove results.The MERGE and TOPIC cases deal with thediscerning resolution of our method. An interest-ing direction for future work is ﬁnding a way tocollect additional target usages to better model theborderline cases. This also seems like a promisingdirection to take with SPLIT cases.The OTHER classiﬁed senses are cases whereour method completely fails. These include targetssuch as become(VERB) which are indeed hard tosense-induce without some mental process, specif-ically with substitutions alone. For example, theWordNet senses differentiate between “become:a change in state” and “become: transform intosomething else”, similarly to Spanish’s ser/estardistinction.

B Quality analysis examples

The following provides examples of the differenterror categories we use, as well as demonstratesthe senses that are induced by the method for somecases, and their descriptions according to the PMImethod.Each table shows an induced sense, its high-ranking PMI words, and sentences associated tothis sense. We additionally provide our assess-ment of that sense (OK, SPLIT, MERGE, TOPIC,TEMPLATE, LM, OTHER), as well as the gold-label WordNet sense for each sentence.We begin with the senses for meet(VERB), atarget that our method scores high on (Table 3).We follow with wait(VERB), a target which ourmethods scores low on (Table 4). Finally, Table5 demonstrates the error categories not present inthe previous ones. o. Class High PMI words • You’re going to meet

John Speckman! (5) • So I guess one question might be how I met my wife. (1) • We are taking you to meet him the day after you arrive. (1)2 OK qualify, offer, below, violate, comply, accomplish, supply,complete,ﬁll,accommodate • And we need Your help to meet the challenge! (3) • So, I want to thank you on meeting my ﬁrst condition. (3) • They could not meet conditions if their competitors were free to ignore them. (3)3 SPLIT conversation, summit, friendly, discussion, business, spend, touching, partner, dining • it’s gonna make the people they’re meeting with feel very uncomfortable ... (5) • Best wishes until we meet again-perhaps over Volume 9 ... (5) • He and Atta agreed to meet later at a location to be determined. (7)4 OK group, convention, weekly, schedule, parliament, convene, celebrate,crowd,originate • A group called the League of Prizren, named for the Kosovo town where it met , ... (7) • cat and bagpipean society a society which met at their ofﬁce ... (7) • A summer Antiekmarkt or antique market meets at Nieuwmarkt on Sundays ...(7)5 OK direct, encounter, dare, oppose, reaction, repulse, cause, underwent, face, react • They were greeted as liberators by the peasants and met only desultory resistance ... (4,9) • ... astounded by the funny logic of, say, meeting one’s match ... (9) • It’s too bad that ... this understanding has to meet with such hostility, don’t you think? (4)6 OK maximum, phase, interval, curve, origin, converge, respectively, cancel, border, dip • we can draw a line of those tangencies ..., that meet at the initial apple-pear distributions... (6)7 TOPIC investigate, cost, fund, ease, budget, recover, shoulder, offset, slash, decrease • ... appealed to the state government to help meet the cost of burying armed robbers...(3)WordNet senses for meet(VERB) in gold labels:1. get to know; get acquainted with2. collect in one place3. ﬁll or meet a want or need4. experience as a reaction5. come together6. be adjacent or come together 7. get together socially or for a speciﬁc pur-pose8. meet by design; be present at the arrival of9. contend against an opponent in a sport,game, or battle Table 3: Senses induced for the target meet(VERB) on which our method perform relatively well. At the end ofeach sentence in parentheses is its tagged WordNet sense in the gold labels. In our manual inspection, Sense o. Class

High PMI words • The horses wait under the cooling shade for their next customers. (1)2 ∼ OK bench, staff, guest, bounce, pat, pit, ticket, fare, to, other • ... offer the best sightlines, roomier seats, and wait staff who peddle gourmet fare. (3)3 OTHER reasonable, slack, qualify, delivery, short, a, temporary, week, hesitation, due • ... without the need to wait until everyone is in town for a meeting. (4) • ... operator will be paid at some average earnings rate during the waiting period. (4) • ... and uh i would agree a a short waiting period would be appropriate to uh ... (4)4 TOPIC literally, cooking, everything, pregnant, family, lot, town, money, forever, food • He had a farm waiting for him right? (2) • If Clinton, ... , was a time bomb waiting to explode, then ... (1) • ... as you wouldn’t if you had a wife who looked like that waiting for you. (1)5 LM ago, they, fade, along, since, drank, though, afterwards, sometimes, uh • ... i sang in a couple of uh community choirs and then um waited for a while ...(4)6 OTHER write, reach, argue, appear, bother, act, seem, star, [, wish • I’m not sure why you waited a week.(4) • Good things come to those who wait .(4) • He had been waiting for Oedipa in the bathroom. (1)WordNet senses for wait(VERB) in gold labels:1. stay in one place and anticipate or expect something2. look forward to the probable occurrence of3. serve as a waiter or waitress in a restaurant4. wait before acting

Table 4: Senses induced for the target wait(VERB) on which our method perform poorly. In our manual inspec-tion, sense arget Class High PMI words strike(VERB) MERGE rally, roar, mobilize, go, picket, dominate, rise, uprising, rebel, riot • William Saﬁre Language Maven strikes

Again. • ... in 1953 when the workers struck in Berlin and the Party told them to stop ... • ... on Feb. 21, 1868, the Radicals struck . ...This sense cluster the meaning of two unrelated concepts: ”strikes again” as in ”does it again”,and ”strike” as in ”worker uprising”. The characteristic substitutes are still informative.sight(NOUN) TEMPLATE advance, propose, glamour, diamond, lipstick, wing, aspiration • ... Barbara Hershey, 50, set her sights on Brooks’ ﬁance ... • The Enquirer provides details about how Monica set her sights on her man ... • ... Frank Sinatra’s widow, Barbara, has set her sights on Grace Kelly’s widower, ...This template of X set her sights on Y, produces a distinct substitutes distribution which pushesthe sentences above into a sense of their own. This ”over-ﬁts” the target usage, giving weightto substitutes such as ”lipstick” in the example above.on Y, produces a distinct substitutes distribution which pushesthe sentences above into a sense of their own. This ”over-ﬁts” the target usage, giving weightto substitutes such as ”lipstick” in the example above.